EcoSta 2023: Start Registration
View Submission - EcoSta2023
A0604
Title: MDI+: A flexible feature importance framework for random forests Authors:  Tiffany Tang - University of California, Berkeley (United States) [presenting]
Abhineet Agarwal - University of California Berkeley (United States)
Ana Kenney - University of California, Irvine (United States)
Yan Shuo Tan - National University of Singapore (Singapore)
Bin Yu - UC Berkeley (United States)
Abstract: The mean decrease in impurity (MDI) is commonly used to evaluate feature importances in random forests (RF). It is shown that the MDI for a feature in each fitted tree in an RF is the unnormalized r-squared value in a linear regression of the response on the collection of local decision stumps corresponding to nodes that split on this feature. Building upon this r-squared interpretation of MDI, MDI+ is developed, which generalizes MDI and provides a flexible framework for computing feature importances using RFs. This MDI+ framework is based on a new predictive model, RF+, that allows the analyst to (1) replace the linear regression model and/or r-squared metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure and (2) incorporate additional features or knowledge to mitigate known biases of decision trees such as their inefficiency in fitting additive or smooth models. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in ranking and identifying relevant features across various settings. Then, in a real-world case study on drug response prediction, MDI+ extracts well-established predictive genes with greater stability and robustness compared to existing feature importance measures. Finally, possible extensions are discussed, and cases of MDI+ are used for extracting interpretable insights from causal forests and heterogeneous treatment effect estimation.