CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1134
Title: MDI+: a flexible random forest-based feature importance framework Authors:  Tiffany Tang - University of California, Berkeley (United States)
Yan Shuo Tan - National University of Singapore (Singapore)
Abhineet Agarwal - University of California Berkeley (United States)
Bin Yu - UC Berkeley (United States)
Ana Kenney - University of California, Irvine (United States) [presenting]
Abstract: The mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). It is shown that the MDI for a feature in each tree in an RF is equivalent to the unnormalized r-squared value in a linear regression of the response on the collection of local decision stumps corresponding to nodes that split on this feature. The interpretation is used to propose a flexible feature importance framework called MDI+. Specifically, MDI+ generalizes MDI by allowing the analyst to replace the linear regression model and r-squared metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure. Moreover, MDI+ incorporates additional features to mitigate known biases of decision trees against additive or smooth models. Further guidance is provided on how practitioners can choose an appropriate GLM and metric based on predictability, computability, and stability framework for veridical data science. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in identifying signal features. MDI+ is also applied to two real-world case studies on drug response prediction and breast cancer subtype classification. MDI+ is shown to extract well-established predictive genes with significantly greater stability compared to existing feature importance measures.