EcoSta 2023: Start Registration
View Submission - EcoSta2023
A0573
Title: Variable importance for random forests: Inconsistency and practical solutions for MDA and Shapley effects Authors:  Clement Benard - Safran Tech (France) [presenting]
Abstract: Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. The exact MDA definition varies across the main random forest software. The objective is to analyze the behaviour of the main MDA implementations rigorously. Consequently, their limits are established when the sample size increases. In particular, these limits are broken down into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, as opposed to the third term, whose value increases with dependence within covariates. Thus, it is theoretically demonstrated that the MDA does not target the right quantity when covariates are dependent, which has been noticed experimentally. New important measures for random forests are defined to address this issue: the Sobol-MDA and SHAFF. The Sobol-MDA fixes the flaws of the original MDA and is appropriate for variable selection. On the other hand, SHAFF is a fast and accurate estimate of Shapley's effects, even when input variables are dependent. SHAFF is appropriate to rank all variables for interpretation purposes. The consistency of the Sobol-MDA and SHAFF is proved, showing that they empirically outperform their competitors.