CFE-CMStatistics 2025: Start Registration
View Submission - CFE-CMStatistics 2025
A0690
Title: Permutation-based multiple testing-controlled variable selection using random forests Authors:  Tim Mueller - Staburo GmbH (Germany) [presenting]
Roman Hornung - University of Munich (Germany)
Silke Szymczak - University of Luebeck (Germany)
Hannes Buchner - Staburo GmbH (Germany)
Abstract: Identifying relevant biomarkers is critical in clinical research and precision medicine, particularly when analyzing high-dimensional data. Random forests (RFs) are promising for such settings due to their flexibility, ease of use, and their ability to handle datasets with more variables than samples. RFs assess the importance of each variable in predicting the outcome using variable importance (VIMP) scores. However, the lack of a known statistical distribution of VIMP scores prevents standard statistical testing and associated multiple testing adjustment for the purpose of variable selection. The aim is to propose a novel method for multiple testing-controlled variable selection. The approach, similar to permutation testing, involves generating permuted counterparts for each variable and comparing their VIMPs across iterations to calculate p-values. However, unlike competing methods, the correlation structure is preserved between the covariates in the permutations to guard against biases. With promising results, the method is evaluated against three competing RF variable selection approaches in simulations that involve high- and low-dimensional data, as well as correlated and categorical variables. Moreover, it is applied to a real dataset to demonstrate its practical use. The method's results integrate seamlessly into standard VIMP plots, providing a flexible and transparent way to interpret results in a familiar format.