A1139
Title: Machine learning methods: Probability of correct model selection using $R^2$ or AIC
Authors: Katherine Thompson - University of Kentucky (United States) [presenting]
Abstract: Although recent attention has focused largely on improving predictive models, less consideration has been given to the prevalence of incorrect models selected by traditional statistical methods. The difficulty in choosing a scientifically correct model is quantified through theoretical and simulation work. Furthermore, the performance of traditional model selection techniques is compared with that of the feasible solutions algorithm, a recent machine learning method. Specifically, when data sets contain large numbers of explanatory variables, the model with the highest $R^2$ (or adjusted $R^2$) or lowest AIC is often not the scientifically correct model that produced the data, suggesting that traditional model selection techniques that rely on these criteria may be inappropriate. It starts with the derivation of the probability of choosing the scientifically correct model in data sets as a function of regression model parameters when using $R^2$ or AIC. Next, simulation results show that these traditional model selection criteria are outperformed by the feasible solutions algorithm, a machine learning method that produces multiple candidate models for researchers' consideration. Lastly, these results are demonstrated through the analysis of a National Health and Nutrition Examination Survey data set.