Title: Cross-validation subset selection for regression
Authors: Dennis Kreber - Trier University (Germany) [presenting]
Abstract: A linear regression model is considered for which we assume that many of the observed regressors are irrelevant for the prediction. In order to avoid overfitting, we want to conduct a variable selection and only include the true predictors for the least square fitting. The best subset selection gained a lot of interest in recent years for addressing this objective. For this method, a mixed-integer optimization problem is solved which finds the optimal subset not larger than a given natural number $k$ with respect to the in-sample error. In practice, a best subset selection is computed for each $k$, and the ideal $k$ is then chosen via a validation. We argue that the notion of the best subset selection might be misaligned with the statistical intention. Only the sparsity is selected via a validation whereas the best cardinality-constrained subset is selected in accordance to the training error. We address this issue by proposing a discrete optimization formulation which conducts an in-model cross-validation. The proposed program is only allowed to fit coefficients to training data, but it can choose to switch variables on and off in order to minimize the validation error of the cross-validation. Moreover, we conduct a simulation study and provide evidence that the novel mixed-integer formulation provides more accurate predictions than the best subset selection and other prominent sparse regression methods like Lasso.