CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1354
Title: Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data Authors:  Khurram Nadeem - University of Guelph (Canada) [presenting]
Abstract: A novel variable selection algorithm is developed for regularized ordinary logistic regression (OLR) models in a severe class imbalance in high dimensional datasets with correlated signal and noise covariates. Class imbalance is resolved using response-based subsampling, which is also employed to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods include Lasso, adaptive Lasso and ridge regression. The methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. Selection performance is assessed by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that the algorithm is robust against severe class imbalance under highly correlated covariates and consistently achieves stable and accurate variable selection with a very low false discovery rate. The methodology is illustrated using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that the framework provides a robust approach to variable selection in severely imbalanced big binary data.