A0772
Title: Subbagging variable selection for massive data
Authors: Xian Li - The Australian National University (Australia) [presenting]
Tao Zou - The Australian National University (Australia)
Xuan Liang - The Australian National University (Australia)
Abstract: Massive datasets usually possess the features of large $N$ (the number of observations) and large $p$ (the number of variables). We propose a subbagging variable selection approach to select relevant variables from massive datasets. Subbagging (subsample aggregating) is an aggregation approach originally from the machine learning literature, which is well suited to the recent trends of massive data analysis and parallel computing. Specifically, we propose a subbagging loss function based on a collection of subsample estimators, which uses a quadratic form to approximate the full sample loss function. The shrinkage estimation and variable selection can be further conducted based on this subbagging loss function. We then theoretically establish the root $N$-consistency and selection consistency for this approach. It is also proved that the resulting estimator possesses the oracle property. However, variance inflation is found in its asymptotic variance compared to the full sample estimator. A modified BIC-type criterion is further developed specifically to tune the hyperparameter in this method. An extensive numerical study is presented to illustrate the finite sample performance and computational efficiency.