B0733
Title: Active learning: Intelligent subsampling
Authors: Jesus Lopez-Fidalgo - University of Navarra (Spain) [presenting]
Alvaro Cia-Mina - University of Navarra (Spain)
Abstract: The Big Data sample size introduces statistical and computational challenges to extract useful information from data sets. The subsampling procedure is widely used to downsize the data volume and allows computing estimators in regression models. Usually, subsampling is performed defining a weight for each point and selecting a subset according to these weights. The subsample can be chosen at random (Passive Learning), but in order to obtain better estimators, the optimal experimental design theory can be used to search for an influential sub-sample (Active Learning). This has been developed in the literature for linear and logistic regression, obtaining algorithms based on D-optimality and A-optimality. To the authors' knowledge, the distribution of the explanatory variables has never been considered for obtaining a subsample. We study the effect of the explanatory variables distribution on the estimation as well as the optimal design. We first assume the normality of the covariates and later we measure the impact of skewness and kurtosis on the estimation and optimal designs. Then, we propose a novel method to obtain optimal subsampling through D-optimality, taking into account the marginal distribution of the covariates. The D-optimal design is computed by an exchange algorithm to obtain the subsample.