B0494
Title: Unsupervised and supervised exchange-methods for subset selection from big datasets
Authors: Chiara Tommasi - University of Milan (Italy) [presenting]
Abstract: In the era of big data, several sampling approaches have been proposed to reduce costs and time and to help in informed decision-making. In particular, the theory of optimal design has been applied to select a subsample that contains the most information for the inferential goal. Unfortunately, big datasets usually are the result of passive observations, and thus they may include high-leverage covariate values or outliers in the response variable (denoted by Y). The most common selection criterion is D-optimality, but in the presence of high-leverage values, all of them would be wrongly selected, as the D-optimal design tends to lie on the boundary of the design region. An exchange procedure to select a nearly D-optimal subset, which avoids the inclusion of the high-leverage values, is herein proposed. Avoiding high leverage points, however, does not guard from all the outliers in Y. Therefore, another method, that exploits the information about the responses to circumvent the selection of abnormal Y-values, is described. The former proposal is an unsupervised procedure, as it is not based on the response observations, while the latter is a supervised exchange method. In addition, both these exchange algorithms are extended to the I-criterion, which aims at providing accurate predictions in a set of covariate values.