A0727
Title: Balanced subsampling for big data with categorical predictors
Authors: Lin Wang - Purdue University (United States) [presenting]
Abstract: The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We will introduce a balanced subsampling approach for big data with categorical predictors. The merits of the proposed approach are two-fold: (i) it is easy to implement and fast; (ii) the selected subsample allows robust effect estimation and prediction. Theoretical results and extensive numerical results show that the proposed approaches are superior to simple random subsampling. The advantages of the balanced subsampling approach are also illustrated through the analysis of real-life examples.