Title: Subdata selection methods
Authors: John Stufken - University of North Carolina at Greensboro (United States) [presenting]
Abstract: The size of big data can cause challenges for even the simplest explorations of the data. Such challenges can, for example, be related to storage of the data or to computations of even the simplest statistics. One method to deal with the challenges is based on selecting a much smaller subdata set from the original full data set. Exploration or analysis would proceed with the subdata. Such subdata set can be selected through a sampling strategy or through a deterministic method that attempts to optimize a specified criterion. Whatever method of subdata selection is used, it is important that it is computationally feasible and efficient. It is also important that inferences or predictions based on the subdata are comparable to those that would have been obtained by using the full data. Ideally, this is true with as few assumptions as possible about the full data. After a brief discussion of different subdata selection methods, we will focus on comparison of the methods, their strengths and weaknesses, and possible extensions.