CMStatistics 2023: Start Registration
View Submission - CMStatistics
B0863
Title: Active labeling for high-dimensional ridge regression with application in genome-wide association studies Authors:  Lin Wang - Purdue University (United States) [presenting]
Abstract: Despite the availability of extensive data sets, it is often impractical to collect labels for all data points in many applications due to various measurement constraints. Subsampling approaches can be employed to select a subset of design points from a large pool, resulting in substantial savings in experimental costs. However, existing subsampling methods are primarily designed for low-dimensional data or rely on the assumption of sparse significant covariates. A computationally tractable sampling method is proposed that enables the selection of a small subset from a large data set without assuming sparsity. The method acknowledges the possibility that the number of significant covariates can be as large as or even larger than the sample size of the full data set. Specifically, the focus lies on ridge regression, for which sampling probabilities are developed that minimize the mean squared prediction error on the full data set. The efficacy of the proposed approach is substantiated through theoretical analysis and extensive simulations. The results demonstrate its superiority over existing subsampling methods when dealing with high-dimensional data containing numerous significant covariates. Additionally, the advantages of the new approach are illustrated through its application to genome-wide association studies, highlighting its potential to yield valuable insights in this domain.