CFE-CMStatistics 2024: Start Registration
View Submission - CFECMStatistics2024
A0243
Title: Semi-supervised sparse Gaussian classification: Provable benefits of unlabeled data Authors:  Boaz Nadler - Weizmann Institute of Science (Israel) [presenting]
Eyar Azar - Weizmann Institute of Science (Israel)
Abstract: The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. SSL is studied for high dimensional sparse Gaussian classification. A key task in constructing an accurate classifier is feature selection, which detects the few variables that separate the two classes. For this SSL setting, information-theoretic lower bounds are analyzed for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. The key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct an accurate SSL classifier in polynomial time. However, computationally efficient supervised or unsupervised learning schemes that separately use only the labeled or unlabeled data would fail. The provable benefits of combining labeled and unlabeled data are highlighted for classification and feature selection in high dimensions. Simulations are presented that complement the theoretical analysis.