A1279
Title: A statistical framework for learning from mixed supervision: labeled and unlabeled data integration
Authors: Jinung Choi - Yonsei University (Korea, South)
Ilmun Kim - Yonsei University (Korea, South)
Jongho Im - Yonsei University (Korea, South) [presenting]
Abstract: A unified statistical framework is developed for integrating multiple data sources under supervised and semi-supervised settings. We consider a general scenario where one dataset contains fully observed outcome-covariate pairs $(x,y)$, while the other includes only covariates $x$. The goal is to estimate a parameter defined by a moment condition with respect to a target distribution. Depending on whether the target data are labeled or unlabeled, we propose different integration strategies that accommodate potential distributional differences between the source and target populations. Specifically, we develop a family of weighted estimating equations that incorporate density ratio or odds ratio corrections to address distributional shifts and outcome missingness. In the unlabeled case, the identification problem is addressed through an odds ratio formulation and fractional imputation techniques. We also derive the optimal weighting parameter that minimizes the mean squared error of the proposed estimator. Simulation studies demonstrate the efficiency gains and robustness of the proposed approach over existing methods under various degrees of label availability and distributional mismatch.