A1530
Title: Tackling the efficiency paradox for data fusion with many external studies
Authors: Jingyue Huang - University of Pennsylvania (United States) [presenting]
Abstract: The problem of integrating individual data is considered from an internal study, and many summary statistics are derived from a separate external study. The research uncovers a paradox: using multiple external summary statistics could worsen the finite-sample performance of data fusion methods, even when the statistics are unbiased and have low variability. By introducing a linear regression representation for data-fused estimators, this paradox is characterized by an inherent trade-off that fewer external studies for integration yield smaller estimation variance but result in a larger discrepancy relative to the semiparametric efficiency bound. A lasso-type regularization method is further proposed to balance the trade-off. The theoretical analysis shows that the semiparametric efficiency bound remains achievable if the number of informative external studies does not grow too quickly. The applicability of the method is also demonstrated to federated transfer learning with structural missingness, which may be of independent interest. The effectiveness of the proposed method is evidenced by simulations and a real-world study.