CFE-CMStatistics 2025: Start Registration
View Submission - CFE-CMStatistics 2025
A0877
Title: Variable selection in compositional data analysis Authors:  Jing Ma - Fred Hutchinson Cancer Center (United States) [presenting]
Kristyn Pantoja - Texas AM University (United States)
David Jones - Texas A&M University (United States)
Abstract: Compositional data, where only relative abundances are available, are common in microbiome and other high-throughput sequencing studies. Log ratios between groups of variables serve as key biomarkers in these settings. However, selecting predictive log ratios is a combinatorial challenge, and existing greedy search-based methods are computationally expensive, limiting their applicability to high-dimensional data. The supervised log ratio (SLR) method is introduced, a novel and efficient approach for selecting predictive log ratios in high-dimensional settings. SLR first screens active variables using univariate regression on log ratio transformed data and then applies principal balance analysis to define balance biomarkers. The approach leverages both the relationship between the response and predictors and the correlations among the predictors to improve accuracy in variable selection and prediction. Through simulations and two case studies, one on inflammatory bowel disease (IBD) and another on colorectal cancer (CRC), it is demonstrated that SLR outperforms existing methods, particularly in high-dimensional settings.