EcoSta 2024: Start Registration
View Submission - EcoSta2024
A0825
Title: Optimizing sample size for statistical learning with bulk transcriptomic sequencing: A learning curve approach Authors:  Yunhui Qi - Iowa State University (United States)
Xinyi Wang - University of California at Davis (United States)
Li-Xuan Qin - Memorial Sloan Kettering Cancer Center (United States) [presenting]
Abstract: Accurate sample classification using transcriptomics data is crucial for personalized medicine. The success of such endeavors depends on determining a suitable sample size and ensuring adequate statistical power without unnecessary resource allocation or ethical concerns. Current sample size calculation methods for sample classification rely on assumptions and algorithms that may not align with modern machine and deep learning techniques. The methodological gap is addressed by developing computational approaches to determine the required number of samples for accurate classification in transcriptomics studies using statistical learning. The approach establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. Its performance is evaluated for both microRNA and RNA sequencing using data from the Cancer Genome Atlas, considering various data characteristics (such as sample size, marker filtering, and sequencing depth normalization) and algorithm configurations (including model selection, hyperparameter tuning, and offline augmentation), based on a range of evaluation metrics. Python and R code for implementation of the proposed approach is freely available on GitHub. The adoption of statistical learning in biomedical transcriptomics studies is expected to advance and accelerate their translation into clinically useful classifiers.