CFE-CMStatistics 2025: Start Registration
View Submission - CFE-CMStatistics 2025
A0383
Title: Genomic prediction: A robustness comparison of machine learning approaches Authors:  Vanda Lourenco - NOVA University of Lisbon and NOVA.id.FCT (Portugal) [presenting]
Joseph O. Ogutu - Bioinformatics Unit - Institute of Crop Science - University of Hohenheim (Germany)
Piepho Hans-Peter - University of Hohenheim (Germany)
Abstract: Accurate estimation of genomic breeding values underpins effective genomic selection in plants and animals. Genomic prediction leverages dense SNP markers and demands models capable of handling extreme dimensionality; machine-learning (ML) algorithms are natural candidates. While many studies benchmark individual ML algorithms, comparisons across algorithmic families remain scarce, and even fewer explore how data contamination affects performance. Yet, breeders routinely confront noisy phenotypes, making robustness as important as raw accuracy. This gap is filled by comparing three supervised ML families: Regularized regression, ensemble learners, and instance-based methods under pristine and contaminated conditions. Using a simulated animal-breeding population, escalating proportions of contaminated phenotypes are imposed, then predictive accuracy (PA) and mean-squared error are quantified. PA declines and MAPE rises with greater contamination and mean shifts; radial outliers impair predictions more than point-mass. Random forest, GLasso, SGB, aENET, and SVM run markedly slower. Results illuminate trade-offs among speed and robustness, and reveal circumstances where penalized regressions outperform more complex alternatives. Guidance is provided for breeders selecting algorithms when data quality is uncertain, emphasizing the need to match model choice to anticipated contamination rather than relying solely on headline accuracy in idealized datasets used for benchmarking.