A0578
Title: On the robustness of random forests for genomic prediction and selection in breeding studies
Authors: Vanda Lourenco - NOVA University of Lisbon and NOVA.id.FCT (Portugal) [presenting]
Miguel Braga - NOVA University of Lisbon (Portugal)
Joao Lita da Silva - NOVA University of Lisbon (Portugal)
Abstract: Real data analysis faces challenges due to potential violations of underlying model assumptions, such as errors or outliers. In linear regression, the presence of outliers can disrupt the normality assumption, leading to compromised parameter estimation and subsequent inferential results. Despite the effectiveness of machine learning methods like Random Forests (RF), susceptibility to data contamination remains a concern. The existing literature acknowledges the necessity for robust statistical techniques to address these issues, particularly in high-dimensional data analysis encompassing variable selection and prediction tasks. Enhancing the resilience of statistical methodologies is crucial for handling complex data scenarios and ensuring reliable analytical outcomes. While data contamination can manifest at both the response and covariate levels, this project primarily focuses on the former. The performance of the classical RF method is assessed via simulation while plugging in robust techniques to enhance its resilience against data contamination. Specifically, a synthetic animal dataset from the literature is employed, introducing various plausible contamination scenarios. The aim is to shed light on the implications of data contamination in genomic prediction and selection for breeding studies, offering insights into possible robust adaptations of RF that will help mitigate the challenges posed by certain types of contamination.