Title: Prediction of disease risk by high-dimensional genetic and environmental data
Authors: Norbert Krautenbacher - Technical University of Munich and Helmholtz Center Munich (Germany) [presenting]
Christiane Fuchs - Helmholtz Center Munich (Germany)
Fabian Theis - Institute of Computational Biology Helmholtz Center Munich (Germany)
Abstract: The aim is to investigate the situation of having high-dimensional genetic and environmental data of individuals where the goal is to build a prediction model for the risk of suffering from the disease asthma. At the study one was also interested in the influence of the specific exposure variable farm-environment, so that a sample of the population should contain an appropriate number of observations with the combination farm/asthma. Since in the population both categories occur only rarely, a simple random sample would require a big sample size. In practice, however, it is not possible to take such a big sample, since collecting genomic data in terms of hundreds of thousands to millions of single-nucleotid polymorphisms (SNPs) is cost-intensive. Thus, a stratified random sample was taken from the population. Therefore, for analyzing the final sample two main issues occur: first, one has to correct for the arisen sample selection bias when learning and evaluating on biased training and test data sets. Second, the present genetic data containing 2.5 million SNPs have to be incorporated as features for dimension reduction and feature selection techniques which require special solutions.