A0684
Title: The selection and creation of benchmark data sets for comparison studies: Challenges and solutions
Authors: Silke Szymczak - University of Luebeck (Germany) [presenting]
Abstract: Benchmark data sets are crucial to ensure a fair and comprehensive evaluation and comparison of statistical methods. Ideally, a large number of diverse data sets that are representative and relevant to the application area of interest should be used. One approach is to select real-world data sets from publicly available data repositories such as OpenML and UCI. However, a major limitation is the poor documentation of the data sets, which includes missing information on the original source, the main research question, and the interpretation and coding of variables. Specific resources such as TCGA are often used to evaluate approaches for multi-omics analyses, but they focus only on cancer and it is unclear whether the results are transferable to other tissues and diseases. An alternative is to generate synthetic data sets based on predefined statistical models and scenarios. However, it is important that the simulated data are not in favor of any particular statistical method. They should also be as realistic as possible, for example, in terms of correlation structures, noise levels and patterns of missing values. Some solutions from methodological machine learning research for tabular clinical and molecular data are presented and discussed.