B0348
Title: Learn2Evaluate: Predictive performance estimation with learning curves
Authors: Mark van de Wiel - Amsterdam University Medical Centers (Netherlands)
Jeroen Goedhart - Amsterdam UMC (Netherlands) [presenting]
Abstract: In high-dimensional prediction settings, i.e. when $p > n$, it remains challenging to estimate the test performance (e.g. AUC). Conventional resampling methods aim to balance between enough samples to reliably learn the model and estimate its performance. We show that combining estimates from a trajectory of subsample sizes, rendering a learning curve, leads to several benefits. Firstly, the use of a smoothed curve can improve the performance point estimate. Secondly, a still-growing- or saturating learning curve indicates whether or not additional samples will boost the prediction accuracy. Thirdly, comparing the trajectories of different learners results in a more complete picture than doing so at one sample size only. Fourthly, the learning curve allows computation of a useful lower confidence bound for the predictive performance. Standard cross-validation suffers from a limited amount of test samples, whereas the learning curve finds a better trade-off between training- and test sample sizes. This confidence bound is proven to be valid. We show coverage results from a simulation, and compare those to a state-of-the-art technique based on asymptotics and bootstrapping. Finally, we demonstrate the benefits of our approach by applying it to several classifiers of tumor location from blood platelet RNAseq data.