Title: Missing mass estimation in feature sampling
Authors: Fadhel Ayed - Oxford (United Kingdom)
Marco Battiston - Bocconi University (Italy)
Federico Camerlenghi - University of Milano-Bicocca and Collegio Carlo Alberto (Italy) [presenting]
Stefano Favaro - University of Torino and Collegio Carlo Alberto (Italy)
Abstract: Feature models generalize species sampling models by allowing every observation to belong to more than one species, now called features. These models are extremely popular in machine learning and they found applications in diverse areas (e.g. in biosciences, biology and many others). Given a sample of size $n$, a relevant statistical problem related to these models is the estimation of the conditional expected number of hitherto unseen features that will be displayed in a future observation. Such a problem is usually referred to as the missing mass problem. This is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We introduce a simple, robust and theoretically sound nonparametric estimator of the missing mass, giving provable guarantees for its performance, and we derive corresponding confidence intervals via useful concentration inequalities. Our approach is illustrated through the analysis of various synthetic data and SNP data from the ENCODE sequencing genome project.