A0635
Title: Besting Good-Turing: Optimality of non-parametric maximum likelihood for distribution estimation
Authors: Yanjun Han - New York University (United States)
Jonathan Niles-Weed - New York University (United States)
Yandi Shen - Carnegie Mellon University (United States) [presenting]
Yihong Wu - Yale University (United States)
Abstract: Estimating probability distributions is a fundamental statistical task with wide-ranging applications in fields such as ecology, immunology, and linguistics. In many of these settings, the support size of the target distribution is comparable to or even exceeds the available sample size. The Good-Turing estimator, developed over 70 years ago, has remained a benchmark for improving upon empirical frequency in such sparse data regimes. Despite its empirical success, practitioners have long noted its limitations, including lack of robustness, numerical instability, and the need for ad hoc preprocessing. These issues reflect its lack of theoretical optimality, as highlighted in recent work within a competitive risk framework. A new estimator is proposed that integrates two foundational statistical ideas: empirical Bayes and nonparametric maximum likelihood. It is proven that this estimator achieves (near) minimax optimality in the competitive framework, where Good-Turing is necessarily sub-optimal. Extensive experiments on both synthetic and real data drawn from natural language and demographic applications demonstrate its superior empirical performance.