CMStatistics 2023: Start Registration
View Submission - CMStatistics
B0573
Title: Learning to approximately count with Bayesian non-parametrics Authors:  Mario Beraha - Università di Torino (Italy) [presenting]
Abstract: Given a sketch $(C_1, \ldots, C_J)$ of data obtained by compressing a sample $(X_1, \ldots, X_n)$ via random hash function $h$, such that $C_j = \sum_{i=1}^n I(h(X_i) = j)$, the following questions are considered. (i) How many times did we see $X_{n+1}$ in the original sample? (ii) How many distinct symbols are there in $X_1, \ldots X_n$? This question within a model-based framework is framed by assuming that the $X_i$'s are an i.i.d. sample from an unknown discrete probability measure $P$.Inference proves to be challenging: In the frequentist setting, the natural estimators depend on some quantities that cannot be estimated from the sketch. Relying on worst-case analysis, it is shown that the estimators obtained are trivial, and, in particular, the original count-min sketch algorithm is recovered. In the Bayesian setting, a prior for $P$ is assumed and the posterior distribution is shown to lead to combinatorial hurdles unless $P$ is a Dirichlet process and further characterizes the DP as the sole nonparametric prior for which Bayesian inference is tractable. Finally, smoothing the frequentist estimators is proposed via Bayesian nonparametric priors: this leads to simple(r) expressions that can be actually computed, while also depending on a handful of parameters that can be easily estimated from the sketch.