Title: Scaling up nonparametric Bayesian clustering with MCMC for big data applications
Authors: Boris Hejblum - Université de Bordeaux, Inserm BPH U1219, Inria SISTM, Vaccine Research Institute (France) [presenting]
Paul Kirk - University of Cambridge (United Kingdom)
Abstract: Non-parametric Bayesian mixture models such as Dirichlet process mixture models (DPMMs), can be used to perform model-based clustering. One of their advantages is their ability to directly estimate the number of clusters from the data, avoiding the tricky issue of choosing the number of clusters. While state-of-the-art Markov chain Monte Carlo (MCMC) algorithms allow efficient and exact inference for these DPMM, it is generally difficult to scale up these algorithms to hundreds of thousands of data points. Subsampling approaches present several drawbacks, especially when some clusters are quite rare. We propose instead a two-step strategy: (i) first summarize the dataset using a misspecified, largely over-parametrized but simple clustering algorithm (such as $k$-means); and then (ii) use the resulting weighted summarization of the dataset to perform Bayesian inference for the DPMM via MCMC algorithms. We use numerical simulations as well as real single-cell cytometry data to investigate the properties of this strategy.