View Submission

A0571

Title: Bayesian nonparametric mixture inconsistency for the number of components: How worried should we be in practice? Authors: Johan van der Molen Moris - Pontificia Universidad Católica de Chile (Chile) [presenting]
Paul Kirk - University of Cambridge (United Kingdom)
Anthony Davison - EPFL (Switzerland)
Yannis Chaumeny - EPFL (Switzerland)
Abstract: A Bayesian clustering approach is considered with the mixture of finite mixtures and Dirichlet process mixture models, popular due to uncertainty estimates for the number of clusters and efficient sampling methods. However, recent theoretical results show that Dirichlet process mixture models overestimate the number of clusters for large samples, and under misspecification, both models give inconsistent estimates. Furthermore, Bayesian mixture models give inconsistent numbers of clusters in some high-dimensional cases. In practice, Markov chain Monte Carlo summarization methods obtain a representative clustering for interpretation, and their effect on the number of clusters is not well studied. The consequences of summarisation methods are investigated for practical scenarios in light of asymptotic results with a simulation study and an application on gene expression data. Results show that, for the situations considered, the Dirichlet process mixture model leads to limited overestimation of the number of clusters for finite samples, which is corrected by some summarization methods. Misspecification leads to considerable overestimation of the number of clusters, but results are still interpretable. It is shown that certain summarization methods also lead to overestimation of the number of clusters, even for accurate estimates. For high-dimensional data, an illustration of the underestimation of the number of clusters is given, suggesting careful interpretation in practice.