COMPSTAT 2024: Start Registration
View Submission - COMPSTAT2024
A0216
Title: Goodness-of-fit testing in topic models Authors:  Anna Staszewska-Bystrova - University of Lodz (Poland) [presenting]
Victor Bystrov - University of Lodz (Poland)
Abstract: Topic models used for structural analysis of textual data are most often evaluated on the basis of characteristics of extracted topics. Apart from providing coherent topics, these models should also exhibit a good fit to the data. The standard goodness-of-fit tests are not suited for large corpora that are characterized by a sparse distribution of terms. We propose a testing procedure that relies on averaging of goodness-of-fit statistics across documents in a corpus. The performance of the tests is evaluated in the latent Dirichlet allocation (LDA) model by means of Monte Carlo simulations under the assumption of known parameters. A bootstrap procedure for goodness-of-fit testing in the estimated LDA is also proposed, and the size and power of the bootstrap tests are analysed.