A0677
Title: Testing the equality of topic distribution between documents of a corpus
Authors: Louisa Kontoghiorghes - Kings College London (United Kingdom) [presenting]
Ana Colubi - University of Giessen (Germany)
Abstract: Topic modelling is a well-known text mining technique to identify the themes covered in a set of documents. We introduce a methodology to test whether two documents of a given corpus are homogeneous with respect to the topics they cover. The suggested approach uses Latent Dirichlet Allocation (LDA) to estimate the topic distributions and the Kullback-Leibler divergence to measure the distance between the distributions. Since the sampling distribution of the proposed statistics is unknown, a (frequentist) bootstrap test is suggested. The methodology is illustrated using scientific abstracts from the CMStatistics conference.