A0618
Title: Testing the equality of topic distribution between documents of a corpus
Authors: Louisa Kontoghiorghes - Kings College London (United Kingdom) [presenting]
Ana Colubi - University of Giessen (Germany)
Abstract: Topic modeling is a well-known text mining technique to identify the themes covered in a set of documents. We introduce two methodologies to test whether two documents of a given corpus are homogeneous with respect to the topics they cover. The suggested approach uses Latent Dirichlet Allocation (LDA) to estimate the topic distributions. Furthermore, Kullback-Leibler divergence and the chi-square are used separately to measure the distance between the distributions, and their results are compared. Since the sampling distribution of the proposed statistics is unknown, a (frequentist) bootstrap test is suggested. The methodology is illustrated using scientific abstracts from the CMStatistics conference.