CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1708
Title: Topic characterization and distinction using constrained latent Dirichlet allocation Authors:  Marco Stefanucci - University of Rome Tor Vergata (Italy)
Alessio Farcomeni - University of Rome Tor Vergata (Italy)
Marco Stefanucci - University of Rome Tor Vergata (Italy) [presenting]
Abstract: Topic models serve as widely employed tools for identifying coherent underlying content within textual data. Among these models, latent Dirichlet allocation (LDA) stands out as one of the most well-known. LDA works as a Bayesian latent model tailored for categorical data. It achieves this by specifying prior distributions using Dirichlet random variables for both the structure and proportions of topics. One noteworthy limitation of the original LDA model is its low ability to select words that distinctly characterize topics. Essentially, common words may possess a significant presence across multiple topics without truly distinguishing any one of them. Conversely, rare words might be strongly associated with specific topics exclusively. A method is demonstrated for enhancing the LDA model to identify words capable of effectively distinguishing topics. This enhancement proves particularly valuable when dealing with overlapping topics. To substantiate the findings, extensive simulations and a detailed case study are presented.