COMPSTAT 2024: Start Registration
View Submission - COMPSTAT2024
A0285
Title: Seeded Poisson factorization: Leveraging domain knowledge to fit topic models Authors:  Bernd Prostmaier - BMW AG (Germany) [presenting]
Bettina Gruen - Wirtschaftsuniversität Wien (Austria)
Paul Hofmarcher - University Salzburg (Austria)
Abstract: The latent variable model Seeded Poisson Factorization (SPF) is proposed, which addresses the challenges in text classification where no labelled texts are available, but the classes are characterized with a set of relevant words. In various business contexts, including, in particular, the assessment of consumer feedback, vast amounts of unlabeled text data are collected where conceptual frameworks outline potential categorization schemata, and domain experts are able to provide sets of relevant words for each category. SPF builds on the Poisson Factorization topic model, which assumes that term counts in documents are independently drawn from a Poisson distribution with the rate resulting from a combination of topic-specific term distributions weighted by the document-specific topic distributions. Seeding modifies the prior distribution of the topic-specific term distributions with the set of relevant words a-priori having higher rates for their topic. Estimation is based on computationally efficient variational inference using general-purpose stochastic gradient optimization. The use of SPF is illustrated on Amazon customer feedback data to classify feedback items where the categories are a-priori known. Empirical results indicate that SPF surpasses alternative topic models, allowing for the specification of seed words for topics in terms of computational cost and classification accuracy.