A0678
Title: A Bayesian semi-supervised approach to keyphrase extraction
Authors: Yichen Cheng - Georgia State University (United States) [presenting]
Abstract: In the era of big data, people are benefited from the existence of tremendous amounts of information. However, the availability of said information may pose great challenges. For instance, one big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyword extraction methods summarize an article by identifying a list of keywords. Many existing keyword extraction methods focus on the unsupervised setting, with all keywords assumed unknown. In reality, a (small) subset of the keywords may be available for a particular article. A rigorous probabilistic model based on a semi-supervised setup is proposed to utilize such information. The method incorporates the graph-based information of an article into a Bayesian framework via an informative prior so that our model facilitates formal statistical inference, which is often absent from existing methods. Both Markov-chain Monte Carlo algorithms based on Gibbs samplers and Variational Bayesian methods are developed to overcome the difficulty arising from high-dimensional posterior sampling. A false discovery rate (FDR) based approach is employed for selecting the number of keywords, while the existing methods use ad-hoc threshold values. The numerical results show that the proposed method compared favourably with state-of-the-art methods for keyword extraction.