CMStatistics 2015: Start Registration
View Submission - CMStatistics
B0691
Title: Processing text data with latent-variable grammars Authors:  Shay Cohen - University of Edinburgh (United Kingdom) [presenting]
Abstract: Enormous growth in the amount of information available from various data resources is being faced. This growth is even more notable when it comes to text data; the number of pages on the internet, for example, is expected to double itself every five years, with billions of multilingual webpages already available. In order to make use of this information, much of it needs to be parsed into natural language structures, such as syntactic trees or semantic graphs. In natural language processing, two important tools are available for this kind of structured prediction: probabilistic grammar formalisms and latent-variable modeling. Probabilistic grammar formalisms are a family of statistical models that give a principled way to process textual data and predict various types of structures for it. Latent-variable modelling, on the other hand, helps to discover patterns in data that are hard to manually annotate. We describe some work that combines these two ideas. We present an algorithm for estimating latent-variable grammars that stands in stark contrast to algorithms that have been used insofar for such estimation, such as the expectation-maximization algorithm. We simplify the algorithm, demonstrating a fast, simple algorithm for doing latent-variable grammar learning. We also describe three applications for these latent-variabe grammars that analyze text: syntactic parsing, machine translation and analysis of text in online forums.