CMStatistics 2015: Start Registration
View Submission - CMStatistics
B0586
Title: Statistical analysis of the text of computer programs Authors:  Charles Sutton - University of Edinburgh (United Kingdom) [presenting]
Abstract: Billions of lines of source code have been written, many of which are freely available on the Internet. The text of this code can be analysed statistically like any other textual corpus, with the goal of identifying textual patterns that characterise software systems that are more reliable and whose code is easy to read. We describe three new tools based on statistical textual analysis that are designed to help software developers write better programs. First, Naturalize is a system that suggests more descriptive names for local variables and functions based on Markov models. Second, TASSAL is a system that summarizes code by automatically hiding regions of code that are least informative according to a latent Dirichlet model. Finally, HAGGIS is a system that learns textual patterns that have syntactic structure, such as for-loops that iterate over vectors. HAGGIS accomplishes this using a nonparametric Bayesian probabilistic grammar.