CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1751
Title: From Gutenberg to BERT: How transformers can change information extraction from text Authors:  Daniela Ushizima - Lawrence Berkeley National Laboratory / UC San Francisco (United States) [presenting]
Eric Chagnon - University of California Davis (United States)
Abstract: Around 1440, Gutenberg revolutionized knowledge dissemination with the advent of the printing press using efficient mechanical devices. Over a half millennium later, access to text has changed from scarcity to overly abundant: the main challenge became how to distill information from huge amounts of textual data. Extracting knowledge from text has undergone a technological upheaval with text mining and natural language processing, but how have these innovations affected scientific activities, such as literature review? Our efforts toward designing algorithms for topic modeling and content recommendation, are describedd given large sets of scientific articles. By using deep learning models, such as the Bidirectional Encoder Representations from Transformers (BERT), our python-based code has been turning text data into information that helps us to identify key topics within different science domains, for example, the most relevant technologies for materials analysis given a set of laboratories. The main advantages are: high-level mechanisms for I/O, semantic similarity among articles that enable recommendations, and topic word scores and evolution of topics over time for quick feedback using visualization.