CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1910
Title: Information extraction using transformers Authors:  Stefana Belbe - Babes Bolyai University, Endava (Romania) [presenting]
Daniel-Gabriel Susanu - Endava (Romania)
Andrada Vulpe - Endava (Romania)
Abstract: The aim is to present the results of an automatized document classification and information extraction data pipeline for scanned documents from the legal field. Novel natural processing language (NLP) techniques are used to transform and decrypt legal multi-class documents and to extract relevant information from the corresponding classes. The data pipeline consists of multiple tasks such as documents split into images, detection of the text from the images using optical character recognition (OCR), layout analysis, document classification and information extraction. With an initial batch of scanned documents from four U.S.A. states, different Transformer-based models are assessed and validated to retrieve essential legal information, pre-trained and fine-tunned on specific documents' formats, exploited their power of generalization on unseen documents and introduced different measures for assessing the quality of the information extraction and OCR modules.