View Submission - HiTECCoDES2024
A0193
Title: Automatic detection of industry sectors in legal articles using machine learning approaches Authors:  Stella Hadjiantoni - University of Essex (United Kingdom) [presenting]
Berthold Lausen - University of Essex (United Kingdom)
Hui Yang - (United Kingdom)
Ruta Petraityte - Mondaq (United Kingdom)
Yunfei Long - University of Essex (United Kingdom)
Abstract: The ability to automatically identify industry sector coverage in articles on legal developments, or any kind of news articles for that matter, can bring plentiful benefits both to the readers and the content creators themselves. By having articles tagged based on industry coverage, readers would be able to get to legal news that is specific to their region and professional industry. A machine learning-powered industry analysis approach which combined natural language processing (NLP) with machine learning (ML) techniques was investigated. A dataset consisting of over 1,700 annotated legal articles was created for the identification of six industry sectors. Text and legal-based features were extracted from the text. Both traditional ML methods (e.g. gradient boosting machine algorithms and decision-tree based algorithms) and deep neural networks (e.g. transformer models) were applied for performance comparison of predictive models. The system achieved promising results with area under the receiver operating characteristic curve scores above 0.90 and F-scores above 0.81 with respect to the six industry sectors. The experimental results show that the suggested automated industry analysis, which employs ML techniques, allows the processing of large collections of text data in an easy, efficient, and scalable way. ML methods perform better than deep neural networks when only a small and domain-specific training data is available for the study.