A0152
Title: Textual data augmentation for rare classes
Authors: Dan Vilenchik - Ben-Gurion University (Israel) [presenting]
Abstract: Data augmentation is a widely studied topic in visual tasks (e.g. image classification) but far less so for textual tasks. Two recent works that offer two novel approaches for rare classes where off-the-shelf methods fail are presented. The first paper deals with that ask of modelling human personality, a field that heavily relies on labelled data, which may be expensive or impossible to get. In this context, a text-based data augmentation approach is developed for human personality (PEDANT). PEDANT doesn't rely on the common type of labelled data but on the generative pre-trained model (GPT) combined with domain expertise. Testing the methodology on three different datasets provides results supporting the generated data's quality. The second work deals with the task of hate speech detection, which hinges upon the availability of rich and variable labelled data, which is hard to obtain. A new approach for data augmentation is presented that uses as input real unlabeled data, which is carefully selected from online platforms where invited hate speech is abundant. It is shown that by harvesting and processing this data (in an automatic manner), one can augment existing manually labelled datasets to improve the classification performance of hate speech classification models. An improvement is observed in F1-score ranging from 2.7\% and up to 9.5\%, depending on the task (in or cross-domain) and on the model that was used.