EcoSta 2024: Start Registration
View Submission - EcoSta2024
A0514
Title: Word-level maximum mean discrepancy regularization for word embedding Authors:  Youqian Gao - The Chinese University of Hong Kong (Hong Kong) [presenting]
Ben Dai - The Chinese University of Hong Kong (China)
Abstract: The technique of word embedding is widely used in natural language processing (NLP) to represent words as numerical vectors in textual datasets. However, the estimation of word embedding may suffer from severe overfitting due to the huge variety of words. To address the issue, a novel regularization framework is proposed that recognizes and accounts for the "word-level distribution discrepancy", a common phenomenon in a range of NLP tasks where word distributions are noticeably disparate under different labels. The proposed regularization, referred to as word-level MMD (wMMD), is a variant of maximum mean discrepancy (MMD) that serves a specific purpose: to enhance/preserve the distribution discrepancies within word embedding numerical vectors and thus prevent overfitting. The theoretical analysis illustrates that wMMD can effectively operate as a dimension-reduction technique of word embedding, thereby significantly improving the robustness and generalization of NLP models. The numerical effectiveness of wMMD is demonstrated in various simulated examples, such as Chile Earthquake T1 and BBC News datasets with state-of-the-art NLP deep learning architectures.