COMPSTAT 2024: Start Registration
View Submission - COMPSTAT2024
A0472
Title: Categorical encoding as joint optimization in predictive models Authors:  Iris-Ioana Roatis - Imperial College London (United Kingdom) [presenting]
Ed Cohen - Imperial College London (United Kingdom)
Niall Adams - Imperial College London and University of Bristol (United Kingdom)
Abstract: The necessity of handling categorical variables, which are not inherently numerical, is a significant challenge in predictive modelling. Developing efficient methods to encode these variables, particularly those with high cardinality, is crucial. While the literature conceptualises the prediction process as comprising two distinct stages, encoding followed by model training, a novel approach is proposed. The new idea consists of jointly optimising the two steps and hence treating it as one single task. This method preserves model interpretability with the advantage of eliminating the need to choose among existing encoding techniques. The embedding is viewed as a non-linear combination of the chosen characteristics of the data. For example, for binary classification problems, the counts of positive and negative labels within each category are considered, while for regression problems, the average and variance of all entries within that category are used. The resulting numerical representation and the remaining features are used to train the model for predicting the target variable, with the loss being backpropagated to jointly update the embedding of the categorical variables. The behavior of this proposal is demonstrated through a series of experiments on simulated and real-life data with promising outcomes.