A0482
Title: Outlier detection in mixed data
Authors: Houda Gadacha - CNAM Paris (France) [presenting]
Patricia Kubicki - UTAC (France)
Ndeye Niang - CNAM (France)
Abstract: Outlier detection is crucial in various fields, such as insurance fraud, disease detection, and cybersecurity. Its application helps to identify suspicious behaviors and enhance the robustness of statistical models. Most outlier detection methods are designed exclusively for numerical data. To detect outliers in data containing both numerical and categorical attributes, factor analysis for mixed data (FAMD) is proposed to extract numerical components. These components are then used for outlier detection. Outlier detection methods are applied only to the first components, the last components, and all FAMD components. The results are compared to those of a traditional one-hot encoding (OHE) preprocessing approach based on simulated data. The simulated data includes four outlier types: (a) global outliers, which significantly deviate from most data points; (b) local outliers, which are not necessarily extreme values but are considered abnormal within their specific context or neighborhood, (c) rare outliers which have unexpected categories compared to the typical data distribution, and (d) mixed outliers which can be both global and rare, or local and rare. The objective is to determine the most effective method in terms of outlier types detected. The results demonstrate the effectiveness of the proposed approach.