A0522
Title: A unified approach to outlier identification for mixed type data
Authors: Efthymios Costa - Imperial College London (United Kingdom)
Christian Hennig - University of Bologna (Italy) [presenting]
Abstract: An approach for identifying outliers in data with continuous as well as ordinal variables is presented with possible extension to nominal categorical variables. The approach is based on robust Mahalanobis distances (based on FastMDC) and a definition of outliers as observations that are in low probability regions relative to a multivariate Gaussian distribution. In order to unify the contribution of continuous and ordinal variables to the robust Mahalanobis distance and the definition of outliers, ordinal variables are modeled as stemming from thresholding a latent Gaussian (one for each ordinal variable). Polychoric and polyserial correlation are used to estimate the covariance matrix of the underlying multivariate Gaussian distribution. There can be issues with singularity, particularly in cases in which a single category of a variable contains a large percentage of observations, which may force FastMCD to estimate its variance as zero. For this reason, the covariance matrix will be regularized. Nominal categorical variables can be incorporated by introducing dummy variables for the categories.