Title: Clustering categorical data using word embedding methods
Authors: Yeojin Chung - Kookmin University (Korea, South) [presenting]
Abstract: Clustering continuous data in Euclidean distance has been extensively studied with parametric and nonparametric statistical methods. However, these methods are not directly generalized to categorical data. Particularly clustering for categorical attributes with high cardinality suffers from curse of dimensionality. We propose to convert nominal data into numerical data using word embedding methods such as CBOW or skipgram, which was originally developed for natural language models. With this procedure, each level of the categorical attribute can be represented in a real vector space, where similar (in some sense) categories are located closer. Then well-developed clustering algorithms for continuous data can be used for clustering vectorized categorical data. We compare this approach of clustering categorical data with pre-existing algorithms such as k-medoids or k-modes algorithms.