CMStatistics 2023: Start Registration
View Submission - CMStatistics
B0928
Title: Clustering of categorical data via mutual information Authors:  Noemi Corsini - University of Padua (Italy) [presenting]
Giovanna Menardi - University of Padova (Italy)
Abstract: Despite the ill-posedness of the clustering task, a broad consensus is overall acknowledged in defining clusters in the continuous setting, where the idea of similarity between subjects finds, to a greater or lesser extent, well-grounded counterparts in the notions of density and distance. Conversely, in the presence of categorical data, the lack of a total order among categories makes somewhat controversial even the notion of distance, and the subsequent arbitrariness of the target to reach eventually undermines the soundness of the inherent methods. A novel notion of a cluster is discussed which complies with natural intuition and relies on the twofold concept of high frequency and association between variables. Groups are defined as highly populated aggregations of cross-categories of the observed variables leading to a large contribution of mutual information. The former concept complies with the notion of cluster described by the modal formulation of the clustering problem, which is taken advantage of, by borrowing some operational tools. The proposed procedure jointly extends, if not formally, at least conceptually, the ideas of connected sets, gradient ascent, and density, typical of the nonparametric clustering setting.