A1039
Title: Explainable ensemble clustering through mutual information, with applications on high dimensional data
Authors: Federico Maria Quetti - University of Pavia (Italy) [presenting]
Elena Ballante - Department of Political and Social Sciences, University of Pavia (Italy)
Paolo Giudici - University of Pavia (Italy)
Silvia Figini - University of Pavia (Italy)
Abstract: Unsupervised learning techniques aim to uncover the intrinsic structure of data, with clustering being the process of grouping similar points together. A common limitation of many machine learning tasks is the lack of explainability of the process, which often operates as a black box. In clustering settings, a major challenge for most methods is the limited interpretability, as little insight is provided into which features drive the grouping, especially in high-dimensional settings. To address this limitation, a bagging-based clustering approach incorporating feature dropout is proposed, analogous to the supervised random forest methodology, aimed at decorrelating features in the partitioning steps. The involvement in clustering of each feature is ranked using an index based on information theory. In each step, the mutual information $I(X;Y) = H(X)-H(X|Y)$ ($H(X)$ being the Shannon entropy, $H(X|Y)$ the conditional entropy) between each feature involved and the estimated label obtained by the partitioning algorithm is evaluated. Then, an aggregated estimate is produced, weighing each step's contribution by an index of validity of the clustering (e.g., Dunn, Silhouette) to emphasize well-formed partitions. Results are presented on simulated and real datasets, with applications in medicine.