CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1136
Title: Measurement error and misclassification in clustering algorithms for mixed-type data Authors:  Valentina Veronesi - University of Milan-Bicocca; University at Buffalo (United States) [presenting]
Marianthi Markatou - University at Buffalo (United States)
Abstract: Addressing the challenge of mixed-type data clustering, the study compares the robustness of KAMILA, PDQ, k-prototypes, HyDaP, and Modha-Spangler algorithms in the presence of measurement error and misclassification (MEM). Moreover, recognizing the need for additional methods to tackle the problem of mixed-type data, two key extensions are proposed. The first is an adaptation of the average silhouette width (ASW) algorithm proposed in a prior study. Time permitting, the second proposal will discuss an extension of the KAMILA algorithm for mixed-type data in the presence of MEM through a deconvolution process. The deconvolution aims to separate truthful data from error components for continuous and categorical variables. The performance of the algorithms and the effectiveness of the extended KAMILA and ASW algorithms are tested through simulations alongside a real-world data application. The simulation study covers a wide variety of scenarios, for example, errors impacting continuous and/or categorical variables, different degrees of correlation and information conveyed by variables. The detailed benchmark analysis distinguishes itself through its thoroughness and completeness. Evaluation metrics beyond the commonly used Adjusted Rand Index are employed. The study provides comprehensive guidelines for users to align clustering algorithm selection with data characteristics and MEM.