View Submission

B0602

Title: Kernel metric learning for variable relevancy in mixed-type data clustering via maximum-similarity cross-validation Authors: John Thompson - University of British Columbia (Canada) [presenting]
Jesse Ghashti - University of British Columbia (Canada)
Abstract: Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data and require a predefined metric to compare data points based on their dissimilarity. While numerous metrics exist for data with numerical and ordered and unordered categorical attributes, an optimal distance for mixed-type data is an open problem as current methods may not accurately balance data types for distance measurement. Many metrics convert numerical attributes to categorical ones or vice versa to handle the data points as a single attribute type, or calculate a distance between each attribute separately and sum the differences. A metric is proposed that utilizes mixed-type kernels to measure dissimilarity with maximum-similarity cross-validated optimal kernel bandwidths to determine variable relevancy for dissimilarity. It is shown that the metric approach improves the accuracy of distance-based clustering algorithms applied to simulated and real-world datasets containing continuous, categorical, and mixed-type data. The method is applied to clustering mixed-type financial trading and survey data to discover investor trading behaviour similarities and investigate the financial wellness of groups of Canadians.