CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1618
Title: Microclustering for record linkage applications Authors:  Brenda Betancourt - NORC at the University of Chicago (United States) [presenting]
Abstract: In database management, record linkage aims to identify multiple records that correspond to the same individual. Record linkage can be treated as a clustering problem in which one or more noisy database records are associated with a unique latent entity. In contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. Hence, two new classes of prior distributions based on exchangeable sequences of clusters and allelic partitions are proposed for the small cluster setting of record linkage. The proposed priors facilitate the introduction of information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size, known as the microclustering property. In addition, a set of novel microclustering conditions are introduced in order to impose further constraints on the cluster sizes a priori. The performance of the proposed classes of priors is evaluated using simulated data and official statistics data sets.