Title: On the use of clustering in a predictive model
Authors: Christophe Biernacki - Inria (France)
Matthieu Marbac - CREST - ENSAI (France) [presenting]
Mohammed Sedki - Paris-Sud University, Inserm, Pasteur, UVSQ (France)
Vincent Vandewalle - Inria (France)
Abstract: Many data, in biostatistics, contain some sets of variables which permit evaluating unobserved traits of the subjects (e.g., we ask question about how many pizzas, hamburgers, chips... are eaten to know how healthy are the food habits of the subjects). Moreover, we often want to measure the relations between these unobserved traits and some target variables (e.g., obesity). Thus, a two-steps procedure is often used: first, a clustering of the observations is performed on the sets of variables related to the same topic; second, the predictive model is fitted by plugging the estimated partitions as covariates. Generally, the estimated partitions are not exactly equal to the true ones. We investigate the impact of these measurement errors on the estimators of the regression parameters, and we explain when this two-steps procedure is consistent. We also present a specific EM algorithm which simultaneously estimates the parameters of the clustering and predictive models.