Title: Influence of missing data on the estimation of the number of components of a PLS regression
Authors: Nicolas Meyer - Universite de Strasbourg (France)
Frederic Bertrand - Universite de Strasbourg (France) [presenting]
Myriam Maumy-Bertrand - Universite de Strasbourg (France)
Abstract: Partial Least Squares regression (PLSR) is a multivariate model for which two algorithms (SIMPLS or NIPALS) can be used to provide its parameters estimates. The NIPALS algorithm has the interesting property of being able to provide estimates with incomplete data and this has been extensively studied in the case of principal component analysis for which the NIPALS algorithm has been originally devised. Nevertheless, the literature gives no clear hints at the amount and patterns of missing values that can be handled by this algorithm in PLSR and to what extent the model parameters estimates are reliable. We study the NIPALS behavior, when used to fit PLSR models, for various proportions and pattern of missing data (at random or completely at random). Comparisons with multiple imputation are done. The NIPALS algorithm tolerance to incomplete data sets depends on the sample size, the proportion of missing data and the chosen component selection method and a proportion of 30$\%$ of missing data can be given as an empirical maximum for a reliable components number estimation. Above this value, whatever the criterion considered, except the $Q_2$, the number of components in PLSR is far from the true one and may hence give misleading conclusions.