CFE 2019: Start Registration
View Submission - CMStatistics
Title: Analysing administrative data using logistic regression modelling Authors:  Maria de Fatima Salgueiro - Instituto Universitario de Lisboa (ISCTE-IUL) and Business Research Unit (BRU-IUL) (Portugal) [presenting]
Marcel Vieira - Universidade Federal de Juiz de Fora - Department of Statistics - Juiz de Fora (Brazil)
Peter W F Smith - University of Southampton - Southampton Statistical Sciences Research Institute (United Kingdom)
Abstract: A binary logistic regression model was estimated with big real register data, using population and sample values from CadUnico, a Brazilian administrative data source used to select low income families for the anti-poverty Bolsa Familia programme. The target population includes over 27 million families. Samples were selected by alternative probability sampling designs, namely simple and stratified simple random sampling with equal, proportional and optimum allocation in the strata. Different sampling fractions were considered. A binary logistic regression model was estimated to explain the probability of a family receiving the benefit, as a function of 16 covariates. In total, 31 parameters were estimated, and 324 seconds were required to achieve convergence (i5 processor, 16 GB RAM memory). Probability samples of 1 and $5\%$ were selected and the chosen population model was estimated. A forward selection procedure was considered to include covariates in the model. Results suggest that $5\%$ samples are enough to reproduce the odds ratio structure of the chosen population model, especially when simple or stratified simple random sampling with proportional allocation were adopted. Moreover, the adoption of sampling procedures lead to a considerable reduction of computational time (down to 19 seconds for the $5\%$ simple random sample), allowing for a faster modelling decision-making process with a standard personal computer.