CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1908
Title: Generalized linear models for massive data via sketching Authors:  Jason Hou-Liu - University of Waterloo (Canada) [presenting]
Ryan Browne - University of Waterloo (Canada)
Abstract: Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but they require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched stochastic approximation to the iteratively re-weighted least squares algorithm to estimate a variety of generalized linear models using a sequence of sketched surrogate datasets. A uniform sketch reduces data transfer costs, and a subsequent Clarkson-Woodruff sketch reduces local computation costs, yielding substantial wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against single subsample and literature methods. Some theoretical properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across 1.67 billion observations on a personal computer in 25 minutes.