A0690
Title: Modeling with categorical features via exact fusion and sparsity regularization
Authors: Peter Radchenko - University of Sydney (Australia) [presenting]
Abstract: The high-dimensional linear regression problem is studied with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together and (b) sparsity of the regression coefficients. The estimator is formulated as a solution to a mixed integer program, and ways to speed up the computation are discussed. A fast approximate algorithm is also presented for the method that obtains high-quality feasible solutions via block coordinate descent; the main building block of the algorithm is an exact solver for the univariate case. New theoretical guarantees are established for both the prediction and the cluster recovery performance of the estimator. The numerical experiments on synthetic and real datasets demonstrate that the proposed estimator tends to outperform the state-of-the-art.