COMPSTAT 2023: Start Registration
View Submission - COMPSTAT2023
A0249
Title: Clustering for category variables in linear regression via generalized fused Lasso Authors:  Mineaki Ohishi - Tohoku University (Japan) [presenting]
Hirokazu Yanagihara - Hiroshima University (Japan)
Abstract: In linear regression, we often use category variables as explanatory variables. A category variable has two types: one is a qualitative variable and the other one is obtained by splitting a quantitative variable. Regarding the former, the obtained finest categories are usually used for modelling. The latter is a popular way of modeling some sort of value, such as real estate. Moreover, the use of the latter has the merit that a non-linear structure can be naturally incorporated into a linear model. When using the latter, a quantitative variable is usually split into divisions based on some experience. However, too fine categories may cause overfitting and complicate a model's interpretation. On the other hand, unsuitably clustered categories may cause declining model fitting. Hence, it is important to consider optimizing the cluster of categories. To address this, we develop an estimation method involving clustering of categories via generalized fused Lasso. Using categories as fine as possible, by estimating parameters for categories with similar effects to be exactly equal, we can expect to obtain the optimal cluster of categories.