Title: Feature selection for sparse mixtures with dependence structure
Authors: Annika Tillander - Linköping University (Sweden) [presenting]
Tetyana Pavlenko - KTH Royal Institute of Technology (Sweden)
Abstract: Including irrelevant features may deteriorate the classification accuracy and for high-dimensional data, such as e.g. gene expressions, few of the features are expected to be relevant for any given classification problem, hence the need to identify informative features. This is a challenging task when informative features are rare and weak. Accounting for the relation between features can improve the chance to identify the relevant information and this leads to block-wise feature selection. A three-step method is suggested where the first step is to learn the structure between features, the second step is to estimate a measure of information strength, and the third step is a thresholding procedure. For single feature selection, the Higher Criticism is a well-known thresholding method that is optimally adaptive i.e. performs well without knowledge of the sparsity and weakness parameters. This method is extended to handle thresholding for blocks of features. Further, it is compared to other goodness-of-fit tests based on sup-functionals of weighted empirical process for thresholding. The relevance and benefits of feature selection for classification problems is demonstrated using both simulation and real data.