Title: Influential features PCA for high dimensional clustering
Authors: Wanjie Wang - National University of Singapore (Singapore) [presenting]
Jiashun Jin - Carnegie Mellon University (United States)
Tracy Ke - University of Chicago (United States)
Abstract: Clustering is a major problem in statistics with many applications. In the Big Data era, it faces two main challenges: (1) The number of features is much larger than the sample size; (2) The signals are sparse and weak, masked by large amount of noise. We propose a new tuning-free clustering procedure for large-scale data, Important Features PCA (IF-PCA). IF-PCA consists of a feature selection step, a PCA step, and a k-means step. The first two steps reduce the data dimensions recursively, while the main information is preserved. As a consequence, IF-PCA is fast and accurate, producing competitive performance in application to 10 gene microarray data sets. We also propose a model that can capture the rarity and weakness of signal. Under this model, the statistical limits for the clustering problem and IF-PCA has been found.