CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1126
Title: DiSK: An efficient algorithm for distributed and streaming $k$-PCA Authors:  Muhammad Zulqarnain - Rutgers, The State University of New Jersey (United States) [presenting]
Waheed Bajwa - Rutgers University (United States)
Abstract: The dimensionality of modern data often necessitates lower-dimensional and uncorrelated data representations to improve the accuracy of downstream machine-learning algorithms. Principal component analysis (PCA) is a popular data representation technique widely used to reap the low dimensionality of data and, with appropriate settings, yield uncorrelatedness. With data often being distributed and streaming in nature, an improvement of the existing C-DIEGO (Consensus DIstributEd Generalized Oja) algorithm is proposed. C-DIEGO is based on Oja updates that can estimate the dominant eigenvector of the population covariance matrix $\boldsymbol{\Sigma}$ within a network of computing machines that lacks a central server by having enough exchange of peer-to-peer messages among the neighboring machines. In the improved algorithm termed \textit{Distributed Streaming Krasulina (DiSK)}, the Gram-Schmidt process is incorporated in every update to estimate the top $k$ dominant eigenvectors of the $\boldsymbol{\Sigma}$ using data that is streaming into an arbitrarily connected network of computing machines. DiSK can estimate top-$k$ eigenvectors in a sample-efficient manner by having multiple communication rounds per iteration. The sample efficiency and convergence behavior of DiSK are demonstrated and are compared to C-DIEGO through extensive numerical experiments on both synthetic and real datasets.