CMStatistics 2021: Start Registration
View Submission - CMStatistics
B0791
Title: Identifiable variational autoencoders via sparse decoding Authors:  Gemma Moran - Columbia University (United States) [presenting]
Dhanya Sridhar - Columbia University (United States)
Yixin Wang - University of California Berkeley (United States)
David Blei - Columbia University (United States)
Abstract: Consider unsupervised representation learning: given datapoints of high-dimensional features, we want to learn low dimensional factors -- a representation -- that captures the observed data. We consider sparse representation learning, where each latent factor influences a subset of features. This notion of sparsity often reflects underlying patterns in data; in movie-ratings data, for example, each movie (feature) is only described by a few genres (factors). To this end, we introduce the Sparse Variational Autoencoder (Sparse VAE), a deep generative model with priors that encourage features to depend on only a few factors. The main technical result is proving that the Sparse VAE is identifiable: given data drawn from the model, there exists a unique optimal set of factors. This result sets the Sparse VAE apart from many deep generative models for representation learning, which are unidentifiable. One key assumption is the existence of ``anchor features'': for each factor, there exist features that depend only on that factor. Importantly, these anchor features do not need to be known a priori. We empirically study the Sparse VAE with simulated data and show that it recovers the true latent factors when related methods do not. We study movie rating and text datasets, and show that the Sparse VAE predicts well on holdout data as well as data drawn from a different test distribution.