CFE-CMStatistics 2024: Start Registration
View Submission - CFECMStatistics2024
A0374
Title: When can we approximate wide contrastive models with neural tangent kernels and principal component analysis Authors:  Pascal Esser - Technical University of Munich (Germany) [presenting]
Gautham Anil - Indian Institute of Technology Madras (India)
Debarghya Ghoshdastidar - Technical University of Munich (Germany)
Abstract: Contrastive learning is a paradigm for learning representations from unlabelled data, and several recent works have claimed that such models effectively learn spectral embeddings and show relations between (wide) contrastive models and kernel principal component analysis (PCA). However, it is not known if trained contrastive models indeed correspond to kernel methods or PCA. The training dynamics of two-layer contrastive models are analyzed with non-linear activation, and it is answered when these models are close to PCA or kernel methods. It is well known in the supervised setting that neural networks are equivalent to neural tangent kernel (NTK) machines and that the NTK of infinitely wide networks remains constant during training. The first constancy results of NTK are provided for contrastive losses, and a nuanced picture is presented: NTK of wide networks remains almost constant for cosine similarity-based contrastive losses but not for losses based on dot product similarity. The training dynamics of contrastive models are further studied with orthogonality constraints on the output layer, which is implicitly assumed in works relating contrastive learning to spectral embedding. The deviation bounds suggest that representations learned by contrastive models are close to the principal components of a certain matrix computed from random features.