CONTRASTIVE LEARNING CAN FIND AN OPTIMAL BASIS FOR APPROXIMATELY VIEW-INVARIANT FUNCTIONS

Abstract

Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.

1. INTRODUCTION

When using a contrastive learning method such as SimCLR (Chen et al., 2020a) for representation learning, the first step is to specify the distribution of original examples z ∼ p(Z) within some space Z along with a sampler of augmented views p(A|Z = z) over a potentially different space A.foot_0 For example, p(Z) might represent a dataset of natural images, and p(A|Z) a random transformation that applies random scaling and color shifts. Contrastive learning then consists of finding a parameterized mapping (such as a neural network) which maps multiple views of the same image (e.g. draws from a 1 , a 2 ∼ p(A|Z = z) for a fixed z) close together, and unrelated views far apart. This mapping can then be used to define a representation which is useful for downstream supervised learning. The success of these representations have led to a variety of theoretical analyses of contrastive learning, including analyses based on conditional independence within latent classes (Saunshi et al., 2019) , alignment of hyperspherical embeddings (Wang & Isola, 2020), conditional independence structure with landmark embeddings (Tosh et al., 2021) , and spectral analysis of an augmentation graph (HaoChen et al., 2021) . Each of these analyses is based on a single choice of contrastive learning objective. In this work, we go further by integrating multiple popular contrastive learning methods into a single framework, and showing that it can be used to build minimax-optimal representations under a straightforward assumption about similarity of labels between positive pairs. Common wisdom for choosing the augmentation distribution p(A|Z) is that it should remove irrelevant information from Z while preserving information necessary to predict the eventual downstream label Y ; for instance, augmentations might be chosen to be random crops or color shifts that affect the semantic content of an image as little as possible (Chen et al., 2020a) . The goal of representation learning is then to find a representation with which we can form good estimates of Y using only a few labeled examples. In particular, we focus on approximating a target function g : A → R n for which g(a) represents the "best guess" of Y based on a. For regression tasks, we might be interested in a



We focus on finite but arbitrarily large Z and A, e.g. the set of 8-bit 32x32 images, and allow Z ̸ = A.1

