CONTRASTIVE LEARNING CAN FIND AN OPTIMAL BASIS FOR APPROXIMATELY VIEW-INVARIANT FUNCTIONS

Abstract

Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.

1. INTRODUCTION

When using a contrastive learning method such as SimCLR (Chen et al., 2020a) for representation learning, the first step is to specify the distribution of original examples z ∼ p(Z) within some space Z along with a sampler of augmented views p(A|Z = z) over a potentially different space A.foot_0 For example, p(Z) might represent a dataset of natural images, and p(A|Z) a random transformation that applies random scaling and color shifts. Contrastive learning then consists of finding a parameterized mapping (such as a neural network) which maps multiple views of the same image (e.g. draws from a 1 , a 2 ∼ p(A|Z = z) for a fixed z) close together, and unrelated views far apart. This mapping can then be used to define a representation which is useful for downstream supervised learning. The success of these representations have led to a variety of theoretical analyses of contrastive learning, including analyses based on conditional independence within latent classes (Saunshi et al., 2019) , alignment of hyperspherical embeddings (Wang & Isola, 2020), conditional independence structure with landmark embeddings (Tosh et al., 2021) , and spectral analysis of an augmentation graph (HaoChen et al., 2021) . Each of these analyses is based on a single choice of contrastive learning objective. In this work, we go further by integrating multiple popular contrastive learning methods into a single framework, and showing that it can be used to build minimax-optimal representations under a straightforward assumption about similarity of labels between positive pairs. Common wisdom for choosing the augmentation distribution p(A|Z) is that it should remove irrelevant information from Z while preserving information necessary to predict the eventual downstream label Y ; for instance, augmentations might be chosen to be random crops or color shifts that affect the semantic content of an image as little as possible (Chen et al., 2020a) . The goal of representation learning is then to find a representation with which we can form good estimates of Y using only a few labeled examples. In particular, we focus on approximating a target function g : A → R n for which g(a) represents the "best guess" of Y based on a. For regression tasks, we might be interested in a In either case, we are interested in constructing a representation for which g can be estimated well using only a small number of labeled augmentations (a i , y i ).foot_1 Since we usually do not have access to the downstream supervised learning task when learning our representation, our goal is to identify a representation that enables us to approximate many different "reasonable" choices of g. Specifically, we focus on finding a single representation which allows us to approximate every target function with a small positive-pair discrepancy, i.e. every g satisfying the following assumption: Assumption 1.1 (Approximate View-Invariance). Each target function g : A → R satisfies E p+(a1,a2) g(a 1 ) -g(a 2 ) 2 ≤ ε, for some fixed ε ∈ [0, ∞), where p + (a 1 , a 2 ) = z p(a 1 |z)p(a 2 |z)p(z). This is a fairly weak assumption, because to the extent that the distribution of augmentations preserves information about some downstream label, our best estimate of that label should not depend much on exactly which augmentation is sampled: it should be approximately invariant to the choice of a different augmented view of the same example. More precisely, as long as the label Y is independent of the augmentation A conditioned on the original example Z (i.e. assuming augmentations are chosen without using the label, as is typically the case), we must have E (g(A 1 ) -g(A 2 )) 2 ≤ 2E (g(A) -Y ) 2 (see Appendix A). For simplicity, we work with scalar g : A → R and Y ∈ R; vector-valued Y can be handled by learning a sequence of scalar functions. Our first contribution is to unify a number of previous analyses and existing techniques, drawing connections between contrastive learning, kernel decomposition, Markov chains, and Assumption 1.1. We start by showing that minimizing existing contrastive losses is equivalent to building an approximation of a particular positive-pair kernel, from which a finite-dimensional representation can be extracted using Kernel PCA (Schölkopf et al., 1997) . We next discuss what properties a representation must have to achieve low approximation error for functions satisfying Assumption 1.1, and show that the eigenfunctions of a Markov chain over positive pairs allow us to re-express this assumption in a form that makes those properties explicit. We then prove that, surprisingly, building a Kernel PCA representation using the positive-pair kernel is exactly equivalent to identifying the eigenfunctions of this Markov chain, ensuring this representation has the desired properties. Our main theoretical result is that contrastive learning methods can be used to find a minimaxoptimal representation for linear predictors under Assumption 1.1. Specifically, for a fixed dimension, we show that taking the eigenfunctions with the largest eigenvalues yields a basis for the linear subspace of functions that minimizes the worst case quadratic approximation error across the set of functions satisfying Assumption 1.1, and further give generalization bounds for the performance of this representation for downstream supervised learning. We conclude by studying the behavior of contrastive learning models on two synthetic tasks for which the exact positive-pair kernel is known, and investigating the extent to which the basis of eigenfunctions can be extracted from trained models. As predicted by our theory, we find that the same eigenfunctions can be recovered from multiple model parameterizations and losses, although the accuracy depends on both kernel parameterization expressiveness and augmentation strength.

2. CONTRASTIVE LEARNING IS SECRETLY KERNEL LEARNING

Standard contrastive learning approaches can generally be decomposed into two pieces: a parameterized model that takes two augmented views and assigns them a real-valued similarity, and a contrastive loss function that encourages the model to assign higher similarity to positive pairs than negative pairs. In particular, the InfoNCE / NT-XEnt objective proposed by Van 



We focus on finite but arbitrarily large Z and A, e.g. the set of 8-bit 32x32 images, and allow Z ̸ = A. If we have a dataset of labeled un-augmented examples (zi, yi), we can build a dataset of labeled augmentations by sampling one or more augmentations of each example in our original dataset.



target function of the form g(a) = E[Y |A = a]. For classification tasks, if Y is represented as a onehot vector, we might be interested in estimating the probability of each class, again taking the form g(a) = E[Y |A = a], or the most likely label, taking the form g(a) = argmax y p(Y = y|A = a).

den Oord et al. (2018) and Chen et al. (2020a) and used with the SimCLR architecture, the NT-Logistic objective also considered by Chen et al. (2020a) and theoretically analyzed by Tosh et al. (2021), and the Spectral Contrastive Loss introduced by HaoChen et al. (2021) all have this structure.

