CONTRASTIVE LEARNING CAN FIND AN OPTIMAL BASIS FOR APPROXIMATELY VIEW-INVARIANT FUNCTIONS

Abstract

Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.

1. INTRODUCTION

When using a contrastive learning method such as SimCLR (Chen et al., 2020a) for representation learning, the first step is to specify the distribution of original examples z ∼ p(Z) within some space Z along with a sampler of augmented views p(A|Z = z) over a potentially different space A.foot_0 For example, p(Z) might represent a dataset of natural images, and p(A|Z) a random transformation that applies random scaling and color shifts. Contrastive learning then consists of finding a parameterized mapping (such as a neural network) which maps multiple views of the same image (e.g. draws from a 1 , a 2 ∼ p(A|Z = z) for a fixed z) close together, and unrelated views far apart. This mapping can then be used to define a representation which is useful for downstream supervised learning. The success of these representations have led to a variety of theoretical analyses of contrastive learning, including analyses based on conditional independence within latent classes (Saunshi et al., 2019) , alignment of hyperspherical embeddings (Wang & Isola, 2020) , conditional independence structure with landmark embeddings (Tosh et al., 2021) , and spectral analysis of an augmentation graph (HaoChen et al., 2021) . Each of these analyses is based on a single choice of contrastive learning objective. In this work, we go further by integrating multiple popular contrastive learning methods into a single framework, and showing that it can be used to build minimax-optimal representations under a straightforward assumption about similarity of labels between positive pairs. Common wisdom for choosing the augmentation distribution p(A|Z) is that it should remove irrelevant information from Z while preserving information necessary to predict the eventual downstream label Y ; for instance, augmentations might be chosen to be random crops or color shifts that affect the semantic content of an image as little as possible (Chen et al., 2020a) . The goal of representation learning is then to find a representation with which we can form good estimates of Y using only a few labeled examples. In particular, we focus on approximating a target function g : A → R n for which g(a) represents the "best guess" of Y based on a. For regression tasks, we might be interested in a target function of the form g(a) = E[Y |A = a]. For classification tasks, if Y is represented as a onehot vector, we might be interested in estimating the probability of each class, again taking the form g(a) = E[Y |A = a], or the most likely label, taking the form g(a) = argmax y p(Y = y|A = a). In either case, we are interested in constructing a representation for which g can be estimated well using only a small number of labeled augmentations (a i , y i ).foot_1 Since we usually do not have access to the downstream supervised learning task when learning our representation, our goal is to identify a representation that enables us to approximate many different "reasonable" choices of g. Specifically, we focus on finding a single representation which allows us to approximate every target function with a small positive-pair discrepancy, i.e. every g satisfying the following assumption: Assumption 1.1 (Approximate View-Invariance). Each target function g : A → R satisfies E p+(a1,a2) g(a 1 ) -g(a 2 ) 2 ≤ ε, for some fixed ε ∈ [0, ∞), where p + (a 1 , a 2 ) = z p(a 1 |z)p(a 2 |z)p(z). This is a fairly weak assumption, because to the extent that the distribution of augmentations preserves information about some downstream label, our best estimate of that label should not depend much on exactly which augmentation is sampled: it should be approximately invariant to the choice of a different augmented view of the same example. More precisely, as long as the label Y is independent of the augmentation A conditioned on the original example Z (i.e. assuming augmentations are chosen without using the label, as is typically the case), we must have E (g(A 1 ) -g(A 2 )) 2 ≤ 2E (g(A) -Y ) 2 (see Appendix A). For simplicity, we work with scalar g : A → R and Y ∈ R; vector-valued Y can be handled by learning a sequence of scalar functions. Our first contribution is to unify a number of previous analyses and existing techniques, drawing connections between contrastive learning, kernel decomposition, Markov chains, and Assumption 1.1. We start by showing that minimizing existing contrastive losses is equivalent to building an approximation of a particular positive-pair kernel, from which a finite-dimensional representation can be extracted using Kernel PCA (Schölkopf et al., 1997) . We next discuss what properties a representation must have to achieve low approximation error for functions satisfying Assumption 1.1, and show that the eigenfunctions of a Markov chain over positive pairs allow us to re-express this assumption in a form that makes those properties explicit. We then prove that, surprisingly, building a Kernel PCA representation using the positive-pair kernel is exactly equivalent to identifying the eigenfunctions of this Markov chain, ensuring this representation has the desired properties. Our main theoretical result is that contrastive learning methods can be used to find a minimaxoptimal representation for linear predictors under Assumption 1.1. Specifically, for a fixed dimension, we show that taking the eigenfunctions with the largest eigenvalues yields a basis for the linear subspace of functions that minimizes the worst case quadratic approximation error across the set of functions satisfying Assumption 1.1, and further give generalization bounds for the performance of this representation for downstream supervised learning. We conclude by studying the behavior of contrastive learning models on two synthetic tasks for which the exact positive-pair kernel is known, and investigating the extent to which the basis of eigenfunctions can be extracted from trained models. As predicted by our theory, we find that the same eigenfunctions can be recovered from multiple model parameterizations and losses, although the accuracy depends on both kernel parameterization expressiveness and augmentation strength.

2. CONTRASTIVE LEARNING IS SECRETLY KERNEL LEARNING

Standard contrastive learning approaches can generally be decomposed into two pieces: a parameterized model that takes two augmented views and assigns them a real-valued similarity, and a contrastive loss function that encourages the model to assign higher similarity to positive pairs than negative pairs. In particular, the InfoNCE / NT-XEnt objective proposed by Van den Oord et al. (2018) and Chen et al. (2020a) and used with the SimCLR architecture, the NT-Logistic objective also considered by Chen et al. (2020a) and theoretically analyzed by Tosh et al. (2021) , and the Spectral Contrastive Loss introduced by HaoChen et al. (2021) all have this structure. Figure 1 : The positive-pair kernel K + assigns high similarity to likely positive pairs. Contrastive learning methods learn parameterized kernels K θ which assign high similarity to nearby points in a learned embedding space. NT-XEnt (Chen et al., 2020a; Van den Oord et al., 2018) Loss E -log Kθ(a + 1 ,a + 2 ) Kθ(a + 1 ,a + 2 )+ a - i Kθ(a + 1 ,a - i ) Kernel K θ (a 1 , a 2 ) = exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ ) Minimum K * (a 1 , a 2 ) = p+(a1,a2) p(a1)p(a2) • C [a1] NT-Logistic (Chen et al., 2020a; Tosh et al., 2021) Loss E -log σ(log K θ (a + 1 , a + 2 )) + E -log σ(-log K θ (a - 1 , a - 2 )) Kernel K θ (a 1 , a 2 ) = exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ ) Minimum K * (a 1 , a 2 ) = p+(a1,a2) p(a1)p(a2) Spectral (HaoChen et al., 2021) Loss p+(a1,a2) p(a1)p(a2) E -2 K θ (a + 1 , a + 2 ) + E ( K θ (a - 1 , a - 2 )) 2 Kernel K θ (a 1 , a 2 ) = h θ (a 1 ) ⊤ h θ (a 2 ) Minimum K * (a 1 , a 2 ) = Table 1 : Existing contrastive learning objectives, reinterpreted as learning parameterized approximations of K + . "Minimum" denotes the population minimum of the loss over all kernel functions (not necessarily representable using the shown parameterization). C [a1] is a equivalence-class-dependent proportionality constant, with C [a1] = C [a2] whenever p + (a 1 , a 2 ) > 0. See Appendix B for derivations and discussion of other related objectives. The similarity between the two augmented views is commonly taken to be the dot product of outputs of a neural network, e.g. as h θ (a 1 ) ⊤ h θ (a 2 ). However, a surprising pattern emerges if we instead interpret the exponentiated dot product exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ ) as the similarity for the NT-XEnt and NT-Logistic objectives, treating the exponential and temperature term τ as part of the model instead of part of the objective. As shown in Table 1 , the three losses now share the same population minimum: the probability ratio p + (a 1 , a 2 )/p(a 1 )p(a 2 ). Intriguingly, the expressions h θ (a 1 ) ⊤ h θ (a 2 ) and exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ ) both satisfy the definition of a Mercer kernel (also called a positive-definite kernel): each implicitly computes the inner product between feature vectors ⟨ϕ(a 1 ), ϕ(a 2 )⟩ under some transformation ϕ : A → R d (where d may be infinite). Furthermore, the probability ratio p + (a 1 , a 2 )/p(a 1 )p(a 2 ) can be interpreted as a Mercer kernel as well: Definition 2.1. The positive-pair kernel associated with distributions p(z) and p(a|z) is the ratio K + (a 1 , a 2 ) = p + (a 1 , a 2 ) p(a 1 )p(a 2 ) = ⟨ϕ + (a 1 ), ϕ + (a 2 )⟩ , where p + (a 1 , a 2 ) = z p(a 1 |z)p(a 2 |z)p(z), and ϕ + : A → R |Z| is the transformation ϕ + (a) = p(a|z1) √ p(z1) p(a) p(a|z2) √ p(z2) p(a) • • • p(a|z |Z| ) √ p(z |Z| ) p(a) ⊤ . (2) Here, the magnitude of the dot product between vectors ϕ + (a 1 ) and ϕ + (a 2 ) reflects the relative likelihood of data points a 1 and a 2 being drawn from a positive pair v.s. a negative pair. In particular, if a 1 and a 2 have zero probability of being a positive pair, ϕ + (a 1 ) and ϕ + (a 2 ) are orthogonal. As shown in Figure 1 , we can thus reinterpret the three contrastive learning methods in Table 1 as kernel learning methods, in that they produce parameterized positive-definite kernel functions which approximate this positive-pair kernel. By investigating properties of this kernel, we can thus hope to build a better understanding of the behavior of contrastive learning.

3. KERNEL PRINCIPAL COMPONENTS ARE MARKOV CHAIN EIGENFUNCTIONS

We start by investigating the geometric structure of the data under K + , and how we could use Kernel PCA to build a natural representation based on this structure. We next ask what properties we would At each step we condition on an augmentation a t (middle row) to sample an uncorrupted example z t (top row), then sample a t+1 from z t so that (a t , a t+1 ) is a positive pair. Below, we plot the five slowest-varying eigenfunctions f 1 , . . . , f 5 at each step of the chain, labeled with their eigenvalues λ 1 , . . . , λ 5 . Weaker augmentations lead to slower mixing, smoother eigenfunctions, and eigenvalues closer to 1. want this representation to have, and use a Markov chain over positive pairs to decompose those properties in a convenient form. We then show that, surprisingly, the representation derived from K + exactly corresponds to the Markov chain decomposition and thus has precisely the desired properties.

3.1. SUMMARIZING K + WITH KERNEL PRINCIPAL COMPONENTS ANALYSIS

Recall that for any a 1 and a 2 , we have K + (a 1 , a 2 ) = ⟨ϕ + (a 1 ), ϕ + (a 2 )⟩ where ϕ + is defined in Equation 2.1. Unfortunately, the "features" ϕ + (a) ∈ R |Z| are very high dimensional, being potentially as large as the cardinality of |Z|. A natural approach for building a lower-dimensional representation is to use principal components analysis (PCA): construct the (uncentered) covariance matrix Σ = E p(a) [ϕ + (a)ϕ + (a) ⊤ ] ∈ R |Z|×|Z| and then diagonalize it as Σ = U DU ⊤ to determine the principal components {(u 1 , σ 2 1 ), (u 2 , σ 2 2 ), . . . } of our transformed data distribution ϕ + (A), i.e. the directions capturing the maximum variance of ϕ + (A). We can then project the transformed points ϕ + (a) onto these directions to construct a sequence of "projection functions" h i (a) := u ⊤ i ϕ(a). Conveniently, it is possible to estimate these projection functions given access only to the kernel function K + (a 1 , a 2 ) = ⟨ϕ + (a 1 ), ϕ + (a 2 )⟩, by using Kernel PCA (Schölkopf et al., 1997) . Kernel PCA bypasses the need to estimate the covariance in feature space, directly producing the set of principal component projection functions h 1 , h 2 , . . . and corresponding eigenvalues σ 2 1 , σ 2 2 , . . . , such that h i (a) measures the projection of ϕ + (a) onto the ith eigenvector of the covariance matrix, and σ 2 i measures the variance along that direction. The sequence of principal component projection functions gives us a view into the geometry of our data when mapped through ϕ + (A). It is thus natural to construct a d-dimensional representation r : A → R d by taking the first d such functions r(a) = [h 1 (a), h 2 (a), . . . , h d (a)], and then use this for downstream learning, as is done in (kernel) principal component regression (Rosipal et al., 2001; Wibowo & Yamamoto, 2012) . In practice, we can also substitute our learned kernel K θ in place of K + . Note that we are free to choose d to trade off between the complexity of the representation and the amount of variance of ϕ + (A) that we can capture.

3.2. DECOMPOSING INVARIANCE WITH THE POSITIVE-PAIR MARKOV CHAIN

What properties might we want this representation to have? If we wish to estimate functions g satisfying Assumption 1.1, for which E p+(a1,a2) (g(a 1 ) -g(a 2 )) 2 is small, we might hope that E p+(a1,a2) ∥r(a 1 ) -r(a 2 )∥ 2 2 is small. But this is not sufficient to ensure we can estimate g with high accuracy; as an example, a constant representation is unlikely to work well. Good representation learning approaches must ensure that the learned representations are also expressive enough to approximate g (for instance by using negative samples or by directly encouraging diversity as in VICReg Bardes et al. (2021) ), but it is not immediately obvious what it means to be "expressive enough" if all we know about g is that it satisfies Assumption 1.1. We can build a better understanding of the quality of a representation by expanding g in terms of a basis in which E p+(a1,a2) (g(a 1 ) -g(a 2 )) 2 admits a simpler form. In particular, a convenient decomposition arises from considering the following Markov chain (shown in Figure 2 ): starting with an example a t , sample the next example a t+1 proportional to how likely (a t , a t+1 ) would be a positive pair, i.e. according to p + (a t+1 |a t ) = z p(a t+1 |z)p(z|a t ). Note that, to the extent that some function g satisfies Assumption 1.1, we would also expect the value of g to change slowly along trajectories of this chain, i.e. that g(a 1 ) ≈ g(a 2 ) ≈ g(a 3 ) ≈ • • • , and thus that in general g(a t ) ≈ E p+(at+1|at) g(a t+1 ) . This motivates solving for the eigenfunctions of the Markov chain, which are functions that satisfy E p+(at+1|at) f i (a t+1 ) = λ i f i (a t ) for some λ ∈ [0, 1]. As shown by Levin & Peres (2017, Chapter 12) , the f i form an orthonormal basisfoot_2 for the set of all functions A → R under the inner product ⟨f, g⟩ = E p(a) [f (a)g(a)], in the sense that E p(a) [f i (a)f i (a)] = 1 and E p(a) [f i (a)f j (a)] = 0 for i ̸ = j. Then, for any g : A → R, if we let c i = E p(a) [f i (a)g(a)], we must have g(a) = i c i f i (a) and E p(a) [g(a) 2 ] = i c 2 i . Furthermore, this particular choice of orthonormal basis has the following appealing property: Proposition 3.1. If g : A → R and c i = E[f i (a)g(a)], then E p+(a1,a2) g(a 1 ) -g(a 2 ) 2 = i (2 -2λ i )c 2 i . See Appendix C for a proof, along with a derivation of the orthonormality of the basis. One particular consequence of this fact is that the eigenfunctions with eigenvalues closest to 1 are also the most view-invariant. Specifically, setting g = f i reveals that E p+(a1,a2) f i (a 1 ) -f i (a 2 ) 2 = 2 -2λ i . More generally, if g satisfies Assumption 1.1, Equation 3 implies that 2 i (1 -λ i )c 2 i ≤ ε, and thus g must have coefficients c i concentrated on eigenfunctions with λ i close to 1. Indeed, if ε = 0 (i.e. if g is perfectly invariant to augmentations) then the only eigenfunctions with nonzero weights must be those with λ i = 1. If we want to approximate g using a small finite-dimensional representation, we should then prefer representations that allow us to estimate any linear combination of the eigenfunctions for which λ i is close to 1.

3.3. KERNEL PCA RECOVERS THE BASIS OF POSITIVE-PAIR EIGENFUNCTIONS

Surprisingly, it turns out that the representation built from kernel PCA in Section 3.1 has precisely the desired property. In fact, performing kernel PCA with K + (over the full population) is exactly equivalent to identifying the eigenfunctions of the Markov transition matrix P . Theorem 3.2. The output (h 1 , σ 2 1 ), (h 2 , σ 2 2 ), . . . of population-level Kernel PCA under K + and the orthonormal basis of eigenfunctions f i of P with eigenvalues λ i satisfy σ 2 i = λ i and h i (a) = σ i f i (a) = λ 1/2 i f i (a) for all i and all a ∈ A (up to reordering and multiplicity of eigenspacesfoot_3 ). See Appendix C for a proof. This theorem reveals a deep connection between the (co)variance of the dataset under our kernel K + and the view-invariance captured by Assumption 1.1. In particular, if we build a representation r(a) = [h 1 (a), h 2 (a), . . . , h d (a)] using the first d principal component projection functions, this representation will directly capture the d eigenfunctions with eigenvalues closest to 1, and allow us to approximate any linear combination of those eigenfunctions using a linear predictor.

4. EIGENFUNCTION REPRESENTATIONS ARE MINIMAX OPTIMAL

We now give a more precise analysis of the quality of this representation for downstream supervised fine-tuning. We focus on the class of linear predictors on top of a k-dimensional representation E p+ ĝ(a 1 ) -ĝ(a 2 ) 2 . (5) Simultaneously, F r d * minimizes the (quadratic) approximation error for the worst-case target function satisfying Assumption 1.1 for any fixed ε: F r d * = argmin dim(F )=d max g∈Sε min ĝ∈F E p(a) g(a) -ĝ(a) 2 . ( ) Equation 5 states that the function class F r d * has an implicit regularization effect: it contains the functions that change as little as possible over positive pairs, relative to their norm. Equation 6 reveals that this function class is also the optimal choice for least-squares approximation of a function satisfying Assumption 1.1. Together, these findings suggest that this representation should perform well as long as Assumption 1.1 holds. We make this intuition precise as follows. Consider the loss function ℓ(ŷ, y) := |ŷ -y|, and the associated risk R(g) = E [ℓ(g(A), Y )] of a predictor g : A → R. Let g * ∈ argmin R(g) be the lowest-risk predictor over the set of functions A → R, with risk R * = R(g * ), and assume it satisfies Assumption 1.1. Expand g * as a linear combination of the basis of eigenfunctions  (f i ) |A| i=1 as g * = |A| i=1 β * i f i , ∥β∥2≤R n -1 n i=1 |⟨β, r * d (A i )⟩ -Y i |. Then the expected excess risk of βR is bounded by: E E( βR ) ≤ 2dR √ n + √ d(∥β * ∥ 2 -R) + + ε 2(1 -λ d+1 ) where E(β) := R(β) -R * is the excess risk and (x) + := max{x, 0}.

Note that ∥β

* ∥ 2 2 = d i=1 β * 2 i ≤ |A| i=1 β * 2 i = E[g * (a) 2 ], so if we choose R 2 ≥ E[g * (a) 2 ] then the second term vanishes. The third term bounds the error incurred by using only the first d eigenfunctions, since β * 2 i ≤ ε 2(1-λi) by Equation 3, and motivates choosing d to be large enough that λ d+1 is small. See Appendix D for proofs of Theorem 4.1 and Proposition 4.2.

5. RELATED WORK

Our work is closely connected to the spectral graph theory analysis by HaoChen et al. (2021) , which focuses on the eigenvectors of the normalized adjacency matrix of an augmentation graph, introduces the spectral contrastive loss, and shows that its optimum recovers the top eigenvectors up to an invertible transformation. We go further by arguing that the Spectral Contrastive, NT-XEnt, and NT-Logistic losses can all be viewed as approximating K + , and showing that the resulting eigenfunctions are minimax-optimal for reconstructing target functions satisfying Assumption 1.1. (Indeed, the eigenvectors analyzed by HaoChen et al. are equal to our eigenfunctions scaled by p(a) 1/2 ; see Appendix C.3.) Our work is also related to the analysis of non-linear CCA given by Lee et al. (2020) , which can be seen as an asymmetric variant of Kernel PCA with K + . We note that Assumption 1.1 can be recast in terms of the Laplacian matrix of the augmentation graph; similar assumptions have been used before for label propagation (Bengio et al., 2006) and Laplacian filtering (Zhou & Srebro, 2011) . This assumption is also used as a "consistency regularizer" for semi-supervised learning (Sajjadi et al., 2016; Laine & Aila, 2016) and self-supervised learning (Bardes et al., 2021) . There have been a number of other attempts to unify different contrastive learning techniques in a single theoretical framework. Balestriero & LeCun (2022) describe a unification using the framework of spectral embedding methods, and draw connections between SimCLR and Kernel ISOMAP. Tian (2022) provides a game-theoretic unification, and shows that contrastive losses are related to PCA in the input space for the special case of deep linear networks. Techniques such as SpIN (Pfau et al., 2018) and NeuralEF (Deng et al., 2022) have been proposed to learn spectral decompositions of kernels. When applied to the positive-pair kernel K + , it is possible to rewrite their objective in terms of paired views instead of requiring kernel evaluations, and the resulting decomposition is exactly the orthogonal basis of Markov chain eigenfunctions f i . Interestingly, modifying Neural EF in this manner yields an algorithm that closely resembles the Variance-Invariance-Covariance regularization (VICReg) self-supervised learning method proposed by Bardes et al. (2021) , as we discuss in Appendix E.2. See also Appendix B.4 for discussion of other connections between the positive-pair kernel and objectives considered by prior work.

6. EXPERIMENTS

It remains to determine how well learned approximations of the positive-pair kernel succeed at recovering the eigenfunction basis in practice. We explore this question on two synthetic testbed tasks for which the true kernel K + can be computed in closed form. Datasets. Our first dataset is a simple "overlapping regions" toy problem, visualized in Figure 4 . We define A to be a set of grid points, and Z to be a set of rectangular regions over the grid (shaded). We set p(Z) to be a uniform distribution over regions, and p(A|Z = z) to choose one grid point contained in z at random. For a more natural distribution of data, our second dataset is derived from MNIST (LeCun et al., 2010) , but with a carefully-chosen augmentation process so that computing K + is tractable. Specifically, we choose p(Z) to uniformly select from a small subset Z of MNIST digits, and define p(A|Z = z) by first transforming z using one of a finite set of possible rotations and translations, then sampling a subset of k pixels with replacement from the transformed copy. The finite set of allowed transformations and the tractable probability mass function of the multinomial distribution together enable us to compute K + by summing over all possible z ∈ Z. (Samples from this distribution for different values of k are shown in Figure 2 .) Model training and eigenfunction estimation. We train contrastive learning models for each of these datasets using a variety of kernel parameterizations and loss functions. For kernel parameterizations, we consider both linear approximate kernels K θ (a 1 , a 2 ) = h θ (a 1 ) ⊤ h θ (a 2 ), where h θ may be normalized to have constant norm ∥h θ (a)∥ 2 = c or be left unconstrained, and hypersphere- based approximate kernels K θ (a 1 , a 2 ) = exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ + b), where ∥h θ (a)∥ 2 = 1 and b ∈ R, τ ∈ R + are learned. We also explore either replacing b with a learned per-example adjustment s θ (a 1 ) + s θ (a 2 ) or fixing it to zero. For losses, we investigate the XEnt, Logistic, and Spectral losses shown in Table 1 , and explore using a downweighted Logistic loss as a regularizer for the XEnt loss to eliminate the underspecified proportionality constant in minimizer of the XEnt loss. For each approximate kernel and also for the true positive-pair kernel K + , we apply Kernel PCA to extract the eigenfunctions fi and their associated eigenvalues λi . We use full-population Kernel PCA for the overlapping regions task, and combine the Nyström method (Williams & Seeger, 2000) with a large random subset of augmented examples to approximate Kernel PCA for the MNIST task. We additionally investigate training a Neural EF model (Deng et al., 2022) to directly decompose the positive pair kernel into principal components using a single model instead of separately training a contrastive learning model, as mentioned in Section 5. We modify the Neural EF objective slightly to make it use positive pair samples instead of kernel evaluations and to increase numerical stability. See Appendix E for additional discussion of the eigenfunction approximation process. We then measure the alignment of each eigenfunction fi of our learned models with the corresponding eigenfunction f j of K + using the formula E p(a) [ fi (a)f j (a)] 2 , where the square is taken since eigenfunctions are invariant to changes in sign. We also estimate the positive-pair discrepancy E p+ ( fi (a 1 ) -fi (a 2 )) 2 for each approximate eigenfunction. Results. We summarize a set of qualitative observations which are relevant to our theoretical claims in the previous sections. A representative subset of the results are also visualized in Figure 5 . See Appendix F.1 for additional experimental results and a more thorough discussion of our findings. Linear kernels, hypersphere kernels, and NeuralEF can all produce good approximations of the basis of eigenfunctions with sufficient tuning, despite their different parameterizations. The relationship between approximate eigenvalues and positive-pair discrepancies also closely matches the prediction from Equation 4. Both of these relationships emerge during training and do not hold for a randomly initialized model. Additionally, for a fixed kernel parameterization, multiple losses can work well. In particular, we were able to train good hypersphere-based models on the toy regions task using either the XEnt loss or the spectral loss, although the latter required additional tuning to stabilize learning. Constraints on the kernel approximation degrade eigenfunction and eigenvalue estimates. We find that introducing constraints on the output head tends to produce worse alignment between eigenfunctions, and leads to eigenvalues that deviate from the expected relationship in Equation 4. Such constraints include reducing the dimension of the output layer, rescaling the output layer for a linear kernel parameterization to have a fixed L2 norm (as proposed by HaoChen et al. ( 2021)), or fixing b to zero for a hypersphere kernel parameterization (as is done implicitly by Chen et al. (2020a) ). MNIST, k = 10 Hypersphere ( d=32 Weaker augmentations make eigenfunction estimation more difficult. Finally, for the MNIST task, we find that as we decrease the augmentation strength (by increasing the number of kept pixels k), the number of successfully-recovered eigenfunctions also decreases. This may be partially due to Kernel PCA having worse statistical behavior when there are smaller gaps between eigenvalues (specifically, more eigenvalues close to 1), since we observe this trend even when applying Kernel PCA to the exact closed-form K + . However, we also observe that the relationship predicted by Equation 4 starts to break down as well for some kernel approximations under weak augmentation strengths.

7. DISCUSSION

We have shown that existing contrastive objectives approximate the kernel K + , and that Kernel PCA yields an optimal representation for linear prediction under Assumption 1.1. We have further demonstrated on two synthetic tasks that running contrastive learning followed by Kernel PCA can yield good approximate representations across multiple parameterizations and losses, although constrained parameterizations and weaker augmentations both reduce approximation quality. Our analysis (in particular Theorem 4.1) assumes that the distribution of views p(A) is shared between the contrastive learning task and the downstream learning task. In practice, the distribution of underlying examples p(Z) and the augmentation distribution p(A|Z) often change when finetuning a self-supervised pretrained model. An interesting future research direction would be to quantify how the minimax optimality of our representation is affected under such a distribution shift. Our analysis also focuses on the standard "linear evaluation protocol", which determines representation quality based on the accuracy of a linear predictor. This measurement of quality may not be directly applicable to tasks other than classification and regression (e.g. object detection and segmentation), or to other downstream learning methods (e.g. fine-tuning or k-nearest-neighbors). In these other settings, our theoretical framework is not directly applicable, but we might still hope that the view-invariant features arising from Kernel PCA with K + would be useful. Additionally, our analysis precisely characterizes the optimal representation if all we know about our target function g is that it satisfies Assumption 1.1, but in practice we often have additional knowledge about g. In particular, Saunshi et al. (2022) show that inductive biases are crucial for building good representations, and argue that standard distributions of augmentations are approximately disjoint, producing many eigenvalues very close to one. Interestingly, this is exactly the regime where the correspondence between the approximate kernels and the positive-pair kernel K + begins to break down in our experiments. We believe this opens up a number of exciting opportunities for research, including studying the training dynamics of parameterized kernels K θ under a weak augmentation regime, analyzing the impact of additional assumptions about g and inductive biases in K θ on the minimax optimality of PCA representations, and exploring new model parameterizations and objectives that trade off between inductive biases and faithful approximation of K + . More generally, the authors believe that the connections drawn in this work between contrastive learning, kernel methods, principal components analysis, and Markov chains provide a useful lens for theoretical study of self-supervised representation learning and also give new insights toward building useful representations in practice. In this section we expand on the justification of Assumption 1.1 given in Section 1. Recall that during contrastive learning we have a distribution p(Z) of original examples, and a distribution p(A|Z = z) of augmented views for each original example z. Then, at downstream supervised learning time, we additionally have a distribution p(Y = y|Z = z) of labels y ∈ R for each original example z. (For simplicity we assume that the distribution of p(Z) remains unchanged during the downstream learning step. We also assume y is a scalar; the vector case can be derived by independently estimating each element of Y with a separate target function.) Since the distribution of augmented views is typically defined via a random perturbation of Z without using Y , the augmented views A are conditionally independently of the label Y given the original example Z, so we have joint distribution p(a, y, z) = p(z)p(a|z)p(y|z). When we draw a pair of augmented views for the same original example, we instead have a joint p(a 1 , a 2 , y, z) = p(z)p(a 1 |z)p(a 2 |z)p(y|z). We also have some target function g : A → R we are trying to approximate. For instance, we might want to approximate a Bayes-optimal predictor of Y under some loss function, such as g(a) = E[Y |A = a] for the quadratic loss. Common wisdom states that p(A|Z) should remove irrelevant information from Z without removing (much) information about Y (most of the time). This means that, if augmentations are chosen appropriately, we can expect E (g(A) -Y ) 2 ] to be small under our joint probability distribution for an appropriate choice of g. We can then expand the left-hand side of Assumption 1.1 as E (g(A 1 ) -g(A 2 )) 2 = E ((g(A 1 ) -Y ) -(g(A 2 ) -Y )) 2 = E (g(A 1 ) -Y ) 2 -2(g(A 1 ) -Y )(g(A 2 ) -Y ) + (g(A 2 ) -Y ) 2 = 2E (g(A) -Y ) 2 ] -2E (g(A 1 ) -Y )(g(A 2 ) -Y ) Furthermore, due to the law of total expectation combined with our conditional independence assumptions, we know that E (g(A 1 ) -Y )(g(A 2 ) -Y ) = E Z,Y E A1,A2 (g(A 1 ) -Y )(g(A 2 ) -Y )|Z, Y = E Z,Y E A1,A2 (g(A 1 ) -Y )|Z, Y E A1,A2 (g(A 2 ) -Y )|Z, Y = E Z,Y E A (g(A) -Y )|Z, Y 2 ≥ 0. We can then conclude that E (g(A 1 ) -g(A 2 )) 2 = 2E (g(A) -Y ) 2 ] -2E (g(A 1 ) -Y )(g(A 2 ) -Y ) ≤ 2E (g(A) -Y ) 2 ] Thus, Assumption 1.1 holds whenever E (g(A) -Y ) 2 ] ≤ 1 2 ε. Intuitively, if it is possible to estimate Y with high accuracy using a deterministic function, then that function must satisfy Assumption 1.1. Note that Assumption 1.1 itself may still hold even if g is not a good estimate of Y , and is well defined even if we never specify Y and work only with the joint distribution p(A 1 , A 2 ). This is particularly useful because it allows us to quantify over all possible choices of g in a sensible way with minimal knowledge about the label distribution itself.

B EXISTING OBJECTIVES ARE MINIMIZED BY THE POSITIVE-PAIR KERNEL

In this section, we show that the positive-pair kernel is the optimum of the objectives shown in Table 1 (although this optimum is not unique for cross-entropy loss), and discuss some connections to other objectives considered by previous work. Throughout this section, we use p(Z) to denote the true data distribution over unperturbed examples, p(A|Z) to denote the distribution of augmented views conditioned on a particular unperturbed example, and p(A) to denote the marginal distribution of augmented views, e.g. p(A = a) = z∈Z p(Z = z)p(A = a|Z = z). We use p + (A 1 , A 2 ) to denote the positive pair distribution induced by p(Z) and p(A|Z), defined by p + (A 1 = a 1 , A 2 = a 2 ) = z∈Z p(Z = z)p(A = a 1 |Z = z)p(A = a 2 |Z = z). For notational convenience, we will use shorthand p(z) for p(Z = z), p(a) for p(A = a), p + (a 1 , a 2 )  for p + (A 1 = a 1 , A 2 = a 2 ), L NT-Xent (θ) = E (a + 1 ,a + 2 )∼p+(A1,A2) a - i ∼p(A)   -log exp h θ (a + 1 ) ⊤ h θ (a + 2 ) τ exp h θ (a + 1 ) ⊤ h θ (a + 2 ) τ + a - i exp h θ (a + 1 ) ⊤ h θ (a - i ) τ   . For simplicity, we assume that all of the negative samples a - i are drawn independently from the marginal distribution when computing the loss for a + 1 and a + 2 . (In practice, implementations often generate negative samples by taking elements of other positive pairs, e.g. (a - 1 , a - 2 ) ∼ p + (a - 1 , a - 2 ), (a - 3 , a - 4 ) ∼ p + (a - 3 , a - 4 ) , and so on.) We can decompose this objective into two parts: an InfoNCE-like loss (Van den Oord et al., 2018) L InfoNCE ( K θ ) = E (a + 1 ,a + 2 )∼p+(A1,A2),a - i ∼p(A) -log K θ (a + 1 , a + 2 ) K θ (a + 1 , a + 2 ) + a - i K θ (a + 1 , a - i ) combined with a particular parameterized function K θ (a 1 , a 2 ) = exp h θ (a 1 ) ⊤ h θ (a 2 ) τ . where h θ : A → R n+1 maps inputs to points on the n-dimensional hypersphere. We first observe that K θ (a 1 , a 2 ) defines a positive definite kernel, within the family of "dot product kernels". Indeed, when h θ is restricted to the unit hypersphere, this parameterization is equivalent to the squared-exponential kernel (also called radial-basis-function kernel) K θ (a 1 , a 2 ) = exp 1 -1 2 ∥h θ (a 1 ) -h θ (a 2 )∥ 2 τ = exp(1/τ ) exp - ∥h θ (a 1 ) -h θ (a 2 )∥ 2 2τ using the identity ∥h θ (a 1 ) -h θ (a 2 )∥ 2 = 2 -2h θ (a 1 ) ⊤ h θ (a 2 ). Although this kernel is positive definite, it has "infinite dimension" and cannot be expressed as an inner product of finite-dimensional embedding vectors; nevertheless, we can still run algorithms such as Kernel PCA on a finite dataset. (See Bach (2021, Chapter 7) for some additional background on positive-definite kernels, and Scetbon & Harchaoui (2021) for discussion of other dot product kernels on the unit hypersphere.) We next discuss the minimum of the NT-XEnt objective, under the unconstrained setting where we allow K θ (a 1 , a 2 ) to be an arbitrary symmetric function. The InfoNCE loss, in its more general form, is not necessarily symmetric: it is based on a distribution of contexts p(C), positive samples p(A|C), and negative samples drawn from the marginal distribution P (A), and is given by L InfoNCE (f θ ) = E c∼p(c),a + ∼p(A|C=c),a - i ∼p(A) -log f θ (c, a + ) f (c, a + ) + a - i f θ (c, a - i ) Van den Oord et al. (2018) show that every minimizer of this objective is of the form f * (c, a) = p(a|c) p(a) • b(c) = p(a, c) p(a)p(c) • b(c) for some function b(c) that does not depend on a. In other words, holding c fixed, f * (c, a) ∝ p(a,c) p(a)p(c) . Intuitively, this is because the exact probability of (c, a + ) being the positive pair given c and the set {a + , a - 1 , . . . , a - K } is also proportional to this density ratio, and the InfoNCE objective is a cross-entropy objective for identifying the positive pair. (See also Poole et al. (2019) for a different proof.) In the case of the NT-XEnt contrastive objective, we choose the context C to be one of the augmentations A + 1 , and the positive sample to be the other augmentation A + 2 drawn according to p + (A + 2 |A + 1 ). We furthermore restrict our attention to symmetric functions K θ , e.g. functions for which K θ (a 1 , a 2 ) = K θ (a 2 , a 1 ). In this case, the minimizer is K * (a 1 , a 2 ) = p + (a 1 , a 2 ) p(a 1 )p(a 2 ) • b(a 1 ) = p + (a 1 , a 2 ) p(a 1 )p(a 2 ) • b(a 2 ). so we must have b(a 1 ) = b(a 2 ) for any pair (a 1 , a 2 ) for which p + (a 1 , a 2 ) > 0. If we assume that the positive pair Markov chain is irreducible, e.g. that there is a single communicating class and it is possible to reach any augmentation in A from any other augmentation over a long enough trajectory, then the function b(a) must be constant everywhere, and thus K * (a 1 , a 2 ) = f * (a 1 , a 2 ) = p + (a 1 , a 2 ) p(a 1 )p(a 2 ) • B for some B ∈ R + . In this case, K * is equivalent to K + up to a scaling constant. If the Markov chain has multiple communicating classes (e.g. if the augmentations can be partitioned so that augmentations always come from the same partition), the function b(a) may assign a different value to different communicating classes. Nevertheless, any such minimizer is still a kernel, since we can write it as K * (a 1 , a 2 ) = b(a 1 )         p(a1|z1) √ p(z1) p(a1) p(a1|z2) √ p(z2) p(a1) . . . p(a1|z |Z| ) √ p(z |Z| ) p(a1)         , b(a 2 )         p(a2|z1) √ p(z1) p(a2) p(a2|z2) √ p(z2) p(a2) . . . p(a2|z |Z| ) √ p(z |Z| ) p(a2)         = b(a 1 ) • ϕ + (a 1 ), b(a 2 ) • ϕ + (a 2 ) Indeed, such a minimizer is equivalent to K + except that it scales the inner product by the value of b(a) for each communicating class. It is still possible to extract the set of Markov chain eigenfunctions from the set of principal components of this kernel, although one must correct for the scaling factor when computing the eigenvalues; see Appendix C.4 for details. Alternatively, one can ensure a unique minimum by combining the NT-Xent/InfoNCE loss with either the spectral or logistic losses (discussed below).

B.2 LOGISTIC LOSSES AND NT-LOGISTIC

Logistic losses have also been proposed for contrastive learning, including the NT-Logistic objective as described in Chen et al. (2020a) and other versions described by Mikolov et al. (2013) and Tosh et al. (2021) . Such losses take the form L Logistic (f θ ) = E (a + 1 ,a + 2 )∼p+(A1,A2) -log σ(f θ (a + 1 , a + 2 )) + E a - 1 ∼p(A),a - 2 ∼p(A) -log σ(-f θ (a - 1 , a - 2 )) where negative samples are drawn independently from the marginal distribution p(A). Tosh et al. (2021) motivates this loss based on a binary classification: choose a label Y to be 0 or 1 with probability 1/2 each, sample a positive pair (a 1 , a 2 ) ∼ p + (A 1 , A 2 ) if Y = 1 and a negative pair a 1 ∼ p(A), a 2 ∼ p(A) if Y = 0, then use a learned model to predict Y given the pair. The minimizer of this loss is then the conditional log-odds-ratio f * (a 1 , a 2 ) = log p(Y = 1|a 1 , a 2 ) p(Y = 0|a 1 , a 2 ) = log p(a 1 , a 2 |Y = 1)p(Y = 1) p(a 1 , a 2 |Y = 0)p(Y = 0) = log p + (a 1 , a 2 ) • 1 2 p(a 1 )p(a 2 ) • 1 2 = log p + (a 1 , a 2 ) p(a 1 )p(a 2 ) . For the particular case of the NT-Logistic objective, we parameterize f θ as f θ (a 1 , a 2 ) = log Kθ (a 1 , a 2 ) = h θ (a 1 ) ⊤ h θ (a 2 ) τ where we again define K θ (a 1 , a 2 ) = exp h θ (a 1 ) ⊤ h θ (a 2 ) τ . The optimum (if we ignore the constraints of this particular form of K and minimize over all functions of two variables) is then K * (a 1 , a 2 ) = p + (a 1 , a 2 ) p(a 1 )p(a 2 ) . Note that in this case there is no proportionality constant.

B.3 THE SPECTRAL CONTRASTIVE LOSS

HaoChen et al. (2021) propose the Spectral Contrastive Loss as an alternative to other contrastive losses with provable performance guarantees. The loss is defined as L Spectral ( K θ ) = -2 • E (a + 1 ,a + 2 )∼p+(A1,A2) K θ (a + 1 , a + 2 ) + E a - 1 ∼p(A),a - 2 ∼p(A) K θ (a - 1 , a - 2 ) 2 where they choose K θ (a 1 , a 2 ) = h θ (a 1 ) ⊤ h θ (a 2 ). for a learned embedding function h θ : A → R d . We note that this directly satisfies the definition of a kernel, in that it is an inner product in a transformed space. One interesting property of this kernel approximation is that it can be negative, whereas the exponential-based kernel approximations in the previous sections are always nonnegative. HaoChen et al. show that this loss can be rewritten as L Spectral ( K θ ) = a1,a2 -2p + (a 1 , a 2 ) K θ (a 1 , a 2 ) + p(a 1 )p(a 2 ) K θ (a 1 , a 2 ) 2 = a1,a2 p + (a 1 , a 2 ) 2 p(a 1 )p(a 2 ) -2p + (a 1 , a 2 ) K θ (a 1 , a 2 ) + p(a 1 )p(a 2 ) K θ (a 1 , a 2 ) 2 - a1,a2 p + (a 1 , a 2 ) 2 p(a 1 )p(a 2 ) = a1,a2 p + (a 1 , a 2 ) p(a 1 )p(a 2 ) -p(a 1 )p(a 2 ) K θ (a 1 , a 2 ) 2 -C, where C = a1,a2 p+(a1,a2) 2 p(a1)p( a2) is a constant independent of the model. If we again ignore the constraints on Kθ , the minimum of the spectral loss must occur when p + (a 1 , a 2 ) p(a 1 )p(a 2 ) = p(a 1 )p(a 2 ) K θ (a 1 , a 2 ), for all a 1 and a 2 , or in other words, when K θ (a 1 , a 2 ) = p + (a 1 , a 2 ) p(a 1 )p(a 2 ) = K + (a 1 , a 2 ). HaoChen et al. continue by expanding their definition of K θ for a fixed representation d, and showing that it relates to the spectral decomposition of a particular augmentation graph. It turns out that this decomposition is equivalent to our decomposition in terms of eigenfunctions except for a scaling factor of p(a) 1/2 ; we discuss this connection more in Appendix C.3.

B.4 OTHER RELATED OBJECTIVES AND CONNECTIONS TO THE POSITIVE-PAIR KERNEL

We note that the log-probability ratio log p(u,v) p(u)p(v) has an information-theoretic interpretation as the pointwise mutual information between two random variates u and v. We can thus view the positivepair kernel K + as being an exponentiated version of the pointwise mutual information between two views A 1 and A 2 . This ratio has been shown to be an optimal critic for mutual information estimation (Poole et al., 2019; Nowozin et al., 2016; Hjelm et al., 2018) . Moustakides & Basioti (2019) describe several sample-based estimators for probability ratios between arbitrary densities or mass functions. These estimators can be seen as generalizations of the the contrastive losses in Table 1 , with the goal of estimating the ratio between the positive and negative pair distributions. The VICReg semi-supervised learning technique can be reinterpreted in terms of the positive-pair kernel as a particular form of kernel decomposition method. See Appendix E.2 for discussion of this connection. Although the parameterized kernels in Table 1 have a fairly simple form, some prior work has considered more sophisticated parameterizations for learning kernels with neural networks (Wilson et al., 2016; Sun et al., 2018) . There have also been recent works related to reinterpreting existing neural network models as kernels (Jacot et al., 2018; Shankar et al., 2020; Amid et al., 2022) . It would be interesting to compare the properties of these other kernel parameterizations with the implicit kernels involved in contrastive learning methods. Finally, we note that early approaches to contrastive learning such as DrLIM (Hadsell et al., 2006) were motivated in part by removing limitations of previous spectral embedding techniques, which required explicitly selecting an input-space distance metric or kernel function (e.g. Bengio et al. (2003) ). Our analysis reveals that that, for modern contrastive learning methods, choosing the distribution of augmentations can still be seen as implicitly defining a kernel function of this form. Conveniently, however, we do not need to be able to evaluate this kernel to train contrastive learning models; we only need to sample augmentations.

EIGENFUNCTIONS

In this section, we describe the relationship between the positive-pair kernel principal components and the Markov chain eigenfunctions in more detail.

C.1 NOTATION

We start by introducing some notation that will be useful. Throughout, we will identify functions f : A → R as vectors f : R A , which will allow us to use matrix notation for many of the relevant quantities. We will also use e i to represent the vector that has a one at the ith position and zeros in all other positions. We will let D Z = diag(p(z 1 ), p(z 2 ), . . . , p(z |Z| )), D A = diag(p(a 1 ), p(a 2 ), . . . , p(a |A| )), be diagonal matrices containing the marginal probabilities of each element in Z and A, respectively, under the true data distribution. We will assume that the distribution has full support, and thus both D Z and D A are invertible. We also define the matrices [P Z,A ] z,a = p(z, a) [P Z→A ] z,a = p(a|z) [P Z←A ] z,a = p(z|a) [P A,Z ] a,z = p(z, a) [P A←Z ] a,z = p(a|z) [P A→Z ] a,z = p(z|a) Equivalently P Z→A = D -1 Z P Z,A , P Z←A = P Z,A D -1 A , P A←Z = (P Z→A ) ⊤ , P A→Z = (P Z←A ) ⊤ . From these, we can construct the positive pair probability matrix P A,A = P A←Z D Z P Z→A [P A,A ] i,j = p(A 1 = i, A 2 = j) = z p(A = i|z)p(z)p(A = j|z).

C.2 PROOF OF CORRESPONDENCE BETWEEN KERNEL PCA AND MARKOV CHAIN EIGENFUNCTIONS

The positive-pair kernel. Writing the positive-pair kernel K + in matrix form, such that [K + ] i,j = K + (i, j), our definition K + (a 1 , a 2 ) = p+(a1,a2) p(a1)p(a2) becomes the matrix equation K + = D -1 A P A,A D -1 A . One way to expand K + is as a product K + = D -1 A P A←Z D Z P Z→A D -1 A = D 1/2 Z P Z→A D -1 A ⊤ D 1/2 Z P Z→A D -1 A = Φ ⊤ + Φ + where Φ + ∈ R Z×A is a matrix whose columns are given by ϕ + : Φ + = D 1/2 Z P Z→A D -1 A = ϕ + (a 1 ) ϕ + (a 2 ) . . . ϕ + (a |A| ) =         p(a1|z1) √ p(z1) p(a1) p(a2|z1) √ p(z1) p(a2) • • • p(a |A| |z1) √ p(z1) p(a |A| ) p(a1|z2) √ p(z2) p(a1) p(a2|z2) √ p(z2) p(a2) • • • p(a |A| |z2) √ p(z2) p(a |A| ) . . . . . . . . . . . . p(a1|z |Z| ) √ p(z |Z| ) p(a1) p(a2|z |Z| ) √ p(z |Z| ) p(a2) • • • p(a |A| |z |Z| ) √ p(z |Z| ) p(a |A| )         . Equivalently, we have ϕ + (a) = Φ + e a . Since K + (a 1 , a 2 ) = ϕ + (a 1 ) ⊤ ϕ + (a 2 ), ϕ + is called a feature map for K + . Note that there are multiple possible feature maps for K + : given any orthonormal matrix Q, the function ϕ Q (a) = Qϕ + (a) is also a feature map for the kernel, since ϕ Q (a 1 ) ⊤ ϕ Q (a 1 ) = ϕ + (a) ⊤ Q ⊤ Qϕ + (a) = ϕ + (a) ⊤ ϕ + (a) = K + (a 1 , a 2 ). Performing kernel PCA under K + is equivalent to performing ordinary PCA over any of its feature maps (since the principal component projection functions are independent of the particular feature map chosen). We thus focus on analyzing the principal component projection functions for the feature map ϕ + . The population level principal components are the eigenvectors of the (uncentered) covariance matrix Σ = E p(a) [ϕ + (a)ϕ + (a) ⊤ ] = E p(a) [Φ + e a e ⊤ a Φ ⊤ + ] = Φ + E p(a) [e a e ⊤ a ]Φ ⊤ + = Φ + D A Φ ⊤ + . Note that we are working with the uncentered principal components, as is commmon for kernel PCA: we do not subtract the mean before computing the covariance. Since Σ is positive semidefinite, it can be diagonalized as Σ = U diag(σ 2 )U ⊤ = i λ i u i u ⊤ i where U = [u 1 , u 2 , . . . , u k ] is orthonormal and diag(σ 2 ) = diag(σ 2 1 , σ 2 2 , . . . , σ 2 k ) is a diagonal matrix of eigenvalues (here k = |Z| is the dimension of the feature map). Each of the vectors u i is one of the population principal components of the transformed distribution ϕ + (A), giving the directions of maximum variance, and the σ 2 i measure the variance in that direction. Given a new augmentation a ∈ A, we can then compute the projection of ϕ + (a) into each of these principal component directions as h i (a) = u ⊤ i ϕ + (a). The Markov Chain. We now redirect our attention to the positive pair Markov chain. The Markov chain transition matrix is defined by [P A←A ] a1,a2 = p + (a 1 |a 2 ), or in matrix form P A←A = P A,A D -1 A . We are interested in the left eigenvectors f ⊤ i P A←A = λ i f ⊤ i of this matrix P A←A , or equivalently the right eigenvectors of its transpose P ⊤ A←A = P A→A , given by P A→A f i = λ i f i . Observe that then D -1 A P A,A f i = λ i f i so equivalently D -1/2 A D -1/2 A P A,A D -1/2 A D 1/2 A f i = λ i D -1/2 A D 1/2 A f i . It follows that D 1/2 A f i is an eigenvector of the symmetric matrix M = D -1/2 A P A,A D -1/2 A with the same eigenvalue λ i . (We note that the matrix M is exactly the symmetrized adjacency matrix described by HaoChen et al. (2021) .) We can now diagonalize M as M = V ΛV ⊤ where V is orthogonal, and then write V = D 1/2 A f 1 D 1/2 A f 2 . . . D 1/2 A f k = D 1/2 A [f 1 f 2 . . . f k ] = D 1/2 A F where F = [f 1 f 2 . . . f k ] . Consider an arbitrary function g : A → R, and let g ∈ R A be its vector form, so that g(a) = g a = g ⊤ e a . Also define c i = E[g(a)f i (a)] and c = [c 1 c 2 . . . c k ] ⊤ . Then c = E g(a)F ⊤ e a = E F ⊤ e a e ⊤ a g = F ⊤ D A g = V ⊤ D 1/2 A g so we must have g = D -1/2 A V ⊤ -1 c = D -1/2 A V c = F c and thus g(a) = i c i f i (a). Additionally, we see that E g(a) 2 = E g ⊤ e a e ⊤ a g = g ⊤ D A g = c ⊤ F ⊤ D A F c = c ⊤ (D 1/2 A F ) ⊤ (D 1/2 A F )c = c ⊤ V ⊤ V c = c ⊤ c = i c 2 i . Note that this also implies that the functions f i are orthonormal under the base measure p(a), e.g. E[f i (a) 2 ] = 1 and E[f i (a)f j (a)] = 0 for i ̸ = j. We can now prove our main results from Section 3. Proposition 3.1. If g : A → R and c i = E[f i (a)g(a)], then E p+(a1,a2) g(a 1 ) -g(a 2 ) 2 = i (2 -2λ i )c 2 i . Proof. Observe that E p+ g(a 1 ) -g(a 2 ) 2 = E p+ g(a 1 ) 2 -2g(a 1 )g(a 2 ) + g(a 2 ) 2 = 2E g(a) 2 -2E p+ [g(a 1 )g(a 2 )] = 2 i c 2 i -2E p+ [g(a 1 )g(a 2 )] Expanding the second term, we have E p+ [g(a 1 )g(a 2 )] = E p+ g ⊤ e a1 e ⊤ a2 g = g ⊤ E p+ e a1 e ⊤ a2 g = g ⊤ P A,A g = g ⊤ D 1/2 A M D 1/2 A g = g ⊤ D 1/2 A V ΛV ⊤ D 1/2 A g = c ⊤ F ⊤ D 1/2 A V ΛV ⊤ D 1/2 A F c = c ⊤ (D 1/2 A F ) ⊤ V ΛV ⊤ (D 1/2 A F )c = c ⊤ V ⊤ V ΛV ⊤ V c = c ⊤ Λc = i λ i c 2 i . We conclude that E p+ g(a 1 ) -g(a 2 ) 2 = 2 i (1 -λ i )c 2 i . Theorem 3.2. The output (h 1 , σ 2 1 ), (h 2 , σ 2 2 ), . . . of population-level Kernel PCA under K + and the orthonormal basis of eigenfunctions f i of P with eigenvalues λ i satisfy σ 2 i = λ i and h i (a) = σ i f i (a) = λ 1/2 i f i (a) for all i and all a ∈ A (up to reordering and multiplicity of eigenspacesfoot_4 ).

Proof. Consider the matrix

B = Φ + D 1/2 A , where Φ + = D 1/2 Z P Z→A D -1 A described above. Take the singular value decomposition B = U Λ 1/2 V ⊤ , where U and V are orthonormal and Λ 1/2 is diagonal. Now observe that BB ⊤ = U ΛU ⊤ = Φ + D A Φ ⊤ + = Σ, B ⊤ B = V ΛV ⊤ = D 1/2 A Φ ⊤ + Φ + D 1/2 A = D 1/2 A K + D 1/2 A = D 1/2 A D -1 A P A,A D -1 A D 1/2 A = D -1/2 A P A,A D -1/2 A = M. Thus, Σ and M must have the same eigenvalues. Diagonalize Σ and M in terms of U and V and define h i (a) and f i (a) according to that diagonalization. We then have h i (a) = u ⊤ i ϕ + (a) = e ⊤ i U ⊤ Φ + e a = e ⊤ i U ⊤ BD -1/2 A e a = e ⊤ i U ⊤ U Λ 1/2 V ⊤ D -1/2 A e a = e ⊤ i Λ 1/2 V ⊤ D -1/2 A e a = e ⊤ i Λ 1/2 D -1/2 A V ⊤ e a = e ⊤ i Λ 1/2 F ⊤ e a = e ⊤ i Λ 1/2 [f 1 f 2 . . . f k ] ⊤ e a = λ 1/2 i f ⊤ i e a = λ 1/2 i f i (a). One interesting consequence of this relationship is that the positive-pair kernel is fully determined by the eigenfunctions and their eigenvalues, and we can write the kernel function directly as a weighted dot product of this representation: Proposition C.1. For any a 1 , a 2 ∈ A, we have K + (a 1 , a 2 ) = i λ i f i (a 1 )f i (a 2 ), where the f i and λ i are as defined in Section 3. Proof. Using the matrix notation and definitions described above, algebraic manipulation shows that K + (a 1 , a 2 ) = e a1 K + e a2 = e a1 D -1 A P A,A D -1 A e a2 = e a1 D -1/2 A M D -1/2 A e a2 = e a1 D -1/2 A V ΛV T D -1/2 A e a2 = e a1 D -1/2 A (D 1/2 A F )Λ(D 1/2 A F ) T D -1/2 A e a2 = e a1 F ΛF T e a2 = i λ i f i (a 1 )f i (a 2 ).

C.3 RELATIONSHIP TO THE EIGENVECTORS OF THE SYMMETRIZED ADJACENCY MATRIX

Interestingly, the matrix M described above is exactly the symmetrized adjacency matrix discussed One way of thinking about this reweighting is as a change of measure. The eigenvectors of M are orthonormal with respect to the counting measure over A, e.g. if you sum squared values over all of A, you obtain 1, and the dot product of different eigenvectors is zero. On the other hand, the eigenvectors f i (or, equivalently, the eigenfunctions f i ) of the Markov chain are orthonormal with respect to the measure p(A), e.g. if you take the expectation of squared values over random augmentations, you obtain 1, and the uncentered covariance of different eigenfunctions is zero. We believe that using p(A) as a measure is a natural choice, since it allows us to reason about expected values in a straightforward way. From this perspective, the p(a) 1/2 scaling terms are a desirable feature of the learned representation that allow us to directly reason about optimality with respect to Assumption 1.1. We note that our Assumption 1.1 could alternatively be expressed in terms of the probability-weighted Laplacian matrix of the augmentation graph, given by L = D A -P A,A . Indeed, we have E p+ g(a 1 ) -g(a 2 ) 2 = 2g ⊤ Lg.

C.4 RECOVERING PROPORTIONALITY CONSTANTS

As described in Appendix B.1, minimizing the NT-XEnt / InfoNCE loss may not exactly produce the positive-pair kernel K + , but may instead learn a scaled version K * (a 1 , a 2 ) = z(a 1 ) • ϕ + (a 1 ), z(a 2 ) • ϕ + (a 2 ) = z(a 1 ) z(a 2 )K + (a 1 , a 2 ), where z : A → R + is some function which is constant on each communicating class on the Markov chain. (Since K + (a 1 , a 2 ) = 0 whenever a 1 and a 2 are in separate communicating classes, we could equivalently say K * (a 1 , a 2 ) = z(a 1 )K + (a 1 , a 2 ) = z(a 2 )K + (a 1 , a 2 ).) When the Markov chain has one communicating class, K * is simply a scaled version of K + . In this case, all of the principal component projection functions for K * are still the eigenfunctions of the Markov chain, but the eigenvalues may be scaled by that constant. The true eigenvalues of the eigenfunctions can then be estimated using equation Equation 4, which states that E[(f i (a 1 ) - f i (a 2 )) 2 ] = 2(1 -λ i ). When the Markov chain has multiple communicating classes, we can partition the eigenfunctions so that each eigenfunction is nonzero on a single communicating class. Since the scaling function z acts as a scaling factor for each communicating class, the principal component functions will then be scaled copies of these partitioned eigenfunctions. We can then similarly estimate the true eigenvalues for each of these eigenfunctions using Equation 4.

D GENERALIZATION PROPERTIES OF THE EIGENFUNCTION REPRESENTATION D.1 MIN-MAX OPTIMALITY OF EIGENFUNCTIONS

We now prove the min-max optimality of the eigenfunctions with respect to their L2 norm. (We note that these results are closely related to the Courant-Fischer-Weyl min-max principle (Bhatia, 2013, Chapter III) , which characterizes the eigenvalues and eigenvectors of a Hermitian matrix in terms of a similar adversarial game.)  F r d * = argmin dim(F )=d max ĝ∈F , E[ĝ(a) 2 ]=1 E p+ ĝ(a 1 ) -ĝ(a 2 ) 2 . (5) Simultaneously, F r d * minimizes the (quadratic) approximation error for the worst-case target function satisfying Assumption 1.1 for any fixed ε: F r d * = argmin dim(F )=d max g∈Sε min ĝ∈F E p(a) g(a) -ĝ(a) 2 . ( ) Proof. We will start by deriving Equation 6, and derive Equation 5 afterward. We can think of Equation 6 as equivalent to the following adversarial game: 1. Player chooses a dimension-d subspace F ⊂ A → R of functions. 2. Adversary chooses a function g ∈ A → R with a fixed level of invariance E[(g(a 1 )g(a 2 )) 2 ] = 2g ⊤ (D A -P A,A )g = ϵ. Without loss of generality, we let ϵ = 2 so that g ⊤ (D A -P A,A )g = 1; other values of ϵ will just lead to scaling the function g.

3.. Player chooses the best

ĝ ∈ F to minimize E[(ĝ(a) -g(a)) 2 ] We can analyze this game by working backward from the innermost step, step 3. Given the function class F and adversarially chosen target function g, choosing ĝ to minimize the expected squared error is equivalent to finding the orthogonal projection of g into F with respect to the measure p(A), e.g. with respect to the weighted L2 norm L(2; p(A)). More precisely, we want ĝ = argmin ĝ∈F E[(ĝ(a) -g(a)) 2 ] = argmin ĝ∈F (ĝ -g) ⊤ D A (ĝ -g) = argmin ĝ∈F (D 1/2 A ĝ -D 1/2 A g) ⊤ (D 1/2 A ĝ -D 1/2 A g) But this is just finding the ĝ ∈ F which minimizes ∥D 1/2 A ĝ -D 1/2 A g∥ 2 . This is given by the orthogonal projection of the vector D 1/2 A g into D 1/2 A F, under the ordinary L2 norm. We can now define R as the orthogonal projection operator on D 1/2 A F, such that Rh ∈ D 1/2 A F (e.g. D -1/2 A Rh ∈ F), and for D 1/2 A f ∈ D 1/2 A F (e.g. f ∈ F), we have D 1/2 A f = RD 1/2 A f (e.g. f = D -1/2 A RD 1/2 A f ). Observe that R is real and symmetric, and has eigenvalue 1 with multiplicity d and all other eigenvalues are 0. Since R characterizes the subset, we will find it convenient to redefine our objective for the initial player as choosing R, and then letting F = {D -1/2 A RD 1/2 A g : g ∈ A → R}. We then have ĝ = argmin ĝ∈F E[(ĝ(v) -g(v)) 2 ] = D -1/2 A RD 1/2 A g. and the cost is E[(ĝ(v) -g(v)) 2 ] = (D 1/2 A f -D 1/2 A g) ⊤ (D 1/2 A f -D 1/2 A g) = (RD 1/2 A g -D 1/2 A g) ⊤ (RD 1/2 A g -D 1/2 A g) = (D 1/2 A g) ⊤ (R -I) ⊤ (R -I)(D 1/2 A g) = g ⊤ D 1/2 A (R ⊤ R -2R + I)D 1/2 A g = g ⊤ D 1/2 A (I -R)D 1/2 A g. We next consider step 2. Given R, what g should the adversary pick? Letting L = D A -P A,A , the adversary is constrained to pick g such that g ⊤ Lg = (L 1/2 g) ⊤ (L 1/2 g) = 1. We note that L is not full rank: in particular, any eigenvector of M with eigenvalue 1 is an eigenvector of L of eigenvector zero. Any function g chosen by the adversary must then be the sum of two parts: • a component in in the range of L, of the form L † 1/2 u where ∥u∥ 2 = 1 and † represents the Moore-Penrose pseudoinverse, • and a component in the null space of L. Overall, we can thus write g = L † 1/2 u + h where ∥u∥ 2 = 1 and h ⊤ Lh = 0. Similarly, the response ĝ must also have two components, one in the range of L and one in the null space of L. There are then two cases. If F does not span the entire null space of L, the adversary can force an arbitrarily high approximation error by choosing h to be in the null space of L but not F. On the other hand, if F spans the entire null space of L, the player can always perfectly approximate h, and so the adversary is forced to maximize cost by using u. In particular, they will pick u = argmax ∥u∥2=1 ( L † 1/2 u) ⊤ D 1/2 A (I -R)D 1/2 A ( L † 1/2 u) = argmax ∥u∥2=1 u ⊤ L † 1/2 D 1/2 A (I -R)D 1/2 A L † 1/2 u = argmax ∥u∥2=1 u ⊤ Au where A is the matrix L † 1/2 D 1/2 A (I -R)D 1/2 A L † 1/2 . The optimal choice for u is an eigenvector of A with maximal eigenvalue, and the cost is then that maximal eigenvalue. But observe that M is similar to the following: A ∼ ( L † 1/2 D 1/2 A ) -1 M ( L † 1/2 D 1/2 A ) = (I -R)D 1/2 A L † D 1/2 A = (I -R) D -1/2 A LD -1/2 A † := A ′ Similar matrices have the same eigenvalues, so the maximum cost attainable by the adversary is the maximal eigenvalue of A ′ . Finally, we consider step 1. Which R should our player choose to minimize this maximum cost? They should first ensure the cost is finite, by choosing R to span the null space of D -1/2 A LD -1/2 A . (Note that if d is less than the dimension of this null space, there is no choice that ensures a finite cost; in this case every representation has unbounded worst-case approximation error.) Afterward, they should ensure that A ′ has the smallest maximum eigenvalue. The sorted vector of eigenvalues of A ′ is bounded below by the vector obtained by matching the largest eigenvalues of D -1/2 A LD -1/2 A † with the smallest of (I -R) (Bhatia, 2013, exercise III.6.14) with the smallest eigenvalues. Furthermore, D -1/2 A LD -1/2 A = D -1/2 A (D A -P A,A )D -1/2 A = I -D -1/2 A P A,A D -1/2 A = I -M, so we are looking for the eigenvectors of M with the largest eigenvalues, where M is the matrix described in Appendix C. Thus, the player should choose F such that D 1/2 A F spans the top d-dimensional eigenspace of M , e.g. they should choose functions of the form D -1/2 A v i where the v i are the eigenvectors of M with largest eigenvalue. But these are exactly the left eigenvectors f i of the positive pair Markov chain, which is how r d * is defined. We conclude that F r d * is the optimal choice for the player, and thus Equation 6 holds. Indeed, we can conclude something further: if λ d+1 is the (d + 1)th eigenvalue of the positive pair Markov chain (the variance along the (d + 1)th principal component of the positive-pair kernel), then as long as d ≥ d * , (1 -λ d+1 ) -1 is the (d -d * + 1)th eigenvalue of D -1/2 A LD -1/2 A † , which is exactly the worst-case approximation error for F r d * against any function with E[(g(v 1 ) -g(v 2 )) 2 ] = 2g ⊤ Lg = 2. Scaling by ε, if E[(g(v 1 ) -g(v 2 )) 2 ] = ε then the worst case error is 1 2 ε/(1 -λ d+1 ). In other words, max g∈Sϵ min ĝ∈F r d * E p(a) g(a) -ĝ(a) 2 = ε 2(1 -λ d+1 ) . We now return our attention to Equation 5. This equation can also be formulated as an adversarial game: 1. Player chooses a rank-d subspace F ⊂ A → R of functions.

2.. Adversary chooses a function

ĝ ∈ F with unit norm E[ĝ(v) 2 ] = ĝ⊤ D A ĝ = 1 to maximize E[(ĝ(v 1 ) -ĝ(v 2 )) 2 ] = 2ĝ ⊤ Lĝ. We can again identify the choice of F with the choice of the orthogonal projection matrix R on D 1/2 A F. We know ĝ ∈ F, so we can write ĝ = D -1/2 A RD 1/2 A ĝ. Also note that for any h (not even necessarily in F), D -1/2 A RD 1/2 A h ∈ F. Now suppose we choose an h so that E[h(a) 2 ] = h ⊤ D A h = 1, and define ĝ = D -1/2 A RD 1/2 A h. Then ĝ⊤ D A ĝ = h ⊤ D 1/2 A RD -1/2 A D A (D -1/2 A RD 1/2 A h) = h ⊤ D 1/2 A R 2 D 1/2 A h = h ⊤ D 1/2 A RD 1/2 A h ≤ h ⊤ D 1/2 A ID 1/2 A h = 1 because R has eigenvalues at most 1. So, the following are equivalent: • choosing ĝ ∈ F with E[ĝ(v) 2 ] ≤ 1 • choosing h ∈ R A with E[h(v) 2 ] ≤ 1 and letting ĝ = D -1/2 A RD 1/2 A h We also note that there is no advantage to the adversary from picking a function such that E[ĝ(v) 2 ] < 1.

So we can reframe step 2 as choosing

h so that E[h(v) 2 ] = 1, to maximize 2 D -1/2 A RD 1/2 A h ⊤ L D -1/2 A RD 1/2 A h We can further reparameterize by letting h = D -1/2 A u, so that E[h(v) 2 ] = 1 is equivalent to ∥u∥ 2 2 = 1. We then have cost C = 2ĝ ⊤ Lĝ = 2(h ⊤ D 1/2 A RD -1/2 A )L(D -1/2 A RD 1/2 A h) = 2(u ⊤ D -1/2 A )D 1/2 A RD -1/2 A LD -1/2 A RD 1/2 A (D -1/2 A u) = 2u ⊤ RD -1/2 A LD -1/2 A

Ru

The choice that maximizes the cost is then an eigenvector of B = 2RD -1/2 A LD -1/2 A R = L 1/2 D -1/2 A R ⊤ L 1/2 D -1/2 A R with maximal eigenvalue, and the cost is 2 times the maximum eigenvalue of B. But note that B has the same eigenvalues as In this case, the optimal cost itself is determined by the largest eigenvalue of B ′ = L 1/2 D -1/2 A R L 1/2 D -1/2 A R ⊤ = L 1/2 D -1/2 A RD -1/2 A L 1/2 since R 2 = R. And B ′ is similar to B ′′ = D -1/2 A LD -1/2 A R so B D -1/2 A LD -1/2 A (times two), so we obtain max ĝ∈F r d * , E[ĝ(a) 2 ]=1 E p+ ĝ(a 1 ) -ĝ(a 2 ) 2 = 2(1 -λ d ). D.2 GENERALIZATION BOUND FOR LINEAR PREDICTION WITH THE EIGENFUNCTION REPRESENTATION Proposition 4.2. Let (A i , Y i ) n i=1 be i.i.d. samples, choose R ≥ 0, and consider the constrained empirical risk minimizer βR ∈ argmin ∥β∥2≤R n -1 n i=1 |⟨β, r * d (A i )⟩ -Y i |. Then the expected excess risk of βR is bounded by: E E( βR ) ≤ 2dR √ n + √ d(∥β * ∥ 2 -R) + + ε 2(1 -λ d+1 ) where E(β) := R(β) -R * is the excess risk and (x) + := max{x, 0}. Proof. We start by decomposing the excess risk as: R( βR ) -R * = R( βR ) -inf ∥β∥2≤R R(β) + inf ∥β∥2≤R R(β) -R(β * ) + (R(β * ) -R * ) The first term is the estimation error, which we can readily bound by first noting that: E ∥r d * (A)∥ 2 2 = E d i=1 f i (A) 2 = d i=1 E f i (A) 2 = d then noticing that our loss is 1-Lipschitz, and finishing with the standard Rademacher complexity argument for constrained linear classes Kakade et al. (2008) to get: R( βR ) -inf ∥β∥2≤R R(β) ≤ 2dR √ n The second term is an approximation error term due to the use of a constrained linear class instead of the full linear class. Define β * := β * max{∥β * ∥2/R,1} . Then we can bound this second term by: inf ∥β∥2≤R R(β) -R(β * ) ≤ R( β * ) -R(β * ) ≤ E r d * (A), β * -β * ≤ E ∥r d * (A)∥ 2 ∥ β * -β * ∥ 2 ≤ √ d(∥β * ∥ 2 -R) + where the second inequality follows from the 1-Lipschitznes of the loss, the third by Cauchy-Schwartz inequality, and the last by Jensen's inequality. The third and last term is an approximation error term due to the use of the function class given by the span of the first d eigenfunctions (f i ) d i=1 . We can bound it as follows: R(β * ) -R * ≤ E r d * (A), β * -g * (A) ≤ E[(⟨r d * (A), β * ⟩ -g * (A)) 2 ] ≤ ε 2(1 -λ d+1 ) (9) where the first inequality follows from the 1-Lipschitznes of the loss, the second from Jensen's inequality, and the last by first noticing that the function h(a) := r d * (a), β * satisfies h = argmin g∈F r d * E (g(A) -g * (A)) 2 (since it is the projection of g * onto F r d * under the norm ∥x∥ 2 := E x 2 (A) ), then appealing to the proof of Proposition 4.1. Combining the bounds of equations ( 7), (8), and (9), we obtain the stated generalization bound.

E EIGENFUNCTION ESTIMATION TECHNIQUES

In practice, we do not generally have access to the closed form for K + or the the full population of our examples, but instead only have access to a dataset of positive pairs (a 1 , a 2 ) drawn from the distribution p + (a 1 , a 2 ) (or, more commonly, a dataset of examples z and a sampling algorithm for p(A|Z)). In this section we discuss some approaches for approximating the optimal eigenfunction representation from these samples. E.1 COMBINING CONTRASTIVE LEARNING WITH KERNEL PCA Our analysis in Sections 2 and 3 motivates the following procedure: 1. Train a contrastive learning model using an existing contrastive learning objective. 1 , identify the approximate kernel K θ , which will hopefully be similar to the positive-pair kernel K + assuming we have converged to a solution to the objective.

2.. Using the equations in Table

3. Perform (or approximate) Kernel PCA using K θ and a large set of individual views drawn from p(A).

4.. Use the first d extracted principal component projection functions h

i : A → R to construct a representation, possibly normalizing by σ i to obtain the orthonormal basis f i (a) = σ -1 i h i (a). We note that applying a rotation matrix to the optimal representation does not affect the expressivity of that representation. It is thus sufficient to identify the subspace spanned by the first d principal component projection functions. If the representation dimension d is known in advance, it may be possible to adjust the contrastive learning method to accomplish this without requiring a separate PCA step. In particular, when using the spectral contrastive loss with a d-dimensional linear kernel head, the population minimizer of the loss will exactly correspond to the best d-dimensional approximation of K + . This means that the output layer representation will be exactly the set of principal component projection functions rotated by some orthogonal matrix. On the other hand, including the PCA step makes it possible to decouple the representation dimension from the kernel approximation method, which may be advantageous if the learning dynamics of a different parameterization or loss are better, or if d is not known in advance. The PCA step also makes it possible to diagnose how well the learned representation is capturing properties of K + by checking the extent to which Equation 4 is satisfied. Directly applying kernel PCA can be expensive for large datasets, due to the need to decompose the full Gram matrix of kernel similarities. A more computationally-friendly approach is to first approximate K θ with a lower-rank approximation, such as the Nyström method (Williams & Seeger, 2000) , and then perform PCA over that approximation (Sriperumbudur & Sterge, 2017; Ullah et al., 2018; Sterge et al., 2020) . This can be especially useful when the dataset is much larger than the number of desired eigenfunctions. We note that approaches based on Kernel PCA may be numerically unstable in the presence of many eigenvalues close to 1, since small kernel estimation errors may be amplified by the eigendecomposition process. Although we were able to apply these techniques to models trained on our two synthetic datasets, we have been so far unable to obtain a reliable estimate of the eigenfunctions for real-world datasets such as those considered by Chen et al. (2020a) . In particular, we attempted to apply the Nyström method to a pretrained SimCLR model but ran into numerical precision issues and were unable to form a good approximation of the learned hypersphere-based kernel K θ . See also Appendix F.5 for a preliminary analysis of a model with a constrained-norm linear kernel head on ImageNet; although we were able to run Kernel PCA with this model, it does not appear to be a good approximation of K + .

E.2 DIRECT EIGENFUNCTION EXTRACTION, AND CONNECTIONS TO VICREG

An alternative strategy for estimating the eigenfunctions f i is to combine the contrastive learning and Kernel PCA steps into a single parameterized model. This is possible using parameterized spectral decomposition techniques such as SpIN (Pfau et al., 2018) or NeuralEF (Deng et al., 2022) . We note that these techniques are usually motivated as extracting the eigenfunctions of the kernel operator T [f ](a 1 ) = K(a 1 , a 2 )f (a 2 )p(a 2 ) da 2 , or in other words, solving the eigenfunction equation λ i f i (a 1 ) = K(a 1 , a 2 )f i (a 2 )p(a 2 ) da 2 . In this case, substituting the form of K + reveals that this is equivalent to finding the eigenfunctions of the positive pair Markov chain: λ i f i (a 1 ) = a2 K + (a 1 , a 2 )f i (a 2 )p(a 2 ) = a2 p + (a 1 , a 2 ) p(a 1 )p(a 2 ) f i (a 2 )p(a 2 ) = a2 p + (a 1 , a 2 ) p(a 1 ) f i (a 2 ) = a2 p(a 2 |a 1 )f i (a 2 ) = E a2|a1 [f i (a 2 )] Connections between NeuralEF and VICReg. We now describe in more detail how to apply the NeuralEF technique to estimate the basis of eigenfunctions of the positive pair Markov chain. The NeuralEF approach approximates the eigenfunctions f i of a kernel K by solving an asymmetric set of constrained optimization problems fj = argmax fj R jj - j-1 i=0 R 2 ij Rii subject to the constraint that C j = 1, where R ij = a1, a2 fi (a 1 )K(a 1 , a 2 ) fj (a 2 )p(a 1 )p(a 2 ), C j = a fj (a) 2 p(a) = E fj (a) 2 . Substituting the positive-pair kernel K + (a 1 , a 2 ) = p+(a1,a2) p(a1)p(a2) reveals an alternative form for the R ij terms, allowing us to apply NeuralEF using samples from p + : R ij = a1, a2 fi (a 1 ) p + (a 1 , a 2 ) p(a 1 )p(a 2 ) fj (a 2 )p(a 1 )p(a 2 ) = E p+(a1,a2) fi (a 1 ) fj (a 2 ) . Interestingly, the resulting algorithm closely resembles the Variance-Invariance-Covariance regularization (VICReg) self-supervised learning method proposed by Bardes et al. (2021) : the C j = 1 constraint ensures each function has sufficient variance, the R 2 ij term reduces the covariance between features, and maximizing R jj = E[ fj (a 1 ) fj (a 2 )] over positive pairs leads to representations that are as invariant as possible between positive pairs. (Note, however, that the asymmetric weighted covariance penalties R 2 ij Rii in NeuralEF ensure that eigenfunctions are recovered in the correct order without interfering with one another.) Stabilizing NeuralEF for contrastive learning. Althoug the NeuralEF-based approach works well when R ii is large for all i, the method becomes numerically unstable when R ii is small. And since R ii ≈ λ i , this makes it difficult to recover eigenfunctions whose eigenvalues λ i are close to zero. To enable recovery of all eigenfunctions, including those with λ i = 0, we do not directly apply Neural EF to K + in our experiments. Instead, we construct a modified kernel with the help of a modified positive pair distribution p mix (a 1 , a 2 ) = 0.5p + (a 1 , a 2 ) + 0.5p(a 1 )1[a 2 = a 1 ] where 1[a 2 = a 1 ] is 1 when a 1 = a 2 and 0 otherwise. Conceptually, with probability 50%, we sample a positive pair (a 1 , a 2 ) as normal, and otherwise, we sample a single augmentation a 1 and then choose a 2 = a 1 . by random noise as eigenvalues become closer together. In particular, as augmentation strength decreases, there are more eigenvalues close to 1, and it becomes more difficult to identify the most significant principal components. Due to the stochasticity of Kernel PCA, it is difficult to accurately identify eigenfunctions even when given direct access to K + , and two runs of Kernel PCA begin to diverge as eigenvalues decrease. Eigenfunction quality degrades even faster when comparing results of Kernel PCA for a learned model and for K + : the learned models only allow recovery of a few principal components accurately at larger augmentation strengths. More expressive kernel approximations recover more eigenfunctions. We observe that the linear kernel head and Neural EF method are able to recover a larger number of eigenfunctions accurately compared to hypersphere kernels, and have eigenvalues that lie closer to the Equation 4 prediction. Additionally, we find that adding a per-example scale function s θ to the hypersphere kernel leads to more correctly-recovered eigenfunctions and fewer outlier eigenvalues. Learned models have faster eigenvalue decay than K + . In general, the eigenvalues of learned kernels decay faster than the eigenvalues of the true positive-pair kernel K + . Interestingly, however, the eigenvalues still appear to follow the relationship predicted by Equation 4for sufficiently expressive models and sufficiently strong augmentations. This suggests that the learned models are approximately capturing a subset of the positive-pair eigenfunctions. We note that both the linear-kernel-head model and the NeuralEF model exhibit a sharp change in eigenvalue near the 100th eigenfunction: the first shows a sudden drop to zero, whereas the second shows strange "jumps" to larger eigenvalues. We believe this corresponds to a failure to identify additional directions of variation, leading to a lower-rank kernel approximation than expected. For NeuralEF, this manifests as essentially repeating earlier eigenfunctions instead of finding new orthogonal directions. The authors believe that one promising research direction for finding better self-supervised learning techniques would be to develop more stable or better-conditioned linear approximations of the positive-pair kernel, building on the spectral contrastive loss or NeuralEF. In particular, we see this as evidence that the parameterizations we used are not able to form good approximations of the true minimizer of the respective objectives. We hope that such techniques could be developed by combining ideas from the kernel methods and self-supervised learning communities, and that they would lead to useful representations for downstream supervised learning as suggested by our analysis.

F.2 TRAINING DETAILS: OVERLAPPING REGIONS TASK

For each of the models visualized in Figure 6 , we use a simple three-layer MLP with hidden layer sizes [64, 128, 256] which maps from the two-dimensional location of each grid point to the final kernel-dependent output embedding. We train all methods for 12,000 steps using a batch size of 1024 independently-sampled positive pairs per iteration, using the Adam optimizer (Kingma & Ba, 2014) with a cosine-decay learning rate schedule. For the hypersphere kernel head K θ (a 1 , a 2 ) = exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ + b), we include a small regularization penalty encouraging b to be small, which stabilizes training with the XEnt loss (since the loss is otherwise independent of b). When using the "XEnt + Logistic" loss mixture, we combine the two losses using a weight of 0.9 for XEnt and 0.1 for Logistic. For all methods other than Neural EF, we used batch normalization for the first 6,000 iterations, then interpolated between the current batch statistics and the average from previous batches for 2,000 more iterations, and finally trained for 4,000 iterations using frozen batch norm statistics alone (e.g. in "inference" mode), which we found slightly improves eigenfunction quality. For Neural EF, we keep batch normalization at all steps, and in particular use L2 batch normalization for the output embedding as proposed by Deng et al. (2022) . To extract eigenfunction estimates from our kernel models, we compute estimates of the eigenfunctions and eigenvalues by performing population kernel PCA over the values of the kernel across all elements of A, weighted by p(a). For the NeuralEF model, we instead directly use the model's outputs as the eigenfunctions, and use running averages of R ii to approximate eigenvalues. For Figure 6 , we compute the alignment between eigenfunctions by taking their squared uncentered covariance E[f i (a) fj (a)] 2 . Note that by Theorem 3.2 this is equivalent to the square of the coefficient c i for the function fj expanded in terms of the basis of eigenfunctions f i ; consequently we have i E[f i (a) fj (a)] 2 = E[ fj (a) 2 ] = 1. Since the specific choice of eigenfunctions is not uniquely determined when there are multiple eigenfunctions with the same eigenvalue, we sum the alignments for all eigenfunctions that have the same eigenvalue, leading to the block-diagonal structure shown in 6. We computed the positive-pair discrepancy for each approximate principal component function by analytically summing over all possible positive pairs.

Population look-up table variant

In Figure 4 , we use a modified procedure to improve visualization quality. Instead of using a MLP, we instead directly learn a lookup table of positions v a ∈ R 2 and scale modifiers s a ∈ [-5.0, 5.0] for each point a ∈ A, and use a rational quadratic kernel head K θ (a 1 , a 2 ) = e sa 1 e sa 2 1 - ∥v a1 -v a2 ∥ 2 2α -α . where α is a learnable parameter. (The size of each marker in Figure 4 represents the learned scale; we find that using the scale modifier improves eigenfunction quality, and note that more tightly-clustered points tend to have slightly smaller scales.) We train using a full-batch variant of the XEnt and Logistic losses. For the XEnt loss, we compute L XEnt (θ) = - a1, a2 p + (a 1 , a 2 ) log p θ (a 2 |a 1 ) where p θ (a 2 |a 1 ) ∝ p(a 2 ) K θ (a 1 , a 2 ). (Here weighting the kernel by p(a 2 ) can be viewed as the population equivalent to sampling a set of negative pairs as the number of negative pairs approaches infinity.) For the logistic loss, we analytically compute L Logistic (θ) = E p+(a + 1 ,a + 2 ) -log σ(log K θ (a + 1 , a + 2 )) + E p(a - 1 )p(a - 2 ) -log σ(-log K θ (a - 1 , a - )) by summing over all possible positive and negative pairs. We use relative weights of 10L XEnt + 1L Logistic . We use population kernel PCA over the set A to identify the principal component functions. We then extend the principal component functions f i : A → R across the full embedding space h i : R 2 → R (ignoring the scale parameter for simplicity) according to h i (v) = K θ (v, A) ⊤ K θ (A, A) † f i (A), where K θ (v, A) is the vector K θ (v, A) =           e sa 1 1 - ∥v-va 1 ∥ 2 2α -α e sa 2 1 - ∥v-va 2 ∥ 2 2α -α . . . e sa |A| 1 - ∥v-va |A| ∥ 2 2α -α           , K θ (A, A) is the Gram matrix of the sequence [a 1 , . . . , a |A| ] (in other words, the matrix elements are defined by [ K θ (A, A)] ij = K θ (a i , a j )), † denotes the Moore-Penrose pseudoinverse, and f i (A) is the vector [f i (a 1 ) f i (a 2 ) • • • f i (a |A| ) ] ⊤ ; this equation implicitly projects each point onto the appropriate principal component of the kernel. For numerical stability reasons, we regularize the pseudoinverse K θ (A, A) † by additionally removing eigenvalues very close to zero.

F.3 TRAINING DETAILS: MNIST WITH MULTINOMIAL AUGMENTATIONS

Our goal in designing our task was to construct distributions p(Z) and p(A|Z) such that the exact positive-pair kernel could be computed, and so that the Markov chain would mix between different unperturbed dataset examples z ∈ Z without changing the label too frequently. To this end, we constructed our task as follows: • Define Z to be a particular subset of the MNIST dataset, and choose p(Z) to be the uniform distribution over Z. We consider two choices for Z: randomly selecting 512 images from each of the ten digit classes (used for comparisions between models), and randomly selecting 1024 images from the digits 0, 1, and 2 (used for visualizations of the eigenfunctions). • For each image, generate 64 pertubed copies, by randomly blurring, translating, and rotating digits by a small amount. Add a small constant to each pixel so that no pixel has value zero, then normalize each such copy so that its pixel values sum to 1. • To generate an augmentation of an image z ∈ Z according to p(A|Z = z), first choose one of the 64 copies of z, then sample a set of k pixel locations with replacement from the distribution represented by that copy, where k determines the augmentation strength. This means we are more likely to sample pixel locations which were within the original digit, although due to the perturbations described above the pixels may be scattered around the digit. Our set A is thus the set of all 28 × 28 images for which all pixels have a nonnegative integer value, and the total of all pixels is k. (Due to the low pixel density, to improve visibility in our figures we render each pixel as a box 5x its original size, with shading indicating overlap of these boxes. However, when giving input to the model, we directly input this sparse pixel reprsentation, without the 5x multiplier.) Given a particular augmented example a, we can compute p(a|z) for any z ∈ Z by summing over each of the 64 copies of z and using the closed-form PDF of a multinomial distribution. We can then compute p(z|a) by normalizing over all possible z, and use this to exactly compute the positive-pair kernel feature map ϕ + . We selected the perturbation parameters such that there was nontrivial uncertainty in z given each a; in other words, we made sure the positive pair Markov chain mixed well between examples.

F.3.1 MNIST MODEL ARCHITECTURES

For the majority of our experiments, we used three-block variants of a ResNet-18 model followed by a two 128-dimensional fully-connected layers and a final output layer. • Linear kernel: We used kernel parameterization K θ (a 1 , a 2 ) = h θ (a 1 ) ⊤ h θ (a 2 ) and the spectral loss, with h θ as the output of the final layer. We set the dimension of the final layer to 512. We trained this model using the spectral contrastive loss. • Hypersphere kernel, global scale: We used kernel parameterization exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ + b), where h θ is computed by normalizing the output of the final layer to lie on the unit hypersphere, and b is a learned scalar. We set the dimension of the final layer to 32. We optimized the temperature τ and scale b jointly with the model parameters. For the loss function, we used a weighted combination of 0.9 times the NT-XEnt loss and 0.1 times the NT-Logistic loss. • Hypersphere kernel, individual scale: We used kernel parameterization exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ + s θ (a 1 ) + s θ (a 2 )). We set the dimension of the final layer to 33, and defined h θ by taking the first 32 entries and normalizing them to lie on the unit hypersphere. We then defined s θ to be 5 × tanh(v 33 ) where v 33 is the 33rd entry of the final layer. We again optimized the temperature τ jointly with the model parameters. The scale parameter allows the model to adjust the magnitude of the kernel on a per-example basis, which can be used to scale the eigenvalues of the principal components or to correct for differences in likelihood between points (since the magnitude of K + depends on the marginal likelihood of each point). For the loss function, we again used a weighted combination of 0.9 times the NT-XEnt loss and 0.1 times the NT-Logistic loss. • Neural EF model: We set the dimension of the final layer to 512, and used L2 batch normalization on this final layer to ensure that the L2 norm of each output was 1 (e.g. that C j = 1), as described by Deng et al. (2022) . We used the modified version of the Neural EF objective described in Appendix E.2. We trained our models using the Adam optimizer (Kingma & Ba, 2014) over 50,000 training iterations and a batch size of 4096 positive pairs per iteration, implemented using the JAX and FLAX libraries (Bradbury et al., 2018; Heek et al., 2020) . For methods other than Neural EF, we used batch normalization for the first 35,000 iterations, then interpolated between the current batch statistics and the average from previous batches for 2,000 more iterations, and finally trained for 13,000 iterations using frozen batch norm statistics alone (e.g. in "inference" mode); we found that this increased the quality of the extracted principal components. For Neural EF, we keep batch normalization at all steps due to the constraint that C j = 1.

F.3.2 EXTRACTION AND ANALYSIS OF PRINCIPAL COMPONENTS

To estimate the eigenfunctions of the true kernel, we performed PCA using the explicit form ϕ + of the positive-pair kernel feature map, where the population covariance was approximated by averaging over 256 augmentations for each of the images in Z. We then constructed the principal component projection functions using that covariance. Our alignment plots in Figure 7 for "True kernel" compare two PCA decompositions using independent random estimates of the population covariance. For our approximate kernels K θ , we first constructed an approximation of the feature map for K θ using the Nyström method (Williams & Seeger, 2000) : we sampled a subset S of augmentations by randomly selecting 25% of Z and sampling one augmentation for each image, computed the Gram matrix K θ (S, S) for that subset, then set φ(a) = K θ (S, S) -1/2 K θ (S, a) where K θ (S, a) is the vector of similarities of a to each reference augmentation in S. The result is a feature map such that φ(a 1 ) ⊤ φ(a 2 ) ≈ K θ (a 1 , a 2 ). We then again performed PCA using this feature map, using 256 samples per example in Z to compute the covariance. For the Neural EF model, we again directly used the model's outputs as the eigenfunctions, and use running averages of R ii to approximate eigenvalues. We note that the Neural EF method did not find a fully orthogonal basis (as indicated by nonzero R ij terms for i ̸ = j), and some of its later eigenfunction estimates were correlated with earlier eigenfunctions; we did not attempt to correct for this in our plots in Figure 7 . We believe this is the cause of the "jumps" from smaller eigenvalue approximations to larger eigenvalue approximations. (In contrast, the approximate eigenfunctions from the kernel PCA approaches are by construction uncorrelated over the sampled augmentations, due to being derived from eigenvectors of the sample covariance.) We normalized all principal component projection functions to have unit uncentered variance, e.g. E[f i (a) 2 ] = 1 and E[ fi (a) 2 ] = 1. As for the overlapping regions task, we then computed alignments by taking the squared covariance E[f i (a) fj (a)] 2 . We estimated the positive-pair discrepancy for each principal component function by taking the sample average of f i (a 1 ) -f i (a 2 ) 2 over 16 randomly sampled augmentation pairs for each image in Z.

F.3.3 THREE-CLASS MNIST RATIONAL QUADRATIC MODEL

For Figure 1 , we additionally trained a ResNet-18 model on only the MNIST digits 0, 1, and 2, using a scaled two-dimensional rational quadratic kernel: K θ (a 1 , a 2 ) = s θ (a 1 )s θ (a 2 ) 1 - ∥f θ (a 1 ) -f θ (a 2 )∥ 2 2α -α . ( ) Here f θ : A → R 2 embeds inputs into the plane, s θ : A → R + is a learned scale function, and α is a learned shape parameter. We set the output dimension of the ResNet-18 model to 3, and took the first two elements as f θ ; s θ was defined as exp(5 × tanh(x)) where z is the third element. We also parameterize α = exp(γ) and learn the scalar parameter γ. The model has a base hidden dimension of 128. The model was trained for 50,000 training iterations. We used the cross entropy InfoNCE loss (as described in Appendix B.1) and the Adam optimizer, with a batch size of 4096 positive pairs per iteration. 

F.4 DOWNSTREAM SUPERVISED LEARNING ON MNIST

To better understand the performance of the true and approximate eigenfunctions for downstream supervised learning, we took our multinomial-sampling MNIST task, and compared the quality of various representations for two downstream prediction tasks: classification with a linear layer, and linear least-squares regression on the one-hot indicator vectors for each digit class. We considered three types of representation: the PCA projection functions for K + , the PCA projection functions for each learned kernel K θ , and the intermediate layer embedding vector between the ResNet layers and the projection head for each model as proposed by Chen et al. (2020a) . The results are shown in Table 2 . Performance is fairly similar across representations, suggesting that the positive-pair kernel K + captures much of the variability between augmentation strengths. Notably, at low augmentation strengths (k = 50), the true eigenfunctions have the smallest error, but eigenfunction approximations using Kernel PCA do not always outperform the intermediate layer representations, suggesting that inductive biases may play a role. In more detail, to generate our supervised training and validation sets, we sampled one augmentation of 16 random images from each digit class, labeled with the original label, a total of 160 labeled augmentations in each set. For the test set, we took 170 distinct images from each digit class and generated one augmentation from each image, without overlap with the training or validation sets, for a total of 1700 labeled augmentations. For the classification task, we fit a logistic regression classifier on the training set using SciKit Learn. For PCA representations, we swept over 40 logarithmically-spaced L2 regularization strengths from 10 -4 to 10 1 , and also swept over representation dimension d, taking the first d principal components for d between 1 and 256. For ResNet embedding representations, we swept over 150 logarithmicallyspaced L2 regularization strengths from 10 -4 to 10 10 ; we found that higher regularization strengths were necessary to attain a good solution. We selected the hyperparameters based on which setting gave the highest top-1 accuracy on the validation set. For the regression task, we used a centered version of the one-hot indicator vector, e.g. the target vector for an example from digit 2 was [-0.1, -0.1, 0.9, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1]. The purpose of this centering was to ensure that the expected value of the label vector was the zero vector. We then fit a predictor using ridge regression (L2-regularized least-squares regression) in Numpy. As for the classification task, for PCA representations we swept over 40 logarithmicallyspaced L2 regularization strengths from 10 -4 to 10 1 and over each representation dimension d between 1 and 256, and for ResNet embedding representations we swept over 150 L2 regularization strengths from 10 -4 to 10 10 . We selected the hyperparameters based on which setting gave the lowest squared error on the validation set.

F.5 PRINCIPAL COMPONENTS ANALYSIS OF SPECTRAL-LOSS LINEAR-KERNEL MODEL ON IMAGENET

In this section, we discuss some preliminary results from applying our Kernel PCA analysis to a real-world contrastive learning model. We started by training a variant of the SimCLR v2 model Chen et al. (2020b) using the spectral contrastive loss and a linear kernel head, normalizing the output layer to have a constrained norm ∥h θ (a)∥ 2 = c as described by HaoChen et al. (2021) . We explored automatically learning this norm c using the Adam optimizer and a separately-tuned learning rate. We found that the norm c reliably increased during training, but training tended to destabilize and produce NaN weights once c ≈ 90, and we were unable to successfully train a model with a higher output layer norm. We trained a model for 100 epochs (≈ 30,000 iterations) on the ImageNet 2012 dataset, using the default augmentation parameters and other hyperparameters for SimCLR v2, obtaining a similar supervised classification accuracy to previous results by HaoChen et al. (2021) . Next, we performed kernel PCA by using ordinary PCA with the explicit form of the kernel features (the output layer with normalization applied), since kernel PCA and ordinary PCA are equivalent for a linear kernel parameterization. We computed the covariance over a sample of 16 augmentations of each training dataset example, averaged across all examples. We then constructed the principal component functions fi by normalizing based on the eigenvalues, and computed positive-pair discrepancies for each function over a sample of 8 augmentation pairs for each training dataset example. Figure 8 shows the results of comparing the eigenvalues and positive-pair discrepancies, relative to the predicted relationship from Equation 4. For comparison, we also reproduce the corresponding figure for this kernel head parameterization and loss function on our overlapping regions task. We see that, across both tasks, the norm constraint causes estimated eigenvalues to be smaller than Equation 4 would predict, but there still appears to be an inverse correlation between the eigenvalue and positive-pair discrepancy. (On the overlapping regions task, it is close to a constant shift of the linear Equation 4 relationship. On the ImageNet task, the relationship is still somewhat linear, but with multiple irregularities, and a somewhat different slope than Equation 4 predicts; some of this may be due to increasing in the norm constraint during training.) We also observe that, in both tasks, the sum of the eigenvalues from Kernel PCA is exactly equal to the norm constraint c. This suggests that the norm constraint is "capping" the sum of the eigenvalues, forcing the model to only learn a subset of eigenfunctions despite having capacity for more. We conjecture that stabilizing the learning dynamics might enable us to remove the norm constraint c and thus capture additional eigenfunctions, leading to potentially superior representations for future self-supervised methods.



We focus on finite but arbitrarily large Z and A, e.g. the set of 8-bit 32x32 images, and allow Z ̸ = A. If we have a dataset of labeled un-augmented examples (zi, yi), we can build a dataset of labeled augmentations by sampling one or more augmentations of each example in our original dataset. As long as we scale them appropriately and choose orthogonal functions within each eigenspace. In other words, when some eigenvalues have multiplicity > 1, the hi and fi are not uniquely determined, but we are free to choose them such that they satisfy this relationship. In other words, when some eigenvalues have multiplicity > 1, the hi and fi are not uniquely determined, but we are free to choose them such that they satisfy this relationship. See also https://math.stackexchange.com/questions/573583/eigenvalues-of-the-product-of-two-symmetric-matrices



Figure2: Samples from the positive-pair Markov chain for MNIST k-pixel-subsampling augmentations at three strengths(k = 10, 20, 50). At each step we condition on an augmentation a t (middle row) to sample an uncorrupted example z t (top row), then sample a t+1 from z t so that (a t , a t+1 ) is a positive pair. Below, we plot the five slowest-varying eigenfunctions f 1 , . . . , f 5 at each step of the chain, labeled with their eigenvalues λ 1 , . . . , λ 5 . Weaker augmentations lead to slower mixing, smoother eigenfunctions, and eigenvalues closer to 1.

f1(a) 0.28 f2(a) 0.06 f3(a) 0.15 f4(a) -0.13 f5(a) -0.21 f6(a) -0.31 f7(a) -0.18 f8(a) -0.04 f9(a) 0.30 f10(a) 0.17 f11(a) -0.20 f12(a) -0.

Theorem 4.1. Let F r = {a → β ⊤ r(a) : β ∈ R d } be the subspace of linear predictors from representation r, and S ε be the set of functions satisfying Assumption 1.1. Let r d * (a) = [f 1 (a), f 2 (a), . . . , f d (a)] be the representation consisting of the d eigenfunctions of the positive pair Markov chain with the largest eigenvalues. Then F r d * maximizes the view invariance of the least-invariant unit-norm predictor in F r d * :

Figure 4: (a) Toy "overlapping regions" contrastive learning task, where A is the set of cross markers, and Z is the set of shaded blue rectangles. (b) 2D embedding space learned by minimizing a contrastive loss with a rational quadratic kernel head K θ . (c) The first 14 eigenfunctions of the true positive-pair Markov chain. (d) The first 14 principal component projection functions for the learned kernel K θ extracted using Kernel PCA. Diagonal colored lines denote alignment between the learned functions and the true eigenspaces, which increases as K approaches K + . (e) Evaluations of the the learned projection functions over the entire latent embedding space using Kernel PCA, showing that it can embed points not seen during training.

Figure 5: Eigenfunction and eigenvalue estimation accuracy for a selection of models on our two synthetic tasks. Top row: Alignment between the eigenfunctions of K + and those of each kernel approximation, with perfect alignment shown as a block diagonal matrix. Bottom row: Relationship between learned kernel eigenvalue λ and the corresponding positive-pair discrepancy E p+ (f (a 1 ) -f (a 2 )) 2 , with the relationship predicted by Equation 4 shown with a dashed line.

Justification of Assumption 1.1 B Existing objectives are minimized by the positive-pair kernel B.1 NT-XEnt and the InfoNCE objective . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Logistic losses and NT-Logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 The Spectral Contrastive Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Other Related Objectives And Connections to the Positive-Pair Kernel . . . . . . . C Relationship between positive-pair kernel and Markov chain eigenfunctions C.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Proof of Correspondence Between Kernel PCA And Markov Chain Eigenfunctions C.3 Relationship to the eigenvectors of the symmetrized adjacency matrix . . . . . . . C.4 Recovering proportionality constants . . . . . . . . . . . . . . . . . . . . . . . . . D Generalization properties of the eigenfunction representation D.1 Min-max optimality of eigenfunctions . . . . . . . . . . . . . . . . . . . . . . . . D.2 Generalization bound for linear prediction with the eigenfunction representation . . E Eigenfunction estimation techniques E.1 Combining Contrastive Learning With Kernel PCA . . . . . . . . . . . . . . . . . E.2 Direct Eigenfunction Extraction, And Connections to VICReg . . . . . . . . . . . F Experiment Details F.1 Additional Eigenfunction Estimation Results . . . . . . . . . . . . . . . . . . . . . F.2 Training Details: Overlapping Regions Task . . . . . . . . . . . . . . . . . . . . . F.3 Training Details: MNIST with Multinomial Augmentations . . . . . . . . . . . . . F.4 Downstream Supervised Learning On MNIST . . . . . . . . . . . . . . . . . . . . F.5 Principal Components Analysis of Spectral-Loss Linear-Kernel Model on ImageNet A JUSTIFICATION OF ASSUMPTION 1.1

by HaoChen et al. (2021). HaoChen et al. motivate their loss as estimating the eigenvectors of M up to a scaling term by p(a) 1/2 , due to prior work showing that eigenvalues give information about clustering structure in graphs. The connection between the symmetrized adjacency matrix M and the positive-pair Markov chainis well known; indeed, HaoChen et al. briefly discuss the positive-pair Markov chain in their Section 2, and Levin & Peres (2017, Chapter 12) introduce the matrix M when discussing the spectral decomposition of a general symmetric Markov chain.

Let F r = {a → β ⊤ r(a) : β ∈ R d } be the subspace of linear predictors from representation r, and S ε be the set of functions satisfying Assumption 1.1. Let r d * (a) = [f 1 (a), f 2 (a), . . . , f d (a)] be the representation consisting of the d eigenfunctions of the positive pair Markov chain with the largest eigenvalues. Then F r d * maximizes the view invariance of the least-invariant unit-norm predictor in F r d * :

6 . Let d * be the dimension of the null space of D -1/2 A LD -1/2 A . Then I -R has d eigenvalues with value 0 (i.e. 0 is an eigenvalue with multiplicity d). Of these, d * must be used to span this null space, and the remaining d -d * (if any) can be used to reduce the eigenvalues of A ′ . Thus, the largest eigenvalue of A ′ is always at least as big as the (d -d * + 1)-th largest eigenvalue of D We can attain this bound by setting R to exactly capture the (d -d * )-dimensional subspace of D along with the d * -dimensional null space of D

and B ′′ have the same eigenvalues.We now consider step 1. What should the player choose for R? By a similar eigenvalue-of-product argument as used for Equation 6, regardless of the choice of R the largest eigenvalue of B ′′ must always be at least as big as the dth smallest eigenvalue of D has eigenvalue 1 with multiplicity d. We can attain this minimum cost by choosing R to project into the eigenspace spanned by the d eigenvectors of D eigenvalues.But observe that the smallest eigenvalues and corresponding eigenvectors of D are exactly the largest eigenvalues and corresponding eigenvectors of M = D , which as we argued above, is exactly the set of eigenfunctions f i used to construct r d * .

For each, we fit a regularized linear predictor on 160 labeled training examples (16 augmented samples from each class), using 160 additional validation examples to tune the regularization strength. For PCA representations, we additionally tune the representation dimension d, choosing only the first d principal component functions.

discrepancy E[(f(a1) f(a2)) 2 ]ResNet (SimCLR v2) model on ImageNet 2012

Figure 8: Relationship between kernel PCA eigenvalue and positive-pair discrepancy for a normconstrained linear kernel head and the spectral contrastive loss, across two datasets. Relationship predicted by Equation 4 is shown with a dashed line. Eigenvalues smaller than 10 -6 are omitted.

R d , e.g. we will approximate g with a parameterized function ĝβ (a) = β ⊤ r(a). It turns out that the representation consisting of the d eigenfunctions {f 1 , . . . , f d } with largest eigenvalues {λ 1 , . . . , λ d } is the best choice under two simultaneous criteria.

Classification error (fraction misclassified) and regression error (squared error) on MNIST task with multinomial augmentations, across augmentation strengths k = 10, k = 20, k = 50.

ACKNOWLEDGEMENTS

We would like to thank David Duvenaud, Daniel Tarlow, and Lechao Xiao for providing feedback on this manuscript, and the ICLR 2023 reviewers for their additional suggestions and comments. We would also like to thank the reviewers and organizers of the ICML 2022 workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward for providing a venue for an earlier version of this work.

Published as a conference paper at ICLR 2023

The corresponding positive-pair Markov chain transition matrix can be written as P mix = 0.5P + 0.5I.It follows that the eigenfunctions for this modified distribution are the same as those for our original positive pair distribution, but the eigenvalues are transformed according to λ mix i = 0.5λ i + 0.5. We can thus apply Neural EF to the corresponding kernel K mix + (a 1 , a 2 ) = p mix (a 1 , a 2 )/p(a 1 )p(a 2 ), and recover the original λ i by inverting this transformation.Substituting this into the Neural EF objective, we obtainfi (a 1 ) p mix (a 1 , a 2 ) p(a 1 )p(a 2 ) fj (a 2 )p(a 1 )p(a 2 ) = 0.5 a1, a2fi (a 1 ) p + (a 1 , a 2 ) p(a 1 )p(a 2 ) fj (a 2 )p(a 1 )p(a 2 ) + 0.5 a1, a2fi (a 1 ) p(a 1 )1[a 2 = a 1 ] p(a 1 )p(a 2 ) fj (a 2 )p(a 1 )p(a 2 ) = 0.5 a1, a2fi (a 1 ) fj (a 2 )p + (a 1 , a 2 ) + 0.5 a1, a2fi (a 1 ) fj (a 2 )p(a 1 )1[a 2 = a 1 ] = 0.5 E p+(a1,a2) fi (a 1 ) fj (a 2 ) + 0.5 E p(a1) fi (a 1 ) fj (a 1 ) .We then estimate R mix ij in each minibatch by averaging over minibatch positive pairs for the first term and over all minibatch augmentations for the second term. (In practice, we drop the 0.5 scaling factor in the loss.) We find that this modification greatly stabilizes the Neural EF objective when estimating large numbers of eigenfunctions, and in particular makes it possible to learn eigenfunctions of K + with eigenvalues that are very small or even zero. 

F EXPERIMENT DETAILS

In this section we describe our experiments and their results in more detail. We start by presenting the full set of results summarized in Section 6 and Figure 5 . We then describe details regarding model training. We conclude with some additional preliminary results regarding supervised learning with Kernel PCA representations on MNIST and eigenfunction extraction on ImageNet.

F.1 ADDITIONAL EIGENFUNCTION ESTIMATION RESULTS

Overlapping regions task. Results for our full set of models on the "overlapping regions" task are shown in Figure 6 .We find that, under suitable losses, linear kernels, hypersphere kernels, and NeuralEF can all produce good approximations of the basis of eigenfunctions, despite their diferent parameterizations. Specifically, unconstrained-norm linear kernels with the spectral loss, hypersphere kernels with a learned bias under either a XEnt-Logistic loss mixture or the spectral loss, and the NeuralEF eigenfunction estimation method all produce eigenfunction estimates that are reasonably aligned to the true eigenfunctions, and eigenvalue estimates that are close to the true eigenvalues. However, especially for eigenspaces with similar eigenvalues, the eigenfunctions occasionally appear in the incorrect order, and have a small amount of mixing between eigenspaces. The relationship between approximate eigenvalues and positive-pair discrepancies also closely matches the prediction from Equation 4.We note also that alignment with the basis of eigenfunctions emerges during training, and is not simply a property of the model architecture, as evidenced by the lack of alignment when applying Kernel PCA to randomly initialized models.The loss function used influences the learning dynamics and final result. Using the XEnt loss alone produces a reasonably-well aligned eigenfunction decomposition, but has eigenvalues scaled by a 4shown as a gray dashed line. Bottom row: Ordered eigenvalues for each approximation, relative to those of the true kernel (dashed line). constant, since the XEnt loss only measures ratios between kernel evaluations and thus only recovers the kernel up to a constant factor (discussed in Appendix B.1). The Logistic loss alone appears to lead to inferior decomposition quality, although eigenvalues are in the right order of magnitude. Interestingly, the spectral loss seems to work even for hypersphere kernel approximations, although we found it to be the most unstable; successfully training a hypersphere kernel with the spectral loss required initializing the bias b to a large negative number.Constraints on the kernel approximation degrade eigenfunction and eigenvalue estimates. For the hypersphere kernel, fixing the bias b to zero led to eigenvalues that were abnormally small, whereas reducing the dimensionality of the hypersphere from 32 to 3 both degraded eigenfunction alignment and led to deviations from Equation 4. For the linear kernel, using a smaller dimension led to estimating only the eigenfunctions with larger eigenvalues, and imposing a norm constraint of ∥h θ (a)∥ 2 = 10 both reduced the number of accurately-captured eigenfunctions and caused the eigenvalues to be smaller than predicted by Equation 4.MNIST task. Results for our full set of models at three augmentation strengths are shown in Figure 7 .We compare two types of hypersphere kernel parameterization: a "global scale" version using exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ + b) and a "local scale" version using exp(h θ (a 1 ) ⊤ h θ (a 2 )/τ + s θ (a 1 ) + s θ (a 2 )). The second is more expressive, but the first is closer to that considered by prior work such as Chen et al. (2020a) . (Note that most work with hypersphere-based models fixes b = 0, but also uses the XEnt loss alone, which is not affected by the value of b. We include b and include the Logistic loss to assess how well the models can recover the correct values of the eigenvalues.).Note that, although we can exactly evaluate K + on any pair of augmented views, the space A (containing all multisets of k pixels) is too large to enumerate, preventing us from exactly computing the exact eigenfunctions of K + . Instead, we use Kernel PCA over a large set of samples to estimate the "ground truth" eigenfunctions. To better understand the impact of this step, we also include a comparison between two independent runs of Kernel PCA on K + .Weaker augmentations make principal component estimation difficult. We find that Kernel PCA can more reliably recover principal components with large gaps between eigenvalues, and is influenced

