JOINT EMBEDDING SELF-SUPERVISED LEARNING IN THE KERNEL REGIME Anonymous

Abstract

The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Motivated by a rich line of work in kernel methods performed on graphs and manifolds, we show that SSL methods likewise admit a kernel regime where embeddings are constructed by linear maps acting on the feature space of a kernel to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of selfsupervised learning algorithms and gain theoretical insights into their performance on downstream tasks.

1. INTRODUCTION

Self-supervised learning (SSL) algorithms are broadly tasked with learning from unlabeled data. In the joint embedding framework of SSL, mainstream contrastive methods build representations by reducing the distance between inputs related by an augmentation (positive pairs) and increasing the distance between inputs not known to be related (negative pairs) (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Ye et al., 2019) . Non-contrastive methods only enforce similarities between positive pairs but are designed carefully to avoid collapse of representations (Grill et al., 2020; Zbontar et al., 2021) . Recent algorithms for SSL have performed remarkably well reaching similar performance to baseline supervised learning algorithms on many downstream tasks (Caron et al., 2020; Bardes et al., 2021; Chen & He, 2021) . In this work, we study SSL from a kernel perspective motivated by the rich history of study of kernel algorithms on graphs and manifolds (Smola & Kondor, 2003; Ando & Zhang, 2006; Bellet et al., 2013; Belkin & Niyogi, 2004) . Our primary aim is to extend this line of work to cover commonly used loss functions in modern SSL settings potentially providing useful insights into their properties and performance. In standard SSL tasks, inputs are fed into a neural network and mapped into a feature space which encodes the final representations used in downstream tasks (e.g., classification tasks). In the kernel setting, inputs are embedded in a feature space corresponding to a kernel, and representations are constructed via an optimal mapping from this feature space to the vector space for the representations of the data. Here, the task can be framed as one of finding an optimal "induced" kernel, which is a mapping from the original kernel in the input feature space to an updated kernel function acting on the vector space of the representations. Our results show that such an induced kernel can be constructed using only manipulations of kernel functions and data that encodes the relationships between inputs in an SSL algorithm (e.g., adjacency matrices between the input datapoints). More broadly, we make the following contributions: • For a contrastive and non-contrastive loss, we provide closed form solutions when the algorithm is trained over a single batch of data. These solutions form a new "induced" kernel which can be used to perform downstream supervised learning tasks. • We show that a version of the representer theorem in kernel methods can be used to formulate kernelized SSL tasks as optimization problems. As an example, we show how to optimially find induced kernels when the loss is enforced over separate batches of data. • We empirically study the properties of our SSL kernel algorithms to gain insights about the training of SSL algorithms in practice. We study the generalization properties of SSL algorithms and show that the choice of augmentation and adjacency matrices encoding relationships between the datapoints are crucial to performance. We proceed as follows. First, we provide a brief background of the goals of our work and the theoretical tools used in our study. Second, we show that kernelized SSL algorithms trained on a single batch admit a closed form solution for commonly used contrastive and non-contrastive loss functions. Third, we generalize our findings to provide a semi-definite programming formulation to solve for the optimal induced kernel in more general settings and provide heuristics to better understand the form and properties of the induced kernels. Finally, we empirically investigate our kernelized SSL algorithms when trained on various datasets (code included in supplemental material).

1.1. NOTATION AND SETUP

We denote vectors and matrices with lowercase (x) and uppercase (X) letters respectively. The vector 2-norm and matrix operator norm is denoted by ∥ • ∥. The Frobenius norm of a matrix M is denoted as ∥M ∥ F . We denote the transpose and conjugate transpose of M by M ⊺ and M † respectively. We denote the identity matrix as I and the vector with each entry equal to one as 1. For a diagonalizable matrix M , its projection onto the eigenspace of its positive eigenvalues is M + . For a dataset of size N , let x i ∈ X for i ∈ [N ] denote the elements of the dataset. Given a kernel function k : X × X → R, let Φ(x) = k(x, •) be the map from inputs to the reproducing kernel Hilbert space (RKHS) denoted by H with corresponding inner product ⟨•, •⟩ H and RKHS norm ∥ • ∥ H . Throughout we denote K s,s ∈ R N ×N to be the kernel matrix of the SSL dataset where (K s,s ) ij = k(x i , x j ). We consider linear models W : H → R K which map features to representations z i = W Φ(x i ). Let Z be the representation matrix which contains Φ(x i ) as rows of the matrix. This linear function space induces a corresponding RKHS norm which can be calculated as ∥W ∥ H = K i=1 ⟨W i , W i ⟩ 2 H where W i ∈ H denotes the i-th component of the output of the linear mapping W . This linear mapping constructs an "induced" kernel denoted as k * (•, •) as discussed later. The driving motive behind modern self-supervised algorithms is to maximize the information of given inputs in a dataset while enforcing similarity between inputs that are known to be related. The adjacency matrix A ∈ {0, 1} N ×N (also can be generalized to A ∈ R N ×N ) connects related inputs x i and x j (i.e., A ij = 1 if inputs i and j are related by a transformation) and D A is a diagonal matrix where entry i on the diagonal is equal to the number of nonzero elements of row i of A.

2. RELATED WORKS

In this section, we briefly summarize some of the related works for this study. We include a more detailed related works section in Appendix A. Modern self-supervised learning approaches: Joint embedding approaches to SSL produce representations by comparing representations of inputs via known relationships. Methods are denoted as non-contrastive if the loss function is only a function of pairs that are related (Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021; Bardes et al., 2021) . Contrastive methods also penalize similarities of representations that are not related. Popular algorithms include SimCLR (Chen et al., 2020) , SwAV (Caron et al., 2020) , NNCLR (Dwibedi et al., 2021) , contrastive predictive coding (Oord et al., 2018) , spectral contrastive loss (HaoChen et al., 2021) , and many others. Separate from the joint embedding framework, many methods in SSL form representations by predicting held-out portions of the data typically in a generative model setting. Commonly used algorithms incorporate autoencoder approaches (He et al., 2022; Radford et al., 2019; Vincent et al., 2010; Dosovitskiy et al., 2020) which are used in both natural language processing and image processing tasks. To translate the SSL setting into the kernel regime, we aim to find the optimal linear function which maps inputs from the RKHS into the K-dimensional feature space of the representations. This new feature space induces a new optimal kernel denoted the "induced" kernel. Relationships between data-points are encoded in an adjacency matrix (the example matrix shown here contains pairwise relationships between datapoints). Neural tangent kernels and gaussian processes: Prior work has connected the outputs of infinite width neural networks to a corresponding gaussian process (Williams & Rasmussen, 2006; Neal, 1996; Lee et al., 2017) . When trained using continuous time gradient descent, these infinite width models evolve as linear models under the so called neural tangent kernel (NTK) regime (Jacot et al., 2018; Arora et al., 2019a) . The discovery of the NTK opened a flurry of exploration into the connections between so-called lazy training of wide networks and kernel methods (Yang, 2019; Chizat & Bach, 2018; Wang et al., 2022; Bietti & Mairal, 2019) . Though the training dynamics of the NTK has previously been studied in the supervised settings, one can analyze an NTK in a self-supervised setting by using that kernel in the SSL algorithms that we study here. We perform some preliminary investigation into this direction in our experiments. Kernel and metric learning: Various works have proposed a series of graph-based kernels that have ties to semi-supervised learning and appealing regularization properties over the geometric structure of the data (Smola & Kondor, 2003; Belkin et al., 2006; Vishwanathan et al., 2010) . Links between spectral properties of the graphs and representations are often formed during the process of learning (Belkin & Niyogi, 2004; Zhao & Liu, 2012) . Throughout our study, we note connections to these related works. From an algorithmic perspective, perhaps the closest lines of work are related to kernel and metric learning (Bellet et al., 2013; Yang & Jin, 2006) . Since our focus is on directly kernelizing SSL methods to eventually analyze and better understand SSL algorithms, our end goal is not to improve over these methods but instead, to extend them to the SSL setting. These works are further summarized in Appendix A.

3. CONTRASTIVE AND NON-CONTRASTIVE KERNEL METHODS

Stated informally, the goal of SSL in the kernel setting is to start with a given kernel function k : X × X → R (e.g., RBF kernel or a neural tangent kernel) and map this kernel function to a new "induced" kernel k * : X × X → R which is a function of the SSL loss function and the SSL dataset. For two new inputs x and x ′ , the induced kernel k * (x, x ′ ) generally outputs correlated values if x and x ′ are correlated in the original kernel space to some datapoint in the SSL dataset or correlated to separate but related datapoints in the SSL dataset as encoded in the graph adjacency matrix. If no relations are found in the SSL dataset between x and x ′ , then the induced kernel will generally output an uncorrelated value. To kernelize SSL methods, we consider a setting generalized from the prototypical SSL setting where representations are obtained by maximizing/minimizing distances between augmented/unaugmented samples. Translating this to the kernel regime, as illustrated in Figure 1 , our goal is to find a linear mapping W * : H → R K which obtains the optimal representation of the data for a given SSL loss function and minimizes the RKHS norm. This optimal solution produces an "induced kernel" k * (•, •) which is the inner product of the data in the output representation space. Once constructed, the induced kernel can be used in downstream tasks to perform supervised learning. Due to a generalization of the representer theorem (Schölkopf et al., 2001) , we can show that the optimal linear function W * must be in the support of the data. This implies that the induced kernel can be written as a function of the kernel between datapoints in the SSL dataset. Proposition 3.1 (Form of optimal representation). Given a dataset x 1 , . . . , x N ∈ X , let k(•, •) be a kernel function with corresponding map Φ : X → H into the RKHS H. Let W : H → R K be a function drawn from the space of linear functions W mapping inputs in the RKHS to the vector space of the representation. For a risk function R(W Φ(x 1 ), . . . , W Φ(x N )) ∈ R and any strictly increasing function r : [0, ∞) → R, consider the optimization problem W * = arg min W ∈W R(W Φ(x 1 ), . . . , W Φ(x N )) + r (∥W ∥ H ) . The optimal solutions of the above take the form optimal representation: W * Φ(x) = M k X,x induced kernel: k * (x, x ′ ) = (M k X,x ) ⊺ M k X,x ′ = k ⊺ X,x M ⊺ M k X,x ′ , where M ∈ R K×N is a matrix that must be solved for and k X,x ∈ R N is a vector with entries [k X,x ] i = k(x i , x). Proposition 3.1, proved in Appendix B.1, provides a prescription for finding the optimal representations or induced kernels: i.e, one must search over the set of matrices M ∈ R K×N to find an optimal matrix. This search can be performed using standard optimization techniques as we will discuss later, but in certain cases, the optimal solution can be calculated in closed-form as shown next for both a contrastive and non-contrastive loss function. Non-contrastive loss Consider a variant of the VICReg (Bardes et al., 2021) loss function below: L V IC = Z ⊺ I - 1 N 11 ⊺ Z -I 2 F + β Tr [Z ⊺ LZ] , where β ∈ R + is a hyperparameter that controls the invariance term in the loss and L = D A -A is the graph Laplacian of the data. We note that the second term in the above loss function takes a role akin to the Laplacian or manifold regularization term studied in kernel methods over graphs (Ando & Zhang, 2006; Belkin et al., 2006) . When the representation space has dimension K ≥ N and the kernel matrix of the data is full rank, the induced kernel of the above loss function is: k * (x, x ′ ) = k x,s K -1 s,s I - 1 N 11 ⊺ - β 2 L + K -1 s,s k s,x ′ , where (•) + projects the matrix inside the parentheses onto the eigenspace of its positive eigenvalues, k x,s ∈ R 1×N is the kernel row-vector with entry i equal to k(x, x i ) with k s,x equal to its transpose, and K s,s ∈ R N ×N is the kernel matrix of the training data for the self-supervised dataset where entry i, j is equal to k(x i , x j ). When we restrict the output space of the self-supervised learning task to be of dimension K < N , then the induced kernel only incorporates the top K eigenvectors of I -1 N 11 ⊺ -β 2 L: k * (x, x ′ ) = k x,s K -1 s,s C :,≤K D ≤K,≤K C ⊺ :,≤K K -1 s,s k s,x ′ , where CDC ⊺ = I -1 N 11 ⊺ -β 2 L is the eigendecomposition including only positive eigenvalues sorted in descending order, C :,≤K denotes the matrix consisting of the first K columns of C and D 1/2 ≤K,≤K denotes the K × K matrix consisting of entries in the first K rows and columns. Proofs of the above are in Appendix B.2.

Contrastive loss

For contrastive SSL, we can also obtain a closed form solution to the induced kernel for a variant of the spectral contrastive loss (HaoChen et al., 2021) : L sc = ∥ZZ ⊺ -(I + A)∥ 2 F , ( ) where A is the adjacency matrix encoding relations between datapoints. When the representation space has dimension K ≥ N , this loss results in the optimal induced kernel: k * (x, x ′ ) = k x,s K -1 s,s (I + A) + K -1 s,s k s,x ′ , where (I + A) + is equal to the projection of I + A onto its eigenspace of positive eigenvalues. In the standard SSL setting where relationships are pair-wise (i.e., A ij = 1 if x i and x j are related by an augmentation), then I + A has only positive or zero eigenvalues so the projection can be ignored. If K < N , then we similarly project the matrix I + A onto its top K eigenvalues and obtain an induced kernel similar to the non-contrastive one: k * (x, x ′ ) = k x,s K -1 s,s C :,≤K D ≤K,≤K C ⊺ :,≤K K -1 s,s k s,x ′ , where as before, CDC ⊺ = I + A is the eigendecomposition including only positive eigenvalues with eigenvalues in descending order, C :,≤K consists of the first K columns of C and D 1/2 ≤K,≤K is the K × K matrix of the first K rows and columns. Proofs of the above are in Appendix B.3. Note, that the induced kernels generally correlate data over the top eigenvalues of the graph adjacency matrix in line with findings from previous works in kernel methods and spectral graph theory (Zhao & Liu, 2007; Belkin & Niyogi, 2004) .

3.1. GENERAL FORM AS SDP

The closed form solutions for the induced kernel obtained above assumed the loss function was enforced across a single batch. Of course, in practice, data are split into several batches. This batched setting may not admit a closed-form solution, but by using Proposition 3.1, we know that any optimal induced kernel takes the general form: k * (x, x ′ ) = k x,s Bk s,x ′ , where B ∈ R N ×N is a positive semi-definite matrix. With constraints properly chosen so that the solution for each batch is optimal (Balestriero & LeCun, 2022; HaoChen et al., 2022) , one can find the optimal matrix B * by solving a semi-definite program (SDP). We perform this conversion to a SDP for the contrastive loss here and leave proofs and further details including the non-contrastive case to Appendix B.4. Introducing some notation to deal with batches, assume we have N datapoints split into n batches of size b. We denote the i-th datapoint within batch j as x (j) i . As before, x i denotes the i-th datapoint across the whole dataset. Let K s,s ∈ R N ×N be the kernel matrix over the complete dataset where [K s,s ] i,j = k(x i , x j ), K s,sj ∈ R N ×b be the kernel matrix between the complete dataset and batch j where [K s,sj ] a,b = k(x a , x (j) b ), and A (j) be the adjacency matrix for inputs in batch j. With this notation, we now aim to minimize the loss function adapted from Equation (6) including a regularizing term for the RKHS norm: L = nbatches j=1 K sj ,s BK s,sj -I + A (j) 2 F + α Tr(BK s,s ), where α ∈ R + is a weighting term for the regularizer. Taking the limit of α → 0, we can find the optimal induced kernel for a representation of dimension K > b by enforcing that optimal representations are obtained in each batch: min B∈R N ×N Tr(BK s,s ) s.t. K sj ,s BK s,sj = I + A (j) + ∀j ∈ {1, 2, . . . , n batches } B ⪰ 0, rank(B) = K, where as before, where (I + A (j) ) + is equal to the projection of I + A (j) onto its eigenspace of positive eigenvalues. Relaxing and removing the constraint that rank(B) = K results in an SDP which can be efficiently solved using existing optimizers. Further details and a generalization of this conversion to other representation dimensions is shown in Appendix B.4.

3.2. INTERPRETING THE INDUCED KERNEL

The top eigenvectors of the adjacency matrix form the representations in the SSL tasks studied here and are consistent with related approaches in kernel methods and spectral graph theory (Zhao & Liu, 2007; Belkin & Niyogi, 2004) . More generally, as a loose rule, the induced kernel will correlate points that are close in the kernel space or related by augmentations in the SSL dataset and uncorrelate points otherwise. Stated in the framework of (Wang & Isola, 2020) , the induced kernel increases alignment by enforcing correlation in datapoints in the traning set and achieves uniformity by setting elements of the representation space to be orthogonal eigenvectors over the dataset. As an example, note that in the contrastive setting (Equation ( 7)), if one calculates the induced kernel k * (x i , x j ) between two points in the SSL dataset indexed by i and j that are related by an augmentation (i.e., A ij = 1), then the kernel between these two points is k * (x i , x j ) = 1. More generally, if the two inputs to the induced kernel are close in kernel space to different points in the SSL dataset that are known to be related by A, then the kernel value will be close to 1. We formalize this intuition below for the standard setting with pairwise augmentations. Proposition 3.2. Given kernel function k(•, •) with corresponding map Φ(•) into the RKHS H, let {x 1 , x 2 , . . . x N } be an SSL dataset normalized such that k(x i , x i ) = 1 and formed by pairwise augmentations (i.e., every element has exactly one neighbor in A) with kernel matrix K s,s . Given two points x and x ′ , if there exists two points in the SSL dataset indexed by i and j which are related by an augmentation (A ij =1) and ∥Φ(x) -Φ(x i )∥ H ≤ ∆ 5∥|K -1 s,s ∥ √ N and ∥Φ(x ′ ) -Φ(x j )∥ H ≤ ∆ 5∥|K -1 s,s ∥ √ N , then the induced kernel for the contrastive loss is at least k * (x, x ′ ) ≥ 1 -∆. We prove the above statement in Appendix B.5. The bounds in the above statement which depend on the number of datapoints N and the kernel matrix norm ∥K -1 s,s ∥ are not very tight and solely meant to provide intuition for the properties of the induced kernel. In more realistic settings, stronger correlations will be observed for much weaker values of the assumptions. In light of this, we visualize the induced kernel values and their relations to the original kernel function in Section 4.1 on a simple 2-dimensional spiral dataset. Here, it is readily observed that the induced kernel better connects points along the data manifold that are related by the adjacency matrix.

3.3. DOWNSTREAM TASKS

In downstream tasks, one can apply the induced kernels directly on supervised algorithms such as kernel regression or SVM. Alternatively, one can also extract representations directly by obtaining the representation as k * (x, •) = M k s,x as shown in Proposition 3.1 and employ any learning algorithm from these features. As an example, in kernel regression, we are given a dataset of size N t consisting of input-output pairs {x (t) i , y i } and aim to train a linear model to minimize the mean squared error loss of the outputs (Williams & Rasmussen, 2006) . The optimal solution using an induced kernel k * (•, •) takes the form: f * (x) = k * (x, x (t) 1 ), k * (x, x (t) 2 ), . . . , k * (x, x (t) Nt ) • K * -1 t,t y, where K * -1 t,t is the kernel matrix of the supervised training dataset with entry i, j equal to k * (x (t) i , x (t) j ) and y is the concatenation of the targets as a vector. Note that since kernel methods generally have complexity that scales quadratically with the number of datapoints, such algorithms may be unfeasible in large-scale learning tasks unless modifications are made. A natural question is when and why should one prefer the induced kernel of SSL to a kernel used in the standard supervised setting perhaps including data augmentation? Kernel methods generally fit a dataset perfectly so an answer to the question more likely arises from studying generalization. In kernel methods, generalization error typically tracks with the norm of the classifier captured by the complexity quantity s N (K) defined as (Mohri et al., 2018; Steinwart & Christmann, 2008) : s N (K) = Tr(K) N y ⊺ K -1 y, ( ) where y is a vector of targets and K is the kernel matrix of the supervised dataset. For example, the generalization gap of an SVM algorithm can be bounded with high probability by O( s N (K)/N ) (see example proof in Appendix C.1) (Meir & Zhang, 2003; Huang et al., 2021) . For kernel functions The induced kernel is computed based on Equation 4, and the graph Laplacian matrix is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. a) We plot three randomly chosen points' kernel entries with respect to the other points on the manifolds. When the neighborhood augmentation range used to construct the Laplacian matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the Laplacian matrix is too large, it creates the "short-circuit" effect in the induced kernel space. Each subplot on the second row is normalized by its largest absolute value for better contrast. k(•, •) bounded in output between 0 and 1, the quantity s N (K) is minimized in binary classification when k(x, x ′ ) = 1 for x, x ′ drawn from the same class and k(x, x ′ ) = 0 for x, x ′ drawn from distinct classes. If the induced kernel works ideally -in the sense that it better correlates points within a class and decorrelates points otherwise -then the entries of the kernel matrix approach these optimal values. This intuition is also supported by the hypothesis that self-supervised and semi-supervised algorithms perform well by connecting the representations of points on a common data manifold (HaoChen et al., 2021; Belkin & Niyogi, 2004) . To formalize this somewhat, consider such an ideal, yet fabricated, setting where the SSL induced kernel has complexity s N (K * ) that does not grow with the dataset size. Proposition 3.3 (Ideal SSL outcome). Given a supervised dataset of N points for binary classification drawn from a distribution with m -1 and m +1 connected manifolds for classes with labels -1 and +1 respectively, if the induced kernel matrix of the dataset K * successfully separates the manifolds such that k * (x, x ′ ) = 1 if x, x ′ are in the same manifold and k * (x, x ′ ) = 0 otherwise, then s N (K * ) = m -1 + m +1 = O(1). The simple proof of the above is in Appendix C. In short, we conjecture that SSL should be preferred in such settings where the relationships between datapoints are "strong" enough to connect similar points in a class on the same manifold. We analyze the quantity s N (K) in Appendix D.4 to add further empirical evidence behind this hypothesis.

4. EXPERIMENTS

In this section, we empirically investigate the performance and properties of the SSL kernel methods on a toy spiral dataset and portions of the MNIST and eMNIST datasets for hand-drawn digits and characters (Cohen et al., 2017) . As with other works, we focus on small-data tasks where kernel methods can be performed efficiently without modifications needed for handling large datasets (Arora et al., 2019a; Fernández-Delgado et al., 2014) . For simplicity and ease of analysis, we perform experiments here with respect to the RBF kernel. Additional experiments reinforcing these findings and also including analysis with neural tangent kernels can be found in Appendix D.

4.1. VISUALIZING THE INDUCED KERNEL ON THE SPIRAL DATASET

For an intuitive understanding, we provide a visualization in Figure 2 to show how the SSL-induced kernel disentangles manifolds in the representation space. In Figure 2 , we study two entangled 1-D spiral manifolds in a 2D space with 200 training points uniformly distributed on the spiral manifolds. We use the non-contrastive SSL-induced kernel, following Equation ( 4), to demonstrate this result, contrastive-MNIST contrastive-EMNIST non-contrastive-MNIST non-contrastive-EMNIST Figure 3 : Depiction of MNIST and EMNIST full test set performances using the contrastive and non-contrastive kernels (in red) and benchmarked against the supervised case with labels on all samples (original + augmented) in black and with labels only on the original samples in blue, with the number of original samples given by N ∈ {16, 64, 256} (each column) and the number of augmented samples (in log2) in x-axis. The first row corresponds to Gaussian blur data-augmentation (poorly aligned with the data distributions) and the second row corresponds to random rotations (-10,10), translations (-0.1,0.1) and scaling (0.9,1.1). We set the SVM ℓ 2 regularization to 0.001 and use the RBF kernel (NTK kernels in Appendix D.3). In all cases, the kernel representation dimensions are unconstrained. Two key observations emerge. First, whenever the augmentation is not aligned with the data distribution, the SSL kernels falls below the supervised case, especially as N increases. Second, when the augmentation is aligned with the data distribution, both SSL kernels are able to get close and even outperform the supervised benchmark with augmented labels. whereas a contrastive SSL-induced kernel is qualitatively similar and left to Appendix D.1. In the RBF kernel space shown in the first row of Figure 2 , the value of the kernel is captured purely by distance. Next, to consider the SSL setting, we construct the graph Laplacian matrix L by connecting vertices between any training points with k( x 1 , x 2 ) > d, i.e., L ij = -1 if k(x i , x j ) > d and L ij = 0 otherwise. The diagonal entries of L are equal to the degree of the vertices, respectively. This construction can be viewed as using the Euclidean neighborhoods of x as the data augmentation. We choose β = 0.4, where other choices within a reasonable range lead to similar qualitative results. In the second row of Figure 2 , we show the induced kernel between selected points (marked with an x) and other training points in the SSL-induced kernel space. When d is chosen properly, as observed in the second row of Figure 2 (a), the SSL-induced kernel faithfully captures the topology of manifolds. However, the augmentation has to be carefully chosen as Figure 2 (b) shows that when d is too large, the two manifolds become mixed in the representation space.

4.2. CLASSIFICATION EXPERIMENTS

We explore in Figure 3 the supervised classification setting of MNIST and EMNIST which consist of 28 × 28 grayscale images. (E)MNIST provide a strong baseline to evaluate kernel methods due to the absence of background in the images making kernels such as RBF more aligned to measure input similarities. In this setting, we explore two different data-augmentation (DA) policies, one aligned with the data distribution (rotation+translation+scaling) and one largely misaligned with the data distribution (aggressive Gaussian blur). Because our goal is to understand how much DA impacts the SSL kernel compared to a fully supervised benchmark, we consider two (supervised) benchmarks: one that employs the labels of the sampled training set and all the augmented samples and one that only employs the sampled training set and no augmented samples. We explore a small training set size going from N = 16 to N = 256 and for each case we produce a number of augmented samples for each datapoint so that the total number of samples does not exceed 50, 000 which is a standard threshold for kernel methods. We observed in Figure 3 that the SSL kernel is able to match and even outperform the fully supervised case when employing the correct data-augmentation, while with the incorrect data-augmentation, the performance is not even able to match the supervised case that did not see the augmented samples. To better understand the impact of different hyperparameters onto the two kernels, we also study in Figure 4 the MNIST test set performances when varying the representation dimension K, the SVM's ℓ 2 regularization parameter, and the non-contrastive kernel's β parameter. We observe that although β is an additional hyper-parameter to tune, its tuning plays a similar role to K, the representation dimension. Hence, in practice, the design of the non-contrastive model can be controlled with a single parameter as with the contrastive setting. We also interestingly observe that K is a more preferable parameter to tune to prevent over-fitting as opposed to SVM's ℓ 2 regularizer.

5. DISCUSSION

Our work explores the properties of SSL algorithms when trained via kernel methods. Connections between kernel methods and neural networks have gained significant interest in the supervised learning setting (Neal, 1996; Lee et al., 2017) for their potential insights into the training of deep networks. As we show in this study, such insights into the training properties of SSL algorithms can similarly be garnered from an analysis of SSL algorithms in the kernel regime. Our theoretical and empirical analysis, for example, highlights the importance of the choice of augmentations and encoded relationships between data points on downstream performance. Looking forward, we believe that interrelations between this kernel regime and the actual deep networks trained in practice can be strengthened particularly by analyzing the neural tangent kernel. In line with similar analysis in the supervised setting (Yang et al., 2022; Seleznova & Kutyniok, 2022; Lee et al., 2020) , neural tangent kernels and their corresponding induced kernels in the SSL setting may shine light on some of the theoretical properties of the finite width networks used in practice.

A.1 JOINT EMBEDDING SSL ALGORITHMS IN PRACTICE

Joint embedding approaches form representations by comparing representations of jointly chosen inputs that have known relations. These methods also can hold out portions of the data, but this is not necessarily a requirement. Any joint-embedding SSL algorithm requires a properly chosen loss function and access to a set of observations and known pairwise positive relation between those observations. Methods are denoted as non-contrastive if the loss function is only a function of pairs that are related (Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) . One common method using the VICReg loss (Bardes et al., 2021) , for example, takes the form L vic =α K k=1 max 0, 1-Cov(Z) k,k +β K j=1,j̸ =k Cov(Z) 2 k,j + γ N N i=1 N j=1 (A) i,j ∥Z i,. -Z j,. ∥ 2 2 . ( ) We adapt the above for the non-contrastive loss we study in our work. Contrastive methods also penalize similarities of representations that are not related. Popular algorithms include SimCLR (Chen et al., 2020) , SwAV (Caron et al., 2020) , NNCLR (Dwibedi et al., 2021) , contrastive predictive coding (Oord et al., 2018) , and many others. The spectral contrastive loss, for example takes the form (HaoChen et al., 2021) : L sc = -2E x,x + f (x) ⊺ f (x + ) + E x,x -f (x) ⊺ f (x -) 2 , where x, x + are positive pairs and x, x -are negative pairs. We consider a variant of the spectral contrastive loss in our work.

A.2 THEORETICAL STUDIES OF SSL

In tandem with the success of SSL in deep learning, a host of theoretical tools have been developed to help understand how SSL algorithms learn (Arora et al., 2019b; Balestriero & LeCun, 2022; HaoChen et al., 2022; Lee et al., 2021) . Findings are often connected to the underlying graph connecting the data distribution (Wei et al., 2020; HaoChen et al., 2021) or the choice of augmentation (Wen & Li, 2021) . Wang & Isola (2020) study the theoretical properties of good SSL algorithms showing that alignment and uniformity are two important criteria for successfully building representations. We also note that prior work on SSL has analyzed the importance of the choice of augmentation in building positive pairs (Zheng et al., 2021; Von Kügelgen et al., 2021) . Building representations from known relationships between datapoints is also studied in spectral graph theory (Chung, 1997). We employ findings from this body of literature to provide intuition behind the properties of the algorithms discussed here.

A.3 KERNEL METHODS

Kernel methods have been connected to properties of graphs and manifolds in a long line of work. Smola & Kondor (2003); Vishwanathan et al. ( 2010), among others, introduce a family of kernels on graphs which offer nice regularization properties for their corresponding RKHS. Similar methods have been proposed for semi-supervised learning via graph-based methods (Zhu, 2005; Hofmann et al., 2008) and manifold regularization (Belkin et al., 2006) . These methods generally introduce terms into a loss function which regularize the solution depending on the geometric structure of the data. Properties of the features have also been tied to results in spectral graph theory (Zhao & Liu, 2007; 2012; Dong et al., 2012) , where the top features generally correspond to the top eigenvalues of the graph Laplacian. More recently, kernel methods have been employed to more efficiently regularize data based on the Laplacian and improve semi-supervised learning algorithms (Cabannes et al., 2021; Cabannes, 2022) . As stated in the main text, perhaps the closest lines of work are related to kernel and metric learning (Bellet et al., 2013; Yang & Jin, 2006) In kernel learning, prior works have proposed constructing a kernel via a learning procedure; e.g., via convex combinations of kernels (Cortes et al., 2010) , kernel alignment (Cristianini et al., 2001) , and unsupervised kernel learning to match local data geometry (Zhuang et al., 2011) . Prior work in distance metric learning using kernel methods aim to produce representations of data in unsupervised or semi-supervised settings by taking advantage of links between data points. For example, (Baghshah & Shouraki, 2010; Hoi et al., 2007) learn to construct a kernel based on optimizing distances between points embedded in Hilbert space according to a similarity and dissimilarity matrix. Yeung & Chang (2007) perform kernel distance metric learning in a semi-supervised setting where pairwise relations between data and labels are provided. Xia et al. (2013) propose an online procedure to learn a kernel which maps similar points closer to each other than dissimilar points. Many of these works also use semi-definite programs to perform optimization to find the optimal kernel. Let x i ∈ X for i ∈ [N ] be elements of a dataset of size N , k : X × X → R be a kernel function with corresponding map Φ : X → H into the RKHS H, and r : [0, ∞) → R be any strictly increasing function. Let W : H → R K be a linear function mapping inputs to their corresponding representations. Given a regularized loss function of the form R (W Φ(x 1 ), . . . , W Φ(x N )) + r (∥W ∥ H ) , where R(W Φ(x 1 ), . . . , W Φ(x N )) is an error function that depends on the representations of the dataset, the minimizer W * of this loss function will be in the span of the training points {Φ(x i ), i ∈ {1, . . . , N }}, i.e. for any ϕ ∈ H: W * ϕ = 0 if ⟨ϕ, Φ(x i )⟩ H = 0 ∀i ∈ [N ]. Proof. Decompose W = W ∥ + W ⊥ , where W ⊥ Φ(x i ) = 0 for all i ∈ [N ]. For a loss function L of the form listed above, we have L(W ∥ + W ⊥ ) = R (W ∥ + W ⊥ )Φ(x 1 ), . . . , (W ∥ + W ⊥ )Φ(x N ) + r ∥W ∥ + W ⊥ ∥ H = R W ∥ Φ(x 1 ), . . . , W ∥ Φ(x N ) + r ∥W ∥ + W ⊥ ∥ H . ( ) Where in the terms in the sum, we used the property that W ⊥ is not in the span of the data. For the regularizer term, we note that r(∥W ∥ + W ⊥ ∥ H ) = r ∥W ∥ ∥ 2 H + ∥W ⊥ ∥ 2 H ≥ r(∥W ∥ ∥ H ). Therefore, strictly enforcing W ⊥ = 0 minimizes the regularizer while leaving the rest of the cost function unchanged. As a consequence of the above, all optimal solutions must have support over the span of the data. This directly results in the statement shown in Proposition 3.1. Remark B.2. The representer theorem renders the problem of optimization tractable over a finite dataset for a given SSL loss function (Hofmann et al., 2008; Soman et al., 2009) . In certain cases, as we show in the main text, such optimization have nice closed form solutions. Nevertheless, approximate methods of optimization can be used to find closed form solutions for loss functions where closed form solutions do not exist or are not expected to exist. Many optimizers exist to perform this optimization including potentially approximations that allow for incorporation of larger datasets (Hofmann et al., 2008; Deisenroth & Ng, 2015; Jain et al., 2012; Menon, 2009; Rasmussen, 2003) .

B.2 CLOSED FORM NON-CONTRASTIVE LOSS

Throughout this section, for simplicity, we define M := I -1 N 11 ⊺ . With slight abuse of notation, we also denote X ∈ R N ×D as a matrix whose i-th row contains the features of Φ(x i ). X =     -Φ(x 1 ) ⊺ - -Φ(x 2 ) ⊺ - . . . -Φ(x N ) ⊺ -     (20) Note that if the RKHS is infinite dimensional, one can apply the general form of the solution as shown in Proposition 3.1 to reframe the problem into the finite dimensional setting below. As a reminder, we aim to minimize the VICreg cost function: C * = min W ∈R K×D ∥W X ⊺ M XW ⊺ -I∥ 2 F + β Tr [W X ⊺ LXW ⊺ ] . ( ) By applying the definition of the Frobenius norm and from some algebra, we obtain: C(W ) = ∥W X ⊺ M XW ⊺ -I∥ 2 F + β Tr [W X ⊺ LXW ⊺ ] = ∥W X ⊺ M XW ⊺ -I∥ 2 F + Tr [W X ⊺ (βL + 2M -2M ) XW ⊺ ] = K + ∥W X ⊺ M XW ⊺ ∥ 2 F -2 Tr [W X ⊺ M XW ⊺ ] + Tr [W X ⊺ (βL + 2M -2M ) XW ⊺ ] = K + ∥W X ⊺ M XW ⊺ ∥ 2 F -Tr [W X ⊺ (2M -βL) XW ⊺ ] = K + ∥XW ⊺ W X ⊺ M ∥ 2 F -Tr [XW ⊺ W X ⊺ M (2M -βL)] . ( ) The optimum of the above formulation has the same optimum as the following optimization problem defined as C ′ (W ): C ′ (W ) = ∥XW ⊺ W X ⊺ M -(M -βL/2) ∥ 2 F . ( ) Since M is a projector and M (2M -βL) = 2M -βL (since the all-ones vector is in the kernel of L), then we can solve the above by employing the Eckart-Young theorem and matching the Kdimensional eigenspace of XW ⊺ W X ⊺ M with that of the top K eigenspace of M -βL/2. One must be careful in choosing this optimum as W X ⊺ M XW ⊺ can only take positive eigenvalues. Therefore, this is achieved by choosing the optimal W * to project the data onto this eigenspace as W * = X (s) + C :,≤K (I -D) 1/2 ≤K,≤K ⊺ , ( ) where we set the eigendecomposition of 1 N 11 ⊺ + β 2 L as CDC ⊺ = 1 N 11 ⊺ + β 2 L and C :,≤K is the matrix consisting of the first K rows of C. Similarly, (I -D) 1/2 ≤K,≤K denotes the top left K × K matrix of (I -D). Also in the above, X (s) + denotes the pseudo-inverse of X. If the diagonal matrix I -D contains negative entries, which can only happen when β is set to a large value, then the values of (I -D) 1/2 for those entries is undefined. Here, the optimum choice is to set those entries to equal zero. In practice, this can be avoided by setting β to be no larger two times than the degree of the graph. Note, that since W only appears in the cost function in the form W ⊺ W , the solution above is only unique up to an orthogonal transformation. Furthermore, the rank of XW ⊺ W X ⊺ M is at most N -1 so a better optimum cannot be achieved by increasing the output dimension of the linear transformation beyond N . To see that this produces the optimal induced kernel, we simply plug in the optimal W * : k * (x, x ′ ) := (W * x) ⊺ (W * x ′ ) = x ⊺ X (s) + C :,≤K (I -D) 1/2 ≤K,≤K (I -D) 1/2 ≤K,≤K C ⊺ :,≤K X (s) + ⊺ x ′ = k x,s K -1 s,s C :,≤K (I -D) ≤K,≤K C ⊺ :,≤K K -1 s,s k s,x ′ . ( ) Now, it needs to be shown that W * is the unique norm optimizer of the optimization problem. To show this, we analyze the following semi-definite program which is equivalent since the cost function is over positive semi-definite matrices of the form W ⊺ W : min B∈R D×D Tr(B) s.t. XBX ⊺ M = C :,≤K (I -D) 1/2 ≤K,≤K C ⊺ :,≤K := P K B ⪰ 0 (26) To lighten notation, we denote P K = C :,≤K (I -D) 1/2 ≤K,≤K C ⊺ :,≤K and simply use X for X. The above can easily be derived by setting B = W ⊺ W in Equation ( 23). This has corresponding dual max Y ∈R N ×N Tr(Y ⊺ P K ) s.t. I -X ⊺ Y M X ⪰ 0 (27) The optimal primal can be obtained from W * by B * = (W * ) ⊺ W * = X + P K (X + ) ⊺ . The optimal dual can be similarly calculated and is equal to Y * = (X + ) ⊺ X + . A straightforward calculation shows that the optimum value of the primal and dual formulation are equal for the given solutions. We now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004) : Primal feasibility: XB * X ⊺ M = P K , B * ⪰ 0 Dual feasibility: I -X ⊺ Y * M X ⪰ 0 Complementary slackness: (I -X ⊺ Y * M X) B * = 0. (28) The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that (I -X ⊺ Y * M X) B * = X + L K X + ⊺ -X ⊺ X + ⊺ X + M XX + L K X + ⊺ = X + L K X + ⊺ -X + M L K X + ⊺ = X + L K X + ⊺ -X + L K X + ⊺ = 0. (29) In the above we used the fact that M L K = L K since M is a projector and L K is unchanged by that projection. This completes the proof of the optimality.

B.3 CONTRASTIVE LOSS

For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ∥XW ⊺ W X ⊺ -(I + A)∥ 2 F . Since XW ⊺ W X ⊺ is positive semi-definite, then this is optimized when XW ⊺ W X ⊺ matches the positive eigenspace of (I + A) defined as (I + A) + . Enumerating the eigenvalues of A as v i with corresponding eigenvalues e i , then (I + A) + = i:ei≥-1 (e i + 1)v i v † i . More generally, if the dimension of the representation is restricted such that K < N , then we abuse notation and define (I + A) + = K i=1 max(e i + 1, 0)v i v † i , where e i are sorted in descending order. To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation ( 26): min B∈R D×D Tr(B) s.t. XBX ⊺ = (I + A) + B ⪰ 0 This has corresponding dual max Y ∈R N ×N Tr (Y ⊺ (I + A) + ) s.t. I -X ⊺ Y X ⪰ 0 (32) The optimal primal is B * = X + (I +A) + (X + ) ⊺ . The optimal dual is equal to Y * = (X + ) ⊺ X + . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004) : Primal feasibility: XB * X ⊺ = (I + A) + , B * ⪰ 0 Dual feasibility: I -X ⊺ Y * X ⪰ 0 Complementary slackness: (I -X ⊺ Y * X) B * = 0. (33) The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that (I -X ⊺ Y * X) B * = X + (I + A) + X + ⊺ -X ⊺ X + ⊺ X + XX + (I + A) + X + ⊺ = X + (I + A) + X + ⊺ -X + (I + A) + X + ⊺ = 0. In the above, we used the fact that X + X is a projection onto the row space of X. This completes the proof of the optimality.

B.4 OPTIMIZATION VIA SEMI-DEFINITE PROGRAM

In general scenarios, Proposition 3.1 gives a prescription for calculating the optimal induced kernel for more complicated optimization tasks since we note that the optimal induced kernel k * (x, x ′ ) must be of the form below: k * (x, x ′ ) = k x,s Bk ⊺ x ′ ,s , where k x,s is a row vector whose entry i equals the kernel k(x, x i ) between x and the i-th datapoint and B ∈ R N ×N is a positive semi-definite matrix. For example, such a scenario arises when one wants to apply the loss function across n batches batches of data. To frame this as an optimization statement, assume we have N datapoints split into batches of size b each. We denote the i-th datapoint within batch k as x i . As before, x i denotes the i-th datapoint across the whole dataset. We define the following variables: • K s,s ∈ R N ×N (kernel matrix over complete dataset including all batches) where [K s,s ] i,j = k(x i , x j ) • K s,s k ∈ R N ×b (kernel matrix between complete dataset and dataset of batch k) where k) is the adjacency matrix for inputs in batch k with corresponding graph Laplacian L (k) In what follows, we denote the representation dimension as K. [K s,s k ] i,j = k(x i , x (k) j ); similarly, K s k ,s ∈ R b×N is simply the transpose of K s,s k • A ( Non-contrastive loss function In the non-contrastive setting, we consider a regularized version of the batched loss of Equation (3). Applying the reduction of the loss function in Equation ( 23), we consider the following loss function where we want to find the minimizer B: L = nbatches j=1 K sj ,s BK s,sj I - 1 b 11 ⊺ -I - 1 b 11 ⊺ -βL (j) /2 2 F + α Tr(BK s,s ), where α ∈ R + is a hyperparameter. The term Tr(BK s,s ) regularizes for the RKHS norm of the resulting solution given by B. For simplicity, we denote as before M = I -1 b 11 ⊺ . Taking the limit α → 0 and enforcing a representation of dimension K, the loss function above is minimized when we obtain the optimal representation, i.e. we must have that K sj ,s BK s,sj M = M -βL (j) /2 K , where M -βL (j) /2 K denotes the projection of M -βL (j) /2 onto the eigenspace of the top K positive singular values (this is the optimal representation as shown earlier). Therefore, we can find the optimal induced kernel by solving the optimization problem below: min B∈R N ×N Tr(BK s,s ) s.t. K si,s BK s,si M = M -βL (j) /2 K ∀i ∈ {1, 2, . . . , n batches } B ⪰ 0, rank(B) = K, Relaxing the constraint rank(B) = K forms an SDP which can be solved efficiently using existing SDP solvers (ApS, 2019; Sturm, 1999) . Contrastive loss As shown in Section 3.1, the contrastive loss function takes the form L = nbatches j=1 K sj ,s BK s,sj -I + A (j) 2 F + α Tr(BK s , s) , where α ∈ R + is a weighting term for the regularizer. In this setting, the optimal representation of dimension K is equal to K sj ,s BK s,sj -I + A (j) K , where I + A (j) K denotes the projection of I + A (j) onto the eigenspace of the top K positive singular values (this is the optimal representation as shown earlier). Taking the limit of α → 0, have a similar optimization problem: min B∈R N ×N Tr(BK s,s ) s.t. K si,s BK s,si = I + A (i) K ∀i ∈ {1, 2, . . . , n batches } B ⪰ 0, rank(B) = K. As before, relaxing the rank constraint results in an SDP.

B.5 PROOF OF KERNEL CLOSENESS

Our goal is to prove Proposition 3.2 copied below. Proposition 3.2. Given kernel function k(•, •) with corresponding map Φ(•) into the RKHS H, let {x 1 , x 2 , . . . x N } be an SSL dataset normalized such that k(x i , x i ) = 1 and formed by pairwise augmentations (i.e., every element has exactly one neighbor in A) with kernel matrix K s,s . Given two points x and x ′ , if there exists two points in the SSL dataset indexed by i and j which are related by an augmentation (A ij =1) and ∥Φ(x) -Φ(x i )∥ H ≤ ∆ 5∥|K -1 s,s ∥ √ N and ∥Φ(x ′ ) -Φ(x j )∥ H ≤ ∆ 5∥|K -1 s,s ∥ √ N , then the induced kernel for the contrastive loss is at least k * (x, x ′ ) ≥ 1 -∆. Note, the adjacency matrix A for pairwise augmentations takes the form below assuming augmented samples are placed next to each other in order. A =           0 1 1 0 0 1 1 0 . . . 0 1 1 0           . ( ) Before proceeding, we prove a helper lemma that shows that ∥k s,xa -k s,x b ∥ is small if ∥Φ(x a ) -Φ(x b )∥ is also relatively small with a factor of dependence on the dataset size. normalized such that k(x i , x i ) = 1, let K s,s be the kernel matrix of the SSL dataset. If ∥Φ(x a ) - Φ(x b )∥ H ≤ ϵ, then K -1 s,s k s,xa -K -1 s,s k s,xa ≤ ∥K -1 s,s ∥ √ N ϵ. Proof. We have that K -1 s,s k s,xa -K -1 s,s k s,xa ≤ K -1 s,s ∥k s,xa -k s,xa ∥ = K -1 s,s N i=1 (⟨Φ(x i ), Φ(x a )⟩ H -⟨Φ(x i ), Φ(x b )⟩ H ) 2 1/2 = K -1 s,s N i=1 (⟨Φ(x i ), Φ(x a ) -Φ(x b )⟩ H ) 2 1/2 ≤ K -1 s,s N i=1 ∥Φ(x i )∥ 2 H ∥Φ(x a ) -Φ(x b )∥ 2 H 1/2 = K -1 s,s √ N ϵ We are now ready to prove Proposition 3.2. Proof. Note that K -1 s,s x i = e i where e i is the vector with a 1 placed on entry i and zeros elsewhere. From equation Equation ( 7), we have that k * (x, x ′ ) = k x,s K -1 s,s (I + A) + K -1 s,s k s,x ′ = (K -1 s,s k s,x -e i + e i ) ⊺ (I + A) + (K -1 s,s k s,x ′ -e j + e j ) = e ⊺ i (I + A) + e j + (K -1 s,s k s,x -e i ) ⊺ (I + A) + e j + e ⊺ i (I + A) + (K -1 s,s k s,x ′ -e j ) + (K -1 s,s k s,x -e i ) ⊺ (I + A) + (K -1 s,s k s,x ′ -e j ). Note that e ⊺ i (I + A) + e j = 1 since A ij = 1. Therefore, k * (x, x ′ ) = 1 + (K -1 s,s k s,x -e i ) ⊺ (I + A) + e j + e ⊺ i (I + A) + (K -1 s,s k s,x ′ -e j ) + (K -1 s,s k s,x -e i ) ⊺ (I + A) + (K -1 s,s k s,x ′ -e j ) ≥ 1 -∥K -1 s,s k s,x -e i ∥ (I + A) + e j + ∥K -1 s,s k s,x ′ -e j ∥ (I + A) + e i + K -1 s,s k s,x -e i (I + A) + ∥K -1 s,s k s,x ′ -e j ∥. Let ∥Φ(x) -Φ(x i )∥ ≤ ϵ = ∆ 5∥|K -1 s,s ∥ √ N and ∥Φ(x ′ ) -Φ(x j )∥ ≤ ϵ = ∆ 5∥|K -1 s,s ∥ √ N and by applying Lemma B.3, we have k * (x, x ′ ) ≥ 1 -∥K -1 s,s k s,x -e i ∥ (I + A) + e j + ∥K -1 s,s k s,x ′ -e j ∥ (I + A) + e i + K -1 s,s k s,x -e i (I + A) + ∥K -1 s,s k s,x ′ -e j ∥ ≥ 1 -2 √ 2∥K -1 s,s ∥ √ N ϵ -2∥K -1 s,s ∥ 2 N ϵ 2 ≥ 1 -(2 √ 2 + 2)∥K -1 s,s ∥ √ N ϵ ≥ 1 -5∥K -1 s,s ∥ √ N ϵ = 1 -∆. In the above, we used the fact that A is block diagonal with pairwise constraints (see Equation ( 42)) so (I + A) + e i = √ 2 and (I + A) + = 2.

C IDEALIZATION OF DOWNSTREAM TASKS

In this section, we prove Proposition 3.3 copied below. Proposition 3.3 (Ideal SSL outcome). Given a supervised dataset of N points for binary classification drawn from a distribution with m -1 and m +1 connected manifolds for classes with labels -1 and +1 respectively, if the induced kernel matrix of the dataset K * successfully separates the manifolds such that k * (x, x ′ ) = 1 if x, x ′ are in the same manifold and k * (x, x ′ ) = 0 otherwise, then s N (K * ) = m -1 + m +1 = O(1). Proof. There are m +1 + m -1 total manifolds in the dataset. Assume some ordering of these manifolds from {1, 2, . . . , m +1 + m -1 } and let #(i) be the number of points in the i-th manifold. The rows and columns of the kernel matrix K * can be permuted such that it becomes block diagonal with m +1 + m -1 blocks with block i equal to 1 #(i) 1 #(i) 1 ⊺ #(i) where 1 k is the all ones vector of length k. I.e., K * permuted accordingly takes the form below:       1 #(1) 1 #(1) 1 ⊺ #(1) 1 #(2) 1 #(2) 1 ⊺ #(2) . . . 1 #(m+1+m-1) 1 #(m+1+m-1) 1 ⊺ #(m+1+m-1)       . ( ) Each block of the above is clearly a rank one matrix with eigenvalue 1. Let y mi ∈ R #(i) be the vector containing all labels for entries in manifold i. Then we have y ⊺ (K * ) -1 y = m+1+m-1 i=1 #(i) -1 ⟨1 #(i) , y mi ⟩ 2 = m+1+m-1 i=1 #(i) -1 ( #(i)) 2 = m +1 + m -1 . C.1 PROOF OF GENERALIZATION BOUND For sake of completeness, we include an example proof of the generalization bound referred to in the main text. Common to classic generalization bounds for kernel methods from several prior works (Huang et al., 2021; Meir & Zhang, 2003; Mohri et al., 2018; Vapnik, 1999; Bartlett & Mendelson, 2002) , the norm of the linear solution in kernel space correlates with the resulting bound on the generalization error. We closely follow the methodology of (Huang et al., 2021) , though other works follow a similar line of reasoning. Given a linear solution in the reproducing kernel Hilbert space denoted by w, we aim to bound the generalization error in the loss function ℓ(y, y ′ ) = | min(1, max(-1, y)) -y ′ | where y ′ denotes binary classification targets in {-1, 1}, y is the output of the kernel function equal to y = ⟨w, x⟩ for a corresponding input x in the Hilbert space or feature space. For our purposes, the solution w is given by e.g., Equation ( 12) equal to w = Nt i=1 K * -1 t,t y i ϕ x (t) i , where ϕ(x (t) i ) denotes the mapping of x (t) i to the Hilbert space of the kernel. Given this solution, we have that ∥w∥ = y ⊺ K * -1 t,t y. (50) The norm above controls, in a sense, the complexity of the resulting solution as it appears in the resulting generalization bound. Given input and output spaces X and Y respectively, let D be a distribution of input/output pairs over the support X × Y. Slightly modifying existing generalization bounds via Rademacher complexity arguments (Huang et al., 2021; Bartlett & Mendelson, 2002; Mohri et al., 2018) , we prove the following generalization bound. Theorem C.1 (Adapted from Section 4.C of (Huang et al., 2021) ). Let x 1 , . . . , x N and y 1 , . . . , y Nt (with y the corresponding vector storing the scalars y i as entries) be our training set of N independent samples drawn i.i.d. from D. Let w = Tr(K) N Nt i=1 K -1 y i ϕ x (t) i denote the solution to the kernel regression problem normalized by the trace of the data kernel. For an L-lipschitz loss function ℓ : Y × Y → [0, b], with probability 1 -δ for any δ > 0, we have E x,y∼D [ℓ(⟨w, x⟩, y)]- 1 N t Nt i=1 ℓ(⟨w, x i ⟩, y i ) ≤ 2 √ 2L + 3b √ 2 Tr(K)y ⊺ K -1 y N +3b log(2/(δ(e -1))) 2N To prove the above, we apply a helper theorem and lemma copied below. Theorem C.2 (Concentration of sum; see Theorem 3.1 in Mohri et al. (2018) ). Let G be a family of functions mapping from Z to [0, 1]. Given N independent samples z 1 , . . . , z n from Z, then for any δ > 0, with probability at least 1 -δ, the following holds: E z [g(z)] - 1 N N i=1 g(z i ) ≤ 2 R S (G) + 3 log(2/δ) 2N , where R S (G) denotes the empirical Rademacher complexity of G equal to R S (G) = E σ sup g∈G 1 N N i=1 σ i g(z i ) , where σ i are independent uniform random variables over {-1, +1}.  R S (Φ • G) ≤ L R S (G). Now, we are ready to prove Theorem C.1. Proof. We consider a function class G γ defined as the set of linear functions on the reproducing kernel Hilbert space such that G γ = {⟨w, •⟩ : ∥w∥ 2 ≤ γ}. Given a dataset of inputs x 1 , . . . , x N in the reproducing kernel Hilbert space with corresponding targets y 1 , . . . , y N , let ϵ w (x i ) = ℓ(⟨w, x⟩, y i ) ∈ [0, b]. The inequality in Theorem C.2 applies for any given G γ but we would like this to hold for all γ ∈ {1, 2, 3, . . . } since ∥w∥ can be unbounded. Since our loss function ℓ is bounded between [0, b], we multiply it by 1/b so that it is ranged in [0, 1] as needed for Theorem C.2. Then, we apply Theorem C.2 to ϵ w (x i ) over the class G γ for each γ ∈ {1, 2, 3, . . . } with probability δ γ = δ(e -1)e -γ . This implies that E x [ϵ w (x)] - 1 N N i=1 ϵ w (x i ) ≤ 2E σ sup ∥v∥ 2 ≤γ 1 N N i=1 σ i ϵ v (x i ) + 3b log(2/(δ(e -1))) + γ 2N . (55) This shows that for any γ, the above inequality holds with probability 1 -δ(e -1)e -γ . However, we need to show this holds for all γ simultaneously. To achieve this, we apply a union bound which holds with probability 1 - ∞ γ=1 δ γ = 1 - ∞ γ=1 δ(e -1)e -γ = 1 -δ. To proceed, we consider the inequality where γ = ⌈∥w∥ 2 ⌉ copied below. E x [ϵ w (x)]- 1 N N i=1 ϵ w (x i ) ≤ 2E σ sup ∥v∥ 2 ≤⌈∥w∥ 2 ⌉ 1 N N i=1 σ i ϵ v (x i ) +3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N . Applying Talagrand's lemma (Lemma C.3) followed by the Cauchy-Schwarz inequality, E x [ϵ w (x)] - 1 N N i=1 ϵ w (x i ) ≤ 2E σ sup ∥v∥ 2 ≤⌈∥w∥ 2 ⌉ 1 N N i=1 σ i ϵ v (x i ) + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N ≤ 2LE σ sup ∥v∥ 2 ≤⌈∥w∥ 2 ⌉ 1 N N i=1 σ i ⟨v, x i ⟩ + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N ≤ 2LE σ sup ∥v∥ 2 ≤⌈∥w∥ 2 ⌉ ∥v∥ 1 N N i=1 σ i x i + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N ≤ 2LE σ ⌈∥w∥⌉ 1 N N i=1 σ i x i + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N . Expanding out the quantity 1 N N i=1 σ i x i and noting that the random variables are independent, we have E x [ϵ w (x)] - 1 N N i=1 ϵ w (x i ) ≤ 2LE σ ⌈∥w∥⌉ 1 N N i=1 σ i x i + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N = 2L N ⌈∥w∥⌉E σ   N i=1 N j=1 σ i σ j k(x i , x j )   + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N ≤ 2L N ⌈∥w∥⌉ E σ   N i=1 N j=1 σ i σ j k(x i , x j )   + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N = 2L N N i=1 k(x i , x i ) + log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N = 2L N ⌈∥w∥⌉ Tr(K) + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N . (58) Since we normalize such that Tr(K) = N , we have E x [ϵ w (x)] - 1 N N i=1 ϵ w (x i ) ≤ 2L √ N ⌈∥w∥⌉ + 3b log(2/(δ(e -1))) + ⌈∥w∥ 2 ⌉ 2N ≤ 2L √ N ⌈∥w∥⌉ + 3b log(2/(δ(e -1))) 2N + 3b ⌈∥w∥⌉ √ 2N = 2 √ 2L + 3b √ 2N ⌈∥w∥⌉ + 3b log(2/(δ(e -1))) 2N . Plugging in ∥w∥ 2 = y ⊺ K -1 y and noting that we normalized the kernel to have Tr(K) = N thus completes the proof.

D NUMERICAL EXPERIMENTS

All of our implementations are based on the common Python scientific library Numpy/Scipy (Harris et al., 2020) and runs on CPU (no GPU used). Kernel algorithms were performed using the Scikitlearn package (Pedregosa et al., 2011) . Hyperparameters were chosen via a grid search over the kernel algorithm parameters (e.g., regularization terms) and the loss function hyperparameters where appropriate. For calculation of the neural tangent kernel, we used the neuraltangents package (Novak et al., 2020) . 7and the graph adjacency matrix A is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. We plot three randomly chosen points' kernel entries with respect to the other points on the manifolds. a) When the neighborhood augmentation range used to construct the adjacency matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the adjacency matrix is too large, it creates the "short-circuit" effect in the induced kernel space. Each subplot on the second row is normalized by dividing its largest absolute value for better contrast. Here, we provide the contrastive SSL-induced kernel visualization in Figure 5 to show how the SSLinduced kernel is helping to manipulate the distance and disentangle manifolds in the representation space. In Figure 5 , we study two entangled 1-D spiral manifolds in a 2D space with 200 training points uniformly distributed on the spiral manifolds. We use the contrastive SSL-induced kernel, following Equation 7, to demonstrate this result, whereas the non-contrastive SSL-induced kernel is provided earlier in the main text. We use the radial basis function (RBF) kernel to calculate the inner products between different points and plot a few points' RBF neighborhoods in the first row of Figure 2 , i.e., k(x 1 , x 2 ) = exp(-∥x1-x2∥ 2 2σ 2 ). As we can see, the RBF kernel captures the locality of the 2D space. Next, we use the RBF kernel space neighborhoods to construct the adjacency matrix A and any training points with k(x 1 , x 2 ) > d are treated as connected vertices, i.e., A ij = 1 and A ij = 0 otherwise. The diagonal entries of A are 0. This construction can be considered as using the Euclidean neighborhoods of x as the data augmentation. In the second row of Figure 5 , we show the selected points' inner products with the other training points in the SSL-induced kernel space. Given σ, when d is chosen small enough, we can see in the second row of Figure 5 (a) that the SSL-induced kernel faithfully captures the topology of manifolds and leads to a more disentangled representation. Figure 5 (b) shows that when d is too large, i.e., an improper data augmentation, the SSL-induced kernel leads to the "short-circuit" effect, and the two manifolds are mixed in the representation space.

D.2 TIME SERIES DATA AND CIFAR10

We also add in table 1 additional empirical validation on a few time-series datasets extracted from UCR (Chen et al., 2015) as well as a few controlled experiments on CIFAR10. For the latter we propose the same set-up as in our MNIST experiments: i.e., translation and rotation data-augmentation. For the UCR experiments, we implement translations and add white noise as data-augmentation. Surprisingly, despite such simple augmentations, the SSL induced kernel achieves similar performance to the supervised baselines.

D.3 MNIST WITH NEURAL TANGENT KERNELS

In further exploring the performance of SSL kernel methods on small datasets, we perform further numerical experiments on the MNIST dataset using kernel functions derived from the neural tangent kernel (NTK) of commonly used neural networks (Jacot et al., 2018) . We use the neural-tangents package to explicitly calculate the NTK for a given architecture (Novak et al., 2020) . The basic setup is repeated from Section 4.2 where two different types of augmentations are performed on im-Table 1 : Test set accuracy on various datasets comparing the different methods. For CIFAR10, the same augmentation was used as for MNIST (rotation + translation) while for the time-series dataset, translation and white noise was used. Every implementation uses the same methodology and hyperparameter tuning as that of the fig. 3 Table 2 : NTK for 3-layer fully connected network: Test set accuracy of SVM using the neural tangent kernel of a 3 layer fully connected network in only a supervised (baseline) setting or via the induced kernel in a self-supervised setting. The induced kernel for the SSL algorithm is calculated using the contrastive induced kernel. Numbers shown above are the test set accuracy for classifying MNIST digits for small dataset sizes with the given number of samples. Due to the quadratic scaling of memory and runtime for kernel methods, we restricted analysis to more feasible settings where there were less than 25,000 total samples (number of augmentations times number of samples). ages. As before, we consider augmentations by Gaussian blurring of the pixels or small translations, rotations, and zooms. Test set accuracy results are shown in Table 2 for the NTK associated to a 3-layer fully connected network with infinite width at each hidden layer. Similarly, in Table 3 , we show a similar analysis for the NTK of a CNN with seven convolutional layers followed by a global pooling layer. We use the contrastive induced kernel for the SSL kernel algorithm. The findings are similar to those reported in the main text. The SSL method performs similarly to the baseline method without any self-supervised learning but including labeled augmented datapoints in the training set. In some cases, the SSL method even outperforms the baseline augmented setting.

D.4 ANALYSIS OF DOWNSTREAM SOLUTION

To empirically analyze generalization, we track the complexity quantity s N (K) here defined in Equation ( 13) and copied below: s N (K) = Tr(K) N y ⊺ K -1 y,



Figure1: To translate the SSL setting into the kernel regime, we aim to find the optimal linear function which maps inputs from the RKHS into the K-dimensional feature space of the representations. This new feature space induces a new optimal kernel denoted the "induced" kernel. Relationships between data-points are encoded in an adjacency matrix (the example matrix shown here contains pairwise relationships between datapoints).

Figure 2: Comparison of the RBF kernel space (first row) and induced kernel space (second row). The induced kernel is computed based on Equation4, and the graph Laplacian matrix is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. a) We plot three randomly chosen points' kernel entries with respect to the other points on the manifolds. When the neighborhood augmentation range used to construct the Laplacian matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the Laplacian matrix is too large, it creates the "short-circuit" effect in the induced kernel space. Each subplot on the second row is normalized by its largest absolute value for better contrast.

Figure4: MNIST classification task with 100 original training samples and 100 augmentations per sample on the train and test set (full, 10, 000 samples) with baselines in the titles given by supervised SVM using labels for all samples or original (100) samples only. We provide on the left the contrastive kernel performances when ablating over the (inverse) of the SVM's ℓ 2 regularizer y-axis and representation dimension (K) in the x-axis. Similarly, we provide on the right the noncontrastive kernel performances when ablating over β on the y-axis and over the representation dimension (K) on the x-axis. We observe, as expected, that reducing K prevents overfitting and should be preferred over the ℓ 2 regularizer, and that β acts jointly with the representation dimension i.e. one only needs to tune one of the two.

1 (Representer theorem for self-supervised tasks; adapted fromSchölkopf et al. (2001)).

Given kernel function k(•, •) with map Φ(•) into RKHS H and an SSL dataset {x i } N i=1

Lemma C.3 (Talagrand's lemma; see Lemma 4.2 in Mohri et al. (2018)). Let Φ : R → R be L-Lipschitz. Then for any hypothesis set G of real-valued functions,

Figure5: Inner products comparison in the RBF kernel space (first row) and induced kernel space (second row). The induced kernel is computed based on Equation 7 and the graph adjacency matrix A is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. We plot three randomly chosen points' kernel entries with respect to the other points on the manifolds. a) When the neighborhood augmentation range used to construct the adjacency matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the adjacency matrix is too large, it creates the "short-circuit" effect in the induced kernel space. Each subplot on the second row is normalized by dividing its largest absolute value for better contrast.

experiment.

annex

Table 3 : NTK for 7-layer convolutional neural network: Test set accuracy of SVM using the neural tangent kernel of a CNN with 7 layers of 3 × 3 convolution followed by a global pooling layer. The induced kernel for the SSL algorithm is calculated using the contrastive induced kernel and compared to the standard SVM using the kernel itself with or without augmentation. Numbers shown above are the test set accuracy for classifying MNIST digits for small dataset sizes with the given number of samples. Due to the quadratic scaling of memory and runtime for kernel methods, we restricted analysis to more feasible settings where there were less than 25,000 total samples (number of augmentations times number of samples). where y is a vector of targets and K is the kernel matrix of the supervised dataset. As a reminder, the generalization gap can be bounded with high probability by O( s N (K)/N ), and in the main text, we provided heuristic indication that this quantity may be reduced when using the induced kernel. Here, we empirically analyze whether this holds true in the MNIST setting considered in the main text. As shown in Figure 6 , the SSL induced kernel produces smaller values of s N (K) than its corresponding supervised counterpart. We calculate this figure by splitting the classes into binary parts (even and odd integers) in order to construct a vector y that mimics a binary classification task.Comparing this to the test accuracy results reported in Figure 4 , it is clear that s N (K) also correlates inversely with test accuracy as expected. The results here lend further evidence to the hypothesis that the induced kernel better correlates points along a data manifold as outlined in Proposition 3.3.

