JOINT EMBEDDING SELF-SUPERVISED LEARNING IN THE KERNEL REGIME Anonymous

Abstract

The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Motivated by a rich line of work in kernel methods performed on graphs and manifolds, we show that SSL methods likewise admit a kernel regime where embeddings are constructed by linear maps acting on the feature space of a kernel to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of selfsupervised learning algorithms and gain theoretical insights into their performance on downstream tasks.

1. INTRODUCTION

Self-supervised learning (SSL) algorithms are broadly tasked with learning from unlabeled data. In the joint embedding framework of SSL, mainstream contrastive methods build representations by reducing the distance between inputs related by an augmentation (positive pairs) and increasing the distance between inputs not known to be related (negative pairs) (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Ye et al., 2019) . Non-contrastive methods only enforce similarities between positive pairs but are designed carefully to avoid collapse of representations (Grill et al., 2020; Zbontar et al., 2021) . Recent algorithms for SSL have performed remarkably well reaching similar performance to baseline supervised learning algorithms on many downstream tasks (Caron et al., 2020; Bardes et al., 2021; Chen & He, 2021) . In this work, we study SSL from a kernel perspective motivated by the rich history of study of kernel algorithms on graphs and manifolds (Smola & Kondor, 2003; Ando & Zhang, 2006; Bellet et al., 2013; Belkin & Niyogi, 2004) . Our primary aim is to extend this line of work to cover commonly used loss functions in modern SSL settings potentially providing useful insights into their properties and performance. In standard SSL tasks, inputs are fed into a neural network and mapped into a feature space which encodes the final representations used in downstream tasks (e.g., classification tasks). In the kernel setting, inputs are embedded in a feature space corresponding to a kernel, and representations are constructed via an optimal mapping from this feature space to the vector space for the representations of the data. Here, the task can be framed as one of finding an optimal "induced" kernel, which is a mapping from the original kernel in the input feature space to an updated kernel function acting on the vector space of the representations. Our results show that such an induced kernel can be constructed using only manipulations of kernel functions and data that encodes the relationships between inputs in an SSL algorithm (e.g., adjacency matrices between the input datapoints). More broadly, we make the following contributions: • For a contrastive and non-contrastive loss, we provide closed form solutions when the algorithm is trained over a single batch of data. These solutions form a new "induced" kernel which can be used to perform downstream supervised learning tasks. • We show that a version of the representer theorem in kernel methods can be used to formulate kernelized SSL tasks as optimization problems. As an example, we show how to optimially find induced kernels when the loss is enforced over separate batches of data. • We empirically study the properties of our SSL kernel algorithms to gain insights about the training of SSL algorithms in practice. We study the generalization properties of SSL algorithms and show that the choice of augmentation and adjacency matrices encoding relationships between the datapoints are crucial to performance. We proceed as follows. First, we provide a brief background of the goals of our work and the theoretical tools used in our study. Second, we show that kernelized SSL algorithms trained on a single batch admit a closed form solution for commonly used contrastive and non-contrastive loss functions. Third, we generalize our findings to provide a semi-definite programming formulation to solve for the optimal induced kernel in more general settings and provide heuristics to better understand the form and properties of the induced kernels. Finally, we empirically investigate our kernelized SSL algorithms when trained on various datasets (code included in supplemental material).

1.1. NOTATION AND SETUP

We denote vectors and matrices with lowercase (x) and uppercase (X) letters respectively. The vector 2-norm and matrix operator norm is denoted by ∥ • ∥. The Frobenius norm of a matrix M is denoted as ∥M ∥ F . We denote the transpose and conjugate transpose of M by M ⊺ and M † respectively. We denote the identity matrix as I and the vector with each entry equal to one as 1. For a diagonalizable matrix M , its projection onto the eigenspace of its positive eigenvalues is M + . For a dataset of size N , let x i ∈ X for i ∈ [N ] denote the elements of the dataset. Given a kernel function k : X × X → R, let Φ(x) = k(x, •) be the map from inputs to the reproducing kernel Hilbert space (RKHS) denoted by H with corresponding inner product ⟨•, •⟩ H and RKHS norm ∥ • ∥ H . Throughout we denote K s,s ∈ R N ×N to be the kernel matrix of the SSL dataset where (K s,s ) ij = k(x i , x j ). We consider linear models W : H → R K which map features to representations z i = W Φ(x i ). Let Z be the representation matrix which contains Φ(x i ) as rows of the matrix. This linear function space induces a corresponding RKHS norm which can be calculated as ∥W ∥ H = K i=1 ⟨W i , W i ⟩ 2 H where W i ∈ H denotes the i-th component of the output of the linear mapping W . This linear mapping constructs an "induced" kernel denoted as k * (•, •) as discussed later. The driving motive behind modern self-supervised algorithms is to maximize the information of given inputs in a dataset while enforcing similarity between inputs that are known to be related. The adjacency matrix A ∈ {0, 1} N ×N (also can be generalized to A ∈ R N ×N ) connects related inputs x i and x j (i.e., A ij = 1 if inputs i and j are related by a transformation) and D A is a diagonal matrix where entry i on the diagonal is equal to the number of nonzero elements of row i of A.

2. RELATED WORKS

In this section, we briefly summarize some of the related works for this study. We include a more detailed related works section in Appendix A. Modern self-supervised learning approaches: Joint embedding approaches to SSL produce representations by comparing representations of inputs via known relationships. Methods are denoted as non-contrastive if the loss function is only a function of pairs that are related (Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021; Bardes et al., 2021) . Contrastive methods also penalize similarities of representations that are not related. Popular algorithms include SimCLR (Chen et al., 2020), SwAV (Caron et al., 2020) , NNCLR (Dwibedi et al., 2021) , contrastive predictive coding (Oord et al., 2018) , spectral contrastive loss (HaoChen et al., 2021) , and many others. Separate from the joint embedding framework, many methods in SSL form representations by predicting held-out portions of the data typically in a generative model setting. Commonly used algorithms incorporate autoencoder approaches (He et al., 2022; Radford et al., 2019; Vincent et al., 2010; Dosovitskiy et al., 2020) which are used in both natural language processing and image processing tasks.

