JOINT EMBEDDING SELF-SUPERVISED LEARNING IN THE KERNEL REGIME Anonymous

Abstract

The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Motivated by a rich line of work in kernel methods performed on graphs and manifolds, we show that SSL methods likewise admit a kernel regime where embeddings are constructed by linear maps acting on the feature space of a kernel to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of selfsupervised learning algorithms and gain theoretical insights into their performance on downstream tasks.

1. INTRODUCTION

Self-supervised learning (SSL) algorithms are broadly tasked with learning from unlabeled data. In the joint embedding framework of SSL, mainstream contrastive methods build representations by reducing the distance between inputs related by an augmentation (positive pairs) and increasing the distance between inputs not known to be related (negative pairs) (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Ye et al., 2019) . Non-contrastive methods only enforce similarities between positive pairs but are designed carefully to avoid collapse of representations (Grill et al., 2020; Zbontar et al., 2021) . Recent algorithms for SSL have performed remarkably well reaching similar performance to baseline supervised learning algorithms on many downstream tasks (Caron et al., 2020; Bardes et al., 2021; Chen & He, 2021) . In this work, we study SSL from a kernel perspective motivated by the rich history of study of kernel algorithms on graphs and manifolds (Smola & Kondor, 2003; Ando & Zhang, 2006; Bellet et al., 2013; Belkin & Niyogi, 2004) . Our primary aim is to extend this line of work to cover commonly used loss functions in modern SSL settings potentially providing useful insights into their properties and performance. In standard SSL tasks, inputs are fed into a neural network and mapped into a feature space which encodes the final representations used in downstream tasks (e.g., classification tasks). In the kernel setting, inputs are embedded in a feature space corresponding to a kernel, and representations are constructed via an optimal mapping from this feature space to the vector space for the representations of the data. Here, the task can be framed as one of finding an optimal "induced" kernel, which is a mapping from the original kernel in the input feature space to an updated kernel function acting on the vector space of the representations. Our results show that such an induced kernel can be constructed using only manipulations of kernel functions and data that encodes the relationships between inputs in an SSL algorithm (e.g., adjacency matrices between the input datapoints). More broadly, we make the following contributions: • For a contrastive and non-contrastive loss, we provide closed form solutions when the algorithm is trained over a single batch of data. These solutions form a new "induced" kernel which can be used to perform downstream supervised learning tasks.

