ON THE DUALITY BETWEEN CONTRASTIVE AND NON-CONTRASTIVE SELF-SUPERVISED LEARNING

Abstract

Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and covariance based non-contrastive criteria that can be related algebraically and shown to be equivalent under limited assumptions, we show how close those families can be. We further study popular methods and introduce variations of them, allowing us to relate this theoretical result to current practices and show the influence (or lack thereof) of design choices on downstream performance. Motivated by our equivalence result, we investigate the low performance of SimCLR and show how it can match VICReg's with careful hyperparameter tuning, improving significantly over known baselines. We also challenge the popular assumption that non-contrastive methods need large output dimensions. Our theoretical and quantitative results suggest that the numerical gaps between contrastive and noncontrastive methods in certain regimes can be closed given better network design choices and hyperparameter tuning. The evidence shows that unifying different SOTA methods is an important direction to build a better understanding of selfsupervised learning.

1. INTRODUCTION

Self-supervised learning (SSL) of image representations has shown significant progress in the last few years (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill et al., 2020; Lee et al., 2021b; Caron et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022; Caron et al., 2021; Chen et al., 2021b; Li et al., 2022a; Zhou et al., 2022a; b; HaoChen et al., 2021) , approaching, and sometime even surpassing, the performance of supervised baselines on many downstream tasks. Most recent approaches are based on the joint-embedding framework with a siamese network architecture (Bromley et al., 1994) which are divided into two main categories, contrastive and non-contrastive methods. Contrastive methods bring together embeddings of different views of the same image while pushing away the embeddings from different images. Non-contrastive methods also attract embeddings of views from the same image but remove the need for explicit negative pairs, either by architectural design (Grill et al., 2020; Chen & He, 2020) or by regularization of the covariance of the embeddings (Zbontar et al., 2021; Bardes et al., 2021; Li et al., 2022b) . While contrastive and non-contrastive approaches seem very different and have been described as such (Zbontar et al., 2021; Bardes et al., 2021; Ermolov et al., 2021; Grill et al., 2020) , we pro- * Correspondence to garridoq@meta.com

