ON THE DUALITY BETWEEN CONTRASTIVE AND NON-CONTRASTIVE SELF-SUPERVISED LEARNING

Abstract

Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and covariance based non-contrastive criteria that can be related algebraically and shown to be equivalent under limited assumptions, we show how close those families can be. We further study popular methods and introduce variations of them, allowing us to relate this theoretical result to current practices and show the influence (or lack thereof) of design choices on downstream performance. Motivated by our equivalence result, we investigate the low performance of SimCLR and show how it can match VICReg's with careful hyperparameter tuning, improving significantly over known baselines. We also challenge the popular assumption that non-contrastive methods need large output dimensions. Our theoretical and quantitative results suggest that the numerical gaps between contrastive and noncontrastive methods in certain regimes can be closed given better network design choices and hyperparameter tuning. The evidence shows that unifying different SOTA methods is an important direction to build a better understanding of selfsupervised learning.

1. INTRODUCTION

Self-supervised learning (SSL) of image representations has shown significant progress in the last few years (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill et al., 2020; Lee et al., 2021b; Caron et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022; Caron et al., 2021; Chen et al., 2021b; Li et al., 2022a; Zhou et al., 2022a; b; HaoChen et al., 2021) , approaching, and sometime even surpassing, the performance of supervised baselines on many downstream tasks. Most recent approaches are based on the joint-embedding framework with a siamese network architecture (Bromley et al., 1994) which are divided into two main categories, contrastive and non-contrastive methods. Contrastive methods bring together embeddings of different views of the same image while pushing away the embeddings from different images. Non-contrastive methods also attract embeddings of views from the same image but remove the need for explicit negative pairs, either by architectural design (Grill et al., 2020; Chen & He, 2020) or by regularization of the covariance of the embeddings (Zbontar et al., 2021; Bardes et al., 2021; Li et al., 2022b) . While contrastive and non-contrastive approaches seem very different and have been described as such (Zbontar et al., 2021; Bardes et al., 2021; Ermolov et al., 2021; Grill et al., 2020) , we pro-pose to take a closer look at the similarities between the two, both from a theoretical and empirical point of view and show that there exists a close relationship between them. We focus on covariance regularization-based non-contrastive methods (Zbontar et al., 2021; Ermolov et al., 2021; Bardes et al., 2021) and demonstrate that these methods can be seen as contrastive between the dimensions of the embeddings instead of contrastive between the samples. We, therefore, introduce the term dimension-contrastive methods which we believe is better suited for them, and refer to the original contrastive methods as sample-contrastive methods. To show the similarities between the two, we define contrastive and non-contrastive criteria based on the Frobenius norm of the Gram and covariance matrices of the embeddings, respectively, and show the equivalence between the two under assumptions on the normalization of the embeddings. We then relate popular methods to these criteria, highlighting the links between them and further motivating the use of the sample-contrastive and dimension-contrastive nomenclature. Finally, we introduce variations of an existing dimensioncontrastive method (VICReg), and a sample-contrastive one (SimCLR). This allows us to verify this equivalence empirically and improve both VICReg and SimCLR through this lens. Our contributions can be summarized as follows: • We make a significant effort to unify several SOTA sample-contrastive and dimensioncontrastive methods and show that empirical performance gaps can be closed completely. By pinpointing its source, we consolidate our understanding of SSL methods. • We introduce two criteria that serve as representatives for sample-and dimensioncontrastive methods. We show that they are equivalent for doubly normalized embeddings, and then relate popular methods to them, highlighting their theoretical similarities. • We introduce methods that interpolate between VICReg and SimCLR to study the practical impact of precise components of their loss functions. This allows us to validate empirically our theoretical result by isolating the sample-and dimension-contrastive nature of methods. • Motivated by the equivalence, we show that advantages attributed to one family can be transferred to the other. We improve SimCLR's performance to match VICReg's, and improve VICReg to make it as robust to embedding dimension as SimCLR.

2. RELATED WORK

Sample-contrastive methods. In self-supervised learning of image representations, contrastive methods pull together embeddings of distorted views of a single image while pushing away embeddings coming from different images. Many works in this direction have recently flourished (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; 2021b; Yeh et al., 2021) , most of them using the InfoNCE criterion (Oord et al., 2018 ), except HaoChen et al. (2021) , that uses squared similarities between the samples. Clustering-based methods (Caron et al., 2018; 2020; 2021) can be seen as contrastive between prototypes, or clusters, instead of samples. Non-contrastive methods. Recently, methods that deviate from contrastive learning have emerged and eliminate the use of negative samples in different ways. Distillation-based methods such as BYOL (Grill et al., 2020) , SimSiam (Chen & He, 2020) or DINO (Caron et al., 2021) use architectural tricks inspired by distillation to avoid the collapse problem. Information maximization methods (Bardes et al., 2021; Zbontar et al., 2021; Ermolov et al., 2021; Li et al., 2022b) maximize the informational content of the representations and have also had significant success. They rely on regularizing the empirical covariance matrix of the embeddings so that their informational content is maximized. Our study of dimension-contrastive learning focuses on these covariance-based methods. Understanding contrastive and non-contrastive learning. Recent works tackle the task of understanding and characterizing methods. The fact that a method like SimSiam does not collapse is studied in Tian et al. 2021) Barlow twins criterion is shown to be related to an upper bound of a sample-contrastive criterion. We go further and exactly quantify the gap between the criterion, which allows us to use the link between methods in practical scenarios. Barlow



(2021). The loss landscape of SimSiam is also compared to SimCLR's inPokle  et al. (2022), which shows that it learns bad minima. In Wang & Isola (2020), the optimal solutions of the InfoNCE criterion are characterized, giving a better understanding of the embedding distributions. A spectral graph point of view is taken inHaoChen et al. (2022; 2021); Shen et al. (2022)   to analyze self-supervised learning methods. Practical properties of contrastive methods have been studied inChen et al. (2021a). InHuang et al. (

