TOWARDS A UNIFIED THEORETICAL UNDERSTAND-ING OF NON-CONTRASTIVE LEARNING VIA RANK DIFFERENTIAL MECHANISM

Abstract

Recently, a variety of methods under the name of non-contrastive learning (like BYOL, SimSiam, SwAV, DINO) show that when equipped with some asymmetric architectural designs, aligning positive pairs alone is sufficient to attain good performance in self-supervised visual learning. Despite some understandings of some specific modules (like the predictor in BYOL), there is yet no unified theoretical understanding of how these seemingly different asymmetric designs can all avoid feature collapse, particularly considering methods that also work without the predictor (like DINO). In this work, we propose a unified theoretical understanding for existing variants of non-contrastive learning. Our theory named Rank Differential Mechanism (RDM) shows that all these asymmetric designs create a consistent rank difference in their dual-branch output features. This rank difference will provably lead to an improvement of effective dimensionality and alleviate either complete or dimensional feature collapse. Different from previous theories, our RDM theory is applicable to different asymmetric designs (with and without the predictor), and thus can serve as a unified understanding of existing non-contrastive learning methods. Besides, our RDM theory also provides practical guidelines for designing many new non-contrastive variants. We show that these variants indeed achieve comparable performance to existing methods on benchmark datasets, and some of them even outperform the baselines. Our code is available at https://github.com/PKU-ML/ Rank-Differential-Mechanism.

1. INTRODUCTION

Self-supervised learning of visual representations has undergone rapid progress in recent years, particularly due to the rise of contrastive learning (CL) (Oord et al., 2018; Wang et al., 2021) . Canonical contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) utilize both positive samples (for feature alignment) and negative samples (for feature uniformity). Surprisingly, researchers notice that CL can also work well by only aligning positive samples, which is referred to as non-contrastive learning. Without the help of negative samples, various techniques are proposed to prevent feature collapse, for example, stop-gradient, momentum encoder, predictor (BYOL (Grill et al., 2020) , SimSiam (Chen & He, 2021)), Sinkhorn iterations (SwAV (Caron et al., 2020) ), feature centering and sharpening (DINO (Caron et al., 2021) ). These above designs all create a certain of asymmetry between the online branch (with gradient) and the target branch (without gradient) (Wang et al., 2022a) . Empirically, these tricks can successfully alleviate feature collapse and obtain comparable or even superior performance than canonical contrastive learning. Despite this progress, it is still not clear why these different heuristics can reach the same goal. Some existing works are proposed to understand some specific non-contrastive techniques, mostly focusing on the predictor head proposed by BYOL (Grill et al., 2020) . From an empirical side, Chen & He (2021) think that the predictor helps approximate the expectation over augmentations under simple linear networks, and Wen & Li (2022) obtain optimization guarantees for two-layer nonlinear networks. These theoretical discussions often need strong assumptions on the data distribution (e.g., standard normal (Tian et al., 2021)) and augmentations (e.g., random masking (Wen & Li, 2022)). Besides, their analyses are often problem-specific, which is hardly extendable to other non-contrastive variants without a predictor, e.g., DINO. Therefore, a natural question is raised here: Are there any basic principles behind these seemingly different techniques? In this paper, we make the first attempt in this direction by discovering a common mechanism behind these non-contrastive variants. To get a glimpse of it, in Figure 1 , we measure the effective rank (Roy & Vetterli, 2007) of four different non-contrastive methods (BYOL, SimSiam, SwAV, and DINO). We find the following phenomenons: 1) among different methods, the target branch (orange line) has consistently higher rank than the online branch (blue line); 2) after the initial warmup stage, the rank of the online branch (blue line) consistently improves along the training process. Inspired by this observation, we propose a new theoretical understanding of non-contrastive methods, dubbed Rank Differential Mechanism (RDM), where we show that these different techniques essentially behave as a low-pass spectral filter, which is guaranteed to induce the rank difference above and avoid feature collapse along the training. We summarize the contribution of this work as follows: • Asymmetry matters for feature diversity. In contrast to common beliefs, we show that even a symmetric architecture can provably alleviate complete feature collapse. However, it still suffers from low feature diversity, collapsing to a very low dimensional subspace. It indicates the key role of asymmetry is to avoid dimensional feature collapse. • Asymmetry induces low-pass filters that provably avoid dimensional collapse. Based on theoretical and empirical evidence on real-world data, we point out the common underlying mechanism of asymmetric designs in BYOL, SimSiam, SwAV, DINO is that they behave as low-pass online-branch filters, or equivalently, high-pass target-branch filters. We further show that the asymmetry-induced low-pass filter can provably yield the rank collapse (Figure 1 ) and prevent feature collapse along the training process. • Principled designs of asymmetry. Following the principle of RDM, we design a series of non-contrastive variants to empirically verify the effectiveness of our theory. For the online encoder, we show that different variants of low-pass filters can also attain fairly good performance. We also design a new kind of target predictors with high-pass filters. Experiments show that SimSiam with our target predictors can outperform DirectPred (Tian et al., 2021) and achieve comparable or even superior performance to the original SimSiam.

2. RELATED WORK

Non-contrastive Learning. Among existing methods, BYOL (Grill et al., 2020) is the first to show we can alleviate the feature collapse of aligning positive samples along with an online predictor and a momentum encoder. Later, SimSiam (Chen & He, 2021) further simplifies this requirement and shows that only the online predictor is enough. As for another thread, SwAV (Caron et al., 2020) applies Sinkhorn-Knopp iterations (Cuturi, 2013) to the target output from an optimal transport view. DINO (Caron et al., 2021) further simplifies this approach by simply combining feature centering and feature sharpening. Remarkably, all these methods adopt an online-target dual-branch architecture and gradients from the target branch are detached. Our theory provides a unified understanding of these designs and reveals their common underlying mechanisms. Additional comparison with related work is included in Appendix F.



* Equal Contribution. † Corresponding Author: Yisen Wang (yisen.wang@pku.edu.cn).



Figure 1: The effective rank of the normalized outputs of the online and target branches for four different non-contrastive methods (BYOL, SimSiam, SwAV, and DINO) on CIFAR-10.

