TOWARDS A UNIFIED THEORETICAL UNDERSTAND-ING OF NON-CONTRASTIVE LEARNING VIA RANK DIFFERENTIAL MECHANISM

Abstract

Recently, a variety of methods under the name of non-contrastive learning (like BYOL, SimSiam, SwAV, DINO) show that when equipped with some asymmetric architectural designs, aligning positive pairs alone is sufficient to attain good performance in self-supervised visual learning. Despite some understandings of some specific modules (like the predictor in BYOL), there is yet no unified theoretical understanding of how these seemingly different asymmetric designs can all avoid feature collapse, particularly considering methods that also work without the predictor (like DINO). In this work, we propose a unified theoretical understanding for existing variants of non-contrastive learning. Our theory named Rank Differential Mechanism (RDM) shows that all these asymmetric designs create a consistent rank difference in their dual-branch output features. This rank difference will provably lead to an improvement of effective dimensionality and alleviate either complete or dimensional feature collapse. Different from previous theories, our RDM theory is applicable to different asymmetric designs (with and without the predictor), and thus can serve as a unified understanding of existing non-contrastive learning methods. Besides, our RDM theory also provides practical guidelines for designing many new non-contrastive variants. We show that these variants indeed achieve comparable performance to existing methods on benchmark datasets, and some of them even outperform the baselines.

1. INTRODUCTION

Self-supervised learning of visual representations has undergone rapid progress in recent years, particularly due to the rise of contrastive learning (CL) (Oord et al., 2018; Wang et al., 2021) . Canonical contrastive learning methods like SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) utilize both positive samples (for feature alignment) and negative samples (for feature uniformity). Surprisingly, researchers notice that CL can also work well by only aligning positive samples, which is referred to as non-contrastive learning. Without the help of negative samples, various techniques are proposed to prevent feature collapse, for example, stop-gradient, momentum encoder, predictor (BYOL (Grill et al., 2020) , SimSiam (Chen & He, 2021) ), Sinkhorn iterations (SwAV (Caron et al., 2020) ), feature centering and sharpening (DINO (Caron et al., 2021) ). These above designs all create a certain of asymmetry between the online branch (with gradient) and the target branch (without gradient) (Wang et al., 2022a) . Empirically, these tricks can successfully alleviate feature collapse and obtain comparable or even superior performance than canonical contrastive learning. Despite this progress, it is still not clear why these different heuristics can reach the same goal. Some existing works are proposed to understand some specific non-contrastive techniques, mostly focusing on the predictor head proposed by BYOL (Grill et al., 2020) . From an empirical side, Chen & He (2021) think that the predictor helps approximate the expectation over augmentations, and Zhang et al. (2022a) take a center-residual decomposition of representations for analyzing the collapse. From a theoretical perspective, Tian et al. (2021) analyze the dynamics of predictor weights under simple linear networks, and Wen & Li (2022) obtain optimization guarantees for two-layer nonlinear networks. These theoretical discussions often need strong assumptions on the data distribution (e.g., standard normal (Tian et al., 2021) ) and augmentations (e.g., random masking (Wen & Li, 2022) ). Besides, their analyses are often problem-specific, which is hardly extendable to other non-contrastive variants without a predictor, e.g., DINO. Therefore, a natural question is raised here: Are there any basic principles behind these seemingly different techniques? In this paper, we make the first attempt in this direction by discovering a common mechanism behind these non-contrastive variants. To get a glimpse of it, in Figure 1 , we measure the effective rank (Roy & Vetterli, 2007) of four different non-contrastive methods (BYOL, SimSiam, SwAV, and DINO) . We find the following phenomenons: 1) among different methods, the target branch (orange line) has consistently higher rank than the online branch (blue line); 2) after the initial warmup stage, the rank of the online branch (blue line) consistently improves along the training process. Inspired by this observation, we propose a new theoretical understanding of non-contrastive methods, dubbed Rank Differential Mechanism (RDM), where we show that these different techniques essentially behave as a low-pass spectral filter, which is guaranteed to induce the rank difference above and avoid feature collapse along the training. We summarize the contribution of this work as follows: • Asymmetry matters for feature diversity. In contrast to common beliefs, we show that even a symmetric architecture can provably alleviate complete feature collapse. However, it still suffers from low feature diversity, collapsing to a very low dimensional subspace. It indicates the key role of asymmetry is to avoid dimensional feature collapse. • Asymmetry induces low-pass filters that provably avoid dimensional collapse. Based on theoretical and empirical evidence on real-world data, we point out the common underlying mechanism of asymmetric designs in BYOL, SimSiam, SwAV, DINO is that they behave as low-pass online-branch filters, or equivalently, high-pass target-branch filters. We further show that the asymmetry-induced low-pass filter can provably yield the rank collapse (Figure 1 ) and prevent feature collapse along the training process. • Principled designs of asymmetry. Following the principle of RDM, we design a series of non-contrastive variants to empirically verify the effectiveness of our theory. For the online encoder, we show that different variants of low-pass filters can also attain fairly good performance. We also design a new kind of target predictors with high-pass filters. Experiments show that SimSiam with our target predictors can outperform DirectPred (Tian et al., 2021) and achieve comparable or even superior performance to the original SimSiam.

2. RELATED WORK

Non-contrastive Learning. Among existing methods, BYOL (Grill et al., 2020) is the first to show we can alleviate the feature collapse of aligning positive samples along with an online predictor and a momentum encoder. Later, SimSiam (Chen & He, 2021) further simplifies this requirement and shows that only the online predictor is enough. As for another thread, SwAV (Caron et al., 2020) applies Sinkhorn-Knopp iterations (Cuturi, 2013) to the target output from an optimal transport view. DINO (Caron et al., 2021) further simplifies this approach by simply combining feature centering and feature sharpening. Remarkably, all these methods adopt an online-target dual-branch architecture and gradients from the target branch are detached. Our theory provides a unified understanding of these designs and reveals their common underlying mechanisms. Additional comparison with related work is included in Appendix F. Dimensional Collapse of Self-supervised Representations. Prior to ours, several works also explore the dimensional collapse issue in contrastive learning. Ermolov et al. (2021) , Hua et al. (2021) , Weng et al. (2022) and Zhang et al. (2022b) propose whitening techniques to alleviate dimensional collapse, similar in spirit to Barlow Twins (Zbontar et al., 2021) with a feature decorrelation regularization. Jing et al. (2022) point out the dimensional collapse of contrastive learning without using the projector, and propose DirectCLR as a direct replacement. Instead, our work mainly focuses on understanding the role of asymmetric designs on overcoming dimensional collapse. Theoretical Analysis on Contrastive Learning. Saunshi et al. (2019) first establish downstream guarantees for contrastive learning, which are later gradually refined (Nozawa & Sato, 2021; Ash et al., 2022; Bao et al., 2022) . Tosh et al. (2021) and Lee et al. (2021) also propose similar guarantees on downstream tasks. However, these methods mostly rely on the conditional independence assumption that is far from practice. Recently, HaoChen et al. (2021) and Wang et al. (2022b) propose an augmentation graph perspective with more practical assumptions, and contribute the generalization ability to the existence of augmentation overlap (which also exists for non-contrastive learning). Wen & Li (2021) analyze the feature dynamics of contrastive learning with shadow ReLU networks.

3. ASYMMETRY IS THE KEY TO ALLEVIATE DIMENSIONAL COLLAPSE

Prior works tend to believe that asymmetric designs are necessary for avoiding complete feature collapse (Zhang et al., 2022a) , while we show that a fully symmetric architecture, dubbed SymSimSiam (Symmetric Simple Siamese network), can also avoid complete collapse. Specifically, we simply align the positive pair (x, x + ) with a symmetric alignment loss, L sym (f ′ θ ) = -E x,x + f ′ θ (x) ⊤ f ′ θ (x + ), where we apply feature centering on the output of an encoder f , i.e., f ′ θ (•) = f θ (•) -µ. The feature average µ = E x f θ (x) can be computed via mini-batch estimate or exponential moving average as in Batch Normalization (Ioffe & Szegedy, 2015) (see Algorithm 1). Theorem 1 states that SymSim-Siam can avoid complete collapse by simultaneously maximizing feature variance Var(f θ (x)). Theorem 1. When f θ (x) is ℓ 2 -normalized, the SymSimSiam objective is equivalent to L sym (f ′ θ ) = L sym (f θ ) -Var(f θ (x)) + 1 = -E x,x + f θ (x) ⊤ f θ (x + ) -E x ∥f θ (x) -µ∥ 2 + 1. (2) Empirically shown in Figures 2(a) & 2(b), compared to vanilla Siamese network, SymSimSiam indeed alleviates complete collapse, i.e., the feature variance is maximized along training and a good linear probing accuracy is achieved. However, the accuracy of SymSimSiam is also obviously lower than the asymmetric SimSiam (Figure 2(a) ). This indicates that symmetric design could alleviate complete collapse, but it may be not enough to prevent dimensional feature collapse. Intuitively, features uniformly distributed on a great circle of a unit sphere have maximal variance while being dimensionally collapsed. As further shown in Figure 2 (c), SymSimSiam indeed suffers from more severe dimensional collapse compared to SimSiam. With few effective dimensions, the encoder network has limited ability to encode rich semantics (HaoChen et al., 2021) . The above SymSimSiam experiments show that with a symmetric architecture, we can easily prevent complete collapse while hardly improving the effective feature dimensionality for overcoming Given u i as an eigenvector of C z , if it is also an eigenvector of C p , then C p u i = λ ′ i u i is in the same direction as u i . Thus, we measure the alignment of u i by computing the cosine similarity between u i and C p u i , and take the average over each u i as the overall eigenspace alignment (details in Appendix B.3). dimensional collapse. Instead, we also notice that SimSiam with asymmetric designs can alleviate dimensional collapse and achieve better performance. It reflects the fact that the asymmetry in existing non-contrastive methods is the key to alleviating dimensional collapse, which leads us to the main focus of our paper, i.e., demystifying asymmetric designs.

4. THE RANK DIFFERENCE MECHANISM OF ASYMMETRIC DESIGNS

In Figure 1 , we have observed a common mechanism behind the non-contrastive learning methods: these asymmetric designs create positive rank differences between the target and online outputs consistently throughout training. Here, we provide a formal analysis of this phenomenon from both theoretical and empirical sides and show how it helps alleviate dimensional feature collapse. Problem Setup. Given a set of natural training data X = {x | x ∈ R d }, we can draw a pair of positive samples (x, x + ) from independent random augmentations of a natural example x with distribution A(•|x). Their joint distribution satisfies P(x, x + ) = P(x + , x) = E xA(x|x)A(x + |x). Without loss of generality, we consider a finite sample space |X | = n (can be exponentially large) following HaoChen et al. (2021) , and denote the collection of online and target outputs as p, z ∈ R n×k whose x-th row is p x , z x , respectively. We consider a general alignment loss L(p) = E x,x + ℓ(p x , sg(z x + )) to cover different variants of non-contrastive methods. The online and target outputs p x , z x are either ℓ 2 -normalized (BYOL and SimSiam) or softmax-normalized (SwAV and DINO). For the loss function ℓ, BYOL and SimSiam adopt the mean square error (MSE) loss, while SwAV and DINO adopt the cross entropy (CE) loss. sg(•) denotes stopping the gradients from the operand. For an encoder f , we define its feature correlation matrix C = E x f (x)f (x) ⊤ , whose spectral decomposition is C = V ΛV ⊤ , where V contains unit eigenvectors in columns, and Λ is the diagonal matrix with descending eigenvalues λ 1 ≥ • • • ≥ λ k ≥ 0. Measure of Dimensional Collapse. A well-known measure of the effective feature dimensionality is the effective rank (erank) of the feature correlation matrix 2007) . Specifically, erank(C) = exp(H(q)), where q = (q 1 , . . . , q k ), q i = λ i / i λ i are the normalized eigenvalues as a probability distribution, and H(q) =i q i log(q i ) is its Shannon entropy. Compared to the canonical rank, the effective rank is real-valued and invariant to feature scaling. A more uniform distribution of eigenvalues has a larger effective rank, and vice versa. Thus, the effective rank of C is a proper metric to measure the degree of dimensional feature collapse. C = E x f (x)f (x) ⊤ (Roy & Vetterli, Spectral Filters. In the signal processing literature, a spectral filtering process G of a signal f is to apply a scalar function (i.e., a spectral filter) g : R → R element-wisely on its eigenvalues in its spectral domain, i.e., u x = Gf (x) = V g(Λ)V ⊤ f (x) , where G = V g(Λ)V ⊤ is also known as a spectral convolution operator. The filtered signal admits C u = E x u x u ⊤ x = V g(Λ)ΛV T . Depending on the property of g, a filter can be categorized as low-pass, high-pass, band-pass or band-stop. Generally speaking, a low-pass filter will amplify large eigenvalues and diminish smaller ones (e.g., a monotonically increasing function), and a high-pass filter does the opposite. Many canonical algorithms can be seen as special cases of spectral filtering, e.g., PCA-based image denoising is low-pass filtering.

4.1. ASYMMETRIC DESIGNS BEHAVE AS SPECTRAL FILTERS

First of all, we notice that regardless of the existence of asymmetry, the alignment of two-branch outputs in non-contrastive learning will enforce the two-branch output features of the positive pairs to be close to each other. Therefore, from a spectral viewpoint, a natural hypothesis is that the online and target features will be aligned into the same eigenspace, and only differ slightly in the eigenvalues. We describe this hypothesis formally below. Definition 1 (Eigenspace alignment). For two matrices A and B, they have aligned eigenspace if ∃ V, s.t. A = V Λ a V ⊤ , B = V Λ b V ⊤ , ( ) where V is an orthogonal matrix of eigenvectors and Λ a , Λ b consist non-increasing eigenvalues. Hypothesis 1. During training, non-contrastive learning aligns the eigenspace of three correlation matrices of output features: the online correlation C p = E x p x p ⊤ x , the target correlation C z = E x z x z ⊤ x , and the feature correlation of positive samples C + = E x,x + z x z ⊤ x + . Next, we validate this hypothesis from both theoretical and empirical aspects. To begin with, we consider a simplified setting adopted in prior work (Tian et al., 2021) for the ease of analysis: 1) data isotropy, where the natural data distribution p(x) has zero mean and identity covariance, and the augmentation A(x|x) has mean x and covariance σ 2 I; 2) linear encoder z x = f (x) = W f x, W f ∈ R d×k ; 3) linear online predictor p x = W z x , W ∈ R k×k . Under this setting, the following lemma shows that for an arbitrary encoder f , the eigenspace of three correlation matrices indeed align well: Lemma 1. With the assumptions above as in Tian et al. (2021) , when the predictor W * minimizes the alignment loss (Eq. 3), we have ∃ V, s.t. C p = V Λ p V ⊤ , C z = V Λ z V ⊤ , C + = V Λ + V ⊤ , ( ) where V is an orthogonal matrix and Λ p , Λ z , Λ + are diagonal matrices consisting of descending eigenvalues λ p i , λ z i , λ + i , i = 1, . . . , k, respectively. Next, we provide an empirical examination of Hypothesis 1 on real-world data. From Figures 3, we can see that there is a consistently high degree of eigenspace alignment between C p and C z among all non-contrastive methods.foot_0 In Appendix E.1, we further verify the alignment w.r.t. C + . Therefore, these methods indeed attain a fairly high degree of eigenspace alignment. A Spectral Filter View. As a result of eigenspace alignment, the alignment loss essentially works on mitigating the difference in eigenvalues. Therefore, we can take a spectral perspective on noncontrastive methods, where the asymmetric designs are equivalent to a spectral filtering process applied to the target output z x , or a target spectral filtering process applied to the target output p x . As the two cases are equivalent, we mainly take the online filter as an example in the discussion below. Lemma 2. Denote an online filter function g : λ z → λ g that satisfies λ g i = λ p i /λ z i , i = 1, . . . , k. We can apply a spectral filtering on z x with g, and get px = W g z x , W g = V g(Λ z )V ⊤ . Then, we have C p = C p = E x px p⊤ x . In other words, p x , px have the same feature correlation. This spectral filter view reveals the key difference between symmetric and asymmetric designs in non-contrastive learning. In the symmetric case, the two branches yield almost equal eigenvalues (Figure 2(d) ). Thus, the alignment loss will quickly diminish and the representations collapse dimensionally (Figure 2(c) ). Instead, in asymmetric designs, asymmetric components create a difference in eigenvalues such that the target output generally has a higher rank than the online output (Figure 4(a) ). Therefore, the alignment loss will not easily diminish (not necessarily decrease; see Figure 9 ). Instead, the alignment improves feature diversity in an implicit way, as we will show later.

4.2. LOW-PASS PROPERTY OF ASYMMETRY-INDUCED SPECTRAL FILTERS

The discussion above reveals that a specific asymmetric design behaves as a spectral filter when applied to non-contrastive learning. To gain some insights for a unified understanding, we further investigate whether there is a common pattern behind the filters of different asymmetric designs. To achieve this, we calculate and plot the corresponding online filter g(λ z ) = λ p i /λ z i of each non-contrastive method. From Figure 4 (b), we find that the spectral filters indeed look very similar,  z i ) = λ p i /λ z i . Figure 4 : Eigenvalues and spectral filters of each method on CIFAR-10: top eigenvalues (whose sum is larger than 99.99% of the total sum) are shown, 128 for BYOL and 512 for the other. The target filters using centering and/or sharpening with different target temperatures (t) in DINO (the online temperature is set to 0.1). particularly in the sense that all filter functions are roughly monotonically increasing w.r.t. λ z . This kind of filter is usually called a low-pass filter because it (relatively) enlarges low-frequency components (large eigenvalues) and shrinks high-frequency components (small eigenvalues). Based on this empirical finding, we propose the following hypothesis on the low-pass nature of asymmetry. Hypothesis 2. Asymmetric modules in non-contrastive learning behave as low-pass online filters. Formally, the corresponding spectral filter g(λ z ) = λ p i /λ z i is monotonically increasing.foot_1 We note that we are not suggesting any asymmetric design behaves as low-pass filters, as someone could easily apply a high-pass filter to the online output deliberately (which, as we have observed, will likely fail). Therefore, our hypothesis above only applies to asymmetric designs that work well in practice. In the discussion below, we further provide theoretical and empirical investigations of why existing asymmetric modules have a low-pass filtering effect. Case I. Online Predictor. One popular kind of non-contrastive method, including BYOL and Sim-Siam, utilizes a learnable online predictor g θ : R k → R k for architectural asymmetry. One would wonder why such a learnable predictor will behave as a low-pass filter (Lemma 2). Here, we provide some theoretical insights in the following theorem. Theorem 2. Under Hypothesis 1, assume the invertibility of C z , the optimal predictor is given by W * = C + C -1 z = V ΩV ⊤ , where ω i = Ω ii = λ + i /λ z i ∈ [0, 1], i ∈ [k]. Therefore, the spectral property of the learnable filter is determined by the filter function λ + i /λ z i . Theorem 2 shows that the predictor essentially learns to predict the correlation C + between positive samples from one augmented view C z , i.e., eliminating the augmentation noise and predicting the common features. The following lemma further reveals that 1) the correlation between positive samples is equivalent to the correlation of their underlying natural data x, and 2) the difference between C + and C z is equal to the conditional covariance induced by data augmentation A(x|x). Lemma 3. The following equalities hold: 1. C + = C := E xz xz ⊤ x , where z x = E x|x z x ; 2. C z = C+V x|x , where V x|x = E xE x|x (z x -z x) (z x -z x) ⊤ is the conditional covariance. In practice, data augmentations mainly cause high-frequency noises in the feature space, therefore, the denoising predictor will behave as a low-pass filter. Indeed, Figures 5(a ) and 5(b) empirically show that the filter derived from our theory, λ + i /λ z i , aligns well with the actual learned predictor in Figure 4 (b) as a low-pass filter. Case II. Target Transformation. Another kind of non-contrastive method is to apply hand-crafted (instead of learned) transformations in the target branch, such as the Sinkhorn-Knopp (SK) iteration in SwAV (Caron et al., 2020) and centering-sharpening operators in DINO (Caron et al., 2021) . In this case, we further study how these transformations behave as high-pass filters applied to the target output (see footnote of Hypothesis 2). Since it is generally hard to analyze these spatial transformations in the spectral domain, here we empirically study the role of each transformation on the resulting filter. Figure 5 (c) shows that SK iterations indeed act like low-pass filters, and one iteration is enough. This explains why a few SK iterations already work well in SwAV. As for DINO, we notice that centering operation alone is not enough to produce a high-pass filter, which agrees with empirical results in DINO. Meanwhile, we notice that in order to obtain a high-pass filter (monotonically decreasing), it is necessary for DINO to apply a temperature smaller than the online branch (< 0.1), which is exactly the feature sharpening technique adopted in DINO (Caron et al., 2021) . These facts show our theory aligns well with empirical results in non-contrastive learning.

4.3. ASYMMETRY-INDUCED LOW-PASS FILTERS SAVE NON-CONTRASTIVE LEARNING

In the discussion above, we observed a common pattern in existing asymmetric designs: their corresponding spectral filters are all low-pass. Here, we further show that this property is so essential that it can provably save non-contrastive learning from the risk of feature collapse by producing the rank difference between outputs and alleviating dimensional collapse during training. First, let us take a look at its effect on the effective rank of two-branch output features. In the theorem below, we show that low-pass online filters are guaranteed to produce a consistently higher effective rank of target features than that of online features, as shown in Figure 1 . Theorem 3. If g(λ z i ) = λ p i /λ z i is monotonically increasing, or equivalently, h(λ p i ) = λ z i /λ p i is monotonically decreasing, we have erank(C p ) ≤ erank(C z ). Further, as long as g(λ z i ) or h(λ p i ) is non-constant, the inequality holds strictly, erank(C p ) < erank(C z ). Meanwhile, we also observe that for each output, its own effective rank is also successfully elevated along this process. This is not a coincidence. Below, we theoretically show how the rank difference alleviates dimensional collapse. As the training dynamics of deep neural networks is generally hard to analyze, prior works (Tian et al., 2021; Wen & Li, 2021) mainly deal with linear or shadow networks with strong assumptions on data distribution, which could be far from practice. As overparameterized deep neural networks are very expressive, we instead adopt the unconstrained feature setting (Mixon et al., 2022; HaoChen et al., 2021) and consider gradient descent directly in the feature space R k . Taking the MSE loss ℓ(u, v) = 1 2 ∥u -v∥ 2 as an example, the following theorem shows that the rank difference indeed helps improve the effective rank of online output p. Theorem 4. Under Hypothesis 1, when we apply an online spectral filter p x = W z x (Lemma 2), gradient descent with step size 0 < α < 1 gives the following update at the t-th step, where h(λ) = 1/g(λ) is a high-pass filter because g(λ) is a low-pass filter (Hypothesis 2). Then, the update λ p i,t+1 /λ p i,t will nicely correspond to a high-pass filter under either of the two conditions: 1. the learned encoder is nearly optimal, i.e., W ≈ W * in Theorem 2. λ p i,t+1 = λ p i,t (1 -α) 2 + α 2 h 2 (λ p i,t ) + 2α(1 -α)h(λ p i,t ) λ + i,t λ z i,t , i = 1, . . . , k,

2.. λ +

i,t ≈ λ z i,t , which naturally holds with good positive alignment, i.e., z x ≈ z x + . Then, according to Theorem 3, we have erank(C p (t+1) ) > erank(C p (t) ). In other words, the effective rank of online output will keep improving after gradient descent. Intuitively, the improvement of effective rank is a natural consequence of the rank difference. As the online output has a lower effective rank than the target output, optimizing its alignment loss w.r.t. the gradient-detached target sg(z x ) will enforce the online output p x to improve its effective rank in order to match the target output z x . In this way, the rank difference becomes a ladder (created by asymmetric designs) for non-contrastive methods to gradually improve its effective feature dimensionality and get rid of dimensional feature collapse eventually. Note on Stop Gradient. Our analysis above also reveals the importance of the stop gradient operation. In particular, when gradients from the target branch are not detached, in order to match the rank of two outputs, minimizing the alignment loss can also be fulfilled by simply pulling down the rank of the target output. In this case, the feature rank will never be improved without stop gradient.

5. PRINCIPLED ASYMMETRIC DESIGNS BASED ON RDM

The discussions above show that the rank differential mechanism provides a unified theoretical understanding of different non-contrastive methods. Besides, it also provides a general recipe for designing new non-contrastive variants. As Theorem 3 points out, the key requirement is that the asymmetry can produce a low-pass filter on the online branch, or equivalently, a high-pass filter on the target branch. Below, we propose some new variants directly following this principle.

5.1. VARIANTS OF ONLINE LOW-PASS FILTERS

Despite the learnable predictor (Section 4.2), we can also directly design online predictors with fixed low-pass filters. For numerical stability, we adopt the singular value decomposition (SVD) of the output to compute the eigenvalues and the eigenspace, e.g., z = U Σ z V ⊤ with singular values σ z i 's. As λ z i = (σ zi ) 2 , a filter that is monotone in λ is also monotone in σ, and vice versa. Specifically, for 

SimSiam

Online Low-pass Filter (ours) Target High-pass Filter (ours) (Learnable Online Filter) g(σ) = log(1 + σ) g(σ) = σ -0.3 acc@1 acc@5 acc@1 acc@5 acc@1 acc@5 67.97 88.17 an online encoder f : X → R k , we assign W = V g(Σ z )V ⊤ , where g(Σ z ) ii = g(σ z i ) is a low-pass filter that is monotonically increasing w.r.t. σ. We note that DirectPred proposed by Tian et al. ( 2021) is a special case with g 0 (σ) = σ. Additionally, we consider three variants: 1) g 1 (σ) = log(σ); 2) g 2 (λ) = log(σ + 1); 3) g 3 (σ) = log(σ 2 + 1). These three new variants are low-pass filters because they are monotonically increasing with σ ≥ 0. We evaluate their performance on four datasets, i.e., CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , ImageNet-100 and ImageNet-1k (Deng et al., 2009) . We use ResNet-18 (He et al., 2016) as the backbone encoder for CIFAR-10, CIFAR-100 and ImageNet-100 and adopt ResNet-50 for Imagenet-1k following standard practice. And the projector is a three-layer MLP in which BN (Ioffe & Szegedy, 2015) is applied to all layers. We adopt the linear probing task for evaluating the learned representations. More details are included in Appendix B.2. Table 1 shows that our designs of online predictors with different low-pass filters work well in practice, and achieve comparable performance to SimSiam with learnable predictors. In particular, it also significantly outperforms SymSimSiam (Figure 2(a) ), showing that rank differences indeed help alleviate dimensional collapse (Figure 6 ).

5.2. VARIANTS OF TARGET HIGH-PASS FILTERS

In turn, we also consider applying a high-pass filter on the target branch using a target predictor. Compared to the online predictors above, target predictors have additional advantages: with stop gradient, we do not require backpropagation through SVD, which could reduce time overhead (Table 5 ). Specifically, we consider polynomial high-pass target filters h(σ) = σ p with different -1 ≤ p < 0, which are all monotonically decreasing (see Algorithm 2). We further evaluate the target filters following the same protocols above. As shown in Table 2 , our high-pass target filters can often outperform SimSiam by a large margin with a relatively short training time. Especially, on CIFAR-10, the filter h(σ) = σ -0.5 improves the baseline by 5.75% with 100 epochs and the filter h(σ) = σ -1 improves the baseline by 2.77% with 200 epochs. Notably, on CIFAR-100, our methods outperform the baseline by a large margin (8.07%, 3.18%, and 0.81% improvements with 100, 200, and 400 epochs, respectively). Additional results based on the BYOL framework can be found in Appendix C.

6. CONCLUSION

In this paper, we presented a unified theoretical understanding of non-contrastive learning via the rank differential hypothesis. In particular, we showed that existing non-contrastive learning all produce a consistent rank difference between the online and the target outputs. Digging deeper into this phenomenon, we theoretically proved that low-pass online filters can yield such a rank difference and improve the effective feature dimensionality along the training. Meanwhile, we provided theoretical and empirical insights on how existing asymmetric designs produce low-pass filters. At last, following the principle of our theory, we designed a series of new online and target filters, and showed that they achieve comparable or even superior to existing asymmetric designs.

A OMITTED PROOFS

In this section, we present proofs for all lemmas and theorems in the main paper.

A.1 PROOF OF THEOREM 1

Proof. Since the output of f θ is ℓ 2 -normalized, and E x f (x) = E x + f (x + ) = µ, we have L sym (θ) = -E x,x + f ′ θ (x) ⊤ f ′ θ (x + ) = -E x,x + (f θ (x) -µ) ⊤ (f θ (x + ) -µ) = -E x,x + f θ (x) ⊤ f θ (x + ) + ∥µ∥ 2 , and Var(f (x)) = E x ∥f (x)∥ 2 -∥µ∥ 2 = 1 -∥µ∥ 2 . Thus, we conclude that L sym (θ) = -E x,x + f θ (x) ⊤ f θ (x + ) -Var(f (x)) + 1. A.2 PROOF OF LEMMA 1 Proof. Let z x = W f x and z x + = W f x + , the loss functions is L(W f , W ) = E x,x + 1 2 ∥W z x -sg(z x + )∥ 2 (8) = E x,x + 1 2 (W z x -sg(z x + )) ⊤ (W z x -sg(z x + )) (9) = E x,x + 1 2 z ⊤ x W ⊤ W z x -sg(z x + ) ⊤ W z x -z ⊤ x W ⊤ sg(z x + ) + sg(z x + ) ⊤ sg(z x + ) (10) = 1 2 E x,x + Tr z ⊤ x W ⊤ W z x -2 Tr sg(z x + ) ⊤ W z x + Tr sg(z x + ) ⊤ sg(z x + ) (11) = 1 2 E x,x + Tr W ⊤ W z x z ⊤ x -2 Tr W z x sg(z x + ) ⊤ + Tr sg(z x + ) sg(z x + ) ⊤ (12) = 1 2 Tr W ⊤ W E x z x z ⊤ x -2 Tr W E x,x + z x z ⊤ x + + Tr E x + z x + z ⊤ x + . ( ) Notice that C z = E x z x z ⊤ x = E x + z x + z ⊤ x + (14) = E xE x|x W f x (W f x) ⊤ (15) = W f E xE x|x xx ⊤ W ⊤ f (16) = W f E x xx ⊤ + σ 2 I W ⊤ f (17) = (1 + σ 2 )W f W ⊤ f , and C + = E x,x + z x z ⊤ x + (19) = E xE x,x + |x z x z ⊤ x + (20) = E x E x|x W f x E x|x + W f x + ⊤ (21) = W f E x xx ⊤ W ⊤ f (22) = W f W ⊤ f . ( ) Hence, L(W f , W ) = 1 2 (1 + σ 2 ) Tr W ⊤ W W f W ⊤ f -2 Tr W W f W ⊤ f + (1 + σ 2 ) Tr W f W ⊤ f . (24) Taking partial derivative with respect to W , we get ∂L(W f , W ) ∂W = (1 + σ 2 )W W f W ⊤ f -W f W ⊤ f . ( ) With additional weight decay, we have ∂L(W f , W ) ∂W = (1 + σ 2 )W W f W ⊤ f -W f W ⊤ f + ηW, ( ) where η > 0 is the coefficient of weight decay. Suppose the spectral decomposition W f W ⊤ f is W f W ⊤ f = V ΛV ⊤ , where V is an orthogonal matrix and Λ is a diagonal matrix consisting of decending eigenvalues λ 1 , • • • , λ k . Therefore, C z = (1 + σ 2 )W f W ⊤ f = V (1 + σ 2 )ΛV ⊤ = V Λ z V ⊤ , C + = E x,x + z x z ⊤ x = W f W ⊤ f = V ΛV ⊤ . ( ) Let ∂L(W ) ∂W = 0, we get W * = W f W ⊤ f (1 + σ 2 )W f W ⊤ f + ηI -1 = V Λ w V ⊤ , where Λ w = diag{λ 1 /((1 + σ 2 )λ 1 + η), • • • , λ k /((1 + σ 2 )λ k + η)}. It follows that C p = E x p x p ⊤ x = W * E x z x z ⊤ x W * ⊤ = W * C z W * ⊤ = V Λ 2 w Λ z V ⊤ = V Λ p V ⊤ . Hence, C p , C z , C + share the same eigenspace.

A.3 PROOF OF LEMMA 2

Proof. By the definition of px = W g z x , we know that C p = E x px p⊤ x = E x W g z x z ⊤ x W ⊤ g = W g E x z x z ⊤ x W ⊤ g = W g C z W ⊤ g . Since W g = V g(Λ z )V ⊤ , C z = V Λ z V ⊤ , C p = V Λ p V ⊤ and g(λ z i ) = λ p i /λ z i , we have C p = V g(Λ z )V ⊤ V Λ z V ⊤ V g(Λ z )V ⊤ = V g(Λ z )Λ z g(Λ z )V ⊤ = V Λ p V ⊤ = C p . A.4 PROOF OF LEMMA 3 Proof. For the first equality, we have C + = E x,x + z x z ⊤ x + = E xE x,x + |x z x z ⊤ x + = E x E x|x z x E x|x + z x + ⊤ = E xz xz ⊤ x = C. For the first part, C + V x|x = E xz xz ⊤ x + E x E x|x (z x -z x) (z x -z x) ⊤ (33) = E xz xz ⊤ x + E x E x|x z x z ⊤ x -E x|x z x z ⊤ x -z x E x|x z x ⊤ + z xz ⊤ x (34) = E xz xz ⊤ x + E xE x|x z x z ⊤ x -E xz xz ⊤ x (35) = E x z x z ⊤ x (36) = C z . A.5 PROOF OF THEOREM 2 Proof. Similar to the derivation in the proof of Lemma 1, we have L(W ) = E x,x + 1 2 ∥W z x -sg(z x + )∥ 2 (38) = 1 2 Tr W ⊤ W E x z x z ⊤ x -2 Tr W E x,x + z x z ⊤ x + + Tr E x + z x + z ⊤ x + . ( ) Notice that C z = E x z x z ⊤ x = E x + z x + z ⊤ x + and E x,x + z x z ⊤ x + = E xE x,x + |x z x z ⊤ x + (40) = E x E x|x z x E x|x + z x + ⊤ (41) = E xz xz ⊤ x (42) = C z . Hence, L(W ) = 1 2 Tr W ⊤ W C z -2 Tr W C + Tr (C z ) . Taking partial derivative with respect to W , we get ∂L(W ) ∂W = W C z -C. Let ∂L(W ) ∂W = 0, we have W * = CC -1 z . Since C z , C have aligned eigenspace V , we have V ΛV ⊤ + V x|x = V Λ z V ⊤ (46) =⇒V x|x = V Λ z -Λ V ⊤ , which implies that the corresponding eigenvalues satisfy ε i = λ z i -λi ≥ 0, where λi , ε i denote the i-th eigenvalues of C, V x|x , respectively. And CC -1 z = V ΛV ⊤ V Λ z V ⊤ -1 = V ΛΛ -1 z V ⊤ = V ΩV ⊤ , where Ω is a diagonal matrix and Ω ii = λi λ z i ∈ [0, 1], i = 1, 2, . . . , k. A.6 PROOF OF THEOREM 3 Before the proof of Theorem 3, we first introduce the following useful lemma. Lemma 4. Assume that k i=1 q i = 1 and q 1 ≥ q 2 ≥ • • • ≥ q k > 0, then for any 1 ≤ i < j ≤ k and any ∆ ∈ (0, p j ), it holds that H(q 1 , q 2 , . . . , q k ) > H(q 1 , . . . , q i + ∆, . . . , q j -∆, . . . , q k ). ( Proof. We first note that H(q 1 , q 2 , . . . , q k ) > H(q 1 , . . . , q i + ∆, . . . , q j -∆, . . . , q k ) (49) ⇐⇒ -q i log(q i ) -q j log(q j ) > -(q i + ∆) log(q i + ∆) -(q j -∆) log(q j -∆). (50) Define f (x) = -(q i + x) log(q i + x) -(q j -x) log(q j -x) and its first order derivative satisfies f ′ (x) = -log(q i + x) + log(q j -x) < 0, ∀x ∈ (0, q j ). Hence, Equation ( 50) holds. This completes the proof of the lemma. Proof of Theorem 3. Let q z i = λ z i /( k l=1 λ z l ) and q p i = λ p i /( k l=1 λ p l ), where i = 1, 2, . . . , k, then k i=1 q z i = k i=1 q p i = 1, q z 1 ≥ q z 2 ≥ • • • ≥ q z k and q z 1 ≥ q z 2 ≥ • • • ≥ q z k . Without loss of generality, we assume that q p k , q z k > 0. Because g(λ z i ) = λ p i /λ z i is monotonically increasing and λ z i ≥ λ z j , for any 1 ≤ i < j ≤ k, we have q p i q p j = λ p i /( k l=1 λ p l ) λ p j /( k l=1 λ p l ) = λ p i λ p j = g 2 (λ z i )λ z i g 2 (λ z j )λ z j ≥ λ z i λ z j = λ z i /( k l=1 λ z l ) λ z j /( k l=1 λ z l ) = q z i q z j . ( ) If g(λ z i ) is constant, it follows that q p i q p j = q z i q z j , ∀1 ≤ i < j ≤ k. Combining with k i=1 q z i = k i=1 q p i = 1, we get q p i = q z i , i = 1, 2, . . . , k. Hence, H(q p 1 , q p 2 , . . . , q p k ) = H(q z 1 , q z 2 , . . . , q z k ), which implies that erank(C p ) = erank(C z ). If g(λ z i ) is non-constant, then g(λ z 1 ) > g(λ z k ). And it follows that q p 1 q p k > q z 1 q z k . Arming with Equations ( 52) and ( 55), we get 1 = k i=1 q p i = q p k k i=1 q p i q p k > q p k k i=1 q z i q z k = q p k q z k k i=1 q z i = q p k q z k , which indicates q p k < q z k . Similarly, we have q p 1 > q z 1 . Hence, m = max{i|q p i ≥ q z i , i = 1, 2, . . . , k} exists and m < 1. Using Equation (52), we have q p i ≥ q z i if 1 ≤ i ≤ m, q p i < q z i if m + 1 ≤ i ≤ k. Directly applying k i=1 q z i = k i=1 q p i = 1 gives that m i=1 q p i -q z i ≥0 = k i=m+1 q z i -q p i >0 . According to Lemma 4 and Equation ( 58), if we transport the redundancy from right to left, the entropy of the distribution will decrease. The transportation process is described as following: Step 1 Let i ← 1 and j ← k. Step 2 If ∆ = min{q p i -q z i , q z j -q p j } > 0, then q z i ← q z i + ∆ and q z j ← q z j -∆. Step 3 If q p i = q z i , i ← i + 1. Else j ← j -1. Step 4 If i ≥ j , we finish this process. Else, we return step 2. After at most k -1 loops, (q z 1 , q z 2 , . . . , q z k ) becomes (q p 1 , q p 2 , . . . , q p k ) . Equation ( 58) ensures the correctness of the transportation process. According to Lemma 4, we have H(q z 1 , q z 2 , . . . , q z k ) > H(q p 1 , q p 2 , . . . , q p k ), which implies that erank(C p ) < erank(C z ).

A.7 PROOF OF THEOREM 4

Proof. Since p (t+1) x = p (t) x -α(p (t) x -z (t) x + ) = (1 -α)p (t) x + αz (t) x + and p x = W z x , C (t+1) p = E x p (t+1) x (p (t+1) x ) ⊤ (60) = E x,x + (1 -α)p (t) x + αz (t) x + (1 -α)p (t) x + αz (t) x + ⊤ (61) = E x (1 -α) 2 p (t) x p (t) x ⊤ + E x + α 2 z (t) x + z (t) x + ⊤ + E x,x + α(1 -α) p (t) x z (t) x + ⊤ + z (t) x + p (t) x ⊤ (62) = (1 -α) 2 E x p (t) x p (t) x ⊤ + α 2 E x + z (t) x + z (t) x + ⊤ + α(1 -α) W E x,x + z (t) x z (t) x + ⊤ + E x,x + z (t) x + z (t) x ⊤ W ⊤ (63) = (1 -α) 2 C (t) p + α 2 C (t) z + α(1 -α) W C + + C + W ⊤ . ( ) It follows that V Λ (t+1) p V ⊤ = (1 -α) 2 V Λ (t) p V ⊤ + α 2 V Λ (t) z V ⊤ + 2α(1 -α)V Λ (t) w Λ (t) + V ⊤ . ( ) As a result, we have Λ (t+1) p = (1 -α) 2 Λ (t) p + α 2 Λ (t) z + 2α(1 -α)Λ (t) w Λ (t) + , i.e., λ p i,t+1 = (1 -α) 2 λ p i,t + α 2 λ z i,t + 2α(1 -α)λ w i,t λ + i,t . Dividing both sides of Equation ( 67) by λ p i,t , we have λ p i,t+1 λ p i,t = (1 -α) 2 + α 2 λ z i,t λ p i,t + 2α(1 -α) λ w i,t λ + i,t λ p i,t = (1 -α) 2 + α 2 h 2 t (λ p i,t ) + 2α(1 -α)h t (λ p i,t ) λ + i,t λ z i,t , where h t (λ p i,t ) = 1/λ w i,t = λ z i,t λ p i,t . • If W = W * = C (t) + (C (t) z ) -1 , then h t (λ p i,t ) = λ z i,t λ + i,t . Hence, λ p i,t+1 λ p i,t = 1 -α 2 + α 2 h 2 t (λ p i,t ) is monotonically decreasing and non-constant. • If λ z i,t = λ + i,t , then λ p i,t+1 λ p i,t = (1 -α) 2 + α 2 h 2 t (λ p i,t ) + 2α(1 -α)h t (λ p i,t ) is monotonically decreasing and non-constant (0 < α < 1). According to Theorem 3, we have erank(C p (t+1) ) > erank(C p (t) ).

B EXPERIMENTAL DETAILS

In this section, we provide the details and hyperparameters for SymSimSiam and variants of spectral filters.

B.1 DATASETS

We evaluate the performance of our methods on four benchmark datasets: CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , ImageNet-100 and ImageNet-1k (Deng et al., 2009) . CIFAR-10 and CIFAR-100 are small-scale datasets, composed of 32 × 32 images with 10 and 100 classes, respectively. ImageNet-100 is a subset of ImageNet-1k containing 100 classes.

B.2 IMPLEMENTATION DETAILS

Unless specified otherwise, we follow the default settings in solo-learn (da Costa et al., 2022) on CIFAR-10, CIFAR-100 and ImageNet-100. For ImageNet, our implementation follows the official code of SimSiam (Chen & He, 2021) , and we use the same settings. For a fair comparison, SimSiam with a learnable linear predictor is adopted as our baseline. With the original projector, SimSiam with a learnable linear predictor could not work, so we delete the last BN in the projector in this case. And we also list the results of SimSiam with the default predictor (we refer it as a learnable nonlinear predictor). Data augmentations. The augmentation pipeline is RandomResizedCrop with scale in (0.2, 1.0), RandomHorizontalFlip with probability 0.5, ColorJitter (brightness (0.4), contrast (0.4), saturation (0.4), hue (0.1)) with probability 0.8 and RandomGray with probability 0.2. For ImageNet-100 and ImageNet-1k, Gaussian blurring (Chen et al., 2020) with an applying probability 0.5 is also used. Optimization. SGD is used as the optimizer with momentum 0.9 and weight decay 1.0 × 10 -5 (1.0 × 10 -4 for ImageNet-1k). The learning rate adopts the linear scaling rule (lr×BatchSize/256) with a base learning rate of 0.5 (0.05 for ImageNet-1k). After 10 epochs of warmup training, we use the cosine learning rate decay (Loshchilov & Hutter, 2017) . We use a batch size 256 on CIFAR-10 and CIFAR-100; 128 on ImageNet-100 and 512 on ImageNet-1k. Linear evaluation. For the linear evaluation, we evaluate the pre-trained backbone network by training a linear classifier on the frozen representation. For CIFAR-10, CIFAR-100 and ImageNet-100, the linear classifier is trained using SGD optimizer with momentum = 0.9, batch size = 256 and initial learning rate = 30.0 for 100 epochs. The learning rate is divided by 10 at epochs 60 and 80. For ImageNet-1k, following the official code of SimSiam (Chen & He, 2021) , we train the linear classifier for 90 epochs with a LARS optimizer (You et al., 2017) with momentum = 0.9, batch size = 4096, weight decay = 0, initial learning rate = 0.1 and cosine decay of the learning rate.

B.3 EXPERIMENTAL SETTING FOR FIGURES

For the implementations of BYOL, SimSiam, SwAV and DINO on CIFAR-10 and CIFAR-100, we use codes and all default settings in the open-source library solo-learn (da Costa et al., 2022) . For SwAV, we do not store features from the previous batches to augment batch features (queue size = 0) for the consistency of training loss. We adopt ResNet-18 as the backbone encoder. The dimensions of the outputs of BYOL, SimSiam, SwAV and DINO are 256, 2048, 3000 and 4092, respectively. In Figure 3 , we use the alignment metric to measure the eigenspace alignment of the online and target outputs, i.e., C p and C z . The intuition is that, given u i as an eigenvector of C z , if it is also an eigenvector of C p , then C p u i = λ ′ i u i is in the same direction as u i . Thus, the higher the cosine similarity between u i and C p u i , the more aligned between the eigenspace of C p and C z . Formally, we define the alignment between the eigenspace of C p and C z as Alignment(C p , C z ) = 1 m m i=1 u T i ∥u i ∥ 2 C p u i ∥C p u i ∥ 2 , ( ) where u i is eigenvector corresponding to the i-th largest eigenvalue λ z i of C z , i = 1, 2, . . . , m. And m is set to 512 for SimSiam, SwAV and DINO and 128 for BYOL. The sum of the first m eigenvalues is greater than 99.99% of the sum of all eigenvalues. Therefore, we can think that the space span by the first m eigenvectors is a good approximation to the original eigenspace. And in Figures 4, 5 Table 4 shows that our target high-pass filters can often outperform BYOL with a large margin at earlier epochs. In particular, on CIFAR-10, the target filter h(σ) = σ -0.3 improves BYOL with learnable linear predictor by 4.92%, 2.68% and 1.00% with 100, 200 and 400 epochs, respectively. We conduct some experiments on CIFAR-100. Figure 7 demonstrates that the target branch always has higher effective rank than the online branch and the rank of the online branch continues to increase after the warmup state in all methods. In Figure 8 (a), we compare the eigenvalues computed from two branch outputs. There is a clear difference in eigenvalues, and the eigenvalue distribution of the target branch is more biased towards larger values. Figure 8 (b) shows the spectral filter g(λ z i ) = λ p i /λ z i , where λ p i , λ z i are eigenvalues of online and target correlation matrices C p , C z , respectively. The spectral filters of all methods are nearly monotonically increasing w.r.t. λ z i . E.3 TRAINING DYNAMIC In Figure 9 , we compare the training loss between SimSiam and SymSimSiam on CIFAR-10 and CIFAR-100. We can see that the loss of SimSiam is always larger than SymSimsiam and does not consistently decrease. Instead, the alignment loss of SymSimSiam continues to decrease. assumptions, while ours focus on the common spectral filtering property that also holds for nonlinear modules and general data distributions. As for the conclusion, we formally show that asymmetric designs will improve effective dimensionality, while Tian et al. only discuss how it avoids full collapse (which is an extreme case of dimensional collapse, and a non-full-collapse encoder may still suffer from dimensional collapse). • A unified framework for various asymmetric designs. Tian et al. 's analysis only focuses on the predictor in BYOL and SimSiam, and they cannot explain why SwAV and DINO also work without predictors. Our RDM applies to all existing asymmetric designs through the unified spectral filter perspective. • General principles for predictor design. Tian et al. propose DirectPred, which is only a specific filter. Instead, we point out the core underlying principle, that as long as the online filter is a low-pass filter, it could theoretically avoid feature collapse. We also empirically verify this point by showing that various online low-pass filters can avoid feature collapse. • A More Effective Asymmetric Design through Target Predictor. Based on our RDM theory, we also propose a new kind of asymmetric design in non-contrastive learning: applying a predictor to the target branch. We show that target predictors achieve better results than online predictors while being more computationally efficient. Therefore, our analysis improves Tian et al. (2021) in many aspects and apply to a wider context. And we achieve this with new perspectives and techniques that are quite distinctive from Tian et al. (2021) .



As DINO adopts dynamic feature sharpening coefficients, there is a larger change of its eigenspace alignment compared to others. Nevertheless, the alignment degree is always above 0.7, which is relatively high. Equivalently, when viewing spectral filters from the target branch, the asymmetric modules behave as a high-pass target filter because the corresponding filter function h(λ p ) = λ z i /λ p i is monotonically decreasing.



Figure 1: The effective rank of the normalized outputs of the online and target branches for four different non-contrastive methods (BYOL, SimSiam, SwAV, and DINO) on CIFAR-10.

Figure 2: Comparison of the asymmetric SimSiam with two symmetric baselines, SymSimSiam (ours) and Vanilla Siamese on CIFAR-10, where SymSimSiam adopts L sym (f ′ θ ) (Eq. 2), while Vanilla Siamese adopts L sym (f θ ). (a): Linear probing test accuracy. (b): Feature variance of normalized outputs. (c): Eigenvalues of the correlation matrix of the normalized outputs. (d): Eigenvalues of the correlation matrix of the normalized outputs from two branches of SymSimSiam.

Figure3: Eigenspace alignment of the online and target outputs along the training. Given u i as an eigenvector of C z , if it is also an eigenvector of C p , then C p u i = λ ′ i u i is in the same direction as u i . Thus, we measure the alignment of u i by computing the cosine similarity between u i and C p u i , and take the average over each u i as the overall eigenspace alignment (details in Appendix B.3).

Eigenvalues of the feature correction matrices of the online and target outputs.0.00 0.02 0.04 0.06 0.08 0.10 0.12 z The corresponding online spectral filter function g(λ

Figure 5: The role of asymmetric designs. (a) & (b): spectral filters of the ideal optimal predictors, g * (λ z i ) = λ + i /λ z i (Eq. 6). Both filters calculated from BYOL and SimSiam are almost monotonically decreasing. (c): The target filters using different Sinkhorn-Knopp (SK) iterations in SwAV. (d):The target filters using centering and/or sharpening with different target temperatures (t) in DINO (the online temperature is set to 0.1).

Figure 6: Eigenvalues of the correlation matrix of the normalized outputs for different variants of filters on CIFAR-10 and CIFAR-100.

& 8, the first 512 point pairs are displayed for SimSiam, SwAV and DINO (128 for BYOL). B.4 PSEUDOCODE Here, we provide the pseudocode for SymSimSiam (Algorithm 1) and variants of spectral filters (Algorithm 2). Algorithm 1 SymSimSiam: Pseudocode in a PyTorch-like style. # f : b a c k b o n e + p r o j e c t i o n mlp # c : c e n t e r (1 -by -k ) # m: momentum 0 . 9 f o r x i n l o a d e r : # l o a d a m i n i b a t c h x w i t h n s a m p l e s x1 , x2 = aug ( x ) , aug ( x ) # two random a u g m e n t a t i o n s z1 , z2 = f ( x1 ) , f ( x2 ) # p r o j e c t i o n s , n-by -k z1 = n o r m a l i z e ( z1 , dim = 1 ) # l 2 -n o r m a l i z e z2 = n o r m a l i z e ( z2 , dim = 1 ) # l 2 -n o r m a l i z e u p d a t e c e n t e r ( c a t ( z1 , z2 ) ) z1 , z2 = z1 -c , z2 -c l o s s = -( z1 * z2 ) . sum ( dim = 1 ) . mean ( ) l o s s . b a c k w a r d ( ) u p d a t e ( f . p a r a m s ) # SGD u p d a t e @torch . n o g r a d ( ) d e f u p d a t e c e n t e r ( z ) : c = m * c + (1 -m) * z . mean ( dim = 0 ) C ADDITIONAL RESULTS BASED ON BYOL FRAMEWORK In the main paper, we conduct experiments based on the SimSiam framework. Here, we also gather some results of variants of target high-pass filters based on the BYOL framework. We adopt ResNet-18 as the backbone encoder and use the default setting in da Costa et al. (2022).

Figure 7: The effective rank of the normalized outputs of the online and target branches along the training dynamics on CIFAR-100.

The corresponding online spectral filter function g(λ z i ) = λ p i /λ z i .

Figure 8: Rank difference experiments and spectral filters on CIFAR-100.

Linear probing accuracy (%) of SimSiam with different online predictors (including learnable nonlinear and linear predictors of SimSiam) on CIFAR-10, CIFAR-100 and ImageNet-100. = log(1 + σ 2 ) 75.51 82.28 87.51 39.19 48.97 59.06 77.22 94.00

Linear probing accuracy (%) of SimSiam with different target predictors (including learnable nonlinear and linear predictors) on CIFAR-10, CIFAR-100 and ImageNet-100.

Linear probing accuracy (%) of SimSiam with different predictors on ImageNet-1k.

Training time and memory comparison of different methods on CIFAR-10.

Eigenspace alignment between C z and C + . .1 ADDITIONAL EMPIRICAL EVIDENCE FOR EIGENSPACE ALIGNMENT In the main paper, we have shown that C p and C z share the same eigenspace. Here, we add empirical evidence for eigenspace alignment between C z and C + on CIFAR-10 and CIFAR-100. As shown in Table6, we can see that all methods have very high alignment between the eigenspace of C z and C + .

ACKNOWLEDGMENTS

Jinwen Ma is supported by the National Key Research and Development Program of China under grant 2018AAA0100205. Yisen Wang is partially supported by the National Natural Science Foundation of China (62006153), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Huawei Technologies Inc.

availability

//github.com/PKU

annex

On CIFAR-100, the filter h(σ) = σ -1 is better 8.75% than the baseline with 100 epochs, the filter h(σ) = σ -0.7 improves the baseline by 5.05% with 200 epochs and h(σ) = σ -0.3 improves the baseline by 5.22% with 400 epochs.

D COST COMPARISON

We compare the training speeds and GPU memory usages of different methods (SimSiam, the online filter g(σ) = log(1 + σ) and the target filter h(σ) = σ -0.3 ) on CIFAR-10. We perform our experiments on a single RTX 2080ti GPU.As shown in Table 5 , compared to the original SimSiam, the GPU memory usage of our methods only increase a little (26 MiB). SimSiam and the target filter h(σ) = σ -0.3 have the same training time for 20 epochs.

