RUN AWAY FROM YOUR TEACHER: A NEW SELF-SUPERVISED APPROACH SOLVING THE PUZZLE OF BYOL

Abstract

Recently, a newly proposed self-supervised framework Bootstrap Your Own Latent (BYOL) seriously challenges the necessity of negative samples in contrastive learning frameworks. BYOL works like a charm despite the fact that it discards the negative samples completely and there is no measure to prevent collapse in its training objective. In this paper, we suggest understanding BYOL from the view of our newly proposed interpretable self-supervised learning framework, Run Away From your Teacher (RAFT). RAFT optimizes two objectives at the same time: (i) aligning two views of the same data to similar representations and (ii) running away from the model's Mean Teacher (MT, the exponential moving average of the history models) instead of BYOL's running towards it. The second term of RAFT explicitly prevents the representation collapse and thus makes RAFT a more conceptually reliable framework. We provide basic benchmarks of RAFT on CIFAR10 to validate the effectiveness of our method. Furthermore, we prove that BYOL is equivalent to RAFT under certain conditions, providing solid reasoning for BYOL's counter-intuitive success.

1. INTRODUCTION

Recently the performance gap between self-supervised learning and supervised learning has been narrowed thanks to the development of contrastive learning (Chen et al., 2020b; a; Tian et al., 2019; Chen et al., 2020b; Sohn, 2016; Zhuang et al., 2019; He et al., 2020; Oord et al., 2018; Hadsell et al., 2006) . Contrastive learning distinguishes positive pairs of data from the negative. It has been shown that when the representation space is l 2 -normalized, i.e. a hypersphere, optimizing the contrastive loss is approximately equivalent to optimizing the alignment of positive pairs and the uniformity of the representation distribution at the same time (Wang & Isola, 2020) . This equivalence conforms to our intuitive understanding. One can easily imagine a failed method when we only optimize either of the properties: aligning the positive pairs without uniformity constraint causes representation collapse, mapping different data all to the same point; scattering the data uniformly in the representation space without aligning similar ones yields no more meaningful representation than random. The proposal of Bootstrap Your Own Latent (BYOL) fiercely challenges our consensus that negative samples are necessary to contrastive methods (Grill et al., 2020) . BYOL trains the model (online network) to predict its Mean Teacher (moving average of the online, refer to Appendix B.2) on two augmented views of the same data (Tarvainen & Valpola, 2017) . There is no explicit constraint on uniformity in BYOL, while the expected collapse never happens, what's more, it reaches the SOTA performance on the downstream tasks. Although BYOL has been empirically proven to be an effective self-supervised learning approach, the mechanism that keeps it from collapse remains unrevealed. Without disclosing this mystery, it would be disturbing for us to adapt BYOL to other problems, let alone further improve it. Therefore solving the puzzle of BYOL is an urgent task. In this paper, we explain how BYOL works through another interpretable learning framework which leverages the MT in the exact opposite way. Based on a series of theoretical derivation and empirical approximation, we build a new self-supervised learning framework, Run Away From your Teacher (RAFT), which optimizes two objectives at the same time: (i) minimize the representation distance between two samples from a positive pair and (ii) maximize the representation distance between the online and its MT. The second objective of RAFT incorporates the MT in a way exactly opposite to BYOL, and it explicitly prevents the representation collapse by encouraging the online to be different from its history (Figure 2a ). Moreover, we empirically show that the second objective of RAFT is a more effective and consistent regularizer for the first objective, which makes RAFT more favorable than BYOL. Finally, we solve the puzzle of BYOL by theoretically proving that BYOL is a special form of RAFT when certain conditions and approximation hold. This proof explains why collapse does not happen in BYOL, and also makes the performance of BYOL an approximate guarantee of the effectiveness of RAFT. The main body of the paper is organized in the same order of how we explore the properties of BYOL and establish RAFT based on them (refer to Appendix A for more details). In section 3, we investigate the phenomenon that BYOL fails to work when the predictor is removed. In section 4, we establish two meaningful objectives out of BYOL by upper bounding. Based on that, we propose RAFT due to its stronger regularization effect and its accordance with our knowledge. In section 5, we prove that, as a representation learning framework, BYOL is a special form of RAFT under certain achievable conditions. In summary, our contributions are listed as follows: • We present a new self-supervised learning framework RAFT that minimizes the alignment and maximizes the distance between the online network and its MT. The motivation of RAFT conforms to our understanding of balancing alignment and uniformity of the representation space, and thus could be easily extended and adapted to future problems. • We equate two seemingly opposite ways of incorporating MT in contrastive methods under certain conditions. By doing so, we unravel the puzzle of how BYOL avoids representation collapse.

2.1. TWO METRICS OPTIMIZED IN CONTRASTIVE LEARNING

Optimizing contrastive learning objective has been empirically proven to have positive correlations with the downstream task performance (Chen et al., 2020b; a; Tian et al., 2019; Chen et al., 2020b; Sohn, 2016; Zhuang et al., 2019; He et al., 2020; Oord et al., 2018) . Wang & Isola (2020) puts the contrastive learning under the context of hypersphere and formally showcases that optimizing the contrastive loss (for preliminary of contrastive learning, refer to Appendix B.1) is equivalent to optimizing two metrics of the encoder network when the size of negative samples K is sufficiently large: the alignment of the two augmented views of the same data and the uniformity of the representation population. We introduce the alignment objective and uniformity objective as follows. Definition 2.1 (Alignment loss) The alignment loss L align (f, P pos ) of the function f over positivepair distribution P pos is defined as: L align (f ; P pos ) E (x1,x2)∼Ppos f (x 1 ) -f (x 2 ) 2 2 , where the positive pair (x 1 , x 2 ) are two augmented views of the same input data x ∼ X , i.e. (x 1 , x 2 ) = (t 1 (x), t 2 (x)) and t 1 ∼ T 1 , t 2 ∼ T 2 are two augmentations. For the sake of simplicity, we omit P pos and use L align (f ) in the following content. Definition 2.2 (Uniformity loss) The loss of uniformity L uniform (f ; X ) of the encoder function f over data distribution X is defined as L uniform (f ; X ) log E (x,y)∼X 2 e -t f (x)-f (y) 2 2 , ( ) where t > 0 is a fixed parameter and is empirically set to t = 2. To note here, the vectors in the representation space are automatically l 2 -normalized, i.e. f (x) f (x)/ f (x) 2 , as we limit the representation space to a hypersphere following Wang & Isola (2020) and Grill et al. (2020) and the representation vectors in the following context are also automatically l 2 -normalized, unless specified otherwise. Wang & Isola (2020) has empirically demonstrated that the balance of the alignment loss and the uniformity loss is necessary when learning representations through contrastive method. The rationale behind it is straightforward: L align provides the motive power that concentrates the similar data, and L uniform prevents it from mapping all the data to the same meaningless point.

2.2. BYOL: BIZARRE ALTERNATIVE OF CONTRASTIVE

A recently proposed self-supervised representation learning algorithm BYOL hugely challenges the common understanding, that the alignment should be balanced by negative samples during the contrastive learning. It establishes two networks, online and target, approaching to each other during training. The online is trained to predict the target's representations and the target is the Exponential Moving Average (EMA) of the parameters of the online. The loss of BYOL at every iteration could be written as L BYOL E (x,t1,t2)∼(X ,T1,T2) q w (f θ (t 1 (x))) -f ξ (t 2 (x)) 2 2 , where two vectors in representation space are automatically l 2 -normalized. f θ is the online encoder network parameterized by θ and q w is the predictor network parameterized by w. x ∼ X is the input sampled from the data distribution X , and t 1 (x), t 2 (x) are two augmented views of x where t 1 ∼ T 1 , t 2 ∼ T 2 are two data augmentations. The target network f ξ is of the same architecture as f θ and is updated by EMA with τ controlling to what degree the target network preserves its history ξ ← τ ξ + (1 -τ )θ. From the scheme of BYOL training, it seems like there is no constraint on the uniformity, and thus most frequently asked question about BYOL is how it prevents the representation collapse. Theoretically, we would expect that when the final convergence of the online and target is reached, L BYOL degenerates to L align and therefore causes representation collapse, while this speculation never happens in reality. Despite the perfect SOTA performance of BYOL, there is one inconsistency not to be neglected: it fails with representation collapse when the predictor is removed, which means q w (x) = x for any given x. This inconsistent behavior of BYOL weakens its reliability and further poses questions on future adaptation of the algorithm. The motivation of understanding and even solving this inconsistency is the start point of this paper.

3. ON-AND-OFF BYOL: FAILURE WITHOUT PREDICTOR

We start by presenting a dissatisfactory property of BYOL: its success heavily relies on the existence of the predictor q w . The experimental setup of this paper is listed in Appendix C. The performance of BYOL original model, whose predictor q w is a two-layer MLP with batch normalization, evaluated on the linear evaluation protocol (Kolesnikov et al., 2019; Kornblith et al., 2019; Chen et al., 2020a; He et al., 2020; Grill et al., 2020) reaches 68.08 ± 0.84%. When the predictor is removed, the performance degenerates to 20.92 ± 1.29%, which is even lower than the random baseline's 42.74 ± 0.41%. We examine the speculation that the performance drop is caused by the representation collapse both visually (refer to Appendix F.1) and numerically. Inspired by Wang & Isola (2020) , we use L uniform (f θ ; X ) to evaluate to what degree the representations are spread on the hypersphere and L align (q w • f θ ) to evaluate how the similar samples are aligned in the representation space. The results in Table 1 show that with the predictor, BYOL optimizes the uniformity of the representation distribution. On the contrary, when taken away the predictor, the alignment of two augmented views is overly optimized and the uniformity of the representation deteriorates (Figure 4 ), therefore we conclude the predictor is essential to the collapse prevention in BYOL. One reasonable follow-up explanation on the efficacy of the predictor may consider its specially designed architecture or some good properties brought by the weight initialization, which makes it hard to understand the mechanism behind it. Fortunately, after replacing the current predictor, two-layer MLP with batch normalization (Ioffe & Szegedy, 2015) , with different network architectures and weight initializations, we find that there is no significant change either on linear evaluation protocol or on the model behavior during training (Table 1 , for detailed training trajectory, refer to Figure 4 ). We first replace the complex structure with linear mapping q w (•) = W (•). This replacement provides a naive solution to representation collapse: W = I, while it never converges to this apparent collapse. Surprisingly enough when we go harsher on this linear predictor by initializing W with the apparent collapse solution I, the model itself seems to have a self-recovering mechanism even though it starts off at a poor position: the loss quickly approaches to 0 and the uniformity deteriorates for 10-20 epochs and suddenly it deflects from the collapse and keeps on the right track. We conduct a theoretical proof that a randomly initialized linear predictor prevents the (more strict form of) representation collapse by creating infinite non-trivial solutions when the convergence is achieved (refer to Appendix I), while we fail to correlate the consistently optimized uniformity with the presence of the predictor, which indicates that a deeper rationale needs to be found.

4.1. DISENTANGLE THE BYOL LOSS BY UPPER BOUNDING

Analyzing L BYOL is hard, since it only has one single mean squared error term and there are many factors entangled within it, e.g., two augmented views of the same data, predictor, and the EMA updating rule. Inspired by the Bias-Variance decomposition on squared loss (Geman et al., 1992) , we extract the alignment loss by subtracting and adding the same term q w (f θ (t 2 (x))) and further yield the upper bound of L BYOL . For details, please refer to Appendix G. Definition 4.1 (Cross-model loss) The cross-model loss L cross-model (f, g; X ) of the function f and g over the data distribution X is defined as L cross-model (f, g; X ) E x∼X f (x) -g(x) 2 2 . ( ) Definition 4.2 (BYOL loss) The BYOL loss L BYOL is defined as L BYOL αL align (q w • f θ ; P pos ) + βL cross-model (q w • f θ , f ξ ; X 2 ) (6) where α, β > 0 are constants, P pos is defined in Eq. 1 and X 2 = T 2 (X ) is the distribution of the augmented data. For the sake of simplicity, we use L align (q w • f θ ) to denote L align (q w • f θ ; P pos ) in the following content. For the sake of symmetry, we use f θ and an extra predictor q w . The Mean Teacher f ξ is the EMA of the encoder f θ . In BYOL, the loss is computed by minimizing the distance between the prediction of one view x 1 and another view x 2 's representation generated by the MT. In RAFT, we optimize two objectives together: (i) minimize the representation distance between two samples from a positive pair and (ii) maximize the representation distance between the online network and its MT. L cross-model (q w • f θ ) to denote (1/2)[L cross-model (q w • f θ , f ξ , X 1 ) + L cross-model (q w • f θ , f ξ , X 2 )] to compute the cross-model loss. 𝒙 ! 𝑡 ! 𝑡 " 𝑓 # 𝑓 $ 𝑞 % 𝑓 # 𝒙 " 𝒙 " 𝒙 𝑞 % 𝑓 # 𝒙 " 𝑓 # 𝒙 ! 𝑓 $ 𝒙 " 𝑞 % 𝑓 # 𝒙 ! Theorem 4.1 (L BYOL is an upper bound of L BYOL ) L BYOL is an upper bound of L BYOL if we ignore the scalar multiplication. Concretely speaking, for any given constants α, β > 0, we have L BYOL ≤ ( 1 α + 1 β )L BYOL . Proof Please refer to Appendix G. Ideally, minimizing L BYOL would yield similar performance as minimizing L BYOL . We exemplify the legitimacy of L BYOL by setting (α, β) = (1, 1). In Table 1 , the performance of BYOL and BYOL are close to each other with respect to three metrics: alignment, uniformity, and downstream linear evaluation protocol, regardless of the form of predictors. When the predictor is linear mapping, the performance differences between them are subtle. Besides, when the predictor is removed, the representation collapse also happens to BYOL . So we conclude that optimizing L BYOL is almost equivalent to L BYOL . In spite of the performance similarity, L BYOL is of a more disentangled form than L BYOL and therefore we focus on studying the former instead of the latter. The new objective consists of two terms: the first term L align minimizes the representation distance between samples from a positive pair and has already been shown crucial to the successful contrastive methods (Wang & Isola, 2020) . Intuitively, it provides the motive power to concentrate similar data in the representation space. Based on the form of BYOL , we conclude that MT is used to regularize the alignment loss. This perspective of two terms regularizing each other is crucial to our analysis and improvement of the original BYOL framework. Understanding why BYOL works without collapse is approximately equivalent to understanding how minimizing L cross-model (q w • f θ , f ξ ) effectively regularizes the alignment loss, or even actively optimizes the uniformity.

4.2. RAFT: RUN AWAY FROM YOUR TEACHER

The major difficulty of correlating L cross-model with L uniform is that their optimization intentions are not only irrelevant, but somewhat opposite. Minimizing the cross-model loss asks the network to produce close representations for certain inputs, while optimizing the uniformity loss requires it to produce varying representations. The disparity residing in the form pushes us to question the original motivation of BYOL: do we really want the online network to approach to the Mean Teacher? To test our suspicion, we minimize [L align (q w • f θ ) -L cross-model (q w • f θ , f ξ )] instead of [L align (q w • f θ ) + L cross-model (q w • f θ , f ξ )], and we find it works as well. This bizarre phenomenon will be explained in Section 5. Removing the predictor, we observe that although minimizing [L align (f θ ) -L cross-model (f θ , f ξ )] fails to yield better representation than the random baseline,  z i = q w (f θ (x i )), z i = f ξ (x i ), i = 1, 2. (b) Objective categories diagram with respect to the effect constraining the alignment loss. In contrastive methods, the most favorable objective actively optimizes uniformity, including both BYOL and RAFT when there exists predictor. The secondly favorable objective is the effective regularizer of alignment. RAFT remains to restrain alignment loss without predictor, while BYOL fails to do so, which implies that RAFT is a more unified objective. it prevents the overly-optimized alignment loss, i.e. it works as an effective regularizer for the alignment loss, while minimizing L cross-model (f θ , f ξ ) does not. Based on the conclusion above and law of Occam's Razor, we propose a new self-supervised learning framework, Run Away From your Teacher (RAFT), which optimizes two learning objectives simultaneously: (i) minimize the alignment loss of two samples from a positive pair and (ii) maximize the distance between the online network and its MT (refer to Figure 1 and Algorithm 1). Definition 4.3 (RAFT loss) The RAFT loss L RAFT is defined as L RAFT αL align (q w • f θ ; P pos ) -βL cross-model (q w • f θ , f ξ ; X 2 ), where α, β > 0 are constants and other components follows the Definition 4.2. Compared to BYOL and BYOL , RAFT better conforms to our knowledge and is a conceptually non-collapsing algorithm. There has been a lot of work demonstrating that weight averaging is roughly equal to sample averaging (Tarvainen & Valpola, 2017) , thus if two samples' representations are close to each other at the beginning and their initial updating directions are opposite, then RAFT consistently separates them in the representation space. All the forms of loss terms could be classified into three categories: uniformity optimizer, effective regularizer for alignment loss, and others (refer to Figure 2b ). According to our experiments, when the predictor is removed, running away from MT remains an effective regularizer for the alignment loss while BYOL's running towards MT fails to do so, thus RAFT is of more unified and consistent form. In summary, our proposed learning framework RAFT is completely based on the intention of solving the inconsistency of the predictor in BYOL, and it's better than BYOL in threefold: • Consistency. Compared to BYOL, our newly proposed method has an effective regularizer for the alignment loss regardless of the presence of predictor. • Interpretability. Mean teacher uses the technique of weight averaging and thus could be considered as an approximate ensemble of the previous versions of the model. Running away from the mean teacher intuitively encourages the diversity of the representation, which is positively correlated to the uniformity. • Disentanglement. The learning objective is decoupled into aligning two augmented views and running away from the mean teacher, and hence could be independently studied. We will discuss the relationship between RAFT and BYOL in the next section, and we find BYOL is a special form of RAFT under certain conditions, which makes the performance of BYOL a guarantee of the effectiveness of RAFT. We provide benchmarks of alignment, uniformity, and downstream linear evaluation performance on CIFAR10 (Table 3 ). We discover that balancing the alignment loss and the cross-model loss is not an easy job with the predictor taken away. The imbalance between the alignment loss and the cross-model loss would lead to representation collapse or over-regularized alignment where every data is randomly projected. One interesting research direction is to study the efficacy of the predictor. The reason why it helps the two terms to achieve an equilibrium is left to be answered.

5. UNDERSTANDING BYOL VIA RAFT

In Section 4.1 we derive an upper bound L BYOL of L BYOL and explicitly extract two terms L align and L cross-model . In BYOL , two terms are simultaneously minimized, while in RAFT, we minimize L align but maximize L cross-model instead. To clearly distinguish the difference between the two objectives, we rewrite them as following: L BYOL = αL align (q w • f θ ) + βL cross-model (q w • f θ , f ξ ), L RAFT = αL align (q w • f θ ) -βL cross-model (q w • f θ , f ξ ), where α, β > 0 are constants. In form, L RAFT and L BYOL seem to evolve in opposite optimizing direction on the second term, but the empirical study has shown that both of them work. How can two opposite optimization goals produce similar effect? Since RAFT is a conceptually working method, we analyze the mechanism of BYOL by establishing the equivalence between the parameters of BYOL and RAFT under mild conditions. Theorem 5.1 (One-to-one correspondence between BYOL and RAFT) There is a one-to-one correspondence between parameter trajectories of BYOL and RAFT when the following three conditions hold: i. the representation space is a hypersphere; ii. the predictor is a linear transformation, i.e. q w (•) = W (•); iii. only the tangential component of the gradient on the hypersphere is preserved. Proof We prove the theorem by construction. For the detail, please refer to Appendix H.

Remark

The third condition conforms to the property of the hypersphere representation space and is easy to achieve. One can preserve only the tangential gradient by slightly modifying the loss. For example, suppose the representation of the MT is z and the representation of the input is z which are both normalized, the cross-model loss z -z 2 2 can be revised as z -λz 2 2 /λ, where λ = sg( z, z ) stands for stopping gradient of the inner product z, z . Our experiments in Table 1 demonstrates that the condition of the tangential component of the gradient doesn't turn any of the algorithms including BYOL, BYOL and RAFT into a collapsed one. In Theorem 1, we show that optimizing L BYOL with initial parameters (θ (0) , W (0) ) is equivalent to optimizing L RAFT with initial parameters (θ (0) , -W (0) ) when the aforementioned three conditions are satisfied. This equivalence demonstrates that the final encoder network f θ and f θ equal to each other. Therefore we conclude that, as representation learning framework, BYOL is equivalent to our newly proposed RAFT. From a geometric point of view, the optimization process is the data points moving in the representation space under the guidance of the training loss. The loss function measures the potential energy of the parameters, and the gradient with regard to the data points is the motive force. If the representation space is a hypersphere as in BYOL, then the tangential force, i.e. the tangential component of the gradient, is the only key to scattering or concentrating the data points in the representation space. By the central symmetry of the hypersphere, clockwise and counterclockwise moving directions are equivalent to some extent, for example, pushing a point by π/2 and pulling it by π/2 on the 2-dimensional sphere causes the same effect. The equivalence between BYOL and RAFT offers us a direct way to understand some strange phenomena we observe which are also reported in the original BYOL paper. Firstly, the non-collapse of BYOL is explained, since the RAFT is an intuitively and practically working algorithm. The equivalence of BYOL and RAFT when predictor is linear helps us understand why BYOL is an effective self-supervised learning algorithm. It also explains our initial question why BYOL fails to avoid representation collapse without the predictor: removing the predictor means fixing W = I, which breaks the RAFT's designing principle of running away from the MT. Secondly, though the BYOL's optimization procedure is of the form that two models approaching to each other, there has been no report of convergence in the original paper. The established equivalence perfectly explains it. RAFT incorporates the MT in an extremely dynamic way since it continuously varies from the history models, thus there would be no convergence of the data points. So does the parameters.

6. CONCLUSION AND FUTURE WORK

In this paper, we address the problem of why the newly proposed self-supervised learning framework Bootstrap Your Own Latent (BYOL) works without negative samples. By decomposing, upper bounding and approximating the original loss of BYOL, we establish another interpretable selfsupervised learning method, Run Away From your Teacher. We show that RAFT contains an explicit term that prevents the representation collapse and we also empirically validate the effectiveness of RAFT. By constructing a one-to-one correspondence from RAFT to BYOL (variant of BYOL), we successfully explain the mechanism behind BYOL that makes it work and therefore implies the huge potential of our proposed RAFT. Based on the observation and the conclusion, here we have several suggestions for future work: Theoretical guarantees of RAFT. Though we have intuitively explained why running away from the MT is an effective regularizer, we don't provide theoretical guarantees why optimizing RAFT would be favorable with respect to the representation learning. In future, one can try to relate RAFT to the theory of Mutual Information (MI) maximization (Belghazi et al., 2018; Hjelm et al., 2018; Tschannen et al., 2019) , as the training objective of contrastive learning InfoNCE has been proven to be a lower bound of the MI (Poole et al., 2019) . One detail should be noticed when attempting to correlate RAFT with MI maximization. Even though RAFT is an effective regularizer, it fails to yield good-quality representations when the predictor is removed, thus any theoretical proof on the effectiveness of RAFT should well explain the mechanism behind this extra predictor. On the efficacy of the predictor. It has become a popular and almost standardized method to add an extra MLP on top of the network in contrastive learning methods (Chen et al., 2020a; b; Grill et al., 2020) , while most of the work adopts this method as a special trick without considering the effect this MLP brings to the algorithm. In this paper, however, we find that this extra MLP may bring some unexpected properties to the original training objective: although the representations are optimized by disparate motivations (in our paper, BYOL running towards MT and RAFT running away from MT), the encoder network is trained to be exactly the same. This observation indicates that the mechanism of the extra MLP to the network needs to be further studied.

B BACKGROUND AND RELATED WORK B.1 CONTRASTIVE LEARNING

Contrastive methods relies on the assumption that two views of the same data point share the information, and thus creates a positive pair. By separating the positives and the negatives, the neural network trained by the algorithm learns to extract the most useful information from the data and performs better on the downstream tasks. Typically, the algorithm uses the InfoNCE objective: L contrast (h, K) = E (x,x + )∼Ppos {x - i } K i=1 ∼X K -log e h(x,x + ) e h(x,x + ) + K i=1 e h(x,x - i ) , where (x, x + ) are sampled from the positive pair distribution P pos , which is built by a series of data augmentation functions [ref] . The negative samples {x - i } K are i.i.d sampled for K times from the data distribution X ; function h(x, y) measures the similarity between two input data (x, y). Empirically for the sake of symmetry, the measurement function h(x, y) = d(f (x), f (y)) has an encoder f (•) and a similarity metric d(•, •) evaluating how close the two representations are.

B.2 MEAN TEACHER

There is one type of semi-supervised learning method that BYOL constantly reminds people of, Mean Teacher (MT) (Tarvainen & Valpola, 2017; Laine & Aila, 2016) . Like BYOL, MT is also of Teacher-Student (T-S) framework, where the teacher network is also the EMA of the student network. The additional consistency loss between the teacher and student is applied to the supervised signals. There has been a lot of work demonstrating the efficacy of MT (Athiwaratkun et al., 2019; Novak et al., 2018; Chaudhari et al., 2019) , among which the major conclusion states that the consistency loss between the student and its MT acts as a regularizer for better generalization. The proven properties of MT might lead us to focus on how the online network's learning from MT effectively regularizes L align in BYOL. In this paper, however, we propose the opposite way of leveraging MT in contrastive methods.

C EXPERIMENTAL SETUP

Dataset Our main goal is to unravel the mystery why BYOL doesn't collapse during training and to solve the predictor-inconsistency. The most important metric is whether the algorithm collapses or not, and we don't target on developing a more powerful self-supervised learning algorithm that surpasses SOTA on large dataset. In this repsect, we limit our experiments to the scope of the CIFAR10 dataset. Each image is resized from 32 × 32 to 96 × 96. This change is the consequence of the tradeoff between the effect of the data augmentation and batch size: larger size of the image would allow more subtle and informative data augmentation scheme while it will reduce the training batch size, which has already been empirically shown is harmful to the model performance.

Model architecture

In our experiments, the model is composed of three stages: an encoder f θ that adopts the ResNet18 architecture (without the classifier on top); a projector g θ that is comprised of a linear layer with output size 512, batch normalization, rectified linear units (ReLU), and a final linear layer with output size 128; a predictor q w that is comprised of the same architecture as the projector but without the batch normalization. Training We adopt the same data augmentation scheme that is used in Chen et al. (2020a) and Grill et al. (2020) and train the BYOL on the training set for 300 epochs with batch size 128 on 3 random seeds. The objective of training is specified accordingly and the model is trained on the Adam optimizer with learning rate 3 × 10 -4 (Kingma & Ba, 2014). Unless stated otherwise, we update the target network with the EMA rate 4 × 10 -3 without the cosine smoothing trick. Evaluation After training, we evaluate the encoder's performance on the widely adopted linear evaluation protocol: we fix the parameter of the encoder and we train another linear classifier on top of it using all the training labels for 100 epochs with learning rate 5 × 10 -4 . The final classification accuracy indicates to what degree the representations of the same class concentrate and the representations of the different class separate, and thus tells the quality of the representation. 

D TABLES

(q w • f θ ), and L cross-model (q w •f θ , f ξ ), i.e. L = αL align (q w •f θ )+βL cross-model (q w •f θ , f ξ ). All the quantifiable metrics are evaluated after 300 epochs of training of the training set of CIFAR10. Compared to BYOL , our proposed RAFT is better in terms of the effectiveness of regularizing the L align .  Model q w α β L align (q w • f θ ) L uniform (f θ ) Linear Evaluation Protocol(%) Rand-Baseline W - -

E ALGORITHMS

Algorithm 1: RAFT: Run Away From your Teacher Inputs : X , T 1 , and T 2 set of images and distributions of transformations θ and f θ model parameters and encoder w and q w predictor parameters and predictor ξ and f ξ MT parameters and MT optimizer optimizer, updates online parameters using the loss gradient K and N total number of optimization steps and batch size {τ k } K k=1 and {η k } K k=1 target network update schedule and learning rate schedule 1 for k = 1 to K do 2 B ← {x i } N i=1 ∼ X N // sample a batch of N images 3 for x i ∈ B do 4 t 1 ∼ T 1 and t 2 ∼ T 2 // sample image transformations 5 z 1 ← q w (f θ (t 1 (x i ))) and z 2 ← q w (f θ (t 2 (x i ))) // reps for model 6 z 1 ← f ξ (t 1 (x i )) and z 2 ← f ξ (t 2 (x i )) // reps for MT 7 l i = z1 z1 2 -z2 z2 2 2 2 // loss for alignment 8 l i = -1 2 z1 z1 2 - z 1 z 1 2 2 2 + z2 z2 2 - z 2 z 2 2 2 2 // loss for cross-model will not cause the collapse even though I is an apparent solution for collapse. Furthermore, initializing the linear predictor with I forces the loss quickly approaching to 0 at beginning, while it recovers from the seemingly collapse after 10-20 epochs of training (orange curve, BYOL-LPI). (b) Evolution of the representation uniformity. BYOL with predictor consistently optimizes the uniformity of the representation distribution even though the uniformity is not explicitly included in the loss term. One interesting fact to note here is that the uniformity loss is optimized with a constant rate with linear predictor (green curve, BYOL-LP; orange curve, BYOL-LPI) after certain phase of training. (c) linear evaluation protocol on CIFAR10. Different structures of the predictor provide close performance on the downstream classification task.  9 end 10 δθ ← 1 N N i=1 ∂ θ l i + ∂ θ l i // compute the loss gradient w.r.t. θ 11 θ ← optimizer(θ, δθ, η k ) // update trainable parameters 12 ξ ← τ k ξ + (1 -τ k )θ //

G PROOF OF BYOL UPPER BOUNDING

In this section, we provide how we derive the upper bound of L BYOL . For the sake of simplicity, without loss of rigor, we use t 1 = t 1 (x) to represent the transformed input x. L BYOL = E x∼X ,t1∼T1,t2∼T2 q w (f θ (t 1 (x))) -f ξ (t 2 (x)) 2 2 = E q w (f θ (x 1 )) -q w (f θ (x 2 )) + q w (f θ (x 2 )) -f ξ (x 2 ) 2 2 . ( ) By applying the Cauchy-Schwarz's inequality to Eq. 12, we yield: L BYOL ≤ (1 + 1 λ ) E q w (f θ (x 1 )) -q w (f θ (x 2 )) 2 2 + λE q w (f θ (x 2 )) -f ξ (x 2 ) 2 2 = (1 + 1 λ ) (L align (q w • f θ ; P pos ) + λL cross-model (q w • f θ , f ξ , X )) which stands for any λ > 0, where the positive-pair distribution P pos is modeled by the chain rule of the conditional probability: P pos (x 1 , x 2 ) = X (x) • T 1 (t 1 |x) • T 2 (t 2 |x). For any given pair α, β > 0, we let λ = β/α and substitute it back to Eq. 13, yielding L BYOL ≤ (1 + α β ) L align (q w • f θ ; P pos ) + β α L cross-model (q w • f θ , f ξ , X ) = ( 1 α + 1 β ) [αL align (q w • f θ ; P pos ) + βL cross-model (q w • f θ , f ξ , X )] = ( 1 α + 1 β )L BYOL , and as an optimization objective, we have min( 1 α + 1 β )L BYOL ⇔ min L BYOL Therefore we have proven that L BYOL as optimization objective is the upper bound of L BYOL . To note here, one can subtract and add a different term f ξ (x 1 ) to form the alignment loss on the side of MT f ξ , L BYOL = E q w (f θ (x 1 )) -f ξ (x 1 ) + f ξ (x 1 ) -f ξ (x 2 ) 2 2 , while it doesn't help to solve the problem since the alignment constraint on the side of MT doesn't generate gradients.

H PROOF OF ONE-TO-ONE CORRESPONDENCE BETWEEN BYOL AND RAFT

Theorem (One-to-one correspondence between BYOL and RAFT) There is a one-to-one correspondence between parameter trajectories of BYOL and RAFT when the following three conditions hold: i. the representation space is a hypersphere; ii. the predictor is a linear transformation, i.e. q w (•) = W (•); iii. only the tangential component of the gradient on the hypersphere is preserved. Without losing generality, suppose that x 1 = t 1 (x), x 2 = t 2 (x) where x is an arbitrary input and batch size is 1, and (α, β) = (1, 1). We set BYOL and RAFT with initial parameters (θ , W ) = (θ (0) , W (0) ) and (θ, W ) = (θ (0) , -W (0) ) respectively. For convenience, we assumes the dot product "•" ignores the row layout or column layout in the chain rule of derivatives and we define the following symbols: z 2 = f ξ (x 2 ), (17) z 1 = W f θ (x 1 ), ( ) z 2 = W f θ (x 2 ), (19) z 1 = W f θ (x 1 ), ( ) z 2 = W f θ (x 2 ). Based on the notations defined, we rewrite the loss terms of BYOL and RAFT as follows: L BYOL align = z 1 -z 2 2 2 , L BYOL cross-model = z 2 -z 2 2 2 , L RAFT align = z 1 -z 2 2 2 , L RAFT cross-model = -z 2 -z 2 2 2 . ( ) The two objectives are following: L BYOL = L BYOL align + L BYOL cross-model , L RAFT = L RAFT align + L RAFT cross-model . We claim that under the third condition, the following equations hold: ∂L BYOL ∂θ = ∂L RAFT ∂θ , ∂L BYOL ∂W = - ∂L RAFT ∂W , where subscript denotes the tangential component of the gradient. Firstly we show the equivalence with respect to θ. Differentiate L BYOL align , L RAFT align with respect to θ ij , θ ij respectively, we obtain ∂L BYOL align ∂θ ij = 2 (z 1 -z 2 ) + (z 1 -z 2 ) ⊥ • ∂z 1 ∂θ ij - ∂z 2 ∂θ ij , ( ) ∂L RAFT align ∂θ ij = 2 (z 1 -z 2 ) + (z 1 -z 2 ) ⊥ • ∂z 1 ∂θ ij - ∂z 2 ∂θ ij , ( ) ∂L BYOL align ∂θ ij = 2 (z 1 -z 2 ) • ∂z 1 ∂θ ij - ∂z 2 ∂θ ij , ∂L RAFT align ∂θ ij = 2 (z 1 -z 2 ) • ∂z 1 ∂θ ij - ∂z 2 ∂θ ij , where (z 1 -z 2 ) , (z 1 -z 2 ) are vectors at the points z 2 and z 2 on the hypersphere and we decompose the vector into the tangential (denoted by ) and normal component (denoted by ⊥): (z 1 -z 2 ) = (z 1 -z 2 ) + (z 1 -z 2 ) ⊥ , (z 1 -z 2 ) = (z 1 -z 2 ) + (z 1 -z 2 ) ⊥ , Generally, suppose z is a unit vector starting at the origin point, which is perpendicular to the unit hypersphere at the point z, for any vector v starting at the point z, we have v ⊥ = v, z • z, v = v -v ⊥ = v -v, z • z. ( ) Then we can compute the tangential component of the gradient: (z 1 -z 2 ) = (z 1 -z 2 ) -z 1 -z 2 , z 2 • z 2 =z 1 -z 2 , z 1 • z 2 , (z 1 -z 2 ) = (z 1 -z 2 ) -z 1 -z 2 , z 2 • z 2 =z 1 -z 2 , z 1 • z 2 . ( ) Because of the initialization, z 1 = -z 1 , z 2 = -z 2 , therefore we have (z 1 -z 2 ) = -(z 1 -z 2 ) , ( ) ∂z 1 ∂θ ij - ∂z 2 ∂θ ij = - ∂z 1 ∂θ ij - ∂z 2 ∂θ ij . ( ) So we show that ∂L BYOL align ∂θ ij = ∂L RAFT align ∂θ ij . We differentiate L BYOL cross-model , L RAFT cross-model with respect to θ ij , θ ij respectively, we obtain that ∂L BYOL cross-model ∂θ ij = -2 (z 2 -z 2 ) • W • ∂f θ (x 2 ) ∂θ ij , ∂L RAFT cross-model ∂θ ij = 2 (z 2 -z 2 ) • W • ∂f θ (x 2 ) ∂θ ij . Similar to Eq. 35, we derive that (z 2 -z 2 ) = (z 2 -z 2 ) Since θ = θ, ∂f θ (x 2 )/∂θ ij = ∂f θ (x 2 )/∂θ ij and W = -W , we have that ∂L BYOL cross-model ∂θ ij = ∂L RAFT cross-model ∂θ ij . Therefore by Eq. 38 and Eq. 42, RAFT's updating of the parameter θ is equal to BYOL : ∂L BYOL ∂θ = ∂L RAFT ∂θ . ( ) Also we differentiate L BYOL align , L RAFT align with respect to W ij , W ij respectively, we obtain that ∂L BYOL align ∂W ij = 2 (z 1 -z 2 ) • ∂z 1 ∂W ij - ∂z 2 ∂W ij , ( ) ∂L RAFT align ∂W ij = 2 (z 1 -z 2 ) • ∂z 1 ∂W ij - ∂z 2 ∂W ij . I NON-TRIVIAL SOLUTIONS CREATED BY PREDICTOR Suppose inputs x 1 = t 1 (x) and x 2 = t 2 (x) is n-dimensional. And in linear model, f θ , f ξ and q w is parameterized by matrices (θ ij ) n×n , (ξ ij ) n×n and (W ij ) m×m respectively. The objective is L BYOL = E (x,t1,t2)∼(X ,T1,T2) q w (f θ (t 1 (x))) -f ξ (t 2 (x)) When the weight of target ξ converge, we have ξ (k) = ξ (k+1) in the updating rule, ξ (k+1) = τ k ξ k + (1 -τ k )θ (k) θ (k) = ξ (k+1) = ξ (k)



Figure1: Framework diagram of RAFT and BYOL. The online network is composed of an encoder f θ and an extra predictor q w . The Mean Teacher f ξ is the EMA of the encoder f θ . In BYOL, the loss is computed by minimizing the distance between the prediction of one view x 1 and another view x 2 's representation generated by the MT. In RAFT, we optimize two objectives together: (i) minimize the representation distance between two samples from a positive pair and (ii) maximize the representation distance between the online network and its MT.

Figure 2: Analysis on the legitimacy of RAFT and why it's more favorable. (a) Diagram demonstrating how RAFT conceptually works: if two samples' updating directions are opposite, MT helps pushing them away at the next several iterations. Herez i = q w (f θ (x i )), z i = f ξ (x i ), i = 1, 2. (b)Objective categories diagram with respect to the effect constraining the alignment loss. In contrastive methods, the most favorable objective actively optimizes uniformity, including both BYOL and RAFT when there exists predictor. The secondly favorable objective is the effective regularizer of alignment. RAFT remains to restrain alignment loss without predictor, while BYOL fails to do so, which implies that RAFT is a more unified objective.

Figure 3: Visualization of the representation distribution evolution on CIFAR10 training set. We project the representation f θ (x) to 2-D dimension using PCA (Wold et al., 1987) and then normalize to a unit sphere. The width of the circle shows the density of the data points that are projected to that particular position. Two dots residing on each side of the blue line across the circle represents two augmented views of the same data. (a) Supervised learning has no restriction on the uniformity of representation space. (b) BYOL with predictor evenly projects the data to different positions. (c) BYOL w/o predictor tends to project the huge portion of the data to the same position. (d) BYOL's upper bound BYOL also effectively disperses representations on the sphere. (e) Our RAFT shows that minimizing/maximizing L cross-model has similar effect on the final representation distribution.

Figure 5: The evolution traces of L align (q w • f θ , f ξ ) and L uniform (f θ ) in BYOL and our proposed RAFT. (a) Evolution trace of BYOL -NP. Increasing β (weight of L cross-model , regularizer for L align ) does not prevent the failed regularization: L align converges to 0 quickly. (b) Evolution trace of RAFT-NP. Small value of β doesn't effectively regularize L align , but increasing the weight helps.In this respect, RAFT is a more effective regularizer, while the uniformity optimization holds no huge difference from BYOL -NP. (c) Evolution trace of RAFT-LP. With the linear predictor, on the contrary to RAFT-NP, the uniformity is optimized consistently during training, which implies deeper rationale of the existence of predictor.

to θ ij and W ij , we have∂ W θx 1k,: (θx 1 ) -ξ k,: x 2 ] k,: (θx 1 ) -ξ k,: x 2 ] ∂ [W k,: (θx 1 ) -ξ k,: x 2 ] θ ij = m k=1 2T k W k,: (x 1 ) j = 2 W T x 1 ij , k,: (θx 1 ) -ξ k,: x 2 ] k,: (θx 1 ) -ξ k,: x 2 ] ∂ [W k,: (θx 1 ) -ξ k,: x 2 ] ∂W ij = m k=1 2T k (θx 1 ) j 1 {k=i} = 2T i θ j,: x 1 = 2 T (θx 1 ) ij ,(59)whereT k = [W k,: (θx 1 ) -ξ k,:x 2 ], T = (T 1 , T 2 , . . . , T m ) = W (θx 1 ) -ξx 2 , and W k,: , ξ k,: are the k-th row of W and ξ respectively.

Evaluation results of BYOL variants on CIFAR10 after 300 epochs of training, used as evidence supporting the proposal of RAFT. (α, β) = (1, 1) are set in BYOL and RAFT. L align (q w • f θ ) and L uniform (f θ ) are evaluated by averaging the last 10 epochs of training. We highlight the overly optimized L align and failed L uniform in this table. For the accuracy on the linear evaluation protocol, we also only highlight the ones that underperform the random baseline.

Look-up table for the models that appear in the paper. Linear Predictor, preserving only the Tangential gradient BYOL -MLPP (BYOL ) trained with L BYOL = L align + L cross-model , MLP Predictor BYOL -LP trained with L BYOL = L align + L cross-model , linear predictor BYOL -NP trained with L BYOL = L align + L cross-model , No Predictor TanBYOL -LP BYOL with Linear Predictor, preserving only the Tangential gradient

Evaluation results of RAFT on CIFAR10. α and β represents the weight of L align

A MAIN THREAD OF PAPER

The proposal of RAFT is based on a series of theoretical derivation and empirical approximation. Therefore the logic chain of our paper is fundamental to the legitimacy of our explanation on BYOL and the superiority of our newly proposed RAFT. Here we organize our main thread in the same order as the sections, to provide a clear view with readers.In Section 3,• As a learning framework, BYOL does not consistently work. It heavily relies on the existence of the predictor. We want to understand why this inconsistency exists. • The architecture of the predictor doesn't affect the collapse of BYOL, the fact that the linear predictor q w (•) = W (•) prevents collapse will be used as a crucial condition in Section 5.In Section 4.1, In Section 4.2,works as well, which has the exact opposite way of incorporating the cross-model loss to BYOL . • Based on the observation above, we propose a new self-supervised learning approach Run Away From your Teacher, which regularizesCompared with BYOL, RAFT accords more with our common understanding.• Additional experiments show that without predictor, BYOL fails to regularize L align (f θ ), let alone optimizing uniformity. On the contrary, although not able to actively optimize uniformity either, RAFT's maximizing L cross-model (f θ , f ξ ) continues to be an effective regularizer for L align (f θ ), which makes it more favorable (Figure 2b ).In Section 5,• We prove that when the predictor is linear (q w = W ) and the representation space is a hypersphere where only the tangential component of gradient is preserved during training, minimizing L cross-model (W • f θ , f ξ ) and maximizing it obtain the same encoder f θ . • Based on the equivalence above, we conclude that BYOL is a special case of RAFT under conditions above. The equivalence established helps understanding several counterintuitive behaviors of BYOL.Note that z 1 = -z 1 , z 2 = -z 2 and similar to Eq. 35, easy to show thatAlso,So we haveDifferentiate L BYOL cross-model , L RAFT cross-model with respect to W ij , W ij respectively, we obtain thatThen we have thatBy Eq. 49 and Eq. 52, we prove that the cross-model loss of BYOL generates the opposite gradient to RAFT, namely,Therefore by the two main conclusions Eq. 43 and Eq. 53, for BYOL with parameters (θ , W ) = (θ (0) , W (0) ) and RAFT with parameters (θ, W ) = (θ (0) , -W (0) ) respectively, we have BYOL:RAFT:.We derive that θ (1) = θ (1) , W (1) = -W (1) , and furthermore,at any iteration k. In this way, we establish an one-to-one correspondence between the parameter trajectories of BYOL and RAFT in training, referred to as H:Substituting ξ by θ, we obtainTo solve Eq. ?? (which is called Sylvester's equation), we using the Kronecker product notation and the vectorization operator vec, we can rewrite the equation in the formSo it has a non-trivial solution θ if and only if I m ⊗ W -(BA -1 ) T ⊗ I n has a non-trivial null space. An equivalent condition to having a non-trivial null space is having zero as an eigenvalue.Let W has eigenvalues in common with BA -1 , then we have a non-trivial solution of θ, which is exactly the prevention for collapse.

