RUN AWAY FROM YOUR TEACHER: A NEW SELF-SUPERVISED APPROACH SOLVING THE PUZZLE OF BYOL

Abstract

Recently, a newly proposed self-supervised framework Bootstrap Your Own Latent (BYOL) seriously challenges the necessity of negative samples in contrastive learning frameworks. BYOL works like a charm despite the fact that it discards the negative samples completely and there is no measure to prevent collapse in its training objective. In this paper, we suggest understanding BYOL from the view of our newly proposed interpretable self-supervised learning framework, Run Away From your Teacher (RAFT). RAFT optimizes two objectives at the same time: (i) aligning two views of the same data to similar representations and (ii) running away from the model's Mean Teacher (MT, the exponential moving average of the history models) instead of BYOL's running towards it. The second term of RAFT explicitly prevents the representation collapse and thus makes RAFT a more conceptually reliable framework. We provide basic benchmarks of RAFT on CIFAR10 to validate the effectiveness of our method. Furthermore, we prove that BYOL is equivalent to RAFT under certain conditions, providing solid reasoning for BYOL's counter-intuitive success.

1. INTRODUCTION

Recently the performance gap between self-supervised learning and supervised learning has been narrowed thanks to the development of contrastive learning (Chen et al., 2020b; a; Tian et al., 2019; Chen et al., 2020b; Sohn, 2016; Zhuang et al., 2019; He et al., 2020; Oord et al., 2018; Hadsell et al., 2006) . Contrastive learning distinguishes positive pairs of data from the negative. It has been shown that when the representation space is l 2 -normalized, i.e. a hypersphere, optimizing the contrastive loss is approximately equivalent to optimizing the alignment of positive pairs and the uniformity of the representation distribution at the same time (Wang & Isola, 2020) . This equivalence conforms to our intuitive understanding. One can easily imagine a failed method when we only optimize either of the properties: aligning the positive pairs without uniformity constraint causes representation collapse, mapping different data all to the same point; scattering the data uniformly in the representation space without aligning similar ones yields no more meaningful representation than random. The proposal of Bootstrap Your Own Latent (BYOL) fiercely challenges our consensus that negative samples are necessary to contrastive methods (Grill et al., 2020) . BYOL trains the model (online network) to predict its Mean Teacher (moving average of the online, refer to Appendix B.2) on two augmented views of the same data (Tarvainen & Valpola, 2017) . There is no explicit constraint on uniformity in BYOL, while the expected collapse never happens, what's more, it reaches the SOTA performance on the downstream tasks. Although BYOL has been empirically proven to be an effective self-supervised learning approach, the mechanism that keeps it from collapse remains unrevealed. Without disclosing this mystery, it would be disturbing for us to adapt BYOL to other problems, let alone further improve it. Therefore solving the puzzle of BYOL is an urgent task. In this paper, we explain how BYOL works through another interpretable learning framework which leverages the MT in the exact opposite way. Based on a series of theoretical derivation and empirical approximation, we build a new self-supervised learning framework, Run Away From your Teacher (RAFT), which optimizes two objectives at the same time: (i) minimize the representation distance between two samples from a positive pair and (ii) maximize the representation distance between the online and its MT. The second objective of RAFT incorporates the MT in a way exactly opposite to BYOL, and it explicitly prevents the representation collapse by encouraging the online to be different from its history (Figure 2a ). Moreover, we empirically show that the second objective of RAFT is a more effective and consistent regularizer for the first objective, which makes RAFT more favorable than BYOL. Finally, we solve the puzzle of BYOL by theoretically proving that BYOL is a special form of RAFT when certain conditions and approximation hold. This proof explains why collapse does not happen in BYOL, and also makes the performance of BYOL an approximate guarantee of the effectiveness of RAFT. The main body of the paper is organized in the same order of how we explore the properties of BYOL and establish RAFT based on them (refer to Appendix A for more details). In section 3, we investigate the phenomenon that BYOL fails to work when the predictor is removed. In section 4, we establish two meaningful objectives out of BYOL by upper bounding. Based on that, we propose RAFT due to its stronger regularization effect and its accordance with our knowledge. In section 5, we prove that, as a representation learning framework, BYOL is a special form of RAFT under certain achievable conditions. In summary, our contributions are listed as follows: • We present a new self-supervised learning framework RAFT that minimizes the alignment and maximizes the distance between the online network and its MT. The motivation of RAFT conforms to our understanding of balancing alignment and uniformity of the representation space, and thus could be easily extended and adapted to future problems. • We equate two seemingly opposite ways of incorporating MT in contrastive methods under certain conditions. By doing so, we unravel the puzzle of how BYOL avoids representation collapse.

2.1. TWO METRICS OPTIMIZED IN CONTRASTIVE LEARNING

Optimizing contrastive learning objective has been empirically proven to have positive correlations with the downstream task performance (Chen et al., 2020b; a; Tian et al., 2019; Chen et al., 2020b; Sohn, 2016; Zhuang et al., 2019; He et al., 2020; Oord et al., 2018) . Wang & Isola (2020) puts the contrastive learning under the context of hypersphere and formally showcases that optimizing the contrastive loss (for preliminary of contrastive learning, refer to Appendix B.1) is equivalent to optimizing two metrics of the encoder network when the size of negative samples K is sufficiently large: the alignment of the two augmented views of the same data and the uniformity of the representation population. We introduce the alignment objective and uniformity objective as follows. Definition 2.1 (Alignment loss) The alignment loss L align (f, P pos ) of the function f over positivepair distribution P pos is defined as: L align (f ; P pos ) E (x1,x2)∼Ppos f (x 1 ) -f (x 2 ) 2 2 , where the positive pair (x 1 , x 2 ) are two augmented views of the same input data x ∼ X , i.e. (x 1 , x 2 ) = (t 1 (x), t 2 (x)) and t 1 ∼ T 1 , t 2 ∼ T 2 are two augmentations. For the sake of simplicity, we omit P pos and use L align (f ) in the following content. Definition 2.2 (Uniformity loss) The loss of uniformity L uniform (f ; X ) of the encoder function f over data distribution X is defined as L uniform (f ; X ) log E (x,y)∼X 2 e -t f (x)-f (y) 2 2 , ( ) where t > 0 is a fixed parameter and is empirically set to t = 2. To note here, the vectors in the representation space are automatically l 2 -normalized, i.e. f (x) f (x)/ f (x) 2 , as we limit the representation space to a hypersphere following Wang & Isola (2020) and Grill et al. (2020) and the representation vectors in the following context are also automatically l 2 -normalized, unless specified otherwise. Wang & Isola (2020) has empirically demonstrated that the balance of the alignment loss

