RUN AWAY FROM YOUR TEACHER: A NEW SELF-SUPERVISED APPROACH SOLVING THE PUZZLE OF BYOL

Abstract

Recently, a newly proposed self-supervised framework Bootstrap Your Own Latent (BYOL) seriously challenges the necessity of negative samples in contrastive learning frameworks. BYOL works like a charm despite the fact that it discards the negative samples completely and there is no measure to prevent collapse in its training objective. In this paper, we suggest understanding BYOL from the view of our newly proposed interpretable self-supervised learning framework, Run Away From your Teacher (RAFT). RAFT optimizes two objectives at the same time: (i) aligning two views of the same data to similar representations and (ii) running away from the model's Mean Teacher (MT, the exponential moving average of the history models) instead of BYOL's running towards it. The second term of RAFT explicitly prevents the representation collapse and thus makes RAFT a more conceptually reliable framework. We provide basic benchmarks of RAFT on CIFAR10 to validate the effectiveness of our method. Furthermore, we prove that BYOL is equivalent to RAFT under certain conditions, providing solid reasoning for BYOL's counter-intuitive success.

1. INTRODUCTION

Recently the performance gap between self-supervised learning and supervised learning has been narrowed thanks to the development of contrastive learning (Chen et al., 2020b; a; Tian et al., 2019; Chen et al., 2020b; Sohn, 2016; Zhuang et al., 2019; He et al., 2020; Oord et al., 2018; Hadsell et al., 2006) . Contrastive learning distinguishes positive pairs of data from the negative. It has been shown that when the representation space is l 2 -normalized, i.e. a hypersphere, optimizing the contrastive loss is approximately equivalent to optimizing the alignment of positive pairs and the uniformity of the representation distribution at the same time (Wang & Isola, 2020) . This equivalence conforms to our intuitive understanding. One can easily imagine a failed method when we only optimize either of the properties: aligning the positive pairs without uniformity constraint causes representation collapse, mapping different data all to the same point; scattering the data uniformly in the representation space without aligning similar ones yields no more meaningful representation than random. The proposal of Bootstrap Your Own Latent (BYOL) fiercely challenges our consensus that negative samples are necessary to contrastive methods (Grill et al., 2020) . BYOL trains the model (online network) to predict its Mean Teacher (moving average of the online, refer to Appendix B.2) on two augmented views of the same data (Tarvainen & Valpola, 2017) . There is no explicit constraint on uniformity in BYOL, while the expected collapse never happens, what's more, it reaches the SOTA performance on the downstream tasks. Although BYOL has been empirically proven to be an effective self-supervised learning approach, the mechanism that keeps it from collapse remains unrevealed. Without disclosing this mystery, it would be disturbing for us to adapt BYOL to other problems, let alone further improve it. Therefore solving the puzzle of BYOL is an urgent task. In this paper, we explain how BYOL works through another interpretable learning framework which leverages the MT in the exact opposite way. Based on a series of theoretical derivation and empirical approximation, we build a new self-supervised learning framework, Run Away From your Teacher (RAFT), which optimizes two objectives at the same time: (i) minimize the representation 1

