UNDERSTANDING SELF-SUPERVISED LEARNING WITH DUAL DEEP NETWORKS

Abstract

We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR, the weights at each layer are updated by a covariance operator that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. We show this leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the combination of BatchNorm and a predictor network creates an implicit contrastive term, acting as an approximate covariance operator. Additionally, for linear architectures we derive exact solutions for BYOL that provide conceptual insights into how BYOL can learn useful non-collapsed representations without any contrastive terms that separate negative pairs. Extensive ablation studies justify our theoretical findings.

1. INTRODUCTION

While self-supervised learning (SSL) has achieved great empirical success across multiple domains, including computer vision (He et al., 2020; Goyal et al., 2019; Chen et al., 2020a; Grill et al., 2020; Misra and Maaten, 2020; Caron et al., 2020) , natural language processing (Devlin et al., 2018) , and speech recognition (Wu et al., 2020; Baevski and Mohamed, 2020; Baevski et al., 2019) , its theoretical understanding remains elusive, especially when multi-layer nonlinear deep networks are involved (Bahri et al., 2020) . Unlike supervised learning (SL) that deals with labeled data, SSL learns meaningful structures from randomly initialized networks without human-provided labels. In this paper, we propose a systematic theoretical analysis of SSL with deep ReLU networks. Our analysis imposes no parametric assumptions on the input data distribution and is applicable to stateof-the-art SSL methods that typically involve two parallel (or dual) deep ReLU networks during training (e.g., SimCLR (Chen et al., 2020a ), BYOL (Grill et al., 2020), etc) . We do so by developing an analogy between SSL and a theoretical framework for analyzing supervised learning, namely the student-teacher setting (Tian, 2020; Allen-Zhu and Li, 2020; Lampinen and Ganguli, 2018; Saad and Solla, 1996) , which also employs a pair of dual networks. Our results indicate that SimCLR weight updates at every layer are amplified by a fundamental positive semi definite (PSD) covariance operator that only captures feature variability across data points that survive averages over data augmentation procedures designed in practice to scramble semantically unimportant features (e.g. random image crops, blurring or color distortions (Falcon and Cho, 2020; Kolesnikov et al., 2019; Misra and Maaten, 2020; Purushwalkam and Gupta, 2020) ). This covariance operator provides a principled framework to study how SimCLR amplifies initial random selectivity to obtain distinctive features that vary across samples after surviving averages over data-augmentations. Based on the covariance operator, we further show that (1) in a two-layer setting, a top-level covariance operator helps accelerate the learning of low-level features, and (2) when the data are generated by a hierarchical latent tree model, training deep ReLU networks leads to an emergence of the latent variables in its intermediate layers. We also analyze how BYOL might work without negative pairs. First we show analytically that an interplay between the zero-mean operation in BatchNorm and the extra predictor in the online network creates an implicit contrastive term, consistent with empirical observations in the recent blog (Fetterman and Albrecht, 2020) . Note this analysis does not rule out the possibility that BYOL could work with other normalization techniques that don't introduce contrastive terms, as shown recently (Richemond et al., 2020a) . To address this, we also derive exact solutions to BYOL in linear networks without any normalization, providing insight into how BYOL can learn without contrastive terms induced either by negative pairs or by BatchNorm. Finally, we also discover that reinitializing the predictor every few epochs doesn't hurt BYOL performance, thereby questioning the hypothesis of an optimal predictor in (Grill et al., 2020) . 

2. OVERALL FRAMEWORK

Notation. Consider an L-layer ReLU network obeying f l = ψ( fl ) and fl = W l f l-1 for l = 1, . . . L. Here fl and f l are n l dimensional pre-activation and activation vectors in layer l, with f 0 = x being the input and f L = fL the output (no ReLU at the top layer). W l ∈ R n l ×n l-1 are the weight matrices, and ψ(u) := max(u, 0) is the element-wise ReLU nonlinearity. We let W := {W l } L l=1 be all network weights. We also denote the gradient of any loss function with respect to f l by g l ∈ R n l , and the derivative of the output f L with respect to an earlier pre-activation fl by the Jacobian matrix J l (x; W) ∈ R n L ×n l , as both play key roles in backpropagation (Fig. 1(b) ). An analogy between self-supervised and supervised learning: the dual network scenario. Many recent successful approaches to self-supervised learning (SSL), including SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020) and MoCo (He et al., 2020) , employ a dual "Siameselike" pair (Koch et al., 2015) of such networks (Fig. 1(b) ). Each network has its own set of weights W 1 and W 2 , receives respective inputs x 1 and x 2 and generates outputs f 1,L (x 1 ; W 1 ) and f 2,L (x 2 ; W 2 ). The pair of inputs {x 1 , x 2 } can be either positive or negative, depending on how they are sampled. For a positive pair, a single data point x is drawn from the data distribution p(•), and then two augmented views x 1 and x 2 are drawn from a conditional augmentation distribution p aug (•|x). Possible image augmentations include random crops, blurs or color distortions, that ideally preserve semantic content useful for downstream tasks. In contrast, for a negative pair, two different data points x, x ∼ p(•) are sampled, and then each are augmented independently to generate x 1 ∼ p aug (•|x) and x 2 ∼ p aug (•|x ). For SimCLR, the dual networks have tied weights with W 1 = W 2 , and a loss function is chosen to encourage the representation of positive (negative) pairs to become similar (dissimilar). In BYOL, only positive pairs are used, and the first network W 1 , called the online network, is trained to match the output of the second network W 2 (the target), using an additional layer named predictor. The target network ideally provides training targets that can improve the online network's representation and does not contribute a gradient. The improved online network is gradually incorporated into the target network, yielding a bootstrapping procedure. Our fundamental goal is to analyze the mechanisms governing how SSL methods like SimCLR and BYOL lead to the emergence of meaningful intermediate features, starting from random initializations, and how these features depend on the data distribution p(x) and augmentation procedure p aug (•|x). Interestingly, the analysis of supervised learning (SL) often employs a similar dual network scenario, called teacher-student setting (Tian, 2020; Allen-Zhu and Li, 2020; Lampinen and Ganguli, 2018; Saad and Solla, 1996) , where W 2 are the ground truth weights of a fixed teacher network, which generates outputs in response to random inputs. These input-output pairs constitute training data for the first network, which is a student network. Only the student network's weights



Figure 1: (a) Overview of the two SSL algorithms we study in this paper: SimCLR (W1 = W2 = W, no predictor, NCE Loss) and BYOL (W1 has an extra predictor, W2 is a moving average), (b) Detailed Notation.

