UNDERSTANDING SELF-SUPERVISED LEARNING WITH DUAL DEEP NETWORKS

Abstract

We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR, the weights at each layer are updated by a covariance operator that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. We show this leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the combination of BatchNorm and a predictor network creates an implicit contrastive term, acting as an approximate covariance operator. Additionally, for linear architectures we derive exact solutions for BYOL that provide conceptual insights into how BYOL can learn useful non-collapsed representations without any contrastive terms that separate negative pairs. Extensive ablation studies justify our theoretical findings.

1. INTRODUCTION

While self-supervised learning (SSL) has achieved great empirical success across multiple domains, including computer vision (He et al., 2020; Goyal et al., 2019; Chen et al., 2020a; Grill et al., 2020; Misra and Maaten, 2020; Caron et al., 2020) , natural language processing (Devlin et al., 2018) , and speech recognition (Wu et al., 2020; Baevski and Mohamed, 2020; Baevski et al., 2019) , its theoretical understanding remains elusive, especially when multi-layer nonlinear deep networks are involved (Bahri et al., 2020) . Unlike supervised learning (SL) that deals with labeled data, SSL learns meaningful structures from randomly initialized networks without human-provided labels. In this paper, we propose a systematic theoretical analysis of SSL with deep ReLU networks. Our analysis imposes no parametric assumptions on the input data distribution and is applicable to stateof-the-art SSL methods that typically involve two parallel (or dual) deep ReLU networks during training (e.g., SimCLR (Chen et al., 2020a) , BYOL (Grill et al., 2020), etc) . We do so by developing an analogy between SSL and a theoretical framework for analyzing supervised learning, namely the student-teacher setting (Tian, 2020; Allen-Zhu and Li, 2020; Lampinen and Ganguli, 2018; Saad and Solla, 1996) , which also employs a pair of dual networks. Our results indicate that SimCLR weight updates at every layer are amplified by a fundamental positive semi definite (PSD) covariance operator that only captures feature variability across data points that survive averages over data augmentation procedures designed in practice to scramble semantically unimportant features (e.g. random image crops, blurring or color distortions (Falcon and Cho, 2020; Kolesnikov et al., 2019; Misra and Maaten, 2020; Purushwalkam and Gupta, 2020) ). This covariance operator provides a principled framework to study how SimCLR amplifies initial random selectivity to obtain distinctive features that vary across samples after surviving averages over data-augmentations. Based on the covariance operator, we further show that (1) in a two-layer setting, a top-level covariance operator helps accelerate the learning of low-level features, and (2) when the data are generated by a hierarchical latent tree model, training deep ReLU networks leads to an emergence of the latent variables in its intermediate layers. We also analyze how BYOL might work without negative pairs. First we show analytically that an interplay between the zero-mean operation in BatchNorm and the extra predictor in the online network creates an implicit contrastive term, consistent with empirical observations in the recent blog (Fetterman and Albrecht, 2020) . Note this analysis does not rule out the possibility that BYOL could work with other normalization techniques that don't introduce contrastive terms, as shown recently (Richemond et al., 2020a) . To address this, we also derive exact solutions to BYOL in linear networks without any normalization, providing insight into how BYOL can learn without contrastive terms induced either by negative pairs or by BatchNorm. Finally, we also discover that reinitializing the predictor every few epochs doesn't hurt BYOL performance, thereby questioning the hypothesis of an optimal predictor in (Grill et al., 2020) . To the best of our knowledge, we are the first to provide a systematic theoretical analysis of modern SSL methods with deep ReLU networks that elucidates how both data and data augmentation, drive the learning of internal representations across multiple layers. Related Work. Besides SimCLR and BYOL, we briefly mention other concurrent SSL frameworks for vision. MoCo (He et al., 2020; Chen et al., 2020b ) keeps a large bank of past representations in a queue as the slow-progressing target to train from. DeepCluster (Caron et al., 2018) and SwAV (Caron et al., 2020) learn the representations by iteratively or implicitly clustering on the current representations and improving representations using the cluster label. (Alwassel et al., 2019) applies similar ideas to multi-modality tasks. In contrast, the literature on the analysis of SSL with dual deep ReLU networks is sparse. (Arora et al., 2019) proposes an interesting analysis of how contrastive learning aids downstream classification tasks, given assumptions about data generation. However, it does not explicitly analyze the learning of representations in deep networks.

2. OVERALL FRAMEWORK

Notation. Consider an L-layer ReLU network obeying f l = ψ( fl ) and fl = W l f l-1 for l = 1, . . . L. Here fl and f l are n l dimensional pre-activation and activation vectors in layer l, with f 0 = x being the input and f L = fL the output (no ReLU at the top layer). W l ∈ R n l ×n l-1 are the weight matrices, and ψ(u) := max(u, 0) is the element-wise ReLU nonlinearity. We let W := {W l } L l=1 be all network weights. We also denote the gradient of any loss function with respect to f l by g l ∈ R n l , and the derivative of the output f L with respect to an earlier pre-activation fl by the Jacobian matrix J l (x; W) ∈ R n L ×n l , as both play key roles in backpropagation (Fig. 1(b) ). An analogy between self-supervised and supervised learning: the dual network scenario. Many recent successful approaches to self-supervised learning (SSL), including SimCLR (Chen et al., 2020a) , BYOL (Grill et al., 2020) and MoCo (He et al., 2020) , employ a dual "Siameselike" pair (Koch et al., 2015) of such networks (Fig. 1(b) ). Each network has its own set of weights W 1 and W 2 , receives respective inputs x 1 and x 2 and generates outputs f 1,L (x 1 ; W 1 ) and f 2,L (x 2 ; W 2 ). The pair of inputs {x 1 , x 2 } can be either positive or negative, depending on how they are sampled. For a positive pair, a single data point x is drawn from the data distribution p(•), and then two augmented views x 1 and x 2 are drawn from a conditional augmentation distribution p aug (•|x). Possible image augmentations include random crops, blurs or color distortions, that ideally preserve semantic content useful for downstream tasks. In contrast, for a negative pair, two different data points x, x ∼ p(•) are sampled, and then each are augmented independently to generate x 1 ∼ p aug (•|x) and x 2 ∼ p aug (•|x ). For SimCLR, the dual networks have tied weights with W 1 = W 2 , and a loss function is chosen to encourage the representation of positive (negative) pairs to become similar (dissimilar). In BYOL, only positive pairs are used, and the first network W 1 , called the online network, is trained to match the output of the second network W 2 (the target), using an additional layer named predictor. The target network ideally provides training targets that can improve the online network's representation and does not contribute a gradient. The improved online network is gradually incorporated into the target network, yielding a bootstrapping procedure. Our fundamental goal is to analyze the mechanisms governing how SSL methods like SimCLR and BYOL lead to the emergence of meaningful intermediate features, starting from random initializations, and how these features depend on the data distribution p(x) and augmentation procedure p aug (•|x) . Interestingly, the analysis of supervised learning (SL) often employs a similar dual network scenario, called teacher-student setting (Tian, 2020; Allen-Zhu and Li, 2020; Lampinen and Ganguli, 2018; Saad and Solla, 1996) , where W 2 are the ground truth weights of a fixed teacher network, which generates outputs in response to random inputs. These input-output pairs constitute training data for the first network, which is a student network. Only the student network's weights W 1 are trained to match the target outputs provided by the teacher. This yields an interesting mathematical parallel between SL, in which the teacher is fixed and only the student evolves, and SSL, in which both the teacher and student evolve with potentially different dynamics. This mathematical parallel opens the door to using techniques from SL (e.g., (Tian, 2020) ) to analyze SSL. Gradient of 2 loss for dual deep ReLU networks. As seen above, the (dis)similarity of representations between a pair of dual networks plays a key role in both SSL and SL. We thus consider minimizing a simple measure of dissimilarity, the squared 2 distance r := 1 2 f 1,L -f 2,L 2 between the final outputs f 1,L and f 2,L of two multi-layer ReLU networks with weights W 1 and W 2 and inputs x 1 and x 2 . Without loss of generality, we only analyze the gradient w.r.t W 1 . For each layer l, we first define the connection K l (x), a quantity that connects the bottom-up feature vector f l-1 with the top-down Jacobian J l , which both contribute to the gradient at weight layer l. Definition 1 (The connection K l (x)). The connection K l (x; W) := f l-1 (x; W) ⊗ J l (x; W) ∈ R n l n l-1 ×n L . Here ⊗ is the Kronecker product. Theorem 1 (Squared 2 Gradient for dual deep ReLU networks). The gradient g W l of r w.r.t. W l ∈ R n l ×n l-1 for a single input pair {x 1 , x 2 } is (here K 1,l := K l (x 1 ; W 1 ) and K 2,l := K l (x 2 ; W 2 )): g W l = vec (∂r/∂W 1,l ) = K 1,l K 1,l vec(W 1,l ) -K 2,l vec(W 2,l ) . (1) We used vectorized notation for the gradient g W l and weights W l to emphasize certain theoretical properties of SSL learning below. The equivalent matrix form is ∂r/∂W 1,l = J 1,l [J 1,l W 1,l f 1,l-1 -J 2,l W 2,l f 2,l-1 ] f T 1,l-1 . See Appendix for proofs of all theorems in main text.

3. ANALYSIS OF SIMCLR

As discussed above, SimCLR (Chen et al., 2020a) employs both positive and negative input pairs, and a symmetric network structure with W 1 = W 2 = W. Let {x 1 , x + } be a positive input pair from x, and let {x 1 , x k-} for k = 1, . . . , H be H negative pairs. These input pairs induce corresponding squared 2 distances in output space, r + := 1 2 f 1,L -f +,L 2 2 , and r k-:= 1 2 f 1,L -f k-,L 2 . We consider three different contrastive losses, (1) the simple contrastive loss L simp := r + -r -, (2) (soft) Triplet loss L τ tri := τ log(1 + e (r+-r-+r0)/τ ) (here r 0 ≥ 0 is the margin). Note that lim τ →0 L τ tri = max(r + -r -+ r 0 , 0) (Schroff et al., 2015) , (3) InfoNCE loss (Oord et al., 2018) : L τ nce (r + , r 1-, r 2-, . . . , r H-) := -log e -r+/τ e -r+/τ + H k=1 e -r k-/τ (2) Note that when and Eqn. 2 reduces to what the original SimCLR uses (the term e -1/τ cancels out). For simplicity, we move the analysis of the final layer 2 normalization to Appendix A.2. In Appendix F.6 of BYOL Grill et al. (2020) v3, it shows that even without 2 normalization, the algorithm still works despite numerical instabilities. In this case, the goal of our analysis is to show that useful weight components grow exponentially in the gradient updates. u 2 = v 2 = 1, we have -1 2 u -v 2 2 = sim(u, v) -1 where sim(u, v) = u v u 2 v 2 , One property of these loss functions is important for our analysis: Theorem 2 (Common Property of Contrastive Losses). For loss functions L ∈ {L simp , L τ tri , L τ nce }, we have ∂L ∂r+ > 0, ∂L ∂r k-< 0 for 1 ≤ k ≤ H and ∂L ∂r+ + H k=1 ∂L ∂r k-= 0. With Theorem 1 and Theorem 2, we now present our first main contribution. The gradient in Sim-CLR is governed by a positive semi-definite (PSD) covariance operator at any layer l: Theorem 3 (Covariance Operator for L simp ). With large batch limit, W l 's update under L simp is: W l (t + 1) = W l (t) + α∆W l (t), where vec(∆W l (t)) = OP simp l (W)vec(W l (t)). ( ) where OP simp l (W) := V x∼p(•) [ Kl (x; W)] ∈ R n l n l-1 ×n l n l-1 is the covariance operator for L simp , Kl (x; W) := E x ∼paug(•|x) [K l (x ; W)] is the expected connection under the augmentation distribution, conditional on the datapoint x and α is the learning rate. Theorem 4 (Covariance Operator for L τ tri and L τ nce (H = 1, single negative pair)). Let r := 1 2 f L (x) -f L (x ) 2 2 . The covariance operator OP l (W) = 1 2 V ξ x,x ∼p(•) Kl (x) -Kl (x ) + corr, where corr := O(E x,x ∼p(•) r(x, x )trV x ∼paug(•|x) [f L (x )] ). For L τ tri , ξ(r) = e -(r-r 0 )/τ 1+e -(r-r 0 )/τ (and lim τ →0 ξ(r) = I(r ≤ r 0 )). For L τ nce , ξ(r) = 1 τ e -r/τ 1+e -r/τ . For L simp , ξ(r) ≡ 1 and corr = 0. Above, we use Cov ξ [X, Y ] := E [ξ(X, Y )(X -E [X])(Y -E [Y ]) ] and V ξ [X] := Cov ξ [X, X] (Cov[X, Y ] means ξ(•) ≡ 1). The covariance operator OP l (W) is a time-varying PSD matrix over the entire training procedure. Therefore, all its eigenvalues are non-negative and at any time t, W l is most amplified along its largest eigenmodes. Intuitively, OP l (W) ignores different views of the same sample x by averaging over the augmentation distribution to compute Kl (x), and then computes the expected covariance of this augmentation averaged connection with respect to the data distribution p(x). Thus, at all layers, any variability in the connection across different data points, that survives augmentation averages, leads to weight amplification. This amplification of weights by the PSD data covariance of an augmentation averaged connection constitutes a fundamental description of SimCLR learning dynamics for arbitrary data and augmentation distributions.

4. HOW THE COVARIANCE OPERATOR DRIVES THE EMERGENCE OF FEATURES

To concretely illustrate how the fundamental covariance operator derived in Theorem 3-4 drives feature emergence in SimCLR, we setup the following paradigm for analysis. The input x = x(z 0 , z ) is assumed to be generated by two groups of latent variables, class/sample-specific latents z 0 and nuisance latents z . We assume data augmentation only changes z while preserving z 0 (Fig. 2 (a)). For brevity we use Theorem 3 (L simp ), then OP = V z0 [ Kl (z 0 )] since z is averaged out in Kl (z 0 ). In this setting, we first show that a linear neuron performs PCA within an augmentation preserved subspace. We then consider how nonlinear neurons with local receptive fields (RFs) can learn to detect simple objects. Finally, we extend our analysis to deep ReLU networks exposed to data generated by a hierarchical latent tree model (HLTM), proving that, with sufficient over-parameterization, there exist lucky nodes at initialization whose activation is correlated with latent variables underlying the data, and that SimCLR amplifies these initial lucky representations during learning.

4.1. SELF-SUPERVISED LEARNING AND THE SINGLE NEURON: ILLUSTRATIVE EXAMPLES

A single linear neuron performs PCA in a preserved subspace. For a single linear neuron (L = 1, n L = 1), the connection in definition 1 is simply K 1 (x) = x. Now imagine the input space x can be decomposed into the direct sum of a semantically relevant subspace, and its orthogonal complement, which corresponds to a subspace of nuisance features. Furthermore, suppose the augmentation distribution p aug (•|x) is obtained by multiplying x by a random Gaussian matrix that acts only in the nuisance subspace, thereby identically preserving the semantic subspace. Then the augmentation averaged connection K1 (x) = Q s x where Q s is a projection operator onto the semantic subspace. In essence, only the projection of data onto the semantic subspace survives augmentation averaging, as the nuisance subspace is scrambled. Then OP = V x [ K1 (x)] = Q s V x [x]Q s . Thus the covariance of the data distribution, projected onto the semantic subspace, governs the growth of the weight vector W 1 , demonstrating SimCLR on a single linear neuron performs PCA within a semantic subspace preserved by data augmentation. A single linear neuron cannot detect localized objects. We now consider a generative model in which data vectors can be thought of as images of objects of the form x(z 0 , z ) where z 0 is an important latent semantic variable denoting object identity, while z is an unimportant latent variable denoting nuisance features, like object pose or location. The augmentation procedure scrambles pose/position while preserving object identity. Consider a simple concrete example (Fig. 3 (a)): x(z 0 , z ) = e z + e (z +1) mod d z 0 = 1 e z + e (z +2) mod d z 0 = 2, Here 0 ≤ z ≤ d -1 denotes d discrete translational object positions on a periodic ring and z 0 ∈ {1, 2} denotes two possible objects 11 and 101. The distribution is uniform both over objects d (a1) 𝑧 ! = 1 𝑧 ! = 2 (a2) 𝑧′ (b) 𝑧 ! = 1 𝑧 ! = 2 Pattern 10 01 00 11 and positions: p(z 0 , z ) = 1 2d . Augmentation shifts the object to a uniformly random position via p aug (z |z 0 ) = 1/d. For a single linear neuron K 1 (x) = x, and the augmentation-averaged connection is K1 (z 0 ) =foot_0 d 1, and is actually independent of object identity z 0 (both objects activate two pixels at any location). Thus OP 1 = V z0 K1 (z 0 ) = 0 and no learning happens. 𝑧 ! = 1 1 1 d-3 1 𝑧 ! = 2 2 2 d- A local receptive field (RF) does not help. In the same generative model, now consider a linear neuron with a local RF of width 2. Within the RF only four patterns can arise: 00, 01, 10, 11. Taking the expectation over z given z 0 (Fig. 3  (a2)) yields K1 (z 0 =1) = 1 d [x 11 + x 01 + x 10 + (d -3)x 00 ] and K1 (z 0 =2) = 1 d [2x 01 + 2x 10 + (d -4)x 00 ]. Here, x 11 ∈ Rfoot_1 denotes pattern 11. This yields OP 1 = V z0 K1 (z 0 ) = 1 4d 2 uu where u := x 11 + x 00 -x 01 -x 10 , and OP 1 ∈ R 2×2 since the RF has width 2. Note that the signed sum of the four pattern vectors in u actually cancel, so that u = 0, implying OP 1 = 0 and no learning happens. Interestingly, although the conditional distribution of the 4 input patterns depends on the object identity z 0 (Fig. 3 (a2)), a linear neuron cannot learn to discriminate the objects. A nonlinear neuron with local RF can learn to detect object selective features. With a ReLU neuron with weight vector w, from Def. 1, the connection is now K 1 (x, w) = ψ (w x)x. Suppose w(t) happens to be selective for a single pattern x p (where p ∈ {00, 01, 10, 11}), i.e., w(t) x p > 0 and w(t) x p < 0 for p = p. The augmentation averaged connection is then K1 (z 0 ) ∝ x p where the proportionality constant depends on object identity z 0 and can be read off (Fig. 3 (a2)). Since this averaged connection varies with object identity z 0 for all p, the covariance operator OP 1 is nonzero and is given by V z0 K1 (z 0 ) = c p x p x p where the constant c p > 0 depends on the selective pattern p and can be computed from Fig. 3(a2) . By Theorem 3, the dot product x p w(t) grows over time: x p w(t + 1) = x p I 2×2 + αc p x p x p w(t) = 1 + αc p x p 2 x p w j (t) > x p w j (t) > 0. (6) Thus the learning dynamics amplifies the initial selectivity to the object selective feature vector x p in a way that cannot be done with a linear neuron. Note this argument also holds with bias terms and initial selectivity for more than one pattern. Moreover, with a local RF, the probability of weak initial selectivity to local object sensitive features is high, and we may expect amplification of such weak selectivity in real neural network training, as observed in other settings (Williams et al., 2018) .

4.2. A TWO-LAYER CASE WITH MULTIPLE HIDDEN NEURONS

Now consider a two-layer network (L = 2). The hidden layer has n 1 ReLU neurons while the output has n 2 (Fig. 2(c )). In this case, the augmentation-averaged connection K1 (z 0 ) at the lower layer l = 1 can be written as (d = n 0 is the input dimension): K1 (z 0 ) = [w 2,1 u 1 (z 0 ), w 2,2 u 2 (z 0 ), . . . , w 2,n1 u n1 (z 0 )] ∈ R n1d×n2 where w 1,j ∈ R d and w 2,j ∈ R n2 are weight vectors into and out of hidden neuron j (Fig. 2(c )), and ẇ2,j = (w 1,j A j w 1,j )w 2,j , ẇ1,j = w 2 2,j A j w 1,j , where u j (z 0 ) := E z |z0 x(z 0 , z )I(w 1,j x(z 0 , z ) ≥ 0) ∈ R d is A j := V z0 [u j (z 0 )]. Note that for ReLU neurons, A j changes with w 1,j , while for linear neurons, A j would be constant, since gating ψ (w 1,j x) ≡ 1. Previous works (Allen-Zhu and Li, 2020) also mention a similar concept in supervised learning, called "backward feature correction." Here we demonstrate rigorously that a similar behavior can occur in SSL under gradient descent in the 2-layer case when the top layer W 2 is diagonal. As an example, consider a mixture of Gaussians: x ∼ 1 2 I(z 0 = 1)N (w * 1 , σ 2 )+ 1 2 I(z 0 = 2)N (w * 2 , σ 2 ) and let ∆w * := w * 1 -w * 2 , then in the linear case, A j ∼ ∆w * ∆w * and w 1,j converges to ±∆w * (Fig. 3(b) ). In the nonlinear case with multiple Gaussians, if one of the Gaussians sits at the origin (e.g., background noise), then dependent on initialization, A j evolves into w * k w * k for some center k, and w 1,j → w * k . Note this dynamics is insensitive to specific parametric forms of the input data.

4.3. DEEP RELU SSL TRAINING WITH HIERARCHICAL LATENT TREE MODELS (HLTM)

We next study how multi-layer ReLU networks learn from data generated by an HLTM, in which visible leaf variables are sampled via a hierarchical branching diffusion process through a sequence of latent variables starting from a root variable z 0 (Fig. 2 (d, left)). The HLTM represents a mathematical abstraction of the hierarchical structure real-world objects, which consist of spatially localized parts and subparts, all of which can lie in different configurations or occluded states. See Appendix D.2 for a detailed description and motivation for the HLTM. Simpler versions of the HLTM have been used to mathematically model how both infants and linear neural networks learn hierarchical structure (Saxe et al., 2019) . We examine when a multi-layer ReLU network with spatially local RFs can learn the latent generative variables when exposed only to the visible leaf variables (Fig. 2(d, right )). We define symbols in Tbl. 1. At layer l, we have categorical latent variables {z µ }, where µ ∈ Z l indexes different latent variables. Each z µ can take discrete values. The topmost latent variable is z 0 . Following the tree structure, for µ ∈ Z l and ν 1 , ν 2 ∈ Z l-1 , conditional independence holds: P(z ν1 , z ν2 |z µ ) = P(z ν1 |z µ )P(z ν2 |z µ ). The final sample x is just the collection of all visible leaf variables (Fig. 2(d) ), and thus depends on all latent variables. Corresponding to the hierarchical tree model, each neural network node j ∈ N l maps to a unique µ = µ(j) ∈ Z l . Let N µ be all nodes that map to µ. For j ∈ N µ , its activation f j only depends on the value of z µ and its descendant latent variables, through input x. Define v j (z µ ) := E z [f j |z µ ] as the expected activation w.r.t z µ . Given a sample x, data augmentation involves resampling all z µ (which are z in Fig. 2 ), fixing the root z 0 . Symmetric Binary HLTM. Here we consider a symmetric binary case: each z µ ∈ {0, 1} and for µ ∈ Z l , ν ∈ Z l-1 , P(z ν = 1|z µ = 1) = P(z ν = 0|z µ = 0) = (1 + ρ µν )/2 , where the polarity ρ µν ∈ [-1, 1] measures how informative z µ is. If ρ µν = ±1 then there is no stochasticity in the top-down generation process; ρ µν = 0 means no information in the downstream latents and the posterior of z 0 given the observation x can only be uniform. See Appendix for more general cases.

Now we compute covariance operator OP

µ = V z0 [ Kµ (z 0 )] at different layers, where Kµ (z 0 ) = E z f N ch µ ⊗ J µ |z 0 . Here we mainly check the term E z f N ch µ |z 0 and assume J µ is constant. Theorem 6 (Activation covariance in binary HLTM). V z0 [E z f N ch µ |z 0 ] = o µ a µ a µ . Here a µ := [ρ µν(k) s k ] k∈N ch µ and o µ := ρ 2 0µ (1 -ρ 2 0 ). If max αβ |ρ αβ | < 1, then lim L→+∞ ρ 0µ → 0 for µ ∈ Z l . Theorem 6 suggests when ρ 0µ and a µ are large, the covariance  OP µ = o µ a µ a µ ⊗ J µ J µ is so that V[f k |z ν ] ≤ σ 2 l for any k ∈ N l . For any µ ∈ Z l+1 , if |N µ | = O(exp(c)) , then with high probability, there exists at least one node j ∈ N µ so that their pre-activation gap ṽj (1) -ṽj (0) = 2w j a µ > 0 and the activations satisfy: v 2 j (1) -v 2 j (0) ≥ 3σ 2 w   1 4|N ch µ | k∈N ch µ |v k (1) -v k (0)| 2 c + 6 6 ρ 2 µν -1 -σ 2 l   . Intuitively, this means that with large polarity ρ µν (strong top-down signals), randomly initialized over-parameterized ReLU networks yield selective neurons, if the lower layer also contains selective ones. For example, when σ l = 0, c = 9, if ρ µν ≥ 63.3% then there is a gap between expected activations v j (1) and v j (0), and the gap is larger when the selectivity in the lower layer l is higher. Note that at the lowest layer, {v k } are themselves observable leaf latent variables and are selective by definition. So a bottom-up mathematical induction of latent selectivity will unfold. If we further assume J µ J µ = I, then after the gradient update, for the "lucky" node j we have: a µ w j (t + 1) = a µ I + αo µ a µ a µ w j (t) = (1 + αo µ a µ 2 )a µ w j (t) > a µ w j (t) > 0 which means that the pre-activation gap ṽj (1) -ṽj (0) = 2w j a µ grows over time and the latent variable z µ is learned (instantiated as f j ) during training, even if the network is never supervised with its true value. While here we analyze the simplest case (J µ J µ = I), in practice J µ changes over time. Similar to Sec. 4.2, once the top layer starts to have large weights, the magnitude of J µ for lower layer becomes larger and training is accelerated. We implement the HLTM and confirm, as predicted by our theory, that the intermediate layers of deep ReLU networks do indeed learn the latent variables of the HLTM (see Tbl. 2 below).

5. ANALYSIS OF INGREDIENTS UNDERLYING BYOL LEARNING

In BYOL, the two networks are no-longer identical and, interestingly, only positive pairs are used for training. The first network with weights W 1 = W := {W base , W pred } is an online network that is trained to predict the output of the second target network with weights W 2 = W , using a learnable predictor W pred to map the online to target outputs (Fig. 1 (a) and Fig. 1 in Grill et al. (2020) ). In contrast, the target network has W := {W base }, where W base is an exponential moving average (EMA) of W base : W base (t + 1) = γ ema W base (t) + (1 -γ ema )W base (t). Since BYOL only uses positive pairs, we consider the following loss function: r := 1 2 f L (x 1 ; W) -f L (x + ; W ) 2 2 (10) where the input data are positive pairs: x 1 , x + ∼ p aug (•|x) and x ∼ p(•). The two outputs, f L (x 1 ; W) and f L (x + ; W ), are from the online and the target networks, respectively. Note that L < L due to the presence of an extra predictor on the side of online network (W). With neither EMA (γ ema = 0) nor the predictor, W = W and the BYOL update without BN is vec (∆W l ) sym = -E x V x ∼paug(•|x) [K l (x )] vec(W l ) (see App. E.1 for proof). This update only promotes variance minimization in the representations of different augmented views of the same data samples and therefore would yield model collapse. We now consider the effects played by the extra predictor and BatchNorm (BN) (Ioffe and Szegedy, 2015) in BYOL. Our interest in BatchNorm is motivated by a recent blogpost (Fetterman and Albrecht, 2020) . We will see that combining both could yield a sufficient condition to create an implicit contrastive term that could help BYOL learn. As pointed out recently by Richemond et al. (2020a) , BYOL can still work using other normalization techniques that do not rely on cross batch statistics (e.g., GroupNorm (Wu and He, 2018) , Weight Standardization (Qiao et al., 2019) and careful initialization of affine transform of activations). In Appendix F we derive exact solutions to BYOL for linear architectures without any normalization, to provide conceptual insights into how BYOL can still learn without contrastive terms, at least in the linear setting. Here we focus on BatchNorm, leaving an analysis of other normalization techniques in nonlinear BYOL settings for future work. When adding predictor, Theorem 1 can still be applied by adding identity layers on top of the target network W so that the online and the target networks have the same depth. Theorem 5 in (Tian, Under review as a conference paper at ICLR 2021 2018) demonstrates this version of BN shifts the downward gradients so their mini-batch mean is 0: ≈ g i l := g i l - 1 |B| i∈B g i l = g i l -ḡl (12) Here g i l is the i-th sample in a batch and ḡl is the batch average (same for fl ). Backpropagating through this BN (vs. just subtracting the mean only in the forward passfoot_2 ), leads to a correction term: Theorem 8. If (1) the network is linear from layer l to the topmost and (2) the downward gradient g l undergoes Eqn. 12, then with large batch limits, the correction of the update isfoot_3 (for brevity, dependency on W is omitted, while dependency on W is made explicit): vec(δW BN l ) = E x Kl (x) E x K l (x) vec(W l ) -E x K l (x; W ) vec(W l ) and the corrected weight update is ∆W l := ∆W l + δW BN l . Using Eqn. 11, we have: vec( ∆W l ) = vec(∆W l )sym -Vx Kl (x) vec(W l ) + Covx Kl (x), Kl (x; W ) vec(W l ) Corollary 1 (SimCLR). For SimCLR with contrastive losses L simp , L tri and L nce , δW BN l = 0. 5.1 THE CASE WITHOUT EMA (γ ema = 0 AND THUS W base = W base ) In BYOL when the predictor is present, W = W and BN is present, from the analysis above we know that δW BN l = 0, which provides an implicit contrastive term. Note that W = W means there is a predictor, the online network uses EMA, or both. We first discuss the case without EMA. From Theorem 8, if we further consider a single linear predictor, then the following holds. Here Kl,base (x) := Kl (x; W base ) and zero-mean expected connection Kl (x) := Kl (x) -E x Kl (x) . Corollary 2 (The role of a predictor in BYOL). If W pred = {W pred } is linear and no EMA, then vec( ∆W l ) = vec(∆W l ) sym + E x Kl,base (x)W pred (I -W pred ) K l,base (x) vec(W l ). If there is no stop gradient, then vec( ∆W l ) = 2vec(∆W l ) sym -V x Kl,base (x)(I -W pred ) vec(W l ). The Predictor. To see why W pred plays a critical role, we check some special case. If W pred = βI n L ×n L (W pred has to be a squared matrix), then vec(∆W l ) = vec(∆W l ) sym + β(1β)V x Kl,base (x) vec(W l ). If 0 < β < 1, then β(1 -β) > 0 and the covariance operator appears. In this regime, BYOL works like SimCLR, except that it also minimizes variance across different augmented views of the same data sample through vec(∆W l ) sym (Eqn. 11), the first term in Eqn. 14. Indeed, the recent blogpost (Fetterman and Albrecht, 2020) as well as our own experiments (Tbl. 3) suggests that standard BYOL without BN fails. In addition, we also initialize the predictor with small positive weights (See Appendix G.4), and reinitialize the predictor weights once in a while (Tbl. 5), and BYOL still works, consistent with our theoretical prediction. Stop Gradient. In BYOL, the target network W serves as a target to be learned from, but does not contribute gradients to the current weight W. Without EMA, we might wonder whether the target network should also contribute the gradient or not. Corollary 2 shows that this won't work: no matter what W pred is, the update always contains a (weighted) negative covariance operator.

5.2. DYNAMICS OF EXPONENTIAL MOVING AVERAGE (EMA)

On the other hand, the EMA part might play a different role. Consider the following linear dynamic system, which is a simplified version of Eqn. 14 (we omit vec(∆W l ) sym and W l = W l,ema ): w(t + 1) -w(t) = ∆w(t) = α [-w(t) + (1 -λ)w ema (t)] Theorem 9 (EMA dynamics in Eqn. 15). w(t) ∝ (1 + κ) t . Here we define κ := 1 2 (η + α) 1 + 4αηλ/(η + α) 2 -1 and η := 1 -γ ema . Moreover, if λ ≥ 0, then κ ≤ λ/(1/α + 1/η). From the analysis above, when β is small, we see that the coefficient before W l (∼ β 2 ) is typically smaller than that before W l,ema (∼ β). This means λ > 0. In this case, κ > 0 and w(t) grows exponentially and learning happens. Compared to no EMA case (γ ema = 0 or η = 1), with EMA, we have η < 1 and κ becomes smaller. Then the growth is less aggressive and training stabilizes. 4 : Top-1 STL performance using different BatchNorm components in the predictor and the projector of BYOL (γema = 0.996, 100 epochs). There is no affine part. "µ" = zero-mean normalization only, "µ, σ" = BN without affine, "µ, σ ∦ " = normalization with mean and std but only backpropagating through mean. All variants with detached zero-mean normalization (in red) yield similar poor performance as no normalization. - A predictor with BN starts to show good performance (63.6%) and further adding EMA leads to the best performance (78.1%). This is consistent with our theoretical findings in Sec. 5, in which we show that using a predictor with BN yields δW BN l = 0 and leads to an implicit contrastive term. µ σ µ, σ µ ∦ σ ∦ µ ∦ , σ µ, σ ∦ µ ∦ , To further test our understanding of the role played by BN, we fractionate BN into several sub-components: subtract by batch mean (mean-norm), divide by batch standard deviation (std-norm) and affine, and do ablation studies (Tbl. 4). Surprisingly, removing affine yields slightly better performance on STL-10 (from 78.1% to 78.7%). We also find that variants of mean-norm performs reasonably, while variants of detached mean-norm has similar poor performance as no normalization, supporting that centralizing backpropagated gradient leads to implicit contrastive terms (Sec. 5). Note that std-norm also helps, which we leave for future analysis. We also check whether the online network requires an "optimal predictor" as suggested by recent version (v3) of BYOL. For this, we reinitialize the predictor (ReInit) every T epochs and compare the final performance under linear evaluation protocol. Interestingly, as shown in Tbl. 5, ReInit actually improves the performance a bit, compared to the original BYOL that keeps training the same predictor, which should therefore be closer to optimal. Moreover, if we shrink the initial weight range of the predictor to make Cov x Kl (x), Kl (x; W ) (third term in Eqn. 14) more dominant, and reduce the learning rate, the performance further improves (See Tbl. 10 in Appendix G.4), thereby corroborating our analysis. A BACKGROUND AND BASIC SETTING (SECTION 2) A.1 LEMMAS Definition 2 (reversibility). A layer l is reversible if there is a G l (x; W) ∈ R n l ×n l-1 so that f l (x; W) = G l (x; W)f l-1 (x; W) and g l-1 = G l (x; W)Q l g l for some constant PSD matrix R n l ×n l Q l 0. A network is reversible if all layers are. Note that many different kinds of layers have this reversible property, including linear layers (MLP and Conv) and (leaky) ReLU nonlinearity. For multi-layer ReLU network, for each layer l, we have: G l (x; W) = D l (x; W)W l , Q l ≡ I n l ×n l (16 ) where D l ∈ R n l ×n l is a binary diagonal matrix that encodes the gating of each neuron at layer l. The gating D l (x; W) depends on the current input x and current weight W. In addition to ReLU, other activation function also satisfies this condition, including linear, LeakyReLU and monomial activations. For example, for power activation ψ(x) = x p where p > 1, we have (where fl is the pre-activation at layer l): G l (x; W) = diag p-1 ( fl )W l , Q l ≡ pI n l ×n l (17) Remark. Note that the reversibility is not the same as invertible. Specifically, reversibility only requires the transfer function of a backpropagation gradient is a transpose of the forward function. Lemma 1 (Recursive Gradient Update (Extension to Lemma 1 in (Tian, 2020) ). Let (pseudo)-Jacobian matrix JL (x) = I n L ×n L , and recursively define Jl-1 (x ) := Jl (x) √ Q l G l (x) ∈ R n l ×n l-1 . Here √ Q l is the constant PSD matrix so that √ Q l √ Q l = Q l 0. If (1) the network is reversible (Def. 2) and (2) √ Q l commutes with Jl (x 1 ) Jl (x 1 ) and Jl (x 1 ) Jl (x 2 ), then minimizing the 2 objective r(W 1 ) := 1 2 f L (x 1 ; W 1 ) -f L (x 2 ; W 2 ) 2 2 (18) with respect to weight matrix W l at layer l yields the following gradient at layer l: g l = J l (x 1 ; W 1 ) Jl (x 1 ; W 1 )f l (x 1 ; W 1 ) -Jl (x 2 ; W 2 )f l (x 2 ; W 2 ) (19) Proof. We prove by induction. Note that our definition of W l is the transpose of W l defined in (Tian, 2020) . Also our g l (x) is the gradient before nonlinearity, while (Tian, 2020) uses the same symbol for the gradient after nonlinearity. For notation brievity, we let f l (x 1 ) := f l (x 1 ; W l ) and G l (x 1 ) := G l (x 1 ; W l ). Similar for x 2 and W 2 . When l = L, by the property of 2 -loss, we know that g L = f L (x 1 ; W 1 ) -f L (x 2 ; W 2 ), by setting JL (x 1 ) = JL (x 2 ) = I, the condition holds. Now suppose for layer l, we have: g l = J l (x 1 ) Jl (x 1 )f l (x 1 ) -Jl (x 2 )f l (x 2 ) Then: g l-1 = G l (x 1 )Q l g l (21) = G l (x 1 )Q l J l (x 1 ) Jl (x 1 )f l (x 1 ) -Jl (x 2 )f l (x 2 ) (22) = G l (x 1 ) Q l J l (x 1 ) J l-1 (x1) • Jl (x 1 ) Q l f l (x 1 ) -Jl (x 2 ) Q l f l (x 2 ) (23) = J l-1 (x 1 )     Jl (x 1 ) Q l G l (x 1 ) Jl-1 (x1) f l-1 (x 1 ) -Jl (x 2 ) Q l G l (x 2 ) Jl-1 (x2) f l-1 (x 2 )     (24) = J l-1 (x 1 ) Jl-1 (x 1 )f l-1 (x 1 ) -Jl-1 (x 2 )f l-1 (x 2 ) Note that for multi-layered ReLU network, G l (x) = D l (x)W l , Q l = I for each ReLU+Linear layer, if we set x 1 = x 2 = x, W 1 = W, W 2 = W * (teacher weights), then we go back to the original Lemma 1 in (Tian, 2020) . Remark on ResNet. Note that the same structure holds for blocks of ResNet with ReLU activation. An alternative form of Lemma 1. Note that we can alternatively group linear weight with its immediate downward nonlinearity and Lemma 1 still holds. In this case, we will have: gl = J l (x 1 ) J l (x 1 ) fl (x 1 ) -J l (x 2 ) fl (x 2 ) ( ) where J l (x) is the (pseudo-)Jacobian: J l (x) := ∂f L /∂ fl (i.e., with respect to the pre-activation fl ), as defined in the notation paragraph of Sec. 2, and gl is the back-propagated gradient after the nonlinearity. This will be used in the following Lemma 2. For other reversible layers (e.g., Eqn. 17), the relationship between the pseudo-Jacobian and the real one can differ by a constant (e.g., some power of √ p).

A.2 2 -NORMALIZATION IN THE TOPMOST LAYER

For 2 -normalized layer f l := f l-1 / f l-1 2 , we have G l := 1/ f l-1 2 • I n l ×n l and due to the following identity (here ỹ := y/ y 2 ): ∂ ỹ ∂y = 1 y 2 (I -ỹ ỹ ) Therefore we have ∂f l /∂f l-1 = (I n l ×n l -f l f l )G l and we could set Q l := I -f l f l , which is a projection matrix and thus PSD. Furthermore, since the normalization layer is at the topmost, Jl = I and Q l trivially commutes with J l Jl . The only issue is that Q l is not a constant matrix and can change over training. Therefore Lemma 1 doesn't apply exactly to such a layer but can be regarded as an approximate way to model.

A.3 THEOREM 1

Now we prove Theorem 1 in a more general setting where the network is reversible (note that deep ReLU networks are included and its Q l is a simple identity matrix): Lemma 2 (Squared 2 Gradient for dual deep reversible networks). The gradient g W l of the squared loss r with respect to W l ∈ R n l ×n l-1 for a single input pair {x 1 , x 2 } is: g W l = vec (∂r/∂W 1,l ) = K 1,l K 1,l vec(W 1,l ) -K 2,l vec(W 2,l ) . ( ) Here K l (x; W) := f l-1 (x; W) ⊗ J l (x; W), K 1,l := K l (x 1 ; W 1 ) and K 2,l := K l (x 2 ; W 2 ). Proof. We consider more general case where the two towers have different parameters, namely W 1 and W 2 . Applying Lemma 1 for the branch with input x 1 at the linear layer l, and we have (See Eqn. 26): g1,l = J 1,l [J 1,l W 1,l f 1,l-1 -J 2,l W 2,l f 2,l-1 ] where f 1,l-1 := f l-1 (x 1 ; W 1 ) is the activation of layer l -1 just below the linear layer at tower 1 (similar for other symbols), and g1,l is the back-propagated gradient after the nonlinearity. In this case, the gradient (and the weight update, according to gradient descent) of the weight W l between layer l and layer l -1 is: ∂r ∂W 1,l = g1,l f 1,l-1 (30) = J 1,l J 1,l W 1,l f 1,l-1 f 1,l-1 -J 1,l J 2,l W 2,l f 2,l-1 f 1,l-1 Using vec(AXB) = (B ⊗ A)vec(X) (where ⊗ is the Kronecker product), we have: vec ∂r ∂W 1,l = f 1,l-1 f 1,l-1 ⊗ J 1,l J 1,l vec(W 1,l ) -f 1,l-1 f 2,l-1 ⊗ J 1,l J 2,l vec(W 2,l ) Let K l (x; W ) := f l-1 (x; W) ⊗ J l (x; W) ∈ R n l n l-1 ×n L (33) Note that K l (x; W) is a function of the current weight W , which includes weights at all layers. By the mixed-product property of Kronecker product (A ⊗ B)(C ⊗ D) = AC ⊗ BD, we have: vec ∂r ∂W 1,l = K l (x 1 )K l (x 1 ) vec(W 1,l ) -K l (x 1 )K l (x 2 ) vec(W 2,l ) (34) = K l (x 1 ) [K l (x 1 ) vec(W 1,l ) -K l (x 2 ) vec(W 2,l )] (35) where K l (x 1 ) = K l (x 1 ; W 1 ) and K l (x 2 ) = K l (x 2 ; W 2 ). In SimCLR case, we have W 1 = W 2 = W so vec ∂r ∂W l = K l (x 1 ) [K l (x 1 ) -K l (x 2 )] vec(W l ) B ANALYSIS OF SIMCLR USING TEACHER-STUDENT SETTING (SECTION 3) B.1 THEOREM 2 Proof. For L simp and L tri the derivation is obvious. For L nce , we have: ∂L ∂r + = 1 τ 1 - e -r+/τ e -r+/τ + H k =1 e -r k -/τ > 0 (37) ∂L ∂r k- = - 1 τ e -r k-/τ e -r+/τ + H k =1 e -r k -/τ < 0, k = 1, . . . , H and obviously we have: ∂L ∂r + + H k=1 ∂L ∂r k- = 0 B.2 THE COVARIANCE OPERATOR UNDER DIFFERENT LOSS FUNCTIONS Lemma 3. For a loss function L that satisfies Theorem 2, with a batch of size one with samples X := {x 1 , x + , x 1-, x 2-, . . . , x H-}, where x 1 , x + ∼ p aug (•|x) are augmentation from the same sample x, and x k-∼ p aug (•|x k ) are augmentations from independent samples x k ∼ p(•). We have: vec(g W l ) = K l (x 1 ) H k=1 ∂L ∂r k-X • (K l (x + ) -K l (x k-)) vec(W l ) Proof. First we have: vec(g W l ) = ∂L ∂W l = ∂L ∂r + ∂r + ∂W l + H k=1 ∂L ∂r k- ∂r k- ∂W l Then we compute each terms. Using Theorem 1, we know that: ∂r + ∂W l = K l (x 1 )(K l (x 1 ) -K l (x + )) vec(W l ) (42) ∂r k- ∂W l = K l (x 1 )(K l (x 1 ) -K l (x k-)) vec(W l ), k = 1, . . . , n Since Eqn. 39 holds, K l (x 1 )K l (x 1 ) will be cancelled out and we have: vec(g W l ) = K l (x 1 ) H k=1 ∂L ∂r k- (K l (x + ) -K l (x k-)) vec(W l )

B.3 THEOREM 3

Proof. For L simp := r + -r -, we have H = 1 and ∂L ∂r-≡ -1. Therefore using Lemma 3, we have: vec(g W l ) = -K l (x 1 ) [K l (x + ) -K l (x -)] vec(W l ) Taking large batch limits, we know that E [K l (x 1 )K l (x + )] = E x Kl (x) K l (x) since x 1 , x + ∼ p aug (•|x ) are all augmented data points from a common sample x. On the other hand, E [K l (x 1 )K l (x k-)] = E x Kl (x) E x K l (x) since x 1 ∼ p aug (•|x ) and x k-∼ p aug (•|x k ) are generated from independent samples x and x k and independent data augmentation. Therefore, vec(g W l ) = -E x Kl (x) K l (x) -E x Kl (x) E x K l (x) vec(W l ) (46) = -V x Kl (x) vec(W l ) The conclusion follows since gradient descent is used and ∆W l = -g W l . B.4 THEOREM 4 Proof. When ∂L/∂r k-is no longer constant, we consider its expansion with respect to unaugmented data point X 0 = {x, x 1 , . . . , x k }. Here X = {x 1 , x + , x 1-, . . . , x H-} is one data sample that includes both positive and negative pairs. Note that x 1 , x + ∼ p aug (•|x) and x k-∼ p aug (•|x k ) for 1 ≤ k ≤ H. ∂L ∂r k-X = ∂L ∂r k-X0 + where is a bounded quantity for L tri (| | ≤ 1) and L nce (| | ≤ 2/τ ). We consider H = 1 where there is only a single negative pair (and r -). In this case X 0 = {x, x }. Let r := 1 2 f L (x) -f L (x ) 2 2 and ξ(x, x ) := -∂L ∂r k-X0 . Note that for L tri , it is not differentiable, so we could use its soft version: L τ tri (r + , r -) = τ log(1 + e (r+-r-+r0)/τ ). It is easy to see that lim τ →0 L τ tri (r + , r -) → max(r + -r -+ r 0 , 0). For the two losses: • For L τ tri , we have ξ(x, x ) = ξ(r) = e -r/τ e -r0/τ + e -r/τ . (49) • For L nce , we have ξ(x, x ) = ξ(r) = 1 τ e -r/τ 1 + e -r/τ . ( ) Note that for L τ tri we have lim τ →0 ξ(r) = I(r ≤ r 0 ). for L nce , since it is differentiable, by Taylor expansion we have = O( x 1 -x 2 , x + -x 2 , x --x 2 ), which will be used later. The constant term ξ with respect to data augmentation. In the following, we first consider the term ξ, which only depends on un-augmented data points X 0 . From Lemma 3, we now have a term in the gradient: g l (X ) := -K l (x 1 ) [K l (x + ) -K l (x -)] ξ(x, x )vec(W l ) Under the large batch limit, taking expectation with respect to data augmentation p aug and notice that all augmentations are done independently, given un-augmented data x and x , we have: g l (x, x ) := E paug [g l (X )] = -Kl (x) K l (x) -K l (x ) ξ(x, x )vec(W l ) Symmetrically, if we swap x and x since both are sampled from the same distribution p(•), we have: g l (x , x) = -Kl (x ) K l (x ) -K l (x) ξ(x , x)vec(W l ) since ξ(x , x) only depends on the squared 2 distance r (Eqn. 49 and Eqn. 50), we have ξ(x , x) = ξ(x, x ) = ξ(r) and thus: g l (x, x ) + g l (x , x) = -Kl (x) K l (x) -Kl (x) K l (x ) + Kl (x ) K l (x ) -Kl (x ) K l (x) ξ(r)vec(W l ) = -ξ(r)( Kl (x) -Kl (x ))( Kl (x) -Kl (x )) vec(W l ) Therefore, we have: E x,x ∼p [g l (x, x )] = - 1 2 E x,x ∼p ξ(r)( Kl (x) -Kl (x ))( Kl (x) -Kl (x )) vec(W l ) (55) = - 1 2 V ξ x,x ∼p Kl (x) -Kl (x ) vec(W l ) Bound the error. For L nce , let F := -∂L ∂r-, then we can compute their partial derivatives: ∂F ∂r + = -F (1/τ -F ), ∂F ∂r - = F (1/τ -F ) Note that |F (1/τ -F )| ≤ 1/τ 2 is always bounded. From Taylor expansion, we have: = - ∂F ∂r + {r+,r-} (r + -r 0 + ) - ∂F ∂r -{r+,r-} (r --r 0 -) for derivatives evaluated at some point {r + , r-} at the line connecting (x, x, x ) and (x 1 , x + , x -). r 0 + and r 0 -are squared 2 distances evaluated at (x, x, x ), therefore, r 0 + ≡ 0 and r 0 -= 1 2 f (x) - f (x ) 2 2 (note that here we just use f := f L for brevity). Therefore, we have r + -r 0 + = 1 2 f (x 1 ) -f (x + ) 2 2 and r --r 0 -= [f (x) -f (x )] [(f (x 1 ) -f (x)) -(f (x -) -f (x ))] + 1 2 (f (x 1 ) -f (x)) -(f (x -) -f (x )) 2 Therefore, we have: E paug ∂F ∂r + (r + -r 0 + ) ≤ 1 τ 2 • 1 2 f (x 1 ) -f (x + ) 2 2 p aug (x 1 |x)p aug (x + |x)dx 1 dx + = 1 τ 2 trV aug [f |x] where trV aug [f |x] := trV x ∼paug(•|x) [f (x ) ] is a scalar. Similarly, using Lemma 4, we have the following (using a 2 2 + b 2 2 ≥ 1 2 a -b 2 2 ). Here c 0 := max x f (x) -E x ∼paug(•|x) [f (x )] 2 2 : E paug ∂F ∂r - (r --r 0 -) (61) ≤ 1 τ 2 f (x) -f (x ) trV aug [f |x] + trV aug [f |x ] + 2c 0 + trV aug [f |x] + trV aug [f |x ] Let M K := max x K l (x) so finally we have: |E x,x ,aug [ K l (x 1 )(K l (x + ) -K l (x -))] | ≤ E x,x ,aug [| K l (x 1 )(K l (x + ) -K l (x -))|] ≤ 2M 2 K τ 2 2E x,x ∼p(•) f (x) -f (x ) trV aug [f |x] + c 0 + 3trE x [V aug [f |x]] (62) Note that if there is no augmentation (i.e., p aug (x 1 |x) = δ(x 1 -x)), then c 0 = 0, V aug [f |x] ≡ 0 and the error (Eqn. 62) is also zero. A small range of augmentation yields tight bound. For L τ tri , the derivation is similar. The only difference is that we have 1/τ rather than 1/τ 2 in Eqn. 62. Note that this didn't change the order of the bound since ξ(r) (and thus the covariance operator) has one less 1/τ as well. We could also see that for hard loss L tri , since τ → 0 this bound will be very loose. We leave a more tight bound as the future work. Remarks for H > 1. Note that for H > 1, L nce has multiple negative pairs and ∂L/∂r k-= e -r k-/τ /Z(X ) where Z(X ) := e -r+/τ + H k=1 e -r k-/τ . While the nominator e -r k-/τ still only depends on the distance between x 1 and x k-(which is good), the normalization constant Z(X ) depends on H + 1 distance pairs simultaneously. This leads to ξ k = ∂L ∂r k-X0 = e -x-x k 2 2 /τ 1 + H k=1 e -x-x k 2 2 /τ which causes issues with the symmetry trick (Eqn. 54), because the denominator involves many negative pairs at the same time. However, if we think given one pair of distinct data point (x, x ), the normalized constant Z averaged over data augmentation is approximately constant due to homogeneity of the dataset and data augmentation, then Eqn. 54 can still be applied and similar conclusion follows.

C THE DYNAMICS OF TWO-LAYER RELU NETWORK AND THE INTERPLAYS OF COVARIANCE OPERATORS BETWEEN NEARBY LAYERS (SECTION 4.2)

C.1 THEOREM 5 Proof. For convenience, we define the centralized version of u j (z 0 ): ûj (z 0 ) = u j (z 0 ) - E z0 [u j (z 0 )] ∈ R d and the matrices A jk := Cov z0 [u j (z 0 ), u k (z 0 )] = E z0 [û j (z 0 )û k (z 0 )] ∈ R d×d . Here both j and k run from 1 to n 1 . At layer l = 1 the covariance operator is V z0 [ K1 (z 0 )] = [w 2,j w 2,k A jk ] ∈ R n1d×n1d . On the other hand, if we check the second layer l = 2, we could compute K2 (z 0 ) ∈ R n1n2×n2 . Note that for input j of the second layer, we can compute its expectation with respect to z|z 0 as w 1,j u j (z 0 ). On the other hand, since the last layer doesn't have ReLU nonlinearity, the Jacobian J 2 = I n2×n2 which is independent of the input. So we have: K2 (z 0 ) =    w 1,1 u 1 (z 0 ) w 1,2 u 2 (z 0 ) . . . w 1,n1 u n1 (z 0 )    ⊗ I n2×n2 ∈ R n1n2×n2 So at layer l = 2 we can compute the covariance operator V z0 [ K2 (z 0 )] = [w 1,j A jk w 1,k ]⊗I n2×n2 ∈ R n1n2×n1n2 . Here [w 1,j A jk w 1,k ] ∈ R n1×n1 . Using these two covariance operators, we are able to write down the weight update in SimCLR setting with simple contrastive loss (here Q j := k A jk w 1,k w 2,k ∈ R d×n2 ): ẇ1,j = Q j w 2,j , ẇ2,j = Q j w 1,j The dynamics of Eqn. 65 can be quite general and hard to solve. In the following, we talk about some special cases. Diagonal W 2 . We consider the case where W 2 is a diagonal and square matrix, so n 1 = n 2 and W 2 = diag(w 2,1 , w 2,2 , . . . , w 2,n1 ) and remains such a structure throughout the training. Note that this also means there is no bias term for all output nodes. In this case, we could simplify Eqn. 65 due to the fact that now w 2,k (t)w 2,j (t) = 0 for j = k at any time step t (again all biases are zero in the top-layer, otherwise the orthogonal condition do not hold): ẇ1,j = w 2 2,j A jj w 1,j (66) ẇ2,j = (w 1,j A jj w 1,j )w 2,j Note that if we multiply w 1,j to Eqn. 66 and multiply w 2,j to Eqn. 67, we arrive at: 1 2 d w 1,j 2 2 dt = w 2 2,j (w 1,j A jj w 1,j ) (68) 1 2 dw 2 2,j dt = (w 1,j A jj w 1,j )w 2 2,j Therefore, d w 1,j 2 2 /dt = dw 2 2,j /dt and thus w 1,j 2 2 = w 2 2,j + c with some time-independent constant c. D HIERARCHICAL LATENT TREE MODELS (SECTION 4.3) D.1 LEMMAS Lemma 4 (Variance Squashing). Suppose a function φ : R → R is L-Lipschitz continuous: |φ(x)φ(y)| ≤ L|x -y|, then for x ∼ p(•), we have: V p [φ(x)] ≤ L 2 V p [x] Proof. Suppose x, y ∼ p(•) are independent samples and µ φ := E [φ(x)]. Note that V[φ(x) ] can be written as the following: E |φ(x) -φ(y)| 2 = 1 2 E |(φ(x) -µ φ ) -(φ(y) -µ φ )| 2 = E |φ(x) -µ φ | 2 + E |φ(y) -µ φ | 2 -2E [(φ(x) -µ φ )(φ(y) -µ φ )] = 2V p [φ(x)] Therefore we have: V p [φ(x)] = 1 2 E |φ(x) -φ(y)| 2 ≤ L 2 2 E |x -y| 2 = L 2 V p [x] Lemma 5 (Sharpened Jensen's inequality (Liao and Berg, 2018) ). If function φ is twice differentiable, and x ∼ p(•), then we have: 1 2 V[x] inf φ ≤ E [φ(x)] -φ(E [x]) ≤ 1 2 V[x] sup φ (73) Lemma 6 (Sharpened Jensen's inequality for ReLU activation). For ReLU activation ψ(x) := max(x, 0) and x ∼ p(•), we have: 0 ≤ E [ψ(x)] -ψ(E [x]) ≤ V p [x] Proof. Since ψ is a convex function, by Jensen's inequality we have E [ψ(x)] -ψ(E [x]) ≥ 0. For the other side, let µ := E p [x] and we have (note that for ReLU, ψ(x) -ψ(µ) ≤ |x -µ|): E [ψ(x)] -ψ(E [x]) = (ψ(x) -ψ(µ))p(x)dx (75) ≤ |x -µ|p(x)dx (76) ≤ |x -µ| 2 p(x)dx 1/2 p(x)dx 1/2 (77) = V p [x] where the last inequality is due to Cauchy-Schwarz.

D.2 MOTIVATION AND DESCRIPTION OF A GENERAL HLTM

Here we describe a general Hierarchical Latent Tree Model (HLTM) of data, and the structure of a multilayer neural network that learns from this data. The structure of the HLTM is motivated by the hierarchical structure of our world in which objects may consist of parts, which in turn may consist of subparts. Moreover the parts and subparts may be in different configurations in relation to each other in any given instantiation of the object, or any given subpart could be occluded in any given view of an object. The HLTM is a very simple toy model that represents a highly abstract mathematical version of this much more realistic scenario. It consists of a tree structured generative model of data (see Fig. 2(d ) and Fig. 4 ). Simpler versions of this generative model have been used to mathematically model how both infants and linear neural networks learn hierarchically structured data (Saxe et al., 2019) . At the top of the tree (i.e. level L), a single categorical latent variable z 0 takes one of m 0 possible integer values values in {0 . . . , m 0 -1}, with a prior distribution P(z 0 ). One can roughly think of the value of z 0 as denoting the identity of one of m 0 possible objects. At level L -1 there is a set of latent variables Z L-1 . This set is indexed by µ and each latent variable z µ is itself a categorical variable that takes one of m µ values in {0, . . . , m µ -1}. Roughly we can think of each latent variable z µ as corresponding to a part, and the different values of z µ reflect different configurations or occlusion states of that part. The conditional probability distributions P(z µ |z 0 ), or m 0 by m µ transition probability matrices, can roughly be thought of as collectively reflecting the distribution over the presence or absence, as well as configurational and occlusional states of each part µ, conditioned on object identity z 0 . This process can continue onwards to describe subparts of parts, etc...

D.3 A TWO-LAYER EXAMPLE TO DEMONSTRATE NOTATION

For simplicity, here we demonstrate the notation in a two-layer generative model, and two-layer network with L = 2. Thus the top, middle, and leaf levels of the generative model are labelled by l = 2, 1, 0, respectively, with corresponding latents z 0 , z µ and z ν , and the corresponding input, hidden, and final layers of the neural network are labelled by l = 0, 1, 2 respectively. As shown in Fig. 4 , at the leaf level l = 0 there are a set of visible variables Z 0 . This set is indexed by ν and each visible variable z ν is itself a categorical variable that takes one of m ν values in the set {0 . . . , m ν -1}. Roughly we can think of each z ν as a pixel, or more generally, some visible feature. For simplicity, we assume that each part z µ at level 1 affects a distinct subset of pixels or visible features z ν . In essence, we assume each visible variable z ν is a child of a unique level 1 latent variable z µ in the generative tree process (Fig. 2(d) ). In the rough analogy to objects and parts, in this abstraction, each part µ controls the appearance of a subset of spatially localized nearby pixels or visual features ν that are all children of part µ. Conversely, each such local cluster of pixels or feature values is influenced by the state of a single part. The conditional probability distribution P(z ν |z µ ), or m µ by m ν transition probability matrix, then describes the distribution over pixel or visual feature values z ν of each child pixel ν, conditioned on state z µ of the parent part µ. We next consider the two-layer ReLU network that learns from data generated from the two-layer HLTM (right hand side of Fig. 2(d) ). The neural network has a set of input neurons that are in one to one correspondence with the pixels or visible variables z ν that arise at the leaves of the HLTM, where l = 0. For any given object z 0 at layer l = 2, and its associated parts states z µ at layer l = 1, and visible feature values z ν at layer l = 0, the input neurons of the neural network receive only the visible feature values z ν as real analog inputs. Thus the neural network does not have direct access to the latent variables z 0 and z µ that generate these visible variables. Over-parameterization. While in the pixel level, there is a one to one correspondence between the children ν of a subpart µ and the pixel, in the hidden layer, more than one neuron could correspond to z µ (i.e., pool from pixels whose values are influenced by part µ), which is a form of over-parameterization. We thus let N µ denote the set of such hidden neurons, and we let N l denote the set of all neurons in layer l of the network. Thus N µ is a subset of N 1 . We further let N ch µ denote the subset of neurons in layer 0 that provide input to the hidden layer neurons in N µ . Thus N ch µ is a subset of N 0 . Each neuron in the subset N ch µ is in one to one correspondence with the children (i.e., some z ν ) of latent variable z µ in the generative tree (see Fig. 2(d) ). In applying SSL in this setup, each object is specified by a set of values for z 0 (object identity), z µ (configurational and occlusional states of parts), and z ν (pixel values). Given any such object and its realization of configurational and occlusional states of parts and the resulting pixel values, we assume that the process of data augmentation corresponds to resampling z µ and z ν from the conditional distributions P(z µ |z 0 ) and P(z ν |z µ ), while fixing object identity z 0 . This augmentation process then roughly corresponds to being able to sample the same object under different parts configurations and views. The key question of interest that we wish to address is, under this generative model of data and model of data augmentation, what do the hidden units of the neural network learn? In particular, can they invert the generative model to convert pixels values at neural layer l = 0 into hidden representations at neural layer l = 1 that reflect the existence of parts, with their associated states z µ ? More precisely, can the network learn hidden units whose activation across all data points correlates well with the values a latent variable z µ takes across all data points? We address this question next, first introducing further simplifying technical assumptions on the generative model and further notation.

D.4 A GENERAL STRUCTURE OF CONDITIONAL DISTRIBUTIONS IN THE HLTM

For convenience, we define the following symbols for k ∈ N ch µ (note that |N ch µ | = N ch µ is the number of the children of the node set N µ ): v µk := E z [f k |z µ ] = P µν(k) v k ∈ R mµ (79) V µ,N ch µ := [v µk ] k∈N ch µ (80) ṽj := E z fj |z µ = V µ,N ch µ w j ∈ R mµ (81) As an extension of binary symmetric HLTM, we make an assumption for the transitional probability: Assumption 1. For µ ∈ Z l and ν ∈ Z l-1 , the transitional probability matrix P µν := [P(z ν |z µ )] has decomposition P µν = 1 mν 1 µ 1 ν + C µν where C µν 1 ν = 0 µ and 1 µ C µν = 0 ν . Note that C µν 1 = 0 is obvious due to the property of conditional probability. The real condition is 1 µ C µν = 0 ν . If m µ = m ν , then P µν is a square matrix and Assumption 1 is equivalent to P µν is double-stochastic. Assumption 1 makes computation of P µν easy for any z µ and z ν . Lemma 7 (Transition Probability). If Assumption 1 holds, then for µ ∈ Z l , ν ∈ Z 1-1 and α ∈ Z l-2 , we have: P µα = P µν P να = 1 m α 1 µ 1 α + C µν C να (82) In general, for any µ ∈ N l1 and α ∈ N l2 with l 1 > l 2 , we have: P µα = 1 m α 1 µ 1 α + µ,...,ξ,ζ,...,α C ξζ (83) Proof. Using Assumption 1, we have P µα = P µν P να (84) = 1 m ν 1 µ 1 ν + C µν 1 m α 1 ν 1 α + C να (85) since 1 ν 1 ν = m ν , C µν 1 ν = 0 ν and 1 ν C να = 0 α , the conclusion follows. Remark. In the symmetric binary HLTM mentioned in the main text, all C µν can be parameterized as (here q := [-1, 1] ): C µν = C µν (ρ µν ) = 1 2 ρ µν -ρ µν -ρ µν ρ µν = 1 2 ρ µν qq (86) This is because 1 2 C µν = 0 2 and C µν 1 2 = 0 2 provides 4 linear constraints (1 redundant), leaving 1 free parameter, which is the polarity ρ µν ∈ [-1, 1] of latent variable z ν given its parent z µ . Moreover, since q q = 2, the parameterization is close under multiplication: C(ρ µν )C(ρ να ) = 1 4 qq qq ρ µν ρ να = 1 2 qq ρ µν ρ να = C(ρ µν ρ να ) Therefore, when |N µ | = O(exp(c)), with high probability, there exists w j so that ṽ+ j = w j u + N ch µ ≥ √ cσ w |N ch µ | u + N ch µ , ṽ- j = w j u - N ch µ < 0 (115) Since vN ch µ ≥ 0 (all f k are after ReLU and non-negative), this leads to: ṽ+ j ≥ √ cσ w |N ch µ | u + N ch µ ≥ √ cσ w |N ch µ | k∈A+ a 2 k ≥ σ w c 2|N ch µ | k∈N ch µ ρ 2 µν s 2 k ( ) By Jensen's inequality, we have (note that ψ(x) := max(x, 0) is the ReLU activation): v + j = E z [f j |z µ = 1] = E z ψ( fj )|z µ = 1 (117) ≥ ψ E z fj z µ = 1 = ψ(ṽ + j ) ≥ σ w c 2|N ch µ | k∈N ch µ ρ 2 µν s 2 k ( ) On the other hand, we also want to compute v - j := E z [f j |z µ = 0] using sharpened Jensen's inequality (Lemma 6). For this we need to compute the conditional covariance V z [ fj |z µ ]: V z [ fj |z µ ] 2 = k w 2 jk V z [f k |z µ ] 3 ≤ 3σ 2 w |N ch µ | k V z [f k |z µ ] (119) = 3σ 2 w |N ch µ | k E zν |zµ [V[f k |z ν ]] + V zν |zµ [E z [f k |z ν ]] (120) ≤ 3σ 2 w σ 2 l + 1 |N ch µ | k V zν |zµ [E z [f k |z ν ]] Note that 2 is due to conditional independence: f k as the computed activation, only depends on latent variable z ν and its descendants. Given z µ , all z ν and their respective descendants are independent of each other and so does f k . 3 is due to the fact that each w jk are sampled from uniform distribution and |w jk | ≤ σ w 3 |N ch µ | . Here V zν |zµ [E z [f k |z ν ]] = s 2 k (1-ρ 2 µν ) can be computed analytically. It is the variance of a binomial distribution: with probability 1 2 (1 + ρ µν ) we get v + k otherwise get v - k . Therefore, we finally have: V z [ fj |z µ ] ≤ 3σ 2 w σ 2 l + 1 |N ch µ | k s 2 k (1 -ρ 2 µν ) As a side note, using Lemma 4, since ReLU function ψ has Lipschitz constant ≤ 1 (empirically it is smaller), we know that: V z [f j |z µ ] ≤ 3σ 2 w σ 2 l + 1 |N ch µ | k s 2 k (1 -ρ 2 µν ) Finally using Lemma 6 and ṽj < 0, we have: v - j = E z [f j |z µ = 0] = E z ψ( fj )|z µ = 0 (124) ≤ ψ E z fj z µ = 0 + V z [ fj |z µ = 0] (125) = V z [ fj |z µ = 0] (126) ≤ σ w 3σ 2 l + 3 |N ch µ | k s 2 k (1 -ρ 2 µν ) Combining Eqn. 118 and Eqn. 127, we have a bound for λ j : λ j = (v + j ) 2 -(v - j ) 2 ≥ 3σ 2 w 1 |N ch µ | k s 2 k c + 6 6 ρ 2 µν -1 -σ 2 l (128) E THE ANALYSIS OF BYOL IN SEC. 5 E.1 DERIVATION OF BYOL GRADIENT Note that for BYOL, we have: vec ∂r ∂W l = K l (x 1 ; W) [K l (x 1 ; W)vec(W l ) -K l (x 2 ; W )vec(W l )] under large batchsize, we have (note that we omit W for any term that depends on W, but make dependence of W explicit in the math expression): vec ∂r ∂W l = E x∼p(•) E x ∼paug(•|x) [K l (x )K l (x )] vec(W l ) -Kl (x) K l (x; W )vec(W l ) For brevity, we write E x [•] := E x∼p(•) [•] and E x [•] := E x ∼paug(•|x) [•] . Similar for V. And the equation above can be written as: vec ∂r ∂W l = E x {V x [K l (x )]} vec(W l ) (130) + E x Kl (x) K l (x)vec(W l ) -K l (x; W )vec(W l ) In terms of weight update by gradient descent, since ∆W l = -∂r ∂W l , we have: vec (∆W l ) = -E x {V x [K l (x )]} vec(W l ) (132) -E x Kl (x) K l (x)vec(W l ) -K l (x; W )vec(W l ) If we consider the special case W = W , then the last two terms cancelled out, yielding: vec(∆W l ) sym = -E x {V x [K l (x )]} vec(W l ) And the general update (Eqn. 136) can be written as: vec (∆W l ) = vec(∆W l ) sym (135) -E x Kl (x) K l (x)vec(W l ) -K l (x; W )vec(W l ) (136) E.2 THEOREM 8 Proof. When BN is present, Eqn. 129 needs to be corrected with an additional term, ∂r ∂W l := ∂r ∂W l -δW BN l , where δW BN l is defined as follows: δW BN l := 1 |B| i∈B D i l ḡl f i l-1 (137) From the proof of Theorem 1 (see Eqn. 26), we know that for each sample i ∈ B (note that by definition, the back-propagated gradient after nonlinearity gi l equals to D i l g i l , where g i l is the backpropagated gradient before nonlinearity): D i l g i l = J i l [J i l W l f i l-1 -J i l (W )W l f i l-1 (W )] Since the network is linear from layer l to the topmost layer L, we have D i l = Dl . Since the only input dependent part in J i l is the gating function between the current layer l and the topmost layer L, for linear network the gating is always 1 and thus Jl = J i l and is independent of input data. We now have (note that we omit W for any terms that are dependent on W, but will write W explicitly for terms that are depend on W ): δW BN l := 1 |B| i∈B D i l ḡl f i l-1 = -Dl ḡl f l-1 (139) = J l [ Jl W l fl-1 -Jl (W )W l fl-1 (W )] f l-1 Therefore we have: vec(δW BN l ) = ( fl-1 ⊗ J l ) ( fl-1 ⊗ J l )vec(W l ) -( fl-1 (W ) ⊗ J l (W ))vec(W l ) Note that by assumption, since Jl doesn't depend on the input data, we have fl-1 ⊗ J l = E B [f l-1 ] ⊗ J l = E B f l-1 ⊗ J l Taking large batchsize limits and notice that the batch B could contain any augmented data generated from independent samples from p(•), we have: vec(δW BN l ) = E x,x [K l (x )] E x,x [K l (x )] vec(W l ) (143) -E x,x [K l (x )] E x,x [K l (x ; W )] vec(W l ) An important thing is that the expectation is taking over x ∼ p(x) and x ∼ p aug (•|x). Intuitively, this is because fl-1 and ḡl are averages over the entire batch, which has both intra-sample and inter-sample variation. With augment-mean connection Kl (x) we could write: vec(δW BN l ) = E x Kl (x) E x K l (x) vec(W l ) -E x Kl (x) E x K l (x; W ) vec(W l ) = E x Kl (x) E x K l (x) vec(W l ) -E x K l (x; W ) vec(W l ) Plug in δW l,BN into Eqn. 136 and we have corrected gradient for BYOL: vec ∂r ∂W l = vec ∂r ∂W l -vec (δW BN l ) (146) = E x V x ∼paug(•|x) [K l (x )] vec(W l ) + V x Kl (x) vec(W l ) (147) -Cov x Kl (x), Kl (x; W ) vec(W l ) And the weight update ∆W l = ∆W l + δW BN l is: vec ∆W l = -E x V x ∼paug(•|x) [K l (x )] vec(W l ) -V x Kl (x) vec(W l ) (149) + Cov x Kl (x), Kl (x; W ) vec(W l ) Using Eqn. 134, we have: vec ∆W l = vec(∆W l ) + vec(δW BN l ) (151) = vec(∆W l ) sym (152) -V x Kl (x) vec(W l ) + Cov x Kl (x), Kl (x; W ) vec(W l )

E.3 COROLLARY 1

Proof. In this case, both the target and online networks use the same weight and there is no predictor. This means W = W. Therefore, in Eqn. 145, all W l = W l and δW BN l = 0. Note that for SimCLR, the loss function contains both positive pair squared distance r + and negative pair squared distance r -. The argument above shows that δW BN l = 0 for positive pair distance r + . For negative pair distance r -, with the same logic in Theorem. 8, we will see δW BN l takes the same form as Eqn. 145 and thus is zero as well. Remarks. Note that BatchNorm does not matter in terms of gradient update, modulo its benefit during optimization. This is justified in the recent blogpost (Fetterman and Albrecht, 2020 ).

E.4 COROLLARY 2

Proof. By our condition, we consider the case that the extra predictor is a linear layer: W pred = {W pred }. Note that W pred ∈ R n L ×n L is a squared matrix, otherwise we cannot compute the loss function between the output f L from the online network with the output f L from the target network. In this case, for connection K l (x) in the common part of the network (in W base ), we have: K l (x) = f l-1 (x) ⊗ J l (x) = f l-1 (x) ⊗ J l,base (x)W pred (154) = (f l-1 (x) ⊗ J l,base (x))W pred Here J l,base (x) is the Jacobian from the current layer l to the layer right before the extra predictor. The last equality is due to the fact that f l-1 is a vector. Therefore, for augment-mean Kl (x), since W pred doesn't depend on the input data distribution, we have: Kl (x) = Kl,base (x)W pred where Kl,base (x) := Kl (x; W base ). To make things concise, let Kl (x) := Kl (x) -E x Kl (x) . Obviously we have Kl (x) = Kl,base (x)W pred . And the covariance operator becomes: V x [ Kl (x; W)] = E x Kl,base (x)W pred W pred K l,base (x) Now let ∆W l be the last two terms in Eqn. 14: vec( ∆W l ) = -V x Kl (x; W) vec(W l ) + Cov x Kl (x; W), Kl (x; W ) vec(W l ) Since there is no EMA, W base = W base and we have: vec( ∆W l ) = -V x Kl (x; W) + Cov x Kl (x; W), Kl (x; W ) vec(W l ) (159) = E x Kl,base (x)W pred (I -W pred ) K l,base (x) vec(W l ) Therefore, the final expression of vec( ∆W l ) is the following: vec( ∆W l ) = vec(∆W l ) + vec(δW BN l ) = -E x V x ∼paug(•|x) [K l (x )] + E x Kl,base (x)W pred (I -W pred ) K l,base (x) vec(W l ) If there is no stop gradient on the target network side, and we receive gradient from both the online and the target network, then for any common layer l, the weight update vec( ∆W l ) becomes symmetric (note that this can be derived by swapping W with W and add the two terms together): vec( ∆W l ) = 2vec(∆W l ) sym (161) -V x Kl (x; W) + V x Kl (x; W ) vec(W l ) (162) + Cov x Kl (x; W), Kl (x; W ) vec(W l ) (163) + Cov x Kl (x; W ), Kl (x; W) vec(W l ) which gives: vec( ∆W l ) = 2vec(∆W l ) sym -E x Kl,base (x)(I -W pred ) (I -W pred ) K l,base (x) vec(W l ) = 2vec(∆W l ) sym -V x Kl,base (x)(I -W pred ) vec(W l ) E.5 THEOREM 9 Proof. Consider the following discrete dynamics of a weight vector w(t): w(t + 1) -w(t) = α [-w(t) + (1 -λ)w ema (t)] where α is the learning rate, w ema (t + 1) = γ ema w ema (t) + (1 -γ ema )w(t) is the exponential moving average of w(t). For convenience, we use η := 1 -γ ema . Since it is a recurrence equation, we apply z-transform on the temporal domain, where w(z) := Z[w(t)] = +∞ t=0 w(t)z -t . This leads to: z(w(z) -w(0)) = w(z) -α (w(z) -Z[w ema (t)](1 -λ)) Note that for w ema (t) we have: z(w ema (z) -w ema (0)) = (1 -η)w ema (z) + ηw(z) If we set w ema (0) = 0, i.e., the target network is all zero at the beginning, then it gives w ema (z) = η z-1+η w(z). Plugging it back to Eqn. 167 and we have: z(w(z) -w(0)) = w(z) -αw(z) 1 - η z -1 + η (1 -λ) And then we could solve w(z): w(z) = z(z -1 + η) (z -1) 2 + (η + α)(z -1) + αηλ w(0) Note that the denominator has two roots z 1 and z 2 : z 1,2 = 1 - 1 2 η + α ± (η + α) 2 -4αηλ and w(z) can be written as w(z) = z(z -1 + η) (z -z 1 )(z -z 2 ) w(0) Without loss of generality, let z 1 < z 2 . The larger root z 2 > 1 when λ < 0, so the zero (z = 1 -η = γ ema ) in the nominator won't cancel out the pole at z 2 . And we have: z (z -z 1 )(z -z 2 ) = z z 2 -z 1 (z -z 1 ) -(z -z 2 ) (z -z 1 )(z -z 2 ) (173) = z z 2 -z 1 1 z -z 2 - 1 z -z 1 (174) = 1 z 2 -z 1 1 1 -z 2 z -1 - 1 1 -z 1 z -1 where 1/(1 -z 2 z -1 ) corresponds to a power series z t 2 in the temporal domain. Therefore, we could see w(t) has exponential growth due to z 2 > 1. Now let us check how z 2 changes over η, i.e., how the parameter γ ema := 1 -η of EMA affects the learning process. We have: z 2 = 1 + η + α 2 1 + 4αηλ (η + α) 2 -1 Use the fact that (1 + x) 1/2 ≤ 1 + 1 2 x for x ≥ 0, we have: z 2 -1 ≤ η + α 4 4αηλ (η + α) 2 = λ 1 α + 1 η (177) Compared to no EMA case (i.e., γ ema = 0 or η = 1), with a γ ema < 1 but close to 1 (or equivalently, η is close to 0), the upper bound of z 2 becomes smaller but still greater than 1, and the exponential growth is less aggressive, which stabilizes the training. Note that if γ ema = 1 (or η = 0), then w ema (t) ≡ w ema (0) = 0 and learning also doesn't happen.

F EXACT SOLUTIONS TO BYOL WITH LINEAR ARCHITECTURES WITHOUT BATCHNORM

An interesting property of BYOL is that it finds useful non-collapsed solutions for the online network and target network, despite the fact that it does not employ contrastive terms to separate the representations of negative pairs. While BatchNorm can implicitly introduce contrastive terms in BYOL, as discussed in the main paper, recent work (Richemond et al., 2020b) has shown that other normalization methods which do not introduce contrastive terms, nevertheless enable BYOL to work well. We therefore analyze BYOL in a simple linear setting to obtain insight into why it does not lead to collapsed solutions, even without BatchNorm. We first derive exact fixed point solutions to BYOL learning dynamics in this setting, and discuss their stability. We then discuss specific models for data distributions and augmentation procedures, and show how the fixed point solutions of BYOL learning dynamics depend on both data and augmentation distributions. We then discuss how our theory reveals a fundamental role for the predictor in avoiding collapse in BYOL solutions. Finally, we derive a highly reduced three dimensional description of BYOL learning dynamics that provide considerable insights into dynamical mechanisms enabling BYOL to avoid collapsed solutions without negative pairs to force apart representations of different objects. F.1 THE FIXED POINT STRUCTURE OF BYOL LEARNING DYNAMICS. We consider a single linear layer online network with weights W 1 ∈ R n1×n0 and a single layer target network with weights Θ ∈ R n1×n0 . Additionally, the online network has a predictor layer with weights W 2 ∈ R n1×n1 , that maps the output of the online network to the output space of the target network. BYOL only uses positive pairs in which a single data point x is drawn from the data distribution p(•), and then two augmented views x 1 and x 2 are drawn from a conditional augmentation distribution p aug (•|x). The loss function driving the learning dynamics of the online weights W 1 and predictor weights W 2 given a single positive pair {x 1 , x 2 } and a given target network Θ is then given by L = W 2 W 1 x 1 -Θx 2 2 2 . In contrast, the dynamics of the target network weights Θ follows that of the online weights W 1 through an exponential moving average. In the limit of large batch sizes and slow learning rates, the combined learning dynamics is then well approximated by the continuous time ordinary differential equations (see e.g. Saxe et al. (2014) for analogous equations in the setting of supervised learning): τ o dW 2 dt = ΘΣ d -W 2 W 1 Σ s W T 1 (179) τ p dW 1 dt = W T 2 ΘΣ d -W 2 W 1 Σ s (180) τ t dΘ dt = -Θ + W 1 , where Σ s ≡ E x1 x 1 x T 1 (182) Σ d ≡ E x1,x2 x 1 x T 2 = E x∼p(•) K(x) K(x) T , and K(x) ≡ E x1∼paug(•|x) [x 1 ] . ( ) Here Σ s is the correlation matrix of a single augmented view x 1 of the data x, while Σ d is the correlation matrix between two augmented views of the same data point, or equivalently, the correlation matrix of the augmentation averaged vector K(x). Additionally, we have retained the possibility of having three different learning rates for the online, predictor, and target networks, represented by the time constants τ o , τ p , and τ t respectively. Because of the linearity of the networks, the final outcome of learning depends on the data and augmentation procedures only through the two correlation matrices Σ s and Σ d . Examining equation 179-equation 181, we find sufficient conditions for a fixed point given by W 2 W 1 Σ s = ΘΣ d and W 1 = Θ. Inserting the second equation into the first and right multiplying both sides by [Σ s ] -1 (assuming Σ s is invertible), yields a manifold of fixed point solutions in W 1 and W 2 satisfying the nonlinear equation W 2 W 1 = W 1 Σ d [Σ s ] -1 . (185) This constitutes a set of n 1 × n 2 nonlinear equations in (n 1 × n 2 ) + (n 2 × n 2 ) unknowns, yielding generically a nonlinear manifold of solutions in W 1 and W 2 of dimensionality n 2 ×n 2 corresponding to the number of predictor parameters. For concreteness, we will assume that n 2 ≤ n 1 , so that the online and target networks perform dimensionality reduction. Then a special class of solutions to equation 185 can be obtained by assuming the n 2 rows of W 1 correspond to n 2 left-eigenvectors of Σ d [Σ s ] -1 and W 2 is a diagonal matrix with the corresponding eigenvalues. This special class of solutions can then be generalized by a transformation W 2 → SW 2 S -1 and W 1 → SW 1 where S is any invertible n 2 by n 2 matrix. Indeed this transformation is a symmetry of equation 185, which defines the solution manifold. In addition to these families of solutions, the collapsed solution W 1 = W 2 = Θ = 0 also exists, and a natural question is, why doesn't BYOL generically converge to this collapsed solution? This question can be addressed by analyzing the stability of both the collapsed solution and the families of solutions presented above. The basic calculation involves computing the Jacobian of the vector field defining the dynamics of equation 179 through equation 181. A fixed point solution is stable if and only if all eigenvalues of the Jacobian evaluated at a fixed point solution are negative. Using methods similar to that of Baldi and Hornik (1989) , which carried out a similar stability analysis for learning dynamics in two weight layer linear networks in the supervised setting, it is possible to show that all of the above fixed point solutions are unstable except for those derived from the special solutions where the n 2 rows of W 1 correspond to the top n 2 principal eigenmodes of Σ d [Σ s ] -1 . Thus this analysis sketch provides conceptual insights into why BYOL, at least in this simple setting, learns nontrivial, and potentially useful representations with only positive examples, and does not converge to the naive collapsed solution. Basically, the collapsed solution, as well as other subdominant solutions, are unstable, while solutions corresponding to the principal eigenmodes of Σ d [Σ s ] -1 are stable. Thus, from generic initial conditions, one would expect that the row space of the online network would converge to the span of the top n 2 principal eigenmodes of Σ d [Σ s ] -1 .

F.2 ILLUSTRATIVE MODELS FOR DATA AND DATA AUGMENTATION

While the above section suggests that BYOL converges to the top eigenmodes of Σ d [Σ s ] -1 , here we make this result more concrete by giving illustrative examples of data distributions and data augmentation procedures, and the resulting properties of Σ d [Σ s ] -1 . Multiplicative scrambling. Consider for example a multiplicative subspace scrambling model, used in the illustration of SIMCLR in Sec. 4.1. In this model, data augmentation scrambles a subspace by multiplying by a random Gaussian matrix, while identically preserving the orthogonal complement of the subspace. In applications, the scrambled subspace could correspond to a space of nuisance features, while the preserved subspace could correspond to semantically important features. More precisely, we consider a random scrambling operator A which only scrambles data vectors x within a fixed k dimensional subspace spanned by the orthonormal columns of the n 0 × k matrix U . Within this subspace, data vectors are scrambled by a random Gaussian k × k matrix B. Thus A takes the form A = P c + U BU T where P c = I -U U T is a projection operator onto the n 0 -k dimensional conserved, semantically important, subspace orthogonal to the span of the columns of U , and the elements of B are i.i.d. zero mean unit variance Gaussian random variables so that E [B ij B kl ] = δ ik δ jl . Under this simple model, the augmentation average K(x) in equation 184 becomes K(x) = P c x. Thus, intuitively, under multiplicative subspace scrambling, the only aspect of a data vector that survives averaging over augmentations is the projection of this data vector onto the preserved subspace. Then the correlation matrix of two different augmented views is Σ d = P c Σ x P c while the correlation matrix of two identical views is Σ s = Σ x where Σ x ≡ E x∼p(•) xx T is the correlation matrix of the data distribution. Thus BYOL learns the principal eigenmodes of Σ d [Σ s ] -1 = P c Σ x P c [Σ x ] -1 . In the special case in which P c commutes with Σ x , we have the simple result that Σ d [Σ s ] -1 = P c , which is completely independent of the data correlation matrix Σ x . Thus in this simple setting BYOL learns the subspace of features that are identically conserved under data augmentation, independent of how much data variance there is in the different dimensions of this conserved subspace. It is interesting to compare to SimCLR in the same setting, which learns the principal eigenmodes of P c Σ x P c as described in Sec. 4.1. Thus SimCLR also projects to the conserved subspace, but is further influenced by the correlation matrix of the data within this subspace. In actual applications, which performs better will depend on whether or not features of high variance within the conserved subspace are important for downstream tasks; SimCLR (BYOL) should perform better if conserved features of high variance are (are not) important. Additive scrambling. We also consider, as an illustrative example, data augmentation procedures which simply add Gaussian noise with a prescribed noise covariance matrix Σ n . Under this model, we have Σ s = Σ x + Σ n while Σ d = Σ x . Thus in this setting, BYOL learns principal eigenmodes of Σ d [Σ s ] -1 = Σ x [Σ x + Σ n ] -1 . Thus intuitively, dimensions with larger noise variance are attenuated in learned BYOL representations. On the otherhand, correlations in the data that are not attenuated by noise are preferentially learned, but the degree to which they are learned is not strongly influenced by the magnitude of the data correlation (i.e. consider dimensions that lie along small eigenvalues of Σ n ).

F.3 THE IMPORTANCE OF THE PREDICTOR IN BYOL.

Here we note that our theory explains why the predictor plays a crucial role in BYOL learning in this simple setting, as is observed empirically in more complex settings. To see this, we can model the removal of the predictor by simply setting W 2 = I in all the above equations. The fixed point solutions then obey W 1 = W 1 Σ d [Σ s ] -1 . This will only have nontrivial, non-collapsed solutions if Σ d [Σ s ] -1 has eigenvectors with eigenvalue 1. Rows of W 1 consisting of linear combinations of these eigenvectors will then constitute solutions. This constraint of eigenvalue 1 yields a much more restrictive condition on data distributions and augmentation procedures for BYOL to have non-collapsed solutions. It can however be satisfied in multiplicative scrambling if an eigenvector of the data matrix Σ x lies in the column space of the projection operator P c (in which case it is an eigenvector of eigenvalue 1 of Σ d [Σ s ] -1 = P c Σ x P c [Σ x ] -1 . This condition cannot however be generically satisfied for additive scrambling case, in which generically all the eigenvalues of Σ d [Σ s ] -1 = Σ x [Σ x + Σ n ] -1 are less than 1. In this case, without a predictor, it can be checked that the collapsed solution W 1 = Θ = 0 is stable. In contrast, with a predictor, the collapsed solution can be checked to be unstable, and therefore it will not be found from generic initial conditions. Thus overall, in this simple setting, our theory provides conceptual insight into how the introduction of a predictor is crucial for creating new non-collapsed solutions for BYOL, whose existence destabilizes the collapsed solutions.

F.4 REDUCTION OF BYOL LEARNING DYNAMICS TO LOW DIMENSIONS

The full learning dynamics in equation 179 to equation 181 constitutes a set of high dimensional nonlinear ODEs which are difficult to solve from arbitrary initial conditions. However, there is a special class of decoupled initial conditions which permits additional insight. Consider the special case in which Σ s and Σ d commute, and so are simultaneously diagonalizable and share a common set of eigenvectors, which we denote by u α ∈ R n0 . Consider also a special set of initial conditions where each row of W 1 and the corresponding row of Θ are both proportional to one of the eigenmodes u α , with scalar proportionality constants w α 1 and θ α respectively, and W 2 is diagonal, with the corresponding diagonal element given by w α 2 . Then it is straightforward to see that under the dynamics in equation 179 to equation 181, that the structure of this initial condition will remain the same, with only the scalars w α 1 , θ α and w α 2 changing over time. Moreover, the scalars decouple across the different indices α, and the dynamics are driven by the eigenvalues λ α s and λ α d of Σ s and Σ d respectively. Inserting this special class of initial conditions into the dynamics in equation 179 to equation 181, and dropping the α index, we find the dynamics of the triplet of scalars is given by s . The intersection of these two nullclines yields two fixed points (red dots): an unstable collapsed solution at the origin w = θ = 0, and a stable nontrivial solution with θ = w and w = λ d λ -1 s . Right: A visualization of dynamics in Eqns 186-188 when the the predictor is removed, so that w2 is fixed to 1. The resulting two dimensional flow field on w = w1 and θ is shown (black arrows). The green curve shows the nullcline w = θ corresponding to dθ dt = 0, while the blue curve shows the nullcline w = θλ d λ -1 s . The slope of this nullcline is λsλ -1 d > 1. The resulting nullcline structure yields a single fixed point at the origin which is stable. Thus there only exists a collapsed solution. In the special case where λsλ -1 d = 1, the two nullclines coincide, yielding a one dimensional manifold of solutions. τ o dw 2 dt = [θλ d -w 2 w 1 λ s ] w 1 (186) τ p dw 1 dt = w 2 [θλ d -w 2 w 1 λ s ] τ t dθ dt = -θ + w 1 . Alternatively, this low dimensional dynamics can be obtained from equation 179 to equation 181 not only by considering a special class of decoupled initial conditions, but also by considering the special case where every matrix is simply a 1 by 1 matrix, making the scalar replacements W 1 → w 1 , W 2 → w 2 , Θ → θ, Σ s → λ s , and Σ d → λ d . The fixed point conditions of this dynamics are given by θ = w 1 and w 2 w 1 = θλ d λ -1 s . Thus the collapsed point w 1 = w 2 = θ = 0 is a solution. Additionally w 2 = λ d λ -1 s and w 1 = θ taking any value is also a family of non-collapsed solutions. We can understand the three dimensional dynamics intuitively as follows when τ t τ o and τ o = τ p . In this case, the target network evolves very slowly compared to the online network, as is done in practice, and for simplicity we use the same learning rate for the predictor as we do for the online network. In this situation, we can treat θ as approximately constant on the fast time scale of τ o on which the online and predictor weights w 1 and w 2 evolve. Then the joint dynamics in equation 186 and equation 187 obeys gradient descent on the error function E = λ s 2 (θλ d λ -1 s -w 2 w 1 ) 2 . ( ) Iso-contours of constant error are hyperbolas in the w 1 by w 2 plane, and for fixed θ, the origin w 1 = w 2 = 0 is a saddle point, yielding an unstable fixed point (see Fig. 5 (left)). From generic initial conditions, w 1 and w 2 will then cooperatively amplify each other to rapidly escape the collapsed solution at the origin, and approach the zero error hyperbolic contour w 2 w 1 = θλ d λ -1 s where θ is close to its initial value. Then the slower target network θ will adjust, slowly moving this contour until θ = w 1 . The more rapid dynamics of w 1 and w 2 will hug the moving contour w 2 w 1 = θλ d λ -1 s as θ slowly adjusts. In this fashion, the joint fast dynamics of w 1 and w 2 , combined with the slow dynamics of θ, lead to a nonzero fixed point for all 3 values, despite the existence of a collapsed fixed point at the origin. Moreover, the larger the ratio λ d λ -1 s , which is determined by the data, the larger the final values of both w 1 and w 2 will tend to be. We can obtain further insight by noting that the submanifold w 1 = w 2 , in which the online and predictor weights are tied, constitutes an invariant submanifold of the dynamics in Eqns. 186 to 188; if w 1 = w 2 at any instant of time, then this condition holds for all future time. Therefore we can both analyze and visualize the dynamics on this two dimensional invariant submanifold, with coordinates w = w 1 = w 2 and θ (Fig. 5 (middle)). This analysis clearly shows an unstable collapsed solution at the origin, with w = θ = 0, and a stable non-collapsed solution at w = θ = λ d λ -1 s . We note again, that the generic existence of these non-collapsed solutions in Fig. 5 depends critically on the presence of a predictor with adjustable weights w 2 . Removing the predictor corresponds to forcing w 2 = 1, and non-collapsed solutions cannot exist unless λ d = λ s , as demonstrated in Fig. 5 (right). Thus, remarkably, in BYOL in this simple setting, the introduction of a predictor network plays a crucial role, even though it neither adds to the expressive capacity of the online network, nor improves its ability to match the target network. Instead, it plays a crucial role by dramatically modifying the learning dynamics (compare e.g. Fig 5 middle and right panels), thereby enabling convergence to noncollapsed solutions through a dynamical mechanism whereby the online and predictor network cooperatively amplify each others' weights to escape collapsed solutions ( Fig. 5 (left)). 𝑧 ! 𝑧 "! 𝑧 "# 𝑧 "!! 𝑧 "!# 𝑧 "#! 𝑧 "## 𝑧 "!!! 𝑧 "!!# 𝑧 "!#! 𝑧 "!## 𝑧 "#!! 𝑧 "#!# 𝑧 "##! 𝑧 "### 𝑧 "# 𝑧 "#! 𝑧 "## Fully connected 𝑧 "#!! 𝑧 "#!# 𝑧 "##! 𝑧 "### (a) (b) Overall, this analysis of BYOL learning dynamics provides considerable insight into the dynamical mechanisms enabling BYOL to avoid collapsed solutions, without negative pairs to force apart representations, in what is likely to be the simplest nontrivial setting.

G ADDITIONAL EXPERIMENTS G.1 EXPERIMENTS ON TWO-LAYER NETWORK

We conduct experiments to verify our theoretical reasoning in Sec. 4.2. We follow the setting in Theorem 5, i.e., two-layer ReLU network (L = 2) that contains the same number of hidden nodes as the number of output nodes (n 1 = n 2 ). The top-layer weight W 2 is a diagonal matrix with no bias (note that "no bias" is important here, otherwise we won't have w 2,j w 2,k = 0 for j = k). We use L τ nce (τ = 0.05 and H = 1) and L simp and use SGD optimizer with learning rate of 0.01. All experiments run with 5000 minibatches with batchsize 128. We test cases with and without 2 normalization at the output layer. For each setting, i.e., (loss, normalization), we run 30 random seeds. The data are generated by a mixture of 10 Gaussian distributions (with uniform prior on each mixture), with mean µ k ∼ N (0, I) and a covariance matrix Σ k = 0.1I. We set the first cluster to be zero-mean. Without 2 normalization. Fig. 7 shows the weight growth of the top (W 2 ) and bottom (W 1 ) layer. As predicted by Theorem 5, the weight quickly grows to infinity. Note that the y-axis is log scale so exponential growth is shown as linear. From the figure, it is clear that with L simp , their growth is super exponential due to the inter-plays between the top and the bottom layers. On the other hand, with L nce , the growth slows down due to the fact that its associated covariance operator has a weight which decays exponentially when the representations of two distinct samples become far apart. With 2 normalization. With the normalization, the weights will not grow substantially and we focus ourselves more on the meaning of each intermediate nodes after training. From the theoretical reasoning in Sec. 4.2, in the ReLU case, we should see each node gradually moves towards (or specializes to) one cluster after training. Fig. 8 shows that a node j that is only active for 1 or 2 cluster centers (out of 10) have much higher |w 2,j | than some other node that is active on many cluster centers. This shows that those "specialized" nodes has undergone learning and their fan-out weight magnitude |w 2,j | becomes (or remains) high, according to the dynamics in Theorem 5. Relationship between specialization of node j and its |w2,j| (Lsimple) Figure 8 : When |w 2,j | is high, the corresponding node j is highly selective to one specific cluster of the data generative models. On the other hand, those node j with low selectivity has very small w 2,j and does not contribute substantially to the output of the network. Training with L τ nce (Left Plot) seems to yield stronger selectivity than with L simp (Right Plot). 

G.2 HLTM

We also check how the norm of the covariance operator (OP) changes during training in different situations. For this set of experiments, we construct a complete binary tree of depth L. The class/sample-specific latent z 0 is at the root, while other nuisance latent variables are labeled with a binary encoding (e.g., µ = "s010" for a z µ that is the left-right-left child of the root z 0 ). Please check Fig. 6 for details. Again we use SimCLR with the L τ nce loss (τ = 0.1 and H = 1) to train the model. 2 normalization is used in the output layer. The results are shown in Fig. 9 and Fig. 10 . We could see that norm of the covariance operator indeed go up, showing that it gets amplified during training. We perform a grid search of depth = [3, 4, 5], delta lower = [0.7, 0.8, 0.9] (and the polarity ρ µν ∼ Uniform[delta lower, 1] at each layer) and over-parameterization parameter hid = |N µ | = [2, 5, 10]. For each experiment configuration, we run 30 random seeds.

G.3 BYOL EXPERIMENTS SETUP

For all STL-10 task, we use ResNet18 as W base . The extra predictor is two layer. It takes 128 dimensional input, has a hidden layer of size 512, and its output is also 128 dimensional. We use ReLU in-between the two layers. When we add BN to the predictor, we add it before ReLU activation. When we say there is no BN in Tbl. 3, we remove BN in both predictor and projector layer (but not in the encoder). Same as BYOL paper (Grill et al., 2020) , symmetric loss function is used with 2 normalization in the topmost layer. We use simple SGD optimizer. The learning rate is 0.03 (unless otherwise stated), momentum is 0.9 and weight decay is 0.0004. The training batchsize is 128. The ImageNet experiment runs on 32 GPUs with a batchsize 4096.

G.4 ADDITIONAL BYOL EXPERIMENTS

Based on our theoretical analysis, we try training the predictor in different ways and check whether it still works. From the analysis in Sec. 5, we know that the reason why BYOL works is due to the dominance of Cov x [ Kl (x), Kl (x; W)] and its resemblance of the covariance operator V x [ Kl (x)], which is a PSD matrix. Same setting as Fig. 9 but focus on lower level. Note that since we have used 2 normalization at the topmost layer, the growth of the covariance operator is likely not due to the growth of the weight magnitudes, but due to more discriminative representations of the input features f µ with respect to different z 0 . Top row: covariance operator of left-right-left latent variable from the root node z 0 ; Bottom row: covariance operator right-left-right-left latent variable from the root node z 0 . The dominance should be stronger if the predictor has smaller weights than normally initialized using Xavier/Kaiming initialization. Also, Cov x [ Kl (x), Kl (x; W )] should behave more like a PSD matrix, if the predictor's weights are all small positive numbers and no BN is used. The following table justifies our theoretical findings. In particular, Tbl. 10 shows better performance in STL-10 with smaller learning rate and smaller sample range of the predictor weights. 



2 /dt and thus w 2 2,j = w 1,j + c where c is some time-independent constant. Since A j is always PSD, w 1,j A j w 1,j ≥ 0 and |w 2,j | generically increases, which in turn accelerates the dynamics of w 1,j , which is most amplified at any given time, along the largest eigenvector of A j . This dynamics exhibits top-down modulation whereby the top-layer weights accelerate the training of the lower layer. In PyTorch, the former is x-x.mean() and the latter is x-x.mean().detach(). A formal treatment requires Jacobian J to incorporate BatchNorm's contribution and is left for future work.



Figure 1: (a) Overview of the two SSL algorithms we study in this paper: SimCLR (W1 = W2 = W, no predictor, NCE Loss) and BYOL (W1 has an extra predictor, W2 is a moving average), (b) Detailed Notation.

Figure 2: Overview of Sec. 4. (a) To analyze the functionality of the covariance operator Vz 0 Kl (z0) (Eqn. 3), we assume that Nature generates the data from a certain generative model with latent variable z0 and z , while data augmentation takes x(z0, z ), changes z but keeps z0 intact. (b) Sec. 4.1: one layer one neuron example. (c) Sec. 4.2: two-layer case where V[ K1] and V[ K2] interplay. (d) Sec. 4.3: Hierarchical Latent Tree Models and deep ReLU networks trained with SimCLR. A latent variable zµ, and its corresponding nodes Nµ in multi-layer ReLU side, covers a subset of input x, resembling local receptive fields in ConvNet.

Figure 3: (a) Two 1D objects under translation: (a1) Two different objects 11 (z0 = 1) and 101 (z0 = 2) located at different locations specified by z . (a2) The frequency table for a neuron with local receptive field of size 2. (b) In two-layer case (Fig. 2(c)), V[ K1] and V[ K2] interplay in two-cluster data distribution.

Figure 5: A visualization of BYOL dynamics in low dimensions. Left: Black arrows denote the vector field of the flow in the w1 and w2 of plane online and predictor weights in Eqns. 186 and 187 when the target network weight θ is fixed to 1. For all 3 panels, λs = 1, λ d = 1/2, and τo = τp = τt = 1, and all vectors are normalized to unit length to indicate direction of flow alone. The red curve shows the hyperoblic manifold of stable fixed points w2w1 = θλ d λ -1s , while the red point at the origin is an unstable fixed point. For a fixed target network, the online and predictor weights will cooperatively amplify each other to escape the collapsed solution at the origin. Middle: A visualization of the full low dimensional BYOL dynamics in Eqns 186-188 when the online and predictor weights are tied so that w1 = w2 = w. The green curve shows the nullcline θ = w corresponding to dθ dt = 0 and the blue curve shows part of the nullcline dw dt = 0 corresponding to w 2 = θλ d λ -1 s . The intersection of these two nullclines yields two fixed points (red dots): an unstable collapsed solution at the origin w = θ = 0, and a stable nontrivial solution with θ = w and w = λ d λ -1 s . Right: A visualization of dynamics in Eqns 186-188 when the the predictor is removed, so that w2 is fixed to 1. The resulting two dimensional flow field on w = w1 and θ is shown (black arrows). The green curve shows the nullcline w = θ corresponding to dθ dt = 0, while the blue curve shows the nullcline w = θλ d λ -1 s . The slope of this nullcline is λsλ -1 d > 1. The resulting nullcline structure yields a single fixed point at the origin which is stable. Thus there only exists a collapsed solution. In the special case where λsλ -1 d = 1, the two nullclines coincide, yielding a one dimensional manifold of solutions.

Figure 6: The Hierarchical Latent Tree Model (HLTM) used in our experiments (Sec. G.2 and Sec. 6).

Figure 7: Top row: Without 2 normalization, training with SimCLR and L simp leads to fast growth of the weight magnitude over time (each curve is one training curve out of 30 trials with different random seeds). Furthermore, this growth is super exponential due to the interplay between top and bottom layers, as suggested by the dynamics in Eqn. 8. Note that the y-axis is in log scale. Bottom row: Without 2 normalization, L nce has more stable weight magnitude over training since its covariance operator is weighted (See Theorem 4).

top-layer weight |w2,j|    Relationship between specialization of node j and its |w2,j| (Lnce) top-layer weight |w2,j|

Figure 9: Ablation of how the Frobenius norm of the covariance operator OP changes over training, under different factors: depth L, sample range of ρ µν (ρ µν ∼ Uniform[delta lower, 1]) and overparameterization |N µ |. Top row: covariance operator of immediate left latent variable of the root node z 0 ; Bottom row: covariance operator of immediate right child of the root node z 0 .

Figure 10: Ablation of how the Frobenius norm of the covariance operator OP changes over training.Same setting as Fig.9but focus on lower level. Note that since we have used 2 normalization at the topmost layer, the growth of the covariance operator is likely not due to the growth of the weight magnitudes, but due to more discriminative representations of the input features f µ with respect to different z 0 . Top row: covariance operator of left-right-left latent variable from the root node z 0 ; Bottom row: covariance operator right-left-right-left latent variable from the root node z 0 .

large and training is faster. For deep HLTM and deep networks, at lower layers, ρ 0µ → 0 and P 0µ is uniform due to mixing of the Markov Chain, making OP µ small. Thus training in SSL is faster at the top layers where the covariance operators have large magnitude. On the other hand, large a µ implies s k := (v k (1) -v k (0))/2 is large, or the expected activation v k (z ν ) is selective for different values of z ν for ν ∈ ch(µ). Interestingly, this can be achieved by over-parameterization (|N µ | > 1):Theorem 7 (Lucky nodes in deep ReLU networks regarding to binary HLTM at initialization). Suppose each element of the weights W l between layer l + 1 and l are initialized with Uniform -σ w 3/|N ch µ |, σ w 3/|N ch µ | . There exists σ 2 l

Normalized Correlation between the topmost latent variables in binary HLTM and topmost nodes in deep ReLU networks (L = 5) go up when training with SimCLR with NCE loss. We see higher correlations at both initialization and end of training, with more over-parameterization (Left: |Nµ| = 2, Right: |Nµ| = 5).

Top-1 STL performance with different combination of predictor (P), EMA and BatchNorm using BYOL. EMA means γema = 0.996. Batch size is 128 and all experiments run on 5 seeds and 100 epochs.Table

σ ∦ 43.9 ± 4.2 64.8 ± 0.6 72.2 ± 0.9 78.1 ± 0.3 44.2 ± 7.0 54.2 ± 0.6 48.3 ± 2.7 76.3 ± 0.4 47.0 ± 8.1 Top-1 performance of BYOL using reinitialization of the predictor every T epochs.

Training one-layer predictor with positive initial weights and no EMA (γ ema = 0). All experiments run for 3 seeds. With BN in predictor 62.78 ± 1.40 62.94 ± 1.03 62.31 ± 1.80 Without BN in predictor 71.95 ± 0.27 72.06 ± 0.44 71.91 ± 0.59

Training one-layer predictor with positive initial weights with EMA (γ ema = 0.996) and predictor resetting every T = 10 epochs. All experiments run for 3 seeds. Note that Xavier range is Uniform[-0.15, 0.15] and our initialization range is much smaller than that. With BN in predictor 65.61 ± 1.34 70.56 ± 0.57 70.87 ± 1.51 Without BN in predictor 74.39 ± 0.67 74.52 ± 0.63 74.80 ± 0.57

Symbol

Definition Size Description Z lThe set of all latent variables at layer l of the generative model.

N l

The set of all neurons at layer l of the neural network.

Nµ

The set of neurons that corresponds to zµ.The set of neurons that corresponds to children of latent zµ. Activations for all nodes j ∈ Nµ and for the children of NµChild selectivity vector in the binary case. . 6.

D.5 THEOREM 6

Proof. First note that for each node k ∈ N ch µ :Note that 1 0 1 ν v k is a constant regarding to change of z 0 . So we could remove it when computing covariance operator. On the other hand, for a categorical distribution (P(z 0 = 0), u(0)), (P(z 0 = 1), u(1)), . . . , (P(z 0 = m 0 -1), u(m 0 -1))With P 0 := diag[P(z 0 )], the mean is E z0 [u] = 1 P 0 u and its covariance can be written as (hereand we have:The equation above can be applied for any cardinality of latent variables. In the binary symmetric case, we have (note here we define ρ 0 := P(z 0 = 1) -P(z 0 = 0), and q := [-1, 1] ):Note that in the binary symmetric case, according to remarks in Lemma 7, all C µν = 1 2 ρ µν qq and we could compute C 0ν v k :where according to Eqn. 87, we have:and the covariance between node k and k can be computed as:The last equality is due to the fact that due to tree structure, the path from z 0 to all child nodes in N ch µ must pass z µ . Therefore we can compute the covariance operator:When L → +∞, we have:and thus the covariance becomes zero as well.

D.6 THEOREM 7

Proof. According to our setting, for each node k ∈ N µ , there exists a unique latent variable z ν with ν = ν(k) that corresponds to it. In the following we omit its dependency on k for brevity.Since we are dealing with binary case, we define the following for convenience:We also define the sensitivity of node k to beIntuitively, a large λ k means that the node k is sensitive for changes of latent variable z ν . If λ k = 0, then the node k is invariant to latent variable z ν .We first consider pre-activation fj := k w jk f k and its expectation with respect to latent variable z:Note that for each node k ∈ N ch µ we have:Then we have ṽ+ j = w j u + N ch µ and ṽ-Note that w j is a random variable with each entryTherefore, for two dimensional vector ṽj = [ṽ + j , ṽj ] , we can compute its first and second order moments: E w [ṽ j ] = 0 andDefine the positive and negative set (note that a k := ρ µν s k ):Without loss of generality, assume that k∈A+ a 2 k ≥ k∈A-a 2 k . In the following, we show there exists j with λ j is greater than some positive threshold. Otherwise the proof is symmetric and we can show λ j is lower than some negative threshold.

When |N ch

µ | is large, by Central Limit Theorem, v can be regarded as zero-mean 2D Gaussian distribution and we have for some c > 0:Moreover, if a l = 0, then the following probability is also not small : 

