UNDERSTANDING SELF-SUPERVISED LEARNING WITH DUAL DEEP NETWORKS

Abstract

We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR, the weights at each layer are updated by a covariance operator that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. We show this leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the combination of BatchNorm and a predictor network creates an implicit contrastive term, acting as an approximate covariance operator. Additionally, for linear architectures we derive exact solutions for BYOL that provide conceptual insights into how BYOL can learn useful non-collapsed representations without any contrastive terms that separate negative pairs. Extensive ablation studies justify our theoretical findings.

1. INTRODUCTION

While self-supervised learning (SSL) has achieved great empirical success across multiple domains, including computer vision (He et al., 2020; Goyal et al., 2019; Chen et al., 2020a; Grill et al., 2020; Misra and Maaten, 2020; Caron et al., 2020) , natural language processing (Devlin et al., 2018) , and speech recognition (Wu et al., 2020; Baevski and Mohamed, 2020; Baevski et al., 2019) , its theoretical understanding remains elusive, especially when multi-layer nonlinear deep networks are involved (Bahri et al., 2020) . Unlike supervised learning (SL) that deals with labeled data, SSL learns meaningful structures from randomly initialized networks without human-provided labels. In this paper, we propose a systematic theoretical analysis of SSL with deep ReLU networks. Our analysis imposes no parametric assumptions on the input data distribution and is applicable to stateof-the-art SSL methods that typically involve two parallel (or dual) deep ReLU networks during training (e.g., SimCLR (Chen et al., 2020a ), BYOL (Grill et al., 2020), etc) . We do so by developing an analogy between SSL and a theoretical framework for analyzing supervised learning, namely the student-teacher setting (Tian, 2020; Allen-Zhu and Li, 2020; Lampinen and Ganguli, 2018; Saad and Solla, 1996) , which also employs a pair of dual networks. Our results indicate that SimCLR weight updates at every layer are amplified by a fundamental positive semi definite (PSD) covariance operator that only captures feature variability across data points that survive averages over data augmentation procedures designed in practice to scramble semantically unimportant features (e.g. random image crops, blurring or color distortions (Falcon and Cho, 2020; Kolesnikov et al., 2019; Misra and Maaten, 2020; Purushwalkam and Gupta, 2020) ). This covariance operator provides a principled framework to study how SimCLR amplifies initial random selectivity to obtain distinctive features that vary across samples after surviving averages over data-augmentations. Based on the covariance operator, we further show that (1) in a two-layer setting, a top-level covariance operator helps accelerate the learning of low-level features, and (2) when the data are generated by a hierarchical latent tree model, training deep ReLU networks leads to an emergence of the latent variables in its intermediate layers. We also analyze how BYOL might work without negative pairs. First we show analytically that an interplay between the zero-mean operation in BatchNorm and the extra predictor in the online network creates an implicit contrastive term, consistent with empirical observations in the recent blog (Fetterman and Albrecht, 2020) . Note this analysis does not rule out the possibility that BYOL could work with other normalization techniques that don't introduce contrastive terms, as shown recently (Richemond et al., 2020a) . To address this, we also derive exact solutions to BYOL in linear networks without any normalization, providing insight into how BYOL can learn without contrastive terms induced either by negative pairs or by BatchNorm. Finally, we also discover that reinitializing the predictor every few epochs doesn't hurt BYOL performance, thereby questioning the hypothesis of an optimal predictor in (Grill et al., 2020) .

