WHAT SHAPES THE LOSS LANDSCAPE OF SELF SUPER-VISED LEARNING?

Abstract

Prevention of complete and dimensional collapse of representations has recently become a design principle for self-supervised learning (SSL). However, questions remain in our theoretical understanding: When do those collapses occur? What are the mechanisms and causes? We answer these questions by deriving and thoroughly analyzing an analytically tractable theory of SSL loss landscapes. In this theory, we identify the causes of the dimensional collapse and study the effect of normalization and bias. Finally, we leverage the interpretability afforded by the analytical theory to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.

1. INTRODUCTION

Self-supervised learning (SSL) methods have achieved remarkable success in learning good representations without labeled data (Chen et al., 2020b) . Loss functions used in such SSL techniques promote representational similarity between pairs of related samples while using explicit penalties (Chen et al., 2020a; He et al., 2020; Zbontar et al., 2021; Caron et al., 2020) or asymmetric dynamics (Caron et al., 2021; Grill et al., 2020; Chen and He, 2021) to ensure that the distance between unrelated samples remains large. In practice, however, SSL training often experiences the phenomenon of dimensional collapse (Jing et al., 2021; Tian et al., 2021; Pokle et al., 2022) , where the learned representation spans a low dimensional subspace of the overall available space. In the extreme case, this failure mode instantiates as a complete collapse, where the learned representation becomes zero-rank, and no informative features can be extracted. Prior work has primarily positioned such collapses in SSL as enemies of learning, arguing that they can negatively impact downstream task performance (Zbontar et al., 2021; Jing et al., 2021; Bardes et al., 2021) . However, recent work by Cosentino et al. (2022) empirically demonstrates otherwise: quality of representations can be improved when there is a degree of collapse. These conflicting results indicate that despite extensive empirical explorations, a gap remains in our understanding of the collapse phenomenon in SSL training. We argue that this gap is due to the lack of a theoretical framework to analyze the mechanisms promoting collapsed representations. We aim to close this gap by carefully studying the loss landscapes of SSL. In this work, we analytically solve the effective landscapes of linear models trained on several popular losses used in self-supervised learning, including InfoNCE (Oord et al., 2018) , Normalized Temperature Cross-Entropy (NT-xent) (Chen et al., 2020a), Spectral Contrastive Loss (HaoChen et al., 2021) , and Barlow Twins / VICReg (Zbontar et al., 2021; Bardes et al., 2021) . The main thesis of this work is: the local geometry of the SSL landscapes around the origin crucially decides the learning behavior of SSL models. Technically, we show that 1. the interplay between data variation and data augmentation determines the geometry of the loss; 2. the geometry of the loss explains when dimensional collapse can be helpful and why certain SSL losses are robust against data imbalance, but not the others. To the best of our knowledge, our work is the first to study the landscape causes of collapse in SSL thoroughly. †Work done during an internship at Physics & Informatics Laboratories, NTT Research. 1

