WHAT SHAPES THE LOSS LANDSCAPE OF SELF SUPER-VISED LEARNING?

Abstract

Prevention of complete and dimensional collapse of representations has recently become a design principle for self-supervised learning (SSL). However, questions remain in our theoretical understanding: When do those collapses occur? What are the mechanisms and causes? We answer these questions by deriving and thoroughly analyzing an analytically tractable theory of SSL loss landscapes. In this theory, we identify the causes of the dimensional collapse and study the effect of normalization and bias. Finally, we leverage the interpretability afforded by the analytical theory to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.

1. INTRODUCTION

Self-supervised learning (SSL) methods have achieved remarkable success in learning good representations without labeled data (Chen et al., 2020b) . Loss functions used in such SSL techniques promote representational similarity between pairs of related samples while using explicit penalties (Chen et al., 2020a; He et al., 2020; Zbontar et al., 2021; Caron et al., 2020) or asymmetric dynamics (Caron et al., 2021; Grill et al., 2020; Chen and He, 2021) to ensure that the distance between unrelated samples remains large. In practice, however, SSL training often experiences the phenomenon of dimensional collapse (Jing et al., 2021; Tian et al., 2021; Pokle et al., 2022) , where the learned representation spans a low dimensional subspace of the overall available space. In the extreme case, this failure mode instantiates as a complete collapse, where the learned representation becomes zero-rank, and no informative features can be extracted. Prior work has primarily positioned such collapses in SSL as enemies of learning, arguing that they can negatively impact downstream task performance (Zbontar et al., 2021; Jing et al., 2021; Bardes et al., 2021) . However, recent work by Cosentino et al. (2022) empirically demonstrates otherwise: quality of representations can be improved when there is a degree of collapse. These conflicting results indicate that despite extensive empirical explorations, a gap remains in our understanding of the collapse phenomenon in SSL training. We argue that this gap is due to the lack of a theoretical framework to analyze the mechanisms promoting collapsed representations. We aim to close this gap by carefully studying the loss landscapes of SSL. In this work, we analytically solve the effective landscapes of linear models trained on several popular losses used in self-supervised learning, including InfoNCE (Oord et al., 2018) , Normalized Temperature Cross-Entropy (NT-xent) (Chen et al., 2020a), Spectral Contrastive Loss (HaoChen et al., 2021), and Barlow Twins / VICReg (Zbontar et al., 2021; Bardes et al., 2021) . The main thesis of this work is: the local geometry of the SSL landscapes around the origin crucially decides the learning behavior of SSL models. Technically, we show that 1. the interplay between data variation and data augmentation determines the geometry of the loss; 2. the geometry of the loss explains when dimensional collapse can be helpful and why certain SSL losses are robust against data imbalance, but not the others. To the best of our knowledge, our work is the first to study the landscape causes of collapse in SSL thoroughly. 

2. RELATED WORKS

SSL and Collapses. On the one hand, prior literature has often argued collapse as a harmful phenomenon that can deteriorate downstream task performance (Jing et al., 2021; Zbontar et al., 2021) . Preventing such collapsed representations is a frequently discussed topic in literature (Hua et al., 2021; Jing et al., 2021; Pokle et al., 2022; Tian et al., 2021) and has motivated the design of several SSL techniques (Zbontar et al., 2021; Bardes et al., 2021; Ermolov et al., 2021) . On the other hand, Cosentino et al. ( 2022) empirically showed that dimensional collapses under strong augmentations could significantly improve generalization performance. Our work demystifies these conflicting results by finding analytic solutions to loss landscapes of several standard SSL techniques. Theoretical Advances in SSL. Recently, several advances have been made towards understanding the success of SSL techniques from different perspectives: e.g., learning theory (Arora et al., 2019; Saunshi et al., 2022; Nozawa and Sato, 2021; Wei et al., 2021) , information theory (Tsai et al., 2021a; b; Tosh et al., 2021) , causality and data-generating processes (Zimmerman et al., 2021; Kugelgen et al., 2021; Trivedi et al., 2022; Tian et al., 2020; Mitrovic et al., 2020; Wang et al., 2022) , dynamics (Wang and Isola, 2020; Tian et al., 2021; Tian, 2022; Wang and Liu, 2021; Simon et al., 2023) , and loss landscapes (Pokle et al., 2022) . These advances have unveiled practically useful properties of SSL, such as robustness to dataset imbalance (Liu et al., 2021) and principled solutions to avoid spurious correlations (Robinson et al., 2021) . Jing et al. (2021) is the closest to ours in problem setting. In that paper, the authors focused on studying the linearized learning dynamics and suggested that a competition between the feature signal strength and augmentation strength can lead to dimensional collapse. In contrast, our focus is on the landscape and our result implies that this feature-augmentation competition on its own is insufficient to cause a dimensional collapse. In fact, we show that there will be no collapse in the setting studied by Jing et al. (2021) .

3. A LANDSCAPE THEORY OF SELF-SUPERVISED-LEARNING

This section presents the main theoretical results. Let {x i } N i be a dataset with N data points. For every data point x, we augment it with an i.i.d. noise ϵ such that x ∶= x + ϵ. To be concrete, we start with considering the standard contrastive loss, InfoNCE (Oord et al., 2018) : L = E ϵ [- N ∑ i=1 log exp(-|f (x i ) -f (x ′ i )| 2 /2) ∑ j≠i exp(-|f (x i ) -f (χ j )| 2 /2) + exp(-|f (x i ) -f (x ′ i )| 2 /2) ] , where f (x) ∈ R d1 is the model output; all x, x ′ and χ are augmented data points for some independent additive noise ϵ such that E ϵ 



†Work done during an internship at Physics & Informatics Laboratories, NTT Research.



Figure 1: Landscape in self-supervised learning (SSL). SSL losses generally depend only on the relative angle between pairs of network outputs (e.g, f (x) T f (x ′)). Thus, the landscapes with a linear network (f (x) = W x) have a global rotational symmetry and are symmetric about the origin. Our theory finds that the local stability at the origin decides the collapse, and larger data variation (green) prevents collapse, while strong data augmentation (red) can promote collapse. We plot the loss for a toy linear model with a diagonal weight matrix diag(r1, r2). (a) The 1d landscape when fixing one of the parameter. (b-d) The 2d landscape. (b) No collapse: the origin is an unstable local maximum, and surrounding local minima avoid collapse. The dimensionally collapsed solutions are the saddle points. (c) Dimensional collapse: the value of w1 for all stable fixed points collapses to zero. (d) Complete collapse: the origin becomes the isolated local minimum.

[x] = x = E ϵ [x ′ ] ≠ E ϵ [χ] = χ.We decompose the model output into a general function ϕ(x) ∈ R d0 and the last-layer weight matrixW ∈ R d1×d0 : f (x) = W ϕ(x). The covariance of ϕ(x) is A 0 ∶= E x[ϕ(x)ϕ(x) T ],and the covariance of the data-augmented penultimate layer representation is Σ ∶= E x [ϕ(x)ϕ(x) T ]. The effect of data augmentation on the learned

