RETHINKING UNIFORMITY IN SELF-SUPERVISED REP-RESENTATION LEARNING

Abstract

Self-supervised representation learning has achieved great success in many machine learning tasks. Many research efforts tends to learn better representations by preventing the model from the collapse problem. Wang & Isola (2020) open a new perspective by introducing a uniformity metric to measure collapse degrees of representations. However, we theoretically and empirically demonstrate this metric is insensitive to the dimensional collapse. Inspired by the finding that representation that obeys zero-mean isotropic Gaussian distribution is with the ideal uniformity, we propose to use the Wasserstein distance between the distribution of learned representations and its ideal distribution with maximum uniformity as a quantifiable metric of uniformity. To analyze the capacity on capturing sensitivity to the dimensional collapse, we design five desirable constraints for ideal uniformity metrics, based on which we find that the proposed uniformity metric satisfies all constraints while the existing one does not. Synthetic experiments also demonstrate that the proposed uniformity metric is capable to distinguish different dimensional collapse degrees while the existing one in (Wang & Isola, 2020) is insensitive. Finally, we impose the proposed uniformity metric as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance.

1. INTRODUCTION

Self-supervised representation learning has become increasingly popular in machine learning community (Chen et al., 2020; He et al., 2020; Caron et al., 2020; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) , and achieved impressive results in various tasks such as object detection, segmentation, and text classification (Xie et al., 2021; Wang et al., 2021b; Yang et al., 2021; Zhao et al., 2021; Wang et al., 2021a; Gunel et al., 2021) . Aiming to learn representations that are invariant under different augmentations, a common practice of self-supervised learning is to maximize the similarity of representations obtained from different augmented versions of a sample by using a Siamese network (Bromley et al., 1994; Hadsell et al., 2006) . However, a common issue with this approach is the existence of trivial constant solutions that all representations collapse to a constant point (Chen & He, 2021), as visualized in Fig. 1 , known as the collapse problem (Jing et al., 2022) .

Constant Collapse

Dimensional Collapse Many efforts have been made to prevent the vanilla Siamese network from the collapse problem. The wellknown solutions can be summarized into three types: contrastive learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020) , asymmetric model architecture (Grill et al., 2020; Chen & He, 2021) , and redundancy reduction (Zbontar et al., 2021; Zhang et al., 2022b) . While these solutions could avoid the complete constant collapse, they might still suffer from a dimensional collapse (Hua et al., 2021) in which representations occupy a lower-dimensional subspace instead of the entire available embedding space (Jing et al., 2022) , as depicted in the Fig. 1 . Therefore, to show the effectiveness of the aforementioned approaches, we need a quantifiable metric to measure the collapse degree of learned representations. To gain a quantifiable analysis of collapse degree, recent works (Arora et al., 2019; Wang & Isola, 2020) propose to divide the loss function into alignment and uniformity terms. For instance, recent objective functions such as InfoNCE (van den Oord et al., 2018) and cross-correlation employed in Barlow Twins (Zbontar et al., 2021) could be divided into two terms. These uniformity terms could explain the degree of collapse to some extent, since they measure the variability of learned representations (Zbontar et al., 2021) . However, the calculation of these uniformity terms relies on the choice of anchor-positive pair, making them hard to be used as general metrics. Wang et al (Wang & Isola, 2020) further propose a formal definition of uniformity metric via Radial Basis Function (RBF) kernel (Cohn & Kumar, 2007) . Despite its usefulness (Gao et al., 2021; Zhou et al., 2022) , we theoretically and empirically demonstrate that the metric is insensitive to the dimensional collapse. In this paper, we focus on designing a new uniformity metric that could capture salient sensitivity to the dimensional collapse. Towards this end, we firstly introduce an interesting finding that representation that obeys zero-mean isotropic Gaussian distribution is with the ideal uniformity. Based on this finding, we use the Wasserstein distance between the distribution of learned representations and the ideal distribution as the metric of uniformity. By checking on five well-designed desirable properties (called 'desiderata') of uniformity, we theoretically demonstrate the proposed uniformity metric satisfies all desiderata while the existing one (Wang & Isola, 2020) does not. Synthetic experiments also demonstrate the proposed uniformity metric is capable to quantitatively distinguish various dimensional collapse degrees while the existing one is insensitive. Lastly, we apply our proposed uniformity metric in the practical scenarios, namely, imposing it as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance. The contributions of this work are summarized as: (i) We theoretically and empirically demonstrate the existing uniformity metric (Wang & Isola, 2020) is insensitive to the dimensional collapse, and we propose a new uniformity metric that could capture salient sensitivity to the dimensional collapse; (ii) By designing five desirable properties, we open a new perspective to rethink the ideal uniformity metrics; (iii) Our proposed uniformity metric can be applied as an auxiliary loss term in various self-supervised methods, which consistently improves the performance in downstream tasks.

2. BACKGROUND

2.1 SELF-SUPERVISED REPRESENTATION LEARNING Self-supervised representation learning aims to learn representations that are invariant to a series of different augmentations. Towards this end, a common practice is to maximize the similarity of representations obtained from different augmented versions of a sample. Specially, given a set of data samples {x 1 , x 2 , ..., x n }, a symmetric network architecture, also called Siamese network (Hadsell et al., 2006) , takes as input two randomly augmented views x a i and x b i from a input sample x i . Then the two views are processed by an encoder network f consisting of a backbone (e.g., ResNet (He et al., 2016) ) and a projection MLP head (Chen et al., 2020) , denoted as g. To enforce invariance to representations of two views z a i ≜ g(f (x a i )) and z b i ≜ g(f (x b i )) , a natural solution is to maximize the cosine similarity between representations of two views, and Mean Square Error (MSE) is a widely used loss function to align their l 2 -normalized representations on the surface of the unit hypersphere: L θ align = ∥ z a i ∥z a i ∥ - z b i ∥z b i ∥ ∥ 2 2 = 2 -2 • ⟨z a i , z b i ⟩ ∥z a i ∥ • ∥z b i ∥ However, a common issue with this approach easily learns an undesired trivial solution that all representations collapse to a constant, as depicted in Fig. 1 .

2.2. EXISTING SOLUTIONS TO CONSTANT COLLAPSE

To prevent the Siamese network from the constant collapse, existing well-known solutions can be summarized into three types: contrastive learning, asymmetric model architecture, and redundancy reduction. More details will be explained in this section. Contrastive Learning Contrastive learning is one effective way to avoid constant collapse, and the core idea is to repulse negative pairs while attracting positive pairs. SimCLR (Chen et al., 2020) is



Figure 1: The left figure presents constant collapse, and the right figure visualizes dimensional collapse.

