RETHINKING UNIFORMITY IN SELF-SUPERVISED REP-RESENTATION LEARNING

Abstract

Self-supervised representation learning has achieved great success in many machine learning tasks. Many research efforts tends to learn better representations by preventing the model from the collapse problem. Wang & Isola (2020) open a new perspective by introducing a uniformity metric to measure collapse degrees of representations. However, we theoretically and empirically demonstrate this metric is insensitive to the dimensional collapse. Inspired by the finding that representation that obeys zero-mean isotropic Gaussian distribution is with the ideal uniformity, we propose to use the Wasserstein distance between the distribution of learned representations and its ideal distribution with maximum uniformity as a quantifiable metric of uniformity. To analyze the capacity on capturing sensitivity to the dimensional collapse, we design five desirable constraints for ideal uniformity metrics, based on which we find that the proposed uniformity metric satisfies all constraints while the existing one does not. Synthetic experiments also demonstrate that the proposed uniformity metric is capable to distinguish different dimensional collapse degrees while the existing one in (Wang & Isola, 2020) is insensitive. Finally, we impose the proposed uniformity metric as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance.

1. INTRODUCTION

Self-supervised representation learning has become increasingly popular in machine learning community (Chen et al., 2020; He et al., 2020; Caron et al., 2020; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) , and achieved impressive results in various tasks such as object detection, segmentation, and text classification (Xie et al., 2021; Wang et al., 2021b; Yang et al., 2021; Zhao et al., 2021; Wang et al., 2021a; Gunel et al., 2021) . Aiming to learn representations that are invariant under different augmentations, a common practice of self-supervised learning is to maximize the similarity of representations obtained from different augmented versions of a sample by using a Siamese network (Bromley et al., 1994; Hadsell et al., 2006) . However, a common issue with this approach is the existence of trivial constant solutions that all representations collapse to a constant point (Chen & He, 2021), as visualized in Fig. 1 , known as the collapse problem (Jing et al., 2022).

Constant Collapse

Dimensional Collapse Many efforts have been made to prevent the vanilla Siamese network from the collapse problem. The wellknown solutions can be summarized into three types: contrastive learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020) , asymmetric model architecture (Grill et al., 2020; Chen & He, 2021) , and redundancy reduction (Zbontar et al., 2021; Zhang et al., 2022b) . While these solutions could avoid the complete constant collapse, they might still suffer from a dimensional collapse (Hua et al., 2021) in which representations occupy a lower-dimensional subspace instead of the entire available embedding space (Jing et al., 2022) , as depicted in the Fig. 1 . Therefore, to show the effectiveness of the aforementioned approaches, we need a quantifiable metric to measure the collapse degree of learned representations.



Figure 1: The left figure presents constant collapse, and the right figure visualizes dimensional collapse.

