THE HIDDEN UNIFORM CLUSTER PRIOR IN SELF-SUPERVISED LEARNING

Abstract

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

1. INTRODUCTION

Self-supervised pretraining has emerged as a highly effective strategy for unsupervised representation learning, with remarkable advances demonstrated by joint-embedding methods (Chen et al., 2020b; Caron et al., 2021; Bardes et al., 2021; Assran et al., 2022) . In the context of visual data, these approaches typically learn representations by training a neural network encoder to produce similar embeddings for two or more views of the same image. However, since outputting a constant vector regardless of the input would satisfy this objective, one of the main challenges with joint-embedding methods is to prevent such pathological solutions. A common remedy is to employ a regularizer that maximizes the volume of space occupied by the representations. This is sometimes referred to as the volume maximization principle. In practice, the volume maximization principle is implemented in a variety of ways, for example, by contrasting negative samples (Bromley et al., 1993; He et al., 2019; Chen et al., 2020b) , by removing correlations in the feature space (Bardes et al., 2021; Zbontar et al., 2021) , or by finding high entropy clusterings of the data (Asano et al., 2019; Caron et al., 2020; Assran et al., 2021; 2022) . When pretrained on the ImageNet dataset (Russakovsky et al., 2015) , these methods have been shown to produce representations that encode highly semantic features (Caron et al., 2020; 2021; Assran et al., 2022) . However, the commonly used ImageNet-1K dataset is relatively class-balanced, which is in contrast to most real-world settings, where data is often class-imbalanced and semantic concepts follow a long-tailed power-law distribution (Newman, 2005; Mahajan et al., 2018; Van Horn et al., 2018) . Indeed, it has been shown that pretraining the same joint-embedding methods on long-tailed datasets can lead to significant drops in performance (Tian et al., 2021a) . Such an observation is problematic in that it significantly hinders the applicability of modern research advances with joint-embedding methods to real-world settings. In this work, we explore the use of joint-embedding methods for class-imbalanced datasets. First, we theoretically show that current methods with volume maximization regularizers such as VI-CReg (Bardes et al., 2021) , SwAV (Caron et al., 2020) , MSN (Assran et al., 2022) and SimCLR (Chen et al., 2020b) (with limited assumptions), have a uniform feature prior; i.e., a bias to learn features that enable grouping the data into clusters of roughly equal size. Consequently, these joint-embedding methods will penalize features that do not uniformly cluster the data, even if such features correlate well with class information; see Figure 1 . Second, we empirically validate that joint-embedding methods employing volume maximization regularizers are sensitive to the mini-batch class distributions. These approaches fail to learn classdiscriminative features when the samples within a mini-batch do not follow a uniform class distribution. This observation partially explains why performance degrades when pretraining with real-world data, where sampled mini-batches often contain highly imbalanced class distributions. Finally, based on this observation, we propose to move away from conventional uniformity priors and instead reformulate self-supervised criteria to prefer long-tailed feature priors that are more aligned with the distribution of semantic concepts in real-world datasets. In particular, we extend Masked Siamese Networks (MSN) of Assran et al. ( 2022) to support the use of arbitrary features priors, and refer to this extension as Prior Matching for Siamese Networks (PMSN). When pretraining on the iNaturalist 2018 dataset (Van Horn et al., 2018) , which is naturally long-tailed, we demonstrate that moving away from uniform priors leads to more semantic representations and improved transfer on downstream tasks.

2. BACKGROUND

Given the recent success of joint-embedding methods, there is a growing literature that aims to build a better understanding of their behaviour. Several works have sought to develop generalization bounds for joint-embedding methods with volume maximization penalties (Arora et al., 2019; Balestriero & LeCun, 2022) . Other works have sought to better understand the differences between various volume maximization penalties and connect them under limited assumptions (Garrido et al., 2022) . In general, it has been shown that ℓ 2normalized contrastive losses can be decomposed into an "alignment" plus volume maximization component that scatters the representations uniformly on the unit hypersphere (Wang & Isola, 2020) . Following this observation, other works (Chen et al., 2021) have sought to reformulate contrastive losses to scatter representations either (a) uniformly on the unit hypercube, or (b) onto Gaussian distributions (which have the highest entropy amongst all distributions with a given variance). There is also theoretical work (Tian et al., 2021b) which aims to understand why certain joint-embedding methods, such as BYOL (Grill et al., 2020) , can avoid representation collapse without explicit use of a volume maximization penalty. While these works have helped build our understanding on the training dynamics of joint-embedding methods, they do not directly explain why empirical use of these methods with real-world classimbalanced data has often led to a degradation in downstream task performance (Tian et al., 2021a; Goyal et al., 2022) (see Appendix A for a broader discussion of related work). In this work, we explore the use of joint-embedding methods with class-imbalanced data. In particular, we theoretically show that a broad range of methods (beyond contrastive) prescribe a uniform feature prior, and that this prior is detrimental when pretraining with class-imbalanced data.

3. UNIFORM PRIORS IN MODERN SELF-SUPERVISED LEARNING

In this section, we theoretically show that common SSL methods such a, VICReg (Bardes et al., 2021 ), SwAV (Caron et al., 2020) , MSN (Assran et al., 2022) , and (with limited assumptions) SimCLR (Chen et al., 2020b) , correspond to variants of K-means, and thereby impose a uniform cluster prior; i.e., a bias to learn features that enable uniform clustering of the data. The governing assumption in K-means is the presence of isotropic data clusters, with roughly an equal number of



Figure1: Impact of uniform cluster prior in K-means when class distribution of data is imbalanced. K-means clustering depicted in color (green vs red). Ground-truth cluster separation depicted with a dotted black line. When uniform ture prior is not satisfied, Kmeans can identify undesirable features for discriminating between data points.

