THE HIDDEN UNIFORM CLUSTER PRIOR IN SELF-SUPERVISED LEARNING

Abstract

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

1. INTRODUCTION

Self-supervised pretraining has emerged as a highly effective strategy for unsupervised representation learning, with remarkable advances demonstrated by joint-embedding methods (Chen et al., 2020b; Caron et al., 2021; Bardes et al., 2021; Assran et al., 2022) . In the context of visual data, these approaches typically learn representations by training a neural network encoder to produce similar embeddings for two or more views of the same image. However, since outputting a constant vector regardless of the input would satisfy this objective, one of the main challenges with joint-embedding methods is to prevent such pathological solutions. A common remedy is to employ a regularizer that maximizes the volume of space occupied by the representations. This is sometimes referred to as the volume maximization principle. In practice, the volume maximization principle is implemented in a variety of ways, for example, by contrasting negative samples (Bromley et al., 1993; He et al., 2019; Chen et al., 2020b) , by removing correlations in the feature space (Bardes et al., 2021; Zbontar et al., 2021) , or by finding high entropy clusterings of the data (Asano et al., 2019; Caron et al., 2020; Assran et al., 2021; 2022) . When pretrained on the ImageNet dataset (Russakovsky et al., 2015) , these methods have been shown to produce representations that encode highly semantic features (Caron et al., 2020; 2021; Assran et al., 2022) . However, the commonly used ImageNet-1K dataset is relatively class-balanced, which is in contrast to most real-world settings, where data is often class-imbalanced and semantic concepts follow a long-tailed power-law distribution (Newman, 2005; Mahajan et al., 2018; Van Horn et al., 2018) . Indeed, it has been shown that pretraining the same joint-embedding methods on long-tailed datasets can lead to significant drops in performance (Tian et al., 2021a) . Such an observation is problematic in that it significantly hinders the applicability of modern research advances with joint-embedding methods to real-world settings. In this work, we explore the use of joint-embedding methods for class-imbalanced datasets. First, we theoretically show that current methods with volume maximization regularizers such as VI-CReg (Bardes et al., 2021 ), SwAV (Caron et al., 2020) , MSN (Assran et al., 2022) and SimCLR (Chen et al., 2020b) (with limited assumptions), have a uniform feature prior; i.e., a bias to learn features that enable grouping the data into clusters of roughly equal size. Consequently, these joint-embedding

