SELF-SUPERVISED LOGIT ADJUSTMENT

Abstract

Self-supervised learning (SSL) has achieved tremendous success on various well curated datasets in computer vision and natural language processing. Nevertheless, it is hard for existing works to capture transferable and robust features, when facing the long-tailed distribution in the real-world scenarios. The attribution is that plain SSL methods to pursue sample-level uniformity easily leads to the distorted embedding space, where head classes with the huge sample number dominate the feature regime and tail classes passively collapse. To tackle this problem, we propose a novel Self-Supervised Logit Adjustment (S 2 LA) method to achieve the category-level uniformity from a geometric perspective. Specially, we measure the geometric statistics of the embedding space to construct the calibration, and jointly learn a surrogate label allocation to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing works in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of S 2 LA with high tolerance to the distribution skewness.

1. INTRODUCTION

Recent years have witnessed a great success of self-supervised learning (Doersch et al., 2015; Wang & Gupta, 2015; Chen et al., 2020; Caron et al., 2020) . The rapid advances behind this paradigm benefit from the elegant training on data without annotations, which can be acquired in a large-volume and low-cost way. However, the real-world natural sources usually exhibit the long-tailed distribution (Reed, 2001) , and directly applying existing self-supervised learning methods will lead to the distorted embedding space, where the majority dominates the feature regime (Zhang et al., 2021) and the minority collapses (Mixon et al., 2022) . With the increasing attention on machine learning fairness in the recent years, it becomes a trend to explore self-supervised long-tailed learning (Yang & Xu, 2020; Liu et al., 2021; Jiang et al., 2021; Zhou et al., 2022) . Compared with the flourishing supervised long-tailed learning (Kang et al., 2019; Yang & Xu, 2020; Menon et al., 2021) , the self-supervised counterpart is underexplored as an emerging direction. Existing explorations for self-supervised learning in long-tailed context are from three perspectives: data perspective, model perspective and loss perspective. In the data perspective, BCL (Zhou et al., 2022) leverages the memorization effect of deep neural networks (DNNs) to drive an instance-wise augmentation, which learns a better trade-off between head classes and tail classes in representation learning. In the model perspective, SDCLR (Jiang et al., 2021) contrasts the feature encoder and its pruned counterpart to discover hard examples that mostly covers the samples from tail classes, and efficiently enhance the learning preference towards tail classes. In the loss perspective, the reweighting mechanism like rwSAM (Liu et al., 2021) that adopts a data-dependent sharpness-aware minimization scheme, can be applied to explicitly regularize the loss surface. However, in terms of the current performance to self-supervised long-tailed learning, the potential of the loss perspective has not been sufficiently set off, with which in comparison in supervised long-tailed learning, logit adjustment (Menon et al., 2021) of the same perspective has conquered a range of methods. We dive into the loss perspective and explore to understand "Why the conventional contrastive learning underperforms in self-supervised long-tailed learning?" To answer this question, let us consider two types of representation uniformity: (1) Sample-level uniformity. As proof in (Wang & Isola, 2020) , contrastive learning targets to distribute the representation of data points uniformly in the embedding space. Then, the feature span of each category is proportional to their corresponding sample number. (2) Category-level uniformity. This uniformity pursues to split the region equally for different categories without considering their corresponding sample number (Papyan et al., 2020; Graf et al., 2021) . In the case of the class-balanced scenarios, the former uniformity naturally implies the latter uniformity and induces the equivalent separability for classification. However, in the classimbalanced cases, especially in long-tailed setting, sample-level uniformity leads to the undesired feature regime of head classes due to its dominant sample proportion. In comparison, category-level uniformity constrains the greedy expansion of head classes and prevents the passive collapse of tail classes, which is more benign to the downstream classification (Graf et al., 2021; Fang et al., 2021; Li et al., 2022) . Unfortunately, there is no support regarding category-level uniformity in contrastive learning losses, which provides an answer to the question arisen at the beginning. Inspired by logit adjustment (LA) for supervised long-tailed learning, we propose a novel method, termed as Self-Supervised Logit Adjustment (S 2 LA), to calibrate self-supervised long-tailed learning from the geometric perspective. Specially, unlike LA that requires the class distribution available, S 2 LA uses a constant Simplex ETF to measure the geometric characteristics of the embedding space for adjustment. Together with a surrogate label allocation to compute the target, we can then explicitly compress the greedy space expansion of head classes and avoid the passive collapse of tail classes. Their alternation refers to an ordinary balancing and an efficient optimal-transport problem, which dynamically approaches towards the category-level uniformity. In Figure 1 , we give a toy experiment to compare the learnt representation without and with S 2 LA in the embedding space. The contribution can be summarized as follows, 1. We are among the first attempts to study the drawback of the contrastive learning loss in self-supervised long-tailed context and point out that the resulting sample-level uniformity is an intrinsic limitation, driving our exploration for category-level uniformity (Section 4). 2. We develop a novel Self-Supervised Logit Adjustment (Figure 2 ), which dynamically adjusts the embedding distribution to calibrate the geometric statistics and conducts a surrogate label allocation for category-level uniformity in an efficient end-to-end manner. 3. Our method can be easily plugged into previous methods of self-supervised long-tailed learning (Eq. 7). Extensive experiments on a range of benchmark datasets demonstrate the consistent improvement of our S 2 LA with high tolerance to the distribution skewness.

2. RELATED WORKS

Self-Supervised Long-tailed Learning. There are several recent explorations devoted to this directions from data, model and loss perspectives. BCL (Zhou et al., 2022) leverages the memorization effect of DNNs to drive an instance-wise augmentation, which enhances the learning of tail samples. SDCLR (Jiang et al., 2021) constructs a self-contrast between model and its pruned counterpart to learn more balanced representation. Classic Focal loss (Lin et al., 2017) leverages the loss statistics to putting more emphasis on the hard examples, which has been applied to self-supervised longtailed learning (Zhou et al., 2022) . SeLa (Asano et al., 2020) is the first attempt to cast the unsupervised clustering as an optimal transport problem and leverage a uniform prior on the class-imbalance data. rwSAM (Liu et al., 2021) proposes to penalize loss sharpness in a reweighting manner to similarly calibrate class-imbalance learning. However, the potential of the loss perspective has not been set-off due to the intrinsic limitation of the conventional contrastive learning loss.



Figure 1: Comparison of S 2 LA and the plain SSL method on a 2-D imbalanced synthetic dataset. (Left) Visualization of the 2-D synthetic dataset. (Middle) The embedding distribution of each category learnt by the plain SSL method is approximately proportional to the sample number. (Right) S 2 LA reduces the adverse effect of class imbalance and approaches to the category-level uniformity.

