SUBCLASS-BALANCING CONTRASTIVE LEARNING FOR LONG-TAILED RECOGNITION

Abstract

Long-tailed recognition with imbalanced class distribution naturally emerges in practical machine learning applications. Existing methods such as data reweighing, resampling, and supervised contrastive learning enforce the class balance with a price of introducing imbalance between instances of head class and tail class, which may ignore the underlying rich semantic substructures of the former and exaggerate the biases in the latter. We overcome these drawbacks by a novel "subclassbalancing contrastive learning (SBCL)" approach that clusters each head class into multiple subclasses of similar sizes as the tail classes and enforce representations to capture the two-layer class hierarchy between the original classes and their subclasses. Since the clustering is conducted in the representation space and updated during the course of training, the subclass labels preserve the semantic substructures of head classes. Meanwhile, it does not overemphasize tail class samples so each individual instance contribute to the representation learning equally. Hence, our method achieves both the instance-and subclass-balance, while the original class labels are also learned through contrastive learning among subclasses from different classes. We evaluate SBCL over a list of long-tailed benchmark datasets and it achieves the state-of-the-art performance. In addition, we present extensive analyses and ablation studies of SBCL to verify its advantages.

1. INTRODUCTION

In reality, the datasets often follow the Zipfian distribution over classes with a long tail (Zipf, 2013; Spain & Perona, 2007) , i.e., a few classes (head classes) containing significantly more instances than the remaining tail classes. Such tail classes could be of great importance for high-stake applications, e.g., patient class in medical diagnosis or accident class in autonomous driving (Cao et al., 2019; Shen et al., 2015) . However, training on such class-imbalanced datasets can result in a severely biased model with noticeable performance drop in classification tasks (Wang et al., 2017; Mahajan et al., 2018; Zhong et al., 2019; Ando & Huang, 2017; Buda et al., 2018; Collobert et al., 2008; Yang et al., 2019) . To overcome the challenges posed by long-tailed data, data resampling (Ando & Huang, 2017; Buda et al., 2018; Chawla et al., 2002; Shen et al., 2016) and loss reweighing (Byrd & Lipton, 2019; Cao et al., 2019; Cui et al., 2019; Dong et al., 2018) have been widely applied but they cannot fully leverage all the head-class samples. Very recent work discovered that supervised contrastive learning (SCL) (Khosla et al., 2020) can achieve state-of-the-art (SOTA) performance on benchmark datasets of long-tailed recognition (Kang et al., 2020; Li et al., 2022) . Specifically, the k-positive contrastive learning (KCL) (Kang et al., 2020) and its subsequent work targeted supervised contrastive learning (TSC) (Li et al., 2022) revamp SCL by encouraging the learned feature space to be class-balanced and uniformly distributed. However, those methods enforcing class-balance often come with a price of instance-imbalance, i.e., each individual instance of tail classes would have much greater impact on model training than that of head classes. Such instance-imbalance can result in significant degradation of the performance on long-tailed recognition for several reasons. On the one hand, the limited samples in each tail class might not be sufficiently representative of the whole class. So even a small bias of them can be enormously exaggerated by class-balancing methods and result in sub-optimal learning of classifiers or representations. On the other hand, head classes usually have more complicated semantic substructures, e.g., multiple high-density regions of the data distribution, so simply downweighing samples of head classes and treating them equally can easily lose critical structural information. For example, images of a head class "cat" might be highly diverse in breeds and colors, which need to be captured by different features but downweighing or subsampling them may easily lose such information, while a tail class "platypus" might only contain a few similar images that are unlikely to cover all the representative features. Therefore, it is non-trivial to enforce both class-balance and instance-balance simultaneously in the same method. Can we remove the negative impact of class-imbalance while still retain the advantages of instancebalance? In this paper, we achieve both through subclass-balancing contrastive learning (SBCL), a novel supervised contrastive learning defined on subclasses, which are the clusters within each head class, have comparable size as tail classes, and are adaptively updated during the training. Instead of sacrificing instance-balance for class-balance, our method achieves both instance-and subclass-balance by exploring the head-class structure in the learned representation space of the model-in-training. In particular, we propose a bi-granularity contrastive loss that enforces a sample (1) to be closer to samples from the same subclass than all the other samples; and (2) to be closer to samples from a different subclass but the same class than samples from any other subclasses. While the former learns representations with balanced and compact subclasses, the latter preserves the class structure on subclass level by encouraging the same class's subslasses to be closer to each other than to any different class's subclasses. Hence, it can learn an accurate classifier distinguishing original classes while enjoy both the instance-and subclass-balance. In this paper, we apply SBCL for several visual recognition tasks to demonstrate SBCL superiority over other previous works (e.g., KCL (Kang et al., 2020) ,TSC (Li et al., 2022) ). To summarize, this paper makes the following contributions: (a). We provide a new design principal of leveraging supervised contrastive learning for longtailed recognition, i.e., aiming at achieving both instance-and subclass-balance instead of class-balance at the expense of instance-balance.

(b).

We propose a novel instantiation of the aforementioned design principal, subclass-balancing contrastive learning (SBCL), which consists of two major components, namely, subclassbalancing adaptive clustering and bi-granularity contrastive loss. (c). Empirically, we compare the SBCL against state-of-the-art methods on three visual tasks: image classification, object detection, and instance segmentation to demonstrate its effectiveness on handling class imbalance. We also conduct a series of experiments to analyze the efficacy of SBCL.

2. BACKGROUND AND NOTATIONS

Long-tailed recognition. Without loss of generality, we follow prior work (Kang et al., 2019; Hong et al., 2021) to assume that the classes are sorted by cardinality in decreasing order (i.e., if i < j, then n i ≥ n j ), and n 1 ≫ n C . In addition, we define the imbalance ratio as max k∈ [C] (n k )/ min k∈[C] (n k ) = n 1 /n C . Finally, let f θ (•) be a deep feature extractor, e.g., a neural network, parameterized by θ and w c is the linear classifier of class c, then the classifier we aim to learn is h( x i ) = arg max c∈[C] w ⊤ c f θ (x i ). Supervised contrastive learning. Recent studies have shown that supervised contrastive learning (SCL) (Khosla et al., 2020) provides a strong performance gain for long-tailed recognition and its variants have achieved state-of-the-art (SOTA) performance (Kang et al., 2020; Li et al., 2022) . Specifically, SCL learns the feature extractor f θ (•) via maximizing the discriminativeness of positive instances, i.e., instances from the same class, and the learning objective for a single training data  (x i , y i ) in a batch B = {(x i , y i )} i∈[N ] , is LSCL = N i=1 - 1 | Pi| zp∈ Pi log exp(zi • z ⊤ p /τ ) za∈ Ṽi exp(zi • z ⊤ a /τ ) , ( ) where τ is the temperature hyperparameter, z i = f θ (x i ) is the feature generated from x i , V i = {z i } i∈[N ] \ {z i } is the current batch of features except for z i , P i = {z j ∈ V i : y j = y i } is a set of instances with the same label as x i . Finally, let zi be the feature of xi , the augmented version of x i , and for any set S i indexed by i, we use Si = S i ∪ { zi }, e.g., Ṽi = V i ∪ { zi }. However, for the long-tailed datasets, the feature spaces is dominated by head classes and thus have limited capability of semantic discrimination (Kang et al., 2020) . To address this, the k-positive contrastive learning (KCL) (Kang et al., 2020) attempts to balance the feature space by keeping the number of positive instances in Pi equal for each class, leading to the following loss LKCL = N i=1 - 1 k + 1 zp∈ P k i log exp(zi • z ⊤ p /τ ) za∈ Ṽi exp(zi • z ⊤ a /τ ) where P k i is a subset of P i with k randomly drawn instances. Finally, the learned feature extractor f θ (•) is exploited in a sequel stage of training the classifier for long-tailed recognition (Kang et al., 2020; Li et al., 2022) .

3. METHODOLOGY

As mentioned above, KCL and its sequels (Kang et al., 2020; Li et al., 2022) balance the learning objective of SCL by picking the same number of positive instance for each class, i.e., |P k i | = k in Eq. 2 no matter which class x i belongs to, however, we argue that such a class-balancing approach would inevitably introduce instance-imbalance: the instances of tail classes have much more chances to be engaged in the training than that of head class. Specifically, assume each class has no less than k instances, then the probability of an instance of class c being selected as positive instance is p(c) = k nc ; if the tail class n C has only k instances and the imbalance ratio is n1 n C = 100, then we have p(1) = k 100k = 0.01 while p(C) = 1. We can see that when the instances of head class are selected once, that of tail class may already be trained 100 times. Thus, the training is immensely biased towards the few samples in each tail class. Besides, as tail classes only have very few instances that are not necessarily representative, the learned feature space might be unsatisfactory and sensitive to the training data of tail classes. Here, we provide a new prospective of handling class-imbalance issue by contrastive learning: instead of aiming at class-balance at the expense of instance-imbalance, we propose to achieve both instanceand subclass-balance. We argue that head classes typically contain more diverse instances and thus have richer semantics in the training dataset. Therefore, it might be wise to break down the head classes into multiple semantically coherent subclasses, each of which consists of similar number of instance as tail classes. Built on this spirit, we develop subclass-balancing contrastive learning (SBCL), a new contrastive learning framework for long-tailed recognition (visualized in Figure 1 ) that achieves both instance-and subclass-balance. 

3.1. SUBCLASS-BALANCING ADAPTIVE CLUSTERING

We break down the class into several "subclasses" to attack the imbalanced phenomenon. Particularly, given a class c and the associated set of data D c , we employ a clustering algorithm of choice based on the features extracted by current feature extractor f θ (•) to divide D c into m c subclasses/clusters. We use Γ c (x i ) to denote the cluster label of an instance x i of class c. To ensure that the number of samples for each subclass is roughly the same, we propose a new cluster algorithm to divide the unit-length feature vectors, i.e., the features output by f θ (•) with additional unit-length normalization. The new proposed cluster algorithm is described in Algorithm 1. In the process of assigning vectors to their centers, we set a threshold M as an upper limit of sample size in a cluster, which guarantees clusters of balanced sizes. We show the distribution of the sample size in clusters on the benchmark dataset in Appendix A.1. Specifically, the threshold M is M = max(nC , δ) (3) where n C is the size of tail class and the hyperparameter δ controls the lower bound of sample size in clusters to prevent overly-small cluster. Note that we only apply clustering algorithm to classes which contain multiple instances while the tail classes remain unchanged. As a consequence, the size of each resultant cluster is similar to that of tail class n C . In addition, instead of only clustering once at the beginning, we update the cluster assignment adaptively based on the current feature extractor f θ (•) during training process and empirically show that such adaptive clustering outperforms only-once clustering in Section 4.5. Then, by replacing the class labels of head classes used in SCL/KCL with the finer-grained cluster labels, we ensure the instance-balanced, i.e., each instance has similar chance of being selected regardless of its class. By breaking down the head classes, which typically contain more diverse instances, into multiple semantically coherent subclasses, we achieve subclass-balanced (instead of class-balanced) while maintain the rich semantics rendered by head classes in training dataset.

3.2. BI-GRANULARITY CONTRASTIVE LOSS

We now have two types of label for instances in head classes from different granularities: the coarsegrained class label and the fine-grained cluster label. A direct consequence of replacing class label in SCL/KCL with cluster label is that we no longer distinguish instances from different head classes, and therefore the boundaries between classes might be blurry, leading to optimal feature space. As a remedy, we combine the contrastive loss of both class label and cluster label into the following one and reuse the notations of Eq. 1: LSBCL = - N i=1 1 | Mi| zp∈ Mi log exp(zi • z ⊤ p /τ1) za∈ Ṽi exp(zi • z ⊤ a /τ1) + β 1 | Pi| -|Mi| zp∈ Pi /M i log exp(zi • z ⊤ p /τ2) za∈ Ṽi /M i exp(zi • z ⊤ a /τ2) where M i = {z j ∈ P i : Γ yi (x i ) = Γ yi (x j ) } is a set of instances with the same cluster label as x i . β is a hyperparameter that balances these two loss terms. The first term corresponds to the SCL loss with cluster labels, while the second term leverages the class label but does not consider the instances of the same cluster, i.e., the instances in M i are removed in the second term. Such a design choice reflects the two types of positive instances for z i : (1) the instance in the same cluster and (2) the instance of the same class but in different clusters. According to previous studies (Wang & Liu, 2021; Hoffmann et al., 2022; Li et al., 2021a) , the temperature τ in contrastive loss is critical in controlling the local separation and global uniformity of the feature distribution. Specifically, a low temperature forces the features to concentrate, while as the temperature increases, the features would distribute more uniformly. Although the above objective explicitly considers the two types of label from different granularities, it still treats class and cluster label similarly. Intuitively, we expect instances of the same subclass to form a more concentrated cluster in feature space than those of the same class, since subclass naturally indicates finer-grained semantic coherence. To achieve this, we ensure the temperature τ 2 > τ 1 and dynamically adjust τ 2 for each class according to its current level of concentration of the instances' feature. Following (Li et al., 2021a) , for class c we define ϕ(c) as ϕ(c) = nc i=1 ∥zi -tc∥2 nc log(nc + α) , where t c is the centroid for the class c, α is a hyperparameter to ensure that ϕ(c) is not overly-large, and z i corresponds to instances of class c. From the formulation, we can see that if the current averaged distance to the class centroid is large or the class contains fewer data, thus the temperature will be set large to adopt the feature distribution of class c during the training process. Then we define the temperature of class c as τ2(c) = τ1 • exp ϕ(c) 1 C C i=1 ϕ(i) such that τ 2 (c) for class label is always larger than τ 1 for cluster label (since ϕ(c) > 0) and could reflect the current level of concentration of the instances in a class. In particular, the proposed τ 2 (c) encourages the features of instances in class c to form a less tight cluster than that of a subclass (by τ 2 (c) > τ 1 ) while adaptively adjust the temperature to prevent an overly-loose/dense cluster.

3.3. TRAINING ALGORITHM

Here, we describe the overall training process of subclass-balancing contrastive learning and the algorithm can be found in Algorithm 2. First, the adaptive clustering (Section 3.1) could be noisy at the early stage of training (Li et al., 2021a; Wang et al., 2021b) . Thus, we warm-up the feature extractor f θ (•) by a few epochs of training on ordinary SCL or KCL loss. In addition, our algorithm involves two adaptively-adjusting parts, namely, the cluster assignment and the temperature τ 2 (c) for each head class c. Instead of updating these every epoch, we use a hyperparameter K as the update interval, i.e., we update the cluster assignment and the temperature based on the current learned f θ (•) every K epoch. if t//K == 0 then ▷ Update cluster assignment and termperature 5:

Algorithm 2 Training Algorithm

Update the cluster assignment based on the current feature extractor f θ (x) 6: Update the temperture τ2 for each head class using Eq. 5 and Eq. 6 7: end if 8: Train f θ (•) using Eq. 4 ▷ Subclass-balancing contrastive learning 9: end for

4. EXPERIMENT

4.1 EXPERIMENTAL SETUP Datasets. We consider three commonly used long-tailed recognition benchmark datasets: CIFAR-100-LT (Cao et al., 2019) , ImageNet-LT (Liu et al., 2019 ), and iNaturalist 2018 (Van Horn et al., 2018) . The CIFAR-100-LT and ImageNet-LT datasets are artificially generated long-tailed datasets from the class-balanced datasets (Krizhevsky et al., 2009; Russakovsky et al., 2015) , and the iNaturalist 2018 dataset is a large-scale real-world dataset that exhibits long-tailed imbalance. Baselines. We consider baseline methods of the following three categories: (1) class-balancing classifiers, including τ -norm, LWS and cRT (Kang et al., 2019) , which fixes the representation which trained by cross-entropy loss and trains the classifier with class-balanced sampling; (2) one-stage balancing loss, including CB loss (Cui et al., 2019) , Focal loss (Lin et al., 2017) , and LDAM loss (Cao et al., 2019) . These supervised distribution-aware loss makes the model to pay more attention on the minority class during training. (3) contrastive learning methods, including SCL (Khosla et al., 2020) , KCL (Kang et al., 2020) , SwAV (Caron et al., 2020) , PCL (Li et al., 2021a) and TSC (Li et al., 2022) which train a feature extractor with the contrastive loss and then learn a classifier given the trained feature extractor. Evaluation protocol. Following (Kang et al., 2020; 2019; Li et al., 2022) , we implement SBCL, as well as other contrastive learning methods, in a two-stage framework. In the first stage, we train the feature extractor with a contrastive learning method, while in the second stage, we train a linear classifier on top of the learned representation. Specifically, for CIFAR-100-LT dataset, the linear classifier is trained with LDAM loss and class re-weighting (Cao et al., 2019) . For ImageNet-LT and iNaturalist 2018 datasets, the linear classifier is trained with CE loss and class-balanced sampling (Kang et al., 2019) . All results are averaged over 5 trials with different random seeds. We mainly report the overall top-1 accuracy. For the two large datasets, ImageNet-LT and iNaturalist 2018 datasets, following the previous work (Liu et al., 2019) , we also report the accuracy of three disjoint subsets: Many-shot classes (classes with more than 100 samples), Medium-shot classes (classes with 20 to 100 samples), and Few-shot classes (classes under 20 samples). We leave the implementation details and additional experiments to the appendix.

4.2. MAIN RESULTS

Table 1 : Performance comparison on ImageNet-LT and iNaturalist 2018 datasets. Top-1 accuracy of ResNet-50 (He et al., 2016) is reported. The "Many", "Medium" ,"Few" and "All" denotes different groups. The results on ImageNet-LT and iNaturalist 2018 are in Table 1 . We can see that SBCL outperforms the baselines with a large margin over the two datasets. In addition, on iNaturalist 2018 dataset, SBCL outperforms the previous SOTA method by 0.7% on Many, 1.3% on Medium, 0.8% on Few and 1.1% on All, which shows the effectiveness of the proposed method in solving real-world longtailed recognition problems such as natural species classification. Besides, SBCL is also better than existing contrastive learning method like KCL and TSC for all class splits, which demonstrates the effectiveness of the design principal of pursuing both instances-and subclass-balance in contrastive learning. Table 2 summarizes the results on CIFAR-100-LT dataset. For CIFAR-100-LT dataset, SBCL outperforms previous SOTA methods except for imbalance ratio 10. We hypothesize that it is because the tail class of CIFAR-100-LT with imbalance ratio 10 has multiple samples, which makes it hard to distinguish the performance of methods on the long-tailed recognition.

4.3. PERFORMANCE ON OTHER VISUAL TASKS

There is a recent trend of using the contrastive learning to pretrain a feature extractor for downstream visual tasks other than image classification (He et al., 2020) . We are curious about two questions: (1) when the pretraining dataset is class-imbalanced, how the downstream performance is affected? (2) In such case, can our SBCL improve the learned feature extractor over existing contrastive learning baselines? To answer these questions, we use the object detection task of PASCAL VOC dataset as the evaluation suite and use ImageNet/ImageNet-LT datasets as class-balanced/-imbalanced pretraining datasets. Following Kang et al. (2020) ; He et al. (2020) , we first pretrain a feature extractor on ImageNet/ImageNet-LT then further finetune it for the downstream object detaction tasks using Faster R-CNN (Ren et al., 2015) with R50-C4 backbone. The experiment results are shown in Tables 3. From the results, we can see that pretraining on classbalanced data (ImgeNet) leads to consistently better results than that on class-imbalanced dataset (ImageNet-LT) pretraining the model on the ImageNet and ImageNet-LT datasets by the SBCL can perform slightly better than other baselines. In addition, the proposed SBCL significantly outperforms baselines on class-imbalanced pretraining dataset, while achieve comparable performance on classbalanced ones. For the representation which trained on the full ImageNet dataset, the performance advantage is not obvious. In Appendix A.1, we show additional experimental results of object detection and instance segmentation on COCO (Lin et al., 2014) dataset and SBCL also outperforms other baseline methods. Thus, we conclude that the proposed SBCL is not only helpful for image classification, but also other visual tasks.

4.4. ANALYSIS OF FEATURE DISTRIBUTION

To analyze the representation learned by SBCL, we firstly define the euclidean distance between a given sample and other samples from the same/different classes as intra/inter-class distance. Concretely, the euclidean distance between a sample z i and a set S is defined as D(z i , S) = 1 |S| zj ∈S ∥z i -z j ∥ 2 . Then, the intra-and inter-class distance of sample z i can be defined as D(z i , P i ) and D(z i , D/P i ) separately; and the intra-and inter-subclass distance of sample z i can be defined as D(z i , M i ) and D(z i , P i /M i ) separately. Intra-class/Inter-class distance. SBCL aims at learning a compact representation space, in which representations from different classes are far from each other and the feature space spanned by representations of each class is invariant to the long-tailed distribution. Though the intra/inter-class distance on the feature space, we compare SBCL on the previous methods (KCL ,TSC) on CIFAR-100-LT with imbalance ratio 100. The results of the average distance are summarized in Table 5 and the distances of different groups are reported separately. The results show several good properties of SBCL over previous methods: (i) the inter-class distances of SBCL is larger than the previous methods, which implies that SBCL can push different classes far away from each other, and thus help the downstream tasks; (ii) the intra-class distance of SBCL is relatively more equal in different disjoint subsets than previous methods, which indicates that that head/tail classes have similar volume of the learned space and thus help balance different classes. Figure 2 : Feature distance of subclasses and classes on CIFAR-100-LT with imbalance ratio 100. We randomly sample instances from many-shot and medium-shot classes so that the size of each equals to that of few-shot classes. Feature distribution of SBCL. As shown in Figure 2a , the distance between samples from the same subclass is less than those from the same class but different subclasses. Meanwhile, in Figure 2b , the inter-class distance is higher than the intra-class distance with stable value, which denotes features from the different classes are uniformly distributed on a hypersphere. The results in all indicate that the two-layer class hierarchy are successfully captured and feature distribution achieves the core idea of SBCL.

4.5. ABLATION STUDIES

Warm-up. As mentioned in Section 3.3, we train the feature extractor for several epochs using ordinary SCL or KCL as warm-up stage. As shown in Table 4 , such a warm-up stage is beneficial since the performance drops when we remove the warm-up stage. This is likely because at the early stage of training, the extracted feature is not well-trained and the cluster assignment could be noisy and ineffective, hindering the efficacy of SBCL. Adaptive clustering. We are also curious about the efficacy of adaptive clustering and thus present the performance of SBCL with clustering only once and fixing cluster assignments during training. As shown in Table 4 , without adaptive clustering, the performance decreases in all cases. The reason could be fixed cluster assignment is prone to noise when the model is not well-trained, while adaptive clustering would dynamically adjust the cluster assignments based on the current learned model, which is supposed to become better as the training proceeds. Dynamic temperature. In Table 4 , we also study the effectiveness of dynamic temperature (Section 3.2). We remove the dynamic temperature and simply set τ 2 > τ 1 following (Hoffmann et al., 2022) . With the fixed temperature τ 2 , the performance of SBCL is significantly worse than that with dynamic temperature. We speculate that this is because dynamic temperature could help prevent the instances of a class to form overly large or small cluster in the feature space and therefore lead to better learned representations. Additionally, to evaluate the impact of dynamic temperature on other baselines, we apply the dynamic temperature on TSC, as reported in Appendix A.1.

5. RELATED WORK

Traditional methods for handling long-tailed recognition problem includes re-sampling and reweighting. There are roughly two types of re-sampling techniques: over-sampling the minority classes (Shen et al., 2016; Zhong et al., 2016; Buda et al., 2018; Byrd & Lipton, 2019) and undersampling the frequent classes (He & Garcia, 2009; Japkowicz & Stephen, 2002; Buda et al., 2018) . The re-weighting techniques assign adaptive weights for different classes or even different samples. The vanilla scheme re-weights classes proportionally to the inverse of their frequency (Huang et al., 2016; 2019; Wang et al., 2017) . For class-level re-weighting methods, many loss functions including CB loss (Cui et al., 2019) , LDAM loss (Cao et al., 2019) and Balanced softmax loss (Ren et al., 2020) were recently proposed, while instance-level re-weighting methods include Focal loss (Lin et al., 2017) and Influence-balanced loss (Park et al., 2021) . Recently, two-stage algorithms have achieved remarkable performance for long-tailed recognition, such as classifier re-training (cRT) (Kang et al., 2019) , learnable weight scaling (LWS) (Kang et al., 2019) , and Mixup Shifted Label-Aware Smoothing model (MiSLAS) (Zhong et al., 2021) . Meanwhile, bilateral branch network (BBN) (Zhou et al., 2020) uses an additional network branch for re-balancing. RIDE (Wang et al., 2021a) use multiple branches named experts, each learning to specialize in the entire classes. LADE (Hong et al., 2021) assumes the prior of test class distributions is available and accordingly post-adjust model predictions. PaCo (Cui et al., 2021) applies parametric class-wise learnable centers to rebalance in contrastive learning. BCL (Zhu et al., 2022) proposes a multi-branch framework to achieve class-averaging and class-complement in the training process. To boost the performance of the two-stage algorithms, researchers have introduced supervised contrastive learning (Khosla et al., 2020) to the first feature-learning stage and proposed k-positive contrastive loss (KCL) (Kang et al., 2020) and targeted supervised contrastive learning (TSC) (Li et al., 2022) . While achieving the state-of-the-art performance, these methods inject class-balance in the contrastive learning objective, inevitably leading to instance-imbalance during training. In this work, we instead propose to achieve both subclass-and instance-balance in the contrastive learning object. Our method is also related to recent studies of clustering-based deep unsupervised learning (Dosovitskiy et al., 2014; Xie et al., 2016; Liao et al., 2016; Yang et al., 2016; Caron et al., 2018; 2020) , especially those that leverage contrastive learning (Li et al., 2021b; a; Wang et al., 2021b; Guo et al., 2022) . However, they target at general unsupervised representation learning scenario, while our method is tailored for long-tailed recognition where the training data is immensely class-imbalanced.

6. CONCLUSION

In this paper, we introduced Subclass-balancing Contrastive Learning (SBCL) for long-tailed recognition. It breaks down the head classes into multiple semantically-coherent subclasses via subclassbalancing adaptive clustering and incorporates a bi-granularity contrastive loss that encourages both subclass-and instance-balance. Extensive experiments on multiple datasets demonstrate that SBCL achieves state-of-the-art single-model performance on benchmark datasets for long-tailed recognition.

A APPENDIX A.1 ADDITIONAL EXPERIMENT RESULTS

Accuracy of each class on CIFAR-100-LT. We visualize the accuracy of each class of both SCL and SBCL on CIFAR-100-LT with imbalance ratio 100 (Figure 3 ). From the results, we can see that SBCL improves performance on tail classes over SCL without the expense of the perforamnce of the head classes. Selection of negative instances in SBCL. Our proposed loss in Eq. 4 consists of two supervised contrastive losses with subclass and class labels respectively. The first term regards instances in different subclasses as negative instead of these in different classes; In Table 6 we show that such design choice leads to better performance than using instances in different classes as negative, which illustrates the effectiveness of exploiting the rich semantic in head classses. Intra-subclass/Inter-subclass distance. To leverage instance semantic coherence to balance the feature space, we expect instances of high semantic coherence to form a more concentrated cluster than other instances in the same class. So, we embed the subclass-balancing adaptive clustering strategy on previous methods to illustrate this on CIFAR-100-LT with imbalance ratio 100. In Table 7 , we report the intra-subclass/inter-subclass distance on different splits. Compare with KCL and TSC, the results show that SBCL achieves to concentrate instances from the same subclass and pulls instances from different subclasses away on all splits. We also note that the inter-subclass distance of SBCL is invariant with the decreasing of group split. This means the head class could be split into many subclasses separately, which constructs a balance feature space for the long-tailed recognition. Hyperparameter analysis on CIFAR-100-LT. Figure 5a and Table 8 show the distribute situation of sample number in subclasses obtained by different cluster algorithms on CIFAR-100-LT with imbalance ratio 100. For Kmean cluster algorithm, the imbalance phenomenon of subclasses is obvious. When using our proposed cluster algorithm, the imbalance ratio of sample number in subclasses deceases from 40 to 9.5. And the standard deviation of sample number on CIFAR-100-LT is relatively small, which denotes the number of samples in most subclasses keeps stable in a certain range. Figure 5b shows the impact of batch size on SBCL/TSC. We find that larger batch sizes have a significant advantage over the smaller ones. This is because larger batch sizes provide more negative examples to facilitate convergence. However, the over-large batch size hurts the model performance. And SBCL and TSC are equally sensitive to batch size on CIFAR-100-LT. Figure 5c shows the curve of the accuracy of SBCL/TSC vs. the number of training epochs. From the curve, we can see that the performance of SBCL and TSC both converge after 800 epochs. When the model is trained with SBCL over 600 epochs, its performance already exceeds TSC. In Figure 5d , we display the performance of SBCL with different learning rates on CIFAR-100-LT with imbalance ratio 100. As shown in the figure, the learning rate has significant impact on the performance, and we set the learning rate as 0.5 for CIFAR-100-LT. Combining TSC with dynamic temperature. According to Table 4 , dynamic temperature effectively contributes to the improvement of accuracy. We also add the dynamic temperature to the second term of TSC (Li et al., 2022) and the experiment results are shown in Table 9 . However, the improvement of the dynamic temperature on TSC is less significant than that on our method, which is reasonable because we introduce dynamic temperature for the loss to distinguish between class and subclass, while TSC does not have subclass and therefore the dynamic temperature is less effective. 43.9(+0.4) 48.0(+0.4) 59.2(+0.5) 44.9(+1.2) 48.7(+0.9) 57.9(+0.9) Warm-up on ImageNet-LT. Instead of using the SCL at the warm-up stage for CIFAR datasets, KCL is adopted for ImageNet-LT and iNaturalist 2018 datasets to warm up the feature extractor. As Table 10 shows, warm-up phase makes feature extractor improve accuracy on all splits of ImageNet-LT. This is because it prevents cluster assignment from feature random distribution at the begining and avoids using the SCL to make the feature space dominated by the head class at the warm-up stage. Advantages of cluster validity. Actually, previous studies (Kang et al., 2020; Li et al., 2022) have proven that randomly sampling balanced instances as positive pairs (such as KCL, TSC) is better than sampling all instances of the same class as positive pairs (such as SCL). However, this strategy may destruct instance semantic coherence. In Table 11 , we replace the first team (regard subclasses as positive pairs) with the balanced positive sampling strategy (KCL) to prove this on ImageNet-LT. As the results show, subclass-balancing adaptive clustering strategy brings more improvement to SBCL than balanced positive sampling strategy. COCO object detection and instance segmentation. In this section, following the experiment setting in (He et al., 2020) , we use Mask R-CNN (He et al., 2017) to conduct the object detection and instance segmentation experiments on COCO dataset. The schedule is the default 2× in (He et al., 2020) . Table 12 shows the model trained by SBCL outperforms it learned with other contrastive learning for the downstream tasks. Combining SBCL with ensemble-based methods. Another line of research to address the longtailed problem is the ensemble-based methods, such as RIDE (Wang et al., 2021a) , which incorporates multiple models in a multi-expert framework. Here we show that SBCL can also be leveraged to boost the performance of RIDE (Wang et al., 2021a) , a state-of-the-art ensemble-based method. To implement SBCL with RIDE, we follow (Li et al., 2022) to simply replace the feature extractor on stage-1 training in RIDE with that trained with SBCL and keep the stage-2 routing training unchange. As shown in Table 13 , applying SBCL to RIDE improves its performance with a significant gap, outperforms the combination of TSC and RIDE on all different number of experts. Combining SBCL with PaCo (Cui et al., 2021) and BCL (Zhu et al., 2022) . PaCo (Cui et al., 2021) and BCL (Zhu et al., 2022) proposed new variants of supervised contrastive loss and jointly train both the proposed loss and classification loss to improve long-tail recognition, while we focus on the two stage pipeline, especially the first stage of representation learning. In this experiment, 0.2, 0.5, 0.8, 1.0, 2.0} with δ = 10 and the value of δ from {5, 10, 20, 30, 50, 100} with β = 0.2. The results are summarized in Table 18 . We observe that the smaller β values (between 0.1 and 0.5) can achieve relatively good performance, with the best being 0.2. This observation aligns with our intuition of emphasizing the subclass-level contrastive loss, because smaller β is equivalent to putting more weights on the first term of Eq. 4, which corresponds to the subclass-level contrastive. For δ, the values between 5 and 30 yield high accuracy, with the best being 10. We can that large δ values (δ = 50, 100) lead to significant drop in performance. We argue that this is because large δ value would result in subclasses that contain more instance than tail classes and therefore affect the subclass-balance, leading to suboptimal performance. In addition, smaller δ value (δ = 5) also causes performance drop; the reason could be small cluster size may let similar instance being assigned to different clusters and therefore affect the learned representations. Therefore, we fix β = 0.2 and δ = 10 for all experiments. Table 18 : Hyperparameter study of β and δ on CIFAR-100-LT with imbalance ratio 100. Baseline information. We report the accuracy of KCL and TSC on different benchmark datasets from (Li et al., 2022) . For SwAVfoot_0 (Caron et al., 2020) , PCLfoot_1 (Li et al., 2021a) and BYOLfoot_2 (Grill et al., 2020) , we use their official open-source implementations.



SwAV offical implementation: https://github.com/facebookresearch/swav. PCL offical implementation: https://github.com/salesforce/PCL. BYOL offical implementation: https://github.com/deepmind/deepmind-research/ tree/master/byol.



Long-tailed recognition aims to learn a classifier from a training dataset with long-tailed class distribution, i.e., a few classes contain many data (head classes) while most classes contain only a few data (tail classes), where the major challenge is to require model recognizing all classes equally well. Let D = {x i , y i } i∈[n] be a long-tailed training dataset, where x i denotes a sample and y i ∈ [C] denotes its label. Denote by D k ⊆ D the set of instances belonging to class k and n k = |D k | the number of samples in class k The total number of training samples over C classes is n = C k=1 n k .

Figure 1: Illustration of subclass-balancing contrastive learning (SBCL).

Dataset D = {xi, yi} i∈[n] ; ; The update interval of cluster assignment K; The number of warm-up epoch T0; The total number of epoch T ; The hyperparameters β and δ. Ensure: A trained feature extractor f θ (•) 1: Initialize the model parameters θ 2: Train f θ (•) with SCL/KCL for T0 epochs ▷ Warm-up stage 3: for t = T0 to T do 4:

(a) Feature distance in subclasses. (b) Feature distance in classes.

Figure 3: Accuracy of each classes on CIFAR100-LT. The black line is the class distribution, and the classes in the left part are head classes while those in the right part are tail classes.

Figure 4: Change in weight norm of the linear classifier based on the representations trained by SBCL on CIFAR-100-LT with imbalance ratio 100 during training.

Figure 5: Analysis of SBCL as a loss function of different hyperparameters on CIFAR-100-LT with imbalance ratio 100. (a): Sample number in clusters with different cluster algorithm. (b): Top-1 accuracy of SBCL/TSC as a function of different batch size. (c): Top-1 accuracy of SBCL/TSC as a function of different pretraining epochs. (d): Top-1 accuracy of SBCL/TSC as a function of different learning rates.

44.3 44.9 44.6 44.3 42.9 42.3    Visualization of generated clusters. In Figure6, we show the clustering results of ImageNet-LT training images generated by subclass-balancing adaptive clustering algorithm. From the results, we can see that the algorithm is able to find the subclasses with similar patterns, helping the model learn semantic coherent representations. For example, the two subclasses in the bottom-left are telephone with/without human.

Figure 6: Visualization of subclasses generated by SBCL. Images with green and orange boarder are randomly drawn from different subclasses within the same classes. We can see that SBCL could produce semantically coherent subclasses.

Performance comparison on CIFAR-100-LT. Top-1 accuracy of the ResNet-32(He et al., 2016) under different imbalance ratios is reported. We also report the accuracy of our re-implemented important baselines ( †) in same setting on CIFAR-100-LT. The columns of "Statistic (IR 100)" are results of different disjoint subsets on CIFAR-100-LT with imbalance ratio being 100.

Object detection Results on PASCAL VOC.

Ablation study on different components of SBCL.

Average intra-class/inter-class distance of features learned with different contrastive learning methods.

Different selection of negative instances in SBCL on CIFAR-100-LT with different imbalance ratio. The 'Class label' row are the first term of loss function is constructed by class label and the 'Subclass label' row are subclass label.

Intra-subclass/inter-subclass distance of features learned with different contrastive learning methods.

Distribution of sample number in subclass on CIFAR-100-LT with imbalance ratio 100.

Combination of TSC and SBCL with dynamic temperature.

SBCL with and without warm-up stage on ImageNet-LT.

Subclass-balancing adaptive clustering strategy improves more than balanced positive sampling strategy on ImageNet-LT.

Object detection and instance segmentation results on COCO dataset. The representation model is trained on ImageNet and ImageNet-LT. We report results in bounding-box AP (AP bb ) and mask AP (AP mk ).

Performance of the combination of SBCL and state-of-the-art ensemble-based method RIDE(Wang et al., 2021a)  with ResNet-50(He et al., 2016) on ImageNet-LT.

Object detection on PASCAL VOC when the pretrained model is frozen.

annex

we show that using models pretrained with both TSC and SBCL as initialization could improve the performance of both PaCo and BCL. The results can be found in Table 14 , and we can see that SBCL renders larger performance gain than TSC. The improvement over PaCo and BCL demonstrates the effectiveness of SBCL in long-tail recognition, and shed lights on potential future work to evaluate the combination of multiple techniques of long-tail recognition to achieve new SOTA results.Both PaCo and BCL adopt a single-stage pipeline where a classifier is trained with both classification loss and supervised contrastive loss. To further test the effectiveness of SBCL, we replace the supervised contrastive loss they used with SBCL and retrain all other techniques they proposed for fair comparison. The results can be found in Table 15 . We can see that using SBCL could improve over PaCo and BCL.Table 14 : Performance of the combination of SBCL and extended contrastive methods (PaCo (Cui et al., 2021) and BCL (Zhu et al., 2022) )) with ResNext-50 (Saining et al., 2017) Combining SBCL with two-stage methods. MisLAS (Zhong et al., 2021) provides two major technical contributions to improve the two-stage pipeline of the longtail recognition, i.e., label-aware smoothing and shifted batch normalization. Both techniques are designed for improving the second stage (the classifier learning), while our method is to improve the first stage of representation learning. Thus, MisLAS could be combined with SBCL and TSC. In Table 16 , we report the results of combing MisLAS with both SBCL and TSC. The results show that both SBCL and TSC improve the performance of MisLAS and SBCL renders more performance boost than TSC. Freezing the pretrained model for object detection. We assess the representation trained on ImagNet/ImageNet-LT for the downstream detection task. We freeze the pretrained backbone to train the Faster R-CNN (Ren et al., 2015) , and use the same training schedule in Section 4.3. Table 17 reports the average mAP on PASCAL VOC dataset. The representation trained by SBCL obtains better performance for object detection.Hyperparameter studies. Here, we study the effect of hyperparameters β and δ. Note that β controls the balance of two loss terms in Eq. 4 and δ determines the lower bound of the cluster size in Eq. 3. Specifically, on CIFAR-100-LT with imbalance ratio 100, we vary the values of β from {0.1,

A.2 ADDITIONAL INFORMATION

Benchmark datasets statistical information and Implementation Details. We summarize the statistical information of the three benchmark datasets in Table 19 . Following (Kang et al., 2019; 2020; Li et al., 2022) , we apply SBCL on the long-tailed recognition by using a two-stage training strategy: (i) train the representation with SBCL; (ii) learn a linear classifier on top of the fixed representation. The training process is the same as TSC (Li et al., 2022) . Thus, we use TSC default hyperparameters and implementation details for the representation learning. For CIFAR-100-LT dataset, all experiments are performed on 2 NVIDIA RTX 3090 GPUs. For ImageNet-LT and iNaturalist 2018 datasets, we perform the experiments on 8 NVIDIA RTX 3090 GPUs. The detailed hyperparameters of TSC and SBCL are given in Table 20 .For the classify learning, training the linear classifier strategy is the same with TSC (Li et al., 2022) ; so, we use TSC default hyperparameters and implementation details for the classifier learning. For the detect model learning, we follow MoCo (He et al., 2020) to adopot the same setting, hyperparameters and evolution metrics with R50-C4 backbone. For Pascal VOC dataset, we train Faster R-CNN (Ren et al., 2015) on VOC07+12 and evaluate on the test set of VOC07. For COCO dataset, we train Mask R-CNN (He et al., 2017) on train2017 set and evaluate on val2017 set. Limitations. SBCL has some limitations. First, clustering the head class in SBCL takes a long time on the training phase, especially for ImageNet-LT and iNaturalist 2018. Second, SBCL requires knowing the number of samples in each class to decide the cluster number; so, it is not applicable to problems where the number of samples is unknown.Social impacts. This work aims to propose a novel representation learning to help people resolve the bias in the real world data recognition, which might has positive social impact. We do not foresee any form of negative social impact induced by our work.Privacy information in data. All datesets we used in the experiment are public. The datasets only include the pictures, which most are animals and plants. No private information is included.

