ON THE EFFECTIVENESS OF OUT-OF-DISTRIBUTION DATA IN SELF-SUPERVISED LONG-TAIL LEARNING

Abstract

Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the 'head' and 'tail' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distributionlevel supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data.

1. INTRODUCTION

Self-supervised learning (SSL) methods (Chen et al., 2020; He et al., 2020; Grill et al., 2020) provide distinctive and transferable representations in an unsupervised manner. However, most SSL methods are performed on well-curated and balanced datasets (e.g., ImageNet), while many real-world datasets in practical applications, such as medical imaging and self-driving cars, usually follow a long-tailed distribution (Spain & Perona, 2007) . Recent research (Liu et al., 2021) indicates that existing SSL methods exhibit severe performance degradation when exposed to imbalanced datasets. To enhance the robustness of SSL methods under long-tailed data, several pioneering methods (Jiang et al., 2021b; Zhou et al., 2022) are proposed for a feasible migration of cost-sensitive learning, which is widely studied in supervised long-tail learning (Elkan, 2001; Sun et al., 2007; Cui et al., 2019b; Wang et al., 2022) . The high-level intuition of these methods is to re-balance classes by adjusting loss values for different classes, i.e., forcing the model to pay more attention to tail samples. Another promising line of work explores the probability of improving the SSL methods with external data. (Jiang et al., 2021a) suggests re-balancing the class distributions by sampling external in-distribution (ID) tail instances in the wild. Nevertheless, they still require available ID samples in the sampling pool, which is hard to collect in many real-world scenarios, e.g., medical image diagnosis (Ju et al., 2021) or species classification (Miao et al., 2021) . The aforementioned findings and challenges motivate us to investigate another more practical and challenging setting: when the ID data is not available, can we leverage the out-of-distribution (OOD) data to improve the performance of SSL in long-tailed learning? Compare to MAK Jiang et al. (2021a) that assumes external ID samples are available, while we consider a more practical scenario where we only have access to OOD data that can be easily collected (e.g., downloaded from the internet). A very recent work (Wei et al., 2022) proposes to re-balance the class priors by assigning labels to OOD images following a pre-defined distribution. However, it is performed in a supervised manner while not directly applicable to the SSL frameworks. In this paper, we proposed a novel and principal method to exploit the unlabeled OOD data to improve SSL performance on long-tailed learning. As suggested in previous research, the standard contrastive learning would naturally put more weight on the loss of majority classes and less weight on that of minority classes, resulting in imbalanced feature spaces and poor linear separability on tail samples (Kang et al., 2020; Li et al., 2022) . However, rebalancing minorities with ID samples, no matter labeled or unlabeled, is quite expensive. To alleviate these issues, we devise a framework, Contrastive Learning with OOD data for Long-Tailed learning (COLT), to dynamically augment the minorities with unlabeled OOD samples which are close to tail classes in the feature space. As illustrated in Fig. 1 , our COLT can significantly improve SSL baselines in terms of the Alignment and Uniformity (Wang & Isola, 2020) , two widely-used metrics to evaluate the performance of contrastive learning methods, demonstrating the effectiveness of our method. The pipeline of our method is illustrated in Fig. 2 . To augment the long-tail ID dataset, we define a tailness score to localize the head and tail samples in an unsupervised manner. Afterward, we design an online sampling strategy to dynamically re-balance the long-tail distribution by selecting OOD samples close (with a large cosine similarity in the feature space) to the head or tail classes based on a predefined budget allocation function. We follow the intuition to allocate more OOD samples to the tail classes for rebalancing. Those selected OOD samples are augmented with the ID dataset for contrastive training, where an additional distribution-level supervised contrastive loss makes the model aware of the samples from different distributions. Experimental results on four long-tail datasets demonstrate that COLT can greatly improve the performance of various SSL methods and even surpass the state-of-the-art baselines with auxiliary ID data. We also conduct comprehensive analyses to understand the effectiveness of COLT. Our contributions can be summarized as: • We raise the question of whether we can and how to improve SSL on long-tailed datasets effectively with external unlabeled OOD data, which is better aligned with the practical scenarios but counter-intuitive to most existing work and rarely investigated before. • We design a novel yet easy-to-use SSL method, which is composed of tailness score estimation, dynamic sampling strategies, and additional contrastive losses for long-tail learning with external OOD samples, to alleviate the imbalance issues during contrastive learning. • We conducted extensive experiments on various datasets and SSL frameworks to verify and understand the effectiveness of the proposed method. Our method consistently outperforms baselines by a large margin with the consistent agreement between the superior performance and various feature quality evaluation metrics of contrastive learning.

2. RELATED WORKS

Supervised learning with imbalanced datasets Early attempts aim to highlight the minority samples by re-balancing strategy. These methods fall into two categories: re-sampling at the data level (Shen et al., 2016; Zou et al., 2018; Geifman & El-Yaniv, 2017) , or re-weighting at the loss (gradient) level (Cao et al., 2019; Jamal et al., 2020) . Due to the usage of label-related information, the above methods can not be generalized to the unsupervised field. (Kang et al., 2019) suggests that the scheme of decoupling learning representations and classifiers benefits long-tail learning. The feasibility of two-stage training promotes the exploration in unsupervised scenarios. Self-supervised long tail learning (Yang & Xu, 2020) is, to our best, the first to analyze the performance of SSL methods in long-tail learning and verify the effectiveness of self-supervised pretraining theoretically and experimentally. However, (Liu et al., 2021) shows that SSL methodsalthough more robust than the supervised methods -are not immune to the imbalanced datasets. Follow-up studies improve the ability of SSL methods on long-tailed datasets. Motivated by the observation that deep neural networks would easily forget hard samples after pruning (Hooker et al., 2019) , (Jiang et al., 2021b) proposed a self-competitor to pay more attention to the hard (tail) samples. BCL (Zhou et al., 2022) involved the memorization effect of deep neural networks (Zhang et al., 2021b) into contrastive learning, i.e., they emphasize samples from tail by assigning more powerful augmentation based on the memorization clue. We show that our method is non-conflict with existing methods and can further improve the balancedness and accuracy (Section 4.2). Learning with auxiliary data Auxiliary data is widely used in the field of deep learning for different purposes, e.g., improving model robustness (Lee et al., 2020) , combating label noise (Wei et al., 2021) , OOD detection (Liang et al., 2018; Hendrycks et al., 2018a) , domain generalization (Li et al., 2021; Liu et al., 2020; Long et al., 2015) , neural network compression (Fang et al., 2021) , training large models (Alayrac et al., 2022; Brown et al., 2020) . In long-tail learning, MAK (Jiang et al., 2021a) suggests tackling the dataset imbalance problem by sampling in-distribution tail classes' data from an open-world sampling pool. On the contrary, we explore the probability of helping long-tail learning with OOD samples, i.e., none of the ID samples are included in the sampling pool. Open-Sampling (Wei et al., 2022) utilizes the OOD samples by assigning a label to each sample following a pre-defined label distribution. Their work is performed under supervised scenarios, and the OOD data is not filtered, which results in a massive computation overhead.

3. METHOD

3.1 PRELIMINARIES Unsupervised visual representation learning methods aim to find an optimal embedding function f , which projects input image X ∈ R CHW to the feature space Z ∈ R d with z = f (x), such that z retains the discriminative semantic information of the input image. SimCLR (Chen et al., 2020) is one of the state-of-the-art unsupervised learning frameworks, and its training objective is defined as: L CL = 1 N N i=1 -log exp(z i • z + i /τ ) exp(z i • z + i /τ ) + z - i ∈Z -exp(z i • z - i /τ ) , where (z i , z + i ) is the positive pair of instance i, z - i indicates the negative samples from the negative set Z -, and τ is the temperature hyper-parameter. In practice, a batch of images is augmented twice in different augmentations, the positive pair is formulated as the two views of the same image, and the negative samples are the views of other images.

3.2. LOCALIZE TAIL SAMPLES IN SELF-SUPERVISED TRAINING

Due to the label-agnostic assumption in the pre-training state, the first step of the proposed method is to localize tail samples. As mentioned earlier, the majority classes dominate the feature space, and tail instances turn out to be outliers. Moreover, the minority classes have lower intra-class consistency (Li et al., 2022) . Hence, a sparse neighborhood could be a reliable proxy to identify the tail samples (More analysis can be found in Section 4.4). Specifically, we use top-k% (k = 2 COLT can be easily plugged into most SSL frameworks. Proposed components are denoted as red. in practice) largest negative logits of each sample to depict the feature space neighborhood during training. Given a training sample x i , its negative logits p - i is the following: p - i = exp(z i • z - i /τ ) exp(z i • z + i /τ ) + z - i ∈Z -exp(z i • z - i /τ ) . ( ) Considering implementing SimCLR (Chen et al., 2020) with batch size B, each image has 2(B -1) negative samples. Then, we define s i t =top-k% p - i as the tailness score for each ID instance x i . During training, we perform a momentum update to the tailness score, i.e., s i,0 t = s i t , s i,n t = ms i,n-1 t + (1 -m)s i,n t where m ∈ [0, 1) is the momentum coefficient. The momentum update makes the tailness score more robust and discriminative to the tail samples. A higher value of s i t indicates sample x i has a more sparse neighborhood in the feature space and implies that it belongs to the tail classes with a larger probability. Experiments in Fig 3e empirically demonstrated that tail samples could be effectively discovered by our proposed tailness score.

3.3. DYNAMICALLY RE-BALANCE THE FEATURE SPACE WITH ONLINE SAMPLING

The core of our approach is to sample OOD images from the sampling pool S ood and further rebalance the original long-tail ID dataset and the feature space. First, we obtain C feature prototypes z ci from ID training set S id via K-means clustering. Note that we use the features at the last projection layer since the contrastive process is performed on this layer. The cluster-wise tailness score s ci t is defined as the mean of tailness score in cluster c i , i.e., s ci t = zj ∈ci s j t /|c i |, here |c i | is the number of instances in cluster c i . Then, we obtain each cluster's sampling budget K ′ as follows: K ′ = K • sof tmax( s c t /τ c ), s c t = s c t -mean(s c t ) std(s c t ) , where K refers to the total sampling budget, K ′ ∈ R C is the sampling budget assigned to each cluster, s c t is the normalized cluster tailness score. Empirically, we assign more sampling budget to the tailness clusters to be consistent with the idea of re-balancing the feature space. We sample OOD images whose feature is close to (higher cosine similarity) the ID prototypes z ci . To fully exploit the OOD data, we re-sample from the S ood every T epoch. The motivation behind this is: i) the sampled OOD data can be well-separated from S id after a few epochs, therefore becoming less effective to re-balance the feature space; ii) over-fitting to the OOD data can be toxic to the ID performance (Wei et al., 2022) . From another perspective, this online sampling strategy lets the ID training set (especially the tail samples) continuously be exposed to the more newly sampled effective negative samples, forcing the model gives more distinctive embeddings and better fitting the ID distribution. The online sampling process is summarized in Algorithm 2. Algorithm 1 The overall pipeline of COLT. Input: ID train set S id , OOD dataset S ood , sample budget K, train epoch T , momentum coefficient m, warm-up epochs w, sample interval r, cluster number C, hyper-parameter k, τ c . Output: pre-trained model parameter θ T . Initialize: model parameter θ 0 , the original train set S train = S id . if epoch = 0 then Train model θ 0 with Eq. 1 and compute s 0 t ;  end if for epoch = 1, • • • , T -1 do if epoch ≥ w then if (epoch -w) %r = for i = 0, • • • , C -1 do Initialize subset S i sample = ∅; while |S sample | < K ′ ci do u = arg max xj ∈S ood sim(z j , z ci ); S i sample = S i sample ∪ {u}; end while S sample = S sample ∪ S i sample ; end for S train = S train ∪ S sample . 3.4 AWARENESS OF THE OUT-OF-DISTRIBUTION DATA Section 3.2 and Section 3.3 introduce our sampling strategy toward OOD images. To involve the sampled OOD subset S sample in training, a feasible way is directly using the augmented training set (containing both ID and OOD samples) to train the model with Eq. 1. However, we would argue that giving equal treatment to all samples may not be the optimal choice (details in Section 4). One natural idea is to let the model be aware of that there are two kinds of samples from different domains. Hence, we define an indicator ϕ to provide weakly supervised (distribution only) information: ϕ(x i ) = +1, x i ∈ S id ; -1, x i ∈ S ood . Afterward, we add a supervised contrastive loss (Khosla et al., 2020) to both ID and OOD samples: L SCL = 1 N N i=1 1 |P (i)| p∈P (i) -log exp(z i • z p /τ ) exp(z i • z p /τ ) + n∈N (i) exp(z i • z n /τ ) , where P (i) ≡ {p : ϕ(x p ) = ϕ(x i )} is the set of indices of the same domain within the mini-batch, |P (i)| is its cardinality and the negative index set N (i) ≡ {n : ϕ(x n ) ̸ = ϕ(x i )} contains index from different distribution. Fig 3c illustrates that the proposed distribution-awareness loss improves not only the overall performance but also facilitates a more balanced feature space. It's worth noting that the proposed loss only utilizes the distribution information as the supervised term, while the labels for both ID and OOD samples are unavailable during the self-supervised training stage. Finally, we scale the supervised loss with α and add it to the contrastive loss in Eq 1:  L COLT = L CL + αL SCL .

4. EXPERIMENTS

In this section, we first introduce the datasets and experimental settings (Section 4.1) and evaluate the proposed COLT in three aspects: accuracy and balancedness(Section 4.2), versatility and complexity (Section 4.3). Then, we verify whether our method can 1), localize tail samples, 2), re-balance the feature space. Finally, we provide a comprehensive analysis of COLT (Section 4.4).

4.1. DATASETS AND SETTINGS

We conduct experiments on four popular datasets. CIFAR-10-LT/CIFAR-100-LT are long-tail subsets sampled from the original CIFAR10/CIFAR100 (Cui et al., 2019a) . We set the imbalance ratio to 100 in default. Following (Wei et al., 2022) , we use 300K Random Images (Hendrycks et al., 2018b) as the OOD dataset. ImageNet-100-LT is proposed by (Jiang et al., 2021b) with 12K images sampled from ImageNet-100 (Tian et al., 2020) with Pareto distribution. We use ImageNet-R (Hendrycks et al., 2021) as the OOD dataset. Places-LT (Liu et al., 2019) contains about 62.5K images sampled from the large-scale scene-centric Places dataset (Zhou et al., 2017) with Pareto distribution. Places-Extra69 (Zhou et al., 2017) is utilized as the OOD dataset. Evaluation protocols To verify the balancedness and separability of the feature space, we report performance under two widely-used evaluation protocols in SSL: linear-probing and few-shot. For both protocols, we first perform self-supervised training on the encoder model to get the optimized visual representation. Then, we fine-tune a linear classifier on top of the fixed encoder. The only difference between linear-probing and few-shot learning is that we use the full dataset for linear probing and 1% samples of the full dataset for few-shot learning during fine-tuning. Measurement metrics As a common practice in long tail learning, we divide each dataset into three disjoint groups in terms of the instance number of each class: {Many, Median, Few}. By calculating the standard deviation of the accuracy of the three groups, we can quantitatively analyze the balancedness of a feature space (Jiang et al., 2021b) . The linear separability of the feature space is evaluated by the overall accuracy. Training settings We evaluate our method with SimCLR (Chen et al., 2020) framework in default. We also conduct experiments on several state-of-the-art methods in self-supervised long tail learning (Jiang et al., 2021b; Zhou et al., 2022) . We adopt Resnet-18 (He et al., 2016) for small datasets (CIFAR-10-LT/CIFAR-100-LT), and Resnet-50 for large datasets (ImageNet-100-LT/Places-LT), respectively. More details can be found in Appendix. The main results of the proposed approach in various datasets and settings are presented in Table 1 and Table 2 . We sample K = 10, 000 OOD images on every r = 25 epoch for CIFAR-10-LT/CIFAR-100-LT, Places-LT, and r = 50 for ImageNet-100-LT. COLT significantly outperforms the baseline (vanilla SimCLR) by a large margin (about 10% for long-tail CIFAR, 5% for ImageNet-100-LT, 1.6% for Places-LT). Besides, the performance gain of the minority classes (Median & Few) is more notable (e.g., about 12% for long-tailed CIFAR-100). Meanwhile, COLT yields a balanced feature space. Following previous works (Jiang et al., 2021b ) (Zhou et al., 2022) , we measure the balancedness of a feature space through the accuracy's standard deviation from Many, Median and Few. COLT significantly narrows the performance gap between the three groups (much lower Std), which indicates we learn a more balanced feature space. To evaluate the versatility of COLT, we carry out experiments on top of several improved SSL frameworks for long-tail learning, i.e., SDCLR (Jiang et al., 2021b) and BCL (Zhou et al., 2022) . Table 1 and Table 2 also summarized COLT performance on these two methods. We can observe that incorporating our method into existing state-of-the-art methods can consistently improve their performance, which indicates that our method is robust to the underlying SSL frameworks.

4.3. COLT VS BASELINES WITH AUXILIARY DATA

We also compare COLT with methods that make use of external data. MAK (Jiang et al., 2021a) is the state-of-the-art method that proposes a sampling strategy to re-balance the training set by sample (d) Sampling interval 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 0 0 0 1 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 in-distribution tail class instances from an open-world sampling pool. We compare the proposed COLT with MAK in Table 3 , noting that "random" refers to random sampling from the external dataset according to the budget. We observe both higher accuracy and balancedness under different sample budgets on ImageNet-100 and Places. It indicates COLT leverages OOD data in a more efficient way. Furthermore, we ask the question that whether OOD samples can replace ID samples to help long-tail learning. We obtain a positive answer from empirical results in Table 4 . We compare the result of COLT and MAK on auxiliary data which involve ID samples. COLT achieves better performance on most of the metrics, even compared with sampling in an entirely ID dataset. On the other, COLT has less computational overhead. MAK applies a three-stage pipeline: pre-train the model with ID samples, sample from the sampling pool, and re-train a model from the beginning. In contrast, COLT samples during the training process, resulted in a single-stage pipeline. The online sampling strategy not only fully utilized the external datasets but also reduced the computation overhead significantly 1 . Open-Sampling (Wei et al., 2022) also uses OOD data to help long-tail learning. Different from ours, they use an large data budget (300K for CIFAR), while COLT improves the baselines with a much smaller budget (10K for CIFAR).

4.4. ANALYSIS AND ABLATION STUDY

The choice of OOD dataset We conduct experiments on CIFAR-100-LT and replace the OOD dataset while maintaining other settings unchanged. As shown in Fig 3a , our method improves the ID accuracy when using 300K Random Images (Hendrycks et al., 2018b) , STL (Coates et al., 2011) , ImageNet-100, Places with 9.81%, 5.19%, 9.43%, 5.74% respectively. Besides, sampling on Gaussian Noise provides limited help (less than 1%) or degradation to ID accuracy. The effectiveness of distribution-awareness loss COLT introduces a supervised contrastive loss to explicitly separate samples from different distributions. We conduct an ablation study on the proposed loss in Fig 3c , and the results show that the proposed loss not only improves the overall accuracy but also significantly alleviates the imbalance (i.e., much lower std). The effect of sampling budget In Fig 3b , we compare the performance gains of COLT under different budgets. COLT consistently outperforms the random sampling strategy, i.e., leveraging OOD samples more effectively. Moreover, though a larger budget will give better performance, the performance gain almost plateaus with a budget of 10-15k in COLT, indicating better data efficiency. Comparison with semi-supervised methods Performing semi-supervised learning is also a natural choice for utilizing external unlabeled data. We implement FixMatch (Sohn et al., 2020) , FlexMatch (Zhang et al., 2021a) , ABC (Lee et al., 2021) , and DARP (Kim et al., 2020) on long-tailed CIFAR-100, the first two are general semi-supervised methods, and the last two are elaborately designed for long-tail learning. In Table 5 , we compare the results in such semi-supervised learning scenarios (labeled: CIFAR-100-LT, unlabeled: 300K Random Images) to supervised, and COLT. It can be observed that 1), external unlabeled OOD data can also be helpful when performing semi-supervised learning 2), the performance gains of COLT (about 10%) are more significant than incorporating OOD data via semi-supervised training. This could be attributed to most semisupervised methods considering unlabeled data is also ID. It may need some special design for unlabeled OOD data, e.g., resist some "toxic" samples or redesign the pseudo labels for OOD data.

Changing of hyper-parameters

We also show the effect of hyper-parameters involved in COLT. Fig 3d shows the empirical results of changing resample interval r. We can observe that reasonably small intervals lead to higher accuracy. Fig 3f shows the classification accuracy when changing k. The limited fluctuations in performance prove that COLT is robust to hyper-parameters. The ability of tail sample mining Recall in Section 3.2, we localize tail samples by assigning a predefined "tailness score" to each sample. In order to verify the effectiveness of tailness score, we select the top 10% samples with the highest tailness score as a subset and calculate the ratio of the percentage of {Major / Minor} samples in this subset to the percentage of whole dataset:ϕ = T ∩S sub id T ∩S id where T denotes the target group, S id is the whole in-distribution dataset, S sub id is the subset of samples which have top γ% highest tailness score, γ is set to 10. ϕ reflects the ability to identify tail samples: when the target group is Minor/Major, higher/lower ϕ indicates a method localizes tail samples well. As illustrated in Fig 3e , COLT discovers more samples from the tail than BCL.

5. CONCLUSION AND LIMITATIONS

In this paper, we propose a novel SSL pipeline COLT, which is, to our best, the first attempt to extend additional training samples from OOD datasets for improved SSL long-tailed learning. COLT includes three steps, unsupervised localizing head/tail samples, re-balancing the feature space by online sampling, and SSL with additional distribution-level supervised contrastive loss. Extensive experiments show that our method significantly and consistently improves the performance of SSL on various long-tailed datasets. There are nevertheless some limitations. First, more theoretical analyses are needed to better understand the effectiveness of OOD samples. Besides, for a given long-tail ID dataset, how to specify the best OOD dataset that gives the largest improvements is also worth exploring. We hope our work can promote the exploration of OOD data in long-tail scenarios. leverages OOD data in a more efficient way. Besides, the performance gain of the minority classes (Median & Few) is more notable. Meanwhile, COLT yields balancedness feature space. Following previous works (Jiang et al., 2021b) (Zhou et al., 2022) , we measure the balancedness of a feature space through the accuracy's standard deviation (Std) from Many, Median and Few. To further demonstrate that our proposed COLT is also effective on the non-curated open-world datasets, we conduct experiments on ImageNet-100-LT with a 50K subset of Open Images (Krasin et al., 2017) (a dataset of about 9 million images belonging to over 6000 categories.) as the OOD dataset. We can also observe a significant improvement in both the accuracy (especially for {Median, Few}) and the balancedness of the feature space. As suggested in previous research, the standard contrastive learning would naturally put more weight on the loss of majority classes and less weight on that of minority classes, resulting in imbalanced feature spaces and poor linear separability on tail samples (Kang et al., 2020; Li et al., 2022) , i.e., the majority classes dominate the feature space and tail instances turn out to be outliers. To quantitatively analyze the imbalance of the feature space, we define a metric called Normalized Misclassification Matrix (NMM): NMM ij = m ij / n k=1 m ik |T j |/ n k=1 |T k | , where T k is the k-th split of the long-tailed train set, satisfying S id = ∪ n k=1 T k and T i ∩ T j = ∅, ∀i ̸ = j. In this paper, we follow the common practice long-tail learning that split the dataset to S id = T = {T Few , T Median , T Many} according to the instance number in each class, and |T j | denote the class number in split T j . m ij represents the number of (misclassified) instances belonging to split T i but are classified to split T j . Note that m ii indicates both the ground truth label and the (wrong) prediction fall into split T i . Intuitively, if the feature space is perfectly balanced (i.e., equal margin between different classes), each element in NMM is approximately 1.0, i.e., the misclassified samples nearly randomly fall into each split. On the contrary, higher (lower) mean margins between classes in T i and T j result in lower (higher) m ij . The results are shown in Figure 5 : The augmentation graph of CIFAR-10. Similar to (Wang et al., 2021) , We choose a random subset of test images and randomly augment them 20 times. Then, we calculate the instance distance in the representation space and draw edges for image pairs whose smallest view distance is below a small threshold. We visualize the samples with t-SNE and denote edges between ID instances in black and edges between ID and OOD samples, forming new connections in red. We observe a smaller s c for minority classes than for majority classes, which is also consistent with the theoretical analysis in (Wang et al., 2021) 

APPENDIX D MORE ANALYSIS ABOUT COLT

Different tail estimation strategies In our paper, we defined a tailness score based on top-k% (k = 2 in practice) largest negative logits to localize the head and tail samples in an unsupervised manner. We also make a comparison to other alternative strategies. In Tab. 8, we provide results on locating tail samples by a radius-based definition (with radius as the sum of negative logits and select different radius thresholds) of tailness score or simply using the method in BCL (Zhou et al., 2022) . The sampling budget is set to 5K, and the ID and OOD dataset is CIFAR-100-LT and 300K Random Images respectively. We can notice that there is no significant difference between COLT with Top-k% and radius-based methods, while both of them surpass BCL. Is the unsupervised clustering reliable? In the paper, we propose to perform a K-means clustering to the samples from S id , then calculate the cluster-wise sampling budget via the tailness score. Since our ultimate goal is to sample more (less) OOD data similar to the minority (majority) samples, it's natural to ask how well does this clustering work, whether we assigned more budget to the tail samples. To this end, we propose to measure the clustering and sampling quality by the ratio of minority samples in a cluster and its corresponding cluster-wise tailness score. Results are visualized in Fig 6 . It's observed that the minority proportion of some clusters is close to 1, while others are composed of samples from majority classes. Besides, the cluster-wise tailness score shows a linear correlation of the minority proportion, implying we do allocate more sampling budget to tail classes according to Eq. 3. Impact of the cluster number Recall in Sec 3.3 we first perform a K-means clustering to the ID samples, then select OOD samples close (with a large cosine similarity in the feature space) to the head or tail classes based on the budget allocation function. To verify 1), whether the cluster number will dramatically affect the performance 2), how the performance is when clustering based on the ground-truth label information rather than a self-supervised manner. We conduct experiments with various cluster number C in Tab. 9, the OOD dataset is 300K Random Images (Hendrycks et al., 2018b) , and the sampling budget is set to 5K. Note that the supervised clustering method is referred to as Oracle. The results show the performance of COLT will not be significantly affected by the number of clusters. Furthermore, cluster according to the label information (Oracle) achieves a similar performance compared to the unsupervised clustering, implying the K-means clustering can be used as a feasible alternative to roughly separates the minority from the majority classes in selfsupervised scenarios. The scale / balancedness of the OOD dataset In our work, we evaluate the proposed COLT on both balanced datasets, which are used as OOD datasets in previous works (Kumar et al., 2021; Wei et al., 2022) and non-curated open-world datasets such as 300K Random Images (Hendrycks et al., 2018b) . Although COLT performs well on the aforementioned datasets, we intend to further ask will COLT perform well when the OOD dataset is also long-tailed? Tab. 10 shows the performance of COLT on CIFAR-100 when the OOD dataset (ImageNet-100) has various imbalance ratios. The similar accuracy suggests COLT is robust to the imbalancedness of the OOD dataset to some extent. This could be attributed to the dynamic sampling procedure filtering out most of the unhelpful OOD samples so that the performance will not be dramatically affected by changes in the external OOD dataset. Another observation can be found from Tab. 11, where we measured how the OOD dataset's scale influences COLT. We form the external OOD dataset of different scales by gradually increasing the number of samples in Random 300K Images (Hendrycks et al., 2018b) , and implementing COLT on those subsets of Random 300K Images. Conform to intuition, the scale of the OOD dataset is positively correlated with the performance of COLT since we may select more desired samples in a larger candidate set. However, the problem can be tricky when the dataset is also changed. For example, we observe a similar performance gain on ImageNet-100-LT when utilizing Places-69 (about 98K images) and ImageNet-R (30K images) as the auxiliary OOD dataset, which implies the scale of the OOD dataset is not the only factor affecting the performance. Published as a conference paper at ICLR 2023 



Although we sample for multiple times (every r epochs) in COLT, the time of performing sampling once is less than training for one epoch; therefore can be ignored compared to 1.7x training epochs brought by MAK. A debiased subset of 80 Million Tiny Images(Torralba et al., 2008) with 300K images. 80 Million TinyImages is constructed by the corresponding images of 53,464 nouns from internet search engines.



Figure 1: (1a): Feature space uniformity of different SSL frameworks. (1b): Visualization of the alignment property of samples in minority classes and majority classes w/ or w/o COLT. The experiment is conducted with ResNet-18 on CIFAR-100-LT.

Figure 2: Overview of Contrastive with Out-of-distribution data for Long-Tail learning (COLT).COLT can be easily plugged into most SSL frameworks. Proposed components are denoted as red.

on hyper-parameter k.

Figure 3: Analytical experiments of COLT on CIFAR-100-LT. (3a): accuracy when changing the external OOD dataset. (3b): accuracy when sampling different numbers of OOD samples on 300K Random Images. (3c): Top-1 accuracy and standard derivation (Std) of COLT with or without the proposed distribution loss. (3d): accuracy with various sampling intervals r. (3e): A higher ϕ tail and a lower ϕ head implies mining tail samples more precisely. (3f): accuracy with various k.

Fig 4. Fig 4a indicates when the train set is balanced, the mean margin of different classes is approximately equal (NMM ij ≈ 1), note that the split {T Few , T Median , T Many } in Fig 4a is consistent with Fig 4b and Fig 4c. Fig 4b reflects that majority classes have a higher margin to other classes than minority classes. In other words, the model is more likely to confuse samples from a minority class with other minority classes, implying an imbalanced feature space. Fig 4c exhibits the result of COLT based on SimCLR with 10K OOD data. It's observed that COLT alleviates the imbalance issue since COLT augments minority classes with more instances which could be interpreted as an implicit loss re-weighting strategy. OOD DATA BRIDGE INSTANCES FROM MINORITY CLASSES In this section, we demonstrate the effect of OOD data on the perspective of contrastive learning. A recent work (Wang et al., 2021) gives a theoretical understanding of contrastive learning based on augmentation overlap. Concretely, they suggest that the samples of the same class could be very alike after aggressive data augmentations. Thus, the pretext task of aligning positive samples can facilitate the model to learn class-separable representations. They define the augmentation graph G = (V, E) as: N samples are the vertices of the graph, and there exists an edge e ij when sample i and sample j has overlapped views. According to their theory, intra-class augmentation overlap is a sufficient condition for gathering features from the same class. In this case, we compute the ratio of connected nodes (degree is not zero) to measure the extent of intra-class augmentation overlap in class k, denoted as score s k c . Ideally, each class should have s k c = 1, which indicates all samples from the same class will be clustered together during the contrastive training process. A smaller s k c indicates lower intra-class consistency and vice versa.

Figure 4: Normalized Misclassification Matrix (NMM) on the test set of CIFAR-100 with different frameworks and train sets. (4a): SimCLR trained on balanced CIFAR-100. (4b): SimCLR trained on long-tailed CIFAR-100. (4c): implement COLT on top of SimCLR trained on long-tailed CIFAR-100.

Figure6: Linear regression results between the minority proportion in a cluster and the cluster's tailness score on long-tailed CIFAR, ImageNet-100, and Places. We set cluster number C = 10.



ID train set S id , OOD dataset S ood , model θ, sample budget K, cluster number C, similarity metric sim(•), hyper-parameter τ c . Output: new train set S train . Calculate both ID features z id and OOD features z ood through model θ; Obtain C ID prototypes z ci via K-means clustering in the projected feature space; Calculate cluster-wise tailness score by s ci t = zj ∈ci s j t /|c i |; Assign each cluster a sample budget K ′ ci with Eq. 3; Initialize the sample set S sample = ∅;

Test accuracy (%) and balancedness (Std↓) on CIFAR-10-LT and CIFAR-100-LT. Median ↑ Few ↑ Std ↓ All ↑ Many ↑ Median ↑ Few ↑ Std ↓ All ↑

Test accuracy (%) and balancedness (Std↓) on ImageNet-100-LT and Places-LT. Median ↑ Few ↑ Std ↓ All ↑ Many ↑ Median ↑ Few ↑ Std ↓ All ↑

Compare the proposed COLT with random sample and MAK under the same sampling pool and sampling budget. The best performance under each setting is marked as bold.

Compare the test accuracy (%) on ImageNet-100-LT of the proposed COLT with MAK which use ID data. The best performance is marked as bold.

Comparison of semi-supervised and self-supervised methods when leveraging OOD data.

Accuracy (%) and balancedness (Std) on ImageNet-100 with different OOD datasets.

Comparison on different tail estimation strategies. Accuracy 54.20 ± 0.35 53.19 ± 0.47 53.33 ± 0.28 54.00 ± 0.21 53.87 ± 0.22 53.0 ± 0.34 52.97 ± 0.30

Ablation study on the cluster number C on CIFAR-100-LT. Accuracy 54.20 ± 0.35 53.61 ± 0.18 54.06 ± 0.22 53.81 ± 0.29 54.16 ± 0.10 Table 10: COLT's accuracy when OOD dataset is also imbalanced. ± 0.33 52.63 ± 0.19 52.74 ± 0.25 52.66 ± 0.41

COLT's accuracy with different scale OOD dataset. Accuracy 53.43 ± 0.26 53.99 ± 0.18 54.20 ± 0.35

ACKNOWLEDGEMENT

This work is supported by the National Natural Science Foundation of China (Grant No. U21B2004, 62106222), the Zhejiang Provincial key RD Program of China (Grant No. 2021C01119), the Natural Science Foundation of Zhejiang Province, China(Grant No. LZ23F020008), the Core Technology Research Project of Foshan, Guangdong Province, China (Grant No. 1920001000498) and the Zhejiang University-Angelalign Inc. R&D Center for Intelligent Healthcare. Jianhong Bai would also like to thank Huan Wang from Northeastern University (Boston, USA), Denny Wu from University of Toronto, and Hangxiang Fang from Zhejiang University for their guidance and help.

availability

Our code is available at https://github.com/

REPRODUCIBILITY

We uploaded the code to ensure the reproducibility of the proposed method, which can be found at: https://github.com/JianhongBai/COLT.

APPENDIX A DATASETS AND TRAINING DETAILS

CIFAR-10-LT/CIFAR-100-LT are first introduced by (Cui et al., 2019a) , which are long-tail subsets sampled from the original CIFAR10/CIFAR100. The imbalance ratio is defined as the instance number of the largest class divided by the smallest class. To better reflect the performance difference, we set the imbalance ratio to 100 in default. Following (Wei et al., 2022) , we use 300K Random Images (Hendrycks et al., 2018b) 2 as the OOD dataset. In addition, we also conduct experiments with OOD datasets as STL-10 (Coates et al., 2011) , which contains 5,000 labeled images and 100,000 unlabeled images in 10 classes with a resolution of 96x96. We use all the unlabeled images as external OOD data.ImageNet-100-LT is proposed by (Jiang et al., 2021b) . it contains about 12K images sampled from ImageNet-100 (Tian et al., 2020) with Pareto distribution. The instance number of each class ranges from 1,280 to 5. We use ImageNet-R (Hendrycks et al., 2021) as the OOD dataset. The dataset contains 30K images with several renditions (e.g., art, cartoons, deviantart) of ImageNet classes.Places-LT The original Places (Zhou et al., 2017 ) is a large-scale scene-centric dataset. Places-LT (Liu et al., 2019) contains about 62.5K images sampled from Places with Pareto distribution. The instance number of each class ranges from 4,980 to 5. Places-Extra69 (Zhou et al., 2017) is utilized as the OOD dataset. It includes 98,721 images for 69 scene categories besides the 365 scene categories in Places.

Training details

We implement all our techniques using PyTorch (Paszke et al., 2017) and conduct the experiments using RTX3090 GPUs. We evaluate our method with SimCLR (Chen et al., 2020) framework with batch size 512 for small datasets (CIFAR-10-LT/CIFAR-100-LT) and 256 for large datasets (ImageNet-100-LT/Places-LT) in default. We adopt Resnet-18 (He et al., 2016) for small datasets and Resnet-50 for large datasets, respectively. In our paper, we evaluate COLT's performance under two evaluation protocols in self-supervised learning: linear-probing and few-shot. For both protocols, we first perform self-supervised training on the encoder model to get the optimized visual representation. Then, we fine-tune a linear classifier on top of the encoder (fixed during training the classifier). The only difference between linear probing and few-shot learning is we use the full dataset for linear probing and 1% samples of the full dataset for few-shot learning during finetuning. We keep all settings in the fine-tuning stage (e.g., optimizer, learning rate, batch size) the same as (Jiang et al., 2021b) .Mainly Following (Jiang et al., 2021b; Zhou et al., 2022; Jiang et al., 2021a) , we pre-train all the baselines and COLT with 2000 epochs on CIFAR10/100, 1000 epochs on ImageNet-100, 500 epochs on Places. As for the fine-tuning stage, the "linear-probing" and "few-shot" results are produced by fine-tuning the classifier for 30 epochs and 100 epochs, respectively. To make a fair comparison, we implement COLT and all baselines with the same data augmentation strategies. We sample K = 10, 000 OOD images on every r = 25 epoch for CIFAR-10-LT/CIFAR-100-LT, Places-LT, and r = 50 for ImageNet-100-LT.

APPENDIX B MORE EMPIRICAL RESULTS

We present the experiment results on ImageNet-100 with ImageNet-R or Places69 as the external OOD dataset in Table 6 . We compare COLT with methods that make use of external data. MAK (Jiang et al., 2021a) is the state-of-the-art method that proposes a sampling strategy to re-balance the training set by sample in-distribution tail class instances. Note that "random" refers to randomly sampling from the external dataset according to the sampling budget. We observe both higher accuracy and balancedness under different sample budgets on different OOD datasets. It indicates COLT

