ON THE EFFECTIVENESS OF OUT-OF-DISTRIBUTION DATA IN SELF-SUPERVISED LONG-TAIL LEARNING

Abstract

Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the 'head' and 'tail' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distributionlevel supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data.

1. INTRODUCTION

Self-supervised learning (SSL) methods (Chen et al., 2020; He et al., 2020; Grill et al., 2020) provide distinctive and transferable representations in an unsupervised manner. However, most SSL methods are performed on well-curated and balanced datasets (e.g., ImageNet), while many real-world datasets in practical applications, such as medical imaging and self-driving cars, usually follow a long-tailed distribution (Spain & Perona, 2007) . Recent research (Liu et al., 2021) indicates that existing SSL methods exhibit severe performance degradation when exposed to imbalanced datasets. To enhance the robustness of SSL methods under long-tailed data, several pioneering methods (Jiang et al., 2021b; Zhou et al., 2022) are proposed for a feasible migration of cost-sensitive learning, which is widely studied in supervised long-tail learning (Elkan, 2001; Sun et al., 2007; Cui et al., 2019b; Wang et al., 2022) . The high-level intuition of these methods is to re-balance classes by adjusting loss values for different classes, i.e., forcing the model to pay more attention to tail samples. Another promising line of work explores the probability of improving the SSL methods with external data. (Jiang et al., 2021a) suggests re-balancing the class distributions by sampling external in-distribution (ID) tail instances in the wild. Nevertheless, they still require available ID samples in the sampling pool, which is hard to collect in many real-world scenarios, e.g., medical image diagnosis (Ju et al., 2021) or species classification (Miao et al., 2021) . The aforementioned findings and challenges motivate us to investigate another more practical and challenging setting: when the ID data is not available, can we leverage the out-of-distribution (OOD) data to improve the performance of SSL in long-tailed learning? Compare to MAK Jiang et al. (2021a) that assumes external ID samples are available, while we consider a more practical scenario where we only have access to OOD data that can be easily collected (e.g., downloaded from the internet). A very recent work (Wei et al., 2022) proposes to re-balance the class priors by assigning labels to OOD images following a pre-defined distribution. However, it is performed in a supervised manner while not directly applicable to the SSL frameworks. In this paper, we proposed a novel and principal method to exploit the unlabeled OOD data to improve SSL performance on long-tailed learning. As suggested in previous research, the standard contrastive learning would naturally put more weight on the loss of majority classes and less weight on that of minority classes, resulting in imbalanced feature spaces and poor linear separability on tail samples (Kang et al., 2020; Li et al., 2022) . However, rebalancing minorities with ID samples, no matter labeled or unlabeled, is quite expensive. To alleviate these issues, we devise a framework, Contrastive Learning with OOD data for Long-Tailed learning (COLT), to dynamically augment the minorities with unlabeled OOD samples which are close to tail classes in the feature space. As illustrated in Fig. 1 , our COLT can significantly improve SSL baselines in terms of the Alignment and Uniformity(Wang & Isola, 2020), two widely-used metrics to evaluate the performance of contrastive learning methods, demonstrating the effectiveness of our method. The pipeline of our method is illustrated in Fig. 2 . To augment the long-tail ID dataset, we define a tailness score to localize the head and tail samples in an unsupervised manner. Afterward, we design an online sampling strategy to dynamically re-balance the long-tail distribution by selecting OOD samples close (with a large cosine similarity in the feature space) to the head or tail classes based on a predefined budget allocation function. We follow the intuition to allocate more OOD samples to the tail classes for rebalancing. Those selected OOD samples are augmented with the ID dataset for contrastive training, where an additional distribution-level supervised contrastive loss makes the model aware of the samples from different distributions. Experimental results on four long-tail datasets demonstrate that COLT can greatly improve the performance of various SSL methods and even surpass the state-of-the-art baselines with auxiliary ID data. We also conduct comprehensive analyses to understand the effectiveness of COLT. Our contributions can be summarized as: • We raise the question of whether we can and how to improve SSL on long-tailed datasets effectively with external unlabeled OOD data, which is better aligned with the practical scenarios but counter-intuitive to most existing work and rarely investigated before. • We design a novel yet easy-to-use SSL method, which is composed of tailness score estimation, dynamic sampling strategies, and additional contrastive losses for long-tail learning with external OOD samples, to alleviate the imbalance issues during contrastive learning. • We conducted extensive experiments on various datasets and SSL frameworks to verify and understand the effectiveness of the proposed method. Our method consistently outperforms baselines by a large margin with the consistent agreement between the superior performance and various feature quality evaluation metrics of contrastive learning.



Figure 1: (1a): Feature space uniformity of different SSL frameworks. (1b): Visualization of the alignment property of samples in minority classes and majority classes w/ or w/o COLT. The experiment is conducted with ResNet-18 on CIFAR-100-LT.

availability

Our code is available at https://github.com/

