ON THE EFFECTIVENESS OF OUT-OF-DISTRIBUTION DATA IN SELF-SUPERVISED LONG-TAIL LEARNING

Abstract

Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the 'head' and 'tail' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distributionlevel supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data.

1. INTRODUCTION

Self-supervised learning (SSL) methods (Chen et al., 2020; He et al., 2020; Grill et al., 2020) provide distinctive and transferable representations in an unsupervised manner. However, most SSL methods are performed on well-curated and balanced datasets (e.g., ImageNet), while many real-world datasets in practical applications, such as medical imaging and self-driving cars, usually follow a long-tailed distribution (Spain & Perona, 2007) . Recent research (Liu et al., 2021) indicates that existing SSL methods exhibit severe performance degradation when exposed to imbalanced datasets. To enhance the robustness of SSL methods under long-tailed data, several pioneering methods (Jiang et al., 2021b; Zhou et al., 2022) are proposed for a feasible migration of cost-sensitive learning, which is widely studied in supervised long-tail learning (Elkan, 2001; Sun et al., 2007; Cui et al., 2019b; Wang et al., 2022) . The high-level intuition of these methods is to re-balance classes by adjusting loss values for different classes, i.e., forcing the model to pay more attention to tail samples. Another promising line of work explores the probability of improving the SSL methods with external data. (Jiang et al., 2021a) suggests re-balancing the class distributions by sampling external in-distribution (ID) tail instances in the wild. Nevertheless, they still require available ID samples in the sampling pool, which is hard to collect in many real-world scenarios, e.g., medical image diagnosis (Ju et al., 2021) or species classification (Miao et al., 2021) .

availability

Our code is available at https://github.com/

