EXPLORING BALANCED FEATURE SPACES FOR REP-RESENTATION LEARNING

Abstract

Existing self-supervised learning (SSL) methods are mostly applied for training representation models from artificially balanced datasets (e.g. ImageNet). It is unclear how well they will perform in the practical scenarios where datasets are often imbalanced w.r.t. the classes. Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one. Our findings are quite intriguing. Different from supervised methods with large performance drop, the self-supervised contrastive learning methods perform stably well even when the datasets are heavily imbalanced. This motivates us to explore the balanced feature spaces learned by contrastive learning, where the feature representations present similar linear separability w.r.t. all the classes. Our further experiments reveal that a representation model generating a balanced feature space can generalize better than that yielding an imbalanced one across multiple settings. Inspired by these insights, we develop a novel representation learning method, called k-positive contrastive learning. It effectively combines strengths of the supervised method and the contrastive learning method to learn representations that are both discriminative and balanced. Extensive experiments demonstrate its superiority on multiple recognition tasks, including both long-tailed ones and normal balanced ones. Code is available at https://github.com/bingykang/BalFeat.

1. INTRODUCTION

Self-supervised learning (SSL) has been popularly explored as it can learn data representations without requiring manual annotations and offer attractive potential of leveraging the vast amount of unlabeled data in the wild to obtain strong representation models (Gidaris et al., 2018; Noroozi & Favaro, 2016; He et al., 2020; Chen et al., 2020a; Wu et al., 2018) . For instance, some recent SSL methods (Hénaff et al., 2019; Oord et al., 2018; Hjelm et al., 2018; He et al., 2020) use the unsupervised contrastive loss (Hadsell et al., 2006) to train the representation models by maximizing the instance discriminativeness, which are shown to generalize well across various downstream tasks, and even surpass the supervised learning counterparts in some cases (He et al., 2020; Chen et al., 2020a) . Despite the great success, existing SSL methods focus on learning data representations from the artificially balanced datasets (e.g. ImageNet (Deng et al., 2009) ) where all the classes have similar numbers of training instances. However in reality, since the classes in natural images follow the Zipfian distribution, the datasets are usually imbalanced and show a long-tailed distribution (Zipf, 1999; Spain & Perona, 2007) , i.e., some classes involving significantly fewer training instances than others. Such imbalanced datasets are very challenging for supervised learning methods to model, leading to noticeable performance drop (Wang et al., 2017; Mahajan et al., 2018; Zhong et al., 2019) . Thus several interesting questions arise: How well will SSL methods perform on imbalanced datasets? Will the quality of their learned representations deteriorate as the supervised learning methods? Or can they perform stably well? Answering these questions is important for understanding the behavior of SSL in practice. But these questions remain open as no research investigations have been conducted along this direction so far.

Supervised CE loss

Contrastive Loss K-positive Contrastive Loss Our work is motivated by the above questions to study the properties of data representations learned with supervised/self-supervised methods in a practical scenario. We start with two representative losses used by these methods, i.e., the supervised cross-entropy and the unsupervised contrastive losses (Hadsell et al., 2006; Oord et al., 2018) , and investigate the classification performance of their trained representation models from multiple training datasets where the instance distribution gradually varies from a balanced one to a long-tailed one. We surprisingly observe that, different from the ones learned from supervised cross-entropy loss where performance drops quickly, the representation models learned from the unsupervised contrastive loss perform stably well, no matter how much the training instance distribution is skewed to be imbalanced. Such a stark difference between the two representation learning methods drives us to explore why SSL performs so stably. We find that using the contrastive loss can obtain representation models generating a balanced feature space that has similar separability (and classification performance) for all the classes, as illustrated in Figure 1 . Such a balanced property of the feature spaces from SSL is intriguing and provides a new perspective to understand the behavior of SSL methods. We dig deeper into its benefits via a systematic study. In particular, since a pre-trained representation model is often used as initialization for downstream tasks (He et al., 2020; Newell & Deng, 2020; Hénaff et al., 2019) , we evaluate and compare the generalization ability of the models that produce feature spaces of different balanced levels (or 'balancedness'). We find that a more balanced model tends to generalize better across a variety of settings, including the out-of-distribution recognition as well as the cross-domain and cross-task applications. These studies imply that feature space balancedness is an important but often neglected factor for learning high-quality representations. Inspired by the above insights, we propose a new representation learning method, the k-positive constrastive learning, which inherits the strength of constrastive learning in learning balanced feature spaces and meanwhile improves the feature spaces' discriminative capability. Specifically, different from the contrastive learning methods lacking semantic discriminativeness, the proposed k-positive constrastive method leverages the available instance semantic labels by taking k instances of the same label with the anchor instance to embed semantics into the contrastive loss. As such, it can learn representations with desirable balancedness and discriminativeness (Figure 1 ). Extensive experiments and analyses clearly demonstrate its superiority over the supervised learning and latest contrastive learning methods (He et al., 2020) for various recognition tasks, including visual recognition in both long-tailed setting (e.g., ImageNet-LT, iNaturalist) and balanced setting. This work makes the following important observations and contributions. (1) We present the first systematic studies on the performance of self-supervised contrastive learning on imbalanced datasets which are helpful to understanding the merits and limitations of SSL in practice. (2) Our studies reveal an intriguing property of the model trained by contrastive learning-the model can robustly learn balanced feature spaces-that has never been discussed before. (3) Our empirical analysis demonstrates that learning balanced feature spaces benefits the generalization of representation models and offer a new perspective for understanding deep model generalizability. (4) We develop a new method to explicitly pursue balanced feature spaces for representation learning and it outperforms the popular cross-entropy and contrastive losses based methods. We believe our findings and the novel k-positive contrastive method are inspiring for future research on representation learning.



Figure 1: Feature spaces learned with different losses given an imbalanced dataset. The supervised crossentropy (CE) learns a space biased to the dominant class. The space learned by unsupervised contrastive loss is balanced but less semantically discriminative. Our proposed k-positive contrastive loss learns a balanced and discriminative feature space. The shadow area ( ) indicates the decision boundary of each class.

