BEYOND RE-BALANCING: DISTRIBUTIONALLY RO-BUST AUGMENTATION AGAINST CLASS-CONDITIONAL DISTRIBUTION SHIFT IN LONG-TAILED RECOGNITION

Abstract

As a fundamental and practical problem, long-tailed recognition has drawn burning attention. In this paper, we investigate an essential but rarely noticed issue in long-tailed recognition, Class-Conditional Distribution (CCD) shift due to scarce instances, which exhibits a significant discrepancy between the empirical CCDs for training and test data, especially for tail classes. We observe empirical evidence that the shift is a key factor that limits the performance of existing longtailed learning methods, and provide novel understanding of these methods in the course of our analysis. Motivated by this, we propose an adaptive data augmentation method, Distributionally Robust Augmentation (DRA), to learn models more robust to CCD shift. A new generalization bound under mild conditions shows the objective of DRA bounds balanced risk on test distribubtion partially. Experimental results verify that DRA outperforms related data augmentation methods without extra training cost and significantly improves the performance of some existing long-tailed recognition methods.

1. INTRODUCTION

Recently, visual recognition has achieved significant progress, driven by the development of deep neural networks (He et al., 2016) as well as large-scale datasets (Russakovsky et al., 2015) . However, in contrast with manually balanced datasets, real-world data often has a long-tailed distribution over classes i.e. a few classes contain many instances (head classes), whereas most classes contain only a few instances (tail classes) (Liu et al., 2019; Van Horn & Perona, 2017) . Training models on long-tailed datasets usually leads to degenerated results, including over preference to head classes, undesired estimation bias and poor generalization (Zhou et al., 2020; Cao et al., 2019; Kang et al., 2019) . To solve above issues, various solutions have been proposed. Many of them focus on addressing imbalanced label distribution for simulating class-balanced model training. Direct re-balancing, like re-sampling and re-weighting, is the most intuitive (Huang et al., 2016; Zhang et al., 2021b) . Recently, the two-stage methods, which apply re-balancing strategy in tuning classifier (Kang et al., 2019) or defer re-weighting after initialization (Cao et al., 2019) , have been verified effective. Logit adjustment uses margin-based loss or post-hoc adjustment to rectify the biased prediction caused by long-tailed distribution (Menon et al., 2020; Ren et al., 2020; Hong et al., 2021) . Formally, denoting an input-label pair as (x, y), classification or recognition models are trained to estimate the posterior probability P (y|x) ∝ P (y)P (x|y). In long-tailed recognition scenarios, most solutions actually obey the following assumption: the class distribution P (y) shifts from training to test (usually classimbalanced in training but class-balanced in test), while the class-conditional distribution (CCD) P (x|y) keeps consistent, i.e. P train (y) ̸ = P test (y) and P train (x|y) = P test (x|y) (Menon et al., 2020; Ren et al., 2020) . Under this assumption, a series of methods including direct re-balancing and logit adjustment have been proved Fisher-consistent (Menon et al., 2020) . We argue that although the consistent CCD assumption (Menon et al., 2020) is reasonable if there is no sampling bias within each class, estimating P (x|y) by empirical CCD is unreliable, especially for tail classes where the samples are extremely scarce. Therefore, to obtain a generalizable model, the shift between empirical CCD and the ideal CCD cannot be ignored. Our focus does not overlap with existing methods that attend to scarce tail instances or inconsistent P (x|y). Transfer learning and data augmentation have been proven effective from the motivation of increasing the diversity of tail classes (Kim et al., 2020; Zhong et al., 2021; Zhou et al., 2022) . But they are still possibly biased due to unreliable empirical distribution and usually lack theoretical guarantee. Some recent works focus on inconsistent class-conditional distribution caused by domain bias or attribute-wise imbalance (Gu et al., 2022; Tang et al., 2022) , which is under shift from unreliable estimation as well. Nevertheless, the influence of CCD shift has not been thoroughly investigated and there has been no effective solution yet. In this work, we perform an empirical study to quantify the effect that the shift of P (x|y) will have on long-tailed recognition, by using CCD from balanced datasets as an oracle to alleviate CCD shift. With this oracle, the performance of existing methods is significantly improved, as shown in Figure 1 , which indicates CCD shift is a key factor in limiting the performance of long-tailed learning. From the CCD shift perspective, we also give new insights to counter-intuitive facts of previous methods e.g. why decoupling methods (Kang et al., 2019) work and how Fisher-consistent parameter in logit adjustment (Menon et al., 2020) gets sub-optimal performance. Motivated by our empirical study, to enhance robustness against CCD shift, we propose Distributionally Robust Augmentation (DRA) which assigns class-aware robustness, generalizing Sinha et al. (2017) and admits a novel generalization bound. This bound verifies models robust to CCD shift benefit long-tailed recognition. Our experiments show DRA improves various existing methods significantly, validating our theoretical insights. Our main contributions are highlighted as follows: • We identify a rarely noticed but essential issue in long-tailed recognition, class-conditional distribution (CCD) shift, and provide new insights from the CCD shift view into some existing methods. 

2.1. LONG-TAILED RECOGNITION

Re-balancing methods. In this section, we review a broader scope of re-balancing methods. We regard methods trying to solve the prior shift of P (y) as re-balancing methods. Long-tailed learning can be considered as a special label shift problem whose testing label distribution is known while the



Figure 1: Accuracy on CIFAR10-LT with or without removing CCD shift. All methods show significant improvement after removing shifts. And the improvement mainly appears in classes with fewer instances, which verifies the empirical CCD distributions of tail classes are more unreliable. Shaded regions show 95% CIs over 5 runs.

To train models robust to CCD shift, we propose DRA with theoretically sound modifications over prior DRO methods(Sinha et al., 2017), which admits a novel generalization bound verifying that models robust to CCD shift benefit long-tailed recognition.• Extensive experiments on long-tailed recognition show effectiveness of DRA: it significantly improves existing re-balancing methods and achieves comparable performance to state-of-the-arts. Moreover, DRA outperforms related data augmentation methods without additional training costs.

