LONG-TAILED RECOGNITION BY ROUTING DIVERSE DISTRIBUTION-AWARE EXPERTS

Abstract

Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the head-tail model bias gap remains large, due to more and larger confusion with hard negatives for the tail. It reduces the model variance with multiple experts, reduces the model bias with a distribution-aware diversity loss, reduces the computational cost with a dynamic expert routing module. RIDE outperforms the state-of-the-art by 5% to 7% on CIFAR100-LT, ImageNet-LT and iNaturalist 2018 benchmarks. It is also a universal framework that is applicable to various backbone networks, long-tailed algorithms, and training mechanisms for consistent performance gains.

1. INTRODUCTION

Real-world data are often long-tail distributed over semantic classes: A few classes contain many instances, whereas most classes contain only a few instances. Long-tailed recognition is challenging, as it needs to handle not only a multitude of small-data learning problems on the tail classes, but also extreme imbalanced classification over all the classes. There are two ways to prevent the many head instances from overwhelming the few tail instances in the classifier training objective: 1) class re-balancing/re-weighting which gives more importance to tail instances (Cao et al., 2019; Kang et al., 2020; Liu et al., 2019) , 2) ensembling over different data distributions which re-organizes long-tailed data into groups, trains a model per group, and then combines individual models in a multi-expert framework (Zhou et al., 2020; Xiang et al., 2020) . We compare three state-of-the-art (SOTA) long-tail classifiers against the standard cross-entropy (CE) classifier: cRT and τ -norm (Kang et al., 2020) which adopt a two-stage optimization, first representation learning and then classification learning, and LDAM (Cao et al., 2019) , which is trained end-to-end with a marginal loss. In terms of the classification accuracy, a common metric for model selection on a fixed training set, Fig. 1a shows that, all these existing long-tail methods increase the overall, medium-and few-shot accuracies over CE, but decrease the many-shot accuracy. These intuitive solutions and their experimental results seem to suggest that there is a head-tail performance trade-off in long-tailed recognition. We need a principled performance analysis approach that could shed light on such a limitation if it exists and provide guidance on how to overcome it. Our insight comes from a dynamic view of the training set: It is merely a sample set of some underlying data distribution. Instead of evaluating how a long-tailed classifier performs on the fixed training set, we evaluate how it performs as the training set fluctuates according to the data distribution. 40.5 0.50 0.42 60.5 0.28 0.30 38.7 0.50 0.44 20.1 0.74 0.52 

Anonymous authors

Paper under double-blind review ABSTRACT Natural data are often long-tail distributed over semantic classes. Existing recognition methods tend to focus on tail performance gain, often at the expense of head performance loss from increased classifier variance. The low tail performance manifests itself in large between-class confusion and high classifier variance. We aim to reduce both the bias and the variance of a long-tailed classifier by RoutIng Diverse Experts (RIDE). It has three components: 1) a shared architecture for multiple classifiers (experts); 2) a distribution-aware diversity loss that encourages more diverse decisions for classes with fewer training instances; and 3) an expert routing module that dynamically assigns more ambiguous instances to additional experts. With on-par computational complexity, RIDE significantly outperforms the state-of-the-art methods by 5% to 7% on all the benchmarks including CIFAR100-LT, ImageNet-LT and iNaturalist. RIDE is also a universal framework that can be applied to different backbone networks and integrated into various re-balancing or re-weighting methods for consistent performance gains.

1. INTRODUCTION

The natural data we encounter in practice often has a long tail distribution: A few classes contain many instances, while most classes contain only a few instances. Learning discrimination among them is challenging, as the few tail instances can be easily overwhelmed by many head instances. Long-tailed recognition is usually handled either by class re-balancing/re-weighting strategies giving more importance to tail instances (Cao et al., 2019; Kang et al., 2020; Liu et al., 2019) , or by multiexpert methods, where long-tailed data are separated into parts by their frequencies and models focusing on individual parts are combined (Zhou et al., 2020; Xiang & Ding, 2020) . However, all these methods generally gain on tail classes at the cost of performance loss on head classes. The state-of-the-art (SOTA) methods on iNaturalist (Van Horn et al., 2018) are cRT and ⌧ -norm (Kang et al., 2020) and BBN (Zhou et al., 2020) . The former belongs to the re-balancing type with a two-stage optimization for learning a good representation and classifier, whereas the latter belongs to the multi-expert type with two experts focusing on head and tail classes. These metrics are evaluated over 20 independently trained models, each on a random sampled set of CIFAR100 with an imbalance ratio of 100 and 300 samples for class 0. Compared to the standard CE classifier, existing SOTA methods almost always increase the variance and some reduce the tail bias at the cost of increasing the head bias. b) The metrics are evaluated over CIFAR100- LT Liu et al. (2019) . LDAM is more likely to confuse the tail (rather than head) classes with the hardest negative class, with an average score of 0.59. RIDE with LDAM can greatly reduce the confusion with the nearest negative class, especially for samples from the few-shot categories. For the above L2 loss on regression h(x) → Y , the model bias measures the accuracy of the prediction with respect to the true value, the variance measures the stability of the prediction, and the irreducible error measures the precision of the prediction and is irrelevant to the model h. Empirically, for n random sample sets of data, D (1) , . . . , D (n) , the k-th model trained on D (k) predicts y (k) on instance x, and collectively they have a mean prediction y m . For the L2 regression loss, the model bias is simply the L2 loss between y m and ground-truth t = E[Y ], whereas the model variance is the variance of y (k) with respect to their mean y m : L2 regression loss: L(y; z) = (y -z) 2 mean prediction: y m = 1 n n k=1 y (k) = arg min z E D [L (h(x); z)] (3) model bias: Bias(x; h) = (y m -t) 2 =L (y m ; t) (4) model variance: Variance(x; h) = 1 n n k=1 y (k) -y m 2 =E D [L (h(x); y m )]. As shown on the above right, these concepts can be expressed entirely in terms of L2 loss L. We can thus extended them to classification (Domingos, 2000) by replacing L with L 0-1 for classification: 0-1 classification loss: L 0-1 (y; z) = 0 if y = z, and 1 otherwise. ( 6) The mean prediction y m minimizes n k=1 L 0-1 y (k) ; y m and becomes the most often or main prediction. The bias and variance terms become L 0-1 (y m ; t) and 1 n n k=1 L 0-1 (y (k) ; y m ) respectively. We apply such bias and variance analysis to the CE and long-tail classifiers. We sample CIFAR100 (Krizhevsky, 2009) according to a long-tail distribution multiple times. For each method, we train



We analyze the performance of a long-tail classifier in terms of bias and variance with respect to fluctuations in the training set: We randomly sample CIFAR100(Krizhevsky, 2009)  according to a long-tailed distribution a few times, train a model each time, and then estimate the per-class bias and variance of the classifier. 1 (a) Comparisons of the mean accuracy, per-class bias and variance of baselines and our RIDE method. Better (worse) metrics than the distribution-unaware cross entropy (CE) reference are marked in green (red).(b) Histograms of the largest softmax score of the other classes (the hardest negative) per instance.

Figure 1: Our method RIDE outperforms SOTA by reducing both model bias and variance. a)These metrics are evaluated over 20 independently trained models, each on a random sampled set of CIFAR100 with an imbalance ratio of 100 and 300 samples for class 0. Compared to the standard CE classifier, existing SOTA methods almost always increase the variance and some reduce the tail bias at the cost of increasing the head bias. b) The metrics are evaluated over CIFAR100-LT Liu  et al. (2019). LDAM is more likely to confuse the tail (rather than head) classes with the hardest negative class, with an average score of 0.59. RIDE with LDAM can greatly reduce the confusion with the nearest negative class, especially for samples from the few-shot categories.

Consider the training data D as a random variable. The prediction error of model h on instance x with output Y varies with the realization of D. The expected variance with respect to variable D has a well-known bias-variance decomposition: Error(x; h) = E[(h(x; D) -Y ) 2 ] = Bias(x; h) + Variance(x; h) + irreducible error(x). (1)

availability

https://github.com/frank-xwang/

