LONG-TAILED RECOGNITION BY ROUTING DIVERSE DISTRIBUTION-AWARE EXPERTS

Abstract

Natural data are often long-tail distributed over semantic classes. Existing recognition methods tackle this imbalanced classification by placing more emphasis on the tail data, through class re-balancing/re-weighting or ensembling over different data groups, resulting in increased tail accuracies but reduced head accuracies. We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the head-tail model bias gap remains large, due to more and larger confusion with hard negatives for the tail. It reduces the model variance with multiple experts, reduces the model bias with a distribution-aware diversity loss, reduces the computational cost with a dynamic expert routing module. RIDE outperforms the state-of-the-art by 5% to 7% on CIFAR100-LT, ImageNet-LT and iNaturalist 2018 benchmarks. It is also a universal framework that is applicable to various backbone networks, long-tailed algorithms, and training mechanisms for consistent performance gains.

1. INTRODUCTION

Real-world data are often long-tail distributed over semantic classes: A few classes contain many instances, whereas most classes contain only a few instances. Long-tailed recognition is challenging, as it needs to handle not only a multitude of small-data learning problems on the tail classes, but also extreme imbalanced classification over all the classes. There are two ways to prevent the many head instances from overwhelming the few tail instances in the classifier training objective: 1) class re-balancing/re-weighting which gives more importance to tail instances (Cao et al., 2019; Kang et al., 2020; Liu et al., 2019) , 2) ensembling over different data distributions which re-organizes long-tailed data into groups, trains a model per group, and then combines individual models in a multi-expert framework (Zhou et al., 2020; Xiang et al., 2020) . We compare three state-of-the-art (SOTA) long-tail classifiers against the standard cross-entropy (CE) classifier: cRT and τ -norm (Kang et al., 2020) which adopt a two-stage optimization, first representation learning and then classification learning, and LDAM (Cao et al., 2019) , which is trained end-to-end with a marginal loss. In terms of the classification accuracy, a common metric for model selection on a fixed training set, Fig. 1a shows that, all these existing long-tail methods increase the overall, medium-and few-shot accuracies over CE, but decrease the many-shot accuracy. These intuitive solutions and their experimental results seem to suggest that there is a head-tail performance trade-off in long-tailed recognition. We need a principled performance analysis approach that could shed light on such a limitation if it exists and provide guidance on how to overcome it. Our insight comes from a dynamic view of the training set: It is merely a sample set of some underlying data distribution. Instead of evaluating how a long-tailed classifier performs on the fixed training set, we evaluate how it performs as the training set fluctuates according to the data distribution.

availability

https://github.com/frank-xwang/

