ASYNCHRONOUS MODELING: A DUAL-PHASE PER-SPECTIVE FOR LONG-TAILED RECOGNITION

Abstract

This work explores deep learning based classification model on real-world datasets with a long-tailed distribution. Most of previous works deal with the long-tailed classification problem by re-balancing the overall distribution within the whole dataset or directly transferring knowledge from data-rich classes to data-poor ones. In this work, we consider the gradient distortion in long-tailed classification when the gradient on data-rich classes and data-poor ones are incorporated simultaneously, i.e., shifted gradient direction towards data-rich classes as well as the enlarged variance by the gradient fluctuation on data-poor classes. Motivated by such phenomenon, we propose to disentangle the distinctive effects of data-rich and data-poor gradient and asynchronously train a model via a dualphase learning process. The first phase only concerns the data-rich classes. In the second phase, besides the standard classification upon data-poor classes, we propose an exemplar memory bank to reserve representative examples and a memoryretentive loss via graph matching to retain the relation between two phases. The extensive experimental results on four commonly used long-tailed benchmarks including CIFAR100-LT, Places-LT, ImageNet-LT and iNaturalist 2018 highlight the excellent performance of our proposed method.

1. INTRODUCTION

Past years have witnessed huge progress in visual recognition with the successful application of deep convolutional neural networks (CNNs) on large-scale datasets, e.g., ImageNet ILSVRC 2012 (Russakovsky et al., 2015) , Places (Zhou et al., 2017) . Such datasets are usually artificially collected and exhibit approximately uniform distribution concerning the number of samples in each class. Real-world datasets, however, are always long-tailed that only a few classes occupy the majority of instances in the dataset (data-rich) and most classes have rarely few samples (data-poor) (Reed, 2001; Van Horn & Perona, 2017) . When modeling such datasets, many standard methods suffer from severe degradation of overall performance. More specifically, the recognition ability on classes with rarely few instances are significantly impaired (Liu et al., 2019) . One prominent direction is to apply class re-sampling or loss re-weighting to balance the influence of different classes (Byrd & Lipton, 2019; Shu et al., 2019) and another alternative is to conduct transferring (Wang et al., 2017; Liu et al., 2019) by the assumption that knowledge obtained on the data-rich classes should benefit the recognition of data-poor classes. Recently, more sophisticated models are designed to train the model either base on some new findings (Zhou et al., 2020; Kang et al., 2020) or combine all available techniques (Zhu & Yang, 2020). However, the property of longtailed setting makes it remain to be difficult to achieve large gains compared to balanced datasets. In contrast to the aforementioned strategies, we approach the long-tailed recognition problem by analyzing gradient distortion in long-tailed data, attributing to the interaction between gradients generated by data-rich and data-poor classes, i.e., the direction of overall gradient is shifted to be closer to the gradient on data-rich classes and its norm variance is increased due to the dramatic variation in the gradient generated by data-poor classes. The degenerated performance when comparing with balanced datasets indicates the gradient distortion is negative during model training. Motivated by this, we hypothesize that the combined analysis for gradients generated by data-rich and data-poor classes could be improper in long-tailed data and attempt to disentangle these two gradients. We thus propose the conception of asynchronous modeling and split the original network to promote a dual-phase learning, along with the partition of the given dataset. In phase I, data-rich classes keeps the bulk of the original dataset. It facilitates better local representation learning and more precise classifier boundary determination by eliminating the negative gradient interaction produced by datapoor classes. Based on the model learned in phase I, we involve the rest data to do new boundary exploration in the second phase. While transiting from the first phase to the second, it is hoped to reserve the knowledge learned in the first phase. Specifically, we design an exemplar memory bank and introduce a memory-retentive loss. The memory bank reserves a few most prominent examples from classes in the first phase and collaborates with data in the second phase for classification. Also, the collaborated data, together with the new memory-retentive loss, tries to preserve old knowledge when the model adapts to new classes in the second phase. In the experiments, we evaluate the proposed asynchronous modeling strategy by comparing to typical strategies, which include the re-balancing based methods (Cao et al., 2019) and transferring based methods (Liu et al., 2019) . Furthermore, we also consider the latest, more sophisticated works, like BBN (Zhou et al., 2020) , IEM (Zhu & Yang, 2020) . The comprehensive study and comparison across four commonly used long-tailed benchmarks, including CIFAR100-LT, Places-LT, ImageNet-LT and iNaturalist 2018 validate the efficacy of our method.

2. RELATED WORK

Class re-sampling. Most works along with this line can be categorized as over-sampling of tail classes (Chawla et al., 2002; Han et al., 2005; Byrd & Lipton, 2019) or under-sampling over head classes (Drummond et al., 2003) . While the idea of re-sampling makes the overall distribution more balanced, it may encounter the problem of over-fitting on rare data and the missing of critical information on dominant classes (Chawla et al., 2002; Cui et al., 2019) , thus hurting the overall generalization. Beyond that, Ouyang et al. ( 2016); Liu et al. ( 2019) also involve a more refined idea of fine-tuning after representation extraction to adjust the final decision boundary. Loss re-weighting. Methods based on loss re-weighting generally allocate larger weights for tail classes to increase their importance (Lin et al., 2017; Ren et al., 2018; Shu et al., 2019; Cui et al., 2019; Khan et al., 2017; 2019; Huang et al., 2019) . However, direct re-weighting method is difficult to be optimized when tackling a large-scale dataset (Mikolov et al., 2013) . Recently, Cao et al. (2019) considers the margins of the training set and introduces a label-distribution-aware loss to enlarge the margins of tail classes. Hayat et al. (2019) proposes the first hybrid loss function to jointly cluster and classify feature vectors in the Euclidean space and to ensure uniformly spaced and equidistant class prototypes. Knowledge transfer. Along this line, methods based on knowledge transfer handle the challenge of imbalanced dataset by transferring the information learned on head classes to assist tail classes. While Wang et al. (2017) proposes to transfer meta-knowledge from the head in a progressive manner, recent strategies take consideration of intra-class variance (Yin et al., 2019) , semantic feature (Liu et al., 2019; Chu et al., 2020) or domain adaptation (Jamal et al., 2020) . Recently, BBN (Zhou et al., 2020) and LWS (Kang et al., 2020) boost the landscape of long-tailed problem based on some insightful findings. The former asserts that prominent class re-balancing methods can impair the representation learning and the latter claims that data imbalance might not be an issue in learning high-quality representations. IEM (Zhu & Yang, 2020) designs a more complex model that tries to concern available techniques, like feature transferring and attention. In this paper, we are motivated by gradient distortion in long-tailed data, which is caused by the gradient interaction between data-rich classes and data-poor classes. We thus propose to split the learning stage into two phases. We demonstrate that this separation allows straightforward approaches to achieve high recognition performance, without introducing extra parameters.

3. OUR METHOD

Let X = {x i , y i }, i ∈ {1, ..., n} be the training set, where x i is the training data and y i is its corresponding label. The number of instances in class j is denoted as n j and the total number of

