HIGH-LIKELIHOOD AREA MATTERS -REWARDING CORRECT, RARE CLASS PREDICTIONS UNDER IMBAL-ANCED DISTRIBUTIONS

Abstract

Learning from natural datasets poses significant challenges for traditional classification methods based on the cross-entropy objective due to imbalanced class distributions. It is intuitive to assume that the examples from rare classes are harder to learn so that the classifier is uncertain of the prediction, which establishes the low-likelihood area. Based on this, existing approaches drive the classifier actively to correctly predict those incorrect, rare examples. However, this assumption is one-sided and could be misleading. We find in practice that the high-likelihood area contains correct predictions for rare class examples and it plays a vital role in learning imbalanced class distributions. In light of this finding, we propose the Eureka Loss, which rewards the classifier when examples belong to rare classes in the high-likelihood area are correctly predicted. Experiments on the large-scale long-tailed iNaturalist 2018 classification dataset and the ImageNet-LT benchmark both validate the proposed approach. We further analyze the influence of the Eureka Loss in detail on diverse data distributions.

1. INTRODUCTION

Existing classification methods usually struggle in real-world applications, where the class distributions are inherently imbalanced and long-tailed (Van Horn & Perona, 2017; Buda et al., 2018; Liu et al., 2019; Gupta et al., 2019) , in which a few head classes occupy a large probability mass while most tail (or rare) classes only possess a few examples. The language generation task is a vivid example of the long-tailed classification. In this case, word types are considered as the classes and the model predicts probabilities over the vocabulary. Common words such as the, of, and and are the head classes, while tailed classes are rare words like Gobbledygook, Scrumptious, and Agastopia. Conventional classifiers based on deep neural networks require a large number of training examples to generalize and have been found to under-perform on rare classes with a few training examples in downstream applications (Van Horn & Perona, 2017; Buda et al., 2018; Cao et al., 2019) . It is proposed that the traditional cross-entropy objective is unsuitable for learning imbalanced distributions since it treats each instance and each class equivalently (Lin et al., 2017; Tan et al., 2020) . In contrast, the instances from tail classes should be paid more attention, indicated by two main approaches that have been recently investigated for class-imbalanced classification: the frequencybased methods and the likelihood-based methods. The former (Cui et al., 2019; Cao et al., 2019) directly adjust the weights of the instances in terms of their class frequencies, so that the instances from the tail classes are learned with a higher priority no matter whether they are correctly predicted or not. The latter (Lin et al., 2017; Zhu et al., 2018) instead penalize the inaccurate predictions more heavily, assuming that the well-classified instances, i.e., the instances in the high-likelihood area, factor inconsequentially in learning imbalanced distributions. However, neither of these two approaches realistically depicts the likelihood landscape. In particular, the high-likelihood area, where the classifier makes the correct predictions for both common class examples and rare class ones, contributes significantly to generalization. However, this area is not well-shaped, as illustrated in Figure 1 . Specifically, the frequency-based methods imply an impaired learning of common class examples that are the principle part of the natural data, while the likelihood- For an instance in the training data, the frequency-based methods either sharpen or soften the loss for all likelihoods according to its class frequency, while the likelihood-based methods adjust the loss in the low-or high-likelihood area, respectively. The high-likelihood area is relatively deprioritized in both cases. The proposed Eureka Loss progressively rewards the systems with higher bonus for higher-likelihood. based methods ignore the correctly-predicted rare class examples that can provide crucial insights into the underlying mechanism for predicting such examples. In this paper, we first demonstrate that existing practice of neglecting predictions in the high-likelihood area is harmful to learning imbalanced class distributions. Furthermore, we find that simply mixing the cross-entropy loss and the Focal Loss (Lin et al., 2017) can induce substantially superior performance, which validates our motivation. In turn, we propose to elevate the importance of high-likelihood predictions even further and design a novel objective called Eureka Loss. It progressively rewards the classifiers according to both the likelihood and the class frequency of an example such that the system is encouraged to be more confident in the correct prediction of examples from rare classes. Experimental results on the image classification and the language generation tasks demonstrate that the Eureka Loss outperforms strong baselines in learning imbalanced class distributions. Our contributions are twofold: • We challenge the common belief that learning for examples in low-likelihood area is more important for learning tail classes and reveal that the correctly-predicted rare class examples make important contribution to learning long-tailed class distributions. • We explore a new direction for learning imbalanced classification that focuses on rewarding correct predictions for tail classes examples, rather than penalizing incorrect ones. The proposed Eureka Loss rewards the classifier for its high-likelihood predictions progressively to the rarity of their class and achieves substantial improvements on various problems with long-tailed distributions.

2. RELATED WORK

Frequency-based Data and Loss Re-balancing Previous literature on learning with long-tailed distribution mainly focusing on re-balancing the data distribution and re-weighting the loss function. The former is based on a straightforward idea to manually create a pseudo-balanced data distribution to ease the learning problem, including up-sampling for rare class examples (Chawla et al., 2002) , down-sampling for head class examples (Drummond & Holte, 2003) and a more concrete sampling strategy based on class frequency (Shen et al., 2016) . As for the latter, recent studies propose to assign different weights to different classes, and the weights can be calculated according to the class distribution. For example, Khan et al. (2018) design a cost-sensitive loss for major and minor class examples. An intuitive method is to down-weight the loss of frequent classes, while up-weight the contribution of rare class examples. However, frequency is not suitable to be directly treated as the the weight since there exists overlap among samples. An advancing alternative loss CB (Cui et al., 2019) proposes to calculate the effective number to substitute the frequency for loss re-weighting. However, since it assigns lower weight to head classes in the maximum likelihood training (Cross Entropy objective), it seriously impairs the learning of head classes. Moreover, CB requires a delicate hyper-parameter tuning for every imbalanced distribution, leading to a lot of manul efforts. From the perspective of max-margin, a recent study LDAM (Cao et al., 2019) proposes to up-weight the loss of tail classes by a class-distribution based margin. Compared to the above methods, we choose to decrease the loss of tail classes by rewarding correct predictions rather than increasing the loss of tail classes through aggravated penalization. Deferring the Frequency-based Class-balanced Training Recent studies find that deferring the class-balanced training helps learn high-quality representations (Liu et al., 2019) , and propose deferred Class-balanced training (deferred CB) (Cao et al., 2019) , which chooses to adopt Cross Entropy objective at the beginning of training. Similarly, the Decoupling method (Kang et al., 2020) shows that the re-balancing strategies impair the quality of learned feature representations and demonstrate an improved performance learned with original data distribution, by training the model with Cross Entropy in the first phase and adopting class-balanced training in the second phase. This decoupling strategy can also be found in BBN (Zhou et al., 2019) , which includes both class imbalanced and balanced training, and the transition from the former to the latter is achieved through a curriculum learning schedule. These methods achieve state-of-the-art performance in long-tailed classification. To be comparable with these methods and to analyse whether Eureka Loss is complementary to this technique, we propose deferred Eureka Loss, in which rewarding for rare class prediction is introduced to encourage the model to learn rare patterns when learning is stalled. Likelihood-based Loss Another dominant method for imbalanced classification is the likelihoodbased method Focal Loss (FL) (Lin et al., 2017) , which proposes to down-weight the contribution of examples in the high-likelihood area. However, we argue that it is harmful for learning tail classes and choose an opposite direction by highlighting the high-likelihood area with a steeper loss. Transferring Representations Techniques for transferring information from sufficient head classes examples to under-represented rare classes examples belong to a parallel successful direction in this field. They include MBJ (Liu et al., 2020) , which utilizes external semantic feature memory and FSA (Chu et al., 2020) , which decomposes feature in to class-specific and class-generic components. These latest transfer learning based studies are less related to our paper but they also obtain good improvements in long-tailed classification, so we add them into comparison in the experiments.

3. ROLE OF THE HIGH-LIKELIHOOD AREA

The existing approaches to the long-tailed classification independently consider the class frequency and the example likelihood. However, we show that this one-sided reflection is problematic when dealing with the tail class examples that can be confidently classified. The tail class examples can be easily classified by the classifier, and the head class examples can also be hard for the classifier to recognize. The difficulty of classification depends on the inherent characteristic of the classes, rather than the sample size of the class. For example, in species classification, the Portuguese man o'war may be a rare class but can be easily classified due to its distinct features, compared to various kinds of moths which are common classes yet are hard to distinguish. However, the frequency-based methods continuously drive the classifier to fit the rare class examples, especially when they are difficult to predict, which may lead to overfitting. On the other hand, the likelihood-based methods relax the concentration in the high-likelihood area, which contains the tail class examples that are not hard to predict and provide insights to generalization. To verify our point of view, we analyze the problem by dissecting the influence of the high-likelihood area with respect to the class frequency and demonstrate that properly encouraging the learning of well-classified tail class examples induce substantial improvements, before which we first give a brief introduction of classification with long-tailed class distributions. The probability vector can be calculated as p = σ(f (x; θ)), where σ is the normalizing function, e.g., softmax for multi-class classification. Typically, the parameters are estimated using maximum likelihood estimation (MLE), which is equivalent to using the Cross-Entropy Loss (CE) function, where the scalar y • log p can be regarded as the (log)-likelihood: L = -E (x,y)∈D log p model (y|x) = - 1 |D| (x,y)∈D y • log p. For deep neural network-based classifiers, due to the non-linearity of the loss function, the problem is typically solved by stochastic gradient descent, which requires the calculation of the gradient with respect to the parameters using the chain-rule, the process of which is called back-propagation: ∂L ∂θ = ∂L ∂p ∂p ∂θ = ∂L ∂p ∂σ(f (x; θ)) ∂θ . We introduce the term likelihood gradient to denote ∂L /∂p, which modulates how the probability mass should be shifted and is a characteristic of the loss function instead of the classifier. For learning imbalanced class distributions, the common methods aim to shape the likelihood gradient so the rare classes are learned with priority, i.e., embodying a sharper loss and a larger likelihood gradient. Frequency-Based Methods Frequency-based methods alter the likelihood gradient according to the class frequencies, which are irrelevant to how well individual examples are classified. A simple form is using a n-dim weight vector w composed of the class weights based on their frequencies in the dataset to determine the importance of examples from each class: L = -w y • (y • log p). Note that when w = 1, it is identical to the cross-entropy objective. The weight vector is typically calculated as w i = m/m i , where m is the average of m i . As we can see, the standard weight is taken as the average of the class size, so that the classes with more examples are down-weighted and the classes with fewer examples are up-weighted. For a natural long-tailed distribution, the average is larger than the median, which suggests more classes are up-weighted. Advanced frequency-based methods try to obtain a more meaningful measurement of the class size, e.g., the Class-Balanced Loss (CB) proposed by Cui et al. (2019) utilizes an effective number 1-β m i /1-β for each class, where β ∈ [0.9, 1) is a tunable class-balanced term. Likelihood-Based Methods Different from the frequency-based methods, likelihood-based methods adjust the likelihood gradient based on the instance-level difficulty as predicted by the classifier such that the examples in the low-likelihood area are more focused in training. For example, the well-known Focal Loss (FL) with a balanced factor α proposed by Lin et al. (2017) takes the following form: L = -α(1 -p y ) γ f •(y• log p), where γ f > 0, which controls the convexness of the loss and higher γ f indicates adjustment that are more significant. Note that when α = 1 and γ f = 0, it is identical to the cross-entropy objective. Following previous works (Cao et al., 2019; Liu et al., 2019; Cui et al., 2019) , the α is set to 1 in multi-class classification, and Class-Balanced Focal Loss (FL+CB) can be viewed Focal Loss with uneven alpha for each class in the multi-class setting. The key idea is to pay less attention to the well-classified examples and pay more attention to the badly-classified examples, because it is natural to assume the tail class examples are harder to learn and thus cannot be well-classified. However, such methods neglect the correctly-predicted tail class examples, the practice of which we show is not constructive to the learning of long-tailed class distributions.

3.2. UNDERSTANDING THE INFLUENCE OF THE HIGH-LIKELIHOOD AREA

To understand the influence of the high-likelihood area, we first prepare a variant of the Focal Loss, the Halted Focal Loss (HFL), such that the high-likelihood area is not deprioritized. The Halted Focal Loss reverts the Focal Loss to the Cross-Entropy Loss when the likelihood is high enough: where p y is prediction probability of the correct label, b = α(1 -(1 -ϕ) γ f ) log ϕ to ensure monotonicity and continuity, and ϕ is the boundary between the low-and the high-likelihood area, which we set as 0.5, i.e., a likelihood higher than which definitely renders a correct prediction. This mixed loss is plotted in left of the Figure 2 , which has the same likelihood gradient as the cross-entropy in the high-likelihood area and remains the same as the Focal Loss in the low-likelihood area. L = -α(1 -p y ) γ f •(y• log p), if p y ≤ ϕ -αy•[log p + b], otherwise To We conduct experiments on long-tailed CIFAR-10 using the aforementioned protocol to examine the effect of the high-likelihood area. The construction of the dataset is provided in Appendix C. We run each configuration 5 times with different random initialization and report the average test performance. The results are shown in the right of Figure 2 . As we can see, compared to the original Focal Loss, the proposed adaptation achieves better performance, indicating regaining focus in the high-likelihood area is beneficial. Nonetheless, the phenomenon can be also attributed to a better learning of the common classes instead of the rare classes. Our analysis based on the class frequency resolves this concern because the Halted Focal Loss brings more improvements if only tailed classes are learned this way, e.g., applying to the top-4 rare classes achieve the best overall performance, which proves that there are rare class examples that reside in the high-likelihood area and have non-negligible effect to generalization. 

4. EUREKA LOSS

We have shown that the high-likelihood area matters for long-tailed classification and in particular, the rare class examples in the area have pivotal contributions. Inspired by this finding, we propose to further enhance the importance of the high-likelihood area so that the likelihood gradient in the high-likelihood area can match or even surpass that in the low-likelihood area. Moreover, the adjustment is inline with the frequency of the class so the rarer the class, the larger the likelihood gradient. Extending the adjustment term b in Eq. ( 6), we propose the Eureka Loss (EL): L = -y• log p -bonus•encouragement, where -bonus•encouragement is intended to reward the well-classified rare examples. The bonus term depends on the example likelihood and the encouragement term depends on the class frequency. Different from the existing approaches that scale the Cross-Entropy Loss, punishing the incorrect predictions selectively, the proposed Eureka Loss deals with long-tailed classification from another perspective, rewarding the correct predictions progressively with their class frequencies. Bonus indicates how well the system executes the task and is designed to be a function of the probability of the ground-truth class to reward the model when it makes correct prediction. In particular, in light of the findings discussed in Section 3.2, we propose to increase the likelihood gradient in the high-likelihood area and adopt the form of bonus = y • log(1 -p), which ensures that the monotonicity of the likelihood gradient is consistent with that of the Cross-Entropy Loss, meaning that the classifier obtains more bonus when making highly-confident correct predictions. The design is against most existing studies in that the high-likelihood area is given more focus than the low-likelihood area. Encouragement implies the system realizes unusual achievements that should be encouraged. Since the unusual achievements in long-tailed classification should be correctly predicting rare class examples, we propose to reward the system based on the frequency of the example's class: encouragement = w y = m m y , where m y denotes the measurement of the frequency of the class y. The form is flexible and similar to the frequency-based methods, and thus can be further extended based on the related studies. In our experiments, we use the effective number from Cui et al. (2019) as the measurement. Compared to the existing frequency-based and likelihood-based objective, our Eureka Loss takes the bonus term to calibrate the attention to different likelihood landscapes and the encouragement term to inform the model with the class difficulty, composing a more targeted yet comprehensive loss for learning imbalanced distributions.

5. EXPERIMENTS

We validate the proposed Eureka Loss on diverse long-tailed classification problems and analyze the characteristics of the Eureka Loss with insights into the learned models. Evaluation Metric For the image classification tasks, we use the accuracy on 'All' data and subset of classes with 'Many', 'Medium' and 'Few' examples , i.e ., the precision of the top-1 prediction.

5.1. TASKS, DATASETS, AND TRAINING SETTINGS

Since the test set of those tasks are balanced in classes, we further propose to estimate the accuracy on the imbalanced class distribution that reflects natural performance in real-world scenarios. The natural accuracy is the linear interpolation of the accuracy on the balanced test set using the class frequencies from the training set. For the natural language generation task, we adopt the micro and macro F-scores from Zhang et al. (2018) between the generated and the reference sentences to check how well the systems participate in the conversation. We further adopt the 4-gram diversity to examine the rare phrases, since a well-known problem for dialogue tasks is that the model tends to generate common, dull and repetitive responses and thus cannot capture the diversity of natural language distributions. Since the test set of natural language generation task is naturally imbalanced, we do not need to estimate the natural performance. For the detailed introduction to tasks, datasets, and training settings, please refer to the appendix.

5.2. RESULTS

iNaturalist 2018 The results are reported in Table 2 . We tune the hyper-parameters for our implemented baselines and report the averaged performance among 3 runs at the best setting. We compare Eureka Loss with frequency-based method Class-balanced Loss (CB), likelihood-based method Focal Loss (FL) and their combination FL+CB. As we can see from the first group in the In all, adopting the Eureka Loss achieves a balanced performance on both common and rare classes. Besides, we also outperform the latest representation transferring based methods including MBJ and FSA. ImageNet-LT Table 3 demonstrates the results on ImageNet-LT of various methods. For this artificial dataset, we first compare with the representative frequency-based method CB and likelihoodbased method Focal Loss (FL). As we can see, the proposed method obtains a significant improvement in the balanced test set and also maintains the lead position in the virtual natural test set. For comparison with methods that defer the class-balanced training including deferred CB and Decouping-LWS, the Eureka Loss of corresponding modification also enjoys a comfortable margin and arguably excels in balancing the performance on both of the common and the rare classes. ConvAI2 Table 4 shows that the proposal helps the prediction of rare words (+10% macro F-score) and thus improves the diversity of language generation (+10% 4-gram diversity). Since this dataset is extremely imbalanced, e.g., the imbalance ratio is over 200,000, the frequency-based methods require extensive tuning to work, which we thus omit from the comparison as we are not able to reproduce favorable results. Compared with the likelihood-based method Focal Loss, which is marginally better that the original cross-entropy loss, the Eureka Loss still obtains substantial improvements. 

5.3. ANALYSIS

P B(p) = -y • p γ b , where γ b is a positive value to ensure the monotonicity and can be tuned for different tasks. Besides, CE achieves a 71.4% accuracy and the likelihood bonus with deferred encouragement gets a 76.1% accuracy. Figure 3 demonstrates that bigger likelihood gradient in the high-likelihood area brings more improvements, e.g., power-bonus with power of 4 is better than bonuses of small power.

Varying Strength of Encouragement

The strength of encouragement is determined by both of the class frequency and the hyper-parameter β as we use the effective number of the class. As β controls the variance of the the effective number, e.g., when β = 0, the variance is 0, meaning all of the classes receive equal encouragement, we control the strength of the encouragement towards tail classes by altering β. The results on the validation set of ImageNet-LT are shown in Table 6 . As we can see, higher β (more encouragement for tail classes), is connected to higher overall accuracy and better tail class performance, which again validates our motivation for encouraging correct rare class predictions. Effect on Example-Likelihood The Eureka Loss rewards the high-likelihood predictions especially for tail classes. It is interesting to see how the training dynamic is changed due to this preference. In order to understand the effect, we visualize the example likelihood grouped by target class frequencies after training in Figure 4 and Figure 5 (due to space limit, the complete comparison is provided in the Appendix A), which are from the validation set from iNaturalist 2018. As we can see, with the Eureka Loss, the examples in the high-likelihood area are driven to the extreme. For example, considering the medium and the low frequency group, the "hard" examples that may be inherently difficult to classify stand invariant, while for the examples that can be classified correctly, the system now treats them with more confidence. This dynamics translate into better accuracy in unseen examples in the test set, hinting the importance of rare class examples in the high-likelihood area for the generalization of learning imbalanced class distributions.

6. CONCLUSIONS

In this paper, we examine the effect of the high-likelihood area on learning imbalanced class distributions. We find that the existing practice of relatively diminishing the contribution of the examples in the high-likelihood area is actually harmful to the learning. We further show that the rare class examples in the high-likelihood area have pivotal contribution to model performance and should be focused on instead of being neglected. Motivated by this, we propose the Eureka Loss, which additionally rewards the well-classified rare class examples. The results of the Eureka Loss in the image classification and natural language generation problems demonstrate the potential of reconsidering the role of the high-likelihood area. In-depth analysis also verifies the effectiveness of the investigated loss form and reveals the learning dynamics of different approaches to long-tailed classification. A VISUALIZATION OF EXAMPLE LIKELIHOOD batch size is 64 and the learning rate is set to 3. The embedding size is 256 and word vectors are initialized with GloVe (Pennington et al., 2014) . We select our final model until the performance on the validation is no longer improving after 5 epochs.

COCO Dectection

For experiments on COCO detection, we adopt the configuration of "RetinaNet-R-50-FPN-1x" from the GitHub repository Detectron2 as our default setting. In this setting, the one-stage detector of RetinaNet with the backbone ResNet50 is trained for 90k updates and the batch size is 8 images per batch. For the image classification tasks, the default β is set to 0.9999 for all datasets. For the deferred version, we defer the adoption of Eureka Loss after training for 160 epochs and 180epochs on CIFAR100 and iNaturalist 2018 respectively. As for the dialog generation task ConvAI2, β is set to 0.999 and we start the encouragement after regularly training the model for 5 epochs. We tune the β ∈ {0.9, 0.99, 0.999, 0.9999} and γ ∈ {0.5, 1, 2} for the Class-Balanced Loss (CB) and the Focal Loss (FL) respectively in multi-class classification, and report the best results of these baselines. Following previous work (Cui et al., 2019) , α is set to 1.0 for the Focal Loss(FL), and the Class-Balanced Focal Loss (FL+CB) in multi-class classification tasks can be viewed as the origin Focal Loss with class-level weight α in binary classification tasks. The training costs are summarized in Table 12 .



Following(Cui et al., 2019), we omit the hyper-parameter α, since the Focal Loss with uneven 'alpha' for each class in the setting of multi-class classification can be viewed as Class-balanced Focal Loss (FL+CB) and FL+CB is compared individually.



Figure1: Conceptual illustration of approaches to learning unbalanced class distributions. For an instance in the training data, the frequency-based methods either sharpen or soften the loss for all likelihoods according to its class frequency, while the likelihood-based methods adjust the loss in the low-or high-likelihood area, respectively. The high-likelihood area is relatively deprioritized in both cases. The proposed Eureka Loss progressively rewards the systems with higher bonus for higher-likelihood.

PREPARATION: CLASSIFICATION WITH LONG-TAILED CLASS DISTRIBUTIONS Let's consider the multi-class classification problem with the long-tailed class distribution. Given a class set C, n denotes the number of different classes in C and m i is the number of examples of the class C i . For simplicity, we sort the class set C according to cardinal m i for C i such that C 0 is the class with the most examples and C n-1 is the rarest class. Let p be a n-dim probability vector predicted by a classifier model f (x; θ) based on the input x, where each element p i denotes the probability of the class C i and y is a n-dim one-hot label vector with y being the ground-truth class.

decouple the effect of class frequency, we further explore to gradually transition from the Focal Loss to the Halted Focal Loss according to the class frequency of an example, e.g., from adopting the Halted Focal Loss only for the rarest class and the Focal Loss to other classes to adopting the Focal Loss only for the most common class and the Halted Focal Loss for the rest. Concretely, we set a proportion t ∈ [0, 1] of classes to receive this loss and the remaining 1 -t proportion of classes adopt the original Focal Loss. The classes are ranked by inverse frequencies, such that the first class is the rarest class.

Figure 3: Varying strength of bonus on long-tailed CIFAR-10. Higher power γ b indicates higher strength.

Figure 4: The visualization of test likelihood distribution for models trained with the Eureka Loss. The model classifies the rare class examples more decisively, compared to the existing methods.

Figure7: Word frequency distribution of ConvAI2 dataset. For natural language generation tasks, each word type can be regarded as a class and most words appear scarcely in the data. If measured the same with long-tailed image classification tasks, the imbalance ratio is 277K.

The importance of the high-likelihood area of the rare examples is further validated on the COCO detection dataset, where the classifier should determine whether the object appears in the image or not. The positive detection is the rare class since there are many false proposals. The experiment setting is in Appendix C. AP 50 and AP 75 measure the precision under different levels of overlap between predictions and ground-truth. As shown in Table1, strengthening the high-likelihood area of the Focal Loss, especially for the rare class examples, obtains a more accurate and confident prediction.

Tasks and DatasetsWe conduct experiments on two image classification tasks and a dialogue generation task. iNaturalist 2018 is a real-world dataset which embodies a highly imbalanced class distribution of 8,142 classes. Apart from the test performance, we also report the validation performance grouped by the class frequency and categorize the examples into three groups: many (classes with more than 100 examples), medium (classes with 20 to 100 examples), and few (classes with fewer than 20 examples). ImageNet-LT(Liu et al., 2019) is an artificially constructed long-tailed classification dataset based on ILSVRC 2012 of 1000 classes. ConvAI2 is a natural conversation dataset for evaluating dialogue system, where each word type can be treated as a class, i.e., 18,848 words (classes) in total, and have extremely imbalanced training and test datasets.

Results on iNaturalist 2018. * denotes results from the corresponding paper and † denotes deferred learning, where the base loss is applied at the beginning of training and the improved method is adopted later. § denotes that it focuses on transferring representations and the method is less related to our work. Best results are shown in bold. The proposed Eureka Loss achieves the best results in learning both common and rare classes.

table, neither FL nor CB achieves improvements over Cross Entropy (CE), but Eureka Loss outperforms CE by a large margin in terms of overall accuracy and accuracy for classes with few examples. In contrary to the first group, considering only the accuracy on the balanced test set, the two-stage version of frequency class-balanced training which adopts the class-balanced training only in the latter training phase includes deferred CB(denotes CB † in the table), LDAM + deferred CB, BBN and Decoupling-LWS enjoy clear advantage over CE. In order to check whether Eureka Loss is additive with the deferred method and the class-balanced training, we take deferred Eureka Loss and Eureka Loss + CB † into comparison.The deferred Eureka Loss is motivated by an intuition that when training enters the bottleneck stage, Eureka Loss is introduced to reward rare classes to encourage the model to learn less common patterns, which may be helpful for learning. Compared with the original method, the deferred encouragement brings improvement on both balanced and imbalanced test distribution (+1.4 and +2.4 regarding All and All(Natural), respectively). Moreover, the class-balanced training still impairs the learning for common classes even under the deferred setting, which may cast into unfavorable natural performance in real applications, e.g., the accuracy on the 'Many' subset for CB † and Decoupling-LWS is under-performs CE by 3.1, the results is that applying CB † reduces the Natural accuracy by 1.3. But deferred Eureka Loss largely outperforms CE and these methods on both balanced and imbalanced test distributions. The reason may be that we do not impair the CE learning and the additional rewarding for rare classes is less harmful. Since Eureka Loss only introduces an additive term, it is flexible and can be combined with CB, the adoption results in best All accuracy of 70.3. Results on ImageNet-LT. * and † are defined similarly to Table2. Eureka Loss demonstrates consistent improvements against existing methods.

F-scores and 4-gram diversity on ConvAI2. The proposed Eureka Loss achieves better performance than baseline methods and generates responses that are more diverse.

Results on long-tailed CIFAR-100 of different imbalance degrees (ID).Varying Strength of Bonus To illustrate the importance of the high-likelihood area in imbalanced classification within Eureka Loss, we compare a

Varying strength of encouragement on the dev set of ImageNet-LT. Higher β means higher strength. exponential form bonus called Power Bonus (PB) to the original bonus, which takes the power form of the probability vector of by a factor γ b :

The complete comparison between the likelihood distribution for Eureka Loss, Cross-Entropy Loss, Focal Loss, Class-balanced Loss are shown in Figure5. We see from the figure that the model trained with Eureka Loss gives high-confidence predictions for gold labels. Compared with the Cross-Entropy Loss, the Focal Loss diminish the contribution of high-likelihood examples, and the resulted model is unsure in the prediction of unseen examples. In particular, for the examples in the Few group, it almost produces no confident correct predictions. The Class-Balanced Loss, on the other hand, improves the confidence for the tail class examples but degrade the performance for the head class examples, which may imply potential issues regrading to natural performance in real-world applications. Besides, it is worth noting that the Decoupling-LWS obtains a likelihood distribution similar to the Class-Balanced Loss.Kang et al., 2020) that training much longer for the iNaturalist 2018 dataset can produce better scores and reflect the performance of the models more authentically. However, most previous studies conduct training for a shorter time. To keep consistent with previous research in this field, we also train the models using Eureka Loss for 90 epochs and the results are shown in Table7. In this setting, eureka loss achieves better accuracy than the two-stage decoupling methods (Decoupling-LWS and BBN), the advantage is more profound under the Natural accuracy, for example, compared to the Decoupling-LWS, the deferred Eureka Loss gains 3.8 Natural accuracy. Compared to the one-state methods including Class-Balanced Loss (CB), Focal Loss (FL), Class-Balanced Focal Loss (FL+CB), LDAM, the model trained with the Eureka Loss is much more accurate on the test distribution.B.2 RESULTS ON LONG-TAILED CIFAR-10In the main text, we have reported the results of the Eureka Loss varying the class imbalance on the CIFAR-100 dataset. Here we also perform comprehensive experiments on long-tailed CIFAR-10 and report top-1 precision on the balanced test set. The results are shown in the Table8. When combined with Class-Balanced Loss, Eureka Loss brings higher improvement in terms of accuracy than Cross-Entropy Loss and LDAM.B.3 HYPER-PARAMETER OF THE FOCAL LOSSIn the paper, we report results for the Focal Loss with best hyper-parameters. For COCO detection, the hyper-parameter of α = 0.25, γ = 2 is the best setting reported in Table1.b of the origin paper Difference between the training and test accuracy on tail classes Test Training Figure 6: Illustration of the over-fitting phenomenon on tail classes, the number on the top of each bar is the difference between the training accuracy and the test accuracy. Data statistics of long-tailed image classification tasks. Imbalance Ratio denotes the ratio of the size of the most common class to that of the rarest class.

annex

( Lin et al., 2017) . For the other multi-class classification tasks, we tune hyper-parameters of the Focal Loss on it. The accuracy for the Focal Loss with different hyper-parameter γ are listed in the Table 1 . we set γ = 1 for Focal Loss since it is consistently optimal in long-tailed image classification. For ConvAI2, γ = 0.5 under-performs Cross Entropy, and neither γ = 1 nor γ = 2 outperforms each other, so we report the Focal Loss of γ = 1 and the Focal Loss of γ = 2 in the Table 4 .

B.4 COMPLEMENTARY EXPERIMENT TO THE MOTIVATION EXPERIMENT

In Section 3, we propose Halted Focal Loss(HFL) and compare it to Focal Loss(FL) to illustrate the potential of the high-likelihood area. However, its loss is no steeper than Cross Entropy (CE). Moreover, Focal Loss does not beat CE in the setting of multi-class classification. In order to bridge the gap between the possibly weak motivation experiment of Halted Focal Loss and the proposed method Eureka Loss. We propose simplified Eureka Loss: where ϕ is set to 0.5, and b is log(1 -ϕ). In the simplified Eureka Loss, the encouragement is removed, and to keep the low-likelihood area unchanged, the new bonus term starts rewarding the model from p = ϕ. As is shown in 

C DETAILS OF EXPERIMENTAL SETTINGS C.1 DATASETS

There are six datasets used in this paper in total and an overview of the dataset statistics are demontrated in Table 11 and Figure 7 . For image classification tasks, the iNaturalist 2018 dataset is the most imbalanced and has the most classes, which is most satisfactory for evaluating long-tailed classifications. For the language generation task, ConvAI2 has an imbalance ratio of 277K, which, however, should be taken cautiously, since most of the tail classes are not covered in evaluation. The common practice to evaluate the learning on the imbalanced language distributions is to investigate the diversity of the generated text. The 4-grams can be regarded as high-ordered classes and a 4-gram of four common words can also be a "rare class".

C.2 TRAINING SETTINGS

CIFAR-10 and CIFAR-100 For experiments on long-tailed CIFAR-10 and CIFAR-100, the backbone network is ResNet-32 (He et al., 2016) . The model is optimized with SGD with a momentum of 0.9. The learning rate is set to 0.1 and the model is trained for 200 epochs with 128 examples per mini-batch. To stabilize the training, we adopt the warm-up strategy used by Goyal et al. (2017) in the first 5 epochs. Following Cao et al. (2019) , we decay the learning rate by 0.01 at the 160th epoch and again at the 180th epoch. For the results in Figure 2 and Figure 3 , we conduct experiments on long-tailed CIFAR-10 with an imbalance ratio of 10.ImageNet-LT For experiments on ImageNet-LT ILSVRC 2012, the base network is ResNext-50 (He et al., 2016) . The batch size is set to 512 to accelerate training. The initial learning rate is 0.2 and we utilize a cosine learning rate scheduler.iNaturalist 2018 Same as the experiments on ImageNet-LT, we also follow the default setting in Kang et al. (2020) for experiments on iNaturalist. To be specific, we adopt ResNet-50 model, use SGD with to train the model for 200 epochs with batch size 512 and a cosine learning rate schedule which gradually decays from 0.2 to 0.0. Results on the valid set are also reported on subset of many (> 100 samples), medium (20 -100 samples), and few (< 20 samples), respectively.ConvAI2 For the conversation generation task, we utilize a two-layer LSTM (Hochreiter & Schmidhuber, 1997) encoder-encoder architecture as our base network. The hidden size of both encoder and decoder is set to 1024. We optimize the model with SGD optimizer with momentum 0.9, the 

