LONG-TAIL LEARNING VIA LOGIT ADJUSTMENT

Abstract

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels have only a few associated samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present a statistical framework that unifies and generalises several recent proposals to cope with these challenges. Our framework revisits the classic idea of logit adjustment based on the label frequencies, which encourages a large relative margin between logits of rare positive versus dominant negative labels. This yields two techniques for long-tail learning, where such adjustment is either applied post-hoc to a trained model, or enforced in the loss during training. These techniques are statistically grounded, and practically effective on four real-world datasets with long-tailed label distributions.

1. INTRODUCTION

Real-world classification problems typically exhibit a long-tailed label distribution, wherein most labels are associated with only a few samples (Van Horn & Perona, 2017; Buda et al., 2017; Liu et al., 2019) . Owing to this paucity of samples, generalisation on such labels is challenging; moreover, naïve learning on such data is susceptible to an undesirable bias towards dominant labels. This problem has been widely studied in the literature on learning under class imbalance (Kubat et al., 1997; Chawla et al., 2002; He & Garcia, 2009) , and the related problem of cost-sensitive learning (Elkan, 2001) . Recently, long-tail learning has received renewed interest in the context of neural networks. Two active strands of work involve post-hoc normalisation of the classification weights (Zhang et al., 2019; Kim & Kim, 2019; Kang et al., 2020; Ye et al., 2020) , and modification of the underlying loss to account for varying class penalties (Zhang et al., 2017; Cui et al., 2019; Cao et al., 2019; Tan et al., 2020) . Each of these strands are intuitive, and have proven empirically successful. However, they are not without limitation: e.g., weight normalisation crucially relies on the weight norms being smaller for rare classes; however, this assumption is sensitive to the choice of optimiser (see §2.1). On the other hand, loss modification sacrifices the consistency that underpins the canonical softmax cross-entropy (see §5.1). Consequently, such techniques may prove suboptimal even in simple settings (see §6.1). In this paper, we establish a statistical framework for long-tail learning that offers a unified view of post-hoc normalisation and loss modification techniques, while overcoming their limitations. Our framework revisits the classic idea of logit adjustment based on label frequencies (Provost, 2000; Zhou & Liu, 2006; Collell et al., 2016) , which encourages a large relative margin between a pair of rare positive and dominant negative labels. Such adjustment can be achieved by shifting the learned logits post-hoc, or augmenting the softmax cross-entropy with a pairwise label margin (cf. ( 11)). While similar in nature to recent techniques, our logit adjustment approaches additionally have a firm statistical grounding: they are Fisher consistent for minimising the balanced error (cf. (2)), a common metric in long-tail settings which averages the per-class errors. This statistical grounding translates into strong empirical performance on four real-world datasets with long-tailed label distributions. In summary, our contributions are: (i) we establish a statistical framework for long-tail learning ( §3) based on logit adjustment that provides a unified view of post-hoc correction and loss modification Table 1 : Comparison of approaches to long-tail learning. Weight normalisation re-scales the classification weights; by contrast, we add per-label offsets to the logits. Margin approaches uniformly increase the margin between a rare positive and all negatives (Cao et al., 2019) , or decrease the margin between all positives and a rare negative (Tan et al., 2020) to prevent rare labels' gradient suppression. By contrast, we increase the margin between a rare positive and a dominant negative. (ii) we present two realisations of logit adjustment, applied either post-hoc ( §4.1) or during training ( §5.1); unlike recent proposals (Table 1 ), these are consistent for minimising the balanced error (iii) we confirm the efficacy of the proposed logit adjustment techniques compared to several baselines on four real-world datasets with long-tailed label distributions ( §6).

2. PROBLEM SETUP AND RELATED WORK

Consider a multiclass classification problem with instances X and labels Y = [L] . = {1, 2, . . . , L}. Given a sample S = {(x n , y n )} N n=1 ∼ P N for unknown distribution P over X × Y, our goal is to learn a scorer f : X → R L that minimises the misclassification error P x,y y / ∈ argmax y ∈Y f y (x) . Typically, one minimises a surrogate loss : Y × R L → R such as the softmax cross-entropy, x) . (y, f (x)) = log y ∈[L] e f y (x) -f y (x) = log 1 + y =y e f y (x)-fy( (1) We may view the resulting softmax probabilities p y (x) ∝ e fy(x) as estimates of P(y | x). The setting of learning under class imbalance or long-tail learning is where the distribution P(y) is highly skewed, so that many rare (or "tail") labels have a low probability of occurrence. Here, the misclassification error is not a suitable measure of performance: a trivial predictor which classifies every instance to the majority label will attain a low misclassification error. To cope with this, a natural alternative is the balanced error (Chan & Stolfo, 1998; Brodersen et al., 2010; Menon et al., 2013) , which averages each of the per-class error rates: under a uniform label distribution BER(f ) . = 1 L y∈[L] P x|y y / ∈ argmax y ∈Y f y (x) . This can be seen as implicitly using a balanced class-probability function P bal (y | x) ∝ 1 L • P(x | y), as opposed to the native P(y | x) ∝ P(y) • P(x | y) that is employed in the misclassification error. 1 Broadly, extant approaches to coping with class imbalance modify: (i) the inputs to a model, for example by over-or under-sampling (Kubat & Matwin, 1997; Chawla et al., 2002; Wallace et al., 2011; Mikolov et al., 2013; Mahajan et al., 2018; Yin et al., 2018) ; (ii) the outputs of a model, for example by post-hoc correction of the decision threshold (Fawcett & Provost, 1996; Collell et al., 2016) or weights (Kim & Kim, 2019; Kang et al., 2020) ; or (iii) the training procedure of a model, for example by modifying the loss function (Zhang et al., 2017; Cui et al., 2019; Cao et al., 2019; Tan et al., 2020; Jamal et al., 2020) . One may easily combine approaches from the first stream with those from the latter two. Consequently, we focus on the latter two in this work, and describe some representative recent examples from each. Post-hoc weight normalisation. Suppose f y (x) = w y Φ(x) for classification weights w y ∈ R D and representations Φ : X → R D , as learned by a neural network. (We may add per-label bias terms Figure 1 : Mean and standard deviation of per-class weight norms w y 2 over 5 runs for a ResNet-32 under momentum and Adam optimisers. We use long-tailed ("LT") versions of CIFAR-10 and CIFAR-100, and sort classes in descending order of frequency; the first class is 100× more likely to appear than the last class (see §6.2). Both optimisers yield comparable balanced error. However, the weight norms have incompatible trends: under momentum, the norms are strongly correlated with class frequency, while with Adam, the norms are anti-correlated or independent of the class frequency. Consequently, weight normalisation under Adam is ineffective for combatting class imbalance. to f y by adding a constant feature to Φ.) A fruitful avenue of exploration involves decoupling of representation and classifier learning (Zhang et al., 2019) . Concretely, we first learn {w y , Φ} via standard training on the long-tailed training sample S, and then for x ∈ X predict the label argmax y∈[L] w y Φ(x)/ν τ y = argmax y∈[L] f y (x)/ν τ y , for τ > 0, where ν y = P(y) in Kim & Kim (2019) ; Ye et al. (2020) and ν y = w y 2 in Kang et al. (2020) . Intuitively, either choice of ν y upweights the contribution of rare labels through weight normalisation. The choice ν y = w y 2 is motivated by the observations that w y 2 tends to correlate with P(y). Further to the above, one may enforce w y 2 = 1 during training (Kim & Kim, 2019) .

Loss modification.

A classic means of coping with class imbalance is to balance the loss, wherein (y, f (x)) is weighted by P(y) -1 (Xie & Manski, 1989; Morik et al., 1999 ): e.g., applied to (1), (y, f (x)) = 1

P(y)

• log 1 + y =y e f y (x)-fy(x) . (4) While intuitive, balancing has minimal effect in separable settings: solutions that achieve zero training loss will necessarily remain optimal even under weighting (Byrd & Lipton, 2019) . Intuitively, one would instead like to shift the separator closer to a dominant class. Li et al. (2002) ; Wu et al. (2008) ; Masnadi-Shirazi & Vasconcelos (2010) thus proposed to add per-class margins into the hinge loss. Cao et al. (2019) similarly proposed to add a per-class margin into the softmax cross-entropy: (y, f (x)) = log 1 + y =y e δy • e f y (x)-fy(x) , where δ y ∝ P(y) -1/4 . This upweights rare "positive" y to encourage a larger gap f y (x) -f y (x), i.e., the margin between y and any "negative" y = y. Separately, Tan et al. (2020) proposed (y, f (x)) = log 1 + y =y e δ y • e f y (x)-fy(x) , where δ y ≤ 0 is a non-decreasing transform of P(y ). Note that in the original softmax cross-entropy with δ y = 0, a rare label often receives a strong inhibitory gradient signal as it disproportionately appear as a negative for dominant labels. This can be modulated by letting δ y 0.

2.1. LIMITATIONS OF EXISTING APPROACHES

Each of the above methods are intuitive, and have shown strong empirical performance. However, a closer analysis identifies some subtle limitations. Limitations of weight normalisation. Post-hoc weight normalisation with ν y = w y 2 per Kang et al. (2020) is motivated by the observation that the weight norm w y 2 tends to correlate with P(y). However, this assumption is highly dependent on the choice of optimiser, as Figure 1 illustrates: for ResNet-32 models trained on long-tailed versions of CIFAR-10 and CIFAR-100, when using the Adam optimiser, the norms are either anti-correlated or independent of P(y). Weight normalisation thus cannot achieve the desired effect of boosting rare labels' scores. One may hope to side-step this by simply using ν y = P(y); unfortunately, this choice has more subtle limitations (see §4.2). Limitations of loss modification. Enforcing a per-label margin per ( 5) and ( 6) is intuitive, as it allows for shifting the decision boundary away from rare classes. However, when doing so, it is important to ensure Fisher consistency (Lin, 2004) (or classification calibration (Bartlett et al., 2006 )) of the resulting loss for the balanced error. That is, the minimiser of the expected loss (equally, the empirical risk in the infinite sample limit) should result in a minimal balanced error. Unfortunately, both ( 5) and ( 6) are not consistent in this sense, even for binary problems; see §5.1, §6.1 for details.

3. LOGIT ADJUSTMENT FOR LONG-TAIL LEARNING: A STATISTICAL VIEW

The above suggests that there is scope for improving performance on long-tail problems, both in terms of post-hoc correction and loss modification. We now show how a statistical perspective suggests simple procedures of each type, both of which overcome the limitations discussed above. Recall that our goal is to minimise the balanced error (2). A classical result is that the best possible or Bayes-optimal scorer for this problem, i.e., f * ∈ argmin f : X→R L BER(f ), satisfies the following (Menon et al., 2013) , (Koyejo et al., 2014, Corollary 4) , (Collell et al., 2016 , Theorem 1): argmax y∈[L] f * y (x) = argmax y∈[L] P bal (y | x) = argmax y∈[L] P(x | y), where P bal is the balanced class-probability as per §2. In words, the Bayes-optimal prediction is the label under which the given instance x ∈ X is most likely. Consequently, for fixed class-conditionals P(x | y), varying the class priors P(y) arbitrarily will not affect the optimal scorers. This is intuitively desirable: the balanced error is agnostic to the level of imbalance in the label distribution. To further probe (7), suppose the underlying class-probabilities P(y | x) ∝ exp(s * y (x)), for scorer s * : X → R L . Since by definition P bal (y | x) ∝ P(y | x)/P(y), (7) becomes argmax y∈[L] P bal (y | x) = argmax y∈[L] exp(s * y (x))/P(y) = argmax y∈[L] s * y (x) -ln P(y), (8) i.e., we translate the optimal logits based on the class priors. Equation 8 provides the ideal predictions for optimising the balanced error, which necessarily rely on unknown quantities s * y (x), P(y) that depend on the underlying distribution. Nonetheless, we may seek to approximate these quantities based on our training sample. Concretely, (8) suggests two means of optimising for the balanced error: (i) train a model to estimate the standard P(y | x) (e.g., by minimising the standard softmax-cross entropy on the long-tailed data), and then explicitly modify its logits post-hoc as per (8) (ii) train a model to estimate the balanced P bal (y | x), whose logits are implicitly modified as per (8). Such logit adjustment techniques -which have been a classic approach to class-imbalance (Provost, 2000) -neatly align with the post-hoc and loss modification streams discussed in §2. However, unlike most previous techniques from these streams, logit adjustment is endowed with a clear statistical grounding: by construction, the optimal solution under such adjustment coincides with the Bayes-optimal solution (7) for the balanced error, i.e., it is Fisher consistent for minimising the balanced error. We now study each of the techniques (i) and (ii) in turn.

4. POST-HOC LOGIT ADJUSTMENT

We now propose a post-hoc logit adjustment scheme for a classifier trained on long-tailed data. We further show this has subtle advantages over recent weight normalisation schemes.

4.1. THE POST-HOC LOGIT ADJUSTMENT PROCEDURE

When employing the softmax cross-entropy to train a neural network, we aim to approximate the underlying P(y | x) with p y (x) ∝ exp(f y (x)) for logits f y (x) = w y Φ(x). Given learned {w, Φ}, one typically predicts the label argmax y∈[L] f y (x), i.e., the most likely label under the model's P(y | x). In post-hoc logit adjustment, we propose to instead predict, for suitable τ > 0: argmax y∈[L] exp(w y Φ(x))/π τ y = argmax y∈[L] f y (x) -τ • log π y , where π ∈ ∆ L (for simplex ∆) are estimates of the class priors P(y), e.g., the empirical frequencies on the training sample S. Effectively, ( 9) adds a label-dependent offset to each of the logits. When τ = 1, this can be seen as applying ( 8) with a plugin estimate of P(y | x), i.e., p y (x) ∝ exp(w y Φ(x)). When τ = 1, this can be seen as applying (8) to temperature scaled estimates py (x) ∝ exp(τ -1 • w y Φ(x)). To unpack this, recall that (8) justifies post-hoc logit thresholding given access to the true probabilities P(y | x). In practice, high-capacity neural networks often produce uncalibrated estimates of these probabilities (Guo et al., 2017) . Temperature scaling is a means to calibrate the estimates, and is routinely employed for distillation (Hinton et al., 2015) . One may treat τ as a tuning parameter to be chosen based on holdout calibration, e.g., the expected calibration error (Murphy & Winkler, 1987; Guo et al., 2017) , probabilistic sharpness (Gneiting et al., 2007; Kuleshov et al., 2018) , or a proper scoring rule such as the log-loss or squared error (Gneiting & Raftery, 2007) . One may alternately fix τ = 1 and aim to learn inherently calibrated probabilities, e.g., via label smoothing (Szegedy et al., 2016; Müller et al., 2019) . Post-hoc logit adjustment with τ = 1 is not a new idea in the classical label imbalance literature (Fawcett & Provost, 1996; Provost, 2000; Maloof, 2003; Zhou & Liu, 2006; Collell et al., 2016) ; however, it has had limited exploration in the recent long-tail learning literature. Further, the case τ = 1 is important in practical usage of neural networks, owing to their typical lack of probabilistic calibration (Guo et al., 2017) . Interestingly, post-hoc logit adjustment also has an important advantage over recently proposed weight normalisation techniques, as we now discuss.

4.2. COMPARISON TO POST-HOC WEIGHT NORMALISAITON

Recall that weight normalisation involves learning logits f y (x) = w y Φ(x), and then post-hoc normalising the weights via w y /ν τ y for τ > 0. We demonstrated in §2 that using ν y = w y 2 may be ineffective when using adaptive optimisers. However, even with ν y = π y , there is a subtle contrast to post-hoc logit adjustment: while the former performs a multiplicative update to the logits, the latter performs an additive update. The two techniques may thus yield different orderings over labels, since w 1 Φ(x)/π 1 < w 2 Φ(x)/π 2 =⇒ ⇐= exp(w 1 Φ(x))/π 1 < exp(w 2 Φ(x))/π 2 . Observe that if a rare label y has a negative score w y Φ(x) < 0, and there is another label with a positive score, then it is impossible for the weight normalisation to give y the highest score. By contrast, under logit adjustment, w y Φ(x) -ln π y will be lower for dominant classes, regardless of the original sign. Weight normalisation is thus not consistent for the balanced error, unlike logit adjustment.

5. THE LOGIT ADJUSTED SOFTMAX CROSS-ENTROPY

We now show how to directly encode logit adjustment into the softmax cross-entropy. The resulting approach has an intuitive relation to existing loss modification techniques.

5.1. THE LOGIT ADJUSTED LOSS

From §3, the second approach to optimising for the balanced error is to directly model P bal (y | x) ∝ P(y | x)/P(y). To do so, consider the following logit adjusted softmax cross-entropy loss for τ > 0: (y, f (x)) = -log e fy(x)+τ •log πy y ∈[L] e f y (x)+τ •log π y = log 1 + y =y π y π y τ • e f y (x)-fy(x) . ( ) Given a scorer that minimises the above, we now predict argmax y∈[L] f y (x) as usual. Compared to the standard softmax cross-entropy (1), the above applies a label-dependent offset to each logit. Compared to (9), we directly enforce the class prior offset while learning the logits, rather than doing this post-hoc. The two approaches have a deeper connection: observe that ( 10) is equivalent to using a scorer of the form g y (x) = f y (x) + τ • log π y , with argmax y∈[L] f y (x) = argmax y∈[L] g y (x) -τ • log π y . Consequently, one can equivalently view learning with this loss as learning a standard scorer g(x), and post-hoc adjusting its logits. For non-convex objectives, as encountered in neural networks, the bias endowed by adding τ • log π y to the logits is likely to result in a different local minima, typically with improved performance.

5.2. COMPARISON TO LOSS MODIFICATION TECHNIQUES

For more insight into the logit adjusted loss, consider the following pairwise margin loss (y, f (x)) = α y • log 1 + y =y e ∆ yy • e (f y (x)-fy(x)) , for label weights α y > 0, and pairwise label margins ∆ yy representing the desired gap between scores for y and y . For τ = 1, our logit adjusted loss (10) corresponds to (11) with α y = 1 and ∆ yy = log π y πy . This demands a larger margin between rare positive (π y ∼ 0) and dominant negative (π y ∼ 1) labels, so that scores for dominant classes do not overwhelm those for rare ones. Existing loss modification techniques can be viewed as special cases of (11). For example, α y = 1/π y and ∆ yy = 0 yields the balanced loss (4). When α y = 1, the choice ∆ yy = π -1/4 y yields (5). Finally, ∆ yy = log F (π y ) yields ( 6), where F : [0, 1] → (0, 1] is some non-decreasing function. These losses thus either consider the frequency of the positive y or negative y , but not both. Remarkably, the specific choice that leads to our loss in (10) has a firm statistical grounding: it ensures Fisher consistency (in the sense of, e.g., Bartlett et al. (2006) ) for the balanced error. Proofs for all results are in the supplementary. Theorem 1. For any δ ∈ R L + , the pairwise loss in (11) is Fisher consistent with weights and margins α y = δ y /P(y) ∆ yy = log δ y /δ y . Letting δ y = π y , we immediately deduce that the logit-adjusted loss of ( 10) is consistent, provided our π y is a consistent estimate of P(y). Similarly, δ y = 1 recovers the classic result that the balanced loss is consistent. While Theorem 1 only provides a sufficient condition in multi-class setting, one can provide a necessary and sufficient condition that rules out other choices of ∆ in the binary case. Theorem 2. Suppose Y = {±1}. Let δ y . = ∆ y,-y , and σ(z) = (1 + exp(z)) -1 . Then, the pairwise margin loss in (11) is Fisher consistent for the balanced error iff α +1 α -1 • σ(δ +1 ) σ(δ -1 ) = 1 -P(y = +1) P(y = +1) .

5.3. DISCUSSION AND EXTENSIONS

Our pairwise margin loss in (11) subsumes several existing loss-correction approaches in the literature. Further, it suggests the exploration of new choices of ∆. For example, we shall see the efficacy of combining the ∆ implicit in the adaptive loss of Cao et al. (2019) with our proposed ∆. One may also generalise the formulation in Theorem 1, and employ ∆ yy = τ 1 • log π y -τ 2 • log π y , where τ 1 , τ 2 > 0. This interpolates between our loss (τ 1 = τ 2 ) and a version of the equalised loss (τ 1 = 0). For τ = -1, a similar loss to (10) has been considered in the context of negative sampling for scalability (Yi et al., 2019 ): here, one samples a subset of negatives based on π, and corrects the logits to obtain an unbiased estimate of the loss based on all negatives (Bengio & Senecal, 2008) . Losses of the general form (11) have also been explored for structured prediction (Pletscher et al., 2010) . Cao et al. (2019, Theorem 2) provides a rigorous generalisation bound for the adaptive margin loss under the assumption of separable data with binary labels. The inconsistency of the loss with respect to the balanced error concerns the more general scenario of non-separable multiclass data, which may occur, e.g., owing to label noise or limited model capacity. We shall subsequently demonstrate that encouraging consistency is not merely of theoretical interest, and can lead to gains in practice. 

6. EXPERIMENTAL RESULTS

We now present experiments confirming our main claims: (i) on simple binary problems, existing weight normalisation and loss modification techniques may not converge to the optimal solution ( §6.1); (ii) on real-world datasets, our post-hoc logit adjustment generally outperforms weight normalisation, and one can obtain further gains via our logit adjusted softmax cross-entropy ( §6.2).

6.1. RESULTS ON SYNTHETIC DATASET

We begin with a binary classification task, wherein samples from class y ∈ {±1} are drawn from a 2D Gaussian with isotropic covariance and means µ y = y • (+1, +1). We introduce class imbalance by setting P(y = +1) = 5%. The Bayes-optimal classifier for the balanced error is (see Appendix F) f * (x) = +1 ⇐⇒ P(x | y = +1) > P(x | y = -1) ⇐⇒ (µ 1 -µ -1 ) x > 0, i.e., it is a linear separator passing through the origin. We compare this separator against those found by several margin losses based on (11): standard ERM (∆ yy = 0), the adaptive loss (Cao et al., 2019) (∆ yy = π -1/4 y ), an instantiation of the equalised loss (Tan et al., 2020) (∆ yy = log π y ), and our logit adjusted loss (∆ yy = log π y πy ). For each loss, we train an affine classifier on a sample of 10, 000 instances, and evaluate the balanced error on a test set of 10, 000 samples over 100 independent trials. Figure 2 confirms that the logit adjusted margin loss attains a balanced error close to that of the Bayes-optimal, which is visually reflected by its learned separator closely matching that in (12). This is in line with our claim of the logit adjusted margin loss being consistent for the balanced error, unlike other approaches. Figure 2 also compares post-hoc weight normalisation and logit adjustment for varying scaling parameter τ (c.f. (3), ( 9)). Logit adjustment is seen to approach the performance of the Bayes predictor; any weight normalisation is however seen to hamper performance. This verifies the consistency of logit adjustment, and inconsistency of weight normalisation ( §4.2).

6.2. RESULTS ON REAL-WORLD DATASETS

We present results on the CIFAR-10, CIFAR-100, ImageNet and iNaturalist 2018 datasets. Following prior work, we create "long-tailed versions" of the CIFAR datasets by suitably downsampling examples per label following the EXP profile of Cui et al. (2019) ; Cao et al. (2019) with imbalance ratio ρ = max y P(y)/min y P(y) = 100. Similarly, we use the long-tailed version of ImageNet produced by Liu et al. (2019) . We employ a ResNet-32 for CIFAR, and a ResNet-50 for ImageNet and iNaturalist. All models are trained using SGD with momentum; see Appendix C for more details. See also Appendix D.3 for results on CIFAR under the STEP profile also considered in the literature. Baselines. We consider several representative baselines: (i) empirical risk minimisation (ERM) on the long-tailed data, (ii) post-hoc weight normalisation (Kang et al., 2020) per (3) (using ν y = w y 2 ) applied to ERM, (iii) the class-balanced loss of Cui et al. (2019) , (iv) the adaptive margin loss (Cao et al., 2019) per (5), including with the "deferred reweighting" (DRW) training scheme, and (v) the equalised loss (Tan et al., 2020) per ( 6), with δ y = F (π y ) for the threshold-based F of Tan et al. (2020) . Where possible, we report numbers for the baselines from the respective papers. Table 2 : Test set balanced error (averaged over 5 trials) on real-world datasets. Here, † , , ‡ are numbers for "LDAM + SGD" and "LDAM + DRW" from Cao et al. (2019, Table 2, 3) ; "τ -normalised" from Kang et al. (2020, Table 3, 7) ; and "Class-Balanced" from Cui et al. (2019, Table 2, 3) . Here, τ = τ * refers to using the best possible tuning parameter τ ; see Figure 3 for results on various τ . Highlighted cells denote the best performing method for a given dataset. 3), ( 9)). Post-hoc logit adjustment consistently outperforms weight normalisation.

Method

We compare the above methods against our proposed post-hoc logit adjustment (9), and logit adjusted loss (10). For post-hoc logit adjustment, we fix the scalar τ = 1 for our basic results; we analyse the effect of tuning this in Figure 3 . We additionally evaluate a combination of our logit adjusted softmax cross-entropy with the adaptive margin of Cao et al. (2019) ; this uses (11) with ∆ yy = log(π y /π y ) + π -1/4 y . We do not perform any further tuning of our techniques. For all methods, we report the balanced error on the test set. (Note that since the test sets are all balanced for these benchmarks, the balanced error is equivalent to the misclassification error.) For all methods, we pre-compute π as the empirical label frequency on the entire training set. Results and analysis. Table 2 demonstrates that our proposed logit adjustment techniques consistently outperform existing methods. Indeed, weight normalisation with τ = 1 is generally improved significantly by post-hoc logit adjustment (e.g., 8% relative reduction on CIFAR-10). Similarly, loss correction techniques are generally outperformed by our logit adjusted softmax cross-entropy (e.g., 6% relative reduction on iNaturalist). Cao et al. (2019) observed that their loss benefits from a deferred reweighting scheme (DRW), wherein class-weighting is applied after a fixed number of epochs. Table 2 indicates this is consistently outperformed by suitable variants of logit adjustment. Table 2 only reports the results for logit adjustment with scalar τ = 1. In practice, tuning τ can significantly improve performance further. Figure 3 studies the effect of tuning τ for post-hoc weight normalisation (using ν y = w y 2 ) and post-hoc logit adjustment. Even without any scaling, post-hoc logit adjustment generally offers superior performance to the best result from weight normalisation (cf. Table 2 ); with scaling, this is further improved. In practice, one may choose τ via cross-validation against the balanced error on the training set. For example, on CIFAR-10-LT, we estimate τ * = 2.6 for post-hoc logit adjustment, for which the resulting balanced test error of 18.73% is superior to that of weight normalisation for any τ . To better understand the gains, Figure 4 reports errors on a per-group basis, where following Kang et al. (2020) we construct three groups of classes -"Many", "Medium", and "Few" -comprising those with ≥ 100, between (20, 100), and ≤ 20 training examples respectively. Logit adjustment shows consistent gains over the "Medium" and "Few" groups, albeit at some expense in "Many" group performance. See Appendix D.2 for a finer-grained breakdown. While both logit adjustment techniques perform similarly, there is a slight advantage to the loss function version. Nonetheless, the strong performance of post-hoc logit adjustment corroborates the ability to decouple representation and classifier learning in long-tail settings (Zhang et al., 2019) . A reference implementation of our methods is planned for release at: https://github.com/google-research/google-research/tree/master/logit_adjustment.

7. DISCUSSION AND FUTURE WORK

Table 2 shows the advantage of logit adjustment over recent proposals, under standard setups from the literature. Further improvements are possible by fusing complementary ideas, and we remark on a few such options. First, one may use a more complex base architecture; e.g., Kang et al. (2020) found gains by employing a ResNet-152, and training for 200 epochs. Table 3 (Appendix) confirms that logit adjustment similarly benefits from this choice, achieving a balanced error of 30.12% on iNaturalist, and 28.02% when combined with the adaptive margin. Second, the DRW training scheme (which applies to any loss) may result in further gains for our techniques. Third, incorporating developments in meta-learning (Wang et al., 2017; Jamal et al., 2020) 

A PROOFS OF RESULTS IN BODY

Proof of Theorem 1. Denote η y (x) = P(y | x). Suppose we employ a margin ∆ yy = log δ y δy . Then, the loss is (y, f (x)) = -log δ y • e fy(x) y ∈[L] δ y • e f y (x) = -log e fy(x)+log δy y ∈[L] e f y (x)+log δ y . Consequently, under constant weights α y = 1, the Bayes-optimal score will satisfy f * y (x) + log δ y = log η y (x), or f * y (x) = log ηy(x) δy . Now suppose we use generic weights α ∈ R L + . The risk under this loss is E x,y [ α (y, f (x))] = y∈[L] π y • E x|y=y [ α (y, f (x))] = y∈[L] π y • E x|y=y [ α (y, f (x))] = y∈[L] π y • α y • E x|y=y [ (y, f (x))] ∝ y∈[L] πy • E x|y=y [ (y, f (x))] , where πy ∝ π y • α y . Consequently, learning with the weighted loss is equivalent to learning with the original loss, on a distribution with modified base-rates π. Under such a distribution, we have class-conditional distribution ηy (x) = P(y | x) = P(x | y) • πy P(x) = η y (x) • πy π y • P(x) P(x) ∝ η y (x) • α y . Consequently, suppose α y = δy πy . Then, f * y (x) = log ηy(x) δy = log ηy(x) πy +C(x), where C(x) does not depend on y. Consequently, argmax y∈[L] f * y (x) = argmax y∈[L] ηy(x) πy , which is the Bayes-optimal prediction for the balanced error. In sum, a consistent family can be obtained by choosing any set of constants δ y > 0 and setting α y = δ y π y ∆ yy = log δ y δ y . Proof of Theorem 2. We establish a more general result in Lemma 3 of the next section, which allows for a temperature parameter in the loss. This allows for interpolating between the standard softmax cross-entropy and margin based losses.

B ON THE CONSISTENCY OF BINARY MARGIN-BASED LOSSES

It is instructive to study the pairwise margin loss (11) in the binary case. Endowing the loss with a temperature parameter γ > 0, we getfoot_2  (+1, f ) = ω +1 γ • log(1 + e γ•δ+1 • e -γ•f ) (-1, f ) = ω -1 γ • log(1 + e γ•δ-1 • e γ•f ) for constants ω ±1 , γ > 0 and δ ±1 ∈ R. Here, we have used δ +1 = ∆ +1,-1 and δ -1 = ∆ -1,+1 for simplicity. The choice ω ±1 = 1, δ ±1 = 0 recovers the temperature scaled binary logistic loss. Evidently, as γ → +∞, these converge to weighted hinge losses with variable margins, i.e., (+1, f ) = ω +1 • [δ +1 -f ] + (-1, f ) = ω -1 • [δ -1 + f ] + . We study two properties of this family losses. First, under what conditions are the losses Fisher consistent for the balanced error? We shall show that in fact there is a simple condition characterising this. Second, do the losses preserve properness of the original binary logistic loss? We shall show that this is always the case, but that the losses involve fundamentally different approximations.

B.1 CONSISTENCY OF THE BINARY PAIRWISE MARGIN LOSS

Given a loss , its Bayes optimal solution is f * ∈ argmin f : X→R E [ (y, f (x))]. For consistency with respect to the balanced error in the binary case, we require this optimal solution f * to satisfy f * (x) > 0 ⇐⇒ η(x) > π, where η( . = P(y = 1 | x) and π . = P(y = 1) (Menon et al., 2013) . This is equivalent to a simple condition on the weights ω and margins δ of the pairwise margin loss. Lemma 3. The losses in (13) are consistent for the balanced error iff ω +1 ω -1 • σ(γ • δ +1 ) σ(γ • δ -1 ) = 1 -π π , where σ(z) = (1 + exp(z)) -1 . Proof of Lemma 3. Denote η(x) . = P(y = +1 | x), and π . = P(y = +1). From Lemma 4 below, the pairwise margin loss is proper composite with invertible link function Ψ : [0, 1] → R ∪ {±∞}. Consequently, since by definition the Bayes-optimal score for a proper composite loss is f * (x) = Ψ(η(x)) (Reid & Williamson, 2010) , to have consistency for the balanced error, from ( 14), (15), we require Ψ -1 (0) = π ⇐⇒ 1 1 -(+1,0) (-1,0) = π ⇐⇒ 1 - (+1, 0) (-1, 0) = 1 π ⇐⇒ - (+1, 0) (-1, 0) = 1 -π π ⇐⇒ ω +1 ω -1 • σ(γ • δ +1 ) σ(γ • δ -1 ) = 1 -π π . From the above, some admissible parameter choices include: • ω +1 = 1 π , ω -1 = 1 1-π , δ ±1 = 1; i.e. , the standard weighted loss with a constant margin • ω ±1 = 1, δ +1 = 1 γ • log 1-π π , δ -1 = 1 γ • log π 1-π ; i.e. , the unweighted loss with a margin biased towards the rare class, as per our logit adjustment procedure The second example above is unusual in that it requires scaling the margin with the temperature; consequently, the margin disappears as γ → +∞. Other combinations are of course possible, but note that one cannot arbitrarily choose parameters and hope for consistency in general. Indeed, some inadmissible choices are naïve applications of the margin modification or weighting, e.g., • ω +1 = 1 π , ω -1 = 1 1-π , δ +1 = 1 γ • log 1-π π , δ -1 = 1 γ • log π 1-π ; i.e. , combining both weighting and margin modification We make two final remarks. First, the above only considers consistency of the result of loss minimisation. For any choice of weights and margins, we may apply suitable post-hoc correction to the predictions to account for any bias in the optimal scores. Second, as γ → +∞, any constant margins δ ±1 > 0 will have no effect on the consistency condition, since σ(γ • δ ±1 ) → 1. The condition will be wholly determined by the weights ω ±1 . For example, we may choose ω +1 = 1 π , ω -1 = 1 1-π , δ +1 = 1, and δ -1 = π 1-π ; the resulting loss will not be consistent for finite γ, but will become so in the limit γ → +∞. • ω ±1 = 1, δ +1 = 1 γ • (1 -π), δ -1 = 1 γ • π; i.e.,

B.2 PROPERNESS OF THE PAIRWISE MARGIN LOSS

In the above, we appealed to the pairwise margin loss being proper composite, in the sense of Reid & Williamson (2010) . Intuitively, this specifies that the loss has Bayes-optimal score of the form f * (x) = Ψ(η(x)), where Ψ is some invertible function, and η(x) = P(y = 1 | x). We have the following general result about properness of any member of the pairwise margin family. Lemma 4. The losses in (13) are proper composite, with link function Ψ(p) = 1 γ • log   a • b q -c ± a • b q -c 2 + 4 • a q   -log 2, where a = ω+1 ω-1 • e γ•δ +1 e γ•δ -1 , b = e γ•δ-1 , c = e γ•δ+1 , and q = 1-p p . Proof of Lemma 4. The above family of losses is proper composite iff the function f → 1 1 -(+1,f ) (-1,f ) is invertible (Reid & Williamson, 2010, Corollary 12) ; the resulting inverse is the link function Ψ. We have (+1, f ) = -ω +1 • e γ•δ+1 • e -γ•f 1 + e γ•δ+1 • e -γ•f (-1, f ) = +ω -1 • e γ•δ-1 • e γ•f 1 + e γ•δ-1 • e γ•f . (15) The invertibility of ( 14) is immediate. To compute the link function Ψ, note that p = 1 1 -(+1,f ) (-1,f ) ⇐⇒ 1 p = 1 - (+1, f ) (-1, f ) ⇐⇒ - (+1, f ) (-1, f ) = 1 -p p ⇐⇒ ω +1 ω -1 • e γ•δ+1 • e -γ•f 1 + e γ•δ+1 • e -γ•f • 1 + e γ•δ-1 • e γ•f e γ•δ-1 • e γ•f = 1 -p p ⇐⇒ ω +1 ω -1 • e γ•δ+1 e γ•δ-1 • 1 e γ•f + e γ•δ+1 • 1 + e γ•δ-1 • e γ•f e γ•f = 1 -p p ⇐⇒ a • 1 + b • g g 2 + c • g = q, where a = ω+1 ω-1 • e γ•δ +1 e γ•δ -1 , b = e γ•δ-1 , c = e γ•δ+1 , g = e γ•f , and q = 1-p p . Thus, a • 1 + b • g g 2 + c • g = q ⇐⇒ g 2 + c • g 1 + b • g = a q ⇐⇒ g 2 + c - a • b q • g - a q = 0 ⇐⇒ g = a•b q -c ± a•b q -c 2 + 4 • a q 2 . As a sanity check, suppose a = b = c = γ = 1. This corresponds to the standard logistic loss. Then, Ψ(p) = log 1 q -1 ± 1 q -1 2 + 4 • 1 q 2 = log p 1 -p , which is the standard logit function. Figure 5 and 6 compares the link functions for a few different settings: • the balanced loss, where ω +1 = 1 π , ω -1 = 1 1-π , and δ ±1 = 1 • an unequal margin loss, where ω ±1 = 1, δ +1 = 1 γ • log 1-π π , and δ -1 = 1 γ • log π 1-π • a balanced + margin loss, where ω +1 = 1 π , ω -1 = 1 1-π , δ +1 = 1, and δ -1 = π 1-π . The property Ψ -1 (0) = π for π = P(y = 1) holds for the first two choices with any γ > 0, and the third choice as γ → +∞. This indicates the Fisher consistency of these losses for the balanced error. However, the precise way this is achieved is strikingly different in each case. In particular, each loss implicitly involves a fundamentally different link function. To better understand the effect of parameter choices, Figure 7 We remark here that for the balanced error, this function takes the form L(p) = p • p < π + (1p) • p > π , i.e., it is a "tent shaped" concave function with a maximum at p = π. For ease of comparison, we normalise this curves to have a maximum of 1. Figure 7 shows that simply applying unequal margins does not affect the underlying conditional Bayes risk compared to the standard log-loss; thus, the change here is purely in terms of the link function. By contrast, either balancing the loss or applying a combination of weighting and margin modification results in a closer approximation to the conditional Bayes risk curve for the cost-sensitive loss with cost π.  -1 = π 1-π , δ +1 = 1, ω +1 = 1 π .

D ADDITIONAL EXPERIMENTS

We present here additional experiments: (i) we present a detailed table of results with a more complex base architecture and number of training epochs for ImageNet-LT and iNaturalist; (ii) we present results for CIFAR-10 and CIFAR-100 on the STEP profile (Cao et al., 2019) with ρ = 100. (iii) we present results on synthetic data with varying imbalance ratios.

D.1 RESULTS WITH MORE COMPLEX BASE ARCHITECTURE

Table 3 presents results when using a ResNet-152, trained for either 90 or 200 epochs, on the larger ImageNet-LT and iNaturalist datasets. Consistent with the findings in Kang et al. (2020) , training with a more complex architecture for longer generally yields significant gains. Logit adjustment, potentially when combined with the adaptive margin, is generally superior to baselines with the sole exception of results for ResNet-152 with 200 epochs on iNaturalist. Figure 8 breaks down the per-class accuracies on CIFAR-10, CIFAR-100, and iNaturalist. On the latter two datasets, for ease of visualisation, we aggregate the classes into ten groups based on their frequency-sorted order (so that, e.g., group 0 comprises the top L 10 most frequent classes). As expected, dominant classes generally see a lower error rate with all methods. However, the logit adjusted loss is seen to systematically improve performance over ERM, particularly on rare classes.

D.3 RESULTS ON CIFAR-LT WITH STEP-100 PROFILE

Table 4 summarises results on the STEP-100 profile. Here, with τ = 1, weight normalisation slightly outperforms logit adjustment. However, with τ > 1, logit adjustment is again found to be superior (54.80); see Figure 9 .

D.4 RESULTS ON SYNTHETIC DATA WITH VARYING IMBALANCE RATIO

Figure 10 shows results on the synthetic data of §6.1 for varying choice of P(y = +1). As expected, we see that as P(y = +1) increases, all methods become equitable in terms of performance. We 

E DOES WEIGHT NORMALISATION INCREASE MARGINS?

Suppose that one uses SGD with a momentum, and finds solutions where w y 2 tracks the class priors. One intuition behind normalisation of weights is that, drawing inspiration from the binary case, this ought to increase the classification margins for tail classes. Unfortunately, as discussed below, this intuition is not necessarily borne out. Consider a scorer f y (x) = w y Φ(x), where w y ∈ R d and Φ : X → R d . The functional margin for an example (x, y) is (Koltchinskii et al., 2001) γ f (x, y) . = w y Φ(x) -max y =y w y Φ(x). (16) This generalises the classical binary margin, wherein by convention Y = {±1}, w -1 = -w 1 , and γ f (x, y) . = y • w 1 Φ(x) = 1 2 • w y Φ(x) -w -y Φ(x) , which agrees with (16) upto scaling. One may also define the geometric margin in the binary case to be the distance of (x, y) from its classifier: γ g,b (x) . = |w 1 • Φ(x)| w 1 2 . ( ) Clearly, γ g,b (x) = |γ f (x,y)| w1 2 , and so for fixed functional margin, one may increase the geometric margin by minimising w 1 2 . However, the same is not necessarily true in the multiclass setting, since here the functional and geometric margins do not generally align (Tatsumi et al., 2011; Tatsumi & Tanino, 2014) . In particular, controlling each w y 2 does not necessarily control the geometric margin. F BAYES-OPTIMAL CLASSIFIER UNDER GAUSSIAN CLASS-CONDITIONALS Derivation of (12). Suppose P(x | y) = 1 √ 2πσ • exp - x -µ y 2 2σ 2 for suitable µ y and σ. Then, P(x | y = +1) > P(x | y = -1) ⇐⇒ exp - x -µ +1 2 2σ 2 > exp - x -µ -1 2 2σ 2 ⇐⇒ x -µ +1 2 2σ 2 < x -µ -1 2 2σ 2 ⇐⇒ x -µ +1 2 < x -µ -1 2 ⇐⇒ 2 • (µ +1 -µ -1 ) x > µ +1 2 -µ -1 2 . Now use the fact that in our setting, µ +1 2 = µ -1 2 . We now explicate that the class-probability function for the synthetic dataset in §6.1 is exactly in the family assumed by the logistic regression. This implies that logistic regression is well-specified for this problem, and thus can perfectly model P(y = +1 | x) in the infinite sample limit. Note that , where w * = 1 σ 2 • (µ +1 -µ -1 ), and b * = log P(y=-1) P(y=+1) . This implies that a sigmoid model for P(y = +1 | x), as employed by logistic regression, is well-specified for the problem. Further, the bias term b * is seen to take the form of the log-odds of the class-priors per (8), as expected.



Both the misclassification and balanced error compare the top-1 predicted versus true label. One may analogously define a balanced top-k error(Lapin et al., 2018), which may be useful in retrieval settings. Compared to the multiclass case, we assume here a scalar score f ∈ R. This is equivalent to constraining that y∈[L] fy = 0 for the multiclass case. Qualitatively similar results may be achieved with other schedules, which involve training for fewer steps. For example, one may train for 256 epochs with a base learning rate of 0.4, a linear warmup for the first 15 epochs, and a decay of 0.1 at the 96th, 192nd, and 224th epoch.



Figure 2: Results on synthetic binary classification problem. Our logit adjusted loss tracks the Bayes-optimal solution and separator (left & middle panel). Post-hoc logit adjustment matches the Bayes performance with suitable scaling (right panel); however, any weight normalisation fails.

Figure 3: Comparison of balanced error for post-hoc correction techniques when varying scaling parameter τ (c.f. (3), (9)). Post-hoc logit adjustment consistently outperforms weight normalisation.

Figure 4: Comparison of per-group errors for loss modification techniques. We construct three groups of classes: "Many", comprising those with at least 100 training examples; "Medium", comprising those with at least 20 and at most 100 training examples; and "Few", comprising those with at most 20 training examples.

specific margin modification Note further that the choices of Cao et al. (2019); Tan et al. (2020) do not meet the requirements of Lemma 3.

illustrates the conditional Bayes risk curves, i.e., L(p) = p • (+1, Ψ(p)) + (1 -p) • (+1, Ψ(p)).

Figure 5: Comparison of link functions for various losses assuming π = 0.2, with γ = 1 (left) and γ = 8 (right). The balanced loss uses ω y = 1 πy . The unequal margin loss usesδ y = 1 γ • log 1-π π . The balanced + margin loss uses δ -1 = π 1-π , δ +1 = 1, ω +1 = 1 π .

Figure 7: Comparison of conditional Bayes risk functions for various losses assuming π = 0.2, with γ = 1 (left) and γ = 8 (right). The balanced loss uses ω y = 1 πy . The unequal margin loss uses δ y = 1 γ • log 1-πy πy . The first balanced + margin loss uses δ -1 = π, δ +1 = 1, ω +1 = 1 π . The second balanced + margin loss uses δ -1 = π 1-π , δ +1 = 1, ω +1 = 1 π .

Figure 9: Post-hoc adjustment on STEP-100 profile, CIFAR-10-LT and CIFAR-100-LT. Logit adjustment outperforms weight normalisation with suitable tuning.

Figure 10: Results on synthetic data with varying imbalance ratio.

(y = +1 | x) = P(x | y = +1) • P(y = +1) P(x) = P(x | y = +1) • P(y = +1) y P(x | y ) • P(y ) (-w * x + b * )

is also a promising avenue. While further exploring such variants are of empirical interest, we hope to have illustrated the conceptual and empirical value of logit adjustment, and leave this for future work. X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep face recognition with long-tailed training data. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5419-5428, 2017. Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(1), 2006.

Test set balanced error (averaged over 5 trials) on real-world datasets with more complex base architectures. Employing a ResNet-152 systematically improves all methods' performance, with logit adjustment remaining superior to existing approaches. The final row reports the results of combining logit adjustment with the adaptive margin loss ofCao et al. (2019), which yields further gains on iNaturalist.

annex

 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 

