LONG-TAIL LEARNING VIA LOGIT ADJUSTMENT

Abstract

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels have only a few associated samples. This poses a challenge for generalisation on such labels, and also makes naïve learning biased towards dominant labels. In this paper, we present a statistical framework that unifies and generalises several recent proposals to cope with these challenges. Our framework revisits the classic idea of logit adjustment based on the label frequencies, which encourages a large relative margin between logits of rare positive versus dominant negative labels. This yields two techniques for long-tail learning, where such adjustment is either applied post-hoc to a trained model, or enforced in the loss during training. These techniques are statistically grounded, and practically effective on four real-world datasets with long-tailed label distributions.

1. INTRODUCTION

Real-world classification problems typically exhibit a long-tailed label distribution, wherein most labels are associated with only a few samples (Van Horn & Perona, 2017; Buda et al., 2017; Liu et al., 2019) . Owing to this paucity of samples, generalisation on such labels is challenging; moreover, naïve learning on such data is susceptible to an undesirable bias towards dominant labels. This problem has been widely studied in the literature on learning under class imbalance (Kubat et al., 1997; Chawla et al., 2002; He & Garcia, 2009) , and the related problem of cost-sensitive learning (Elkan, 2001) . Recently, long-tail learning has received renewed interest in the context of neural networks. Two active strands of work involve post-hoc normalisation of the classification weights (Zhang et al., 2019; Kim & Kim, 2019; Kang et al., 2020; Ye et al., 2020) , and modification of the underlying loss to account for varying class penalties (Zhang et al., 2017; Cui et al., 2019; Cao et al., 2019; Tan et al., 2020) . Each of these strands are intuitive, and have proven empirically successful. However, they are not without limitation: e.g., weight normalisation crucially relies on the weight norms being smaller for rare classes; however, this assumption is sensitive to the choice of optimiser (see §2.1). On the other hand, loss modification sacrifices the consistency that underpins the canonical softmax cross-entropy (see §5.1). Consequently, such techniques may prove suboptimal even in simple settings (see §6.1). In this paper, we establish a statistical framework for long-tail learning that offers a unified view of post-hoc normalisation and loss modification techniques, while overcoming their limitations. Our framework revisits the classic idea of logit adjustment based on label frequencies (Provost, 2000; Zhou & Liu, 2006; Collell et al., 2016) , which encourages a large relative margin between a pair of rare positive and dominant negative labels. Such adjustment can be achieved by shifting the learned logits post-hoc, or augmenting the softmax cross-entropy with a pairwise label margin (cf. ( 11)). While similar in nature to recent techniques, our logit adjustment approaches additionally have a firm statistical grounding: they are Fisher consistent for minimising the balanced error (cf. (2)), a common metric in long-tail settings which averages the per-class errors. This statistical grounding translates into strong empirical performance on four real-world datasets with long-tailed label distributions. In summary, our contributions are: (i) we establish a statistical framework for long-tail learning ( §3) based on logit adjustment that provides a unified view of post-hoc correction and loss modification (ii) we present two realisations of logit adjustment, applied either post-hoc ( §4.1) or during training ( §5.1); unlike recent proposals (Table 1 ), these are consistent for minimising the balanced error (iii) we confirm the efficacy of the proposed logit adjustment techniques compared to several baselines on four real-world datasets with long-tailed label distributions ( §6).

2. PROBLEM SETUP AND RELATED WORK

Consider a multiclass classification problem with instances X and labels Y = [L] . = {1, 2, . . . , L}. Given a sample S = {(x n , y n )} N n=1 ∼ P N for unknown distribution P over X × Y, our goal is to learn a scorer f : X → R L that minimises the misclassification error P x,y y / ∈ argmax y ∈Y f y (x) . Typically, one minimises a surrogate loss : Y × R L → R such as the softmax cross-entropy, x) . (y, f (x)) = log y ∈[L] e f y (x) -f y (x) = log 1 + y =y e f y (x)-fy( (1) We may view the resulting softmax probabilities p y (x) ∝ e fy(x) as estimates of P(y | x). The setting of learning under class imbalance or long-tail learning is where the distribution P(y) is highly skewed, so that many rare (or "tail") labels have a low probability of occurrence. Here, the misclassification error is not a suitable measure of performance: a trivial predictor which classifies every instance to the majority label will attain a low misclassification error. To cope with this, a natural alternative is the balanced error (Chan & Stolfo, 1998; Brodersen et al., 2010; Menon et al., 2013) , which averages each of the per-class error rates: under a uniform label distribution BER(f ) . = 1 L y∈[L] P x|y y / ∈ argmax y ∈Y f y (x) . This can be seen as implicitly using a balanced class-probability function P bal (y | x) ∝foot_0 L • P(x | y), as opposed to the native P(y | x) ∝ P(y) • P(x | y) that is employed in the misclassification error. 1 Broadly, extant approaches to coping with class imbalance modify: (i) the inputs to a model, for example by over-or under-sampling (Kubat & Matwin, 1997; Chawla et al., 2002; Wallace et al., 2011; Mikolov et al., 2013; Mahajan et al., 2018; Yin et al., 2018) ; (ii) the outputs of a model, for example by post-hoc correction of the decision threshold (Fawcett & Provost, 1996; Collell et al., 2016) or weights (Kim & Kim, 2019; Kang et al., 2020) ; or (iii) the training procedure of a model, for example by modifying the loss function (Zhang et al., 2017; Cui et al., 2019; Cao et al., 2019; Tan et al., 2020; Jamal et al., 2020) . One may easily combine approaches from the first stream with those from the latter two. Consequently, we focus on the latter two in this work, and describe some representative recent examples from each. Post-hoc weight normalisation. Suppose f y (x) = w y Φ(x) for classification weights w y ∈ R D and representations Φ : X → R D , as learned by a neural network. (We may add per-label bias terms



Both the misclassification and balanced error compare the top-1 predicted versus true label. One may analogously define a balanced top-k error(Lapin et al., 2018), which may be useful in retrieval settings.



Comparison of approaches to long-tail learning. Weight normalisation re-scales the classification weights; by contrast, we add per-label offsets to the logits. Margin approaches uniformly increase the margin between a rare positive and all negatives(Cao et al., 2019), or decrease the margin between all positives and a rare negative(Tan et al., 2020)  to prevent rare labels' gradient suppression. By contrast, we increase the margin between a rare positive and a dominant negative.

