A CURRICULUM PERSPECTIVE TO ROBUST LOSS FUNCTIONS Anonymous authors Paper under double-blind review

Abstract

Learning with noisy labels is a fundamental problem in machine learning. Much work has been done in designing loss functions that are theoretically robust against label noise. However, it remains unclear why robust loss functions can underfit and why loss functions deviating from theoretical robustness conditions can appear robust. To elucidate these questions, we show that most robust loss functions differ only in the sample-weighting curriculums they implicitly define. The curriculum perspective enables straightforward analysis of the training dynamics with each loss function, which has not been considered in existing theoretical approaches. We show that underfitting can be attributed to marginal sample weights during training, and noise robustness can be attributed to larger weights for clean samples than noisy samples. With a simple fix to the curriculums, robust loss functions that severely underfit can become competitive with the state-of-the-art. 1

1. INTRODUCTION

Labeling errors are non-negligible from automatic annotation (Liu et al., 2021; Khayrallah & Koehn, 2018) , crowd-sourcing (Russakovsky et al., 2015) and expert annotation (Kato & Matsubara, 2010; Bridge et al., 2016) . The resulting noisy labels may hamper generalization since over-parameterized neural networks can memorize the training set (Zhang et al., 2017) . To combat the adverse impact of noisy labels in classification tasks, a large body of research (Song et al., 2020) aims to design loss functions robust against label noise. Most existing approaches derive sufficient conditions (Ghosh et al., 2017; Zhou et al., 2021b) for noise robustness. Despite the theoretical appeal being agnostic to models and training dynamicsfoot_1 , they may fail to comprehensively characterize the performance of robust loss functions. Specifically, it has been shown that (1) robust loss functions can underfit difficult tasks (Zhang & Sabuncu, 2018; Wang et al., 2019c; Ma et al., 2020) , while (2) loss functions violating existing robustness conditions (Zhang & Sabuncu, 2018; Wang et al., 2019c; b) can exhibit robustness. For (1), existing explanations (Ma et al., 2020; Wang et al., 2019a) can be limited as discussed in §2.2. For (2), to our knowledge, there has been no work directly addressing it. We analyze training dynamics with various loss functions to elucidate the above observations, which complements existing theoretical approaches. Specifically, we rewrite a broad array of loss functions into a standard form with the same implicit loss function and varied sample-weighting functions ( §3), each implicitly defining a sample-weighting curriculum. The interaction between the sampleweighting function and the distribution of implicit losses of samples thus reveals aspects of the training dynamics with each loss function. Here a curriculum by definition (Wang et al., 2020) specifies a sequence of re-weighting for the distribution of training samples, e.g., sample weighting (Chang et al., 2017) or sample selection (Zhou et al., 2021a) , based on a metric for sample difficulty. Notably, our novel curriculum perspective first connects robust loss functions to the seemingly distinct curriculum learning approaches (Song et al., 2020) for noise-robust training. With our curriculum perspective, we first attribute the underfitting issue of robust loss functions to marginal sample weights during training ( §4.1). In particular, for classification tasks with numerous classes, the initial sample weights under the curriculum of some robust loss functions can become marginal. When modifying the curriculums accordingly, robust loss functions that severely underfit can become competitive with the state-of-the-art. We then attribute noise robustness of loss functions to larger sample weights for clean samples than for noisy ones during training ( §4.2). By examining the changes of implicit losses during training, we find that dynamics of SGD suppress the learning of noisy samples. Curriculums of robust loss functions further suppress the learning of noisy samples by magnifying the difference in learning pace between clean and noisy samples while neglecting unlearned noisy samples. Based on our analysis, we present two unexpected phenomenons when viewed from existing theoretical results. By simply changing the learning rate schedule, (1) robust loss functions can become vulnerable to label noise, while (2) cross entropy can appear robust.

2. BACKGROUND

Classification k-ary classification with input x ∈ R d can be solved by classifier arg max i s i , where s i is the score for the i-th class in the class scoring function s : R d → R k parameterized by θ. Class scores s(x; θ) can be turned into class probabilities with the softmax function p i = e si /( k j=1 e sj ), where p i is the probability of class i. Given a loss function L(s(x; θ), y) and data (x, y) with y ∈ {1..k} the ground truth label, θ can be estimated by risk minimization arg min θ E x,y L(s(x; θ), y), whose solutions are called risk minimizers. We use s in place of s(x; θ) for notation simplicity. Noise robustness Mistakes in the labeling process can corrupt the clean label y into a noisy label ỹ = y, with probability P (ỹ = y|x, y) i, i ̸ = y with probability P (ỹ = i|x, y) Label noise is symmetric (or uniform) if P (ỹ = i|x, y) = η/(k -1), ∀i ̸ = y, with η = P (ỹ ̸ = y) the noise rate constant. Label noise is asymmetric (or class-conditional) if P (ỹ = i|x, y) = P (ỹ = i|y). Given data (x, ỹ) with noisy label ỹ, a loss function L is robust against label noise if arg min θ E x,ỹ L(s(x; θ), ỹ) = arg min θ E x,y L(s(x; θ), y) Conditions for noise robustness Most existing approaches on robust loss function (Ghosh et al., 2017; Ma et al., 2020; Liu & Guo, 2020; Feng et al., 2020; Zhou et al., 2021b) focus on bounding the difference between risk minimizers obtained with noisy and clean data, i.e., ensuring that Eq. ( 1) approximately holds. These bounds only depend on the loss functions and mild assumptions about the dataset. To contrast our curriculum perspective with these approaches, we review two typical sufficient conditions for noise robustness. Loss function L is symmetric (Ghosh et al., 2017) if k i=1 L(s, i) = C, ∀s ∈ R k , where C is a constant. It is robust against symmetric label noise with η < (k -1)/k. This stringent condition was later relaxed to the asymmetric condition. To rephrase, a loss function as a function of the softmax probability p i , i.e., L(s, i) = l(p i ), is asymmetric (Zhou et al., 2021b) if max i̸ =y P (ỹ = i|x, y) P (ỹ = y|x, y) = r ≤ r = inf 0≤pi,∆p≤1 pi+∆p≤1 l(p i ) -l(p i + ∆p) l(0) -l(∆p) , where ∆p is a valid increment of p i . When clean labels dominate the data, i.e., r < 1, an asymmetric loss is robust against generic label noise. The active-passive dichotomy Ma et al. ( 2020) draw a distinction between active and passive loss functions. By rewriting loss function L into a sum of basic functions, L(s, y) = k i=1 l(s, i), active loss functions can be defined with ∀i ̸ = y, l(s, i) = 0, which emphasizes learning the target label. In contrast, passive loss functions defined with ∃i ̸ = y, l(s, i) ̸ = 0 can be improved by unlearning the non-target labels. However, since there is no canonical guideline to specify l(s, i), different specifications can lead to ambiguities in the active-passive dichotomy as discussed in §2.2. In summary, the above research fails to address open questions in §2.2. Since many loss functions degrade less in performance than cross entropy under label noise, exhibiting various degrees of noise robustness, as a slight abuse of terminology, we refer to them as robust loss functions hereafter.

2.1. TYPICAL ROBUST LOSS FUNCTIONS

We review typical robust loss functions for our analysis besides cross entropy (CE) that is vulnerable to label noise (Ghosh et al., 2017) . Differences in constant scaling factors and additive biases are ignored, as they are either equivalent to learning rate scaling in SGD or irrelevant in the gradient computation. See Table 1 for the formulas and Appendix A for an extended review of loss functions.



Our code will be available at github. Changes of model states during training except for trivial metrics like evaluation metrics and loss functions.

