A CURRICULUM PERSPECTIVE TO ROBUST LOSS FUNCTIONS Anonymous authors Paper under double-blind review

Abstract

Learning with noisy labels is a fundamental problem in machine learning. Much work has been done in designing loss functions that are theoretically robust against label noise. However, it remains unclear why robust loss functions can underfit and why loss functions deviating from theoretical robustness conditions can appear robust. To elucidate these questions, we show that most robust loss functions differ only in the sample-weighting curriculums they implicitly define. The curriculum perspective enables straightforward analysis of the training dynamics with each loss function, which has not been considered in existing theoretical approaches. We show that underfitting can be attributed to marginal sample weights during training, and noise robustness can be attributed to larger weights for clean samples than noisy samples. With a simple fix to the curriculums, robust loss functions that severely underfit can become competitive with the state-of-the-art. 1

1. INTRODUCTION

Labeling errors are non-negligible from automatic annotation (Liu et al., 2021; Khayrallah & Koehn, 2018) , crowd-sourcing (Russakovsky et al., 2015) and expert annotation (Kato & Matsubara, 2010; Bridge et al., 2016) . The resulting noisy labels may hamper generalization since over-parameterized neural networks can memorize the training set (Zhang et al., 2017) . To combat the adverse impact of noisy labels in classification tasks, a large body of research (Song et al., 2020) aims to design loss functions robust against label noise. Most existing approaches derive sufficient conditions (Ghosh et al., 2017; Zhou et al., 2021b) for noise robustness. Despite the theoretical appeal being agnostic to models and training dynamicsfoot_1 , they may fail to comprehensively characterize the performance of robust loss functions. Specifically, it has been shown that (1) robust loss functions can underfit difficult tasks (Zhang & Sabuncu, 2018; Wang et al., 2019c; Ma et al., 2020) , while (2) loss functions violating existing robustness conditions (Zhang & Sabuncu, 2018; Wang et al., 2019c; b) can exhibit robustness. For (1), existing explanations (Ma et al., 2020; Wang et al., 2019a ) can be limited as discussed in §2.2. For (2), to our knowledge, there has been no work directly addressing it. We analyze training dynamics with various loss functions to elucidate the above observations, which complements existing theoretical approaches. Specifically, we rewrite a broad array of loss functions into a standard form with the same implicit loss function and varied sample-weighting functions ( §3), each implicitly defining a sample-weighting curriculum. The interaction between the sampleweighting function and the distribution of implicit losses of samples thus reveals aspects of the training dynamics with each loss function. Here a curriculum by definition (Wang et al., 2020) specifies a sequence of re-weighting for the distribution of training samples, e.g., sample weighting (Chang et al., 2017) or sample selection (Zhou et al., 2021a) , based on a metric for sample difficulty. Notably, our novel curriculum perspective first connects robust loss functions to the seemingly distinct curriculum learning approaches (Song et al., 2020) for noise-robust training. With our curriculum perspective, we first attribute the underfitting issue of robust loss functions to marginal sample weights during training ( §4.1). In particular, for classification tasks with numerous classes, the initial sample weights under the curriculum of some robust loss functions can become marginal. When modifying the curriculums accordingly, robust loss functions that severely underfit can become competitive with the state-of-the-art. We then attribute noise robustness of loss functions to larger sample weights for clean samples than for noisy ones during training ( §4.2). By examining the changes of implicit losses during training, we find that dynamics of SGD suppress the learning of noisy samples. Curriculums of robust loss functions further suppress the learning of noisy samples by magnifying the difference in learning pace between clean and noisy samples while neglecting unlearned noisy samples. Based on our analysis, we present two unexpected phenomenons when viewed from existing theoretical results. By simply changing the learning rate schedule, (1) robust loss functions can become vulnerable to label noise, while (2) cross entropy can appear robust.

2. BACKGROUND

Classification k-ary classification with input x ∈ R d can be solved by classifier arg max i s i , where s i is the score for the i-th class in the class scoring function s : R d → R k parameterized by θ. Class scores s(x; θ) can be turned into class probabilities with the softmax function p i = e si /( k j=1 e sj ), where p i is the probability of class i. Given a loss function L(s(x; θ), y) and data (x, y) with y ∈ {1..k} the ground truth label, θ can be estimated by risk minimization arg min θ E x,y L(s(x; θ), y), whose solutions are called risk minimizers. We use s in place of s(x; θ) for notation simplicity. Noise robustness Mistakes in the labeling process can corrupt the clean label y into a noisy label ỹ = y, with probability P (ỹ = y|x, y) i, i ̸ = y with probability P (ỹ = i|x, y) Label noise is symmetric (or uniform) if P (ỹ = i|x, y) = η/(k -1), ∀i ̸ = y, with η = P (ỹ ̸ = y) the noise rate constant. Label noise is asymmetric (or class-conditional) if P (ỹ = i|x, y) = P (ỹ = i|y). Given data (x, ỹ) with noisy label ỹ, a loss function L is robust against label noise if arg min θ E x,ỹ L(s(x; θ), ỹ) = arg min θ E x,y L(s(x; θ), y) Conditions for noise robustness Most existing approaches on robust loss function (Ghosh et al., 2017; Ma et al., 2020; Liu & Guo, 2020; Feng et al., 2020; Zhou et al., 2021b) focus on bounding the difference between risk minimizers obtained with noisy and clean data, i.e., ensuring that Eq. ( 1) approximately holds. These bounds only depend on the loss functions and mild assumptions about the dataset. To contrast our curriculum perspective with these approaches, we review two typical sufficient conditions for noise robustness. Loss function L is symmetric (Ghosh et al., 2017) if k i=1 L(s, i) = C, ∀s ∈ R k , ( ) where C is a constant. It is robust against symmetric label noise with η < (k -1)/k. This stringent condition was later relaxed to the asymmetric condition. To rephrase, a loss function as a function of the softmax probability p i , i.e., L(s, i) = l(p i ), is asymmetric (Zhou et al., 2021b) if max i̸ =y P (ỹ = i|x, y) P (ỹ = y|x, y) = r ≤ r = inf 0≤pi,∆p≤1 pi+∆p≤1 l(p i ) -l(p i + ∆p) l(0) -l(∆p) , where ∆p is a valid increment of p i . When clean labels dominate the data, i.e., r < 1, an asymmetric loss is robust against generic label noise. The active-passive dichotomy Ma et al. (2020) draw a distinction between active and passive loss functions. By rewriting loss function L into a sum of basic functions, L(s, y) = k i=1 l(s, i), active loss functions can be defined with ∀i ̸ = y, l(s, i) = 0, which emphasizes learning the target label. In contrast, passive loss functions defined with ∃i ̸ = y, l(s, i) ̸ = 0 can be improved by unlearning the non-target labels. However, since there is no canonical guideline to specify l(s, i), different specifications can lead to ambiguities in the active-passive dichotomy as discussed in §2.2. In summary, the above research fails to address open questions in §2.2. Since many loss functions degrade less in performance than cross entropy under label noise, exhibiting various degrees of noise robustness, as a slight abuse of terminology, we refer to them as robust loss functions hereafter.

2.1. TYPICAL ROBUST LOSS FUNCTIONS

We review typical robust loss functions for our analysis besides cross entropy (CE) that is vulnerable to label noise (Ghosh et al., 2017) . Differences in constant scaling factors and additive biases are ignored, as they are either equivalent to learning rate scaling in SGD or irrelevant in the gradient computation. See Table 1 for the formulas and Appendix A for an extended review of loss functions.

Symmetric

The mean absolute error (MAE; Ghosh et al., 2017) and the equivalent reverse cross entropy (RCE; Wang et al., 2019c) are both symmetric as they satisfy Eq. (2). Ma et al. (2020) make loss functions satisfying L(s, i) > 0, ∀i ∈ {1..k} symmetric by normalizing them with L N (s, y) = L(s, y)/( k i=1 L(s, i)). We include normalized cross entropy (NCE; Ma et al. 2020 ) as an example. Asymmetric We include asymmetric generalized cross entropy (AGCE) and asymmetric unhinged loss (AUL) proposed by Zhou et al. (2021b) as typical asymmetric loss functions. Notably, AGCE with q ≥ 1 and AUL with q ≤ 1 are completely asymmetric, i.e., Eq. (3) always holds when r < 1. Combined Loss functions can be combined for both robustness and sufficient learning. For example, generalized cross entropy (GCE; Zhang & Sabuncu, 2018) is a smooth interpolation between CE and MAE. Alternatively, symmetric cross entropy (SCE; Wang et al., 2019c ) is a weighted average of CE and RCE (MAE). Ma et al. (2020) argue that robust and sufficient training requires a balanced combination of active and passive loss functions. Accordingly to their active-passive dichotomy, CE and NCE are active while MAE (RCE) is passive. We include NCE+MAE as an example.

2.2. OPEN QUESTIONS

Why do robust loss functions underfit? Ma et al. (2020) attribute underfitting to failure in balancing active-passive components. However, the active-passive dichotomy can be ambiguous. Given Wang et al. (2019a) view ∥∇ s L(s, y)∥ 1 as weights for sample gradients and attribute underfitting to their low variance, making clean and noisy samples less distinguishable. However, as we show in §4.1, MAE also underfits data with clean labels. In summary, neither Ma et al. (2020) nor Wang et al. (2019a) fully explain the underfitting issue. What affects the robustness of a loss function? Although combined loss functions such as GCE and SCE fail to satisfy Eq. ( 2) and (3), it is unclear why they exhibit robustness against label noise (Zhang & Sabuncu, 2018; Wang et al., 2019c) . Furthermore, it is unclear how training dynamics with loss functions, which are irrelevant in theoretical robustness guarantees (Ghosh et al., 2017; Zhou et al., 2021b; Ma et al., 2020; Liu & Guo, 2020; Feng et al., 2020) , affect their noise robustness. L MAE (s, y) ∝ k i=1 |I(i = y) -p i | ∝ k i=1 I(i = y)(1 -p i ) where I(•) is the indicator function, MAE is passive with l(s, i) = |I(i = y) -p i | but active with l(s, i) = I(i = y)(1 -p i ).

3. IMPLICIT CURRICULUMS OF LOSS FUNCTIONS

Loss functions in Table 1 except NCE and NCE+MAE can be written as a function of the target softmax probability p y , i.e., L(s, y) = l(p y ). A close examination of p y gives p y = e sy k i=1 e si = 1 e log i̸ =y e s i -sy + 1 = 1 e -∆y + 1 = sigmoid(∆ y ), where ∆ y = s y -log i̸ =y e si ≤ s y -max i̸ =y s i (5) indicates how well a sample is learned, as ∆ y ≥ 0 ensures successful classification y = arg max i s i . Loss functions with the form L(s, y) = l(p y ) can thus be rewritten into a standard form with equivalent gradients, i.e., L(s, y) = l(p y ) = s ∇ s l(p y )ds = s l ′ (p y )p ′ y (∆ y ) • ∇ s ∆ y ds = -w(∆ y ) • ∆ y , where w(∆ y ) = |l ′ (p y )p ′ y (∆ y )| is a scalar weight wrapped with the stop-gradient operator, and ∆ y is an implicit loss function embedded in L(s, y). Loss functions following Eq. ( 6) thus implicitly define different sample-weighting curriculums, with w(∆ y ) the sample-weighting function and ∆ y the metric for sample difficulty. Notably, ∆ y factors out the preference from w(∆ y ), making it a more direct metric for sample difficulty than those based on losses (Kumar et al., 2010) or gradient magnitudes (Gopal, 2016) . In addition, the interaction between w(∆ y ) and ∆ y distributions reveals aspects of the training dynamics with each loss function. See Table 1 for a summary of w(∆ y ) for the reviewed loss functions, and Appendix A for how hyperparameters affect w(∆ y ). Asym. AUL [(a -py) q -(a -1) q ]/q py(1 -py)(a -py) q-1 a > 1, q > 0 AGCE [(a + 1) -(a + py) q ]/q py(a + py) q-1 (1 -py) a > 0, q > 0 Comb.

GCE

(1 -p q y )/q p q y (1 -py) 0 < q ≤ 1 SCE (1 -q) • LCE + q • LMAE (1 -q + q • py)(1 -py) 0 < q < 1 NCE+MAE (1 -q) • LNCE + q • LMAE / 0 < q < 1 Table 1 : Expressions, constraints and sample-weighting functions ( §3) for loss functions in §2.1.

3.1. THE ADDITIONAL REGULARIZER OF NCE

NCE does not follow Eq. ( 6) as it additionally depends on p i , i ̸ = y. Yet it can be rewritten into L NCE (s, y) = γ NCE • L CE (s, y) + γ NCE • ϵ NCE • R NCE (s), where γ NCE = 1/( k i=1 -log p i ) and ϵ NCE = k(-log p y )/( k i=1 -log p i ) are scalar weights wrapped with the stop-gradient operator. In Eq. ( 7), the first term is a primary loss function defining a sample-weighting curriculum similar to CE. The second is a regularizer R NCE (s) = k i=1 1 k log p i reducing the entropy of the softmax output. Although the training dynamics of NCE are complicated by the additional regularizer, we can use the upperbound for the L1 norm of gradient ∇ s L NCE (s, y), ŵNCE = 2γ NCE • w CE (1 + ϵ NCE ) ≥ ∥∇ s L NCE (s, y)∥ 1 , as the weight of each sample in parameter updates. Examining how ŵNCE changes during training helps understand why NCE underfits in §4.1. We leave derivations of Eq. ( 7) and ( 8) and discussions on similar loss functions with an additional regularizer to Appendix A.5.

4. ROBUST LOSS FUNCTIONS FROM THE CURRICULUM PERSPECTIVE

We examine the interaction between w(∆ y ) and ∆ y distributions to address questions in §2.2. Results are reported on MNIST (Lecun et al., 1998) and CIFAR10/100 (Krizhevsky, 2009) with synthetic symmetric and asymmetric label noise following Ma et al. (2020) ; Zhou et al. (2021b) . For real-world scenarios, we include CIFAR10/100 with human label noise (Wei et al., 2022) and the large-scale noisy dataset WebVision (Li et al., 2017) , which exhibit more complex noise patterns than symmetric and asymmetric label noise. Unlike standard settings, we scale w(∆ y ) to unit maximum to avoid complications, since hyperparameters of loss functions can change the scale of w(∆ y ), essentially adjusting the learning rate of SGD. See Appendix B for more experimental details.

4.1. UNDERSTANDING UNDERFITTING OF ROBUST LOSS FUNCTIONS

Robust loss functions can underfit We confirm that on difficult tasks like CIFAR100 (Song et al., 2020) , underfitting can result from robust loss functions themselves rather than inferior hyperparameters. As shown in Table 2 : Without label noise, robust loss functions can underfit CIFAR100 but CIFAR10. Hyperparameters of loss functions are tuned on CIFAR100 and listed in Table 8 . We report test accuracy and ᾱ * t (scaled by 10 3 ) at the final training step from 3 different runs with learning rate α = 0.1. Loss functions with inferior hyperparameters (denoted with †) are included as references. See Table 9 for similar results with learning rate α = 0.01. 2 with hyperparameters in Table 8 . We include the initial ∆ y distributions of CIFAR10 and CIFAR100 for reference, which are obtained by computing ∆ y with a randomly initialized model for all training samples. batch and α t the learning rate. The overall α * t up to step t can be ᾱ * t = t i=1 α * i /t. In Table 2 , for loss functions that heavily underfit on CIFAR100, ᾱ * t at the final step is marginal compared to CE, suggesting a marginal overall sample weight during training given the same learning rate schedule. Underfitting from fast diminishing sample weights Similar to CE, in Fig. 2a, ᾱ * t of NCE based on ŵNCE peaks at initialization. However, it decreases much faster than CE since both γ NCE and w CE decrease with improved ∆ y . In addition, the regularizer R NCE (s) further reduces the entropy of softmax output and thus γ NCE . The resulting fast decreasing ŵNCE hampers the learning of training samples, which can lead to underfitting. Underfitting from marginal initial sample weights In Fig. 1a , unlike NCE, loss functions that severely underfit in Table 2 assign marginal weights to samples in CIFAR100 at initialization, which leads to marginal initial ᾱ * t . ∆ y of these samples can barely improve before the learning rate vanishes, thus leading to underfitting. In contrast, loss functions with non-trivial initial sample weights (Fig. 1b and 1c ) result in moderate or no underfitting. As further corroboration, we plot ᾱ * t of AUL with superior and inferior hyperparameters (AUL and AUL † in Table 2 ) in Fig. 2b . ᾱ * t stays marginal with AUL † , but quickly increases to a non-negligible value before gradually decreasing with AUL. Loss combination can mitigate underfitting. As ŵNCE peaks at initialization but quickly diminishes while w MAE is marginal at initialization but peaks later during training, combining NCE with MAE can mitigate the underfitting issue of each other. In Table 2 , combining NCE and MAE suffers less from underfitting compared to both individuals. Increased number of classes leads to marginal initial sample weights. Unlike CIFAR100, all loss functions in Table 2 perform equally well on CIFAR10. Such a difference has been vaguely attributed to the increased task difficulty of CIFAR100 (Zhang & Sabuncu, 2018; Song et al., 2020) . Intuitively, the more classes, the more subtle differences to be distinguished. In addition, the number of classes k determines the initial distribution of ∆ y . Assume that class scores s i at initialization are i.i.d. normal variables s i ∼ N (µ, σ). In particular, µ = 0 and σ = 1 for most neural networks with standard initializations (Glorot & Bengio, 2010; He et al., 2015) and normalization layers (Ioffe & 11 and 12 for results with more noise rates. Szegedy, 2015; Ba et al., 2016) . The expected ∆ y can be approximated with E[∆ y ] ≈ -log(k -1) -σ 2 /2 + e σ 2 -1 2(k -1) We leave derivations and comparisons between our assumptions and real settings to Appendix C.1. A large k results in small initial ∆ y ; with sample-weighting functions in Fig. 1a it further leads to marginal initial sample weights, which results in underfitting on CIFAR100 as discussed previously.

4.1.1. ADDRESSING UNDERFITTING FROM MARGINAL INITIAL SAMPLE WEIGHTS

Our analysis suggests that the fixed sample-weighting function w(∆ y ) is to blame for underfitting. To make the initial sample weights agnostic to the number of classes, we can simply scale w * (∆ y ) = w(∆ * y ) = w(∆ y /|E[∆ y ]| • τ ) or shift w + (∆ y ) = w(∆ + y ) = w(∆ y + |E[∆ y ]| -τ ) the sample-weighting functions, where τ is a hyperparameter. Intuitively, |E[∆ y ]| in w * (∆ y ) and w + (∆ y ) cancels the effect of k on the weight of the expected initial ∆ y . A small τ thus leads to high initial sample weights regardless of k. In Appendix C.1.1 we visualize w * MAE (∆ y ) and w + MAE (∆ y ) in Fig. 7 and discuss the robustness of the loss functions they induce. Results on CIFAR100 with different label noise are reported in Table 3 . See Tables 11 and 12 in Appendix C.1.1 for results with additional noise rates. We also report results on the large-scale We-bVision dataset with different numbers of classes in Table 4 . In summary, shifting and scaling alleviate underfitting, making MAE and AGCE comparable to the previous state-of-the-art (NCE+AUL; Zhou et al. 2021b Table 4 : Shifting or scaling w(∆ y ) mitigates underfitting on WebVision subsampled with different numbers of classes. k = 50 is the standard "mini" setting in previous work (Ma et al., 2020; Zhou et al., 2021b) . We report test accuracy with a single run due to a limited computation budget. Clean Asymmetric Symmetric Human Table 5 : Robust loss functions assign larger weights to clean samples. We report snr and diff from the best of 5 runs on CIFAR10 under each noise setting, as inferior initialization can heavily degrade the performance. Hyperparameters listed in Table 13 are selected to cover more variants of sampleweighting functions (plotted in Fig. 8 ), which are not necessarily optimal. η = 0.2 η = 0.2 η = 0.4 η = 0.8 η = Although w * (∆ y ) and w + (∆ y ) are agnostic to the number of classes at initialization, their performances differ significantly. Intuitively, w + (∆ y ) diminishes much faster than w * (∆ y ) with increased ∆ y , which can lead to insufficient training of clean samples and thus inferior performance.

4.2. UNDERSTANDING NOISE ROBUSTNESS OF LOSS FUNCTIONS

We show that robust loss functions following Eq. ( 6) implicitly assign larger weights to clean samples. The underlying reasons are explored by examining how ∆ y distributions change during training. Notably, similar sample-weighting rules are explicitly adopted by curriculums for noise robust training (Ren et al., 2018) . We leave NCE to future work as it involves an additional regularizer. Robust loss functions assign larger weights to clean samples. We use the ratio between the average weights of clean ( wclean ) and noisy ( wnoise ) samples, snr = wclean / wnoise , to characterize their relative contribution during training. See Appendix C.2 for the exact formulas. Noise robustness is characterized by differences in test accuracy compared to results with clean labels (diff). We report diff and snr under different label noise on CIFAR10 in Table 5 . Loss functions with higher snr have less performance drop with label noise in general, thus being more robust. To explain what leads to a large snr, we plot changes of ∆ y distributions during training on CIFAR10 with symmetric label noise in Fig. 3 . See Fig. 9 and 10 for similar results with additional types of label noise and loss functions. When trained with loss functions that are more robust against label noise (Fig. 3b and 3c ), ∆ y distributions of noisy and clean samples spread wider and get better separated. In addition, the consistent decrease of ∆ y for noisy samples suggests that they can be unlearned. In contrast, training with CE (Fig. 3a ) results in more compact and less separated ∆ y distributions. Furthermore, ∆ y of noisy samples consistently increases. Dynamics of SGD suppress learning of noisy samples. As shown in Fig. 3a , noisy samples are learned slower than clean samples as measured by improvements of ∆ y , which can be explained by more coherent gradients among clean samples (Chatterjee & Zielinski, 2022) . Similar results have been reported (Zhang et al., 2017; Arpit et al., 2017) and utilized in curriculum-based robust training (Yao et al., 2019; Han et al., 2018) . In addition, noisy samples can be unlearned as shown in Fig. 3b and 3c , which can stem from generalization with clean samples. Both dynamics suppress the learning of noisy samples but clean ones, thus leading to robustness against label noise. Robust w(∆ y ) synergizes with SGD dynamics for noise robustness. In Fig. 1 , the bell-shaped w(∆ y ) of robust loss functions only assigns large weights to samples with moderate ∆ y . Since ∆ y distributions initially concentrate at the monotonically increasing interval of w(∆ y ), (1) samples with faster improving ∆ y , due to either larger initial weights or faster learning as clean samples, are weighted more during early training and learned faster. The magnified learning pace difference explains the widely spread distributions in Fig. 3b and 3c . In addition, (2) the unlearned samples with small ∆ y receive diminishing weights from w(∆ y ), which hampers their pace of learning. Noisy samples in Fig. 3b and 3c are consistently unlearned and ignored with marginal sample weights, leading to a consistent decrease in ∆ y . In addition to the SGD dynamics, (1) and ( 2) further suppress the learning of noisy samples and enhance that of clean samples, thus leading to increased robustness against label noise. In contrast, the monotonically decreasing w CE (∆ y ) emphasizes samples with smaller ∆ y , essentially acting against the SGD dynamics for noise robustness. Thus training with CE results in increased vulnerability to label noise as shown in Table 5 .

4.2.1. TRAINING SCHEDULES AFFECT NOISE ROBUSTNESS

Although the learning pace of noisy samples gets initially suppressed, the expected gradient will eventually be dominated by noisy samples, since well-learned clean samples receive marginal sample weights thanks to the monotonically decreasing interval of w(∆ y ). (Song et al., 2019) , or a constrained learning pace that prevents sufficient learning of clean samples, which avoids diminishing weights for them. We show the learning curve of CE using fixed learning rates under symmetric label noise on MNIST in Fig. 4b . By simply increasing or decreasing the learning rate, which strengthens the implicit regularization of SGD (Smith et al., 2021) or directly slows down the learning pace, CE can become robust against label noise.

5. RELATED WORK

Our work closely relates to robust loss functions against label noise (Song et al., 2020) . Most existing studies (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Wang et al., 2019c; Ma et al., 2020; Liu & Guo, 2020; Cheng et al., 2021; Feng et al., 2020; Zhou et al., 2021b) focus on bounding the difference between risk minimizers obtained with noisy and clean data, which are agnostic to training dynamics. In contrast, with our novel curriculum perspective, we analyze the training dynamics with robust loss functions for reasons behind their underfitting issue and noise robustness. The underfitting problem has been heuristically mitigated with loss combination (Zhang & Sabuncu, 2018; Wang et al., 2019c; Ma et al., 2020) . We identify the cause and provide effective solutions. Curriculum-based approaches combat label noise with either sample selection (Chen et al., 2019; Zhou et al., 2021a) or sample weighting (Chang et al., 2017; Jiang et al., 2018; Ren et al., 2018) . In particular, sample weights are explicitly designed (Wang et al., 2019a; b; Chang et al., 2017) or predicted by a model trained on a different dataset (Jiang et al., 2018; Ren et al., 2018) . In contrast, the sample weights in this work are implicitly defined by robust loss functions. Notably, the implicit loss function we identified is a more direct metric for sample difficulty compared to common metrics based on loss functions (Kumar et al., 2010; Loshchilov & Hutter, 2015) and gradient magnitudes (Gopal, 2016) , which are implicitly affected by the preference from the sample-weighting functions of loss functions. Our work is also related to the ongoing debate (Hacohen & Weinshall, 2019; Wang et al., 2020) on strategies for selecting or weighting samples in curriculum learning: either easier first (Bengio et al., 2009; Kumar et al., 2010) or harder first (Loshchilov & Hutter, 2015; Zhang et al., 2018) . The implicit curriculums of robust loss functions can be viewed as a combination of both strategies, emphasizing samples with moderate difficulty. Most related to our work, Wang et al. (2019b) identify gradient norms as weights for sample gradients and propose heuristic designs of weighting functions for noise-robust training. In contrast, we explicitly identify the implicit loss function, which connects robust loss functions to curriculum learning, facilitates our analysis of the training dynamics and helps elucidate the robustness of loss functions from a curriculum perspective. Altering noise robustness by adjusting the learning rate is reminiscent of (Huang et al., 2019) . They use a cyclic learning rate to make models change back and forth between overfitting and underfitting to collect statistics for noisy label detection. To achieve noise robustness, they discard samples with detected noisy labels and retrain the model from scratch. In contrast, our results show that simply changing the learning rate can achieve noise robustness.

6. CONCLUSION AND DISCUSSION

We identified the implicit sample-weighting curriculums of a broad array of loss functions. Our novel curriculum perspective enables examining the training dynamics with loss functions through the interaction between the sample-weighting function and distributions of implicit losses. It connects robust loss functions to the seemingly distinct curriculum learning. Notably, the implicit loss function we identified is a direct metric for sample difficulty in curriculum learning as it factors out the preference of sample-weighting functions. We elucidate the reasons behind underfitting and robustness against label noise and propose a simple approach to address the underfitting issue. As with previous work on robust loss functions, our empirical results are based on image classification using convolutional neural networks with and without residual connections. We have extended our experiments to cover larger-scale classification tasks, human label noise and a broader array of robust loss functions. Although our derivation does not depend on the models and task specifications, additional experiments should be performed in future work to extend our conclusions to more models and tasks.

A EXTENDED REVIEW OF LOSS FUNCTIONS

Due to limited space, we only briefly describe typical robust loss functions in §2.1. As a general reference, here we provide a comprehensive review of loss functions related to the standard form Eq. ( 6). Similar to §2.1, we ignore the differences in constant scaling factors and additive bias. Loss functions and their sample-weighting functions are summarized in Table 6 . We examine how hyperparameters affect their sample-weighting functions in Fig. 5 .

A.1 LOSS FUNCTIONS WITHOUT ROBUSTNESS GUARANTEES

Cross Entropy (CE) L CE (s, y) = -log p y is the standard loss function for classification. Focal Loss (FL; Lin et al. 2017 ) L FL (s, y) = -(1 -p y ) q log p y aims to address label imbalance when training object detection models. Both CE and FL are neither symmetric (Ma et al., 2020) nor asymmetric (Zhou et al., 2021b) .

A.2 SYMMETRIC LOSS FUNCTIONS

Mean Absolute Error (MAE; Ghosh et al. 2017 )  L MAE (s, y) = k i=1 |I(i = y) -p i | = 2 -2p y ∝ 1 -p y L RCE (s, y) = k i=1 p i log 1(i = y) = i̸ =y p i A = (1 -p y )A ∝ 1 -p y = L MAE (s, y) is equivalent to MAE in implementation, where log 0 is truncated to a negative constant A to avoid numerical overflow. Ma et al. (2020) argued that any generic loss functions with L(s, i) > 0, ∀i ∈ {1..k} can become symmetric by simply normalizing them. As an example, py(1 -py)(a -py) q-1 a > 1, q > 0 AGCE (a-py ) q -(a-1) q q py(a + py) q-1 (1 -py) a > 0, q > 0 AEL e -py /q 1 q py(1 -py)e -py /q q > 0 GCE (1 -p q y )/q p q y (1 -py) 0 < q ≤ 1 SCE -(1 -q) log py + q(1 -py) (1 -q + q • py)(1 -py) 0 < q < 1 TCE q i=1 (1 -py) i /i py q i=1 (1 -py) i q ≥ 1 Table 6 : Expressions, constraints of hyperparameters and sample-weighting functions of loss functions reviewed in Appendix A that follow the standard form Eq. ( 6). Normalized Cross Entropy (NCE; Ma et al. 2020 ) L NCE (s, y) = L CE (s, y) k i=1 L CE (s, i) = -log p y k i=1 -log p i is a symmetric loss function. However, NCE does not follow the standard form of Eq. ( 6) as it additionally depends on p i , i ̸ = y. It involves an additional regularizer, thus being more relevant to discussions in Appendix A.5. Zhou et al. (2021b) derived the asymmetric condition for noise robustness and propose numerous asymmetric loss functions: Asymmetric Generalized Cross Entropy (AGCE)

A.3 ASYMMETRIC LOSS FUNCTIONS

L AGCE (s, y) = (a + 1) -(a + p y ) q q where a > 0 and q > 0. It is asymmetric when I(q ≤ 1)( a+1 a ) 1-q + I(q > 1) ≤ 1/r. Asymmetric Unhinged Loss (AUL) L AUL (s, y) = (a -p y ) q -(a -1) q q where a > 1 and q > 0. It is asymmetric when I(q ≤ 1)( a a-1 ) q-1 + I(q ≤ 1) ≤ 1/r. Asymmetric Exponential Loss (AEL) L AEL (s, y) = e -py/q where q > 0. It is asymmetric when e 1/q ≤ 1/r.

A.4 COMBINED LOSS FUNCTIONS

Generalized Cross Entropy (GCE; Zhang & Sabuncu 2018) L GCE (s, y) = 1 -p q y q can be viewed as a smooth interpolation between CE and MAE, where 0 < q ≤ 1. CE or MAE can be recovered by setting q → 0 or q = 1. 6 . The initial ∆ y distributions of CIFAR100 extracted with a randomly initialized model are included as reference. Symmetric Cross Entropy (SCE; Wang et al. 2019c ) L SCE (s, y) = a • L CE (s, y) + b • L RCE (s, y) ∝ (1 -q) • (-log p i ) + q • (1 -p i ) is a weighted average of CE and RCE (MAE), where a > 0, b > 0, and 0 < q < 1. Taylor Cross Entropy (TCE; Feng et al. 2020 ) L TCE (s, y) = q i=1 (1 -p y ) i i is derived from Taylor series of the log function. It reduces to MAE when q = 1. Interestingly, the summand of TCE (1 -p y ) i /i with i > 2 is proportional to AUL with a = 1 and q = i. Thus TCE can be viewed as a combination of symmetric and asymmetric loss functions. Active-Passive Loss (APL; Ma et al. 2020) Ma et al. (2020) propose weighted combinations of active and passive loss functions. We include NCE+MAE as an example: L NCE+MAE (s, y) = a • L NCE (s, y) + b • L MAE (s, y) ∝ (1 -q) • -log p y k i=1 -log p i + q • (1 -p y ) where a > 0, b > 0, and 0 < q < 1.

A.5 LOSS FUNCTIONS WITH ADDITIONAL REGULARIZERS

We additionally review loss functions that implicitly involve a regularizer and a primary loss function following the standard form Eq. ( 6). See Table 7 for a summary. In addition to the sample-weighting curriculums implicitly defined by the primary loss function, the additional regularizer complicates the analysis of the training dynamics. We leave investigations on how these regularizers affect noise robustness for future work.

Name Original Primary Loss Regularizer

MSE 1 -2py + k i=1 p 2 i 1 -py k i=1 p 2 i PL(CR) -log py + log p yn|xm -log py k i=1 P (ỹ = i) log pi CE+GLS -k i=1 [I(i = y)(1 -α) + α k ] log pi -log py ± k i=1 1 k log pi NCE -log py/( k i=1 -log pi) -γNCE • log pi k i=1 1 k log pi Table 7 : Original expressions, primary loss functions in the standard form Eq. ( 6) and regularizers for loss functions reviewed in Appendix A.5. We view PL in its expectation to derive its regularizer. p yn|xm is the softmax probability of a random label y n with a random input x m sampled from the noisy data. γ NCE = 1/( k i=1 -log p i ) is a scalar wrapped with the stop-gradient operator. Mean Square Error (MSE; Ghosh et al. 2017 ) L MSE (s, y) = k i=1 (I(i = y) -p i ) 2 = 1 -2p y + k i=1 p 2 i ∝ 1 -p y + 1 2 • k i=1 p 2 i = L MAE (s, y) + α • R MSE (s) is more robust than CE (Ghosh et al., 2017) , where α = 0.5 and the regularizer R MSE (s) = k i=1 p 2 i ( ) increases the entropy of the softmax output. We can generalize α to a hyperparamter, making MSE a combination of MAE and an entropy regularizer R MSE . Peer Loss (PL; Liu & Guo 2020) L PL (s, y) = L(s, y) -L(s n , y m ) makes a generic loss function L(s, y) robust against label noise, where s n denotes the score of an input x n and y m a label, both randomly sampled from the noisy data. Its noise robustness is theoretically established for binary classification and extended to multi-class setting (Liu & Guo, 2020) . Confidence Regularizer (CR; Cheng et al. 2021 ) (Cheng et al., 2021) to be the regularizer induced by PL in expectation. Substituting L with cross entropy leads to R CR (s) = -E ỹ [L(s, ỹ)] is shown R CR (s) = -E ỹ [-log p ỹ ] = k i=1 P (ỹ = i) log p i Minimizing R CR (s) thus makes the softmax output distribution p deviate from the prior label distribution of the noisy dataset P (ỹ = i), reducing the entropy of the softmax output. Generalized Label Smoothing (GLS; Wei et al. 2021 ) Lukasik et al. (2020) show that label smoothing (LS; Szegedy et al. 2016 ) can mitigate overfitting with label noise, which is later extended to GSL. Cross entropy with GLS is L CE+GLS (s, y) = k i=1 -[I(i = y)(1 -α) + α k ] log p i = -(1 -α) log p y -α • 1 k k i=1 log p i ∝ -log p y - α 1 -α • 1 k k i=1 log p i = L CE (s, y) + α ′ • R GLS (s) where α ′ = α/(1 -α), has regularizer R GLS R GLS (s) = - k i=1 1 k log p i With α ′ > 0, R GLS corresponds to the original label smoothing, which increases the entropy of softmax outputs. In contrast, α ′ < 0 corresponding to negative label smoothing (Wei et al., 2021) , which decreases the output entropy similar to R CR . A.5.1 DERIVATIONS FOR NCE Deriving Eq. ( 7) With equivalent derivatives, since ∇ s L NCE (s, y) = ∇ s L CE (s, y) • k i=1 L CE (s, i) -∇ s k i=1 L CE (s, i) • L CE (s, y) k i=1 L CE (s, i) 2 = 1 k i=1 L CE (s, i) ∇ s L CE (s, y) + kL CE (s, y) k i=1 L CE (s, i) • ∇ s k i=1 - 1 k L CE (s, i) = γ NCE • [∇ s L CE (s, y) + ϵ NCE • ∇ s R NCE (s)] , NCE can be rewritten as L NCE (s, y) = γ NCE • L CE (s, y) + γ NCE • ϵ NCE • R NCE (s) where γ NCE = 1/(  ∥∇ s L NCE (s, y)∥ 1 ≤ γ NCE • (∥∇ s L CE (s, y)∥ 1 + ϵ NCE • ∥∇ s R NCE (s)∥ 1 ) ≤ γ NCE • ∥∇ s L CE (s, y)∥ 1 + ϵ NCE • 1 k k i=1 ∥∇ s L CE (s, i)∥ 1 = γ NCE • w CE • ∥∇ s ∆ y ∥ 1 + ϵ NCE • 1 k k i=1 w CE • ∥∇ s ∆ i ∥ 1 = 2γ NCE • w CE (1 + ϵ NCE ) = ŵNCE The derivation is based on the inequality |x ± y| ≤ |x| + |y| and the fact that ∥∇ s ∆ i ∥ 1 = 2. The latter can be proved by straightforward calculations. Given ∂∆ i ∂s j = 1, j = i -e s j k̸ =i e s k = - pj 1-pi , j ̸ = i we then have ∥∇ s ∆ i ∥ 1 = j | ∂∆ i ∂s j | = 1 + j̸ =i p j 1 -p i = 1 + 1 = 2

B DETAILED EXPERIMENTAL SETTINGS

Label noise The synthetic noisy labels are generated following (Ma et al., 2020; Zhou et al., 2021b; Patrini et al., 2017) . For symmetric label noise, the training labels are randomly flipped to a different class with probabilities η ∈ {0.2, 0.4, 0.6, 0.8}. Asymmetric label noise is generated from a classdependent flipping pattern. On CIFAR100, the 100 classes are grouped into 20 super-classes, each 8 . having 5 sub-classes. Each class is flipped within the same super-class into the next in a circular fashion. The flip probabilities are η ∈ {0.1, 0.2, 0.3, 0.4}. Human label noise for CIFAR10/100 are adopted from Wei et al. (2022) . We use the "worst" labels of CIFAR10-N and the "fine" labels of CIFAR100-N, both leading to η = 0.4.

Models and hyperparameters

We use a 4-layer CNN for MNIST, an 8-layer CNN for CIFAR10, a ResNet-34 (He et al., 2016) for CIFAR100, and a ResNet-50 (He et al., 2016) for WebVision, all with batch normalization (Ioffe & Szegedy, 2015) . Data augmentation on CIFAR10/100 include random width/height shift and horizontal flip. On WebVision, we additionally include random cropping and color jittering. Without further specifications, all models are trained using SGD with momentum 0.9 and batch size 128 for 50, 120, 200 and 250 epochs on MNIST, CIFAR10, CIFAR100 and WebVision, respectively. Learning rates with cosine annealing are 0.01 on MNIST and CIFAR10, 0.1 on CIFAR100, and 0.2 on WebVision. Weight decays are 10 -3 on MNIST, 10 -4 on CIFAR10, 10 -5 on CIFAR100 and 3 × 10 -5 on WebVision. All loss functions are normalized to have unit maximum in sample weights, which is different from (Ma et al., 2020) . Hyperparameters of loss functions are listed in Tables 8 and 13 for different experiments. C ADDITIONAL RESULTS TO UNDERSTAND ROBUST LOSS FUNCTIONS Derivation of E(∆ y ) at initialization in Eq. ( 9): E(∆ y ) = E[s y -log i̸ =y e si ] = µ -E[log i̸ =y e si ] ≈ 1 µ -log E[ i̸ =y e si ] + V[ i̸ =y e si ] 2E[ i̸ =y e si ] 2 = 2 µ -log{(k -1)E[e sy ]} + (k -1)V[e sy ] 2{(k -1)E[e sy ]} 2 = 3 µ -log[(k -1)e µ+σ 2 /2 ] + (k -1)(e σ 2 -1)e 2µ+σ 2 2[(k -1)e µ+σ 2 /2 ] 2 = -log(k -1) -σ 2 /2 + e σ 2 -1 2(k -1) where ≈ 1 follows the approximation with Taylor expansion et al., 2006) , = 2 utilizes properties of sum of log-normal variables (Cobb et al., 2012) , and = 3 substitutes E[e sy ] and V[e sy ] with expressions for log-normal distributions. E[log X] ≈ log E[X]-V[X]/(2E[X] 2 ) (Teh

C.1.1 ADDRESSING UNDERFITTING FROM MARGINAL INITIAL SAMPLE WEIGHTS

Hyperparamter τ for different settings The hyperparameter τ controlling the shape of modified sample-weighting functions w + (∆ y ) and w * (∆ y ) can affect the noise robustness. Thus we tune τ for the best performance under different noise types and noise rates, which are listed in Table 10 . Table 12 : Addition results to Table 3 with more asymmetric label noise rates on CIFAR100. Additional results with w * (∆ y ) and w + (∆ y ). We report additional results under symmetric and asymmetric label noise with diverse noise rates η in Table 11 and Table 12 , respectively. Performance of MAE and AGCE gets substantially improved with w * (∆ y ) and w + (∆ y ). Visualization of w * (∆ y ) and w + (∆ y ). In Fig. 7 we visualize the shifted and scaled sampleweighting functions of MAE on CIFAR100. Although both achieve the same initial sample weights at |E[∆ y ]| of CIFAR100, w + (∆ y ) diminishes much faster as ∆ y increases, leading to insufficient learning of training samples, which can explain its inferior performance in Tables 3, 4 10 , a larger noise rate η requires a larger τ for better performance, which assigns less weights to samples with small ∆ y in general. However, our preliminary exploration find no straightforward derivation from L(s, y) being symmetric/asymmetric to L * (s, y) and L + (s, y) being symmetric/asymmetric. We leave the theoretical discussions to future work. Table 13 : Hyperparameters of different loss functions for results in §4.2 and Appendix C.2. They are selected for broad coverage of shapes, scales and horizontal locations of sample-weighting functions instead of optimal performance on CIFAR10.

C.2 NOISE ROBUSTNESS OF LOSS FUNCTIONS

Computation of wclean and wnoise for snr in Table 5 The average weight for clean samples, adjusted by the learning rate at each step α t , can be wclean = i,t α t • I(ỹ i,t = y i,t )w i,t i,t α t • I(ỹ i,t = y i,t ) where w i,t denotes the weight of i-th sample of the batch at step t, ỹi,t is the potentially corrupted noisy label and y i,t the uncorrupted label. Similarly, for noisy samples, wnoise = i,t α t • I(ỹ i,t ̸ = y i,t )w i,t i,t α t • I(ỹ i,t ̸ = y i,t ) Hyperparameters We list the hyperparameters for different loss functions in Table 13 for results in §4.2 and Appendix C.2. In Fig. 8 , we plot the sample-weighting functions of different loss functions. Changes of ∆ y distributions with different label noise and loss functions Complementing Fig. 3 , in Fig. 9 we plot how distributions of ∆ y change during training on CIFAR10 with additional types of label noise using hyperparameters in Table 13 . They follow similar trends as in Fig. 3 , thus supporting analysis in §4.2. As MAE is not robust against asymmetric label noise with high η (Ghosh et al., 2017) , it results in inferior performance. We also include results with additional loss functions in Fig. 10 . Since optimal hyperparameters will result in similar sample-weighting functions, we choose hyperparameters for broad coverage of w(∆ y ) to better understand how they affect robustness. 5 with hyperparameters in Table 8 . 



Our code will be available at github. Changes of model states during training except for trivial metrics like evaluation metrics and loss functions. Very small or too small to be important. Enough training steps without early stopping or diminishing learning rates for a small training loss.



Figure 1: Sample-weighting functions w(∆ y ) of loss functions in Table2with hyperparameters in Table8. We include the initial ∆ y distributions of CIFAR10 and CIFAR100 for reference, which are obtained by computing ∆ y with a randomly initialized model for all training samples.

(a) NCE with estimated weight upperbound. (b) AUL with inferior/superior hyperparameters.

Figure 2: Different explanations for underfitting: (a) fast diminishing sample weights; (b) marginal initial sample weights. We plot the variation of ᾱ * t with training step t on CIFAR100 without label noise for each loss function. ᾱ * t of NCE is estimated with ŵNCE . Since ŵNCE is not comparable to w CE , we normalize ᾱ * t with its maximum in (a) to emphasize its variation during training.

Figure 3: How ∆ y distributions of noisy (green, left) and clean (orange, right) samples change on CIFAR10 during training with symmetric label noise and η = 0.4. Vertical axes denoting probability density are scaled to the peak of histograms for readability, with epoch number (axis scaling factor) denoted on the right of each subplot. We plot w(∆ y ) and report the test accuracy of each setting for reference. See Appendix C.2 for results with additional types of label noise and loss functions.

(a) α = 0.01 with different (η, loss). (b) CE with different (η, α).

Figure 4: Learning curves with fixed learning rate and extended training epochs on MNIST under symmetric label noise, where α is the learning rate and η the noise rate.

is a classic symmetric loss function, where I(i = y) is the indicator function. Reverse Cross Entropy (RCE;Wang et al. 2019c)

Figure 5: How hyperparameters affect the sample-weighting functions in Table6. The initial ∆ y distributions of CIFAR100 extracted with a randomly initialized model are included as reference.

log p i ) and ϵ NCE = k(-log p y )/( k i=1 -log p i ) are scalar weights wrapped with the stop-gradient operator as discussed in §3.1. The regularizer R NCE (s) = similar to R GLS . Deriving ŵNCE of Eq. (8) Here we derive the upperbound of ∥∇ s L NCE (s, y)∥ 1 discussed in §3.1:

Figure 6: Comparisons between simulated and real ∆ y distributions at initialization. The simulations are based on the assumption that class scores follow normal distribution s i ∼ N (0, 1) at initialization and plotted as curves. Real distributions are extracted with randomly initialized models and plotted as histograms. The vertical axis denotes probability density f (∆ y ).

, 11 and 12. Robustness of loss functions from w * (∆ y ) and w + (∆ y ). Our proposed w * (∆ y ) and w + (∆ y ) aim to address the underfitting issue of robust loss functions with marginal initial sample weights. They modify p y into α = τ /|E[∆ y ]| and β = |E[∆ y ]| -τ , which induces new loss functions L * (s, y) = l(p * y ) and L + (s, y) = l(p + y ), respectively. Commonly α < 1 and β > 0 since a small τ leads to large initial sample weights and underfitting results from small E[∆ y ]. Notably, τ can determine the robustness of the induced loss functions. As shown in Table

Figure 7: Shifted, scaled and the vanilla sample-weighting functions of MAE on CIFAR100. τ equals |E[∆ y ]| on CIFAR10. We include the initial ∆ y distributions of CIFAR10/100 extracted with a randomly initialized model as reference. AUL AGCE GCE SCE a 2.0 3.0 / / q 2.0 4.0 0.4 0.95

Figure 8: Plots of sample-weighting functions of loss functions used in Table5with hyperparameters in Table8.

(a) CE, Sym., 0.2: 74.49 (b) SCE, Sym., 0.2: 85.31 (c) MAE, Sym., 0.2: 86.71 (d) CE, Human, 0.4: 61.12 (e) SCE, Human, 0.4: 72.56 (f) MAE, Human, 0.4: 78.99 (g) CE, Asym., 0.4: 73.75 (h) SCE, Asym., 0.4: 72.44 (i) MAE, Asym., 0.4: 61.11

Figure 9: Additional results to Fig. 3 with different label noise: (a-c) symmetric label noise with η = 0.2; (d-f) human label noise with η = 0.4; (g-i) asymmetric label noise with η = 0.4. Noisy samples are colored green (on the left) and clean samples are orange (on the right). Test accuracies are included in the caption for reference.

(a) AGCE, Sym., 0.4: 44.52 (b) GCE, Sym., 0.4: 66.60 (c) AUL, Sym., 0.4: 84.12 (d) AGCE, Human, 0.4: 35.73 (e) GCE, Human, 0.4: 68.22 (f) AUL, Human, 0.4: 77.22 (g) AGCE, Asym., 0.4: 46.75 (h) GCE, Asym., 0.4: 73.23 (i) AUL, Asym., 0.4: 67.72

Figure 10: Additional results to Fig. 3 with more robust loss functions under different label noise: (a-c) symmetric label noise with η = 0.4; (d-f) human label noise with η = 0.4; (g-i) asymmetric label noise with η = 0.4. Test accuracies are included in the caption for reference. Noisy samples are colored green (on the left) and clean samples are orange (on the right). Hyperparameters of these loss functions are selected for broad coverage rather than optimal performance.



* t with training step t on CIFAR100 without label noise for each loss function. ᾱ * t of NCE is estimated with ŵNCE . Since ŵNCE is not comparable to w CE , we normalize ᾱ * t with its maximum in (a) to emphasize its variation during training. .45 AGCE scale 70.57 ± 0.62 56.69 ± 0.33 14.64 ± 0.79 39.71 ± 0.17 50.85 ± 0.11

Shifting or scaling w(∆

). Notably, w * (∆ y ) leads to dramatic improvements for MAE under all settings.

Models with extended training 4 thus risk overfitting noisy samples during the late training stage. Adjusting the training schedules to enable or avoid such overfitting can therefore affect the noise robustness of models. BasedGhosh et al., 2017), similar to CE, with extended training, MAE eventually overfits noisy samples, resulting in vulnerability to label noise. CE can become robust by adjusting the learning rate schedule. To avoid overfitting noisy samples, we can avoid learning when noisy samples dominate the expected gradient. It can be achieved with either early stopping

Hyperparameters of different loss functions for results in §4.1 and Appendix C.1. They are tuned on CIFAR100 without label noise. Settings with inferior hyperparameters are denoted with †.

Similar results as Table2with learning rate α = 0.01. Hyperparameters for loss functions are listed in Table

Robust loss functions can underfit. In Table9we report results similar to Table2with learning rate α = 0.01. Although settings that severe underfit slightly improve, they still perform much worse than CE, which further confirms that underfitting results from robust loss functions themselves.

Hyperparameter τ of w

Addition results to Table 3 with more symmetric label noise rates on CIFAR100. ± 0.73 27.59 ± 0.54 25.75 ± 0.50 24.28 ± 0.80 20.64 ± 0.40 NCE+AUL ‡ 68.96 ± 0.16 66.62 ± 0.09 63.86 ± 0.18 50.38 ± 0.32 38.59 ± 0.48 AGCE 49.27 ± 1.03 47.53 ± 0.73 46.77 ± 2.37 39.82 ± 2.70 33.40 ± 1.57 AGCE shift 69.39 ± 0.84 63.03 ± 0.42 55.84 ± 0.78 49.05 ± 0.81 40.76 ± 0.74 AGCE scale 70.57 ± 0.62 67.13 ± 0.60 59.71 ± 0.10 48.23 ± 0.29 39.71 ± 0.17

