NOISE AGAINST NOISE: STOCHASTIC LABEL NOISE HELPS COMBAT INHERENT LABEL NOISE

Abstract

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect, previously studied in optimization by analyzing the dynamics of parameter updates. In this paper, we are interested in learning with noisy labels, where we have a collection of samples with potential mislabeling. We show that a previously rarely discussed SGD noise, induced by stochastic label noise (SLN), mitigates the effects of inherent label noise. In contrast, the common SGD noise directly applied to model parameters does not. We formalize the differences and connections of SGD noise variants, showing that SLN induces SGD noise dependent on the sharpness of output landscape and the confidence of output probability, which may help escape from sharp minima and prevent overconfidence. SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. Specifically, we present an enhanced algorithm by applying SLN to label correction. Our code is released 1 .

1. INTRODUCTION

The existence of label noise is a common issue in classification since real-world samples unavoidably contain some noisy labels, resulting from annotation platforms such as crowdsourcing systems (Yan et al., 2014) . In the canonical setting of learning with noisy labels, we collect samples with potential mislabeling, but we do not know which samples are mislabeled since true labels are unobservable. It is troubling that overparameterized Deep Neural Networks (DNNs) can memorize noise in training, leading to poor generalization performance (Zhang et al., 2017; Chen et al., 2020b) . Thus, we are urgent for robust training methods that can mitigate the effects of label noise. The noise in stochastic gradient descent (SGD) (Wu et al., 2020) provides a crucial implicit regularization effect for training overparameterized models. SGD noise is previously studied in optimization by analyzing the dynamics of parameter updates, whereas its utility in learning with noisy labels has not been explored to the best of our knowledge. In this paper, we find that the common SGD noise directly applied to model parameters does not endow much robustness, whereas a variant induced by controllable label noise does. Interestingly, inherent label noise is harmful to generalization, while we can mitigate its effects using additional controllable label noise. To prevent confusion, we use stochastic label noise (SLN) to indicate the label noise we introduce. Inherent label noise is biased and unknown, fixed when the data is given. SLN is mean-zero and independently sampled for each instance in each training step. Our main contributions are as follows. • We formalize the differences and connections of three SGD noise variants (Proposition 1-3) and show that SLN induces SGD noise that is dependent on the sharpness of output landscape and the confidence of output probability. • Based on the noise covariance, we analyze and illustrate two effects of SLN (Claim 1 and Claim 2): escaping from sharp minimafoot_0 and preventing overconfidencefoot_1 . • We empirically show that SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. We present an enhanced algorithm by applying SLN to label correction. In Fig. 1 , we present a quick comparison between models trained with/without SLN on CIFAR-10 with symmetric/asymmetric/instance-dependent/open-set label noise. Throughout this paper, we use CE to indicate a model trained with standard cross-entropy (CE) loss without any robust learning techniques, while the standard CE loss is also used by default for methods like SLN. In Section 4, we will provide more experimental details and more results that comprehensively verify the robustness of SLN on different synthetic noise and real-world noise. Here, the test curves in Fig. 1 show that SLN avoids the drop of test accuracy, with converged test accuracy even higher than the peak accuracy of the model trained with CE. The right two subplots in Fig. 1 show the average loss on clean and noisy samples. When trained with CE, the model eventually memorizes noise, indicated by the drop of average loss on noisy samples. In contrast, SLN largely avoids fitting noisy labels.

2.1. SGD NOISE AND THE REGULARIZATION EFFECT

The noise in SGD (Wu et al., 2020; Wen et al., 2019; Keskar et al., 2016) has long been studied in optimization. It is believed to provide a crucial implicit regularization effect (HaoChen et al., 2020; Arora et al., 2019; Soudry et al., 2018) for training overparameterized models. The most common SGD noise is spherical Gaussian noise on model parameters (Ge et al., 2015; Neelakantan et al., 2015; Mou et al., 2018) , while empirical studies (Wen et al., 2019; Shallue et al., 2019) demonstrate that parameter-dependent SGD noise is more effective. It is shown that the noise covariance containing curvature information performs better for escaping from sharp minima (Zhu et al., 2019; Daneshmand et al., 2018) . On a quadratically-parameterized model (Vaskevicius et al., 2019; Woodworth et al., 2020 ), HaoChen et al. (2020) prove that in an over-parameterized regression setting, SGD with label perturbations recovers the sparse groundtruth, whereas SGD with Gaussian noise directly added on gradient descent overfits to dense solutions. In the deep learning scenario, HaoChen et al. (2020) present primary empirical results showing that SGD noise -induced by Gaussian noise on the gradient of the loss w.r.t. the model's output -avoids performance degeneration of large-batch training. Xie et al. (2016) discuss the implicit ensemble effect of random label perturbations and demonstrate better generalization performance. In this paper, we provide new insights by analyzing SGD noise variants and the effects, and showing the utility in learning with noisy labels.

2.2. ROBUST TRAINING METHODS

Mitigating the effects of label noise is a vital topic in classification, which has a long history (Ekholm & Palmgren, 1982; Natarajan et al., 2013) 2020) apply regularization techniques to improve generalization under label noise, including explicit regularizations such as manifold regularization (Belkin et al., 2006) and virtual adversarial training (Miyato et al., 2018) , and implicit regularizations such as dropout (Srivastava et al., 2014) , temporal ensembling (Laine & Aila, 2017) , gradient clipping (Pascanu et al., 2012; Zhang et al., 2019; Menon et al., 2020) and label smoothing (Szegedy et al., 2016) . 7) One can combat label noise with refined training strategies (Li et al., 2019b; 2020; Nguyen et al., 2020) that potentially incorporate several techniques, including sample selection/weighting, label correction, meta-learning (Li et al., 2019b) and semi-supervised learning (Tarvainen & Valpola, 2017; Berthelot et al., 2019) . Among these methods, regularization techniques are closely related to the essence of training networks, and studying robustness under label noise provides a new lens of understanding the regularization apart from the optimization lens.

3.1. THE DIFFERENCES AND CONNECTIONS OF SGD NOISE VARIANTS

Notations. Let D = {(x (i) , y (i) )} n i=1 be a dataset with noisy labels. For each sample (x, y), its label y may be incorrect and the true label is unobservable. Let f (x; θ) be the neural network model with trainable parameter θ ∈ R p . For a c-class classification problem, we have the output f (x; θ) ∈ R c . We use a softmax function S(f (x; θ)) ∈ [0, 1] c to obtain the probability of each class. The loss on a sample is denoted as (f, y). For classification, we use the cross-entropy (CE) loss by default. In parameter updates, a sample contributes ∇ θ (f, y) to the gradient descent. With SGD noise, the model is trained with a noisy gradient ∇θ (f, y). Following the standard notation of the Jacobian matrix, we have ∇ θ ∈ R 1×p , ∇ f ∈ R 1×c , ∇ θ f ∈ R c×p , ∇ θi f ∈ R c and ∇ θ f i ∈ R 1×p . Gaussian noise on the gradient of loss w.r.t. parameters. The most common SGD noise is the spherical Gaussian noise directly added to the gradient to parameters (Neelakantan et al., 2015) as follow, ∇θ (f, y) = ∇ θ (f, y) + σ θ z θ , where σ θ > 0, z θ ∈ R 1×p and z θ ∼ N (0, I p×p ). Gaussian noise on the gradient of loss w.r.t. the model output. Taking a step further, HaoChen et al. ( 2020) study SGD noise induced by label noise on a quadratically-parameterized regression model, whereas for classification, they add mean-zero noise to ∇ f (f, y) as follow, ∇θ (f, y) = (∇ f (f, y) + σ f z f ) • ∇ θ f, where σ f > 0, z f ∈ R 1×c and z f ∼ N (0, I c×c ). Noise induced by SLN. The label perturbation is a common technique (Xie et al., 2016) , while we provide new insights by analyzing the effects from the lens of SGD noise. Our SLN adds mean-zero Gaussian noise to the one-hot labels, where the noise is independently sampled for each instance in each training step. ∇θ (f, y) = ∇ θ (f, y + σ y z y ), where σ y > 0, z y ∈ R c and z y ∼ N (0, I c×c ). Here, σ y z y is the SLN on the label y. In above Eq. (1-3), the standard deviation σ is a hyperparameter. Since the SGD noise we introduce is i.i.d for each sample, we consider each single sample (x, y) independently in the following propositions. For convenience, we use f , S to indicate the model output f (x; θ) and the softmax output S(f (x; θ)) on a sample. The proofs are presented in Appendix A. Proposition 1. Compared with Eq. ( 1), Eq. ( 2) induces noise z ∼ N (0, σ 2 f M ) on ∇ θ (f, y), s.t., M ∈ R p×p and M i,j = (∇ θi f ) T ∇ θj f , ∀i, j ∈ {1, • • • , p}. Note that the standard deviation of noise on the i-th parameter θ i is σ f ∇ θi f 2 , where • 2 denotes the L 2 norm. Proposition 2. For the cross-entropy loss, compared with Eq. ( 2), Eq. ( 3 ) induces noise z ∼ N (0, σ 2 y M ) on ∇ f (f, y), s.t., M ∈ R c×c , M i,i = c(S i -1/c) 2 + (c -1)/c and M i,j = c • S i S j -S i -S j , if i = j. Note that the standard deviation of noise on the i-th entry is σ y c(S i -1/c) 2 + (c -1)/c. Proposition 3. For the cross-entropy loss, compared with Eq. ( 1), Eq. ( 3 ) induces noise z ∼ N (0, σ 2 y M ) on ∇ θ (f, y), s.t., M ∈ R p×p and M i,j = ( ∇ θ i S S ) T ∇ θ j S S , ∀i, j ∈ {1, • • • , p}, where • S denotes the element-wise division. Note that the standard deviation of noise on the i-th parameter θ i is σ y ∇ θ i S S 2 . 3.2 THE EFFECTS OF SGD NOISE Xie et al. (2016) discuss the effect of label perturbations as implicit ensemble. In this paper, based on Proposition 1-3, we show that SLN induces SGD noise of high variance when the output landscape is sharp or the prediction confidence is high. In this way, SLN helps escape from sharp minima and prevents overconfidence. It was discussed that flat minima generalize well (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Neyshabur et al., 2017) . Specifically, Achille & Soatto (2018) show that flat minima have lower mutual information between model parameters and training data, which leads to better generalization. The finding motivates several robust learning methods (Harutyunyan et al., 2020; Xie et al., 2020) and also supports our method. Moreover, preventing overconfidence can mitigate overfitting on noisy labels (Menon et al., 2020; Lukasik et al., 2020) . The SGD noise perturbs θ so that the training can not converge when the noise has high variance. Therefore, we derive Claim 1 and Claim 2 with justifications as follows. • For the most common spherical Gaussian noise shown in Eq. ( 1), its standard deviation is a constant throughout training, independent of the landscape. • For Eq. ( 2), Proposition 1 shows that the standard deviation of noise is σ f ∇ θi f 2 . ∇ θi f 2 can be very large around the sharp landscape, which means the SGD noise has high variance. The high variance makes the training difficult to converge, which helps escape from sharp minima. Note that for training without SGD noise, it can converge to sharp minima because θ always follows the direction of gradient descent, whereas the direction of noise is random. • For our SLN in Eq. ( 3), Proposition 3 shows that the standard deviation of the SGD noise is σ y ∇ θi S/S 2 with ∇ θi S in the numerator. SLN similarly induces SGD noise with high variance around the sharp landscape. Hence we have Claim 1. • Moreover, Proposition 2 shows that SLN induces SGD noise dependent on the confidence of S. The standard deviation of noise on an entry of ∇ f (f, y) is minimized if S i = 1/c, maximized if S i = 1. Proposition 3 directly characterizes the equivalent noise on θ, where S is in the denominator (element-wise division). If S is confident, s.t., the entropy H(S) → 0, which means an entry of S approaches 1 and others approach 0, then the variance will be high since there are small numbers in the denominator. Hence we have Claim 2. Claim 1. With SGD noise induced by Eq. (2) or Eq. ( 3), the training is difficult to converge when the output (f or S) landscape is sharp. Claim 2. With SGD noise induced by Eq. ( 3), the training is difficult to converge when the output (S) is overconfident, s.t. H(S) → 0, where H(•) is the entropy. In terms of escaping from sharp minima, there are previous works (Zhu et al., 2019; Daneshmand et al., 2018) that study the inheret noise in SGD, showing that the noise covariance containing curvature information performs better. They use the second-order approximation near the minima and apply integral to training steps to characterize the ability of escaping from sharp minima. In this paper, we study the noise induced by SLN and draw a more direct intuition between the noise covariance and the ability of escaping from sharp minima. Moreover, the dependency between the noise induced by SLN and the confidence of output probability further provides an intuition on avoiding overfitting under noisy labels. In terms of preventing overconfidence, we shall discuss label smoothing (LS) (Lukasik et al., 2020) . It smooths the given one-hot label y into a soft one ỹ, s.t., ỹ = (1 -α)y + αe/c, where e is an all-one vector and α > 0. In this way, LS introduces a fixed and biased perturbation on the label, whereas SLN introduces dynamic and mean-zero perturbations. LS does not introduce noise in each training step, while SLN adaptively perturbs the parameter once the landscape is sharp or the prediction is overconfident. Hence, the robustness of SLN may not result from preventing overconfidence alone, but also escaping from sharp minima. In Fig. 3 , we plot the sample density w.r.t. predictions on the labeled class, using CIFAR-10 with 40% symmetric noise as an example. It shows that SLN does reduce overconfidence, while it mostly affects noisy samples.

3.3. A DISSECTION ON TRAINING SAMPLES

In this section, we show that SLN boosts popular robust training methods, including sample selection and label correction. Many methods, demonstrated to work well, select or add higher weight on small-loss samples (Han et al., 2018b; Jiang et al., 2018; Li et al., 2020) , or use the model's predictions to correct noisy labels (Tanaka et al., 2018; Arazo et al., 2019) . A warm-up phase is usually required to initialize the model before sample selection or label correction, yet the model will memorize noise if the warm-up phase is too long. With SLN, we can simply train the model until convergence. In Fig. 4 , we compare converged models on CIFAR-10, where the model is trained with/without SLN. The detailed noise setting can be found in Section 4. We first sort training samples in ascending order of loss, then uniformly divide them into 1000 samples per interval, and finally obtain the number of four types of samples in each interval based on the correctness of the given label and the prediction. When trained with CE, there are many small-loss samples with incorrect labels (the blue region), then selecting small-loss samples is not reliable. The model trained with SLN largely addresses the issue. Moreover, SLN is suitable for label correction since it yields correct predictions for many originally incorrect samples (the orange region). As a concrete example, we present an enhanced algorithm by applying SLN to label correction. With SLN, we train the model for sufficient epochs until convergence, without the need of carefully tuning a warm-up phase. Then we start label correction using y correction = ω • y + (1 -ω) • S, where S is the softmax prediction, ω ∈ [0, 1] is the weight obtained by normalizing the training loss, s.t., for the i-th training sample, ω i = ( imin )/( maxmin ). More discussions on the label correction are presented in Appendix D.

4.1. EXPERIMENT SETUP

We comprehensively verify the utility of SLN on different types of label noise, including symmetric noise, asymmetric noise (Zhang & Sabuncu, 2018) , instance-dependent noise (Chen et al., 2020a) and open-set noise (Wang et al., 2018) synthesized on CIFAR-10 and CIFAR-100 and real-world noise on Clothing1M (Xiao et al., 2015) . • Symmetric noise assumes each label has the same probability of flipping to any other class. We uniformly flip the label to other classes with an overall probability 40%. • Asymmetric noise contains noisy labels flipped between similar classes. Following Zhang & Sabuncu (2018) , on CIFAR-10, we flip labels between TRUCK→AUTOMOBILE, BIRD→AIRPLANE, DEER→HORSE, and CAT↔DOG, with a probability 40%; on CIFAR-100, we flip each class into the next class circularly with a probability 40%. • Instance-dependent noise is challenging since the mislabeling probability should be dependent on each instance's input features (Xia et al., 2020; Chen et al., 2020a) . We use the instance-dependent noise from Chen et al. (2020a) with a noise ratio 40%, where the noise is synthetized based on the DNN prediction error. • Open-set noise contains samples that do not belong to any class considered in the classification task. Following Wang et al. (2018) , we yield CIFAR-10 with open-set noise by randomly replacing 40% of its training images with images from CIFAR-100. t) . We use α = 0.999 in all experiments. For SLN-MO-LC, we have discussed in Section 3.3 that SLN is reliable when applied to label correction. θ (t) = αθ (t-1) + (1 -α)θ (

4.2. COMPARING SGD NOISE VARIANTS

We first show that compared with other variants, SLN stands out in mitigating the effects of label noise. The SGD noise variants have been formalized in Eq. (1-3), including z θ directly added to ∇ θ , z f on ∇ f and our SLN (z y ). For SLN, the standard deviation is σ = 1 under symmetric label noise and σ = 0.5 in all other cases. For other SGD noise, the standard deviation is equally tuned and the performance under three different σ is separately shown in Fig. 5 . SLN significant improves the generalization, while other SGD noise variants can not achieve such impressive performance even if σ is heavily tuned. It is worth noting that for all these variants, when σ is too small, the model overfits noise since the training is similar to merely using CE loss; when σ is too large, the model fails to fit the training data since the SGD noise is too high.

4.3. CIFAR-10 AND CIFAR-100

On CIFAR-10 and CIFAR-100, we compare with the following baselines: 1) standard crossentropy (CE) loss, 2) Generalized Cross-Entropy (GCE) (Zhang & Sabuncu, 2018) loss, 3) Co-Teaching (Han et al., 2018b ) that uses co-training and sample selection, 4) PHuber-CE (Menon et al., 2020) that uses gradient clipping and 5) label-smoothing (LS) Lukasik et al. (2020) that clips the label to be less confident before training. We use 5k noisy samples as the validation to tune hyperparameters, then train the model on the full training set and report the test accuracy at the last epoch. SLN simply requires tuning the standard deviation σ, which is tuned in {0.1, 0.2, 0.5, 1}. On CIFAR-10, the best σ is 1 under symmetric noise and 0.5 otherwise; On CIFAR-100, it is 0.1 under instance-dependent noise and 0.2 otherwise. The label correction in SLN-MO-LC is applied in the last 50 epochs. The softmax prediction is converted into one-hot label in correction. We repeat each experiment 5 times. The average test accuracy at the last epoch is reported in Table 1 and Table 2 . To illustrate the influence of σ, an ablation study is presented in Fig. 6 . In the tables, we mark the top-3 results in bold and present the average training time of each method, evaluated on a single V100 GPU. Without the momentum model and label correction, vanilla SLN achieves impressive test performance, which is consistent with results in Fig. 1 . In Section 3. 

4.4. CLOTHING1M

Clothing1M (Xiao et al., 2015) is a large-scale benchmark of clothing images from online shops with 14 classes, containing real-world label noise. It has 1 million noisy samples for training, 14k and 10k clean samples for validation and test. The number of images labeled as each class is unbalanced, ranging from 18976 to 88588 in the noisy training set. In previous works, some experiments are conducted by sampling a class-balanced training subset in each epoch (Li et al., 2020) , while others directly train on the full training set (Patrini et al., 2017) . Since the balanced training sampling itself affects the test performance, it is difficult to compare results across papers. Therefore, we (Patrini et al., 2017) , presented in Appendix B. We set the standard deviation of SLN as σ = 0.2. For SLN-MO-LC, the label correction is applied since the first epoch. Results are listed in Table 3 , with the best result in bold and previous published results marked by a star. DivideMix trains two models in each run and we average their test accuracy, rather than using additional ensemble of two models. Our SLN outperforms many baselines. The variants SLN-MO and SLN-MO-LC further achieve higher test accuracy. Specifically, SLN-MO-LC achieves the best test accuracy in both settings. Our methods also stand out for training efficiency.

5. CONCLUSION

In this paper, we establish that SLN induces SGD noise dependent on the sharpness of output landscape and the confidence of output probability and analyze the effects of escaping from sharp local minima and preventing overconfidence. This partially explains the robustness of SLN under noisy labels since various works show that flat minima typically generalize well and preventing overconfidence helps mitigate overfitting on noisy labels. We empirically verify the robustness of SLN under various synthetic label noise and real-world noise. Moreover, we show that SLN boosts popular robust training methods, including sample selection and label correction. In particular, we justify that SLN can enhance existing methods based on a detailed dissection on training samples, then present a practical algorithm by applying SLN to label correction. A PROOFS Proposition 1. Compared with Eq. (1), Eq. ( 2) induces noise z ∼ N (0, σ 2 f M ) on ∇ θ (f, y), s.t., M ∈ R p×p and M i,j = (∇ θi f ) T ∇ θj f , ∀i, j ∈ {1, • • • , p}. Note that the standard deviation of noise on the i-th parameter θ i is σ f ∇ θi f 2 , where • 2 denotes the L 2 norm. Proof. Dimension: z f ∈ R 1×c , ∇ θ ∈ R 1×p , ∇ f ∈ R 1×c , ∇ θ f ∈ R c×p , ∇ θi f ∈ R c . For Eq. ( 2), the noisy gradient is ∇θ (f, y) = (∇ f (f, y) + σ f z f ) • ∇ θ f = ∇ θ (f, y) + σ f z f • ∇ θ f (4) The noise on ∇ θ (f, y) is z = σ f z f • ∇ θ f ∈ R 1×p . Note that z f ∼ N (0, I c×c ), let z i be the i-th entry of z, we have z i = σ f c k=1 ∂f k ∂θ i z f k . Hence, E[z 2 i ] = σ 2 f ∇ θi f 2 2 , E[z i z j ] = σ 2 f (∇ θi f ) T ∇ θj f. ( ) Proposition 2. For the cross-entropy loss, compared with Eq. ( 2), Eq. ( 3) induces noise z ∼ N (0, σ 2 y M ) on ∇ f (f, y), s.t., M ∈ R c×c M i,i = c(S i -1/c) 2 + (c -1)/c , and M i,j = c • S i S j -S i -S j , if i = j. Note that the standard deviation of noise on the i-th entry is σ y c(S i -1/c) 2 + (c -1)/c. Proof. Dimension: y ∈ R c , z y ∈ R c , S ∈ R c , ∇ f ∈ R 1×c , ∇ S ∈ R 1×c , ∇ f S ∈ R c×c . Firstly, for the softmax function S = S(f (x)), we have the derivative matrix ∇ f S = Λ(S) -S • S T , where Λ(S) is a diagonal matrix with S i on its i-th diagonal element and 0 otherwise. For the cross-entropy loss, (f, y) = -c k=1 y k log S k . Let • S denote element-wise division, we have ∇ f (f, y + σ y z y ) = ∇ S (f, y + σ y z y ) • ∇ f S = - y + σ y z y S T • ∇ f S = - y S T • ∇ f S - σ y z y S T • ∇ f S = ∇ f (f, y) - σ y z y S T • (Λ(S) -S • S T ) = ∇ f (f, y) -σ y z y - c k=1 z y k • S T Then it is equivalent to induce noise z = -σ y (z y - c k=1 z y k • S) on ∇ f (f, y), whose i-th entry is z i = -σ y (z yi - c k=1 z y k • S i ). Note that z y ∼ N (0, I c×c ), then we can derive for i = j, E[z 2 i ] = σ 2 y (1 -2S i + c • S 2 i ) = σ 2 y (c(S i -1/c) 2 + (c -1)/c), and for i = j, E[z i z j ] = σ 2 y (c • S i S j -S i -S j ). ( ) Proposition 3. For the cross-entropy loss, compared with Eq. ( 1), Eq. ( 3) induces noise z ∼ N (0, σ 2 y M ) on ∇ θ (f, y), s.t., M ∈ R p×p and M i,j = ( ∇ θ i S S ) T ∇ θ j S S , ∀i, j ∈ {1, • • • , p}, where • S denotes the element-wise division. Note that the standard deviation of noise on the i-th parameter θ i is σ y ∇ θ i S S 2 . • LS (Lukasik et al., 2020) . There is a hyperparameter α that controls how much the label is smoothed. We tune α ∈ {0.2, 0.5, 0.8} and finally, on CIFAR-10, we use α = 0.5 for asymmetric noise and α = 0.8 otherwise; on CIFAR-100, we use α = 0.8. • SIGUA (Han et al., 2020) . The hyperpramemeter γ is a factor that is multiplied on the loss on 'bad' samples. We tune it in {0.01, 0.001, 0.0001} and finally use γ = 0.001 in all experiments. • DivideMix (Li et al., 2020) . We tune the warm-up epoch and λ u -the weight for unsupervised loss. In the official implementation, the warm-up epoch is 10 on CIFAR-10 and 30 on CIFAR-100. The default hyperparameters do not work well in our experiments (we observe a decrease of test accuracy after warm-up). Hence we tune the warm-up epoch in {10, 30, 50, 100} but the performance does not improve compared with the default settings. Hence we use the default warm-up and tune λ u ∈ {0, 1, 5, 10, 25}. Finally, we obtain impressive results with λ u = 0 in all experiments.

B.2 CLOTHING1M

The backbone and general training hyperparameters. On Clothing1M, following the common setting (Patrini et al., 2017) , we train an Imagenet-pretrained ResNet-50 using the SGD optimizer with momentum 0.9, weight decay 10 -3 and batchsize 32. The initial learning rate is 10 -3 and decreased to 10 -4 after 5 epochs. We use standard data augmentation with per-pixel normalization, horizontal random flip and 224 × 224 random crop. Note that DivideMix uses the same backbone ResNet-50 but a different training schedule, and we follow its official implementation released on the GitHub. Method-specific hyperparameters. Since the backbone ResNet-50 is used by default for most published results (Patrini et al., 2017; Li et al., 2020) , we can easily follow the default hyperparameters. • SLN/SLN-MO/SLN-MO-LC (ours). Following previous methods (Patrini et al., 2017; Li et al., 2020) , the validation set cobtaining 14k clean samples is adopted to tune our hyperparameters. We tune σ ∈ {0.1, 0.2, 0.5} and choose σ = 0.2. The momentum model is implemented with hyperparemeter 0.999 without tuning. We fix the overall training epoch as 10 and tune the epoch for applying label correction in {1, 5, 9}. Finally, we apply label correction after the first epoch in all experiments. • Forward/Backword (Patrini et al., 2017) . The results is reproduced by reimplementing the method exactly following hyperparameters in Patrini et al. (2017) . • Co-Teaching (Han et al., 2018b) . On Clothing1M, the estimated noise rate is around 0.4 (Xiao et al., 2015) . Hence, we linearly reduce the rate of selecting small-loss samples from 1 to 0.6 in 10 epochs. • DivideMix (Li et al., 2020) . The result is reproduced from its official implementation.

C MORE EMPIRICAL RESULTS AND DISCUSSIONS C.1 SLN ENHANCES EXISTING METHODS

With a detailed dissection on predictions of DNNs trained with SLN (Section 3.3), we have shown that SLN can boost popular robust training methods such as label correction and sample selection. In this section, we verify this by integrating SLN with the following methods. • Co-teaching (Han et al., 2018b) . It uses co-training and sample selection. Two modes select small-loss samples to train each other. • Stochastic integrated gradient underweighted ascent (SIGUA) (Han et al., 2020) . It adopts gradient descent on good data as usual, and learning-rate-reduced gradient ascent on bad data. • DivideMix (Li et al., 2020) this section, we provide an intuitive hypothesis. We can analyze the effect of label correction on the four types of samples, as shown in Fig. 9 . • 1) label ×, prediction ×. Label correction can not correct these samples, yet may even make the case worse because the prediction error should be easier for the model to overfit compared to given noisy labels. • 2) label ×, prediction . These samples benefit from label correction. • 3) label , prediction ×. Label correction is harmful for these samples. • 4) label , prediction . Label correction does not significantly impact these samples because both the prediction and the label are correct. In a word, there exist samples of case 1) and case 3) that are not desired in label correction, but they are unavoidable since we do not expect a model with 100% prediction accuracy. In label correction, when trained with modified labels obtained from wrong predictions, the model may accumulate its own error due to a positive feedback: yielding worse predictions after training on prediction errors. Therefore, we hypothesize that samples of case 1) and case 3) result in the convergence issue.



Around sharp minima, the output changes rapidly(Hochreiter & Schmidhuber, 1997;Keskar et al., 2017). The prediction probability on some class approaches 1.



Figure 1: Test accuracy and training loss, averaged in 5 runs.

and attracts much recent interest with several directions explored. 1) Malach & Shalev-Shwartz (2017); Han et al. (2018b); Yu et al. (2019); Chen et al. (2019a); Wei et al. (2020) propose sample selection methods that train on trusted samples, identified according to training loss, cross-validation or (dis)agreement between two models. 2) Liu & Tao (2015); Jiang et al. (2018); Ren et al. (2018); Shu et al. (2019); Li et al. (2019a) develop sampleweighting schemes that aim to add higher weights on clean samples. 3) Sukhbaatar et al. (2015); Patrini et al. (2017); Hendrycks et al. (2018); Han et al. (2018a) apply loss-correction based on an estimated noise transition matrix. 4) Reed et al. (2015); Tanaka et al. (2018); Arazo et al. (2019); Zheng et al. (2020); Chen et al. (2020a) propose label correction based on the model's predictions. 5) Ghosh et al. (2017); Zhang & Sabuncu (2018); Xu et al. (2019); Wang et al. (2019); Lyu & Tsang (2020); Ma et al. (2020) study robust loss functions that have a theoretical guarantee for noisy risk minimization, typically with the assumption that the noise is class-conditional(Scott et al., 2013;Natarajan et al., 2013). 6) Chen et al. (2019b); Menon et al. (2020); Hu et al. (2020); Harutyunyan et al. (2020); Lukasik et al. (

Fig.2shows visualizations of loss landscapes. The model trained with SLN converges to a flat minimum that has small SGD noise. More discussions on the convergence are presented in Appendix E.

Figure 2: Loss landscapes around the local minima of converged models trained on CIFAR-10 with symmetric noise, visualized using the technique in Li et al. (2018). We show the z-axis on the same scale to compare the sharpness; and draw color bars separately to show the loss distribution around each minimum. (a): The model trained with CE converges to a sharp minimum. (b): Training with Eq. (1) yields a minimum with a higher loss, yet it is still sharp. (c)&(d): Consistent with our analysis, the model trained with Eq. (2) or Eq. (3) converges to a flat minimum.

Figure 4: Training samples are sorted in ascending order of loss, uniformly divided into 1000 samples per interval, and dissected according to the correctness of the given label and the prediction.

Figure 5: Performance of SGD noise variants on CIFAR-10. The accuracy is averaged in 5 runs.

Figure 8: Test accuracy w.r.t. training efficiency (1/time) on CIFAR-10.

Figure9: Training samples are sorted in ascending order of loss, uniformly divided into 1000 samples per interval, and dissected according to the correctness of the given label and the prediction. This figure is exactly the same as Fig.4in the main paper. We present it here for your convenience since we refer to the figure in Section D.

• For real-world noise, we use the large-scale benchmark Clothing1M, which contains 1M training images with noisy labels from online shops.

Test accuracy (mean±std in 5 runs) on CIFAR-10. The Open-Set noise is generated by randomly replacing 40% images of CIFAR-10 with images from CIFAR-100. In Appendix C.1, Table4shows that SLN can improve many robust learning methods.

Test accuracy (mean±std in 5 runs) on CIFAR-100. Test accuracy (mean±std in 5 runs) of SLN on CIFAR-10 w.r.t. σ. A small σ results overfitting while a large σ yields underfitting. In Appendix C.2, we visualize the embedding of overfitting/underfitting. For results reported in Table1 and Table 2, followingZhang & Sabuncu (2018);Chen et al. (2020b), we use 5k noisy samples as the validation to tune σ ∈ {0.1, 0.2, 0.5, 1}.

Test accuracy (mean±std in 3 runs) on Clothing1M. The star * marks results copied fromPatrini et al. (2017). The result of DivideMix(Li et al., 2020) is reproduced from its official implementation, which uses class-balanced training sampling. here and conduct experiments in both setting, including the standard sampling and the noisy-class-balanced sampling. For the latter, in each epoch, 18976 instances per class are randomly sampled from the noisy training set. Other training details strictly follow the standard benchmark setting

. It combines co-training of two models, sample selection based on the loss, label correction/guessing based on semi-supervised learning MixMatch(Berthelot et al., 2019), and other techniques including regularization, augmenting each image twice.

ACKNOWLEDGMENTS

This work was supported by a grant from Research Grants Council of the Hong Kong Special Administrative Region (Project No. CUHK 14201620) and the National Natural Science Foundation of China (Project No. 62006219).

annex

Proof. Dimension:For Eq. (3), the noisy gradient isThe noise on ∇ θ (f, y) is z = -σyzy S T • ∇ θ S. Note that z y ∼ N (0, I c×c ), let z i be the i-th entry of z, we haveHence,B MORE DETAILS ON EXPERIMENT SETUP B.1 CIFAR-10 AND CIFAR-100The backbone and general training hyperparameters. In all experiments on CIFAR-10 and CIFAR-100, we train wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) for 300 epochs using the SGD optimizer with learning rate 0.001, momentum 0.9, weight decay 5 × 10 -4 , and a batchsize of 128. Standard data augmentation is applied, including per-pixel normalization, horizontal random flip and 32 × 32 random crop after padding with 4 pixels on each side. The criterion for setting the training hyperparameters includes 1) all methods should converge (the training accuracy converges), 2) all methods share the same general training hyperparameters for a fair comparison.Method-specific hyperparameters. The backbone is not unified in previous papers and we reimplement all methods in the same backbone for a fair comparison. Regarding this, we may not directly follow the default hyperparameters. Following Zhang & Sabuncu (2018) , we use 5k noisy samples (10% of the training data) as the validation set to tune method-specific hyperparameters. We then train the model on the full training set and report the test accuracy at the last epoch.• SLN/SLN-MO/SLN-MO-LC (ours). We tune σ ∈ {0.1, 0.2, 0.5, 1}. On CIFAR-10, we use σ = 1 for symmetric noise and σ = 0.5 otherwise; On CIFAR-100, we use σ = 0.1 for instance-dependent noise and σ = 0.2 otherwise. The momentum model is introduced with hyperparemeter 0.999 without tuning. The label correction (LC) is applied after convergence of training with SLN. All models are trained for 300 epochs and we introduce LC at the 250th epoch without tuning beacuse the training accuracy does not increase much after the 250th epoch. In this way, we do not increase the computation cost.• GCE (Zhang & Sabuncu, 2018) . GCE loss is applied as the training starts and there is a warm epoch after which truncated GCE loss is applied every 10 epochs. We tune the warm epoch in {0, 50, 100, 150, 200} and use 50 for CIFAR-10, 150 for CIFAR-100. There is a hyperparameter q for the GCE loss. We set q = 0.7 since it is used in all experiments on CIFAR-10 and CIFAR-100 in its original paper (Zhang & Sabuncu, 2018) .• Co-Teaching (Han et al., 2018b) . The rate of selecting small-loss samples is linearly decreased from 1 to 1 -ε at the first 10 epochs, where ε is the noise rate. This setting is used in all experiments in the original paper (Han et al., 2018b) and it works well in our setting.• PHuber-CE (Menon et al., 2020) . There is a hyperparameter τ that controls the gradient clipping. The original paper (Menon et al., 2020) uses τ = 2 on CIFAR-10 and τ = 10 on CIFAR-100, but the default setting does not work well in our experiments. Hence, we tune τ ∈ {2, 5, 10, 30, 50} and finally, on CIFAR-10, we use τ = 10 for asymmetric noise and τ = 2 otherwise; on CIFAR-100, we use τ = 30. Specifically, we use models trained with SLN as the initialization of these methods. As shown in Table 4 , all methods obtain consistent improvement when integrated with SLN. Moreover, with SLN as initialization, we do not need to tune the warm-up phase for methods like DivideMix, because we can train with SLN until convergence. In contrast, without SLN, we need to carefully warm up the model so that it learns enough correct patterns and does not memorize too much noise. Note that better results for DivideMix are reported in the original paper with a different backbone and carefully scheduled learning rate. We focus on fairly comparing the robustness of all methods: in all experiments, we train the same backbone wide ResNet-28-2 for 300 epochs without learning rate change. Detailed training settings are presented in Appendix B.

C.2 T-SNE VISUALIZATION OF FEATURES ON THE MODEL'S PENULTIMATE LAYER

In Fig. 7 , we show the t-SNE visualization of features on the model's penultimate layer, taking all training samples as input. We visualize the embedding on CIFAR-10 with symmetric noise since the noise yields the most severe damage to the generalization and SLN provides a significant improvement. Fig. 7 shows that the model trained with SLN can yield a better embedding. It also demonstrates overfitting and underfitting when σ is too small or too large. 

D LABEL CORRECTION

The convergence issue in label correction. When using the model's prediction to correct noisy labels, we find a convergence issue such that the test accuracy decreases, as shown in Fig. 10 . The convergence issue is also reported in Arazo et al. (2019) , but it has not been widely discussed. In How do we assign weights in label correction? We usewhere ω is an instance-dependent weight positive correlated with the loss. We observe that our scheme mitigates the convergence issue, as shown in Fig. 10 . Our intuition that -samples that need correction have large losses -is consistent with the method of Arazo et al. (2019) , but this does not mean that the weight on S should be positive correlated with the loss. We can consider the following cases.• For small-loss samples, we have S ≈ y, as illustrated by the red region in Fig. 9 . Label correction does not affect these samples much regardless of the weight. Hence, we simply need to consider the effects of label correction on large-loss samples.• Large-loss samples of case 2) can benefit from label correction, as illustrated by the orange region in Fig. 9 . However, in this case, a higher loss does not mean that it requires a higher weight on S for label correction 4 .• There exist large-loss samples of case 1) and case 3) for which label correction can be harmful, as has been discussed in the above paragraph and illustrated by the blue and green regions in Fig. 9 .Therefore, we assign a small weight of 1 -ω on S for large-loss samples. In this way, samples of case 2) still benefit from label correction, while we mitigate the undesired effects on other large-loss samples for which label correction can be harmful. The effectiveness of our scheme is verified in Fig. 10 .

E THE CONVERGENCE

In Fig. 2 , the visualizations of loss landscapes show that the model trained with SLN converges to a solution that has small SGD noise. The center point on the visualized landscape (i.e., the loss of the given model) is a local minimum. From Fig. 2 (d), we observe that the minimum has the following properties.• The gradient around the minimum is small since it is flat.• The predictions do not approach one-hot labels because the loss at the local minimum is high. As shown in Fig. 3 , the prediction probabilities are much lower than 1.With the above two properties, Proposition 3 implies that around the flat minimum illustrated in Fig. 2 (d), the noise on gradients is small. Therefore, the model can converge in the local flat minimum.4 For example, considering two samples with the wrong label y1 = y2 = [1, 0, 0] and the latent true label [0, 1, 0]. Imaging the predictions are S1 = [0.4, 0.6, 0], S2 = [0.3, 0.4, 0.3]. Then the cross-entropy loss is (y1, S1) < (y2, S2), while for the weight on S, we want w1 > w2 because S1 is more correct compared with S2 (the second entry 0.6 > 0.4). This example implies that for samples that can benefit from label correction, a higher loss does not mean that it requires a higher weight on S for label correction.

