NOISE AGAINST NOISE: STOCHASTIC LABEL NOISE HELPS COMBAT INHERENT LABEL NOISE

Abstract

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect, previously studied in optimization by analyzing the dynamics of parameter updates. In this paper, we are interested in learning with noisy labels, where we have a collection of samples with potential mislabeling. We show that a previously rarely discussed SGD noise, induced by stochastic label noise (SLN), mitigates the effects of inherent label noise. In contrast, the common SGD noise directly applied to model parameters does not. We formalize the differences and connections of SGD noise variants, showing that SLN induces SGD noise dependent on the sharpness of output landscape and the confidence of output probability, which may help escape from sharp minima and prevent overconfidence. SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. Specifically, we present an enhanced algorithm by applying SLN to label correction. Our code is released 1 .

1. INTRODUCTION

The existence of label noise is a common issue in classification since real-world samples unavoidably contain some noisy labels, resulting from annotation platforms such as crowdsourcing systems (Yan et al., 2014) . In the canonical setting of learning with noisy labels, we collect samples with potential mislabeling, but we do not know which samples are mislabeled since true labels are unobservable. It is troubling that overparameterized Deep Neural Networks (DNNs) can memorize noise in training, leading to poor generalization performance (Zhang et al., 2017; Chen et al., 2020b) . Thus, we are urgent for robust training methods that can mitigate the effects of label noise. The noise in stochastic gradient descent (SGD) (Wu et al., 2020) provides a crucial implicit regularization effect for training overparameterized models. SGD noise is previously studied in optimization by analyzing the dynamics of parameter updates, whereas its utility in learning with noisy labels has not been explored to the best of our knowledge. In this paper, we find that the common SGD noise directly applied to model parameters does not endow much robustness, whereas a variant induced by controllable label noise does. Interestingly, inherent label noise is harmful to generalization, while we can mitigate its effects using additional controllable label noise. To prevent confusion, we use stochastic label noise (SLN) to indicate the label noise we introduce. Inherent label noise is biased and unknown, fixed when the data is given. SLN is mean-zero and independently sampled for each instance in each training step. Our main contributions are as follows. • We formalize the differences and connections of three SGD noise variants (Proposition 1-3) and show that SLN induces SGD noise that is dependent on the sharpness of output landscape and the confidence of output probability. • Based on the noise covariance, we analyze and illustrate two effects of SLN (Claim 1 and Claim 2): escaping from sharp minimafoot_0 and preventing overconfidencefoot_1 . • We empirically show that SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. We present an enhanced algorithm by applying SLN to label correction. In Fig. 1 , we present a quick comparison between models trained with/without SLN on CIFAR-10 with symmetric/asymmetric/instance-dependent/open-set label noise. Throughout this paper, we use CE to indicate a model trained with standard cross-entropy (CE) loss without any robust learning techniques, while the standard CE loss is also used by default for methods like SLN. In Section 4, we will provide more experimental details and more results that comprehensively verify the robustness of SLN on different synthetic noise and real-world noise. Here, the test curves in Fig. 1 show that SLN avoids the drop of test accuracy, with converged test accuracy even higher than the peak accuracy of the model trained with CE. The right two subplots in Fig. 1 show the average loss on clean and noisy samples. When trained with CE, the model eventually memorizes noise, indicated by the drop of average loss on noisy samples. In contrast, SLN largely avoids fitting noisy labels.

2.1. SGD NOISE AND THE REGULARIZATION EFFECT

The noise in SGD (Wu et al., 2020; Wen et al., 2019; Keskar et al., 2016) has long been studied in optimization. It is believed to provide a crucial implicit regularization effect (HaoChen et al., 2020; Arora et al., 2019; Soudry et al., 2018) for training overparameterized models. The most common SGD noise is spherical Gaussian noise on model parameters (Ge et al., 2015; Neelakantan et al., 2015; Mou et al., 2018) , while empirical studies (Wen et al., 2019; Shallue et al., 2019) demonstrate that parameter-dependent SGD noise is more effective. It is shown that the noise covariance containing curvature information performs better for escaping from sharp minima (Zhu et al., 2019; Daneshmand et al., 2018) . On a quadratically-parameterized model (Vaskevicius et al., 2019; Woodworth et al., 2020 ), HaoChen et al. (2020) prove that in an over-parameterized regression setting, SGD with label perturbations recovers the sparse groundtruth, whereas SGD with Gaussian noise directly added on gradient descent overfits to dense solutions. 



Around sharp minima, the output changes rapidly(Hochreiter & Schmidhuber, 1997; Keskar et al., 2017). The prediction probability on some class approaches 1.



Figure 1: Test accuracy and training loss, averaged in 5 runs.

In the deep learning scenario, HaoChen et al. (2020) present primary empirical results showing that SGD noise -induced by Gaussian noise on the gradient of the loss w.r.t. the model's output -avoids performance degeneration of large-batch training. Xie et al. (2016) discuss the implicit ensemble effect of random label perturbations and demonstrate better generalization performance. In this paper, we provide new insights by analyzing SGD noise variants and the effects, and showing the utility in learning with noisy labels.

