ROBUST EARLY-LEARNING: HINDERING THE MEMORIZATION OF NOISY LABELS

Abstract

The memorization effects of deep networks show that they will first memorize training data with clean labels and then those with noisy labels. The early stopping method therefore can be exploited for learning with noisy labels. However, the side effect brought by noisy labels will influence the memorization of clean labels before early stopping. In this paper, motivated by the lottery ticket hypothesis which shows that only partial parameters are important for generalization, we find that only partial parameters are important for fitting clean labels and generalize well, which we term as critical parameters; while the other parameters tend to fit noisy labels and cannot generalize well, which we term as non-critical parameters. Based on this, we propose robust early-learning to reduce the side effect of noisy labels before early stopping and thus enhance the memorization of clean labels. Specifically, in each iteration, we divide all parameters into the critical and non-critical ones, and then perform different update rules for different types of parameters. Extensive experiments on benchmark-simulated and real-world label-noise datasets demonstrate the superiority of the proposed method over the state-of-the-art label-noise learning methods.

1. INTRODUCTION

Deep neural networks have achieved a remarkable success in various tasks, such as image classification (He et al., 2015) , object detection (Ren et al., 2015) , speech recognition (Graves et al., 2013) , and machine translation (Wu et al., 2016) . However, the success is largely attributed to large amounts of data with high-quality annotations, which is expensive or even infeasible in practice (Han et al., 2018a; Li et al., 2020a; Wu et al., 2020) . On the other hand, many large-scale datasets are collected from image search engines or web crawlers, which inevitably involves noisy labels (Xiao et al., 2015; Li et al., 2017a; Zhu et al., 2021) . As deep networks have large learning capacities and strong memorization power, they will ultimately overfit noisy labels, leading to poor generalization performance (Jiang et al., 2018; Nguyen et al., 2020) . General regularization techniques such as dropout and weight decay cannot address this issue well (Zhang et al., 2017) . Fortunately, even though deep networks will fit all the labels eventually, they first fit data with clean labels, which helps generalization (Arpit et al., 2017; Han et al., 2018b; Yu et al., 2019; Liu et al., 2020) . Thus, the early stopping method can be used to reduce overfitting to the noisy labels (Rolnick et al., 2017; Li et al., 2020b; Hu et al., 2020) . However, the existence of noisy labels will still adversely affect the memorization of clean labels even in the early training stage. This will hurt generalization (Han et al., 2020) . Intuitively, if we can reduce the side effect of noisy labels before early stopping, the generalization and robustness of the networks can be improved. Note that over-parameterization of deep networks is one of the main reasons for overfitting to noisy labels (Zhang et al., 2017; Yao et al., 2020a) . The lottery ticket hypothesis (Frankle & Carbin, 2018) shows that only partial parameters are important for generalization. The deep networks with these important parameters can generalize well, or even better by avoid overfitting. Motivated by this, for learning with noisy labels, it remains a question if we can divide the parameters into two parts to reduce the side effect brought by noisy labels, which enhances the memorization of clean labels and further improves the generalization performance of the deep networks. In this paper, we present a novel and effective method to find which parameters are important for fitting data with clean labels, and which parameters tend to fit data with noisy labels. We term the former as critical parameters, and the latter as non-critical parameters. Then on this basis, we proposed robust early-learning to reduce the side effect of noisy labels before early stopping. Specifically, in each iteration during training, we first categorize all parameters into two parts, i.e., the critical parameters and the non-critical parameters. Then we designed different update rules for different types of parameters. For the critical ones, we perform robust positive update. This part of the parameters are updated using the gradients derived from the objective function and weight decay. For the non-critical ones, we perform negative update. Their values are penalized with the weight decay, and without the gradients derived from the objective function. Note that the gradients for updating are based on the loss between the prediction of deep networks and given labels. For the critical ones, they tend to fit data with clean (correct) labels to help generalization. Their gradients can therefore be exploited to update parameters. However, for the non-critical ones, they tend to fit data with noisy (incorrect) labels, which hurts generalization. Their gradients will misguide the deep networks to overfit data with noisy labels. Thus, we only use a regularization item, i.e., the weight decay, to update them. The weight decay will penalize their values to be zero, which means that they are penalized to be deactivated, and not to contribute to the generalization of deep networks. In this way, we can reduce the side effect of noisy labels and enhance the memorization of clean labels. In summary, the main contributions of this work are as follows: • We propose a novel and effective method which can categorize the parameters into two parts according to whether they are important to fit data with clean labels. • Different update rules have been designed for different types of the parameters to reduce the side effect of noisy labels before early stopping. • We experimentally validate the proposed method on both synthetic noisy datasets and real-world noisy datasets, on which it achieves superior robustness compared with the state-of-the-art methods for learning with noisy labels. Related Work. Early stopping is quite simple but effective in practice. It was used in supervised learning early (Prechelt, 1998; Caruana et al., 2001; Zhang et al., 2005; Yao et al., 2007) . With the help of a validation set, training is then stopped before convergence to avoid the overfitting. While learning with noisy labels, the networks fit the data with clean labels before starting to overfit the data with noisy labels (Arpit et al., 2017) . Early stopping is then formally proved to be valid for relieving overfitting to noisy labels (Rolnick et al., 2017; Li et al., 2020b) . It has also been widely used in existing methods to improve robustness and generalization (Yu et al., 2018b; Xu et al., 2019; Yao et al., 2020b; Cheng et al., 2021) . The lottery ticket hypothesis (Frankle & Carbin, 2018) shows that deep networks are likely to be over-parameterized, and only partial parameters are important for generalization. With this part of the parameters, the small and sparsified networks can be trained to generalize well. While this work is motivated by the lottery ticket hypotheis, this work is fundamentally different from it. The lottery ticket hypothesis focuses on network compression. It aims to find a sparsified sub-network which has competitive generalization performance compared with the original network. This paper focuses on learning with noisy labels. We want to find the critical/non-critical parameters to reduce the side effect of noisy labels, which greatly improves the generalization performance. Lots of work proposed various methods for training with noisy labels, such as exploiting a noise transition matrix (Liu & Tao, 2016; Hendrycks et al., 2018; Xia et al., 2020a; Li et al., 2021) , using graph models (Xiao et al., 2015; Li et al., 2017b) , using surrogate loss functions (Zhang & Sabuncu, 2018; Wang et al., 2019; Ma et al., 2020) , meta-learning (Ren et al., 2018; Shu et al., 2020) , and employing the small loss trick (Jiang et al., 2018; Han et al., 2018b; Yu et al., 2019) . Some methods among them employ early stopping explicitly or implicitly (Patrini et al., 2017; Xia et al., 2019) . We also use early stopping in this paper. We are the first to hinder the memorization of noisy labels with analyzing the criticality of parameters. Organization. The rest of the paper is organized as follows. In Section 2, we setup the problem and introduce the neural network optimization method. In Section 3, we discuss how to find the critical parameters and perform different update rules. In Section 4, we provide empirical evaluations of the proposed learning algorithm. Finally, Section 5 concludes the paper.

2. PRELIMINARIES

Notation. Vectors and matrices are denoted by bold-faced letters. The standard inner product between two vectors is denoted by •, • . We use • p as the p norm of vectors or matrices. For a function f , we use ∇f to denote its gradient. Let [n] = {1, 2, . . . , n}. Problem Setup. Consider a classification task, there are c classes. Let X and Y be the feature and label spaces respectively, where X ∈ R d with d being the dimensionality, and Y = [c]. The joint probability distribution over X × Y is denoted by D. Let S = {(x i , y i )} n i=1 be an i.i.d. sample drawn from D, where n denotes the sample size. In traditional supervised learning, by employing S, the aim is to learn a classifier that can assign labels precisely for given instances. While learning with noisy labels, we are given a sample with noisy labels S = {(x i , y i )} n i=1 , which is drawn from a corrupted joint probability distribution D rather than D. Here, y is the possibly corrupted label of the underlying clean label y. The aim is changed to learn a robust classifier that could assign clean labels to test data by only exploiting a training sample with noisy labels.

2.1. NEURAL NETWORK OPTIMIZATION METHOD

The optimization method is essential for training neural networks. Stochastic gradient descent (SGD) is the most popular one nowadays among the optimization methods (Allen- Zhu et al., 2019; Cao & Gu, 2019; Zou et al., 2020) . Our proposed method is directly related to SGD. We analyze the optimization problem of typical supervised learning with clean labels as knowledge background. Consider a classifier to be trained, let W ∈ R m be all the parameters, where m is the total number of the parameters. Let L : R c × Y → R + be the surrogate loss function, e.g., cross entropy loss. With a regularization item, e.g., 1 regularizer, optimization method would involve minimizing an objective function as: min L(W; S) = min 1 n n i=1 L(W; (x i , y i )) + λ W 1 , where λ ∈ R + is a regularization parameter. The update rules of the parameters W can be represented by the following formula: W(k + 1) ← W(k) -η ∂L(W(k); S ) ∂W(k) + λsgn(W(k)) , where η > 0 is the learning rate, W(k) is the set of the parameters at the k-th iteration, sgn(•) is the standard sgn function in mathematics, and S is a subset randomly sampled from S. With SGD, the regularization parameter λ is equivalent to the weight decay coefficient in the training process (Loshchilov & Hutter, 2019) .

3. METHODOLOGY

In this section, we first introduce an alternative interpretation for the optimality criterion (Section 3.1). Then, we present how to determine the critical/non-critical parameters by exploiting this interpretation during training and the memorization effect of deep neural networks (Section 3.2). Finally, different update rules are proposed for different types of parameters to cope with noisy labels (Section 3.3).

3.1. OPTIMALITY CRITERION

For the optimization of the objective function L(W; S), the optimality will be achieved at W when ∇L(W; S) = 0 (Boyd et al., 2004; Bubeck, 2014) . However, modern neural networks are complex and over-parameterized, which makes ∇L(W; S) extremely high-dimensional. It is unintuitive for us to analyze high-dimensional vectors. The optimality is hard to be effectively judged. To address this issue, we will use a more intuitive interpretation for optimality criterion in this paper, which can associate the optimization with a scalar. Specifically, if we let G(t) = L(tW; S), G (t) = ∇L(tW; S) W. Let t = 1, then G (1) = ∇L(W; S) W = ∇L(W; S), W . We know that the optimality can be reached when ∇L(W; S) = 0, then G (1) = 0. In this way, the optimality can be checked by exploiting the scalar G (1). Note that the new optimality criterion is sufficient, but is not necessary. In this paper, we focus on learning with noisy labels, and the necessity of the new optimality criterion does not affect the effectiveness of the proposed method. We will discuss this carefully in the next subsection.

3.2. JUDGING THE IMPORTANCE OF NETWORK PARAMETERS

We have shown that the optimization of the objective function can be related to a scalar G (1). Its value is equal to the inner product between the value of the parameters and the gradient w.r.t. the parameters. To achieve the optimality, we need to push the value of G (1) to be zero. Thus, we can judge the importance of each parameter through its influence on the value of G (1). Note that the memorization effects of deep networks show they first memorize the data with clean labels. The parameters that contribute to the optmiality at the early stage are therefore important for clean labels. Consider a parameter, denoted by w i ∈ W, its gradient is ∇L(w i ; S). The judgement criteria is denoted by g i , i.e., g i = |∇L(w i ; S) × w i |, i ∈ [m]. If the value of g i is large, w i is viewed as a critical parameter, as g i has a great influence on the value of G (1). On the contrary, if the value of g i is small, e.g., zero or very close to zero, w i is regarded to be a non-critical parameter. It is not important for fitting clean labels. If we update it, it will tend to fit noisy labels. The underlying issue of directly using the gradient of L(w i ; S) as a criterion for the criticality can be identified. When we only exploit gradient information, we ignore the value of the parameter w i . However, if the value is zero or close to zero, the parameter is inactivated. It is also non-critical for optimality (Han et al., 2015; Frankle & Carbin, 2018; Lee et al., 2019) . Note that we use early stopping in this paper. The deep networks mainly fit clean labels in the early training. Thus, even with the existence of noisy labels, we can use the criterion to analyze the criticality of the parameters. It should be noted when g i = 0, there will be three possible scenarios: (1) only ∇L(w i ; S) = 0; (2) only w i = 0; (3) both ∇L(w i ; S) = 0 and w i = 0. In all three cases, we can judge the importance of parameter w i using g i as analyzed. In other words, when g i = 0, we allow that the value of ∇L(w i ; S) is not zero, i.e., the new optimality criterion is not necessary, which does not influence the effectiveness of the proposed method.

3.3. COMBATING NOISY LABELS WITH DIFFERENT UPDATE RULES

We have presented how to judge the importance of the parameters, and then divide them into the critical ones and non-critical ones. We exploit the label noise rate to help divide the parameters into the critical/non-critical ones. Intuitively, if the noise rate is high, the number of clean labels is small. The number of required critical parameters for memorizing clean labels is then small. The number of the critical parameters has a negative correlation with noise rate. We therefore use the noise rate to help identify the critical parameters. We use τ to denote the noise rate. If τ is not known in advanced, it can be easily inferred (Liu & Tao, 2016; Yu et al., 2018a) . We show that the proposed method is insensitive to the estimation result of the noise rate in Section 4.4. Then, the number of the critical parameters can be defined as: m c = (1 -τ )m. In each iteration, for each parameter w i , i ∈ [m]. The critical and non-critical parameters are determined according to the result of numerical sorting of g i , which has been explained before. The critical and non-critical parameters are denoted by W c and W n respectively. Two different update strategies are performed for two types of the parameters. Robust positive update. For the critical ones W c , we use the gradients derived from the objective function and weight decay. We clip the gradients to perform gradient decay in this paper. The update rule is: W c (k + 1) ← W c (k) -η (1 -τ ) ∂L(W c (k); S ) ∂W c (k) + λsgn(W c (k)) , where S is a subset randomly sampled from S. Note that we directly use S here, rather than S in Eq. ( 2). It is because the proposed method exploits the memorization effects of deep networks. As can be seen in Eq. ( 5), the gradient decay coefficient is set to 1 -τ , which can prevent over-confident descent steps in the training process. Negative update. For the non-critical parameters W n , we only use the weight decay to update them. The update rule is: W n (k + 1) ← W n (k) -ηλsgn(W n (k)). ( ) The gradients of the objective function exploit loss between the prediction of deep networks and given labels. Robust positive update uses the gradients to update the critical ones, which helps deep networks memorize clean labels. For the non-critical ones, they tend to overfit noisy labels, their gradients are misleading for generalization. Thus, we only use the weight decay, to update them. The weight decay will penalize their values to be zero and help generalization (Arora et al., 2018) . As they are deactivated, they will not contribute to the memorization or generalization. The use of two update rules makes us achieve the goal, i.e., reducing the side effect of noisy labels and thus enhance the memorization of clean labels. The overall procedure of combating noisy labels with different update rules (CDR) is summarized in Algorithm 1.

4. EXPERIMENTS

In this section, we first introduce the datasets used, and implementation details in the experiments (Section 4.1). We next introduce the methods used for comparison in this paper (Section 4.2). The ablation study is conducted to show that the proposed method is not sensitive to the estimation result of the noise rate (Section 4.3). Finally, we present the experimental results on synthetic and real-world noisy datasets to show the effectiveness of the proposed method (Section 4.4).

4.1. DATASETS AND IMPLEMENTATION DETAILS

To verify the effectiveness of the proposed method, we run experiments on the manually corrupted version of four datasets, i.e., MNIST (LeCun et al., 1998) , F-MNIST (Xiao et al., 2017) , CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009) , and two real-world noisy datasets, i.e., Food-101 (Bossard et al., 2014) and WebVision (Li et al., 2017a) . MNIST and F-MNIST both have 28×28 gayscale images of 10 classes including 60,000 training images and 10,000 test images. CIFAR-10 and CIFAR-100 both have 32 × 32 × 3 color images including 50,000 training images and 10,000 test images. CIFAR-10 has 10 classes while CIFAR-100 has 100 classes. Food-101 consists of 101 food categories, with 101,000 images. For each class, 250 manually reviewed clean test images are provided as well as 750 training images with real-world label noise. WebVision contains 2.4 million images crawled from the websites using the 1,000 concepts in ImageNet ILSVRC12 (Deng et al., 2009) . Following the "Mini" setting in (Jiang et al., 2018; Chen et al., 2019; Ma et al., 2020) , we take the first 50 classes of the Google resized image subset, and evaluate the trained networks on the same 50 classes of the ILSVRC12 validation set, which is exploited as a test set. For all datasets, following prior works (Patrini et al., 2017; Wang et al., 2021b) , we leave out 10% training data as a validation set, which is for early stopping. We consider four types of synthetic label noise in this paper, i.e., symmetric noise, asymmetric noise, pairflip noise and instance-dependent noise (abbreviated as instance noise). These settings are widely used in existing works (Ma et al., 2018; Thulasidasan et al., 2019; Pleiss et al., 2020; Wang et al., 2021a ). The noise rates τ are set to 20% and 40%. The details of the noise setting are described as follows: • Symmetric noise: this kind of label noise is generated by flipping labels in each class uniformly to incorrect labels of other classes. • Asymmetric noise: this kind of label noise is generated by flipping labels within a set of similar classes. In this paper, for MNIST, flipping 2 → 7, 3 → 8, 5 ↔ 6. For F-MNIST, flipping T-SHIRT → SHIRT, PULLOVER → COAT, SANDALS → SNEAKER. For CIFAR-10, TRUCK → AU-TOMOBILE, BIRD → AIRPLANE, DEER → HORSE, CAT↔DOG. For CIFAR-100, the 100 classes are grouped into 20 super-classes, and each has 5 sub-classes. Each class is then flipped into the next within the same super-class. • Pairflip noise: the noise flips each class to its adjacent class. More explanation about this noise setting can be found in (Yu et al., 2019; Zheng et al., 2020; Lyu & Tsang, 2020) . • Instance noise: the noise is quite realistic, where the probability that an instance is mislabeled depends on its features. Following (Xia et al., 2020b) , we generate this type of label noise to validate the effectiveness of the proposed method. For MNIST, we train a LeNet (LeCun et al., 1998) with batch size 32. For F-MNIST, we train a ResNet-50 (He et al., 2015) with batch size 32. For CIFAR-10 and CIFAR-100, we train a ResNet-50 with batch size 64, and typical data augmentations including random crop and horizontal flip are applied. For all the training, we use SGD optimizer with momentum 0.9 and weight decay is set to 10 -3 . The initial learning rate is set to 10 -2 . For Food-101, we use a ResNet-50 pre-trained on ImageNet with batch size 32. The initial learning rate is changed to 10 -3 . For WebVision, we use an Inception-ResNet v2 (Szegedy et al., 2016) with batch size 128. The initial learning rate is set to 10 -1 . We set 100 epochs in total for all the experiments. For fair comparison, all the codes are implemented in PyTorch 1.2.0 with CUDA 10.0, and run on NVIDIA Tesla V100 GPUs. Our implementation is available at https://github.com/xiaoboxia/CDR.

4.2. COMPARISON METHODS

We compare the proposed method with the following methods: (1) CE, which trains the deep neural networks with the cross entropy loss on noisy datasets. (2) GCE (Zhang & Sabuncu, 2018) , which unites the mean absolute error loss and the cross entropy loss to handle noisy labels. The hyperparameter q in this work is set to 0.7. (3) DMI (Xu et al., 2019) , which copes with noisy labels from the perspective of information theory. (4) APL (Ma et al., 2020) , which combines two mutually reinforcing robust loss functions. For this baseline, we employ its combination of NCE and RCE for a comparison. ( 5) MentorNet (Jiang et al., 2018) , which learns a curriculum to filter out noisy data. (6) Co-teaching (Han et al., 2018b) , which maintains two networks and cross-trains on the instances with small loss. (7) Co-teaching+ (Yu et al., 2019) , which maintains two networks and finds small loss instances among the prediction disagreement data for training. (8) S2E (Yao et al., 2020a) , which exploits automated machine learning to handle noisy labels. ( 9) Forward (Patrini et al., 2017) , which estimates the noise transition matrix to correct the training loss. ( 10) T-Revision (Xia et al., 2019) , which employs importance reweighting technique and introduces a slack variable to revise the noise transition matrix. ( 11) Joint (Tanaka et al., 2018) , which jointly optimizes the network parameters and the sample labels. The hyperparameters α and β are set to 1.2 and 0.8 respectively. Note that we do not compare with some state-of-the-art methods like SELF (Nguyen et al., 2020) and DivideMix (Li et al., 2020a) as baselines because of the following reasons. (1) Their proposed methods are aggregations of multiple techniques while this paper only focuses on one, therefore the comparison is not fair. (2) We are focusing on proving the concept, i.e., how to reduce the side effect of noisy labels before early stopping, but not on boosting the classification performance.

4.3. ABLATION STUDY

We need the noise rate to determine the number of different types of the parameters and the gradient decay coefficient as mentioned above. Compared with symmetric noise, asymmetric noise and pairflip noise, the noise rate of instance noise is hard to be estimated (Cheng et al., 2020; Xia et al., 2020b) . We present that our proposed method is insensitive to the estimation result of noise rate. The experiments are conducted on CIFAR-10 and CIFAR-100 datasets with instance noise. The noise rates are set to 20% and 40%, respectively. In Figure 1 , we show that how the classificaition performance of the proposed method varies with the change of the estimated noise rate. We can clearly see that the proposed method is robust to the estimation result of the noise rate. In this paper, we set the proportion of non-critical parameters to the noise rate. Also, it is interesting to investigate that how the proposed method works if we set the proportion of non-critical parameters as a constant number. Note that if we set the constant number too randomly, the performance of the proposed method may be hurt. Therefore, we use a noisy validation set to locate it and compare the difference between the located constant and the noise rate. The search of the constant is within the range {0.10, 0.20, . . . , 0.90}. The experiments are conducted on MNIST, F-MNIST, and CIFAR-10. The experimental results are provided in Table 2 . As we can see, in many cases, the located constant and the label noise rate are numerically equal. However, it is complicated to locate a suitable constant with a noisy validation set, as there is a huge search range. On the contrary, the noise rate always can be estimated effectively (Liu & Tao, 2016; Yu et al., 2018a) . Therefore, it is reasonable and feasible to assume that the proportion of non-critical parameters is the same as the noise rate. 

4.4. CLASSIFICATION PERFORMANCE ON NOISY DATASETS

Results on synthetic noisy datasets. Table 1 shows the experimental results on four synthetic noisy datasets with various types of noisy labels. For MNIST, as can be seen, our proposed method produce the best results in the vast majority of cases. When the noise is instance-dependent, the proposed method achieves competitive results. Note that T-Revision achieves the best classification performance in this case. Compared with the other synthetic noisy datasets, MNIST is less challenging. In this case, T-Revision thus can exploit the noise transition matrix and the slack variable to well model label noise, which leads to the best performance. However, for the instance-dependent label noise on the other datasets, i.e., F-MNIST, CIFAR-10, and CIFAR-100, estimating the transition matrices does not work well and the proposed robust eary-learning method achieves the best performance. For F-MNIST and CIFAR-10, our proposed method is consistently superior to other state-of-the-art methods across all the settings. Note that S2E achieves impressive performance in the case of Symmetric-20% on CIFAR-10. However, it fails to generalize well compared with the proposed method, especially in the cases of 40% label noise rate. In contrast, CDR achieves a clear lead over S2E in these cases, which verifies the effectiveness of the proposed method. Lastly, for the more challenging dataset, i.e., CIFAR-100, the proposed method once again outperforms all the baseline methods. In particular, in the very challenging case of Instance-40%, the proposed method takes a nearly 6% lead compared to the second best method GCE. Results on real-world noisy datasets. The experimental results on Food-101 and WebVision datasets are reported in Table 3 and 4 . Again, on classification accuracy, our method surpasses all other baselines. This verifies the effectiveness of our proposed method against real-world label noise. Note that the CE method exploits early stopping in all experiments. Compared the classification performance of CE with the classification performance of CDR, i.e., early stopping vs. robust early stopping, we can clearly see that the proposed method can achieve better performance. We also present the illustration of the experimental results on CIFAR-100 with 40% noise. As shown in Figure 2 , CDR can effectively reduce the side effect of noisy labels at the early training stage. The illustrations of the experimental results on the other settings can be found in Appendix A.

5. CONCLUSION

In this paper, motivated by the lottery ticket we provide a novel method to distinguish the critical parameters and non-critical parameters for fitting clean labels. Then we propose different update rules for different types of parameters to reduce the side effect before early stopping. The proposed method is very effective for learning with noisy labels, which is supported by experiments on synthetic datasets with various types of label noise as well as on real-world datasets. Our method is simple and orthogonal to other methods. We believe that this opens up new possibilities in the topics of learning with noisy labels. It would be interesting to explore the potential characteristics of the updated parameters such as the distance to initial parameters (Li et al., 2020b; Hu et al., 2020) and the mutual information with the vector of all training labels given inputs (Harutyunyan et al., 2020) .

A DETAILED EXPERIMENTAL RESULTS

In section 4.4, we provide the illustrations of the experimental results. However, because of limited space, we only show the illustrations of the experimental results on noisy CIFAR-100. The noise rate is set to 40%. In this supplementary material, we provide the illustrations of the experimental results on the other employed datasets and settings as follows. 



† Correspondence to Tongliang Liu (tongliang.liu@sydney.edu.au).



Figure2: Illustration of the experimental results on noisy CIFAR-100. We can clearly see that the proposed method (CDR) can reduce the side effect of noisy labels at the early training stage, which improves generalization (red line vs. green line).

Figure 3: Illustration of the experimental results on noisy MNIST. The noise rate is set to 20%.

Algorithm 1 CDR algorithm. 1: Input: initialization parameters W, noisy training set D t , noisy validation set D v , learning rate η, weight decay coefficient λ, fixed τ , epoch T and T max , iteration N max ; for T = 1, 2, . . . , T max do 2: Shuffle training set D t ; for N = 1, . . . , N max do 3: Fetch mini-batch Dt from D t ; 4: Divide W into W c and W n with Eq. (3) and Eq. (4); //define the types of the parameters; 5: Update W c with Eq. (5); //update W c using the robust positive update; 6: Update W n with Eq. (6); //update W n using the negative update; end end //Early stopping criterion: if the minimum classification error is achieved with W on D v 8: Output: parameters W after update. Though we have only noisy training data, deep networks will first memorize training data with clean labels.

Mean and standard deviations of classification accuracy (percentage) on synthetic noisy datasets with different noise levels. The experimental results are reported over five trials. The best mean results are bolded.

Figure 1: of robustness to the estimation result of noise rate. For each estimated noise rate, we report experimental results over five trials. The blue dots represent the result of each experiment. The orange dots represent the mean of five experimental results in each case.

The located constant on synthetic noisy datasets with different noise levels. The result with an underline means that the located constant and the noise rate are numerically equal.

Classification accuracy (percentage) on Food-101 dataset. The best result is in bold.

Top-1 validation accuracies (percentage) on clean ILSVRC12 validation set of Inception-ResNet v2 models trained on WebVision dataset, under the "Mini" setting in(Jiang et al., 2018;Chen et al., 2019;Ma et al., 2020). The best result is in bold.

acknowledgement

ACKNOWLEDGMENTS TLL was supported by Australian Research Council Project DE-190101473 and DP-180103424. BH was supported by the RGC Early Career Scheme No. 22200720, NSFC Young Scientists Fund No. 62006202, HKBU Tier-1 Start-up Grant, and HKBU CSD Departmental Incentive Scheme. CG was supported by NSF of China (No. 61973162) and CCF-Tencent Open Fund (No: RAGR20200101). NNW was supported by National Key Research and Development Program of China under Grant 2018AAA0103202. ZYG was supported by the Airdoc-Monash research centre fellowship. YC was supported by the NSF of China (No.61976102, No.U19A2065). We thank anonymous reviewers for giving constructive comments.

