LEARNING WITH INSTANCE-DEPENDENT LABEL NOISE: A SAMPLE SIEVE APPROACH

Abstract

Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, providing theoretically rigorous solutions for learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES 2 (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted examples. The implementation of CORES 2 does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES 2 in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES 2 on CIFAR10 and CI-FAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance. Code is available at https://github.com/UCSC-REAL/cores.

1. INTRODUCTION

Deep neural networks (DNNs) have gained popularity in a wide range of applications. The remarkable success of DNNs often relies on the availability of large-scale datasets. However, data annotation inevitably introduces label noise, and it is extremely expensive and time-consuming to clean up the corrupted labels. The existence of label noise can weaken the true correlation between features and labels as well as introducing artificial correlation patterns. Thus, mitigating the effects of noisy labels becomes a critical issue that needs careful treatment. It is challenging to avoid overfitting to noisy labels, especially when the noise depends on both true labels Y and features X. Unfortunately, this often tends to be the case where human annotations are prone to different levels of errors for tasks with varying difficulty levels. Recent work has also shown that the presence of instance-dependent noisy labels imposes additional challenges and cautions to training in this scenario (Liu, 2021) . For such instance-dependent (or feature-dependent, instance-based) label noise settings, theory-supported works usually focus on loss-correction which requires estimating noise rates (Xia et al., 2020; Berthon et al., 2020) . Recent work by Cheng et al. (2020) addresses the bounded instance-based noise by first learning the noisy distribution and then distilling examples according to some thresholds. 1 However, with a limited size of datasets, learning an accurate noisy distribution for each example is a non-trivial task. Additionally, the size and the quality of distilled examples are sensitive to the thresholds for distillation. Departing from the above line of works, we design a sample sieve with theoretical guarantees to provide a high-quality splitting of clean and corrupted examples without the need to estimate noise rates. Instead of learning the noisy distributions or noise rates, we focus on learning the underlying clean distribution and design a regularization term to help improve the confidence of the learned classifier, which is proven to help safely sieve out corrupted examples. With the division between "clean" and "corrupted" examples, our training enjoys performance improvements by treating the clean examples (using standard loss) and the corrupted ones (using an unsupervised consistency loss) separately. We summarize our main contributions: 1) We propose to train a classifier using a novel confidence regularization (CR) term and theoretically guarantee that, under mild assumptions, minimizing the confidence regularized cross-entropy (CE) loss on the instance-based noisy distribution is equivalent to minimizing the pure CE loss on the corresponding "unobservable" clean distribution. This classifier is also shown to be helpful for evaluating each example to build our sample sieve.2) We provide a theoretically sound sample sieve that simply compares the example's regularized loss with a closed-form threshold explicitly determined by predictions from the above trained model using our confidence regularized loss, without any extra estimates. 3) To the best of our knowledge, the proposed CORESfoot_0 (COnfidence REgularized Sample Sieve) is the first method that is thoroughly studied for a multi-class classification problem, has theoretical guarantees to avoid overfitting to instance-dependent label noise, and provides high-quality division without knowing or estimating noise rates. 4) By decoupling the regularized loss into separate additive terms, we also provide a novel and promising mechanism for understanding and controlling the effects of general instancedependent label noise. 5) CORES 2 achieves competitive performance on multiple datasets, including CIFAR-10, CIFAR-100, and Clothing1M, under different label noise settings. Other related works In addition to recent works by Xia et al. (2020) , Berthon et al. (2020) , and Cheng et al. (2020) , we briefly overview other most relevant references. Detailed related work is left to Appendix A. Making the loss function robust to label noise is important for building a robust machine learning model (Zhang et al., 2016) . One popular direction is to perform loss correction, which first estimates transition matrix (Patrini et al., 2017; Vahdat, 2017; Xiao et al., 2015; Zhu et al., 2021b; Yao et al., 2020b) , and then performs correction/reweighting via forward or backward propagation, or further revises the estimated transition matrix with controllable variations (Xia et al., 2019) . The other line of work focuses on designing specific losses without estimating transition matrices (Natarajan et al., 2013; Xu et al., 2019; Liu & Guo, 2020; Wei & Liu, 2021) . However, these works assume the label noise is instance-independent which limits their extension. Another approach is sample selection (Jiang et al., 2017; Han et al., 2018; Yu et al., 2019; Northcutt et al., 2019; Yao et al., 2020a; Wei et al., 2020; Zhang et al., 2020a) , which selects the "small loss" examples as clean ones. However, we find this approach only works well on the instance-independent label noise. Approaches such as label correction (Veit et al., 2017; Li et al., 2017; Han et al., 2019) or semi-supervised learning (Li et al., 2020; Nguyen et al., 2019)  ] := {1, 2, • • • , N } is the set of example indices. Examples (x n , y n ) are drawn according to random variables (X, Y ) ∈ X × Y from a joint distribution D. Let D X and D Y be the marginal distributions of X and Y . The classification task aims to identify a classifier f : X → Y that maps X to Y accurately. One common approach is minimizing the empirical risk using DNNs with respect to the cross-entropy loss defined as (f (x), y) = -ln(f x [y]), y ∈ [K], where f x [y] denotes the y-th component of f (x) and K is the number of classes. In real-world applications, such as human-annotated images (Krizhevsky et al., 2012; Zhang et al., 2017) and medical diagnosis (Agarwal et al., 2016) , the learner can only observe a set of noisy labels. For instance, human annotators may wrongly label some images containing cats as ones that contain dogs accidentally or irresponsibly. The label noise of each instance is characterized by a noise transition matrix T (X), where each element T ij (X) := P( Y = j|Y = i, X). The corresponding noisy dataset 2 and distribution are denoted by D := {(x n , ỹn )} n∈[N ] and D. Let 1(•) be the indicator function taking value 1 when the specified condition is satisfied and 0 otherwise. Similar to the goals in surrogate loss (Natarajan et al., 2013) , L DMI (Xu et al., 2019) and peer loss (Liu & Guo, 2020) , we aim to learn a classifier f from the noisy distribution D which also minimizes P(f (X) = Y ), (X, Y ) ∼ D. Beyond their results, we attempt to propose a theoretically sound approach addressing a general instance-based noise regime without knowing or estimating noise rates.

2.1. CONFIDENCE REGULARIZATION

In this section, we present a new confidence regularizer (CR). Our design of the CR is mainly motivated by a recently proposed robust loss function called peer loss (Liu & Guo, 2020) . For each example (x n , ỹn ), peer loss has the following form: PL (f (x n ), ỹn ) := (f (x n ), ỹn ) -(f (x n1 ), ỹn2 ), where (x n1 , ỹn1 ) and (x n2 , ỹn2 ) are two randomly sampled and paired peer examples (with replacement) for n. Let X n1 and Y n2 be the corresponding random variables. Note X n1 , Y n2 are two independent and uniform random variables being each x n , n ∈ [N ] and ỹn , n ∈ [N ] with probability 1 N respectively: P(X n1 = x n | D) = P( Y n2 = y n | D) = 1 N , ∀n ∈ [N ]. Let D Y | D be the distribution of Y n2 given dataset D. Peer loss then has the following equivalent form in expectation: 1 N n∈[N ] E Xn 1 , Yn 2 | D [ (f (xn), ỹn)-(f (Xn 1 ), Yn 2 )] = 1 N n∈[N ] (f (xn), ỹn)- n ∈[N ] P(Xn 1 = x n | D)ED Y | D [ (f (x n ), Y )] = 1 N n∈[N ] (f (xn), ỹn) -ED Y | D [ (f (xn), Y )] . This result characterizes a new loss denoted by CA : CA (f (x n ), ỹn ) := (f (x n ), ỹn ) -E D Y | D [ (f (x n ), Y )]. Though not studied rigorously by Liu & Guo (2020) , we show, under conditionsfoot_1 , CA defined in Eqn. (1) encourages confident predictionsfoot_2 from f by analyzing the gradients: Theorem 1. For CA (•), solutions satisfying f xn [i] > 0, ∀i ∈ [K] are not locally optimal at (x n , ỹn ). See Appendix B.2 for the proof. Particularly, in binary cases, we have constraint f (x n )[0] + f (x n )[1] = 1. Following Theorem 1, we know minimizing CA (f (x n ), ỹn ) w.r.t f under this con- straint leads to either f (x n )[0] → 1 or f (x n )[1] → 1, indicating confident predictions. Therefore, the addition of term -E D Y | D [ (f (x n ), Y )] helps improve the confidence of the learned classifier. Inspired by the above observation, we define the following confidence regularizer: Confidence Regularizer: CR (f (x n )) := -β • E D Y | D [ (f (x n ), Y )], where β is positive and (•) refers to the CE loss. The prior probability P( Y | D) is counted directly from the noisy dataset. In the remaining of this paper, (•) indicates the CE loss by default. Why are confident predictions important? Intuitively, when model fits to the label noise, its predictions often become less confident, since the noise usually corrupts the signal encoded in the clean data. From this perspective, encouraging confident predictions plays against fitting to label noise. Compared to instance-independent noise, the difficulties in estimating the instance-dependent noise rates largely prevent us from applying existing techniques. In addition, as shown by Manwani & Sastry (2013) , the 0-1 loss function is more robust to instance-based noise but hard to optimize with. To a certain degree, pushing confident predictions results in a differentiable loss function that approximates the 0-1 loss, and therefore restores the robustness property. Besides, as observed by Chatterjee (2020) and Zielinski et al. (2020) , gradients from similar examples would reinforce each other. When the overall label information is dominantly informative that T ii (X) > T ij (X), DNNs will receive more correct information statistically. Encouraging confident predictions would discourage the memorization of the noisy examples (makes it hard for noisy labels to reduce the confidence of predictions), and therefore further facilitate DNNs to learn the (clean) dominant information. CR is NOT the entropy regularization Entropy regularization (ER) is a popular choice for improving confidence of the trained classifiers in the literature (Tanaka et al., 2018; Yi & Wu, 2019) . Given a particular prediction probability p for a class, the ER term is based on the function -p ln p, while our CR is built on ln p. Later we show CR offers us favorable theoretical guarantees for training with instance-dependent label noise, while ER does not. In Appendix C.1, we present both theoretical and experimental evidences that CR serves as a better regularizer compared to ER.

2.2. CONFIDENCE REGULARIZED SAMPLE SIEVE

Intuitively, label noise misleads the training thus sieving corrupted examples out of datasets is beneficial. Furthermore, label noise introduces high variance during training even with the existence of CR (discussed in Section 3.3). Therefore, rather than accomplishing training solely with CR , we will first leverage its regularization power to design an efficient sample sieve. Similar to a general sieving process in physical words that compares the size of particles with the aperture of a sieve, we evaluate the "size" (quality, or a regularized loss) of examples and compare them with some to-be-specified thresholds, therefore the name sample sieve. In our formulation, the regularized loss (f (x n ), ỹn ) + CR (f (x n )) is employed to evaluate examples and α n is used to specify thresholds. Specifically, we aim to solve the sample sieve problem in (2). Confidence Regularized Sample Sieve min f ∈F , v∈{0,1} N n∈[N ] vn [ (f (xn), ỹn) + CR(f (xn)) -αn] s.t. CR(f (xn)) := -β • ED Y | D (f (xn), Y ), αn := 1 K ỹ∈[K] ( f (xn), ỹ) + CR( f (xn)). (2) Sample Sieve-0 Sample Sieve-1 • v n ∈ {0, 1} indicates whether example n is clean (v n = 1) or not (v n = 0); • α n (mimicking the aperture of a sieve) controls which example should be sieved out; • f is a copy of f and does not contribute to the back-propagation. F is the search space of f . Dynamic sample sieve The problem in (2) is a combinatorial optimization which is hard to solve directly. A standard solution to (2) is to apply alternate search iteratively as follows: • Starting at t = 1, v (0) n = 1, ∀n ∈ [N ]. • Confidence-regularized model update (at iteration-t): f (t) = arg min f ∈F n∈[N ] v (t-1) n [ (f (x n ), ỹn ) + CR (f (x n ))] ; (3) • Sample sieve (at iteration-t): v (t) n = 1( (f (t) (x n ), ỹn ) + CR (f (t) (x n )) < α n,t ), where t) and v (t) refer to the specific classifier and weight at iteration-t. Note the values of CR ( f (t) (x n )) and CR (f (t) (x n )) are the same. We keep both terms to be consistent with the objective in Eq. (2). In DNNs, we usually update model f with one or several epochs of data instead of completely solving (3). α n,t = 1 K ỹ∈[K] ( f (t) (x n ), ỹ) + CR ( f (t) (x n )), f Figure 1 illustrates the dynamic sample sieve, where the size of each example corresponds to the regularized loss and the aperture of a sieve is determined by α n,t . In each iteration-t, sample sieve- ). The loss is given by (f (t) (x n ), ỹn ) + CR (f (t) (x n )) -α n,t as (4). CE Sieve represents the dynamic sample sieve with standard cross-entropy loss (without CR). t "blocks" some corrupted examples by comparing a regularized example loss with a closed-form threshold α n,t , which can be immediately obtained given current model f (t) and example (x n , ỹn ) (no extra estimation needed). In contrast, most sample selection works (Han et al., 2018; Yu et al., 2019; Wei et al., 2020) focus on controlling the number of the selected examples using an intuitive function where the overall noise rate may be required, or directly selecting examples by an empirically set threshold (Zhang & Sabuncu, 2018) . Intuitively, the specially designed thresholds α n,t for each example should be more accurate than a single threshold for the whole dataset. Besides, the goal of existing works is often to select clean examples while our sample sieve focuses on removing the corrupted ones. On a high level, we follow a different philosophy from these sample selection works. We coin our solution as COnfidence REgularized Sample Sieve (CORES 2 ). More visualizations of the sample sieve In addition to Figure 1 , we visualize the superiority of our sample sieve with numerical results as 

3. THEORETICAL GUARANTEES OF CORES 2

In this section, we theoretically show the advantages of CORES 2 . The analyses focus on showing CORES 2 guarantees a quality division, i.e. v n = 1(y n = ỹn ), ∀n, with a properly set β. To show the effectiveness of this solution, we call a model prediction on x n is better than random guess if f xn [y n ] > 1/K, and call it confident if f xn [y] ∈ {0, 1}, ∀y ∈ [K] , where y n is the clean label and y is an arbitrary label. The quality of sieving out corrupted examples is guaranteed in Theorem 2. Theorem 2. The sample sieve defined in (4) ensures that clean examples (x n , ỹn = y n ) will not be identified as being corrupted if the model f (t) 's prediction on x n is better than random guess. Theorem 2 informs us that our sample sieve can progressively and safely filter out corrupted examples, and therefore improves division quality, when the model prediction on each x n is better than random guess. The full proof is left to Appendix B.3. In the next section, we provide evidences that our trained model is guaranteed to achieve this requirement with sufficient examples.

3.1. DECOUPLING THE CONFIDENCE REGULARIZED LOSS

The discussion of performance guarantees of the sample sieve focuses on a general instance-based noise transition matrix T (X), which can induce any specific noise regime such as symmetric noise and asymmetric noise (Kim et al., 2019; Li et al., 2020) . Note the feature-independency was one critical assumption in state-of-the-art theoretically guaranteed noise-resistant literatures (Natarajan et al., 2013; Liu & Guo, 2020; Xu et al., 2019) while we do not require. Let  T ij := E D|Y =i [T ij (X)], ∀i, j ∈ [K]. E D (f (X), Y ) + CR (f (X)) = Term-1 T • E D [ (f (X), Y )] + Term-2 ∆ • E D∆ [ (f (X), Y )] + j∈[K] i∈[K] P(Y = i)E D|Y =i [(U ij (X) -βP( Y = j)) (f (X), j)] Term-3 , where T := min j∈[K] T jj , ∆ := j∈[K] ∆ j P(Y = j), ∆ j := T jj -T , U ij (X) = T ij (X), ∀i = j, U jj (X) = T jj (X) -T jj , and ED ∆ [ (f (X), Y )] := 1( ∆ > 0) j∈[K] ∆ j P(Y =j) ∆ E D|Y =j [ (f (X), j)]. Equation ( 5) provides a generic machinery for anatomizing noisy datasets, where we show the effects of instance-based label noise on the CR regularized loss can be decoupled into three additive terms: Term-1 reflects the expectation of CE on clean distribution D, Term-2 shifts the clean distribution by changing the prior probability of Y , and Term-3 characterizes how the corrupted examples (represented by U ij (X)) might mislead/mis-weight the loss, as well as the regularization ability of CR (represented by βP( Y = j)). In addition to the design of sample sieve, this additive decoupling structure also provides a novel and promising perspective for understanding and controlling the effects of generic instance-dependent label noise.

3.2. GUARANTEES OF THE SAMPLE SIEVE

By decoupling the effects of instance-dependent noise into separate additive terms as shown in Theorem 3, we can further study under what conditions, minimizing the confidence regularized CE loss on the (instance-dependent) noisy distribution will be equivalent to minimizing the true loss incurred on the clean distribution, which is exactly encoded by Term-1. In other words, we would like to understand when Term-2 and Term-3 in ( 5) can be controlled not to disrupt the minimization of Term-1. Our next main result establishes this guarantee but will first need the following two assumptions. Assumption 1. (Y * = Y ) Clean labels are Bayes optimal (Y * := arg max i∈[K] P(Y = i|X)). Assumption 2. (Informative datasets) The noise rate is bounded as T ii (X) -T ij (X) > 0, ∀i ∈ [K], j ∈ [K], j = i, X ∼ D X . Feasibility of assumptions: 1) Note for many popular image datasets, e.g. CIFAR, the label of each feature is well-defined and the corresponding distribution is well-separated by human annotation. In this case, each feature X only belongs to one particular class Y . Thus Assumption 1 is generally held in classification problems (Liu & Tao, 2015) . Technically, this assumption could be relaxed. We use this assumption for clean presentations. 2) Assumption 2 shows the requirement of noise rates, i.e., for any feature X, a sufficient number of clean examples are necessary for dominant clean information. For example, we require T ii (X) -T ij (X) > 0 to ensure examples from class i are informative (Liu & Chen, 2017) . Before formally presenting the noise-resistant property of training with CR , we discuss intuitions here. As discussed earlier in Section 2.1, our CR regularizes the CE loss to generate/incentivize confident prediction, and thus is able to approximate the 0-1 loss to obtain its robustness property. More explicitly, from (5), CR affects Term-3 with a scale parameter β. Recall that U ij (X) = T ij (X), ∀i = j, which is exactly the noise transition matrix. Although we have no information about this transition matrix, the confusion brought by U ij (X) can be canceled or reversed by a sufficiently large β such that U ij (X) -βP( Y = j) ≤ 0. Intuitively, with an appropriate β, all the effects of U ij (X), i = j can be reversed, and we will get a negative loss punishing the classifier for predicting class-j when the clean label is i. Formally, Theorem 4 shows the noise-resistant property of training with CR and is proved in Appendix B.4. Theorem 4. (Robustness of the Confidence Regularized CE Loss) With Assumption 1 and 2, when max i,j∈[K],X∼D X Uij(X) P( Y = j) ≤ β ≤ min P( Y =i)>P( Y =j),X∼D X Tii(X) -Tij(X) P( Y = i) -P( Y = j) , ( ) minimizing E D [ (f (X), Y ) + CR (f (X))] is equivalent to minimizing E D [ (f (X), Y )]. Theorem 4 shows a sufficient condition of β for our confidence regularized CE loss to be robust to instance-dependent label noise. The bound on LHS ensures the confusion from label noise could be canceled or reversed by the β weighted confidence regularizer, and the RHS bound guarantees the model with the minimized regularized loss predicts the most frequent label in each feature w.p. 1. Theorem 4 also provides guidelines for tuning β. Although we have no knowledge about T ij (X), we can roughly estimate the range of possible β. One possibly good setting of β is linearly increasing with the number of classes, e.g. β = 2 for 10 classes and β = 20 for 100 classes. With infinite model capacity, minimizing E D [ (f (X), Y )] returns the Bayes optimal classifier (since CE is a calibrated loss) which predicts on each x n better than random guess. Therefore, with a sufficient number of examples, minimizing E D [ (f (X), Y ) + CR (f (X))] will also return a model that predicts better than random guess, then satisfying the condition required in Theorem 2 to guarantee the quality of sieved examples. Further, since the Bayes optimal classifier always predicts clean labels confidently when Assumption 1 holds, Theorem 4 also guarantees confident predictions. With such predictions, the sample sieve in (4) will achieve 100% precision on both clean and corrupted examples. This guaranteed division is summarized in Corollary 1: 

3.3. TRAINING WITH SIEVED SAMPLES

We discuss the necessity of a dynamic sample sieve in this subsection. Despite the strong guarantee in expectation as shown Theorem 4, performing direct Empirical Risk Minimization (ERM) of the regularized loss is likely to return a sub-optimal solution. Although Theorem 4 guarantees the equivalence of minimizing two first-order statistics, their second-order statistics are also important for estimating the expectation when examples are finite. Intuitively, Term-1 T • E D [ (f (X), Y )] primarily helps distinguish a good classifier from a bad one on the clean distribution. The existence of the leading constant T reduces the power of the above discrimination, as effectively the gap between the expected losses become smaller as noise increases (T will decrease). Therefore we would require more examples to recognize the better model. Equivalently, the variance of the selection becomes larger. In Appendix C.2, we also offer an explanation from the variance's perspective. For some instances with extreme label noise, the β satisfying Eqn. (6) in Theorem 4 may not exist. In such case, these instances cannot be properly used and other auxiliary techniques are necessary (e.g., sample pruning). Sieving 

4. EXPERIMENTS

Now we present experimental evidences of how CORES 2 works.foot_3  Datasets: CORES 2 is evaluated on three benchmark datasets: CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) and Clothing1M (Xiao et al., 2015) . Following the convention from Consistency training after the sample sieve: Let τ be the last iteration of CORES 2 . Define L(τ Pre+Re , where Pre := n∈[N ] 1(vn=1,yn=ỹn) ) := {n|n ∈ [N ], v (τ ) n = 1}, H(τ ) := {n|n ∈ [N ], v (τ ) n = 0}, D L(τ ) := {(x n , n∈[N ] 1(vn=1) , and  Re := n∈[N ] 1(vn=1,yn=ỹn) n∈[N ] 1(yn=ỹn) . L(τ )}, D H(τ ) := {(x n , ỹn ) : n ∈ H(τ )}. Thus D L(τ ) is (τ ) KL (f (x n ), f (t) (x n,t )), where f (t) is a copy of the DNN at the beginning of epoch-t but without gradients. Summing the classification and consistency loss yields the total loss. See Appendix D.1 for an illustration. Other alternatives: Checking the consistency of noisy predictions is only one possible way to leverage the additional information after sample sieves. Our basic idea of first sieving the dataset and then treating corrupted examples differently from clean ones admits other alternatives. There are many other possible designs after sample sieves, e.g., estimating transition matrix using sieved examples then applying loss-correction (Patrini et al., 2017; Vahdat, 2017; Xiao et al., 2015) , making the consistency loss as another regularization term and retraining the model (Zhang et al., 2020b) , correcting the sample selection bias in clean examples and retraining (Cheng et al., 2020; Fang et al., 2020) , or relabeling those corrupted examples and retraining, etc. Additionally, clustering methods on the feature space (Han et al., 2019; Luo et al., 2020) or high-order information (Zhu et al., 2021a) can also be exploited along with the dynamic sample sieve. Besides, the current structure is ready to include other techniques such as mixup (Zhang et al., 2018) . Quality of our sample sieve: Experiments on CIFAR-10, CIFAR-100 and Clothing1M: In this section, we compare CORES 2 with several state-of-the-art methods on CIFAR-10 and CIFAR-100 under instance-based, symmetric and asymmetric label noise settings, which is shown on Table 1 and Table 2 approaches, CORES 2 also works fairly well on the Clothing1M dataset. See more experiments in Appendix D. We also provide source codes with detailed instructions in supplementary materials. = 0.4 ε = 0.6 ε = 0.2 ε = 0.3 ε = 0.4 ε = 0.6 ε = 0.2 ε = 0.3

5. CONCLUSIONS

This paper introduces CORES 2 , a sample sieve that is guaranteed to be robust to general instancedependent label noise and sieve out corrupted examples, but without using explicit knowledge of the noise rates of labels. The analysis of CORES 2 assumed that the Bayes optimal labels are the same as clean labels. Future directions of this work include extensions to more general cases where the Bayes optimal labels may differ from clean labels. We are also interested in exploring different A FULL VERSION OF RELATED WORKS Learning with noisy labels has observed exponentially growing interests. Since the traditional crossentropy (CE) loss has been proved to easily overfit noisy labels (Zhang et al., 2016) , researchers try to design different loss functions to handle this problem. There were two main perspectives on designing loss functions. Considering the fact that outputs of logarithm functions in the CE loss grow explosively when the prediction f (x) approaches zero, some researchers tried to design bounded loss functions (Amid et al., 2019; Wang et al., 2019; Gong et al., 2018; Ghosh et al., 2017) . To avoid relying on fine-tuning of hyper-parameters in loss functions, a meta-learning method was proposed bt Shu et al. ( 2020) to combine the above four loss functions together. However, simply considering loss function values without discussing the noise type and the corresponding statistics could not be noise-tolerant as defined by Manwani & Sastry (2013) . As a complementary, others started from noise types and tried to design noise-tolerant loss functions. Based on the assumption that label noise only depends on the true class (a.k.a. feature-independent or label-dependent), an unbiased loss function called surrogate loss (Natarajan et al., 2013) , an information-based loss function called L DMI (Xu et al., 2019) , and a new family of loss functions to punish agreements between classifiers and noisy datasets called peer loss (Liu & Guo, 2020) were proposed. They proved theoretically that training DNNs using their loss functions on feature-independent noisy datasets was equivalent to training CE on the corresponding unobservable clean datasets. However, surrogate loss focused on the binary classifications and required knowing noise rates. L DMI and peer loss does not require knowing noise rates while L DMI may not be easy for extension and multi-class classification of peer loss requires particular transition matrices. The correction approach is also popular in handling label noise. Previous works (Patrini et al., 2017; Vahdat, 2017; Xiao et al., 2015) assumed the feature-independent noise transition matrix was given or could be estimated and attempted to use it to correct loss functions. For example, Patrini et al. (2017) first estimated the noise transition matrix and then relied on it to correct forward or backward propagation during training. However, without a set of clean examples, the noise transition matrix could be hard to estimate correctly. Instead of correcting loss functions, some methods directly corrected labels (Veit et al., 2017; Li et al., 2017; Han et al., 2019) , whereas it might introduce extra noise and damage useful information. Recent works (Xia et al., 2020; Berthon et al., 2020) extended loss-correction from the limited feature-independent label noise to part-dependent or a more general instance-dependent noise regime while they relied heavily on the noise rate estimation. Sample selection (Jiang et al., 2017; Han et al., 2018; Yu et al., 2019; Yao et al., 2020a; Wei et al., 2020) mainly focused on exploiting the memorization of DNNs and treating the "small loss" examples as clean ones, while they only focused on feature-independent label noise. Cheng et al. (2020) tried to distill some examples relying on the predictions using the surrogate loss function (Natarajan et al., 2013) . Note estimating noise rates are necessary for both applying surrogate loss and determining the threshold for distillation. The sample selection methods could be implemented with some semi-supervised learning techniques to improve the performance, where the corrupted examples were treated as unlabeled data (Li et al., 2020; Nguyen et al., 2019) . However, the training mechanisms of these methods were still based on the CE loss, which could not be guaranteed to avoid overfitting to label noise.

B PROOF FOR THEOREMS

In this section, we firstly present the proof for Theorem 3 (our main theorem) in Section B.1, which provides a generic machinery for anatomizing noisy datasets. Then we will respectively prove Theorem 1 in Section B.2, Theorem 2 in Section B.3, and Theorem 4 in Section B.4 according to the order they appear.

B.1 PROOF FOR THEOREM 3

Theorem 3. (Main Theorem: Decoupling the Expected Regularized CE Loss) In expectation, the loss with CR can be decoupled as three separate additive terms: E D (f (X), Y ) + CR (f (X)) = Term-1 T • E D [ (f (X), Y )] + Term-2 ∆ • E D∆ [ (f (X), Y )] + j∈[K] i∈[K] P(Y = i)E D|Y =i [(U ij (X) -βP( Y = j)) (f (X), j)] Term-3 , ( ) where T := min j∈[K] T jj , ∆ := j∈[K] ∆ j P(Y = j), ∆ j := T jj -T , U ij (X) = T ij (X), ∀i = j, U jj (X) = T jj (X) -T jj , and ED ∆ [ (f (X), Y )] := 1( ∆ > 0) j∈[K] ∆ j P(Y =j) ∆ E D|Y =j [ (f (X), j)]. Proof. The expected form of traditional CE loss on noisy distribution D can be written as E D [ (f (X), Y )] = j∈[K] i∈[K] P(Y = i)E D|Y =i [T ij (X) (f (X), j)] = j∈[K] i∈[K] P(Y = i)T ij E D|Y =i [ (f (X), j)] + j∈[K] i∈[K] P(Y = i)Cov D|Y =i (T ij (X), (f (X), j)). The first term could be transformed as: j∈[K] i∈[K] P(Y = i)T ij E D|Y =i [ (f (X), j)] = j∈[K]   T jj P(Y = j)E D|Y =j [ (f (X), j)] + i∈[K],i =j T ij P(Y = i)E D|Y =i [ (f (X), j)]   =T E D [ (f (X), Y )] + ∆E D∆ [ (f (X), Y )] + j∈[K] i∈[K],i =j T ij P(Y = i)E D|Y =i [ (f (X), j)], where T := min j∈[K] T jj , ∆ := j∈[K] ∆ j P(Y = j), ∆ j := T jj -T , E D∆ [ (f (X), Y )] := j∈[K] ∆j P(Y =j) ∆ E D|Y =j [ (f (X), j)], if ∆ > 0, 0 if ∆ = 0. Then E D [ (f (X), Y )] =T ED[ (f (X), Y )] + ∆ED ∆ [ (f (X), Y )] + j∈[K] i∈[K],i =j TijP(Y = i)E D|Y =i [ (f (X), j)], + j∈[K] i∈[K] P(Y = i)Cov D|Y =i (Tij(X), (f (X), j)) =T ED[ (f (X), Y )] + ∆ED ∆ [ (f (X), Y )] + j∈[K] i∈[K],i =j TijP(Y = i)E D|Y =i [ (f (X), j)], + j∈[K] i∈[K],i =j P(Y = i)E D|Y =i [(Tij(X) -Tij)( (f (X), j) -E D|Y =i [ (f (X), j)])] + j∈[K] P(Y = j)E D|Y =j [(Tjj(X) -Tjj)( (f (X), j) -E D|Y =j [ (f (X), j)])] =T ED[ (f (X), Y )] + ∆ED ∆ [ (f (X), Y )] + j∈[K] i∈[K],i =j P(Y = i)E D|Y =i [(Tij(X) -Tij)( (f (X), j) -E D|Y =i [ (f (X), j)]) + Tij (f (X), j)] + j∈[K] P(Y = j)E D|Y =j [(Tjj(X) -Tjj)( (f (X), j) -E D|Y =j [ (f (X), j)])] =T ED[ (f (X), Y )] + ∆ED ∆ [ (f (X), Y )] + j∈[K] i∈[K],i =j P(Y = i)E D|Y =i [Tij(X) (f (X), j)] + j∈[K] P(Y = j)E D|Y =j [(Tjj(X) -Tjj) (f (X), j)] =T ED[ (f (X), Y )] + ∆ED ∆ [ (f (X), Y )] + j∈[K] i∈[K] P(Y = i)E D|Y =i [Uij(X) (f (X), j)], where U ij (X) = T ij (X), ∀i = j, U jj (X) = T jj (X) -T jj . The expected form of CR on noisy distribution D can be written as E D [ CR (f (x i ))] = -βE D E D Y | D [ (f (x i ), Y )] = -β D P( D)E D Y | D [ (f (x i ), Y )] = -β j∈[K] P( Y = j)E D X [ (f (x i ), j)] = - j∈[K] i∈[K] P(Y = i)E D|Y =i [βP( Y = j) (f (x i ), j)]. Thus the expected form of the new regularized loss is E D (f (X), Y ) + CR (f (x i )) = T E D [ (f (X), Y )] + ∆E D∆ [ (f (X), Y )] + j∈[K] i∈[K] P(Y = i)E D|Y =i [(U ij (X) -βP( Y = j)) (f (X), j)].

B.2 PROOF FOR THEOREM 1

Theorem 1. For CA (•), solutions satisfying f xn [i] > 0, ∀i ∈ [K] are not locally optimal at (x n , ỹn ). Proof. Let (•) be the CE loss. Note this proof does not rely on whether the data distribution is clean or not. We use D to denote any data distribution and D to denote the corresponding dataset. This notation applies only to this proof. For any data distribution D, we have E D (f (X), Y ) -E D Y |D [ (f (x n ), Y )] =E D [ (f (X), Y )] -E D Y [E D X [ (f (X), Y )]] = - D X dx y∈[K] P(x, y) ln f x [y] + D X dx y∈[K] P(x)P(y) ln f x [y] = - D X dx y∈[K] ln f x [y][P(x, y) -P(x)P(y)]. The dynamical analyses are based on the following three assumptions: A1. The model capacity is infinite (i.e., it can realize arbitrary variation). A2. The model is updated using the gradient descent algorithm (i.e. updates follow the direction of decreasing E D [ (f (X), Y )] -E D Y [E D X [ (f (X), Y )]]). A3. The derivative of network function ∂f (x;w) ∂wi is smooth (i.e. the network function has no singular point), where w i 's are model parameters. Denote the variations of f x [y] during one gradient descent update by ∆ y (x). From Lemma 1, it can be explicitly written as ∆ y (x) = f x [y] • η D X dx y ∈[K] [P(x , y ) -P(x )P(y )] i∈[K] G i (x, y)G i (x , y ), where η is the learning rate, G i (x, y) = - ∂g y (x) ∂w i + y ∈[K] f x [y ] ∂g y (x) ∂w i , and g y (x) is the network output before the softmax activation. i.e. f x [y] = exp(g y (x)) y ∈[K] exp(g y (x)) . With ∆ y (x), the variation of the regularized loss is ∆E D [ (f (X), Y ) + CR ] = - D X dx P(x) y∈[K] ∆ y (x) P(y|x) -P(y) f x [y] . If the training reaches a steady state (a.k.a. local optimum), we have ∆E D [ (f (X), Y ) + CR ] = 0. To check the property of this variation, consider the following example. For a particular x 0 , define F (x 0 ) := y∈[K] ∆ y (x 0 ) P(y|x 0 ) -P(y) f x0 [y] . Split the labels y into the following two sets (without loss of generality, we ignore the P(y|x  a y + N ab y∈Y+ b y = 0, b y = N ab b y . Let B (x 0 ) be a -neighbourhood of x 0 . Since f x [y] is continuous, we can set ∆ y (x) = 1 2 (1 + cos π x-x0 )∆ y (x 0 ), ∀x ∈ B (x 0 ) and 0 otherwise. The coefficient 1 2 (1 + cos π x-x0 ) is added so that the continuity of f x [y] preserves. This choice will lead to ∆E D [ (f (X), Y ) + CR ] < 0. Therefore, for any CA (f (x n ), y n ) with solution f xn [i] > 0, ∀i ∈ [K], we can always find a decreasing direction, indicating the solution is not (steady) locally optimal. Note D can be any distribution in this proof. Thus the result holds for the noisy distribution D. Lemma 1. ∆ y (x) = f x [y] • η D X dx y ∈[K] [P(x , y ) -P(x )P(y )] i∈[K] G i (x, y)G i (x , y ). Proof. We need to take into account the actual form of activation function, i.e., the softmax function, as well as the SGD algorithm to demonstrate the correctness of this lemma. The variation ∆ y0 (x 0 ) is caused by the change in network parameters {w i }, i.e., ∆ y0 (x 0 ) = i∈[K] ∂f x0 [y 0 ] ∂w i δw i , where δw i are determined by the SGD algorithm δw i = -η ∂E D [ (f (X), Y ) + CR ] ∂w i =η x,y P(x, y) -P(x)P(y) f x [y] ∂f x [y] ∂w i . Plugging back to (11) yields ∆ y0 (x 0 ) = η x,y P(x, y) -P(x)P(y) f x [y] i∈[K] ∂f x0 [y 0 ] ∂w i ∂f x [y] ∂w i . To proceed, we need to expand ∂fx[y] ∂wi . Taking into account the activation function, one has f x [y] = exp(g y (x)) y ∈[K] exp(g y (x)) , where g y (x) refers to the network output before passed to the activation function. Recall that, by our assumption, derivatives ∂f (x;w) ∂wi are not singular. Now we have ∂f x [y] ∂w i = ∂e -gy(x) ∂w i 1 y ∈[K] e -g y (x) + e -gy(x) ∂ ∂w i 1 y ∈[K] e -g y (x) = -e -gy(x) y ∈[K] e -g y (x) ∂g y (x) ∂w i + e -gy(x) y ∈[K] e -g y (x) 2 y ∈[K] e -g y (x) ∂g y (x) ∂w i =f x [y]   - ∂g y (x) ∂w i + y ∈[K] f x [y ] ∂g y (x) ∂w i   . For simplicity, we can rewrite the above result as ∂f x [y] ∂w i = f x [y]G i (x, y), where G i (x, y) := - ∂g y (x) ∂w i + y f x [y ] ∂g y (x) ∂w i is a smooth function. Combining all the above gives ∆ y0 (x 0 ) as follows. ∆ y0 (x 0 ) = f x0 [y 0 ] • η x,y [P(x, y) -P(x)P(y)] i G i (x 0 , y 0 )G i (x, y) B.3 PROOF FOR THEOREM 2 Theorem 2. The sample sieve defined in (4) ensures that clean examples (x n , ỹn = y n ) will not be identified as being corrupted if the model f (t) 's prediction on x n is better than random guess. Proof. Let y n be the true label corresponding to feature x n . For a clean sample, we have ỹn = y n . Consider an arbitrary DNN model f . With the CE loss, we have (f (x n ), y n ) = -ln(f xn [y n ]). According to Equation (4) in the paper, the necessary and sufficient condition of v n > 0 is (f (x n ), ỹn ) + CR (f (x n )) < α n ⇔ -ln(f xn [y n ]) < - 1 K y∈[K] ln(f xn [y]) ⇔ -ln(f xn [y n ]) < - 1 K -1 y∈[K],y =yn ln(f xn [y]). By Jensen's inequality we have -ln 1 -f xn [y n ] K -1 = -ln y∈[K],y =yn f xn [y] K -1 ≤ - 1 K -1 y∈[K],y =yn ln(f xn [y]). Therefore, when (sufficient condition) -ln(f xn [y n ]) < -ln 1 -f xn [y n ] K -1 ⇔ f xn [y n ] > 1 K , we have v n > 0. Inequality f xn [y n ] > 1 K indicates the model prediction is better than random guess.

B.4 PROOF FOR THEOREM 4

Before proving Theorem 4, we need to show the effect of adding Term-2 to Term-1 in (5). Let X < 0.5 be the measure of separation among classes w.r.t feature X in distribution D, i.e., P(Y = Y * |X) = 1 -X , (X, Y ) ∼ D, where Y * := arg max i∈[K] P(Y = i|X) is the Bayes optimal label. Let D be the shifted distribution by adding Term-2 to Term-1 and Y be the shifted label. Then P(X|Y ) = P(X|Y ), ∀(X, Y ) ∼ D, (X, Y ) ∼ D but P(Y ) may be different from P(Y ). Lemma 2 shows the invariant property of this label shift. Lemma 2. Label shift does not change the Bayes optimal label of feature X when X < min ∀i,j∈[K] Tjj Tii+Tjj . Proof. Consider the shifted distribution D . Let T E D [ (f (X), Y )] + ∆E D∆ [ (f (X), Y )] = CE D [ (f (X), Y )], where E D [ (f (X), Y )] := j∈[K] P(Y = j)E D |Y =j [ (f (X), j)], and P(Y = j) := T jj P(Y = j) C , where C := j∈[K] T jj P(Y = j) is a constant for normalization. For each possible Y = i, we have P(Y = i|X) ∈ [0, X ] ∪ {1 -X }, X < 0.5. Thus P(X|Y = i) = P(Y = i|X)P(X) P(Y = i) ∈ [0, X P(X) P(Y = i) ] ∪ { P(X)(1 -X ) P(Y = i) }. Compare D and D, we know there is a label shift (Alexandari et al., 2020; Storkey, 2009) , where P(X|Y = i) = P(X|Y = i) but P(Y ) and P(Y ) may be different. To ensure the label shift does not change the Bayes optimal label, we need Y * = arg max i∈[K] P(Y = i|X) = arg max i∈[K] P(X|Y = i)P(Y = i) P(X) , (X, Y ) ∼ D. One sufficient condition is X P(Y = i) P(Y = i) < (1 -X )P(Y = j) P(Y = j) ⇒ X < min ∀i,j∈[K] T jj T ii + T jj With Lemma 2, Assumption 1, and Assumption 2, we present the proof for Theorem 4 as follows. Theorem 4. (Robustness of the Confidence Regularized CE Loss) With Assumption 1 and 2, when max i,j∈[K],X∼D X U ij (X) P( Y = j) ≤ β ≤ min P( Y =i)>P( Y =j),X∼D X T ii (X) -T ij (X) P( Y = i) -P( Y = j) , minimizing E D [ (f (X), Y ) + CR (f (X))] is equivalent to minimizing E D [ (f (X), Y )]. Proof. It is easy to check X = 0, ∀X ∼ D X when Assumption 1 holds. Thus adding Term-2 to Term-1 in (5) does not change the Bayes optimal label. With Assumption 1, the Bayes optimal classifier on the clean distribution should satisfy f * (X)[Y ] = 1, ∀(X, Y ) ∼ D. On one hand, when β ≥ max i,j∈[K],X∼D X U ij (X)/P( Y = j), we have β ij (X) := U ij (X) -βP( Y = j) ≤ 0, ∀i, j ∈ [K], X ∼ D X . In this case, minimizing the regularization term results in confident predictions. On the other hand, to make it unbiased to clean results, β could not be arbitrarily large. We need to find the upper bound on β such that f * also minimizes the loss defined in the latter regularization term. Assume there is no loss on confident true predictions and there is one miss-prediction on example (x n , y n = j 1 ), i.e., the prediction changes from the Bayes optimal prediction f xn [j 1 ] = 1 to f xn [j 2 ] = 1, j 2 = j 1 . Compared to the optimal one, the first two terms in the right side of ( 5) is increased by T j1,j1 0 , where 0 > 0 is the regret of one confident wrong prediction. Accordingly, the last term in the right side of ( 5) is increased by (β j1,j1 (X) -β j1,j2 (X)) 0 . It is supposed that T j1,j1 0 + (β j1,j1 (x n ) -β j1,j2 (x n )) 0 ≥ 0, ∀j 1 , j 2 ∈ [K], which is equivalent to β(P( Y = j 1 ) -P( Y = j 2 )) ≤ T j1,j1 (x n ) -T j1,j2 (x n ), ∀j 1 , j 2 ∈ [K]. Thus β ≤ min P( Y =j1)>P( Y =j2),X∼D X T j1,j1 (X) -T j1,j2 (X) P( Y = j 1 ) -P( Y = j 2 ) . By mathematical inductions, it can be generalized to the case with multiple miss-predictions in the CE term.

C OTHER JUSTIFICATIONS

In this section, we first compare CR and entropy regularization in Section C.1 and highlight our superiority with both theoretical and experimental evidence, then show an example for explaining the variances incurred by label noise in Section C.  (f * D (x n ), ỹ = y n ) = max . Note CR = (K -1) max + min K for each example. The expectation is E D [ (f * D (X), Y ) + CR (f * D (X))] = ε max + (1 -ε) min + CR . Thus the variance is var D [ (f * D (X), Y ) + CR (f * D (X))] =ε( max + CR -(ε max + (1 -ε) min + CR )) 2 + (1 -ε)( min + CR -(ε max + (1 -ε) min + CR )) 2 =ε(1 -ε)( max -min ) 2 . We know in this example,  var D [ (f * D (X), Y ) + CR (f * D (X))] = ε(1 -ε)( max -min ) 2 var D ( (f * D (X), Y )) = 0. E D [ (f (X), Y )], f * D := arg min f R D (f ), R D L * ,γ (f ) := 1 |L * | n∈L * [γ(x n ) (f (x n ), ỹn )], f D L * ,γ := arg min f ∈F R D L * ,γ (f ) , where γ(X) := P D (X)/P D L * (X) stands for the importance of each example to correct sample bias such that R D (f ) = E D L * [γ(X) (f (X), Y )]. The weight γ(X) can be estimated by kernel mean matching (Huang et al., 2007) and its DNN adaption (Fang et al., 2020) . Let D L * ,X be the marginal distribution of D L * on X. For example, with a particular kernel Φ(X), the optimization problem is: min γ(X) E D X [Φ(X)] -E D L * ,X [γ(X)Φ(X)] s.t. γ(X) > 0 and E D L * ,X [γ(X)] = 1. Note the selection of kernel Φ(•) is non-trivial, especially for complicated features. See (Fang et al., 2020) for a detailed DNN solutions. Corollary 2 provides a risk bound for minimizing CE after sample sieve. Corollary 2. If γ • is [0, b]-valued, then for any δ > 0, with probability at least 1 -δ, we have R D ( f D L * ,γ ) -R D (f * D ) ≤ 2R(γ • • F) + 2b log(1/δ) 2|L * | , where the Rademacher complexity R(γ • • F) := E D L * ,σ [sup f ∈F 2 |L * | n∈L * σ n γ(x n ) (f (x n ), ỹn )] and {σ n∈L * } are independent Rademacher variables. Proof. The sieved clean examples may be biased due to the covariate shift caused by instancebased label noise. One solution to such shift is re-weighting D L * to match D using importance re-weighting. Particularly, we need to estimate parameters γ(X) such that R D (f ) = R D L * ,γ (f ) := E D L * [γ(X) (f (X), Y )]. With the optimal γ(X), the ERM should be changed as f D L * ,γ := arg min f ∈F R D L * ,γ (f ), where R D L * ,γ (f ) := 1 |L * | n∈L * [γ(x n ) (f (x n ), ỹn )]. Via Hoeffding's inequality, ∀f , w.p. at least 1 -δ, we have | R D L * ,γ (f ) -R D L * ,γ (f )| ≤ R( • F) + 2b ln(1/δ) 2|L * | . Following the basic Rademacher bound (Bartlett & Mendelson, 2002) on the maximal deviation between the expected empirical risks: R D ( f D L * ,γ ) -R D (f * D ) =R D L * ,γ ( f D L * ,γ ) -R D L * ,γ (f * D L * ,γ ) = R D L * ,γ ( f D L * ,γ ) -R D L * ,γ (f * D L * ,γ ) + R D L * ,γ ( f D L * ,γ ) -R D L * ,γ ( f D L * ,γ ) + R D L * ,γ (f * D L * ,γ ) -R D L * ,γ (f * D L * ,γ ) ≤0 + 2 max f ∈F | R D L * ,γ (f ) -R D L * ,γ (f )| ≤2R(γ • • F) + 2b ln(1/δ) 2|L * | , where the Rademacher complexity R(γ • • F) := E D L * ,σ [sup f ∈F 2 |L * | n∈L * σ n γ(x n ) (f (x n ), ỹn )] and {σ n∈L * } are independent Rademacher variables. Therefore, we get Corollary 2. Corollary 2 informs us that, theoretically, the sample sieve is biased and γ(X) is necessary to correct the selection bias. However, the error induced by estimating γ(X) may degrade the performance. In addition, it is easy to check the optimal solution of performing direct ERM on the sieved clean examples is the same as f * D in expectation when Assumption 1 holds.

D MORE DETAILS AND RESULTS FOR EXPERIMENTS

We firstly show our training framework in Section D.1, then show implementation details and discussions in Section D.2. The algorithm for generating the instance-dependent label noise is provided in Section D.3. We show more experiments in Section D.4 and the ablation study in Section D.5.

D.1 ILLUSTRATION OF THE TRAINING FRAMEWORK

Our experiments follows the framework shown in Figure 5 .

Iteration-t

Model Update Data Selection

Consistency training

Remove Label

Random Data Augmentation

Low Loss High Loss  ∈ L(t)}, D H(t) := {(x n , ỹn ) : n ∈ H(t)}, D X,H(τ ) := {x n : n ∈ H(τ )}.

D.2 IMPLEMENTATION DETAILS AND MORE ANALYSIS

Implementation details on CIFAR-10 and CIFAR-100 with instance-based label noise: The basic hyper-parameters settings for CIFAR-10 and CIFAR-100 are listed as follows: mini-batch size (64), optimizer (SGD), initial learning rate (0.1), momentum (0.9), weight decay (0.0005), number of epochs (100) and learning rate decay (0.1 at 50 epochs). Standard data augmentation is applied to each dataset. CORES 2 and baseline share the same hyper-parameters setting except for α and β in equation 2. When perform CORES 2 , We first train network on the dataset for 10 warm-up epochs with only CE (Cross Entropy) loss. Then β is linearly increased from 0 to 2 for next 30 epochs and kept as 2 for the rest of the epochs. The data selection is performed at the 30 epoch and α n,t is set to 1 K ỹ∈[K] ( f (t) (x n ), ỹ) + CR ( f (t) (x n )) in epoch-t as the paper suggests. When performing CORES 2 , we used the sieved result at epoch-40. It is worth noting that at that time, the sample sieve may not reach the highest test accuracy. However, the division property brought by the confidence regularizer works well at that time. We use the default setting from UDA (Xie et al., 2019) to apply efficient data augmentation. Implementation details on Clothing-1M: We train the network for 120 epochs on 1 million noisy training images. Batch-size is set to 32. The initial learning rate is set as 0.01 and reduced by a factor of 10 at 30, 60, 90 epochs. For each epoch, we sample 1000 mini-batches from the training data while ensuring the (noisy) labels are balanced. Mixup strategy is employed to further avoid the overfitting problem (Zhang et al., 2018; Li et al., 2020) . β is set to 0 at first 80 epochs, and linearly increased to 0.4 for next 20 epochs and kept as 0.4 for the rest of the epochs. It is worth noting that Clothing-1M actually does not satisfy our Assumption 2 since the class "Knitwear" (denoted by class-i) and the class "Sweater" (denoted by class-j) can not satisfy T ii (X) -T ij (X) > T ii -T jj . Note consistency training is not implemented on Clothing-1M. More analysis on β: The value of β mainly affects the sample sieve in CORES 2 . From Theorem 3 and Theorem 4 in the paper, when β is set to be small, we do not have the good division property. When β is set to be large, the training is biased to the CE term. 2: Sample instance flip rates q n from the truncated normal distribution N (ε, 0.1 2 , [0, 1]); 3: Sample W ∈ R S×K from the standard normal distribution N (0, 1 2 ); for n = 1 to N do 4: p = x n • W // Generate instance dependent flip rates. The size of p is 1 × K.

5:

p yn = -∞ // Only consider entries different from the true label 6: p = q n • softmax(p) // Let qn be the probability of getting a wrong label 7: p yn = 1 -q n // Keep clean w.p. 1 -qn 8: Randomly choose a label from the label space as noisy label ỹn according to p; end for Output: 9: Noisy examples (x i , ỹn ) N n=1 .

D.3 GENERATING THE INSTANCE-DEPENDENT LABEL NOISE

In this section, we introduce how to generate instance-based label noise which is illustrated in Algorithm 1. Note this algorithm follows the state-of-the-art method (Xia et al., 2020) . Define the noise rate (the global flipping rate) as ε. First, in order to control ε but without constraining all of the instances to have a same flip rate, we sample their flip rates from a truncated normal distribution N(ε, 0.1 2 , [0, 1]), where [0, 1] indicates the range of the truncated normal distribution. Second, we sample parameters W from the standard normal distribution for generating instance-dependent label noise. The size of W is S × K, where S denotes the length of each feature. For each instance (x n , y n ), we use Step 5 and Step 6 to ensure that the probability of getting a wrong label is q n . Step 7 ensures the sum of all the entries of p is 1. Suppose there are two features: x i and x j where x i = x j . Then the possibility p of these two features, calculated by x • W , from the Algorithm 1, would be exactly the same. Thus the label noise is strongly instance-dependent. (Reed et al., 2014) 82.9 58.4 Forward T (Patrini et al., 2017) 83.1 59.4 Co-teaching+ (Yu et al., 2019) 88.2 84.1 Mixup (Zhang et al., 2018) 92.3 77.6 P-correction (Yi & Wu, 2019) 92.0 88.7 Meta-Learning (Li et al., 2019) 92.0 88.8 M-correction (Arazo et al., 2019) 93.8 91.9 DivideMix (Li et al., 2020) 95.7 94.4 CORES 2 95.9 94.5 

D.4 MORE EXPERIMENTS ON CIFAR-10 AND TINY-IMAGENET

In this section, we compare CORES 2 with more methods on CIFAR-10 and Tiny-Imagenet. Table 5 records the comparison results with recent benchmark methods. Table 6 compares CORES 2 with other methods on Tiny-ImageNet. Both tables show that CORES 2 achieves competitive results.

D.5 ABLATION STUDY

CORES 2 (without consistency training): By optimizing loss in (2), the model can be forced to concentrate only on clean examples. Thus even without consistency training, the network trained by CORES 2 is also noise-robust. Table 7 compares CORES 2 with other noise-robust methods which do not apply semi-supervised setting in the framework. We can see CORES 2 still achieves the best performance among all the methods. CORES 2 without confidence regularization or dynamic data selection: The loss in equation 2 consists of data selection strategy and confident regularization term. To see how they influence the final accuracy, we perform the ablation study to show their effect on Table 8 . The first row of Table 8 corresponds to the traditional CE loss. The second row corresponds to the sample sieve with CE loss. The third row is the typical CORES 2 . The last row is CORES 2 . We can see both the dynamic sample sieve in (4) and the confidence-regularized model update in (3) show positive effects on the final accuracy, which suggests the rationality of CORES 2 .



In this paper, the noisy dataset refers to a dataset with noisy examples. A noisy example is either a clean example (whose label is true) or a corrupted example (whose label is wrong). Detailed conditions for Theorem 1 are specified at the end of our main contents. Our observation can also help partially explain the robustness property of peer loss(Liu & Guo, 2020). The logarithmic function in CR is adapted to ln(fx[y] + 10 -8 ) for numerical stability.



also lack guarantees for the instancebased label noise. 2 CORES 2 : CONFIDENCE REGULARIZED SAMPLE SIEVE Consider a classification problem on a set of N training examples denoted by D := {(x n , y n )} n∈[N ] , where [N

Figure 1: Dynamic sample sieves. Green circles are clean examples. Red hexagons are corrupted examples. The crucial components in (2) are:

Figure 2: Loss distributions of training on CIFAR-10 with 40% symmetric noise (symm.) or 40% instance-based noise (inst.). The loss is given by (f (t) (x n ), ỹn ) + CR (f (t) (x n )) -α n,t as (4). CE Sieve represents the dynamic sample sieve with standard cross-entropy loss (without CR).

The sieved dataset is in the form of two clusters of examples. Particularly, from Figure 2(b) and Figure 2(f), we observe that CE suffers from providing a good division of clean and corrupted examples due to overfitting in the final stage of training. On the other hand, with CR , there are two distinct clusters and can be separated by the threshold 0 as shown in Figure 2(d) and Figure 2(h). Comparing Figure 2(a)-2(d) with Figure 2(e)-2(h), we find the effect of instance-dependent noise on training is indeed different from the symmetric one, where the instance-dependent noise is more likely to cause overfitting.

Theorem 3 explicitly shows the contributions of clean examples, corrupted examples, and CR during training. See Appendix B.1 for the proof. Theorem 3. (Main Theorem: Decoupling the Expected Regularized CE Loss) In expectation, the loss with CR can be decoupled as three separate additive terms:

When conditions in Theorem 4 hold, with infinite model capacity and sufficiently many examples, CORES 2 achieves v n = 1(y n = ỹn ), ∀n ∈ [N ], i.e., all the sieved clean examples are effectively clean.

Figure 3: F-score comparisons on CIFAR10 under symmetric (Symm.) and instance-based (Inst.) label noise. F-score := 2•Pre•Re Pre+Re , where Pre := n∈[N ] 1(vn=1,yn=ỹn)

Figure3shows the F-scores of sieved clean examples with training epochs on the symmetric and the instance-based label noise. F-score quantifies the quality of the sample sieve by the harmonic mean of precision (ratio of actual cleans examples in sieved clean ones) and recall (ratio of sieved cleans examples in actual clean ones). We compare CORES 2 with Co-teaching and Co-teaching+. Note the F-scores of CORES 2 and Co-teaching are consistently high on the symmetric noise, while CORES 2 achieves higher performance on the challenging instancebased label noise, especially with the 60% noise rate where the other two methods have low F-scores.

0 ) -P(y) = 0 cases): Y x0;-= {y : P(y|x 0 ) -P(y) < 0} and Y x0;+ = {y : P(y|x 0 ) -P(y) > 0}. By assigning ∆ y (x 0 ) = a y < 0, ∀y ∈ Y x0;-and ∆ y (x 0 ) = b y > 0, ∀y ∈ Y x0;+ , one finds F (x 0 ) > 0 since f x0 [y] > 0. Note we have an extra constraint y ∆ y (x 0 ) = 0 to ensure y∈[K] f x0 [y] = 1 after update. It is easy to check our assigned a y and b y could maintain this constraint by introducing a weight N ab to scale b y as follows.y∈Y-

Figure 4: Comparing our regularization with entropy regularization .

ANALYSIS FOR THE RISK BOUND Let D L * and D L * be the set and the distribution of the sieved clean examples according to Corollary 1. We know they are supposed to contain only clean examples. Define R D (f ) :=

Figure 5: One example of CORES 2 . L(t): Indices of sieved clean examples. H(t): Indices of sieved corrupted examples. D L(t) := {(x n , ỹn ) : n ∈ L(t)}, D H(t) := {(x n , ỹn ) : n ∈ H(t)}, D X,H(τ ) := {x n : n ∈ H(τ )}.

Figure 6 visualize this phenomenon. It can be seen that in the left and right figure, many clean examples and corrupted examples overlap together located in the left and right clusters, respectively.

Figure6: Analyzing how the value of β influences the division. We set β = 0.5, 2, 10 for lower, proper, and higher beta settings, respectively.

Xu et al. (2019), we use ResNet34 for CIFAR-10 and CIFAR-100 and ResNet50 for Clothing1M. We experiment with three types of label noise: symmetric, asymmetric and instancedependent label noise. Symmetric noise is generated by randomly flipping a true label to the other possible labels w.p. ε(Kim et al., 2019), where ε is called the noise rate. Asymmetric noise is generated by flipping the true label to the next class (i.e., label i → i+1, mod K) w.p. ε. Instancedependent label noise is a more challenging setting and we generate instance-dependent label noise following the method fromXia et al. (2020) (See Appendix D.3 for details). In expectation, the noise rate ε for all noise regimes is the overall ratio of corrupted examples in the whole dataset.

sieved as clean examples and D H(τ ) is filtered out as corrupted ones. Examples (x n , ỹn ) ∈ D L(τ ) lead the training direction using the CE loss as n∈L(τ ) (f (x n ), ỹn ). Noting the labels in D H(τ ) are supposed to be corrupted and can distract the training, we simply drop them. On the other hand, feature information of these examples encodes useful information that we can further leverage to improve the generalization ability of models. There are different ways to use this unsupervised information, in this paper, we chose to minimize the KL-divergence between predictions on the original feature and the augmented feature to make predictions consistent. This is a common option as chosen byLi et al. (2019),Xie et al. (2019), and  Zhang et al. (2020b). The consistency loss function in epoch-t is n∈H

. CORES 2 denotes that we apply consistency training on the corrupted examples after the sample sieve. For a fair comparison, all the methods use ResNet-34 as the backbone. By comparing the performance of CE on Comparison of test accuracies on clean datasets under instance-based label noise.

Comparison of test accuracies on clean datasets under symmetric/asymmetric label noise.

The best epoch (clean) test accuracy for each method on Clothing1M.

possible designs of robust training with sieved examples. Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pp. 8778-8788, 2018. Zizhao Zhang, Han Zhang, Sercan O Arik, Honglak Lee, and Tomas Pfister. Distilling effective supervision from severe label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9294-9303, 2020b.

Comparison with the results reported by DivideMix(Li et al., 2020) on CIFAR-10. All methods use Pre-ResNet18 as the backbone. The last epoch test accuracy for each method is reported. The noise rate is defined as the probability of replacing the label with other labels including the true label.

The best epoch accuracy for each method on Tiny-ImageNet.

Comparing CORES 2 (without consistency training) with other noise-robust methods on CIFAR-10. Patrini et al., 2017) 88.11 83.27 75.34 90.11 89.42 88.25 Truncated Lq (Zhang & Sabuncu, 2018) 89.70 87.62 82.70 90.43 89.45 87.10 L DMI (Xu et al., 2019) 88.74 83.04 76.51 90.28 89.04 87.88 CORES 2 (without consistency training) 90.70 88.29 82.10 92.41 91.02 90.53

Analysis of each component of CORES 2 on CIFAR-10. All the methods use ResNet-34.

acknowledgement

Acknowledgement This work is partially supported by the National Science Foundation (NSF) under grant IIS-2007951 and the Office of Naval Research under grant N00014-20-1-22.

APPENDIX

The appendices are organized as follows. Section A presents the full version of related works. Section B details the proofs for our theorems. Section C supplements other necessary evidences to justify CORES 2 . Section D shows more experimental details and results. For simplicity, we consider two-class classification problem. Suppose for a given feature x, the probability of x belonging to class 1 is p. The entropy regularization (ER) can be written as:while our regularization term is written as:We have the following proposition:Proposition 1. CR regularizes models stronger than the entropy regularization in terms of gradients.Proof. First notice that both R ER and R CR are symmetric functions around p = 0.5. Thus we can only consider the situation where 0 < p < 0.5. The gradients w.r.t p are:andNow we compare the absolute value of two gradients. When 0 < p < 0.5, it is easy to checkand both gradients are larger than 0. Therefore, CR has larger gradients than the entropy regularization, i.e., CR has stronger regularization ability than ER.We can also draw a figure to show this phenomenon. Figure 4 shows the value of R CR and R ER with respect to p. We can see the gradient of our regularization is larger than entropy regularization, resulting in a more confident prediction. We also perform an experiment to further show the evidence. Table 4 records comparison results which show our regularization achieves higher accuracy compared to the entropy term. 

