LEARNING WITH FEATURE-DEPENDENT LABEL NOISE: A PROGRESSIVE APPROACH

Abstract

Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of featuredependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels.

1. INTRODUCTION

Addressing noise in training set labels is an important problem in supervised learning. Incorrect annotation of data is inevitable in large-scale data collection, due to intrinsic ambiguity of data/class and mistakes of human/automatic annotators (Yan et al., 2014; Andreas et al., 2017) . Developing methods that are resilient to label noise is therefore crucial in real-life applications. Classical approaches take a rather simplistic i.i.d. assumption on the label noise, i.e., the label corruption is independent and identically distributed and thus is feature-independent. Methods based on this assumption either explicitly estimate the noise pattern (Reed et al., 2014; Patrini et al., 2017; Dan et al., 2019; Xu et al., 2019) or introduce extra regularizer/loss terms (Natarajan et al., 2013; Van Rooyen et al., 2015; Xiao et al., 2015; Zhang & Sabuncu, 2018; Ma et al., 2018; Arazo et al., 2019; Shen & Sanghavi, 2019) . Some results prove that the commonly used losses are naturally robust against such i.i.d. label noise (Manwani & Sastry, 2013; Ghosh et al., 2015; Gao et al., 2016; Ghosh et al., 2017; Charoenphakdee et al., 2019; Hu et al., 2020) . Although these methods come with theoretical guarantees, they usually do not perform as well as expected in practice due to the unrealistic i.i.d. assumption on noise. This is likely because label noise is heterogeneous and feature-dependent. A cat with an intrinsically ambiguous appearance is more likely to be mislabeled as a dog. An image with poor lighting or severe occlusion can be mislabeled, as important visual clues are imperceptible. Methods that can combat label noise of a much more general form are very much needed to address real-world challenges. To adapt to the heterogeneous label noise, state-of-the-arts (SOTAs) often resort to a data-recalibrating strategy. They progressively identify trustworthy data or correct data labels, and then train using these data (Tanaka et al., 2018; Wang et al., 2018; Lu et al., 2018; Li et al., 2019) . The models gradually improve as more clean data are collected or more labels are corrected, eventually converging to models of high accuracy. These data-recalibrating methods best leverage the learning power of deep neural nets and achieve superior performance in practice. However, their underlying mechanism remains a mystery. No methods in this category can provide theoretical insights as to why the model can converge to an ideal one. Thus, these methods require careful hyperparameter tuning and are hard to generalize. In this paper, we propose a novel and principled method that specifically targets the heterogeneous, feature-dependent label noise. Unlike previous methods, we target a much more general family of noise, called Polynomial Margin Diminishing (PMD) label noise. In this noise family, we allow arbitrary noise level except for data far away from the true decision boundary. This is consistent with the real-world scenario; data near the decision boundary are harder to distinguish and more likely to be mislabeled. Meanwhile, a datum far away from the decision boundary is a typical example of its true class and should have a reasonably bounded noise level. Assuming this new PMD noise family, we propose a theoretically-guaranteed data-recalibrating algorithm that gradually corrects labels based on the noisy classifier's confidence. We start from data points with high confidence, and correct the labels of these data using the predictions of the noisy classifier. Next, the model is improved using cleaned labels. We continue alternating the label correction and model improvement until it converges. See Figure 1 for an illustration. Our main theorem shows that with a theory-informed criterion for label correction at each iteration, the improvement of the label purity is guaranteed. Thus the model is guaranteed to improve with sufficient rate through iterations and eventually becomes consistent with the Bayes optimal classifier. Beside the theoretical strength, we also demonstrate the power of our method in practice. Our method outperforms others on CIFAR-10/100 with various synthetic noise patterns. We also evaluate our method against SOTAs on three real-world datasets with unknown noise patterns. To the best of our knowledge, our method is the first data-recalibrating method that is theoretically guaranteed to converge to an ideal model. The PMD noise family encompasses a broad spectrum of heterogeneous and feature-dependent noise, and better approximates the real-world scenario. It also provides a novel theoretical setting for the study of label noise. Related works. We review works that do not assume an i.i.d. label noise. Menon et al. (2018) generalized the work of (Ghosh et al., 2015) and provided an elegant theoretical framework, showing that loss functions fulfilling certain conditions naturally resist instance-dependent noise. The method can achieve even better theoretical properties (i.e., Bayes-consistency) with stronger assumption on the clean posterior probability η. In practice, this method has not been extended to deep neural networks. Cheng et al. (2020) proposed an active learning method for instance-dependent label noise. The algorithm iteratively queries clean labels from an oracle on carefully selected data. However, this approach is not applicable to settings where kosher annotations are unavailable. Another contemporary work (Chen et al., 2021) showed that the noise in real-world dataset is unlikely to be i.i.d., and proposed to fix the noisy labels by averaging the network predictions on each instance over the whole training process. While being effective, their method lacks theoretical guarantees. Chen et al. (2019) showed by regulating the topology of a classifier's decision boundary, one can improve the model's robustness against label noise. Data-recalibrating methods use noisy networks' predictions to iteratively select/correct data and improve the models. Tanaka et al. (2018) introduced a joint training framework which simultaneously enforces the network to be consistent with its own predictions and corrects the noisy labels during training. Wang et al. (2018) identified noisy labels as outliers based on their label consistencies with surrounding data. Lu et al. (2018) used a curriculum learning strategy where the teacher net is trained on a small kosher dataset to determine if a datum is clean; then the learnt curriculum that gives the weight to each datum is fed into the student net for the training and inference. (Yu et al., 2019; Bo et al., 2018) trained two synchronized networks; the confidence and consistency of the two networks are utilized to identify clean data. Wu et al. (2020) selected the clean data by investigating the topological structures of the training data in the learned feature space. For completeness, we also refer to other methods of similar design (Li et al., 2017; Vahdat, 2017; Andreas et al., 2017; Ma et al., 2018; Thulasidasan et al., 2019; Arazo et al., 2019; Shu et al., 2019; Yi & Wu, 2019) . As for theoretical guarantees, Ren et al. (2018) proposed an algorithm that iteratively re-weights each data point by solving an optimization problem. They proved the convergence of the training, but provided no guarantees that the model converges to an ideal one. Amid et al. (2019b) generalized the work of (Amid et al., 2019a) and proposed a tempered matching loss. They showed that when the final softmax layer is replaced by the bi-tempered loss, the resulting classifier will be Bayes consistent. Zheng et al. ( 2020) proved a one-shot guarantee for their data-recalibrating method; but the convergence of the model is not guaranteed. Our method is the first data-recalibrating method which is guaranteed to converge to a well-behaved classifier.

2. METHOD

We start by introducing the family of Poly-Margin Diminishing (PMD) label noise. In Section 2.2, we present our main algorithm. Finally, we prove the correctness of our algorithm in Section 3. Notations and preliminaries. Although the noise setting and algorithm naturally generalize to multiclass, for simplicity we focus on binary classification. Let the feature space be X . We assume the data (x, y) is sampled from an underlying distribution D on X × {0, 1}. Define the posterior probability η (x) = P[y = 1 | x]. Let τ 0,1 (x) = P[ y = 1 | y = 0, x] and τ 1,0 (x) = P[ y = 0 | y = 1, x] be the noise functions, where y denotes the corrupted label. For example, if a datum x has true label y = 0, it has τ 0,1 (x) chance to be corrupted to 1. Similarly, it has τ 1,0 (x) chance to be corrupted from 1 to 0. Let η(x) = P[ y = 1 | x] be the noisy posterior probability of y = 1 given feature x. Let η * (x) = I {η(x)≥ 1 2 } be the (clean) Bayes optimal classifier, where I A equals 1 if A is true, and 0 otherwise. Finally, let f (x) : X → [0, 1] be the classifier scoring function (the softmax output of a neural network in this paper).

2.1. POLY-MARGIN DIMINISHING NOISE

We first introduce the family of noise functions τ this paper will address. We introduce the concept of polynomial margin diminishing noise (PMD noise), which only upper bounds the noise τ in a certain level set of η(x), thus allowing τ to be arbitrarily high outside the restricted domain. This formulation not only covers the feature-independent scenario but also generalizes scenarios proposed by (Du & Cai, 2015; Menon et al., 2018; Cheng et al., 2020) . Definition 1 (PMD noise). A pair of noise functions τ 0,1 (x) and τ 1,0 (x) are polynomial-margin diminishing (PMD), if there exist constants t 0 ∈ (0, 1 2 ), and c 1 , c 2 > 0 such that: τ 1,0 (x) ≤ c 1 [1 -η(x)] 1+c2 ; ∀η(x) ≥ 1 2 + t 0 , τ 0,1 (x) ≤ c 1 η(x) 1+c2 ; ∀η(x) ≤ 1 2 -t 0 . (1) We abuse notation by referring to t 0 as the "margin" of τ . Note that the PMD condition only requires the upper bound on τ to be polynomial and monotonically decreasing in the region where the Bayes classifier is fairly confident. For the region {x : |η(x) -1 2 | < t 0 }, we allow both τ 0,1 (x) and τ 1,0 (x) to be arbitrary. Figure 2 (d) illustrates the upper bound (orange curve) and a sample noise function (blue curve). We also show the corrupted data according to this noise function (black points are the clean data whereas red points are the data with corrupted labels). The PMD noise family is much more general than existing noise assumptions. For example, the boundary consistent noise (BCN) (Du & Cai, 2015; Menon et al., 2018) assumes a noise function that monotonically decreases as the data are moving away from the decision boundary. See Figure 2 (c) for an illustration. This noise is much more restrictive compared to our PMD noise which (1) only requires a monotonic upper bound, and (2) allows arbitrary noise strength in a wide buffer near the decision boundary. Figure 2 (b) shows a traditional feature-independent noise pattern (Reed et al., 2014; Patrini et al., 2017) , which assumes τ 0,1 (x) (resp. τ 1,0 (x)) to be a constant independent of x. 

2.2. THE PROGRESSIVE CORRECTION ALGORITHM

Our algorithm iteratively trains a neural network and corrects labels. We start with a warm-up period, in which we train the neural network (NN) with the original noisy data. This allows us to attain a reasonable network before it starts fitting noise (Zhang et al., 2017) . After the warm-up period, the classifier can be used for label correction. We only correct a label on which the classifier f has a very high confidence. The intuition is that under the noise assumption, there exists a "pure region" in which the prediction of the noisy classifier f is highly confident and is consistent with the clean Bayes optimal classifier η * . Thus the label correction gives clean labels within this pure region. In particular, we select a high threshold θ. If f predicts a different label than ỹ and its confidence is above the threshold, |f (x) -1/2| > θ, we flip the label ỹ to the prediction of f . We repeatedly correct labels and improve the network until no label is corrected. Next, we slightly decrease the threshold θ, use the decreased threshold for label correction, and improve the model accordingly. We continue the process until convergence. For convenience in theoretical analysis, in the algorithm, we define a continuous increasing threshold T and let θ = 1/2 -T . Our algorithm is summarized in Algorithm 1. We term our algorithm as PLC (Progressive Label Correction). In Section 3, we will show that this iterative algorithm will converge to be consistent with clean Bayes optimal classifier η * (x) for most of the input instances. Algorithm 1 Progressive Label Correction Input: Dataset S = {(x 1 , y 0 1 ), • • • , (x n , y 0 n )}, initial NN f (x) , step size β, initial and end thresholds (T 0 , T end ), warm-up m, total round N Output: f f inal (•) 1: T ← T 0 2: θ ← 1/2 -T 0 3: for t ← 1, • • • , N do 4: Train f (x) on S 5: for all (x i , y t-1 i ) ∈ S and |f (x i ) -1 2 | ≥ θ do 6: y t i ← I {f (xi)≥ 1 2 } 7: end for 8: if t ≥ m then 9: θ ← 1/2 -T 10: if ∀i ∈ [1, • • • , n], y t i = y t-1 i then 11: T ← min(T (1 + β), T end ) 12: end if 13: end if 14: S ← {(x 1 , y t 1 ), • • • , (x n , y t n )} 15: end for Generalizing to the multi-class scenario. In multi-class scenario, denote by f i (x) the classifier's prediction probability of label i. Let h x be the classifier's class prediction, i.e., h x = arg max i f i (x). We change the f (x) -1 2 term to the gap between the highest confidence f hx (x) and the confidence on y, f y (x). If the absolute difference between these two confidences is larger than certain threshold θ, then we correct y to h x . In practice, we find using the difference of logarithms will be more robust.

3. ANALYSIS

Our analysis focuses on the asymptotic case and answers the following question: Given infinitely many data with corrupted labels, is it possible to learn a reasonably good classifier? We show that if the noise satisfies the arguably-general PMD condition, the answer is yes. Assuming mild conditions on the hypothesis class of the machine learning model and the distribution D, we prove that Algorithm 1 obtains a nearly clean classifier. This reduces the challenge of noisy label learning from a realizable problem into a sample complexity problem. In this work we only focus on the asymptotic case, and leave the sample complexity for future work.

3.1. ASSUMPTIONS

Our first assumption restricts the model to be able to at least approximate the true Bayes classifier η(x). This condition assumes that given a hypothesis class H with sufficient complexity, the approximation gap between a classifier f (x) in this class and η(x) is determined by the inconsistency between the noisy labels and the Bayes optimal classifier. Definition 2 (Level set (α, )-consistency). Suppose data are sampled as (x, ỹ) ∼ D(x, η(x)) and f (x) = arg min h∈H E (x,ỹ)∼D(x,η(x)) Loss(h(x), ỹ). Given ε < 1 2 , we call H is (α, )-consistent if: |f (x) -η(x)| ≤ αE (z,ỹ)∼D(z,η(z)) 1 {ỹz =η * (z)} (z) η(z) - 1 2 ≥ η(x) - 1 2 + . (2) For two input instances z and x such that η(z) > η(x) (and hence the clean Bayes optimal classifier η * (x) has higher confidence at z than it does at x), the indicator function 1 {ỹz =η * (z)} z : η(z) -1 2 ≥ η(x) -1 2 equals to 1 if the label of the more confident point z is inconsistent with η * (x). This condition says that the approximation error of the classifier at x should be controlled by the risk of η * (•) at points z where η * (•) is more confident than it is at x. We next define a regularity condition of data distribution which describes the continuity of the level set density function. Definition 3 (Level set bounded distribution). Define the margin t(x) = |η(x) -1 2 | and G(t) be the cdf of t: G The above condition enforces the continuity of level set density function. This is crucial in the analysis since such a continuity allows one to borrow information from its neighborhood region so that a clean neighbor can help correct the corrupted label. To simplify the notation, we will omit D in the subscript when we mention . From now on, we will assume: (t) = P x∼D (|η(x) -1 2 | ≤ t). Let g(t) = G (t) be the density function of t. We say the distribution D is (c * , c * )-bounded if for all 0 ≤ t ≤ 1/2, 0 < c * ≤ g(t) ≤ c * . If D is (c * , c * )-bounded, Assumption 1. There exist constants α, , c * , c * such that the hypothesis class H is (α, )-consistent and the unknown distribution D is (c * , c * )-bounded.

3.2. MAIN RESULT AND PROOF SKETCH

In this section we first state our main result, and then present the supporting claims. Complete proofs can be found in the appendix. Our main result below states that if our starting function is trained correctly, i.e., f (x) = arg min h∈H E (x,ỹ)∼D(x,η(x)) Loss(h(x), ỹ), then Algorithm 1 terminates with most of the final labels matching the Bayes optimal classifier labels. In practice, minimizing true risk is not achievable. Instead, the empirical risk is used to estimate true risk, approaching true risk asymptotically. For a scoring function f , we will denote by y f (x) := I(f (x) ≥ 1/2) the label predicted by f . Theorem 1. Under Assumption 1, for any noise τ which is PMD with margin t 0 , define e 0 = max(t 0 , α+ε 1+2α ). Then for the output of Algorithm 1 with f as above and with the following initializations: (1) T 0 < 1 2 -e 0 , (2) m ≥ α ε log( 2T0 1-2e0 ), (3) N ≥ m + 1 β log( T0 ), (4) T end ≤ 3 and ( 5) ε α ≤ β ≤ 2ε α , we have: P x∼D [y f f inal (x) = η * (x)] ≥ 1 -3c * . In the remainder of this section we shall assume that the noise τ is PMD with margin t 0 . To prove our result we first define a "pure" level set. Definition 4 (Pure (e, f, η)-level set). A set L(e, η) := {x||η(x) -1 2 | ≥ e} is pure for f if y f (x) = η * (x) for all x ∈ L(e, η). We now state a lemma that forms the foundation of our progressive correction algorithm. We show that given a tiny region where the model is reliable, we can move one step forward by trusting the model. Although the improvement is slight in a single round, it empowers a conservatively recursive step in the Algorithm 1. Lemma 1 (One round purity improvement). Suppose Assumption 1 is satisfied, and assume an f such that there exists a pure (e, f, η)-level set with 3 ≤ e < 1 2 . Let ηnew (x) = y f (x) if |f (x) -1/2| ≥ e and η(x) if |f (x) -1/2| < e, and assume f new = arg min h∈H E (x,ỹ)∼D(x,ηnew(x)) Loss(h(x), ỹ). Let e new = min{e|e > 0, L(e, η) is pure for f new }. Then 1 2 -e new ≥ (1 + ε α )( 1 2 -e). The above lemma states that the cleansed region will be enlarged by at least a constant factor. In the following lemma, we justify the functionality of the first m warm-up rounds. Since the initial neural network can behave badly, the region where we can trust the classifier can be very limited. Before starting the flipping procedure in a relatively larger level set, one first needs to expand the initial tiny region 1 2 -e 0 to a constant T 0 . Lemma 2 (Warm-up rounds). Suppose for a given function f 0 there exists a level set L(e 0 , η) which is pure for f 0 . Given T 0 < 1/2, after running Algorithm 1 for m ≥ α ε log( 2T0 1-2e0 ) rounds, there exists a level set L( 1 2 -T 0 , η) that is pure for f 0 . Next we present our final lemma that combines the previous two lemmata. Lemma 3. Suppose Assumption 1 is satisfied, and for a given function f 0 there exists a level set L(e 0 , η) which is pure for f 0 . If one runs Algorithm 1 starting with f 0 and the initializations: (1) T 0 < 1 2 -e 0 , (2) m ≥ α ε log( 2T0 1-2e0 ), (3) N ≥ m + 1 β log( 1-6ε 2T0 ), (4) T end ≤ 1 2 -3 and (5) ε α ≤ β ≤ 2ε α , then we have P x∼D [y f f inal (x) = η * (x)] ≥ 1 -3c * . This lemma states that given an initial model that has a reasonably pure super level set, one can manage to progressively correct a large fraction of corrupted labels by running Algorithm 1 for a sufficient long time with carefully chosen parameters. The limit of Algorithm 1 will depend on the approximation ability of the neural network, which is characterized by parameter ε in Definition 2. To prove Theorem 1 using Lemma 3, it suffices to get a model which has a reliable region. This is provably achievable by training with a family of good scoring functions on PMD noisy data.

4. EXPERIMENTS

We evaluate our method on both synthetic and real-world datasets. We first conduct synthetic experiments on two public datasets CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . To synthesize the label noise, we first approximate the true posterior probability η using the confidence prediction of a clean neural network (trained with the original clean labels). We call these original labels raw labels. Then we sample y x ∼ η(x) for each instance x. Instead of using raw labels, we use these sampled labels y x as the clean labels, whose posterior probabilities are exactly η(x); and therefore the neural network is the Bayes optimal classifier η * : X → {1, • • • , C}, where C is the number of classes. Note that in multi-class setting, η(x) has a vector output and η i (x) is the i-th element of this vector. Noise generation. We consider a generic family of noise. We consider not only feature-dependent noise, but also hybrid noise that consists of both feature-dependent noise and i.i.d. noise. For feature-dependent noise, we use three types of noise functions within the PMD noise family. To make the noise challenging enough, for input x we always corrupt label from the most confident category u x to the second confident category s x , according to η(x). Because s x is the class that confuses η * (x) the most, this noise will hurt the network's performance the most. Note that y x is sampled from η(x), which has quite an extreme confidence. Thus we generally assume y x is u x . For each datum x, we only flip it to s x or keep it as u x . The three noise functions are as follows: Type-I : τ ux,sx = - 1 2 [η ux (x) -η sx (x)] 2 + 1 2 , Type-II : τ ux,sx = 1 -[η ux (x) -η sx (x)] 3 , Type-III : τ ux,sx = 1 - 1 3 [η ux (x) -η sx (x)] 3 + [η ux (x) -η sx (x)] 2 + [η ux (x) -η sx (x)] . Notice that the noise level is determined by the η(x) naturally and we cannot control it directly. To change the noise level, we multiply τ ux,sx by a certain constant factor such that the final proportion of noise matches our requirement. For PMD noise only, we test noise levels 35% and 70%, meaning that 35% and 70% of the data are corrupted due to the noise, respectively. For i.i.d. noise we follow the convention and adopt the commonly used uniform noise and asymmetric noise (Patrini et al., 2017) . We artificially corrupt the labels by constructing the noise transition matrix T , where T ij = P ( y = j|y = i) = τ ij defines the probability that a true label y = i is flipped to j. Then for each sample with label i, we replace its label with the one sampled from the probability distribution given by the i-th row of matrix T . We consider two kinds of i.i.d. noise in this work. (1) Uniform noise: the true label i is corrupted uniformly to other classes, i.e., T ij = τ /(C -1) for i = j, and T ii = 1 -τ , where τ is the constant noise level; (2) Asymmetric noise: the true label i is flipped to j or stays unchanged with probabilities T ij = τ and T ii = 1 -τ , respectively. Baselines. We compare our method with several recently proposed approaches. ( 1 et al., 2020) . All these methods are generic and handle the label noise without assuming the noise structures. Finally, we also provide the results by standard method, which simply trains the deep network on noisy datasets in a standard manner. During training, we use a batch size of 128 and train the network for 180 epochs to ensure the convergence of all methods. We train the network with SGD optimizer, with initial learning rate 0.01. We randomly repeat the experiments 3 times, and report the mean and standard deviation values. Our code is available at https://github.com/pxiangwu/PLC.

Results

. Table 1 lists the performance of different methods under three types of feature-dependent noise at noise levels 35% and 70%. We observe that our method achieves the best performance across different noise settings. Moreover, notice that some of the baseline methods' performances are inferior to the standard approach. Possible reasons are that these methods behave too conservatively in dealing with noise. Thus they only make use of a small subset of the original training set, which is not representative enough to grant the model good discriminative ability. In Table 2 we show the results on datasets corrupted with a combination of feature-dependent noise and i.i.d. noise, which ends up to real noise levels ranging from 50% to 70% (in terms of the proportion of corrupted labels). I.i.d. noise is overlayed on the feature-dependent noise. Our method outperforms baselines under these more complicated noise patterns. In contrast, when the noise level is high like the cases where we further apply additional 30% and 60% uniform noise, performances of a few baselines deteriorate and become worse than the standard approach. We carry out the ablation studies on hyper-parameters θ 0 (determining the initial confidence threshold for label correction, see Algorithm 1) and β (the step size). In Tables 3 and 4 , we show that our method is robust against the choice of θ 0 and β up to a wide range. Notice that to compare against the threshold θ 0 , here we are calculating the absolute difference of log f y (x) and log f hx (x). As mentioned in Section 2.2, this operation gives a good performance in practice. Results on real-world noisy datasets. To test the effectiveness of the proposed method under realworld label noise, we conduct experiments on the Clothing1M dataset (Xiao et al., 2015) . This dataset contains 1 million clothing images obtained from online shopping websites with 14 categories. The labels in this dataset are quite noisy with an unknown underlying structure. This dataset provides 50k, 14k and 10k manually verified clean data for training, validation and testing, respectively. Following (Tanaka et al., 2018; Yi & Wu, 2019) , in our experiment we discard the 50k clean training data and evaluate the classification accuracy on the 10k clean data. Also, following (Yi & Wu, 2019) , we use a randomly sampled pseudo-balanced subset as the training set, which includes about 260k images. We set the batch size 32, learning rate 0.001, and adopt SGD optimizer and use ResNet-50 with weights pre-trained on ImageNet, as in (Tanaka et al., 2018; Yi & Wu, 2019) . We compare our method with the following baselines. (1) Standard; (2) Forward Correction (Patrini et al., 2017) ; (3) D2L (Ma et al., 2018) ; (4) JO (Tanaka et al., 2018) ; ( 5) PENCIL (Yi & Wu, 2019); ( 6) DY (Arazo et al., 2019) ; ( 7) GCE (Zhang & Sabuncu, 2018); (8) SL (Wang et al., 2019) ; ( 9) MLNT (Li et al., 2019) ; ( 10) LRT (Zheng et al., 2020) . In Table 5 we observe that our method achieves the best performance, suggesting the applicability of our label correction strategy in real-world scenarios. Apart from Clothing1M, we also test our method on another smaller dataset, Food-101N (Lee et al., 2018) . Food-101N is a dataset for food classification, and consists of 310k training images collected from the web. The estimated label purity is 80%. Following (Lee et al., 2018) , the classification accuracy is evaluated on the Food-101 (Bossard et al., 2014) testing set, which contains 25k images with curated annotations. We use ResNet-50 pre-trained on ImageNet. We train the network for 30 epochs with SGD optimizer. The batch size is 32 and the initial learning rate is 0.005, which is divided by 10 every 10 epochs. We also adopt simple data augmentation procedures, including random horizontal flip, and resizing the image with a short edge of 256 and patch from the resized image. We repeat the experiments with 3 random trials and report the mean value and standard deviation. The results are shown in Table 6 . Our method much improves upon the previous approaches. Finally, we test our method on a recently proposed real-world dataset, ANIMAL-10N (Song et al., 2019) . This dataset contains human-labeled online images for 10 animals with confusing appearance. The estimated label noise rate is 8%. There are 50,000 training and 5,000 testing images. Following (Song et al., 2019) , we use VGG-19 with batch normalization. The SGD optimizer is employed. Also following (Song et al., 2019) , we train the network for 100 epochs and use an initial learning rate of 0.1, which is divided by 5 at 50% and 75% of the total number of epochs. We repeat the experiments with 3 random trials and report the mean value and standard deviation. As is shown in Table 7 , our method outperforms the existing baselines. 

5. CONCLUSION

We propose a novel family of feature-dependent label noise that is much more general than the traditional i.i.d. noise pattern. Building upon this noise assumption, we propose the first datarecalibrating method that is theoretically guaranteed to converge to a well-behaved classifier. On the synthetic datasets, we show that our method outperforms various baselines under different featuredependent noise patterns subject to our assumption. Also, we test our method on different real-world noisy datasets and observe superior performances over existing approaches. The proposed noise family offers a new theoretical setting for the study of label noise.



Figure 1: Illustration of the algorithm using synthetic data. (a) Gaussian blob with clean label (η * (x)). (b) Data with corrupted labels. (c) Final corrected data. Black dots are the data that have their clean labels. Red dots are the noisy data. Points that remain un-corrected are closer to the decision boundary. Our algorithm corrects most of the noise only using noisy classifier's confidence. (d) Data after label correction. (e)-(h) We show the intermediate results at different iterations. Gray region is the area where the classifier has high confidence. Labels within this region are corrected.

Figure 2: Illustration of different noise functions. (a) The original data: Gaussian blob with clean labels (by clean label, we refer to the prediction of the Bayes optimal classifier η * (x), not y). Confident region of η (and thus f ) in this case is the place where η(x) is close to 0 or 1. Blue and green dots correspond to different classes. (b) Uniform label noise: each point has an equal probability to be flipped. Red dots are data with corrupted labels; black dots correspond to data that are not corrupted. (c) BCN noise: the level of noise is decreasing as η * (x) becomes confident. (d) PMD noise: noise level (blue) is only upper bounded by diminishing polynomial function when η(x) is higher or lower than certain threshold. The upper bound is shown in solid orange curve. The dashed orange curve means the noise level near the decision boundary is unbounded.

we define the worst-case density-imbalance ratio of D by D := c * c * .

) GCE (Zhang & Sabuncu, 2018); (2) Co-teaching+ (Yu et al., 2019); (3) SL (Wang et al., 2019); (4) LRT (Zheng

Test accuracy (%) on CIFAR-10 and CIFAR-100 under different feature-dependent noise types and levels. The average accuracy and standard deviation over 3 trials are reported. ± 1.12 45.44 ± 0.64 40.30 ± 1.46 41.11 ± 1.92 44.67 ± 3.89 46.04 ± 2.20 Type-III ( 35% ) 76.89 ± 0.79 78.38 ± 0.67 79.18 ± 0.61 78.81 ± 0.29 81.08 ± 0.35 81.50 ± 0.50 Type-III ( 70% ) 43.32 ± 1.00 41.90 ± 0.86 37.10 ± 0.59 38.49 ± 1.46 44.47 ± 1.23 45.05 ± 1.13 ± 0.29 56.70 ± 0.71 58.37 ± 0.18 55.20 ± 0.33 56.74 ± 0.34 60.01 ± 0.43 Type-I ( 70% ) 39.32 ± 0.43 39.53 ± 0.28 40.01 ± 0.71 40.02 ± 0.85 45.29 ± 0.43 45.92 ± 0.61 Type-II ( 35% ) 57.83 ± 0.25 56.57 ± 0.52 58.11 ± 1.05 56.10 ± 0.73 57.25 ± 0.68 63.68 ± 0.29 Type-II ( 70% ) 39.30 ± 0.32 36.84 ± 0.39 37.75 ± 0.46 38.45 ± 0.45 43.71 ± 0.51 45.03 ± 0.50 Type-III ( 35% ) 56.07 ± 0.79 55.77 ± 0.98 57.51 ± 1.16 56.04 ± 0.74 56.57 ± 0.30 63.68 ± 0.29 Type-III ( 70% ) 40.01 ± 0.18 35.37 ± 2.65 40.53 ± 0.60 39.94 ± 0.84 44.41 ± 0.19 44.45 ± 0.62

Test accuracy (%) on CIFAR-10 and CIFAR-100 under different hybrid noise types and levels. The average accuracy and standard deviation over 3 trials are reported. ± 0.16 22.89 ± 0.75 36.82 ± 0.49 37.65 ± 1.42 22.81 ± 0.72 50.73 ± 2.16 Type-III + 30% Asymmetric 45.70 ± 0.12 49.38 ± 0.86 50.87 ± 1.12 48.15 ± 0.90 50.31 ± 0.39 54.56 ± 1.11

The effect of θ 0 on the performance.

The effect of β on the performance.

Test accuracy (%) on Clothing1M.

Test accuracy (%) on Food-101N.

Test accuracy (%) on ANIMAL-10N.

acknowledgement

Acknowledgement. The authors acknowledge support from US National Science Foundation (NSF) awards CRII-1755791, CCF-1910873, CCF-1855760. This effort was partially supported by the Intelligence Advanced Research Projects Agency (IARPA) under the contract W911NF20C0038. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

annex

Proof: We analyze the case where η(x) > 1 2 . The analysis on the other side can be derived similarly. Due to the fact that there exists a level set L(e, η) pure to f , we have e ≤ |η(x) -1 2 |, ∀x:

Now consider

x where e -γ ≤ |η(x) -1 2 |. Since the distribution D is (c * , c * )-bounded, we have:.will give the same label as η * (x) and thus (e -γ, η)-level set becomes pure for f . Meanwhile, the choice of γ ensures that 1 2 -e new ≥ (1 + ε α )( 1 2 -e). Lemma 2 (Warm-up rounds). Suppose for a given function f 0 there exists a level set L(e 0 , η) which is pure for f 0 . Given T 0 < 1/2, after running Algorithm 1 for m ≥ α ε log( 2T0 1-2e0 ) rounds, there exists a level set L( 1 2 -T 0 , η) that is pure for f 0 .Proof: The proof follows from the fact that each round of label flipping improves the purity by a factor of (1 + ε α ). To obtain an at least T 0 pure region, it suffices to repeat the flipping step for m ≥ α ε log( T0 1-2e0 ) rounds. Lemma 3. Suppose Assumption 1 is satisfied, and for a given function f 0 there exists a level set L(e 0 , η) which is pure for f 0 . If one runs Algorithm 1 starting with f 0 and the initializations: (1)), (4) T end ≤ 1 2 -3 and ( 5)Proof: The proof can be done by combining Lemma 1 and Lemma 2. In the first m iterations, by Lemma 2, we can guarantee a level set ( 1 2 -T 0 , η) pure to f . In the rest of the iterations we ensure the level set |η(x) -1 2 | ≥ 1 2 -T is pure. We increase T by a reasonable factor of β to avoid incurring too many corrupted labels while ensuring enough progress in label purification, i.e.,This condition ensures the correctness of flipping when T ≤ 1 2 -3ε. The purity cannot be improved once T ≥ 1 2 -3ε = T end since there is no guarantee that f (x) has consistent label with η(x) when |η(x) -1 2 | < 3ε and |η(x) -f (x)| ≤ 3ε. By (c * , c * )-bounded assumption on D, its mass of impure 3ε level set region is at most 3c * . Theorem 1. Under assumption 1, for any noise τ which is PMD with margin t 0 , define e 0 = max(t 0 , α+ε 1+2α ). Then for the output of Algorithm 1 with f as above and with the following initializations: (1)), (4) T end ≤ 3 and ( 5)Proof: The proof is based on Lemma 3 plus a verification of the existence of f 0 for which there exists a pure (e 0 , f, η)-level set. Let:In the level set |η(x) -1 2 | ≥ e 0 , P z [ỹ z = η * (z)| |η(z) -1 2 | ≥ e 0 ] ≤ 1 2 -e 0 + τ (z). By level set (α, ε)-consistency, it suffices to satisfy α( 12 -e o + τ ) + ε ≤ e 0 to ensure that f (x) has the same prediction with η(x) when |η(x) -1 2 | ≥ e 0 . By polynomial level set diminishing noise, we have τ (x) ≤ 1 2 -e 0 if e 0 > t 0 , and thus by choosing e 0 = max(t 0 , α+ε 1+2α ) one can ensure that initial f 0 has a pure (e 0 , f 0 , η)-level set. The rest of the proof follows from Lemma 3.

