A TIME-CONSISTENCY CURRICULUM FOR LEARNING FROM INSTANCE-DEPENDENT NOISY LABELS

Abstract

Many machine learning algorithms are known to be fragile on simple instanceindependent noisy labels. However, noisy labels in real-world data are more devastating since they are produced by more complicated mechanisms in an instancedependent manner. In this paper, we target this practical challenge of Instance-Dependent Noisy Labels by jointly training (1) a model reversely engineering the noise generating mechanism, which produces an instance-dependent mapping between the clean label posterior and the observed noisy label; and (2) a robust classifier that produces clean label posteriors. Compared to previous methods, the former model is novel and enables end-to-end learning of the latter directly from noisy labels. An extensive empirical study indicates that the time-consistency of data is critical to the success of training both models and motivates us to develop a curriculum selecting training data based on their dynamics on the two models' outputs over the course of training. We show that the curriculum-selected data provide both clean labels and high-quality input-output pairs for training the two models. Therefore, it leads to promising and robust classification performance even in notably challenging settings of instance-dependent noisy labels where many SoTA methods could easily fail. Extensive experimental comparisons and ablation studies further demonstrate the advantages and significance of the time-consistency curriculum in learning from instance-dependent noisy labels on multiple benchmark datasets.

1. INTRODUCTION

The training of neural networks can easily fail in the presence of even the simple instance-independent noisy labels since they quickly lead to model overfitting of the noises (Zhang et al., 2017) . In practice, however, it is usually challenging to control the labeling quality of large-scale datasets because the labels were generated by complicated mechanisms such as non-expert workers (Han et al., 2020b ). An average of 3.3% noisy labels is identified in the test/validation sets of 10 of the most commonly-used datasets in computer vision, natural language, and audio analysis (Northcutt et al., 2021) . Moreover, real-world noisy labels are generated in an instance-dependent manner, which is significantly more challenging to address than the most widely studied but oversimplified instance-independent noises, which assume that the noise only depends on the class (Wei et al., 2022) . Two principal methodologies have been developed to address the label noises: (1) detecting samples (X, Ỹ ) with correct labels Ỹ = Y (empirically, they are the ones with the smallest loss values) and using them to train a clean classifier (Han et al., 2018b; Yu et al., 2019) ; (2) learning the noise generating mechanism, i.e., a transition matrix T defining the mapping between clean label Y and noisy label Ỹ such that P ( Ỹ | X) = T ⊤ P (Y | X), where P (• | X) denotes the posterior vector, and then using it to build statistically consistent classifiers (Liu & Tao, 2016; Patrini et al., 2017; Yang et al., 2021) . Although both methodologies have achieved promising results in the simplified instanceindependent (class-dependent) setting, they have non-trivial drawbacks when applied to the more practical but complicated instance-dependent noises: (1) the "small loss" trick is no longer effective in detecting correct labels (Cheng et al., 2021) because the loss threshold drastically varies across instances and is determined by each transition matrix T (X); (2) the instance-dependent transition matrix T (X) is not identifiable given only the noisy sample and it heavily relies on the estimation of clean label Y in the triple (X, Y, Ỹ ) (Yang et al., 2021) , which is an unsolved challenge in (1). Therefore, the two learning problems are entangled, i.e., the training of a clean label predictor and the transition matrix estimator depends on each other's accuracy, which substantially relies on the quality of training data (X, Y, Ỹ ). Specifically, the "small loss" trick cannot provide a high-quality estimation of Y due to the instance-specific threshold of loss. Moreover, the estimation of Y can change rapidly due to the non-stationary loss, which can fluctuate during training and provide inconsistent training signals over time for both models if selected for training. Furthermore, the data subset selection inevitably introduces biases toward easy-to-fit samples and degrades the data diversity (Yang et al., 2021; Cheng et al., 2021; Berthon et al., 2021; Cheng et al., 2020) , which in fact is critical to the training and the accuracy of both models, especially the transition matrix estimator, because easy-to-fit samples usually have extremely sparse transition matrices. To tackle the above issues, we propose a novel metric "Time-Consistency of Prediction (TCP)" to select high-quality data to train both models. TCP measures the consistency of model prediction for an instance over the course of training, which reflects whether its given label results in gradients consistent with the majority of other instances, and this criterion turns out to be a more reliable identifier of clean labels. When applied to the training of clean label predictor, TCP is more accurate in clean label detection than "small loss" (or high confidence) criterion, because it avoids the comparison of confidence for samples with instance-dependent loss/confidence thresholds. Moreover, when applied to the training of the transition matrix estimator, TCP measures the time-consistency of predicted noisy labels Ỹ . Surprisingly, it also faithfully reflects the correctness of the predicted clean label Y . Since the objective to estimate the transition matrix is defined by both Y and Ỹ , selecting samples with high TCP considerably improves the training of the transition matrix estimator. In addition, to exploit the data diversity in training the two models, we apply a curriculum that starts from selecting only a few high TCP data for early-stage training but progressively includes more training data once the two models become maturer and more consistent. In this paper, we develop a three-stage training strategy with the TCP curriculum embedded. In every training step, we first update the clean label predictor using selected data with high TCP on this model, followed by training the transition matrix estimator given the predicted clean label posterior and the noisy labels on selected data with high TCP on the estimator, and end with fine-tuning the clean label predictor directly using the noisy label and the estimated transition matrix. It is worth noting that the TCP metrics for the two models are updated using the model outputs collected from this dynamic training process without causing additional cost. As demonstrated by extensive empirical studies and experimental comparisons, our method leads to efficient joint training of the two models that mutually benefits from each other and produces an accurate estimation of both the clean label and instance-dependent transition matrix. On multiple benchmark datasets with either synthetic or real-world noises, our method achieves state-of-the-art performance with significant improvements.

2. BACKGROUNDS AND RELATED WORKS

Let (X, Y ) ∈ X × {1, . . . , c} be the random variables for instances and clean labels, where X represents the instance space and c is the number of classes. In many real-world applications, the observed labels are not always correct but contain some noise. Let Ỹ be the random variable for the noisy label. What we have is a sample {(x 1 , ỹ1 ), . . . , (x n , ỹn )} drawn from the noisy distribution D ρ of the random variables (X, Ỹ ). We aim to learn a robust classifier that could assign clean labels to test data by exploiting the sample with noisy labels. Label noise models. Currently, there are three typical label noise models, which are the random classification noise (RCN) model (Biggio et al., 2011; Natarajan et al., 2013; Manwani & Sastry, 2013) , the class-dependent label noise (CDN) model (Patrini et al., 2017; Xia et al., 2019; Zhang & Sabuncu, 2018) , and the instance-dependent label noise (IDN) model (Berthon et al., 2021; Cheng et al., 2021) . Specifically, RCN assumes that clean labels flip randomly with a constant rate; CDN assumes that the flip rate only depends on the true class; IDN considers the most general case of label noise, where the flip rate depends on its instance. Since IDN is non-identifiable without any additional assumption, some simplified variants were proposed. (Xia et al., 2020b) proposed the part-dependent label noise (PDN) model which assumes that the label noise depends on parts of instances. (Cheng et al., 2020; Yang et al., 2021) assume that the flip rates are dependent on instances but can be upper bounded by a value smaller than 1. This paper focuses on the original IDN model without introducing any additional assumptions. Estimating the transition matrix. The structure of label noise is usually formulated by a c × c transition matrix T , where c is the number of classes and its ij-th element T ij (x) = P ( Ỹ = j | Y = i, X = x), which represents the probability that the instance x with the clean label Y = i actually has a noisy label Ỹ = j. The transition matrix naturally establishes the connection between noisy posterior and clean posterior, i.e., P ( Ỹ | X) = T ⊤ (X)P (Y | X), and thus plays an important role in building statistically consistent classifiers in label-noise learning (Liu & Tao, 2016; Scott, 2015) . To estimate it, a cross-validation method can be applied for the binary classification task (Natarajan et al., 2013) . For CDN, the transition matrix could be learned by exploiting anchor points (Patrini et al., 2017; Yu et al., 2018) . For IDN, the transition matrix for an instance could be approximated by a combination of the transition matrices for the parts of the instance (Xia et al., 2020b) or a Bayes label transition matrix (Yang et al., 2021) . Yao et al. (2021) exploited the causal graph to estimate the transition relations between clean and noisy labels. Curriculum learning. Curriculum learning was first proposed by Bengio et al. (2009) , which describes a learning paradigm in which a model is learned by gradually introducing samples of increasing hardness to training. Its effectiveness has been empirically verified in a wide range of applications, e.g., computer vision (Chen & Gupta, 2015) , natural language processing (Turian et al., 2010) , and multitask learning (Graves et al., 2017) . Curriculum for label-noise learning has been also investigated. MentorNet (Jiang et al., 2018) pre-trains an extra network producing a data-driven curriculum selecting data instances to guide the training. When the clean validation data is not available, MentorNet has to use a predefined curriculum. RoCL (Zhou et al., 2021) develops a curriculum learning strategy that smoothly transitions between (1) detection and supervised training on clean data; and (2) relabeling and self-supervision on noisy data. Nevertheless, RoCL has no convergence guarantee and needs extra data augmentations to collect spatial-consistent pseudo labels. Moreover, existing methods for learning with noisy labels employ heuristics to reduce the side-effect of noisy labels, e.g., selecting reliable samples (Han et al., 2018b; Yu et al., 2019; Wei et al., 2020a; Wu et al., 2020; Xia et al., 2020a) , reweighting samples (Ren et al., 2018; Jiang et al., 2018; Ma et al., 2018; Kremer et al., 2018; Reed et al., 2015) , correcting labels Tanaka et al. (2018) ; Zheng et al. (2020) , designing robust loss functions Zhang & Sabuncu (2018) ; Xu et al. (2019) ; Liu & Guo (2020) ; Ma et al. (2020) , employing side information (Vahdat, 2017; Li et al., 2017) , and (implicitly) adding regularization (Li et al., 2021; 2017; Veit et al., 2017; Vahdat, 2017; Han et al., 2018a; Zhang et al., 2018; Guo et al., 2018; Hu et al., 2020; Zhang et al., 2021; Han et al., 2020a) .

3. EXAMPLES SELECTION CRITERION: TIME-CONSISTENCY OF PREDICTION

According to the observation that the loss on instances with clean labels is usually smaller than instances with noisy labels, the loss computed at an instantaneous step has been widely adopted as a selection criterion for confident examples (Han et al., 2018b; Yu et al., 2019; Wang et al., 2019) . It is because instances with clean labels are mutually consistent with each other in producing gradient updates, allowing the model to fit them better and thereby make the loss smaller than instances with noisy labels. Unfortunately, the instantaneous loss was found only work well on the instance-independent label noise (Cheng et al., 2021) . For a deep neural network, because of the non-smooth nature of the loss and the randomness of stochastic gradient descent, the instantaneous loss of each instance can change dramatically between consecutive epochs, leading to a huge gap between training sets selected over consecutive epochs. Therefore, it is necessary to take the training history of each instance into consideration. Zhou et al. (2020) proposed a robust version of the instantaneous loss as the exponential moving average of it over the course of training. Nevertheless, in the IDN case, each instance with its noisy label is a unique pattern, which is more complex and thereby requires a more robust selection criterion. Apparently, at the instance level, the one-hot prediction of an instance is a more robust metric than the loss because the former has a tolerance to the change of predicted class posterior while the latter has not, i.e., the one-hot prediction remains unchanged if the position of the max element in the predicted class posterior vector is maintained but the cross-entropy loss changes once the predicted class posterior changes. Inspired by the above insights, we propose a time-consistency of prediction (TCP) metric as follows: TCP t+1 (x) = t t + 1 TCP t (x) + 1 t + 1 InP t+1 (x), where InP t+1 (x) = 1[ŷ t+1 = ŷt ] and ŷt is the predicted label at epoch t. This metric considers the prediction consistency over the course of training, which can better describe the IDN data and select confident examples than the previous ones. To see this, we first manually add IDN at 0.4 noise rate (see Section. 5) onto a benchmark dataset CIFAR10 and train a ResNet34 (He et al., 2016) for 100 epochs with a constant learning rate. Since no curriculum strategy is applied here, we select confident examples at every epoch t with a fixed number 5,000 according to four types of selection criterion, i.e., instantaneous prediction InP t (x), instantaneous loss ℓ(x), time-consistency of prediction TCP t (x), and time-consistency of loss (defined in the same way as TCP). Then we count the number of instances with clean labels from the selected confident examples and calculate the clean ratios (ratio of clean instances to selected instances). As shown in Figure 2 , we can find that the two instantaneous metrics have clean ratios lower than 0.6, which are worse than random selection. As for time-consistency of loss, the clean ratio is slightly higher than the random selection. Those three metrics are basically not discriminative to the noisy data. By contrast, the proposed TCP metric has a distinguishable performance, uplifting more than 20 percent of the clean ratio of the selected confident examples. More empirical study regarding Figures 1 and 2 is provided in Appendix A.  ′ t+1 = θ t + η x∈L ∇ θ ℓ (x; θ t ) + ∇ θ ℓ (x ′ ; θ t ) , where θ t denotes the network parameters at step t and η denotes the learning rate. Then we have In Figure 3 , we show the clean ratios of the original noisy labels and pseudo labels of instances selected with our curriculum during the whole training process. The clean ratio for pseudo labels maintains an amazing high value, much better than the original clean ratio. Therefore, the clean classifier can be learned by minimizing 1 η x∈L ℓ (x; θ t+1 ) -ℓ x; θ ′ t+1 = p ŷ′ t t+1 (x ′ ) p ŷ′ t t (x ′ ) -1 , N n=1 L (f (x n ), y * n ) , where y * can be the original noisy label or pseudo label in different learning phases. Implementation details can be found in Section 4.3.

4.2. LEARNING A TRANSITION MATRIX WITH HIGH NOISY-TCP INSTANCES

The transition matrix is not identifiable by only exploiting noisy data without introducing additional assumptions, therefore we formulate the objective function for leaning the transition matrix based on the equation To select the high-quality triplets (X, Y, Ỹ ) for the above objective, two conditions should be considered. First, it is necessary for f (•) to output a precise clean class posterior, otherwise, T cannot be optimized in the correct direction, in the case ỹ is given and f (•) has been learned in advance and is fixed. As we discussed above, instances with high clean-TCP tend to have the correct pseudo label, and thereby a precise clean class posterior, which satisfies this necessary condition. Second, the noisy-TCP should be high. By treating T ⊤ (x)f (x) as a whole predictor for ỹ, the corresponding new objective is to predict ỹ. Therefore, high noisy-TCP instances naturally indicate the instance is learned better and faster for predicting ỹ, leading to stable and fast learning. We discover that the noisy-TCP inherently has a strong correlation with clean-TCP so we can use it to select triplets fulfilling both conditions above. To see this, at each epoch, we calculate the Spearman rank-order correlation coefficientfoot_0 between the noisy-TCP and clean-TCP of the whole dataset. Besides, we calculate the clean ratio of the selected high noisy-TCP instances w.r.t. their pseudo labels. In Figure 5 , we show the data distribution in terms of clean-and noisy-TCP at epoch 50 (More data distributions at different epochs are provided in Appendix C). The green regression line partially implies the linear correlation between the clean-and noisy-TCP. Also, instances with correct pseudo labels are mainly distributed in the high TCP area, and vice versa. As shown in 4, the Spearman rankorder correlation coefficient is above 0.6 after 10 epochs with a consistent 0 p-value, roughly indicating that noisy-TCP is strongly Spearman rank-order correlated with clean-TCP for 100% sure. Meanwhile, the clean ratio is consistently above 0.8, which means those high noisy-TCP instances also have correct clean predictions and thereby probably precise clean class posterior. Note that the clean ratio decreases at the late stage because the curriculum selects almost all the data at the end. Overall, high noisy-TCP instances not only are naturally stable for the new objective to predict ỹ but also satisfy the necessary condition to have precise clean class posterior, which make them perfect examples for learning the transition matrix. P ( Ỹ | X) = T ⊤ (X)P (Y | X) as follow: min T 1 N N n=1 L T ⊤ (x n )f (x n ), ỹn , where f (x n ) = P (y | x n ) .

4.3. TCP GUIDED CURRICULUM LEARNING ALGORITHM

The main steps of our algorithm are summarized in Algorithm 1 with the complete procedure detailed in Appendix D. First, we warm up the feature extractor ϕ, classification layer c by minimizing a standard cross-entropy (CE) loss on noisy data, and meanwhile compute the clean-TCP for every instance. Then, we warm up the transition matrix layer t with high clean-TCP instances and obtain the noisy-TCP for every instance. From now on, iteratively, high clean-TCP instances are fed to the clean classifier (green part in Figure 6 ) to train a primary clean classifier with the clean CE loss, and based on the primary clean classifier, instances with high noisy-TCP are fed to the transition matrix (blue part in Figure 6 ) to train a transition matrix with the noisy CE loss while the parameters of Select N t [e] high noisy-TCP instances as D t . 5: Train t while fixing ϕ and c on D t by minimizing N n=1 L n T ⊤ (x n )f (x n ), ỹn . 6: Select N c [e] high clean-TCP instances as D c .

7:

Train ϕ and c on D c by minimizing N n=1 L c (f (x n ), y * n ) , where y * is the pseudo label.

8:

Fix t and fine-tune ϕ and c on D by minimizing N n=1 L n T ⊤ (x n )f (x n ), ỹn . 9: Record the clean and noisy prediction and calculate the clean-and noisy-TCP by Eq. equation 1. 10: end for Output: Optimized feature extractor ϕ, classification layer c, transition matrix layer t. the primary clean classifier are frozen. Then the clean classifier gets improved by being fine-tuned on the whole data with the fixed transition matrix. The clean-and noisy-TCP of every instance are updated at the end of each epoch. Finally, a transition matrix with a small estimation error and a clean classifier with a performance improvement can be obtained.

5. EXPERIMENTS

In this section, we examine how the proposed methods learn a robust classifier against instancedependent noisy labels. Dataset. We employ three widely used datasets, i.e., F-MNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , and CIFAR10/100 (Krizhevsky et al., 2009) , and four versions of the real-world noisy dataset CIFAR10N (Wei et al., 2022) , and Clothing1M (Xiao et al., 2015) . F-MNIST contains 60,000 training images and 10,000 test images with 10 classes. SVHN and CIFAR10 both have 10 classes of images, but the former contains 73,257 training images and 26,032 test images, and the latter contains 50,000 training images and 10,000 test images while CIFAR100 has 100 classes. CIFAR10N provides CIFAR10 images with human-annotated noisy labels obtained from Amazon Mechanical Turk. Four versions of CIFAR10N label sets are employed here, three of which are labeled by three independent workers (named CIFAR10N-1/2/3) and one of which is negatively aggregated from the above three sets (named CIFAR10N-W). Clothing1M has 1M images with real-world noisy labels and additional 50k, 14k, 10k images with clean labels for training, validation and test, and we only use noisy training set in the training phase. For all the datasets, we leave out 10% of the training data as a validation set, which is for model selection. The final test model is selected with the highest validation accuracy. Noisy labels generation. For clean datasets, we artificially corrupt the class labels of training and validation sets following the instance-dependent noisy labels generalization method in Xia et al. (2020b) . We generate noisy datasets of {0.1, 0.2, 0.3, 0.4, 0.5} five noise rates. Baselines and measurements. On synthetic noisy datasets, without introducing data augmentation techniques and semi-supervised learning, we compare the proposed method TCP with the following baselines: (i). CE, which optimizes the standard cross-entropy loss on noisy datasets. (ii). Decoupling (Malach & Shalev-Shwartz, 2017) , which trains two networks on samples whose predictions are different. (iii). MentorNet Jiang et al. (2018) , Co-teaching (Han et al., 2018b) , and Co-teaching+ (Yu et al., 2019) , that mainly handle noisy labels by training on instances with small loss. (iv). Joint (Tanaka et al., 2018) , which jointly optimizes labels and network parameters. (v). DMI (Xu et al., 2019) , which uses a novel information-theoretic loss function to learn a robust classifier. (vi). Forward (Patrini et al., 2017) , Reweight (Liu & Tao, 2016) , and T-Revision (Xia et al., 2019) , that utilize a class-dependent transition matrix T to correct the loss function. (vii). PTD (Xia et al., 2020b) and Bayes (Yang et al., 2021) , estimate instance-dependent transition matrix under some additional assumptions; CRUST (Mirzasoleiman et al., 2020) iteratively selects subsets of clean data points that provide an approximately low-rank Jacobian matrix; CausalINL (Yao et al., 2021) exploits the causal graph to estimate the transition relations between clean and noisy labels. On real-world noisy datasets, we apply the transition matrix learning and fine-tuning parts to the SoTA method Dividemix (Li et al., 2020) , i.e., at each epoch, in addition to the Dividemix training, we select high noisy-TCP data to learn the transition matrix and use it to fine-tune the whole data. Then we compare this combined method TCP-D with the following SoTA methods: (i). PES (Bai et al., 2021) . (ii). Dividemix (Li et al., 2020) . (iii). CORES (Cheng et al., 2021) . (iv). ELR+ (Liu et al., 2020) . (v). JoCoR (Wei et al., 2020b) . (vi). CAL (Zhu et al., 2021) . We use a ResNet18 network for F-MNIST, a ResNet34 network for SVHN and CIFAR10, a ResNet50 network for CIFAR100, a PreAct-ResNet18 for CIFAR10N, and a pre-trained ResNet50 network for Clothing1M. More training details are provided in Appendix E. Classification accuracy is employed to evaluate the performance of each model on the clean test set. Results over 5 trials on all datasets except Clothing1M, for which the result is over 1 trial, are reported. Comparison with the State-of-the-Arts. We compare TCP with multiple baselines using the same network architecture. Table 1 show the results on CIFAR10 with different rates of IDN from 0.1 to 0.5, respectively. TCP outperforms baselines across all datasets and noise rates. The improvement is significant when the noise rate is large. Tables 5, 6 , and 7 show the results on F-MNIST, SVHN, and CIFAR100, which are provided in Appendix F. Table 2 show the results on real-world noisy datasets CIFAR10N-1/2/3/W and Clothing1M. TCP-D consistently achieves the best test accuracy on real-world noisy datasets. Note that the results of baselines on CIFAR10N are taken from the official leaderboard http://www.noisylabels.com/. Comparison on the transition matrix estimation error. We compare the transition matrix estimation error of our method with the instance-independent method Forward (Patrini et al., 2017) , and two Table 3 : Ablation study results on CIFAR10. instance-dependent methods PTD (Xia et al., 2020b) and Bayes (Yang et al., 2021) . As shown in Figure 7 , our method achieves the consistent best estimation error on CIFAR10 with different noise rates. Ablation study. We study the effect of removing different components of our methods to provide insights into what makes TCP successful in Table 3 . TCP w/o D c indicates that we do not select high clean-TCP data D c to learn the clean classifier while TCP w/o D t indicates that we do not select high noisy-TCP data D t to learn the transition matrix and use it to fine-tune the clean classifier. Results show that the performances of both reduced methods decrease. Without D c , the primary clean classifier cannot be learned, and thus the transition matrix cannot be learned well. Without D t , the transition matrix is not learned, and thus the whole noisy data cannot be fully exploited to build the consistent classifier. To sum up, the learning of the clean classifier and the transition matrix benefit and boost each other.

6. CONCLUSIONS

In this paper, we study the instance-dependent label noise (IDN) problem, which is a more general and practical setting than the previously addressed instance-independent label noise problem. Targeting the main challenges, we propose a novel time-consistency metric, i.e., TCP for the IDN problem. Based on TCP, we can detect examples with clean labels or correct pseudo labels better than the existing measures, and allocate reliable triplets for learning the transition matrix. Then we design an assumption-free curriculum that learns the clean classifier, as well as the transition matrix simultaneously. Through extensive experiments, we empirically demonstrate that the proposed method remarkably surpasses the baselines on many datasets with both synthetic noise and real-world noise, and achieves the smallest transition matrix estimation error than existing methods.

A MORE EMPIRICAL STUDY REGARDING FIGURES 1 AND 2

A.1 MORE EMPIRICAL STUDY REGARDING FIGURE. 1 Figures 8 and 9 show the results on more practical dataset CIFAR100 and IDN noise 0.4 higher IDN noise 0.6 with different partitions (2:6:2), which demonstrate that our claims hold in general and do not change sensitively with the partition ratios. Following the same setting as Figure 2 , we select 5,000 confident examples at every epoch t according to six types of selection criterion, i.e., instantaneous prediction InP t (x), instantaneous loss ℓ(x), time-consistency of prediction TCP t (x), time-consistency of loss, and two SOTA confident sample selection methods: FINE (Kim et al., 2021) and Topological Filter (Wu et al., 2020) . Then we count the number of instances with clean labels and calculate the clean ratios. As shown in Figure 12 , at the starting stage, when the model just learns the clean data while has not fit the noisy data, FINE and Topological Filter perform perform well. As the training goes and the model fits the noisy data, our method achieves the best selection clean ratio .  θ t+1 = θ t + η x∈L ∇ θ ℓ (x; θ t ) , and by training on L and x ′ for one step, we have θ ′ t+1 = θ t + η x∈L ∇ θ ℓ (x; θ t ) + ∇ θ ℓ (x ′ ; θ t ) , where θ t denotes the network parameters at step t and η denotes the learning rate. The Taylor expansion of loss ℓ (x; θ) at the point θ = θ 0 is: g θ0 (θ) = x∈L ℓ (x; θ 0 ) + ∇ θ ℓ (x; θ 0 ) (θ -θ 0 ) + o (θ -θ 0 ) 2 . ( ) Then we evaluate the forgetting effect of introducing instances x ′ with its pseudo label to the training set by checking the change of loss over the labeled set L with x ′ added or not. If adding x ′ does not cause a vital change, we can conclude that it does not lead to catastrophic forgetting of the learned examples with correct labels. Therefore, we calculate the change of loss over the labeled set by 1 η x∈L ℓ x; θ t+1 -ℓ x; θt+1 = 1 η g θ t θ t+1 -g θ t θt+1 ≈ ∇ θ ℓ x ′ ; θ t x∈L ∇ θ ℓ x; θ t = ∂ℓ (x ′ ; θ t ) ∂θ t ∂θ t ∂t = ∂ℓ (x ′ ; θ t ) ∂t = ∂ℓ (x ′ ; θ t ) ∂p ŷ′ t t (x ′ ) ∂p ŷ′ t t (x ′ ) ∂t , where p ŷ′ t t (x ′ ) is the probability of x ′ belonging to ŷ′ t at step t, and ŷ′ t is the prediction (pseudo label) of x ′ at step t. The second line holds because we omit the second and higher order terms of the Taylor expansion in Eq equation 4. Then, with cross-entropy loss employed, we have  ∂ℓ (x ′ ; θ t ) ∂p ŷ′ t t (x ′ ) = ∂ log(p ŷ′ t t (x ′ )) ∂p ŷ′ t t (x ′ ) = - 1 p ŷ′ t t (x ′ ) . Next, N n=1 L n (f (x n ), ỹn ) + N n=1 L T ⊤ (x n )f (x n ), ỹn N n=1 L c (f (x n ), y * n ) , where y * is the pseudo label. 18: Fix t and fine-tune ϕ and c on D by minimizing N n=1 L n T ⊤ (x n )f (x n ), ỹn . 19: Record the clean and noisy prediction and calculate the clean-and noisy-TCP by Eq. equation 1. 20: end for Output: Optimized feature extractor ϕ, classification layer c, transition matrix layer t. to 0.001 and decayed every 5 epochs with 50 epochs in total by a factor of 0.1, 1/3, and 1 for F-MNIST, SVHN, and CIFAR10/100, respectively. In the rest of 100-epoch training, the leaning rate of the feature extractor ϕ and classification layer c is 1e -4 and divided by 10 at epoch 30 and 80; the leaning rate of the transition matrix layer t is 3e -4 before epoch 30 and 1e -5 otherwise. The learning rate for fine-tuning is 1e -6 . For real-world noisy dataset CIFAR10N and Clothing1M, we follow the optimization method as Dividemix. For CIFAR10N, in the warmup clean-TCP stage, the learning rate is initialized to 0.001 and decayed every 5 epochs with 50 epochs in total by a factor of 1/3. In the rest of 300-epoch training, the leaning rate of the transition matrix layer t is 6e -3 before epoch 80 and 2e -4 between epoch 80 and 150, and 2e -4 otherwise. The learning rate for fine-tuning is 2e -3 before epoch 80, and 6e -4 between epoch 80 and 150, and 2e -4 between epoch 150 and 250, and 2e -5 otherwise. For Clothing1M, in the warmup clean-TCP stage, the learning rate is initialized to 0.002 and decayed every 2 epochs with 5 epochs in total by a factor of 1/3. In the rest of 20-epoch training, the leaning rate of the transition matrix layer t is 6e -4 before epoch 8 and 2e -5 between epoch 8 and 12, and 5e -6 otherwise. The learning rate for fine-tuning is 2e -4 before epoch 10, and 6e -5 between epoch 10 and 14, and 2e -5 between epoch 14 and 17, and 2e -6 otherwise.

E.1 TRAINING TIME

Due to the light-weight and simple network architecture, our method is more time-efficient and scalable than those methods, which employs dual networks or requires data augmentations for semi-supervised learning. We report the time costs below to demonstrate this advantage. The additional cost caused by estimating T and fine-tune is small. For each epoch, the additional time cost of estimating-T part is neglectable when compared with the cost of one standard training epoch minimizing the cross-entropy loss. This is because estimating T only updates the parameters of a c × c linear layer, where c is the number of classes. The time cost of fine-tune part is slightly bigger than one standard training epoch. Fortunately, both parts are not necessary to be applied in every epoch. In our experiments, we only apply them at the last 50 epochs. Moreover, since traning clean classifier part (line 7 in Algorithm 1) and estimating T part only involves the high clean-TCP and high noisy-TCP data rather than the whole data, which save plenty of time for the fine-tune part. Therefore, in practice, our method can easily adapt and scale to meet the realistic settings. 



The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two sets(Zar, 2005).



(a) Partitioned at a start stage. (b) Partitioned at a early stage. (c) Partitioned at an end stage.

Figure 1: TCP (mean and std.) of three groups (high TCP (10%), middle TCP (80%), and low TCP (10%)) partitioned by the TCP calculated at the start stage (epoch 5), early stage (epoch 30), and end stage (epoch 95) during training a ResNet34 on CIFAR10 with IDN-0.4 for 100 epochs.

Figure 2: Clean ratios of the selected top 5000 instances ranked by four kinds of instance hardness measures, respectively, during a standard training for 100 epochs. The clean ratio of randomly selected instances is 0.6 since the noise rate is 0.4.Moreover, we partition the whole data into three groups (high TCP (10%), middle TCP (80%), and low TCP (10%)) by the TCP calculated at the start stage (epoch 5), early stage (epoch 30), and end stage (epoch 95). We visualize the mean and variance of the groups through the whole training epochs. As shown in Figure1, the startstage partition fails at the end stage as three groups are entangled together while the earlystage partition shares the almost same pattern as the end-stage partition. Thus, we can conclude that the early-stage TCP is reflective of the property of each instance in the future, which means the time-consistent examples selected in the early stage will not mislead the classifier because their TCP are still high and thus they will still be selected as time-consistent examples in the rest training epochs. Besides, a warmup for the TCP is proved to be necessary.

Figure 3: Clean ratios of selected high clean-TCP examples w.r.t. their original noisy labels and pseudo labels with linear growth of the selected number during our curriculum learning on CIFAR10 with IDN-0.4 for 100 epochs.

Figure 4: Clean ratios of selected high noisy-TCP examples w.r.t. their pseudo labels with exponential growth of the selected number and Spearman rank-order correlation coefficient between the noisy-and clean-TCP during our curriculum learning on CIFAR10 with IDN-0.4 for 100 epochs.

Figure 5: Data distribution in terms of noisy-and clean-TCP at epoch 50 during our curriculum learning on CIFAR10 with IDN-0.4 for 100 epochs.

Figure 6: An overview of the proposed method. The second image of cat has a noisy label as "dog". The transition matrix T (•) = t(ϕ(•)) and classifier f (•) = c(ϕ(•)) share a common feature extractor.

Figure 7: Transition matrix estimation errors on CIFAR10 from IDN-0.1 to IDN-0.4.

Figures 10 and 11 show the results on datasets CIFAR100 and SVHN with different model architectures. Conclusions from Section 3 are based on the memorization effect of overparameterized DNNs. Therefore, they hold better for deeper DNNs (ResNet50) than the shallower DNNs (AlexNet).Moreover, for AlexNet on SVHN, the high TCP group partitioned at an early stage has no overlap with the middle TCP group. Overall, the results demonstrate that our conclusions in the paper holds true and generalize to other architectures and datasets.

Figure 8: TCP (mean and std.) of three groups partitioned by the TCP calculated at the start stage (epoch 5), early stage (epoch 30), and end stage (epoch 95) during training a ResNet50 on CIFAR100 with IDN-0.4 for 100 epochs. The first row is partitioned by high TCP (10%), middle TCP (80%), and low TCP (10%). The second row is partitioned by high TCP (20%), middle TCP (60%), and low TCP (20%).

(a) Partitioned at a start stage. (b) Partitioned at a early stage. (c) Partitioned at an end stage. (d) Partitioned at a start stage. (e) Partitioned at a early stage. (f) Partitioned at an end stage.

Figure 9: TCP (mean and std.) of three groups partitioned by the TCP calculated at the start stage (epoch 5), early stage (epoch 30), and end stage (epoch 95) during training a ResNet50 on CIFAR100 with IDN-0.6 for 100 epochs. The first row is partitioned by high TCP (10%), middle TCP (80%), and low TCP (10%). The second row is partitioned by high TCP (20%), middle TCP (60%), and low TCP (20%).

(a) Partitioned at a start stage. (b) Partitioned at a early stage. (c) Partitioned at an end stage.

Figure 10: TCP (mean and std.) of three groups (high TCP (10%), middle TCP (80%), and low TCP (10%)) partitioned by the TCP calculated at the start stage (epoch 5), early stage (epoch 30), and end stage (epoch 95) during training a ResNet50 on CIFAR100 with IDN-0.4 for 100 epochs.

(a) Partitioned at a start stage. (b) Partitioned at a early stage. (c) Partitioned at an end stage.

Figure 11: TCP (mean and std.) of three groups (high TCP (10%), middle TCP (80%), and low TCP (10%)) partitioned by the TCP calculated at the start stage (epoch 5), early stage (epoch 30), and end stage (epoch 95) during training a AlexNet on SVHN with IDN-0.4 for 100 epochs.

ŷ′ t t+1 (x ′ ) because it has been verified in Figure 1 that instances with high clean-TCP in the early stage maintain their high clean-TCP in the future, which means the loss change can be bounded with a very small value. Therefore, exploiting high clean-TCP instances with pseudo labels helps to correct corrupted labels and learn a clean classifier without causing catastrophic forgetting of the learned examples with correct labels. C DATA DISTRIBUTION IN TERMS OF CLEAN-AND NOISY-TCP AT DIFFERENT TRAINING STAGES (a) Epoch 10. (b) Epoch 30. (c) Epoch 80. (d) Epoch 100.

Figure 13: Clean ratios of selected high noisy-TCP examples w.r.t. their pseudo labels with exponential growth of the selected number and Spearman rank-order correlation coefficient between the noisy-and clean-TCP during our curriculum learning on CIFAR10 with IDN-0.4 for 100 epochs.

prediction and calculate the clean-TCP. 5: end for 6: Warmup noisy-TCP: 7: for e in {1, • • • , e 2 } do 8: Select N c [e] high clean-TCP instances as D c . 9: Train ϕ, c and t on D c by minimizing

.

and noisy prediction and calculate the clean-and noisy-TCP. 11: end for 12: Curriculum training: 13: for e in {1, • • • , e 3 } do 14: Select N t [e 2 + e] high noisy-TCP instances as D t . 15:Train t while fixing ϕ and c on D t by minimizingN n=1 L n T ⊤ (x n )f (x n ), ỹn.16: Select N c [e 2 + e] high clean-TCP instances as D c . 17: Train ϕ and c on D c by minimizing

Means and stds of classification accuracy on CIFAR10 with different label noise rates.

Means and stds of classification accuracy on real-world noisy datasets.

by using p

The average time of training each component on CIFAR10 and CIFAR100 with ResNet34 on NVIDIA 3090.

Means and stds of classification accuracy on F-MNIST with different label noise rates. 54±0.31 88.38±0.42 84.22±0.35 68.86±0.78 51.42±0.66 Decoupling 89.27±0.31 86.50±0.35 85.33±0.47 78.54±0.53 57.32±2.11 MentorNet 90.00±0.34 87.02±0.41 86.02±0.82 80.12±0.76 58.62±1.36 Co-teaching 90.82±0.33 87.89±0.41 86.88±0.32 82.78±0.95 63.22±1.58 Co-teaching+ 90.92±0.51 89.77±0.45 88.52±0.45 83.57±1.77 59.32±2.77 Joint 70.24±0.99 56.83±0.45 51.27±0.67 44.24±0.78 30.45±0.45 DMI 91.98±0.62 90.33±0.21 84.81±0.44 69.01±1.87 51.64±1.78 Forward 89.05±0.43 88.61±0.43 84.27±0.46 70.25±1.28 57.33±3.75 Reweight 90.33±0.27 89.70±0.35 87.04±0.35 80.29±0.89 65.27±1.33 T-Revision 91.56±0.31 90.68±0.66 89.46±0.45 84.01±1.24 68.99±1.04

Means and stds of classification accuracy on SVHN with different label noise rates. 49±0.15 90.47±0.66 85.27±0.34 82.57±1.45 42.56±2.79 MentorNet 90.28±0.12 90.37±0.37 86.49±0.49 83.75±0.75 40.27±3.14 Co-teaching 91.33±0.31 90.56±0.67 88.93±0.78 85.47±0.64 45.90±2.31 Co-teaching+ 93.05±1.20 91.05±0.82 85.33±2.71 57.24±3.77 42.56±3.65 Joint 86.01±0.34 78.58±0.72 76.34±0.56 65.14±1.72 46.78±3.77 DMI 93.51±1.09 93.22±0.62 91.78±1.54 69.34±2.45 48.93±2.34 Forward 90.89±0.63 90.65±0.27 87.32±0.59 78.46±2.58 46.27±3.90 71±0.44 94.02±1.32 91.38±1.94 85.55±3.17 75.46±3.79 TCP 94.90±0.11 94.60±0.20 93.92±1.37 94.09±0.34 84.92±8.40

