A TIME-CONSISTENCY CURRICULUM FOR LEARNING FROM INSTANCE-DEPENDENT NOISY LABELS

Abstract

Many machine learning algorithms are known to be fragile on simple instanceindependent noisy labels. However, noisy labels in real-world data are more devastating since they are produced by more complicated mechanisms in an instancedependent manner. In this paper, we target this practical challenge of Instance-Dependent Noisy Labels by jointly training (1) a model reversely engineering the noise generating mechanism, which produces an instance-dependent mapping between the clean label posterior and the observed noisy label; and (2) a robust classifier that produces clean label posteriors. Compared to previous methods, the former model is novel and enables end-to-end learning of the latter directly from noisy labels. An extensive empirical study indicates that the time-consistency of data is critical to the success of training both models and motivates us to develop a curriculum selecting training data based on their dynamics on the two models' outputs over the course of training. We show that the curriculum-selected data provide both clean labels and high-quality input-output pairs for training the two models. Therefore, it leads to promising and robust classification performance even in notably challenging settings of instance-dependent noisy labels where many SoTA methods could easily fail. Extensive experimental comparisons and ablation studies further demonstrate the advantages and significance of the time-consistency curriculum in learning from instance-dependent noisy labels on multiple benchmark datasets.

1. INTRODUCTION

The training of neural networks can easily fail in the presence of even the simple instance-independent noisy labels since they quickly lead to model overfitting of the noises (Zhang et al., 2017) . In practice, however, it is usually challenging to control the labeling quality of large-scale datasets because the labels were generated by complicated mechanisms such as non-expert workers (Han et al., 2020b ). An average of 3.3% noisy labels is identified in the test/validation sets of 10 of the most commonly-used datasets in computer vision, natural language, and audio analysis (Northcutt et al., 2021) . Moreover, real-world noisy labels are generated in an instance-dependent manner, which is significantly more challenging to address than the most widely studied but oversimplified instance-independent noises, which assume that the noise only depends on the class (Wei et al., 2022) . Two principal methodologies have been developed to address the label noises: (1) detecting samples (X, Ỹ ) with correct labels Ỹ = Y (empirically, they are the ones with the smallest loss values) and using them to train a clean classifier (Han et al., 2018b; Yu et al., 2019) ; (2) learning the noise generating mechanism, i.e., a transition matrix T defining the mapping between clean label Y and noisy label Ỹ such that P ( Ỹ | X) = T ⊤ P (Y | X), where P (• | X) denotes the posterior vector, and then using it to build statistically consistent classifiers (Liu & Tao, 2016; Patrini et al., 2017; Yang et al., 2021) . Although both methodologies have achieved promising results in the simplified instanceindependent (class-dependent) setting, they have non-trivial drawbacks when applied to the more practical but complicated instance-dependent noises: (1) the "small loss" trick is no longer effective in detecting correct labels (Cheng et al., 2021) because the loss threshold drastically varies across instances and is determined by each transition matrix T (X); (2) the instance-dependent transition matrix T (X) is not identifiable given only the noisy sample and it heavily relies on the estimation of clean label Y in the triple (X, Y, Ỹ ) (Yang et al., 2021) , which is an unsolved challenge in (1). Therefore, the two learning problems are entangled, i.e., the training of a clean label predictor and the transition matrix estimator depends on each other's accuracy, which substantially relies on the quality of training data (X, Y, Ỹ ). Specifically, the "small loss" trick cannot provide a high-quality estimation of Y due to the instance-specific threshold of loss. Moreover, the estimation of Y can change rapidly due to the non-stationary loss, which can fluctuate during training and provide inconsistent training signals over time for both models if selected for training. Furthermore, the data subset selection inevitably introduces biases toward easy-to-fit samples and degrades the data diversity (Yang et al., 2021; Cheng et al., 2021; Berthon et al., 2021; Cheng et al., 2020) , which in fact is critical to the training and the accuracy of both models, especially the transition matrix estimator, because easy-to-fit samples usually have extremely sparse transition matrices. To tackle the above issues, we propose a novel metric "Time-Consistency of Prediction (TCP)" to select high-quality data to train both models. TCP measures the consistency of model prediction for an instance over the course of training, which reflects whether its given label results in gradients consistent with the majority of other instances, and this criterion turns out to be a more reliable identifier of clean labels. When applied to the training of clean label predictor, TCP is more accurate in clean label detection than "small loss" (or high confidence) criterion, because it avoids the comparison of confidence for samples with instance-dependent loss/confidence thresholds. Moreover, when applied to the training of the transition matrix estimator, TCP measures the time-consistency of predicted noisy labels Ỹ . Surprisingly, it also faithfully reflects the correctness of the predicted clean label Y . Since the objective to estimate the transition matrix is defined by both Y and Ỹ , selecting samples with high TCP considerably improves the training of the transition matrix estimator. In addition, to exploit the data diversity in training the two models, we apply a curriculum that starts from selecting only a few high TCP data for early-stage training but progressively includes more training data once the two models become maturer and more consistent. In this paper, we develop a three-stage training strategy with the TCP curriculum embedded. In every training step, we first update the clean label predictor using selected data with high TCP on this model, followed by training the transition matrix estimator given the predicted clean label posterior and the noisy labels on selected data with high TCP on the estimator, and end with fine-tuning the clean label predictor directly using the noisy label and the estimated transition matrix. It is worth noting that the TCP metrics for the two models are updated using the model outputs collected from this dynamic training process without causing additional cost. As demonstrated by extensive empirical studies and experimental comparisons, our method leads to efficient joint training of the two models that mutually benefits from each other and produces an accurate estimation of both the clean label and instance-dependent transition matrix. On multiple benchmark datasets with either synthetic or real-world noises, our method achieves state-of-the-art performance with significant improvements.

2. BACKGROUNDS AND RELATED WORKS

Let (X, Y ) ∈ X × {1, . . . , c} be the random variables for instances and clean labels, where X represents the instance space and c is the number of classes. In many real-world applications, the observed labels are not always correct but contain some noise. Let Ỹ be the random variable for the noisy label. What we have is a sample {(x 1 , ỹ1 ), . . . , (x n , ỹn )} drawn from the noisy distribution D ρ of the random variables (X, Ỹ ). We aim to learn a robust classifier that could assign clean labels to test data by exploiting the sample with noisy labels. Label noise models. Currently, there are three typical label noise models, which are the random classification noise (RCN) model (Biggio et al., 2011; Natarajan et al., 2013; Manwani & Sastry, 2013) , the class-dependent label noise (CDN) model (Patrini et al., 2017; Xia et al., 2019; Zhang & Sabuncu, 2018) , and the instance-dependent label noise (IDN) model (Berthon et al., 2021; Cheng et al., 2021) . Specifically, RCN assumes that clean labels flip randomly with a constant rate; CDN assumes that the flip rate only depends on the true class; IDN considers the most general case of label noise, where the flip rate depends on its instance. Since IDN is non-identifiable without any additional assumption, some simplified variants were proposed. (Xia et al., 2020b) proposed the part-dependent label noise (PDN) model which assumes that the label noise depends on parts of instances. (Cheng et al., 2020; Yang et al., 2021) assume that the flip rates are dependent on instances but can be upper bounded by a value smaller than 1. This paper focuses on the original IDN model without introducing any additional assumptions.

