LEARNING WITH INSTANCE-DEPENDENT LABEL NOISE: A SAMPLE SIEVE APPROACH

Abstract

Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, providing theoretically rigorous solutions for learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES 2 (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted examples. The implementation of CORES 2 does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES 2 in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES 2 on CIFAR10 and CI-FAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance. Code is available at https://github.com/UCSC-REAL/cores.

1. INTRODUCTION

Deep neural networks (DNNs) have gained popularity in a wide range of applications. The remarkable success of DNNs often relies on the availability of large-scale datasets. However, data annotation inevitably introduces label noise, and it is extremely expensive and time-consuming to clean up the corrupted labels. The existence of label noise can weaken the true correlation between features and labels as well as introducing artificial correlation patterns. Thus, mitigating the effects of noisy labels becomes a critical issue that needs careful treatment. It is challenging to avoid overfitting to noisy labels, especially when the noise depends on both true labels Y and features X. Unfortunately, this often tends to be the case where human annotations are prone to different levels of errors for tasks with varying difficulty levels. Recent work has also shown that the presence of instance-dependent noisy labels imposes additional challenges and cautions to training in this scenario (Liu, 2021) . For such instance-dependent (or feature-dependent, instance-based) label noise settings, theory-supported works usually focus on loss-correction which requires estimating noise rates (Xia et al., 2020; Berthon et al., 2020) . Recent work by Cheng et al. (2020) addresses the bounded instance-based noise by first learning the noisy distribution and then distilling examples according to some thresholds. 1 However, with a limited size of datasets, learning an accurate noisy distribution for each example is a non-trivial task. Additionally, the size and the quality of distilled examples are sensitive to the thresholds for distillation. Departing from the above line of works, we design a sample sieve with theoretical guarantees to provide a high-quality splitting of clean and corrupted examples without the need to estimate noise rates. Instead of learning the noisy distributions or noise rates, we focus on learning the underlying clean distribution and design a regularization term to help improve the confidence of the learned classifier, which is proven to help safely sieve out corrupted examples. With the division between "clean" and "corrupted" examples, our training enjoys performance improvements by treating the clean examples (using standard loss) and the corrupted ones (using an unsupervised consistency loss) separately. We summarize our main contributions: 1) We propose to train a classifier using a novel confidence regularization (CR) term and theoretically guarantee that, under mild assumptions, minimizing the confidence regularized cross-entropy (CE) loss on the instance-based noisy distribution is equivalent to minimizing the pure CE loss on the corresponding "unobservable" clean distribution. This classifier is also shown to be helpful for evaluating each example to build our sample sieve.2) We provide a theoretically sound sample sieve that simply compares the example's regularized loss with a closed-form threshold explicitly determined by predictions from the above trained model using our confidence regularized loss, without any extra estimates. 3) To the best of our knowledge, the proposed CORES 2 (COnfidence REgularized Sample Sieve) is the first method that is thoroughly studied for a multi-class classification problem, has theoretical guarantees to avoid overfitting to instance-dependent label noise, and provides high-quality division without knowing or estimating noise rates. 4) By decoupling the regularized loss into separate additive terms, we also provide a novel and promising mechanism for understanding and controlling the effects of general instancedependent label noise. 5) CORES 2 achieves competitive performance on multiple datasets, including CIFAR-10, CIFAR-100, and Clothing1M, under different label noise settings. 2020), we briefly overview other most relevant references. Detailed related work is left to Appendix A. Making the loss function robust to label noise is important for building a robust machine learning model (Zhang et al., 2016) . One popular direction is to perform loss correction, which first estimates transition matrix (Patrini et al., 2017; Vahdat, 2017; Xiao et al., 2015; Zhu et al., 2021b; Yao et al., 2020b) , and then performs correction/reweighting via forward or backward propagation, or further revises the estimated transition matrix with controllable variations (Xia et al., 2019) . The other line of work focuses on designing specific losses without estimating transition matrices (Natarajan et al., 2013; Xu et al., 2019; Liu & Guo, 2020; Wei & Liu, 2021) . However, these works assume the label noise is instance-independent which limits their extension. Another approach is sample selection (Jiang et al., 2017; Han et al., 2018; Yu et al., 2019; Northcutt et al., 2019; Yao et al., 2020a; Wei et al., 2020; Zhang et al., 2020a) , which selects the "small loss" examples as clean ones. However, we find this approach only works well on the instance-independent label noise. Approaches such as label correction (Veit et al., 2017; Li et al., 2017; Han et al., 2019) or semi-supervised learning (Li et al., 2020; Nguyen et al., 2019) The classification task aims to identify a classifier f : X → Y that maps X to Y accurately. One common approach is minimizing the empirical risk using DNNs with respect to the cross-entropy loss defined as (f (x), y) = -ln(f x [y]), y ∈ [K], where f x [y] denotes the y-th component of f (x) and K is the number of classes. In real-world applications, such as human-annotated images (Krizhevsky et al., 2012; Zhang et al., 2017) and medical diagnosis (Agarwal et al., 2016) , the learner can only observe a set of noisy labels. For instance, human annotators may wrongly label some images containing cats as ones that contain dogs accidentally or irresponsibly. The label noise of each instance is characterized by a noise transition matrix T (X), where each element T ij (X) := P( Y = j|Y = i, X). The corresponding noisy dataset 2 and distribution are denoted by D := {(x n , ỹn )} n∈[N ] and D. Let 1(•) be the indicator function taking 2 In this paper, the noisy dataset refers to a dataset with noisy examples. A noisy example is either a clean example (whose label is true) or a corrupted example (whose label is wrong).



Other related works In addition to recent works by Xia et al. (2020), Berthon et al. (2020), and Cheng et al. (

also lack guarantees for the instancebased label noise. 2 CORES 2 : CONFIDENCE REGULARIZED SAMPLE SIEVE Consider a classification problem on a set of N training examples denoted by D := {(x n , y n )} n∈[N ] , where [N ] := {1, 2, • • • , N } is the set of example indices. Examples (x n , y n ) are drawn according to random variables (X, Y ) ∈ X × Y from a joint distribution D. Let D X and D Y be the marginal distributions of X and Y .

