WHICH IS BETTER FOR LEARNING WITH NOISY LA-BELS: THE SEMI-SUPERVISED METHOD OR MODELING LABEL NOISE?

Abstract

In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled confident examples and unlabeled unconfident examples. The other one is to model label noise and design statistically consistent classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of causal data generation process. Specifically, the semi-supervised method depends heavily on the data generation process while the modeling label-noise method is independent of the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.

1. INTRODUCTION

Deep neural networks can achieve remarkable performance when accurately annotated large-scale training datasets are available. However, annotating a large number of examples accurately is often expensive and sometimes infeasible in real life. Cheap datasets which contain label errors are easy to obtain (Li et al., 2019) and have been widely used to train deep neural networks. Recent results (Han et al., 2018; Nguyen et al., 2019) show that deep neural networks can easily memorize label noise during the training, which leads to poor test performance. To reduce the side effect of label noise, there are two major streams of methods. One stream of methods focus on getting rid of label errors. Specifically, they would first select confident examples (i.e., whose labels are likely to be correct), e.g., by exploiting the memorization effect of deep networks (Jiang et al., 2018) . Then, by discarding the labels of unconfident examples (i.e., whose labels are likely to be inccorect ) and keeping their unlabeled instances, they (Li et al., 2019; 2020; Wei et al., 2020; Yao et al., 2021; Tan et al., 2021; Ciortan et al., 2021; Yao et al., 2021) would employ the semi-supervised method, e.g., mixmatch (), to achieve state-of-the-art performance. Those methods are usually based on heuristics and lack theoretical guarantee. Another major stream is to model the label noise and then get rid of its side effects. They mainly focus on estimating the label noise transition matrix T (x), i.e., T ij (x) = P ( Ỹ = i|Y = j, X = x) representing the probability that an instance x with a clean label Y = i but flips to a noisy label Ỹ = j. The idea is that the clean class posterior distribution P (Y |X) can be inferred by learning the transition matrix T (x) and noisy class posterior distribution P ( Ỹ |X). In general, when T (x) is well estimated (or given), these methods are statistically consistent, i.e., they guarantee that the classifiers learned from the noisy data converge to the optimal classifiers defined on the clean data as the size of the noisy training data increases (Patrini et al., 2017; Xia et al., 2019) . It naturally raises the question that which stream of methods should be used for a specific real-world application? Answering the question is crucial for the community of learning with noisy labels. If the answer is that one stream of methods is dominating, future efforts should mainly focus on that specific stream. However, if not, we should know their differences and which method should be used for a specific real-world application. In this paper, from a causal perspective, we answer that none of the two streams of methods are dominating. They have advantages and disadvantages, which are closely related to the underlying data generation process. The semi-supervised methods can easily incorporate heuristics (e.g., prior knowledge) to make use of the finite training sample but they do not work if the feature is the cause of the label in the data generation process. The modeling label-noise methods are not influenced by the data generation process. They can make use of all the instances and noisy labels and can be statistically consistent but they need a large training sample to perform well. Specifically, when the instance X is a cause of the clean label Y , the distributions P (X) and P (Y |X) are disentangled (Schölkopf et al., 2012; Zhang et al., 2015) , which means that P (X) contains no labeling information. In other words, exploiting the unlabeled data by semi-supervised methods cannot help learn the classifier. When the clean label Y is a cause of the instance X, the distributions of P (X) and P (Y |X) are entangled (Schölkopf et al., 2012; Zhang et al., 2015) , then P (X) generally contains some information about P (Y |X). Then the semi-supervised methods are helpful. In many real-world applications, we do not know the causal structure of the data generation process. To detect that on a specific noisy dataset, we proposed an intuitive method by exploiting an asymmetric property of the two different causal structures (X causes Y vs Y causes X) regarding estimating the transition matrix.

2. RELATED WORK

In this section, we first introduce the two major streams, i.e., the methods employing semi-supervised learning and the methods based on modeling label noise. Then we introduce the causal generation process of the noisy data. Method based on semi-supervised learning. Semi-supervised learning is widely employed in learning with noisy labels. To getting rid of label errors, existing methods usually divide the dataset to confident examples and unconfident examples. Then the deep neural networks are trained on the confident examples in a supervised manner (Jiang et al., 2018; Han et al., 2018) . To also make use of the unconfident examples that contain a large amount of incorrect labels, by just employing the unlabeled instances, different semi-supervised learning techniques can be employed. For example, the consistency regularization (Laine and Aila, 2016) is employed by (Englesson and Azizpour, 2021); FixMatch (Sohn et al., 2020) is employed by (Li et al., 2019) ; the co-Regularization is employed by (Wei et al., 2020) ; contrastive learning is employed by (Tan et al., 2021; Ciortan et al., 2021; Li et al., 2020; Ghosh and Lan, 2021; Yao et al., 2021; Zheltonozhskii et al., 2022) . Empirically, these methods have demonstrated state-of-the-art performance. Method based on modeling label noise. This family of methods mainly focuses on designing statistically consistent methods by employing the noise transition matrix T (x). Specifically, given an instance x, its transition matrix T (x) reveals the transition relationship from clean labels to noisy labels of the instance., i.e., T (x)[P (Y = 1|x), . . . , P (Y = L|x)] ⊤ = [P ( Ỹ = 1|x), . . . , P ( Ỹ = L|x)] ⊤ . (1) Let h : X → ∆ C-1 models a class posterior distribution and ℓ ce be the cross-entropy loss, then arg min h E x,y [ℓ ce (y, h(x))] = arg min h E x,ỹ [ℓ ce (ỹ, T (x)h(x))]. The above equation shows that if T (x) is given, the minimizer of the corrected loss under the noisy distribution is the same as the minimizer of the original loss under the clean distribution (Liu and Tao, 2016; Patrini et al., 2017) . In practice, T (x) usually is not given and needs to be estimated from noisy data (Xia et al., 2020; Li et al., 2021) . It is also worth mentioning that, methods focus on design robust loss functions can closely related to modeling label-noise methods. These methods usually require the noise rate to help hyperparameter selection (Zhang and Sabuncu, 2018; Liu and Guo, 2020) . To calculate the noise rate, T (x) usually have to be estimated (Yao et al., 2020) .

