WHICH IS BETTER FOR LEARNING WITH NOISY LA-BELS: THE SEMI-SUPERVISED METHOD OR MODELING LABEL NOISE?

Abstract

In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled confident examples and unlabeled unconfident examples. The other one is to model label noise and design statistically consistent classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of causal data generation process. Specifically, the semi-supervised method depends heavily on the data generation process while the modeling label-noise method is independent of the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.

1. INTRODUCTION

Deep neural networks can achieve remarkable performance when accurately annotated large-scale training datasets are available. However, annotating a large number of examples accurately is often expensive and sometimes infeasible in real life. Cheap datasets which contain label errors are easy to obtain (Li et al., 2019) and have been widely used to train deep neural networks. Recent results (Han et al., 2018; Nguyen et al., 2019) show that deep neural networks can easily memorize label noise during the training, which leads to poor test performance. To reduce the side effect of label noise, there are two major streams of methods. One stream of methods focus on getting rid of label errors. Specifically, they would first select confident examples (i.e., whose labels are likely to be correct), e.g., by exploiting the memorization effect of deep networks (Jiang et al., 2018) . Then, by discarding the labels of unconfident examples (i.e., whose labels are likely to be inccorect ) and keeping their unlabeled instances, they (Li et al., 2019; 2020; Wei et al., 2020; Yao et al., 2021; Tan et al., 2021; Ciortan et al., 2021; Yao et al., 2021) would employ the semi-supervised method, e.g., mixmatch (), to achieve state-of-the-art performance. Those methods are usually based on heuristics and lack theoretical guarantee. Another major stream is to model the label noise and then get rid of its side effects. They mainly focus on estimating the label noise transition matrix T (x), i.e., T ij (x) = P ( Ỹ = i|Y = j, X = x) representing the probability that an instance x with a clean label Y = i but flips to a noisy label Ỹ = j. The idea is that the clean class posterior distribution P (Y |X) can be inferred by learning the transition matrix T (x) and noisy class posterior distribution P ( Ỹ |X). In general, when T (x) is well estimated (or given), these methods are statistically consistent, i.e., they guarantee that the classifiers learned from the noisy data converge to the optimal classifiers defined on the clean data as the size of the noisy training data increases (Patrini et al., 2017; Xia et al., 2019) . It naturally raises the question that which stream of methods should be used for a specific real-world application? Answering the question is crucial for the community of learning with noisy labels. If the answer is that one stream of methods is dominating, future efforts should mainly focus on that specific stream. However, if not, we should know their differences and which method should be used for a specific real-world application. In this paper, from a causal perspective, we answer that none of the two streams of methods are dominating. They have advantages and disadvantages, which are closely related to the underlying data generation process. The semi-supervised methods can easily incorporate heuristics (e.g., prior knowledge) to make use of the finite training sample but they do not work if the feature is the cause of the label in the data generation process. The modeling label-noise methods are not influenced by the data generation process. They can make use of all the instances and noisy labels and can be statistically consistent but they need a large training sample to perform well. Specifically, when the instance X is a cause of the clean label Y , the distributions P (X) and P (Y |X) are disentangled (Schölkopf et al., 2012; Zhang et al., 2015) , which means that P (X) contains no labeling information. In other words, exploiting the unlabeled data by semi-supervised methods cannot help learn the classifier. When the clean label Y is a cause of the instance X, the distributions of P (X) and P (Y |X) are entangled (Schölkopf et al., 2012; Zhang et al., 2015) , then P (X) generally contains some information about P (Y |X). Then the semi-supervised methods are helpful. In many real-world applications, we do not know the causal structure of the data generation process. To detect that on a specific noisy dataset, we proposed an intuitive method by exploiting an asymmetric property of the two different causal structures (X causes Y vs Y causes X) regarding estimating the transition matrix.

2. RELATED WORK

In this section, we first introduce the two major streams, i.e., the methods employing semi-supervised learning and the methods based on modeling label noise. Then we introduce the causal generation process of the noisy data. Method based on semi-supervised learning. Semi-supervised learning is widely employed in learning with noisy labels. To getting rid of label errors, existing methods usually divide the dataset to confident examples and unconfident examples. Then the deep neural networks are trained on the confident examples in a supervised manner (Jiang et al., 2018; Han et al., 2018) . To also make use of the unconfident examples that contain a large amount of incorrect labels, by just employing the unlabeled instances, different semi-supervised learning techniques can be employed. For example, the consistency regularization (Laine and Aila, 2016) is employed by (Englesson and Azizpour, 2021) ; FixMatch (Sohn et al., 2020) is employed by (Li et al., 2019) ; the co-Regularization is employed by (Wei et al., 2020) ; contrastive learning is employed by (Tan et al., 2021; Ciortan et al., 2021; Li et al., 2020; Ghosh and Lan, 2021; Yao et al., 2021; Zheltonozhskii et al., 2022) . Empirically, these methods have demonstrated state-of-the-art performance. Method based on modeling label noise. This family of methods mainly focuses on designing statistically consistent methods by employing the noise transition matrix T (x). Specifically, given an instance x, its transition matrix T (x) reveals the transition relationship from clean labels to noisy labels of the instance., i.e., T (x)[P (Y = 1|x), . . . , P (Y = L|x)] ⊤ = [P ( Ỹ = 1|x), . . . , P ( Ỹ = L|x)] ⊤ . (1) Let h : X → ∆ C-1 models a class posterior distribution and ℓ ce be the cross-entropy loss, then arg min h E x,y [ℓ ce (y, h(x))] = arg min h E x,ỹ [ℓ ce (ỹ, T (x)h(x))]. The above equation shows that if T (x) is given, the minimizer of the corrected loss under the noisy distribution is the same as the minimizer of the original loss under the clean distribution (Liu and Tao, 2016; Patrini et al., 2017) . In practice, T (x) usually is not given and needs to be estimated from noisy data (Xia et al., 2020; Li et al., 2021) . It is also worth mentioning that, methods focus on design robust loss functions can closely related to modeling label-noise methods. These methods usually require the noise rate to help hyperparameter selection (Zhang and Sabuncu, 2018; Liu and Guo, 2020) . To calculate the noise rate, T (x) usually have to be estimated (Yao et al., 2020) . causal generation process of noisy data. We introduce some background knowledge about causality and describe the data generation process by the causal graph and the structural causal model (SCM) (Spirtes and Zhang, 2016) . Specifically, in Fig. 1 (a), we illustrate a possible data generation process when data contains instance-dependent label noise by using the causal graph which represents a flow of information and reveals causal relationships among all the variables (Glymour et al., 2019) . For example, Fig. 1(a) shows that the latent clean label Y is a cause of the instance X, and both X and Y are causes of Ỹ . The generation process can also be described by a structural causal model (SCM). Specifically, X Ỹ Y (a) Y causes X X Ỹ Y (b) X causes Y Y ∼ P Y , U X ∼ P U X , X = f (Y, U X ), U Ỹ ∼ P U Ỹ , Ỹ = g(X, Y, U Ỹ ), where U X and U Ỹ are mutually independent exogenous random variables that are also independent of Y . The occurrence the exogenous variables model the random sampling process of X and Ỹ . f and g can be linear or non-linear functions. Each equation species a distribution of a variable conditioned on its parents (could be an empty set). Similarly, the SCM corresponding to the causal graph in Fig. 1 (b) can be written as: X ∼ P X , U Y ∼ P U Y , U Ỹ ∼ P U Ỹ , Y = f ′ (X, U Y ), Ỹ = g(X, Y, U Ỹ ). causal factorization and modularity. By the conditional independence relations proposed by the Markov property (Pearl, 2000) , the joint distribution P (X, Y, Ỹ ) when Y causes X can be factorized by following the causal direction as follows. P (X, Y, Ỹ ) = P (Y )P (X|Y )P ( Ỹ |X, Y ). The above decomposition is called a causal decomposition. According to the modularity property of causal mechanisms (Schölkopf et al., 2012; Peters et al., 2017) , the conditional distribution of each variable given its causes (which could be an empty set) does not inform or influence the other conditional distributions, which implies that all the distributions P (Y ), P (X|Y ) and P ( Ỹ |X, Y ) are disentangled. Similarly, when X causes Y , the causal decomposition of P (X, Y, Ỹ ) is as follows: P (X, Y, Ỹ ) = P (X)P (Y |X)P ( Ỹ |X, Y ).

3. LEARNING WITH NOISY LABELS FROM A CAUSAL PERSPECTIVE

In this section, we show that the modeling label-noise method is independent of different generation processes while the semi-supervised methods depends on different generation processes. We also proposed an intuitive method to detect the causal structure by exploiting an asymmetric property regarding estimating the transition matrix.

3.1. THE INFLUENCE OF NOISY DATA GENERATION PROCESSES TO DIFFERENT METHODS

The modeling label-noise method is independent of different generation processes. The reason is that these methods mainly rely on estimating the transition matrix T (x), which can be estimated by exploiting the noisy class posterior P ( Ỹ |X) learned on the noisy data (Xia et al., 2020; Li et al., 2021) . It is clear that the data generation process does not influence learning P ( Ỹ |X) and T (x). By contrast, the semi-supervised methods are influenced by data generation processes because they rely on exploiting the unlabeled data to help learn the classifier. The helpfulness of unlabeled data depends on whether P (X) contains labeling information or not. According to the causal modularity property, when X causes Y , P (X) does not contain labeling information, because P (Y |X) and P (X) are disentangled with each other. However, when Y causes X, P (X) should contain labeling information, because P (X) and P (Y |X) are entangled with each other. To clearly illustrate the entanglement, we will derive that, when Y causes X, P (Y |X) and P (X) will change simultaneously to P ′ (Y |X) and P ′ (X) if we intervene on Y , i.e., change P (Y ) to a different distribution P ′ (Y ). Specifically, when P (Y ) is changed to P ′ (Y ), P (X|Y ) will not be influenced because of the modularity property (Pearl, 2000) . Since P (Y ) is changed to P ′ (Y ), and P (X|Y ) remains fixed, after the intervention, the joint distribution P (X, Y ) = P (Y )P (X|Y ) will be changed to a new joint distribution P ′ (X, Y ) = P ′ (Y )P (X|Y ). Then P (X) will be changed to P ′ (X) = y P ′ (Y )P (X|Y )dy. By applying Bayes' rule, P (Y |X) = P (Y )P (X|Y )/P (X) will change to a different distribution P ′ (Y |X) = P ′ (Y )P (X|Y )/P ′ (X) unless P ′ (Y )/P ′ (X) = P (Y )/P (X) which is a special case. Therefore, P (Y |X) and P (X) generally are entangled when Y causes X. To provide more intuition, we illustrate a toy example in Fig. 2 . For example, as illustrated in Fig. 2 The change of the selected label sets will only change the classification rules (tasks). It is clear that relabeling the sampled data points with different labels according to the new rules will not influence the distribution of the sampled data points P (X), and P (X) is disentangled with the different label sets. Then P (X) generally does not contain information to learn clean label Y . Therefore the semi-supervised based methods may not work well in this case.

3.2. AN INTUITIVE METHOD FOR THE CAUSAL STRUCTURE DETECTION

In many real-world applications, the causal structure of the noisy data generation process is unknown. To discover the causal structure, we provide an intuitive casual structure detection method for learning with noisy labels (i.e., CDNL estimator). Our method relies on an asymmetric property of estimating flip rates under different generalization processes. Specifically, when X causes Y , the flip rate P ( Ỹ |Y ′ ) estimated by an unsupervised classification method usually has a large estimation error, where Y ′ is pseudo labels estimated by the unsupervised method. However, when Y causes X, the estimation error is small. It is worth mentioning that the performance of the proposed CDLN estimator relies on the backbone unsupervised classification method. When Y causes X, the backbone method is expected to have reasonable classification accuracy on training instances. Thanks to the great success of the unsupervised learning methods (Likas et al., 2003; Niu et al., 2021; Ghosh and Lan, 2021; Zhou et al., 2021) , some of these methods can even have compatible performance with the supervised learning on some benchmark datasets such as STL10 Coates et al. (2011) and CIFAR10 Krizhevsky et al. (2009) . Let Y * = arg max i P (Y = i|x) be the Bayes label on the clean class-posterior distribution. To obtain the estimation error, we calculate the average difference between the noise rate estimated by the method based on modeling label noise and the noise rate estimated by a clustering algorithm, i.e., d(P ( Ỹ |Y * ), P ( Ỹ |Y ′ )) = L i L j |P ( Ỹ = j|Y * = i) -P ( Ỹ = j|Y ′ = i)| L 2 . ( ) The intuition is that given a noisy dataset, suppose that Bayes labels and pseudo labels of all instances are known and fixed, then the P ( Ỹ |Y * ) and P ( Ỹ |Y ′ ) are different in general unless Y * and Y ′ are identical to each other, which as illustrated in the following theorem. Theorem 1. Let P (Y * |Y ′ ) be the transition relationship from the pseudo label Y ′ to the Bayes label Y * , then d(P ( Ỹ |Y * ), P ( Ỹ |Y ′ )) = 0 either 1). for all i ∈ L, P (Y * = i|Y ′ = i) = 1, or 2). for all i, j ∈ L such that (1 -P (Y * = i|Y ′ = i))P ( Ỹ = j|Y * = i) = k̸ =i P (Y * = k|Y ′ = i)P ( Ỹ = j|Y * = k), where k P (Y * = k|Y ′ = i) = 1. The above theorem shows to let d(P ( Ỹ |Y * ), P ( Ỹ |Y ′ )) = 0, either the condition 1) or the condition 2) has to be satisfied. To satisfy the condition 1), given all examples with Y ′ = i, their Bayes labels have also be i, it implies that when two variables Y * and Y ′ are identical to each other. In this case, Y must cause X, because P (X) contains labeling information, i.e., by exploiting P (X), Y * = Y ′ can be learned. The condition 2) is a special case that requires all entries in P ( Ỹ |Y * ) and P (Y * |Y ′ ) are carefully designed to make Eq. ( 4) holds, which can be hard to satisfy in general. Estimation of P ( Ỹ |Y ′ ). To estimate the flip rate P ( Ỹ |Y ′ ), a clustering method is employed first to learn the clusters C. Then the clusters C can be converted into the pseudo label Y ′ by exploiting the estimated Bayes label Ŷ * , and the average noise rate P ( Ỹ |Y ′ ) obtained by a clustering method can be directly calculated. Be more specific, let C = i denote the cluster label i, and let S Ci = {x j } N C i j=0 denote the instance with cluster label i. Similarly let S Ŷ * j = {x k } N Ŷ * j k=0 denote the instance with estimated Bayes label j by employing label-noise learning methods (Patrini et al., 2017) . We assign the pseudo labels Ŷ ′ of all instances in set S Ci be the dominated estimated Bayes label Ŷ * , i.e., Ŷ ′ = arg max j∈L x k ∈S Ŷ * j 1 {x k ∈S C i } N Ci . (5) Empirically, the assignment is implemented by applying Hungarian algorithm (Jonker and Volgenant, 1986) . After the assignment, the pseudo labels of all training examples can be obtained. Then P ( Ỹ |Y ′ ) can be estimated via counting on training examples, i.e., P ( Ỹ = j|Y ′ = i) = (x,ỹ,ŷ ′ ) 1 { Ŷ ′ =i∧ỹ=j} (x,ỹ,ŷ ′ ) 1 { Ŷ ′ =i} , where 1 {.} is an indicator function, (x, ỹ, ŷ′ ) is an training example with the estimated pseudo label, and ∧ represents the AND operation. Estimation of P ( Ỹ |Y * ). We estimate the average flip rate P ( Ỹ |Y * ) in an end-to-end manner. Specifically, let f be a deep classification model that outputs the estimated Bayes label in a one-hot fashion. Empirically, it can be achieved by employing Gumbel-Softmax (Jang et al., 2016) . The distribution P ( Ỹ |Y * ) is modeled by a trainable diagonally dominant column stochastic matrix A. Then, similar to the state-of-the-art method (Li et al., 2021) , A can be estimated by minimizing the empirical loss on noisy data, i.e., { Â * , f } = arg min A,f 1 N x,ỹ ℓ ce (ỹ, Ah(x)), s.t. max i h i (x) = 1. In Section 4.1.1, we show that the estimation error of P ( Ỹ |Y * ) by employing our method above is much smaller than employing a state-of-the-art method VolMinNet (Li et al., 2021) for both instance-dependent and instance-independent label noise. The advantage of our method to estimate P ( Ỹ |Y * ) mainly comes from two perspectives: 1). To estimate P ( Ỹ |Y * ) by employing existing methods, the noise transition matrix P ( Ỹ |Y, X) has to be learned in advance, but P ( Ỹ |Y, X) is hard to estimate in practice (Li et al., 2019; Yao et al., 2020) . Our method avoids learning P ( Ỹ |Y, X) but directly estimates the average flip rate P ( Ỹ |Y * ) . Specifically, to estimate P ( Ỹ |Y * ) with existing methods, P ( Ỹ |X) and P ( Ỹ |Y, X) have to be learned first. Then both the estimated clean label Y and the Bayes label Y * can be revealed by (2). After that, P ( Ỹ |Y * ) can be estimated by using the same technique as in Eq. ( 6). However, because P ( Ỹ |Y, X) usually is hard to estimate. As a result, the learned classifier (in (2)) and Bayes labels will be poorly estimated, which leads to a large estimation error of P ( Ỹ |Y * ). 2). We let h directly estimate Bayes labels but not P (Y |X). The output complexity of h reduced from a continuous distribution P (Y |X) to a discrete distribution, the learning difficulty P ( Ỹ |Y * ) is reduced.

4. EXPERIMENTS

In this section, we illustrate the performance of the proposed estimator and different methods under different data generation processes with the existence of label noise.

Baselines.

We illustrate the performance of state-of-the-art modeling label-noise methods and semi-supervised methods. The modeling label-noise methods employed are: (i) Forward (Patrini et al., 2017) which estimates the transition matrix and embeds it to the neural network; (ii) Reweighting (Liu and Tao, 2016) which gives training examples different weights according to the transition matrix by importance reweighting; (iii) T-Revision (Xia et al., 2019) which refines the learned transition matrix to improve the classification accuracy. The semi-supervised methods employed are (iv) JoCoR (Wei et al., 2020) which aims to reduce the diversity of two networks during training; (v) MoPro (Li et al., 2020) which is a contrastive learning method that achieves online label noise correction (vi) Dividemix (Li et al., 2019) which leverages the techniques FixMatch (Sohn et al., 2020) and Mixup (Zhang et al., 2018) ; (viii) Mixup (Zhang et al., 2018) which trains a neural network on convex combinations of pairs of examples and their labels. For all baseline methods, we follow their hyper-parameters settings mentioned in their original paper. It is worth noting that, MoPro focuses on image datasets, to let it work for non-image datasets, we replace the strong data augmentation for images with small Gaussian Noise, which may influence its performance. Datasets and noise types. We have employed 2 synthetic datasets that are XYgaussian and YXguassian. We have also demonstrated the performance of our methods on 6 real-world datasets which are KrKp, Balancescale, Splice, waveform, MNIST, and CIFAR10. The causal datasets generated from X to Y are KrKp, Balancescale and Splice. The rest are anticausal datasets generated from Y to X. We manually inject label noise into all datasets, and 20% of data is left as the validation set. Three types of noise in our experiments are employed in our experiments. (1) symmetry flipping (Sym) (Patrini et al., 2017) which randomly replaces a percentage of labels in the training data with all possible labels. (2) pair flipping (Pair) (Han et al., 2018) where labels are only replaced by similar classes. (3) instance-dependent Label Noise (IDN) (Xia et al., 2020) where different instances have different transition matrices depending on parts of instances. Network structure and optimization. For a fair comparison, we implement all methods by PyTorch. All the methods are trained on Nvidia Geforce RTX 2080 GPUs. For non-image datasets, a 2-hidden-layer network with batch normalization (Ioffe and Szegedy, 2015) and dropout (0.25) (Srivastava et al., 2014) is employed as the backbone method for all baselines. We employ LeNet-5 for MNIST (LeCun, 1998) dataset and ResNet-18 (He et al., 2016) for CIFAR10 (Krizhevsky et al., 2009) . To estimate P ( Ỹ |Y * ), we use SGD to train the classification network with batch size 128, momentum 0.9, weight decay 10 -4 . The initial learning rate is 10 -2 , and it decays at 30th and 60th epochs with the rate 0.1, respectively. To get P ( Ŷ |Y ′ ), for XYguassian, yxGuassain, KrKp, Balancescale, Splice and waveform and MNIST, K-means clustering method (Likas et al., 2003) is employed; for CIFAR10, the SPICE * (Niu et al., 2021) clustering method is employed.

4.1. EXPERIMENTS ON SYNTHETIC DATASETS

To validate the correctness of our method, we have generated a causal dataset (from X to Y ) and an anticausal dataset (from Y to X). For both datasets, P (X) is a multivariate Gaussian mixture of N (0, I) and N (1, I) with dimension 5. For causal dataset XYguassian, the causal association f and f ′ between X and Y are set to be linear. The parameter of the linear function is randomly drawn from the N (0, I). For YXguassian, we let the label be the mean value of the multivariate Gaussian distribution. For both datasets, we have balanced the positive and negative class priors to 0.5, and the training sample size is 20000.

4.1.1. ESTIMATION ERROR OF P ( Ỹ |Y * )

In Fig. 3 , we compare the estimation error of average flip rate P ( Ỹ |Y * ) of our CDNL estimator and the state-of-the-art method VolMinNet (Li et al., 2021) , respectively. To let VolMinNet estimate P ( Ỹ |Y * ), we first train VolMinNet with a noisy training set and select the best model by using the validation set, then the estimated clean class-posterior distribution P (Y |X) is obtained. The Bayes label Y * can be directly obtained via P (Y |X), and P ( Ỹ |Y * ) can be estimated by using the same technique as in Eq. ( 6). As illustrated in Fig. 3 , it shows that the estimation error of our method is close to 0 not only on instance-independent label noise but also on instance-dependent label noise, which is much smaller than the estimated error of VolMinNet. This illustrates the advantage of CDNL estimator, which does not require learning the transition matrix and clean label for each instance, but directly estimates the average level of noise rates.

4.1.2. ESTIMATIONS OF CDNL ESTIMATOR AND CLASSIFICATION ACCURACY

In Tab. 1, we illustrate the estimations of CDNL estimator and the test accuracies of modeling labelnoise methods and semi-supervised methods. The estimations of CDNL estimator are shown in the parentheses, and each estimation is averaged over 5 repeated experiments. The matrix Â * = P ( Ŷ |Y * ) estimated by our method is embedded into modeling label-noise methods. It shows that the estimations on the anticausal dataset YXgaussian are 10 times smaller than causal dataset XYguassian, which illustrates the effectiveness of our estimator. Specifically, when X is a cause of Y (anticausal), we expect that the difference d(.) obtained by employing our estimator is large; when Y is a cause of X, the difference d(.) obtained by employing our estimator should be small. It is worth mentioning that, on both datasets, estimations of our estimator decrease with the increase of noise rates. It is because that the labels of these two datasets are binary, and P ( Ỹ |Y * = 1) only has one degree of freedom, i.e., P ( Ỹ = 1|Y * = 1) = 1 -P ( Ỹ = 0|Y * = 1). difference between P ( Ỹ = 1|Y * = 1) and P ( Ỹ = 1|Y ′ = 1) is small, the difference between P ( Ỹ = 0|Y * = 1) and P ( Ỹ = 0|Y ′ = 1) will also be small, the estimation will be small. The results show that on the causal dataset XYguassian, modeling label-noise methods perform better than semi-supervised methods. It is because that P (X) does not contain information of P (Y |X), then semi-supervised methods may not be helpful. On the anticausal dataset YXguassian, when the sample size is large (20000), modeling label-noise methods perform slightly better than semisupervised methods. When the complexity of anticausal datasets is high, with a limited sample size, the semi-supervised method should have better performance than modeling label-noise methods (See Tab. 3). We have also conducted an additional experiment on YXgaussian by reducing the sample size to 5000, SSL-based methods clearly outperform modeling label-noise methods under the same setting. The results are illustrated in Appendix C.

4.2. EXPERIMENTS ON REAL-WORLD DATASETS

We illustrate estimations of CDNL estimator, test accuracies of different methods on real-world datasets in Tab. 3 and Tab. 2. On these datasets, when the estimation of CDNL estimator is lower than 0.005, semi-supervised methods demonstrate their effectiveness, otherwise, modeling label-noise methods are more helpful to improve the robustness of learning models. It is also worth mentioning that for waveform, although it is an anticausal dataset, modeling label-noise methods have better performance than semi-supervised methods, and the estimation of CDNL estimator is also large. The reason could be that 1). P (X) may not always contain information about P (Y |X) even if the data generation process is from Y to X, or 2). P (X) may contain information about P (Y |X), but the information can be hard to be exploited by existing methods.

5. CONCLUSION

In this paper, we have investigated the influence of the noisy data generation process on semisupervised methods and modeling label-noise methods. We show that the semi-supervised methods can easily incorporate heuristics to make use of the finite training sample but their performance depends on the data generation process, while the modeling label-noise methods are independent of the generation process. In many real-world applications, the causal structure of the data generation process is not given. Then we proposed an intuitive method by exploiting the asymmetric property of estimating flip rate under different generalization processes. A PROOF OF THEOREM 1 In this section, we will prove Theorem 1 in our main paper. Proof. P ( Ỹ = j|Y ′ = i) = k P ( Ỹ = j|Y ′ = i, Y * = k)P (Y * = k|Y ′ = i) = k P ( Ỹ = j|Y * = k)P (Y * = k|Y ′ = i). The second equality holds because that Y ′ is learned in an unsupervised manner that does not employ Ỹ , then Ỹ does not contain any information about Ỹ and is independent with Ỹ . By the above equation, For condition 2), let the difference P ( Ỹ = j|Y ′ = i) -P ( Ỹ = j|Y * = i) = 0, then according to the above equation, For condition 1), if P (Y * = i|Y ′ = i) = 1 for all i, then (1 -P (Y * = i|Y ′ = i)) = 0 and P (Y * = k|Y ′ = i) = 0 for all k. It implies that both LHS and RHS will be 0, which completes the proof. P ( Ỹ = j|Y * = i) -P ( Ỹ = j|Y ′ = i) =P ( Ỹ = j|Y * = i) - k P ( Ỹ = j|Y * = k)P (Y * = k|Y ′ = i) =(P ( Ỹ = j|Y * = i) -P ( Ỹ = j|Y * = i)P (Y * = i|Y ′ = i)) - B ESTIMATION ERROR OF P ( Ỹ |Y ′ ) Here, we theoretically show that when X causes Y , flip rate P ( Ỹ |Y ′ ) estimated by an unsupervised classification method usually has a large estimation error, where Y is pseudo labels estimated by the unsupervised method. However, when Y causes X, the estimation error is usually small.  The last equality is obtained by using the reweighting technique Liu and Tao (2016) , which requires that P (X|Y = j) and P (X|Y ′ = j) have the same support. Then we calculate the difference P ( Ỹ = i|Y ′ = j) -P ( Ỹ = i|Y = j) as follows. P ( Ỹ = i|Y ′ = j) -P ( Ỹ = i|Y = j)  =E P (X|Y =j) 1 { f (X



Figure 1: The change of P (X) with the change of P (Y ) under different data generation processes.

𝑷 𝒀 = 𝟎 = 𝟎. 𝟓, 𝑷 𝒀 = 𝟏 = 𝟎. 𝟓, 𝑷 𝒀 = 𝟐 = 𝟎, 𝑷 𝒀 = 𝟑 = 𝟎. 𝑷(𝑿|𝒀 = 𝟑) (b) 𝑷′ 𝒀 = 𝟎 = 𝟎, 𝑷′ 𝒀 = 𝟏 = 𝟎, 𝑷′ 𝒀 = 𝟐 = 𝟎. 𝟓, 𝑷′ 𝒀 = 𝟑 = 𝟎. 𝟓. 𝑷(𝑿|𝒀 = 𝟐) 𝑷(𝑿|𝒀 = 𝟏) (c) 𝑷 𝒀 = 𝟎 = 𝟎. 𝟓, 𝑷 𝒀 = 𝟏 = 𝟎. 𝟓, 𝑷 𝒀 = 𝟐 = 𝟎, 𝑷 𝒀 = 𝟑 = 𝟎. 𝑷′ 𝒀 = 𝟎 = 𝟎, 𝑷′ 𝒀 = 𝟏 = 𝟎, 𝑷′ 𝒀 = 𝟐 = 𝟎. 𝟓, 𝑷′ 𝒀 = 𝟑 = 𝟎. 𝟓.

Figure 2: (a)-(d) illustrate the influence to P (X) when P (Y ) changes under different data generative processes. When Y causes X, as illustrated in (a) and (b), changing P (Y ) to P ′ (Y ) influences P (X), then P (X) contains labeling information; when X causes Y , as illustrated in (c) and (d), changing P (Y ) to P ′ (Y ) does not influence P (X), then P (X) does not contain labeling information.

(a), when P (Y = 0) = P (Y = 1) = 0.5, P (Y = 2) = P (Y = 3) = 0 , the data is drawn from either P (X|Y = 0) or P (X|Y = 1), then P (X) = 0.5P (X|Y = 0) + 0.5P (X|Y = 1). However, if the class prior is changed to P ′ (Y = 0) = P ′ (Y = 1) = 0, P ′ (Y = 2) = P ′ (Y = 3) = 0.5, as illustrated in Fig.2(b), instead of drawing data belonging to Y = 0 and Y = 1, the data belonging to Y = 2 and Y = 3 will be drawn, and the data distribution becomes P ′ (X) = 0.5P (X|Y = 2) + 0.5P (X|Y = 3). Meanwhile, the change in P (Y ) also leads to change in P (Y |X). The changes of P (X) and P (Y |X) both come from changes of P (Y ), indicating that P (X) contains information about P (Y |X). Therefore the semi-supervised based methods can be useful in this case.When feature X is a cause of Y , intervention on P (Y ) will change the function f ′ or the distribution of U Y but leave P (X) unchanged. For example, from Fig.2(c) to Fig. 2(d), the function f ′ will be changed to output Y = 0 or Y = 1 instead of Y = 2 or Y = 3 to account for the label distribution change.

j|Y * = k)P (Y * = k|Y ′ = i) =(1 -P (Y * = i|Y ′ = i))P ( Ỹ = j|Y * = i) -k̸ =i P (Y * = k|Y ′ = i)P ( Ỹ = j|Y * = k).

P (Y * = i|Y ′ = i))P ( Ỹ = j|Y * = i) = k̸ =i P (Y * = k|Y ′ = i)P ( Ỹ = j|Y * = k).

Let P ( Ỹ |Y ) be the transition relationship from the noisy label Ỹ to the clean label Y ; let P ( Ỹ |Y ′ ) be the transition relationship from the noisy label Ỹ to the pseudo label Y ′ . Then the estimation error isd(P ( Ỹ |Y ′ ), P ( Ỹ |Y )) = L i L j |P ( Ỹ = i|Y ′ = j) -P ( Ỹ = i|Y = j)| L 2 = i,j P (Y = j) E P (X) P (Y ′ = j|X) P (Y =j) P (Y ′ =j) -P (Y = j|X) 1 { f (X)=i} L 2 .Proof. Let f (x) = arg max i P ( Ỹ = i|X = x) output the noisy label of every instance x.P ( Ỹ = i|Y = j) = E P (X|Y =j) [1 { f (X)=i} ] = x 1 { f (x)=i} P (X = x|Y = j)dx. i|Y ′ = j) = E P (X|Y ′ =j) 1 { f (X)=i} = x 1 { f (x)=i} P (X = x|Y ′ = j)dx = x 1 { f (x)=i} P (X = x|Y ′ = j) P (X = x|Y = j) P (X = x|Y = j) dx= E P (X|Y =j) 1 { f (X)=i} P (X|Y ′ = j) P (X|Y = j) .

)=i} P (X|Y ′ = j) P (X|Y = j)-E P (X|Y =j) [1 { f (X)=i} ] = x 1 { f (x)=i} P (X = x|Y ′ = j) P (X = x|Y = j) P (X = x|Y = j) dxx 1 { f (x)=i} P (X = x|Y = j)dx = x 1 { f (x)=i} P (X = x|Y ′ = j) P (X = x|Y = j) P (X = x|Y = j) -1 { f (x)=i} P (X = x|Y = j)dx = x 1 { f (x)=i} (P (X = x|Y ′ = j) -P (X = x|Y = j)) dx = x P (Y ′ = j|X = x)P (X = x) P (Y ′ = j) -P (Y = j|X = x)P (X = x) P (Y = j) dx 1 { f (X)=i} = x P (Y ′ = j|X = x)P (Y = j)P (X = x) P (Y ′ = j)P (Y = j) -P (Y = j|X = x)P (Y ′ = j)P (X = x) P (Y ′ = j)P (Y = j) 1 { f (X)=i} dx = x P (Y ′ = j|X = x)P (Y = j)P (X = x) -P (Y = j|X = x)P (Y ′ = j)P (X = x) P (Y ′ = j)P (Y = j) 1 { f (X)=i} dx = x P (Y ′ = j|X = x)P (Y = j) -P (Y = j|X = x)P (Y ′ = j) P (Y ′ = j)P (Y = j) P (X = x)1 { f (X)=i} dx =E P (X) P (Y ′ = j|X)P (Y = j) -P (Y = j|X)P (Y ′ = j) P (Y ′ = j)P (Y = j) =P (Y = j)E P (X) P (Y ′ = j|X)P (Y = j) -P (Y = j|X)P (Y ′ = j) P (Y ′ = j) 1 { f (X)=i} =P (Y = j)E P (X) P (Y ′ = j|X) P (Y = j) P (Y ′ = j) -P (Y = j|X) 1 { f (X)=i}(10) By using the above equation, the estimation error d(P ( Ỹ |Y ′ ), P ( Ỹ |Y )) is as follows. d(P ( Ỹ |Y ′ ), P ( Ỹ |Y )) = i|Y ′ = j) -P ( Ỹ = i|Y = j)| L 2 = i,j P (Y = j) E P (X) P (Y ′ = j|X) P (Y =j) P (Y ′ =j) -P (Y = j|X) 1 { f (X)=i} L 2 , which completes the proof.

CDNL EstimatorInput: a noisy training sample S tr ; a noisy validation sample S val ; a cluster algorithm z; a classification model h; a trainable stochastic matrix A 1: Optimize h and A via Eq. (7) to obtain Â * = P ( Ỹ |Y * ) by employing the training set S tr and the validation set S val ; 2: Employ the cluster algorithm z to estimate the cluster IDs of all instances in training set S tr ; 3: Obtain Ŷ ′ of all instances from cluster IDs; 4: Calculate P ( Ỹ |Y ) by Eq (6).Output: The estimation d( P ( Ỹ |Y * ), P ( Ỹ |Y ′ )) via Eq. (3). also be estimated. On the causal dataset (X causes Y ), P (X) does not contain labeling information, then Y ′ should be very different from clean label Y . Therefore, the estimation error of P ( Ỹ |Y ′ ) is large. On the anticausal dataset (Y causes X), P (X) contains labeling information, the Y ′ should be "close" to clean label Y . Therefore the estimation error of P ( Ỹ |Y ′ ) is small.

Estimation error of P ( Ỹ |Y * ) on synthetic datasets with instance-independent and instancedependent label noise. Our estimator outperforms the state-of-the-art method by a large margin.

Test accuracies (%) of different methods on causal datasets with different types of label noise. Estimations of CDNL estimator are shown in parentheses.

Then, if the Comparing test accuracies (%) of different methods on anticausal datasets with different levels and types of label noise. Estimations of CDNL estimator are shown in parentheses.

Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pages 8778-8788, 2018. Evgenii Zheltonozhskii, Chaim Baskin, Avi Mendelson, Alex M Bronstein, and Or Litany. Contrast to divide: Self-supervised pre-training for learning with noisy labels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1657-1667, 2022. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.

annex

From Theorem 2, we can find out that to let the estimation error be small, two class posterior P (Y ′ |X) of the pseudo label and clean class posterior P (Y |X) have to be similar. For example, suppose that P (Y ′ |X) and P (Y |X) are identical, P (Y ) also identical to 

