IMPROVING THE ESTIMATION OF INSTANCE-DEPENDENT TRANSITION MATRIX BY USING SELF-SUPERVISED LEARNING Anonymous authors Paper under double-blind review

Abstract

The transition matrix reveals the transition relationship between clean labels and noisy labels. It plays an important role in building statistically consistent classifiers. In real-world applications, the transition matrix is usually unknown and has to be estimated. It is a challenging task to accurately estimate the transition matrix, especially when it depends on the instance. Given that both instances and noisy labels are available, the major difficulty of learning the transition matrix comes from the missing of clean information. A lot of methods have been proposed to infer clean information. The self-supervised learning has demonstrated great success. These methods could even achieve comparable performance with supervised learning on some datasets but without requiring any labels during the training. It implies that these methods can efficiently infer clean labels. Motivated by this, in this paper, we have proposed a practical method that leverages self-supervised learning to obtain nearly clean labels to help the learning of instance-dependent transition matrix. Empirically, the proposed method has achieved state-of-the-art performance on different datasets.

1. INTRODUCTION

Recently, more researchers in the deep learning community place emphasis on learning with noisy labels (Jiang et al., 2018; Liu, 2021; Yao et al., 2021b; Bai et al., 2021; Ciortan et al., 2021) . This is because manually annotating large-scale datasets is labor-intensive and time-consuming. Then some cheap but imperfect methods, e.g., crowdsourcing and web crawling, have been used to collect largescale datasets which usually contain label errors. Existing work shows that training deep learning models on these datasets can lead to performance degeneration, because deep models can memorize the noisy labels easily (Han et al., 2018; Bai et al., 2021) . How to improve the robustness of deep models when training data containing label errors becomes an important research topic. To learn a classifier robust to label noise, there are two streams of methods, i.e., statistically inconsistent methods and statistically consistent methods. The statistically inconsistent methods mainly focus on designing heuristics to reduce the negative effect of label noise (Nguyen et al., 2019; Li et al., 2019; 2020; Wei et al., 2020; Bai et al., 2021; Yao et al., 2021a) . These methods have demonstrated strong empirical performance but usually require expensive hyper-parameter tuning and do not provide statistical guarantees. To address the limitation, another stream of methods focusing on designing classifier-consistent algorithms (Liu & Tao, 2015; Patrini et al., 2017; Xia et al., 2020; Li et al., 2021) by exploiting the noise transition matrix T (x) ∈ R C×C , where T ij (x) = P ( Ỹ = j|Y = i, X = x), X denotes the random variable for instances or features, Ỹ denotes the noisy label, Y denotes the clean label, and C denotes the number of classes. When the transition matrix is given, the optimal classifier defined on the clean domain can be learned by utilizing noisy data only (Liu & Tao, 2015; Xia et al., 2019) . In real-world applications, the instance-dependent transition matrix T (x) is usually unknown and has to be learned. It is still a challenging task to accurately learn T (x) (Li et al., 2019; Yao et al., 2020) . The reason is that to accurately learn T (x), the instance X, the noisy label Ỹ and the clean label Y generally have to be given. However, for the dataset containing label errors, clean labels usually are not available. In general, without any other assumptions, to learn the transition matrix for an instance, its clean-label information has to be given. Then existing methods hope some cleanlabel information can be inferred to learn T (x) (Xia et al., 2019; Yang et al., 2022; Li et al., 2021) . We will discuss the details in Section 2. Recently, the classification model based on self-supervised learning has demonstrated comparable performance with supervised learning on some benchmark datasets (He et al., 2020; Niu et al., 2021) . This implies that self-supervised learning has a strong ability to infer clean labels. Motivated by this, in this paper, we propose CoNL (Contrastive label-Noise Learning), which leverages the self-supervised technique to learn the instance-dependent transition matrix. In CoNL, it contains two main stages: contrastive co-selecting and constraint T (x) revision which are as follows: • We propose contrastive co-selecting, which utilizes the visual representations learned by contrastive learning to select confident examples without employing noisy labels. In such a way, the learned visual representations will be less influenced by label errors. The empirical results for both transition-matrix learning and classification have demonstrated the strong performance with different types and levels of label noise on three synthetic IDN datasets (MNIST, CIFAR10, SVHN) and one real-world noisy dataset (CIFAR-10N). The rest of this paper is organized as follows. In Sec. 2, we review related work on label-noise learning especially modeling noisy labels and contrastive learning. In Sec. 3, we discuss how to leverage contrastive learning to learn the instant-dependent transition matrix better. In Sec. 4, we provide the empirical evaluations of the proposed method. In Sec. 5, we conclude our paper.

2. LABEL-NOISE LEARNING AND CONTRASTIVE LEARNING

Problem setting. Let D be the distribution of a noisy example (X, Ỹ ) ∈ X ×{1, . . . , C}, where X denotes the variable of instances, Ỹ the variable of noisy labels, X the feature space, {1, . . . , C} the label space, and C the size of classes. In learning with noisy labels, clean labels are not available, given a noisy training sample S = {x i , ỹi } N i=1 independently drawn from D, the aim is to learn a robust classifier from the sample S. The noise transition matrix T (x). The transition matrix T (x) has been widely used to model label-noise generation. The ij-th entry of the transition matrix T ij (x) = P ( Ỹ = j|Y = i, X = x) represents the possibility of the clean label Y = i of instance x flip to the noisy label Ỹ = j. Existed methods can learn statistically consistent classifiers when the transition matrix is given (Liu & Tao, 2015; Goldberger & Ben-Reuven, 2017; Yu et al., 2018; Xia et al., 2019; 2020; Li et al., 2021) . The reason is that, the clean-class posterior P (Y |X) can be inferred by using the transition matrix and the noisy-class posterior P ( Ỹ |X) (Patrini et al., 2017) , i.e., T (x)[P (Y = 1|x), . . . , P (Y = C|x)] ⊤ = [P ( Ỹ = 1|x), . . . , P ( Ỹ = C|x)] ⊤ . In general, the transition matrices are not given and need to be estimated. Without any other assumptions, to learn the transition matrix for an instance, its clean-label information has to be given (Xia et al., 2019; Yang et al., 2022) Learning the transition matrix T (x). The clean-label information is crucial for learning the transition matrix. To learn transition matrices for all instances, 1). existing methods first learn some of the transition matrices in a training sample by inferring the clean-label information. 2). Then, by making additional assumptions, the learned transition matrices can be used to help learn the other instance-dependent transition matrices. Specifically, to learn some of the transition matrices in a training sample, existing methods try to infer clean-label information of some instances. Then the transition matrices of these instances can be learned. For example, if some instances can be identified which belong to a specific class almost surely (i.e., anchor points), the transition matrices of these instances can be learned (Liu & Tao, 2015) . If Bayes optimal labels of some instances can be identified, their Bayes-label transition matrices can be learned (Yang et al., 2022) . If clean-class posteriors are far from uniform (i.e., sufficiently scattered), the transition matrices enclosing P ( Ỹ |X) with minimum volume is unique and can be learned (Li et al., 2021) . Once some of the transition matrices are learned, different assumptions have been proposed to utilize the learned transition matrices to help learn transition matrices of other instances. For example, the manifold assumption, where the instances that are close in manifold distance have similar transition matrices Cheng et al. (2022) ; class-dependent assumption, where instances with the same clean labels have the same transition matrices (Liu & Tao, 2015; Patrini et al., 2017; Li et al., 2021) ; part-dependent assumption (Xia et al., 2020) , where the instances with similar parts have similar transition matrices. Contrastive learning. Contrastive learning (Sermanet et al., 2018; Dwibedi et al., 2019; Chen et al., 2020a; He et al., 2020) , which could learn semantically meaningful features without human annotations (Hadsell et al., 2006; Wu et al., 2018) , is an important branch of unsupervised representative learning using methods related to the contrastive loss (Hadsell et al., 2006) . Existing methods have shown that semantically meaningful features are a very important characteristic in the human visual system. Humans usually use their existing knowledge of visual categories to learn about new categories of objects, where the visual categories are often encoded as high-level semantic attributes (Rosch et al., 1976; Su & Jurie, 2012) . Contrastive learning, which helps in learning semantically meaningful features, is therefore very useful in inferring clean labels. Empirically, contrastive learning shows superior performance to other unsupervised-learning techniques on different datasets on the classification task (Chen et al., 2020b; Niu et al., 2021) . In this paper, we adopt an unsupervised instance discrimination-based representative learning approach, MoCo (He et al., 2020) . The basic idea of contrastive learning is that the query representation should be similar to its matching key representation and dissimilar to other key representations, i.e., contrastive learning can be formulated as a dictionary look-up problem. Given an image x, the corresponding images obtained by using different augmentations are x q and x k . The query representation generated by the backbone f θ is f θ (x q ), The corresponding key representation generated by another backbone f θmo is f θmo (x k ). The key representation will be stored in a queue. To learn the representation, for each iteration, MoCo optimize θ according to the following loss function (He et al., 2020) : L mo = - 1 N xq log exp(f θ (x q ) • f θmo (x k )/τ ) exp(f θ (x q ) • f θmo (x k )/τ ) + x ′ k exp(f θ (x q ) • f θmo (x ′ k )/τ ) where x ′ k is another image different from x, f θmo (x ′ k ) is the key representation for x ′ k , and τ is the temperature. Then θ mo is updated by according to the parameter θ, i.e., θ mo ← µθ mo + (1 -µ)θ, where µ ∈ [0, 1) is the hyper-parameter momentum.

3. CONTRASTIVE LABEL-NOISE LEARNING

Motivated by the success of contrastive learning on the classification task. Previous work shows that some methods based on contrastive learning even could achieve comparable performance with supervised methods on some datasets (He et al., 2020; Chen et al., 2020b) . In this section, we introduce Contrastive label-Noise Learning (CoNL), which aims to effectively leverage the advantage of contrastive learning. An overview of the method is shown in Fig. 1 and described in Algo. 1. 

3.1. LEVERAGING CONTRASTIVE LEARNING FOR CO-SELECTING

We aim to accurately select confident examples by leveraging contrastive learning. To achieve it, we utilize the contrastive method MOCO to learn the visual representations on training instances. Then, the learned representations obtained by applying strong and weak data augmentations are employed to learn two classifiers and transition metrics, respectively. To select confident examples, we exploit both the estimated clean-class posterior and consistency predicted clean labels of its neighbors. The details of contrastive co-selecting are as follows. Firstly, to produce the visual representations, a backbone neural network f θ : X → Z is trained by only using the training instance via Eq. (1). By employing the f θ , we can obtain the representations Z s = {z s i } n i=1 and Z w = {z w i } n i=1 based on strong and weak data augmentations Φ s and Φ w , respectively, where z s i = f θ (Φ s (x i )) and z w i = f θ (Φ w (x i )). Let g ϕ1 and g ϕ2 be two classifier heads modeling P ϕ1 (Y |X) and P ϕ2 (Y |X) with learnable parameters ϕ 1 and ϕ 2 , respectively. Let T ζ1 and T ζ2 be two transition matrices that modeled by neural networks with learnable parameters ζ 1 and ζ 2 , respectively. To help learn transition matrix by employing the visual representations Z s and Z w . We train two classifier heads g ϕ1 and g ϕ2 and two transition matrices T ζ1 and T ζ2 simultaneously on Z 1 and Z 2 by minimizing the cross-entropy loss, respectively. Specifically, the objective is as follows. { φ1, φ2, ζ1, ζ2} = arg min ϕ 1 ,ϕ 2 ,ζ 1 ,ζ 2 - 1 N N i=1 (ỹi log(g ϕ 1 (z s i )T ζ 1 (z s i )) + ỹi log(g ϕ 2 (z w i )T ζ 2 (z w i ))) . In this training process, the parameter of backbone f θ is fixed. There are several advantages. 1). By employing the representations that are independent of label errors, the classifiers can be better learned. Intuitively, previous work shows that representations learned by self-supervised learning contain the semantic information which usually correlates with clean labels (Wu et al., 2018; Niu et al., 2021) . It implies that the representations contain some information about clean labels. In the training process, the visual representations are used as inputs of the classifier head. Then the learned classifier head also contains some information about these representations and clean labels. 2). By keeping f θ fixed, the learning difficulty of the transition matrix is reduced. In the learning process, only two simple models, i.e., the classifier head and the transition matrix, need to be learned. Since the visual representations could contain some information about clean labels, by also employing the noisy labels, the transition matrix can be effectively estimated. Moreover, we train two classifier heads g ϕ1 and g ϕ2 with different data augmentations. In such a way, the two classifier heads are encouraged to be diverse and have different learning abilities. 2) to learn the parameters φ1 , φ2 , ζ1 , ζ2 ; Get the confident sample S l = S w l S s l , where S s l and S w l are generated according to Eq. ( 4) and Eq. ( 5) ; Select the best transition matrix T ζ and corresponding classifier head f θ by employing the validation sample Sv ; Get revised parameters θ′ , φ′ and ζ′ on the training sample S and the confident sample S l by employing Eq. ( 7) ; Output: f θ′ , g φ′ , T ζ′ illustrate our confident sample-selection method on g φ1 which is trained on the representations with strong augmentations. The same sample-selection method are employed for g φ2 . Specifically, the trained classifier head g φ1 is employed to relabel all instances and get a set of examples S s = {(x i , ỹi , ŷs i )} N i=1 . To select confident examples, we employ the combination of two criteria. The basic idea is that an instance is reliable if 1). The confidence of the predicted clean label is high 2). Its corresponding predicted clean labels are consistent with its neighbor examples' predicted clean labels. To determine whether an example (x i , ỹi , ŷs i ) should be the confident example or not, the confidence of the predicted clean label and the predicted clean label consistency of x i have to be calculated. The confidence of the predicted clean labels can be directly obtained via g φ1 (x i ) ŷs i which is the ŷs i -th coordinate of the output g φ1 (x i ). The predicted clean label consistency r i of x i is calculated as follows. r s i = 1 K s y∈N s i 1(y = ŷi ), where N s (x i ) contains neighbors of the instance x i , when the strong augmentations are applied. It is obtained according to the cosine similarity of features extracted from the backbone. The K s nearest neighbors with high cosine similarity are selected as the neighbors of x i . By combining two criteria together, the example (x i , ỹi , ŷi ) is considered to be an confident example if both criteria g φ1 (x i ) ŷs i > λ and r i > τ satisfy, where λ and τ are hyper-parameters. Finally, the confident sample S s l selected by the trained classifier heads g φ1 is as follows. S s l = {(x i , ỹi , ŷi )|r i > τ, g φ1 (x i ) ŷs i > λ, ∀i = 1, 2, . . . , N }. By applying the weak data augmentation on training instances and using the same selection method for g φ2 , the confident sample S w l selected by g φ2 can also be obtained, i.e., S w l = {(x i , ỹi , ŷi )|r i > τ, g φ1 (x i ) ŷw i > λ, ∀i = 1, 2, . . . , N }. Then, to utilize different confident examples obtained by applying different data augmentations, we union two confident samples, i.e., S l = S w l S s l which will be used for constraint T (x) revision.

3.2. CONSTRAINT T(x) REVISION

We improve T -revision (Xia et al., 2019) by utilizing the confident sample S l to refine the transition matrix which depends on the instance. The philosophy of constraint T (x) revision is that the favorable transition matrix would make the classification losses on both clean labels and noisy labels small. Once we have the confident sample S l containing both noisy labels and predicted clean labels. We could regularize and refine the transition matrix by minimizing classification losses of both noisy labels and predicted clean labels of the selected confident examples. After fine-tuning the transition matrix, it can also help learn better the representation and the classifier head. Therefore, in this stage, we also fine-tune the representation and the classifier head. Specifically, By comparing the the validation accuracy of g φ1 (•)T ζ1 (•) and g φ2 (•)T ζ2 (•), we select the best transition matrix and the classifier which denote as T ζ and g φ, respectively. Let M be the number of confident examples. Reminding that Φ s and Φ w are strong and weak data augmentations, respectively. To minimize the classification loss L c on the predicted clean labels, the confident sample S l is employed. The loss function is as follows. L Ŷ (S l , g φ, f θ ) = - 1 M (x,ỹ,ŷ)∈S l ŷ log(g φ(f θ (Φ s (x)))) + ŷ log(g φ(f θ (Φ w (x)))) . To minimize the classification loss on noisy labels, the noisy training sample S is employed, i.e., L Ỹ ( S, g φ, f θ , T ζ ) = - 1 N (x,ỹ)∈ S ỹ log(g φ(f θ (Φ s (x)))T ζ f θ (Φ s (x)) ) +ỹ log(g φ(f θ (Φ w (x)))T ζ (f θ (Φ w (x)))) . By combining two losses L Ŷ and L Ỹ together, we fine-tune the transition matrix T ζ , the classification head g φ and the backbone model f θ by employing the objective function: { θ′ , φ′ , ζ′ } = arg min θ, φ, ζ L Ŷ (S l , g φ, f θ ) + L Ỹ ( S, g φ, f θ , T ζ ) . In Section 4, we show that by employing constraint T (X) revision, both the estimation of the transition matrix and the classification accuracy can be dramatically improved.

4. EXPERIMENTS

In this section, we present the empirical results of the proposed method and state-of-the-art methods on synthetic and real-world noisy datasets. We also conduct ablation studies to demonstrate the effectiveness of contrastive co-selecting and constraint T (x) revision.

4.1. EXPERIMENT SETUP

Datasets. We empirically verify the performance of our method on three synthesis datasets, i.e., Fashion-MNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , CIFAR-10 ( Krizhevsky et al., 2009) , and one real-world dataset, i.e., CIFAR-10N (Wei et al., 2022) . Fashion-MNIST contains 70,000 28x28 grayscale images with 10 classes total, and 60,000 images for training and 10,000 images for testing. SVHN contains 73,257 training images and 26,032 testing images. CIFAR-10 contains 50,000 training images and 10,000 testing images. Both SVHN and CIFAR-10 have 10 classes of images, and the image size is 32x32. The three datasets contain clean labels. We corrupted the training data manually according to the instance-dependent label noise generation method proposed in (Xia et al., 2020) . All experiments are repeated five times. CIFAR-10N is a real-world label-noise version of CIFAR-10, it contains human-annotated noisy labels with five different types of noise (Worst, Aggregate, Random 1, Random 1, and Random 3). For all datasets, we leave out 10% of training examples as a noisy validation set. Baselines. The baselines used in our experiments for comparison are: 1). CE, training the model using standard cross-entropy loss on noisy data directly; 2), GCE (Zhang & Sabuncu, 2018) , which use standard cross-entropy loss and mean absolute error to train model on noisy data; 3), Mentor-Net (Jiang et al., 2018) , which pretrains a model to select reliable samples for the main model; 4), Co-teaching (Han et al., 2018) , which trains two models simultaneously to select reliable samples for each other; 5), Reweight (Liu & Tao, 2015) , which exploits importance reweighting method to estimate a unbiased risk defined on clean data using noisy data; 6), Forward (Patrini et al., 2017) , using class-dependent transition matrix to correct loss function; 7), PTD (Xia et al., 2020) , estimating instance-dependent transition matrix through part-dependent transition matrices; 8) CausalNL (Yao et al., 2021b) , which explores using causal mechanism to excavate clean-label information on noisy data; 9), MEIDTM (Cheng et al., 2022) , which uses Lipschitz continuity to constrain the transition matrix; 10), BLTM (Yang et al., 2022) , which using Bayes optimal label to learn instance-dependent transition matrix; 11), NPC (Bae et al., 2022) , which proposes a post-processing scheme to calibrate the prediction of a noise-robust classifier. Implementation. We implement our algorithm using PyTorch and conduct all our experiments on RTX 3090. We use a ResNet-18 as the backbone for Fashion-MNIST, a ResNet-34 as the backbone for SVHN and CIFAR-10. For the classifier head and transition matrix generator, we use a two layers MLP with ReLU activation function. The final layer of the transition matrix generator is initialized to generate a diagonal largest transition matrix. To learn the backbone, we follow the same settings of MoCo (He et al., 2020) . We set the temperature τ = 0.2 and set µ = 0.999. The size of the queue is 12800. Total epochs are 1000. When training the classifier heads and transition matrices, we use SGD with momentum 0.9, weight decay 10 -4 , batch size 128 and an initial learning rate of 10 -2 to optimize the networks. The learning rate is divided by 10 at the 5th epochs and 7th epochs. We set 10 epochs in total. When revising the transition matrix, we use Adam with an initial learning rate 10 -4 . The learning rate is divided by 10 at the 5th epochs and 7th epochs. We set 10 epochs in total. After that, we optimize the backbone and classifier head using SGD with momentum 0.9, weight decay 10 -4 , batch size 128 and an initial learning rate of 10 -2 . At the same time, we optimize the transition matrix generator using Adam with an initial learning rate 10 -4 . The learning rate is divided by 10 at the 10th epochs and 20th epochs. We set 30 epochs in total.

4.2. CLASSIFICATIONS ACTUARIES ON DIFFERENT DATASETS

We conduct experiments on Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-10N. The noise rate for Fashion-MNIST, SVHN and CIFAR-10 is ranged from 0.1 to 0.5. The ones with CoNL-NR represent the results of our algorithm without constraint T (X) revision. The ones with CoNL represent the results after constraint T (X) revision. As shown in Tab. 1, Tab. 2 and Tab. 3, The proposed method is have outperform to other methods by a large margin when the noise rate is large. We also conduct experiments on the real-world label-noise dataset CIFAR-10N. The results of full types of noise (Worst, Aggregate, Random 1, Random 1, and Random 3) are shown in Tab. 4. The experiment results show that our method also works well on the real-world label-noise dataset. 

4.3. ABLATION STUDIES

We perform ablation studies on Fashion-MNIST, SVHN and CIFAR10 including the performance of contrastive co-selecting, constraint T (x) revision and the influence of contrastive learning on the accuracy. We leave results on Fashion-MNIST in our appendix due to the limited space.

4.3.1. CONFIDENT EXAMPLE SELECTION

We illustrate the performance of contrastive co-selecting in Tab. 5, Tab. 6 and Tab. 11. The results demonstrate that contrastive co-selecting can accurately select confident examples. Specifically, under 50% of instance-dependent noise, it can select at least 35.56% of examples with the clean ratio of at least 97.88%. Moreover, the noise rate of the selected example set S l is also close to the real noise rate. This could make a great contribution to the revision of transition matrix T (x).

4.3.2. THE ESTIMATION ERROR OF THE TRANSITION MATRIX

We show the estimation error of transition matrix before and after constraint T (X) revision. To caculate the estimation error, we compare the difference between the ground-truth transition matrix and the estimated transition matrix by employing l 1 norm For each instance, we only analyze the estimation error of a specific low since the noise is generated by one row of T (x). The experiment results are showed in Tab. 7, Tab. 8 and Tab. 13. The results show that constraint T (X) revision can effectively reduce the estimation error of transition matrix. We conduct the experiments with or without employing contrastive learning on different datasets. For CoNL (w/o MoCo), the backbone method f θ is not trained by employing MoCo, but during the co-selecting stage, we do not fix its parameter. The other settings are same as CoNL. The experiment results are showed in Tab. 9, Tab. 10 and Tab. 12. The empirical results clearly show that the contrastive learning technique dramatically improves the robustness of the learning model and is powerful for inferring the clean-label information.

5. CONCLUSION

Since both instances and noise labels are available, the main difficulty in learning instance-dependent transition matrices is due to the lack of clean-label information. Motivated by the great success of self-supervised learning in inferring clean labels. In this article, we propose CoNL (Contrastive Label-Noise Learning), which can effectively utilize self-supervised learning to learn instancedependent transition matrices. Empirically, the proposed method have achieved state-of-the-art performance on different datasets.

A ABLATION STUDIES ON FASHION-MNIST

We demonstrate results of ablation studies on Fashion-MNIST including the performance of contrastive co-selecting, constraint T (x) revision and the influence of contrastive learning on the accuracy in Tab. 11, Tab. 13 and Tab. 12, respectively. 

B EXPERIMENTS ON CIFAR-100 AND WEBVISION

To verify whether the proposed method can still work well when the number of class increase, we conduct the experiments on CIFAR-100 and WebVision datasets. Specifically, for CIFAR-100, we keep the same experiment setting as CIFAR-10. For WebVision, we trained a standard ResNet-18 and an inception-resnet v2 (Szegedy et al., 2017) on WebVision for 1000 epochs using MoCo v2 to obtain the pretrained model. We follow the previous work (Chen et al., 2019) 



Figure 1: A working flow of our method CoNL.

to select the first 50 classes of the Google image subset as the training set and leave out 10% of training examples as a noisy validation set. Then we train the model with the proposed CoNL. Other experiment settings are as same as the experiment on CIFAR-10. We test the model on the human-annotated WebVision validation set. The test accuracy of CoNL with ResNet backbone is 64.88% and the test accuracy of CoNL with inception-resnet v2 backbone is 70.80%. The experiment results of CIFAR-100 are shown in Tab. 14.

To select confident examples defined on the clean domain, we learn two classifiers estimating P (Y |X) and two transition matrices simultaneously by employing noisy labels and learned representations. We also encourage the two classifiers to have different learning abilities by training them with the representations obtained from strong and weak data augmentations, respectively. Then they can learn different types of confident examples and be robust to different noise rates. Combining two classifiers can obtain more confident examples. • We propose constraint T (x) revision, which refines the learned transition matrix by employing the selected confident examples. Based on the philosophy that the favorable transition matrix would make the classification risks on both clean data and noisy data small.

Means and standard deviations (percentage) of classification accuracy on Fashion-MNIST.

Means and standard deviations (percentage) of classification accuracy on SVHN. ± 0.03 94.37 ± 0.13 86.39 ± 5.12 81.95 ± 1.45 63.20 ± 2.75 MentorNet 95.54 ± 0.12 94.76 ± 0.16 92.39 ± 0.18 90.41 ± 0.49 61.23 ± 2.82 CoTeaching 94.66 ± 0.36 93.93 ± 0.31 92.06 ± 0.31 91.93 ± 0.81 67.62 ± 1.99 Reweight 95.91 ± 0.44 94.23 ± 2.53 91.06 ± 4.09 87.92 ± 6.46 85.30 ± 0.10 Forward 96.12 ± 0.11 95.84 ± 0.07 94.07 ± 2.14 87.38 ± 3.85 82.02 ± 4.81 PTD 72.90 ± 1.31 75.68 ± 9.43 75.01 ± 1.98 31.59 ± 5.58 30.58 ± 2.32 CausalNL 94.20 ± 0.09 94.06 ± 0.23 93.86 ± 0.37 93.82 ± 0.45 85.41 ± 2.95 BLTM 93.88 ± 0.55 92.66 ± 1.53 92.18 ± 0.61 84.33 ± 5.44 76.19 ± 5.17 MEIDTM 94.76 ± 0.10 93.56 ± 0.22 91.11 ± 0.42 86.11 ± 0.34 72.66 ± 2.50 NPC 94.36 ± 0.06 93.41 ± 0.07 90.31 ± 0.59 84.81 ± 1.80 70.15 ± 0.76

Means and standard deviations (percentage) of classification accuracy on CIFAR-10.

Means and standard deviations (percentage) of classification accuracy on CIFAR-10N. ± 0.31 82.77 ± 0.13 81.18 ± 0.22 80.39 ± 0.54 80.89 ± 0.82 MentorNet 77.91 ± 0.38 75.56 ± 0.25 77.12 ± 0.05 76.03 ± 0.81 76.57 ± 0.18 CoTeaching 81.86 ± 0.40 82.45 ± 0.08 82.90 ± 0.46 82.95 ± 0.26 82.66 ± 0.12

Performance of contrastive co-selecting on SVHN. .18 ± 0.49 37.71 ± 0.16 37.25 ± 0.09 36.85 ± 0.19 35.56 ± 0.57 Noise rate 13.31 ± 0.12 21.01 ± 0.07 30.63 ± 0.08 40.29 ± 0.14 49.86 ± 0.30 clean ratio 99.40 ± 0.02 99.38 ± 0.02 99.40 ± 0.03 99.33 ± 0.13 97.88 ± 2.82

Performance of contrastive co-selecting on CIFAR-10.

Transition matrix estimation error on SVHN.

Transition matrix estimation error on CIFAR-10.

The test accuracy of with MoCo and without MoCo on SVHN.

The test accuracy of with MoCo and without MoCo on CIFAR-10.

Performance of contrastive co-selecting on Fashion-MNIST. ± 0.20 20.18 ± 0.13 30.12 ± 0.21 39.47 ± 0.26 49.21 ± 0.42 clean ratio 99.76 ± 0.11 99.81 ± 0.06 99.88 ± 0.03 99.87 ± 0.06 99.74 ± 0.32

Test accuracy of with MoCo and without MoCo on Fashion-MNIST. MoCo) 91.79 ± 0.15 90.76 ± 0.14 87.46 ± 2.32 84.43 ± 1.43 75.16 ± 5.13 CoNL-NR 92.14 ± 0.07 91.86 ± 0.26 91.40 ± 0.15 90.09 ± 0.41 85.84 ± 0.71 CoNL 94.98 ± 0.14 94.20 ± 0.23 92.92 ± 1.05 91.01 ± 1.83 86.01 ± 1.83

Transition matrix estimation error on Fashion-MNIST. MoCo) 0.247 ± 0.005 0.360 ± 0.012 0.450 ± 0.011 0.582 ± 0.023 0.674 ± 0.020 CoNL-NR 0.236 ± 0.003 0.327 ± 0.021 0.409 ± 0.010 0.528 ± 0.013 0.674 ± 0.018 CoNL 0.238 ± 0.006 0.324 ± 0.005 0.391 ± 0.008 0.467 ± 0.013 0.540 ± 0.003

Test accuracy of CoNL on CIFAR-100. NR 40.78 ± 1.07 39.94 ± 1.51 38.30 ± 1.77 36.25 ± 1.69 32.25 ± 0.85 CoNL 74.13 ± 0.34 72.15 ± 0.52 69.96 ± 0.71 65.40 ± 2.76 59.09 ± 1.78 C DIFFERENCES BETWEEN CONTRASTIVE CO-SELECTING AND PREVIOUS WORK Previous work Co-Teaching (Han et al., 2018) learn two classifiers to select confident examples for each other, and filter errors from the biased selection in the first mini-batch. In our work, two classifiers are only used in the co-selection stage and do not provide supervised signals for each other. We aim to design a method that can select examples as many as possible. Previous work AugDesc (Nishi et al., 2021) explores how to use data augmentation techniques for different permutations and combinations to improve the generalization and robustness of models without impacting the memorization effect negatively, i.e., weak augmentation techniques for pseudo labels generation and strong augmentation techniques for the back-propagation step to update model's parameters. In this paper, different data augmentation strategies are used to enable contrastive co-selecting to select more reliable examples under different noise rates.

