CLASS2SIMI: A NEW PERSPECTIVE ON LEARNING WITH LABEL NOISE

Abstract

Label noise is ubiquitous in the era of big data. Deep learning algorithms can easily fit the noise and thus cannot generalize well without properly modeling the noise. In this paper, we propose a new perspective on dealing with label noise called "Class2Simi". Specifically, we transform the training examples with noisy class labels into pairs of examples with noisy similarity labels, and propose a deep learning framework to learn robust classifiers with the noisy similarity labels. Note that a class label shows the class that an instance belongs to; while a similarity label indicates whether or not two instances belong to the same class. It is worthwhile to perform the transformation: We prove that the noise rate for the noisy similarity labels is lower than that of the noisy class labels, because similarity labels themselves are robust to noise. For example, given two instances, even if both of their class labels are incorrect, their similarity label could be correct. Due to the lower noise rate, Class2Simi achieves remarkably better classification accuracy than its baselines that directly deals with the noisy class labels.

1. INTRODUCTION

It is expensive to label large-scale data accurately. Therefore, cheap datasets with label noise are ubiquitous in the era of big data. However, label noise will degenerate the performance of trained deep models, because deep networks will easily overfit label noise (Zhang et al., 2017; Zhong et al., 2019; Li et al., 2019; Yi & Wu, 2019; Zhang et al., 2019; 2018; Xia et al., 2019; 2020) . In this paper, we propose a new perspective on handling label noise called "Class2Simi", i.e., transforming training examples with noisy class labels into pairs of examples with noisy similarity labels. A class label shows the class that an instance belongs to, while a similarity label indicates whether or not two instances belong to the same class. This transformation is motivated by the observation that the noise rate becomes lower, e.g., even if two instances have incorrect class labels, their similarity label could be correct. In the label-noise learning community, a lower noise rate usually results in higher classification performance (Han et al., 2018b; Patrini et al., 2017) . Specifically, we illustrate the transformation and the robustness of similarity labels in Figure 1 . Assume we have eight noisy examples {(x 1 , ȳ1 ), . . . , (x 8 , ȳ8 )} as shown in the upper part of the middle column. Their labels are of four classes, i.e., {1, 2, 3, 4}. The labels marked in red are incorrect labels. We transform the 8 examples into 8 × 8 example-pairs with noisy similarity labels as shown in the bottom part of the middle column, where the similarity label 1 means the two instances have the same class label and 0 means the two instances have different class labels. We present the latent clean class labels and similarity labels in the left column. In the middle column, we can see that although the instances x 2 and x 4 both have incorrect class labels, the similarity label of the example-pair (x 2 , x 4 ) is correct. Similarity labels are robust because they further consider the information on the pairwise relationship. We prove that the noise rate in the noisy similarity labels is lower than that of the noisy class labels. For example, if we assume that the noisy class labels in Figure 1 are generated according to the latent clean labels and the transition matrix shown in the upper part of the right column (the ij-th entry of the matrix denotes the probability that the clean class label i flips into the noisy class label j), the noise rate for the noisy class labels is 0.5 while the rate for the corresponding noisy similarity labels is 0.25. Note that the noise rate is the ratio of the number of incorrect labels to the number of total examples, which can be calculated from the noise transition matrix combined with the proportion of each class, i.e., 1/6 × 3/4 + 1/2 × 1/4 = 0.25. If we assume the class label noise is generated according to the noise transition matrix presented in the upper part of the right column, it can be calculated that the noise rate for the noisy class labels is 0.5 while the rate for the noisy similarity labels is 0.25. Note that the noise transition matrix for similarity labels can be calculated by exploiting the class noise transition matrix as in Theorem 1. It is obvious that Class2Simi suffers information loss because we can not recover the class labels from similarity labels. However, since the similarity labels are more robust to noise than the class labels, the advantage of the reduction of noise rate overweighs the disadvantage of the loss of information. Intuitively, in the learning process, it is the signal in the information that enhances the performance of the model, while the noise in the information is harmful to the model. Through Class2Simi, although the total amount of information is reduced, the signal to noise ratio is increased, and so would be the total amount of signals. Thus, we can benefit from the transformation and achieve better performances. Theorem 2 and the experimental results will verify the effectiveness of this transformation. It remains unsolved how to learn a robust classifier from the data with transformed noisy similarity labels. To solve this problem, we first estimate the similarity noise transition matrix, a 2 × 2 matrix whose entries denote the flip rates of similarity labels. Note that the transition matrix bridges the noisy similarity posterior and the clean similarity posterior. The noisy similarity posterior can be learned from the data with noisy similarity labels. Then, given the similarity noise transition matrix, we can infer the clean similarity posterior from the noisy similarity posterior. Since the clean similarity posterior is approximated by the inner product of the clean class posterior (Hsu et al., 2019) , the clean class posterior (and thus the robust classifier) can thereby be learned. We will empirically show that Class2Simi with the estimated similarity noise transition matrix will remarkably outperform the baselines even given with the ground-truth class noise transition matrix. The contributions of this paper are summarized as follows: • We propose a new perspective on learning with label noise, which transforms class labels into similarity labels. Such a transformation reduces the noise level. • We provide a way to estimate the similarity noise transition matrix by theoretically establishing its relation to the class noise transition matrix. We show that even if the class noise transition matrix is inaccurately estimated, the induced similarity noise transition matrix still works well. • We design a deep learning method to learn robust classifiers from data with noisy similarity labels and theoretically analyze its generalization ability. • We empirically demonstrate that the proposed method remarkably surpasses the baselines on many datasets with both synthetic noise and real-world noise. The rest of this paper is organized as follows. In Section 2, we formalize the noisy multi-class classification problem, and in Section 3, we propose the Class2Simi strategy and practical implementation. Experimental results are discussed in Section 4. We conclude our paper in Section 5.

2. PROBLEM SETUP AND RELATED WORK

Let (X, Y ) ∈ X × {1, . . . , C} be the random variables for instances and clean labels, where X represents the instance space and C is the number of classes. However, in many real-world applications (Zhang et al., 2017; Zhong et al., 2019; Li et al., 2019; Yi & Wu, 2019; Zhang et al., 2019; Tanno et al., 2019; Zhang et al., 2018) , the clean labels cannot be observed. The observed labels are noisy. Let Ȳ be the random variable for the noisy labels. What we have is a sample {(x 1 , ȳ1 ), . . . , (x n , ȳn )} drawn from the noisy distribution D ρ of the random variables (X, Ȳ ). Our aim is to learn a robust classifier that could assign clean labels to test data by exploiting the sample with noisy labels. Existing methods for learning with noisy labels can be divided into two categories: algorithms that result in statistically inconsistent or consistent classifiers. Methods in the first category usually employ heuristics to reduce the side-effect of noisy labels, e.g., selecting reliable examples (Yu et al., 2019; Han et al., 2018b; Malach & Shalev-Shwartz, 2017) , reweighting examples (Ren et al., 2018; Jiang et al., 2018; Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2015) , employing side information (Vahdat, 2017; Li et al., 2017; Berthon et al., 2020) , and adding regularization (Han et al., 2018a; Guo et al., 2018; Veit et al., 2017; Vahdat, 2017; Li et al., 2017) . Those methods empirically work well in many settings. Methods in the second category aim to learn robust classifiers that could converge to the optimal ones defined by using clean data. They utilize the noise transition matrix, which denotes the probabilities that the clean labels flip into noisy labels, to build consistent algorithms (Goldberger & Ben-Reuven, 2017; Patrini et al., 2017; Thekumparampil et al., 2018; Yu et al., 2018; Liu & Guo, 2020; Zhang & Sabuncu, 2018; Kremer et al., 2018; Liu & Tao, 2016; Northcutt et al., 2017; Scott, 2015; Natarajan et al., 2013; Yao et al., 2020b) . The idea is that given the noisy class posterior probability and the noise transition matrix, the clean class posterior probability can be inferred. Note that the noisy class posterior and the noise transition matrix can be estimated by exploiting the noisy data, where the noise transition matrix additionally needs anchor points (Liu & Tao, 2016; Patrini et al., 2017) . Some methods assume anchor points have already been given (Yu et al., 2018) . There are also methods showing how to identify anchor points from the noisy training data (Liu & Tao, 2016; Patrini et al., 2017) .

3. CLASS2SIMI MEETS NOISY SUPERVISION

In this section, we propose a new strategy for learning from noisy data. Our core idea is to transform class labels to similarity labels first, and then handle the noise manifested on similarity labels.

3.1. TRANSFORMATION ON LABELS AND THE TRANSITION MATRIX

As in Figure 1 , we combine every 2 instances in pairs, and if the two instances have the same class label, we assign this pair a similarity label 1, otherwise 0. If the class labels are corrupted, the generated similarity labels also contain noise. We denote the clean and noisy similarity labels of the example-pair (x i , x j ) by H ij and Hij respectively. The definition of the similarity noise transition matrix is similar to the class one, denoting the probabilities that clean similarity labels flip into noisy similarity labels, i.e., T s,mn = P ( Hij = n|H ij = m). The dimension of the similarity noise transition matrix is always 2 × 2. Since the similarity labels are generated from class labels, the similarity noise is also determined and, thus can be calculated, by the class noise transition matrix. Theorem 1. Assume that the dataset is balanced (each class has the same amount of samples, and c classes in total), and the noise is class-dependent. Given a class noise transition matrix T c , such that T c,ij = P ( Ȳ = j|Y = i). The elements of the corresponding similarity noise transition matrix T s can be calculated as T s,00 = c 2 -c - j ( i T c,ij ) 2 -||T c || 2 Figure 2: An overview of the proposed method. We add a pairwise enumeration layer and similarity transition matrix to calculate and correct the predicted similarity posterior. By minimizing the proposed loss L c2s , a classifier f can be learned for assigning clean labels. The detailed structures of the Neural Network are provided in Section 4. Note that for the noisy similarity labels, some of them are correct and some are not. The similarity label for dogs is correct and the similarity label for cats is incorrect. In practice, the input data is original class-labeled data, and the transformation is conducted during the training procedure rather than before training. A detailed proof is provided in Appendix A. Remark 1. Theorem 1 can easily extend to the setting where the dataset is unbalanced in classes by multiplying each T c,ij by a coefficient n i . n i is the number of examples from the i-th class. Note that the similarity labels are only dependent on class labels. If the class noise is class-dependent, the similarity noise is also "class-dependent" (class means similar and dissimilar). Under classdependent label noise, a binary classification is learnable as long as T 00 + T 11 > 1 (Menon et al., 2015) , where T is the corresponding binary transition matrix; a multi-class classification is learnable if the corresponding transition matrix T c is invertible. For Class2Simi, in the most general sense, i.e., T c is invertible, T s,00 + T s,11 > 1 holds. Namely, the learnability of the pointwise classification implies the learnability of the reduced pairwise classification. However, the latter cannot imply the former. A proof and a counterexample are provided in Appendix F. Theorem 2. Assume that the dataset is balanced (each class has the same amount of samples), and the noise is class-dependent. When the number of classes c ≥ 8foot_0 , the noise rate for the noisy similarity labels is lower than that of the noisy class labels. A detailed proof is provided in Appendix B. When dealing with label noise, a low noise rate has many benefits. The most important one is that the noise-robust algorithms will consistently achieve higher performance when the noise rate is lower (Bao et al., 2018; Han et al., 2018b; Xia et al., 2019; Patrini et al., 2017) . Another benefit is that, when the noise rate is low, the complex instance-dependent label noise can be well approximated by class-dependent label noise (Cheng et al., 2020) , which is easier to handle. After the Class2Simi transformation, the number of dissimilar pairs is (c -1) times as much as that of similar pairs. Meanwhile, compared with the original noise rate of class labels, the noise rate of similar pairs (the ratio of the number of mislabeled similar pairs to the number of total real similar pairs) is higher and the noise rate of dissimilar pairs is lower, while the overall noise rate of pairwise examples is lower, which partially reflects that the impact of the label noise is less bad. Moreover, the flip from dissimilar to similar should be more adversarial and thus more important. In practice, it is common that one class has more than one clusters, while it is rare that two or more classes are in the same cluster. If there is a flip from similar to dissimilar and based on it we split a (latent) cluster into two (latent) clusters, we still have a high chance to label these two clusters correctly later. If there is a flip from dissimilar to similar and based on it we join two clusters belonging to two classes into a single cluster, we nearly have zero chance to label this cluster correctly later. As a consequence, the flip from dissimilar to similar is more adversarial, and thus more important. To sum up, considering the reduction of the overall noise rate is meaningful.

3.2. LEARNING WITH NOISY SIMILARITY LABELS

In order to learn a multi-class classifier from similarity labeled data, we should establish relationships between class posterior probability and similarity posterior probability. Here we employ the relationship established in (Hsu et al., 2019) , which is derived from a likelihood model. As in Figure 2, they denote the predicted clean similarity posterior by the inner product between two categorical distributions: Ŝij = f (X i ) f (X j ). Intuitively, f (X) outputs the predicted categorical distribution of input data X and f (X i ) f (X j ) can measure how similar the two distributions are. For clarity, we visualize the predicted similarity posterior in Figure 3 . If X i and X j are predicted belonging to the same class, i.e., argmax m∈C f m (X i ) = argmax n∈C f n (X j ), the predicted similarity posterior should be relatively high ( Ŝij = 0.30 in Figure 3 (a)). By contrast, if X i and X j are predicted belonging to different classes, the predicted similarity posterior should be relatively low ( Ŝij = 0.0654 in Figure 3 (b)). Note that the noisy similarity posterior distribution P ( Hij |X i , X j ) and clean similarity posterior distribution P (H ij |X i , X j ) satisfy P ( Hij |X i , X j ) = T s P (H ij |X i , X j ). Therefore, we can infer noisy similarity posterior Ŝij from clean similarity posterior Ŝij with the similarity noise transition matrix. To measure the error between the predicted noisy similarity posterior Ŝij and noisy similarity label Hij , we employ a binary cross-entropy loss function (Shannon, 1948) . The final optimization function is L c2s ( Hij , Ŝij ) = - i,j Hij log Ŝij + (1 -Hij ) log(1 -Ŝij ). The pipeline of the proposed Class2Simi is summarized in Figure 2 . The softmax function outputs an estimation for the clean class posterior, i.e., f (X) = P (Y |X), where P (Y |X) denotes the estimated class posterior. Then a pairwise enumeration layer (Hsu et al., 2018) is added to calculate the predicted clean similarity posterior Ŝij of every two instances. According to Equation 1, by pre-multiplying the transpose of the noise similarity transition matrix, we can obtain the predicted noisy similarity posterior Ŝij . Therefore, by minimizing L c2s , we can learn a classifier for predicting noisy similarity labels. Meanwhile, before the transition matrix layer, the pairwise enumeration layer will output a prediction for the clean similarity posterior, which guides f (X) to predict clean class labels.

3.3. IMPLEMENTATION

The proposed algorithm is summarized in Algorithm 1. Since learning only from similarity labels will lose the mapping between the output nodes and the semantic classes, we load the model trained on the data with noisy class labels to learn the class information in Stage 2. It is worthwhile to mention that Class2Simi increases the computation cost slightly. Note that the transformation of labels is during the training phase rather than before training. Specifically, as in Figure 2 , first, we read a batch of n examples, and generate their corresponding n 2 similarity labels. Since n is the batch size, it is usually small. In addition, we only save the labels, not example-pairs, such that it introduces a After that pairwise enumeration layer calculates the inner products between every two instances, outputting nfoot_2 predicted similarity posterior probabilities. Then the similarity transition matrix corrects the n 2 predicted similarity posterior probabilities. Finally, the loss is accumulated by n 2 items. Namely, Class2Simi only does the additional computation on generating similarity labels and calculating the inner products between every two instances in the pairwise enumeration layer, which is time-efficient.

3.4. GENERALIZATION ERROR

We formulate the above problem in the traditional risk minimization framework (Mohri et al., 2018) . The expected and empirical risks of employing estimator f can be defined as R(f ) = E (Xi,Xj , Ȳi, Ȳj , Hij ,Ts)∼Dρ [ (f (X i ), f (X j ), T s , Hij )], and R n (f ) = 1 n 2 n i=1 n j=1 (f (X i ), f (X j ), T s , Hij ), where n is training sample size of the noisy data. Assume that the neural network has d layers with parameter matrices W 1 , . . . , W d , and the activation functions σ 1 , . . . , σ d-1 are Lipschitz continuous, satisfying σ j (0) = 0. We denote by H : X → W d σ d-1 (W d-1 σ d-2 (. . . σ 1 (W 1 X)) ) ∈ R the standard form of the neural network. H = argmax i∈{1,...,C} h i . Then the output of the softmax function is defined as f i (X) = exp (h i (X))/ C j=1 exp (h j (X)), i = 1, . . . , C. We can obtain the following generalization error bound as follow. Theorem 3. Assume the parameter matrices W 1 , . . . , W d have Frobenius norm at most M 1 , . . . , M d , and the activation functions are 1-Lipschitz, positive-homogeneous, and applied element-wise (such as the ReLU). Assume the transition matrix is given, and the instances X are upper bounded by B, i.e., X ≤ B for all X, and the loss function is upper bounded by M 2 . Then, for any δ > 0, with probability at least 1 -δ, R( f ) -R n ( f ) ≤ (T s,11 -T s,01 )2BC( √ 2d log 2 + 1)Π d i=1 M i T s,11 √ n + M log 1/δ 2n . Notation and a detailed proof are provided in Appendix C. Theorem 3 implies that if the training error is small and the training sample size is large, the expected risk R( f ) of the representations for noisy similarity posterior will be small. If the transition matrix is well estimated, the clean similarity posterior as well as the classifier for the clean class will also have a small risk according to Equation 1 and the Class2Simi relations. This theoretically justifies why the proposed method works well. In the experiment section, we will show that the transition matrices are well estimated and that the proposed method significantly outperforms the baselines. In Class2Simi, a multi-class classification is reduced to a pairwise binary classification. For pairwise examples, if a surrogate loss is classification-calibrated, minimizing it leads to minimizing the zeroone loss on the pointwise random variables in the limit case. Otherwise, we cannot guarantee the worst-case learnability of learning pointwise labels from pairwise examples, but it cannot imply the average-case non-learnability either. Theoretically, Bao et al. (2020) proved that when the pairwise labels are all correct, for the special case c = 2, a good model for predicting similar/dissimilar pairs must also be a good model for predicting the original classes, under mild assumptions. In practice, it seems fine to use non-classification-calibrated losses. According to Tewari & Bartlett (2007) , the multi-class margin loss (i.e., one-vs-rest loss) and the pairwise comparison loss (i.e., one-vs-one loss) are proved to be non-calibrated, but they are still the main multi-class losses in Mohri et al. (2018) ; Shalev-Shwartz & Ben-David (2014) .

4. EXPERIMENTS

Datasets. We employ three widely used image datasets, i.e., MNIST (LeCun, 1998) , CIFAR-10, and CIFAR-100 (Krizhevsky et al., 2009) , one text dataset News20, and one real-world noisy dataset Clothing1M (Xiao et al., 2015) . Noisy class labels generation. For the three clean datasets, we artificially corrupt the class labels of training and validation sets according to the class noise transition matrix. Specifically, for each instance with clean label i, we replace its label by j with a probability of T c,ij . In this paper, we consider both symmetric and asymmetric noise settings which are defined in Appendix D. MNIST Baselines. As mentioned before, Class2Simi is a strategy rather than a specific algorithm. In this paper, we employ three T -based methods, i.e., Forward correction (Patrini et al., 2017) , Reweight (Liu & Tao, 2016) , and T -revision (Xia et al., 2019) , which all utilize a class-dependent transition matrix to model the noise, to implement our approach to show the effectiveness of Class2Simi. Besides, we externally conduct experiments on Co-teaching (Han et al., 2018b) , which is a representative algorithm of selecting reliable examples for training; APL (Ma et al., 2020) , which applies simple normalization on loss functions and makes them robust to noisy labels; S2E (Yao et al., 2020a) , which properly controls the sample selection process so that deep networks can benefit from the memorization effect. Network structure and Optimizer. For MNIST, we use LeNet (LeCun et al., 1998) . For CIFAR-10, we use ResNet-32 with pre-activation (He et al., 2016b) . For CIFAR-100, we use ResNet-56 with pre-activation (He et al., 2016b) . For News20, we use GloVe (Pennington et al., 2014) to obtain vector representations for text, and employ a 3-layer MLP with the Softsign active function. For Clothing1M*, we use pre-trained ResNet-50 (He et al., 2016a) . We use the same optimization method as Forward correction to learn the noise transition matrix Tc . In Stage 2, we use the Adam optimizer with initial learning rate 0.001. On MNIST, the batch size is 128 and the learning rate decays every 20 epochs by a factor of 0.1 with 60 epochs in total. On CIFAR-10, the batch size is also 128 and the learning rate decays every 40 epochs by a factor of 0.1 with 120 epochs in total. On CIFAR-100, the batch size is 1000 and the learning rate drops at epoch 80 and 160 by a factor of 0.1 with 200 epochs in total. On News20, the batch size is 128 and the learning rate decays every 10 epochs by a factor of 0.1 with 30 epochs in total. On Clothing1M*, the batch size is 32 and the learning rate drops every 5 epochs by a factor of 0.1 with 10 epochs in total. Results on noisy image datasets. The results in Table 1 and Figure 4 demonstrate that Class2Simi achieves distinguished classification accuracy and is robust against the estimation errors on transition matrix. From Table 1 , overall, we can see that after the transformation, better performances are achieved due to a lower noise rate and the similarity transition matrix being robust to noise. Specifically, On MNIST, as the noise rate increases from Sym-0.1 to Sym-0.5, Forward & Class2Simi maintains remarkable accuracy above 98.20% while the accuracy of Forward decreases steadily. On CIFAR100, there are obvious decreases in the accuracy of all methods and our method achieves the best results across all noise rate, i.e., at Sym-0.5, Class2Simi gives accuracy uplifts of about 9.0% compared with those T -based methods. Results under asymmetric noise are provided in Appendix E.3. In Figure 4 , we show that the similarity noise transition matrix is robust against estimation errors. To verify this, we add some random noise to the ground-truth T c through multiplying every element in class T c by a random variable α ij . We control the noise rate on the T c by sampling α ij in different intervals, i.e., 0.1 noise means that α ij is uniformly sampled from ±[1.1, 1.2]. Then we normalize T c to make its row sums equal to 1. From Figure 4 , we can see that the accuracy of Forward drops dramatically with the increase of the noise on T c on three datasets. Meanwhile, there is only a slight fluctuation of Forward & Class2Simi on MNIST Results on noisy text dataset. Results in Table 2 show that the proposed strategy works well on the text dataset under both symmetric and asymmetric noise settings. Results on real-world noisy dataset. Results in Table 3 show that the proposed strategy significantly improves the classification accuracy of the T -based methods. T -based methods with Class2Simi also outperform those classic methods. Ablation study. To investigate how the similarity loss function influences the classification accuracy, we conduct experiments with the cross-entropy loss function and the similarity loss function respectively on clean datasets over 3 trails where the T c is set to an identity matrix. All other settings are kept the same. As shown in Table 4, the similarity loss function does not improve the classification accuracy, which means the accuracy increase in our paper is benefited from the lower noise rate and the more robust transition matrix.

5. CONCLUSION

This paper proposes a new perspective on dealing with class label noise (called Class2Simi) by transforming the training sample with noisy class labels into a training sample with noisy similarity labels. We also propose a deep learning framework to learn classifiers directly with the noisy similarity labels. The core idea is to transform class information into similarity information, which makes the noise rate lower. We also prove that not only the similarity labels but also the similarity noise transition matrix is robust to noise. Experiments are conducted on benchmark datasets, demonstrating the effectiveness of our method. In future work, investigating different types of noise for diverse real-life scenarios might prove important.

APPENDICES A PROOF OF THEOREM 1

Theorem 1. Assume that the dataset is balanced (each class has the same amount of samples, and c classes in total), and the noise is class-dependent. Given a class noise transition matrix T c , such that T c,ij = P ( Ȳ = j|Y = i). The elements of the corresponding similarity noise transition matrix T s can be calculated as T s,00 = c 2 -c - j ( i T c,ij ) 2 -||T c || 2 Fro c 2 -c , T s,01 = j ( i T c,ij ) 2 -||T c || 2 Fro c 2 -c , T s,10 = c -||T c || 2 Fro c , T s,11 = ||T c || 2 Fro c . Proof. Assume each class has n samples. n 2 T c,ij T c,i j represents the number of sample-pairs generated by ( Ȳ = j|Y = i) and ( Ȳ = j |Y = i ). For the first element T s,00 , n 2 i =i T c,ij T c,i j is the number of sample-pairs with clean similarity labels H = 0, while n 2 i =i ,j =j T c,ij T c,i j is the number of example-pairs with clean similarity labels S = 0 and noisy similarity labels H = 0. Thus the ratio of these two terms is exact the T s,00 = P ( H = 0|H = 0). The remaining three elements can be represented in the same way. The primal representations are as follows, T s,00 = i =i ,j =j T c,ij T c,i j i =i T c,ij T c,i j , T s,01 = i =i ,j=j T c,ij T c,i j i =i T c,ij T c,i j , T s,10 = i=i ,j =j T c,ij T c,i j i=i T c,ij T c,i j , T s,11 = i=i ,j=j T c,ij T c,i j i=i T c,ij T c,i j . Further, note that i=i T c,i,j T c,i ,j = i,j,j T c,i,j T c,i,j = i ( j T c,i,j )( j T c,i,j ) = c, i =i T c,i,j T c,i ,j = i =i ,j,j T c,i,j T c,i ,j = i =i ( j T c,i,j )( j T c,i,j ) = (c -1)c, i=i ,j=j T c,ij T c,i j = ||T c || 2 Fro , i =i ,j=j T c,ij T c,i j = j ( i T c,ij ) 2 -||T c || 2 Fro . Substituting above four equations to the primal representations, we have the Theorem 1 proved.

B PROOF OF THEOREM 2

Theorem 2. Assume that the dataset is balanced (each class has the same amount of samples), and the noise is class-dependent. When the number of classes c ≥ 8, the noise rate for the noisy similarity labels is lower than that of the noisy class labels. Proof. Assume each class has n samples. As we state in the proof of Theorem 1, the number of example-pairs with clean similarity labels H = 0 and noisy similarity labels H = 0 is n 2 i =i ,j =j T c,ij T c,i j . We denote it by N 00 . Similarly, we have, N 00 = n 2 i =i ,j =j T c,ij T c,i j , N 01 = n 2 i =i ,j=j T c,ij T c,i j , N 10 = n 2 i=i ,j =j T c,ij T c,i j , N 11 = n 2 i=i ,j=j T c,ij T c,i j . The noise rate is the ratio of the number of noisy examples to the number of total examples. Assume that the number of classes is c. We have S noise = N 01 + N 10 N 00 + N 01 + N 10 + N 11 = N 01 + N 10 c 2 n 2 , C noise = n i =j T c,ij cn . Let S noise minus C noise , we have S noise -C noise = n 2 i =i ,j=j T c,ij T c,i j + n 2 i=i ,j =j T c,ij T c,i j c 2 n 2 - n i =j T c,ij cn = i =i ,j=j T c,ij T c,i j + i=i ,j =j T c,ij T c,i j -c i =j T c,ij c 2 . Let A = i =i ,j=j T c,ij T c,i j + i=i ,j =j T c,ij T c,i j -c i =j T c,ij , we have A = i =i ,j=j T c,ij T c,i j + i=i ,j =j T c,ij T c,i j -c i =j T c,ij = i =i ,j=j T c,ij T c,i j + i=i ,j =j T c,ij T c,i j -c( i,j T c,ij - i=j T c,ij ) = i =i ,j=j T c,ij T c,i j + i=i ,j =j T c,ij T c,i j -c + c i=j T c,ij . The second equation holds because the row sum of T c is 1. For the first term i =i ,j=j T c,ij T c,i j , notice that: i =i ,j=j T c,ij T c,i j = j i T c,ij ( i =i T c,i j ) = j i T c,ij ( i =i T c,i j + T c,ij -T c,ij ) = j i T c,ij ( i T c,i j -T c,ij ) = j i T c,ij (S j -T c,ij ) (S j is the column sum of the j -th column) = j i T c,ij S j -T 2 c,ij = j S j i T c,ij - j i T 2 c,ij = j S 2 j - j i T 2 c,ij . Due to the symmetry of i and j, for the second term i=i ,j =j T c,ij T c,i j , we have i=i ,j =j T c,ij T c,i j = j i T c,ij (R i -T c,ij ) (R i is the row sum of the i -th row, and R i = 1) = j i T c,ij -T 2 c,ij = c - j i T 2 c,ij . Therefore, substituting Equation ( 6) and ( 7) into A, we have A = j S 2 j - j i T 2 c,ij + c - j i T 2 c,ij -c 2 + c i=j T c,ij . To prove S noise -C noise ≤ 0 is equivalent to prove A ≤ 0. Let M = c 2 -c, N = j S 2 j -2 j i T 2 ij + c i=j T ij (we drop the subscript c in T c,ij ), and A = N -M . Now we utilize the Adjustment method (Su & Xiong, 2015) to scale N . For every iteration, we denote the original N by N o , and the adjusted N by N a . Since c ≥ 8, there can not exist three columns with column sum bigger than c/2 -1. Otherwise, the sum of the three columns will be bigger than c, which is impossible because the sum of the whole matrix is c. Therefore, first, we assume that the j, k -th columns have column sum bigger than c/2 -1. Then, for the row i, we add the elements l, which are not in j, k -th columns, to the diagonal element. We have N a -N o = (S i + T il ) 2 + (S l + T il ) 2 + cT il -2(T ii + T il ) 2 -S 2 i -S 2 l + 2(T 2 ii + T 2 il ) = T il (2T il + 2S i -2S l + c -4T ii ) ≥ T il (2T il -2S l + c -2T ii ) (∵ S i ≥ T ii ) > T il (2T il -c + 2 + c -2T ii ) (∵ S l < c/2 -1) ≥ 0. (∵ T ii ≤ 1) We do such adjustment to every rows, then N a is getting bigger and the adjusted matrix will only have values on diagonal elements and the j, k -th columns. Since the diagonal elements are dominant in the row, S j + S k < 2c/3 + 2/3 (because for i = j, k, T ij + T ik < 2/3). Assume that the column sum of k -th column is no bigger than that of the j -th column, and thus S k < c/3 + 1/3. Then, for a row i, we add the T ik to T ii . We have N a -N o = (S i + T ik ) 2 + (S k + T ik ) 2 + cT ik -2(T ii + T ik ) 2 -S 2 i -S 2 k + 2(T 2 ii + T 2 ik ) = T ik (2T ik + 2S i -2S k + c -4T ii ) ≥ T ik (2T ik -2S k + c -2T ii ) (∵ S i ≥ T ii ) > T ik (2T ik + c/3 -2/3 -2T ii ) (∵ S k < c/3 + 1/3) ≥ 0. (∵ c ≥ 8, and T ii ≤ 1) We do such adjustment to every rows, then N a is getting bigger and the adjusted matrix will only have values on diagonal elements and the j -th column, which is called final matrix. Note that if there is only one column with a column sum bigger than c/2 -1, we can adjust the rest c -1 columns as above and then obtain the final matrix as well. If there is no column with a column sum bigger than c/2 -1, we can adjust all the elements as above and then obtain a unit matrix. For the unit matrix, A = N -M < N a -M = 0, the Theorem 2 is proved. Now we process the final matrix. For simplification, we assume j = 0 in the final matrix. We denote the T ij by b i and T ii by a i , for i = {1, . . . , c -1}. We have N a = i a 2 i + (1 + i b i ) 2 + c( i a i + 1) -2( i a 2 i + i b 2 i + 1) = (1 + i b i ) 2 + c i a i + c - i a 2 i -2 i b 2 i -2 = 1 + ( i b i ) 2 + 2 i b i + c i a i + c - i a 2 i -2 i b 2 i -2 = ( i b i ) 2 + 2 i b i -2 i b 2 i + c i a i - i a 2 i + c -1 = ( i b i ) 2 + 2 i b i -2 i b 2 i + c i (1 -b i ) - i (1 -b i ) 2 + c -1 = ( i b i ) 2 + 2 i b i -2 i b 2 i + c 2 -c -c i b i - i (1 -2b i + b 2 i ) + c -1 = ( i b i ) 2 + 4 i b i -3 i b 2 i -c i b i + c 2 -c. Now we prove A = N -M ≤ N a -M ≤ 0. Note that N a -M = ( i b i ) 2 + 4 i b i -3 i b 2 i -c i b i = ( i b i ) 2 + 3 i b i -3 i b 2 i -(c -1) i b i = ( i b i ) 2 + 3 i b i -3 i b 2 i -( i (1 -b i ) + i b i ) i b i = 3 i b i -3 i b 2 i - i (1 -b i ) i b i = 3 i b i (1 -b i ) - i (1 -b i ) i b i . According to the rearrangement inequality (Hardy et al., 1952) , we have i (1 -b i ) i b i ≥ (c -1) i b i (1 -b i ). Note that c ≥ 8, thus C PROOF OF THEOREM 3 3 i b i (1-b i )-i (1-b i ) i b i ≤ 0, Theorem 3. Assume the parameter matrices W 1 , . . . , W d have Frobenius norm at most M 1 , . . . , M d , and the activation functions are 1-Lipschitz, positive-homogeneous, and applied element-wise (such as the ReLU). Assume the transition matrix is given, and the instances X are upper bounded by B, i.e., X ≤ B for all X, and the loss function is upper bounded by Mfoot_3 . Then, for any δ > 0, with probability at least 1 -δ, R( f ) -R n ( f ) ≤ (T s,11 -T s,01 )2BC( √ 2d log 2 + 1)Π d i=1 M i T s,11 √ n + M log 1/δ 2n . Proof. We have defined R(f ) = E (Xi,Xj , Ȳi, Ȳj , Hij ,Ts)∼Dρ [ (f (X i ), f (X j ), T s , Hij )], and R n (f ) = 1 n 2 n i=1 n j=1 (f (X i ), f (X j ), T s , Hij ), ( ) where n is training sample size of the noisy data. First, we bound the generalization error with Rademacher complexity (Bartlett & Mendelson, 2002) . Theorem 4 (Bartlett & Mendelson (2002) ). Let the loss function be upper bounded by M . Then, for any δ > 0, with the probability 1 -δ, we have sup f ∈F |R(f ) -R n (f )| ≤ 2R n ( • F) + M log 1/δ 2n , where R n ( • F) is the Rademacher complexity defined by R n ( • F) = E sup f ∈F 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) , and {σ 1 , • • • , σ n } are Rademacher variables uniformly distributed from {-1, 1}. Before further upper bound the Rademacher complexity R n ( • F), we discuss the special loss function and its Lipschitz continuity w.r.t h k (X i ), k = {1, . . . , C}. Lemma 1. Given similarity transition matrix T s , loss function (f (X i ), f (X j ), T s , Hij ) is µ- Lipschitz with respect to h k (X i ), k = {1, . . . , C}, and µ = (T s,11 -T s,01 )/T s,11 ∂ (f (X i ), f (X j ), T s , Hij ) ∂h k (X i ) < T s,11 -T s,01 T s,11 . Detailed proof of Lemma 1 can be found in Section C.1. Lemma 1 shows that the loss function is µ-Lipschitz with respect to h k (X i ), k = {1, . . . , C}. Based on Lemma 1, we can further upper bound the Rademacher complexity R n ( • F) by the following lemma. Lemma 2. Given similarity transition matrix T s and assume that loss function (f (X i ), f (X j ), T s , Hij ) is µ-Lipschitz with respect to h k (X i ), k = {1, . . . , C}, we have R n ( • F) = E sup f ∈F 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) ≤ µCE sup h∈H 1 n n i=1 σ i h(X i ) , ( ) where H is the function class induced by the deep neural network. Detailed proof of Lemma 2 can be found in Section C.2. The right-hand side of the above inequality, indicating the hypothesis complexity of deep neural networks and bounding the Rademacher complexity, can be bounded by the following theorem. Theorem 5. (Golowich et al., 2018) Assume the Frobenius norm of the weight matrices W 1 , . . . , W d are at most M 1 , . . . , M d . Let the activation functions be 1-Lipschitz, positive-homogeneous, and applied element-wise (such as the ReLU). Let X is upper bounded by B, i.e., for any X, X ≤ B. Then, E sup h∈H 1 n n i=1 σ i h(X i ) ≤ B( √ 2d log 2 + 1)Π d i=1 M i √ n . Combining Lemma 1,2, and Theorem 4, 5, Theorem 3 is proved. C.1 PROOF OF LEMMA 1 Recall that (f (X i ), f (X j ), T s , Hij = 1) = -log( Ŝij ) = -log( Ŝij × T s,11 + (1 -Ŝij ) × T s,01 ) = -log(f (X i ) f (X j ) × T s,11 + (1 -f (X i ) f (X j )) × T s,01 ), where f (X i ) = [f 1 (X i ), . . . , f c (X i )] = exp(h 1 (X)) c k=1 exp(h k (X)) , . . . , exp(h c (X)) c k=1 exp(h k (X)) . ( ) Take the derivative of (f (X i ), f (X j ), T s , Hij = 1) w.r.t. h k (X i ), we have ∂ (f (X i ), f (X j ), T s , Hij = 1) ∂h k (X i ) = ∂ (f (X i ), f (X j ), T s , Hij = 1) ∂ Ŝij ∂f (X i ) ∂h k (X i ) ∂ Ŝij ∂f (X i ) , where ∂ (f (X i ), f (X j ), T s , Hij = 1) ∂ Ŝij = - 1 f (X i ) f (X j ) × T s,11 + (1 -f (X i ) f (X j )) × T s,01 , ∂ Ŝij ∂f (X i ) = f (X j ) × T s,11 -f (X j ) × T s,01 , ∂f (X i ) ∂h k (X i ) = f (X i ) = [f 1 (X i ), . . . , f c (X i )] . Note that the derivative of the softmax function has some properties, i.e., if m = k, f m (X i ) = -f m (X i )f k (X i ) and if m = k, f k (X i ) = (1 -f k (X i ))f k (X i ). We denote by V ector m the m -th element in V ector for those complex vectors. Because 0 < f m (X i ) < 1, ∀m ∈ {1, . . . , c}, we have f m (X i ) ≤ |f m (X i )| < f m (X i ), ∀m ∈ {1, . . . , c}; (19) f (X i ) f (X j ) < f (X i ) f (X j ). Therefore, ∂ (f (X i ), f (X j ), T s , Hij = 1) ∂h k (X i ) = ∂ (f (X i ), f (X j ), T s , Hij = 1) ∂ Ŝij ∂f (X i ) ∂h k (X i ) ∂ Ŝij ∂f (X i ) = f (X i ) f (X j ) × T s,11 -f (X i ) f (X j ) × T s,01 f (X i ) f (X j ) × T s,11 + (1 -f (X i ) f (X j )) × T s,01 < f (X i ) f (X j ) × T s,11 -f (X i ) f (X j ) × T s,01 f (X i ) f (X j ) × T s,11 + (1 -f (X i ) f (X j )) × T s,01 < T s,11 -T s,01 T s,11 = T s,11 -T s,01 T s,11 . The second inequality holds because of T s,11 > T s,01 (Detailed proof can be found in Section C.1.1) and Equation (20). The third inequality holds because of f (X i ) f (X j ) < 1. Similarly, we can prove As we mentioned in Section B, we have, = c(c -1)n 2 N 11 -cn 2 N 01 > 0. The last equation holds because of (c -1)N 11 -N 01 > 0 according to the rearrangement inequality (Hardy et al., 1952) . ∂ (f (X i ), f (X j ), T s , Hij = 0) ∂h k (X i ) < T s, N 00 = n 2 i =i ,j =j C.2 PROOF OF LEMMA 2 E sup f ∈F 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) = E sup g 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) = E sup argmax{h1,...,h C } 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) = E sup max{h1,...,h C } 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) ≤ E C k=1 sup h k ∈H 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) = C k=1 E sup h k ∈H 1 n n i=1 σ i (f (X i ), f (X j ), T s , Hij ) ≤ µCE sup h k ∈H 1 n n i=1 σ i h k (X i ) = µCE sup h∈H 1 n n i=1 σ i h(X i ) , where the first three equations hold because given T s , f and max{h 1 , . . . , h C } give the same constraint on h j (X i ), j = {1, . . . , C}; the sixth inequality holds because of the Talagrand Contraction Lemma (Ledoux & Talagrand, 2013) . In Figure 6 , we show that the Class2Simi performs well even with fewer data. The noise in training data is set to Sym-0.5. We randomly sample from original training data with a sampling rate from 0.5 to 1.0 and train the model on the sampled data. Test datasets remain the same. At each sampling rate, Class2Simi performs better than the baseline. With only 50%, 80% and 80% data on MNIST, CIFAR10 and CIFAR100, our method can achieve the same accuracy as Forward. In Table 5 , we demonstrate the effectiveness of our method under the asymmetric noise setting. For an invertible T c , denote by v j the j-th column of T c and 1 the all-one vector. Then,

E.3 RESULTS ON ASYMMETRIC NOISE SETTING

j ( i T c,ij ) 2 = j v j , 1 2 ≤ j ||v j || 2 ||1|| 2 = c||T c || 2 Fro ,



In multi-class classification problems, the number of classes is usually bigger than 8 , e.g.,MNIST (LeCun, 1998), CIFAR-10, and CIFAR-100(Krizhevsky et al., 2009). The assumption holds because deep neural networks will always regulate the objective to be a finite value and thus the corresponding loss functions are of finite values. The assumption holds because deep neural networks will always regulate the objective to be a finite value and thus the corresponding loss functions are of finite values.



Figure 1: Illustration of the transformation from class labels to similarity labels. Note that ȳ stands for the noisy class label and y for the latent clean class label. The labels marked in red are incorrect labels.If we assume the class label noise is generated according to the noise transition matrix presented in the upper part of the right column, it can be calculated that the noise rate for the noisy class labels is 0.5 while the rate for the noisy similarity labels is 0.25. Note that the noise transition matrix for similarity labels can be calculated by exploiting the class noise transition matrix as in Theorem 1.

Figure3: Examples of predicted noisy similarity. Assume class number is 10; f (X i ) and f (X j ) are categorical distribution of X i and X j respectively, which are shown above in the form of area charts. Ŝij is the predicted similarity posterior between two instances, calculated by the inner product between two categorical distributions.

Class2Simi Input: training data with noisy class labels; validation data with noisy class labels. Stage 1: Learn Ts 1: Learn g(X) = P ( Ȳ |X) by training data with noisy class labels, and save the model for Stage 2; 2: Estimate Tc following the optimization method in (Patrini et al., 2017); 3: Transform Tc to Ts . Stage 2: Learn the classifier f (X) = P (Y |X) 4: Load the model saved in Stage 1, and train the whole pipeline showed in Figure 2. Output: classifier f . negligible memory overhead. Then the neural network outputs the class posterior probabilities of n single examples in the batch of data.

Figure 4: Means and Standard Deviations of Classification Accuracy over 5 trials on MNIST, CIFAR10 and CIFAR100 with perturbational ground-truth Tc .

c,ij T c,i j , N 01 = n 2 i =i ,j=j T c,ij T c,i j , N 10 = n 2 i=i ,j =j T c,ij T c,i j , N 11 = n 2 i=i ,j=j T c,ij T c,i j , T s,01 = N 01 N 00 + N 01 , T s,11 = N 11 N 10 + N 11 , T s,11 -T s,01 = N 11 N 00 + N 11 N 01 -N 01 N 10 -N 01 N 11 (N 00 + N 01 )(N 10 + N 11 ) .Let us review the definition of similarity labels: if two instances belong to the same class, they will have similarity label S = 1, otherwise S = 0. That is to say, for a k-class dataset, only 1 k of similarity data has similarity labels S = 1, and the rest 1 -1 k has similarity labels S = 0. We denote the number of data with similarity labels S = 1 by N 1 , otherwise N 0 . Therefore, for the balanced dataset with n samples of each class, N 1 = cn 2 , andN 0 = c(c -1)n 2 . Let A = T s,11 -T s,01 , we have A = N 11 N 00 -N 01 N 10 = N 11 N 00 -(N 0 -N 00 )(N 1 -N 11 ) = N 11 N 00 -N 0 N 1 -N 11 N 00 + N 11 N 0 + N 1 N 00 = N 11 N 0 -N 01 N 1

Figure 6: Means and Standard Deviations (Percentage) of Classification Accuracy over 5 trials on MNIST, CIFAR10 and CIFAR100 trained with different sampling rate of training data. The noise rate on training data is set to Sym-0.5.

Pointwise IMPLIES pairwise

Means and Standard Deviations of Classification Accuracy over 5 trials on image datasets.

has 28 × 28 grayscale images of 10 classes including 60,000 training images and 10,000 test images. CIFAR-10 and CIFAR-100 both have 32 × 32 × 3 color images including 50,000 training images and 10,000 test images. CIFAR-10 has 10 classes while CIFAR-100 has 100 classes. News20 is a collection of approximately 20,000 newsgroup documents, partitioned nearly evenly across 20 different newsgroups. Clothing1M has 1M images with real-world noisy labels and additional 50k, 14k, 10k images with clean labels for training, validation and test, and we only use noisy training set in the training phase. Note that the similarity learning method of Class2Simi is based on Cluster because there is no class information. Intuitively, for a noisy class, if most instances in it belong to another specific class, we can hardly identify it. For example, assume that a class with noisy labels ī contains n i instances with ground-truth labels i and n j instances with ground-truth labels j. If n j is bigger than n i , the model will cluster class i into j. Unfortunately, in Clothing1M, most instances with label '5' belong to class '3' actually. Therefore, we merge the two classes, and denote the fixed dataset by Clothing1M* which contains 13 classes. For all the datasets, we leave out 10% of the training examples as a validation set, which is for model selection.

Classification Accuracy on News20.

Classification Accuracy on Clothing1M*.

Classification Accuracy on clean datasets.

and A ≤ 0. Therefore S noise -C noise ≤ 0, and the equation holds if and only if the noise rate is 0 or every instances have the same noisy class label (i.e., there is one column in the T c , of which every elements are 1, and the rest elements of the T c are 0). Above two extreme situations are not considered in this paper. Namely, the noise rate of the noisy similarity labels is lower than that of the noisy class labels. Theorem 2 is proved.

.1.1 PROOF OF T s,11 > T s,01

Means and Standard Deviations (Percentage)  of Classification Accuracy over 5 trials on MNIST, CIFAR10 and CIFAR100 with asymmetric noise of which the noise rate is about 0.3.

D DEFINITION OF NOISE SETTINGS

Symmetric noise setting is defined as follows, where C is the number of classes.Sym-ρ:The 0. From Figure 5 , overall, we can see that Class2Simi (Class2Simi TrueT) achieves the best performance whenever class T c is given or estimated. In most cases, Class2Simi with estimated T c even outperforms baselines with the ground-truth class noise transition matrix, due to lower noise rate and the similarity transition matrix being robust to noise. Specifically, On MNIST, as the noise rate increases from Sym-0.1 to Sym-0.5, Class2Simi TrueT maintains remarkable accuracy above 99.20% while the accuracy of Class2Simi and Forward TrueT decrease steadily. However, there is a significant decrease in the accuracy of Forward. On CIFAR10, the patterns of varying tendencies of four curves are similar to that of MNIST except that the decreases are more dramatic and even Class2Simi TrueT drops slightly at Sym-0.5. On CIFAR100, there is an obvious decrease in the accuracy of all methods and our method achieves the best results across all noise rate, i.e., at Sym-0.5, Class2Simi gives an accuracy uplift of about 8.0% compared with Forward.where we use the Cauchy-Schwarz inequality (Steele, 2004) in the second step. Further, we haveThus the learnability of the pointwise classification implies the learnability of the reduced pairwise classification. In this case, the reduced pairwise classification is learnable while the original pointwise classification is not learnable.Thus the learnability of the reduced pairwise classification does not imply the learnability of the pointwise classification.

