CLASS2SIMI: A NEW PERSPECTIVE ON LEARNING WITH LABEL NOISE

Abstract

Label noise is ubiquitous in the era of big data. Deep learning algorithms can easily fit the noise and thus cannot generalize well without properly modeling the noise. In this paper, we propose a new perspective on dealing with label noise called "Class2Simi". Specifically, we transform the training examples with noisy class labels into pairs of examples with noisy similarity labels, and propose a deep learning framework to learn robust classifiers with the noisy similarity labels. Note that a class label shows the class that an instance belongs to; while a similarity label indicates whether or not two instances belong to the same class. It is worthwhile to perform the transformation: We prove that the noise rate for the noisy similarity labels is lower than that of the noisy class labels, because similarity labels themselves are robust to noise. For example, given two instances, even if both of their class labels are incorrect, their similarity label could be correct. Due to the lower noise rate, Class2Simi achieves remarkably better classification accuracy than its baselines that directly deals with the noisy class labels.

1. INTRODUCTION

It is expensive to label large-scale data accurately. Therefore, cheap datasets with label noise are ubiquitous in the era of big data. However, label noise will degenerate the performance of trained deep models, because deep networks will easily overfit label noise (Zhang et al., 2017; Zhong et al., 2019; Li et al., 2019; Yi & Wu, 2019; Zhang et al., 2019; 2018; Xia et al., 2019; 2020) . In this paper, we propose a new perspective on handling label noise called "Class2Simi", i.e., transforming training examples with noisy class labels into pairs of examples with noisy similarity labels. A class label shows the class that an instance belongs to, while a similarity label indicates whether or not two instances belong to the same class. This transformation is motivated by the observation that the noise rate becomes lower, e.g., even if two instances have incorrect class labels, their similarity label could be correct. In the label-noise learning community, a lower noise rate usually results in higher classification performance (Han et al., 2018b; Patrini et al., 2017) . Specifically, we illustrate the transformation and the robustness of similarity labels in Figure 1 . Assume we have eight noisy examples {(x 1 , ȳ1 ), . . . , (x 8 , ȳ8 )} as shown in the upper part of the middle column. Their labels are of four classes, i.e., {1, 2, 3, 4}. The labels marked in red are incorrect labels. We transform the 8 examples into 8 × 8 example-pairs with noisy similarity labels as shown in the bottom part of the middle column, where the similarity label 1 means the two instances have the same class label and 0 means the two instances have different class labels. We present the latent clean class labels and similarity labels in the left column. In the middle column, we can see that although the instances x 2 and x 4 both have incorrect class labels, the similarity label of the example-pair (x 2 , x 4 ) is correct. Similarity labels are robust because they further consider the information on the pairwise relationship. We prove that the noise rate in the noisy similarity labels is lower than that of the noisy class labels. For example, if we assume that the noisy class labels in Figure 1 are generated according to the latent clean labels and the transition matrix shown in the upper part of the right column (the ij-th entry of the matrix denotes the probability that the clean class label i flips into the noisy class label j), the noise rate for the noisy class labels is 0.5 while the rate for the corresponding noisy similarity labels is 0.25. Note that the noise rate is the ratio of the number of incorrect labels to the number of total examples, which can be calculated from the noise transition matrix combined with the proportion of each class, i.e., 1/6 × 3/4 + 1/2 × 1/4 = 0.25. If we assume the class label noise is generated according to the noise transition matrix presented in the upper part of the right column, it can be calculated that the noise rate for the noisy class labels is 0.5 while the rate for the noisy similarity labels is 0.25. Note that the noise transition matrix for similarity labels can be calculated by exploiting the class noise transition matrix as in Theorem 1. It is obvious that Class2Simi suffers information loss because we can not recover the class labels from similarity labels. However, since the similarity labels are more robust to noise than the class labels, the advantage of the reduction of noise rate overweighs the disadvantage of the loss of information. Intuitively, in the learning process, it is the signal in the information that enhances the performance of the model, while the noise in the information is harmful to the model. Through Class2Simi, although the total amount of information is reduced, the signal to noise ratio is increased, and so would be the total amount of signals. Thus, we can benefit from the transformation and achieve better performances. Theorem 2 and the experimental results will verify the effectiveness of this transformation. It remains unsolved how to learn a robust classifier from the data with transformed noisy similarity labels. To solve this problem, we first estimate the similarity noise transition matrix, a 2 × 2 matrix whose entries denote the flip rates of similarity labels. Note that the transition matrix bridges the noisy similarity posterior and the clean similarity posterior. The noisy similarity posterior can be learned from the data with noisy similarity labels. Then, given the similarity noise transition matrix, we can infer the clean similarity posterior from the noisy similarity posterior. Since the clean similarity posterior is approximated by the inner product of the clean class posterior (Hsu et al., 2019) , the clean class posterior (and thus the robust classifier) can thereby be learned. We will empirically show that Class2Simi with the estimated similarity noise transition matrix will remarkably outperform the baselines even given with the ground-truth class noise transition matrix. The contributions of this paper are summarized as follows: • We propose a new perspective on learning with label noise, which transforms class labels into similarity labels. Such a transformation reduces the noise level. • We provide a way to estimate the similarity noise transition matrix by theoretically establishing its relation to the class noise transition matrix. We show that even if the class noise transition matrix is inaccurately estimated, the induced similarity noise transition matrix still works well. • We design a deep learning method to learn robust classifiers from data with noisy similarity labels and theoretically analyze its generalization ability. • We empirically demonstrate that the proposed method remarkably surpasses the baselines on many datasets with both synthetic noise and real-world noise. The rest of this paper is organized as follows. In Section 2, we formalize the noisy multi-class classification problem, and in Section 3, we propose the Class2Simi strategy and practical implementation. Experimental results are discussed in Section 4. We conclude our paper in Section 5.



Figure 1: Illustration of the transformation from class labels to similarity labels. Note that ȳ stands for the noisy class label and y for the latent clean class label. The labels marked in red are incorrect labels.If we assume the class label noise is generated according to the noise transition matrix presented in the upper part of the right column, it can be calculated that the noise rate for the noisy class labels is 0.5 while the rate for the noisy similarity labels is 0.25. Note that the noise transition matrix for similarity labels can be calculated by exploiting the class noise transition matrix as in Theorem 1.

