AUTOCLEANSING: UNBIASED ESTIMATION OF DEEP LEARNING WITH MISLABELED DATA

Abstract

Mislabeled samples cause prediction errors. This study proposes a solution to the problem of incorrect labels, known as AutoCleansing, to automatically capture the effect of incorrect labels and mitigate it without removing the mislabeled samples. AutoCleansing consists of a base network model and sample-category specific constants. Both parameters of the base model and sample-category constants are estimated simultaneously using the training data. Thereafter, predictions for test data are made using a base model without the constants capturing the mislabeled effects. A theoretical model for AutoCleansing is developed and showing that the gradient of the loss function of the proposed method can be zero at true parameters with mislabeled data if the model is correctly constructed. Experimental results show that AutoCleansing has better performance in test accuracy than previous studies for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.

1. INTRODUCTION

The prediction performance of supervised machine learning depends on the quality of the training data. For classification tasks, the dataset is assumed to have a correct label for each object. However, real-world datasets may contain some mislabeled samples. For instance, Pleiss et al. (2020) analyzed incorrect labels in the CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009) . They reported that the mislabeled sample was 3 % of CIFAR-10 and 13 % of the CIFAR-100 datasets. Figure 1 shows typical examples of incorrect labels in the CIFAR-10 dataset. It consists of 60,000 images in 10 category classes. Each image was assigned one of 10 classes. In this figure, the original label of #1 is DOG; however, it appears to be the image of CAT. As the category set of CIFAR-10 includes both DOG and CAT, #1 is considered to be an example of an incorrect label within the category set. The image of #2 has TRUCK as the original label. However, it shows the image of PERSON, which does not belong to the category set. Thus, this is an example of an incorrect label outside the category set. For the image of #3, there are two objects in this image; however, it has only one label of DEER. It can be considered as an example of an incorrect label with multiple objects. Incorrect labels in the training dataset may cause prediction errors. The most intuitive way to address the problem of incorrect labels is by removing mislabeled samples from the training dataset. However, to identify mislabeled samples, it is necessary to measure the correctness of labels and define the threshold determining whether the label is correct or not. Deleting excess data may reduce the efficiency of estimation by decreasing the sample size. Finding an optimal threshold requires several runs of learning by removing mislabeled samples with different thresholds. This study proposes an alternative solution to the problem of incorrect labels, called AutoCleansing, to automatically capture the effect of incorrect labels and mitigate it without removing mislabeled samples. AutoCleansing consists of a base network model and sample-category specific constants. Both parameters of the base model and sample-category constants are estimated simultaneously using the training data. Thereafter, predictions for test data are made using a base model without the constants capturing the mislabeled effects. As shown in the section of the theoretical analysis, the proposed AutoCleansing can address the prediction errors due to the incorrect labels within the category set, outside the category set, and with multiple objects. AutoCleansing can use any network model as the base model with any augmentation method. For example, the experimental section presents the estimation results of AutoCleansing with base models of ResNet (He et al., 2016) , WideResNet (Zagoruyko & Komodaki, 2016) , Shake-Shake (Gastaldi, 2017) , and PyramidNet+ShakeDrop (Yamada et al., 2018) using AutoAugment (Cubuk et al., 2018) . Furthermore, AutoCleansing does not require iterative runs to identify the incorrect labels because the effects of the incorrect labels are automatically captured in a single run by the sample-category specific constant. The contribution of this study can be summarized as follows: • It provides a theoretical model for AutoCleansing. The incorrect labels in the training data cause a prediction error. The proposed method can capture the biased effects of incorrect labels automatically and address the problem of prediction error due to incorrect labels. • The proposed method can be implemented with any network model or augmentation method. This study considers experiments of AutoCleansing with ResNet, WideResNet, Shake-Shake, and PyramidNet+ShakeDrop using AutoAugment. • Experimental results show that the proposed AutoCleansing method can improve the validation accuracy for the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. • The additional training cost of AutoCleansing relative to the base network model can be negligible. AutoCleansing can remove the biased effect of incorrect labels automatically in a single run of learning. For example, the additional learning time of AutoCleansing with CIFAR-10/100 datasets is only 0.5 % of that of the base network models.

2. RELATED WORKS

There are several studies on noisy datasets in the literature on machine learning. Frnay & Verleysen (2014) provide a comprehensive review of label noise in classification. Algan & Ulusoy (2019) provide a complete overview of deep learning with noisy datasets. There are three approaches to dealing with mislabeled datasets as follows: (1) robust learning with label noise, (2) identification of mislabeled data, and (3) utilization of a small dataset without incorrect labels. For robust learning with label noise, after the early works of Reed et al. (2015) and Azadi et al. (2015) , several algorithms using deep neural networks have been proposed including the S-model (Goldberger & Ben-Reuven, 2016) , MentorNet (Jiang et al., 2018) , decoupling (Malach & Shalev-Shwartz, 2017) , F-correction (Patrini et al., 2017) , Open-set (Wang et al., 2018) , Bi-level-model (Jenni & Favaro, 2018) , Lq (Zhang & Sabuncu, 2018) , co-teaching (Han et al., 2018) , random reweighting (Ren et al., 2018) , joint optimization (Tanaka et al., 2018) , DAC (Thulasidasan et al., 2019) , SELF (Nguyen et al., 2020) , dynamic bootstrapping (Arazo et al., 2019) , and DivideMix (Li et al., 2020) . Goldberger & Ben-Reuven (2016) and Patrini et al. (2017) estimated a noise transition matrix to correct for the loss function. However, it was difficult to correctly estimate the transition matrix. Jiang et al. (2018) and Ren et al. (2018) proposed weighted samples to adapt the noisy samples. However, estimation of correctly weighted samples was also challenging. Arazo et al. (2019) proposed a beta mixture model of the cross-entropy loss of each sample and modeled the label noise. Their approach showed outstanding performance for high-level noise. For linear regression, a consistent robust regression was proposed for the corrupted data (Bhatia et al., 2017) . However, the consistency of robust learning method of nonlinear estimation using the classification model was not clear. These studies of robust learning with label noise used datasets having synthetic-label noise added. Label noise is generated by replacing one label with another at a given probability within a category set. These studies showed good performance for artificially generated label noise. However, most did not find mislabeled samples in real-world datasets. On the other hand, the approach proposed in the this study considers incorrect labels within and outside the category set as well as multiple objects in real-world datasets. For the identification of mislabeled data, some studies found incorrect labels in famous datasets for deep learning. For instance, Ekambaram et al. (2017) 2020) identified incorrect labels using the area under the margin (AUM) statistic. They showed that incorrect labels are 3 % on CIFAR10, 13 % on CIFAR100, and 24 % on Tiny ImageNet. On CIFAR100, they reported that removing 13 % of the data leads to a 1.2 % drop in error. Identifying mislabeled data requires some criteria to determine whether it is correct or incorrect. Several runs of learning may be required to search for optimal criteria for the identification of incorrect labels. The method proposed in this study, in contrast, allows us to capture and remove the effects of incorrect labels in a single run of learning. For the utilization of a small dataset without incorrect labels, it is assumed that we have a small set of clean data, namely, free of mislabeled samples. Sukhbaatar et al. (2015) and Hendrycks et al. (2018) used clean data to estimate the noise-transition matrix for incorrect labels. Other studies on the utilization of clean data include Ren et al. (2018) , Li et al. (2017), and Zhang et al. (2019b) . However, it is difficult to find a small set of clean data in real-world datasets. Meanwhile, the method proposed in this study does not require clean data.

3. AUTOCLEANSING

Consider the classification task with K categories for the N training data points X = {x 1, x 2 , • • • , x N } and labels Y = {y 1, y 2 , • • • , y N }. Let the learning network model be, M (x, θ) = {m 1 (x, θ) , • • • , m K (x, θ) } that assigns an input x to an output m with a given parameter θ. The predicted probability of output y, given the input x using this model, is assumed to be calculated by the following softmax function: P (y = i|x) = e mi j∈K e mj where m i = m i (x, θ) denotes the ith element of output and K = {1, • • • , K} the category set. If the training data has mislabeled samples, the estimated parameter might be biased, which could cause a prediction error. To address the problem of incorrect labels, consider the cleansing network model, C (x, θ) = M (x, θ) + α, where α ∈ R N ×K is a parameter of the sample-category specific constant. Let α k be the constant for input x of the sample and category k. Therefore, the prediction probability P C with the constant can be expressed as follows: P C (y = i|x) = e mi+αi j∈K e mj +αj (2) If the effect of the mislabeled samples is captured by the sample-category-specific constant α of the cleansing network model C (x, θ, α) = M (x, θ) + α, the base network model M (x, θ) could avoid the biased problem due to incorrect labels. The learning process with AutoCleansing is as follows: (1) Learning with the training data using a cleansing network model C (x, θ, α) = M (x, θ) + α, (2) deleting the sample-category-specific constant α estimated in the learning process, and (3) testing with the validation data using a cleansed network model, C(x, θ, α) -α, where θ denotes the estimated parameter of the base model.

3.1. THEORETICAL ANALYSIS OF AUTOCLEANSING

To investigate the performance of the proposed AutoCleansing for mislabeled data, we consider the following definition of a general case: Definition 1 (Incorrect labels and outside of the category set). Let K * be a full set of true categories and K ⊂ K * be the category set of the model. Let π(ŷ|y * x) be the probability such that the input x with the true category y * ∈ K * has label ŷ ∈ K. Let P (y * | x, K * , θ * ) be the true probability of the category y * from the category set K * with input x given true parameter θ * . Thus, the observed probability of category ŷ∈K for the input x with incorrect labels within and outside of the category set is defined as follows: Q ŷ | x, K * , θ * = y * ∈K * π (ŷ|y * , x) • P y * | x, K * , θ * This definition includes incorrect labels, outside of the category set, and multiple objects. If K * = K, this definition is equivalent to that within the category set. If incorrect labels occur outside the category set, namely π (ŷ|y * , x) = 0 ∀ŷ = y * ∈ K and π (ŷ|y * , x) ≥ 0 ∀y * ∈ K * \ K, it defines the observed probability outside the category set. Let S be the combination of categories in multiple objects (for example, DEER and PERSON) and K + = K S be the true category set including the original categories and combination of categories in multiple objects. If K * = K + , definition 1 describes multiple objects. If the sample data has incorrect labels within the category set, incorrect labels outside of the category set, or multiple objects, the estimated parameter using the minimum loss function does not converge to the true parameter; thus, the incorrect labels cause the prediction error. However, the following theorem shows that AutoCleansing can address the biased estimation due to mislabeled data: Theorem 1 (AutoCleansing for incorrect labels within and outside the category set). Let K * be a full set of true categories and K be the category set of the model. Let π (ŷ|y * , x) be a probability such that the input x with the true category y * ∈ K * has label ŷ ∈ K. Assume that the sample has incorrect labels and the outside of the category set is defined as Definition 1. Let L C be the expected loss function of AutoCleansing and θ + be the set of the solution to ∂L C /∂θ = 0. Furthermore assume that the model is correctly constructed and the probability distribution of the output is the softmax function. Then the gradient of the expected loss function with AutoCleansing is zero at the true parameter value θ * . Namely, θ * ∈ θ + , The proof can be found in the Appendix A.1. Note that, although this theorem states that the stochastic gradient descent using the loss function of the correct model with AutoCleansing can be stopped at true values, it does not guarantee that minimization of the loss function will converge to the true value, if the loss function has more than one local minimum. Theorem 1 shows that the implementation of the sample-category specific constant α can capture the biased effect of incorrect labels. This suggests that the value of α may reflect the effects of incorrect labels. The following theorem confirms this: Theorem 2 (Sample-category specific constants and incorrect labels). Assume that the sample has incorrect labels and outside of the category set defined as Definition 1. Consider that the true label t is assigned a false label f . Let αt and αf be parameters of the sample-category-specific constants of the true and false labels, respectively, which are estimated by the minimum loss function with AutoCleansing. Assume that the probability of observing a false label is greater than or equal to that of the true label, namely Q (f | x, K * , θ * ) ≥ Q (t | x, K * , θ * ). Furthermore assume that the model is correctly constructed, the probability distribution of the output is the softmax function, and the loss function is minimized at true parameter values. Then, as N → ∞ , the estimated parameter of the sample-category specific constants α using the minimum loss function with AutoCleansing has the following properties : 1. General case: The sample-category-specific constants of the true label are equal to or less than that of the false label: αt ≤ αf . 2. Symmetric case: Assume π (f |t, x) is symmetric and independent of x, namely, π (f |t, x) = π (f |t) = π (t|f ). For this case, the sample-category specific constants of the true label are equal to or less than zero: αt ≤ 0. 3. Single symmetric case: Assume π (f |t, x) is symmetric, independent of x, and incorrect labels occur between t and f only. Consequently, π (f |t, x) = π (f |t) = π (t|f ) and π (j|j, x) = 1, ∀j = t, f . For this case, the sample-category-specific constants of labels except the true and false labels are equal to zero: αj = 0, ∀j = t, f . The Appendix A.2 provides the proof of this theorem. Table 1 provides numerical examples of the theorem. Assume that the category set has three categories K = {1, 2, 3}. Example (A) is the incorrect label within the category set. Assume the true model output is {m * 1 , m * 2 , m * 3 } = {0.1, 0.1, 0.8}. Namely, the correct category is Category 3, having the highest output value. If the first label is observed, the incorrect label causes biased learning: {c 1 , c 2 , c 3 } = {0.8, 0.1, 0.1}. The estimated sample-category-specific constants are assumed to be optimal such that α = m * -c. The constant of the true label (α 3 = -0.7) is less than that of the false label (α 1 = 0.7), such that the output value of the biased model (c = m * + α) of the observed category (0.8) is larger than the one of the true category (0.1). Example (B) is the incorrect label outside the category set. Consider that the true category set is K * = {1, 2, 3, 4}. Assume the true category is Category 4 outside of the observed category set. The optimal constant of observed label (α 1 = 0.7) has the largest value. Note that the constant of true category (α 4 = -0.7) is not estimated, because this category is the outside of category set. Example (C) is the multiple objects. Assume the observed label is 1, whereas the true categories are 1 and 3. The constant of unobserved true category (α 3 = -0.7) is less than that of the observed true label (α 1 = 0.0), 

4. EXPERIMENTS

This section provides the experiments investigating the performance of the proposed AutoCleansing on the CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009) , SVHN (Netzer et al., 2011) , and ImageNet (Russakovsky et al., 2015) datasets. The proposed AutoCleansing method requires base network models. In these experiments, the base network models are Wide-ResNet 40-2 and Wide-ResNet 28-10 (Zagoruyko & Komodaki, 2016), Shake-Shake 26 2x32d, 26 2x96d and 26 2x112d (Gastaldi, 2017) , and PyramidNet+ShakeDrop (Yamada et al., 2018) . All hyperparameters of base network models are the same as those used in AutoAugment (Cubuk et al., 2018) , FastAutoAugment (Lim et al., 2019) , and PBA (Ho et al., 2019) foot_0 . A cosine learning decay with one annealing cycle was applied to all models except ResNet. For AutoCleansing, the sample-category specific constants could not converge without regularization. The weight decay of the sample-category specific constant is 5 ×10 -5 , except for PyramidNet, which uses 1 ×10 -5 . AutoCleansing has learning parameters of the sample-category specific constant α that consists of K variables for each sample. Note that all K variables cannot be identified; therefore, the constants for the first category are set to zero for all samples (α 1 = 0). Thus, the estimated sample-category specific constant α ∈ R N ×K has N (K -1) estimable parameters. In this study, the AutoCleansing with the sample-category-specific constant of N (K -1) parameters is called as AC1. However, it might be difficult to estimate all parameters of α for large datasets. For example, ImageNet has more than 1.2 million images with 1,000 categories for training data. Therefore, AC1 needs to estimate more than 1.2 billion parameters. Instead, consider the sample specific constant α ∈ R N such that all categories except the observed label are set to zero for all samples (α j = 0 ∀j = ŷ); that is, α has N estimable parameters. The AutoCleansing with the sample-specific constant of N parameters is called as AC2. Note that AC2 corresponds to the special case of the single symmetric case of Theorem 2. For the single symmetric case, α nj = 0 ∀n, ∀j = t, f . If the true category belongs to the outside of the category set, we cannot estimate the α nt of the true category, therefore, all categories except the observed label have zero values of α nj . The experiments compare the results with baseline preprocessing, Cutout (DeVries & Taylor, 2017), AutoAugment (AA), FastAutoAugment (FAA), and Population Based Augmentation (PBA). The baseline preprocessing is conventional augmentation as follows: standardizing the data, horizontal flipping with 50 % probability, zero-padding, and randomly cropping. In the proposed AutoCleansing, we follow the procedure of AutoAugment, which first applies the baseline preprocessing, then applies the AutoAugment policy, and finally applies the Cutout.

4.1. EXPERIMENTAL RESULTS

The CIFAR-10 dataset has a total of 60,000 images, including 50,000 for training set and 10,000 for test sets. The number of categories was 10. Thus, the sample-category specific constant for AC1 has 0.45 million estimable parameters, whereas the sample specific constant for AC2 has 0.05 million estimable parameters. Table 2 shows the results of the test accuracy for different network models using the CIFAR-10 dataset. For all models, the proposed AutoCleansing with AutoAugment can achieve better performance compared to previous modelsfoot_1 . The CIFAR-100 dataset also has a total of 60,000 images, including 50,000 for training set and 10,000 for test sets. The number of categories was 100. Thus, the sample-category specific constant for AC1 has 4.95 million estimable parameters, whereas the sample specific constant for AC2 has 0.05 million estimable parameters. Table 3 provides the results for the CIFAR-100 dataset. Similarly, for CIFAR-10, the proposed model has better accuracy than previous models. The SVHN dataset has 73,257 digit images for the core training set, 531,131 for the additional training set, and 26,032 for the test set. In this experiment, both core and additional training sets were used. The number of categories was 10. Thus, the sample-category specific constant for AC1 has 5.44 million estimable parameters, whereas the sample specific constant for AC2 has 0.53 million estimable parameters. Table 4 reports the results for the SVHN dataset. For the SVHN dataset, the proposed model has better accuracy than previous models. The ImageNet dataset has more than 1.2 million images for the training set and 0.15 million images for the validation and test sets. The number of categories was 1,000. Thus, the sample-category specific constant for AC1 has more than 1.28 billion parameters that may not be feasible to estimate. Therefore, this experiment uses the sample specific constant for AC2, which has 1.28 million estimable parameters. Table 5 shows the results for the ImageNet dataset. The proposed model of AC2 has better Top1 accuracy than previous models, whereas the Top5 accuracy of AC2 is less than that of AA and FAA. Pleiss et al. (2020) proposed the the area under the margin (AUM) statistic for robust learning with label noise. They provided experiments of label noise using real-world datasets and the artificially generated noise. Their experimental results showed that the AUM had better performance than those of the previous studies for datasets having synthetic label noise added. Table 6 shows the test accuracy of AutoCleansing and AUM using the ResNet 32 model. All hyperparameters of the base network models were the same as those used by Pleiss et al. (2020) . AutoCleansing demonstrates outperforming the AUM on both CIFAR-10 and CIFAR-100.

4.2. DETECTION OF INCORRECT LABELS USING AUTOCLEANSING

As shown in the theoretical analysis section, AutoCleansing can capture the effect of incorrect labels using the sample-category specific constant α. If there is no mislabeled sample, α → 0, as N → ∞. A large value of |α| indicates the existence of mislabeled samples in the data. Therefore, it might be possible to identify incorrect labels using the estimated sample-category specific constant α. #1 in Figure 4 is an example of incorrect labels within the category set. The original label of #1 is DOG and an alternative to this image is CAT. The estimated α of an original label of #1 is 0.233, whereas α of an alternative label of #1 is -0.227. Similarly, the original labels are positive α, whereas alternative labels are negative α for images #2-#5 in this figure . #6 in Figure 5 shows an example of incorrect labels outside the category set. The original label of #6 is TRUCK, whereas the correct label might be PERSON, which does not belong to the category set for the CIFAR-10 dataset. For this image, α of the original labels have positive. For these examples of #6-#10, the estimated α of the original labels are positive. #11 in Figure 6 presents an example of multiple objects. The original label of #11 is DEER; however, this image includes an additional object of PERSON that does not belong to the category set of the CIFAR-10 dataset. Images of #12-#15 are examples of multiple objects in the CIFAR-100 datasets. For example, the original label of #12 is PLANE; however, this image also has the object of SEA. Both PLANE and SEA belong to the category set. For these images, α of the original labels have positive, α and additional labels have negative labels. Note that the values of MaxRank or MinRank are less than 0.1 % for all images in these figures. This suggests that the mislabeled samples can be identified using the high or low value of the samplecategory-specific constants α estimated by AutoCleansing. To identify the mislabeled samples, we must specify the threshold criteria between the correct and incorrect labels. Let τ be the percentage of mislabeled samples in the dataset. Algorithm 2 in Appendix A.4 provides the procedure for searching for mislabeled samples given τ using AutoCleansing. After searching for the mislabeled samples, we can remove the mislabeled data from the sample and run the learning models using the trimmed sample. Figure 3 shows the test accuracies of the base model using trimmed data with different threshold criteria of incorrect labels on the CIFAR-10 and CIFAR-100 datasets. The base model is a Wide-ResNet 40-2 model with AutoAugmentation. As can be observed, the test accuracies are highest when 0.2 % of incorrect labels are dropped in both datasets. This figure shows that dropping mislabeled samples at an appropriate drop rate can improve the classification accuracy. However, it is necessary to repeat learning several times with different criteria to determine the optimum drop rate of mislabeled samples. If an excessively small sample is dropped, the effects of bias due to incorrect labels might remain. If excess samples are removed, the estimation efficiency could be reduced because of the decreasing sample size. Notably, the maximum test accuracies using trimmed data are very close to the accuracies of Auto-Cleansing in this figure. This suggests that AutoCleansing can remove the biased effects of incorrect labels without dropping the mislabeled samples from the datasets. Furthermore, AutoCleansing does not need to repeat learning because it does not require the threshold criteria of the drop rate for the mislabeled samples. Instead of dropping the mislabeled samples that requires the threshold criteria of incorrect labels, AutoCleansing drops the sample-category specific constantsby capturing the mislabeled bias. 

5. CONCLUSION

This study introduces AutoCleansing to address the biased problem due to incorrect labels. The proposed method is appealing in that it can automatically capture the effect of incorrect labels and mitigate it without removing mislabeled samples. As shown in the theoretical analysis, if the model is correctly constructed, the gradient of the expected loss function of AutoCleansing is equal to zero at true parameter values using mislabeled samples for incorrect labels within or outside the category set as well as multiple objects. Furthermore, AutoCleansing can be implemented with any network model and any augmentation method. Experimental results show that the proposed AutoCleansing has better performance than previous studies on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Additional topics for future investigation into AutoCleansing include applications to artificial label noise (Algan & Ulusoy, 2019) , the use of other network models such as EfficientNet (Tan

A APPENDIX

A.1 PROOF OF THEOREM 1 Consider following loss function: L(θ) = - 1 N N n=1 i∈K 1 [y n = i] ln P (y n = i|x n , θ) where 1[y n = i] is an indicator function equal to one when y n = i and zero otherwise. Let L (θ) be the probability limit of the loss function L (θ). Consider that the sample has incorrect labels and the outside of the category set is defined as definition 1. If the model is correctly constructed, by the strong law of large numbers, as N → ∞, L (θ) is as follows: L (θ) a.s. -→ L (θ) = -E x∼p(X) i∈K Q i | x, K * , θ * • ln P (i | x, K, θ) = -E x∼p(X) î∈K k∈K * π(i|k, x) • P k | x, K * , θ * • ln P (i | x, K, θ) . The derivative of L (θ) is as follows: ∂L (θ) ∂θ = -E x∼p(X) i∈K k∈K * π(i|k, x) • P (k | x, K * , θ * ) P (i | x, K, θ) ∂P (i | x, K, θ) ∂θ If the distribution function is the softmax function (1), the derivative of L (θ) can be expressed as follows: ∂L (θ) ∂θ = -E x∼p(X) i∈K    j∈K e mj e mi j∈K * e m * j k∈K * π(i|k, x) • e m * k ∂P (i | x, K, θ) ∂θ    where m * k = m k (x, θ * ) denotes the kth element of the output for the base model given the true parameter and the true category set. First, consider the case of no incorrect label, namely, π (i|i, x) = 1 for all i ∈ K and K = K * . For this case, the derivative of L (θ) is equal to zero at the true parameter θ = θ * , as follows: ∂L (θ * ) ∂θ = -E x∼p(X) i∈K    j∈K e m * j j∈K * e m * j ∂P (i | x, K, θ * ) ∂θ    = -E x∼p(X)    j∈K e m * j j∈K * e m * j ∂ i∈K P (i | x, K, θ * ) ∂θ    = 0 Note that i∈K P (i | x, K, θ * ) = 1. From the assumption of identification for the parameter, L (θ) = L(θ * ) for all θ = θ * . Thus, if there is no incorrect label and the model is constructed correctly, the estimated parameter θ using the minimum loss function is consistent as follows: θ → θ * as N → ∞. This is a well-known property of consistency in the maximum likelihood estimator of the logit model (Amemiya, 1985) . However, in general, the derivative of L (θ) is not always equal to zero at the true parameter value if the sample has an incorrect label. If there are mislabeled samples, the estimated parameter θ using the minimum loss function may not converge to the true value. Consider the estimation using the AutoCleansing model, c j (x, Θ) = m j (x, θ) + α j where Θ = {θ, α}. The probability limit of the loss function for the AutoCleansing model is as follows: L C (Θ) = -E x∼p(X) i∈K k∈K * π (i|k, x) • P k | x, K * , θ * • ln P C (i | x, K, Θ) . Assume that the probability distribution is a softmax function (2). Let θ + be the set of the solution to ∂L C /∂θ = 0. The derivative of L C (Θ) can be expressed as follows: ∂L C (Θ) ∂θ = -E x∼p(X) i∈K    j∈K e mj +αj j∈K * e m * j k∈K * π (i|k, x) • e m * k e mi+αj ∂P C (i | x, K, Θ) ∂θ    = -E x∼p(X) i∈K    j∈K e mj +αj j∈K * e m * j e m * i +α * i e mi+αj ∂P C (i | x, K, Θ) ∂θ    where α * i = ln k∈K * π (i|k, x) • e m * k -m * i . Thus, the derivative of L C (Θ) is equal to zero at Θ * = {θ * , α * }. Namely, θ * ∈ θ + . 2 A.2 PROOF OF THEOREM 2 From the assumption, Q (f | x, K * , θ * ) ≥ Q (t | x, K * , θ * ). If the model is correctly constructed, the output of the model for the true category is higher than that of the other category. Therefore, m * t ≥ m * j , ∀j. For the general case, from Theorem 1, the difference between the sample-categoryspecific constants of the false label and that of the true label is as follows: For the symmetric case, π (t|k, x) =π (k|t, x) for all k Therefore, the sample-category-specific constants of the true label are as follows: For the single-symmetric case, π (j|j) = 1 ∀j = f, t. Therefore, the sample-category-specific constants of category j = f, t are as follows:  α * f -α * t = ln α * t = ln α * j = ln



See Table 7 in Appendix Table 8 in Appendix provides a comparison between AC1 and AC2. Note that the test accuracy of AC1+AA is close to that of AC2+AA. & Le, 2019), and the use of recent augmentation methods such as adversarial AutoAugment(Zhang et al., 2019a).



Figure 1: Example of incorrect labels in CIFAR-10. The original label is the corresponding label of each image in the dataset. Alternative label is the possibly correct label of each image.

Figure 2 shows the concept of AutoCleansing. Let x be the input, y be the output, and y = m(x, θ) be the base network model, where θ denotes the parameter of the base model. Consider five observations of A, B, • • • , E. The red line is the true model defined as, y = m(x, θ * ) where θ * denotes the true parameter. B is a mislabeled sample, as the observed label B differs significantly from the true label B * . The dotted line is the estimated model y = m(x, θ) using incorrect data, where θ denotes the estimated parameter. As can be observed, overfitting occurs owing to the mislabeled sample. In this figure, ŷ denotes the prediction for x= 3 using the estimated model; however, the true label is y * . Thus, the incorrect label causes the prediction error. Consider the cleansing model, y = m (x, θ) + α, where α denotes the constant parameter for each observation. If the constant α B captures the effect of an incorrect label, as shown in this figure, removing the constant from the cleansing model may mitigate the overfitting problem.

Figure 2: Concept of AutoCleansing. (Left) A to E are the observations. B is the incorrect label. Red line represents the true model. Black dot line represents the estimated model using incorrect data. y denotes the predicted label using the overfitting model. y * denotes the true label. Auto-Cleansing consists of the base network model and constant α. The constant α B captures the effect of the incorrect label for B. Thus, removing α mitigates the overfitting effect due to the incorrect label. (Right) (1) Training data has correct and incorrect labels. (2) Constructing the cleansing model consists of a base network model m(x) and sample-category specific constant α. Learning with the training data using cleansing, (3) Deleting the constant α, and (4) Testing with the validation data using the cleansed network model.

Figure 4-6 in Appendix show the sample-category specific constants α for the example of mislabeled images in CIFAR-10 and CIFAR-100. The estimation model is Wide-ResNet 40-2 with AutoCleansing. Because α1 is fixed to zero, these figures show the standardized values of αk -Mean{α 1 , • • • , αK }. Let αMax n

Figure 4: Test accuracies of the base model with trimmed data and AutoCleansing. Averages of five runs are reported.

Figure 3: Test accuracies of the base model with trimmed data and AutoCleansing. Averages of five runs are reported.

| x, K * , θ * ) Q (t | x, K * , θ * ) -m * f +m * t ≥ 0.

Figure 4: Example images of incorrect labels within category set and sample-category specific constants α. MaxRank is the percentile rank of sorted αMax n in descending order, and MinRank is sorted αMin n in ascending order. See text for more details.

Figure 5: Example images of incorrect labels outside category set and sample-category specific constants α.

Neumerical example of AutoCleansing. Obs. is the observed label and True is the true category. Outside is the outside of the category set. c is the output of biased model, m * is the output of true model, and α is the biased effects estimated by AutoCleansing.

Test accuracy (%) on CIFAR-10. AC1+AA are the results of the proposed AutoCleansing with sample-category specific constants and AutoAugment. All experiments in this study replicate the results of Baseline, Cutout, and AutoAugment methods from Cubuk et al. (2018), FAA from Lim et al. (2019), and PBA from Ho et al. (2019). Averages of five runs are reported.

Test accuracy (%) on CIFAR-100. Averages of five runs are reported.



AutoCleansing and Area Under the Margin (AUM). Base network model is ResNet32. This table replicates the results of Baseline and AUM fromPleiss et al. (2020). Averages of five runs are reported.

annex

A.3 ADDITIONAL TABLES Table 7 : Hyperparameters for the Experiment. LR represents the learning rate, whereas WD represents the weight decay. Multi steps schedule decays the learning rate by 10-fold at epochs (150, 225) for CIFAR and (90, 180, 240) 

