AUTOCLEANSING: UNBIASED ESTIMATION OF DEEP LEARNING WITH MISLABELED DATA

Abstract

Mislabeled samples cause prediction errors. This study proposes a solution to the problem of incorrect labels, known as AutoCleansing, to automatically capture the effect of incorrect labels and mitigate it without removing the mislabeled samples. AutoCleansing consists of a base network model and sample-category specific constants. Both parameters of the base model and sample-category constants are estimated simultaneously using the training data. Thereafter, predictions for test data are made using a base model without the constants capturing the mislabeled effects. A theoretical model for AutoCleansing is developed and showing that the gradient of the loss function of the proposed method can be zero at true parameters with mislabeled data if the model is correctly constructed. Experimental results show that AutoCleansing has better performance in test accuracy than previous studies for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.

1. INTRODUCTION

The prediction performance of supervised machine learning depends on the quality of the training data. For classification tasks, the dataset is assumed to have a correct label for each object. However, real-world datasets may contain some mislabeled samples. For instance, Pleiss et al. (2020) analyzed incorrect labels in the CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009) . They reported that the mislabeled sample was 3 % of CIFAR-10 and 13 % of the CIFAR-100 datasets. Figure 1 shows typical examples of incorrect labels in the CIFAR-10 dataset. It consists of 60,000 images in 10 category classes. Each image was assigned one of 10 classes. In this figure, the original label of #1 is DOG; however, it appears to be the image of CAT. As the category set of CIFAR-10 includes both DOG and CAT, #1 is considered to be an example of an incorrect label within the category set. The image of #2 has TRUCK as the original label. However, it shows the image of PERSON, which does not belong to the category set. Thus, this is an example of an incorrect label outside the category set. For the image of #3, there are two objects in this image; however, it has only one label of DEER. It can be considered as an example of an incorrect label with multiple objects. Incorrect labels in the training dataset may cause prediction errors. The most intuitive way to address the problem of incorrect labels is by removing mislabeled samples from the training dataset. However, to identify mislabeled samples, it is necessary to measure the correctness of labels and define the threshold determining whether the label is correct or not. Deleting excess data may reduce the efficiency of estimation by decreasing the sample size. Finding an optimal threshold requires several runs of learning by removing mislabeled samples with different thresholds. This study proposes an alternative solution to the problem of incorrect labels, called AutoCleansing, to automatically capture the effect of incorrect labels and mitigate it without removing mislabeled samples. AutoCleansing consists of a base network model and sample-category specific constants. Both parameters of the base model and sample-category constants are estimated simultaneously using the training data. Thereafter, predictions for test data are made using a base model without the constants capturing the mislabeled effects. As shown in the section of the theoretical analysis, the proposed AutoCleansing can address the prediction errors due to the incorrect labels within the category set, outside the category set, and with multiple objects. AutoCleansing can use any network model as the base model with any augmentation method. For example, the experimental section presents the estimation results of AutoCleansing with base models of ResNet (He et al., 2016 ), WideResNet (Zagoruyko & Komodaki, 2016) , Shake-



Figure 1: Example of incorrect labels in CIFAR-10. The original label is the corresponding label of each image in the dataset. Alternative label is the possibly correct label of each image.

Figure 2 shows the concept of AutoCleansing. Let x be the input, y be the output, and y = m(x, θ) be the base network model, where θ denotes the parameter of the base model. Consider five observations of A, B, • • • , E. The red line is the true model defined as, y = m(x, θ * ) where θ * denotes the true parameter. B is a mislabeled sample, as the observed label B differs significantly from the true label B * . The dotted line is the estimated model y = m(x, θ) using incorrect data, where θ denotes the estimated parameter. As can be observed, overfitting occurs owing to the mislabeled sample. In this figure, ŷ denotes the prediction for x= 3 using the estimated model; however, the true label is y * . Thus, the incorrect label causes the prediction error. Consider the cleansing model, y = m (x, θ) + α, where α denotes the constant parameter for each observation. If the constant α B captures the effect of an incorrect label, as shown in this figure, removing the constant from the cleansing model may mitigate the overfitting problem.

Figure 2: Concept of AutoCleansing. (Left) A to E are the observations. B is the incorrect label. Red line represents the true model. Black dot line represents the estimated model using incorrect data. y denotes the predicted label using the overfitting model. y * denotes the true label. Auto-Cleansing consists of the base network model and constant α. The constant α B captures the effect of the incorrect label for B. Thus, removing α mitigates the overfitting effect due to the incorrect label. (Right) (1) Training data has correct and incorrect labels. (2) Constructing the cleansing model consists of a base network model m(x) and sample-category specific constant α. Learning with the training data using cleansing, (3) Deleting the constant α, and (4) Testing with the validation data using the cleansed network model.

