AUTOCLEANSING: UNBIASED ESTIMATION OF DEEP LEARNING WITH MISLABELED DATA

Abstract

Mislabeled samples cause prediction errors. This study proposes a solution to the problem of incorrect labels, known as AutoCleansing, to automatically capture the effect of incorrect labels and mitigate it without removing the mislabeled samples. AutoCleansing consists of a base network model and sample-category specific constants. Both parameters of the base model and sample-category constants are estimated simultaneously using the training data. Thereafter, predictions for test data are made using a base model without the constants capturing the mislabeled effects. A theoretical model for AutoCleansing is developed and showing that the gradient of the loss function of the proposed method can be zero at true parameters with mislabeled data if the model is correctly constructed. Experimental results show that AutoCleansing has better performance in test accuracy than previous studies for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.

1. INTRODUCTION

The prediction performance of supervised machine learning depends on the quality of the training data. For classification tasks, the dataset is assumed to have a correct label for each object. However, real-world datasets may contain some mislabeled samples. For instance, Pleiss et al. (2020) analyzed incorrect labels in the CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009) . They reported that the mislabeled sample was 3 % of CIFAR-10 and 13 % of the CIFAR-100 datasets. Figure 1 shows typical examples of incorrect labels in the CIFAR-10 dataset. It consists of 60,000 images in 10 category classes. Each image was assigned one of 10 classes. In this figure, the original label of #1 is DOG; however, it appears to be the image of CAT. As the category set of CIFAR-10 includes both DOG and CAT, #1 is considered to be an example of an incorrect label within the category set. The image of #2 has TRUCK as the original label. However, it shows the image of PERSON, which does not belong to the category set. Thus, this is an example of an incorrect label outside the category set. For the image of #3, there are two objects in this image; however, it has only one label of DEER. It can be considered as an example of an incorrect label with multiple objects.



Figure 1: Example of incorrect labels in CIFAR-10. The original label is the corresponding label of each image in the dataset. Alternative label is the possibly correct label of each image.

