WHICH IS BETTER FOR LEARNING WITH NOISY LA-BELS: THE SEMI-SUPERVISED METHOD OR MODELING LABEL NOISE?

Abstract

In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled confident examples and unlabeled unconfident examples. The other one is to model label noise and design statistically consistent classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of causal data generation process. Specifically, the semi-supervised method depends heavily on the data generation process while the modeling label-noise method is independent of the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.

1. INTRODUCTION

Deep neural networks can achieve remarkable performance when accurately annotated large-scale training datasets are available. However, annotating a large number of examples accurately is often expensive and sometimes infeasible in real life. Cheap datasets which contain label errors are easy to obtain (Li et al., 2019) and have been widely used to train deep neural networks. Recent results (Han et al., 2018; Nguyen et al., 2019) show that deep neural networks can easily memorize label noise during the training, which leads to poor test performance. To reduce the side effect of label noise, there are two major streams of methods. One stream of methods focus on getting rid of label errors. Specifically, they would first select confident examples (i.e., whose labels are likely to be correct), e.g., by exploiting the memorization effect of deep networks (Jiang et al., 2018) . Then, by discarding the labels of unconfident examples (i.e., whose labels are likely to be inccorect ) and keeping their unlabeled instances, they (Li et al., 2019; 2020; Wei et al., 2020; Yao et al., 2021; Tan et al., 2021; Ciortan et al., 2021; Yao et al., 2021) would employ the semi-supervised method, e.g., mixmatch (), to achieve state-of-the-art performance. Those methods are usually based on heuristics and lack theoretical guarantee. Another major stream is to model the label noise and then get rid of its side effects. They mainly focus on estimating the label noise transition matrix T (x), i.e., T ij (x) = P ( Ỹ = i|Y = j, X = x) representing the probability that an instance x with a clean label Y = i but flips to a noisy label Ỹ = j. The idea is that the clean class posterior distribution P (Y |X) can be inferred by learning the transition matrix T (x) and noisy class posterior distribution P ( Ỹ |X). In general, when T (x) is well estimated (or given), these methods are statistically consistent, i.e., they guarantee that the classifiers learned from the noisy data converge to the optimal classifiers defined on the clean data as the size of the noisy training data increases (Patrini et al., 2017; Xia et al., 2019) . It naturally raises the question that which stream of methods should be used for a specific real-world application? Answering the question is crucial for the community of learning with noisy labels. If the answer is that one stream of methods is dominating, future efforts should mainly focus on that

