AN INFORMATION-THEORETIC FRAMEWORK FOR LEARNING MODELS OF INSTANCE-INDEPENDENT LA-BEL NOISE

Abstract

Given a dataset D with label noise, how do we learn its underlying noise model? If we assume that the label noise is instance-independent, then the noise model can be represented by a noise transition matrix Q D . Recent work has shown that even without further information about any instances with correct labels, or further assumptions on the distribution of the label noise, it is still possible to estimate Q D while simultaneously learning a classifier from D. However, this presupposes that a good estimate of Q D requires an accurate classifier. In this paper, we show that high classification accuracy is actually not required for estimating Q D well. We shall introduce an information-theoretic-based framework for estimating Q D solely from D (without additional information or assumptions). At the heart of our framework is a discriminator that predicts whether an input dataset has maximum Shannon entropy, which shall be used on multiple new datasets D synthesized from D via the insertion of additional label noise. We prove that our estimator for Q D is statistically consistent, in terms of dataset size, and the number of intermediate datasets D synthesized from D. As a concrete realization of our framework, we shall incorporate local intrinsic dimensionality (LID) into the discriminator, and we show experimentally that with our LID-based discriminator, the estimation error for Q D can be significantly reduced. We achieved average Kullback-Leibler (KL) loss reduction from 0.27 to 0.17 for 40% anchor-like samples removal when evaluated on CIFAR10 with symmetric noise. Although no clean subset of D is required for our framework to work, we show that our framework can also take advantage of clean data to improve upon existing estimation methods.

1. INTRODUCTION

Real-world datasets are inherently noisy. Although there are numerous existing methods for learning classifiers in the presence of label noise (e.g. Han et al. (2018) ; Hendrycks et al. (2018) ; Natarajan et al. (2013) ; Tanaka et al. ( 2018)), there is still a gap between empirical success and theoretical understanding of conditions required for these methods to work. For instance-independent label noise, all methods with theoretical performance guarantees require a good estimation of the noise transition matrix as a key indispensable step (Cheng et al., 2017; Jindal et al., 2016; Patrini et al., 2017; Thekumparampil et al., 2018; Xia et al., 2019) . Recall that for any dataset D with label noise, we can associate to it a noise transition matrix Q D , whose entries are conditional probabilities p(y|z) that a randomly selected instance of D has the given label y, under the condition that its correct label is z. Many algorithms for estimating Q D either require that a small clean subset D clean of D is provided (Liu & Tao, 2015; Scott, 2015) , or assume that the noise model is a mixture model (Ramaswamy et al., 2016; Yu et al., 2018) , where at least some anchor points are known for every component. Here, "anchor points" refer to datapoints belonging to exactly one component of the mixture model almost surely (cf. Vandermeulen et al. (2019) ), while "clean" refers to instances with correct labels. Recently, it was shown that the knowledge of anchor points or D clean is not required for estimating Q D . The proposed approach, known as T-Revision (Xia et al., 2019) , learns a classifier from D and simultaneously identifies anchor-like instances in D, which are used iteratively to estimate Q D , which in turn is used to improve the classifier. Hence for T-Revision, a good estimation for Q D is inextricably tied to learning a classifier with high classification accuracy. In this paper, we propose a framework for estimating Q D solely from D, without requiring anchor points, a clean subset, or even anchor-like instances. In particular, we show that high classification accuracy is not required for a good estimation of Q D . Our framework is able to robustly estimate Q D at all noise levels, even in extreme scenarios where anchor points are removed from D, or where D is imbalanced. Our key starting point is that Shannon entropy and other related information-theoretic concepts can be defined analogously for datasets with label noise. Suppose we have a discriminator Φ that takes any dataset D as its input, and gives a binary output that predicts whether D has maximum entropy. Given D, a dataset with label noise, we shall synthesize multiple new datasets D by inserting additional label noise into D, using different noise levels for different label classes. Intuitively, the more label noise that D initially has, the lower the minimum amount of additional label noise we need to insert into D to reach near-maximum entropy. We show that among those datasets D that are predicted by Φ to have maximum entropy, their associated levels of additional label noise can be used to compute a single estimate for Q D . Our estimator is statistically consistent: We prove that by repeating this method, the average of the estimates would converge to the true Q D . As a concrete realization of this idea, we shall construct Φ using the notion of Local Intrinsic Dimensionality (LID) (Houle, 2013; 2017a; b) . Intuitively, the LID computed at a feature vector v is an approximation of the dimension of a smooth manifold containing v that would "best" fit the distribution D in the vicinity of v. LID plays a fundamental role in an important 2018 breakthrough in noise detection (Ma et al., 2018c) , wherein it was empirically shown that sequences of LID scores could be used to distinguish clean datasets from datasets with label noise. Roughly speaking, the training data for Φ consists of LID sequences that correspond to multiple datasets synthesized from D. In particular, we show that Φ can be trained without needing any clean data. Since we are optimizing the predictive accuracy of Φ, rather than optimizing the classification accuracy for D, we also do not require state-of-the-art architectures. For example, in our experiments on the CIFAR-10 dataset (Krizhevsky et al., 2009) , we found that LID sequences generated by training on shallow "vanilla" convolutional neural networks (CNNs), were sufficient for training Φ. Our contributions are summarized as follows: • We introduce an information-theoretic-based framework for estimating the noise transition matrix of any dataset D with instance-independent label noise. We do not make any assumptions on the structure of the noise transition matrix. • We prove that our noise transition matrix estimator is consistent. This is the first-ever estimator that is proven to be consistent without needing to optimize classification accuracy. Notably, our consistency proof does not require anchor points, a clean subset, or any anchor-like instances. • We construct an LID-based discriminator Φ and show experimentally that training a shallow CNN to generate LID sequences is sufficient for obtaining high predictive accuracy for Φ. Using our LID-based discriminator Φ, our proposed estimator outperforms the state-of-the-art methods, especially in the case when anchor-like instances are removed from D. • Given access to a clean subset D clean , we show that our method can be used to further improve existing competitive estimation methods.

2. PROPOSED INFORMATION-THEORETIC FRAMEWORK

Our framework hinges on a simple yet crucial observation: Datasets with different label noise levels have different entropies. Although the entropy of any given dataset D is (initially) unknown to us, we do know, crucially, that a complete uniformly random relabeling of D would yield a new dataset with maximum entropy (which we call "baseline datasets"), and we can easily generate multiple such datasets. We could also use partial relabelings to generate a spectrum of new datasets whose entropies range from the entropy of D, to the maximum possible entropy. We call them "α-increment datasets", where α is a parameter that we control. The minimum value α min for α, such that an α-increment dataset reaches maximum entropy, would depend on the original entropy of D. See Fig. 1 for a visualization of the spectrum of entropies for α-increment datasets and baseline datasets. Our main idea is to train a discriminator Φ that recognizes datasets with maximum entropy, and then use Φ to determine this minimum value α min . Once this value is estimated, we are then able to estimate Q D . Specific realizations of our framework correspond to specific designs for Φ. An

