AN INFORMATION-THEORETIC FRAMEWORK FOR LEARNING MODELS OF INSTANCE-INDEPENDENT LA-BEL NOISE

Abstract

Given a dataset D with label noise, how do we learn its underlying noise model? If we assume that the label noise is instance-independent, then the noise model can be represented by a noise transition matrix Q D . Recent work has shown that even without further information about any instances with correct labels, or further assumptions on the distribution of the label noise, it is still possible to estimate Q D while simultaneously learning a classifier from D. However, this presupposes that a good estimate of Q D requires an accurate classifier. In this paper, we show that high classification accuracy is actually not required for estimating Q D well. We shall introduce an information-theoretic-based framework for estimating Q D solely from D (without additional information or assumptions). At the heart of our framework is a discriminator that predicts whether an input dataset has maximum Shannon entropy, which shall be used on multiple new datasets D synthesized from D via the insertion of additional label noise. We prove that our estimator for Q D is statistically consistent, in terms of dataset size, and the number of intermediate datasets D synthesized from D. As a concrete realization of our framework, we shall incorporate local intrinsic dimensionality (LID) into the discriminator, and we show experimentally that with our LID-based discriminator, the estimation error for Q D can be significantly reduced. We achieved average Kullback-Leibler (KL) loss reduction from 0.27 to 0.17 for 40% anchor-like samples removal when evaluated on CIFAR10 with symmetric noise. Although no clean subset of D is required for our framework to work, we show that our framework can also take advantage of clean data to improve upon existing estimation methods.

1. INTRODUCTION

Real-world datasets are inherently noisy. Although there are numerous existing methods for learning classifiers in the presence of label noise (e.g. Han et al. ( 2018 2018)), there is still a gap between empirical success and theoretical understanding of conditions required for these methods to work. For instance-independent label noise, all methods with theoretical performance guarantees require a good estimation of the noise transition matrix as a key indispensable step (Cheng et al., 2017; Jindal et al., 2016; Patrini et al., 2017; Thekumparampil et al., 2018; Xia et al., 2019) . Recall that for any dataset D with label noise, we can associate to it a noise transition matrix Q D , whose entries are conditional probabilities p(y|z) that a randomly selected instance of D has the given label y, under the condition that its correct label is z. Many algorithms for estimating Q D either require that a small clean subset D clean of D is provided (Liu & Tao, 2015; Scott, 2015) , or assume that the noise model is a mixture model (Ramaswamy et al., 2016; Yu et al., 2018) , where at least some anchor points are known for every component. Here, "anchor points" refer to datapoints belonging to exactly one component of the mixture model almost surely (cf. Vandermeulen et al. ( 2019)), while "clean" refers to instances with correct labels. Recently, it was shown that the knowledge of anchor points or D clean is not required for estimating Q D . The proposed approach, known as T-Revision (Xia et al., 2019) , learns a classifier from D and simultaneously identifies anchor-like instances in D, which are used iteratively to estimate Q D , which in turn is used to improve the classifier. Hence for T-Revision, a good estimation for Q D is inextricably tied to learning a classifier with high classification accuracy. In this paper, we propose a 1



); Hendrycks et al. (2018); Natarajan et al. (2013); Tanaka et al. (

