AN INFORMATION-THEORETIC FRAMEWORK FOR LEARNING MODELS OF INSTANCE-INDEPENDENT LA-BEL NOISE

Abstract

Given a dataset D with label noise, how do we learn its underlying noise model? If we assume that the label noise is instance-independent, then the noise model can be represented by a noise transition matrix Q D . Recent work has shown that even without further information about any instances with correct labels, or further assumptions on the distribution of the label noise, it is still possible to estimate Q D while simultaneously learning a classifier from D. However, this presupposes that a good estimate of Q D requires an accurate classifier. In this paper, we show that high classification accuracy is actually not required for estimating Q D well. We shall introduce an information-theoretic-based framework for estimating Q D solely from D (without additional information or assumptions). At the heart of our framework is a discriminator that predicts whether an input dataset has maximum Shannon entropy, which shall be used on multiple new datasets D synthesized from D via the insertion of additional label noise. We prove that our estimator for Q D is statistically consistent, in terms of dataset size, and the number of intermediate datasets D synthesized from D. As a concrete realization of our framework, we shall incorporate local intrinsic dimensionality (LID) into the discriminator, and we show experimentally that with our LID-based discriminator, the estimation error for Q D can be significantly reduced. We achieved average Kullback-Leibler (KL) loss reduction from 0.27 to 0.17 for 40% anchor-like samples removal when evaluated on CIFAR10 with symmetric noise. Although no clean subset of D is required for our framework to work, we show that our framework can also take advantage of clean data to improve upon existing estimation methods.

1. INTRODUCTION

Real-world datasets are inherently noisy. Although there are numerous existing methods for learning classifiers in the presence of label noise (e.g. Han et al. (2018) ; Hendrycks et al. (2018) ; Natarajan et al. (2013) ; Tanaka et al. (2018) ), there is still a gap between empirical success and theoretical understanding of conditions required for these methods to work. For instance-independent label noise, all methods with theoretical performance guarantees require a good estimation of the noise transition matrix as a key indispensable step (Cheng et al., 2017; Jindal et al., 2016; Patrini et al., 2017; Thekumparampil et al., 2018; Xia et al., 2019) . Recall that for any dataset D with label noise, we can associate to it a noise transition matrix Q D , whose entries are conditional probabilities p(y|z) that a randomly selected instance of D has the given label y, under the condition that its correct label is z. Many algorithms for estimating Q D either require that a small clean subset D clean of D is provided (Liu & Tao, 2015; Scott, 2015) , or assume that the noise model is a mixture model (Ramaswamy et al., 2016; Yu et al., 2018) , where at least some anchor points are known for every component. Here, "anchor points" refer to datapoints belonging to exactly one component of the mixture model almost surely (cf. Vandermeulen et al. (2019) ), while "clean" refers to instances with correct labels. Recently, it was shown that the knowledge of anchor points or D clean is not required for estimating Q D . The proposed approach, known as T-Revision (Xia et al., 2019) , learns a classifier from D and simultaneously identifies anchor-like instances in D, which are used iteratively to estimate Q D , which in turn is used to improve the classifier. Hence for T-Revision, a good estimation for Q D is inextricably tied to learning a classifier with high classification accuracy. In this paper, we propose a framework for estimating Q D solely from D, without requiring anchor points, a clean subset, or even anchor-like instances. In particular, we show that high classification accuracy is not required for a good estimation of Q D . Our framework is able to robustly estimate Q D at all noise levels, even in extreme scenarios where anchor points are removed from D, or where D is imbalanced. Our key starting point is that Shannon entropy and other related information-theoretic concepts can be defined analogously for datasets with label noise. Suppose we have a discriminator Φ that takes any dataset D as its input, and gives a binary output that predicts whether D has maximum entropy. Given D, a dataset with label noise, we shall synthesize multiple new datasets D by inserting additional label noise into D, using different noise levels for different label classes. Intuitively, the more label noise that D initially has, the lower the minimum amount of additional label noise we need to insert into D to reach near-maximum entropy. We show that among those datasets D that are predicted by Φ to have maximum entropy, their associated levels of additional label noise can be used to compute a single estimate for Q D . Our estimator is statistically consistent: We prove that by repeating this method, the average of the estimates would converge to the true Q D . As a concrete realization of this idea, we shall construct Φ using the notion of Local Intrinsic Dimensionality (LID) (Houle, 2013; 2017a; b) . Intuitively, the LID computed at a feature vector v is an approximation of the dimension of a smooth manifold containing v that would "best" fit the distribution D in the vicinity of v. LID plays a fundamental role in an important 2018 breakthrough in noise detection (Ma et al., 2018c) , wherein it was empirically shown that sequences of LID scores could be used to distinguish clean datasets from datasets with label noise. Roughly speaking, the training data for Φ consists of LID sequences that correspond to multiple datasets synthesized from D. In particular, we show that Φ can be trained without needing any clean data. Since we are optimizing the predictive accuracy of Φ, rather than optimizing the classification accuracy for D, we also do not require state-of-the-art architectures. For example, in our experiments on the CIFAR-10 dataset (Krizhevsky et al., 2009) , we found that LID sequences generated by training on shallow "vanilla" convolutional neural networks (CNNs), were sufficient for training Φ. Our contributions are summarized as follows: • We introduce an information-theoretic-based framework for estimating the noise transition matrix of any dataset D with instance-independent label noise. We do not make any assumptions on the structure of the noise transition matrix. • We prove that our noise transition matrix estimator is consistent. This is the first-ever estimator that is proven to be consistent without needing to optimize classification accuracy. Notably, our consistency proof does not require anchor points, a clean subset, or any anchor-like instances. • We construct an LID-based discriminator Φ and show experimentally that training a shallow CNN to generate LID sequences is sufficient for obtaining high predictive accuracy for Φ. Using our LID-based discriminator Φ, our proposed estimator outperforms the state-of-the-art methods, especially in the case when anchor-like instances are removed from D. • Given access to a clean subset D clean , we show that our method can be used to further improve existing competitive estimation methods.

2. PROPOSED INFORMATION-THEORETIC FRAMEWORK

Our framework hinges on a simple yet crucial observation: Datasets with different label noise levels have different entropies. Although the entropy of any given dataset D is (initially) unknown to us, we do know, crucially, that a complete uniformly random relabeling of D would yield a new dataset with maximum entropy (which we call "baseline datasets"), and we can easily generate multiple such datasets. We could also use partial relabelings to generate a spectrum of new datasets whose entropies range from the entropy of D, to the maximum possible entropy. We call them "α-increment datasets", where α is a parameter that we control. The minimum value α min for α, such that an α-increment dataset reaches maximum entropy, would depend on the original entropy of D. See Fig. 1 for a visualization of the spectrum of entropies for α-increment datasets and baseline datasets. Our main idea is to train a discriminator Φ that recognizes datasets with maximum entropy, and then use Φ to determine this minimum value α min . Once this value is estimated, we are then able to estimate Q D . Specific realizations of our framework correspond to specific designs for Φ. An Figure 1 : A visualization of the entropy maximization process. We illustrate the noise transition matrices of DILNs as heat maps. The intact CIFAR-10 dataset is used as the underlying clean dataset, and four noise models are considered (symmetric 20%/50%/80% noise and pairwise 45% noise), which are shown here as four pairs of columns. The first four rows depict heat maps for α-increment DILNs, where α = (a, . . . , a) for four values a = 0.0, 0.6, 0.854, 0.877. The last row depicts heat maps for baseline DILNs, which have expected maximum entropy. Note that the minimum value a min for a, such that a discriminator would find α-increment DILNs "indistinguishable" from baseline DILNs, would depend on the base noise model that the α-increment DILN is derived from. From this figure, we are able to infer, for example, that for symmetric noise models, a min ≈ 0.877 for base noise level 50%, while in contrast, a min ≈ 0.854 for base noise level 80%. In general, different noise levels correspond to different minimum values for α. illustration of our framework using LID-based discriminators is given in Fig. 2 ; details on LID-based discriminators can be found in Section 3, and will be further elaborated in the appendix. Throughout this paper, given any discrete random variables X, Y , we shall write p X (x) and p X|Y (x|y) to mean Pr(X = x) and Pr(X = x|Y = y) respectively. We assume that the reader is familiar with the basics of information theory; see Cover & Thomas (2012) for an excellent introduction.

2.1. ENTROPY OF DATASETS WITH LABEL NOISE

Given D a dataset with instance-independent label noise (DILN), let A be its set of all label classes, and let Y (resp. Z) be the given (resp. correct) label of a randomly selected instance X of D. 1 For convenience, we say that D is a DILN with noise model (Y |Z; A). The noise transition matrix of D is a matrix Q D whose (i, j)-th entry is q D i,j := p Y |Z (j, i). We shall define the entropy of D by H(D) := - i∈A p Z (i) j∈A q D i,j log q D i,j . Notice that H(D) is precisely the conditional entropy of Y given Z. (We use the convention that 0 log 0 = 0.) Hence, it is easy to prove that 0 ≤ H(D) ≤ log |A|. In particular, H(D) = 0 if and only if every pair of instances of D in the same class have the same given labels. Note also that D has maximum entropy log |A| if and only if every entry of Q D equals 1 |A| (i.e. the given labels of D are completely noisy). Thus, H(D) could be interpreted as a measure of the label noise level of D. 1 Every datapoint of D is a pair (x, y), where x is an instance, and y is its given label, which may differ from the correct label z associated to x. Note that Z is a function of X, and Y is a random function of Z. By instance-independent label noise, we mean that Pr(Y = y|Z = z, X = x) = Pr(Y = y|Z = z). A more detailed treatment of DILNs can be found in Appendix A. In particular, a DILN includes its noise model. A derived DILN of D shall mean a DILN D with noise model (Y |Z; A) for some Y independent of Z, such that both D, D have the same underlying set of instances, given in the same sequential order. For example, D could be "derived" from D by inserting additional instance-independent label noise, in which case D can be interpreted as a partial relabeling of D. For convenience, we say that D is a Y -derived DILN of D.

2.2. SYNTHESIS OF NEW DATASETS FROM D

Let D be a DILN with noise model (Y |Z; A). Without loss of generality, assume A = {1, . . . , k}, assume that D has N instances, and write Q D = [q i,j ] 1≤i,j≤k . The correct labels for the instances are fixed and unknown to us, hence all entries of Q D are fixed constants with unknown values. Our goal is to estimate Q D . As alluded to earlier, we shall be synthesizing two types of datasets from D. The first type is what we call a baseline dataset, described as follows: Let D be a random dataset obtained from D by replacing the given label of each instance of D by a label chosen uniformly at random from A. Hence D is a random DILN with expected entropy log k (i.e. maximum entropy), which we shall denote by D max := E[D]. The noise transition matrix Q D = [Q i,j ] 1≤i,j≤k of D is a random matrix whose entries Q i,j = 1 k (1 + E i,j ) are random variables, where each "error" E i,j is a random variable with mean 0. Any observed D is called a baseline DILN of D. The second type is what we call an α-increment dataset, where α = (α 1 , . . . , α k ) is a vector whose entries satisfy 0 ≤ α i ≤ 1 for all i. Let D α be obtained from D as follows: For each 1 ≤ i ≤ k, select uniformly at random α i × 100% of the instances with given label i, and reassign each selected given label to one of the remaining k -1 classes, chosen uniformly at random. Hence D α is a random DILN, and its noise transition matrix Q Dα = [Q i,j ] 1≤i,j≤k is a random matrix whose entries Q i,j = q i,j (1 -α j ) + 1≤t≤k t =j q i,t α t 1 k-1 (1 + E i,j ) (for all 1 ≤ i, j ≤ k) are random variables, where each "error" E i,j is a random variable with mean 0. Any observed D α is called an α-increment DILN of D.

2.3. UNDERLYING INTUITION FOR DISTINGUISHING BASELINE DILNS

Suppose we have a discriminator Φ that is able to predict whether an input DILN is a baseline DILN of D. We could try, as input to Φ, an α-increment DILN D of D for different values of α. For any given α, we could try, as input to Φ, multiple observed values of the random DILN D α . Intuitively, the values of α for which "most" of the observed values for D α are predicted by Φ to be baseline DILNs, give non-trivial information about Q D . In this subsection, we explain the underlying intuition for how Φ can be trained, without requiring any knowledge of the correct labels Z. Let D 0 be a baseline DILN, and let D 1 , . . . , D be α-increment DILNs for α = α (1) , . . . , α ( ) , respectively. Intuitively, if α (1) , . . . , α ( ) cover a sufficiently wide range of vectors, then most of the DILNs among D 1 , . . . , D , would have entropies that are not near maximum entropy and hence would (in principle) be distinguishable from baseline DILNs. Ideally, we would want to train a discriminator Φ using baseline DILNs as "positive" data, and α-increment DILNs as "unlabeled" data, via some positive-unlabeled learning algorithm. However, having entire datasets as training data (i.e. where each DILN is a single datapoint for training Φ) may not necessarily be a good representation for the training data. Hence, we introduce the notion of a separable random function g, where effectively, we shall use g to generate the training data for Φ (by applying g on the DILN). Our definition for the "separability" of g relies on (a suitable analog of) the asymptotic equipartition property (AEP) from information theory, which is a key ingredient in proving Shannon's channel coding theorem. This AEP can be interpreted as a rigorous formulation of the idea that a "typical" sequence of observed values for r i.i.d. random variables would belong to a tiny fraction of all possible sequences of observed values, when r is sufficiently large. This notion of "typicality" can be explained with the example of flipping biased coins. Suppose we have two coins, with probabilities 0.1 and 0.6 respectively, for getting heads. Toss each coin a total of r times, and record the corresponding sequence of outcomes. Now repeat the process multiple times to get multiple sequences, each of length r. If r is sufficiently large, then with high probability, a randomly selected sequence for the first (resp. second) coin would have heads for ≈ 0.1 (resp. ≈ 0.6) of the outcomes; such sequences form only a vanishingly small fraction of all possible sequences. Hence, if we repeatedly generate these sequences, then with high probability, we would be able to distinguish our two coins. We define g to be an R d -valued random function. If g(D 0 ) is invoked r times, then we get a randomly generated sequence of length r, where each entry is a vector in R d ; this resulting output sequence shall be treated as a single datapoint in the "positive" class, for training Φ. Analogously, for each 1 ≤ i ≤ , we shall invoke g(D i ) a total of r times, and treat the resulting output sequence (of length r) to be a single "unlabeled" datapoint for training Φ. We repeat this process to generate our training data for Φ. If r is sufficiently large, then with high probability, each output sequence (datapoint for Φ) would be a typical sequence, whose statistics is based on the input DILN. Informally, we define g to be "separable" if distinct DILNs have distinguishable typical sequences for sufficiently large r. A discriminator for D is a prediction model Φ that takes any D in U[D] as its input and gives a score Φ(D ) in [0, 1], which could be interpreted as the likelihood that D is a baseline DILN of D. We say that D is predicted "positive" if Φ(D ) ≥ 0.5, and predicted "negative" otherwise. Let Φ + denote the subset of U[D] on which Φ predicts positive.

2.4. DISCRIMINATORS

What is a "typical" value of f matrix (D)? Notice that all k N possible relabelings of D could occur as the labeling of a randomly generated baseline DILN of D, so the set of possible outcomes for f matrix (D) is the entire set of all possible noise transition matrices. Intuitively, we know for example that the identity matrix is a "non-typical" value for f matrix (D), even though its occurrence is possible. We shall adapt the notion of typical sets from information theory to capture this intuition of "typicality". Definition 2.1. Given V a random derived DILN of D, let V 1 , V 2 , . . . be an infinite sequence of i.i.d. random derived DILNs of D, with the same distribution as V . (i) For any > 0 and integer n ≥ 1, the n-fold ε-typical set of V is defined to be the set Λ (n) ε (V ) consisting of all sequences (D 1 , . . . , D n ) ∈ U[D] n of observed values of V 1 , . . . , V n , with the property E[H(V )] -ε ≤ 1 n 1≤t≤n H(D t ) ≤ E[H(V )] + ε. (ii) Consider an arbitrary function g : U[D] → R d . Note that each g(V i ) is an R d -valued random variable. For any > 0 and integer n ≥ 1, the n-fold ε-typical set of g(V ) is defined to be the set Λ (n) ε (g(V )) consisting of all sequences (u 1 , . . . , u n ) ∈ R d × • • • × R d ∼ = R dn of observed vectors of g(V 1 ), . . . , g(V n ), with the property E[g(V )] -ε1 d ≤ 1 n 1≤t≤n u t ≤ E[g(V )] + ε1 d . Remark . We say that g is D 0 -separable if for every ε > 0, δ > 0, and every D 1 ∈ U[D] satisfying |H(D 0 ) -H(D 1 )| > δ, there exists some sufficiently large n such that Λ (n) ε (g(D 0 )) and Λ (n) ε (g(D 1 )) are disjoint typical sets. We say that g is separable if g is D 0 -separable for all D 0 ∈ U[D]. Definition 2.4. Let β > 0, let D 0 ∈ U[D] , and let g be a D 0 -separable R d -valued random function on U [D] . We say that a discriminator Φ for D is trained n-fold on (D 0 , g) with threshold β, if Φ + = {D ∈ U[D] : Λ (n) β (g(D 0 )) ∩ Λ (n) β (g(D )) = ∅} Definition 2. 5. An α-sequence for D is a (finite or infinite) sequence (α (1) , α (2) , . . . ) of distinct vectors in [0, k-1 k ) k that satisfies α (i) ≤ α (j) (coordinate-wise inequality) for all i < j. An α-sequence is called valid if it is a (possibly finite) subsequence of an infinite α-sequence (α (1) , α (2) , . . . ) whose set of elements {α (i) } ∞ i=1 is a dense subset of [0, k-1 k ] k . Our estimator for Q D relies on the existence of a separable random function g on U[D]. Once we find such a g, we can then train multiple discriminators Φ using multiple randomly generated baseline DILNs of D, to get multiple intermediate estimates for Q D . We use each discriminator Φ to find a suitable α ∈ [0, 1] k such that Φ gives a high score for a "typical" α-increment DILN of D. We shall then use this value α to compute an intermediate estimate for Q D . The average of these intermediate estimates is our final estimate QD for Q D ; see Algorithm 1. More details are found in Appendix B. Theorem 2.6. Let QD be the final averaged output matrix from Algorithm 1, which takes as its inputs integers r, m, n, ≥ 1, a threshold β > 0, a separable R d -valued random function on U[D], and a valid α-sequence Ω = (α (1) , . . . , α ( ) ) for D. Then QD converges in probability to Q D as r → ∞, m → ∞, n → ∞, → ∞, and N → ∞. Informally, the input integers r, m, n, can be interpreted as follows: is the length of the input α-sequence; m is the number of baseline DILNs generated; r is the length of the sequences that each discriminator Φ is trained on; and n is the number of observed values of D α (for some optimal α contained in the input α-sequence) that are predicted positive by Φ. Theorem 2.6 tells us that our estimator (i.e. Algorithm 1) is consistent; see Corollary B.16 in the appendix for a more refined statement. Roughly speaking, our proof of Theorem 2.6 involves a careful iterated use of typical sequences, and requires an analog of joint AEP for DILNs, as well as a notion of "transverse entropy", which has no corresponding analog in the usual notion of entropy for random variables (see Appendix B.1). Although our consistency result requires the limit N → ∞ (recall that N is the number of instances in D), a careful analysis of our proof reveals that for fixed N (with r → ∞, m → ∞, n → ∞, → ∞), we have an explicit upper bound on the estimation error of our estimator; see Corollary B.15. In the next section, we shall introduce a suitable candidate for g. Algorithm 1 An overview for the general framework for estimating Q D Require: integers r, m, n, ≥ 1. Require: threshold β > 0, g: a separable R d -valued random function on U[D] Require: Ω = (α (1) , . . . , α ( ) ) ⊆ [0, 1] k a valid α-sequence for D. 1: Initialize empty list L. 2: for ς = 1 . . . m do 3: Generate observed value D = D (ς) 0 . 4: for s = 1 . . . do 5: Generate n independent observed values D α (s) = D (ς) s,1 , D (ς) s,2 , . . . , D (ς) s,n . 6: Let Φ ς be a discriminator trained r-fold on (D (ς) 0 , g) with threshold β.

7:

Compute s := min{s : 1 ≤ s ≤ , there exists 1 ≤ t ≤ n such that D (ς) s,t ∈ Φ + ς }. 8: if s exists (i.e. s is well-defined) then 9: for t = 1 . . . n do 10: Generate observed values for random variables E i,j , E i,j (for all 1 ≤ i, j ≤ k) #[Note: E i,j , E i,j are defined in Q D , Q Dα s respectively.] 11: Solve system of linear equations if Unique solution to linear system exists then 13: Q D = Q Dα s (in the k 2 variables q i,j for 1 ≤ i, j ≤ k) #[Note: We substitute the generated observed values for E i,j , E i,j into Q D = Q Dα Q(ς) t ← [q i,j ] 1≤i,j≤k , where {q i,j = qi,j } i,j is the unique solution to the linear system. 14: Insert matrix Q(ς) t into list L 15: return mean of matrices in L (This is our estimate QD for Q D .)

3. REALIZATION OF FRAMEWORK USING LID-BASED DISCRIMINATORS

A key challenge for realizing our framework is the construction of good discriminators. This requires a suitable separable random function g. The underlying intuition for g we should have is that we want g to "separate" datasets with different noise levels. Appendix B.6 elaborates on this intuition. With this intuition in mind, we propose to use Local Intrinsic Dimensionality (LID) scores (Houle, 2013; 2017a; b) . The LID score is used in several applications (Amsaleg et al., 2017; Von Brünken et al., 2015; Schubert & Gertz, 2017) , and it plays a fundamental role in a 2018 breakthrough in noise detection: It is possible to determine whether a dataset is clean or has label noise, by considering LID sequences (Ma et al., 2018b) , which are sequences of LID scores; cf. Amsaleg et al. (2015) . In particular, it is possible for LID sequences to detect adversarial noise (Ma et al., 2018a) . LID scores are assigned to every training epoch. As observed in Ma et al. (2018b) , the LID score of a model has an initial phase: It would start "high" and then generally decrease to a "low" value. Subsequently, its behavior depends on the amount of label noise in the dataset. In the absence of label noise, the LID score would remain low. If instead there is "significant" label noise, then the LID score would rise (after its initial decrease). Thus, the presence of label noise in a dataset could in principle be detected by any sharp increase in the LID score during the training phase. In this paper, we use LID sequences as a proxy for measuring the entropy of the underlying dataset trained on. Consider a neural network N . Suppose D ∈ U[D], and let x 1 , . . . , x N be the enumeration of all instances of D . As we train our neural network on D , we shall keep track of how the feature vectors of randomly selected instances evolve over the training epochs. Given an input instance x, the feature vector of x shall mean the output vector of the last hidden layer of N , given the input x; we shall denote the feature vector of x in epoch j by ω j (x), and we shall define Ω j := {ω j (x i )} 1≤i≤N . Let s, s ≥ 1 be fixed integers. The LID score of a single instance x in epoch j is defined by LID j (x; D ) := -1 s 1≤i≤s (log r i (x) -log r s (x)) -1 , where r i (x) is the Euclidean distance between ω j (x) and its i-th nearest neighbor in An LID-based discriminator is a discriminator Φ trained on LID sequences as its training data. For every synthesized baseline DILN D of D, we shall invoke g LID (D ) a total of r times, which yields r LID sequences that shall be considered "positive". For each α ∈ [0, 1] k and each α-increment DILN D of D, we similarly invoke g LID (D ) a total of r times, which yields r LID sequences that shall be considered "unlabeled". Hence, we can generate training data for Φ consisting of "positive" samples and "unlabeled" samples. We could then use any positive-unlabeled learning algorithm to train Φ. 1 : Forward KL loss comparisons for symmetric noise matrix estimations with CIFAR-10 as the underlying clean dataset. "AP" means anchor point. We removed anchor-like data up to 70% in the same manner described in (Xia et al., 2019) . An average loss reduction is achieved from 0.27 to 0.17 for 40% anchor-like instances removal when comparing with baselines not using clean samples. We also improved MPEIA and GLC by using their estimates as priors. Smaller loss values are bold-faced. 

4. EXPERIMENTS

Framework implementation details. Let α (1) , . . . , α ( ) be a sequence of vectors in [0, 1] k . Let D 0 be a baseline DILN of D, and for each 1 ≤ s ≤ , let D s be an α (s) -increment DILN of D. If D 0 , D 1 , . . . , D are synthesized using a common random seed ς, then we say that {D 0 , D 1 , . . . , D } is a seed collection with seed ς and α-sequence (α (1) , . . . , α ( ) ). In our experiments, we used a fixed α-sequence Ω, where each α = (α 1 , . . . , α k ) in Ω satisfies α i ≤ 0.886 for all i for CIFAR-10 and α i ≤ 0.916 for all i for Clothing1M. We trained each discriminator Φ on the LID sequences obtained from three different seed collections (called a "triple"), where two of them are used for training and the third is used for validation. Our LID-based discriminator Φ is trained using positive-unlabeled bagging (Elkan & Noto, 2008; Mordelet & Vert, 2014) , with decision trees as our sub-routine. We used 1000 decision trees. For each derived DILN, we generated 50 LID sequences to be used as training data. Once trained, our discriminator Φ assigns a score to each input DILN D based on voting: Again, 50 LID sequences are generated for D . Each LID sequence is predicted either positive or negative by Φ, and the total number of positive votes, divided by 50, is the final score assigned to D . After training, if the validation recall is τ ≥ 0.9, then the discriminator Φ would be further fine-tuned. Details on fine-tuning can be found in Appendix C.2.2. Datasets. We did experiments on CIFAR-10 ( Krizhevsky et al., 2009) and Clothing1M (Xiao et al., 2015) . CIFAR-10 has 50, 000 training and 10, 000 test images over 10 classes. We manually added two types of instance-independent label noise: symmetric and pairwise, following the label flip settings in (Han et al., 2018) . For symmetric noise, we used noise levels 20%, 50% and 80%, while for pairwise asymmetric noise, we used noise levels 20%, 45% and 80%. Clothing1M (Xiao et al., 2015) has around 1 million clothing images of 14 classes. The paper (Xiao et al., 2015 ) also provides a noisy subset, whose corresponding Q D has been manually verified exactly. We estimate its Q D based on this subset for real-life label noise scenario and refer this subset as "Clothing1M subset". Methods. We compared our method with baselines: (i) S-model (Goldberger & Ben-Reuven, 2016) Experimental Results. We row-normalized all estimates from the baselines then evaluated them using (forward) Kullback-Leibler (KL) lossfoot_1 For CIFAR-10's both symmetric and pairwise cases, even for 70% anchor-point removal or imbalanced class ratios, our method has the lowest losses for averaged noise levels, compared to all the baselines, and made improvements when MPEIA and GLC are used as our priors. For Clothing1M, ours-1 has the lowest KL loss, 0.4903, slightly better than T-Revision with a KL loss of 0.5262. S-model ranked the last. Among all methods that used 0.5% clean samples, ours-3 achieved the lowest loss, 0.5311, while its prior GLC has a loss of 0.5957. 

5. CONCLUDING REMARKS

This paper focuses on datasets with instance-independent label noise (DILNs), and tackles the problem of estimating the noise transition matrix Q D of a DILN D. Our main algorithm is the firstever estimator for Q D that is proven to be consistent without needing to optimize the classification accuracy of a classifier trained on D. Notably, we do not require clean data or anchor-like instances, and we do not make any assumptions on the structure of Q D . Thus, a key "takeaway insight" is that Q D could be accurately estimated in a wide range of scenarios, including possibly for classification tasks that are "inherently still difficult" to get high classification accuracies even without label noise. Our consistent estimator is based on a new information-theoretic framework, in which we introduce the notion of entropy for DILNs. A key step in our approach is the training of discriminators to predict whether an input DILN has maximum entropy. Our proof of consistency relies crucially on the notion of "typicality" and the asymptotic equipartition property from information theory.

APPENDIX

This appendix is organized as follows: • Section A gives a detailed treatment of datasets with instance-independent label noise (DILNs). • Section B proves the consistency of our proposed estimator. • Section C provides all the implementation details of our experiments. • Section D describes how our work relates to the information bottleneck theory for deep learning.

A A RIGOROUS FORMALISM FOR DILNS

A dataset D is a set consisting of N ("instance", "given-label") pairs. If we enumerate these pairs by (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N ), then D is the set of pairs {(x i , y i )} 1≤i≤N . Across all disciplines that deal with datasets, there is an implicit assumption that the instances of a dataset are sampled from some "true" data distribution. Formally, each instance x is an observed value of some random variable X true . For classification tasks, given an instance X true = x, it is assumed that there is a uniquely determined correct label y x associated to this instance x. Hence, there is a function f true that assigns each instance x to its correct label y x . We assume that f true is completely deterministic and does not involve any randomness. If Z true is a random variable representing the correct label of a random instance X true , then Z true = f true (X true ). Because we are only given the datapoints of D, we typically consider the restriction of the domain of f true to the instances of D; we denote this restricted function by f D true . In the absence of label noise, the dataset becomes precisely D = {(x i , y xi )} 1≤i≤N . For such "noise-free" datasets D, the goal of learning a classifier from D is to obtain a good approximation f true for the function f D true .foot_2 When instance-independent label noise is added to the dataset, there is a random function f noise that is applied to the correct labels y x of all instances x. This random function f noise depends only on the input label z, and does not depend on which instance x this label is for. Similar to the case of f true , because we are only given the datapoints of D, we again typically consider the restriction of the domain of f noise to the instances of D; we denote this restricted function by f D noise . Assuming that the set of all possible labels is A = {1, . . . , k}, this implies that f D noise can be decomposed into k random variables Y 1 , . . . , Y k , where each Y i is a discrete random variable taking on values in A. The set of probabilities {p Yi (1), . . . , p Yi (k)} would completely determine the distribution of Y i . Thus, f D noise is a function on A given by the map i → Y i . This is a random function that is completely determined by the set of all k 2 probabilities {p Yi (j))} 1≤i,j≤k . The k-by-k matrix Q = [q i,j ] 1≤i,j≤k whose (i, j)-th entry equals p Yi (j) is precisely the noise transition matrix that we consider in our paper, which we have denoted by Q D . Consequently, to "learn" a model of this instance-independent label noise is to obtain a good approximation fnoise for f D noise , which is exactly the same as finding a good approximation QD to the noise transition matrix Q D . Our goal for this paper is to estimate the noise transition matrix Q D from a given dataset D. Thus, one of our main assertions for this paper, that "high classification accuracy is not required for estimating Q D well", can be interpreted as the assertion that we can find a good approximation fnoise for f D noise , even if we are unable to find a good approximation for f D true , or a good approximation for f D noise • f D true . Notice that for existing methods that "learn in the presence of label noise", their underlying goal is to find a good approximation for either f D true or the composite map f D noise • f D true . In contrast, our goal is to find a good approximation for f D noise . Given D a dataset with instance-independent label noise (DILN), let X be a random variable representing an instance of D selected uniformly at random (notice that X = X true ), let Z := f D true (X), and let Y := f D noise (Z). By definition, f D noise is completely determined by the conditional distribution of Y given Z, and the set of all possible labels A; in particular, f D noise does not depend on X or f D true . This explains why the noise model associated to D is denoted by (Y |Z; A). In our paper, we have defined a DILN to be a set. It is the set D = {(x i , y i )} 1≤i≤N , which also has an associated noise model (Y |Z; A). Strictly speaking, a DILN should be defined as a triple (D, Y |Z, A), since we also include the information about the noise model (Y |Z; A) as part of the definition of the DILN. However, for the purpose of this paper, we abuse notation (slightly) and assume that a DILN D includes its associated noise model. To avoid ambiguity, the set consisting of all ("instance", "given-label") pairs (x i , y i ) shall be called the underlying dataset of D, and each such (x i , y i ) pair shall be called a datapoint of D. Of course, we are implicitly assuming that the instances x 1 , . . . , x N are sampled from some "true" distribution (i.e. the distribution of X true ), but we do not need any information involving X true beyond this implicit assumption.

B PROOF OF CONSISTENCY OF PROPOSED ESTIMATOR FOR Q D

The goal for this section is to prove Theorem 2.6, i.e. that our proposed estimator for Q D is consistent; see Theorem B.14 for a precise (equivalent) formulation of Theorem 2.6. Our consistency proof is essentially an iterated application of suitable analogs of the asymptotic equipartition property (AEP) theorem from information theory. In particular, we will prove a joint AEP theorem for DILNs; see Theorem B.3. The proof of Theorem B.14 requires some preparation, so we shall first prove several related results in Sections B.1-B.3, before we present Theorem B.14 in Section B.4. Throughout, let D be a DILN with noise model (Y |Z; A). Without loss of generality, assume that A = {1, . . . , k} satisfies k ≥ 2, assume that D has N instances, and write Q D = [q i,j ] 1≤i,j≤k . If {X n } ∞ n=1 is any sequence of random variables, then we write "X n p -→ µ" to mean that "X n converges in probability to µ". Here, µ could be a scalar or a random variable.

B.1 JOINT ASYMPTOTIC EQUIPARTITION PROPERTY FOR DILNS

Recall that in Definition 2.1, we introduced the notion of "n-fold ε-typical sets" for random derived DILNs. We will also need to define jointly typical sets for random derived DILNs. This involves what we shall call "transverse entropy", which has no corresponding analog in the usual notion of entropy for random variables.  H(D ∧ D ) := - i∈A p Z (i) j∈A |q D i,j -q D i,j | log |q D i,j -q D i,j |. (Recall: We use the convention that 0 log 0 = 0.) Definition B.2. Let V and W be random derived DILNs of D. Let V 1 , V 2 , . . . (resp. W 1 , W 2 , . . . ) be an infinite sequence of i.i.d. random derived DILNs of D with the same distribution as V (resp. W ). For any > 0 and integer n ≥ 1, the n-fold jointly ε-typical set of V and W is defined to be the set Λ (n) ε (V, W ) consisting of all sequences (D 1 , . . . , D n ), (D 1 , . . . , D n ) ∈ U[D] n × U[D] n of observed values of V 1 , . . . , V n , W 1 , . . . , W n , with the following properties: E[H(V )] -ε ≤ 1 n n t=1 H(D t ) ≤ E[H(V )] + ε; (3) E[H(W )] -ε ≤ 1 n n t=1 H(D t ) ≤ E[H(W )] + ε; (4) E[H(V ∧ W )] -ε ≤ 1 n n t=1 H(D t ∧ D t ) ≤ E[H(V ∧ W )] + ε. In particular, notice that the restriction of  Λ (n) ε (V, W ) ⊆ U[D] n × U[D] n to the first (resp. second) U[D] n component is a subset of the n-fold ε-typical set of V (resp. W ). . , D n ), (D 1 , . . . , D n ) ∈ Λ (n) ε (V, W ) = 1. Proof. Consider an arbitrary (D 1 , . . . , D n ), (D 1 , . . . , D n ) ∈ Λ (n) ε (V, W ). By the weak law of large numbers, 1 n n t=1 H(D t ) converges in probability to E[H(V )]. Hence, given any ε > 0, there exists an integer n 1 ≥ 1 such that for all integers n > n 1 , Pr 1 n n t=1 H(D t ) -E[H(V )] ≥ ε ≤ ε 3 . ( ) By a similar argument, we also infer that given any ε > 0, there exists an integer n 2 ≥ 1 such that for all integers n > n 2 , Pr 1 n n t=1 H(D t ) -E[H(W )] ≥ ε ≤ ε 3 , and there exists an integer n 3 ≥ 1 such that for all integers n > n 3 , Pr 1 n n t=1 H(D t ∧ D t ) -E[H(V ∧ W )] ≥ ε ≤ ε 3 . Therefore, for all integers n ≥ max{n 1 , n 2 , n 3 }, the probability of the union of the events in ( 6), ( 7) and ( 8) must be at most ε, which proves our assertion. For the rest of this subsection, let D 0 ∈ U[D], and let g be a D 0 -separable R d -valued random function on U [D] . Recall that for β > 0, a discriminator Φ for D is said to be trained r-fold on (D 0 , g) with threshold β, if the set of positive predictions for Φ is Φ + = {D ∈ U[D] : Λ (r) β (g(D 0 )) ∩ Λ (r) β (g(D )) = ∅}. Definition B.4. Let δ > 0, β > 0, let r ≥ 1 be an integer, and suppose Φ is a discriminator for D that is trained r-fold on (D 0 , g) with threshold β. We say that Φ is δ -sufficient if every D ∈ U[D] satisfying H(D 0 ∧ D ) < δ is predicted positive, i.e. D ∈ Φ + . Lemma B.5. For every β, δ > 0, there is a sufficiently large integer r β,δ ≥ 1 such that for all integers n ≥ r β,δ , if Φ is a discriminator for D that is trained n-fold on (D 0 , g) with threshold β, then the following implication holds: D ∈ Φ + =⇒ |H(D 0 ) -H(D )| ≤ δ. Proof. Consider any D 1 ∈ U[D] that satisfies |H(D 0 ) -H(D 1 )| > δ. Since g is D 0 -separable, it follows from Definition 2.3 that there is an integer r D1 β,δ such that Λ (n) β (g(D 0 )) ∩ Λ (n) β (g(D 1 )) = ∅ for all integers n ≥ r D1 β,δ . For each β and δ, define r β,δ := max{r D1 β,δ ∈ Z : D 1 ∈ U[D], |H(D 0 ) -H(D 1 )| > δ}. In particular, r β,δ is well-defined, since U[D] is finite (of size k N ). Now, for any n ≥ r β,δ , suppose that Φ is trained n-fold on (D 0 , g) with threshold β. By Definition 2.4, this means that Φ + = {D ∈ U[D] : Λ (n) β (g(D 0 )) ∩ Λ (n) β (g(D )) = ∅}. By the definition of r β,δ , we thus have Λ (n) β (g(D 0 )) ∩ Λ (n) β (g(D )) = ∅ for all D satisfying |H(D 0 ) -H(D )| > δ , which then proves the assertion. Lemma B.6. Let 0 < ε ≤ 1, and let h : [0, k -1] → R be a function given by h(x) = k-1 k (1 -x k-1 ) log (1 -x k-1 ) + 1 k (1 + x) log(1 + x). Then h(x) is strictly increasing, and h(ε ) > 0. Moreover, if ζ := min{p Z (i) : 1 ≤ i ≤ k} > 0, and if H(D) ≥ log k -ζh(ε), then |q D i,j -1 k | ≤ ε for all 1 ≤ i, j ≤ k. Proof. First of all, we check that the derivative of h(x) is h (x) = - 1 k log 1 - x k -1 -log(1 + x) = - 1 k log 1 -x k-1 1 + x , which satisfies h (x) > 0 for all 0 < x < k -1. (Recall that k ≥ 2 by assumption.) Thus, h(x) is strictly increasing on the closed interval [0, k -1]. In particular, ε > 0 implies that h(ε) > h(0) = 0. Henceforth, assume ζ > 0, and suppose on the contrary that there exists some 1 ≤ i 0 , j 0 ≤ k such that q D i0,j0 -1 k > ε. Let q D i0,j = 1 k (1 + ε j ) for all 1 ≤ j ≤ k, and assume without loss of generality that q D i0,j0 = 1 k (1 + ε j0 ) for some 1 k ε j0 > ε. This implies that - 1≤j≤k j =j0 q D i0,j log q D i0,j = - 1≤j≤k j =j0 1 k (1 + ε j ) log 1 k (1 + ε j ) = 1≤j≤k j =j0 1 k (1 + ε j ) log k - 1≤j≤k j =j0 1 k (1 + ε j ) log (1 + ε j ). Since k j=1 ε j = 0, and since the map x → -x log x is concave, it follows from Jensen's inequality (see Cover & Thomas (2012, Thm. 2.6.2)) that - 1≤j≤k j =j0 q D i0,j log q D i0,j ≤ 1≤j≤k j =j0 1 k (1 + ε j ) log k -k-1 k (1 - εj 0 k-1 ) log (1 - εj 0 k-1 ). Note also that -q D i0,j0 log q D i0,j0 = -1 k (1 + ε j0 ) log 1 k (1 + ε j0 ) = 1 k (1 + ε j0 ) log k -1 k (1 + ε j0 ) log(1 + ε j0 ). Summing ( 10) and (11) gives us - 1≤j≤k q D i0,j log q D i0,j ≤ 1 k 1≤j≤k (1 + ε j ) log k -k-1 k (1 - εj 0 k-1 ) log (1 - εj 0 k-1 ) -1 k (1 + ε j0 ) log(1 + ε j0 ) = log k -k-1 k (1 - εj 0 k-1 ) log (1 - εj 0 k-1 ) -1 k (1 + ε j0 ) log(1 + ε j0 ) = log k -h(ε j0 ). ( ) Note that q D i0,j0 = 1 k + 1 k ε j0 ≤ 1 implies ε j0 ≤ k -1, and recall that 1 k ε j0 > ε by assumption, hence h(ε j0 ) > h(ε) > 0. It then follows from (12) that - 1≤j≤k q D i0,j log q D i0,j ≤ log k -h(ε j0 ) < log k -h(ε) < log k Note also that for all 1 ≤ i ≤ k satisfying i = i 0 , Jensen's inequality yields - 1≤j≤k q D i,j log q D i,j ≤ log k. Thus, H(D) = - k i=1 p Z (i) k j=1 q D i,j log q D i,j < log k -ζh(ε) Since ( 13) contradicts the condition that H(D) ≥ log k -ζh(ε), we conclude that no such i 0 , j 0 exist, therefore H(D) ≥ log k -ζh(ε) implies |q D i,j -1 k | ≤ ε for all 1 ≤ i, j ≤ k.

B.2 JOINT ASYMPTOTIC EQUIPARTITION PROPERTY FOR DILN MATRICES

Let Mat k×k ([0, 1]) be the set of all row-stochastic k-by-k matrices. For convenience, a random matrix shall henceforth mean a Mat k×k ([0, 1])-valued random variable, i.e. we omit the qualifier "rowstochastic" from "random row-stochastic matrix". Let 1 k k×k denote the matrix in Mat k×k ([0, 1]) whose k 2 entries are all equal to 1 k . We shall also use • to denote a norm on Mat k×k ([0, 1]). All subsequent results still hold for any norm on Mat k×k ([0, 1]); only certain constants in continuity arguments would change with a different norm. For concreteness, we shall work with the matrix 1-norm, i.e. [c i,j ] 1≤i,j≤k := max 1≤j≤k k i=1 |c i,j |. Proposition B.7. Let 0 < ε ≤ 1, and suppose that D (1) , . . . , D (m) are derived DILNs of D. If ζ := min{p Z (i) : 1 ≤ i ≤ k} > 0, and if 1 m m t=1 H(D (t) ) ≥ log k -ζh( ε k ) (where h(x) is defined as in (9)), then 1 m m t=1 Q D (t) -1 k k×k ≤ ε. ( ) Proof. Define the function σ Z : Mat k×k ([0, 1]) → R by [c i,j ] 1≤i,j≤k → - k i=1 p Z (i) k j=1 c i,j log c i,j . Note that H(D ) = σ Z (Q D ) for any derived DILN D of D. Note also that σ Z is a concave function, so by generalized Jensen's inequality (see Perlman ( 1974)), log k -ζh( ε k ) ≤ 1 m m t=1 H(D (t) ) = 1 m m t=1 σ Z (Q D (t) ) ≤ σ Z 1 m m t=1 Q D (t) . Consequently, writing Q D (t) = q (t) i,j 1≤i,j≤k for each 1 ≤ t ≤ m, it follows from Lemma B.6 that 1 m m t=1 q (t) i,j - 1 k ≤ ε k for all 1 ≤ i, j ≤ k, therefore ( 14) follows from the definition of the matrix 1-norm. Next, we shall define several Mat k×k ([0, 1])-valued functions. For every α = (α 1 , . . . , α k ) ∈ [0, 1] k , define the function f α increment : Mat k×k ([0, 1]) × Mat k×k ([0, 1]) → Mat k×k ([0, 1]) by q i,j 1≤i,j≤k , ε i,j 1≤i,j≤k → q i,j (1 -α j ) + 1≤t≤k t =j q i,t α t 1 k-1 (1 + ε i,j ) 1≤i,j≤k Define the random matrix E α := [E i,j ] 1≤i,j≤k , where E i,j is the random variable as defined in (1). Notice that by definition, f α increment (Q D , E α ) = Q Dα . Next, let f α solve : Mat k×k ([0, 1])×Mat k×k ([0, 1]) → Mat k×k ([0, 1] ) be the function that is uniquely determined by the map f α increment (Q, E), E → Q. Note that to compute this map f α solve , we would need to solve a system of linear equations: Specifically, if Q = [q i,j ] 1≤i,j≤k and E = [ε i,j ] 1≤i,j≤k , and if f α increment (Q, E) is the given matrix [c i,j ] 1≤i,j≤k , then Q can be computed by solving the system of k 2 linear equations in the k 2 variables {q i,j } 1≤i,j≤k , given as follows: c i,j = q i,j (1 -α j ) + 1≤t≤k t =j q i,t α t 1 k-1 (1 + ε i,j ) (for 1 ≤ i, j ≤ k). In general, for any given matrix [c i,j ] 1≤i,j≤k , if we sample E from the distribution of E α , then this system of linear equations has a unique solution almost surely. Consequently, f α solve ([c i,j ] 1≤i,j≤k , E α ) is well-defined almost surely. Lemma B.8. Let ε > 0, let α ∈ [0, 1] k , and let n ≥ 1 be an integer. • Let E 1 , E 2 , . . . be an infinite sequence of i.i.d. random matrices with the same distribution as E α , and suppose that E 1 = E 1 , E 2 = E 2 , . . . is a corresponding sequence of observed values. • Let D (1) α , D (2) α , . . . be an infinite sequence of i.i.d. random derived DILNs of D with the same distribution as D α , and suppose that D (1) α = D 1 , D (2) α = D 2 , . . . is a corresponding sequence of observed values. For every integer i ≥ 1, define Q Ei := f α increment (Q D , E i ). Then, lim n→∞ Pr 1 n n i=1 Q Di -Q Ei < ε = 1. Proof. By the weak law of large numbers, and using the definitions of Q Di and Q Ei , we have 1 n n i=1 Q Di p -→ E[Q Dα ], and 1 n n i=1 Q Ei p -→ E[Q Dα ], hence the assertion follows. Theorem B.9. Let n, m ≥ 1 be integers, and let (α (1) , . . . , α (m) ) be a sequence of m vectors in [0, 1] k . Assume that ζ := min 1≤i≤k p Z (i) > 0. Let ε > 0, and define δ := 1 2 ζh( ε 2k ) > 0 (where h(x) is defined in (9)). • Let D (1) 0 , . . . , D (m) 0 ∈ U[D] such that log k - 1 m m j=1 H(D (j) 0 ) ≤ δ. ( ) • For every 1 ≤ j ≤ m, let E (j) 1 , E (j) 2 , . . . be an infinite sequence of i.i.d. random matrices with the same distribution as E α (j) , and suppose that E (j) 1 = E (j) 1 , E (j) 2 = E (j) 2 , . . . is a corresponding sequence of observed values. • Let (D (1) 1 , . . . , D (1) n ) ∈ Λ (n) ε (D α (1) ), . . . , (D (m) 1 , . . . , D (m) n ) ∈ Λ (n) ε (D α (m) ) be m sequences such that for all 1 ≤ i ≤ n, 1 m m j=1 H(D (j) 0 ) -H(D (j) i ) ≤ δ. ( ) For every 1 ≤ i ≤ n, 1 ≤ j ≤ m, define QD (j) i := f α (j) solve (Q D (j) 0 , E (j) i ). Then, lim n→∞ Pr 1 m m j=1 1 n n i=1 QD (j) i -Q D < ε = 1. ( ) Proof. First of all, note that ( 16) yields 16) and ( 17), we infer that log k - log k - 1 m m j=1 H(D (j) 0 ) ≤ δ ≤ 2δ = ζh( ε 2k ), thus it follows from Proposition B.7 that 1 m m j=1 Q D (j) 0 -1 k k×k ≤ ε 2 . By ( 1 m m j=1 H(D (j) i ) ≤ 2δ for all 1 ≤ i ≤ n. So by similarly applying Proposition B.7, we get 1 m m j=1 Q D (j) i -1 k k×k ≤ ε 2 for all 1 ≤ i ≤ n. Thus, by triangle inequality, 1 m m j=1 Q D (j) 0 -Q D (j) i ≤ ε for all 1 ≤ i ≤ n, which implies 1 m m j=1 1 n n i=1 Q D (j) 0 -Q D (j) i ≤ ε. ( ) Also, by Lemma B.8, we infer that lim n→∞ Pr 1 m m j=1 1 n n i=1 Q D (j) i -Q E (j) i < ε = 1, ( ) where Q E (j) i := f α (j) increment (Q D , E i ). Consequently, it follows from ( 19) and ( 20) that lim n→∞ Pr 1 m m j=1 1 n n i=1 Q D (j) 0 -Q E (j) i < 2ε = 1. (21) Note that by definition, f α (j) solve (Q E (j) i , E (j) i ) = f α (j) solve (f α (j) increment (Q D , E (j) i ), E (j) i ) = Q D . Therefore, by applying the multilinear function f α (j) solve (-, E i ) to each term in ( 21), we get (18) as desired. B.3 ESTIMATION OF Q D VIA TYPICAL SETS Define the random matrix E := [E i,j ] 1≤i,j≤k , where E i,j is the random variable as defined in Section 2.2. Given any observed value E = [ε i,j ] 1≤i,j≤k for E , we shall write E + 1 k to denote the matrix [ε i,j + 1 k ] 1≤i,j≤k . Lemma B.10. Let ε > 0, and let n ≥ 1 be an integer. • Let E 1 , E 2 , . . . be an infinite sequence of i.i.d. random matrices with the same distribution as E , and suppose that Then, E 1 = E 1 , E 2 = E 2 , . . . lim n→∞ Pr 1 n n i=1 Q Di -(E i + 1 k ) < ε = 1. Proof. By the weak law of large numbers, and using the definitions of Q Di and E i + 1 k , we have The following theorem is an extension of Theorem B.9 that takes into account predictions from a discriminator for D. This extension involves the gap of D. 1 n n i=1 Q Di p -→ E[Q D ], 1 n n i=1 (E i + 1 k ) p -→ E[Q D ], Theorem B.12. Let n, m, r ≥ 1 be integers, let (α (1) , . . . , α (m) ) be a sequence of m vectors in [0, 1] k , and assume that ζ := min{p Z (i) : 1 ≤ i ≤ k} > 0. Let β, δ, ε, ε > 0 be scalars satisfying gap(D) < δ ≤ 1 2 ζ log k, 0 < ε ≤ δ -gap(D), and ε := 2k • h -1 ( 2δ ζ ) > 0, where h(x) is defined in (9). Also, let g be a separable R d -valued random function on U[D]. • Let E 1 , . . . , E n be a sequence of i.i.d. random matrices with the same distribution as E , and suppose that E 1 = E 1 , . . . , E n = E n is a corresponding sequence of observed values. • For every 1 ≤ j ≤ m, let E (j) 1 , E (j) 2 , . . . , E (j) n be a sequence of i.i.d. random matrices with the same distribution as E α (j) , and suppose E (j) 1 = E (j) 1 , E (j) 2 = E (j) 2 , . . . , E (j) n = E (j) n is a corresponding sequence of observed values. • Let (D (1) 0 , . . . , D (m) 0 ) ∈ Λ (m) ε (D), and for every 1 ≤ j ≤ m, suppose that Φ j is a discriminator for D that is trained r-fold on (D (j) 0 , g) with threshold β. • For every 1 ≤ j ≤ m, suppose that (D (j) 1 , . . . , D (j) n ) ∈ Λ (n) ε (D α (j) ) ∩ (Φ + j ) n . For every integer 1 ≤ i ≤ n, define Q(j) i := f α (j) solve (E i + 1 k , E (j) i ). Then there exists a sufficiently large integer r β,δ such that for all r ≥ r β,δ , lim n→∞ Pr 1 m m j=1 1 n n i=1 Q(j) i -Q D < 2ε = 1. (22) Proof. The proof, although seemingly complicated, actually follows essentially from unraveling the relevant definitions. First of all, notice that E[H(D)] -ε ≥ E[H(D)] -δ + gap(D) = log k -δ, which implies that for any (D (1) 0 , . . . , D (m) 0 ) ∈ Λ (m) ε (D), we have (by the definition of Λ (m) ε (D)) that log k - 1 m m j=1 H(D (j) 0 ) ≤ δ. ( ) For each 1 ≤ j ≤ m, it follows from Lemma B.5 that there exists some sufficiently large integer r (j) β,δ ≥ 1 such that for all r ≥ r (j) β,δ , D ∈ Φ + j =⇒ |H(D (j) 0 ) -H(D )| ≤ δ. ( ) Let r β,δ := max{r (j) β,δ : 1 ≤ j ≤ m}, and henceforth assume that every discriminator Φ j is trained r-fold on (D (j) 0 , g) with threshold β, for some r ≥ r β,δ . For all 1 ≤ j ≤ m and 1 ≤ i ≤ n, note that D (j) i ∈ Φ + j by definition, hence (24) implies that |H(D (j) 0 ) -H(D (j) i )| ≤ δ, so in particular, 1 m m j=1 H(D (j) 0 ) -H(D (j) i ) ≤ δ. ( ) By definition, ε = 2k • h -1 2δ ζ . This means that δ = 1 2 ζh( ε 2k ). In particular, recall from Lemma B.6 that h(x) is a strictly increasing (and hence bijective) function with domain [0, k -1], note that h(k -1) = log k, and note that 2δ ζ ≤ log k by assumption, so the inverse h -1 2δ ζ is well-defined. Then for every 1 ≤ j ≤ k, it follows from ( 23), (25), and Theorem B.9 that lim n→∞ Pr 1 m m j=1 1 n n i=1 f α (j) solve Q D (j) 0 , E (j) i -Q D < ε = 1. Note that by Lemma B.10, we have lim n→∞ Pr 1 m m j=1 1 n n i=1 Q D (j) 0 -(E i + 1 k ) < ε = 1. Next, apply the multilinear function f α (j) solve (-, E ) to each term in (27); this yields lim n→∞ Pr 1 m m j=1 1 n n i=1 f α (j) solve (Q D (j) 0 , E (j) i ) -f α (j) solve (E i + 1 k , E (j) i ) < ε = 1. (28) Note that Q(j) i := f α (j) solve (E i + 1 k , E ) by definition, thus it follows from ( 26) and ( 28) that lim n→∞ Pr 1 m m j=1 1 n n i=1 Q(j) i -Q D < 2ε = 1. B.4 PROOF OF THEOREM 2.6 Consider a valid α-sequence Ω = (α (1) , . . . , α ( ) ) for D. By definition, this means Ω is a subsequence of an infinite sequence (α (1) , α(1) , . . . ) of distinct vectors in [0, k-1 k ) k (recall that k is the number of label classes), such that {α (i) } ∞ i=1 is a dense subset of [0, k-1 k ] k . Let ε > 0. For any fixed integer n ≥ 1, observe that lim →∞ Λ (n) ε (D α ( ) ) ∩ Λ (n) ε (D) = Λ (n) ε (D). This implies that lim →∞ Λ (n) ε (D α ( ) ) ∩ Λ (n) ε (D) Λ (n) ε (D) = 1. In contrast, for any fixed integer ≥ 1, if E[H(D α ( ) )] = E[H(D)], then lim ε→0 lim n→∞ Λ (n) ε (D α ( ) ) ∩ Λ (n) ε (D) = ∅. This implies that lim ε→0 lim n→∞ Λ (n) ε (D α ( ) ) ∩ Λ (n) ε (D) Λ (n) ε (D) = 0. ( ) Lemma B.13. Let β, ε > 0, let D 0 ∈ Λ ε (D), i.e. D 0 is an ε-typical baseline DILN of D, and let g be a D 0 -separable R d -valued random function on U[D]. Let r ≥ 1 be an integer, let Φ be a discriminator for D that is trained r-fold on (D 0 , g) with threshold β, and suppose that Φ is δ -sufficient for some δ > 0. Let Ω = (α (1) , α (2) , . . . ) be a valid α-sequence for D. Then there exists some sufficiently large integer Φ ≥ 1 such that for all integers ≥ Φ , a randomly generated ε-typical α ( ) -increment DILN of D has non-zero probability of being predicted positive by Φ, i.e. Pr Λ (r) ε (g(D 0 )) ∩ Λ (r) ε (g(D )) = ∅ D ∈ Λ (1) ε (D α ( ) ) > 0. Proof. Since Φ is δ -sufficient, we infer that (32) is true if there exists some D ∈ Λ (1) ε (D α ( ) ) such that H(D 0 ∧ D ) < δ . Clearly, H(D 0 ∧ D 0 ) = 0 < δ , so it suffices to show that D 0 ∈ Λ (1) ε (D α ( ) ). From (30), we conclude that D 0 ∈ Λ (1) ε (D α ( ) ) is true if is sufficiently large. Finally, we prove an equivalent (and rather long) reformulation of Theorem 2.6 from the main paper. Theorem B.14. Let n, m, r, ≥ 1 be integers, let Ω = (α (1) , α (2) , . . . ) be a valid α-sequence for D, and assume that ζ := min{p Z (i) : m) be a sequence of i.i.d. random derived DILNs of D with the same distribution as D, and suppose we have observed values 1 ≤ i ≤ k} > 0. Let β, δ, ε, ε > 0 be real scalars satisfying gap(D) < δ ≤ 1 2 ζ log k, 0 < ε ≤ δ -gap(D), and ε := 2k • h -1 ( 2δ ζ ) > 0, where h(x) is defined in (9). Also, let g be a separable R d -valued random function on U[D]. • Let D (1) , . . . , D D (1) = D (1) 0 , . . . , D (m) = D (m) 0 . • For every 1 ≤ s ≤ and 1 ≤ j ≤ m, let D (j) α (s) ,1 , D (j) α (s) ,2 , . . . , D (j) α (s) ,n be a sequence of i.i.d. random derived DILNs of D with the same distribution as D α (s) , and suppose we have the observed values D (j) α (s) ,1 = D (j) s,1 , D (j) α (s) ,2 = D (j) s,2 , . . . , D (j) α (s) ,n = D (j) s,n . For every 1 ≤ j ≤ m, suppose that Φ j is a discriminator for D that is trained r-fold on (D (j) 0 , g) with threshold β, and suppose that every discriminator Φ j is δ -sufficient for some δ > 0. Then there exists some sufficiently large integer r,δ ≥ 1 (which depends on r and δ ), such that for all integers ≥ r,δ , a randomly generated ε-typical α ( ) -increment DILN of D has non-zero probability of being predicted positive by Φ j for all 1 ≤ j ≤ m. Assume that the given integer is sufficiently large, i.e. ≥ r,δ . For every 1 ≤ j ≤ m, define s (j) best := min{s : 1 ≤ s ≤ , there exists some 1 ≤ i ≤ n such that D (j) s,i ∈ Φ + j }. ( ) Let 1 ≤ j 1 < j 2 < • • • < j m ≤ m be all indices j t such that s (jt) best = -∞. (By default, we define min ∅ = -∞. Note that m ≤ m.) • For every 1 ≤ j ≤ m, let E (j) 1 , . . . , E (j) n be a sequence of i.i.d. random matrices with the same distribution as E , and suppose that we have the sequence of observed values E (j) 1 = E (j) 1 , . . . , E (j) n = E (j) n . • For every 1 ≤ j ≤ m, let E (j) 1 , E (j) 2 , . . . , E (j) n be a sequence of i.i.d. random matrices with the same distribution as E α (s ) , where s := s (j) best , and suppose we have the sequence of observed values E (j) 1 = E (j) 1 , E (j) 2 = E (j) 2 , . . . , E (j) n = E (j) n . For every 1 ≤ i ≤ n and 1 ≤ j ≤ m, define Q(j) i := f α (j) solve (E i + 1 k , E ). Then there exist a sufficiently large integer r β,δ ≥ 1 (depending on β and δ), a corresponding sufficiently large integer r β,δ ,δ ≥ 1 (depending on r β,δ and δ ), such that for a fixed r = r β,δ , and for all ≥ r β,δ ,δ , lim m→∞ lim n→∞ Pr 1 m m t=1 1 n n i=1 Q(jt) i -Q D < 2ε = 1. Proof. First of all, for each 1 ≤ j ≤ m, Lemma B.13 says that there is some sufficiently large integer Φj such that for all ≥ Φj , a randomly generated ε-typical α ( ) -increment DILN of D has non-zero probability to be in Φ + j . Thus, we could set max{ Φj : 1 ≤ j ≤ m}. For each ) , where s = s 1 ≤ t ≤ m , define α (jt) best := α (s (jt) best . Observe that if D (jt) s (j t ) best ,1 , . . . , D (jt) s (j t ) best ,n ∈ Λ (n) ε D α (j t ) best ∩ (Φ + jt ) n , and if (D (j1) 0 , . . . , D (j m ) 0 ) ∈ Λ (m ) ε (D), then Theorem B.12 yields lim n→∞ Pr 1 m m t=1 1 n n i=1 Q(jt) i -Q D < 2ε = 1. By definition, each D (jt) s (j t ) best ,i (for 1 ≤ i ≤ n) is already contained in Φ + jt , hence for (35) to be true, we only need to check that D (jt) s (j t ) best ,1 , . . . , D (jt) s (j t ) best ,n ∈ Λ (n) ε D α (j t ) best . Now, for any 1 ≤ t ≤ m , it follows from Theorem B.3 that there exists some sufficiently large integer n 0 ≥ 1 such that for all integers n ≥ n 0 , we have Pr D (1) 0 , . . . , D (n) 0 , D (jt) s (j t ) best ,1 , . . . , D (jt) s (j t ) best ,n ∈ Λ (n) ε D, D α (j t ) best > 1 -ε . Consequently, if m ≥ n 0 , then lim n→∞ Pr 1 m m t=1 1 n n i=1 Q(jt) i -Q D < 2ε > 1 -ε Finally, the choice of r implies that Pr(s (j) best = -∞) > 0, hence m → ∞ as m → ∞, therefore (39) is true. Note that the matrix QD ; = 1 m m t=1 1 n n i=1 Q(jt) i , which appears in (39) is precisely the output matrix of Algorithm 1. This gives the following two corollaries. Corollary B.15. Let δ > 0. Assume that the initial conditions (on r, m, n, , Ω, β, g) given in Algorithm 1 are satisfied. Assume k ≥ 2, assume that ζ := min{p Z (i) : 1 ≤ i ≤ k} > 0, and assume that the discriminator Φ j is δ -sufficient for all 1 ≤ j ≤ m. If r (depending on β) is sufficiently large, and if (depending on r and δ ) is sufficiently large, then the final output matrix QD from Algorithm 1 satisfies lim m→∞ lim n→∞ Pr QD -Q D < 4k • h -1 2 • gap(D) ζ = 1. ( ) Proof. Following the notation of Theorem B.14, note that 2ε = 4k • h -1 ( 2δ ζ ) > 0 (where h(x) is a strictly increasing function defined in ( 9)), and note also that δ > gap(D), hence the assertion follows from Theorem B.14 by choosing δ arbitrarily close to gap(D). Corollary B.16. Let δ > 0. Assume that the initial conditions (on r, m, n, , Ω, β, g) given in Algorithm 1 are satisfied. Assume k ≥ 2, and assume that the discriminator Φ j is δ -sufficient for all 1 ≤ j ≤ m. Also, assume that for every possible label i ∈ A = {1, . . . , k}, there is at least one instance of D with i as its correct label. If r (depending on β) is sufficiently large, and if (depending on r and δ ) is sufficiently large, then the final output matrix QD from Algorithm 1 converges in probability to Q D as m → ∞, n → ∞, and N → ∞. (Here, N is the number of instances in D.) Proof. This is immediate from Corollary B.15 and Lemma B.11.

B.5 PRECISE REFORMULATION OF PROPOSED ALGORITHM

In the previous few subsections, we have introduced new notation and terminology, so that we are able to give a precise statement of our consistency result (Theorem B.14) . Correspondingly, using our new notation and terminology, we shall also give, in this subsection, an equivalent reformulation of Algorithm 1 from the main paper. Generate observed values for random matrices Roughly speaking, Fig. 3 captures the intuition on how to check experimentally whether a candidate R d -valued random function g on U[D] is approximately separable. Suppose D α1 , . . . , D α is a sequence of α-increment datasets for distinct vectors α = α 1 , . . . , α , i.e. corresponding to different noise levels. Since g is a random function, we can repeatedly call g(D αi ) to get a sequence of vectors in R d , say of length r (i.e. each of the r entries is a vector). As elaborated in Section 2.3, these sequences of length r, which correspond to different values α i , are the datapoints for training discriminators; each sequence is a single datapoint for training a discriminator Φ. For our random function g to be separable, the datapoints (sequences) generated using a particular value for α should, with high probability, be distinguishable from datapoints (sequences) generated using a different value for α. Using Fig. 3 as an example, notice that for each considered noise rate a (corresponding to α = (a, . . . , a)), we have plotted a coordinate-wise minimum-to-maximum range of the entries for sequences generated using α = (a, . . . , a), where these sequences associated to D α are repeatedly generated using multiple random seeds. We call this coordinate-wise minimum-to-maximum range a "band". Now, notice that different bands, corresponding to different values for α, are already "visually separable" in our given plot. When we train a discriminator Φ on such sequences (datapoints), those sequences in the red band are treated as "positive" datapoints, while those sequences in the blue bands (of various blue hues) are treated as "unlabeled" datapoints. The goal for Φ is to identify which blue band is "most indistinguishable" from the red band, and then use the α value corresponding to the identified blue band to infer a single intermediate estimate of the noise transition matrix. For this idea to work, the blue bands (for different values of α) should themselves be distinguishable (i.e. "separable") from each other. By "distinguishable", we mean that a machine learning model (we used decision trees in our experiments) is able to distinguish (i.e. "separate") different blue bands. So informally, a candidate random function g is "separable" if sequences of g(D α ) are (with high probability) distinguishable for different values of α. Therefore, we have experimentally verified (see Fig. 3 ) that g = g LID is a suitable function that is empirically (approximately) separable. E = E (j) i and E α (s ) = E (j) i . 11: Compute Q(j) i := f α (j) solve (E (j) i + 1 k , E (j) i ) #[Note: Computation of Q(j) i involves

B.7 STRIKING CONNECTIONS TO SHANNON'S CODING THEOREM

The significance of our proposed information-theoretic framework is not our newly introduced information-theoretic notion H(D) of "the entropy of a DILN" per se, but rather the idea of how this notion H(D) is used to define typical sets of random derived DILNs of D (see Definition 2.1), and correspondingly, how the idea of typical sets is used to define our notion of separable random functions (see Definition 2.3). Generally speaking, H(D) is a compact notation that is very useful for us to define typical sets (of random derived DILNs of D) in Definition 2.1). This is very similar to the scenario of defining typical sets of random variables, in the context of the asymptotic equipartition property (AEP) theorem; see, e.g. Chapter 3.1 in Cover & Thomas (2012) . Notice that the AEP theorem (Thm 3.1.1 in Cover & Thomas (2012) ) is essentially a direct consequence of the weak law of large numbers. Its precise theorem statement, usually formulated in terms of the entropy of a random variable, could equivalently be formulated without any mention of, or without any interpretation involving, the notion of information entropy. Similarly, the notion of "typical sets" of random variables makes sense in the general context of probability theory, without necessarily needing any information-theoretic interpretation. Intuitively, we should still think of "typical sets" of a random derived DILN of D as a "typical" sequence of DILNs, where "typical" has a precise meaning in terms of our notion of the entropy of DILNs. Just like how typical sets (of random variables) are crucial for proving Shannon's coding theorem, our notion of typical sets of random derived DILNs of D are crucial for proving our consistency theorem. Just like how we should think of channel coding in terms of the entropy of random variables, we should also analogously think of label noise estimation in terms of the entropy of DILNs. As we have initially described in Section 2.3, and subsequently detailed in Appendix B.1-B.4, the notions of "typical sets" and (joint) AEP are crucial for proving our main consistency result Theorem 2.6. In fact, Theorem 2.6 can be interpreted as an "inverse" analog of (one direction of) Shannon's channel coding theorem. Perhaps, the separable random function g, required as input to our estimator (Algorithm 1), is analogous to a (random) error-correction code in channel coding. The idea of "separability" for random functions could serve as a guide for the design of future estimators for Q D . It is plausible (likely) that the convergence rate for our consistent estimator may depend on the choice of this separable random function g. Hence, building on the parallelism between separable functions and error-correction codes, we pose the following question: Could we obtain more efficient estimators for Q D by designing "better" separable functions for DILNs, analogous to how more efficient error-correction codes were designed to further improve communication over a noisy channel? We are excited by the interesting questions that naturally arise from this parallelism, especially concerning the design of new and "better" estimators for noise transition matrices and more general noise models, independent of classification accuracy.

C IMPLEMENTATION DETAILS

Algorithm 3 provided below describes in detail an explicit implementation of our proposed framework to estimate Q D , based on the use of LID-based discriminators. Broadly, there are three stages in our implementation, and correspondingly, we organize the rest of this section into the following three subsections: Train a discriminator Φ on triple to produce a recall τ and a vote sequence. 8: Initialize a fine-tuning list F containing all the triples with recall τ ≥ 0.9. 9: Refine list F. 10: Initialize an empty list Q. 11: for each triple in F do 12: Fine-tune discriminator trained on the triple.

13:

If the discriminator is well-trained (i.e. able to produce required recalls), then add it to Q. In our experiments, we synthesized α-increment DILNs of D, using only "uniform" vectors α, i.e. every vector α (s) in our α-sequence can be expressed as α (s) = a s • 1 k for some 0 ≤ a s ≤ 1. (Here, 1 k denotes the all-ones vector in R k .) It should be noted that our choice to use only "uniform" vectors for α (s) is an implementation simplification, stemming from our limited computational resources. Ideally, with no computational constraints, having an α-sequence containing "non-uniform" vectors α (s) (i.e. whose entries are distinct) could potentially improve the final estimate, albeit with a lot more trials. For CIFAR-10, a fixed α-sequence Ω = (α (1) , . . . , α ( ) ) is used throughout our experiments (i.e. Ω is fixed, as we consider various random seeds). We restricted every α ∈ Ω to be of the form convolutional layers, (iii) reduce the training batch size, (iv) fix the weight of the last fc layer, (v) do not use softmax and batchnorm; and (vi) do not use random crop during training. See Table 4 for adjustments made when a smaller subset of the original dataset is used. Note that we did not continue using 8192 hidden units in the last fc layer for CIFAR-10 with 70% anchor point removal. This is because if the batch size is small or if the number of hidden units in the last fc layer is increased, then the training time would increase significantly. Therefore, 4096 number of hidden units was used instead. For Clothing1M, we used ImageNet-Pretrained ResNet-18 (PyTorch version). An fc layer with 512 hidden units is added to it (2 fc layers in total) and the last fc layer is fixed without updating weights during training. We refer this model as "ResNet19". The initial learning rate is 0.001, which is reduced to 0.0001 at the 5th epoch. We used stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.05. The model is trained for 6 epochs to gather LID sequences, with batch size 32.

Dataset

In our experiments for all datasets, our computation of LID sequences follows the same process as given in (Ma et al., 2018a) , with the following exceptions: 1. To remove unwanted randomness in any LID sequences computed, we re-implemented their code in PyTorch. This ensures that any random function we invoke is completely determined by the selected random seed. The unwanted randomness in the original Keras implementation includes (although not limited to) randomized weight initializations and non-deterministic cuDNN sub-routines. 2. The weights in the last fully-connected (fc) layer of the neural network are fixed throughout training, so that all weights updated during backpropagation are used in the computation of LID sequences. (Recall that LID sequences are computed using the output vectors from the last hidden layer.) 3. In (Ma et al., 2018a) , 1280 samples are randomly selected from the whole dataset throughout the training epochs to compute one LID sequence. In contrast, we generated 50 random sets of 1280 sample indices to compute 50 LID sequences for Φ. At each epoch, the LID scores are computed from the fixed samples.

C.1.2 COMPARISON OF NEURAL NETWORKS USED FOR CIFAR-10 AND CLOTHING1M

To ensure a fair comparison, we chose the same backbone neural network architecture across all baseline methods, and we used a smaller architecture for our method. As far as possible, we tried to use the same number of training epochs across all baseline methods. However, the different methods are rather distinct, and employ various techniques for their estimation of Q D , such as the augmentation of the neural network architecture (e.g., one extra "distinguished" softmax layer for S-model), a schedule using multiple loss functions for training (e.g., T-Revision uses 3 loss functions in total for training). For a more comprehensive overview of the subtle differences across the various methods, please see Tables 5 and 6 , which explicitly specify the neural network architecture used, the usage of neural network, the type of loss functions, and the number of training epochs for different stages (including the computation of priors, if any). Let the recall from the well-trained discriminator be τ , where τ = Pr(-θ ≤ z ≤ θ), or 1-τ 2 = P (z ≤ -θ). From the standard normal distribution table, θ can be found when τ is defined. Therefore E D = -θ k-1 N k 2 . We make a simplification that 1 k E i,j ≈ E D for all (i, j). Both random variables E i,j and E i,j are assumed to follow multivariate normal distributions, with unknown parameters to us. To further simplify the estimation, we model both random variables with a common random variable "E". This simplification has been shown reasonable by experiments: 

LEARNING

The momentous work on the Information Bottleneck (IB) theory for deep learning (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017) has spurred much interest and discussion; see (Gabrié et al., 2018; Saxe et al., 2018) . A key aspect proposed by the IB theory is that the training of a deep neural network (DNN) consists of two distinct phases: An "expansion" phase where the mutual information between layers increase, and a "compression" phase, where the mutual information between layers decrease. The notion of local intrinsic dimensionality (LID), which we used as an essential ingredient for a concrete realization of our proposed framework, was in part motivated by this IB theory. It was reported in (Ma et al., 2018c ) that for their experiments on training DNNs, the LID scores of the training epochs also exhibits a similar two-phase phenomenon: An initial decrease in LID score, then either an increase in LID score or a stagnant LID score, depending on whether there is label noise or not, respectively. As part of our experiments, we synthesized multiple datasets with varying entropy, and generated multiple LID sequences for each synthesized dataset. Although our focus is on the estimation of the noise transition matrices of datasets with instance-independent label noise, a by-product of our work is that we observed a very wide spectrum of behavior for LID sequences, some of which the two-phase phenomenon is not obvious. Notice that in Fig. 3 , we have included multiple plots of LID sequences at various entropies, which may be of independent interest to other researchers (especially those directly working on the IB theory). A caveat is that LID scores are distinct from the mutual information between layers, but LID scores could still be interpreted as a measure of model complexity, which presumably have close connections to the generalizability of deep learning. , where these sequences associated to D α are repeatedly generated using multiple random seeds. We call this coordinate-wise minimum-tomaximum range a "band". Blue bands represent LID sequences associated to α-increment DILNs with noise levels 0%, 60%, 80%, 85% and 87.8% (from lightest blue to darkest blue). Red bands represent LID sequences associated to baseline DILNs.



Both MPEIA and GLC inherently require D clean . Since our method does not leverage clean data, it would not be fair to directly evaluate our method against them. Instead, ours-2 and ours-3 are intended to show that MPEIA and GLC can be enhanced with our method, without having to do further clean data annotation/augmentation. If QD = [qi,j] 1≤i,j≤k (resp. QD = [qi,j] 1≤i,j≤k ) is the estimated (resp. true) noise transition matrix for D, then the corresponding (forward) KL loss is defined to be k i=1 pZ (i) k j=1 qi,j log qi,j/qi,j . Of course, the purpose for learning a classifier from D is to learn a good representation for f true, and we can only do so given the dataset D. If the underlying distribution of D is not a "good representation" of the distribution of Xtrue, then any approximation to f D true , no matter how accurate, will not approximate f true well. Henceforth, we assume that D has a "good underlying representation" for the distribution of Xtrue. Although this CNN is trained for the CIFAR-10 classification task, achieving high classification accuracy is not our goal. Instead, our goal is to gather LID sequences from the training phase. For example, given a dataset with symmetric noise rate 50%, when 80% uniform α-vector is used, the best test accuracy is around 25%. While training baseline DILNs for LID sequences, the test accuracy is around 10%, which is equivalent to "random guessing" (since the labels in baseline DILNs are by definition chosen uniformly at random). Despite the seemingly low test accuracies, these LID sequences are sufficient for use as training data for the discriminator Φ. In one of our trials, as part of our comparison with MPEIA, we used the CIFAR-10 dataset with 90% samples removed from one class. To gather LID sequences for the resulting smaller subset of CIFAR-10, we used the same neural network model architecture as in the case when the dataset is intact.



Figure 2: An overview of the proposed information-theoretic framework, using LID-based discriminators as a concrete example. Given an input dataset D with instance-independent label noise, several collections of new datasets are synthesized from D. These synthesized datasets are "derived" via different partial relabelings (and hence have different entropies). The synthesized datasets in each collection are further processed to create training data for a single discriminator. In this illustration, LID sequences from the synthesized datasets become the training data for each discriminator. Every discriminator generates a single intermediate estimate for the noise transition matrix Q D . The final estimate is computed as the mean of all intermediate estimates.

FOR D AND ESTIMATORS FOR Q D We now formalize our intuition presented in Section 2.3. Let D[D] be the set of all derived DILNs of D, and let U[D] be the set of all possible underlying datasets for D[D] (i.e. we throw away information about the associated noise models). For notational ease, a DILN D could be an element of either D[D] or U[D]. Every D in U[D] is uniquely determined by its sequence of given labels (y 1 , . . . , y N ), which we call the labeling of D . We think of (y 1 , . . . , y N ) as a relabeling of D, generated from some (possibly unknown) noise model (Y |Z; A). Hence, U[D] is a finite set of size k N . Formally, a random derived DILN of D is a discrete random variable V : D[D] → U[D], defined on some distribution on D[D]. We shall also define a map f matrix with domain D[D], given by D → Q D .

: Unique solution to linear system exists almost surely; more details in Appendix B.2.] 12:

Ω j . The LID score of D in epoch j, denoted by LID j (D ), is the mean LID scores of s randomly selected instances in epoch j. If training is done over L epochs, then LID(D ) := (LID 1 (D ), . . . , LID L (D )) is the LID sequence of D . We then define the random function g LID : U[D] → R L by D → LID(D ).

Definition B.1. Let D and D be derived DILNs of D. The transverse entropy of D and D is

Theorem B.3 (cf. Cover & Thomas (2012, Thm. 7.6.1)). Let V and W be random derived DILNs ofD. Let V 1 , V 2 , . . . (resp. W 1 , W 2 , . . .) be an infinite sequence of i.i.d. random derived DILNs of D with the same distribution as V (resp. W ), and suppose that D 1 , D 2 , . . . (resp. D 1 , D 2 , . . . ) is a corresponding sequence of observed values. Then for any ε > 0, lim n→∞ Pr (D 1 , . .

is a corresponding sequence of observed values. • Let D 1 , D 2 , . . . be an infinite sequence of i.i.d. random derived DILN of D with the same distribution as D, and suppose that D 1 = D 1 , D 2 = D 2 , . . . is a corresponding sequence of observed values.

hence the assertion follows. By assumption, our dataset D has N instances and k label classes. We shall define the gap of D to be gap(D) := log k -E[H(D)]. Notice that by the definition D, this gap(D) depends only on the values of N and k. Lemma B.11. 0 ≤ gap(D) ≤ log k, and lim N →∞ gap(D) = 0. Proof. By Jensen's inequality (see Cover & Thomas (2012, Thm. 2.6.2)), we have E[H(D)] ≤ H(E[D]) = log k, hence gap(D) ≥ 0. Note that gap(D) ≤ log k, since H(D ) ≥ 0 for all DILNs D . Finally, the limit lim N →∞ gap(D) = 0 is a direct consequence of the weak law of large numbers.

Algorithm 2 A precise formulation of algorithm to estimate Q D Require: integers r, m, n, ≥ 1. Require: threshold β > 0, and g a separable R d -valued random function on U[D].Require: Ω = (α (1) , . . . , α ( ) ) ⊆ [0, 1] k a valid α-sequence with vectors. 1: Initialize empty list L. 2: for j = 1 . . . m do Φ j be a discriminator trained r-fold on (D (j) 0 , g) with threshold β.

= min{s : 1 ≤ s ≤ , there exists 1 ≤ i ≤ n such that D (j) s,i ∈ Φ + j }. #[Note: s equals s (j) best in Theorem B.14. By default, min ∅ = -∞.] 8: if s = -∞ then 9:for i = 1 . . . n do 10:

Section C.1: Gathering of LID sequences. • Section C.2: Training of LID-based discriminators. • Section C.3: Final computation to estimate Q D . Algorithm 3 Implementation details to estimate Q D . Require: D, a dataset with instance-independent label noise. Require: Σ, a collection of random seeds, with |Σ| ≥ 10. Require: A valid α-sequence Ω = (α (1) , . . . , α ( ) ). #[Stage 1: Gathering of LID sequences] 1: for each random seed ς in Σ do 2: for each α s in Ω do 3: Generate α s -increment DILN D s . 4: Generate LID sequences for D s from a neural network. #[Stage 2: Training of LID-based discriminators] 5: Consolidate all generated LID sequences (from all random seeds) for initial training of LID-based discriminators. 6: for each triple (i.e. three seed collections) do 7:

Fig.3is provided to show that LID sequences are effective in distinguishing datasets with different entropies. We used 4 clean datasets as the underlying dataset D: CIFAR-10 intact (top plot in Fig.

i,j = -(k -1)E and E i,j = (k -2)E Suppose R = {r 1 , . . . , r m }.For each 1 ≤ s ≤ m, let I s denote the set of all estimates with associated recall r s . Therefore, our final estimate for Q D is: THE INFORMATION BOTTLENECK THEORY FOR DEEP

Figure 3: LID sequences visualization for clean CIFAR-10 intact as D (top plot) and clean CIFAR-10 with 70% of anchor-like instances removed as D (bottom plot). The x-axis represents epochs, and the y-axis represents individual LID scores (at each epoch).Notice that for each considered noise level a (corresponding to α = (a, . . . , a)), we have a coordinate-wise minimum-to-maximum range of the entries for sequences generated using α = (a, . . . , a), where these sequences associated to D α are repeatedly generated using multiple random seeds. We call this coordinate-wise minimum-tomaximum range a "band". Blue bands represent LID sequences associated to α-increment DILNs with noise levels 0%, 60%, 80%, 85% and 87.8% (from lightest blue to darkest blue). Red bands represent LID sequences associated to baseline DILNs.

2.2. Given a (non-random) derived DILN D of D, and an R d -valued random function g, we could treat g(D ) equivalently as a composition of a random derived DILN of D with an R d -valued (non-random) function. Hence for any > 0 and integer n ≥ 1, in view of Definition 2.1(ii), the notion of an n-fold ε-typical set of g(D ) is well-defined. Definition 2.3. Let D 0 ∈ U[D], and let g be an R d -valued random function on U[D]



, which concatenates a neural network (NN) with an extra softmax layer; (ii) Forward(Patrini et al., 2017), which trains an NN and uses anchor-like instances to estimate Q D ; (iii) T-Revision(Xia et al., 2019), which finetunes Q D concurrently with the training of its classifier; (iv) MPEIA(Yu et al., 2018), which estimates mixture proportion by a fraction of D clean , for Q D ; and (v) Gold Loss Correction (GLC)(Hendrycks et al., 2018), which trains an NN on the noisy data. Then the trained NN computes softmax outputs of D clean for Q D . S-model, Forward and T-Revision do not require D clean while MPEIA and GLC randomly selects 0.5% D clean from the whole dataset (in our paper). NN structure, training losses and training epochs can be found in Table?? and Table6(in the appendix) for CIFAR-10 and Clothing1M, respectively. We used the same training hyper-parameters and data augmentation settings as given inPatrini et al. (2017), except T-Revision, which follows the settings inXia et al. (2019) for respective datasets. Ours-1 took a similar approach as GLC without D



solving a system of k 2 linear equations in k 2 variables.] #[Note: Unique solution to linear system exists almost surely.] : return mean of matrices in L (This is our estimate QD for Q D .) B.6 HOW TO CHECK FOR SEPARABLE RANDOM FUNCTIONS?

Compute the mean QD of all means of intermediate estimates. 23: return QD (This is our final estimate for Q D .)

A summary of the main differences while training CNN to generate LID sequences for CIFAR-10, when different amounts of anchor-like instances are removed.For training, after normalization, random horizontal flip with probability 0.5 is applied for CIFAR-10 as part of data augmentation. The initial learning rate is 0.01, which is reduced to 0.001 at the 40th epoch for CIFAR-10. We used stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 10 -4 . Training is stopped at the 45th epoch for CIFAR-10. For the Clothing1M subset, images are first resized to 256X256. During training, images are randomly cropped to size 224X224 then random horizontal flip is applied with probability 0.5. During test time, images are first resized to 256X256, then centrally cropped to 224X224. For normalization, we used mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225] for the 3-channel RGB images.

Neural network (and its usage), training losses and training epochs of all the baselines for CIFAR-10. Some methods, S-model, T-Revision, ours-1, ours-2 and ours-3, require "prior model" to initialize noise transition matrices for formal estimation. While some method, T-Revision, trains NN with special losses in its main method.

Neural network (and its usage), training losses and training epochs of all the baselines for Clothing1M subset. Respective prior models and losses are summarized.

annex

α = (a, . . . , a), where the values of a used are given as follows:0%, 30%, 83%, 84%, 85%, 85.4%, 85.8%, 86%, 86.2%, 86.4%, 87.6%, 87.7%, 87.8%, 87.9%, 88.0%, 88.1%, 88.2%, 88.3%, 88.4%, 88.5%, 88.6%For Clothing1M, we added extra α: 90.9%, 91.1%, 91.3%, 91.4%, 91.5%, 91.6%. Notice that these values for a are not uniformly spaced apart; rather, they are concentrated in the range [83%, 88.6%] (for Clothing1M, [83%, 91.6%]). This is because a high value of a is typically required for the corresponding α-increment DILN of D to have near-maximum entropy, unless D already has a significantly high noise level, e.g. > 80%. (We define the noise level of D to be the probability that a randomly selected instance of D has a given label that differs from the correct label.) To accurately estimate the noise level of D, we recommend choosing a finer α-sequence, if computational resources allow for it. However, as a compromise, we have opted to use the above set of values for a in our experiments.

C.1.1 TRAINING OF NEURAL NETWORKS

To generate LID sequences for the CIFAR-10 dataset (Krizhevsky et al., 2009) , the following 12-layer convolutional neural network (CNN) is used 5 :• Two convolutional layers with 64 out channels with filter size of 3-by-3 and padding of size 1, each followed by a ReLU activation function. Max pooling of size of 2-by-2 is applied after the last ReLU activation function. • Two convolutional layers with 128 out channels with filter size of 3 by 3 and padding of size 1, each followed by a ReLU activation function. Max pooling of size of 2-by-2 is applied after the last ReLU activation function. • Two convolutional layers with 196 out channels with filter size of 3 by 3 and padding of size 1, each followed by a ReLU activation function. Max pooling of size of 2-by-2 is applied after the last ReLU activation function. • Output vectors are flattened.• Fully connected layer with 1024 (hidden) units, which is followed by a ReLU activation function.• Fully connected layer with 10 output units, whose weights are set to be non-trainable (i.e. frozen).The above model architectures are used when the dataset size is "relatively" big. When the size is small, for example, less than 40, 000 samples 6 , the LID sequences generated could be identical even if the level of label noise present in the DILNs are not similar. For instance, in Fig. 3 , the given datasets in both plots are clean, but the top plot corresponds to the intact CIFAR-10 dataset with 50, 000 samples, while the bottom plot corresponds to a subset of the CIFAR-10 dataset with 70% anchor-like instances removed, i.e., with a total of 15, 000 samples.For noise levels of 80% and 85%, the LID sequences in the top plot is still "visually separable" except from epoch 9 to epoch 17. In contrast, the LID sequences on the bottom plot for the same noise levels (80% and 85%) overlap with each other across almost all training epochs, until they become stagnant on the 23rd epoch. We posit that such significant overlaps could vitiate the effectiveness of a discriminator Φ in distinguishing baseline DILNs from non-baseline DILNs, when Φ is trained on such overlapping LID sequences.Therefore, to further differentiate the LID sequences, we used the trials described as follows: (i) increase the number of hidden units in the last fully connected (fc) layer, (ii) increase the number of 3), CIFAR-10 with 70% anchor-like data removal (bottom plot in Fig. 3 ). For every α-increment DILN (shown in blue) or baseline DILN (shown in red) of D, we generated 5 such DILNs. The noise transition matrices are of symmetric form and the noise levels inserted are 0%, 60%, 80%, 85% and 87.8% (colored from the lightest blue to the darkest). We computed 50 LID sequences for each DILN, which corresponds to 50 datapoints in each epoch. To visualize the 50 sequences of a DILN, only the maximum and the minimum scores are displayed for each epoch. For CIFAR-10, this forms a band with 2 lines over the total 45 epochs. The same random seed is used for all the plots.

C.2 TRAINING OF LID-BASED DISCRIMINATORS

Each LID-based discriminator Φ is trained using positive-unlabeled bagging (Elkan & Noto, 2008; Mordelet & Vert, 2014) , with decision trees used as our sub-routine. We used 1, 000 trees and the number of unlabeled samples to draw and train each base estimator is 50.

C.2.1 INITIAL TRAINING

The LID sequences from α-increment DILNs are treated as unlabeled samples while those from baseline DILNs are treated as positive. For CIFAR-10, although the maximum noise level a s injected into D is 88.6% to synthesize a derived DILN of D, we used the LID sequences from a s up to 88.3% for initial training. It is recommended to reserve some LID sequences associated to the high value a s during the initial training. The reason is that it could be hard for a discriminator to get a good recall as the positive samples (LID sequences from the baseline DILN) and large number of negative samples (LID sequences from DILNs with a s > 0.883) are very similar. However, for Clothing1M, we did not reserve LID sequences with high α values, for the purpose of training ease. Recall that α-increment DILNs and a baseline DILN from the same common random seed is collectively called a "seed collection". Any three different seed collections is called a "triple". The discriminators would be trained on the LID sequences from all possible combinations of triples. In our experiments, we used at least 10 random seeds for each matrix in CIFAR-10 and 20 random seeds for Clothing1M. We do not use any augmentation or normalization of the LID sequences when training Φ.A trained discriminator Φ predicts whether an input LID sequence is similar to the LID sequences generated from baseline DILNs. If Φ assigns a score of ≥ 0.5 to some LID sequence, then this LID sequence is predicted positive and one vote is assigned to its respective α (s) . If the score is instead < 0.5, then 0 vote is assigned. The vote sum of α (s) represents the number of LID sequences a discriminator "considers" to be baseline-LID-sequences-alike. The vote sum sequence of all the α (s) in the α-sequence is called a "vote sequence". A trained discriminator produces a recall τ and a vote sequence.

C.2.2 FINETUNING

After the initial training, all the triples come with their recalls and vote sequences. We only select those with recall above 0.9 and put them in a list F for further fine-tuning, which indicates a possibly well-trained discriminator and similar label distributions in both DILN D 0 (the baseline dataset) and D s (synthesized from D by injecting noise α s ).Triples with recall above 0.9 are added into a list F and we would further refine the list for better Q D estimation. For CIFAR-10, if the total number of non-zero votes of all the triples in F before noise level 0.85 is above 4 (for Clothing1M, the noise level threshold is 0.86, since Clothing1M has wider range of α values), we use triples with low recall ≤ 0.92 and abandon the remaining. Else, we use triples high recall ≥ 0.98. Given that (for CIFAR-10 only), if the number of non-zero votes from noise levels of 0.85 to 0.86 is more than 4, only recall 0.98 is used. The intuition is that if the presence of the number of non-zero votes before noise level 0.85 (or a s ≤ 0.85) is not negligible (we say ≥ 4 in this paper), Q D could potentially possess high noise level, as "significant" number of LID sequences from a s ≤ 0.85 can be recognized as similar to the baseline's by Φ. Experiments show that when D has high noise level, triples with low recall (≤ 0.92) can give better estimate while if D has low noise level, then high recall (≥ 0.98) is preferred. Therefore, based on recall above 0.9, we further narrow down the range of recalls of the triples to do fine-tuning. Let the new range of recalls be R.We then re-consolidate the LID sequences with the new range of α-sequence according to R to finetune Φ. Re-consolidation of LID sequences is necessary, because different sets of recalls require different α-sequences to estimate Q D . The higher the recall, the smaller the range of possible vectors we are allowed to use in the α-sequence. If an inappropriate α-sequence is used, then the matrix Q D could have illegal entries (i.e. with values < 0 or > 1). During the initial training, for CIFAR-10, LID sequences are consolidataed with a (s) ≤ 0.883, then we select triples with recall above 0.9. For Clothing1M, the maximum α used for initial training has noise rate 91.6%. Now with refinement, the range of recall is narrower, the respective range of LID sequences would have to be adjusted for later fine-tuning.With the re-consolidated LID sequences, we can fine-tune Φ for the same triple, T , for five times.Let the new recall during fine-tuning be τ and the old recall in the initial training be τ . If τ = τ for at least 3 times, we can stop fine-tuning. Else, we abandon this triple. And now the discriminator for T is considered well-trained, which comes together with τ and the new vote sequence. In the new vote sequence, the α s with the highest number of vote sum is denoted as top-voted a * . τ and top-voted a * would be used for Q D estimation.For each D, there shall be at least 2 discriminators well-trained. Otherwise, more random seeds are required to synthesize DILNs of D, generate LIDs and train Φ.To learn Q D , a prior, P , is used to help define the structure of Q D . For CIFAR-10, we used ResNet-18 (He et al., 2016) to train on 90% of the given noisy dataset for 20 epochs and validate on the 10% instances. For Clothing1M, we used ImageNet pretrained ResNet18 with an extra fc layer (512 hidden units) to train and validate it for 10 epochs. For respective datasets, the optimization procedure used are the same as given in (Patrini et al., 2017) . The probabilities of the training instances from the softmax layer are recorded. The probabilities from the epoch with the highest validation accuracy is used to compute a prior. Let Υ(y n = i) j be the j-th output from the neural network's softmax layer for some sample with label i, and let the set of samples with label i be S i , i.e.Consider an arbitrary α ∈ [0, 1] k . Recall that when we equate the random matrices Q D = Q Dα , we get the following equation for each entry:To simplify computations in our estimation, for each row, we estimate only one entry q i,j and the remaining entries shall be proportional to 1 -q i,j according to a prior. This reduces the estimation "burden" from k 2 entries to k. The coordinate (i, j) is determined by the maximum entry's location of the i-th row in the prior: j = argmax m p i,m . The entry q i,t is substituted with pi,t k =j p i,k (1 -q i,j ).Next, we estimate E i,j and E i,j . For D, let the binary random variable X n be 1 if the n-th sample has been assigned a wrong label. The total number of instances in D is N . Hence, X n is a Bernoulli random variable with parameter k-1 k and the noise level in D is, where E D is a random variable that approximates the relabeling process. By the central limit theorem, we infer that

N n=1

Xn N is approximately normal with mean k-1 k and variance k-1 N k 2 . Let z be a random variable following standard normal distribution. Normalizing, we get:

