IDENTIFIABILITY OF LABEL NOISE TRANSITION MATRIX

Abstract

The noise transition matrix plays a central role in the problem of learning with noisy labels. Among many other reasons, a large number of existing solutions rely on access to it. Identifying and estimating the transition matrix without ground truth labels is a critical and challenging task. When label noise transition depends on each instance, the problem of identifying the instance-dependent noise transition matrix becomes substantially more challenging. Despite recent works proposing solutions for learning from instance-dependent noisy labels, the field lacks a unified understanding of when such a problem remains identifiable. The goal of this paper is to characterize the identifiability of the label noise transition matrix. Building on Kruskal's identifiability results, we show the necessity of multiple noisy labels in identifying the noise transition matrix for the generic case at the instance level. We further instantiate the results to relate to the successes of the state-of-the-art solutions and how additional assumptions alleviated the requirement of multiple noisy labels. Our result also reveals that disentangled features are helpful in the above identification task and we provide empirical evidence.

1. INTRODUCTION

The literature of learning with noisy labels concerns the scenario when the observed labels Ỹ can differ from the true one Y . The noise transition matrix T (X), defined as the transition probability from Y to Ỹ given X, plays a central role in this problem. Among many other benefits, the knowledge of T (X) has demonstrated its use in performing either risk (Natarajan et al., 2013; Patrini et al., 2017a) , or label (Patrini et al., 2017a) , or constraint corrections (Wang et al., 2021a) . In beyond, it also finds applications in ranking small loss samples (Han et al., 2020) and detecting corrupted samples (Zhu et al., 2021a) . On the other hand, applying the wrong transition matrix T (X) can lead to a number of issues. The literature has well-documented evidence that a wrongly inferred transition matrix can lead to performance drops (Natarajan et al., 2013; Liu & Wang; Xia et al., 2019; Zhu et al., 2021c) , and false sense of fairness (Wang et al., 2021a; Liu & Wang) . Knowing whether a T (X) is identifiable or not helps understand if the underlying noisy learning problem is indeed learnable. Prior works have documented challenges in estimating the noise transition matrices when the quality of available training information remains unclear. For instance, in (Zhu et al., 2022) the authors show that when the quality of representations dropped, the estimation error in T (X) increases significantly (Figure 1 therein). Other previous references have documented these challenges too (Xia et al., 2019) . We have also provided experiments to validate the argument in Appendix C.4. The earlier results have focused on class-but not instance-dependent transition matrix T (X) ≡ T := [P( Ỹ = j|Y = i)] i,j , ∀X. The literature has provided discussions of the identifiability of T under the mixture proportion estimation setup (Scott, 2015) , and has identified a reducibility condition for inferring the inverse noise rate. Later works have developed a sequence of solutions to estimate T under a variety of assumptions, including irreducibility (Scott, 2015) , anchor points (Liu & Tao, 2016; Xia et al., 2019; Yao et al., 2020a) , separability (Cheng et al., 2020) , rankability (Northcutt et al., 2017; 2021) , redundant labels/tensor (Liu et al., 2020; Traganitis et al., 2018; Zhang et al., 2014) , clusterability (Zhu et al., 2021c) , among others (Zhang et al., 2021; Li et al., 2021) . Recent study (Wei et al., 2021) has empirically shown that the above class-dependent model is not precise in capturing the real-world noise patterns, but rather real human-level noise follows an instance-dependent model. Intuitively, the instance X encodes the difficulties in generating the label for it. This more realistic, flexible and powerful noise model helps characterize the challenges. We observe a recent surge of different solutions towards solving the instance-dependent label noise problem (Cheng et al., 2020; Xia et al., 2020b; Cheng et al., 2021a; Yao et al., 2021) . Some of the results took on the problem of estimating T (X), while the others proposed solutions to learn directly from instance-dependent noisy labels. We will survey these results in Section 1.1. The question of identifying and estimating T becomes much trickier when the noise transition matrix is instance-dependent. The potentially complicated dependency between X and T (X) renders it even less clear whether solving this problem is viable or not. Despite the above successes, there lacks a unified understanding of when this learning from instancedependent noisy label problem is indeed identifiable and therefore learnable. The mixture of different observations calls for the need for demystifying: (1) Under what conditions are the noise transition matrices T (X) identifiable? (2) When and why do the existing solutions work when handling the instance-dependent label noise? (3) When T (X) is not identifiable, what can we do to improve its identifiability? Providing answers to these questions will be the primary focus of this paper. The main contributions of this paper are to characterize the identifiability of instance-dependent label noise, use them to provide evidences to the success of existing solutions and point out possible directions to improve. Among other findings, some highlights of the paper are 1. We find many existing solutions have a deep connection to the celebrated Kruskal's identifiability results that date back to the 1970s (Kruskal, 1976; 1977) . 2. Three separate independent and identically distributed (i.i.d.) noisy labels (random variables) are both necessary and sufficient for instance-level identifiability. This observation echoes the previous successes of developing tensor-based approaches for identifying the hidden models. 3. Disentangled features help with identifiability. Our paper will proceed as follows. Section 2 and 3 will present our formulation and the highly relevant preliminaries. Section 4 provides characterizations of the identifiability at the instance level and lays the foundations for our discussions. Section 5 extends the discussion to different instantiations that help us provide evidences to the success of existing solutions. Section 6 provides some empirical observations.

1.1. RELATED WORKS

In the literature of learning with label noise, a major set of works focus on designing risk-consistent methods, i.e., performing empirical risk minimization (ERM) with specially designed loss functions on noisy distributions leads to the same minimizer as if performing ERM over the corresponding unobservable clean distribution. The noise transition matrix is a crucial component for implementing risk-consistent methods, e.g., loss correction (Patrini et al., 2017b) , loss reweighting (Liu & Tao, 2015) , label correction (Xiao et al., 2015) and unbiased loss (Natarajan et al., 2013) . A number of solutions were proposed to estimate this transition matrix for class-dependent label noise, which we have discussed in the introduction. To handle instance-dependent noise, recent solutions include estimating local transition matrices for different groups of data (Xia et al., 2020b) , using confidence scores to revise transition matrices (Berthon et al., 2020) , and using clusterability of the data (Zhu et al., 2021c) . More recent works have used the causal knowledge to improve the estimation (Yao et al., 2021) , and the deep neural network to estimate the transition matrix defined between the noisy label and the Bayes optimal label (Yang et al., 2021) . Other works chose to focus on the learning from instance-dependent label noise directly, without explicitly estimating the transition matrix (Zhu et al., 2021b; Cheng et al., 2021a; Berthon et al., 2021; Xia et al., 2020a; Li et al., 2020) . The identifiability issue with label noise has been discussed in the literature, despite not being formally treated. Relevant to us is the identifiability results studied in the Mixture Proportion Estimation setting (Scott, 2015; Yao et al., 2020b; Menon et al., 2015) . We'd like to note that the identifiability was defined for the inverse noise rate, which differs from our focus on the noise transition matrix T . To our best knowledge, we are not aware of other works that specifically address the identifiability of T (X), particularly for an instance-dependent label noise setting. Highly relevant to us is the Kruskal's identifiability results (Kruskal, 1976; 1977; Sidiropoulos & Bro, 2000; Allman et al., 2009) , which reveals a sufficient condition for identifying a parametric model that links a hidden variable to a set of observed ones. Kruskal's early results were developed under the context of tensor, which later proves to be a powerful tool for learning latent variable models (Sidiropoulos et al., 2017; Zhang et al., 2014; Anandkumar et al., 2014) .

2. FORMULATION

We use (X, Y ) to denote a supervised data in the form of (feature, label) drawn from an unknown distribution over X × Y . We consider a K-class classification problem where the label Y ∈ {1, 2, ..., K} with K ≥ 2. In our setup, we do not observe the clean true label Y , but rather a noisy one, denoting by Ỹ . The generation of Ỹ follows the following transition matrix: T (X) := [P( Ỹ = j|Y = i, X)] K i,j=1 . T (X) is a K × K matrix with its (i, j) entry being P( Ỹ = j|Y = i, X). To define identifiability, we will denote by Ω an observation space. We first define identifiability for a general parametric space Θ. Denote the distribution induced by the parameter θ ∈ Θ of a statistical model on the observation space Ω as P θ (Kruskal, 1976; Allman et al., 2009 ). To give an example, for a fixed X (when consider instance-level identifiability), and Ω is simply the outcome space for its associated noisy label Ỹ , i.e., {1, 2, ..., K}. In this case, each θ is the combination of a possible transition matrix T (X) and the hidden prior of P(Y |X), which we use to denote the conditional probability distribution of Y given X. P θ is then the distribution (probability density function) P( Ỹ |X). Later in Section 4 when we introduce three noisy labels Ỹ1 , Ỹ2 , Ỹ3 for each X, P θ is the joint distribution P( Ỹ1 , Ỹ2 , Ỹ3 |X). Identifiability defines as follows: Definition 1 (Identifiability). The parameter θ (statistical model) is identifiable if P θ ̸ = P θ ′ , ∀θ ̸ = θ ′ . We define identifiability for the task of learning with noisy labels for an X. Denote by θ(X) := {T (X), P(Y |X)}. P θ(X) is the distribution (probability density function) over Ω, defined by the noise transition matrix T (X) and the prior P(Y |X). To emphasize, Ω is not necessarily the observation space of the noisy label Ỹ only. The exploration of an effective Ω will be one of the focuses. Definition 2 (Identifiability of T (X)). For a given X, T (X) is identifiable if P θ(X) ̸ = P θ ′ (X) for θ(X) ̸ = θ ′ (X), up to label permutation. Label permutation relabels the label space, e.g., 1 → 2, 2 → 1, and the rows in T (X) will swap. Allowing for label permutation would mean that our results allow the high noise rate regime. For instance, for a binary classification problem, an 80% noise rate would correspond to a counterfactual 20% one. Finding either model would be regarded as being identifiable. In practice, further restriction such as noise rate should not exceed 50% can help us remove one of the two cases.

3. PRELIMINARY

In this section, we will introduce two highly relevant results on Mixture Proportion Estimation (MPE) (Scott, 2015) and Kruskal's identifiability result (Kruskal, 1976; 1977) .

3.1. PRELIMINARY RESULTS USING IRREDUCIBILITY AND ANCHOR POINTS

The problem of learning from noisy labels ties closely to another problem called Mixture Proportion Estimation (MPE) (Scott, 2015) , which concerns the following problem: let F, J, H be distributions defined over a Hilbert space Z. The three relate to each other as follows: F = (1 -κ * )J + κ * H. The identifiability problem concerns the ability to identify the mixture proportion κ * from only observing F and H. The following identifiability result has been established: Proposition 1. (Blanchard et al., 2010) κ * is identifiable if J is irreducible with respect to H, that J can not be written as J = γH + (1 -γ)F ′ , where 0 ≤ γ ≤ 1, and F ′ is another distribution. Later, the anchor point condition (Yao et al., 2020b) , a stronger requirement was established: Proposition 2. (Yao et al., 2020b ) κ * is identifiable if there exists a subset S ⊆ Z such that H(S) > 0, but J(S) H(S) = 0, where J(S), H(S) denote the probabilities of S measured by J, H. The above set S is called an anchor set. A sequence of follow-up works have emphasized the necessity of anchor points in identifying a class-dependent transition matrix T (Xia et al., 2019; Li et al., 2021) . Prior work has established the connection between the MPE problem and the learning from noisy label one (Yao et al., 2020b) for the identifiability of an inverse noise rate P(Y | Ỹ ) but not the noise transition T (X). We reproduce the discussion and fill in the gap. The discussion and results are for the class-dependent but not instance-dependent label noise, i.e., T (X) ≡ T (P( Ỹ |Y, X) ≡ P( Ỹ |Y )), and for a binary classification problem. To follow the convention, we assume Y ∈ {-1, +1}. There are two things we need to do: (1) State the noisy label problem as an MPE one; and (2) show that the identifiability of κ * is equivalent to the identifiability of T . We start with the first thing above. We want to acknowledge that this equivalence appeared before in (Yao et al., 2020b; Menon et al., 2015) . We reproduce it here to make our paper self-contained. Denote by π + := P(Y = -1| Ỹ = +1), π -:= P(Y = +1| Ỹ = -1) and π-= π- 1-π+ , π+ = π+ 1-π-. Lemma 1. P(X| Ỹ = -1), P(X| Ỹ = +1) relate to P(X|Y = -1), P(X|Y = +1) as follows: P(X| Ỹ = -1) = π-• P(X| Ỹ = +1) + (1 -π-) • P(X|Y = -1) (1) P(X| Ỹ = +1) = π+ • P(X| Ỹ = -1) + (1 -π+) • P(X|Y = +1) . Now P(X| Ỹ = +1), P(X| Ỹ = -1) correspond to the observed mixture distribution F, H, while P(X|Y = +1) and P(X|Y = -1) are the two unobserved Js, π-, π+ correspond to the mixture proportion κ * . This has established the learning with noisy label problem as two MPE problems corresponding for the two associated distributions P(X| Ỹ = -1), P(X| Ỹ = +1). Therefore to formally establish the equivalence between identifying κ * and T , we will only need to establish the equivalence between identifying π-, π+ and identifying T . Denote by e + := P( Ỹ = -1|Y = +1), e -:= P( Ỹ = +1|Y = -1) which determine the T for the binary case. We then have: Theorem 3. Identifying {π -, π+ } is equivalent with identifying {e -, e + }. The above theorem concludes the same irreducibility and anchor point conditions proposed under MPE also apply to identifying noise transition matrix T . This conclusion aligns with previous successes in estimating class-dependent noise transition matrix T when the anchor point conditions are satisfied (Liu & Tao, 2016; Xia et al., 2019; Li et al., 2021) . The above result has limitations. Notably, the result focuses on two mixed distributions, leading to the binary classification setup in the noisy learning setting. The authors did not find an easy extension to the multi-class classification problem. Secondly, the translation to the noisy learning problem requires the noise transition matrix to stay the same for a distribution of X (e.g., P(X| Ỹ = +1)), instead of providing instance-level understanding for each X.

3.2. KRUSKAL'S IDENTIFIABILITY RESULT

Our results build on the Kruskal's identifiability result (Kruskal, 1976; 1977) . The setup is as follows: suppose that there is an unobserved variable Z that takes values in a K-sized discrete domain {1, 2, ..., r}. Z has a non-degenerate prior P(Z = i) > 0. Instead of observing Z, we observe p variables {O i } p i=1 . Each O i has a finite state space {1, 2, ..., κ i } with cardinality κ i . Let M i be a matrix of size r × κ i , which j-th row is simply [P(O i = 1|Z = j), ..., P(O i = κ i |Z = j)]. In this case, [M 1 , M 2 , ..., M p ] and P(Z = i) are the hidden parameters that control the generation of observations -together, these form our θ. We now introduce the Kruskal rank of a matrix, which plays a central role in Kruskal's identifiability results. Definition 3 (Kruskal rank). (Kruskal, 1976; 1977) For a matrix M , the Kruskal rank of M is the largest number I such that every set of I rowsfoot_0 of M are linearly independent. In this paper, we will use Kr(M ) to denote the Kruskal rank of matrix M . To give an example, M = 1 0 0 0 1 0 2 0 0 ⇒ Kr(M ) = 1. This is because [1, 0, 0] and [2, 0, 0] are linearly dependent. We first reproduce the following theorem: Theorem 4. (Kruskal, 1976; 1977; Sidiropoulos & Bro, 2000) The parameters M i , i = 1, ..., p are identifiable, up to label permutation, if p i=1 Kr(M i ) ≥ 2r + p -1 The result for p = 3 was first established in (Kruskal, 1977) demonstrating the power of a three-way tensor, and then it was shown in (Sidiropoulos & Bro, 2000) that the proof extends to a general p. The proof builds on showing that different parameter θ leads to different stacking of M s: [M 1 , ..., M p ]. For example, when p = 3, [M 1 , M 2 , M 3 ] := K k=1 m k 1 m k 2 m k 3 forms the tensor of the observations, where m k i , i = 1, 2, 3 is the k-th column of M i .

4. INSTANCE-LEVEL IDENTIFIABILITY

This section will characterize the identifiability of T (X) at the instance level.  (X) = 1 -e -(X) e -(X) e + (X) 1 -e + (X) . Note that using chain rule (probability) we have P( Ỹ = +1|X) = P( Ỹ = +1|Y = +1, X) • P(Y = +1|X) + P( Ỹ = +1|Y = -1, X) • P(Y = -1|X) = (1 -e + (X)) • P(Y = +1|X) + e -(X) • P(Y = -1|X) Consider two cases: (1): P(Y = +1|X) = 1, e + (X) = e -(X) = 0.3 and (2): P(Y = +1|X) = 0.7, e + (X) = 0.1, e -(X) = 0. 233. Both cases will return the same P( Ỹ = +1|X) = 0.7. Is then the anchor point requirement necessary for identifying T (X) at the instance level? The discussion in the rest of this section departs from the classical single noisy label setting.

4.2. THE NECESSITY OF MULTIPLE NOISY LABELS

Setups We assume for each instance X, we will have p conditionally independent (given X, Y ) and identically distributed noisy labels Ỹ1 , ..., Ỹp generated according to T (X). Let's assume for now we potentially have these labels. Later in this section, we discuss when having multiple redundant labels are possible, and connect to existing solutions in the literature in the next section. For each instance X, denote by K X ≤ K the number of non-degenerated label classes k such that P(Y = k|X) > 0. W.l.o.g., let us assume the non-degenerate classes are simply {1, 2..., K X }. Before we formally present the results for having multiple conditionally independent noisy labels, we offer intuitions. The reason behind this identifiability result ties close to latent class model (Clogg, 1995) and tensor decomposition (Anandkumar et al., 2014) . When the p noisy labels are conditionally independent given X and Y , we will have the joint distribution written as: P( Ỹ1 , Ỹ2 , ..., Ỹp |Y, X) = p i=1 P( Ỹi |Y, X) That is, the joint distribution of noisy labels can be encoded in a much smaller parameter space! In our setup, when we assume the i.i.d. Ỹi , i = 1, 2, ..., p are generated according to the same transition matrix T (X), the parameter space is fixed and determined by the size of T (X). Yet, when we increase p, the observation space P( Ỹ1 , Ỹ2 , ..., Ỹp |Y, X) becomes richer to help us identify T (X). We now define an informative noisy label. Definition 4. For a given (X, Y ), we call their noisy label Ỹ informative if rank(T (X)) = K X . Definition 4 requires the K X rows of T (X) are linearly independent. When the observation space for Ỹ is the same as Y (therefore T (X) is a squared matrix), i.e., the true label Y has a full support on the entire label space, then the requirement is stating that T (X) is of full rank, which is already assumed in the literature -e.g., loss correction (Natarajan et al., 2013; Patrini et al., 2017a) would require the matrix has an inverse T -1 (X), which is equivalent to T (X) being full rank. In particular, it was required e + (X) + e -(X) < 1 in (Natarajan et al., 2013) , which can be easily shown to imply T (X) is full rank. But we don't remove the possibility that T (X) is not a squared matrix and K X can be much smaller than the entire label space. Our first identifiability result states as follows: Theorem 5. With i.i.d. noisy labels, three informative noisy labels Ỹ1 , Ỹ2 , Ỹ3 (p = 3) are both sufficient and necessary to identify T (X) when K X ≥ 2. Note that K X ≥ 2 is easily satisfied as long as there exists uncertainty in P(Y |X). Proof sketch. We provide the key steps of the proof. The full proof can be found in the supplemental material. We first prove sufficiency. We first relate our problem setting to the setup of Kruskal's identifiability scenario: Y ∈ {1, 2, ..., K X }. corresponds to the unobserved hidden variable Z. P(Y = i) corresponds to the prior of this hidden variable. Each Ỹi , i = 1, ..., p corresponds to the observation O i . κ i is then simply the cardinality of the noisy label space, K. In the context of this theorem, p = 3, corresponds to the three noisy labels we have. Each Ỹi corresponds to an observation matrix M i : M i [j, k] = P(O i = k|Z = j) = P( Ỹi = k|Y = j, X). Therefore, by definition of M 1 , M 2 , M 3 and T (X), they all equal to T (X): M i ≡ T (X), i = 1, 2, 3. When T (X) has a rank K X , we know immediately that all rows in M 1 , M 2 , M 3 are independent. Therefore, the Kruskal ranks satisfy Kr(M 1 ) = Kr(M 2 ) = Kr(M 3 ) = K X . Checking the condition in Theorem 4, we easily verify Kr(M 1 ) + Kr(M 2 ) + Kr(M 3 ) = 3K X ≥ 2K X + 2 . Calling Theorem 4 proves the sufficiency. To prove necessity, we need to prove less than 3 informative labels will not suffice to guarantee identifiability. The idea is to show that the two different sets of parameters T (X) can lead to the same joint distribution P( Ỹ1 , Ỹ2 |X). We leave the detailed constructions to the supplemental material. The above result points out that to ensure identifiability of T (X) at the instance level, we would need three conditionally independent and informative noisy labels. This result coincides with a couple of recent works that promote the use of three redundant labels (Liu et al., 2020; Zhu et al., 2021c; Zhang et al., 2014) . Per our theorem, these two proposed solutions have a more profound connection to the identifiability of hidden parametric models, and three labels are not only algorithmically sufficiently, but also necessary. This result also echoes the power of tensor (stacking third order information) in uncovering hidden models (Traganitis et al., 2018; Zhang et al., 2014) . Particularly relevant to us is (Zhang et al., 2014) where it was shown a spectral EM approach that uses three noisy labels suffices to identify the noise transition matrix of labels. We want to highlight that our proof and results establish both the necessity and sufficiency for having three informative noisy labels, independent from the specific algorithms developed. Another note we want to add is that our main inquiry is on establishing the conditions for identifying T (X), instead of proposing algorithms to estimate T (X). The crowdsourcing community has been largely focusing on soliciting more than one label from crowdsourced workers, yet the learning from noisy label literature has primarily focused on learning from a single one. One of the primary motivations of crowdsourcing multiple noisy labels is indeed to aggregate them into a cleaner one (Liu et al., 2012; Karger et al., 2011; Liu & Liu, 2015) , which serves as a pre-processing step towards solving the noisy learning problem. Nonetheless, our result demonstrates the other significance of having multiple labels -they help the learner identify the underlying true noise transition parameters.

5. INSTANTIATIONS AND PRACTICAL IMPLICATIONS OF OUR RESULTS

Most of the learning with noisy label solutions focus on the case of using a single label and have observed empirical successes. In this section, we provide extensions of our results to cover of stateof-the-art learning with noisy label methods, together with specific assumptions over X, T (X) = [P( Ỹ |Y, X)] etc. We show that our results can easily extend to these specific instantiations that successfully avoided the requirements of having multiple noisy labels for each X. The high-level intuition for Section 5.1 is to leverage the smoothness and clusterability of the nearest neighbor Xs so that their noisy labels will jointly serve as the multiple noisy labels for the local group. Section 5.2 and 5.3 build on the notion that if T (X) is the same for a group of Xs, each group can then be treated as one "instance" and a "disentangled" version of X will become observation variables that serve the similar role of the additionally required noisy labels.

5.1. LEVERAGING SMOOTHNESS AND CLUSTERABILITY OF X

We start with a discussion using the smoothness and clusterability of X. Recent results have explored the clusterability of Xs (Zhu et al., 2021c; Bahri et al., 2020) to infer the noise transition matrix: Definition 5. The 2-NN clusterability requires each X and its two nearest neighbors X 1 , X 2 share the same true label Y , that is Y = Y 1 = Y 2 , and T (X) = T (X 1 ) = T (X 2 ). This definition helps us remove the requirment for multiple noisy labels per each X: one can view it as for each X, borrowing the noisy labels from its 2-NN, we have three independent noisy labels Ỹ , Ỹ1 , Ỹ2 , all from the same Y . This smoothness or clusterability condition allows us to apply our identifiability results when one believes the T (X) stays the same for the 2-NN nearest neighborhood X, X 1 , X 2 . But, when does an instance X and its 2-NN X 1 , X 2 share the same true label? This requirement seems strange at the first sight: as long as P(Y |X), P(Y 1 |X 1 ) are not degenerate (being either 0 or 1 for different label classes), there always seems to be a positive probability that the realized Y ̸ = Y 1 , no matter how close X and X 1 are. Nonetheless, the 2-NN requirement seems to hold empirically: according to (Zhu et al., 2021c) (Table 3 therein), when using a feature extractor built using the clean label, more than 99% of the instance satisfies the 2-NN condition. Even when using a weaker feature extractor, the ratio is mostly always in or close to the 80% range. The following data generation process for an unstructured discrete domain of classification problems (Feldman, 2020; Liu, 2021) helps us justify the 2-NN requirement. The intuition is that when Xs are informative and sufficiently discriminative, the similar Xs are going to enjoy the same true label. • Let λ = {λ 1 , ..., λ n } denote the priors for each X ∈ X . • For each X ∈ X , sample a quantity q X independently and uniformly from the set λ. • The resulting probability mass function of X is given by D (X) = q X X∈X q X . • A total of N Xs are observed. Denote by X 1 , X 2 X's two nearest neighbors. • Each (X, X 1 , X 2 ) forms a triplet if ||X 1 -X||, ||X 2 -X|| fall below a threshold ϵ (closeness). • A single Y for the tuple (X, X 1 , X 2 ) draws from P(Y |X, X 1 , X 2 ). • Based on Y , we further observe three Ỹ , Ỹ1 , Ỹ2 according to P( Ỹ , Ỹ1 , Ỹ2 |Y ). The above data-generation process captures the correlation among Xs that are really close. We prove the above data generation process satisfies the 2-NN clusterability requirement with high probability. Theorem 6. When N is large enough such that N > 4 X∈X q X min X q X , w.p. at least 1 -N exp(-2N ), each X and its two nearest neighbor X 1 , X 2 satisfy the 2-NN clusterability. Smoothness conditions in semi-supervised learning This above discussion also ties closely to the smoothness requirements in semi-supervised learning (Zhu et al., 2003; Zhu, 2005) , where the neighborhood Xs can provide and propagate label information in each local neighborhood of Xs. Indeed, this idea echoes the co-teaching solution (Jiang et al., 2018; Han et al., 2018) in the literature of learning with noisy labels, where a teacher/mentor network is trained to provide artificially generated noisy labels to supervise the training of the student network. Our identifiability result, to a certain degree, implies that the addition of the additional noisy supervision improves the chance for identifying T (X). In (Jiang et al., 2018; Han et al., 2018) , counting the noisy label itself, and the "teacher" supervision, there are two such noisy supervision labels. This observation raises an interesting question: since our result emphasized three labels, does adding an additional teacher network for an additional supervision help? This question merits empirical verification.

5.2. LEVERAGING SMOOTHNESS AND CLUSTERABILITY OF T (X)

Figure 1 : Graph for (G, X, Y, Ỹ ). Grey color indicates observable variables. We show that another "smoothness" assumption of T (X) introduces new observation variables for us to identify T (X). In Figure 1 , we define variable G = {1, 2, ..., |G|} to denote the group membership for each X. Consider a scenario that X can be grouped into |G| groups such that each group of Xs share the same T (X): T (X 1 ) = T (X 2 ) if X 1 , X 2 share the same group membership. We observe G, X, Ỹ . This type of grouping has been observed in the literature: Class-dependent T P( Ỹ |Y, X) ≡ P( Ỹ |Y ), a single group ∀Xs.

Noise clusterability

The noise transition estimator proposed in (Zhu et al., 2021c) was primarily developed for class-dependent but not instance-dependent T (X). Nonetheless, a noise clusterability definition is introduced therein to allow the approach to be applied to instance-dependent noise. Under noise clusterability, using clustering algorithms can help separate the dataset into local ones. Group-dependent T (X) Recent results have also studied the case that the data X can be grouped using additional information (Wang et al., 2021a; Liu & Wang; Wang et al., 2021b) . For instance, (Wang et al., 2021a; Liu & Wang) consider the setting where the data can be grouped by the associated "sensitive information", e.g., by age, gender, or race. Then the noise transition matrix remains the same for Xs that come from each group. By this grouping, X becomes informative observations for each hidden Y and will fulfill the requirement of observing additional noisy labels. We now define a disentangled feature and an informative feature: Denote by R(X) ∈ R d * a learned representation for X. Denote by R i the random variable for R i (X), i = 1, 2, .., d * . For simplicity of the analysis, we assume each R i has finite observation space R i with cardinality |R i | = κ i . Define M i for each R i as M i [j, k] = P(R i = R i [k]|Y = j), where in above R i [k] denotes the k-th element in R i . Definition 6 (Disentangled R). R is disentangled if {R i } d * i=1 are conditional independent given Y . Definition 7 (Informative features). R i is informative if its Kruskal rank is at least 2: Kr(M i ) ≥ 2. Assuming each X can be transformed into a set of disentangled features R, we prove: Theorem 7. For Xs in a given group g ∈ G, with a single informative noisy label, T (X) is identifiable if the number of disentangled and informative features d * satisfy that d * ≥ K. This result points out a new observation that even when we have a single noisy label, given a sufficient number of disentangled and informative features, the noise transition matrix T is indeed identifiable, without requiring either multiple noisy labels, or the anchor point condition. The above result aligns with recent discussions of a neural network being able to disentangle features (Higgins et al., 2018; Steenbrugge et al., 2018) proves to be a helpful property. We establish that having disentangled feature helps identify T (X). The required number of disentangled features grows linearly in K. When relaxing the unique identifiability to generic identifiability, i.e., the identifiability scenario has measure zero (Allman et al., 2009) , the above theorem can be further extended to requiring d * ≥ ⌈log 2 2K * G +1 2 ⌉, where K * G = max X∈G K X . Details are deferred to Appendix (Theorem 10). Note that the existence of disentangled X does not imply that we will be able to directly infer P(Y |X) which will help us complete the learning task directly. But rather, it is indeed possible to further identify the structure P(X|Y ) (from unobserved to observed) but this is an identifiability problem defined on a much higher space. When disentangled features are not given, how do we disentangle X using only noisy labels to benefit from our results? In Section 6 we will test the effectiveness of a self-supervised representation learning approach that takes the side information relative to true label Y but operates independently from noisy labels. This result implies when the noise rate is high such that Ỹ starts to become uninformative, dropping the noisy labels and focusing on obtaining the disentangled features helps with the identifiability of T (X). This observation also helps explain successes in applying semisupervised (Cheng et al., 2021a; Li et al., 2020; Nguyen et al., 2019) and self-supervised learning (Cheng et al., 2021b; Zheltonozhskii et al., 2022; Ghosh & Lan, 2021) to handle noisy labels. In practice, we often do not know the groupings of X that share the same T (X), nor do we have a clear power (e.g., the noise clusterability condition) to separate the data into different groups. In reality, different from Figure 1 , the group membership can often remain hidden, if no additional knowledge of the data is solicited, leading to a situation in Figure 2 . It is a non-trivial task to jointly infer the group membership with T (X). We first show that mixing the group membership can lead to non-negligible estimation errors. Suppose that there are two groups of X, each having a noise transition matrix T 1 (X), T 2 (X). Suppose we ended up estimating one T * (X) for both groups mistakenly. We then have: Theorem 8. Any estimator T * (X) will incur at least the following estimation error:

5.3. SMOOTHNESS AND CLUSTERABILITY OF T (X) WITH UNKNOWN GROUPINGS

||T1(X) -T * (X)||F + ||T2(X) -T * (X)||F ≥ (1/ √ 2) • ||T1(X) -T2(X)||F The above result shows the necessity of identifying G as well. Now we present our positive result on the identifiability when G is hidden too: Re-number the combined space of G×Y as {1, 2, ..., |G|K}foot_2 . We are going to reuse the definition of M i for each disentangled feature R i : Define the "Kruskal matrix" for each R i as M i [j, k] = P(R i = R i [k]|G × Y = j). Theorem 9. For Xs in a given group g ∈ G, with a single informative noisy label, T (X) is identifiable if the number of disentangled and informative features d * satisfy that d * ≥ 2|G|K -1. When we have unknown groups of noise, the requirement of the number of informative and disentangled features grows linearly in |G|. We now relate to the literature that implicitly groups Xs. We will use X to denote the space of all possible Xs. Part-dependent label noise (Xia et al., 2020b) discusses a part-dependent label noise model where each T (X) can decompose into a linear combination of p parts: T (X) = p i ω i (X) • T i . The motivation of the above model is each X can be viewed as a combination of multiple different sub-parts, and each of them has a certain difficulty being labeled. The hope is that the parameter space ω(X) can reduce the dependency between X and T (X). Denote W := {ω(X) : X ∈ X }. To put into our result, |G| = |W|. If W has a much smaller space than X , the condition specified in Theorem 9 would be more likely to be satisfied. DNN approach (Yang et al., 2021) proposes using a deep neural network to encode the dependency between X and T * (X), with the only difference being that T * (X) is defined as the transition between Ỹ and the Bayes optimal label Y * . Define: DNN := {DNN(X) : X ∈ X }. Similarly, in analogy to our results in Theorem 9, with replacing the hidden variable Y to Y * , |G| will be determined by |DNN|. So long as the DNN can identify the patterns in T (X) and compress the space of DNN(X) as compared to X , the identifiability becomes easier to achieve. The causal approach (Yao et al., 2021) proposed improving the identifiability by exploring the causal structure. With causal inference, one can identify a more representative and compressed X for each X such that P( Ỹ |Y, X, X) = P( Ỹ |Y, X). Denote X := { X : X → X ∈ X }, and |G| = | X |.

6. SOME EMPIRICAL EVIDENCE: DISENTANGLED FEATURES

Most of our results above verified the empirical success of existing approaches from the identifiability's perspective and we refer the interested reader to the detailed experiments in the corresponding references. We now empirically show the possibility of learning disentangled features to help identify the noise transition matrix. We consider three types of encoders that are used to generate features. The first encoder is pre-trained by cross-entropy (CE) loss via a weakly supervised manner which is generally adopted in FW (Patrini et al., 2017b) and HOC (Zhu et al., 2021c) . However, since the training data is noisy, it is hard to guarantee that features are disentangled -this is our baseline. The second encoder is pre-trained by SimCLR (Chen et al., 2020) via a self-supervised manner. It is shown that the features trained by SimCLR are partly disentangled on some simple augmentation features such as rotation and colorization (Wang et al., 2021c) . The third encoder is trained by IPIRM (Wang et al., 2021c ) via a self-supervised manner which can generate fully disentangled features. After training these three encoders, we fix the encoder and generate features from raw samples to estimate the noise transition matrix using HOC estimator (Zhu et al., 2021c) . We evaluate the performance via absolute estimation error defined below: err = K i=1 K j=1 | Ti,j -T i,j |/K 2 • 100, where T is the estimated noise transition matrix, T is the real noise-transition matrix, K is the number of classes in the dataset, which is also the size of the transition matrix. The overall experiments are shown in Table 1 . We observe that the estimation error decreases as features become more disentangled which supports our analyses. We defer the details, more experiments, as well as experiments on comparing training performances using disentangled features, to the supplementary material. Concluding remarks This paper characterizes the identifiability of instance-level label noise transition matrix. We connect the problem to the celebrated Kruskal's identifiability result and present a necessary and sufficient condition for the instance-level identifiability. We extend and instantiate our results to practical settings to explain the successes of existing solutions. We show the importance of disentangled and informative features for identifying the noise transition matrix.

7. ETHICAL STATEMENT

We are not aware of negative societal consequence of applying our results. But at multiple places of the work, we state the limitations of the setup to not mislead readers into misunderstanding our claims. For instance, we discussed the situation when we will have multiple noisy labels and our focus on discretized features. In Section 5.2, we clearly stated our requirement of the disentangled and informative features.

8. REPRODUCIBILITY STATEMENT

We include the following checklist for the purpose of reproducibility: 1. Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes] 2. If you are including theoretical results... 

APPENDIX: IDENTIFIABILITY OF LABEL NOISE TRANSITION MATRIX

The Appendix is organized in the following way: Section A proves the Theorems in the main paper; Section B provides more discussions on generic identifiability; Section C provides more experiments on learning with noisy labels w.r.t. disentangled features and elaborates the detailed experimental settings in the paper. A OMITTED PROOFS PROOF FOR LEMMA 1 Proof. Using Bayes rule we easily obtain P(X| Ỹ = +1) = P(X|Y = +1) • P(Y = +1| Ỹ = +1) + P(X|Y = -1) • P(Y = -1| Ỹ = +1) (5) The equality is due to the fact that Ỹ and X are assumed to be independent given Y . Similarly: P(X| Ỹ = -1) = P(X|Y = +1) • P(Y = +1| Ỹ = -1) + P(X|Y = -1) • P(Y = -1| Ỹ = -1) Since both P(X|Y = +1), P(X|Y = -1) are unknown, solving Eqn. ( 5) and ( 6) we further have P(X| Ỹ = -1) = π-• P(X| Ỹ = +1) + (1 -π-) • P(X|Y = -1) P(X| Ỹ = +1) = π+ • P(X| Ỹ = -1) + (1 -π+ ) • P(X|Y = +1). PROOF FOR THEOREM 3 Proof. Further from π-, π+ we can solve and derive π -= π-(1-π+) 1-π-π+ , π + = π+(1-π-) 1-π-π+ , establishing the equivalence between identifying π-, π+ with identifying π -, π + . Next we show that identifying π -, π + is equivalent with identifying {e + , e -}. We first show identifying {π + , π -} suffices to identify {e + , e -}. To see this, P( Ỹ = +1|Y = -1) = P(Y = -1| Ỹ = +1)P( Ỹ = +1) P(Y = -1) And: P(Y = -1) = P(Y = -1| Ỹ = +1)P( Ỹ = +1) + P(Y = -1| Ỹ = -1)P( Ỹ = -1) The derivation for P( Ỹ = -1|Y = +1) is entirely symmetric. Since we directly observe P( Ỹ = -1), P( Ỹ = +1), with identifying P(Y = +1| Ỹ = -1), P(Y = -1| Ỹ = +1), we can identify P( Ỹ = +1|Y = -1), P( Ỹ = -1|Y = +1). Next we show that to identify {e + , e -}, it is necessary to identify {π + , π -}. Suppose not: we are unable to identify π + , π i but are able to identify {e + , e -}. This implies that there exists another pair {π ′ + , π ′ -} ̸ = {π + , π -} such that (denote by p := P( Ỹ = +1)) P( Ỹ = +1|Y = -1) = π + p π + p + (1 -π -)(1 -p) (9) = π ′ + p π ′ + p + (1 -π ′ -)(1 -p) P( Ỹ = -1|Y = +1) = π -(1 -p) (1 -π + )p + π -(1 -p) (11) = π ′ -(1 -p) (1 -π ′ + )p + π ′ -(1 -p) By dividing π + , π ′ + in both the numerator and denominator in Eqn. ( 9) and ( 10), we conclude that 1 -π - π + = 1 -π ′ - π ′ + While from Eqn. ( 11) and ( 12) we conclude 1 -π + π - = 1 -π ′ + π ′ - From Eqn. ( 13) and ( 14) we have (1 -π -)π ′ + = (1 -π ′ -)π + (15) (1 -π ′ + )π -= (1 -π + )π ′ - Taking the difference and re-arrange terms we prove 13) again, taking -1 on both side we have π + + π -= π ′ + + π ′ - From Eqn. ( 1 -π --π + π + = 1 -π ′ --π ′ + π ′ + This proves π + = π ′ + . Similarly we have π -= π ′ --but this contradicts the assumption that {π ′ -, π ′ + } is a different pair.

PROOF FOR THEOREM 5

Proof. We first prove sufficiency. We first relate our problem setting to the setup of Kruskal's identifiability scenario: Y ∈ {1, 2, ..., K X } corresponds to the unobserved hidden variable Z. P(Y = i) corresponds to the prior of this hidden variable. Each Ỹi , i = 1, ..., p corresponds to the observation O i . κ i is then simply the cardinality of the noisy label space, K. In the context of this theorem, p = 3, corresponding to the three noisy labels we have. Each Ỹi corresponds to an observation matrix M i : M i [j, k] = P(O i = k|Z = j) = P( Ỹi = k|Y = j, X) Therefore, by definition of M 1 , M 2 , M 3 and T (X), they all equal to T (X): M i ≡ T (X), i = 1, 2, 3. When T (X) has a rank K X , we know immediately that all rows in M 1 , M 2 , M 3 are linearly independent. Therefore, the Kruskal ranks satisfy Kr(M 1 ) = Kr(M 2 ) = Kr(M 3 ) = K X Checking the condition in Theorem 4, we easily verify Kr(M 1 ) + Kr(M 2 ) + Kr(M 3 ) = 3K X ≥ 2K X + 2 Calling Theorem 4 proves the sufficiency. Now we prove necessity. To prove so, we are allowed to focus on the binary case, where T (X) = 1 -e -(X) e -(X) e + (X) 1 -e + (X) Note in above, for simplicity we drop e -, e + 's dependency in X. We need to prove less than 3 informative labels will not suffice to guarantee identifiability. The idea is to show that the two different set of parameters e -, e + can lead to the same joint distribution P( Ỹ1 , Ỹ2 |X). The case with a single label is already proved by Example 1. Now consider two noisy labels Ỹ1 , Ỹ2 . We first claim the following three quantities fully capture the information provided by Ỹ1 , Ỹ2 : • Posterior: P( Ỹ1 = +1|X) • Positive Consensus: P( Ỹ1 = Ỹ2 = +1|X) • Negative Consensus: P( Ỹ1 = Ỹ2 = -1|X) This is because other statistics in Ỹ1 , Ỹ2 |X can be reproduced using combinations of the three quantities above: P( Ỹ1 = -1|X) = 1 -P( Ỹ1 = +1|X) , P( Ỹ1 = +1, Ỹ2 = -1|X) = P( Ỹ1 = +1|X) -P( Ỹ1 = Ỹ2 = +1|X) , P( Ỹ1 = -1, Ỹ2 = +1|X) = P( Ỹ2 = +1|X) -P( Ỹ1 = Ỹ2 = +1|X ) . But P( Ỹ2 = +1|X) = P( Ỹ1 = +1|X), since the two noisy labels are identically distributed. The above three quantities led to three equations that depend on e + , e -: denote by γ := P(Y = +1) Next we prove the following system of equations: P( Ỹ = +1|X) = γ • (1 -e + ) + (1 -γ) • e - P( Ỹ1 = Ỹ2 = +1|X) = γ • (1 -e + ) 2 + (1 -γ) • e 2 - P( Ỹ1 = Ỹ2 = -1|X) = γ • e 2 + + (1 -γ) • (1 -e -) 2 To see this: P( Ỹ1 = Ỹ2 = +1|X) =P( Ỹ1 = Ỹ2 = +1, Y = +1|X) + P( Ỹ1 = Ỹ2 = +1, Y = -1|X) =P( Ỹ1 = Ỹ2 = +1|Y = +1, X) • P(Y = +1|X) + P( Ỹ1 = Ỹ2 = +1|Y = -1, X) • P(Y = -1|X) =γ • (1 -e + ) 2 + (1 -γ) • e 2 - The last equality uses the fact that Ỹ1 , Ỹ2 are conditional independent given Y , so P( Ỹ1 = Ỹ2 = +1|Y = +1, X) = P( Ỹ1 = +1|Y = +1, X) • P( Ỹ2 = +1|Y = +1, X) P( Ỹ1 = Ỹ2 = +1|Y = -1, X) = P( Ỹ1 = +1|Y = -1, X) • P( Ỹ2 = +1|Y = -1, X) We can similarly derive for P( Ỹ1 = Ỹ2 = -1|X). Now we show the above equations do not identify e + , e -. For instance, it is straightforward to verify that both of the solutions below satisfy the equations (up to numerical errors, exact solution exists but in complicated forms): • γ = 0.7, e + = 0.2, e -= 0.2 • γ = 0.8, e + = 0.242, e -= 0.07 The above example proves that two informative noisy labels are insufficient to guarantee identifiability. For completeness we provide rationals for the multi-class case too. The idea is to show that the complete information returned by the single noisy label and two noisy labels do not always guarantee a unique solution. For the first order information: P( Ỹ = i|X) = k∈[K] P(Y = k) • P( Ỹ = i|Y = k, X) = k∈[K] P(Y = k|X) • T ki (X) Enumerating all is, there are K equations, written in a matrix form as: P = (T (X)) ⊤ • P where P is the vector form for [P( Ỹ = 1|X); P( Ỹ = 2|X); ...; P( Ỹ = K|X)] and P is the one for P(Y = k|X). For the second order information P( Ỹ1 = i, Ỹ2 = j|X) = k∈[K] P(Y = k|X) • P( Ỹ1 = i|Y = k, X) • P( Ỹ2 = j|Y = k, X) = k∈[K] P(Y = k|X) • T ki (X) • T kj (X) Enumerating pairs of (i, j) we have K 2 equations, written in matrix form as: C = (T (X)) ⊤ • Λ • T (X) where in above C is a K × K matrix with the (i, j)-th entry being P( Ỹ1 = i, Ỹ2 = j|X); Λ is a diagonal matrix with Λ ii = P(Y = k|X).

Notice that

j P( Ỹ1 = i, Ỹ2 = j|X) = P( Ỹ1 = i) and j k∈[K] P(Y = k|X) • T ki (X) • T kj (X) = k∈[K] P(Y = k|X) • T ki (X) we know that for every K equations from the second order information, there is at least one redundant equation. That is to conclude that we have at most K + K 2 -K = K 2 independent equations. Nonetheless, we have K(P(Y = k|X)) + K 2 (T (X)) = K 2 + K unknown variables. So the equations are under-determined. Therefore we conclude for the general K, there exists cases two labels will not define a unique solution. For instance, for K = 3, we can easily find the following two sets of parameter settings will return us the same observed distribution for two labels: 

PROOF FOR THEOREM 6

Proof. In the unstructured model, we first show that, with a large N , with high probability, each X's will present at least 3 times. Denote by N X the number of times X appears in the dataset. Then N X := N i=1 1[X i = X], E[N X ] = q X X∈X q X N ( ) When N is large enough such that N > 4 X∈X q X min X q X , we have E[N X ] > 4. Then using Hoeffding inequality we have P(N X ≤ 3) ≤ exp(-2N ). Using union bound (across N samples), it implies that with probability at least 1 -N exp(-2N ), N X ≥ 3, ∀X: P(N X > 3, ∀X) = 1 -P(N X ≤ 3, ∃X) ≤ 1 -N exp(-2N ) This further implies that with probability at least 1 -N exp(-2N ), we have X 1 = X 2 = X for each X: Their distance is 0, clearly falling below the closeness threshold ϵ. Therefore they will share the same true label. Note that we are not imagining the exact same data appearing three times, but rather that three different data that happen to have the same pattern X that appeared three times For instance, these three Xs can correspond to three independent users trying to apply for a credit card and ending up having the same application profiles (e.g., age, salary range, education level etc); it can also be three similar cat images ended up with the same encoding of the features. PROOF FOR THEOREM 7 Proof. The d * features and the noisy label Ỹ jointly give us d * + 1 independent observations. Denote by K * G = max X∈G K X . In Kruskal's setup, Y ∈ {1, 2, ..., K * G } will then correspond. to the unobserved hidden variable Z. If the noisy label is informative we know that Kr(T (X)) = K * G ≤ K. Then checking Kruskal's condition we have: Kr(T (X)) + d * i=1 Kr(M i ) ≥ K * G + 2 • d * ≥ K * G + K * G + d * = 2K * G + d * + 1 -1 Calling Theorem 4, we establish the identifiability.

PROOF FOR THEOREM 8

Proof. By definition ||T 1 (X) -T * (X)|| F = i j (T 1 [i, j] -T [i, j]) 2 Easy to show that ||T 1 (X) -T * (X)|| F + ||T 2 (X) -T * (X)|| F = i j (T 1 [i, j] -T [i, j]) 2 + i j (T 2 [i, j] -T [i, j]) 2 =   i j (T 1 [i, j] -T [i, j]) 2 + i j (T 2 [i, j] -T [i, j]) 2   2 ≥ i j (T 1 [i, j] -T [i, j]) 2 + (T 2 [i, j] -T [i, j]) Then we prove that ||T 1 (X) -T * (X)|| F + ||T 2 (X) -T * (X)|| F ≥ i j T 1 [i, j] - T 1 [i, j] + T 2 [i, j] 2 2 + T 2 [i, j] - T 1 [i, j] + T 2 [i, j] 2 2 (minimum distance is at half) = i j 2 T 1 [i, j] -T 2 [i, j] 2 2 = 1 √ 2 i j (T 1 [i, j] -T 2 [i, j]) 2 = 1 √ 2 ||T 1 (X) -T 2 (X)|| F PROOF FOR THEOREM 9 Proof. The proof is straightforward by checking Kruskal's identifiability condition: Kr(T (X)) + d * i=1 Kr(M i ) ≥ 1 + 2 • d * ≥ 1 + 2|G|K -1 + d * = 2|G| • K + d * + 1 -1 Note |G|•K is the size of space for the unobserved variable (GY renumbered as {1, 2, ..., |G|K}).

B GENERIC IDENTIFIABILITY

We provide a bit more detail for the discussion on generic identifiability left in Section 5.2.  (K * G , κ 1 ) + min(K * G , κ 2 ) + min(K * G , κ 3 ) ≥ 2K * G + 2 Based on the above theorem we have the following identifiability result: Grouping d * features evenly into two groups, each corresponding to a meta variable/feature: R * 1 = d * 1 i=1 R i , X * 2 = d * j=d * 1 +1 R j Denote feature dimensions of each group as d * 1 , d * 2 : τ * 1 = d * 1 i=1 ≥ 2 d * 1 ≥ 2 ⌈log 2 2K * G +1 2 ⌉ ≥ 2K * G + 1 2 (22) Similarly τ * 2 ≥ K X +2 2 . Denote by M * 1 , M * 2 the two observation matrices for the grouped variables //Find 2-NN using a similarity function Sim(x, x ′ ). M * i [j, k] = P(R * i = R * i [k]|Y = j), i = 1, 2. Then: Kr(T (X)) + Kr(M * 1 ) + Kr(M * 2 ) ≥ 1 + 2 2K * G + 1 2 = 2K * G + 2: With 1 -Sim(x, x ′ ) as the distance metric: {(ỹ n , ỹn1 , ỹn2 ), ∀n} ← Get2NN( D); //Count first-, second, and third-order consensus patterns: 3: (ĉ [1] , ĉ[2] , ĉ[3] ) ← CountFreq({(ỹ n , ỹn1 , ỹn2 ), ∀n}) //Solve equations: 4: Find T such that match the counts (ĉ [1] , ĉ[2] , ĉ[3] ).

C MORE EXPERIMENTS

In this section, we elaborate the detailed experiment setting and perform more experiments w.r.t. disentangled features.

C.1 EXPERIMENT SETTING FOR TABLE 1

Label Noise Generation The label noise of each instance is characterized by T ij (X) = P( Y = j|X, Y = i). In this paper, we consider two types of label noise: asymmetric label noise (Han et al., 2018; Wei et al., 2020) and instance-dependent label noise (Cheng et al., 2021a; Zhu et al., 2021b) . For asymmetric label noise, T (X) ≡ T , each clean label is randomly flipped to its adjacent label w.p. ϵ, where ϵ is the noise rate, i.e., T ii = 1 -ϵ, T ii + T i,(i+1) K = 1, (i + 1) K := i mod K + 1. For instance-dependent label noise, the generation of noisy labels also depends on the features. We follow CORES (Cheng et al., 2021a) to generate instance-dependent label noise. The generation process is detailed in Algorithm 2. With these definitions, asymm./inst. ϵ in Table 1 denotes asymmetric/instance-dependent label noise with noise rate ϵ. Model pre-training. The network structures of all the three encoders in Table 1 are ResNet50 (He et al., 2016) . Note that the encoders which generate features to estimate transition matrix can be pre-trained on different dataset. For example, HOC (Zhu et al., 2021c) utilizes ImageNet pre-trained encoders to generate features for CIFAR. Thus, following the pipeline of disentangled feature generation (Wang et al., 2021c) , we pre-train all the three encoders on CIFAR100 dataset and generate feature for CIFAR10 to estimate transition matrix. The first encoder is trained under 0.1 symmetric label noise rate to simulate the weakly-supervised features while the second and third encoder is trained via self-supervised learning (SSL). Recall the goal of SSL is to learn a good representation without accessing labels. In this paper, we adopt SimCLR (Chen et al., 2020) and IPIRM (Wang et al., 2021c) to perform SSL pre-training. SimCLR, as a representative work on SSL literature, learns a good represention based on InfoNCE loss (Van den Oord et al., 2018) . However, it is shown that the features learned by SimCLR are only partly disentangled on some simple augmentation features such as rotation and colorization (Wang et al., 2021c) . Thus IPIRM proposes a learning algorithm that embeds InfoNCE loss into IRM (Invariant Risk Minimization) framework (Arjovsky et al., 2019) to learn fully disentangled features. We train SimCLR model and IPIRM model by referring official codebase of IPIRMfoot_4 . The pre-trained models, as well as evaluation code are all released in the supplementary material.

Key steps of HOC

Estimation error of Transition matrix. After training these three encoders, we fix the encoder and generate features from raw samples to estimate the noise transition matrix using Global HOC estimator (Zhu et al., 2021c) . The hyper-parameters for estimating transition matrix are consistent with official implementation of HOCfoot_5 : optimizer: Adam, learning rate: 0.1, number of iterations: 1500. After training, we evaluate the performance via absolute estimation error defined below: 2: Sample instance flip rates q n from the truncated normal distribution N (ε, 0.1 2 , [0, 1]); 3: Sample W ∈ R S×K from the standard normal distribution N (0, 1 2 ); for n = 1 to N do 4: p = x n • W // Generate instance dependent flip rates. The size of p is 1 × K. err = K i=1 K j=1 | Ti,j -T i,j | K 2 * 100,

5:

p yn = -∞ // Only consider entries different from the true label 6: p = q n • softmax(p) // Let qn be the probability of getting a wrong label 7: p yn = 1 -q n // where T is the estimated noise transition matrix, T is the real noise-transition matrix, K is the number of classes in the dataset.

C.2 TRAINING PERFORMANCE USING ESTIMATED TRANSITION MATRIX

We can further use the estimated transition matrix to perform forward loss correction (FW) (Patrini et al., 2017b) . Table 2 records the performance of FW by using the estimated transition matrix of SimCLR and IPIRM. The hyper-parameters for all the experiments in Table 2 are the same: optimizer: SGD, training epochs: 100, learning rate: 0.1 for first 50 epochs and 0.01 for last 50 epochs, batch-size: 256. From the results, we can observe that the test accuracy increases as features become more disentangled.

C.3 INITIALIZING DNN USING DISENTANGLED FEATURES

Except for estimating transition matrix, we can directly use disentangled features to perform training on noisy dataset. Table 3 shows the effect of using disentangled features as DNN initialization on CIFAR100. The hyper-parameters for all the experiments in Table 3 are consistent with Table 2 . From the results, We can observe that even with vanilla Cross Entropy loss, the disentangled features are still beneficial to the performance. Our first experiment is to show that when estimated transition matrices is far from the ground-truth matrix, it may make model perform worse even compared to the baseline (vanilla training with Cross Entropy).

Experiment setting:

The training framework with transition matrix is followed from FW (Patrini et al., 2017b) . The dataset is CIFAR10 and the network structure is ResNet34. The hyper-parameters are as follows: batchsize (64), learning rate (0.1 for first 50 epochs and 0.01 for last 50 epochs), optimizer (SGD). For a randomly selected set of instances (0% of the population), we generate noisy labels using the following transition matrix: T = P( Y |Y, X) =               0.9 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.019 0.82 0.019 0.019 0.019 0.019 0.019 0.019 0.019 0.019 0.028 0.028 0.74 0.028 0.028 0.028 0.028 0.028 0.028 0.028 0.037 0.037 0.037 0.66 0.037 0.037 0.037 0.037 0.037 0.037 0.045 0.045 0.045 0.045 0.58 0.045 0.045 0.045 0.045 0.045 0.054 0.054 0.054 0.054 0.054 0.51 0.054 0.054 0.054 0.054 0.063 0.063 0.063 0.063 0.063 0.063 0.43 0.063 0.063 0.063 0.071 0.071 0.071 0.071 0.071 0.071 0.071 0.35 0.071 0.071 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.27 0.08 0.088 0.088 0.088 0.088 0.088 0.088 0.088 0.088 0.088 0.2               The above transition matrix T is uniform off-diagonal with diagonals evenly spaced over [0.9, 0.2], which is the ground-truth transition matrix in our setting. The remaining unselected instances will enjoy a T ≡ 0. We perform experiments using the following three uniform off-diagonal transition matrix with forward loss correction (Patrini et al., 2017b ): • T 1 with diagonals evenly spaced over [0.9, 0.2] • T 2 with all diagonals 0.4. • T 3 with diagonals evenly spaced over [0.2, 0.9] where T 1 is the ground-truth transition matrix while T 3 is far from the ground-truth. The results are listed in Table 4 . It can be observed that when using T 3 , the performance is even worse than vanilla training with Cross Entropy, suggesting the importance of identifying and estimating the noise transition matrix.

C.4.2 GAUSSIAN EXPERIMENT

Our second experiment is to show that in some settings, the transition matrix is hard to estimate correctly, which suggests the importance of identifiability. Consider a simple setting for binary classification and a set of instances generated according to the following setups: • X ∼ N (0, 3) where N denotes Gaussian distribution with mean 0 and variance 3. • P(Y = 1|X) = sigmoid(X) = 1 1+e -X We generate X and Y following the above procedure and define the ground-truth transition matrix T = P( Y |Y, X) = 0.9 0.1 0.2 0.8 for generating Y from Y . Our goal is to examine whether we can estimate the correct transition matrix using (X, Y ).

Experiment setting:

The training framework for estimating transition matrix is followed from FW (Patrini et al., 2017b) . We randomly sample 5000 (x, y) pairs from the data generating procedure and using T = P( Y |Y ) = 0.9 0.1 0.2 0.8 to generate Y from Y . The network structure is a simple FCN (fully connected network tructure) with one hidden layer (10 nodes) and ReLU activation. The hyper-parameters are as follows: learning rate (0.01 for 100 epochs ), optimizer (SGD). We perform the experiments with 30 runs and record the average performance in Table 5 . From Table 5 , we can see that FW has very little gain compared to vanilla Cross Entropy training. We then calculate the average estimated transition matrix: T estimated = 0.983 0.017 0.008 0.992 We find that T estimated is nearly as the same as the identity matrix, suggesting that in this setting, FW is hard to estimate noise transition matrix correctly and contributes less to the performance. 



There exists other definition that checks columns. Results would be symmetrical. We clarify that we will require knowing P( Ỹ |X) -this requirement may appear weird when only one noisy label is sampled. But in practice, there are tools available to regress the posterior function P( Ỹ |X) for each X. By mapping(G = 1, Y = 1) → 1, (G = 1, Y = 2) → 2, ..., (G = |G|, Y = K) → |G|K. (Dropping the cross-product term which is positive) https://github.com/Wangt-CN/IP-IRM https://github.com/UCSC-REAL/HOC



SINGLE NOISY LABEL MIGHT NOT BE SUFFICIENT At a first sight, it is impossible to identify P( Ỹ |Y, X) from only observing P( Ỹ |X), 2 unless X satisfies the anchor point definition that P(Y = k|X) = 1 for a certain k: since P( Ỹ |X) = P( Ỹ |Y, X) • P(Y |X), different combinations of P( Ỹ |Y, X), P(Y |X) can lead to the same P( Ỹ |X). More specifically, consider the following example: Example 1. Suppose we have a binary classification problem with T

Figure 2: Graph with unobserved G. Grey color indicates observable variables.

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] We present the complete proofs in the appendix and have added detailed explanations. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We submitted experiment details in the appendix and the implementations in the supplementary materials. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you're using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

[P(Y = 1|X), P(Y = 2|X), P(Y = 3|X)] = [0.35, 0.35, 0.3], T (XY = 1|X), P(Y = 2|X), P(Y = 3|X)] = [0.31, 0.34, 0.35], T (Xbe obtained by searching through the solutions space of the equations.

2, which again satisfied the identifiability condition specified in Theorem 4. Algorithm 1 Key Steps of HOC 1: Input: Noisy dataset: D = {(x n , ỹn )} n∈[N ] , with disentangled features.

Comparison of estimation error for different types of features on CIFAR-10. Each experiment is run 3 times and mean ± std is reported. asymm.: asymmetric label noise; inst.: instance-dependent label noise. Numbers are noise rates. All the encoders are from ResNet50 backbone.

Theorem 10. With a single informative noisy label, T (X) is generically identifiable for each group g ∈ G if the number of disentangled features d * satisfies that d * ≥ ⌈log 2

Comparison of test accuracy on CIFAR10 by using the estimated transition matrix.

Comparison of test accuracy on CIFAR100 by using different DNN initialization.

Comparison of test accuracy on CIFAR10 by using different transition matrix.CEFW with T 1 FW with T 2 FW with T 3

Comparison of test accuracy for CE and FW.

