IDENTIFIABILITY OF LABEL NOISE TRANSITION MATRIX

Abstract

The noise transition matrix plays a central role in the problem of learning with noisy labels. Among many other reasons, a large number of existing solutions rely on access to it. Identifying and estimating the transition matrix without ground truth labels is a critical and challenging task. When label noise transition depends on each instance, the problem of identifying the instance-dependent noise transition matrix becomes substantially more challenging. Despite recent works proposing solutions for learning from instance-dependent noisy labels, the field lacks a unified understanding of when such a problem remains identifiable. The goal of this paper is to characterize the identifiability of the label noise transition matrix. Building on Kruskal's identifiability results, we show the necessity of multiple noisy labels in identifying the noise transition matrix for the generic case at the instance level. We further instantiate the results to relate to the successes of the state-of-the-art solutions and how additional assumptions alleviated the requirement of multiple noisy labels. Our result also reveals that disentangled features are helpful in the above identification task and we provide empirical evidence.

1. INTRODUCTION

The literature of learning with noisy labels concerns the scenario when the observed labels Ỹ can differ from the true one Y . The noise transition matrix T (X), defined as the transition probability from Y to Ỹ given X, plays a central role in this problem. Among many other benefits, the knowledge of T (X) has demonstrated its use in performing either risk (Natarajan et al., 2013; Patrini et al., 2017a) , or label (Patrini et al., 2017a) , or constraint corrections (Wang et al., 2021a) . In beyond, it also finds applications in ranking small loss samples (Han et al., 2020) and detecting corrupted samples (Zhu et al., 2021a) . On the other hand, applying the wrong transition matrix T (X) can lead to a number of issues. The literature has well-documented evidence that a wrongly inferred transition matrix can lead to performance drops (Natarajan et al., 2013; Liu & Wang; Xia et al., 2019; Zhu et al., 2021c) , and false sense of fairness (Wang et al., 2021a; Liu & Wang) . Knowing whether a T (X) is identifiable or not helps understand if the underlying noisy learning problem is indeed learnable. Prior works have documented challenges in estimating the noise transition matrices when the quality of available training information remains unclear. For instance, in (Zhu et al., 2022) the authors show that when the quality of representations dropped, the estimation error in T (X) increases significantly (Figure 1 therein). Other previous references have documented these challenges too (Xia et al., 2019) . We have also provided experiments to validate the argument in Appendix C.4. The earlier results have focused on class-but not instance-dependent transition matrix T (X) ≡ T := [P( Ỹ = j|Y = i)] i,j , ∀X. The literature has provided discussions of the identifiability of T under the mixture proportion estimation setup (Scott, 2015) , and has identified a reducibility condition for inferring the inverse noise rate. Later works have developed a sequence of solutions to estimate T under a variety of assumptions, including irreducibility (Scott, 2015) , anchor points (Liu & Tao, 2016; Xia et al., 2019; Yao et al., 2020a) , separability (Cheng et al., 2020) , rankability (Northcutt et al., 2017; 2021) , redundant labels/tensor (Liu et al., 2020; Traganitis et al., 2018; Zhang et al., 2014 ), clusterability (Zhu et al., 2021c ), among others (Zhang et al., 2021; Li et al., 2021) . Recent study (Wei et al., 2021) has empirically shown that the above class-dependent model is not precise in capturing the real-world noise patterns, but rather real human-level noise follows an instance-dependent model. Intuitively, the instance X encodes the difficulties in generating the label for it. This more realistic, flexible and powerful noise model helps characterize the challenges. We observe a recent surge of different solutions towards solving the instance-dependent label noise problem (Cheng et al., 2020; Xia et al., 2020b; Cheng et al., 2021a; Yao et al., 2021) . Some of the results took on the problem of estimating T (X), while the others proposed solutions to learn directly from instance-dependent noisy labels. We will survey these results in Section 1.1. The question of identifying and estimating T becomes much trickier when the noise transition matrix is instance-dependent. The potentially complicated dependency between X and T (X) renders it even less clear whether solving this problem is viable or not. Despite the above successes, there lacks a unified understanding of when this learning from instancedependent noisy label problem is indeed identifiable and therefore learnable. The mixture of different observations calls for the need for demystifying: ( 1) Under what conditions are the noise transition matrices T (X) identifiable? (2) When and why do the existing solutions work when handling the instance-dependent label noise? (3) When T (X) is not identifiable, what can we do to improve its identifiability? Providing answers to these questions will be the primary focus of this paper. The main contributions of this paper are to characterize the identifiability of instance-dependent label noise, use them to provide evidences to the success of existing solutions and point out possible directions to improve. Among other findings, some highlights of the paper are 1. We find many existing solutions have a deep connection to the celebrated Kruskal's identifiability results that date back to the 1970s (Kruskal, 1976; 1977) . 2. Three separate independent and identically distributed (i.i.d.) noisy labels (random variables) are both necessary and sufficient for instance-level identifiability. This observation echoes the previous successes of developing tensor-based approaches for identifying the hidden models. 3. Disentangled features help with identifiability. Our paper will proceed as follows. Section 2 and 3 will present our formulation and the highly relevant preliminaries. Section 4 provides characterizations of the identifiability at the instance level and lays the foundations for our discussions. Section 5 extends the discussion to different instantiations that help us provide evidences to the success of existing solutions. Section 6 provides some empirical observations.

1.1. RELATED WORKS

In the literature of learning with label noise, a major set of works focus on designing risk-consistent methods, i.e., performing empirical risk minimization (ERM) with specially designed loss functions on noisy distributions leads to the same minimizer as if performing ERM over the corresponding unobservable clean distribution. The noise transition matrix is a crucial component for implementing risk-consistent methods, e.g., loss correction (Patrini et al., 2017b) , loss reweighting (Liu & Tao, 2015) , label correction (Xiao et al., 2015) and unbiased loss (Natarajan et al., 2013) . A number of solutions were proposed to estimate this transition matrix for class-dependent label noise, which we have discussed in the introduction. To handle instance-dependent noise, recent solutions include estimating local transition matrices for different groups of data (Xia et al., 2020b) , using confidence scores to revise transition matrices (Berthon et al., 2020) , and using clusterability of the data (Zhu et al., 2021c) . More recent works have used the causal knowledge to improve the estimation (Yao et al., 2021) , and the deep neural network to estimate the transition matrix defined between the noisy label and the Bayes optimal label (Yang et al., 2021) . Other works chose to focus on the learning from instance-dependent label noise directly, without explicitly estimating the transition matrix (Zhu et al., 2021b; Cheng et al., 2021a; Berthon et al., 2021; Xia et al., 2020a; Li et al., 2020) . The identifiability issue with label noise has been discussed in the literature, despite not being formally treated. Relevant to us is the identifiability results studied in the Mixture Proportion Estimation setting (Scott, 2015; Yao et al., 2020b; Menon et al., 2015) . We'd like to note that the identifiability was defined for the inverse noise rate, which differs from our focus on the noise transition matrix T . To our best knowledge, we are not aware of other works that specifically address the identifiability of T (X), particularly for an instance-dependent label noise setting. Highly relevant to us is the Kruskal's identifiability results (Kruskal, 1976; 1977; Sidiropoulos & Bro, 2000; Allman et al., 2009) , which reveals a sufficient condition for identifying a parametric model that links a hidden variable to a set of observed ones. Kruskal's early results were developed under the context of tensor, which later proves to be a powerful tool for learning latent variable models (Sidiropoulos et al., 2017; Zhang et al., 2014; Anandkumar et al., 2014) .

