A HOLISTIC VIEW OF LABEL NOISE TRANSITION MA-TRIX IN DEEP LEARNING AND BEYOND

Abstract

In this paper, we explore learning statistically consistent classifiers under label noise by estimating the noise transition matrix (T ). We first provide a holistic view of existing T -estimation methods including those with or without anchor point assumptions. We unified them into the Minimum Geometric Envelope Operator (MGEO) framework, which tries to find the smallest T (in terms of a certain metric) that elicits a convex hull to enclose the posteriors of all the training data. Although MGEO methods show appealing theoretical properties and empirical results, we find them prone to failing when the noisy posterior estimation is imperfect, which is inevitable in practice. Specifically, we show that MGEO methods are in-consistent even with infinite samples if the noisy posterior is not estimated accurately. In view of this, we make the first effort to address this issue by proposing a novel T -estimation framework via the lens of bilevel optimization, and term it RObust Bilevel OpTimzation (ROBOT). ROBOT paves a new road beyond MGEO framework, which enjoys strong theoretical properties: identifibility, consistency and finite-sample generalization guarantees. Notably, ROBOT neither requires the perfect posterior estimation nor assumes the existence of anchor points. We further theoretically demonstrate that ROBOT is more robust in the case where MGEO methods fail. Experimentally, our framework also shows superior performance across multiple benchmarks. Our code is released at https://github.com/pipilurj/ROBOT † .

1. INTRODUCTION

Deep learning has achieved remarkable success in recent years, owing to the availability of abundant computational resources and large scale datasets to train the models with millions of parameters. Unfortunately, the quality of datasets can not be guaranteed in practice. There are often a large amount of mislabelled data in real world datasets, especially those obtained from the internet through crowd sourcing (Li et al., 2021; Xia et al., 2019; 2020; Xu et al., 2019; Wang et al., 2019; 2021; Liu et al., 2020; Collier et al., 2021; Bahri et al., 2020; Li et al., 2022a; Yong et al., 2022; Lin et al., 2022; Zhou et al., 2022b) . This gives rise to the interest in performing learning under label noise. Our task is to learn a function f θ with parameter θ to predict the clean label Y ∈ Y = {1, ..., K} based on the input X ∈ X = R d . However, we only observe a noisy label Ỹ which is generated from Y by an (oracle) noisy transition matrix T * (x) whose element T * ij (x) = P ( Ỹ = j|Y = i, X = x). We consider class-dependent label noise problems which assume that T * is independent of x, i.e., T * (x) = T * . Given T * , we can obtain the clean posterior from the noise posterior by P (Y |X = x) = (T * ) -1 P ( Ỹ |X = x). With the oracle T * , we can learn a statistically consistent model on the noisy dataset which collides with the optimal model for the clean dataset. Specifically, denote the θ * as the minimizer of the clean loss E X,Y [ℓ(f θ (X), Y )] where ℓ is the cross entropy loss. Then minimizing the E X, Ỹ [ℓ(T * f θ (X), Ỹ )] also leads to θ * . Therefore, the effectiveness of such methods heavily depends on the quality of estimated T . To estimate T * , earlier methods assume that there exists anchor points which belong to a certain class with probability one. The noisy posterior probabilities of the anchor points are then used to construct T * (Patrini et al., 2017; Liu & Tao, 2015) . Specifically, they fit a model on the noisy dataset and select the most confident samples as the anchor points. However, as argued in (Li et al., 2021) , the violation of the anchor point assumption can lead to inaccurate estimation of T . To overcome this limitation, Li et al. (2021) ; Zhang et al. (2021) then try to develop anchor-free methods that are able to estimate T without the anchor point assumption (See Appendix C for more discussion on related work). Since P ( Ỹ |X = x) = T * P (Y |X = x) and C i=1 P (Y = i|X = x) = 1, we know that for any x, P ( Ỹ |X = x) is enclosed in the convex hull conv{T * } formed by the columns of T * . Therefore, both anchor-based and anchor-free methods try to find a T whose conv{T } encloses the noisy posteriors of all data points, under the assumption that all the noisy posteriors are perfectly estimated. To identify T * from all the T s satisfying the above condition, they choose the smallest one in terms of certain positive metrics, e.g., anchor-based methods adopts equation 2 and minimum volume (anchor-free) adopts equation 3. Therefore, we unify them into the framework of Minimum Geometric Envelope Operator (MGEO), the formal definition of which is in Section 2. Though MGEO-based methods achieve remarkable success, we show that they are sensitive to noisy posteriors estimation errors. Notably, neural networks can easily result in inaccurate posterior estimations due to their over-confidence (Guo et al., 2017) . If some posterior estimation errors skew the smallest convex hull that encloses all the data points, MGEO can result in an unreliable T -estimation returned (as illustrated in Fig. 1 (a) ). We theoretically show that even if the noisy posterior is accurate except for a single data point, MGEO-based methods can result in a constant level error in T -estimation. We further provide supportive experimental results for our theoretical findings in Section 2. In view of this, we aim to go beyond MGEO by proposing a novel framework for stable end-to-end T -estimation. Let θ(T ) be the solution of θ to minimize E X, Ỹ [ℓ(T f θ (X), Ỹ )] when T is fixed. Here θ(T ) explicitly shows the returned θ depends on T . If the clean dataset is available, we can find T * by checking whether T induces a θ(T ) that is optimal for E X,Y [ℓ(f θ (X), Y )] (this is ensured by the consistency of forward correction method under suitable conditions (Patrini et al., 2017) , which is discussed in Appendix A.3). Then the challenge arises, as we do not have the clean dataset in practice. Fortunately, we have the well established robust losses (denoted as ℓ rob ), e.g., Mean Absolute Error (MAE) (Ghosh et al., 2017) and Reversed Cross Entropy (RCE) (Wang et al., 2019) whose minimizer on E X, Ỹ [ℓ rob (f θ (X), Ỹ )] collides with that on E X,Y [ℓ rob (f θ (X), Y )]. Therefore, we search for T * by checking whether T minimizes E X, Ỹ [ℓ rob (f θ(T ) (X), Ỹ )], which only depends on the noisy data. This procedure can be naturally formulated as a bilevel problem as follows: in the inner loop, T is fixed and we obtain θ(T ) by training θ to minimize E X, Ỹ [ℓ(T f θ (X), Ỹ )]. In the outer loop, we train T to minimize E X, Ỹ [ℓ(T f θ(T ) (X), Ỹ )]. We named our framework as RObust Bilevel OpTmization (ROBOT). Notably, different from MGEO, ROBOT is based on sample mean estimator which is intrinsically consistent by the law of large numbers. In Section 3.2, we show the theoretical properties of ROBOT: identifiablilty, finite sample generalization and consistency. Further, ROBOT achieves O(1/n) robustness to the noisy posterior estimation error in the case where MGEO methods lead to a constant level T -estimation error. In Section 4, we conduct extensive experiments on both synthetic and real word datasets. Our methods beat the MGEO methods significantly both in terms of prediction accuracy and T -estimation accuracy.

Contribution.

• We provide the first framework MGEO to unify the existing T -estimation methods including both anchor-based and anchor-free methods. Through both theoretically analysis and empirical evidence, we formally identify the instability of MGEO-based methods when the noisy posteriors are not estimated perfectly, which is inevitable in practice due to the over-confidence of large neural networks. 

2. MINIMUM GEOMETRIC ENVELOPE OPERATOR (MGEO)

Preliminaries. Throughout this paper, we use upper cased letters, i.e., X, to denote random vectors, use x to denote deterministic scalars and vectors. For any vector v, we use v[i] or v i to denote the ith element of v. The L1 norm of v is |v| 1 = d i=1 |v i |. Let T * denote the oracle noise transition matrix as introduced in Section 1. Let T i• and T •j be the ith row and jth of T column, respectively. Denote the feasible region of T as  T := {T |T ij > 0, |T •j | 1 = 1, ∀i, j ∈ [K]}. Let e i ∈ (n) is L(θ, D(n)) := 1 n (x,y)∈D(n) [ℓ(f θ (x), y)] where ℓ is the loss function. By default, we use cross entropy as ℓ. Denote the noisy posterior P ( Ỹ |X = x) as g(x) for short. Let g(x) denote the fitted posterior that we obtain. Let G(n) denote the set of the fitted posteriors, i.e., G(n) := {g(x i )} n i=1 .

2.1. MGEO AND ITS LIMITATION

In this section, we first unify the existing works of T -estimation by MGEO. Denote the convex hull induced by the columns T as conv(T ) = {t|t = K i=1 α i t i , K i=1 α i = 1, α i > 0}, where T = [t 1 ...t K ]. Notably, conv(T ) is the feasible region of g(x) generated by T because g(x) = P ( Ỹ |X = x) = T P (Y |X = x) and |P (Y |X = x)| 1 = 1. Existing methods assume that we can fit the posteriors perfectly, i.e., g(x) = g(x) for all x. They then try to find a T whose conv(T ) encloses G(n) (Patrini et al., 2017; Zhang et al., 2021; Xia et al., 2020; Li et al., 2021; Zhang et al., 2021; Liu & Tao, 2015) . Since there are infinitely many T satisfying this conditions, they chose the smallest one in terms of certain positive metrics. We name them as the Minimum Geometric Envelope Operator (MGEO) and present the unified definition of them as follows: Definition 1 (Minimum Geometric Envelope Operator). An operator Q : R K×n → T on a set G(n) = {g(x i )} n i=1 is said to be Minimum Geometric Envelope Operator (MGEO) if it solves T MGEO = arg min T ∈T M(T ), s.t. G(n) ⊂ conv(T ), where T = [t 1 ...t K ] and conv(T ) = {t|t = K i=1 α i t i , K i=1 α i = 1, α i > 0}. We denote it as T MGEO = Q(G(n)) for short. Now we proceed to show how MGEO takes anchor-based and anchor-free methods as special cases (a brief description of these methods is included in Appendix A.1). Example 1 (anchor-based). For each class j ∈ [K], anchor-based methods assumes that there exists an anchor point x j with P (Y |X = x j ) = e j . So T •j = g(x j ) and g(x j )[j] ≥ g(x)[j] for all x. Anchor methods first fit g(•) on the noisy data and then find the most confident sample x j in G(n) for class j. Then they set T •j = g(x j ). This is equivalent to solving equation 1 with the following metric: M(T ) = j∈[K] T jj + min i∈[n] ∥T •j -g(x i )∥ 2 . ( ) The intuition is as follows: because equation 1 requires G(n) to be contained in conv(T ), we must have T jj ≥ g(x j )[j] or otherwise g(x j ) will be outside of conv(T ). At the same time, we have min i∈[n] ∥T •j -g(x i )∥ 2 ≥ 0 for all T and j. So choosing T •j = g(x j ) can minimize equation 2 because T jj = g(x j )[j] and ∥T •j -g(x j )∥ 2 = 0. The MGEO-based methods can work well if the posterior is perfectly estimated, i.e., g(x) = g(x) for all x. In this case, MGEO methods can identify T * under suitable conditions. However, it is common that g(x) ̸ = g(x) for some x in practice because of DNN's over-confidence. An error in the posterior estimation can easily skew the smallest convex hull which encloses all data points as illustrated in Figure 1 . Then MGEO methods lead to inaccurate T -estimation in this case. To understand the sensitivity of the MEGO methods to the posterior estimation error, we consider a simple case where g(x) agrees with g(x) almost everywhere except from a single point x ′ ∈ D(n): g ϵ,x ′ (x) = g(x) + ϵ, if x = x ′ , g(x), otherwise. ( ) where ϵ is the error of the estimated posterior at x ′ and it needs to satisfy that g(x ′ ) + ϵ has all non-negative elements which sum up to 1, i.e., ϵ ∈ Ξ := {ϵ||ϵ + g(x ′ )| 1 = 1, (ϵ + g(x ′ ))[i] ≥ 0, ∀i ∈ [K]}. Let G ϵ,x ′ (n) := {g ϵ,x ′ (x i )} n i=1 be set of the fitted posteriors by g ϵ,x ′ (•). Similarly, we denote G(n) := {g(x i )} n i=1 . Further, let T MGEO and T MGEO ϵ,x ′ be the solution of MGEO on G(n) and G ϵ,x ′ (n), i.e., T MGEO := Q( G(n)) and T MGEO ϵ,x ′ := Q(G ϵ,x ′ (n)). We have T MGEO = T * under suitable conditions (Patrini et al., 2017; Li et al., 2021) . We are then interested in the T -estimation error due to ϵ in terms of the Frobenius norm, i.e., ∥T MGEO ϵ,x ′ -T * ∥ F . Following are the results: Proposition 1. Under assumptions specified in Appendix A.4, suppose that we obtain an imperfect estimation of the noise posterior g ϵ,x ′ (•) as described in equation 4. Then MGEO-based methods lead to a T -estimation error whose minimax lower bound is: sup ϵ∈Ξ ∥T MGEO ϵ,x ′ -T * ∥ F ≥ Ω(1). See Appendix A.4 for the proof. Proposition 1 shows that MEGO methods can lead to a constant level of T -estimation error under the fitted posterior g ϵ,x ′ (•). This is in analogy to Figure 1 (left) that a single outlier caused by ϵ can skew the smallest convex hull that encloses all samples. Note that Proposition 1 shows that the error caused by the inaccurate posterior estimation does not shrink to zero as the sample size increases. Thus MGEO methods can be in-consistent in this case. In the following corollary of Proposition 1, we formally state the in-consistency of MEGO: Corollary 1 (In-consistency of MGEO). Under assumptions specified in Appendix A.4, suppose that we obtain an imperfect estimation of the noise posterior g ϵ,x ′ (•) as described in equation 4. Then MGEO-based methods can be inconsistent, i.e., there exists an ϵ such that T MGEO ϵ,x ′ ̸ → T * as n → ∞. Proof. In the proof of Proposition 1, we already show that there exists an ϵ such that ∥T MGEO ϵ,x ′ -T * ∥ F ≥ Ω(1). Since this result holds for any simple size n. If lim n→∞ T MGEO ϵ,x ′ does not exists , the claim holds immediately. Otherwise if lim n→∞ T MGEO ϵ,x ′ exists, we have ∥ lim n→∞ T MGEO ϵ,x ′ -T * ∥ ≥ Ω(1), which leads to lim n→∞ T MGEO ϵ,x ′ ̸ = T * . So we conclude that T MGEO ϵ,x ′ ̸ → T * as n → ∞.

2.2. EMPIRICAL FINDINGS

Unfortunately, due to the tendency of Deep Neural Network to make over-confident prediction, there are inevitably outliers due to erroneous posterior estimation. Because MGEO methods attempts to find the smallest T that convers all the samples, those outliers can severely skew the T and degrades the estimation accuracy. We observe this phenomenon widely exists in existing methods. In Figure 1 (b), we illustrate the T estimation results of the Minimum Volume method. The red triangle is the oracle T * . We can see that there are lots of samples whose estimated posteriors are out of the red triangle. The purple triangle are the fitted T by the Minimum Volume method, which is highly skewed by the outliers. Refer to Appendix B.1 for experimental details.

3. ROBUST BILEVEL OPTIMIZATION (ROBOT)

In the last section, we discussed the instability of MGEO methods when the noisy posterior estimation is imperfect. Proposition 1 shows that even if the estimated posterior differs from g(•) at a single point, MGEO can lead to a constant level of T -estimation error. This is because the T -estimation of MGEO depends on finding a convex hull to enclose all samples and such convex hull is determined by the the outermost samples. Therefore, if there is an outlier due to inaccurate posterior estimation, the resulting convex hull can be easily skewed. Then a natural question to ask is, can we go beyond Geometric Envelope Operator to obtain a robust and consistent T -estimation? As a thought experiment, let's use a simple case for example. Suppose g(x i )[j] = 0.5 for all i. We aim to estimate the T by the anchor point method. If g(•) = g(•), we have T jj = max i∈[n] g(x i )[j] = 0.5. Now suppose that we obtain a inaccurate posterior with some small Gaussian error, i.e., g( Ho & Hsing, 1996) . So this method is inconsistent even with infinite samples. On the other hand, it is well known that the sample mean estimator is consistent by 1/n n i=1 g(x i )[j] → 0.5 as n → ∞. This toy example shows that the sample mean estimators are consistent while MGEO methods are not because they depends on the maximum of the samples. In light of this, we provide the framework ROBOT as the first attempt to estimate T based on the sample mean estimators. •)[j] = g(•)[j]+ξ where ξ ∼ N (0, σ 2 ) for some small σ. We then have E[T jj ] = E[max i∈[n] g(x i )[j] + ξ i ] → 0.5 + σ 2 log(n) for large n (

3.1. METHODS

Denote the optimal parameter that minimizes the cross entropy loss on the clean dataset as θ * := arg min θ L(f θ , D). A popular method is to minimize following forward correction loss on the noisy dataset (Patrini et al., 2017) : T , θ = arg min T ,θ L(T f θ , D) (5) Patrini et al. (2017) shows that the forward correction is consistent, i.e., θ = θ * if T = T * , under suitable conditions. These conditions are mainly on the sufficiently large function space of f θ and proper composite loss functions. We discuss these conditions in Appendix A.3 for completeness. Though equation 5 is consistent, solving equation 5 can not uniquely identify T * and θ * . This is because if T * = T 1 T 2 , and f θ1 (•) = T 2 f θ * (•) , T 1 and θ 1 can also achieve the optimal loss. In order to make T identifiable, MGEO tries to find the minimum T whose conv(T ) contains all data points. Now we try to go beyond MGEO and aim to identify T * by seeking for some sample mean estimators. Since the solution of θ in equation 5 depends on T if T is fixed, we then use θ(T ) to explicitly denote such dependency, i.e., θ(T ) = arg min θ L(T f θ , D). By the consistency of the forward correction, we already know θ(T * ) = θ * under suitable conditions described in Appendix A.3. Therefore, we can seek for a good T by evaluating θ(T ). which is minimized by θ * (which is also θ(T * ))? Readers familiar with the noise robust losses may already guess our next proposal. Existing works have proposed several losses robust to label noise, e.g., Mean Absolute Error (MAE) and Reversed Cross-entropy (RCE), whose minimizer is the same on the noisy dataset with that on the clean dataset under suitable conditions (Wang et al., 2019; Ghosh et al., 2017) . These conditions are described in Appendix A.2. Let ℓ rob be the noise robust losses and L rob be the corresponding risk on datasets. Then we have the following: arg min θ L rob (f θ , D) = arg min θ L rob (f θ , D). Therefore, we can use L rob (f θ , D) to measure the optimality of θ(T ). This property enables us to seek for T = T * by checking whether T minimizes L rob (f θ(T ) , D). The above-mentioned procedure can be naturally formulated into a bi-level problem: We split the noisy dataset into a training set Dtr and a validation set Dv . In the inner loop, T is fixed and we obtain θ(T ) by minimizing the forward correction loss on the training set over θ, i.e., θ(T ) := arg min θ L(T f θ , Dtr ). In the outer loop, we optimize the robust loss of f θ(T ) on the validation dataset by minimizing over T , i.e., min T L rob (f θ(T ) , Dv ). We summarize the bilevel procedure as follows: min T L rob (f θ(T ) , Dv ) (7) s.t. θ(T ) = arg min θ L(T f θ , Dtr ). We name it as RObust Bilvel OpTmization (ROBOT). Remarkably, both L and L rob are consistent estimators by the law of large numbers without requiring perfect noisy posteriors (Jeffreys, 1998) . We also include the convergence of ROBOT in Appendix A.9 for completeness, which is an application of standard bilevel optimization. Remark 1. A curious reader may wonder if it is possible to use a two-step procedure instead of the bilevel framework: first learn θ by minimizing equation 6 and then plug the θ into equation 5 to obtain T . From the statistical property view (ignoring the optimization difficulty), we may have θ = θ * with infinite samples and then it can further leads to T = T * . However, existing works show that robust losses are very hard to optimize (Zhang & Sabuncu, 2018; Wang et al., 2019) , indicating that directly optimizing equation 6 on θ can hardly lead to the optimal θ * in practice. Our ROBOT in equation 7 transforms the optimization from the space of neural network parameters to the space of T when minimizing L rob (f θ(T ) , D) over T . The experimental results in Appendix B.4 show that training ROBOT can significantly decrease the training robust loss while directly optimizing the robust loss fails to do so. This indicates that the reparametrization of ROBOT may have better convergence property on minimizing the robust loss. We will investigate the in-depth mechanism of this interesting phenomenon in the future.

3.2. THEORETICAL ANALYSIS OF ROBOT

In this section, we analyze the theoretical properties of ROBOT in equation 7. First we show the identifiability of T * with infinite noisy samples. Then we provide the finite sample generalization bound together with the consistency. We first start with some mild assumptions: Assumption 1. The optimal θ * for the cross entropy on the clean dataset also minimizes the robust losses on the clean dataset D, i.e., L rob (θ * , D) < L rob (θ, D) for all θ ̸ = θ * . This assumption is natural because f θ * (x) = P (Y |X = x) is supposed to be optimal for both cross entropy and robust losses on the clean dataset. Further, Ghosh et al. (2017) shows that these losses are class-calibrated, and minimizing them leads to the decrease in 0-1 error. We assume θ * is unique, or otherwise we can also concern ourselves with the one with the minimum norm and the identifiability results are the same. Notably, robust losses are more difficult to optimize and they often lead to under-fitting when applied on deep neural networks (Wang et al., 2019; Zhang & Sabuncu, 2018) . So it is quite difficult to obtain θ * in practice by minimizing L rob (θ, D) over θ. ROBOT prevents this issue by using bilevel optimization. Refer to Remark 1 for more discussion. Assumption 2. The mapping θ(T ) is injective, i.e., θ(T 1 ) ̸ = θ(T 2 ) if T 1 ̸ = T 2 . Then we present the identifiability result given infinite samples as follows: Table 1 : A comparison between MGEO methods with ROBOT. MGEO methods contain anchorbased (Liu & Tao, 2015; Patrini et al., 2017) and anchor-free (Li et al., 2021; Zhang et al., 2021) methods. *The consistency considered here refers to the setting where we do not assume the noisy posterior is perfectly estimated since posterior estimation errors are inevitable in practice. No anchor point assumption Identifiablility Consistency* Finite sample generalization Error under g ϵ,x ′ MGEO(Anchor-based) ✗ ✓ ✗Col 1) ✗ Ω(1) (Prop 1) MGEO(Anchor-free) ✓ ✓ ✗(Col 1) ✗ Ω(1) (Prop 1) ROBOT ✓ ✓(Thm 1) ✓(Thm 2) ✓(Thm 2) O(1/n) (Prop 2) Theorem 1 (Identifiability).  F := {ℓ(f θ(T ) (•); •) : X × Y → R + , ∀T ∈ T }. Fix any ϵ > 0, assume we obtain a ϵ-approximated solution T that satisfies L rob ( θ( T ), Dv (n)) ≤ L rob ( θ(T ), Dv (n)) + ϵ, ∀T ∈ T . Let N (ϵ, F, ∥ • ∥ ∞ ) be the ϵ-cover of F. Then with probability at least 1 -δ, L rob ( θ( T ), D) ≤ inf T ∈T L rob ( θ(T ), D) + 2ϵ + M 2 ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) n . Further, if equation 7 is well solved (ϵ → 0), then T p -→ T * as n → ∞. The full proof of Theorem 2 is included in Appendix A.6. We also add additional results on the convergence of T -estimation error in Appendix A.11. Due to the instability of MGEO methods, the finite sample guarantees have been missing in MGEO methods. The consistency of our method is a direct consequence of finite sample property. Notably, the consistency of existing works (e.g., Theorem 2 of Zhang et al. (2021) ) requires that the noisy posterior is perfectly estimated, which does not hold in a more realistic settings as we consider. We then analyze the stability of ROBOT under the inaccurate noisy posterior g ϵ,x ′ as defined in equation 4. Proposition 2 (Stability). Consider we obtain an inaccurate posterior g ϵ,x ′ (•) as defined in equation 4, the T -estimation error of ROBOT upper bounded as: sup ϵ ∥T ROBOT ϵ,x ′ -T ∥ ≤ O(1/n). See Appendix A.7 for a proof. Comparing Proposition 2 with Proposition 1, we can see that ROBOT achieves an O(1/n) robustness to the posterior estimation error in the case where MEGO leads to an constant error. Finally, we summarize the comparison between ROBOT with MGEO in Table 1 .

4. EXPERIMENTS

In this section, we conduct extensive experiments to demonstrate the effectiveness of ROBOT. Our method demonstrates superior performances compared with other state of the art approaches base on loss correction. In particular, we try two robust losses for the outer loop loss function of ROBOT, namely MAE (Ghosh et al., 2017) and RCE (Wang et al., 2019) losses. Benchmark Datasets. We experiment our proposed method on three synthetic datasets: MNIST, CIFAR10 and CIFAR100. In addition, we also conduct experiments on three real world datasets, namely CIFAR10-N, CIFAR100-N and Clothing1M. For synthetic dataset experiments, we run experiments with two types of commonly used noise generation processes: symmetric and pair-flip noises. The experiments were repeated for 5 times, we report both the mean and standard deviation of our results. The results of the baseline approaches are derived from Li et al. (2021) . For more details about the datasets and noise generation, please refer to the appendix. Experiments on Synthetic Label Noise. We compare our method with other approaches on commonly used datasets with synthetic label noise as described in Section 4. The results in 2 shows that ROBOT consistently outperforms baselines by a large margin across all datasets and types of label noise. Remarkably, the advantage of our method becomes more evident when the task gets more challenging. For instance, our method outperforms the previous SOTA T estimation approach VolMinNet Li et al. ( 2021) by 11.22% and 11.27% in test accuracy for CIFAR100 with 50% Uniform noise and 45% pairflip noise, respectively.

Baseline

Estimation Error of Transition Matrix. We compare the T -estimation error with other approaches on a variety of datasets with different settings. Note that with synthetic label noise, we have the ground truth T * , and therefore able to calculate the estimation error. The results in Table 3 show that ROBOT achieves lower T -estimation error than MGEO-based methods. Note that the comparison between ROBOT and the two-stage methods supports the argument in Remark 1. Experiments on Real World Label Noise. To furthur verify the ability of our method to handle label noise learning, we conduct experiments with datasets containing real-world label noise. Specifically, we showcase the performance of our method on CIFAR10-N, CIFAR100-N (in Table 4 ) and Clothing1M (in Table 5 ) datasets. Note that in this paper, we mostly focus on the estimation of transition matrix T that is robust to outliers, therefore, we mainly compare with approaches based on loss correction for fairness. We can observe that our method outperforms other baselines by a noticeable margin, which verifies its ability to handle real world label noise. 

5. CONCLUSION

In this paper, we investigate the problem of learning statistically consistent models under label noise by estimating T . We first propose the framework MGEO to unify the existing T -estimation methods. Then we provide both theoretical and experimental results to show that MGEO methods are sensitive to the error in noisy posterior estimation. To overcome the limitation of MGEO, we further propose ROBOT, which enjoys superior theoretical properties and shows strong empirical performance.

REPRODUCIBILITY STATEMENT

The experiments in the paper are all conducted using public datasets. The hyperparameters and network choices for the experiments are elaborated in Section 4 and Appendix B. We submit the source code with the ICLR submission. The code will be make public upon acceptance of the paper. For the theory part, the assumptions and full proof are included in Section 3.2, Appendix A.4, A.5, A.6 and A.6.

A PROOFS A.1 INTRODUCTION OF EXISTING METHODS

In Section 2, we show that the framework MEGO takes anchor-based and anchor-free (the minimum volume method) methods as special cases. In this section, we briefly introduce the anchor-based method (Patrini et al., 2017; Liu & Tao, 2015) and the minimum volume method Li et al. (2021) for completeness. Here we assume that the fitted posterior g(x) matches the noisy posterior, e.g., g(x) = P ( Ỹ |X = x), for all x. The anchor-based method. For each class j ∈ [K], anchor-based methods assumes that there exists an anchor point xj with P (Y |X = xj ) = e j . Anchor based method first find the most confident sample for each class: xj = arg max x g(x)[j], where g(x) [j] is the probability of the class j of the sample x given by g(x). By the anchor point assumption, we have T •,j = T e j = T P (Y |X = xj ) = g( xj ), where T •,j is the jth column of T . By repeating equation 9 and equation 10 for each class, anchor points methods obtain the estimation of the whole T . The anchor-based method (the minimum volume method). Since the posterior g(x) = T P (Y |X = x), then g(x) is enclosed in the convex hull formed by the columns of T . However, there are still infinite number of T whose conv(T ) encloses all the samples. When the samples are sufficiently scattered, Li et al. (2021) shows that T * is the one with the minimum volume. So the solve the following problem: min T ∈T vol(T ) (11) s.t. T f θ (x) = g(x), ∀x, where vol(T ) is the volume of T .

A.2 THE CONDITIONS FOR THE ROBUST LOSSES

The robust losses can work well (i.e., equation 6 holds) when the noise ratio is not too large. As shown in the Theorem 1 of Wang et al. (2019) , we need the noise ratio to satisfy the following condition: Condition 1 (Restate of Theorem 1 in Wang et al. (2019) ). equation 6 holds 1) under symmetric or uniform label noise if noise rate η < 1/K, where K is the class number; 2) under asymmetric or class-dependent label noise when noise rate η yk ≤ 1 -η y , with k̸ =y η yk = η y .

A.3 THE CONDITIONS FOR THE CONSISTENCY OF FORWARD CORRECTION METHOD

The conditions of forward correction is presented in the Section 4.2 of Patrini et al. (2017) . We discuss these conditions briefly in this section for completeness. Recall the risk on dataset D(n) is L(θ, D(n)) := 1 n (x,y)∈D(n) [ℓ(f θ (x), y)] where ℓ is the loss function. We discuss the cross entropy loss as ℓ in this section since we focus on classification tasks. Here we consider the loss function ℓ is endowed with a link function ϕ: ∆ K-1 → R K , where K is the class number. In the case of cross entropy, the softmax is the inverse link function, i.e., f θ (x) =ϕ -1 (h θ (x)) ℓ(f θ (x), y) =ℓ ϕ (h θ (x), y). The first condition is that the function space parameterized by θ is large enough, i.e., Condition 2. The function class is sufficiently large such that there exists a θ * and f θ * (x) = P[y|x]. Notably, by the universal approximation property of neural networks (Scarselli & Tsoi, 1998) , Condition 2 can be satisfied by using a deep and wide neural network. In this work, we use Lenet for MNIST dataset, ResNet18 for CIFAR10 and CIFAR10-N, ResNet34 for for CIFAR100 and CIFAR100-N, and ResNet50 for Clothing1M (the same with existing works). The second condition condition is that the composite losses are proper as follows: Condition 3. Suppose Condition 2 holds, the composite loss is proper as follows. arg min h θ ℓ ϕ (h θ (x), y) = ϕ(p(y|x)). Notably, cross entropy and square loss are examples of proper composite losses. Theorem 2 of Patrini et al. (2017) shows that if Condition 2 and 3 hold, minimizing the forward correction loss on the noisy data leads to the optimal function which minimizes the loss on the clean data.

A.4 ASSUMPTIONS AND PROOF OF PROPOSITION 1

Assumption 3. Assume the follows are true: (a) There exists label noise that is constant level , i.e., max i∈[K] (1 -T ii ) ≥ C ≥ Ω(1). (b) Let µ denotes the Lebesgue measure, for any T 1 , T 2 ∈ V if Conv(T 1 ) ⊊ Conv(T 2 ), then M(T 2 ) -M(T 1 ) ≥ Ω(µ(Conv(T 2 ) \ Conv(T 2 ))). (c) If M(T 1 ) > M(T 2 ), then ∥T 1 -T 2 ∥ F ≥ Ω(M(T 1 ) -M(T 2 )). (d) g(x ′ ) does not lies in the boundary of Conv(T ) such that g(x ′ ) = K i=1 α i t i and 0 < α i < 1, ∀i. The first assumption is natural for noisy data problems that the noise ratio is larger than 0. The intuition for the Assumption 3(b) is that the difference in metric should be the same order with the difference in the measure of the convex hulls. For example, if Conv(T 1 ) is larger than Conv(T 2 ) one by a constant level in Lebesgue measure, the metric of T 1 is also larger than that of T 2 by a constant level. The Assumption 3(c) requires that if two matrix T 1 and T 2 is different in terms of the measure M, there should be difference between their elements, which is captured by the Frobenius norm. One can easily check these assumptions holds for anchor points and minimum volume methods illustrated in Example 1 and 2. The Assumption 3(d) is also a mild assumption because it holds almost surely. Proof. Denote v k as a vector with 1 at the kth index and 0 at the other index, i.e., v k,i = 1, if i = k, 0, otherwise . Let T MGEO := arg min V ∈V M(V ), s.t. G(n) ⊂ Conv(V ), T MGEO ϵ,x ′ := arg min V ∈V M(V ), s.t. G(n, ϵ) ⊂ Conv(V ). Under the assumptions of anchor points Patrini et al. (2017) or sufficiently scattered Li et al. (2021) , T MGEO = T . Take j = arg min i T ii and ϵ j = -g(x ′ ) + e j . In this case, we claim T MGEO ϵ,x ′ = T ej , where Further, given any T ej := [t 1 , t 2 ...e j ...t K ]. This is because 1) we have G(n, ϵ) ⊂ Conv(T ej ) because G(n) ⊂ Conv(T ej ) and g ϵ,x ′ (x ′ ) ∈ Conv(T ej ); 2) α i ∈ (T ii , 1], we can see v ei = α i e i + [K]\{i} α j t j / ∈ Conv(T ) for any j∈[K]\{i} α j = 1 -α i and α j ≥ 0. This is because for all v ∈ Conv(T ), v i ≤ T ii but the ith element of v ei is larger than T ii . On the other hand, one can easily see that Conv(T ) ⊂ Conv(T ei ). So we have µ(Conv(T e1 ) \ Conv(T )) ≥µ({v|v = α i e i + j∈[K]\{i} α j t j , α i ∈ (T ii , 1], j∈[K]\{i} α j = 1 -α i , α j ≥ 0}) =Ω(1 -T ii ), Then we have sup ϵ ∥ T MGEO ϵ,x ′ -T MGEO ∥ F ≥ sup ϵ ∥ T MGEO ϵj ,x ′ -T MGEO ∥ F =∥T ej -T ∥ F ≥Ω(M(T ej ) -M(T )) ≥Ω µ(Conv(T ej ) \ Conv(T )) ≥Ω(1 -T jj ) ≥Ω(1). The first inequality is due to taking ϵ as ϵ j ; the first inequality is due to equation 13 and equation 14; the second inequality is due to Assumption 3 (c); the third inequality is due to Assumption 3 (b) and the last inequality is due to equation 15.

A.5 PROOF OF THEOREM 1

Proof. First there exists a T that can induce θ * in the inner loop. This is due the consistency of forward/backward correction: when T = T * , θ(T ) → θ * as n → ∞. Then by Assumption 1, we know L rob (f θ(T * ) , D) < L rob (f θ , D), ∀θ ̸ = θ * . As verified in Ghosh et al. (2017) ; Wang et al. (2019) ; Xu et al. (2019) , we have L rob (f θ , D) = L rob (f θ , D) + c for any θ, where c is a fixed constant. We then have L rob (f θ(T * ) , D) < L rob (f θ , D), ∀θ ̸ = θ * . Finally, we further have L rob (θ(T * )) < L rob (θ(T )), ∀T ̸ = T * by Assumption 2.

A.6 PROOF OF THEOREM 2

Proof. Recall the definition of ϵ covering, we can find a T i for T in the covering set such that |ℓ( θ(T i ); X, Y ) -ℓ( θ( T ); X, Y )| ≤ ϵ, ∀(X, Y ) ∈ X × Y, then we have: L rob ( θ( T ), D) ≤L rob ( θ(T i ), D) + ϵ ≤L rob ( θ(T i ), Dn v ) + M ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) 2n + ϵ ≤L rob ( θ( T ), Dn v ) + M ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) 2n + 2ϵ ≤L rob ( θ(T ), Dn v ) + M ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) 2n + 3ϵ ≤L rob ( θ(T ), D) + M ln(2/δ) 2n + M ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) 2n + 3ϵ ≤L rob ( θ(T ), D) + M 2 ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) n + 3ϵ The first and third inequalities are by the definition of ϵ covering, the second inequality is to apply Hoeffding inequality on all elements of the covering sets, the forth inequality is due to T is ϵapproximated solution on the dataset Dn v , the fifth inequality is to apply Hoeffding inequality on T and the last inequality in because N (ϵ, F, ∥ • ∥ ∞ ) > 1.

A.7 PROOF OF PROPOSITION 2

Assumption 4. Assume the first and second derivatives involved have norm bounded above and the inverse matrices have all positive eigenvalues : (a) σ min ( ∂Lrob(f θ , Dn ) ∂θ ∂ 2 θϵ(T ) ∂T 2 ) ≥ Q 1 > 0, (b) σ min ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ ≥ Q 2 > 0, (c) ∥ ∂θ(T ) ∂T ∥ 2 ≤ Q 3 , (d) ∥ ∂Lrob(f θ , Dn ) ∂θ ∥ 2 ≤ Q 5 , (e) ∥ ∂ 2 L(T f θ , Dn ) ∂θ∂T ⊤ ∥ 2 ≤ Q 6 (f) ∥ ∂k(θϵ,T ,x ′ ,y ′ ,ϵ) ∂θ ∥ 2 ≤ Q 7 , (g) ∥ ∂ 2 k(θ,T ,x ′ ,y ′ ,ϵ) ∂θ∂θ ⊤ ∥ 2 ≤ Q 8 . We imediately have ∂L rob (f θ , Dn ) ∂θ ∂ 2 θ ϵ (T ) ∂T 2 -1 F ≤ d ∂L rob (f θ , Dn ) ∂θ ∂ 2 θ ϵ (T ) ∂T 2 -1 2 ≤ d Q 1 Lemma 1 (Cauchy, Implicit Function Theorem, Theorem 1 of Lorraine et al. (2020) ). If for some (θ, T ) such that ∂L(T f θ , D) ∂θ θ ′ ,T ′ = 0 and regularity conditions are satisfied, then surrounding (θ ′ , T ′ ) there exists a function θ * (T ) such that ∂L(T f θ , D) ∂θ θ * (T ),T = 0 and we have ∂θ * (T ) ∂T T ′ = ∂ 2 L(T f θ , D) ∂θ∂θ ⊤ -1 × ∂ 2 L(T f θ , D) ∂θ∂T ⊤ θ * (T ′ ),T ′ We consider that the noisy posterior perfectly fit the dataset Dn (ϵ) := Dn-1 ∪ (x ′ , y ′ + ϵ), which is equivalent to the inaccurate poster g ϵ,x ′ . θ ϵ (T ) = arg min θ n -1 n L(T f θ , Dn-1 ) + 1 n L(T f θ , (x ′ , y ′ + ϵ)) = arg min θ L(T f θ , Dn ) + 1 n (ℓ(T f θ (x ′ ), y ′ + ϵ)) -ℓ(T f θ (x ′ ), y ′ )) = arg min θ L(T f θ , Dn ) + 1 n k(θ, T , x ′ , y ′ , ϵ) Because ∂L(T f θ , Dn ) ∂θ = 0, we have - 1 n ∂k(θ ϵ , T , x ′ , y ′ , ϵ) ∂θ = ∂L(T f θϵ , Dn ) ∂θ = ∂L(T f θϵ , Dn ) ∂θ - ∂L(T f θ , Dn ) ∂θ = ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ (θ -θ ϵ ) + o(θ -θ ϵ ) Then ∂θ ϵ (T ) ∂T - ∂θ(T ) ∂T = ∂ 2 L(T f θ , Dn (ϵ)) ∂θ∂θ ⊤ -1 × ∂ 2 L(T f θ , Dn (ϵ)) ∂θ∂T ⊤ - ∂θ(T ) ∂T = ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ + 1 n ∂ 2 k(θ, T , x ′ , y ′ , ϵ) ∂θ∂θ ⊤ -1 × ∂ 2 L(T f θ , Dn ) ∂θ∂T ⊤ + 1 n ∂ 2 k(θ, T , x ′ , y ′ , ϵ) ∂θ∂T ⊤ - ∂θ(T ) ∂T = 1 n ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ -2 ∂ 2 k(θ, T , x ′ , y ′ , ϵ) ∂θ∂θ ⊤ ∂ 2 L(T f θ , Dn ) ∂θ∂T ⊤ + 1 n ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ -1 ∂ 2 L(T f θ , Dn ) ∂θ∂T ⊤ + o(1/n) = 1 n J + o(1/n), where J = ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ -2 ∂ 2 k(θ, T , x ′ , y ′ , ϵ) ∂θ∂θ ⊤ ∂ 2 L(T f θ , Dn ) ∂θ∂T ⊤ + ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ -1 ∂ 2 L(T f θ , Dn ) ∂θ∂T ⊤ By Assumption 4, we have ∥J ∥ 2 ≤ Q 6 Q 8 Q 2 2 + Q 6 Q 2 Then we have (θ -θ ϵ ) = 1 n ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ -1 ∂k(θ ϵ , T , x ′ , y ′ , ϵ) ∂θ Then we have the following by omitting the higher order terms: ∂L rob (f θϵ , Dn (ϵ)) ∂θ = ∂L rob (f θϵ , Dn ) ∂θ + 1 n ∂k(θ, T , x ′ , y ′ , ϵ) ∂θ = ∂L rob (f θ , Dn ) ∂θ + ∂ 2 L rob (f θ , Dn ) ∂ 2 θ (θ ϵ -θ) + 1 n ∂k(θ, T , x ′ , y ′ , ϵ) ∂θ + o(1/n) And further: ∂θ ϵ (T ϵ ) ∂T = ∂θ ϵ (T ϵ ) ∂T - ∂θ ϵ (T ) ∂T + ∂θ ϵ (T ) ∂T - ∂θ(T ) ∂T + ∂θ(T ) ∂T = ∂ 2 θ ϵ (T ) ∂T 2 (T ϵ -T ) + 1 n J + ∂θ(T ) ∂T + o(1/n) On the other side, we have 0 = ∂L rob (f θ(Tϵ) , Dn (ϵ)) ∂T = ∂L rob (f θϵ , Dn (ϵ)) ∂θ ∂θ ϵ (T ϵ ) ∂T = ∂L rob (f θϵ , Dn (ϵ)) ∂θ ∂θ ϵ (T ϵ ) ∂T = ∂L rob (f θϵ , Dn ) ∂θ + 1 n ∂ (ℓ(T f θ (x ′ ), y ′ + ϵ)) -ℓ(T f θ (x ′ ), y ′ )) ∂θ ∂θ ϵ (T ϵ ) ∂T = ∂L rob (f θ , Dn ) ∂θ + ∂ 2 L rob (f θ , Dn ) ∂θ∂θ ⊤ (θ ϵ -θ) + 1 n ∂k(θ, T , x ′ , y ′ , ϵ) ∂θ × ∂θ(T ) ∂T + ∂ 2 θ ϵ (T ) ∂T 2 (T ϵ -T ) + 1 n J + o(1/n) Denote K := ∂ 2 L rob (f θ , Dn ) ∂θ∂θ ⊤ ∂ 2 L(T f θ , Dn ) ∂θ∂θ ⊤ -1 ∂k(θ ϵ , T , x ′ , y ′ , ϵ) ∂θ + I, With Assumption 4, we have ∥K∥ 2 ≤ Q 5 Q 7 Q 2 + 1 Recall that ∂L rob (f θ , Dn ) ∂θ ∂θ(T ) ∂T = 0, then we have following by omitting high order terms: 1 n ∂L rob (f θ , Dn ) ∂θ J + ∂L rob (f θ , Dn ) ∂θ ∂ 2 θ ϵ (T ) ∂T 2 (T ϵ -T ) = - 1 n K ∂θ(T ) ∂T . Then finally we have ∥(T ϵ -T )∥ F ≤K∥(T ϵ -T )∥ 2 = K n ∂L rob (f θ , Dn ) ∂θ ∂ 2 θ ϵ (T ) ∂T 2 -1 K ∂θ(T ) ∂T + ∂L rob (f θ , Dn ) ∂θ J 2 ≤ K n ∂L rob (f θ , Dn ) ∂θ ∂ 2 θ ϵ (T ) ∂T 2 -1 2 K ∂θ(T ) ∂T + ∂L rob (f θ , Dn ) ∂θ J 2 ≤ K n 1 Q 1 K ∂θ(T ) ∂T 2 + ∂L rob (f θ , Dn ) ∂θ J 2 ≤ K n 1 Q 1 ( Q 5 Q 7 Q 2 + 1)Q 3 + Q 5 ( Q 6 Q 8 Q 2 2 + Q 6 Q 2 ) = Q n . where Q = 1 Q1 ( Q5Q7 Q2 + 1)Q 3 + Q 5 ( Q6Q8 Q 2 2 + Q6 Q2 ) and Q 1 -Q 8 are specified in Assumption 4.

A.8 ALGORITHM OF ROBOT

Usually the neural network parameters θ is of high dimension, and the number of samples in the training dataset Dtr is huge, which makes exactly solving the inner problem infeasible. Therefore, in practice, we solve the inner problem and the outer problem alternatively at each update step to alleviate the computational burden Shu et al. (2019) ; Ren et al. (2018) . We provide detailed explanation of how our ROBOT works in practice. Forward Correction with Noise Transition Matrix T . With the noise transition matrix T , the procedure of one-step update in forward correction method in the inner loop can be formulated as: θ t+1 = θ t -η 1 n n i=1 ∂l ce (T f θ (x i ), y i ) ∂θ ( ) where n is the number of training samples in a mini-batch, η is the learning rate. Approximately Solving the Inner Problem. Due to the high time complexity for exactly solving θ(T ) = arg min θ L(T f θ , Dtr ), we use one-step update to approximate it, which is widely used in previous literature and shown to be effective Shu et al. (2019) ; Ren et al. (2018) . The approximation at the θ t can be formulated as follows: θ(T ) = θ t -η 1 n n i=1 ∂l ce (T f θ (x i ), y i ) ∂θ Update T In the Outer Loop. With the approximate solution from the inner loop, we establish a mapping from T to θ(T ), with which we can calculate the gradient of the outer loss w.r.t T . Therefore, the update in the outer loop can be formulated as: Update θ by Equation 1610: end for Output: Optimized T * and θ * . T t+1 = T t -α 1 m m i=1 ∂l rob (f θ(T ) (x i ), y i ) ∂T Discussion about the Algorithm. In practice, the noisy training dataset Dtr and the noisy validation dataset Dv can be the same, which achieves similar performance. Our method is able to scale because the one-step approximation used in our implementation. Besides, the frequency of the outer loop updates can be set lower for better efficiency. With these approximation technique, our framework can be efficiently trained on large-scale datasets, e.g., Cloth1M with 1 million samples. The overall computational cost is about 1.6 times the cost for regular training on Cloth1M. A.9 CONVERGENCE OF ROBOT The convergence of the bilevel optimization using approximated solution of inner loop was first established in Pedregosa (2016) . We restate it here for completeness. Theorem 3 (Convergence, Theorem 3.3 of Pedregosa ( 2016)). Suppose L rob (f θ(T ) ; D) is smooth w.r.t. T , L(T f θ ; D) is β-smooth and α-strongly convex w.r.t. θ. We solve the inner loop by unrolling J steps, choose learning rate in the outer loop as η τ = 1/ √ τ , then we arrive at a approximately stationary point as follows after R steps: E R τ =1 η τ ∥∇ T L rob (f θ(T ) ; D)∥ 2 2 R τ =1 η τ ≤ Õ ϵ + ϵ 2 + 1 √ R , where Õ absorbs constants and logarithmic terms and ϵ = (1 -α/β) J .

A.10 IMPLEMENTATION OF ROBUT LOSSES

We provide the details about the robust loss functions used in ROBOT. Reverse Cross Entropy Loss (RCE) ℓ rce (f (x, θ), y) = - K k=1 f k (x, θ) log q(k|x) Mean Absolute Error (MAE) ℓ mae (f (x, θ), y) = - K k=1 ∥f k (x) -q(k|x)∥ 1 We denote the ground-truth distribution over labels by q(k|x), and K k=1 q(k|x) = 1. Given the ground-truth label is y, then q(y|x) = 1 and q(k|x) = 0 for all k ̸ = y. For the RCE loss, we approximate log(0) to constant 10.

A.11 CONVERGENCE OF T

In Theorem 2, we provide the uniform convergence of ROBOT in terms of the outer loss. We are also able to direct bound the deviation of T from T * as follows: Corollary 2. Define ϵ, N (ϵ, F, ∥ • ∥ ∞ ), T as the same with Theorem 2. Assume L rob ( θ(T ), D) is γ-strongly-convex w.r.t. T , then with probability at least 1 -δ, we have ∥ T -T * ∥ 2 ≤ 2ϵ + M 2 ln(2N (ϵ,F ,∥•∥∞)/δ) n γ . Proof. By Theorem 2 and the fact T * = arg min T L rob ( θ(T ), D), we have with probability at least 1 -δ the following holds L rob ( θ( T ), D) ≤ L rob ( θ(T * ), D) + 2ϵ + M 2 ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) n . By the strong convexity, we have L rob ( θ( T ), D) ≥ L rob ( θ( T * ), D) + ∇ T L rob ( θ( T * ), D)( T -T * ) + γ 2 ∥ T -T * ∥ 2 2 (24) By the optimality of T * , we have ∇ T L rob ( θ(T ), D) = 0. Combining equation 23-25, we know that with probability at least 1 -δ, the following holds: γ 2 ∥ T -T * ∥ 2 2 ≤ 2ϵ + M 2 ln(2N (ϵ, F, ∥ • ∥ ∞ )/δ) n . We finish the proof by rearrangement.

B EXPERIMENTAL DETAILS B.1 PLOT DETAILS

We choose 3 classes from MNIST (namely, class 2, 4 and 6), and apply uniform noise with noise rate 0.4. For ease of illustration, we randomly sample 100 points for each class. We first fit a 3-laryer MLP h(x) on the noisy dataset. Then we the Minimum Volume method (Li et al., 2021) to find the minimum T whose conv(T ) encloses all samples. We then train ROBOT on the same setting. To make a fair comparison with minimum volme method, we use h(x) as the soft noisy label for ROBOT to ensure that the noisy posterior for ROBOT cannot be more accurate than the minimum volumn method. B.2 DATASET DETAILS CIFAR10-N and CIFAR100-N Wei et al. ( 2021) is a recently proposed dataset, which was created with the Amazon Mechanical Turk (M-Turk) by posting the CIFAR-10 and CIFAR-100 datasets as the annotation Human Intelligence Tasks (HITs). The human annotations are then used as labels for the training data. Clothing1M is a dataset proposed in Xiao et al. (2015) . The dataset contains 1 million images with noisy labels obtained from the web. We follow the same setting as in Li et al. (2021) and only use the noisy dataset to jointly train both the neural network and the noise transition matrix T .

B.3 EXPERIMENTAL DETAILS

Specifically, we train Lenet with 5 layers for MNIST, SGD with batch size 128, weight decay 10 -3 momentum 0.9 and learning rate 10 -2 is used to optimize the neural network parameters, while Adam with learning rate 10 -2 and batch size 128 is used to optimize T in the outer loop; for CIFAR10, we experiment on ResNet18, trained using SGD with batch size 128, weight decay 5 × 10 -4 momentum 0.9 and learning rate 5 × 10 -2 ; for CIFAR100, we experiment on ResNet34, trained using SGD with batch size 128, weight decay 1 × 10 -3 , momentum 0.9 and learning rate 5 × 10 -2 , for both CIFAR10 and CIFAR100, Adam with learning rate 5 × 10 -3 and batch size 256 is used din the outer loop to train T; for CIFAR10-N and CIFAR100-N, ResNet34 is training using SGD with learning rate 0.1, momentum 0.9 and weight decay 5 × 10 -4 , following the official hyper-parameters, while T is optimized using Adam with learning rate 5 × 10 -3 ; for Clothing1M, we follow Li et al. (2021) to finetune an ImageNet pre-trained ResNet50 using SGD with learning rate 2 × 10 -3 , momentum 0.9 and weight decay 1 × 10 -3 , batch size is set to 32. anchor point assumptions are unfavorable (Xia et al., 2019) . Recently, (Li et al., 2021; Zhang et al., 2021) make the attempt to estimate T without relying on the anchor point assumption. Even though promising results are achieved, we point out that these methods are prone to inaccurate posteriors estimation (Section 2) and suffer unreliable performances especially when training data is scarce (Section 4). In comparison, our method overcomes the above mentioned issues and demonstrates superior performances. Robust Loss Functions. Several noise-robust loss functions are proposed to train the network (Ghosh et al., 2017; Liu & Guo, 2020; Xu et al., 2019; Wang et al., 2019; Ma et al., 2020; Zhou et al., 2021; Kim et al., 2021) , such that when the number of training data approaches infinity, the optimal weights derived from noisy training data is the same as the weights derived from clean data. Despite they learn a robust classifier in theory, they are typically difficult to train the DNNs and result require more hyper-parameter tuning (Wang et al., 2019) . Our method utilizes the robust loss functions to optimize the noise transition matrix T instead of the model parameters, the learnt T is then used to correct the loss during training to learn a statistically consistent classifier. 



R K denote the unit vector where e i [i] = 1. D(n) and D(n) denote the clean and noisy dataset with n samples, respectively; let D and D denote D(∞) and D(∞) for short. The risk on dataset D

Figure 1: (a) Illustration of how the posterior estimation error can lead to T -estimation error in a 3-class classification task. The posteriors of the blue points are accurately estimated and they are all in conv(T * ). There is an error ϵ on the noisy posterior of x ′ , which is denoted in red. Because P ( Ỹ |X = x ′ ) + ϵ lies outside of conv(T * ), MGEO needs to find a larger T to enclose it. (b) Visualization of MGEO (Minimum Volume) method in a 3-class MNIST classification task. conv(T * ) is denoted by the red triangle. There are many outliers whose estimated posteriors are out of conv(T * ) due to overfitting. MGEO methods result in leads to a inaccurate T because they try to enclose all the samples, including the outliers. (c) ROBOT obtains a more accurate T than MGEO.

If we have the clean dataset D, this would be straightforward by checking whether θ(T ) minimizes the clean loss L(f θ(T ) , D). Specifically, combining θ * = θ(T * ) and θ * = arg min θ L(f θ , D), we can uniquely identify T * by T * = arg min T L(f θ(T ) , D). Notably, L(f θ(T ) , D) depends on the sample mean of the losses, which is consistent according to the law of large numbers. Now the new challenge arises: we do not have the clean dataset D in practice. Can we find a sample mean estimator only based on the noisy dataset D

Methods, Network Architectures and Training. We compare our ROBOT with the following baselines: Decoupling(Malach & Shalev-Shwartz, 2017), Co-teachingHan et al. (2018), T-RevisionXia et al. (2019), MentorNet(Jiang et al., 2018), Forward(Patrini et al., 2017),MAE

Figure 2: Comparison of RCE loss values on the training dataset achieved by one-level training (directly optimizing the robust loss) and our ROBOT. We can see that directly training the network with RCE leads to difficulty in optimization. On the other hand, ROBOT decreases the training loss quickly since as ROBOT transforms the optimization to a much smaller space (transform the optimization from the space of neural networks to the space of T ).

Bilevel optimizationSinha et al. (2017) has achieved a lot of successes in recent years, which is able to solve hierarchical decision making processes. Bilevel optimization is adopted in broad areas of research, such as hyper-paramter optimization Lorraine et al. (2020); Maclaurin et al. (2015); Pedregosa (2016); MacKay et al. (2019); Franceschi et al. (2017); Vicol et al. (2021), neural architecture search Pham et al. (2018); Liu et al. (2018); Pham et al. (2018); Shi et al. (2020); Yao et al. (2021a); Gao et al. (2022; 2021); Shi et al. (2021), meta learning Finn et al. (2017); Nichol & Schulman (2018), dataset condensation Wang et al. (2018); Zhao et al. (2020); Cazenavette et al. (2022); Pi et al. (2022) and sample re-weighting Ren et al. (2018); Shu et al. (2019); Zhou et al. (2022a;c).

ors or the existence of anchor points. Further, when the posterior estimation is imperfect, the error of ROBOT is bounded O(1/n) while that of MGEO is in constant level.• Extensive experiments over various popular benchmarks show that ROBOT improves over MGEO-based methods by a large margin in terms of both test accuracy and T estimation error. For instance, ROBOT increases ∼ 10% accuracy and decreases ∼ 40% T -estimation error over MGEO-based methods on CIFAR100 with uniform noisy.

See Appendix A.5 for the proof. Theorem 1 shows ROBOT can uniquely learn T * based on infinite noisy samples. Notably, we neither require the existence of anchor points, nor assume that the noisy posteriors are perfectly estimated, which distinguishes ROBOT from MGEO based worksLi et al. (2021);Zhang et al. (2021). We further present the finite sample generalization results as follows:

Test accuracy of experiments on MNIST, CIFAR10 and CIFAR100 with different noise types and noise ratios. Our method significantly outperforms the counterparts by a large margin across all experiment settings. Notably, the superiority of our method becomes more evident under challenging scenarios, such as CIFAR100 dataset with flip noise.

We compare the noise transition matrix estimation errors between various methods across multiple datasets. Note that Two-stage methods correspond to the alternative approach mentioned in Remark 1. We can see that ROBOT consistently achieves the lowest T estimation error.

Test accuracy of experiments on CIFAR10-N and CIFAR100-N with different noise types. For fair comparison, we only compare against the approaches that are designed based on transition matrices. Our method consistently outperforms the counterparts across all experiment settings.

Test accuracy of experiments on Clothing1M. We only adopt noisy data during training.

One can easily check Conv([t 1 , t 2 ...t j ...t K , e j ]) = Conv([t 1 , t 2 ..., e j ...t K ]) = Conv(T ej ) and Conv([t 1 , t 2 ...t j ...t K , e j ]) is the smallest convex hull that contains both Conv([t 1 , t 2 ...t j ...t K ]) and {e j } in terms of measure M. Further Conv([t 1 , t 2 ...t j ...t K , e j ]) is the smallest convex hull containing G(n). With Assumption 3 (d) we know Conv(T ej ) is the smallest convex hull that contains G(n, ϵ).

acknowledgement

ACKNOWLEDGEMENTS BH was supported by NSFC Young Scientists Fund No. 62006202, Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652, RGC Early Career Scheme No. 22200720, CAAI-Huawei MindSpore Open Fund and HKBU CSD Departmental Incentive Grant. XBX was supported by Australian Research Council Project DE-190101473 and Google PhD Fellowship. TLL was partially supported by Australian Research Council Projects IC-190100031, LP-220100527, DP-220102121, and FT-220100318.

annex

We conduct the following two experiments on CIFAR100 to verify that the bilevel formulation in ROBOT leads to easier optimization of the robust loss function: 1) directly using the reverse cross entropy loss to train the network (referred as one-level method in the following discussion); 2) execute the bilevel procedure in our ROBOT (for fair comparison, we use the training dataset for both the inner loop and outer loop in equation 7). In Figure 2 , we can see that direct optimizing the robust loss (one level) can hardly decrease the training loss, which is consistent with the findings in Zhang & Sabuncu (2018) . On the other hand, the robust training loss optimized as the outer loss of ROBOT decreases rapidly.Implementation Details In ROBOT, we set K = 1 for the algorithm described in Appendix A.8 for ROBOT, which iterate between inner and outer loop by performing one step gradient descent each time, in the inner loop, SGD with momentum 0.9 and learning rate 0.1 is used; in the outer loop, Adam with learning rate 0.001 is used. For directly training the model using the reverse cross entropy loss, we have tried different learning rate ({0.1, 0.01, 0.001, 0.0001}) with Adam optimizer. The learning rate is decayed at 40 and 70 epochs in all runs. For each experiment, we record the value of the reverse cross entropy loss on the training dataset, which is the training loss in the onelevel case and is the outer loss in ROBOT, respectively. The results are shown in Figure 2 . Each iteration of ROBOT needs two times of gradient evaluations, one for the inner loop and one for the outer loop. Each iteration of direct optimizing robust loss (one-level) needs a single time of gradient evaluation.

B.5 EXPERIMENTS W/WO SEPARATE VALIDATION SET

In the implementation of equation 7 described in Appendix A.8, we split the noisy dataset into a training and validation dataset. The training dataset is used in the inner loop and validation set used for the outer loop. Actually we can also use the same noisy dataset for both the inner and outer loops (with out splitting it into a training and validation set). We conduct experiments on CIFAR10 dataset with 20% and 50% uniform noise to compare these two schemes. Table 6 reports the test performances and estimation error rate. The outer objective adopts RCE loss. Other configurations follow the same settings as the main experiment. Table 6 shows that the two schemes lead to similar performance. 

C RELATED WORK

In this section we categorize previous noisy label learning approaches into two types: heuristic methods and statistically consistent methods, and further provide a brief introduction.Heuristic Methods. Due to the empirical observation that the neural networks tend to learn easy (correct) samples first, and then starts to fit onto the hard (corrupt) samples in the later phase of training, many algorithms are designed based on the training samples' loss values (Han et al., 2018; Wei et al., 2020; Huang et al., 2019; Pleiss et al., 2020; Yao et al., 2021b) . Specifically, the samples with small loss values are presumed to be correctly labelled, while those with large loss values are considered corrupted. Even though these methods demonstrate strong empirical results, they typically lack theoretical guarantees and hence make their reliability questionable.Loss Correction Methods. Algorithms belonging to this category attempt to train a statistically consistent classifier under label noise with theoretical guarantees by utilizing the noise transition matrix (T ) to correct the loss during training. The majority of previous methods rely on the anchor point assumption, which means there is at least one data belonging each specific class with probability one (Patrini et al., 2017; Xia et al., 2020; Liu & Tao, 2015; Scott, 2015; Scott et al., 2013; Yao et al., 2020; Zhu et al., 2021; Wu et al., 2021; Xia et al., 2022; Li et al., 2022b) . However, the

