A HOLISTIC VIEW OF LABEL NOISE TRANSITION MA-TRIX IN DEEP LEARNING AND BEYOND

Abstract

In this paper, we explore learning statistically consistent classifiers under label noise by estimating the noise transition matrix (T ). We first provide a holistic view of existing T -estimation methods including those with or without anchor point assumptions. We unified them into the Minimum Geometric Envelope Operator (MGEO) framework, which tries to find the smallest T (in terms of a certain metric) that elicits a convex hull to enclose the posteriors of all the training data. Although MGEO methods show appealing theoretical properties and empirical results, we find them prone to failing when the noisy posterior estimation is imperfect, which is inevitable in practice. Specifically, we show that MGEO methods are in-consistent even with infinite samples if the noisy posterior is not estimated accurately. In view of this, we make the first effort to address this issue by proposing a novel T -estimation framework via the lens of bilevel optimization, and term it RObust Bilevel OpTimzation (ROBOT). ROBOT paves a new road beyond MGEO framework, which enjoys strong theoretical properties: identifibility, consistency and finite-sample generalization guarantees. Notably, ROBOT neither requires the perfect posterior estimation nor assumes the existence of anchor points. We further theoretically demonstrate that ROBOT is more robust in the case where MGEO methods fail. Experimentally, our framework also shows superior performance across multiple benchmarks. Our code is released at https://github.com/pipilurj/ROBOT † .

1. INTRODUCTION

Deep learning has achieved remarkable success in recent years, owing to the availability of abundant computational resources and large scale datasets to train the models with millions of parameters. Unfortunately, the quality of datasets can not be guaranteed in practice. There are often a large amount of mislabelled data in real world datasets, especially those obtained from the internet through crowd sourcing (Li et al., 2021; Xia et al., 2019; 2020; Xu et al., 2019; Wang et al., 2019; 2021; Liu et al., 2020; Collier et al., 2021; Bahri et al., 2020; Li et al., 2022a; Yong et al., 2022; Lin et al., 2022; Zhou et al., 2022b) . This gives rise to the interest in performing learning under label noise. Our task is to learn a function f θ with parameter θ to predict the clean label Y ∈ Y = {1, ..., K} based on the input X ∈ X = R d . However, we only observe a noisy label Ỹ which is generated from Y by an (oracle) noisy transition matrix T * (x) whose element T * ij (x) = P ( Ỹ = j|Y = i, X = x). We consider class-dependent label noise problems which assume that T * is independent of x, i.e., T * (x) = T * . Given T * , we can obtain the clean posterior from the noise posterior by P (Y |X = x) = (T * ) -1 P ( Ỹ |X = x). With the oracle T * , we can learn a statistically consistent model on the noisy dataset which collides with the optimal model for the clean dataset. Specifically, denote the θ * as the minimizer of the clean loss E X,Y [ℓ(f θ (X), Y )] where ℓ is the cross entropy loss. Then minimizing the E X, Ỹ [ℓ(T * f θ (X), Ỹ )] also leads to θ * . Therefore, the effectiveness of such methods heavily depends on the quality of estimated T . To estimate T * , earlier methods assume that there exists anchor points which belong to a certain class with probability one. The noisy posterior probabilities of the anchor points are then used to construct T * (Patrini et al., 2017; Liu & Tao, 2015) . Specifically, they fit a model on the noisy dataset and select the most confident samples as the anchor points. However, as argued in (Li et al., 2021) , the violation of the anchor point assumption can lead to inaccurate estimation of T . To overcome this limitation, Li et al. ( 2021 Therefore, both anchor-based and anchor-free methods try to find a T whose conv{T } encloses the noisy posteriors of all data points, under the assumption that all the noisy posteriors are perfectly estimated. To identify T * from all the T s satisfying the above condition, they choose the smallest one in terms of certain positive metrics, e.g., anchor-based methods adopts equation 2 and minimum volume (anchor-free) adopts equation 3. Therefore, we unify them into the framework of Minimum Geometric Envelope Operator (MGEO), the formal definition of which is in Section 2. Though MGEO-based methods achieve remarkable success, we show that they are sensitive to noisy posteriors estimation errors. Notably, neural networks can easily result in inaccurate posterior estimations due to their over-confidence (Guo et al., 2017) . If some posterior estimation errors skew the smallest convex hull that encloses all the data points, MGEO can result in an unreliable T -estimation returned (as illustrated in Fig. 1 (a) ). We theoretically show that even if the noisy posterior is accurate except for a single data point, MGEO-based methods can result in a constant level error in T -estimation. We further provide supportive experimental results for our theoretical findings in Section 2. In view of this, we aim to go beyond MGEO by proposing a novel framework for stable end-to-end T -estimation. Let θ(T ) be the solution of θ to minimize E X, Ỹ [ℓ(T f θ (X), Ỹ )] when T is fixed. Here θ(T ) explicitly shows the returned θ depends on T . If the clean dataset is available, we can find T * by checking whether T induces a θ(T ) that is optimal for E X,Y [ℓ(f θ (X), Y )] (this is ensured by the consistency of forward correction method under suitable conditions (Patrini et al., 2017) , which is discussed in Appendix A.3). Then the challenge arises, as we do not have the clean dataset in practice. Fortunately, we have the well established robust losses (denoted as ℓ rob ), e.g., Mean Absolute Error (MAE) (Ghosh et al., 2017) and Reversed Cross Entropy (RCE) (Wang et al., 2019) whose minimizer on E X, Ỹ [ℓ rob (f θ (X), Ỹ )] collides with that on E X,Y [ℓ rob (f θ (X), Y )]. Therefore, we search for T * by checking whether T minimizes E X, Ỹ [ℓ rob (f θ(T ) (X), Ỹ )], which only depends on the noisy data. This procedure can be naturally formulated as a bilevel problem as follows: in the inner loop, T is fixed and we obtain θ(T ) by training θ to minimize E X, Ỹ [ℓ(T f θ (X), Ỹ )]. In the outer loop, we train T to minimize E X, Ỹ [ℓ(T f θ(T ) (X), Ỹ )]. We named our framework as RObust Bilevel OpTmization (ROBOT). Notably, different from MGEO, ROBOT is based on sample mean estimator which is intrinsically consistent by the law of large numbers. In Section 3.2, we show the theoretical properties of ROBOT: identifiablilty, finite sample generalization and consistency. Further, ROBOT achieves O(1/n) robustness to the noisy posterior estimation error in the case where MGEO methods lead to a constant level T -estimation error. In Section 4, we conduct extensive experiments on both synthetic and real word datasets. Our methods beat the MGEO methods significantly both in terms of prediction accuracy and T -estimation accuracy.

Contribution.

• We provide the first framework MGEO to unify the existing T -estimation methods including both anchor-based and anchor-free methods. Through both theoretically analysis and empirical evidence, we formally identify the instability of MGEO-based methods when the noisy posteriors are not estimated perfectly, which is inevitable in practice due to the over-confidence of large neural networks. • To break through the limitation of MGEO-based methods, we propose a novel framework ROBOT to estimate T that only relies on sample mean estimators, which is consistent by the law of large numbers. ROBOT enjoys strong theoretical guarantees including identifibility, consistency and finite sample generalization without assuming perfect noisy posteri-



); Zhang et al. (2021) then try to develop anchor-free methods that are able to estimate T without the anchor point assumption (See Appendix C for more discussion on related work). Since P ( Ỹ |X = x) = T * P (Y |X = x) and C i=1 P (Y = i|X = x) = 1, we know that for any x, P ( Ỹ |X = x) is enclosed in the convex hull conv{T * } formed by the columns of T * .

