A HOLISTIC VIEW OF LABEL NOISE TRANSITION MA-TRIX IN DEEP LEARNING AND BEYOND

Abstract

In this paper, we explore learning statistically consistent classifiers under label noise by estimating the noise transition matrix (T ). We first provide a holistic view of existing T -estimation methods including those with or without anchor point assumptions. We unified them into the Minimum Geometric Envelope Operator (MGEO) framework, which tries to find the smallest T (in terms of a certain metric) that elicits a convex hull to enclose the posteriors of all the training data. Although MGEO methods show appealing theoretical properties and empirical results, we find them prone to failing when the noisy posterior estimation is imperfect, which is inevitable in practice. Specifically, we show that MGEO methods are in-consistent even with infinite samples if the noisy posterior is not estimated accurately. In view of this, we make the first effort to address this issue by proposing a novel T -estimation framework via the lens of bilevel optimization, and term it RObust Bilevel OpTimzation (ROBOT). ROBOT paves a new road beyond MGEO framework, which enjoys strong theoretical properties: identifibility, consistency and finite-sample generalization guarantees. Notably, ROBOT neither requires the perfect posterior estimation nor assumes the existence of anchor points. We further theoretically demonstrate that ROBOT is more robust in the case where MGEO methods fail. Experimentally, our framework also shows superior performance across multiple benchmarks. Our code is released at https://github.com/pipilurj/ROBOT † .

1. INTRODUCTION

Deep learning has achieved remarkable success in recent years, owing to the availability of abundant computational resources and large scale datasets to train the models with millions of parameters. Unfortunately, the quality of datasets can not be guaranteed in practice. There are often a large amount of mislabelled data in real world datasets, especially those obtained from the internet through crowd sourcing (Li et al., 2021; Xia et al., 2019; 2020; Xu et al., 2019; Wang et al., 2019; 2021; Liu et al., 2020; Collier et al., 2021; Bahri et al., 2020; Li et al., 2022a; Yong et al., 2022; Lin et al., 2022; Zhou et al., 2022b) . This gives rise to the interest in performing learning under label noise. Our task is to learn a function f θ with parameter θ to predict the clean label Y ∈ Y = {1, ..., K} based on the input X ∈ X = R d . However, we only observe a noisy label Ỹ which is generated from Y by an (oracle) noisy transition matrix T * (x) whose element T * ij (x) = P ( Ỹ = j|Y = i, X = x). We consider class-dependent label noise problems which assume that T * is independent of x, i.e., T * (x) = T * . Given T * , we can obtain the clean posterior from the noise posterior by P (Y |X = x) = (T * ) -1 P ( Ỹ |X = x). With the oracle T * , we can learn a statistically consistent model on the noisy dataset which collides with the optimal model for the clean dataset. Specifically, denote the θ * as the minimizer of the clean loss E X,Y [ℓ(f θ (X), Y )] where ℓ is the cross entropy loss. Then minimizing the E X, Ỹ [ℓ(T * f θ (X), Ỹ )] also leads to θ * . Therefore, the effectiveness of such methods heavily depends on the quality of estimated T .

