MITIGATING MEMORIZATION OF NOISY LABELS VIA REGULARIZATION BETWEEN REPRESENTATIONS

Abstract

Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved remarkable performance in many areas including speech recognition (Graves et al., 2013 ), computer vision (Krizhevsky et al., 2012; Lotter et al., 2016) , natural language processing (Zhang & LeCun, 2015) , etc. The high-achieving performance often builds on the availability of quality-annotated datasets. In a real-world scenario, data annotation inevitably brings in label noise (Wei et al., 2022d; e) , which degrades the performance of the network, primarily due to DNNs' capability in "memorizing" noisy labels (Zhang et al., 2016) . In the past few years, a number of methods have been proposed to tackle the problem of learning with noisy labels. Notable achievements include robust loss design (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Liu & Guo, 2020; Wang et al., 2021) , sample selection (Han et al., 2018; Yu et al., 2019; Cheng et al., 2021; Xia et al., 2021b) , transition matrix estimation (Patrini et al., 2017; Zhu et al., 2021b; Xia et al., 2019; 2020b) and loss correction/reweighting based on noise transition matrix (Natarajan et al., 2013; Liu & Tao, 2015; Patrini et al., 2017; Jiang et al., 2021; Zhu et al., 2021b; Wei et al., 2022a; Zhu et al., 2022c) . However, these methods still suffer from limitations because they are agnostic to the model complexity and do not explicitly take the over-fitting property of DNN into consideration when designing these methods (Wei et al., 2021; Liu et al., 2022) . In the context of representation learning, DNN is prone to fit/memorize noisy labels as training proceeds (Wei et al., 2022d; Zhang et al., 2016) , i.e., the memorization effect. Thus when the noise rate is high, even though the robust losses have some theoretical guarantees in expectation, they are still unstable during training (Cheng et al., 2021) . It has been shown that early stopping helps mitigate memorizing noisy labels (Rolnick et al., 2017; Xia et al., 2020a) . But intuitively, early stopping will handle overfitting wrong labels at the cost of underfitting clean samples if not tuned properly. An alternative approach is using regularizer to punish/avoid overfitting (Liu & Guo, 2020; Cheng et al., 2021; Liu et al., 2020) , which mainly build regularizers by editing labels. In this paper, we study the effectiveness of a representation regularizer. To fully understand the memorization effect on learning with noisy labels, we decouple the generalization error into estimation error and approximation error. By analyzing these two errors, we find that DNN behaves differently on various label noise types and the key to prevent over-fitting is to control model complexity. However, specifically designing the model structure for learning with noisy labels is hard. One tractable solution is to use representation regularizers to cut off some redundant function space without hurting the optima. Therefore, we propose a unified framework by utilizing representation to mitigate the memorization effect. We list main contributions below: • We first theoretically analyze the memorization effect by decomposing the generalization error into estimation error and approximation error in the context of learning with noisy labels and show that a lower-capacity model may perform better on noisy datasets. • Due to the fact that designing a neural network with the best capacity given an arbitrary task requires formidable effort, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of DNNs by the structural information between representations. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. • The effectiveness of the proposed regularizer is demonstrated by both theoretical analyses and numerical experiments. Our framework can incorporate many current robust losses and help them further improve performance.

1.1. RELATED WORKS

Learning with Noisy Labels Many works design robust loss to improve the robustness of neural networks when learning with noisy labels (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Liu & Guo, 2020; Xu et al., 2019; Feng et al., 2021; Yong et al.; Xia et al., 2022; Wei et al., 2022c; b) . (Ghosh et al., 2017) proves MAE is inherently robust to label noise. However, MAE has a severe under-fitting problem. (Zhang & Sabuncu, 2018) proposes GCE loss which can combine the advantage of MAE and CE, exhibiting good performance on noisy datasets. (Liu & Guo, 2020) introduces peer loss, which is statistically robust to label noise without knowing noise rates. The extension of peer loss also shows good performance on instance-dependent label noise (Cheng et al., 2021; Zhu et al., 2021a) . Another efficient approach to combat label noise is by sample selection (Jiang et al., 2018; Han et al., 2018; Yu et al., 2019; Northcutt et al., 2021; Yao et al., 2020; Wei et al., 2020; Zhang et al., 2020; Xia et al., 2021a) . These methods regard "small loss" examples as clean ones and train multiple networks to select clean samples. Semi-supervised learning is also popular and effective on learning with noisy labels in recent years. Some works (Li et al., 2020; Nguyen et al., 2020) perform clustering on the sample loss and divide the samples into clean ones and noisy ones. Then drop the labels of the "noisy samples" and perform semi-supervised learning on all the samples. However, the semi-supervised pseudo labels can cause disparate impact on different groups of data Zhu et al. (2022b) . Recently, some works apply self-supervised learning to handle noisy labels (Ghosh & Lan, 2021; Li et al., 2022a; Wei et al., 2023) . Our work can also explain some findings from (Ghosh & Lan, 2021). Knowledge Distillation Our proposed learning framework is related to knowledge distillation (KD). (Hinton et al., 2015) shows that a small, shallow network can be improved through a teacher-student framework. Due to its great applicability, KD has gained more and more attention in recent years and numerous methods have been proposed to perform efficient distillation (Mirzadeh et al., 2020; Zhang et al., 2018b; 2019) . However, the dataset used in KD is assumed to be clean. Thus it is hard to connect KD with learning with noisy labels. In this paper, we theoretically and experimentally show that a regularizer generally used in KD (Park et al., 2019) can alleviate the over-fitting problem on noisy data by using DNN features which offers a new alternative for dealing with label noise.

2. PRELIMINARY

We introduce preliminaries and notations including definitions and problem formulation. 



Problem formulation Consider a K-class classification problem on a set of N training examples denoted by D := {(x n , y n )} n∈[N ] , where [N ] := {1, 2, • • • , N } is the set of example indices.

availability

//github.com/

