MITIGATING MEMORIZATION OF NOISY LABELS VIA REGULARIZATION BETWEEN REPRESENTATIONS

Abstract

Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved remarkable performance in many areas including speech recognition (Graves et al., 2013 ), computer vision (Krizhevsky et al., 2012; Lotter et al., 2016) , natural language processing (Zhang & LeCun, 2015) , etc. The high-achieving performance often builds on the availability of quality-annotated datasets. In a real-world scenario, data annotation inevitably brings in label noise (Wei et al., 2022d; e) , which degrades the performance of the network, primarily due to DNNs' capability in "memorizing" noisy labels (Zhang et al., 2016) . In the past few years, a number of methods have been proposed to tackle the problem of learning with noisy labels. Notable achievements include robust loss design (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Liu & Guo, 2020; Wang et al., 2021) , sample selection (Han et al., 2018; Yu et al., 2019; Cheng et al., 2021; Xia et al., 2021b) , transition matrix estimation (Patrini et al., 2017; Zhu et al., 2021b; Xia et al., 2019; 2020b) and loss correction/reweighting based on noise transition matrix (Natarajan et al., 2013; Liu & Tao, 2015; Patrini et al., 2017; Jiang et al., 2021; Zhu et al., 2021b; Wei et al., 2022a; Zhu et al., 2022c) . However, these methods still suffer from limitations because they are agnostic to the model complexity and do not explicitly take the over-fitting property of DNN into consideration when designing these methods (Wei et al., 2021; Liu et al., 2022) . In the context of representation learning, DNN is prone to fit/memorize noisy labels as training proceeds (Wei et al., 2022d; Zhang et al., 2016) , i.e., the memorization effect. Thus when the noise rate is high, even though the robust losses have some theoretical guarantees in expectation, they are still unstable during training (Cheng et al., 2021) . It has been shown that early stopping helps mitigate memorizing noisy labels (Rolnick et al., 2017; Xia et al., 2020a) . But intuitively, early stopping will handle overfitting wrong labels at the cost of underfitting clean samples if not tuned properly. An alternative approach is using regularizer to punish/avoid overfitting (Liu & Guo, 2020; Cheng et al., 

availability

//github.com/

