ROBUST EARLY-LEARNING: HINDERING THE MEMORIZATION OF NOISY LABELS

Abstract

The memorization effects of deep networks show that they will first memorize training data with clean labels and then those with noisy labels. The early stopping method therefore can be exploited for learning with noisy labels. However, the side effect brought by noisy labels will influence the memorization of clean labels before early stopping. In this paper, motivated by the lottery ticket hypothesis which shows that only partial parameters are important for generalization, we find that only partial parameters are important for fitting clean labels and generalize well, which we term as critical parameters; while the other parameters tend to fit noisy labels and cannot generalize well, which we term as non-critical parameters. Based on this, we propose robust early-learning to reduce the side effect of noisy labels before early stopping and thus enhance the memorization of clean labels. Specifically, in each iteration, we divide all parameters into the critical and non-critical ones, and then perform different update rules for different types of parameters. Extensive experiments on benchmark-simulated and real-world label-noise datasets demonstrate the superiority of the proposed method over the state-of-the-art label-noise learning methods.

1. INTRODUCTION

Deep neural networks have achieved a remarkable success in various tasks, such as image classification (He et al., 2015) , object detection (Ren et al., 2015) , speech recognition (Graves et al., 2013) , and machine translation (Wu et al., 2016) . However, the success is largely attributed to large amounts of data with high-quality annotations, which is expensive or even infeasible in practice (Han et al., 2018a; Li et al., 2020a; Wu et al., 2020) . On the other hand, many large-scale datasets are collected from image search engines or web crawlers, which inevitably involves noisy labels (Xiao et al., 2015; Li et al., 2017a; Zhu et al., 2021) . As deep networks have large learning capacities and strong memorization power, they will ultimately overfit noisy labels, leading to poor generalization performance (Jiang et al., 2018; Nguyen et al., 2020) . General regularization techniques such as dropout and weight decay cannot address this issue well (Zhang et al., 2017) . Fortunately, even though deep networks will fit all the labels eventually, they first fit data with clean labels, which helps generalization (Arpit et al., 2017; Han et al., 2018b; Yu et al., 2019; Liu et al., 2020) . Thus, the early stopping method can be used to reduce overfitting to the noisy labels (Rolnick et al., 2017; Li et al., 2020b; Hu et al., 2020) . However, the existence of noisy labels will still adversely affect the memorization of clean labels even in the early training stage. This will hurt generalization (Han et al., 2020) . Intuitively, if we can reduce the side effect of noisy labels before early stopping, the generalization and robustness of the networks can be improved. Note that over-parameterization of deep networks is one of the main reasons for overfitting to noisy labels (Zhang et al., 2017; Yao et al., 2020a) . The lottery ticket hypothesis (Frankle & Carbin, 2018) shows that only partial parameters are important for generalization. The deep networks with these important parameters can generalize well, or even better by avoid overfitting. Motivated by this, for learning with noisy labels, it remains a question if we can divide the parameters into two parts to reduce the side effect brought by noisy labels, which enhances the memorization of clean labels and further improves the generalization performance of the deep networks. In this paper, we present a novel and effective method to find which parameters are important for fitting data with clean labels, and which parameters tend to fit data with noisy labels. We term the former as critical parameters, and the latter as non-critical parameters. Then on this basis, we proposed robust early-learning to reduce the side effect of noisy labels before early stopping. Specifically, in each iteration during training, we first categorize all parameters into two parts, i.e., the critical parameters and the non-critical parameters. Then we designed different update rules for different types of parameters. For the critical ones, we perform robust positive update. This part of the parameters are updated using the gradients derived from the objective function and weight decay. For the non-critical ones, we perform negative update. Their values are penalized with the weight decay, and without the gradients derived from the objective function. Note that the gradients for updating are based on the loss between the prediction of deep networks and given labels. For the critical ones, they tend to fit data with clean (correct) labels to help generalization. Their gradients can therefore be exploited to update parameters. However, for the non-critical ones, they tend to fit data with noisy (incorrect) labels, which hurts generalization. Their gradients will misguide the deep networks to overfit data with noisy labels. Thus, we only use a regularization item, i.e., the weight decay, to update them. The weight decay will penalize their values to be zero, which means that they are penalized to be deactivated, and not to contribute to the generalization of deep networks. In this way, we can reduce the side effect of noisy labels and enhance the memorization of clean labels. In summary, the main contributions of this work are as follows: • We propose a novel and effective method which can categorize the parameters into two parts according to whether they are important to fit data with clean labels. • Different update rules have been designed for different types of the parameters to reduce the side effect of noisy labels before early stopping. • We experimentally validate the proposed method on both synthetic noisy datasets and real-world noisy datasets, on which it achieves superior robustness compared with the state-of-the-art methods for learning with noisy labels. Related Work. Early stopping is quite simple but effective in practice. It was used in supervised learning early (Prechelt, 1998; Caruana et al., 2001; Zhang et al., 2005; Yao et al., 2007) . With the help of a validation set, training is then stopped before convergence to avoid the overfitting. While learning with noisy labels, the networks fit the data with clean labels before starting to overfit the data with noisy labels (Arpit et al., 2017) . Early stopping is then formally proved to be valid for relieving overfitting to noisy labels (Rolnick et al., 2017; Li et al., 2020b) . It has also been widely used in existing methods to improve robustness and generalization (Yu et al., 2018b; Xu et al., 2019; Yao et al., 2020b; Cheng et al., 2021) . The lottery ticket hypothesis (Frankle & Carbin, 2018) shows that deep networks are likely to be over-parameterized, and only partial parameters are important for generalization. With this part of the parameters, the small and sparsified networks can be trained to generalize well. While this work is motivated by the lottery ticket hypotheis, this work is fundamentally different from it. The lottery ticket hypothesis focuses on network compression. It aims to find a sparsified sub-network which has competitive generalization performance compared with the original network. This paper focuses on learning with noisy labels. We want to find the critical/non-critical parameters to reduce the side effect of noisy labels, which greatly improves the generalization performance. Lots of work proposed various methods for training with noisy labels, such as exploiting a noise transition matrix (Liu & Tao, 2016; Hendrycks et al., 2018; Xia et al., 2020a; Li et al., 2021) , using graph models (Xiao et al., 2015; Li et al., 2017b) , using surrogate loss functions (Zhang & Sabuncu, 2018; Wang et al., 2019; Ma et al., 2020 ), meta-learning (Ren et al., 2018; Shu et al., 2020) , and employing the small loss trick (Jiang et al., 2018; Han et al., 2018b; Yu et al., 2019) . Some methods among them employ early stopping explicitly or implicitly (Patrini et al., 2017; Xia et al., 2019) . We



† Correspondence to Tongliang Liu (tongliang.liu@sydney.edu.au).

