ROBUST EARLY-LEARNING: HINDERING THE MEMORIZATION OF NOISY LABELS

Abstract

The memorization effects of deep networks show that they will first memorize training data with clean labels and then those with noisy labels. The early stopping method therefore can be exploited for learning with noisy labels. However, the side effect brought by noisy labels will influence the memorization of clean labels before early stopping. In this paper, motivated by the lottery ticket hypothesis which shows that only partial parameters are important for generalization, we find that only partial parameters are important for fitting clean labels and generalize well, which we term as critical parameters; while the other parameters tend to fit noisy labels and cannot generalize well, which we term as non-critical parameters. Based on this, we propose robust early-learning to reduce the side effect of noisy labels before early stopping and thus enhance the memorization of clean labels. Specifically, in each iteration, we divide all parameters into the critical and non-critical ones, and then perform different update rules for different types of parameters. Extensive experiments on benchmark-simulated and real-world label-noise datasets demonstrate the superiority of the proposed method over the state-of-the-art label-noise learning methods.

1. INTRODUCTION

Deep neural networks have achieved a remarkable success in various tasks, such as image classification (He et al., 2015) , object detection (Ren et al., 2015) , speech recognition (Graves et al., 2013) , and machine translation (Wu et al., 2016) . However, the success is largely attributed to large amounts of data with high-quality annotations, which is expensive or even infeasible in practice (Han et al., 2018a; Li et al., 2020a; Wu et al., 2020) . On the other hand, many large-scale datasets are collected from image search engines or web crawlers, which inevitably involves noisy labels (Xiao et al., 2015; Li et al., 2017a; Zhu et al., 2021) . As deep networks have large learning capacities and strong memorization power, they will ultimately overfit noisy labels, leading to poor generalization performance (Jiang et al., 2018; Nguyen et al., 2020) . General regularization techniques such as dropout and weight decay cannot address this issue well (Zhang et al., 2017) . Fortunately, even though deep networks will fit all the labels eventually, they first fit data with clean labels, which helps generalization (Arpit et al., 2017; Han et al., 2018b; Yu et al., 2019; Liu et al., 2020) . Thus, the early stopping method can be used to reduce overfitting to the noisy labels (Rolnick et al., 2017; Li et al., 2020b; Hu et al., 2020) . However, the existence of noisy labels will still adversely affect the memorization of clean labels even in the early training stage. This will hurt generalization (Han et al., 2020) . Intuitively, if we can reduce the side effect of noisy labels before early stopping, the generalization and robustness of the networks can be improved.



† Correspondence to Tongliang Liu (tongliang.liu@sydney.edu.au).

