ME-MOMENTUM: EXTRACTING HARD CONFIDENT EXAMPLES FROM NOISILY LABELED DATA

Abstract

Examples that are close to the decision boundary-that we term hard examples, are essential to shaping accurate classifiers. Extracting confident examples has been widely studied in the community of learning with noisy labels. However, it remains elusive how to extract hard confident examples from the noisy training data. In this paper, we propose a deep learning paradigm to solve this problem, which is built on the memorization effect of deep neural networks that they would first learn simple patterns, i.e., which are defined by these shared by multiple training examples. To extract hard confident examples that contain non-simple patterns and are entangled with the inaccurately labeled examples, we borrow the idea of momentum from physics. Specifically, we alternately update the confident examples and refine the classifier. Note that the extracted confident examples in the previous round can be exploited to learn a better classifier and that the better classifier will help identify better (and hard) confident examples. We call the approach the "Momentum of Memorization" (Me-Momentum). Empirical results on benchmark-simulated and real-world label-noise data illustrate the effectiveness of Me-Momentum for extracting hard confident examples, leading to better classification performance.

1. INTRODUCTION

As training datasets are growing big while accurately labeling them is often expensive or sometimes even infeasible, cheap datasets with label noise are ubiquitous in many real-world applications. Without any care, label noise will degenerate the performance of learning algorithms, especially for these based on deep neural networks (Zhang et al., 2017) . Learning with noisy labels (Angluin & Laird, 1988) aims to reduce the side-effect of label noise and therefore has become an important topic in machine learning. Existing methods for learning with noisy labels can be divided into two categories: algorithms that result in statistically consistent or inconsistent classifiers. Methods in the first category intent to design classifier-consistent algorithms (Zhang & Sabuncu, 2018; Kremer et al., 2018; Liu & Tao, 2016; Scott, 2015; Natarajan et al., 2013; Goldberger & Ben-Reuven, 2017; Patrini et al., 2017; Thekumparampil et al., 2018; Yu et al., 2018; Liu & Guo, 2020; Xu et al., 2019) , where classifiers learned by using the noisy data will statistically converge to the optimal classifiers defined by clean data. However, these methods rely heavily on the noise transition matrix (Liu & Tao, 2016; Patrini et al., 2017; Xia et al., 2019) . In real-world applications, it is hard to learn the instance-independent noise transition matrix (Cheng et al., 2020) . To be free of estimating the noise transition matrix, methods in the second category employ heuristics to reduce the side-effect of label noise (Yu et al., 2019; Han et al., 2018b; Malach & Shalev-Shwartz, 2017; Ren et al., 2018; Jiang et al., 2018; Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2015; Han et al., 2018a; Guo et al., 2018; Veit et al., 2017; Vahdat, 2017; Li et al., 2017; 2020) . These methods were reported to empirically work well, especially in the setting of instance-dependent label noise. One promising direction in the second category is to extract examples with clean labels-confident examples- (Northcutt et al., 2017; Cheng et al., 2020; Chen et al., 2019; Malach & Shalev-Shwartz, 2017; Han et al., 2018b; Yu et al., 2019; Thulasidasan et al., 2019; Nguyen et al., 2020; Dredze et al., 2008; Liu & Tao, 2016; Wang et al., 2017; Ren et al., 2018; Jiang et al., 2018) . The idea is that compared with the original noisy training data, the extracted examples are less noisy and thus will lead to a classifier with better performance. (Vapnik, 2013; Bengio et al., 2009; Huang et al., 2010; He et al., 2018) . Notwithstanding the importance of hard confident examples, none of the existing methods studies how to extract hard confident examples from noisy data. Note that extracting hard confident examples is non-effortless. Since hard examples are often of a small proportion and contain less discriminative information compared with the easy ones (these that are far away from the decision boundary), they are often entangled with inaccurately labeled examples in the procedures of extraction. In this paper, by alternately updating the confident examples and refining the classifier, we propose a deep learning paradigm that is able to extract hard confident examples from the noisy training data, leading to better classification performance. Specifically, the idea is similar to the usage of momentum from physics. As stated in the statistical learning theory, with better training data, better classifier can be obtained (Mohri et al., 2018) . We can then think of the classifier as a particle traveling through the hypothesis space, getting acceleration from the confident data. Classifiers with better performance can be achieved by properly exploiting the previously extracted confident examples. This is similar to the momentum trick used in optimization that previous gradient information can be used to escape local minimum and achieve fast convergence rates (Sutskever et al., 2013)foot_0 . At a high level, the proposed method is built on the memorization effect of deep neural networks and on the intuition that better confident examples will result in a better classifier and that a better classifier will identify better confident examples (and hard confident examples). The proposed method is therefore called the Momentum of Memorization (Me-Momentum). We conduct experiments to show the effectiveness of the proposed Me-Momentum on noisy versions of MNIST, CIFAR10, CIFAR100, and a real-world label noise dataset Clothing1M. Specifically, on MNIST and CIFAR, we generate class-and instance-dependent label noise and visualize the extracted hard confident examples, which justifies why Me-Momentum consistently outperforms the baseline methods. Codes will be available online.

2. ME-MOMENTUM

In this section, by specifying the proposed method of momentum of memorization (Me-Momentum; summarized in Algorithm 1), we would like to detail how to accomplish extracting hard confident



In optimisation, the parameter vector can be thought of as a particle traveling through the parameter space, getting acceleration from the gradient of the loss. The momentum trick demonstrated that gradient in previous update can help escape local minimum and achieve fast convergence rates.



Figure 1: The illustration of the influence of hard (confident) examples in classification. Circles represent positive examples while triangles represent negative examples. Green and blue denote examples with accurate labels while red presents examples with incorrect labels. Blank circles and triangles represent unextracted data. (a) shows an example of classification with clean data. (b) shows noisy examples, especially those close to the decision boundary, will significantly degenerate the accuracy of classifier. (c) shows confident examples help learn a fairly good classifier. (d) shows that hard confident examples are essential to train an accurate classifier.

Given only noisy data, state-of-the-art methods exploit the memorization effect(Zhang et al., 2017; Arpit et al., 2017)  to extract confident examples. The memorization effect will enable deep neural networks to first learn patterns that are shared by majority training examples. As clean labels are of majority in each noisy class(Natarajan et al.,  2013; Xiao et al., 2015), deep neural networks would therefore first fit training data with clean labels, and then gradually fit the examples with incorrect labels(Chen et al., 2019). Therefore, early stopping(Li et al., 2020; Song et al., 2019)  and the small loss trick(Jiang et al., 2018; Han et al.,  2018b; Yu et al., 2019)  can be used to exploit confident examples.

