ROBUST TEMPORAL ENSEMBLING

Abstract

Successful training of deep neural networks with noisy labels is an essential capability as most real-world datasets contain some amount of mislabeled data. Left unmitigated, label noise can sharply degrade typical supervised learning approaches. In this paper, we present robust temporal ensembling (RTE), a simple supervised learning approach which combines robust task loss, temporal pseudolabeling, and a ensemble consistency regularization term to achieve noise-robust learning. We demonstrate that RTE achieves state-of-the-art performance across the CIFAR-10, CIFAR-100, and ImageNet datasets, while forgoing the recent trend of label filtering/fixing. In particular, RTE achieves 93.64% accuracy on CIFAR-10 and 66.43% accuracy on CIFAR-100 under 80% label corruption, and achieves 74.79% accuracy on ImageNet under 40% corruption. These are substantial gains over previous state-of-the-art accuracies of 86.6%, 60.2%, and 71.31%, respectively, achieved using three distinct methods. Finally, we show that RTE retains competitive corruption robustness to unforeseen input noise using CIFAR-10-C, obtaining a mean corruption error (mCE) of 13.50% even in the presence of an 80% noise ratio, versus 26.9% mCE with standard methods on clean data.

1. INTRODUCTION

Deep neural networks have enjoyed considerable success across a variety of domains, and in particular computer vision, where the common theme is that more labeled training data yields improved model performance (Hestness et al., 2017; Mahajan et al., 2018; Xie et al., 2019b; Kolesnikov et al., 2019) . However, performance depends on the quality of the training data, which is expensive to collect and inevitably imperfect. For example, ImageNet (Deng et al., 2009) is one of the most widely-used datasets in the field of deep learning and despite over 2 years of labor from more than 49,000 human annotators across 167 countries, it still contains erroneous and ambiguous labels (Fei-Fei & Deng, 2017; Karpathy, 2014) . It is therefore essential that learning algorithms in production workflows leverage noise robust methods. Noise robust learning has a long history and takes many forms (Natarajan et al., 2013; Frenay & Verleysen, 2014; Song et al., 2020) . Common strategies include loss correction and reweighting (Patrini et al., 2016; Zhang & Sabuncu, 2018; Menon et al., 2020 ), label refurbishment (Reed et al., 2014; Song et al., 2019 ), abstention (Thulasidasan et al., 2019) , and relying on carefully constructed trusted subsets of human-verified labeled data (Li et al., 2017; Hendrycks et al., 2018; Zhang et al., 2020) . Additionally, recent methods such as SELF (Nguyen et al., 2020) and DivideMix (Li et al., 2020) convert the problem of learning with noise into a semi-supervised learning approach by splitting the corrupted training set into clean labeled data and noisy unlabeled data at which point semisupervised learning methods such as Mean Teacher (Tarvainen & Valpola, 2017) and MixMatch (Berthelot et al., 2019) can be applied directly. In essence, these methods effectively discard a majority of the label information so as to side-step having to learning with noise at all. The problem here is that noisy label filtering tactics are imperfect resulting in corrupted data in the small labeled partition and valuable clean samples lost to the large pool of unlabeled data. Moreover, caution is needed when applying semi-supervised methods where the labeled data is not sampled i.i.d. from the pool of unlabeled data (Oliver et al.) . Indeed, filtering tactics can be biased and irregular, driven by specification error and the underlying noise process of the label corruption. Recognizing the success of semi-supervised approaches, we ask: can we leverage the underlying mechanisms of semi-supervised learning such as entropy regularization for learning with noise without discarding our most valuable asset, the labels? 2.1 PRELIMINARIES Adopting the notation of Zhang & Sabuncu (2018), we consider the problem of classification where X ⊂ R d is the feature space and Y = {1, . . . , c} is the label space where the classifier function is a deep neural network with a softmax output layer that maps input features to distributions over labels f : X → R c . The dataset of training examples containing in-sample noise is defined as D = {(x i , ỹi )} n i=1 where (x i , ỹi ) ∈ (X × Y) and ỹi is the noisy version of the true label y i such that p(ỹ i = k|y i = j, x i ) ≡ η ijk . We do not consider open-set noise (Wang et al., 2018) , in which there is a particular type of noise that occurs on inputs, x, rather than labels. Following most prior work, we make the simplifying assumption that the noise is conditionally independent of the input, x i , given the true labels. In this setting, we can write η ijk = p(ỹ i = k|y i = j) ≡ η jk which is, in general, considered to be class dependent noisefoot_0,foot_1 . To aid in a simple and precise corruption procedure, we now depart from traditional notation and further decompose η jk as p j • c jk , where p j ∈ [0, 1] is the probability of corruption of the j-th class and c jk ∈ [0, 1] is the relative probability that corrupted samples of class j are labeled as class k, with c i =j ≥ 0, c jj = 0 and k c jk = 1. A noisy dataset with m classes can then be described as transition probabilities specified by F = diag(P ) • C + diag(1 -P ) • I (1) where C ∈ R m×m defines the system confusion or noise structure, P ∈ R m defines the noise intensity or ratio for each class, and I is the identity matrix. When c jk = c kj the noise is said to be symmetric and is considered asymmetric otherwise. If ratio of noise is the same for all classes then p j = p and the dataset is said to exhibit uniform noise. For the case of uniform noise, equation ( 1) interestingly takes the familiar form of the Google matrix equation as F p = p • C + (1 -p) • I Note that, by this definition, η jj = p • c jj = 0 which prohibits ỹi = y i . This ensures a true effective noise ratio of p. For example, suppose there are m = 10 classes and we wish to corrupt labels with 80% probability. Then if corrupted labels are sampled from Y rather than Y \ {y}, 1 10 • 0.8 = 8% of the corrupted samples will not actually be corrupted, leading to a true corruption rate of 72%. Therefore, despite prescribing p = 0.8, the true effective noise ratio would be 0.72, which in turn yields a 0.08 1-0.8 = 40% increase in clean labels, and this is indeed the case in many studies (Zhang & Sabuncu, 2018; Nguyen et al., 2020; Li et al., 2020; Zhang et al., 2020) .

2.2. METHODS

At a very high level, RTE is the combination of noise-robust task loss, augmentation, and pseudolabeling for consistency regularization. We unify generalized cross entropy (Zhang & Sabuncu, 2018), AugMix stochastic augmentation strategy (Hendrycks et al., 2020) , an exponential moving average of model weights for generating pseudo-labels (Tarvainen & Valpola, 2017) , and an augmentation anchoring-like approach (Berthelot et al., 2020) to form a robust approach for learning with noisy labels.

2.2.1. NOISE-ROBUST TASK LOSS

Generalized cross entropy (GCE) (Zhang & Sabuncu, 2018) is a theoretically grounded noise-robust loss function that can be seen as a generalization of mean absolute error (MAE) and categorical cross entropy (CCE). The main idea is that CCE learns quickly, but more emphasis is put on difficult samples which is prone to overfit noisy labels, while MAE treats all samples equally, providing noise-robustness but learning slowly. To exploit the benefits of both MAE and CCE, a negative Box-Cox transformation (Box & Cox, 1964) is used as the loss function L q (f (x i ), y i = j) = (1 -f j (x i ) q ) q (3)



SeeLee et al. (2019) for treatment of conditionally dependent semantic noise such that η ijk = η jk . Note that Patrini et al. (2016) define the noise transition matrix T such that T jk ≡ η jk .

