ROBUST TEMPORAL ENSEMBLING

Abstract

Successful training of deep neural networks with noisy labels is an essential capability as most real-world datasets contain some amount of mislabeled data. Left unmitigated, label noise can sharply degrade typical supervised learning approaches. In this paper, we present robust temporal ensembling (RTE), a simple supervised learning approach which combines robust task loss, temporal pseudolabeling, and a ensemble consistency regularization term to achieve noise-robust learning. We demonstrate that RTE achieves state-of-the-art performance across the CIFAR-10, CIFAR-100, and ImageNet datasets, while forgoing the recent trend of label filtering/fixing. In particular, RTE achieves 93.64% accuracy on CIFAR-10 and 66.43% accuracy on CIFAR-100 under 80% label corruption, and achieves 74.79% accuracy on ImageNet under 40% corruption. These are substantial gains over previous state-of-the-art accuracies of 86.6%, 60.2%, and 71.31%, respectively, achieved using three distinct methods. Finally, we show that RTE retains competitive corruption robustness to unforeseen input noise using CIFAR-10-C, obtaining a mean corruption error (mCE) of 13.50% even in the presence of an 80% noise ratio, versus 26.9% mCE with standard methods on clean data.

1. INTRODUCTION

Deep neural networks have enjoyed considerable success across a variety of domains, and in particular computer vision, where the common theme is that more labeled training data yields improved model performance (Hestness et al., 2017; Mahajan et al., 2018; Xie et al., 2019b; Kolesnikov et al., 2019) . However, performance depends on the quality of the training data, which is expensive to collect and inevitably imperfect. For example, ImageNet (Deng et al., 2009) is one of the most widely-used datasets in the field of deep learning and despite over 2 years of labor from more than 49,000 human annotators across 167 countries, it still contains erroneous and ambiguous labels (Fei-Fei & Deng, 2017; Karpathy, 2014) . It is therefore essential that learning algorithms in production workflows leverage noise robust methods. Noise robust learning has a long history and takes many forms (Natarajan et al., 2013; Frenay & Verleysen, 2014; Song et al., 2020) . Common strategies include loss correction and reweighting (Patrini et al., 2016; Zhang & Sabuncu, 2018; Menon et al., 2020 ), label refurbishment (Reed et al., 2014; Song et al., 2019 ), abstention (Thulasidasan et al., 2019) , and relying on carefully constructed trusted subsets of human-verified labeled data (Li et al., 2017; Hendrycks et al., 2018; Zhang et al., 2020) . Additionally, recent methods such as SELF (Nguyen et al., 2020) and DivideMix (Li et al., 2020) convert the problem of learning with noise into a semi-supervised learning approach by splitting the corrupted training set into clean labeled data and noisy unlabeled data at which point semisupervised learning methods such as Mean Teacher (Tarvainen & Valpola, 2017) and MixMatch (Berthelot et al., 2019) can be applied directly. In essence, these methods effectively discard a majority of the label information so as to side-step having to learning with noise at all. The problem here is that noisy label filtering tactics are imperfect resulting in corrupted data in the small labeled partition and valuable clean samples lost to the large pool of unlabeled data. Moreover, caution is needed when applying semi-supervised methods where the labeled data is not sampled i.i.d. from the pool of unlabeled data (Oliver et al.) . Indeed, filtering tactics can be biased and irregular, driven by specification error and the underlying noise process of the label corruption. Recognizing the success of semi-supervised approaches, we ask: can we leverage the underlying mechanisms of semi-supervised learning such as entropy regularization for learning with noise without discarding our most valuable asset, the labels?

