ROBUST LEARNING VIA GOLDEN SYMMETRIC LOSS OF (UN)TRUSTED LABELS

Abstract

Learning robust deep models against noisy labels becomes ever critical when today's data is commonly collected from open platforms and subject to adversarial corruption. The information on the label corruption process, i.e., corruption matrix, can greatly enhance the robustness of deep models but still fall behind in combating hard classes. In this paper, we propose to construct a golden symmetric loss (GSL) based on the estimated confusion matrix as to avoid overfitting to noisy labels and learn effectively from hard classes. GSL is the weighted sum of the corrected regular cross entropy and reverse cross entropy. By leveraging a small fraction of trusted clean data, we estimate the corruption matrix and use it to correct the loss as well as to determine the weights of GSL. We theoretically prove the robustness of the proposed loss function in the presence of dirty labels. We provide a heuristics to adaptively tune the loss weights of GSL according to the noise rate and diversity measured from the dataset. We evaluate our proposed golden symmetric loss on both vision and natural language deep models subject to different types of label noise patterns. Empirical results show that GSL can significantly outperform the existing robust training methods on different noise patterns, showing accuracy improvement up to 18% on CIFAR-100 and 1% on real world noisy dataset of Clothing1M.

1. INTRODUCTION

Diverse datasets collected from the public domain which power up deep learning models present new challenges -highly noisy labels. It is not only time consuming to collect labels but also difficult to ensure a consistent label quality due to various annotation errors (Patrini et al., 2017) and adversarial attacks (Goodfellow et al., 2015) . The large capacity of deep learning models enables effective learning from complex datasets but also suffers from overfitting to the noise structure in the dataset. The curse of memorization effect (Jiang et al., 2018) can degrade the accuracy of deep learning models in the presence of highly noisy labels. For example, in (Zhang et al., 2017) the accuracy of AlexNet to classify CIFAR10 images drops from 77% to 10%, when there are randomly flipped labels. Designing learning models that can robustly train on noisy labels is thus imperative. To distill the impact of noisy labels, the related work either filters out suspiciously noisy data, derives robust loss functions or tries to proactively correct labels. Symmetric Cross entropy Loss (SCL) is shown effective in combating label noise especially for hard classes by combing the regular with the reverse cross entropy. The former avoids overfitting and the latter is resilient to label noise. Given its promising results, there is yet to have a clear principle on how to weight the regular and reverse cross entropy terms, e.g., at different noise rates and patterns. In contrast, Distilling (Li et al., 2017) and Golden Loss Correction (GLC) (Hendrycks et al., 2018) advocate to use a small clean data to improve the estimated corruption matrix. Specifically, GLC trains the deep model on both a clean and noisy set, whose loss is corrected through the corruption matrix. While the clean set is evenly chosen from all classes, the corrupted labels may appear unevenly across classes depending on the noise pattern (Xiao et al., 2015) . As the corrected loss of GLC does not differentiate the difficulty of classes, it may not learn those hard classes effectively. We propose GSL constructing the golden symmetric loss that dynamically weights regular/reverse cross entropy and corrects the label prediction based on the estimated corruption matrix. Similar to GLC, GSL leverages clean data to estimate the corruption matrix which is used to correct labels and decide the weights of the golden symmetric loss. As such, GSL can effectively differentiate the difficulty level of classes by adjusting the weights and mitigate the impact of noise overfitting via the golden symmetric cross entropy. Specifically, we use the noise rate and noise diversity to adaptively tune the weights of modified cross entropy and reverse cross entropy. We prove that modified cross entropy by using confusion matrix is noise tolerant same as the reverse cross entropy. Motivation example. We demonstrate the advantages and disadvantages of GLC and SCL, and the their combination (the proposed GSL) through the example of learning convolution networks on CIFAR-10 injected with 60% symmetric noise. The experimental setup is detailed in §6. Figure 1 shows the corruption matrix of the injected noise and the confusion matrices from the predictions of SCL, GLC, and GSL. Even if the injected noise is symmetric across all classes (see Figure 1a ), prediction errors are distributed asymmetrically across the classes (see Figure 1b ). Though GLC can achieve a lower average error rate than SCL (reflected in darker diagonal elements on average), it performs worse in hard classes, e.g., class 4 (cat) and class 6 (dog) (difference in blue shades across the diagonal elements). By setting up proper weights for two types of cross entropy, GSL is able to achieve both superior average and per class accuracy.

2. RELATED WORK

Enhancing the robustness of deep models against noisy labels is an active research area. The massive datasets needed to train deep models are commonly found corrupted, (Wang et al., 2018) , severely degrading the achievable accuracy, (Zhang et al., 2017) . The impact of label noise on deep neural networks is first characterized by the theoretical testing accuracy over a limited set of noise patterns (Chen et al., 2019) . (Vahdat, 2017) suggest an undirected graph model for modeling label noise in deep neural networks and indicate symmetric noise to be more challenging than asymmetric. Solutions of the prior art can be categorized into three directions: (i) filtering out noisy labels: (Malach & Shalev-Shwartz, 2017; Han et al., 2018b; Yu et al., 2019; Wang et al., 2018) ; (ii) correcting noisy labels: (Patrini et al., 2017; Hendrycks et al., 2018; Li et al., 2017) ; and (iii) deriving noise resilient loss functions: (Ma et al., 2018; Konstantinov & Lampert, 2019) . Noise Resilient Loss Function. The loss function is modified to enhance the robustness to label noise by introducing new loss functions, (Ghosh et al., 2017; Wang et al., 2019) , or adjusting the weights of noisy data instances, (Ren et al., 2018b; Konstantinov & Lampert, 2019; Ma et al., 2018) . Mean Absolute Error (MAE) (Ghosh et al., 2017; Zhang & Sabuncu, 2018) and General Cross Entropy loss (Zhang & Sabuncu, 2018) are proposed as a noise resilient alternative but at the cost of slow convergence. To avoid overfitting to noise, D2L (Ma et al., 2018) uses the subspace dimensionality to assign weights to each data point, whereas Konstantinov (Konstantinov & Lampert, 2019) determines the loss weights based on the trustworthiness level of data sources. (Wang et al., 2019) propose symmetric cross-entropy loss that combines a new term of reverse cross entropy with traditional cross entropy via constant weights on both terms. Meta-Weight-Net (Shu et al., 2019) re-weights samples during optimizing loss function in the training process by using a multi-layer perceptron to predict the weight of each sample. With the same perspective, (Ren et al., 2018a) uses the similarity of samples to the clean instances in the validation set for re-weighting them in loss function. Label correction. To avoid the data reduction caused by filtering, label correction methods adjust the predicted/given labels by using only noisy labels (Patrini et al., 2017; Tanaka et al., 2018) or jointly with a small fraction of trusted data (Veit et al., 2017; Han et al., 2018a; Li et al., 2017; Hendrycks et al., 2018) . Reed et al. train the classifier by the "new" labels combining the raw and predicted labels without access to label ground truth. (Patrini et al., 2017) estimate the noise confusion matrix by first training a classifier on the noisy labels and then using the softmax probabilities. (Veit et al., 2017) acquire human-verified labels to train a cleaning network for correcting noisy labels of multi-label classification problems. (Han et al., 2018a) estimate the noise transition probability by incorporating human assistance. (Li et al., 2017) and (Hendrycks et al., 2018) leverage a small set of clean data to estimate noise corruption matrix from the clean and noisy sets, respectively. DivideMix (Li et al., 2020) is a semi-supervised method, including two networks and Gaussian Mixture Model for sample selection. The proposed GSL combines resilient loss function and label correction by curating a small fraction of trusted data. We solicit a subset of informative data instances to estimate the confusion matrix and provide a minimum supervision on noisy labels. We also provide a heuristic to adaptively tune the weights of golden symmetric loss according to the noise characteristics of the dataset.

3. GOLDEN SYMMETRIC LOSS

Consider the classification problem having dataset D = {(x n , ỹn )} N n=1 where x n ∈ X ⊂ R d denotes the n th observed sample, and ỹn ∈ Y := {1, ..., K} the corresponding given label over K classes. Hereon n is ignore for the simplicity. ỹ is affected by label noise. The label corruption process is characterised by a corruption matrix C ij = P (ỹ = j|y = i) for i = 1, . . . , K and j = 1, . . . , K where y is the true label. Synthetic noise patterns are expressed as a label corruption probability ε plus a noise label distribution. Let g(•, θ) denote a neural network-based classifier parameterized by θ. For each data point x, f (•, θ) predicts the probability for each class label k: p(k|x) = e z k K j=1 e z j where z j are the logits. Symmetric Cross Entropy. Let q(k|x) denote the ground truth probability distribution over the K class labels where q(k|x) = 1 for k equal to the true class y and q(k|x) = 0 for all k = y. The cross entropy loss ( ce ) and reverse cross entropy lossfoot_0 ( rce ) for sample x are: ce = - K k=1 q(k|x) log p(k|x), rce = - K k=1 p(k|x) log q(k|x). (1) (Wang et al., 2019) combine cross entropy and reverse cross entropy into the symmetric cross entropy: l sl = α ce + β rce . where α and β are hyperparameters. On the one hand cross entropy loss is not robust to noise (Ghosh et al., 2017) but achieves good convergence (Zhang & Sabuncu, 2018) . On the other hand reverse cross entropy is tolerant to noise (Wang et al., 2019) . Estimating Noise Corruption Matrix. We estimate the noise corruption matrix as in (Hendrycks et al., 2018) . The method fosters training a first classifier g(•, Θ) on noisy data to approximate the elements C ij of the noise corruption matrix via a small fraction of trusted data D with known true label y. Practically given A i the subset of trusted data with label of class i {A i ⊂ D : y = i}, the elements of C can be approximated by: Ĉij = P (ỹ = j|y = i) ≈ 1 |A i | x∈Ai g(ỹ = j|x, Θ) where g(ỹ = j|x, Θ) denotes predicted probability of x having class label j. That is Ĉij is computed as the mean predicted probability of class j for all trusted data points having true label of class i. Training with Corrected Labels. Let Ĉ be the estimated noise corruption matrix. Using the method in (Patrini et al., 2017) , we increase the noise resilience by correcting the predictions of the classifier using Ĉ. Let p be the corrected predicted probabilities p = ĈT p, i.e. for data point x: p(k|x) = K i=1 Ĉik p(i|x) for k = 1, . . . , K. We enhance the regular cross entropy term. Applying the prediction correction to both terms holds lower benefits. We evaluate this empirically with extensive experiments on datasets of text, i.e. Twitter in Figure . 2a, and images, i.e. CIFAR10 and CIFAR100 in Appendix A.Experiment details can be found in §6. We consider different datasets, noise rates, noise types and fractions of trusted data. We see that in all cases, except one with a difference < 0.3%, correcting only the cross entropy (ce-only) holds better results than correcting only the reverse cross entropy (rce-only) or correcting both. Focusing on Figure . 2a, ce-only improves accuracy by up to 5% and 8% for bimodal and symmetric noise, respectively. In case of CIFAR-10 and CIFAR-100 datasets the improvements are more pronounced with up to 11% and 50% respectively. Golden Symmetric Loss. Towards a more effective and robust learning we propose to leverage the estimated noise corruption matrix Ĉ to tune the two loss terms based on the observed noise pattern. α and β can significantly impact the final model accuracy. Tuning these parameters is no mean feat as different datasets affected by different noise patterns benefit from different optimal values (Wang et al., 2019) . Again we show this behavior by training a 2-layer FC neural network on the Twitter dataset under eleven different (α, β) combinations and two noise patterns with 80% noise. Figure . 2b reports for each noise pattern the evolution over the training epochs of the test accuracy for the best and worst (α, β)-pair. For bimodal noise even with a small number of trials, the impact of (α, β) ranges from an accuracy close to 60% all the way down to almost 0%. Moreover only few (two out of eleven) (α, β)-pairs hold accuracy close to 60%. For symmetric noise the tuning impact is lower (limited between 70% and 80%) but the best and worst (α, β)-pair differ from the bimodal noise case. This underlines both the importance and difficulty of tuning (α, β). Motivated by the high impact of α and β, we propose to dynamically weight the regular and reverse cross entropy terms. Let A() and B() be weighting functions mapping Ĉ -→ R we define a new loss function: GSL = A( Ĉ) ce + B( Ĉ) rce We call this new loss function golden symmetric loss. A() and B() should capture not only the intensity of the noise pattern, but also the diversity of the noise pattern (see Figure . 2b ). Determining Weights of Golden Symmetric Loss (A() and B()). In general the more intense and asymmetric the noise pattern, the lower the weight values should be. Since the final loss function learns from both dirty and clean data (see next paragraph), lower values of α and β reduce the influence of dirty data over the one of clean data. Hence, we design A() and B() to capture both noise intensity and diversity. The intensity is given by the noise rate ε ∈ [0, . . . , 1], i.e. one minus the average of the diagonal elements of Ĉ. The diversity is measured via Jain's fairness index J(x 1 , x 2 , . . . , x n ) ( n i=n xi) 2 /n n i=n x 2 i . We choose J because it bounds the diversity on a similar scale as ε between 1 (all equal, full symmetry) down to 1 /n (highest asymmetry). We apply J on all the K(K -1) noise, i.e. off the diagonal, elements of Ĉ: r 2 6 i s r W 9 s b l W 3 a z u 7 e / s H 5 3 4 8 l 4 M d 6 N j 8 V o x S h 3 6 v O g V a J V 5 A a F G g N q l / 9 o S R p T I U h H G v t e 2 5 i g g w r w w i n s 0 o / 1 T T B Z I J H 1 L d U 4 J j q I J t H n q E z q w x R J J V 9 w q C 5 + n s j w 7 H W 0 z i 0 k 3 l E v e z l 4 n + e n 5 r o K s J = ( K i=1 K j=1,j =i Ĉij ) 2 K(K -1) K i=1 K j=1,j =i Ĉ2 ij (5) D < l a t e x i t s h a 1 _ b a s e 6 4 = " V x 5 f j B 2 i N I U I e 3 d y o W c 6 0 c m k x h Y = " > A A A B / H i c b V D L S s N A F L 2 p r 1 p f 0 S 7 d D B b B V U m k o O 6 K u n B Z w T 6 g C W U y m b Z D J w 9 m J k I I 8 V f c u F D E r R / i z r 9 x 0 m a h r Q c G D u f c y z 1 z v J g z q S z u F R T 0 a J I L R L I h 6 J g Y c l 5 S y k X c U U p 4 N Y U B x 4 n P a 9 2 U 3 h 9 x + p k C w K H 1 Q a U z f A k 5 C N G c F K S y O z 7 i j G f Z o 5 A V Z T g n l 2 m + c j s 2 E 1 r T n Q K r F L 0 o A S n Z H 5 5 f g R S Q I a K s K x l E P b i p W b Y a E Y 4 T S v O Y m k M S Y z P K F D T U M c U O l m 8 / A 5 O t W K j 8 a R 0 C 9 U a K 7 + 3 s h w I G U a e H q y y C i X v U L 8 z x s m a n z p Z i y M E 0 V D s j g 0 T j h S E S q a Q D 4 T l C i e a o K J Y D o r I l M s M F G 6 r 5 o u w V 7 + 8 i r p n T f t V v P q v t V o X 5 d 1 V O E Y T u A M b L i A N t x B B 7 p A I I V n e I U i Y S F J D B V l 8 F K U c G Y n y + 9 G Q K U o M n 1 q C i W I 2 K y J j r D A x t q W K L c F b P n m V d C 7 q X q N + / d C o N W + K O s p w A q d w D h 5 c Q h P u o Q V t I C D h G V 7 h z T H O i / P u f C x G S 0 6 x c w x / 4 H z + A H i 9 k W Y = < / l a t e x i t > Train network 𝑓 on D < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 Z 4 6 p s 9 B 4 Y 0 r m Q S B C 2 / 4 t q q B 8 j w = " For symmetric noise J = 1, the more asymmetric the smaller J. Final weights proportional to J, ε. > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y U g r o r 6 M J l B f u A 6 V A y a a Y N z S R D k h H K 0 M 9 w 4 0 I R t 3 6 N O / / G T D s L b T 0 Q O J x z L z n 3 h A l n 2 r j u t 1 P a 2 N z a 3 i n v V v b 2 D w 6 P q s c n X S 1 T R W i H S C 5 V P 8 S a c i Z o x z D D a T 9 R F M c h p 7 1 w e p v 7 v S e q N J P i 0 c w S G s R 4 L F j E C D Z W 8 g c x N h O C e X Y 3 H 1 Z r b t 1 d A K 0 T r y A 1 K N A e V r 8 G I 0 n S m A p D O N b a 9 9 z E B B l W h h F O 5 5 V B q m m C y R S P q W + p w D H V Q b a I P E c X V h m h S C r 7 h E E L 9 f d G h m O t Z 3 F o J / O I e t X L x f 8 8 P z X R d Z A x k a S G C r L 8 K E o 5 M h L l 9 6 M R U 5 Q Y P r M E E 8 V s V k Q m W G F i b E s V W 4 K 3 e v I x + p k C w K H 1 Q a U z f A k 5 C N G c F K S y O z 7 i j G f Z o 5 A V Z T g n l 2 m + c j s 2 E 1 r T n Q K r F L 0 o A S n Z H 5 5 f g R S Q I a K s K x l E P b i p W b Y a E Y 4 T S v O Y m k M S Y z P K F D T U M c U O l m 8 / A 5 O t W K j 8 a R 0 C 9 U a K 7 + 3 s h w I G U a e H q y y C i X v U L 8 z x s m a n z p Z i y M E 0 V D s j g 0 T j h S E S q a Q D 4 T l C i e a o K J Y D o r I l M s M F G 6 r 5 o u w V 7 + 8 i r p n T f t V v P q v t V o X 5 d 1 V O E Y T u A M b L i A N t x E Z u R Z 2 I = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K Q b 0 V e v F Y w X 5 A G 8 p m u 2 m X b j Z h d y K U 0 B / h x Y M i X v 0 9 3 v w 3 b t s c t P X B w O O 9 G W b m B Y k U B l 3 3 2 y l s b e / s 7 h X 3 S w e H R 8 c n 5 d O z j o l T z X i b x T L W v Y A a L o X i b R Q o e S / R n E a B 5 N 1 g 2 l z 4 3 S e u j Y j V I 8 4 S 7 k d 0 r E Q o G E U r d Q c T i l l z P i x X 3 K q 7 B N k k X k 4 q k K M 1 L H 8 N R j F L I 6 6 Q S W p M 3 3 M T 9 D O q U T D J 5 6 V B a n h C 2 Z S O e d 9 S R S N u / G x 5 7 p x c W W V E w l j b U k i W 6 u + J j E b G z K L A d k Y U J 2 b d W 4 j / e f 0 U w 1 s / E y p J k S u 2 W h S m k m B M F r + T k d C c o Z x Z Q p k W 9 l b C J l R T h j a h k g 3 B W 3 9 5 k 3 R q V a 9 e v X u o V x q 1 P I 4 i X M A l X I M H N 9 C A e 2 h B G x h M 4 R l e 4 c 1 J n B f n 3 f l Y t R f d A 1 6 h u f L K 4 = " > A A A C D H i c b V D J S g N B F O y J W 4 x b 1 K O X x i A k l z A j A R U R g l 4 8 K p g o Z I b Q 0 3 l J m v Q s d L + R D G M + w I u / 4 s W D I l 7 9 A G / + j Z 3 l 4 F b Q U F T V 4 / U r P 5 Z C o 2 1 / W r m 5 + Y X F p f x y Y W V 1 b X 2 j u L n V 1 F G i O D R 4 J C N 1 4 z M N U o T Q Q I E S b m I F L P A l X P u D s 7 F / f Q t K i y i 8 w j Q G L 2 C 9 U H Q F Z 2 i k d r H U K w 8 r 9 I S 6 f Y Z Z P C q 7 K G Q H s n R 0 N z y m 7 l U f k F V M y q 7 a E 9 C / x J m R E p I O G Q + o 0 R p y T W r Q + D c z S j k d b 8 + a 1 z g t O G a N a t p L Y D X i V 2 Q G i r Q c c 2 v 4 S i i S Q C h o p x I O b C t W D k Z E Y p R D n l l m E i I C Z 2 S M Q w 0 D U k A 0 s k W t + f 4 X C s j 7 E d C V 6 j w Q v 0 9 k Z F A Putting It All Together. As a final step, to maximize the utility of the trusted data, we foster D as additional trusted training data for f (). Since D contains the true labels y no prediction correction is applied. Hence, the overall loss function for data points from both D and D is:  l = -A( Ĉ) K k=1 q(k|x) log( K i=1 Ĉik p(i|x)) -B( Ĉ) K k=1 p(k|x) log q(k|x), x ∈ D - K k=1 q(k|x) log p(k|x), x ∈ D.

4. THEORETICAL ANALYSIS

We prove that the cross entropy loss with label correction is noise tolerant under the definition put forth by (Ghosh et al., 2017; Manwani & Sastry, 2013) and extending prior art results. Let the risk of classifier f and loss function ce under clean labels be R(f ) = E x,y ce (f (x), y) and the risk under noise rate ε be R ε (f ) = E x,ỹ ce (f (x), ỹ). E indicates the expectation taken over the random variables indicated as its subscripts. With prediction correction via C, the risk becomes R ε (f, C) = E x,ỹ ce (C T f (x), ỹ). Let f * and f * ε be the global minimizers of R(f ) and R ε (f ), respectively, and C * = p(ỹ|y) and Ĉ be the true and estimated noise confusion matrices, respectively. Theorem 4.1. In a multi-class classification problem, ce with prediction correction is noise tolerant under symmetric label noise if the noise rate ε < K-1 K-∆A ∆R , where ∆A = K k=1 ce (C * T f (x), k) - K k=1 ce ( ĈT f (x), k), and ∆R is the difference of risk minimization between optimal classifier and f . And ce with prediction correction is also noise tolerant under flip noise when noise rate ε yk ≤ (1 + ∆Wy ∆W k ) -ε y (1 + ∆Wy ∆W k ) where ε k and ε yk are the correct and flipped class probabilities, respectively. The proof is based on the risk minimization framework aiming to show under which condition R ε (f * , C * ) -R ε (f, Ĉ) ≤ 0, i.e. the loss function is robust to noise. The detailed steps of the proof can be found in Appendix C. The condition ε < K-1 K-∆A ∆R is a generalization of the previous bound ε < K-1 K by (Ghosh et al., 2017) . Without label correction ∆A = 0 which corresponds to the previous result. Label correction improves the robustness by allowing higher noise rates, i.e. with label correction ∆A ∆R ≥ 0 and hence K-1 K ≤ K-1 K-∆A ∆R . Similar observations hold for flip noise bound.

5. EXPERIMENTAL SETUP

Dataset, Architecture and Parameters. We consider two types of datasets: vision and text analysis. For vision, we use convolution neural networks (CNN) to classify CIFAR-10 and CIFAR-100 with injected label noise and Clothing1M as real world noisy dataset. For text, we use fully connected neural networks to classify noisy Twitter and Stanford Sentiment Treebank (SST). In principle, we use the same network architecture on all comparative approaches across different noise resilience techniques. In addition, we test the original network from the respective papers too and report the best results among the two. • CIFAR-10 ( Krizhevsky et al., 2009) . It contains 60K images classified into 10 classes: 50K as a training set and 10K as validation set. We use the architecture of Wide-ResNet by (Zagoruyko & Komodakis, 2016) with depth 28 and a widening factor 10 and train it with SGD with Nesterov momentum and a cosine learning rate schedule (Loshchilov & Hutter) . For GSL, we first train f for 75 epochs to obtain the noise corruption matrix. Then we train g for 120 epochs. • CIFAR-100 (Krizhevsky et al., 2009) . It contains 60K images classified into 100 classes: 50K as training set and 10K as the validation set. We use the same Wide-ResNet architecture used for CIFAR-10. For GSL, we train the f and g networks for 75 and 200 epochs, respectively. • Clothing1M (Xiao et al., 2015) . This is a real world dataset with label noise. It includes images scrapped from the Internet classified into 14 categories. We resize and crop each image to 224 × 224 pixels. This dataset contains 47K and 10K images for training and testing, respectively. These two sets have both given (scrapped) and true (human-checked) labels. We use ResNet-50 pretrained with ImageNet and further train for 10 epochs with batch size 32, SGD optimizer, momentum 0.9, weight decay 10 -3 , and learning rate 10 -3 which is divided by 10 after 5 epochs. • Twitter (Gimpel et al., 2011) . The Twitter dataset includes 1,827 tweets annotated with 25 POS tags split in 1000 tweets as training set, 327 tweets as development set and 500 tweets as test set. We add development set to training set, and consider it as a training set. We use a 2-layer fully connected network with 256 hidden neurons each and GELU nonlinearity as activation function. We train f with Adam for 15 epochs with batch size 64 and learning rate of 0.001. We train g for 25 epochs. To regularize all linear output layer, we use 2 weight decay with λ = 5 × 10 -5 . • Stanford Sentiment Treebank (Socher et al., 2013) . The SST dataset includes single sentence movie reviews. We use the 2-class version, including 6911 reviews in the training set, a development set with 872 reviews, and 1821 reviews in the test set. We augment the training set by using development set. We learn 100-dimensional word vectors from scratch for a vocab size of 10000. We train a word-averaging model with an affine output layer using Adam optimizer for 5 epochs for network f and 10 epochs for network g. The batch size and learning rate are 50 and 0.001, respectively. To regularize all linear output layer, we use 2 weight decay with λ = 1 × 10 -4 . Noise Corruption. We consider symmetric noise and two different asymmetric noises, namely flip and bimodal. Symmetric noise corrupts the true label into a random other labels with equal probability based on the noise rate. The flip noise is generated by flipping the original label to a paired other class with a specific probability. The bimodal noise imitates targeted adversarial attacks (Goodfellow et al., 2015) . Specifically, the true labels are corrupted into two neighborhoods centered on two targeted classes, each of which follows truncated normal distribution, N T (µ, σ, a, b). µ specifies the target and σ controls the spread. a and b simply define the class label boundaries. For CIFAR-10 we target class 3 and 7, for CIFAR-100 class 30 and 70, for Twitter class 6 and 18, and for SST class 0 and 1. Instead, Clothing1M is already affected by real world label noise and left untouched.

6. EVALUATION

In this section, we empirically compare GSL against state of the art noise resilient networks on noisy vision and text data. We aim to show the effectiveness of GSL via testing accuracy on diverse and challenging noise patterns. Our target evaluation metric is the accuracy achieved on the clean testing set, i.e. not affected by noise. 

6.1. VISION ANALYSIS

We compare GSL against four noise resilient networks from the state of the art: GLC (Hendrycks et al., 2018) , SCL (Wang et al., 2019) , FORWARD (Patrini et al., 2017) , BOOTSTRAP (Reed et al.) and CO-TEACHING+ (Yu et al., 2019) . In addition, we extend FORWARD by adding reverse cross-entropy based on SCL and loss correction through the confusion matrix same as GLC, called SGFORWARD. For training GSL, CO-TEACHING+, SGFORWARD and GLC, we use PyTorch v1.4.0. For all other methods, we use Keras v2.2.4 and Tensorflow v1.13.0.We assume 10% of trusted data is available for GSL, GLC and SGFORWARD. Table 1 summarizes the testing accuracy for all combinations of noise patterns and comparative approaches. For CIFAR-10, GSL achieves the highest accuracy among all resilient networks except for flip noise with 30% noise rate. SGFORWARD is the closest rival to GSL because both use the same mechanism in the loss function. Besides, GSL has 2 to 8% higher accuracy than GLC, demonstrating the benefit of introducing symmetric cross-entropy, especially in high noise rates. In terms of comparison between GSL and SCL, the accuracy difference is even more visible, implying the benefit of using corruption matrix to assign weights on two terms in symmetric cross-entropy. We note that SCL uses an 8-layer CNN with 6 convolutional layers followed by 2 fully connected layers instead of a Wide ResNet because of the superior results. SCL performs particularly worse in 60% bimodal noise because this is a more challenging pattern and has no access to the corruption matrix. Moreover, our method can still obtain 11 to 30% higher test accuracy than CO-TEACHING+ that uses two deep networks concurrently. CIFAR-100 is more challenging than CIFAR-10 due to the larger number of classes. GSL achieves the highest accuracy except for flip noise with 30% rate, and SGFORWARD is the second best result among other competitors. Although for flip noise with 30% rate SGFORWARD performs better than GSL, the improvement of GSL is more significant than SGFORWARD compared to the CIFAR-10 dataset. The largest difference (more than 2%) in accuracy between the GSL and SGFORWARD methods is with bimodal noise. In case of 60% symmetric noise, GSL achieves the accuracy of 68%, whereas GLC and SCL trail far behind. Moreover, given the difficulty of training a robust classifier for CIFAR-100 with 60% label noise, it is worth mentioning that SCL can achieve similar performance as GLC that is given 10% of trusted data in case of 30% symmetric noise. This also indicates the effectiveness of symmetric cross entropy in learning hard classes even without trusted data. However, when facing extremely noisy labels and patterns, the small amount of trusted data can greatly improve the robustness of the classifier but not necessarily the symmetric cross entropy. Seen from the high accuracy compared to GLC and SCL, GSL effectively uses the trusted data to correct symmetric cross entropy loss and improve the learning on the hard classes. GSL performs slightly better with symmetric noise than with bimodal and flip noise that is more challenging for CIFAR-10. In the CIFAR-100, GSL works better on the asymmetric noise rather than symmetric. For Clothing1M dataset, as shown in Table 1 , GSL obtains the highest test accuracy compared to other methods. Same as CIFAR-10 and CIFAR-100, SGFORWARD achieves a relatively good performance. The difference between GSL and SCL comes from the effectiveness of corruption matrix that makes the regular cross entropy robust. We evaluate GSL on text datasets of Twitter and SST, against resilient networks that leverage corruption matrix, namely GLC and FORWARD. Both GSL and GLC use the trusted data for estimating the corruption matrix, wheres the original FORWARD (Patrini et al., 2017) relies solely on the noisy data. As the proposed loss of golden symmetric cross entropy is general and can be combined with different resilient networks, we hence use following four variations of loss correction and symmetric cross entropy on the existing work: • Forward gold (GFORWARD): we replace the estimation of the corruption matrix by the identity matrix on trusted samples and apply loss correction through the confusion matrix. • True corruption matrix (TMATRIX): we directly use the true corruption matrix and apply loss correction through it. • Forward gold with symmetric cross entropy (SGFORWARD): we extend the corrected loss of GFORWARD to the corrected symmetric cross entropy as in the GSL. • True corruption with symmetric cross entropy (STMATRIX): we apply golden symmetric cross entropy and the true corruption matrix instead of the estimated matrix. We extensively evaluate GSL, GLC GFORWARD, TMATRIX, SGFORWARD, and STMATRIX on Twitter and SST, with label corruption ranging from 0% to 100%. We also vary the percentage of trusted data among 1% and 5%. We summarize the average accuracy across 11 noise rates in Table 2 . Twitter. GSL consistently achieves the highest average accuracy in most cases. Compared to GLC, GSL has significant higher accuracy for Twitter corrupted with symmetric and bimodal noises, but the difference diminishes with increasing amounts of trusted data. When the percent of trusted data is low, say, 1%, GLC is unable to estimate the corruption matrix accurately nor to correct the loss, seen by the difference between GLC and TMATRIX. SST. Here, the classification involves only two classes and turns out to be less challenging than the Twitter case. The difference among the different comparative approaches considered is smaller than for Twitter. For instance, though GSL consistently achieves the best average accuracy in almost all cases, the difference between GSL and GLC is around 1-3%. Again, we see that GSL visibly outperforms GLC on low amounts of trusted data because of using cross entropy and the difference among them becomes limited. We note that TMATRIX and GFORWARD collapse under Flip noise.

7. CONCLUSION

To enhance the robustness of deep models against by label noise, we propose GSL that features on correcting the symmetric cross entropy loss by the noise corruption matrix. GSL uses a small fraction of trusted data to accurately estimate the corruption matrix, and further determine the weights applied on regular and reverse cross entropy. GSL learns deep networks from trusted samples through regular cross entropy and from untrusted noisy samples through golden symmetric cross entropy. We prove that the cross entropy corrected by the corruption matrix is noise robust. To adapt to noise patterns of dataset, we heuristically set the weights of golden symmetric loss based on the corruption matrix. We extensively evaluate GSL on vision and text analysis under diversified noise rates and patterns. Evaluation results show that GSL can achieve a remarkable accuracy improvement, i.e., from 2 to 18% on CIFAR benchmarks and real world noisy data, compared to methods that either correct loss or leverage symmetric cross entropy. Here we present the extensive results of our empirical evaluation on training with corrected labels for the vision datasets. This complements the results presented in §3. We compare the impact of correcting labels only on the cross entropy term (ce only), only on the reverse cross entropy term (rce only), or both. Table 3 and Table 4 show the achieved accuracy for CIFAR-10 and CIFAR-100, respectively, under two noise rates, 30% and 60%, three different noise types, symmetric, bimodal and flip, and three fractions of trusted data, 5%, 10% and 15%. For each noise scenario the best case is highlighted in bold. ce only achieves the highest accuracy in all cases except one. Under 60% bimodal noise on CIFAR-100 with 10% trusted data rce only is slightly better by 0.28 percent points. More in general, rce only typically performs second best and both achieves the worst accuracy. Focusing on ce only over the other two, the gain tends to increase with the difficulty of the noise scenarios, i.e. with higher number of classes, higher noise rates and less trusted data. ce only outperforms the other two by up to 11.74 percent points for CIFAR-10 and up to 51.03 percent points for CIFAR-100.

B TEXT ANALYSIS ON TWITTER AND SST DATASETS WITH VARYING NOISE RATES

Figure 4 shows how the accuracy changes with respect to different noise rates on the Twitter and SST datasets. GSL and GLC are provided with one percent trusted data. In contrast, GSL can effectively use the symmetric cross entropy to overcome the limitation of low trusted data. This also explains why STMATRIX always trails closely behind GSL by using the true confusion matrix and symmetric entropy loss. One may further improve STMATRIX by using the optimal weights of A and B according to the true corruption matrix, instead of estimated corruption matrix of GSL. The Twitter dataset highlights well the differences (see Figure 4a ). The SST dataset is an easier problem with only two classes and all methods are able to perform equally well (see Figure 4b ). Proof. For symmetric noise: R ε (f, Ĉ) = E x,ỹ ce ( ĈT f (x), ỹ) = E x E y|x E ỹ|x,y ce ( ĈT f (x), y) where ∆R ≤ 0, because f * is the global minimizer for R and C * the optimal noise confusion matrix. Similarly, ∆A ≤ 0 because for the optimal case we can say A(C * T f * (x), y) ≈ 0. ce with label correction is robust to noise when R ε (f * , C * ) -R ε (f, Ĉ) ≤ 0. This is true when: = E x,y [(1 -ε) ce ( ĈT f (x), y) + ε K -1 ( k =y ce ( ĈT f (x), k))] = E x,y [(1 -ε) ce ( ĈT f (x), y) + ε K -1 ( K k=1 ce ( ĈT f (x), k) -ce ( ĈT f (x), y))] = (1 -ε)R(f (x, Ĉ) + ε K -1 ( K k=1 ce ( ĈT f (x), k) -R(f, Ĉ)) = R(f, Ĉ)(1 - εK K -1 ) + ε K -1 ( K k=1 ce ( ĈT f (x), k)) R ε (f * , C * ) -R ε (f, Ĉ) = (1 - εK K -1 )∆R + ε K -1 ∆A =∆R - εK K -1 ∆R + ε K -1 ∆A ≤ 0 ∆R≤0 = === ⇒ 1 - εK K -1 + ε K -1 ∆A ∆R ≥ 0 = ⇒ 1 ≥ ε K -1 (K - ∆A ∆R ) = ⇒ ε ≤ K -1 K -∆A ∆R (9) With no label correction, C is missing in ∆A and the two terms become equal, i.e. ∆A = 0. In this condition the bound becomes ε < K-1 K as found by (Ghosh et al., 2017) for cross-entropy without label correction. Since ∆A ∆R ≥ 0, ε < K-1 K ≤ K-1 K-∆A ∆R the new bound can be seen as generalization of the previous bound. ∆A ∆R should also be less than one to ensure a meaningful bound on ε, avoiding scenarios of noise rate greater than 1. For asymmetric flip noise, 1 -ε y is the probability of a label being correct (i.e., k = y), and the noise condition ε yk < 1 -ε y generally states that a sample x has a higher probability (1 -ε y ) of being classified correctly as class y, rather than the probability (ε yk ) of being classified incorrectly as class k = y. where ∆W y ≤ 0 and ∆W k ≤ 0 because C * is the optimal noise confusion matrix. Rewriting (11): E x,y [ K k=1 (1 -ε y )∆W y + k =y ε yk ∆W k ] ≤ 0 = ⇒ K k=1 (1 -ε y )∆W y + k =y (1 -ε y -ε yk )∆W k ≤ 0 = ⇒ ∆W y -ε y ∆W y ≤ -∆W k + ε y ∆W k + ε yk ∆W k ∆W k ≤0 = ==== ⇒ ∆W y ∆W k -ε y ∆W y ∆W k ≥ -1 + ε y + ε yk = ⇒ ∆W y ∆W k -ε y ( ∆W y ∆W k + 1) ≥ ε yk -1 According to ( 12), the bound is ε yk ≤ (1 + ∆Wy ∆W k ) -ε y (1 + ∆Wy ∆W k ). With no label correction ∆W y = 0 and the bound becomes ε yk < 1 -ε y as found by prior art.



To avoid problems with the logarithm, zero values of q are replaced by a small positive value, i.e. 10 -4 .



Figure 1: Noise corruption matrix and confusion matrices of predictions for CIFAR-10 with 60% symmetric label noise.

Figure 2: Impact of loss correction and α,β-tuning on a 2-layer FC network trained on Twitter data.

A H x u c P b 8 a V T Q = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 5 E K 7 A 9 J p d E R x B E m 8 o B L Z p m f N c 5 E = " > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I Q d 0 V d e G y g n 3 A d C i Z N N O G Z p I h y Q h l 6 G e 4 c a G I W 7 / G n X 9 j p p 2 F t h 4 I H M 6 5 l 5 x 7 w o Q z b V z 3 2 y m t r W 9 s b p W 3 K z u 7 e / s H 1 c O j j p a p I r R N J J e q F 2 J N O R O 0 b Z j h t J c o i u O Q 0 2 4 4 u c 3 9 7 h N V m k n x a K Y J D W I 8 E i x i B B s r + f 0 Y m z H B P L u b D a o 1 t + 7

6 6 T b q X r N + 8 9 C s t R p F H W U 4 g 3 O 4 B A + u o A X 3 0 I Y O E J D w D K / w 5 h j n x X l 3 P p a j J a f Y O Y U / c D 5 / A H P t k V Y = < / l a t e x i t > on D < l a t e x i t s h a 1 _ b a s e 6 4 = " V x 5 f j B 2 i N I U I e 3 d y o W c 6 0 c m k x h Y = " > A A A B / H i c b V D L S s N A F L 2 p r 1 p f 0 S 7 d D B b B V U m k o O 6 K u n B Z w T 6 g C W U y m b Z D J w 9 m J k I I 8 V f c u F D E r R / i z r 9 x 0 m a h r Q c G D u f c y z 1 z v J g z q S z r 2 6 i s r W 9 s b l W 3 a z u 7 e / s H 5 u F R T 0 a J I L R L I h 6 J g Y c l 5 S y k X c U U p 4 N Y U B x 4 n P a 9 2 U 3 h 9

B B 7 p A I I V n e I U 3 4 8 l 4 M d 6 N j 8 V o x S h 3 6 v A H x u c P b 8 a V T Q = = < / l a t e x i t > Ĉ < l a t e x i t s h a 1 _ b a s e 6 4 = " B A 9 0 e K k x R I s 8 + k X s p O P T

a c f O Y c / s D 5 / A F e w 4 + Q < / l a t e x i t > g(x) = p(ỹ|x; ⇥) < l a t e x i t s h a 1 _ b a s e 6 4 = " n m A g K m M r 6 c Y o 7 7 N y

n h o l 3 8 c D s R T w I I k U u m d c u x Y / Q y p l B w C a O C m 2 i I G R + w H r Q M D V k A 2 s s m x 4 z o n l E 6 t B s p 8 0 K k E / X 7 R M Y C r d P A N 8 m A Y V / / 9 s b i f 1 4 r w e 6 h l 4 k w T h B C P l 3 U T S T F i I 6 b o R 2 h g K N M D W F c C f N X y v t M M Y 6 m v 4 I p w f l 9 8 l / S 3 K 8 6 t e r R Z a 1 U P 5 3 V k S c 7 Z J e U i U M O S J 2 c k w v S I J z c k 0 f y T F 6 s B + v J e r X e p t G c N Z v Z J j 9 g v X 8 B V 6 W a l A = = < / l a t e x i t > `GSL ( ĈT f (x), y) < l a t e x i t s h a 1 _ b a s e 6 4 = " x S c k bF / Q A a F 6 U S T 6 b V B n Y N 8 b V / w = " > A A A C B X i c b V D L S s N A F J 3 U V 6 2 v q E t d D B a h B S m J F N R d o Q t d u K j Y F z S x T K a T d u j k w c x E D C E b N / 6 K G x e K u P U f 3 P k 3 T t s s t P X A h c M 5 9 3 L v P U 7 I q J C G 8 a 3 l l p Z X V t f y 6 4 W N z a 3 t H X 1 3 r y 2 C i G P S w g E L e N d B g j D q k 5 a k k p F u y A n y H E Y 6 z r g + 8 T v 3 h A s a + E 0 Z h 8 T 2 0 N C n L s V I K q m v H 1 q E s X 5 y e X u d l q w R k k k 9 v W u 6 p Y f y C Y z L f b 1 o V I w p 4 C I x M 1 I E G R p 9 / c s a B D j y i C 8 x Q 0 L 0 T C O U d o K 4 p J i R t G B F g o Q I j 9 G Q 9 B T 1 k U e E n U y / S O G x U g b Q D b g q X 8 K p + n s i Q Z 4 Q s e e o T g / J k Z j 3 J u J / X i + S 7 r m d U D + M J P H x b J E b M S g D O I k E D i g n W L J Y E Y Q 5 V b d C P E I c Y a m C K 6 g Q z P m X F 0 n 7 t G J W K x c 3 1 W K t m s W R B w f g C J S A C c 5 A D V y B B m g B D B 7 B M3 g F b 9 q T 9 q K 9 a x + z 1 p y W z e y D P 9 A + f w C 0 y p d i < / l a t e x i t > `ce (f (x), y) < l a t e x i t s h a 1 _ b a s e 6 4 = " s Q 7 S x J k Z V C 7 s J v 3 A o g p x Y E C S H i Y = " > A A A B + 3 i c b V B N S 8 N A E N 3 U r 1 q / Y j 1 6 W S x C C 1 I S K a i 3 g h e P F W w t t C F s t p N 2 6 e a D 3 Y 0 0 h P w V L x 4 U 8 e o f 8 e a / c d v m o K 0 P B h 7 v z T A z z 4 s 5 k 8 q y v o 3 S x u b W 9 k 5 5 t 7 K 3 f 3 B 4 Z B 5 X e z J K B I U u j X g k + h 6 R w F k I X c U U h 3 4 s g A Q e h 0 d v e j v 3 H 5 9 A S B a F D y q N w Q n

Figure 3: Training process of GSL divided into two steps.

summarises visually the training process divided into two main steps: (i) estimating noise corruption matrix through the first network g trained on untrusted dataset D and (ii) training classifier f on both untrusted D and trusted D through the golden symmetric loss.

Figure 4: Testing accuracy on text datasets with varying noise rates (1% trusted data).

) Let A( ĈT f (x), y) = K k=1 ce ( ĈT f (x), k). Then we can rewrite (7) as R ε (f, C) = (1 -εK K-1 )R(f, C) + ε K-1 A(C T f (x), y), thus: R ε (f * , C * ) -R ε (f, Ĉ) = (1 -εK K -1 ) (R(f * , C * ) -R(f, Ĉ)) ∆R + ε K -1 (A(C * T f * (x), y) -A( ĈT f (x), y)) ∆A(8)

ε (f, C) = E x,ỹ ce (C T f (x), ỹ) = E x E y|x E ỹ|x,y ce (C T f (x), y) = E x,y [(1 -ε y ) ce (C T f (x), y) + k =y ε yk ce (C T f (x), k)] = E x,y [(1 -ε y )( K k=1 ce (C T f (x), k)k =y ce (C T f (x), y)) + k =y ε yk ce (C T f (x), k)] = E x,y [(1 -ε y ) K k=1 ce (C T f (x), k) + k =y (1 -ε y -ε yk ) ce (C T f (x), k)] (10)Similar to the symmetric case we require that R ε (f * , C * ) -R ε (f, Ĉ) ≤ 0 for the loss to be robust to noise:R ε (f * , C * ) -R ε (f, Ĉ) = E x,y [(1 -ε y )( K k=1 ce (C * T f * (x), k)ce ( ĈT f (x), k)) ∆Wy + k =y (1 -ε y -ε yk ) ce (C * T f * (x), k)ce ( ĈT f (x), k)

Vision analysis: test accuracy(%) of real-world noisy Clothing1M, and CIFAR10/CIFAR100 corrupted with 30% and 60% noise for different noise resilient networks. Best results in bold.

Text analysis: average accuracy (%) of variants combining loss correction and symmetric cross entropy. Results averaged across entire range of noise rates [0, 100]. Best accuracy in bold.

Accuracy (%) of different gold fraction on CIFAR-10

