LEARNING FROM NOISY DATA WITH ROBUST REPRE-SENTATION LEARNING

Abstract

Learning from noisy data has attracted much attention, where most methods focus on label noise. In this work, we propose a new framework which simultaneously addresses three types of noise commonly seen in real-world data: label noise, outof-distribution input, and input corruption. In contrast to most existing methods, we combat noise by learning robust representation. Specifically, we embed images into a low-dimensional subspace by training an autoencoder on the deep features. We regularize the geometric structure of the subspace with robust contrastive learning, which includes an unsupervised consistency loss and a supervised mixup prototypical loss. Furthermore, we leverage the structure of the learned subspace for noise cleaning, by aggregating information from neighboring samples. Experiments on multiple benchmarks demonstrate state-of-the-art performance of our method and robustness of the learned representation. Our code will be released 1 . Data in real life is noisy. However, deep models with remarkable performance are mostly trained on clean datasets with high-quality human annotations. Manual data cleaning and labeling is an expensive process that is difficult to scale. On the other hand, there exists almost infinite amount of noisy data online. It is crucial that deep neural networks (DNNs) could harvest noisy training data. However, it has been shown that DNNs are susceptible to overfitting to noise (Zhang et al., 2017) . As shown in Figure 1 , a real-world noisy image dataset often consists of multiple types of noise. Label noise refers to samples that are wrongly labeled as another class (e.g. flower labeled as orange). Out-of-distribution input refers to samples that do not belong to any known classes. Input corruption refers to image-level distortion (e.g. low brightness) that causes data shift between training and test. Most of the methods in literature focus on addressing the more detrimental label noise. Two dominant approaches include: (1) find clean samples as those with smaller loss and assign larger weights to them (Han et al.



• We propose noise-robust contrastive learning, which introduces two contrastive losses. The first is an unsupervised consistency contrastive loss. It enforces inputs with perturbations to have similar normalized embeddings, which helps learn robust and discriminative representation. • Our second contrastive loss is a weakly-supervised mixup prototypical loss. We compute class prototypes as normalized mean embeddings, and enforces each sample's embedding to be closer to

Input Corruption Label Noise

Out-of-distribution Input Figure 1 : Google search images from WebVision (Li et al., 2017) dataset with keyword "orange". its class prototype. Inspired by Mixup (Zhang et al., 2018) , we construct virtual training samples as linear interpolation of inputs, and encourage the same linear relationship w.r.t the class prototypes. • We train a linear autoencoder to reconstruct the high-dimensional features using low-dimensional embeddings. The autoendoer enables the high-dimensional features to maximally preserve the robustness of the low-dimensional embeddings, thus regularizing the classifier. • We propose a new noise cleaning method which exploits the structure of the learned representations. For each sample, we aggregate information from its top-k neighbors to create a pseudo-label. A subset of training samples with confident pseudo-labels are selected to compute the weaklysupervised losses. This process can effectively clean both label noise and out-of-distribution (OOD) noise. Our experimental contributions include: • We experimentally show that our method is robust to label noise, OOD input, and input corruption. Experiments are performed on multiple datasets with controlled noise and real-world noise, where our method achieves state-of-the-art performance. • We demonstrate that the proposed noise cleaning method can effectively clean a majority of label noise. It also learns a curriculum that gradually leverages more samples to compute the weakly-supervised losses as the pseudo-labels become more accurate. • We validate the robustness of the learned low-dimensional representation by showing (1) k-nearest neighbor classification outperforms the softmax classifier. (2) OOD samples can be separated from in-distribution samples. The efficacy of the proposed autoencoder is also verified.

2. RELATED WORK

Label noise learning. Learning from noisy labels have been extensively studied in the literature. While some methods require access to a small set of clean samples (Xiao et al., 2015; Vahdat, 2017; Veit et al., 2017; Lee et al., 2018; Hendrycks et al., 2018) , most methods focus on the more challenging scenario where no clean labels are available. These methods can be categorized into two major types. The first type performs label correction using predictions from the network (Reed et al., 2015; Ma et al., 2018; Tanaka et al., 2018; Yi & Wu, 2019) . The second type tries to separate clean samples from corrupted samples, and trains the model on clean samples (Han et al., 2018; Arazo et al., 2019; Jiang et al., 2018; 2020; Wang et al., 2018; Chen et al., 2019; Lyu & Tsang, 2020) . The recently proposed DivideMix (Li et al., 2020a) effectively combines label correction and sample selection with the Mixup (Zhang et al., 2018) data augmentation under a co-training framework. However, it cost 2⇥ the computational resource of our method. Different from existing methods, our method combats noise by learning noise-robust low-dimensional representations. We propose a more effective noise cleaning method by leveraging the structure of the learned representations. Furthermore, our model is robust not only to label noise, but also to out-of-distribution and corrupted input. A previous work has studied open-set noisy labels (Wang et al., 2018) , but their method does not enjoy the same level of robustness as ours. Contrastive learning. Contrastive learning is at the core of recent self-supervised representation learning methods (Chen et al., 2020; He et al., 2019; Oord et al., 2018; Wu et al., 2018) . In selfsupervised contrastive learning, two randomly augmented images are generated for each input image. Then a contrastive loss is applied to pull embeddings from the same source image closer, while pushing embeddings from different source images apart. Recently, prototypical contrastive learning (PCL) (Li et al., 2020b) has been proposed, which uses cluster centroids as prototypes, and trains the network by pulling an image embedding closer to its assigned prototypes. Different from previous methods, our method performs contrastive learning in the principal subspace of the high-dimensional feature space, by training a linear autoencoder. Furthermore, our supervised contrastive loss improves PCL (Li et al., 2020b) with Mixup (Zhang et al., 2018) . Different from the original Mixup where learning happens at the classification layer, our learning takes places in the low-dimensional subspace.

3. METHOD

Given a noisy training dataset D = {(x i , y i )} n i=1 , where x i is an image and y i 2 {1, ..., C} is its class label. We aim to train a network that is robust to the noise in training data (i.e. label noise, OOD input, input corruption) and achieves high accuracy on a clean test set. The proposed network consists of three components: (1) a deep encoder (a convolutional neural network) that encodes an image x i to a high-dimensional feature v i ; (2) a classifier (a fully-connected layer followed by softmax) that receives v i as input and outputs class predictions; (3) a linear autoencoder that projects v i into a low-dimensional embedding z i 2 R d . We show an illustration of our method in Figure 2 , and a pseudo-code in appendix B. Next, we delineate its details.

3.1. CONTRASTIVE LEARNING IN ROBUST LOW-DIMENSIONAL SUBSPACE

Let z i = W e v i be the linear projection from high-dimensional features to low-dimensional embeddings, and ẑi = z i / kz i k 2 be the normalized embeddings. We aim to learn robust embeddings with two contrastive losses: unsupervised consistency loss and weakly-supervised mixup prototypical loss. Unsupervised consistency contrastive loss. Following the NT-Xent (Chen et al., 2020) loss for selfsupervised representation learning, our consistency contrastive loss enforces images with semanticpreserving perturbations to have similar embeddings. Specifically, given a minibatch of b images, we apply weak-augmentation and strong-augmentation to each image, and obtain 2b inputs {x i } 2b i=1 . Weak augmentation is a standard flip-and-shift augmentation strategy, while strong augmentation consists of color and brightness changes with details given in Section 4.1. We project the inputs into the low-dimensional space to obtain their normalized embeddings {ẑ i } 2b i=1 . Let i 2 {1, ..., b} be the index of a weakly-augmented input, and j(i) be the index of the strong-augmented input from the same source image, the consistency contrastive loss is defined as: L cc = b X i=1 log exp(ẑ i • ẑj(i) /⌧ ) P 2b k=1 i6 =k exp(ẑ i • ẑk /⌧ ) , ( ) where ⌧ is a scalar temperature parameter. The consistency contrastive loss maximizes the inner product between the pair of positive embeddings ẑi and ẑj(i) , while minimizing the inner product between 2(b 1) pairs of negative embeddings. By mapping different views (augmentations) of the same image to neighboring embeddings, the consistency contrastive loss encourages the network to learn discriminative representation that is robust to low-level image corruption. Weakly-supervised mixup prototypical contrastive loss. Our second contrastive loss injects structural knowledge of classes into the embedding space. Let I c denote indices for the subset of images in D labeled with class c, we calculate the class prototype as the normalized mean embedding: z c = 1 |I c | X i2Ic ẑi , ẑc = z c kz c k 2 , ( ) where ẑi is the embedding of a center-cropped image, and the class prototypes are calculated at the beginning of each epoch. The prototypical contrastive loss enforces an image embedding ẑi to be more similar to its corresponding class prototype ẑyi , in contrast to other class prototypes: L pc (ẑ i , y i ) = log exp(ẑ i • ẑyi /⌧ ) P C c=1 exp(ẑ i • ẑc /⌧ ) . Since the label y i is noisy, we would like to regularize the encoder from memorizing training labels. Mixup (Zhang et al., 2018) has been shown to be an effective method against label noise (Arazo et al., 2019; Li et al., 2020a) . Inspired by it, we create virtual training samples by linearly interpolating a sample (indexed by i) with another sample (indexed by m(i)) randomly chosen from the same minibatch: x m i = x i + (1 )x m(i) , ) where ⇠ Beta(↵, ↵).

Let ẑm

i be the normalized embedding for x m i , the mixup version of the prototypical contrastive loss is defined as a weighted combination of the two L pc w.r.t class y i and y m(i) . It enforces the embedding for the interpolated input to have the same linear relationship w.r.t. the class prototypes. L pc mix = 2b X i=1 L pc (ẑ m i , y i ) + (1 )L pc (ẑ m i , y m(i) ). Reconstruction loss. We also train a linear decoder W d to reconstruct the high-dimensional feature v i based on z i . The reconstruction loss is defined as: L recon = 2b X i=1 kv i W d z i k 2 2 . ( ) There are several benefits for training the autoencoder. First, with an optimal linear autoencoder, W e will project v i into its low-dimensional principal subspace and can be understood as applying PCA (Baldi & Hornik, 1989) . Thus the low-dimensional representation z i is intrinsically robust to input noise. Second, minimizing the reconstruction error is maximizing a lower bound of the mutual information between v i and z i (Vincent et al., 2010) . Therefore, knowledge learned from the proposed contrastive losses can be maximally preserved in the high-dimensional representation, which helps regularize the classifier. Classification loss. Given the softmax output from the classifier, p(y; x i ), we define the classification loss as the cross-entropy loss. Note that it is only applied to the weakly-augmented inputs. The overall training objective is to minimize a weighted sum of all losses: L ce = b X i=1 log p(y i ; x i ). L = L ce + ! cc L cc + ! pc L pc mix + ! recon L recon For all experiments, we fix ! cc = 1, ! recon = 1, and change ! pc only across datasets.

3.2. NOISE CLEANING WITH SMOOTH NEIGHBORS

After warming-up the model by training with the noisy labels {y i } n i=1 for t 0 epochs, we aim to clean the noise by generating a soft pseudo-label q i for each training sample. Different from previous methods that perform label correction purely using the model's softmax prediction, our method exploits the structure of the low-dimensional subspace by aggregating information from top-k neighboring samples, which helps alleviate the confirmation bias problem. At the t-th epoch, for each sample x i , let p t i be the classifier's softmax prediction, let q t 1 i be its soft label from the previous epoch, we calculate the soft label for the current epoch as: q t i = 1 2 p t i + 1 2 k X j=1 w t ij q t 1 j , where w t ij represents the normalized affinity between a sample and its neighbor and is defined as w t ij = exp(ẑ t i •ẑ t j /⌧ ) P k j=1 exp(ẑ t i •ẑ t j /⌧ ) . We set k = 200 in all experiments. The soft label defined by eqn.( 9) is the minimizer of the following quadratic loss function: J(q t i ) = k X j=1 w t ij q t i q t 1 j 2 2 + q t i p t i 2 2 . ( ) The first term is a smoothness constraint which encourages the soft label to take a similar value as its neighbors' labels, whereas the second term attempts to maintain the model's class prediction. We construct a weakly-supervised subset which contains (1) clean sample whose soft label score for the original class y i is higher than a threshold ⌘ 0 , (2) pseudo-labeled sample whose maximum soft label score exceeds a threshold ⌘ 1 . For pseudo-labeled samples, we convert their soft labels into hard labels by taking the class with the maximum score. D t sup = {x i , y i | q t i (y i ) > ⌘ 0 } [ {x i , ŷt i = arg max c q t i (c) | 8 max c q t i (c) > ⌘ 1 , c 2 {1, .., C}} Given the weakly-supervised subset, we modify the classification loss L ce , the mixup prototypical contrastive loss L pc mix , and the calculation of prototypes ẑc , such that they only use samples from such curriculum, we analyse the noise cleaning statistics for training our model on CIFAR-10 and CIFAR-100 datasets with 50% label noise (experimental details explained in the next section). In Figure 3 (a), we show the accuracy of the soft pseudo-labels w.r.t to clean training labels (only used for analysis purpose). Our method can significantly reduce the ratio of label noise from 50% to 5% (for CIFAR-10) and 17% (for CIFAR-100). Figure 3 (b) shows the size of D t sup as a percentage of the total number of training samples, and Figure 3 (c) shows the effective label noise ratio within the weakly-supervised subset D t sup . Our method maintains a low noise ratio in the weakly-supervised subset, while gradually increasing its size to utilize more samples for the weakly-supervised losses.

4. EXPERIMENT

In this section, we validate the proposed method on multiple benchmarks with controlled noise and real-world noise. Our method achieves state-of-the-art performance across all benchmarks. For fair comparison, we compare with DivideMix (Li et al., 2020a) without ensemble. In appendix A, we report the result of our method with co-training and ensemble, which further improves performance.

4.1. EXPERIMENTS ON CONTROLLED NOISY LABELS

Dataset. Following Tanaka et al. (2018) ; Li et al. (2020a) , we corrupt the training data of CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009) with two types of label noise: symmetric and asymmetric. Symmetric noise is injected by randomly selecting a percentage of samples and changing their labels to random labels. Asymmetric noise is class-dependant, where labels are only changed to similar classes (e.g. dog$cat, deer!horse). We experiment with multiple noise ratios: sym 20%, sym 50%, and asym 40% (see results for sym 80% and 90% in appendix A). Note that asymmetric noise ratio cannot exceed 50% because certain classes would become theoretically indistinguishable. Implementation details. Same as previous works (Arazo et al., 2019; Li et al., 2020a) , we use PreAct ResNet-18 (He et al., 2016) as our encoder model. We set the dimensionality of the bottleneck layer as d = 50. Our model is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 200 epochs. We set the initial learning rate as 0.02 and use a cosine decay schedule. We apply standard crop and horizontal flip as the weak augmentation. For strong augmentation, we use AugMix (Hendrycks et al., 2020) , though other methods (e.g. SimAug (Chen et al., 2020) ) work equally well. For all CIFAR experiments, we fix the hyper-parameters as ! cc = 1, ! pc = 5, ! recon = 1, ⌧ = 0.3, ↵ = 8, ⌘ 1 = 0.9. For CIFAR-10, we activate noise cleaning at epoch t 0 = 5, and set ⌘ 0 = 0.1 (sym.) or 0.4 (asym.). For CIFAR-100, we activate noise cleaning at epoch t 0 = 15, and set ⌘ 0 = 0.02. We use faiss-gpu (Johnson et al., 2017) for efficient knn search in the low-dimensional subspace, which finishes within 1 second.

Results

. Table 1 shows the comparison with existing methods. Our method outperforms previous methods across all label noise settings. On the more challenging CIFAR-100, we achieve 3-4% accuracy improvement compared to the second-best method DivideMix. Moreover, our method is more computational efficient than DivideMix, which needs co-training for noise filtering. indicate average test accuracy (%) over last 10 epochs. We report results over 3 independent runs with randomlygenerated noise. We re-run previous methods using publicly available code with the same noisy data and model architecture as ours. In order to demonstrate the advantage of the proposed low-dimensional embeddings, we perform knearest neighbor (knn) classification (k = 200), by projecting test images into normalized embeddings. Compared to the trained classifier, knn achieves higher accuracy, which verifies the robustness of the learned low-dimensional representations. Results. Table 2 shows the results, where our method consistently outperforms existing methods by a substantial margin. We observe that OOD images from a similar domain (CIFAR-100) are more harmful than OOD images from a more different domain (SVHN). This is because noisy images that are closer to the test data distribution are more likely to distort the decision boundary in a way that negatively affects test performance. Nevertheless, performing knn classification using the learned embeddings demonstrates high robustness to input noise. In Figure 5 (2) separate OOD samples from in-distribution samples, such that their harm is reduced.

Test dataset WebVision ILSVRC12

Accuracy (%) top1 top5 top1 top5 Forward (Patrini et al., 2017) 61.1 82.7 57.4 82.4 Decoupling (Malach & Shalev-Shwartz, 2017 ) 62.5 84.7 58.3 82.3 D2L (Ma et al., 2018) 62.7 84.0 57.8 81.4 MentorNet (Jiang et al., 2018) 63.0 81.4 57.8 79.9 Co-teaching (Han et al., 2018) 63.6 85.2 61.5 84.7 INCV (Chen et al., 2019) 65.2 85.3 61.0 85.0 DivideMix (Li et al., 2020a) 75.9 90. 

4.3. EXPERIMENTS ON REAL-WORLD NOISY DATA

Dataset and implementation details. We verify our method on two real-word noisy datasets: WebVision (Li et al., 2017) and Clothing1M (Xiao et al., 2015) . Webvision contains images crawled from the web using the same concepts from ImageNet ILSVRC12 (Deng et al., 2009) . Following previous works (Chen et al., 2019; Li et al., 2020a) , we perform experiments on the first 50 classes of the Google image subset. Clothing1M consists of images collected from online shopping websites where labels were generated from surrounding texts. Note that we do not use the additional clean set for training. For both experiments, we use the same model architecture as previous methods. More implementation details are given in the appendix. Results. We report the results for WebVision in Table 3 and Clothing1M in Table 4 , where we achieve state-of-the-art performance on both datasets. Our method achieves competitive performance on WebVision even without performing noise cleaning, which demonstrates the robustness of the learned representation. Appendix D shows examples of noisy images that are cleaned by our method.

4.4. ABLATION STUDY

Effect of the proposed components. In order to study the effect of the proposed components, we remove each of them and report accuracy of the classifier (knn) across four benchmarks. As shown in Table 5 , the mixup prototypical contrastive loss (L pc mix ) is most crucial to the model's performance. The consistency contrastive loss (L cc ) has a stronger effect with corrupted input or larger number of classes. We also experiment with removing mixup and using the standard prototypical contrastive loss, and using standard data augmentation (crop and horizontal flip) instead of AugMix. The proposed method still achieves state-of-the-art result with standard data augmentation. 

5. CONCLUSION

This paper proposes noise-robust contrastive learning, a new method to combat noise in training data by learning robust representation. We demonstrate our model's state-of-the-art performance with extensive experiments on multiple noisy datasets. For future work, we are interested in adapting our method to other domains such as NLP or speech. We would also like to explore the potential of our method for learning transferable representations that could be useful for down-stream tasks.



Figure2: Our proposed framework for noise-robust contrastive learning. We project images into a lowdimensional subspace, and regularize the geometric structure of the subspace with (1)Lcc a consistency contrastive loss which enforces images with perturbations to have similar embeddings; (2)Lpc mix: a prototypical contrastive loss augmented with mixup, which encourages the embedding for a linearly-interpolated input to have the same linear relationship w.r.t the class prototypes. The low-dimensional embeddings are also trained to reconstruct the high-dimensional features, which preserves the learned information and regularizes the classifier.

Figure 3: Curriculum learned by the proposed label correction method for training on CIFAR datasets with 50% sym. noise. (a) Accuracy of pseudo-labels w.r.t to clean training labels. (b) Number of samples in the weakly-supervised subset D t sup . (c) Label noise ratio in the weakly-supervised subset.

Figure 4: Examples of input noise injected to CIFAR-10.

, we show the t-SNE (Maaten & Hinton, 2008) visualization of the low-dimensional embeddings for all training samples. As training progresses, our model learns to separate OOD samples (represented as gray points) from in-distribution samples, and cluster samples of the same class together despite their noisy labels.

Figure 5: t-SNE visualization of low-dimensional embeddings for CIFAR-10 images (color represents the true class) + OOD images (gray points) from CIFAR-100 or SVHN. The model is trained on noisy CIFAR-10 (50k images with 50% label noise) and 20k OOD images with random labels. Our method can effectively learn to (1) cluster CIFAR-10 images according to their true class, despite their noisy labels; (2) separate OOD samples from in-distribution samples, such that their harm is reduced.

Learning curriculum. Our iterative noise cleaning method learns an effective training curriculum, which gradually increases the size of D t sup as the pseudo-labels become more accurate. To demonstrate Comparison with state-of-the-art methods on CIFAR datasets with label noise. Numbers indicate average test accuracy (%) over last 10 epochs. We report results over 3 independent runs with randomly-generated label noise. Results for previous methods are copied from Arazo et al. (2019);Li et al. (2020a). We re-run DivideMix (without ensemble) using the publicly available code on the same noisy data as ours.

Comparison with state-of-the-art methods on datasets with label noise and input noise. Numbers

Comparison with state-of-the-art methods trained on WebVision (mini).

Comparison with state-of-the-art methods on Clothing1M dataset.

Effect of the proposed components. We show the accuracy of the classifier (knn) on four benchmarks with different noise. Note that DivideMix(Li et al., 2020a) also performs mixup.Effect of bottleneck dimension. We vary the dimensionality of the bottleneck layer, d, and examine the performance change in Table6. Our model is in general not very sensitive to the change of d. bottleneck dimension d = 25 d = 50 d = 100 d = 200

Classifier's test accuracy (%) with different low-dimensions.

