MOPRO: WEBLY SUPERVISED LEARNING WITH MOMENTUM PROTOTYPES

Abstract

We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on weblysupervised representation learning adopt a vanilla supervised learning method without accounting for the prevalent noise in the training data, whereas most prior methods in learning with label noise are less effective for real-world large-scale noisy data. We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning. MoPro achieves state-of-the-art performance on WebVision, a weakly-labeled noisy dataset. MoPro also shows superior performance when the pretrained model is transferred to down-stream image classification and detection tasks. It outperforms the ImageNet supervised pretrained model by +10.5 on 1-shot classification on VOC, and outperforms the best self-supervised pretrained model by +17.3 when finetuned on 1% of ImageNet labeled samples. Furthermore, MoPro is more robust to distribution shifts. Code and pretrained models are available at https://github.com/ salesforce/MoPro.

1. INTRODUCTION

Large-scale datasets with human-annotated labels have revolutionized computer vision. Supervised pretraining on ImageNet (Deng et al., 2009) has been the de facto formula of success for almost all state-of-the-art visual perception models. However, it is extremely labor intensive to manually annotate millions of images, which makes it a non-scalable solution. One alternative to reduce annotation cost is self-supervised representation learning, which leverages unlabeled data. However, self-supervised learning methods (Goyal et al., 2019; He et al., 2019; Chen et al., 2020a; Li et al., 2020b ) have yet consistently shown superior performance compared to supervised learning, especially when transferred to downstream tasks with limited labels. With the help of commercial search engines, photo-sharing websites, and social media platforms, there is near-infinite amount of weakly-labeled images available on the web. Several works have exploited the scalable source of web images and demonstrated promising results with weblysupervised representation learning (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020) . However, there exists two competing claims on whether weakly-labeled noisy datasets lead to worse generalization performance. One claim argues that the effect of noise can be overpowered by the scale of data, and simply applies standard supervised learning method on web datasets (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020) . The other claim argues that deep models can easily memorize noisy labels, resulting in worse generalization (Zhang et al., 2017; Ma et al., 2018) . In this paper, we show that both claims are partially true. While increasing the size of data does improve the model's robustness to noise, our method can substantially boost the representation learning performance by addressing noise. There exists a large body of literature on learning with label noise (Jiang et al., 2018; Han et al., 2018; Guo et al., 2018; Tanaka et al., 2018; Arazo et al., 2019; Li et al., 2020a) . However, existing methods have several limitations that make them less effective for webly-supervised representation learning. First, most methods do not consider out-of-distribution (OOD) samples, which is a major source of noise in real-world web datasets. Second, many methods perform computation-heavy procedures for noise cleaning (Jiang et al., 2018; Li et al., 2019; 2020a) , or require access to a set of samples with clean labels (Vahdat, 2017; Veit et al., 2017; Lee et al., 2018) , which limit their scalability in practice. We propose a new method for efficient representation learning from weakly-labeled web images. Our method is inspired by recent developments in contrastive learning for self-supervised learning (He et al., 2019; Chen et al., 2020a; Li et al., 2020b) We introduce Momentum Prototypes (Mo-Pro), a simple component which is effective in label noise correction, OOD sample removal, and representation learning. A visual explanation of our method is shown in Figure 1 . We use a deep network to project images into normalized low-dimensional embeddings, and calculate the prototype for a class as the moving-average embedding for clean samples in that class. We train the network such that embeddings are pulled closer to their corresponding prototypes, while pushed away from other prototypes. Images with corrupted labels are corrected either as another class or as an OOD sample based on their distance to the momentum prototypes. We experimentally show that: • MoPro achieves state-of-the-art performance on the upstream weakly-supervised learning task. • MoPro substantially improves representation learning performance when the pretrained model is transferred to downstream image classification and object detection tasks. For the first time, we show that weakly-supervised representation learning achieves similar performance as supervised representation learning, under the same data and computation budget. With a larger web dataset, MoPro outperforms ImageNet supervised learning by a large margin. • MoPro learns a more robust and calibrated model that generalizes better to distribution variations.

2.1. WEBLY-SUPERVISED REPRESENTATION LEARNING

A number of prior works exploit large web datasets for visual representation learning (Divvala et al., 2014; Chen & Gupta, 2015; Joulin et al., 2016; Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020) . These datasets contain a considerable amount of noise. Approximately 20% of the labels in the JMT-300M dataset (Sun et al., 2017) are noisy, whereas 34% of images in the WebVision dataset (Li et al., 2017) are considered outliers. Surprisingly, most prior works have chosen to ignore the noise and applied vanilla supervised method, with the claim that the scale of data can overpower the noise (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017) . However, we show that supervised method cannot fully harvest the power of large-scale weakly-labeled datasets.



Figure 1: Illustration of the normalized embedding space learned with MoPro. Samples from the same class gather around their class prototype, whereas OOD samples are separated from in-distribution samples. Label correction and OOD removal are achieved based on a sample's distance with the prototypes.

