MOPRO: WEBLY SUPERVISED LEARNING WITH MOMENTUM PROTOTYPES

Abstract

We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on weblysupervised representation learning adopt a vanilla supervised learning method without accounting for the prevalent noise in the training data, whereas most prior methods in learning with label noise are less effective for real-world large-scale noisy data. We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning. MoPro achieves state-of-the-art performance on WebVision, a weakly-labeled noisy dataset. MoPro also shows superior performance when the pretrained model is transferred to down-stream image classification and detection tasks. It outperforms the ImageNet supervised pretrained model by +10.5 on 1-shot classification on VOC, and outperforms the best self-supervised pretrained model by +17.3 when finetuned on 1% of ImageNet labeled samples. Furthermore, MoPro is more robust to distribution shifts. Code and pretrained models are available at https://github.com/ salesforce/MoPro.

1. INTRODUCTION

Large-scale datasets with human-annotated labels have revolutionized computer vision. Supervised pretraining on ImageNet (Deng et al., 2009) has been the de facto formula of success for almost all state-of-the-art visual perception models. However, it is extremely labor intensive to manually annotate millions of images, which makes it a non-scalable solution. One alternative to reduce annotation cost is self-supervised representation learning, which leverages unlabeled data. However, self-supervised learning methods (Goyal et al., 2019; He et al., 2019; Chen et al., 2020a; Li et al., 2020b ) have yet consistently shown superior performance compared to supervised learning, especially when transferred to downstream tasks with limited labels. With the help of commercial search engines, photo-sharing websites, and social media platforms, there is near-infinite amount of weakly-labeled images available on the web. Several works have exploited the scalable source of web images and demonstrated promising results with weblysupervised representation learning (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020) . However, there exists two competing claims on whether weakly-labeled noisy datasets lead to worse generalization performance. One claim argues that the effect of noise can be overpowered by the scale of data, and simply applies standard supervised learning method on web datasets (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020) . The other claim argues that deep models can easily memorize noisy labels, resulting in worse generalization (Zhang et al., 2017; Ma et al., 2018) . In this paper, we show that both claims are partially true. While increasing the size of data does improve the model's robustness to noise, our method can substantially boost the representation learning performance by addressing noise. There exists a large body of literature on learning with label noise (Jiang et al., 2018; Han et al., 2018; Guo et al., 2018; Tanaka et al., 2018; Arazo et al., 2019; Li et al., 2020a) . However, existing methods have several limitations that make them less effective for webly-supervised representation learning. First, most methods do not consider out-of-distribution (OOD) samples, which is a major

