CONTRASTIVE LEARNING WITH STRONGER AUGMEN-TATIONS

Abstract

Representation learning has been greatly improved with the advance of contrastive learning methods with the performance being closer to their supervised learning counterparts. Those methods have greatly benefited from various data augmentations that are carefully designated to maintain their identities so that the images transformed from the same instance can still be retrieved. Although stronger augmentations could expose novel patterns of representations to improve their generalizability, directly using stronger augmentations in instance discrimination-based contrastive learning may even deteriorate the performance, because the distortions induced from the stronger augmentations could ridiculously change the image structures and thus the transformed images can not be viewed as the same as the original ones any more. Additional efforts are needed for us to explore the role of the stronger augmentations in further pushing the performance of unsupervised learning to the fully supervised upper bound. Instead of applying the stronger augmentations directly to minimize the contrastive loss, we propose to minimize the distribution divergence between the weakly and strongly augmented images over the representation bank to supervise the retrieval of strongly augmented queries from a pool of candidates. This avoids an overoptimistic assumption that could overfit the strongly augmented queries containing distorted visual structures into the positive targets in the representation bank, while still being able to distinguish them from the negative samples by leveraging the distributions of weakly augmented counterparts. The proposed method achieves top-1 accuracy of 76.2% on ImageNet with a standard ResNet-50 architecture with a single-layer classifier fine-tuned. This is almost the same as 76.5% of top-1 accuracy with a fully supervised ResNet-50. Moreover, it outperforms the previous self-supervised and supervised methods on both the transfer learning and object detection tasks.

1. INTRODUCTION

Deep neural network has shown its sweeping successes in learning from large-scale labeled datasets like ImageNet (Deng et al. (2009) ). However, such successes hinge on the availability of a large amount of labeled examples that are expensive to collect. To address this challenge, unsupervised visual representation learning and self-supervised learning, have been studied to learn feature representations without labels. Among them is the contrastive learning (Hadsell et al. (2006) It is worth noting that these methods usually rely on image augmentations that are carefully designated to maintain their instance identities so that the augmentation of an instance can be accurately retrieved from a dictionary of instances. On the other hand, we believe stronger augmentations could expose novel patterns which can further improve the generalizability of learned representations and eventually close the gap with the fully supervised models. However, directly using stronger augmen-tations in the contrastive learning could deteriorate the performance, because the induced distortions could ridiculously change the image structures and thus the transformed images cannot keep the identity of the original instances. Thus, additional efforts are needed for us to explore the role of the stronger augmentations to further boost the performance of self-supervised learning. Thus we propose the CLSA (Contrastive Learning with Stronger Augmentations) framework to address this challenge. Instead of applying strongly augmented views to the contrastive loss, we propose to minimize the distribution divergence between the weakly and strongly augmented images over a representation bank to supervise the retrieval of stronger queries. This avoids an overoptimistic assumption that could overfit the strongly augmented queries containing distorted visual structures into the positive targets, while still being able to distinguish them from the negative samples by leveraging the distributions of weakly augmented counterparts. The learned representation will not only explore the novel patterns exposed by the stronger augmentations, but also inherits the knowledge about the relative similarities to the negative samples. The experiments on various datasets demonstrate that the proposed framework can greatly boost the performance by learning from stronger augmentations. On the ImageNet linear evaluation protocol, we reach a record 76.2% top-1 accuracy with the standard ResNet-50 backbone, which is almost as high as 76.5% top-1 accuracy of the fully supervised model. Meanwhile, it also achieves the competitive performances on several downstream tasks. Among them is a top-1 accuracy of 93.6% on VOC07 with the linear classifier on the pretrained ResNet-50 compared to the previous record of 88.9% top-1 accuracy. For the COCO object detection, the AP S for small object detection has been improved to 24.4% from the previous best AP S of 20.8%. These results show that the CLSA can more effectively leverage stronger augmentations than the previous self-supervised methods on downstream tasks. We also conduct ablation study to show a naive application of stronger augmentations in the contrastive learning would degrade the performances.

2. RELATED WORK

Unsupervised and self-supervised learning methods have been widely studied to close the gap with supervised learning. These methods can be categorized into four different major aspects. 



; Misra & Maaten (2020); Chen et al. (2020b); He et al. (2020); Caron et al. (2020)), showing great potentials to close the performance gap with supervised methods. In contrastive learning Hadsell et al. (2006), each image is considered as an instance, and we wish to train the network so that the representations of different augmentations of the same instance are as close as possible to each other (He et al. (2020); Chen et al. (2020a); Wu et al. (2018); Hjelm et al. (2018); Oord et al. (2018); Bachman et al. (2019); Zhuang et al. (2019); Tian et al. (2019); Hénaff et al. (2019)). Meanwhile, the representations of different instances can be also distinguished between each other.

Each image is considered as an individual class in an instance discrimination setting (Bojanowski & Joulin (2017); Dosovitskiy et al. (2015); Wu et al. (2018); Chen et al. (2020a); He et al. (2020)). It can be further formulated as contrastive learning (Hadsell et al. (2006)). In particular, Wu et al. (2018) built a memory bank that stores pre-computed representations from which positive examples are retrieved given some queries. Following this work, He et al. (2020) used a momentum update mechanism to maintain a long queue of negative examples for contrastive learning. Chen et al. (2020a) proposed a rich family of data augmentations on cropped images which has significantly boosted the classification accuracy. However, these methods failed to further improve the performance by naively applying stronger augmentations to minimize the contrastive loss, and this motivated the proposed work. Generative Methods The generative methods typically adopt auto-encoders (Vincent et al. (2008); Kingma & Welling (2013)), and adversarial learning (Donahue et al. (2016); Donahue & Simonyan (2019)) to train an unsupervised representation. Usually, they focused on the pixel-wise information of images to distinguish images from different classes. Self-supervised Clustering Data clustering (Asano et al. (2019); Caron et al. (2018; 2019; 2020); Yan et al. (2020)) can also be used to learn visual representations by assigning pseudo cluster labels to individual samples. DeepCluster (Caron et al. (2018)) generalized k-means by alternating between assigning pseudo-labels and updating networks. Recently, the SWAV (Caron et al. (2020)) is proposed to learn a cluster of prototypes as the negative examples for the contrastive learning. Combined with the multi-crops of training examples, the SWAV has achieved the state-of-the-art performance on ImageNet. Pretext Tasks In addition to the contrastive learning, there exist many alternative methods using different pretext tasks (Agrawal et al. (2015); Qi et al. (2019); Doersch et al. (2015); Kim et al. (2018); Larsson et al. (2016); Zhang et al. (2019)) to train unsupervised deep networks. For example, Doersch et al. (2015) used the relative positions of two randomly sampled patches as the supervised signal. Agrawal et al. (2015); Zhang et al. (2019); Gidaris et al. (2018) adopted various geometric

