CONTRASTIVE LEARNING WITH STRONGER AUGMEN-TATIONS

Abstract

Representation learning has been greatly improved with the advance of contrastive learning methods with the performance being closer to their supervised learning counterparts. Those methods have greatly benefited from various data augmentations that are carefully designated to maintain their identities so that the images transformed from the same instance can still be retrieved. Although stronger augmentations could expose novel patterns of representations to improve their generalizability, directly using stronger augmentations in instance discrimination-based contrastive learning may even deteriorate the performance, because the distortions induced from the stronger augmentations could ridiculously change the image structures and thus the transformed images can not be viewed as the same as the original ones any more. Additional efforts are needed for us to explore the role of the stronger augmentations in further pushing the performance of unsupervised learning to the fully supervised upper bound. Instead of applying the stronger augmentations directly to minimize the contrastive loss, we propose to minimize the distribution divergence between the weakly and strongly augmented images over the representation bank to supervise the retrieval of strongly augmented queries from a pool of candidates. This avoids an overoptimistic assumption that could overfit the strongly augmented queries containing distorted visual structures into the positive targets in the representation bank, while still being able to distinguish them from the negative samples by leveraging the distributions of weakly augmented counterparts. The proposed method achieves top-1 accuracy of 76.2% on ImageNet with a standard ResNet-50 architecture with a single-layer classifier fine-tuned. This is almost the same as 76.5% of top-1 accuracy with a fully supervised ResNet-50. Moreover, it outperforms the previous self-supervised and supervised methods on both the transfer learning and object detection tasks.

1. INTRODUCTION

Deep neural network has shown its sweeping successes in learning from large-scale labeled datasets like ImageNet (Deng et al. (2009) ). However, such successes hinge on the availability of a large amount of labeled examples that are expensive to collect. To address this challenge, unsupervised visual representation learning and self-supervised learning, have been studied to learn feature representations without labels. Among them is the contrastive learning (Hadsell et al. ( 2006 It is worth noting that these methods usually rely on image augmentations that are carefully designated to maintain their instance identities so that the augmentation of an instance can be accurately retrieved from a dictionary of instances. On the other hand, we believe stronger augmentations could expose novel patterns which can further improve the generalizability of learned representations and eventually close the gap with the fully supervised models. However, directly using stronger augmen-



); Misra & Maaten (2020); Chen et al. (2020b); He et al. (2020); Caron et al. (2020)), showing great potentials to close the performance gap with supervised methods. In contrastive learning Hadsell et al. (2006), each image is considered as an instance, and we wish to train the network so that the representations of different augmentations of the same instance are as close as possible to each other (He et al. (2020); Chen et al. (2020a); Wu et al. (2018); Hjelm et al. (2018); Oord et al. (2018); Bachman et al. (2019); Zhuang et al. (2019); Tian et al. (2019); Hénaff et al. (2019)). Meanwhile, the representations of different instances can be also distinguished between each other.

