CENTER-WISE LOCAL IMAGE MIXTURE FOR CON-TRASTIVE REPRESENTATION LEARNING

Abstract

Recent advances in unsupervised representation learning have experienced remarkable progress, especially with the achievements of contrastive learning, which regards each image as well its augmentations as a separate class, while does not consider the semantic similarity among images. This paper proposes a new kind of data augmentation, named Center-wise Local Image Mixture, to expand the neighborhood space of an image. CLIM encourages both local similarity and global aggregation while pulling similar images. This is achieved by searching local similar samples of an image, and only selecting images that are closer to the corresponding cluster center, which we denote as center-wise local selection. As a result, similar representations are progressively approaching the clusters, while do not break the local similarity. Furthermore, image mixture is used as a smoothing regularization to avoid overconfidence on the selected samples. Besides, we introduce multi-resolution augmentation, which enables the representation to be scale invariant. Integrating the two augmentations produces better feature representation on several unsupervised benchmarks. Notably, we reach 75.5% top-1 accuracy with linear evaluation over ResNet-50, and 59.3% top-1 accuracy when fine-tuned with only 1% labels, as well as consistently outperforming supervised pretraining on several downstream transfer tasks.

1. INTRODUCTION

Learning general representations that can be transferable to different downstream tasks is a key challenge in computer vision. This is usually achieved by fully supervised learning paradigm, e.g., making use of ImageNet labels for pretraining over the past several years. Recently, self-supervised learning has attracted more attention due to its free of human labels. In self-supervised learning, the network aims at exploring the intrinsic distributions of images via a series of predefined pretext tasks (Doersch et al., 2015; Gidaris et al., 2018; Noroozi & Favaro, 2016; Pathak et al., 2016) . Among them, instance discrimination (Wu et al., 2018) based methods have achieved remarkable progress (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2020) . The core idea of instance discrimination is to push away different images, and encourage the representation of different transformations (augmentations) of the same image to be similar. Following this paradigm, self-supervised models are able to generate features that are comparable or even better than those produced by supervised pretraining when evaluated on some downstream tasks, e.g.,COCO detection and segmentation (Chen et al., 2020c;b). In contrastive learning, the positive pairs are simply constrained within different transformations of the same image, e.g., cropping, color distortion, Gaussian blur, rotation, etc.. Recent advances have demonstrated that better data augmentations (Chen et al., 2020a) really help to improve the representation robustness. However, contrasting two images that are de facto similar in semantic space is not applicable for general representations. It is intuitive to pull semantically similar images for better transferability. DeepCluster (Caron et al., 2018) and Local Aggregation (Zhuang et al., 2019) relax the extreme instance discrimination task via discriminating groups of images instead of an individual image. However, due to the lack of labels, it is inevitable that the positive pairs contain noisy samples, which limits the performance. In this paper, we target at expanding instance discrimination by exploring local similarities among images. Towards this goal, one need to solve two issues: i) how to select similar images as positive 1

