CENTER-WISE LOCAL IMAGE MIXTURE FOR CON-TRASTIVE REPRESENTATION LEARNING

Abstract

Recent advances in unsupervised representation learning have experienced remarkable progress, especially with the achievements of contrastive learning, which regards each image as well its augmentations as a separate class, while does not consider the semantic similarity among images. This paper proposes a new kind of data augmentation, named Center-wise Local Image Mixture, to expand the neighborhood space of an image. CLIM encourages both local similarity and global aggregation while pulling similar images. This is achieved by searching local similar samples of an image, and only selecting images that are closer to the corresponding cluster center, which we denote as center-wise local selection. As a result, similar representations are progressively approaching the clusters, while do not break the local similarity. Furthermore, image mixture is used as a smoothing regularization to avoid overconfidence on the selected samples. Besides, we introduce multi-resolution augmentation, which enables the representation to be scale invariant. Integrating the two augmentations produces better feature representation on several unsupervised benchmarks. Notably, we reach 75.5% top-1 accuracy with linear evaluation over ResNet-50, and 59.3% top-1 accuracy when fine-tuned with only 1% labels, as well as consistently outperforming supervised pretraining on several downstream transfer tasks.

1. INTRODUCTION

Learning general representations that can be transferable to different downstream tasks is a key challenge in computer vision. This is usually achieved by fully supervised learning paradigm, e.g., making use of ImageNet labels for pretraining over the past several years. Recently, self-supervised learning has attracted more attention due to its free of human labels. In self-supervised learning, the network aims at exploring the intrinsic distributions of images via a series of predefined pretext tasks (Doersch et al., 2015; Gidaris et al., 2018; Noroozi & Favaro, 2016; Pathak et al., 2016) . Among them, instance discrimination (Wu et al., 2018) based methods have achieved remarkable progress (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2020) . The core idea of instance discrimination is to push away different images, and encourage the representation of different transformations (augmentations) of the same image to be similar. Following this paradigm, self-supervised models are able to generate features that are comparable or even better than those produced by supervised pretraining when evaluated on some downstream tasks, e.g.,COCO detection and segmentation (Chen et al., 2020c; b) . In contrastive learning, the positive pairs are simply constrained within different transformations of the same image, e.g., cropping, color distortion, Gaussian blur, rotation, etc.. Recent advances have demonstrated that better data augmentations (Chen et al., 2020a) really help to improve the representation robustness. However, contrasting two images that are de facto similar in semantic space is not applicable for general representations. It is intuitive to pull semantically similar images for better transferability. DeepCluster (Caron et al., 2018) and Local Aggregation (Zhuang et al., 2019) relax the extreme instance discrimination task via discriminating groups of images instead of an individual image. However, due to the lack of labels, it is inevitable that the positive pairs contain noisy samples, which limits the performance. In this paper, we target at expanding instance discrimination by exploring local similarities among images. Towards this goal, one need to solve two issues: i) how to select similar images as positive pairs of an image, and ii) how to incorporate these positive pairs, which inevitably contain noisy assignments, into contrastive learning. We propose a new kind of data augmentation, named Centerwise Local Image Mixture, to tackle the above two issues in a robust and efficient way. CLIM consists of two core elements, i.e., a center-wise positive sample selection, as well as a data mixing operation. For positive sample selection, the motivation is that a good representation should be endowed with high intra-class similarity, and we find that although MoCo (He et al., 2020) does not explicitly model invariance to similar images, the intra-class similarity becomes higher as the training process goes. Based on this observation, we explicitly enforce semantically similar images towards the center of clusters, and generate representation with higher intra-class similarity, which we find is beneficial for few shot learning. This is achieved by searching nearest neighbors of an image, and only retaining similar samples that are closer to the corresponding cluster center, which we denote as center-wise local sample selection. As a result, an image is pulled towards the center while do not break the local similarity. Once similar samples are selected, a direct way is to treat these similar samples as multiple positives for contrastive learning. However, since feature representation in high dimensional space is complex, the returned positive samples inevitably contain noisy assignments, which should not be overconfident. Instead, we rely on data mixing as augmented samples, which can be treated as a smoothing regularization in unsupervised learning. In particular, we apply Cutmix (Yun et al., 2019) , a widely used data augmentation in supervised learning, where patches are cut and pasted among the positive pairs to generate new samples. Benefit from the center-wise sample selection, the Cutmix augmentation is only constrained within the local neighborhood of an image, and can be treated as an expansion of current neighborhood space. In this way, similar samples are pulled together in a smoother and robust way, which we find is beneficial for general representation. Furthermore, we propose multi-resolution augmentation, which aims at contrasting the same image (patch) at different resolutions explicitly, to enable the representation to be scale invariant. We argue that although previous operations such as crop and resize introduce multi-resolution implicitly, they do not compare the same patch at different resolutions directly. As comparisons, multi-resolution incorporates scale invariance into contrastive learning, and significantly boosts the performance even based on a strong baseline. The multi-resolution strategy is simple but effective, and can be combined with current data augmentations for further improving performance. We evaluate the feature representation on several self-supervised learning benchmarks. In particular, on ImageNet linear evaluation protocol, we achieve 75.5% top-1 accuracy with a standard ResNet-50. In few shot setting, when finetuned with only 1% labels, we achieve 59.3% top-1 accuracy, surpassing previous works by a large margin. We also validate its transferring ability on several downstream tasks, and consistently outperform the fully supervised counterparts.

2. RELATED WORK

Unsupervised Representation Learning. Unsupervised learning aims at exploring the intrinsic distribution of data samples via constructing a series of pretext tasks without human labels. These pretext tasks take many forms and vary in utilizing different properties of images. Among them, one family of methods takes advantage of the spatial properties of images, typical pretext tasks include predicting the relative spatial positions of patches (Doersch et al., 2015; Noroozi & Favaro, 2016) , or inferring the missing parts of images by inpainting (Pathak et al., 2016 ), colorization (Zhang et al., 2016) , or rotation prediction (Gidaris et al., 2018) . Recent progress in self-supervised learning mainly benefits from instance discrimination, which regards each image (and augmentations of itself) as one class for contrastive learning. The motivation behind these works is the InfoMax principle, which aims at maximizing mutual information (Tian et al., 2019; Wu et al., 2018) across different augmentations of the same image (He et al., 2020; Chen et al., 2020a ), (Tian et al., 2019) . Data Augmentation. Instance discrimination makes use of several data augmentations, e.g., random cropping, color jittering, horizontal flipping, to define a large view set of vicinities for each image. As has been demonstrated (Chen et al., 2020a; Tian et al., 2020) , the effectiveness of instance discrimination methods strongly relies on the type of augmentations. Hoping that the network holds invariance in the local vicinities of each sample. However, current data augmentations are mostly constrained within a single image. An exception is (Shen et al., 2020) , where image mixture is used

