SOFT NEIGHBORS ARE POSITIVE SUPPORTERS IN CONTRASTIVE VISUAL REPRESENTATION LEARNING

Abstract

Contrastive learning methods train visual encoders by comparing views (e.g., often created via a group of data augmentations on the same instance) from one instance to others. Typically, the views created from one instance are set as positive, while views from other instances are negative. This binary instance discrimination is studied extensively to improve feature representations in self-supervised learning. In this paper, we rethink the instance discrimination framework and find the binary instance labeling insufficient to measure correlations between different samples. For an intuitive example, given a random image instance, there may exist other images in a mini-batch whose content meanings are the same (i.e., belonging to the same category) or partially related (i.e., belonging to a similar category). How to treat the images that correlate similarly to the current image instance leaves an unexplored problem. We thus propose to support the current image by exploring other correlated instances (i.e., soft neighbors). We first carefully cultivate a candidate neighbor set, which will be further utilized to explore the highly-correlated instances. A cross-attention module is then introduced to predict the correlation score (denoted as positiveness) of other correlated instances with respect to the current one. The positiveness score quantitatively measures the positive support from each correlated instance, and is encoded into the objective for pretext training. To this end, our proposed method benefits in discriminating uncorrelated instances while absorbing correlated instances for SSL. We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation. The state-of-theart recognition performance shows that SNCLR is effective in improving feature representations from both ViT and CNN encoders.

1. INTRODUCTION

Visual representations are fundamental to recognition performance. Compared to the supervised learning design, self-supervised learning (SSL) is capable of leveraging large-scale images for pretext learning without annotations. Meanwhile, the feature representations via SSL are more generalizable to benefit downstream recognition scenarios (Grill et al., 2020; Chen et al., 2021; Xie et al., 2021; He et al., 2022; Wang et al., 2021) . Among SSL methods, contrastive learning (CLR) receives extensive studies. In a CLR framework, training samples are utilized to create multiple views based on different data augmentations. These views are passed through a two-branch pipeline for similarity measurement (e.g., InfoNCE loss (Oord et al., 2018) or redundancy-reduction loss (Zbontar et al., 2021) ). Based on this learning framework, investigations on memory queue (He et al., 2020) , large batch size (Chen et al., 2020b) , encoder synchronization (Chen & He, 2021), and self-distillation (Caron et al., 2021) show how to effectively learn self-supervised representations. In a contrastive learning framework, the created views are automatically assigned with binary labels according to different image sources. The views created from the same image instance are labeled as positive, while the views from other image instances are labeled as negative. These positive

Current Instances

Other Instances Nearest Neighbors Soft Neighbors where the current instance is regarded as positive while others are negative. In SwAV (Caron et al., 2020) , the current instance is assigned to online maintained clusters for contrastive measurement. In NNCLR (Dwibedi et al., 2021) , the nearest neighbor is selected as a positive instance to support the current one for the contrastive loss computation (Chen et al., 2020b) . Different from these methods, our SNCLR shown in (d) measures correlations between the current instance and other instances identified from the candidate neighbor set. We define the instances that are highly-correlated to the current instance as soft neighbors (e.g., different pandas to the input image belonging to the 'panda' category in Fig. 3 ). The similarity/correlation, which is based on the cross-attention computation from a shallow attention module, denotes the instance positiveness to compute the contrastive loss. (Best viewed in colors) and negative pairs constitute the contrastive loss computation process. In practice, we observe that this rudimentary labeling is not sufficient to represent instance correlations (i.e., similar to the semantic relations in supervised learning) between different images and may hurt the performance of learned representations. For example, when we use ImageNet data (Russakovsky et al., 2015) for self-supervised pre-training, views from the current image instance x 0 belonging to the 'bloodhound' category is labeled as positive, while views from other image instances belonging to the same category will be labeled as negative. Since we do not know the original training labels during SSL, images belonging to the same semantic category will be labeled differently to limit feature representations. On the other hand, for the current image instance x 0 , although views from one image instance x 1 belonging to the 'walker hound' category and views from another image instance x 2 belonging to the 'peacock' category are both labeled as negative, the views from x 1 are more correlated to the views from x 0 than those views from x 2 . This correlation is not properly encoded during contrastive learning as views from both x 1 and x 2 are labeled the same. Without original training labels, contrastive learning methods are not effective in capturing the correlations between image instances. The learned feature representations are limited to describing correlated visual contents across different images. Exploring instance correlations have arisen recently in SwAV (Caron et al., 2020) and NNCLR (Dwibedi et al., 2021) . Fig. 1 shows how they improve training sample comparisons upon the vanilla CLR. In (a), the vanilla CLR method labels the view from the current image instance as positive while views from other image instances as negative. This CLR framework is improved in (b) where SwAV assigns view embeddings of the current image instance to neighboring clusters. The contrastive loss computation is based on these clusters rather than view embeddings. Another improvement upon (a) is shown in (c) where nearest neighbor (NN) view embeddings are utilized in NNCLR to support the current views to compute the contrastive loss. The clustered CLR and NN selection inspire us to take advantage of both. We aim to accurately identify the neighbors that are highly correlated to the current image instance, which is inspired by the NN selection. Meanwhile, we expect to produce a soft measurement of the correlation extent, which is motivated by the clustered CLR. In the end, the identified neighbors with adaptive weights support the current sample for the contrastive loss computation. In this paper, we propose to explore soft neighbors during contrastive learning (SNCLR). Our framework consists of two encoders, two projectors, and one predictor, which are commonly adopted in CLR framework design (Grill et al., 2020; Chen et al., 2021) . Moreover, we introduce a candidate neighbor set to store nearest neighbors and an attention module to compute cross-attention scores. For the current view instance, the candidate neighbor set contains instance features from other images, but their feature representations are similar to that of the current view instance. We then



Figure1: Training sample comparisons in the self-supervised contrastive learning framework. The vanilla contrastive learning method is shown in (a) where the current instance is regarded as positive while others are negative. In SwAV(Caron et al., 2020), the current instance is assigned to online maintained clusters for contrastive measurement. In NNCLR(Dwibedi et al., 2021), the nearest neighbor is selected as a positive instance to support the current one for the contrastive loss computation(Chen et al., 2020b). Different from these methods, our SNCLR shown in (d) measures correlations between the current instance and other instances identified from the candidate neighbor set. We define the instances that are highly-correlated to the current instance as soft neighbors (e.g., different pandas to the input image belonging to the 'panda' category in Fig.3). The similarity/correlation, which is based on the cross-attention computation from a shallow attention module, denotes the instance positiveness to compute the contrastive loss. (Best viewed in colors)

