SOFT NEIGHBORS ARE POSITIVE SUPPORTERS IN CONTRASTIVE VISUAL REPRESENTATION LEARNING

Abstract

Contrastive learning methods train visual encoders by comparing views (e.g., often created via a group of data augmentations on the same instance) from one instance to others. Typically, the views created from one instance are set as positive, while views from other instances are negative. This binary instance discrimination is studied extensively to improve feature representations in self-supervised learning. In this paper, we rethink the instance discrimination framework and find the binary instance labeling insufficient to measure correlations between different samples. For an intuitive example, given a random image instance, there may exist other images in a mini-batch whose content meanings are the same (i.e., belonging to the same category) or partially related (i.e., belonging to a similar category). How to treat the images that correlate similarly to the current image instance leaves an unexplored problem. We thus propose to support the current image by exploring other correlated instances (i.e., soft neighbors). We first carefully cultivate a candidate neighbor set, which will be further utilized to explore the highly-correlated instances. A cross-attention module is then introduced to predict the correlation score (denoted as positiveness) of other correlated instances with respect to the current one. The positiveness score quantitatively measures the positive support from each correlated instance, and is encoded into the objective for pretext training. To this end, our proposed method benefits in discriminating uncorrelated instances while absorbing correlated instances for SSL. We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation. The state-of-theart recognition performance shows that SNCLR is effective in improving feature representations from both ViT and CNN encoders.

1. INTRODUCTION

Visual representations are fundamental to recognition performance. Compared to the supervised learning design, self-supervised learning (SSL) is capable of leveraging large-scale images for pretext learning without annotations. Meanwhile, the feature representations via SSL are more generalizable to benefit downstream recognition scenarios (Grill et al., 2020; Chen et al., 2021; Xie et al., 2021; He et al., 2022; Wang et al., 2021) . Among SSL methods, contrastive learning (CLR) receives extensive studies. In a CLR framework, training samples are utilized to create multiple views based on different data augmentations. These views are passed through a two-branch pipeline for similarity measurement (e.g., InfoNCE loss (Oord et al., 2018) or redundancy-reduction loss (Zbontar et al., 2021) ). Based on this learning framework, investigations on memory queue (He et al., 2020) , large batch size (Chen et al., 2020b) , encoder synchronization (Chen & He, 2021), and self-distillation (Caron et al., 2021) show how to effectively learn self-supervised representations. In a contrastive learning framework, the created views are automatically assigned with binary labels according to different image sources. The views created from the same image instance are labeled as positive, while the views from other image instances are labeled as negative. These positive

