SET DISCRIMINATION CONTRASTIVE LEARNING

Abstract

In this work, we propose a self-supervised contrastive learning method that integrates the concept of set-based feature learning. The main idea of our method is to randomly construct sets of instances in a mini-batch and then learn to contrast the set representations. Inspired by set-based feature learning, we aggregate set features from individual sample features by a symmetric function. To improve the effectiveness of our set-based contrastive learning, we propose a set construction scheme built upon sample permutation in a mini-batch that allows a sample to appear in multiple sets, which naturally ensures common features among sets by construction, thus, generating hard negative samples. Our set construction scheme also increases both the number of positive and negative sets in a mini-batch, leading to better representation learning. We demonstrate the robustness of our method by seamlessly integrating it into existing contrastive learning methods such as Sim-CLR and MoCo. Extensive experiments demonstrate that our method consistently improves the performance of these contrastive learning methods in various datasets and downstream tasks.

1. INTRODUCTION

Learning effective representations from data has been a long-standing challenge in machine learning over the past decades. A prominent direction to address this problem is self-supervised learning (SSL), which aims to learn the representations without the need of human supervision. Contrastive learning (Jing & Tian, 2021; Le-Khac et al., 2020) is a modern powerful approach in self-supervised learning that learns a representation based on the idea of attracting and repelling features, i.e., data samples with similar semantics are expected to be close to each other in the feature space while dissimilar samples are expected to stay apart. A dominant pretext task for contrastive learning is instance discrimination where each instance is an original data sample represented by a feature vector. Given an instance, positive samples can be defined as different views of the same instance generated by applying data augmentation to the instance such as cropping and flipping (Ye et al., 2019; Chen et al., 2020a) , luminance and chrominance decomposition (Tian et al., 2020a) . On the other hand, negative samples are defined as remaining samples such as other samples in the same mini-batch (Ye et al., 2019; Chen et al., 2020a) or they can be generated through memory banks (He et al., 2020; Wu et al., 2018) . By distinguishing positive samples from negative samples, effective representations can be learned through such selfsupervision. Remarkably, recent progresses demonstrate that self-supervised representations can even surpass the performance of supervised counterparts in computer vision downstream tasks (Henaff, 2020; He et al., 2020) . A limitation of instance discrimination is that this pretext task can be optimized by simply learning low-level features of the data, which might not be effective representations for downstream tasks. This could be due to overfitting when maximizing mutual information between positive views (Tschannen et al., 2020) . Task-irrelevant features could also occur when excessive noise is present due to unnecessarily high mutual information between views during learning (Tian et al., 2020b) . Unfortunately, it is challenging to identify useful information from noise without any additional cue such as knowledge of downstream tasks. We conjecture that effective features should be shared among instances i.e., embedding of which should have some degree of mutual information. Our conjecture is based on previous hypothesis that the good bits are those shared between different views of the world (Tian et al., 2020a; Smith & Gasser, 2005) . Specifically, previous works such as Tian et al. (2020a) only take views from the same underlying instance into account while we expand this conjecture to multiple instances. Our intuition is that even if instances belong to different categories, they should share some high-level properties such as abstract shapes, part compositions, etc and learning these common concepts might be more beneficial than low-level features. As the same time, instance embedding should be adequately discriminative to be distinguished from each other. In this paper, we facilitate learning such shared features by considering unordered sets of instances because: 1) the aggregation function in a set-structured representation (Zhang et al., 2020; Naderializadeh et al., 2021) devise a bottleneck that encourages the model to learn common features across instances to maximize set mutual information (Section 3.4) ; 2) circumventing instance discrimination helps avoid unintentionally maximizing distances between samples with similar semantics which hinders common features learning. To realize the idea of set-based learning into self-supervised learning, we propose Multiple Instance RAndomly Grouped for Contrastive LEarning, so-called Miracle, a simple algorithm for set-based contrastive learning in which we arbitrarily sample data points in a mini-batch and group them to form sets. Similar to instance discrimination (Wu et al., 2018; Ye et al., 2019) , we apply data augmentation to create two views of a set. We construct features of a set by aggregating features of the samples of a set by a symmetric function, which can then be passed to a contrastive loss. The network is trained to maximize agreement to views of the same set while being able to distinguish different sets. We refer to this task as set discrimination. To support the training, we devise an efficient set construction scheme that is based on permuting the samples in a mini-batch multiple times. The benefits of our set construction is two-fold. First, it allows an instance to appear in multiple sets, and therefore the sets can have common features by construction. This encourages the network to learn common features for the instances, and also generates harder negative sets to improve the robustness of representation learning. Second, our set construction can increase the number of positive and negative sets to improve the self-supervision. In contrast, contemporary methods only focus solely on positives samples (Dwibedi et al., 2021) or negative samples (He et al., 2020; Chen et al., 2020b; Wu et al., 2018 ) at a time. By virtue of the simplicity of proposed approach, we can plug this set-based contrastive learning into existing contrastive learning methods. Through extensive experiments, we demonstrate the efficacy of Miracle in various scenarios. First, we show that the proposed method consistently improve the baselines such as SimCLR (Chen et al., 2020a), MoCo (He et al., 2020) on CIFAR-10, CIFAR-100, STL-10, ImageNet-100, and ImageNet-1K. We verify the robustness of Miracle when scaling up the learning with different hyperparameters including pretraining epochs, batch sizes, learning rates and temperatures. We also study Miracle in various conditions including weaker data augmentation and transfer learning. In summary, our contributions are: (1) a new pretext task of set discrimination for self-supervised visual representation learning; (2) a simple but effective method to integrate set-based feature learning into existing contrastive learning methods, yielding significant performance improvement; (3) extensive experiments and ablation studies that empirically demonstrate the usefulness and robustness of set-based contrastive learning.

2. RELATED WORK

Instance-wise contrastive learning Recent advances in contrastive learning are largely driven by the instance discrimination task (Wu et al., 2018) . Prior work (Chen et al., 2020a; He et al., 2020; Ye et al., 2019; Wu et al., 2018; Tian et al., 2020a) in this direction treat each instance as a category and learn an embedding space such that views from same instance, also known as positive samples, obtained by different transformations of an image, should have small distances while views from different instances, or negative samples, should have large distances. There are more effective ways to generate the samples for contrastive learning: Chen et al. 



; Ye et al. simply use all samples from the same mini-batch; Wu et al. uses a memory bank which stores the features from previous steps; He et al. uses a momentum encoder to compute positive samples and memory bank for negative samples; Hu et al. trains a generative model together with a representation network to generate negative samples. Most of these methods adopt the InfoNCE loss function (Van den Oord et al., 2018) which usually requires a large batch size to reduce the bias of the estimation. Yeh et al.; Chen et al. propose variants of InfoNCE to cope with aforementioned challenge.

