SET DISCRIMINATION CONTRASTIVE LEARNING

Abstract

In this work, we propose a self-supervised contrastive learning method that integrates the concept of set-based feature learning. The main idea of our method is to randomly construct sets of instances in a mini-batch and then learn to contrast the set representations. Inspired by set-based feature learning, we aggregate set features from individual sample features by a symmetric function. To improve the effectiveness of our set-based contrastive learning, we propose a set construction scheme built upon sample permutation in a mini-batch that allows a sample to appear in multiple sets, which naturally ensures common features among sets by construction, thus, generating hard negative samples. Our set construction scheme also increases both the number of positive and negative sets in a mini-batch, leading to better representation learning. We demonstrate the robustness of our method by seamlessly integrating it into existing contrastive learning methods such as Sim-CLR and MoCo. Extensive experiments demonstrate that our method consistently improves the performance of these contrastive learning methods in various datasets and downstream tasks.

1. INTRODUCTION

Learning effective representations from data has been a long-standing challenge in machine learning over the past decades. A prominent direction to address this problem is self-supervised learning (SSL), which aims to learn the representations without the need of human supervision. Contrastive learning (Jing & Tian, 2021; Le-Khac et al., 2020) is a modern powerful approach in self-supervised learning that learns a representation based on the idea of attracting and repelling features, i.e., data samples with similar semantics are expected to be close to each other in the feature space while dissimilar samples are expected to stay apart. A dominant pretext task for contrastive learning is instance discrimination where each instance is an original data sample represented by a feature vector. Given an instance, positive samples can be defined as different views of the same instance generated by applying data augmentation to the instance such as cropping and flipping (Ye et al., 2019; Chen et al., 2020a) , luminance and chrominance decomposition (Tian et al., 2020a) . On the other hand, negative samples are defined as remaining samples such as other samples in the same mini-batch (Ye et al., 2019; Chen et al., 2020a) or they can be generated through memory banks (He et al., 2020; Wu et al., 2018) . By distinguishing positive samples from negative samples, effective representations can be learned through such selfsupervision. Remarkably, recent progresses demonstrate that self-supervised representations can even surpass the performance of supervised counterparts in computer vision downstream tasks (Henaff, 2020; He et al., 2020) . A limitation of instance discrimination is that this pretext task can be optimized by simply learning low-level features of the data, which might not be effective representations for downstream tasks. This could be due to overfitting when maximizing mutual information between positive views (Tschannen et al., 2020) . Task-irrelevant features could also occur when excessive noise is present due to unnecessarily high mutual information between views during learning (Tian et al., 2020b) . Unfortunately, it is challenging to identify useful information from noise without any additional cue such as knowledge of downstream tasks. We conjecture that effective features should be shared among instances i.e., embedding of which should have some degree of mutual information. Our conjecture is based on previous hypothesis that the good bits are those shared between different views of the world (Tian et al., 2020a; Smith & 

