PROSAMPLER: IMPROVING CONTRASTIVE LEARNING BY BETTER MINI-BATCH SAMPLING

Abstract

In-batch contrastive learning has emerged as a state-of-the-art self-supervised learning solution, with the philosophy of bringing semantically similar instances closer while pushing dissimilar instances apart within a mini-batch. However, the in-batch negative sharing strategy is limited by the batch size and falls short of prioritizing the informative negatives (i.e., hard negatives) globally. In this paper, we propose to sample mini-batches with hard negatives on a proximity graph in which the instances (nodes) are connected according to the similarity measurement. Sampling on the proximity graph can better exploit the hard negatives globally by bridging in similar instances from the entire dataset. The proposed method can flexibly explore the negatives by modulating two parameters, and we show that such flexibility is the key to better exploit hard negatives globally. We evaluate the proposed method on three representative contrastive learning algorithms, each of which corresponds to one modality: image, text, and graph. Besides, we also apply it to the variants of the InfoNCE objective to verify its generality. Results show that our method can consistently boost the performance of contrastive methods, with a relative improvement of 2.5% for SimCLR on ImageNet-100, 1.4% for SimCSE on the standard STS task, and 1.2% for GraphCL on the COLLAB dataset.

1. INTRODUCTION

Contrastive learning has been the dominant approach in current self-supervised representation learning, which is applied in many areas, such as MoCo (Chen et al., 2020) and SimCLR (He et al., 2020) in computer vision, GCC (Qiu et al., 2020) and GraphCL (You et al., 2020) in graph representation learning and SimCSE (Gao et al., 2021) in natural language processing. The basic idea is to decrease the distance between the embeddings of the same instances (positive pair) while increase that of the other instances (negative pair). These important works in contrastive learning generally follow or slightly modify the framework of in-batch contrastive learning as follows: minimize E {x1... x B }⊂D   - B i=1 log e f (xi) T f (x + i ) e f (xi) T f (x + i ) + j̸ =i e f (xi) T f (xj )   , where {x 1 ... x B } is a mini-batch of samples (usually) sequentially loaded from the dataset D, and x + i is an augmented version of x i . The encoder f (•) learns to discriminate instances by mapping different data-augmentation versions of the same instance (positive pair) to similar embeddings, and mapping different instances in the mini-batch (negative pair) to dissimilar embeddings. The key to efficiency is in-batch negative sharing strategy that any instance within the mini-batch is the other instances' negative, which means we learn to discriminate all B(B -1) pairs of instances in a mini-batch while encoding each instance only once. Its advantages of simplicity and efficiency make it more popular than pairwise (Mikolov et al., 2013) or triple-based methods (Schroff et al., 2015; Harwood et al., 2017) , gradually becoming the dominating framework for contrastive learning. However, the performance of in-batch contrastive learning is closely related to batch size. It is the target of many important contrastive learning methods to obtain a large (equivalent) batch size under limited computation and memory budget. For example, Memory Bank (Wu et al., 2018) shows that simply increasing the batch size of the plain in-batch contrastive learning to 8,192 outperforms previous carefully designed methods. Although many works highlight the importance of batch size, a further question arises -which instances in the mini-batch contribute the most? Hard negative pair contributes the most -a clear answer to this question is well-supported by many efforts in some related studies on negative sampling (Ying et al., 2018; Yang et al., 2020; Huang et al., 2021; Kalantidis et al., 2020; Robinson et al., 2021) . An intuitive explanation is that the e f (xi) T f (xj ) for easy-to-discriminate negative pairs will become very small after the early period of training, and thus the hard negative pairs contribute the majority of the loss and gradients. The hard negative sampling already made great success in many real-world applications, e.g., 8% improvement of Facebook search recall (Huang et al., 2020) and 15% relative gains of Microsoft retrieval engine (Xiong et al., 2020) . The key to these methods is to globally select negatives that are similar to the query one across the whole dataset. However, previous methods for negative sampling of in-batch contrastive learning (Robinson et al., 2021; Chuang et al., 2020) focus on identifying negative samples within the current mini-batch, which is insufficient to mine the meaningful negatives from the entire dataset. Meanwhile, previous global negative samplers apply triplet loss and explore the negatives in pairs (Karpukhin et al., 2020; Xiong et al., 2020) , which is inapplicable to in-batch negative sharing strategy, since it cannot guarantee the similarity between every instance pair within a mini-batch. In this paper, we focus on designing a global hard negative sampler for in-batch contrastive learning. Since every instance serves as the negative to the other instances in the same batch, the desired sampling strategy should be the one with more hard-to-distinguish pairs in each sampled batch. This objective can be considered as sampling a batch of similar instances from the dataset. But how can we identify such batch globally over the dataset? Present Work. Here we propose Proximity Graph-based Sampler (ProSampler), a global hard negative sampling strategy that can be plugged into any in-batch contrastive learning method. Proximity graph breaks the independence between different instances and captures the relevance among instances to better perform global negative sampling. As shown in Figure 1 , similar instances form a local neighborhood in the proximity graph where ProSampler performs negative sampling as short random walks to effectively draw hard negative pairs. Besides, ProSampler can flexibly control the hardness of the sampled mini-batch by modulating two parameters. In practice, we build the proximity graph per fixed iterations, and then apply Random Walk with Restart (RWR) per iteration to sample a mini-batch for training. Our experiments show that ProSampler consistently improves top-performing contrastive learning algorithms in different domains, including SimCLR (Chen et al., 2020) and MoCo v3 (Chen et al., 2021) in CV, SimCSE (Gao et al., 2021) in NLP, and GraphCL (You et al., 2020) in graph learning by merely changing the mini-batch sampling step. To the best of our knowledge, ProSampler is the first algorithm to optimize the mini-batch sampling step for better negative sampling in the current in-batch contrastive learning framework.



Figure 1: A motivating example of ProSampler. The generated image representations form an embedding space where Uniform Sampler randomly samples a mini-batch with easy negatives and ProSampler samples a mini-batch with hard negatives based on proximity graph.



