PROSAMPLER: IMPROVING CONTRASTIVE LEARNING BY BETTER MINI-BATCH SAMPLING

Abstract

In-batch contrastive learning has emerged as a state-of-the-art self-supervised learning solution, with the philosophy of bringing semantically similar instances closer while pushing dissimilar instances apart within a mini-batch. However, the in-batch negative sharing strategy is limited by the batch size and falls short of prioritizing the informative negatives (i.e., hard negatives) globally. In this paper, we propose to sample mini-batches with hard negatives on a proximity graph in which the instances (nodes) are connected according to the similarity measurement. Sampling on the proximity graph can better exploit the hard negatives globally by bridging in similar instances from the entire dataset. The proposed method can flexibly explore the negatives by modulating two parameters, and we show that such flexibility is the key to better exploit hard negatives globally. We evaluate the proposed method on three representative contrastive learning algorithms, each of which corresponds to one modality: image, text, and graph. Besides, we also apply it to the variants of the InfoNCE objective to verify its generality. Results show that our method can consistently boost the performance of contrastive methods, with a relative improvement of 2.5% for SimCLR on ImageNet-100, 1.4% for SimCSE on the standard STS task, and 1.2% for GraphCL on the COLLAB dataset.

1. INTRODUCTION

Contrastive learning has been the dominant approach in current self-supervised representation learning, which is applied in many areas, such as MoCo (Chen et al., 2020) and SimCLR (He et al., 2020) in computer vision, GCC (Qiu et al., 2020) and GraphCL (You et al., 2020) in graph representation learning and SimCSE (Gao et al., 2021) in natural language processing. The basic idea is to decrease the distance between the embeddings of the same instances (positive pair) while increase that of the other instances (negative pair). These important works in contrastive learning generally follow or slightly modify the framework of in-batch contrastive learning as follows: minimize E {x1... x B }⊂D   - B i=1 log e f (xi) T f (x + i ) e f (xi) T f (x + i ) + j̸ =i e f (xi) T f (xj )   , where {x 1 ... x B } is a mini-batch of samples (usually) sequentially loaded from the dataset D, and x + i is an augmented version of x i . The encoder f (•) learns to discriminate instances by mapping different data-augmentation versions of the same instance (positive pair) to similar embeddings, and mapping different instances in the mini-batch (negative pair) to dissimilar embeddings. The key to efficiency is in-batch negative sharing strategy that any instance within the mini-batch is the other instances' negative, which means we learn to discriminate all B(B -1) pairs of instances in a mini-batch while encoding each instance only once. Its advantages of simplicity and efficiency make it more popular than pairwise (Mikolov et al., 2013) or triple-based methods (Schroff et al., 2015; Harwood et al., 2017) , gradually becoming the dominating framework for contrastive learning. However, the performance of in-batch contrastive learning is closely related to batch size. It is the target of many important contrastive learning methods to obtain a large (equivalent) batch size under limited computation and memory budget. For example, Memory Bank (Wu et al., 2018) stores the encoded embeddings from previous mini-batches as extra negative samples, and MoCo (He et al., 2020) improves the consistency of the stored negative samples via a momentum encoder. SimCLR shows that simply increasing the batch size of the plain in-batch contrastive learning to 8,192 outperforms previous carefully designed methods. Although many works highlight the importance of batch size, a further question arises -which instances in the mini-batch contribute the most? Hard negative pair contributes the most -a clear answer to this question is well-supported by many efforts in some related studies on negative sampling (Ying et al., 2018; Yang et al., 2020; Huang et al., 2021; Kalantidis et al., 2020; Robinson et al., 2021) . An intuitive explanation is that the e f (xi) T f (xj ) for easy-to-discriminate negative pairs will become very small after the early period of training, and thus the hard negative pairs contribute the majority of the loss and gradients. The hard negative sampling already made great success in many real-world applications, e.g., 8% improvement of Facebook search recall (Huang et al., 2020) and 15% relative gains of Microsoft retrieval engine (Xiong et al., 2020) . The key to these methods is to globally select negatives that are similar to the query one across the whole dataset. However, previous methods for negative sampling of in-batch contrastive learning (Robinson et al., 2021; Chuang et al., 2020) focus on identifying negative samples within the current mini-batch, which is insufficient to mine the meaningful negatives from the entire dataset. Meanwhile, previous global negative samplers apply triplet loss and explore the negatives in pairs (Karpukhin et al., 2020; Xiong et al., 2020) , which is inapplicable to in-batch negative sharing strategy, since it cannot guarantee the similarity between every instance pair within a mini-batch. In this paper, we focus on designing a global hard negative sampler for in-batch contrastive learning. Since every instance serves as the negative to the other instances in the same batch, the desired sampling strategy should be the one with more hard-to-distinguish pairs in each sampled batch. This objective can be considered as sampling a batch of similar instances from the dataset. But how can we identify such batch globally over the dataset? Present Work. Here we propose Proximity Graph-based Sampler (ProSampler), a global hard negative sampling strategy that can be plugged into any in-batch contrastive learning method. Proximity graph breaks the independence between different instances and captures the relevance among instances to better perform global negative sampling. As shown in Figure 1 , similar instances form a local neighborhood in the proximity graph where ProSampler performs negative sampling as short random walks to effectively draw hard negative pairs. Besides, ProSampler can flexibly control the hardness of the sampled mini-batch by modulating two parameters. In practice, we build the proximity graph per fixed iterations, and then apply Random Walk with Restart (RWR) per iteration to sample a mini-batch for training. Our experiments show that ProSampler consistently improves top-performing contrastive learning algorithms in different domains, including SimCLR (Chen et al., 2020) and MoCo v3 (Chen et al., 2021) in CV, SimCSE (Gao et al., 2021) in NLP, and GraphCL (You et al., 2020) in graph learning by merely changing the mini-batch sampling step. To the best of our knowledge, ProSampler is the first algorithm to optimize the mini-batch sampling step for better negative sampling in the current in-batch contrastive learning framework.

2. RELATED WORK

Contrastive learning in different modalities. Contrastive learning follows a similar paradigm that contrasts similar and dissimilar observations based on noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010; Oord et al., 2018) . The primary distinction between contrastive methods of different modalities is how they augment the data. As for computer vision, MoCo (He et al., 2020) , SimCLR (Chen et al., 2020) , SwAV (Caron et al., 2020) , and BYOL (Grill et al., 2020) augment data with geometric transformation and appearance transformation. Besides simply using data augmentation, WCL (Zheng et al., 2021) additionally utilizes an affinity graph to construct positive pairs for each example within the mini-batch. As for language, CLEAR (Wu et al., 2020b) and COCO-LM (Meng et al., 2021) augment the text data through word deletion, reordering, and substitution, while SimCSE (Gao et al., 2021) obtains the augmented instances by applying the standard dropout twice. As for graph, DGI (Petar et al., 2018) and InfoGraph (Sun et al., 2019) treat the node representations and corresponding graph representations as positive pairs. Besides, GCC (Qiu et al., 2020) and GraphCL (You et al., 2020) augment the graph data by graph sampling or proximity-oriented methods. Zhu et al. (2021) compares different kinds of graph augmentation strategy. Our proposed ProSampler is a general mini-batch sampler which can directly be applied to any in-batch contrastive learning framework with different modalities. Negative sampling in contrastive learning. Previous studies about negative sampling in contrastive learning roughly fall into two categories: (1) Memory-based negative sampling strategy, such as MoCo (He et al., 2020) , maintains a fixed-size memory bank to store negatives which are updated regularly during the training process. MoCHI (Kalantidis et al., 2020) proposes to mix the hard negative candidates at the feature level to generate more challenging negative pairs. MoCoRing (Wu et al., 2020a) samples hard negatives from a defined conditional distribution which keeps a lower bound on the mutual information. (2) In-batch negative sharing strategy, such as SimCLR (Chen et al., 2020) and MoCo v3 (Chen et al., 2021) , adopts different instances in the current mini-batch as negatives. To mitigate the false negative issue, DCL (Chuang et al., 2020) modifies the original InfoNCE objective to reweight the contrastive loss. Huynh et al. (2022) identifies the false negatives within a mini-batch by comparing the similarity between negatives and the anchor image's multiple support views. Additionally, HCL (Robinson et al., 2021) revises the original InfoNCE objective by assigning higher weights for hard negatives among the mini-batch. However, such locally sampled hard negatives cannot exploit the hard negatives sufficiently from the dataset. Global hard negative sampling methods on triplet loss have been widely investigated, which aim to globally sample hard negatives for a given positive pair. For example, Wang et al. (2021) proposes to take rank-k hard negatives from some randomly sampled negatives. Xiong et al. (2020) globally samples hard negatives by an asynchronously-updated approximate nearest neighbor (ANN) index for dense text retrieval. Different from the abovementioned methods which are applied to a triplet loss for a given pair, our ProSampler samples mini-batches with hard negatives for InfoNCE loss. Self-supervised learning without negative sampling. Recently, some attempts on learning without negative sampling achieve promising results, such as BYOL (Grill et al., 2020) , SwAV (Caron et al., 2020) , SimSiam (Chen and He, 2021) , and DINO (Caron et al., 2021) . These methods apply Siamese network structure, and contrast the output of an online network and a target network with different augmented views. The main difference between them is how they prevent model from collapsing.

3.1. GLOBAL HARD NEGATIVE SAMPLING FOR IN-BATCH CONTRASTIVE LEARNING

Contrastive learning aims to learn a proper transformation that maps two semantically similar instances x i , x j to two close points in the embedding space. It applies NCE objective (Gutmann and Hyvärinen, 2010; Oord et al., 2018) and in-batch negative sharing strategy to boost the training efficiency, which means that every instance serves as a negative to the other instances within the mini-batch. How to sample a mini-batch with hard negatives for contrastive learning remains an open problem, and previous methods achieve this by sampling within the mini-batch (Chuang et al., 2020; Robinson et al., 2021; Karpukhin et al., 2020) . However, the batch size is far smaller than the dataset size and sampling within the mini-batch cannot effectively explore the hard negatives from the whole dataset (Xiong et al., 2020; Zhang et al., 2013) . In this work, we delve deeper into learning with the global hard negative sampler, which picks a batch of instances containing considerable hard-to-distinguish pairs. Besides, a desired mini-batch sampling strategy should be general and adaptable to the datasets with various modalities and different scales. We formulate the problem as: Problem 1. Given a set of data instances D = {x 1 , • • • , x N }, our goal is to design a modalityindependent sampler g(D) = {x i , • • • , x (i+B) } to sample a mini-batch of instances where any instance pair are hard to distinguish across the dataset.

3.2. TWO EXTREME STRATEGIES: UNIFORM SAMPLER AND KNN SAMPLER

Here we discuss two extreme mini-batch sampling strategies for in-batch contrastive learning: Uniform Sampler and kNN Sampler, which represent extreme scenarios in terms of the hardness of a mini-batch they construct. Uniform Sampler is the most common strategy used in contrastive learning (Chen et al., 2020; Gao et al., 2021; You et al., 2020) , which is general, easy to implement, and model-independent. The overall pipeline is to first randomly sample a batch of instances for each training step, then feed them into the objective function. kNN Sampler can globally sample a mini-batch with many hard negative examples. As its name indicates, kNN Sampler would pick an instance at random and retrieve a set of nearest neighbors to construct a batch. Figure 4 shows that the mini-batch sampled by kNN Sampler has a high percentage of similar instance pairs. However, the abovementioned two methods suffer from the following limitations: • Uniform Sampler neglects the effect of hard negatives (Kalantidis et al., 2020; Robinson et al., 2021) , and will select negatives with low gradients that contribute little to optimization. As shown in Figure 4 , Uniform Sampler results in a low percentage of similar instance pairs in a mini-batch. Yang et al. (2020) and Xiong et al. (2020) also theoretically prove that the suggested sampled negative should be similar to the query instance since it can provide a meaningful gradient to the model. • During the self-supervised training, the instances of the same class will cluster together in the embedding space (Chen et al., 2020; Caron et al., 2020) . Hence the kNN Sampler can first retrieve the hard negatives but they will be replaced by false negatives (FN) as the training epochs increase. Figure 4 also demonstrates that kNN Sampler exhibits a very high percentage of FN in a mini-batch. In conclusion, Uniform Sampler cannot leverage hard negatives to guide the optimization of the model; whereas kNN Sampler explicitly samples hard negatives but suffers from the false negative issue. Both of them will result in sub-optimal performance. A better global hard negative sampler for in-batch contrastive learning should trade-off these two sampling styles, and balance the exploitation of hard negatives and the FN issue. Building on the above observations, we propose ProSampler, a flexible global mini-batch sampler which allows us to smoothly interpolate between the kNN Sampler and the Uniform Sampler.

3.3. PROSAMPLER

As discussed in Section 3.1, a desired mini-batch should be the one where any example is the hard negative of the other examples. This objective can also be seen as sampling a group of instances which are close to one another in the embedding space. But how to identify such groups globally from the dataset? As shown in Figure 2 , we propose to capture similarity relationships among instances by proximity graph. Proximity graph connects the instances by the similarity measurement, and in this way, instances that appear to be close to each other form a local community in the graph. We perform the mini-batch sampling as a walk in the proximity graph, which collects the visited instances as sampling results. To modulate the hardness of a sampled batch, we introduce two parameters M and α to control the behaviors of proximity graph construction and sampling respectively.  M i Z v H C j r N 4 D H 5 F t Q = " > A A A C x H i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V k U x G U L 9 g G 1 S J J O a 2 h e z E z E U v Q H 3 O q 3 i X + g f + G d c Q p q E Z 2 Q 5 M y 5 9 5 y Z e 6 + f R a G Q j v N a s O b m F x a X i s u l l d W 1 9 Y 3 y 5 l Z L p D k P W D N I o 5 R 3 f E + w K E x Y U 4 Y y Y p 2 M M y / 2 I 9 b 2 R 2 c q 3 r 5 l X I R p c i n H G e v F 3 j A J B 2 H g S a I a d 9 f l i l N 1 9 L J n g W t A B W b V 0 / I L r t B H i g A 5 Y j A k k I Q j e B D 0 d O H C Q U Z c D x P i O K F Q x x n u U S J t T l m M M j x i R / Q d 0 q 5 r 2 I T 2 y l N o d U C n R P R y U t r Y I 0 1 K e Z y w O s 3 W 8 V w 7 K / Y 3 7 4 n 2 V H c b 0 9 8 3 X j G x E j f E / q W b Z v 5 X p 2 q R G O B E 1 x B S T Z l m V H W B c c l 1 V 9 T N 7 S 9 V S X L I i F O 4 T 3 F O O N D K a Z 9 t r R G 6 d t V b T 8 f f d K Z i 1 T 4 w u T n e 1 S 1 p w O 7 P c c 6 C 1 k H V P a q 6 j c N K 7 d S M u o g d 7 G K f 5 n m M G i 5 Q R 1 N 7 P + I J z 9 a 5 F V n C y j 9 T r Y L R b O P b s h 4 + A G 0 U j 4 M = < / l a t e x i t > x Input < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 J P p r E m 6 J U p h U b g s W 6 2 Y W a v P m d c = " > A A A C z n i c j V H L T s J A F D 3 U F + I L d e m m k Z i 4 I q 0 x 6 p L o x i U m A i Z A S D s M O K G 0 z X R K J I S 4 9 Q f c 6 m c Z / 0 D / w j t j S V R i d J q 2 Z / c K H j j R G X i Y j C a z W O e X v o 9 U P R E 8 x T R D V b S g R d P r m b d k S n W H L K j l n 2 P H A z U E K 2 q l H x B S 1 0 E Y E h x R A c I R T h A B 4 S e p p w 4 S A m r o 0 J c Z K Q M H G O K Q r k T U n F S e E R O 6 B v n 3 b N j A 1 p r 3 M m x s 3 o l I B e S U 4 b B + S J S C c J 6 9 N s E 0 9 N Z s 3 + l n t i c u q 7 j e n v Z 7 m G x C r c E v u X b 6 b 8 r 0 / X o t D D m a l B U E 2 x Y X R 1 L M u S m q 7 o m 9 t f q l K U I S Z O 4 y 7 F J W F m n L M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 Q p 3 v U t a c D u z 3 H O g / p R 2 T 0 p u 1 f H p c p 5 N u o 8 9 r C P Q 5 r n K S q 4 R B U 1 0 / F H P O H Z q l o j a 2 r d f 0 q t X O b Z x b d l P X w A 6 7 m U F Q = = < / l a t e x i t > xi < l a t e x i t s h a 1 _ b a s e 6 4 = " / j o h 2 q O u B A O 9 P 6 W 9 j J Z T h j 0 x 4 b c = " > A A A C z n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V l 0 4 7 K C f U B b S p J O 6 9 i 8 S C b F U o p b f 8 C t f p b 4 B / o X 3 h m n o B b R C U n O n H v O n b n 3 u r H P U 2 F Z r z l j Y X F p e S W / W l h b 3 9 j c K m 7 v 1 N M o S z x W 8 y I / S p q u k z K f h 6 w m u P B Z M 0 6 Y E 7 g + a 7 j D C x l v j F i S 8 i i 8 F u O Y d Q J n E P I + 9 x x B V K s t u N 9 j k 7 t p 9 7 Z b L F l l S y 1 z H t g a l K B X N S q + o I 0 e I n j I E I A h h C D s w 0 F K T w s 2 L M T E d T A h L i H E V Z x h i g J 5 M 1 I x U j j E D u k 7 o F 1 L s y H t Z c 5 U u T 0 6 x a c 3 I a e J A / J E p E s I y 9 N M F c 9 U Z s n + l n u i c s q 7 j e n v 6 l w B s Q I 3 x P 7 l m y n / 6 5 O 1 C P R x p m r g V F O s G F m d p 7 N k q i v y 5 u a X q g R l i I m T u E f x h L C n n L M + m 8 q T q t p l b x 0 V f 1 N K y c q 9 p 7 U Z 3 u U t a c D 2 z 3 H O g / p R 2 T 4 p 2 1 f H p c q 5 H n U e e 9 j H I c 3 z F B V c o o q a 6 v g j n v B s V I 2 R M T X u P 6 V G T n t 2 8 W 0 Z D x / u G Z Q W < / l a t e x i t > xj Views < l a t e x i t s h a 1 _ b a s e 6 4 = " y m b P S V 1 Q l r f s l K D 8 y M 5 O Z Q H 8 q / g = " > A A A C y 3 i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S L U T U l E 1 G X R j R u h g n 1 A W y S Z T u v Q N B O S i V C r S 3 / A r f 6 X + A f 6 F 9 4 Z U 1 C L 6 I Q k Z 8 4 9 5 8 7 c e / 0 o E I l y n N c Z a 3 Z u f m E x t 5 R f X l l d W y 9 s b N Y T m c a M 1 5 g M Z N z 0 v Y Q H I u Q 1 J V T A m 1 H M v a E f 8 I Y / O N X x x g 2 P E y H D S z W K e G f o 9 U P R E 8 x T R D V 7 p T b r S r V 3 V S g 6 Z c c s e x q 4 G S g i W 1 V Z e E E b X U g w p B i C I 4 Q i H M B D Q k 8 L L h x E x H U w J i 4 m J E y c 4 x 5 5 8 q a k 4 q T w i B 3 Q t 0 + 7 V s a G t N c 5 E + N m d E p A b 0 x O G 7 v k k a S L C e v T b B N P T W b N / p Z 7 b H L q u 4 3 o 7 2 e 5 h s Q q X B P 7 l 2 + i / K 9 P 1 6 L Q w 7 G p Q V B N k W F 0 d S z L k p q u 6 J v b X 6 p S l C E i T u M u x W P C z D g n f b a N J z G 1 6 9 5 6 J v 5 m l J r V e 5 Z p U 7 z r W 9 K A 3 Z / j n A b 1 / b J 7 W H Y v D o q V k 2 z U O W x j B y W a 5 x E q O E M V N T P H R z z h 2 T q 3 E u v W u v u U W j O Z Z w v f l v X w A e N I k g 4 = < / l a t e x i t > f (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " y m b P S V 1 Q l r f s l K D 8 y M 5 O Z Q H 8 q / g = " > A A A C y 3 i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S L U T U l E 1 G X R j R u h g n 1 A W y S Z T u v Q N B O S i V C r S 3 / A r f 6 X + A f 6 F 9 4 Z U 1 C L 6 I Q k Z 8 4 9 5 8 7 c e / 0 o E I l y n N c Z a 3 Z u f m E x t 5 R f X l l d W y 9 s b N Y T m c a M 1 5 g M Z N z 0 v Y Q H I u Q 1 J V T A m 1 H M v a E f 8 I Y / O N X x x g 2 P E y H D S z W K e G f o 9 U P R E 8 x T R D V 7 p T b r S r V 3 V S g 6 Z c c s e x q 4 G S g i W 1 V Z e E E b X U g w p B i C I 4 Q i H M B D Q k 8 L L h x E x H U w J i 4 m J E y c 4 x 5 5 8 q a k 4 q T w i B 3 Q t 0 + 7 V s a G t N c 5 E + N m d E p A b 0 x O G 7 v k k a S L C e v T b B N P T W b N / p Z 7 b H L q u 4 3 o 7 2 e 5 h s Q q X B P 7 l 2 + i / K 9 P 1 6 L Q w 7 G p Q V B N k W F 0 d S z L k p q u 6 J v b X 6 p S l C E i T u M u x W P C z D g n f b a N J z G 1 6 9 5 6 J v 5 m l J r V e 5 Z p U 7 z r W 9 K A 3 Z / j n A b 1 / b J 7 W H Y v D o q V k 2 z U O W x j B y W a 5 x E q O E M V N T P H R z z h 2 T q 3 E u v W u v u U W j O Z Z w v f l v X w A e N I k g 4 = < / l a t e x i t > f (•) Representations < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 N 8 0 w F R y s x E M X q 2 2 4 H h q a W 4 i  E A c = " > A A A C z 3 i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S K 4 K o m I u i y 6 c d m C f U B b S p J O 6 9 C 8 m E z U U i p u / Q G 3 + l f i H + h f e G d M Q S 2 i E 5 K c O f e e M 3 P v d W O f J 9 K y X u e M + Y X F p e X c S n 5 1 b X 1 j s 7 C 1 X U + i V H i s 5 k V + J J q u k z C f h 6 w m u f R Z M x b M C V y f N d z h u Y o 3 r p l I e B R e y l H M O o E z C H m f e 4 4 k q t 2 W 7 F a 6 / T G b d H m 3 U L R K l l 7 m L L A z U E S 2 K l H h B W 3 0 E M F D i g A M I S R h H w 4 S e l q w Y S E m r o M x c Y I Q 1 3 G G C f K k T S m L U Y Z D 7 J C + A 9 q 1 M j a k v f J M t N q j U 3 x 6 B S l N 7 J M m o j x B W J 1 m 6 n i q n R X 7 m / d Y e 6 q 7 j e j v Z l 4 B s R J X x P 6 l m 2 b + V 6 d q k e j j V N f A q a Z Y M 6 o 6 L 3 N J d V f U z c 0 v V U l y i I l T u E d x Q d j T y m m f T a 1 J d O 2 q t 4 6 O v + l M x a q 9 l + W m e F e 3 p A H b P 8 c 5 C + q H J f u 4 Z F e P i u W z b N Q 5 7 G I P B z T P E 5 R x g Q p q 5 B 3 j E U 9 4 N q r G j X F n 3 H + m G n O Z Z g f f l v H w A Q 1 4 l I c = < / l a t e x i t > e i < l a t e x i t s h a 1 _ b a s e 6 4 = " k J F n 3 Y m 7 n k 6 / y q L L + u S X k W H F d z I = " > A A A C z 3 i c j V H L T s J A F D 3 U F + I L d e m m k Z i 4 I q 0 x 6 p L o x i U k 8 k i A k H Y Y s F L a p p 2 q h G D c + g N u 9 a + M f 6 B / 4 Z 1 x S F R i d J q 2 Z 8 6 9 5 8 z c e 9 3 I 9 x J h W a 8 Z Y 2 5 + Y X E p u 5 x b W V 1 b 3 8 h v b t W S M I 0 Z r 7 L Q D + O G 6 y T c 9 w J e F Z 7 w e S O K u T N 0 f V 5 3 B 2 c y X r / m c e K F w Y U Y R b w 9 d P q B 1 / O Y I 4 h q t Q S / F W 5 v z C e d q 0 6 + Y B U t t c x Z Y G t Q g F 7 l M P + C F r o I w Z B i C I 4 A g r A P B w k 9 T d i w E B H X x p i 4 m J C n 4 h w T 5 E i b U h a n D I f Y A X 3 7 t G t q N q C 9 9 E y U m t E p P r 0 x K U 3 s k S a k v J i w P M 1 U 8 V Q 5 S / Y 3 7 7 H y l H c b 0 d / V X k N i B S 6 J / U s 3 z f y v T t Y i 0 M O J q s G j m i L F y O q Y d k l V V + T N z S 9 V C X K I i J O 4 S / G Y M F P K a Z 9 N p U l U 7 b K 3 j o q / q U z J K t 7 q S v K m P I t H J T Y R t G S t L A k = " > A A A C 5 H i c j V H L S s N A F D 3 G 9 7 v q 0 o X B I t R N S U T U Z d G N S 4 W 2 F m w p k + m 0 H c y L Z C K U 0 q U 7 d + L W H 3 C r 3 y L + g f 6 F d 8 Y I P h C d k O T M u f e c m X u v F / s y V Y 7 z P G a N T 0 x O T c / M z s 0 v L C 4 t F 1 Z W 6 2 m U J V z U e O R H S c N j q f B l K G p K K l 8 0 4 k S w w P P F m X d x p O N n l y J J Z R R W 1 S A W r Y D 1 Q t m V n C m i 2 o W N b n v I s t 6 o 1 O S d S G 3 b z V Q G d j N g q s + Z P 6 y O 2 o W i U 3 b M s n 8 C N w d F 5 O s k K j y h i Q 4 i c G Q I I B B C E f b B k N J z D h c O Y u J a G B K X E J I m L j D C H G k z y h K U w Y i c + k o 3 Z t 1 1 x f 1 W z o 9 t 7 2 4 D 2 c o = " > A A A C 5 X i c j V H L S s N A F D 3 G V 3 1 X X b o J F r F u S i K i L o t u X F Z o a 8 F K m U y n d T A v k o l Q S r f u 3 I l b f 8 C t / o r 4 B / o X 3 h k j + E B 0 Q p I z 5 9 5 z Z u 6 9 X u z L V D n O 8 5 g 1 P j E 5 N V 2 Y m Z 2 b X 1 h c K i 6 v N N M o S 7 h o 8 M i P k p b H U u H L U D S U V L 5 o x Y l g g e e L E + / i U M d P L k W S y i i s q 0 E s z g L W D 2 V P c q a I 6 h T t 3 m Z n y L L + q N z m 3 U h t 2 e 1 U B n Y 7 Y O q c M 3 9 Y H 3 W K J a f i m G X / B G 4 O S s h X L S o + o Y 0 u I n B k C C A Q Q h H 2 w Z D S c w o X D m L i z j A k L i E k T V x g h F n S Z p Q l K I M R e 0 H f P u 1 O c z a k v f Z M j Z r T K T 6 9 C S l t b J A m o r y E s D 7 N N v H M O G v 2 N + + h 8 d R 3 G 9 D f y 7 0 C Y h X O i f 1 L 9 5 H 5 X 5 2 u R a G H f V O D p J p i w + j q e O G = (V, E), where the node set V = {v 1 , • • • , v N } denotes the data examples and E ⊆ {(v i , v j )|v i , v j ∈ V} is a collection of node pairs. Let N i be the neighbor set of instances v i in the proximity graph. To construct N i , we first form a candidate set C i = {v m } for each instance v i by uniformly picking M (M ≪ N ) neighbor candidates. Then we select the K nearest ones from the candidate set: N i = TopK vm∈Ci (e i • e m ) , where • is the inner product operation. M controls the similarity between the center node and its immediate neighbor nodes, which can be demonstrated by the following proposition: Proposition 1. Given an observation v i with the corresponding representation e i , assume that there are at least S observations whose inner product similarity with v i is larger than s, i.e., v j ∈ V | e i • e j > s ≥ S. Then in the proximity graph G, the similarity between v i and its neighbors is larger than s with proximate probability at least: P {e i • e k > s, ∀v k ∈ N i } ⪆ 1 -p M K , where p = N -S N , and K is the number of neighbors. The proof is deferred to Appendix B. The insight of Proposition 1 is to relate candidate set size M to the similarity between a node pair. Higher M indicates a greater probability that two adjacent nodes are similar, and proximity graph will be more like the kNN graph. On the other hand, if M is low, some randomly chosen instances are more likely to be neighbors, improving the diversity of the negatives around the center node. Proximity graph sampling. Breadth-first Sampling (BFS) and Depth-first Sampling (DFS) are two straightforward graph sampling methods (Grover and Leskovec, 2016) , representing extreme scenarios in terms of the search space: • Breadth-first Sampling (BFS) collects all of the current node's immediate neighbors, then moves to its neighbors and repeats the procedure until the number of collected instances reaches batch size. • Depth-first Sampling (DFS) randomly explores the node branch as far as possible before the number of the visited nodes reaches batch size. Compute the loss by Eq. 1, where {(e i , e j )} i̸ =j B(B-1) are treated as negative pairs. Update the parameters of f (•). end the sampler iteratively teleports back to the start point with probability α or travels to a neighbor of the current position with the probability proportional to the edge weight. The process continues until it collects a fixed number of vertices which will be taken as the sampled batch. The key insight of using RWR is that it can modulate the probability of sampling within a neighborhood by adjusting α, which can be demonstrated by the following proposition: Proposition 2. For all 0 < α ≤ 1 and S ⊂ V, the probability that a Lazy Random Walk with Restart starting from a node u ∈ S escapes S satisfies v∈(V-S) p u (v) ≤ 1-α 2α Φ(S), where p u is the stationary distribution, and Φ(S) is the graph conductance of S. The proof is deferred to Appendix B. Proposition 2 indicates that the probability of RWR escaping from a local cluster (Andersen et al., 2006; Spielman and Teng, 2013 ) can be bounded by the graph conductance (Šíma and Schaeffer., 2006) and the restart probability α. In other words, higher α indicates that the walker will approximate BFS behavior and sample within a small locality. Lower α encourages the walker to visit the nodes which are further away from the center node. ProSampler pipeline. As shown in Algorithm 1, ProSampler serves as a mini-batch sampler and can be easily plugged into any in-batch contrastive learning method, such as SimCLR (Chen et al., 2020) , MoCo v3 (Chen et al., 2021) , SimCSE (Gao et al., 2021) and GraphCL (You et al., 2020) . Specifically, during the training process, ProSampler first constructs the proximity graph, which will be updated after t training steps, then selects a start node at random and samples a mini-batch on proximity graph by RWR. ProSampler is orthogonal to the contrastive methods. As shown in Figure 3 , the number of candidates M and the restart probability α are the key to flexibly control the hardness of a sampled batch. When we set M as the size of dataset and α as 1, proximity graph is equivalent to kNN graph and graph sampler will only collect the immediate neighbors around a center node, which behaves similarly to a kNN Sampler. On the other hand, if M is set to 1 and α is set to 0, the RWR degenerates into the DFS and chooses the neighbors that are linked at random, which indicates that ProSampler performs as a Uniform Sampler. We provide an empirical criterion of choosing M and α in Section 4.3. Complexity. The time complexity of building a proximity graph is O(N M d) where N is the dataset size, M is the candidate set size and d denotes the embedding size. It is practically efficient since usually M is much smaller than N , and the process can be accelerated by embedding retrieval libraries such as Faiss (Johnson et al., 2019) . More analysis on efficiency can be found in Appendix F.5. Besides, the space cost of ProSampler mainly comes from graph construction and graph storage. The total space complexity of ProSampler is O(N d + N K) where K is the number of neighbors in the proximity graph.

4. EXPERIMENTS

To show the effectiveness of ProSampler in a variety of scenarios, we apply it to the representative contrastive learning algorithms on three data modalities, including image, text, and graph. Furthermore, to investigate ProSampler's generality, we equip two variants of InfoNCE objective with our model, including DCL (Chuang et al., 2020) and HCL (Robinson et al., 2021) . InfoNCE objective and its variants are described in Appendix C. The statistics of the datasets are summarized in Appendix D, and the detailed experimental setting can be found in Appendix E. Results on Image Modality. We first adopt Sim-CLR (Chen et al., 2020) and MoCo v3 (Chen et al., 2021) as the backbone based on ResNet-50 (He et al., 2016) . We start with training the model for 800 epochs with a batch size of 2048 for SimCLR and 4096 for MoCo v3, respectively. We then use linear probing to evaluate the representations on ImageNet. We also compare the ProSampler with the two state-of-the-art self-supervised learning methods without negative sampling, including SwAV (Caron et al., 2020) and BYOL (Grill et al., 2020) . As shown in Table 1 , our proposed model can consistently boost the performance of original SimCLR and MoCo v3, and outperforms all the baselines without negatives, demonstrating the superiority of ProSampler. Besides, we evaluate ProSampler on the other benchmark datasets, which can be found in Appendix F.1.

4.1. BENCHMARKING RESULTS

Results on Text Modality. We evaluate ProSampler on learning the sentence representations by SimCSE (Gao et al., 2021) framework with pretrained BERT (Devlin et al., 2018) as backbone. The results of Table 2 suggest that ProSampler consistently improves the baseline models with an absolute gain of 1.09%∼2.91% on 7 semantic textual similarity (STS) tasks (Agirre et al., 2012; 2013; 2014; 2015; 2016; Tian et al., 2017; Marelli et al., 2014) . Specifically, we observe that when applying DCL and HCL, the performance of the self-supervised language model averagely drops by 2.45% and 3.08% respectively. As shown in Zhou et al. (2022) and Appendix F.2, the pretrained language model offers a prior distribution over the sentences, leading to a high cosine similarity of both positive pairs and negative pairs. So DCL and HCL, which leverage the similarity of positive and negative scores to tune the weight of negatives, are inapplicable because the high similarity scores of positives and negatives will result in homogeneous weighting. However, the hard negatives explicitly sampled by our proposed ProSampler can alleviate it, with an absolute improvement of 1.64% on DCL and 2.64% on HCL. The results of RoBERTa (Liu et al., 2019) are reported in Appendix F.3. In this section, we apply SimCLR on CIFAR10 and CIFAR100, and compare the Uniform Sampler, kNN Sampler and ProSampler in terms of performance and the false negatives to deepen the understanding of ProSampler. We show the performance of ProSampler with different M and α settings in Figure 3 , and use the name convention ProSampler (M, α). Besides, we illustrate the histogram of cosine similarity for all pairs from a sampled batch, and the percentage of false negatives within the mini-batch during training in Figure 4 . We can observe that although kNN Sampler can explicitly draw a data batch with similar pairs, it introduces a substantially higher number of false negatives, degrading performance significantly. Uniform Sampler is independent of the model so the percentage of FN within the sampled batch remains consistent during training. However, ProSampler can modulate M, α to find the best balance between these two sampling methods. We can observe that ProSampler can sample hard mini-batch but only exhibits a slightly higher percentage of false negatives than Uniform Sampler with optimal parameter setting, which enables ProSampler to achieve the best performance. Similar phenomenon on CIFAR100 can be found in Appendix F.4. 

4.3. EMPIRICAL CRITERION FOR PROSAMPLER

To analyze the impact of the size of neighbor candidate M and the random walk restart probability α, we vary the M and α in range of {500, 1000, 2000, 4000, 6000} and {0.1, 0.3, 0.5, 0.7} respectively, and apply SimCLR, SimCSE and GraphCL as backone. We summarize the results in Table 4 and Table 5 . Table 4 shows that in most cases, the performance of the model peaks when M = 1000 but plumbs quickly with the increasing of M . Such phenomena are consistent with the intuition that higher M raises the probability of selecting similar instances as neighbors, but the sampler will be more likely to draw the mini-batch with false negatives, degrading the performance. Table 5 shows the performance of ProSampler with different α. Besides, to better understand the effect of α, we illustrate the histograms of cosine similarity for all pairs from a sampled batch after training in Figure 5 , and plot the percentage of false negatives in the mini-batch during training in Figure 6 . We can observe that α moving from 0.1 through 0.2 to 0.7 causes cosine similarities to gradually skew left, but introduces more false negatives in the batch, creating a trade-off. This phenomenon indicates that the sampler with a higher α sample more frequently within a local neighborhood, which is more likely to yield similar pairs. However, as training progresses, the instances of the same class tend to group together, increasing the probability of collecting false negatives. To find the best balance, we linearly decay α from 0.2 to 0.05 as the training epoch increases, which is presented as 0.2 ∼ 0.05 in Table 5 . It can be found that this dynamic strategy achieves the best performance in all cases except SimCSE which only trains for one epoch. Interestingly, SimCSE achieves the best performance by a large margin when α = 0.7 since hard negatives can alleviate the distribution issue brought by the pre-trained language model. More analysis can be found in Section 4.1 and Appendix F.2. To sum up, the suggested M would be 500 for the small-scale dataset, and 1000 for the larger dataset. The suggested α should be relatively high, e.g., 0.7, for the pre-trained language model-based method. Besides, dynamic decay α, e.g., 0.2 to 0.05, is the best strategy for the other methods.  ℓ i = -log e f (xi) T f (x + i )/τ B j=1 e f (xi) T f (x + j )/τ . Compared with the SimCLR, GraphCL and SimCSE only take the other B -1 augmented instances as negatives. C.3 DCL AND HCL DCL (Robinson et al., 2021) and HCL (Robinson et al., 2021) are two variants of InfoNCE objective function, which aim to alleviate the false negative issue or mine the hard negatives by reweighting the negatives in the objective. The main idea behind them is using the positive distribution to correct for the negative distribution. For simplicity, we annotate the positive score e f (xi) T f (x + i )/τ as pos, and negative score e f (xi) T f (x + j )/τ as neg ij . Given a mini-batch and a positive pair (x i , x + i ), the reweighting negative distribution proposed in DCL and HCL are: max   B j=1 -N neg × τ + × pos + λ ij × neg ij 1 -τ + , e -1/τ   , where N neg is the number of the negatives in mini-batch, τ + is the class probability, τ is the temperature, and λ ij is concentration parameter which is simply set as 1 in DCL or calculated as λ ij = β×negij negij /Nneg in HCL. All of τ + , τ, β are tunable hyperparameters. The insight of Eq.15 is that the negative pair with the score closer to positive score will be assigned lower weight in loss function. In other words, the similarity difference between positive and negative pairs dominates the weighting function.

D DATASET DETAILS

For image representation learning, we adopt five benchmark datasets, comprising of CIFAR10, CIFAR100, STL10, ImageNet-100 and ImageNet ILSVRC-2012 (Russakovsky et al., 2015) . Information on the statistics of these datasets is summarized in Table 6 . For graph-level representation learning, we conduct experiments on IMDB-B, IMDB-M, COLLAB and REDDIT-B (Yanardag and Vishwanathan, 2015) , the details of which are presented in Table 7 . For text representation learning, we evaluate the method on a one-million English Wikipedia dataset which is used in the SimCSE and can be downloaded from HuggingFace repositoryfoot_0 . In image domain, we apply SimCLR (Chen et al., 2020) and MoCo v3 (Chen et al., 2021) as the baseline method, with ResNet-50 (He et al., 2016) as an encoder to learn image representations. The (Wu et al., 2018) . We employ two sampled data augmentation strategies to generate positive pairs, and implicitly use other examples in the same mini-batch as negative samples. For CIFAR10, CIFAR100 and STL10, all models are trained for 1000 epochs with the default batch size B of 256. We use the Adam optimizer (Kingma and Ba, 2015) with learning rate of 0.001 for optimization. The temperature parameter is set as 0.5 and the dimension of image embedding is set as 128. For ImageNet-100 and ImageNet, we train the models with 100 and 400 epochs respectively, and use LARS optimizer (You et al., 2019) with learning rate of 0.3 × B/256 and weight decay of 10 -6 . Here, the batch size is set as 2048 for ImageNet and 512 for ImageNet-100, respectively. We fix the temperature parameter as 0.1 and the image embedding dimension as 128. After the unsupervised learning, we train a supervised linear classifier for 100 epochs on the top of the frozen learned representations. As for ProSampler, we update the proximity graph per 100 training iterations. We fix the number of neighbors K as 100 for CIFAR10, CIFAR100 and STL10. The size of neighbor candidate set M is set as 1000 for CIFAR100 and STL10, and 500 for CIFAR10. Besides, the initial restart probability α of RWR (Random Walk with Restart) is set to 0.2 and decays linearly to 0.05 with the training process. For ImageNet-100 and ImageNet, we keep M as 1000 and K as 500. The restart probability α is fixed as 0.1.

E.2 GRAPH REPRESENTATIONS

In graph domain, we use the GraphCL (You et al., 2020) framework as the baseline and GIN (Xu et al., 2018) as the backone. We run ProSampler 5 times with different random seeds and report the mean 10-fold cross-validation accuracy with variance. We apply Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.01, and 3-layer GIN with a fixed hidden size of 32. We set the temperature as 0.2 and gradually decay the restart probability of RWR (0.2 ∼ 0.05). Proximity graph will be updated after t iterations. The overall hyperparameter settings on different datasets are summarized in Table 8 . In text domain, we use SimCSE (Gao et al., 2021) as baseline method and adopt the pretrained BERT and RoBERTa provided by HuggingFace 2 for sentence embedding learning. Following the training setting of SimCSE, we train the model for one epoch in an unsupervised manner and evaluate it on 7 STS tasks. Proximity graph will be only built once based on the pretrained language models before training. For BERT, we set the batch size to 64 and learning rate to 3 × 10 -5 . For RoBERTa, the batch size is set as 512 and learning rate is fixed as 10 -5 . We keep the temperature as 0.05, the number of neighbor candidates M as 1000, the number of neighbors K as 500, and the restart probability α as 0.7 for both BERT and RoBERTa. F ADDITIONAL EXPERIMENTS F.1 EXTENSIVE STUDIES ON COMPUTER VISION Here we evaluate the ProSampler on two small-scale (CIFAR10,CIFAR100) and two mediumscale (STL10,ImageNet-100) benchmark datasets, and equip DCL (Chuang et al., 2020) and HCL (Robinson et al., 2021) foot_2 with ProSampler to investigate its generality. Experimental results in Table 9 show that ProSampler can consistently improve SimCLR and its variants on all the datasets, with an absolute gain of 0.3%∼2.5%. We also can observe that the improvement is greater on medium-scale datasets than on small-scale datasets. Specifically, the model equipped with HCL and ProSampler achieves a significant improvement (6.23%) on STL10 over the original SimCLR. To explain the performance degradation of DCL and HCL objectives, we select 12 representative mini-batches and plot the cosine similarity histogram of positive and negative pairs on BERT (top) and RoBERTa (bottom) in Figure 7 . We observe the following: (1) At the start of and throughout the training, the positive pairs are assigned a high cosine similarity (around 0.9) by the pretrained language model; (2) The negative similarities begin with a relative high score and gradually skew left because of the self-supervised learning. Such phenomenon is consistent to Zhou et al. (2022) . DCL and HCL which leverage the difference between positive and negative similarity to reweight the negative scores are inapplicable since the low distribution gap between positive and negative similarities will lead to homogeneous weighting in the objective.

F.3 TEXT REPRESENTATIONS WITH ROBERTA

We also apply ProSampler to the SimCSE with the pretrained RoBERTa, and present the results in Table 10 . Similar as the results of BERT, ProSampler can consistently improve the performance of the baseline model. Besides, as discussed in Section 4.1 and Section F.2, the hard negative sampled by ProSampler explicitly can alleviate the low distribution gap between positive score and negative score distribution caused by the pretrained language model, alleviating the performance degradation of DCL and HCL. 12 presents an overall performance comparison with different graph sampling methods. Besides, we illustrate the histograms of cosine similarity for all pairs from a sampled batch after finishing training and plot the percentage of false negatives in the mini-batch during training in Figure 9 . It can be observed that although BFS brings the most similar pairs in the mini-batch, it performs worse than the original SimCLR since it introduces substantial false negatives. While having a slightly lower percentage of false negatives than RWR, DFS and RW do not exhibit higher performance since they are unable to collect the hard negatives in the mini-batch. The restart property allows RWR to exhibit a mixture of DFS and BFS, which can flexibly modulate the hardness of the sampled batch and find the best balance between hard negatives and false negatives. Benefiting from it, RWR achieves the best performance over the other sampling methods. To analyze the impact of the batchsize B, we vary B in the range of {16, 32, 64, 128, 256} and summarize the results in Table 13 . It can be found that a larger batchsize leads to better results, which is consistent with the previous studies (Chen et al., 2020; He et al., 2020; Kalantidis et al., 2020) . F.8.2 IMPACT OF NEIGHBOR NUMBER K In Figure 11 , we investigate the impact of the neighbor number K on ImageNet-100 dataset with the default ProSampler setting. We observe that an absolute improvement of 1.1% with the increasing size of neighbors. Specifically, model achieves an absolute performance gain of 0.9% from K = 100 to K = 300, while only obtains 0.2% from K = 300 to K = 500. Such experimental results are consistent with our prior philosophy, in which sampling more neighbors always increase the scale of proximity graph and urging ProSampler to explore smaller-scope local cluster (i.e. sample harder negatives within a batch), leading to a significant improvement in performance at first. However, performance degrades after reaching the optimum, because larger K introduces more easy negatives. 

F.8.3 PROXIMITY GRAPH UPDATE INTERVAL t

Proximity graph will be updated per t training iterations, and to analyze the impact of t, we vary t in the range of {50,100,200,400} and summarize the results in Table 15 . It can be observed that update intervals that are too short (t = 50) or too long (t = 400) will degrade the performance. The possible reason is that sampling on a proximity graph that is frequently updated results in unstable learning of the model. Besides, the distribution of instances in the embedding space will change during the training process, resulting in a shift in hard negatives. As a result, after a few iterations, the lazy-updated graph cannot adequately capture the similarity relationship. 

F.10 CASE STUDY

To give an intuitive impression of the mini-batch sampled by ProSampler, we show some real cases of the negatives sampled by ProSampler and Uniform Sampler in Figure 13 . For a given anchor (a cat or a dog), we apply ProSampler and Uniform Sampler to draw a mini-batch of images, and pick the images with the highest inner product with anchor. Obviously, compared with Uniform Sampler, the images sampled by ProSampler are more semantically relevant to the anchor in terms of texture, background or appearance. 



https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/ resolve/main/wiki1m_for_simcse.txt https://huggingface.co/models DCL and HCL are more like variants of InfoNCE loss, which adjust the weights of negative samples in the original InfoNCE loss.



Figure 1: A motivating example of ProSampler. The generated image representations form an embedding space where Uniform Sampler randomly samples a mini-batch with easy negatives and ProSampler samples a mini-batch with hard negatives based on proximity graph.

8 4 9 5 8 7 c e / 0 4 E I l y n N e c t b C 4 t L y S X y 2 s r W 9 s b h W 3 d + p J l E r G a y w K I n n j e w k P R M h r S q i A 3 8 S S e 0 M / 4 A 1

y j 3 T u S n e 5 S 1 p w P b P c c 6 C 2 k H R P i r a l c N C 6 V S P O o s d 7 G K f 5 n m M E s 5 R R p W 8 I z z i C c 9 G x b g x 7 o z 7 z 1 Q j o z X b + L a M h w 8 P 2 J S I < / l a t e x i t > e j Contrastive Loss < l a t e x i t s h a 1 _ b a s e 6 4 = " Y D w 2

9 o G + P d u c 5 G 9 J e e 6 Z G z e k U n 9 6 E l D a 2 S B N R X k J Y n 2 a b e G a c N f u b 9 9 B 4 6 r s N 6 O / l X g G x C n 1 i / 9 J 9 Z P 5 X p 2 t R 6 O L A 1 C C p p t g w u j q e u 2 S m K / r m 9 q e q F D n E x G n c o X h C m B v l R 5 9 t o 0 l N 7 b q 3 z M R f T K Z m 9 Z 7 n u R l e 9 S 1 p w O 7 3 c f 4 E 9 Z 2 y u 1 d 2 T 3 e L l c N 8 1 D N Y x y Z K N M 9 9 V H C M E 9 T I + w r 3 e M C j 1 b W u r R v r 9 j 3 V G s s 1 a / i y r L s 3 2 V C c I A = = < / l a t e x i t > faug(•) ⇠ T < l a t e x i t s h a 1 _ b a s e 6 4 = " s f y

Figure 2: The framework of ProSampler. The proximity graph is first constructed based on generated image representations and will be updated every t training steps. Next, a proximity graph-based negative sampler is applied to generate a batch with hard negatives for in-batch contrastive learning. Proximity graph construction. Recall that in a training dataset, we have N observations {v i |i = 1, • • • , N } and their corresponding representations {e i |i = 1, • • • , N } generated by the current encoder f (•). We formulate the proximity graph as:

Figure 3: Performance comparison of different in-batch samplers on image classification task.

Figure 5: Cosine similarities between the pairs.

Figure 7: Histograms of cosine similarity on BERT (top) and RoBERTa (bottom).

Figure 10: Percentage of false negative using different graph building methods over the training step.

Figure 11: Impact of neighbor number K.

TRAINING CURVEWe plot the training curves on STL10 and ImageNet-100 respectively. As shown in Figure12, on STL10 dataset, ProSampler takes only about 600 epochs to achieve the similar performance as the original SimCLR, which takes 1000 epochs. A similar phenomenon can be seen on ImageNet-100. All these results manifest that ProSampler can bring model better and faster learning.

Figure 12: Training curves for image classification task on STL10 and ImageNet-100.

Figure 13: Case study of the negatives sampled by ProSampler and Uniform Sampler based on the encoder trained for 100 epochs on ImageNet. Given an anchor image (Cat or Dog), (a,c) select 10 images with the highest similarity from a mini-batch sampled by ProSampler, and (b,d) randomly select 10 images from a mini-batch sampled by Uniform Sampler.

Algorithm 1: In-batch Contrastive Framework with ProSampler Input:Dataset D = {x i |i = 1, • • • , N }, Encoder f (•), Batchsize B,Graph update step t, Modality-specific augmentation functions T . for iter ← 0, 1, • • • do // ProSampler if iter%t == 0 then // Proximity Graph Construction Build the proximity graph G by Algorithm 2. end // Proximity Graph Sampling Randomly select a start node and get the mini-batch {x i } B by Algorithm 3. Obtain positive pairs {(x i , x + i )} B by augmentation functions f aug (•) ∼ T . Generate representations {(e i , e + i )} B by Encoder f (•).

Top-1 accuracy under the linear evaluation with the ResNet-50 backbone on ImageNet.

Overall performance comparison with different negative sampling methods on STS tasks. 89% across all the datasets. Besides, equipped with ProSampler, DCL and HCL can achieve better performance in 6 out of 8 cases. It can also be found that ProSampler can reduce variance in most cases, demonstrating that the hard negatives exploited by ProSampler can enforce the model to learn more robust representations.

Accuracy on graph classification task under LIBSVM(Chang and Lin, 2011) classifier.

Impact of neighbor candidates M .

Impact of restart probability α.

Statistics of datasets for image classification task.

Statistics of datasets for graph-level classification task.

Hyperparameter settings for graph-level representation learning.

Overall performance comparison on image classification task in term of Top-1 Accuracy.

Performance comparison for sentence embedding learning based on RoBERTa. RoBERTa base 67.90 80.91 73.14 80.58 80.74 80.26 69.87 76.20 w/ ProSampler 68.29 81.96 73.86 82.16 80.94 80.77 69.30 76.75 DCL-RoBERTa base 66.60 79.16 71.05 80.40 77.76 77.94 67.57 74.35 w/ ProSampler 65.53 80.09 71.00 80.64 78.35 77.75 67.52 74.41 HCL-RoBETa base 67.20 80.47 72.44 80.88 80.57 78.79 67.98 75.49 w/ ProSampler 66.01 80.79 73.58 81.25 80.66 79.22 68.52 75.72 F.4 COSINE SIMILARITY AND FALSE NEGATIVE RATIO ON CIFAR100 Here we compare the Uniform Sampler, kNN Sampler, and ProSampler in terms of cosine similarity and false negatives on CIFAR100. Specifically, we show the histogram of cosine similarity for all pairs in a sampled batch, and the false negative ratio of the mini-batch in Figure 8. It can be found that ProSampler exhibits a balance of Uniform Sampler and kNN Sampler, which can sample the hard negative pair but only brings slightly greater number of false negatives than Uniform Sampler. More analysis can be found in Section 4.2. Proximity Graph Construction Amortized Cost (Cost G/t ) is the ratio of Cost G to the graph update interval t. The time cost of ProSampler is shown in Table 11, from which we make the following observations: (1) Sampling a mini-batch Cost S takes an order of magnitude less time than training with a batch Cost T at most cases. (2) Although it takes 100s for ProSampler to construct a proximity graph in ImageNet, the cost shares across t training steps, which take only Cost G/t = 0.2 per batch. A similar phenomenon can be found in the other datasets as well. In particular, SimCSE only trains for one epoch, and proximity graph is only built once.

Time cost of mini-batch sampling by ProSampler on a NVIDIA V100 GPU. F.6 COMPREHENSIVE ANALYSIS ABOUT STRATEGIES OF PROXIMITY GRAPH SAMPLING We conduct an experiment to explore different choices of graph sampling methods, including (1) Depth First Search (DFS); (2) Breadth First Search (BFS); (3) Random Walk (RW); (4) Random Walk with Restart (RWR). Table

Overall performance comparison with different graph sampling methods. .7 COMPARISON BETWEEN PROXIMITY GRAPH AND KNN GRAPHTo demonstrate the effectiveness of proximity graph, we do an ablation study by replacing proximity graph with kNN graph which directly selects k neighbors with the highest scores for each instance from the whole dataset. The neighbor number k is 100 by default. The comparison results are shown in Table14, from which we can observe that proximity graph outperforms the kNN graph by a margin. ProSampler with kNN graph even performs worse than the original contrastive learning method because of the false negatives.

Performance comparison of different graph construction methods.To develop a intuitive understanding of how Proximity graph alleviates the false negative issue, Figure10plots the changing curve of false negative ratio in a batch. The results show that Proximity graph could discard the false negative significantly: by the end of the training, kNN will introduce more than 22% false negatives in a batch, while Proximity graph brings about 13% on the CIFAR10 dataset. Similar phenomenon can also be found on CIFAR100 dataset.

Performance comparison with different update interval t on CIFAR10 and CIFAR100.

4.4. DISCUSSIONS

Due to the page limit, some additional experiments are reported in Appendix F. Appendix F.5 studies the efficiency of ProSampler. Appendix F.6 compares different graph sampling methods in terms of performance, cosine similarity, and false negatives. Appendix F.7 compares the performance of ProSampler with proximity graph and kNN graph. Appendix F.8 discusses the influence of some parameters, including batchsize B, neighbor number K, and proximity graph update interval t. Appendix F.9 presents the training curves. Appendix F.10 includes case studies where we show some real cases of the mini-batch sampled by ProSampler and Uniform Sampler.

5. CONCLUSION

In this paper, we study the problem of global hard negative sampling for in-batch contrastive learning. We reformulate the original mini-batch sampling problem to the proximity graph sampling problem. Based on this, we propose a proximity graph-based sampling framework, ProSampler, which can sample a mini-batch with hard negative pairs for in-batch contrastive learning at each training step. Besides, we conduct experiments on three state-of-the-art contrastive methods with different modalities and two variants of InfoNCE objective to evaluate our proposed ProSampler, which shows that ProSampler can consistently improve the these models.A ALGORITHM DETAIL 

B THEORETICAL PROOF

Proposition 1. Given an observation v i with the corresponding representation e i , assume that there are at least S observations whose inner product similarity with v i is larger than s, i.e.,(1)Then in the proximity graph G, the similarity between v i and its neighbors is larger than s with proximate probability at least:where p = N -S N , and K is the number of neighbors.Proof. Since M ≪ N , we can approximately assume that the sampling is with replacement. In this case, we haveThen let us prove (2) by induction. When K = 1, the conclusion clearly holds.Assuming that the conclusion holds when K = L -1, let us consider the case when K = L. We haveTo prove the conclusion, we only need to showor equivalentlyOn the other hand, according to Knuth (1997), we havewhere e denotes the Euler's number. Substituting ( 7) into ( 6), we only need to showThe above relation holds depending on the choices of M , S and L, which can be approximately satisfied in our scenario.Proposition 2. For all 0 < α ≤ 1 and S ⊂ V, the probability that a Lazy Random Walk with Restart starting from a node u ∈ S escapes S satisfies v∈(V-S) p u (v) ≤ 1-α 2α Φ(S), where p u is the stationary distribution, and Φ(S) is the graph conductance of S.Proof. We first introduce the definition of graph conductance (Šíma and Schaeffer., 2006) and Lazy Random Walk (Spielman and Teng, 2013):Graph Conductance . For an undirected graph G = (V, E), the graph volume of a node set S ⊂ V is defined as vol(S) = v∈S d(v), where d(v) is the degree of node v. The edge boundary of a node set is defined to be ∂(S) = {(x, y) ∈ E|x ∈ S, y / ∈ S}. The conductance of S is calculated as followed:Lazy Random Walk . Lazy Random Walk (LRW) is a variant of Random Walk, which first starts at a node, then stays at the current position with a probability 1/2 or travels to a neighbor. The transition matrix of a lazy random walk is M ≜ (I + AD -1 )/2, where the I denotes the identity matrix, A is the adjacent matrix, and D is the degree matrix. The K-th step Lazy Random Walk distribution starting from a node u is defined as q (K) ← M K 1 u .We then present a theorem which relates the Lazy Random Walk to graph conductance, which has been proved in Spielman and Teng (2013):Theorem 1. For all K ≥ 0 and S ⊂ V, the probability that a K-step Lazy Random Walk starting at u ∈ S escapes S satisfies q (K) (V -S) ≤ KΦ(S)/2.Theorem 1 guarantees that given a non-empty node set S ⊂ V and a start node u ∈ S, the Lazy Random Walker will be more likely stuck at S. Here we extend the LRW to Lazy Random Walk with Restart (LRWR) which will return to the start node with probability α or perform Lazy Random Walk. According to the previous studies (Page et al., 1999; Avrachenkov et al., 2014; Chung and Zhao, 2010; Tong et al., 2006) , we can obtain a stationary distribution p u by recursively performing Lazy Random Walk with Restart, which can be formulated as a linear system:where α denotes the restart probability. The element p u (v) represents the probability of the walker starting at u and ending at v. p u can be expressed by a geometric sum of Lazy Random Walk (Chung and Tsiatas, 2010) :Applying the Theorem 1, we have:The desired result is obtained by comparing the two sides of (12).In particular, the only difference between Lazy Random with Restart and Random Walk with Restart is that the former has a probability of remaining in the current position without taking any action. They are equivalent when sampling a predetermined number of nodes.

C INFONCE OBJECTIVE AND ITS VARIANTS

Here we describe in detail the objective functions of three in-batch contrastive learning methods, including SimCLR (Chen et al., 2020) , GraphCL (You et al., 2020) and SimCSE (Gao et al., 2021) . Besides, we cover two variants, i.e., DCL (Chuang et al., 2020) and HCL (Robinson et al., 2021) , which are also applied in the experiments.C.1 SIMCLR SimCLR (Chen et al., 2020) first uniformly draws a mini-batch of instances {x 1 ... x B } ⊂ D, then augments the instances by two randomly sampled augmentation strategies f aug (•), f ′ aug (•) ∼ T , resulting in 2B data points. Two augmented views (x i , x i+B ) of the same image are treated as a positive pair, while the other 2(B -1) examples are negatives. The objective function applied in SimCLR for a positive pair (x i , x i+B ) is formulated as:where τ is the temperature and f (•) is the encoder. The loss is calculated for all positive pairs in a mini-batch, including (x i , x i+B ) and (x i+B , x i ). It can be found that SimCLR takes all 2(B -1) augmented instances within a mini-batch as negatives.

C.2 GRAPHCL AND SIMCSE

Similar as SimCLR, the objective function of GraphCL (You et al., 2020) and SimCSE (Gao et al., 2021) is defined on the augmented instance pairs within a mini-batch. Given a sampled minibatch {x 1 ... x B } ⊂ D, both GraphCL and SimCSE apply data augmentation to obtain positive pairs, and the loss function for a positive pair (x i , x + i ) can be formulated as:

