CONDITIONAL NEGATIVE SAMPLING FOR CON-TRASTIVE LEARNING OF VISUAL REPRESENTATIONS

Abstract

Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two transformations of an image. NCE typically uses randomly sampled negative examples to normalize the objective, but this may often include many uninformative examples either because they are too easy or too hard to discriminate. Taking inspiration from metric learning, we show that choosing semi-hard negatives can yield stronger contrastive representations. To do this, we introduce a family of mutual information estimators that sample negatives conditionally -in a "ring" around each positive. We prove that these estimators remain lower-bounds of mutual information, with higher bias but lower variance than NCE. Experimentally, we find our approach, applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% absolute points in each case, measured by linear evaluation on four standard image benchmarks. Moreover, we find continued benefits when transferring features to a variety of new image distributions from the Meta-Dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and key-point detection. Supervised learning has given rise to human-level performance in several visual tasks (Russakovsky et al., 2015; He et al., 2017) , relying heavily on large image datasets paired with semantic annotations. These annotations vary in difficulty and cost, spanning from simple class labels to more granular descriptions like bounding boxes and key-points. As it is impractical to scale high quality annotations, this reliance on supervision poses a barrier to widespread adoption. While supervised pretraining is still the dominant approach in computer vision, recent studies using unsupervised "contrastive" objectives, have achieved remarkable results in the last two years, closing the gap to supervised baselines (



be important for downstream performance (Tian et al., 2020) . In this paper, we present a new estimator of mutual information based on the popular noise-contrastive estimator (NCE) that supports sampling negatives from conditional distributions. We summarize our contributions below: 1. We prove our Conditional-NCE (CNCE) objective to lower bound mutual information. Further, we show that although CNCE is a looser bound than NCE, it has lower variance. This motivates its value for representation learning. 2. We use CNCE to generalize contrastive algorithms that utilize a memory structure like IR, CMC, and MoCo to sample semi-hard negatives in just a few lines of code and minimal compute overhead. 3. We find that the naive strategy of sampling hard negatives throughout training can be detrimental. We then show that slowly introducing harder negatives yields good performance. 4. On four image classification benchmarks, we find improvements of 2-5% absolute points. We also find consistent improvements (1) when transferring features to new image datasets and (2) in object detection, instance segmentation, and key-point detection.

2. BACKGROUND

We focus on exemplar-based contrastive objectives, where examples are compared to one another to learn a representation. Many of these objectives (Hjelm et al., 2018; Wu et al., 2018; Bachman et al., 2019; Tian et al., 2019; Chen et al., 2020a) are equivalent to NCE (Oord et al., 2018; Poole et al., 2019) , a popular lower bound on the mutual information, denoted by I, between two random variables. This connection is well-known and stated in several works (Chen et al., 2020a; Tschannen et al., 2019; Tian et al., 2020; Wu et al., 2020) . To review, recall: I(X; Y ) I NCE (X; Y ) = E xi,yi⇠p(x,y) E y 1:k ⇠p(y) " log e f ✓ (xi,yi) 1 k+1 P j2{i,1:k} e f ✓ (xi,yj ) # where x, y are realizations of two random variables, X and Y , and f ✓ : X ⇥ Y ! R is a similarity function. We call y 1:k = {y 1 , . . . y k } negative examples, being other realizations of Y . Suppose the two random variables in Eq. 1 are both transformations of a common random variable X. Let T be a family of transformations where each member t is a composition of cropping, color jittering, gaussian blurring, among others (Wu et al., 2018; Bachman et al., 2019; Chen et al., 2020a) . We call a transformed input t(x) a "view" of x. Let p(t) denote a distribution over T , a common choice being uniform. Next, introduce an encoder g ✓ : X ! S n 1 that maps an example to a L 2 -normalized representation. Suppose we have a dataset D = {x i } n i=1 of n values for X sampled from a distribution p(x). Then, the contrastive objective for the i-th example is: L(x i ) = E t,t 0 ,t 1:k ⇠p(t) E x 1:k ⇠p(x) " log e g ✓ (t(xi)) T g ✓ (t 0 (xi))/⌧ 1 k+1 P j2{i,1:k} e g ✓ (t(xi)) T g ✓ (tj (xj ))/⌧ # (2) where ⌧ is a temperature. The equivalence of Eq. 2 to NCE is immediate given f ✓ (x, y) = g ✓ (x) T g ✓ (y)/⌧ . Maximizing Eq. 2 chooses an embedding that pulls two views of the same example together while pushing two views of distinct examples apart. A drawback to this framework is that the number of negatives k must be large to faithfully approximate the true partition. In practice, k is limited by memory. Recent innovations have focused on tackling this challenge: Instance Discrimination (Wu et al., 2018) , or IR, introduces a memory bank of n entries to cache embeddings of each example throughout training. Since every epoch we observe each example once, the memory bank will save the embedding of the view of the i-th example observed last epoch in its ith entry. Representations stored in the memory bank are removed from the automatic differentiation tape, but in return, we can choose a large k by querying M . A follow up work, Contrastive Multiview Coding (Tian et al., 2019) , or CMC, decomposes an image into two color modalities. Then, CMC sums two IR losses where the memory banks for each modality are swapped. Momentum Contrast (He et al., 2019) , or MoCo, observed that the representations stored in the memory bank grow stale, since possibly thousands of optimization steps pass before updating an entry again. So, MoCo makes two important changes. First, it replaces the memory bank with a first-in first-out (FIFO) queue of size k. During each minibatch, representations are cached into the queue while the most stale ones are removed. Second, MoCo introduces a second (momentum) encoder g 0 ✓ 0 as a copy of g ✓ . The primary encoder g ✓ is used to embed one view of x i whereas the momentum encoder is used to embed the other. Again, gradients are not propagated to g 0 ✓ 0 . In this work, we focus on contrastive algorithms that utilize a memory structure that we repurpose in Sec. 4 to efficiently sample hard negatives from. In Sec. 7, we briefly discuss generalizations to contrastive algorithms that do not use a memory structure.

3. CONDITIONAL NOISE CONTRASTIVE ESTIMATION

In NCE, the negative examples are sampled i.i.d. from the marginal distribution, p(y). Indeed, the existing proof that NCE lower bounds mutual information (Poole et al., 2019) assumes this to be true. However, choosing negatives in this manner may not be the best choice for learning a good representation. For instance, prior work in metric learning has shown the effectiveness of semihard negative mining in optimizing triplet losses (Wu et al., 2017; Yuan et al., 2017; Schroff et al., 2015) . We similarly wish to exploit choosing semi-hard negatives in NCE conditional on the current example but to do so in a manner that preserves the lower bound on mutual information. In presenting the theory, we assume two random variables X and Y , deriving a general bound; we will return to the contrastive learning setting in Sec. 4. To begin, in Eq. 1, suppose we sample negatives from a distribution q(y|x) conditional on a value x ⇠ p(x) rather than the marginal p(y), which is independent of X. Ideally, we would like to freely choose q(y|x) to be any distribution but not all choices preserve a bound on mutual informationfoot_0 . This does not, however, imply that we can only sample negatives from p(y) (Poole et al., 2019; Oord et al., 2018) . One of our contributions is to formally define a family of conditional distributions Q such that for any q(y|x) 2 Q, drawing negative examples from q defines an estimator that lower bounds I(X; Y ). We call this new bound the Conditional Noise Contrastive Estimator, or CNCE. We first prove CNCE to be a bound: x,y) ], the expected exponentiated similarity. Pick a set B ⇢ R strictly lower-bounded by c. Assume the pulled back set S B = {y|e f (x,y) 2 B} has non-zero probability (i.e. p(S B ) > 0). For A 1 , . . . , A k in the Borel -algebra over R d , define  log e f (x,y) A = A 1 ⇥ . . . ⇥ A k and let q((Y 1 , . . . , Y k ) 2 A|X = x) = k Y j=1 p(A j |S B ). Let I CNCE (X; Y ) = E x, 1 k P k j=1 e f (x,y j ) . Then I CNCE  I NCE . Proof. To show I CNCE  I NCE , we show E p [log ,yj ) ]. To see this, apply Jensen's to the left-hand side of log E p [ P k j=1 e f (x,yj ) ] < log ,yj ) , which holds if y j 2 S B for j = 1, . . . , k, and then take the expectation E q of both sides. The last inequality holds by monoticity of log, linearity of expectation, and the fact that E p [e f (x,yj ) ]  e f (x,yj ) . P k j=1 e f (x,yj ) ] < E q [log P k j=1 e f (x P k j=1 e f (x Theorem Intuition. For intuition, although using arbitrary negative distributions in NCE does not bound mutual information, we have found a restricted class of distributions Q where every member q(y|x) "subsets the support" of the distribution p(y). That is, given some fixed value x, we have defined q(y|x) to constrain the support of p(y) to a set S B whose members are "close" to x as measured by the similarity function f . For every element y 2 S B , the distribution q(y|x) wants to assign to it the same probability as p(y). However, as q(y|x) is not defined outside of S B , we must renormalize it to sum to one (hence p(A j |S B ) = p(Aj \S B ) p(S B ) ). Intuitively, q(y|x) cannot change p(y) too much: it must redistribute mass proportionally. The primary distinction then, is the smaller t(x ) i t(x ) i t'(x ) i t(x ) j (a) IR, CMC, MoCo q(x |t(x )) t'(x ) i t(x ) i i j t(x ) j (b) Ring t(x ) j t'(x ) i t(x ) i t(x ) j t'(x ) i t(x ) i t(x ) j t'(x ) i t(x ) i (c) Annealed Ring Figure 1 : Visual illustration of Ring Discrimination. Black: view of example x i ; gray: second view of x i ; red: negative samples; gray area: distribution q(x|t(x i )). In subfigure (c), the negative samples are annealed to be closer to t(x i ) through training. In other words, the support of q shrinks. support of q(y|x), which forces samples from it to be harder for f to distinguish from x. Thm. 3.1 shows that substituting q(y|x) for p(y) in NCE still bounds mutual information. Theorem Example 3.1. We give a concrete example for the choice B that will be used in Sec. 4. For any realization x, suppose we define two similarity thresholds ! `, ! u 2 R where c < ! `< ! u . Then, choose B = [w `, w u ]. In this case, the set S B , which defines the support of the distribution q(y|x), contains values of y that are not "too-close" to x but not "too-far". In contrastive learning, we might pick these similarity thresholds to vary the difficulty of negative samples. Interestingly, Thm. 3.1 states that CNCE is looser than NCE, which raises the question: when is a looser bound useful? In reply, we show that while CNCE is a more biased estimator than NCE, in return it has lower variance. Intuitively, because q(y|x) is the result of restricting p(y) to a smaller support, samples from q(y|x) have less opportunity to deviate, hence lower variance. Formally: Theorem 3.2. (Bias and Variance Tradeoff) Pick any x, y ⇠ p(x, y). Fix the distribution q(y 1:k |x) as stated in Theorem 3.1. Define a new random variable Z(y 1:k ) = log ✓ e f (x,y) 1 k P k j=1 e f (x,y j ) ◆ repre- senting the normalized similarity. By Theorem 3.1, the expressions E p(y 1:k ) [Z] and E q(y 1:k |x) [Z] are estimators for I(X; Y ). Suppose that the set S B is chosen to ensure Var q(y 1: k |x) [Z]  Var q(y 1:k |x) [Z] , where q(A) = p(A| complement of S B ). That is, we assume the variance of the normalized similarity when using y 1:k 2 S B is smaller than when using y 1:k / 2 S B . Then Bias p(y 1:k ) (Z)  Bias q(y 1:k |x) (Z) and Var p(y 1:k ) (Z) Var q(y 1:k |x) (Z). The proof can be found in Sec. A.2. Thm. 3.2 provides one answer to our question of looseness. In stochastic optimization, a lower variance objective may lead to better local optima. For representation learning, using CNCE to sample more difficult negatives may (1) encourage the representation to distinguish fine-grained features useful in transfer tasks, and (2) provide less noisy gradients.

4. RING DISCRIMINATION

We have shown CNCE to be a new bound on the mutual information that uses hard negative samples. Now we wish to apply CNCE to contrastive learning where the two random variables are again transformations of a single variable X. In this setting, for a fixed x i ⇠ p(x), the CNCE distribution is written as q(x|t(x i )) for some transform t 2 T . Samples from x ⇠ q(x|t(x i )) will be such that the exponentiated distance, exp{g ✓ (t(x i )) T g ✓ (t 0 (x))}, is at least a minimum value c. As in Example 3.1, we will choose B = [! `, ! u ], a closed interval in R defined by two thresholds. Picking thresholds. We pick the thresholds conditioned on the i-th example in the dataset, hence each example has a different set B. We first describe how to pick the upper threshold ! u . Given the i-example x i , we pick a number u 2 [0, 100] representing an upper "percentile". We consider each example x in the dataset to be in the support S B if and only if the (exponentiated) distance between the embedding of x i and x, or exp{g ✓ (t(x i )) T g ✓ (t 0 (x))}, is below the u-th percentile for all x 2 D. Call this maximum distance ! u . In other words, we construct q(x|t(x i )) such that we ignore examples from the dataset whose embedding dot producted with the embedding of x i is above ! u . (Note that u = 100 recovers NCE.) For a small enough choice of u, the upper similarity threshold ! u will be greater than c (defined in Thm. 3.1 as the expected distance with respect to p(x)), and the samples from q(x|t(x i )) will be harder negatives to discriminate from x i . In picking the lower threshold ! `, one could choose it to be 0, so B = [0, ! u ). However, picking the closest examples to t(x i ) as its negative examples may be inappropriate, as these examples might be better suited as positive views rather than negatives (Zhuang et al., 2019; Xie et al., 2020) . As an extreme case, if the same image is included in the dataset twice, we would not like to select it as a negative example for itself. Furthermore, choosing negatives "too close" to the current instance may result in representations that pick up on fine-grain details only, ignoring larger semantic concepts. This suggests removing examples from q(x|t(x i )) we consider "too close" to x i . To do this, we pick a lower percentile 0  `< u. For each example x 2 D, we say it is in S B if exp{g ✓ (t(x i )) T g ✓ (t 0 (x))} is below ! u and also if it is above the `-th percentile of all distances with respect to D. Call this minimum distance ! `. Fig. 2 visualizes this whole procedure. dist(x ,x )= i 1 =dist(x ,x ) i 1 dist(x ,x )= i 1 q(x|t(x )) i T Step 3: Sort distances. Step 4: Compute threhsolds. Step 5: Define distribution q. l-th u-th S B l w w u R 5 4 1 3 4 x x x x x Step 1: Pick two percentiles. Step 2: Compute distances. dist(x,y)= e g(t(x)) g(t'(y)) T 0 l u 1 0 0 1 2 3 4 5 x x x x x x p(x) i S B x i i 5 d i s t ( x , x ) i 1 d i s t ( x , x ) Figure 2: Defining the CNCE distribution q(x|t(x i )) . By choosing a lower and upper percentile ànd u, we implicitly define similarity thresholds ! `and ! u to construct a support of valid negative examples, S B , which in turn, defines the distribution q(x|t(x i )). Algorithm 1: MoCoRing # g q , g k : e n c o d e r n e t w o r k s # m: momentum ; t : t e m p e r a t u r e # u : r i n g up pe r p e r c e n t i l e # l : r i n g lo we r p e r c e n t i l e t x 1 =aug ( x ) # random a u g m e n t a t i o n t x 2 =aug ( x ) emb1=norm ( g q ( tx1 ) ) emb2=norm ( g k ( tx2 ) ) . d e t a c h ( ) dps =sum ( tx1⇤ t x 2 ) / t # d o t p r o d u c t # s o r t from c l o s e s t t o f a r t h e s t neg a l l d p s = s o r t ( emb1@queue . T / t ) # f i n d i n d i c e s o f t h r e s h o l d s i x l = l ⇤ l e n ( queue ) i x u =u⇤ l e n ( queue ) r i n g d p s = a l l d p s [ : , i x l : ix u ] # n o n p a r a m e t r i c s o f t m a x l o s s= dps + logsumexp ( r i n g d p s ) l o s s . backward ( ) s t e p ( g q . params ) # moco u p d a t e s g k . params = m⇤g k . params+\ (1 m)⇤ g q . params enqueue ( queue , emb2 ) ; dequeue ( queue ) # t h r e s h o l d u p d a t e s a n n e a l ( w l ) ; anneal ( w u ) Ring Discrimination. Having defined ! `and ! u , we have a practical method of choosing B, and thus S B to define q(x|t(x i )) for i-th example. Intuitively, we construct a conditional distribution for negative examples that are (1) not too easy since their representations are fairly similar to that of x i and ( 2) not too hard since we remove the "closest" instances to x i from S B . We call this algorithm Ring Discrimination, or Ring, inspired by the shape of negative set (see Fig. 1 ). Ring can be easily added to popular contrastive algorithms. For IR and CMC, this amounts to simply sampling entries in the memory bank that fall within the `-th to u-th percentile of all distances to the current example view (in representation space). Similarly, for MoCo, we sample from a subset of the queue (chosen to be in the `-th to u-th percentile), preserving the FIFO ordering. In our experiments, we refer to these as IRing, CM-CRing, MoCoRing, respectively. Alg. 1 shows PyTorch-like pseudocode for MoCoRing. One of the strengths of this approach is the simplicity: the algorithm requires only a few lines of code on top of existing implementations. Annealing Policy. Naively using hard negatives can collapse to a poor representation, especially if we choose the upper threshold, ! u , to be very small early in training. At the start of training, the encoder g ✓ is randomly initialized and cannot guarantee that elements in the `-th to u-th percentile are properly calibrated: if the representations are near random, choosing negatives that are close in embedding distance may detrimentally exclude those examples that are "actually" close. This could lock in poor local minima. To avoid this, we propose to use an annealing policy that reduces ! u (and thus the size of the support S B ) throughout training. Early in training we choose ! u to be large. Over many epochs, we slowly decrease ! u closer to ! l , thereby selecting more difficult negatives. We explored several annealing policies and found a linear schedule to be well-performing and simple (see Sec. G). In our experiments, annealing is shown to be crucial: being too aggressive with negatives early in training produced representations that performed poorly on downstream tasks.

5. EXPERIMENTS

We explore our method applied to IR, CMC, and MoCo in four commonly used visual datasets. As in prior work (Wu et al., 2018; Zhuang et al., 2019; He et al., 2019; Misra & Maaten, 2020; Hénaff et al., 2019; Kolesnikov et al., 2019; Donahue & Simonyan, 2019; Bachman et al., 2019; Tian et al., 2019; Chen et al., 2020a) , we evaluate each method by linear classification on frozen embeddings. That is, we optimize a contrastive objective on a pretraining dataset to learn a representation; then, using a transfer dataset, we fit logistic regression on representations only. A better representation would contain more "object-centric" information, thereby achieving a higher classification score. Training Details. We pick the upper percentile u = 10 and the lower percentile `= 1 although we anneal u starting from 100. We resize input images to be 256 by 256 pixels, and normalize them using dataset mean and standard deviation. The temperature ⌧ is set to 0.07. We use a composition of a 224 by 224-pixel random crop, random color jittering, random horizontal flip, and random grayscale conversion as our augmentation family T . We use a ResNet-18 encoder with a output dimension of 128. For CMC, we use two ResNet-18 encoders, doubling the number of parameters. For linear classification, we treat the pre-pool output (size 512 ⇥ 7 ⇥ 7) after the last convolutional layer as the input to the logistic regression. Note that this setup is equivalent to using a linear projection head (Chen et al., 2020a; b) . In pretraining, we use SGD with learning rate 0.03, momentum 0.9 and weight decay 1e-4 for 300 epochs and batch size 256 (128 for CMC). We drop the learning rate twice by a factor of 10 on epochs 200 and 250. In transfer, we use SGD with learning rate 0.01, momentum 0.9, and no weight decay for 100 epochs without dropping learning rate. These hyperparameters were taken from Wu et al. (2018) and used in all of Table 1 for a consistent comparison. We found normalizing hyperparameters to be important for a fair comparison as many competing algorithms use different hyperparameters. For a state-of-the-art comparison, see Table 5 Ablations: Annealing and Upper Boundary. Having found good performance with Ring Discrimination, we want to assess the importance of the individual components that comprise Ring. We focus on the annealing policy and the exclusion of very close negatives from S B . Concretely, we measure the transfer accuracy of (1) IRing without annealing and (2) IRing with an lower percentile `= 0, thereby excluding no close negatives. That is, S B contains all examples in the dataset with representation similarity less than the ! u (a "ball" instead of a "ring"). Table 2 compares these ablations to IR and full IRing on CIFAR10 and ImageNet classification transfer. We observe that both ablations result in worse transfer accuracy, with proper annealing being especially important to prevent convergence to bad minima. We also find even with `= 0, IRing outperforms IR, suggesting both removing negatives that are "too close" and "too far" contribute to the improved representation quality. Transferring Features. Thus far we have only evaluated the learned representations on unseen examples from the training distribution. As the goal of unsu-pervised learning is to capture general representations, we are also interested in their performance on new, unseen distributions. To gauge this, we use the same linear classification paradigm on a suite of image datasets from the "Meta Dataset" collection (Triantafillou et al., 2019) that have been used before in contrastive literature (Chen et al., 2020a) . All representations were trained on CIFAR10. For each transfer dataset, we compute mean and variance from a training split to normalize input images, which we found important for generalization to new visual domains. We find in Table 3 that the Ring models are competitive with the non-Ring analogues, with increases in transfer accuracies of 0.5 to 2% absolute. Most notable are the TrafficSign and VGGFlower datasets in which Ring models surpass others by a larger margin. We also observe that IRing largely outperforms LA. This suggests the features learned with more difficult negatives are not only useful for the training distribution but may also be transferrable to many visual datasets. More Downstream Tasks. Object classification is a popular transfer task, but we want our learned representations to capture holistic knowledge about the contents of an image. We must thus evaluate performance on transfer tasks such as detection and segmentation that require different kinds of visual information. We study four additional downstream tasks: object detection on COCO (Lin et al., 2014) and Pascal VOC'07 (Everingham et al., 2010) , instance segmentation on COCO, and keypoint detection on COCO. In all cases, we employ embeddings trained on ImageNet with a ResNet-18 encoder. We base these experiments after those found in He et al. (2019) with the same hyperparameters. However, we use a smaller backbone (ResNet-18 versus ResNet-50) and we freeze its parameters instead of finetuning them. We adapt code from Detectron2 (Wu et al., 2019) . We find IRing outperforms IR by around 2.3 points in COCO object detection, 2.5 points in COCO Instance Segmentation, 2.6 points in COCO keypoint detection, and 2.1 points in VOC object detection. Similarly, MoCoRing finds consistent improvements of 1-3 points over MoCo on the four tasks. Future work can investigate orthogonal directions of using larger encoders (e.g. ResNet-50) and finetuning ResNet parameters for these individual tasks.

6. RELATED WORK

Several of the ideas in Ring Discrimination relate to existing work. Below, we explore these connections, and at the same time, place our work in a fast-paced and growing field. Hard negative mining. While it has not been deeply explored in modern contrastive learning, negative mining has a rich line of research in the metric learning community. Deep metric learning utilizes triplet objectives of the form L triplet = d(g ✓ (x i ), g ✓ (x + )) d(g ✓ (x i ), g ✓ (x )+↵) where d is a distance function (e.g. L 2 distance), x + and x are a positive and negative example, respectively, relative to x i , the current instance, and ↵ 2 R + is a margin. In this context, several approaches pick semi-hard negatives: Schroff et al. (2015) treats the furthest (in L 2 distance) example in the same minibatch as x i as its negative, whereas Oh Song et al. (2016) weight each example in the minibatch by its distance to g ✓ (x i ), thereby being a continuous version of Schroff et al. (2015) . More sophisticated negative sampling strategies developed over time. In Wu et al. (2017) , the authors pick negatives from a fixed normal distribution that is shown to approximate L 2 normalized embeddings in high dimensions. The authors show that weighting by this distribution samples more diverse negatives. Similarly, HDC (Yuan et al., 2017) simulataneously optimizes a triplet loss using many levels of "hardness" in negatives, again improving the diversity. Although triplet objectives paved the way for modern NCE-based objectives, the focus on negative mining has largely been overlooked. Ring Discrimination, being inspired by the deep metric learning literature, reminds that negative sampling is still an effective way of learning stronger representations in the new NCE framework. As such, an important contribution was to do so while retaining the theoretical properties of NCE, namely in relation to mutual information. This, to the best of our knowledge, is novel as negative mining in metric learning literature was not characterized in terms of information theory. That being said, there are some cases of negative mining in contrastive literature. In CPC (Oord et al., 2018) , the authors explore using negatives from the same speaker versus from mixed speakers in audio applications, the former of which can be interpreted as being more difficult. A recent paper, InterCLR (Xie et al., 2020) , also finds that using "semi-hard negatives" is beneficial to contrastive learning whereas negatives that are too difficult or too easy produce worse representations. Where InterCLR uses a margin-based approach to sample negatives, we explore a wider family of negative distributions and show analysis that annealing offers a simple and easy solution to choosing between easy and hard negatives. Further, as InterCLR's negative sampling procedure is a special case of CNCE, we provide theory grounding these approaches in information theory. Finally, a separate line of work in contrastive learning explores using neighboring examples (in embedding space) as "positive" views of the instance (Zhuang et al., 2019; Xie et al., 2020; Asano et al., 2019; Caron et al., 2020; Li et al., 2020) . That is, finding a set {x j } such that we consider x j = t(x i ) for the current instance x i . While this does not deal with negatives explicitly, it shares similarities to our approach by employing other examples in the contrastive objective to learn better representations. In the Appendix, we discuss how one of these algorithms, LA (Zhuang et al., 2019) , implicitly uses hard negatives and expand the Ring family with ideas inspired by it. Contrastive learning. We focused primarily on comparing Ring Discrimination to three recent and highly performing contrastive algorithms, but the field contains much more. The basic idea of learning representations to be invariant under a family of transformations is an old one, having been explored with self-organizing maps (Becker & Hinton, 1992) and dimensionality reduction (Hadsell et al., 2006) . Before IR, the idea of instance discrimination was studied (Dosovitskiy et al., 2014; Wang & Gupta, 2015) among many pretext objectives such as position prediction (Doersch et al., 2015) , color prediction (Zhang et al., 2016) , multi-task objectives (Doersch & Zisserman, 2017) , rotation prediction (Gidaris et al., 2018; Chen et al., 2019) , and many other "pretext" objectives (Pathak et al., 2017) . As we have mentioned, one of the primary challenges to instance discrimination is making such a large softmax objective tractable. Moving from a parametric (Dosovitskiy et al., 2014) to a nonparametric softmax reduced issues with vanishing gradients, shifting the challenge to efficient negative sampling. The memory bank approach (Wu et al., 2018) is a simple and memory-efficient solution, quickly being adopted by the research community (Zhuang et al., 2019; Tian et al., 2019; He et al., 2019; Chen et al., 2020b; Misra & Maaten, 2020) . With enough computational resources, it is now also possible to reuse examples in a large minibatch and negatives of one another (Ye et al., 2019; Ji et al., 2019; Chen et al., 2020a) . In our work, we focus on hard negative mining in the context of a memory bank or queue due to its computational efficiency. However, the same principles should be applicable to batch-based methods (e.g. SimCLR): assuming a large enough batch size, for each example, we only use a subset of the minibatch as negatives as in Ring. Finally, more recent work (Grill et al., 2020) removes negatives altogether, which is speculated to implicitly use negative samples via batch normalization (Ioffe & Szegedy, 2015) ; we leave a more thorough understanding of negatives in this setting to future work.

7. DISCUSSION

Computational cost of Ring. LibriSpeech Spk. ID (Panayotov et al., 2015) 95.5 96.6 (+1.1) AudioMNIST (Becker et al., 2018) 87.4 91.3 (+3.9) Google Commands (Warden, 2018) 38.5 41.4 (+2.9) Fluent Actions (Lugosch et al., 2019) 36.5 36.8 (+0.3) Fluent Objects (Lugosch et al., 2019) 41.9 44.1 (+2.2) Fluent Locations (Lugosch et al., 2019) 60.9 63.9 (+3.0) cost over 200 epochs. We observe that Ring models cost no more than 1.5 times the cost of standard contrastive algorithms, amounting to a difference of 3 to 7 minutes in ImageNet and 10 to 60 seconds in three other datasets per epoch. In the context of deep learning, we do not find the cost increases to be substantial. In particular, since (1) the memory structure in IR and MoCo allow us to store and reuse embeddings and (2) gradients are not propagated through the memory structure, the additional compute of Ring amounts to one matrix multiplication, which is cheap on modern hardware. We used a single Titan X GPU with 8 CPU workers, and PyTorch Lightning (Falcon et al., 2019) . Comparison with the state-of-the-art. Unlike the experiments in Sec. 5, we now choose the optimal hyperparameters for MoCo-v2 (Chen et al., 2020b) separately for CIFAR10, CIFAR100, and STL10. Table 5b compares MoCo-v2 and its CNCE equivalent, MoCoRing-v2 using linear evaluation. We observe comparable improvements as found in Table 1 even with optimal hyperparameters. Notably, the gains generalize to ResNet-50 encoders. Refer to Sec. F for hyperparameter choices. Generalization to other modalities. Thus far, we have focused on visual representation learning, although the same ideas apply to other domains. To exemplify the generality of CNCE, we apply MoCoRing to learning speech representations. Table 5c reports linear evaluation on six transfer datasets, ranging from predicting speaker identity to speech recognition to intent prediction. We find significant gains of 1 to 4 percent over 4 datasets and 6 transfer tasks with an average of 2.2 absolute percentage points. See Sec. E for experimental details. Batch-based negative sampling. In Ring, we assumed to have a memory structure that stores embeddings, which led to an efficient procedure of mining semi-hard negatives. However, another flavor of contrastive algorithms removes the memory structure entirely, using the examples in the minibatch as negatives of one another. Here, we motivate a possible extension of Ring to SimCLR, and leave more careful study to future work. In SimCLR, we are given a minibatch M of examples. To sample hard negatives, as before, pick `and u as lower and upper percentiles. For every example x i in the minibatch, only consider the subset of the minibatch {x : x ✓ M, exp{g ✓ (t(x i )) T g ✓ (t 0 (x))} in the `-th and u-th percentiles in M } as negative examples for x i . This can be efficiently implemented as a matrix operation using an element-wise mask. Thus, we ignore gradient signal for examples too far or too close to x i in representation. As before, we anneal u from 100 to 10 and set `= 1. 

8. CONCLUDING REMARKS

To conclude, we presented a family of mutual information estimators that approximate the partition function using samples from a class of conditional distributions. We proved several theoretical statements about this family, showing a bound on mutual information and a tradeoff between bias and variance. Then, we applied these estimators as objectives in contrastive representation learning. In doing so, we found that our representations outperform existing approaches consistently across a spectrum of contrastive objectives, data distributions, and transfer tasks. Overall, we hope our work to encourage more exploration of negative sampling in the recent growth of contrastive learning.



We provide a counterexample in Sec. A.1.



Theorem 3.1. (The Conditional NCE bound) Define d-dimensional random variables X and Y by a joint distribution p(x, y) and let Y 1 , ..., Y k be i.i.d. copies of Y with the marginal distribution p(y). Fix any function f : (X, Y ) ! R, any realization x of X, and let c = E y⇠p(y) [e f (

y⇠p(x,y) E y1,...,y k ⇠q(y1,...,y k |x)

. Comparison of contrastive algorithms on four image domains. Superscript (

Transferring CIFAR10 embeddings to various image distributions.

Evaluation of ImageNet representations using four visual transfer tasks.

To measure the cost of CNCE, we compare the cost an epoch of training MoCo/IR versus MoCoRing/IRing on four image datasets.Table 5a reports the average

Generalizations of Ring to a new modality (a) and a batch-based algorithm (b).

Table 5d report consistent but moderate gains over SimCLR, showing promise but room for improvement in future research.

ACKNOWLEDGMENTS

This research was supported by the Office of Naval Research grant ONR MURI N00014-16-1-2007. MW is supported by the Stanford Interdisciplinary Graduate Fellowship as the Karr Family Fellow.

