CONTRASTIVE LEARNING WITH HARD NEGATIVE SAMPLES

Abstract

How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.

1. INTRODUCTION

Owing to their empirical success, contrastive learning methods (Chopra et al., 2005; Hadsell et al., 2006) have become one of the most popular self-supervised approaches for learning representations (Oord et al., 2018; Tian et al., 2019; Chen et al., 2020a) . In computer vision, unsupervised contrastive learning methods have even outperformed supervised pre-training for object detection and segmentation tasks (Misra & Maaten, 2020; He et al., 2020) . Contrastive learning relies on two key ingredients: notions of similar (positive) (x, x + ) and dissimilar (negative) (x, x -) pairs of data points. The training objective, typically noise-contrastive estimation (Gutmann & Hyvärinen, 2010) , guides the learned representation f to map positive pairs to nearby locations, and negative pairs farther apart; other objectives have also been considered (Chen et al., 2020a) . The success of the associated methods depends on the design of informative of the positive and negative pairs, which cannot exploit true similarity information since there is no supervision. Much research effort has addressed sampling strategies for positive pairs, and has been a key driver of recent progress in multi-view and contrastive learning (Blum & Mitchell, 1998; Xu et al., 2013; Bachman et al., 2019; Chen et al., 2020a; Tian et al., 2020) . For image data, positive sampling strategies often apply transformations that preserve semantic content, e.g., jittering, random cropping, separating color channels, etc. (Chen et al., 2020a; c; Tian et al., 2019) . Such transformations have also been effective in learning control policies from raw pixel data (Srinivas et al., 2020) . Positive sampling techniques have also been proposed for sentence, audio, and video data (Logeswaran & Lee, 2018; Oord et al., 2018; Purushwalkam & Gupta, 2020; Sermanet et al., 2018) . Surprisingly, the choice of negative pairs has drawn much less attention in contrastive learning. Often, given an "anchor" point x, a "negative" x -is simply sampled uniformly from the training data, independent of how informative it may be for the learned representation. In supervised and metric learning settings, "hard" (true negative) examples can help guide a learning method to correct its mistakes more quickly (Schroff et al., 2015; Song et al., 2016) . For representation learning, informative negative examples are intuitively those pairs that are mapped nearby but should be far apart. This idea is successfully applied in metric learning, where true pairs of dissimilar points are available, as opposed to unsupervised contrastive learning. the distribution over latent classes by ρ(c) for c ∈ C, we define the joint distribution p x,c (x, c) = p(x|c)ρ(c) whose marginal p(x) we refer to simply as p, and assume supp(p) = X . For simplicity, we assume ρ(c) = τ + is uniform, and let τ -= 1 -τ + be the probability of another class. Since the class-prior τ + is unknown in practice, it must either be treated as a hyperparameter, or estimated (Christoffel et al., 2016; Jain et al., 2016) . Let h : X → C be the true underlying hypothesis that assigns class labels to inputs. We write x ∼ x to denote the label equivalence relation h(x) = h(x ). We denote by p + x (x ) = p(x |h(x ) = h(x)), the distribution over points with same label as x, and by p - x (x ) = p(x |h(x ) = h(x)), the distribution over points with labels different from x. We drop the subscript x when the context is clear. Following the usual convention, we overload '∼' and also write x ∼ p to denote a point sampled from p. For each data point x ∼ p, the noise-contrastive estimation (NCE) objective (Gutmann & Hyvärinen, 2010) for learning the representation f uses a positive example x + with the same label as x, and negative examples {x - i } N i=1 with (supposedly) different labels, h(x - i ) = h(x), sampled from q: E x∼p, x + ∼p + x {x - i } N i=1 ∼q -log e f (x) T f (x + ) e f (x) T f (x + ) + Q N N i=1 e f (x) T f (x - i ) . (1) The weighting parameter Q is introduced for the purpose of analysis. When N is finite we take Q = N , yielding the usual form of the contrastive objective. The negative sample distribution q is frequently chosen to be the marginal distribution p, or, in practice, an empirical approximation of it (Tian et al., 2019; Chen et al., 2020a; c; He et al., 2020; Chen et al., 2020c; Oord et al., 2018; Hénaff et al., 2020) . In this paper we ask: is there a better way to choose q? 3 HARD NEGATIVE SAMPLING In this section we describe our approach for hard negative sampling. We begin by asking what makes a good negative sample? To answer this question we adopt the following two guiding principles: Principle 1. q should only sample "true negatives" x - i whose labels differ from that of the anchor x. Principle 2. The most useful negative samples are ones that the embedding currently believes to be similar to the anchor. In short, negative samples that have different label from the anchor, but that are embedded nearby are likely to be most useful and provide significant gradient information during training. In metric learning there is access to true negative pairs, automatically fulfilling the first principle. In unsupervised contrastive learning there is no supervision, so upholding Principle 1 is impossible to do exactly. In this paper we propose a method that upholds Principle 1 approximately, and simultaneously combines this idea with the key additional conceptual ingredient of "hardness" (encapsulated in Principle 2). The level of "hardness" in our method can be smoothly adjusted, allowing the user to select the hardness that best trades-off between an improved learning signal from hard negatives, and the harm due to the correction of false negatives being only approximate. This important since the hardest points are those closest to the anchor, and are expected to have a high propensity to have the same label. Therefore the damage from the approximation not removing all false negatives becomes larger for harder samples, creating the trade-off. As a special case our our method, when the hardness level is tuned fully down, we obtain the method proposed in (Chuang et al., 2020) that only upholds Principle 1 (approximately) but not Principle 2. Finally, beyond Principles 1 and 2, we wish to design an efficient sampling method that does not add additional computational overhead during training.

3.1. PROPOSED HARD SAMPLING METHOD

Our first goal is to design a distribution q on X that is allowed to depend on the embedding f and the anchor x. From q we sample a batch of negatives {x - i } N i=1 according to the principles noted above. We propose sampling negatives from the distribution q - β defined as q - β (x -) := q β (x -|h(x) = h(x -)), where q β (x -) ∝ e βf (x) f (x -) • p(x -), for β ≥ 0. Note that q - β and q β both depend on x, but we suppress the dependance from the notation. The exponential term in q β is an unnormalized von Mises-Fisher distribution with mean direction f (x) and "concentration parameter" β (Mardia & Jupp, 2000) . There are two key components to q - β , corresponding to each principle: 1) conditioning on the event {h(x) = h(x -)} which guarantees that (x, x -) correspond to different latent classes (Principle 1); 2) the concentration parameter β term controls the degree by which q β up-weights points x -that have large inner product (similarity) to the anchor x (Principle 2). Since f lies on the surface of a hypersphere of radius 1/t, we have f (x) -f (x ) 2 = 2/t 2 -2f (x) f (x ) so preferring points with large inner product is equivalent to preferring points with small squared Euclidean distance. Although we have designed q - β to have all of the desired components, it is not clear how to sample efficiently from it. To work towards a practical method, note that we can rewrite this distribution by adopting a PU-learning viewpoint (Elkan & Noto, 2008; Du Plessis et al., 2014; Chuang et al., 2020) . That is, by conditioning on the event {h(x) = h(x -)} we can split q β (x -) as q β (x -) = τ -q - β (x -) + τ + q + β (x -), where q + β (x -) = q β (x -|h(x) = h(x -)) ∝ e βf (x) f (x -) • p + (x -). Rearranging equation 2 yields a formula q - β (x -) = q β (x -) -τ + q + β (x -) /τ -for the negative sampling distribution q - β in terms of two distributions that are tractable since we have samples from p and can approximate samples from p + using a set of semantics-preserving transformations, as is typical in contrastive learning methods (see Appendix E for extra discussion of practical implications of this approximation). It is possible to generate samples from q β and approximately from q + β using rejection sampling and data augmentations to generate positives. However, rejection sampling involves an algorithmic complication since the procedure for sampling batches must be modified. To avoid this, we instead take an importance sampling approach. To obtain this, first note that fixing the number Q and taking the limit N → ∞ in the objective (1) yields, L(f, q) = E x∼p x + ∼p + x -log e f (x) T f (x + ) e f (x) T f (x + ) + QE x -∼q [e f (x) T f (x -) ] . The original objective (1) can be viewed as a finite negative sample approximation to L(f, q) (note implicitly L(f, q) depends on Q) . Inserting q = q - β and using the rearrangement of equation ( 2) we obtain the following hardness-biased objective: E x∼p x + ∼p + x   -log e f (x) T f (x + ) e f (x) T f (x + ) + Q τ -(E x -∼q β [e f (x) T f (x -) ] -τ + E v∼q + β [e f (x) T f (v) ])   . This objective suggests that we need only to approximate expectations E x -∼q β [e f (x) T f (x -) ] and v) ] over q β and q + β (rather than explicily sampling). This can be achieved using classical Monte-Carlo importance sampling techniques using samples from p and p + as follows: E v∼q + β [e f (x) T f ( E x -∼q β [e f (x) T f (x -) ] = E x -∼p [e f (x) T f (x -) q β /p] = E x -∼p [e (β+1)f (x) T f (x -) /Z β ], E v∼q + β [e f (x) T f (v) ] = E v∼p + [e f (x) T f (v) q + β /p + ] = E v∼p + [e (β+1)f (x) T f (v) /Z + β ], where Z β , Z + β are the partition functions of q β and q + β respectively. The right hand terms readily admit empirical approximations by replacing p and p + with p (x) = 1 N N i=1 δ x - i (x) and p+ (x) = 1 M M i=1 δ x + i (x) respectively (δ w denotes the Dirac delta function centered at w). The only unknowns left are the partition functions, Z β = E x -∼p [e βf (x) T f (x -) ] and Z + β = E x + ∼p + [e βf (x) T f (x + ) ] which themselves are expectations over p and p + and therefore admit empirical estimates, Z β = 1 N N i=1 e βf (x) f (x - i ) , Z + β = 1 M M i=1 e βf (x) f (x + i ) . It is important to emphasize the simplicity of the implementation of our proposed approach. Since we propose to reweight the objective instead of modifying the sampling procedure, only two extra lines of code are needed to implement our approach, with no additional computational overhead. PyTorch-style pseudocode for the objective is given in Fig. 13 in Appendix D.

4.1. HARD SAMPLING INTERPOLATES BETWEEN MARGINAL AND WORST-CASE NEGATIVES

Intuitively, the concentration parameter β in our proposed negative sample distribution q - β controls the level of "hardness" of the negative samples. As discussed earlier, the debiasing method of Chuang et al. (2020) can be recovered as a special case: taking β = 0 to obtain the distribution q - 0 . This case amounts to correcting for the fact that some samples in a negative batch sampled from p will have the same label as the anchor. But what interpretation does large β admit? Specifically, what does the distribution q - β converge to in the limit β → ∞, if anything? We show that in the limit q - β approximates an inner solution to the following zero-sum two player game. inf f sup q∈Π L(f, q) = E x∼p x + ∼p + x -log e f (x) T f (x + ) e f (x) T f (x + ) + QE x -∼q [e f (x) T f (x -) ] . ( ) where Π = {q = q(•; x, f ) : supp q(•; x, f ) ⊆ {x ∈ X : x x}, ∀x ∈ X } is the set of distributions with support that is disjoint from points with the same class as x (without loss of generality we assume {x ∈ X : x x} is non-empty). Since q = q(•; x, f ) depends on x and f it can be thought of as a family of distributions. The formal statement is as follows. Proposition 3. Let L * (f ) = sup q∈Π L(f, q). Then for any t > 0 and f : X → S d-1 /t we observe the convergence L(f, q - β ) -→ L * (f ) as β → ∞. Proof. See Appendix A.1. To develop a better intuitive understanding of the worst case negative distribution objective L * (f ) = sup q∈Π L(f, q), we note that the supremum can be characterized analytically. Indeed, sup q∈Π L(f, q) = -E x∼p x + ∼p + x f (x) T f (x + ) + sup q∈Π E x∼p x + ∼p + x log e f (x) T f (x + ) + QE x -∼q [e f (x) T f (x -) ] = -E x∼p x + ∼p + x f (x) T f (x + ) + E x∼p x + ∼p + x log e f (x) T f (x + ) + Q • sup q∈Π E x -∼q [e f (x) T f (x -) ] . The supremum over q can be pushed inside the expectation since q is a family of distribution indexed by x, reducing the problem to maximizing E x -∼q [e f (x) T f (x -) ], which is solved by any q * whose support is a subset of arg sup x -:x -x e f (x) T f (x -) if the supremum is attained. However, computing such points involves maximizing a neural network. Instead of taking this challenging route, using q - β defines a lower bound by placing higher probability on x -for which f (x) T f (x -) is large. This lower bound becomes tight as β → ∞ (Proposition 3).

4.2. OPTIMAL EMBEDDINGS ON THE HYPERSPHERE FOR WORST-CASE NEGATIVE SAMPLES

What desirable properties does an optimal contrastive embedding (global minimizer of L) possess that make the representation generalizable? To study this question, we first analyze the distribution of an optimal embedding f * on the hypersphere when negatives are sampled from the adversarial worst-case distribution. We consider a different limiting viewpoint of objective (1) as the number of negative samples N → ∞. Following the formulation of Wang & Isola (2020) we take Q = N in (1), and subtract log N . This changes neither the set of minimizers, nor the geometry of the loss surface. Taking the number of negative samples N → ∞ yields the limiting objective, L ∞ (f, q) = E x∼p x + ∼p + x -log e f (x) T f (x + ) E x -∼q [e f (x) T f (x -) ] . ( ) Theorem 4. Suppose the downstream task is classification (i.e. C is finite), and let L * ∞ (f ) = sup q∈Π L ∞ (f, q) . The infimum inf f : measurable L * ∞ (f ) is attained, and any f * achieving the global minimum is such that f * (x) = f * (x + ) almost surely. Furthermore, letting v c = f * (x) for any x such that h(x) = c (so v c is well defined up to a set of x of measure zero), f * is characterized as being any solution to the following ball-packing problem, max {vc∈S d-1 /t} c∈C c∈C ρ(c) • min c =c v c -v c 2 . ( ) Proof. See Appendix A.2. Interpretation: The first component of the result is that f * (x) = f * (x + ) almost surely for an optimal f * . That is, an optimal embedding f * must be invariant across pairs of similar inputs x, x + . The second component is characterizing solutions via the classical geometrical Ball-Packing Problem of Tammes (1930) (Eq. 7) that has only been solved exactly for uniform ρ, for specific of|C| and typically for Sfoot_2 (Schütte & Van der Waerden, 1951; Musin & Tarasov, 2015; Tammes, 1930) . When the distribution ρ over classes is uniform this problem is solved by a set of|C| points on the hypersphere such that the average squared-2 distance from a point to the nearest other point is as large as possible. In other words, suppose we wish to place |C| number of ballsfoot_1 on S d-1 so that they do not intersect. Then solutions to Tammes' Problem (7) expresses (twice) the largest possible average squared radius that the balls can have. So, we have a ball-packing problem where instead of trying to pack as many balls as possible of a fixed size, we aim to pack a fixed number of balls (one for each class) to have as big radii as possible. Non-uniform ρ adds importance weights to each fixed ball. In summary, solutions of the problem min f L * ∞ (f ) are a maximum margin clustering. This understanding of global minimizers of L * ∞ (f ) = sup q∈Π L ∞ (f, q) can further developed into a better understanding of generalization on downstream tasks. The next result shows that representations that achieve small excess risk on the objective L * ∞ still separate clusters well in the sense that a simple 1-nearest neighbor classifier achieves low classification error. Theorem 5. Suppose ρ is uniform on C and f is such that L * ∞ (f ) -inf f measurable L * ∞ ( f ) ≤ ε with ε ≤ 1. Let {v * c ∈ S d-1 /t} c∈C be a solution to Problem 7, and define the constant ξ = min c,c -:c =c -v * c -v * c-> 0. Then there exists a set of vectors {v c ∈ S d-1 /t} c∈C such that the 1-nearest neighbor classifier ĥ(x) = arg min c∈C f (x) -v c (ties broken arbitrarily) achieves misclassification risk, P x,c ( ĥ(x) = c) ≤ 8ε (ξ 2 -2|C| (1 + 1/t)ε 1/2 ) 2 Proof. See Appendix A.3. In particular, P( ĥ(x) = c) = O(ε) as ε → 0, and in the limit ε → 0 we recover the invariance claim of Theorem 4 as a special case. The result can be generalized to arbitrary ρ by replacing|C| in the bound by 1/ min c ρ(c). The result also implies that it is possible to build simple classifiers for tasks that involve only a subset of classes from C, or classes that are a union of classes from C. The constant ξ = min c,c -:c =c -v * c -v * c-> 0 is a purely geometrical property of spheres, and describes the minimum separation distance between a set of points that solves the Tammes' ball-packing problem.

5. EMPIRICAL RESULTS

Next, we evaluate our hard negative sampling method empirically, and apply it as a modification to state-of-the-art contrastive methods on image, graph, and text data. For all experiments β is treated as a hyper-parameter (see ablations in Fig. 2 for more understanding of how to pick β). Values for M and τ + must also be determined. We fix M = 1 for all experiments, since taking M > 1 would increase the number of inputs for the forward-backward pass. Lemma 11 in the appendix gives a theoretical justification for the choice of M = 1. Choosing the class-prior τ + can be done in two ways: estimating it from data (Christoffel et al., 2016; Jain et al., 2016) , or treating it as a hyper-parameter. The first option requires the possession of labeled data before contrastive training.

5.1. IMAGE REPRESENTATIONS

We begin by testing the hard sampling method on vision tasks using the STL10, CIFAR100 and CIFAR10 data. We use SimCLR (Chen et al., 2020a) as the baseline method, and all models are trained for 400 epochs. The results in Fig. 2 show consistent improvement over SimCLR (q = p) and the particular case of our method with β = 0 proposed in (Chuang et al., 2020 ) (called debiasing) on STL10 and CIFAR100. For N = 510 negative examples per data point we observe absolute improvements of 3% and 7.3% over SimCLR on CIFAR100 and STL10 respectively, and absolute improvements over the best debiased baseline of 1.9% and 3.2%. On tinyImageNet (Tab. 1) we observe an absolute improvement of 3.6% over SimCLR, while on CIFAR10 there is a slight improvement for smaller N , which disappears at larger N . See Appendix C.1 results using MoCo-v2 for large negative batch size, and Appendix D.1 for full setup details. Figure 2: Classification accuracy on downstream tasks. Embeddings trained using hard, debiased, and standard (β = 0, τ + = 0) versions of SimCLR, and evaluated using linear readout accuracy.

5.2. GRAPH REPRESENTATIONS

SimCLR Debiased Hard (β = 1) 53.4% 53.7% 57.0% Table 1 : Top-1 linear readout on tinyIm-ageNet. Class prior is set to τ + = 0.01. Second, we consider hard negative sampling in the context of learning graph representations. We use the state-of-theart InfoGraph method introduced by Sun et al. (2020) as the baseline, which is suitable for downstream graph-level classification. The objective is of a slightly different form from the NCE loss. Because of this we use a generalization of the formulation presented in Section 3 (See Appendix B for details). In doing so, we illustrate that it is easy to adapt our hard sampling method to other contrastive frameworks. Fig. 3 shows the results of fine-tuning an SVM (Boser et al., 1992; Cortes & Vapnik, 1995) on the fixed, learned embedding for a range of different values of β. Hard sampling does as well as InfoGraph in all cases, and better in 6 out of 8 cases. For ENZYMES and REDDIT, hard negative samples improve the accuracy by 3.2% and 2.4%, respectively, for DD and PTC by 1 -2%, and for IMDB-B and MUTAG by at least 0.5%. Usually, multiple different choices of β > 0 were competitive with the InfoGraph baseline: 17 out of the 24 values of β > 0 tried (across all 8 datasets) achieve accuracy as high or better than InfoGraph (β = 0).

5.3. SENTENCE REPRESENTATIONS

Third, we test hard negative sampling on learning representations of sentences using the quickthoughts (QT) vectors framework introduced by Logeswaran & Lee (2018), which uses adjacent sentences (before/after) as positive samples. Embeddings are trained using the unlabeled BookCorpus dataset (Kiros et al., 2015) , and evaluated following the protocol of Logeswaran & Lee (2018) on six downstream tasks. The results are reported in 

Accuracy Average Accuracy

InfoGraph ( = 0) Hard ( = 1) Hard ( = 2) Hard ( = 10) baseline in 5 out of 6 cases, the debiased baseline (Chuang et al., 2020) in 4 out of 6, and both in 3 out of 6 cases. Setting τ + > 0 led to numerical issues in optimization for hard sampling. Objective Fig. 4 (left, middle) shows that for vision problems, taking larger β does not necessarily lead to better representations. In contrast, when one uses true positive pairs during training (green curve, uses label information for positive but not negative pairs), the downstream performance monotonically increases with β until convergence (Fig. 4 , middle). Interestingly, this is achieved without using label information for the negative pairs. This observation suggests an explanation for why bigger β hurts performance in practice. Debiasing (conditioning on the event {h(x) = h(x -)}) using the true p + corrects for sampling x -with the same label as x. However, since in practice we approximate p + using a set of data transformations, we can only partially correct. This is harmful for large β since this regime strongly prefers x -for which f (x -) is close to f (x), many of whom will have the same label as x if not corrected for. We note also that by annealing β (gradually decreasing β to 0 throughout training; see Appendix D.1 for details) it is possible to be more robust to the choice of initial β, with marginal impact on downstream accuracy compared to the best fixed value of β.

6.2. DOES AVOIDING FALSE NEGATIVES IMPROVE HARD SAMPLING?

Our proposed hard negative sampling method conditions on the event {h(x) = h(x -)} in order to avoid false negatives (termed "debiasing" (Chuang et al., 2020) ). But does this help? To test this, we train four embeddings: hard sampling with and without debiasing, and uniform sampling (β = 0) with and without debiasing. The results in Fig. 4 (right) show that hard sampling with debiasing obtains the highest linear readout accuracy on STL10, only using hard sampling or only debiasing yields (in this case) similar accuracy. All improve over the SimCLR baseline. Fig. 5 compares the histograms of cosine similarities of positive and negative pairs for the four learned representations. The representation trained with hard negatives and debiasing assigns much lower similarity score to a pair of negative samples than other methods. On the other hand, the SimCLR baseline assigns higher cosine similarity scores to pairs of positive samples. However, to discriminate positive and negative pairs, a key property is the amount of overlap of positive and negative histograms. Our hard sampling method achieves less overlap than SimCLR, by better trading off higher dissimilarity of negative pairs with less similarity of positive pairs. Similar tradeoffs are observed for the debiased objective, and hard sampling without debiasing.

6.3. HOW DO HARD NEGATIVES AFFECT OPTIMIZATION?

Fig. 11 (in Appendix C due to space constraints) shows the performance on STL10 and CIFAR100 of SimCLR versus using hard negatives throughout training. We use weighted k-nearest neighbors with k = 200 as the classifier and evaluate each model once every five epochs. Hard sampling with β = 1 leads to much faster training: on STL10 hard sampling takes only 60 epochs to reach the same performance as SimCLR does in 400 epochs. On CIFAR100 hard sampling takes only 125 epochs to reach the same performance as SimCLR does in 400 epochs. We speculate that the speedup is, in part, due to hard negatives providing non-negligible gradient information during training.

7. CONCLUSION

We argue for the value of hard negatives in unsupervised contrastive representation learning, and introduce a simple hard negative sampling method. Our work connects two major lines of work: contrastive learning, and negative mining in metric learning. Doing so requires overcoming an apparent roadblock: negative mining in metric learning uses pairwise similarity information as a core component, while contrastive learning is unsupervised. Our method enjoys several nice aspects: having desirable theoretical properties, a very simple implementation that requires modifying only a couple of lines of code, not changing anything about the data sampling pipeline, introducing zero extra computational overhead, and handling false negatives in a principled way.

A ANALYSIS OF HARD SAMPLING A.1 HARD SAMPLING INTERPOLATES BETWEEN MARGINAL AND WORST-CASE NEGATIVES

We begin by proving Proposition 3. Recall that the proposition stated the following. Proposition 6. Let L * (f ) = sup q∈Π L(f, q). Then for any t > 0 and measurable f : X → S d-1 /t we observe the convergence L(f, q - β ) -→ L * (f ) as β → ∞. Proof. Consider the following essential supremum, M (x) = ess sup x -∈X :x -x f (x) T f (x -) = sup{m > 0 : m ≥ f (x) T f (x -) a.s. for x -∼ p -}. The second inequality holds since supp(p) = X . We may rewrite L * (f ) = E x∼p x + ∼p + x -log e f (x) T f (x + ) e f (x) T f (x + ) + Qe M (x) , L(f, q - β ) = E x∼p x + ∼p + x   -log e f (x) T f (x + ) e f (x) T f (x + ) + QE x -∼q - β [e f (x) T f (x -) ]   . The difference between these two terms can be bounded as follows, L * (f ) -L(f, q - β ) ≤ E x∼p x + ∼p + x -log e f (x) T f (x + ) e f (x) T f (x + ) + Qe M (x) + log e f (x) T f (x + ) e f (x) T f (x + ) + QE x -∼q - β [e f (x) T f (x -) ] = E x∼p x + ∼p + x log e f (x) T f (x + ) + QE x -∼q - β [e f (x) T f (x -) ] -log e f (x) T f (x + ) + Qe M (x) ≤ e 1/t Q + 1 • E x∼p x + ∼p + x e f (x) T f (x + ) + QE x -∼q - β [e f (x) T f (x -) ] -e f (x) T f (x + ) -Qe M (x) = e 1/t Q Q + 1 • Ex∼p E x -∼q - β [e f (x) T f (x -) ] -e M (x) ≤ e 1/t • E x∼p E x -∼q - β e M (x) -e f (x) T f (x -) where for the second inequality we have used the fact that f lies on the hypersphere of radius 1/t to restrict the domain of the logarithm to values greater than (Q + 1)e -1/t . Because of this the logarithm is Lipschitz with parameter e 1/t /(Q + 1). Using again the fact that f lies on the hypersphere we know that f (x) T f (x -) ≤ 1/t 2 and hence have the following inequality, E x∼p E q - β e M (x) -e f (x) T f (x -) ≤ e 1/t 2 E x∼p E q - β M (x) -f (x) T f (x -) Let us consider the inner expectation E β (x) = E q - β M (x) -f (x) T f (x -) . Note that since f is bounded, E β (x) is uniformly bounded in x. Therefore, in order to show the convergence L(f, q - β ) → L * (f ) as β → ∞, it suffices by the dominated convergence theorem to show that E β (x) → 0 pointwise as β → ∞ for arbitrary fixed x ∈ X . From now on we denote M = M (x) for brevity, and consider a fixed x ∈ X . From the definition of q - β it is clear that q - β p -. That is, since q - β = c • p -for some (non-constant) c, it is absolutely continuous with respect to p -. So M (x) ≥ f (x) T f (x -) almost surely for x -∼ q - β , and we may therefore drop the absolute value signs from our expectation. Define the following event G ε = {x -: f (x) f (x -) ≥ M -ε} where G is refers to a "good" event. Define its complement B ε = G c ε where B is for "bad". For a fixed x ∈ X and ε > 0 consider, E β (x) = E x -∼q - β M (x) -f (x) T f (x -) = P x -∼q - β (G ε ) • E x -∼q - β M (x) -f (x) T f (x -) |G ε + P x -∼q - β (B ε ) • E x -∼q - β M (x) -f (x) T f (x -) |B ε ≤ P x -∼q - β (G ε ) • ε + 2P x -∼q - β (B ε ) ≤ ε + 2P x -∼q - β (B ε ). We need to control P x -∼q - β (B ε ). Expanding, P x -∼q - β (B ε ) = X 1 f (x) T f (x -) < M (x) -ε e βf (x) T f (x -) • p -(x -) Z β dx - where Z β = X e βf (x) T f (x -) p -(x -)dx -is the partition function of q - β . We may bound this expression by, X 1 f (x) T f (x -) < M -ε e β(M -ε) • p -(x -) Z β dx -≤ e β(M -ε) Z β X 1 f (x) T f (x -) < M -ε p -(x -)dx - = e β(M -ε) Z β P x -∼p -(B ε ) ≤ e β(M -ε) Z β Note that Z β = X e βf (x) T f (x -) p -(x -)dx -≥ e β(M -ε/2) P x -∼p -(f (x) T f (x -) ≥ M -ε/2). By the definition of M = M (x) the probability ρ ε = P x -∼p -(f (x) T f (x -) ≥ M -ε/2) > 0, and we may therefore bound, P x -∼q - β (B ε ) = e β(M -ε) e β(M -ε/2) ρ ε = e -βε/2 /ρ ε -→ 0 as β → ∞. We may therefore take β to be sufficiently big so as to make P x -∼q - β (B ε ) ≤ ε and therefore E β (x) ≤ 3ε. In other words, E β (x) -→ 0 as β → ∞.

A.2 OPTIMAL EMBEDDINGS ON THE HYPERSPHERE FOR WORST-CASE NEGATIVE SAMPLES

In order to study properties of global optima of the contrastive objective using the adversarial worst case hard sampling distribution recall that we have the following limiting objective, L ∞ (f, q) = E x∼p x + ∼p + x -log e f (x) T f (x + ) E x -∼q β [e f (x) T f (x -) ] . ( ) We may separate the logarithm of a quotient into the sum of two terms plus a constant, L ∞ (f, q) = L align (f ) + L unif (f, q) -1/t 2 where L align (f ) = E x,x + f (x) -f (x + ) 2 /2 and L unif (f, q) = E x∼p log E x -∼q e f (x) f (x -) . Here we have used the fact that f lies on the boundary of the hypersphere of radius 1/t, which gives us the following equivalence between inner products and squared Euclidean norm, 2/t 2 -2f (x) f (x + ) = f (x) 2 + f (x + ) 2 -2f (x) f (x + ) = f (x) -f (x + ) 2 . (9) Taking supremum to obtain L * ∞ (f ) = sup q∈Π L ∞ (f, q) we find that the second expression simplifies to, L * unif (f ) = sup q∈Π L unif (f, q) = E x∼p log sup x -x e f (x) f (x -) = E x∼p sup x -x f (x) f (x -). Using Eqn. ( 9), this can be re-expressed as, E x∼p sup x -x f (x) f (x -) = -E x∼p inf x -x f (x) -f (x -) 2 /2 + 1/t 2 . ( ) The forthcoming theorem exactly characterizes the global optima of min f L * ∞ (f ) Theorem 7. Suppose the downstream task is classification (i.e. C is finite), and let L * ∞ (f ) = sup q∈Π L ∞ (f, q) . The infimum inf f : measurable L * ∞ (f ) is attained, and any f * achieving the global minimum is such that f * (x) = f * (x + ) almost surely. Furthermore, letting v c = f * (x) for any x such that h(x) = c (so v c is well defined up to a set of x of measure zero), f * is characterized as being any solution to the following ball-packing problem, max {vc∈S d-1 /t} c∈C c∈C ρ(c) • min c =c v c -v c 2 . ( ) Proof. Any minimizer of L align (f ) has the property that f (x) = f (x + ) almost surely. So, in order to prove the first claim, it suffices to show that there exist functions f ∈ arg inf f L * unif (f ) for which f (x) = f (x + ) almost surely. This is because, at that point, we have shown that arg min f L align (f ) and arg min f L * unif (f ) intersect, and therefore any solution of L * ∞ (f ) = L align (f ) + L * unif (f ) must lie in this intersection. To this end, suppose that f ∈ arg min f L * unif (f ) but that f (x) = f (x + ) with non-zero probability. We shall show that we can construct a new embedding f such that f (x) = f (x + ) almost surely, and L * unif ( f ) ≤ L * unif (f ). Due to Eqn. (10) this last condition is equivalent to showing, E x∼p inf x -x f (x) -f (x -) 2 ≥ E x∼p inf x -x f (x) -f (x -) 2 . ( ) Fix a c ∈ C, and let x c ∈ arg max x:h(x)=c inf x -x f (x) -f (x -) 2 . The maximum is guaranteed to be attained, as we explain now. Indeed we know the maximum is attained at some point in the closure ∂{x : h(x) = c} ∪ {x : h(x) = c}. Since X is compact and connected, any point x ∈ ∂{x : h(x) = c} \ {x : h(x) = c} is such that inf x -x f (x) -f (x -) 2 = 0 since x must belong to {x : h(x) = c } for some other c . Such an x cannot be a solution unless all points in {x : h(x) = c} also achieve 0, in which case we can simply take x c to be a point in the interior of {x : h(x) = c}. Now, define f (x) = f (x c ) for any x such that h(x) = c and f (x) = f (x) otherwise. Let us first aim to show that Eqn. ( 12) holds for this f . Let us begin to expand the left hand side of Eqn. ( 12), E x∼p inf x -x f (x) -f (x -) 2 = E ĉ∼ρ E x∼p(•|ĉ) inf x -x f (x) -f (x -) 2 = ρ(c)E x∼p(•|c) inf x -x f (x) -f (x -) 2 + (1 -ρ(c))E ĉ∼ρ(•|ĉ =c) E x∼p(•|ĉ) inf x -x f (x) -f (x -) 2 = ρ(c)E x∼p(•|c) inf x -x f (x c ) -f (x -) 2 + (1 -ρ(c))E ĉ∼ρ(•|ĉ =c) E x∼p(•|ĉ) inf x -x f (x) -f (x -) 2 = ρ(c) inf x -xc f (x c ) -f (x -) 2 + (1 -ρ(c))E ĉ∼ρ(•|ĉ =c) E x∼p(•|ĉ) inf h(x -) =ĉ f (x) -f (x -) 2 (13) By construction, the first term can be lower bounded by inf x -xc f (x c ) -f (x -) 2 ≥ E x∼p(•|c) inf h(x -) =c f (x) -f (x -) 2 for any x such that h(x) = c. To lower bound the second term, consider any fixed ĉ = c and x ∼ p(•|ĉ) (so h(x) = ĉ). Define the following two subsets of the input space X A = {f (x -) : f (x -) = ĉ for x -∈ X } A = {f (x -) ∈ X : f (x -) = ĉ for x -∈ X }. Since by construction the range of f is a subset of the range of f , we know that A ⊆ A. Combining this with the fact that f (x) = f (x) whenever h(x) = ĉ = c we see, inf h(x -) =ĉ f (x) -f (x -) 2 = inf h(x -) =ĉ f (x) -f (x -) 2 = inf u∈ A f (x) -u 2 ≥ inf u∈A f (x) -u 2 = inf h(x -) =ĉ f (x) -f (x -) 2 Using these two lower bounds we may conclude that Eqn. ( 13) can be lower bounded by, ρ(c)E x∼p(•|c) inf h(x -) =c f (x) -f (x -) 2 + (1 -ρ(c))E ĉ∼ρ(•|ĉ =c) E x∼p(•|ĉ) inf h(x -) =ĉ f (x) -f (x -) 2 which equals E x∼p inf x -x f (x) -f (x -) 2 . We have therefore proved Eqn. (12). To summarize the current progress; given an embedding f we have constructed a new embedding f that attains lower L unif loss and which is constant on x such that f is constant on {x : h(x) = c}. Enumerating C = {c 1 , c 2 . . . , c |C| }, we may repeatedly apply the same argument to construct a sequence of embeddings f 1 , f 2 , . . . , f |C| such that f i is constant on each of the following sets {x : h(x) = c j } for j ≤ i . The final embedding in the sequence f * = f |C| is such that L * unif (f * ) ≤ L * unif (f ) and therefore f * is a minimizer. This embedding is constant on each of {x : h(x) = c j } for j = 1, 2, . . . ,|C|. In other words, f * (x) = f * (x + ) almost surely. We have proved the first claim. Obtaining the second claim is a matter of manipulating L * ∞ (f * ). Indeed, we know that L * ∞ (f * ) = L * unif (f * ) -1/t 2 and defining v c = f * (x) = f (x c ) for each c ∈ C, this expression is minimized if and only if f * attains, max f E x∼p inf x -x f (x) -f (x -) 2 = max f E c∼ρ E x∼p(•|c) inf h(x -) =c f (x) -f (x -) 2 = max f c∈C ρ(c) • inf h(x -) =c f (x) -f (x -) 2 = max {vc∈S d-1 /t} c∈C c∈C ρ(c) • min c =c v c -v c 2 where the final equality inserts f * as an optimal f and reparameterizes the maximum to be over the set of vectors {v c ∈ S d-1 /t} c∈C . A.3 DOWNSTREAM GENERALIZATION Theorem 8. Suppose ρ is uniform on C and f is such that L * ∞ (f ) -inf f measurable L * ∞ ( f ) ≤ ε with ε ≤ 1. Let {v * c ∈ S d-1 /t} c∈C be a solution to Problem 7, and define ξ = min c,c -:c =c -v * c -v * c- > 0. Then there exists a set of vectors {v c ∈ S d-1 /t} c∈C such that the following 1-nearest neighbor classifier, ĥ(x) = ĉ, where ĉ = arg min c∈C f (x) -v c (ties broken arbitrarily) achieves misclassification risk, P( ĥ(x) = c) ≤ 8ε (ξ 2 -2|C| (1 + 1/t)ε 1/2 ) 2 Proof. To begin, using the definition of ĥ we know that for any 0 < δ < ξ, P x,c ( ĥ(x) = c) = P x,c f (x) -v c ≤ min c -:c -=c f (x) -v c - ≥ P x,c f (x) -v c ≤ δ, and δ ≤ min c -:c -=c f (x) -v c - ≥ 1 -P x,c f (x) -v c > δ -P x,c min c -:c -=c f (x) -v c -< δ So to prove the result, our goal is now to bound these two probabilities. To do so, we use the bound on the excess risk. Indeed, combining the fact L * ∞ (f ) -inf f measurable L * ∞ ( f ) ≤ ε with the notational rearrangements before Theorem 7 we observe that E x,x + f (x) -f (x + ) 2 ≤ 2ε. We have, 2ε ≥ E x,x + f (x) -f (x + ) 2 = E c∼ρ E x + ∼p(•|c) E x∼p(•|c) f (x) -f (x + ) 2 . For fixed c, x + , let x c ∈ arg min {x + :h(x + )=c} E x∼p(•|c) f (x) -f (x + ) 2 where we extend the minimum to be over the closure, a compact set, to guarantee it is attained. Then we have 2ε ≥ E c∼ρ E x + ∼p(•|c) E x∼p(•|c) f (x) -f (x + ) 2 ≥ E c∼ρ E x∼p(•|c) f (x) -v c 2 where we have now defined v c = f (x c ) for each c ∈ C. Note in particular that v c lies on the surface of the hypersphere S d-1 /t. This enables us to obtain the follow bound using Markov's inequality, P x,c f (x) -v c > δ = P x,c f (x) -v c 2 > δ 2 ≤ E x,c f (x) -v c 2 δ 2 ≤ 2ε δ 2 . so it remains still to bound P x,c min c -:c -=c f (x) -v c - < δ . Defining ξ = min c,c -:c =c -v c -v c-, we have the following fact (proven later). Fact (see lemma 9): ξ ≥ ξ 2 -2|C| (1 + 1/t) √ ε. Using this fact we are able to get control over the tail probability as follows, P x,c min c -:c -=c f (x) -v c -< δ ≤ P x,c f (x) -v c > ξ -δ ≤ P x,c f (x) -v c > ξ -ξ 2 -2|C| (1 + 1/t)ε 1/2 -δ = P x,c f (x) -v c 2 > ( ξ 2 -2|C| (1 + 1/t)ε 1/2 -δ) 2 ≤ 2ε ( ξ 2 -2|C| (1 + 1/t)ε 1/2 -δ) 2 . where this inequality holds for for any 0 ≤ δ ≤ ξ 2 -2|C| (1 + 1/t)ε 1/2 . Gathering together our tail probability bounds we find that P x,c ( ĥ (x) = c) ≥ 1 -2ε δ 2 - 2ε ( √ ξ 2 -2|C|(1+1/t)ε 1/2 -δ) 2 for any 0 ≤ δ ≤ ξ 2 -2|C| (1 + 1/t)ε 1/2 . That is, P x,c ( ĥ(x) = c) ≤ 2ε δ 2 + 2ε ( ξ 2 -2|C| (1 + 1/t)ε 1/2 -δ) 2 Since this holds for any 0 ≤ δ ≤ ξ 2 -2|C| (1 + 1/t)ε 1/2 , P x,c ( ĥ(x) = c) ≤ min 0≤δ≤ √ ξ 2 -2|C|ε 2ε δ 2 + 2ε ( ξ 2 -2|C| (1 + 1/t)ε 1/2 -δ) 2 . Elementary calculus shows that the minimum is attained at δ = √ ξ 2 -2|C|(1+1/t)ε 1/2 2 . Plugging this in yields the final bound, P( ĥ(x) = c) ≤ 8ε (ξ 2 -2|C| (1 + 1/t)ε 1/2 ) 2 . Lemma 9. Consider the same setting as introduced in Theorem 5. In particular define ξ = min c,c -:c =c -v c -v c-, ξ = min c,c -:c =c -v * c -v * c-. where {v * c ∈ S d-1 /t} c∈C is a solution to Problem 7, and {v c ∈ S d-1 /t} c∈C is defined via v c = f (x c ) with x c ∈ arg min {x + :h(x + )=c} E x∼p(•|c) f (x) -f (x + ) 2 for each c ∈ C. Then we have, ξ ≥ ξ 2 -2|C| (1 + 1/t)ε 1/2 . Proof. Define, X = min c -:c -=c v c -v c- 2 , X * = min c -:c -=c v * c -v * c- 2 . X and X * are random due to the randomness of c ∼ ρ. We can split up the following expectation by conditioning on the event {X ≤ X * } and its complement, E|X -X * | = P(X ≥ X * )E[X -X * ] + P(X ≤ X * )E[X * -X]. (14) Using L * ∞ (f )-inf f measurable L * ∞ ( f ) ≤ ε and the notational re-writing of the objective L * ∞ introduced before Theorem 7, we observe the following fact, whose proof we give in a separate lemma after the conclusion of this proof.

Fact (see lemma 10): EX

* -2(1 + 1/t) √ ε ≤ EX ≤ EX * . This fact implies in particular E[X -X * ] ≤ 0 and E[X * -X] ≤ 2(1 + 1/t) √ ε. Inserting both inequalities into Eqn. 14 we find that E|X - X * | ≤ 2(1 + 1/t) √ ε. In other words, since ρ is uniform, 1 |C| c∈C min c -:c -=c v c -v c- 2 -min c -:c -=c v * c -v * c- 2 ≤ 2(1 + 1/t) √ ε. From which we can say that for any c ∈ C , min c -:c -=c v c -v c- 2 -min c -:c -=c v * c -v * c- 2 ≤ 2|C| (1 + 1/t) √ ε. So min c -:c -=c v c -v c- ≥ min c -:c -=c v * c -v * c- 2 -2|C| (1 + 1/t)ε 1/2 ≥ ξ 2 -2|C| (1 + 1/t)ε 1/2 . Since this holds for any c ∈ C , we conclude that ξ ≥ ξ 2 -2|C| (1 + 1/t)ε 1/2 . Lemma 10. Consider the same setting as introduced in Theorem 5. Define also, X = min c -:c -=c v c -v c- 2 , X * = min c -:c -=c v * c -v * c- 2 , where v c = f (x c ) with x c ∈ arg min {x + :h(x + )=c} E x∼p(•|c) f (x) -f (x + ) 2 for each c ∈ C. We have, EX * -2(1 + 1/t) √ ε ≤ EX ≤ EX * . Proof. By Theorem 7 we know there is an f * attaining the minimum inf f measurable L * ∞ ( f ) and that this f * attains L * align (f * ) = 0, and also minimizes the uniformity term L * unif (f ), taking the value L * unif (f * ) = E c∼ρ max c -:c -=c v * c v * c -. Because of this we find, L * unif (f ) ≤ L * ∞ (f ) -L * ∞ (f * ) + L * align (f * ) -L * align (f ) + L * unif (f * ) ≤ L * ∞ (f ) -L * ∞ (f * ) + L * unif (f * ) ≤ ε + L * unif (f * ) = ε + E c∼ρ max c -:c -=c v * c v * c -. Since we would like to bound E c∼ρ max c -:c -=c v c v c -in terms of E c∼ρ max c -:c -=c v * c v * c -, this observation means that is suffices to bound E c∼ρ max c -:c -=c v c v c -in terms of L * unif (f ). To this end, note that for a fixed c, and x such that h(x) = c we have, sup x -x f (x) f (x -) = sup x -x v c f (x -) + (f (x) -v c ) f (x -) = sup x -x v c f (x -) -f (x) -v c /t ≥ max x -∈{xc} c∈C v c f (x -) -f (x) -v c /t = max c -=c v c v c --f (x) -v c /t where the inequality follows since {x c } c∈C is a subset of the closure of {x -: x - x}. Taking expectations over c, x, L * unif (f ) = E x,c sup x -x f (x) f (x -) ≥ E c∼ρ max c -=c v c v c --E x,c f (x) -v c /t ≥ E c∼ρ max c -=c v c v c --E f (x) -v c 2 /t ≥ E c∼ρ max c -=c v c v c -- √ ε/t. So since ε ≤ √ ε, we have found that E c∼ρ max c -=c v c v c -≤ √ ε/t + ε + E c∼ρ max c -:c -=c v * c v * c -≤ (1 + 1/t) √ ε + E c∼ρ max c -:c -=c v * c v * c -. Of course we also have, E c∼ρ max c -:c -=c v * c v * c -= L * unif (f * ) ≤ E c∼ρ max c -:c -=c v c v c - since the embedding f (x) = v c whenever h(x) = c is also a feasible solution. Combining these two inequalities with the simple identity x y = 1/t 2 -xy 2 /2 for all length 1/t vectors x, y, we find, 1/t 2 -E c∼ρ max c -:c -=c v * c -v * c - 2 /2 ≤ 1/t 2 -E c∼ρ max c -:c -=c v c -v c -2 /2 ≤ 1/t 2 -E c∼ρ max c -:c -=c v * c -v * c - 2 /2 + (1 + 1/t) √ ε. Subtracting 1/t 2 and multiplying by -2 yields the result.

B GRAPH REPRESENTATION LEARNING

We describe in detail the hard sampling method for graphs whose results are reported in Section 5.2. Before getting that point, in the interests of completeness we cover some required background details on the InfoGraph method of Sun et al. (2020) . For further information see the original paper (Sun et al., 2020) .

B.1 BACKGROUND ON GRAPH REPRESENTATIONS

We observe a set of graphs G = {G j ∈ G} n j=1 sampled according to a distribution p over an ambient graph space G. Each node u in a graph G is assumed to have features h (0) u living in some Euclidean space. We consider a K-layer graph neural network, whose k-th layer iteratively computes updated embeddings for each node v ∈ G in the following way, h (k) v = COMBINE (k)   h (k-1) v , AGGREGATE (k) h (k-1) v , h (k-1) u , e uv : u ∈ N (v)   where COMBINE (k) and AGGREGATE (k) are parameterized learnable functions and N (v) denotes the set of neighboring nodes of v. The K embeddings for a node u are collected together to obtain a single final summary embedding for u. As recommended by Xu et al. (2019) we use concatenation, h u = h u (G) = CONCAT {h (k) u } K k=1 to obtain an embedding in R d . Finally, the node representations are combined together into a length d graph level embedding using a readout function, H(G) = READOUT {h u } u∈G which is typically taken to be a simple permutation invariant function such as the sum or mean. The InfoGraph method aims to maximize the mutual information between the graph level embedding H(G) and patch-level embeddings h u (G) using the following objective, max h E G∼p 1 |G| u∈G I h u (G); H(G) In practice the population distribution p is replaced by its empirical counterpart, and the mutual information I is replaced by a variational approximation I T . In line with Sun et al. (2020) we use the Jensen-Shannon mutual information estimator as formulated by Nowozin et al. (2016) . It is defined using a neural network discriminator T : R 2d → R as, I T h u (G); H(G) = E G∼p -sp(-T h u (G), H(G) ) -E (G,G )∼p×p sp(T h u (G), H(G ) ) where sp(z) = log(1+e z ) denotes the softplus function. The finial objective is the joint maximization over h and T , max θ,ψ E G∼p 1 |G| u∈G I T h u (G); H(G)

B.2 HARD NEGATIVE SAMPLING FOR LEARNING GRAPH REPRESENTATIONS

In order to derive a simple modification of the NCE hard sampling technique that is appropriate for use with InfoGraph, we first provide a mildly generalized view of hard sampling. Recall that the NCE contrastive objective can be decomposed into two constituent pieces, L(f, q) = L align (f ) + L unif (f, q) where q is in fact a family of distributions q(x -; x) over x -that is indexed by the possible values of the anchor x. L align performs the role of "aligning" positive pairs (embedding near to one-another), while L unif repels negative pairs. The hard sampling framework aims to solve, inf f sup q L(f, q). In the case of NCE loss we take, L align (f ) = -E x∼p x + ∼p + x f (x) T f (x + ), L unif (f, q) = E x∼p x + ∼p + x log e f (x) T f (x + ) + QE x -∼q [e f (x) T f (x -) ] . View this view, we can easily adapt to the InfoGraph framework, taking L align (h, T ) = -E G∼p 1 |G| u∈G sp(-T h u (G), H(G) ), L unif (h, T, q) = -E G∼p 1 |G| u∈G E G ∼q sp(T h u (G), H(G ) ) Denote by p the distribution over nodes u ∈ R s defined by first sampling G ∼ p, then sampling u ∈ G uniformly over all nodes of G. Then these two terms can be simplified to L align (h, T ) = -E u∼ psp(-T h u (G), H(G) ), L unif (h, T, q) = -E (u,G )∼ p×q sp(T h u (G), H(G ) ) At this point it becomes clear that, just as with NCE, a distribution q * ∈ arg max q L(f, q) in the InfoGraph framework if it is supported on arg max G ∈G sp(T h u (G), H(G ) ). Although this is still hard to compute exactly, it can be approximated by, The vision experiments in the main body of the paper are all based off the SimCLR framework (Chen et al., 2020a) . They use a relatively small batch size (up to 512). In order to test whether our hard negatives sampling method can help when the negative batch size is very large, we also run experiments using MoCo-v2 with standard negative memory bank size N = 65536 (He et al., 2020; Chen et al., 2020c) . We adopt the official MoCo-v2 codefoot_3 . Embeddings are trained for 200 epochs, with batch size 128. To study the affect of varying the concentration parameter β on learned embeddings Figure 9 plots cosine similarity histograms of pairs of similar and dissimilar points. The results show that for β moving from 0 through 0.5 to 2 causes both the positive and negative similarities to gradually skew left. In terms of downstream classification, an important property is the relative difference in similarity between positive and negative pairs. In this case β = 0.5 find the best balance (since it achieves the highest downstream accuracy). When β is taken very large (β = 6), we see a change in conditions. Both positive and negative pairs are assigned higher similarities in general. Visually it seems that the positive and negative histograms for β = 6 overlap a lot more than for smaller values, which helps explain why the linear readout accuracy is lower for β = 6 . q β u (G ) ∝ exp βT (h u (G), H(G)) • p(G ). Figure 12 gives real examples of hard vs. uniformly sampled negatives. Given an anchor x (a monkey) and trained embedding f (trained on STL10 using standard SimCLR for 400 epochs), we sample a batch of 128 images. The top row shows the ten negatives x -that have the largest inner product f (x) f (x -), while the bottom row is a random sample from from the same batch. Negatives with the largest inner product with the anchor correspond to the items in the batch are the most important terms in the objective since they are given the highest weighting by q - β . Figure 12 shows that "real" hard negatives are conceptually similar to the idea as proposed in Figure 1 : hard negatives are semantically similar to the anchor, possessing various similarities, including color (browns and greens), texture (fur), and objects (animals vs machinery). 

D EXPERIMENTAL DETAILS

Figure 13 shows PyTorch-style pseudocode for the standard objective, the debiased objective, and the hard sampling objective. The proposed hard-sample loss is very simple to implement, requiring only two extra lines of code compared to the standard objective.

D.1 VISUAL REPRESENTATIONS

We implement SimCLR in PyTorch. We use a ResNet-50 (He et al., 2016) as the backbone with embedding dimension 2048 (the representation used for linear readout), and projection head into the lower 128-dimensional space (the embedding used in the contrastive objective). We use the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001 and weight decay 10 -6 . Code available at https://github.com/joshr17/HCL. Since we adopt the SimCLR framework, the number of negative samples N = 2(batch size -1). Since we always take the batch size to be a power of 2 (16, 32, 64, 128, 256) the negative batch sizes are 30, 62, 126, 254, 510 respectively. Unless otherwise stated, all models are trained for 400 epochs. Annealing β Method: We detail the annealing method whose results are given in Figure 4 . The idea is to reduce the concentration parameter down to zero as training progresses. Specifically, suppose we have e number of total training epochs. We also specify a number of "changes" to the concentration parameter we shall make. We initialize the concentration parameter β 1 = β (where this β is the number reported in Figure 4 ), then once every e/ epochs we reduce β i by β/ . In other words, if we are currently on β i , then β i+1 = β i -β/ , and we switch from β i to β i+1 in epoch number i • e/ . The idea of this method is to select particularly difficult negative samples early on order to obtain useful gradient information early on, but later (once the embedding is already quite good) we reduce the "hardness" level so as to reduce the harmful effect of only approximately correcting for false negatives (negatives with the same labels as the anchor). Figure 13 : Pseudocode for our proposed new hard sample objective, as well as the original NCE contrastive objective, and debiased contrastive objective. In each case we take the number of positive samples to be M = 1. The implementation of our hard sampling method only requires two additional lines of code compared to the standard objective. We also found the annealing in the opposite direction ("down") achieved similar performance. Bias-variance of empirical estimates in hard-negative objective: Recall the final hard negative samples objective we derive is, E x∼p x + ∼p + x   -log e f (x) T f (x + ) e f (x) T f (x + ) + Q τ -(E x -∼q β [e f (x) T f (x -) ] -τ + E v∼q + β [e f (x) T f (v) ])   . This objective admits a practical counterpart by using empirical approximations to v) ]. In practice we use a fairly large number of samples (e.g. N = 510) to approximate the first expectation, and only M = 1 samples to approximate the second. Clearly in both cases the resulting estimator is unbiased. Further, since the first expectation is approximated using many samples, and the integrand is bounded, the resulting estimator is well concentrated (e.g. apply Hoeffding's inequality out-of-the-box). But what about the second expectation? This might seem uncontrolled since we use only one sample, however it turns out that the random variable X = e f (x) T f (v) where x ∼ p and v ∼ q + β has variance that is bounded by L align (f ). Lemma 11. Consider the random variable X = e f (x) T f (v) where x ∼ p and v ∼ q + β . Then Var(X) ≤ O L align (f ) . E x -∼q β [e f (x) T f (x -) ] and E v∼q + β [e f (x) T f ( Recall that L align (f ) = E x,x + f (x) -f (x + ) 2 /2 is termed alignment, and Wang & Isola (2020) show that the contrastive objective jointly optimize alignment and uniformity. Lemma 11 therefore shows that as training evolves, the variance of the X = e f (x) T f (v) where x ∼ p and v ∼ q + β is bounded by a term that we expect to see becoming small, suggesting that using a single sample (M = 1) to approximate this expectation is not unreasonable. We cannot, however, say more than this since we have no guarantee that L align (f ) goes to zero. Proof. Fix an x and recall that we are considering q + β (•) = q + β (•; x). First let X be an i.i.d. copy of X, and note that, conditioning on x, we have 2Var (X|x) = Var(X|x) + Var(X |x) = Var(X -X |x) ≤ E (X -X ) 2 |x . Bounding this difference, E (X -X ) 2 |x = E v,v ∼q + β e f (x) f (v) -e f (x) f (v ) 2 ≤ E v,v ∼q + β e 1/t 2 f (x) f (v) -f (x) f (v ) 2 ≤ e 1/t 4 E v,v ∼q + β f (x) f (v) -f (v ) 2 = e 1/t 4 t 2 E v,v ∼q + β f (v) -f (v ) 2 ≤ O E v,v ∼p + f (v) -f (v ) 2 where the first inequality follows since f lies on the sphere of radius 1/t, the second inequality by Cauchy-Schwarz, the third again since f lies on the sphere of radius 1/t, and the fourth since q + β is absolutely continuous with respect to p + with bounded ratio. Since p + (x + ) = p(x + |h(x)) only depends on c = h(x), rather than x itself, taking expectations over x ∼ p is equivalent to taking expectations over c ∼ ρ. Further, ρ(c)p(v|c)p( Figure 14 : In cases where the learned embedding is not normalized to lie on a hypersphere we found that clipping the negatives to live in a fixed range (in this case [-2, 2]) stabilizes optimization. v |c) = p(v)p(v |c) = p(v)p + v (v ). So E c∼ρ E v,v ∼p + f (v) -f (v ) 2 = E x,x + f (x) -f (x + ) 2 = 2L align (f ),

D.2 GRAPH REPRESENTATIONS

All datasets we benchmark on can be downloaded at www.graphlearning.io from the TU-Dataset repository of graph classification problems (Morris et al., 2020) . Information on basic statistics of the datasets is included in Tables 3 and 4 . For fair comparison to the original In-foGraph method, we adopt the official code, which can be found at https://github.com/ fanyun-sun/InfoGraph. We modify only the gan_losses.py script, adding in our proposed hard sampling via reweighting. For simplicity we trained all models using the same set of hyperparameters: we used the GIN architecture (Xu et al., 2019) with K = 3 layers and embedding dimension d = 32. Each model is trained for 200 epochs with batch size 128 using the Adam optimizer (Kingma & Ba, 2015) . with learning rate 0.001, and weight decay of 10 -6 . Each embedding is evaluated using the average accuracy 10-fold cross-validation using an SVM as the classifier (in line with the approach taken by Morris et al. (2020) ). Each experiment is repeated from scratch 10 times, and the distribution of results from these 10 runs is plotted in Figure 3 . Since the graph embeddings are not constrained to lie on a hypersphere, for a batch we clip all the inner products to live in the interval [-2, 2] while computing the reweighting, as illustrated in Figure 14 . We found this to be important for stabilizing optimization. 

D.3 SENTENCE REPRESENTATIONS

We adopt the official quick-thoughts vectors experimental settings, which can be found at https: //github.com/lajanugen/S2V. We keep all hyperparameters at the default values and change only the s2v-model.py script. Since the official BookCorpus dataset Kiros et al. (2015) is not available, we use an unofficial version obtained using the following repository: https://github. com/soskek/bookcorpus. Since the sentence embeddings are also not constrained to lie on a hypersphere, we use the same clipping trick as for the graph embeddings, illustrated in Figure 14 . After training on the BookCorpus dataset, we evaluate the embeddings on six different classification tasks: paraphrase identification (MSRP) (Dolan et al., 2004) , question type classification (TREC) (Voorhees & Harman, 2002) , opinion polarity (MPQA) (Wiebe et al., 2005) , subjectivity classification (SUBJ) (Pang & Lee, 2004) , product reviews (CR) (Hu & Liu, 2004) , and sentiment of movie reviews (MR) (Pang & Lee, 2005) .

E FURTHER DISCUSSION

Comparison with Kalantidis et al. (2020) : Kalantidis et al. (2020) also consider ways to sample negatives, and propose a mixing strategy for hard negatives, called MoCHi. The main points of difference are: 1) MoCHi considers the benefit of hard negatives, but does not consider the possibility of false negatives (Principle 1), which we found to be valuable. 2) MoCHi introduces three extra hyperparameters, while our method introduces only two (β, τ + ). If we discard Principle 1 (i.e. τ + ) then only β requires tuning. 3) our method introduces zero computational overhead by utilizing within-batch reweighting, whereas MoCHi involves a small amount of extra computation. Limitations of proposed method. There are two main sources of approximation in the hard negative sampling algorithm we propose to use in practice. First, our hard negatives objective requires computing an expectation over p + x (x + ), the distributions on points with the same x + class label as x. Due to lack of supervision, we are not able to sample exactly from p + x . To circumnavigate this, we propose using samples generated using data augmentation, which we find works well in practice, but deviates from the original formulation and framework. We find this approximation has important implications for practitioners, since we observe that when using oracle access to p + x the downstream performance improves monotonically as the concentration parameter β increases (see Fig. 4 , middle), while when using the practical approximation, large values of β start to hurt performance, thereby requiring that β be tuned (values in the range (0.5, 2) tend to work well in general). Second, we estimate the expectation E v∼q + β [e f (x) f (v) ] using just a single (importance weighted) sample. This is done in the interests of efficiency -since using M > 1 samples significantly increases the effective batch size (this is the more costly multi-view setting), increasing memory and runtime costs. We anticipate that using M > 1 would improve performance further, in exchange for this extra price, but leave experimentation to future work. That said, our analysis finds that the variance of the variable e f (x) f (v) with v ∼ q + β is controlled by the alignment objective, as shown in Lemma 11, which suggests that in later stage training the variance of the single-sample estimator may be reasonably small.



Code available at: https://github.com/joshr17/HCL For a manifold M ⊆ R d , we say C ⊂ M is a ball if it is connected, and there exists a Euclidean ball B = {x ∈ R d : x ≤ R} for which C = M ∩ B. https://github.com/facebookresearch/moco



Figure 1: Schematic illustration of negative sampling methods for the example of classifying species of tree. Top row: uniformly samples negative examples (red rings); mostly focuses on very different data points from the anchor (yellow triangle), and may even sample examples from the same class (triangles, vs. circles). Bottom row: Hard negative sampling prefers examples that are (incorrectly) close to the anchor.

= 0.5, τ + = 0.1) Hard (β = 1, τ + = 0.1)

Figure 3: Classification accuracy on downstream tasks. We compare graph representations on four classification tasks. Accuracies are obtained by fine-tuning an SVM readout function, and are the average of 10 runs, each using 10-fold cross validation. Results in bold indicate best performer.

Figure4: Left: the effect of varying concentration parameter β on linear readout accuracy. Middle: linear readout accuracy as concentration parameter β varyies, in the case of contrastive learning (fully unsupervised), using true positive samples (uses label information), and an annealing method that improves robustness to the choice of β (see Appendix D.1 for details). Right: STL10 linear readout accuracy for hard sampling with and without debiasing, and non-hard sampling (β = 0) with and without debiasing. Best results come from using both simultaneously.

Figure 6: Hard negative sampling using MoCo-v2 framework. Results show that hard negative samples can still be useful when the negative memory bank is very large (in this case N = 65536).

Figure 7: The effect of varying concentration parameter β on linear readout accuracy for CI-FAR10. (Complements the left and middle plot from Figure 4.)

Figure 8: Histograms of cosine similarity of pairs of points with different label (bottom) and same label (top) for embeddings trained on CIFAR100 with different values of β. Histograms overlaid pairwise to allow for easy comparison.

Figure 9: Histograms of cosine similarity of pairs of points with the same label (top) and different labels (bottom) for embeddings trained on CIFAR100 with four different objectives. H=Hard Sampling, D=Debiasing. Histograms overlaid pairwise to allow for convenient comparison.

Figure 10: Histograms of cosine similarity of pairs of points with the same label (top) and different labels (bottom) for embeddings trained on CIFAR10 with four different objectives. H=Hard Sampling, D=Debiasing. Histograms overlaid pairwise to allow for convenient comparison.

Figure12: Qualitative comparison of hard negatives and uniformly sampled negatives for embedding trained on STL10 for 400 epochs using SimCLR. Top row: selecting the 10 images with highest inner product with anchor in latent space from a batch of 128 inputs. Bottom row: a set of random samples from the same batch. Hard negatives are semantically much more similar to the anchor than uniformly sampled negatives -hard negatives possess many similar characteristics to the anchor, including texture, colors, animals vs machinery.

Clipping negatives trick before computing reweighting 9 reweight = 2 * neg / max( neg.max().abs(), neg.min().abs() ) 10 reweight = (beta * reweight) / reweight.mean() 11 Neg = max((-N * tau_plus * pos + reweight * neg).sum() / (1-tau_plus), e ** (-1/t)) 12 hard_loss = -log( pos.sum() / (pos.sum() + Neg))

Hard sampling outperforms or equals the QT

Classification

Basic statistics for graph datasets.

Basic statistics for graph datasets.

