CONTRASTIVE LEARNING WITH HARD NEGATIVE SAMPLES

Abstract

How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.

1. INTRODUCTION

Owing to their empirical success, contrastive learning methods (Chopra et al., 2005; Hadsell et al., 2006) have become one of the most popular self-supervised approaches for learning representations (Oord et al., 2018; Tian et al., 2019; Chen et al., 2020a) . In computer vision, unsupervised contrastive learning methods have even outperformed supervised pre-training for object detection and segmentation tasks (Misra & Maaten, 2020; He et al., 2020) . Contrastive learning relies on two key ingredients: notions of similar (positive) (x, x + ) and dissimilar (negative) (x, x -) pairs of data points. The training objective, typically noise-contrastive estimation (Gutmann & Hyvärinen, 2010) , guides the learned representation f to map positive pairs to nearby locations, and negative pairs farther apart; other objectives have also been considered (Chen et al., 2020a) . The success of the associated methods depends on the design of informative of the positive and negative pairs, which cannot exploit true similarity information since there is no supervision. Much research effort has addressed sampling strategies for positive pairs, and has been a key driver of recent progress in multi-view and contrastive learning (Blum & Mitchell, 1998; Xu et al., 2013; Bachman et al., 2019; Chen et al., 2020a; Tian et al., 2020) . For image data, positive sampling strategies often apply transformations that preserve semantic content, e.g., jittering, random cropping, separating color channels, etc. (Chen et al., 2020a; c; Tian et al., 2019) . Such transformations have also been effective in learning control policies from raw pixel data (Srinivas et al., 2020) . Positive sampling techniques have also been proposed for sentence, audio, and video data (Logeswaran & Lee, 2018; Oord et al., 2018; Purushwalkam & Gupta, 2020; Sermanet et al., 2018) . Surprisingly, the choice of negative pairs has drawn much less attention in contrastive learning. Often, given an "anchor" point x, a "negative" x -is simply sampled uniformly from the training data, independent of how informative it may be for the learned representation. In supervised and metric learning settings, "hard" (true negative) examples can help guide a learning method to correct its mistakes more quickly (Schroff et al., 2015; Song et al., 2016) . For representation learning, informative negative examples are intuitively those pairs that are mapped nearby but should be far apart. This idea is successfully applied in metric learning, where true pairs of dissimilar points are available, as opposed to unsupervised contrastive learning. With this motivation, we address the challenge of selecting informative negatives for contrastive representation learning. In response, we propose a solution that builds a tunable sampling distribution that prefers negative pairs whose representations are currently very similar. This solution faces two challenges: (1) we do not have access to any true similarity or dissimilarity information; (2) we need an efficient sampling strategy for this tunable distribution. We overcome (1) by building on ideas from positive-unlabeled learning (Elkan & Noto, 2008; Du Plessis et al., 2014) , and (2) by designing an efficient, easy to implement importance sampling technique that incurs no computational overhead. Our theoretical analysis shows that, as a function of the tuning parameter, the optimal representations for our new method place similar inputs in tight clusters, whilst spacing the clusters as far apart as possible. Empirically, our hard negative sampling strategy improves the downstream task performance for image, graph and text data, supporting that indeed, our negative examples are more informative. Contributions. In summary, we make the following contributions: 1. We propose a simple distribution over hard negative pairs for contrastive representation learning, and derive a practical importance sampling strategy with zero computational overhead that takes into account the lack of true dissimilarity information; 2. We theoretically analyze the hard negatives objective and optimal representations, showing that they capture desirable generalization properties; 3. We empirically observe that the proposed sampling method improves the downstream task performance on image, graph and text data. Before moving onto the problem formulation and our results, we summarize related work below.

1.1. RELATED WORK

Contrastive Representation Learning. Various frameworks for contrastive learning of visual representations have been proposed, including SimCLR (Chen et al., 2020a;b), which uses augmented views of other items in a minibatch as negative samples, and MoCo (He et al., 2020; Chen et al., 2020c) , which uses a momentum updated memory bank of old negative representations to enable the use of very large batches of negative samples. Most contrastive methods are unsupervised, however there exist some that use label information (Sylvain et al., 2020; Khosla et al., 2020) . Many works study the role of positive pairs, and, e.g., propose to apply large perturbations for images Chen et al. (2020a;c), or argue to minimize the mutual information within positive pairs, apart from relevant information for the ultimate prediction task (Tian et al., 2020) . Beyond visual data, contrastive methods have been developed for sentence embeddings (Logeswaran & Lee, 2018), sequential data (Oord et al., 2018; Hénaff et al., 2020 ), graph (Sun et al., 2020; Hassani & Khasahmadi, 2020; Li et al., 2019) and node representation learning (Velickovic et al., 2019) , and learning representations from raw images for off-policy control (Srinivas et al., 2020) . The role of negative pairs hase been much less studied. Chuang et al. (2020) propose a method for "debiasing", i.e., correcting for the fact that not all negative pairs may be true negatives. It does so by taking the viewpoint of Positive-Unlabeled learning, and exploits a decomposition of the true negative distribution. Kalantidis et al. (2020) consider applying Mixup (Zhang et al., 2018) to generate hard negatives in latent space, and Jin et al. (2018) exploit the specific temporal structure of video to generate negatives for object detection. Negative Mining in Deep Metric Learning. As opposed to the contrastive representation learning literature, selection strategies for negative samples have been thoroughly studied in (deep) metric learning (Schroff et al., 2015; Song et al., 2016; Harwood et al., 2017; Wu et al., 2017; Ge, 2018; Suh et al., 2019) . Most of these works observe that it is helpful to use negative samples that are difficult for the current embedding to discriminate. Schroff et al. (2015) qualify this, observing that some examples are simply too hard, and propose selecting "semi-hard" negative samples. The well known importance of negative samples in metric learning, where (partial) true dissimilarity information is available, raises the question of negative samples in contrastive learning, the subject of this paper.

2. CONTRASTIVE LEARNING SETUP

We begin with the setup and the idea of contrastive representation learning. We wish to learn an embedding f : X → S d-1 /t that maps an observation x to a point on a hypersphere S d-1 /t in R d of radius 1/t, where t is the "temperature" scaling hyperparameter. Following the setup of Arora et al. ( 2019), we assume an underlying set of discrete latent classes C that represent semantic content, so that similar pairs (x, x + ) have the same latent class. Denoting



Code available at: https://github.com/joshr17/HCL

