CONTRASTIVE LEARNING WITH HARD NEGATIVE SAMPLES

Abstract

How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.

1. INTRODUCTION

Owing to their empirical success, contrastive learning methods (Chopra et al., 2005; Hadsell et al., 2006) have become one of the most popular self-supervised approaches for learning representations (Oord et al., 2018; Tian et al., 2019; Chen et al., 2020a) . In computer vision, unsupervised contrastive learning methods have even outperformed supervised pre-training for object detection and segmentation tasks (Misra & Maaten, 2020; He et al., 2020) . Contrastive learning relies on two key ingredients: notions of similar (positive) (x, x + ) and dissimilar (negative) (x, x -) pairs of data points. The training objective, typically noise-contrastive estimation (Gutmann & Hyvärinen, 2010), guides the learned representation f to map positive pairs to nearby locations, and negative pairs farther apart; other objectives have also been considered (Chen et al., 2020a) . The success of the associated methods depends on the design of informative of the positive and negative pairs, which cannot exploit true similarity information since there is no supervision. Much research effort has addressed sampling strategies for positive pairs, and has been a key driver of recent progress in multi-view and contrastive learning (Blum & Mitchell, 1998; Xu et al., 2013; Bachman et al., 2019; Chen et al., 2020a; Tian et al., 2020) . For image data, positive sampling strategies often apply transformations that preserve semantic content, e.g., jittering, random cropping, separating color channels, etc. (Chen et al., 2020a; c; Tian et al., 2019) . Such transformations have also been effective in learning control policies from raw pixel data (Srinivas et al., 2020) . Positive sampling techniques have also been proposed for sentence, audio, and video data (Logeswaran & Lee, 2018; Oord et al., 2018; Purushwalkam & Gupta, 2020; Sermanet et al., 2018) . Surprisingly, the choice of negative pairs has drawn much less attention in contrastive learning. Often, given an "anchor" point x, a "negative" x -is simply sampled uniformly from the training data, independent of how informative it may be for the learned representation. In supervised and metric learning settings, "hard" (true negative) examples can help guide a learning method to correct its mistakes more quickly (Schroff et al., 2015; Song et al., 2016) . For representation learning, informative negative examples are intuitively those pairs that are mapped nearby but should be far apart. This idea is successfully applied in metric learning, where true pairs of dissimilar points are available, as opposed to unsupervised contrastive learning. Code available at: https://github.com/joshr17/HCL 1

