CONDITIONAL NEGATIVE SAMPLING FOR CON-TRASTIVE LEARNING OF VISUAL REPRESENTATIONS

Abstract

Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two transformations of an image. NCE typically uses randomly sampled negative examples to normalize the objective, but this may often include many uninformative examples either because they are too easy or too hard to discriminate. Taking inspiration from metric learning, we show that choosing semi-hard negatives can yield stronger contrastive representations. To do this, we introduce a family of mutual information estimators that sample negatives conditionally -in a "ring" around each positive. We prove that these estimators remain lower-bounds of mutual information, with higher bias but lower variance than NCE. Experimentally, we find our approach, applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% absolute points in each case, measured by linear evaluation on four standard image benchmarks. Moreover, we find continued benefits when transferring features to a variety of new image distributions from the Meta-Dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and key-point detection.

1. INTRODUCTION

Supervised learning has given rise to human-level performance in several visual tasks (Russakovsky et al., 2015; He et al., 2017) , relying heavily on large image datasets paired with semantic annotations. These annotations vary in difficulty and cost, spanning from simple class labels to more granular descriptions like bounding boxes and key-points. As it is impractical to scale high quality annotations, this reliance on supervision poses a barrier to widespread adoption. While supervised pretraining is still the dominant approach in computer vision, recent studies using unsupervised "contrastive" objectives, have achieved remarkable results in the last two years, closing the gap to supervised baselines (Wu et al., 2018; Oord et al., 2018; Hjelm et al., 2018; Zhuang et al., 2019; Hénaff et al., 2019; Misra & Maaten, 2020; He et al., 2019; Chen et al., 2020a; b; Grill et al., 2020) . Many contrastive algorithms are estimators of mutual information (Oord et al., 2018; Hjelm et al., 2018; Bachman et al., 2019) , capturing the intuition that a good low-dimensional "representation" is one that linearizes the useful information embedded within a high-dimensional data point. In vision, these estimators maximize the similarity of encodings for two augmentations of the same image. This is trivial (e.g. assign all image pairs maximum similarity) unless the similarity function is normalized. This is typically done by comparing an image to "negative examples", which a model must assign low similarity to. We hypothesize that how we choose these negatives greatly impacts the representation quality. With harder negatives, the encoder is encouraged to capture more granular information that may improve performance on downstream tasks. While research in contrastive learning has explored architectures, augmentations, and pretext tasks, there has been little attention given to the negative sampling procedure. Meanwhile, there is a rich body of work in deep metric learning showing semi-hard negative mining to improve the efficacy of triplet losses. Inspired by this, we hope to bring harder negative sampling to modern contrastive learning. Naively choosing difficult negatives may yield an objective that no longer bounds mutual information, removing a theoretical connection that is core to contrastive learning and has been shown to be important for downstream performance (Tian et al., 2020) . In this paper, we present a new estimator of mutual information based on the popular noise-contrastive estimator (NCE) that supports sampling negatives from conditional distributions. We summarize our contributions below: 1. We prove our Conditional-NCE (CNCE) objective to lower bound mutual information. Further, we show that although CNCE is a looser bound than NCE, it has lower variance. This motivates its value for representation learning. 2. We use CNCE to generalize contrastive algorithms that utilize a memory structure like IR, CMC, and MoCo to sample semi-hard negatives in just a few lines of code and minimal compute overhead. 3. We find that the naive strategy of sampling hard negatives throughout training can be detrimental. We then show that slowly introducing harder negatives yields good performance. 4. On four image classification benchmarks, we find improvements of 2-5% absolute points. We also find consistent improvements (1) when transferring features to new image datasets and (2) in object detection, instance segmentation, and key-point detection.

2. BACKGROUND

We et al., 2019; Tian et al., 2020; Wu et al., 2020) . To review, recall: I(X; Y ) I NCE (X; Y ) = E xi,yi⇠p(x,y) E y 1:k ⇠p(y) " log e f ✓ (xi,yi) 1 k+1 P j2{i,1:k} e f ✓ (xi,yj ) # (1) where x, y are realizations of two random variables, X and Y , and f ✓ : X ⇥ Y ! R is a similarity function. We call y 1:k = {y 1 , . . . y k } negative examples, being other realizations of Y . Suppose the two random variables in Eq. 1 are both transformations of a common random variable X. Let T be a family of transformations where each member t is a composition of cropping, color jittering, gaussian blurring, among others (Wu et al., 2018; Bachman et al., 2019; Chen et al., 2020a) . We call a transformed input t(x) a "view" of x. Let p(t) denote a distribution over T , a common choice being uniform. Next, introduce an encoder g ✓ : X ! S n 1 that maps an example to a L 2 -normalized representation. Suppose we have a dataset D = {x i } n i=1 of n values for X sampled from a distribution p(x). Then, the contrastive objective for the i-th example is: L(x i ) = E t,t 0 ,t 1:k ⇠p(t) E x 1:k ⇠p(x) " log e g ✓ (t(xi)) T g ✓ (t 0 (xi))/⌧ 1 k+1 P j2{i,1:k} e g ✓ (t(xi)) T g ✓ (tj (xj ))/⌧ # (2) where ⌧ is a temperature. The equivalence of Eq. 2 to NCE is immediate given  f ✓ (x, y) = g ✓ (x) T g ✓ (y)/⌧ .



Maximizing Eq. 2 chooses an embedding that pulls two views of the same example together while pushing two views of distinct examples apart. A drawback to this framework is that the number of negatives k must be large to faithfully approximate the true partition. In practice, k is limited by memory. Recent innovations have focused on tackling this challenge:Instance Discrimination(Wu et al., 2018), or IR, introduces a memory bank of n entries to cache embeddings of each example throughout training. Since every epoch we observe each example once, the memory bank will save the embedding of the view of the i-th example observed last epoch in its ith entry. Representations stored in the memory bank are removed from the automatic differentiation tape, but in return, we can choose a large k by querying M . A follow up work, Contrastive Multiview Coding(Tian et al., 2019), or CMC, decomposes an image into two color modalities. Then, CMC sums two IR losses where the memory banks for each modality are swapped.MomentumContrast (He et al., 2019), or MoCo, observed that the representations stored in the memory bank grow stale, since possibly thousands of optimization steps pass before updating an entry again. So, MoCo makes two important changes. First, it replaces the memory bank with a

