ARE ALL NEGATIVES CREATED EQUAL IN CONTRASTIVE INSTANCE DISCRIMINATION?

Abstract

Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020c), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives-the hardest 5%-were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment.

1. INTRODUCTION

In recent years, there has been tremendous progress on self-supervised learning (SSL), a paradigm in which representations are learned using a pre-training task that uses only unlabeled data. These representations are then used on downstream tasks, such as classification or object detection. Since SSL pre-training does not require labels, it can leverage unlabeled data, which is can be abundant and cheaper to obtain than labeled data. In computer vision, representations learned from unlabeled data have historically underperformed representations learned directly from labeled data. Recently, however, newly proposed SSL methods such as MoCo (He et al., 2019; Chen et al., 2020c ), SimCLR (Chen et al., 2020a; b), SwAV (Caron et al., 2020), and BYOL (Grill et al., 2020) have dramatically reduced this performance gap. The MoCo and SimCLR pre-training tasks use contrastive instance discrimination (CID), in which a network is trained to recognize different augmented views of the same image (sometimes called the query and the positive and discriminate between the query and the augmented views of other random images from the dataset (called negatives). In this work, we empirically investigate how the difficulty of negatives affects the downstream performance of the learned representation. We measure difficulty using the dot product between the normalized contrastive-space embeddings of the query and the negative. A dot product closer to 1 suggests a negative that is more difficult to distinguish from the query. We ask how different negatives, by difficulty, affect training. Are some negatives more important than others for downstream accuracy? If so, we ask: Which ones? To what extent? And what makes them different? We focus on MoCo v2 (Chen et al., 2020c) and the downstream task of linear classification on ImageNet (Deng et al., 2009) , and have similar results for SimCLR in Appendix A.2. We make the following contributions (see Figure 1 for summary): • The easiest 95% of negatives are unnecessary and insufficient, while the top 5% hardest negatives are necessary and sufficient: We reached within 0.7 percentage points of full accuracy by training on the 5% of hardest negatives for each query, suggesting that the 95% easiest negatives are unnecessary. In contrast, the easiest negatives are insufficient (and, therefore, the hardest negatives are necessary): accuracy drops substantially when training on only the easiest 95% of negatives. The hardest 5% of negatives are especially important: training on only the next hardest 5% lowers accuracy by 15 percentage points. • The hardest 0.1% of negatives are unnecessary and sometimes detrimental: Downstream accuracy was unchanged or improved when we removed these hardest negatives. These negatives were more often in the same ImageNet class as the query, compared to easier negatives, suggesting that semantically identical (but superficially dissimilar) negatives were unhelpful or detrimental. • Properties of negatives: Based on our observations that the importance of a negative varies with its difficulty, we investigate the properties of negatives that affect their difficulty. -We found that hard negatives were more semantically similar to the query than were easy negatives: the hardest 5% of negatives were more likely to be of the same Im-ageNet class as the query, compared to easier negatives. These hard negatives were also closer to the query as measured by depth of the least common ancestor for the negative and the query in the WordNet tree (which ImageNet is built upon). -We also observed that the pattern is reversed for the ≈50% of easier negatives: there, the easier the negative, the more semantically similar it is to the query. -There exist negatives that are more consistently hard across queries than would be expected by random chance.



In MoCo, these are called the query and positive and are treated slightly differently; in SimCLR, both are treated the same and are called positives. The other SSL methods listed (not SimCLR and MoCo) are not CID.



Figure1: Schematic summary of main results. Easy negatives are unnecessary and insufficient (green) and are more often dissimilar (i.e., in unrelated ImageNet classes) to the query (light blue) compared to harder negatives. Hard (but not the very hardest) negatives were necessary and sufficient (orange) and are more often semantically similar to the query compared to easier negatives. The very hardest negatives are unnecessary and sometimes detrimental and also are more often in the same class as the query, compared to easier negatives (red). This is an illustrative schematic; images and trees are not from ImageNet.

