ARE ALL NEGATIVES CREATED EQUAL IN CONTRASTIVE INSTANCE DISCRIMINATION?

Abstract

Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020c), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives-the hardest 5%-were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment.

1. INTRODUCTION

In recent years, there has been tremendous progress on self-supervised learning (SSL), a paradigm in which representations are learned using a pre-training task that uses only unlabeled data. These representations are then used on downstream tasks, such as classification or object detection. Since SSL pre-training does not require labels, it can leverage unlabeled data, which is can be abundant and cheaper to obtain than labeled data. In computer vision, representations learned from unlabeled data have historically underperformed representations learned directly from labeled data. Recently, however, newly proposed SSL methods such as MoCo (He et al., 2019; Chen et al., 2020c) 



In MoCo, these are called the query and positive and are treated slightly differently; in SimCLR, both are treated the same and are called positives. The other SSL methods listed (not SimCLR and MoCo) are not CID.



, SimCLR(Chen et al., 2020a;b),SwAV (Caron et al., 2020), and BYOL (Grill et al., 2020)  have dramatically reduced this performance gap.The MoCo and SimCLR pre-training tasks use contrastive instance discrimination (CID), in which a network is trained to recognize different augmented views of the same image (sometimes called the query and the positive and discriminate between the query and the augmented views of other random images from the dataset (called negatives). 1 ) Despite the empirical successes of CID, the mechanisms underlying its strong performance remain unclear. Recent theoretical and empirical works have investigated the role of mutual information between augmentations (Tian et al., 2020), analyzed properties of the learned representations such as alignment and uniformity(Wang & Isola, 2020), and proposed a theoretical framework(Arora  et al., 2019), among others. However, existing works on CID have not investigated the relative importance or semantic properties of different negatives, even though negatives play a central role in CID. In other areas, works on hard negative mining in metric learning (Kaya & Bilge, 2019) and on the impact of different training examples in supervised learning(Birodkar et al., 2019)  suggest that understanding the relative importance of different training data can be fruitful.

