EQCO: EQUIVALENT RULES FOR SELF-SUPERVISED CONTRASTIVE LEARNING

Abstract

In this paper, we propose a method, named EqCo (Equivalent Rules for Contrastive Learning), to make self-supervised learning irrelevant to the number of negative samples in the contrastive learning framework. Inspired by the InfoMax principle, we point that the margin term in contrastive loss needs to be adaptively scaled according to the number of negative pairs in order to keep steady mutual information bound and gradient magnitude. EqCo bridges the performance gap among a wide range of negative sample sizes, so that we can use only a few negative pairs (e.g. 16 per query) to perform self-supervised contrastive training on large-scale vision datasets like ImageNet, while with almost no accuracy drop. This is quite a contrast to the widely used large batch training or memory bank mechanism in current practices. Equipped with EqCo, our simplified MoCo (SiMo) achieves comparable accuracy with MoCo v2 on ImageNet (linear evaluation protocol) while only involves 16 negative pairs per query instead of 65536, suggesting that large quantities of negative samples might not be a critical factor in contrastive learning frameworks.

1. INTRODUCTION AND BACKGROUND

Self-supervised learning has recently received much attention in the field of visual representation learning (Hadsell et al. (2006) 2020)), as its potential to learn universal representations from unlabeled data. Among various self-supervised methods, one of the most promising research paths is contrastive learning (Oord et al. ( 2018)), which has been demonstrated to achieve comparable or even better performances than supervised training for many downstream tasks such as image classification, object detection, and semantic segmentation (Chen et al., 2020c; He et al., 2020; Chen et al., 2020a; b) . The core idea of contrastive learning is briefly summarized as follows: first, extracting a pair of embedding vectors (q(I), k(I)) (named query and key respectively) from the two augmented views of each instance I; then, learning to maximize the similarity of each positive pair (q(I), k(I)) while pushing the negative pairs (q(I), k(I )) (i.e., query and key extracted from different instances accordingly) away from each other. To learn the representation, an InfoNCE loss (Oord et al. (2018); Wu et al. (2018) ) is conventionally employed in the following formulation (slightly modified with an additional margin term): L N CE = E q∼D,k0∼D (q),ki∼D -log e (q k0-m)/τ e (q k0-m)/τ + K i=1 e q ki/τ , where q and k i (i = 0, . . . , K) stand for the query and keys sampled from the two (augmented) data distributions D and D respectively. Specifically, k 0 is associated to the same instance as q's while other k i s not; hence we name k 0 and k i (i > 0) positive sample and negative samples respectively in the remaining text, in which K is the number of negative samples (or pairs) for each query. The 



; Dosovitskiy et al. (2014); Oord et al. (2018); Bachman et al. (2019); Hénaff et al. (2019); Wu et al. (2018); Tian et al. (2019); He et al. (2020); Misra & Maaten (2020); Grill et al. (2020); Cao et al. (2020); Tian et al. (

temperature τ and the margin m are hyper-parameters. In most previous works, m is trivially set to zero (e.g. Oord et al. (2018); He et al. (2020); Chen et al. (2020a); Tian et al. (2020)) or some

