EQCO: EQUIVALENT RULES FOR SELF-SUPERVISED CONTRASTIVE LEARNING

Abstract

In this paper, we propose a method, named EqCo (Equivalent Rules for Contrastive Learning), to make self-supervised learning irrelevant to the number of negative samples in the contrastive learning framework. Inspired by the InfoMax principle, we point that the margin term in contrastive loss needs to be adaptively scaled according to the number of negative pairs in order to keep steady mutual information bound and gradient magnitude. EqCo bridges the performance gap among a wide range of negative sample sizes, so that we can use only a few negative pairs (e.g. 16 per query) to perform self-supervised contrastive training on large-scale vision datasets like ImageNet, while with almost no accuracy drop. This is quite a contrast to the widely used large batch training or memory bank mechanism in current practices. Equipped with EqCo, our simplified MoCo (SiMo) achieves comparable accuracy with MoCo v2 on ImageNet (linear evaluation protocol) while only involves 16 negative pairs per query instead of 65536, suggesting that large quantities of negative samples might not be a critical factor in contrastive learning frameworks.

1. INTRODUCTION AND BACKGROUND

Self-supervised learning has recently received much attention in the field of visual representation learning (Hadsell et al. (2006) 2020)), as its potential to learn universal representations from unlabeled data. Among various self-supervised methods, one of the most promising research paths is contrastive learning (Oord et al. (2018) ), which has been demonstrated to achieve comparable or even better performances than supervised training for many downstream tasks such as image classification, object detection, and semantic segmentation (Chen et al., 2020c; He et al., 2020; Chen et al., 2020a; b) . The core idea of contrastive learning is briefly summarized as follows: first, extracting a pair of embedding vectors (q(I), k(I)) (named query and key respectively) from the two augmented views of each instance I; then, learning to maximize the similarity of each positive pair (q(I), k(I)) while pushing the negative pairs (q(I), k(I )) (i.e., query and key extracted from different instances accordingly) away from each other. To learn the representation, an InfoNCE loss (Oord et al. (2018); Wu et al. (2018) ) is conventionally employed in the following formulation (slightly modified with an additional margin term): L N CE = E q∼D,k0∼D (q),ki∼D -log e (q k0-m)/τ e (q k0-m)/τ + K i=1 e q ki/τ , where q and k i (i = 0, . . . , K) stand for the query and keys sampled from the two (augmented) data distributions D and D respectively. Specifically, k 0 is associated to the same instance as q's while other k i s not; hence we name k 0 and k i (i > 0) positive sample and negative samples respectively in the remaining text, in which K is the number of negative samples (or pairs) for each query. The 2020)), loss in the similar form as Eq. 1 is often applied on a lot of negative pairs for hard negative mining. Besides, there are also a few theoretical studies supporting the viewpoint. For instance, Oord et al. ( 2018) points out that the mutual information between the positive pair tends to increase with the number of negative pairs K; Wang & Isola (2020) find that the negative pairs encourage features' uniformity on the hypersphere; Chuang et al. ( 2020) suggests that large K leads to more precise estimation of the debiased contrastive loss; etc. Despite the above empirical or theoretical evidence, however, we point out that the reason for using many negative pairs is still less convincing. First, unlike the metric learning mentioned above, in self-supervised learning, the negative terms k i in Eq. 1 include both "true negative" (whose underlying class label is different from the query's, similarly hereinafter) and "false negative" samples, since the actual ground truth label is not available. So, intuitively large K should not always be beneficial because the risk of false negative samples also increases (known as class collision problem). Arora et al. ( 2019) thus theoretically concludes that a large number of negative samples could not necessarily help. Second, some recent works have proven that by introducing new architectures (e.g., a predictor network in BYOL (Grill et al., 2020)), or designing new loss functions (e.g., Caron et al. (2020a); Ermolov et al. ( 2020)), state-of-the-art performance can still be obtained even without any explicit negative pairs. In conclusion, it is still an open question whether large quantities of negative samples are essential to contrastive learning. After referring to the above two aspects, we rise a question: is a large K really essential in the contrastive learning framework? We propose to rethink the question from a different view: note that in Eq. 1, there are three hyper-parameters: the number of negative samples K, temperature τ , and margin m. In most of previous empirical studies (He et al. (2020) ; Chen et al. (2020a)), only K is changed while τ and m are usually kept constant. Do the optimal hyper-parameters of τ and m varies with K? If so, the performance gains observed from larger Ks may be a wrong interpretation -merely brought by suboptimal hyper-parameters' choices for small Ks, rather than much of an essential. In the paper, we investigate the relationship among three hyper-parameters and suggest an equivalent rule: m = τ log α K , where α is a constant. We find that if the margin m is adaptively adjusted based on the above rule, the performance of contrastive learning is irrelevant to the size of K, in a very large range (e.g. K ≥ 16). For example, in MoCo framework, by introducing EqCo the performance gap between K = 256 and K = 65536 (the best configuration reported in He et al. ( 2020)) almost disappears (from 6.1% decrease to 0.2%). We call this method "Equivalent Rules for Contrastive learning" (EqCo). For completeness, as the other part of EqCo we point that adjusting the learning rate according to the conventional linear scaling rule satisfies the equivalence for different number of queries per batch. Theoretically, following the InfoMax principle (Linsker (1988) ) and the derivation in CPC (Oord et al. ( 2018)), we prove that in EqCo, the lower bound of the mutual information keeps steady under various numbers of negative samples K. Moreover, from the back-propagation perspective, we further prove that in such configuration the upper bound of the gradient norm is also free of



Recently, some self-supervised learning algorithms achieve new state-of-the-art results using different frameworks instead of conventional InfoNCE loss as in Eq. 1, e.g. mean teacher (in BYOL Grill et al. (2020)) and online clustering(in SWAV Caron et al. (2020b)). We will investigate them in the future.



; Dosovitskiy et al. (2014); Oord et al. (2018); Bachman et al. (2019); Hénaff et al. (2019); Wu et al. (2018); Tian et al. (2019); He et al. (2020); Misra & Maaten (2020); Grill et al. (2020); Cao et al. (2020); Tian et al. (

temperature τ and the margin m are hyper-parameters. In most previous works, m is trivially set to zero (e.g. Oord et al. (2018); He et al. (2020); Chen et al. (2020a); Tian et al. (2020)) or some handcraft values (e.g. Xie et al. (2020)). In the following text, we mainly study contrastive learning frameworks with InfoNCE loss as in Eq. 1 unless otherwise specified. 1 In contrastive learning research, it has been widely believed that enlarging the number of negative samples K boosts the performance (Hénaff et al. (2019); Tian et al. (2019); Bachman et al. (2019)). For example, in MoCo (He et al. (2020)) the ImageNet accuracy rises from 54.7% to 60.6% under linear classification protocol when K grows from 256 to 65536. Such observation further drives a line of studies how to effectively optimize under a number of negative pairs, such as memory bank methods (Wu et al. (2018); He et al. (2020)) and large batch training (Chen et al. (2020a)), either of which empirically reports superior performances when K becomes large. Analogously, in the field of supervised metric learning (Deng et al. (2019); Wang et al. (2018); Sun et al. (2020); Wang et al. (

