EQCO: EQUIVALENT RULES FOR SELF-SUPERVISED CONTRASTIVE LEARNING

Abstract

In this paper, we propose a method, named EqCo (Equivalent Rules for Contrastive Learning), to make self-supervised learning irrelevant to the number of negative samples in the contrastive learning framework. Inspired by the InfoMax principle, we point that the margin term in contrastive loss needs to be adaptively scaled according to the number of negative pairs in order to keep steady mutual information bound and gradient magnitude. EqCo bridges the performance gap among a wide range of negative sample sizes, so that we can use only a few negative pairs (e.g. 16 per query) to perform self-supervised contrastive training on large-scale vision datasets like ImageNet, while with almost no accuracy drop. This is quite a contrast to the widely used large batch training or memory bank mechanism in current practices. Equipped with EqCo, our simplified MoCo (SiMo) achieves comparable accuracy with MoCo v2 on ImageNet (linear evaluation protocol) while only involves 16 negative pairs per query instead of 65536, suggesting that large quantities of negative samples might not be a critical factor in contrastive learning frameworks.

1. INTRODUCTION AND BACKGROUND

Self-supervised learning has recently received much attention in the field of visual representation learning (Hadsell et al. (2006) ; Dosovitskiy et al. (2014) ; Oord et al. (2018) ; Bachman et al. (2019) ; Hénaff et al. (2019) ; Wu et al. (2018) ; Tian et al. (2019) ; He et al. (2020) ; Misra & Maaten (2020) ; Grill et al. (2020) ; Cao et al. (2020) ; Tian et al. (2020) ), as its potential to learn universal representations from unlabeled data. Among various self-supervised methods, one of the most promising research paths is contrastive learning (Oord et al. (2018) ), which has been demonstrated to achieve comparable or even better performances than supervised training for many downstream tasks such as image classification, object detection, and semantic segmentation (Chen et al., 2020c; He et al., 2020; Chen et al., 2020a; b) . The core idea of contrastive learning is briefly summarized as follows: first, extracting a pair of embedding vectors (q(I), k(I)) (named query and key respectively) from the two augmented views of each instance I; then, learning to maximize the similarity of each positive pair (q(I), k(I)) while pushing the negative pairs (q(I), k(I )) (i.e., query and key extracted from different instances accordingly) away from each other. To learn the representation, an InfoNCE loss (Oord et al. (2018) ; Wu et al. (2018) ) is conventionally employed in the following formulation (slightly modified with an additional margin term): L N CE = E q∼D,k0∼D (q),ki∼D -log e (q k0-m)/τ e (q k0-m)/τ + K i=1 e q ki/τ , where q and k i (i = 0, . . . , K) stand for the query and keys sampled from the two (augmented) data distributions D and D respectively. Specifically, k 0 is associated to the same instance as q's while other k i s not; hence we name k 0 and k i (i > 0) positive sample and negative samples respectively in the remaining text, in which K is the number of negative samples (or pairs) for each query. The temperature τ and the margin m are hyper-parameters. In most previous works, m is trivially set to zero (e.g. Oord et al. (2018) ; He et al. (2020) ; Chen et al. (2020a) ; Tian et al. (2020) ) or some handcraft values (e.g. Xie et al. (2020) ). In the following text, we mainly study contrastive learning frameworks with InfoNCE loss as in Eq. 1 unless otherwise specified. 1In contrastive learning research, it has been widely believed that enlarging the number of negative samples K boosts the performance (Hénaff et al. (2019) ; Tian et al. (2019) ; Bachman et al. (2019) ). For example, in MoCo (He et al. (2020) ) the ImageNet accuracy rises from 54.7% to 60.6% under linear classification protocol when K grows from 256 to 65536. Such observation further drives a line of studies how to effectively optimize under a number of negative pairs, such as memory bank methods (Wu et al. (2018) ; He et al. (2020) ) and large batch training (Chen et al. (2020a) ), either of which empirically reports superior performances when K becomes large. Analogously, in the field of supervised metric learning (Deng et al. (2019) ; Wang et al. (2018) ; Sun et al. (2020) ; Wang et al. (2020) ), loss in the similar form as Eq. 1 is often applied on a lot of negative pairs for hard negative mining. Besides, there are also a few theoretical studies supporting the viewpoint. For instance, Oord et al. (2018) points out that the mutual information between the positive pair tends to increase with the number of negative pairs K; Wang & Isola (2020) find that the negative pairs encourage features' uniformity on the hypersphere; Chuang et al. (2020) suggests that large K leads to more precise estimation of the debiased contrastive loss; etc. Despite the above empirical or theoretical evidence, however, we point out that the reason for using many negative pairs is still less convincing. First, unlike the metric learning mentioned above, in self-supervised learning, the negative terms k i in Eq. 1 include both "true negative" (whose underlying class label is different from the query's, similarly hereinafter) and "false negative" samples, since the actual ground truth label is not available. So, intuitively large K should not always be beneficial because the risk of false negative samples also increases (known as class collision problem). Arora et al. (2019) thus theoretically concludes that a large number of negative samples could not necessarily help. Second, some recent works have proven that by introducing new architectures (e.g., a predictor network in BYOL (Grill et al., 2020) ), or designing new loss functions (e.g., Caron et al. (2020a) ; Ermolov et al. (2020) ), state-of-the-art performance can still be obtained even without any explicit negative pairs. In conclusion, it is still an open question whether large quantities of negative samples are essential to contrastive learning. After referring to the above two aspects, we rise a question: is a large K really essential in the contrastive learning framework? We propose to rethink the question from a different view: note that in Eq. 1, there are three hyper-parameters: the number of negative samples K, temperature τ , and margin m. In most of previous empirical studies (He et al. (2020) ; Chen et al. (2020a) ), only K is changed while τ and m are usually kept constant. Do the optimal hyper-parameters of τ and m varies with K? If so, the performance gains observed from larger Ks may be a wrong interpretation -merely brought by suboptimal hyper-parameters' choices for small Ks, rather than much of an essential. In the paper, we investigate the relationship among three hyper-parameters and suggest an equivalent rule: m = τ log α K , where α is a constant. We find that if the margin m is adaptively adjusted based on the above rule, the performance of contrastive learning is irrelevant to the size of K, in a very large range (e.g. K ≥ 16). For example, in MoCo framework, by introducing EqCo the performance gap between K = 256 and K = 65536 (the best configuration reported in He et al. (2020) ) almost disappears (from 6.1% decrease to 0.2%). We call this method "Equivalent Rules for Contrastive learning" (EqCo). For completeness, as the other part of EqCo we point that adjusting the learning rate according to the conventional linear scaling rule satisfies the equivalence for different number of queries per batch. Theoretically, following the InfoMax principle (Linsker (1988) ) and the derivation in CPC (Oord et al. ( 2018)), we prove that in EqCo, the lower bound of the mutual information keeps steady under various numbers of negative samples K. Moreover, from the back-propagation perspective, we further prove that in such configuration the upper bound of the gradient norm is also free of K's scale. The proposed equivalent rule implies that, by assigning α = K 0 , it can "mimic" the optimization behavior under K 0 negative samples even if the physical number of negatives K = K 0 . The "equivalent" methodology of EqCo follows the well-known linear scaling rule (Krizhevsky (2014) ; Goyal et al. (2017) ), which suggests scaling the learning rate proportional to the batch size if the loss satisfies with the linear averaged form: L = 1 N N i=1 f (x i ; θ). However, linear scaling rule cannot be directly applied on InfoNCE loss (Eq. 1), which is partially because InfoNCE loss includes two batch sizes (number of queries and keys respectively) while linear scaling rule only involves one, in addition to the nonlinearity of the keys in InfoNCE loss. In the experiments of SimCLR (Chen et al. (2020a)) , learning rates under different batch sizes are adjusted with linear scaling rule, but the accuracy gap is still very large (57.5%@batch=256 vs. 64+%@batch=8192, 100 epochs training). EqCo challenges the belief that self-supervised contrastive learning requires large quantities of negative pairs to obtain competitive performance, making it possible to design simpler algorithms. We thus present SiMo, a simplified contrastive learning framework based on MoCo v2 (Chen et al. (2020c) ). SiMo is elegant, efficient, free of large batch training and memory bank; moreover, it can achieve superior performances over state-of-the-art even if the number of negative pairs is extremely small (e.g. 16), without bells and whistles. The contributions of our paper are summarized as follows: • We challenge the widely accepted belief that on large-scale vision datasets like ImageNet, large size of negative samples is critical for contrastive learning. We interpret it from a different view: it may be because the hyper-parameters are not set to the optimum. • We propose EqCo, an equivalent rule to adaptively set hyper-parameters between small and large numbers of negative samples, which proves to bridge the performance gap. • We present SiMo, a simpler but stronger baseline for contrastive learning.

2. EQCO: EQUIVALENT RULES FOR CONTRASTIVE LEARNING

In this section we introduce EqCo. We mainly consider the circumstance of optimizing the InfoNCE loss (Eq. 1) with SGD. For each batch of training, there are two meanings of the concept "batch size", i.e., the size of negative samples/pairs K per query, and the number of queries (or positive pairs) N per batch. Hence our equivalent rules accordingly consist of two parts, which will be introduced in the next subsections.

2.1. THE CASE OF NEGATIVE PAIRS

Our derivation is mainly inspired by the model of Contrastive Predictive Coding (CPC) (Oord et al. ( 2018)), in which InfoNCE loss is interpreted as a mutual information estimator. We further extend the method so that it is applicable to InfoNCE loss with a margin term (Eq. 1), which is not considered in Oord et al. (2018) . Following the concept in Oord et al. ( 2018), given a query embedding q (namely the context in Oord et al. ( 2018)) and suppose K + 1 random key embeddings x = {x i } i=0,...,K , where there exists exactly one entry (e.g., x i ) sampled from the conditional distribution P(x i |q) while others (e.g., x j ) sampled from the "proposal" distribution P(x j ) independently. According to which entry corresponds to the conditional distribution, we therefore defines K + 1 candidate distributions for x (denoted by {H i } i=0,...,K ), where the probability density of x under H i is P Hi (x) = P(x i |q) j =i P(x j ). So, given the observed data X = {k 0 , . . . , k K } of x, the probability where x is sampled from H 0 rather than other candidates is thus derived with Bayes theorem: Pr[x ∼ H 0 |q, X] = P + P H0 (X) P + P H0 (X) + P -K i=1 P Hi (X) = P + P - P(k0|q) P(k0) P + P - P(k0|q) P(k0) + K i=1 P(ki|q) P(ki) , where we denote P + and P -as the prior probabilities of H 0 and H i (i > 0) respectively. We point that Eq. 2 introduces a generalized form to that in Oord et al. ( 2018) by taking the priors into account. Referring to the notations in Eq. 1, we suppose that H 0 is the ground truth distribution of x (since k 0 is the only positive sample). By modeling the density ratio P(k i |q)/P(k i ) ∝ e q ki/τ (i = 0, . . . , K) and letting P + /P -= e -m/τ , the negative log-likelihood L opt E q,Xlog Pr[x ∼ H 0 |q, X] can be regarded as the optimal value of L N CE . Similar to the methodology of Oord et al. ( 2018), we explore the lower bound of L opt : L opt = E q∼D,k0∼D (q),ki∼D log 1 + e m/τ P(k 0 ) P(k 0 |q) K i=1 P(k i |q) P(k i ) ≈ E q∼D,k0∼D (q) log 1 + Ke m/τ P(k 0 ) P(k 0 |q) E ki∼D P(k i |q) P(k i ) = E q∼D,k0∼D (q) log 1 + Ke m/τ P(k 0 ) P(k 0 |q) ≥ log(1 + Ke m/τ ) -I(k 0 , q), where I(•, •) means mutual information. The approximation in the second row is guaranteed by Law of Large Numbers as well as the fact P(k i |q) ≈ P(k i ) since k i (i > 0) and q are "almost" independent. The inequality in the last row is resulted from P(k 0 |q) ≥ P(k 0 ) as k 0 and q are extracted from the same instance. Therefore the lower bound of the mutual information (noted as f bound (m, K)) between the positive pair (k 0 , q) is: I(k 0 , q) ≥ f bound (m, K) log(1 + Ke m/τ ) -L opt ≈ log(1 + Ke m/τ ) - E q∼D,k0∼D (q) log 1 + Ke m/τ P(k 0 ) P(k 0 |q) . So, minimizing L N CE (Eq. 1) towards L opt implies maximizing the lower bound of the mutual information, which is also satisfied when m = 0. In the case of m = 0, the result is consistent with that in Oord et al. (2018) . Oord et al. (2018) further points out the bound increases with K, which indicates larger K encourages to learn more mutual information thus could help to improve the performance. Nevertheless, different from Oord et al. ( 2018) our model does not require m to be zero, so the lower bound in Eq. 4 is also a function of e m/τ . Thus we have the following theorem: Theorem 1. (Main, EqCo for negative pairs) The mutual information lower bound of InfoNCE loss in Eq. 1 is irrelevant to the number of negative pairs K, if m = τ log α K , where α is a constant coefficient. And in the circumstances the bound is given by: f bound τ log α K , K ≈ log(1 + α) - E q∼D,k0∼D (q) log 1 + α P(k 0 ) P(k 0 |q) ≈ f bound (0, α), which can be immediately obtained by substituting Eq. 5 into Eq. 4. We name Eq. 5 as "equivalent condition". Theorem 1 suggests a property of equivalency: under the condition of Eq. 5, no matter what the number of physical negative pairs K is, the optimal solution of L N CE (Eq. 1) is "equivalent" in the sense of the same mutual information lower bound. The bound is controlled by a hyper-parameter α rather than K. Eq. 6 further implies that the lower bound also correlates to the configuration of K = α without margin, which suggests we can "mimic" the InfoNCE loss's behavior of K = K 0 under a different physical negative sample size K 1 , just by applying Eq. 5 with α = K 0 . It inspires us to simplify the existing state-of-the-art frameworks (e.g. MoCo (He et al. (2020) )) with fewer negative samples but as accurate as the original configurations, which will be introduced next. We empirically validate Theorem 1 as follows. Notice that f bound is difficult to calculate directly because L opt is not known. Instead, we plot the empirical mutual information lower bound fbound (m, K) log(1 + Ke m/τ ) -L N CE . So, we have fbound ≤ f bound ; when L N CE converges to the optimum L opt , fbound is an approximation of f bound . In Fig. 1 , we plot the evolution of fbound during the training of MoCo v2 under different configurations. Obviously, when it converges, without EqCo fbound keeps increasing with the number of negative pairs K; in contrast, after applying the equivalent condition (Eq. 5) fbound converges to almost the same value under different Ks. The empirical results are thus consistent with Theorem 1. Remarks 1. The equivalent condition in Eq. 5 suggests the margin m is inversely correlated with K. It is intuitive, because the larger K is, the more risks of class collision (Arora et al. (2019) ) it suffers from, so we need to avoid over-penalty for negative samples near the query, thus smaller m is used; in contrast, if K is very small, we use larger m to exploit more "hard" negative samples. Besides, recall that the margin term e m/τ is defined as the ratio of the prior probabilities P -/P + in Eq. 2. If the equivalent condition Eq. 5 satisfies, i.e., P -/P + = α/K, we have P + = 1/(1 + α) (notice that KP -+P + ≡ 1), suggesting that the prior probability of the ground truth distribution H 0 is supposed to be a constant ignoring the number of negative samples K. While in previous works (usually without the margin term, or m = 0) we have P + = 1/(K + 1). It is hard to distinguish which prior is more reasonable. However at least, we intuitively suppose keeping a constant prior for the ground truth distribution may help to keep the optimal choices of hyper-parameters steady under different Ks, which is also consistent with our empirical observations. Remarks 2. In Theorem 1, it is worth noting that K refers to the number of negative samples per query. In the conventional batched training scheme, negative samples for different queries could be either (fully or partially) shared or isolated, i.e., the total number of distinguishing negatives samples per batch could be different, which is not ruled by Theorem 1. However, we empirically find the differences in implementation do not result in much of the performance variation. The following theorem further supports the equivalent rule (Theorem 1) from back-propagation view: Theorem 2. Given the equivalent condition (Eq. 5) and a query embedding q as well as the corresponding positive sample k 0 , for L N CE in Eq. 1 the expectation of the gradient norm w.r.t. q is bounded byfoot_1 : E ki∼D dL N CE dq ≤ 2 τ 1 - exp(q k 0 /τ ) exp(q k 0 /τ ) + αE ki∼D [exp(q k i /τ )] . Please refer to the Appendix A.1 for the detailed proof. Note that we assume the embedding vectors are normalized, i.e., k i = 1(i = 0, • • • , K), which is also a convention in recent contrastive learning works. Theorem 2 indicates that, equipped with the equivalent rule (Eq. 5), the upper bound of the gradient norm is irrelevant to the number of negative samples K. Fig. 4 (see the Appendix A.2) further validates our theory: the gradient norm becomes much more steady after using EqCo under different Ks. Since the size of K affects little on the gradient magnitude, gradient scaling techniques, e.g. linear scaling rule, are not required specifically for different Ks. Eq. 7 also implies that the temperature τ significantly affects the gradient norm even EqCo is applied -it is why we only recommend to modify m for equivalence (Eq. 5), though the mutual information lower bound is determined by e m/τ as a whole.

2.2. THE CASE OF POSITIVE PAIRS

In practice the InfoNCE loss (Eq. 1) is usually optimized with batched SGD, which can be represented as empirical risk minimization: L batch N CE = 1 N N j=1 L (j) N CE (q j , k j,0 ), ( ) where N is the number of queries (or positive pairs) per batch; (q j , k j,0 ) ∼ (D, D (q j )) is the j-th positive pair, and L (j) N CE (q j , k j,0 ) is the corresponding loss. For different j, L N CE is (almost) independent of each other, because q j is sampled independently. Hence, Eq. 8 satisfies the form of linear scaling rule (Krizhevsky (2014) ; Goyal et al. (2017) ), suggesting that the learning rate should be adjusted proportional to the number of queries N per batch. Remarks 3. Previous work like SimCLR (Chen et al. (2020a) ) also proposes to apply linear scaling rule. 3 The difference is, in SimCLR it does not clarify the concept of "batch size" refers to the number of queries or the number of keys. However in our paper, we explicitly point that the linear scaling rule needs to be applied corresponding to the number of queries per batch (N ) rather than K.

2.3. EMPIRICAL EVALUATION

In this subsection we conduct experiments on the three state-of-the-art self-supervised contrastive learning frameworks -MoCo (He et al. ( 2020)), MoCo v2 (Chen et al. (2020c) ) and SimCLR (Chen et al. (2020a) ) to verify our theory in Sec. 2.1 and Sec. 2.2. We propose to alter K and N separately to examine the correctness of our equivalent rules. Implementation details. We follow most of the training and evaluation settings recommended in the original papers respectively. The only difference is, for SimCLR, we adopt SGD with momentum rather than LARS (You et al. (2017) ) as the optimizer. We use ResNet-50 (He et al. (2016) ) as the default network architecture. 128-d features are employed for query and key embeddings. Unless specially mentioned, all models are trained on ImageNet (Deng et al. (2009) ) for 200 epochs without using the ground truth labels. We report the top-1 accuracy under the conventional linear evaluation protocol according to the original paper respectively. The number of queries per batch (N ) is set to 256 by default. All models are trained with 8 GPUs. It is worth noting the way we alter the number of negative samples K independent of N during training. For MoCo and MoCo v2, we simply need to set the size of the memory bank to K. Specially, if K < N , in the current batch the memory bank is actually composed of K random keys sampled from the previous batch. While for SimCLR, if K < N we random sample K negative keys for each query independently. We do not study the case that K > N for SimCLR. We mainly consider the ease of implementation in designing the strategies; as mentioned in Remarks 2 (Sec. 2.1), it does not affect the empirical conclusion. Quantitative results. Fig. 2 illustrates the effect of our equivalent rule under different Ks. Our experiments start with the best configurations (i.e. K = 65536 for MoCo and MoCo v2, and K = 256 for SimCLRfoot_3 ), then we gradually reduce K and benchmark the performance. Results in Fig. 2 indicates that, without EqCo the accuracy significantly drops if K becomes very small (e.g. K < 64). While with EqCo, by setting α to "mimic" the optimal K, the performance surprisingly keeps steady under a wide range of Ks. Fig. 2(b ) further shows that in SimCLR, by setting α to a number larger than the physical batch size (e.g. 4096 vs. 256), the accuracy significantly improves from 62.0% to 65.3%,foot_4 suggesting the benefit of EqCo especially when the memory is limited. The comparison fully demonstrates EqCo is essential especially when the number of negative pairs is small. Besides, Table 1 2020c)) equipped with EqCo. We follow most of the design in Chen et al. (2020c) , where the key differences are as follows: Memory bank. MoCo, MoCo v2 and SimCLR v2foot_5 (Chen et al. (2020b) ) employ memory bank to maintain large number of negative embeddings k i , in which there is a side effect: every positive embedding k 0 is always extracted from a "newer" network than the negatives' in the same batch, which could harm the performance. In SiMo, we thus cancel the memory bank as we only rely on a few negative samples per batch. Instead, we use the momentum encoder to extract both positive and negative key embeddings from the current batch. Shuffling BN vs. Sync BN. In MoCo v1/v2, shuffling BN (He et al. (2020) ) is proposed to remove the obvious dissimilarities of the BN (Ioffe & Szegedy (2015) ) statistics between the positive (from current mini-batch) and the negatives (from memory bank), so that the model can make predictions based on the semantic information of images rather than the BN statistics. In contrast, since the positive and negatives are from the same batch in SiMo, therefore, we use sync BN (Peng et al. (2018) ) for simplicity and more stable statistics. Sync BN is also used in SimCLR (Chen et al. (2020a) ) and SimCLR v2 (Chen et al. (2020b) ). There are a few other differences, including 1) we use a BN attached to each of the fully-connected layers; 2) we introduce a warm-up stage at the beginning of the training, which follows the methodology in SimCLR (Chen et al. (2020a) ). Apart from all the differences mentioned above, the architecture and the training (including data augmentations) details in SiMo are exactly the same as MoCo v2's. In the following text, the number of queries per batch (N) is set to 256, and the backbone network is ResNet-50 by default. Quantitative results. First, we empirically demonstrate the necessity of EqCo in SiMo framework. We choose the number of negative samples K = 256 as the baseline, then reduce K to evaluate the performance. Fig. 3 shows the result on ImageNet using linear evaluation protocol. Without EqCo, the accuracy significantly drops when K is very small. In contrast, using EqCo to "mimic" the case of large K (by setting α to 256), the accuracy almost keeps steady even under very small Ks. 

4. LIMITATIONS AND FUTURE WORK

Theorem 1 suggests that given the equivalent condition (Eq. 5), InfoNCE losses under various Ks are "equivalent" in the sense of the same mutual information lower bound, which is also backed up with the experiments in Fig. 1 . However, Fig. 2 (a) shows that if K is smaller than a certain value (e.g. K ≤ 16), some frameworks like MoCo v2 start to degrade significantly even with EqCo; while for other frameworks like SiMo (Fig. 3 ), the accuracy almost keeps steady for very small Ks. Tschannen et al. ( 2019) also point that the principle of InfoMax cannot explain all the phenomena in contrastive learning. We will investigate the problem in the future, e.g. from other viewpoints such as gradient noise brought by small Ks (Fig. 4 in Appendix A.2 gives some insights). Though the formulation of Eq. 1 is very common in the field of supervised metric learning, which is usually named margin softmax cross-entropy loss (Deng et al., 2019; Wang et al., 2018; Sun et al., 2020) . Nevertheless, unfortunately, our equivalent rule seems invalid to be generalized to those problems (e.g. face recognition). The major issue lies in the approximation in Eq. 3, we need the negative samples k i to be independent of the query q, which is not satisfied in supervised tasks. According to Fig. 2 and Fig. 3 , the benefits of EqCo become significant if K is sufficiently small (e.g. K < 64). But in practice, for modern computing devices (e.g. GPUs) it is not that difficult to use ∼ 256 negative pairs per query. Applying EqCo to "simulate" more negative pairs via adjusting α can further boost the performance, however, whose accuracy gains become relatively marginal. For example, in A DETAILS ABOUT THEOREM 2 A.1 PROOF OF EQ. 7 Given the equivalent condition (Eq. 5) and a query embedding q as well as the corresponding positive sample k 0 , for L N CE in Eq. 1 the expectation of the gradient norm w.r.t. q is bounded by: E ki∼D dL N CE dq ≤ 2 τ 1 - exp(q k 0 /τ ) exp(q k 0 /τ ) + αE ki∼D [exp(q k i /τ )] . Proof. For simplicity, we denote the term exp(q k i /τ ) as s i (i = 0, . . . , K). Then L N CE can be rewritten as: L N CE = -log s 0 s 0 + α K K i=1 s i The gradient of L N CE with respect to q is easily to derived: dL N CE dq = - 1 τ 1 - s 0 s 0 + α K K i=1 s i k 0 + α τ K K i=1 s 0 s 0 + α K K i=1 s i k i , Owing to the Triangle Inequality and the fact that k i (i = 0, . . . , K) is normalized, the norm of gradient is bounded by: dL N CE dq ≤ 1 τ 1 - s 0 s 0 + α K K i=1 s i • k 0 + K i=1 α τ K s i s 0 + α K K i=1 s i • k i = 1 τ 1 - s 0 s 0 + α K K i=1 s i + 1 τ K i=1 α K s i s 0 + α K K i=1 s i = 2 τ 1 - s 0 s 0 + α K K i=1 s i Since the cosine similarity between q and k i (i = 1, . . . , K) is bounded in [-1, 1], we know the expectation of Ek i∼D [s i ] exists. According to Inequality (12) and Jensen's Inequality, we have: E ki∼D 2 τ 1 - s 0 s 0 + α K K i=1 s i = 2 τ 1 -E ki∼D s 0 s 0 + α K K i=1 s i ≤ 2 τ 1 - s 0 s 0 + α Ek i∼D [s i ] (13) Replacing s i by exp(q k i /τ ), the proof of Theorem 2 is completed. 

B MORE EXPERIMENTS ON SIMO

For the following experiments of this section, we report the top-1 accuracy of SiMo on ImageNet (Deng et al., 2009) under the linear evaluation protocol. The backbone of SiMo is ResNet-50 (He et al., 2016) and we train SiMo for 200 epochs unless noted otherwise.

B.1 ABLATION ON MOMENTUM UPDATE

In MoCo (He et al., 2020) and MoCo v2 (Chen et al., 2020c) , the key encoder is updated by the following rule: θ k = βθ k + (1 -β) θ q where θ q and θ k stand for the weights of query encoder and key encoder respectively, and β is the momentum coefficient. For SiMo, we also adopt the momentum update and use the key encoder to compute the features of positive sample and negative samples. In Table 3 

C A TOY EVALUATION OF EQCO

To evaluate the effectiveness of EqCo as mutual information (MI) estimator, following the configuration of Poole et al. (2019) , we estimate the MI lower bound of between two simple random vectors. Specifically, given that (X, Y ) are drawn from the known correlated Gaussian distribution, we calculate the lower bound of MI between X and Y based on their embedding. X is a 20-dimensional random variables drawn from a standard Gaussian distribution. And we sampled Y with the following rule: Y = ρX + 1 -ρ 2 (14) where ρ is a the given correlation coefficient and is a random variable sampled from a standard Gaussian distribution and independent from X. With a known ρ, the ground truth MI between X and Y is easy to compute: I (X, Y ) = - d 2 log 1 -ρ 2 (15) Here, d is the dimension of X and Y , and as mentioned above we set d = 20. To embed X and Y , we adopt two MLPs respectively, and each MLP has 1 hidden layer of 256 units, followed by ReLU activation function. We use Adam optimizer with learning rate of 0.0005 to optimize InfoNCE or EqCo for 5000 steps. For each training iteration, K pairs of (X, Y ) are independently sampled, which means there are K-1 negative samples for each query. After training, the weights of MLPs are frozen and we repeat estimating the lower bound of MI for 1000 times to reduce the estimating variance. For experiments with EqCo, we set the α = 512. As shown in Table 9 , I N CE varies with K, while I EqCo remains steady. Especially, when the ground truth MI is relatively large (e.g., 8, 10), significant differences between EqCo and InfoNCE can be observed. The experiment further validates the effectiveness of EqCo. K 64 128 256 512 Mutual Information = 2.0 I N CE 1.7 1.8 1.9 1.9 I EqCo 1.9 1.9 1.9 1.9 Mutual Information = 4.0 I N CE 2.9 3.2 3.4 3.6 I EqCo 3.8 3.7 3.6 3.6 Mutual Information = 6.0 I N CE 3.6 4.1 4.5 4.9 I EqCo 5.1 5.0 4.9 4.9 Mutual Information = 8.0 I N CE 3.9 4.6 5.1 5.6 I EqCo 5.8 5.7 5.7 5.6 Mutual Information = 10.0 I N CE 4.1 4.7 5.4 6.0 I EqCo 6.1 6.0 6.0 6.0 



Recently, some self-supervised learning algorithms achieve new state-of-the-art results using different frameworks instead of conventional InfoNCE loss as in Eq. 1, e.g. mean teacher (inBYOL Grill et al. (2020)) and online clustering (inSWAV Caron et al. (2020b)). We will investigate them in the future. Some works (e.g.,He et al. (2020)) only use dLNCE/dq for optimization. In contrast, other works(Chen et al. (2020a)) also involve dLNCE/dki, (i = 0, . . . , K), which we will investigate in the future. In SimCLR, the authors find that square-root learning rate scaling is more desirable with LARS optimizer(You et al. (2017)), rather than linear scaling rule. Also, their experiments suggest that the performance gap between large and small batch sizes become smaller under that configuration. We point that the direction is orthogonal to our equivalent rule. Besides, SimCLR does not explore the case of very small Ks (e.g. K <= 128). In the original paper of SimCLR(Chen et al. (2020a)), the best number of negative pairs is around 4096. However, the largest K we can use in our experiment is 256 due to GPU memory limit. Our "mimicking" result (65.3%, α = 4096, K = 256) is slightly lower than the counterpart score reported in the original SimCLR paper (66.6%, with a physical batch size of K = 4096), which we think may be resulted from the extra benefits of SyncBN along with LARS optimizer used in SimCLR, especially when the physical batch size is large. SimCLR v2 compares the settings with/without memory bank. However, they suggest employing memory bank as the best configuration. We mainly compare the methods with InfoNCE loss (Eq. 1) here, though recently BYOL(Grill et al. (2020)) and SWAV achieve better results using different loss functions.



Figure 1: Evolution of the empirical mutual information lower bound fbound during training. We use α = 65536 for EqCo. Results are evaluated with MoCo v2 on ImageNet. Refer to Theorem 1 for details. Best viewed in color.

Figure 4: The means (solid line) and variances (ribbon, ±σ) of dL N CE /dq under different Ks. We train a normal MoCo v2 for 200 epochs and show the statistics at different epochs.

Figure 2: Comparisons with/without EqCo under different number of negative samples (noted by K). Results are evaluated with ImageNet top-1 accuracy using linear evaluation protocol. In EqCo, we set α = 65536 for MoCo and MoCo v2, and α = 256 for SimCLR (except for one data point with α = 4096, as noted in the legend). Best viewed in color.

ImageNet accuracy (MoCo v2) vs. the number of queries per batch (N ). The learning rates during training are adjusted with linear scaling rule.3 SIMO: A SIMPLER BUT STRONGER BASELINEEqCo inspires us to rethink the design of contrastive learning frameworks. The previous state-ofthe-arts like MoCo and SimCLR heavily rely on large quantities of negative pairs to obtain high performances, hence implementation tricks such as memory bank and large batch training are introduced, which makes the system complex and tends to be costly. Thanks to EqCo, we are able to design a simpler contrastive learning framework with fewer negative pairs.

State-of-the-art InfoNCE-based frameworksWe propose SiMo, a simplified variant of MoCo v2(Chen et al. (

Table2further compares our SiMo with state-of-the-art self-supervised contrastive learning methods on ImageNet. 7 Using only 16 negative samples per query, SiMo outperforms MoCo v2 (68.1% vs. 67.5%). If we increase α to 65536 to "simulate" the case under huge number of negative pairs, the accuracy further increases to 68.5%. Moreover, when we extend the training epochs to 800, we get the accuracy of 72.1%, surpassing the baseline MoCo v2 by 1.0%. The only entry that surpasses our results is InfoMin Aug.(Tian et al. (2020)), which is mainly focuses on data generation and orthogonal to ours. The experiments indicate that SiMo is a simpler but more powerful baseline for self-supervised contrastive learning. Readers can refer to the Appendix B for more experimental results of SiMo.

Table 2 under 200 epochs training, SiMo with α = 65536 outperforms that of α = 256 by only 0.5%. It could be a fundamental limitation of InfoNCE loss. We will investigate the problem in the future.

, we report the results of SiMo with different momentum coefficients. The number of training epochs is set to be 100, so the top-1 accuracy of baseline (β = 0.999) drops to 64.4%. Compared to the baseline, SiMo without momentum update (β = 0) is inferior, showing the advantage of momentum update. Ablation on momentum update.B.2 ABLATION ON BNTable4shows the performance of SiMo equipped with shuffling BN or Sync BN. Likewise, we train SiMo for 100 epochs. It is easy to check out that SiMo with shuffling BN struggles to perform well. Besides, compared to MoCo v2, SiMo with shuffling BN degrades significantly, and we conjecture that it is because the MLP structure of SiMo is more suitable for Sync BN, rather than shuffling BN.

Sync BN vs. shuffling BN.    As shown in Sec.2.1, α is related to the lower bound of mutual information. Table5reveals how accuracy of SiMo varies with the choice of α. As we increase α to 65536, the accuracy tends to improve, in accordance with the Eq.6. However, when α is too large (e.g., 262144), the performance slightly drops by 0.2%.

SiMo with different α.

MoCo v2 with different K.

SiMo with wider models. All models are trained with 200 epochs.B.5 TRANSFER TO OBJECT DETECTIONSetup We utilize FPN(Lin et al., 2017) with a stack of 4 3 × 3 convolution layers in R-CNN head to validate the effectiveness of SiMo. Following the MoCo training protocol, we fine-tune with synchronized batch-normalization(Peng et al., 2018) across GPUs. The additional initialized layers are also equipped with BN for stable training. To effectively validate the transferability of the features, the training schedule is set to be 12 epochs (known as 1×), in which learning rate is initialized as 0.2 and decreased at 7 and 11 epochs with a factor of 0.1. The image scales are random sampled of [640, 800] pixels during training and fixed with 800 at inference.ResultsTable 8 summarizes the fine-tuning results on COCO val2017 of different pre-training methods. Random initialization indicates training COCO from scratch, and supervised represents conventional pre-training with ImageNet labels. Compared with MoCo, SiMo achieves competitive performance without large quantities of negative pairs. It is also on a par with the supervised counterpart and significantly outperforms random initialized one. pre-train AP AP 50 AP 75 AP s AP m AP l

Object detection fine-tuned on COCO.

Estimating mutual information by InfoNCE and EqCo with different batch size and various ground truth mutual information.

