CONSISTENT INSTANCE CLASSIFICATION FOR UNSU-PERVISED REPRESENTATION LEARNING

Abstract

In this paper, we address the problem of learning the representations from images without human annotations. We study the instance classification solution, which regards each instance as a category, and improve the optimization and feature quality. The proposed consistent instance classification (ConIC) approach simultaneously optimizes the classification loss and an additional consistency loss explicitly penalizing the feature dissimilarity between the augmented views from the same instance. The benefit of optimizing the consistency loss is that the learned features for augmented views from the same instance are more compact and accordingly the classification loss optimization becomes easier, thus boosting the quality of the learned representations. This differs from InstDisc (Wu et al., 2018) and MoCo (He et al., 2019; Chen et al., 2020c) that use an estimated prototype as the classifier weight to ease the optimization. Different from SimCLR (Chen et al., 2020b) that directly compares different instances, our approach does not require large batch size. Experimental results demonstrate competitive performance for linear evaluation and better performance than InstDisc, MoCo and SimCLR at downstream tasks, such as detection and segmentation, as well as competitive or superior performance compared to other methods with stronger training setting.

1. INTRODUCTION

Learning good representations from unlabeled images is a land-standing and challenging problem. The mainstream methods include: generative modeling (Hinton et al., 2006; Kingma & Welling, 2014) , colorization (Zhang et al., 2016) , transformation or spatial relation prediction (Doersch et al., 2015; Noroozi & Favaro, 2016; Gidaris et al., 2018) , and discriminative methods, such as instance classification (Dosovitskiy et al., 2016; He et al., 2019) , and contrastive learning (Chen et al., 2020b) . The instance discrimination methods show promising performance for downstream tasks. There are two basic objectives that are optimized (He et al., 2019; Chen et al., 2020b; Yu et al., 2020; Wang & Isola, 2020) : contraction and separation. Contraction means that the features of the augmented views from the same instance should be as close as possible. Separation means that the features of the augmented views from one instance should lie in a region different from other instances. The instance classification framework, such as InstDisc (Wu et al., 2018) , and MoCo (He et al., 2019; Chen et al., 2020c) , adopts a prototype-based classifier, where the prototype is estimated as the moving average of the corresponding features of previous epoches Wu et al. (2018) or as the output of an moving-average network He et al. (2019) ; Chen et al. (2020c) . The prototype-based schemes ease the optimization of the classification loss in the challenging case that there is over one million categories. BYOL (Grill et al., 2020) computes the prototype in a way similar to MoCo, and only aligns the feature of augmented views with its prototype leaving the separation objective implicitly optimized. The prototype, computed from a single view rather than many views and from networks with different parameters, might not be reliable enough, making the contraction and separation optimization quality not guaranteed. The contrastive learning frameworkfoot_0 , such as SimCLR (Chen et al., 2020b) and Ye et al. (2019) , simultaneously maximizes the similarities between each view pair from the same instance and min- imizes the similarities between the view pair from different instances. This framework directly compares the feature of one view to a different view other than to a prototype, avoiding the unreliability of the prototype estimation. It, however, requires large batch size for each SGD iteration to compare enough number of negative instances for imposing the separation constraintfoot_2 , increasing the difficulty in large batch training. We propose a simple unsupervised representation learning approach, consistent instance classification (ConIC), to improve the optimization and feature quality. Our approach jointly minimizes two losses: instance classification loss and consistency loss. The instance classification loss is formulated by regarding each instance as a category. Its optimization encourages that different instances lie in different regions. The consistency loss is formulated to directly compares the features of the augmented views from the same instance and encourages high similarity between them. One benefit from the consistency optimization is to directly and explicitly make the features of the same instances compact and thus to accelerate the optimization of the classification loss. This is different from Wu et al. (2018) , He et al. (2019) , heuristically estimating the classifier weights using the prototypes and does not suffer from the prototype estimation reliability issue. On the other hand, our approach does not rely on large batch training, that is essential for SimCLR (Chen et al., 2020b) , because the whole loss in our formulation can be decomposed as a sum of components each of which only depends on one instance. Furthermore, we observed that jointly optimizing the consistency and classification losses leads to that the representation is more focused on the textured region, as shown in Figure 1 . This implies that the learned representation is more capable of characterizing the objects, and thus potentially more helpful for downstream tasks like object detection and segmentation. We demonstrate the effectiveness of our approach in unsupervised representation learning on Ima-geNet. Our approach achieves competitive performance under the linear evaluation protocol. When finetuned on downstream tasks, such as object detection on VOC, object detection and instance segmentation on COCO, instance segmentation on Cityscapes and LVIS, as well as semantic segmentation on Citeyscapes, COCO Stuff, ADE and VOC, our approach performs better than InstDisc, MoCo and SimCLR, and competitively or superior compared to other methods with stronger training setting (e.g., InfoMin and SwAV).

2. RELATED WORK

Generative approaches. Generative models, such as auto-encoders (Hinton et al., 2006; Kingma & Welling, 2014; Vincent et al., 2008) , context encoders (Pathak et al., 2016) , GANs (Donahue & Simonyan, 2019) , and GPTs (Chen et al., 2020a) , learn an unsupervised representation by faithfully reconstructing the pixels. Later self-supervised models, such as colorization (Zhang et al., 2016) and split-brain encoders (Zhang et al., 2017) , improve generative models by withholding some part of the data and predicting it.

Spatial relation prediction.

The representation is learned by solving pretext tasks related to image patch spatial relation prediction, such as predicting the spatial relation between two patches sampled from an image, e.g., a patch is on the left of another patch, (Doersch et al., 2015) ; solving Jigsaw Puzzles and determining the spatial configuration for the shuffled (typically 9) patches (Noroozi & Favaro, 2016) ; and predicting the rotation (Gidaris et al., 2018) . Instance classification. Exemplar-CNN (Dosovitskiy et al., 2016) regards the views formed by augmenting each instance as a class, and formulates an instance classification problem. InstDisc (Wu et al., 2018) , MoCo (He et al., 2019) , CMC (Tian et al., 2019) and PIRL (Misra & van der Maaten, 2019) generalize exemplar-CNN by heuristically estimating the classifier weights using prototypes for easing the optimization. Our proposed approach follows the instance classification approach, and exploit an additional consistency loss to help optimization. Instance clustering. Rather than regarding each instance as a category, the instance clustering frameworks (Caron et al., 2018; 2019; 2020; Asano et al., 2020; Huang et al., 2019; Xie et al., 2016; Yan et al., 2020; Yang et al., 2016) learn representations in which the instances are well optimized. DeepCluster (Caron et al., 2018) simply adopt the k-means clustering method by simultaneously optimizing the network parameters, and uses k-means assignments as pseudo-labels to learn representations. SwAV (Caron et al., 2020) simultaneously clusters the data while enforcing consistency between cluster assignments for different views of the same instance. Contrastive learning. Contrastive predictive coding (van den Oord et al., 2018; Hénaff et al., 2019) predicts the representations of patches below a certain position from those above it by optimizing contrastive loss. DIM (Hjelm et al., 2019) and ANDIM (Bachman et al., 2019) achieves global-tolocal/local-to-neighbor patch representation prediction (overlapping) across augmented views using the contrastive loss. The contrastive learning framework (Ye et al., 2019; Chen et al., 2020b) formulates a contrastive loss encouraging the high similarity between the augmented views from the same instance, and low similarity between the instance and other instances. Wang & Isola (2020) presents a novel formulation based on two measures: alignment and uniformity, and shows that it is an alternative of contrastive loss. Yu et al. (2020) connects contractive and contrastive learning, cross-entropy, and so on, and provides theoretical guarantees for learning diverse and discriminative features. Consistency in semi-supervised learning. Consistency regularization, enforcing the similarity between the predictions or features of different views for the same unlabeled instance, has been widely applied in semi-supervised learning, such as Π Model (Laine & Aila, 2017) , Temporal Ensemble (Laine & Aila, 2017) , and Mean Teacher (Tarvainen & Valpola, 2017) . We exploit the consistency loss to help optimize the classification loss for unsupervised representation learning.

3. APPROACH

Given a set of image instances without any labels, I = {I 1 , I 2 , . . . , I N }, the goal is to learn a feature extractor (a neural network) x = f (I). The discrimination approach expands each image I n to a set of augmented views {I 1 n , I 2 n , . . . , I K n }, and formulates the problem in a way that the features of the augmented views of each instance are similar (contraction) and the features of different instances are distributed separately (separation). In the following, we first review three related instance classification methods, then we introduce our approach and present the analysis.

3.1. INSTANCE CLASSIFICATION

Exemplar CNN. Exemplar-CNN (Dosovitskiy et al., 2016) formulates unsupervised representation learning as an instance classification problem. The augmented views from one instance are regarded as one category, and the augmented views from different instances are regarded as different categories. The softmax loss is used and written for the kth view of the nth instance: s (x k n ) = -log e w n xk n /τ N j=1 e w j xk n /τ , where τ is the temperature. Exemplar-CNN uses the standard backpropagation algorithm to learn the network f (•) and the classification weights {w 1 , w 2 , . . . , w N }. InstDisc. The InstDisc approach (Wu et al., 2018) optimizes the network parameters, and heuristically estimates the classifier weights {w 1 , w 2 , . . . , w N } in each epoch using a feature moving average scheme, i.e., compute the exponential averages of the features of the corresponding instances (stored in a memory bank) in the previous epochs. The heuristic weight estimation scheme eases the network optimization. MoCo. MoCo (He et al., 2019) instead adopts a network moving average scheme. In each SGD iteration, MoCo updates a momentum network whose parameters are moving average of the previous network parameters. It computes the features from the momentum network as the classifier weights, which are further maintained by a queue. This leads to better classifier weight estimates.

3.2. CONSISTENT INSTANCE CLASSIFICATION

We introduce a consistency loss to explicitly penalize the dissimilarity between augmented views from the same instance. Let sim(u, v) = u v/ u v denote the inner product between 2 normalized u and v, i.e. cosine similarity. The consistency loss for two views x i n and x j n from the image I n is formed as c (x i n , x j n ) = (1 -sim(x i n , x j n )) 2 = (1 -xi n xj n ) 2 . (2) Here, we normalize the feature vector x = x/ x 2 as done in InstDisc and MoCo. The consistency loss for the N images each with K augmented views is written as L c = N n=1 K i,j=1,i =j c (x i n , x j n ) = N n=1 K i,j=1,i =j (1 -xi n xj n ) 2 . ( ) The classification loss for the N images each with K augmented views is written as L s = N n=1 K k=1 s (x k n ) = - N n=1 K k=1 log e w n xk n /τ N j=1 e w j xk n /τ . where we let the classifier weight be an 2 -normalized vector: w 2 = 1, which is similar to normalizing the prototype vector as done in InstDisc and MoCo. We combine the two losses together, L = L s + αL c , where α is a weight for the consistency loss to balance the two losses, avoiding over-optimizing the consistency loss or merely optimizing the classification loss. Consistency helps optimizing the classification loss. In general, when the features for each class are more compact, different classes are more easily separated and the softmax classification are more efficiently optimized. Our approach has the benefits: the feature distribution for the same instance is compact and the distributions for different instances are well separable, because of maximizing the consistency. Figure 2 (a) illustrates the benefit from simultaneously optimizing the consistency loss and the classification loss. Let's see how the consistency term makes the gradient of the classifier weight more effective. We have the gradient for the classification loss s (x k n ) with respect to the classifier weight w n : ∂ s(x k n ) ∂wn = (P k nn -1)x k n , where P k nn = e w n xk n /τ N r=1 e w r xk n /τ . The gradient from two views x i n and x j n is g w = ∂ s (x i n ) ∂w n + ∂ s (x j n ) ∂w n = (P i nn -1)x i n + (P j nn -1)x j n . According to the law of cosines, xi n 2 = 1 and xj n 2 = 1, we have g w 2 2 = (P i nn -1) 2 + (P j nn -1) 2 + 2|(P i nn -1)||(P j nn -1)|x i n xj n . When the consistency term is included, xi n and xj n are very close, implying that xi n xj n is larger. In the case P i nn and P j nn are not changed, the magnitude g w 2 is larger, and accordingly the classifier weight w n is updated effectively and quickly. In contrast, when the consistency term is not included, xi n and xj n might be very diverse as discussed (see the discussion in "Optimizing the classification loss is not direct to optimize the consistency loss.") This results in that g w 2 is smaller, and thus the classifier weight w n is updated less effectively and less quickly. Figure 3 shows the final classification loss L s and consistency loss L c value with different consistency loss weights. We can see that increasing the consistency weight when smaller than 2.5 helps optimizing the classification loss, and when larger than 2.5 harms the classification loss optimization. In Appendix B, we discuss the reason: over-weighting the consistency loss could lead to a trivial solution. In our experiments, we set α to 2.5 in which case the training classification loss is minimum. Optimizing the classification loss is not direct to optimize the consistency loss. Optimizing the classification loss intuitively expects that each instance lies in a different region. We expect that the augmented views of an instance x n are assigned to the nth region and compactly distributed. We find that merely optimizing the classification loss L s is not easy to make the features of the augmented views of the same instance contractive, consequently the features are not compactly distributed. The reason is that larger similarity between augmented views is not explicitly encouraged, and is implicitly imposed through the classifier weight. The angle between xi n and xj n , θ( x i n , xj n ) (reflecting the similarity between xi n and xj n , θ(x i n , xj n )), is upbounded: θ(x i n , xj n ) ≤ θ(x i n , w n ) + θ(w n , xj n ). Minimizing the classifier loss L s if given w n , it is possible that the numerators (e.g, w n xi n and w n xj n ), are larger and accordingly the upbound θ(x i n , w n )+θ(w n , xj n ) is smaller. However, we find that there exist many transformations R so that the upbound is the same: w n (R xi n ) = w n xi n , and θ(x i n , w n ) = θ(Rx i n , w n ). In this case, θ(x i n , xj n ) is likely to be very different from θ(Rx i n , xj n ). This implies that there is still a gap between optimizing the upbound θ(x i n , w n ) + θ(w n , xj n ) and directly optimizing θ(x i n , xj n ). As a result, merely optimizing the classification loss is not easy to make the features for one instance compactly distributed in the corresponding region. Batch size. We present rough analysis showing that instance classification, including our approach, MoCo, and InstDisc, does not require large batch (see He et al. (2019) and the empirical validation in Figure 4 for our approach). We rewrite the loss function in Equation 5as L = N n=1 [α K i,j=1,i =j (1 -sim(x i n , x j n )) 2 + K k=1 log e w n xk n /τ N j=1 e w j xk n /τ ]. The reformulation indicates that the loss can be decomposed to the sum of components, where each component depends on a different instance. The separation between instances is got through the classifier weights each of which encodes the information of the corresponding instance. The decomposability property leads to that the optimization of L using SGD behaves similarly to the standard classification problem with SGD: large batch size is not necessary. Figure 4 shows that the performances with batch sizes 256, 512 and 1024 are similar. In contrast, the contrastive loss over all the N instances in SimCLR, is L = N n=1 (log e ( ) We can see that this can not be decomposed as a sum of components each of which depends on a different instance, which is a general requirement for SGD. Each instance depends on other instances. We believe that this is the reason why SimCLR needs large batch size (Chen et al., 2020b) .

4. IMPLEMENTATION DETAILS

Data augmentation. We adopt the augmentation scheme similar to SimCLR (Chen et al., 2020b) . We randomly crop the input image with the crop scale (0.15, 1) and resize it to 224 × 224. Then we apply random horizontal flipping, color jittering, grayscale, and Gaussian blur. Network architecture. We use ResNet-50 (He et al., 2016) to extract features. Following SimCLR we adopt the same projection head consisting of a two-layer batch-normalized MLP (Linear→BN→ ReLU→Linear→BN) and reduce the feature dimension from 2048 to 128 in pretraining. Training. We use the SGD algorithm with momentum optimizer. We set the momentum parameter to 0.9, the weight decay parameter to 1e -4, the batch size to 512, and the epoch number to 200. We adopt the cosine learning rate schedule, with the initial learning rate 0.06. Each instance in the current mini-batch is augmented into two views during training. The temperature τ is set to be 0.1. We use SyncBN. For ablation study, we train all the models for 100 epochs. The training is performed on 8 NVIDIA V100 GPUs. We use the PyTorch 1.3 platform (Paszke et al., 2019) . Sampling classifier weight update. The analysis is based on the standard SGD algorithm. For clarity, we assume each iteration samples 1 instance with two augmented views. The analysis can be easily extended to sampling more instances with more augmented views. The loss function becomes L = 2(1 -sim(x 1 n , x 2 n )) - 2 k=1 log e w n xk n /τ N j=1 e w j xk n /τ . It can be seen that the denominator in the second term on the right-hand side, N j=1 e w j xk n /τ , is a summation of N elements, and thus the complexity is Θ(N ). We propose to approximate it by summing fewer (N = 65536 is the same to the queue size in MoCo He et al. ( 2019)) elements: e w n xk n /τ + β N j=1 e w s j xk n /τ , where we let the sampling compensation weight β = N -1 N for a better approximation (See Appendix A.1). This approximation reduces the forward loss computation complexity to Θ(N ). The normal iteration process needs to update all the N classifier weights in each iteration, implying the complexity is still Θ(N ). Fortunately, through derivation (see Appendix A.2), we find that we do not need to compute the gradients and update the classifier weights corresponding to the instances that are not sampled. We can delay the update to the iteration that the instances are sampled. In other words, at each iteration we only compute the gradients and update the classifier weights corresponding to the instances that are sampled. Consequently, the gradient computation and the classifier weight update, and accordingly each iteration takes Θ(N ) time. In our implementation, rather than sampling all the N classifier weights at each iteration, we only use the classifier weights, which correspond to the (e.g., 512) instances in a mini-batch, to replace the classifier weights that are the earliest sampled. The potential benefit is to reduce the IO cost if we store the weights in the disk or the CPU memory and only store the sampled weights in the GPU memory, which is practically valuable for very large scale cases (e.g., 1B or more images). 

5. EXPERIMENTS

We conduct the evaluation by training the models on ImageNet (Deng et al., 2009) w/o using the labels. We follow He et al. (2019) and adopt two protocols, linear evaluation on ImageNet, and downstream task performance with fine-tuning. Ablation study: consistency. Figure 3 shows how the consistency weight parameter α influences the classification loss. The results are as our expectation and suggest that the classification loss decreases when the parameter α increases to a certain value 2.5 that choose to use in our experiments, and then the classification loss increases. The results on the downstream tasks and linear evaluation shown in Table 1 indicate consistent observations: the overall performance when α = 2.5 performs satisfactorily, and better than the performance w/o consistency (α = 0). Ablation study: sampling classifier weight update. We evaluate how sampling classifier weight update and sampling compensation affect the performance. Figure 5 indicates that sampling compensation makes the results w/ the sampling scheme overall similar to the results w/o the sampling scheme and better than w/o sampling compensation (β = 1). Comparison with state-of-the-arts. We compare our approach, consistent instance classification (ConIC) to recent state-of-the-art solutions: Exemplar-CNN, InstDisc, PIRL, MoCov1, MoCov2, AlignUniform, SwAV, and InfoMinfoot_3 . The pretrained models of MoCo v1/v2, AlignUniform, SwAV, and InfoMin are obtained from GitHub provided by the corresponding authors. The PIRL pretrained model is obtained from PyContrastfoot_4 . We implement Exemplar-CNN and InstDisc using the same setup with ours, including 2 normalization and data augmentation. The comparison to these methods is fair as the models are pretrained with almost the same setting, e.g., #epochs is all 200, data augmentation is almost the same, the backbone is the same, and each instance is augmented to two views. We fine-tune all the models using the same setting for the downstream tasks. The results for downstream tasks are given in Table 2 . The overall performance of our approach (ConIC) is the best, and the overall performance of our approach w/ sampling classifier weight update (ConIC w/sampling) is the second best. In contrast, the best one among the previous methods, AlignUniform (Wang & Isola, 2020) performs satisfactorily for most tasks and unsatisfactorily for the segmentation tasks on Cityscapes, ADE20k, and Pascal-VOC. The superiority of our approach shows that minimizing the consistency loss improves the capability of characterizing the objects and the feature transferability. In addition, we report the results of three approaches w/ stronger setup, InfoMin, SwAV, and Sim-CLR. We got the pretrained models provided by the authors. (1) InfoMin performs similarly to our approach, but it adopts stronger augmentation, RandomAugment (Cubuk et al., 2020) that is learned from supervised learning. (2) SimCLR (1000 epochs) performs inconsistently and surprisingly poorlyfoot_5 . (3) SwAV performs much better than our ConIC for COCO detection, and much worse for VOC detection, DensePose, semantic segmentation on Cityscapes, COCO stuff, Pascal-VOC, and Pascal-Context. Linear evaluation results on ImageNet are in Table 3 . Our approach performs competitively in comparison to MoCo v2, PIC, and PCL v2 whose training setup is similar to our approach. Other approaches, e.g, InfoMin, SwAV, BYOL, training the models using stronger augmentation, more views, more epochs, respectively, get higher performance. See more analysis in Appendix F.

6. CONCLUSION

We exploit the consistency loss minimization to help the optimization of the instance classification loss. The benefits include: the representations of different views of the same instance are more compact; the representations of different distances are more separable; the representations characterize more about the textured region in an image. These lead to high capability on downstream tasks like object detection and segmentation.  Assume that the classifier weights corresponding to the (N -1) negative instances, {w 1 , . . . , w n-1 , w n+1 , . . . , N }, are i.i.d., we want that the expectations of the term on the left hand side and the right hand side are the same: E[β N j=1 e w s j xk n /τ ] = E[ N j=1,j =n e w j xk n /τ ]. The term on the left hand side is E[β N j=1 e w s j xk n /τ ] = βN E[e w xk n /τ ], where w has the same distribution with w j (j = n). Similarly, the term on the right hand side becomes E[ N j=1,j =n e w j xk n /τ ] = (N -1) E[e w xk n /τ ]. Thus, we have β = N -1 N . During the SGD iterations, the i.i.d. assumption does not hold. But our experiments show that the choice, β = N -1 N , improves the performance, better than β = 1. We think that tuning β manually might lead to superior performance.

A.2 DELAY UPDATE OF THE UNSAMPLED CLASSIFIER WEIGHTS

Consider a classifier weight w that is sampled at the (s)th iteration and at the (s + k)th iteration, and is not sampled from the (s + 1)th iteration to the (s + k -1)th iteration. We have (1) the gradient of the loss L with respect to w is zero, ∂L ∂w = 0, (2) the gradient of the 2 regularizer R = λ 2 w 2 2 is ∂R w = λw, and thus the gradient becomes: g w (s) = λw. The update equation of SGD with momentum becomes v (s+1) w w (s+1) = m λ -η (s+1) m (1 -η (s+1) λ) v (s) w w (s) , from which we get: v (s+k) w w (s+k) = m λ -η (s+k) m (1 -η (s+k) λ) . . . m λ -η (s+1) m (1 -η (s+1) λ) v (s) w w (s) . ( ) This means that we do not need to really compute w at the iterations in which it is not sampled, and only need to update it at the iteration in which it is sampled again, In addition, we observe that w (unsampled) is updated independently and does not influence the update of other classifier weights. Consequently, we are safe to delay the update of the classifier weights that are not sampled to the iteration in which the weight is sampled again. 

B MORE ANALYSIS

is ∂ c ∂ xi n = -2(1 -xi n xj n )x j n . In the hard sample case, the similarity xi n xj n is smaller and far from 1, (1 -xi n xj n ) is larger, implying the gradient magnitude is larger. This means more contribution to the gradient. In the easy sample case, the contribution would be smaller.

C EVALUATION SETUP C.1 EVALUATION ON DOWNSTREAM TASKS

We perform object detection, COCO keypoint detection, COCO DensePose estimation and Instance segmentation experiments on Detectron2 (Wu et al., 2019) framework. Object detection. We perform object detection on Pascal VOC (Everingham et al., 2010) and COCO (Lin et al., 2014) datasets. For Pascal VOC, we use Faster- RCNN (Ren et al., 2017) with R50-C4 backbone as the detector. Following He et al. (2019) , extra BNs are added in newly initialized layers. We fine-tune all layers (including BN layers) in object detection experiments. Initial learning rate is 0.02. Two training schemes are adopted: (i) the model is trained on train2007 set for 9k iterations, with learning rate decay at 6k and 8k iteration. (ii) the model is trained on trainval07+12 set for 24k iterations, with learning rate decay at 18k and 22k iterations. We report AP bb 50 and standard COCO-style AP bb . For COCO object detection, we use Mask-RCNN (He et al., 2017) with R50-FPN (Lin et al., 2017) backbone as the detector. SyncBN is adopted in backbone, FPN and ROI Heads. The model is fine-tuned on train2017 set and evaluated on val2017 set. We use standard 1× and 2× fine-tune schedule. Standard COCO-style bounding box AP bb and mask AP mk are reported. COCO keypoint detection. We perform human pose estimation on COCO keypoint (Lin et al., 2014) dataset. We use Mask-RCNN (He et al., 2017) (keypoint version) with R50-FPN backbone as the detector. SyncBN is adopted in backbone, FPN and ROI Head. The model is fine-tuned on train2017 set and evaluated on val2017 set. Standard 2× fine-tune schedule is applied. We report AP kp and AP kp 50 . COCO DensePose estimation For DensePose (Güler et al., 2018) estimation, We use Dense-Pose R-CNN with R50-FPN backbone. SyncBN is adopted in backbone, FPN and ROI Box Head. The model is trained on train2014 + valminusminival2014 and evaluated on minival2014. We use "s1×" fine-tune schedule (improved baseline "R 50 FPN s1x" in De-tectron2). We report AP dp of DensePose GPS metric. Instance segmentation. We perform instance segmentation on COCO (Lin et al., 2014) , Cityscapes (Cordts et al., 2016) , and LVIS (Gupta et al., 2019) datasets. COCO instance segmentation is jointly-trained with COCO object detection with Mask-RCNN model. We use Mask-RCNN with R50-FPN for fine-tuning. SyncBN is adopted in backbone, FPN and ROI Heads. For Cityscapes, the model is trained on cityscapes fine instance seg train and evaluated on cityscapes fine instance seg val for 24k iterations. For LVIS, the model is trained on lvis v0.5 train and evaluated on lvis v0.5 val with 2× schedule. Standard AP mk is reported. Semantic segmentation. We perform semantic segmentation on Cityscapes (Cordts et al., 2016) , COCO-stuff (Caesar et al., 2018) , ADE20k (Zhou et al., 2017) , Pascal-VOC (Everingham et al., 2010) , and Pascal-Context (Mottaghi et al., 2014) datasets. We use DeeplabV3 (Chen et al., 2017) rently we are not able to re-implement these algorithms or use their provided pretrained models and evaluate them on other downstream tasks. 



InstDisc (Wu et al., 2018) and MoCo(He et al., 2019; Chen et al., 2020c) are also closely related to contrastive learning and are regarded as contrastive learning methods by some researchers. We will show one possible reason that it requires large batch. The results of other methods for downstream tasks are given in Appendix F. https://github.com/HobbitLong/PyContrast We contacted the authors to see if we use the models correctly for some downstream tasks, and the feedback is they did not check the performance for those downstream tasks.



Figure 1: Visualizing the activation maps. (a) input image, (b) activation maps from our approach, (c) activation maps from only optimizing the classification loss. One can see that our approach (b) tends to more focus on the textured region.

Figure 2: Visualizing learned feature distributions for 2D toy examples. Each color corresponds to augmented views of the same instance. (a) jointly optimize the consistency and classification losses. (b) only optimize the classification loss.

Figure 2 (b) shows the insufficiency of only optimizing the classification loss. One can see that the distributions of different instances in Figure 2 (a) are better separated and the distribution for each instance is more compact.

Figure 4: Performance with different batch sizes: (a) Linear evaluation on ImageNet, (b) VOC detection, and (c) COCO-stuff segmentation. The performances with batch sizes 256, 512 and 1024 are similar.

Figure 5: Illustrating the effect of sampling classifier weight update. Three results, the baseline w/o sampling, sampling update w/o sampling compensation, and sampling update w/ sampling compensation, are reported. The results show that sampling update w/ sampling compensation performs better than w/o sampling compensation and similar to the baseline w/o sampling.

SAMPLING CLASSIFIER WEIGHT UPDATE A.1 THE CHOICE OF β We approximate the part (corresponding to the negative instances) of the denominator in the classification loss L s = 2 k=1 log e w n xk

Illustrating how the consistency weight α influences the performance The observations are consistent to the one about the classification loss shown in Figure3.

Comparison of our approach ConIC with recent state-of-the-art solutions. We highlight the best and second-best scores among the approaches w/o strong setup in red and blue, respectively.

Comparison for linear evaluation on ImageNet. Our approach gets comparable results to MoCov2, A-U, and others that train models using similar setup. See Appendix F for more discussions.

Trivial solution for merely optimizing the consistency loss. Let us look at the consistency loss in Equation 3. It is obvious that L c ≥ 0. We can see that the minimum L c = 0 holds, if the features of all the augmented views for an image are the same: xi n = xj n . It also holds in theory when that different images can have different representations: xi n = xj m . However, we empirically observe that merely optimizing the consistency loss always leads to the trivial solution: the representations of all the augmented views for all the images are the same, xi Hard sample mining. It is known that the softmax loss has a benefit: hard samples contribute more to the gradient and thus the parameter update. We show that the consistence term has a similar property. The gradient of the consistence term c in Equation 2 with respect to xi n

VOC object detection results for other methods that are not included in Table2.

annex

with R50-dilated8 backbone. We use SGD with momentum optimizer and lambda poly learning rate schedule for semantic segmentation experiments. We employ cross entropy loss on both the final output of DeeplabV3 and the intermediate feature map output from stage3, where the weight over the final loss is 1 and the auxiliary loss is 0.4. Single-scale testing is adopted for all experiments. For Cityscapes experiments, we train the model for 40k iterations with batch size 8, initial learning rate 0.01, input size 1024×512. For COCO-stuff experiments, we train the model for 60k iterations with batch size 16, initial learning rate 0.01, input size 520 × 520. For ADE20k experiments, the model is trained for 150k iterations with batch size 16, initial learning rate 0.02, input size 520 × 520. For Pascal-VOC experiments, we use train aug2012 set (augmented by Hariharan et al. (2011)) as training set. The model is trained for 60k iterations with batch size 16, initial learning rate 0.001 and input size 513 × 513. For Pascal-Context experiments, the model is trained for 30k iterations with batch size 16, initial learning rate 0.001 and input size 520 × 520. Standard mIoU metric is reported.

C.2 LINEAR EVALUATION

We freeze the pretrained backbone and train a linear classifier on the frozen feature. The classifier is trained for 100 epochs with initial learning rate 75 and a cosine learning rate schedule. We set weight decay to 0. The data augmentation is the same as supervised ImageNet classification.

D IMPLEMENTATION DETAILS OF THE TOY EXAMPLE

Figure 2 shows the learned feature distributions for 2D toy examples. We train the models (with a ResNet-50 encoder) on a toy dataset, containing 8 ImageNet images. We apply RandomCrop (0.7,1) on the images to generate augmented views. The models are trained for 200 epochs with a cosine schedule and initial learning rate 0.0001. We use batch size 8 and weight decay 1e -6. The dimension of features output from projection head is 2. For the classification only experiment, we set α = 0. For the jointly optimization of the consistency loss and classification loss, we set α = 0.1. After training, the learned features of 20 random augmented views of each image are recorded. We apply kernel density estimation with a Gaussian kernel of std 0.04 on the recorded features for visualization. Each color represents the learned feature distribution of augmented views from an image.

E DATA AUGMENTATION

We provide the PyTorch pseudo code of the data augmentation we adopted, as follows: 

F EXPERIMENT RESULTS OF OTHER METHODS

The abbreviations in Table 3 are explained in the following: ConIC-S = ConIC w/ sampling, Local Agg. = Local Aggregation (Zhuang et al., 2019) . E-CNN = Exemplar-CNN (Dosovitskiy et al., 2016) . A-U = AlignUniform (Wang & Isola, 2020) . BowNet = Gidaris et al. (2020) .Table 4 shows the results of on VOC object detection form some other methods that are not included in Table 2 . The results are got from the corresponding papers. BYOL, PCL, BowNet, SeLa adopted different evaluation setups and thus their results are not reported. Because of time limitation, cur-

