CONSISTENT INSTANCE CLASSIFICATION FOR UNSU-PERVISED REPRESENTATION LEARNING

Abstract

In this paper, we address the problem of learning the representations from images without human annotations. We study the instance classification solution, which regards each instance as a category, and improve the optimization and feature quality. The proposed consistent instance classification (ConIC) approach simultaneously optimizes the classification loss and an additional consistency loss explicitly penalizing the feature dissimilarity between the augmented views from the same instance. The benefit of optimizing the consistency loss is that the learned features for augmented views from the same instance are more compact and accordingly the classification loss optimization becomes easier, thus boosting the quality of the learned representations. This differs from InstDisc (Wu et al., 2018) and MoCo (He et al., 2019; Chen et al., 2020c) that use an estimated prototype as the classifier weight to ease the optimization. Different from SimCLR (Chen et al., 2020b) that directly compares different instances, our approach does not require large batch size. Experimental results demonstrate competitive performance for linear evaluation and better performance than InstDisc, MoCo and SimCLR at downstream tasks, such as detection and segmentation, as well as competitive or superior performance compared to other methods with stronger training setting.

1. INTRODUCTION

Learning good representations from unlabeled images is a land-standing and challenging problem. The mainstream methods include: generative modeling (Hinton et al., 2006; Kingma & Welling, 2014) , colorization (Zhang et al., 2016) , transformation or spatial relation prediction (Doersch et al., 2015; Noroozi & Favaro, 2016; Gidaris et al., 2018) , and discriminative methods, such as instance classification (Dosovitskiy et al., 2016; He et al., 2019) , and contrastive learning (Chen et al., 2020b) . The instance discrimination methods show promising performance for downstream tasks. There are two basic objectives that are optimized (He et al., 2019; Chen et al., 2020b; Yu et al., 2020; Wang & Isola, 2020) : contraction and separation. Contraction means that the features of the augmented views from the same instance should be as close as possible. Separation means that the features of the augmented views from one instance should lie in a region different from other instances. The instance classification framework, such as InstDisc (Wu et al., 2018), and MoCo (He et al., 2019; Chen et al., 2020c) , adopts a prototype-based classifier, where the prototype is estimated as the moving average of the corresponding features of previous epoches Wu et al. (2018) or as the output of an moving-average network He et al. (2019); Chen et al. (2020c) . The prototype-based schemes ease the optimization of the classification loss in the challenging case that there is over one million categories. BYOL (Grill et al., 2020) computes the prototype in a way similar to MoCo, and only aligns the feature of augmented views with its prototype leaving the separation objective implicitly optimized. The prototype, computed from a single view rather than many views and from networks with different parameters, might not be reliable enough, making the contraction and separation optimization quality not guaranteed. The contrastive learning frameworkfoot_0 , such as SimCLR (Chen et al., 2020b) and Ye et al. (2019) , simultaneously maximizes the similarities between each view pair from the same instance and min- imizes the similarities between the view pair from different instances. This framework directly compares the feature of one view to a different view other than to a prototype, avoiding the unreliability of the prototype estimation. It, however, requires large batch size for each SGD iteration to compare enough number of negative instances for imposing the separation constraintfoot_2 , increasing the difficulty in large batch training. We propose a simple unsupervised representation learning approach, consistent instance classification (ConIC), to improve the optimization and feature quality. Our approach jointly minimizes two losses: instance classification loss and consistency loss. The instance classification loss is formulated by regarding each instance as a category. Its optimization encourages that different instances lie in different regions. The consistency loss is formulated to directly compares the features of the augmented views from the same instance and encourages high similarity between them. One benefit from the consistency optimization is to directly and explicitly make the features of the same instances compact and thus to accelerate the optimization of the classification loss. This is different from Wu et al. (2018 ), He et al. (2019) , heuristically estimating the classifier weights using the prototypes and does not suffer from the prototype estimation reliability issue. On the other hand, our approach does not rely on large batch training, that is essential for SimCLR (Chen et al., 2020b), because the whole loss in our formulation can be decomposed as a sum of components each of which only depends on one instance. Furthermore, we observed that jointly optimizing the consistency and classification losses leads to that the representation is more focused on the textured region, as shown in Figure 1 . This implies that the learned representation is more capable of characterizing the objects, and thus potentially more helpful for downstream tasks like object detection and segmentation. We demonstrate the effectiveness of our approach in unsupervised representation learning on Ima-geNet. Our approach achieves competitive performance under the linear evaluation protocol. When finetuned on downstream tasks, such as object detection on VOC, object detection and instance segmentation on COCO, instance segmentation on Cityscapes and LVIS, as well as semantic segmentation on Citeyscapes, COCO Stuff, ADE and VOC, our approach performs better than InstDisc, MoCo and SimCLR, and competitively or superior compared to other methods with stronger training setting (e.g., InfoMin and SwAV).

2. RELATED WORK

Generative approaches. Generative models, such as auto-encoders (Hinton et al., 2006; Kingma & Welling, 2014; Vincent et al., 2008 ), context encoders (Pathak et al., 2016 ), GANs (Donahue & Simonyan, 2019 ), and GPTs (Chen et al., 2020a) , learn an unsupervised representation by faithfully reconstructing the pixels. Later self-supervised models, such as colorization (Zhang et al., 2016) and split-brain encoders (Zhang et al., 2017) , improve generative models by withholding some part of the data and predicting it.



InstDisc (Wu et al., 2018) andMoCo (He et al., 2019; Chen et al., 2020c) are also closely related to contrastive learning and are regarded as contrastive learning methods by some researchers. We will show one possible reason that it requires large batch.



Figure 1: Visualizing the activation maps. (a) input image, (b) activation maps from our approach, (c) activation maps from only optimizing the classification loss. One can see that our approach (b) tends to more focus on the textured region.

