CONSISTENT INSTANCE CLASSIFICATION FOR UNSU-PERVISED REPRESENTATION LEARNING

Abstract

In this paper, we address the problem of learning the representations from images without human annotations. We study the instance classification solution, which regards each instance as a category, and improve the optimization and feature quality. The proposed consistent instance classification (ConIC) approach simultaneously optimizes the classification loss and an additional consistency loss explicitly penalizing the feature dissimilarity between the augmented views from the same instance. The benefit of optimizing the consistency loss is that the learned features for augmented views from the same instance are more compact and accordingly the classification loss optimization becomes easier, thus boosting the quality of the learned representations. This differs from InstDisc (Wu et al., 2018) and MoCo (He et al., 2019; Chen et al., 2020c) that use an estimated prototype as the classifier weight to ease the optimization. Different from SimCLR (Chen et al., 2020b) that directly compares different instances, our approach does not require large batch size. Experimental results demonstrate competitive performance for linear evaluation and better performance than InstDisc, MoCo and SimCLR at downstream tasks, such as detection and segmentation, as well as competitive or superior performance compared to other methods with stronger training setting.

1. INTRODUCTION

Learning good representations from unlabeled images is a land-standing and challenging problem. The mainstream methods include: generative modeling (Hinton et al., 2006; Kingma & Welling, 2014 ), colorization (Zhang et al., 2016) , transformation or spatial relation prediction (Doersch et al., 2015; Noroozi & Favaro, 2016; Gidaris et al., 2018) , and discriminative methods, such as instance classification (Dosovitskiy et al., 2016; He et al., 2019) , and contrastive learning (Chen et al., 2020b) . The instance discrimination methods show promising performance for downstream tasks. There are two basic objectives that are optimized (He et al., 2019; Chen et al., 2020b; Yu et al., 2020; Wang & Isola, 2020) 2020c). The prototype-based schemes ease the optimization of the classification loss in the challenging case that there is over one million categories. BYOL (Grill et al., 2020) computes the prototype in a way similar to MoCo, and only aligns the feature of augmented views with its prototype leaving the separation objective implicitly optimized. The prototype, computed from a single view rather than many views and from networks with different parameters, might not be reliable enough, making the contraction and separation optimization quality not guaranteed. The contrastive learning frameworkfoot_0 , such as SimCLR (Chen et al., 2020b) and Ye et al. (2019) , simultaneously maximizes the similarities between each view pair from the same instance and min-



InstDisc (Wu et al., 2018) andMoCo (He et al., 2019; Chen et al., 2020c) are also closely related to contrastive learning and are regarded as contrastive learning methods by some researchers.



: contraction and separation. Contraction means that the features of the augmented views from the same instance should be as close as possible. Separation means that the features of the augmented views from one instance should lie in a region different from other instances. The instance classification framework, such as InstDisc (Wu et al., 2018), and MoCo (He et al., 2019; Chen et al., 2020c), adopts a prototype-based classifier, where the prototype is estimated as the moving average of the corresponding features of previous epoches Wu et al. (2018) or as the output of an moving-average network He et al. (2019); Chen et al. (

