LEARNING REPRESENTATIONS BY CONTRASTING CLUSTERS WHILE BOOTSTRAPPING INSTANCES Anonymous

Abstract

Learning visual representations using large-scale unlabelled images is a holy grail for most of computer vision tasks. Recent contrastive learning methods have focused on encouraging the learned visual representations to be linearly separable among the individual items regardless of their semantic similarity; however, it could lead to a sub-optimal solution if a given downstream task is related to non-discriminative ones such as cluster analysis and information retrieval. In this work, we propose an advanced approach to consider the instance semantics in an unsupervised environment by both i) Contrasting batch-wise Cluster assignment features and ii) Bootstrapping an INstance representations without considering negatives simultaneously, referred to as C2BIN. Specifically, instances in a mini-batch are appropriately assigned to distinct clusters, each of which aims to capture apparent similarity among instances. Moreover, we introduce a multi-scale clustering technique, showing positive effects on the representations by capturing multi-scale semantics. Empirically, our method achieves comparable or better performance than both representation learning and clustering baselines on various benchmark datasets: CIFAR-10, CIFAR-100, and STL-10.

1. INTRODUCTION

Learning to extract generalized representations from a high-dimensional image is essential in solving various down-stream tasks in computer vision. Though a supervised learning framework has shown to be useful in learning discriminative representations for pre-training the model, expensive labeling cost makes it practically infeasible in a large-scale dataset. Moreover, relying on the human-annotated labels tends to cause several issues such as class imbalance (Cui et al., 2019) , noisy labels (Lee et al., 2019) , and biased datasets (Bahng et al., 2019) . To address these issues, self-supervised visual representation learning, which does not require any given labels, has emerged as an alternative training framework, being actively studied to find a proper training objective. Recently, self-supervised approaches with contrastive learning (Wu et al., 2018; Chen et al., 2020a; He et al., 2020) have rapidly narrowed the performance gap with supervised pre-training in various vision tasks. The contrastive method aims to learn invariant mapping (Hadsell et al., 2006) and instance discrimination. Intuitively, two augmented views of the same instance are mapped to the same latent space while different instances are pushed away. However, aforementioned instance discrimination does not consider the semantic similarities of the representations (e.g., same class), even pushing away the relevant instances. This affects the learned representations to exhibit uniformly distributed characteristics, proven by the previous works (Wang & Isola, 2020; Chen & Li, 2020) . We point out that this uniformly distributed characteristic over instances can be a fundamental limitation against improving the learned representation quality. For instance, consider the representations illustrated in Fig. 1 . It indicates a simple case where linearly separable representations do not always guarantee that they can 1



Figure 1: Though illustrated 2D representations are linearly separable, irrelevant instances are clustered together.

