PROTOTYPICAL CONTRASTIVE LEARNING OF UNSUPERVISED REPRESENTATIONS

Abstract

This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that bridges contrastive learning with clustering. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it encodes semantic structures discovered by clustering into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks with substantial improvement in low-resource transfer learning. Code and pretrained models are available at

1. INTRODUCTION

Unsupervised visual representation learning aims to learn image representations from pixels themselves without relying on semantic annotations, and recent advances are largely driven by instance discrimination tasks (Wu et al., 2018; Ye et al., 2019; He et al., 2020; Misra & van der Maaten, 2020; Hjelm et al., 2019; Oord et al., 2018; Tian et al., 2019) . These methods usually consist of two key components: image transformation and contrastive loss. Image transformation aims to generate multiple embeddings that represent the same image, by data augmentation (Ye et al., 2019; Bachman et al., 2019; Chen et al., 2020a) , patch perturbation (Misra & van der Maaten, 2020), or using momentum features (He et al., 2020) . The contrastive loss, in the form of a noise contrastive estimator (Gutmann & Hyvärinen, 2010) , aims to bring closer samples from the same instance and separate samples from different instances. Essentially, instance-wise contrastive learning leads to an embedding space where all instances are well-separated, and each instance is locally smooth (i.e. input with perturbations have similar representations). Despite their improved performance, instance discrimination methods share a common weakness: the representation is not encouraged to encode the semantic structure of data. This problem arises because instance-wise contrastive learning treats two samples as a negative pair as long as they are from different instances, regardless of their semantic similarity. This is magnified by the fact that thousands of negative samples are generated to form the contrastive loss, leading to many negative pairs that share similar semantics but are undesirably pushed apart in the embedding space. In this paper, we propose prototypical contrastive learning (PCL), a new framework for unsupervised representation learning that implicitly encodes the semantic structure of data into the embedding space. Figure 1 shows an illustration of PCL. A prototype is defined as "a representative embedding for a group of semantically similar instances". We assign several prototypes of different granularity to each instance, and construct a contrastive loss which enforces the embedding of a sample to be more similar to its corresponding prototypes compared to other prototypes. In practice, we can find prototypes by performing clustering on the embeddings. distribution, by iteratively approximating and maximizing the log-likelihood function. Specifically, we introduce prototypes as additional latent variables, and estimate their probability in the E-step by performing k-means clustering. In the M-step, we update the network parameters by minimizing our proposed contrastive loss, namely ProtoNCE. We show that minimizing ProtoNCE is equivalent to maximizing the estimated log-likelihood, under the assumption that the data distribution around each prototype is isotropic Gaussian. Under the EM framework, the widely used instance discrimination task can be explained as a special case of prototypical contrastive learning, where the prototype for each instance is its augmented feature, and the Gaussian distribution around each prototype has the same fixed variance. The contributions of this paper can be summarized as follows: • We propose prototypical contrastive learning, a novel framework for unsupervised representation learning that bridges contrastive learning and clustering. The learned representation is encouraged to capture the hierarchical semantic structure of the dataset. • We give a theoretical framework that formulates PCL as an Expectation-Maximization (EM) based algorithm. The iterative steps of clustering and representation learning can be interpreted as approximating and maximizing the log-likelihood function. The previous methods based on instance discrimination form a special case in the proposed EM framework. • We propose ProtoNCE, a new contrastive loss which improves the widely used InfoNCE by dynamically estimating the concentration for the feature distribution around each prototype. ProtoNCE also includes an InfoNCE term in which the instance embeddings can be interpreted as instancebased prototypes. We provide explanations for PCL from an information theory perspective, by showing that the learned prototypes contain more information about the image classes. • PCL outperforms instance-wise contrastive learning on multiple benchmarks with substantial improvements in low-resource transfer learning. PCL also leads to better clustering results.

2. RELATED WORK

Our work is closely related to two main branches of studies in unsupervised/self-supervised learning: instance-wise contrastive learning and deep unsupervised clustering. Instance-wise contrastive learning (Wu et al., 2018; Ye et al., 2019; He et al., 2020; Misra & van der Maaten, 2020; Zhuang et al., 2019; Hjelm et al., 2019; Oord et al., 2018; Tian et al., 2019; Chen et al., 2020a) aims to learn an embedding space where samples (e.g. crops) from the same instance (e.g. an image) are pulled closer and samples from different instances are pushed apart. To construct the contrastive loss, positive instance features and negative instance features are generated for each sample. Different contrastive learning methods vary in their strategy to generate instance features. The memory bank approach (Wu et al., 2018) stores the features of all samples calculated in the previous step. The end-to-end approach (Ye et al., 2019; Tian et al., 2019; Chen et al., 2020a) generates instance features using all samples within the current mini-batch. The momentum encoder approach (He et al., 2020) encodes samples on-the-fly by a momentum-updated encoder, and maintains a queue of instance features. Despite their improved performance, the existing methods based on instance-wise contrastive learning have the following two major limitations, which can be addressed by the proposed PCL framework.



Figure 1: Illustration of Prototypical Contrastive Learning. Each instance is assigned to multiple prototypes with different granularity. PCL learns an embedding space which encodes the semantic structure of data.

availability

https://github.com/salesforce/PCL.

