MICE: MIXTURE OF CONTRASTIVE EXPERTS FOR UN-SUPERVISED IMAGE CLUSTERING

Abstract

We present Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering framework that simultaneously exploits the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model. Motivated by the mixture of experts, MiCE employs a gating function to partition an unlabeled dataset into subsets according to the latent semantics and multiple experts to discriminate distinct subsets of instances assigned to them in a contrastive learning manner. To solve the nontrivial inference and learning problems caused by the latent variables, we further develop a scalable variant of the Expectation-Maximization (EM) algorithm for MiCE and provide proof of the convergence. Empirically, we evaluate the clustering performance of MiCE on four widely adopted natural image datasets. MiCE achieves significantly better results 1 than various previous methods and a strong contrastive learning baseline.

1. INTRODUCTION

Unsupervised clustering is a fundamental task that aims to partition data into distinct groups of similar ones without explicit human labels. Deep clustering methods (Xie et al., 2016; Wu et al., 2019) exploit the representations learned by neural networks and have made large progress on high-dimensional data recently. Often, such methods learn the representations for clustering by reconstructing data in a deterministic (Ghasedi Dizaji et al., 2017) or probabilistic manner (Jiang et al., 2016) , or maximizing certain mutual information (Hu et al., 2017; Ji et al., 2019 ) (see Sec. 2 for the related work). Despite the recent advances, the representations learned by existing methods may not be discriminative enough to capture the semantic similarity between images. The instance discrimination task (Wu et al., 2018; He et al., 2020) in contrastive learning has shown promise in pre-training representations transferable to downstream tasks through fine-tuning. Given that the literature (Shiran & Weinshall, 2019; Niu et al., 2020) shows improved representations can lead to better clustering results, we hypothesize that instance discrimination can improve the performance as well. A straightforward approach is to learn a classical clustering model, e.g. spherical k-means (Dhillon & Modha, 2001) , directly on the representations pre-trained by the task. Such a two-stage baseline can achieve excellent clustering results (please refer to Tab. 1). However, because of the independence of the two stages, the baseline may not fully explore the semantic structures of the data when learning the representations and lead to a sub-optimal solution for clustering. To this end, we propose Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering method that utilizes the instance discrimination task as a stepping stone to improve clustering. In particular, to capture the semantic structure explicitly, we formulate a mixture of conditional models by introducing latent variables to represent cluster labels of the images, which is inspired by the mixture of experts (MoE) formulation. In MiCE, each of the conditional models, also called an expert, learns to discriminate a subset of instances, while an input-dependent gating function partitions the dataset into subsets according to the latent semantics by allocating weights among experts. Further, we develop a scalable variant of the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) for the nontrivial inference and learning problems. In the E-step, we obtain the approximate inference of the posterior distribution of the latent variables given the observed data. In the M-step, we maximize the evidence lower bound (ELBO) of the log conditional likelihood with respect to all parameters. Theoretically, we show that the ELBO is bounded and the proposed EM algorithm leads to the convergence of ELBO. Moreover, we carefully discuss the algorithmic relation between MiCE and the two-stage baseline and show that the latter is a special instance of the former in a certain extreme case. Compared with existing clustering methods, MiCE has the following advantages. (i) Methodologically unified: MiCE conjoins the benefits of both the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model within a unified probabilistic framework. (ii) Free from regularization: MiCE trained by EM optimizes a single objective function, which does not require auxiliary loss or regularization terms. (iii) Empirically effective: Evaluated on four widely adopted natural image datasets, MiCE achieves significantly better results than a strong contrastive baseline and extensive prior clustering methods on several benchmarks without any form of pre-training.

2. RELATED WORK

Deep clustering. Inspired by the success of deep learning, many researchers propose to learn the representations and cluster assignments simultaneously (Xie et al., 2016; Yang et al., 2016; 2017) based on data reconstruction (Xie et al., 2016; Yang et al., 2017) , pairwise relationship among instances (Chang et al., 2017; Haeusser et al., 2018; Wu et al., 2019) , multi-task learning (Shiran & Weinshall, 2019; Niu et al., 2020) , etc. The joint training framework often ends up optimizing a weighted average of multiple loss functions. However, given that the validation dataset is barely provided, tuning the weights between the losses may be impractical (Ghasedi Dizaji et al., 2017) . Recently, several methods also explore probabilistic modeling, and they introduce latent variables to represent the underlying classes. On one hand, deep generative approaches (Jiang et al., 2016; Dilokthanakul et al., 2016; Chongxuan et al., 2018; Mukherjee et al., 2019; Yang et al., 2019) attempt to capture the data generation process with a mixture of Gaussian prior on latent representations. However, the imposed assumptions can be violated in many cases, and capturing the true data distribution is challenging but may not be helpful to the clustering (Krause et al., 2010) . On the other hand, discriminative approaches (Hu et al., 2017; Ji et al., 2019; Darlow & Storkey, 2020) directly model the mapping from the inputs to the cluster labels and maximize a form of mutual information, which often yields superior cluster accuracy. Despite the simplicity, the discriminative approaches discard the instance-specific details that can benefit clustering via improving the representations. Besides, MIXAE (Zhang et al., 2017 ), DAMIC (Chazan et al., 2019 ), and MoE-Sim-VAE (Kopf et al., 2019) combine the mixture of experts (MoE) formulation (Jacobs et al., 1991) with the data reconstruction task. However, either pre-training, regularization, or an extra clustering loss is required. Contrastive learning. To learn discriminative representations, contrastive learning (Wu et al., 2018; Oord et al., 2018; He et al., 2020; Tian et al., 2019; Chen et al., 2020) incorporates various contrastive loss functions with different pretext tasks such as colorization (Zhang et al., 2016) , context autoencoding (Pathak et al., 2016) , and instance discrimination (Dosovitskiy et al., 2015; Wu et al., 2018) . The pre-trained representations often achieve promising results on downstream tasks, e.g., depth prediction, object detection (Ren et al., 2015; He et al., 2017) , and image classification (Kolesnikov et al., 2019) , after fine-tuning with human labels. In particular, InstDisc (Wu et al., 2018) learns from instance-level discrimination using NCE (Gutmann & Hyvärinen, 2010) , and maintains a memory bank to compute the loss function efficiently. MoCo replaces the memory bank with a queue and maintains an EMA of the student network as the teacher network to encourage consistent representations. A concurrent work called PCL (Li et al., 2020) also explores the semantic structures in contrastive learning. They add an auxiliary cluster-style objective function on top of the MoCo's original objective, which differs from our method significantly. PCL requires an auxiliary k-means (Lloyd, 1982) algorithm to obtain the posterior estimates and the prototypes. Moreover, their aim of clustering is to induce transferable embeddings instead of discovering groups of data that correspond to underlying semantic classes.

