MICE: MIXTURE OF CONTRASTIVE EXPERTS FOR UN-SUPERVISED IMAGE CLUSTERING

Abstract

We present Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering framework that simultaneously exploits the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model. Motivated by the mixture of experts, MiCE employs a gating function to partition an unlabeled dataset into subsets according to the latent semantics and multiple experts to discriminate distinct subsets of instances assigned to them in a contrastive learning manner. To solve the nontrivial inference and learning problems caused by the latent variables, we further develop a scalable variant of the Expectation-Maximization (EM) algorithm for MiCE and provide proof of the convergence. Empirically, we evaluate the clustering performance of MiCE on four widely adopted natural image datasets. MiCE achieves significantly better results 1 than various previous methods and a strong contrastive learning baseline.

1. INTRODUCTION

Unsupervised clustering is a fundamental task that aims to partition data into distinct groups of similar ones without explicit human labels. Deep clustering methods (Xie et al., 2016; Wu et al., 2019) exploit the representations learned by neural networks and have made large progress on high-dimensional data recently. Often, such methods learn the representations for clustering by reconstructing data in a deterministic (Ghasedi Dizaji et al., 2017) or probabilistic manner (Jiang et al., 2016) , or maximizing certain mutual information (Hu et al., 2017; Ji et al., 2019 ) (see Sec. 2 for the related work). Despite the recent advances, the representations learned by existing methods may not be discriminative enough to capture the semantic similarity between images. The instance discrimination task (Wu et al., 2018; He et al., 2020) in contrastive learning has shown promise in pre-training representations transferable to downstream tasks through fine-tuning. Given that the literature (Shiran & Weinshall, 2019; Niu et al., 2020) shows improved representations can lead to better clustering results, we hypothesize that instance discrimination can improve the performance as well. A straightforward approach is to learn a classical clustering model, e.g. spherical k-means (Dhillon & Modha, 2001) , directly on the representations pre-trained by the task. Such a two-stage baseline can achieve excellent clustering results (please refer to Tab. 1). However, because of the independence of the two stages, the baseline may not fully explore the semantic structures of the data when learning the representations and lead to a sub-optimal solution for clustering. To this end, we propose Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering method that utilizes the instance discrimination task as a stepping stone to improve clustering. In particular, to capture the semantic structure explicitly, we formulate a mixture of conditional models by introducing latent variables to represent cluster labels of the images, which is inspired by the mixture of experts (MoE) formulation. In MiCE, each of the conditional models, also called an expert, learns to discriminate a subset of instances, while an input-dependent gating function partitions the dataset into subsets according to the latent semantics by allocating weights among experts. Further, we develop a scalable variant of the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) for the nontrivial inference and learning problems. In the E-step, we obtain the approximate inference of the posterior distribution of the latent variables given the observed data. In the M-step, we maximize the evidence lower bound (ELBO) of the log conditional likelihood with respect to all parameters. Theoretically, we show that the ELBO is bounded and the proposed EM algorithm leads to the convergence of ELBO. Moreover, we carefully discuss the algorithmic relation between MiCE and the two-stage baseline and show that the latter is a special instance of the former in a certain extreme case. Compared with existing clustering methods, MiCE has the following advantages. (i) Methodologically unified: MiCE conjoins the benefits of both the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model within a unified probabilistic framework. (ii) Free from regularization: MiCE trained by EM optimizes a single objective function, which does not require auxiliary loss or regularization terms. (iii) Empirically effective: Evaluated on four widely adopted natural image datasets, MiCE achieves significantly better results than a strong contrastive baseline and extensive prior clustering methods on several benchmarks without any form of pre-training.

2. RELATED WORK

Deep clustering. Inspired by the success of deep learning, many researchers propose to learn the representations and cluster assignments simultaneously (Xie et al., 2016; Yang et al., 2016; 2017) based on data reconstruction (Xie et al., 2016; Yang et al., 2017) , pairwise relationship among instances (Chang et al., 2017; Haeusser et al., 2018; Wu et al., 2019) , multi-task learning (Shiran & Weinshall, 2019; Niu et al., 2020) , etc. The joint training framework often ends up optimizing a weighted average of multiple loss functions. However, given that the validation dataset is barely provided, tuning the weights between the losses may be impractical (Ghasedi Dizaji et al., 2017) . Recently, several methods also explore probabilistic modeling, and they introduce latent variables to represent the underlying classes. On one hand, deep generative approaches (Jiang et al., 2016; Dilokthanakul et al., 2016; Chongxuan et al., 2018; Mukherjee et al., 2019; Yang et al., 2019) attempt to capture the data generation process with a mixture of Gaussian prior on latent representations. However, the imposed assumptions can be violated in many cases, and capturing the true data distribution is challenging but may not be helpful to the clustering (Krause et al., 2010) . On the other hand, discriminative approaches (Hu et al., 2017; Ji et al., 2019; Darlow & Storkey, 2020) directly model the mapping from the inputs to the cluster labels and maximize a form of mutual information, which often yields superior cluster accuracy. Despite the simplicity, the discriminative approaches discard the instance-specific details that can benefit clustering via improving the representations. Besides, MIXAE (Zhang et al., 2017) , DAMIC (Chazan et al., 2019) , and MoE-Sim-VAE (Kopf et al., 2019) combine the mixture of experts (MoE) formulation (Jacobs et al., 1991) with the data reconstruction task. However, either pre-training, regularization, or an extra clustering loss is required. Contrastive learning. To learn discriminative representations, contrastive learning (Wu et al., 2018; Oord et al., 2018; He et al., 2020; Tian et al., 2019; Chen et al., 2020) incorporates various contrastive loss functions with different pretext tasks such as colorization (Zhang et al., 2016) , context autoencoding (Pathak et al., 2016) , and instance discrimination (Dosovitskiy et al., 2015; Wu et al., 2018) . The pre-trained representations often achieve promising results on downstream tasks, e.g., depth prediction, object detection (Ren et al., 2015; He et al., 2017) , and image classification (Kolesnikov et al., 2019) , after fine-tuning with human labels. In particular, InstDisc (Wu et al., 2018) learns from instance-level discrimination using NCE (Gutmann & Hyvärinen, 2010) , and maintains a memory bank to compute the loss function efficiently. MoCo replaces the memory bank with a queue and maintains an EMA of the student network as the teacher network to encourage consistent representations. A concurrent work called PCL (Li et al., 2020) also explores the semantic structures in contrastive learning. They add an auxiliary cluster-style objective function on top of the MoCo's original objective, which differs from our method significantly. PCL requires an auxiliary k-means (Lloyd, 1982) algorithm to obtain the posterior estimates and the prototypes. Moreover, their aim of clustering is to induce transferable embeddings instead of discovering groups of data that correspond to underlying semantic classes.

3. PRELIMINARY

We introduce the contrastive learning methods based on the instance discrimination task (Wu et al., 2018; Ye et al., 2019; He et al., 2020; Chen et al., 2020) , with a particular focus on the recent state-of-the-art method, MoCo (He et al., 2020) . Let X = {x n } N n=1 be a set of images without the ground-truth labels, and each of the datapoint x n is assigned with a unique surrogate label y n ∈ {1, 2, ..., N } such that y n = y j , ∀j = nfoot_0 . To learn representations in an unsupervised manner, instance discrimination considers a discriminative classifier that maps the given image to its surrogate label. Suppose that we have two encoder networks f θ and f θ that generate 2 -normalized embeddings v yn ∈ R d and f n ∈ R d , respectively, given the image x n with the surrogate label y n . We show the parameters of the networks in the subscript, and images are transformed by a stochastic data augmentation module before passing to the networks (please see Appendix D). We can model the probability classifier with: p(Y|X) = N n=1 p(y n |x n ) = N n=1 exp(v yn f n /τ ) N i=1 exp(v i f n /τ ) , ( ) where τ is the temperature hyper-parameter controlling the concentration level (Hinton et al., 2015) foot_1 . The recent contrastive learning methods mainly differ in: (1) The contrastive loss used to learn the network parameters, including NCE (Wu et al., 2018 ), InfoNCE (Oord et al., 2018) , and the margin loss (Schroff et al., 2015) . ( 2) The choice of the two encoder networks based on deep neural networks (DNNs) in which θ can be an identical (Ye et al., 2019; Chen et al., 2020) , distinct (Tian et al., 2019) , or an exponential moving average (EMA) (He et al., 2020) version of θ. In particular, MoCo (He et al., 2020) learns by minimizing the InfoNCE loss: log exp v yn f n /τ exp v yn f n /τ + ν i=1 exp q i f n /τ , where q ∈ R ν×d is a queue of size ν ≤ N storing previous embeddings from f θ . While it adopts the EMA approach to avoid rapidly changing embeddings in the queue that adversely impacts the performance (He et al., 2020) . For convenience, we refer f θ and f θ as the student and teacher network respectively (Tarvainen & Valpola, 2017; Tsai et al., 2019) . In the following, we propose a unified latent mixture model based on contrastive learning to tackle the clustering task.

4. MIXTURE OF CONTRASTIVE EXPERTS

Unsupervised clustering aims to partition a dataset X with N observations into K clusters. We introduce the latent variable z n ∈ {1, 2, ..., K} to be the cluster label of the image x n and naturally extend Eq. (1) to Mixture of Contrastive Experts (MiCE): p(Y, Z|X) = N n=1 K k=1 p(y n , z n = k|x n ) 1(zn=k) = N n=1 K k=1 p(z n = k|x n ) 1(zn=k) p(y n |x n , z n = k) 1(zn=k) , where 1(•) is an indicator function. The formulation explicitly introduces a mixture model to capture the latent semantic structures, which is inspired by the mixture of experts (MoE) framework (Jacobs et al., 1991) . In Eq. (3), p(y n |x n , z n ) is one of the experts that learn to discriminate a subset of instances and p(z n |x n ) is a gating function that partitions the dataset into subsets according to the latent semantics by routing the given input to one or a few experts. With a divide-and-conquer principle, the experts are often highly specialized in particular images that share similar semantics, which improves the learning efficiency. Notably, MiCE is generic to the choice of the underlying contrastive methods (Wu et al., 2018; He et al., 2020; Chen et al., 2020) , while in this paper, we focus on an instance based on MoCo. Also, please see Fig. 1 for an illustration of MiCE with three experts. In contrast to the original MoE used in the supervised settings (Jacobs et al., 1991) , our experts learn from instance-wise discrimination instead of human labels. In addition, both gating and expert parts of MiCE are based on DNNs to fit the high-dimensional data. In the following, we will elaborate on how we parameterize the gating function and the experts to fit the clustering task. For simplicity, we omit the parameters in all probability distributions in this section. Gating function. The gating function organizes the instance discrimination task into K simpler subtasks by weighting the experts based on the semantics of the input image. We define g ψ as an encoder network that outputs an embedding for each input image. We denote the output vector for image x n as g n ∈ R d . The gating function is then parameterized as: p(z n |x n ) = exp(ω zn g n /κ) K k=1 exp(ω k g n /κ) , ( ) where κ is the temperature, and ω = {ω k } K k=1 represent the gating prototypes. All prototypes and image embeddings are 2 -normalized in the R d space. Hence, the gating function performs a soft partitioning of the dataset based on the cosine similarity between the image embeddings and the gating prototypes. We can view it as a prototype-based discriminative clustering module, whereas we obtain the cluster labels using posterior inference to consider additional information in the experts. Experts. In MiCE, every expert learns to solve the instance discrimination subtask arranged by the gating function. We define the expert in terms of the unnormalized model Φ(•) following Wu et al. (2018) ; He et al. (2020) . Therefore, the probability of the image x n being recognized as the y n -th one by the z n -th expert is formulated as follows: p(y n |x n , z n ) = Φ(x n , y n , z n ) Z(x n , z n ) , where Z(x n , z n ) = N i=1 Φ(x n , y i , z n ) is a normalization constant that is often computationally intractable. Similar to MoCo, we have the student network f θ that maps the image x n into K continuous embeddings f n = {f n,k } K k=1 ∈ R K×d . Likewise, the teacher network f θ outputs v yn = {v yn,k } K k=1 ∈ R K×d given x n . To be specific, f n,zn ∈ R d and v yn,zn ∈ R d are the student embedding and the teacher embedding for images x n under the z n -th expert, respectively. We then parameterize the unnormalized model as: Φ(x n , y n , z n ) = exp v yn,zn (f n,zn + µ zn ) /τ , ( ) where τ is the temperature and µ = {µ k } K k=1 represent the cluster prototypes for the experts. In Eq. ( 6), the first instance-wise dot product explores the instance-level information to induce discriminative representations within each expert. The second instance-prototype dot product incorporates the class-level information into representation learning, encouraging a clear cluster structure around the prototype. Overall, the learned embeddings are therefore encoded with semantic structures while being discriminative enough to represent the instances. Eq. ( 6) is built upon MoCo with the EMA approach, while in principle, many other potential solutions exist to define the experts, which are left for future studies. Besides, the parameters θ and ψ are partially shared, please refer to the Appendix D for more details on the architecture.

5. INFERENCE AND LEARNING

We first discuss the evidence lower bound (ELBO), the single objective used in MiCE, in Sec. 5.1. Then, we present a scalable variant of the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) to deal with the non-trivial inference and learning of MiCE in Sec. 5.2. Lastly, in Sec. 5.3, we show that a naïve two-stage baseline, in which we run a spherical k-means algorithm on the embeddings learned by MoCo, is a special case of MiCE.

5.1. EVIDENCE LOWER BOUND (ELBO)

The parameters to update include the parameters θ, ψ of the student and gating network respectively, and the expert prototypes µ = {µ} K k=1 . The learning objective of MiCE is to maximize the evidence lower bound (ELBO) of the log conditional likelihood of the entire dataset. The ELBO of the datapoint n is given by: log p(y n |x n ) ≥ L(θ, ψ, µ; x n , y n ) := E q(zn|xn,yn) [log p(y n |x n , z n ; θ, µ)] -D KL (q(z n |x n , y n ) p(z n |x n ; ψ)), where q(z n |x n , y n ) is a variational distribution to infer the latent cluster label given the observed data. The first term in Eq. ( 7) encourages q(z n |x n , y n ) to be high for the experts that are good at discriminating the input images. Intuitively, it can relief the potential degeneracy issue (Caron et al., 2018; Ji et al., 2019) , where all images are assigned to the same cluster. This is because a degenerated posterior puts the pressure of discriminating all images on a single expert, which may result in a looser ELBO. The second term in Eq. ( 7) is the Kullback-Leibler divergence between the variational distribution and the distribution defined by the gating function. With this term, the gating function is refined during training and considers the capability of the experts when partitioning data. Notably, MiCE does not rely on auxiliary loss or regularization terms as many prior methods (Haeusser et al., 2018; Shiran & Weinshall, 2019; Wu et al., 2019; Niu et al., 2020) do.

E-step.

Inferring the posterior distribution of latent variables given the observations is an important step to apply MiCE to clustering. According to Bayes' theorem, the posterior distribution given the current estimate of the model parameters is: p(z n |x n , y n ; θ, ψ, µ) = p(z n |x n ; ψ)p(y n |x n , z n ; θ, µ) K k=1 p(k|x n ; ψ)p(y n |x n , k; θ, µ) . ( ) Comparing with the gating function p(z n |x n ; ψ), the posterior provides better estimates of the latent variables by incorporating the supplementary information of the experts. However, we cannot tractably compute the posterior distribution because of the normalization constants Z(x n , z n ; θ, µ). In fact, given the image x n and the cluster label z n , Z(x n , z n ; θ, µ) sums over the entire dataset, which is prohibitive for large-scale image dataset. We present a simple and analytically tractable estimator to approximate them. Specifically, we maintain a queue q ∈ R ν×K×d that stores ν previous outputs of the teacher network, following MoCo closely. Formally, the estimator Ẑ(•) is: Ẑ(x n , z n ; θ, µ) = exp v yn,z n (f n,zn + µ zn ) /τ + ν i=1 exp q i,zn (f n,zn + µ zn ) /τ . ( ) The estimator is biased, while its bias decreases as ν increases and we can get a sufficient amount of embeddings from the queue q efficientlyfoot_2 . With Eq. ( 9), we approximate the posterior as: q(z n |x n , y n ; θ, ψ, µ) = p(z n |x n ; ψ)Φ (x n , y n , z n ; θ, µ)/ Ẑ(x n , z n ; θ, µ) K k=1 p(k|x n ; ψ)Φ (x n , y n , k; θ, µ)/ Ẑ(x n , k; θ, µ) . (10) M-step. We leverage the stochastic gradient ascent to optimize ELBO with respect to the network parameters θ, ψ and the expert prototypes µ. We approximate the normalization constants appear in ELBO in analogy to the E-step, formulated as follows: L(θ, ψ, µ; x n , y n ) = E q(zn|xn,yn;θ,ψ,µ) log Φ (x n , y n , z n ; θ, µ) Ẑ(x n , z n ; θ, µ) -D KL (q(z n |x n , y n ; θ, ψ, µ) p(z n |x n ; ψ)). ( ) Sampling a mini-batch B of datapoints, we can construct an efficient stochastic estimator of ELBO over the full dataset to learn θ, ψ and µ: L(θ, ψ, µ; X, Y) ≈ N |B| n∈B L(θ, ψ, µ; x n , y n ). ( ) It requires additional care on the update of the prototypes, as discussed in many clustering methods (Sculley, 2010; Xie et al., 2016; Yang et al., 2017; Shiran & Weinshall, 2019) . Some of them carefully adjust the learning rate of each prototype separately (Sculley, 2010; Yang et al., 2017) , which can be very different from the one used for the network parameters. Since evaluating different learning rate schemes on the validation dataset is often infeasible in unsupervised clustering, we employ alternative strategies which are free from using per-prototype learning rates in MiCE. As for the expert prototypes, we observe that using only the stochastic update can lead to bad local optima. Therefore, at the end of each training epoch, we apply an additional analytical update derived from the ELBO as follows: μk = n:ẑn=k v yn,k , µ k = μk μk 2 , ∀k, where ∀n, ẑn = arg max k q(k|x n , y n ; θ, ψ, µ) is the hard assignment of the cluster label. Please refer to Appendix A.2 for the detailed derivation. Intuitively, the analytical update in Eq. ( 13) considers all the teacher embeddings assigned to the k-th cluster, instead of only the ones in a mini-batch, to avoid bad local optima. Beside, we fix the gating prototypes ω to a set of pre-defined embeddings to stabilize the training process. However, using randomly initialized prototypes may cause unnecessary difficulties in partitioning the dataset if some of them are crowded together. We address the potential issue by using the means of a Max-Mahalanobis distribution (MMD) (Pang et al., 2018) which is a special case of the mixture of Gaussian distribution. The untrainable means in MMD provide the optimal inter-cluster dispersion (Pang et al., 2020 ) that stabilizes the gating outputs. We provide the algorithm of MMD in Appendix B and a systematical ablation study in Tab. 3 to investigate the effect of the updates on ω and µ. Lastly, we provide the formal proof on the convergence of the EM algorithm in Appendix A.4.

5.3. RELATIONS TO A TWO-STAGE BASELINE

The combination of a contrastive learning method and a clustering method is a natural baseline of MiCE. Our analysis reveals that MiCE is the general form of the two-stage baseline in which we learn the image embeddings with MoCo (He et al., 2020) and subsequently run a spherical k-means algorithm (Dhillon & Modha, 2001 ) to obtain the cluster labels. On one hand, in the extreme case where κ → ∞ (Assumption A3), the student embeddings f n,k and teacher embeddings v yn,k are identical for different k (Assumption A4), and the class-level (Lloyd, 1982) 8.7 22.9 4.9 8.40 13.0 2.8 12.5 19.2 6.1 5.5 10.5 2.0 SC (Zelnik-Manor & Perona, 2004) 10.3 24.7 8.5 9.0 13.6 2.2 9.8 15.9 4.8 3.8 11.1 1.3 AE † (Bengio et al., 2006) 23.9 31.4 16.9 10.0 16.5 4.8 25.0 30.3 16.1 10.4 18.5 7.3 DAE † (Vincent et al., 2010) 25 information in Eq. ( 6) is omitted (Assumption A5), we arrive at the same Softmax classifier (Eq. ( 1)) and the InfoNCE loss (Eq. ( 2)) used by MoCo as a special case of our method. On the other hand, of particular relevance to the analytical update on expert prototypes (Eq. ( 13)) is the spherical k-means algorithm (Dhillon & Modha, 2001) that leverages the cosine similarity to cluster 2 -normalized data (Hornik et al., 2012) . In addition to Assumptions A3 and A4, if we assume the unnormalized model is perfectly self-normalized (Assumption A2), using the hard assignment to get the cluster labels together with the analytical update turns out to be a single-iteration spherical k-means algorithm on the teacher embeddings. Please refer to the Appendix C for a detailed derivation. The performance of the baseline is limited by the independence of the representation learning stage and the clustering stage. In contrast, MiCE provides a unified framework to align the representation learning and clustering objectives in a principled manner. See a comprehensive comparison in Tab. 1. (Krizhevsky et al., 2009) , and ImageNet-Dog (Chang et al., 2017) . The experiment settings follow the literature closely (Chang et al., 2017; Wu et al., 2019; Ji et al., 2019; Shiran & Weinshall, 2019; Darlow & Storkey, 2020) and the numbers of the clusters are known in advance. The statistics of the datasets are summarized in Comparison with the two-stage baseline. Compared to the straightforward combination of MoCo and spherical k-means, MiCE explores the semantic structures of the dataset that improve the clustering performance. From Tab. 1, we can see that MiCE consistently outperforms the baseline in terms of the mean performance, which agrees with the analysis in Sec. 5.3. Specifically, regarding ACC, we improve upon the strong baseline by 8.7%, 2.7%, and 8.2% on CIFAR-10, CIFAR-100, and ImageNet-Dog, respectively. Taking the measurement variance into consideration, our performance overlaps with MoCo only on STL-10. We conjecture that the small data size may limit the performance as each expert learns from a subset of data. Nevertheless, the comparison manifests the significance of aligning the representation learning and clustering objectives in a unified framework, and we believe that MiCE points out a promising direction for future studies in clustering.

6. EXPERIMENTS

Visualization of the learned embeddings. We visualize the image embeddings produced by the gating network using t-SNE (Maaten & Hinton, 2008) in Fig. 2 . Different colors denote the different ground-truth class labels. At the beginning, the embeddings from distinct classes are indistinguishable. MiCE progressively refines its estimates and ends up with embeddings that show a clear cluster structure. The learned clusters align with the ground-truth semantics well, which verifies the effectiveness of our method. Additional visualizations and the comparisons with MoCo are in Appendix E.

6.2. ABLATION STUDIES

Simplified model (Tab. 3 (left)). We investigate the gating function and the unnormalized model to understand the contributions of different components. Using a simpler latent variable model often deteriorates the performance. ( 1) With a uniform prior, the experts would take extra efforts to become specialized in a set of images with shared semantics. (2 & 3) The teacher embedding v yn is pushed to be close to all expert prototypes at the same time. It may be difficult for the simplified expert to encode the latent semantics while being discriminative. (4) The performance drop shows that the class-level information is essential for the image embeddings to capture the semantic structures of the dataset, despite the learned representations are still discriminative between instances. Without the term, the learned embeddings are mixed up and scattered over the embedding space without a clear cluster structure. Prototypes update rules (Tab. 3 (right)). We also conduct ablation studies to gain insights into the different ways of handling the gating and expert prototypes. We see that (a) without Eq. ( 13), we may be stuck in bad local optima. As mentioned in Sec. 5.2, a possible reason is that we are using the same learning rate for all network parameters and prototypes (Sculley, 2010; Yang et al., 2017) , but tuning separate learning rates for each prototype is impractical for unsupervised clustering. Hence, we derive the analytical update to tackle the issue. As for (b), it shows that the current gradient update rule avoids the potential inconsistency between the expert prototypes and the teacher embeddings during the mini-batch training. Lastly, as discussed in Sec. 5.2, comparing to using (c) uniformly initiated gating prototypes projected onto the unit sphere, utilizing the means of MMD slightly improves performance. This also bypasses the potential learning rate issue that may appear in (d).

7. CONCLUSION

We present a principled probabilistic clustering method that conjoins the benefits of the discriminative representations learned by contrastive learning and the semantic structures introduced by the latent mixture model in a unified framework. With a divide-and-conquer principle, MiCE comprises an input-dependent gating function that distributes subtasks to one or a few specialized experts, and K experts that discriminate the subset of images based on instance-level and class-level information. To address the challenging inference and learning problems, we present a scalable variant of Expectation-Maximization (EM) algorithm, which maximizes the ELBO and is free from any other loss or regularization terms. Moreover, we show that MoCo with spherical k-means, one of the two-stage baselines, is a special case of MiCE under various assumptions. Empirically, MiCE outperforms extensive prior methods and the strong two-stage baseline by a significant margin on several benchmarking datasets. For future work, one may explore different learning pretext tasks that potentially fit the clustering task, other than the instance discrimination one. Also, it would be an interesting and important future work to include dataset with a larger amount clusters, such as ImageNet. Besides, being able to obtain semantically meaningful clusters could be beneficial to weakly supervised settings (Zhou, 2018) where quality labels are scarce. subjected to the constraint that µ k = 1. Introducing a Lagrange multiplier λ, we then solve the Lagrangian of the objective function: argmax µ k λ(1 -µ k µ k ) + N n=1 q(z n = k|x n , y n )v yn,k µ k /τ (14) We can get the estimates of µ k and λ by differentiating the Lagrangian with regards to them and setting the derivatives to zero. To be specific, for µ k , we have: ∇ µ k λ(1 -µ k µ k ) + N n=1 q(z n = k|x n , y n )v yn,k µ k /τ = -2λµ k + N n=1 q(z n = k|x n , y n )v yn,k /τ = 0, such that µ k = N n=1 q(z n = k|x n , y n )v yn,k 2λτ := μk 2λτ , where we define μk := N n=1 q(z n = k|x n , y n )v yn,k := n:ẑn=k v yn,k for simplicity. Similarly, by differentiating with respect to λ, we get: µ k µ k = 1. ( ) Substituting Eq. ( 15) in Eq. ( 16), we have: 1 = 1 4τ 2 λ 2 μ k μk = 1 4τ 2 λ 2 μk 2 . Therefore, we obtain the analytical solutions: λ = μk 2τ , µ k = μk 2τ λ = μk μk . The last expression is essentially the update for the prototypes of the experts, as presented in Eq. ( 13) in the main text. 

A.4 CONVERGENCE OF THE PROPOSED EM ALGORITHM

Theorem 1 (Convergence of MiCE). Assume that the log conditional likelihood of the observed data log p(Y|X; θ, ψ, µ) has an upper bound, the approximated ELBO L(θ, ψ, µ; X, Y) of MiCE is then upper bounded and it will converge to a certain value L * with the proposed EM algorithm. Proof. For any given variational distribution q(z n |x n , y n ), the approximated ELBO we optimize in Eq. ( 11) can be written as the original ELBO in Eq. ( 7) plus an additional term: L(θ, ψ, µ; x n , y n ) = E q(zn|xn,yn) log Φ (x n , y n , z n ; θ, µ) Ẑ(x n , z n ; θ, µ) -D KL (q(z n |x n , y n ) p(z n |x n ; ψ)) = E q(zn|xn,yn) log Φ (x n , y n , z n ; θ, µ) Z(x n , z n ; θ, µ) + log Z(x n , z n ; θ, µ) Ẑ(x n , z n ; θ, µ) -D KL (q(z n |x n , y n ) p(z n |x n ; ψ)) = L(θ, ψ, µ; x n , y n ) + E q(zn|xn,yn) log Z(x n , z n ; θ, µ) Ẑ(x n , z n ; θ, µ) . Recall that the normalization constant is Z(x n , z n ; θ, µ) = N i=1 exp v yi,zn (f n,zn + µ zn ) /τ , and the approximated normalization constant is Ẑ(x n , z n ; θ, µ) = ν j=1 exp q j,zn (f n,zn + µ zn ) /τ , As in Eq. ( 9) where q is a queue that stores the outputs of the teacher network as described in the main text. If we use the teacher embeddings of the entire dataset as the queue, then the two normalization constants cancel each other and we obtain the original ELBO. In such cases, the convergence can be proved following the standard EM algorithm (Dempster et al., 1977; Wu, 1983) . Otherwise, we can also bound the approximated ELBO. Since that all student and teacher embeddings and expert prototypes are 2 -normalized, we can bound the log ratio term independent of the choice of the queue: -2 τ ≤ v yi,zn (f n,zn + µ zn ) /τ ≤ 2 τ , which implies Z(x n , z n ; θ, µ) ≤ N exp( 2 τ ), Ẑ(x n , z n ; θ, µ) ≥ ν exp( -2 τ ). Given that they are both positive, we can get log Z(x n , z n ; θ, µ) Ẑ(x n , z n ; θ, µ) ≤ log N -log ν + 4 τ Therefore, the approximated ELBO is bounded by: L(θ, ψ, µ; x n , y n ) = L(θ, ψ, µ; x n , y n ) + E q(zn|xn,yn) log Z(x n , z n ; θ, µ) Ẑ(x n , z n ; θ, µ) ≤ log p(y n |x n ; θ, ψ, µ) + log N -log ν + 4 τ . Similar to the proof of original EM (Wu, 1983) , since the approximated ELBO will not decrease in expectation during training, the approximated ELBO of MiCE will converge to a certain value L * . There are two remarks for Theorem 1: (1) The convergence of the proposed EM relies on similar assumptions of the standard EM algorithm. ( 2) Due to the extra log ratio term in the approximated ELBO, it will need further analysis to know the exact convergent point comparing to MiCE learning with the standard EM algorithm.

B ALGORITHM FOR GENERATING THE CENTERS OF MMD

We provide the algorithm on generating the centers of the Max-Mahalanobis distribution (MMD) in Algorithm 2. The gating prototypes ω are fix to these centers during training. The algorithm closely follows the one proposed by Pang et al. (2020) . For a dataset with K ground-truth classes, we will generate K centers which are all 2 -normalized in the R d space. Please kindly note that the algorithm requires K ≤ (d + 1) (Pang et al., 2020) . Algorithm 2: Algorithm to craft the gating prototypes ω as the centers of MMD Input: The dimension of each gating prototypes d and the number of the clusters K. Initialization: We initialize the first prototype ω 1 with the first unit basis vector e 1 ∈ R d . Rest of the prototypes ω i , i = 1, are initialized with the zero vector 0 d ∈ R d 1 for i = 2 to K do 2 for j = 1 to i -1 do 3 ω ij = -[1/(K -1) + ω i ω j ]/ω jj 4 end 5 ω ii = 1 -ω i 2 2 6 end Return: The gating prototypes ω = {ω i } K i=1 .

C RELATIONS TO THE TWO-STAGE BASELINE

C.1 CONTRASTIVE LEARNING Assumption A3. The gating temperature κ → ∞ such that for all k, the prior p(z n |x n ) = 1 K , ∀n. Assumption A4. There is only a single output layer for both student network f θ and teacher network f θ respectively. The resulting embeddings are used across K expert, to be specific, for all possible cases, Φ(x n , y n , k) = exp v yn (f n + µ k )/τ , where f n = f θ (x n ) ∈ R d , v yn = f θ (x n ) ∈ R d . Assumption A5. The unnormalized model Φ(•) considers only the instance-level information such that for all possible cases, Φ(x n , y n , z n ) = exp v yn,zn f n,zn /τ . Theorem 2. If Assumptions A3-A5 holds, the overall output of MiCE become p(y n |x n ) = exp(v yn f n /τ ) N i=1 exp(v i f n /τ ) . et al., 2017; Wu et al., 2019) . Note that the numbers of the clusters are known in advance as in Chang et al. (2017); Ji et al. (2019) ; Wu et al. (2019) ; Shiran & Weinshall (2019) . The statistics of the datasets are summarized in Tab. 2. We adopt three common metrics to evaluate the clustering performance, namely normalized mutual information (NMI), cluster accuracy (ACC), and adjusted rand index (ARI). All the metrics are presented in percentage (%). Regarding the network architecture, MiCE mainly use a ResNet-34 (He et al., 2016) as the backbone following the recent methods (Ji et al., 2019; Shiran & Weinshall, 2019) for fair comparisons. For the gating network, the output layer is replaced by a single fully connected layer that generates a 2 -normalized embeddings in R 128 . As for the student network, it uses the same ResNet-34 backbone as the gating network and includes K more fully connected layers which map the images to K embeddings. Therefore, the parameters of student and gating networks are shared except for the output layers. The teacher network f θ is the exponential moving average (EMA) version of the student network f θ , which stabilizes the learning process (He et al., 2020; Tarvainen & Valpola, 2017) . The update rule follows θ ← mθ + (1 -m)θ with m ∈ [0, 1) being the smoothing coefficient. In practice, we let m = 0.999 following MoCo. Since the images of CIFAR-10 and CIFAR-100 are smaller than ImageNet images, following (Chen et al., 2020) , we replace the first 7x7 Conv of stride 2 with a 3x3 Conv of stride 1 for all experiments on CIFAR-10 and CIFAR-100. The first max-pooling operation is removed as well (Wu et al., 2018; Chen et al., 2020; Ye et al., 2019) . Please kindly note that if the first max-pooling operation is not removed, MiCE can still achieve 83.6% ACC on CIFAR-10. For fair comparisons, MoCo also uses a ResNet-34 with a 128-dimensional output and follows the same hyper-parameter settings as MiCE. As it is often infeasible to tune the hyper-parameters with a validation dataset in real-world clustering tasks (Ghasedi Dizaji et al., 2017) , we set both temperatures τ and κ as 1.0. The queue size ν is set to 12800 for STL-10 because of the smaller data size and 16384 for the other three datasets. The data augmentation follows MoCo closely. Specifically, before passing into any of the embedding networks, images are randomly resized and cropped to the same size, followed by random gray-scale, random color jittering, and random horizontal flip. For a fair comparison, MoCo in the two-stage baseline also uses a ResNet-34 backbone. For all datasets, we use a batch size equals to 256. Note that the data augmentation strategy is critical to contrastive learning methods and MiCE, and we follow the one used by MoCo for fairness. In terms of the optimization details, we use stochastic gradient descent (SGD) as our optimizer on the negative ELBO. We set the SGD weight decay as 0.0001 and the SGD momentum as 0.9 (He et al., 2020) . The learning rate is initiated as 1.0 and is multiplied by 0.1 at three different epochs. For different datasets, the number of training epochs is different to accommodate the data size to have a similar and reasonable training time. For CIFAR-10/100, we train for 1000 epochs in total and multiply the learning rate by 0.1 at 480, 640, and 800 epochs. For STL-10, the total epochs are 6000 and the learning rate is multiplied by 0.1 at 3000, 4000, and 5000 epochs. Lastly, for ImageNet-Dog, the total epochs are 3000 and the learning rate is multiplied by 0.1 at 1500, 2000, and 2500 epochs. Also, the learning rate for expert prototypes is the same as the one for network parameters. All the experiments are trained on a single GPU. For all experiment settings in the main text, we follow the recent methods (Wu et al., 2019; Ji et al., 2019; Shiran & Weinshall, 2019) closely where the models are trained from scratch. The setting is different from some of the methods including VaDE (Jiang et al., 2016) , DGG (Yang et al., 2019) , and LTVAE (Li et al., 2019) from two aspects. Firstly, we do not use any form of the pre-trained model. Secondly, we focus on a purely unsupervised setting. In contrast, VaDE (Jiang et al., 2016) and DGG (Yang et al., 2019) use a supervised pre-trained model on ImageNet for STL-10. For fairness, in the original submission, we compare to many previous methods that use the same settings.

E ADDITIONAL EXPERIMENTS AND VISUALIZATIONS

Posterior distribution and cluster predictions. From the left side of Fig. 3 , we can see that in the initial stage (epoch 1), MiCE is yet certain about the cluster labels of any give images. At the end of the training (epoch 1000), the major values of the approximated posterior distribution fall in the [0, 0.1) interval, indicating that the model is confident that images do not belong to those clusters. Here, we train and evaluate the model using CIFAR-10 which has 10 classes and 60,000 images. We divide the probability distribution into 10 discrete bins. Best view in color. After training, the learned model is able to generate sparse posterior distributions. The predicted cluster labels are also balanced across different clusters, which is shown on the right side of Fig. 3 . Visualization of embeddings of MiCE and MoCo. We present the t-SNE visualization of the embeddings learned by MiCE and MoCo in Fig. 4 and Fig. 5 to investigate whether the cluster structure and the latent semantics are captured by the models. The two figures differ in the way we color the datapoints. In Fig. 4 , different colors represent different cluster labels predicted by the clustering methods. We get the predicted cluster labels of MiCE based on the hard assignments using the posterior distributions. The cluster labels for MoCo are the outputs of the spherical k-means algorithm. MiCE can learn a distinct structure at the end of the training where each expert would mainly be responsible for one cluster. By comparing Fig. 4 (c) and (f), we can see that the boundaries between the clusters learned by spherical k-means do not match the structure learned by MoCo well. The divergence is caused by the independence of representation learning and clustering. MiCE solves the issue with a unified framework. In Fig. 5 , the datapoints are colored according to the ground-truth class labels that are unknown to the models. In Fig. 5 (c), the cluster structure is highly aligned with the underlying semantics. Most of the clusters are filled with images from the same classes while having some difficult ones lying mostly on the boundaries. This verifies that the gating network learns to divide the dataset based on the latent semantics and allocates each of the images to one or a few experts. Please kindly note that we use the embeddings from the gating network instead of the student network for simplicity since the embeddings are from the same output head. In contrast, the cluster structure learned by MoCo does not align the semantics well. For all the above t-SNE visualizations, we set the perplexity hyper-parameter to 200. Training time. We present the training time of MiCE and MoCO on CIFAR-10. It takes around 17 and 30 hours to train MoCo and MiCE for 1000 epochs, respectively. For all four datasets, experiments are conducted on a single GPU (NVIDIA GeForce GTX 1080 Ti). Extra ablation studies on updating expert prototypes. A bad specification of the expert prototypes can lead to a bad result if it is not properly handled. Specifically, in Tab. (3) (a), we see that the ACC on CIFAR-10 is only 21.3%. We empirically verify two principled methods that solve the issue: (1) extra end-of-epoch training only on µ with stochastic gradient ascent or (2) using Eq. ( 13). The first method is used as follows. At the end of each epoch, we update µ while fixing the network parameters until the pre-defined convergence criteria are met. The convergence can be determined based on either the norm of the prototypes or the change of the objective function. To control the training time, we stop the update at the current epoch once we iterate through the entire dataset 20 times. We discover that with additional training, we can achieve 42.3% ACC. The result shows that with a proper update on µ, the proposed model can identify semantic clusters with stochastic gradient ascent alone. However, it requires more than 10 times of training time (around 394 hours). The above discussions manifest the benefits of using the Eq. ( 13) which is derived based on the same objective function. With the prototypical update, we can achieve similar results with minimal computational overhead. On average, we can achieve 42.2% ACC on CIFAR-100 as shown in Sec. 6. (He et al., 2020) and InstDisc (Wu et al., 2018) . Since SCAN involves pre-training using SimCLR (Chen et al., 2020) , it takes additional advantages when directly comparing to other methods without pre-training. Thus, we fine-tune the MiCE model with the same training protocol described in Appendix D (except the learning rate can be smaller). We discover that MiCE can obtain higher results and outperforms SCAN, as shown in the second sector in Tab. 4. To be specific, in the fine-tuning stage, we load the network parameters from the pre-trained model and a smaller initial learning rate of 0.1 with all the other settings remain the same. We show the augmentation strategy in the parenthesis if we use a different one from the pre-training stage. For the RandAugment (Cubuk et al., 2020) , we follow the implementation (and hyper-parameters) based on the public code provided by SCAN. As mentioned in Van Gansbeke et al. (2020) , the self-labeling stage requires a shift in the augmentation, otherwise, it will lead to a degenerated solution. In contrast, MiCE with pre-training is not prone to degeneracy can get comparable or better performance than SCAN regardless of the choice of the augmentation strategy.

F ADDITIONAL DISCUSSIONS

The number of the experts. In the cases where the number of the experts L differs from the number of ground-truth clusters K, the model will partition the datasets into L subsets instead of K. Even though the number of experts is currently tied with K, it is not a drawback and does not prevent us from applying to the common clustering settings. Also, MiCE does not use additional knowledge comparing to the baseline methods. If the ground-truth K is not known, we may treat K as a hyperparameter and decide K following the methods described in Smyth (2000) ; McLachlan & Peel (2004) , which is worth investigating in the future. Overclustering. The overclustering technique (Ji et al., 2019) is orthogonal to our methods and can be applied with minor adaptations. However, it may require additional hyper-parameter tuning to ensure overclustering improves the results. From the supplementary file provided by IIC (Ji et al., 2019) , we see that the numbers of clusters (for overclustering) are set differently for different datasets. Despite overclustering is an interesting technique, we would like to highlight the simplicity of the current version of MiCE. Similarity between the two set of prototypes. We do not expect the two prototypes to be similar even though their dimensions are the same. In fact, they have different aims: the gating ones aim to divide the datasets into simpler subtasks for the experts, while the expert ones help the expert to solve the instance discrimination subtasks by introducing the class-level information. Therefore, the derived ELBO objective does not encourage the two sets of prototypes to be similar or maintaining a clear correspondence between them. Empirically, we calculate the cosine similarity between any pair of µ and ω. The absolute values we get are less than 0.25, which showed that they are not similar. If we force them to be similar during training, the performance may be negatively affected due to the lack of flexibility.



The value of the surrogate label can be regarded as the index of the image. Due to summation over the entire dataset in the denominator term, it can be computationally prohibitive to get Maximum likelihood estimation (MLE) of the parameters(Ma & Collins, 2018). Even though the bias does not vanish due to the use of queue, we find that the approximation works well empirically.



Figure 1: An illustration of MiCE with three experts. Best view in color.

Figure 2: Visualization of the image embeddings of MiCE on CIFAR-10 with t-SNE. Different colors denote the different ground-truth class labels (unknown to the model). The cluster ACC is shown in the parenthesis. MiCE learns a clear cluster structure that matches the latent semantics well. Best view in color.

PSEUDOCODE OF MICEAlgorithm 1: Pseudocode of MiCE in a PyTorch-like style # encoder_s, encoder_t, encoder_g: student, teacher, and gating network respectively # K: number of clusters; D: embedding size # omega: gating prototypes that are fixed to the centers of MMD (KxD) # mu: expert prototypes (KxD) # queue: dictionary as a queue of V embeddings (VxKxD) # m: momentum # tau, kappa: temperatures for the experts and gating function respectively teacher.params = student.params # initialize mu_hat = zeros((K, D)) for x in loader: # load a mini-batch x with B samples x_f = aug(x) # a randomly augmented version x_v = aug(x) # another randomly augmented version x_g = aug(x) # the other randomly augmented version f = encoder_s.forward(x_f) # student embeddings: BxKxD v = encoder_t.forward(x_v) # teacher embeddings: BxKxD v = v.detach() # no gradient to teacher embeddings g = encoder_g.forward(x_g) # gating embeddings: BxD v_f = einsum("BKD,BKD->BK", [v, f]) # instance-level information v_mu = einsum("BKD,KD->BK", [v, l2normalize(mu)]) # class-level information l_pos = (v_f + v_mu) # positive logits: BxK queue_f = einsum("VKD,BKD->BKV", [queue.detach().clone(), f]) queue_mu = einsum("VKD,KD->KV", [queue, l2normalize(mu)] l_neg = queue_f + queue_mu.view(1, K, V) # BxKxV experts_logits = cat([l_pos.view(B, K, 1), l_neg], dim=2) / tau # BxKx(1+V) experts = Softmax(expert_logits, dim=2)[:, :, 0] # BxK gating = Softmax(einsum("KD,BD->BK", [omega.detach().clone(), g]) / kappa) # BxK variational_q = einsum("BK,BK->BK", [gating, experts]) variational_q /= variational_q.sum(dim=1) # BxK ELBO = sum(variational_q * (log(gating) + log(experts) -log(variational_q)) / B loss = -ELBO # SGD update: student and gating networks loss.backward() update(encoder_s.params) update(encoder_g.params) # momentum update: teacher network encoder_t.params = m*encoder_t.params+(1-m)*encoder_s.params # update the queue enqueue(queue, v) # enqueue the current mini-batch dequeue(queue) # dequeue the earliest mini-batch # aggregate the teacher embeddings hard_assignments = argmax(variational_q, dim=1) # Bx1 mu_hat += einsum("BKD,BK->KD", [v, oneHot(hard_assignments)] # update the expert prototypes mu = mu_hat einsum: Einstein sum; cat: concatenation; oneHot: One-hot encoding; Softmax: Softmax function; l2normalized: Normalization with 2-norm

Figure3: The histogram of (left) approximate posterior distributions of MiCE at the initial and final stage of training and (right) the predicted cluster labels obtained during testing. Here, we train and evaluate the model using CIFAR-10 which has 10 classes and 60,000 images. We divide the probability distribution into 10 discrete bins. Best view in color.

Figure 4: Visualization of the image embeddings of MiCE (upper row) and MoCo (lower row) on CIFAR-10 with t-SNE. Different colors correspond to various cluster labels obtained based on the posterior distribution (MiCE) or spherical k-means (MoCo). The embeddings of MiCE depict a clear cluster structure, as shown in (c). In contrast, the structure in (f) is ambiguous for a large portion of data and it does not match the cluster labels well.

Figure 5: Visualization of the image embeddings of MiCE (upper row) and MoCo (lower row) on CIFAR-10 with t-SNE. Different colors denote the different ground-truth class labels (unknown to the model). Comparing to MoCo, the clusters learned by MiCE better correspond with the underlying class semantics.

Unsupervised clustering performance of different methods on three datasets. The first sector presents the results from the literature, the later ones display the results of the baseline and the proposed MiCE. In the last two sectors, the bold results indicating the one with the highest values. Methods with the legend † are the ones that required post-processing by k-means to obtain the clusters since they do not learn the clustering function directly, except that we use spherical k-means for MoCo. We calculate the mean and standard deviation (Std.) of MiCE and MoCo based on five runs.

Statistics of the datasets.

Ablations of MiCE on the probabilistic model (left) and different ways of learning the gating and expert prototypes (right). Each row shows the ACC (%) on CIFAR-100 when applying the single change to MiCE. The assumptions are detailed in the Appendix C.

Comparing the cluster accuracy ACC (%) ofSCAN (Van Gansbeke et al., 2020)  and MiCE on CIFAR-10. Following SCAN, we show the data augmentation strategy in the parenthesis if it is different from the one MiCE and MoCo use. "SimCLR" indicates the augmentation strategy used in(Chen et al., 2020), and "RA" is the RandAugment(Cubuk et al., 2020). For a fair comparison, the first sector compares MiCE to SCAN without the self-labeling step, and the second sector compares SCAN with self-labeling to MiCE with pre-training.Comparing MiCE toSCAN (Van Gansbeke et al., 2020). We provide additional experiment results to compare with SCAN (Van Gansbeke et al., 2020) that uses unsupervised pre-training under a comparable experiment setting. As SCAN adopts a three-step training procedure, we think that it will provide additional insights by comparing MiCE to SCAN at the steps after the pre-training step. The detail results on CIFAR-10 are presented in Tab. 4.Firstly, we focus on SCAN with two steps of training. MiCE outperforms SCAN with two steps of training. SCAN obtains 78.7% and 81.8% on CIFAR-10 with the SimCLR augmentation and RandAugment, respectively. In contrast, MiCE can get 83.4% despite we are using a weaker augmentation strategy following MoCo

8. ACKNOWLEDGEMENTS

The authors would like to thank Tianyu Pang and Zihao Wang for the discussion and the reviewers for the valuable suggestions. This work was supported by NSFC Projects (Nos. 62061136001, 61620106010, 62076145), Beijing NSF Project (No. JQ19016), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration. C. Li was supported by the fellowship of China postdoctoral Science Foundation (2020M680572), and the fellowship of China national postdoctoral program for innovative talents (BX20190172) and Shuimu Tsinghua Scholar.

A INFERENCE AND LEARNING

A.1 DERIVATION OF ELBO log p(Y|X; θ, ψ, µ) = E q(Z|X,Y) log p(Y, Z|X; θ, ψ, µ) q(Z|X, Y) + D KL (q(Z|X, Y) p(Z|X, Y; θ, ψ, µ)≥ L(θ, ψ, µ; X, Y):= E q(Z|X,Y) log p(Y, Z|X; θ, ψ, µ) q(Z|X, Y) = E q(Z|X,Y) [log p(Y, Z|X; θ, ψ, µ)] -E q(Z|X,Y) [log q(Z|X, Y)]A.2 DERIVATION OF EQ. ( 13) Assumption A1. Under the hard assignment, the variational distribution is given by:Assumption A2. For all possible inputs, the unnormalized model Φ(x n , y n , z n ) is self-normalized similar to (Wu et al., 2018) , i.e., Z(x n , z n ) equals to a constant c, such that there is no need to approximate the normalization constant as well.Here, we derive the analytical update of the expert prototypes based on ELBO. Under the Assumptions A1 and A2, we can update µ k by solving the following problem:Proof. Since we the expertThe overall output of MiCE is then.The simplified model shown in Theorem 2 is essentially the non-parametric Softmax classifier used by MoCo and InstDisc, which is also related to some recent contrastive learning methods (Ye et al., 2019; Bachman et al., 2019) . Note that InstDisc adopts a slightly different implementation in which the teacher network is identical to the student network and the loss function is based on NCE (Gutmann & Hyvärinen, 2010) . For detailed comparisons, please refer to He et al. (2020) .Lemma 1. If Assumptions A3-A5 hold, the posterior is uniformly distributed.Proof.Assumption A6. The normalization constant is computationally tractable such that we can have the variational distribution being identical to the posterior distribution.Theorem 3. Given that A3-A6 hold, p(yand the tractable version of ELBO is identical to the form of InfoNCE (Oord et al., 2018) loss used by MoCo.Proof.According to Theorem 2 and 3, we see that MoCo can be viewed as a special case of the proposed MiCE. In other words, they are able to learn the same representations under the same experimental setting if the above assumptions are made.

C.2 SPHERICAL k-MEANS

Theorem 4. If Assumptions A1-A4 hold, the analytical update on the expert prototypes is equivalent to a single-iteration spherical k-means algorithm on the teacher embeddings.Proof. Under the assumptions, the variational distribution reduces to, such that the hard cluster assignment is determined bt the cosine similarity between the teacher embedding and the expert prototypes:With the simplified expert model, we follow the previous discussion on Eq. ( 14) and get the Lagrangian of the objective function:The analytical solution is therefore:where the cluster assignment step and prototype update rule are the same as the spherical k-means.

D EXPERIEMENT SETTINGS

We mainly compare with the methods that are trained from scratch without using the pre-training model and the experiment settings follow the literature closely (Chang et al., 2017; Wu et al., 2019; Ji et al., 2019; Shiran & Weinshall, 2019; Darlow & Storkey, 2020) . For CIFAR-10, CIFAR-100, and STL-10, all the training and test images are jointly utilized, and the 20 superclasses of CIFAR-100 are used instead of the fine labels. The 15 classes of dog images are selected from the ILSVRC2012 1K (Deng et al., 2009) dataset and resized to 96 × 96 × 3 to form the ImageNet-Dog dataset (Chang

