A PROBABILISTIC APPROACH TO CONSTRAINED DEEP CLUSTERING

Abstract

Clustering with constraints has gained significant attention in the field of semisupervised machine learning as it can leverage partial prior information on a growing amount of unlabelled data. Following recent advances in deep generative models, we derive a novel probabilistic approach to constrained clustering that can be trained efficiently in the framework of stochastic gradient variational Bayes. In contrast to existing approaches, our model (CVaDE) uncovers the underlying distribution of the data conditioned on prior clustering preferences, expressed as pairwise constraints. The inclusion of such constraints allows the user to guide the clustering process towards a desirable partition of the data by indicating which samples should or should not belong to the same class. We provide extensive experiments to demonstrate that CVaDE shows superior clustering performances and robustness compared to state-of-the-art deep constrained clustering methods in a variety of data sets. We further demonstrate the usefulness of our approach on challenging real-world medical applications and face image generation.

1. INTRODUCTION

The ever-growing amount of data and the time cost associated with its labeling has made clustering a relevant task in the field of machine learning. Yet, in many cases, a fully unsupervised clustering algorithm might naturally find a solution which is not consistent with the domain knowledge (Basu et al., 2008) . In medicine, for example, clustering could be driven by unwanted bias, such as the type of machine used to record the data, rather than more informative features. Moreover, practitioners often have access to prior information about the types of clusters that are sought, and a principled method to guide the algorithm towards a desirable configuration is then needed. Constrained clustering, therefore has a long history in machine learning as it enforces desirable clustering properties by incorporating domain knowledge, in the form of constraints, into the clustering objective. Following recent advances in deep clustering, constrained clustering algorithms have been recently used in combination with deep neural networks (DNN) to favor a better representation of highdimensional data sets. The methods proposed so far mainly extend some of the most widely used deep clustering algorithms, such as DEC (Xie et al., 2016) , to include a variety of loss functions that force the clustering process to be consistent with the given constraints (Ren et al., 2019; Shukla et al., 2018; Zhang et al., 2019b) . Although they perform well, none of the above methods model the data generative process. As a result, they can neither uncover the underlying structure of the data, nor control the strength of the clustering preferences, nor generate new samples (Min et al., 2018) . To address the above issues, we propose a novel probabilistic approach to constrained clustering, the Constrained Variational Deep Embedding (CVaDE) , that uncovers the underlying data distribution conditioned on domain knowledge, expressed in the form of pairwise constraints. Our method extends previous work in unsupervised variational deep clustering (Jiang et al., 2017; Dilokthanakul et al., 2016) to incorporate clustering preferences as Bayesian prior probabilities with varying degrees of uncertainty. This allows systematical reasoning about parameter uncertainty (Zhang et al., 2019a) , thereby enabling the ability to perform Bayesian model validation, outlier detection and data generation. By integrating prior information in the generative process of the data, our model can guide the clustering process towards the configuration sought by the practitioners. Our main contributions are as follows: (i) We propose a constrained clustering method (CVaDE) to incorporate given clustering preferences, with varying degrees of certainty, within the Variational Auto-Encoder (VAE) framework. (ii) We provide a thorough empirical assessment of our model. In particular, we show that (a) a small fraction of prior information remarkably increases the performance of CVaDE compared to unsupervised variational clustering methods, (b) our model shows superior clustering performance compared to state-of-the-art deep constrained clustering models on a wide range of data sets and, (c) our model proves to be robust against noise as it can easily incorporate the uncertainty of the given constraints. (iii) We show that our model can drive the clustering performance towards different desirable configurations, depending on the constraints used, and it successfully generates new samples on challenging real-world image data.

2. THEORETICAL BACKGROUND & RELATED WORK

Constrained Clustering. A constrained clustering problem differs from the classical clustering scenario as the user has access to some pre-existing knowledge about the desired partition of the data. The constraints are usually expressed as pairwise constraints (Wagstaff & Cardie, 2000) , consisting of must-links and cannot-links, which indicate whether two samples are believed to belong to the same cluster or to different clusters. Such pairwise relations contain less information than the labels used in classification tasks but are usually easier to obtain. Traditional clustering methods have been then extended to enforce pairwise constraints (Lange et al., 2005) . COP-KMEANS (Wagstaff et al., 2001) and MPCK-mean (Bilenko et al., 2004) adapted the well-known K-means algorithm, while several methods proposed a constrained version of the Gaussian Mixture Models (Shental et al., 2003; Law et al., 2004; 2005) . Among them, penalized probabilistic clustering (PPC, Lu & Leen (2004) ) is most related to our work as it expresses the pairwise constraints as Bayesian priors over the assignment of data points to clusters. However, PPC, as well as the previous models, shows poor performance and high computational complexity on high-dimensional and large-scale data sets. Deep Constrained Clustering. To overcome the limitations of the above models, constrained clustering algorithms have been lately used in combination with DNNs. Hsu & Kira (2015) train a DNN to minimize the Kullback-Leibler (KL) divergence between similar pairs of samples, while Chen (2015) performs semi-supervised maximum margin clustering of the learned features on a DNN. More recently, many extensions of the widely used DEC model (Xie et al., 2016) have been proposed to include a variety of loss functions to enforce pairwise constraints. Among them, SDEC (Ren et al., 2019) includes a distance loss function that forces the data points with a must-link to be close in the latent space and vice-versa. C-IDEC (Zhang et al., 2019b) , uses, instead, a KL divergence loss, extending the work of Shukla et al. (2018) . Other works have focused on discriminative clustering methods by self-generating pairwise constraints from either Siamese networks or KNN graphs (Smieja et al., 2020) (Fogel et al., 2019) . As none of the approaches proposed so far is based on generative models, the above methods fail to uncover the underlying data distribution. Additionally, DEC-based architectures rely on heavy pretraining of the autoencoder, resulting in no theoretical guarantee that the learned latent space is indeed suitable for clustering (Min et al., 2018) . VAE-based deep clustering. Many models have been proposed in the literature to perform unsupervised clustering through deep generative models (Li et al., 2019; Yang et al., 2019; Manduchi et al., 2019; Kopf et al., 2019) . Among them, the Variational Deep Embedding (VaDE, Jiang et al. (2017) ) and the Gaussian Mixture Variational Autoencoder (GMM-VAE, Dilokthanakul et al. (2016) ) propose a variant of the VAE (Kingma & Welling (2014); Rezende et al. ( 2014)) in which the prior is a Gaussian Mixture distribution. With this assumption, they construct an inference model that can be directly optimised in the framework of stochastic gradient variational Bayes. However, variational deep clustering methods, such as the VaDE, cannot incorporate domain knowledge and clustering preferences. Even though a semi-supervised version on the VAE has been proposed by Kingma et al. (2014) , the latter cannot be naturally applied to clustering. For this reason, we aim at extending the above methods to incorporate clustering preferences in the form of constraints, modeled as Bayesian priors, to guide the clustering process towards a desirable configuration.

3. CONSTRAINED VARIATIONAL DEEP EMBEDDING

In the following section, we propose a novel constrained clustering model (CVaDE) to incorporate clustering preferences, with varying degree of certainty, in a VAE-based deep clustering setting. In particular, we use the VaDE (Jiang et al., 2017) generative assumptions of the data, conditioned on the domain knowledge. We then illustrate how our model can be trained efficiently in the framework of stochastic gradient variational Bayes by optimizing the Conditional Variational Lower Bound. Additionally, we define concrete prior formulations to incorporate our preferences, with a focus on pairwise constraints.

3.1. THE GENERATIVE ASSUMPTIONS

Let us consider a data set X = {x i } N i=1 consisting of N samples with x i ∈ R M that we wish to cluster into K groups according to some prior information encoded as G. For example, we may know a priori that certain samples should be clustered together with different degree of certainty. Hence G encodes both our prior knowledge on the data set and the degree of confidence. We assume the data is generated from a random process consisting of three steps. First, the cluster assignments c = {c i } N i=1 , with c i ∈ {1, . . . , K}, are sampled from a distribution conditioned on the prior information, c ∼ p(c|G). Next, for each cluster assignment c i , a continuous latent embedding, z i ∈ R D , is sampled from a Gaussian distribution, whose mean and variance depend on the selected cluster c i . Finally, the sample x i is generated from a distribution conditioned on z i . Given c i , the generative process can be summarized as: z i ∼ p(z i |c i ) = N (z i |µ ci , σ 2 ci I) x i ∼ p θ (x i |z i ) = N (x i |µ xi , σ 2 xi I) with [µ xi , σ 2 xi ] = f (z i ; θ) if x i is real-valued Ber(µ xi ) with µ xi = f (z i ; θ) if x i is binary (2) where µ ci and σ 2 ci are mean and variance of the Gaussian distribution corresponding to cluster c i in the latent space and the function f (z; θ) is a neural network, called decoder, parametrized by θ. Without prior information, that is when p(c|G) = p(c) = i p(c i ) = i Cat(c i |π), the cluster assignments are independent and identical distributed as they follow a categorical distribution with mixing parameters π. In that case, the generative assumptions described above are equal to those of Jiang et al. (2017) and the parameters of the model can be learned using the unsupervised VaDE method (see Appendix C). In the following, we explore the case when p(c|G) = p(c).

3.2. CONDITIONAL VARIATIONAL LOWER BOUND

Given the data generative assumptions illustrated in Sec. 3.1, the objective is to infer the parameters π, µ c , σ 2 c and θ which better explain the data X given prior information on the cluster assignments G. We achieve this by maximizing the marginal log-likelihood conditioned on G, that is: log p(X|G) = log Z c p(X, Z, c|G), where Z = {z i } N i=1 is the collection of the latent embeddings corresponding to the data set X. The conditional joint probability is derived from Eq 1/2 and can be factorized as: p(X, Z, c|G) = p θ (X|Z)p(Z|c)p(c|G) = p(c|G) N i=1 p θ (x i |z i )p(z i |c i ). Since the conditional log-likelihood is intractable, we derive a lower bound of the log marginal conditional probability of the data, which we call Conditional ELBO (C-ELBO, L C ): L C (X|G) = E q φ (Z,c|X) log p(X, Z, c|G) q φ (Z, c|X) Similarly to Jiang et al. (2017) and Dilokthanakul et al. (2016) , we employ the following amortized mean-field variational distribution: q φ (Z, c|X) = q φ (Z|X)p(c|Z) = N i=1 q φ (z i |x i )p(c i |z i ) with p(c i |z i ) = p(z i |c i )p(c i ) k p(z i |k)p(k) , where q φ (z i |x i ) is a Gaussian distribution with mean µ(x i ) and variance σ 2 (x i )I which are the outputs of a neural network, called encoder, parametrized by φ and p(c i = k) is denoted as p(k) for simplicity. It is important to note that, in this formulation, the variational distribution does not depend on G. This approximation is used to retain a mean-field variational distribution if the cluster assignments, conditioned on the prior information, are not independent (Sec 3.4.1), that is when p(c|G) = i p(c i |G).

3.3. ROLE OF THE PRIOR INFORMATION

To highlight how the prior information G influences the clustering objective, we rewrite Eq. 5 as: L C (X|G) = E q φ (Z|X) [log p θ (X|Z)] -D KL (q φ (Z, c|X) p(Z, c|G)). (7) The first term is called the reconstruction term, similarly to the VAE. The second term, on the other hand, is the Kullback-Leibler (KL) divergence between the variational posterior and the Constrained Gaussian Mixture prior. By maximizing the C-ELBO, the variational posterior mimics the true conditional probability of the latent embeddings and the cluster assignments. This results in enforcing the latent embeddings to follow a Gaussian mixture which agrees on the clustering preferences. Using Eq. 6, the Conditional ELBO can be further factorized as: L C (X|G) =E p(c|Z) [log p(c|G)] + E q φ (Z|X) [log p θ (X|Z)] + E q φ (Z|X)p(c|Z) [log p(Z|c)] -E q φ (Z,c|X) [log q φ (Z, c|X)], where the last three terms are not affected by G and they can be rewritten using the SGVB estimator and the reparameterization trick (Kingma & Welling, 2014) to be trained efficiently using stochastic gradient descent (see Appendix B). The first term, on the other hand, is investigated in Sec 3.4.

3.4. CONDITIONAL PRIOR PROBABILITY

We incorporate our clustering preference through the conditional probability p(c|G). In particular, we construct the conditional prior probability to be: p(c|G) = i π ci g i (c) c j π cj g j (c) = 1 Ω(π) i π ci g i (c), where π = {π k } K k=1 are the weights associated to each cluster, c i is the cluster assignment of sample x i , Ω(π) is the normalization factor and g i (c) is a weighting function that assumes large values if c i agrees with our belief with respect to c and low values otherwise.

3.4.1. PAIRWISE CONSTRAINTS

We hereby focus on expressing the conditional prior distribution in the context of pairwise constrained clustering and we do so by adapting the work of Lu & Leen (2004) in our variational framework. The weighting function g i (c) is then defined as: g i (c) = j =i exp W i,j δ cicj , ( ) where δ is the Kronecker δ-function and W ∈ R N ×N is a symmetric matrix containing the pairwise preferences and confidence. In particular, W i,j = 0 if we have no prior information on samples x i and x j , W i,j > 0 if there is a must-link constraint (the two samples should be clustered together) and W i,j < 0 if there is a cannot-link constraint (the two samples should not be clustered together). The value |W i,j | ∈ [0, ∞) reflects the degree of certainty in the constraint. For example, if W i,j -→ -∞ then x i and x j must be assigned to different clusters otherwise p(c|G) -→ 0 (hard constraint). On the other hand, smaller values indicate a soft preference as they admit some degree of freedom in the model. An heuristic to select |W i,j | is presented in Sec 4. Interestingly, the probability p(c|G) with pairwise constraints can be seen as the posterior of the superparamagnetic clustering method (Blatt et al., 1996) , with loss function given by a fully connected Potts model (Wu, 1982) . Differently from our method, Blatt et al. (1996) cluster the data according to the pairwise correlation functions E p(c|G) δ cicj that are estimated with MCMC methods. Finally, we incorporate the conditional prior distribution in Eq. 8. The first term can be written as: E p(c|Z) [log p(c|G)] = E p(c|Z) log 1 Ω(π) i π ci j =i exp W i,j δ cicj (11) = -log Ω(π) + i E p(ci|zi) log π ci + i,j =i E p(ci|zi) E p(cj |zj ) W i,j δ cicj (12) = -log Ω(π) + i k p(c i = k|z i ) log π ci + i,j =i k p(c i = k|z i )p(c j = k|z j )W i,j . Maximizing Eq. 13 w.r.t. π poses computational problems due to the normalization factor Ω(π). Crude approximations are investigated in (Basu et al., 2008) , however we choose to fix the parameter π k = 1/K to make z uniformly distributed in the latent space, as in previous works (Dilokthanakul et al., 2016) . By doing so, the normalization factor can be treated as a constant. The analysis of different approaches to learn the weights π is left for future work. The C-ELBO with a pairwise constrained prior can then be optimized using Monte Carlo sampling and stochastic gradient descent.

3.4.2. FURTHER POSSIBLE CONSTRAINTS

Given the flexibility of our general framework, different types of constraints can be included in the formulation of the weighting functions g i (c). In particular, we can perform semi-supervised learning by setting the prior information as: g i (c) = g(c i ) = exp (W i,k ) with W ∈ R N ×K (14) where W i,k indicates whether the sample x i should be assigned to cluster k. Additionally, one could also include triple-constraints by modifying the weighting function to be: g i (c) = j,k =i exp W i,j,k δ cicj c k with W ∈ R N ×N ×N symmetric. ( ) where W i,j,k = 0 if we do not have any prior information, W i,j,k > 0 indicates that the samples x i , x j and x k should be clustered together and W i,j,k < 0 if they should belong to different clusters. The analysis of these different constraints formulation is outside the scope of our work but they may represent interesting directions for future work.

4. EXPERIMENTS

In the following, we provide a thorough empirical assessment of our proposed method (CVaDE) with pairwise constraints using a wide range of data sets. First, we evaluate our model's performance compared to both unsupervised variational deep clustering methods and state-of-the-art constrained clustering methods. Then, we present extensive evidence of the ability of our model to handle noisy constraint information. We additionally perform experiments on a medical application to prove that our model can reach different desirable partitions of the data, depending on the constraints used, even with real-world, noisy data. Finally, we show that CVaDE successfully generates new data, using the learned generative process of Sec 3.1, on a challenging face image data set. Baselines & implementation details. As baselines, we include the traditional constrained Kmeans (MPCK-means, Bilenko et al. ( 2004)) and two recent deep constrained clustering methods based on DEC (SDEC, Ren et al. (2019) , and C-IDEC, Zhang et al. (2019b) ) as they achieve stateof-the-art performance in constrained clustering. We also compare our model to the unsupervised variational deep clustering method VaDE (Jiang et al., 2017) . To implement our model, we were careful in maintaining a fair comparison with the baselines. In particular, we adopted the same encoder and decoder feed-forward architecture among all methods with four layers of 500, 500, 2000, D units respectively, where D = 10 unless stated otherwise. The VAE is pretrained for 10 epochs while the DEC-based baselines need a more complex layer-wise pretraining of the autoencoder which involves 50 epochs of pretraining for each layer and 100 epochs of pretraining as finetuning. The pairwise constraints are chosen randomly within the training set by sampling two data points and assigning a must-link if they have the same label and a cannot-link otherwise. Unless stated otherwise, the values of |W i,j | are set to 10 4 for all data sets, and N pairwise constraints are used for both our model and the constrained clustering baselines, where N is the number of samples of the considered data set. Note that N pairwise constraints can be obtained by using only √ N labeled data points. To allow for fast iteration we simplify the last term of Eq. 13 by allowing the search of pairwise constraints to be performed only inside the considered batch. We observed empirically that, with a batch size of 1024, this approximation does not affect the clustering performance. Further details on the other hyper-parameters setting can be found in the Appendix E.1. Constrained clustering. We test the clustering performance of our model compared with the baselines on four different data sets: MNIST (LeCun et al., 2010) , Fashion MNIST (Xiao et al., 2017) , Reuters (Xie et al., 2016) and HHAR (Stisen et al., 2015) (see Appendix A). Note that we preprocessed the Reuters data by computing the tf-idf features on the 2000 most frequent words on a random subset of 10000 documents and by selecting 4 root categories (Xie et al., 2016) . Accuracy and Normalized Mutual Information (NMI) are used as evaluation metrics. The results are shown in Table 1 . We observe that our model reaches state-of-the-art performance in the Fashion MNIST and Reuters data sets. On MNIST and HHAR data sets, on the other hand, it has comparable clustering performance with C-IDEC, however it is generally more stable accross different data sets. Constrained clustering with noisy labels. In real-world applications it is often the case that the additional information comes from different sources with different noise levels. As an example, pairwise annotations could be obtained by both very experienced domain experts and by lessexperienced users. Hence, the ability to integrate constraints with different degrees of certainty into the clustering algorithm is of significant practical importance. In this experiment, we consider the case in which the given pairwise constraints have three different noise levels, q ∈ {0.1, 0.2, 0.3}, where q determines the fraction of pairwise constraints with flipped signs (that is, when a mustlink is turned into a cannot-link and vice-versa). In Fig. 1 we show the clustering performance of our model compared to the strongest baseline derived from the previous section, C-IDEC. For all data sets, we decrease the value of the pairwise confidence of our method using the heuristic |W i,j | = α log 1-q q with α = 3500. Also, we use grid search to choose the hyper-parameters of C-IDEC for the different noise levels (in particular we set the penalty weight to 0.1, 0.005, and 0.001 respectively). CVaDE clearly achieves better performance in terms of accuracy and NMI on all three noise levels for all data sets. In particular, the higher the noise level the greater the performance difference. We can conclude that our model is more robust than its main competitor. Additionally, C-IDEC cannot model different noise levels within the same data set, while our model can easily include different source of information with different degree of uncertainty. Heart Echo. We evaluate the effectiveness of our model in a real-world application by using infant heart echo cardiogram videos. The data set consists of 305 infant echo cardiogram videos from 65 patient visits obtained from a large University Children's Hospital. Each visit may consist of videos taken from several different angles (called view), denoted by [LA, KAKL, KAPAP, KAAP, 4CV]. We cropped the videos by isolating the cone of the echo cardiogram, we resized them to 64 × 64 pixels and split them into individual frames obtaining a total of N = 20000 images. We focused on investigating two constrained clustering problems using this data set. First, we cluster the echo video frames by view. This is a relevant task, as in many datasets, heart echo videos are not explicitly labeled (Zhang et al., 2018) . Then, we cluster the echo video frames by infant maturity at birth, following the WHO definition of premature birth categories ("Preterm"). We believe that these two clustering tasks demonstrate that our model admits a degree of control in choosing the underlying structure of the learned clusters. As the experiments demonstrate, by providing different pairwise constraints, it is possible to guide the clustering process towards a preferred configuration, depending on what the practitioners are seeking in the data. For both experiments, we compare the performance of our method, CVaDE, with the unsupervised VaDE method and the C-IDEC. Additionally, we include a variant of both our method and VaDE in 2 . The CVaDE model outperforms both baselines by a significant margin in both accuracy and NMI in all clustering experiments. This demonstrates that the addition of domain knowledge is particularly effective for medical purposes. Additionally, we observe that C-IDEC performs poorly on real-world noisy data. We believe this is due to the heavy pretraining of the autoencoder, required by DEC-based methods, as it does not always enforce that the learned latent space is suitable for clustering. Additionally, we investigate the relationship between the number of constraints and clustering performance for the view detection task. We observed that with only 5000 pairwise constraints, which could be obtained with less than 80 labels, our model achieves results comparable to those of Table 2 (see Appendix D.3 for details). Face Image Generation We evaluate the generative capabilities of our model using the UTKFace dataset (Zhang et al., 2017) . This dataset contains over 20000 images of male and female faces, aged from 1 to 118 years old, with multiple ethnicities represented. We use a convolutional network for the VAE (the implementation details are described in the Appendix E.3). As a first task, we cluster the data using the gender prior information, in the form of 2N pairwise constraints, which requires labels for 0.7% of the data set. Fig. 2a /2b shows the PCA decomposition of the embedding space of both CVaDE and the unsupervised VaDE method. As a second task, we select a sub-sample of individuals between 18 and 50 years of age (approx. 11000 samples), and cluster by ethnicity (White, Black, Indian, Asian) using 2N pairwise constraints, Fig. 2c/2d . We believe that one possible reason for biases in machine learning models currently is due to under-representation of various ethnic groups in training data sets. The constrained CVaDE method could be used to identify such imbalances, while requiring only a relatively small number of labeled training samples. This would allow easy identification of such instances of bias in training data sets at a relatively low cost. Furthermore, the ability to generate samples from these detected clusters potentially allows automatic data set augmentation. With the inclusion of domain knowledge, we observe a neat division of the selected clusters in the embedding space in both tasks, conversely the unsupervised approach is not able to distinguish any feature of interest. Finally, using the multivariate Gaussian distributions of each cluster in the learned embedded space, we test the generative capabilities of CVaDE by first recovering the mean face of each cluster, and then generating several more faces from each cluster. Figure 3 shows these generated samples. As can be observed, the ethnicities present in the data set are represented well by the mean face. Furthermore, the sampled faces all correspond to the respective cluster, and have a good amount of variation. The quality of generated samples could be improved by using higher resolution training samples or different CNN architectures. 

APPENDIX A DATA SETS

The data sets used in the experiments are the followings: • MNIST: It consists of 70000 handwritten digits. The images are centered and of size 28 by 28 pixels. We reshaped each image to a 784-dimensional vector (LeCun et al., 2010) . • Fashion MNIST: A data set of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples (Xiao et al., 2017) . • HHAR: The Heterogeneity Human Activity Recognition (HHAR) dataset contains 10299 sensor records from smart phones and smart watches. All samples are partitioned into 6 categories of human activities and each sample is of 561 dimensions (Stisen et al., 2015) . • Reuters: It contains 810000 English news stories (Lewis et al., 2004) . Following the work of Xie et al. (2016) , we used 4 root categories: corporate/industrial, government/social, markets, and economics as labels and discarded all documents with multiple labels, which results in a 685071-article dataset. We computed tf-idf features on the 2000 most frequent words to represent all articles. A random subset of 10000 documents is then sampled. • Newborn echo cardiograms: The dataset consists of 305 infant echo cardiogram videos from 65 patient visits obtained from a large children hospital. Each visit may consist of videos taken from several different angles, denoted by [LA, KAKL, KAPAP, KAAP, 4CV]. We cropped the videos by isolating the cone of the echo cardiogram, we resized them to 64x64 pixels and split them into individual frames obtaining a total of N = 20000 images. • UTKFace: This dataset contains over 20000 images of male and female face of individuals from 1 to 118 years old, with multiple ethnicities represented (Zhang et al., 2017) .

B CONDITIONAL ELBO DERIVATIONS

In this section, we provide the detailed derivation of the Conditional ELBO with pairwise constraints and we describe how it can be trained efficiently with Monte Carlo sampling. Specifically, the C-ELBO, L C (X|G), is the upper bound of the marginal log-likelihood conditioned on G: log p(X|G) = log Z c p(X, Z, c|G) ≥ E q φ (Z,c|X) log p(X, Z, c|G) q φ (Z, c|X) = L C (X|G). By plugging in the marginal log-likelihood of Eq. 3 and the variational distribution of Eq. 6, the C-ELBO can be rewritten as: L C (X|G) = E q φ (Z,c|X) [log p θ (X|Z)p(Z|c)p(c|G)] -E q φ (Z,c|X) [log q φ (Z, c|X)] (17) = E q φ (Z|X) [log p θ (X|Z)] + E q φ (Z|X)p(c|Z) [log p(Z|c)] + E p(c|Z) [log p(c|G)] -E q φ (Z|X) [log q φ (Z|X)] -E p(c|Z) [log p(c|Z)] where we used the fact that the variational distribution can be factorized as q φ (Z, c|X) = q φ (Z|X)p(c|Z). Given that q φ (Z|X)p(c|Z) = i q φ (z i |x i )p(c i |z i ) and using Eq 13, we can further factorize Eq 18: L C (X|G) = N i=1 E q φ (zi|xi) [log p θ (x i |z i )] + E q φ (zi|X)p(c|zi) [log p(z i |c i )] + E p(ci|zi) log π ci + i,j =i E p(ci|zi) E p(cj |zj ) W i,j δ cicj -E q φ (zi|xi) [log q(z i |x i )] -E p(ci|zi) [log q(c i |z i )] (19) As p(c i |z i ) is discrete, E p(ci|zi) [•] = k p(c i = k|z i )[•] , the above equation can then be written as: L C (X|G) = N i=1 E q φ (zi|xi) [log p θ (x i |z i )] + E q φ (zi|xi) k p(c i = k|z i ) log p(z i |c i = k) + i k p(c i = k|z i ) log π ci + i,j =i k p(c i = k|z i )p(c j = k|z j )W i,j -E q φ (zi|xi) [log q(z i |x i )] - k p(c i = k|z i )[log p(c i = k|z i )] Using the SGVB estimator, we can approximate the above equation as: L C (X|G) = N i=1 1 L L l=1 log p θ (x i |z (l) i ) + k p(c i = k|z (l) i ) log p(z (l) i |c i = k) + i k p(c i = k|z (l) i ) log π ci + i,j =i k p(c i = k|z i )p(c j = k|z j )W i,j -log q φ (z (l) i |x i ) - k p(c i = k|z (l) i ) log p(c i = k|z (l) i ) , ( ) where L is the number of Monte Carlo samples in the SGVB estimator and it is set to L = 1 in all experiments.

C VARIATIONAL DEEP EMBEDDING

The Variational Deep Embedding method, VaDE (Jiang et al., 2017) , assumes an observed sample x i is generated by the following generative process: c i ∼ Cat(1/K) z i ∼ p(z i |c i ) = N (z i |µ ci , σ 2 ci I) x i ∼ p(x i |z i ) = N (z|µ xi , σ 2 xi I) with [µ xi , σ 2 xi ] = f (z i ; θ) if x i is real-valued Ber(µ xi ) with µ xi = f (z i ; θ) if x i is binary ( ) where K is the predefined number of clusters µ c , σ 2 c are mean and variance of the Gaussian distribution corresponding to cluster c in the latent space and the function f (z; θ) is a neural network, called decoder, parametrized by θ, similarly to a VAE. To infer both the parameters of the Gaussian mixture model and the decoder, the VaDE maximises the likelihood of the data X, that is: log p(X) = log Z c p(X, Z, c) ≥ E q(Z,c|X) log p(X, Z, c) q(Z, c|X) = L ELBO The variational distribution is chosen to be: q φ (Z, c|X) = i q φ (z i , c i |x i ) = i q φ (z i |x i )p(c i |z i ) with p(c i |z i ) = p(z i |c i )p(c i ) k p(z i |k)p(k) where q φ (z i |x i ) is a Gaussian distribution with mean µ(x i ) and variance σ 2 (x i )I which are the outputs of a neural network, called encoder, parametrized by φ. The ELBO can be then formulated as: L ELBO (X) = E q φ (Z|X) [log p θ (X|Z)] -D KL (q φ (Z, c|X) p(Z, c)). D FURTHER EXPERIMENTS

D.1 COMPARISON WITH C-IDEC

In Table 3 , we present the quantitative results in term of Adjusted Rand Index (ARI) and the percentage of satisfied constraints (SC) of both our model, CVaDE, and the strongest baseline, C-IDEC with N constraints. 4/5 /6/7, we present the quantitative results in term of Accuracy and Normalized Mutual Information of both our model, CVaDE, and the strongest baseline, C-IDEC with N noisy constraints (presented visually in Fig. 1 ). In particular, the results are computed for q ∈ {0.1, 0.2, 0.3}, where q determines the fraction of pairwise constraints with flipped signs (that is, when a must-link is turned into a cannot-link and vice-versa). Impact of Constraints In order to try to understand how the number of constraints provided impacts the performance of CVaDE, we performed an experiment using the heart echo dataset where progressively more and more constraints were provided to the model during training. Figure 5 demonstrates that the clustering performance of CVaDE improves as more pairwise constraints are provided. This trend continues until 5000 pairwise constraints are provided (requiring 0.35% of the dataset to be labeled), at which point the performance reaches a plateau.

E IMPLEMENTATION DETAILS E.1 HYPER-PARAMETERS SETTING

In Table 8 we specify the hyper-parameters setting of both our model, CVaDE, and the unsupervised baseline, VaDE. Given the semi-supervised setting, we did not focus in fine-tuning the hyperparameters but rather we chose standard configurations for all data sets. We observed that our model is robust against changes in the hyper-parameters, except for the batch size. The latter requires a high value, as we simplify the last term of Eq. 13 by allowing the search of pairwise constraints to be performed only inside the considered batch. For the VaDE, we used the same hyper-parameters setting used in their paper (Jiang et al., 2017) . 10 -4 10, 0.9 10, 0.9 5, 0.5 10, 0.9

E.2 HEART ECHO

In addition to the model described in Section 4, we also used a VGG-like convolutional neural network. This model is implemented in Tensorflow, using two VGG blocks (using a 3 × 3 kernel size) of 32 and 64 filters for the encoder, followed by a single fully-connected layer reducing down to an embedding of dimension 10. The decoder has a symmetric architecture. The VAE is pretrained for 10 epochs, following which our model is trained for 300 epochs using a learning rate of 0.001 (with an exponential decay of 0.00001), and a batch size of 1024. Refer to the accompanying code for further details.

E.3 FACE IMAGE GENERATION

The face image generation experiments using the UTK Face dataset described in Section 4 were carried out using VGG-like convolutional neural networks implemented in Tensorflow. In particular, the input image size of 64×64×3 allowed two VGG blocks (using a 3×3 kernel size) of 64 and 128 filters for the encoder, followed by a single fully-connected layer reducing down to an embedding of dimension 50. The decoder has a symmetric architecture. The VAE is pretrained for 100 epochs, following which our model is trained for 1000 epochs using a learning rate of 0.001 (with a decay of 0.00001), and a batch size of 1024. Refer to the accompanying code for further details.



Figure 1: Accuracy and NMI clustering performance on four different datasets with noisy labels.

Figure 2: PCA decomposition of test set examples in the embedded space and generative samples using CVaDE and VaDE for (a)-(b) Gender, (c)-(d) Ethnicity. In this configuration, CVaDE obtains a NMI of 0.52 (gender) and 0.4 (ethnicity) while VaDE obtains a NMI close to 0 for both tasks.

Figure 3: Mean face and sampled faces for each learned cluster, top two rows corresponding to gender, bottom rows to ethnicity

Figure 4: Left: PCA decomposition of CVaDE learned embedded space with representative samples from the dataset, right: baseline VaDE embedded space

Figure 5: Plot demonstrating that CVaDE's performance improves as more constraints are given. At 5000 pairwise constraints the performance equals that of N = 20000 (the total number of samples)

Clustering performances of CVaDE compared with baselines. All methods use N pairwise constraints ( √ N labels) except the unsupervised VaDE. Means and standard deviations are computed across 10 runs with different random model initialization and pre-training weights. ±5.7 81.9 ±2.2 59.3 ±2.3 59.8 ±1.9 76.0 ±0.7 50.2 ±0.9 75.3 ±5.9 70.5 ±4.0 MPCK 58.8 ±0.0 53.3 ±0.1 57.5 ±0.0 59.8 ±1.9 60.7 ±0.5 50.2 ±0.9 58.1 ±0.2 70.5 ±4.0 SDEC 87.9 ±0.2 86.8 ±0.3 59.7 ±2.2 63.0 ±1.2 77.3 ±3.9 60.0 ±2.9 67.0 ±7.0 71.5 ±2.0 C-IDEC 97.9 ±0.1 94.3 ±0.3 84.5 ±1.7 76.0 ±1.6 92.6 ±4.1 79.1 ±4.1 93.4 ±0.5 86.3 ±1.1 CVaDE 98.0 ±0.1 94.4 ±0.2 86.1 ±0.2 77.6 ±0.3 95.3 ±0.5 82.6 ±1.4 93.5 ±0.8 86.5 ±1.2

Constrained clustering results using heart echo cardiogram data with fully connected layers on the left and convolutional layers on the right. Means and standard deviations are computed across 10 runs with different random model initialization and pre-training weights.

Clustering performances of CVaDE compared with C-IDEC. All methods use N pairwise constraints ( √ N labels) except the unsupervised VaDE. Means and standard deviations are computed across 5 runs with different random model initialization and pre-training weights.

Clustering performances with noisy labels averaged over 10 runs on MNIST.

Clustering performances with noisy labels averaged over 10 runs on fMNIST.

Clustering performances with noisy labels averaged over 10 runs on Reuters.

Clustering performances with noisy labels averaged over 10 runs on HHAR.

Hyperparameters setting of both our model, CVaDE, and the unsupervised VaDE.

