DEEP GOAL-ORIENTED CLUSTERING

Abstract

Clustering and prediction are two primary tasks in the fields of unsupervised and supervised machine learning. Although much of the recent advances in machine learning have been centered around those two tasks, the interdependent, mutually beneficial relationship between them is rarely explored. One could reasonably expect appropriately clustering the data would aid the downstream prediction task and, conversely, a better prediction performance for the downstream task could potentially inform a more appropriate clustering strategy. In this work, we focus on the latter part of this mutually beneficial relationship. To this end, we introduce Deep Goal-Oriented Clustering (DGC), a probabilistic framework that clusters the data by jointly using supervision via side-information and unsupervised modeling of the inherent data structure in an end-to-end fashion. We show the effectiveness of our model on a range of datasets by achieving prediction accuracies comparable to the state-of-the-art, while, more importantly in our setting, simultaneously learning congruent clustering strategies. We also apply DGC to a real-world breast cancer dataset, and show that the discovered clusters carry clinical significance.

1. INTRODUCTION

Much of the advances in supervised learning in the past decade are due to the development of deep neural networks (DNN), a class of hierarchical function approximators that are capable of learning complex input-output relationships. Prime examples of such advances include image recognition (Krizhevsky et al., 2012) , speech recognition (Nassif et al., 2019) , and neural translation (Bahdanau et al., 2015) . However, with the explosion of the size of modern datasets, it becomes increasingly unrealistic to manually annotate all available data for training. Hence, understanding inherent data structure through unsupervised clustering is of increasing importance. Several approaches to apply DNNs to unsupervised clustering have been proposed in the past few years (Caron et al., 2018; Law et al., 2017; Xie et al., 2016; Shaham et al., 2018) , centering around the concept that the input space in which traditional clustering algorithms operate is of importance. Hence, learning this space from data is desirable, in particular, for complex data. Despite the improvements these approaches have made on benchmark clustering datasets, the ill-defined, ambiguous nature of clustering still remains a challenge. Such ambiguity is particularly problematic in scientific discovery, sometimes requiring researchers to choose from different, but potentially equally meaningful clustering results when little information is available a priori (Ronan et al., 2016) . When facing such ambiguity, using direct side-information to reduce clustering ambivalence proves to be a fruitful direction (Xing et al., 2002; Khashabi et al., 2015; Jin et al., 2013) . Direct sideinformation is usually available in terms of constraints, such as the must-link and the cannot-link constraints (Wang & Davidson, 2010; Wagstaff & Cardie, 2000) , or via a pre-conceived notion of similarity (Xing et al., 2002) . However, defining such direct side-information requires human expertise, which could be labor intensive and potentially vulnerable to labeling errors. On the contrary, indirect, but informative, side-information might exist in abundance, and may not require human expertise to obtain. Being able to learn from such indirect information to form a congruous clustering strategy is thus immensely valuable.

Main Contributions

We propose Deep Goal-Oriented Clustering (DGC), a probabilistic model that is capable of using indirect, but informative, side-information to form a pertinent clustering strategy. Specifically: 1) We combine supervision via side-information and unsupervised data structure modeling in a probabilistic manner; 2) We make minimal assumptions on what form the supervised side-information might take, and assume no explicit correspondence between the sideinformation and the clusters; 3) We train DGC end-to-end so that the model simultaneously learns from the available side-information while forming a desired clustering strategy.

2. RELATED WORK

Most related work in the literature can be classified into two categories: 1) Methods that utilize extra side-information to form better, less ambiguous clusters; however, such side-information needs to be provided beforehand and cannot be learned; 2) Methods that can learn from the provided labels to lessen the ambiguity in the formed clusters, but these methods rely on the cluster assumption (detailed below), and usually assume that the provided labels are discrete and the ground truth labels. This excludes the possibility of learning from indirectly related, but informative side-information. We propose a unified framework that allows using informative side-information directly or indirectly to arrive at better formed clusters. Latent space sharing among different tasks has been studied in a VAE setting (Le et al., 2018; Xie & Ma, 2019) . In this work we utilize this latent space sharing framework, but instead focus on clustering with the aid of general, indirect side-information. Side-information as constraints Using side-information to form better clusters is well-studied. Wagstaff & Cardie (2000) consider both must-link and cannot-link constraints in the context of K-means clustering. Motivated by image segmentation, Orbanz & Buhmann (2007) proposed a probabilistic model that can incorporate must-link constraints. Khashabi et al. (2015) proposed a nonparametric Bayesian hierarchical model to incorporate noisy side-information as soft-constraints. Vu et al. (2019) utilize constraints and cluster labels as side information. Mazumdar & Saha (2017) give complexity bounds when provided with an oracle that can be queried for side information. Wasid & Ali (2019) incorporate side information through the use of fuzzy sets. In supervised clustering, the side-information is the a priori known complete clustering for the training set, which is being used as a constraint to learn a mapping between the data and the given clustering (Finley & Joachims, 2005) . In contrast, we do not assume that the constraints are given a priori. Instead, we let the side-information guide the clustering procedure during the training process. Semi-supervised methods & the cluster assumption Semi-supervised clustering approaches generally assume that they only have access to a fraction of the true cluster labels. Via constraints as the ones discussed, the available labels are propagated to unlabeled data, which can help mitigate the ambiguity in choosing among different clustering strategies (Bair, 2013) . The generative approach to semi-supervised learning introduced in Kingma et al. ( 2014) is based on a hierarchical generative model with two variational layers. Although it was originally meant for semi-supervised classification tasks, it can also be used for clustering. However, if used for clustering, it has to strictly rely on the cluster assumption,which states that there exists a direct correspondence between labels/classes and clusters (Färber et al., 2010; Chapelle et al., 2006) . We show that this approach is a special case of our framework without the probabilistic ensemble component (see Sec. 4.2) and when certain distributional assumptions are made. Sansone et al. (2016) proposed a method for joint classification and clustering to address the stringent cluster assumption most approaches make by modeling the cluster indices and the class labels separately, underscoring the possibility that each cluster may consist of multiple class labels. Deploying a mixture of factor analysers as the underlying probabilistic framework, they also used a variational approximation to maximize the joint log-likelihood. In this work, we generalize the notion of learning from discrete, ground truth labels to learning from indirect, but informative side-information. We make virtually no assumptions on the form of y nor its relations to the clusters. This makes our approach more applicable to general settings.

3.1. BACKGROUND-VARIATIONAL DEEP EMBEDDING

The starting point for DGC is the variational auto-encoder (VAE) (Kingma & Welling, 2014) with the prior distribution of the latent code chosen as a Gaussian mixture distribution. This is introduced in Jiang et al. (2017) as VaDE. We briefly review the generative VaDE approach here to provide the background for DGC. We adopt the notation that lower case letters denote samples from their corresponding distributions; bold, lower case letters denote random variables/vectors; and bold upper case letters denote random matrices. Assume the prior distribution of the latent code, z, belongs to the family of Gaussian mixture distributions, i.e. p(z) = c p(z|c)p(c) = c π c N (µ c , σ 2 c I) where c is a random variable, with prior probability π c , indexing the normal component with mean µ c and variance σ 2 c . VaDE allows for the clustering of the input data in the latent space, with each component of the Gaussian mixture prior representing an underlying cluster. A VAE-based model can be efficiently described in terms of its generative process and inference procedure. Given an input x ∈ R d , the following decomposition of the joint probability p(x, z, c) details VaDE's generative process: p(x, z, c) = p(x|z)p(z|c)p(c). In words, we first sample the component index c from a prior categorical distribution p(c), then sample the latent code z from the component p(z|c), and lastly reconstruct the input x through the reconstruction network p(x|z). To perform inference and learn from the data, VaDE is constructed to maximize the log-likelihood of the input data x by maximizing its evidence lower bound (ELBO): log p(x) ≥ E q(z|x) log p(x|z) -E q(c|x) log q(c|x) p(c) -E q(z,c|x) log q(z|x) p(z|c) (1) where, given the input x, q(z, c|x) denotes the variational posterior distribution over the latent variables, and E d denotes the expectation wrt. distribution d. With proper assumptions on the prior and variational posterior distributions, the ELBO in Eq. 1 admits a closed-form expression in terms of the parameters of those distributions. We refer readers to Jiang et al. ( 2017) for additional details.

3.2. PROBLEM SETUP

Figure 1 : The Bayesian network that underlies the generative process of DGC. θ and π together constitute the generative parameters. Unlike the unsupervised settings, we do assume we have a response variable y, and our goal is to leverage y to inform a better clustering strategy. Abstractly, given the input-output random variable pair (x, y), we seek to divide the probability space of x into non-overlapping subspaces that are meaningful in explaining the output y. In other words, we want to use the prediction task of mapping data points, x, sampled from the probability space of x to their corresponding sampled outcomes y as a teaching agent, to guide the process of dividing the probability space of x into subspaces that optimally explain y. Since our goal is to discover the subspace-structure without knowing a priori whether such a structure indeed exists, a probabilistic framework is more appropriate due to its ability to incorporate and reason with uncertainty. To this end, we use and extend the VaDE framework, with the following assumption imposed on the latent code that specifically caters to our setting. Assume the input x carries predictive information with respect to the output y. Since the latent code z should inherit sufficient information from which the input x can be reconstructed, it is reasonable to assume that z also inherits that predictive information. This assumption implies that x and y are conditionally independent given z, i.e. p(x, y|z) = p(x|z)p(y|z).

4.1. GENERATIVE PROCESS

In order to incorporate y into a probabilistic model, recall from our previous discussion that y might manifest with respect to the input differently across different subspaces of the input space. Viewing p(y|z) as a conditional probability distribution over y resulting from a functional transformation from z to the space of probability distributions over y, we can assume that the ground truth transformation function, g c , is different for each subspace indexed by c. If z ∼ p(z|c) for some index c, we assume p(y|z, c) ∝ g c (z) for some subspace-specific g c . As a result, we learn a different mapping function for each subspace. The overall generative process of our model is as follows: , c ). The Bayesian network that underlies DGC is shown in Fig. 1 , and the joint distribution of x, y, z, and c can be decomposed as: p(x, y, z, c) = p(y|z, c)p(x|z)p(z|c)p(c). 1. Generate c ∼ Cat(π); 2. Generate z ∼ p(z|c); 3. Generate x ∼ p(x|z); 4. Generate y ∼ p(y|z

4.2. INFERENCE & VARIATIONAL LOWER BOUND

We first note that the joint variational posterior distribution q(z, c|x, y) can be factorized as q(z, c|x, y) = q(c|x, z, y) • q(z|x, y). Since we assume we do not have access to the side-information y at test time, we do not use y to compute q(z|x, y) (in reality, this entails that the encoder only takes x as input to compute the latent code z). We omit the variable y in q(z|x, y) for the rest of the paper for notation convenience. See Sec. 4.4 for how to compute q(c|x, z, y) at test time when y is not available. With this setup, we have the following variational lower bound (see the Appendix for a detailed derivation) log p(x, y) ≥ E q(z,c|x,y) log p(y|z, c) Probabilistic Ensemble + E q(z,c|x,y) log p(x, z, c) q(z, c|x, y) ELBO for VAE with GMM prior = L ELBO . The first term in L ELBO allows for a probabilistic ensemble of classifiers based on the subspace index. This can be seen as follows E q(z,c|x,y) log p(y|z, c) = E q(z|x) k λ k log p(y|z, c = k) ≈ 1 M M l=1 k λ k log p(y|z (l) , c = k) where λ k = q(c = k|x) and l indexes the Monte Carlo samples used to approximate the expectation with respect to q(z|x). The probabilistic ensemble allows the model to maintain necessary uncertainty with respect to the discovered subspace structure until an unambiguous structure is captured. It is also worth noting that the variational lower bound described in Eq. 2 holds regardless of the prior distribution we choose for the latent code z. Although we choose the mixture distribution as the prior in this work, choosing z ∼ N (0, I) and disregarding the probabilistic ensemble component would recover the exact model introduced in Kingma et al. ( 2014) (when all labels are missing), and hence is a special case of our proposed framework.

4.3. MEAN-FIELD VARIATIONAL POSTERIOR DISTRIBUTIONS

Following VAE (Kingma et al., 2014), we choose q(z|x) to be N z|μ μ μz , σ σ σ2 z I where μ μ μz , σ σ σ2 z = h(x; θ). h is parameterized by a feed-forward neural network with weights θ. See the Appendix for a detailed discussion of why using a unimodal distribution (i.e. q(z|x)) to approximate a multimodal distribution (p(z)) is appropriate in our setting. Choosing q(c|x, z, y) appropriately requires us to analyze the proposed L ELBO in greater detail based on the following decomposition (see the Appendix for a detailed derivation): L ELBO = E q(z,c|x,y) log p(y|z, c) 1 + E q(z|x) log p(x, z) q(z|x) 2 -E q(z|x) KL (q(c|x, z, y)||p(c|z)) 3 . ( ) We observe that since 2 does not depend on c, q(c|x, z, y) should be chosen to maximize 1 -3 . Moreover, the expectation over q(z|x) does not depend on c, and thus has no influence over our choice of q(c|x, z, y). Casting finding q(c|x, z, y) as an optimization problem, we have min q(c|x,z,y) f 0 (q) = KL (q(c|x, z, y)||p(c|z)) -E q(c|x,z,y) log p(y|z, c) , s.t. k q(c|x, z, y) = 1, q(c|x, z, y) ≥ 0, ∀k . The objective functional f 0 is convex over the probability space of q, as the Kullback-Leibler divergence is convex in q and the expectation is linear in q. Analytically solving the convex program (4) (see the Appendix for a detailed derivation), we obtain q(c = k|x, z, y) = p(y|z, c = k) • p(c = k|z) k p(y|z, c = k) • p(c = k|z) . (5) First we note that since the solution q(c|x, z, y) does not depend on x, we omit x in q(c|x, z, y) for the remainder of the paper for notational convenience. To better facilitate understanding, we interpret Eq. 5 in two extremes. If y is evenly distributed across the different subspaces, i.e. the ground truth transformations g c are the same for all c, then q(c|z, y) = p(c = k|z), which is what one would choose for unsupervised clustering (Jiang et al., 2017) . However, if the supervised task is informative while the unsupervised task is not, i.e. p(c|z) is a uniform distribution, the likelihoods {p(y|z, c = k)} k will dominate q. Therefore, one could interpret any in-between scenario as a balance that automatically weights the supervised and the unsupervised tasks based on how strong their signals are with respect to grouping the latent probability space into different subspaces.

4.4. EVALUATING ON UNLABELED DATA

We first introduce some notations that we will adopt in this section. We write p(y|z, c) to denote the likelihood value, which requires a specific value of y to compute. We write p y|z,c to refer to the entire distribution (in the context of the entropy (H) and the expectation (E) operators we refer to distributions, otherwise we refer to specific likelihoods). When presented with the response variable y, Eq. 5 gives the optimal choice of q(c|z, y) that allows the network to incorporate both the supervised and the unsupervised signals when weighting the clusters. In practice, we do not have access to y on newly collected (test) data points, which prohibits us from evaluating q(c|z, y). One easy remedy to this would be to use p(c|z) when y is not available; however, having an ensemble of well-trained conditional likelihood mappings, {p(y|z, c = k)} k , and not utilizing them when evaluating on new data points seems wasteful. We thus add a regularization term to L ELBO , so that DGC can naturally generalize to unlabeled testing samples. The regularized ELBO is: L regu ELBO = L ELBO -E q(z,c|x,y) H max (p y|z,c ) , where H max (p y|z,c ) = max{H(p y|z,c ), 0} and H(p y|z,c ) = -E p y|z,c log p y|z,c , which is the entropy of the task network distributions. If y is a discrete random variable, H max (p y|z,c ) = H(p y|z,c ), which is the entropy of p y|z,c and always non-negative; on the other hand, when y is continuous, although the differential entropy of p y|z,c can take any sign, the max operator ensures that H max (p y|z,c ) will remain non-negative. Therefore, adding (a convex combination of) negative entropies preserves the inequality, and thus L regu ELBO remains a proper lower bound. Additionally, solving a similar convex program provides the optimal choice of q(c|z, y) in the presence of the regularizer q(c = k|z, y) = e log p(y|z,k)-Hmax(p y|z,k ) • p(k|z) j e log p(y|z,j)-Hmax(p y|z,j ) • p(j|z) . This form of regularization penalizes the entropies of the conditional distributions p y|z,c . More specifically, clusters with higher posterior weights are penalized more towards having low entropies. This aligns with our intuition: the most suitable cluster to explain a given sample y should be relatively more certain in how it is distributed. We thus use Eq. 7 to weight the clusters during training. It is worth noting that although Eq. 7 provides the optimal choice of q(c|z, y) for maximizing L regu ELBO , any choice of q(c = k|z, y), as long as it maintains a proper probability distribution, would satisfy the fact that L regu ELBO is a proper lower bound. Based on the previously stated intuition, when evaluating on an unlabeled data point, we use q test (c = k|z, y) = e -Hmax(p y|z,k ) • p(k|z) j e -Hmax(p y|z,j ) • p(j|z) to weight the clusters. This aligns with our previous reasoning: the cluster that corresponds to p(y|z, c) with the lowest entropy will be weighted most heavily. This allows the model to use the tuned conditional likelihood mappings when the side-information y is not available.

5. EXPERIMENTS

We investigate the efficacy of DGC on a range of datasets. We refer the reader to the Appendix for the experimental details, e.g. the train/validation/test split, the chosen network architecture, the choices of learning rate and optimizer.

5.1. NOISY MNIST

We introduce a synthetic data experiment using the MNIST dataset, which we name the noisy MNIST, to illustrate that the supervised part of DGC can enhance the performance of an otherwise well-performing unsupervised counterpart. Further, we explore the behavior of DGC without its unsupervised part to demonstrate the importance of capturing the inherent data structure. We extract images that correspond to the digits 2 and 7 from MNIST. For each digit, we randomly select half of the images for that digit and superpose noisy backgrounds onto those images (see the Appendix for image samples). The binary random variable y indicates what digit each image belongs to. Our goal is to cluster the images into 4 clusters: digits 2 and 7, with and without background. However, we are only using the binary responses for supervision and have no direct knowledge of the background. We therefore parameterize the task networks, {p(y|z, c = k} 4 k=1 , as Bernoulli distributions where we learn the parameters (the probabilities). As a baseline, the unsupervised approach, VaDE, already performs well on this dataset, achieving a clustering accuracy of 95.6% when the desired number of clusters is set to 4. Fig. 2a shows that VaDE distinguishes well based on the presence or absence of the noisy background, and incorrectly clustered samples are mainly due to VaDE's inability to differentiate the underlying digits. This is reasonable behavior: if the background signal dominates, the network may focus on the background for clustering as it has no explicit knowledge about the digits. DGC performs nearly perfectly (with a clustering accuracy of 99.6%) in this setting with the help of the added supervision. We see that DGC mitigates the difficulty of distinguishing between digits under the presence of strong, noisy backgrounds (as shown in Fig. 2b , where DGC makes almost no mistakes in distinguishing between digits even in the presence of noisy backgrounds). This added supervision does not overshadow the original advantage of VaDE (i.e. distinguishing whether the images contain background or not). Instead, it enhances the overall model in cases where the unsupervised part, i.e. VaDE, struggles. Furthermore, as detailed in Sansone et al. (2016) and earlier sections, most existing approaches that take advantage of available labels rely on the cluster assumption, which assumes a one-to-one correspondence between the clusters and the labels used for supervision. This experiment is a concrete example that demonstrates DGC does not need to rely on such an assumption to form a sound clustering strategy. Instead, DGC is able to work with class labels that are only partially indicative of what the final clustering strategy should be, potentially making DGC more applicable to more general settings. Ablation study To further test the importance of each part of our model, we ablate the probabilistic components (i.e. we get rid of the decoder and the loss terms associated with it, so that only the supervision will inform how the clusters are formed in the latent space) and perform clustering using only the supervised part of our model. We find that clustering accuracy degrades from the nearly-perfect accuracy obtained by the full model to 50%. Coupled with the improvements over VaDE, this indicates that each component of our model contributes to the final accuracy and that our original intuition that supervision and clustering may reinforce each other is correct.

5.2. PACMAN

In this experiment we test DGC's ability to learn a clustering strategy when facing a continuous response as side-information. The Pacman-shaped data consists of two annuli and each point in the two annuli is associated with a continuous response value (see the Appendix for a detailed breakdown). These response values decrease linearly (from 1 to 0) in one direction for the inner (yellow) annulus, and increase exponentially (from 0 to 1) in the opposite direction for the outer (purple) annulus. Figure 3a contains a 3D illustration of the dataset. We use linear/exponential rates for the responses to not only test our model's ability to detect different trends, but also to test its ability to fit different rates. Our goal is to separate the two annuli depicted in Fig. 3a . This is challenging as the annuli were deliberately chosen to be very close to each other. We applied various traditional unsupervised learning methods including K-means and hierarchical clustering to only the 2D Pacman-shaped data (i.e., not using the responses, but only the 2D Cartesian coordinates). Besides hierarchical clustering with single linkage (and not other distance metric), none of the unsupervised methods managed to separate the two annuli. Moreover, these approaches also result in different clustering strategies as they are based on different distance metrics (see the Appendix for these results). This phenomenon echos a deep-rooted obstacle for clustering methods in general: the concept of clustering is inherently subjective, and different distance metrics can potentially produce different, but sometimes equally meaningful, clustering results. Applying DGC with the input x as the 2D Cartesian point coordinates and the responses y as the response values described previously, we are able to distinguish the two annulli wholly based on the discriminative information carried by the responses. We parameterize the task networks, {p(y|z, c = k} 2 k=1 , as Gaussian distributions where we learn the means and the covariance matrices. As the generated samples from Fig. 3b shows, both the Pacman shape and its corresponding response values are captured. The generated samples from DGC substantiate the model's ability to appropriately learn and use the side-information provided by the response values to obtain a sensible clustering strategy. Unlike most previously discussed methods, DGC can work with continuous response values. This is highly attractive, as it lends itself to any general regression setting in which one would believe the desired clustering strategy should be informed by the regression task. Finally, we compare DGC to VaDE, its ablated version, and a baseline method to substantiate the efficacy of our proposed framework. First, although the solution to the convex programming in Eq. 4 provides an optimal choice of q(c|z, y) from a theoretical standpoint, our proposed framework, specifically the proposed L ELBO (Eq. 2), holds for any choice of q(c|z, y). We thus ablate the convex programming component of our model and parameterize q(c|z, y) using a neural network (NN-DGC). Second, by choosing z ∼ N (0, I), the unsupervised part of DGC recovers exactly the semi-supervised (SS) approach introduced by Kingma et al. ( 2014) in the case when all labels are missing. Since SS is not expected to perform well in a purely unsupervised setting, we include the probabilistic ensemble component as an augmentation (AUG-SS). The results described in Tab. 1 are obtained from running each model 100 times, and demonstrate the following: 1) without the additional responses y, VaDE cannot distinguish between the two annuli at all, emphasizing the importance of exploiting the additional information; 2) the convex programming (Eq.4) is crucial to the success of DGC and it is difficult for a neural network to find the same optimal distribution; 3) the choice of the prior on the latent code z is also of paramount importance, and the Gaussian mixture distribution is more suitable for modeling clusters than an isotropic Gaussian.

5.3. SVHN

We apply DGC to the Street View House Number (SVHN) dataset (Netzer et al., 2011) where the digit labels (10 digits in total) are used as the ground truth clustering labels. This dataset consists of 73,257 training images, 26,032 test images, and 531,131 additional training images. We train DGC using all the training and extra images. We parameterize the task networks, {p(y|z, c = k)} 10 k=1 , as multinomial distributions over the 10 digits where we learn the parameters of those distributions. The goal of this experiment is to investigate the impact of the hyperparameter, the number of clusters desired, on the overall framework. Ideally one would hope that a model would effectively ignore additional clusters when the number of chosen clusters is larger than the number of ground truth clusters. Instead, it should automatically determine how many clusters are appropriate during learning, which is usually unknown a priori. To test if DGC has this property, we alter the desired number of clusters. We use the class labels as the responses to help the clustering. It is worth noting that while in general one will not use ground-truth labels as side-information, we use them for this experiment as they provide a good idea of what a reasonable number of clusters in latent space should be. Firstly, note that DGC outperforms VaDE and AUG-SS by significant margins, demonstrating the importance of the added side-information and the Gaussian mixture prior assumption, respectively. Additionally, although we use the ground truth labels as the side-information, DGC outperforms the K-means baseline, where we perform the K-means clustering on the features obtained from the last hidden layer of a classification network that is trained on the SVHN dataset (with a 95.7% classification accuracy). This demonstrates the advantages of the latent space sharing and the end-to-end nature of DGC. Secondly, even when the desired number of clusters is larger than the number of digits, a network may still achieve a clustering accuracy of 100% if it learns to group samples into a consistent set of clusters, with the cardinality of that set matching the number of digits. This is echoed by the clustering accuracies shown in Tab. 2. When the desired number of clusters exceeds 10, DGC is still able to achieve high clustering accuracies. By comparison, the clustering accuracy drops dramatically when the number of desired clusters increases for all baselines, demonstrating DGC's ability to choose an appropriate number of clusters.

5.4. CAROLINA BREAST CANCER STUDY (CBCS)

In this experiment we apply DGC to a real-world breast cancer dataset collected as part of the Carolina Breast Cancer Study (CBCS). The dataset consists of 1,713 patients, each of which has 2-4 associated histopathological images and a list of biological markers. e.g., the Pam50 gene expressions (Troester et al., 2018) and ER status. As an exploratory investigation, we use the binary indicator for breast cancer recurrence as the response variable y. Applying deep learning techniques, supervised or unsupervised, to analyze histopathological images of breast cancer has gained traction in recent years (Xie et al., 2019) . Distinguished from those methods, our goal is to inspect whether the discovered clusters, whose formation is influenced both by the supervised recurrence information and the unsupervised reconstruction signal, carry meaningful information in terms of survival rate or gene expression. We parameterize the task networks, {p(y|z, c = k)} 3 k=1 , as Bernoulli distributions and learn the associated parameters. See the Appendix for experimental details. To investigate whether the three clusters that we discovered were identifying meaningful differences in tumor biology, we examine the differences in rates of cancer recurrence and features of tumor aggressiveness between the clusters. We also compare to the baseline clusters obtained from the purely unsupervised VaDE to corroborate the importance of the added side-information. Using a Kaplan-Meier estimator to estimate risk differences for time to cancer recurrence within three years, we obtained a p-value of 0.0024 and observed that Cluster 0 had the lowest risk of recurrence and Cluster 2 had the highest risk (see Tab. 3). Even with the small sample size, we observed substantial differences in recurrence risk at three years of follow-up between the clusters, particularly Clusters 0 and 2 (see Fig. 4a ). By comparison, the differences in recurrence risk between the clusters from VaDE is much less significant, both visually (see Fig. 4b , where two clusters almost overlap) and in terms of p-value (0.073). As for grade, it is not the cluster with the most high grade patients. For ER status and tumor subtype, it does have the highest negative ER subtype and the most Basal-like tumor subtype, but the differences are much less significant compared to the clusters obtained from DGC. This indicates that using recurrence side-information, via our DGC approach, indeed resulted in more meaningful clusters.

6. CONCLUSION

In this work, we introduced DGC, a probabilistic framework that allows for the integration of both supervised and unsupervised information when searching for a congruous clustering in the latent space. This is an extremely relevant, but daunting task, where previous attempts are either largely restricted to discrete, supervised, ground-truth labels or rely heavily on the side-information being provided as manually tuned constraints. To the best of our knowledge, this is the first attempt to simultaneously learn from generally indirect, but informative side-information and form a sensible clustering strategy, all the while making minimal assumptions on either the form of the supervision or the relationship between the supervision and the clusters. This method is applicable to a variety of fields where an instance's input and task are defined but its membership is important and unknown, e.g., survival analysis. Training the model in an end-to-end fashion, we demonstrate that DGC is capable of capturing a clustering that aligns with the provided information, while obtaining reasonable classification results on various datasets.



Figure 2: Confusion matrices abbreviated, 2B/7B, in the row/column labels denotes digits 2/7 with background. Rows represent the predicted clusters, and columns represent the ground truth.

Figure 3: (a) The ground truth 3D Pacman; (b) The generated samples from DGC.

Figure 4: The Kaplan-Meier risk differences curves among clusters from DGC and VaDE.

Test clustering accuracies on the Pacman dataset

Clustering (and classification) accuracy variation over maximum possible clusters.

demonstrates the clustering and classification accuracies obtained from DGC and various baselines over a varying number of underlying clusters. The results indicate that DGC most successfully groups the data into the proper classes and clusters regardless of the number of clusters available.

The risk of recurrence difference (RRD) between clusters for DGC.

Tumor characteristics for each cluster. Features are color-coded as low , intermediate , or high risk. ER) positivity, low grade, and Luminal A tumor subtype (see Tab. 4). In contrast, more aggressive tumor characteristics were featured in Clusters 1 and 2, such as negative ER status, high grade, and Basallike tumor subtype, although Cluster 1 appeared to be intermediate between Cluster 0 and 2 in some characteristics. Coupled with the differences in cancer outcomes, these differences in tumor characteristics indicate that the method successfully distinguished between tumors with low-risk features (Cluster 0) and tumors with intermediate-and high-risk features (Clusters 1 and 2). We include the same table that characterizes tumor characteristic for clusters obtained from VaDE in the Appendix. As one can see, cluster 0 from VaDE, which has the highest recurrence rate, should have the most negative ER subtype, the most high grade, and the most Basal-like tumor subtype.

