DEEP GOAL-ORIENTED CLUSTERING

Abstract

Clustering and prediction are two primary tasks in the fields of unsupervised and supervised machine learning. Although much of the recent advances in machine learning have been centered around those two tasks, the interdependent, mutually beneficial relationship between them is rarely explored. One could reasonably expect appropriately clustering the data would aid the downstream prediction task and, conversely, a better prediction performance for the downstream task could potentially inform a more appropriate clustering strategy. In this work, we focus on the latter part of this mutually beneficial relationship. To this end, we introduce Deep Goal-Oriented Clustering (DGC), a probabilistic framework that clusters the data by jointly using supervision via side-information and unsupervised modeling of the inherent data structure in an end-to-end fashion. We show the effectiveness of our model on a range of datasets by achieving prediction accuracies comparable to the state-of-the-art, while, more importantly in our setting, simultaneously learning congruent clustering strategies. We also apply DGC to a real-world breast cancer dataset, and show that the discovered clusters carry clinical significance.

1. INTRODUCTION

Much of the advances in supervised learning in the past decade are due to the development of deep neural networks (DNN), a class of hierarchical function approximators that are capable of learning complex input-output relationships. Prime examples of such advances include image recognition (Krizhevsky et al., 2012) , speech recognition (Nassif et al., 2019) , and neural translation (Bahdanau et al., 2015) . However, with the explosion of the size of modern datasets, it becomes increasingly unrealistic to manually annotate all available data for training. Hence, understanding inherent data structure through unsupervised clustering is of increasing importance. Several approaches to apply DNNs to unsupervised clustering have been proposed in the past few years (Caron et al., 2018; Law et al., 2017; Xie et al., 2016; Shaham et al., 2018) , centering around the concept that the input space in which traditional clustering algorithms operate is of importance. Hence, learning this space from data is desirable, in particular, for complex data. Despite the improvements these approaches have made on benchmark clustering datasets, the ill-defined, ambiguous nature of clustering still remains a challenge. Such ambiguity is particularly problematic in scientific discovery, sometimes requiring researchers to choose from different, but potentially equally meaningful clustering results when little information is available a priori (Ronan et al., 2016) . When facing such ambiguity, using direct side-information to reduce clustering ambivalence proves to be a fruitful direction (Xing et al., 2002; Khashabi et al., 2015; Jin et al., 2013) . Direct sideinformation is usually available in terms of constraints, such as the must-link and the cannot-link constraints (Wang & Davidson, 2010; Wagstaff & Cardie, 2000) , or via a pre-conceived notion of similarity (Xing et al., 2002) . However, defining such direct side-information requires human expertise, which could be labor intensive and potentially vulnerable to labeling errors. On the contrary, indirect, but informative, side-information might exist in abundance, and may not require human expertise to obtain. Being able to learn from such indirect information to form a congruous clustering strategy is thus immensely valuable.

Main Contributions

We propose Deep Goal-Oriented Clustering (DGC), a probabilistic model that is capable of using indirect, but informative, side-information to form a pertinent clustering strategy. Specifically: 1) We combine supervision via side-information and unsupervised data structure modeling in a probabilistic manner; 2) We make minimal assumptions on what form the supervised side-information might take, and assume no explicit correspondence between the sideinformation and the clusters; 3) We train DGC end-to-end so that the model simultaneously learns from the available side-information while forming a desired clustering strategy.

2. RELATED WORK

Most related work in the literature can be classified into two categories: 1) Methods that utilize extra side-information to form better, less ambiguous clusters; however, such side-information needs to be provided beforehand and cannot be learned; 2) Methods that can learn from the provided labels to lessen the ambiguity in the formed clusters, but these methods rely on the cluster assumption (detailed below), and usually assume that the provided labels are discrete and the ground truth labels. This excludes the possibility of learning from indirectly related, but informative side-information. We propose a unified framework that allows using informative side-information directly or indirectly to arrive at better formed clusters. Latent space sharing among different tasks has been studied in a VAE setting (Le et al., 2018; Xie & Ma, 2019) . In this work we utilize this latent space sharing framework, but instead focus on clustering with the aid of general, indirect side-information. 2019) utilize constraints and cluster labels as side information. Mazumdar & Saha (2017) give complexity bounds when provided with an oracle that can be queried for side information. Wasid & Ali (2019) incorporate side information through the use of fuzzy sets. In supervised clustering, the side-information is the a priori known complete clustering for the training set, which is being used as a constraint to learn a mapping between the data and the given clustering (Finley & Joachims, 2005) . In contrast, we do not assume that the constraints are given a priori. Instead, we let the side-information guide the clustering procedure during the training process.

Side-information as constraints

Semi-supervised methods & the cluster assumption Semi-supervised clustering approaches generally assume that they only have access to a fraction of the true cluster labels. Via constraints as the ones discussed, the available labels are propagated to unlabeled data, which can help mitigate the ambiguity in choosing among different clustering strategies (Bair, 2013) . The generative approach to semi-supervised learning introduced in Kingma et al. ( 2014) is based on a hierarchical generative model with two variational layers. Although it was originally meant for semi-supervised classification tasks, it can also be used for clustering. However, if used for clustering, it has to strictly rely on the cluster assumption,which states that there exists a direct correspondence between labels/classes and clusters (Färber et al., 2010; Chapelle et al., 2006) . We show that this approach is a special case of our framework without the probabilistic ensemble component (see Sec. 4.2) and when certain distributional assumptions are made. Sansone et al. ( 2016) proposed a method for joint classification and clustering to address the stringent cluster assumption most approaches make by modeling the cluster indices and the class labels separately, underscoring the possibility that each cluster may consist of multiple class labels. Deploying a mixture of factor analysers as the underlying probabilistic framework, they also used a variational approximation to maximize the joint log-likelihood. In this work, we generalize the notion of learning from discrete, ground truth labels to learning from indirect, but informative side-information. We make virtually no assumptions on the form of y nor its relations to the clusters. This makes our approach more applicable to general settings.

3.1. BACKGROUND-VARIATIONAL DEEP EMBEDDING

The starting point for DGC is the variational auto-encoder (VAE) (Kingma & Welling, 2014) with the prior distribution of the latent code chosen as a Gaussian mixture distribution. This is introduced in Jiang et al. (2017) as VaDE. We briefly review the generative VaDE approach here to provide the background for DGC. We adopt the notation that lower case letters denote samples from their



Using side-information to form better clusters is well-studied. Wagstaff & Cardie (2000) consider both must-link and cannot-link constraints in the context of K-means clustering. Motivated by image segmentation, Orbanz & Buhmann (2007) proposed a probabilistic model that can incorporate must-link constraints. Khashabi et al. (2015) proposed a nonparametric Bayesian hierarchical model to incorporate noisy side-information as soft-constraints. Vu et al. (

