DEEP GOAL-ORIENTED CLUSTERING

Abstract

Clustering and prediction are two primary tasks in the fields of unsupervised and supervised machine learning. Although much of the recent advances in machine learning have been centered around those two tasks, the interdependent, mutually beneficial relationship between them is rarely explored. One could reasonably expect appropriately clustering the data would aid the downstream prediction task and, conversely, a better prediction performance for the downstream task could potentially inform a more appropriate clustering strategy. In this work, we focus on the latter part of this mutually beneficial relationship. To this end, we introduce Deep Goal-Oriented Clustering (DGC), a probabilistic framework that clusters the data by jointly using supervision via side-information and unsupervised modeling of the inherent data structure in an end-to-end fashion. We show the effectiveness of our model on a range of datasets by achieving prediction accuracies comparable to the state-of-the-art, while, more importantly in our setting, simultaneously learning congruent clustering strategies. We also apply DGC to a real-world breast cancer dataset, and show that the discovered clusters carry clinical significance.

1. INTRODUCTION

Much of the advances in supervised learning in the past decade are due to the development of deep neural networks (DNN), a class of hierarchical function approximators that are capable of learning complex input-output relationships. Prime examples of such advances include image recognition (Krizhevsky et al., 2012 ), speech recognition (Nassif et al., 2019) , and neural translation (Bahdanau et al., 2015) . However, with the explosion of the size of modern datasets, it becomes increasingly unrealistic to manually annotate all available data for training. Hence, understanding inherent data structure through unsupervised clustering is of increasing importance. Several approaches to apply DNNs to unsupervised clustering have been proposed in the past few years (Caron et al., 2018; Law et al., 2017; Xie et al., 2016; Shaham et al., 2018) , centering around the concept that the input space in which traditional clustering algorithms operate is of importance. Hence, learning this space from data is desirable, in particular, for complex data. Despite the improvements these approaches have made on benchmark clustering datasets, the ill-defined, ambiguous nature of clustering still remains a challenge. Such ambiguity is particularly problematic in scientific discovery, sometimes requiring researchers to choose from different, but potentially equally meaningful clustering results when little information is available a priori (Ronan et al., 2016) . When facing such ambiguity, using direct side-information to reduce clustering ambivalence proves to be a fruitful direction (Xing et al., 2002; Khashabi et al., 2015; Jin et al., 2013) . Direct sideinformation is usually available in terms of constraints, such as the must-link and the cannot-link constraints (Wang & Davidson, 2010; Wagstaff & Cardie, 2000) , or via a pre-conceived notion of similarity (Xing et al., 2002) . However, defining such direct side-information requires human expertise, which could be labor intensive and potentially vulnerable to labeling errors. On the contrary, indirect, but informative, side-information might exist in abundance, and may not require human expertise to obtain. Being able to learn from such indirect information to form a congruous clustering strategy is thus immensely valuable.

Main Contributions

We propose Deep Goal-Oriented Clustering (DGC), a probabilistic model that is capable of using indirect, but informative, side-information to form a pertinent clustering strategy. Specifically: 1) We combine supervision via side-information and unsupervised data structure modeling in a probabilistic manner; 2) We make minimal assumptions on what form the

