CONSENSUS CLUSTERING WITH UNSUPERVISED REP-RESENTATION LEARNING

Abstract

Recent advances in deep clustering and unsupervised representation learning are based on the idea that different views of an input image (generated through data augmentation techniques) must either be closer in the representation space, or have a similar cluster assignment. In this work, we leverage this idea together with ensemble learning to perform clustering and representation learning. Ensemble learning is widely used in the supervised learning setting but has not yet been practical in deep clustering. Previous works on ensemble learning for clustering neither work on the feature space nor learn features. We propose a novel ensemble learning algorithm dubbed Consensus Clustering with Unsupervised Representation Learning (ConCURL) which learns representations by creating a consensus on multiple clustering outputs. Specifically, we generate a cluster ensemble using random transformations on the embedding space, and define a consensus loss function that measures the disagreement among the constituents of the ensemble. Thus, diverse ensembles minimize this loss function in a synergistic way, which leads to better representations that work with all cluster ensemble constituents. Our proposed method ConCURL is easy to implement and integrate into any representation learning or deep clustering block. ConCURL outperforms all state of the art methods on various computer vision datasets. Specifically, we beat the closest state of the art method by 5.9 percent on the ImageNet-10 dataset, and by 18 percent on the ImageNet-Dogs dataset in terms of clustering accuracy. We further shed some light on the under-studied overfitting issue in clustering and show that our method does not overfit as much as existing methods, and thereby generalizes better for new data samples.

1. INTRODUCTION

Supervised learning algorithms have shown great progress recently, but generally require a lot of labeled data. However, in many domains (e.g., advertising, social platforms, etc.), most of the available data are not labeled and manually labeling it is a very labor, time, and cost intensive task (Xiao et al., 2015; Deshmukh, 2019; Mintz et al., 2009; Blum & Mitchell, 1998) . On the other hand, clustering algorithms do not need labeled data to group similar data points into clusters. Some popular clustering algorithms include k-means, hierarchical clustering, DBSCAN (Ester et al., 1996) , spectral clustering, etc., and the usefulness of each algorithm varies with the application. In this work, we deal with the clustering of images. Traditional clustering approaches focus on hand crafted features on which out of the box clustering algorithms are applied. However, hand crafted features may not be optimal, and are not scalable to large scale real word datasets (Wu et al., 2019) . Advancements in deep learning techniques have enabled end-to-end learning of rich representations for supervised learning. On the other hand, simultaneously learning the feature spaces while clustering leads to degenerate solutions, which until recently limited end to end implementations of clustering with representation learning approaches (Caron et al., 2018) . Recent deep clustering works take several approaches to address this issue such as alternating pseudo cluster assignments and pseudo supervised training, comparing the predictions with their own high confidence assignments (Caron et al., 2018; Asano et al., 2019; Xie et al., 2016; Wu et al., 2019) , and maximizing mutual information between predictions of positive pairs (Ji et al., 2019) . Although these methods show impressive performance on challenging datasets, we believe taking advantage of rich ideas from ensemble learning for clustering with representation learning will enhance the performance of deep clustering methods. Ensemble learning methods train a variety of learners and build a meta learner by combining the predictions of individual learners (Dietterich, 2000; Breiman, 1996; Freund et al., 1996) . In practice they have been heavily used in supervised learning setting. Ensemble learning methods have also found their place in clustering i.e. knowledge reuse framework (Strehl & Ghosh, 2002) where a consensus algorithm is applied on constituent cluster partitions to generate an updated partition that clusters the data better than any component partition individually. However, the knowledge reuse framework and much of the consensus clustering literature that followed (Fern & Brodley, 2003; Fred & Jain, 2005; Topchy et al., 2005) do not make use of the underlying features used to generate the ensemble. We propose the use of consensus clustering as a way to extend ensemble methods to unsupervised representation learning. In particular, we define a 'disagreement' measure among the constituents of the ensemble. The key motivation for this is that the diversity of the ensemble drives the minimization of the disagreement measure in a synergistic way, thereby leading to better representations. We propose Consensus Clustering with Unsupervised Representation Learning (ConCURL ) and following are our main contributions: 1. A novel ensemble learning algorithm which learns representations by creating a consensus on multiple clustering outputs generated by applying random transformations on the embeddings. 2. Our method outperforms the current state of the art clustering algorithms on popular computer vision datasets based on clustering metrics ( A.4). 3. Even though there is no labeled data available while learning representations, clustering may still be prone to be overfitting to the "training data." As stated in Bubeck & Von Luxburg (2007) , in clustering, we generally assume that the finite data set has been sampled from some underlying space and the goal is to find the true approximate partition of the underlying space rather than the best partition in a given finite data set. Hence, to check generalizability of the method proposed we also evaluate our models on the "test data" -data that was not available during training/representation learning. Our method is more generalizable compared to state of the art methods (i.e. it outperforms the other algorithms when evaluated on the test set).

2. RELATED WORK

Clustering is a ubiquitous task and it has been actively used in many different scientific and practical pursuits such as detecting genes from microarray data (Frey & Dueck, 2007) 



, clustering faces(Rodriguez & Laio, 2014), and segmentation in medical imaging to support diagnosis(Masulli &  Schenone, 1999). We refer interested readers to these excellent sources for a survey of these usesJain et al. (1999); Liao (2005); Xu & Wunsch (2005); Nugent & Meila (2010).Clustering with Deep Learning: In their influential work,Caron et al. (2018)  show that it is possible to train deep convolutional neural networks with pseudo labels that are generated by a clustering algorithm (DeepCluster). More precisely, in DeepCluster, previous versions of representations are used to assign pseudo labels to the data using an out of the box clustering algorithm such as k-means. These pseudo labels are used to improve the learned representation of the data by minimizing a supervised loss. Along the same lines, several more methods have been proposed. For example, Gaussian ATtention network for image clustering (GATCluster) (Niu et al., 2020) comprises four self-learning tasks with the constraints of transformation invariance, separability maximization, entropy analysis and attention mapping. Training is performed in two distinct steps, similar toCaron  et al. (2018)  where the first step is to compute pseudo targets for a large batch of data and the second step is to train the model in a supervised way using the pseudo targets. Both DeepCluster and GATCluster use k-means to generate pseudo labels which may not scale well.Wu et al. (2019)  propose Deep Comprehensive Correlation Mining (DCCM), where discriminative features are learned by taking advantage of the correlations of the data using pseudo-label supervision and triplet mutual information among features. However, DCCM may be susceptible to trivial solutions(Niu et al.,  2020). Invariant Information Clustering (IIC)(Ji et al., 2019) maximizes mutual information between the class assignments of two different views of the same image (paired samples) in order to learn representations that preserve what is common between the views while discarding instance specific details.Ji et al. (2019)  argue that the presence of an entropy term in mutual information plays an important role in avoiding the degenerate solutions. However a large batch size is needed

