CONSENSUS CLUSTERING WITH UNSUPERVISED REP-RESENTATION LEARNING

Abstract

Recent advances in deep clustering and unsupervised representation learning are based on the idea that different views of an input image (generated through data augmentation techniques) must either be closer in the representation space, or have a similar cluster assignment. In this work, we leverage this idea together with ensemble learning to perform clustering and representation learning. Ensemble learning is widely used in the supervised learning setting but has not yet been practical in deep clustering. Previous works on ensemble learning for clustering neither work on the feature space nor learn features. We propose a novel ensemble learning algorithm dubbed Consensus Clustering with Unsupervised Representation Learning (ConCURL) which learns representations by creating a consensus on multiple clustering outputs. Specifically, we generate a cluster ensemble using random transformations on the embedding space, and define a consensus loss function that measures the disagreement among the constituents of the ensemble. Thus, diverse ensembles minimize this loss function in a synergistic way, which leads to better representations that work with all cluster ensemble constituents. Our proposed method ConCURL is easy to implement and integrate into any representation learning or deep clustering block. ConCURL outperforms all state of the art methods on various computer vision datasets. Specifically, we beat the closest state of the art method by 5.9 percent on the ImageNet-10 dataset, and by 18 percent on the ImageNet-Dogs dataset in terms of clustering accuracy. We further shed some light on the under-studied overfitting issue in clustering and show that our method does not overfit as much as existing methods, and thereby generalizes better for new data samples.

1. INTRODUCTION

Supervised learning algorithms have shown great progress recently, but generally require a lot of labeled data. However, in many domains (e.g., advertising, social platforms, etc.), most of the available data are not labeled and manually labeling it is a very labor, time, and cost intensive task (Xiao et al., 2015; Deshmukh, 2019; Mintz et al., 2009; Blum & Mitchell, 1998) . On the other hand, clustering algorithms do not need labeled data to group similar data points into clusters. Some popular clustering algorithms include k-means, hierarchical clustering, DBSCAN (Ester et al., 1996) , spectral clustering, etc., and the usefulness of each algorithm varies with the application. In this work, we deal with the clustering of images. Traditional clustering approaches focus on hand crafted features on which out of the box clustering algorithms are applied. However, hand crafted features may not be optimal, and are not scalable to large scale real word datasets (Wu et al., 2019) . Advancements in deep learning techniques have enabled end-to-end learning of rich representations for supervised learning. On the other hand, simultaneously learning the feature spaces while clustering leads to degenerate solutions, which until recently limited end to end implementations of clustering with representation learning approaches (Caron et al., 2018) . Recent deep clustering works take several approaches to address this issue such as alternating pseudo cluster assignments and pseudo supervised training, comparing the predictions with their own high confidence assignments (Caron et al., 2018; Asano et al., 2019; Xie et al., 2016; Wu et al., 2019) , and maximizing mutual information between predictions of positive pairs (Ji et al., 2019) . Although these methods show impressive performance on challenging datasets, we believe taking advantage of rich ideas from ensemble learning for clustering with representation learning will enhance the performance of deep clustering methods.

