CONTRASTIVE HIERARCHICAL CLUSTERING

Abstract

Deep clustering has been dominated by flat clustering models, which split a dataset into a predefined number of groups. Although recent methods achieve extremely high similarity with the ground truth on popular benchmarks, the information contained in the flat partition is limited. In this paper, we introduce CoHiClust, a Contrastive Hierarchical Clustering model based on deep neural networks, which can be applied to large-scale image data. By employing a self-supervised learning approach, CoHiClust distills the base network into a binary tree without access to any labeled data. The hierarchical clustering structure can be used to analyze the relationship between clusters as well as to measure the similarity between data points. Experiments performed on typical image benchmarks demonstrate that CoHiClust generates a reasonable structure of clusters, which is consistent with our intuition and image semantics. Moreover, by applying the proposed pruning strategy, we can restrict the hierarchy to the requested number of clusters (leaf nodes) and obtain the clustering accuracy outperforming existing hierarchical baselines.

1. INTRODUCTION

Clustering, a fundamental branch of unsupervised learning, is often one of the first steps in data analysis, which finds applications in anomaly detection Barai & Dey (2017) , personalized recommendations Zhang et al. (2014) or bioinformatics Lakhani et al. (2015) . Since it does not use any information about class labels, representation learning becomes an integral part of deep clustering methods. Initial approaches use representations taken from pre-trained models Guérin et al. ( 2017 Since augmentations used for image data are class invariant, the latter techniques often obtain very high similarity with the ground truth classes. However, we should be careful when comparing clustering techniques only by inspecting their accuracy with ground truth classes because the primary goal of clustering is to deliver information about data and not to perform classification. Most works in the area of deep clustering focus on producing flat partitions with a predefined number of groups. Although hierarchical clustering gained notable attention in classical machine learning and has been frequently applied in real-life problems Zou et al. ( 2020); Śmieja et al. ( 2014), its role has been drastically marginalized in the era of deep learning. In the case of hierarchical clustering, the exact number of clusters does not have to be specified because we can inspect the partition at various tree levels. Moreover, we can analyze the clusters' relationships, e.g. by finding superclusters or measuring the distance between groups in the hierarchy. The above advantages make hierarchical clustering an excellent tool for analyzing complex data. However, in order to take full advantage of hierarchical clustering, it is necessary to create an appropriate image representation, which is possible thanks to the use of deep neural networks. To the best of our knowledge, DeepECT Mautz et al. (2019; 2020) is the only hierarchical clustering model trained jointly with the neural network. Nevertheless, this method has not been examined to large image datasets, which appear in practical applications. To fill this gap, we introduce CoHiClust (Contrastive Hierarchical Clustering), which creates a hierarchy of clusters and can be applied to large image data. CoHiClust uses a neural network to generate a high-level representation of data, which is next distilled into the tree hierarchy by applying the projection head, see Figure 2 . The whole framework is trained jointly in an end-to-end manner without labels using our novel contrastive loss and automatically generated data augmentations following the self-supervised learning principle. The constructed hierarchy uses the structure of a binary tree, where the sequence of decisions made by internal nodes determines the final assignment to clusters (leaf nodes). In consequence, similar examples are processed longer by the same path than dissimilar ones. By inspecting the number of edges needed to connect two clusters (leaves), we obtain a natural similarity measure between data points. Although CoHiClust assumes a pre-defined tree structure with a fixed height, we introduce a pruning mechanism, which removes the least informative leaf nodes until the requested number of leaves is obtained. In contrast to typical pruning strategies or tree cuts, where neighboring leaves are only merged to ancestor node, we allow for reassigning data points between clusters by finetuning the whole model, which further improves topology of the tree. Figure 1 : A tree hierarchy generated by CoHiClust for F-MNIST (images in the nodes denote mean images in each sub-tree). The right sub-tree contains clothes while the other items (shoes and bags) are placed in the left branch. Looking at the lowest hierarchy level, we have clothes with long sleeves grouped in the neighboring leaves. The same holds for clothes with designs. Observe that CoHiClust assigned white-colored t-shirts and dresses to the same cluster, while trousers are in the separate one. Small shoes such as sneakers or sandals are considered similar (neighboring leaves) and distinct from ankle shoes. Concluding, CoHiClust is able to retrieve meaningful information about image semantics, which is complementary to the ground truth classification. The proposed model has been examined on various image datasets and compared with both hierarchical and flat clustering baselines. By analyzing the constructed hierarchies, we show that CoHi-Clust generates a structure of clusters that is consistent with our intuition and image semantics, see Figures 1 for the illustration and discussion. Our analysis is supported by a quantitative study, which shows that CoHiClust gives higher similarity with ground truth partition than available hierarchical baselines, see Tables 1, 2 and 3. Moreover, it is among the three best-performing methods when compared to the flat clustering models, see Table 3 . Our main contributions are summarized as follows: • We introduce a hierarchical clustering model CoHiClust, which converts the base neural network into a binary tree. The model is trained effectively with no supervision using our novel hierarchical contrastive loss applied to self-generated data augmentations. • We implement a pruning strategy, which not only leads to creating a fixed number of leaves but also improves the constructed hierarchy.



); Naumov et al. (2021) or employ auto-encoders in joint training of the representation and the clustering model Guo et al. (2017a); Mautz et al. (2019). More recent works designed to image data frequently follow the self-supervised learning principle, where representation is trained on pairs of similar images automatically generated by data augmentations Li et al. (2021b); Dang et al. (2021).

