META-K: TOWARDS SELF-SUPERVISED PREDICTION OF NUMBER OF CLUSTERS

Abstract

Data clustering is a well-known unsupervised learning approach. Despite the recent advances in clustering using deep neural networks, determining the number of clusters without any information about the given dataset remains an existing problem. There have been classical approaches based on data statistics that require the manual analysis of a data scientist to calculate the probable number of clusters in a dataset. In this work, we propose a new method for unsupervised prediction of the number of clusters in a dataset given only the data without any labels. We evaluate our method extensively on randomly generated datasets using the scikit-learn package and multiple computer vision datasets and show that our method is able to determine the number of classes in a dataset effectively without any supervision.

1. INTRODUCTION

Clustering is an important task in machine learning, and it has a wide range of applications (Lung et al. (2004) ; Aminzadeh & Chatterjee (1984) ; Gan et al. (2007) ). Clustering often consists of two steps: the feature extraction step and the clustering step. There have been numerous works on clustering (Xu & Tian (2015) ), and among the proposed algorithms, K-Means (Bock (2007) ) is renowned for its simplicity and performance. Despite its popularity, K-Means has several shortcomings discussed in (Ortega et al. (2009); Shibao & Keyun (2007) ). In particular, with an increase in the dimensionality of the input data, K-Means' performance decreases (Prabhu & Anbazhagan (2011) ). This phenomenon is called the curse of dimensionality (Bellman (2015)). Dimensionality reduction and feature transformation methods have been used to minimize this effect. These methods map the original data into a new feature space, in which the new data-points are easier to be separated and clustered (Min et al. (2018)) . Some examples of existing data transformation methods are: PCA (Wold et al. (1987) ), kernel methods (Hofmann et al. (2008) ) and spectral methods (Ng et al. (2002) ). Although these methods are effective, a highly complex latent structure of data can still challenge them ( (Saul et al., 2006; Min et al., 2018) ). Due to the recent enhancements in deep neural networks ( Liu et al. (2017) ) and because of their inherent property of non-linear transformations, these architectures have the potential to replace classical dimensionality reduction methods. In the research field of deep clustering, popularized by the seminal paper "Unsupervised Deep Embedding for Clustering Analysis" (Xie et al. ( 2016)), deep neural networks are adopted as the feature extractor and are combined with a clustering algorithm to perform the clustering task. A unique loss function is defined which updates the model. Deep clustering methods typically take k, the number of clusters, as a hyper-parameter. In real-world scenarios, where datasets are not labeled, assigning a wrong value to this parameter can reduce the overall accuracy of the model. Meta-learning, a framework that allows a model to use information from its past tasks to learn a new task quickly or with little data, has been adopted by a handful of papers (Ferrari & de Castro (2012) 2019)) to improve the performance of clustering tasks. Closest to our work is the approach proposed by Garg & Kalai (2018) that tries to predict the number of clusters in K-Means clustering using meta-information. ; To solve the same issue, we propose Meta-k, a gradient-based method for finding the optimal number of clusters and an attempt to have a self-supervised approach for clustering. Our work is based on the observation that a network can take input points and learn parameters to predict the best number

Encoder

Decoder k of clusters, k. The predicted k, along with the points in the dataset, are the inputs to the clustering algorithm K-Means. For the clusters created, the silhouette score (Rousseeuw ( 1987)) is calculated. Using this metric value as the reward signal, we can compute the policy gradient to update the controller. As a result, in the next iteration, the controller gives higher probabilities to the k that causes better (closer to 1) silhouette scores to be calculated.

K-Means OR

To be able to perform optimized clustering on both low-and high-dimensional spaces, we augment our model with a feature extraction module, a deep auto-encoder. In this regard, our work is related to the idea of learning to learn or meta-learning, a general framework to use the information that is learned in one task for improving a future task. Figure 1 shows the diagram of our model for lowand high-dimensional data. We evaluate our method in multiple scenarios on different clustering tasks using synthetic and computer vision datasets, and we show that our approach can predict the number of clusters in most settings with an insignificant error.

Our contributions are:

• A novel self-supervised approach for predicting the number of clusters using policy gradient methods. • Extensive evaluation on synthetic scikit-learn ( Pedregosa et al. ( 2011)) datasets and wellknown vision datasets MNIST ( LeCun et al. (2010) ) and Fashion-MNIST ( Xiao et al. (2017) ). • Our results show that our approach is able to predict the number of clusters in most scenarios identical or very close to the real number of data clusters. • We plan to release the source code of this work upon its acceptance.

2. RELATED WORK

There is a vast amount of unlabeled data in many scientific fields that can be used for training neural networks. Unsupervised learning makes use of these data, and clustering is one of the most important tasks in unsupervised learning. For unsupervised clustering, the classical K-means algorithm (Lloyd (1982)) has been used extensively due to its simplicity and effectiveness. If the data is distributed compactly around distinctive centroids, then the K-Means algorithm works well, but in real life, such scenarios are rare. Therefore, research has focused on transforming the data into a lowerdimensional space in which K-Means can perform successfully. If our data points are small images, then PCA (Wold et al. (1987) ) is commonly used for this transformation. Other methods include non-linear transformation such as kernel methods (Hofmann et al. (2008) ) and spectral methods (Ng et al. (2002) ).



Ferrari & De Castro (2015); Garg & Kalai (2018); Kim et al. (2019); Jiang & Verma (

Figure 1: This figure shows an outline of our method. The green and blue dashed lines show two possible inputs to the clustering. The green part shows the high dimensional input (e.g. images) feature extraction using an auto-encoder and the blue line shows the low dimensional input. The right part of the figure shows the prediction of k and how we update the controller network.

