INTERNAL PURITY: A DIFFERENTIAL ENTROPY BASED INTERNAL VALIDATION INDEX FOR CLUSTER-ING VALIDATION

Abstract

In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indices are implemented based on distance and variance, which cannnot catpure the real "density" of the cluster. Moreover the time complexity for distance based indices is usually too high to be applied for large datasets. Therefore, we propose a novel internal validation index based on the differential entropy, named internal purity (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indices. Based on six powerful deep pre-trained representation models, we use four basic clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that, for 90 test cases in total, our IP index can return the optimal clustering results in 51 cases while the second best index can merely report the optimal partition in 30 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided.

1. INTRODUCTION

The goal of clustering is to divide a set of samples into different clusters such that similar samples are grouped in the same clusters. As one of the most fundamental tasks in machine learning, clustering has been extensively studied in many fields, such as text mining (Guan et al., 2012a) , image analysis (Zhou et al., 2011) and pattern recognition (Guan et al., 2012b) . With the advance of deep learning, it has been proved that running any classical clustering algorithm (e.g., K-means) over the learned representation can yield better results (Xie et al., 2016; Huang et al., 2020; Dang et al., 2021) . The main reason behind this is that the deep neural networks can effectively extract highly non-linear features that are helpful for clustering. However, besides the data representation, the outcome of clustering is still affected by other factors (Xu & Wunsch, 2005; Yang et al., 2017) . For example, different clustering algorithms usually lead to different clustering results in a specific dataset. Even for the same algorithm, the selection of different parameters may affect the final clustering results (Halkidi et al., 2000) . Thus, within an effective process of cluster analysis, it is inevitable to validate the goodness of different partitions after clustering and select the best one for application. Here the best one refers to not only the proper parameters but also the best partition that fits the underlying data structure. In fact, many clustering validation measures have been proposed over past years and they can be categorized to external validation and internal validation (Wu et al., 2009; Liu et al., 2013) . Specifically, external validation indices assume the "true" cluster information is known in advance, and they use the supervised information to quantify how good is the obtained partition with respect to prior ground truth clusters. However, such prior knowledge is rarely available in many real scenarios. Then, internal validation indices become the only option for evaluating the clustering result. Internal clustering validation usually measures the clustering result based on following two criteria (Liu et al., 2013; Fraley & Raftery, 1998) : (1) compactness, which measures how closely related are 2) separation, which measures how clusters are separated from each other. In general, distance and variance are two main strategies to implement compactness and separation. However, validation indices based on these implementations suffers following drawbacks that limit their performance. First, for distance based index, given two clusters, the same distance computation result cannot guarantee the same compactness. We use the example shown in Fig. 1 (a) for explanation. Particularly, two clusters are grouped into two cubes respectively and each cube represents the volume of the corresponding vector space. Suppose the volume of two cluster vector spaces are the same and the average pairwise distance of each cluster is also the same, measures like Silhouette index (Rousseeuw, 1987) will consider they have the same compactness. However, from the density perspective, the left cluster should be tighter than the right one. Though indices like S Dbw (Halkidi & Vazirgiannis, 2001) , DBCV (Moulavi et al., 2014) and DCVI (Xie et al., 2020) also propose the density related concept, the density is still calculated based on the distance instead of the volume of vector space. Then, many existing indices require computing pairwise distances for compactness or separation, which would be prohibitively time-consuming for a large high-dimensional data set (Cheng et al., 2018) . Second, variance based compactness computation usually view lower variance as the indicator of higher tightness. However, this is misleading in some cases. As shown in Fig. 1 (b), we can easily observe that cluster B is more compact than cluster A. But the variances for both clusters are the same, which makes measures like standard deviation index (SD) (Halkidi et al., 2000) fail to provide a reliable validation. In fact, covariance here is more suitable to measure the compactness, i.e., the covariances among pairs of variables in cluster A is smaller than that of cluster B. Therefore, we dedicate this paper to a novel internal validation index, named internal purity (IP). Here the purity refers to how "pure" the semantic of a set of samples are. For example, a cluster of similar texts describe a specific event. Hence, from the perspective of compactness, we want a cluster to be as pure as possible, while for the separation perspective, a clustering partition of lower purity is favored. To evaluate the purity, we apply the idea of information entropy (Shannon, 1948) and more specifically, the differential entropy (Cover & Thomas, 1991) is used for evaluation due to that the embedding variables are usually continuous in deep clustering. Furthermore, using differential entropy can help us to overcome the aforementioned drawbacks of existing measures. First of all, unlike the average pairwise distance, the nature of the entropy lends itself to measuring the "information density" (Zu et al., 2020) of a vector space (Cheng et al., 1999) . Then, theoretical studies have shown that computation of differential entropy actually considers the variance of each variable and the covariance between variables (Johnson et al., 2014) , which makes it more effectively measure the compactness of a cluster. Last but not least, since the differential entropy computation requires merely one iteration over the whole cluster, it needs less computation time compared with computing average pairwise distance. In fact, from the perspective of information theory, there is another information criterion (Bishop & Nasrabadi, 2006) that can be used for evaluating clustering partitions (Akogul & Erisoglu, 2016), i.e., estimating the quantity of information loss based on model performance and complexity, where model performance can be evaluated by using a probalilistic framework, such as log-likelihood and model complexity by the number of parameters in the model (Akogul & Erisoglu, 2016) . Akaike information criterion (AIC) (Akaike, 1974) and the Bayesian information criterion (BIC) (Schwarz, 1978) are two indices of this criterion. As will be shown later in the paper, our IP index can actually be regarded as a form of such criterion through mathematical derivation. But different from AIC and BIC, our IP index also takes into account the traditional internal clustering criteria, i.e. compactness



Figure1: Drawback illustration for computing compactness based on distance and variance the samples within the same cluster, and (2) separation, which measures how clusters are separated from each other. In general, distance and variance are two main strategies to implement compactness and separation. However, validation indices based on these implementations suffers following drawbacks that limit their performance.

