INTERNAL PURITY: A DIFFERENTIAL ENTROPY BASED INTERNAL VALIDATION INDEX FOR CLUSTER-ING VALIDATION

Abstract

In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indices are implemented based on distance and variance, which cannnot catpure the real "density" of the cluster. Moreover the time complexity for distance based indices is usually too high to be applied for large datasets. Therefore, we propose a novel internal validation index based on the differential entropy, named internal purity (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indices. Based on six powerful deep pre-trained representation models, we use four basic clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that, for 90 test cases in total, our IP index can return the optimal clustering results in 51 cases while the second best index can merely report the optimal partition in 30 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided. Table 40: Swin based AHC clustering results on five image datasets. CIFAR-10 -10 MNIST -10 FashionMNIST -10 ImageNet-10 -10 CINIC-10 -10 ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k SD 85.92 80.08 86.43 9 19.36 4.09 7.85 3 31.8 21.96 37.75 4 99.66 99.25 98.94 10 64.77 47.09 58.26 12 Dunn 19.94 18.12 42.4 2 23.54 6.38 9.96 4 19.97 13.79 34.16 2 39.93 18.49 57.59 4 19.35 14.62 33.03 2 I 19.94 18.12 42.4 2 16.93 2.01 4.46 2 19.97 13.79 34.16 2 29.93 10.73 43.02 3 19.35 14.62 33.03 2 XB 85.92 80.08 86.43 9 16.93 2.01 4.46 2 34.43 20.54 36.51 5 99.66 99.25 98.94 10 27.79 18.64 43.35 3 S 89 83.9 85.83 11 16.93 2.01 4.46 2 19.97 13.79 34.16 2 91.03 91.92 94.97 14 43.62 38.8 52.97 5 CH 19.94 18.12 42.4 2 16.93 2.01 4.46 2 19.97 13.79 34.16 2 99.66 99.25 98.94 10 19.35 14.62 33.03 2 DB 85.92 80.08 86.43 9 16.93 2.01 4.46 2 34.43 20.54 36.51 5 99.66 99.25 98.94 10 27.79 18.64 43.35 3 S Dbw 46.3 46.79 72.51 34 8.66 4.35 19.86 99 9.81 6.36 33.21 100 99.66 99.25 98.94 10 20.7 15.57 48.79 86 CVNN 76.7 72.92 83.99 8 16.93 2.01 4.46 2 19.97 13.79 34.16 2 99.66 99.25 98.94 10 27.79 18.64 43.35 3 DCVI 21.2 19.46 62.25 100 8.66 4.34 19.86 100 9.81 6.36 33.21 100 39.93 18.49 57.59 4 18.74 14.08 48.21 100 DBCV 19.94 18.12 42.4 2 8.66 4.34 19.86 100 9.81 6.36 33.21 100 99.66 99.25 98.94 10 18.74 14.08 48.21 100 AIC 19.94 18.12 42.4 2 16.93 2.01 4.46 2 19.97 13.79 34.16 2 19.97 9.85 35.28 2 19.35 14.62 33.03 2 BIC 19.94 18.12 42.4 2 16.93 2.01 4.46 2 19.97 13.79 34.16 2 19.97 9.85 35.28 2 19.35 14.62 33.03 2 IP 93.62 86.56 87.01 10 24.03 7.17 12.3 8 35.03 21.09 35.54 6 99.66 99.25 98.94 10 61.49 47.21 57.78 8 Table 41: BEiT based AHC clustering results on five image datasets. CIFAR-10 -10 MNIST -10 FashionMNIST -10 ImageNet-10 -10 CINIC-10 -10 ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k SD 20.67 6.21 9.42 5 20.87 6.04 12.07 2 34.74 21.87 38.85 5 16.54 3.92 11.21 2 17.19 3.82 6.93 2 Dunn 10.19 4.14 16.66 79 11.17 6.93 32.16 100 15.48 11.7 41.01 84 20.45 4.67 16.25 3 7.55 2.34 13.6 100 I 16.96 3.96 7.56 2 22.61 5.9 18.2 4 19.9 12.47 25.97 2 20.45 4.67 16.25 3 17.88 3.5 8.46 3 XB 17.59 3.85 7.53 3 20.87 6.04 12.07 2 19.9 12.47 25.97 2 20.45 4.67 16.25 3 17.19 3.82 6.93 2 S 16.96 3.96 7.56 2 20.87 6.04 12.07 2 19.9 12.47 25.97 2 20.45 4.67 16.25 3 17.19 3.82 6.93 2 CH 16.96 3.96 7.56 2 20.87 6.04 12.07 2 19.9 12.47 25.97 2 16.54 3.92 11.21 2 17.19 3.82 6.93 2 DB 8.44 3.37 17.15 98 22.61 5.9 18.2 4 19.9 12.47 25.97 2 20.45 4.67 16.25 3 9.22 2.79 12.96 73 S Dbw 8.16 3.31 17.2 100 11.17 6.93 32.16 100 13.48 10.15 40.73 97 12.78 6.87 28.54 113 7.55 2.34 13.58 99 CVNN 16.96 3.96 7.56 2 20.87 6.04 12.07 2 42.72 23.03 42.06 10 16.54 3.92 11.21 2 17.19 3.82 6.93 2 DCVI 8.16 3.31 17.2 100 11.17 6.93 32.16 100 12.6 9.71 40.64 100 12.78 6.87 28.54 114 7.55 2.34 13.6 100 DBCV 8.16 3.31 17.2 100 11.17 6.93 32

1. INTRODUCTION

The goal of clustering is to divide a set of samples into different clusters such that similar samples are grouped in the same clusters. As one of the most fundamental tasks in machine learning, clustering has been extensively studied in many fields, such as text mining (Guan et al., 2012a) , image analysis (Zhou et al., 2011) and pattern recognition (Guan et al., 2012b) . With the advance of deep learning, it has been proved that running any classical clustering algorithm (e.g., K-means) over the learned representation can yield better results (Xie et al., 2016; Huang et al., 2020; Dang et al., 2021) . The main reason behind this is that the deep neural networks can effectively extract highly non-linear features that are helpful for clustering. However, besides the data representation, the outcome of clustering is still affected by other factors (Xu & Wunsch, 2005; Yang et al., 2017) . For example, different clustering algorithms usually lead to different clustering results in a specific dataset. Even for the same algorithm, the selection of different parameters may affect the final clustering results (Halkidi et al., 2000) . Thus, within an effective process of cluster analysis, it is inevitable to validate the goodness of different partitions after clustering and select the best one for application. Here the best one refers to not only the proper parameters but also the best partition that fits the underlying data structure. In fact, many clustering validation measures have been proposed over past years and they can be categorized to external validation and internal validation (Wu et al., 2009; Liu et al., 2013) . Specifically, external validation indices assume the "true" cluster information is known in advance, and they use the supervised information to quantify how good is the obtained partition with respect to prior ground truth clusters. However, such prior knowledge is rarely available in many real scenarios. Then, internal validation indices become the only option for evaluating the clustering result. Internal clustering validation usually measures the clustering result based on following two criteria (Liu et al., 2013; Fraley & Raftery, 1998) : (1) compactness, which measures how closely related are Figure 1 : Drawback illustration for computing compactness based on distance and variance the samples within the same cluster, and (2) separation, which measures how clusters are separated from each other. In general, distance and variance are two main strategies to implement compactness and separation. However, validation indices based on these implementations suffers following drawbacks that limit their performance. First, for distance based index, given two clusters, the same distance computation result cannot guarantee the same compactness. We use the example shown in Fig. 1 (a) for explanation. Particularly, two clusters are grouped into two cubes respectively and each cube represents the volume of the corresponding vector space. Suppose the volume of two cluster vector spaces are the same and the average pairwise distance of each cluster is also the same, measures like Silhouette index (Rousseeuw, 1987) will consider they have the same compactness. However, from the density perspective, the left cluster should be tighter than the right one. Though indices like S Dbw (Halkidi & Vazirgiannis, 2001) , DBCV (Moulavi et al., 2014) and DCVI (Xie et al., 2020 ) also propose the density related concept, the density is still calculated based on the distance instead of the volume of vector space. Then, many existing indices require computing pairwise distances for compactness or separation, which would be prohibitively time-consuming for a large high-dimensional data set (Cheng et al., 2018) . Second, variance based compactness computation usually view lower variance as the indicator of higher tightness. However, this is misleading in some cases. As shown in Fig. 1 (b), we can easily observe that cluster B is more compact than cluster A. But the variances for both clusters are the same, which makes measures like standard deviation index (SD) (Halkidi et al., 2000) fail to provide a reliable validation. In fact, covariance here is more suitable to measure the compactness, i.e., the covariances among pairs of variables in cluster A is smaller than that of cluster B. Therefore, we dedicate this paper to a novel internal validation index, named internal purity (IP). Here the purity refers to how "pure" the semantic of a set of samples are. For example, a cluster of similar texts describe a specific event. Hence, from the perspective of compactness, we want a cluster to be as pure as possible, while for the separation perspective, a clustering partition of lower purity is favored. To evaluate the purity, we apply the idea of information entropy (Shannon, 1948) and more specifically, the differential entropy (Cover & Thomas, 1991) is used for evaluation due to that the embedding variables are usually continuous in deep clustering. Furthermore, using differential entropy can help us to overcome the aforementioned drawbacks of existing measures. First of all, unlike the average pairwise distance, the nature of the entropy lends itself to measuring the "information density" (Zu et al., 2020) of a vector space (Cheng et al., 1999) . Then, theoretical studies have shown that computation of differential entropy actually considers the variance of each variable and the covariance between variables (Johnson et al., 2014) , which makes it more effectively measure the compactness of a cluster. Last but not least, since the differential entropy computation requires merely one iteration over the whole cluster, it needs less computation time compared with computing average pairwise distance. In fact, from the perspective of information theory, there is another information criterion (Bishop & Nasrabadi, 2006) that can be used for evaluating clustering partitions (Akogul & Erisoglu, 2016) , i.e., estimating the quantity of information loss based on model performance and complexity, where model performance can be evaluated by using a probalilistic framework, such as log-likelihood and model complexity by the number of parameters in the model (Akogul & Erisoglu, 2016) . Akaike information criterion (AIC) (Akaike, 1974) and the Bayesian information criterion (BIC) (Schwarz, 1978) are two indices of this criterion. As will be shown later in the paper, our IP index can actually be regarded as a form of such criterion through mathematical derivation. But different from AIC and BIC, our IP index also takes into account the traditional internal clustering criteria, i.e. compactness and separation, whereas AIC and BIC can not. Moreover, some existing works have shown that AIC and BIC indices have poor performance and are prone to overestimate the number of clusters in dataset (Windham & Cutler, 1992; Hu & Xu, 2004) . Hence, our IP index is more competitive for evaluating the clustering results based on its capability to capture different criteria. Note that, although the term "purity" has been used in external validation (Wu et al., 2009) and the entropy has also been used as the measurement, the definition as well as the computation of our internal purity is quite different since no supervised information is known in advance in our validation setting. To summarize, our contributions are threefold: 1. To tackle the effectiveness and efficiency problems of existing measures, we propose to use differential entropy to measure the purity of a cluster as well as a partition. 2. Following the traditional perspective of compactness and separation, we design a new internal validation index based on the nature of the proposed purity measure. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed internal purity is also provided. 3. Based on six powerful deep pre-trained models, we use four basic clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that for 90 test cases, our IP index can return the optimal clustering results in 51 cases while the second best index can merely report the optimal partition in 30 cases. The remainder of this paper is organized as follows. Section 2 briefly reviews the existing internal indices. Section 3 introduces preliminaries. Section 4 presents our internal purity index. Section 5 reports the results of experimental evaluation. Finally, Section 6 concludes the paper. The internal clustering validation indices measure the quality of clustering by the internal information of the dataset and no other external information is used. We group the internal indices to two categories based on how the criteria of compactness and separation is calculated, i.e., distance-based and variance-based. We briefly summarize several well known indices in Table 1 .

2. RELATED WORK

Distance-based. Indices in this category either use pairwise distance or cluster center based distance to measure the compactness and the separation. DB (Davies & Bouldin, 1979) , I (Maulik & Bandyopadhyay, 2002) and XB (Xie & Beni, 1991) measure compactness based on average centerbased distance. Dunn (Dunn, 1973) measures compactness based on maximum pairwise distance. DBCV and DCVI select the most representative pairwise distance for measuring the compactness. However, using center based or a single pairwise distance in the cluster to represent the compactness for the entire cluster cannot provide stable performance (Liu et al., 2013) . Our IP index uses all the samples in the cluster which is more stable. Although S and CVNN can use all samples in the cluster by computing average pairwise distance, they have high time complexity. Moreover, as mentioned in Section 1, these average pairwise distance-based indices cannot correctly reflect the density of a vector space. Variance-based. Indices in this category assume lower variance indicates better compactness. RMSSTD (Halkidi et al., 2001) , CH (Caliński & Harabasz, 1974) , SD and S Dbw measure compactness based on variances of samples in a cluster. As a representative, CH further measures the separation by computing the between-cluster variance based on cluster centroids. However, these variance-based indices are not good measures since two clusters with same variance may have distinct density.

3. PRELIMINARIES

In this section, we first present the concept of entropy and differential entropy. Then we introduce the computation of differential entropy for a multivariate normal distribution. Note that, though the normal distribution assumption made here seems restrictive, it still works for datasets that deviate from the assumption as long as we use their deep representations. This is because the distribution of deep representation has been proved to be close to the normal distribution (Lee et al., 2018; Daneshmand et al., 2021) .

3.1. ENTROPY AND DIFFERENTIAL ENTROPY

In information theory, the entropy of a random variable is the average level of "surprise" or "uncertainty" inherent in the variable's possible outcomes (Shannon, 1948) . In other words, it can be used to measure the uncertainty of data. Entropy includes two classes: entropy for discrete random variables and entropy for continuous random variables. i.e., differential entropy. Given a discrete random variable V , with possible outcomes v 1 , ..., v n , we assume the probability of V being v i is P i . The entropy of V is formally defined as (Shannon, 1948) : Entropy(V ) = - n i=1 P i log P i (1) Given a continuous random variable V , with a probability density function p(v). The differential entropy is defined as (Cover & Thomas, 1991) : Dif f Entropy(V ) = - +∞ -∞ p(v) log p(v)dv

3.2. DIFFERENTIAL ENTROPY OF THE MULTIVARIATE NORMAL DISTRIBUTION

The multivariate normal distribution of a d-dimensional random vector H = H 1 , . . . , H d T can be written in the following form (Goodfellow et al., 2016) : H ∼ N d (µ, Σ) (3) with d-dimensional mean vector µ = E[H] = E H 1 , E H 2 , . . . , E H d T (4) and d × d covariance matrix Σ i,j = E H i -µ i H j -µ j = Cov H i , H j (5) such that 1 ≤ i, j ≤ d, and Σ is a positive definite matrix. The differential entropy of multivariate normal distribution is given by (Ahmed & Gokhale, 1989) : Dif f Entropy(H) = rank(Σ) 2 + rank(Σ) 2 ln(2π) + 1 2 ln |Σ| (6) where, rank(Σ) is the rank of Σ and |Σ| is the determinant of covariance matrix. If the covariance matrix Σ is not full rank, the multivariate normal distribution is degenerate (Rao et al., 1973) . Then, the rank(Σ) < d and the determinant of covariance matrix is degenerated as the pseudodeterminant.

4. INTERNAL PURITY

In this section, we first present the implementation details for our proposed internal purity (IP) index, and then provide theoretical analysis for its effectiveness and efficiency.

4.1. IP IMPLEMENTATION

Following the traditional criteria of internal clustering validation, IP index consists of following two main components, i.e., compactness purity (CP ) and seperation purity (SP ). The former measures the average differential entropy of clusters to judge the compactness of the clustering result, while the latter evaluates the separation between clusters based on the differential entropy of the space formed by the center of each cluster. Compactness. Let H be the feature space, where H = {h 1 , ..., h N } T . Supposing that N samples are clustered into k clusters, i.e. H 1 , ..., H k , the compactness CP for k clusters is defined as average differential entropy of k clusters: CP = 1 k k i=1 Dif f Entropy(H i ) Given a specific dataset, we can get a feature representation of each sample x i through the pretrained deep model f (•), i.e. h i = f (x i ). Then, for k-th cluster we can get {h k 1 , ..., h k m } T ∈ H k , where m is the number of data samples in the k-th cluster. So H k is a m × d matrix composed of feature representations corresponding to samples in the H k . H k can be viewed as a set of points in d-dimensional space. Moreover, we usually assume feature matrix H k follows multivariate normal distribution when we have no prior knowledge of the distribution of these points (Goodfellow et al., 2016) and existing works have demonstrated that the distribution of the hidden representation is close to the normal distribution (Lee et al., 2018; Daneshmand et al., 2021) . Hence we can obtain the differential entropy of a cluster H k by Eq. (6). Separation. Let µ k be the centroid of the k-th cluster H k , i.e., µ k = 1 m m i=1 h i , h i ∈ H k , where m is the number of data samples in the k-th cluster.. The separation SP for k clusters is defined as differential entropy of the feature matrix H µ formed by k cluster centers SP = Dif f Entropy(H µ ) where H µ = {µ 1 , ..., µ k } T . Here we also assume that the centers of clusters follow a multivariate normal distribution. Hence, we can obtain the differential entropy of this distribution by Eq. ( 6). IP Index. Based on CP and SP , the IP index for a clustering result of k clusters is defined as IP = CP -SP As shown above, our IP index takes the form of subtracting the intercluster separation from the intracluster compactness. A lower value of IP indicates a better clustering result.

4.2. THEORETICAL ANALYSIS

Effectiveness. According to Eq. ( 6), we can see that the differential entropy of multivariate normal distribution is proportional to the determinant of the covariance matrix. The determinant of the covariance matrix is usually called generalized variance (Johnson et al., 2014) . For a fixed set of data, generalized variance is proportional to the square of the volume generated by the d deviation vectors (Johnson et al., 2014) . i.e., Generalized variance = |Σ| = (N -1) -d ( volume ) 2 Based on Eq.( 10), we know the reason why our internal purity can avoid the drawbacks of distance or variance based indices. Specifically, the form volume N embedded in the Eq. ( 10) indicates the real density for a given vector space. Then, according to Eq. ( 9), we can further obtain following formula (detailed derivation is listed in Appendix A): IP = 1 2 1 k k i=1 ln |Σ i | -ln |Σ µ | part 1 + 1 + ln(2π) 2 1 k k i=1 rank(Σ i ) -rank(Σ µ ) part 2 (11) This formula is similar to the information criterion (Bishop & Nasrabadi, 2006) 

5. EXPERIMENTS

In this section, we compare the performance of IP with other 13 well-known internal indices, namely, SD, Dunn, I, XB, S, CH, DB, AIC, BIC, S Dbw, CVNN, DBCV and DCVI. Note that, we didn't consider RMSSTD since it need subjective determination for the shift point of its curve (Halkidi et al., 2001; Vendramin et al., 2010) . Similar to existing works (Halkidi et al., 2000; Liu et al., 2013; Moulavi et al., 2014) , we use the task of determining optimal cluster number for evaluation purpose. The general evaluation procedure is as follows: (1) we first use the existing pre-trained deep model to transfer the dataset to a representation matrix; (2) a set of clustering algorithms are then applied to the representation matrix and different clustering partitions can be obtained with different parameters; (3) finally, we compute the internal index for each partition and get the best partition as well as its corresponding optimal cluster number.

5.1. SETTINGS

Datasets. We use five text datasets and five image datasets. The statistics of these datasets are shown in Table 2 and 3 respectively . For text datasets usages, we cluster the train sets of SearchSnippets (Phan et al., 2008 ), Biomedical (Xu et al., 2017 ), StackOverflow (Xu et al., 2017) and WebofScience-11967 (Kowsari et al., 2017) , and 10,000 randomly selected texts from each class on Yahoo!Answers (Zhang et al., 2015) . For image datasets usages, we cluster the test sets of CIFAR-10 ( Krizhevsky et al., 2009) , MINST (LeCun et al., 1998) and FashionMNIST (Xiao et al., 2017) , and 10,000 randomly selected imagess in the test set of CINIC-10 (Darlow et al., 2018) . For ImageNet-10 (Chang et al., 2017) , the train set is directly used. Moreover, we also use five real-world datasets from UCI Machine Learning Repository (Frank, 2010) for evaluating how our IP index performs over dataset without deep representation, and details can be seen in Appendix E. Evaluation Metrics. Since the information of true clusters are known for above datasets, i.e., classes are given, we can use external indices to evaluate the best clustering result each internal index selected and then further judge which internal index is better. Three external indices are used in our experiments, i.e., Accuracy (ACC) (Wu & Schölkopf, 2006) , Adjusted Rand Index (ARI) (Hubert & Arabie, 1985) and Normalized Mutual Information (NMI) (Chen et al., 2010) . Larger ACC, ARI and NMI indicate better clustering result. Details of the implementation of the three external indices can be seen in Appendix B. Pre-trained Representation Models. To extract feature representations, we use following six pretrained models. For text datasets, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) , Sentence-BERT (SBERT) (Reimers & Gurevych, 2019) and Simple contrastive sentence embedding framework (SimCSE) (Gao et al., 2021) are used. We use the average of all output embeddings as the feature representation of each text sentence on the three models. For image datasets, we use Vision Transformer (ViT) (Dosovitskiy et al., 2021) , Swin Transformer (Swin) (Liu et al., 2021) and Bidirectional Encoder Representations from Image Transformers (BEiT) (Bao et al., 2021) . The generation of feature representations in different image models follows the suggestion by their original work. Note that since the unsupervised nature of clustering, we don't further fine-tune above pre-trained models with the experimental datasets. Experimental Setup. Above models are implemented based on Huggingface's transformers (Wolf et al., 2020) and Sentence Transformers (Reimers & Gurevych, 2019) . Detailed model configurations are listed in Appendix C. Our internal index is implemented based on scikit-learn (Pedregosa et al., 2011) and SciPy (Virtanen et al., 2020) . Then based on above setup, we use four different clustering algorithms for generating partitions, including K-Means, GMM, agglomerative hierarchical clustering (AHC) (Ward Jr, 1963) and density-based spatial clustering of applications with noise (DBSCAN) (Ester et al., 1996) . In the experiments, the numbers of clusters or components are in the search range from Pal, 1998) for K-Means, GMM and AHC. Two parameters are needed for DBSCAN, min samples and eps. For min samples, we choose min samples ∈ [5, 100] and step size is 5. For eps, we obtain the minimum and maximum values of pairwise distances for each dataset and employ 50 different values of eps equally distributed within this range. Morever, since the AIC and BIC are implemented based on the current clustering model, within-cluster-sum-of squares is applied in K-Means , AHC and DBSCAN (Manning, 2008) , maximum likelihood function applied in GMM. The experimental results are averaged over five random runs for each validation index.  2 to ⌊ √ N ⌋ (Pal & Bezdek, 1995) (Bezdek &

5.2. EFFECTIVENESS STUDY BASED ON EXTERNAL EVALUATION

The evaluation results based on external indices of ACC, ARI and NMI are shown in Table 5 , 6, 7, 8, 9, and 10, where the best results are highlighted in bold and the optimal k each index select, i.e. opt k , and the true cluster number in each dataset, i.e. dataset-k, are provided. Moreover, we also count the number of best results that each internal index can achieve and report them in Table 4 . Obviously, compared with other 13 indices, our IP index has significant advance in achieving the best results. Specifically, for all 90 cases, our IP index can produce 51 best results, while the second top index AIC only has 30 best results. Finally, IP index outperforms other indices greatly with respect to different domains, i.e., text and image. Due to space limitations, every clustering algorithm evaluation results can be seen in the Appendix D respectively. Text datasets. In Table 5 , we can also observe that our IP index outperforms other indices except StackOverflow and Yahoo!Answers. In Table 6 , the index S has the best results in three cases. However, it is distance based method which has higher time complexity than our IP index. In Table 7 , although the index AIC has the results in nine cases, the optimal k value selected by it is far from the true number of classes on Biomedical and Yahoo!Answers. Image datasets. In Table 8 , CH obtains three best cases in terms of ACC, ARI and NMI on ImageNet-10, respectively. However as shown in Table 9 where Swin is the representation model, the best partitions in terms of ACC, ARI and NMI are found by our IP index again, which indicates that the representation model can influence the clustering result. Moreover, when comparing with the results of ImageNet-10 in Table 9 and 10 it is interesting to find that: (1) Different representation models may have great impacts on the clustering result, e.g., the best ARI score of Swin based clustering can be as high as 99.64 while it drops to 0.65 dramatically when the representation model is changed to BEiT; (2) Even for the same representation model, the optimal clustering results found by different internal indices could vary a lot, e.g., the ARI score of I index is 0 while our IP index and several indices can provide optimal partition ARI score of 99.64. 

6. CONCLUSIONS

In this paper, we propose internal purity (IP), a novel internal validation index. IP index uses the differential entropy to measure the purity of a cluster and the cluster centers of a partition. Based on the theoretical analysis, the nature of our IP index can help overcome the effectiveness and efficiency drawbacks of existing internal indices. Extensive experiments over different datasets of text and image domains also show that, our IP index can significantly outperform other thirteen well known internal indices when selecting the optimal partition and cluster number with different deep representation models. Although normal distribution assumption is indeed restrictive in our IP index, it won't affect the usage of our method over the data after deep representation. Considering that deep representation is already widely used, we believe that our method is still very practical. A DERIVATION OF EQ. 11 IP = CP -SP = 1 k k i=1 Dif f Entropy(Hi) -Dif f Entropy(Hµ) According to Eq. 7 and 8 = 1 k k i=1 rank(Σi) 2 + rank(Σi) 2 ln (2π) + 1 2 ln |Σi| - rank(Σµ) 2 + rank(Σµ) 2 ln (2π) + 1 2 ln |Σµ| According to Eq. 6 = 1 k k i=1 rank(Σi) 2 + k i=1 rank(Σi) 2 ln(2π)+ k i=1 1 2 ln|Σi| - rank(Σµ) 2 + rank(Σµ) 2 ln(2π) + 1 2 ln |Σµ| = 1 2k k i=1 [1 + ln(2π)] rank(Σi) + k i=1 ln |Σi| - 1 2 {[1 + ln(2π)] rank(Σµ) + ln |Σµ|} = 1 2 1 k k i=1 ln |Σi| -ln |Σµ| + 1 + ln(2π) 2 1 k k i=1 rank(Σi) -rank(Σµ)

B EVALUATION METRICS

In our evaluations we used three common external validation indices: Accuracy (ACC), Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). The ACC and NMI range between 0 and 1, and the ARI ranges between -1 and 1. Larger ACC, ARI and NMI indicates better clustering result. ACC is computed as follows: ACC = N i=1 δ (y i , map (c i )) N ( ) where y i is the true cluster label, c i is the cluster label obtained by clustering, and δ(x, y) is an indicator function returning 0 (x ̸ = y) or 1 (x = y). map(•) transforms the cluster label c i to its true cluster label by the Hungarian algorithm (Papadimitriou & Steiglitz, 1998) . The larger the ACC value is, the better the clustering result. ARI is computed as follows: ARI = RI -E[RI] 1 -E[RI] , RI = a + b C N 2 (13) where C N 2 is the total number of possible pairs in the dataset, a is the number of pairs of samples that are in the same ground truth class and in the same cluster, b is the number of pairs of samples that are in different ground truth class and in different cluster. E[RI] is the expected RI. Larger ARI indicates better clustering result. NMI is computed as follows: NMI(A, B) = I(A, B) H(A)H(B) ( ) where A is the predicted labels and B the ground truth labels. I is the mutual information and H is the entropy.

C MODEL CONFIGURATION

BERT is based on bert-base-uncased, SBERT is based on all-distilroberta-v1 .00 3.42±0.00 CH 17.21±0.00 4.97±0.01 20.38±0.00 5.66±0.00 19.67±0.00 11.79±0.00 23.77±0.00 7.23±0.00 17.09±0.00 3.42±0.00 DB 12.92±1.37 5.44±0.61 19.77±0.01 5.72±0.00 19.67±0.00 11.79±0.00 23.77±0.00 7.23±0.00 18.76±0.02 3.39±0.01 S Dbw 7.63±0.33 3.29±0.15 9.73±0.48 6.46±0.29 12.51±0.58 9.86±0.53 11.64±1.24 6.77±0.63 7.04±0.47 2.63±0.20 CVNN 17.21±0.00 4.97±0.01 20.38±0.00 5.66±0.00 41.49±2.10 23.91±0.88 31.62±0.03 12.25±0.03 17.09±0.00 3.42±0.00 DCVI 7.66±0.16 3.30±0.10 9.48±0.27 6.42±0.12 12.13±0.90 9.51±0.59 11.44±0.52 6.68±0.31 7.14±0.43 2.56±0.16 DBCV 7.76±0.24 3.33±0.08 9.22±0.29 6.22±0.11 12.15±0.83 9.37±0.35 11.06±0.25 6.58±0.15 7.25±0.63 2.59±0.15 AIC 17.21±0.00 4.97±0.01 20.38±0.00 5.66±0.00 19.67±0.00 11.79±0.00 17.37±0.03 4.08±0.03 17.09±0.00 3.42±0.00 BIC 17.21±0.00 4.97±0.01 20.38±0.00 5.66±0.00 19.67±0.00 11.79±0.00 17.37±0.03 4.08±0.03 17.09±0.00 3.42±0.00 IP 23.79±0.04 7.16±0.07 20.38±0.00 5.66±0.00 37.99±0.53 23.16±0.19 37.35±1.50 18.47±1.13 23.64±0.06 8.37±0.03 

D.2 THE CUSTERING RESULTS ON GMM

In this section, we only use GMM to evaluate the clustering results. Since the random initialization nature of GMM, the experimental results are averaged over five random runs for each validation index. The evaluation results based on external indices of ACC and ARI are shown in 24, 25, 26, 27, 28 and 29, the evaluation results in terms of NMI are shown in Table 30, 31, 32, 33, 34 and 35 where the best results are highlighted in bold and the optimal k value each index selecet is provided in Fig. 8 , 9, 10, 11, 12 and 13. Obviously, for almost all cases, our IP index outperforms other indices and is close to the real k value represented in red dash line. In this section, we evaluate ther clustering results on five text datasets. Text representations are obtained by computing TF-IDF features on the 1500 most frequently occurring word stems. As shown in Table 50 , our method degrades when the data is represented without the assumption of normal distribution.



a cluster is the square root of the variance of all the attributes. Calinski-Harabasz index CH CH is the ratio of the sum of between-clusters scatter and of within-cluster scatter for all clusters. I index I I measures compactness by calculating the distance from the samples in a cluster to their cluster center and the separation based on the maximum distance between cluster centers. Dunn index Dunn Dunn uses the farthest distance between samples in cluster as the compactness and uses the distance between the nearest samples in different clusters as the separation. Silhouette index S S measures compactness based on the pairwise distance in a cluster and separation based on the average distance between a sample and all other samples in the next nearest cluster. Davies-Bouldin index DB DB measures compactness by calculating the average distance between samples in a cluster to their cluster center and the separation based on the distance between cluster centers. Xie-Beni index XB XB defines compactness as average center-based distance and separation as the minimal squared distances between cluster centers . Standard deviation index SD SD measures compactness based on variances of samples in a cluster, and separation based on distances between cluster centers. S Dbw validation index S Dbw S Dbw measures compactness based on variances of samples in a cluster, and separation based on average density among clusters. Clustering validation index based on nearest neighbors CVNN CVNN measures compactness based on pairwise distance in a cluster and separation based on the idea of k-nearest neighbor (kNN). Density-Based Clustering Validation DBCV DBCV measures compactness based on maximum edge weight of the minimum spanning tree and separation based on minimum reachability distance between nodes in different clusters. Density core based clustering validation index DCVI DCVI measures compactness based on maximum weight value of the minimum spanning tree and separation based on the minimum value of minimum distances between samples in different clusters.

Figure 2: The optimal k value found by each index on BERT representations for text datasets.

Figure 3: The optimal k value found by each index on SBERT representations for text datasets.

Figure 4: The optimal k value found by each index on SimCSE representations for text datasets.

Figure 5: The optimal k value found by each index on ViT representations for image datasets.

Figure 6: The optimal k value found by each index on Swin representations for image datasets.

Figure 7: The optimal k value found by each index on BEiT representations for image datasets.

Figure 8: The optimal k value found by each index on BERT representations for text datasets.

Figure 9: The optimal k value found by each index on SBERT representations for text datasets.

Figure 10: The optimal k value found by each index on SimCSE representations for text datasets.

Figure 11: The optimal k value found by each index on ViT representations for image datasets.

Figure 12: The optimal k value found by each index on Swin representations for image datasets.

Figure13: The optimal k value found by each index on BEiT representations for image datasets.

The description of well known internal clustering validation indices.

|C k |) and the determinant of covariance matrix complexity is O(d 2.376 )(Aho & Hopcroft, 1974). Hence, the complexity of calcualting the compactness of a cluster is O(d 2 |C k | + d 2.376 ), then the complexity for k clusters is O(N d 2 + kd 2.376 ).For seperation, we have to calculate k cluster centroids, and the covariance matrix formed by k cluster centroids and the corresponding determinant. So the complexity of seperation is O(N d + kd 2 + d 2.376 ). Usually, d ≪ N and k ≪ N , then, the complexity of IP index is O(N d 2 ), which makes it affordable for large-scale and high-dimensional datasets.

Statistics of text datasets.

Statistics of image datasets.

Number of best clustering results based on counting overTable 5, 6, 7, 8, 9, and 10.

BERT based clustering results on five text datasets. (15 cases: 5 datasets × evaluation metrics) ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k

SBERT based clustering results on five text datasets. (15 cases: 5 datasets × 3 evaluation metrics) ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k

SimCSE based clustering results on five text datasets. (15 cases: 5 datasets × 3 evaluation metrics) ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k

ViT based clustering results on five image datasets. (15 cases: 5 datasets × 3 evaluation metrics) ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k .13 1.51 3 14.66 1.24 3.17 3 15.69 1.79 4.44 3 67.96 45.14 71.31 28.71 4.25 30.43 7 AIC 28.21 26.67 57.69 46 13.55 7.57 22.90 40 16.94 11.68 33.89 41 33.01 36.74 72.87 27.31 21.82 48.74 44 BIC 19.55 8.52 29.94 2 16.02 1.99 3.69 2 19.87 11.66 26.08 2 19.74 6.44 27.37 2 19.02 7.60 26.43 2 IP 75.87 58.89 70.24 10 14.52 0.43 3.28 2 12.48 0.39 4.11 2 53.08 17.88 56.16 8 59.47 38.70 54.43 10

Swin based clustering results on five image datasets. (15 cases: 5 datasets × evaluation metrics) ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k DBCV 23.44 1.51 23.05 15 16.18 1.23 9.44 3 13.68 0.36 7.42 3 57.78 22.57 60.58 10 22.38 1.95 24.05 12 AIC 43.81 47.01 73.79 33 16.28 8.50 18.35 32 21.25 13.79 34.29 30 50.88 55.04 79.93 42 35.38 29.84 54.39 33 BIC 20.00 17.35 40.25 2 19.49 4.99 9.56 2 17.47 6.73 15.88 2 19.81 4.56 23.70 19.89 14.32 30.65 2 IP 95.38 90.12 90.14 10 16.74 1.48 8.57 2 35.18 20.10 34.48 7 99.84 99.64 99.47 10 68.57 52.89 62.36 10

BEiT based clustering results on five image datasets. (15 cases: 5 datasets × 3 evaluation metrics) ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k AIC 13.15 5.37 16.88 43 17.76 10.41 27.92 37 22.93 17.30 41.39 36 16.72 9.69 28.51 57 11.92 4.10 13.84 42 BIC 17.44 5.17 9.61 2 20.41 5.71 13.57 2 19.78 12.05 24.75 2 18.29 5.62 13.19 17.24 3.53 7.44 2 IP 24.71 8.01 13.62 7 23.85 6.48 18.36 3 37.62 20.75 38.08 7 14.01 0.65 5.23 21.17 4.69 9.88 7

Jiaming Xu, Bo Xu, Peng Wang, Suncong Zheng, Guanhua Tian, and Jun Zhao. Self-taught convolutional neural networks for short text clustering. Neural Networks, 2017. Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Transactions on neural networks, 2005. Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: simultaneous deep learning and clustering. In ICML, 2017. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NIPS, 2015. Tianyi Zhou, Dacheng Tao, and Xindong Wu. Manifold elastic net: a unified framework for sparse dimension reduction. Data Mining and Knowledge Discovery, 2011. Mengjie Zu, Arunkumar Bupathy, Daan Frenkel, and Srikanth Sastry. Information density, structure and entropy in equilibrium and non-equilibrium systems. Journal of Statistical Mechanics: Theory and Experiment, 2020.

Number of best clustering results based on counting over K-Means, GMM, AHC and DBSCAN ,20, 21, 22 and 23, where the best results are highlighted in bold. Moreover, the optimal k results each index select can be seen in Fig.2, 3, 4, 5, 6 and 7. Obviously, for almost all cases, our IP index outperforms other indices and is close to the real k value represented in red dash line.

BERT based K-Means clustering results on five text datasets.

SBERT based K-Means clustering results on five text datasets.

SimCSE based K-Means clustering results on five text datasets. 82±0.67 6.98±0.19 14.76±1.06 8.93±0.54 63.25±3.11 46.71±1.33 43.59±0.00 28.19±0.01 10.26±0.25 4.89±0.10 S Dbw 10.32±0.42 6.75±0.09 13.67±0.28 8.14±0.20 19.57±0.55 17.19±0.29 8.64±0.62 6.98±0.24 10.27±0.5 5.07±0.26 CVNN 30.44±0.01 6.29±0.01 17.39±0.01 6.63±0.01 9.21±0.02 3.10±0.02 31.69±0.00 21.95±0.00 17.70±0.04 4.22±0.02 DCVI 10.26±0.31 6.65±0.05 14.47±0.28 8.79±0.21 21.49±1.03 18.53±1.13 9.04±0.19 7.13±0.17 10.38±0.5 5.07±0.21 DBCV 10.04±0.22 6.67±0.20 13.99±0.74 8.44±0.52 21.71±1.54 18.57±1.72 9.32±0.48 7.17±0.23 10.41±0.46 5.06±0.24 AIC 30.44±0.01 6.29±0.01 9.75±0.00 3.21±0.00 9.21±0.02 3.10±0.02 31.69±0.00 21.95±0.00 17.70±0.04 4.22±0.02 BIC 30.44±0.01 6.29±0.01 9.75±0.00 3.21±0.00 9.21±0.02 3.10±0.02 31.69±0.00 21.95±0.00 17.70±0.04 4.22±0.02 IP 50.33±0.82 30.74±0.99 37.96±0.19 20.06±0.10 68.74±1.38 46.44±1.36 48.98±3.73 33.74±2.07 37.03±0.15 14.75±0.16

ViT based K-Means clustering results on five image datasets. Dbw 18.60±4.88 16.85±4.55 7.50±0.21 4.41±0.07 8.92±0.24 6.16±0.07 66.31±16.12 50.61±24.24 15.33±1.65 11.99±1.1 CVNN 23.10±4.86 10.71±3.00 15.87±0.00 1.90±0.00 19.81±0.00 11.31±0.00 52.64±8.71 31.27±9.35 19.02±0.00 7.60±0.00 DCVI 19.70±4.89 15.08±0.63 7.61±0.33 4.38±0.13 8.89±0.11 6.21±0.19 84.74±30.79 82.49±31.77 15.05±0.76 11.73±0.43 DBCV 17.36±1.05 15.35±0.69 7.73±0.26 4.39±0.12 9.10±0.31 6.21±0.13 92.36±8.83 91.64±7.17 14.62±0.76 11.38±0.39 AIC 19.55±0.00 8.52±0.00 15.87±0.00 1.90±0.00 19.81±0.00 11.31±0.00 19.80±0.01 8.40±2.17 19.02±0.00 7.60±0.00 BIC 19.55±0.00 8.52±0.00 15.87±0.00 1.90±0.00 19.81±0.00 11.31±0.00 19.80±0.01 8.40±2.17 19.02±0.00 7.60±0.00 IP 75.87±0.03 58.89±0.09 31.37±0.36 11.73±0.33 39.75±1.08 21.87±0.23 72.14±13.20 58.87±20.65 63.39±0.96 40.94±2.68

Swin based K-Means clustering results on five image datasets. 00±0.00 17.35±0.00 19.47±0.01 4.94±0.00 19.99±0.00 13.57±0.00 99.84±0.00 99.64±0.00 19.89±0.00 14.32±0.01 DB 86.76±0.02 82.11±0.06 27.69±0.03 9.13±0.02 30.85±0.00 20.65±0.00 99.84±0.00 99.64±0.00 68.87±0.11 52.62±0.17 S Dbw 20.07±0.51 19.90±0.41 8.32±0.37 4.38±0.32 9.05±0.44 6.20±0.13 99.84±0.00 99.64±0.00 20.72±3.52 16.93±3.28 CVNN 86.76±0.02 82.11±0.06 19.47±0.01 4.94±0.00 19.99±0.00 13.57±0.00 99.84±0.00 99.64±0.00 27.01±3.98 19.72±3.02 DCVI 20.66±0.86 20.28±1.29 8.19±0.34 4.25±0.12 9.70±0.68 6.62±0.28 89.81±14.18 89.18±15.15 18.52±2.05 14.81±1.31 DBCV 21.14±1.04 20.88±1.02 8.37±0.34 4.35±0.19 9.34±0.29 6.44±0.08 99.84±0.00 99.64±0.00 20.43±1.39 16.30±1.25 AIC 20.00±0.00 17.35±0.00 19.47±0.01 4.94±0.00 19.99±0.00 13.57±0.00 19.81±0.06 4.56±0.07 19.89±0.00 14.32±0.01 BIC 20.00±0.00 17.35±0.00 19.47±0.01 4.94±0.00 19.99±0.00 13.57±0.00 19.81±0.06 4.56±0.07 19.89±0.00 14.32±0.01 IP 95.38±0.02 90.12±0.05 27.30±0.01 8.60±0.01 35.18±0.30 20.10±0.67 99.84±0.00 99.64±0.00 68.05±1.77 51.80±1.74

BEiT based K-Means clustering results on five image datasets.

BERT based K-Means clustering results in terms of NMI on five text datasets.

BERT based GMM clustering results on five text datasets. 19±7.62 17.25±4.15 29.45±0.05 14.73±0.15 13.98±1.96 3.48±0.66 40.88±0.01 24.69±0.01 19.76±1.95 3.06±1.00 Dunn 30.72±15.00 15.60±6.96 14.31±8.35 5.56±5.12 7.18±1.94 0.59±0.98 12.54±5.19 8.99±3.76 26.13±3.34 9.

SBERT based GMM clustering results on five text datasets.

SimCSE based GMM clustering results on five text datasets. 83±4.84 10.15±3.17 32.91±4.88 18.41±3.51 61.59±6.52 43.41±5.43 55.10±9.34 36.36±4.02 16.35±2.34 8.28±1.13 Dunn 33.02±17.80 14.36±11.62 18.74±8.19 8.88±5.44 26.20±21.59 15.46±16.05 32.04±0.01 22.47±0.00 21.70±7.96 6.

ViT based GMM clustering results on five image datasets.

Swin based GMM clustering results on five image datasets. 07±21.13 68.91±25.07 28.46±0.55 10.12±0.60 30.82±0.04 20.60±0.09 71.76±26.80 62.11±34.91 64.82±6.10 50.99±4.16 Dunn 39.90±20.25 36.86±22.86 20.17±1.05 7.16±2.97 14.74±5.20 10.06±3.56 43.87±25.01 39.57±29.21 19.89±0.00 14.

BEiT based GMM clustering results on five image datasets.

BERT based DBSCAN clustering results on five text datasets. ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k

SBERT based DBSCAN clustering results on five text datasets. ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k SD 21.89 0.06 1.26 13 6.24 0.03 3.8 38 5.11 0 0.38 7 18.92 -0.02 6.77 10.4 0 1.22 14

SimCSE based DBSCAN clustering results on five text datasets. ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k SD 21.65 0.05 1.03 11 8.86 0.25 7.34 12 6.64 0.02 3.18 10 17.62 0 0.02 10.01 0 0.02 2

ViT based DBSCAN clustering results on five image datasets. ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k Dbw 11.52 0.02 4.32 16 11.34 0 0.02 2 10.01 0 0.02 2 20.41 1.38 18.56 14 10.03 0 0.08 2 CVNN 11.09 0.02 2.75 7 12.95 0.12 2.46 2 12.82 0.34 4.56 3 15.5 0.43 10.18 10.86 0.03 1.71 2 DCVI 10.01 0 0.02 2 11.34 0 0.02 2 10.01 0 0.02 2 10.03 0 0.06 10.01 0 0.02 2 DBCV 11.46 0.13 1.51 3 14.66 1.24 3.17 3 15.69 1.79 4.44 3 67.96 45.14 71.31 10 28.71 4.25 30.43 7 IP 37.38 11.91 44.58 8 14.52 0.43 3.28 2 12.48 0.39 4.11 2 53.08 17.88 56.16 25.01 3.63 26.38 5

Clustering results in terms of ACC and ARI on five UCI datasets. ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k SD 100 100 100 2 99.72 99.56 99.18 3 79.63 75.43 78.02 2 100 100 100 2 83.7 77.01 81.24 2 Dunn 69.23 59.72 74.63 2 99.51 99.22 98.87 4 83.91 80.1 78.57 3 71.94 59.43 70.04 2 69.29 65.8 88.24 18

TF-IDF based clustering results in terms of ACC and ARI on five text datasets. ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k ACC ARI NMI opt k SD 70.09 55.18 81.01 6 30.77 1.32 30.4 10 64.53 30.59 65.53 15 17.8 0 0.05 2 21.14 0.24 9.04 16 Dunn 86.76 80.73 81.3 3 87.17 74.01 68.15 2 64.53 30.59 65.53 15 91.52 87.06 86.94 6 21.36 0.41 1.21 2 I 62.5 31.93 56.09 2 31.31 -2.1 14.99 2 58.08 25.7 57.4 8 18.01 0 0.09 3 21.71 0.41 2.81 3 XB 62.5 31.93 56.09 2 31.31 -2.1 14.99 2 51.77 21.39 50.17 6 17.8 0 0.05 2 21.71 0.41 2.81 3 S 14.37 9.43 53.9 167 17.14 8.16 31.47 135 30.01 16.06 39.78 132 11.46 7.05 35.63 103 16.88 0.99 16.02 47 CH 62.5 31.93 56.09 2 31.31 -2.1 14.99 2 58.57 29.24 66 31 32.55 23.71 38.76 2 21.14 0.24 9.04 16 DB 62.5 31.93 56.09 2 30.77 1.32 30.4 10 58.08 25.7 57.4 8 17.IP 33.64 10.13 14.25 2 8.64 1.3 3.85 2 6.97 0.6 2.49 2 39.17 18.1 31.19 3 16.21 1.79 3.41 3

D.3 THE CUSTERING RESULTS ON AHC

In this section, we only use AHC to evaluate the clustering results. The evaluation results based on external indices of ACC, ARI and NMI are shown in Table 36, 37, 38, 39, 40 and 41 , where the best results are highlighted in bold. Moreover, the optimal k each index select, i.e. opt k , and the true cluster number in each dataset, i.e. dataset-k, are provided in table. Obviously, for almost all cases, our IP index outperforms other indices and is close to the real k value. SD 35.57 25.65 49.98 32 32.46 19.31 33.88 42 33.66 10.43 50.1 8 30.12 24.23 47.3 29 29.65 16.86 32.17 28 Dunn 36.09 15.29 28.18 2 9.79 3.03 11.07 2 9.55 3.05 19.76 2 32.66 23.76 38.21 2 18.08 5.49 11.96 I 36.09 15.29 28.18 2 13.73 5.06 16.87 3 13.4 3.29 25.39 3 32.66 23.76 38.21 2 18.08 5.49 11.96 XB 14.95 9.61 44.41 111 17 6.06 19.86 4 33.66 10.43 50.1 8 56.46 44.25 52.39 13 17.66 10.13 31.23 CH 36.09 15.29 28.18 2 9.79 3.03 11.07 2 9.55 3.05 19.76 2 32.66 23.76 38.21 2 18.08 5.49 11.96 DB 14.95 9.61 44.41 111 17.76 10.74 34.44 141 26.01 7.29 42.83 6 56.46 44.25 52.39 13 13.53 7.44 30.59 99 S Dbw 14.95 9.61 44.41 111 17.76 10.74 34.44 138 51.75 17.02 59.97 13 11.96 8.44 41.59 109 13.53 7.44 30.57 100 CVNN 36.09 15.29 28.18 2 9.79 3.03 11.07 2 21.72 5.74 37.4 5 32.66 23.76 38.21 2 18.08 5.49 11.96 DCVI 14.95 9.67 44.42 110 17.76 10.74 34.44 141 26.93 24.17 53.42 141 11.96 8.44 

D.4 THE CUSTERING RESULTS ON DBSCAN

In this section, we only use DBSCAN to evaluate the clustering results. The evaluation results based on external indices of ACC, ARI and NMI are shown in Table 42, 43, 44, 45, 46 and 47,  where the best results are highlighted in bold and Top-3 best results are underlined. Moreover, the optimal k each index select, i.e. opt k , and the true cluster number in each dataset, i.e. dataset-k, are provided in table. Our index is either on par or slightly better than competing indices. 

E EVALUATION RESULTS ON UCI DATASETS

In this section, we compare our index with other 13 indices on five real-world datasets from UCI Repository (Frank, 2010) . Table 48 lists the basic statistics of these datasets. The evaluation results based on external indices of ACC, ARI and NMI are shown in Table 49 , where the best results are highlighted in bold and the optimal k each index select, i.e. opt k , and the true cluster number in each dataset, i.e. dataset-k, are provided. Our index is either on par or slightly better than competing indices. 

