INTERNAL PURITY: A DIFFERENTIAL ENTROPY BASED INTERNAL VALIDATION INDEX FOR CLUSTER-ING VALIDATION

Abstract

In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indices are implemented based on distance and variance, which cannnot catpure the real "density" of the cluster. Moreover the time complexity for distance based indices is usually too high to be applied for large datasets. Therefore, we propose a novel internal validation index based on the differential entropy, named internal purity (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indices. Based on six powerful deep pre-trained representation models, we use four basic clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that, for 90 test cases in total, our IP index can return the optimal clustering results in 51 cases while the second best index can merely report the optimal partition in 30 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided.

1. INTRODUCTION

The goal of clustering is to divide a set of samples into different clusters such that similar samples are grouped in the same clusters. As one of the most fundamental tasks in machine learning, clustering has been extensively studied in many fields, such as text mining (Guan et al., 2012a) , image analysis (Zhou et al., 2011) and pattern recognition (Guan et al., 2012b) . With the advance of deep learning, it has been proved that running any classical clustering algorithm (e.g., K-means) over the learned representation can yield better results (Xie et al., 2016; Huang et al., 2020; Dang et al., 2021) . The main reason behind this is that the deep neural networks can effectively extract highly non-linear features that are helpful for clustering. However, besides the data representation, the outcome of clustering is still affected by other factors (Xu & Wunsch, 2005; Yang et al., 2017) . For example, different clustering algorithms usually lead to different clustering results in a specific dataset. Even for the same algorithm, the selection of different parameters may affect the final clustering results (Halkidi et al., 2000) . Thus, within an effective process of cluster analysis, it is inevitable to validate the goodness of different partitions after clustering and select the best one for application. Here the best one refers to not only the proper parameters but also the best partition that fits the underlying data structure. In fact, many clustering validation measures have been proposed over past years and they can be categorized to external validation and internal validation (Wu et al., 2009; Liu et al., 2013) . Specifically, external validation indices assume the "true" cluster information is known in advance, and they use the supervised information to quantify how good is the obtained partition with respect to prior ground truth clusters. However, such prior knowledge is rarely available in many real scenarios. Then, internal validation indices become the only option for evaluating the clustering result. Internal clustering validation usually measures the clustering result based on following two criteria (Liu et al., 2013; Fraley & Raftery, 1998) : (1) compactness, which measures how closely related are 1

