DISSECTING GRAPH MEASURES PERFORMANCE FOR NODE CLUSTERING IN LFR PARAMETER SPACE

Abstract

Graph measures can be used for graph node clustering using metric clustering algorithms. There are multiple measures applicable to this task, and which one performs better is an open question. We study the performance of 25 graph measures on generated graphs with different parameters. While usually measure comparisons are limited to general measure ranking on a particular dataset, we aim to explore the performance of various measures depending on graph features. Using an LFR graph generator, we create a dataset of ∼7500 graphs covering the whole LFR parameter space. For each graph, we assess the quality of clustering with k-means algorithm for every considered measure. We determine the best measure for every area of the parameter space. We find that the parameter space consists of distinct zones where one particular measure is the best. We analyze the geometry of the resulting zones and describe it with simple criteria. Given particular graph parameters, this allows us to choose the best measure to use for clustering.

1. INTRODUCTION

Graph node clustering is one of the central tasks in graph structure analysis. It provides a partition of nodes into disjoint clusters, which are groups of nodes that are characterized by strong mutual connections. It can be of practical use for graphs representing real-life systems, such as social networks or industrial processes. Clustering allows to infer some information about the system: the nodes of the same cluster are highly similar, while the nodes of different clusters are dissimilar. The technique can be applied without any labeled data to extract important insights about a network. There are different approaches to clustering, including ones based on modularity optimization (Newman & Girvan, 2004; Blondel et al., 2008) , label propagation algorithm (Raghavan et al., 2007; Barber & Clark, 2009) , Markov cluster process (Van Dongen, 2000; Enright et al., 2002) , and spectral clustering (Von Luxburg, 2007) . In this work, we use a different approach based on choosing a closeness measure on a graph, which allows one to use any metric clustering algorithm (e.g., Yen et al., 2009) . The choice of the measure significantly affects the quality of clustering. Classical measures are the Shortest Path (Buckley & Harary, 1990 ) and the Commute Time (Göbel & Jagers, 1974) distances. The former is the minimum number of edges in a path between a given pair of nodes. The latter is the expected number of steps from one node to the other and back in a random walk on the graph. There is a number of other measures, including recent ones (e.g., Estrada & Silver, 2017; Jacobsen & Tien, 2018) , many of them are parametric. Despite the fact that graph measures are compatible with any metric algorithm, in this paper we restrict ourselves to the kernel k-means algorithm (e.g., Fouss et al., 2016) . We base our research on a generated set of graphs. There are various algorithms to generate graphs with community structures. The well-known ones are the Stochastic Block Model (Holland et al., 1983) and Lancichinetti-Fortunato-Radicchi benchmark (Lancichinetti et al., 2008 ) (hereafter, LFR). The first one is an extension of the Erdős-Rényi model with different intra-and intercluster probabilities of edge creation. The second one involves power law distributions of node degrees and community sizes. There are other generation models, e.g., Naive Scale-free Clustering (Pasta & Zaidi, 2017) . We choose the LFR model: although it misses some key properties of real graphs, like diameter or the clustering coefficient, this model has been proven to be effective in meta-learning (Prokhorenkova, 2019). There are a lot of measure benchmarking studies considering node classification and clustering for both generated graphs and real-world datasets (Fouss et al., 2012; Sommer et al., 2016; 2017; Avrachenkov et al., 2017; Ivashkin & Chebotarev, 2016; Guex et al., 2018; 2019; Aynulin, 2019a; b; Courtain et al., 2020; Leleux et al., 2020) , etc. Despite a large number of experimental results, theoretical results are still a matter of the future. One of the most interesting theoretical results on graph measures is the work by Luxburg et al. (2010) , where some unattractive features of the Commute Time distance on large graphs were explained theoretically, and a reasonable amendment was proposed to fix the problem. Beyond the complexity of such proofs, there is still very little empirical understanding of what effects need to be proven. Our empirical work has two main differences from the previous ones. First, we consider a large number of graph measures, which for the first time gives a fairly complete picture. Second, unlike these studies concluding with a global leaderboard, we are looking for the leading measures for each set of the LFR parameters. We aim to explore the performance of of the 25 most popular measures in the graph clustering problem on a set of generated graphs with various parameters. We assess the quality of clustering with every considered measure and determine the best measure for every region of the graph parameter space. Our contributions are as follows: • We generate a dataset of ∼7500 graphs covering all parameter space of LFR generator; • We consider a broad set of measures and rank measures by clustering performance on this dataset; • We find the regions of certain measure leadership in the graph parameter space; • We determine the graph features that are responsible for measure leadership; • We check the applicability of the results on real-world graphs. Our framework for clustering with graph measures as well as a collected dataset are available on link_is_not_available_during_blind_review.

2.1. KERNEL k-MEANS

The original k-means algorithm (Lloyd, 1982; MacQueen et al., 1967) clusters objects in Euclidean space. It requires coordinates of the objects to determine the distances between them and centroids. The algorithm can be generalized to use the degree of closeness between the objects without defining a particular space. This technique is called the kernel trick, usually it is used to bring non-linearity to linear algorithms. The algorithm that uses the kernel trick is called kernel k-means (see, e.g., Fouss et al., 2016) . For graph node clustering scenario, we can use graph measures as kernels for the kernel k-means. Initially, the number of clusters is known and we need to set initial state of centroids. The results of the clustering with k-means are very sensitive to it. Usually, the algorithm runs several times with different initial states (trials) and chooses the best trial. There are different approaches to the initialization; we consider three of them: random data points, k-means++ (Arthur & Vassilvitskii, 2006) , and random partition. We combine all these strategies to reduce the impact of the initialization strategy on the result.

2.2. CLOSENESS MEASURES

For a given graph G, V (G) is the set of its vertices and A is its adjacency matrix. A measure on G is a function κ : V (G) × V (G) → R, which gets two nodes and returns closeness (bigger means closer) or distance (bigger means farther). A kernel on a graph is a graph nodes' closeness measure that has an inner product representation. Any symmetric positive semidefinite matrix is an inner product matrix (also called Gram matrix). A kernel matrix K is a square matrix that contains similarities for all pairs of nodes in a graph.

