DISSECTING GRAPH MEASURES PERFORMANCE FOR NODE CLUSTERING IN LFR PARAMETER SPACE

Abstract

Graph measures can be used for graph node clustering using metric clustering algorithms. There are multiple measures applicable to this task, and which one performs better is an open question. We study the performance of 25 graph measures on generated graphs with different parameters. While usually measure comparisons are limited to general measure ranking on a particular dataset, we aim to explore the performance of various measures depending on graph features. Using an LFR graph generator, we create a dataset of ∼7500 graphs covering the whole LFR parameter space. For each graph, we assess the quality of clustering with k-means algorithm for every considered measure. We determine the best measure for every area of the parameter space. We find that the parameter space consists of distinct zones where one particular measure is the best. We analyze the geometry of the resulting zones and describe it with simple criteria. Given particular graph parameters, this allows us to choose the best measure to use for clustering.

1. INTRODUCTION

Graph node clustering is one of the central tasks in graph structure analysis. It provides a partition of nodes into disjoint clusters, which are groups of nodes that are characterized by strong mutual connections. It can be of practical use for graphs representing real-life systems, such as social networks or industrial processes. Clustering allows to infer some information about the system: the nodes of the same cluster are highly similar, while the nodes of different clusters are dissimilar. The technique can be applied without any labeled data to extract important insights about a network. There are different approaches to clustering, including ones based on modularity optimization (Newman & Girvan, 2004; Blondel et al., 2008) , label propagation algorithm (Raghavan et al., 2007; Barber & Clark, 2009) , Markov cluster process (Van Dongen, 2000; Enright et al., 2002) , and spectral clustering (Von Luxburg, 2007) . In this work, we use a different approach based on choosing a closeness measure on a graph, which allows one to use any metric clustering algorithm (e.g., Yen et al., 2009) . The choice of the measure significantly affects the quality of clustering. Classical measures are the Shortest Path (Buckley & Harary, 1990) and the Commute Time (Göbel & Jagers, 1974) distances. The former is the minimum number of edges in a path between a given pair of nodes. The latter is the expected number of steps from one node to the other and back in a random walk on the graph. There is a number of other measures, including recent ones (e.g., Estrada & Silver, 2017; Jacobsen & Tien, 2018) , many of them are parametric. Despite the fact that graph measures are compatible with any metric algorithm, in this paper we restrict ourselves to the kernel k-means algorithm (e.g., Fouss et al., 2016) . We base our research on a generated set of graphs. There are various algorithms to generate graphs with community structures. The well-known ones are the Stochastic Block Model (Holland et al., 1983) and Lancichinetti-Fortunato-Radicchi benchmark (Lancichinetti et al., 2008 ) (hereafter, LFR). The first one is an extension of the Erdős-Rényi model with different intra-and intercluster probabilities of edge creation. The second one involves power law distributions of node degrees and community sizes. There are other generation models, e.g., Naive Scale-free Clustering (Pasta & Zaidi, 2017) . We choose the LFR model: although it misses some key properties of real graphs, like diameter or the clustering coefficient, this model has been proven to be effective in meta-learning (Prokhorenkova, 2019) .

