CLUSTERING FOR DIRECTED GRAPHS USING PARAMETRIZED RANDOM WALK DIFFUSION KERNELS

Abstract

Clustering based on the random walk operator has been proven effective for undirected graphs, but its generalization to directed graphs (digraphs) is much more challenging. Although the random walk operator is well-defined for digraphs, in most cases such digraphs are not strongly connected, and hence the associated random walks are not irreducible, which is a crucial property for clustering that exists naturally in the undirected setting. To remedy this, the usual workaround is to either naively symmetrize the adjacency matrix or to replace the natural random walk operator by the Pagerank random walk operator, but this can lead to the loss of valuable information carried by the graph directionality and edge density. In this paper, we introduce a new clustering framework, the Parametrized Random Walk Diffusion Kernel Clustering (P-RWDKC), which is suitable for handling both directed and undirected graphs. P-RWDKC is based on the diffusion geometry (Coifman & Lafon, 2006) and the generalized spectral clustering framework (Sevi et al., 2022) . Accordingly, we propose an algorithm that automatically reveals the cluster structure at a given scale, by considering the random walk dynamics associated with a parametrized graph operator, and by estimating its critical diffusion time. Experiments on K-NN graphs constructed from realworld datasets and real-world graphs, show that in most of the tested cases our clustering approach has superior performance compared to existing approaches.

1. INTRODUCTION

Clustering is a fundamental unsupervised learning task whose aim is to analyze and reveal the cluster structure of unlabeled datasets, and has widespread applications in machine learning, network analysis, biology, and other fields (Kiselev et al., 2017; McFee & Ellis, 2014) . Clustering for data represented as a graph has been formulated in various ways. A well-established one consists in minimizing a functional of the graph-cut (Von Luxburg, 2007; Shi & Malik, 2000) , leading to the spectral clustering (SC) framework and various algorithms. SC is simple and effective, but has important limitations (Nadler & Galun, 2006 ) that make it unreliable in a number of non-rare data regimes. Overcoming these limitations is where the focus of the machine learning community is (Tremblay et al., 2016; Zhang & Rohe, 2018; Dall'Amico et al., 2021; Sevi et al., 2022) . A well-studied clustering approach for high-dimensional data, which is of particular interest for this work, is using the operator associated with a random walk (or, in other terms, with a Markov chain), called random walk operator. Meilȃ & Shi (2001) viewed the pairwise similarities between datapoints as edge flows of a Markov chain and proposed an ergodic random walk interpretation of the spectral clustering. However, and beyond SC, the first to conceive the idea of turning the distance matrix between high-dimensional data into a Markov process were Tishby & Slonim (2000) . More specifically, they proposed to examine the decay of mutual information during the relaxation of the Markov process. During the relaxation procedure, the clusters emerge as quasi-stable structures, and then they get extracted using the information bottleneck method (Tishby et al., 2000) . Azran & Ghahramani (2006) proposed to estimate the number of data clusters by estimating the diffusion time of the random walk operator that reveals the most significant cluster structure. Lin & Cohen (2010) proposed then to find a low-dimensional embedding of the data using the truncated power iteration of a random walk operator derived from the pairwise similarity matrix. Clustering high-dimensional data using a random walk operator through the lens of the diffusion geometry (Coifman & Lafon, 2006; Coifman et al., 2005) has also been investigated. Nadler et al. (2006) proposed a unifying probabilistic diffusion framework based on the probabilistic interpretation of SC and dimensionality reduction algorithms using the eigenvectors of the normalized graph Laplacian. Given the pairwise similarity matrix built from high-dimensional data points, they defined a distance function between any two points based on the random walk on the graph called the diffusion distance (Coifman & Lafon, 2006; Pons & Latapy, 2005) and showed that the lowdimensional representation of the data by the first few eigenvectors of the corresponding random walk operator is optimal under a certain criterion. Recently, an unsupervised clustering methodology has been proposed based on the diffusion geometry framework (Maggioni & Murphy, 2019; Murphy & Polk, 2022) , where it has been demonstrated how the diffusion time of the random walk operator can be exploited to successfully cluster datasets for which k-means, spectral clustering, or density-based clustering methods fail. Although there has been a constant development of clustering approaches which are based on the random walk operator, all such efforts (like those mentioned above) have been proposed for undirected graphs. Moreover, the motivation of most of them is to run forward the random walk to avoid the costly eigendecomposition of the graph Laplacian. The extension of clustering approaches based on the random walk operator to digraphs is much more subtle and challenging. As presented earlier, random walk-based clustering approaches rely either on the eigenvectors of the random walk operator or on its iterated powers. In the directed setting, the random walk is well-defined, but it is not reversible in general like in the undirected case (Levin & Peres, 2017) . Consequently, the associated eigenvectors are possibly complex, which makes their interpretation and use difficult in the context of clustering. A possible workaround would be to consider the approach based on the iterated powers of the random walk operator which would allow one to obtain a real-valued embedding and thus avoid the use of complex eigenvectors. However, a second much more subtle and problematic bottleneck arises: the random walk's irreducibility. While in the undirected case, the random walk is irreducible, i.e. any graph vertex can be reached from any other vertex, this is not the case in general for random walks on digraphs.To overcome the irreducibility issue, either a symmetrization procedure is usually employed in the case where the digraph is derived by a K-NN graph construction or the original random walk operator is replaced by the operator of the Pagerank or teleporting random walk (Page et al., 1999) . For these two workarounds, valuable information can potentially be discarded. To address this problem, we present a new clustering algorithm based on the diffusion geometry framework, which is suitable for any digraph, either obtained by K-NN constructions or by real digraphs representing asymmetric relationships between vertices (e.g. citation graphs). Our approach stems from a new type of graph Laplacians (Sevi et al., 2022) that is parametrized by an arbitrary vertex measure, capable of encoding digraph information, and from which we derive a random walk operator. Besides, we can exploit the necessary diffusion time of this parametrized random walk to successfully reveal clusters at different scales. The contribution of this work is manifold: i) We propose a new similarity kernel operator, which we term random walk diffusion kernel (RWDK), that derives directly from the original definition of the diffusion distance. This latter allows us to theoretically justify the use of the random walk operator as a fundamental graph operator for clustering. ii) We generalize this kernel by proposing the use of a parametrized random walk operator, which allows us to extend the diffusion distance to digraphs. iii) From there, we present our clustering algorithm on digraphs, the parametrized random walk diffusion kernel clustering, based on the parametrized RWDK operator considered as a data embedding. iv) We propose a general method for estimating the diffusion time that best reveals a given number of clusters. Finally, we show that our approach is efficient on both synthetic and real-world datasets, as it outperforms existing methods in most of the tested cases. Graph theory and graph functions. Let G = (V, E, w) be a weighted directed graph (digraph), where V is the finite set of N = |V| vertices, and E ⊆ V × V is the finite set of edges. Each edge (i, j) is an ordered vertex pair indicating the direction of a link from vertex i to vertex j. With little abuse of notation we sometimes use i ∈ V implying a vertex index i ∈ [1, ..., N ]. The edge weight function w : V × V → R + associates a nonnegative real value to every vertex pair: w(i, j) > 0, iff (i, j) ∈ E, otherwise w(i, j) = 0. A digraph G is represented by a weighted adjacency matrix W = {w(i, j)} N i,j=1 ∈ R N ×N + , where w(i, j) is the weight of the edge (i, j). We define the outdegree and the in-degree of the i-th vertex by d out,i = N j=1 w(i, j) and d in,i = N j=1 w(j, i), respectively. Also, the function D ν = diag(ν) is a square diagonal matrix with the elements of the input vector ν in its diagonal. Consider a graph function f mapping all of its vertices to an Ndimensional vector: f = [ f (i) ] T ∀i∈V ∈ R N . We assume that graph functions are defined in ℓ 2 (V, ν), which is the Hilbert space of functions, defined over the vertex set V of G, endowed with the inner product associated with an arbitrary positive measure ν. Let δ i ∈ {0, 1} N ×1 be the vector output of the Kronecker delta function at i ∈ V. Any function ν : V → R + , associating a nonnegative value to each graph vertex, can be regarded as a positive vertex measure. Random walk fundamentals. What we call in short a random walk on a weighted graph G, is defined more formally as a natural random walk on the graph, which is a homogeneous Markov chain X = (X t ) t≥0 with a finite state space V, and with state transition probabilities proportional to the edge weights. The entries of the transition matrix P = [ p(i, j) ] ∀i,j∈V are defined by: p(i, j) = P(X t+1 = j | X t = i) = w(i, j) z∈V w(i, z) . Algebraically, the transition matrix P ∈ R N ×N can be expressed as P = D -1 out W, D out = D dout (respectively, D in = D din )whose spectrum sp(P) ∈ [-1, 1]. For a strongly connected digraph G, the random walk X is irreducible. If, in addition, the irreducible random walk X admits the aperiodicity condition, then X is ergodic, and therefore as t → ∞, the probability measures p t (i, * ) = δ T i P t , for all i ∈ V, converge toward a unique stationary distribution denoted by the row vector π ∈ R N + (Brémaud, 2013) . A random walk X with stationary distribution π and transition matrix P is called reversible if π(i)p(i, j) = π(j)p(j, i), for all i, j ∈ V. Within the undirected setting: d out,i = d in,i = d i , where d ∈ R N ×1 + is the vector of the vertex degrees, also and D d = diag(d) is a square diagonal matrix with the elements of the input vector d in its diagonal. Moreover, the stationary distribution is proportional to the vertex degree distribution, i.e. π ∝ d. Parametrized graph Laplacians. A notable example of parametrized operators that we use later, is the generalized graph Laplacian from Sevi et al. (2022) , which is a new type of graph operators for (but not restricted to) digraphs. Relevant theoretical background is provided in Appendix A.4.1. Definition 2.1. Generalized graph Laplacians (Sevi et al., 2022) . Let P be the transition matrix of a random walk on a digraph G. Under an arbitrary positive vertex measure ν on G, consider the positive vertex measure ξ = ν T P. Let also the diagonal matrices D ν = diag(ν), D ξ = diag(ξ) and D ν+ξ = diag(ν + ξ). Then, we can define three generalized Laplacians of G as follows: generalized random walk Laplacian: L RW,(ν) = (D ν+ξ ) -1 (D ν P + P T D ν ) (1) unnormalized generalized Laplacian: L (ν) = D ν+ξ -(D ν P + P T D ν ) normalized generalized Laplacian: L (ν) = D -1/2 ν+ξ L (ν) D -1/2 ν+ξ . L RW,(ν) , L (ν) , and L (ν) are parametrized by an arbitrary vertex measure has the modeling capacity to encode the graph directionality and edge density using the random walk dynamics of the original digraph. When the transition matrix P is irreducible and the vertex measure ν is the ergodic measure π, they correspond to the directed graph Laplacians in Chung (2005) .

3. PARAMETRIZED RANDOM WALK OPERATOR ON GRAPHS

In this section, we introduce the parametrized random walk operator that we propose, and we show how it is derived from the generalized graph Laplacian (presented at the end of Sec. 2). Definition 3.1. Parametrized random walk operator (P-RW). Let P be the transition matrix of a random walk on a digraph G. Under an arbitrary positive vertex measure ν on G, consider the positive vertex measure ξ = ν T P. Let also the diagonal matrices D ν = diag(ν), D ξ = diag(ξ) and D ν+ξ = diag(ν + ξ). Finally, let X ν be the random walk on G with the associated random walk operator (transition matrix) P (ν) defined by: P (ν) = (D ν+ξ ) -1 (D ν P + P T D ν ). It is easy to verify that P (ν) is a transition matrix (see Appendix A.1.2) and that there is the following relation with generalized random walk Laplacian. Definition 3.2. Relation between L RW,(ν) and P (ν) . Let P be the transition matrix of a random walk on a digraph G. Under an arbitrary positive vertex measure ν on G. The generalized random walk Laplacian and the parametrized random walk on G are related as follows: L RW,(ν) = I -P (ν) . As stated in Sevi et al. (2022) , the generalized random walk Laplacian, L RW,(ν) , is self-adjoint in ℓ 2 (V, ν+ξ). Consequently, the generalized random walk P (ν) is also self-adjoint in ℓ 2 (V, ν+ξ), and the associated random walk X ν is ergodic (under the aperiodicity condition) and admits the ergodic distribution π ν . Therefore, P (ν) is a random walk operator parameterized by a vertex measure ν that encodes the random walk dynamic of the original digraph.

4. DIFFUSION GEOMETRY FOR DIGRAPHS

In this section, we review the concept of diffusion geometry (Coifman & Lafon, 2006 ), and we show: i) how its core feature, the diffusion distance, can be expressed as a Mahalanobis distance involving a specific kernel matrix, which we call Random Walk Diffusion Kernel (RWDK); ii) how the SC's diagonalization step can be thought of as a function applied to the spectrum of the Laplacian and the conceptual connection with the RWDK, and iii) accordingly a clustering algorithm.

4.1. THE DIFFUSION DISTANCE AS A MAHALANOBIS DISTANCE

The seminal work by (Coifman & Lafon, 2006) introduced the diffusion geometry framework, which uses diffusion processes as basic tool to find meaningful geometric descriptions for dataset. The framework can provide different geometric representations of the dataset by iterating the Markov transition matrix, which is equivalent to running forward the random walk. The key element of the diffusion geometry is the diffusion distance (Coifman & Lafon, 2006; Pons & Latapy, 2005 ) defined as follows. We provide relevant theoretical background in Appendix A.4.2. Definition 4.1. Diffusion distance. Let P be the transition matrix of a reversible random walk on an undirected graph G, with an ergodic distribution π. Let also ∥f ∥ 2 1/π = ⟨f, D -1 π f ⟩, with D π = diag(π), be the ℓ 2 -norm of a graph function f induced by the measure 1/π. The diffusion distance at time t ∈ N between the vertices i and j is defined by: d 2 t (i, j) = ∥p t (i, * ) -p t (j, * )∥ 2 1/π . As noted in (Coifman & Lafon, 2006) , the diffusion distance emphasizes the cluster structure, if present. Next, we show that the diffusion distance can be seen as a Mahalanobis distance. Proposition 4.1. Diffusion distance as a Mahalanobis distance. Let P be the transition matrix of a reversible random walk on an undirected graph G, with an ergodic distribution π. The diffusion distance d 2 t (i, j) between vertices i and j, at a given time t ∈ N, can be expressed as the following Mahalanobis distance: d 2 t (i, j) = (δ i -δ j ) T K t (δ i -δ j ) , where the similarity positive definite kernel matrix K t is defined by: K t = P 2t D -1 d . The diffusion distance reveals a similarity kernel matrix K t that we call Random Walk Diffusion Kernel (RWDK), which is simply a power of the transition matrix P normalized by the vertex degrees. Consequently, using the diffusion distance d 2 t is equivalent to using the RWDK matrix K t .

4.2. RANDOM WALK DIFFUSION KERNEL, NORMALIZED GRAPH LAPLACIAN AND SPECTRAL CLUSTERING

Spectral clustering (SC) is one of the most widely used clustering methods due to its simplicity, efficiency, and strong theoretical foundation (Shi & Malik, 2000; Ng et al., 2002; Von Luxburg, 2007; Peng et al., 2015; Boedihardjo et al., 2021) . Given a fixed number of clusters k, SC consists of three main steps: i) construct the graph Laplacian matrix; ii) compute the eigenvectors associated with the k smallest eigenvectors of the Laplacian matrix and store them as columns in a matrix; iii) apply k-means on the rows of the latter matrix, which are regarded as embedded representations of the datapoints. We aim to highlight how the SC's second step and the RWDK do have similar characteristics. Let us consider the normalized graph Laplacian matrix L (Chung & Graham, 1997) with eigendecomposition L = N j=1 ϑ j ϕ j ϕ T j , ordered eigenvalues 0 ≤ ϑ 1 ≤ ... ≤ ϑ N ≤ 2 and eigenvectors {ϕ j } N j=1 . Computing the eigenvectors associated to the k smallest eigenvalues of L amounts to applying a function f 1 on the spectrum matrix of L such that f 1 (x) = 1 if x ≤ ϑ k , and f 1 (x) = 0 otherwise, namely H = f 1 (L) = k j=1 ϑ j ϕ j ϕ T j . Thanks to the relation L = D 1 2 (I -P)D -1 2 , selecting the k smallest eigenvalues of L is thus equivalent to selecting the k largest eigenvalues of P, namely H = f 1 (L) = D 1 2 f 1 (I -P)D -1 2 . On the other hand, the RWDK K t corresponds to applying a function f 2 (x, t) = x 2t on the spectrum matrix of P. As t → ∞, f 2 (P, t) → π1 T , and consequently K t = f 2 (P, t)D -1 → [tr(D d )] -1 11 T . As t increases, f 2 (P) becomes smoother because f 2 acts as a soft truncation of the spectrum matrix of P, which preserves those eigenvectors of P that are associated with the largest eigenvalues. However, as t increases, f 1 acts as a hard truncation of the spectrum matrix of L, which preserves the k eigenvectors associated with the lowest ones up to ϑ k . Consequently, the RWDK K t is an adaptive alternative to H, and the aim is to determine the iteration time t that best reveals the k clusters the user looks for.

4.3. PARAMETRIZED RANDOM WALK KERNEL CLUSTERING

We have shown that the diffusion distance d 2 t is directly related to the RWDK and that for a given diffusion time t, the RWDK is an alternative to computing the eigenvectors of the graph Laplacian. However, the original diffusion distance defined in Eq. 5 was settled in the undirected setting w.r.t a transition matrix associated with a reversible random walk. For an arbitrary transition matrix P and an arbitrary measure µ, the diffusion distance between vertices i and j, at a given diffusion time t, is indeed defined as the weighted Euclidean distance between the rows of P t : d 2 t (i, j, µ) = ∥p t (i, * ) -p t (j, * )∥ 2 µ = (δ i -δ j ) T K (t,µ) (δ i -δ j ). We have established the last expression in Sec.4.1; it suggests that one can see the diffusion distance as a Mahalanobis distance involving a similarity kernel K (t,µ) = P t D µ (P t ) T with D µ = diag(µ). By breaking down this way d 2 t K = P 2t D -1 (see Eq. 6), P needs to be the transition matrix of a reversible random walk with stationary measure π, and the vertex measure needs to be µ = 1/π. This is the reason why the original diffusion distance formulation was only used in the undirected setting. Now, the advantage of the parametrized random walk operator we use, is that it is also a reversible transition matrix for digraphs, and hence enables naturally the extension of the diffusion distance for digraphs. Definition 4.2. Parametrized diffusion distance. Let X be a random walk on digraph G, with transition matrix P. Let ν be an arbitrary positive vertex measure on G, and ξ be the vertex measure defined by ξ = ν T P. Define the diagonal matrices D ν = diag(ν) and D ξ = diag(ξ). On G, let X ν be a random walk associated with the random walk operator P (ν) parametrized by an arbitrary measure ν defined in Eq. 4 and ergodic measure π ν . Let p t,ν (i, * ) = δ T i P t (ν) be the conditional probability vector given the vertex i at the diffusion time t. The parametrized diffusion distance between the vertices i and j, at a given diffusion time t, is defined by: d 2 t,ν (i, j) = ∥p t,ν (i, * ) -p t,ν (j, * )∥ 2 1/πν = (δ i -δ j ) T K (t,ν) (δ i -δ j ), with K (t,ν) is the parametrized random walk diffusion kernel (P-RWDK) defined as: K (t,ν) = P t (ν) D -1 ν+ξ . Finally, the normalized generalized Laplacian L (ν) and the parametrized random walk operator P (ν) are related in the same manner the normalized Laplacian relates to the random walk operator (see  L (ν) = D 1/2 ν+ξ I -P (ν) D -1/2 ν+ξ . Consequently, the same approach of Sec.4.2 is also valid in this setting. Once we have defined the parameterized diffusion distance d 2 (t,ν) , we are in position to present the novel parametrized random walk diffusion kernel clustering (P-RWDKC) for both undirected and directed graphs (Alg. 1), which is based on the parametrized RWDK K (t,ν) .

5. THE P-RWDKC METHOD IN PRACTICE

In the previous section, we described a novel and simple clustering algorithm for digraphs based on an operator that is parametrized by an arbitrary vertex measure ν, and a its diffusion up to time t. To render our algorithm more flexible for practical use, two important aspects points need to be addressed: i) the design of the vertex measure ν, ii) the estimation of the diffusion time.

5.1. DESIGNING THE VERTEX MEASURE

Designing the vertex measure is one of the major aspects of P-RWDKC as we aim to capture with it the random walk dynamics of the original digraph in our parametrized random walk operator. To do so, we propose a vertex measure derived from the iterated powers of a random walk consisting of the forward and backward digraph's flow information. Specifically, the proposed vertex measure can be parametrized by three parameters (t ∈ N, γ ∈ [0, 1], α ∈ R) and is formally given by: ν α (t,γ) (i) = 1 N 1 T N ×1 P t γ δ i α , where 1 N ×1 is the all-ones vector, recall that δ i ∈ {0, 1} N ×1 is the vector returned by the Kronecker delta function at i ∈ V, and P γ = γP out + (1 -γ)P in , where P in = D -1 in W T and P out = D -1 out W (recall that D out = D dout and D in = D din ). The three parameters have an easy-to-see role, and their use is optional as one can set them to values that neutralize their effect (i.e. t = 1, α = 1, γ = 0.5). The random walk iteration parameter t controls the diffusion time, γ controls the mixing between P out (forward information) and P in (backward information), and α controls the re-weighting of the vertex measure. Plugging ν α (t,γ) to Eq. 4 yields the following expression for the parametrized random walk P (ν α (t,γ) ) and hence the P-RWDK K (t d ,ν α (t,γ) ) . Note that in our implementation, we use Pγ = (I + P γ )/2 instead of P γ . The choice of the vertex measure can have a significant influence on the clustering performance. Intuitively, one would be interested to see how "concentrated" are the measure's configurations that lead to good clusterings. When the problem is easy (well-separated clusters), the influence of the vertex measure is very limited and multiple different measure parametrizations may lead to good clustering (and many different algorithms may do the same). On the other hand, in difficult cases (clusters that are intricate and/or imbalanced and/or sparse) the vertex measure will be key to reaching high clustering performance, and maybe more "concentrated" in a region of good parametrizations in R N . We can also see the impact of the measure by considering that vanilla SC can perform arbitrarily badly. This has been partially studied theoretically in Sevi et al. (2022) and is also shown empirically in our experimental results in Sec. 6.

5.2. DETERMINING THE APPROPRIATE DIFFUSION TIME IN THE UNSUPERVISED SETTING

For a known number of clusters k, we aim at determining the best diffusion time t d for the P-RWDKC algorithm. As we work in an unsupervised setting, and hence lacking ground truth, determining the right diffusion time is challenging (Shan & Daubechies, 2022) . Our main insights to deal with this matter are derived from the concepts of diffusion geometry (Coifman & Lafon, 2006) , the theory of nearly uncoupled Markov chains (Tifenbach, 2011; Sharpe & Wales, 2021) and the metastability of Markov chains (Landim & Xu, 2015) . As stated in (Coifman & Lafon, 2006) , assuming that the graph has clusters and/or a multi-scale structure, the random walk diffusion reveals clusters at key moments of the diffusion at the course of time. The emergence of clusters can be understood through the prism of metastability theory: for a given cluster structure, there is a critical time-scale when an irreducible random walk becomes nearly reducible, diffusing only inside clusters. Typically, this means to observe approximately that, at that diffusion time, our operator enjoys high intra-cluster compactness and high inter-cluster separation. Cluster validity indexes can be employed for measuring the ratio of the former to the latter of the quantities (José-García & Gómez-Flores, 2021) . Here, we propose the use of the Calinski-Harabasz criterion (CH; also known as Variance ratio criterion) (Caliński & Harabasz, 1974) , which computes the ratio between the distance of the cluster centroids to the global centroid, and the distance of the datapoints of each cluster to its cluster centroid. To compute this variance criterion, we propose two types of distances between data points, either the Euclidean distance between original data points, or a distance between their embedded representations, which in the context of this work means to consider the rows of a reference graph operator. Given the set of N datapoints X = {x 1 , ..., x N } partitioned into k clusters, denoted by V = {V j } k j=1 , we denote by µ j = 1 |Vj | xi∈Vj x i the centroid of cluster j, by µ = 1 N xi∈X x i the centroid of X, and by d(x i , x j ) the pairwise distance used for two datapoints x i , x j ∈ X. Thus, the CH criterion endowed with a given distance d is defined by: CH X, V = N -k k -1 k j=1 |V j | d(µ j , µ) k j=1 xi∈Vj d(x i , µ j ) . In the standard case, where we consider multidimensional data vectors, X = {x 1 , ..., x N } ∈ R d , the usual CH criterion endowed with the Euclidean distance d(x i , x j ) = ∥x i -x j ∥ 2 is usually used. However, there are many reasons that the usual CH criterion may be either not directly applicable (e.g. when the input is only a graph and no point cloud), or inefficient (e.g. when the clusters of the input point cloud are non-convex or nested, and the Euclidean distance cannot help distinguishing them). For this reason, as part of our framework, we propose to extend the CH criterion for this setting, based on the Kullback-Leibler divergence as a distance Van Erven & Harremos (2014). Definition 5.1. Probability Density-based Calinski-Harabasz (DCH) criterion. Let G = (V, E) be a digraph with cardinality |V| = N . Let P be the transition matrix of a random walk on G. The digraph G is partitioned into k clusters denoted by V = {V j } k j=1 . Let p(i, * ) = δ T i P be the conditional probability vector given the vertex i that we consider as the representation of the vertex i. We define as reference data representation X = {p(i, * ) ∈ R N } N i=1 the set of conditional probability vectors associated with the vertices of G. Let us denote by µ j = 1 |Vj | p(i, * )∈Vj p(i, * ) the centroid of cluster j, and µ = 1 N p(i, * )∈X p(i, * ) the centroid of X. The Kullback-Leibler divergence between two discrete probability distributions, p and q, is defined as D KL (p, q) = y p(y) log p(y) q(y) , s.t. q(y) ̸ = 0. Given a set of datapoints X and a partition V , the probability density-based Calinski-Harabasz (DCH) criterion endowed with the Kullback-Leibler divergence D KL (p, q), is defined by: DCH X, V = N -k k -1 k j=1 |V j | D KL (µ j , µ) k j=1 p(i, * )∈Vj D KL (p(i, * ), µ j ) . We thus estimate the diffusion time in practice by evaluating the CH or DCH criterion for the partitions associated with the dyadic powers of the parametric random walk P 2 j (ν) with ν = 1 (the uniform measure) and j ∈ {0, ..., J} with J = 15. Since the purpose is to estimate the diffusion time, taking the vertex measure equal to a uniform measure is sufficient for this situation. We summarize this procedure in Alg. 2. Under review as a conference paper at ICLR 2023 Algorithm 2 Estimating the random walk diffusion time t * Input: Reference representation X; W: adjacency matrix; k: number of clusters; J: max number of iterations Output: t * : the estimated diffusion time that best reveals k clusters 1: Set ν = 1 for the vertex measure 2: Compute the parametrized random walk operator P (ν) , see Eq. 4 3: for j = 0 to J do 4: Apply k-means on P 2 j 

6. EXPERIMENTS

Setup and competitors. In this section, we demonstrate the effectiveness of our approach both on digraphs obtained from high-dimensional data through graph construction procedures, and realworld graphs. When dealing with high-dimensional data, we use the K-nearest neighbor (K-NN) graph construction with K = ⌊log(N )⌋, which produces (relatively) sparse and non-strongly connected digraphs. A resulting K-NN graph is unweighted, directed, and represented by its nonsymmetric adjacency matrix W = {w ij } N i,j=1 , with entries w ij = 1 ∥x i -x j ∥ 2 ≤ dist K (x i ) . In the latter, x i , x j ∈ R d stand for the original coordinates of the datapoints corresponding to the vertices i and j, dist K (x) is the Euclidean distance between x and its K-th nearest neighbor, and 1{ • } ∈ {0, 1} is the indicator function that evaluates the truth of the input condition. Our clustering method, denoted by P-RWDKC(α, γ, t, t d ), is endowed with the parameters α ≥ 0, γ ∈ [0, 1], and t, t d ≥ 0. The search grid used for each parameter is: α ∈ {0, 0.1, ..., 1}, t ∈ {0, 1, ..., 100}, and γ ∈ {0, 0.1, ..., 1}. Finally, the diffusion time parameter t d ∈ {2 0 , ..., 2 J }, with J = 15, is estimated using Alg. 2. For each method, we apply k-means clustering over the obtained embeddings (we report the best score out of 100 restarts). We select for each method the optimal parameter values obtained through cross-validation over a grid search, yielding the closest partition to the ground truth. The obtained partitions are evaluated by the normalized mutual information (NMI) Strehl & Ghosh (2002) , which is a popular supervised cluster evaluation index. NMI corresponds to a normalization of the mutual information between the predicted cluster assignments and the ground truth labels. This metric is symmetric and invariant to label permutations. In the experiments we compare with the following methods: • DSC + (γ) (Zhou et al., 2005) is a SC method on digraphs based on the Pagerank random walk (Page et al., 1999) endowed with the parameter γ ∈ [0, 1). • DI-SIM L (τ ) and DI-SIM R (τ ) (Rohe et al., 2016) are two variants that are based on the left and the right singular vectors, respectively, of a given regularized and normalized operator whose regularization is denoted by the parameter τ ≥ 0. We use cross-validation to search the optimal parameter with a grid search over τ ∈ {1, 2, ..., 20}. • SC-SYM 1 and SC-SYM 2 are SC variants (Von Luxburg, 2007) based on the unnormalized and the normalized graph Laplacians obtained from the symmetrization of the adjacency matrix W. • PIC(t d ) (Lin & Cohen, 2010) is the clustering approach based on the power iteration of the random walk operator. We use the random walk operator derived from the original adjacency matrix W of the graph. The diffusion time t d is estimated using Alg. 2. • RSC(τ ) (Qin & Rohe, 2013; Zhang & Rohe, 2018) is the regularized SC proposed to deal with sparse graphs parametrized by τ ≥ 0. We use cross-validation with a grid search over τ ∈ {1, 2, ..., 20} to tune this parameter. The method applies only to undirected graphs. For digraphs, we had to symmetrize the adjacency matrix of the original digraph. Multi-scale synthetic Gaussians. Here we test the our approach at revealing clusters at different scales thanks to the accurate estimation of the diffusion time. We generate one instance of a point cloud in R 2 , of N = 300 data points drawn independently from the following mixture of six Gaussian distributions 6 i=1 α i N (µ i , σ 2 i I) with weights α i (note: i α i = 1). Specifically, σ i = 0.5, α i = 1/6, ∀i and µ 1 = (-3, -2), µ 2 = (0, -2), µ 3 = (-1, 1), µ 4 = (4, -2), µ 5 = (7, -2), µ 6 = (5, 1). The resulting data exhibit a multi-scale structure, as they can be seen as either  and 1d , show the respective clustering results obtained by P-RWDKC with a vertex measure ν = 1 and estimated diffusion times t d1 = 64 and t d2 = 128. The results are quite consistent with the ground truth of each case, hence provide evidence that our approach can reveal multi-scale clusters. Real-world data. In this section, we conduct experiments on graphs created from high-dimensional real-world data. We show that P-RWDKC has superior performance compared to existing methods in nearly all tested cases. Note that, to further validate the efficiency of P-RWDKC, we run additional experiments on several real-world graphs, which are provided in Appendix A.2. Here we report results on 11 benchmark datasets from the UCI repository (Dheeru & Karra Taniskidou, 2017) . We compare against DSC+, SC-SYM 1 and SC-SYM 2 , DI-SIM, and PIC. Tab. 1 summarizes the comparative results based on NMI. In nearly all cases, the proposed P-RWDKC outperforms significantly the other methods on average. Our approach performs better than SC-SYM 1 and SC-SYM 2 . Moreover, our approach performs better on average than DSC+. This allows us to state that P-RWDKC, associated with the suitable vertex measure from Eq. 9 brings indeed real added value to the clustering problem. Furthermore, P-RWDKC outperforms PIC. Consequently, the RWDK operator produces better graph embeddings than the original random walk operator defined the directed K-NN graphs because the parametrized random walk is irreducible compared to the random walk used in PIC that is not and thanks to the digraph information encoded into the vertex measure. We have to mention that for each dataset, the parameters reported in our approach are not necessarily unique as several combinations of parameters may yield the same clustering performance (see Sec. 5.1.)

7. CONCLUSION

We have proposed the parametrized random walk diffusion kernel clustering (P-RWDKC) that applies to both directed and undirected graphs. First, we introduced the parametrized random walk (P-RW) operator. We then show that the diffusion distance, is a Mahalanobis distance involving a special kernel matrix, called random walk diffusion kernel (RWDK), which is simply a power of the transition matrix (normalized by the vertex degrees). From this, we show that the RWDK is an alternative to the eigendecomposition step of the spectral clustering pipeline. We extend the diffusion geometry framework to digraphs by combining RWDK and P-RW. The P-RWDKC clustering algorithm stems from our analysis. Finally, we demonstrated empirically, with extensive experiments on several datasets, that P-RWDKC outperforms existing approaches for digraphs. Proposition A.1. Let X be a random walk on an undirected graph G with transition matrix P and ergodic distribution π. The transition matrix P admits the following eigendcomposition P = ΦD λ Ψ T . The diffusion distance between vertices i and j, d 2 t (i, j) at a given time t ∈ N can be written as the following Mahalanobis distance d 2 t (i, j) = (δ i -δ j ) T K t (δ i -δ j ) , where the similarity positive definite kernel matrix K t is defined by K t = P 2t D -1 d . Proof. d 2 t (i, j) = ∥p t (i, * ) -p t (j, * )∥ 2 1/π , = ∥(P t ) T (δ i -δ j )∥ 2 1/π , d 2 t (i, j) = (δ i -δ j ) T P t D -1 π (P t ) T (δ i -δ j ). By setting K t = P t D -1 π (P t ) T , using the eigendecomposition of P and the fact that P is self-adjoint in ℓ 2 (V, π), we have K t = P t D -1 π (P t ) T = ΦD t λ Ψ T D -1 π ΨD t λ Φ T , = ΦD 2t λ Φ T (Ψ T D -1 π Ψ = I), = ΦD 2t λ Ψ T D -1 d (Φ = D -1 d Ψ), K t = P 2t D -1 d . A.1.2 SUPPLEMENTARY PROOFS Proposition A.2. P (ν) is a transition matrix and reversible. Proof. We have the following equality P (ν) = (I + D -1 ν D ξ ) -1 (P + D -1 ν P T D ν ) = (D ν + D ξ ) -1 (D ν P + P T D ν ). As a result, we need to show that (D ν + D ξ ) -1 (D ν P + P T D ν ) is a transition matrix, i.e. we show that |V| j=1 P (ν),ij = 1, ∀i ∈ V |V| j=1 P (ν),ij = |V| j=1 (D ν + D ξ ) -1 (D ν P + P T D ν ) ij = |V| j=1 |V| k=1 (D ν + D ξ ) -1 ik (D ν P + P T D ν ) kj |V| j=1 P (ν),ij = (D ν + D ξ ) -1 ii |V| j=1 (D ν P + P T D ν ) ij (12) |V| j=1 (D ν P + P T D ν ) ij = |V| j=1 |V| k=1 (D ν ) ik P kj + |V| j=1 |V| k=1 (P T ) ik (D ν ) kj = |V| j=1 (D ν ) ii P ij + |V| j=1 (P T ) ij (D ν ) jj = |V| j=1 ν(i)p(i, j) + |V| j=1 ν(j)p(j, i) |V| j=1 (D ν P + P T D ν ) ij = ν(i) + ξ(i), ∀ i ∈ V Using Eq. 12, we finally obtain |V| j=1 P (ν),ij = 1.

A.2 EXPERIMENTS ON REAL-WORLD GRAPH BENCHMARKS

Most of the real-world graphs are sparse with heterogeneous degrees with a high variance of the degree. As a result, spectral clustering mostly fails for these graphs Zhang & Rohe (2018) ; Rohe et al. (2011) ; Dall'Amico et al. (2020) . To avoid numerical issues due to the high variance degree, we proposed a slight modification of the parametrized random walk operator (transition matrix) P (ν) = (D ν + D ξ ) -1 (D ν W + W T D ν ) with ξ = ν T W and the vertex measure ν = [ν 1 , ..., νN ] T defined by νi = ν i d out i , ∀i ∈ V. We report the results of experiments that evaluate the performance of the proposed P-RWDKC method on 4 real-world networks. Among these, 2 are directed (Political blogs, Cora) and 2 are undirected (Karate Club, College Football). For all networks, the number of clusters is considered known. Moreover, when necessary, clustering is performed on the largest connected component of the graph. We compare against SC-SYM 1 , RSC, and PIC. Tab. 2 summarizes the comparative results according to the NMI index. In all cases, the proposed P-RWDKC outperforms the other methods, and when looking at the overall average performance the difference is significant. Our approach performs significantly better than SC-SYM 1 . This is caused by a mixed effect of the symmetrization of the digraph as well as the hard truncation phenomenon discussed in Sec. 4.2. RSC is competitive against our method. Nevertheless, the adjacency matrix involved in RSC is dense and clearly modified, which can have a deleterious impact compared to our method (e.g. see the results on Cora). In Sec. 5.1 we presented a design for the vertex measure to be used by P-RWDKC. Here we discuss an alternative design. As before, the vertex measure can be parametrized by three parameters (t ∈ N, γ ∈ [0, 1], α ∈ R) and is formally given by: ν α (t,γ) (i) = 1 N 1 T N ×1 P t γ δ i α , where 1 N ×1 is the all-ones vector , δ i ∈ {0, 1} N ×1 is the vector output of the Kronecker delta function at i ∈ V, and where W γ is defined by P γ = D -1 γ W γ , D γ = diag(W γ 1), W γ = γW + (1 -γ)W T , γ ∈ [0, 1]. The random walk iteration parameter t controls the random walk diffusion, γ controls the influence between the original adjacency W (forward information) and its transpose W T (backward information), and α controls the re-weighting of the vertex measure. A.3.1 VERTEX MEASURE WHEN γ = 1/2 AND t → ∞ In the setting where, γ = 1 2 and t → ∞, we are able to characterize explicitly the vertex measure. Indeed, W γ becomes symmetric, the associated random walk is thus ergodic, lim t→∞ δ T i P t 1/2 = π 1/2 , ∀i and the parametrized vertex measure defined in Eq. 14 becomes ν α (t,γ) (i) = π 1/2 (i) α for any vertex i.

A.3.2 ADDITIONAL EXPERIMENTS

In this section, we demonstrate the effectiveness of our approach based on the same setting described in Sec. 6, using the vertex measure in Sec. A.3.1. Tabs. 3 and 5 summarize the comparative results based on NMI. Similarly to the results from Sec. 6, we observe that the proposed P-RWDKC outperforms significantly the other methods in nearly all cases. The P-RWDKC, associated with the vertex measure defined in Sec. A.3.1 stays competitive against P-RWDKC from Sec. 6 with less degree of freedom. The main takeway to consider is that the construction of this kernel and the design of the vertex measures are the key elements to creating relevant embedding for clustering graphs. To further evaluate the P-RWDKC framework, instead of using the ground truth and cross-validation, here we choose the parameter values that optimize the Calinski-Harabasz (CH). We first compute the set of candidate partitions, one for each (α, t d ) combination in the considered parameter grid. Then we select as best the model with the highest CH and compute its NMI. We operate in the same way for the methods that have parameters (i.e. all but SC-SYM 1 , SC-SYM 2 , and PIC whose results are the same as in Tabs. 3 and 5. The comparative results are shown in Tabs. 4 and 6. Notice that P-RWDKC outperforms significantly the other methods in nearly all cases (see also the average performance in the last row of the table .). Compared to Tab. 3 here the NMI of P-RWDKC stays just a little lower. This indicates that the unsupervised tuning of the model parameters offers comparable graph partition quality to the previous case where we applied cross-validation using the ground truth.  (f ) = x,y∈V ν(x)p(x, y)|f (x) -f (y)| 2 . ( ) This functional extends the well-known notion of Dirichlet energy Montenegro et al. (2006) which is originally restricted to any ergodic random walk associated with its transition matrix and its stationary measure. In particular, the GDE is defined with respect to any positive regularizing measure and any Markov transition matrix without the requirement of any specific property. As mentioned earlier, the generalized graph Laplacians stems from the GDE by the following relation D 2 ν,P (f ) = ⟨f, L RW (ν)f ⟩ ν+ξ = ⟨f, L(ν)f ⟩. , which means that the quadratic form of D 2 ν,P (f ) involves the unnormalized generalized graph Laplacian. It is important to mention that the GDE stems directly from the random walk point of view of the graph partitioning proposed in Sevi et al. (2022) .

A.4.2 BACKGROUND ON DIFFUSION GEOMETRY

The seminal work by (Coifman et al., 2005; Coifman & Lafon, 2006) introduced the diffusion geometry framework, which uses diffusion processes as a basic tool to find meaningful geometric descriptions for data. This general framework leads to efficient multi-scale analysis of datasets for which we have a Heisenberg location principle relating localization in data to localization in the spectrum. The framework can provide different geometric representations of the dataset by iterating the Markov transition matrix, which is equivalent to running forward the random walk. The key element of the diffusion geometry is the diffusion distance Coifman & Lafon (2006) ; Pons & Latapy (2005); Coifman et al. (2005) which captures the structure encoded in P as a data-dependent distance metric between points and whose the original definition is the following. (ZACHARY, 1977) 34 2 73.24 83.72 (1.7) 83.72 (1) 83.72 (0,1) Definition A.2. Diffusion distance. Let P be the transition matrix of a reversible random walk on an undirected graph G, with an ergodic distribution π. Given a graph function f , ∥f ∥ 2 1/π is the ℓ 2 -norm induced by the measure 1/π defined by ∥f ∥ 2 1/π = ⟨f, D -1 π f ⟩. The diffusion distance at time t ∈ N between the vertices i and j is defined by: d 2 t (i, j) = ∥p t (i, * ) -p t (j, * )∥ 2 1/π . Notably, diffusion distances are data-dependent, allowing the detection of nonlinear structure in data at different scales Coifman et al. (2005) . The diffusion distance at time t can be considered as the Euclidean distance between rows of P , potentially weighted by a measure (in the original definition by 1/π) which takes into account the (empirical) local density of the points. For a finite time t, if each cluster in a cluster of G is highly-connected, and well-separated from other clusters, hen p t (i, * ) will be nearly equal to p t (j, * ) for any pair of points i and j in the same cluster, implying a low diffusion distance between points within the same cluster. Conversely, if i and j are in distinct clusters, p t (i, * ) is expected to be very different from p t (j, * ) . Since then, some variants of the diffusion distance notion have been proposed Goldberg & Kim (2010; 2012) . We now introduce the notion of diffusion map as a main concept of the diffusion geometry framework. Definition A.3. Diffusion map. Let X be a random walk on an undirected graph G = (V, E),|V| = N with transition matrix P and ergodic distribution π. The transition matrix P admits the following eigendecomposition P = ΦD λ Φ -1 where Φ = [ϕ 1 , ϕ 2 , • • • , ϕ n ] is the eigenbasis and D λ = diag(λ) is the diagonal matrix of eigenvalues. The diffusion map at time t is defined by Ψ t (i) = (λ t 1 ϕ 1 (i), λ t 2 ϕ 1 (i), • • • , λ t n ϕ n (i)). Diffusion maps and diffusion distances are related by the following relation Coifman & Lafon (2006) d 2 t (i, j) = ∥Ψ t (i) -Ψ t (j)∥ 2 , which means that the diffusion map Ψ t embeds the data into a Euclidean space in which the Euclidean distance is equal to the diffusion distance d 2 t . We note that this relationship only exists if the diffusion distance is the original form proposed in Coifman & Lafon (2006) , i.e. if the random walk with transition matrix P is reversible and if the Euclidean distance is weighted by 1/π.



This work is part of the main line of work on edge density-based clustering on digraphs that seeks clusters characterized by high intra-cluster and low inter-cluster edge densities, and has produced several methods in the last two decades Zhou et al. (2005); Meilȃ & Pentney (2007); Satuluri & Parthasarathy (2011); Rohe et al. (2016); Palmer & Zheng (2020); Sevi et al. (2022); Klus & Djurdjevac Conrad (2023). Lately, there have been proposed approaches Cucuringu et al. (2020); Laenen & Sun (2020); Coste & Stephan (2021); Hayashi et al. (2022), which are called flow-based and whose objective is opposite to the traditional density-based graph clustering view-point.

8 3: Consider each xi ∈ R N , i = 1, ..., N , to be the embedding of the i-th vertex, represented by the i-th row of K (t d ,ν) , and apply a clustering method (e.g. k-means) to all these vectors asking for k clusters 4: Obtain the k-partition Vt d = Vt d k j=1 of the graph vertices based on the clustering result of Step 3 5: return Vt d Sec.4.2), and therefore we can write:

the k-partition Vj = {Vq,j} k q=1 of the graph vertices based on the clustering result of Step 4. 6: end for 7: Select j ⋆ = argmax j∈{0,...,J} CH X, Vj or DCH X, Vj . 8: return t * = 2 j ⋆

Figure 1: Comparison between the ground truth classes at different scales and the result of P-RWDKC on a synthetic toy-case. (a),(b) Small-scale: ground truth with 6 clusters and the P-RWDKC's result. (c),(d) Largescale: ground truth with 2 clusters and the finding of P-RWDKC.

Algorithm 1 Parametrized Random Walk Diffusion Kernel Clustering (P-RWDKC) Input: W ∈ R N ×N : adjacency matrix, k: number of clusters, ν: vertex measure, t d : diffusion time Output: Vt d : graph k-partition for diffusion time t d 1: Compute the parametrized random walk operator P (ν) , see Eq. 4 2: Compute the parametrized random walk diffusion kernel K (t d ,ν) , see Eq.

Clustering performance (NMI) on UCI datasets with optimal parameters in parentheses. original Gaussian clusters, or as having 2 clusters where each of them is made up of 3 smaller clusters. Figs.1a and 1cshow the ground truth data classes for the two scales.Figs. 1b

Clustering performance (NMI) on real-world datasets with optimal parameters in parentheses.

Clustering performance (NMI) on UCI datasets with optimal parameters in parentheses.

Clustering performance (NMI) on UCI datasets with estimated parameters (shown in parentheses) according to the CH (or DCH) index. DATASET N d k SC-SYM 1 SC-SYM 2 DI-SIM L (τ ) DI-SIM R (τ ) DSC+(γ) PIC(t d ) P-RWDK(α, t d )

Clustering performance (NMI) on real-world datasets with optimal parameters in parentheses.

Clustering performance (NMI) on real-world datasets with estimated parameters (shown in parentheses) according to the CH (or DCH) index.

