ROBUST GRAPH DICTIONARY LEARNING

Abstract

Traditional Dictionary Learning (DL) aims to approximate data vectors as sparse linear combinations of basis elements (atoms) and is widely used in machine learning, computer vision, and signal processing. To extend DL to graphs, Vincent-Cuaz et al. 2021 proposed a method, called GDL, which describes the topology of each graph with a pairwise relation matrix (PRM) and compares PRMs via the Gromov-Wasserstein Discrepancy (GWD). However, the lack of robustness often excludes GDL from a variety of real-world applications since GWD is sensitive to the structural noise in graphs. This paper proposes an improved graph dictionary learning algorithm based on a robust Gromov-Wasserstein discrepancy (RGWD) which has theoretically sound properties and an efficient numerical scheme. Based on such a discrepancy, our dictionary learning algorithm can learn atoms from noisy graph data. Experimental results demonstrate that our algorithm achieves good performance on both simulated and real-world datasets.

1. INTRODUCTION

Dictionary learning (DL) seeks to learn a set of basis elements (atoms) from data and approximates data samples by sparse linear combinations of these basis elements (Mallat, 1999; Mairal et al., 2009; Tošić and Frossard, 2011) , which has numerous machine learning applications including dimensionality reduction (Feng et al., 2013; Wei et al., 2018) , classification (Raina et al., 2007; Mairal et al., 2008) , and clustering (Ramirez et al., 2010; Sprechmann and Sapiro, 2010) , to name a few. Although DL has received significant attention, it mostly focuses on vectorized data of the same dimension and is not amenable to graph data (Xu, 2020; Vincent-Cuaz et al., 2021; 2022) . Many exciting machine learning tasks use graphs to capture complex structures (Backstrom and Leskovec, 2011; Sadreazami et al., 2017; Naderializadeh et al., 2020; Jin et al., 2017; Agrawal et al., 2018) . DL for graphs is more challenging due to the lack of effective means to compare graphs. Specifically, evaluating the similarity between one observed graph and its approximation is difficult, since they are often with different numbers of nodes and the node correspondence across graphs is often unknown (Xu, 2020; Vincent-Cuaz et al., 2021) . The seminal work of Vincent-Cuaz et al. (2021) proposed a DL method for graphs based on the Gromov-Wasserstein Discrepancy (GWD) that is a variant of the Gromov-Wasserstein distance. Gromov-Wasserstein distance compares probability distributions supported on different metric spaces using pairwise distances (Mémoli, 2011) . By expressing each graph as a probability measure and capturing the graph topology with a pairwise relation matrix (PRM), comparing graphs can be naturally formulated as computing the GWD, since both the node correspondence and the discrepancy of the compared graphs are calculated (Peyré et al., 2016; Xu et al., 2019b) . However, observed graphs often contain structural noise including spurious or missing edges, which leads to the differences between the obtained PRMs and the true ones (Donnat et al., 2018; Xu et al., 2019b) . Since GWD lacks robustness (Séjourné et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022) , the inaccuracies of PRMs may severely affect GWD and the effectiveness of DL in real-world applications. Contributions. To handle the inaccuracies of PRMs, this paper first proposes a novel robust Gromov-Wasserstein discrepancy (RGWD) which adopts a minimax formulation. We prove that the inner maximization problem has a closed-form solution and derive an efficient numerical scheme to approximate RGWD. Under suitable assumptions, such a numerical scheme is guaranteed to find a δ-stationary solution within O( 1 δ 2 ) iterations. We further prove that RGWD is lower bounded and the lower bound is achieved if and only if two graphs are isomorphic. Therefore, RGWD can be employed to compare graphs. RGWD also satisfies the triangle inequality which is of its own interest and allows numerous potential applications. A robust graph dictionary learning (RGDL) algorithm is thereby developed to learn atoms from noisy graph data, which assesses the quality of approximated graphs via RGWD. Numerical experiments on both synthetic and real-world datasets demonstrate that RGDL achieves good performance. The rest of the paper is organized as follows. In Sec. 2, a comprehensive review of the background is given. Sec. 3 presents RGWD and the numerical approximation scheme for RGWD. RGDL is delineated in Sec. 4. Empirical results are demonstrated in Sec. 5. We finally discuss related work in Sec. 6.

2.1. OPTIMAL TRANSPORT

We first present the notation used throughout this paper and then review the definition of the Gromov-Wasserstein distance that originates from the optimal transport theory (Villani, 2008; Peyre and Cuturi, 2018) . Notation. We use bold lowercase symbols (e.g. x), bold uppercase letters (e.g. A), uppercase calligraphic fonts (e.g. X ), and Greek letters (e.g. α) to denote vectors, matrices, spaces (sets), and measures, respectively. 1 d ∈ R d is a d-dimensional all-ones vector. ∆ d is the probability simplex with d bins, namely the set of probability vectors ∆ d = a ∈ R d + | d i=1 a i = 1 . A[i, : ] and A[:, j] are the i th row and the j th column of matrix A respectively. Given a matrix A, A F and A ∞ denote its Frobenius norm and element-wise ∞ -norm (i.e., A ∞ = max i,j |A ij |), respectively. The cardinality of set A is denoted by |A|. The bracketed notation n is the shorthand for integer sets {1, 2, . . . , n}. A discrete measure α is denoted by α = m i=1 a i δ xi , where δ x is the Dirac measure at position x, i.e., a unit of mass infinitely concentrated at x. Gromov-Wasserstein distance. Optimal Transport addresses the problem of transporting one probability measure towards another probability measure with the minimum cost (Villani, 2008; Peyre and Cuturi, 2018) . The induced cost defines a distance between the two probability measures. Gromov-Wasserstein (GW) distance extends classic optimal transport to compare probability measures supported on different spaces (Mémoli, 2011) . Let (X , d X ) and (Y, d Y ) be two metric spaces. Given two probability measures α = m i=1 p i δ xi and β = n i =1 q i δ y i where x 1 , x 2 , . . ., x m ∈ X and y 1 , y 2 , . . ., y n ∈ Y, the r-GW distance between α and β is defined as GW r (α, β) := min T∈Π(p,q) m i,j=1 n i ,j =1 D r ii jj T ii T jj 1 r , where the feasible domain of the transport plan T = [T ii ] is given by the set Π(p, q) = T ∈ R m×n + T1 n = p, T 1 m = q , and D ii jj calculates the difference between pairwise distances, i.e., D ii jj = |d X (x i , x j )d Y (y i , y j )|.

2.2. GRAPH REPRESENTATION AND COMPARISON

In this subsection, we formalize the idea of comparing graphs with GWD, which addresses the challenges that graphs are often with different numbers of nodes and the node correspondence is unknown (Xu et al., 2019b; Xu, 2020; Vincent-Cuaz et al., 2021) . Pairwise relation and graph representation. Given a graph G with n nodes, assigning each node an index i ∈ n , G can be expressed as a tuple (C, p), where C ∈ R n×n is a matrix encoding the pairwise relations (e.g. adjacency, shortest-path, Laplacian, or heat kernel) and p ∈ ∆ n is a probability vector modeling the relative importance of nodes within the graph (Peyré et al., 2016; Xu et al., 2019b; Titouan et al., 2019; Vincent-Cuaz et al., 2022) . Gromov-Wasserstein Discrepancy. GWD can be derived from the 2-GW distance by replacing the metrics with pairwise relations (Xu et al., 2019b; Vincent-Cuaz et al., 2022) . More specifically, given an observed source graph G s and a target graph G t that can be expressed as (C s , p s ) and (C t , p t ) respectively, GWD is defined as GWD (C s , p s ), (C t , p t ) = min T∈Π(p s ,p t ) n s i,j=1 n t i ,j =1 (C s ij -C t i j ) 2 T ii T jj 1 2 , where n s and n t are the numbers of nodes of G s and G t respectively. GWD computes both a soft assignment matrix between the nodes of the two graphs and a notion of discrepancy between them. For conciseness, we abbreviate GWD (C s , p s ), (C t , p t ) to GWD(C s , C t ) in the sequel.

2.3. DICTIONARY LEARNING

Traditional DL approximates data vectors as sparse linear combinations of basis elements (atoms) (Mallat, 1999; Mairal et al., 2009; Tošić and Frossard, 2011; Jiang et al., 2015) , and is usually formulated as min D∈C,W K k=1 X[:, k] - M m=1 w k m D[:, m] 2 2 + λΩ(w k ), where X ∈ R d×K is the data matrix whose columns represent samples, the matrix D ∈ R d×M contains M atoms to learn and is constrained to the following set C = {D ∈ R d×M |∀m ∈ M , D[:, m] 2 ≤ 1}, W ∈ R M ×K is the new representation of data whose k th -column w k = [w k m ] m∈ M stores the embedding of the k th sample, and λΩ(w k ) promotes the sparsity of w k . Such a formulation only applies to vectorized data. Recently, Xu 2020 proposes to approximate graphs via the highly non-linear GW barycenter. Specifically, given a dataset of K graphs which has PRMs {C k } k∈ K such that C k ∈ R n k ×n k , the basis elements { Cm } m∈ M are learned by solving min { Cm } m∈ M ,{w k } k∈ K K k=1 GWD 2 C k , B w k , { Cm } m∈ M , where w k ∈ ∆ M is referred to as the embedding of the k th graph G k , and the GW barycenter B w k , { Cm } m∈ M gives the approximation of G k and is defined as B w k , { Cm } m∈ M = argmin B M m=1 w k m GWD 2 (B, Cm ). Therefore, a complex bi-level optimization problem is involved, which is computationally inefficient (Vincent-Cuaz et al., 2021) . DL for graphs. To overcome the above computational issues, Vincent-Cuaz et al. 2021 proposed GDL which approximates each graph as a weighted sum of PRMs and is formulated as min { Cm } m∈ M ,{w k } k∈ K K k=1 GWD 2 C k , M m=1 w k m Cm + λΩ(w k ), where each atom Cm is a n a × n a matrix. In contrast to the 2 loss in Eq. ( 1), GWD is used to assess the quality of the linear representation M m=1 w k m Cm for k ∈ K . However, the observed graphs often contain noisy edges or miss some edges in real-world applications (Clauset et al., 2008; Xu et al., 2019b; Shi et al., 2019; Piccioli et al., 2022) , which leads to the inaccuracies of the PRMs C k , that is, the deviation between C k and the true PRM C k * . Since GWD lacks robustness (Séjourné et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022) , the quality of the learned dictionary may be severely affected.

3. ROBUST GROMOV-WASSERSTEIN DISCREPANCY

To deal with the inaccuracies of PRMs, this section defines a robust variant of GWD, referred to as RGWD. The properties of RGWD are rigorously analyzed. We then derive a theoretically guaranteed numerical scheme for calculating RGWD approximately. Due to the limit of space, all proofs can be found in the appendix. Definition 1 Given an observed source graph G s and a target graph G t that can be expressed as (C s , p s ) and (C t , p t ) respectively, RGWD is defined by the solution to the following problem RGWD (C s , p s ), (C t , p t ), = min T∈Π(p s ,p t ) max E∈U f (T, E; C s , C t ) 1 2 , where the objective f (•) is given by f (T, E; C s , C t ) = n s i,j=1 n t i ,j =1 (C s ij -C t i j -E i j ) 2 T ii T jj , and the perturbation E is in the bounded set U = {E|E = E and E ∞ ≤ }. RGWD requires the sought transport plan to have low transportation costs for all perturbation E in set U . For succinctness, we omit C s and C t in f (T, E; C s , C t ) in the following.

3.1. PROPERTIES OF RGWD

The properties of RGWD are presented as follows. Firstly, although RGWD involves a non-convex non-concave minimax optimization problem, the inner maximization problem has a closed-form solution, which allows an efficient numerical scheme for RGWD. Secondly, RGWD has a lower bound that is achieved if and only if the expressions of compared graphs are identical up to a permutation, which implies RGWD can be employed to evaluate the similarity between one observed graph and its approximation in DL. Thirdly, RGWD satisfies the triangle inequality, which allows numerous potential applications including clustering (Elkan, 2003; HajKacem et al., 2019) , metric learning (Pitis et al., 2019) , and Bayesian learning (Moore, 2000; Xiao et al., 2019) . Finally, arbitrarily changing the node orders does not affect the value of RGWD. More formally, we state the properties in the following theorem. Theorem 1 Given an observed source graph G s and a target graph G t that can be expressed as (C s , p s ) and (C t , p t ) respectively, RGWD satisfies 1. for all T ∈ Π(p s , p t ), E(T) = [E i j (T)] where E i j (T) = , if n s i,j=1 T ii T jj (C s ij -C t i j ) ≤ 0, -, otherwise. solves the inner maximization problem max E∈U f (T, E). 2. RGWD is lower bounded, that is, RGWD (C s , p s ), (C t , p t ), ≥ , where the equality holds if and only if there exists a bijective π * : n s → n t such that p s i = p t π * (i) for all i ∈ n s and C s ij = C t π * (i)π * (j) for all i, j ∈ n s . 3. The triangle inequality holds for RGWD, i.e., RGWD (C 1 , p 1 ), (C 3 , p 3 ), ≤ RGWD (C 1 , p 1 ), (C 2 , p 2 ), + RGWD (C 2 , p 2 ), (C 3 , p 3 ), . 4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Q s and Q t , RGWD (C s , p s ), (C t , p t ), = RGWD (Q s C s Q s , Q s p s ), (Q t C t Q t , Q t p t ), . As is implied by Theorem 1, RGWD does not define a distance between metric-measure spaces. Firstly, the identity axiom is not satisfied. Secondly, the symmetry generally does not hold either, which we exemplify below. Example 1 (Asymmetry of RGWD) Consider the case p s = p t = [ 0.5 0.5 ], C s = [ 0 1 1 0 ], C t = [ 0 4 4 0 ]. Then RGWD (C s , p s ), (C t , p t ), 1 = 11.5 with the solution given by T * = [ 0.25 0.25 0.25 0.25 ] and E * = [ -1 1 1 -1 ]. In contrast, RGWD (C t , p t ), (C s , p s ), 1 = 10.5 with T * = [ 0.25 0.25 0.25 0.25 ] and E * = [ -1 -1 -1 -1 ]. One has RGWD (C s , p s ), (C t , p t ), 1 = RGWD (C t , p t ), (C s , p s ), 1 . Example 1 showcases that RGWD is asymmetric even if n s = n t .

3.2. NUMERICAL SCHEME OF RGWD

We derive a gradient based numerical scheme to solve RGWD by exploiting the property that the inner maximization problem has a closed-form solution, which is summarized in Algorithm 1. In each iteration, E τ that solves the inner problem for current T τ is calculated. Then, the transport plan is updated using the projected gradient descent. Algorithm 1 Projected Gradient Descent for RGWD 1: Input: Initialization T 0 , step-size η, number of iterations N . 2: Output: Estimated optimal transport plan T and its corresponding perturbation Ê. 3: for τ = 0, 1, . . . , N -1 do 4: Find E τ that maximizes f (T τ , E).

5:

Update the transport plan via T τ +1 = Proj Π(p s ,p t ) T τ -η∇ T f (T τ , E τ ) , where the partial gradient takes the form ∇ T f (T τ , E τ ) =2 C s C s T τ 1 n t 1 n t 21 n s 1 n s T τ C t + E τ C t + E τ -4C s T τ C t + E τ , with denoting the element-wise multiplication. 6: end for 7: Pick τ uniformly at random from {1, 2, . . . , T }. 8: Set T ← T τ . 9: Find Ê that maximizes f ( T, E). To present the convergence guarantee of Algorithm 1, we introduce the notion of the Moreau envelope. The stationarity of any function h(x) can be quantified by the norm of the gradient of its Moreau envelope h λ (x) = min x h(x ) + 1 2λ xx 2 . The following theorem gives the convergence rate of Algorithm 1 and the proof is deferred to the appendix. Theorem 2 Define φ(•) = max E∈U f (•, E). The output T of Algorithm 1 with step-size η = γ √ N +1 satisfies E ∇φ 1/2l ( T) 2 ≤ 2 φ 1/2l (T 0 ) -min T∈Π(p s ,p t ) φ(T) + lL 2 γ 2 γ √ N + 1 , where l = √ 2 max{10n 3 U 2 1 + 6n 3 U 1 + 4nU 1 U 2 + 4n 3 2 , 6n 2 U 1 U 2 + 2U 2 2 + 4n 2 U 2 } and L = √ 2 max{(4U 1 +2 )U 2 2 n 3 , 2(2U 1 + ) 2 U 2 n 3 } with n = max{n s , n t }, U 1 = max{ C s ∞ , C t ∞ } and U 2 = max{ p t 2 , max T ∈Π(p s ,p t ) T F }. When U 1 and are of the order O( 1 n 3 ), both l and L are of the order O(1) and Theorem 2 states that an δ-stationary solution can be obtained within O( 1δ 2 ) iterations. Note that we can multiply C s , C t , and by the same number without affecting the resulted transport plan.

4. ROBUST GRAPH DICTIONARY LEARNING

The problem of learning a robust dictionary for graph data is now formulated as follows. Given a dataset of K graphs expressed by {(C k , p k )} k∈ K , estimating the optimal dictionary is formalized by min { Cm } m∈ M ,{w k } k∈ K K k=1 RGWD 2 M m=1 w k m Cm , p , (C k , p k ), -λ w k 2 , ( ) where { Cm } m∈ M and {w k } k∈ K are the dictionary and graph embeddings respectively, and p is obtained by sorting and averaging {p k } k∈ K following Xu et al. (2019a) . To resolve (2), we propose a nested iterative optimization algorithm that is summarized in Algorithm 2. The main idea is that the dictionary and embeddings are updated alternatingly. We discuss some crucial details below. Algorithm 2 Robust Graph Dictionary Learning (RGDL) 1: Input: The dataset {C k , p k } k∈ K , the initial dictionary { Cm } m∈ M , the number of iterations T , mini-batch size b. 2: Output: The learned dictionary { Cm } m∈ M . 3: for t = 0, 1, . . . , T -1 do Update the atom Cm for m ∈ M with stochastic gradient ∇ Cm which has the form ∇ Cm = 2 b k∈B w k m M m =1 w k m Cm pp -T k (C k + E k )T k . ( ) 13: end for Solving w k . We now formulate the problem of obtaining the embedding of the k th graph G k when the dictionary is fixed and the PRM is inaccurate. Given dictionary { Cm } m∈ M where each Cm ∈ R n a ×n a , the embedding of G k expressed by (C k , p k ) is calculated by solving min w k ∈∆ M RGWD 2 M m=1 w k m Cm , p , (C k , p k ), -λ w k 2 , where λ ≥ 0 induces a negative quadratic regularization promoting sparsity on the simplex (Li et al., 2020; Vincent-Cuaz et al., 2021) . When w k is fixed, updating T k and E k can be solved by Algorithm 1 whose convergence is guaranteed by Theorem 2. For fixed T k and E k , the problem of updating w k is a non-convex problem that can be tackled by a conditional gradient algorithm. Note that for non-convex problems, the conditional gradient algorithm is proved to converge to a local stationary point (Lacoste-Julien, 2016). Such a procedure is described from Line 5 to Line 11 in Algorithm 2, which we observe converges within tens of iterations empirically. Stochastic updates. To enhance computational efficiency, each atom is updated with stochastic estimates of the gradient. At each stochastic update, b embedding learning problems are solved independently for the current dictionary using the procedure stated above, where b is the size of the sampled mini-batch. Each atom is then updated using the stochastic gradient given in Eq. ( 3). Note that the symmetry of each atom is preserved as long as the initialized atom is symmetric, since the stochastic gradients are symmetric.

5. EXPERIMENTS

This section provides empirical evidence that RGDL performs well in the unsupervised graph clustering task on both synthetic and real-world datasetsfoot_0 . The heat kernel matrix is employed for the PRM since it captures both global and local topology and achieves good performance in many tasks (Donnat et al., 2018; Tsitsulin et al., 2018; Chowdhury and Needham, 2021) .

5.1. SIMULATED DATASETS

We first test RGDL in the graph clustering task on datasets simulated according to the well-studied Stochastic Block Model (SBM) (Holland et al., 1983; Wang and Wong, 1987) . RGDL is compared against the following state-of-the-art graph clustering methods: (i) GDL (Vincent-Cuaz et al., 2021) learns graph dictionaries via GWD; (ii) Gromov-Wasserstein Factorization (GWF) (Xu, 2020 ) that approximates graphs via GW barycenters; (iii) Spectral Clustering (SC) of Shi and Malik (2000) ; Stella and Shi (2003) applied to the matrix with each entry storing the GWD between two graphs. 0.057(0.002) 0.033(0.002) 0.010(0.001) 0.054(0.024) 0.015(0.004) 0.010(0.001) RGDL( =0.01) 0.316(0.005) 0.161(0.005) 0.052(0.002) 0.246(0.013) 0.039(0.009) 0.024 (0.001) RGDL( =0.1) 0.853(0.003) 0.756(0.018) 0.439(0.015) 0.765(0.022) 0.694(0.046) 0.499(0.016) RGDL( =0.2) 0.975(0.025) 0.879(0.023) 0.736(0.020) 0.866(0.023) 0.815(0.028) 0.770(0.016) RGDL( =0.3) 0.975(0.025) 0.879(0.023) 0.869(0.013) 0.943(0.001) 0.916(0.027) 0.848(0.061) RGDL( =10) 0.975(0.025) 0.950(0.000) 0.950(0.000) 0.943(0.001) 0.943(0.001) 0.943(0.001) RGDL( =30) 0.781(0.046) 0.779(0.070) 0.728(0.085) 0.723(0.067) 0.698(0.057) 0.666(0.040) Dataset generation. We consider two scenarios of inaccuracies. In the first scenario (S1), Gaussian noise is added into the heat kernel matrix of each graph. More specifically, denoting the heat kernel matrix of the k th graph as C k * for k ∈ K , the PRM available to DL methods is C k = C k * + Z + Z where each entry Z ij of Z is sampled from the Gaussian distribution N (0, σ). In the second scenario (S2), we randomly add ρ|E| edges into the graph and then randomly remove ρ|E| edges while keeping the graph connected, where E is the edge set of the graph. The heat kernel matrix is then constructed for the modified graph. Such two scenarios allow us to study the performance of RGDL against different scales of inaccuracies. In both S1 and S2, we generate two datasets, both of which involve three generative structures (also used to label graphs): dense (only one community), two communities, and three communities. We fix p = 0.1 as the probability of inter-community connectivity and 1 -p as the probability of intra-community connectivity. The first dataset includes 20 graphs for each generative structure and thus is referred to as the balanced dataset. The second =0.01) 0.451(0.014) 0.449(0.016) 0.449(0.016) 0.401(0.002) 0.401(0.002) 0.399(0.000) RGDL( =0.1) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL( =0.2) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL( =0.3) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL( =10) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) RGDL( =30) 0.896(0.070) 0.888(0.080) 0.857(0.044) 0.864(0.057) 0.827(0.043) 0.816(0.074) dataset consists of 12, 18, and 30 graphs for the three generative structures respectively, and is hence named as the unbalanced dataset. The number of graph nodes is uniformly sampled from [30, 50] . The magnitude of the observed PRM C k satisfies C k ∞ ≤ 15. Evaluating the performance. The learned embeddings of the graphs are used as input for a kmeans algorithm to cluster graphs. We use the well-known Adjusted Rand Index (ARI) (Hubert and Arabie, 1985; Steinley, 2004) , to evaluate the quality of clustering by comparing it with the graph labels. RGDL with varied is compared against GDL, GWF, and SC. RGDL, GDL and GWF use three atoms which are R 6×6 matrices. We run each method for 5 times and report the averaged ARI scores and the standard deviations. Experimental results reported in Table 1 and Table 2 demonstrate RGDL outperforms baselines significantly. Influence of . RGDL with moderate values outperforms baseline methods by a large margin and is more robust to the noise. Even when is relatively small ( =0.01), RGDL achieves better performance than baselines. Increasing within a suitable range can boost ARI and RGDL is not sensitive to the choice of . If becomes too large, the performance of RGDL slowly decreases. In practice, when a small quantity of data labels are available, can be chosen according to the performance on this small subset of data. 

5.2. REAL-WORLD DATASETS

We further use RGDL to cluster real-world graphs. We consider widely utilized benchmark datasets including MUTAG (Debnath et al., 1991) , BZR (Sutherland et al., 2003) , and Peking 1 (Pan et al., 2016) . The labels of the graphs are employed as the ground truth to evaluate the estimated clustering results. For each dataset, the size of the atoms is set as the median of the numbers of graph nodes following Vincent-Cuaz et al. (2021) . The number of atoms M is set as M = β(# classes) where β is chosen from {2, 3, 4, 5}. RGDL is run with different values of . Specifically, is chosen from {U, 10 -1 U, 10 -2 U, 10 -3 U } where U = max k∈ K C k ∞ . Results. The experimental results on real-world graphs are reported in Figure 1 . RGDL with = 10 -1 U or = 10 -2 U outperforms baselines on all datasets, which implies that the observed graphs contain structural noise and 10 -2 U, 10 -1 U is often a suitable range for . The time required for RGDL to converge is comparable to that of state-of-the-art of methods.

6. RELATED WORK

Unbalanced OT. Enhancing the robustness of the optimal transport plan has received wide attention recently (Balaji et al., 2020; Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022; Chapel et al., 2020; Séjourné et al., 2021; Vincent-Cuaz et al., 2022) . Originally, robust variants of classical OT were proposed to compare distributions supported on the same metric space (Balaji et al., 2020; Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022) , which model the noise as outlier supports and reduce the influence of outlier supports by allowing mass destruction and creation. Following the same spirit, variants of the GW distance that also relax the marginal constraints were proposed (Chapel et al., 2020; Séjourné et al., 2021; Vincent-Cuaz et al., 2022) . However, these methods do not take the inaccuracies of the pairwise distances/similarities into account. The proposed RGWD aims to handle such cases. Graph representation learning and graph comparison. Comparing graphs often requires learning meaningful graph representations. Some methods manually design representations that are invariant under graph isomorphism (Bagrow and Bollt, 2019; Tam and Dunson, 2022) . Such representations are often sophisticated and require domain knowledge. Graph neural network-based methods learn the representations of graphs in an end-to-end manner (Scarselli et al., 2008; Zhang et al., 2018; Lee et al., 2018; Errica et al., 2019) , which however requires a large amount of labeled data. Another family of methods that uses graph representations implicitly is referred to as graph kernels (Shervashidze et al., 2009; Vishwanathan et al., 2010) . GWD and its variants based methods can estimate the node correspondence and provide an interpretable discrepancy between compared graphs (Xu et al., 2019b; Titouan et al., 2019; Barbe et al., 2020; Chapel et al., 2020) . In this paper, we propose a novel graph dictionary learning method based on a robust variant of GWD to learn representations of graphs which are useful in downstream tasks. Non-linear combination of atoms. Classic DL methods are linear in the sense that they attempt to approximate each vectorized datum by a linear combination of a few basis elements. Recently, non-linear operations were also considered. In order to exploit the non-linear nature of data, Autoencoder-based methods encode them to low-dimensional vectors using neural networks, and decode data with another neural network (Hinton and Salakhutdinov, 2006; Hu and Tan, 2018 ). Another family of methods replace the linear combinations by geodesic interpolations (Boissard et al., 2011; Bigot et al., 2013; Seguy and Cuturi, 2015; Schmitz et al., 2018) . More closely related to our work, Xu 2020 proposed to approximate graphs via the GW barycenter of graph atoms, which however involves a complicated and computational demanding optimization problem. Projection robust OT. To improve the convergence of empirical Wasserstein distances (Rüschendorf, 1985) to population ones, a group of methods project the distributions to informative low-dimensional subspaces (Paty and Cuturi, 2019; Lin et al., 2020; 2021) , which involves solving min-max or max-min problems. This paper considers distributions supported on different metric spaces and does not project distributions.

7. CONCLUSION

In this paper, we propose a novel graph dictionary learning algorithm that is robust to the structural noise of graphs. We first propose a robust variant of GWD, referred to as RGWD, which involves a minimax optimization problem. Exploiting the fact that the inner maximization problem has a closed-form solution, an efficient numerical scheme is derived. Based on RGWD, a robust dictionary learning algorithm for graphs called RGDL is derived to learn atoms from noisy graph data. Numerical results on both simulated and real-world datasets demonstrate that RGDL achieves good performance in the presence of structural noise. 1. for all T ∈ Π(p s , p t ), E(T) = [E i j (T)] where E i j (T) = , if n s i,j=1 T ii T jj (C s ij -C t i j ) ≤ 0, -, otherwise. solves the inner maximization problem max E∈U f (T, E). 2. RGWD is lower bounded, that is, RGWD (C s , p s ), (C t , p t ), ≥ , where the equality holds if and only if there exists a bijective π * : n s → n t such that p s i = p t π * (i) for all i ∈ n s and C s ij = C t π * (i)π * (j) for all i, j ∈ n s . 3. The triangle inequality holds for RGWD, i.e., RGWD (C 1 , p 1 ), (C 3 , p 3 ), ≤ RGWD (C 1 , p 1 ), (C 2 , p 2 ), + RGWD (C 2 , p 2 ), (C 3 , p 3 ), . 4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Q s and Q t , RGWD (C s , p s ), (C t , p t ), = RGWD (Q s C s Q s , Q s p s ), (Q t C t Q t , Q t p t ), . Proof: (i) The objective can be rewritten as follows, n s i,j=1 n t i ,j =1 (C s ij -C t i j -E i j ) 2 T ii T jj = n s i,j=1 n t i ,j =1 (C s ij -C t i j ) 2 T ii T jj + n t i ,j =1 n s i,j=1 T ii T jj E 2 i j -2 n s i,j=1 T ii T jj (C s ij -C t i j )E i j = n s i,j=1 n t i ,j =1 (C s ij -C t i j ) 2 T ii T jj + n t i ,j =1 p t i p t j E 2 i j -2 n s i,j=1 T ii T jj (C s ij -C t i j )E i j , (5) which by the property of quadratic functions yields the closed-form solution E(T) = [E i j (T)], where E i j (T) = , if n s i,j=1 T ii T jj (C s ij -C t i j ) ≤ 0, -, otherwise. It is easy to verify that the such a choice guarantees the symmetry of E(T). (ii) We now prove the lower boundedness. By Eq. ( 5), we have min T∈Π(p s ,p t ) max E∈U n s i,j=1 n t i ,j =1 (C s ij -C t i j -E i j ) 2 T ii T jj ≥ min T∈Π(p s ,p t ) n s i,j=1 n t i ,j =1 T ii T jj 2 = 2 . Note that when there exists a bijective π * : n s → n t such that p s i = p t π * (i) for all i ∈ n s and C s ij = C t π * (i)π * (j) for all i, j ∈ n s , choosing the transport plan T = [ Tii ] where Tii = p s i , if i = π * (i), 0, otherwise, we have for all i , j ∈ n t , n s i,j=1 Tii Tjj (C s ij -C t i j ) = Tπ * -1 (i )i Tπ * -1 (j )j C s π * -1 (i )π * -1 (j ) -C t i j = 0, which implies that E i j (T) = for all i , j ∈ n t . We then have n s i,j=1 n t i ,j =1 (C s ij -C t i j -E i j ) 2 Tii Tjj = n s i,j=1 (C s ij -C t π * (i)π * (j) -) 2 Tiπ * (i) Tjπ * (j) = 2 . Therefore, in such a case, RGWD (C s , p s ), (C t , p t ), = . On the other hand, when such a bijective does not exist, RGWD (C s , p s ), (C t , p t ), ≥ 2 + min T∈Π(p s ,p t ) n s i,j=1 n t i ,j =1 (C s ij -C t i j ) 2 T ii T jj > , where the strict inequality is due to the fact that n s i,j=1 n t i ,j =1 (C s ij -C t i j ) 2 T ii T jj > 0. (iii) Thirdly, we prove the triangle inequality. Given tuples (C 1 , p 1 ), (C 2 , p 2 ), and (C 3 , p 3 ) which have the node numbers n 1 , n 2 , and n 3 respectively, let (T * 12 , E * 12 ), (T * 23 , E * 23 ), and (T * 13 , E * 13 ) be the solutions of RGWD (C 1 , p 1 ), (C 2 , p 2 ), , RGWD (C 2 , p 2 ), (C 3 , p 3 ), , and RGWD (C 1 , p 1 ), (C 3 , p 3 ), . Define T 13 = [T 13 i1i3 ] where T 13 i1i3 = n 2 i2=1 T * 12 i1i2 T * 23 i2i3 p 2 i2 . Then we have RGWD (C 1 , p 1 ), (C 3 , p 3 ), ≤ n 1 i1,j1=1 n 3 i3,j3=1 C 1 i1j1 -C 3 i3j3 -E * 13 i3j3 2 T 13 i1i3 T 13 j1j3 = n 1 i1,j1=1 n 3 i3,j3=1 C 1 i1j1 -C 3 i3j3 -E * 13 i3j3 2 n 2 i2=1 T * 12 i1i2 T * 23 i2i3 p 2 i2 n 2 j2=1 T * 12 j1j2 T * 23 j2j3 p 2 j2 = n 1 i1,j1=1 n 2 i2,j2=1 n 3 i3,j3=1 C 1 i1j1 -C 2 i2j2 + C 2 i2j2 -C 3 i3j3 -E * 13 i3j3 2 T * 12 i1i2 T * 23 i2i3 p 2 i2 T * 12 j1j2 T * 23 j2j3 p 2 j2 ≤ n 1 i1,j1=1 n 2 i2,j2=1 n 3 i3,j3=1 C 1 i1j1 -C 2 i2j2 2 T * 12 i1i2 T * 23 i2i3 p 2 i2 T * 12 j1j2 T * 23 j2j3 p 2 j2 + n 1 i1,j1=1 n 2 i2,j2=1 n 3 i3,j3=1 C 2 i2j2 -C 3 i3j3 -E * 13 i3j3 2 T * 12 i1i2 T * 23 i2i3 p 2 i2 T * 12 j1j2 T * 23 j2j3 p 2 j2 = n 1 i1,j1=1 n 2 i2,j2=1 C 1 i1j1 -C 2 i2j2 2 T * 12 i1i2 T * 12 j1j2 + n 2 i2,j2=1 n 3 i3,j3=1 C 2 i2j2 -C 3 i3j3 -E * 13 i3j3 2 T * 23 i2i3 T * 23 j2j3 ≤ RGWD (C 1 , p 1 ), (C 2 , p 2 ), + RGWD (C 2 , p 2 ), (C 3 , p 3 ), . (iv) Finally, we prove the invariance to the node order permutation. Denote the solution to the objective of RGWD by T * = [T * ii ] and E * = [E * i j ], which implies E * i j = , if n s i,j=1 T * ii T * jj (C s ij -C t i j ) ≤ 0, -, otherwise, and n s i,j=1 n t i ,j =1 (C s ij -C t i j -E * i j ) 2 T * ii T * jj ≤ max E∈U n s i,j=1 n t i ,j =1 (C s ij -C t i j -E i j ) 2 T ii T jj , for all T ∈ Π(p s , p t ). The two permutation operations can be equivalently denoted as two bijectives π s and π t . Denote Cs = [ Cs ij ] where Cs ij = C s π s -1 (i)π s -1 (j) , Ct = [ Ct i j ] where Ct i j = C t π t-1 (i )π t-1 (j ) , Ẽ * = [ Ẽ * i j ] where Ẽ * i j = E * π t -1 (i )π t-1 (j ) , T * = [ T s ii ] where T s ii = T * π s -1 (i)π t -1 (i ) . We first prove Ẽ * solves the inner maximization problem for T * . For all i , j ∈ n t , when ij T * ii T * jj ( Cs ij -Ct i j ) ≤ 0, we have ij T * iπ t-1 (i ) T * jπ t-1 (j ) (C s ij -C t π t-1 (i )π t -1 (j ) ) ≤ 0, which is consistent with Ẽ * i j = . The case when ij T * ii T * jj ( Cs ij -Ct i j ) > 0 is similar. Since n s i,j=1 n t i ,j =1 (C s ij -C t i j -E * i j ) 2 T * ii T * jj = n s i,j=1 n t i ,j =1 ( Cs ij -Ct i j -Ẽ * i j ) 2 T * ii T * jj and max E∈U n s i,j=1 n t i ,j =1 (C s ij -C t i j -E i j ) 2 T ii T jj = max E∈U n s i,j=1 n t i ,j =1 ( Cs ij -Ct i j -E i j ) 2 Tii Tjj , where Tii = T π s -1 (i)π t-1 (i ) , T * and Ẽ * solve the optimization problem of RGWD (Q s C s Q s , Q s p s ), (Q t C t Q t , Q t p t ), . To prove Theorem 2, we require the following lemma. Lemma 3 f (•) is l-smooth and L-Lipschitz, where l = √ 2 max{10n 3 U 2 1 + 6n 3 U 1 + 4nU 1 U 2 + 4n 3 2 , 6n 2 U 1 U 2 + 2U 2 2 + 4n 2 U 2 } and L = √ 2 max{(4U 1 + 2 )U 2 2 n 3 , 2(2U 1 + ) 2 U 2 n 3 } with n = max{n s , n t }, U 1 = max{ C s ∞ , C t ∞ } and U 2 = max{ p t 2 , max T ∈Π(p s ,p t ) T F }. Proof: (i) We first prove that f (•) is L-Lipschitz. For all T, T ∈ Π(p s , p t ) and E, E ∈ U , f (T, E) -f (T , E ) ≤ iji j (C s ij -C t i j -E i j ) 2 T ii T jj - iji j (C s ij -C t i j -E i j ) 2 T ii T jj + iji j (C s ij -C t i j -E i j ) 2 T ii T jj - iji j (C s ij -C t i j -E i j ) 2 T ii T jj + iji j (C s ij -C t i j -E i j ) 2 T ii T jj - iji j (C s ij -C t i j -E i j ) 2 T ii T jj . For the first term, iji j (C s ij -C t i j -E i j ) 2 T ii T jj - iji j (C s ij -C t i j -E i j ) 2 T ii T jj = iji j (C s ij -C t i j -E i j + C s ij -C t i j -E i j )(E i j -E i j )T ii T jj ≤ iji j C s ij -C t i j -E i j + C s ij -C t i j -E i j E i j -E i j T ii T jj ≤(4U 1 + 2 )U 2 2 n 2 i j E i j -E i j . For the second term, iji j (C s ij -C t i j -E i j ) 2 T ii T jj - iji j (C s ij -C t i j -E i j ) 2 T ii T jj ≤ iji j |C s ij -C t i j -E i j | 2 T ii T jj -T jj ≤(2U 1 + ) 2 U 2 iji j T jj -T jj ≤(2U 1 + ) 2 U 2 n 2 T jj -T jj . For the third term, iji j (C s ij -C t i j -E i j ) 2 T ii T jj - iji j (C s ij -C t i j -E i j ) 2 T ii T jj ≤(2U 1 + ) 2 U 2 n 2 T jj -T jj . Combining the three relations above, we have f (T, E) -f (T , E ) ≤ max (4U 1 + 2 )U 2 2 n 2 , (2U 1 + ) 2 U 2 n 2 i j |E i j -E i j | + jj |T jj -T jj | ≤L E -E 2 F + T -T 2 F . (ii) Now we prove that f (•) is l-smooth, which requires finding a constant l satisfying vec ∇ T f (T, E) vec ∇ E f (T, E) - vec ∇ T f (T , E ) vec ∇ E f (T , E ) 2 ≤ l vec(T) vec(E) - vec(T ) vec(E ) 2 , where vec(X) means the vectorization of matrix X and [ a b ] denotes the concatenation of vectors a and b. Since the left hand side satisfies vec ∇ T f (T, E) vec ∇ E f (T, E) - vec ∇ T f (T , E ) vec ∇ E f (T , E ) 2 = ∇ T f (T, E) -∇ T f (T , E ) 2 F + ∇ E f (T, E) -∇ E f (T , E ) 2 F ≤ ∇ T f (T, E) -∇ T f (T , E ) F + ∇ E f (T, E) -∇ E f (T , E ) F , and the right hand side satisfies l vec(T) vec(E) - vec(T ) vec(E ) 2 =l E -E 2 F + T -T 2 F ≥ l √ 2 E -E F + l √ 2 T -T F , it suffices to find a constant l satisfying ∇ T f (T, E)-∇ T f (T , E ) F + ∇ E f (T, E)-∇ E f (T , E ) F ≤ l √ 2 E-E F + l √ 2 T-T F . We bound ∇ T f (T, E) -∇ T f (T , E ) F as follows, ∇ T f (T, E) -∇ T f (T , E ) F ≤2n C s C s F T -T F + 4 C s F T(C t + E) -T (C t + E ) F +2n T(C t + E) (C t + E) -T (C t + E ) (C t + E ) F . For the first term, 2n C s C s F T -T F ≤ 2n C s 2 F ≤ 2n 3 U 2 1 T -T F . For the second term, 4 C s F T(C t + E) -T (C t + E ) F ≤4 C s F T -T F C t F + E F + 4 C s F T F E -E F ≤(4n 2 U 2 1 + 4n 2 U 1 ) T -T F + 4nU 1 U 2 E -E F . For the third term, 2n T(C t + E) (C t + E) -T (C t + E ) (C t + E ) F ≤2n T(C t + E) (C t + E) -T (C t + E) (C t + E) F +2n T (C t + E) (C t + E) -T (C t + E ) (C t + E) F +2n T (C t + E ) (C t + E) -T (C t + E ) (C t + E ) F ≤4n C t 2 F + E 2 F T -T F + 2n T F C t F + E F E -E F +2n T F C t F + E F E -E F ≤ 4n 3 U 2 1 + 4n 3 2 T -T F + (4n 2 U 1 U 2 + 4n 2 U 2 ) E -E F Since ∇ E f (T, E) = 2(E+C )p t p t -2T C s T, ∇ E f (T, E)-∇ E f (T , E ) F can be bounded as follows, ∇ E f (T, E) -∇ E f (T , E ) F ≤ 2(E -E ) F p t 2 F + 4 T F C s F T -T F ≤2U 2 2 (E -E ) F + 4nU 1 U 2 T -T F . Combining the above four relations, we have ∇ T f (T, E) -∇ T f (T , E ) F + ∇ E f (T, E) -∇ E f (T , E ) F ≤ max{10n 3 U 2 1 + 6n 3 U 1 + 4nU 1 U 2 + 4n 3 2 , 6n 2 U 1 U 2 + 2U 2 2 + 4n 2 U 2 } E -E F + T -T F , which yields the desired result. where l = √ 2 max{10n 3 U 2 1 + 6n 3 U 1 + 4nU 1 U 2 + 4n 3 2 , 6n 2 U 1 U 2 + 2U 2 2 + 4n 2 U 2 } and L = √ 2 max{(4U 1 +2 )U 2 2 n 3 , 2(2U 1 + ) 2 U 2 n 3 } with n = max{n s , n t }, U 1 = max{ C s ∞ , C t ∞ } and U 2 = max{ p t 2 , max T ∈Π(p s ,p t ) T F }. Proof: By the smoothness of f (•), for any T ∈ Π(p s , p t ) and T τ from Algorithm 1, we have Plugging this in (7) and combining Lemma 3 proves the result. φ( T) ≥ f ( T, E τ ) ≥ f (T τ , E τ ) + ∇ T f (T τ , E τ ), T -T τ - l 2 T -T τ 2 F = φ(T τ ) + ∇ T f (T τ , E τ ), T -T τ - l 2 T -T τ 2 F .

B ALGORITHMIC DETAILS

The Projected Gradient Descent (PGD) consists of the following three steps in each iteration τ . Find E τ that maximizes f (T τ , E). By Theorem 1, we need to calculate an auxiliary matrix G = T τ C s T τ -C t T τ T τ , where denotes the element-wise multiplication. And we have E i j (T τ ) = , if G i j ≤ 0, -, otherwise. Such a step involves computational cost O(n 3 ) where n = max{n s , n t }. Gradient descent. Calculate H τ = T τ -η∇ T f (T τ , E τ ) where ∇ T f (T τ , E τ ) = 2 C s C s T τ 1 n t 1 n t + 21 n s 1 n s T τ C t + E τ C t + E τ -4C s T τ C t + E τ , which also involves computational cost O(n 3 ). The results are reported in Table 4 and RGDL outperforms or matches state-of-the-art methods. RGDL outperforms GDL and GWF significantly, which indicates the necessity of taking into account the structural noise of observed graphs. Although WGDL and GNTK have similar performance, they are more computation and memory demanding due to the usage of graph neural networks.



Code available at https://github.com/cxxszz/rgdl.



-batch of graphs whose indices are denoted by B such that |B| = b. 5:for k ∈ B do 6: Initialize w k = 1 M 1 M and T k = pp k . k , E k ) via Algorithm 1 with fixed w k . 9:Compute w k solving (4) for the fixed T k and E k with conditional gradient.

Figure 1: ARI scores vs. time on MUTAG (left), BZR(middle), and Peking 1(right) datasets.

Define φ(•) = max E∈U f (•, E). The output T of Algorithm 1 with step-size η = γ √ N +1 satisfies E ∇φ 1/2l ( T) 2 ≤ 2 φ 1/2l (T 0 ) -min T∈Π(p s ,p t ) φ(T) + lL 2 γ 2 γ √ N + 1 ,

)Let Tτ = argmin T∈Π(p s ,p t ) φ(T) + l T -T τ . We haveφ 1/2l (T τ +1 ) ≤ φ( Tτ ) + l T τ +1 -Tτ 2 F ≤ φ( Tτ ) + l T τ -η∇ T f (T τ , E τ ) -Tτ 2 F ≤ φ( Tτ ) + l T τ -Tτ 2 + 2lη ∇ T f (T τ , E τ ), Tτ -T τ + η 2 l ∇ T f (T τ , E τ ) 2 φ 1/2l (T τ ) + 2lη ∇ T f (T τ , E τ ), Tτ -T τ + η 2 l ∇ T f (T τ , E τ ) 2 φ 1/2l (T τ ) + 2ηl φ( Tτ ) -φ(T τ ) + l 2 Tτ -T τ + η 2 lL 2 ,where the second line uses Lemma 3.1 ofBubeck et al. (2015) and the last line follows from (6). Taking a telescopic sum over τ , we obtainφ 1/2l (T N ) ≤ φ 1/2l (T 0 ) + 2ηl N τ =0 φ( Tτ ) -φ(T τ ) + l 2 Tτ -T τ 2 + η 2 lL 2 . 2l (T 0 ) -min T∈Π(p s ,p t ) φ(T)is l-strongly convex, we have -φ( Tτ ) + φ(T τ ) -+ φ(T τ ) + l T τ -

Average (stdev) ARI scores for the first scenario of synthetic datasets.

Average (stdev) ARI scores for the second scenario of synthetic datasets.

Graph classification results.   77.7(1.5)  59.0(2.6) 53.0(6.2) 79.7(3.6) 76.9(3.6) IMDB-M 52.9(2.5) 44.2(2.3) 42.0(3.7) 53.5(5.0)52.8(4.6)

ACKNOWLEDGMENTS

This work is supported by the National Key R&D Program of China (2020YFB1313501, 2020AAA0107400), National Natural Science Foundation of China (T2293723, 61972347, 61672376, 61751209, 61472347, 62206248), Zhejiang Provincial Natural Science Foundation (LR19F020005, LZ18F020002), the Key Research and Development Project of Zhejiang Province (2022C01022, 2022C01119, 2021C03003), the Fundamental Research Funds for the Central Universities (226-2022-00051), Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies, China Scholarship Council (202206320311), and MEXT KAKENHI (20H04243).

Appendix

The appendix is organized as follows. We first provide omitted proofs in the main paper in Sec. A. Then, algorithmic details are presented in Sec. B. Finally, Sec. C gives additional experimental results.

A OMITTED PROOFS

Theorem 1 Given an observed source graph G s and a target graph G t that can be expressed as (C s , p s ) and (C t , p t ) respectively, RGWD satisfies Projection into the feasible domain. This requires solving the following problemThis optimization problem has a strongly convex objective and linear constraints, and hence can be solved efficiently via Augmented Lagrangian Method with computational complexity , 2021) where ρ measures the optimality, that is, the violation of the two linear constraints. When ρ = O 1 n 2 , this step also has cubic costs if we ignore the log term.Therefore, the overall complexity of PGD obtaining a δ-stationary solution isVisualization of graph embeddings. Since GDL, GWF, and RGDL can output graph embeddings, we further illustrate the embeddings generated by them respectively based on PCA. As is shown in Figure 2 , the embeddings of the two types of the graphs are less likely to be mixed together, which explains why RGDL achieves higher ARI values.Sensitivity analysis of λ. As is discussed in Li et al. (2020) ; Vincent-Cuaz et al. (2021) , the negative quadratic term can promote the sparsity of graph embeddings. We further conduct sensitivity analysis of λ by varying the value in {0, 10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 }. As is shown in Table 3 , λ ∈ 10 -4 , 10 -2 often yields good performance. The experiments in the main paper are run with λ = 10 -3 . The learned embeddings of graphs can also be used in the graph classification task. RGDL is thus compared against GDL (Vincent-Cuaz et al., 2021) , GWF (Xu, 2020) , and other state-of-the-art graph classification methods including WGDL (Zhang et al., 2021) and GNTK (Du et al., 2019) on the benchmark datasets MUTAG (Debnath et al., 1991) , IMDB-B, and IMDB-M (Yanardag and Vishwanathan, 2015) . RGDL, GDL, and GWF use 3-NN as the classifier due to its simplicity. We perform a 10-fold nested cross validation (using 9 folds for training, 1 for testing, and reporting the average accuracy of this experiment repeated 10 times) by keeping same folds across methods.

