A DIFFERENTIAL GEOMETRIC VIEW AND EXPLAIN-ABILITY OF GNN ON EVOLVING GRAPHS

Abstract

Graphs are ubiquitous in social networks and biochemistry, where Graph Neural Networks (GNN) are the state-of-the-art models for prediction. Graphs can be evolving and it is vital to formally model and understand how a trained GNN responds to graph evolution. We propose a smooth parameterization of the GNN predicted distributions using axiomatic attribution, where the distributions are on a low-dimensional manifold within a high-dimensional embedding space. We exploit the differential geometric viewpoint to model distributional evolution as smooth curves on the manifold. We reparameterize families of curves on the manifold and design a convex optimization problem to find a unique curve that concisely approximates the distributional evolution for human interpretation. Extensive experiments on node classification, link prediction, and graph classification tasks with evolving graphs demonstrate the better sparsity, faithfulness, and intuitiveness of the proposed method over the state-of-the-art methods. 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + 3KHPHDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS&KLDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[

1. INTRODUCTION

Graph neural networks (GNN) are now the state-of-the-art method for graph representation in many applications, such as social network modeling Kipf & Welling (2017) and molecule property prediction Wu et al. (2018) , pose estimation in computer vision Yang et al. (2021) , smart cities Ye et al. (2020) , fraud detection Wang et al. (2019) , and recommendation systems Ying et al. (2018) . A GNN outputs a probability distribution Pr(Y |G; θ) of Y , the class random variable of a node (node classification), a link (link prediction), or a graph (graph classification), using trained parameters θ. Graphs can be evolving, with edges/nodes added and removed. For example, social networks are undergoing constant updates Xu et al. (2020a) ; graphs representing chemical compounds are constantly tweaked and tested during molecule design. In a sequence of graph snapshots, without loss of generality, let G 0 → G 1 be any two snapshots where the source graph G 0 evolves to the destination graph G 1 . Pr(Y |G 0 ; θ) will evolve to Pr(Y |G 1 ; θ) accordingly, and we aim to model and explain the evolution of Pr(Y |G; θ) with respect to G 0 → G 1 to help humans understand the evolution Ying et al. (2019) ; Schnake et al. (2020) ; Pope et al. (2019) ; Ren et al. (2021) ; Liu et al. (2021) . For example, a GNN's prediction of whether a chemical compound is promising for a target disease during compound design can change as the compound is fine-tuned, and it is useful for the designers to understand how the GNN's prediction evolves with respect to compound perturbations. To model graph evolution, existing work Leskovec et al. (2007; 2008) analyzed the macroscopic change in graph properties, such as graph diameter, density, and power law, but did not analyze how a parametric model responses to graph evolution. Recent work Kumar et al. (2019) ; Rossi et al. (2020) ; Kazemi et al. (2020) ; Xu et al. (2020b; a) investigated learning a model for each graph snapshot and thus the model is evolving, while we focus on modeling a fixed GNN model over evolving graphs. A more fundamental drawback of the above work is the discrete viewpoint of graph evolution, as individual edges and nodes are added or deleted. Such discrete modeling fails to describe the corresponding change in Pr(Y |G; θ), which is generated by a computation graph that can be perturbed with infinitesimal amount and can be understood as a sufficiently smooth function. The smoothness can help identify subtle infinitesimal changes contributing significantly to change in Pr(Y |G; θ), and thus more faithfully explain the change. Regarding explaining GNN predictions, there is promising progress made with static graphs, including local or global explanation methods Yuan et al. (2020b) . Local methods explain individual GNN predictions by selecting salient subgraphs Ying et al. (2019) , nodes, or edges Schnake et al. (2020) . Global methods Yuan et al. (2020a) ; Vu & Thai (2020) optimize simpler surrogate models to approximate the target GNN and generate explaining models or graph instances. Existing counterfactual or perturbation-based methods Lucic et al. (2021) attribute a static prediction to individual edges or nodes by optimizing a perturbation to the input graph to maximally alter the target prediction, thus giving a sense of explaining graph evolution. However, the perturbed graph found by these algorithms can differ from G 0 , and thus does not explain the change from Pr(Y |G 0 ; θ) to Pr(Y |G 1 ; θ). Both prior methods DeepLIFT Shrikumar et al. (2017) and GNN-LRP Schnake et al. (2020) can find propagation paths that contribute to prediction changes. However, they have a fixed G 0 for any G 1 and thus fail to model smooth evolution between arbitrary G 0 and G 1 . They also handle multiple classes independently Schnake et al. (2020) or uses the log-odd of two predicted classes Y = j and Y = j ′ to measure the changes in Pr(Y |G; θ) Shrikumar et al. (2017) , rather than the overall divergence between two distributions. 

2. PRELIMINARIES

Graph neural networks. For node classification, assume that we have a trained GNN of T layers that predicts the class distribution of each node J ∈ V on a graph G = (V, E). Let N (J) be the neighbors of node J. On layer t, t = 1, . . . , T and for node J, GNN computes hidden vector h (t) J using messages sent from its neighbors: z (t) J = f (t) UPDATE (f (t) AGG (h (t-1) J , h (t-1) K , K ∈ N (J))), h (t) J = NonLinear(z (t) J ), where f (t) AGG aggregates the messages from all neighbors and can be the element-wise sum, average, or maximum of the incoming messages. f (t) UPDATE maps f (t) AGG to z (t) J , using z (t) J = f (t) AGG , θ (t) or a multi-layered perceptron with parameters θ (t) . For layer t ∈ {1, . . . , T -1}, ReLU is used as the NonLinear mapping, and we refer to the linear terms in the argument of NonLinear as "logits". At the input layer, h (0) J is the node feature vector x J . At layer T , the logits are z (T ) J ≜ z J (G), whose j-th element z j (G) denotes the logit of the class j = 1, . . . , c. z J (G) is mapped to the class distribution Pr(Y J |G; θ) through the softmax (c > 2) or sigmoid (c = 2) function, and arg max j z j = arg max j Pr(Y = j|G; θ) is the predicted class for J. For link prediction, we concatenate z (T ) I and z (T ) J as the input to a linear layer to obtain the logits: z IJ = z (T ) I ; z (T ) J , θ . Since link prediction is a binary classification problem, Z IJ can be mapped to the probability that (I, J) exists using the sigmoid function. For graph classification, the average pooling of z J (G) of all nodes from G can be used to obtain a single vector representation z(G) of G for classification. The number of classes G0 → G1 Graph G0 evolves to G1 z J (G) Logit vector [z1(G), . . . , zc(G)] of node J ∆z J (G0, G1) ∆z J (G0, G1) = z J (G1) -z J (G0) Pr(Y |G) Distribution [Pr1(G), . . . , Prc(G)] of class Y W (G) Paths on the computation graph of GNN W J (G) The subset of W (G) that computes z J (G) ∆W J (G0, G1) Altered paths in W J (G0) as G0 → G1 Cp,j Contribution of the p-th altered path to ∆zj Since the GNN parameters θ are fixed, we ignore θ in Pr(Y |G; θ) and use Pr(Y |G) to denote the predicted class distribution of Y , which is a general random variable of the class of a node, an edge, or a whole graph, depending on the tasks. Similarly, we use z (t) and z to denote logits on layer t and the last layer of GNN, respectively. For a uniform treatment, we consider the GNN as learning node representations z J , while the concatenation, pooling, sigmoid, and softmax at the last layer that generate Pr(Y |G) from the node representations are task-specific and separated from the GNN. Evolving graphs. In a sequence of graph snapshots, let G 0 = (V 0 , E 0 ) denote an arbitrary source graph with edges E 0 and nodes V 0 , and G 1 = (V 1 , E 1 ) an arbitrary destination graph, so that the edge set evolves from E 0 to E 1 and the node set evolves from V 0 to V 1 . We denote the evolution by G 0 → G 1 . Both sets can undergo addition, deletion, or both, and all such operations happen deterministically so that the evolution is discrete. Let ∆E be the set of altered edges: We propose a novel extrinsic coordinate based on the contributions of paths to Pr(Y |G) on the computation graph of the GNN. z J (G) is generated by the computation graph of the given GNN, which is a spanning tree rooted at J of depth T . Figure 1 shows two computation graphs for G 0 and G 1 . On a computation graph, each node consists of neurons for the corresponding node in G, and we use the same labels (I, J, K, L, etc.) to identify nodes in the input and computational graphs. The leaves of the tree contain neurons from the input layer (t = 0) and the root node contains neurons of the output layer (t = T ). The trees completely represent the computations in Eqs. ( 1)-( 2), where each message is passed through a path from a leaf to the root. Let a path be (. . . , U, V, . . . , J), where U and V represent any two adjacent nodes and J is the root where z J (G) is generated. For a GNN with T layers, the paths are sequences of T + 1 nodes. Let W J (G) be the paths ending at J. ∆E = {e : e ∈ E 1 ∧ e / ∈ E 0 or e ∈ E 0 ∧ e / ∈ E 1 }. As G 0 → G 1 , Consider a reference graph G * containing all nodes in the graphs during the evolution. The symmetric set difference ∆W J (G * , G) = W J (G * )∆W J (G) contains all m paths rooted at J with at least one altered edge when G * → G. For example, in Figure 1 , ∆W J (G * , G) = {(J, K, J), (K, J, J), (K, K, J), (L, K, J)}. ∆W J (G * , G) causes the change in z J and Pr(Y |G) through the computation graph. Let the difference in z J computed on G * and G be ∆z J (G * , G) = z J (G) -z J (G * ) = [∆z 1 , . . . , ∆z c ] ∈ R c . We adopt DeepLIFT to GNN (see Shrikumar et al. (2017) and Appendix A.4) to compute the contribution C p,j (G) of each path p ∈ ∆W J (G * , G) to ∆z j (G) for any class j = 1, . . . , c, so that [z 1 (G), . . . , z c (G)] is reparameterized as [z 1 (G * ), . . . , z c (G * )] + m p=1 C p,1 (G), . . . , m p=1 C p,c (G) = z J (G * ) + 1 ⊤ C J (G) Here, C J (G) is the contribution matrix with C p,j (G) as elements and C :j (G) the j-th column. 1 is an all-1 m × 1 vector. By fixing G * and z J (G * ), we use C p,j (G) as the extrinsic coordinates of Pr(Y |G). In this coordinates system, the difference vector between two logits for node J is: ∆z J (G 0 , G 1 ) = z J (G 1 ) -z J (G 0 ) = 1 ⊤ (C J (G 1 ) -C J (G 0 )) = 1 ⊤ ∆C J (G 0 , G 1 ). (5) If we set G * = G 0 , we have ∆z J (G 0 , G 1 ) = 1 ⊤ C J (G 1 ). Even with a fixed G * , different graphs G and nodes J can lead to different sets of ∆W J (G * , G). We obtain a unified coordinate system by taking the union ∪ G,J∈G ∆W J (G * , G). We set the rows of C J (G) to zeros for those paths that are not in ∆W J (G * , G). In implementing our algorithm, we only rely on the observed graphs to exhaust the relevant paths without computing ∪ G,J∈G ∆W J (G * , G). We now embed Pr(Y |G) in the coordinate system. • Node classification: the class distribution of node J is Pr(Y |G) = softmax(z J (G)) = softmax(z J (G * ) + 1 ⊤ C J (G)). • Link prediction: for a link (I, J) between nodes I and J, the logits z I (G) and z J (G) are concatenated as input to a linear layer θ LP ("LP" means "link prediction"). The class distribution of the link is Pr(Y |G) = sigmoid([z I (G * ) + 1 ⊤ C I (G); z J (G * ) + 1 ⊤ C J (G)]θ LP ). • Graph classification: with a linear layer θ GC ("GC" for "graph classification") and average pooling, the distribution of the graph class is Pr(Y |G) = softmax(mean(z J (G * ) + 1 ⊤ C J (G) : J ∈ V)θ GC ). The arguments of the above softmax and sigmoid are linear in C J (G) for all J ∈ V of G, and we recover exponential families reparameterized by C J (G). For a specific prediction task, we let the contribution matrices C J (G) in the corresponding equation of Eqs. ( 6)-( 8) vary smoothly, and the resulting set {Pr(Y |G)} constitutes a manifold M(G, J). The dimension of the manifold is the same as the number of sufficient statistics of Pr(Y |G), though the embedding Euclidean space has mc (2mc and |V|mc, resp.) coordinates for node classification (link prediction and graph classification, resp.), where m is the number of paths in ∆W J (G 0 , G 1 ) and c the number of classes.

3.2. A CURVED METRIC ON THE MANIFOLD

We will define a curved metric on the manifold M(G, J) of node classification probability distribution (link prediction and graph classification can be done similarly). A well-defined metric is vital to tasks such as metric learning on manifolds, which we will use to explain evolving GNN predictions in Section 3.3. We could have approximated the distance between two distributions Pr(Y |G 0 ) and Pr(Y |G 1 ) by ∥∆C J (G 0 , G 1 )∥ with some matrix norm (e.g., the Frobenius norm), as shown in Figure 1 . As another example, DeepLIFT Shrikumar et al. (2017) uses the linear term 1 ⊤ (C :j (G 0 ) -C :j ′ (G 1 )) for two predicted classes j and j ′ on G 0 and G 1 , respectively. These options imply the Euclidean distance metric defined in the flat space spanned by elements in C J (G). However, the evolution of Pr(Y |G) on M(G, J) depends on C J (G 0 ) and C J (G 1 ) nonlinearly through the sigmoid or softmax function as in Eqs. ( 6)-( 8), and the difference between Pr(Y |G 0 ) and Pr(Y |G 1 ) should reflect the curvatures over the manifold M(G, J) of class distributions. We adopt information geometry Amari (2016) to defined a curved metric on the manifold M(G, J). Take node classification as an example, the KL-divergence D KL (Pr(Y |G 1 )||Pr(Y |G 0 )) between any two class distributions on the manifold M(G, J) is defined as Y log Pr(Y |G 1 ) Pr(Y |G1) Pr(Y |G0) . As the parameter C J (G 0 ) approaches C J (G 1 ), Pr(Y |G 0 ) becomes close to Pr(Y |G 1 ) (as measured by the following Riemannian metric on M(G, J), rather than the Euclidean metric of the extrinsic space), and the KL-divergence can be approximated locally at Pr(Y |G 1 ) as vec(∆C J (G 1 , G 0 )) ⊤ I(vec(C J (G 1 )))vec(∆C J (G 1 , G 0 )), where vec(C J (G 1 )) is the column vector with all elements from C J (G 1 ), and similar for vec(∆C J (G 1 , G 0 )) with the matrix ∆C J (G 1 , G 0 ). I(vec(C J (G 1 ))) is the Fisher information matrix of the distribution Pr(Y |G 1 ) with respect to parameters in vec(C J (G 1 )), evaluated as (∇ vec(C J (G1)) z J (G 1 )) ⊤ E Y ∼Pr(Y |G1) [s z J (G1) s ⊤ z J (G1) ](∇ vec(C J (G1)) z J (G 1 )), with s z J (G1) = ∇ z J (G1) log Pr(Y |G 1 ) being the gradient vector of the log-likelihood with respect to z J (G 1 ) and mc) the Jacobian of z J (G 1 ) w.r.t vec(C J (G 1 )). See (Martens (2020) and Appendix A.2 for the derivations). I is symmetric positive definite (SPD) and Eq. ( 9) defines a non-Euclidean metric on the manifold to make M a Riemannian manifold. ∇ vec(C J (G1)) z J (G 1 ) ∈ R c×(

3.3. CONNECTING TWO DISTRIBUTIONS VIA A SIMPLE CURVE

We formulate the problem of explaining the GNN prediction evolution as optimizing a curve on the manifold M. Take node classification as an examplefoot_1 . Let s ∈ [0, 1] be the time variable. As s → 1, Pr(Y |G(s)) moves smoothly over M along a curve γ(s) ∈ Γ(G 0 , G 1 ) = {γ(s) = {Pr(Y |G(s)) : s ∈ [0, 1], Pr(Y |G(0)) = Pr(Y |G 0 ), Pr(Y |G(1)) = Pr(Y |G 1 )} ⊂ M}. Two possible curves γ 1 (s) and γ 2 (s) are shown in Figure 1 . With the parameterization in Eqs. ( 5)-( 6), we can specifically define the following curves by smoothly varying the path contributions to Pr(Y |G(s)) through ∆z J (G(s)) (s can be reversed to move in the opposite direction along γ(s)). • linear in the directional matrix ∆C J (G 0 , G 1 ), with ∆C J (G 0 , G(s)) = ∆C J (G 0 , G 1 )s; • linear in the elements of ∆C J (G 0 , G 1 ): ∆C J (G 0 , G(s)) = ∆C J (G 0 , G 1 ) ⊙ X(s) , where ⊙ is element-wise product and the matrix element X(s) p,j is a function mapping s ∈ [0, 1] → [0, 1]; • linear in the rows of ∆C J (G 0 , G 1 ): let ⇀ x = [x 1 (s), . . . , x m (s)] ⊤ and x p (s) ∈ [0, 1] weight the p-th path as a whole, and ∆C J (G 0 , G(s)) = ∆C J (G 0 , G 1 ) ⊙ [1 1×c ⊗ ⇀ x(s)], (10) where ⊗ is the Kronecker product creating the path weighting matrix 1 1×c ⊗ ⇀ x(s) ∈ [0, 1] mc . According to the derivation in Appendix A.1, we can rewrite D KL (Pr(Y |G 1 )||Pr(Y |G 0 )) as E j∼Pr(Y |G1) [1 ⊤ (C :j (G 1 ) -C :j (G 0 ))] -log Z(G 1 ) + log c j=1 exp{z j (G * ) + 1 ⊤ C :j (G 0 )} (11) where the expectation has class j sampled from Pr(Y |G 1 ) and log Z(G 1 ) is the cumulant function of Pr(Y |G 1 ). In Eq. ( 11), by letting G 0 vary along any γ(s) as parameterized above and replacing  C :j (G 0 ) with C :j (G(s)) = C :j (G 0 ) + ∆C :j (G 0 , G(s)), (Y |G 1 ). Since γ(s) ∈ Γ(G 0 , G 1 ) ⊂ M(G, J) , selecting a curve γ(s) is different from selecting some edges from G 1 to approximate the distribution Pr(Y |G 1 ) as in Ying et al. (2019) . Rather, the curves should move according to the geometry of the manifold M(G, J). We can use Eq. ( 11) to explain how the computation of Pr(Y |G 0 ) evolves to that of Pr(Y |G 1 ) following γ(s). There are mc coordinates in C J (G(s)), and can be large when J is a highdegree node, while an explanation should be concise. We will identify a curve γ(s) using a small number of coordinates for conciseness. The parameterization Eq. ( 10) assigns a weight x p (s) to each path p at time s, allowing thresholding the elements in ⇀ x(s) to select a subset E n of n ≪ m paths. The contributions from these few selected path is now ∆C J (G 0 , G(s)), which should well-approximate ∆C J (G 0 , G(1)) as s → 1. For example, in Figure 1 , we can take E 2 = {(K, J, J), (L, K, J)} ⊂ ∆W J (G 0 , G 1 ) with n = 2. The selected paths span a lowdimensional space to embed the neighborhood of Pr(Y |G 1 ) on the manifold M(J, G). Adding paths in E n to the computation graph of G 0 leads to a new computation graph on the manifold. We optimize E n to minimize the KL-divergence in Eq. ( 11) with Eq. (10). Let x(s; p) ∈ [0, 1], p = 1, . . . , m be the weight of selecting path p into E n . We solve the following problem: min ⇀ x(s)∈[0,1] m ∥ ⇀ x(s)∥1=n E j∼Pr(Y =j|G1) [1 ⊤ (C :j (G 1 )-C :j ( ⇀ x(s))]+log c j=1 exp{z j (G * )+1 ⊤ C :j ( ⇀ x(s))} (12) where C :j ( ⇀ x(s)) = C :j (G 0 ) + ∆C :j (G 0 , G(s) ) is a vector of path contributions to the logit of class j. ∆C :j (G 0 , G(s)) is parameterized by Eq. ( 10) and is a function of ⇀ x(s). The constants log Z(G 1 ) is ignored from Eq. ( 11) as G 1 is fixed. The linear constraint ensures the total probabilities of the selected edges is n. The optimization problem is convex and has a unique optimal solution. We select the paths with the highest 

4. EXPERIMENTS

Datasets and tasks. We study node classification task on evolving graphs on the YelpChi, Yelp-NYC, YelpZip Rayana & Akoglu (2015) , Pheme Zubiaga et al. (2017) and Weibo Ma et al. (2018) datasets, and study the link prediction task on the BC-OTC, BC-Alpha, and UCI datasets. These datasets have time stamps and the graph evolutions can be identified. The molecular data (MUTAG Debnath et al. (1991) is used for the graph classification. In searching molecules, slight perturbations are applied to molecule graphs You et al. (2018) . We simulate the perturbations by randomly add or remove edges to create evolving graphs. Appendix A.5.1 gives more details. Experimental setup. For each dataset, we optimize a GNN parameter θ on the training set of static graphs, using labeled nodes, edges, or graphs, depending on the tasks. For each graph snapshot except the first one, target nodes/edges/graphs with a significantly large D KL (Pr(Y |G 0 )||Pr(Y |G 1 )) are collected and the change in Pr(Y |G) is explained. We run Algorithm 1 to calculate the contribution matrix C J (G) for each node J ∈ V * . We use the cvxpy library Diamond & Boyd to solve the constrained convex optimization problem in Eq. ( 12), Eq.( 14) and Eq.( 15). This method is called "AxiomPath-Convex". We also adopt the following baselines. • Gradient. Grad computes the gradients of the logit of the predicted class j with the maximal Pr(Y = j|G) on G 0 and G 1 , respectively. Each computation path is assigned the sum of gradients of the edges on the paths as its importance. The contribution of a path to the change in Pr(Y |G) is the difference between the two path importance scores computed on G 0 and G 1 . If a path only exists on one graph, the importance of the path is taken as the contribution. All paths with top importance are selected into E n . • GNNExplainer (GNNExp)  (G 0 , G 1 ) or in ∆W I (G 0 , G 1 ) ∪ ∆W J (G 0 , G 1 ) or in ∪ J∈V ∆W J (G 0 , G 1 ) are ranked and selected. • AxiomPath-Topk is a variant of AxiomPath-Convex. It selects the top paths p from ∆W J (G 0 , G 1 ) or ∆W I (G 0 , G 1 ) ∪ ∆W J (G 0 , G 1 ) or ∪ J∈V ∆W J (G 0 , G 1 ) with the highest con- tributions ∆C J (G 0 , G 1 )1 , where 1 is an all-1 c × 1 vector. This baseline works in the Euclidean space spanned by the paths as coordinates and rely on linear differences in C(G) rather than the nonlinear movement from Pr(Y |G 0 ) to Pr(Y |G 1 ). • AxiomPath-Linear optimizes the AxiomPath-Convex objectives without the last log terms, leading to a linear programming. Quantitative evaluation metrics. Let Pr(Y |¬G(s)) be computed on the computation graph of G 1 with those from E n disabled. That should bring G 1 close to G 0 along γ(s) so that KL + = D KL Pr J (G 0 )Pr J (¬G(s)) should be small if E n does contain the paths vital to the evolution. Similarly, we expect Pr J (G(s)) to move close to Pr J (G 1 ) after the paths E n are enabled on the computation graph of G 0 , and KL -= KL(Pr J (G 1 )∥Pr J (G(s))) should be smaller. Intuitively, if E n indeed contains the more salient altered paths that turn G 0 into G 1 , the less information the remaining paths can propagate, the more similar should G n be to G 1 and ¬G n be to G 0 , and thus the smaller the KL-divergence. Prior work Suermondt (1992); Yuan et al. (2020a); Ying et al. (2019) use KL-divergence to measure the approximation quality of a static predicted distribution Pr(Y |G), while the above metrics evaluate how distribution on the curve γ(s) approach the target Pr(Y |G 1 ). A similar metric can be defined for the link prediction task and the graph classification task, where the KL-divergence is calculated using predicted distributions over the target edge or graph. The target nodes (links or graphs ) are grouped based on the number altered paths in ∆W J (G 0 , G 1 ) for the results to be comparable, since alternating different number of paths can lead to significantly different performance. For each group, we let n = |E n | range in a pre-defined 5-level of explanation simplicity and all methods are compared under the same level of simplicity. Appendix A.5.2 and Appendix A.5.3 gives more details of the experimental setup.

4.1. PERFORMANCE EVALUATION AND COMPARISON

We compare the performance of the methods on three tasks (node classification, link prediction and graph classification) under different graph evolutions (adding and/or deleting edges). For the node classification, in Figure 2 , we demonstrate the effectiveness of the salient path selection of AxiomPath-Convex. For each dataset, we report the average KL + over target nodes/edges on three datasets (results with the KL -metric, and results on the remaining datasets are given and 8 in the Appendix). From the figures, we can see that AxiomPath-Convex has the smallest KL + over all levels of explanation complexities and over all datasets. On six settings (Weibo-adding edges only and mixture of adding and removing edges, and all cases on YelpChi), the gap between AxiomPath-Convex and the runner-up is significant. On the remaining settings, AxiomPath-Convex slightly outperforms or is comparable to the runner-ups. AxiomPath-Topk and AxiomPath-Linear underperform AxiomPath-Convex, indicating that modeling the geometry of the manifold of probability distributions has obvious advantage over working in the linear parameters of the distributions. On two link prediction tasks and one graph classification task, in Figure 3 , we show that AxiomPath-Convex significantly uniformly outperform the runner-ups (results on the remaining link prediction task and regarding the KL -metrics are give in the Figure 6 , 8 and 9 in the Appendix). DeepLIFT and GNNExplainer always, and Grad sometimes, fails to find the salient paths to explain the change, as they are designed for static graphs. In Appendix A.7, we provide cases where AxiomPath-Convex identifies edges and subgraphs that help make sense of the evolving predictions. In Appendix A.6, we analyze how long each component of the AxiomPath-Convex algorithms take on several datasets. In Appendix A.8, we analysis the limit of our method. 2018). However, taking the geometric viewpoint of GNN evolution and its explanation is novel and has not been observed within the information geometry literature and explainable/interpretable machine learning.

5. RELATED WORK

Prior work explain GNN predictions on static graphs. There are methods explaining the predicted class distribution of graphs or nodes using mutual information Ying et al. (2019) . Other works 2019) have been used to construct explanations. These works cannot axiomatically isolate contributions of paths that causally lead to the prediction changes on the computation graphs. Most of the prior work evaluates the faithfulness of the explanations of a static prediction. To explain distributional evolution, faithfulness should be evaluated based on approximation of the curve of evolution on the manifold so that the geometry will be respected. None prior work has taken a differential geometric viewpoint of distributional evolution of GNN. Optimally selecting salient elements to compose a simple and faithful explanation is less focused. With the novel reparameterization of curves on the manifold, we formulate a convex programming to select a curve that can concisely explain the distributional evolution while respecting the manifold geometry.

6. CONCLUSIONS

We studied the problem of explaining change in GNN predictions over evolving graphs. We addressed the issues of prior works that treat the evolution linearly. The proposed model view evolution of GNN output with respect to graph evolution as a smooth curve on a manifold of all class distributions. This viewpoint help formulate a convex optimization problem to select a small subset of paths to explain the distributional evolution on the manifold. Experiments showed the superiority of the proposed method over the state-of-the-art. In the future, we will explore more geometric properties of the construct manifold to enable a deeper understanding of GNN on evolving graphs.

A APPENDIX

You may include other additional sections here. A.1 MISC. PROOFS Here we give the detailed derivations of Eq. ( 11). D KL (Pr(Y |G 1 )||Pr(Y |G 0 )) = c j=1 Pr(Y = j|G 1 ) log[Pr(Y = j|G 1 )/Pr(Y = j|G 0 )] (13) = c j=1 Pr(Y = j|G 1 )[z j (G 1 ) -z j (G 0 )] -log[Z(G 1 )/Z(G 0 )] = c j=1 Pr(Y = j|G 1 )1 ⊤ ∆C :j (G 0 , G 1 ) + log Z(G 0 ) -log c j=1 exp      z j (G * ) + 1 ⊤ [C :j (G 0 ) + ∆C :j (G 0 , G 1 )] =C:j (G1)      = c j=1 Pr(Y = j|G 1 )1 ⊤ ∆C :j (G 0 , G 1 ) -log Z(G 1 ) + log c j=1 exp z j (G * ) + 1 ⊤ C :j (G 0 ) A.2

SECOND-ORDERED APPROXIMATION OF THE KL DIVERGENCE WITH THE FISHER INFORMATION MATRIX

To help understand Eq. ( 9) that defines the Riemannian metric, we need to second-order approximation of the KL-divergence. We reproduce the derivations from the note "Information Geometry and Natural Gradients" posted on https://www.nathanratliff.com/pedagogy/ mathematics-for-intelligent-systems by Nathan Ratliff (Disclaimer: we make no contribution to these derivations and the author of the note owns all credits). In the following, the term θ should be understood as the vector vec(C J (G 1 )) and δ should be understood as the difference vector vec(∆C J (G 1 , G 0 )) in Eq. ( 9). x is understood as the random variable Y , the class variable, in our case. KL (p (x; θ) ∥p (x; θ + δ)) ≈ p (x; θ) log p (x; θ) dx -p (x; θ) log p (x; θ) + ∇ θ p (x; θ) p (x; θ) ⊤ δ + 1 2 δ ⊤ ∇ 2 θ log p (x; θ) δ dx = p (x; θ) log p (x; θ) p (x; θ) dx =0 - ∇ θ p (x; θ) dx ⊤ δ =0 - 1 2 δ ⊤ p (x; θ) ∇ 2 θ log p (x; θ t ) δ By assuming that the differentiation and integration in the second term can be exchanged, we have ∇p (x; θ) dx = ∇ p (x; θ) dx = ∇1 = 0 ∇ 2 log p (x; θ) = 1 p (x; θ) ∇ 2 p (x; θ) -∇ log p (x; θ) ∇ log p (x; θ) ⊤ KL (p (x; θ) ∥p (x; θ + δ)) ≈ - 1 2 δ ⊤ p (x; θ) ∇ 2 log p (x; θ) dx δ = - 1 2 δ ⊤ ∇ 2 p (x; θ) dx =0 δ + 1 2 δ ⊤ p (x; θ) ∇ log p (x; θ) ∇ log p (x; θ) ⊤ dx G(θt) δ The matrix G (θ) is known as the Fisher Information matrix.

A.3 OPTIMIZE A CURVE ON THE LINK PREDICTION TASK AND GRAPH CLASSIFICATION TASK

Similar to node classification, according to the Eq.( 12), for the link prediction, we solve the following problem: min ⇀ x(s)∈[0,1] m ⇀ x ′ (s ′ )∈[0,1] m ′ ∥ ⇀ x(s)+ ⇀ x ′ (s ′ )∥ 1 =n E Pr(Y |G1) [C :l (G 1 )-C :l ( ⇀ x(s), ⇀ x ′ (s ′ ))]θ LP +log 1 l=0 exp{z j (G * )+C :l ( ⇀ x(s), ⇀ x ′ (s ′ ))}θ LP where C :l (G 1 ) = [1 ⊤ C :i (G 1 ); 1 ⊤ C :j (G 1 )], C :l ( ⇀ x(s), ⇀ x ′ (s ′ )) = [ ⇀ x(s) ⊤ (C :i (G 0 ) + ∆C :i (G 0 , G(s))); ⇀ x ′ (s ′ ) ⊤ (C :j (G 0 ) + ∆C :j (G 0 , G(s ′ )))]. For the graph classification, we solve the following problem: min ⇀ x(s)∈[0,1] m || ⇀ x(s)|| 1 =n E Pr(Y |G1) [C :g (G 1 ) -C :g ( ⇀ x(s)]θ GC + log c j=1 { exp z j (G * ) |V| + C :g ( ⇀ x(s))}θ GC (15) where |V| denotes the number of nodes in the graph, C :g (G 1 ) = J∈V 1 ⊤ C:j (G1) |V| , C :g ( ⇀ x J (s)) = J∈V ⇀ x(s) ⊤ (C:j (G0)+∆C:j (G0,G(s)))) |V| . A.4 ATTRIBUTING THE CHANGE TO PATHS We describe the computation of C p,j in the previous section.

A.4.1 DEEPLIFT FOR MLP

DeepLIFT Shrikumar et al. (2017) serves as a foundation. Let the activation of a neuron at layer t + 1 be h (t+1) ∈ R, which is computed by h (t+1) = f ([h (t) 1 , . . . , h n ]), Given the reference activation vector h (t) (0) = [h (t) 1 (0), . . . , h n (0)] at layer t at time 0, we can calculate the scalar reference activation h (t+1) (0) = f (h (t) (0)) at layer t + 1. The difference-from-reference is ∆h (t+1) = h (t+1) -h (t+1) (0) and ∆h (t) i = h (t) i -h (t) i (0), i = 1, . . . , n. With (or without) the 0 in parentheses indicate the reference (or the current) activations. The contribution of ∆h t+1) ). The DeepLIFT method defines multiplier and the chain rule so that given the multipliers for each neuron to each immediate successor neuron, DeepLIFT can compute the multipliers for any neuron to a given target neuron efficiently via backpropagation. DeepLIFT defines the multiplier as: (t) i to ∆h (t+1) is C ∆h (t) i ∆h (t+1) such that n i=1 C ∆h (t) i ∆h (t+1) = ∆h (t+1) (preservation of ∆h ( m ∆h (t) i ∆h (t+1) = C ∆h (t) i ∆h (t+1) ∆h (t) i (16) = θ (t) i linear layer ∆h (t+1) ∆h (t) i nonlinear activation If the neurons are connected by a linear layer, C ∆h (t) i ∆h (t+1) = ∆h (t) i × θ (t) i where θ (t) i is the element of the parameter matrix θ (t) that multiplies the activation h i to contribute to h (t+1) . For element-wise nonlinear activation functions, we adopt the Rescale rule to obtain the multiplier such that C ∆h (t) i ∆h (t+1) = ∆h (t+1) . DeepLIFT defines the chain rule for the multipliers as: m ∆h (0) i ∆h (T ) = l • • • j m ∆h (0) i ∆h (1) l . . . m ∆h (T -1) j ∆h (T ) A.4.2 DEEPLIFT FOR GNN We linearly attribute the change to paths by the linear rule and Rescale rule, even with nonlinear activation functions. When G 0 → G 1 , there may be multiple added or removed edges, or both. These seemingly complicated and different situations can be reduced to the case with a single added edge. First, any altered path can only have multiple added edges or removed edges but not both. If there were a removed edge that is closer to the root than an added edge, the added edge would have appeared in a different path and the removed edge must be from an existing path leading to the root. If there were an added edge closer to the root than a removed edge, the nodes after the removed edge have no contribution in G 0 and the situation is the same as with added edges. Second, a path with removed edges only when G 0 → G 1 can be treated as a path with added edges only when G 1 → G 0 . Lastly, as shown below, only the altered edge closest to the root is relevant even with multiple added edges. Let U and V be any adjacent nodes in a path, where V is closer to the root J. Difference-from-reference of neuron activation and logits. When handling the path with multiple added edges, we let the reference activations be computed by the GNN on the graph G 0 , G ref = G 0 , the graph at the current moment is G 1 , G cur = G 1 . For a path p in ∆W J (G 0 , G 1 ), let p[t] denote the node or the neurons of the node at layer t. For example, if p = (I, . . . , U, V, . . . , J), p[T ] represents J or the neurons of J, and p[0] represents I or the neurons of I. Given a path p, let t = max{τ |τ = 1, . . . , T, p[τ ] = V and p[τ -1] = U and if (U, V ) is newly added}. When t ≥ t, the reference activation of p[t] is h (t) p[t] (G 0 ). While when t < t, the reference activation of p[t] is zero, because the message of p[t] cannot be passed along the path (p[t], . . . , J) to J in G 0 , and the edge (U, V ) must be added to G 0 to connect p[t] to J in the path. We thus calculate the differencefrom-reference of neurons at each layer for the specific path as follows: ∆h (t) p[t] = h (t) p[t] (G 1 ) -h (t) p[t] (G 0 ) t ≥ t, h (t) p[t] (G 1 ) otherwise. ( ) I I J K L L K J J zj h 1 j h 1 k x z h (0) j h (0) k h (0) l Figure 4 : Circles in rectangles are neurons, and a neuron has a specific color if it contributes to the prediction change in a class. Left: DeepLIFT finds the contribution of an input neuron to the change in an output neuron of an MLP for link prediction, where the input layer is the output of a GNN. Right: A two-layer GNN. The four colored quadrants in ∆zj at the top layer, which can be the input layer to the MLP, can be attributed to the changes in the input neurons at the input layer (e.g., the two blue quadrants at J at the top is attributed to the blue neurons in node K at the input layer through paths (K, K, J) and (K, J, J). For example, in Figure 4 , the added edge is (J, K). For the path p = (J, K, J) in G 1 , t = 2 and ∆h (0) j = h (0) j (G 1 ), because in G 0 , the neuron j at layer 0 cannot pass message to the neuron j at the output layer along the path (J, K, J) in G 0 . The change in the logits ∆z (t) p[t] can be handled similarly. Multiplier of a neuron to its immediate successor. We choose element-wise sum as the f AGG function to ensure that the attribution by DeepLIFT can preserve the total change in the logits such that ∆z j = m p=1 C p,j . Then, z (t) v = U ∈N (V ) u∈U h (t-1) u θ (t) u,v , where θ (t) u,v denotes the element of the parameter matrix θ (t) that links neuron u to neuron v. According to the Eq. ( 16), m ∆h (t-1) u ∆z (t) v = θ (t) u,v , m ∆z (t) v ∆h (t) v = ∆h (t) v ∆z (t) v . ( ) Then we can obtain the multiplier of the neuron u to its immediate successor neuron v according to Eq. ( 17): m ∆h (t-1) u ∆h (t) v = ∆h (t) v ∆z (t) v × θ (t) u,v . Note that the output of GNN model is z J , thus m ∆h (T -1) p[T -1] ∆zj = θ (T ) p[T -1],j . We can obtain the multiplier of each neuron to its immediate successor in the path according to Eq. ( 20) by letting t = T → 1. After obtaining the m ∆h (0) p[0] ∆h (1) p[1] , . . . , m ∆h (T -1) p[T -1] ,∆zj , according to Eq. ( 17), we can obtain m ∆h (0) p[0] ∆zj as m ∆h (0) p[0] ∆zj = p[0] • • • p[T -1] m ∆h (0) p[0] ∆h (1) p[1] . . . m ∆h (T -1) p[T -1] ∆zj . Calculate the contribution of each path. For the path p in ∆W J (G 0 , G 1 ), we obtain the contribution of the path by summing up the input neurons' contributions: C p,j = p[0] m ∆h (0) p[0] zj × ∆h (0) p[0] , where p[0] indexes the neurons of the input (a leaf node in the computation graph of the GNN) and ∆h (0) p[0] = h (0) p[0] . Algorithm 1 Compute C p,j for a target node J. 1: Input: two graph snapshots G 0 and G 1 . Pre-trained GNN parameters θ GN N for node classification, 2: Obtain the altered path set ∆W J (G 0 , G 1 ). 3: Initialize C ∈ R |W J (G0,G1)|×c as an all-zero matrix 4: for p ∈ ∆W J (G 0 , G 1 ) do 5: if p contains removed edges then 6: Reverse G 0 → G 1 to G 1 → G 0 . 7: Compute C p,j according to Eq. ( 22) and let -C p,j be the contribution of p as G 0 → G 1 . Algorithm 1 describes how to attribute the change in a root node J's logit ∆z j to each path p ∈ ∆W J (G 0 , G 1 ). The computational complexity of C p,j . Supposing that we have the T layer GNN, the dimension of hidden vector h . Then according to the Eq. 21 and the Eq. 22, we use the chain rule to obtain the final multiplier and then obtain the contribution. Because the multiplier m ∆h (t-1) u ∆h (t) v is sparse, obtaining the final multiplier matrix is also relatively fast. For the dense edge structure, the number of paths is large. But for these paths with the same nodes after layer t (for example, the path (K, K, J) and the path (J, K, J)), the multipliers after the layer t are(for example m ∆h (1) k ∆h (2) j ) the same. The proportion of paths that can share multiplier matrices is large. Because of this, the calculation will speed up. Theorem 1. The GNN-LRP is a special case if the reference activation is set to the empty graph. Proof. Considering the path p = (I, . . . , U, V, . . . J) on the graph G 1 , we let G 0 is the empty graph. Then, ∆h (t) p[t] = h (t) p[t] (G 1 ), ∆z (t) p[t] = z (t) p[t] (G 1 ), m ∆h (t-1) u ∆h (t) v = ∆h (t) v ∆z (t) v × θ (t) u,v . While, for the GNN- LRP, when γ = 0, R j = z j , we note LRP (t) u,v = h (t-1) u θ (t) u,v U ∈N (V ) u h (t-1) u θ (t) u,v = h (t-1) u θ (t) u,v z (t) v that represents the allocation rule of neuron v to its predecessor neuron u. The contribution of this path is  R p = i • • • p[T -1] LRP (1) i,p[1] . . . LRP (T ) p[T -1],j R j = i • • • p[T -1] h (0) i θ (1) i,p[1] z (1) p[1] . . . h (T -1) p[T -1] θ (T ) p[T -1],j z j z j = i h (0) i p[1] • • • p[T -1] h (1) p[1] θ (1) i,p[1] z (1) p[1] . . . θ (T ) p[T -1],j = i m ∆h (0) i ∆zj h J L K J J L K J L K z J (G 0 ) = [3,3,4] z J (G 0 ) = [3,3,4] J (G 0 ) = [0.21,0.21,0.58] J (G 0 ) = [0.21,0.21,0.58] K J J K K J J K J L K J z J (G 1 ) = [3,2,5] z J (G 1 ) = [3 J ( ¬G s ) J ( ¬G s ) J (G s ) J (G s ) Remove the removed path J->I->J Add the added path K->J->J K->K->J KL + = KL( J (G 0 ) | | J ( ¬G(s))) KL + = KL( J (G 0 ) | | J ( ¬G(s))) KL -= KL( J (G 1 ) | | J (G(s))) KL -= KL( J (G 1 ) | | J (G(s))) E n E n The important paths g., a citation network) at time t = 0 is updated to G1 at time t 1 after the edge (J, K) added and the edge (I, J) removed, and the logits zJ (G0) and predicted class distribution PrJ (G0) of node J changes accordingly. Prior counterfactual methods attribute the change to edges (J, K) and (I, J). Center left: the GNN computational graph that propagates information from leaves to the root J. Top right: Any paths from the computational graph containing a dashed edge contribute to the prediction change, and we axiomatically attribute the logits changes to these paths with contribution Cp,j (for the pth path to the component ∆zj). Center right: Not all paths are significant contributors and we formulate a convex program to uniquely identify a few paths to maximally approximate the changes. Bottom: We show the calculation process of KL + and KL -after obtaining En. Other situations, including edge deletion, mixture of addition and deletion, and link prediction can be reduced to this simple case.

A.8 FURTHER EXPERIMENTAL RESULTS

We analyzed how the method performs on the spectrum of varying KL(Pr J (G 1 )||Pr J G(0)) for the YelpChi, YelpZip, UCI, BC-OTC and MUTAG datasets when edges are added and removed (See Figure 16 ). For some nodes with the lower KL(Pr J (G 1 )||Pr J G(0)), the KL + or KL -is higher. Through the further analysis, we find that it may because of the Pr J (G 1 ) or Pr J (G 0 ) has all probability mass concentrated at one class. (See Figure 17 ). For the some target nodes/edges/graphs with the classification probability in G 1 or G 0 close to 1, the KL + or KL -is high. That means the selected paths may not explain the change of probability distribution well. When the classification probability is close to 1 in the G 0 (G 1 ), it is more difficult to select a few paths to make the probability distribution close G 1 (G 0 ), so the KL + or KL -is high. 



A computational model that outputs Pr(Y |G(s)) does not necessarily correspond to a concrete input graph G(s). We use the notation Pr(Y |G(s)) and G(s) for notation convenience only. In the Appendix section A.3, we discuss the cases of link prediction and graph classification. http://snap.stanford.edu/data/soc-sign-bitcoin-otc.html http://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html http://konect.cc/networks/opsahl-ucsocial AxiomPath-Convex has performance on Cora similar to those in Figure2.



Figure 1: G0 at time time s = 0 is updated to G1 at time s = 1 after the edge (J, K) is added, and the predicted class distribution (Pr(Y |G0)) of node J changes accordingly. The contributions of each path p on a computation graph to Pr(Y = j|G) for class j give the coordinates of Pr(Y |G) in a high-dimensional Euclidean space, with axes indexed by (p, j). Pr(Y |G) varies smoothly on a low dimensional manifold, where multiple curves γ(s) can explain the evolution from Pr(Y |G0) to Pr(Y |G1) at very fine-grained. We select a γ(s) that use a sparse set of axes for explaining the prediction evolution. Edge deletion, mixture of addition and deletion, link prediction, and graph classification are handled similarly.

we obtain D KL (Pr(Y |G 1 )||Pr(Y |G(s))). Taking s → 1, the curve Pr(Y |G(s)) enters a neighborhood of Pr(Y |G 1 ) on the manifold M to approximate Pr(Y |G 1 ) and D KL (Pr(Y |G 1 )||Pr(Y |G(s))) → 0 so that the curve γ(s) parameterized by ⇀ x(s) smoothly mimics the movement from Pr(Y |G 0 ) to Pr(Y |G 1 ), at least locally in the neighborhood of Pr

x(s) values to constitute a curve γ(s) that explains the change from Pr(Y |G 0 ) to Pr(Y |G 1 ) as γ(s) approaches Pr(Y |G 1 ). Concerning the Riemannian metric in Eq. (9), the above optimization does not change the Riemannian metric I(vec(C J (G 1 ))) at Pr(Y |G 1 ) since the objective function is based on the KL-divergence of distributions generated by the nonlinear softmax mapping, while C :j ( ⇀ x(s)) vary in the extrinsic coordinate system with ⇀ x(s).

Figure 2: Performance in KL + as G0 → G1 on the node classification tasks. Each column is a dataset and each row is one type of evolution.

Differential geometry of probability distributions are explored in the field called "information geometry" Amari (2016), which has been applied to optimization Chen et al. (2020); Osawa et al. (2019); Kunstner et al. (2019); Seroussi & Zeitouni (2022); Soen & Sun (2021), machine learning Lebanon (2002); Karakida et al. (2020); Nock et al. (2017); Bernstein et al. (2020), and computer vision Shao et al. (

Figure 3: Average KL + on the link prediction and graph classification tasks. Each row is a dataset and each column is one evolution setting.

C p,j according to Eq. (22) as the contribution of p as G 0 → G 1 . 10: end if 11: end for 12: Output: the contribution matrix C.

of layer t is d t (t = 1, 2, . . . , T ) and the dimension of the input feature vector is d. The time complexity of determining the contribution of each path is O( T t=1 d t × d). In the calculation, according to the Eq.20, we can obtain the multiplier m ∆h (t-1)

YelpNYC, and YelpZip Rayana & Akoglu (2015): each node represents a review, product, or user. If a user posts a review to a product, there are edges between the user and the review, and between the review and the product. The data sets are used for node classification.• PhemeZubiaga et al. (2017) and WeiboMa et al. (2018): they are collected from Twitter and Weibo. A social event is represented as a trace of information propagation. Each event has a label, rumor or non-rumor. Consider the propagation tree of each event as a graph. The data sets are used for node classification. • BC-OTC 3 and BC-Alpha 4 : is a who trusts-whom network of bitcoin users trading on the platform.The data sets are used for link prediction. • UCI 5 : is an online community of students from the University of California, Irvine, where in the links of this social network indicate sent messages between users. The data sets are used for link prediction. • MUTAGMorris et al. (2020): A molecule is represented as a graph of atoms where an edge represents two bounding atoms.

Figure 5: Top left: G0 (e.g., a citation network) at time t = 0 is updated to G1 at time t 1 after the

Figure 10: Decomposition of running time of AxiomPath-Convex.

We derive the Fisher information matrix I of Pr(Y |G) with respect to the path coordinates. As the KL-divergence between Pr(Y |G 0 ) and Pr(Y |G 1 ) that are sufficiently close can be approximated by a quadratic function with the Fisher information matrix, Pr(Y |G) does not necessarily evolve linearly in the extrinsic coordinates but adapts to the curved intrinsic geometry of the manifold around Pr(Y |G). explain the GNN responses to evolving graphs on 8 graph datasets with node classification, link prediction, and graph classification, with edge additions and/or deletions.

Symbols and their meanings

there is an evolution from Pr(Y |G 0 ) to Pr(Y |G 1 ). Differential geometry. An n dimensional manifold M is a set of points, each of which can be associated with a local n dimensional Euclidean tangent space. The manifold can be embedded in a global Euclidean space R m , so that each point can be assigned with m global coordinates. A smooth curve on M is a smooth function γ : [0, 1] → M. A two dimensional manifold embedded in R 3 with two curves is shown in Figure1.

Ying et al. (2019) is designed to explain GNN predictions for node and graph classification on static graphs. It weight edges on G 1 to maximally preserve Pr(Y |G 1 ) regardless of Pr(Y |G 0 ). Paths are weighted and selected as for Grad, with edge weights calculated using GNNExplainer.• GNN-LRP adopts the back-propagation attribution method LRP toGNN Schnake et al. (2020). It attributes the class probability Pr(Y = j|G 1 ) to input neurons regardless of Pr(Y |G 0 ). It assigns an importance score to paths and top paths are put in E n .• DeepLIFT Shrikumar et al. (2017) can attribute the log-odd between two probabilities Pr(Y = j|G 0 ) and Pr(Y = j ′ |G 1 ), where j ̸ = j ′ . For a target node or edge or graph, if the predicted class changes, the difference between a path's contributions to the new and original predicted classes is used to rank and select paths. If the predicted class remains the same but the distribution changes, a path's contributions to the same predicted class is used. Only paths from ∆W J

ACKNOWLEDGEMENTS

Sihong was supported in part by the National Science Foundation under NSF Grants IIS-1909879, CNS-1931042, IIS-2008155, and IIS-2145922. Yazheng Liu and Xi Zhang are supported by the Natural Science Foundation of China (No.61976026) and the 111 Project (B18008). Any opinions, findings, conclusions, or recommendations expressed in this document are those of the author(s) and should not be interpreted as the views of any U.S. Government.

funding

$[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *HLERUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + 3KHPHUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS&KLUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + :HLERDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + 3KHPHDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS&KLDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + %&$OSKDDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + DGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + 8&,UHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + %&$OSKDUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + UHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + 8&,DGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + %&$OSKDDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + DGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS=LSDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + %&27&DGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS1<&UHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS=LSUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + %&27&UHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS1<&DGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + <HOS=LSDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ + %&27&DGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 for a pair of snapshots. ([SODQDWLRQFRPSOH[LW\OHYHOV ./ HLERDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ 3KHPHDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ <HOS&KLDGGHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ :HLERUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ 3KHPHUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ <HOS&KLUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ :HLERDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ 3KHPHDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S ([SODQDWLRQFRPSOH[LW\OHYHOV ./ <HOS&KLDGGDQGUHPRYHHGJHV 'HHS/,)7 $[LRP3DWK&RQYH[ $[LRP3DWK7RSN $[LRP3DWK/LQHDU *UDG *11/53 *11([S for a pair of snapshots.

A.5.2 EXPERIMENTAL SETUP

We trained the two layers GNN. We choose element-wise sum as the f AGG function. The logit for node J is denoted by z J (G). For node classification, z J (G) is mapped to the class distribution through the softmax (number of classes c > 2) or sigmoid (number of classes c = 2) function. For the link prediction, we concatenate z I (G) and z J (G) as the input to a linear layer to obtain the logits. Then it be mapped to the probability that the edge (I, J) exists using the sigmoid function. For the graph classification task, the average pooling of z J (G) of all nodes from G can be used to obtain a single vector representation z(G) of G for classification. It can be mapped to the class probability distribution through the sigmoid or softmax function. We set the learning rate to 0.01, the dropout to 0.2 and the hidden size to 16 when we train the GNN model. The model is trained and then fixed during the prediction and explanation stages.The node or edge or graph is selected as the target node or edge or graph, if KL(Pr J (G 1 )∥Pr J (G 0 )) > threshold, where threshold=0.001.For the MUTAG dataset, we randomly add or delete five edges to obtain the G 1 . For other datasets, we use the t initial and the t end to obtain a pair of graph snapshots. We get the graph containing all edges from t initial to t end . Then two consecutive graph snapshots can be considered as G 0 and G 1 . For Weibo and Pheme datasets, according to the time-stamps of the edges, for each event, we can divide the edges into three equal parts. On the YelpZip(both) and UCI, we convert time to weeks since 2004. On the BC-OTC and BC-Alpha datasets, we convert time to months since 2010. On other Yelp datasets, we convert time to months since 2004. See the table 2 for details. 16,17,18,19] [19,20,21,22] Weibo 4,657 To show that as n increases, Pr J (G n ) is gradually approaching Pr J (G 1 ), we let n gradually increase. We choose n according to the number of the altered paths . See the table 3 for details.

A.5.3 QUANTITATIVE EVALUATION METRICS

We illustrate the calculation process of our method in Figure 5 .

A.5.4 EXPERIMENTAL RESULT

See the Figure 6 for the result on the KL + on the YelpNYC, YelpZip and BC-OTC datasets. See the Figure 7 , Figure 8 and Figure 9 for the result on the KL -on the all datasets. The method AxiomPath-Convex is significantly better than the runner-up method.Published as a conference paper at ICLR 2023 A.6 SCALABILITY Running time overhead optimization. We plot the base running time for searching paths in ∆W (G 0 , G 1 ) (or ∆W I (G 0 , G 1 ) ∪ ∆W J (G 0 , G 1 )) and attribution vs. the running time of the convex optimization. In Figure 10 , we see that in the two top cases, the larger ∆W J (G 0 , G 1 ) (or ∆W I (G 0 , G 1 )∪∆W J (G 0 , G 1 )) lead to higher cost in the optimization step compared to path search and attribution. In the lower two cases, the graphs are less regular and the search and attribution can spend the majority computation time. The overall absolution running time is acceptable. In practice, one can design incremental path search for different graph topology, and more specific convex optimization algorithm to speed up the algorithm.We plot the running of the baseline methods.(See the Figure 11 ,Figure 12 and Figure 13 ). The order of running time by the baseline methods is: AxiomPath-Convex, AxiomPath-Linear > DeepLIFT, AxiomPath-Topk > Gradient, GNNLRP. About for the GNNExplainer methods, they cost more time than AxiomPath-Convex when the the graph is small and they cost less time than DeepLIFT when the graph is large. Although the running time of Gradient and GNNLRP is less, the Gradient method cannot obtain the contribution value of the path, it only obtain the contribution value of the edge in the input layer. GNNLRP, like DeepLIFT, was originally designed to find the path contributions to the probability distribution in the static graph, and cannot handle changing graphs. If considering the running time of calculating path contribution values, we can use GNNLPR to obtain the paths contribution value in the G 0 and G 1 and subtracted them as the of the final contribution value. After obtaining C p,j , we can still use our theory to choose the critical path to explain the change of probability distribution. GNNLPR can be a faster replacement for DeepLIFT.

A.7 CASE STUDY

It is necessary to show that AxiomPath-Convex selects salient paths to provide insight regarding the relationship between the altered paths and the changed predictions.On Cora, we add and/or remove edges randomly, and for the target nodes that the predicted class changed, we calculate the percentages of nodes on the paths selected by AxiomPath-Convex that have the same ground truth labels as the predicted classes on G 0 (class 0) and G 1 (class 1), respectively. We expect that there are more nodes of class 1 on the added paths, and more nodes of class 0 on the removed paths. We conducted 10 random experiments and calculate the means and standard deviations of the ratios. Figure 14 shows that the percentages behave as we expected. It further confirms that the fidelity metric aligns well with the dynamics of the class distribution that contributed to the prediction changes 6 .In Figure 15 , on the MUTAG dataset, we demonstrate how the probability of the graph changes as some edges are added/removed. We add or remove edges, adding or destroying new rings in the molecule. The AxiomPath-Convex can identify the salient paths that justify the probability changes. 

