CHARACTERIZING THE INFLUENCE OF GRAPH ELE-MENTS

Abstract

Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes of model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of a SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.

1. INTRODUCTION

Graph data is pervasive in real-world applications, such as, online recommendations (Shalaby et al., 2017; Huang et al., 2021; Li et al., 2021) , drug discovery (Takigawa & Mamitsuka, 2013; Li et al., 2017) , and knowledge management (Rizun, 2019; Wang et al., 2018) , to name a few. The growing need to analyze huge amounts of graph data has inspired work that combines Graph Neural Networks with deep learning (Gori et al., 2005; Scarselli et al., 2005; Li et al., 2016; Hamilton et al., 2017; Xu et al., 2019b; Jiang et al., 2019) . Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017; Zhang & Chen, 2018; Fan et al., 2019) , the most cited GNN architecture, adopts convolution and message-passing mechanisms. To better understand GCNs from a data-centric perspective, we consider the following question: Without model retraining, how can we estimate the changes of parameters in GCNs when the graph used for learning is perturbed by edge-or node-removals? This question proposes to estimate counterfactual effects on the parameters of a well-trained model when there is a manipulation in the basic elements in a graph, where the ground truth of such an effect should be obtained from model retraining. With a computational tool as the answer, we can efficiently manipulate edges or nodes in a graph to control the change of model parameters of trained GCNs. The solution would provide further extensions like graph data rectification, improving model generalization, and graph data poison attacks through a pure data modeling way. Yet, current methods for training GCNs offer limited interpretability of the interactions between the training graph and the GCN model. More specifically, we fall short of understanding the influence of the input graph elements on both the changes in model parameters and the generalizability of a trained model (Ying et al., 2019; Huang et al., 2022; Yuan et al., 2021; Xu et al., 2019a; Zheng et al., 2021) . In the regime of robust statistics, an analyzing tool called influence functions (Hampel, 1974; Koh & Liang, 2017) is proposed to study the counterfactual effect between training data and model performance. For independent and identically distributed (i.i.d.) data, influence functions offer an approximate estimation of the model's change when there is an infinitesimal perturbation added to the training distribution, e.g., a reweighing on some training instances. However, unlike i.i.d. data, manipulation on a graph would incur a knock-on effect through GCNs. For example, an edge removal will break down all message passing that is supposed to pass through this edge and consequentially change node representations and affect the final model optimization. Therefore, introducing influence functions to graph data and GCNs is non-trivial work and requires extra considerations. In this work, we aim to derive influence functions for GCNs. As the first attempt in this direction, we focused on Simple Graph Convolution (Wu et al., 2019) . Our contributions are three-fold: • We derived influence functions for Simple Graph Convolution. Based on influence functions, we developed computational approaches to estimate the changes in model parameters caused by two basic perturbations: edge removal and node removal. • We derived the theoretical error bounds to characterize the gap between the estimated changes and the actual changes in model parameters in terms of both edge and node removal. • We show that our influence analysis on the graph can be utilized to (1) rectify the training graph to improve model testing performance, and (2) guide adversarial attacks to SGC or conduct grey-box attacks on GCNs via a surrogate SGC. Code is publicly available at https://github.com/Cyrus9721/Characterizing_ graph_influence.

2. PRELIMINARIES

In the following sections, we use a lowercase x for a scalar or an entity, an uppercase X for a constant or a set, a bolder lowercase x for a vector, and a bolder uppercase X for a matrix. Influence Functions Influence functions (Hampel, 1974) estimate the change in model parameters when the empirical weight distribution of i.i.d. training samples is perturbed infinitesimally. Such estimations are computationally efficient compared to learn-one-out retraining iterating every training sample. For N training instances x and label y, consider empirical risk minimization (ERM) θ = arg min θ∈Θ 1 N x,y ℓ(x, y) + λ 2 ∥θ∥ 2 2 for some loss function ℓ(•, •) through a parameterized model θ and with a regularization term. When down weighing a training sample (x i , y i ) by an infinitely small fraction ϵ, the substitutional ERM can be expressed as θ(x i ; -ϵ) = arg min θ∈Θ 1 N x,y ℓ(x, y) -ϵℓ(x i , y i ) + λ 2 ∥θ∥ 2 2 . Influence functions estimate the actual change I * (x i ; -ϵ) = θ(x i ; -ϵ) -θ for a strictly convex and twice differentiable ℓ(•, •): I(x i ; -ϵ) = lim ϵ→0 θ(x i ; -ϵ) -θ = -H -1 θ ∇ θ ℓ(x i , y i ), where H θ := 1 N N i=1 ∇ 2 θ ℓ(x i , y i ) + λI is the Hessian matrix with regularization at parameter θ. For some differentiable model evaluation function f : Θ → R like calculating total model loss over a test set, the change from down weighing ϵ → (x i , y i ) to the evaluative results can be approximated by ∇ θ f ( θ)H -1 θ ∇ θ ℓ(x i , y i ). When N the size of the training data is large, by setting ϵ = 1 N , we can approximate the change of θ incurred by removing an entire training sample I( x i ; -1 N ) = I(-x i ) via linear extrapolations 1 N → 0. Obviously, in terms of the estimated influence I, removing a training sample has the opposite value of adding the same training sample I(-x i ) = -I(+x i ). In our work, we shall assume an additivity of influence functions in computations when several samples are removed, e.g., when removing two samples: I(-x i , -x j ) = I(-x i ) + I(-x j ). Though efficient, as a drawback, influence functions on non-convex models suffer from estimation errors due to the variant local minima and usually a computational approximation to H -1 θ for a noninvertible Hessian matrix. To introduce influence functions from i.i.d. data to graphs and precisely characterize the influence of graph elements to model parameters' changes, we consider a convex model called Simple Graph Convolution from the GCNs family. Simple Graph Convolution By removing non-linear activations between layers from typical Graph Convolutional Networks, Simple Graph Convolution (SGC) (Wu et al., 2019) formulates a linear simplification of GCNs with competitive performance on various tasks (He et al., 2020; Rakhimberdina & Murata, 2019) . Let G = (V, E) denote an undirected attributed graph, where V = {v} contains vertices with corresponding feature X ∈ R |V |×D with D the feature dimension, and E = {e ij } 1≤i<j≤|V | is the set of edges. Let Γ v denote the set of neighborhood nodes around v, and d v the node degrees of v. We use A denote the adjacency matrix where A ij = A ji = 1 if e ij ∈ E, and 0 elsewhere. D = diag(d v ) denotes the degree matrix. When the context is clear, we simplify the notation Γ vi → Γ i , and the same manner for other symbols. For multi-layer GNNs, let z (k) v denote the hidden representation of node v in the k-th layer, and with z (0) v = x v the initial node features. Simple Graph Convolution processes node representations as: z k) , where W (k) and b (k) are trainable parameters in k-th layer. In transductive node classification, let V train ⊂ V denote the set of N training nodes associated with labels y. ERM of SGC in this task is θ = arg min θ∈Θ (k) v = W (k) u∈Γv∪{v} d -1/2 u d -1/2 v z (k-1) u + b ( 1 N v∈Vtrain ℓ(z (k) v , y v ) + λ 2 ∥θ∥ 2 2 . Due to the linearity of SGC, parameters W (k) and b (k) in each layer can be unified, and predictions after k layers can be simplified as y = arg max( D-1 2 Ã D-1 2 ) k XW + b with Ã = A + I and D the degree matrix of Ã. Therefore, for node representations Z (k) = ( D-1 2 Ã D-1 2 ) k X with y and cross-entropy loss, ℓ(•, •) is convex. The parameters θ in ℓ consist of matrix W ∈ R D×|Class| and vector b ∈ R |Class| with |Class| the number of class, and can be solved via logistic regression.

Additional Notations

In what follows, we shall build our influence analysis upon SGC. For notational simplification, we omit (k) in Z (k) and use Z to denote the last-layer node representations from SGC. We use I * (-e ij ) = θ(-e ij ) -θ to denote the actual model parameters' change where θ(-e ij ) is obtained through ERM when e ij is removed from E. Likewise, I * (-v i ) denotes the change from v i 's removal from graph G. I(-e ij ) and I(-v i ) are the corresponding estimated influence for I * (-e ij ) and I * (-v i ) based on influence functions, respectively.

3. MODELING THE INFLUENCE OF ELEMENTS IN GRAPHS

We mainly consider the use of influence functions of two fundamental operations over an attributed graph: removing an edge (in Section 3.1) and removing a complete node (in Section 3.2).

3.1. INFLUENCE OF EDGE REMOVAL

With message passing through edges in graph convolution, removing an edge will incur representational changes in Z. When e ij is removed, the changes come from two aspects: (1) The message passing for node features via the removed edge will be blocked, and all the representations of k-hop neighboring nodes of the removed edge will be affected. (2) Due to the normalization operation over A, the degree of all adjacent edges e jk , ∀k ∈ Γ i and e ik , ∀k ∈ Γ j will be changed, and these edges will have a larger value in D-1 2 Ã D-1 2 . We have the following expression to describe the representational changes ∆(-e ij ) of node representations Z in SGC incurred by removing e ij .

∆(-e

ij ) = [( D(-e ij ) -1 2 Ã(-e ij ) D(-e ij ) -1 2 ) k -( D-1 2 Ã D-1 2 ) k ]X. (2) A(-e ij ) is the modified adjacency matrix with A(-e ij ) ij/ji = 0 and A(-e ij ) = A elsewhere. Ã(-e ij ) = A(-e ij ) + I and D(-e ij ) the degree matrix of Ã(-e ij ). By having ∆(-e ij ), we can access the change in every node. Let δ k (-e ij ) denotes the k-th row in ∆(-e ij ). δ k = 0 implies no change in k-th node from removing e ij , and δ k ̸ = 0 indicates a change in z k . We proceed to use influence functions to characterize the counterfactual effect of removing e ij . Our high-level idea is, from an influence functions perspective, representational changes in nodes z → z + δ is equivalent to removing training instances with feature z, and adding new training instances with feature z + δ and with the same labels. The problem thus turns back to an instance reweighing problem developed by influence functions. In this case, we have the lemma below to prove the influence functions' linearity. Lemma 3.1. Consider empirical risk minimization θ = arg min θ∈Θ i ℓ(x i , y i ) and θ(x j → x j + δ) = arg min θ∈Θ i̸ =j ℓ(x i , y i ) + ℓ(x j + δ, y j ) with some twice-differentiable and strictly convex ℓ, let I * (x j → x j + δ) = θ(x j → x j + δ) -θ, the estimated influence satisfies linearity: I(x j → x j + δ) = I(-x j ) + I(+(x j + δ)). (3) By having Lemma 3.1, we are ready to derive a proposition from characterizing edge removal. Proposition 3.2. Let δ k (-e ij ) denote the k-th row of ∆(-e ij ). The influence of removing an edge e ij ∈ E from graph G can be estimated by: I(-e ij ) = I(z → z + δ(-e ij )) = k I(+(z k + δ k (-e ij ))) + I(-z k ) = -H -1 θ v k ∈Vtrain (∇ θ ℓ(z k + δ k (-e ij ), y k ) -∇ θ ℓ(z k , y k )). (4) Proof. The second equality comes from Lemma 3.1, and the third equality comes from Equation (1). Realize that removing two representations I(-z i , -z j ) = I(-z i ) + I(-z j ) completing the proof. Proposition 3.2 offers an approach to calculate the estimated influence of removing e ij . In practice, having the inverse hessian matrix, a removal only requires users to compute the updated gradients ∇ θ ℓ(z k + δ k (-e ij ), y k ) and its original gradients for all affected nodes in (k+1)-hop neighbors.

3.2. INFLUENCE OF NODE REMOVAL

We address the case of node removal. The impact from removing a node v i from graph G to parameters' change are two-folds: (1) The loss term ℓ(x i , y i ) will no longer involved in ERM if v i ∈ V train . (2) All edges link to this node {e ij }, ∀j ∈ Γ i will be removed either. The first aspect can be deemed as a regular training instance removal similar to an i.i.d. case, and the second aspect be can an incremental extension from edge removal in Proposition 3.2. The representational changes from removing node v i can be expressed as: ∆(-v i ) = [( D(-v i ) -1 2 Ã(-v i ) D(-v i ) -1 2 ) k -( D-1 2 Ã D-1 2 ) k ]X, with A(-v i ) jk/kj = A jk/kj , ∀j, k : j ̸ = i ∧ k / ∈ Γ i , and A(-v i ) = 0 elsewhere. Similarly, Ã(-v i ) = A(-v i ) + I and D(-v i ) is the corresponding degree matrix of Ã(-v i ). Having ∆(-v i ), Lemma 3.1 and Proposition 3.2, we state the estimated influence of removing v i . Proposition 3.3. Let δ j (-v i ) denote the j-th row of ∆(-v i ). The influence of removing node v i from graph G can be estimated by:  I(-v i ) = I(-z i ) + I(z → z + δ(-v i )) = I(-z i ) + j I(+(z j + δ j (-v i ))) + I(-z j ) = -1 vi∈Vtrain • H -1 θ ∇ θ ℓ (z i , y i ) -H -1 θ vj ∈Vtrain (∇ θ ℓ(z j + δ j (-v i ), y j ) -∇ θ ℓ(z j , y j )),

4. THEORETICAL ERROR BOUNDS

In the above section, we show how to estimate the changes of model parameters due to edge removal: θ → θ(-e ij ) and node removals: θ → θ(-v i ). In this section, we study the error between the estimated influence given by influence functions I and the actual influence I * obtained by model retraining. We give upper error bounds on edge removal ∥I * (-e ij ) -I(-e ij )∥ 2 (see Theorem 4.1) and node removal ∥I * (-v i ) -I(-v i )∥ 2 (see Corollary A.1). In what follows, we shall assume the second derivative of ℓ(•, •) is Lipschitz continuous at θ with constant C based on the convergence theory of Newton's method. To simplify the notations, we use z ′ i = z i + δ i to denote the new representation of v i obtained after removing an edge or a node, where δ i is the row vector of ∆(-e ij ) or ∆(-v i ) depending on the context. Theorem 4.1. Let σ min ≥ 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices ∇ 2 θ ℓ(z i , y i ), ∀v i ∈ V train of the original model θ. Let σ ′ min ≥ 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices ∇ 2 θ(-eij ) ℓ(z i , y i ), ∀v i ∈ V train of the retrained model θ(-e ij ) with e ij removed from graph G. Use L denote the set {v : z ′ ̸ = z} containing affected nodes from the edge removal, and Err(-e ij ) = ∥I * (-e ij ) -I(-e ij )∥ 2 . Recall λ is the ℓ 2 regularization strength, we have an upper bound on the estimated error of model parameters' change: Err(-e ij ) ≤ N 3 C (N λ + (N -|L|)σ min + σ ′ min |L|) 3 • ∥ v l ∈L (∇ θ ℓ(z ′ l , y l ) -∇ θ ℓ(z l , y l ))∥ 2 2 + N N λ + (N -|L|)σ min + min(σ min , σ ′ min )|L| • ∥ v l ∈L (∇ θ ℓ(z ′ l , y l ) -∇ θ ℓ(z l , y l ))∥ 2 . ( ) Proof sketch. We use the one-step Newton approximation (Pregibon, 1981) as an intermediate step to derive the bound. The first term is the difference between the actual change I * (-e ij ) and its Newton approximation, and the second term is the difference between the Newton approximation and the estimated influence I(-e ij ). Combining these two parts result the bound. Remark 4.2. We have the following main observations from Theorem 4.1. (1) The estimation error of influence function is controlled by the ℓ 2 regularization strength within a factor of O(1/λ). A stronger regularization will likely produce a better approximation. (2) The error is controlled by the inherent property of a model. A smoother model in terms of its hessian matrix will help lower the upper bound. (3) The upper bound is controlled by the norm of the changed gradient from z → z ′ . Intuitively, if removing e ij incurs smaller changes in node representations, the approximation of the actual influence would be more accurate. Also, a smaller Err(-v i ) is expected if the model is less prone to changes in training samples. (4) There are no significant correlations between the bound and the number of training nodes N . As a special case, if σ min = σ ′ min = 0, the bound is irrelevant to N . We attach empirical verification for our bound in Appendix D. Similar to Theorem 4.1, we have Corollary A.1 to derive an upper bound on ∥I * (-v i ) -I(-v i )∥ 2 for removing a node v i from graph presented in Appendix A.

5. EXPERIMENTS

We conducted three major experiments: (1) Validate the estimation accuracy of our influence functions on graph in Section 5.2; (2) Utilize the estimated edge influence to carry out adversarial attacks and graph rectification for increasing model performance in Section 5.3; and (3) Utilize the estimated node influence to carry out adversarial attacks on GCN (Kipf & Welling, 2017) in Section 5.4.

5.1. SETUP

We choose six real-world graph datasets:Cora, PubMed, CiteSeer (Sen et al., 2008) , WiKi-CS (Mernyei & Cangea, 2020) , Amazon Computers, and Amazon Photos (Shchur et al., 2018) in our experiments. Statistics of these datasets are outlined in Appendix B Table 4 . For the Cora, PubMed, and CiteSeer datasets, we used their public train/val/test splits. For the Wiki-CS datasets, we took a random single train/val/test split provided by Mernyei & Cangea (2020) . For the Amazon datasets, we randomly selected 20 nodes from each class for training, 30 nodes from each class for validation and used the rest nodes in the test set. All the experiments are conducted under the transductive node classification settings. We only use the last three datasets for influence validation.

5.2. VALIDATING INFLUENCE FUNCTIONS ON GRAPHS

Validating Estimated Influence We compared the estimated influence of removing a node/edge with its corresponding ground truth effect. The actual influence is obtained by re-training the model after removing a node/edge and calculating the change in the total cross-entropy loss. We also validated the estimated influence of removing node embeddings, for example, removing ℓ(z i , y i ) of node v i from the ERM objective while keeping the embeddings of other nodes intact. Figure 2 shows that the estimated influence correlates highly with the actual influence (Spearman correlation coefficients range from 0.847 to 0.981). More results are included in Figure 4 in the appendix. Visualization Figure 1 visualizes the estimated influence of edge and node removals on the validation loss for the Cora dataset. This visualization hints at opportunities for improving the test performance of a model or attacking a model by removing nodes/edges with noticeable influences (see experiments in Sections 5.3 and 5.4).

5.3. APPLICATIONS OF THE ESTIMATED EDGE INFLUENCE

The estimated influence of edge removals on the validation set can be utilized to improve the test performance of SGC or carry out adversarial attacks on SGC/GCN. Graph Rectification via Edge Removals We begin by investigating the impact of edges with negative influences. Based on our influence analysis, removing negative influence edges from the original will decrease validation loss. Thus the classification accuracy on the test set is expected to increase correspondingly. We sort the edges by their estimated influences in descending order, then cumulatively remove edges starting from the one with the lowest negative influence. We train the SGC model, fine-tune it on the public split validation set and select the number of negative influence edges to be removed by validation accuracy. For a fair comparison, we fix the test set remaining unchanged regarding the removal of the edges. The results are derived based on Figure 8 and displayed in Table 1 , where we also report the performance of several classical and state-of-the-art GNN models on the original whole set as references, including GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018) , FGCN (Chen et al., 2018) , GIN (Xu et al., 2019b) , DGI (Velickovic et al., 2019) with a nonlinear activation function and SGC (Wu et al., 2019) . We demonstrate that our proposed method can marginally improve the accuracy of SGC from the data perspective and without any change to the original model structure of SGC, which validates the impacts of edges with negative influences. In addition, the performance of the SGC model with eliminating the negative influence edges can outperform other GNN-based methods in most cases. Attacking SGC via Edge Removals We investigated how to use edge removals to deteriorate SGC performance. Based on the influence analysis, removing an edge with a positive estimated influence can increase the model loss and decrease the model performance on the validation set. Thus, our attack is carried out in the following way. We first calculated the estimated influence of all edges and cumulatively removed edges with the highest positive influence one at a time. Every time we remove a new edge, we retrain the model to obtain the current model performance. We remove 100 edges in total for each experiment. We present our results in Figure 3 . Apparently, in general, the accuracy of SGC on node classification drops significantly. We notice the influence of edges is approximately power-law distributed, where only a small proportion of edges has a relatively significant influence. The performance worsens with increasingly cumulative edge removals on both validation and test sets. The empirical results verify our expectations of edges with a positive estimated influence. Attacking GCN via Surrogate SGC We further explored the impact of removing positive influences edges under adversarial grey-box attack settings. Here, we followed Zügner & Günnemann (2019) to interpret SGC as a surrogate model for attacking the GCN (Kipf & Welling, 2017) as a victim model, where the assumption lays under that the increase of loss on SGC can implicitly drop the performance of GCN. We eliminated positive influence edges at different rates 1%, 3%, 5% among all edges. The drop in accuracy was compared against DICE (Zügner et al., 2018) , Graph-Poison (Bojchevski & Günnemann, 2019) , MetaAttack (Zügner et al., 2018) . For a fair comparison, we restrict the compared method can only perturb graph structures via edge removals. Our results are presented in Table 2 . Our attack strategy achieves the best performance in all the scenarios of edge eliminations, especially on Pubmed with 1% edge elimination rate. Our attack model outperforms others by over 10% in accuracy drop. Since we directly estimate the impact of edges on the model parameter change, our attack strategy is more effective in seeking the most vulnerable edges to the victim model. These indicate that our proposed influence on edges can guide the construction of grey-box adversarial attacks on graph structures.

5.4. INFLUENCE OF NODE REMOVAL

Attacking GCN via Node Removals In this section, we study the impact of training nodes with a positive influence on transductive node classification tasks. Again, we assume that eliminating the positive influence nodes derived from SGC may implicitly harm the GCN model. We sort the nodes by our estimated influence in descending order and cumulatively remove the nodes from the training set. We built two baseline methods, Random and Degree, to compare the accuracy drop in different node removal ratios: 5%, 10%, 15%. For the Random baseline, we randomly remove the nodes from the training sets. For Degree baseline, we remove nodes by their degree in descending order. According to Table Table 3 , the model performance on GCN drops by a large margin in all three citation network datasets as the selected positive influence node is removed, especially on the Cora dataset. The model outperforms the baseline over 20% on 15% removing ratio. These results indicate that our estimation of node influence can be used to guide the adversarial attack on GCN in the settings of node removal.

6. RELATED WORKS

Influence Functions Recently, more efforts have been dedicated to investigating influence functions (Koh et al., 2019; Giordano et al., 2019; Ting & Brochu, 2018) in various applications, such as,computer vision (Koh & Liang, 2017), natural language processing (Han et al., 2020) , tabular data (Wang et al., 2020b) , causal inference (Alaa & Van Der Schaar, 2019) , data poisoning attack (Fang et al., 2020; Wang et al., 2020a) , and algorithmic fairness (Li & Liu, 2022) . In this work, we propose a major extension of influence functions to graph-structured data and systemically study how we can estimate the influence of nodes and edges in terms of different editing operations on graphs. We believe our work complements the big picture of influence functions in machine learning applications. Understanding Graph Data Besides influence functions, there are many other approaches to exploring the underlying patterns in graph data and its elements. Explanation models for graphs (Ying et al., 2019; Huang et al., 2022; Yuan et al., 2021; Bajaj et al., 2021; Abrate & Bonchi, 2021) provide an accessible relationship between the model's predictions and corresponding elements in graphs or subgraphs. They show how the graph's local structure or node features impact the decisions from GNNs. As a major difference, these approaches tackle model inference with fixed parameters, while we focus on a counterfactual effect and investigate the contributions from the presence of nodes and edges in training data to decisions of GNN models in the inference stage.

Adversarial Attacks on Graph

The adversarial attack on an attributed graph is usually conducted by adding perturbations on the graphic structure or node features (Zügner & Günnemann, 2019; Zheng et al., 2021) . In addition, Zhang et al. (2020) introduces an adversarial attack setting by flipping a small fraction of node labels in the training set that causes a significant drop in model performance. A majority of the attacker models (Zügner et al., 2018; Xu et al., 2019a) on graph structure are constructed based on the gradient information on both edges and node features and achieved costly but effective attacking results. These attacker models rely mainly on greedy-based methods to find the graph structure's optimal perturbations. We only focus on the perturbations resulting from eliminating edges and directly estimate the change of loss in response to the removal effect guided by the proposed influence-based approach.

7. CONCLUSIONS

We have developed a novel influence analysis to understand the effects of graph elements on the parameter changes of GCNs without needing to retrain the GCNs. We chose Simple Graph Convolution due to its convexity and its competitive performance to non-linear GNNs on a variety of tasks. Our influence functions can be used to approximate the changes in model parameters caused by edge or node removals from an attributed graph. Moreover, we provided theoretical bounds on the estimation error of the edge and node influence on model parameters. We experimentally validated the accuracy and effectiveness of our influence functions by comparing its estimation with the actual influence obtained by model retraining. We showed in our experiments that our influence functions could be used to reliably identify edge and node with negative and positive influences on model performance. Finally, we demonstrated that our influence function could be applied to graph rectification and model attacks. A PROOFS Lemma 3.1. Consider empirical risk minimization θ = arg min θ∈Θ i ℓ(x i , y i ) and θ(x j → x j + δ) = arg min i̸ =j ℓ(x i , y i ) + ℓ(x j + δ, y j ) with some twice-differentiable and strictly convex ℓ, let I * (x j → x j + δ) = θ(x j → x j + δ) -θ, the estimated influence satisfies linearity: I(x j → x j + δ) = I(-x j ) + I(+(x j + δ)). (3) Proof. Notice the actual model parameters in response of the perturbations ∆ can be denoted as: θ(x j → x j + δ) def = arg min θ∈Θ 1 N N k=1 ℓ (x k , y k ) - 1 N ℓ (x j , y j ) + 1 N ℓ (x j + δ, y j ) In this case, the actual change in model parameters in response of the perturbations can be represented as: I(x j → x j + δ)= θ(x j → x j + δ)-θ. For estimating ∆θ, we start by considering the parameter change from up weighting infinite small ε on {x ′ l } and down weight infinite small ε on {x l } where ∀l ∈ L. By definition, the model parameter in response of perturbation θε can be represented as: θε def = arg min θ∈Θ 1 N N k=1 ℓ (x k , y k ) -εℓ (x j , y j ) + εℓ (x j + δ, y j ) (8) The change of model parameter due to the modification on group of data's weight on loss be: ∆θ ε = θε -θ Since θε minimize the changed loss function under perturbation, take the derivative: 0 = 1 N N k=1 ∇ θε ℓ (x k , y k ) -ε∇ θε ℓ (x j , y j ) + ε∇ θε ℓ (x j + δ, y j ) Apply the first order Taylor expansion of θε on θ on the right side of the equation, we have: 0 = 1 N N k=1 ∇ θ ℓ (x k , y k ) + ε∇ θ ℓ (x j + δ, y j ) -ε∇ θ ℓ (x j , y j ) + 1 N N k=1 ∇ 2 θ ℓ (x k , y k ) + ε∇ 2 θ ℓ (x j + δ, y j ) -ε∇ 2 θ ℓ (x j , y j ) • ∆θ ε + o(∆θ 2 ε ) Since θ minimize the loss function without perturbation, 1 N N k=1 ∇ θε ℓ (x k , y k )=0. Dropping o(∆θ 2 ε ) term, We have: ∆θ ε ≈ - 1 N N k=1 ∇ 2 θ ℓ (x k , y k ) + ε∇ 2 θ ℓ (x j + δ, y j ) -ε∇ 2 θ ℓ (x j , y j ) -1 • [ε∇ θ ℓ (x j + δ, y j ) -ε∇ θ ℓ (x j , y j )] Take the derivative of ∆θ ε over ε, by dropping O(ε) terms we have: ∂∆θ ε ∂ε = - 1 N N k=1 ∇ 2 θ ℓ (x k , y k ) -1 ∇ θ ℓ (x j + δ, y j ) -∇ θ ℓ (x j , y j ) = -H -1 θ l∈L (∇ℓ (x j + δ, y l ) -∇ℓ (x j , y j )) Published as a conference paper at ICLR 2023 For sufficient large N , by setting ε to 1 N , the changed we can approximate the actual change in model parameters using: I(x j → x j + δ)= j → x j + δ)-θ≈ θεθ. Plugging in to Eq. ( 13) we finish the proof: I(x j → x j + δ) ≈ -H -1 θ (∇ℓ (x j + δ, y l ) -∇ℓ (x j , y j )) = -H -1 θ ∇ℓ (x j + δ, y l ) + H -1 θ ∇ℓ (x j , y l ) = I(+(x j + δ)) + I(-x j ). (14) Proposition 3.3. Let δ j (-v i ) denote the j-th row of ∆(-v i ). The influence of removing node v i from graph G can be estimated by: I(-v i ) = I(-z i ) + I(z → z + δ(-v i )) = I(-z i ) + j I(+(z j + δ j (-v i ))) + I(-z j ) = -1 vi∈Vtrain • H -1 θ ∇ θ ℓ (z i , y i ) -H -1 θ vj ∈Vtrain (∇ θ ℓ(z j + δ j (-v i ), y j ) -∇ θ ℓ(z j , y j )), ) where 1 is an indicator function. Proof. Similarly to the edge removal, we first calculate the node representation change incurred from the removal of the node v i of a 2-layer SGC as follow: ∆(-v i ) = (D -1 2 -vi A -vi D -1 2 -vi ) 2 -(D -1 2 AD -1 2 ) 2 X. The above change will affect a set of nodes, including the node v i itself and the 2-hop neighbors of the node v i connected neighbors. A set of nodes S={s|s ∈ N i ∪ j∈Ni N j } capture the changed node embeddings in the training set, i.e., δ s ̸ = 0, where ∆ -vi = {δ i } N i=1 in Eq. ( 15). The model parameter change of the removal of the node v i can be characterized by removing the representation of the node v i if the node v i is a training sample, and the node representation change from the set S. Thus, we have I(-v i ) = -1 vi∈Vtrain • I (z i , y i ) + s∈{S\vi} (I (z ′ s , y s ) -I (z s , y s )) = -1 vi∈Vtrain • H -1 θ ∇L CE (z i , y i ) -H -1 θ s∈S\vi (∇L CE (z ′ s , y s ) -∇L CE (z s , y s )). We finish the proof. Theorem 4.1. Let σ min ≥ 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices ∇ 2 θ ℓ(z i , y i ), ∀v i ∈ V train of the original model θ. Let σ ′ min ≥ 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices ∇ 2 θ(-eij ) ℓ(z i , y i ), ∀v i ∈ V train of the retrained model θ(-e ij ) with e ij removed from graph G. Use L denote the set {v : z ′ ̸ = z} containing affected nodes from the edge removal, and Err(-e ij ) = ∥I * (-e ij ) -I(-e ij )∥ 2 . Recall λ is the ℓ 2 regularization strength, we have an upper bound on the estimated error of model parameters' change: Err(-e ij ) ≤ N 3 C (N λ + (N -|L|)σ min + σ ′ min |L|) 3 • ∥ v l ∈L (∇ θ ℓ(z ′ l , y l ) -∇ θ ℓ(z l , y l ))∥ 2 2 + N N λ + (N -|L|)σ min + min(σ min , σ ′ min )|L| • ∥ v l ∈L (∇ θ ℓ(z ′ l , y l ) -∇ θ ℓ(z l , y l ))∥ 2 . Proof. In this proof, we utilize one-step Newton approximation as an intermediary to estimate the error bound of the change in model parameters, i.e., Err(-e ij ) = I * (-e ij ) -I N t (-e ij ) + I N t (-e ij ) -I(-e ij ) , where I * (-e ij )=∆ θε = θεθ, I N t (-e ij ) is the one-step Newton approximation with the model parameter θNt = θ + ∆ θNt According to Boyd et al. (2004) (Section 9.5.1), ∆ θNt can be calculated as follows: ∆ θNt = -H θ + λI -1 • 1 N ( N i=1 ∇ θ ℓ(z i , y i ) + v l ∈L ∇ θ ℓ(z ′ l , y l ) - v l ∈L ∇ θ ℓ(z l , y l ) + λ∥ θ∥ 2 ). In the following, we will calculate the bound of I * (-e ij ) -I N t (-e ij ) and I N t (-e ij ) -I(-e ij ) as two separate steps and combine them together. Here we define the before and after objective functions with the removal of edge e ij as follows: L b (θ) = n i=1 ℓ(z i , y i ) + λ 2 ∥θ∥ 2 2 , L a (θ) = n i=1 ℓ(z i , y i ) + v l ∈L ℓ(z ′ l , y l ) - v l ∈L ℓ(z l , y l ) + λ 2 ∥θ∥ 2 2 . ( ) Step I: Bound of I * (-e ij ) -I N t (-e ij ). Due to that SGC model is convex on θ, we take the second derivative of L a (θ) and have λI + 1 N N i=1 ∇ 2 L (z i , y i ) + v l ∈L ∇ 2 L (z ′ l , y l ) - v l ∈L ∇ 2 L (z l , y l ) ≻ 0. To simplify the above equation, we define σ ′ min and σ ′ max are the smallest and largest eignenvalues of ∇ 2 ℓ (z ′ l , y l ) and σ min and σ max are the smallest and largest eignenvalues of ∇ 2 L (z l , y l ). Then we have I • λ + (N -|L|) • σ min + |L| • σ ′ min N ≻ 0. Therefore, the SGC loss function corresponds to the removal of edge is strictly convex with the parameter λ + (N -|L|)•σmin+|L|•σ ′ min N . By this convexity property and the implications of strong convexity (Boyd et al., 2004 ) (Section 9.1.2), we can bound I * (-e ij ) -I N t (-e ij ) with the first derivative of SGC loss function as follows: I * (-e ij ) -I N t (-e ij ) =∥∆ θε -∆ θNt ∥ 2 = ∥(∆ θε + θ) -(∆ θNt + θ)∥ 2 = ∥ θε -θNt ∥ 2 ≤ 2N N λ + (N -|L|)σ min + |L|σ ′ min • 1 N ( N i=1 ∇ θNt ℓ(z i , y i ) + v l ∈L ∇ θNt ℓ(z ′ l , y l ) - v l ∈L ∇ θNt ℓ(z l , y l ) + λ∥ θNt ∥ 2 ) 2 . ( ) If we take a close look at the second term in the above equation, we notice it is equal the first derivative of L a (θ), i.e., ∇ θ ℓ a ( θNt ) = 1 N ( N k=1 ∇ θNt ℓ(z k , y k ) + v l ∈L ∇ θNt ℓ(z ′ l , y l ) - v l ∈L ∇ θNt ℓ(z l , y l ) + λ∥ θNt ∥ 2 ). Therefore, we focus on bounding ∥∇ θ L a ( θNt )∥ 2 in the following. ∥∇ θ L a ( θNt )∥ 2 = ∥∇ θ L a ( θ + ∆ θNt )∥ 2 =∥∇ θ L a ( θ + ∆ θNt ) -∇ θ L a ( θ) + ∇ θ L a ( θ)∥ 2 =∥∇ θ L a ( θ + ∆ θNt ) -∇ θ L a ( θ) -∇ 2 θ L a ( θ)∆ θNt ∥ 2 The above last equation holds due to the definition of ∆ θNt in Eq. ( 18). For any continuous function f and any inputs a and b, there exists f (a + b) -f (a) -bf ′ 1 0 b • (f ′ (a + bt) -f ′ (a))dt. Based on that, we can rewrite ∥∇ θ L a ( θNt )∥ 2 as follows: ∥∇ θ L a ( θNt )∥ 2 =∥∇ θ L a ( θ + ∆ θNt ) -∇ θ L a ( θ) -∇ 2 θ L a ( θ)∆ θNt ∥ 2 = 1 0 ∆ θNt (∇ 2 θ L a ( θ + ∆ θNt • t) -∇ 2 θ L a ( θ))dt 2 . ( ) We assume the loss function ℓ on is twice differentiable and the second derivative of the loss function is Lipschitz continuous at θ, with parameter C. Here C is controlled by the third derivative (Curvature) of the loss function ℓ. Thus, we have ∥∇ 2 θ ℓ(θ 1 ) -∇ 2 θ ℓ(θ 2 )∥ 2 ≤ C • ∥θ 1 -θ 2 ∥ 2 . ( ) Then we take Eq. ( 26) into Eq. ( 25) and have delt∥∇ θ L a ( θNt )∥ 2 ≤∥N C∆ θNt 1 0 tdt∥ 2 = N C 2 ∥∆ θNt ∥ 2 2 = N C 2 ∥∇ 2 θ L a ( θ) -1 • ∇ θ L a ( θ)∥ 2 2 ≤ N C 2 • N 2 (N λ + (N -|L|) • σ min + |L| • σ ′ min ) 2 • ∥ l∈L (∇ θ ℓ (z ′ l , y l ) -∇ θ ℓ (z l , y l ))∥ 2 2 . ( ) The above last inequation holds according to the bound of ∇ 2 θ L a ( θ) -1 and Eq. ( 19). Combining Eq. ( 22), ( 23) and ( 27), we finish the bound of I * (-e ij ) -I N t (-e ij ) as follows: ∥I * (-e i,j ) -I N t (-e i,j )∥ 2 ≤ N 3 C (N λ + (N -|L|)σ min + σ ′ min |L|) 3 • ∥ v l ∈L (∇ θ ℓ (z ′ l , y l ) -∇ θ ℓ (z l , y l ))∥ 2 2 . ( ) We finish Step I. Step II: Bound of I N t (-e ij ) -I(-e ij ). By the definition of I N t (-e ij ) and I(-e ij )), we have: I N t (-e ij ) -I(-e ij ) = λI + 1 N n k=1 ∇ 2 θ ℓ (z k , y k ) + v l ∈L ∇ 2 θ ℓ (z ′ l , y l ) - v l ∈L ∇ 2 θ ℓ (z l , y l ) -1 -λI + 1 N n k=1 ∇ 2 θ ℓ (z k , y k ) -1 • v l ∈L (∇ θ ℓ (z ′ l , y l ) -∇ℓ (z l , y l )) . For simplification, we use matrix A, B and C for the following substitutions: A = λI + 1 N n k=1 ∇ 2 θ ℓ (z k , y k ) - v l ∈L ∇ 2 θ ℓ (z l , y l ) , B = 1 N v l ∈L ∇ 2 θ ℓ (z ′ l , y l ) , and C = 1 N v l ∈L ∇ 2 θ ℓ (z l , y l ) , where A, B and C are positive definite matrix and have the following properties: λ + (N -|L|)σ max N ≻A ≻ λ + (N -|L|)σ min N , |L|σ ′ max N ≻B ≻ |L|σ ′ min N , |L|σ max N ≻ C ≻ |L|σ min N . Published as a conference paper at ICLR 2023 Therefore, we have I N t (-e ij ) -I(-e ij ) = ((A + B) -1 -(A + C) -1 ) • v l ∈L (∇ θ ℓ (z l , y l ) -∇ θ ℓ (z l , y l )) , where (A + B) -1 -(A + C) -1 ≺ N N λ+(N -|L|)σmin+|L|min(σ ′ min ,σmin) I. The l 2 norm of the error between our predicted influence and Newton approximation can be bounded as follows: ∥I N t (-e ij ) -I(-e ij )∥ 2 ≤ N N λ + (N -|L|)σ min + min(σ ′ min , σ min )|L| • ∥ v l ∈L (∇ℓ θ (z ′ l , y l ) -∇ℓ θ (z l , y l ))∥ 2 . ( ) We finish Step II. Combining the conclusion in Step I and II in Eq. ( 28) and ( 33), we have the error between the actual influence and our predicted influence as: Err(-e ij ) ≤∥I * (-e ij ) -I N t (-e ij )∥ 2 + ∥I N t (-e ij ) -I(-e ij )∥ 2 = N 3 C (N λ + (N -|L|)σ min + |L|σ ′ min ) 3 • ∥ v l ∈L (∇ θ ℓ (z ′ l , y l ) -∇ θ ℓ (z l , y l ))∥ 2 2 + N N λ + (N -|L|)σ min + min(σ ′ min , σ min ) • ∥ v l ∈L (∇ θ ℓ (z ′ l , y l ) -∇ θ ℓ (z l , y l ))∥ 2 . ( ) We finish the whole proof. Corollary A.1. Let σ min ≥ 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices ∇ 2 θ ℓ(z i , y i ), ∀v i ∈ V train of the original model θ. Let σ ′ min ≥ 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices ∇ 2 θ(-vi) ℓ(z i , y i ), ∀v i ∈ V train of the retrained model θ(-v i ) with v i removed from graph G. Use S denote the set {v : z ′ ̸ = z} containing affected nodes from the node removal, and Err(-v i ) = ∥I * (-v i ) -I(-v i )∥ 2 . We have the following upper bound on the estimated error of model parameters' change: Err(-v i ) ≤ N 3 m 2 C ((N -1)λ + (N -|S|)σ min + σ ′ min |S|) 3 + (N -1)m N λ + (N -|S|)σ min + min(σ min , σ ′ min )|S| + N 3 C (N λ + (N -1)σ min ) 3 • ∥ℓ (z ′ i , y i ) ∥ 2 2 + N N λ + N σ min • ∥ℓ (z ′ i , y i ) ∥ 2 (35) where m = ∥ vs∈S [∇ θ ℓ(z ′ s , y s ) -∇ θ ℓ(z s , y s )] -∇ θ ℓ(z i , y i )∥ 2 . Proof. We provide a simple proof for the error bound of removing a complete nodes. Notice that this error can be decomposed into two parts, 1, the error or removing a single node embedding z i and 2, the error of adding z ′ s and removing z s , where s∈S. where we have Err(-v i ) ≤ s∈S Err(z s → z ′ s ) + Err(-z i ) Notice that Eq. Theorem 4.1 proofs the error bound of Err(-v i ), in the proving process we decompose the problem in to deriving the error bound by adding z ′ l and removing z l where l∈L, where L is the set of changed node embedding caused by removing a edge from the graph. Following the same proving setting of Eq. Theorem 4.1, Again, notice that S is the set of changed node embedding caused by removing a node from the graph. We simply substitute L by S, we have the error bounds for s∈S Err(z s → z ′ s ). s∈S Err(z s → z ′ s ) ≤ N 3 m 2 C N λ + (N -|S|)σ min + σ ′ min |S|) 3 + (N -1)m N λ + (N -|S|)σ min + min(σ min , σ ′ min )|S| , Where m = ∥ vs∈S [∇ θ ℓ(z ′ s , y s ) -∇ θ ℓ(z s , y s )]∥ 2 . For Err(-z i ), it can be derived following the same proving process Eq. Theorem 4.1, but we only remove one data points. In this case we have: Err(-z i ) ≤ N 3 C (N λ + (N -1)σ min ) 3 • ∥ℓ (z ′ i , y i ) ∥ 2 2 + N N λ + N σ min • ∥ℓ (z ′ i , y i ) ∥ 2 . Combining the two error bounds we have: Err(-v i ) ≤ N 3 m 2 C ((N -1)λ + (N -|S|)σ min + σ ′ min |S|) 3 + (N -1)m N λ + (N -|S|)σ min + min(σ min , σ ′ min )|S| + N 3 C (N λ + (N -1)σ min ) 3 • ∥ℓ (z ′ i , y i ) ∥ 2 2 + N N λ + N σ min • ∥ℓ (z ′ i , y i ) ∥ 2 B DATASET STATISTICS We present the data statistic on our experiments below. We choose only small and medium-sized data. Because, each time we validate the influence of the elements in a graph, we need to retrain the model. . Because for validating every edge's influence, we need to retrain the model and compare the change on loss, the computation cost is exceptionally high. We randomly choose 10000 edges of each datasets and validate their influence. We observe that even for medium-size datasets, our estimated influence is of high correlation to the actual influence.

D EMPIRICAL VERIFICATION OF THEOREM 4.1

As the value of l 2 regularization term decreases, the accuracy of our estimation of the influence of edges drops, and the Spearman correlation coefficient decrease correspondingly. This trend is consistent with the interpretations of the error bound on Theorem 4.1 that the estimation error of an influence function is inversely related with the l 2 regularization term. We also notice that the edges that connects high-degree nodes have overall less influence. Their estimation points lies relatively close to the y=x line and thus could have relative small estimation error. 

E GROUP EFFECT OF REMOVING MULTIPLE EDGES

We study the group effect of influence estimation on removing multiple edges. On dataset Cora, we randomly sample k edges from the attributed graph, where k's values were chosen increasingly as 2, 10, 50, 100, 200, 350. Every time, we remove k edges simultaneously and validate their estimated influence. We observe: though with high correlation, our influence plots tend to move downward as more edges are removed at the same time. In this case, our method tends to be less accurate and underestimates the influence of a simultaneously removed group of edges. 

F VALIDATING INFLUENCE FOR ARTIFICIALLY ADDED EDGES

In this section, we validate our influence estimation for artificially added edges on dataset Cora, Pubmed, and Citeseer. On each dataset, we randomly select 10000 unconnected node pairs, add an artificial edge between them and validate its influence estimation. Figure 7 shows that the estimated influence correlates highly with the actual influence. This demonstrates that our proposed method can successfully evaluate the influence of artificially added edges. 

G STUDY OF EDGES WITH NEGATIVE INFLUENCE

Here we demonstrate the performance via cumulatively removing edges with negative influence in Figure 8 . The detailed implementation has been discussed in Section 5.3. Due to that the inaccurate influence estimation with more edges removed, we consider a maximum of 50 edges to be removed for each dataset. We observe an overall increase in model performance as we cumulatively remove edges predicted as a negative influence. This again demonstrates the usefulness of our influence estimation on edges. 

H EXTEND INFLUENCE METHOD TO OTHER GNN MODELS

Theoretically, our current pipeline can be extended to other nonlinear GNNs under some violation of assumption. (1) According to Propositions Proposition 3.2 and Proposition 3.3, we require the existence of the inverse of the Hessian matrix, which is based on the assumption that the loss function on model parameters is strictly convex. Under the context of some GNN models with nonlinear activation functions, we can use the pseudo-inverse of the hessian matrix instead. (2) For non-convex loss functions of most GNN, our proposed error bound in Theorem Theorem 4.1 does not hold unless a large regularization term is applied to make the hessian matrix positive definite. From the implementation purpose, (1) From the implementation perspective, the non-linear models usually have more parameters than the linear ones, which require more space to store the Hessian matrix. Accordingly, the calculation of the inverse of the Hessian matrix might be out of memory. It needs to reformulate the gradient calculation and apply optimization methods like Conjugate gradient for approximation. (2) Our current pipeline is constructed based on mathematical, hands-on derived gradients adopted from Koh et al. (2019) . Existing packages like PyTorch use automatic differentiation to get the gradients on model parameters. It could be inaccurate for second-order gradients calculation. Extending the current pipeline to other GNNs may require extensive first and second-order gradient formulations. We will explore more GNN influence in the future.

I RUNNING TIME COMPARISON

We present the running time comparison between calculating the edge influence via the influencebased method and retrieving the actual edge influence via retraining. We conduct our experiment on dataset Cora, Pubmed, and Citeseer. We demonstrate our method is 15-25 faster than the retrained method. Notably, for tasks like improving model performance or carrying out adversarial attacks via edge removal, it could save a considerable amount of time in finding the edge to be removed with the lowest/largest influence. 



is an indicator function. Proof. Combining Lemma 3.1 and Equation (1) completes the proof.

Figure 1: The Cora experiment -the estimated influences of individual training nodes/edges on the validation loss. The largest connected component of the Cora dataset is visualized here. Left: The dataset. The node size indicates if a node is in the training subset (large) or not (small). Middle: Influence of the training edges.Each edge is colored accordingly to its estimated influence value (blue -negative influence, removing it is expected to decrease the loss on the validation set; redpositive influence, removing it is expected to increase the loss on the validation set; and grey -little influence. The deeper color indicates higher influence.). Right: Influence of the training nodes. The same color scheme in the middle plot is used here.

Figure 2: Estimated influence vs. actual influence. Three datasets are used in this illustration Cora (left column), Pubmed (middle column) and Citeseer (right column). In all plots, the horizontal axes indicate the actual influence on the validation set, the vertical axes indicate the predicted influence, and ρ indicates Spearman's correlation coefficient between our predictions and the actual influences. Top row: Influence of node embedding removal. Each point represents a training node embedding Middle row: Influence of edge removals. Each point corresponds to a removed edge. Bottom row: Influence of node removal. Each point represents a removed training node.

Figure 3: Study of edges with positive influence on both validation (blue) and test (red) set. Columns correspond to Cora, Pubmed and Citeseer datasets. Top: scale of values of the edges with negative influence. Bottom: accuracy drop by cumulatively removing edges with positive influence.

Figure 4: Estimated influence vs. actual influence on medium-sized graphs. Three datasets are used in this illustration Wiki-CS (left column), Amazon Computers (middle column) and Amazon Photo (right column). In all plots, the horizontal axes indicate the actual influence on the test set, the vertical axes indicate the predicted influence, and ρ indicates Spearman's coefficient between our predictions and the actual influences. Top row: Influence of node embeddings. Middle row: Influence of edge removals. Each point corresponds a removed training edge. Bottom row: Influence of node removal. Each point represents a removed training node.

Figure 5: Spearman correlation on Citeseer dataset with different l 2 regularization term on validating influence of edges. The orange points denote the summations of the degrees of the two nodes that an edge connects is high. The blue points denote the edges, which are the summations of the degrees of the two nodes connecting the edge.

Figure 6: Estimating group influence on Cora. The horizontal axes indicate the actual influence on the validation set, and the vertical axes indicate the predicted influence. On each set, we randomly sample k edges (k=2, 10, 50, 100, 200, 350) from the graph and repeat this process 5000 times. Each time, we remove k edges simultaneously and validate our influence estimation.

Figure 7: Estimated influence vs. actual influence on artificially added edges. Three datasets are used in this illustration Cora, Pubmed, and Citeseer. Due to the high time complexity of evaluating the influence on every pair of nodes, we randomly sample 10000 node pairs and add artificial edge.

Figure 8: Study of edges with negative influence, each column corresponds to Cora, Pubmed, and Citeseer dataset. Top: the scale of edges with negative influence. Bottom: accuracy by cumulatively removing edges with negative influence. Blue and red lines present the accuracy changes of validation and test in response to negative influence edge removal, respectively.

Our performance via eliminating edges with negative influence values.





Dataset Statistics Wiki-CS dataset, we randomly select one of the train/val/test split as described inMernyei & Cangea (2020) to explore the effect of training nodes/edges influence. For the Amazon Computers and Amazon Photo dataset, we follow the implementation ofShchur et al. (2018). To set random splits, On each dataset, we use 20 * C nodes as training set, 30 * C nodes as validating set and the rest nodes as testing set, where C is the number of classes.

Running time comparisons for edge removal by second. Self-loop edges are not recorded.

ACKNOWLEDGEMENT

We would like to thank the three anonymous reviewers for their constructive questions and invaluable suggestions. This work is partially supported by NSF DMR 1933525 and NSF OAC 1920147. Any opinions or conclusions in this paper are those of the authors and do not reflect the views of the funding agencies.

