GRAPHEDITOR: AN EFFICIENT GRAPH REPRESENTA-TION LEARNING AND UNLEARNING APPROACH

Abstract

As graph representation learning has received much attention due to its widespread applications, removing the effect of a specific node from the pre-trained graph representation learning model due to privacy concerns has become equally important. However, due to the dependency between nodes in the graph, graph representation unlearning is notoriously challenging and still remains less well explored. To fill in this gap, we propose GRAPHEDITOR, an efficient graph representation learning and unlearning approach that supports node/edge deletion, node/edge addition, and node feature update for linear-GNN. Compared to existing unlearning approaches, GRAPHEDITOR requires neither retraining from scratch nor of all data presented during unlearning, which is beneficial for the settings that not all the training data are available to retrain. Besides, since GRAPHEDITOR is exact unlearning, the removal of all the information associated with the deleted nodes/edges can be guaranteed. Empirical results on real-world datasets illustrate the effectiveness of GRAPHEDITOR for both node and edge unlearning tasks. The code can be found in supplementary.

1. INTRODUCTION

In recent years, graph representation learning has been recognized as a fundamental learning problem and has received much attention due to its widespread use in various domains, including social network analysis Kipf & Welling (2017) ; Hamilton et al. (2017) , traffic prediction Cui et al. (2019) ; Rahimi et al. (2018) , knowledge graphs Wang et al. (2019a; b) , and recommender systems Berg et al. (2017) ; Ying et al. (2018) . However, due to the increasing concerns on data privacy, removing the effect of a specific data point from the pre-trained model has become equally important. Recently, "Right to be forgotten" Wikipedia contributors (2021) empowers the users the right to request the organizations or companies to have their personal data be deleted in a rigorous manner. For example, when Facebook users deregister their account, users not only can request the company to permanently delete the account's profiles from the social network, but also require the company to eliminate the impact of the deleted data on any machine learning model trained based on the deleted data, which is known as machine unlearning Bourtoule et al. (2021) . One of the most straightforward unlearning approaches is to retrain the model from scratch using the remaining data, which could be computationally prohibitive when the dataset size is large or infeasible if not all the data are available to retrain. Recently, many efforts have been made to achieve efficient unlearning, which can be roughly classified into exact unlearning and approximate unlearning, each of which has its own limitations. Exact unlearning: Bourtoule et al. (2021) proposes to randomly split the original dataset into multiple disjoint shards and train each shard model independently. Upon receiving a data deletion request, the model provider only needs to retrain the corresponding shard model. Chen et al. (2021) extends Bourtoule et al. (2021) by taking the graph into consideration for data partition. However, splitting too many shards could hurt the model performance due to the data heterogeneity and lack of training data for each shard model Ramezani et al. (2021) . On the other hand, too few shards result in retraining on massive data, which is computationally prohibitive; Approximate unlearning: Guo et al. (2020) ; Chien et al. (2022) proposes to approximate the unlearned model using first-order Taylor-expansion, Golatkar et al. (2020) proposes to fine-tune with Newton's method on the remaining data, and Wu et al. (2020a) proposes to transfer the gradient computed at one weight to another and retrain the model from scratch with lower computational cost. Since approximate unlearning lacks guarantee on whether all information associated with the deleted data are eliminated, these methods require injecting random noise, which can significantly hurt the model performance. Employing graph representation unlearning is even more challenging due to the dependency between nodes that are connected by edges. We not only need to remove the information related to the deleted nodes, but also need to update its impact on neighboring remaining nodes of multi-hops. Since most of the existing unlearning methods only support data deletion, extending their application to graphs is non-trivial. Motivated by the importance and challenges of graph representation unlearning, we aim at answering the following two questions: Q1: Can approximate unlearning methods remove all information related to the deleted data? To verify this, we introduce "deleted data replay test" to validate the effectiveness of unlearning in Section 5. Specifically, we add an extra-label category and change all deleted nodes to this extra-label category. To better distinguish deleted nodes from others, an extra binary feature is appended to all nodes and set the extra binary feature as "1" for the deleted nodes and as "0" for other nodes. We first pre-train the model on the dataset with extra label and feature, then we evaluate the effectiveness of unlearning method by comparing the number of the deleted nodes that are predicted as the extralabel category before and after the unlearning process. Intuitively, an effective unlearning method should unlearn all the knowledge related to the additional category and binary feature, a model after unlearning should never predict a node as the additional category. However, according to our observation, approximate unlearning fails to remove all information related to the deleted data, which motivates us to design an efficient exact graph representation unlearning method. Q2: If not, can we design an efficient exact graph representation unlearning method? We propose an exact graph learning and unlearning algorithm GRAPHEDITOR which can efficiently update the parameters with provable low time complexity. GRAPHEDITOR not only supports node/edge deletion, but also node/edge addition and node feature update. The key idea of GRAPHEDITOR is to reformulate the ordinary GNN training problem as an alternative problem with the closed-form solution. Upon receiving a deletion request, GRAPHEDITOR takes the closed-form solution as input and quickly updates the model parameters only based on a small fraction of nodes in the neighborhood of the deleted node/edge. Comparing to retraining from the scratch, GRAPHEDITOR only requires less data with a single step of computation, which is more suitable for the online setting that requires the model provider to immediately get the unlearned model or not all the training data are available to retrain. Comparing to existing exact unlearning methods GRAPHERASER Chen et al. (2021) , GRAPHEDITOR enjoys a better performance since the unlearned model does not suffer from data heterogeneity and lack of training data on each shard model. Comparing to approximate unlearning method INFLUENCE Guo et al. (2020) and FISHER Golatkar et al. (2020) , GRAPHEDITOR guarantees removing all information related to deleted nodes/edges and does not require integrating differential privacy noise to prevent information leakage after unlearning. Contributions. We summarize our contributions as follows: 1 We introduce "deleted data reply test" to validate the effectiveness of unlearning methods and illustrate the insufficiency of approximate unlearning methods for removing all information related to the deleted nodes/edges. 2 We introduce a graph representation learning and unlearning approach GRAPHEDITOR on linear-GNNs, which supports node/edge deletion, node/edge addition, and node feature update. 3 To improve the scalability and expressiveness, we introduce subgraph sampling and the non-linearity extension of GRAPHEDITOR. 4 Empirical studies on real-world datasets that illustrates its effectiveness.

2. RELATED WORKS

Exact machine unlearning. Exact unlearning aims to produce the performance of the model trained without the deleted data. The most straightforward way is to retrain the model from scratch, which is in general computationally demanding, except for some model-specific or deterministic problems such as SVM Cauwenberghs & Poggio (2001) , K-means Ginart et al. (2019) , and decision tree Brophy & Lowd (2021) . Recently, efforts have been made to reduce the computation cost for general gradient-based training problems. For example, Bourtoule et al. (2021) proposes to split the dataset into multiple shards and train an independent model on each data shard, then aggregate their prediction during inference. The data partition schema allows for an efficient retrain of models on a smaller fragment of data. However, the model performance suffers because each model has fewer data to be trained on and data heterogeneity can also deteriorate the performance. Besides, GRAPHERASER Chen et al. (2021) extends Bourtoule et al. (2021) to graph-structured data by proposing a graph partition method that can preserve the structural information as much as possible and weighted prediction aggregation for inference. Ullah et al. (2021) proposes to train the model using mini-batch SGD and save the model parameters at each iteration. When receiving the deletion requests, retraining only starts at the iteration that deleted data first time appears. Neel et al. (2020) ; Ullah et al. (2021) ; Sekhari et al. (2021) study the unlearning from the generalization theory perspective, which is not the main focus of this paper. Approximate machine unlearning. INFLUENCE Guo et al. (2020) proposes to unlearn by removing the influence of the deleted data on the model parameters. Formally, let D d ⊂ D denote the deleted subset of training data, D r = D \ D d denote the remaining data, L(w) is the objective function, and w is the model parameters before unlearning. Then, INFLUENCE unlearn by w u = w + H -1 r g d , which is derived from the first-order Taylor approximation on gradient, where w u is the parameters after unlearning, H r = ∇ 2 L(w, D r ) is the Hessian computed on the remaining data, and g d = ∇L(w, D d ) is the gradient computed on the deleted data. To mitigate the potential information leakage, INFLUENCE utilizes a perturbed objective function L(w) + b ⊤ w, where b is the random noise. Guo et al. (2020) requires the objective function be i.i.d., extending its application on graph is non-trivial because nodes in graph are non-i.i.d due to node dependency (details in Appendix E). Chien et al. (2022) extends the analysis of Guo et al. (2020) to graph. A similar idea is explored in Wu et al. (2022) . FISHER Golatkar et al. (2020) performs Fisher forgetting by taking a single step of Newton's method on the remaining training data, then performing noise injection to model parameters to mitigate the potential information leaking. The model parameters after unlearning is given by w u = w -H -1 r g r + H -1/4 r b, where H r = ∇ 2 L(w, D r ) is Hessian and g r = ∇L(w, D r ) is gradient computed on the remaining data D r , and b is the random noise. Golatkar et al. (2021) generalizes the idea to deep neural networks by assuming a subset of training samples are never forgotten, which can be used to pre-train a neural network as feature extractor, and only unlearn the last layer. Wu et al. (2020a) proposes to save all the intermediate weight parameters w t and gradients ∇L(w t , D) during training. Then, these information will be used to efficiently estimate the optimization path of strongly convex and smooth objective function after unlearning, which results in very limited applications. Khan & Swaroop (2021) proposes knowledge-adaptation priors to reduce the cost of retraining by enabling adaptation for a wide variety of tasks and models. Similar ideas are explored in Ginart et al. (2019) for K-means and Wu et al. (2020b) for logistic regression. Wang et al. (2021a) observes that different channels have a varying contribution to different categories in image classification. Inspired by this observation, Wang et al. (2021a) proposes to quantize the class discrimination of channels and prune the most relevant channel of the target category to unlearn its contribution to the model. Izzo et al. (2021) propose approximate data deletion method, which has a time complexity that is linear in the dimension of the deleted data and is independent of the size of the dataset. Fu et al. (2022); Nguyen et al. (2022) study Bayesian inference unlearning, which is different from the neural network unlearning that we focused on.

3. PRELIMINARIES ON GRAPH REPRESENTATION UNLEARNING

Problem setup. Given a graph G(V, E) with N = |V| nodes as input, let us suppose each node v i ∈ V is associated with node feature vector h (0) i . Let A, D ∈ R N ×N denote the adjacency matrix and its degree matrix, and the normalized propagation matrix is defined as P = D -1/2 AD -1/2 . For ease of exposition, we take semi-supervised node classification as a running example, where a subset of nodes V train ⊂ V are labeled, our goal is to predict the label for the rest nodes V \ V train using the information of the labeled nodes. Please notice that GRAPHEDITOR can also be applied to link prediction task for edge unlearning, which will be discussed in details in the appendix. Graph neural network. The feed-forward rule in graph neural network (GNN) is defined as ℓ) , where σ(•) is non-linear activation function, H (ℓ) denotes the hidden representation at the ℓ-th layer. Then, a linear classifier is applied to the final layer node representation H (L) for prediction. Although GNNs have become the de-facto tool for graph representation learning, employing unlearning strategies on the ordinary GNNs is non-trivial. This is because how to rigorously verify data removal guarantee in non-linear models is an open problem and non-trivial to verify empirically. Recently, linear-GNNs are proposed to remove non-linearities and only use a single-weight matrix in the neural architecture. For example, SGC Wu et al. (2019) proposes to compute the node representation by H (L) = P L H (0) . By linearizing the GNNs, these methods not only enjoy a faster training speed but also allow us rigorously theoretically and empirically verify whether the information has been perfectly unlearned. Although lack of non-linearity, recent studies Wei et al. (2022) ; Wang & Zhang (2022) shows that linear-GNNs are almost as expressive as its non-linear counterparts (details in Appendix F). Motivated by the advantages of linear-GNNs, we will first illustrating our idea on linear-GNNs in Section 4.1 and then introduce its non-linearity extension in Section 4.4. We will rigorously test whether the information is perfectly unlearned on linear-GNNs and also demonstrate the possibility to use GRAPHEDITOR with non-linear GNNs. H (ℓ) = σ PH (ℓ-1) W ( Challenges in graph unlearning. Graph representation unlearning is challenging for the following three main reasons: 1 High computation cost. Existing unlearning methods suffer from high computation cost. To see this, let us suppose we are training logistic regression via gradient descent, i.e., f (w) = -N i=1 y i log µ i + (1 -y i ) log(1 -µ i ), where µ i = σ(w ⊤ x i ) is the prediction and y i ∈ {0, 1} is the ground truth label. If unlearn by re-training from scratch, it takes O(dN E) time complexity to unlearn a single data point, where N is the number of data points, d is feature dimension, and E is the number of epochs during training, which is infeasible if the deletion request needs to be completed immediately. Although approximate unlearning methods can alleviate the computation burden to some extent, the computation cost is still linear with respect to the number of nodes N . For example, INFLUENCE and FISHER require O(N d) to compute gradient ∇f (w) = X ⊤ (µ -y) and O(N d 2 ) to compute Hessian ∇ 2 f (w) = X ⊤ diag(µ • (1 -µ))X, which could scale poorly on the large-scale datasets. 2 Non-triviality of extension to graph domain. Most existing unlearning methods only support data deletion, however, graph representation unlearning also requires updating the effect of the deleted nodes to its neighborhood due to the convolution operation on graph.  i = σ   j∈N (vi) α ij h (0) j W   , α ij = 1 deg(v i ) deg(v j ) Figure 1 : An illustration of how output of a 1-Layer GCN is affected after deleting the node v 1 . For example, as shown in Figure 1 , let us suppose our goal is to unlearn the effect of node v 1 on a pre-trained 1-layer GCN. After removing node v 1 , the node representation of node {v 2 , v 3 , v 4 } are also affected due to the change of edge weight α ij and the deletion of node v 1 's feature. Therefore, a proper graph representation unlearning algorithm not only need to remove the effect of node v 1 (which can be achieved by using INFLUENCE and FISHER), but also require to be capable of updating the effect of node {v 2 , v 3 , v 4 } on model parameters (which is not supported by most unlearning methods). 3 Lack of data removal guarantee. Although approximate unlearning methods are more efficient than retraining from scratch, the removal of all information related to the deleted data is not guaranteed, in which we validate this by "deleted data replay test" in Section 5. Intuitively, the above observation makes sense because the output of approximate unlearning is not necessarily equivalent to the result of exact unlearning. Furthermore, most approximate unlearning algorithms seek to prove the approximately unlearned model is close to an exactly retrained model Wu et al. (2020a) ; Aldaghri et al. (2021) ; Izzo et al. (2021) . However, it has been pointed out by Thudi et al. (2021) ; Guo et al. (2020) that we cannot infer "whether the data have been deleted" solely from "the closeness of the approximately unlearned and exactly retrained model in the parameter space". In fact, Thudi et al. (2021) shows that one can even unlearn the data without modifying the parameters. Therefore, it is important to show from the algorithm itself that the sensitive information can be perfectly removed, which is lacking in most approximate unlearning methods due to the approximation process. To overcome the above challenges, we propose GRAPHEDITOR that enjoys a low computation cost with data removal guarantees.

4. GRAPHEDITOR

In this section, we first introduce the graph representation learning and unlearning under the notation of linear-GNN in Section 4.1 and Section 4.2, respectively. Then, we introduce the subgraph sampling-strategy to lower the computation cost in Section 4.3 and introduce application of using GRAPHEDITOR with multi-layer GNNs in Section 4.4. We consider both the node unlearning (discussed in main text) and edge unlearning (deferred to appendix). We summarize the full unlearning process of GRAPHEDITOR in Algorithm 1, which consists of three main functions: find W(), remove data(), and add data(). Algorithm 1 GRAPHEDITOR (Numpy-like pseudo-code) # Input: X as the output of GNNs, Y as the label matrix # (Before unlearning) Compute the closed-form solution >>> S, W = find_W(X, Y ) def find_W(X, Y, reg=0): XtX = X.T@X + reg * numpy.eye(X.shape[0]), S = numpy.linalg.inv(XtX), W = S@X.T@Y return S, W # (GraphEditor) Step 1: Delete information >>> S, W = remove_data(X[Vrm ∪ V upd ], Y [Vrm ∪ V upd ], S, W ) def remove_data(X, Y, S, W): I = numpy.eye(X.shape[0]) A = S@X.T, B = numpy.linalg.inv(I -X@S@X.T), C = Y -X@W, D = X@S return S + A@B@D, W -A@B@C # (GraphEditor) Step 2: Update information. X computed on updated graph. >>> S, W = add_data( X[V upd ], Y [V upd ], S, W ) def add_data(X, Y, S, W): I = numpy.eye(X.shape[0]) A = S@X.T, B = numpy.linalg.inv(I + X@S@X.T), C = Y -X@W, D = X@S return S -A@B@D, W + A@B@C # (Optional) Fine-tune W using cross-entropy loss

4.1. GRAPH REPRESENTATION LEARNING ON LINEAR-GNN

Instead of training ordinary GNNs by optimizing the cross-entropy loss, we propose to first formulate the ordinary GNN training as a linear GNN training with Ridge regression as the objective, which can be efficiently solved by the closed-form solution. More specifically, we first solve the following Ridge regression problem L Ridge (W; X, Y) = ∥XW -Y∥ 2 F + λ∥W∥ 2 F , where Y ∈ R N ×dy is the zero-one label matrix and X ∈ R N ×dx is the node representation matrix for linear-GNN (e.g., for SGC we have X = P L H (0) ). The closed-form solution for the above objective function is W ⋆ = arg min W L Ridge (W; X, Y) = S ⋆ X ⊤ Y, where S ⋆ = (X ⊤ X + λI) -1 is the inversed correlation matrix. After training, we cache both S ⋆ ∈ R dx×dx and W ⋆ ∈ R dx×dy for efficient unlearning. Please refer to find W() for details. To boost model performance, we can take the closed-form solution as initialization and fine-tune using cross-entropy loss with a small number of iterations. The time complexity of computing an exact unlearning solution with closed-form solution is O(N d 2 x + N d x d y + d 2 x d y ), which makes retraining on large-scale dataset computationally prohibitive due to linear dependency with respect to the graph size N . In the next section, we show that GRAPHEDITOR achieves efficient graph unlearning with computation cost independent of graph size, which makes it suitable for unlearning on large graphs.

4.2. GRAPH REPRESENTATION UNLEARNING ON LINEAR-GNN

Let us suppose node v i is to be deleted. In node unlearning, we not only need to unlearn node v i 's features but also need to unlearn its connection with other nodes. Formally, let G node u (V node u , E node u ) denote the graph with node v i and all its associated edges are removed, where V node u = V \ {v i } and E node u = E \ {(v i , v j ) | v j ∈ N (v i )}. The model after node unlearning is expected to produce the same performance as the model trained on G node u . The key idea of GRAPHEDITOR is to leverage the obtained closed-form solution to first efficiently "remove" the effect of the deleted nodes on weight parameters by remove data(), then "update" the effect of the neighboring nodes of the deleted nodes on weight parameters by add data(). Before delving into the details of algorithm, let us first take a closer look at the key factors that affect the weight parameters after node deletion. Lemma 1 (Key factors that affect weight parameters). The optimal weight parameters W u ⋆ after removing node v i is affected by two factors: 1 "node representation removal" caused by removing node set V rm = {v i }; 2 "node representation update" due to the inner dependency between nodes in the graph, where the affected node set are all nodes that has shortest path distance (SPD) smaller than 2L to nodes in V rm , i.e., V upd = {v j | SPD(v i , v j ) ≤ 2L, ∀v j ∈ V, ∀v i ∈ V rm }. The above lemma shows that node-set V rm and V upd are the key factors that affect the weight parameters. Therefore, we propose GRAPHEDITOR to first remove the effect of all node in V rm ∪ V upd on the optimal weight parameters W ⋆ , then update the effect of V upd to derive the optimal weight W u ⋆ . More specifically, our first step is to remove the effect of V rm ∪ V upd by remove data(). Let X rm = X[V rm ∪ V upd ], Y rm = Y[V rm ∪ V upd ] denote the subset of matrix X, Y with row indexed by V rm ∪ V upd . Then, given the initial solution S ⋆ and W ⋆ , we first update the inversed correlation matrix as S rm = S ⋆ +S ⋆ X ⊤ rm [I-X rm S ⋆ X ⊤ rm ] -1 X rm S ⋆ and update the optimal solution by W rm = W ⋆ -S ⋆ X ⊤ rm [I -X rm S ⋆ X ⊤ rm ] -1 (Y rm -X rm W ⋆ ). Then, our next step is to update the effect of V upd on the weight parameters by add data(). To achieve this, we first compute the updated node representation X on the graph without the deleted nodes. Let X upd = X[V upd ], Y upd = Y[V upd ] denote the subset of matrix X, Y with row indexed by V upd . Then, we update the inversed correlation matrix by S upd = S rm -S rm X ⊤ upd [I + X upd S rm X ⊤ upd ] -foot_0 X upd S rm and update the optimal solution by W upd = W rm + S rm X ⊤ upd [I + X upd S rm X ⊤ upd ] -1 (Y upd -X upd W rm ). Notice that GRAPHEDITOR's output is equivalent to the optimal solution W u ⋆ = arg min W L Ridge (W; X, Y). Besides, since we are using the closed-form solution, we do not have to worry about the information of the deleted nodes in V rm might be potentially remained in the weight parameters. The time complexity for graph unlearning is O(M 3 + M d 2 x + M d x d y ), where M = |V rm ∪ V upd |. According to time complexity, we know GRAPHEDITOR enjoys a lower computation cost than retraining from scratch if M < d x . When M is large, we could split the removed nodes into multiple small shards and unlearn them one after another, which alleviates the cubic computation dependency on M . Since we are using the closed-form solution, the results after unlearning nodes are identical regardless of how many shards we split to unlearn. Besides, since efficient matrix inverse algorithms (e.g., BLAS and LAPACK) have been implemented in Numpy, the practical computation time of GRAPHEDITOR is significantly smaller than re-training or other unlearning methods according to our observations. Details on the correctiveness of GRAPHEDITOR and the time complexity please refer to Appendix C. Remark 1 (Node/Edge addition). The process of node/edge addition is identical to node/edge deletion. For example, we can add nodes by first unlearn the set of all "affected nodes" using the remove data(), then perform information update on the "affected nodes" and the "added nodes" use the add data(). Here, the "affected nodes" are nodes that have SPD smaller than 2L to the "added nodes", which is similar to the node deletion operation. The same applies to edge addition.

4.3. SUBGRAPH SAMPLING FOR BETTER SCALABILITY

Recall from Lemma 1 that the number of nodes in V upd grows twice exponentially with respect to the linear-GNN depth, i.e., by letting D as the maximum node degree we have |V upd | ≤ |V rm | × D 2L . When the linear GNN is deep, GRAPHEDITOR becomes computational prohibitive even when the deleted node set V rm is small, since GRAPHEDITOR's overall computation cost is cubic with respect to |V upd ∪ V rm |. To overcome the aforementioned issue, we propose to decouple the receptive field of each node with the GNN depth by extracting the rooted-subgraph for each node in the graph using shaDow-subgraph sampler Zeng et al. (2021) , and apply the linear GNN to compute the feature representation of the root node on the extracted subgraph. In practice, we can either use the K-hop sampling to uniformly sampling the K-hop neighbors of root node or we can sample a fixed number of nodes according to the PPR score with respect to the root node. Lemma 2 (Affected nodes after sampling). Let us denote V sg j as the set of nodes returned by shaDowsubgraph sampler for root node v j . If using shaDow-subgraph sampler, the affected node set of "node representation update" in Lemma 1 are reduced to V sg upd = {v j | v i ∈ V sg j , ∀v j ∈ V, ∀v i ∈ V rm } As shown in Lemma 2, we can reduce the size of update node set to |V sg upd |, which is independent of GNN depth. In practice, we find K-hop sampling works well on all datasets. When using K-hop sampling, the update node set size is reduced to |V sg upd | ≤ |V rm | × D K and we only need to update the representation of a node if its K-hop rooted subgraph contains the deleted nodes.

4.4. USING GRAPHEDITOR WITH NON-LINEAR MULTI-LAYER GNNS

One of the biggest concern readers might have is the linear-GNN requirement. 1 Extending the existing unlearning methods to non-linear models requires first pre-training multi-layer GNNs as a feature extractor on the public datasets (assume nodes in the public set will never need to be unlearned in the future), then we use the pre-trained multi-layer GNNs to extract the node representation for a linear classifierfoot_1 . During unlearning, we only unlearn the linear classifier without updating the feature extractor because it does not carry any information on the deleted data. Formally, let us define V public , V private as the nodes in the public and private dataset, V rm ∩ V public = ∅ and V rm ⊂ V private , define f w • g θ as a multi-layer GNN model with f w as the final linear classifier and g θ as all previous layers. Let us suppose we could pre-train g θ on the public dataset nodes V public , then we use the g θ as a feature extractor to extract the node representation X on private graph node V private . By doing so, we can apply find W() on X to learn the linear classifier f w by GRAPHEDITOR. To unlearn node V rm , we can first apply remove data() with X[V rm ∪ V sg upd ] to unlearn the information of V rm ∪ V sg upd on f w , then compute the node representation X on the update graph using g θ , and finally apply add data() with X[V sg upd ] to update the information of V sg upd on f w . Please notice that using GRAPHEDITOR with non-linear GNNs is similar to using GRAPHEDITOR with linear-GNNs, except the original X, X are computed using linear-GNNs but here we use non-linear GNNs instead.

5. EXPERIMENTS

Datasets and baselines. We select OGB-Arxiv, OGB-Products, Flickr, and Reddit datasets for node unlearning evaluation, and OGB-Collab dataset for edge unlearning evaluation. We compare with approximate unlearning methods INFLUENCE and FISHER on linear-GNNs (by extending their application from the non-structured data to the structured data) and exact unlearning method GRAPHERASER on both linear and non-linear GNNs. Besides, we compare with retraining GCN Kipf & Welling (2017) , GraphSAGE Hamilton et al. (2017) , GAT Veličković et al. (2017) from scratch. Representation computation. For node unlearning, the node representation is extracted from the sampled rooted subgraph of each node and label reuse trick Wang et al. (2021b) is used for linear-GNNs node classification. For edge unlearning, we first extract the subgraph of the two nodes connected by that edge, then the edge representation is extracted from the intersection of the two subgraphs, common neighbor score Liben-Nowell & Kleinberg (2007) is used for linear-GNNs. Overview on experiments. We measure the success of an unlearning method by two criteria: whether the information is unlearned and whether the unlearning algorithm could deteriorate the model performance. Please notice that the second criteria is as important as the first criteria. For example, in an extreme case, one can just unlearn by randomly initializing the model but it might significantly hurt the model performance. To validate the above two criteria, we conduct the following experiments: In section 5.1, we conduct deleted data replay test on linear-GNN to test whether an unlearning method could perfectly unlearn the features-labels correlation from the deleted nodes, and whether the unlearning method hurt the model's prediction accuracy. In section 5.2, we compare the similarity between the unlearned model to the re-training from the scratch model on linear-GNN. The two models are expected to be similar to prevent information leakage on the deleted nodes. In section 5.3, we evaluate the efficiency and effectiveness of using GRAPHEDITOR with non-linear GNNs. We compare the accuracy with re-training non-linear GNNs from scratch. Intuitively, a strong unlearning method should unlearn in a short time and produce a similar model performance to re-training. Furthermore, we conduct edge unlearning test in Appendix A.1, ablation study the effect of subgraph sampling size on GRAPHEDITOR for node unlearning in Appendix A.2 and for edge unlearning in Appendix A.3, conduct node addition test in Appendix A.4, compare the prediction confidence on the unlearned node in Appendix A.5, compare the running time in Appendix A.6, and compare linear-GNN with shallow sampler and different multi-layer GNNs in Appendix A.7. Details on datasets, baselines, experiment details are summarized in Appendix B.

5.1. DELETED DATA REPLAY TEST

In this experiment, we randomly select 100 nodes from the training set as the deleted nodes and modify their label categories to an extra-label category. An extra binary feature is injected to the node features to help linear-GNNs memorize the correlation between deleted nodes to the extra-label category. To simulate real-world deletion unlearning requests that come one after another, we first uniformly split all deleted nodes into S ∈ {10, 50, 100} shards, then we randomly select one shard without replacement to unlearn at each unlearning iteration and repeat this S times. We measure the success of unlearning by checking whether the information on the deleted nodes is unlearned and 1 . As the number of shards S increases, the times required by GRAPHEDITOR increases less than baselines. This is because increasing S could also decreases the number of nodes to unlearn at each iteration and the time complexity of GRAPHEDITOR is independent of the dataset size (refer to in Section 4.2). Please notice that this is not the case for baselines because their complexity is always proportional to the dataset size (refer to part three of Section 3). Therefore, GRAPHEDITOR is more efficient than other baseline methods.

5.2. COMPARISON TO RETRAINED MODEL

To assess the similarity between the unlearned model and the retrained model, a natural way is to measure the distance between the final activations obtained by the unlearned W u and the retrained W r models on the deleted nodes and testing set nodes. More specifically, for B ∈ {V rm , V test } we compare the distance of final activations as E vi∈B ∥softmax(x i W u ) -softmax(x i W r )∥ 2 . The deleted nodes are randomly selected 100 samples from the training set and are unlearned through 10 sequential forgetting requests, each request of size 10. Intuitively, a powerful unlearning algorithm should generate similar final activations to the retrained model on nodes from both the deleted nodes and testing set nodes. Besides, we measure the Euclidean distance between the parameters returned by the unlearning algorithms and the parameters obtained via retraining from scratch. For linear GNN, a small Euclidean distance in the parameter space means the model is likely to have the same predictions. We have the following observations according to Figure 2 : 1 We observe that GRAPHEDITOR's (both with and without fine-tune) final activation difference and parameter difference is consistently low compared to baselines during the 10 unlearning requests. However, this is not the case for approximate unlearning methods INFLUENCE and FISHER. 2 We observe that the final activation differences of approximate unlearning methods on the deleted nodes are consistently larger than the values on the test nodes. Therefore, a malicious third party could potentially identify the deleted nodes from other nodes by comparing its final activation difference.

5.3. GRAPHEDITOR WITH NON-LINEAR GNN

In this section, we demonstrate the potential extension of using GRAPHEDITOR with non-linear GNNs. This experiment is conducted under the assumption that a subset of training samples are never forgotten (i.e., public dataset), which can be used to pretrain a neural network as feature extractor, and only unlearn the final linear classifier. To test under this scenario, we randomly split the training set into 90% and 10% for public dataset and private dataset. For GRAPHEDITOR, the feature extractor is pre-trained on the 90% training set nodes, then it is used to extract node representation for all nodes. The extracted representation will be used to train the linear classifier. GRAPHEDITOR is only applied to the linear classifier since the pre-trained feature extractor do not have any information about the deleted nodes. For both re-training and GRAPHERASER, the multi-layer GNNs are trained on all nodes. We select 100 nodes from the 10% private training set with the largest node degree to unlearn. We use 3-layer GAT, GCN, and SAGE with hidden dimension 256, attention head size 8 as the backbone model. We are using the official implementation of GRAPHERASER (detailed setup in Appendix B.3). From .4 , compare the prediction confidence on the unlearned node in Appendix A.5, compare the running time in Appendix A.6, and compare linear-GNN with shallow sampler and different multi-layer GNNs in Appendix A.7. In Section B, we provide details on experiment setup. In Section C, we provide detailed analysis on the computation complexity and correctness of GRAPHEDITOR. In Section D, we provided proof for Lemma 1 and Lemma 2. In Section E we highlight the dependency issue of applying existing unlearning approach to graph structured data. In Section F, we summarize recent theoretical analysis showing that linear-GNN could be almost as expressive as non-linear GNNs. [Code] to reproduce the experiment results can be find from the anonymous repository.

A.1 EDGE UNLEARNING

In this section, we introduce our edge unlearning problem formulation and demonstrate our results. Suppose edge (v i , v j ) is to be deleted, our goal is to unlearn the connectivity. Let G edge u (V, E edge u ) denotes the graph with edge (v i , v j ) removed, where E edge u = E \ {(v i , v j )}. The model after unlearning is expected to produce the same performance as the model trained on G edge u . As shown in Table 3 , we compare the model performance (measured by Hits@50 for edge unlearning), wall-clock time (measured by seconds), and the number of deleted nodes that are predicted as the extra-label category before and after unlearning (reported in the parenthesis), and have the following observations: 1 We can observe that different from node-level tasks, the performance of GRAPHEDITOR to baselines are very close. This is potentially due to the nature of OGB-Collab dataset and the feature extraction strategy we used on linear GNN. In fact, we found that the feature extracting strategy plays a more important role in the OGB-Collab dataset, please refer to the next section for more detailed discussions and ablation study results in Table 5 . 2 Besides, we can observe that the exact unlearning methods GRAPHEDITOR and GRAPHERASER can always unlearn all the information related to the deleted nodes, however, this is not the case for approximate unlearning methods INFLUENCE and FISHER. 3 GRAPHEDITOR is significantly more efficient than other baseline methods, mainly due to its unlearning complexity is independent of the dataset size, which requires less wall-clock time throughout the unlearning process. We study the effectiveness of GRAPHEDITOR by comparing it with ordinary GNNs (including GCN and GraphSAGE) and provide an ablation study on the effect of the number of layers perhop on performance and efficiency. Besides, we also provide experimental results by applying GRAPHERASER 4 onto ordinary GNNs by splitting the original graph into 8 subgraphs and using mean-aggregation during inference, where the time is reported by the maximum time trained on a single subgraph. We repeat experiment 5 times, each time 100 nodes are randomly selected as deleted nodes from the training set, for node unlearning we randomly split the deleted nodes into 10 sequential forgetting requests of equal size. The results are reported in Table 4 , where we denote full-neighbor subgraph as "Full", denote subgraph with K neighbors per hop as "SG (K)", and denote model with fine-tuning as "+ FT". We have the following observation from 1 . Interesting, we found that the performance degradation issue using ordinary GNN is more severe than the linear GNN and the results reported in their original paper Chen et al. (2021) . This is potentially due to ordinary non-linear GNNs requires more data for training than linear GNN because of its higher model complexity. Table 4 : Comparison on the effect of subgraph sampling and fine-tuning of GRAPHEDITOR on linear-GNNs. Besides, we also compare with re-training multi-layer GNNs from scratch and using GRAPHERASER for multi-layer GNNs (marked with †). Here, we are using our implementation of GRAPHERASER by directly applying GCN, GraphSAGE, GraphSAINT, and ClusterGCN onto the graph partitioned by METIS and using mean-average pooling for aggregation. We do this to make sure only the unlearning method is different and other parts are consistent among different unlearning methods (e.g., neural architecture, hyper-parameters). We believe our implementation is general enough and has already captured the leading spirit of GRAPHERASER, i.e., split data into multiple shards and train a different model on each graph partition. METIS allows us to split the original graph into multiple subgraphs while preserving the original graph structure as much as possible. We would like to point out that the experimental results using official implementations (Table 2 ) are consistent with our implementation's results and meet our expectations.

A.3 EFFECTIVENESS FOR EDGE UNLEARNING.

We conduct similar experiment to Appendix A.2 for edge unlearning. We repeat experiment 5 times, each time 100 edges are randomly selected as deleted edges from the training set, for edge unlearning we unlearn through 100 forgetting requests The results are reported in Table 5 , where we denote full-neighbor subgraph as "Full", denote subgraph with K neighbors per hop as "SG (K)", and denote model with fine-tuning as "+ FT". We have the following observation from Table 5 . 1 Addinhg neighbors per hop not necessarily results in a better model performance on the linear GNN used in GRAPHEDITOR, which can be explained by the over-smoothing hypothesis in Cong et al. (2021) ; Li et al. (2018) . 2 Fine-tuning can bring very less improvements to link prediction datasets, which is potential because node features are less important in the OGB-Collab dataset, details please refer to here. 3 Linear GNN can achieve compatible results (even outperform) the ordinary multi-layer non-linear GNN with significantly less computation time, which motivates us to explore better feature engineering tricks for linear GNN as a future direction. In this experiment, we first randomly select 100 nodes from the training set as the node set V add to add, then remove them from the graph, including all edges that are connected to V add . Similar to the "deleted node reply test", extra-label category and binary feature are added to all nodes, where we edit the label of nodes in V add to this additional label category, and set the extra feature as "1" for all node in V add , and set as "0" for all other nodes in V \ V add . Then, we pre-train our model on the modified dataset. We randomly split the 100 added nodes into S ∈ {10, 50, 100} shards. At each node addition iteration, we randomly select one shard without replacement and ask the model to learn the information about the new nodes. To evaluate the effectiveness of node addition operation, we compare the number of nodes that are predicted as the (C + 1)-th category. Notice that "node addition test" can be thought of as a reverse operation of the "deleted data replay test" for node unlearning. As shown in Table 6 , GRAPHEDITOR can efficiently learn the correlation between the extra node features and extra-label category. In this experiment, we measure the difference between the prediction probability of the target category obtained by the unlearned W u and the retrained W r models on the deleted nodes as E vi∈Vrm [softmax(x i W u )] yi -[softmax(x i W r )] yi . The deleted nodes are randomly selected 100 samples from the training set and are unlearned through 10 sequential forgetting requests, each request of size 10. Intuitively, if a model learned the node during training, it is expected to have a higher confidence on the target category. A powerful unlearning algorithm should generate similar prediction probability of the target category to the retrained model on nodes from both the deleted node-set. We repeat the experiment 5 times with different random seeds. We retrain INFLUENCE and FISHER with the same initialization and number of epochs to eliminate the performance difference caused by other factors. We observe that prediction probability of the target category of GRAPHEDITOR (both with and without fine-tune) is consistently low compared to baselines during the 10 unlearning requests. However, this is not the case for approximate unlearning methods INFLUENCE and FISHER. Therefore, a malicious third party could potentially identify the deleted nodes from other nodes by comparing its final activation difference. We report the wall-clock time of unlearning 10 nodes with a single unlearning request on in Table 7 on OGB-Arxiv and OGB-Products. We use 2-hop neighbor sampling with 15 neighbors for OGB-Arxiv and 10 neighbors for OGB-Products. In practice, the overall unlearning process of GRAPHEDI-TOR could be split into the data preparation time on CPU and the computation time on GPU. For example on OGB-Arxiv, GRAPHEDITOR takes around 190.2 + 0.325 seconds to learn a model using the closed-form solution. The unlearning process takes around 1.21 + 0.092 + 0.0028 + 0.087 seconds, which is relatively small compared to re-training. The computation time on OGB-Arxiv is larger than OGB-Products because the dimension of the augmented feature in OGB-Arxiv is larger and the computation cost is proportional to the feature dimension. Besides, we would like to point out that one of another biggest advantages of GraphEditor is that it does not require all data presented during unlearning, which is beneficial for the settings where not all the training data are available to retrain. 2021) that a 2-hop shallow sampler usually achieves good performance when compared to the deeper GNN with larger receptive fields. Then, to answer this question, we compare the performance of the "2-hop sampler with linear-GNN" and "multi-layer GNNs" on more datasets in Table 8 . We could observe that the performance of "2-hop linear GNN" is very close to "multi-layer GNNs". Combining the results of the 2-hop linear GNN in Table 4 and Table 5 , we believe using shallow depth is not a significant limitation. Meanwhile, please notice that the computation cost in GRAPHEDITOR is only related to the number of nodes in the sampled subgraph. Therefore, we could still use other sampling strategies to obtain a deeper receptive field subgraph but with fewer neighbors per node. However, please notice that this is not the focus of our unlearning paper. We conduct experiments on a single machine with Intel i9 CPU, Nvidia RTX 3090 GPU, and 64GB RAM memory. The code is written in Python 3.7 and we use PyTorch 1.4 on CUDA 10.1 for model training.

B.2 DATASET

We select OGB-Arxiv, OGB-Products, Flickr, and Reddit datasets for node unlearning evaluation, and OGB-Collab dataset for edge unlearning evaluation. The detailed dataset statistics are summarized in Table 9 . Details on GRAPHERASER. GRAPHERASER is an exact unlearning method. GRAPHERASER proposes to split the original graph into multiple shards (i.e., subgraphs) and train an independent model on each data shard. During inference, GRAPHERASER averages the prediction of each shard model as the final prediction. Upon receiving unlearning requests, GRAPHERASER only needs to re-train the specific shard model where the deleted data belongs to. For GRAPHERASER, we use the default official implementation for non-linear GNNs and we also provide our own implementation for linear-GNNs to achieve a fair comparison with other linear-GNN methods. For the official implementation, we follow their official implementation by using "balanced k-means graph partitioning" to split the original graph into 10 shards, using "importance score aggregate" for model inference aggregation, and using mini-batch size 512 for shard model training. Although the official implementation also provides "balanced label propagation" for graph partitioning, this graph partitioning method does not work well with large-scale graphs due to its high memory cost. For example, it takes 173GB for Arxiv dataset, which is infeasible on our machine. For our implementation, we split all nodes into 8 shards using graph partition algorithm METIS and use mean average for model aggregation. Each shard model is trained with enough epochs and we return the epoch model with the highest validation score. We believe our implementation is general enough and has already captured the main spirit of GRAPHERASER, i.e., split data into multiple shards and train a shard model on each shard. METIS allows us to split the original graph into multiple subgraphs while preserving the original graph structure as much as possible. 

B.4 BASELINE HYPER-PARAMETERS SETUP

Hyper-parameters for linear-GNNs. Without specifically mentioned, all results reported on baselines are applied to the same linear GNN architecture and the same subgraph extracted in GRAPHED-ITOR for a fair comparison. We select learning rate from {0.01, 0.001} and regularization constant λ from {0, 10 -4 , 10 -5 , 10 -6 }. Notice that the larger λ, the less number of nodes/edges will be predicted as the extra-label category. Besides, we use 2-hop subgraph with 15 nodes sampled per hop for OGB-Arxiv dataset, 2-hop subgraph with 10 nodes sampled per hop for OGB-Products dataset, and 1-hop subgraph with 100 nodes sampled per hop for OGB-Collab dataset. Hyper-parameters for non-linear GNNs. Without specifically mentioned, we did not explicitly conduct hyper-parameter selection for non-linear GNNs. Instead, we directly follow the hyperparameters used in the existing official implementations. For example, the non-linear GNNs in Section 5.3 is directly using the implementation and hyper-parameter choices from Zeng et al. (2021) foot_6 . The non-linear GNNs used with GRAPHERASER are the implementation provided by their authors, and we use their default hyper-parameters but only change the hidden dimension.

B.5 DETAILS FOR NODE DELETION REPLAY TEST

For node unlearning we consider multi-label node classification as the downstream task. Let assume each node v i is associated with a node feature vector x i ∈ R d and label y i ∈ {1, . . . , C}. Before training, we preprocess the features and label as x ′ i ∈ R d+1 and categorical label y ′ i ∈ {1, . . . , C + 1}. In particular, we have • For any node v i ∈ V rm , we set x ′ i = [x i | 1] and y ′ i = C + 1; • For any node v i ̸ ∈ V rm , we set x ′ i = [x i | 0] and y ′ i = y i . Similarly, for edge unlearning, we consider multi-label link prediction as the downstream task. Let suppose edge e ij = (v i , v j ) has feature x ij ∈ R d and label y ij ∈ {1, . . . , C}. Before training, we preprocess the features and categorical label as x ′ ij ∈ R d+1 and y ′ ij ∈ {1, . . . , C + 1}: • For any edge e ij ∈ E rm , we set x ′ ij = [x ij | 1] and y ′ ij = C + 1; • For any edge e ij ̸ ∈ E rm , we set x ′ ij = [x ij | 0] and y ′ ij = y ij . C GRAPHEDITOR: DETAILS, CORRECTIVENESS, AND TIME COMPLEXITY In the following, we first introduce the closed-form solution before unlearning in Section C.1, then show how to remove and add information that associated with the node features in Section C.2 and Section C.3, which relies on the following lemma. Lemma 3 (Sherman-Morrison-Woodbury formula Sherman & Morrison (1950) ). Suppose X ∈ R N ×N is an invertible square matrix and u, v ∈ R N are column vectors. Then X + uv ⊤ is invertible if and only if 1 + v ⊤ X -1 u ̸ = 0. In this case, we have X + uv ⊤ -1 = X -1 - X -1 uv ⊤ X -1 1 + v ⊤ X -1 u . The Sherman-Morrison-Woodbury formula could also be generalized to a rank k modification to X Petersen et al. (2008); Bishop et al. (1995) . More specifically, for any U, V ∈ R N ×k we have (X + UV ⊤ ) -1 = X -1 -X -1 U(I + V ⊤ X -1 U) -1 V ⊤ X -1 . C.1 BEFORE UNLEARNING: CLOSED-FORM SOLUTION BY F I N D W(X, Y) Let X ∈ R N ×dx denote the input node feature matrix and label vector Y ∈ R N ×dy . Then, the closed-form solution is as follows  W ⋆ = arg min W ∥XW -Y∥ 2 F + λ∥W∥ 2 F = (X ⊤ X + λI n ) -1 X ⊤ Y. A = X ⊤ X + λI n ∈ R dx×dx is O(N d 2 x ), the time complexity for computing B = X ⊤ Y ∈ R dx×dy is O(N d x d y ), the time complexity for computing A -1 is O(d 3 x ), and the time complexity for computing A -1 B is O(d 2 x d y ). Then, the total time complexity of computing the closed-form solution is O(N d 2 x + N d x d y + d 2 x d y ). C.2 GRAPH UNLEARNING: DELETE INFORMATION BY R E M O V E D A T A(X, Y, S, W) Given the initial solution S ⋆ and W ⋆ , we first update the inversed correlation matrix as S rm = S ⋆ + S ⋆ X ⊤ rm [I -X rm S ⋆ X ⊤ rm ] -1 X rm S ⋆ , and update the optimal solution by W rm = W ⋆ -S ⋆ X ⊤ rm [I -X rm S ⋆ X ⊤ rm ] -1 (Y rm -X rm W ⋆ ). Let X \i , Y \i as X, Y but with the i-th row deleted. By Lemma. 3, we have (X ⊤ \i X \i + λI n ) -1 = X ⊤ X + λI n -x i x ⊤ i -1 = (X ⊤ X + λI n ) -1 + (X ⊤ X + λI n ) -1 x i x ⊤ i (X ⊤ X + λI n ) -1 1 -x ⊤ i (X ⊤ X + λI n ) -1 x i . Let denote S ⋆ = (X ⊤ X + λI n ) -1 and S rm = (X ⊤ \i X \i + λI n ) -1 for simplicity. Then, the above equality can be written as S rm = S ⋆ + S ⋆ x i x ⊤ i S ⋆ 1 -x ⊤ i S ⋆ x i . Therefore, the optimal solution on the data after deletion can be written as W rm = arg min W ∥X \i W -Y \i ∥ 2 F + λ∥W∥ 2 F = (X ⊤ \i X \i + λI n ) -1 X ⊤ \i Y \i = S ⋆ + S ⋆ x i x ⊤ i S ⋆ 1 -x ⊤ i S ⋆ x i (X ⊤ Y -x i y ⊤ i ) = W ⋆ -S ⋆ x i y ⊤ i + S ⋆ x i x ⊤ i 1 -x ⊤ i S ⋆ x i W ⋆ - S ⋆ x i x ⊤ i S ⋆ 1 -x ⊤ i S ⋆ x i x i y ⊤ i = W ⋆ - S ⋆ x i 1 -x ⊤ i S ⋆ x i 1 -x ⊤ i S ⋆ x i y ⊤ i -x ⊤ i W ⋆ + x ⊤ i S ⋆ x i y ⊤ i = W ⋆ - S ⋆ x i 1 -x ⊤ i S ⋆ x i (y ⊤ i -x ⊤ i W ⋆ ). The above formulation can be written as the matrix form as in Eq. 3 and Eq. 4, which allows GRAPHEDITOR to parallel delete all samples in the node set V rm ∪ V upd . Lemma 5. The time complexity of Eq 6 and Eq. 7 is O(d 2 x + d x d y ), where d x , d y are the number of dimension of X, Y. Proof of Lemma 5. The time complexity of computing a = S ⋆ x i ∈ R dx and b = S ⊤ ⋆ x i ∈ R dx is O(d 2 x ), the time complexity of computing c = y i -W ⊤ ⋆ x i ∈ R dy is O(d x d y ), the time complexity of computing ab ⊤ is O(d 2 x ), the time complexity of computing ac ⊤ is O(d x d y ) the time complexity of computing x ⊤ i S ⋆ x i ∈ R is O(d 2 x ) . Therefore, the overall computation cost is O(d 2 x + d x d y ). Lemma 6. The time complexity of Eq 3 and Eq. 4 is O(M 3 + M d 2 x + M d x d y ), where d x , d y are the number of dimension of X, Y, M = |V rm ∪ V upd |. Proof of Lemma 6. Let suppose M = |V rm ∪ V upd |. The time complexity to compute A = S ⋆ X ⊤ rm ∈ R dx×M is O(M d 2 x ), the time complexity to compute B = I -X rm S ⋆ X ⊤ rm ∈ R M ×M is O(M d 2 x ), the time complexity to compute B -1 ∈ R M ×M is O(M 3 ), the time complexity to compute C = X rm S ⋆ ∈ R M ×dx is O(M d 2 x ), the time complexity to compute D = Y rm -X rm W ⋆ ∈ R M ×dy is O(M d x d y ), the time complexity to compute AB -1 C is O(M 2 d x ), and the time complexity to compute AB -1 D is O(M 2 d x + M d x d y ). C.3 GRAPH UNLEARNING: UPDATE INFORMATION BY A D D D A T A(X, Y, S, W) Let X upd = X[V upd ], Y upd = Y[V upd ] denote the subset of matrix X, Y with row indexed by V upd . Then, we update the inversed correlation matrix by S upd = S rm -S rm X ⊤ upd [I + X upd S rm X ⊤ upd ] -1 X upd S rm , and update the optimal solution by W upd = W rm + S rm X ⊤ upd [I + X upd S rm X ⊤ upd ] -1 (Y upd -X upd W rm ). Let X + , Y + as appending new sample to the (n + 1)-th row of X, Y, denoted as (x n+1 , y n+1 ). By Lemma 3, we have X ⊤ X + x n+1 x ⊤ n+1 -1 = (X ⊤ X + λI n ) -1 - (X ⊤ X + λI n ) -1 x n+1 x ⊤ n+1 (X ⊤ X + λI n ) -1 1 + x ⊤ n+1 (X ⊤ X + λI n ) -1 x n+1 . Let denote S rm = (X ⊤ X + λI n ) -1 and S upd = (X ⊤ + X + + λI n ) -1 for simplicity. Then, the above equality can be written as S upd = S rm - S rm x n+1 x ⊤ n+1 S rm 1 + x ⊤ n+1 S rm x n+1 . Then, the optimal solution on the data after adding new data point can be written as W upd = (X ⊤ X + λI n + x n+1 x ⊤ n+1 ) -1 (X ⊤ Y + x n+1 y ⊤ n+1 ) = S rm - S rm x n+1 x ⊤ n+1 S rm 1 + x ⊤ n+1 S rm x n+1 (X ⊤ Y + x n+1 y ⊤ n+1 ) = W rm + S rm x n+1 y ⊤ n+1 - S rm x n+1 x ⊤ n+1 1 + x ⊤ n+1 S rm x n+1 W rm - S rm x n+1 x ⊤ n+1 S rm 1 + x ⊤ n+1 S rm x n+1 x n+1 y ⊤ n+1 = W rm - S rm x n+1 1 + x ⊤ n+1 S rm x n+1 -1 + x ⊤ n+1 S rm x n+1 y ⊤ n+1 + x ⊤ n+1 W rm + x ⊤ n+1 S rm x n+1 y ⊤ n+1 = W rm + S rm x n+1 1 + x ⊤ n+1 S rm x n+1 (y ⊤ n+1 -x ⊤ n+1 W rm ). The above formulation can be written as the matrix form as in Eq. 8 and Eq. 9, which allows parallel updating all samples in V upd . The time complexity of node information update is similar to Lemma 5 and Lemma 6 by replacing M = |V upd |.

C.4 CONNECTION TO SECOND-ORDER UNLEARNING

In the following, we study the connection between GRAPHEDITOR to the second-order unlearning method, e.g.,the FISHER-and the INFLUENCE-based approximate unlearning methods as introduced in Golatkar et al. (2020); Guo et al. (2020) . In particular, we show that GRAPHEDITOR is the same as applying one-step of Newton's method using all remaining data, which requires time complexity O(Rd 2 x + N d x d y + d 2 x d y ) where R = |V \ V rm | is the number of remaining nodes. To see this, let first recall the gradient ∇L Ridge (W ⋆ ; X, Y) and Hessian ∇ 2 L Ridge (W ⋆ ; X) is computed as ∇L Ridge (W ⋆ ; X, Y) = ( X⊤ X + λI)W ⋆ -X⊤ Y, ∇ 2 L Ridge (W; X) = X⊤ X + λI Then, one step of the Newton's method on the updated data ( X, Y) is computed as W u ⋆ = W ⋆ -∇ 2 L Ridge (W; X) -1 ∇L Ridge (W; X, Y) = W ⋆ -X⊤ X + λI -1 ( X⊤ X + λI)W ⋆ -X⊤ Y = ( X⊤ X + λI) -1 X⊤ Y = arg min W L Ridge (W; X, Y), which is equivalent to the solution of GRAPHEDITOR (Eq. 9). Notice that this property does not hold in FISHER-based unlearning Golatkar et al. (2020) because they directly optimize the logistic regression. Similarly, this property does not hold in INFLUENCE-based unlearning Guo et al. (2020) because their gradient is computed on the deleted nodes.

D ON THE AFFECTED NODES SIZE WITH/WITHOUT SUBGRAPH SAMPLING

In this section, we aim at investigating the size of affected nodes with and without using subgraph sampling. Let N (v i ) denote the set of 1-hop neighbors of node v j , L as the depth of underlying linear GNN model with node representations computed by X = P L H (0) , P = D -1/2 AD -1/2 , and K as the depth of the sampled subgraph. In the following, we first show in Section D.1 that without sampling, only nodes that are within the 2L-hop neighborhood of a deleted node (i.e., has shortest path distance not greater than 2L) are affected. Then we consider training with sampling, and show in Section D.2 that only nodes that are within the sampled graph of a deleted node are affected.

D.1 PROOF OF LEMMA 1

Let us first consider the case when L = 1, i.e., we have X = PH (0) . Let suppose we want to delete node v k . Since the propagation matrix is computed by [P] i,j = 1 deg(v i ) deg(v j ) , all elements in the k-th row and the k-th column will be affected after deleting node v k . All 1-hop neighbors are affected. Suppose v l is the 1-hop neighbor if v k . Before deleting node v k , the representation of node v l is x l = 1 deg(v l )deg(v k ) h (0) k + vj ∈N (v l )\{v k } 1 deg(v l ) deg(v j ) h (0) j Since deleting node v k can be think of setting its node degree deg(v k ) as 0, the representation of all 1-hop neighbors are affected. All 2-hop neighbors are affected. Suppose v l is the 1-hop neighbor of v k , v m is the 2-hop neighbor of v k , and v l is the 1-hop neighbor of v m . Before deleting node v k , the representation of node v m is x m = 1 deg(v m )deg(v l ) h (0) l + vj ∈N (vm)\v l 1 deg(v m ) deg(v j ) h (0) j Since v l is the 1-hop neighbor of v k , deleting node v k will change deg(v l ) by reducing the degree of node v l . the representation of all 2-hop neighbors are also affected. Neighbors that are more than 2-hops are not affected. Since deleting nodes only affect a single row and column of the propagation matrix, any neighbors that are more than 2-hops are not affected. Since an the representation of an L-layer linear GNN can be think of as X = P(P L-1 H (0) ) = PH (L-1) , one can easily generalize the above logic and find that all 2L-hop neighbors are affected.

D.2 PROOF OF LEMMA 2

When using subgraph sampling, the representation of any node v i is only depending on a subgraph G sg i (V sg i , E sg i ) induced by the sampled nodes. When deleting node v k , the subgraph G sg i get affected only if node v k ∈ V sg i .

E DEPENDENCY ISSUE OF APPLYING EXISTING UNLEARNING APPROACHES

Please notice that this supplementary section is optional. Not reading this section will not affect your understanding of other parts of this paper. Most unlearning approaches Wu et al. (2020a) ; Guo et al. (2020) ; Izzo et al. (2021) are designed for the finite-sum problem with i.i.d. assumption on all the training data. Directly generalizing the aforementioned general machine unlearning methods to graph structured data is infeasible due to the non-i.i.d. data issue caused node dependency. In other word, one cannot directly unlearn a specific node v i , but have to remove the effect of all its multi-hop neighbors in parallel if using these methods. In the following, we use Guo et al. (2020) as an example to illustrate the key issue. The discussion also applied to other machine unlearning methods that assume input data is i.i.d.. In the following, we first recall how the influence function is used to update the weight parameters in Guo et al. (2020) , then highlight why node dependency makes applying Guo et al. (2020) to graph-structured data challenging and introduce a solution to alleviate this issue. Influence function in Guo et al. (2020) .  From w \n = arg min w L \n (w) and the convexity of the objective function L \n , we know that ∇L \n (w \n ) = 0. Therefore, we have 0 = ∇L(w \n ) -∇ℓ(w ⊤ \n x n , y n ) ≈ Re-arranging the above equation we have w \n ≈ w ⋆ + ∇ 2 L(w ⋆ ) -∇ 2 ℓ(w ⊤ ⋆ x n , y n ) ∇ℓ(w ⊤ ⋆ x n , y n ) influence function , where the second term on the right hand side is the so called influence function. Challenges due to dependency in graph. Please notice that the objective function in Eq. 19 and Eq. 20 are finite-sum formulation. In the following, we will show that directly using the second-order method in Guo et al. (2020) is not allowed due to the node dependency in graph. Before getting started, let me first introduce some notations: • Let us denote the graph before node deletion as G, where the graph structure is captured by adjacency matrix A ∈ {0, 1} n×n and node feature matrix is X. The row normalized propagation matrix us computed as P = D -1 A. • Let us denote the graph after node deletion as G \n , where the graph structure is captured by adjacency matrix A \n ∈ {0, 1} (n-1)×(n-1) and node feature matrix is X \n ∈ R (n-1)×d . The row normalized propagation matrix us computed as P \n = D -1 \n A \n . For simplicity, let us only consider 1-hop SGC, which is already enough to illustrate why node dependency makes applying machine unlearning methods to graph structured data challenging. In (24) Due to the inequality of (a), we cannot directly use the second-order method in Guo et al. (2020) to approximate w \n from w ⋆ . Please notice that this equality is important in Eq. 21 before using first-order Taylor expansion. Get around this issue by deleting more nodes. One way to alleviate this issue is to unlearn all the affected nodes V affect = {v n } ∪ N (v n ) in parallel. To see this, according to the definition of V affect , we know [PX] i = [P \n X \n ] i , ∀v i ∈ V \ V affect because all the final-layer output of any node in V \ V affect are remaining the same after node deletion. Then, we can define the new objective function L \Vaffect (w) on node set V \ V affect (26) As a result, we can approximate w \Vaffect by w \Vaffect = w ⋆ + ∇ 2 L(w ⋆ ) - i∈Vaffect ∇ 2 ℓ(w ⊤ ⋆ [PX] i , y i ) -1 i∈Vaffect ∇ℓ(w ⊤ ⋆ [PX] i , y i ) , ( ) where both Hessian and gradient are computed on the original graph before node deletion.

F LINEAR-GNN IS ALMOST AS POWERFUL AS NON-LINEAR COUNTERPARTS

Please notice that this supplementary section is optional. Not reading this section will not affect your understanding of other parts of this paper. In this section, we summarize recent studies Wei et al. (2022) ; Wang & Zhang (2022) shows that linear-GNNs are almost as expressive as its non-linear counterparts. Although two papers study from different perspective (i.e., Wei et al. (2022) studies from Bayesian inference and Wang & Zhang (2022) studies from spectral neural network), they all lead to a similar conclusion that linear-GNNs are almost as expressive as its non-linear counterparts under some assumption on the node features and graph structure properties.



Please notice that the linearity is not only required by GRAPHEDITOR but also required by most approximate unlearning methods (e.g.,Guo et al. (2020);Golatkar et al. (2020);Wu et al. (2020a);Golatkar et al. (2021)) to theoretically show the data removal guarantee. Unless re-training from scratch, how to rigorously show data removal guarantee in non-linear models is still an open problem and is non-trivial to verify empirically. Similar assumption are made inGolatkar et al. (2021) for convolutional neural network unlearning. We are not comparing the difference between the accuracy before and after unlearning because it is not necessarily related to whether the unlearning success but mainly related to the number of data to unlearn. https://github.com/MinChen00/Graph-Unlearning https://github.com/facebookresearch/certified-removal https://github.com/AdityaGolatkar/SelectiveForgetting https://github.com/facebookresearch/shaDow_GNN



Figure 2: Comparison on the difference of final activation prediction on deleted nodes (1st column) and testing nodes (2nd column) and difference of weight parameters (3rd column).

Figure 3: Comparison on the difference of the prediction probability of the target category obtained by the unlearned W u and the retrained W r models on deleted nodes

(Metric) Node (Accuracy) Node (Accuracy) Node (Accuracy) Node (Accuracy) Edge (Hits@50) B.3 DETAILS ON BASELINE METHODS In this paper, we consider graph unlearning method GRAPHERASER 5 Chen et al. (2021), general machine unlearning method INFLUENCE 6 Guo et al. (2020) and FISHER 7 Golatkar et al. (2020) as baseline methods.

Details on INFLUENCE. INFLUENCE is approximate unlearning method. INFLUENCE proposes to unlearn by removing the influence of the deleted data on the model parameters. Formally, let D d ⊂ D denote the deleted subset of training data, D r = D \ D d denote the remaining data, L(w) is the objective function, and w is the model parameters before unlearning. Then, INFLUENCE unlearn by w u = w + H -1 r g d , which is derived from the first-order Taylor approximation on gradient, where w u is the parameters after unlearning, H r = ∇ 2 L(w, D r ) is the Hessian computed on the remaining data, and g d = ∇L(w, D d ) is the gradient computed on the deleted data. To mitigate the potential information leakage, INFLUENCE utilizes a perturbed objective function L(w) + b ⊤ w, where b is the random noise. INFLUENCE requires the loss function as logistic regression, we use the one-vs-rest strategy splits the multi-class classification into one binary classification problem per class and train with logistic regression. Besides, INFLUENCE-based unlearning requires the i.i.d. data and cannot handle graph structured data, we opt to remove both the deleted and affected nodes. A reader who is interesting the mathematically details could refer to Section E.Details on FISHER. FISHER is approximate unlearning method. FISHER performs Fisher forgetting by taking a single step of Newton's method on the remaining training data, then performing noise injection to model parameters to mitigate the potential information leaking. The model parameters after unlearning is given by w u = w -H -1 r g r + H-1/4 r b, where H r = ∇ 2 L(w, D r ) is Hessian and g r = ∇L(w, D r ) is gradient computed on the remaining data D r , and b is the random noise.

Lemma 4. The time complexity for computing Eq. 2 is O(N d 2 x + N d x d y + d 2 x d y ), where d x , d y are the number of dimension of X, Y, N = |V| is the number of nodes in the graph. Proof of Lemma 4. The time complexity for computing

The influence function used in Guo et al. (2020) captures the change in model parameters due to removing a data point from the training set. Let L(w) denote the finite-sum objective function computed on the full training set {x i } n i=1 with optimal solution w ⋆ = arg min w L(w), where L(w) = n i=1 ℓ(w ⊤ x i , y i ) (19) and L \n (w) denote the objective function without data point (x n , y n ) with optimal solution w \n = arg min w L \n (w), where L \n (w) = n-1 i=1 ℓ(w ⊤ x i , y i ) = L(w) -ℓ(w ⊤ x n , y n ).

⋆ ) + ∇ 2 L(w ⋆ )(w \n -w ⋆ ) -∇ℓ(w ⊤ ⋆ x n , y n ) + ∇ 2 ℓ(w ⊤ ⋆ x n , y n )(w \n -w ⋆ ) = ∇L(w ⋆ ) -∇ℓ(w ⊤ ⋆ x n , y n ) + ∇ 2 L(w ⋆ ) -∇ 2 ℓ(w ⊤ ⋆ x n , y n ) (w \n -w ⋆ ) = (b) -∇ℓ(w ⊤ ⋆ x n , y n ) + ∇ 2 L(w ⋆ ) -∇ 2 ℓ(w ⊤ ⋆ x n , y n ) (w \n -w ⋆ ),(21)where (a) is the first-order Taylor expansion and (b) due to ∇L(w ⋆ ) = 0 for w ⋆ = arg min w L(w).

graph structured data, let F (w) denote the objective function computed on the full training graph G with optimal solution w ⋆ = arg min w L(w), where L(w) = n i=1 ℓ(w ⊤ [PX] i , y i ) (23) and L \n (w) denote the objective function computed on graph G \n without node n , with optimal solution w \n = arg min w L \n (w), where L \n (w) = ⊤ [P \n X] i , y i ) -ℓ(w ⊤ [PX] n , y n ).

\Vaffect (w) = i∈V\Vaffect ℓ(w ⊤ [P \n X \n ] i , y i ) = i∈V\Vaffect ℓ(w ⊤ [PX] i , y i ) = ⊤ [PX] i , y i ),(25)where the equality in (a) is what we are looking for and is similar to the last term in Eq. 20. To this end, let us define w \Vaffect = arg min w L \Vaffect (w), then we have0 = ∇L(w \Vaffect ) -i∈Vaffect ∇ℓ(w ⊤ \Vaffect [PX] i , y i ) ≈ ∇L(w ⋆ ) -i∈Vaffect ∇ℓ(w ⊤ ⋆ [PX] i , y i ) + ∇ 2 L(w ⋆ ) -i∈Vaffect ∇ 2 ℓ(w ⊤ ⋆ [PX] i , y i ) (w \Vaffect -w ⋆ ) = ⊤ ⋆ [PX] i , y i ) + ∇ 2 L(w ⋆ ) -i∈Vaffect ∇ 2 ℓ(w ⊤ ⋆ [PX] i , y i ) (w \Vaffect -w ⋆ ).

Comparison on the accuracy (before parentheses), number of deleted nodes that are predicted as the extra-label category before and after unlearning (inside parentheses), and wall-clock time. To measure the first criteria, we compare the number of the deleted nodes that is computed as the extra-label category. The prediction of deleted nodes is computed on the graph structure before node deletion, therefore namely "deleted data replay test". Intuitively, a successful unlearning should never predict a node as the extra-label category after the unlearning process. As shown in Table1, exact unlearning methods GRAPHEDITOR and GRAPHERASER can unlearn all the information because none the deleted nodes are predicted as the extra-label category. However, this is not the case for approximate unlearning methods INFLUENCE and FISHER. 2 To measure the second criteria, we report the "accuracy" before and after unlearning. A strong unlearning method should not hurt the model performance, therefore they should produce high accuracy both before and after unlearning.3 From Table1, we know that GRAPHERASER, INFLUENCE, and FISHER need to sacrifice their performance to achieve efficient unlearning, therefore they have a lower performance both before and after unlearning. GRAPHERASER has lower performance because of data heterogeneity (data distribution on each shard is different from each other) and lack of training data for each shard model (caused by graph partition). INFLUENCE and FISHER have lower performance because a large regularization term is required to stabilize the Hessian inverse computation and also due to random noises. 3 Moreover, we also compare the running time in Table

Table 2, we observe that: 1 GRAPHEDITOR could attain similar accuracy as re-training from scratch but with shorter time, 2 GRAPHEDITOR has slightly lower precision than re-training because it has less data to training the feature extractor, 3 the performance of GRAPHERASER is relatively lower than GRAPHEDITOR and re-training due to lack of training data on each shard model, data heterogeneity caused by graph partitioning, and is prone to over-fitting during shard model training phase. Comparison on the accuracy after unlearning and running time on non-linear GNNs. SAGE 72.58 ± 0.17 (1, 211 sec) 71.13 ± 0.19 (19 sec) 56.32 ± 0.23 (224 sec) GAT 72.23 ± 0.18 (1, 425 sec) 71.01 ± 0.19 (21 sec) 57.16 ± 0.56 (301 sec) GCN 71.49 ± 0.27 (1, 089 sec) 70.89 ± 0.30 (18 sec) 55.90 ± 0.51 (277 sec) OGB-Products SAGE 80.67 ± 0.37 (3, 156 sec) 79.00 ± 0.39 (56 sec) 56.53 ± 0.50 (11, 648 sec) GAT 81.42 ± 0.31 (3, 235 sec) 79.41 ± 0.37 (59 sec) 62.47 ± 0.62 (15, 496 sec) GCN 79.14 ± 0.44 (3, 071 sec) 78.11 ± 0.48 (55 sec) 56.41 ± 0.49 (11,298 sec) Connection to second-order unlearning . . . . . . . . . . . . . . . . . . . . . . . . D On the affected nodes size with/without subgraph sampling D.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . From Bayesian inference perspective . . . . . . . . . . . . . . . . . . . . . . . . . F.2 From spectral neural network perspective . . . . . . . . . . . . . . . . . . . . . .Organization.In Section A, we provide additional experiment results on both node/edge unlearning and comparison of linear GNNs to ordinary multi-layer non-linear GNNs. More specifically, we conduct edge unlearning test in Appendix A.1, ablation study the effect of subgraph sampling size on GRAPHEDITOR for node unlearning in Appendix A.2 and for edge unlearning in Appendix A.3, conduct node addition test in Appendix A

Comparison on the accuracy (in front of the parentheses), the number of the deleted nodes that are predicted as the extra-label category before and after unlearning (inside the parentheses), and wall-clock time.

Table 4: 1 adding neighbors per hop not necessarily results in a better model performance on the linear GNN used in GRAPHEDITOR, which can be explained by the over-smoothing hypothesis in Cong et al. (2021); Liet al. (2018). 2 fine-tuning can bring around 3% of performance-boosting on node classification datasets, which indicates the importance of finetuning. 3 linear GNN can achieve compatible results (even outperform) the ordinary multi-layer non-linear GNN with significantly less computation time, which motivates us to explore better feature engineering tricks for linear GNN as a future direction. 4 GRAPHERASER suffers performance degradation issue due to the data heterogeneity and lack of training data on each subgraph, which is aligned with our observation on using GRAPHERASER with linear GNNs as reported in Table

Comparison on the effect of subgraph sampling and fine-tuning with ordinary GNNs for edge unlearning.

Comparison on the accuracy (in front of the parentheses), the number of the deleted nodes that are predicted as the extra-label category before and after node addition (inside the parentheses), and wall-clock time.

The time of unlearning 10 nodes with single unlearning request using linear-GNN. COMPARISON LINEAR-GNN WITH SHALLOW SAMPLER WITH DEEP MULTI-LAYER GNNS Reader might wondering whether shallow sampling could cause performance degradation. Before answering this question, we would like to point out that it has been shown in existing worksZeng et al.  (

Comparison linear-GNN with shallow sampler with deep multi-layer GNNs.

Statistics of the datasets used in our experiments.

GRAPHEDITOR: AN EFFICIENT GRAPH REPRESENTATION LEARNING AND UNLEARNING APPROACH

 2022) compares linear-GNN and non-linear GNN from the Bayesian inference perspective. They consider binary node classification where the graph is randomly generated by contextual stochastic block models (CSBM). They measure the success of node classification by signal-to-noise ratio (SNR). They show that under some assumptions on the CSBM, the SNR of non-linear GNN is in the same order as linear-GNN.More specifically, Wei et al. (2022) assumes the random graphs are generated by contextual stochastic block models (CSBM), where each node has a feature vector x i ∈ R m and a binary category label y i ∈ {+1, -1}. Let us suppose G(V, E) is an graph generated by CSBM(n, p, q, P +1 , P -1 ) with the following processes:• (Generate node labels) For each node v i ∈ V, we randomly sample the label from y i ∈ {+1, -1}, where n = |V| is the number of nodes.• (Generate graph structure) If any two nodes have the same label y i = y j , then with probability p we add edge (v i , v j ) to the edge set E. Otherwise, with probability q we add edge (v i , v j ) to the edge set E. • (Generate node features) If a node y i = +1, then its feature vector x i is sampled fromOtherwise, if a node y i = -1, then its feature vector x i is sampled fromAfter that, Wei et al. (2022) formulates non-linear GNN and linear-GNN under the context of Bayesian inference, where the optimal propagation is derived from max-a-posterior estimation. To classify a node v i , the optimal non-linear propagation is defined aswhere ψ(x) = log(P +1 (x)/P -1 (x)) and ϕ(ψ, log(p/q)) = ReLU(ψ + log(p/q)) -ReLU(ψlog(p/q)) -log(p/q). Similarly, for linear-GNN, the optimal linear propagation is defined aswhere2 )/2 and α is a parameter to balance information from the root node and its neighbors.The minimal Bayesian mis-classification error is measured by signal-to-noise ratio (SNR), which is defined as ρ for non-linear GN and ρ l for linear-GNN,They make the following assumptions on the random graph generator CSBM. More specifically, Wei et al. (2022) assumes the graph structure generated by p, q is neither too strong (e.g., p → 1 and q → 0) or too weak (e.g., graph is too sparse) in Assumption 1 assumes feature generation distributions for positive nodes and negative nodes are not too different. Assumption 1 (Assumption on graph structure). Let us define S(p, q) = (p -q) 2 /(p + q). They assume no very weak graph structure information S(p, q) = ω n (log n) 2 /n and no very strong graph structure information S(p, q) ̸ → |p -q|.Assumption 2 (Assumption on node features). Recall that µ +1 is the mean of positive node feature distribution and µ -1 is the mean of negative node feature distribution. Then, we assumeThen, Wei et al. (2022) has the following conclusion on the SNR of linear-GNN and non-liear GNN. In particular, they show that non-linear GNN behaves similar to the linear-GNN as their SNRs are in the same order. In other word, under some assumption on graph structure and node features, the linear-GNN could be as expressive as non-linear GNN.Theorem 1 (Theorem 2 part 1 of Wei et al. (2022) ). If CSBM satisfy Assumption 1 and Assumption 2, we have ρ r = Θ n (ρ l ).When non-linearity is helpful? Besides, they show that non-linearity is helpful only if the Assumption 2 does not hold. In other word, if the mean of the positive and negative node feature sampling distribution is different enoughF.2 FROM SPECTRAL NEURAL NETWORK PERSPECTIVE Wang & Zhang (2022) shows that linear-GNNs could produce arbitrary predictions under mild conditions on the Laplacian and node features, without relying on the non-linearity in MLP. The expressive power of linear-GNNs mainly comes from its weighted combination of multi-hop graph convolution operators.Let us define L ∈ R n×n as the Laplacian matrix in spectral GNNs, where U is the eigenvectors of L and Λ is the diagonal matrix of eigenvalues. They make the following assumptions on L. Assumption 3 (Assumption on L). No eigenvalues of L are identical.Let us denote X ∈ R n×d as the node features and X = UX as the graph Fourier transform of node features X. They make the following assumption on X.Assumption 4 (Assumption on X). No rows of X are zero vector.Given any target function z = f (L, X) ∈ R n×1 we want to approximate via linear-GNN. Wang & Zhang (2022) shows that there exists a linear-GNN can approximate function f (L, X) arbitrary close if Assumption 3 and Assumption 4 hold.Theorem 2. Let us define g α,w (L, X) = k ℓ=1 α ℓ L ℓ Xw as the linear-GNN and f is the target function we want to approximate. Under the Assumption 3 and Assumption 4, there is always exists a set of α ⋆ ∈ R k , w ⋆ ∈ R d such that g α ⋆ ,w ⋆ (L, X) = f (L, X).In practice, Wang & Zhang (2022) found Assumption 3 and Assumption 4 are very likely to hold on the real-world datasets.When non-linearity is helpful? They show that adding non-linear MLP to linear-GNNs could alleviate the conditions on node features (i.e., Assumption 4) because the output of multi-layer neural network are very likely to satisfy this condition. However, adding non-linearity will not necessarily improve its expressive power if the conditions are already satisfied in the first place.

