EFFICIENT MODEL UPDATES FOR APPROXIMATE UN-LEARNING OF GRAPH-STRUCTURED DATA

Abstract

With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning has become of significant importance. This is particularly the case for graph-structured data, and learning tools specialized for such data, including graph neural networks (GNNs). This work introduces the first known approach for approximate graph unlearning with provable theoretical guarantees. The challenges in addressing the problem are two-fold. First, there exist multiple different types of unlearning requests that need to be considered, including node feature, edge and node unlearning. Second, to establish provable performance guarantees, one needs to carefully evaluate the process of feature mixing during propagation. We focus on analyzing Simple Graph Convolutions (SGC) and their generalized PageRank (GPR) extensions, thereby laying the theoretical foundations for unlearning GNNs. Empirical evaluations of six benchmark datasets demonstrate excellent performance/complexity/privacy trade-offs of our approach compared to complete retraining and general methods that do not leverage graph information. For example, unlearning 200 out of 1208 training nodes of the Cora dataset only leads to a 0.1% loss in test accuracy, but offers a 4-fold speed-up compared to complete retraining with a (ϵ, δ) = (1, 10 -4 ) "privacy cost". We also exhibit a 12% increase in test accuracy for the same dataset when compared to unlearning methods that do not leverage graph information, with comparable time complexity and the same privacy guarantee. Our code is available online 1 .

1. INTRODUCTION

Machine learning algorithms are used in many application domains, including biology, computer vision and natural language processing. Relevant models are often trained either on third-party datasets, internal or customized subsets of publicly available user data. For example, many computer vision models are trained on images from Flickr users (Thomee et al., 2016; Guo et al., 2020) while many natural language processing (e.g., sentiment analysis) and recommender systems heavily rely on repositories such as IMDB (Maas et al., 2011) . Furthermore, numerous ML classifiers in computational biology are trained on data from the UK Biobank (Sudlow et al., 2015) , which represents a collection of genetic and medical records of roughly half a million participants (Ginart et al., 2019) . With recent demands for increased data privacy, the above referenced and many other data repositories are facing increasing demands for data removal. Certain laws are already in place guaranteeing the rights of certified data removal, including the European Union's General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA) and the Canadian Consumer Privacy Protection Act (CPPA) (Sekhari et al., 2021) . Removing user data from a dataset is insufficient to guarantee the desired level of privacy, since models trained on the original data may still contain information about their patterns and features. This consideration gave rise to a new research direction in machine learning, referred to as machine unlearning (Cao & Yang, 2015) , in which the goal is to guarantee that the user data information is also removed from the trained model. Naively, one can retrain the model from scratch to meet Figure 1 : Illustration of three different types of approximate graph unlearning problems and a comparison with the case of unlearning without graph information (Guo et al., 2020) . The colors of the nodes capture properties of node features, and the red frame indicates node embeddings affected by 1-hop propagation. When no graph information is used, the node embeddings are uncorrelated. However, for the case of graph unlearning problems, removing one node or edge can affect the node embeddings of the entire graph for a large enough number of propagation steps. the privacy demand, yet retraining comes at a high computation cost and is thus not practical when accommodating frequent removal requests. To avoid complete retraining, various methods for machine unlearning have been proposed, including exact approaches (Ginart et al., 2019; Bourtoule et al., 2021) as well as approximate methods (Guo et al., 2020; Sekhari et al., 2021) . At the same time, graph-centered machine learning has received significant interest from the learning community due to the ubiquity of graph-structured data. Usually, the data contains two sources of information: Node features and graph topology. Graph Neural Networks (GNN) leverage both types of information simultaneously and achieve state-of-the-art performance in numerous real-world applications, including Google Maps (Derrow-Pinion et al., 2021) , various recommender system (Ying et al., 2018) , self-driving cars (Gao et al., 2020) and bioinformatics (Zhang et al., 2021b) . Clearly, user data is involved in training the underlying GNNs and it may therefore be subject to removal. However, it is still unclear how to perform unlearning of GNNs. We take the first step towards solving the approximate unlearning problem by performing a nontrivial theoretical analysis of some simplified GNN architectures. Inspired by the unstructured data certified removal procedure (Guo et al., 2020) , we propose the first known approach for approximate graph unlearning. Our main contributions are as follows. First, we introduce three types of data removal requests for graph unlearning: Node feature unlearning, edge unlearning and node unlearning (see Figure 1 ). Second, we derive theoretical guarantees for approximate graph unlearning mechanisms for all three removal cases on SGC (Wu et al., 2019) and their GPR generalizations. In particular, we analyze L 2 -regularized graph models trained with differentiable convex loss functions. The analysis is challenging since propagation on graphs "mixes" node features. Our analysis reveals that the degree of the unlearned node plays an important role in the unlearning process, while the number of propagation steps may or may not be important for different unlearning scenarios. To the best of our knowledge, the theoretical guarantees established in this work are the first provable approximate unlearning studies for graphs. Furthermore, the proposed analysis also encompasses node classification and node regression problems. Third, our empirical investigation on frequently used datasets for GNN learning shows that our method offers an excellent performance-complexityprivacy trade-off. For example, when unlearning 200 out of 1208 training nodes of the Cora dataset, our method offers comparable test accuracy as complete retraining, but offers a 4-fold speed-up with a (ϵ, δ) = (1, 10 -4 ) "privacy cost". We also test our model on datasets for which removal requests are most likely to arise, including Amazon co-purchase networks. Due to space limitations, all proofs and some detailed discussions are relegated to the Appendix.

2. RELATED WORKS

Machine unlearning and certified data removal. Cao & Yang (2015) introduced the concept of machine unlearning and proposed distributed learners for exact unlearning. Bourtoule et al. (2021) introduced sharding-based methods for unlearning, while Ginart et al. (2019) described unlearning approaches for k-means clustering. These works focused on exact unlearning: The unlearned model is required to perform identically to a completely retrained model. As an alternative, Guo et al. (2020) introduced a probabilistic definition of unlearning motivated by differential privacy (Dwork, 2011) . Sekhari et al. (2021) studied the generalization performance of machine unlearning methods. Golatkar et al. (2020) proposed heuristic-based selective forgetting in deep networks. These probabilistic approaches naturally allow for "approximate" unlearning. None of these works addressed the machine unlearning problem on graphs. To the best of our knowledge, the only work in this direction is GraphEraser (Chen et al., 2021) . However, the strategy proposed therein uses sharding, which only works for exact unlearning and is hence completely different from our approximate approach. Also, the approach in Chen et al. (2021) relies on partitioning the graph using community detection methods. It therefore implicitly makes the assumption that the graph is homophilic which is not warranted in practice (Chien et al., 2021b; Lim et al., 2021) . In contrast, our method works for arbitrary graphs and allows for approximate unlearning while ensuring excellent trade-offs among performance, privacy and complexity. Differential privacy (DP) and DP-GNNs. Machine unlearning, especially the approximation version described in Guo et al. (2020) , is closely related to differential privacy (Dwork, 2011) . In fact, differential privacy is a sufficient condition for machine unlearning. If a model is differentially private, then the adversary cannot distinguish whether the model is trained on the original dataset or on a dataset in which one data point is removed. Hence, even without model updating, a DP model will automatically unlearn the removed data point (see also the explanation in (Ginart et al., 2019; Sekhari et al., 2021) and Figure 4 ). Although DP is a sufficient condition for unlearning, it is not a necessary condition. Also, most of the DP models suffer from a significant degradation in performance even when the privacy constraint is loose (Chaudhuri et al., 2011; Abadi et al., 2016) . Machine unlearning can therefore be viewed as a means to trade-off between performance and computational cost, with complete retraining and DP on two different ends of the spectrum (Guo et al., 2020) . Several recent works proposed DP-GNNs (Daigavane et al., 2021; Olatunji et al., 2021a; Wu et al., 2021; Sajadmanesh et al., 2022) -however, even for unlearning one single node or edge, these methods require a high "privacy cost" (ϵ ≥ 5) to learn with sufficient accuracy. Graph neural networks. While GNNs are successfully used for many graph-related problems, accompanying theoretical analyses are usually difficult due to the combination of nonlinear feature transformation and graph propagation. Recently, several simplified GNN models were proposed that can further the theoretical understanding of their performance and scalability. SGCs (Wu et al., 2019) simplify Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017) via linearization (i.e., through the removal of all nonlinearities); although SGC in general underperforms compared to state-of-the-art GNNs, they still offer competitive performance on many datasets. The analysis of SGCs elucidated the relationship between low-pass graph filtering and GCNs which reveals both advantages and potential limitations of GNNs. The GPR generalization of SGC is closely related to many important models that resolve different issues inherent to GNNs. For example, GPRGNN (Chien et al., 2021b) addresses the problem of universal learning on homophilic and heterophilic graph datasets and the issue of over-smoothing. SIGN (Frasca et al., 2020) based graph models and S 2 GC (Zhu & Koniusz, 2020) allow for arbitrary sized mini-batch training, which improves the scalability and leads to further performance improvements of methods (Sun et al., 2021; Zhang et al., 2021a; Chien et al., 2022a) on the Open Graph Benchmark leaderboard Hu et al. (2020) . Hence, developing approximate graph unlearning approaches for SGC and generalizations thereof is not only of theoretical interest, but also of practical importance.

3. PRELIMINARIES

Notation. We reserve bold-font capital letters such as S for matrices and bold-font lowercase letters such as s for vectors. We use e i to denote the i th standard basis, so that e T i S and Se i represent the i th row and column vector of S, respectively. The absolute value | • | is applied component-wise on both matrices and vectors. We also use the symbols 1 for the all-one vector and I for the identity matrix. Furthermore, we let G = (V, E) stand for an undirected graph with node set V = [n] of size n and edge set E. The symbols A and D are used to denote the corresponding adjacency and node degree matrix, respectively. The feature matrix is denoted by X ∈ R n×F and the features have dimension F ; For binary classification, the label are summarized in Y ∈ {-1, 1} n , while the nonbinary case is discussed in Section 5. The relevant norms are ∥ • ∥, the l 2 norm, and ∥ • ∥ F , the Frobenius norm. Note that we use ∥ • ∥ for both row and column vectors to simplify the notation. The matrices A and D should not be confused with the symbols for an algorithm A and dataset D.

Certified removal.

Let A be a (randomized) learning algorithm that trains on D, the set of data points before removal, and outputs a model h ∈ H, where H represents a chosen space of models. The removal of a subset of points from D results in D ′ . For instance, let D = (X, Y). Suppose we want to remove a data point, (e T i X, e T i Y) from D, resulting in D ′ = (X ′ , Y ′ ). Here, X ′ , Y ′ are equal to X, Y, respectively, except that the row corresponding to the removed data point is deleted. Given ϵ > 0, an unlearning algorithm M applied to A(D) is said to guarantee an (ϵ, δ)-certified removal for A, where ϵ, δ > 0 and X denotes the space of possible datasets, if ∀T ⊆ H, D ⊆ X , i ∈ [n] : P (M (A(D), D, D \ D ′ ) ∈ T ) ≤ exp(ϵ)P (A(D ′ ) ∈ T ) + δ, P (A(D ′ ) ∈ T ) ≤ exp(ϵ)P (M (A(D), D, D \ D ′ ) ∈ T ) + δ. (1) This definition is related to (ϵ, δ)-DP (Dwork, 2011) except that we are allowed to update the model based on the removed point (see Figure 4 ). An (ϵ, δ)-certified removal method guarantees that the updated model M (A(D), D, D \ D ′ ) is "approximately" the same as the model A(D ′ ) obtained by retraining from scratch. Thus, any information about the removed data D \ D ′ is "approximately" eliminated from the model. Ideally, we would like to design M such that it satisfies equation ( 1) and has a complexity that is significantly smaller than that of complete retraining.

4. APPROXIMATE GRAPH UNLEARNING WITH THEORETICAL GUARANTEES

Unlike standard machine unlearning, approximate graph unlearning uses datasets that contain not only node features X but also the graph topology A, and therefore require different data removal procedures. We focus on node classification, for which the training dataset equals D = (X, Y Tr , A). Here, Y Tr is identical to Y on rows indexed by points of the training set T r while the remaining rows are all zeros. Without loss of generality, we assume that the training set comprises the first m nodes (i.e. T r = [m]), where m ≤ n. An unlearning method M achieves (ϵ, δ)-approximate graph unlearning with algorithm A if equation 1 is satisfied for D = (X, Y Tr , A) and D ′ , which differ based on the type of requests: Node feature unlearning, edge unlearning, and node unlearning.

4.1. UNLEARNING SGC AND COMPARISON WITH UNSTRUCTURED UNLEARNING

SGC is a simplification of GCN obtained by removing all nonlinearities from the latter model. This leads to the following update rule: P K XW ≜ ZW, where W denotes the matrix of learnable weights, K ≥ 0 equals the number of propagation steps and P denotes the one-step propagation matrix. The standard choice of the propagation matrix is the symmetric normalized adjacency matrix with self-loops, P = D-1/2 Ã D-1/2 , where Ã = A+I and D equals the degree matrix with respect to Ã. We will work with the asymmetric normalized version of P, P = D-1 Ã. This choice is made purely for analytical purposes and our empirical results confirm that this normalization ensures the competitive performance of our unlearning methods. The resulting node embedding is used for node classification by choosing an appropriate loss (i.e., logistic loss) and minimizing the L 2 -regularized empirical risk. For binary classification, W can be replaced by a vector w; the loss equals L(w, D) = i:e T i Y Tr ̸ =0 ℓ(e T i Zw, e T i Y Tr ) + λ 2 ∥w∥ 2 , where ℓ(e T i Zw, e T i Y Tr ) is a convex loss function that is differentiable everywhere. We also write w ⋆ = A(D) = arg min w L(w, D), where the optimizer is unique whenever λ > 0. We start with a high-level description of the approximate unstructured unlearning approach introduced in Guo et al. (2020) . Note that "certified removal" in the context of the former work refers to approximate unlearning that provably satisfies (1). Let us denote the Hessian of L(•; D ′ ) at w ⋆ by H w ⋆ = ∇ 2 L(w ⋆ ; D ′ ). The authors of Guo et al. (2020) propose the following mechanism for unlearning the m th training point: w -= M (w ⋆ , D, D \ D ′ ) = w ⋆ + H -1 w ⋆ ∆ guo , where ∆ guo = λw ⋆ + ∇ℓ(e T m Xw ⋆ , e T m Y Tr ). When ∇L(w -, D ′ ) = 0, then w -is the unique optimizer of L(•; D ′ ). If ∇L(w -, D ′ ) ̸ = 0, then information about the removed data point remains present in the model. One can show that the gradient residual norm ∥∇L(w -, D ′ )∥ determines the error of w -when used to approximate the true minimizer of L(•; D ′ ) (Guo et al., 2020) . Hence, upper bounds on ∥∇L(w -, D ′ )∥ can be used to establish approximate unlearning guarantees. More precisely, assume that we have ∥∇L(w -, D ′ )∥ ≤ ϵ ′ for some ϵ ′ > 0. Furthermore, consider training with the noisy loss L b (w, D) = i:e T i Y Tr ̸ =0 ℓ(e T i Xw, e T i Y Tr ) + λ 2 ∥w∥ 2 + b T w, where b is drawn randomly according to some distribution. Then one can leverage the following result. Theorem 4.1 (Theorem 3 from Guo et al. (2020) ). Let A be the learning algorithm that returns the unique optimum of the loss L b (w, D). Suppose that ∥∇L b (w -, D ′ )∥ ≤ ϵ ′ , for some computable bound ϵ ′ > 0 independent on b and achieved by M . If b ∼ N (0, (c 0 ϵ ′ /ϵ) 2 ) d with c 0 > 0, then M satisfies (1) with parameters (ϵ, δ) for algorithm A applied to D ′ , where δ = 1.5e -c 2 0 /2 . Hence, if we can prove that ∥∇L(w -, D ′ )∥ is appropriately bounded for the graph setting as well, then the unlearning mechanism M will ensure (ϵ, δ)-approximate graph unlearning. One of the main contributions of Guo et al. (2020) is to bound the gradient residual norm ∥∇L(w -, D ′ )∥ of the proposed unlearning mechanism with ∆ guo = λw ⋆ + ∇ℓ(e T m Xw ⋆ , e T m Y Tr ). Motivated by the unlearning approach from Guo et al. (2020) pertaining to unstructured data, we design an unlearning mechanism for graphs. We generalize their unlearning mechanism by replacing ∆ guo with ∆ = ∇L(w ⋆ , D) -∇L(w ⋆ , D ′ ). As an demonstrative example, for node unlearning we consequently have ∆ = λw ⋆ + ∇ℓ(e T m Zw ⋆ , e T m Y Tr ) + m-1 i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) . (2) Note that our generalized unlearning mechanism matches that ot Guo et al. (2020) when no graph information is present. This can be seen by setting K = 0, which leads to Z = X and e T i Z = e T i Z ′ ∀i ∈ [m -1]. Hence, the third term in equation ( 2) is zero and thus ∆ = ∆ guo . Note that when graph information is present, the third term in equation ( 2) is, in general, bounded away from zero. This term captures the impact of unlearning node m on all remaining training nodes [m -1], and including effects pertaining to edge and feature removal. This not only highlights the necessity of investigating generalized unlearning mechanisms, but also the main difficulty of extending the analysis of Guo et al. (2020) to graphs. A more detailed discussion regarding the intuition behind our approach can be found in Appendix A.3. The main technical contribution of our work is to establish bounds of the gradient residual norm for all three types of graph unlearning scenarios. For this analysis, we need the loss function ℓ to satisfy the following properties. Assumption 4.2. For any D, i ∈ [n] and w ∈ R F : (1) ∥∇ℓ(e T i Zw, e T i Y)∥ ≤ c (i.e. the norm of ∇ℓ is c-bounded); (2) ℓ ′′ is γ 2 -Lipschitz; (3) ∥e T i X∥ ≤ 1; (4) ℓ ′ is γ 1 -Lipschitz; (5) ℓ ′ is c 1 -bounded. Assumptions (1)-(3) are also needed for unstructured unlearning of linear classifiers (Guo et al., 2020) . To account for graph-structured data, we require additional assumptions (4)-( 5) to establish worst-case bounds. The additional assumptions may be avoided when working with data-dependent bounds (Section 5). In all subsequent derivations, we assume that the unlearned data point corresponds to the m th node for node feature and node unlearning; for edge unlearning, we wish to unlearn the edge (1, m). Generalizations for multiple unlearning requests are discussed in Section 5.

4.2. NODE FEATURE UNLEARNING FOR SGCS

We start with the simplest type of unlearning -node feature unlearning -for SGCs. In this case, we remove the node feature and label of one node from D, resulting in D ′ = (X ′ , Y ′ Tr , A). The matrices X ′ , Y ′ Tr are identical to X, Y Tr , respectively, except for the m th row of the former being set to zero. Note that in this case, the graph structure remains unchanged. Theorem 4.3. Suppose that Assumption 4.2 holds. For the node feature unlearning scenario and Z = P K X and P = D-1 Ã, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ γ 2 (2cλ + (cγ 1 + λc 1 ) Dmm ) 2 λ 4 (m -1) , where H wη denotes the Hessian of L( •; D ′ ) at w η = w ⋆ + ηH -1 w ⋆ ∆ for some η ∈ [0, 1]. A similar conclusion holds for the case when we wish to unlearn node features of a node that is not in T r . In this case we just replace Dmm by the degree of the corresponding node. This result shows that the norm bound is large if the unlearned node has a large degree, since a large-degree node will affect the values of many rows in Z. Our result also demonstrates that the norm bound is independent of K, due to the fact that P is right stochastic. We provide next a sketch of the proof to illustrate the analytical challenges of graph unlearning compared to those of unstructured data unlearning. Although for node feature unlearning the graph topology does not change, all rows of Z = P K X may potentially change due to graph information propagation. Thus, the original analysis from (Guo et al., 2020) , which corresponds to the special case Z = X, cannot be applied directly. There are two particular challenges. The first is to ensure that the norm of each row of Z is bounded by 1. We provide Lemma A.1 to guarantee this. It is critical to choose P = D-1 Ã since all other choices of degree normalization lead to worse bounds (see Appendix A.11) . The second and more difficult challenge is to bound ∥∆∥. When Z = X, the third term in equation ( 2) is exactly zero, in accordance with Guo et al. (2020) . Due to graph propagation, we have to further bound the norm of the third term, which is highly nontrivial since the upper bound is not allowed to grow with m or n. We first focus on one of the m -1 terms in the sum. Using Assumption 4.2, one can bound this term by ∥e T i (Z -Z ′ )∥ (we suppressed the dependency on λ, c, c 1 and γ 1 for simplicity). The key analytical novelty is to explore the sparsity of Z -Z ′ = P K (X -X ′ ). Note that X -X ′ is an all-zero matrix except for its m th row being equal to e T m X. Thus, we have ∥e T i (Z-Z ′ )∥ = ∥e T i P K (X-X ′ )∥ = ∥e T i P K e m e T m X∥ ≤ e T i P K e m , where the last bound follows from the Cauchy-Schwartz inequality, (3) in Assumption 4.2 and the fact that P K is a (componentwise) nonnegative matrix. Thus, summing over i ∈ [m -1] leads to the upper bound 1 T P K e m , since m ≤ n. Next, observe that 1 T P K e m = 1 T P K D-1 De m = 1 T D-1 Ã K D-1 e m Dmm = 1 T D-1 Ã D-1 K e m Dmm . Since Ã D-1 is a left stochastic matrix, Ã D-1 p is a probability vector whenever p is a probability vector. Clearly, e m is a probability vector. Hence, ( Ã D-1 ) K e m is also a probability vector. Since all diagonal entries of D-1 are nonnegative and upper bounded by 1 given the self-loops for all nodes, 1 T D-1 p ≤ 1 T p = 1 for any probability vector p. Hence, the term above is bounded by Dmm . The bound depends on Dmm and does not increase with m or K. Although node feature unlearning is the simplest case of graph unlearning, our sketch of the proof illustrates the difficulties associated with bounding the third term in ∆. Similar, but more complicated approaches are needed for the analysis of edge unlearning and node unlearning.

4.3. EDGE AND NODE UNLEARNING FOR SGCS AND GPRS EXTENSIONS

Edge unlearning for SGC. We describe next the bounds for edge unlearning and highlight the technical issues arising in the analysis of this setting. Here, we remove one edge (1, m) from D, resulting in D ′ = (X, Y Tr , A ′ ). The matrix A ′ is identical to A except for Ã′ 1m = Ã′ m1 = 0. Furthermore, D′ is the degree matrix corresponding to Ã′ . Note that the node features and labels remain unchanged. Theorem 4.4. Suppose that Assumption 4.2 holds. Under the edge unlearning scenario, and for P = D-1 Ã and Z = P K X, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ 16γ 2 K 2 (cγ 1 + c 1 λ) 2 λ 4 m . Similar to what holds for the node feature unlearning case, Theorem 4.4 still holds when neither of the two end nodes of the removed edge belongs to T r . Node unlearning for SGC. We now discuss the most difficult case, node unlearning. In this case, one node is entirely removed from D, including node features, labels and edges. This results in D ′ = (X ′ , Y ′ Tr , A ′ ). The matrices X ′ , Y ′ Tr are defined similarly to those described for node feature unlearning. The matrix A ′ is obtained by replacing the m th row and column in A by allzeros (similar changes are introduced in Ã, with Ãmm = 0). For simplicity, we let D′ mm = 1 as this assumption does not affect the propagation results. Theorem 4.5. Suppose that Assumption 4.2 holds. For the node unlearning scenario and Z = P K X and P = D-1 Ã, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ γ 2 2cλ + K (cγ 1 + c 1 λ) 2 Dmm -1 2 λ 4 (m -1) . The main challenge arises in the proof of Theorem 4.4 and 4.5 is bounding ∥∆∥ appropriately. Unlike for the node feature unlearning case, now both graph structure and node features can change due to the unlearning request. We establish a series of lemmas to characterize the difference between Z and Z ′ , which play important roles in our proofs (see Appendix A.8 and A.9 for complete proofs). Approximate graph unlearning in GPR-based model. Our analysis can be extended to Generalized PageRank (GPR)-based models (Li et al., 2019) . The definition of GPR is K k=0 θ k P k S, where S denotes a node feature or node embedding. The learnable weights θ k are called GPR weights and different choices for the weights lead to different propagation rules (Jeh & Widom, 2003; Chung, 2007) . GPR-type propagations include SGC and APPNP rules as special cases (Chien et al., 2021b ). If we use linearly transformed features S = X W, for some weight matrix W, the GPR rule can be rewritten as ZW = 1 K+1 X, PX, P 2 X, • • • , P K X W. This constitutes a concatenation of the steps from 0 up to K. The learnable weight matrix W ∈ R (K+1)F ×C combines θ k and W. These represent linearizations of GPR-GNNs (Chien et al., 2021b) and SIGNs (Frasca et al., 2020) , simple yet useful models for learning on graphs. For simplicity, we only describe the results for node feature unlearning and delegate the analysis of edge and node unlearning to Appendix A.10. Theorem 4.6. Suppose that Assumption 4.2 holds and considers the node feature unlearning case. For Z = 1 K+1 X, PX, P 2 X, • • • , P K X and P = D-1 Ã, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ γ 2 (2cλ + (cγ 1 + λc 1 ) Dmm ) 2 λ 4 (m -1) . ( ) Note that the resulting bound is the same as the bound in Theorem 4.3. This is due to the fact that we used the normalization factor 1 K+1 in Z. Hence, given the same noise level, the GPR-based models are more sensitive when we trained on the noisy loss L b . Whether the general high-level performance of GPR can overcompensate this drawback depends on the actual datasets considered.

5. EMPIRICAL ASPECTS OF APPROXIMATE GRAPH UNLEARNING

Logistic and least-squares regression on graphs. For binary logistic regression, the loss equals ℓ(e T i Zw, e T i Y Tr ) = -log(σ(e T i Y Tr e T i Zw)), where σ(x) = 1/(1 + exp(-x)) denotes the sigmoid function. As shown in Guo et al. (2020) , the assumptions (1)-(3) in Assumption 4.2 are satisfied with c = 1 and γ 2 = 1/4. By standard analysis, we show that our loss satisfies (4) and ( 5) in Assumption 4.2 with γ 1 = 1/4 and c 1 = 1. For multi-class logistic regression, one can adapt the "one-versus-all other-classes" strategy which leads to the same result. For least-square regression, since the hessian is independent of w our approach offers (0, 0)-approximate graph unlearning even without loss perturbations. See Appendix A.5 for the complete discussion and derivation. Sequential unlearning. In practice, multiple users may request unlearning. Hence, it is desirable to have a model that supports sequential unlearning of all types of data points. One can leverage the same proof as in Guo et al. (2020) (induction coupled with the triangle inequality) to show that the resulting gradient residual norm bound equals T ϵ ′ at the T th unlearning request, where ϵ ′ is the bound for a single instance of approximate graph unlearning. Data-dependent bounds. The gradient residual norm bounds derived for different types of approximate graph unlearning contain a constant factor 1/λ 4 , and may be loose in practice. Following Guo et al. (2020) , we also examined data dependent bounds. Corollary 5.1 (Application of Corollary 1 in Guo et al. (2020) ). For all three graph unlearning scenarios, we have ∥∇L(w -, D ′ )∥ ≤ γ 2 ∥Z ′ ∥ op ∥H -1 w ⋆ ∆∥∥Z ′ H -1 w ⋆ ∆∥. Hence, there are two ways to accomplish approximate graph unlearning. If we do not allow any retraining, we have to leverage the worst case bound in Section 4 based on the expected number of unlearning requests. Importantly, we will also need to constrain the node degree of nodes to be unlearned (i.e., do not allow for unlearning hub nodes), for both node feature and node unlearning. Otherwise, we can select the noise standard deviation α, ϵ and δ and compute the corresponding "privacy budget" αϵ/ 2 log(1.5/δ). Once the accumulated gradient residual norm exceeds this budget, we retrain the model from scratch. Note that this still greatly reduces the time complexity compare to retraining the model for every unlearning request (see Section 6). We also relegate the pseudo-code of our method leveraging data-dependent bounds for sequential unlearning in Appendix A.6. 

6. EXPERIMENT

Settings. We test our methods on benchmark datasets for graph learning, including Cora, Citseer, Pubmed (Sen et al., 2008; Yang et al., 2016; Fey & Lenssen, 2019) and large-scale dataset ogbnarxiv (Hu et al., 2020) and Amazon co-purchase networks Computers and Photo (McAuley et al., 2015; Shchur et al., 2018) . We either use the public splitting or random splitting based on similar rules as public splitting and focus on node classification. Following Guo et al. (2020) , we use LBFGS as the optimizer for all methods due to its high efficiency on strongly convex problems. Unless specified otherwise, we fix K = 2, δ = 10 -4 , λ = 10 -2 , ϵ = 1, α = 0.1 for all experiments, and average the results over 5 independent trails with random initializations. Our baseline methods include complete retraining with graph information after each unlearning request (SGC Retraining), complete retraining without graph information after each unlearning request (No Graph Retraining), and Unstructured Unlearning (Guo et al., 2020) . Additional details can be found in Appendix A.19. Bounds on the gradient residual norm. The first row of Figure 2 compares the values of both worst-case bounds computed in Section 4 and data-dependent bounds computed from Corollary 5.1 with the true value of the gradient residual norm (True Norm). For simplicity, we set α = 0 during training. The observation is that the worst-case bounds are looser than the data-dependent bounds, and both bounds are indeed valid upper bounds for the actual gradient residual norm. Dependency on node degrees. While an upper bound does not necessarily capture the dependency of each term correctly, we show in Figure 3 (a) and (b) that our Theorem 4.5 and 4.6 indeed do so. Here, each point corresponds to unlearning one node. We test for all nodes in the training set T r and fix λ = 10 -4 , α = 0. Our results show that unlearning a large-degree node is more expensive in terms of the privacy budget (i.e., it induces a larger gradient residual norm). For other datasets, refer to Appendix A.19. Trade-off amongst privacy, performance and time complexity. As indicated in Theorem 4.1, there is a trade-off amongst privacy, performance and time complexity. Comparing to exact unlearning (i.e. SGC retraining), allowing approximate unlearning gives 4× speedup in time with competitive performance. We further examine this trade-off by fixing λ and δ, then the trade-off is controlled by ϵ and α. The results are shown in Figure 3 (g) for Cora, where we set αϵ = 0.1. The test accuracy increases when we relax our constraints on ϵ, which agrees with our intuition. Remarkably, we can still obtain competitive performance with SGC Retraining when we require ϵ to be as small as 1. In contrast, one needs at least ϵ ≥ 5 to unlearn even one node or edge by leveraging state-of-the-art DP-GNNs (Sajadmanesh et al., 2022; Daigavane et al., 2021) for reasonable performance, albeit our tested datasets are different. This shows the benefit of our approximate graph unlearning method as opposed to both retraining from scratch and DP-GNNs. Unfortunately, the codes of these DP-GNNs are not publicly available, which prevents us from testing them on our datasets in a unified treatment. Membership inference attacks on unlearned models. (2021b) and significantly worse than both our unlearning and complete retraining method. The results also highlight the fact that the privacy definition considered in MI attacks and approximate unlearning are different (see Appendix A.2). Nevertheless, experiments show that our method offers similar privacy-preserving performance (in terms of MI) as complete retraining, and better performance compared to just using the original model without unlearning features.

A.2 FUTURE RESEARCH DIRECTIONS AND LIMITATIONS

Batch unlearning. In practice, it is likely that we not only require sequential unlearning, but also batch unlearning: A number of users may request their data to be unlearned within a certain (short) time frame. The approach in Guo et al. (2020) can ensure certified removal even in this scenario. The generalization of our approach for batch unlearning is also possible, but will be discussed elsewhere. Nonlinear models and hypergraph extension. Also akin to what was described in Guo et al. (2020) , we can leverage pre-trained (nonlinear) feature extractors or special graph feature transforms to further improve the performance of the overall model. For example, Chien et al. (2022a) proposed a node feature extraction method termed GIANT-XRT that greatly improves the performance of simple network models such as MLP and SGC. If a public dataset is never subjected to unlearning, one can pre-train GIANT-XRT on that dataset and use it for subsequent approximate graph unlearning. If such a public dataset is unavailable, we have to make the node feature extractor DP. In this case, we can either design a DP version of GIANT-XRT or leverage the DP-GNN model described in Section 2. By applying Theorem 5 of Guo et al. (2020) , the overall model can be shown to guarantee approximate graph unlearning, where the parameters ϵ and δ now also depend on the DP guarantees of the node feature extractor. There is also another line of work on Graph Scattering Transforms (GSTs) (Gama et al., 2019; Pan et al., 2021) for use as feature extractors for graph information. Since a GST is a predefined mathematical transform and hence does not require training, it can be easily combined with our approach (Pan et al., 2022). Finally, generalizing approximate graph unlearning to hypergraphs can also be an interesting direction. Although the current SOTA hypergraph neural networks heavily rely on nonlinear modules such as AllSet (Chien et al., 2022b) , we believe extension to classical hypergraph learning algorithms (Chien et al., 2021a) is possible.

Empirical metrics and MI.

There is currently no empirical metric that can be used to evaluate how well approximate machine unlearning methods preserve privacy. Although the definition of approximate graph unlearning automatically and theoretically ensures that one cannot infer information about the unlearned data point from the updated model (if one chooses to set ϵ, δ to 0), it remains an open problem whether we can design an empirical metric that can accurately quantify this privacy-preserving performance. Note that privacy-based attacks like the membership inference attack (Shokri et al., 2017; Olatunji et al., 2021b) have completely different design goals and may not work well in unlearning practice. For example, assume that there are two nodes that share similar features and neighborhood structures, come from the same class in the graph and are both included in the training set. This scenario frequently arises in practice, especially for graphs with strong homophily properties. In this case, even if we unlearn one of the nodes, the attack model will still have a high probability of recognizing the unlearned node in the training set due to the presence of the "similar" node. Thus, the viability of using the results returned by the attack models to assess the performance of an unlearner is not clear. This is also verified by some preliminary experiments on node unlearning tasks described in Section A.19 and the main text. Societal impacts. The authors believe that for medical and biological sciences research, the right to be forgotten may significantly set back potentially life-saving discoveries due to the need to have access to many diverse data samples. But current trends seem to favor privacy over discovery rates and timings. Hence, a compromise between data availability and the right to be forgotten has to be established in the near future. One current limitation of our work is that the newly proposed proof techniques do not apply to general graph neural networks where nonlinear activation functions are used. Nevertheless, our work is the first step towards developing approximate graph unlearning approaches for general GNNs.

A.3 INTUITION BEHIND THE MODEL UPDATE RULE

Our unlearning mechanism proposed in Section 4 is w -= w ⋆ + ∇ 2 L(w ⋆ , D ′ ) -1 [∇L(w ⋆ , D) -∇L(w ⋆ , D ′ )] , and the intuition is stated as follows. Our goal for the updated model is ∇L(w -, D ′ ) = 0. By Taylor series we have that ∇L(w -, D ′ ) ≈ ∇L(w ⋆ , D ′ ) + ∇ 2 L(w ⋆ , D ′ )(w --w ⋆ ) = 0. Published as a conference paper at ICLR 2023 Therefore, we have w --w ⋆ = ∇ 2 L(w ⋆ , D ′ ) -1 [0 -∇L(w ⋆ , D ′ )] w -= w ⋆ + ∇ 2 L(w ⋆ , D ′ ) -1 [∇L(w ⋆ , D) -∇L(w ⋆ , D ′ )] . The last equality holds due to the fact that w ⋆ should be the unique optimizer for the strongly convex loss L(w, D) over the entire dataset D. A.5 ADDITIONAL DISCUSSIONS Details on on Assumption 4.2. Assumptions ( 2), ( 4) and ( 5) in our model and that of Guo et al. (2020) require Lipschitz conditions with respect to the first argument of ℓ, but not the second. We also implicitly assume that the second argument (corresponding to labels) does not effect the norm of gradients or Hessians. One example that meets these constraints is the logistic loss: If ℓ(w T x, y) = ℓ(yw T x) then all required assumptions hold. Least-squares and logistic regression on graphs. Paralleling once again the results of Guo et al. (2020) , it is clear that our approximate graph unlearning mechanism can be used in conjunction with least-squares and logistic regressions. For example, node classification can be performed using a logistic loss. The node regression problem described in Ma et al. (2020) ; Jia & Benson (2020) is related to least-squares regression. In particular, least-squares regression uses the loss ℓ(e T i Zw, e T i Y Tr ) = (e T i Zwe T i Y Tr ) 2 . Note that its Hessian is of the form (e T i Z) T e T i Z, which does not depend on w. Thus, based on the same arguments presented in Guo et al. (2020) , our proposed unlearning method M offers (0, 0)-approximate graph unlearning even without loss perturbations. For binary logistic regression, the loss equals ℓ(e T i Zw, e T i Y Tr ) = -log(σ(e T i Y Tr e T i Zw)), where σ(x) = 1/(1 + exp(-x)) denotes the sigmoid function. As shown in Guo et al. (2020) , the assumptions ( 1)-(3) in 4.2 are satisfied with c = 1 and γ 2 = 1/4. We only need to show that (4) and ( 5 Using some simple algebra, one can also prove that σ(x) ′ = σ(x)(1 -σ(x)) ⇒ max x∈R |σ(x) ′ | = 1/4. Thus our loss satisfies assumption (4) in 4.2 as well, with γ 1 = 1/4. For multi-class logistic regression, one can adapt the "one-versus-all other-classes" strategy which leads to the same result.

A.6 ALGORITHMIC DETAILS

The pseudo-codes for training removal-enabled models and the removal procedure for the case of binary classification are presented below. Note that this procedure is the same for all three types of removal requests (node feature unlearning, edge unlearning and node unlearning). During training, we add a random linear term to the training loss by sampling a Gaussian noise vector b. The choice of standard deviation α determines the "privacy budget" αϵ/ 2 log(1.5/δ) as shown in Section 5. Update the feature matrix X ′ and propagation matrix P ′ based on the removal.

8:

Compute new node embedding after propagation Z ′ = P ′ K X ′ . 9: if j ∈ T r then Compute ∆ = ∇L (w, D) -∇L (w, D ′ ).

14:

Compute H = ∇ 2 L (w; D ′ ).

15:

Update accumulated gradient residual norm β = β + γ 2 ∥Z ′ ∥ op ∥H -1 ∆∥∥Z ′ H -1 ∆∥. 16: if β > αϵ/ 2 log(1.5/δ) then 17: Recompute w using Algorithm 1 (D ′ , ℓ, α, λ), β = 0.

18:

else 19: w = w + H -1 ∆. 20: end if 21: end for 22: return w.

A.7 PROOF OF THEOREM 4.3

Theorem. Under the node feature unlearning scenario, D = (X, Y Tr , A) and D = (X ′ , Y ′ Tr , A). Suppose Assumption 4.2 holds. For Z = P K X and P = D-1 Ã, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ γ 2 (2cλ + (cγ 1 + λc 1 ) Dmm ) 2 λ 4 (m -1) . ( ) We need to ensure that the norm of each row of Z is bounded by 1. We state the following lemma in support of this claim. Lemma A.1. Assume that ∥e T i S∥ ≤ 1, ∀i ∈ [n]. Then, ∀i ∈ [n], K ≥ 0, ∥e T i P K S∥ ≤ 1, where P = D-1 Ã. Proof. Our proof is a nontrivial generalization and extension of the proof in Guo et al. (2020) . For completeness, we outline every step of the proof. We also emphasize novel approaches used to accommodate out approximate graph unlearning scenario. Let G(w) = ∇L(w, D ′ ). By the Taylor theorem, ∃η ∈ [0, 1] such that G(w -) = G(w ⋆ + H -1 w ⋆ ∆) = G(w ⋆ ) + ∇G(w ⋆ + ηH -1 w ⋆ ∆)H -1 w ⋆ ∆ (a) = G(w ⋆ ) + H wη H -1 w ⋆ ∆ = G(w ⋆ ) + ∆ + H wη H -1 w ⋆ ∆ -∆ (b) = 0 + H wη H -1 w ⋆ ∆ -∆ = H wη H -1 w ⋆ ∆ -H w ⋆ H -1 w ⋆ ∆ = (H wη -H w ⋆ )H -1 w ⋆ ∆. In (a), we wrote H wη ≜ ∇G(w ⋆ + ηH -1 w ⋆ ∆), corresponding to the Hessian at w η ≜ w ⋆ + ηH -1 w ⋆ ∆. Equality (b) is due to our choice of ∆ = ∇L(w ⋆ , D) -∇L(w ⋆ , D ′ ) and the fact that w ⋆ is the minimizer of L(•, D). We would like to point out that our choice of ∆ is more general then that Guo et al. (2020) : Since unlearning one node may affect the entire node embedding Z, a generalization of ∆ is crucial. When K = 0 (i.e., when no graph topology is included), one recovers ∆ from Guo et al. (2020) as a special case of our model. In the latter part of the proof, we will see how the graph setting makes the analysis more intricate and complex. By the Cauchy-Schwartz inequality, we have ∥G(w -)∥ ≤ ∥H wη -H w ⋆ ∥∥H -1 w ⋆ ∆∥. (9) Below we bound both norms on the right hand side separately. We start with the term ∥H wη -H w ⋆ ∥. Note that ∥∇ 2 ℓ(e T i Z ′ w η , e T i Y ′ Tr ) -∇ 2 ℓ(e T i Z ′ w ⋆ , e T i Y ′ Tr )∥ = ∥ ℓ ′′ (e T i Z ′ w η , e T i Y ′ Tr ) -ℓ ′′ (e T i Z ′ w ⋆ , e T i Y ′ Tr ) (e T i Z ′ ) T e T i Z ′ ∥ (a) ≤ γ 2 ∥e T i Z ′ w η -e T i Z ′ w ⋆ ∥∥e T i Z ′ ∥ 2 ≤ γ 2 ∥w η -w ⋆ ∥∥e T i Z ′ ∥ 3 = γ 2 ∥ηH -1 w ⋆ ∆∥∥e T i Z ′ ∥ 3 ≤ γ 2 ∥H -1 w ⋆ ∆∥∥e T i Z ′ ∥ 3 . (10) Here, (a) follows from the Cauchy-Schwartz inequality and the Lipschitz condition on ℓ ′′ in Assumption 4.2. Unlike the analysis in Guo et al. (2020) , we are faced with the problem of bounding the term ∥e T i Z ′ ∥. In Guo et al. (2020) (where Z = X), a simple bound equals 1, which may be ontained via (3) in Assumption 4.2. However, in our case, due to graph propagation this norm needs more careful examination and a simple application of the Cauchy-Schwartz inequality does not suffice, as it would lead to a term ∥X∥ op , where ∥ • ∥ op denotes the operator norm. The simple worst case (i.e., when all rows of X are identical) leads to a meaningless bound O(n). By leveraging Lemma A.1, we can further upper bound equation 10 according to ∥∇ 2 ℓ(e T i Z ′ w η , e T i Y ′ Tr ) -∇ 2 ℓ(e T i Z ′ w ⋆ , e T i Y ′ Tr )∥ ≤ γ 2 ∥H -1 w ⋆ ∆∥∥e T i Z ′ ∥ 3 (a) ≤ γ 2 ∥H -1 w ⋆ ∆∥, where (a) follows from Lemma A.1. As a result, we arrive at a bound for ∥H wη -H w ⋆ ∥ of the form ∥H wη -H w ⋆ ∥ ≤ m-1 i=1 ∥∇ 2 ℓ(e T i Z ′ w η , e T i Y ′ Tr ) -∇ 2 ℓ(e T i Z ′ w ⋆ , e T i Y ′ Tr )∥ ≤ γ 2 (m -1)∥H -1 w ⋆ ∆∥. Next, we bound ∥H -1 w ⋆ ∆∥. Since L(•, D ′ ) is λ(m -1)-strongly convex, we have ∥H -1 w ⋆ ∥ ≤ 1 λ(m-1) . For the norm ∥∆∥, we have ∆ = ∇L(w ⋆ , D) -∇L(w ⋆ , D ′ ) = λw ⋆ + ∇ℓ(e T m Zw ⋆ , e T m Y Tr ) + m-1 i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) . ( ) The third term does not appear in Guo et al. (2020) , since when K = 0, Z = X and Z ′ = X ′ are identical except for the m th row. In the approximate graph unlearning scenario, even removing one node feature can make the entire node embedding matrix Z change in every row, which creates new analytical challenges. For example, consider the case X = [x 1 , x 2 , x 3 ] T , where we have a graph with three nodes, each with a 1-dimensional feature. Consider the fully connected graph (i.e., all entries in P set to 1/3). Then, unlearning node 1 results in Z ′ = [0, x 2 , x 3 ] T for unstructured unlearning. However, Z ′ = [(x 2 + x 3 )/3, (x 2 + x 3 )/3, (x 2 + x 3 )/3] T for the case of L = 1, which is completely different from Z = [(x 1 + x 2 + x 3 )/3, (x 1 + x 2 + x 3 )/3, (x 1 + x 2 + x 3 )/3] T . Hence, the analysis in Guo et al. ( 2020) cannot be directly applied to graphs, as Z ′ changes in more than just one row compared to Z while unlearning a node feature. By Minkowski's triangle inequality, we only need to bound the norm of the three individual terms in order to bound the norm of ∆. For ∥w ⋆ ∥, since w ⋆ is the global optimum of L(•; D), we have 0 = ∇L(w ⋆ ; D) = m i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) + λmw ⋆ . By (1) in Assumption 4.2, we have ∥w ⋆ ∥ = ∥ m i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr )∥ λm ≤ c λ . Once again, by (1) in Assumption 4.2, we have ∥∇ℓ(e T m Zw ⋆ , e T m Y Tr )∥ ≤ c. A bound for the last term is established in the last step, as described below. ∥ m-1 i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) ∥ ≤ m-1 i=1 ∥∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr )∥ = m-1 i=1 ∥ℓ ′ (e T i Zw ⋆ , e T i Y Tr )(e T i Z) T -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z ′ ) T ∥. Observe that ∥ℓ ′ (e T i Zw ⋆ , e T i Y Tr )(e T i Z) T -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z ′ ) T ∥ ≤ ∥ℓ ′ (e T i Zw ⋆ , e T i Y Tr )(e T i Z) T -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z) T ∥ + ∥ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z) T -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z ′ ) T ∥ The first term can be bounded as ∥ℓ ′ (e T i Zw ⋆ , e T i Y Tr )(e T i Z) T -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z) T ∥ ≤ ℓ ′ (e T i Zw ⋆ , e T i Y Tr ) -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr ) ∥(e T i Z) T ∥ (a) ≤ γ 1 ∥e T i Zw ⋆ -e T i Z ′ w ⋆ ∥∥(e T i Z) T ∥ (b) ≤ γ 1 ∥(e T i Z -e T i Z ′ ) T ∥∥w ⋆ ∥ (c) ≤ cγ 1 λ ∥(e T i Z -e T i Z ′ ) T ∥. Here, (a) is due to (4) in Assumption 4.2, while (b) follows from Lemma A.1 and the Cauchy-Schwartz inequality. Inequality (c) is a consequence of the bound for ∥w∥ that we previously derived. The second term can be bounded as ∥ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z) T -ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr )(e T i Z ′ ) T ∥ ≤ ℓ ′ (e T i Z ′ w ⋆ , e T i Y Tr ) ∥(e T i Z) T -(e T i Z ′ ) T ∥ (a) ≤ c 1 ∥(e T i Z) T -(e T i Z ′ ) T ∥. For the inequality in (a), we used (5) from Assumption 4.2. Put together, we have ∥ m-1 i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) ∥ ≤ m-1 i=1 cγ 1 λ + c 1 ∥(e T i Z) T -(e T i Z ′ ) T ∥ = cγ 1 λ + c 1 m-1 i=1 ∥e T i (Z -Z ′ )∥ = cγ 1 λ + c 1 m-1 i=1 ∥(e T i P K (X -X ′ ))∥ = cγ 1 λ + c 1 m-1 i=1 ∥(e T i P K D-1 D(X -X ′ ))∥ (a) = cγ 1 λ + c 1 m-1 i=1 ∥(e T i P K D-1 De m e T m X)∥ (b) ≤ cγ 1 λ + c 1 m-1 i=1 ∥e T i P K D-1 De m ∥∥e T m X∥ (c) ≤ cγ 1 λ + c 1 m-1 i=1 ∥e T i P K D-1 De m ∥ = cγ 1 λ + c 1 m-1 i=1 ∥e T i P K D-1 e m Dmm ∥ (d) = cγ 1 λ + c 1 m-1 i=1 e T i P K D-1 e m Dmm (e) ≤ cγ 1 λ + c 1 1 T P K D-1 e m Dmm (f ) = cγ 1 λ + c 1 1 T D-1 p Dmm (g) ≤ cγ 1 λ + c 1 Dmm . Inequality (a) follows from the fact that X ′ is identical to X except for the last row and column, which are set to all-zeros. Thus, X -X ′ is a matrix with rows equal to zero-vectors, except for the m th which equals the m th row of X. Inequality (b) follows from the Cauchy-Schwartz inequality. Inequality (c) is a result of (3) in Assumption 4.2, while (d) is a consequence of the fact that e T i P K D-1 e m is the value in the i th row and m th column of the matrix P K D-1 . Also, it is obvious that this matrix is entry-wise nonnegative. Inequality (e) is due to the fact that P K D-1 is entry-wise nonnegative. In (f), p stands for a probability vector and (f) holds since P K D-1 = D-1 Ã K D-1 = D-1 ( Ã D-1 ) K , and Ã D-1 is a left stochastic matrix. Inequality (g) is a consequence of the observation that the maximum entry in D-1 is at most 1 and that the latter is a diagonal matrix. Hence, 1 T D-1 p ≤ 1 T p. Also, 1 T p = 1 by the definition of the probability vector. Combining the bounds, we obtain ∥∆∥ ≤ c + c + cγ 1 λ + c 1 Dmm = 2cλ + (cγ 1 + λc 1 ) Dmm λ . Including the bound on ∥H -1 w ⋆ ∥ and equation 12, we then obtain ∥G(w -)∥ ≤ γ 2 (m -1)∥H -1 w ⋆ ∆∥ 2 ≤ γ 2 (m -1) 2cλ+(cγ1+λc1) Dmm λ λ(m -1) 2 = γ 2 (2cλ + (cγ 1 + λc 1 ) Dmm ) 2 λ 4 (m -1) . This completes the proof. A.8 PROOF OF THEOREM 4.4 Theorem. For the edge unlearning case, we have D = (X, Y Tr , P) and D ′ = (X, Y Tr , P ′ ). If P = D-1 Ã and Z = P K X, then we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ 16γ 2 K 2 (cγ 1 + c 1 λ) 2 λ 4 m . ( ) Similar to what holds for the node feature unlearning case, Theorem 4.4 still holds when neither of the two end nodes of the removed edge belongs to T r . Since P ′ is a right stochastic matrix, Lemma A.1 still applies. Thus, we only need to describe how to bound ∥∆∥. Following an approach similar to the previously described one, we have ∥∆∥ ≤ cγ1 λ + c 1 m i=1 n j=1 ∥e T i (P K -P ′ K )e j ∥. We also need the following technical lemmas. Lemma A.2. For both edge and node unlearning, we have |e T i P K -(P ′ ) K e j | ≤ K k=1 e T i (P ′ ) k-1 |P -P ′ | P K-k e j , ∀i, j ∈ [n], K ≥ 1. Lemma A.3. For edge unlearning, we have 1 T P ′ k-1 |P -P ′ |P K-k 1 ≤ 4, ∀k ∈ [K]. Combining the two lemmas and after some algebraic manipulation, we arrive at the desired result. It is not hard to see that |P -P ′ | has only two nonzero rows, which correspond to the unlearned edge. One can again construct a left stochastic matrix Ã′ D′ -1 and a right stochastic matrix P which lead to the result of Lemma A.3. Proof. The theorem can be proved as follows. From previous proof we have ∥G(w -)∥ ≤ ∥H wη -H w ⋆ ∥∥H -1 w ⋆ ∥∥∆∥ ≤ γ 2 ∥∆∥ 2 λ 2 m . ( ) Since the first term ∥H wη -H w ⋆ ∥ only involved the updated dataset, the upper bound for this term proved for node feature unlearning still holds. The term ∥H -1 w ⋆ ∥ can again be bounded using the fact that L(•, D ′ ) is λm-strongly convex. The main difference between node feature and edge unlearning lies in the bound for ∆. By definition, ∆ =∇L(w ⋆ , D) -∇L(w ⋆ , D ′ ) = m i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) , and ∥∆∥ ≤ cγ 1 λ + c 1 m i=1 ∥(Z -Z ′ ) T e i ∥ = cγ 1 λ + c 1 m i=1 ∥(P K X -P ′ K X) T e i ∥ = cγ 1 λ + c 1 m i=1 ∥e T i (P K -P ′ K )X∥ = cγ 1 λ + c 1 m i=1 ∥e T i (P K -P ′ K ) n j=1 e j e T j X∥ ≤ cγ 1 λ + c 1 m i=1 n j=1 ∥e T i (P K -P ′ K )e j e T j X∥ ≤ cγ 1 λ + c 1 m i=1 n j=1 ∥e T i (P K -P ′ K )e j ∥∥e T j X∥ ≤ cγ 1 λ + c 1 m i=1 n j=1 ∥e T i (P K -P ′ K )e j ∥ By Lemma A.2 we have cγ 1 λ + c 1 m i=1 n j=1 ∥e T i (P K -P ′ K )e j ∥ ≤ cγ 1 λ + c 1 m i=1 n j=1 K k=1 e T i P ′ k-1 |P -P ′ |P K-k e j ≤ cγ 1 λ + c 1 K k=1 1 T P ′ k-1 |P -P ′ |P K-k 1. Using Lemma A.3 we arrive at ∥∆∥ ≤ cγ1 λ + c 1 4K. Plugging this expression into equation 26 completes the proof. A.9 PROOF OF THEOREM 4.5 Theorem. Under the node unlearning scenario, we have D = (X, Y Tr , P) and D = (X ′ , Y ′ Tr , P ′ ). Suppose also that Assumption 4.2 holds. For Z = P K X and P = D-1 Ã, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ γ 2 2cλ + K (cγ 1 + c 1 λ) 2 Dmm -1 2 λ 4 (m -1) . Again, the main challenge is to bound ∆. First we observe that (P ′ ) K X ′ = (P ′ ) K X. This holds because node m is removed from the graph in D ′ , and thus its corresponding node features do not affect Z ′ . Similarly to the proof Lemma A.2, we first derive the bound K k=1 1 T (P ′ ) k-1 |P -P ′ | P K-k 1. For each term, 1 T (P ′ ) k-1 |P -P ′ | P K-k 1 = n l=1 1 T (P ′ ) k-1 e l e T l |P -P ′ | P K-k 1. To proceed, we need the following two lemmas. Lemma A.4. For node unlearning and ∀k ∈ [K] and ∀l ∈ [n], 1 T (P ′ ) k-1 ( D′ ) -1 e l ≤ 1. Lemma A.5. For node unlearning and ∀k ∈ [K], n l=1 e T l D′ |P -P ′ | P K-k 1 ≤ 2 Dmm -1. These two lemmas give rise to the term K(2 Dmm -1) in the bound of Theorem 4.5 and the rest of the analysis is similar to that of the previous cases. Lemma A.5 is rather technical, and relies on the following proposition that exploits the structure of |P -P ′ |. Proposition A.6. For node unlearning and ∀i, j ̸ = m, e T i |P -P ′ | e j = e T i (P ′ -P) e j . For i = m or j = m, e T i |P -P ′ | e j = e T i Pe j . Proof. The proof is similar to the proof of Theorem 4.3, although several parts need modifications. First, the result of Lemma A.1 needs to be replaced by the following claim. Lemma A.7. Assume that ∥e T i S∥ ≤ 1, ∀i ̸ = m and that e T m S = 0 T . Then ∀i ∈ [n], K ≥ 0, we have ∥e T i (P ′ ) K S∥ ≤ 1, where P = D-1 Ã and P ′ = ( D′ ) -1 Ã′ . Next we have to modify the proof regarding the bound of ∥∆∥. Following a proof similar to that of Theorem 4.3, we have ∥∆∥ ≤ 2c + cγ 1 λ + c 1 m-1 i=1 ∥(Z -Z ′ ) T e i ∥. Plugging in the expressions for Z and Z ′ leads to m-1 i=1 ∥(Z -Z ′ ) T e i ∥ = m-1 i=1 ∥(P K X -(P ′ ) K X ′ ) T e i ∥ (a) = m-1 i=1 ∥(P K X -(P ′ ) K X) T e i ∥ = m-1 i=1 ∥( P K -(P ′ ) K X) T e i ∥ (b) = m-1 i=1 ∥e T i P K -(P ′ ) K n j=1 e j e T j X∥ (c) ≤ m-1 i=1 n j=1 ∥e T i P K -(P ′ ) K e j e T j X∥ (d) ≤ m-1 i=1 n j=1 ∥e T i P K -(P ′ ) K e j ∥∥e T j X∥ (e) ≤ m-1 i=1 n j=1 ∥e T i P K -(P ′ ) K e j ∥. The equality (a) is due to the fact that (P ′ ) K X ′ = (P ′ ) K X, as the m th row and column of (P ′ ) K are all-zeros. Thus, changing the last row of X ′ makes no difference of (P ′ ) K X ′ . Equation (b) is a consequence of the fact that I = n j=1 e j e T j . Inequality (c) follows from Minkowski's inequality, while (d) follows from the Cauchy-Schwartz inequality. Inequality (e) holds based on (3) in Assumption 4.2. By Lemma A.2, we can proceed with our analysis as follows: m-1 i=1 ∥(Z -Z ′ ) T e i ∥ ≤ m-1 i=1 n j=1 ∥e T i P K -(P ′ ) K e j ∥ (a) ≤ m-1 i=1 n j=1 K k=1 e T i (P ′ ) k-1 |P -P ′ | P K-k e j ≤ K k=1 1 T (P ′ ) k-1 |P -P ′ | P K-k 1, where (a) is due to Lemma A.2 and the fact that e T i Pe j is a scalar, equal to the i th row j th column of the matrix P. Next, we bound each term 1 T (P ′ ) k-1 |P -P ′ | P K-k 1 separately. For k ∈ [K], we have 1 T (P ′ ) k-1 |P -P ′ | P K-k 1 = 1 T (P ′ ) k-1 ( D′ ) -1 D′ |P -P ′ | P K-k 1 = 1 T (P ′ ) k-1 ( D′ ) -1 n l=1 e l e T l D′ |P -P ′ | P K-k 1 = n l=1 1 T (P ′ ) k-1 ( D′ ) -1 e l e T l D′ |P -P ′ | P K-k 1 . Note that for each index l, the corresponding term in the sum is just a product of two scalars. Let first analyze 1 T (P ′ ) k-1 ( D′ ) -1 e l . This term can be bounded as 1 T (P ′ ) k-1 |P -P ′ | P K-k 1 = n j=1 1 T (P ′ ) k-1 ( D′ ) -1 e j e T j D′ |P -P ′ | P K-k 1 (a) ≤ n j=1 e T j D′ |P -P ′ | P K-k 1. where (a) follows from Lemma A.4. We now turn our attention to the term e T l D′ |P -P ′ | P K-k 1, which can be bounded according to Lemma A.5 as follows 1 T (P ′ ) k-1 |P -P ′ | P K-k 1 ≤ n l=1 e T l D′ |P -P ′ | P K-k 1 ≤ 2 Dmm -1. ( ) Using these two bounds in equation 32 gives m-1 i=1 ∥(Z -Z ′ ) T e i ∥ ≤ m-1 i=1 n j=1 ∥e T i P K -(P ′ ) K e j ∥ ≤ K k=1 1 T (P ′ ) k-1 |P -P ′ | P K-k 1 ≤ K k=1 (2 Dmm + 1) = K(2 Dmm -1). Using this bound in the expression for ∥∆∥ we obtain ∥∆∥ ≤ 2c + cγ 1 λ + c 1 m-1 i=1 ∥(Z -Z ′ ) T e i ∥ ≤ 2c + cγ 1 λ + c 1 K 2 Dmm -1 ⇒ ∥G(w -)∥ ≤ ∥H wη -H w ⋆ ∥∥H -1 w ⋆ ∆∥ ≤ γ 2 (m -1)∥H -1 w ⋆ ∆∥ 2 ≤ γ 2 (m -1)   2c + cγ1 λ + c 1 K 2 Dmm -1 λ(m -1)   2 = γ 2 2cλ + K (cγ 1 + c 1 λ) 2 Dmm -1 2 λ 4 (m -1) . ( ) This completes the proof. A.10 PROOF OF THEOREM 4.6 Theorem. In the node feature unlearning scenario, we are given D = (X, Y Tr , A) and D = (X ′ , Y ′ Tr , A). Suppose that Assumption 4.2 holds. For Z = 1 K+1 X, PX, P 2 X, • • • , P K X and P = D-1 Ã, we have ∥∇L(w -, D ′ )∥ = ∥(H wη -H w ⋆ )H -1 w ⋆ ∆∥ ≤ γ 2 (2cλ + (cγ 1 + λc 1 ) Dmm ) 2 λ 4 (m -1) . ( ) Proof. The proof is almost identical to the proof of Theorem 4.3. We only need to bound the norms of the terms in Z. We start by modifying Lemma A.1 for the GPR case. Lemma A.8. Assume that ∥e T i S∥ ≤ 1, ∀i ∈ [n]. Then ∀i ∈ [n], K ≥ 0, we have ∥ 1 √ K+1 e T i S, PS, P 2 S, • • • , P K S ∥ ≤ 1, where P = D-1 Ã. Another part of the proof that needs to be changed is to establish a bound on m-1 i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) . Following a proof similar to that of Theorem 4.3, we have ∥ m-1 i=1 ∇ℓ(e T i Zw ⋆ , e T i Y Tr ) -∇ℓ(e T i Z ′ w ⋆ , e T i Y Tr ) ∥ ≤ m-1 i=1 cγ 1 λ + c 1 ∥(e T i Z) T -(e T i Z ′ ) T ∥ = cγ 1 λ + c 1 m-1 i=1 ∥(Z -Z ′ ) T e i ∥ = cγ 1 λ + c 1 m-1 i=1 ∥( 1 K + 1 X -X ′ , P(X -X ′ ), • • • , P K (X -X ′ ) ) T e i ∥ = cγ 1 λ + c 1 m-1 i=1 ∥( 1 K + 1 e m e T m X, Pe m e T m X, • • • , P K e m e T m X ) T e i ∥ = cγ 1 λ + c 1 m-1 i=1 ∥ 1 K + 1 e T i e m e T m X, e T i Pe m e T m X, • • • , e T i P K e m e T m X T ∥ ≤ cγ 1 λ + c 1 m-1 i=1 ∥ 1 K + 1 e T i e m , e T i Pe m , • • • , e T i P K e m T ∥∥(e T m X) T ∥ ≤ cγ 1 λ + c 1 m-1 i=1 ∥ 1 K + 1 e T i e m , e T i Pe m , • • • , e T i P K e m T ∥ (a) ≤ cγ 1 λ + c 1 m-1 i=1 1 K + 1 K k=1 e T i P k e m ≤ cγ1 λ + c 1 K + 1 K k=1 1 T P k e m = cγ1 λ + c 1 K + 1 K k=1 1 T P k D-1 De m = cγ1 λ + c 1 K + 1 K k=1 1 T P k D-1 e m Dmm (b) = cγ1 λ + c 1 K + 1 K k=1 1 T D-1 p (k) Dmm ≤ cγ1 λ + c 1 K + 1 K k=1 1 T p (k) D mm = cγ1 λ + c 1 K + 1 × K Dmm ≤ ( cγ 1 λ + c 1 ) Dmm , where (a) is due to the fact that the ℓ 1 norm is an upper bound for the ℓ 2 norm. Also note that e T i e m = 0, ∀i ̸ = m. In (b), ∀k ∈ [K], p (k) are probability vectors. This completes the proof. Remark. Note that the GPR extension for the edge and node unlearning cases can be derived through a similar analysis. One can also see that the key step is inequality (a), which still holds for the edge and node unlearning cases. The results are similar to Theorem 4.4 and Theorem 4.5, except that the definition of Z is replaced by one corresponding to the GPR case, as in Theorem 4.6. A.11 PROOF OF LEMMA A.1 Lemma. Assume that ∥e T i S∥ ≤ 1, ∀i ∈ [n]. Then ∀i ∈ [n], K ≥ 0, we have ∥e T i P K S∥ ≤ 1, where P = D-1 Ã. Proof. We prove this lemma by induction. Let Z (k) = P k S. For the base case k = 0 it is true by assumption that ∥e T i S∥ ≤ 1 ∀i ∈ [n]. Assume next that the claim is true for the case k = K -1.

Then we have ∥e

T i P K S∥ = ∥e T i PZ (K-1) ∥ = ∥ 1 Dii j: Ãij =1 e T j Z (K-1) ∥ ≤ 1 Dii j: Ãij =1 ∥e T j Z (K-1) ∥ (a) ≤ 1 Dii j: Ãij =1 1 = 1 Dii × Dii = 1, where (a) is based on the induction hypothesis for k = K -1. Remark: Note that if we choose another propagation matrix P compared to the one used in the SGC analysis, the above expression for K = 1 becomes ∥e T i PS∥ = ∥ 1 Dii j: Ãij =1 e T j S Djj ∥ ≤ 1 Dii j: Ãij =1 ∥e T j S∥ Djj ≤ 1 Dii j: Ãij =1 1 Djj . ( ) We cannot easily simplify the sum j: Ãij =1 1 √

Djj

. One way to approach the problem is to simply use the fact that the degree of a node is at least 1 and can thusbe further upper bounded by Dii . This leads to the bound ∥e T i PS∥ ≤ D ii . (43) Obviously, this bound is worse than the one in Lemma A.1 even when K = 1. For general K, there will be an additional exponent K/2 for the maximal degree, which is undesirable. Nevertheless, our bound is tight since for the worst case of a star graph with a center at node i, so that D jj = 2 for all j ̸ = i. The same argument applies for other degree normalizations. Thus it is critical to choose P = D-1 Ã to obtained the desired bound in Lemma A.1. A.12 PROOF OF LEMMA A.2 Lemma. For either the edge or node unlearning case, and ∀i, j ∈ [n], K ≥ 1, we have |e T i P K -(P ′ ) K e j | ≤ K k=1 e T i (P ′ ) k-1 |P -P ′ | P K-k e j . Proof. The proof consist of two parts. We first show that P K -(P ′ ) K = K k=1 (P ′ ) k-1 (P -P ′ ) P K-k . Then we proceed to analyze the absolute values of all terms in the sum. The proof of the first part follows from a telescoping property for the sum, K k=1 (P ′ ) k-1 (P -P ′ ) P K-k = K k=1 (P ′ ) k-1 P K-k+1 -(P ′ ) k P K-k = (P ′ ) 0 P K -(P ′ ) 1 P K-1 + (P ′ ) 1 P K-1 -(P ′ ) 2 P K-2 + • • • + (P ′ ) K-1 P 1 -(P ′ ) K P 0 = P K -(P ′ ) K . ( ) Next, note that both P ′ and P are nonnegative matrices, and the same is true of their k th powers, k ≥ 2. Thus, e T i P K -(P ′ ) K e j = K k=1 e T i (P ′ ) k-1 (P -P ′ ) P K-k e j ≤ K k=1 e T i (P ′ ) k-1 |P -P ′ | P K-k e j . This completes the proof. 1 D′ -1 D-1 Ã + d 1 -2 d 1 (d 1 -1) e T m , where the last equality holds since Ã1m = 1. Similar arguments apply for the m th row, for which we have e T m |P -P ′ | = e T m D′ -1 D-1 Ã + d m -2 d m (d m -1) e T 1 . For a fixed k ∈ [K], 1 T P ′ k-1 |P -P ′ |P K-k 1 = 1 T P ′ k-1 e 1 e T 1 D′ -1 D-1 ÃP K-k 1 + 1 T P ′ k-1 d 1 -2 d 1 (d 1 -1) e 1 e T m P K-k 1 + 1 T P ′ k-1 e m e T m D′ -1 D-1 ÃP K-k 1 + 1 T P ′ k-1 d m -2 d m (d m -1) e m e T 1 P K-k 1. We analyze these four terms separately. For the first term, we have 1 T P ′ k-1 e 1 e T 1 D′ -1 D-1 ÃP K-k 1 = 1 T P ′ k-1 D′ -1 e 1 e T 1 D-1 ÃP K-k 1 = 1 T P ′ k-1 D′ -1 e 1 e T 1 P K-k+1 1 (51) By the same argument as used in the proof for node feature unlearning, 1 T P ′ k-1 D′ -1 e 1 = 1 T D′ -1 p ≤ 1, for some probability vector p. Also, e T 1 P K-k+1 1 ≤ 1, which holds due to the fact that P is a right-stochastic matrix. We have hence shown that the first term in equation 51 is bounded by 1. For the second term, note that d1-2 d1(d1-1) ≤ 1 (d1-1) . Hence, 1 T P ′ k-1 d 1 -2 d 1 (d 1 -1) e 1 e T m P K-k 1 ≤ 1 T P ′ k-1 1 d 1 -1 e 1 e T m P K-k 1 = 1 T P ′ k-1 D′ -1 e 1 e T m P K-k 1 ≤ 1, where the final inequality follows the same argument as the one used for bounding the first term. For the third and fourth term, the analysis is similar to these two cases and both terms can be shown to be bounded by 1. Hence, we have All our experiments were executed on a Linux machine with 48 cores, 376GB of system memory, and two NVIDIA Tesla P100 GPUs with 12GB of GPU memory each. Information about all datasets can be found in Table 1 . The data split is public and obtained from PyTorch Geometric Fey & Lenssen (2019) . We used the "full" split option for Cora, Citeseer and Pubmed. Since there is no public split for Computers and Photo, we adopted a similar setting as for the citation networks via random splits (i.e., 500 nodes in the validation set and 1, 000 nodes in the test set). The data split for ogbn-arxiv is the public split provided by the Open Graph Benchmark Hu et al. (2020) . 1 T P ′ k-1 |P -P ′ |P K-k 1 ≤ 4. Dependency on the node degree. We verified our Theorem 4.5 and Theorem 4.6 for node degree dependencies on Photo, Cora, Citeseer and Pubmed. The results are presented in Figure 5 . Nonaccumulative time for each of the unlearning procedures. Figure 6 shows the average time complexity for each unlearning step on the Cora dataset. The spikes for approximate graph unlearning methods and Unstructured Unlearning (Guo et al., 2020) corresponds to retraining after a removal. Membership inference attacks for unlearned models. We performed experiments for node unlearning tasks and applied the membership inference attack for GNNs reported in Olatunji et al. (2021b) to our obtained updated models. For simplicity, we used the Cora dataset and removed up to 100 nodes. After each removal, we applied an MI attack on the updated model. We compare the results of our SGC node unlearning approach with that of the original SGC model without updates, which is the model trained on the full dataset, and with SGC retraining, which corresponds to the model obtained after retraining upon each removal request. We repeated the experiments with 10 different trails and random splits and averaged the results. As shown in Figure 7 , even for full SGC retraining the attack model can still identify parts of the removed nodes in the training set, and the result of SGC node unlearning is slightly worse (w.r.t privacy) than retraining since our method is concerned with approximate unlearning. Note that the performance of the MI attack on the original model is consistent with the results from Olatunji et al. (2021b) and significantly worse than both our unlearning as well as the complete retraining method. This, from the experimental side, shows that 



Figure 2: Comparison of proposed SGC node feature unlearning (left column), edge unlearning (middle column) and node unlearning (right column) with baseline methods. The shaded regions in the second row represent the standard deviation of test accuracy. In the third row, we show the accumulated unlearning time as a function of the number of unlearned points. The time needed for each unlearning procedure is given in Appendix A.19.

Figure 3: (a), (b) Simulation verification of the result in Theorem 4.5 and 4.6 pertaining to node degrees. (d), (f) Accumulated unlearning time as a function of the number of removed nodes. The unlearning time of Unstructured Unlearning is often higher than that of our proposed approximate graph unlearning algorithms, because the number of retraining steps needed may be larger. (c), (e) Performance of approximate graph unlearning methods on different datasets. We set α = 10, λ = 10 -4 for Computers and λ = 10 -4 for ogbn-arxiv. The number of repeated trials is 3 due to the large amount of removed data. (g) Tradeoff between privacy ϵ and performance. To achieve similar numbers of retraining, we set αϵ = 0.1. (h) The number of data points predicted by the membership inference attack model to lie in the training set.

⋅; 𝑾, 𝒟, 𝒟/𝒟 ) Unlearning requirement: 𝑝 ⋅; 𝑾, 𝒟, 𝒟/𝒟 ≈ 𝑝 (⋅; 𝒟 )

Figure 4: Difference between machine unlearning (as defined in Guo et al. (2020)) and Differential Privacy (DP).

) of 4.2 hold as well. Observe that ℓ ′ (x, e T i Y Tr ) = σ(e T i Y Tr x) -1 . Since the sigmoid function σ(•) is restricted to lie in [0, 1], |ℓ ′ | is bounded by 1, which means that our loss satisfies (5) in 4.2 with c 1 = 1. Based on the Mean Value Theorem, one can show that σ(x) is max x∈R |σ(x) ′ |-Lipschitz.

Training procedure 1: input: Training data Z ∈ R m×d , training labels Y ∈ R m , loss ℓ, parameters α, λ > 0. 2: Sample the noise vector b∼ N (0, α 2 ) d . 3: w ⋆ = arg min w∈R d m i=1 ℓ(z T i w, y i ) + λ 2 ∥w∥ 2 + b T w. 4: return w ⋆ .Algorithm 2 Unlearning procedure 1: input: Feature matrix X ∈ R n×d , labels Y ∈ R n , one-step propagation matrix P, loss ℓ, training set indices T r = {i 1 , i 2 , . . .}, sequence of removal requests R m = {j 1 , j 2 , . . .}, parameters K, ϵ, δ, γ 2 , α, λ > 0. 2: Compute node embedding after propagation Z = P K X. 3: Training set D = {z i , y i } i∈Tr . 4: Compute w using Algorithm 1 (D, ℓ, α, λ). 5: Accumulated gradient residual norm β = 0. 6: for j ∈ R m do 7:

the training indices T r = T r \ {j}. set D ′ = {z ′ i , y i } i∈Tr .13:

consider the case l ̸ = m, r = m. Again, by Proposition A.6, we have e T l D′ |P -P ′ | e m e T m case (1), we have Ãlm = 1 and D′ ll = Dll -1 ≥ 1. This leads to e T l D′ |P -P ′ | e m e T m

Figure 5: Additional examination of the degree dependency result from Theorem 4.5 (top) and Theorem 4.6 (bottom).

Figure 8: Performance of approximate graph unlearning methods on different datasets. First Row: We removed up to 55% of the training data in Citeseer. Second Row: We removed up to 50% of the training data in Pubmed. Third Row: We set α = 10, λ = 10 -4 , and removed up to 30% of the training data in Amazon Photo.

Figure 9: Performance of approximate graph unlearning methods on Cora. First Row: The reported statistics are based on averaging over 10 repeated trails with random splitting. Second Row: GPRbased models are used to obtain node embedding. All other settings are the same as in Figure 2.

We performed experiments pertaining to the node unlearning task and applied the membership inference (MI) attack for GNNs reported inOlatunji et al. (2021b)  on our updated model. The experimental details are discussed in Appendix A.19. As shown in Figure3(h), even for full SGC retraining the attack can still identify parts of the removed nodes in the training set (for relevant explanations, see Appendix A.2), and the result of SGC node unlearning is slightly worse (w.r.t privacy) than retraining since our algorithm performs approximate unlearning. Note that the performance of the MI attack on the original model is consistent with the results in Olatunji et al.

Properties of benchmarking datasets.

ACKNOWLEDGMENTS

This work was funded by NSF grants 1816913 and 1956384. The authors thank Wei-Ning Chen and Pan Li for the helpful discussion.

A APPENDIX

A.1 CONCLUSION We introduced the first known framework for approximate graph unlearning. In this setting, new analytical unlearning challenges had to be addressed due to the presence of complex graph feature and topology data. Our analytical contributions pertain to novel proof techniques for approximate graph unlearning, while our empirical studies on six benchmark datasets established fundamental performance-complexity trade-offs between unlearning and complete retraining.Lemma. For the edge unlearning scenario, and ∀k ∈ [K], we have(47)Proof. Let us start by analyzing the matrix |P -Note that all its rows are zeros except for the 1 st and m th row. The first row of the matrix equalsÃ1(m+1) , . . .A.14 PROOF OF LEMMA A.4Lemma. For all k ∈ [K] and l ∈ [n],Proof. For k = 1, the claim is obviously true for all l ∈ [n], as the largest entry in D-1 is upper bounded by 1. For k ≥ 2 and l ̸ = m we haveIn (a), p stands for a probability vector and the result follows since Ã′ ( D′ ) -1 is a left-stochastic matrix if one ignores the node m. For l = m, it is easy to see that Ã′ D′ e m = 0 by the fact that the m th row and column of Ã′ are all-zeros. This completes the proof.A.15 PROOF OF LEMMA A.5Lemma. For node unlearning, and ∀k ∈Proof. First, note that where the m th entry of the first row vector equals 0 and the second row vector is all-zeros except for the m th entry. Note that the first row vector times Dll Dll -1 > 1 is a probability vector. Hence, by the property of P K-k being a right-stochastic matrix, we haveSince Dll -1Dll < 1, we also haveTogether, this shows that for each j ̸ = m and for the case (1), one hasFor case (2), note that e T l D′ |P -P ′ | is an all-zero row vector. Note also that, excluding self-loops, there are at most Dmm -1 neighbors l of m (case (1)). Thus, for some probability vector p. We have hence shown that for any k ∈This completes the proof.A.16 PROOF OF LEMMA A.7Lemma. Assume that ∥e T i S∥ ≤ 1, ∀i ̸ = m and that e T m S = 0 T . Then ∀i ∈ [n], K ≥ 0, we have ∥e T i (P ′ ) K S∥ ≤ 1, where P = D-1 Ã and P ′ = ( D′ ) -1 Ã′ .Proof. The proof is similar to the proof of Lemma A.1, and based on induction. The base case k = 0 is obviously true by assumption. Now, assume that the claim is true for k = K -1 and letHere, (a) is due to our hypothesis for k = K -1. For i = m, note that Ã′ mj = 0, ∀j ∈ [n]. Thus, ∥e T n (P ′ ) K S∥ = 0 ≤ 1. This completes the proof.A.17 PROOF OF LEMMA A.8Proof. By Lemma A.1, we have ∥ewhich complete the proof.Remark. Using the normalization 1 K+1 also leads to a norm bounded by 1. Hence, the norm of each row of Z is bounded by 1. We need the normalization 1 K+1 instead of 1 √ K+1 to accommodate another claim in the proof.A.18 PROOF OF PROPOSITION A.6 Proposition. We have e T i |P -P ′ | e j = e T i (P ′ -P) e j , ∀i, j ̸ = m. For i = m or j = m, e T i |P -P ′ | e j = e T i Pe j .Proof. For the first case when ∀i, j ̸ = m,Recall that by definition, in this case we have Ãij = Ã′ ij . Now, there are two cases to consider: (1) i is a neighbor of m; (2) i is not a neighbor of m. For (1), we know that D′This directly implies e T i |P -P ′ | e j = e T i (P ′ -P) e j . For (2), we know that D′ ii = Dii and thus e T i |P -P ′ | e j = 0 = e T i (P ′ -P) e j . These claims complete the proof for the first part. For the case that i = m or j = m, note that since both the m th row and column are all-zeros for P ′ , we simply have e T i |P -P ′ | e j = e T i Pe j . Note that in establishing the claim we also used the fact that P is nonnegative. This completes the proof.Published as a conference paper at ICLR 2023 Figure 6 : Nonaccumulative time for each removal step on the Cora dataset. The setting is the same as in Figure 2 . our method offers similar privacy-preserving performance as full retraining, and better performance when compared to the original model without unlearning. Nevertheless, the results also motivate the search for alternatives to MI attacks for unlearning schemes. Additional experiments. The performance of our proposed approximate graph unlearning methods on three datasets, including Citeseer, Pubmed and Amazon Photo, is shown in Figure 8 . It is worth pointing out that our bound on the gradient residual norm in Section 4 does not guarantee the generalization ability of the updated model. Therefore, It could happen that the test accuracy increases as we remove information from the training set, as shown in the second row of Figure 8 , or that the performance is not very stable, as seen in the third row of Figure 8 .We also performed additional experiments on the Cora dataset, with results shown in Figure 9 . The first row shows the average performance over 10 repeated trails with random splitting, and the conclusion is the same as the one stated in Section 6. The second row shows the performance on GPR-based models. Note that when the number of removal requests becomes large, the performance of GPR-based models degrades much faster than that of SGC-based models. This observation is consistent with our discussion of GPR-based models. None of the retraining methods involves noise.

