EFFICIENT MODEL UPDATES FOR APPROXIMATE UN-LEARNING OF GRAPH-STRUCTURED DATA

Abstract

With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning has become of significant importance. This is particularly the case for graph-structured data, and learning tools specialized for such data, including graph neural networks (GNNs). This work introduces the first known approach for approximate graph unlearning with provable theoretical guarantees. The challenges in addressing the problem are two-fold. First, there exist multiple different types of unlearning requests that need to be considered, including node feature, edge and node unlearning. Second, to establish provable performance guarantees, one needs to carefully evaluate the process of feature mixing during propagation. We focus on analyzing Simple Graph Convolutions (SGC) and their generalized PageRank (GPR) extensions, thereby laying the theoretical foundations for unlearning GNNs. Empirical evaluations of six benchmark datasets demonstrate excellent performance/complexity/privacy trade-offs of our approach compared to complete retraining and general methods that do not leverage graph information. For example, unlearning 200 out of 1208 training nodes of the Cora dataset only leads to a 0.1% loss in test accuracy, but offers a 4-fold speed-up compared to complete retraining with a (ϵ, δ) = (1, 10 -4 ) "privacy cost". We also exhibit a 12% increase in test accuracy for the same dataset when compared to unlearning methods that do not leverage graph information, with comparable time complexity and the same privacy guarantee. Our code is available online 1 .

1. INTRODUCTION

Machine learning algorithms are used in many application domains, including biology, computer vision and natural language processing. Relevant models are often trained either on third-party datasets, internal or customized subsets of publicly available user data. For example, many computer vision models are trained on images from Flickr users (Thomee et al., 2016; Guo et al., 2020) while many natural language processing (e.g., sentiment analysis) and recommender systems heavily rely on repositories such as IMDB (Maas et al., 2011) . Furthermore, numerous ML classifiers in computational biology are trained on data from the UK Biobank (Sudlow et al., 2015) , which represents a collection of genetic and medical records of roughly half a million participants (Ginart et al., 2019) . With recent demands for increased data privacy, the above referenced and many other data repositories are facing increasing demands for data removal. Certain laws are already in place guaranteeing the rights of certified data removal, including the European Union's General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA) and the Canadian Consumer Privacy Protection Act (CPPA) (Sekhari et al., 2021) . Removing user data from a dataset is insufficient to guarantee the desired level of privacy, since models trained on the original data may still contain information about their patterns and features. This consideration gave rise to a new research direction in machine learning, referred to as machine unlearning (Cao & Yang, 2015) , in which the goal is to guarantee that the user data information is also removed from the trained model. Naively, one can retrain the model from scratch to meet the privacy demand, yet retraining comes at a high computation cost and is thus not practical when accommodating frequent removal requests. We take the first step towards solving the approximate unlearning problem by performing a nontrivial theoretical analysis of some simplified GNN architectures. Inspired by the unstructured data certified removal procedure (Guo et al., 2020) , we propose the first known approach for approximate graph unlearning. Our main contributions are as follows. First, we introduce three types of data removal requests for graph unlearning: Node feature unlearning, edge unlearning and node unlearning (see Figure 1 ). Second, we derive theoretical guarantees for approximate graph unlearning mechanisms for all three removal cases on SGC (Wu et al., 2019) and their GPR generalizations. In particular, we analyze L 2 -regularized graph models trained with differentiable convex loss functions. The analysis is challenging since propagation on graphs "mixes" node features. Our analysis reveals that the degree of the unlearned node plays an important role in the unlearning process, while the number of propagation steps may or may not be important for different unlearning scenarios. To the best of our knowledge, the theoretical guarantees established in this work are the first provable approximate unlearning studies for graphs. Furthermore, the proposed analysis also encompasses node classification and node regression problems. Third, our empirical investigation on frequently used datasets for GNN learning shows that our method offers an excellent performance-complexityprivacy trade-off. For example, when unlearning 200 out of 1208 training nodes of the Cora dataset, our method offers comparable test accuracy as complete retraining, but offers a 4-fold speed-up with a (ϵ, δ) = (1, 10 -4 ) "privacy cost". We also test our model on datasets for which removal requests are most likely to arise, including Amazon co-purchase networks. Due to space limitations, all proofs and some detailed discussions are relegated to the Appendix. 



Figure 1: Illustration of three different types of approximate graph unlearning problems and a comparison with the case of unlearning without graph information (Guo et al., 2020). The colors of the nodes capture properties of node features, and the red frame indicates node embeddings affected by 1-hop propagation. When no graph information is used, the node embeddings are uncorrelated. However, for the case of graph unlearning problems, removing one node or edge can affect the node embeddings of the entire graph for a large enough number of propagation steps.

Machine unlearning and certified data removal. Cao & Yang (2015) introduced the concept of machine unlearning and proposed distributed learners for exact unlearning. Bourtoule et al. (2021) introduced sharding-based methods for unlearning, while Ginart et al. (2019) described unlearning approaches for k-means clustering. These works focused on exact unlearning: The unlearned model is required to perform identically to a completely retrained model. As an alternative, Guo et al. (2020) introduced a probabilistic definition of unlearning motivated by differential privacy (Dwork, 2011). Sekhari et al. (2021) studied the generalization performance of machine unlearning methods.

To avoid complete retraining, various methods for machine unlearning have been proposed, including exact approaches(Ginart et al., 2019; Bourtoule  et al., 2021)  as well as approximate methods(Guo et al., 2020; Sekhari et al., 2021).At the same time, graph-centered machine learning has received significant interest from the learning community due to the ubiquity of graph-structured data. Usually, the data contains two sources of information: Node features and graph topology. Graph Neural Networks (GNN) leverage both types of information simultaneously and achieve state-of-the-art performance in numerous real-world ap-

