EFFICIENT MODEL UPDATES FOR APPROXIMATE UN-LEARNING OF GRAPH-STRUCTURED DATA

Abstract

With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning has become of significant importance. This is particularly the case for graph-structured data, and learning tools specialized for such data, including graph neural networks (GNNs). This work introduces the first known approach for approximate graph unlearning with provable theoretical guarantees. The challenges in addressing the problem are two-fold. First, there exist multiple different types of unlearning requests that need to be considered, including node feature, edge and node unlearning. Second, to establish provable performance guarantees, one needs to carefully evaluate the process of feature mixing during propagation. We focus on analyzing Simple Graph Convolutions (SGC) and their generalized PageRank (GPR) extensions, thereby laying the theoretical foundations for unlearning GNNs. Empirical evaluations of six benchmark datasets demonstrate excellent performance/complexity/privacy trade-offs of our approach compared to complete retraining and general methods that do not leverage graph information. For example, unlearning 200 out of 1208 training nodes of the Cora dataset only leads to a 0.1% loss in test accuracy, but offers a 4-fold speed-up compared to complete retraining with a (ϵ, δ) = (1, 10 -4 ) "privacy cost". We also exhibit a 12% increase in test accuracy for the same dataset when compared to unlearning methods that do not leverage graph information, with comparable time complexity and the same privacy guarantee. Our code is available online 1 .

1. INTRODUCTION

Machine learning algorithms are used in many application domains, including biology, computer vision and natural language processing. Relevant models are often trained either on third-party datasets, internal or customized subsets of publicly available user data. For example, many computer vision models are trained on images from Flickr users (Thomee et al., 2016; Guo et al., 2020) while many natural language processing (e.g., sentiment analysis) and recommender systems heavily rely on repositories such as IMDB (Maas et al., 2011) . Furthermore, numerous ML classifiers in computational biology are trained on data from the UK Biobank (Sudlow et al., 2015) , which represents a collection of genetic and medical records of roughly half a million participants (Ginart et al., 2019) . With recent demands for increased data privacy, the above referenced and many other data repositories are facing increasing demands for data removal. Certain laws are already in place guaranteeing the rights of certified data removal, including the European Union's General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA) and the Canadian Consumer Privacy Protection Act (CPPA) (Sekhari et al., 2021) . Removing user data from a dataset is insufficient to guarantee the desired level of privacy, since models trained on the original data may still contain information about their patterns and features. This consideration gave rise to a new research direction in machine learning, referred to as machine unlearning (Cao & Yang, 2015) , in which the goal is to guarantee that the user data information is also removed from the trained model. Naively, one can retrain the model from scratch to meet

availability

//github.

