Fast Yet Effective Graph Unlearning through Influence Analysis

Abstract

Recent evolving data privacy policies and regulations have led to increasing interest in the problem of removing information from a machine learning model. In this paper, we consider Graph Neural Networks (GNNs) as the target model, and study the problem of edge unlearning in GNNs, i.e., learning a new GNN model as if a specified set of edges never existed in the training graph. Despite its practical importance, the problem remains elusive due to the non-convexity nature of GNNs and the large scale of the input graph. Our main technical contribution is three-fold: 1) we cast the problem of fast edge unlearning as estimating the influence of the edges to be removed and eliminating the estimated influence from the original model in one-shot; 2) we design a computationally and memory efficient algorithm named EraEdge for edge influence estimation and unlearning; 3) under standard regularity conditions, we prove that EraEdge converges to the desired model. A comprehensive set of experiments on four prominent GNN models and three benchmark graph datasets demonstrate that EraEdge achieves significant speedup gains over retraining from scratch without sacrificing the model accuracy too much. The speedup is even more outstanding on large graphs. Furthermore, EraEdge witnesses significantly higher model accuracy than the existing GNN unlearning approaches.

1. INTRODUCTION

Recent legislation such as the General Data Protection Regulation (GDPR) (Regulation, 2018) , the California Consumer Privacy Act (CCPA) (Pardau, 2018) , and the Personal Information Protection and Electronic Documents Act (PIPEDA) (Parliament, 2000) requires companies to remove private user data upon request. This has prompted the discussion of "right to be forgotten" (Kwak et al., 2017) , which entitles users to get more control over their data by deleting it from learned models. In case a company has already used the data collected from users to train their machine learning (ML) models, these models need to be manipulated accordingly to reflect data deletion requests. In this paper, we consider Graph Neural Networks (GNNs) that receive frequent edge removal requests as our target ML model. For example, consider a social network graph collected from an online social network platform that witnesses frequent insertion/deletion of users (nodes) and/or change of social relations between users (edges). Some of these structural changes can be accompanied with users' withdrawal requests of their data. In this paper, we only consider the requests of removing social relations (edges). Then the owner of the platform is obligated by the laws to remove the effect of the requested edges, so that the GNN models trained on the graph do not "remember" their corresponding social interactions. In general, a naive solution to deleting user data from a trained ML model is to retrain the model on the training data which excludes the samples to be removed. However, retraining a model from scratch can be prohibitively expensive, especially for complex ML models and large training data. To address this issue, numerous efforts (Mahadevan & Mathioudakis, 2021; Brophy & Lowd, 2021; Cauwenberghs & Poggio, 2000; Cao & Yang, 2015) have been spent on designing efficient unlearning methods that can remove the effect of some particular data samples without model retraining. One of the main challenges is how to estimate the effects of a given training sample on model parameters (Golatkar et al., 2021) , which has led to research focusing on simpler convex learning problem such a linear/logistic regression (Mahadevan & Mathioudakis, 2021), random forests (Brophy & Lowd, 2021), support vector machines (Cauwenberghs & Poggio, 2000) and k-means clustering (Ginart et al., 2019) , for which a theoretical analysis was established. Although there have 1

