SECURE NETWORK RELEASE WITH LINK PRIVACY

Abstract

Many data mining and analytical tasks rely on the abstraction of networks (graphs) to summarize relational structures among individuals (nodes). Since relational data are often sensitive, we aim to seek effective approaches to release utilitypreserved yet privacy-protected structured data. In this paper, we leverage the differential privacy (DP) framework to formulate and enforce rigorous privacy constraints on deep graph generation models, with a focus on edge-DP to guarantee individual link privacy. In particular, we enforce edge-DP by injecting Gaussian noise to the gradients of a link reconstruction based graph generation model, while ensuring data utility by improving structure learning with structure-oriented graph comparison. Extensive experiments on two real-world network datasets show that our proposed DPGGAN model is able to generate networks with effectively preserved global structure and rigorously protected individual link privacy.

1. INTRODUCTION

Nowadays, open data of networks play a pivotal role in data mining and data analytics (Tang et al., 2008; Sen et al., 2008; Blum et al., 2013; Leskovec & Krevl, 2014) . By releasing and sharing structured relational data with research facilities and enterprise partners, data companies harvest the enormous potential value from their data, which benefits decision-making on various aspects, including social, financial, environmental, through collectively improved ads, recommendation, retention, and so on (Yang et al., 2017; 2018; Sigurbjörnsson & Van Zwol, 2008; Kuhn, 2009) . However, network data usually encode sensitive information not only about individuals but also their interactions, which makes direct release and exploitation rather unsafe. More importantly, even with careful anonymization, individual privacy is still at stake under collective attack models facilitated by the underlying network structure (Zhang et al., 2019; Cai et al., 2018) . Can we find a way to securely release network data without drastic sanitization that essentially renders the released data useless? In dealing with such tension between the need to release utilizable data and the concern of data owners' privacy, quite a few models have been proposed recently, focusing on grid-based data like images, texts and gene sequences (Frigerio et al., 2019; Papernot et al., 2018; Triastcyn & Faltings, 2018; Narayanan & Shmatikov, 2008; Xie et al., 2018; Chen et al., 2018; Boob et al., 2018; Dy & Krause, 2018; Lecuyer et al., 2018; Zhang et al., 2018) . However, none of the existing models can be directly applied to the network (graph) setting. While a secure generative model on grid-based data apparently aims to preserve high-level semantics (e.g., class distributions) and protect detailed training data (e.g., exact images or sentences), it remains obtuse what to be preserved and what to be protected for network data, due to its modeling of complex interactive objects. Motivating scenario. In Figure 1 , a bank aims to encourage public studies on its customers' community structures. It does so by firstly anonymizing all customers and then sharing the network (i.e., (a) in Figure 1 ) to the public. However, an attacker interested in knowing the financial interactions (e.g., money transfer) between particular customers in the bank may happen to have access to another network of a similar set of customers (e.g., a malicious employee of another financial company). The similarity of simple graph properties like node degree distribution and triangle count between the two networks can then be used to identify specific customers with high accuracy in the released network (e.g., customer A as the only node with degree 5 and within 1 triangle, and customer B as the only node with degree 2 and within 1 triangle). Thus, the attacker confidently knows the A and B's identities and the fact that they have financial interactions in the bank, which seriously harms customers' privacy and poses potential crises. As the first contribution in this work, we define and formulate secure network release goals as preserving global network structure while protecting individual link privacy. Continue with the toy example, the solution we propose is to train a graph neural network model on the original network and release the generated networks (e.g., (b) in Figure 1 ). Towards the utility of generated networks, we require them to be similar to the original networks from a global perspective, which can be measured by various graph global properties (e.g., network (b) has very similar degree distribution and the same triangle count as (a)). In this way, we expect many downstream data-mining and analytical tasks on them to produce similar results as on the original networks. As for privacy protection, we require that the information in the generated networks cannot confidently reveal the existence or absence of any individual links in the original networks (e.g., the attacker may still identify customers A and B in network (b), but their link structure has changed). Subsequently, there are two unique challenges in learning such structure-preserved and privacyprotected graph generation models, which have not been explored by existing literature so far. Challenge 1: Rigorous protection of individual link privacy. The rich relational structures in graph data often allow attackers to recover private information through various ways of collective inference (Zhang et al., 2014; Narayanan & Shmatikov, 2009; Backstrom et al., 2007) . Moreover, graph structure can always be converted to numerical features such as spectral embedding, after which most attacks on grid-based data like model inversion (Fredrikson et al., 2015) Gu et al., 2019; Larsen et al., 2016) to enable structure-oriented network comparison. To evaluate the effectiveness of DPGGAN, we conduct extensive experiments on two real-world network datasets. On one hand, we evaluate the utility of generated networks by computing a suite of commonly concerned graph properties to compare the global structure of generated networks with the original ones. On the other hand, we validate the privacy of individual links by evaluating links predicted from the generated networks on the original networks. Consistent experimental results show that DPGGAN is able to effectively generate networks that are similar to the original ones regarding global network structure, while at the same time useless towards individual link prediction.



(a) Anonymized original net. (b) DPGGAN generated net.

Figure 1: A toy pair of anonymized and generated networks.

and membership inference(Shokri et al., 2017)  can be directly applied for link identification. How can we design an effective mechanism with rigorous privacy protection on links in networks against various attacks? Challenge 2: Effective preservation of global network structure. To capture the global network structure, the model has to constantly compare the structures of the input graphs and currently generated graphs during training. However, unlike images and other grid-based data, graphs have flexible structures, and thus they lack efficient universal representations(Dong et al., 2019). How can we allow a network generation model to effectively learn from the structural difference between two graphs, without conducting very time-costly operations like isomorphism tests all the time? Present work. In this work, for the first time, we draw attention to the secure release of network data with deep generative models. Technically, towards the aforementioned two challenges, we develop Differentially Private Graph Generative Nets (DPGGAN), which imposes DP training over a link reconstruction based network generation model for rigorous individual link privacy protection, and further ensures structure-oriented graph comparison for effective global network structure preservation. In particular, we first formulate and enforce edge-DP via Gaussian gradient distortion by injecting designed noise into the sensitive modules during model training.

