SECURE NETWORK RELEASE WITH LINK PRIVACY

Abstract

Many data mining and analytical tasks rely on the abstraction of networks (graphs) to summarize relational structures among individuals (nodes). Since relational data are often sensitive, we aim to seek effective approaches to release utilitypreserved yet privacy-protected structured data. In this paper, we leverage the differential privacy (DP) framework to formulate and enforce rigorous privacy constraints on deep graph generation models, with a focus on edge-DP to guarantee individual link privacy. In particular, we enforce edge-DP by injecting Gaussian noise to the gradients of a link reconstruction based graph generation model, while ensuring data utility by improving structure learning with structure-oriented graph comparison. Extensive experiments on two real-world network datasets show that our proposed DPGGAN model is able to generate networks with effectively preserved global structure and rigorously protected individual link privacy.

1. INTRODUCTION

Nowadays, open data of networks play a pivotal role in data mining and data analytics (Tang et al., 2008; Sen et al., 2008; Blum et al., 2013; Leskovec & Krevl, 2014) . By releasing and sharing structured relational data with research facilities and enterprise partners, data companies harvest the enormous potential value from their data, which benefits decision-making on various aspects, including social, financial, environmental, through collectively improved ads, recommendation, retention, and so on (Yang et al., 2017; 2018; Sigurbjörnsson & Van Zwol, 2008; Kuhn, 2009) . However, network data usually encode sensitive information not only about individuals but also their interactions, which makes direct release and exploitation rather unsafe. More importantly, even with careful anonymization, individual privacy is still at stake under collective attack models facilitated by the underlying network structure (Zhang et al., 2019; Cai et al., 2018) . Can we find a way to securely release network data without drastic sanitization that essentially renders the released data useless? In dealing with such tension between the need to release utilizable data and the concern of data owners' privacy, quite a few models have been proposed recently, focusing on grid-based data like images, texts and gene sequences (Frigerio et al., 2019; Papernot et al., 2018; Triastcyn & Faltings, 2018; Narayanan & Shmatikov, 2008; Xie et al., 2018; Chen et al., 2018; Boob et al., 2018; Dy & Krause, 2018; Lecuyer et al., 2018; Zhang et al., 2018) . However, none of the existing models can be directly applied to the network (graph) setting. While a secure generative model on grid-based data apparently aims to preserve high-level semantics (e.g., class distributions) and protect detailed training data (e.g., exact images or sentences), it remains obtuse what to be preserved and what to be protected for network data, due to its modeling of complex interactive objects. Motivating scenario. In Figure 1 , a bank aims to encourage public studies on its customers' community structures. It does so by firstly anonymizing all customers and then sharing the network (i.e., (a) in Figure 1 ) to the public. However, an attacker interested in knowing the financial interactions (e.g., money transfer) between particular customers in the bank may happen to have access to another network of a similar set of customers (e.g., a malicious employee of another financial company). The similarity of simple graph properties like node degree distribution and triangle count between the two networks can then be used to identify specific customers with high accuracy in the released network (e.g., customer A as the only node with degree 5 and within 1 triangle, and customer B as the only node with degree 2 and within 1 triangle). Thus, the attacker confidently knows the A and B's identities and the fact that they have financial interactions in the bank, which seriously harms customers' privacy and poses potential crises.

