CAUSALLY CONSTRAINED DATA SYNTHESIS FOR PRI-VATE DATA RELEASE

Abstract

Data privacy is critical in many decision-making contexts, such as healthcare and finance. A common mechanism is to create differentially private synthetic data using generative models. Such data generation reflects certain statistical properties of the original data, but often has an unacceptable privacy vs. utility trade-off. Since natural data inherently exhibits causal structure, we propose incorporating causal information into the training process to favorably navigate the aforementioned trade-off. Under certain assumptions for linear gaussian models and a broader class of models, we theoretically prove that causally informed generative models provide better differential privacy guarantees than their non-causal counterparts. We evaluate our proposal using variational autoencoders, and demonstrate that the trade-off is mitigated through better utility for comparable privacy.

1. INTRODUCTION

Automating AI-based solutions and making evidence-based decisions both require data analyses. However, in many situations, the data is sensitive and cannot be published directly. Synthetic data generation, which captures certain statistical properties of the original data, is useful in resolving these issues. However, naive data synthesis may not work: when improperly constructed, the synthetic data can leak information about its sensitive counterpart (from which it was constructed). Several membership inference (MI) and attribute inference attacks demonstrated for generative models (Mukherjee et al., 2019; Zhang et al., 2020b) eliminate any privacy advantage provided by releasing synthetic data. Therefore, effective privacy-preserving synthetic data generation methods are needed. The de-facto mechanism used for providing privacy in synthetic data release is that of differential privacy (DP) (Dwork et al., 2006) which is known to degrade utility proportional to the amount of privacy provided. This is further exacerbated in tabular data because of the correlations between different records, and among different attributes within a record. In such settings, the amount of noise required to provide meaningful privacy guarantees often destroys utility. Apart from assumptions made on the independence of records and attributes, prior works make numerous assumptions about the nature of usage of synthetic data and downstream tasks to customize DP application (Xiao et al., 2010; Hardt et al., 2010; Cormode et al., 2019; Dwork et al., 2009) . To this end, we propose a mechanism to create synthetic data that is agnostic of the downstream task. Similar to Jordan et al. (Jordon et al., 2018) , our solution involves training a generative model to provide formal DP guarantees. A key distinction arises as we encode knowledge about the causal structure of the data into the generation process to provide better utility. Our approach leverages the fact that naturally occurring data exhibits causal structure. In particular, to induce favorable privacy vs. utility trade-offs, our main contribution involves encoding the causal graph (CG) into the training of the generative model to synthesize data. Considering the case of linear gaussian models, we formally prove that generative models trained with additional knowledge of the causal structure of the specific dataset are more private than their non-causal counterparts. We extend this proof for a more broader class of generative models as well. To validate the theoretical results on real-world data, we present a novel practical solution utilizing variational autoencoders (VAEs) (Kingma & Welling, 2013) . These models combine the advantage of both deep learning and probabilistic modeling, making them scale to large datasets, flexible to fit complex data in a probabilistic manner, and can be used for data generation (Ma et al., 2019; 2020a) . Thus, in designing our solution, we train causally informed and differentially private VAEs. The CG can be obtained from a domain expert, or learnt directly from observed data (Zheng et al., 2018; Morales-Alvarez et al., 2021) or by using DP CG discovery algorithm (Wang et al., 2020a) . The problem of learning the CG itself is important but orthogonal to the goals of this paper. We evaluate our approach to understand its efficacy both towards improving the utility of downstream tasks, and robustness to an MI attack (Stadler et al., 2020) . Further, we aim to understand the effect of true, partial and incorrect CG on the privacy vs. utility trade-off. We experimentally evaluate our solution on a synthetic dataset where the true CG is known. We evaluate on real world applications: a medical dataset (Tu et al., 2019) , a student response dataset from a real-world online education platform (Wang et al., 2020b) , and perform ablation studies using the Lung Cancer dataset (Lauritzen & Spiegelhalter, 1988) . Through our evaluation, we show that models that are causally informed are more stable (Kutin & Niyogi, 2012) than associational (either non-causal, or with the incorrect causal structure) models trained using the same dataset. In the absence of DP noise, causal models enhance the baseline utilityfoot_0 by 2.42 percentage points (PPs) on average while non-causal models degrade it by 3.49 PPs. With respect to privacy evaluation, prior works solely rely on the value of the privacy budget ε. We take this one step further and empirically evaluate resilience to MI. Our experimental results demonstrate the positive impact of causal information in inhibiting the MI adversary's advantage on average. Better still, we demonstrate that DP models that incorporate both complete or even partial causal information are more resilient to MI adversaries than those with purely differential privacy with the exact same ε-DP guarantees. In summary, the contributions of our work include: 1. A deeper understanding of the advantages of causality through a theoretical result that highlights the privacy amplification induced by being causally informed ( § 3), and insight as to how this can be instantiated ( § 4.1). 2. Empirical results demonstrating that causally constrained (and DP) models are more utilitarian in downstream classification tasks ( § 5.1) and are robust (on average) to MI attacks ( § 5.2).

2. PROBLEM STATEMENT & NOTATION

Problem Statement: Formally, we define a dataset D to be the set {x 1 , • • • x n } of n records x i ∈ X (the universe of records); each record x = (x 1 , • • • , x k ) has k attributes (a.k.a variables X 1 , • • • , X k ). We aim to design a procedure which takes as input a private (or sensitive) dataset D p and outputs a synthetic dataset D s . The output should have formal privacy guarantees and maintain statistical properties from the input for downstream tasks. Formally speaking, we wish to design f θ : Z → X , where θ are the parameters of the method, and Z is some underlying latent representation for inputs in X . In our work, we wish for f θ to provide the guarantee of differential privacy. Differential Privacy (Dwork et al., 2006) : Let ε ∈ R + be the privacy budget, and H be a randomized mechanism that takes a dataset as input. H is said to provide ε-differential privacy (DP) if, for all datasets D 1 and D 2 that differ on a single record, and all subsets S of the outcomes of running H: P[H(D 1 ) ∈ S] ≤ e ε • P[H(D 2 ) ∈ S] , where the probability is over the randomness of H. Sensitivity: Let d ∈ Z + , D be a collection of datasets, and define H : D → R d . The 1 sensitivity of H, denoted ∆H, is defined by ∆H = max H(D 1 ) -H(D 2 ) 1 , where the maximum is over all pairs of datasets D 1 and D 2 in D differing in at most one record. We rely on generative models to enable private data release. If they are trained to provide DP, then any further post-processing (i.e., using them to obtain a synthetic dataset) is also DP by postprocessing (Dwork et al., 2014) . In this work, we use variational autoencoders (VAEs) as our generative models. Variational Autoencoders (VAEs) (Kingma & Welling, 2013): Data generation, p θ (x|z), is realized by a deep neural network (DNN) parameterized by θ, known as the decoder. To approximate the posterior of the latent variable p θ (z|x), VAEs use another DNN (the encoder) with x as input to produce an approximation of the posterior q φ (z|x). VAEs are trained by maximizing an evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence between q φ (z|x) and



Utility obtained from models trained on the original dataset (without the use of any generative model).

