CAUSALLY CONSTRAINED DATA SYNTHESIS FOR PRI-VATE DATA RELEASE

Abstract

Data privacy is critical in many decision-making contexts, such as healthcare and finance. A common mechanism is to create differentially private synthetic data using generative models. Such data generation reflects certain statistical properties of the original data, but often has an unacceptable privacy vs. utility trade-off. Since natural data inherently exhibits causal structure, we propose incorporating causal information into the training process to favorably navigate the aforementioned trade-off. Under certain assumptions for linear gaussian models and a broader class of models, we theoretically prove that causally informed generative models provide better differential privacy guarantees than their non-causal counterparts. We evaluate our proposal using variational autoencoders, and demonstrate that the trade-off is mitigated through better utility for comparable privacy.

1. INTRODUCTION

Automating AI-based solutions and making evidence-based decisions both require data analyses. However, in many situations, the data is sensitive and cannot be published directly. Synthetic data generation, which captures certain statistical properties of the original data, is useful in resolving these issues. However, naive data synthesis may not work: when improperly constructed, the synthetic data can leak information about its sensitive counterpart (from which it was constructed). Several membership inference (MI) and attribute inference attacks demonstrated for generative models (Mukherjee et al., 2019; Zhang et al., 2020b) eliminate any privacy advantage provided by releasing synthetic data. Therefore, effective privacy-preserving synthetic data generation methods are needed. The de-facto mechanism used for providing privacy in synthetic data release is that of differential privacy (DP) (Dwork et al., 2006) which is known to degrade utility proportional to the amount of privacy provided. This is further exacerbated in tabular data because of the correlations between different records, and among different attributes within a record. In such settings, the amount of noise required to provide meaningful privacy guarantees often destroys utility. Apart from assumptions made on the independence of records and attributes, prior works make numerous assumptions about the nature of usage of synthetic data and downstream tasks to customize DP application (Xiao et al., 2010; Hardt et al., 2010; Cormode et al., 2019; Dwork et al., 2009) . To this end, we propose a mechanism to create synthetic data that is agnostic of the downstream task. Similar to Jordan et al. (Jordon et al., 2018) , our solution involves training a generative model to provide formal DP guarantees. A key distinction arises as we encode knowledge about the causal structure of the data into the generation process to provide better utility. Our approach leverages the fact that naturally occurring data exhibits causal structure. In particular, to induce favorable privacy vs. utility trade-offs, our main contribution involves encoding the causal graph (CG) into the training of the generative model to synthesize data. Considering the case of linear gaussian models, we formally prove that generative models trained with additional knowledge of the causal structure of the specific dataset are more private than their non-causal counterparts. We extend this proof for a more broader class of generative models as well. To validate the theoretical results on real-world data, we present a novel practical solution utilizing variational autoencoders (VAEs) (Kingma & Welling, 2013) . These models combine the advantage of both deep learning and probabilistic modeling, making them scale to large datasets, flexible to fit complex data in a probabilistic manner, and can be used for data generation (Ma et al., 2019; 2020a) . Thus, in designing our solution, we train causally informed and differentially private VAEs. The

