CAUSALLY CONSTRAINED DATA SYNTHESIS FOR PRI-VATE DATA RELEASE

Abstract

Data privacy is critical in many decision-making contexts, such as healthcare and finance. A common mechanism is to create differentially private synthetic data using generative models. Such data generation reflects certain statistical properties of the original data, but often has an unacceptable privacy vs. utility trade-off. Since natural data inherently exhibits causal structure, we propose incorporating causal information into the training process to favorably navigate the aforementioned trade-off. Under certain assumptions for linear gaussian models and a broader class of models, we theoretically prove that causally informed generative models provide better differential privacy guarantees than their non-causal counterparts. We evaluate our proposal using variational autoencoders, and demonstrate that the trade-off is mitigated through better utility for comparable privacy.

1. INTRODUCTION

Automating AI-based solutions and making evidence-based decisions both require data analyses. However, in many situations, the data is sensitive and cannot be published directly. Synthetic data generation, which captures certain statistical properties of the original data, is useful in resolving these issues. However, naive data synthesis may not work: when improperly constructed, the synthetic data can leak information about its sensitive counterpart (from which it was constructed). Several membership inference (MI) and attribute inference attacks demonstrated for generative models (Mukherjee et al., 2019; Zhang et al., 2020b) eliminate any privacy advantage provided by releasing synthetic data. Therefore, effective privacy-preserving synthetic data generation methods are needed. The de-facto mechanism used for providing privacy in synthetic data release is that of differential privacy (DP) (Dwork et al., 2006) which is known to degrade utility proportional to the amount of privacy provided. This is further exacerbated in tabular data because of the correlations between different records, and among different attributes within a record. In such settings, the amount of noise required to provide meaningful privacy guarantees often destroys utility. Apart from assumptions made on the independence of records and attributes, prior works make numerous assumptions about the nature of usage of synthetic data and downstream tasks to customize DP application (Xiao et al., 2010; Hardt et al., 2010; Cormode et al., 2019; Dwork et al., 2009) . To this end, we propose a mechanism to create synthetic data that is agnostic of the downstream task. Similar to Jordan et al. (Jordon et al., 2018) , our solution involves training a generative model to provide formal DP guarantees. A key distinction arises as we encode knowledge about the causal structure of the data into the generation process to provide better utility. Our approach leverages the fact that naturally occurring data exhibits causal structure. In particular, to induce favorable privacy vs. utility trade-offs, our main contribution involves encoding the causal graph (CG) into the training of the generative model to synthesize data. Considering the case of linear gaussian models, we formally prove that generative models trained with additional knowledge of the causal structure of the specific dataset are more private than their non-causal counterparts. We extend this proof for a more broader class of generative models as well. To validate the theoretical results on real-world data, we present a novel practical solution utilizing variational autoencoders (VAEs) (Kingma & Welling, 2013) . These models combine the advantage of both deep learning and probabilistic modeling, making them scale to large datasets, flexible to fit complex data in a probabilistic manner, and can be used for data generation (Ma et al., 2019; 2020a) . Thus, in designing our solution, we train causally informed and differentially private VAEs. The CG can be obtained from a domain expert, or learnt directly from observed data (Zheng et al., 2018; Morales-Alvarez et al., 2021) or by using DP CG discovery algorithm (Wang et al., 2020a) . The problem of learning the CG itself is important but orthogonal to the goals of this paper. We evaluate our approach to understand its efficacy both towards improving the utility of downstream tasks, and robustness to an MI attack (Stadler et al., 2020) . Further, we aim to understand the effect of true, partial and incorrect CG on the privacy vs. utility trade-off. We experimentally evaluate our solution on a synthetic dataset where the true CG is known. We evaluate on real world applications: a medical dataset (Tu et al., 2019) , a student response dataset from a real-world online education platform (Wang et al., 2020b) , and perform ablation studies using the Lung Cancer dataset (Lauritzen & Spiegelhalter, 1988) . Through our evaluation, we show that models that are causally informed are more stable (Kutin & Niyogi, 2012) than associational (either non-causal, or with the incorrect causal structure) models trained using the same dataset. In the absence of DP noise, causal models enhance the baseline utilityfoot_0 by 2.42 percentage points (PPs) on average while non-causal models degrade it by 3.49 PPs. With respect to privacy evaluation, prior works solely rely on the value of the privacy budget ε. We take this one step further and empirically evaluate resilience to MI. Our experimental results demonstrate the positive impact of causal information in inhibiting the MI adversary's advantage on average. Better still, we demonstrate that DP models that incorporate both complete or even partial causal information are more resilient to MI adversaries than those with purely differential privacy with the exact same ε-DP guarantees. In summary, the contributions of our work include: 1. A deeper understanding of the advantages of causality through a theoretical result that highlights the privacy amplification induced by being causally informed ( § 3), and insight as to how this can be instantiated ( § 4.1). 2. Empirical results demonstrating that causally constrained (and DP) models are more utilitarian in downstream classification tasks ( § 5.1) and are robust (on average) to MI attacks ( § 5.2).

2. PROBLEM STATEMENT & NOTATION

Problem Statement: Formally, we define a dataset D to be the set {x 1 , • • • x n } of n records x i ∈ X (the universe of records); each record x = (x 1 , • • • , x k ) has k attributes (a.k.a variables X 1 , • • • , X k ). We aim to design a procedure which takes as input a private (or sensitive) dataset D p and outputs a synthetic dataset D s . The output should have formal privacy guarantees and maintain statistical properties from the input for downstream tasks. Formally speaking, we wish to design f θ : Z → X , where θ are the parameters of the method, and Z is some underlying latent representation for inputs in X . In our work, we wish for f θ to provide the guarantee of differential privacy. Differential Privacy (Dwork et al., 2006) : Let ε ∈ R + be the privacy budget, and H be a randomized mechanism that takes a dataset as input. H is said to provide ε-differential privacy (DP) if, for all datasets D 1 and D 2 that differ on a single record, and all subsets S of the outcomes of running H: P[H(D 1 ) ∈ S] ≤ e ε • P[H(D 2 ) ∈ S] , where the probability is over the randomness of H. Sensitivity: Let d ∈ Z + , D be a collection of datasets, and define H : D → R d . The 1 sensitivity of H, denoted ∆H, is defined by ∆H = max H(D 1 ) -H(D 2 ) 1 , where the maximum is over all pairs of datasets D 1 and D 2 in D differing in at most one record. We rely on generative models to enable private data release. If they are trained to provide DP, then any further post-processing (i.e., using them to obtain a synthetic dataset) is also DP by postprocessing (Dwork et al., 2014) . In this work, we use variational autoencoders (VAEs) as our generative models. Variational Autoencoders (VAEs) (Kingma & Welling, 2013): Data generation, p θ (x|z), is realized by a deep neural network (DNN) parameterized by θ, known as the decoder. To approximate the posterior of the latent variable p θ (z|x), VAEs use another DNN (the encoder) with x as input to produce an approximation of the posterior q φ (z|x). VAEs are trained by maximizing an evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence between q φ (z|x) and p θ (z|x) (Jordan et al., 1999; Zhang et al., 2018) . Solutions using VAEs for data generation would concatenate all variables as X, train the model, and generate data through sampling from the prior p(Z). To train the model, we wish to minimize the KL divergence between the true posterior p(z|x) and the approximated posterior q φ (z|x), by maximizing the ELBO: ELBO = E q φ (z|x)) [log p θ (x|z)] -KL[q φ ((z|x)||p(z)] Causally Consistent Models: Formally, the underlying data generating process (DGP) is characterized by a causal graph that describes the conditional independence relationships between different variables. In this work, we use the term causally consistent models to refer to those models that factorize in the causal direction. For example, the graph X 1 → X 2 implies that the factorization following the causal direction is p(X 1 , X 2 ) = p(X 1 ) • p(X 2 |X 1 ). Due to the modularity property (Woodward, 2005) , the mechanism to generate X 2 from X 1 is independent from the marginal distribution of X 1 . This only holds in causal factorization but not in anti-causal factorization.

3. PRIVACY AMPLIFICATION THROUGH CAUSALITY

Here, we present our main result. Stated simply: under infinite training data, causally consistent (or simply causal) models are more private than their non-causal (or associational) counterparts. We also characterize the conditions needed for this claim to be true under finite training data. For ease of exposition, we first consider the setting of linear gaussian structural causal models (SCMs) where each node (i.e., variable) is generated as a linear function of its parents in the causal graph (CG). Causal and Associational models. Let M = X, f, be a linear gaussian SCM corresponding to a CG G = (X, E G ). X is the set of variables {x 1 , • • • , x k }, E G are the edges in the CG connecting them, f represents the linear generating function for each variable x i ∈ X, and are the error terms. We assume all variables are standardized to be zero mean and unit variance. We use upper-case variables to capture sets, bold-faced to capture vectors, and subscripts capture appropriate indexing. In a linear gaussian SCM, each node is generated as a linear function of its parents (assuming no interaction between them). x i = P a i β i + i (1) where P a i is a matrix of parent variables of x i (of size n × k i , where n is the number of data points and k i ≤ k denotes the number of parents of x i ), and similarly β i is the estimated coefficient vector (of size k i × 1). The error terms i are mutually independent as well as independent of all other variables. We can also write it as Xβ i,ext + i where X is the n × k matrix with all variables as columns and β i,ext is an extended vector such that its value is fixed to 0 for all non-parents of x i . A causal generative model has additional knowledge of the CG. Since mechanisms of the SCM are stable and independent (Peters et al., 2017) , fitting the causal generative model can be broken down into a set of separately fit linear regression models. For any variable x i , parameters β i are learnt (as βi ) by minimizing the least squares error, ( xi , x i ) = j∈[n] (x j i -x j i ) 2 , where xi is given by, xi = P a i βi An associational generative model does not have knowledge of the true CG. In general, it can be a generative model such as a VAE. However, for learning linear functional relationships, it makes sense to instead learn a set of linear regression equations, which is based on an alternative acyclic structure (e.g., an incorrect graph). For each x i , let H i be the feature matrix used to predict x i . We obtain, xi = H i γ i (3) where H i is the data for all features that generate the value of x i in the model, analogous to P a i . For each x i , γ i is the learnt parameter vector of the associational model. We show that sensitivity of β = {β 1 , • • • , β k } is lower than or equal to γ = {γ 1 , • • • , γ k }. To do so, we first prove a result about comparing sensitivity of linear regression when features are chosen by the true data-generating process (DGP) or not. Our result on linear regression can be found in Appendix A (see Lemma 1). Note that since our goal is to compare between models, we follow a different set of assumptions than standard DP on linear regression. Rather than assuming that the inputs are bounded, we assume that the error terms in the DGP are bounded, thus providing a bound on the values of parameters that are optimal for any point.

3.1. MAIN THEOREM

Theorem 1. Consider a linear gaussian SCM M = X, f, with standardized variables (zero mean, unit variance). Let the true generative equations be expressed as, ∀x i ∈ X : x i = P a i β i + i (4) where (a) P a i is the data matrix denoting all parents of x i in the CG corresponding to M, (b) β i are the true generative parameters, and (c) i is an independent, symmetric error vector that is bounded such that the maximum deviation from β i that minimizes |x j i -β i P a j i | for the j th data point is δ max . Let a mechanism take as input n data points and output the parameters of the fitted linear generative functions. Then, 1. As n → ∞, the sensitivity of the mechanism based on causal information is lower than that without it. 2. For finite n, whenever the empirical correlation of a variable's parents with error is much less than the empirical correlation of the features used by the associational model with error (i.e., P a T i i << H T i i foot_1 ), the sensitivity of the mechanism based on causal information is the lowest. Proof idea.foot_2 Since a SCM consists of independent mechanisms, it is sufficient to prove the result for regression parameters of any one mechanism. Given a fitted linear regression for one of the nodes in the CG, its sensitivity is determined by stability: how much a new training point can alter the learnt parameters; the new point is also constrained to be generated by the linear SCM equation, so the optimal learnt parameters for the new point is the the true SCM parameters (within a δ max bound). In the infinite data regime, for a causal model, the fitted parameters are exactly the true SCM parameters since it uses exactly the parents as features. Whereas the associational model uses a different set (with possibly correlated features) and therefore the learnt parameters for each feature are different than the SCM parameters. So a new point generated from the true SCM parameters can introduce a significant change in learnt parameters for associational models (note that new input is not bounded), while for causal models the effect is bounded by δ max . For finite data, we additionally need that the associational model deviates substantially from the causal features (i.e., the contribution of non-causal features to the model is substantial); otherwise a causal model may not have significant improvement on the sensitivity than the associational model. The above proof also generalizes to case where all variables of SCM are not observed. We need the following assumption. Assumption 1. For each node x i ∈ X for the SCM M, any unobserved parents P a unobs i are independent of all other observed parents of x i , P a obs i . P a j,obs i ⊥ ⊥ P a k,unobs i ∀j, k where P a obs i ∪ P a unobs i = P a i (5) Corollary 1. Under Assumption 1, Theorem 1 is satisfied even if some variables of the SCM are unobserved. By assuming strong convexity of the loss function, we are able to generalize our result beyond linear gaussian SCMs; the detailed proof is in Appendix B. The key insight again stems from the higher stability of causal models (in both the infinite and finite data regimes) resulting in lower sensitivity. Why does causality provide any privacy benefit? In the case of causal models, we know the proper factorization of the joint probability distribution. These factorized conditional probabilities remain stable even under new data that corresponds to a different joint distribution (Peters et al., 2017) . Hence any learnt model using the factorization will be more stable. Due to this higher stability, its (worst-case) loss on unseen points (or points generated by the true DGP) is lower. In the associational case, the model does not have any such constraint, and thus may learn relationships that are not stable.

4. OUR APPROACH AND IMPLEMENTATION

Our theoretical analysis shows the privacy benefit of training a causally informed generative model for the linear gaussian case. In practice, the generation of synthetic datasets requires flexibility for downstream tasks and computational efficiency in large-scale applications. Therefore, we propose our approach of building causally informed generative models using VAEs and implement this approach for evaluating on real-world datasets; salient features of each dataset is presented in Table 2 . Experiments were performed on a server with 8 NVIDIA GeForce RTX GPUs, 48 vCPU cores, and 252 GB of memory. All our code was implemented in python.

4.1. CAUSAL DEEP GENERATIVE MODELS

Let us consider that the original dataset contains variables X 1 and X 2 and the causal relationship follows Figure 9 (b) (in Appendix C.1). For example, X 1 can be a medical treatment, X 2 can be a medical test, and Z is the patient health status which is not observed. Instead of using a VAE to generate the data, we design a generative model as shown in Figure 9 (c), where the solid line shows the model p (X 1 , X 2 , Z) = p(Z) • p θ1 (X 1 |Z) • p θ2 (X 2 |X 1 , Z) , and the dashed line shows the inference network q φ (Z|X 1 , X 2 ). In this way, the model is consistent with the underlying CG. The modeling principle is similar to that of prior work, CAMA (Zhang et al., 2020a) . However, CAMA only focuses on prediction and ignores all variables outside the Markov Blanket of the target. In our application, we aim for data generation and need to consider the full causal graph, and generalizes CAMA. Remark 1: In this work, we assume that the causal relationship is given. In practice, this can be obtained from domain expert or using a careful chosen causal discovery algorithm (Glymour et al., 2019) ; the latter can be learnt in a differentially-private manner (Wang et al., 2020a) . Additionally, recent advances enable simultaneously learning the CG and optimizing the parameters of a generative model informed by it (Morales-Alvarez et al., 2021) . We stress that the contribution of our work is understanding the influence of causal information on the privacy associated with generative models, and not on mechanisms to learn the required causal information. Remark 2: The theory we propose in Appendices A and B show that the privacy amplification induced by causal information is agnostic of the particular choice of generative model. We utilize VAEs as the work of Morales-Alvarez et al. ( 2021) provides a platform for simultaneously performing (P1) causal discovery and (P2) causal inference. Recent work provides both the aforementioned properties in the context of generative adversarial networks (GANs) (Geffner et al., 2022) and flow-based models (Kyono et al., 2021) , and validating their efficacy is a subject of future research.

4.2. PRIVACY BUDGET (ε)

For training our DP models, we utilize opacus v0.10.0 libraryfoot_3 that supports the DP-SGD training approach proposed by Abadi et al. (2016) of clipping the gradients and adding noise while training. We ensure that the training parameters for training both causal and non-causal models are fixed. These are described in Appendix C. For all our experiments, we perform a grid search to obtain the optimal clipping norm and noise multiplier. Once training is done, we calculate the privacy budget after training using the Renyi DP accountant provided as part of the opacus package. The value of ε for both causal and non-causal models is the same (refer Table 2 ).

4.3. MEMBERSHIP INFERENCE (MI) ATTACK

Prior solutions for private generative models often use the value of ε as the sole measure for privacy (Jordon et al., 2018; Zhang et al., 2020b) . In addition to ε, we use an MI attack specific to generative models to empirically evaluate if the models we train leak information (Stadler et al., 2020) . In this attack, the adversary has access to (a) synthetic data sampled from a generative model trained with a particular record in the training data, and (b) synthetic data sampled from a generative model trained without the same record in the training data. The objective of the adversary is to use this synthetic data (from both cases) and learn a classifier to determine if a particular record was used during training.

4.4. UTILITY METRICS

Utility is preserved if the synthetic data performs comparably to the original sensitive data for any given task using any classifier. To measure the utility change, we perform the following experiment: if a dataset has k attributes, we utilize k -1 attributes to predict the k th attribute. We randomly choose 20 different attributes to be predicted. Furthermore, we train 5 different classifiers for this task, and compare the predictive capabilities of these classifiers when trained on (a) the original sensitive dataset, and (b) the synthetically generated private dataset. The 5 classifiers are: (a) linear SVC (or kernel), (b) svc, (c) logistic regression, (d) rf (or random forest), (e) knn. Additionally, we draw pairplots using the features from the original and the synthetic dataset to compare their similarity visually. These pairplots are obtained by choosing 10 random attributes (out of the k available attributes). Refer Appendix E for more details.

5. EVALUATION

Since the accounting mechanism returns the same value of ε, our evaluation is designed to understand: • If (DP) causally informed models negatively influence accuracy? • If causally informed models leak more information than their non-causal counterparts? From our evaluation, we observe that: • In the absence of DP, causal models enhance the baseline utility computed using the original dataset by 2.42 percentage points (PPs) on average while non-causal models degrade it by 3.49 PPs ( § 5.1). • The influence of causality (with DP) on MI resilience is nuanced. The general trend we observe is that, as the accuracy of causal information increases, so does the model's resilience to MI ( § 5.2).

5.1. UTILITY EVALUATION

Table 1 shows the change in utility on different downstream tasks using the generated synthetic data when trained with and without causality as well as DP. The negative values in the table indicate an improvement in utilityfoot_4 . We only present the range of absolute utility values when trained using the original data in the table and provide individual utility for each classifier in Appendix E. We observe an average performance degradation of 3.49 percentage points (PPs) across all non-causal models trained without DP, and an average increase of 2.42 PPs in their causal counterparts. However, it is well understood that DP training induces a privacy vs. utility trade-off, and consequently the utility suffers (compare rows with DP and without DP). However, when causal information is incorporated into the generative model, we observe that the utility degradation is less severe (compare pairs of cells with and without causal information). These results suggest that causal information encoded into the generative process improves the privacy vs. utility trade-off i.e., for the same ε-DP guarantees, the utility for causal models is better than their non-causal counterparts. In Figure 1 , we plot the utility (measured by the average accuracy across the 5 downstream prediction tasks) for both causal and associational models, for varied values of ε. Observe that for a fixed ε, the causal models always have better utility than their associational counterparts. Note that our work is the first to combine DP and causality. When compared with prior work, DP-PGM McKenna et al. (2019) (which uses only DP), we observe that our approach (Causal in Figure 1 ) generates synthetic data that is more utilitarian than DP-PGM. (a) Pain1000   1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4  0 Note that we evaluated the quality of synthetic data generated using a generative model trained using the approach proposed by Morales-Alvarez et al. Morales-Alvarez et al. ( 2021), and the results are not significantly different from those presented in Table 1 (which we omit for brevity).

5.2. EVALUATING MI RESILIENCE

Understanding the influence of causal information on MI requires nuanced discussion. Thus far, our discussion has focused on how perfect causal information can be used to theoretically minimize ε. However, causal information is often incomplete/partial, or incorrect. Our evaluation answers: 1. What is the effect of complete, partial and incorrect causal information on MI attack accuracy? 2. What is the effect on MI attack accuracy when a causally informed VAE is trained both with and without DP? We summarize our key results below: 1. Knowledge of a complete CG and training with DP consistently reduces the adversary's advantage across different feature extractors and classifiers i.e., provides better privacy (Figure 3 ). 2. We observe that incorrect causal information has disparate impacts on resilience to MI. While introducing spurious correlations improves MI efficacy, removing causal information reduces MI efficacy but also degrades utility ( § 5.2.2). 3. Even partial causal information reduces the advantage of the adversary when the model is trained without DP and in most cases, with DP as well. As the accuracy of causal information increases, so does the model's resilience to MI attacks ( § 5.2.3). Evaluation Methodology. We train 2 generative models: one that encodes information from a CG and one that does not. We train each of them with and without DP, and thus have 4 models in total. For all our datasets, we evaluate these models against the MI adversary ( § 4.3), and compute the attack accuracy when a method (DP/causal consistency) is not used in comparison to when it is used. Note that as part of the MI attack, we are unable to utilize the Correlation and Ensemble feature extractors for the EEDI dataset due to computational constraints in our server.

5.2.1. WITH COMPLETE CG

We conduct a toy experiment with synthetic data where the complete (true) CG is known apriori. The data is generated based on the CG defined in Appendix D. The results are presented in Figure 3 . Here, advantage degradation is the change in MI success between a scenario when DP is not used, and when DP is used for training (positive values are better). Observe that in both the causal and non-causal model, training with DP provides an advantage against the MI adversary, though to varying degrees. However, the important observation is that causal models provide greater resilience on average in comparison to the non-causal model.

5.2.2. ABLATION STUDY: INCORRECT CG

Many real-world datasets do not come with their associated CGs. CGs obtained from domain experts, or those learnt through algorithmic means are potentially erroneous. We wish to understand how these errors assist MI adversaries. To this end, we utilize the causally informed generative modelling framework of Morales-Alvarez et al. correlations which the MI adversary exploits, but by removing (causal) edges, we remove signal that the MI adversary can use. While removing edges may seem tempting from a privacy perspective, the average (across 8 trials) downstream utility of the data generated from the corresponding model suffers. The utility of the data generated from the true CG is 82.94% which reduces to 81.8% when edge is added and futher to 79.67% when an edge is removed. Our results show that while knowing a true CG is always useful both for privacy and utility, having access to an approximate CG may also provide privacy benefits at the cost of utility. Disclaimer: There are two ways a model may use incorrect causal structure: missing a true parent of a node, or adding an incorrect parent. Corollary 1 already covers the first: with missing parents in the causal model, the result of lower sensitivity for causal models holds as long as Assumption 1 is true. So overall, missing parents is a weaker violation and we may still obtain the same benefits. But if a model adds an incorrect parent, then its sensitivity will definitely increase.

5.2.3. WITH PARTIAL CG

For the real world EEDI and Pain datasets, we follow two approaches: (a) we utilize information from domain experts to partially construct a condensed CG (where several variables are clubbed into a single node of the condensed CG), and (b) we learn a CG from data and simultaneously train a VAE that is informed by it (Morales-Alvarez et al., 2021)-resulting in a larger, yet partial CG. Note that due to space constraints, we only report results for 2 out of the 3 datasets we evaluate. The trends from Pain1000 are similar to that of Pain5000 and are in Appendix F. Effect of only Causality. Figure 5 shows the MI adversary's advantage when the model incorporates causal information in the absence of DP. Observe that across both datasets, causal models result in more resilient models i.e., lower attack accuracy. While this effect is moderate in the EEDI datasetfoot_5 (Figure 5a ) (where the Histogram features enable a more effective attack in the causal model), this effect is more pronounced in the Pain5000 dataset (Figure 5b ). This suggests that standalone causal information provides some privacy guarantees. We conjecture that the MI adversary uses the spurious correlation among the attributes to perform the attack. Even partial causal information is able to eliminate spurious correlation present in the dataset. Hence, a causally informed VAE is also able to reduce the MI adversary's advantage. Effect of Causality with DP. Figure 6 shows the results for models trained using partial causal information with DP. Similar to the earlier results (of causal information and no DP), models trained with DP and causal information are more resilient to MI attacks. These results validate that causality amplifies privacy provided by DP training. We iterate that from results in Figure 1 we show that for the same ε-DP guarantees, models learnt using causal information provide higher utility, thereby providing a promising direction to balance the privacy vs. utility trade-off. The results thus far rely on domain experts to partially construct a condensed CG. 2021) discusses an approach to simultaneously learn the CG and train the generative model with said structure. In practice, we observed that causal discovery using this method adds limited overhead (∼ 100 epochs of additional training).

6. RELATED WORK

Private Data Generation: The primary issue associated with private synthetic data generation involves dealing with data scale and dimensionality. Solutions involve using Bayesian networks to add calibrated noise to the latent representations (Zhang et al., 2017; Jälkö et al., 2019) , or smarter mechanisms to determine correlations (Zhang et al., 2020b) . Utilizing synthetic data generated by GANs has been extensively studied, but only few solutions provide formal guarantees of privacy (Jordon et al., 2018; Wu et al., 2019; Harder et al., 2020; Torkzadehmahani et al., 2019; Ma et al., 2020b; Tantipongpipat et al., 2019; Xin et al., 2020; Long et al., 2019; Liu et al., 2019) . Across the spectrum, very limited techniques are evaluated against MI adversaries (Mukherjee et al., 2019) . Membership Inference: Most MI work focuses on the discriminative setting (Shokri et al., 2017) . More recently, several works propose MI attacks against generative models (Chen et al., 2020; Hilprecht et al., 2019) but offer a limited explanation as to why they are possible. Tople et al. (2020) show the benefits of causal learning to alleviate membership privacy attacks but only limited to classification models and not for generative models.

7. CONCLUSIONS

Our work proposes a mechanism for private data release using VAEs trained with differential privacy. Theoretically, we highlight how causal information encoded into the training procedure can potentially amplify the privacy guarantee provided by differential privacy, without degrading utility. Empirically, we show how causal information enables advantageous privacy vs. utility trade-offs.

A DETAILED PROOF: BENEFITS OF CAUSAL LEARNING IN LINEAR GAUSSIAN SCM

A.1 BACKGROUND Let M = X, f, be a linear gaussian structural causal model corresponding to a causal graph G = (X, E G ), where (a) X is the set of variables, G is the causal graph connecting them through edges E G , (b) f represents the set of linear generating functions for each variable x i ∈ X, and (c) are the error terms. We assume that all variables are standardized to be zero mean and unit variance. In a linear gaussian SCM, each node is generated as a linear function of its parents (we assume without any interaction terms). The error terms are mutually independent and independent of all variables. x i ← j β j i P a j i + i ; x i = P a i β i + i where P a j i is a vector referring to the values of j th parent of node x i , P a i refers to a matrix of data values with parents of x i as columns (total k i columns, where k i is the number of parents of x i ) and n rows as data-points, and β i refers to the true coefficient vector (or structural causal parameter). Alternatively, we can write it as Xβ i,ext + i where X is the matrix with all variables as columns and β i,ext is an extended vector such that its value is fixed to 0 for all non-parents of x i . A causal generative model has additional knowledge of the graph structure. Since mechanisms of SCM are stable and independent (Peters et al., 2017) , fitting the causal generative model can be broken down into a set of separately fit linear regression models. For any variable x i , parameters β i = β 1 i , β 2 i . ..β ki i are learnt (as βi ) by minimizing the least squares error, (x i , xi ) = j∈[n] ( xj i -x j i ) 2 , where xi is given by xi = j βj i P a j i ; xi = P a i βi An associational generative model does not know the true causal graph, so it may learn an alternative generative acyclic structure, which is also reducible to a set of independently fitted linear regressions. For each x i , let H i be the matrix denoting the features used to predict x i (columns are features, rows are data-points). We obtain xi = j γ j i H j i ; xi = H i γ i ( ) where H j i and H i is the individual j th parent data vector and all parents' data matrix respectively, analogous to P a j i and P a i . γ i is the learnt parameter vector of the associational model, for each i A.2 GOAL Our goal is to show that sensitivity of β = {β 1 , • • • , β k } is lower than or equal to γ = {γ 1 , • • • , γ k }. To do so, we first prove a result about sensitivity of linear regression, which is used to estimate the parameters of the generative model. Note that since our goal is to compare between models, we follow a different set of assumptions than standard differential privacy on linear regression. Rather than assuming that the inputs are bounded, we assume that the error terms in the DGP are bounded, thus providing a bound on the parameter values that are optimal for any point. Our proof utilizes the following strategy. First, we define 2 worlds: world 1 where where a model is learnt with causal information, and world 2 where a model is learnt without this causal information. Next, we measure the sensitivity of the parameters learnt in both worlds and demonstrate that the sensitivity is lower in world 1 than world 2. This suggests that if DP is used in both worlds, the privacy budget in world 1 will be lower than that of world 2. A.3 PRIMER Lemma 1. Consider dataset (X j , y j ) n j=1 where the labels are generated by the following equation: y = βX c + 7 where X c ⊆ X refers to variables having a non-zero coefficient in the true data-generating process (DGP) of y, X c corresponds to the data matrix associated with variables in X c , and is independent error that is bounded and symmetric. is bounded such that the maximum deviation from β that minimizes |y j -βX j c | for any j th data-point is δ max . Consider two linear regression models fit to this dataset, one that has knowledge of the true DGP and includes X c as features, and one that does not and includes a different subset X a ⊆ X as features. Assume all variables are standardized (zero mean, unit variance). 1. As n → ∞, the sensitivity of linear regression model fitted over X c is lower than or equal to the sensitivity of the linear regression model fitted over X a . 2. For finite n, whenever the empirical correlation of X c with error is much less than other features X a with error, X c << X a , the sensitivity of the mechanism based on causal information is lower than that of the mechanism without it.

Proof. WARM-UP: TWO VARIABLE REGRESSION

As a warm-up exercise, consider a 2-variable regression where X = {x 1 , x 2 }. x 1 is a part of the true DGP for y (i.e., causes y), while x 2 is not part of the DGP for y but is related to y (it may be a part of Markov Blanket of y, or simply be correlated with y). As stated in the Lemma, the corresponding data-generating equation is y = βx 1 + (Causal) X c Model. The model will use the following for predicting y, ŷ = β1 x 1 (10) leading to the following learnt parameter, β1 = x 1 y x 2 1 = x 1 (βx 1 + ) x 2 1 = β + x 1 x 2 1 (11) To calculate sensitivity, we assume the existence of an adversary that wishes to add one more point to the training process such that the estimated parameters are farthest from what is currently achieved (capturing the definition of sensitivity). Sensitivity. Let us consider a new data-point added to the training set by an adversary, to maximize difference between β1 and β1 (the estimated parameter obtained after adding the adversarial datapoint). Any new input chosen by the adversary will be generated based on Eqn 9. Note that the definition of DP is for any 2 databases from a universe of databases; Eqn 9 captures this universe. Since the adversary operates in the same world, any point they sample has to also obey the same equation. Their overall distribution can be different, e.g., Pr(y|x 1 , x 2 ) can be different, but Pr(y|x 1 ) has to remain the same. Since the error is bounded, the adversary tries to generate a point such that new estimated β1 is farthest from above. That is, the adversary may choose a point such that the parameter obtained after minimizing the squared loss using that point is β ± δ max minimizes the squared loss on the point. Further, the adversary can choose a point with large enough x 1 such that estimate on the entire dataset matches β ± δ max . This can be done by choosing a point (x 1 , y ) such that | x 1 + x1 x 2 1 + x 2 1 | = δ max . ⇒ Thus, for the parameter corresponding to variable x 1 , the sensitivity is |δ max | + | x1 3 x 2 1 |. For the parameter corresponding to the variable x 2 ,the sensitivity is zero (since there is no parameter). (Associational) X a Model. In contrast, the full regression model will use the following parameters. Since x 2 is not independent of y after conditioning on x 1 , the model may include x 2 since it may result in predictive accuracy gain for y. ŷ = γ 1 x 1 + γ 2 x 2 (12) leading to the following learnt parameters, γ1 = x 2 2 x 1 y -x 1 x 2 x 2 y x 2 1 x 2 2 -( x 1 x 2 ) 2 γ2 = x 2 1 x 2 y -x 1 x 2 x 1 y x 2 1 x 2 2 -( x 1 x 2 ) 2 Sensitivity. Compared to the X c model, the second parameter (γ 2 ) will have non-zero sensitivity (while that for causal model is zero). So we focus on showing that the first parameter γ1 will also have a higher sensitivity. Observe that the first parameter can be rewritten as γ1 = x 2 2 x 1 (βx 1 + ) -x 1 x 2 x 2 (βx 1 + ) x 2 1 x 2 2 -( x 1 x 2 ) 2 = β + x 2 2 x 1 -x 1 x 2 x 2 x 2 1 x 2 2 -( x 1 x 2 ) 2 Now the adversary can select a (x 1 , x 2 , y ) such that x 2 = 0. Here x 2 's coefficient becomes irrelevant and the parameter that minimizes error on any new adversarial point is β ± δ max . Thus, the sensitivity of the first parameter is | γ1 -β ± δ max | = | x 2 2 x 1 -x 1 x 2 x 2 x 2 1 x 2 2 -( x 1 x 2 ) 2 | + |δ max | Note that this sensitivity can be achieved by choosing x 1 , y (and therefore ) for a new adversarial point such that its value is much higher than other points and thus the estimated coefficient tends to β + δ max . We now prove the main claims. 1. Infinite Data. As n → ∞, x 1 = 0 because x 1 and are independent, by property of the generative process. But x 2 = 0 because it is correlated with y. Thus, for infinite data, sensitivity of X c model (|δ max |) is lower than another X a model.

2.. Finite Data. For finite data, if x 2 >>

x 1 , then sensitivity of X c model is lower than or equal to the X a model.

PROVING THE GENERAL CASE

Using the closed form solution for linear regression, we can write, β = (Z T Z) -1 Z T y ( ) where Z is a matrix denoting the model's features' values and y is a column vector denoting the values in a dataset for the variable y. For the causal model, Z = X c , the true variables from the DGP. Expanding y based on the DGP equation, β = (X T c X c ) -1 X T c y = (X T c X c ) -1 (X T c X c )β + (X T c X c ) -1 (X T c ) β = β + (X T c X c ) -1 (X T c ) In contrast, for the associational model, we obtain γ = (H T H) -1 H T y = (H T H) -1 (H T X c )β + (H T H) -1 (H T ) where H represents X a . The set of features H used by the X a model may not be equal to the true DGP variables, X c . Thus, the above can be rewritten in terms of the true parameter, β, as follows, γ = (H T H) -1 H T (X c β) + (H T H) -1 (H T ) = (H T H) -1 H T (Hβ i,extended + Sβ ) + (H T H) -1 (H T ) = β i,extended + (H T H) -1 H T Sβ + (H T H) -1 (H T ) where β i,extended and β are simply re-parameterizations of the true β; they are zero for all variables x / ∈ X c . β i,extended is an extension of β for variables in H, and β is an additional vector which is used only if H does not include all true DGP variables X c . S is the matrix denoting data for all x ∈ X c that are not in H PROOF OF MAIN CLAIM Infinite data case. Error is independent of the DGP variables, i.e., X T c = 0 as the number of samples n → ∞. Thus, β = β and any new training point provided by the adversary will also be generated using the same true β (within an bound of ±δ max ). Thus, β = β ± δ max will also be optimal for this new point and the estimated value will change within δ max . For the X a model, however, note that γ = β unless H = X c . We provide a construction for adversary's input such that estimated γ will change more than or equal to δ max after adding that input. For all variables x k ∈ X a that are correlated (not in X c ), an adversary chooses value of x k = 0 and generates y using β such that the correlation between y and x k is broken. Further, for the X c features and y, the adversary is constrained to choose an input such that γ Xc = β ± δ max is on the input, irrespective of the value of other γ dimensions (since x k = 0). That is, after addition of new point, the least squares optimization will ignore x k and move γXc closer to β for the X c features. The total sensitivity is | βXc -β| + |δ max | for the parameters corresponding to X c . For all other parameters, the X c model outputs value of zero (and hence sensitivity of zero), which is trivially lower than or equal to the X a 's model's parameter sensitivity for those parameters. Finite data case. In the finite data case, given a fitted β the sensitivity that a new adversarial point can lead to is |β ± δ max -β| = |(X T c X c ) -1 (X T c )| + |δ max | (from Eqn 16 ). For large-enough n, X T c should be close to zero due to independence. Sensitivity of coefficients for all non-parents is zero. For the X a model, the coefficients can be written as, γ = β i,extended + (H T H) -1 H T Sβ + (H T H) -1 (H T ) As we can see, for the variables / ∈ X c , γ depends on the data and therefore will have a non-zero sensitivity to a new data-point, greater than the X c model. We next consider sensitivity of parameters corresponding to variables in X c . As with the infinite data case, to generate a new input, the adversary can set the value of variables such that the non-X c features in h ∈ H become 0. Then optimal γXc for the new point will be β ± δ max (which is equivalent to β extended ± δ max ), and sensitivity will be the last two terms from Eqn 19 plus δ max (which can be achieved by sufficiently high values of X c variables for the adversarial point). Thus, sensitivity is |β ± δ max -γXc | = |(H T H) -1 H T Sβ + (H T H) -1 (H T )| + |δ max |. Since β and S correspond to the true β and X c respectively, , the sensitivity of X a model will be higher than X c model's sensitivity for the same parameter whenever H T >> X T c . We now use this lemma to prove our main theorem.

A.4 MAIN THEOREM

Theorem 1. Consider a linear gaussian SCM M = X, f, with standardized variables (zero mean, unit variance). Let the true generative equations be expressed as, ∀x i ∈ X : x i = P a i β i + i (4) where (a) P a i is the data matrix denoting all parents of x i in the CG corresponding to M, (b) β i are the true generative parameters, and (c) i is an independent, symmetric error vector that is bounded such that the maximum deviation from β i that minimizes |x j i -β i P a j i | for the j th data point is δ max . Let a mechanism take as input n data points and output the parameters of the fitted linear generative functions. Then, 1. As n → ∞, the sensitivity of the mechanism based on causal information is lower than that without it. 2. For finite n, whenever the empirical correlation of a variable's parents with error is much less than the empirical correlation of the features used by the associational model with error (i.e., P a T i i << H T i i 8), the sensitivity of the mechanism based on causal information is the lowest. Proof. WARM-UP: Three node SCM. As a warm-up, consider a 3-node SCM where X = {x 1 , x 2 , x 3 }. x 1 causes x 3 , while x 2 does not have a causal relationship with x 3 but is related to x 3 (it may be a part of Markov Blanket of x 3 , or simply be correlated with x 3 . The corresponding data-generating equations are, x 1 = 1 x 2 = f 2 (x 1 , x 3 , 2 ) x 3 = β 3 x 1 + 3 (20) where f 2 can be any linear function (x 2 optionally is caused by x 1 ). If x 2 depends on x 3 , it will be a child of x 3 ; if x 2 depends on x 1 , it will be correlated with x 3 ; if x 2 depends on both x 1 and x 3 , it will be both a child and correlated with x 3 . Let us consider the estimation equation for any x i . The proof logic follows the same way for other estimating equations, since they correspond to independent mechanisms. Causal Model. The causal model will use the following model for predicting x 3 , x3 = β3 x 1 (21) Associational Model. In contrast, the associational generative model will use the following parameters. Since x 2 is not independent of x 3 after conditioning on x 1 , the model may include x 2 since it may result in predictive accuracy gain for x 3 . x3 = γ 1 3 x 1 + γ 2 3 x 2 (22) Using Lemma 1, we see that causal model corresponds to the true DGP, whereas associational model does not. Thus, sensitivity of model is lower than or equal to associational model, under the same conditions (where X c is replaced by causal parents P a = {x 1 } and X a by H = {x 1 , x 2 }) . We can analogously use Lemma 1 for the general case with all variables. The above proof also generalizes to case where all variables of SCM are not observed. We need the following assumption. Assumption 1. For each node x i ∈ X for the SCM M, any unobserved parents P a unobs i are independent of all other observed parents of x i , P a obs i . If all parents of a node x i are observed, the above assumption is true trivially. If all parents are not observed (e.g., x i shares an unobserved common cause with another variable), then this assumption ensures that the estimated error from linear regression is independent of the true parents. Corollary 2. Under Assumption 1, Theorem 1 is satisfied even if some variables of the SCM are unobserved.

P a

Proof. Under Assumption 1, the empirical error is still independent of the causal parents. Under the simple case, if x 3 has an unobserved parent, it may share an unobserved parent with x 2 and thus the 3 term can be expanded as, 3 = β unobs 3 x unobs + 3 where 3 is the true SCM error with an unobserved variable. However, due to Assumption 1, x unobs ⊥ ⊥ x 1 ; hence x 1 ⊥ ⊥ 3 . Hence the logic of above proof holds and the rest of the proof follows identically to Theorem 1.

REMARK: Counter-Examples

Where Causal Information May Not Help Sensitivity. Note that the above proof would not work if the additional features used by the associational model are independent or weakly correlated with the variable to be generated (and thus we cannot claim that H T i i >> P a T i i ). In the 3-node SCM, for example, if x 2 is weakly correlated with x 3 ; x 2 3 will also be close to zero, as would be x 1 3 , for large n. So the sensitivity of γ1 3 may be comparable to β1 3 (though parameter γ2 3 may still have non-zero sensitivity compared to the causal model's zero sensitivity). Specifically, if relationship between x 2 and x 3 is weak enough such that x 2 3 is comparable to x 1 3 , then it is not guaranteed that causal information will help. To provide a contrived example where a model with causal information will have worse sensitivity than without, consider the following counter-example. Suppose a finite dataset such that x 2 2 x 1 3 -x 1 x 2 x 2 3 = 0. Then γ3 1 = β 3 and will have minimal (δ max ) sensitivity, while β3 1 = β 3 and thus its sensitivity is > δ max . In addition, we can construct relationship of x 2 and x 3 such that sensitivity of γ 2 3 is also zero. Specifically, x 2 may be the child of x 3 and be fully determined by x 3 ; x 2 = β 2 x 3 ⇒ x 3 = 1 β2 x 2 , leading to γ2 3 = 1 β2 . Then γ2 3 will also be optimal for any new adversarial point. Thus, comparing the causal versus associational model, the sensitivity of both γ3 2 and β3 2 is zero, but the sensitivity of γ3 1 is lower than that of β3 1 . Correctness of the Proof: Our proof does not simply depend on the number of features. As a counterexample, consider a graph where a node x 3 has two parents: x 1 and x 2 and is correlated with x 4 . A causal model will use both x 1 and x 2 while the associational model may use x 4 . The logic in the proof (independence of error w.r.t features) will show that causal model with more features is less sensitive.

Clarification (& Comparison to Prior Work):

We would like to clarify that the proof involving linear mixture of gaussians is meant to serve as intuition to understand what the benefit of causal side-information has towards the stability of the learnt parameters. Prior work (Chaudhuri et al., 2011 ) also studies the learning of linear models using DP and performs a similar style of analysis. However, we remark that (a) their work does not focus on using any causal information, and (b) measuring sensitivity through parameter stability is a fairly general technique used (and does not imply that output perturbation is the only mechanism to be used to achieve DP in such a setting). We would also like to stress that (a) we use a different proof technique that is motivated by the causal graph structure, and (b) their work is for discriminative models while ours is for generative models.

B PRIVACY BENEFIT OF CAUSAL MODELS IN NON-LINEAR SETTINGS

Note: The notation in this section is slightly different from that used earlier.

B.1 NOTATION

A mechanism H takes in as input a dataset D and outputs a parameterized model f θ , where θ are the parameters of the model. The model (and its parameters) belongs to a hypothesis space H. The dataset comprises of samples, where each sample x = (x 1 , • • • , x k ) comprise of k features. To learn the model, we utilize the empirical risk mechanism (ERM), and a loss function L. The subscript of the loss function denotes what the loss is calculated over. For example L x denotes the loss being calculated over sample x. Similarly, L D denotes the average loss calculated over all samples in the dataset i.e., L D = 1 |D| x∈D L x . Additionally, L x (f θ ) = (f θ (x), f * (x) ) where f * is the oracle (responsible for generating ground truth), and (., .) can be any loss function (such as the cross entropy loss or reconstruction loss for a generative model).

1.. Data Generating Process (DGP):

The DGP f * , η is obtained using the following procedure: f * = lim n→∞ arg min L D (f θ ). Essentially f can be thought of as the infinite data limit of the ERM learner and can be viewed as the ground truth. In a causal setting, the DGP for all variables/features x is defined as f * (x) = (f * 1 (P a(x 1 )) + η i , • • • , f * n (P a(x n )) + η n ) where η i are mutually, independently chosen noise values and P a(x i ) are the parents of x i in the SCM.

2.. Distinction between Causal and Associational Worlds:

For each feature x i , we call P a(x i ) as the causal features, and X \ {x i , P a(x i )} as the associational features for predicting x i . Correspondingly, the model using only P a(x i ) for each feature x i is known as the causal model, and the model using all features X (including associational features) is known as the associational model. We denote the causal model learnt by ERM with loss L as f θc , and the associational model learnt by ERM using the same loss L as f θa . Note that the hypothesis class for the models is different: f θc ∈ H C and f θa ∈ H A , where H C ⊆ H A . Like f θc , the true DGP function uses only the causal features. Assuming that the true function f * belongs in the hypothesis class H C , we write, f * = lim |D|→∞ arg min f ∈H C L D (f ). 3. Adversary. Given a dataset D and a model f θ , the role of an adversary is to create a neighboring dataset D by adding a new point x . We assume that the adversary does so by choosing a point x where the loss of f θ is maximized. Thus, the difference of the empirical loss on D compared to D will be high, which we expect to lead to high susceptibility to membership inference attacks. 4. Loss-maximizing (LM) Adversary: Given a model f θ , dataset D, and a loss function L, an LM adversary chooses a point x (to be added to D to obtain D ) as arg max x L x (f θ ). Note that L x (f θ ) = L x (f θ (x)) Main Result. Given a dataset D of size n, and a strongly convex and Lipschitz continuous loss function L, assume we train two models in a differentially private manner: a causal (generative) model f θc , and an associational (generative) model f θa , such that they minimize L on D. Assume that the class of hypotheses H is expressive enough such that the true causal function lies in H. 1. Infinite sample case. As n → ∞, the privacy budget of the causal model is lower than that of the associational model i.e., ε c ≤ ε a . 2. Finite sample case. For finite n, assuming certain conditions on the associational models learnt and n, the privacy budget of the causal model is lower than that of the associational model i.e., ε c ≤ ε a .

B.2 PROOF OUTLINE

The main steps of our proof are as follows: 1. We instantiate two worlds: one with a causal model, and one without one (i.e., with an associational model). 2. We show that the maximum loss of a causal model is lower than or equal to maximum loss of the corresponding associational model. 3. Using strong convexity and Lipschitz continuity of the loss function, we show how the difference in loss corresponds to the sensitivity of the learning function. 4. Finally, the privacy budget ε is a monotonic function of the sensitivity. We prove step 2 separately for n → ∞ (Appendix B.1) and finite n (Appendix B.2) below. Then we prove step 3 in Appendix B.3. Step 4 follows from differential privacy literature (Dwork et al., 2014) .

B.3 PROOF WHEN n → ∞ (FOR STEP 2 FROM OUTLINE)

As |D| = n → ∞, the proof arises from knowledge that the causal model becomes the same as the true DGP f * . Preliminary 1. Given any variable x t , the causal model learns a function based only on its parents, P a(x t ). The adversary for causal model chooses points from the DGP f * , ηfoot_8 s.t., x = arg max x L x (f θc (x)) s.t. ∀i x i = f * i (P a(x i )) + η i where f θc = arg min f ∈H C L D (f ). Assuming that H C is expressive enough such that f * ∈ H C , as n = |D| → ∞, we can write lim |D|→∞ f θc = lim |D|→∞ arg max f ∈H C L D (f ) = f * Therefore, the causal model is equivalent to the true DGP's function. For any target x i to be predicted, maximum error on any point is η i for the 1 loss, and a function of η i for other losses. Intuitively, the adversary is constrained to choose points at a maximum η i distance away from the causal model. But for associational models, we have, x = arg max x L x (f θa (x)) s.t. ∀i x i = f * i (P a(x i )) + η i As n = |D| → ∞, f θa = f * . Thus, the adversary is less constrained and can generate points for a target x i that are generated from a different function than the associational model. For any point, the difference in the associational model's prediction and the true value is |f θa (x)) -f * (P a(x i ))| + η i , which is equivalent to the loss under 1 . For a general loss function, the loss is a function of |f θa (x) -f * (P a(x i ))| + η i . Therefore, we obtain, ∀i η i ≤ |f θa (x) -f * (P a(x i ))| + η i ⇒ max x L x (f θc (x)) ≤ max x L x (f θa (x)) for all losses that are increasing functions of the difference between the predicted and actual value.

B.4 PROOF WHEN FINITE n (FOR STEP 2 FROM OUTLINE)

When n is finite, the proof argument remains the same but we need an additional assumption on the associational model f θa learnt from D. From learning theory (Shalev-Shwartz & Ben-David, 2014) , we know that the loss of f θc will converge to that of f * , while loss of f θa will converge to loss of f ∞ θa = f * . Thus, with high probability, f θc will have a lower loss w.r.t. f * than f θa and a similar argument follows as for the infinite-data case. However, since this convergence is probabilistic and depends on the size of n, it is possible to obtain a f θc that has a higher loss w.r.t. f * compared to f θa . Therefore, rather than assuming convergence of f θc to f * , we instead rely on the property that the true DGP function f * does not depend on the associational features x a . As a result, even if the loss of the associational model is lower than the causal model on a particular point x = x c ∪ x afoot_9 , we can change the value of x a to obtain a higher loss for the associational model (without changing the loss of the causal model). This requires that the associational model have a non-trivial contribution from the associational (non-causal) features, sufficient to change the loss. We state the following assumption. Assumption 2: If f θc is the causal model and f θa is the associational model, then we assume that the associational model has non-trivial contribution from the associational features. Specifically, denote x c as the causal features and x a as the associational features, such that x = x c ∪ x a . We define any two new points: x = x c ∪ x a and x = x c ∪ x a . Let us first assume a fixed value of x a . The LHS (below) denotes the max difference in loss between f θc and f θa (i.e., change in loss between causal and associational models over the same causal features). The RHS (below) denotes difference in loss of f θa between x a and another value x * a , keeping x c constant (i.e., effect due to the associational features). Formally speaking ∃x a max x c {L x (f θc (x c ∪ x a )) -L x (f θa (x c ∪ x a ))} ≤ min x c max x a L x (f θa (x c ∪ x a )) -L x c ∪xa (f θa (x c ∪ x a )) The inequality above can be interpreted as follows: if adversary 1 aims to find the x c such that difference in loss between associational and causal features is highest for a given x a , then there can always be another adversary 2 who can obtain a bigger difference in loss by changing the associational features (from the same x a to x a ). Intuition: Imagine that f θc is trained initially, and then associational features are introduced to train f θa . f θa can obtain a lower loss than f θc by using the associational features x a . In doing so, it might even change the model parameters related to x c . Assumption 1 says that change in x c 's parameters is small compared to the importance of the x a 's parameters in f θa . For example, consider a f * , f θc , f θa to predict the value of x t such that x c = {x 1 } and x a = {x 2 }, and consider 1 loss. f * = x 1 ; f θc = 2x 1 ; f θa = 1.9x 1 + φ(x 2 ) where x t = f * (x) + η and η ∈ [-0.5, 0.5]. Note that without φ(x 2 ), the loss of the associational model is lower than the loss of causal model on any point. However, if x a = x 2 ∈ R, then we can always set |x 2 | to an extreme value such that φ(x 2 ) overturns the reduction in loss for the associational model, without invoking Assumption 1. When x a is bounded (e.g., x 2 ∈ {0, 1}), then Assumption 1 states that the change in loss possible due to changing φ(x 2 ) is higher than the loss difference (which is 0.1 for 1 loss). If H was the class of linear functions and we assume 1 loss with all features in the same range (e.g., [0, 1]), then Assumption 1 implies that the coefficient of the associational features in f θa is higher than the change in coefficient for the causal features from f θc to f θa . Lemma 2. Assume an LM adversary and a strongly convex loss function L. Given a causal f θc and an associational model f θa trained on dataset D using ERM. The LM adversary selects two points: x and x . Then under Assumption 2, the worst-case loss obtained on the causal ERM model L x (f θc ) is lower than the worst-case loss obtained on the associational ERM model L x (f θa ) i.e., L x (f θc ) ≤ L x (f θa ) which can be re-written as max x L x (f θc ) ≤ max x L x (f θa ) Proof: Before we discuss the proof, let us establish another preliminary. Preliminary 2. Let us write f θa (x) = f θa (x c ∪ x a ) as a combination of terms due to x c and x a , where x c and x a are the causal features (parents) and non-causal features respectively i.e., x c ∪x a = x, and x c ∩ x a = φ. Let x = x c ∪ x a be the point chosen by the causal adversary. We will show that the associational adversary can always choose a point x = x c ∪ x a such that loss of the adversary is higher. We write, for any value x afoot_10 , L(f θa (x ∪ x a )) = L(f θa (x c ∪ x a )) -L(f θa (x c ∪ x a )) + L(f θa (x c ∪ x a )) = (L(f θa (x c ∪ x a )) -L(f θa (x c ∪ x a ))) + (L(f θa (x c ∪ x a )) -L(f θc (x c ∪ x a ))) + L(f θc (x c ∪ x a )) Rearranging terms, and since L(f θc (x c ∪ x a )) = L(f θc (x c ∪ x a )) for any value of x a (causal model does not depend on associational features), L(f θa (x c ∪ x a )) -L(f θc (x c ∪ x a )) = (L(f θa (x c ∪ x a )) -L(f θa (x c ∪ x a ))) Term 1 -(L(f θc (x c ∪ x a )) -L(f θa (x c ∪ x a )) Term 2 ) (29) Now the first term is ≥ 0 since the adversary can select x a such that loss increases (or stays constant) for f θa . Since the true function f * does not depend on x a , changing x a does not change the true function's value but will change the value of the associational model (and adversary can choose it such that loss on the new point is higher). The second term can either be positive or negative. If it is negative, then we are done. Then the LHS > 0. If the second term is positive, then we need to show that the first term is higher in magnitude than the second term. From assumption 1, let it be satisfied for some x • a . We know that L( f θc (x c ∪ x a )) = L(f θc (x c ∪ x • a ) ) since the causal model ignores the associational features. L(f θc (x c ∪ x • a )) -L(f θa (x c ∪ x • a )) ≤ max xc (L(f θc (x c ∪ x • a )) -L(f θa (x c ∪ x • a ))) ≤ min xc max x * a (L(f θa (x c ∪ x * a ) -L(f θa (x c ∪ x • a )) ≤ max x * a (L(f θa (x c ∪ x * a ) -L(f θa (x c ∪ x • a )) Now suppose adversary chooses a point such that x a = x max a where x max a is the arg max of the RHS above. Then Equation 7 can be rewritten as L(f θa (x c ∪ x max a )) -L(f θc (x c ∪ x a )) = (L(f θa (x c ∪ x max a )) -L(f θa (x c ∪ x • a ))) -(L(f θc (x c ∪ x • a )) -L(f θa (x c ∪ x • a ))) > 0 (31) where the last inequality is due to equation 8. Thus, adversary can always select a different value of x = x c ∪ x max a such that loss is higher than the max loss in a causal model. L(f θc (x c ∪ x a )) = max x L x (f θc ) ≤ L(f θa (x c ∪ x max a )) ≤ max x L x (f θa ) B.5 PROOF OF STEP 3 FROM OUTLINE Theorem 2. Assume the existence of a dataset D of n samples. Further, assume a neighboring dataset is defined by adding a data point to D. Let f θc and f θa be the causal and associational models learnt using D, and f θ c and f θ a be the causal and associational models learnt using neighboring datasets D and D respectively. All models are obtained by ERM on a Lipschitz continuous (with parameter ρ), strongly convex (with parameter λ) loss function L. Then, the sensitivity of a causal learning function H C will be lower than that of its associational counterpart H A . Mathematically, assuming large enough n such that n > 2ρ λ -1, max D,D ||θ c -θ c || ≤ max D,D ||θ a -θ a || Proof: The proof uses strongly convex and Lipschitz properties of the loss function. Before we discuss the proof, let us introduce a requisite preliminary. Preliminary 3. Assume the existence of a dataset D of size n. There are two generative models learnt, f θa and f θc using this dataset. Similarly, assume there is a neighboring dataset D which is obtained by adding one point x . Then the corresponding ERM models learnt using D are f θ a and f θ c . We now detail the steps of the proof. Step 1. Assume L is strongly convex. Then by the optimality of ERM predictor on D and the definition of strong convexity, L D (f θ ) ≤ L D (αf θ + (1 -α)f θ ) ≤ αL D (f θ ) + (1 -α)L D (f θ ) - λ 2 α(1 -α)||θ -θ || 2 Step 2. Rearranging terms, and as α → 1, (1 -α)L D (f θ ) ≤ (1 -α)L D (f θ ) - λ 2 α(1 -α)||θ -θ || 2 ⇒ ||θ -θ || 2 ≤ 2 λ (L D (f θ ) -L D (f θ )) Step 3. Further, we can write (L D (f θ ) -L D (f θ )) in terms of loss on x . 34) L D (f θ ) = n n + 1 L D (f θ ) + 1 n + 1 L x (f θ ) (since D = D ∪ x ) ≤ n n + 1 L D (f θ ) + 1 n + 1 L x (f θ ) (from Equation ≤ n n + 1 n + 1 n L D (f θ ) - n n + 1 1 n L x (f θ ) + 1 n + 1 L x (f θ ) (since D = D -{x }) ⇒ L D (f θ ) -L D (f θ ) ≤ 1 n + 1 (L x (f θ ) -L x (f θ )) Step 4. Combining the above two equations, we obtain, ||θ -θ || 2 ≤ 2 λ (L D (f θ ) -L D (f θ )) ≤ 2 λ(n + 1) (L x (f θ ) -L x (f θ )) Step 5. From Claim 1 above, we know that max x L x (f θc ) ≤ max x L x (f θa ) ⇒ L x (f θc ) ≤ L x (f θa ) where x = arg max x L x (f θc ) and x is chosen such that x and x differ only in the associational features. Thus, L x (f θ c ) = L x (f θ c ). Also because H C ⊆ H A , the training loss of the ERM model for any D defined using D and x is higher for a causal model i.e., L x (f θ c ) = L x (f θ c ) ≥ L x (f θ a ) Therefore, we obtain, L x (f θc ) -L x (f θ c ) = max x L x (f θc ) -L x (f θ c ) ≤ L x (f θa ) -L x (f θ a ) So we have now shown that the max loss difference on a point x for causal ERM models trained on neighboring datasets is lower than the corresponding loss difference over x for the associational models. Step 6. Now we use the Lipschitz property, to claim, L x (f θa ) -L x (f θ a ) ≤ ρ||θ a -θ a || Step 7. Combining Equations 36 (substituting f θc ) and 40, and taking max on the RHS, we get,  max D,D ||θ c -θ c || 2 ≤ 2 λ(n + 1) max D,D L x (f θc ) -L x (f θ c ) ≤ 2 λ(n + 1) max D,D L x (f θa ) -L x (f θ a ) ≤ 2ρ λ(n + 1) max D,D ||θ a -θ a || (41) ⇒ max D,D ||θ c -θ c || 2 ≤ 2ρ λ(n + 1) max D,D ||θ a -θ a || (42) For n + 1 > 2ρ λ , max D,D ||θ c -θ c || 2 ≤ max D,D ||θ a -θ a || Remark on Theorem 2. Theorem 2 depends on two key assumptions: 1. Assumption 1 that constrains associational model to have non-trivial contribution from associational (non-causal) features. 2. A sufficiently large n as shown above. When any of these assumptions is violated (e.g., a small-n training dataset or an associational model that is negligibly dependent on the associational features), then it is possible that the causal ERM model has higher ε than the associational model.

Connections between Theory & Practice:

The proof provided in Appendix A is meant to provide intuition as to the benefit of causal information in a simpler setting using linear gaussian mixtures, similar to what is done in other work (Ilyas et al., 2019) . We provide a more general proof (across all classes of generative models), for a general privacy adversary in Appendix B, but under some constraining assumptions regarding the convexity of the loss function. X 2 X 1 Z Encoder: q φ (z, x 1 |x 2 ) = q φ1 (z|x 2 ) Decoder: p θ (x 2 , x 1 , z) = p(z) • p(x 1 ) • p θ1 (x 2 |z) • p θ2 (x 2 |x 1 ) Note: All encoders and decoders used as part of our experiments comprised of simple feed-forward architectures. In particular, these architectures have 3 layers. All embeddings generated are of size 10. We set the learning rate to be 0.001. Spiegelhalter (1988) . More salient features of each dataset is presented in Table 2 . We choose these datasets as they encompass diversity in their size (n), dimensionality (k), and have some prior information on causal structures (refer Appendix D). We utilize this causal information in building a causally informed generative models. The (partial) SCM is given utilizing domain knowledge of the EEDIfoot_11 and Pain contexts. E UTILITY EVALUATION 

E.2 PAIRPLOTS

Observe that the pair-plots obtained from the models trained with causality and DP are comparable to those obtained from models trained with causality and no DP; the utility of both these models should be comparable. 

G MI RESULTS

PA denotes the ability of the classifier to correctly classify train samples. NA denotes the ability of the classifier to correctly classify test samples. The Accuracy is a weighted combination of PA and NA. EEDI. Observe that when there is causal information, the difference between the DP and No DP column is larger. Extractor 



Utility obtained from models trained on the original dataset (without the use of any generative model). T is used to denote the transpose of the matrix. The detailed proof is in Appendix A. https://github.com/pytorch/opacus All trials were reported 5 times with different random seeds. The numbers reported in the table are an average of these trials. Unlike the other datasets we consider, the EEDI dataset is sparse. The Histogram attack relies on counting the number of entries for a particular feature to aid in disambiguation, while the Naive attack relies on condensing the entire dataset to summary statistics (such as mean, median, mode); we conjecture that sparsity helps the adversary in this case. This equation holds for the data generating distribution; a dataset is sampled from this distribution. More generally, this is the data-generating equation of the SCM that defines the distribution. T is used to denote the transpose of the matrix. One should think of the DGP = f * , η as the oracle that generates labels. Note that xa and xc each represent a set of features, and not a single feature. We omit the subscript for L for brevity. It can be implied from context. A larger SCM for the EEDI dataset was learnt using the VICause methodology proposed byMorales-Alvarez et al. (2021).



Figure 1: Utility vs. Privacy: Causal models always outperform their associational counterparts, for the same ε.

Figure 2: More Causal Information: EEDI Morales-Alvarez et al. (2021) is resilient to MI attacks with accurate causal information.

Figure 3: True CG: Both causal and non-causal models trained with DP reduce the adversary's advantage. Causal models provide more advantage degradation on average.

Figure 4: Incorrect CGs: For the Lung Cancer dataset, adding edges (introducing spurious relationships) enables the MI adversary, but removing edges (disabling causal relationships) hurts it.

Figure 5: Partial Causal Information & No DP: We plot the (average) MI success the adversary uses a noncausal model and switches to its causal counterpart. Observe that even in the absence of DP, causal information by itself provides resilience against our MI adversary.

Figure 6: Partial Causal Information & DP: We plot the MI success when the adversary uses a noncausal model and switches to its causal counterpart. Observe that in the presence of DP & causal information, the adversary is less effective.

43) Now if max D,D ||θ c -θ c || ≥ 1, then the result follows by taking the square root over LHS. If not, we need a sufficiently large n such that n + 1 > 2ρ λ max D,D ||θc-θ c || , then we obtain, max D,D ||θ c -θ c || ≤ max D,D ||θ a -θ a ||

Figure 10: Pain5000 Dataset

Downstream Utility Change: We report the utility change induced by synthetic data on downstream classification tasks in comparison to the original data i.e., (original data utility -synthetic data utility). Negative values indicate the percentage point improvement, while positive values indicate degradation. The performance range of the classifiers we consider is reported in parentheses next to each dataset. Observe that (a) DP training induces performance degradation in both causal and non-causal settings, and (b) performance degradation in the causal setting is lower than that of the non-causal setting.

To see if additional causal information provides more resilience, we repeat the experiments with the EEDI dataset using the generative model proposed byMorales-Alvarez et al. (2021) that learns a CG from data. We utilize the same training configuration as in the earlier case, resulting in a model learnt with ε ≈ 13., which learns a CG with 57 nodes (compared to the 3 node CG used thus far). From Figure2, observe that as we provide more accurate causal information to the causal model, its effects on privacy are exacerbated by the presence of DP noise specifically for the attack using Naive features. The attack using Histogram features in unaffected even with DP providing empirical evidence to recent work that questions the sufficiency of DP training against MI attacks(Humphries et al., 2020). Loose privacy budget. The values of ε used in our experiment are large. This is a result of the batch size and training duration of the VAEs we use as DP training is expensive. If computation resource is not a constraint then we should be able to get smaller epsilon values. There is extensive discussion by Bhowmick et al. (2018) and Nasr et al. (2021) which discusses different threat models where larger values of ε are tolerable. 3. Non-convex loss functions. VAEs are deep learning models trained to minimize ELBO which is highly non-convex. Research shows that deep learning framework demonstrate for example locally convex properties (Lucas et al., 2021; Littwin & Wolf, 2020) which explains our results. 4. Overheads: The work of Morales-Alvarez et al. (

Salient features of our experimental setup. More information about the datasets and parameters used can be found in the Appendix C.4.We evaluate on datasets from three real-world applications. The first one is the EEDI datasetWang  et al. (2020b)  which is one of the largest real-world education data collected from an online education platform. It contains answers by students (of various educational backgrounds) for certain diagnostic questions. The second one is the neuropathic pain (Pain) diagnosis dataset obtained from a causally grounded simulatorTu et al. (2019). For this dataset, we consider two variants: one with 1000 data records (or Pain1000), and another with 5000 data records (or Pain5000). The third dataset (Lung Cancer) contains information about lung diseases and visits to Asia Lauritzen &

E.1 ACCURACY ON ORIGINAL DATA The results are detailed in Table 4. Baseline Accuracy calculated on the original (and not synthetic) data. Results presented in Table 1 are based on these values.

Model trained using EEDI and partial causal information.

Model trained using EEDI and causal information obtained from VICause Morales-Alvarez et al. (2021).

Model trained using EEDI and no causal information.Pain 1000. Observe that when there is causal information, the difference between the DP and No DP column is larger.

Model trained using Pain1000 and partial causal information.

Model trained using Pain1000 and no causal information.Pain 5000. Observe that when there is causal information, the difference between the DP and No DP column is larger.

Model trained using Pain5000 and partial causal information.

Model trained using Pain5000 and no causal information.

C TRAINING DETAILS C.1 OVERALL SCHEME

We outline the work-flow overview (refer Figure 7 ): 1. The user provides a dataset for which they want to create a synthetic copy; 2. We utilize techniques (e.g., Morales-Alvarez et al. (2021) ) to learn the structured causal model (SCM) associated with this dataset (or can assume the SCM is given); 3. We encode this structure into a generative model (e.g., Morales-Alvarez et al. (2021); Geffner et al. (2022) ; Kyono et al. (2021) ) and train it with DP-SGD; 4. The generative model is sampled to obtain a synthetic data which is DP (by post-processing) and can be used for arbitrary downstream tasks. We will also clarify that unlike some works which require some information about the nature of usage of downstream data, ours does not. C.2 MODELS For the experiments related to the model proposed by Morales-Alvarez et al. ( 2021), we utilized the same parameters as for EEDI (in Table 3 ) to obtain the same privacy expenditure (ε).

D CAUSAL GRAPHS

SCM-1: SCM for the Pain dataset. X 1 denotes the causes of a medical condition, and X 2 denotes the various conditions.X 2 X 1 Z SCM-2: SCM used by the EEDI. X 2 denotes the answers to questions and X 1 is the student meta data such as the year group and school. Lung Cancer: The causal graph related to the Lung Cancer dataset can be found in https: //www.bnlearn.com/bnrepository/discrete-small.html. The edge between tub and ether was removed to simulate missing edges, and the edge between asia and smoke was added to simulate new edges. 

