STATISTICAL THEORY OF DIFFERENTIALLY PRIVATE MARGINAL-BASED DATA SYNTHESIS ALGORITHMS

Abstract

Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each lowdimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the L 2 distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every ϵ-DP synthetic data generator.

1. INTRODUCTION

In recent years, the problem of privacy-preserving data analysis has become increasingly important and differential privacy (Dwork et al., 2006) appears as the foundation of data privacy. Differential privacy (DP) techniques are widely adopted by industrial companies and the U.S. Census Bureau (Johnson et al., 2017; Erlingsson et al., 2014; Nguyên et al., 2016; The U.S. Census Bureau, 2020; Abowd, 2018) . One important method to protect data privacy is differentially private data synthesis (DPDS). In the setting of DPDS, a synthetic dataset is generated by some DP data synthesis algorithms from a real dataset. Then, one can release the synthetic dataset and the real dataset will be protected. Recently, National Institutes of Standards and Technology (NIST) organized the differential privacy synthetic data competition (NIST, 2018; 2019; 2020 -2021) . In the NIST competition, the state-ofthe-art algorithms are marginal-based (McKenna et al., 2021) , where the synthetic dataset is drawn from a noisy marginal distribution estimated by the real dataset. To deal with high-dimensional data, the distribution is usually modeled by the probabilistic graphical model (PGM) such as the Bayesian networks or Markov random fields (Jordan, 1999; Wainwright et al., 2008; Zhang et al., 2017; Mckenna et al., 2019; Cai et al., 2021) . Despite its empirical success in releasing high-dimensional data, as far as we know, the theoretical guarantee of marginal-based DPDS approaches is rarely studied in literature. In this paper, we focus on a DPDS algorithm based on the Bayesian networks (BN) known as the PrivBayes (Zhang et al., 2017) that is widely used in synthesizing sparse data (sparsity measured by the degree of a BN that will be defined later). A BN is a directed acyclic graph where each vertex is a low-dimensional marginal distribution and each edge is the conditional distribution between two vertices. It approximates the high-dimensional distribution of the raw data with a set of well-chosen low-dimensional distributions. Random noise is added to each low-dimensional marginal to achieve differential privacy. We aim to analyze the marginal-based approach from a statistical perspective and measure the accuracy of PrivBayes under different statistical distances including the total variation distance or the L 2 distance. Another metric of synthetic data we are interested in is the utility metric related to downstream machine learning tasks. Empirical evaluation of synthetic data in downstream machine learning tasks is widely studied in literature. Existing utility metrics include Train on Synthetic data and Test on Real data (TSTR, (Esteban et al., 2017) ) and Synthetic Ranking Agreement (SRA, (Jordon et al., 2018) ). To our best knowledge, most of these utility evaluation methods are empirical without a theoretical guarantee. Establishing the statistical learning theory of synthetic data is another concern of this paper. Precisely, we focus on the statistical theory of PrivBayes based on the TSTR error. Our contributions. Our contributions are three-fold. First, we theoretically analyze the marginalbased synthetic data generation and derive an upper bound on the TV distance and L 2 distance between real data and synthetic data. The upper bounds show that the Bayesian network structure mitigates the "curse of dimensionality". An upper bound for the sparsity of real data is also derived from the accuracy bounds. Second, we evaluate the utility of the synthetic data from downstream supervised learning tasks theoretically. Precisely, we bound the TSTR error between the predictors trained on real data and synthetic data. Third, we establish a lower bound for the TV distance between the synthetic data distribution and the real data distribution.

1.1. RELATED WORKS AND COMPARISONS

Broadly speaking, our work is related to a vast body of work in differential privacy (Dinur & Nissim, 2003; Dwork & Nissim, 2004; Blum et al., 2005; Dwork et al., 2007; Nissim et al., 2007; Barak et al., 2007; McSherry & Talwar, 2007; Machanavajjhala et al., 2008; Dwork et al., 2015) . For example, McSherry & Talwar (2007) proposed the exponential mechanism that is widely used in practice. Machanavajjhala et al. (2008) discussed privacy for histogram data by sampling from the perturbed cell probabilities. However, these methods are not efficient for releasing high-dimensional tabular data, since the domain size grows exponentially in the dimension (which is known as "the curse of dimensionality"). The state-of-art method for this problem is the marginal-based approach (Zhang et al., 2017; Qardaji et al., 2014; Zhang et al., 2021) . Zhang et al. (2017) approximated the raw dataset by a sparse Bayesian network and then added noise to each vertex in the graph. Zhang et al. (2021) selected a collection of 2-way marginals and a gradually updating method was applied to release synthetic data. Although most of them provide rigorous privacy guarantees, theoretical analysis on accuracy is rare. Wasserman & Zhou (2010) established a statistical framework of DP and derived the accuracy of distribution estimated by noisy histograms. Our setting is different from theirs. Precisely, we analyze how noise addition and post-processing affect the conditional distribution (Lemma 6.2). Moreover, our proof handles the non-trivial interaction between the Bayesian network and noise addition. Our lower bound (Theorem 5.1) is related to existing results of the worst case lower bounds under the DP constraint in literature (Hardt & Talwar, 2010; Ullman, 2013; Bassily et al., 2014; Steinke & Ullman, 2017) . Hardt & Talwar (2010) established lower bounds for the accuracy of answering linear queries with privacy budget ϵ. Ullman (2013) derived the worst-case result that in general, it is NP-hard to release private synthetic data which accurately preserves all two-dimensional marginals. Bassily et al. (2014) built on their result and further developed lower bounds for the excess risk for every (ϵ, δ)-DP algorithm. Our result is novel since we consider private synthetic data and the corresponding TV accuracy. Existing results for linear quires are not directly applicable to TV accuracy since they heavily rely on the linear structure.

2. DIFFERENTIAL PRIVACY

Differential privacy requires that any particular element in the raw dataset has a limited influence on the output (Dwork et al., 2006) . The definition is formalized as follows. Here the data domain is denoted as Ω. Definition 2.1 ((ϵ, δ)-differential privacy). Let A : Ω n → R be a randomized algorithm that takes a dataset of size n as input, where the output space R is a probability space. For every ϵ, δ ≥ 0, A satisfies (ϵ, δ)-differential privacy if for every two adjacent datasets D 1 and D 2 , we have P[A(D 1 ) ∈ S] ≤ exp(ϵ)P[A(D 2 ) ∈ S] + δ, for all measurable S ⊆ R. Here D 1 and D 2 are datasets of size n. We say that they are adjacent if they differ only on a single element, denoted as D 1 ≃ D 2 . For δ = 0, we abbreviate the definition as ϵ-differential privacy (ϵ-DP). A widely used metamechanism to ensure ϵ-DP is the Laplace mechanism. The Laplace mechanism privatizes a function f on the dataset D by adding i.i.d. Laplace noises (denoted as η ∼ Lap(λ) ) to each output value of f (D). Here the probability density function of η is given by P Nissim (2004) show that it ensures ϵ-DP when λ ≥ ∆ f /ϵ, where ∆ f is the L 1 sensitivity of f : [η = x] = 1 2λ exp( -|x| λ ). Dwork & ∆ f = max (D1,D2):D1≃D2 ∥f (D 1 ) -f (D 2 )∥ 1 .

3. MARGINAL-BASED DATA SYNTHESIS ALGORITHMS

In this section, we introduce DP marginal-based methods. For simplicity, we consider Boolean data where Ω = {0, 1} d and |Ω| = 2 d . It's obvious that our theory can be generalized to any categorical dataset with a finite domain size.

3.1. DIFFERENTIALLY PRIVATE ESTIMATE OF LOW-DIMENSIONAL MARGINAL DISTRIBUTIONS

Given a dataset D = {x (i) } n i=1 ⊂ Ω n drawn independently from a distribution, the probability mass function is estimated by p D (x) = 1 n n i=1 1[x (i) = x], for all x ∈ Ω. (1) Noise addition and post processing. We then sanitize p D (x) by the Laplace mechanism. Note that the sensitivity of p D (x) is 1/n. Then, we define p D = p D + Lap(1/(nϵ)) and p D is ϵ-DP. Adding noise leads to inconsistency. To be specific, some estimated probabilities may be negative and the overall summation may not be 1. The following two kinds of post processing methods to address the inconsistency are widely adopted in marginal-based methods (cf., (Mckenna et al., 2019; Zhang et al., 2017) ). Normalization. We convert all the negative probabilities to zeros, and then normalize all the probabilities by a scalar such that their summation is 1. L 2 -projection. We project the inconsistent distribution onto the probability simplex using the L 2 metric. Specifically, for an inconsistent distribution (a 1 , • • • , a m ), the output is (b 1 , • • • , b m ) := arg min bi≥0, bi=1 m i=1 (a i -b i ) 2 .

3.2. MARGINAL SELECTION AND BAYESIAN NETWORKS

It is well-known that marginal-based methods have the curse of dimensionality. One way to mitigate the curse of dimensionality is adopting Bayesian networks (Zhang et al., 2017) . Marginal selection. We first disassemble the raw dataset into a group of lower dimensional marginal datasets. Precisely, PrivBayes (Zhang et al., 2017) uses a sparse Bayesian network {x 1 , • • • , x d } to approximate the raw data. Each node x i corresponds to an attribute, and each edge from x j to x i represents P[x i | x j ], which is the probability of x j causing x i . We denote Π i := {j | x j → x i }, which is the collection of all the attributes that affect x i . Zhang et al. (2017) also make the following assumptions on the network structure. Here k is a pre-fixed parameter that is much smaller than d. Assumption 3.1 (Sparsity). The degree of the Bayesian network is no more than k. Precisely, for any i, the size of Π i is no more than k. The second assumption ensures that the graph cannot contain loops, which aids sampling from the graph. Assumption 3.2. For any i, we have Π i ⊂ {x 1 , • • • , x i-1 }. For example, the Bayesian network in Figure 1 DP Bayesian networks. In a Bayesian network, each low-dimensional marginal distribution P(x i , Π i ) is estimated by the marginal function defined by (1). For privacy consideration, we add Laplace noise Lap(d/nϵ) to the marginal P(x i , Π i ) and obtain the DP distribution P[x i , Π i ] by using post processing to the noisy marginal. Then the overall privacy budget can be calculated by the composition property of DP (Dwork, 2008; Zhang et al., 2017) and is ϵ. Generating synthetic data. The loop-free Bayesian network provides an efficient sampling approach. Precisely, we draw x i from P[x i | Π i ] in an increasing order of i. Recall that Assumption 3.2 ensures that x j / ∈ Π i for any j>i. Therefore, by the time x i is to be sampled, all nodes in Π i must have been sampled. This verifies that the sampling approach is practical. Moreover, sampling x i only needs the marginal P[x i , Π i ], instead of the full distribution. By Assumption 3.1, it is a marginal with size less than k + 1. This leads to a small computational load since k is small. With this sampling method, one (Zhang et al., 2017) can show that the synthetic private distribution is P[x 1 , • • • , x d ] = d i=1 P[x i | Π i ].

4. ACCURACY OF PRIVBAYES

In this section, we develop some theoretical results of the accuracy of PrivBayes. We discuss the proof of these results briefly in Section 6.

4.1. STATISTICAL DISTANCES

The goal of this subsection is to establish the accuracy guarantee for PrivBayes with different postprocessing methods: normalization and L 2 -projection. By the term "accuracy", we mean the TV or the L 2 distance between the synthetic distribution and the raw data distribution, respectively. Note that the error comes from two sources: 1) approximating the raw data by a Bayesian network that satisfies Assumption 3.1 and Assumption 3.2 , 2) adding noise and post processing. The first error, however, only relies on the sparsity of the raw data. Since we aim to establish our result for general raw data, we only focus on the second one. Therefore, it is natural for us to make the following assumption. Assumption 4.1. We assume the raw data distribution can be represented by a Bayesian network with d vertices that satisfies Assumption 3.1 and Assumption 3.2. With this assumption, PrivBayes (normalization) has the following accuracy guarantee. Theorem 4.1. Assuming that the raw dataset D is Boolean and satisfies Assumption 4.1, then we have P -P TV ≤ 12d 2 2 2k (k + 1) nϵ log 2d δ , with probability at least 1 -δ (with respect to the randomness of the Laplace mechanism and the same below). Here P is the empirical distribution of D and P is the output of PrivBayes (normalization) with privacy budget ϵ. Proof. See Section 6 for a proof sketch. The other post-processing method (L 2 -projection) we studied is also efficient. The following result verifies that PrivBayes (L 2 -projection) enjoys the similar accuracy guarantee in terms of the L 2 distance. Here, with a little bit abuse of notations, we denote the L 2 distance between two distributions as the L 2 distance between their density functions. Theorem 4.2. Assuming the raw dataset D is Boolean and Assumption 4.1 is satisfied, then we have P -P L 2 ≤ 12d 2 2 k (k + 1) nϵ log 2d δ , with probability at least 1 -δ. Here P is the empirical distribution of D and P is the output of PrivBayes (L 2 -projection) with privacy budget ϵ. Proof. The proof is similar to Theorem 4.1. However it is still non-trivial and we discuss it in detail in Appendix B. Discussion. Theorem 4.1 and Theorem 4.2 achieves a bound that is consistent (tends to 0 as n tends to infinity) if k ≪ log 2 (nϵ/d 2 ). Moreover, a smaller k leads to smaller upper bounds. Since we assume that the real data and synthetic data share the same k, we conclude that PrivBayes achieve better performance on sparser real datasets. The size of k is often rather small in real application. For example, Zhang et al. (2017) chose k ≤ 4 in the simulation. Theorem 4.1 also characterizes the reliance on the privacy budget ϵ. Precisely, tighter privacy budget means better privacy guarantee, but leads to worse performance. Moreover, our rate is polynomial in the dimension d. Comparing with directly applying the Laplace mechanism to the whole domain (see Theorem 4.3), our result shows that by heavily deploying the network structure, the Bayesian network exponentially refines the rate. Theorem 4.3. Assuming that the raw dataset D is Boolean, then we have P -P TV ≤ d2 2d nϵ log 2 δ , with probability at least 1 -δ. Here P is the synthetic distribution generated by directly applying Laplace mechanism to the entire domain. Proof. By the definition of Laplace mechanism, we add i.i.d. Lap(1/nϵ) noise to all the 2 d choices of P(x 1 , • • • , x d ) and then normalize them.Then, Theorem 4.3 can be proved similarly as Lemma 6.1 with m = 2 d .

4.2. UTILITY ERRORS

In real world practice, the raw training data in supervised learning may contain sensitive information, like personal preference of users. Therefore, it is not allowed to be released to the public. An alternative is to release differentially private synthetic data instead of the raw data. A central problem is the "utility" of synthetic training data, which means the evaluation of synthetic data in downstream tasks. We explain the term utility in detail below. Consider a dataset D = {x (i) } n i=1 drawn from the domain Ω with sample size n. We denote its corresponding synthetic dataset as D = { x (i) } n i=1 of size n. We use the empirical risk minimization (ERM) model to capture the supervised learning. Precisely, the ERM estimators for the raw dataset D and the synthetic data D are defined as θ = arg min θ∈C 1 n n i=1 ℓ(θ, x (i) ) + λJ(θ), θ syn = arg min θ∈C 1 n n i=1 ℓ(θ, x (i) ) + λJ(θ), respectively. Here C is a convex closed set. We assume the loss function ℓ(•, x) is convex on C and is L-Lipschitz in x for some L ≥ 0. The regularization term J(•) is adopted to prevent over-fitting. The model captures a wide range of applications. For example, given a data point x (i) = (u i , v i ) ∈ {0, 1} d+1 , by defining the hinge loss ℓ(θ, x (i) ) = (1 -⟨θ, u i ⟩ • v i ) + , we recover the popular support vector machine (SVM) classifier. The loss is √ d + 1-Lipschitz in θ since x (i) 2 ≤ √ d + 1. The utility of the synthetic dataset D measures whether θ syn and θ perform similarly on the prediction task (Esteban et al., 2017) . To be specific, the following metric is used to evaluate the utility, U ( D, D) := 1 n R( θ) -R( θ syn ) , where R(θ) = n i=1 ℓ(θ, x (i) ) is the empirical risk on D ( (Rankin et al., 2020; Hittmeir et al., 2019) ). Intuitively, the asymptotic behavior of U ( D, D) is affected by the difference between distributions of synthetic data and true data. This fact is characterized in Theorem 4.4. We first make the following assumption on the bound of the loss function ℓ(•, •), which is quite natural due to its continuity (Bassily et al., 2014) . Assumption 4.2. For any θ ∈ C and any data point x in Ω, we have |ℓ(θ, x)| ≤ 1. Generating synthetic dataset from PrivBayes. We still denote the raw dataset as D and denote P its empirical distribution. Its corresponding output of PrivBayes is a distribution denoted as Q. To generate the synthetic training data D, we draw n i.i.d. samples from Q. The corresponding empirical distribution is denoted as Q. With the above preparation, we are now ready to state our result that characterizes the utility of PrivBayes. Theorem 4.4. If Assumption 4.1 and Assumption 4.2 hold and the raw dataset D is Boolean, then we have U ( D, D) ≤ C(λ) + C 1 ∥Q -P∥ TV + 2R C + log 1 δ 2 n ≤ C(λ) + C 1 2 2k d 2 (k + 1) nϵ ln 2d δ + 2R C + log 1 δ 2 n , with probability at least 1 -δ. Here R C is the Rademacher complexity of the function class {x → ℓ(θ, x) | θ ∈ C} and C 1 is a positive universal constant. The term C(λ) is non-negative and vanishes when λ = 0, namely C(0) = 0. Proof. See Section 6 for a proof sketch. Discussion. The term C(λ) comes from the regularization process. C(λ) = 0 if no regularization is applied (λ = 0). In real practice, λ is often much smaller than d 2 /nϵ. Therefore C(λ) is also relatively small. The Rademacher complexity in equation ( 4) comes from the sampling process. Our result then implies that, when the sample size n and n are sufficiently large and the regularization parameter λ is sufficiently small, the quality loss caused by the private mechanism is rather small. In other words, private synthetic data generated by PrivBayes performs similarly to raw data in downstream learning tasks.

5. LOWER BOUND

In this section, we complete the picture by deriving a lower bound for the TV-distance between synthetic private distribution and the raw data distribution. Notations and conventions. As before, the raw dataset D is of size n with empricial distribution P. The data domain is denoted as Ω. A synthetic data generator is a randomized algorithm that sends a dataset of size n to a distribution over Ω. We also need the following assumption on the range of the parameters. Assumption 5.1. We assume that d/ϵ ≪ n ≪ |Ω|. The first part of this assumption allows a rather wide choice of ϵ in practice. For instance, in two real datasets ACS (Ruggles et al., 2015) and Adult (Bache & Lichman, 2013) , the size n ≈ 40, 000, the dimension d ≈ 40. Then Assumption 5.1 only requires ϵ ≥ 1/1, 000. Moreover, the size of Ω is at least 2 d , which is clearly much larger than n. Therefore, the second part of Assumption 5.1 holds for real world datasets. We now state Theorem 5.1 that establishes the lower bound for TV-distance. We begin with a technical lemma that characterizes the normalization process. See Appendix A for its detailed proof. Lemma 6.1. For a distribution (a 1 , • • • , a m ), we denote its outcome after adding i.i.d. Lap(d/nϵ) noise and normalizing it as (b 1 , • • • , b m ). Then, for all large n and all δ > 0, it holds that max i |a i -b i | ≤ 3md nϵ log m δ , with probability at least 1 -δ. Lemma 6.1 characterizes the difference between P(x i , Π i ) and P(x i , Π i ). However, we need further analysis to establish the conditional version of Lemma 6.1. To be specific, we need to bound P(x i | Π i ) -P(x i | Π i ) . The following result serves for this goal.  Proof. See Appendix A for a detailed proof. Combining Lemma 6.1 and Lemma 6.2, the distance between the conditional distributions is bounded in the following result. Lemma 6.3. If D is boolean and satisfies Assumption 4.1, then we have P(x i | Π i ) -P(x i | Π i ) ≤ 6d2 k (k + 1) nϵ log 2 δ 1 P(Π i ) , with probability at least 1 -δ, simultaneously for all i and all choices of (x i , Π i ).

Proof. Setting

             m = 2 k+1 , s = 2, a 1 = P(1, Π i ), a 2 = P(0, Π i ), b 1 = P(1, Π i ), b 2 = P(0, Π i ), β = 3md nϵ log m δ , in Lemma 6.1 and Lemma 6.2 concludes the proof. To bound the TV-distance, we begin with rewriting it in telescoping series and applying Lemma 6.3. One technical impediment for estimation is the fraction term 1/P(Π i ) in (7). To address this challenge, we need to deploy the Bayesian network structure (Assumption 3.1 and Assumption 3.2). Deploying the network structure is quite technical and lengthy, we defer the detail to Appendix A. . Recall that Q is the output of PrivBayes and Q is the empirical distribution of the n samples drawn independently from Q. Here term (i) and term (vi) come from the difference between the synthetic distribution Q and the raw distribution P. They can be bounded above by the distance between the synthetic distribution and the raw one. Term (ii) and term (v) come from sampling and are bounded by classical Rademacher method. Term (iii) and term (vii) are derived from the regularization process. They combine to be the C(λ) term. Bounding term (iv), however, is more tricky and requires more detailed analysis. We discuss each group in detail in Appendix C.

7. DISCUSSIONS AND FUTURE TOPICS

We establish perhaps the first statistical analysis for the accuracy and utility of Bayesian networkbased data synthesis algorithms. We also derive a lower bound for the accuracy to complete the picture. Compared with the lower bound, the accuracy bound we achieve is sub-optimal up to a d factor. One way to improve the accuracy is to reduce the effects of random noise in releasing the synthetic data through some post-processing procedures. However, it is still quite challenging to develop a practical algorithm based on this idea, and we leave it for future work.



Figure 1: A Bayesian network over 5 attributes of degree 2

Theorem 5.1. If Assumption 5.1 holds, then for any synthetic data generator A(•) with privacy budget ϵ, and for any 0 ≤ δ ≤ 1/2, there exists a dataset D of size n, such that ∥A(D) -P∥ TV ≥ 1 nϵ log(δ|Ω|) with probability at least 1 -2δ. Here P is the empirical distribution of D. Proof. See Appendix D for a detialed proof. Choosing δ = 1/4 in Theorem 5.1 yields the following corollary. Corollary 5.2. If |Ω| ≥ 4 exp(d), then for any synthetic data generator A(•) with privacy budget ϵ, there exists a dataset D of size n such that ∥A(D) -P∥ TV ≥ d nϵ with probability at least 1/2. Here P is the empirical distribution of D. Discussion and comparison. Comparison with the upper bound in Theorem 4.1, PrivBayes is suboptimal up to a d factor. The sub-optimality is caused by the composition property of DP (the dataset is processed d times in a Bayesian network) and the structure of a Bayesian network. 6 PROOF SKETCH FOR THE TECHNICAL RESULTS 6.1 PROOF SKETCH FOR THEOREM 4.1

Consider two non-negative real vectors (a 1 , • • • , a s ) and (b 1 , • • • , b s ) (not necessary to be distributions). If, for some β ≥ 0, we havemax j |a j -b j | ≤ β,(5)then, for any l ∈ {1, • • • , s}, the following result holds.

PROOF SKETCH FOR THEOREM 4.4 We begin with some notations. The non-regularized estimators trained on D and D by an ERM model are denoted as θ * and θ * syn .We further define the prediction risk with respect to a certain distribution. For any distribution on Ω, denoted as P , and any θ ∈ C we defineR(θ, P ) := x∈Ω ℓ(θ, x)P (x)(8)as the prediction risk with respect to P . Then R(•) in (3) is equal to R(•, P).

We are now ready to sketch the proof. The most important step of the proof is to decompose the utility in (3 into the following seven termsU ( Q, P) ≤ R( θ syn , P) -R( θ syn , Q) term (i) + R( θ syn , Q) -R( θ syn , Q) term (ii) + R( θ syn , Q) -R(θ * syn , Q) term (iii) + R(θ * syn , Q) -R(θ * , Q) term (iv) + R(θ * , Q) -R(θ * , Q) term (v) + |R(θ * , Q) -R(θ * , P)| term (vi)+ R(θ * , P) -R( θ, P)term (vii)

ACKNOWLEDGMENTS

We appreciate Prof. Ninghui Li and Dr. Zitao Li for their discussions about the background and applications of marignal-based data synthesis methods, which motivates us to study the corresponding theory. This research is supported by the Office of Naval Research [ONR N00014-22-1-2680] and the National Science Foundation [NSF -SCALE MoDL (2134209)].

