ANALYSIS OF DIFFERENTIALLY PRIVATE SYNTHETIC DATA: A MEASUREMENT ERROR APPROACH

Abstract

Differential private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to the DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a class of MLE-based easy-to-implement bias-corrected DP estimators with valid asymptotic confidence intervals (CI) for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that our estimator has comparable performance to the widely used sufficient statistic perturbation (SSP) algorithm in some scenarios but with the advantage of releasing a synthetic dataset and obtaining statistically valid asymptotic CIs, which can achieve better coverage when compared to the naive CIs obtained by ignoring the DP mechanism.

1. INTRODUCTION

Differential privacy (DP) is a mathematically rigorous definition that quantifies privacy risk. It builds on the idea of releasing "privacy-protected" query results, such as summary statistics, using randomized responses. In recent years, the use of differential privacy has quickly gathered popularity as it promises that there will be no additional privacy harm for any individual whether the said individual's data belongs in a private dataset or not, and therefore encourage data sharing. One important characteristic of DP is its composition property (Dwork & Roth, 2014) . That is, to avoid the scenario where the same analysis is rerun and averaging away the noises from the randomized responses, the composition property indicates that running the same analysis twice will have double the amount of privacy risk as running the analysis once. The data provider often set a total amount of privacy risk/budget allowed, commonly referred to as the privacy budget, and each analysis from researchers uses a portion of the privacy budget. Once the total privacy budget is exhausted, any new analysis would not be possible unless the data provider decides to increase the total privacy budget and thus take on more privacy risk. This could be problematic as it limits the number of analyses that researchers can run, which can result in the dataset not being fully explored. In consequence, it diminishes the probability of serendipitous discovery and amplifies the odds of being tricked by unanticipated data problems (Evans et al., Working Paper) . To address the problem above, various methods of releasing differentially private synthetic datasets (Liu (2016) ; Bowen & Liu (2020) ; Gambs et al. (2021) ) have been proposed. Using the postprocessing property of DP, any analysis on the DP synthetic dataset will be differentially private without the additional cost of privacy budget. Therefore, by releasing DP synthetic dataset, it circumvents the problem of running out of privacy budget. Here we will mention a few notable methods of generating DP synthetic datasets. In general, the methods of generating DP datasets can be categorized into the non-parametric method and the parametric method. For the non-parametric methods, the DP dataset is constructed based on the empirical distribution of the data. The simplest approach would be directly adding Laplace or Gaussian noises to the confidential dataset. For the parametric methods, the DP dataset is constructed based on a parametric distribution/model of the data. Using the robust and flexible model of vine copula, Gambs et al. (2021) draw the DP synthetic dataset from the DP trained vine copula model. From the Bayesian perspective, Liu (2016) proposes generating DP synthetic dataset by drawing samples from the DP version of the posterior predictive distribution. For a more comprehensive overview of different DP dataset generation methods, refer to Bowen & Liu (2020) . Dwork & Roth (2014) characterizes differential privacy as a definition of privacy tailored to the problem of privacy-preserving data analysis. However, for a statistician, the goal of statistical inference is often as important as data analysis. Under the framework of differential privacy, the methods for making statistical inferences are under-explored. Fortunately, the interest in statistical inference under differential privacy has been rising recently including the works like Sheffet (2017) and Barrientos et al. (2019) . To make statistical inferences, statistical models need to be specified. With different differential privacy algorithms, it comes different statistical models. It turns out that additive mechanisms, like the Laplace mechanism or Gaussian mechanism, give statistical models that are naturally related to the measurement error models. In other words, each additive mechanism can be viewed as some variation of the measurement error model, and therefore, the tools from the measurement error models can be used to make inferences in the differential privacy setting. In this paper, we generate DP synthetic dataset by adding DP noises directly to the confidential dataset through the Gaussian mechanism. We choose this method due to its simplicity, and more importantly, it allows us to establish the connection to the theory of measurement error as we will see in section 3.1. Using the established tool in the theory of measurement error, we then derive an MLE-based DP bias-corrected estimator and an asymptotic confidence interval for our parameter of interest. Therefore, by establishing a connection to measurement error, we will be able to develop statistical inference under the differential privacy setting. To demonstrate the usefulness of this connection, we study statistical inference under the linear regression setting while preserving differential privacy. In particular, we derive DP consistent estimator and asymptotic confidence interval for the regression coefficient.

Related work

As one of the most common statistical models, linear regression has been studied before in differential privacy literature. One of the widely used methods for obtaining a DP estimator for the regression coefficient is through the perturbation of sufficient statistics (Dwork et al., 2014; Sheffet, 2017; Wang, 2018) . It's commonly used due to its simplicity and is closely related to the classical ordinary least square method. Motivated by Dwork & Lei (2009) , Alabi et al. (2022) shows that algorithms based on a robust estimator, such as a median-based estimator, perform better compared to the classical ordinary least square estimator on small sample cases. Similar to our work, Charest & Nombo (2020) uses simulation extrapolation (SIMEX), a technique from the literature on measurement error (Carroll et al. (2006) ), to obtain a DP estimator for the regression coefficient. However, what differs from our work is that there is no mention of constructing confidence intervals in Charest & Nombo (2020) . Agarwal et al. (2021) ; Agarwal & Singh (2021) also mentioned the connection between the measurement error model and differentially private mechanism. Differing from our work, Agarwal et al. (2021) focused on the setting where only covariates are perturbed with differentially private noises and on the goal of learning a predictive linear model using principal component regression. Similar to our work, Agarwal & Singh (2021) use the connection to make inferences on the regression coefficient under a more general and less structured setting, but the methodology is much more involved compared to our more simplistic approach. Lastly, Evans & King (2022) also uses the connection to obtain a consistent estimator of the regression coefficient, but without any mention of the confidence interval. Using the Johnson-Lindenstrauss transform (Blocki et al., 2012 ), Sheffet (2017) studies DP ordinary least square estimator and derived DP asymptotic confidence intervals for the regression coefficients. Differing from the additive DP noises used in our work, a random projection could potentially limit the usefulness of the synthetic dataset for other types of analysis. Instead of obtaining confidence intervals for the regression coefficient, Barrientos et al. (2019) studies the DP hypothesis testing for the regression coefficient by perturbing the t-statistic. However, since the approach achieves DP through randomizing the t-statistics, each hypothesis testing will cost a portion of the total privacy budget and the total privacy budget can be exhausted quickly. Lastly, from the Bayesian perspective, Bernstein & Sheldon (2019) studies the DP Bayesian linear regression, which requires a prior distribution for the regression coefficient, through releasing private sufficient statistics and thus is imperilled to the problem of privacy budget running out as described above. Structure of the paper In section 2, we state the necessary concepts related to differential privacy and measurement error. In section 3, we establish the connection between differential privacy and measurement error and working under regression setting, we derive DP consistent estimator and DP asymptotic confidence interval for regression coefficient β using the tool from measurement error framework. In section 4, we conduct a simulation to examine the performance of our DP estimator against the widely used sufficient statistics perturbation method (SSP) and show that the performance of our estimator is comparable to SSP estimator in some scenarios while being outperformed in others. Furthermore, we look at the coverage of our DP confidence interval compared with the naive CIs obtained from ignoring the DP noises, and demonstrate the issue with naive inference as not only are the naive CIs centred at the wrong value, but they also have shorter length than would be obtained with the true data (Carroll et al., 2006) .

2. PRELIMINARIES 2.1 DIFFERENTIAL PRIVACY

We begin by going through some basics regarding differential privacy. The central idea around differential privacy is that it gives the assurance that any sequence of query responses is "essentially" equally likely to occur, independent of the presence or absence of any individual (Dwork & Roth (2014) ). Naturally, the most basic concept in differential privacy is the notion of neighbouring datasets where two datasets differ by one individual record. Definition 2.1 (Neighboring datasets). Two datasets of the same dimension (same numbers of columns and rows) are called neighbouring datasets if they only differ in exactly one row/individual record. In this paper, we are only concerned with approximate differential privacy, which is a natural relaxation of the original definition of ε-differential privacy. Definition 2.2 (Approximate differential privacy (Dwork et al., 2006a; b) ). A randomized algorithm M is (ε, δ)-differentially private if for all (measurable) set S and for all neighboring datasets X and X ′ , P(M(X) ∈ S) ≤ exp(ε)P(M(X ′ ) ∈ S) + δ One of the most important properties of differential privacy is its immunity to post-processing. That is, without any additional information on the confidential dataset, it's impossible to make a function of the output of a differentially private algorithm M any less differentially private. More precisely, Proposition 2.1 (Post-processing property (Dwork & Roth, 2014) ). Let M be randomized algorithm that is (ε, δ)-differentially private. Let f be an arbitrary randomized mapping with its domain within the range of M. Then f • M is (ε, δ)-differentially private. Another fundamental notion in differential privacy is the idea of a query. A query is a function to be applied to the dataset (Dwork & Roth (2014) ). Naturally, to achieve the same degree of privacy protection, different queries will likely require a different amount of noise perturbation. To quantify this, we need the concept of sensitivity: Definition 2.3 (l 2 sensitivity). The l 2 sensitivity of a query function f is defined as ∆ f = max X,X ′ ∥f (X) -f (X ′ )∥ 2 where the max is taken over all possible pair of neighboring datasets X and X ′ . (ε, δ)-differential privacy can be achieved through the application of Gaussian mechanism, Definition 2.4 (Analytic Gaussian Mechanism (Balle & Wang, 2018) ). Let f : X → R d be a function with global L 2 sensitivity ∆. For any ε ≥ 0 and δ ∈ [0, 1], the Gaussian output perturbation mechanism M (x) = f (x) + Z with Z ∼ N 0, σ 2 I is (ε, δ)-DP if and only if Φ ∆ 2σ - εσ ∆ -e ε Φ - ∆ 2σ - εσ ∆ ≤ δ where Φ is the cumulative distribution function for a standard normal random variable.

2.2. MEASUREMENT ERROR MODEL

In simplest terms, measurement error problems can be described as the problem of making inferences about a statistical model in terms of a variable Z that is not directly observable. Instead, a surrogate variable W of Z is observed, and inference must be made through W instead. The statistical models and inference methods are called measurement error models (Stefanski, 2000) . A measurement error model consists of two parts, the first is the error structure relating the surrogate W to the truth Z, and the second is data structure of true variable Z (Carroll et al., 2006) . As an example, consider the following measurement error model, W = X + U (1a) Y = g(X; β) + q (1b) where the U is assumed to have mean zero, constant variance and is independent of X. Similarly, q is assumed to have a Gaussian distribution with mean zero and constant variance and is independent of X and U . Model (1) above is referred to as error-in-variable model, where the covariates are measured with error in a regression setting. Eq.(1a) describes the classical measurement error structure, in which only the true (unobserved) covariate X is measured with additive error. Eq.( 1b) describes the regression structure of the data Z = {X, Y }. It reduces to the familiar linear regression structure for g(X; β) = X ⊤ β. Remark There are other types of error structures such as multiplicative error, but in this paper, we will restrict ourselves to only additive measurement error. Remark In measurement error literature, there is an important distinction between the functional model where X is not modelled and the structural model where X is modelled with a parametric distribution. For this paper, we will restrict our attention to structural modelling where X is assumed to have a Gaussian distribution. Under (1) with g(X; β) = X ⊤ β, one of the most well-known effects of the measurement error is to bias the regression coefficient towards zero. This phenomenon is commonly referred to as attenuation. More precisely, the OLS estimator obtained by regressing Y on the surrogate W is not a consistent estimator of β but instead of β * = λβ where λ = σ 2 x σ 2 x + σ 2 u < 1 (2) The attenuation factor λ is referred to as the reliability ratio (Carroll et al., 2006) . The larger the σ 2 u , the variance of the measurement error, the closer to zero the attenuation factor λ will be. Therefore, when ignoring the measurement error, the naive method of regressing Y on W will result in severe underestimation of β when the magnitude of measurement error is large. Based on equation 2, we can obtain a consistent estimator of β as β = βols / λ = βols S w S w -σ2 u (3) where S w is the sample variance of the surrogate W and σ2 u is a consistent estimator of σ 2 u . So far we have only discussed the scenario where only the explanatory variable X is measured with error, but the scenario where both explanatory and response variables are measured with error is much more beneficial to the differential privacy framework. Although there is admittedly less literature on response measurement error, the extension is surprisingly easy in some cases. As noted in Carroll et al. (2006) , for unbiased and homoscedastic response measurement error in linear regression, the response measurement error increases the variability of the fitted lines without causing bias. Furthermore, all hypothesis tests, confidence intervals, etc. remain perfectly valid albeit they are less powerful. These conclusions indicate that unbiased error in linear regression requires no special adjustments when extending to response measurement error. However, for the binary response, the measurement error becomes misclassification, which is no longer an unbiased error, and therefore special considerations are required. As the first steps establish the connections between differential privacy and the measurement error model, we will focus on the linear regression measurement error throughout this paper. Nonlinear regression such as logistic regression will be left for future directions.

3.1. GAUSSIAN MECHANISM AS MEASUREMENT ERROR MODEL

Let's denote the private dataset by Z, then the analytic Gaussian mechanism (sec. 2.1) releases a differentially private dataset Z by adding a centred Gaussian noise U ∼ N 0, σ 2 u I p , Z = Z + U (4) Note that ∆ denotes the sensitivity for the identity query function, that is, ∆ := max Z,Z ′ ∥Z -Z ′ ∥ F where ∥ • ∥ F denotes the Frobenius norm. Refer back to section 2.2, it's easy to observe that equation 4 can be viewed as the error structure between the surrogate variable Z and the true unobservable variable Z of a measurement error model. A key difference here is that commonly in measurement error problems, the magnitude of the measurement error σ 2 u , the variance of U, is unknown and has to be estimated. Fortunately, in the differential privacy setting, the variance of U is purely determined by the privacy budget ε, δ and the sensitivity ∆, and therefore it can be publicized and is assumed to be known. Remark For unbounded variables, like Gaussian random variables, ∆ will be ∞ and the Gaussian mechanism will no longer work without additional procedures. To deal with unbounded predictors, we simply clip the variable within a fixed interval. To disclose ∆, the fixed interval must be chosen before or without seeing the confidential dataset.

3.2. STATISTICAL INFERENCE FROM MEASUREMENT ERROR PERSPECTIVE

Now equipped with the perspective that the Gaussian mechanism can be viewed as the measurement error structure of a measurement error model, we need the second component of the measurement error model, the data structure, to make statistical inferences. Let's consider the regression where we partition the private dataset as Z = {X, y} where X is the exploratory variable and y is the response variable. Furthermore, we assume a functional relationship between X and the expected value of y, y = g(X, β) + q (5) where q ∼ N (0, σ 2 q I), and it's assumed to be independent of X. Combined with equation 4, we can write our measurement error model as the following, y = g(X; β) + q Z = Z + U (6) where Z = ( X, ỹ) and U = (U x , u y ). Note the variance of U is assumed to be known. When u y has a zero variance, then it reduces to model ( 1), error-in-variable model, in section 2.2. Under model ( 6), one of the classical methods for estimation is the maximum likelihood approach (Wansbeek & Meijer, 2000) due to several nice properties such as consistency and asymptotic normality that the maximum likelihood estimator (MLE) enjoys. Since only Z are observed, the likelihood function to maximize comes from the marginal distribution of Z, which is simply a multivariate normal distribution. For some function g, numerical analysis is required to maximize the likelihood and a closed-form solution often does not exist. However, in this paper, we will focus on one such scenario that a closed-form solution exists. That is, we will focus on the case that g(X; β) = X ⊤ β, in which case eq. ( 6) reduces to the following, y = Xβ + q Z = Z + U (7) Denotes 1 β = 1 n X⊤ X -σ 2 u I -1 1 n X⊤ ỹ, σq = S v -σ 2 u 1 + β 2 2 where S v = 1 n-k y -X β 2 2 . To obtain the limiting distribution of these estimators, we have the following theorem. Theorem 3.1 (Fuller (1987) ). Let model ( 7) holds, that is, assume homoscedastic linear regression model, additive measurement error structure and a normally distributed predictor. Let θ = β ⊤ , σ qq ⊤ and let θ = β⊤ , σqq ⊤ . Then, n 1/2 ( θ -θ) ⇝ N (0, Γ) , where the submatrices of Γ are Γ ββ =M -1 x σ 2 v + M -1 x σ 2 u σ 2 v I + σ 4 u ββ ⊤ M -1 x Γ qq =Var 1 n ∥v∥ 2 2 Γ βq =2M -1 x σ 2 u σ 2 v β with v = u y + q -U x β and M x = µ x µ ⊤ x + Σ x . Furthermore, The variance of the approximate distribution of β can be estimated by Var{ β} =n -1 M-1 x S v + M-1 x S v σ 2 u I + σ 4 u β β⊤ M-1 x where Mx = 1 n X⊤ X -σ 2 u I. Remark The theorem above is not valid if a clipping process is applied to Z to ensure finite sensitivity. Therefore, the clipped interval needs to be sufficiently large to minimize the impact of the clipping effect. Directly following the theorem above, we can derive the result for the simple linear regression, Y = β 0 + Xβ 1 + q, which will be used in the simulation in section 4. Corollary 3.1.1 (Simply linear regression (Fuller, 1987) ). Suppose σ 2 u known, σ 2 ε > 0, and σ 2 ξ > 0. Then, the vector √ n β0 -β 0 β1 -β 1 ⇝ N (0, Γ) where the covariance matrix Γ is, Γ =     µ 2 x σ 2 xσ 2 v + Cov 2 (x, v) σ 4 x + σ 2 v -µ x σ 2 xσ 2 v + Cov 2 (x, v) σ 4 x -µ x σ 2 xσ 2 v + Cov 2 (x, v) σ 4 x σ 2 xσ 2 v + Cov 2 (x, v) σ 4 x     Furthermore, n Var β0 , β1 ⊤ is a consistent estimator of Γ where Var β0 β1 =   mean( X) 2 Var β1 + 1 n S v -mean( X) Var β1 -mean( X) Var β1 Var β1   Var β1 = 1 n -1 S xS v + β2 1 σ 4 u (S x -σ 2 u ) 2 where S v = 1 (n-2) ỹ -mean(ỹ) -β1 (x -mean (x)) 2 2 . Immediately following from the corollary above, we can derive an asymptotic confidence interval for β 1 as follows, Corollary 3.1.2 (Asymptotic confidence interval). The interval is defined as follows β1 ± t 1-α/2,n-2 Var β1 where t 1-α/2,n-2 denotes the 1 -α/2 quantile of the student's t distribution with df = n -2, is a 1 -α asymptotically correct confidence interval for the regression coefficient β 1 .

4. SIMULATION AND RESULTS

In this section, we perform simulations to evaluate the performance of our estimator against the widely used SSP algorithm (Dwork et al., 2014; Sheffet, 2017; Wang, 2018; Alabi et al., 2022) . As the result will show that our estimator is comparable to the SSP algorithm in some scenarios. Furthermore, we will obtain an asymptotic confidence interval for β 1 without additional privacy cost, which is one of the advantages of our approach. Compared to the naive CI obtained by ignoring the DP noises, our CI does a much better job capturing the true value for β 1 .

4.1. METHOD

For this simulation, we assume the following simple linear regression model, Y t = β 0 + β 1 X t + q t Additionally, we assume q t ∼ N (0, 1) and X t ∼ N (0, 1). To conduct the simulation, we set the coefficients to be (β 0 , β 1 ) = (1, 1), and then draw X t , t = 1, 2, . . . , n from N (0, 1) and the regression noises q t , t = 1, 2, . . . , n from N (0, 1). Once Y t = β 0 + β 1 X + q t , t = 1, 2, . . . , n is obtained, we clip Y t within the interval [-3, 3] to ensure a finite sensitivity ∆. The particular interval of [-3, 3] is chosen since the interval is relatively large so that the effect of clipping will not have a big impact on the result. First, we will obtain the point estimators for β 0 and β 1 using the SSP algorithm. To implement the SSP algorithm, we follow the DPSuffStats algorithm in Alabi et al. (2022) with a few adjustmentsfoot_1 . To obtain our estimator, we first construct our DP synthetic dataset described in section 3.1 with ∆ = (1 -0) 2 + (3 -(-3)) 2 = √ 37, and then obtain the estimates as described in section 3.2. To compare the performance between these two estimators, we report their median absolute error (MAE) for each combination of sample size n ∈ {500, 1000, 2000, 5000, 10 4 , 10 5 } and privacy budget ε ∈ {0.1, 0.5, 1, 5} while setting δ = 1/n. Due to the post-processing property of DP, any statistics derived from the DP synthetic dataset will remain differentially private and won't incur any additional privacy risk. Therefore, the asymptotic CI describes in corollary 3.1.2 is differentially private. Similarly, the naive CI obtained by ignoring the DP noises is differentially private as well. To compare our asymptotic CI with the naive CI, we report their relative frequencies of capturing the true value of β 1 out of the 1000 trials. Lastly, the normal distribution assumption of the covariates might not be realistic in practice. Therefore, we rerun the simulation described above but with X t drawn from Unif(0, 1) instead to evaluate the performance of our method under a different setting.

4.2. RESULT

Table 1 shows the MAE results between our DP estimator (bottom value) and the SSP estimator (top value). As we can observe from the table, as one might expect when privacy budget ε or sample size n increases, the MAE for both estimator decrease. However, the SSP estimator outperforms our SSP estimator except when both sample size and privacy budget are large, their performances are similar. The lower performance of our estimator is due to the nature of the finite sample. In simulation, Table 1 : MAE result for uniformly/normally distributed predictor. The top value within each cell indicates the MAE for the SSP algorithm, and the bottom value within each cell indicates the MAE for our estimator without applying the Gram-Schmidt process. the sample covariance between DP noises and data is often non-negligible when the sample size is small relative to the amount of noises injected. This results in poor estimation of σ x , which leads to the poor performance of our estimator. Although our estimation performance is worse than the SSP method, it's still comparable in some scenarios (the combination of a small privacy budget and a small sample size or the combination of a large privacy budget and a large sample size). Furthermore, our approach allows the release of a synthetic dataset and more importantly, it provides the method to obtain a confidence interval without additional privacy budget. We will discuss the performance of our confidence interval next.

Gaussian

Uniform ε = 0.1 ε = 0.5 ε = 1 ε = 5 ε = 0.1 ε = 0.5 ε = 1 ε = 5 n = Figure 1 show the coverage probabilities and margin of error of our confidence intervals (DP) under normally distributed and uniformly distributed covariate X. For comparison, the naive CIs derived from the synthetic dataset (naive) and non-DP CIs (non-DP) derived from the confidential dataset are also plotted. As shown in both figures, the coverage of our CIs is relatively close to the nominal level (90%, indicated by the dotted lines). In comparison, the coverage of the naive CIs never captures the true value of β 1 even though they are much narrower in comparison. This highlights the importance of considering DP noises when making statistically valid inferences. The reason behind the terrible coverage of the naive CIs, as explained in Carroll et al. (2006) , is because the variance of the naive estimator can be smaller than the true data estimator when the privacy budget is small (or DP noises are large), which results in a more precise, but biased estimator.

5. CONCLUSION

In this paper, we established a connection between DP mechanisms and measurement error models. Applying the tools from the measurement error framework, we developed statistical inference under linear regression while preserving differential privacy. In particular, we derived DP consistent estimator and DP asymptotic conference interval for the regression coefficients. To evaluate the performance of the estimator, we compared it to the widely used SSP method and demonstrated our estimator has comparable performance in some scenarios but has the advantage of obtaining statistically valid asymptotic confidence intervals without additional privacy cost. Furthermore, by comparing the coverage between our asymptotic CIs and naive CIs, we illustrated the importance of incorporating the DP mechanism into the inference method to ensure a valid statistical inference. For future directions, some theoretical works on the comparison between our estimator and the SSP estimator could be an interesting direction. Similarly, the extension of the theorem 3.1 to accommodate the clipping might be a fruitful path to pursue. Furthermore, there are still many tools from the measurement error literature yet to be utilized. One of the obvious next steps would be to extend the linear regression setting to the more general generalized linear model setting such as logistic regression. We hope this paper will motivate future works to explore more the connection between differential privacy and measurement error, and to develop statistical inference under the differential privacy setting.



Note that β is the MLE of β, but σq is not the MLE of σq. σq is used here because the limiting distribution can be derived under less restrictive conditions than those used to obtain the maximum likelihood estimator(Fuller, 1987). First, the Gaussian mechanism is used instead of the Laplace mechanism for a better comparison. Then, we extend the algorithm to accommodate different clipping intervals for Xt and Yt.



Figure 1: The coverage probabilities comparison between our DP confidence intervals, naive CIs and non-DP CIs at various sample sizes. The top 4 plots are for the normally distributed covariate, and the bottom 4 plots are for the uniformly distributed covariate. Note the horizontal axis is in logarithmic scale with sample size n = 500, 1000, 2000, 5000, 10 4 , 10 5 . The dotted line indicates the nominal CI level of 0.9.

