ANALYSIS OF DIFFERENTIALLY PRIVATE SYNTHETIC DATA: A MEASUREMENT ERROR APPROACH

Abstract

Differential private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to the DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a class of MLE-based easy-to-implement bias-corrected DP estimators with valid asymptotic confidence intervals (CI) for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that our estimator has comparable performance to the widely used sufficient statistic perturbation (SSP) algorithm in some scenarios but with the advantage of releasing a synthetic dataset and obtaining statistically valid asymptotic CIs, which can achieve better coverage when compared to the naive CIs obtained by ignoring the DP mechanism.

1. INTRODUCTION

Differential privacy (DP) is a mathematically rigorous definition that quantifies privacy risk. It builds on the idea of releasing "privacy-protected" query results, such as summary statistics, using randomized responses. In recent years, the use of differential privacy has quickly gathered popularity as it promises that there will be no additional privacy harm for any individual whether the said individual's data belongs in a private dataset or not, and therefore encourage data sharing. One important characteristic of DP is its composition property (Dwork & Roth, 2014) . That is, to avoid the scenario where the same analysis is rerun and averaging away the noises from the randomized responses, the composition property indicates that running the same analysis twice will have double the amount of privacy risk as running the analysis once. The data provider often set a total amount of privacy risk/budget allowed, commonly referred to as the privacy budget, and each analysis from researchers uses a portion of the privacy budget. Once the total privacy budget is exhausted, any new analysis would not be possible unless the data provider decides to increase the total privacy budget and thus take on more privacy risk. This could be problematic as it limits the number of analyses that researchers can run, which can result in the dataset not being fully explored. In consequence, it diminishes the probability of serendipitous discovery and amplifies the odds of being tricked by unanticipated data problems (Evans et al., Working Paper). To address the problem above, various methods of releasing differentially private synthetic datasets (Liu (2016); Bowen & Liu (2020); Gambs et al. ( 2021)) have been proposed. Using the postprocessing property of DP, any analysis on the DP synthetic dataset will be differentially private without the additional cost of privacy budget. Therefore, by releasing DP synthetic dataset, it circumvents the problem of running out of privacy budget. Here we will mention a few notable methods of generating DP synthetic datasets. In general, the methods of generating DP datasets can be categorized into the non-parametric method and the parametric method. For the non-parametric methods, the DP dataset is constructed based on the empirical distribution of the data. The simplest approach would be directly adding Laplace or Gaussian noises to the confidential dataset. Dwork & Roth (2014) characterizes differential privacy as a definition of privacy tailored to the problem of privacy-preserving data analysis. However, for a statistician, the goal of statistical inference is often as important as data analysis. Under the framework of differential privacy, the methods for making statistical inferences are under-explored. Fortunately, the interest in statistical inference under differential privacy has been rising recently including the works like Sheffet (2017) and Barrientos et al. (2019) . To make statistical inferences, statistical models need to be specified. With different differential privacy algorithms, it comes different statistical models. It turns out that additive mechanisms, like the Laplace mechanism or Gaussian mechanism, give statistical models that are naturally related to the measurement error models. In other words, each additive mechanism can be viewed as some variation of the measurement error model, and therefore, the tools from the measurement error models can be used to make inferences in the differential privacy setting. In this paper, we generate DP synthetic dataset by adding DP noises directly to the confidential dataset through the Gaussian mechanism. We choose this method due to its simplicity, and more importantly, it allows us to establish the connection to the theory of measurement error as we will see in section 3.1. Using the established tool in the theory of measurement error, we then derive an MLE-based DP bias-corrected estimator and an asymptotic confidence interval for our parameter of interest. Therefore, by establishing a connection to measurement error, we will be able to develop statistical inference under the differential privacy setting. To demonstrate the usefulness of this connection, we study statistical inference under the linear regression setting while preserving differential privacy. In particular, we derive DP consistent estimator and asymptotic confidence interval for the regression coefficient.

Related work

As one of the most common statistical models, linear regression has been studied before in differential privacy literature. One of the widely used methods for obtaining a DP estimator for the regression coefficient is through the perturbation of sufficient statistics (Dwork et al., 2014; Sheffet, 2017; Wang, 2018) . It's commonly used due to its simplicity and is closely related to the classical ordinary least square method. Motivated by Dwork & Lei (2009 ), Alabi et al. (2022) shows that algorithms based on a robust estimator, such as a median-based estimator, perform better compared to the classical ordinary least square estimator on small sample cases. Similar to our work, Charest & Nombo (2020) uses simulation extrapolation (SIMEX), a technique from the literature on measurement error (Carroll et al. (2006) ), to obtain a DP estimator for the regression coefficient. However, what differs from our work is that there is no mention of constructing confidence intervals in Charest & Nombo (2020) . Agarwal et al. (2021); Agarwal & Singh (2021) also mentioned the connection between the measurement error model and differentially private mechanism. Differing from our work, Agarwal et al. (2021) focused on the setting where only covariates are perturbed with differentially private noises and on the goal of learning a predictive linear model using principal component regression. Similar to our work, Agarwal & Singh (2021) use the connection to make inferences on the regression coefficient under a more general and less structured setting, but the methodology is much more involved compared to our more simplistic approach. Lastly, Evans & King (2022) also uses the connection to obtain a consistent estimator of the regression coefficient, but without any mention of the confidence interval. Using the Johnson-Lindenstrauss transform (Blocki et al., 2012 ), Sheffet (2017) studies DP ordinary least square estimator and derived DP asymptotic confidence intervals for the regression coefficients. Differing from the additive DP noises used in our work, a random projection could potentially limit the usefulness of the synthetic dataset for other types of analysis. Instead of obtaining confidence intervals for the regression coefficient, Barrientos et al. (2019) studies the DP hypothesis testing for the regression coefficient by perturbing the t-statistic. However, since the approach achieves DP through randomizing the t-statistics, each hypothesis testing will cost a portion of the total privacy budget and the total privacy budget can be exhausted quickly. Lastly, from the Bayesian perspective, Bernstein & Sheldon (2019) studies the DP Bayesian linear regression, which requires a prior distribution for the regression coefficient, through releasing private sufficient statistics and thus is imperilled to the problem of privacy budget running out as described above.



For the parametric methods, the DP dataset is constructed based on a parametric distribution/model of the data. Using the robust and flexible model of vine copula, Gambs et al. (2021) draw the DP synthetic dataset from the DP trained vine copula model. From the Bayesian perspective, Liu (2016) proposes generating DP synthetic dataset by drawing samples from the DP version of the posterior predictive distribution. For a more comprehensive overview of different DP dataset generation methods, refer to Bowen & Liu (2020).

