DIFFERENTIALLY PRIVATE SYNTHETIC DATA: APPLIED EVALUATIONS AND ENHANCEMENTS

Abstract

Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practitioners alike. In addition, we propose QUAIL, a two model hybrid approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint.

1. INTRODUCTION

Maintaining an individual's privacy is a major concern when collecting sensitive information from groups or organizations. A formalization of privacy, known as differential privacy, has become the gold standard with which to protect information from malicious agents (Dwork, TAMC 2008) . Differential privacy offers some of the most stringent known theoretical privacy guarantees (Dwork et al., 2014) . Intuitively, for some query on some dataset, a differentially private algorithm produces an output, regulated by a privacy parameter , that is statistically indistinguishable from the same query on the same dataset had any one individual's information been removed. This powerful tool has been adopted by researchers and industry leaders, and has become particularly interesting to machine learning practitioners, who hope to leverage privatized data in training predictive models (Ji et al., 2014; Vietri et al., 2020) . Because differential privacy often depends on adding noise, the results of differentially private algorithms can come at the cost of data accuracy and utility. However, differentially private machine learning algorithms have shown promise across a number of domains. These algorithms can provide tight privacy guarantees while still producing accurate predictions (Abadi et al., 2016) . A drawback to most methods, however, is in the one-off nature of training: once the model is produced, the privacy budget for a real dataset can be entirely consumed. The differentially private model is therefore inflexible to retraining and difficult to share/verify: the output model is a black box. This can be especially disadvantageous in the presence of high dimensional data that require rigorous training techniques like dimensionality reduction or feature selection (Hay et al., 2016) . With limited budget to spend, data scientists cannot exercise free range over a dataset, thus sacrificing model quality. In an effort to remedy this, and other challenges faced by traditional differentially private methods for querying, we can use differentially private techniques for synthetic data generation, investigate the privatized data, and train informed supervised learning models. In order to use the many state-of-the-art methods for differentially private synthetic data effectively in industry domains, we must first address pitfalls in practical analysis, such as the lack of realistic benchmarking (Arnold & Neunhoeffer, 2020) . Benchmarking is non-trivial, as many new stateof-the-art differentially private synthetic data algorithms leverage generative adversarial networks (GANs), making them expensive to evaluate on large scale datasets (Zhao et al., 2019) . Furthermore, many of state-of-the-art approaches lack direct comparisons to one another, and by nature of the privatization mechanisms, interpreting experimental results is non-trivial (Jayaraman & Evans, 2019) . New metrics presented to analyze differentially private synthetic data methods may themselves need more work to understand, especially in the domain of tabular data (Ruggles et al., 2019; Machanavajjhala et al., 2017) . To that end, our contributions in this paper are 3-fold. (1) We introduce more realisitic benchmarking. Practitioners commonly collect state-of-the-art approaches for comparison in a shared environment (Xu et al., 2019) . We provide our evaluation framework, with extensive comparisons on both standard datasets and our real-world, industry applications. (2) We provide experimentation on novel metrics at scale. We stress the tradeoff between synthetic data utility and statistical similarity, and offer guidelines for untried data. (3) We present a straightforward and pragmatic enhancement, QUAIL, that addresses the tradeoff between utility and statistical similarity. QUAIL's simple modification to a differentially private data synthesis architecture boosts synthetic data utility in machine learning scenarios without harming summary statistics or privacy guarantees.

2. BACKGROUND

Differential Privacy (DP) is a formal definition of privacy offering strong assurances against various re-identification and re-construction attacks (Dwork et al., 2006; 2014) . In the last decade, DP has attracted significant attention due to its provable privacy guarantees and ability to quantify privacy loss, as well as unique properties such as robustness to auxiliary information, composability enabling modular design, and group privacy (Dwork et al., 2014; Abadi et al., 2016) Definition 1. (Differential Privacy Dwork et al. (2006) ) A randomized function K provides ( , δ)differential privacy if ∀S ⊆ Range(K), all neighboring datasets D, D differing on a single entry, Pr[K(D) ∈ S] ≤ e • Pr[K( D) ∈ S] + δ, This is a standard definition of DP, implying that the outputs of differentially private algorithm for datasets that vary by a single individual are indistinguishable, bounded by the privacy parameter . Here, is a non-negative number otherwise known as the privacy budget. Smaller values more rigorously enforce privacy, but often decrease data utility. An important property of DP is its resistance to post-processing. Given an ( , δ)-differentially private algorithm K : D → O, and f : O → O´an arbitrary randomized mapping, f • K : D → O´is also differentially private. Currently, the widespread accessibility of data has increased data protection and privacy regulations, leading to a surge of research into applied scenarios for differential privacy (Allen et al. (2019) (2016) , is one of the first studies to make the Stochastic Gradient Descent (SGD) computation differential private. Intuitively, DPSGD minimizes its loss function while preserving differential privacy by clipping the gradient in the optimization's l 2 norm to reduce the model's sensitivity, and adding noise to protect privacy. Further details can be found in the Appendix. PATE Private Aggregation of Teacher Ensembles (PATE) Papernot et al. (2016) provided PATE, which functions by first deploying multiple teacher models that are trained on disjoint datasets, then deploying the teacher models on unseen data to make predictions. On unseen data, the teacher models "vote" to determine the label; here random noise is introduced to privatize the results of the vote. (Xie et al., 2018; Torkzadehmahani et al., 2019; Xu et al., 2018) . These models inject noise to the GAN's discriminator during training to enforce differential privacy. DP's guarantee of post-processing privacy means that privatizing the GAN's discriminator enforces differential privacy on the parameters of the GAN's generator, as the GAN's mapping function between the two functions does not involve any private data. We use the Differentially Private Generative Adversarial Network (DPGAN) Xie et al. (2018) as one of our benchmark synthesizers. DPGAN leverages the Wasserstein GAN proposed by Arjovsky et al. (2017) , adds noise on the gradients, and clips the model weights only, ensuring the Lipschitz property of the network. DPGAN has been evaluated on image data and Electronic Health Records (EHR) in the past. PATE- GAN Jordon et al. (2018b) modified the Private Aggregation of Teacher Ensembles (PATE) framework to apply to GANs in order to preserve the differential privacy of synthetic data. Similarly to DPGAN, PATE-GAN only applies the PATE mechanism to the discriminator. The dataset is first partitioned into k subsets, and k teacher discriminators are initialized. Each teacher discriminator is trained to discriminate between a subset of the original data and fake data generated by Generator. The student discriminators are then trained to distinguish real data and fake data using the labels generated by an ensemble of teacher discriminators with random noise added. Lastly, the generator is trained to fool the student discriminator. Jordon et al. (2018b) claim that this method outperforms DPGAN for classification tasks, and present supporting results.

3. ENHANCING PERFORMANCE

The QUAIL Hybrid Method As we explored generating differentially private synthetic data, we noted a disconnect between the distribution of epsilon, or privacy budget, and the algorithm's application. Generating synthetic data to provide summary statistics necessitates an even distribution of budget across the entire privatization effort; we cannot know a user's query in advance. We may want to reallocate the budget, however, for a known supervised learning task. QUAIL (Quail-ified Architecture to Improve Learning) is a simple, two model hybrid approach to enhancing the utility of a differentially private synthetic dataset for machine learning tasks. Intuitively, QUAIL assembles a DP supervised learning model in tandem with a DP synthetic data model to produce synthetic data with machine learning potential. Algorithm 1 describes the procedure more formally. Algorithm 1: QUAIL pseudocode Input: Dataset D, supervised learning target dimension r , budget > 0, split factor 0 < p < 1, size n samples to generate, a differentially private synthesizer M (D, ), and a differentially private supervised learning model C(D, , t) (t is supervisory signal i.e. target dimension). We let X be the universe of samples, and N denote the set of all non-negative integers. Thus, N |X| is all databases in universe X, as described in Section 2.3 of Dwork et al. (2014) . Split; Split the budget: M = * p and C = * (1 -p). Create D M , which is identical to D except r ∈ D M . In parallel; • Train differentially private supervised learning model: C(D, C , r ) to produce C r (s) : N |X| → R 1 , which can map any arbitrary s ∈ N |X| to an output label. • Train differentially private synthesizer: M (D M , M ) : N |X| → R 2 to produce synthesizer M D M , which produces synthetic data S ∈ N |X| .

Sample;

1. Using M D M , generate synthetic dataset S D M with n samples. 2. For each sample s i ∈ S D M , apply C r (s i ) = r i i.e. apply model to each synthetic datapoint to produce a supervised learning target output r i .

3.. Transform

S D M → S R . For each row s i ∈ S D M , s i = [s i , r i ] s.t. ∀s i , s i ∈ dom(D) i.e. append r i to each row s i so that S R is now in same domain as D, the original dataset. Output: Return S R , a synthetic dataset with n samples, where each sample in S R has target dimension r i produced by the supervised learner C r Theorem 3.1 (QUAIL follows the standard composition theorem for ( , δ)-differential privacy). The QUAIL method preserves the differential privacy guarantees of C(R, C , r ) and M (R M , M ) by the standard composition rules of differential privacy (Dwork et al., 2014) . Proof. Let the first ( , δ)-differentially private mechanism M 1 : N |X| → R 1 be C(R, C , r ). Let the second ( , δ)-differentially private mechanism M 2 : N |X| → R 2 be M (R M , M ). Fix 0 < p < 1, M = p * and C = (1 -p) * , then by construction, P r[M1(x)=(r1,r2)] P r[M2(y)=(r1,r2)] ≥ exp(-( M + C )) , which satisfies the differential privacy constraints for a privacy budget of M + C = total . For more details, see the appendix. Differentially Private GANs for Tabular Data In this paper, we focus on tabular synthetic data, and explored state-of-the-art methods for generating tabular data with GANs. CTGAN is a state-ofthe-art GAN for generating tabular data presented by Xu et al. (2019) . We made CTGAN differentially private using the aforementioned techniques, DP-SGD and PATE. CTGAN addresses specific challenges that a vanilla GAN faces when generating tabular data, such as mode-collapse and continuous data following a non-Gaussian distribution (Xu et al., 2019) . To model continuous data with multi-model distributions, it leverages mode-specific normalization. In addition, CTGAN introduces a conditional generator, which can generate synthetic rows conditioned by specific discrete columns. CTGAN further trains by sampling, which explores discrete values more evenly. Xie et al. (2018) 's DPGAN work, we applied DP-SGD to the CTGAN architecture (details can be found in Figure 1 in the Appendix). Similarly to DPGAN, in applying DP-SGD to CTGAN we add random noise to the discriminator and clip the norm to make it differentially private. Based on the post-processing property (Dwork et al., 2014) that any randomized mapping of a differentially private output is also differentially private, the generator is guaranteed to be differentially private when the generator is trained to maximize the probability of D(G(z)). In CTGAN, the authors add the cross-entropy loss between conditional vector and produced set of onehot discrete vectors into the generator loss. To guarantee differential privacy with the generator, we removed the cross-entropy loss when calculating generator loss. Thus, the generator is differentially private as well. See Figure 1 in Appendix for a diagram. PATE-CTGAN Drawing from work on PATE-GAN, we applied the PATE framework to CTGAN (Jordon et al., 2018b) . Similarly to PATE-GAN, we partitioned our original dataset into k subsets and trained k differentially private teacher discriminators to distinguish real and fake data. In order to apply the PATE framework, we further modified CTGAN's teacher discriminator training: instead of using one generator to generate samples, we initialize k conditional generators for each subset of data (shown in Figure 2 in the appendix). 

4. EVALUATION: METRICS, INFRASTRUCTURE AND PUBLIC BENCHMARKS

We focus on two sets of metrics in our benchmarks: one for comparing the distributional similarity of two datasets and another for comparing the utility of synthetic datasets given a specific predictive task. These two dimensions should be viewed as complementary, and in tandem they capture the overall quality of the synthetic data. Distributional similarity To provide a quantitative measure for comparison of synthetically generated datasets, we use a relatively new metric for assessing synthetic data quality: propensity score mean-squared error (pMSE) ratio score. Proposed by Snoke & Slavković (2018) , pMSE provides a statistic to capture the distributional similarity between two datasets. Given two datasets, we combine the two together with an indicator to label which set a specific observation comes from. A discriminator is then trained to predict these indicator labels. To calculate pMSE, we simply compute the mean-squared error of the predicted probabilities for this classification task. If our model is unable to discern between these classes, then the two datasets are said to have high distributional similarity. To help limit the sensitivity of this metric to outliers, Snoke & Slavković (2018) propose transforming pMSE to a ratio by leveraging an approximation to the null distribution. For the ratio, we simply divide the pMSE by the expectation of the null distribution. A ratio score of 0 implies the two datasets are identical. Machine Learning Utility Given the context of this paper, we aim to provide quantitative measures for approximating the utility of differentially private synthetic data in regards to machine learning tasks. Specifically, we used three metrics: AUCROC and F1-score, two traditional utility measures, and the synthetic ranking agreement (SRA), a more recent measure. SRA can be thought of as the probability that a comparison between any two algorithms on the synthetic data will be similar to comparisons of the same two algorithms on the real data (Jordon et al., 2018a) . Descriptions of each metric can be found in the Appendix. Evaluation Infrastructure The design of our pipeline addressed scalability concerns, allowing us to benchmark four computationally expensive GANs on five high dimensional datasets across the privacy budgets = [0.01, 0.1, 0.5, 1.0, 3.0, 6.0, 9.0], averaged across 12 runs. We used varying compute, including CPU nodes (24 Cores, 224 GB RAM, 1440 GB Disk) and GPU nodes GP U (4 x NVIDIA Tesla K80). Despite extensive computational resources, we could not adequately address the problem of hyperparameter tuning differentially private algorithms for machine learning tasks, which is an open research problem (Liu & Talwar, 2019) . In our case, a grid search was computationally intractable: for each run of the public datasets on all synthesizers, Car averaged 1.27 hours, Mushroom averaged 8.33 hours, Bank averaged 13.30 hours, Adult averaged 14.47 hours and Shopping averaged 27.37 hours. We trained our GANs using the experimentally determined hyperparameters, and were informed by prior work around each algorithm. We include a description of the parameters used for each synthesizer in the appendix. Regarding F1-score and AUC-ROC: We averaged across the maximum performance of five classification models: an AdaBoost classifier, a Bagging classifier, a Logistic Regression classifier, Multilayer Perceptron classifier, and a Random Forest classifier. We decided to focus on one classification scenario specifically: train-synthetic test-real or TSTR, which was far more representative of applied scenarios than train-synthetic test-synthetic. We compare these values to train-real test-real (TRTR). In our Car evaluations in Figure 3 , we see strong performance from the QUAIL variants on very low values. However, we note that for ≥ 3.0, DPCTGAN and PATECTGAN outperform even the QUAIL enhanced models. We further note that PATECTGAN performs remarkably well on the pMSE metric across values in Figure 28b . In our Mushroom evaluations in Figure 28a , QUAIL variants also outperformed other synthesizers. However, PATECTGAN's exhibits the best statistical similarity (pMSE score) with larger . In our evaluations on the Adult dataset in Figure 30a , while PATECTGAN performs well, DPCTGAN performs best when ≥ 3.0. Our findings suggest that generally, with larger budgets ( ≥ 3.0), PATECTGAN improves on other synthesizers, both in terms of utility and statistical similarity. With smaller budgets ( ≤ 1.0), DPCTGAN may perform better. Synthesizers are not able to achieve reasonable utility under low budgets ( ≤ 0.1), but DPCTGAN was able to achieve statistical similarity in this setting.

5. EVALUATION: APPLIED SCENARIO

Supported by learnings from experiments on the public datasets, we evaluated our benchmark DP synthesizers on several private internal datasets, for different scenarios such as classification and regression. We show that DP synthetic data models can perform on real-world data, despite a noisy supervised learning problem and skewed distributions when compared to the more standardized public datasets. Classification The data used in this set of experiment include ∼100,000 samples and 30 features. The data includes only categorical columns each containing between 2 to 24 categories. One of our tasks with this dataset was to train a classification task with three classes. We faced significant challenges when managing the long-tail distribution of each feature. Figure 26 , which can be found in the appendix, shows an example of data distributions for different attributes in this data. We ran our evaluation suite on the applied internal data scenarios to generate the synthetic data from each DP synthesizer and benchmark standard ML models. We also applied a Logistic Regression classifier with differential privacy from IBM. (Chaudhuri et al., 2011; diffprivlib) to the real data as a baseline. Figure 5a shows the ML results from our evaluation suite. As expected, as the privacy budget increases, performance generally improves. DP-CTGAN had the highest performance without the QUAIL enhancement. QUAIL, however, improved the performance of all synthesizers. In particular, a QUAIL enhanced DPCTGAN synthesizer had the highest performance across epsilons in this experiment. In particular, these experiments demonstrated the advantages of QUAIL, combining DP synthesizers with a DP classifier for a classification model.

Regression

In this experiment, we used another internal data for the task of regression. Our dataset included 27466 and 6867 training and testing samples, respectively. The domain comprised eight categorical and 40 continuous features. After generating the DP synthetic data from each model, we used Linear Regression to predict the target variable. Figure 5b shows the results from the evaluation suite. We used RMSE as the evaluation metric. For QUAIL boosting, we used a Linear Regression model with differential privacy from IBM (Sheffet, 2015; diffprivlib) . We also compared the DP synthesizers with a "vanilla" DP Linear Regression (DPLR) using real data. In this experiment, PATECTGAN outperformed other models and even improved on the RMSE (root-mean-squared-error) when compared to the real data for budget > 1.0. For QUAIL-enhanced models, the RMSE is considerably larger than the real and other DP synthetic data. We attribute this to a weakness of the embedded regression model (DP Linear Regression) in QUAIL for this data scenario. Based on our observations, small privacy budgets ( < 10.0) for DP Linear Regression significantly affects its performance. However, as shown in Figure 5b , we still see some boost on the QUAIL variant synthesizers when compared to the "vanilla" DP Linear Regression. For distributional similarity comparison, please refer to Figure 27 in the appendix. QUAIL Evaluations QUAIL's hyperparameter, the split factor p where 0 < p < 1, determines the distribution of budget between classifier and synthesizer. We generated classification task datasets with 10000-50000 samples, 7 feature columns and 10 output classes using the make classif ication package from Scikit-learn (Pedregosa et al., 2011) . We experimented with the values p = [0.1, 0.3, 0.5, 0.7, 0.9], and report on results, varying budget = [1.0, 3.0, 10.0]. See the appendix for complete results and a list of DP classifiers we experimented on embedding in QUAIL. Our figures represent the delta δ in F1 score between training the classifier C(R, C , r ) on the original dataset (the "vanilla" scenario) (F1 v ), and training a Random Forest classifier on the differentially private synthetic dataset produced by applying QUAIL to an hybrid of C(R, C , r ) and one of our benchmark synthesizers M (D, M ) (F1 q ). We plot δ = F1 v -F1 q across epsilon splits and datasizes. Positive to highly positive deltas are grey→red, indicating the "vanilla" scenario outperformed the QUAIL scenario. Small or negative deltas are blue, indicating the QUAIL scenario matched, or even outperformed, the "vanilla" scenario. Each cell contains δ for some p on datasets |10000 -50000|. In our results we use DP Gaussian Naive Bayes (DP-GNB) as C(R, C , r ) (F1 v ), and trained a Random Forest Classifier on data generated by QUAIL (F1 q ) (recall QUAIL combines C(R, C , r ) and a DP synthesizer) (Vaidya et al., 2013; diffprivlib) . We average across 75 runs. Note the correlation between epsilon split, datasize and classification performance when embedding PATECTGAN in QUAIL, shown in Figure 6 , suggesting that a higher p split value increases the likelihood of outperforming C(R, C , r ). For an embedded MWEM synthesizer, seen in Figure 7 , the relationship between split, scale and performance was more ambiguous. In general, a higher split factor p, which assigns more budget to the differentially private classifier C(D, M , t) could improve the utility of the overall synthetic dataset. However, any perceived improvements were highly dependant on the differentially private synthesizer used. 8 and 9 , we see that not only is the synthetic data produced by QUAIL very similar to the real data, but the accuracy of the labeling for the embedded model (in this case, DPLR) is also very similar. Further investigation into data scale revealed that the QUAIL method takes advantage of allocating excess epsilon when datasets are large. As datascale increases, the sensitivity of the differentially private model decreases and so less epsilon can be used more efficiently. Thus, we see that an exaggerated difference between DPLR embedded in QUAIL (with an epsilon of 2.4) and DPLR with an epsilon of 3.0 for a dataset of 20,000 samples. In this case, the embedded DPLR model accuracy suffers, and so does the learning utility of the produced synthetic data. Conversely, as we increase the data size to 50,000 and 100,000 samples, we see that the internal model (with epsilon 2.4) can match the performance of the vanilla model (with epsilon 3.0). Then, the synthetic dataset serves only to augment the performance by small but significant margin (in Figure 10 , we see a bump of three percent to f1 score.) Time Performance Analysis of QUAIL: Making supervised learning more efficient QUAIL benefits the efficiency of training intensive GANs. In Table 1 , the time performance of QUAIL is compared with non-Quail methods. Specifically, we select two epsilons ( = 3.0 and = 6.0) and two QUAIL split factors (p = 0.9 and p = 0.5). From this table, it can be seen that in all GAN-based models, QUAIL can improve time efficiency considerably. This is more noticeable as the epsilon increases where training time for models such as DPCTGAN and DPGAN skyrockets.

6. PUNCHLINES

We summarize our findings in the following punchlines, concise takeaways from our work for researchers and applied practitioners exploring DP synthetic data. 1. Holistic Performance. No single model performed best always (but PATECTGAN performed well often). Model performance was domain dependent, with continuous/categorical features, dataset scale and distributional complexity all affecting benchmark results. However, in general, we found that PATECTGAN had better utility and statistical similarity in scenarios with high privacy budget ( >= 3.0) when compared to the other synthesizers we benchmarked. Conversely, with low privacy budget ( <= 1.0) we found that DPCTGAN had better utility, but PATECTGAN may still be better in terms of statistical similarity. 2. Computational tradeoff. Our highest performant GANs were slow, and MWEM is fast. PATECT-GAN and DPCTGAN, while being our most performant synthesizers, were also the slowest to train. With GANs, more computation often correlates with higher performance (Lucic et al., 2018) . On categorical data, MWEM performed competitively, and is significantly faster to train in any domain. 3. Using AUC-ROC and F1 Score. One should calculate both, especially to best understand QUAIL's tradeoffs. Our highest performing models by F1 Score often had QUAIL enhancements, which sometimes, but not always, detrimentally affected AUC-ROC. Without both metrics, one risks using a QUAIL enhancement for a model with high training accuracy that struggles to generalize. 4. Using pMSE. pMSE can be used alongside ML utility metrics to balanced experiments. pMSE concisely captures statistical similarity, and allows practitioners to easily balance utility against the distributional quality of their synthetic data. 5. Enhancing with QUAIL. QUAIL's effectiveness depends far more on the quality of the embedded differentially private classifier than on the synthesizer. QUAIL showed promising results in almost all the scenarios we evaluated. Given confidence in the embedded "vanilla" differentially private classifier, QUAIL can be used regularly to improve the utility of DP synthetic data. 6. Reservations for use in applied scenarios. Applied DP for ML is hard, thanks to scale and dimensionality. Applied scenarios we presented assessed large datasets, leading to high computational costs that makes tuning performance difficult. Dimensionality is tricky to deal with in large, sparse, imbalanced private applied scenarios (like we faced with internal datasets). Practitioners may want to investigate differentially private feature selection or dimensionality reductions before training. We are aware of work being done to embed autoencoders into differentially private synthesizers, and view this a promising approach (Nguyen et al., 2020) .

7. CONCLUSION

With this paper, we set out to assess the efficacy of differentially private synthetic data for use on machine learning tasks. We surveyed an histogram based approach (MWEM) and four differentially private GANs for data synthesis (DPGAN, PATE-GAN, DPCTGAN and PATECTGAN). We evaluated each approach using an extensive benchmarking pipeline. We proposed and evaluated QUAIL, a straightforward method to enhance synthetic data utility in ML tasks. We reported on results from two applied internal machine learning scenarios. Our experiments favored PATECTGAN when the privacy budget ≥ 3.0, and DPCTGAN when the privacy budget ≤ 1.0. We discussed nuances of domain-based tradeoffs and offered takeaways across current methods of model selection, training and benchmarking. As of writing, our experiments represent one of the largest efforts at benchmarking differentially private synthetic data, and demonstrates the promise of this approach when tackling private real-world machine learning problems. A APPENDIX

B METHODS

B.1 DP-SGD DETAILED STEPS The detailed training steps are as follows: 1. A batch of random samples is taken and the gradient for each sample is computed 2. For each computed gradient g, it is clipped to g/max(1, g 2 C ) , where C is a clipping bound hyperparameter. 3. A Gaussian noise (N (0, σ 2 C 2 I)) (where σ is the noise scale) is added to the clipped gradients and the model parameters are updated. 4. Finally, the overall privacy cost ( , δ) is computed using a privacy accountant method. B.2 DESCRIPTIONS OF METRICS: F1-SCORE, AUC-ROC AND SRA F1-score measures the accuracy of a classifier, essentially calculating the mean between precision and recall and favoring the lower of the two. It varies between 0 and 1, where 1 is perfect performance. AUC-ROC: Area Under the Receiver Operating Characteristic (AUC-ROC) represents the Receiver Operating Characteristic curve in a single number between 0 and 1. This provides insight into the true positive vs. false positive rate of the classifier. SRA: SRA can be thought of as the probability that a comparison between any two algorithms on the synthetic data will be similar to comparisons of the same two algorithms on the real data. SRA compares train-synthetic test-real (i.e. TSTR, which uses differentially private synthetic data to train the classifier, and real data to test) with train-real test-real (TRTR, which uses differentially private synthetic data to train and test the classifier) Further Motivation Machine learning practitioners often need a deep understanding of data in order to train predictive models. That can be incredibly difficult when data is private. Training oneoff, blackbox "vanilla" DP classifiers cannot be retrained, as this risks individual privacy, making parameter tuning and feature selection incredibly difficult with these models. Differentially private synthetic data allows practitioners to treat data normally, without further privacy considerations, giving them an opportunity to fine tune their models.

C QUAIL C.1 QUAIL FURTHER DETAILS

We evaluated with a few vanilla differentially private classifiers C(R, C , r ): 1. Logistic Regression classifier with differential privacy. (Chaudhuri et al., 2011; diffprivlib) 2. Gaussian Naive Bayes with differential privacy. (Vaidya et al., 2013; diffprivlib) 3. Multi-layer Perceptron (Neural Network) with differential privacy. (Abadi et al., 2016) Theorem C.1. Standard Composition Theorem (Dwork et al., 2014) Let M 1 : N |X| → R 1 be an 1 -differentially private algorithm, and let M 2 : N |X| → R 1 be 2 -differentially private algorithm. Then their combination, defined to be M 1,2 → R 1 XR 2 by the mapping: we launch a process that synthesizes datasets for each privacy budget ( s) specified on each synthesizer specified. Once the synthesis is complete, the pipeline launches a secondary process that analyzes the synthetic data, training classifiers and running the previously mentioned novel metrics. The run is launched, and the results are logged, using MLFlow runs (Zaharia et al., 2018) with an Azure Machine Learning compute-cluster backend. Our compute used CPU nodes ST AN DARD N C24r (24 Cores, 224 GB RAM, 1440 GB Disk) and GPU nodes GP U (4 x NVIDIA Tesla K80). We highly encourage future work into hyperparameter tuning for differentially private machine learning tasks, and believe our evaluation pipeline could be of some use in that effort. M 1,2 (x) = (M 1 (x), M 2 (x)) is 1 + 2 -differentially private.

D.2 DETAILS ON DATA

Results presented on the Public Datasets are averaged across 12 runs. SRA results were moved to the appendix after difficulty interpreting their significance, although there are potential trends that warrant further exploration. bank car shopping mushroom adult epsilons 0.01 0.6 0.2 0.7 0.7 0.4 0.10 0.4 0.1 0.7 0.7 0.7 0.50 0.4 0.4 0.9 0.6 0.3 1.00 0.8 0.2 0.7 1.0 0.5 3.00 0.9 0.4 0.2 1.0 0.9 6.00 0.9 0.6 0.5 0.6 0.8 9.00 0.4 0.2 0.9 0.9 0.7 (b) SRA results for PATECTGAN bank car shopping mushroom adult epsilons 0.01 0.4 0.4 0.7 1.0 0.5 0.10 0.7 0.3 0.5 0.9 0.5 0.50 0.5 0.8 0.7 0.9 0.4 1.00 0.5 0.3 0.8 0.6 0.5 3.00 0.7 0.4 0.5 0.8 0.3 6.00 0.5 0.4 0.5 1.0 0.4 9.00 0.4 0.3 0.5 0.8 0.5 (a) SRA results for DPCTGAN bank car shopping mushroom adult epsilons 0.01 0.5 0.1 0.8 0.9 0.0 0.10 0.3 0.1 0.6 0.9 0.2 0.50 0.9 0.2 0.5 0.7 0.3 1.00 0.4 0.2 0.5 0.9 0.0 3.00 0.1 0.1 0.0 0.8 0.8 6.00 0.1 0.5 0.1 0.8 0.8 9.00 0.0 0.5 0.2 1.0 0.7 (b) SRA results for QUAIL (MWEM) bank car shopping mushroom adult epsilons 0.01 0.6 0.7 1.0 1.0 0.6 0.10 0.8 0.5 1.0 0.4 0.4 0.50 0.2 0.6 0.9 0.4 0.2 1.00 0.5 0.5 0.6 0.5 0.6 3.00 0.5 0.5 0.3 0.4 0.6 6.00 0.7 0. bank car shopping mushroom adult epsilons 0.01 0.5 0.7 0.9 1.0 0.2 0.10 0.6 0.6 0.5 0.9 0.6 0.50 0.2 0.5 0.5 0.5 0.3 1.00 0.5 0.6 0.9 0.3 0.6 3.00 0.6 0.5 0.2 0.5 0.5 6.00 0.6 0.5 0.3 0.5 0.8 9.00 0.6 0.5 0.0 0.5 0. bank car shopping mushroom adult epsilons 0.01 0.9 0.3 0.6 0.3 0.5 0.10 0.4 0.6 0.9 0.3 0.9 0.50 0.7 0.5 0.5 0.3 0.6 1.00 0.0 0.5 0.9 0.3 0.9 3.00 0.6 0.4 0.1 0.3 0.7 6.00 0.7 0.4 0.8 0.4 0.9 9.00 0.7 0.4 0.9 0.3 0.8 (b) SRA results for QUAIL (DPCTGAN) bank car shopping mushroom adult epsilons 0.01 0.8 0.2 0.9 0.5 0.2 0.10 0.6 0.5 0.8 0.3 0.3 0.50 0.1 0.4 0.7 0.3 0.4 1.00 0.7 0.5 0.5 0.3 0.4 3.00 0.1 0.4 0.8 0.4 0.5 6.00 0.0 0.4 0.7 0.3 0.1 9.00 0.3 0.4 0.1 0.5 0.1



; Ding et al. (2017); Doudalis et al. (2017). There have been several studies into protecting individual's privacy during model training Li et al. (2014); Zhang et al. (2015); Feldman et al. (2018). In particular, several studies have attempted to solve the problem of preserving privacy in deep learning (Phan et al. (2017); Abadi et al. (2016); Shokri & Shmatikov (2015); Xie et al. (2018); Zhang et al. (2018); Jordon et al. (2018b); Torkzadehmahani et al. (2019)). Here, two main techniques for training models with differential privacy are discussed: DP-SGD Differentially Private Stochastic Gradient Descent (DP-SGD), proposed by Abadi et al.

Figure 1: Block diagram of DP-CTGAN model.

Figure 2: Teacher Discriminator of PATE-CTGAN model.

Figure 3: Real Car F1 Score: 0.97 Figure 4: Mushroom pMSE

Figure 5: ML evaluation results for internal dataset

Our QUAIL results are agnostic to the embedded supervised learning algorithm C(R, C , r ), as they depict relative performance, though different methods of supervised learning are more suitable to certain domains. Future work might explore alternative classifiers or regression models, and how purposefully overfitting the model C(R, C , r ) could contribute to improved synthetic data.

Figure 6: Privacy budget = 3.0 Figure 7: Privacy budget = 3.0

Figure 8: Privacy budget = 3.0

Figure 10: Privacy budget = 3.0

Figure 17: Budget = 1.0 Figure 18: Budget = 3.0 Figure 19: Budget = 10.0

Figure 26: Data distribution of the internal dataset for various attributes. Included to highlight the imbalanced nature, difficulty of supervised learning problem.

Figure 27: pMSE evaluation results for internal dataset. PATECTGAN performed best in both cases.

Figure 28: PATECTGAN demonstrated better performance at higher epsilons. QUAIL synthesizers performed best at low epsilon privacy values.

Figure 31: As the most complex benchmark dataset, Bank presented a particular challenge. The results are difficult to interpret, and would require further experimentation to draw conclusions.

f r e q u e n c y = True , d i s a b l e d d p = F a l s e , t a r g e t d e l t a =None , s i g m a = 5 , m a x p e r s a m p l e g r a d n o r m = 1 . 0 , v e r b o s e = True , l o s s = ' w a s s e r s t e i n

2 l o g f r e q u e n c y = True , d i s a b l e d d p = F a l s e , t a r g e t d e l t a =None , s i g m a = 5 , m a x p e r s a m p l e g r a d n o r m = 1 . 0 , v e r b o s e = True , l o s s = ' c r o s s e n t r o p y ' , b i n a r y = F a l s e , b a t c h s i z e = 5 0 0 , t e a c h e r i t e r s = 5 , s t u d e n t i t e r s = 5

Time Performance Analysis of QUAIL compared to other synthesizers (time is shown in seconds)

Details on Public Datasets used for benchmarking.

annex

Proof. Let x, y ∈ N |X| be such that ||x -y|| 1 < 1. Fix any (r 1 , r 2 ) ∈ R 1 XR 2 . Then:Proof. QUAIL: full proof of differential privacy Let the first ( , δ)-differentially private mechanismC.2 QUAIL FULL RESULTS Our experimental pipeline provides an extensible interface for loading datasets from remote hosts, specifically from the UCI ML Dataset repository (Dua & Graff, 2017) . For each evaluation dataset,

