MIND THE PRIVACY BUDGET: HOW GENERATIVE MODELS SPEND THEIR PRIVACY BUDGETS

Abstract

Numerous Differentially Private (DP) generative models have been presented that aim to produce synthetic data while minimizing privacy risks. As there is no single model that works well in all settings, empirical analysis is needed to establish and optimize trade-offs vis-à-vis the intended use of the synthetic data. In this paper, we identify and address several challenges in the empirical evaluation of such models. First, we analyze the steps in which different algorithms "spend" their privacy budget. We evaluate the effects on the performance of downstream tasks to identify problem settings they are most likely to be successful at. Then, we experiment with increasingly wider and taller training sets with various features, decreasing privacy budgets, and different DP mechanisms and generative models. Our empirical evaluation, performed on both graphical and deep generative models, sheds light on the distinctive features of different models/mechanisms that make them well-suited for different settings and tasks. Graphical models distribute the privacy budget horizontally and cannot handle relatively wide datasets, while the performance on the task they were optimized for monotonically increases with more data. Deep generative models spend their budget per iteration, and their behavior is less predictable with varying dataset dimensions, but could perform better if trained on more features. Also, low levels of privacy ( ≥ 100) could help some models generalize, achieving better results than without applying DP.

1. INTRODUCTION

Techniques for creating synthetic data based on generative models have been attracting efforts from the research community (Jordon et al., 2022) , government organizations (Benedetto et al., 2018; NIST, 2018a; 2020; NHS England, 2021) , regulatory bodies (ICO UK, 2022), and industry alike (TechCrunch, 2022) . However, training such models without privacy guarantees can lead to overfitting to the training data or memorization of individual points (Carlini et al., 2019; Webster et al., 2019) . This, in turn, enables attacks like membership and property inference (Hayes et al., 2019; Hilprecht et al., 2019; Chen et al., 2020) . To mitigate these concerns, models should be trained to satisfy Differential Privacy (DP) (Dwork et al., 2014) , whereby noisy/random mechanisms are used to provably minimize the contribution of individual data points to the model. In this paper, we set out to profile DP generative models for tabular data aiming to understand the settings in which they perform best systematically. As shown by (Hay et al., 2016) , the more complex DP algorithms become, the harder it is to analyze their performance analytically. This motivates the need for solid empirical evaluations; thus far, however, evaluations have only been performed on few, relatively small (i.e., involving a dozen features) datasets and rarely beyond a single evaluation metric. Furthermore, it is unclear if/how they scale to larger datasets, even though companies are using scalability delivered in minutes as part of their product offerings (Accelario, 2022; Datagen, 2022; Syntho, 2022; Gretel, 2022) . The conventional wisdom is that computations/methods satisfying DP become more accurate with more training data and less so with stricter privacy guarantees. On the other hand, in some cases, more data or training iterations make deep learning classifiers worse when optimized with DP-SGD (Near & Abuah, 2021) . Also, satisfying a small degree of DP improves the performance of CNNs on limited data (Pearce, 2022) and of GANs for imbalanced data (Ganev, 2022) , and even defends against reconstruction attacks vs. highly informed adversaries (Balle et al., 2022) . Problem Statement. We perform the first empirical evaluation of the effects of different DP mechanisms and the size and dimensionality of training data in the context of generative models and synthetic data. More broadly, we aim to analyze how generative models spend their privacy budget across the number of rows and columns, while also varying the dimensions of the datasets. We also evaluate how the choice of generative model and DP mechanism affects the quality of the synthetic data for downstream tasks, e.g., capturing simple distributions, maintaining high similarity, clustering, and binary and multi-class classification. We compare two synthetic data approaches, namely, graphical and deep generative models; more precisely, PrivBayes (Zhang et al., 2017) and MST (McKenna et al., 2021b) for the former and DP-WGAN (Alzantot & Srivastava, 2019) and PATE-GAN (Jordon et al., 2018) for the latter. Overall, we aim to answer the following research questions: • RQ1: How scalable are DP generative models in terms of the dimensions of the dataset? • RQ2: Do DP generative models distribute their privacy budgets in a similar way? • RQ3: What are the effects of different ways to distribute DP and varying dataset dimensions on the downstream task of the synthetic data? Main Findings. Among other things, our experiments reveal that: 1. The graphical models distribute their privacy budget per column, cannot scale to many features (256 for PrivBayes and 128 for MST at most), and increasing the number of rows does not affect the training time. The deep generative models spend their budget per training iteration and can handle much wider datasets but become slower with more data. 2. PrivBayes's performance on downstream tasks degrades when a stricter privacy budget is imposed or the number of features increases, while more data counters these effects. Also, it is the only model properly separating signal from noise in the clustering task. 3. MST is the best performing model at capturing simple statistics and has the best privacyutility trade-off for this task. Also, when there is a lack of data, it benefits from adding a small degree of privacy ( ≥ 100). Increasing the number of rows too much, however, can cause MST to overfit and degrade its performance on more complex tasks. 4. The GAN models have more variable behaviors with different dataset dimensions. While they are not as competitive on simple tasks and frequently cannot beat baseline models, PATE-GAN is well-suited for more complex tasks and often outperforms both graphical models. PATE-GAN can also improve when presented with more features and is better than DP-WGAN for almost all settings. Overall, we are confident that our work will assist researchers and practitioners deploying DP synthetic data techniques in understanding the trade-offs and navigating through the best candidate models, vis-à-vis the dataset features, desired privacy level, and the downstream task at hand.

2. BACKGROUND AND RELATED WORK

In this section, we review related work on Differential Privacy and data dimensionality for queries, classification models, and generative models. In Appendix A, we also provide background information on DP, synthetic data generation, and DP generative models. Notation. In the rest of the paper, we use n to denote the number of rows and d and the number of columns of the real dataset. Also, denotes the privacy budget and δ the probability of failure in a DP mechanism (see Equation 1 in Appendix A). DP Queries and Data Dimensionality. (Hay et al., 2016) benchmark 15 DP algorithms for range queries over 1 and 2-dimensional datasets, showing that increasing values of n reduce error. For small n, data-dependent algorithms tend to perform better; for large n, data-independent algorithms dominate. For (more complex) predicate counting queries and higher dimensional data, (McKenna et al., 2021a) propose a method with consistent utility improvements and show that increasing d results in more significant errors. However, they experiment with datasets with at most 15 features, and their model struggles to scale beyond 30-dimensional datasets.

