PRIVATE POST-GAN BOOSTING

Abstract

Differentially private GANs have proven to be a promising approach for generating realistic synthetic data without compromising the privacy of individuals. Due to the privacy-protective noise introduced in the training, the convergence of GANs becomes even more elusive, which often leads to poor utility in the output generator at the end of training. We propose Private post-GAN boosting (Private PGB), a differentially private method that combines samples produced by the sequence of generators obtained during GAN training to create a high-quality synthetic dataset. To that end, our method leverages the Private Multiplicative Weights method (Hardt and Rothblum, 2010) to reweight generated samples. We evaluate Private PGB on two dimensional toy data, MNIST images, US Census data and a standard machine learning prediction task. Our experiments show that Private PGB improves upon a standard private GAN approach across a collection of quality measures. We also provide a non-private variant of PGB that improves the data quality of standard GAN training.

1. INTRODUCTION

The vast collection of detailed personal data, including everything from medical history to voting records, to GPS traces, to online behavior, promises to enable researchers from many disciplines to conduct insightful data analyses. However, many of these datasets contain sensitive personal information, and there is a growing tension between data analyses and data privacy. To protect the privacy of individual citizens, many organizations, including Google (Erlingsson et al., 2014) , Microsoft (Ding et al., 2017) , Apple (Differential Privacy Team, Apple, 2017) , and more recently the 2020 US Census (Abowd, 2018) , have adopted differential privacy (Dwork et al., 2006) as a mathematically rigorous privacy measure. However, working with noisy statistics released under differential privacy requires training. A natural and promising approach to tackle this challenge is to release differentially private synthetic data-a privatized version of the dataset that consists of fake data records and that approximates the real dataset on important statistical properties of interest. Since they already satisfy differential privacy, synthetic data enable researchers to interact with the data freely and to perform the same analyses even without expertise in differential privacy. A recent line of work (Beaulieu-Jones et al., 2019; Xie et al., 2018; Yoon et al., 2019) studies how one can generate synthetic data by incorporating differential privacy into generative adversarial networks (GANs) (Goodfellow et al., 2014) . Although GANs provide a powerful framework for synthetic data, they are also notoriously hard to train and privacy constraint imposes even more difficulty. Due to the added noise in the private gradient updates, it is often difficult to reach convergence with private training. In this paper, we study how to improve the quality of the synthetic data produced by private GANs. Unlike much of the prior work that focuses on fine-tuning of network architectures and training techniques, we propose Private post-GAN boosting (Private PGB)-a differentially private method that boosts the quality of the generated samples after the training of a GAN. Our method can be viewed as a simple and practical amplification scheme that improves the distribution from any ex-isting black-box GAN training method -private or not. We take inspiration from an empirical observation in Beaulieu-Jones et al. ( 2019) that even though the generator distribution at the end of the private training may be a poor approximation to the data distribution (due to e.g. mode collapse), there may exist a high-quality mixture distribution that is given by several generators over different training epochs. PGB is a principled method for finding such a mixture at a moderate privacy cost and without any modification of the GAN training procedure. To derive PGB, we first formulate a two-player zero-sum game, called post-GAN zero-sum game, between a synthetic data player, who chooses a distribution over generated samples over training epochs to emulate the real dataset, and a distinguisher player, who tries to distinguish generated samples from real samples with the set of discriminators over training epochs. We show that under a "support coverage" assumption the synthetic data player's mixed strategy (given by a distribution over the generated samples) at an equilibrium can successfully "fool" the distinguisher-that is, no mixture of discriminators can distinguish the real versus fake examples better than random guessing. While the strict assumption does not always hold in practice, we demonstrate empirically that the synthetic data player's equilibrium mixture consistently improves the GAN distribution. The Private PGB method then privately computes an approximate equilibrium in the game. The algorithm can be viewed as a computationally efficient variant of MWEM (Hardt & Rothblum, 2010; Hardt et al., 2012) , which is an inefficient query release algorithm with near-optimal sample complexity. Since MWEM maintains a distribution over exponentially many "experts" (the set of all possible records in the data domain), it runs in time exponential in the dimension of the data. In contrast, we rely on private GAN to reduce the support to only contain the set of privately generated samples, which makes PGB tractable even for high-dimensional data. We also provide an extension of the PGB method by incorporating the technique of discriminator rejection sampling (Azadi et al., 2019; Turner et al., 2019) . We leverage the fact that the distinguisher's equilibrium strategy, which is a mixture of discriminators, can often accurately predict which samples are unlikely and thus can be used as a rejection sampler. This allows us to further improve the PGB distribution with rejection sampling without any additional privacy cost since differential privacy is preserved under post-processing. Our Private PGB method also has a natural non-private variant, which we show improves the GAN training without privacy constraints. We empirically evaluate both the Private and Non-Private PGB methods on several tasks. To visualize the effects of our methods, we first evaluate our methods on a two-dimensional toy dataset with samples drawn from a mixture of 25 Gaussian distributions. We define a relevant quality score function and show that the both Private and Non-Private PGB methods improve the score of the samples generated from GAN. We then show that the Non-Private PGB method can also be used to improve the quality of images generated by GANs using the MNIST dataset. Finally, we focus on applications with high relevance for privacy-protection. First we synthesize US Census datasets and demonstrate that the PGB method can improve the generator distribution on several statistical measures, including 3-way marginal distributions and pMSE. Secondly, we evaluate the PGB methods on a dataset with a natural classification task. We train predictive models on samples from Private PGB and samples from a private GAN (without PGB), and show that PGB consistently improves the model accuracy on real out-of-sample test data. Related work. Our PGB method can be viewed as a modular boosting method that can improve on a growing line of work on differentially private GANs (Beaulieu-Jones et al., 2019; Xie et al., 2018; Frigerio et al., 2019; Torkzadehmahani et al., 2020) . To obtain formal privacy guarantees, these algorithms optimize the discriminators in GAN under differential privacy, by using private SGD, RMSprop, or Adam methods, and track the privacy cost using moments accounting Abadi et al. (2016); Mironov (2017 ). Yoon et al. (2019) give a private GAN training method by adapting ideas from the PATE framework (Papernot et al., 2018) . Our PGB method is inspired by the Private Multiplicative Weigths method (Hardt & Rothblum, 2010) and its more practical variant MWEM (Hardt et al., 2012 ), which answer a large collection of statistical queries by releasing a synthetic dataset. Our work also draws upon two recent techniques (Turner et al. (2019) and Azadi et al. (2019) ) that use the discriminator as a rejection sampler to improve the generator distribution. We apply their technique by using the mixture discriminator computed in PGB as the rejection sampler. There has also been work that applies the idea of boosting to (non-private) GANs. For example, Arora et al. (2017) and Hoang et al. (2018) propose methods

