DIFFERENTIALLY PRIVATE GENERATIVE MODELS THROUGH OPTIMAL TRANSPORT

Abstract

Although machine learning models trained on massive data have led to breakthroughs in several areas, their deployment in privacy-sensitive domains remains limited due to restricted access to data. Generative models trained with privacy constraints on private data can sidestep this challenge and provide indirect access to the private data instead. We propose DP-Sinkhorn, a novel optimal transportbased generative method for learning data distributions from private data with differential privacy. DP-Sinkhorn relies on minimizing the Sinkhorn divergence-a computationally efficient approximation to the exact optimal transport distancebetween the model and the data in a differentially private manner and also uses a novel technique for conditional generation in the Sinkhorn framework. Unlike existing approaches for training differentially private generative models, which are mostly based on generative adversarial networks, we do not rely on adversarial objectives, which are notoriously difficult to optimize, especially in the presence of noise imposed by the privacy constraints. Hence, DP-Sinkhorn is easy to train and deploy. Experimentally, despite our method's simplicity we improve upon the state-of-the-art on multiple image modeling benchmarks. We also show differentially private synthesis of informative RGB images, which has not been demonstrated before by differentially private generative models without the use of auxiliary public data.

1. INTRODUCTION

As the full value of data comes to fruition through a growing number of data-centric applications (e.g. recommender systems (Gomez-Uribe & Hunt, 2016), personalized medicine (Ho et al., 2020) , face recognition (Wang & Deng, 2020), speech synthesis (Oord et al., 2016) , etc.), the importance of privacy protection has become apparent to both the public and academia. At the same time, recent Machine Learning (ML) algorithms and applications are increasingly data hungry and the use of personal data will eventually be a necessity. Differential Privacy (DP) is a rigorous definition of privacy that quantifies the amount of information leaked by a user participating in any data release (Dwork et al., 2006; Dwork & Roth, 2014) . DP was originally designed for answering queries to statistical databases. In a typical setting, a data analyst (party wanting to use data, such as a healthcare or marketing company) sends a query to a data curator (party in charge of safekeeping the database, such as a hospital), who makes the query on the database and replies with a semi-random answer that preserves privacy. Differentially Private Stochastic Gradient Descent (DPSGD) 1 (Abadi et al., 2016) is the most popular method for training general machine learning models with DP guarantees. DPSGD involves large numbers of queries, in the form of gradient computations, to be answered quickly by the curator. This requires technology transfer of model design from analyst to curator, and strong computational capacity be present at the curator. Furthermore, if the analyst wants to train on multiple tasks, the curator must subdivide the privacy budget to spend on each task. As few institutions have simultaneous access to private data, computational resources, and expertise in machine learning, these requirements significantly limit adoption of DPSGD for learning with privacy guarantees. To address this challenge, generative models-models with the capacity to synthesize new datacan be applied as a general medium for data-sharing (Xie et al., 2018; Augenstein et al., 2020) . The 1 Including any variants that use gradient perturbation for ensuring privacy. 1

