LEARNING TO NOISE: APPLICATION-AGNOSTIC DATA SHARING WITH LOCAL DIFFERENTIAL PRIVACY

Abstract

In recent years, the collection and sharing of individuals' private data has become commonplace in many industries. Local differential privacy (LDP) is a rigorous approach which uses a randomized algorithm to preserve privacy even from the database administrator, unlike the more standard central differential privacy. For LDP, when applying noise directly to high-dimensional data, the level of noise required all but entirely destroys data utility. In this paper we introduce a novel, application-agnostic privatization mechanism that leverages representation learning to overcome the prohibitive noise requirements of direct methods, while maintaining the strict guarantees of LDP. We further demonstrate that data privatized with this mechanism can be used to train machine learning algorithms. Applications of this model include private data collection, private novel-class classification, and the augmentation of clean datasets with additional privatized features. We achieve significant gains in performance on downstream classification tasks relative to benchmarks that noise the data directly, which are state-of-the-art in the context of application-agnostic LDP mechanisms for high-dimensional data sharing tasks.

1. INTRODUCTION

The collection of personal data is ubiquitous, and unavoidable for many in everyday life. While this has undeniably improved the quality and user experience of many products and services, evidence of data misuse and data breaches (Sweeney, 1997; Jolly, 2020) have brought the concept of data privacy into sharp focus, fueling both regulatory changes as well as a shift in personal preferences. The onus has now fallen on organizations to determine if they are willing and able to collect personal data under these changing expectations. There is thus a growing need to collect data in a privacypreserving manner, that can still be used to improve products and services. Often coined the 'gold standard' of privacy guarantees, central differential privacy (CDP) (Dwork & Roth, 2014) protects against an adversary determining the presence of a user in a dataset. It provides a quantifiable definition, is robust to post-processing, and allows for the protection of groups of people via its composability property (Dwork et al., 2006; Dwork & Roth, 2014) . The CDP framework relies on the addition of noise to the output of statistical queries on a dataset, in such a way that the same information can be extracted from that dataset whether or not any given individual is present. One can train machine learning models such that the model is CDP with respect to the training set by using training methods such as DP-SGD (Abadi et al., 2016) , DP-Adam (Gylberth et al., 2017) or PATE (Papernot et al., 2017) . Given access to a clean labelled dataset, one could train, for example, a CDP classifier this way. Similarly, one could 'share' data privately by training a generative model, such as a variational autoencoder (VAE) or generative adversarial network (GAN), with a CDP training method, and then construct a synthetic dataset satisfying CDP with respect to the training set by generating samples from this model (Xie et al., 2018; Triastcyn & Faltings, 2019; Acs et al., 2019; Takagi et al., 2021) . While some of these approaches require only an unlabelled training set, their applications are limited in that they generate only synthetic samples that are likely under the original training set distribution. Firstly, changes in the data distribution warrant the re-training of the generative model. Secondly, the synthetic points are only representative samples, and so we lack any information about the features of given individuals. Clearly, this limits the range of applications: the model cannot be used to join private and clean datasets, for applications in which 1

