LEARNING TO NOISE: APPLICATION-AGNOSTIC DATA SHARING WITH LOCAL DIFFERENTIAL PRIVACY

Abstract

In recent years, the collection and sharing of individuals' private data has become commonplace in many industries. Local differential privacy (LDP) is a rigorous approach which uses a randomized algorithm to preserve privacy even from the database administrator, unlike the more standard central differential privacy. For LDP, when applying noise directly to high-dimensional data, the level of noise required all but entirely destroys data utility. In this paper we introduce a novel, application-agnostic privatization mechanism that leverages representation learning to overcome the prohibitive noise requirements of direct methods, while maintaining the strict guarantees of LDP. We further demonstrate that data privatized with this mechanism can be used to train machine learning algorithms. Applications of this model include private data collection, private novel-class classification, and the augmentation of clean datasets with additional privatized features. We achieve significant gains in performance on downstream classification tasks relative to benchmarks that noise the data directly, which are state-of-the-art in the context of application-agnostic LDP mechanisms for high-dimensional data sharing tasks.

1. INTRODUCTION

The collection of personal data is ubiquitous, and unavoidable for many in everyday life. While this has undeniably improved the quality and user experience of many products and services, evidence of data misuse and data breaches (Sweeney, 1997; Jolly, 2020) have brought the concept of data privacy into sharp focus, fueling both regulatory changes as well as a shift in personal preferences. The onus has now fallen on organizations to determine if they are willing and able to collect personal data under these changing expectations. There is thus a growing need to collect data in a privacypreserving manner, that can still be used to improve products and services. Often coined the 'gold standard' of privacy guarantees, central differential privacy (CDP) (Dwork & Roth, 2014) protects against an adversary determining the presence of a user in a dataset. It provides a quantifiable definition, is robust to post-processing, and allows for the protection of groups of people via its composability property (Dwork et al., 2006; Dwork & Roth, 2014) . The CDP framework relies on the addition of noise to the output of statistical queries on a dataset, in such a way that the same information can be extracted from that dataset whether or not any given individual is present. One can train machine learning models such that the model is CDP with respect to the training set by using training methods such as DP-SGD (Abadi et al., 2016) , DP-Adam (Gylberth et al., 2017) or PATE (Papernot et al., 2017) . Given access to a clean labelled dataset, one could train, for example, a CDP classifier this way. Similarly, one could 'share' data privately by training a generative model, such as a variational autoencoder (VAE) or generative adversarial network (GAN), with a CDP training method, and then construct a synthetic dataset satisfying CDP with respect to the training set by generating samples from this model (Xie et al., 2018; Triastcyn & Faltings, 2019; Acs et al., 2019; Takagi et al., 2021) . While some of these approaches require only an unlabelled training set, their applications are limited in that they generate only synthetic samples that are likely under the original training set distribution. Firstly, changes in the data distribution warrant the re-training of the generative model. Secondly, the synthetic points are only representative samples, and so we lack any information about the features of given individuals. Clearly, this limits the range of applications: the model cannot be used to join private and clean datasets, for applications in which we expect a distributional shift, or for privately collecting new data. Furthermore, CDP also requires a trustworthy database administrator. Federated learning (McMahan et al., 2017; Agarwal et al., 2018; Rodríguez-Barroso et al., 2020 ) is a related technique in which a model can be trained privately, without placing trust in a database administrator. The model is trained across multiple decentralized devices, omitting the need to share data with a central server. Unfortunately, by construction, federated learning does not allow for any data to be privately shared or collected, and is only concerned with the privacy of the model. An approach that allows data collection, while protecting the privacy of an individual against even the database administrator, is to construct a mechanism that privatizes the features of a given inlocally, before collection. Warner (1965) developed a mechanism, known as randomized response, to preserve the privacy of survey respondents: when answering a sensitive (binary) question, the respondent is granted plausible deniability by giving a truthful answer if a fair coin flip returns heads, and answering yes otherwise. Recent work has further developed this idea, often referred to as local differential privacy (LDP) (Kasiviswanathan et al., 2008; Duchi et al., 2013) . LDP provides a mathematically-provable privacy guarantee for members of a database against both adversaries and database administrators. Many existing LDP mechanism focus on the collection of low-dimensional data like summary statistics, or on mechanisms which do not easily generalize to different data types. Erlingsson et al. ( 2014 2017) introduce methods for collecting such data repeatedly over time. Erlingsson et al. (2014) find that for one time collection, directly noising data (after hashing to a bloom filter) is their best approach for inducing privacy. Ren et al. (2018) extend this bloom filter based approach, and attempt to estimate the clean distribution of their LDP data in order to generate a LDP synthetic dataset, though we note that the range of applications is limited, as with CDP synthetic generation above. In this paper, we adapt well-established techniques from representation learning to address the fundamental limitation of LDP in the context of high-dimensional data: datapoints in high dimensional space require prohibitive levels of noise to locally privatize (the privacy budget, , naively scales linearly with the dimensionality). To motivate our approach, consider the fact that it is often a good approximation to assume that a given high-dimensional dataset lives on a much lower dimensional manifold. Applying a general privatization mechanism to low-dimensional representations should thus enable us to learn how to add noise to the high-dimensional data efficiently and applicationagnostically. Our approach is inspired by the VAE (Kingma & Welling, 2014; Rezende et al., 2014) ; we demonstrate that sampling in latent space is equivalent to passing a datapoint through an LDP Laplace mechanism. Furthermore, reconstructing a datapoint is equivalent to adding a complex noise distribution to the raw features, thereby inducing LDP. Our randomized algorithm, which we refer to as the variational Laplace mechanism (VLM), satisfies the strict guarantees of LDP, and is agnostic to both data type and downstream task. We demonstrate that we can use the data privatized with our mechanism to train downstream machine learning models that act on both clean and privatized data at inference time. Furthermore, we demonstrate multiple concrete applications of our model: we privately collect data from individuals for downstream model training; we use a transfer-learning-inspired approach to privately collect data of an unseen class type upon which we train a classifier; and we augment a clean dataset with additional privatized features to improve the accuracy of a classifier on the combined data. None of these applications can be solved with CDP, and we find significant performance gains over the naive approach of directly noising the data.

2. BASIC DEFINITIONS AND NOTATION

To formalize the concept of differential privacy, we first introduce some definitions and notation. Definition (( , δ)-central differential privacy): Let A : D → Z be a randomized algorithm, that takes as input datasets from the dataset domain D. We say A is ( , δ)-central differentially private if for , δ ≥ 0, for all subsets S ⊆ Z, and for all neighboring datasets D, D ∈ D, we have p(A(D) ∈ S) ≤ exp( ) p(A(D ) ∈ S) + δ (1) where for D and D to be neighboring means that they are identical in all but one datapoint.



), Ding et al. (2017) and Tang et al. (

