LEARNING TO NOISE: APPLICATION-AGNOSTIC DATA SHARING WITH LOCAL DIFFERENTIAL PRIVACY

Abstract

In recent years, the collection and sharing of individuals' private data has become commonplace in many industries. Local differential privacy (LDP) is a rigorous approach which uses a randomized algorithm to preserve privacy even from the database administrator, unlike the more standard central differential privacy. For LDP, when applying noise directly to high-dimensional data, the level of noise required all but entirely destroys data utility. In this paper we introduce a novel, application-agnostic privatization mechanism that leverages representation learning to overcome the prohibitive noise requirements of direct methods, while maintaining the strict guarantees of LDP. We further demonstrate that data privatized with this mechanism can be used to train machine learning algorithms. Applications of this model include private data collection, private novel-class classification, and the augmentation of clean datasets with additional privatized features. We achieve significant gains in performance on downstream classification tasks relative to benchmarks that noise the data directly, which are state-of-the-art in the context of application-agnostic LDP mechanisms for high-dimensional data sharing tasks.

1. INTRODUCTION

The collection of personal data is ubiquitous, and unavoidable for many in everyday life. While this has undeniably improved the quality and user experience of many products and services, evidence of data misuse and data breaches (Sweeney, 1997; Jolly, 2020) have brought the concept of data privacy into sharp focus, fueling both regulatory changes as well as a shift in personal preferences. The onus has now fallen on organizations to determine if they are willing and able to collect personal data under these changing expectations. There is thus a growing need to collect data in a privacypreserving manner, that can still be used to improve products and services. Often coined the 'gold standard' of privacy guarantees, central differential privacy (CDP) (Dwork & Roth, 2014) protects against an adversary determining the presence of a user in a dataset. It provides a quantifiable definition, is robust to post-processing, and allows for the protection of groups of people via its composability property (Dwork et al., 2006; Dwork & Roth, 2014) . The CDP framework relies on the addition of noise to the output of statistical queries on a dataset, in such a way that the same information can be extracted from that dataset whether or not any given individual is present. One can train machine learning models such that the model is CDP with respect to the training set by using training methods such as DP-SGD (Abadi et al., 2016) , DP-Adam (Gylberth et al., 2017) or PATE (Papernot et al., 2017) . Given access to a clean labelled dataset, one could train, for example, a CDP classifier this way. Similarly, one could 'share' data privately by training a generative model, such as a variational autoencoder (VAE) or generative adversarial network (GAN), with a CDP training method, and then construct a synthetic dataset satisfying CDP with respect to the training set by generating samples from this model (Xie et al., 2018; Triastcyn & Faltings, 2019; Acs et al., 2019; Takagi et al., 2021) . While some of these approaches require only an unlabelled training set, their applications are limited in that they generate only synthetic samples that are likely under the original training set distribution. Firstly, changes in the data distribution warrant the re-training of the generative model. Secondly, the synthetic points are only representative samples, and so we lack any information about the features of given individuals. Clearly, this limits the range of applications: the model cannot be used to join private and clean datasets, for applications in which we expect a distributional shift, or for privately collecting new data. Furthermore, CDP also requires a trustworthy database administrator. Federated learning (McMahan et al., 2017; Agarwal et al., 2018; Rodríguez-Barroso et al., 2020 ) is a related technique in which a model can be trained privately, without placing trust in a database administrator. The model is trained across multiple decentralized devices, omitting the need to share data with a central server. Unfortunately, by construction, federated learning does not allow for any data to be privately shared or collected, and is only concerned with the privacy of the model. An approach that allows data collection, while protecting the privacy of an individual against even the database administrator, is to construct a mechanism that privatizes the features of a given individual locally, before collection. Warner (1965) developed a mechanism, known as randomized response, to preserve the privacy of survey respondents: when answering a sensitive (binary) question, the respondent is granted plausible deniability by giving a truthful answer if a fair coin flip returns heads, and answering yes otherwise. Recent work has further developed this idea, often referred to as local differential privacy (LDP) (Kasiviswanathan et al., 2008; Duchi et al., 2013) . LDP provides a mathematically-provable privacy guarantee for members of a database against both adversaries and database administrators. Many existing LDP mechanism focus on the collection of low-dimensional data like summary statistics, or on mechanisms which do not easily generalize to different data types. Erlingsson et al. (2014) , Ding et al. (2017) and Tang et al. (2017) introduce methods for collecting such data repeatedly over time. Erlingsson et al. (2014) find that for one time collection, directly noising data (after hashing to a bloom filter) is their best approach for inducing privacy. Ren et al. (2018) extend this bloom filter based approach, and attempt to estimate the clean distribution of their LDP data in order to generate a LDP synthetic dataset, though we note that the range of applications is limited, as with CDP synthetic generation above. In this paper, we adapt well-established techniques from representation learning to address the fundamental limitation of LDP in the context of high-dimensional data: datapoints in high dimensional space require prohibitive levels of noise to locally privatize (the privacy budget, , naively scales linearly with the dimensionality). To motivate our approach, consider the fact that it is often a good approximation to assume that a given high-dimensional dataset lives on a much lower dimensional manifold. Applying a general privatization mechanism to low-dimensional representations should thus enable us to learn how to add noise to the high-dimensional data efficiently and applicationagnostically. Our approach is inspired by the VAE (Kingma & Welling, 2014; Rezende et al., 2014) ; we demonstrate that sampling in latent space is equivalent to passing a datapoint through an LDP Laplace mechanism. Furthermore, reconstructing a datapoint is equivalent to adding a complex noise distribution to the raw features, thereby inducing LDP. Our randomized algorithm, which we refer to as the variational Laplace mechanism (VLM), satisfies the strict guarantees of LDP, and is agnostic to both data type and downstream task. We demonstrate that we can use the data privatized with our mechanism to train downstream machine learning models that act on both clean and privatized data at inference time. Furthermore, we demonstrate multiple concrete applications of our model: we privately collect data from individuals for downstream model training; we use a transfer-learning-inspired approach to privately collect data of an unseen class type upon which we train a classifier; and we augment a clean dataset with additional privatized features to improve the accuracy of a classifier on the combined data. None of these applications can be solved with CDP, and we find significant performance gains over the naive approach of directly noising the data.

2. BASIC DEFINITIONS AND NOTATION

To formalize the concept of differential privacy, we first introduce some definitions and notation. Definition (( , δ)-central differential privacy): Let A : D → Z be a randomized algorithm, that takes as input datasets from the dataset domain D. We say A is ( , δ)-central differentially private if for , δ ≥ 0, for all subsets S ⊆ Z, and for all neighboring datasets D, D ∈ D, we have p(A(D) ∈ S) ≤ exp( ) p(A(D ) ∈ S) + δ (1) where for D and D to be neighboring means that they are identical in all but one datapoint. Intuitively, this states that one cannot tell (with a level of certainty determined by ( , δ)) whether an individual is present in a database or not. When δ = 0 we say A satisfies -CDP. Definition ( 1 sensitivity): The 1 sensitivity of a function f : D → R k is defined as ∆f = max adjacent(D,D ) ||f (D) -f (D )|| 1 (2) where adjacent(D, D ) implies D, D ∈ D are neighboring datasets. Definition (Laplace mechanism): The Laplace mechanism M (central) : D → R k is a randomized algorithm defined as M (central) (D, f (•), ) = f (D) + (s 1 , . . . , s k ) (3) for D ∈ D, s i ∼ Laplace(0, ∆f / ), and some transformation function f : D → R k . The Laplace mechanism induces -CDP; see Dwork & Roth (2014) for proof. While CDP relies on a trusted database administrator, LDP provides a much stricter guarantee in which the individual does not need to trust an administrator. Instead individuals are able to privatize their data before sending it using a local randomized algorithm. Definition (( , δ)-local differential privacy): A local randomized algorithm A : X → Z, that takes as input a datapoint from the data domain X , satisfies ( , δ)-local differential privacy if for , δ ≥ 0, for all S ⊆ Z, and for any inputs x, x ∈ X , p(A(x) ∈ S) ≤ exp ( ) p(A(x ) ∈ S) + δ When δ = 0 we say A satisfies -LDP.

Definition (Local Laplace mechanism):

The local Laplace mechanism M (local) : X → R k is a randomized algorithm defined as M (local) (x, f (•), ) = f (x) + (s 1 , . . . , s k ) for x ∈ X , s i ∼ Laplace(0, ∆f / ), and some transformation function f : X → Z, where Z ⊆ R k and the 1 sensitivity of f (•) is defined as ∆f = max x,x ∈X ||f (x) -f (x )|| 1 . The local Laplace mechanism satisfies -LDP (see Appendix A for proof). Another common choice of mechanism for privatizing continuous data is the Gaussian mechanism, which satisfies ( , δ > 0)-LDP. For the remainder of the paper however, we exclusively study the local Laplace mechanism since it provides a strong privacy guarantee (i.e. δ = 0). We note that our approach could be used to learn a Gaussian mechanism with minimal modification.

3. PROPOSED METHOD

Early work on data collection algorithms, such as randomized response, have relied on very simple randomized algorithms to induce privacy. Throughout this paper we have benchmarked our results against such a mechanism, in which we add Laplace noise to all continuous features, and flip each of the categorical features with some probability. By the composition theorem (Dwork & Roth, 2014) , each feature then contributes towards the overall LDP guarantee of the d-dimensional datapoint as = d i=1 i . Since we have no prior knowledge about which features are most important, we choose the Laplace noise level (or flip probability) for each feature to be such that i = /d. See Appendix E.2 for further details. As with our approach, this benchmark mechanism can act on any date type, and forms a downstream task agnostic LDP version of the data. As d increases, i decreases for each feature i. The noise required to induce -LDP thus grows with data dimensionality. For high-dimensional datasets like images or large tables, features are often highly correlated; consequently, noising features independently is wasteful towards privatizing the information content in each datapoint. A more effective approach to privatization involves noising a learned lower-dimensional representation of each datapoint using a generic noising mechanism. To this end, we use a VAE-based approach to learn a low-dimensional latent representation of our data. This learned mapping from data space to latent space forms our function f (•) in Equation 5, and requires only a small unlabelled dataset from a similar distribution, which most organizations will typically already have access to, either internally or publicly. Applying the Laplace mechanism (as described in Section 3.1) thus ensures the encoded latents, as well as reconstructed datapoints, satisfy LDP. We can therefore privatize data at the latent level or the original-feature level; preference between these two options is application specific, and we investigate both experimentally. Data owners can apply this learned LDP mechanism to their data before sharing, and the database administrator forms an LDP data set from the collected data. This set, along with information on the type of noise added, is used to train downstream machine learning algorithms. Though our privatization method is task agnostic, in this paper we focus on classification tasks in which we have some features x for which we want to predict the corresponding label y. At inference time, we show that this classifier can act on either clean or privatized datapoints, depending on the application.

3.1. LEARNING A LAPLACE MECHANISM

We assume that our data x is generated by a random process involving a latent variable z. We then optimize a lower bound on the log likelihood (Kingma & Welling, 2014)  log p(x) = log p(z)p θ (x|z)dz ≥ E q φ (z|x) [log p θ (x|z)] -D KL q φ (z|x)||p(z) where p(z) is the prior distribution and q φ (z|x) is the approximate posterior. The generative distribution p θ (x|z) and approximate inference distribution q φ (z|x) are parameterized by neural networks, with learnable parameters θ and φ respectively. While the distributions over latent space are commonly modeled as Gaussian, we aim to learn a Laplace mechanism and so instead we choose p(z) = d i=1 p(z i ), and q φ (z|x) = d i=1 q φ (z i |x) where p(z i ) ∼ Laplace(0, 1/ √ 2) and q φ (z i |x) ∼ Laplace(µ φ (x) i , b). We parameterize µ φ (•) with a neural network and restrict its output via a carefully chosen activation function ν(•) acting on the final layer µ φ (.) = ν(h φ (•)). This clips the output h φ (•) such that all points are within a constant 1 -norm l of the origin, by re-scaling the position vector of points at a larger 1 distance. In this way we ensure that ∆µ φ = 2l. With ∆µ φ finite, we note that if we fix the scale b = 2l/ x at inference time, then drawing a sample from our encoder distribution q φ (z|x) is equivalent to passing a point x through the Local Laplace mechanism M (local) (x, µ φ (•), x ) from Equation 5. Therefore to obtain a representation z of x that satisfies x -LDP, we simply have to pass it through the encoder mean function µ φ (•) and add Laplace(0, 2l/ x ) noise. We refer to this model as a variational Laplace mechanism (VLM). We further prove in Appendix B that a reconstruction x obtained by passing z through the decoder network also satisfies x -LDP, allowing us to privatize datapoints at either latent level z, or originalfeature level x. Note that b is always fixed at inference (i.e. data privatization) time to guarantee z is x -LDP. However, we experiment with learning b during training, as well as fixing it to different values. Certain applications of our model require us to share either the encoder or decoder of the VLM at inference time. If the VLM training data itself contains sensitive information, then the part of the network that gets shared must satisfy CDP with respect to this training data. We found the following two-stage VLM training approach to be helpful in these cases: • Stage 1: Train a VLM with encoding distribution q φ (z|x) and decoding distribution p θ (x|z) using a non-DP optimizer, such as Adam (Kingma & Ba, 2015) . • Stage 2: If training a DP-encoder model, fix θ, and re-train the encoder with a new distribution q φprivate (z|x). If training a DP-decoder, fix φ and replace the decoder with p θprivate (x|z). Optimize φ private or θ private as appropriate using DP-Adam (Gylberth et al., 2017) . Section 4 and Appendix D outline applications in which private VLM components are required.

3.2. TRAINING ON PRIVATE DATA

For supervised learning we must also privatize our target y. For classification, y ∈ {1, . . . , K} is a discrete scalar. To obtain a private label ỹ, we flip y with some fixed probability p < (K -1)/K to one of the other K -1 categories: p(ỹ = i|y = j) = (1 -p)I(i = j) + p/(K -1)I(i = j). Setting p = (K -1)/(e y + K -1) induces y -LDP (see Appendix C for proof). By the composition theorem (Dwork & Roth, 2014) , the tuple (x, ỹ) satisfies -LDP where = x + y . Downstream models may be more robust to label noise than feature noise, or vice versa, so for a fixed we set x = λ and y = (1 -λ) , where λ is chosen to maximise the utility of the dataset. In practice, we treat λ as a model hyperparameter. Rather than training the classifier p ψ (•) directly on private labels, we incorporate the known noise mechanism into our objective function log p(ỹ|x) = log K y=1 p(ỹ|y) p ψ (y|x) or log p(ỹ|z) = log K y=1 p(ỹ|y) p ψ (y|z) depending on whether we choose to work on a feature level, or latent level. At inference time we can classify privatized points using p ψ (y|x) or p ψ (y|z). We also show empirically that we can classify clean points using the same classifier. We refer to these tasks as private and clean classification, with applications given in Sections 4 and 5.

3.3. PRIVATE VALIDATION AND HYPER-PARAMETER OPTIMIZATION

Typically for model validation, one needs access to clean labels y (and clean data x for validating a clean classifier). However, we note that we need only collect privatized model performance metrics on test and validation sets, rather than actually collect the raw datapoints. To do this, we send the trained classifier to members of the validation set so that they can test whether it classified correctly: c ∈ {0, 1}. They return an answer, flipped with probability p = 1/(e + 1) such that the flipped answer c satisfies -LDP, and we estimate true validation set accuracy A = (Warner, 1965) . We use this method to implement a grid search over hyperparameters of our model. 1 Nval Nval n=1 c n from privatized accuracy Ã = 1 Nval Nval n=1 cn using A = Ã -p 1 -2p

4. CLASSIFYING CLEAN DATAPOINTS: APPLICATIONS AND EXPERIMENTS

Below we demonstrate the versatility of our model by outlining a non-exhaustive list of potential applications, with corresponding experiments. Experiments are trained on the MNIST dataset (Le-Cun et al., 1998) , or the Lending Club datasetfoot_0 . The CDP requirements differ between applications, but are explicitly stated for each application in Appendix D. For all stated ( , δ)-CDP results we use δ = 10 -5 , whilst for ( , δ)-LDP results, δ = 0. All results quoted are the mean of 3 trials; error bars represent ±1 standard deviation. Appendix E describes experimental setup and dataset information. In Sections 4.1 and 4.2, we investigate the clean classification task, and report on clean accuracy. Namely, the classifiers are trained on privatized data in order to classify clean datapoints at inference time. We also study the classification of privatized datapoints, using a classifier trained on privatized data, which we refer to as private classification (and report private accuracy) in Section 5.

4.1. DATA COLLECTION

Organizations may have access to some (potentially unlabelled) clean, internal data D 1 , but want to collect privatized labelled data D 2 in order to train a machine learning algorithm. For example, a public health body may have access to public medical images, but want to train a diagnosis classifier to be used in hospitals using labelled data collected privately from their patients. Similarly, a tech company with access to data from a small group of users may want to train an in-app classifier; to do so they could collect private labelled training data from a broader group of users, before pushing the trained classifier to the app. Finally, a multinational company may be allowed to collect raw data on their US users, but only LDP data on users from countries with more restrictive data privacy laws. In this experiment, we split the data such that D 1 and D 2 follow the same data distribution, however in practice this may not always be the case. For example, when D 2 is sales data collected in a different time period, or user data collected in a different region, we may expect the distribution to change. We have omitted such an experiment here, but the extreme case of this distributional shift is explored experimentally in Section 4.2. We run this experiment on both MNIST and Lending Club. As in Sections 3.1 -3.3, we first train a VLM with a DP encoder using D 1 , then privatize all datapoints and corresponding labels in D 2 before training a classifier on this privatized training set. Results are shown in Figure 1 . For both MNIST and Lending Club, we significantly outperform the baseline approach of noising pixels directly. The benchmark performed at random accuracy for all local ≤ 100 for MNIST (local ≤ 20 for Lending Club). Our model performed well above random for all local values tested. For MNIST, we see that latent-level classification outperforms feature-level classification for higher local-values. Indeed, the data processing inequality states we cannot gain more information about a given datapoint by passing it through our decoder. However at lower local , we see feature level classification accuracy is higher. We hypothesize that at this point, so much noise is added to the latent that the latent level classifier struggles, while on pixel level the VLM decoder improves classification by adding global dataset information from D 1 to the privatized point. For Lending Club, we do not see a clear difference between latent-level and feature-level accuracy. However we also note that the features are not as highly correlated as in MNIST, so perhaps the decoder has less influence on results. Finally, we see that reducing central has an adverse effect on MNIST classification accuracy, especially for higher local . The effect seems negligible for Lending Club and we hypothesize that this is due to the large quantity of training data, along with the easier task of binary classification.

4.2. NOVEL-CLASS CLASSIFICATION

As discussed in Section 4.1, the internal data D 1 and the data to be collected D 2 , may follow different data distributions. In the extreme case, the desired task on D 2 may be to predict membership in a class that is not even present in dataset D 1 . For example, in a medical application there may be a large existing dataset D 1 of chest scans, but only a relatively small dataset D 2 that contains patients with a novel disease. As before, a public health body may want to train a novel-disease classifier to distribute to hospitals. Similarly, a software developer may have access to an existing dataset D 1 , but want to predict software usage data for D 2 , whose label is specific to the UI of a new release. We run this experiment on MNIST, where the internal D 1 contains training images from classes 0 to 8, (with a small number of images held out for classification), and D 2 contains all training images from class 9. As in Sections 3.1 and 3.2, we first train a VLM with a DP encoder on D 1 , then privatize all images in D 2 (we are not required to collect or privatize the label since all images have the same label). We then train a binary classifier on the dataset formed of the private 9's and the held out internal images from classes 0-8 (which we privatize and label 'not 9's'). Results are shown in Figure 2 

4.3. DATA JOINING

An organization training a classifier on some labelled dataset D 1 could potentially improve performance by augmenting their dataset with other informative features, and so may want to join D 1 with features from another dataset D 2 . We assume the owner of D 2 may only be willing to share a privatized version of this dataset. For example, two organizations with mutual interests, such as the IRS and a private bank, or a fitness tracking company and a hospital, may want to join datasets to improve the performance of their algorithm. Similarly, it may be illegal for multinational organiza- tions to share and join non-privatized client data between departments in different regions, but legal to do so when the shared data satisfies LDP. We run this experiment on Lending Club, where we divide the dataset slightly differently from previous experiments: both datasets contain all rows, but D 1 contains a subset of (clean) features, along with the clean label, and D 2 contains the remaining features (to be privatized). We follow a privatization procedure similar to that of Section 3.1, with the distinction that the VLM should be both trained on D 2 , and used to privatize D 2 . For the classification problem, instead of Equation 8, we optimize log p ψ (y 1 |x 1 , x2 ) where (x 1 , y 1 ) ∈ D 1 and x2 ∈ D (private)

2

. We are not required to conduct a private grid search over hyperparmeters as in Section 3.3, since we have access to all raw data needed for validation. Note that unlike the previous two experiments, we train the classifier on a combination of both clean and privatized features, and we classify this same 'semi-private' group of features at inference time. Results are shown in Figure 3 . We can see that using features from D 1 only, we obtain classification accuracy of 56.1%, while classifying on all (clean) features, we obtain 65.8% accuracy. The benchmark of noising the D 2 features directly never achieves more than 1 percentage point accuracy increase over classifying the clean features only, whereas our model achieves a significant improvement for local ∈ [4, 10]. We share the privatized features on latent level in this experiment and so do not need to satisfy CDP.

5. CLASSIFYING PRIVATE DATAPOINTS: APPLICATIONS AND EXPERIMENTS

In Sections 4.1 and 4.2, we have been investigating the use of privatized training data to train algorithms that classify clean datapoints. In some use cases however, we may want to train algorithms that act directly on LDP datapoints at inference time. Most notably, in the data collection framework, the organization may want to do inference on individuals whose data they have privately collected. However from the definition of LDP in Equation 4, it is clear that a considerable amount of information about a datapoint x is lost after privatization, and in fact, classification performance is fundamentally limited. In Appendix F, we show that for a given local , the accuracy A of a K-class latent-level classifier acting on a privatized datapoint (where the privatization mechanism has latent dimension d ≥ K/2) is upper bounded by A ≤ K/2-1 j=0,j =1 K/2 -1 j (-1) j 1 -j e -j /2 1 + j - e -/2 2 - + 1 8 (K -2) e -/2 In Figure 4 , we show the accuracy of our model from Section 4.1 (data collection) when applied to privatized datapoints at inference time, and compare to the upper bound in Equation 10. Running this experiment on MNIST, we see a considerable drop in performance when classifying privatized datapoints, compared with clean classification results. We are clearly not saturating the bound from Equation 10. While it may initially appear that our model is under-performing, we note that our model aims to build a downstream-task-agnostic privatized representation of the data. This means that the representation must contain more information than just the class label. Meanwhile, the upper bound is derived from the extreme setting in which the latent encodes only class information, and would be unable to solve any other downstream task. Though Equation 10 is constructed under the framework of latent-level classification, we do note that our feature-level classifier seems to marginally outperform the latent-level one (see Appendix G). This may be a result of the decoder de-noising the latent to some extent.

6. CONCLUSION AND FUTURE WORK

In this paper we have introduced a framework for collecting and sharing data under the strict guarantees of LDP. We induce LDP by learning to efficiently add noise to any data type for which representation learning is possible. We have demonstrated a number of different applications of our framework, spanning important issues such as medical diagnosis, financial crime detection, and customer experience improvement, significantly outperforming existing baselines on each of these. This is the first use of latent-variable models for learning LDP Laplace mechanisms. We foresee that even stronger performance could be achieved by combining our method with state-of-the-art latent-variable models that utilise more complex architectures, and often deep hierarchies of latents (Gulrajani et al., 2017; Maaløe et al., 2019; Ho et al., 2019) . In this work we sought to show that significant hurdles in LDP for high-dimensional data could be overcome using a representationlearning driven randomization algorithm, which we indeed think is well established in this paper.

A PROOF THAT THE LOCAL LAPLACE MECHANISM SATISFIES LDP

Claim: The Local Laplace mechanism satisfies -local differential privacy. Proof: We follow an approach similar to the proof in Dwork & Roth (2014) that the Laplace Mechanism satisfies CDP. Assume x and x are two arbitrary datapoints. Denote M local (x) = f (x) + (s 1 , . . . , s k ) where s i ∼ Laplace(0, ∆f/ ). Then for some arbitrary c we know that p(M local (x) = c) p(M local (x ) = c) = k i=1 p(M local i (x) = c i ) p(M local i (x ) = c i ) (11) = k i=1 exp -|fi(x)-ci| ∆f exp -|fi(x )-ci| ∆f (12) = k i=1 exp |f i (x ) -c i | -|f i (x) -c i | ∆f (13) ≤ k i=1 exp |f i (x ) -f i (x)| ∆f (14) = exp ||f (x ) -f (x)|| 1 ∆f (15) ≤ exp ( ) where the first inequality comes from the triangle inequality, and the second comes from the definition of ∆f .

B PROOF THAT DECODED PRIVATE LATENTS SATISFY LDP

Claim: If a point in latent space satisfies -LDP, then this point still satisfies -LDP after being passed through a deterministic function, such as the function that parameterizes the mean of the decoder network. Proof: We follow an approach similar to the proof that central differential privacy is immune to post-processing (Dwork & Roth, 2014) . Consider an arbitrary deterministic mapping g : Z → X . Let S ⊆ X and T = {z ∈ Z : g(z) ∈ S}. Then p g(A(x)) ∈ S = p A(x) ∈ T (17) ≤ exp ( ) p A(x ) ∈ T (18) = exp ( ) p g(A(x )) ∈ S C PROOF THAT THE FLIP MECHANISM SATISFIES LDP Claim: The flip mechanism p(ỹ = i|y = j) = (1 -p)I(i = j) + p/(K -1)I(i = j) where p = (K -1)/(e + K -1) satisfies -local differential privacy. Proof: We can write, for any i, j, j : p(ỹ = i|y = j) p(ỹ = i|y = j ) = (1 -p)I(i = j) + p/(K -1)I(i = j) (1 -p)I(i = j ) + p/(K -1)I(i = j ) (20) =      (K-1)(1-p) p i = j, i = j p (K-1)(1-p) i = j, i = j 1 otherwise (21) =    e i = j, i = j e - i = j, i = j 1 otherwise Therefore, we have that for any i, j, j , p(ỹ=i|y=j) p(ỹ=i|y=j ) ≤ e .

D IMPLEMENTATION REQUIREMENTS FOR DIFFERENT APPLICATIONS

In the scenario that D 1 contains sensitive information, the encoder or decoder may need to be trained with the two-stage approach (outlined in Section 3.1) in order to guarantee ( , δ)-CDP wrt D 1 . There are broadly two scenarios in which this is the case. Firstly, if private data is published on pixel level then a DP decoder is required. Secondly, if the encoder needs to be shared with the client (for example, in client side data collection) then a DP encoder is required. For pixel level data collection a DP decoder is not required, since the client can share their privatized latents and these can be 'decoded' to pixel level on the server side, avoiding the need to share the decoder. Table 1 explicitly outlines the CDP requirements for the applications discussed in our paper. 

E EXPERIMENTAL SET UP

For every experiment in the paper, we conducted three trials, and calculate the mean and standard deviation of accuracy for each set of trials. The error bars represent one standard deviation above and below the mean. We use the MNIST dataset and the Lending Club dataset. MNIST is a dataset containing 70,000 images of handwritten digits from 0-9 with corresponding class labels; the task is 10-way classification to determine the digit number. Lending Club is a tabular, financial dataset made up of around 540,000 entries with 23 continuous and categorical features (after pre-processing, before one-hot encoding); the task is binary classification, to determine whether a debt will be re-paid.

E.1 DATA PRE-PROCESSING

For MNIST, we converted the images into values between 0 and 1 by dividing each pixel value by 255. These are then passed through a logit and treated as continuous. For Lending Club, a number of standard pre-processing steps are performed, including: • Dropping features that contain too many missing values or would not normally be available to a loan company. • The proportion λ of our privacy budget assigned to the datapoint, compared with the label i.e. λ = x /( x + y ). • The 1 clipping distance l of our inference network mean i.e. l = ∆µ φ /2. • The Laplace distribution scale b of our approximate posterior distribution during pretraining of the VLM. Note that we report this as the pre-training -LDP value induced by a sample from this posterior distribution, given l in the previous point i.e. pre-training = 2l/b. This is fixed throughout training, unless 'learned' is specified in the below table, in which case the parameter b is a learned scalar. • The latent dimension d. This was fixed to 8 for data collection experiments but we searched over d ∈ {5, 8} for the data join experiments due to the smaller number of features. We also did a grid search over the following DP-Adam hyperparameters for central ∈ {1, 5}: • Noise multiplier • Batch size • DP learning rate The DP-Adam hyperparameter max gradient norm was fixed to 1 throughout. The number of training epochs needed to reach the target central value follows from the choice of hyperparameters, combined with the VLM training set size (45,000 for MNIST, and 341,000 for Lending Club). Note that we fixed δ = 10 -5 for all experiments. The results from these grid searches are given in Tables 2, 3 , and 4.

E.4 NETWORK ARCHITECTURES

Throughout the paper, we used feedforward architectures for both the VLM and classifier networks. For MNIST, we use a VLM encoder network with 3 hidden layers of size {400, 150, 50}, and a decoder network with 3 hidden layers of size {50, 150, 400}. For Lending Club, we use a VLM encoder and decoder network with 2 hidden layers of size {500, 500}. For the the latent level classifier we used a network with 1 hidden layer of size 50 and for pixel level classifier we use a network with 3 hidden layers of size {400, 150, 50}. F PROOF OF UPPER BOUND ON NOISY ACCURACY In this section, we derive an upper bound on accuracy for the classification of datapoints privatized using the Laplace mechanism (see Equation 4). To simplify the proof, we make the following assumptions: • We have K equally balanced classes. • K is even. • K ≤ 2d where d is the dimension of the output of f (•) (the latent space on which we add Laplace noise). These are true for all experiments in this paper. First, suppose that K = 2d. Since we add iid Laplace noise to each datapoint, we will obtain the highest possible accuracy when f (•) maps datapoints from class i as far away from datapoints from class j = i as possible. This maximum separation distance can be at most ∆f ; we can separate all K classes by distance ∆f in our d dimensional latent space iff each class is mapped to a separate vertex c (y) of the taxicab sphere of ( 1 -norm) radius ∆f /2. The decision boundary is given by the line that equally separates these vertices in 1 -space, as shown (for 2 dimensions) in Figure 5 . The accuracy of the classifier C on datapoints privatized by the latent-space Laplace mechanism q(•|x) ∼ Laplace(f (x), ∆f / ) is given by A = E (x,y)∼p(x,y),z∼q(z|x) p (C(z) = y) (23) = E y∼p(y) p C(c (y) + s) = y (24) = p C(c (1) + s) = 1 where s = (s 1 , . . . , s d ) and s i ∼ Laplace(0, ∆f / ). The second equality follows from the fact that all points from a given class are mapped to the same point in latent space, and the final equality follows from the symmetry between classes. This final term gives the probability that when we add Laplace noise to c (1) and obtain the private representation c(1) , we do not cross the decision boundary. We assume WLOG that c (1) = (1, 0, . . . , 0) and calculate this probability as follows (dropping the superscript for clarity) d -1 j (-1) j exp (-j /2) 1 -j 2 -exp (-/2) 2(1 -j) c (1) A = c1>0, ci<|c1|,∀i =1 dc 1 . . . dc d 1 (2b) d exp - ||c -c|| 1 b (26) = ∞ 0 dc 1 1 2b exp - |c 1 -1| b d i=2 c1 -c1 dc i 1 2b exp - |c i | b (27) = ∞ 0 dc 1 1 2b exp - |c 1 -1| b 1 -exp (-c 1 /b) d-1 where in the last step we used the fact that in this case = 2/b. By substituting d = K/2, Equation 10 follows directly. Now, we consider the case K ≤ 2d. Clearly, the taxicab sphere has more than K vertices, and so classes can occupy different combinations of vertices. The maximum accuracy will be found where the occupied vertices are maximally separated from each other. For the case of Laplace noise, where probability mass decreases exponentially with 1 -distance from the mean, it is clear that the probability of a noised datapoint crossing a decision boundary is higher when the classes are centered on vertices aligned along different axes. We have shown this for d = 2 in Figures 6(a ) and 6(b), where clearly the probability mass of the blue shaded area for a Laplace distribution with mean c (1) is larger than the green shaded area. Therefore we are more likely to cross the decision boundary in Figure 6 (b), given a fixed quantity of noise. Thus, the optimal setup is when the K classes are positioned on vertices aligned along the first K/2 axes. Noise added to the remaining d -K/2 dimensions does not affect the classifier, and so Equation 33 still holds. G FEATURE LEVEL PRIVATE CLASSIFIER FOR DATA COLLECTION Figure 7 shows the private accuracy results for MNIST data collection, on latent level. 



https://www.kaggle.com/wordsforthewise/lending-club



Figure 1: Clean accuracy as a function of local for data collection. Results are shown for the MNIST dataset (top) and Lending Club (bottom), on latent level (left) and feature level (right). Each line indicates a different value of ( , δ = 10 -5 )-CDP at which the encoder was trained. The x-axis shows the -LDP guarantee for the collected training set.

Figure 3: 'Semi-private' accuracy versus local for data joining (private features shared and joined on latent level). The x-axis shows the -LDP guarantee for the collected training set.

Figure 4: Private accuracy versus local for latent-level data collection. Each line indicates a different value of ( , δ = 10 -5 )-CDP at which the encoder was trained. The x-axis shows the -LDP guarantee for the collected training set.

Figure5: The decision boundary for a classifier that equally separates (in 1 -distance) vertices c (i) for i ∈ {1, 2, 3, 4} in 2-dimensional space.

Figure 6: The shaded areas represent, for d = 2 and K = 2, the decision boundaries for: (a) a function f (•) that maps the two classes onto opposing vertices; (b) a function f (•) that maps the two classes onto adjacent vertices. Refer to Appendix F for details on the color-coding of shaded areas.

central = Our method, central = 5 Our method, central = 1 Noise features directly Theoretical upper bound Random classifier

Figure 7: Private accuracy as a function of local for data collection (feature level). Each line indicates a different value of ( , δ = 10 -5 )-CDP at which the encoder was trained, each point on the x-axis shows the -LDP guarantee for the collected training set.

Central differential privacy requirements for the VLM, with respect to the dataset D 1 .

DP-Adam hyperparameters used for data collection and novel-class classification.

VLM hyperparameters used for data join experiments.

VLM hyperparameters used for data collection and novel-class classification.

annex

• Mean imputation to fill remaining missing values.• Standard scaling of continuous features. Extreme outliers (those with features more than 10 standard deviations from the mean) are removed here.• Balancing the target classes by dropping the excess class 0 entries.• One-hot encoding categorical variables.The target variable denotes whether the loan has been charged off or not, resulting in a binary classification task. The train, validation, test split is done chronologically according to the feature 'issue date'.In real world applications the sizes of the VLM training and validation sets, and the classifier training and validation sets would be pre-determined. For our experiments we used the data splits outlined in the following Sections. 

MNIST:

We use a similar approach to the above, but split the data between the VLM and the classifier such that the VLM train/validation sets contain 8 9 ths of (unlabelled) training images from classes 0 to 8. The remaining 1 9 ths of 0 to 8 images, and all 9s, are used for classifier train, test and validation sets. Our VLM datasets then contain equal class balance for the classes 0 to 8, and the classifier datasets contain equal class balance for 9s and 'not 9s'.

E.1.3 DATA JOIN

Lending Club: For this experiment, the datasets were split between the VLM and the classifier column-wise, between the dataset's 23 features. The VLM datasets contain 8 features (month of earliest credit line, number of open credit lines, initial listing status of loan, application type, address (US state), home ownership status, employment length, public record bankruptcies). The classifier datasets contain the remaining features. The feature split was chosen such that the first dataset contains some information to solve the classification task, but the features from the second dataset contain information which, at least before privatization, further improve classifier performance.

E.2 NOISING FEATURES DIRECTLY

For continuous features, we assume that ∆f is equal to the difference between the maximum and minimum value of that feature within the training and validation sets used to train the VLM in the main experiments, after pre-processing. One then has to clip any values that lie outside this interval in the shared/collected dataset at privatization time.

E.3 HYPERPARAMETER CHOICES

We conducted a grid search over a number of the hyperparameters in our model, in order to find the optimal experimental setup.For stage 1 of the VLM training, a learning rate of 10 -4 and batch size of 128 was used for Lending Club experiments, and a learning rate of 5 × 10 -4 and batch size of 64 was used for MNIST. We then searched over the following model hyperparameters (with central =∞):

