DO NOT LET PRIVACY OVERBILL UTILITY: GRADIENT EMBEDDING PERTURBATION FOR PRIVATE LEARNING

Abstract

The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm Gradient Embedding Perturbation (GEP) towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with reasonable computational cost and modest privacy guarantee for deep models. Especially, with privacy bound = 8, we achieve 74.9% test accuracy on CIFAR10 and 95.1% test accuracy on SVHN, significantly improving over existing results.

1. INTRODUCTION

Recent works have shown that the trained model may leak/memorize the information of its training set (Fredrikson et al., 2015; Wu et al., 2016; Shokri et al., 2017; Hitaj et al., 2017) , which raises privacy issue when the models are trained with sensitive data. Differential privacy (DP) mechanism provides a way to quantitatively measure and upper bound such information leakage. It theoretically ensures that the influence of any individual sample is negligible with the DP parameter or ( , δ). Moreover, it has been observed that differentially private models can also resist model inversion attack (Carlini et al., 2019) , membership inference attack (Rahman et al., 2018; Bernau et al., 2019; Sablayrolles et al., 2019; Yu et al., 2021) , gradient matching attack (Zhu et al., 2019) , and data poisoning attack (Ma et al., 2019) . One popular way to achieve differentially private machine learning is to perturb the training process with noise (Song et al., 2013; Bassily et al., 2014; Shokri & Shmatikov, 2015; Wu et al., 2017; Fukuchi et al., 2017; Iyengar et al., 2019; Phan et al., 2020) . Specifically, gradient perturbation perturbs the gradient at each iteration of (stochastic) gradient descent algorithm and guarantees the privacy of the final model via composition property of DP. It is worthy to note that gradient perturbation does not assume (strongly) convex objective and hence is applicable to various settings (Abadi et al., 2016; Wang et al., 2017; Lee & Kifer, 2018; Jayaraman et al., 2018; Wang & Gu, 2019; Yu et al., 2020) . Specifically, for given gradient sensitivity S, a general form of gradient perturbation is to add an isotropic Gaussian noise z to the gradient g ∈ R p independently for each step, g = g + z, where z ∼ N (0, σ 2 S 2 I p×p ). (1) One can set proper variance σ 2 to make each update differentially private with parameter ( , δ). It is easy to see that the intensity of the added noise E[ z 2 ] scales linearly with the model dimension p. F / • 2 (Tropp et al., 2015) of batch gradient matrix of given groups (with p parameters). The setting is ResNet20 on CIFAR-10. The stable rank is small throughout training. This indicates that as the model becomes larger, the useful signal, i.e., gradient, would be submerged in the added noise (see Figure 1 ). This dimensional barrier restricts the utility of deep learning models trained with gradient perturbation. The dimensional barrier is attributed to the fact that the added noise is isotropic while the gradients live on a very low dimensional manifold, which has been observed in (Gur-Ari et al., 2018; Vogels et al., 2019; Gooneratne et al., 2020; Li et al., 2020) and is also verified in Figure 2 for the gradients of a 20-layer ResNet (He et al., 2016) . Hence to limit the noise energy, it is natural to think "Can we reduce the dimension of gradients first and then add the isotropic noise onto a low-dimensional gradient embedding?" The answer is affirmative. We propose a new algorithm Gradient Embedding Perturbation (GEP), illustrated in Figure 3 . Specifically, we first compute anchor gradients on some non-sensitive auxiliary data, and identify an anchor subspace that is spanned by several top principal components of the anchor gradient matrix. Then we project the private gradients into the anchor subspace and obtain low-dimensional gradient embeddings and small-norm residual gradients. Finally, we perturb the gradient embedding and residual gradient separately according to the sensitivities and privacy budget. We intuitively argue why GEP could reduce the perturbation variance and achieve good utility for large models. First, because the gradient embedding has a very low dimension, the added isotropic noise on embedding has small energy that scales linearly only with the subspace dimension. Second, if the anchor subspace can cover most of the gradient information, the residual gradient, though high dimensional, should have small magnitude, which permits smaller added noise to guarantee the same level privacy because of the reduced sensitivity. Overall, we can use a much lower perturbation compared with the original gradient perturbation to guarantee the same level of privacy. We emphasize several properties of GEP. First, the non-sensitive auxiliary data assumption is weak. In fact, GEP only requires a small number of non-sensitive unlabeled data following a similar feature distribution as the private data, which often exist even for learning on sensitive data. In our experiments, we use a few unlabeled samples from ImageNet to serve as auxiliary data for MNIST, SVHN, and CIFAR-10. This assumption is much weaker than the public data assumption in previous works (Papernot et al., 2017; 2018; Alon et al., 2019; Wang & Zhou, 2020) , where the public data should follow exactly the same distribution as the private data. Second, GEP produces an unbiased estimator of the target gradient because of releasing both the perturbed gradient embedding and the perturbed residual gradient, which turns out to be critical for good utility. Third, we use power method to estimate the principal components of anchor gradients, achievable with a few matrix multiplications. The fact that GEP is not sensitive to the choices of subspace dimension further allows a very efficient implementation. Compared with existing works of differentially private machine learning, our contribution can be summarized as follows: (1) we propose a novel algorithm GEP that achieves good utility for large models with modest differential privacy guarantee; (2) we show that GEP returns an unbiased estimator of target private gradient with much lower perturbation variance than original gradient perturbation; (3) we demonstrate that GEP achieves state-of-the-art utility in differentially private learning with three benchmark datasets. Specifically, for = 8, GEP achieves 74.9% test accuracy Figure 3 : Overview of the proposed GEP approach. 1) We estimate an anchor subspace on some non-sensitive data; 2) We project the private gradients into the anchor subspace, producing lowdimensional embeddings and residual gradients; 3) We perturb the gradient embedding and residual gradient separately to guarantee differential privacy. The auxiliary data are only required to share similar features as the private data. In our experiments, we use 2000 images from ImageNet as auxiliary data for MNIST, SVHN, and CIFAR-10 datasets. on CIFAR-10 with a ResNet20 model. To the best of our knowledge, GEP is the first algorithm that can achieve such utility with training deep models from scratch for a "single-digit" privacy budgetfoot_0 .

1.1. RELATED WORK

Existing works studying differentially private machine learning in high-dimensional setting can be roughly categorized into two sets. One is treating the optimization of the machine learning objective as a whole mechanism and adding noise into this process. The other one is based on the knowledge transfer of machine learning models, which trains a differentially private publishable student model with private signals from teacher models. We review them one by one. Differentially private convex optimization in high-dimensional setting has been studied extensively over the years (Kifer et al., 2012; Thakurta & Smith, 2013; Talwar et al., 2015; Wang & Xu, 2019; Wang & Gu, 2019) . Although these methods demonstrate good utility on some convex settings, their analyses can not be directly applied to non-convex setting. Right before the submission, we note two independent and concurrent works (Zhou et al., 2020; Kairouz et al., 2020) that also leverage the gradient redundancy to reduce the added noise. Specifically, Kairouz et al. (2020) track historical gradients to do dimension reduction for private AdaGrad. Zhou et al. (2020) requires gradients on some public data and then project the noisy gradients into a public subspace at each update. One core difference between these two works and GEP is that we introduce residual gradient perturbation and GEP produces an unbiased estimator of the private gradients, which is essential for achieving the superior utility. Moreover, we weaken the auxiliary data assumption and introduce several designs that significantly boost the efficiency and applicability of GEP. One recent progress towards training arbitrary models with differential privacy is Private Aggregation of Teacher Ensembles (PATE) (Papernot et al., 2017; 2018; Jordon et al., 2019) . PATE first trains independent teacher models on disjoint shards of private data. Then it trains a student model with privacy guarantee by distilling noisy predictions of teacher models on some public samples. In comparison, GEP only requires some non-sensitive data that have similar natural features as the private data while PATE requires the public data follow exactly the same distribution as the private data and in practice it uses a portion of the test data to serve as public data. Moreover, GEP demonstrates better performance than PATE especially for complex datasets, e.g., CIFAR-10, because GEP can train the model with the whole private data rather than a small shard of data.

2. PRELIMINARIES

We introduce some notations and definitions. We use bold lowercase letters, e.g., v, and bold capital letters, e.g., M , to denote vectors and matrices, respectively. The L 2 norm of a vector v is denoted by v . The spectral norm and the Frobenius norm of a matrix M are denoted by M and M F , respectively. A sample d = (x, y) consists of feature x and label y. A dataset D is a collection of individual samples. A dataset D is said to be a neighboring dataset of D if they differ in a single sample, denoted as D ∼ D . Differential privacy ensures that the outputs of an algorithm on neighboring datasets have approximately indistinguishable distributions. Definition 1 (( , δ)-DP (Dwork et al., 2006a; b) ). A randomized mechanism M guarantees ( , δ)differential privacy if for any two neighboring input datasets D ∼ D and for any subset of outputs S it holds that Pr[M(D) ∈ S] ≤ e Pr[M(D ) ∈ S] + δ. By its definition, ( , δ)-DP controls the maximum influence that any individual sample can produce. One can adjust the privacy parameters to trade off between privacy and utility. Differential privacy is immune to post-processing (Dwork et al., 2014) , i.e., any function applied on the output of a differentially private algorithm would not increase the privacy loss as long as it does not have new interaction with the private dataset. Differential privacy also allows composition, i.e., the composition of a series of differentially private mechanisms is also differentially private but with different parameters. Several variants of ( , δ)-DP have been proposed (Bun & Steinke, 2016; Dong et al., 2019) to address certain weakness of ( , δ)-DP, e.g., they achieve better composition property. In this work, we use Rényi differential privacy (Mironov, 2017) to track the privacy loss and then convert it to ( , δ)-DP. Suppose that there is a private dataset D = {(x i , y i )} n i=1 with n samples. We want to train a model f to learn the mapping in D. Specifically, f takes x as input and outputs a label y, and f has parameter θ ∈ R p . The training objective is to minimize an empirical risk 1 n n i=1 (f (x i ), y i ), where (•, •) is a loss function. We further assume that there is an auxiliary dataset D (a) = {( xj , ỹj )} m j=1 that x shares similar features as x in D while ỹ could be random.

3. GRADIENT EMBEDDING PERTURBATION

An overview of GEP is given in Figure 3 . GEP has three major ingredients: 1) first, estimate an anchor subspace that contains the principal components of some non-sensitive anchor gradients via power method; 2) then, project private gradients into the anchor subspace and produce low-dimensional embeddings of private gradients and residual gradients; 3) finally, perturb gradient embedding and residual gradient separately to establish differential privacy guarantee. In Section 3.1, we present the GEP algorithm in detail. In Section 3.2, we given an analysis on the residual gradients. In Section 3.3, we give a differentially private learning algorithm that updates the model with the output of GEP.

3.1. THE GEP ALGORITHM AND ITS PRIVACY ANALYSIS

The pseudocode of GEP is presented in Algorithm 1. For convenience, we write a set of gradients and a set of basis vectors as matrices with each row being one gradient/basis vector. The anchor subspace is constructed as follows. We first compute the gradients of the model on an auxiliary dataset D (a) with m samples, which is referred to as the anchor gradients G (a) ∈ R m×p . We then use the power method to estimate the principal components of G (a) to construct a subspace basis B ∈ R k×p , which is referred to as the anchor subspace. All these matrices are publishable because D (a) is non-sensitive. We expect that the anchor subspace B can cover most energy of private gradients when the auxiliary data are not far from private data and m, k are reasonably large. Suppose that the private gradients are G ∈ R n×p . Then, we project the private gradients into the anchor subspace B. The projection produces low-dimensional embeddings W = GB T and residual gradients R = G -GB T B. The magnitude of residual gradients is usually much smaller than original gradient even when k is small because of the gradient redundancy. Then, we aggregate the gradient embeddings and the residual gradients, respectively. We perturb the aggregated embedding and the aggregated residual gradient respectively to guarantee certain differential privacy. Finally, we release the perturbed embedding and the perturbed residual gradient and construct an unbiased estimator of the private gradient: ṽ := ( wT B + r)/n. This construction process does not resulting in additional privacy loss because of DP's post-processing property. The privacy analysis of the whole process of GEP is given in Theorem 3.1. Theorem 3.1. Let S 1 and S 2 be the sensitivity of w and r, respectively, the output of Algorithm 1 satisfies ( , δ)-DP for any δ ∈ (0, 1) and ≤ 2 log(1/δ) if we choose σ 1 ≥ 2S 1 2 log(1/δ)/ and σ 2 ≥ 2S 2 2 log(1/δ)/ . Algorithm 1: Gradient embedding perturbation 1: Input: anchor gradients G (a) ∈ R m×p ; number of basis vectors k; private gradients G ∈ R n×p ; clipping thresholds S 1 , S 2 ; standard deviations σ 1 , σ 2 ; number of power iterations t. 2: //First stage: Compute an orthonormal basis for the anchor subspace. 3: Initialize B ∈ R k×p randomly. 4: for i = 1 to t do 5: Compute A = G (a) B T and B = A T G (a) . 6: Orthogonalize B and normalize row vectors. 7: end for 8: Delete G (a) to free memory. 9: //Second stage: project the private gradients G into anchor subspace B 10: Compute gradient embeddings W = GB T and clip its rows with S 1 to obtain Ŵ . 11: Compute residual gradients R = G -W B and clip its rows with S 2 to obtain R. 12: //Third stage: perturb gradient embedding and residual gradient separately 13: Perturb embedding with noise z (1) ∼ N (0, σ 2 1 I k×k ): w := i Ŵi,: , w := w + z (1) . 14: Perturb residual gradient with noise z (2) ∼ N (0, σ 2 2 I p×p ): r := i Ri,: , r := r + z (2) . 15: Return ṽ := ( wT B + r)/n. A common practice to control sensitivity is to clip the output with a pre-defined threshold. In our experiments, we use different thresholds S 1 and S 2 to clip the gradient embeddings and residual gradients, respectively. The privacy loss of GEP consists of two parts: the privacy loss incurred by releasing the perturbed embedding and the privacy loss incurred by releasing the perturbed residual gradient. We compose these two parts via the Rényi differential privacy and convert it to ( , δ)-DP. We highlight several implementation techniques that make GEP widely applicable and implementable with reasonable computational cost. Firstly, auxiliary non-sensitive data do not have to be the same source as the private data and the auxiliary data can be randomly labeled. This non-sensitive data assumption is very weak and easy to satisfy in practical scenarios. To understand why random label works, a quick example is that for the least squares regression problem the individual gradient is aligned with the feature vector while the label only scales the length but does not change the direction. This auxiliary data assumption avoids conducting principal component analysis (PCA) on private gradients, which requires releasing private high-dimensional basis vectors and hence introduces large privacy loss. Secondly, we use power method (Panju, 2011; Vogels et al., 2019) to approximately estimate the principal components. The new operation we introduce is standard matrix multiplication that enjoys efficient implementation on GPU. The computational complexity of each power iteration is 2mkp, where p is the number of model parameters, m is the number of anchor gradients and k is the number of subspace basis vectors. Thirdly, we divide the parameters into different groups and compute one orthonormal basis for each group. This further reduces the computational cost. For example, suppose the parameters are divided into two groups with size p 1 , p 2 and the numbers of basis vectors are k 1 , k 2 , the computational complexity of each power iteration is 2m(k 1 p 1 + k 2 p 2 ), which is smaller than 2m(k 1 + k 2 )(p 1 + p 2 ). In Appendix B, we analyze the additional computational and memory costs of GEP compared to standard gradient perturbation. Curious readers may wonder if we can use random projection to reduce the dimensionality as Johnson-Lindenstrauss Lemma (Dasgupta & Gupta, 2003) guarantees that one can preserve the pairwise distance between any two points after projecting into a random subspace of much lower dimension. However, preserving the pairwise distance is not sufficient for high quality gradient reconstruction, which is verified by the empirical observation in Appendix C.

3.2. AN ANALYSIS ON THE RESIDUAL GRADIENTS OF GEP

Let g := 1 n i G i,: be the target private gradient. For a given anchor subspace B, the residual gradients are defined as R := G -GB T B. We then analyze how large the residual gradients could be. The following argument holds for all time steps and we ignore the time step index for simplicity. For the ease of discussion, we introduce ξ i := (G i,: ) T for i ∈ [n] to denote the the private gradients and the ξj := (G (a) j,: ) T for j ∈ [m] to denote the anchor gradients. We use λ k (•) to denote the k th largest eigenvalue of a given matrix. We assume that the private gradients ξ 1 , ..., ξ n and the anchor gradients ξ1 , ..., ξm are sampled independently from a distribution P. We assume Σ := E ξ∼P ξξ T ∈ R p×p to be the population gradient (uncentered) covariance matrix. We also consider the (uncentered) empirical gradient covariance matrix Ŝ := 1 m m i=1 ξi ξT i . One case is that the population gradient covariance matrix Σ is low-rank k. In this case we can argue that the residual gradients are 0 once the number of anchor gradients m > k. Lemma 3.1. Assume that the population covariance matrix Σ is with rank k and the distribution P satisfies P(ξ ∈ F s ) = 0 for all s-flats F s in R p with 0 ≤ s < k. Let Σ = V k ΛV T k and Ŝ = Vk Λ V T k be the eigendecompositions of Σ and the empirical covariance matrix Ŝ, respectively, such that λ k ( Ŝ) > 0 and λ k +1 ( Ŝ) = 0. Then if m ≥ k, we have with probability 1, k = k and V k V T k -Vk V T k 2 = 0. (2) Proof. The proof is based on the non-singularity of covariance matrix. See Appendix D. We note that s-flat is the translate F s = x + F s(0) of an s-dimensional linear subspace F s(0) in R p and the normal distribution satisfies such condition (Eaton & Perlman, 1973; Muirhead, 2009) . Therefore, we have seen that for low-rank case of population covariance matrix, the residual gradients are 0 once m > k. In the general case, we measure the expected norm of the residual gradients. Lemma 3.2. Assume that ξ ∼ P and ξ 2 < T almost surely. Let Σ = V ΛV T be the eigendecomposition of the population covariance matrix Σ. Let Ŝ = Vk Λ V T k be the eigendecomposition of the empirical covariance matrix Ŝ. Then we have with probability 1 -2 exp(-δ), E ξ -Π Vk (ξ) 2 2 ≤ k >k λ k (Σ) + kC/m + T 2δ/m, where C = E ξ 4 -i λ 2 i (Σ) + 1 m m j=1 ξj 4 -i λ 2 i ( Ŝ) , Π Vk is a projection operator onto the subspace Vk and the E is taken over the randomness of ξ ∼ P. Proof. The proof is an adaptation of Theorem 3.1 in Blanchard et al. (2007) . From Lemma 3.2, we can see the larger the number of anchor gradients and the dimension of the anchor subspace k, the smaller the residual gradients. We can choose m, k properly such that the upper bound on the expected residual gradient norm is small. This indicates that we may use a smaller clipping threshold and consequently apply smaller noises with achieving the same privacy guarantee. We next empirically examine the projection error r = i R i,: by training a 20-layer ResNet on CIFAR10 dataset. We try two different types of auxiliary data to compute the anchor gradients: 1) samples from the same source as private data with correct labels, i.e., 2000 random samples from the test data; 2) samples from different source with random labels, i.e., 2000 random samples from ImageNet. The relation between the dimension of anchor subspace k and the projection error rate ( 1 n r / g ) is presented in Figure 4 . We can see that the project error is small and decreases with k, and the benefit of increasing k diminishes when k is large, which is implied by Lemma 3.2. In practice one can only use small or moderate k because of the memory constraint. GEP needs to store at least k individual gradients and each individual gradient consumes the same amount of memory as the model itself. Moreover, we can see that the projection into anchor subspace of random labeled auxiliary data yields comparable projection error, corroborating our argument that unlabeled auxiliary data are sufficient for finding the anchor subspace. We also verify that the redundancy of residual gradients is small, by plotting the stable rank of residual gradient matrix in Figure 5 . The stable rank of residual gradient matrix is an order of magnitude higher than the stable rank of original gradient matrix. This implies that it could be hard to further approximate R with low-dimensional embeddings. We next compare the GEP with a scheme that simply discards the residual gradients and only outputs the perturbed gradient embedding, i.e., the output is ũ := wT B/n. Remark 1. Let ũ := wT B/n be the reconstructed gradient from noisy gradient embedding and ṽ be the output of GEP. If ignoring the effect of gradient clipping, we have E[ ũ] = g -r/n, E[ṽ] = g. ( ) where r = i R i,: is the aggregated residual gradients, w, B are given in Algorithm 1 and the expectation is over the added random noises. This indicates that ũ contains a systematic error that makes ũ always deviate from g by the residual gradient. This systematic error is the projection error, which is plotted in Figure 4 . The systematic error cannot be mitigated by reducing the noise magnitude (e.g., increasing the privacy budget or collecting more private data). We refer to the algorithm releasing ũ directly as Biased-GEP or B-GEP for short, which can be viewed as an efficient implementation of the algorithm in (Zhou et al., 2020) . In our experiments, B-GEP can outperform standard gradient perturbation when k is large but is inferior to GEP. We note that the above remark is made with ignoring the clipping effect (or set a large clipping threshold). In practice, we do apply clipping for the individual gradients at each time step, which makes the expectations in Remark 1 obscure (Chen et al., 2020b) . We note that the claim that ṽ is an unbiased estimator of g is not that precise when applying gradient clipping.

3.3. PRIVATE LEARNING WITH GRADIENT EMBEDDING PERTURBATION

GEP (Algorithm 1) describes how to release one-step gradient with privacy guarantee. In this section, we compose the privacy losses at each step to establish the privacy guarantee for the whole learning process. The differentially private learning process with GEP is given in Algorithm 2 and the privacy analysis is presented in Theorem 3.2. Algorithm 2: Differentially private gradient descent with GEP. Compute the private gradients G t and anchor gradients G (a) t of loss with respect to θ t .

5:

Call GEP with G t , G (a) t and configuration C to get ṽt .

6:

Update model θ t+1 = θ t -η ṽt . 7: end for Theorem 3.2. For any < 2 log(1/δ) and δ ∈ (0, 1), the output of Algorithm 2 satisfies ( , δ)-DP if we set σ ≥ 2 2T log(1/δ)/ . If the private gradients are randomly sampled from the full batch gradients, the privacy guarantee can be strengthened via the privacy amplification by subsampling theorem of DP (Balle et al., 2018; Wang et al., 2019; Zhu & Wang, 2019; Mironov et al., 2019) . Theorem 3.3 gives the expected excess error of Algorithm 2. Expected excess error measures the distance between the algorithm's output and the optimal solution in expectation. Theorem 3.3. Suppose the loss L(θ) = 1 n (x,y)∈D (f θ (x), y) is 1-Lipschitz, convex, and β- smooth. If η = 1 β , T = nβ √ p , and θ = 1 T T t=1 θ t , then we have E[L( θ)] -L(θ * ) ≤ O √ k log(1/δ) n + r√ p log(1/δ) n , where r = 1 T T -1 t=0 rfoot_1 t and r t = max i (R t ) i,: is the sensitivity of residual gradient at step t. The r term represents the average projection error over the training process. The previous best expected excess error for gradient perturbation is O( p log(1/δ)/(n )) (Wang et al., 2017) . As shown in Lemma 3.1, if the gradients locate in a k-dimensional subspace over the training process, r = 0 and the excess error is O( k log(1/δ)/(n )), independent of the problem ambient dimension p. When the gradients are in general position, i.e., gradient matrix is not exact low-rank, Lemma 3.2 and the empirical result give a hint on how small the residual gradients could be. However, it is hard to get a good bound on max i (R t ) i,: and the bound in Theorem 3.3 does not explicitly improve over previous result. One possible solution is to use a clipping threshold based on the expected residual gradient norm. Then the output gradient becomes biased because of clipping and the utility/privacy guarantees in Theorem 3.3/3.2 require new elaborate derivation. We leave this for future work.

4. EXPERIMENTS

We conduct experiments on MNIST, extended SVHN, and CIFAR-10 datasets. Our implementation is publicly available 2 . The model for MNIST has two convolutional layers with max-pooling and one fully connected layer. The model for SVHN and CIFAR-10 is ResNet20 in He et al. (2016) . We replace all batch normalization (Ioffe & Szegedy, 2015) layers with group normalization (Wu & He, 2018) layers because batch normalization mixes the representations of different samples and makes the privacy loss cannot be analyzed accurately. The non-private accuracy for MNIST, SVHN, and CIFAR-10 is 99.1%, 95.9%, and 90.4%, respectively. We also provide experiments with pre-trained models in Appendix A. Tramèr & Boneh (2020) show that differentially private linear classifier can achieve high accuracy using the features produced by pre-trained models. We examine whether GEP can improve the performance of such private linear classifiers. Notably, using the features produced by a model pre-trained on unlabeled ImageNet, GEP achieves 94.8% validation accuracy on CIFAR10 with = 2.

Evaluated algorithms

We use the algorithm in Abadi et al. (2016) as benchmark gradient perturbation approach, referred to as "GP". We also compare GEP with PATE (Papernot et al., 2017) . We run the experiments for PATE using the official implementation. The privacy parameter of PATE is data-dependent and hence cannot be released directly (see Section 3.3 in Papernot et al. (2017) ). Nonetheless, we report the results of PATE anyway. Implementation details At each step, GEP needs to release two vectors: the noisy gradient embedding and the noisy residual gradient. The gradient embeddings have a sensitivity of S 1 and the residual gradients have a sensitivity of S 2 because of the clipping. The output of GEP can be constructed as follows: (1) normalize the gradient embeddings and residual gradients by 1/S 1 and 1/S 2 , respectively, (2) concatenate the rescaled vectors, (3) release the concatenated vector via gaussian mechanism with sensitivity √ 2, (4) rescale the two components by S 1 and S 2 . B-GEP only needs to release the normalized noisy gradient embedding. We use the numerical tool in Mironov et al. (2019) to compute the privacy loss. For given privacy budget and sampling probability, σ is set to be the smallest value such that the privacy budget is allowable to run desired epochs. All experiments are run on a single Tesla V100 GPU with 16G memory. For ResNet20, the parameters are divided into five groups: input layer, output layer, and three intermediate stages. For a given quota of basis vectors, we allocate it to each group according to the square root of the number of parameters in each group. We compute an orthonormal subspace basis on each group separately. Moreover, the performance of GEP is not that sensitive to k. Then we concatenate the projections of all groups to construct gradient embeddings. The number of power iterations t is set as 1 as empirical evaluations suggest more iterations do not improve the performance for GEP and B-GEP. For all datasets, the anchor gradients are computed on 2000 random samples from ImageNet. In Appendix C, we examine the influence of choosing different numbers of anchor gradients and different sources of auxiliary data. The selected images are downsampled into size of 32 × 32 (28 × 28 for MNIST) and we label them randomly at each update. For SVHN and CIFAR-10, k is chosen from [500, 1000, 1500, 2000] . For MNIST, we halve the size of k. We use SGD with momentum 0.9 as the optimizer. Initial learning rate and batchsize are 0.1 and 1000, respectively. The learning rate is divided by 10 at middle of training. Weight decay is set as 1 × 10 -4 . The clipping threshold for is 10 for original gradients and 2 for residual gradients. The number of training epochs for CIFAR-10 and MNIST is 50, 100, 200 for privacy parameter = 2, 5, 8, respectively. The number of training epochs for SVHN is 5, 10, 20 for privacy parameter = 2, 5, 8, respectively. Privacy parameter δ is 1 × 10 -6 for SVHN and 1 × 10 -5 for CIFAR-10 and MNIST.

Results

The best accuracy with given is in Table 4 . For all datasets, GEP achieves considerable improvement over GP in Abadi et al. (2016) . Specifically, GEP achieves 74.9% test accuracy on CIFAR-10 with (8, 10 -5 )-DP, outperforming GP by 18.5%. PATE achieves best accuracy on MNIST but its performance drops as the dataset becomes more complex. We also plot the relation between accuracy and k in Figure 6 . GEP is less sensitive to the choice of k and outperforms B-GEP for all choices of k. The improvement of increasing k becomes smaller as k becomes larger. We note that the memory cost of choosing large k is high because we need to store at least k individual gradients to compute anchor subspace.

5. CONCLUSION

In this paper, we propose Gradient Embedding Perturbation (GEP) for learning with differential privacy. GEP leverages the gradient redundancy to reduce the added noise and outputs an unbiased estimator of target gradient. The several key designs of GEP significantly boost the applicability of GEP. Extensive experiments on real world datasets demonstrate the superior utility of GEP.

A EXPERIMENTS WITH PRE-TRAINED MODELS

Recent works have shown that pre-training the models on unlabeled data can be beneficial for subsequent learning tasks (Chen et al., 2020a; He et al., 2020) . Tramèr & Boneh (2020) demonstrate that differentially private linear classifier can achieve high accuracy using the features produced by those per-trained models. We show that GEP can also benefit from such pre-trained models. Inspired by Tramèr & Boneh (2020) , we use the output of the penultimate layer of a pre-trained ResNet152 model as feature to train a private linear classifier. The ResNet152 model is pre-trained on unlabeled ImageNet using SimCLR (Chen et al., 2020a) . The feature dimension is 4096. Implementation Details We choose the privacy parameter from [0.1, 0.5, 1, 2]. The privacy parameter δ is 1 × 10 -5 . We run all experiments for 5 times and report the average accuracy. 

Results

The experiment results are shown in Table 3 . GEP outperforms GP on all values of . With privacy bound = 2, GEP achieves 94.8% validation accuracy on CIFAR10 dataset, improving over the GP baseline by 1.4%. For very strong privacy guarantee ( = 0.1), B-GEP performs on par with GEP because strong privacy guarantee requires large noise and the useful signal in residual gradient is submerged in the added noise. B-GEP benefits less from larger compared to GP or GEP. For = 1 and 2, the performance of B-GEP is worse than the performance of GP. This is because larger can not reduce the systematic error of B-GEP (see Remark 1 in Section 3.2). 

B COMPLEXITY ANALYSIS

We provide an analysis of the computational and memory costs of the construction of anchor subspace. The computation of the anchor subspace is the dominant additional cost of GEP compared to conventional gradient perturbation. Notations: k, m, n, and p are the dimension of anchor subspace, number of anchor gradients, number of private gradients, and the model dimension, respectively. In order to reduce the computational and memory costs, we divide the parameters into g groups and compute one orthonormal basis for each group. We refer to this approach as 'parameter grouping'. In this section, we assume the parameters and the dimension of the anchor subspace are both divided evenly. Table 4 summarizes the additional costs of GEP with/without parameter grouping. Using parameter grouping can reduce the computational/memory cost significantly. 

C ABLATION STUDY

The influence of choosing different auxiliary datasets. We conduct experiments with different choices of datasets. For CIFAR10, we try 2000 random test samples from CIFAR10, 2000 random samples from CIFAR100, and 2000 random samples from ImageNet. When the auxiliary dataset is CIFAR10, we try both correct labels and random labels. For all choices of auxiliary datasets, the test accuracy is evaluated on 8000 test samples of CIFAR10 that are not used as auxiliary data. Other implementation details are the same as in Section 4. The results are shown in Table 5 . Surprisingly, using samples from CIFAR10 with correct labels yields the worst accuracy. This may because the model 'overfits' the auxiliary data when it has access to correct labels, which makes the anchor subspace contains less information about the private gradients. The best accuracy is achieved using samples from CIFAR10 with random labels, this makes sense because in this case the features of auxiliary data and private data have the same distribution. Using samples from CIFAR100 or ImageNet as auxiliary data has a small influence on the test accuracy. The influence of the number of anchor gradients. In the main text, the size of auxiliary dataset is m = 2000. We conduct more experiments with different sizes of auxiliary dataset to examine the influence of m. The auxiliary data is randomly sampled from ImageNet. Table 6 reports the test accuracy on CIFAR10 with different choices of m. For both B-GEP and GEP, increasing m leads to slightly improved performance. The projection error of random basis vectors. It is tempting to construct the anchor subspace using random basis vectors because Johnson-Lindenstrauss Lemma (Dasgupta & Gupta, 2003) guarantees that one can preserve the pairwise distance between any two points after projecting into a random subspace of much lower dimension. We empirically verify the projection error of Gaussian random basis vectors on CIFAR10 and SVHN. The experiment settings are the same as in Section 4. The projection errors over the training process are plotted in Figure 7 . The projection error of random basis vectors is very high (> 95%) throughout training. This is because preserving the pairwise If we set σ 1 = S 1 σ and σ 2 = S 2 σ for some σ, then Algorithm 1 satisfies (λ, λ σ 2 )-RDP because of Lemma D.2 and D.3. In order to guarantee ( , δ)-DP, we need λ σ 2 + log(1/δ) λ -1 ≤ . Choose λ = 1 + 2 log(1/δ) and rearrange Eq (5), we need σ 2 ≥ 2 ( + 2 log(1/δ)) 2 . ( ) Then using the constraint on concludes the proof. Theorem 3.2. For any < 2 log(1/δ) and δ ∈ (0, 1), the output of Algorithm 2 satisfies ( , δ)-DP if we set σ ≥ 2 2T log(1/δ)/ . Proof of Theorem 3.2. From the proof of Theorem 3.1, we have each call of GEP satisfies (λ, λ σ 2 )-RDP. Then by the composition property of RDP (Lemma D.3), the output of Algorithm 2 satisfies (λ, T λ σ 2 )-RDP. Plugging T λ σ 2 into Equation 5 and 6 concludes the proof. , where r = 1 T T -1 t=0 r 2 t and r t = max i (R t ) i,: is the sensitivity of residual gradient at step t. Proof of Theorem 3.3. The β-smooth condition gives L(θ t+1 ) ≤ L(θ t ) + ∇L(θ t ), θ t+1 -θ t + β 2 θ t+1 -θ t 2 . (7) Based on the update rule of GEP we have θ t+1 -θ t = -η ṽ = -η∇L(θ t ) -η n (z (1) t B + z (2) t ), where z (1) t ∼ N (0, σ 2 I k×k ), z (2) t ∼ N (0, σ 2 r 2 t I p×p ) are the perturbation noises and r t = max i (R t ) i,: is the sensitivity of residual gradients at step t. Take expectation on Eq (7) with respect to the perturbation noises.  E[L(θ t+1 )] ≤ E[L(θ t )] -(η - The second inequality holds because L is convex. Then choose η = 1 β and plug ∇L(θ t ) = (θ t -θ t+1 )/η -(z t 1 B + z t 2 )/n into Eq (10). E[L(θ t+1 )] -L(θ * ) ≤ βE[ θ t -θ t+1 , θ t -θ * ] - β 2 E[ θ t -θ t+1 2 ] + σ 2 βn 2 k + pr 2 t = β 2 E[ θ t -θ * 2 ] -E[ θ t+1 -θ * 2 ] + σ 2 βn 2 k + pr 2 t . Sum over t = 0, . . . , T -1 and use convexity, we have E[L( θ)] -L(θ * ) ≤ β 2T θ 0 -θ * + σ 2 βn 2 (k + p T T -1 t=0 r 2 t ). Then substituting T = nβ √ p and σ = O( T log(1/δ)/ ) yields the desired bound.



Abadi et al. (2016) achieve 73% accuracy on CIFAR-10 but they need to pre-train the model on CIFAR-100. https://github.com/dayu11/Gradient-Embedding-Perturbation



Figure1: Noise norm vs gradient norm of ResNet20 at initialization. The noise variance is chosen such that SGD satisfies (5, 10 -5 )-DP after 90 epochs inAbadi et al. (2016).

Figure2: Stable rank • 2 F / • 2 (Tropp et al., 2015) of batch gradient matrix of given groups (with p parameters). The setting is ResNet20 on CIFAR-10. The stable rank is small throughout training.

Figure 4: Relative projection error ( 1 n r / g ) of the second stage in ResNet20. The number of anchor gradients is 2000. The dimension of anchor subspace is k. The learning rate is decayed by 10 at epoch 30. The left plot uses random samples from ImageNet. The right plot uses random samples from test data. The benefit of increasing k becomes smaller when k is larger.

Input: private dataset D; auxiliary dataset D (a) ; number of updates T ; learning rate η; configuration of GEP C; loss function ; 2: Output: Differentially private model θ T . 3: for t = 0 to T -1 do 4:

Figure6: Test accuracy when varying the dimension of anchor subspace. GEP significantly outperforms B-GEP for all k. Moreover, the performance of GEP is not that sensitive to k.

Figure 7: Projection error rate of random basis vectors. The dimension of subspace is denoted by k.

Suppose the loss L(θ) = 1 n (x,y)∈D (f θ (x), y) is 1-Lipschitz, convex, and βsmooth. If η = 1 β , T = nβ √ p , and θ = 1 T T t=1 θ t , then we have E[L( θ)] -L(θ * )

Test accuracy (in %) with varying choices of privacy bound . The numbers under symbol ∆ denote the improvement over GP baseline.

The clipping threshold of residual gradients is still one-fifth of the clipping threshold of the original gradients. The dimension of anchor subspace is set as 200 √ p where p = 40960 is the model dimension. We randomly sample 500 samples from the test set as auxiliary data and evaluate performance on the rest test samples. The optimizer is Adam with default momentum coefficients. Other hyper-parameters are listed in Table 2. Hyperparameter values used in Appendix A.

Validation accuracy (in %) on CIFAR10 with varying choices of . We train a private linear model on top of the features from a ResNet152 model, which is pre-trained on unlabeled ImageNet.

Computational and memory costs of a single power iteration in Algorithm 1. The computation cost is measured by the number of floating point operations. The memory cost is measured by the number of floating-point numbers we need to store. 'GEP+PG' denotes GEP with parameter grouping and g denotes the number of groups. Notations: k, m, n, and p are the dimension of anchor subspace, number of anchor gradients, number of private gradients, and the model dimension, respectively.

Test accuracy on CIFAR10 with different choices of auxiliary datasets. The privacy guarantee is (8, 10 -5 )-DP. We report the average accuracy of five runs with standard deviations in brackets.

Test accuracy on CIFAR10 with different sizes of auxiliary dataset. The privacy guarantee is (8, 10 -5 )-DP. We report the average accuracy of five runs with standard deviations in brackets.

βη 2 /2)E[ ∇L(θ t )

D MISSING PROOFS

Lemma 3.1. Assume that the population covariance matrix Σ is with rank k and the distribution P satisfies P(ξ ∈ F s ) = 0 for all s-flats F s in R p with 0 ≤ s < k.be the eigendecompositions of Σ and the empirical covariance matrix Ŝ, respectively, such thatProof. We extend the Theorem 3.2 in Eaton & Perlman (1973) to the low-rank case. We note that the subspace spanned by Vk is in the space spanned by V k by definition. Hence k ≤ k. Theorem 3.1. Let S 1 and S 2 be the sensitivity of w and r, respectively, the output of Algorithm 1 satisfies ( , δ)-DP for any δ ∈ (0, 1) andProof of Theorem 3.1. We first introduce some background knowledge of Rényi differential privacy (RDP) (Mironov, 2017) . RDP measures the Rényi divergence between two output distributions.Definition 2 ((λ, γ)-RDP). A randomized mechanism f is said to guarantee (λ, γ)-RDP if for any neighboring datasets D, D and λ > 1 it holds thatwhere D λ (•||•) denotes the Rényi divergence of order λ.We next introduce some useful properties of RDP.Lemma D.2 (Gaussian mechanism of RDP). Let S = max D∼D f (D) -f (D ) be the l 2 sensitivity, then Gaussian mechanism M = f (D) + z satisfies (λ, λS 2 2σ 2 )-RDP, where z ∼ N (0, σ 2 I p×p ). Lemma D.3 (Composition of RDP). If M 1 , M 2 satisfy (λ, γ 1 )-RDP and (λ, γ 2 )-RDP respectively, then their composition satisfies (λ, γ 1 + γ 2 )-RDP.

