BYPASSING THE AMBIENT DIMENSION: PRIVATE SGD WITH GRADIENT SUBSPACE IDENTIFICATION

Abstract

Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension p, the number of parameters in the model. Such dependence can be problematic for over-parameterized models where p n, the number of training samples. Existing lower bounds on private ERM show that such dependence on p is inevitable in the worst case. In this paper, we circumvent the dependence on the ambient dimension by leveraging a low-dimensional structure of gradient space in deep networks-that is, the stochastic gradients for deep nets usually stay in a low dimensional subspace in the training process. We propose Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset. We provide a general sample complexity analysis on the public dataset for the gradient subspace identification problem and demonstrate that under certain lowdimensional assumptions the public sample complexity only grows logarithmically in p. Finally, we provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD in the high privacy regime (corresponding to low privacy loss ).

1. INTRODUCTION

Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a loss function , find a model w ∈ R p that minimizes the empirical risk Ln (w) = 1 n n i=1 (w, z i ), where z 1 , . . . , z n are i.i.d. examples drawn from a distribution P. In many applications, the training data may contain highly sensitive information about some individuals. When the models are given by deep neural networks, their rich representation can potentially reveal fine details of the private data. Differential privacy (DP) (Dwork et al., 2006) has by now become the standard approach to provide principled and rigorous privacy guarantees in machine learning. Roughly speaking, DP is a stability notion that requires that no individual example has a significant influence on the trained model. One of the most commonly used algorithm for solving private ERM is the differentially-private stochastic gradient descent (DP-SGD) (Abadi et al., 2016; Bassily et al., 2014; Song et al., 2013) -a private variant of SGD that perturbs each gradient update with random noise vector drawn from an isotropic Gaussian distribution N (0, σ 2 I p ), with appropriately chosen variance σ 2 . Due to the gradient perturbation drawn from an isotropic Gaussian distribution, the error rate of DP-SGD has a dependence on the ambient dimension p-the number of parameters in the model. In the case of convex loss , Bassily et al. (2014) show that DP-SGD achieves the optimal empirical excess risk of Õ √ p/(n ) . For non-convex loss , which is more common in neural network training, minimizing Ln (w) is in general intractable. However, many (non-private) gradient-based optimization methods are shown to be effective in practice and can provably find approximate stationary points with vanishing gradient norm ∇ Ln (w) 2 (see e.g. Nesterov (2014) ; Ghadimi and Lan ( 2013)). Moreover, for a wide family of loss functions Ln under the Polyak-Łojasiewicz condition (Polyak, 1963) , the minimization of gradient norm implies achieving global optimum. With privacy constraint, Wang and Xu (2019) recently showed that DP-SGD minimize the empirical gradient norm down to Õ p 1/4 / √ n when the loss function is smooth. Furthermore, exsiting lower bounds results on private ERM (Bassily et al., 2014) show that such dependence on p is inevitable in the worst case. However, many modern machine learning tasks now involve training extremely large models, with the number of parameters substantially larger than the number of training samples. For these large models, the error dependence on p can be a barrier to practical private ERM. In this paper, we aim to overcome such dependence on the ambient dimension p by leveraging the structure of the gradient space in the training of neural networks. We take inspiration from the empirical observation from Li et al. (2020) ; Gur-Ari et al. (2018) ; Papyan (2019) that even though the ambient dimension of the gradients is large, the set of sample gradients at most iterations along the optimization trajectory is often contained in a much lower-dimensional subspace. While this observation has been made mostly for non-private SGD algorithm, we also provide our empirical evaluation of this structure (in terms of eigenvalues of the gradient second moments matrix) in Figure 1 . Based on this observation, we provide a modular private ERM optimization framework with two components. At each iteration t, the algorithm performs the following two steps: 1) Gradient dimension reduction. Let g t be the mini-batch gradient at iteration t. In general, this subroutines solves the following problem: given any k < p, find a linear projection Vk (t) ∈ R p×k such that the reconstruction error g t -Vk (t) Vk (t) g t is small. To implement this subroutine, we follow a long line of work that studies private data analysis with access to an auxiliary public dataset S h drawn from the same distribution P, for which we don't need to provide formal privacy guarantee (Bassily et al., 2019b; 2020; Feldman et al., 2018; Avent et al., 2017; Papernot et al., 2017) . In our case, we compute Vk (t) which is given by the top-k eigenspace of the gradients evaluated on S h . Alternatively, this subroutine can potentially be implemented through private subspace identification on the private dataset. However, to our best knowledge, all existing methods have reconstruction error scaling with √ p (Dwork et al., 2014) , which will be propagated to the optimization error. 2) Projected DP-SGD (PDP-SGD). Given the projection Vk (t), we perturb gradient in the projected subspace: gt = Vk (t) Vk (t) (g t + b t ), where b t is a p-dimensional Gaussian vector. The projection mapping provides a large reduction of the noise and enables higher accuracy for PDP-SGD. Our results. We provide both theoretical analyses and empirical evaluations of PDP-SGD: Uniform convergence for projections. A key step in our theoretical analysis is to bound the reconstruction error on the gradients from projection of Vk (t). This reduces to bounding the deviation Vk (t) Vk (t) -V k V k (t) 2 , where V k (t) denotes the top-k eigenspace of the population second moment matrix E[∇ (w t , z)∇ (w t , z) ]. To handle the adaptivity of the sequence of iterates, we provide a uniform deviation bound for all w ∈ W, where the set W contains all of the iterates. By leveraging generic chaining techniques, we provide a deviation bound that scales linearly with a complexity measure-the γ 2 function due to Talagrand (2014)-of the set W. We provide lowcomplexity examples of W that are supported by empirical observations and show that their γ 2 function only scales logarithmically with p. Convergence for convex and non-convex optimization. Building on the reconstruction error bound, we provide convergence and sample complexity results for our method PDP-SGD in two types of loss functions, including 1) smooth and non-convex, 2) Lipschitz convex. Under suitable assumptions on the gradient space, our rates only scales logarithmically on p. Empirical evaluation. We provide an empirical evaluation of PDP-SGD on two real datasets. In our experiments, we construct the "public" datasets by taking very small random sub-samples of these two datasets (100 samples). While these two public datasets are not sufficient for training an accurate predictor, we demonstrate that they provide useful gradient subspace projection and substantial accuracy improvement over DP-SGD.

Related work.

Beyond the aforementioned work, there has been recent work on private ERM that also leverages the low-dimensional structure of the problem. Jain and Thakurta (2014) ; Song et al. (2020) show dimension independent excess empirical risk bounds for convex generalized linear problems, when the input data matrix is low-rank. Kairouz et al. (2020) study unconstrained convex empirical risk minimization and provide a noisy AdaGrad method that achieves dimensionfree excess risk bound, provided that the gradients along the optimization trajectory lie in a lowdimensional subspace. In comparison, our work studies both convex and non-convex problems and our analysis applies to more general low-dimensional structures that can be characterized by small γ 2 functions (Talagrand, 2014; Gunasekar et al., 2015) (e.g., low-rank gradients and fast decay in the magnitude of the gradient coordinates). Recently, Tramer and Boneh (2021) show that private learning with features learned on public data from a similar domain can significantly improve the utility. Zhang et al. (2021) leverage the sparsity of the gradients in deep nets to improve the dependence on dimension in the error rate. We also note a recent work (Yu et al., 2021) that proposes an algorithm similar to PDP-SGD. However, in addition to perturbing the projected gradient in the top eigenspaces in the public data, their algorithm also adds noise to the residual gradient. Their error rate scales with dimension p in general due to the noise added to the full space. To achieve a dimension independent error bound, their analyses require fresh public samples drawn from the same distribution at each step, which consequently requires a large public data set with size scaling linearly with T . In comparison, our analysis does not require fresh public samples at each iteration, and our experiments demonstrate that a small public data set of size no more than 150 suffices.foot_0 

2. PRELIMINARIES

Given a private dataset S = {z 1 , ..., z n } drawn i.i.d. from the underlying distribution P, we want to solve the following empirical risk minimization (ERM) problem subject to differential privacy:foot_1 min w Ln (w) = 1 n n i=1 (w, z i ). where the parameter w ∈ R p . We optimize this objective with an iterative algorithm. At each step t, we write w t as the algorithm's iterate and use g t to denote the mini-batch gradient, and ∇ Ln (w t ) = 1 n n i=1 ∇ (w t , z i ) to denote the empirical gradient. In addition to the private dataset, the algorithm can also freely access to a small public dataset S h = {z 1 , . . . , zm } drawn from the same distribution P, without any privacy constraint. Notation. We write M t ∈ R p×p to denote the second moment matrix of gradients evaluated on public dataset S h , i.e., M t = 1 m m i=1 ∇ (w t , zi ) ∇ (w t , zi ) and write Σ t ∈ R p×p to denote the population second moment matrix, i.e., Σ t = E z∼P [∇ (w t , z) ∇ (w t , z) ]. We use V (t) ∈ R p×p as the full eigenspace of Σ t . We use Vk (t) ∈ R p×k as the top-k eigenspace of M t and V k (t) ∈ R p×k as the top-k eigenspace of Σ t . To present our result in the subsequent sections, we introduce the eigen-gap notation α t , i.e., let λ 1 (Σ t ) ≥ ... ≥ λ p (Σ t ) be the eigenvalue of Σ t , we use α t to denote the eigen-gap between λ k (Σ t ) and λ k+1 (Σ t ), i.e., λ k (Σ t ) -λ k+1 (Σ t ) ≥ α t . We also define W ∈ R p as the set that contains all the possible iterates w t ∈ W for t ∈ [T ]. Throughout, for any matrix A and vector v, A 2 denotes spectral norm and v 2 denotes 2 norm. Definition 1 (Differential Privacy (Dwork et al., 2006 )) A randomized algorithm R is ( , δ)differentially private if for any pair of datasets D, D differ in exactly one data point and for all event Y ⊆ Range(R) in the output range of R, we have P {R(D) ∈ Y} ≤ exp( )P {R(D ) ∈ Y} + δ, where the probability is taken over the randomness of R. To establish the privacy guarantee of our algorithm, we will combine three standard tools in differential privacy, including 1) the Gaussian mechanism (Dwork et al., 2006) that releases an aggregate statistic (e.g., the empirical average gradient) by Gaussian perturbation, 2) privacy amplification via subsampling (Kasiviswanathan et al., 2008) that reduces the privacy parameters and δ by running the private computation on a random subsample, and 3) advanced composition theorem (Dwork et al., 2010) that tracks the cumulative privacy loss over the course of the algorithm. We analyze our method under two asumptions on the gradients of . Assumption 1 For any w ∈ R p and example z, ∇ (w, z) 2 ≤ G. Assumption 2 For any example z, the gradient ∇ (w, z) is ρ-Lipschitz with respect to a suitable pseudo-metric d : R p × R p → R, i.e., ∇ (w, z) -∇ (w , z) 2 ≤ ρd(w, w ), ∀w, w ∈ R p . Note that Assumption 1 implies that Ln (w) is G-Lipschitz and Assumption 2 implies that Ln (w) is ρ-smooth when d is the 2 -distance. We will discuss additional assumptions regarding the structure of the stochastic gradients and the error rate for different type of functions in Section 3.

3. PROJECTED PRIVATE GRADIENT DESCENT

The PDP-SGD follows the classical noisy gradient descent algorithm DP-SGD (Wang et al., 2017; Wang and Xu, 2019; Bassily et al., 2014) . DP-SGD adds isotropic Gaussian noise b t ∼ N (0, σ 2 I p ) to the gradient g t , i.e., each coordinate of the gradient g t is perturbed by the Gaussian noise. Given the dimension of gradient to be p, this method ends up in getting a factor of p in the error rate (Bassily et al., 2014; 2019a) . Our algorithm is inspired by the recent observations that stochastic gradients stay in a low-dimensional space in the training of deep nets (Li et al., 2020; Gur-Ari et al., 2018) . Such observation is also valid for the private training algorithm, i.e., DP-SGD (Figure 1 (b) and (c)). Intuitively, the most information needed for gradient descent is embedded in the top eigenspace of the stochastic gradients. Thus, PDP-SGD performs noise reduction by projecting the noisy gradient g t + b t to an approximation of such a subspace given by a public dataset S h . Algorithm 1 Projected DP-SGD (PDP-SGD) 1: Input: Training set S, public set S h , certain loss (•), initial point w 0 2: Set: Noise parameter σ, iteration time T , step size η t . 3: for t = 0, ..., T do 4: Compute top-k eigenspace Vk (t) of M t = 1 |S h | zi∈Sh ∇ (w t , zi )∇ (w t , zi ) . 5: g t = 1 |Bt| zi∈Bt ∇ (w t , z i ), with B t uniformly sampled from S with replacement.

6:

Project noisy gradient using Vk (t): gt = Vk (t) Vk (t) (g t + b t ), where b t ∼ N (0, σ 2 I p ).

7:

Update parameter using projected noisy gradient: w t+1 = w t -η t gt . 8: end for Thus, our algorithm involves two steps at each iteration, i.e., subspace identification and noisy gradient projection. The pseudo-code of PDP-SGD is given in Algorithm 1. At each iteration t, in order to obtain an approximated subspace without leaking the information of the private dataset S, we evaluate the second moment matrix M t on S h and compute the top-k eigenvectors Vk (t) of M t (line 4 in Algorithm 1). Then we project the noisy gradient g t + b t to the top-k eigenspace, i.e., gt = Vk (t) Vk (t) (g t + b t ) (line 6 in Algorithm 1). Then PDP-SGD uses the projected noisy gradient gt to update the parameter w t+1 = w t -η t gt .foot_2 Let us first state its privacy guarantee. Theorem 1 (Privacy) Under Assumption 1, there exist constants c 1 and c 2 so that given the number of iterations T , for any ≤ c 1 q 2 T , where q = |Bt| n , PDP-SGD (Algorithm 1) is ( , δ)-differentially private for any δ > 0, if The privacy proof essentailly follows from the same proof of DP-SGD (Abadi et al., 2016) . At each iteration, the update step PDP-SGD is essentially post-processing of Gaussian Mechanism that computes a noisy estimate of the gradient g t + b t . Then the privacy guarantee of releasing the sequence of {g t + b t } t is exactly the same as the privacy proof of Theorem 1 of Abadi et al. (2016) . σ 2 ≥ c 2 G 2 T ln( 1 δ ) n 2 2

3.1. GRADIENT SUBSPACE IDENTIFICATION

We now analyze the gradient deviation between the approximated subspace Vk (t) Vk (t) and true (population ) subspace V k (t)V k (t) , i.e., Vk (t) Vk (t) -V k (t)V k (t) 2 . To bound Vk (t) Vk (t) - V k (t)V k (t) 2 , we first bound the deviation between second moment matrix M t -Σ t (Dwork et al., 2014; McSherry, 2004) . Note that, if M t = 1 m m i=1 ∇ (w t , zi ) ∇ (w t , zi ) is evaluated on fresh public samples, the Σ t is the expectation of M t , and the deviation of M t from Σ t can be easily analyzed by the Ahlswede-Winter Inequality (Horn and Johnson, 2012; Wainwright, 2019) , i.e., at any iteration t, if we have fresh public sample {z 1 (t), ..., zm (t)} drawn i.i.d. from the distribution P, with suitable assumptions we have, 1 m m i=1 ∇ (w t , zi (t))∇ (w t , zi (t)) -E [∇ (w t , zi (t))∇ (w t , zi (t)) ] 2 > u, ∀u ∈ [0, 1], with probability at most p exp(-mu 2 /4G) and G is as in Assumption 1. However, this concentration bound does not hold for w t , ∀t > 0 in general, since the public dataset S h is reused over the iterations and the parameter w t depends on S h . To handle the dependency issue, we bound M t -Σ t 2 uniformly over all iterations t ∈ [T ] to bound the worst-case counterparts that consider all possible iterates. Our uniform bound analysis is based on generic chaining (GC) (Talagrand, 2014) , an advanced tool from probability theory. Eventually, the error bound is expressed in terms of a complexity measure called γ 2 function (Talagrand, 2014) . Note that one may consider the idea of sample splitting to bypass the dependency issue by splitting m public samples into T disjoint subsets for each iteration. Based on Ahlswede-Winter Inequality, the deviation error scales with O( √ T √ m ) leading to a worse trade-off between the subspace construction error and optimization error due to the dependence on T . Definition 2 (γ 2 function (Talagrand, 2014)  ) For a metric space (A, d), an admissible sequence of A is a collection of subsets of A, Γ = {A n : n ≥ 0}, with |A 0 | = 1 and |A n | ≤ 2 2 n for all n ≥ 1, the γ 2 functional is defined by γ 2 (A, d) = inf Γ sup A∈A n≥0 2 n/2 d(A, A n ) , where the infimum is over all admissible sequences of A. In Theorem 2, we show that the uniform convergence bound of M t -Σ t 2 scales with γ 2 (W, d), where d is the pseudo metric as in Assumption 2 and W ∈ R p is the set that contains all possible iterates in the algorithm, i.e., w t ∈ W for all t ∈ [T ]. Based on the majorizing measure theorem (e.g., Theorem 2.4.1 in Talagrand ( 2014)), if the metric d is 2 -norm, γ 2 (W, d) can be expressed as Gaussian width (Vershynin, 2018; Wainwright, 2019) of the set W, i.e., w(W) = E v [sup w∈W w, v ] where v ∼ N (0, I p ), which only depends on the size of the W. In Appendix A.2, we show the complexity measure γ 2 (W, d) can be expressed as the γ 2 function measure on the gradient space by mapping the parameter space W to the gradient space, i.e., f : W → M, where f can be considered as f (w) = E z∈P [∇(w, z)]. To simplify the notation, we write m = E z∈P [∇(w, z)] as the population gradient at w and M as the space of the population gradient. Considering d(m, m ) = mm 2 for m, m ∈ M, γ 2 (M, d) will be the same order as the Gaussian width w(M). To measure the value of γ 2 (M, d), we empirically explore the gradient space M for deep nets. Figure 2 gives an example of the population gradient along the training trajectory of DP-SGD with σ = {1, 2, 4} for training a 2-layer ReLU on MNIST dataset. Figure 2 shows that each coordinate of the gradient is of small value and gradient components decay very fast (Li and Banerjee, 2021) . Thus, it is fair that the gradient space M is a union of ellipsoids, i.e., there exists e ∈ R p such that (Talagrand, 2014) , where c 1 and c 2 are absolute constants. If the elements of e are sorted in a decreasing order satisfy e(j) M = {m ∈ R p | p j=1 m(j) 2 /e(j) 2 ≤ 1, e ∈ R p }, where j denotes the j-th coordinate. Then we have γ 2 (M, d) ≤ c 1 w(M) ≤ c 2 e 2 ≤ c 3 / √ j for all j ∈ [p], then γ 2 (M, d) ≤ O √ log p . 4 Now we give the uniform convergence bound of M t -Σ t 2 . Theorem 2 (Second Moment Concentration) Under Assumption 1, 2, the second moment matrix of the public gradient M t = 1 m m i=1 ∇ (w t , zi )∇ (w t , zi ) approximates the population second moment matrix Σ t = E z∼P [∇ (w t , z)∇ (w t , z) ] uniformly over all iterations, i.e., for any u > 0, sup t∈[T ] M t -Σ t 2 ≤ O uGρ √ ln pγ 2 (W, d) √ m , with probability at least 1 -c exp -u 2 /4 , where c is an absolute constant. Theorem 2 shows that M t approximates the population second moment matrix Σ t uniformly over all iterations. This uniform bound is derived by the technique GC, which develops sharp upper bounds to suprema of stochastic processes indexed by a set with a metric structure in terms of γ 2 functions. In our case, M t -Σ t is treated as the stochastic process indexed by the set W ∈ R p such that w t ∈ W, which is the set of all possible iterates. The metric d is the pseudo-metric d : R p × R p → R defined in Assumption 2. To get a more practical bound, following the above discussion, instead of working with W over parameters, one can consider working with the set M of population gradients by defining the pseudo-metric as (w, z) ] that maps the parameter space to gradient space. Thus, the complexity measure γ 2 (W, d) can be expressed as the γ 2 function measure on the population gradient space, i.e., γ 2 (M, d). As discussed above, using the 2 -norm as d, the γ 2 (M, d) will be a constant if assuming the gradient space is a union of ellipsoids and uniform bound only depends on logarithmically on p. d (w, w ) = d (f (w), f (w )) = d (m, m ), where f (w) = E z∈P [∇ Using the result in Theorem 2 and Davis-Kahan sin-θ theorem (McSherry, 2004) , we obtain the subspace construction error Vk (t) Vk (t) -V k (t)V k (t) 2 in the following theorem. Theorem 3 (Subspace Closeness) Under Assumption 1 and 2, with V k (t) to be the top-k eigenvectors of the population second moment matrix Σ t and α t be the eigen-gap at t-th iterate such that λ k (Σ t ) -λ k+1 (Σ t ) ≥ α t , for the Vk (t) in Algorithm 1, if m ≥ O(Gρ √ ln pγ2(W,d)) 2 mint α 2 t , we have E Vk (t) Vk (t) -V k (t)V k (t) 2 ≤ O Gρ √ ln pγ 2 (W, d) α t √ m , ∀t ∈ [T ]. Theorem 3 gives the sample complexity of the public sample size and the reconstruction error, i.e., the difference between Vk (t) Vk (t) evaluated on the public dataset S h and V k (t)V k (t) given by the population second moment Σ t . The sample complexity and the reconstruction error both depend on the γ 2 function and eigen-gap α t . A small eigen-gap α t requires larger public sample m.

3.2. EMPIRICAL RISK CONVERGENCE ANALYSIS

In this section, we present the error rate of PDP-SGD for non-convex (smooth) functions. The error rate for convex functions is deferred to Appendix C. For non-convex case, we first give the error rate of the 2 -norm of the principal component of the gradient, i.e., V k (t)V k (t) ∇ Ln (w t ) 2 . Then we show that the gradient norm also converges if the principal component dominates the residual component of the gradient as suggested by Figure 1 and recent observations (Papyan, 2019; Li et al., 2020) . To present our results, we introduce some new notations here. We write [∇ Ln (w t )] = V k (t)V k (t) ∇ Ln (w t ) as the principal component of the gradient and [∇ Ln (w t )] ⊥ = ∇ Ln (w t ) - V k (t)V k (t) ∇ Ln (w t ) as the residual component. Theorem 4 (Smooth and Non-convex) For ρ-smooth function Ln (w), under Assumptions 1 and 2, let Λ = T t=1 1/α 2 t T , for any , δ > 0, with T = O(n 2 2 ) and η t = 1 √ T , PDP-SGD achieves: 1 T T t=1 E V k (t)V k (t) ∇ Ln (w t ) 2 2 ≤ Õ kρG 2 n + O ΛG 4 ρ 2 γ 2 2 (W, d) ln p m . Additionally, assuming the principal component of the gradient dominates, i.e., there exist c > 0, such that 1 T T t=1 [∇ Ln (w t )] ⊥ 2 2 ≤ c 1 T T t=1 [∇ Ln (w t )] 2 2 , we have E ∇ Ln (w R ) 2 2 ≤ Õ kρG 2 n + O ΛG 4 ρ 2 γ 2 2 (W, d) ln p m , where w R is uniformly sampled from {w 1 , ..., w T }. Theorem 4 shows that PDP-SGD reduces the error rate of a factor of p to k compared to existing results for non-convex and smooth functions (Wang and Xu, 2019) . The error rate also includes a term depending on the γ 2 function and the eigen-gap α t , i.e., Λ = T t=1 1/α 2 t /T . This term comes from the subspace reconstruction error. As discussed in the previous section, as the gradients stay in a union of ellipsoids, the γ 2 is a constant. The term Λ depends on the eigen-gap α t , i.e, λ k (Σ t ) -λ k+1 (Σ t ) ≥ α t . As shown by the Figure 1 , along the training trajectory, there are a few dominated eigenvalues and the eigen-gap stays significant (even at the last epoch). Then the term Λ will be a constant and the bound scales logarithmically with p. If one considers the eigen-gap α t decays as training proceed, e.g., α t = 1 t 1/4 for t > 0, then we have Λ = O( √ T ). In this case, with T = n 2 2 , PDP-SGD requires the public data size m = O(n ).

4. EXPERIMENTS

We empirically evaluate PDP-SGD on training neural networks with two datasets: the MNIST (Le- Cun et al., 1998) and Fashion MNIST (Xiao et al., 2017) . We compare the performance of PDP-SGD with the baseline DP-SGD for various privacy levels . In addition, we also explore a heuristic method, i.e., DP-SGD with random projection by replacing the projector with a R k×p Gaussian random projector (Bingham and Mannila, 2001; Blocki et al., 2012) . We call this method randomly projected DP-SGD (RPDP-SGD). We present the experimental results after discussing the experimental setup. More details and additional results are in Appendix D. Datasets and Network Structure. The MNIST and Fashion MNIST datasets both consist of 60,000 training examples and 10,000 test examples. To construct the private training set, we randomly sample 10, 000 samples from the original training set of MNIST and Fashion MNIST, then we randomly sample 100 samples from the rest to construct the public dataset. Note that the smaller private datasets make the private learning problem more challenging. For both datasets, we use a convolutional neural network that follows the structure in Papernot et al. (2020) . Training and Hyper-parameter Setting. Cross-entropy is used as our loss function throughout experiments. The mini-batch size is set to be 250 for both MNIST and Fashion MNIST. For the step size, we follow the grid search method with search space and step size used for DP-SGD, PDP-SGD and RPDP-SGD listed in Appendix D. For training, a fixed budget on the number of epochs i.e., 30 is assigned for the each task. We repeat each experiments 3 times and report the mean and standard deviation of the accuracy on the training and test set. For PDP-SGD, we use Lanczos algorithm to compute the top k eigen-space of the gradient second moment martix on public dataset. We use k = 50 for MNIST and k = 70 for Fashion MNIST. For RPDP-SGD, we use k = 800 for both datasets. Instead of doing the projection for all epochs, we also explored a start point for the projection, i.e., executing the projection from the 1-st epoch, 15-th epoch. We found that for Fashion MNIST, PDP-SGD and RPDP-SGD perform better when starting projection from the 15-th epoch.

Privacy Parameter Setting:

We consider different choices of the noise scale, i.e., σ = {18, 14, 10, 8, 6, 4} for MNIST and σ = {18, 14, 10, 6, 4, 2} for Fashion MNIST. Since gradient that the accuracy of PDP-SGD improves as the m increases from 50 to 150. This is consistent with the theoretical analysis that increasing m helps to reduce the subspace reconstruction error. Also, PDP-SGD with m = 100 performs similar to PDP-SGD with m = 150. The results suggest that while a small number of public datasets are not sufficient for training an accurate predictor, they provide useful gradient subspace projection and accuracy improvement over DP-SGD. To reduce the computation complexity introduced by eigen-value decomposition, we explored PDP-SGD with sparse eigen-space computation, i.e., update the projector every s iterates. Note that PDP-SGD with s = 1 means computing the top eigen-space at every iteration. Figure 6 reports PDP-SGD with s = {1, 10, 20} for (a) MNIST and (b) Fashion MNIST showing that PDP-SGD with a reduced eigen-space computation also outperforms DP-SGD, even though there is a mild decay for PDP-SGD with fewer eigen-space computations.

5. CONCLUSION

While DP-SGD and variants have been well studied for private ERM, the error rate of DP-SGD depends on the ambient dimension p. In this paper, we aim to bypass such dependence by leveraging the low-dimensional structure of the observed gradients in the training of deep networks. We propose PDP-SGD which projects the noisy gradient to an approximated subspace evaluated on a public dataset. We show theoretically that PDP-SGD can obtain (near) dimension-independent error rate. We evaluate the proposed algorithms on two popular deep learning tasks and demonstrate the empirical advantages of PDP-SGD. A UNIFORM CONVERGENCE FOR SUBSPACES: PROOFS FOR SECTION 3.1 In this section, we provide the proofs for Section 3.1. We first show that the second moment matrix M t converges to the population second moment matrix Σ t uniform over all iterations t ∈ [T ], i.e., sup t∈[T ] M t -Σ t . Then we show that the top-k subspace of M t uniformly converges to the top-k subspace of Σ t , i.e., Vk (t) Vk (t) -V k (t)V k (t) for all t ∈ [T ]. Our bound depends on γ 2 (W, d) where W is the set of all possible parameters along the training trajectory. In Section A.2, we show that the bound can be derived by γ 2 (M, d) as well, where M is the set of population gradients along the training trajectory. Then, we provide examples of the set M and corresponding value of γ 2 (M, d).

A.1 UNIFORM CONVERGENCE BOUND

Our proofs of Theorem 2 heavily rely on the advanced probability tool, Generic Chaining (GC) (Talagrand, 2014) . Typically the results in generic chaining are characterized by the so-called γ 2 function (see Definition 2). Talagrand (2014) shows that for a process (X t ) t∈T and a given metric space (T, d), if (X t ) t∈T satisfies the increment condition ∀u > 0, P (|X s -X t | ≥ u) ≤ 2 exp - u 2 2d(s, t) 2 , ( ) then the size of the process can be bounded as E sup t∈T X t ≤ cγ 2 (T, d), with c to be an absolute constant. To apply the GC result to establish the Theorem 2, we treat M t -Σ t 2 as the process X t over the iterations. In detail, since M t = 1 m m i=1 ∇ (w t , zi ) ∇ (w t , zi ) , and Σ t = E z∼P [∇ (w t , z) ∇ (w t , z) ], the M t -Σ t 2 is a random process indexed by w t ∈ W, with W to be the set of all possible iterates obtained by the algorithm. We first show that the variable M t -Σ t 2 satisfies the increment condition as stated in equation 5 in Lemma 1. Before we present the proof of Lemma 1, we introduce the Ahlswede-Winter Inequality (Horn and Johnson, 2012; Wainwright, 2019) , which will be used in the proof of Lemma 1. Ahlswede-Winter Inequality shows that positive semi-definite random matrix with bounded spectral norm concentrates to its expectation with high probability. Theorem 5 (Ahlswede-Winter Inequality) Let Y be a random, symmetric, positive semi-definite p × p matrix. such that such that E[Y ] ≤ 1. Suppose Y ≤ R for some fixed scalar R ≥ 1. Let {Y 1 , . . . , Y m } be independent copies of Y (i.e., independently sampled matrices with the same distribution as Y ). For any u ∈ [0, 1], we have P 1 m m i=1 Y i -E [Y i ] 2 > u ≤ 2p • exp -mu 2 /4R . To make the argument clear, we use a more informative notation for M t and Σ t . Recall the notation of M t and Σ t such that M t = 1 m m i=1 ∇ (w t , zi ) ∇ (w t , zi ) , and Σ t = E z∼P [∇ (w t , z) ∇ (w t , z) ] , given the dataset S h = {z 1 , .., zm } and distribution P where zi ∼ P for i ∈ [m], the M t and Σ t are functions of parameter w t , so we use M (w t ) and Σ(w t ) for M t and Σ t interchangeably in the rest of this section, i..e, M (w) = 1 m m i=1 ∇ (w, zi ) ∇ (w, zi ) and Σ(w) = E z∼P [∇ (w, z) ∇ (w, z) ] . Lemma 1 With Assumption 2 and 1 hold, for any w, w ∈ W and ∀u > 0, we have P M (w) -Σ(w) 2 -M (w ) -Σ(w ) 2 ≥ u √ m • 4Gρd(w, w ) ≤ 2p • exp -u 2 /4 , where d : W × W → R is the pseudo-metric in Assumption 2. Proof: We consider random variable X i = ∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) + 2Gρd(w, w )I p , where I p ∈ R p×p is the identity matrix. Note that 2Gρd(w, w ) is deterministic and the randomness of X i comes from ∇ (w, zi ) and ∇ (w , zi ). By triangle inequality and the construction of X i , we have M (w) -Σ(w) 2 -M (w ) -Σ(w ) 2 = M (w) -E[M (w)] 2 -M (w ) -E[M (w )] 2 ≤ M (w) -E[M (w)] -(M (w ) -E[M (w )]) 2 = M (w) -M (w ) -E[(M (w) -M (w ))] 2 = 1 m m i=1 (∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) ) -E [∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) ] 2 = 1 m m i=1 (∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) + 2Gρd(w, w )I p ) -E [∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) + 2Gρd(w, w )I p ] 2 = 1 m X i -E[X i ] 2 To apply Theorem 5 for 1 m X i -E[X i ] 2 , we first show that the random symmetric matrix X i is positive semi-definite.

By Assumption 1 and Assumption 2 and definition

∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) 2 = sup x: x =1 x (∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) ) x = sup x: x =1 x ∇ (w, zi )∇ (w, zi ) x -x ∇ (w , zi )∇ (w , zi ) x = sup x: x =1 x, ∇ (w, zi ) 2 -x, ∇ (w , zi ) 2 = sup x: x =1 ( x, ∇ (w, zi ) + x, ∇ (w , zi ) ) ( x, ∇ (w, zi ) -x, ∇ (w , zi ) ) ≤ sup x: x =1 2G ( x, ∇ (w, zi ) -x, ∇ (w , zi ) ) = 2G ∇ (w, zi ) -∇ (w , zi ) 2 ≤ 2Gρd(w, w ) For any non-zero vector x ∈ R p , we have x X i x = x (∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) + 2Gρd(w, w )I p ) x = x (∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) ) x + 2Gρd(w, w ) x 2 2 ≥ -∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) 2 x 2 2 + 2Gρd(w, w ) x 2 2 = (2Gρd(w, w ) -∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) 2 ) x 2 2 (a) ≥ 0, where (a) is true because ∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) 2 ≤ 2Gρd(w, w ) as shown in equation 15. Let Y i = Xi 4Gρd(w,w ) , with equation 15, we have Y i 2 = ∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) + 2Gρd(w, w )I p 2 4Gρd(w, w ) ≤ ∇ (w, zi )∇ (w, zi ) -∇ (w , zi )∇ (w , zi ) 2 + 2Gρd(w, w )I p 2 4Gρd(w, w ) ≤ 2Gρd(w, w ) + 2Gρd(w, w ) 4Gρd(w, w ) = 1. (17) So that Y i 2 ≤ 1 and E[Y i ] 2 ≤ 1. Then, from Theorem 5, with R = 1, we have for any u ∈ [0, 1] P 1 m m i=1 Y i -E [Y i ] > u ≤ 2p • exp -mu 2 /4 . ( ) Note that 1 m m i=1 Y i -E [Y i ] is always bounded by 1 since Y i 2 ≤ 1 and E[Y i ] 2 ≤ 1. So the above inequality holds for any u > 1 with probability 0 which is bounded by 2p • exp -mu 2 /4 . So that we have for any u > 0, P 1 m m i=1 Y i -E [Y i ] > u ≤ 2p • exp -mu 2 /4 . ( ) So that for any u > 0, P 1 m m i=1 X i -E [X i ] > u √ m • 4Gρd(w, w ) ≤ 2p • exp -u 2 /4 . ( ) Combine equation 20 and equation 14, we have P M (w) -Σ(w) 2 -M (w ) -Σ(w ) 2 ≥ u √ m • 4Gρd(w, w ) = P M (w) -E[M (w)] 2 -M (w ) -E[M (w )] 2 > u √ m • 4Gρd(w, w ) ≤ P 1 m m i=1 X i -E [X i ] > u √ m • 4Gρd(w, w ) ≤ 2p • exp -u 2 /4 (21) That completes the proof. Based on the above result, now we come to the proof of Theorem 2. The proof follows the Generic Chaining argument, i.e., Chapter 2 of Talagrand (2014) . Theorem 2 (Second Moment Concentration) Under Assumption 1, 2, the second moment matrix of the public gradient M t = 1 m m i=1 ∇ (w t , zi )∇ (w t , zi ) approximates the population second moment matrix Σ t = E z∼P [∇ (w t , z)∇ (w t , z) ] uniformly over all iterations, i.e., for any u > 0, sup t∈[T ] M t -Σ t 2 ≤ O uGρ √ ln pγ 2 (W, d) √ m , with probability at least 1 -c exp -u 2 /4 , where c is an absolute constant. Proof: Note that equation equation 1 is a uniform bound over iteration t ∈ [T ]. To bound sup t∈[T ] M t -Σ t 2 , it is sufficient to bound sup w∈W M (w) -Σ(w) 2 , ( ) where W contains all the possible trajectorys of w 1 , ..., w T . We consider a sequence of subsets W n of W, and card W n ≤ N n , where N 0 = 1; N n = 2 2 n if n ≥ 1. Let π n (w) ∈ W n be the approximation of any w ∈ W. We decompose the M (w) -Σ(w) 2 as M (w) -Σ(w) 2 -M (π 0 (w)) -Σ(π 0 (w)) 2 = n≥1 ( M (π n (w)) -Σ(π n (w)) 2 -M (π n-1 (w)) -Σ(π n-1 (w)) 2 ) , which holds since π n (w) = w for n large enough. Based on Lemma 1, for any u > 0, we have M (π n (w)) -Σ(π n (w)) 2 -M (π n-1 (w)) -Σ(π n-1 (w)) 2 ≥ u √ m • 4Gρd(π n (w), π n-1 (w)), with probability at most 2p exp(-u 2 4 ). For any n > 0 and w ∈ W, the number of possible pairs (π n (w), π n-1 (w)) is card W n • card W n-1 ≤ N n N n-1 ≤ N n+1 = 2 2 n+1 . ( ) Apply union bound over all the possible pairs of (π n (w), π n-1 (w)), following Talagrand (2014) (Chapter 2.2), for any n > 0, u > 0, and w ∈ W, we have M (π n (w)) -Σ(π n (w)) 2 -M (π n-1 (w))-Σ(π n-1 (w)) 2 ≥ u2 n/2 d(π n (w), π n-1 (w))• 4Gρ √ m (27) with probability n≥1 2p • 2 2 n+1 exp -u 2 2 n-2 ≤ c p exp - u 2 4 , ( ) where c is a universal constant. Then we have n≥1 ( M (π n (w)) -Σ(π n (w)) 2 -M (π n-1 (w)) -Σ(π n-1 (w)) 2 ) ≥ n≥1 u2 n/2 d(π n (w), π n-1 (w)) • 4Gρ √ m ≥ n≥0 u2 n/2 d(w, W n ) • 4Gρ √ m ( ) with probability at most c p exp -u 2 4 . From Theorem 5, let Y i = ∇ (π0(w),zi)∇ (π0(w),zi)

G

, so that Y i ≤ 1 and E[Y i ] ≤ 1. Then we have M (π 0 (w)) -Σ(π 0 (w)) 2 ≥ u G √ m , with ptobability at most 2p exp -u 2 4 . Combine equation 24, equation 29 and equation 30, we have 31) with probability at most (c + 2)p exp -u 2 4 . sup w∈W M (w) -Σ(w) 2 ≥ sup w∈W n≥0 u2 n/2 d(w, W n ) • 4Gρ √ m + u G √ m = u G (4ργ 2 (W, d) + 1) √ m That completes the proof. Now we provide the proof of Theorem 3. Theorem 3 (Subspace Closeness) Under Assumption 1 and 2, with V k (t) to be the top-k eigenvectors of the population second moment matrix Σ t and α t be the eigen-gap at t-th iterate such that λ k (Σ t ) -λ k+1 (Σ t ) ≥ α t , for the Vk (t) in Algorithm 1, if m ≥ O(Gρ √ ln pγ2(W,d)) 2 mint α 2 t , we have E Vk (t) Vk (t) -V k (t)V k (t) 2 ≤ O Gρ √ ln pγ 2 (W, d) α t √ m , ∀t ∈ [T ]. Proof: Recall that Vk (t) is the top-k eigenspace of M t . Let V k (t) be the top-k eigenspace of Σ t . Vk (t) Vk (t) -V k (t)V k (t) = Π (k) Mt (I -Π (k) Σt ) + (I -Π (k) Mt )Π (k) Σt (32) ⇒ Vk (t) Vk (t) -V k (t)V k (t) 2 ≤ Π (k) Mt (I -Π (k) Σt ) 2 + (I -Π (k) Mt )Π (k) Σt 2 . Mt = Vk (t) Vk (t) denotes the projection to the top-k subspace of the symmetric PSD M t and Π (k) Σt = V k (t)V k (t) denotes the projection to the top-k subspace of the symmetric PSD Σ t . Then, from Davis-Kahan (Corollary 8 in McSherry (2004) ) and using the fact for symmetric PSD matrices eigen-values and singular values are the same, we have Π (k) Mt (I -Π (k) Σt ) 2 ≤ M t -Σ t 2 λ k (M t ) -λ k+1 (Σ t ) (I -Π (k) Mt )Π (k) Σt 2 ≤ M t -Σ t 2 λ k (Σ t ) -λ k+1 (M t ) . Recall, from Horn and Johnson (2012) (Section 4.3) and Golub and Van Loan (1996) (Section 8.1.2), e.g., Corollary 8.1.6, we have |λ k (M t ) -λ k (Σ t )| ≤ M t -Σ t 2 . ( ) From Theorem 2 and Lemma 2, with Y = M t -Σ t 2 , A = c, and B = 4Gρ √ ln pγ2(W,d) √ m , we have E [ M t -Σ t 2 ] ≤ O Gρ √ ln pγ 2 (W, d) √ m . ( ) Let c 0 = O Gρ √ ln pγ 2 (W, d) . For m ≥ c 2 0 α 2 t , we have E [ M t -Σ t 2 ] ≤ c 0 √ m ≤ α t 2 . Then, for equation 34, we have Π (k) Mt (I -Π (k) Σt ) 2 ≤ M t -Σ t 2 λ k (M t ) -λ k+1 (Σ t ) = M t -Σ t 2 (λ k (Σ t ) -λ k+1 (Σ t )) -(λ k (Σ t ) -λ k (M t )) ≤ M t -Σ t 2 α t -M t -Σ t 2 . ( ) Then, for equation 35, we have (I -Π (k) Mt )Π (k) Σt 2 ≤ M t -Σ t 2 λ k (Σ t ) -λ k+1 (M t ) = M t -Σ t 2 (λ k (Σ t ) -λ k+1 (Σ t )) + (λ k+1 (Σ t ) -λ k+1 (M t )) ≤ M t -Σ t 2 α t -M t -Σ t 2 . ( ) Combining these two bounds and equation 37, with c 0 = O Gρ √ ln pγ 2 (W, d) , we have E Vk (t) Vk (t) -V k (t)V k (t) 2 ≤ 2α t α t -E [ M t -Σ t 2 ] -2 ≤ 2c0 √ m α t -c0 √ m Using equation 38 such that c0 √ m ≤ αt 2 , we have E Vk (t) Vk (t) -V k (t)V k (t) 2 ≤ O Gρ √ ln pγ 2 (W, d) α t √ m . That completes the proof. Lemma 2 (Lemma 2.2.3 in Talagrand ( 2014)) Consider a r.v. Y ≥ 0 which satisfies ∀u > 0, P(Y ≥ u) ≤ A exp - u 2 B 2 for certain numbers A ≥ 2 and B > 0. Then EY ≤ CB log A, where C denotes a universal constant. A.2 GEOMETRY OF GRADIENTS AND γ 2 FUNCTIONS. In this section, we provide more intuitions and explanations of γ 2 functions. We justify our assumptions about the gradient space and provide more examples of the gradient space structure and the corresponding γ 2 functions. At a high level, for a metric space (M, d), γ 2 (M, d) is related to log N (M, d, ) where N (M, d, ) is the covering number of M with balls with metric d, but it is considerably sharper. Such sharpening has happened in two stages in the literature: first, based on chaining, which considers an integral over all yielding the Dudley bound, and subsequently, based on generic chaining, which considers a hierarchical covering, developed by Talagrand and colleagues, and which yields the sharpest bounds of this type. The official perspective of generic chaining is to view γ 2 (M, d) as an upper (and lower) bound on suprema of Gaussian processes indexed on M and with metric d [Theorem 2.4.1 in Talagrand (2014) ]. Considering d to be the 2 norm distance, γ 2 (M, d) will be the same order as the Gaussian width of M (Vershynin, 2018) , which is a scaled version of the mean width of M . Structured sets (of gradients) have small Gaussian widths, e.g., a L 1 unit ball in R p has a Gaussian width of O( √ log p, wheras a L 2 unit ball in R p has a Gaussian width of O( √ p). To utilize the structure of gradients as example shown in Figure 2 , instead of focusing on the γ 2 (W, d), we can also derive the uniform convergence bound using the measurement d on the set of population gradient m = E z∈P [∇(w, z)]. We consider a mapping f : W → M from the parameter space W to the gradient space M, where f can be considered as (w, z) ] and M to be the space of the population gradient m t ∈ M, the pseudo metric d(w, w ) can be written as d(w, w f (w) = E z∈P [∇(w, z)]. With m t = E z∈P [∇ ) = d(f (w), f (w )) = d(m, m ) with m = E z∈P [∇(w, z)] and m = E z∈P [∇(w , z)]. With such a mapping f , the admissible sequence d ) will be the same order as the Gaussian width of M, i.e., w(M) Γ W = {W n : n ≥ 0} of W in the proof of Theorem 2 corresponds to the admissible sequence Γ M = {M n : n ≥ 0} of M. The γ 2 (W, d) = inf Γ W sup w∈W n≥0 2 n/2 d (w, W n ) = inf Γ M sup m∈M n≥0 2 n/2 d (m, M n ) = γ 2 (M, d). Considering d(m, m ) = m -m 2 , the γ 2 (M, = E v [sup m∈M m, v ], where v ∼ N (0, I p ). Below, we provide more examples of the gradient space structure and the γ 2 functions. Ellipsoid. The Gaussian width w(M) depends on the structure of the gradient m. In Figure 2 , we observe that, for each coordinates of the gradient is of small value along the training trajectory and thus M includes all gradients living in an ellipsoid, i.e., Talagrand (2014) , where c 1 and c 2 are absolute constants. If the elements of e sorted in decreasing order satisfy e(j) M = {m t ∈ R p | p j=1 m t (j) 2 /e(j) 2 ≤ 1, e ∈ R p }. Then we have γ 2 (M, d) ≤ c 1 w(M) ≤ c 2 e 2 ≤ c 3 / √ j for all j ∈ [p], then γ 2 (M, d) ≤ O √ log p . Composition. Based on the composition properties of γ 2 functions Talagrand (2014), one can construct additional examples of the gradient spaces. B PROOFS FOR SECTION 3.2 If M = M 1 + M 2 = {m 1 + m 2 , m 1 ∈ M 1 , m 2 ∈ M 2 }, the Minkowski sum, then γ 2 (M, d) ≤ c(γ 2 (M 1 , d) + γ 2 (M 2 , d)) ( In this section, we present the proofs for Section 3.2. Then we present the error rate for convex problems in the subsequent section. Theorem 4 (Smooth and Non-convex) For ρ-smooth function Ln (w), under Assumptions 1 and 2, let Λ = T t=1 1/α 2 t T , for any , δ > 0, with T = O(n 2 2 ) and η t = 1 √ T , PDP-SGD achieves: 1 T T t=1 E V k (t)V k (t) ∇ Ln (w t ) 2 2 ≤ Õ kρG 2 n + O ΛG 4 ρ 2 γ 2 2 (W, d) ln p m . Additionally, assuming the principal component of the gradient dominates, i.e., there exist c > 0, such that 1 T T t=1 [∇ Ln (w t )] ⊥ 2 2 ≤ c 1 T T t=1 [∇ Ln (w t )] 2 2 , we have E ∇ Ln (w R ) 2 2 ≤ Õ kρG 2 n + O ΛG 4 ρ 2 γ 2 2 (W, d) ln p m , where w R is uniformly sampled from {w 1 , ..., w T }. Proof: Recall that g t =  For D t,1 , let ∇ Ln (w t ) = Π Qt [∇ Ln (w t )] + Π Q ⊥ t [∇ Ln (w t )] where Q t is Vk (t) Vk (t) and Π Qt is the projection. Q ⊥ t is the null space of Q t . Π Qt [∇ Ln (w t )] = Vk (t) Vk (t) ∇ Ln (w t ) and Π Q ⊥ t [∇ Ln (w t )] = I -Vk (t) Vk (t) ∇ Ln (w t ). We have ∆ t = Π Qt [ Vk (t) Vk (t) g t -g t ] + Π Q ⊥ t [ Vk (t) Vk (t) g t -g t ] = -Π Q ⊥ t [g t ]. So that we have D t,1 = E t ∇ Ln (w t ), ∆ t = -E t Π Qt [∇ Ln (w t )] + Π Q ⊥ t [∇ Ln (w t )], Π Q ⊥ t [g t ] = -E t Π Q ⊥ t [∇ Ln (w t )], I -Vk (t) Vk (t) g t = -I -Vk (t) Vk (t) ∇ Ln (w t ), I -Vk (t) Vk (t) ∇ Ln (w t ) = -∇ Ln (w t ) I -Vk (t) Vk (t) ∇ Ln (w t ) . ( ) Bringing the above to equation 50, the right-hand side of equation 50 becomes . η t ∇ Ln (w t ) 2 2 + η t D t,1 = η t ∇ (57) For D t,2 , we have E t gt 2 2 = E t Vk V T k g t + Vk V T k b t 2 2 = E t Vk V T k g t 2 2 + Vk V T k b t 2 2 ≤ G 2 + kσ 2 . (58) Thus, D t,2 = E t ḡt 2 2 + ∆ t 2 2 ≤ G 2 + kσ 2 . ( ) Bringing the upper bound of D t,2 to equation 57, setting η t = 1 √ T , using telescoping sum and taking the expectation over all iterations, we have 1 T T t=1 E Vk (t) Vk (t) ∇ Ln (w t ) 2 2 ≤ Ln (w 1 ) -L n √ T + ρ(G 2 + kσ 2 ) 2 √ T With triangle inequality and Theorem 3, we have V k (t)V k (t) ∇ Ln (w t ) 2 2 ≤ 2 Vk (t) Vk (t) ∇ Ln (w t ) 2 2 + 2 V k (t)V k (t) -Vk (t) Vk (t) ∇ Ln (w t ) 2 2 ≤ 2 Vk (t) Vk (t) ∇ Ln (w t ) 2 2 + O G 4 ρ 2 ln pγ 2 2 (W, d) α 2 t m . With equation 60, we have 1 T T t=1 E V k (t)V k (t) ∇ Ln (w t ) 2 2 ≤ Ln (w 1 ) -L n √ T /2 + ρ(G 2 + kσ 2 ) √ T + O ΛG 4 Lρ 2 γ 2 2 (W, d) ln p m . ( ) where Λ = T t=1 1 α 2 t . Let ∇ Ln (w t ) = V k (t)V k (t) ∇ Ln (w t ) and ∇ Ln (w t ) ⊥ = ∇ Ln (w t ) - V k (t)V k (t) ∇ Ln (w t ). Assuming there exist c > 0, we have T t=1 E ∇ Ln (w t ) ⊥ 2 2 ≤ c T t=1 E ∇ Ln (w t ) 2 2 . ( ) Then we have 1 T T t=1 E ∇ Ln (w t ) 2 2 ≤ (1 + c) Ln (w 1 ) -L n √ T /2 + ρ(G 2 + kσ 2 ) √ T + O ΛG 4 Lρ 2 γ 2 2 (W, d) ln p m . Take T = n 2 2 , with E ∇ Ln (w R ) 2 2 = 1 T T t=1 E ∇ Ln (w t ) 2 2 , we have E ∇ Ln (w R ) 2 2 ≤ Õ kρG 2 n + O ΛG 4 ρ 2 γ 2 2 (W, d) ln p m , where w R is uniformly sampled from {w 1 , ..., w T }.

C ERROR RATE OF CONVEX PROBLEMS

For the convex and Lipschitz functions, we consider the low-rank structure of the gradient space, i.e, the population gradient second momment Σ t is of rank-k, which is a special case of the principal gradient dominate assumption when |[∇ Ln (w t )] ⊥ 2 = 0. Theorem 6 (Convex and Lipschitz) For G-Lipschitz and convex function Ln (w), under Assumptions 1,2 and assuming Σ t is of rank-k, let Λ = T t=1 1/αt T , for any , δ > 0, with T = O(n 2 2 ), step size η t = 1 √ T , the PDP-SGD achieves E Ln ( w) -Ln (w ) ≤ O kG 2 n + O ΛGργ 2 (W, d) ln p √ m , where w = T t=1 wt T , and w is the minima of Ln (w). PDP-SGD also demonstrates an improvement from a factor of p to k compared to the error rate of DP-SGD for convex functions (Bassily et al., 2014; 2019a) . PDP-SGD also has the subspace reconstruction error, depending on the γ 2 function and eigen-gap term Λ = T t=1 1/α t /T . From previous discussions, the γ 2 is a constant with suitable assumptions of the gradient structure. For the eigen-gap term, if α t stays as a constant in the training procedure as shown by Figure 1 , Λ will be a constant and the bound scales logarithmically with p. If one assumes the eigen-gap α t decays as training proceed, e.g., α t = 1 t 1/2 for t > 0, then we have Λ = O( √ T ). In this case, with T = O(n 2 2 ), PDP-SGD requires public data size m = O(n ). Recently, Kairouz et al. (2020) propose a noisy version of AdaGrad algorithm for unconstrained convex empirical risk minimization. Their algorithm operates with only private data and also achieves a dimension-free excess risk bound, i.e., Õ(r/n ), by assuming the gradients along the path of optimization lie in a low-dimensional subspace with constant rank r. The bounds are not directly comparable since the assumptions in Kairouz et al. (2020) are different from those in this paper, i.e., Kairouz et al. (2020) assume that the accumulated gradients along the training process lie in a constant subspace which requires the rank of the gradient space does not explode when adding more stochastic gradients. Our work does not impose such a constant subspace assumption on Σ t allowing the subspace of Σ t to be different along the training process. In other words, our bounds hold even when the rank of the accumulated stochastic gradients space increases as training proceeds. Proof of Theorem 6: By the convexity of Ln (w), we have Ln (w t ) -Ln (w ) ≤ w t -w , ∇ Ln (w t ) . ( ) At iteration t, we have w t+1 = w t -η t gt , where gt = Vk V k g t + Vk V k b t . Let ḡt = g t + Vk V k b t , and ∆ t = Vk V k g t -g t . Then we have gt = ḡt + ∆ t . Recall that g t = 1 |Bt| zi∈Bt ∇ (w t , z i ). With B t uniformly sampled from S, we have E t [g t ] = ∇ Ln (w t ). Since b t is a zero mean Gaussian vector, we have E t [ḡ t ] = g t . By convexity, conditioned at w t , we have E t [ Ln (w t ) -Ln (w )] ≤ E t w t -w , g t = E t w t -w , ḡt = 1 η t E t w t -w , w t -w t+1 -η t ∆ t = 1 2η t E t w t -w 2 2 + η 2 t ḡt 2 2 -w t+1 -w -η t ∆ t 2 2 = 1 2η t E t w t -w 2 2 + η 2 t ḡt 2 2 -w t+1 -w 2 2 -η 2 t ∆ t 2 2 + 2 w t+1 -w , η t ∆ t = 1 2η t E t w t -w 2 2 -w t+1 -w 2 2 + η t 2 E t [ ḡt 2 2 ] - η t 2 E t [ ∆ t 2 2 ] + E t w t+1 -w , ∆ t (a) ≤ 1 2η t E t w t -w 2 2 -w t+1 -w 2 2 + η t 2 G 2 + kσ 2 + BE t ∆ t 2 , where (a) is true since E t ḡt 2 2 ≤ E t [ g t 2 2 + Vk V k b t 2 2 ] ≤ G 2 + kσ 2 , and w t+1 -w 2 ≤ Bfoot_5 . Let η t = 1 √ T , taking the expectation over all iterations and sum over t = 1, .., T , we have 1 T E t=1 Ln (w t ) -Ln (w ) ≤ w 0 -w 2 2 + G 2 + kσ 2 2 √ T + B t=1 E t ∆ t 2 T . From Theorem 3, we have E t [ ∆ t 2 ] = E Vk (t) Vk (t) g t -V (t)V (t) g t 2 ≤ E Vk (t) Vk (t) -V (t)V (t) 2 G ≤ GE Vk (t) Vk (t) -V k (t)V k (t) 2 + V k (t)V k (t) -V (t)V (t) 2 ≤ O Gρ ln pγ 2 (W, d) α t √ m , where the last inequality holds because the Σ t is of rank k and V (t) = V k (t). Bring this to equation 72, use the fact that w 0 -w < B, with Jensen's inequality we have E Ln ( w) -Ln (w ) ≤ B 2 + G 2 + kσ 2 2 √ T + O ΛGρ ln pγ 2 (W, d) √ m . ( ) where Λ = t=1 1 αt . With σ 2 = G 2 T n 2 2 , let T = n 2 2 , we have E Ln ( w) -Ln (w ) ≤ O kG 2 B 2 n + O ΛGρ ln pγ 2 (W, d) √ m . That completes the proof. Hyper-parameter Setting. We consider different choices of the noise scale, i.e., σ = {18, 14, 10, 8, 6, 4} for MNIST and σ = {18, 14, 10, 6, 4, 2} for Fashion MNIST. Cross-entropy is used as our loss function throughout experiments. The mini-batch size is set to be 250 for both MNIST and Fashion MNIST. For the step size, we follow the grid search method with search space {0.01, 0.05, 0.1, 0.2} to tune the step size for MNIST and the search space is {0.01, 0.02, 0.05, 0.1, 0.2} for Fashion MNIST. We choose the step size based on the training accuracy at the last epoch. The best step sizes for DP-SGD and PDP-SGD for different privacy levels are presented in Table 3 and Table 4 for MNIST and Fashion MNIST, respectively. For training, a fixed budget on the number of epochs i.e., 30 is assigned for the each task. We repeat each experiments 3 times and report the mean and standard deviation of the accuracy on the training and test set. For PDP-SGD, the projection dimension k is a hyper-parameter and it illustrates a trade-off between the reconstruction error and the noise reduction. A small k implies more noise amount will be reduced, and a larger reconstruction error will be introduced. We explored k = {20, 30, 50} for MNIST and k = {30, 50, 70} for Fashion MNIST, and we found that k = 50 and k = 70 achieve the best performance for MNIST and Fashion MNIST respectively among the search space we consider. Instead of doing the projection for all epochs, we also explored a start point for the projection, i.e., executing the projection from the 1-th epoch, 15-th epoch. The information of projection dimension k and the starting epoch for projection are also given in Table 3 and Table 4 for MNIST and Fashion MNIST, respectively.

D EXPERIMENTAL SETUP AND ADDITIONAL RESULTS

Privacy Parameter Setting. Since gradient norm bound G is unknow for deep learning, we follow the gradient clipping method in Abadi et al. (2016) to guarantee the privacy. We implement the micro-batch clipping method in PyTorchfoot_7 . We use micro-batch = 1 and micro-batch = 5 for MNIST and Fashion MNIST, respectively. Note that training with micro-batch clipping will need the noise scaled by micro-batch size to guarantee the same privacy. But it takes less time than training with per-sample clipping, i.e., micro-batch = 1. We follow the Moment Accountant (MA) method (Bu et al., 2019) to calculate the accumulated privacy cost, which depends on the number of epochs, the batch size, δ, and noise variance σ. With 30 epochs, batch size 250, public samples for both case). The observation that PDP-SGD outperforms DP-SGD for small regime in Figure 3 also holds for other number of training samples. We also explore PDP-SGD with sparse eigen-space computation, i.e., update the projector every s iterates. Note that PDP-SGD with s = 1 means computing the top eigen-space at every iteration. Figure 13 reports PDP-SGD with s = {1, 10, 20} for (a) MNIST with 50,000 samples and (b) Fashion MNIST with 50,000 samples showing that there is a mild decay for PDP-SGD with fewer eigen-space computation. PDP-SGD with a reduced eigen-space computation also improves the accuracy over DP-SGD. 



Note that the requirement of a large public data set may remove the need of using the private data in the first place, since training with the large public data set may already provide an accurate model. In this paper, we focus on minimizing the empirical risk. However, by relying on the generalization guarantee of ( , δ)-differential privacy, one can also derive a population risk bound that matches the empirical risk bound up to a term of order O( + δ)(Dwork et al., 2015;Bassily et al., 2016;Jung et al., 2020). For convex problem, we consider the typical constrained optimization problem such that the optimal solution w is in a set H, where H = {w : w ≤ B} and each step we project wt+1 back to the set H. In the appendix, we provide more examples M that are consistent with empirical observations of stochastic gradient distributions and have small γ2(M, d). The Assumption 2 suggests that the Ln(w) is ρ-smooth. For convex problem, we consider w ∈ H, where H = {w : w ≤ B}. PDP-SGD can work with larger training set as well. We randomly sample 10,000 samples due to the limitation of computation resources, e.g., GPUs. We implement the clipping method based on this repository: https://github.com/ChrisWaites/pyvacy.



Figure 1: Top 500 eigen-value spectrum of the gradient second moment matrix along the training trajectory of SGD, DP-SGD with σ = 1, 2. Dataset: MNIST; model: 2-layer ReLU with 128 nodes each layer. The network has roughly 130,000 parameters and is trained on MNIST dataset. The Y-axis is the eigenvalue and X-axis is order of eigenvalues from largest to smallest.



Figure 2: Sorted components of population gradients for DP-SGD with σ = 1.0, 2.0, 4.0. Dataset: MNIST; model: 2-layer ReLU with 128 nodes each layer. The network has roughly 130,000 parameters. Y-axis is the absolute value of sorted gradient coordinates, i.e., |m t (j)|, X-axis is the order of sorted gradient component.

Figure 3: Training and test accuracy for DP-SGD, PDP-SGD and RPDP-SGD with different privacy levels for (a) MNIST and (b) Fashion MNIST. The X-axis is the , and the Y-axis is the train/test accuracy. For small regime, which is more favorable for privacy, PDP-SGD outperforms DP-SGD.

Figure 5: Training accuracy and test accuracy for (a) PDP-SGD with k = {10, 20, 30, 50}; (b) PDP-SGD with m = {50, 100, 150} for MNIST ( = 0.23). The X-axis and Y-axis refer to Figure 4. The performance of PDP-SGD increases as projection dimension k and public sample size m increase.

Theorem 2.4.15 in Talagrand (2014)), where c is an absolute constant. If M is a union of several subset, i.e., M = ∪ D h=1 M h , then by using an union bound on Theorem 2, we have γ 2 (M, d) ≤ √ log D max h γ 2 (M h , d). Thus, if M is an union of D = p s ellipsoids, i.e., polynomial in p, then γ 2 (M, • 2 ) ≤ O ( √ s log p).

Figure 7: Comparison of DP-SGD and PDP-SGD for MNIST. (a-c) report the training accuracy and (d-f) report the test accuracy for = {0.23, 0.30, 0.42}. The X-axis is the number of epochs, and the Y-axis is the train/test accuracy. DPD-SGD outperforms DP-SGD for small .

Figure 8: Comparison of DP-SGD and PDP-SGD for Fashion MNIST. (a-b) report the training accuracy and (c-d) report the test accuracy for = {0.23, 0.30}. Learning rare is 0.01 for both PDP-SGD and DP-SGD. PDP-SGD starts projection at 15-th epoch. The X-axis is the number of epochs, and the Y-axis is the train/test accuracy. DPD-SGD outperforms DP-SGD for small .

Figure 9: Training accuracy and test accuracy for PDP-SGD with k = {10, 20, 30, 50} for (a) MNIST with = 0.30; (b) MNIST with = 0.53. The X-axis and Y-axis refer to Figure 4. PDP-SGD with k = 50 performs better that the others in terms of the training and test accuracy.

Figure 10: Training accuracy and test accuracy for PDP-SGD with m = {50, 100, 150} for (a) MNIST with = 0.30; (b) MNIST with = 0.53. The X-axis and Y-axis refer to Figure 4. PDP-SGD with m = 150 and m = 100 perform better that the others in terms of the training and test accuracy.

Figure 11: Training accuracy and test accuracy for PDP-SGD with m = {50, 150, 200} for (a) MNIST with = 0.23; (b) MNIST with = 0.43. The X-axis and Y-axis refer to Figure 4. PDP-SGD with m = 150 and m = 200 performs slightly better that the other one in terms of the training and test accuracy.

Figure 12: Training and test accuracy for DP-SGD and PDP-SGD with different privacy levels for (a) MNIST with 20,000 samples and (b) Fashion MNIST with 50,000 samples. The X-axis and Y-axis refer to Figure 3. For small privacy loss , PDP-SGD outperforms DP-SGD.

(g t + b t ) , and w t+1 = w t -η t gt . (46) Let ḡt = g t + Vk V k b t , and ∆ t = Vk V k g t -g t . Then we have gt = ḡt + ∆ t . Ln (w t+1 ) ≤ Ln (w t ) + E t ∇ Ln (w t ), w t+1 -w t +

Ln (w t ) [I] ∇ Ln (w t ) -η t ∇ Ln (w t ) I -Vk (t) Vk (t) ∇ Ln (w t ) = η t ∇ Ln (w t ) Vk (t) Vk (t) ∇ Ln (w t ) Vk (t) Vk (t) ∇ Ln (w t ) 2 2 ≤ Ln (w t ) -E t Ln (w t+1 ) +

Datasets and Network Structure. The MNIST and Fashion MNIST datasets both consist of 60,000 training examples and 10,000 test examples. To construct the private training set, we randomly sample 10, 000 samples from the original training set of MNIST, then we randomly sample 100 samples from the rest to construct as the public dataset 7 . Details refer to Table2. For both MNIST and Fashion MNIST, we use a convolutional neural network that follows the structure inPapernot et al. (2020) whose architecture is described in Table1. All experiments have been run on NVIDIA Tesla K40 GPUs. Network architecture for MNIST and Fashion MNIST.

Neural network and datasets setup.

ACKNOWLEDGEMENT

The research was supported by NSF grants IIS-1908104, OAC-1934634, IIS-1563950, a Google Faculty Research Award, a J.P. Morgan Faculty Award, and a Mozilla research grant. We would like to thank the Minnesota Super-computing Institute (MSI) for providing computational resources and support.

annex

Published as a conference paper at ICLR 2021 Table 3 : Hyper-parameter settings for DP-SGD and PDP-SGD for MNIST. The results suggest that for small , PDP-SGD can effeciently reduce the noise variance injected to the gradient, which improves the training and test accuracy over DP-SGD.In order to understand the role of projection dimension k, we run PDP-SGD with projection starting from the first epoch. Figure 9 reports the PDP-SGD with k ∈ {10, 20, 30, 50} for MNIST with = 0.30 (Figure 9 (a)) and = 0.53 (Figure 9 (b)). Among the choice of k, we can see that PDP-SGD with k = 50 performs better that the others in terms of the training and test accuracy. PDP-SGD with k = 10 proceeds slower than PDP-SGD with k = 20 and k = 50. This is due to the larger reconstruction error introduced by projecting the gradient to a much smaller subspace, i..e, k = 10. However, compared to the gradient dimension p ≈ 25, 000, it is impressive that PDP-SGD with k = 50 which projects the gradient to the a much smaller subspace, can achieve better accuracy than DP-SGD.We also empirically evaluate the effect of the public sample size m. The training and test accuracy of PDP-SGD increases as the public sample size increases from 50 to 150. This is consistent with the theoretical analysis that increasing m helps reducing the subspace reconstruction error as suggested by the theoretical bound. Also, PDP-SGD with m = 150 and m = 200 performs slightly better that m = 50 in terms of the training and test accuracy. The results suggest that while a small amount of public datasets are not sufficient for training an accurate predictor, they provide useful gradient subspace projection and accuracy improvement over DP-SGD.We also compare PDP-SGD and DP-SGD for different number of training samples, i.e., MNIST with 20,000 samples (Figure 12(a) ) and Fashion MNIST with 50,000 samples (Figure 12 

