BYPASSING THE AMBIENT DIMENSION: PRIVATE SGD WITH GRADIENT SUBSPACE IDENTIFICATION

Abstract

Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension p, the number of parameters in the model. Such dependence can be problematic for over-parameterized models where p n, the number of training samples. Existing lower bounds on private ERM show that such dependence on p is inevitable in the worst case. In this paper, we circumvent the dependence on the ambient dimension by leveraging a low-dimensional structure of gradient space in deep networks-that is, the stochastic gradients for deep nets usually stay in a low dimensional subspace in the training process. We propose Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset. We provide a general sample complexity analysis on the public dataset for the gradient subspace identification problem and demonstrate that under certain lowdimensional assumptions the public sample complexity only grows logarithmically in p. Finally, we provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD in the high privacy regime (corresponding to low privacy loss ).

1. INTRODUCTION

Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a loss function , find a model w ∈ R p that minimizes the empirical risk Ln (w) = 1 n n i=1 (w, z i ), where z 1 , . . . , z n are i.i.d. examples drawn from a distribution P. In many applications, the training data may contain highly sensitive information about some individuals. When the models are given by deep neural networks, their rich representation can potentially reveal fine details of the private data. Differential privacy (DP) (Dwork et al., 2006) has by now become the standard approach to provide principled and rigorous privacy guarantees in machine learning. Roughly speaking, DP is a stability notion that requires that no individual example has a significant influence on the trained model. One of the most commonly used algorithm for solving private ERM is the differentially-private stochastic gradient descent (DP-SGD) (Abadi et al., 2016; Bassily et al., 2014; Song et al., 2013) -a private variant of SGD that perturbs each gradient update with random noise vector drawn from an isotropic Gaussian distribution N (0, σ 2 I p ), with appropriately chosen variance σ 2 . Due to the gradient perturbation drawn from an isotropic Gaussian distribution, the error rate of DP-SGD has a dependence on the ambient dimension p-the number of parameters in the model. In the case of convex loss , Bassily et al. (2014) show that DP-SGD achieves the optimal empirical excess risk of Õ √ p/(n ) . For non-convex loss , which is more common in neural network training, minimizing Ln (w) is in general intractable. However, many (non-private) gradient-based optimization methods are shown to be effective in practice and can provably find approximate stationary points with vanishing gradient norm ∇ Ln (w) 2 (see e.g. Nesterov (2014); Ghadimi and 2019) recently showed that DP-SGD minimize the empirical gradient norm down to Õ p 1/4 / √ n when the loss function is smooth. Furthermore, exsiting lower bounds results on private ERM (Bassily et al., 2014) show that such dependence on p is inevitable in the worst case. However, many modern machine learning tasks now involve training extremely large models, with the number of parameters substantially larger than the number of training samples. For these large models, the error dependence on p can be a barrier to practical private ERM. In this paper, we aim to overcome such dependence on the ambient dimension p by leveraging the structure of the gradient space in the training of neural networks. We take inspiration from the empirical observation from Li et al. ( 2020); Gur-Ari et al. ( 2018); Papyan (2019) that even though the ambient dimension of the gradients is large, the set of sample gradients at most iterations along the optimization trajectory is often contained in a much lower-dimensional subspace. While this observation has been made mostly for non-private SGD algorithm, we also provide our empirical evaluation of this structure (in terms of eigenvalues of the gradient second moments matrix) in Figure 1 . Based on this observation, we provide a modular private ERM optimization framework with two components. At each iteration t, the algorithm performs the following two steps: 1) Gradient dimension reduction. Let g t be the mini-batch gradient at iteration t. In general, this subroutines solves the following problem: given any k < p, find a linear projection Vk (t) ∈ R p×k such that the reconstruction error g t -Vk (t) Vk (t) g t is small. To implement this subroutine, we follow a long line of work that studies private data analysis with access to an auxiliary public dataset S h drawn from the same distribution P, for which we don't need to provide formal privacy guarantee (Bassily et al., 2019b; 2020; Feldman et al., 2018; Avent et al., 2017; Papernot et al., 2017) . In our case, we compute Vk (t) which is given by the top-k eigenspace of the gradients evaluated on S h . Alternatively, this subroutine can potentially be implemented through private subspace identification on the private dataset. However, to our best knowledge, all existing methods have reconstruction error scaling with √ p (Dwork et al., 2014), which will be propagated to the optimization error. 2) Projected DP-SGD (PDP-SGD). Given the projection Vk (t), we perturb gradient in the projected subspace: gt = Vk (t) Vk (t) (g t + b t ), where b t is a p-dimensional Gaussian vector. The projection mapping provides a large reduction of the noise and enables higher accuracy for PDP-SGD. Our results. We provide both theoretical analyses and empirical evaluations of PDP-SGD: Uniform convergence for projections. A key step in our theoretical analysis is to bound the reconstruction error on the gradients from projection of Vk (t). This reduces to bounding the deviation Vk (t) Vk (t) -V k V k (t) 2 , where V k (t) denotes the top-k eigenspace of the population second moment matrix E[∇ (w t , z)∇ (w t , z) ]. To handle the adaptivity of the sequence of iterates, we provide a uniform deviation bound for all w ∈ W, where the set W contains all of the iterates. By leveraging generic chaining techniques, we provide a deviation bound that scales linearly with a complexity measure-the γ 2 function due to Talagrand (2014)-of the set W. We provide lowcomplexity examples of W that are supported by empirical observations and show that their γ 2 function only scales logarithmically with p.



Figure 1: Top 500 eigen-value spectrum of the gradient second moment matrix along the training trajectory of SGD, DP-SGD with σ = 1, 2. Dataset: MNIST; model: 2-layer ReLU with 128 nodes each layer. The network has roughly 130,000 parameters and is trained on MNIST dataset. The Y-axis is the eigenvalue and X-axis is order of eigenvalues from largest to smallest.Lan (2013)). Moreover, for a wide family of loss functions Ln under the Polyak-Łojasiewicz condition(Polyak, 1963), the minimization of gradient norm implies achieving global optimum. With privacy constraint, Wang and Xu (2019) recently showed that DP-SGD minimize the empirical gradient norm down to Õ p 1/4 / √ n when the loss function is smooth. Furthermore, exsiting lower bounds results on private ERM(Bassily et al., 2014)  show that such dependence on p is inevitable in the worst case. However, many modern machine learning tasks now involve training extremely large models, with the number of parameters substantially larger than the number of training samples. For these large models, the error dependence on p can be a barrier to practical private ERM.

