BYPASSING THE AMBIENT DIMENSION: PRIVATE SGD WITH GRADIENT SUBSPACE IDENTIFICATION

Abstract

Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension p, the number of parameters in the model. Such dependence can be problematic for over-parameterized models where p n, the number of training samples. Existing lower bounds on private ERM show that such dependence on p is inevitable in the worst case. In this paper, we circumvent the dependence on the ambient dimension by leveraging a low-dimensional structure of gradient space in deep networks-that is, the stochastic gradients for deep nets usually stay in a low dimensional subspace in the training process. We propose Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset. We provide a general sample complexity analysis on the public dataset for the gradient subspace identification problem and demonstrate that under certain lowdimensional assumptions the public sample complexity only grows logarithmically in p. Finally, we provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD in the high privacy regime (corresponding to low privacy loss ).

1. INTRODUCTION

Many fundamental machine learning tasks involve solving empirical risk minimization (ERM): given a loss function , find a model w ∈ R p that minimizes the empirical risk Ln (w) = 1 n n i=1 (w, z i ), where z 1 , . . . , z n are i.i.d. examples drawn from a distribution P. In many applications, the training data may contain highly sensitive information about some individuals. When the models are given by deep neural networks, their rich representation can potentially reveal fine details of the private data. Differential privacy (DP) (Dwork et al., 2006) has by now become the standard approach to provide principled and rigorous privacy guarantees in machine learning. Roughly speaking, DP is a stability notion that requires that no individual example has a significant influence on the trained model. One of the most commonly used algorithm for solving private ERM is the differentially-private stochastic gradient descent (DP-SGD) (Abadi et al., 2016; Bassily et al., 2014; Song et al., 2013 )-a private variant of SGD that perturbs each gradient update with random noise vector drawn from an isotropic Gaussian distribution N (0, σ 2 I p ), with appropriately chosen variance σ 2 . Due to the gradient perturbation drawn from an isotropic Gaussian distribution, the error rate of DP-SGD has a dependence on the ambient dimension p-the number of parameters in the model. In the case of convex loss , Bassily et al. (2014) show that DP-SGD achieves the optimal empirical excess risk of Õ √ p/(n ) . For non-convex loss , which is more common in neural network training, minimizing Ln (w) is in general intractable. However, many (non-private) gradient-based optimization methods are shown to be effective in practice and can provably find approximate stationary points with vanishing gradient norm ∇ Ln (w) 2 (see e.g. Nesterov (2014); Ghadimi and

