AUTOMATIC CLIPPING: DIFFERENTIALLY PRIVATE DEEP LEARNING MADE EASIER AND STRONGER

Abstract

Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold R, however, is shown to be vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune R for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, which shows that it can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients. We also demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.

1. INTRODUCTION

Deep learning has achieved impressive progress in a wide range of tasks. These successes are made available, in part, by the collection of large datasets, sometimes containing sensitive private information of individual data points (e.g., chest scan images, DNA sequences). Prior works have illustrated that deep learning models pose severe privacy risks to individual subjects in the training data and are susceptible to various practical attacks. For example, machine learning services such as Google Prediction API and Amazon Machine Learning can leak membership information from the purchase records (Shokri et al., 2017) ; if one feeds the GPT2 language model with some specific prefix, the model will autocomplete texts that contain someone's full name, phone number, email address, etc., from the training data that it memorizes (Carlini et al., 2021) . Differential privacy (DP) (Dwork, 2008; Dwork et al., 2014; 2006) is a formal definition of privacy that has been shown to prevent the aforementioned privacy risks in deep learning (Abadi et al., 2016) . On a high level, the key difference between the DP deep learning and the regular one is whether the gradient is privately released. In other words, while the standard optimizers update on the summed gradient i g i , and DP optimizers update on the private gradient: DP Optimizer({g i } B i=1 ) = Optimizer( private gradient i g i • Clip(∥g i ∥; R) + σR • N (0, I)) (1.1) Standard Optimizer({g i } B i=1 ) = Optimizer( i g i ) (1.2) Here g i ∈ R d is the per-sample gradient of loss l i , N is the standard normal, σ is the noise multiplier, and R is the clipping threshold. The clipping function Clip : R d → R is defined such that ∥g i • Clip(g i ; R)∥ ≤ R. For instance, the DP-SGD in Abadi et al. (2016) on batch B t is DP-SGD Abadi : w t+1 = w t -η i∈Bt ∂l i ∂w t min R/ ∂l i ∂w t , 1 + σR • N (0, I) (1.3) In comparison to the regular training (1.2), two additional DP-specific hyperparameters R and σ need to be determined in DP learning (1.1). On the one hand, setting the noise multiplier σ is easy and can be derived analytically prior to the training. Whenever the privacy budget (ϵ, δ) is determined, one can apply off-the-shelf privacy accounting tools in Section 2.1 to determine σ, based on the subsampling probability p and the number of iterations T : privacy accountant(σ, p, T ; δ) = ϵ

