AUTOMATIC CLIPPING: DIFFERENTIALLY PRIVATE DEEP LEARNING MADE EASIER AND STRONGER

Abstract

Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold R, however, is shown to be vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune R for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, which shows that it can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients. We also demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.

1. INTRODUCTION

Deep learning has achieved impressive progress in a wide range of tasks. These successes are made available, in part, by the collection of large datasets, sometimes containing sensitive private information of individual data points (e.g., chest scan images, DNA sequences). Prior works have illustrated that deep learning models pose severe privacy risks to individual subjects in the training data and are susceptible to various practical attacks. For example, machine learning services such as Google Prediction API and Amazon Machine Learning can leak membership information from the purchase records (Shokri et al., 2017) ; if one feeds the GPT2 language model with some specific prefix, the model will autocomplete texts that contain someone's full name, phone number, email address, etc., from the training data that it memorizes (Carlini et al., 2021) . Differential privacy (DP) (Dwork, 2008; Dwork et al., 2014; 2006) is a formal definition of privacy that has been shown to prevent the aforementioned privacy risks in deep learning (Abadi et al., 2016) . On a high level, the key difference between the DP deep learning and the regular one is whether the gradient is privately released. In other words, while the standard optimizers update on the summed gradient i g i , and DP optimizers update on the private gradient: DP Optimizer({g i } B i=1 ) = Optimizer( private gradient i g i • Clip(∥g i ∥; R) + σR • N (0, I)) (1.1) Standard Optimizer({g i } B i=1 ) = Optimizer( i g i ) (1.2) Here g i ∈ R d is the per-sample gradient of loss l i , N is the standard normal, σ is the noise multiplier, and R is the clipping threshold. The clipping function Clip : R d → R is defined such that ∥g i • Clip(g i ; R)∥ ≤ R. For instance, the DP-SGD in Abadi et al. (2016) on batch B t is DP-SGD Abadi : w t+1 = w t -η i∈Bt ∂l i ∂w t min R/ ∂l i ∂w t , 1 + σR • N (0, I) (1.3) In comparison to the regular training (1.2), two additional DP-specific hyperparameters R and σ need to be determined in DP learning (1.1). On the one hand, setting the noise multiplier σ is easy and can be derived analytically prior to the training. Whenever the privacy budget (ϵ, δ) is determined, one can apply off-the-shelf privacy accounting tools in Section 2.1 to determine σ, based on the subsampling probability p and the number of iterations 3 ) even if the noise multiplier σ = 0. Unlike the noise multiplier σ, the clipping threshold R cannot be inferred from the privacy budget (ϵ, δ) and have to be tuned. Consequently, DP training necessarily requires a 2D grid search for (R, η), like the lower plot of Figure 1 , whereas the regular training only requires an easy 1D grid search for η. Even worse, the difficulty of tuning a per-layer clipping threshold vector (McMahan et al., 2018) , i.e. one clipping threshold for one layer, may increase exponentially as the number of layers increases. To save the effort of tuning R, previous researches have proposed different approaches. In (Andrew et al., 2021; Pichapati et al., 2019; Golatkar et al., 2022) , researchers advocate to use data-adaptive information to select R, such as a specified quantile of the gradient norm distribution. These adaptive clipping methods can be a little ad-hoc: they often replace the the need to tune R by the need to tune one or more new hyperparameters, e.g. the quantile to use and the ratio to split the privacy budget between the quantile decision and the gradient perturbation. Another approach used by the practitioners is to replace an expensive 2D grid search by multiple cheaper 1D grid searches. For example, the researchers propose, in (Kurakin et al., 2022, Section 3.3) to fine-tune η with non-DP SGD, fix η and sweep over various values of the clipping threshold R with DP-SGD, then further fix R and do one more grid search on η. However, tuning R formally in a data-dependent way (e.g. through cross-validation) introduces additional privacy loss (Papernot & Steinke, 2021), and most existing empirical work does not privately conduct hyperparameter tuning. We take a completely different route by proposing a new clipping principle that removes R, instead of coming up with methods to find the appropriate R. We term our method as automatic clipping and we term the versions of DP optimizers using it as automatic DP optimizers. We summarize our contributions as follows. 1. We propose the automatic clipping in (4.1) that expunges the clipping threshold from general DP optimizers, allowing DP learning to be as amenable as regular learning. 2. We show that automatic DP optimizers are as private and efficient as existing DP optimizers. 3. We show in Theorem 4 that automatic DP-SGD converges in the non-convex setting, at the same asymptotic convergence rate as the standard SGD. Our theoretical analysis successfully explains the training behaviors in previous empirical works. 4. We demonstrate the superiority of automatic clipping on a variety of vision and language tasks, especially with large models including ResNet, RoBERTa and GPT2. 5. In Appendix K, we include simple code snippets that demonstrate how easy it is to switch from Abadi's clipping to our automatic clipping in popular codebases, e.g. Opacus and ObJAX. (2.1)

2. PRELIMINARIES

In words, DP restricts the influence of an arbitrary sample, so that the information contributed by such sample is limited and less vulnerable to privacy attacks. In deep learning, DP is achieved by applying the subsampled Gaussian mechanism to privatize the minibatch gradients during training. As illustrated in Equation (1.1), the subsampled Gaussian mechanism involves (1) Sampling a minibatch by including each data point iid with probability p (2) per-sample gradient clipping to bound



S ′ is a neighbor of S if one can obtain S ′ by adding or removing one data point from S.



On the other hand, the choice of clipping threshold R is crucial to the performance of DP models, yet the hyperparameter tuning is much labor-intensive. Recent advances of DP deep learning on ImageNet(Kurakin et al., 2022)  and on E2E datasets(Li et al., 2021), using ResNet18 and GPT2 respectively, illustrate that the performance is very sensitive to R. We have reproduced their results in Figure1. Observe that on ImageNet, ResNet18 can drop from the highest 45% accuracy to 31% if R is chosen 2 times larger, and to 0.1% if R is chosen 4 times larger. Similar drastic drop can also be observed in(Kurakin et al., 2022, Figure

′ , and for any event E,P[M (S) ∈ E] ⩽ e ε P [M (S ′ ) ∈ E] + δ.

