AUTOMATIC CLIPPING: DIFFERENTIALLY PRIVATE DEEP LEARNING MADE EASIER AND STRONGER

Abstract

Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold R, however, is shown to be vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune R for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, which shows that it can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients. We also demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.

1. INTRODUCTION

Deep learning has achieved impressive progress in a wide range of tasks. These successes are made available, in part, by the collection of large datasets, sometimes containing sensitive private information of individual data points (e.g., chest scan images, DNA sequences). Prior works have illustrated that deep learning models pose severe privacy risks to individual subjects in the training data and are susceptible to various practical attacks. For example, machine learning services such as Google Prediction API and Amazon Machine Learning can leak membership information from the purchase records (Shokri et al., 2017) ; if one feeds the GPT2 language model with some specific prefix, the model will autocomplete texts that contain someone's full name, phone number, email address, etc., from the training data that it memorizes (Carlini et al., 2021) . Differential privacy (DP) (Dwork, 2008; Dwork et al., 2014; 2006) is a formal definition of privacy that has been shown to prevent the aforementioned privacy risks in deep learning (Abadi et al., 2016) . On a high level, the key difference between the DP deep learning and the regular one is whether the gradient is privately released. In other words, while the standard optimizers update on the summed gradient i g i , and DP optimizers update on the private gradient: DP Optimizer({g i } B i=1 ) = Optimizer( private gradient i g i • Clip(∥g i ∥; R) + σR • N (0, I)) (1.1) Standard Optimizer({g i } B i=1 ) = Optimizer( i g i ) (1.2) Here g i ∈ R d is the per-sample gradient of loss l i , N is the standard normal, σ is the noise multiplier, and R is the clipping threshold. The clipping function Clip : R d → R is defined such that ∥g i • Clip(g i ; R)∥ ≤ R. For instance, the DP-SGD in Abadi et al. (2016) on batch B t is DP-SGD Abadi : w t+1 = w t -η i∈Bt ∂l i ∂w t min R/ ∂l i ∂w t , 1 + σR • N (0, I) (1.3) In comparison to the regular training (1.2), two additional DP-specific hyperparameters R and σ need to be determined in DP learning (1.1). On the one hand, setting the noise multiplier σ is easy and can be derived analytically prior to the training. Whenever the privacy budget (ϵ, δ) is determined, one can apply off-the-shelf privacy accounting tools in Section 2.1 to determine σ, based on the subsampling probability p and the number of iterations T : privacy accountant(σ, p, T ; δ) = ϵ On the other hand, the choice of clipping threshold R is crucial to the performance of DP models, yet the hyperparameter tuning is much labor-intensive. Recent advances of DP deep learning on ImageNet (Kurakin et al., 2022) and on E2E datasets (Li et al., 2021) , using ResNet18 and GPT2 respectively, illustrate that the performance is very sensitive to R. We have reproduced their results in Figure 1 . Observe that on ImageNet, ResNet18 can drop from the highest 45% accuracy to 31% if R is chosen 2 times larger, and to 0.1% if R is chosen 4 times larger. Similar drastic drop can also be observed in (Kurakin et al., 2022, Figure 3 ) even if the noise multiplier σ = 0. Unlike the noise multiplier σ, the clipping threshold R cannot be inferred from the privacy budget (ϵ, δ) and have to be tuned. Consequently, DP training necessarily requires a 2D grid search for (R, η), like the lower plot of Figure 1 , whereas the regular training only requires an easy 1D grid search for η. Even worse, the difficulty of tuning a per-layer clipping threshold vector (McMahan et al., 2018) , i.e. one clipping threshold for one layer, may increase exponentially as the number of layers increases. To save the effort of tuning R, previous researches have proposed different approaches. In (Andrew et al., 2021; Pichapati et al., 2019; Golatkar et al., 2022) , researchers advocate to use data-adaptive information to select R, such as a specified quantile of the gradient norm distribution. These adaptive clipping methods can be a little ad-hoc: they often replace the the need to tune R by the need to tune one or more new hyperparameters, e.g. the quantile to use and the ratio to split the privacy budget between the quantile decision and the gradient perturbation. Another approach used by the practitioners is to replace an expensive 2D grid search by multiple cheaper 1D grid searches. For example, the researchers propose, in (Kurakin et al., 2022, Section 3.3) to fine-tune η with non-DP SGD, fix η and sweep over various values of the clipping threshold R with DP-SGD, then further fix R and do one more grid search on η. However, tuning R formally in a data-dependent way (e.g. through cross-validation) introduces additional privacy loss (Papernot & Steinke, 2021) , and most existing empirical work does not privately conduct hyperparameter tuning. We take a completely different route by proposing a new clipping principle that removes R, instead of coming up with methods to find the appropriate R. We term our method as automatic clipping and we term the versions of DP optimizers using it as automatic DP optimizers. We summarize our contributions as follows. 1. We propose the automatic clipping in (4.1) that expunges the clipping threshold from general DP optimizers, allowing DP learning to be as amenable as regular learning. 2. We show that automatic DP optimizers are as private and efficient as existing DP optimizers. 3. We show in Theorem 4 that automatic DP-SGD converges in the non-convex setting, at the same asymptotic convergence rate as the standard SGD. Our theoretical analysis successfully explains the training behaviors in previous empirical works. 4. We demonstrate the superiority of automatic clipping on a variety of vision and language tasks, especially with large models including ResNet, RoBERTa and GPT2. 5. In Appendix K, we include simple code snippets that demonstrate how easy it is to switch from Abadi's clipping to our automatic clipping in popular codebases, e.g. Opacus and ObJAX.

2. PRELIMINARIES

2.1 DIFFERENTIAL PRIVACY We consider the (ϵ, δ)-DP in Definition 2.1, where smaller (ϵ, δ) means stronger privacy guarantee. Definition 2.1 ( (Dwork et al., 2006) ). A randomized algorithm M is (ε, δ)-differentially private (DP) if for any two neighboringfoot_0 datasets S, S ′ , and for any event E, P[M (S) ∈ E] ⩽ e ε P [M (S ′ ) ∈ E] + δ. (2.1) In words, DP restricts the influence of an arbitrary sample, so that the information contributed by such sample is limited and less vulnerable to privacy attacks. In deep learning, DP is achieved by applying the subsampled Gaussian mechanism to privatize the minibatch gradients during training. As illustrated in Equation (1.1), the subsampled Gaussian mechanism involves (1) Sampling a minibatch by including each data point iid with probability p (2) per-sample gradient clipping to bound the l 2 norm sensitivity at R and (3) adding independent Gaussian noise proportional to the sensitivity R and σ, which is derived from the privacy loss ϵ. This can be realized by leveraging a variety of modern privacy accounting tools, such as Renyi DP (or moments accountant) (Abadi et al., 2016; Mironov, 2017; Wang et al., 2019) , Privacy Loss distribution (Fourier accountants) (Koskela et al., 2020; Gopi et al., 2021; Zhu et al., 2022) , or Gaussian DP (Dong et al., 2022; Bu et al., 2020) .

2.2. DIFFERENTIALLY PRIVATE OPTIMIZERS WITH GENERAL CLIPPING OPERATIONS

Privately released stochastic gradients (through the Gaussian mechanism) can be used to instantiate various off-the-shelf optimizers, which gives rise to DP-SGD in (1.3), DP-HeavyBall, DP-AdaGrad, DP-Adam, DP-FedAvg, DP-FedSGD (McMahan et al., 2018) , etc. To improve the performance of DP optimizers, previous researches can be classified into two categories. The first category, where the majority of researches lie in, works with Abadi's clipping and focuses on better design of R. To name a few examples, one can adaptively design R t for each iteration t (Andrew et al., 2021; Pichapati et al., 2019; Golatkar et al., 2022) , or design the per-layer clipping threshold vector R ∈ R L for L layers (Abadi et al., 2016; McMahan et al., 2018) so as to apply a different clipping threshold for each layer. Fewer works fall into the second category that proposes new clipping methods. In fact, any function Clip : R d → R satisfying ∥Clip(g) • g∥ ≤ R can serve as a valid clipping function besides Abadi's. For instance, the global clipping (Bu et al., 2021b) proposes Clip global (g) := I(∥g∥ < R) to mitigate the bias of the private gradient and alleviate the mis-calibration issue of DP classifiers. Our automatic clipping also belongs to this category. We note that different clipping methods work orthogonally to optimizers, network architectures and gradient norm computation (see Section 7).

3. MOTIVATION

3.1 SMALL CLIPPING THRESHOLD WORKS BEST Upper: BLEU score of GPT2 on E2E dataset (Li et al., 2021) , with DP-AdamW. Lower: Test accuracy of ResNet18 on ImageNet dataset (Kurakin et al., 2022) , with DP-SGD and momentum. One intriguing observation that we can make about the recent studies on DP learning with large models is that the state-of-theart (SOTA) results are often achieved with very small clipping threshold R. This observation is consistent in both vision and language tasks. In Li et al. (2021) , GPT2 (about 800 million parameters) and RoBERTa models (over 300 millions parameters) achieve the best results under DP on QNLI, MNLI, SST-2, QQP, E2E, and DART datasets, with each per-sample gradient clipped to length R = 0.1. In (Kurakin et al., 2022; De et al., 2022; Mehta et al., 2022) , ResNets and Vision Transformers achieve the best DP results on ImageNet with R = 1; in (Tramer & Boneh, 2020) , the best DP results on CIFAR10 use R = 0.1 with ResNeXt-29 and SimCLRv2 (Chen et al., 2020a) . The effectiveness of small clipping threshold together with proper learning rate is depicted in Figure 1 . Intuitively, smaller R implies that the Abadi's clipping (3.1) happens, which means min R/∥g i ∥, 1 = R/∥g i ∥. Given that the clipping threshold R is so small compared to the number of parameters in large neural networks, and that strong DP is guaranteed when the number of training iterations is small (i.e. ∥g i ∥ has not converged to small values yet), we expect and empirically observe that the clipping happens on a large proportion of per-sample gradients at all iterations. For instance, we find in the GPT2 generation experiments in Li et al. (2021) that 100% of per-sample gradients are clipped at all iterations; in classification tasks such as QQP/QNLI/MNLI, the percentage of clipping is about 20 ∼ 60% on average (more details in Appendix H.1).

3.2. PER-SAMPLE GRADIENT NORMALIZATION AS NEW CLIPPING

In the small clipping threshold regime, we can approximately view Clip Abadi (g i ; R) = min (R/||g i ||, 1) ≈ R/||g i || =: Clip AUTO-V (g i ; R) (3.1) and thus derive a novel private gradient i R gi ∥gi∥ + σR • N (0, I). Here AUTO-V stands for the vanilla automatic clipping, which essentially performs the gradient normalization on each persample gradient. As a specific example, we can write the R-dependent automatic DP-SGD as R-dependent DP-SGD AUTO-V : w t+1 = w t -η i∈Bt R ∂l i ∂w t /∥ ∂l i ∂w t ∥ + σR • N (0, I) (3.2) We may view our AUTO-V clipping as to maximize the dot-product similarity (a commonly used similarity measure, e.g. in the attention block in transformers (Vaswani et al., 2017) ) between the clipped gradient and the regular gradient. Suppose we want max Ci i C i g i , j g j s.t. 0 ≤ C i ≤ R/∥g i ∥ Note that the constraint is a sufficient condition for clipping, as discussed in Section 2.2. It is not hard to see that the optimal clipping factor is C i = R/∥g i ∥ • I(⟨g i , j g j ⟩ > 0) If the per-sample gradients are indeed concentrated in the sense ∀i, ⟨g i , j g j ⟩ ≥ 0, then AUTO-V is the optimal per-sample gradient clipping. We compare with Abadi's clipping in Figure 2 , where the dot-product similarity is significantly magnified by our AUTO-V clipping. One potential drawback of AUTO-V clipping is that all gradients lose their magnitudes information completely, since ∥g i • Clip AUTO-V (g i ; R)∥ = R, ∀i. This scaleinvariance in AUTO-V and partially in Abadi's clipping (when ∥g i ∥ > R) leads to the "lazy region" issue: the parameters will not be updated by DP-GD even if the true gradients are non-zero. In Figure 3 , we illustrate in a logistic regressionfoot_1 that AUTO-V and Abadi's clipping have zero clipped gradient for the trainable parameter θ ∈ [-2, 2], as the per-sample gradients from two classes cancel each other. Another benefit of γ is to remain stationary as g i → 0, i.e. making the clipped gradient C i g i → g i /γ small rather than having a magnitude R in AUTO-V. We elaborate this point in Section 4.3. To preserve the magnitude information and thus escape the lazy region, we propose the AUTO-S clipping, with a positive stability constant γ: Clip AUTO-S (g i ; R) := R/(||g i || + γ) (3.3) We visualize in Figure 4 that AUTO-S allows larger per-sample gradients to have larger magnitudes after the clipping, while still allowing smaller gradients to vanish after "clipping". This is critical in our convergence analysis and allows DP-SGD AUTO-S (but not DP-SGD AUTO-V ) to converge to zero gradient norms in Section 5.

4. AUTOMATIC DP TRAINING

One may wonder why our clipping (3.1)(3.3) is automatic at all, if the hyperparameter R is still present and there is an additional parameter γ to choose. It turns out that any constant choice of R > 0 is equivalent to choosing R = 1, and common deep learning optimizers are insensitive to the choice of γ (e.g. for any γ > 0, we show that the gradient norm converges to zero at the same asymptotic rate in Theorem 4; see also the ablation study in Figure 14 ). Consequently, we set γ = 0.01 as the default. Specifically, let us redefine the R-independent clipping function: Clip AUTO-S (g i ) := 1/(||g i || + γ). (4.1) With this clipping, we can design automatic DP optimizers similar to (1.1): Automatic DP Optimizer({g i } B i=1 ) = Optimizer i∈Bt g t,i ||g t,i || + γ + σ • N (0, I) denoted as ĝt (4.2) Clearly, the new private gradient ĝt from our automatic clipping is R-independent, in contrast to the one used in (1.1). A concrete example (in the case of γ = 0) that is comparable to (3.2) will be R-independent DP-SGD AUTO-V : w t+1 = w t -η i∈Bt ∂l i ∂w t / ∂l i ∂w t + σ • N (0, I) (4.3) Leveraging the private gradient ĝt in (4.2), we can train DP neural networks without tuning DPspecific hyperparamters R and σ, as demonstrated in Algorithm 1. Algorithm Apply automatic clipping to per-sample gradients {g i } i∈Bt : ĝi = g i /(∥g i ∥ 2 + 0.01).

5:

Add Gaussian noise to the sum of clipped gradients: ĝ = i ĝi + σ • N (0, I).

6:

Update w t by any optimizer on the private gradient ĝ with learning rate η t . We will elaborate two distinct reasons in the next sub-sections for the following statement:  DP Optimizer Abadi ≈ R-dependent DP Optimizer AUTO ≡ R-independent w t+1 = w t -η i∈Bt g t,i • R ||g t,i || + γ + σR • N (0, I) = w t -ηRĝ t . We can view η effective ≡ ηR as a whole: increasing R has the same effect as increasing η, which explains the diagonal pattern in Figure 1 (lower plot) where DP-SGD Abadi is applied with small clipping thresholdfoot_2 . We extend to general non-adaptive optimizers in Theorem 1foot_3 . Theorem 1. Non-adaptive R-dependent automatic DP optimizers (including SGD, Heavyball (Polyak, 1964) and NAG (Nesterov, 1983 )), with learning rate η and weight decay λ, is equivalent to R-independent automatic DP optimizers, with learning rate ηR and weight decay λ/R.

4.2. ADAPTIVE OPTIMIZER CAN BE INSENSITIVE TO CLIPPING THRESHOLD

Adaptive automatic DP optimizers are different than the non-adaptive ones, as the clipping threshold cancels out instead of being coupled with learning rate. To see this, we scrutinize DP-Adam Abadi (which is similar to DP-Adam AUTO-V ) in Figure 1 (upper plot), where columns to the left are almost identical. Further evidence is observed in (Mehta et al., 2022, Table 5 ) that shrinking R has zero effect on LAMB. We now give a simple explanation using AdaGrad (Duchi et al., 2011) : w t+1 = w t -η g t √ G t where g t = i g t,i is the gradient sum and G t = τ <t g 2 τ is sum of gradient square by Hadamard product over the past iterations. In R-dependent DP-AdaGrad AUTO-V , the private gradient is Rĝ t in place of the standard gradient sum g t , and Ĝt = R 2 τ ≤t ĝ2 τ : w t+1 = w t -η Rĝ t Ĝt = w t -η ĝt τ <t (ĝ τ ) 2 . We generalize to the general adaptive optimizers in Theorem 2 . Theorem 2. Adaptive R-dependent automatic DP optimizers (e.g. AdaGrad (Duchi et al., 2011) , AdaDelta(Zeiler, 2012), AdaMax/Adam (Kingma & Ba, 2014) , NAdam (Dozat, 2016) , RAdam (Liu et al., 2019a) , LARS (You et al., 2017) , LAMB (You et al., 2020) ), with learning rate η and weight decay λ is equivalent to R-independent automatic DP optimizers with learning rate η and weight decay λ/R. With decoupled weight decay (Loshchilov & Hutter, 2018) , R-dependent automatic DP-AdamW is equivalent to R-independent automatic DP-AdamW with the same η and λ. In Appendix B.3, we also analyze the automatic DP optimizers with per-layer clipping style. In Theorem 3 (proved in Appendix A), we show that the new private gradient ĝt in (4.2) has the same level of privacy guarantee as the existing one in (1.1), since the global sensitivity remains the same (see Figure 4 ). We note that as long as γ > 0, the magnitude information of per-sample gradients is preserved by AUTO-S, in the sense that ∥g i ∥ > ∥g j ∥ ⇐⇒ ∥C i g i ∥ > ∥C j g j ∥, whereas this can be violated in both the AUTO-V and Abadi's clipping (as depicted by the flat curve in Figure 4 when ∥g i ∥ > 1).

4.3. AUTOMATIC CLIPPING IS EQUALLY PRIVATE AND MAXIMIZES UTILITY

Additionally, note that when γ is small, almost all data points "max out" the signal relative to the amount of noise we add. To say it differently, for the same amount of noise, AUTO-S with small γ allows more signal to be pushed through a differentially private channel. Towards the end of the training, i.e., at the limit when ∥g i ∥ → 0 for all i, then we have i gi ∥gi∥+γ → 1 γ i g i . In words, the clipped gradients become closer to the standard SGD, thus do not suffer from the instability of AUTO-V. Theorem 3. Under the noise multiplier σ, number of iterations T , subsampling probability B/n, DP optimizers using AUTO-V or AUTO-S clipping satisfy (ϵ Accountant (δ, σ, B/n, T ), δ)-DP, where ϵ Accountant is any valid privacy accountant for DP-SGD under Abadi's clipping.

5.1. CONVERGENCE THEORY OF DP-SGD TO STATIONARY POINTS

We highlight that automatic clipping can be more amenable to analysis than Abadi's clipping in Chen et al. (2020b) , since we no longer need to decide whether each per-sample gradient is clipped. To analyze the convergence of automatic DP-SGD (4.2) in the non-convex setting, we follow the standard assumptions in the SGD literature (Ghadimi & Lan, 2013; Allen-Zhu, 2018; Bottou et al., 2018) , including a symmetry assumption on the gradient noise, which is empirically verified in (Chen et al., 2020b, Figure 3 )foot_4 . Assumption 5.1 (Lower bound of loss). For all w and some constant L * , we have L(w) ≥ L * . Assumption 5.2 (Smoothness). Let g(w) denote the gradient of the objective L(w). Then ∀w, v, there is an non-negative constant L such that L(v) -L(w) + g(w) ⊤ (v -w) ≤ L 2 ∥w -v∥ 2 . (5.1) Assumption 5.3 (Gradient noise). The per-sample gradient noise gt,ig t is i.i.d. from some ditribution such that E(g t,i -g t ) = 0, E∥g t,i -g t ∥ 2 ≤ ξ 2 , and gt,i is centrally symmetric about g t in distribution: gt,ig t D = g t -gt,i . We show in Theorem 4 that DP-SGD with AUTO-S clipping allows the true gradient norm to converge to zero, though the clipped gradient may still be biased, but not so with AUTO-V clipping. We leave the proof in Appendix C.1. Theorem 4. Under Assumption 5.1, 5.2, 5.3, running DP-SGD with automatic clipping for T iterations and setting the learning rate η ∝ 1/ √ T givefoot_5  min 0≤t≤T E(∥g t ∥) ≤ G 4 √ T (L 0 -L * )L 1 + σ 2 d B 2 ; ξ, γ := min r>0 ξ r + F (• • • ; r, ξ, γ) . (5.2) Here • • • represents the first argument of G, and G is increasing and positive. As T → ∞, we have min t E(∥g t ∥) = O(T -1/4 ) for AUTO-S, the same rate as the standard SGD given in Theorem 9. Remark 5.4. We show in Theorem 6 and in Figure 5 that the upper bound (5.2) has G ≥ ξ for AUTO-V (γ = 0), and G only reduces to zero for AUTO-S (γ > 0). We provide real data evidence in Figure 13 that strictly positive γ reduces the gradient norm significantly.

5.2. ANALYSIS OF FACTORS AFFECTING THE CONVERGENCE

We now analyze the many factors that affect the convergence in Theorem 4, from a unified viewpoint of both the convergence and the privacy. We start with the stability constant γ and the learning rate η t , both only affect the convergence not the privacy. We empirically observe in Figure 7 that small γ benefits the convergence at initial iterations (when the privacy guarantee is strong) but larger γ converges faster asymptotically. For η t , the optimal is in fact the miminizer of the hyperbola in (C.4), that is unique and tunable. Next, we focus on the hyperparameters that affect both convergence and privacy: the batch size B, the noise multiplier σ, and the number of iterations T . These hyperparameters have to be considered along the privacy-accuracy tradeoff, not just from a convergence perspective. Recall that given a fixed privacy budget (ϵ, δ), we rely on modern privacy accountant for computing the appropriate combinations of parameter σ, T, B. The exact expression of the bound as a function of (ϵ, δ) is somewhat messy. For this reason, we illustrate our analysis in terms of the surrogate parameter µ for µ-GDP (Dong et al., 2022) . Bu et al. (2020) showed that DP-SGD's privacy guarantee asymptotically converges to µ-GDP (as T → ∞) with µ = B n T (e 1/σ 2 -1). µ-GDP implies (ϵ, δ)-DP with ϵ = µ 2 + µ 2 log(1/δ))foot_6 . We can alternatively leverage ρ-tCDP (Bun et al., 2018) for similar conclusions, using ρ in place of µ 2 in (5.3). Theorem 5. Under Assumption 5.1, 5.3, fixing the asymptotic µ(ϵ, δ)-GDP parameter, running DP-SGD with automatic clipping for T iterations and setting the learning rate η 1/ √ T give min 0≤t≤T E(∥g t ∥) ≤ G 4 (L 0 -L * )L 1 T + d µ 2 n 2 + O 1 B 2 T ; ξ, γ (5.3) To show that our analysis matches the training behaviors observed in SOTA empirical work (Li et al., 2021; Kurakin et al., 2022; De et al., 2022; Tramer & Boneh, 2020; Mehta et al., 2022; Yu et al., 2021) , we minimize the first argument of G in (5.3), denoted as X(B, T, µ, d, L, L 0 ). 1. [Train longer with larger noise] Fixing the expected batch size B, we see that X is decreasing in T . Hence larger T and consequently larger σ are preferred. 2. [Larger batch size helps] Fixing number of iterations T or epochs E = BT /n, we see that X is decreasing in B. Hence larger B and consequently larger σ are preferred. 3. [Pretraining is critical] Pretraining can boost the DP accuracy through a much smaller initial loss L 0 and from a smooth (small L) and flat (small ξ, c.f. Figure 7 (left)) initialization. 4. [Learning rate needs tuning] The optimal learning rate by minimizing (C.4) is (L0-L * )µ 2 n 2 L(µ 2 n 2 +dT ) . This indicates that one should use larger learning rate for smaller model d, weaker privacy (larger µ or small ϵ), or smaller iteration budget T . Interestingly, the optimal choice of learning rate is independent to (expected) batch-size B.

6. EXPERIMENTS

We evaluate our automatic DP training on image classification, sentence classification, and table-totext generation tasks. Detailed settings including hyperparameters can be found in Appendix G.

6.1. IMAGE CLASSIFICATION

For MNIST/FashionMNIST, we use the same setup as in (Papernot et al., 2021; Tramer & Boneh, 2020; Shamsabadi & Papernot, 2021) with a simple CNN. For CIFAR10, we use the same setup as in Tramer & Boneh (2020) with pretrained SimCLRv2 (Chen et al., 2020a) . For ImageNette, a 10-class sub-task of ImageNet (Deng et al., 2009) , we use the same setup as in Klause et al. (2022) without the learning rate decay. For CelebA (Liu et al., 2015) , the real human face dataset, we train ResNet9 (He et al., 2016) with group normalization to replace the batch normalization. Notice that CelebA contains high-resolution (178x218) images, each with 40 labels. We consider CelebA for either multi-class classification on one label, e.g. 'Smiling' and 'Male', or for multi-label/multi-task problem to learn all labels simultaneously. 1 : Average test accuracy and 95% confidence interval on image tasks over 5 runs. In Table 1 , we observe that AUTO-S clipping outperforms existing clipping in all datasets with statistical significance. Interestingly, the standard deviation from different runs is smaller for automatic DP optimizers, indicating better reproducibility and stability. We additionally experiment 40 binary classification problems on CelebA with respect to each label, and observe that the mean accuracy further improves to 91.63% at ϵ = 8 for AUTO-S (see Appendix J).

6.2. SENTENCE CLASSIFICATION

On five benchmark language datasets (MNLI(m/mm) (Williams et al., 2018) , QQP (Iyer et al., 2017) , QNLI (Rajpurkar et al., 2016) , SST2 (Socher et al., 2013 )), we compare our automatic DP training with reparameterized gradient perturbation (RGP, (Yu et al., 2021) ) and full-parameter finetuning (full, (Li et al., 2021) ) using RoBERTa models (Liu et al., 2019b Table 3 : Test accuracy on language tasks with RoBERTa-large (24 blocks, 355 million parameters). In Table 2 and Table 3 , we note that full parameter finetuning with AUTO-S outperforms or at least matches SOTA on all tasks. We use exactly the same hyperparameters as in Li et al. (2021) .

6.3. TABLE-TO-TEXT GENERATION

We compare our automatic DP training with a variety of fine-tuning methods, for table-to-text generation task on E2E dataset (Dusek et al., 2020) , where the goal is to generate texts about different aspects of a restaurant's data. We measure the success on this task by BLEU, ROUGE-L (in Table 4 ), METEOR, NIST, CIDEr (extended in Table 8 ), with higher value meaning better model quality. Competitive methods include low-rank adaption (LoRA), prefix-tuning (prefix), RGP, only finetuning the top 2 Transformer blocks (top2), and training from scratch (retrain), as were recorded in Li et al. (2021) . Again, we use the exactly the same hyperparameters as in Li et al. (2021) . For GPT2 (124 million parameters), GPT2 medium (355 million), and GPT2 large (774 million), Table 4 shows that AUTO-S is scalable with stronger performance on larger models. Our automatic fullparameter finetuning has the best overall performance. Additionally, we highlight that AUTO-S and methods like LoRA are not mutually exclusive and can be combined to yield strong performance, since AUTO-S modifies the optimizers and LoRA modifies the architecture.

7. DISCUSSION

In this work, we proposed the automatic clipping as a drop-in replacement to the standard perexample clipping differentially private training. This is the first technique that eliminate the need to tune the clipping threshold R, thus making DP deep learning as easy as regular learning. Our AUTO-S method enjoys both theoretical guarantee of convergence in non-convex problems (under various conditions), and strong empirical performance that advances the state-of-the-art (SOTA) of DP learning on both computer vision and language tasks. We are excited about the future of automatic DP training, especially along with other working techniques. Notably, our automatic clipping applies compatibly with general optimizers (e.g. (Bu et al., 2021a; Du & Mi, 2021) ), clipping styles (all-layer or per-layer), architecture modifications (e.g. LoRA, RGP, prefix), and data augmentation (e.g. adversarial training (Goodfellow et al., 2015) and multiple augmentation De et al. (2022) ). Thus, we expect to achieve comparable results to all SOTA in a lightweight fashion.



S ′ is a neighbor of S if one can obtain S ′ by adding or removing one data point from S. The settings are in Appendix F, where the lazy region issues also emerge in the mean estimation problem. We note that the lazy region is also discussed in(Chen et al., 2020b, Example 2). When we further consider weight decay in automatic clipping (included in Theorem 1), increasing R is no longer equivalent to increasing η, as η also couples with the weight decay constant λ. This coupling of η and R is also partially observed in(De et al., 2022) through a reparameterization trick of Abadi's clipping. Unlike AUTO-S/V, their coupling is not strict (e.g. doubling R is not equivalent to doubling η in their Figure8, thus necessitating tuning both (η, R)), and the relationship to weight decay was not discussed. This symmetry assumption is relaxed from the Gaussian noise assumption (since a zero-mean Gaussian is symmetric) in the SGD literature(Mandt et al., 2017;Smith et al., 2018;Chaudhari & Soatto, 2018;Xie et al., 2020). By setting minibatch size to 1, we reduce the noise assumption to a per-sample gradient case. The upper bound takes an implicit form of G(•; ξ, γ) because it is a lower envelope of functions ξ r + F(•; r, ξ, γ) over all possible r > 0, whose forms are detailed in Theorem 6. Notice that G results only from the clipping operation, not from the noise addition. More precisely, µ-GDP is equivalent to an entire family of (ϵ, δ)-DP for any ϵ > 0 and δ = Φ(µ/2ϵ/µ) -e ϵ Φ(-µ/2 -ϵ/µ) where Φ is the standard Gaussian CDF. See https://github.com/lxuechen/private-transformers and the detailed modification in Appendix K.3.



Figure 1: Ablation study of clipping threshold and learning rate.Upper: BLEU score of GPT2 on E2E dataset(Li et al., 2021), with DP-AdamW. Lower: Test accuracy of ResNet18 on ImageNet dataset(Kurakin et al., 2022), with DP-SGD and momentum.

Figure2: RoBERTa-base with DP-Adam (ϵ = 3) on SST2 dataset, as in Section 6.2.3.3 STABILITY CONSTANT BREAKS SCALE-INVARIANCE AND REMAINS STATIONARY

DP Optimizer AUTO which reduces the hyperparameter tuning of DP training to that of the regular training, i.e. only on learning rate, weight decay, etc. The significant save in the tuning effort is illustrated in Figure 15. 4.1 NON-ADAPTIVE OPTIMIZER COUPLES CLIPPING THRESHOLD WITH LEARNING RATE With R-dependent automatic clipping, DP-SGD becomes

Figure 4: Gradient norms before and after clipping by different methods at R = 1.

Figure 5: Left: DP-SGD with AUTO-V clipping. Middle: DP-SGD with AUTO-S clipping. Right: Log-log plot of convergence rate in comparison to standard SGD. Here ξ = 25, γ = 0.01, and the O(1/ √ T ) term is set to 10 for DP-SGD and to 2 for standard SGD.

1 Automatic Deep Learning with DP Parameters: initial weights w 0 , learning rate η t , sampling probability p, number of iterations T .

). These methods use the same experimental setup. For language models, our automatic training is based on the codebase of et al., 2021) 8 . Test accuracy on language tasks with RoBERTa-base (12 blocks, 125 million parameters).

AUTO-S AUTO-S AUTO-V(Li et al., 2021) (Hu et al., 2021) (Yu et al., 2021) (Li & Liang, 2021)    Test performance on E2E dataset with GPT2. Additional performance measures are included in Table8. The best two GPT2 models for each row are marked in bold.

