EXPLORING THE LIMITS OF DIFFERENTIALLY PRIVATE DEEP LEARNING WITH GROUP-WISE CLIPPING

Abstract

Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of group-wise clipping. To reduce the compute time overhead of private learning, we show that per-layer clipping, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with per-device clipping that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with perdevice clipping achieves a task performance at ϵ = 1 better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.



In our case, M is an optimization algorithm which outputs the learned model parameters θ. Widely used DP optimizers (e.g., DP-SGD) usually introduce two additional steps before each parameter update to privatize gradients: (1) clip per-example gradients of a minibatch by their Euclidean norms according to some threshold C; (2) add Gaussian noise to the sum of clipped gradients. In practice, the clipping threshold C can be either a tunable hyperparameter or set to some (privatized) statistic estimated from data. The standard deviation of the Gaussian noise is determined by the clipping threshold C and the noise multiplier σ, the latter of which is set by a privacy accounting procedure given target privacy parameters (ϵ, δ), the number of iterations T , and the subsampling rate ρ (Abadi et al., 2016; Mironov, 2017; Dong et al., 2021; Gopi et al., 2021) . We now present background on different per-example gradient clipping strategies in DP optimization. Flat Clipping. This is the clipping scheme used in the original DP-SGD algorithm (Abadi et al., 2016) . Here, the gradient of example s i 's loss ℓ(θ, s i ) with respect to model parameters g (i) := ∂ℓ(θ, s i )/∂θ is normalized if its magnitude exceeds the threshold C. Thus, the actual contribution (up to scaling) of the ith instance to the noisy gradient is g (i) := g (i) • min{1, C/∥g (i) ∥}. Flat clipping cannot be performed until the gradient norms {∥g (i) ∥} i are computed. Since the latter quantities are only known after backpropagation completes, flat clipping necessitates a second-round of computation after backpropagation to conditionally rescale the gradients. This is a source of overhead and presents complications when model weights don't fit on a single device. Group-Wise Clipping. This scheme partitions the set of parameters θ ∈ R d into K disjoint groups {θ k } K k=1 with θ k ∈ R d k for k ∈ [K]. For each group k, the scheme prescribes a clipping threshold C k . Denote example s i 's gradient for the kth group by g (i) k := ∂ℓ(θ, s i )/∂θ k . Under group-wise clipping, the kth clipped gradient for s i is g(i) k := g (i) k • min{1, C k /∥g (i) k ∥}. Next, we present two instantiations of group-wise clipping that are computationally advantageous in different settings.

3. EFFICIENT PRIVATE LEARNING WITH ADAPTIVE PER-LAYER CLIPPING

The first instantiation of group-wise clipping we study is per-layer clipping which clips gradients of separate neural network layers separately. This scheme had been presented in past works (McMahan et al., 2018a; b; Dupuy et al., 2022) , but neither its computational properties nor its performance implications had been carefully studied. We show that with proper implementation, per-layer clipping can be as memory-efficient and almost as time-efficient as non-private training for small-to moderatescale workflows that run on single accelerators. Additionally, we confirm that per-layer clipping with hand-set thresholds underperforms flat clipping and demonstrate that adaptively setting these thresholds eliminates potential performance losses.

3.1. PER-LAYER CLIPPING DP-SGD CAN BE ALMOST AS EFFICIENT AS NON-PRIVATE SGD

Per-layer clipping groups together parameters of a neural network layer (e.g., linear, convolution) and prescribes each of the K layers of the network a clipping threshold C k to clip the gradient of that layer. This directly implies that gradient clipping for any layer can be performed as soon as backpropagation reaches that layer (to construct per-example gradients or norms) when parameter sharing is absent, and is unlike flat clipping which cannot be performed until backpropagation completes entirely. 2Our efficient implementation of per-layer clipping clips layer-wise gradients as soon as the gradient with respect to outputs of that layer are returned from backpropagation. The operations of clipping and summing per-example gradients can be fused once input activations, output gradients, and per-example gradient norms are known. In addition, per-example gradient norms can be cheaply computed without materializing actual per-example gradients in memory (Li et al., 2022b, Section 4) . This implementation results in private training that is as memory-efficient as non-private training since per-example gradients are not instantiated, and almost as time-efficient per update since the extra computation involving gradient norm and gradient scaling are typically cheap. Figure 1 shows that carefully implemented per-layer clipping matches the memory profile and almost matches the training throughput of non-private learning for an autoregressive fine-tuning task with GPT-2 on a single GPU (we followed the same experimental protocol as that of Section 4 in (Li et al., 2022b) for a fair comparison). See Appendix G for experiments with head-to-head wall time comparisons. Figure 1 : Private learning with (adaptive) per-layer clipping can be almost as efficient as non-private learning (the throughput gap is less than 15% in this case). Here, "(usual) flat clipping" refers to the implementation which first creates and stores in memory all per-example gradients (e.g., adopted in Opacus (Yousefpour et al., 2021) ). Ghost clipping (Li et al., 2022b ) is based on flat clipping but avoids materializing per-example gradients by performing an additional backward pass each time.

3.2. FIXED PER-LAYER CLIPPING MAY HURT THE UTILITY

Despite the computational advantages, per-layer clipping with fixed clipping thresholds set by hand (which we refer to as fixed per-layer clipping) reportedly underperforms flat clipping (McMahan et al., 2018b) . To further verify this and remove confounding effects of potentially suboptimal hyperparameters, we compare fixed per-layer clipping against (fixed) flat clipping on two tasks: 1) training wide ResNet (WRN16-4) (Zagoruyko & Komodakis, 2016) from scratch to classify CIFAR-10 images, and 2) fine-tuning the pretrained RoBERTa-base for classifying sentiment on SST-2. We carefully tuned the clipping thresholds and learning rate for both clipping methods; see Appendix A for details. Tables 1a and 1b confirm that per-layer clipping with hand-set fixed thresholds underperforms flat clipping. To understand why fixed per-layer clipping gives worse performance, we plot the per-layer gradient norms of randomly sampled CIFAR-10 examples for privately training WRN16-4 in Figure 2 (see Appendix B for the setup). We observe that the general magnitudes of per-layer gradient norms change dramatically across training. Early on, gradient norms are generally uniformly low across all layers. As training proceeds, gradient norms for layers close to the input gradually become high. We present additional evidence for this phenomenon with language model fine-tuning in Appendix B. These observations suggest that clipping with fixed layer-wise thresholds likely removes the structural relation between gradients of different layers. This incurs an extra source of bias in addition to the usual bias of flat clipping that alters the relation of gradients across samples, and makes balancing clipping bias and privacy noise throughout training more challenging. These observations motivate us to set the thresholds based on some adaptively estimated statistic of layer-wise gradients.

3.3. ADAPTIVE PER-LAYER CLIPPING CAN BE AS EFFECTIVE AS FLAT CLIPPING

To overcome the performance issues of fixed per-layer clipping, we consider per-layer clipping with adaptive clipping thresholds that we herein refer to as adaptive per-layer clipping. Our hope is that the adaptive thresholds can track gradient norm shift, capture gradient structure, and consequently mitigate the structural bias caused by clipping gradients of separate layers separately. One candidate statistic for setting the adaptive threshold is some quantile of gradient norms. Notably, Andrew et al. (2019) provided an effective way of estimating quantiles privately for flat clipping via online convex optimization. We adapt their algorithm to the per-layer setup, and let each layer maintain an online estimate of a target gradient norm quantile. Two questions arise with this formulation: 1) How should the per-layer gradient norm quantiles be privately estimated, and 2) how should the noise levels for different layers be decided? We address the two questions below. Algorithm 1 delineates the overall procedure, where per-layer gradient clipping in conjunction with backpropagation occurs on lines 7-12, adaptive private quantile estimation occurs on lines 15-18, and noise allocation occurs on line 13. The pseudocode is based on DP-SGD, but its core ideas naturally apply to private versions of other first-order optimizers (e.g., DP-Adam). Estimating Quantiles Privately. We allocate some privacy budget (in practice r =1% to 10% of total budget) to estimate a target quantile of each layer's gradient norms. Clipping thresholds C 1 , ..., C K are then set to these estimated quantiles. We record the number of gradients clipped before each parameter update and adjust the clipping threshold based on whether too many or too few are clipped. The central quantity which needs to be privatized is then the fraction of clipped gradients. We introduce the additional noise multiplier σ b to privatize this fraction statistic (used in Gaussian mechanism). The new noise multiplier σ new (based on 1 -r fraction of total budget) for noising parameter updates is computed with the following proposition, whose proof we defer to Appendix D. Proposition 3.1. Let σ be the original noise multiplier for noising parameter updates to achieve a certain level of differential privacy (without private quantile estimation) and σ b be chosen for noising quantile estimates (release of the latter consumes r fraction of the privacy budget). Then the new noise multiplier σ new for noising parameter updates (consuming 1 -r fraction of the budget) is σ new = (σ -2 -K/(2σ b ) 2 ) -1/2 . (3.1) Remark 3.1. The private quantile estimation for K groups costs a fraction r = Kσ 2 /(4σ 2 b ) of privacy budget (in terms of Rényi differential privacy (Mironov, 2017) ). Allocating Noise. The original Gaussian mechanism adds isotropic noise to statistics before their release (Dwork et al., 2014) which results in different coordinates experiencing the same amount of noise. Yet, simply scaling different components with public quantities (before adding noise) allows different components to experience different levels of noise. As an example, let γ 1 , • • • , γ K be coefficients for scaling, and recall that gk is the sum of clipped gradients for layer / group k. Then, applying the Gaussian mechanism to the scaled ĝ := (ĝ 1 , ..., ĝK ), where ĝk := gk /γ k , and rescaling back the privatized quantities afterwards ends up adding noise to gk that has standard deviation proportional to γ k . Among the possible ways of choosing {γ 1 , ..., γ K }, we outline two simple approaches that we found to be effective for different reasons in our empirical studies. We use the global strategy in all but experiments with GPT-3. Appendix E includes empirical studies of alternate strategies. • Global strategy: γ k = 1 for k ∈ [K]. This strategy adds the same amount of noise to every component. The total noise has squared ℓ 2 norm V G ∝ ( k C 2 k ) • ( k d k ). • Equal budget strategy: γ k = C k for k ∈ [K] . Each group has the same amount of privacy budget. The total noise has squared ℓ 2 norm V E ∝ K K k=1 d k C 2 k . Algorithm 1 DP-SGD with adaptive per-layer clipping 1: INPUT: Private dataset D = {s i } N i=1 ; initial iterate θ 0 ; number of iterations T ; learning rate η t ; learning rate for quantile estimation η; privacy parameters ϵ, δ; per-layer parameters {θ 1 , . . . , θ K }; initial clipping thresholds {C 1 , . . . , C K }; weighting factors {γ 1 , . . . , γ K }; target quantile q; sampling rate ρ = B/N . 2: σ ← PrivacyAccountant(ϵ, δ, ρ, T ) 3: Choose σ b as the noise multiplier for private quantile estimation 4: Compute the new noise multiplier σ new for gradient privatization with equation 3.1 5: for t = 0 to T -1 do 6: Sample a minibatch S t with sampling rate ρ and perform the forward pass 7: for k = K to 1 do 8: Compute per-sample gradient norms {∥g (i) k ∥} i∈St given activations and output gradients 9: Compute gk ← i∈St g(i) k = i∈St g (i) k • min{1, C k /∥g (i) k ∥} with fused operation 10: Record b k ← i∈St 1[∥g (i) k ∥ ≤ C k ] for quantile estimation 11: Perform usual backpropagation to obtain input gradients if k > 1 12: end for 13: Draw z ← (z 1 , ..., z K ), where z k ∼ N 0, σ 2 new S 2 γ 2 k I d k and S = ( K k=1 C 2 k /γ 2 k ) 1/2 14: θ t+1 ← θ t -η t (g + z) /B. 15: for k = 1 to K do 16: Draw z k ∼ N (0, σ 2 b ), set bk ← (b k +z k )/B, C k ← C k • exp(-η( bk -q)) .

17:

end for 18: end for 19: return θ T (or some average of all iterates) With the tools of quantile estimation and noise allocation, we show that adaptive per-layer clipping matches the performance of flat clipping. . Figure 3 compares adaptive per-layer clipping against fixed per-layer clipping and flat clipping with and without noise for training WRN16-4 on CIFAR-10 (details in Appendix A), and shows that the performance of adaptive per-layer clipping matches that of flat clipping, while fixed adaptive clipping suffers large performance drops. Section 5 includes additional results to validate this point.

4. EFFICIENT PRIVATE PIPELINE PARALLELISM WITH PER-DEVICE CLIPPING

Past works have shown that DP fine-tuning yields improved privacy-utility trade-offs with the use of larger / better pretrained models. We study whether this trend continues to hold as one leverages larger pretrained models by scaling DP training to work with one of the largest pretrained language models to date-the 175 billion-parameter GPT-3. The sheer size of this model presents challenges in computational efficiency, since model weights cannot be fit on a single device (e.g., GPU) and existing approaches for distributing computation don't tend to play well with flat clipping. We base our distributed DP training strategy off the popular pipeline parallelism used in non-private training (Huang et al., 2019; Rasley et al., 2020) . 3 We summarize the idea of pipeline parallelism here and defer to the cited works for the specifics. Pipeline parallelism first partitions the model into chunks of consecutive layers / blocks and distributes each onto a single accelerator. Forward computation with a microbatch (created through splitting a minibatch) then chains together local computations with each model piece (hosted on each accelerator) by communicating activations across accelerators. Backward computation (backpropagation) roughly reverses the above process, but on each accelerator, intermediate forward activations of the model piece are recomputed to reduce peak memory (Huang et al., 2019, Section 2.3) . Most importantly, pipeline parallelism simultaneously performs computation with different microbatches on different accelerators to reduce the overall idle time. Devices synchronize after all microbatches finish their forward and backward computation and before the optimizer invokes the parameter update. Flat clipping necessitates computing per-example gradient norms to correctly rescale gradients. This calls for the communication of per-example norms of local gradients on each device and leads to an inherent overhead in pipeline parallelism. We outline two potential approaches for accomplishing communication, both of which unfortunately lead to non-trivial slowdowns as well as complications in implementation. The first approach synchronizes all devices after the full backward pass finishes for each microbatch (within a minibatch) so that each device will retain the same gradient norms for computing the scaling factor in clipping. This approach incurs as many extra synchronization steps as the number of microbatches per minibatch and reduces training efficiency when the number of microbatches is large. While executing primitives like all-gather with local gradient norms is not costly per se, the disruption these calls bring to the pipeline schedule is. Concretely, devices need to perform one of the following: (i) retain the unclipped local per-example gradients for a microbatch-and become idle due to pausing its processing of subsequent microbatches to avoid memory errors-until synchronization for the microbatch is called; (ii) offload the unclipped local per-example gradients to CPU only to transport them back on synchronization; (iii) rematerialize the microbatch's local gradient on synchronization. (i) is costly since it forces devices to be idle, (ii) is costly due to slow CPU-GPU data transfer, and (iii) is costly due to performing the extra round of backpropagation. To reduce the frequency of synchronization, a second approach may instead ask devices to only synchronize after the last microbatch has been processed. This approach, however, does not bypass the complications in the subsequent gradient rescaling step which requires local per-example gradients either be offloaded to CPU and moved back later or rematerialized on synchronization. As the first attempt at experimenting with DP fine-tuning on huge models, we instead turn to an alternative per-device clipping scheme, where each device is prescribed a clipping threshold for clipping per-example gradients of the hosted model piece. Leveraging the equal budget strategy, the noise level added to gradients on each device is agnostic of the clipping thresholds of other devices (thus, no extra communication incurred). We present the full pseudocode of the algorithm in Appendix C. Notably, per-device clipping with DP LoRA fine-tuning allowed us to obtain improved results for a challenging summarization task (see Section 5.3).

5. EXPERIMENTS

Previous sections verified that per-layer clipping has an efficiency advantage over flat clipping. We now show that adaptive per-layer clipping is competitive in terms of privacy vs utility. Our experiments cover training wide ResNets from scratch, fine-tuning RoBERTa on GLUE tasks, and fine-tuning GPT-2 and GPT-3 for table-to-text generation and summarization tasks. For private quantile estimation, we use the geometric update rule by Andrew et al. (2019) and set η = 0.3 for all experiments. Reported numbers are averaged over three seeds unless otherwise stated. Code to reproduce some of our experiments can be found at https://github.com/lxuechen/perlayer-public.

5.1. PRIVATELY LEARN WIDERESNETS FOR CIFAR-10 CLASSIFICATION

We train a wide ResNet (WRN16-4, 2.8M trainable parameters) (Zagoruyko & Komodakis, 2016) from scratch for CIFAR-10 classification with differential privacy. We follow the implementation by De et al. (2022) , e.g., batch normalization are replaced with group normalization and weight standardization is applied for convolutional layers, except that we do not use augmentation multiplicity for simplicity. We set privacy parameter δ = 10 -5 and choose ϵ from {1, 3, 5, 8}, which are typical privacy parameters used in previous works. We compare the performance of adaptive per-layer clipping with that of flat clipping. For both algorithms, we use hyperparameters suggested by De et al. (2022) and tune learning rates. We use a fraction r = 0.01 of privacy budget for quantile estimation and choose the target quantile q from {0.5, 0.6, 0.7}. For both algorithms we train for 300 epochs. We summarize the details in Appendix A.1. Table 2 shows that adaptive per-layer clipping achieves training and validation accuracies on par with flat clipping for multiple choices of ϵ. 5.2 PRIVATELY FINE-TUNE ROBERTA ON GLUE TASKS Our first experiment aims to show that adaptive per-layer clipping is broadly competitive with existing approaches in the literature in terms of the privacy-utility tradeoff. In this experiment, we fine-tune RoBERTa-base (125M) and RoBERTa-large (355M) (Liu et al., 2019) on SST-2, QNLI, QQP, and MNLI from the GLUE benchmark (Wang et al., 2018) with differential privacy. We set ϵ ∈ {3, 8} and δ = 1/n 1.1 , where n is the size of training set. We tune the learning rate, batch size, and target quantile on SST-2's training data and transfer the best hyperparameters to other tasks. We use r = 0.1 of the privacy budget for quantile estimation, choose the target quantile q from {0.5, 0.75, 0.85}, and set the number of training epochs E = 20. Table 3 shows that adaptive per-layer clipping obtains accuracies competitive with established approaches in the literature under fixed privacy constraints. Our second controlled experiment shows that adaptive per-layer clipping gives utility that is competitive with flat clipping under fixed training epochs (when both approaches fine-tune the same set of parameters), effectively verifying its wall time advantage. In this experiment, we constrain the number of training epochs E to be one of {3, 10, 20, 30} and fine-tune RoBERTa models on SST-2 with the two clipping methods. 

5.3. PRIVATELY FINE-TUNE ON LANGAUGE GENERATION TASKS

Table-To-Text Generation. We compare adaptive per-layer clipping against flat clipping for full fine-tuning with GPT-2 on the E2E (Novikova et al., 2017) and DART (Nan et al., 2020) table-to-text generation tasks. Since Li et al. (2022b) performed extensive tuning on these tasks for flat clipping, we recall their results here. For runs with adaptive per-layer clipping, we reused hyperparameter values tuned for SST-2, but re-tuned the target quantile with the E2E Dialog Summarization. We use the SAMSum dialog summarization task as a testbed for studying model scaling (Gliwa et al., 2019) . 5 This task is more challenging than previously tested ones since its training set is small (less than 15k examples) and inputs are long. We fine-tune both GPT-2-xl and the (original) 175 billion-parameter GPT-3 with LoRA (Hu et al., 2021) with and without DP, and compare them against in-context learning with GPT-3 (Brown et al., 2020) . Table 6 shows that GPT-3 fine-tuned at ϵ = 1 outperforms non-privately fine-tuned GPT-2-xl and in-context learning with 4 demonstrations (the maximum that can be fitted within the context window of 2048 tokens). See Appendix C for more details.

6. RELATED WORK

Training large deep learning models with DP has gained momentum in the recent years. For instance, Anil et al. (2021) privately pretrained BERT models, and Kurakin et al. (2022) privately trained deep ResNets on ImageNet. Recent works have also investigated private fine-tuning (Kerrigan et al., 2020; Tian et al., 2021; Senge et al., 2021; Hoory et al., 2021; Basu et al., 2021; Yu et al., 2021b ) and observed that one can achieve favourable privacy-utility trade-offs with large pretrained models for image classification (Luo et al., 2021; Tramèr & Boneh, 2021; Golatkar et al., 2022; De et al., 2022; Mehta et al., 2022) and tasks in NLP (Yu et al., 2022; Li et al., 2022b; a) . Group-wise clipping schemes considered in our work improve the efficiency of DP-SGD and further this line of research by making scaling private learning easier. Several works considered adjusting the clipping threshold of DP-SGD adaptively during training (Pichapati et al., 2019; Asi et al., 2021) . The most related to us is that by Andrew et al. (2019) who set the threshold for flat clipping as privately estimated quantile of gradient norms. They showed that doing so eased hyperparameter tuning without affecting the final model performance. Different from these works, ours considers per-layer clipping, where adapting the clipping threshold plays a more crucial role for obtaining good utility. More discussion on related work is in Appendix I.

7. CONCLUSION

We showed that group-wise clipping schemes are effective tools to improve the efficiency of DP-SGD for small-to moderate-scale workflows that run on single accelerators, and to avoid overheads in private distributed pipeline parallel training of models that do not fit on single accelerators. We showed that adaptive clipping algorithms can mitigate known utility losses associated with using fixed and hand-tuned thresholds. Designing group-wise clipping algorithms that can Pareto-dominate flat clipping in terms of privacy vs utility (or show impossibility) is an interesting future direction.

LIMITATIONS

Group-wise clipping schemes offer various advantages but are not without limitations and drawbacks. First, group-wise clipping algorithms tend to have a few extra hyperparameters. This could lead to a need of additional tuning when optimal hyperparameters differ across tasks and domains (although we showed that across the tasks we studied, the optimal values for most of the additional hyperparameters remained stable). Second, the per-layer clipping scheme gives limited efficiency improvements in non-distributed settings when only few parameters are fine-tuned. Lastly, care must be taken during implementation to fully realize the gains of adaptive per-layer clipping in practice.

ETHICS STATEMENT

Our work studies and improves differentially private learning algorithms along two distinct axes and has the potential to expand the scope of machine learning on sensitive data. We argue that improvements in differentially private machine learning alone should not be the sole motivation to expand the collection of user data or make aggressive the training of machine learning models on such data without considering the potential long-term harms of developing and releasing models trained with sensitive data. Our efforts on scaling differentially private fine-tuning to work with GPT-3 are purely motivated by an academic research question. We note there are privacy concerns associated with the pretraining corpus of GPT-3, and thus a model fine-tuned from GPT-3 should not be deployed without undergoing careful privacy audits. For deployment purposes, we suggest fine-tuning only models pretrained on carefully curated corpora. Lastly, we note that language is inherently complex, and its complexity may well be reflected in datasets for sophisticated tasks such as dialog completion. Differential privacy as a guarantee alone may fail to fulfill the desired privacy goals if example boundaries are not set appropriately. worth noting that we found that for fine-tuning SST-2 with RoBERTa-base, it is true for many layers that the 85% clipping threshold (see red dashed line in Figure 4 ) is just the point can split samples into a group with small gradient norms and a group with large ones. Both of Figure 2 and Figure 4 demonstrate that the distribution of gradient norms is complex and may related to many factors: (1) Iterations & Samples: gradient norms are small and spread out across layers in the early epochs, and as the training process goes on, per-sample gradients become divided, the large becomes larger and the small becomes smaller; (2) Layers: gradient norms of layers close to the input are larger than those of layers close to output, it is more prominent in the later stages of training, but it's aligned well across samples. Stage microbatch S j 's LocalForward and LocalBackward calls in the schedule C, ensuring the stages are executed sequentially for this microbatch 6: end for 7: Organize the schedule C based on pipeline parallel rules, allowing different devices to process different microbatches simultaneously 8: Execute the schedule C 9: Synchronize all devices 10: for k = 1 to K do 11: θ ′ k ← θ k -ηu k . 12: end for 13: return θ ′ = (θ ′ 1 , • • • , θ ′ K ) Algorithm 3 LocalForward 1: INPUT: Device id k; microbatch index j 2: Wait for activations a (j) k-1 from device k-1 if k > 1; otherwise transfer microbatch S j onto device k 3: Perform forward pass with a (j) k-1 and model piece θ i (stored on device k) to obtain outputs a k-1 and communicate this to device k -1 if k > 1.

C.2 FINE-TUNING GPT-3 ON SAMSUM

Note the term "GPT-3" in the literature is used in multiple occasions and can refer to multiple models. Our experiments are based on fine-tuning or prompting the original GPT-3 model (Brown et al., 2020) and not the more recent variants which had been fine-tuned or adapted in some way (e.g., instruct-GPT-3 (Ouyang et al., 2022) labeled with prefix instruct-in OpenAI API). Larger models are known to have better fine-tuned performance when inputs and outputs are formatted as instructions and responses (Wei et al., 2021; Sanh et al., 2021) . We observed similar results when fine-tuning with differential privacy, and thus augmented the training and test sets by prepending the inputs with the instruction "Summarize the following dialogue" and the outputs with the delimiter "TL;DR". To ensure a fair comparison, we used this instruction-augmented dataset for all experiments. For decoding from models, we used beam search with a beam size of 4 for both GPT-3 and GPT-2-xl (including in-context learning experiments). To ensure we account for the variability in performance with different prompts for in-context learning, we sampled 3 sets of prompts for the 4-shot learning experiments and reported the average metric over runs. Without access to the original pretraining corpus, we cannot completely rule out the possibility of data contamination, which refers to the unfortunate outcome that parts of the fine-tuning or evaluation data occur in the pretraining corpus. Nevertheless, we believe the chances of this happening are small due to two reasons. First, zero-shot prompting GPT-3 with both low temperature sampling and beam search based on instruction-augmented inputs tended to result in completions which either repeated or extended the instruction or the dialog (e.g., "the following is a dialog between..."), or attempted to continue the dialog but digressed. In the limited number of examples we inspected, we were unable to find an instance where the output looked similar to a high-quality summary. Second, we looked up the initial time when the SAMSum paper was released to arXiv (late Nov. 2019). Given that the GPT-3 model we based our experiments off were pretrained with shards of Common Crawl uploaded (possibly) at the end of 2019 (Brown et al., 2020) , we performed simple searches of the SAMSum paper with their url index in the Dec. 2019 crawl archive of Common Crawl and were not able to find the link of the paper. Notably the SAMSum dataset was crafted by linguists and highly curated (as opposed to collected based on web data). For fine-tuning GPT-3 with DP LoRA on SAMSum, we reused hyperparameters adopted by Hu et al. ( 2021), but re-tuned the learning rate based on preliminary runs for another dataset. We set all per-device clipping threshold to be 1e-5 and adopted the equal budget noise allocation strategy for simplicity. We fine-tuned for 5 epochs in all runs (both GPT-3 and GPT-2-xl; both private and non-private). For the DP LoRA fine-tuning runs, we used a machine with 16 V100 GPUs each with 32 gigabytes of VRAM. This enabled LoRA fine-tuning with a rank of 32 with a microbatch size of 1 under pipeline parallelism. Fine-tuning with DP LoRA for 5 epochs on SAMSum's training set took 15 hours, and decoding with test inputs using beam search further took another 22 hours.

D PROOFS

We present the proof for Proposition 3.1. For easy reference, we restate the proposition here. Proposition 3.1. Let σ be the original noise multiplier for noising parameter updates to achieve a certain level of differential privacy (without private quantile estimation) and σ b be chosen for noising quantile estimates (release of the latter consumes r fraction of the privacy budget). Then the new noise multiplier σ new for noising parameter updates (consuming 1 -r fraction of the budget) is σ new = (σ -2 -K/(2σ b ) 2 ) -1/2 . (3.1) Proof. The proof is based on direct calculation, a simple version of which is given in Andrew et al. (2019) . First, we note that the clip counts b (i) k is either 0 or 1 (see line 10 in Algorithm 1). One can make it to be symmetric by using b (i) k -1 2 , whose sensitivity is 1 2 . Suppose the gradient has sensitivity S. For the Gaussian mechanism, to keep the privacy budget constant we have Table 11a and Table 11b . We can see that adaptivity also helps flat clipping but the improvement is not statistically significant. The performance of adaptive per-layer clipping is on par with that of adaptive flat clipping as well. All experiments here are performed on a machine with a single Titan RTX GPU with 24 GB of VRAM (different from the configuration in Figure 1 which uses a single A6000 GPU). S 2 /(Sσ) 2 = S 2 /(Sσ new ) 2 + K • (1/2) The direct experiment we perform is to full fine-tune GPT-2 on E2E with three clipping approaches (adaptive per-layer, ghost, and flat clipping) under the same epoch constraint (which we fix to be 10 for all workflows). Regarding hyperparameters for flat clipping, we adopt the set of values obtained from extensive tuning on this task used by Li et al. (2022b) . We reuse the same set of hyperparameters values for ghost clipping, since the approach essentially results in the same gradient updates as flat clipping up to numerical precision (only computed in a different way). Using these near optimal hyperparameters for flat and ghost clipping prevents our experiments from unfairly disfavouring the two approaches. Figure 7 shows that adaptive per-layer clipping consistently achieves lower test set negative log-likelihood than flat clipping and ghost clipping under any given wall time elapse. While language generation metrics (e.g., BLEU and ROUGE-L) are generally noiser than the test set NLL, Figure 8 shows that adaptive per-layer clipping generally yields better task metric numbers compared to flat clipping and ghost clipping under the same wall time. Finally, we note the caveat that the precise run time advantage of adaptive per-layer clipping against flat clipping may vary across machines and GPU types. In addition, the realized gains for actual training workflows might be smaller than that observed in our controlled experiments (e.g., Figure 1 ) due to compute time spent on auxiliary operations such as data loading and data preprocessing (e.g., pad sequences of different length to the same length). For instance, we repeat the controlled experiment in Figure 1 but this time with a different GPU, and observe slightly different factors of speed gains. Overall, we generally see that adaptive per-layer clipping is above 1.4x the speed of flat clipping in our controlled experiments, and the realized gains is roughly as much for full fine-tuning on E2E with our implementation. 

H ADDITIONAL EXPERIMENTS COMPARING ADAPTIVE PER-LAYER AGAINST FLAT CLIPPING

This section complements the results in Table 4 and follows the same experimental protocol for those experiments except now we consider ϵ = 8. The goal is to provide further evidence showing that adaptive per-layer clipping has a privacy-utility tradeoff that is comparable to flat clipping when both methods are used to train models for the same number of training epochs. Table 12 shows that for full fine-tuning and across the epoch constraints we considered, adaptive per-layer clipping is competitive with flat clipping in terms of accuracy for different privacy budgets. Note the results we obtain for flat clipping in Table 12 are higher than those reported by Li et al. (2022b) . This is because to ensure that we are not unfairly disfavouring flat clipping for this SST-2 task, we tuned hyperparameters on this task (details in Appendix A.2); recall Li et al. (2022a) didn't tune on SST-2, but instead relied on hyperparameter transfer from the E2E task. 

I ADDITIONAL RELATED WORK

Faster DP-SGD. Improving the efficiency of DP-SGD is an active research area. One line of works improve the implementation without changing the algorithm such as using better parallelism and compile-time optimization (Subramani et al., 2021; Anil et al., 2021) . Subramani et al. (2021) show that the running time of carefully implemented DP-SGD is comparable to non-private SGD for small-size models. However, the high memory cost of storing per-example gradients still limits the throughput of DP-SGD when the model size is large (Li et al., 2022b) . Another line of works avoids instantiating per-example gradients by running backpropagation twice (Goodfellow, 2015; Lee & Kifer, 2021; Bu et al., 2021; Li et al., 2022b; Bu et al., 2022) . The high-level idea is to compute or estimate per-example gradient norms in the first backpropagation and reweight loss functions before the second backpropagation. Although these works achieve memory efficiency, they add computational overhead because of the additional backpropagation. DP-SGD with Per-layer Clipping. Per-layer clipping (or more generally group-wise clipping) has been studied in Abadi et al. (2016) ; McMahan et al. (2018a; b) ; Dupuy et al. (2022) . However, the advantage of per-layer clipping has not yet been fully understood because of two reasons. Firstly, previous work does not focus on computational efficiency, leaving the empirical advantage of groupwise clipping unexplored. Secondly, these works simply adopt fixed thresholds for per-layer clipping and hence generally observe performance drops compared to flat clipping. In this work, we use adaptive thresholds to improve the privacy-utility tradeoff of per-layer clipping. Moreover, we give an efficient implementation of per-layer clipping to demonstrate its superior empirical advantage. Privacy Attacks Against Deep Models. Deep models may unintentionally leak sensitivie information about their training data (Shokri et al., 2017; Hitaj et al., 2017; Zhu et al., 2019; Song et al., 2019; Carlini et al., 2020; Choquette-Choo et al., 2021; Carlini et al., 2022; Balle et al., 2022) . For instance, Carlini et al. (2020) show that GPT-2 outputs its training data when short prefixs are provided. Training deep models with differential privacy has become a popular choice to prevent data leakage (Abadi et al., 2016; Papernot et al., 2017; McMahan et al., 2018b; Zhu et al., 2020) . In addition to theoretical guarantee, differentially private models are also very robust to empirical privacy attacks (Bernau et al., 2019; Carlini et al., 2019; Yu et al., 2021b) . Adapting to the Geometry of Gradients in DP-SGD. Using adaptive clipping thresholds in DP-SGD fits more broadly into a line of work that adapts the geometry of gradients to clipping and noising. The gradients of machine learning models usually have much smaller intrinsic dimensions than the model sizes. This property has been used to prove better theoretical bounds for DP-SGD or improve its empirical performance (Kairouz et al., 2020; Song et al., 2020; Zhou et al., 2021; Yu et al., 2021a; Li et al., 2022a; Ma et al., 2022) .



While size does not equate quality, there is strong correlation between the two under currently popular pretraining techniques(Liu et al., 2019;Brown et al., 2020). We're optimistic that future smaller models pretrained with improved techniques can be as performant as current large models(Hoffmann et al., 2022). Note the DP learning library Opacus(Yousefpour et al., 2021) has a per-layer clipping optimizer which supports clipping each layer with a separate threshold. But this implementation conducts per-layer clipping after backpropagation completes and instantiates all per-example gradients before clipping, which is inefficient. Alternate parallelization schemes can be more flat clipping friendly (e.g., FSDP(Zhao et al., 2022)), but current open source implementations of these schemes are generally not light-weight fine-tuning friendly. Results inYu et al. (2021b; 2022) on MNLI are the average of the matched and mismatched accuracy. We believe the chance of contamination occurring for this task to be small. See Appendix C for discussions.



Figure 2: The distribution of per-layer gradient norms shifts substantially across training. Each column represents one layer, and each row represents one example. Layers of the neural network are placed from input (left) to output (right). Darker colors indicate higher values of gradient norms.

Figure 3: Adaptive per-layer clipping eliminates performance losses experienced by fixed per-layer clipping.

LocalBackward 1: INPUT: Device id k; microbatch index j 2: Transfer activations a (j) k-1 from CPU to device k 3: Wait for output gradients o (j) k if device k < K 4: Rematerialize activations by performing extra forward pass 5: Clip and sum per-example gradients {g (i) k } i of local model piece θ k by backpropagating based on o (j) k or the loss values with threshold C k ; add this to a local accumulator u k (stored on device k) 6: If j = 1, add noise to accumulator u k to guarantee DP 7: Compute gradients with respect to input o (j)

Figure 5: Validation accuracy (in %) with different target quantiles.

Figure 7: Adaptive per-layer clipping consistently achieves lower test set negative log-likelihood than flat clipping and ghost clipping under the same wall time.



Adaptive per-layer clipping achieves accuracy (in %) on par with flat clipping for CIFAR-10.

Table4shows that adaptive per-layer clipping is on par with flat clipping in accuracy for all setups and justifies the wall time advantage claim given that adaptive per-layer clipping is faster per epoch.Li et al. (2022b)  demonstrated that the performance of private fine-tuning for text classification with various algorithms improves with a text infilling formulation. The infilling technique reformulates the optimization problem and is orthogonal to the algorithmic aspects under study. To ensure our comparisons are fair, all experiments in this section follow the usual BERT fine-tuning setup without text infilling. Additional details of the two experiments can be found in Appendix A.2.

Adaptive per-layer clipping matches accuracy (in %) results in the literature on GLUE tasks.4   Adaptive per-layer 87.10/87.20 86.80 89.80 93.87 87.67/87.57 87.20 90.77 94.03

Adaptive per-layer clipping is competitive with flat clipping on SST-2 in accuracy (%) under fixed fine-tuning epochs (E). Models are full fine-tuned. Numbers in parentheses are standard deviations from three independent runs. See results for ϵ = 8 in Appendix H.

Adaptive per-layer clipping matches the performance of flat clipping for full fine-tuning under the same number of training epochs for common privacy levels. Results based on fine-tuning GPT-2 on E2E and DART. Numbers in the column "flat" are reported byLi et al. (2022b).

Hyperparameters for full fine-tuning GPT-2 with adaptive per-layer clipping. Numbers in bold are best performing hyperparameters used for reporting final results.

MORE ON PER-DEVICE CLIPPING AND EXPERIMENTS WITH GPT-3 C.1 ADDITIONAL COMMENTS ON PER-DEVICE CLIPPING Algorithm 2 details the private pipeline parallel training procedure with per-device clipping covered in the main text. The algorithm adopts the equal budget noise allocation strategy to avoid incurring extra communication across devices. For simplicity, the pseudocode covers a single update and omits any subprocedures for adapting clipping thresholds. Extending this pseudocode to adaptive threshold clipping based on quantile estimation is straightforward. Lastly, the pseudocode assumes that the model has already been partitioned into chunks of consecutive layers, where each chunk θ k is hosted on device k. Algorithm 2 Single Update of Private Pipeline Parallel Training With Per-Device Clipping 1: INPUT: Minibatch S; iterate θ; per-device parameters {θ 1 , • • • , θ K } hosted on K devices; clipping thresholds {C 1 , . . . , C K }; noise multiplier σ; learning rate η; number of microbatches per minibatch J 2: Partition minibatch S J microbatches {S 1 , • • • , S J } 3: Create an empty execution schedule C 4: for j = 1 to J do

Adaptivity helps flat clipping but not as much as for per-layer clipping. Averaged accuracy and standard deviation are given by 3 independent runs. include additional experiments comparing the run time of different clipping approaches. Our goal is to further consolidate the claims that 1) adaptive per-layer clipping attains similar or better task performances under the same epoch budget for certain workflows, and that 2) this translates into compute time savings since per-layer clipping is faster than alternative clipping approaches per update (or almost equivalently per epoch).

Adaptive per-layer clipping matches or outperforms flat clipping in accuracy (%) on SST-2 under fixed epoch (E) constraints. This experiment performs full fine-tuning with both clipping approaches. Numbers in parentheses are standard deviations from three independent runs. RoBERTa-base Flat clipping (tuned) 89.13(0.64) 91.53(0.12) 91.97(0.67) 92.10(0.56) Adaptive per-layer 90.83(1.39) 92.27(0.76) 92.63(0.32) 92.87(0.12) RoBERTa-large Flat clipping (tuned) 92.43(0.32) 93.50(0.20) 94.57(0.55) 94.87(0.76) Adaptive per-layer 93.20(0.36) 93.53(0.47) 94.37(0.15) 94.33(0.38)

A EXPERIMENT DETAILS

A.1 SETUP FOR CIFAR-10 CLASSIFICATION WITH WRN16-4 MODEL In the paper, we have used CIFAR-10 classification task with wide residual network (WRN16-4) to demonstrate some points. To increase the reproducibility, we describe the detailed setting of these experiments.We modify the standard WRN16-4 (Zagoruyko & Komodakis, 2016) following the suggestions of De et al. (2022) , i.e., replacing batch normalization with group normalization and using weight standardization for convoluntional weights, except that we do not use augmentation multiplicity for simplicity. Specifically, for flat clipping, De et al. (2022) uses a handed tuned clipping threshold C = 1 and learning rate lr = 4. Due to the fact that the learning rate and clipping thresholds jointly affect the performance in a complex way, we set the fixed per-layer clipping thresholds and the adaptive per-layer clipping thresholds so that they both have equivalent global threshold C = 1. For fixed per-layer clipping, each layer clipping threshold is C/ √ K where K is the number of layers. For adaptive clipping thresholds C 1 , ..., C K , we rescale themTo make the comparison more fair, we carefully tune the hyper-parameters of fixed per-layer clipping in Table 1 , Table 11 and Figure 3 . We find that small fixed thresholds can improve the performance of fixed per-layer clipping. We try different C's from {1.0, 0.5, 0.1, 0.05} while making C • lr constant (which is critical for SGD), and set clipping threshold C k = C/ √ K for each layer. Finally, for fixed per-layer clipping, We choose the best hyper-parameter combinations (C = 0.05, lr = 40) and (C = 0.1, lr = 20) for ϵ = {3, 8}, respectively.For both fixed and adaptive per-layer clippings, we use the global strategy for noise allocation, i.e., γ k = 1 for all k ∈ [K]. Moreover, we use the same optimizer, weight decay, momentum, learning rate schedule, batch size and max epochs as flat clipping, as shown in Table 7 . We tune the learning rate from two choices {2, 4} for all three algorithms. For adaptive per-layer clipping, we use a fraction r = 0.01 of privacy budget to estimate quantiles and quantile learning rate η = 0.3. We tune the target quantile from three choices {0.5, 0.6, 0.7}. We will evaluate the hyperparameter sensitivity in ablation study (Section F). Hyperparameters are tuned by training from scratch on training set and evaluating on test set. We use the best hyperparameter combinations for different ϵ respectively and report the test set accuracy of the last epoch in Table 2 . To evaluate the performance of adaptive per-layer clipping, we conduct experiments on GLUE tasks by fine-tuning RoBERTa-base and RoBERTa-large models with differential privacy.The optimizer setup and dropout rates are the same for adaptive per-layer clipping, fixed per-layer clipping, fixed flat clipping and adaptive flat clipping, as shown in Table 8 .For per-layer clipping, we use the global strategy for noise allocation, i.e.,We tune other hyperparameters: peak learning rate, batch size, clipping thresholds for fixed per-layer clipping, target quantile q for adaptive per-layer clipping, as shown in the bottom half of Table 8 .To tune hyperparameters fairly, we split the training set of SST-2 into two parts: a new training set containing 80% of original training set and a validation set containing the remaining. We select the best hyperparameters with the performance on the validation set, averaging over 3 different seeds.Table 8 shows the best hyperparameter combinations we use for adaptive and fixed per-layer clipping.For experiments in Section 5.2, we set the privacy budget for quantile estimation r = 10%. Figure 6 suggests that using smaller values such as r = 1% or r = 5% may produce slightly better results.We transfer hyperparameters tuned on SST-2 to the remaining GLUE tasks. Specifically, we follow Li et al. (2022b) and keep the sampling rate the same across different datasets.For the GLUE tasks considered, we find that training for more epochs generally improves the performance for both flat and per-layer clipping. To ensure runs finish under realistic training times, we fix the max epochs to be 20 for experiments for the adaptive per-layer clipping runs reported in Table 3 . We report the accuracies on the original dev sets for each GLUE task.For the second experiment in Section 5.2, we only optimize with respect to the self-attention layers and the classification head parameters for both flat clipping and adaptive per-layer clipping so that the total computational cost of this controlled ablation study is manageable. We reused most of the hyperparameters specific to adaptive quantile estimation based on tuning results on SST-2. We retuned the target quantile parameter as we observed that optimal values of this parameter tend to be different for different tasks. To ensure a fair comparison against full fine-tuning, we constrain the runs with adaptive per-layer clipping to have the same batch size and training epochs as in (Li et al., 2022b) . We adopted the default values set by the Hugging Face transformers library for Adam's β 1 , β 2 , and ϵ. Table 9 contains the full set of hyperparameters.

B ADDITIONAL EXPERIMENTS ON GRADIENT NORM SHIFT

In this section, we illustrate the distribution of gradient norms shift in both CIFAR-10 training and SST-2 fine-tuning. To visualize the gradient norms, we first randomly select some samples from the training set, and take the checkpoints at different epochs of a privately trained model with adaptive per-layer clipping and the privacy parameter ϵ = 8. For each sample, we compute the gradient norm of each layer. Specifically, for CIFAR-10, we ramdonly select 32 samples and place layers of WRN16-4 from input (left) to output (right) in Figure 2 .For SST-2, we randomly select 4,096 samples and some layers in the RoBERTa-base model, and plot the histogram of the per-sample per-layer gradient norms in Figure 4 across the first few epochs. It is Simplifying the above expression, we get σ new .We can further compute the fraction r of budget that is used to privately estimate quantiles byWe can also derive the value of σ b given r from the above formula. □

E NOISE ALLOCATION COMPARISON

We compare the noise allocation strategies empirically. Apart from the global strategy where γ k = 1 for all k ∈ [K] and the equal budget strategy where γ k = C k for all k ∈ [K] that are discussed in Section 3.3, we also consider another weighted strategy:In this case, the number of parameter plays a role so that each coordinate would roughly have the same signal to noise ratio and the total noise has squaredWe fine-tune RoBERTa-base models on the SST-2 sentence classification task. The hyper-parameters are searched for each strategy separately where the ranges follow Appendix A. Results are presented in Table 10 . We can see that three strategies achieves comparable performance and the global strategy is slightly better. Therefore, we use global strategy for all the experiment except for GPT-3 where the equal budget strategy is used to eliminate the concern of communication across devices. 

F ABLATION STUDIES

Here we conduct ablation studies to see 1) the influence of using different quantiles to perform clipping; 2) the influence of varying the privacy budget for quantile estimation; 3) whether adaptive flat clipping significantly better than fixed flat clipping.Clipping with Different Target Quantiles. We use different target quantiles for clipping on both WRN16-4 and RoBERTa-base. We choose the quantile for CIFAR-10 from {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and that for SST-2 from {0.05, 0.2, 0.4, 0.6, 0.85, 0.9, 0.95}. Other hyperparameters are the same as those in Section 5.1 and 5.2. We plot the results in Figure 5 . On CIFAR-10, the accuracy is robust to the choice of target quantile on all values considered. On SST-2, all quantiles around 0.9 give good performance. This suggests that setting the target quantile according to the model accuracy is a good default choice for the classification tasks. For generation task, we tune the target quantile as a hyper-parameter in general.Different Budgets for Quantile Estimation. We show the influence of using different privacy budgets to estimate the target quantile. We fine-tune RoBERTa-base models on SST-2. The fraction of privacy budget for quantile estimation r is from {0.01%, 0.1%, 1%, 5%, 10%, 20%, 40%, 80%}.We plots the results in Figure 6 . The performance is good for a wide range of r. When ϵ = 8, using r as small as 0.01% still gives good accuracy. This further confirms the finding in Andrew et al. (2019) that quantiles can be estimated quite accurately with small privacy budget. Therefore we only need to split negligible budget for the private quantile estimation without affecting much the noises added to the model updates.Adaptive Per-layer Clipping vs Adaptive Flat Clipping. We have verified that adaptive per-layer clipping can match the performance of well-tuned flat clipping in Section 5. To really justify value of adaptive per-layer clipping, we need to demonstrate that adaptive flat clipping does not achieve significantly better performance than fixed flat clipping. We run experiments on the CIFAR-10 task with WRN16-4 and the SST-2 task with RoBERTa-base model. Their results are presented in

