DIFFERENTIALLY PRIVATE OPTIMIZATION ON LARGE MODEL AT SMALL COST

Abstract

Differentially private (DP) optimization is the standard paradigm to learn large neural networks that are accurate and privacy-preserving. The computational cost for DP deep learning, however, is notoriously heavy due to the per-sample gradient clipping. Existing DP implementations are 2 -1000× more costly in time and space complexity than the standard (non-private) training. In this work, we develop a novel Book-Keeping (BK) technique that implements existing DP optimizers (thus achieving the same accuracy), with a substantial improvement on the computational cost. Specifically, BK enables DP training on large models and high dimensional data to be roughly as efficient as the standard training, whereas previous DP algorithms can be inefficient or incapable of training due to memory error. The computational advantage of BK is supported by the complexity analysis as well as extensive experiments on vision and language tasks. Our implementation achieves state-of-the-art (SOTA) accuracy with very small extra cost: on GPT2 and at almost the same memory cost (< 1% overhead), BK has 1.03× the time complexity of the standard training (0.83× training speed in practice), and 0.61× the time complexity of the most efficient DP implementation (1.36× training speed in practice). We will open-source the codebase for the BK algorithm.

1. INTRODUCTION

Deep learning with differential privacy (DP; Dwork et al. (2006) ) has shown strong performance while guaranteeing rigorous protection against privacy risks, especially on large models that tend to memorize and leak the training data Carlini et al. (2021) ; Haim et al. (2022) ; Shokri et al. (2017) . For example, recent advances have shed light on the success of DP GPT2 Li et al. (2021) ; Bu et al. (2022b) ; Yu et al. (2021) , which achieves 64.6 BLEU scorefoot_0 at strong privacy guarantee (ϵ = 3), on the text generation task using E2E restaurant review dataset. This is only marginally below the standard non-private GPT2 (BLEU score 66.8). Similarly, on computer vision tasks (ϵ = 2), DP vision transformers and ResNets have obtained 97.1%/86.2% accuracy on CIFAR10/100 by Bu et al. (2022a) and over 81% accuracy on ImageNet by De et al. (2022) ; Mehta et al. (2022) . However, DP training of large neural networks is well-known to be computationally burdensome in comparison to the standard training, in terms of both the training time and the memory cost. For instance, training a small recurrent neural network (0.598M parameters) experiences a 1000× slowdown using DP optimizers in Tensorflow-Privacy (TF-Privacy) library Bu et al. (2021) , and training a small convolutional neural network (CNN, 0.605M parameters) on CIFAR10 has a 24× slowdown with Tensorflow 2 and the XLA compiler Subramani et al. (2021) . Even with SOTA efficient implementations, large models such as RoBERTa Liu et al. (2019) , GPT2 Radford et al. (2019) , ResNet He et al. (2016) , VGG Simonyan & Zisserman (2014) , ViT Dosovitskiy et al. (2020) and its variants, experience about 2 -3× slowdown in Pytorch Li et al. (2021) ; Bu et al. (2022a) and 2 -9× slowdown in JAX Kurakin et al. (2022) ; De et al. (2022) , with possibly 4 -20× memory overhead Bu et al. (2022a) ; Li et al. (2021) ; Subramani et al. (2021) if not out of memory. The efficiency bottleneck in DP deep learning lies in the per-sample gradient clipping, which restricts the magnitude of each per-sample gradient in the mini-batch. Applying the clipping jointly with the Gaussian noise addition, one can privately release the gradient to arbitrary optimizers like SGD and Adam, and thus guarantee the privacy of the training as described in Section 1.3: private gradient: Ĝ := i g i • C(∥g i ∥ 2 ) + σ DP • N (0, I), private optimizer (e.g. SGD): w t+1 = w t -η Ĝ. (1) Here w is the model parameters, L i is the per-sample loss, g i = ∂Li ∂W is the per-sample gradient, η is the learning rate, σ DP is the noise magnitude that defines the privacy loss, and C(∥g i ∥) or simply C i is the per-sample clipping factor, e.g. min{R/∥g i ∥, 1} in Abadi et al. ( 2016 An orthogonal approach, including this work, focuses on the computation efficiency (part III), i.e. reducing the time and space complexity through efficient implementations, without affecting the DP optimizers (part I) and thus their performance. We will elaborate on multiple methods in Section 1.2. Additionally, these methods can be compiled on different platforms (part IV) such as Tensorflow 2(XLA), JAX and Pytorch Li et al. (2021) ; Subramani et al. (2021) ; De et al. (2022) ; Kurakin et al. (2022) , where remarkable speed difference has been observed in some cases, even with the same implementation. For example, Subramani et al. (2021) implemented DP-SGD using JAX and claimed its efficiency advantage over the same algorithm using Tensorflow or Pytorch.

1.1. CONTRIBUTIONS 1. [Algorithm]

We propose the book-keeping (BK) algorithm that makes existing DP optimizers fast and memory efficient, especially comparable to non-private optimizers. We demonstrate BK via the computation graph in Figure 1 . The highlight is that BK only uses one back-propagation and never instantiates per-sample gradients { ∂Li ∂W } B i=1 .

2.. [Analysis]

We analyze the complexity to show that BK has almost the same time and space complexity as non-DP training, especially when the feature dimension is small (see Table 5 ).

3.. [Extension]

We strengthen BK using a layerwise decision to mix with Opacus (see Section 3.2), which proves to be efficient when the feature dimension is large (and difficult for GhostClip). We also extend BK to the parameter efficient fine-tuning such as DP LoRA and Adapter.

4.. [Codebase]

We develop a Pytorch (Paszke et al., 2019) 2022a), the per-sample gradients can be clipped without being instantiated, thus both time and space complexity can be further improved if the feature dimension is small. We refer interested readers to Figure 3 and Appendix C for algorithmic details of these implementations. We now compare BK to different implementations in Table 2 and Figure 2 . In what follows, B is the batch sizefoot_1 , T (l) is the feature dimensionfoot_2 , d (l) , p (l) are the input or output dimension of a layer. Non-DP TF-privacy Opacus FastGradClip GhostClip BK (ours) Instantiating per-sample grad l) . The main bottleneck is marked in red. ✗ ✓ ✓ ✓ ✗ ✗ Storing every layer's grad ✗ ✗ ✓ ✗ ✗ ✗ Instantiating non-DP grad ✓ ✓ ✓ ✗ ✓ ✗ Number of back-propagation 1 B 1 2 2 1 Time Complexity of Clipping 6BT pd 6BT pd 8BT pd 8BT pd 10BT pd + O(BT 2 ) ≈ 6BT pd Memory Overhead to non-DP 0 0 Bpd Bpd 2BT 2 min{2BT 2 , BT pd} Scalable to large model ✓ ✗ ✗ ✗ ✓ ✓ Scalable to high-dim input ✓ ✗ ✓ ✓ ✗ ✓ Table 2: Summary of different DP implementations on a linear/convolution layer R B×T (l) ×d (l) → R B×T (l) ×p (

1.3. PRELIMINARIES

We work with the (ϵ, δ)-DP by Dwork et al. (2006) , which makes it difficult for any privacy attacker to distinguish or detect an arbitrary training sample, even with full access to the model (see Appendix A for details). In deep learning, DP is achieved by training on the private gradient in Equation ( 1) with any optimizer such as SGD, Adam, FedAvg, etc. Essentially, the private gradient is the addition of Gaussian noise to the sum of clipped per-sample gradients, which guarantees the DP protection through the privacy accounting theorems Abadi et al. 

2. BOOK-KEEPING: EFFICIENT DP TRAINING IN LOW DIMENSION

The main computational bottleneck of DP training comes from the per-sample gradient clipping, or from the computation of per-sample gradient norms, to be exact. One widely used approach in Opacus, TF-privacy, and FastGradClip, is to instantiate the per-sample gradients and then deriving their norms. Straight-forward implementation of this approach on a mini-batch of per-sample losses requires B rounds of back-propagation (unacceptable slowdown) or B× gradient storage (unacceptable memory burden; see Opacus in Figure 2 ). Consequently, these implementations are not suitable for large model training. For instance, Li et al. (2021) shows that, when training GPT2-large (774M parameters), Opacus Yousefpour et al. (2021) and JAX Subramani et al. ( 2021) cannot fit even one single sample into a 24GB GPU. An alternative approach, termed as the ghost clipping (GhostClip), directly computes the per-sample gradient norms without computing the gradients themselves. This is made possible, unfortunately, through two rounds of back-propagation. During the first back-propagation, one uses the regular loss i L i and extracts the activation tensor and the output gradient (a, ∂L ∂s ). One can use an algebraic trick in Equation (2) to compute the per-sample gradient norms {∥ ∂Li ∂W ∥} i and the clipping factors {C i } i in Equation (1). During the second back-propagation, one uses the reweighted loss i C i L i whose gradient is directly the weighted gradient i C i g i , which constitutes the private gradient we need. Note that this double back-propagation roughly doubles the training time (or to be more precise, 10/6 ≈ 1.667× when T is small; see Table 2 ). 

2.1. BOOK-KEEPING ALGORITHMS

BK algorithms in their base forms are built on GhostClip and especially the ghost norm trick, so as to avoid instantiating the memory costly per-sample gradients: as can be seen in Algorithm 1 and Figure 3 , ∂Li ∂W = a ⊤ i ∂L ∂si is not computed throughout the training. In comparison to GhostClip, our significant improvement is solely on the speed (see Get activation tensor {a (l),i } by Pytorch forward hook 3: for layer l ∈ L, L -1, • • • , 1 do 4: Get output gradient { ∂L ∂s (l),i } by Pytorch backward hook 5: Compute per-example gradient norm ∥ ∂Li ∂W (l) ∥ 2 F by ghost norm trick in Equation (2) 6: Aggregate gradient norm across all layers: ∥ ∂Li ∂W ∥ 2 F = l ∥ ∂Li ∂W (l) ∥ 2 F 7: Compute clipping factor: C i = C(∥ ∂Li ∂W ∥ F ; R) 8: for layer l ∈ L, L -1, • • • , 1 do 9: Compute sum of clipped gradients G l = a ⊤ (l) diag(C 1 , C 2 , • • • ) ∂L ∂s (l) 10: Delete {a (l),i }, { ∂L ∂s (l),i } 11: Add Gaussian noise Ĝ = G + σR • N (0, I) 12: Apply SGD/Adam/LAMB with the private gradient Ĝ on W Ghost norm trick The ghost norm trick Goodfellow (2015) computes the gradient norm without the gradient: while the gradient is instantiated by the multiplication in Equation ( 2), the gradient norm can be computed without a i meeting ∂L ∂si . This is applicable to generalized linear layers including the linear, the embedding Li et al. (2021) , and the convolution layers Bu et al. (2022a) . We demonstrate this trick using a simple linear layer s i = a i W, where W ∈ R d×p is the weight matrix, a ∈ R B×T ×d is the mini-batch input of this layer (a.k.a. the activation tensor) and s ∈ R B×T ×p is the output. Given that the output gradient ∂L ∂s is readily available in the back-propagation, for DP and standard training, one can directly derive the per-sample gradient norm ∂L i ∂W 2 Frobenius = vec ∂L ∂s i ∂L ∂s i ⊤ • vec a i a ⊤ i without computing ∂L i ∂W = a ⊤ i ∂L ∂s i . Here 'vec' means flattening the T × T matrix to a vector. This trick is particularly efficient when T is small, reducing the space complexity from O(Bpd) to O(BT 2 ) by Table 3 . Ghost differentiation trick This trick improves the time complexity on the first back-propagation in GhostClip, further reducing from 8BT M + O(BT 2 ) to 6BT M + O(BT 2 ) in Table 2 . Our idea is to only compute the output gradient ∂L ∂s (l) but not the parameter gradient ∂L ∂W . That is, we break the 4BT M time complexity of the full back-propagation into two sub-processes, each of 2BT M complexity, and remove the unnecessary one. To be more specific, during the back-propagation of Opacus and GhostClip, the output gradient ∂L ∂s and then the parameter gradient ∂L ∂W = a ⊤ ∂L ∂s are computed. However, we can stop after we obtain ∂L ∂s : we only need the output gradient to compute the clipped parameter gradient ∂ i CiLi ∂W in Line 9 of Algorithm 1. Therefore, the ghost differentiation trick sets all parameters to not require gradients (see technical details in Appendix D.2, including the origin parameter trick that propagates on a computation graph even when no parameters require gradients).

2.2. COMPLEXITY OF DP IMPLEMENTATIONS: A MODULAR ANALYSIS

In this section, we analyze the complexity of DP implementations from their opearation modules. We summarize the time and space complexity in Table 3 and give the derivation in Appendix B. We will refer to these modules by indices, e.g. 2a for the computation of output gradient. When the feature dimension T is small, we claim that BK is almost as efficient as the standard non-private training, with a negligible O(BT 2 ) time and memory overhead by Table 2 : Memory complexity: non-DP ≈ BK ≈ GhostClip < FastGradClip ≪ Opacus Time complexity: non-DP ≈ BK < FastGradClip ≈ Opacus < GhostClip Now, we discuss the cases where the data has low dimension and thus T is small. Generally speaking, the feature dimension T (l) depends on both the data and the model. For non-sequential input and 1D audio data, T = 1. For sequential data such as texts (T being sentence length) or time series (T being time duration), T (l) is fixed across layers. In this case, BK is efficient on short-sequence datasets including GLUE However, on the convolution layers with image data, T (l) is the product of hidden feature sizes (c.f. (Bu et al., 2022a , Section 3)), thus T (l) depends on the original image size and network architecture. For example, larger kernel size/dilation/stride in convolution layer reduces T (l) , while larger images have larger T (l) at each layer. Therefore, BK (and GhostClip) may suffer on when training ResNet on ImageNet (224 × 224), as we show in Figure 6 (see also (Bu et al., 2022a , Table 7 )), although training the same network efficiently on CIFAR10/100 (32 × 32). Lower: RoBERTa-large on GLUE datasets. Note here the hybrid implementations are equivalent to the base ones, because of the short sequence length.

2.4. APPLYING OUR TRICKS TO EXISTING IMPLEMENTATIONS

Our tricks in Section 2.1 can also improve other existing implementations, reducing the time complexity of GhostClip from 10BT pd + 2BT 2 (p + d) to 6BT pd + 2BT 2 (p + d), that of Opacus and FastGradClip from 8BT pd to 6BT pd. We highlight that these improved implementations are leveraged to design hybrid implementation in Section 3.2. In addition to DP full fine-tuning, BK is demonstrated in Appendix E.2 to also apply to the parameter efficient fine-tuning like Adapters. GhostClip = 1 + 2a + 2b + 3 + 2a + 2b ghost differentiation ----------→ book-keeping 1 + 2a + 3 + 2b (our BK) Opacus = 1 + 2a + 2b + 4 + 5 ghost differentiation ----------→ 1 + 2a + 4 + 5 FastGradClip = 1 + 2a + 4 + 2a + 2b book-keeping ----------→ 1 + 2a + 4 + 2b

3. HYBRID BOOK-KEEPING: EFFICIENT DP TRAINING IN HIGH DIMENSION

In previous section, we have analyzed DP implementations in the small T regime, where the ghost norm-based GhostClip and BK are efficient. Nevertheless, in the large T and large model regime, none of the base implementations may be efficient (see Figure 6 ) and we turn to hybrid methods.

3.1. LARGE T NECESSITATES NON-GHOST NORM METHOD

A closer look at the space complexity in Table 3 shows that, the ghost norm trick is favored over the per-sample gradient instantiation if and only if 2T 2 (l) < p (l) d (l) , where p (l) d (l) is the number of parameters at one layer. When this criterion is violated for large T , GhostClip/BK can significantly under-perform Opacus/FastGradClip, as shown in Figure 6 , Figure 7 and Table 10 . Similar to Section 2.3, we discuss two cases where T is large. For paragraph or document-level language tasks like WikiHop Welbl et al. (2018) and TriviaQA Joshi et al. (2017) , T can range from 2000 -20000, which makes 2T 2 = 8 -800M. For image tasks, particularly on CNN, T (l) varies at each layer with large values on top layers, as the features are less compressed by convolution and pooling. Taking ImageNet and the first convolution layer of VGG11 as an example (Bu et al., 2022a, Table 3 ), 2T 2 = 5 × 10 9 ≫ p (1) d (1) = 1.7 × 10 3 . Consequently, ghost norm-based implementations (i.e. GhostClip and BK) costs more than 40GB memory on ResNet18, under B = 32, while Opacus only costs 2.5GB. This curse of dimension grows from a difficult issue on ImageNet to an impossible challenge on videos or high-resolution images, e.g. GhostClip cannot train ResNet18 with even one single CelebA-HQ image (1024 × 1024) using a 40GB GPU. In short, the ghost norm trick is inefficient for large T and the per-sample gradient instantiation is inefficient for large model. Therefore, we must hybridize the base implementations. 

3.2. HYBRID IMPLEMENTATIONS VIA LAYERWISE DECISION

We adopt the same layerwise decision as Bu et al. (2022a) , known as the mixed ghost norm technique: we use the ghost norm trick on a layer if 2T 2 (l) < p (l) d (l) , and instantiate per-sample gradients otherwise. Therefore, the space complexity of computing the per-sample gradient norm reduces to min{2T 2 (l) , p (l) d (l) }, which is significantly cheaper than either the ghost norm or the per-sample gradient instantiation in high dimension, as depicted in Table 4 and Figure 7 . Consequently, over all layers, the space complexity is lower than both constituting methods, e.g. saving more than 10× memory for the per-sample gradient clipping on ResNet18 (see more models in Table 10 ). In contrast to the mixed ghost clipping (MixGhostClip) in Bu et al. (2022a) , which hybridizes Fast-GradClip and GhostClip, we boost the efficiency by hybridizing our BK with the improved Fast-GradClip/Opacus in Section 2.4. We propose BK-MixOpt (and BK-MixGhostClip as an intermediate product only for comparison) and use MixGhostClip as a reference point, • MixGhostClip = 1 + 2a + 2b + min 3 , 4 + 2a + 2b ≈ min{GhostClip, FastGradClip}, • BK-MixGhostClip = 1 + 2a + min 3 , 4 + 2b = min{BK, improved FastGradClip}, • BK-MixOpt = 1 + 2a + min 3 + 2b , 4 + 5 = min{BK, improved Opacus}. Under review as a conference paper at 2023 The hybrid BK algorithms are presented in Algorithm 5. We summarize the layerwise complexity in Table 5 , from which we derive the overall complexity in 

3.3. EFFECT OF MODEL ARCHITECTURE & FEATURE DIMENSION ON HYBRIDIZATION

We dive deeper to understand when the hybridization favors the ghost or non-ghost norm tricks. From a model architecture viewpoint, transformers such as ViT, RoBERTa, GPT tend to prefer the ghost norm: for moderate-sequence text data and moderate-dimension image data, hybrid BK algorithms are close or equivalent to the base BK algorithm (see right-most plot in Figure 7 ). However, CNN prefers the per-sample gradient instantiation at top layers, and there exists a depth threshold below which the ghost norm is more efficient. Hence the hybridization is necessary to take advantages of both worlds. From the feature dimension viewpoint, larger input means this depth threshold is deeper, e.g. from the 9-th layer of ResNet18 to the 17-th layer in Figure 7 , when the image size increases from 224 × 224 to 512 × 512. We visualize this effect of feature dimension on various models in Appendix G. 

4. DISCUSSION

In this work, we propose the Book-Keeping (BK) algorithms to effciently implement DP optimizers using three tricks: ghost norm, book-keeping, and ghost differentiation. Our BK reduces the time and space complexity of DP training to the similar level of the standard training. Specially, we develop hybrid BK to overcome the computational challenge of training large models with highdimensional data, and we extend BK to parameter efficient fine-tuning such as LoRA and Adapter. One minor limitation of this work is that BK (and GhostClip) only applies to the weights, not the biases, of the generalized linear layers, i.e. embedding, linear, and convolution layers, though these weights constitute 99.9% of the trainable parameters (see Table 7 ). Implementation-wise, although BK should be as fast as the standard training for small T , e.g. on MLP where T = 1, we observe some gap between the theoretical complexity and the throughput in practice. This gap is mainly due to the mechanism of Pytorch hooks which can be possibly optimized by customizing the CUDA kernel or using the symbolic programming.



BLEU (BiLingual Evaluation Understudy) is a metric (0-100) for automatically evaluating translated text. BLEU > 60 is considered as "very high quality, adequate, and fluent translations, often better than human". We report the physical batch size, which affects the efficiency; the accuracy is only affected by the logical batch size, which can be implemented through the gradient accumulation of physical batch size. For non-sequential data, T = 1; for texts, T is the sequence length, which is layer-independent; for images (or videos), T (l) is the height×width(×time) of hidden feature representation, which is layer-dependent.



), with a clipping threshold R. At high level, previous work have tackled the efficiency bottleneck with various approaches. part II) focuses on the parameter efficiency by partially training a neural network, in contrast to full fine-tuning all model parameters, e.g. only the last output layer Tramer & Boneh (2020), the adapter layers Houlsby et al. (2019); Mahabadi et al. (2021), or the Low-Rank Adaptation (LoRA) Hu et al. (2021); Yu et al. (2021). For example, Mehta et al. (2022) accelerate the DP training on ImageNet Deng et al. (2009) up to 30× by only training the last layer of ResNet152. Noticeably, parameter efficient fine-tuning does not improve on the efficiency in terms of complexity per parameter, rather than reducing the number of parameters. Furthermore, this approach oftentimes leads to some accuracy degradation compared to DP full fine-tuning Bu et al. (2020); Mehta et al. (2022); Li et al. (2021); Yu et al. (2021).

Figure 1: Forward pass and back-propagation of the l-th linear layer (standard training is in black; DP training by our book-keeping algorithm is added in red). Here a (l) is the activation tensor, s (l) is the layer output, W (l) , b (l) are weight and bias, L i , L are the per-sample loss and the summed loss. The dotted arrow represents the inter-layer operation such as activation, pooling, or normalization.

(2016);Mironov (2017);Dong et al. (2019);Zhu et al. (2021);Gopi et al. (2021);Koskela et al. (2020).

Figure 2: Speed and memory on MLP and CIFAR100 (images are flattened into vectors). Left to right: deep network (50 layers, width 1000, 50M parameters, batch size 128), shallow network (10 layers, width 1000, 10M parameters, batch size 128), and wide network (10 layers, width 5000, 250M parameters, batch size 128 or 1024; Opacus is OOM). See more ablation study in Appendix F.

Figure 3: Standard (non-DP), Opacus, FastGradClip, GhostClip, BK implementations. Notice that BK learns to directly compute weighted gradient from Opacus, to compute the ghost norm from GhostClip, to use auto-differentiation instead of full back-propagation from FastGradClip.

Figure 4: Backward propagation of BK algorithm ( L = i C i L i ).

Figure 5: Memory and speed of different DP implementations. Upper: GPT2 on E2E dataset (fixing B, DP speed is 0.86 ∼ 0.89× of non-DP). Lower: RoBERTa-large on GLUE datasets. Note here the hybrid implementations are equivalent to the base ones, because of the short sequence length.

Figure 6: Memory and speed by different implementations on 50000 images. Left: VGG11 (133M;Simonyan & Zisserman (2014)), right is BEiT-large (304M;Bao et al. (2021)). Memory cost uses a physical batch size 1. Throughput uses the maximum physical batch size.

Figure 7: Layerwise space complexity of computing the per-sample gradient norm. Left to right: ResNet18 (224 × 224), ResNet18 (512 × 512), VGG11 (224 × 224), and ViT-base (224 × 224).

A preview of BK's efficiency on DP tasks (complexity in orange; extended in Table9).

Table2) through two novel tricks: the bookkeeping and the ghost differentiation. The entire BK algorithm is built on the understanding of computation graph in Appendix A. Note that these tricks also offer improved efficiency for existing implementations, to be presented in Section 2.4. We now elaborate on these tricks.

Time and space complexity of modules in DP training for one generalized linear layer.

Space complexity of the per-sample gradient clipping (not the entire DP algorithm) for B = 1 on ImageNet 224 × 224. Layerwise decision of hybrid BK algorithms is highlighted in bold.

Table 8 and observe that BK has almost the same complexity as non-DP training. Note that in low dimension, the mixed ghost norm is equivalent to the ghost norm, hence MixGhostClip/BK-MixOpt is equivalent to GhostClip/BK, respectively.

Complexity of DP implementations on one layer. Here ⟨⟩ means between two values. The time complexity of BK-MixOpt is 6BT pd + 2BT 2 (p + d) • I{2T 2 < pd}.

