DIFFERENTIALLY PRIVATE OPTIMIZATION ON LARGE MODEL AT SMALL COST

Abstract

Differentially private (DP) optimization is the standard paradigm to learn large neural networks that are accurate and privacy-preserving. The computational cost for DP deep learning, however, is notoriously heavy due to the per-sample gradient clipping. Existing DP implementations are 2 -1000× more costly in time and space complexity than the standard (non-private) training. In this work, we develop a novel Book-Keeping (BK) technique that implements existing DP optimizers (thus achieving the same accuracy), with a substantial improvement on the computational cost. Specifically, BK enables DP training on large models and high dimensional data to be roughly as efficient as the standard training, whereas previous DP algorithms can be inefficient or incapable of training due to memory error. The computational advantage of BK is supported by the complexity analysis as well as extensive experiments on vision and language tasks. Our implementation achieves state-of-the-art (SOTA) accuracy with very small extra cost: on GPT2 and at almost the same memory cost (< 1% overhead), BK has 1.03× the time complexity of the standard training (0.83× training speed in practice), and 0.61× the time complexity of the most efficient DP implementation (1.36× training speed in practice). We will open-source the codebase for the BK algorithm.



The efficiency bottleneck in DP deep learning lies in the per-sample gradient clipping, which restricts the magnitude of each per-sample gradient in the mini-batch. Applying the clipping jointly with the Gaussian noise addition, one can privately release the gradient to arbitrary optimizers like SGD and Adam, and thus guarantee the privacy of the training as described in Section 1.3: private gradient: Ĝ := i g i • C(∥g i ∥ 2 ) + σ DP • N (0, I), private optimizer (e.g. SGD): w t+1 = w t -η Ĝ. (1) Here w is the model parameters, L i is the per-sample loss, g i = ∂Li ∂W is the per-sample gradient, η is the learning rate, σ DP is the noise magnitude that defines the privacy loss, and C(∥g i ∥) or simply C i is the per-sample clipping factor, e.g. min{R/∥g i ∥, 1} in Abadi et al. ( 2016 2021). An orthogonal approach, including this work, focuses on the computation efficiency (part III), i.e. reducing the time and space complexity through efficient implementations, without affecting the DP optimizers (part I) and thus their performance. We will elaborate on multiple methods in Section 1. 2022), where remarkable speed difference has been observed in some cases, even with the same implementation. For example, Subramani et al. ( 2021) implemented DP-SGD using JAX and claimed its efficiency advantage over the same algorithm using Tensorflow or Pytorch.

1.1. CONTRIBUTIONS 1. [Algorithm]

We propose the book-keeping (BK) algorithm that makes existing DP optimizers fast and memory efficient, especially comparable to non-private optimizers. We demonstrate BK via the computation graph in Figure 1 . The highlight is that BK only uses one back-propagation and never instantiates per-sample gradients { ∂Li ∂W } B i=1 .

2.. [Analysis]

We analyze the complexity to show that BK has almost the same time and space complexity as non-DP training, especially when the feature dimension is small (see Table 5 ).

3.. [Extension]

We strengthen BK using a layerwise decision to mix with Opacus (see Section 3.2), which proves to be efficient when the feature dimension is large (and difficult for GhostClip). We also extend BK to the parameter efficient fine-tuning such as DP LoRA and Adapter. 



BLEU (BiLingual Evaluation Understudy) is a metric (0-100) for automatically evaluating translated text. BLEU > 60 is considered as "very high quality, adequate, and fluent translations, often better than human".



differential privacy (DP; Dwork et al. (2006)) has shown strong performance while guaranteeing rigorous protection against privacy risks, especially on large models that tend to memorize and leak the training data Carlini et al. (2021); Haim et al. (2022); Shokri et al. (2017). For example, recent advances have shed light on the success of DP GPT2 Li et al. (2021); Bu et al. (2022b); Yu et al. (2021), which achieves 64.6 BLEU score 1 at strong privacy guarantee (ϵ = 3), on the text generation task using E2E restaurant review dataset. This is only marginally below the standard non-private GPT2 (BLEU score 66.8). Similarly, on computer vision tasks (ϵ = 2), DP vision transformers and ResNets have obtained 97.1%/86.2% accuracy on CIFAR10/100 by Bu et al. (2022a) and over 81% accuracy on ImageNet by De et al. (2022); Mehta et al. (2022). However, DP training of large neural networks is well-known to be computationally burdensome in comparison to the standard training, in terms of both the training time and the memory cost. For instance, training a small recurrent neural network (0.598M parameters) experiences a 1000× slowdown using DP optimizers in Tensorflow-Privacy (TF-Privacy) library Bu et al. (2021), and training a small convolutional neural network (CNN, 0.605M parameters) on CIFAR10 has a 24× slowdown with Tensorflow 2 and the XLA compiler Subramani et al. (2021). Even with SOTA efficient implementations, large models such as RoBERTa Liu et al. (2019), GPT2 Radford et al. (2019), ResNet He et al. (2016), VGG Simonyan & Zisserman (2014), ViT Dosovitskiy et al. (2020) and its variants, experience about 2 -3× slowdown in Pytorch Li et al. (2021); Bu et al. (2022a) and 2 -9× slowdown in JAX Kurakin et al. (2022); De et al. (2022), with possibly 4 -20× memory overhead Bu et al. (2022a); Li et al. (2021); Subramani et al. (2021) if not out of memory.

), with a clipping threshold R. At high level, previous work have tackled the efficiency bottleneck with various approaches. part II) focuses on the parameter efficiency by partially training a neural network, in contrast to full fine-tuning all model parameters, e.g. only the last output layer Tramer & Boneh (2020), the adapter layers Houlsby et al. (2019); Mahabadi et al. (2021), or the Low-Rank Adaptation (LoRA) Hu et al. (2021); Yu et al. (2021). For example, Mehta et al. (2022) accelerate the DP training on ImageNet Deng et al. (2009) up to 30× by only training the last layer of ResNet152. Noticeably, parameter efficient fine-tuning does not improve on the efficiency in terms of complexity per parameter, rather than reducing the number of parameters. Furthermore, this approach oftentimes leads to some accuracy degradation compared to DP full fine-tuning Bu et al. (2020); Mehta et al. (2022); Li et al. (2021); Yu et al. (

2. Additionally, these methods can be compiled on different platforms (part IV) such as Tensorflow 2(XLA), JAX and Pytorch Li et al. (2021); Subramani et al. (2021); De et al. (2022); Kurakin et al. (

We demonstrate the amazing efficiency of BK on training large models, saving the memory up to 10× and boosting the speed by 30% -5× than previous DP implementations. A preview of BK's efficiency on DP tasks (complexity in orange; extended in Table9).

