DIFFERENTIALLY PRIVATE BIAS-TERM ONLY FINE-TUNING OF FOUNDATION MODELS

Abstract

We study the problem of differentially private (DP) fine-tuning of large pre-trained models -a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about 0.1% of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is 2 ∼ 30× faster and uses 2 ∼ 8× less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP finetuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods. 1. Algorithmically, we propose the Differentially Private Bias-Term Fine-Tuning (DP-BiTFiT) in Algorithm 1 that is highly accurate under DP constraint, on par with SOTA in Section 4. Specially, we propose a two-phase training in Section 4.4 to close the accuracy gap between DP-BiTFiT and DP full-finetuning.

1. INTRODUCTION

Fine-tuning from large pre-trained neural networks is one of the most critical technique in deep learning, yielding strong performance in a variety of domains (Pan & Yang, 2009; Kenton & Toutanova, 2019; Goyal et al., 2017) . Among different methods, full fine-tuning is the most prevalent one, which trains all the model parameters on the downstream tasks and achieves high accuracy within a small number of training epochs. However, full fine-tuning on large models, from hundreds of millions (He et al., 2016; Chen et al., 2016) to billions of parameters (Brown et al., 2020) , can be burdensome in terms of the computation and the deployment, since a full copy of fine-tuned model parameters is needed for each task. To alleviate this issue, the parameter efficient fine-tuning only trains a substantially small portion of the model parameters, in contrast to the full fine-tuning. At a high level, the parameter efficient finetuning methods can be divided into two categories. ⟨1⟩ Model-aware methods, meaning a relatively small number of parameters are introduced into the neural network architecture and only the new parameters are optimized. Examples include LoRA (Hu et al., 2021) , Adapter (Houlsby et al., 2019), and Compacter (Mahabadi et al., 2021) . ⟨2⟩ Model-agnostic methods, meaning that only a subset of existing parameters are trainable. Examples include training only the output linear layer (linear probing, (Kornblith et al., 2019) ), only the layer normalization layer (Houlsby et al., 2019) and biasterm fine-tuning (BiTFiT) (Zaken et al., 2022) . We illustrate the differences in Equation (1): W 0 , b 0 are the pre-trained weights and biases, ' ˆ' indicates trainable parameters, and θ is the additional parameters. Empirically, these parameter efficient fine-tuning methods have achieved high accuracy that is comparable to the full fine-tuning in the standard non-private setting. For instance, linear probing of ResNet (He et al., 2016) and Vision Transformer (ViT, (Dosovitskiy et al., 2020) ) achieves 80% accuracy on the ImageNet dataset (Sun et al., 2017; Kornblith et al., 2019) ; LoRA and BiTFiT of RoBERTa (Liu et al., 2019) and BERT (Kenton & Toutanova, 2019) achieve about 94% on SST2, 87% on MNLI, and on average 85% across the General Language Understanding Evaluation (GLUE) datasets (He et al., 2021; Hu et al., 2021) . In addition, parameter efficient methods are faster than full fine-tuning and save the communication cost significantly in the distributed learning. Parallel to these developments, the success of deep learning models relies on the availability of large datasets, which may contain sensitive information to be protected rigorously. This privacy issue is well-known for neural networks can be vulnerable to privacy attacks: membership information can be leaked from the purchase records via Google and Amazon online services (Shokri et al., 2017) ; sensitive texts can be reconstructed by specifically designed prefix on GPT2 (Carlini et al., 2021) and so can images in CIFAR10 and MNIST (Haim et al., 2022) . To protect against such privacy risks, the standard technique is differential privacy (DP, formally stated in Definition 2.1), which randomizes the standard optimizers by updating with the private gradient in Equation (2). A recent line of work has extensively studied the DP fine-tuning in both computer vision and language tasks, often achieving less than 3% accuracy drop across different settings via full fine-tuning (De et al., 2022; Li et al., 2021; Bu et al., 2022b; a) , linear probing (Mehta et al., 2022) , LoRA, Adapter, or Compacter (Yu et al., 2021a) . In fact, fine-tuning or pre-training from large dataset is considered necessary in the DP deep learning literature. As a matter of fact, full fine-tuning DP-GPT2 only achieves 24.2 BLEU score (ϵ = 8) on E2E dataset if randomly initialized (Li et al., 2021) , in starking contrast to 63.2 BLEU if pre-trained; similarly, state-of-the-art (SOTA) DP accuracy on ImageNet is 48% (ϵ = 10) without pre-training (Kurakin et al., 2022) but 86.7% accuracy if pre-trained (De et al., 2022) . Specifically, parameter efficient DP fine-tuning has empirically demonstrated strong accuracy (see our Table 4 ) with 3 ∼ 4× memory saving and 2 ∼ 3× speedup compared to DP full fine-tuning by Opacus (c.f. Figure 3 and Yu et al., 2021a, Table 3 ). Although previous works have shed light on various DP fine-tuning methods, we are the first to study DP-BiTFiT specifically and to show two distinctive advantages of it. Firstly, DP-BiTFiT is model-agnostic and remains its parameter efficiency around 0.1% across models by Table 1 . While linear probing is also model-agnostic, the parameter efficiency can be as high as 8% in ResNet50. Other methods like LoRA, Adapter and Compacter are architecture-dependent and possibly parameter inefficient, making them difficult to directly apply on arbitrary neural networks: LoRA and Adapter may need to train more than 12% on BART-large (Lewis et al., 2020) to achieve high accuracy by He et al. (2021, Figure 1& 4) . Secondly, DP-BiTFiT is computationally efficient, almost as much as the standard BiTFiT and significantly more efficient than DP full fine-tuning, particularly with large models and highdimensional input data. For examples of DP full fine-tuning, Li et al. (2021) have reported 2 ∼ 4× slowdown on large language models for four advanced private codebases and up to 5× memory overhead, compared to the standard fine-tuning; even on small networks, 11 codebases across Tensorflow, JAX, and Pytorch have demonstrated 0.2 ∼ 5× slowdown and 3 ∼ 100× reduction in maximum batch size in Subramani et al. (2021) . See more discussion in Section 3.3.

Contributions.

In this work, we develop DP-BiTFiT, a fine-tuning method that is model-agnostic, accurate, privacy-preserving, parameter efficient, and computationally efficient. 2. DP-BiTFiT is model-agnostic and only optimizes 0.1% of the model parameters on BERT, RoBERTa, GPT2, ViT, ResNet, and so on (see Table 1 ). Thus DP-BiTFiT is one of the most parameter efficient fine-tuning methods among DP LoRA, Adapter, linear probing, etc. 3. We design a computationally efficient implementation of DP-BiTFiT, whose time and space complexity is almost the same as the standard non-DP BiTFiT, while being faster than non-DP full fine-tuning and other DP fine-tuning (see Figure 1 ). This advantage is analyzed in Table 2 , and demonstrated via the substantial speedup and memory-saving in Figure 3 and Figure 4 . 4. DP-BiTFiT is the unique DP algorithm that has a computation overhead independent of the feature dimension Tfoot_0 . This is due to the activation-free forward pass that only happens in the no-weight trainingfoot_1 . Therefore, DP-BiTFiT enjoys a special advantage to work efficiently on long-sequence texts and high-resolution images (see Figure 3 ). Novelty. At a glance, our results may appear to be incremental as we are merely adding differential privacy to an existing method (BiTFiT) through a standard mechanism (DP-SGD). This is not true! Computationally, our implementation of DP-BiTFiT involves substantial algorithmic innovation that exploits the special structures in the forward and backward passes of the per-example gradient computation, hence removing the computational and memory overhead in DP-SGD. Statistically, it is quite surprising to us that one can achieve nearly the same accuracy in DP-fine tuning in both vision and language tasks when optimizing only 0.1% of the parameters.  for layer l ∈ L, L -1, • • • , 1 do 4: Get output gradient ∂L ∂s l 5: Compute per-example gradient and its norm: ∂Li ∂b l = ∂L ∂s l,i ⊤ 1 =⇒ ∥ ∂Li ∂b l ∥ 2 F 6: Aggregate gradient norms across all layers: ∥ ∂Li ∂b ∥ 2 F = l ∥ ∂Li ∂b l ∥ 2 F 7: Compute clipping factor: C i = C(∥ ∂Li ∂b ∥ F ; R) 8: Compute sum of clipped gradients G = i C i ∂Li ∂b

9:

Add Gaussian noise G = G + σR • N (0, I) 10: Descend on bias terms with the gradient G by SGD/Adam/...

2. PRELIMINARIES

Fine-tuning methods. Fine-tuning, i.e. training a model on a large dataset for a sufficiently long time, and then continuing to train (or transferring) onto the downstream datasets, is the standard paradigm to achieve high accuracy in both the standard and the DP regimes. In DP deep learning, the pre-training takes place on a public dataset using regular optimizers like SGD, and the finetuning takes place on a private dataset which requires privacy protection, using DP optimizers like DP-SGD in Section 2. In a long line of research, various fine-tuning methods have been proposed. One of the most popular method is the full fine-tuning, which simply runs gradient descents on all trainable weights and biases, thus can be inefficient when the model is large. To improve the efficiency, Li & Liang (2021) proposes the prefix tuning that only optimizes the prompts or the input layer activation (Lester et al., 2021; Liu et al., 2021) . However, as pointed out in Hu et al. (2021) and Li et al. (2021) , the prefix tuning can be difficult to optimize and thus sub-optimal on large models. Another approach is to reduce the number of trainable parameters. For example, LoRA (Hu et al., 2021) , Adapter (Houlsby et al., 2019; Rebuffi et al., 2017; Pfeiffer et al., 2021; Rücklé et al., 2021; Lin et al., 2020) and Compacter (Mahabadi et al., 2021) insert small 'adapter' layers (usually 1-10% of total parameters) between existing layers, and only the newly added adapters are optimized. We describe the forms of LoRA and Adapter in Appendix C and analyze their complexity. In addition to the aforementioned methods, BiTFiT is a special parameter-efficient method that rivals the full fine-tuning (Zaken et al., 2022; Cai et al., 2020; He et al., 2021) . Firstly, BiTFiT optimizes a subset of original parameters -the bias terms, which usually constitute less than 1/1000 of all parameters as demonstrated in Table 1 . Therefore, BiTFiT can be readily deployed to any network in a model-agnostic manner. Secondly, BiTFiT is fundamentally different to other parameter efficient methods such as LoRA, since the bias gradients are computed differently than the weight gradients on the computation graph. We will elaborate on this in Equation ( 4). Deep learning with differential privacy. We recall the classic (ϵ, δ)-DP, under which we train deep neural networks with provably privacy guarantees. Definition 2.1 ( (Dwork et al., 2006) ). A randomized algorithm M is (ε, δ)-differentially private if, for any two neighboring datasets S, S ′ that differ by one datapoint and for any event E, we have P[M (S) ∈ E] ⩽ e ε P [M (S ′ ) ∈ E] + δ. In deep learning, DP can be achieved through applying an off-the-shelf optimizer (SGD or Adam) with a privately released stochastic gradient in place of the regular i g i . The private stochastic gradient is computed by first getting a minibatch I via Poisson sampling, then compute Private gradient i∈I g i • C(∥g i ∥; R) + σR • N (0, I), where C is any function 3 R + → R subject to C(x) ≤ R/x, g i is the i-th per-sample gradient, R is the clipping threshold, and σ is the noise multiplier. The private gradient is guaranteed to be DP through the sampled-Gaussian mechanism and the associated tight privacy accounting to compose over the iterations (see, e.g., Abadi et al., 2016; Wang et al., 2019; Mironov et al., 2019; Koskela et al., 2020; Bu et al., 2020; Gopi et al., 2021 , and the references therein.). Backward propagation. We briefly introduce the back-propagation, which reveals a simple yet important difference between the gradients of weights and those of biases. We consider a linear layer, indexed as the l-th layer, with weight W l ∈ R d×p and bias as b l ∈ R p . We leave the derivation of other layers such as normalization and convolution in Appendix A. We denote the mini-batched input of this layer as a l ∈ R B×T ×d and the immediate output as s l ∈ R B×T ×p , where B is the batch size and T is the feature dimensionfoot_3 : a l+1 = ϕ(s l ), s l = a l W l + b l . Here ϕ is any non-parametric inter-layer operation, e.g. the non-linear activation (like ReLU), pooling, padding, and so on. We write L = n i=1 L i as the total loss and L i as the per-sample loss of the i-th sample. During a standard back-propagation of L layers, the chain rule keeps track of the output gradient at each layer in a just-in-time fashion: ∂L ∂s l = ∂L ∂a L • ∂a L ∂s L-1 • ∂s L-1 ∂a L-1 • • • • ∂a l+1 ∂s l = ∂L ∂s l+1 W l+1 • ϕ ′ (s l ). This output gradient ∂L ∂s l is used to compute per-sample gradient of weights and biases, ∂L i ∂W l ⊤ = j ∂L i ∂s l,j ⊤ ∂s l,j ∂W l = ∂L ∂s l,i ⊤ a l,i , ∂L i ∂b l ⊤ = j ∂L i ∂s l,j ⊤ ∂s l,j ∂b l = ∂L ∂s l,i ⊤ 1. Notably, the weight gradient needs the activation tensor a l to compute an expensive O(BT pd) tensor multiplication. Memory-wise, {a l } l across all layers is very costly to store (taking more than 95% memory across VGG, ResNet, DenseNet, RoBERTa, etc. by Jain et al. (2020, Figure 3 )). In sharp contrast, the computation of bias gradient does not need a l , and the multiplication with 1 in Equation ( 4) is actually a cheap O(BT p) summation on ∂L ∂s l : B × T × p → B × p. Forward propagation and the hook. During the forward propagation, all Pytorch-based codebases for DP algorithms such as Private Transformers, Opacus, FastGradClip, and others (Yu et al., 2021a; Bu et al., 2022a) register the forward hooks to extract the activation tensors {a l } l of all layers from the computation graph, where a l is computed and stored. Hence, the majority of memory burden is on the activation that grows extremely large for huge models like GPT3 (Brown et al., 2020) with 175B parameters: the activation tensors consume more than 3600GB of memory while the parameters and gradients only consume 300GB (Rajbhandari et al., 2020) . On one hand, this issue can be alleviated by the activation recomputation or checkpointing technique (Chen et al., 2016; Jain et al., 2020) , whose memory cost reduces from O(L) to O( √ L) with an extra 33% slowdown. Alternatively, we note that the activation tensors are not necessary in the forward propagation, if we only optimize the bias terms.

3. DIFFERENTIALLY PRIVATE BIAS-TERM FINE-TUNING

We propose DP-BiTFiT, to privately train only the bias terms in a neural network by combining Equation (4) and Equation ( 2). We use shaded lines to represent the additional DP operations in Algorithm 1, and add DP-related variables and operations in red in the computation graph by Figure 2 . (Goodfellow, 2015; Li et al., 2021; Bu et al., 2022a) ). Upper right: full fine-tuning with Opacus (Yousefpour et al., 2021) . Lower right: BiTFiT. Implementation-wise, DP-BiTFiT is different from all existing DP algorithms (including full, LoRA, Adapter, etc.) that optimize weights, since it does not apply a Pytorch forward hook to store the activation a l for all layers. We provide the implementation details of DP-BiTFiT in Appendix B. To give a concrete example, we apply DP-BiTFiT to the RoBERTa-large model on QQP dataset, following the same setting as Li et al. (2021) and using one 40GB A100 GPU. This is the most timeconsuming text classification task in our work, taking 119 minutes per epoch for a training batch size 20 using the fastest DP full fine-tuning implementation -GhostClip (Li et al., 2021) . To conduct a simple ablation study, setting all weights to not require gradients (but forward hooks are still operating) reduces the training time by 50% to to 80 minutes; removing the forward hooks further reduces the training time by 30% to 63 minutes; finally, using the maximum batch size allowed by the memory-saving DP-BiTFiT reduces to 43 minutes.

3.1. PARAMETER EFFICIENCY

DP-BiTFiT enjoys exactly the same parameter efficiency as the standard BiTFiT, training merely about 0.1% of the total parameters in large models. We demonstrate that DP-BiTFiT is one of the most parameter-efficient fine-tuning through a list of models in An advantage of this parameter efficiency is reflected in the computation efficiency, given that most parameters do not require gradients to be computed: we show in Table 2 and Section 3.3 that DP-BiTFiT is much more efficient than full fine-tuning (DP and even non-DP). Additionally, the parameter efficiency also translates to the communication efficiency in the distributed learning. For example, the 64-bit communication cost of DP full fine-tuning is 64M D where M is the number of worker and D is the total number of parameters, which can be improved to 0.064M D by DP-BiTFiT.

3.2. COMPLEXITY OF WEIGHT AND BIAS TRAINING

We present in Here, the DP weight training (full fine-tuning) uses three efficient implementations that are equivalent mathematically but have different complexity: Opacus (Yousefpour et al., 2021) , GhostClip (Goodfellow, 2015; Li et al., 2021) , and MixGhostClip (Bu et al., 2022a) . The first two implementations are illustrated in Figure 2 , of which MixGhostClip is a hybridization that reduces to GhostClip when T is small. These implementations have been thoroughly analyzed in (Bu et al., 2022a , Appendix C), and we take the complexity result from Bu et al. (2022a, Table 1 ). For the complexity of bias training in Table 2 , it suffices to analyze Line 5 of Algorithm 1. We refer the interested readers to Table 8 for details, where we also apply the complexity analysis of weight training on other methods beyond full fine-tuning, including DP LoRA and DP Adapter.

3.3. SCALABILITY OF DP ALGORITHMS

From the complexity analysis in Table 2 , we observe that DP training on weights can be memory costly, especially when the models are large and the data is high-dimensional. As an example of the large modelling issue, Li et al. (2021) shows that Opacus cannot fit even a single datapoint into a 16GB GPU using GPT2-large (Radford et al.) with 774M parameters, due to its O(B l p l d l ) space complexity where the number of parameters is l p l d l ; for high-dimensional data, GhostClip cannot fit a single 400 × 400 image into the same GPU using ResNet18 with 11.7M parameters, due to its O(B l T 2 l ) space complexity. Although MixGhostClip (Bu et al., 2022a) significantly alleviates the memory issue in both cases, it does so at a cost of roughly 2× slowdown than the standard full fine-tuning (c.f. Bu et al., 2022a, Figure 4 ). In sharp contrast, DP-BiTFiT is amazingly scalable since its computational overhead is negligible and independent of T (though the total complexity, mainly due to forward and output gradient, is still linear in T ).

Efficiency of DP training v.s. feature dimension

To empirically evaluate the computation efficiency of DP fine-tuning methods, we measure the time and GPU memory for a fixed batch size. We depict the high-dimensional data issue in Figure 3 , in which the memory saving and speedup by DP-BiTFiT is substantial. We expect to observe greater efficiency advantage of DP-BiTFiT on higher dimensional data, e.g. in document-level language tasks with T ≈ 20000 by Beltagy et al. (2020) , and in high-resolution image tasks, such as 1024 × 1024 CelebA-HQ (Karras et al., 2018) and Flickr-Faces-HQ (Karras et al., 2019) where T can be of order 10 5 in the convolution layers. Efficiency of DP training v.s. model size To stress-test the computation efficiency of DP-BiTFiT with large models, we apply the maximum batch size with respect to each fine-tuning method, instead of using a fixed one across different methods. Therefore, DP-BiTFiT can further leverage its memory efficiency to achieve the best throughput. Here we consider a setting of high-dimensional data (T = 512 2 ) but small ResNet (11.7 ∼ 58.2M parameters) and the other setting of lowdimensional data (T = 100) but large GPT2 (125 ∼ 774M parameters). 

4. EXPERIMENTS

We now test the accuracy of DP-BiTFiT on natural language and computer vision tasks, with the settings in Appendix D. For DP full fine-tuning algorithms, we use GhostClip (Li et al., 2021) on texts, and MixedGhostClip (Bu et al., 2022a) on images, which achieve SOTA efficiency and accuracy on these datasets respectively. We compute ϵ using a conversion from RDP though tighter privacy accountants in Section 2 are feasible. We illustrate in Table 3 that tuning the learning rate for BiTFiT is not difficult. And we observe in all experiments that, with or without DP, the optimal learning rate for BiTFiT is larger than that for full fine-tuning. DP-BiTFiT DP full non-DP full learning rate 5e-4 1e-3 2e-3 5e-3 1e-2 1e-4 2e-4 5e-4 1e-3 1e-5 2e-5 5e-5 1e-4 RoBERTa-base 90.94 Table 3 : Test accuracy on SST2 under ϵ = 8, using DP-Adam with AUTO-S clipping.

4.1. TEXT CLASSIFICATION

We experiment on MNLI-m(mismatch) (Williams et al., 2018) , QQP (Iyer et al., 2017) , QNLI (Rajpurkar et al., 2016) , and SST2 datasets (Socher et al., 2013) . Competitive algorithms include reparameterized gradient perturbation (RGP, (Yu et al., 2021c) ), LoRA, Adapter and Compacter (Yu et al., 2021a) . We use the same setup as Li et al. (2021) on RoBERTa models, only increasing the learning rate for DP-BiTFiT. Additional results with different clipping functions and under a stronger privacy guarantee ϵ = 3 can be found in Table 13 . Full RGP Adapter LoRA BiTFiT Compacter (Li et al., 2021) (Yu et al., 2021a) (Yu et al., 2021a) (Yu et al., 2021a) Ours (Yu et al., Table 4 : Accuracy of fine-tuning methods with RoBERTa, under ϵ = 8. More non-private finetuning results (similar to here) can be found in (Yu et al., 2021a; Hu et al., 2021; Zaken et al., 2022) . Note that linear probing of RoBERTa-base only gets 87.2% on SST2 and 77.3% on QNLI. In Table 4 , DP-BiTFiT is highly parameter efficiency and on-par with other DP fine-tuning in terms of accuracy. As indicated by Figure 1 and Figure 3 , over 2× speedup and over 3× memory saving is observed, when switching from DP full fine-tuning to DP-BiTFiT across datasets. Remark 4.1. It is encouraging to observe that the gap between the full fine-tuning and BiTFiT, with or without DP, tends to decrease as the model size increases. For instance on QNLI, this gap without privacy reduces from 4.1% to 1.4%, and with privacy reduces from 1.4% to 0.1%. This scaling pattern is consistently observed on different tasks, e.g. in Table 5 and Table 6 .

4.2. NATURAL LANGUAGE GENERATION

We compare DP-BiTFiT with DP LoRA, full fine-tuning, and prefix tuning (Li & Liang, 2021 ) on E2E dataset (Dusek et al., 2020) , in order to train GPT2 that generates texts to evaluate a restaurant. The performance measures are BLEU (Papineni et al., 2002) , ROGUE-L (Lin, 2004) , NIST (Sadjadi et al., 2018) , METEOR (Banerjee & Lavie, 2005) , CIDEr (Vedantam et al., 2015) and perplexity. We use the same setup as Bu et al. (2022b) with automatic clipping, only increasing the learning rate for DP-BiTFiT. More results under a stronger privacy guarantee ϵ = 3 can be found in Table 14 . Li et al. (2021) . Best performance in each model is in bold text. In Table 5 , DP-BiTFiT has shown strong performance, even outperforming DP full fine-tuning on GPT2-large, as well as both the computation and parameter efficiency (see Figure 4 ). Similar to Remark 4.1, the gap of BLEU score between DP-BiTFiT and DP full fine-tuning reduces from -3.06/-3.20 (GPT2-small/medium) to +0.57 (GPT2-large), as the model size increases. We refer to Table 14 for a more significant pattern when ϵ = 3.

4.3. IMAGE CLASSIFICATION

We further experiment with DP-BiTFiT on CIFAR10/CIFAR100 (32 We introduce the two-phase training, denoted as X+BiTFiT, which firstly applies DP full finetuning for X epochs then DP-BiTFiT for the rest of training. Hence, X+BiTFiT becomes DP full fine-tuning when X equals total epochs, and reduces to DP-BiTFiT when X = 0. Empirically speaking, it suffices to use X ≤ 2 to achieve comparable accuracy to full fine-tuning, while still enjoying the speedup. The effectiveness of two-phase training is verified in  ×

5. DISCUSSION

In this work, we study DP-BiTFiT to privately train only the bias terms of neural networks. The highlight of DP-BiTFiT is the accuracy, the parameter efficiency and the computation efficiency, which is realized by not forward caching the activation tensors, and not back-propagating the gradient of weights. This consequently allows DP-BiTFiT to be as fast and memory-saving as its non-private counterpart, thus particularly suitable for large models and high-dimension data. For future directions, DP-BiTFiT can be readily combined with prefix-based tuning and weightsbased fine-tuning, e.g. DP Adapter+BiTFiT and DP LoRA+BiTFiT, via f (x; W 0 , b, θ) using the notation in Equation (1). Specifically, such combination can decide whether weight and/or bias should be fine-tuned in a layer-wise manner. For instance, we can optimize only the embedding layer (which has no bias terms) and all bias terms in other layers. We expect this interpolating approach between full fine-tuning and BiTFiT, in parallel to our two-phase training, to circumvent the limitation that DP-BiTFiT is sometimes sub-optimal on small models or difficult tasks.

A DETAILED ANALYSIS OF BACK-PROPAGATION

We rigorously analyze the neural network represented in Section 2: for sample index i ∈ [B], a l+1,i R T ×d ′ = ϕ( s l,i R T ×p ), s l,i = a l,i R T ×d W l R d×p + 1 R T ×1 • b l R 1×p , Then the per-sample weight gradient is given by the chain rule as ∂L i ∂W l ⊤ = j ∂L i ∂s l,j ⊤ ∂s l,j ∂W l = ∂L i ∂s l,i ⊤ ∂s l,i ∂W l = ∂L i ∂s l,i ⊤ a l,i = ∂L ∂s l,i ⊤ a l,i in which the second equality holds when there is no parameter sharing (so that each per-sample loss only depends on i-th input and output). The last equality holds for the same reason. Similarly, we have the per-sample bias gradient as ∂L i ∂b l ⊤ = j ∂L i ∂s l,j ⊤ ∂s l,j ∂b l = ∂L i ∂s l,i ⊤ ∂s l,i ∂b l = ∂L i ∂s l,i ⊤ 1 = ∂L ∂s l,i ⊤ 1. We additionally demonstrate that bias gradient is independent of the input a l , on the convolution (1d/2d/3d) and the normalization layers. For the convolution, s l is the inversely folded output and a l is the unfolded input, then the forward pass is the same as that of linear layer in Equation ( 5). Notice that T is the product of hidden feature dimension (c.f. Bu et al. (2022a) ), which depends on the padding, kernel sizes, strides, etc. For the batch, layer, group, and instance normalization, the forward pass is s l,i = a l,i -E(a l ) Var(a l ) + 0.00001 • W l + 1 • b l which can be analyzed similarly to that of Equation ( 5).

B IMPLEMENTATION OF DP-BITFIT

In this section we describe the implementation of DP-BiTFiT, which only uses Pytorch backward hook but not the forward hook, and thus is different from existing packages such as FastGradClip Lee & Kifer (2020) , Opacus Yousefpour et al. (2021 ), Private Transformers Li et al. (2021) , Private CNN Bu et al. (2022a) . Notice that in these packages, the forward hook is used to store the activation tensor a l for all layers, which incurs huge memory burden as discussed in Section 2. The Pytorch backward hook is a function, to be registered on a torch Module (or a layer in the neural network), that will be executed in the backward propagation. The backward hook automatically extracts the input gradient ∂L ∂a l and the output gradient ∂L ∂s l of the layer. In DP-BiTFiT, we call register backward hook to register a backward hook for Line 5 of Algorithm 1. An example for a linear layer: R B×T ×d → R B×T ×p looks like where biases is the collection of all bias terms in all layers.

C COMPLEXITY ANALYSIS

We provide more details on analyzing the time and space complexity. The analysis for full fine-tuning has been presented in (Bu et al., 2022a, Appendix C) and is adapted here for the parameter efficient fine-tuning: for example, Adapter Houlsby et al. (2019) uses two matrices W down ∈ R p×r , W up ∈ R r×p that constitute x ←-x + GeLU(x • W down )W up Hence the complexity, in comparison to full-finetuning, changes by replacing d → 2r. LoRA Hu et al. ( 2021) also uses two matrices To give a fair comparison, we use the same optimizer as in Li et al. (2021) full (Li et al., 2021; Bu et al., 2022b) BiTFiT (ours) RoBERTa-base standard DP Abadi  W down ∈ R d×r , W up ∈ R r×p that constitute x ←-x • W + x • W down W up



As summarized in Table and Table8, the computation overhead to get the per-sample weight gradient norm is linear (by instantiating per-sample gradints) or quadratic in T (if using the ghost norm trick(Goodfellow, 2015;Li et al., 2021)), for DP full and parameter efficient fine-tuning.2 We distinguish the weight training and bias training in Section 2 using the chain rules. Note that activationfree means memory-saving, which is not leveraged by DP full, LoRA, Adapter, Compacter, etc. Examples of gradient clipping include but not limited to Abadi's clipping min(R/∥gi∥, 1)(Abadi et al., 2016) and automatic clipping (AUTO-S) R/(∥gi∥ + 0.01)(Bu et al., 2022b;Yang et al., 2022). In sequential data such as text, T is the sequence length; in vision data, T is the product of input dimensions (e.g. for images, T is the product of height and width). We refer to a high-dimensional input when T is large.



(x; W 0 , b 0 ) pre-trained model -→ f (x; Ŵ, b) full fine-tuning or f (x; W 0 , b 0 , θ) model-aware fine-tuning or f (x; W 0 , b) bias-term fine-tuning(1)

Figure 1: Performance of different fine-tuning methods on MNLI dataset with RoBERTa-large.

Figure 2: Back-propagation for DP (red&black) and non-DP (black) algorithms. Left: full finetuning with GhostClip (ghost clipping;(Goodfellow, 2015;Li et al., 2021; Bu et al., 2022a)). Upper right: full fine-tuning with Opacus(Yousefpour et al., 2021). Lower right: BiTFiT.

Figure 3: Memory and speed by different fine-tuning methods. Left two: SST2 dataset (sequence length T ; MixGhostClip is equivalent to GhostClip for this small T ) with RoBERTa-base and batch size 20. Right two: 50000 images of √ T × √ T pixels with ResNet50 and batch size 200.

Figure 4: Maximum throughput and batch size by different fine-tuning methods. Left two: E2E dataset with GPT2-small/medium/large (MixGhostClip is equivalent to GhostClip for this small T ). Right two: 50000 images of 512 × 512 pixels with ResNet 50/101/152.

Figure 5: Accuracy by epochs with BEiT-large on CIFAR100.

in Equation (4), which consists of the per-sample gradient instantiation (i.e. summation along the feature dimension, from R T p → R p , ∂L ∂s l,i → ∂Li ∂b l ), and computing the per-sample gradient norm (i.e. taking the square at each index and summing all indices). Here each operation in italic takes Bp time complexity, meaning the total time complexity is 3Bp, but the space complexity is Bp if operated in-place. D EXPERIMENT DETAILS D.1 LANGUAGE TASKS Throughout this work, the text datasets are processed and loaded from Huggingface Lhoest et al. (2021). We follow the same setup as Li et al. (2021); Bu et al. (2022b), such as δ = 0.5×sample size. The full fine-tuning is implemented by Private Transformers codebase, version 0.2.0 (i.e. GhostClip algorithm Li et al. (2021)). For text classification, we experiment on four datasets: MNLI(m), the matched splits from Multi-Genre Natural Language Inference Corpus; QQP, the Quora Question Pairs2 dataset; QNLI The Stanford Question Answering dataset; SST2 The Stanford Sentiment Treebank dataset.

DP AUTO DP Abadi DP AUTO standard DP Abadi DP AUTO DP Abadi DP AUTO ϵ = ∞ ϵ = 8 ϵ = 8 ϵ = 3 ϵ = 3 ϵ = ∞ ϵ = 8 ϵ = 8 ϵ = 3 ϵ = 3 Accuracy SST294.5

extended in Table12.

Parameter efficiency of (DP) BiTFiT.

Table 2 the complexity of DP training on weights and biases, for one layer mappingB × T l × d l to B × T l × p l .To elaborate on Footnote 4, for text data, T l is the sequence length, d l is input dimension, and p l is output dimension; for image data and specially in a convolution layer, T l is height times width, d l is the input channels times kernel sizes, p l is the output channels (c.f.Bu  et al., 2022a, Section 2.3). Notice that the total complexity of training a network is summed across all layers, e.g. the time complexity of standard full training is 6B l T l p l d l , DP full fine-tuning is over 8B l T l p l d l , and DP-BiTFiT is about 4B l T l p l d l . Therefore, our complexity analysis indicates that DP-BiTFiT is 6/4 = 1.5× faster than non-private full fine-tuning and over 8/4 = 2× faster than DP full fine-tuning.

Per-layer time and space complexity of training on weights (full fine-tuning) and biases.

Performance of fine-tuning methods with GPT2, under ϵ = 8. LoRA and prefix results are documented in

32 pixels, resized to 224 × 224) and CelebA (218 × 178 pixels, not resized) after pre-training on ImageNet (224 × 224 pixels). Unlike language tasks, DP-BiTFiT can be less satisfactory, e.g. inducing 30% test accuracy gap in CelebA [Smiling] classification in Table 6, though this gap can often be closed by increasing the model size, similar to Remark 4.1. Alternatively, we can leverage a two-phase training to interpolate between full fine-tuning and BiTFiT in the next section. Accuracy of DP fine-tuning methods on CIFAR10 and CelebA. More results under different ϵ and network architectures can be found in Appendix E.3.

Table 7 and Appendix E.3. 1+BiTFiT outperforms previous SOTA by DP full fine-tuning (Bu et al., 2022a) that used BEiT-large: CIFAR10 97.1% → 98.8%; CIFAR100 86.2% → 88.7%, under ϵ = 2. 2+BiTFiT is comparable to previous SOTA, 87.05/87.58% → 86.54/86.71% on CelebA in Table 17, under ϵ = 3/8.

Hence the complexity, in comparison to full-finetuning, changes by replacing pd → r(p + d). Per-layer time and space complexity of training on weights (full and parameter efficient fine-tuning) and biases. '+' means additional overhead to non-DP training.

Hyperparameters of text classification in Table4 and Table 13, using RoBERTa (base/large).

Parameter efficiency of (DP) BiTFiT on various models.E.2 MORE RESULTS ON DP-BITFIT AND LANGUAGE TASKS

Abadi DP AUTO DP Abadi DP AUTO standard DP Abadi DP AUTO DP Abadi DP AUTO ϵ = ∞ ϵ = 8 ϵ = 8 ϵ = 3 ϵ = 3 ϵ = ∞ ϵ = 8 ϵ = 8 ϵ = 3 ϵ = 3 Accuracy of full fine-tuning and BiTFiT with RoBERTa, under different per-sample clipping functions (indicated as subscript, Abadi Abadi et al. (2016) and AUTO-S Bu et al. (2022b)). Same setting as Appendix D.

Accuracy of fine-tuning with GPT2 on E2E dataset. LoRA and prefix results are taken fromLi et al. (2021). Same setting as Appendix D.

Accuracy of two-phase fine-tuning on CIFAR10. Same setting as Appendix D.2 except ViT uses the following learning rate: DP full fine-tuning 5e-4, DP-BiTFiT 5e-3.

Accuracy of two-phase fine-tuning on CIFAR100. Same setting as Appendix D.2 except ViT uses the following learning rate: DP full fine-tuning 5e-4, DP-BiTFiT 5e-3.

annex

For E2E generation task, we experiment GPT2 models using the same optimizer as in Bu et al. (2022b) 

