DIFFERENTIALLY PRIVATE BIAS-TERM ONLY FINE-TUNING OF FOUNDATION MODELS

Abstract

We study the problem of differentially private (DP) fine-tuning of large pre-trained models -a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about 0.1% of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is 2 ∼ 30× faster and uses 2 ∼ 8× less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP finetuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods. 1. Algorithmically, we propose the Differentially Private Bias-Term Fine-Tuning (DP-BiTFiT) in Algorithm 1 that is highly accurate under DP constraint, on par with SOTA in Section 4. Specially, we propose a two-phase training in Section 4.4 to close the accuracy gap between DP-BiTFiT and DP full-finetuning.

1. INTRODUCTION

Fine-tuning from large pre-trained neural networks is one of the most critical technique in deep learning, yielding strong performance in a variety of domains (Pan & Yang, 2009; Kenton & Toutanova, 2019; Goyal et al., 2017) . Among different methods, full fine-tuning is the most prevalent one, which trains all the model parameters on the downstream tasks and achieves high accuracy within a small number of training epochs. However, full fine-tuning on large models, from hundreds of millions (He et al., 2016; Chen et al., 2016) to billions of parameters (Brown et al., 2020) , can be burdensome in terms of the computation and the deployment, since a full copy of fine-tuned model parameters is needed for each task. To alleviate this issue, the parameter efficient fine-tuning only trains a substantially small portion of the model parameters, in contrast to the full fine-tuning. At a high level, the parameter efficient finetuning methods can be divided into two categories. ⟨1⟩ Model-aware methods, meaning a relatively small number of parameters are introduced into the neural network architecture and only the new parameters are optimized. Examples include LoRA (Hu et al., 2021 ), Adapter (Houlsby et al., 2019 ), and Compacter (Mahabadi et al., 2021) . ⟨2⟩ Model-agnostic methods, meaning that only a subset of existing parameters are trainable. Examples include training only the output linear layer (linear probing, (Kornblith et al., 2019) ), only the layer normalization layer (Houlsby et al., 2019) and biasterm fine-tuning (BiTFiT) (Zaken et al., 2022) . We illustrate the differences in Equation (1): W 0 , b 0 are the pre-trained weights and biases, ' ˆ' indicates trainable parameters, and θ is the additional parameters. f (x; W 0 , b 0 ) pre-trained model -→ f (x; Ŵ, b) full fine-tuning or f (x; W 0 , b 0 , θ) model-aware fine-tuning or f (x; W 0 , b) bias-term fine-tuning Empirically, these parameter efficient fine-tuning methods have achieved high accuracy that is comparable to the full fine-tuning in the standard non-private setting. Contributions. In this work, we develop DP-BiTFiT, a fine-tuning method that is model-agnostic, accurate, privacy-preserving, parameter efficient, and computationally efficient.



Figure 1: Performance of different fine-tuning methods on MNLI dataset with RoBERTa-large.

Liu et al., 2019)  andBERT (Kenton & Toutanova, 2019)  achieve about 94% on SST2, 87% on MNLI, and on average 85% across the General Language Understanding Evaluation (GLUE) datasets(He et al., 2021; Hu et al., 2021). In addition, parameter efficient methods are faster than full fine-tuning and save the communication cost significantly in the distributed learning.Parallel to these developments, the success of deep learning models relies on the availability of large datasets, which may contain sensitive information to be protected rigorously. This privacy issue is well-known for neural networks can be vulnerable to privacy attacks: membership information can be leaked from the purchase records via Google and Amazon online services(Shokri et al., 2017); sensitive texts can be reconstructed by specifically designed prefix on GPT2(Carlini et al., 2021)   and so can images in CIFAR10 and MNIST(Haim et al., 2022). To protect against such privacy risks, the standard technique is differential privacy (DP, formally stated in Definition 2.1), which randomizes the standard optimizers by updating with the private gradient in Equation (2).A recent line of work has extensively studied the DP fine-tuning in both computer vision and language tasks, often achieving less than 3% accuracy drop across different settings via full fine-tuning(De et al., 2022; Li et al., 2021; Bu et al., 2022b;a), linear probing (Mehta et al., 2022), LoRA,  Adapter, or Compacter (Yu et al., 2021a). In fact, fine-tuning or pre-training from large dataset is considered necessary in the DP deep learning literature. As a matter of fact, full fine-tuning DP-GPT2 only achieves 24.2 BLEU score (ϵ = 8) on E2E dataset if randomly initialized(Li et al.,  2021), in starking contrast to 63.2 BLEU if pre-trained; similarly, state-of-the-art (SOTA) DP accuracy on ImageNet is 48% (ϵ = 10) without pre-training(Kurakin et al., 2022)  but 86.7% accuracy if pre-trained(De et al., 2022). Specifically, parameter efficient DP fine-tuning has empirically demonstrated strong accuracy (see our Table4) with 3 ∼ 4× memory saving and 2 ∼ 3× speedup compared to DP full fine-tuning by Opacus (c.f. Figure3and Yu et al., 2021a, Table3). Although previous works have shed light on various DP fine-tuning methods, we are the first to study DP-BiTFiT specifically and to show two distinctive advantages of it.

Firstly, DP-BiTFiT is model-agnostic and remains its parameter efficiency around 0.1% across models by Table1. While linear probing is also model-agnostic, the parameter efficiency can be as high as 8% in ResNet50. Other methods like LoRA, Adapter and Compacter are architecture-dependent and possibly parameter inefficient, making them difficult to directly apply on arbitrary neural networks: LoRA and Adapter may need to train more than 12% on BART-large(Lewis et al., 2020)  to achieve high accuracy by He et al. (2021, Figure1& 4).Secondly, DP-BiTFiT is computationally efficient, almost as much as the standard BiTFiT and significantly more efficient than DP full fine-tuning, particularly with large models and highdimensional input data. For examples of DP full fine-tuning, Li et al. (2021) have reported 2 ∼ 4× slowdown on large language models for four advanced private codebases and up to 5× memory overhead, compared to the standard fine-tuning; even on small networks, 11 codebases across Tensorflow, JAX, and Pytorch have demonstrated 0.2 ∼ 5× slowdown and 3 ∼ 100× reduction in maximum batch size inSubramani et al. (2021). See more discussion in Section 3.3.

