DIFFERENTIALLY PRIVATE BIAS-TERM ONLY FINE-TUNING OF FOUNDATION MODELS

Abstract

We study the problem of differentially private (DP) fine-tuning of large pre-trained models -a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about 0.1% of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is 2 ∼ 30× faster and uses 2 ∼ 8× less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP finetuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods.

1. INTRODUCTION

Fine-tuning from large pre-trained neural networks is one of the most critical technique in deep learning, yielding strong performance in a variety of domains (Pan & Yang, 2009; Kenton & Toutanova, 2019; Goyal et al., 2017) . Among different methods, full fine-tuning is the most prevalent one, which trains all the model parameters on the downstream tasks and achieves high accuracy within a small number of training epochs. However, full fine-tuning on large models, from hundreds of millions (He et al., 2016; Chen et al., 2016) to billions of parameters (Brown et al., 2020) , can be burdensome in terms of the computation and the deployment, since a full copy of fine-tuned model parameters is needed for each task. To alleviate this issue, the parameter efficient fine-tuning only trains a substantially small portion of the model parameters, in contrast to the full fine-tuning. At a high level, the parameter efficient finetuning methods can be divided into two categories. Empirically, these parameter efficient fine-tuning methods have achieved high accuracy that is comparable to the full fine-tuning in the standard non-private setting. For instance, linear probing of ResNet (He et al., 2016) and Vision Transformer (ViT, (Dosovitskiy et al., 2020) ) achieves 80% accuracy on the ImageNet dataset (Sun et al., 2017; Kornblith et al., 2019) ; LoRA and BiTFiT



⟨1⟩ Model-aware methods, meaning a relatively small number of parameters are introduced into the neural network architecture and only the new parameters are optimized. Examples include LoRA (Hu et al., 2021), Adapter (Houlsby et al., 2019), and Compacter (Mahabadi et al., 2021). ⟨2⟩ Model-agnostic methods, meaning that only a subset of existing parameters are trainable. Examples include training only the output linear layer (linear probing, (Kornblith et al., 2019)), only the layer normalization layer (Houlsby et al., 2019) and biasterm fine-tuning (BiTFiT) (Zaken et al., 2022). We illustrate the differences in Equation (1): W 0 , b 0 are the pre-trained weights and biases, ' ˆ' indicates trainable parameters, and θ is the additional parameters. f (x; W 0 , b 0 ) pre-trained model -→ f (x; Ŵ, b) full fine-tuning or f (x; W 0 , b 0 , θ) model-aware fine-tuning or f (x; W 0 , b) bias-term fine-tuning(1)

