PARAMETER-EFFICIENT FINE-TUNING DESIGN SPACES

Abstract

The aim of parameter-efficient fine-tuning is to achieve performance that is comparable to fine-tuning, but with fewer trainable parameters. Several hand-crafted strategies, such as Adapters, Prefix Tuning, BitFit, and LoRA, have been proposed, but it remains unclear whether there are underlying design patterns. Thus, we present a parameter-efficient design paradigm and identify design patterns that are applicable to various experimental settings. Instead of developing another individual tuning strategy, we introduce design spaces that parameterize tuning structures and strategies. These design spaces consist of four components: layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. Our experiments reveal the following design patterns: (i) group layers in a spindle pattern, (ii) allocate trainable parameters evenly among layers, (iii) tune all groups, and (iv) assign appropriate tuning strategies to each group. These patterns lead to new methods for parameter-efficient fine-tuning, which we show experimentally outperform existing strategies across various backbone models and NLP tasks 1 .

1. INTRODUCTION

Large pre-trained models have shown to achieve state-of-the-art results in many downstream natural language processing tasks, by fine-tuning on task-specific labeled data (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019; Joshi et al., 2019; Sun et al., 2019; Clark et al., 2019; Lewis et al., 2020a; Bao et al., 2020; He et al., 2020; Raffel et al., 2020; Ziems et al., 2022) . However, the cost of finetuning all parameters and storing them separately for each task is high in terms of computational and storage resources, e.g., 355 million parameters for RoBERTa (Liu et al., 2019) and 175 billion parameters for GPT-3 (Brown et al., 2020) . This makes it challenging to deploy in real-world natural language processing (NLP) systems that handle multiple tasks. To make pretrained models more efficient for specific downstream tasks, various strategies have been proposed that only learn a small number of extra parameters while keeping the rest frozen (Houlsby et al., 2019b; Pfeiffer et al., 2021; Li & Liang, 2021; Brown et al., 2020; Lester et al., 2021b; Schick & Schütze, 2021; Ziems et al., 2022) . One such strategy is adapter tuning (Houlsby et al., 2019b) , which adds small neural modules (adapters) to each layer of the pretrained network, and only trains the adapters during fine-tuning. Other methods, such as prefix tuning (Li & Liang, 2021) and prompt tuning (Lester et al., 2021a) , have been inspired by the success of controlling pretrained models through textual prompts (Brown et al., 2020) . These methods prepend tunable tokens to the input or hidden layers, and only train these tokens during fine-tuning. BitFit (Zaken et al., 2021) updates the bias terms of pretrained models while freezing the rest, while LoRA (Hu et al., 2021) decomposes attention weight gradients into low-rank matrices to reduce the number of trainable parameters. He et al. (2022) proposed a unified view of these strategies, illustrating their differences and connections, but like its predecessors, the method is still equally applied to different layers of the pretrained network. Most current fine-tuning strategies to adapt pretrained models to specific tasks are effective, but they are often developed through manual design processes without considering potential design patterns

