PARAMETER-EFFICIENT FINE-TUNING DESIGN SPACES

Abstract

The aim of parameter-efficient fine-tuning is to achieve performance that is comparable to fine-tuning, but with fewer trainable parameters. Several hand-crafted strategies, such as Adapters, Prefix Tuning, BitFit, and LoRA, have been proposed, but it remains unclear whether there are underlying design patterns. Thus, we present a parameter-efficient design paradigm and identify design patterns that are applicable to various experimental settings. Instead of developing another individual tuning strategy, we introduce design spaces that parameterize tuning structures and strategies. These design spaces consist of four components: layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. Our experiments reveal the following design patterns: (i) group layers in a spindle pattern, (ii) allocate trainable parameters evenly among layers, (iii) tune all groups, and (iv) assign appropriate tuning strategies to each group. These patterns lead to new methods for parameter-efficient fine-tuning, which we show experimentally outperform existing strategies across various backbone models and NLP tasks 1 .

1. INTRODUCTION

Large pre-trained models have shown to achieve state-of-the-art results in many downstream natural language processing tasks, by fine-tuning on task-specific labeled data (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019; Joshi et al., 2019; Sun et al., 2019; Clark et al., 2019; Lewis et al., 2020a; Bao et al., 2020; He et al., 2020; Raffel et al., 2020; Ziems et al., 2022) . However, the cost of finetuning all parameters and storing them separately for each task is high in terms of computational and storage resources, e.g., 355 million parameters for RoBERTa (Liu et al., 2019) and 175 billion parameters for GPT-3 (Brown et al., 2020) . This makes it challenging to deploy in real-world natural language processing (NLP) systems that handle multiple tasks. To make pretrained models more efficient for specific downstream tasks, various strategies have been proposed that only learn a small number of extra parameters while keeping the rest frozen (Houlsby et al., 2019b; Pfeiffer et al., 2021; Li & Liang, 2021; Brown et al., 2020; Lester et al., 2021b; Schick & Schütze, 2021; Ziems et al., 2022) . One such strategy is adapter tuning (Houlsby et al., 2019b) , which adds small neural modules (adapters) to each layer of the pretrained network, and only trains the adapters during fine-tuning. Other methods, such as prefix tuning (Li & Liang, 2021) and prompt tuning (Lester et al., 2021a) , have been inspired by the success of controlling pretrained models through textual prompts (Brown et al., 2020) . These methods prepend tunable tokens to the input or hidden layers, and only train these tokens during fine-tuning. BitFit (Zaken et al., 2021) updates the bias terms of pretrained models while freezing the rest, while LoRA (Hu et al., 2021) decomposes attention weight gradients into low-rank matrices to reduce the number of trainable parameters. He et al. (2022) proposed a unified view of these strategies, illustrating their differences and connections, but like its predecessors, the method is still equally applied to different layers of the pretrained network. Most current fine-tuning strategies to adapt pretrained models to specific tasks are effective, but they are often developed through manual design processes without considering potential design patterns The design space is characterized by: (i) Grouping of consecutive layers, (ii) The allocation of the number of trainable parameters to each layer, (iii) The selection of groups that will be finetuned, and (iv) The assignment of appropriate strategies, such as Adapter (A), Prefix (P), BitFit (B), or LoRA (L), to each group. across these strategies, different backbone models, and downstream tasks. The effectiveness of different strategies is also unclear as they are usually applied separately, and it's unknown how they reinforce or complement each other (Mao et al., 2022) . Our aim is to gain a comprehensive understanding of the fine-tuning design and uncover interpretable and widely applicable design patterns. Instead of creating yet another strategy to be applied uniformly to various pretrained layers, we present parameter-efficient fine-tuning design spaces that allow customization of both tuning structures and strategies. These design spaces are comprised of four main components, as illustrated in Figure 1 : layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. We start our journey towards parameter-efficient fine-tuning design using a relatively unconstrained design space. We then narrow this space through successive rounds of comparison, using random sampling and while enforcing constraints such as equal layer grouping. Through this process, we discover several key design patterns, including layer grouping in a spindle pattern, uniform allocation of trainable parameters, tuning all groups, and appropriate strategy assignments. Our new methods outperform existing parameter-efficient fine-tuning strategies. We demonstrate the effectiveness of our approach using T5 (Raffel et al., 2020) and classification tasks, but find that the discovered design patterns are applicable to other backbones (such as RoBERTa (Liu et al., 2019) , BART (Lewis et al., 2020b) and XLNet (Yang et al., 2019)), and NLP tasks (e.g., summarization, machine translation, and eight SuperGLUE datasets). Our contributions are: (i) The introduction of parameter-efficient fine-tuning design spaces. (ii) The discovery of several design patterns in parameter-efficient fine-tuning through comprehensive experiments. (iii) The creation of parameter-efficient fine-tuning methods based on the discovered design patterns, which outperform existing strategies on various backbone models and NLP tasks.

2. RELATED WORK

Our work is closely related to and builds on work about network design spaces and parameterefficient fine-tuning. We discuss the connections and differences below. Network Design Spaces. Many works designed neural network models via an ad-hoc discovery of new design choices that improve performance (Radosavovic et al., 2019) , such as the use of deeper architectures or residual connections. Recent work (Radosavovic et al., 2020; You et al., 2020; Radosavovic et al., 2019) focuses on the design space to discover new design principles for convolutional neural networks (Radosavovic et al., 2020) and graph neural networks (You et al., 2020) . Inspired by this work we focus on the design spaces to rethink parameter-efficient finetuning, with the goal of discovering design patterns that are applicable to different settings. Parameter-Efficient Fine-Tuning for NLP. As pretrained models increase in size, storing and finetuning them becomes increasingly expensive and unfeasible for those without ample computational



Figure1: The design space is characterized by: (i) Grouping of consecutive layers, (ii) The allocation of the number of trainable parameters to each layer, (iii) The selection of groups that will be finetuned, and (iv) The assignment of appropriate strategies, such as Adapter (A), Prefix (P), BitFit (B), or LoRA (L), to each group.

