PARAMETER-EFFICIENT FINE-TUNING DESIGN SPACES

Abstract

The aim of parameter-efficient fine-tuning is to achieve performance that is comparable to fine-tuning, but with fewer trainable parameters. Several hand-crafted strategies, such as Adapters, Prefix Tuning, BitFit, and LoRA, have been proposed, but it remains unclear whether there are underlying design patterns. Thus, we present a parameter-efficient design paradigm and identify design patterns that are applicable to various experimental settings. Instead of developing another individual tuning strategy, we introduce design spaces that parameterize tuning structures and strategies. These design spaces consist of four components: layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. Our experiments reveal the following design patterns: (i) group layers in a spindle pattern, (ii) allocate trainable parameters evenly among layers, (iii) tune all groups, and (iv) assign appropriate tuning strategies to each group. These patterns lead to new methods for parameter-efficient fine-tuning, which we show experimentally outperform existing strategies across various backbone models and NLP tasks 1 .

1. INTRODUCTION

Large pre-trained models have shown to achieve state-of-the-art results in many downstream natural language processing tasks, by fine-tuning on task-specific labeled data (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019; Joshi et al., 2019; Sun et al., 2019; Clark et al., 2019; Lewis et al., 2020a; Bao et al., 2020; He et al., 2020; Raffel et al., 2020; Ziems et al., 2022) . However, the cost of finetuning all parameters and storing them separately for each task is high in terms of computational and storage resources, e.g., 355 million parameters for RoBERTa (Liu et al., 2019) and 175 billion parameters for GPT-3 (Brown et al., 2020) . This makes it challenging to deploy in real-world natural language processing (NLP) systems that handle multiple tasks. To make pretrained models more efficient for specific downstream tasks, various strategies have been proposed that only learn a small number of extra parameters while keeping the rest frozen (Houlsby et al., 2019b; Pfeiffer et al., 2021; Li & Liang, 2021; Brown et al., 2020; Lester et al., 2021b; Schick & Schütze, 2021; Ziems et al., 2022) . One such strategy is adapter tuning (Houlsby et al., 2019b) , which adds small neural modules (adapters) to each layer of the pretrained network, and only trains the adapters during fine-tuning. Other methods, such as prefix tuning (Li & Liang, 2021) and prompt tuning (Lester et al., 2021a) , have been inspired by the success of controlling pretrained models through textual prompts (Brown et al., 2020) . These methods prepend tunable tokens to the input or hidden layers, and only train these tokens during fine-tuning. BitFit (Zaken et al., 2021) updates the bias terms of pretrained models while freezing the rest, while LoRA (Hu et al., 2021) decomposes attention weight gradients into low-rank matrices to reduce the number of trainable parameters. He et al. (2022) proposed a unified view of these strategies, illustrating their differences and connections, but like its predecessors, the method is still equally applied to different layers of the pretrained network. Most current fine-tuning strategies to adapt pretrained models to specific tasks are effective, but they are often developed through manual design processes without considering potential design patterns The design space is characterized by: (i) Grouping of consecutive layers, (ii) The allocation of the number of trainable parameters to each layer, (iii) The selection of groups that will be finetuned, and (iv) The assignment of appropriate strategies, such as Adapter (A), Prefix (P), BitFit (B), or LoRA (L), to each group. across these strategies, different backbone models, and downstream tasks. The effectiveness of different strategies is also unclear as they are usually applied separately, and it's unknown how they reinforce or complement each other (Mao et al., 2022) . Our aim is to gain a comprehensive understanding of the fine-tuning design and uncover interpretable and widely applicable design patterns. Instead of creating yet another strategy to be applied uniformly to various pretrained layers, we present parameter-efficient fine-tuning design spaces that allow customization of both tuning structures and strategies. These design spaces are comprised of four main components, as illustrated in Figure 1 : layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. We start our journey towards parameter-efficient fine-tuning design using a relatively unconstrained design space. We then narrow this space through successive rounds of comparison, using random sampling and while enforcing constraints such as equal layer grouping. Through this process, we discover several key design patterns, including layer grouping in a spindle pattern, uniform allocation of trainable parameters, tuning all groups, and appropriate strategy assignments. Our new methods outperform existing parameter-efficient fine-tuning strategies. We demonstrate the effectiveness of our approach using T5 (Raffel et al., 2020) and classification tasks, but find that the discovered design patterns are applicable to other backbones (such as RoBERTa (Liu et al., 2019) , BART (Lewis et al., 2020b) and XLNet (Yang et al., 2019) ), and NLP tasks (e.g., summarization, machine translation, and eight SuperGLUE datasets). Our contributions are: (i) The introduction of parameter-efficient fine-tuning design spaces. (ii) The discovery of several design patterns in parameter-efficient fine-tuning through comprehensive experiments. (iii) The creation of parameter-efficient fine-tuning methods based on the discovered design patterns, which outperform existing strategies on various backbone models and NLP tasks.

2. RELATED WORK

Our work is closely related to and builds on work about network design spaces and parameterefficient fine-tuning. We discuss the connections and differences below. Network Design Spaces. Many works designed neural network models via an ad-hoc discovery of new design choices that improve performance (Radosavovic et al., 2019) , such as the use of deeper architectures or residual connections. Recent work (Radosavovic et al., 2020; You et al., 2020; Radosavovic et al., 2019) focuses on the design space to discover new design principles for convolutional neural networks (Radosavovic et al., 2020) and graph neural networks (You et al., 2020) . Inspired by this work we focus on the design spaces to rethink parameter-efficient finetuning, with the goal of discovering design patterns that are applicable to different settings. Parameter-Efficient Fine-Tuning for NLP. As pretrained models increase in size, storing and finetuning them becomes increasingly expensive and unfeasible for those without ample computational resources. A growing body of research is aimed at finding alternatives to fine-tuning large-scale models that reduce memory and storage costs. Some researchers have proposed using bottleneck layers with skip-connections to adapt large models, as seen in works such as Houlsby et al. (2019a ), (Stickland & Murray, 2019) , (Pfeiffer et al., 2020), and (Rebuffi et al., 2017) . Other works focus on identifying and training only a subset of all model parameters, such as (Zhao et al., 2020) and (Guo et al., 2020) . More recent research explores low-rank decomposition (Zhang et al., 2021) and the injection of trainable low-rank decomposition matrices into each layer (Hu et al., 2021; Karimi Mahabadi et al., 2021) . Li & Liang (2021) introduced prefix-tuning, where a set of prefixes is added to autoregressive language models or both encoders and decoders, while Lester et al. (2021b) proposed adding virtual tokens to the embedding layer. Another approach, side-tuning, was introduced in (Sung et al., 2022) . He et al. (2022) and Ding et al. (2022) . They proposed a unified view of existing parameter-efficient fine-tuning strategies. In yet another approach, Mao et al. (2022) introduced a unified framework to combine various methods through mixture-of-experts. Our research focuses on the general design spaces of parameter-efficient fine-tuning, providing a more comprehensive view of this method. By experimenting and refining design spaces, we aim to discover design patterns for parameter-efficient fine-tuning.

3. COMPONENTS OF DESIGN SPACES

Our goal is not to list all possible design choices, but to show how design spaces can guide parameter-efficient fine-tuning research. As such, we pick a representative subset for each of the following four components: (i) layer grouping, (ii) trainable parameter allocation, (iii) tunable groups, and (iv) strategy assignment. Figure 1 provides an example. Given these choices, we sample from a distribution over them, then pick a subset that performs the best, narrowing down the set of choices. Given that more restrictive set, we repeat the procedure by sampling and picking a now even more restrictive subset until we arrive at a concise description of the design space. Quite understandably, this is very costly when dealing with Large Language Models. Taking a leaf out of (Radosavovic et al., 2020) we perform our experiments using a sufficiently cheap model, in this case T5-base and T5-3b, unless stated otherwise. Further details will be discussed in the next section. For now let's review the set of choices available. Layer Grouping. Different layers in pre-trained models capture varying information and behave differently. For example, the authors of (Jawahar et al., 2019) found that the 3, 4, 5, 6, 7, 9, 12th layers have the most representation power in BERT and each layer captures a different type of information, ranging from surface to syntactic to semantic level representation of text. For instance, the 9th layer performs well in semantic tasks such as checking random swaps of coordinated clauses, while the 3rd layer is best suited for surface tasks like predicting sentence length. When adapting these pre-trained models for downstream tasks, it's crucial to group layers with similar behaviors together. This is critical to the design and proper implementation of parameter-efficient fine-tuning strategies. In this design component, we study patterns of how to group consecutive layers in pre-trained models (e.g., transformer layers in T5) during the fine-tuning process. Trainable Parameter Allocation. In parameter-efficient fine-tuning, the total number of trainable parameters is usually set to a small portion of the total parameters in the pretrained model. Our study will explore different ways to allocate the predefined number of trainable parameters to the layers. Tunable Groups. Not all the parameters of a pretrained model need to be updated during fine-tuning for downstream tasks. For example, BitFit (Zaken et al., 2021) only updates the bias parameters while freezing the rest. As a result, we explore which groups of parameters need to be learned during parameter-efficient fine-tuning to achieve better performance. Strategy Assignment. In order to improve the parameter efficiency, different sets of strategies (Li & Liang, 2021; Lester et al., 2021b; Houlsby et al., 2019b; Hu et al., 2021) were proposed, where only a small number of (extra) parameters are tuned and the remaining parameters in these pretrained models are frozen to adapt their general knowledge to specific down-stream tasks. We hypothesize that different groups might benefit from different proper strategies (or combinations) for capturing different types of information. More formally, given a set of individual strategies A for assignment, for any group G i , assign a subset U i ⊂ A to each layer in G i .

4. DISCOVERING DESIGN PATTERNS

Each design space, denoted as S i , consists of a set of models (S i -models) that satisfy the constraints characterizing the space with respect to layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. To discover design patterns, we start from a relatively unconstrained parameter-efficient fine-tuning design space S 0 . We progressively refine it via S 1 , . . . S 4 by comparing the overall quality of models in design spaces enforced with different constraints (e.g., each group has the same number of layers). To quantify the overall quality of models in any design space S i with a low-compute, low-epoch regime (Radosavovic et al., 2020) , we randomly sample 100 models from S i , fine-tune with only 3 epochsfoot_0 , and compute the average of the GLUE average performance. Using such a low number of epochs is sufficient to obtain a sufficiently representative score to draw consistent conclusions (see Table 7 in the Appendix) that extend to a full training run. We emphasize that our goal is to demonstrate how the perspective of design spaces can help inform parameter-efficient fine-tuning research, rather than to find out the "best" design space or method. For computational efficiency, it is beyond the scope of this work to enumerate all possible constraints with respect to the design space components (Section 3). For efficiency, we use T5-base (pretrained backbone model) as it's both representative and also sufficiently small to make experimentation with many options computationally affordable. In this work, we follow the discovery sequence of "grouping patterns -trainable parameter allocation -tunable groups -strategy assignment": (1) To explore and understand the design patterns in all the layers in large pre-trained models in scale, it is necessary and more efficient to study the layers in the unit of groups. So we start with the grouping patterns. (2) Once figuring out the optimal grouping patterns, it is then important to explore how to allocate the trainable parameters to these different groups in order to study more subtle designs with fair comparisons (e.g., this would allow comparing different patterns of strategy assignments without the impact from different trainable parameters.). (3) Next, it becomes influential to examine which groups need to be learned during fine-tuning before we dig into the strategy assignment patterns. Because it is only meaningful to study assigning strategies to different groups after we figure out which groups need to be learned. (4) Finally, we study the tuning strategy assignment, which is the most subtle design.

4.1. S 0 -THE INITIAL DESIGN SPACE

The initial relatively unconstrained design space S 0 consists of all models without constraints on the design space components. Individual parameter-efficient fine-tuning strategies consist of Adapter, Prefix, BitFit, and LoRA. Specifically, without grouping constraints, each layer of the pretrained layer has a probability of 0.5 to be tuned. If tuned, a random strategy, or combinations thereof, with a random amount of trainable parameters are assigned to that layer. Before comparing more subtle design patterns such as to which tuning strategy among Adapter, Prefix, BitFit, and LoRA to pick, we begin by exploring how to group layers and how to allocate the total number of trainable parameters to layers.

4.2. S 1 -APPLYING GROUPING CONSTRAINTS

Transformers are quite deep by now. This makes it impractical to pick a different tuning strategy for each layer. As such, the first question to ask is how to assemble the layers into groups that will be tuned using the same strategy. Inspired by Radosavovic et al. (2020) , we consider 4 groups, G 1 , . . . , G 4 , in the order of forward pass, in the experimentsfoot_1 Denote by N i the number of layers in G i . As illustrated in Figure 2 , we compare the following layer grouping patterns: Increasing (N i+1 > N i ): the number of layers in groups gradually increases; Uniform (N i+1 = N i ): the number of layers in groups is the same; Decreasing (N i+1 < N i ): the number of layers in groups gradually decreases; Spindle (N 1 < N 2 = N 3 > N 4 ): the numbers of layers in groups at both ends are smaller; Bottleneck (N 1 > N 2 = N 3 < N 4 ): the numbers of layers in groups at both ends are bigger. These layer grouping patterns lead to 5 possible design choices. They consist of all models in the S 0 design space that satisfy one of these grouping pattern constraints. To compare the overall model qualities of different design spaces, we (i) randomly sample 100 models from the S 0 design space that satisfy each grouping pattern constraint (Figure 2 ); (ii) fine-tune with 3 epochs; and (iii) compute the average performance for each design space. We will follow this procedure as we progressively add new constraints later. The average performance is shown in Table 1 foot_2 . We find that models from the design space with the spindle grouping pattern (Figure 2 ) consistently outperform those from the other design spaces across all the 8 GLUE tasks. In other words, we find that fine-tuning works better if we treat a small number of layers close to the input and close to the output as special, and furthermore, if we divide up the bulk of the network into two blocks, each with their own design choices. Applying the spindle grouping partitioning to S 0 yields the new design space S 1 .

4.3. S 2 -VARYING THE NUMBER OF TRAINABLE PARAMETERS PER LAYER

Now that we know how to group the layers we need to establish how to allocate the parameters n i within the layers i of each group. In particular, we consider the following options: Increasing (n i+1 ≥ n i ): number of trainable parameters per layer increases or remains the same. Uniform (n i+1 = n i ): number of trainable parameters in every layer is constant; Decreasing (n i+1 ≤ n i ): number of trainable parameters per layer decreases or remains the same. As above, we obtain 100 models for each of these 3 new design spaces. Table 2 reports the average performance of these 3 design spaces. The uniform allocation design pattern obtains the highest GLUE average performance, making this relatively simple, interpretable design pattern favorable. Allocating the number of trainable parameters to layers uniformly yields the new design space S 2 .

4.4. S 3 -SELECTING THE GROUPS

Given that we established how to partition layers into groups, and how to allocate parameters per group, the next step is to assess whether all groups actually need tuning. Rather than exploring the 2 4 -1 = 15 combinatorial choices we limit ourselves to the 4(4 + 1)/2 = 10 options with the exception of (G 2 , G 3 ), since focusing on interior groups only does not yield good results (this is consistent with our findings in Table 3 ). Based on the GLUE average performance, we find that all the groups need to be tuned to obtain the best results. This suggests that all the groups of pretrained layers have captured useful information that should be adapted to the downstream tasks. Tuning all the groups yields the new design space S 3 .

4.5. S 4 -SELECTING STRATEGIES PER GROUP

So far the structure we've been exploring is fairly trivial: S 4 amounts to a uniform distribution of parameters over the layers of the groups and to tuning all groups. This belies the fact that we still have significant freedom of design in picking specific fine-tuning approaches. Specifically, each design space consists of models that assign a subset of {Adapter (A), Prefix (P), BitFit (B), and LoRA (L)} to the layers of each group G i for i ∈ {1 . . . 4}. This is quite a large space of options. To make some headway, we determine the ideal choice progressively by first reviewing strategies for G 1 , then G 2 up to G 4 . Due to space constraints the details of this procedure are relegated to the appendix(G 1 in Table 8 , G 2 Table 9 , G 3 in Table 10 , and G 4 in Table 11 ). We arrive at the following strategy assignment for the T5-base pretrained backbone: G 1 : (A, L) -G 2 : (A, P) -G 3 : (A, P, B) -G 4 :(P, B, L) For example, Adapter is more recommended in groups closer to input, while BitFit is more recommended in groups closer to the output. The resulting design space will be referred to as S 4 .

4.6. VERIFICATION OF THE DESIGN CHOICES ON T5-3B

So far our results have led to a competent fine-tuning strategy for T5-base. To assess whether we actually discovered some useful strategies that have validity beyond T5-base, we need to apply it to other models, too. For convenience we pick T5-3b. As before, the detailed results are relegated to the appendix (Tables 12, 13 , 14 and 15). We observe that the following design patterns still apply: 1. grouping layers in a spindle pattern (Table 12 ) 2. uniformly allocating the number of trainable parameters to layers (Table 13 ) 3. tuning all the groups (Table 14 ) 4. tuning different groups with proper strategies (Table 15 ) Note that for T5-3b (with final design space S 4 -3b), the discovered proper strategy assignment is slightly different G 1 : (P, L) -G 2 : (A, L) -G 3 : (P, B, L) -G 4 : (A, P, B).

4.7. EXPERIMENTAL SETUP

Datasets. Our process is based on the average performance on the widely-used GLUE benchmark (Wang et al., 2018) . It covers a wide range of natural language understanding tasks. First, single-sentence tasks include (i) Stanford Sentiment Treebank (SST-2) and (ii) Corpus of Linguistic Acceptability (CoLA). Second, similarity and paraphrase tasks include (i) Quora Question Pairs (QQP), (ii) Semantic Textual Similarity Benchmark (STS-B), and (iii) Microsoft Research Paraphrase Corpus (MRPC). Third, inference tasks include (i) Multi-Genre Natural Language Inference (MNLI), (ii) Question Natural Language Inference (QNLI), and (iii) Recognizing Textual Entailment (RTE). To compare performance, the Matthews correlation is measured for CoLA; the Spearman correlation is used for STS-B, and accuracy is measured for the remaining GLUE tasks. Pretrained Backbone Models and Model Settings We use T5-base/3b (Raffel et al., 2020) as the main pretrained backbone models for discovering design patterns via our parameter-efficient finetuning design spaces. We use HuggingFace Transformers for our implementations and follow the default settings. During the exploration, we fix the total number of trainable parameters (in the percentage of that in the backbone model) by following He et al. (2022) . By limiting ourselves to a rather concise parameter space and a small number of parameters within that parameter space that we allow to be fine-tuned we ensure that exploration remains computationally feasible. Obviously, this exploration would be pointless if the discovered insights were not portable. Hence, we need to evaluate how well the strategies perform on new models and new architectures.

5. EVALUATION

The S 4 model (Section 4.5) and S 4 -3b model (Section 4.6) adopt the design patterns discovered from T5-base and T5-3b, respectively. We will evaluate their effectiveness when applied to different pretrained backbones and different NLP tasks.

5.1. EXPERIMENTAL SETUP

Dataset. Besides the GLUE datasets (Wang et al., 2018 ) (Section 4.7), we evaluate our methods on two generation tasks used by He et al. (2022) : Abstractive Summarization using XSum (Narayan et al., 2018) , and Machine Translation using the WMT 2016 en-ro dataset (Bojar et al., 2016) . We report ROUGE scores (Lin, 2004) on the XSum test set, and BLEU scores (Papineni et al., 2002) on the en-ro test set. Models and Model Settings. We mainly compare our methods with the following baselines: (i) Full Fine-tuning (full): it fine-tunes all the model parameters in the pretrained models; (ii) Adapter (Houlsby et al., 2019b) : it adds adapter modules to each transformer layer; (iii) Prefix (Li & Liang, 2021): it optimizes a set of small continuous vectors prepended to transformer layers; (iv) BitFit (Zaken et al., 2021) : it only updates the bias terms in pretrained models; (v) LoRA (Hu et al., 2021) : it decomposes the attention weight into low-rank matrices to reduce the number of trainable parameters. Besides T5 (Raffel et al., 2020) , we additionally apply our methods to other backbone models including RoBERTa-base/large (Liu et al., 2019) and BART-base/large (Lewis et al., 2020a) . We use the default settings. We set the total number of trainable parameters (in the percentage of that in the backbone model) by following He et al. (2022) . Specifically, this value is set to 0.5% for Adapter, Prefix, LoRA, and our methods, and 0.1% for BitFit. For all the experiments, we followed Liu et al. (2019) to set the linear decay scheduler with a warmup ratio of 0.06 for training. The batch size was 128 for base models and 64 for large models. The maximum learning rate was 5e -5 and the maximum number of training epochs was set to be either 5 or 10. All the experiments were performed using 8 A100 GPUs.

5.2. EFFECTIVENESS ON DIFFERENT BACKBONES

GLUE with T5 Backbone. With our discovered design patterns, we fine-tune T5-base (S 4 -model) and T5-3b (S 4 -3b-model) on GLUE and compare them with all the baseline methods. The results are shown in Table 4 , where the key measure is the GLUE average performance (last column). We find that our S 4 -model and S 4 -3b-model consistently outperform the investigated methods in the key measure. By tuning only 0.5% parameters, our methods even outperform the full fine-tuning baseline where all the parameters are tuned, indicating the effectiveness of our discovered parameterefficient fine-tuning design patterns. GLUE with RoBERTa Backbone. We directly apply S 4 -model and S 4 -3b-model (adopting design patterns discovered using T5-base and T5-3b) to fine-tune the RoBERTa-base RoBERTalarge pretrained backbone models, respectively. We keep all the other settings the same and evaluate them on GLUE datasets. We also compare with variant methods randomly sampled from two design spaces: (i) S 0 -model, where all the designs are randomly selected for RoBERTa as in S 0 ; (ii) S 3model, where strategies are randomly assigned to different RoBERTa layer groups as in S 3 . Table 5 shows that (i) the design patterns (adopted by S 4 -model and S 4 -3b-model) discovered using T5 models are applicable to the RoBERTa backbone models and outperform the investigated methods in GLUE average performance with no extra discovery processfoot_3 ; (ii) improved performance from S 0 -models, S 3 -models, to S 4 -(3b)-models support adding more constraints in the pattern discovery process (Section 4). SuperGLUE with XLNet Backbone. We also directly use the S 4 -model and S 4 -3b-model (adopting design patterns discovered using T5-base and T5-3b) to fine-tune the XLNet-base and XLNetlarge pretrained backbone models without any extra discovery process. We keep all the other settings the same and evaluate them on SuperGLUE datasets. Table 17 (In the Appendix) reiterates the fact that our PEFT design patterns discovered from T5 models are generelizable to the XLNet backbone models and outperform the investigated methods in other tasks (SuperGLUE) with no additional discovery process. Generation Tasks with BART Backbone. We further apply the S 4 -model and S 4 -3b-model (adopting design patterns discovered using T5-base and T5-3b) to fine-tune the BART-base and BARTlarge pretrained backbone models, respectively. We evaluate the models on two generation tasks: summarization (XSUM) and machine translation (en-ro) following He et al. (2022) . We also compare with PA (parallel adapter) using the same number of trainable parameters (He et al., 2022) . Table 6 shows that our methods, although adopting design patterns discovered from classification tasks using T5, still outperform investigated parameter-efficient fine-tuning strategies on generation tasks with different BART backbones.

6. CONCLUSION

Parameter-efficient fine-tuning adapts knowledge in pretrained models to down-stream tasks in a more parameter-efficient fashion. Instead of focusing on designing another strategy in the first place, we introduced parameter-efficient fine-tuning design spaces. We empirically discovered several design patterns in parameter-efficient fine-tuning. These design patterns led to new parameter-efficient fine-tuning methods. Experiments showed that these methods consistently outperform investigated parameter-efficient fine-tuning strategies across different backbone models and different tasks in natural language processing. 



We set the low epoch by observing whether it is enough for models to obtain stable performances to draw consistent conclusions (See Table7in the Appendix). The experimental results with 8 groups are shown in the Table16in the Appendix. The training time for the step is shown in the Table18in the Appendix. Future works might repeat the discovery process using RoBERTa to improve performance for this backbone.



Figure1: The design space is characterized by: (i) Grouping of consecutive layers, (ii) The allocation of the number of trainable parameters to each layer, (iii) The selection of groups that will be finetuned, and (iv) The assignment of appropriate strategies, such as Adapter (A), Prefix (P), BitFit (B), or LoRA (L), to each group.

Figure 2: Layer grouping patterns: group ID (G 1 , . . . G 4 ) vs. number of layers per group.

Average performance (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-base pretrained backbone. We compare adding different layer grouping constraints to the S 0 design space.

Average performance (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-base pretrained backbone model. We compare adding different parameter allocation constraints to the S 1 design space.

Average performance (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-base pretrained backbone model. We compare adding different tunable group constraints to the S 2 design space.

Performances of different tuning methods on the GLUE datasets using the T5-base (upper part) and T5-3b (lower part) pretrained backbone models, respectively. The results are averaged over 20 random runs (with standard deviations as subscripts). The S 4 -model and the S 4 -3b-model perform significantly better than the second-best PEFT methods in all the eight datasets at the significance level p < 0.05( * ) or even p < 0.01( * * ).

Performances of different tuning methods on GLUE datasets using the RoBERTa-base (upper part) and RoBERTa-large (lower part) pretrained backbone models. The results are averaged over 20 random runs (with standard deviations as subscripts). Here we also include two baselines:

Performance of different tuning methods on generation tasks (XSUM and en-ro) using the BART-base (left) and BART-large (right) pretrained backbone models.

Average performances (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-base pretrained backbone model. We compare adding different G 1 strategy assignment constraints to the S 3 design space.

Average performances (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-base pretrained backbone model. We compare adding different G 2 strategy assignment constraints with G 1 -(L, A) to the S 3 design space.

Average performances (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-base pretrained backbone model. We compare adding different G 3 strategy assignment constraints with G 1 -(L, A) -G 2 -(P, A) to the S 3 design space.

Average performances (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-3b pretrained backbone model. We compare adding different tuning groups constraints to the S 2 design space.

Average performances (low-compute, low-epoch regime: 100 random models, 3 tuning epochs) on the GLUE datasets using the T5-3b pretrained backbone model. We compare adding different strategy assignment constraints following the process in Section 4.5.

A MORE EXPERIMENTAL RESULTS

Table 7 : Average performances (low-compute, low-epoch regime: 100 random models, tuning epochs = 1, 2, 3, 4, 20 for five different blocks) on the GLUE datasets using the T5-base pretrained backbone model. We compare adding different grouping constraints to the S 0 design space. 

B GENERAL EFFECTIVENESS ON SUPERGLUE WITH XLNET BACKBONES

We also directly use the S 4 -model and S 4 -3b-model (adopting design patterns discovered using T5base and T5-3b) to fine-tune the XLNet-base and XLNet-large pretrained backbone models without any extra discovery process. We keep all the other settings the same and evaluate them on Su-perGLUE datasets. Table 17 reiterates the fact that our PEFT design patterns discovered from T5 models are generelizable to the XLNet backbone models and outperform the investigated methods in other tasks (SuperGLUE) with no additional discovery process. 

