AHEAD-OF-TIME P-TUNING

Abstract

This paper proposes a new parameter-efficient method for fine-tuning, AoT P-Tuning. This method adds input-dependent biases before evaluating the Transformer layer, reducing the required evaluation time while allowing multi-task inference with a single backbone model for evaluating different tasks in a single batch. We experimented with the proposed method on the GLUE and SuperGLUE benchmarking datasets using RoBERTa-Base, RoBERTa-Large, and DeBERTa-XL backbone models. Our findings show that AoT P-tuning performed on par with or better than P-Tuning v2 and comparable to other baselines for efficient fine-tuning while being faster during inference.



1 INTRODUCTION P-Tuning (Liu et al., 2021b; a; Lester et al., 2021) is a promising way to fine-tune large Language Models (LMs) (Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019; Radford et al., 2019) . While it currently underperforms compared to other methods for parameter-efficient fine-tuning (Hu et al., 2022; Houlsby et al., 2019) on a wide range of tasks (Ding et al., 2022) , it has a practical, valuable property that allows it to evaluate different trained prompts parallel in a multi-task manner (i.e., a single backbone LM could be used for different tasks during inference, which can simplify model serving in real-world applications) (Lester et al., 2021) . This property is why researchers aim to further develop P-Tuning methods. Although it is possible to perform multi-task evaluation with P-Tuning, it introduces significant computational overhead due to the concatenation of prefixes to sequences and the evaluation of the attention mechanism (Vaswani et al., 2017) on longer sequences. We propose a simple mechanism for parameter-efficient fine-tuning of Language Models, namely Ahead-of-Time (AoT) P-Tuning, for which we add input-dependent bias before each Transformer layer. Same as P-Tuning, it is possible to use AoT P-Tuning in multi-task inference setups when a single backbone LM is used for several downstream tasks. The contributions of this paper can be summarized as follows: 1. We described the intuition behind AoT P-Tuning, which illustrates the connection of the proposed method with P-Tuning. 

3. AHEAD-OF-TIME P-TUNING

For readers' convenience, we provided background to Transformer evaluation and P-Tuning v1/v2 methods, which are relatable to the proposed method in Appendix Section A.

3.1. ON THE OVERHEAD OF RECENT METHODS

While the Transformer model has O(n 2 ) time complexity and GPU memory consumption for sequence length n. For P-Tuning v1, this complexity transforms into O((n + p) 2 ) since the length of input sequence is increased by the length of the prompt p, while for P-Tuning v2 the complexity is equal to O(n(n + p)).



Figure 1: GLUE and SuperGLUE Macro scores for different backbone model scales. See Section 4.2 for more details.

Figure 2: Schematic comparison of P-Tuning v2 (left), and AoT P-Tuning (right). Since the sequence length is not increased, AoT P-Tuning takes significantly less time to evaluate, only requiring the overhead of adding biases to the input sequence (See Section 4.3 for experiments with inference speed).

wide range of different methods could be referenced with P-Tuning.Liu et al. (2021b)   proposed to add soft prompts to the embeddings of GPT-2's input sequence(Radford et al., 2019)   to train it on classification tasks.Lester et al. (2021)  proposed a scheme similar to the one used in Liu et al. (2021b), but trained a T5 model (Raffel et al., 2020) with P-Tuning to show how the performance of the method changes with the increased scale of the backbone model. Recently, Qin & Eisner (2021); Li & Liang (2021); Liu et al. (2021a) proposed to add prefixes not only to input embeddings but also at each layer of the Transformer model. In addition, Liu et al. (2021a) suggested training a linear classification head on top of the backbone model instead of utilizing a LM head to obtain classification results. Due to this range of similar methods, we will follow the naming used by Liu et al. (2021a) and refer to Prompt-Tuning (adding soft prompts to the input embeddings) as P-Tuning v1 and to Prefix-Tuning (adding soft prefixes at each layer of Transformer backbone) as P-Tuning v2. Hu et al. (2022) proposed to train low-rank changes of attention weights, while Houlsby et al. (2019) fine-tuned additional model layers, which can also be considered parameter-efficient. Ben Zaken et al. (2022) proposed to fine-tune only bias terms of the model.

). Since the sequence length is not increased, AoT P-Tuning takes significantly less time to evaluate, only requiring the overhead of adding biases to the input sequence (See Section 4.3 for experiments with inference speed).2. We proposed two reparameterizations of AoT P-Tuning weights: first based on a factorized matrix trained from scratch, and second based on a LM's embeddings matrix passed through a trainable Fully Connected network. 3. We experimented with the proposed method on GLUE, and SuperGLUE Benchmarking Datasets (Wang et al.

