AHEAD-OF-TIME P-TUNING

Abstract

This paper proposes a new parameter-efficient method for fine-tuning, AoT P-Tuning. This method adds input-dependent biases before evaluating the Transformer layer, reducing the required evaluation time while allowing multi-task inference with a single backbone model for evaluating different tasks in a single batch. We experimented with the proposed method on the GLUE and SuperGLUE benchmarking datasets using RoBERTa-Base, RoBERTa-Large, and DeBERTa-XL backbone models. Our findings show that AoT P-tuning performed on par with or better than P-Tuning v2 and comparable to other baselines for efficient fine-tuning while being faster during inference.



1 INTRODUCTION P-Tuning (Liu et al., 2021b; a; Lester et al., 2021) is a promising way to fine-tune large Language Models (LMs) (Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019; Radford et al., 2019) . While it currently underperforms compared to other methods for parameter-efficient fine-tuning (Hu et al., 2022; Houlsby et al., 2019) on a wide range of tasks (Ding et al., 2022) , it has a practical, valuable property that allows it to evaluate different trained prompts parallel in a multi-task manner (i.e., a single backbone LM could be used for different tasks during inference, which can simplify model serving in real-world applications) (Lester et al., 2021) . This property is why researchers aim to further develop P-Tuning methods. Although it is possible to perform multi-task evaluation with P-Tuning, it introduces significant computational overhead due to the concatenation of prefixes to sequences and the evaluation of the attention mechanism (Vaswani et al., 2017) on longer sequences. We propose a simple mechanism for parameter-efficient fine-tuning of Language Models, namely Ahead-of-Time (AoT) P-Tuning, for which we add input-dependent bias before each Transformer layer. Same as P-Tuning, it is possible to use AoT P-Tuning in multi-task inference setups when a single backbone LM is used for several downstream tasks. The contributions of this paper can be summarized as follows: 1. We described the intuition behind AoT P-Tuning, which illustrates the connection of the proposed method with P-Tuning.



Figure 1: GLUE and SuperGLUE Macro scores for different backbone model scales. See Section 4.2 for more details.

