MULTITASK PROMPT TUNING ENABLES PARAMETER-EFFICIENT TRANSFER LEARNING

Abstract

Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for efficiently adapting large language models to multiple downstream tasks. However, existing methods typically learn soft prompt vectors from scratch, and it has not been clear how to exploit the rich cross-task knowledge with prompt vectors in a multitask learning setting. We propose multitask prompt tuning (MPT), which first learns a single transferable prompt by distilling knowledge from multiple task-specific source prompts. We then learn multiplicative low rank updates to this shared prompt to efficiently adapt it to each downstream target task. Extensive experiments on 23 NLP datasets demonstrate that our proposed approach outperforms the state-of-the-art methods, including the full finetuning baseline in some cases, despite only tuning 0.035% as many task-specific parameters.

1. INTRODUCTION

Finetuning pretrained language models (PLMs) has led to significant improvements across various downstream NLP tasks (Devlin et al., 2019; Howard & Ruder, 2018; Raffel et al., 2020) . However, the conventional paradigm of full task-specific finetuning (FT) is difficult to scale to multiple tasks, given that modern PLMs can have hundreds of millions (or even billions) of parameters. There thus has been a growing interest in developing parameter-efficient methods for model tuning (Houlsby et al., 2019; Lester et al., 2021; Ding et al., 2022) , where the goal is to learn only a small number of additional parameters per task while achieving performance comparable to full finetuning. Prompt tuning (PT), which prepends tunable continuous prompt vectors to the input, has emerged as a promising approach for parameter-efficient transfer learning with PLMs (Liu et al., 2021a; Li & Liang, 2021; Lester et al., 2021; Liu et al., 2022b; 2021b) . PT freezes the PLM parameters and only learns a small set of task-specific prompt vectors. However, despite their impressive performance, there is still a large gap between prompt tuning and full finetuning (Lester et al., 2021) . Additionally, this approach is sensitive to initialization and often requires more training time than finetuning (Su et al., 2022; Zhong et al., 2022) . Recent work has proposed to address these issues by transferring prompt vectors from various tasks (Su et al., 2022; Zhong et al., 2022) . These methods first train soft prompts on multiple source tasks and then use these pretrained prompts to initialize the prompt for further finetuning on a target task based on a (potentially learned) similarity measure 2022) (see Figure 1 , top). In this paper, we extend this line of work and introduce multitask prompt tuning (MPT), which uses multitask data to learn a single prompt that can be efficiently transferred to target tasks. While conceptually simple, learning a shared prompt space can be practically challenging as it requires learning commonalities across different source tasks while minimizing interference. Therefore, we decompose the soft prompt of each source task (which can be represented as a prompt matrix) into a multiplication of a shared matrix and a low-rank task-specific matrix, and find that this decomposition is more effective than simply sharing the prompt matrix across all tasks. This decomposition is learned through knowledge distillation from soft prompts obtained from regular prompt tuning. To transfer to new tasks, we perform low-rank multiplicative updates to the shared prompt matrix. Figure 1 (bottom) illustrates our approach. Extensive experiments on 23 NLP datasets across diverse tasks demonstrate the effectiveness of our proposed approach over state-of-the-art prompt transfer methods. On the SuperGLUE benchmark (Wang et al., 2019) , MPT with T5-Base (Raffel et al., 2020) yields a 16.3% improvement over the vanilla prompt tuning baseline (PT, Lester et al., 2021) , and also outperforms the most competitive multitask prompt transfer baseline (ATTEMPT, Asai et al., 2022) despite tuning much fewer task-specific prompt parameters (77.6K vs 232K). On some benchmarks, MPT exceeds the performance of full finetuning while only requiring 0.035% tunable parameters per task (see Figure 2 ). We also find that MPT is very effective for few-shot learning with 4-32 labels for each target task.

2. RELATED WORK

Parameter-efficient transfer learning. Parameter-efficient transfer learning for pretrained language models is an active research area (Ding et al., 2022) . Adapters (Houlsby et al., 2019; Mahabadi et al., 2021) and its variants (Hu et al., 2021; Karimi Mahabadi et al., 2021) , 2021a; b; Gao et al., 2021; Malkin et al., 2022) . However, our approach is most related to the transferability of prompts (Wang et al., 2021; Vu et al., 2022; Su et al., 2022) , which focuses on boosting the performance of prompt tuning across many tasks. SPoT (Vu et al., 2022) selects one prompt using a similarity measure, and ATTEMPT (Asai et al., 2022) adopts an attention mechanism over the source prompts to initialize the prompt for a target task. Unlike existing works, our approach learns a single shared prompt by decomposing and distilling knowledge from source prompts for efficient adaptation to a diverse set of target tasks.



Figure1: A conceptual overview of our approach. Instead of retrieving or aggregating source prompts (top), multitask prompt tuning (MPT, bottom) learns a single transferable prompt. The transferable prompt is learned via prompt decomposition and distillation.

Parameter efficiency on GLUE (left) and SuperGLUE (right). Our multitask prompt tuning (MPT) approach, which transfers a single shared prompt learned from multiple source tasks using prompt decomposition and distillation, maintains high accuracy (y-axis) while finetuning only a small number of parameters per task (x-axis). All results are based on T5-Base(Raffel et al., 2020). Baselines include: Adapters(Houlsby  et al., 2019), BitFit (Zaken et al., 2022),PT (Lester et al., 2021),SPoT (Vu et al., 2022), and ATTEMPT (Asai  et al., 2022). * Indicates multitask training on target tasks. Best viewed in color.

insert trainable layers, while BitFit (Zaken et al., 2022) only updates the bias parameters without changing any other model parameters. Diff pruning (Guo et al., 2021) and FISH (Sung et al., 2021) learn sparse updates to the original PLM. Another popular choice is prompt tuning (Lester et al., 2021) which only updates soft prompt vectors prepended to the input.

funding

* Work done during an internship at MIT-IBM Watson AI Lab.

