MODEL ENSEMBLE INSTEAD OF PROMPT FUSION: A SAMPLE-SPECIFIC KNOWLEDGE TRANSFER METHOD FOR FEW-SHOT PROMPT TUNING

Abstract

Prompt tuning approaches, which learn task-specific soft prompts for a downstream task conditioning on frozen pre-trained models, have attracted growing interest due to its parameter efficiency. With large language models and sufficient training data, prompt tuning performs comparably to full-model tuning. However, with limited training samples in few-shot settings, prompt tuning fails to match the performance of full-model fine-tuning. In this work, we focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks. Recognizing the good generalization capabilities of ensemble methods in low-data regime, we first experiment and show that a simple ensemble of model predictions based on different source prompts, outperforms existing multi-prompt knowledge transfer approaches such as source prompt fusion in the few-shot setting. Motivated by this observation, we further investigate model ensembles and propose Sample-specific Ensemble of Source Models (SESoM). SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs. Through this way, SESoM inherits the superior generalization of model ensemble approaches and simultaneously captures the sample-specific competence of each source prompt. We conduct experiments across a diverse set of eight NLP tasks using models of different scales (T5-{base, large, XL}) and find that SESoM consistently outperforms the existing models of the same as well as larger parametric scale by a large margin.

1. INTRODUCTION

Recent few years have witnessed the great success of large pre-trained language models (PLM) (Kenton & Toutanova, 2019; Liu et al., 2019; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020) . The size of pre-trained models which can easily go to billions of parameters (Brown et al., 2020; Raffel et al., 2020) , however, hinder their real-world deployments and applications. The huge size of pre-trained language models can make model fine-tuning for downstream NLP tasks computationally expensive and memory-inefficient. To alleviate this problem, many parameterefficient fine-tuning methods are proposed (Li & Liang, 2021; Houlsby et al., 2019; Zhang et al., 2021; Lester et al., 2021; Liu et al., 2021b) . Among them, prompt tuning (Lester et al., 2021) is one of the most widely adopted methods. Given a downstream task, prompt tuning methods keep the entire pre-trained model frozen. Only the newly added task-specific soft prompts are updated on the training data from a target task, conditioning on the original pre-trained model. Compared to traditional fine-tuning methods that update the entire pre-trained model, prompt tuning consumes significantly less memory and less training time per iteration (Table 10 in Gu et al. (2022) ). Despite prompt tuning's advantages in practice and its continuously improved performances on various NLP tasks (Liu et al., 2021a; Vu et al., 2022) , its performances in few-shot settings where labeled training data is limited, still have large space for improvements (Gu et al., 2022) . In low-data scenarios, one of the most widely applied approaches to alleviate data shortage of the target task, is to seek help from source tasks where labeled training data is abundant. Although such knowledge transfer approaches from multiple source tasks are analyzed on full-model training in other domains (Chen, 2021; Li et al., 2020; Sasso et al., 2022; Lee et al., 2019) , relevant methods for few-shot prompt-tuning still remain under explored. Therefore, in this work, we seek to find an effective strategy to use trained soft prompts from multiple source tasks to benefit few-shot prompt tuning on a new target task. With soft prompts trained from several source tasks and full training data from a target task, there are a few existing approaches one could adopt. Vu et al. (2022) finds the most suitable source soft prompt to initialize the soft prompt of the target task. Alternatively, Asai et al. ( 2022) directly fuses all source soft prompts together with a target task-specific prompt. Although both source soft prompt based initialization and fusion improve performance with enough training data for a target task, we empirically find them not as effective under few-shot settings. Another tempting alternative we could employ to use source prompts is model ensemble, which is known to provide good generalization and low variance (Hansen & Salamon, 1990) . For instance, Dvornik et al. ( 2019) and Liu et al. (2020) show that simple ensemble methods outperform complicated approaches in few-shot settings in the computer vision domain. Therefore, for few-shot prompt tuning, we wonder whether an ensemble of model outputs given different source prompts achieve better performance compared to existing approaches employing source prompts. If so, what is the most effective model ensemble strategy for the knowledge transfer from multiple source prompts? To answer these questions, we conduct empirical analysis and find that a simple uniform logitaveraging ensemble of model predictions based on different source prompts, can already outperform existing multi-source knowledge transfer approaches for few-shot prompt tuning. Motivated by this observation, we further look into ensemble approaches and propose our solution, a sample-specific ensemble of source models (SESoM). Source models refer to the trained soft prompts of source tasks, together with the pre-trained language model that the source soft prompts are trained with. As the name suggests, SESoM learns from the few-shot target samples to adaptively decide how much each source task should contribute given different target samples. Specifically, our method trains an attention-style network to generate weights to ensemble the outputs of different source models, in order to make the prediction given each target sample. Through this way, our model is able to capture the sample-specific preferences to ensemble different source models given the few-shot labelled target data. Therefore, compared to existing knowledge transfer approaches for prompt tuning that provide a fixed knowledge transfer strategy for all target samples, SESoM is more effective due to its sample-specific strategy. We conduct experiments across six source tasks and eight target tasks on three model scales, T5-Base, T5-Large and T5-XL. Experimental results show that SESoM outperforms existing methods, such as source prompt fusion approaches and other model ensemble methods, by a large margins in every scenario tested. Moreover, we also find that SESoM can consistently achieve better performance compared to existing methods when the number of few-shot labeled target data increases. Even in full-data settings, SESoM outperforms existing methods although not as significantly as in few-shot settings. Finally, we find that SESoM can achieve better performance when the number of source tasks increases, even when the newly added tasks are less preferable in general for the target task. Our case study also shows that SESoM can generate different ensemble weights for different samples of one target task. The generated weights are also aligned with the sample-specific performance of different source models.

2. RELATED WORK

Knowledge transfer approaches in the context of prompt tuning. Since the emergence of prompt tuning methods, much recent research has focused on improving the performance of prompt-tuning methods on full-data fine-tuning. Some of them focus on transferring knowledge from other tasks which are similar to the target task, to facilitate prompt tuning of the target task. Among them, SPoT (Vu et al., 2022) first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task. SPoT significantly boosts the performance of prompt-tuning across many tasks. Similarly, PPT (Gu et al., 2022) pre-trains the soft prompt of the target task with data formulated similarly with target data. These two methods provide a fixed knowledge transfer strategy for all target samples, given that they both provide initialization before few-shot prompt tuning of the target task. Different from them, our method provides sample-specific knowledge transfer from source models to each target samples, leading to better performance on few-shot fine-

