MODEL ENSEMBLE INSTEAD OF PROMPT FUSION: A SAMPLE-SPECIFIC KNOWLEDGE TRANSFER METHOD FOR FEW-SHOT PROMPT TUNING

Abstract

Prompt tuning approaches, which learn task-specific soft prompts for a downstream task conditioning on frozen pre-trained models, have attracted growing interest due to its parameter efficiency. With large language models and sufficient training data, prompt tuning performs comparably to full-model tuning. However, with limited training samples in few-shot settings, prompt tuning fails to match the performance of full-model fine-tuning. In this work, we focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks. Recognizing the good generalization capabilities of ensemble methods in low-data regime, we first experiment and show that a simple ensemble of model predictions based on different source prompts, outperforms existing multi-prompt knowledge transfer approaches such as source prompt fusion in the few-shot setting. Motivated by this observation, we further investigate model ensembles and propose Sample-specific Ensemble of Source Models (SESoM). SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs. Through this way, SESoM inherits the superior generalization of model ensemble approaches and simultaneously captures the sample-specific competence of each source prompt. We conduct experiments across a diverse set of eight NLP tasks using models of different scales (T5-{base, large, XL}) and find that SESoM consistently outperforms the existing models of the same as well as larger parametric scale by a large margin.

1. INTRODUCTION

Recent few years have witnessed the great success of large pre-trained language models (PLM) (Kenton & Toutanova, 2019; Liu et al., 2019; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020) . The size of pre-trained models which can easily go to billions of parameters (Brown et al., 2020; Raffel et al., 2020) , however, hinder their real-world deployments and applications. The huge size of pre-trained language models can make model fine-tuning for downstream NLP tasks computationally expensive and memory-inefficient. To alleviate this problem, many parameterefficient fine-tuning methods are proposed (Li & Liang, 2021; Houlsby et al., 2019; Zhang et al., 2021; Lester et al., 2021; Liu et al., 2021b) . Among them, prompt tuning (Lester et al., 2021) is one of the most widely adopted methods. Given a downstream task, prompt tuning methods keep the entire pre-trained model frozen. Only the newly added task-specific soft prompts are updated on the training data from a target task, conditioning on the original pre-trained model. Compared to traditional fine-tuning methods that update the entire pre-trained model, prompt tuning consumes significantly less memory and less training time per iteration (Table 10 in Gu et al. (2022) ). Despite prompt tuning's advantages in practice and its continuously improved performances on various NLP tasks (Liu et al., 2021a; Vu et al., 2022) , its performances in few-shot settings where labeled training data is limited, still have large space for improvements (Gu et al., 2022) . In low-data scenarios, one of the most widely applied approaches to alleviate data shortage of the target task, is to seek help from source tasks where labeled training data is abundant. Although such knowledge transfer approaches from multiple source tasks are analyzed on full-model training in other domains (Chen, 2021; Li et al., 2020; Sasso et al., 2022; Lee et al., 2019) , relevant methods for few-shot prompt-tuning still remain under explored. Therefore, in this work, we seek to find an effective strategy to use trained soft prompts from multiple source tasks to benefit few-shot prompt tuning on a new target task. With soft prompts trained from several source tasks and full training data from a target task, there are a few existing approaches one could adopt. Vu et al. (2022) finds the most suitable source soft prompt to initialize the soft prompt of the target task. Alternatively, Asai et al. (2022) directly fuses all source soft prompts together with a target task-specific prompt. Although both source soft prompt based initialization and fusion improve performance with enough training data for a target task, we empirically find them not as effective under few-shot settings. Another tempting alternative we could employ to use source prompts is model ensemble, which is known to provide good generalization and low variance (Hansen & Salamon, 1990) . For instance, Dvornik et al. (2019) and Liu et al. (2020) show that simple ensemble methods outperform complicated approaches in few-shot settings in the computer vision domain. Therefore, for few-shot prompt tuning, we wonder whether an ensemble of model outputs given different source prompts achieve better performance compared to existing approaches employing source prompts. If so, what is the most effective model ensemble strategy for the knowledge transfer from multiple source prompts? To answer these questions, we conduct empirical analysis and find that a simple uniform logitaveraging ensemble of model predictions based on different source prompts, can already outperform existing multi-source knowledge transfer approaches for few-shot prompt tuning. Motivated by this observation, we further look into ensemble approaches and propose our solution, a sample-specific ensemble of source models (SESoM). Source models refer to the trained soft prompts of source tasks, together with the pre-trained language model that the source soft prompts are trained with. As the name suggests, SESoM learns from the few-shot target samples to adaptively decide how much each source task should contribute given different target samples. Specifically, our method trains an attention-style network to generate weights to ensemble the outputs of different source models, in order to make the prediction given each target sample. Through this way, our model is able to capture the sample-specific preferences to ensemble different source models given the few-shot labelled target data. Therefore, compared to existing knowledge transfer approaches for prompt tuning that provide a fixed knowledge transfer strategy for all target samples, SESoM is more effective due to its sample-specific strategy. We conduct experiments across six source tasks and eight target tasks on three model scales, T5-Base, T5-Large and T5-XL. Experimental results show that SESoM outperforms existing methods, such as source prompt fusion approaches and other model ensemble methods, by a large margins in every scenario tested. Moreover, we also find that SESoM can consistently achieve better performance compared to existing methods when the number of few-shot labeled target data increases. Even in full-data settings, SESoM outperforms existing methods although not as significantly as in few-shot settings. Finally, we find that SESoM can achieve better performance when the number of source tasks increases, even when the newly added tasks are less preferable in general for the target task. Our case study also shows that SESoM can generate different ensemble weights for different samples of one target task. The generated weights are also aligned with the sample-specific performance of different source models.

2. RELATED WORK

Knowledge transfer approaches in the context of prompt tuning. Since the emergence of prompt tuning methods, much recent research has focused on improving the performance of prompt-tuning methods on full-data fine-tuning. Some of them focus on transferring knowledge from other tasks which are similar to the target task, to facilitate prompt tuning of the target task. Among them, SPoT (Vu et al., 2022) first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task. SPoT significantly boosts the performance of prompt-tuning across many tasks. Similarly, PPT (Gu et al., 2022) pre-trains the soft prompt of the target task with data formulated similarly with target data. These two methods provide a fixed knowledge transfer strategy for all target samples, given that they both provide initialization before few-shot prompt tuning of the target task. Different from them, our method provides sample-specific knowledge transfer from source models to each target samples, leading to better performance on few-shot fine-tuning. More recently, Asai et al. (2022) proposes a sample-specific prompt fusion method for full-data prompt tuning. In such method, the soft prompts of source tasks are fused together to construct a new prompt for each target sample, instead of ensembling the source models' outputs given each target sample in SESoM. Ensemble learning has been a popular approach to obtain a low-variance and generalizable model (Hansen & Salamon, 1990) . The basic ensemble techniques include voting (Hansen & Salamon, 1990) , bagging (Breiman, 1994) , boosting (Schapire, 1990; Freund & Schapire, 1997) and stacking (Wolpert, 1992) , which have been applied across many NLP problems such as model debiasing (Elazar & Goldberg, 2018; Stacey et al., 2020) , cross-lingual transfer learning (Wang et al., 2021) , calibrating sequence classification and generation models (Reich et al., 2020) , domain adaptation (Kim et al., 2017) . Within the paradigm of prompt-based learning, Schick & Schütze (2021a) explores unweighted average of logits corresponding to different human-selected verbalizers for zeroshot classification. Lester et al. (2021) uses majority voting over logits from five prompts trained on the same task. They all employ identical ensemble strategy for all test samples, which could be sub-optimal. Schick & Schütze (2021a) also found that the oracle that selects the best verbalizer for every test sample significantly outperforms the unweighted-averaging model. SESoM fills this gap by taking over the role of oracle and train a separate attention module to generate samplespecific importance weight for every source model. The sample-specific weights help to measure each source model's competence in making prediction for a sample, and enable better utilization of available prompts.

3.1. PRELIMINARIES

In SESoM, given the training data for source tasks S 1 , ..., S T and a pre-trained language model, we first train a soft prompt P j (j ∈ [1, T ]) for each source task by running prompt tuning (Lester et al., 2021) . Following a typical T5-setting, we restructure all downstream tasks to text-to-text generation format, where each label of a training sample is represented by a verbalizer (Schick & Schütze, 2021c) and, optionally, a task-specific template (Schick & Schütze, 2021c; b; Mishra et al., 2021) . We represent an instance in a source or target task as (X, y), where X is a sequence of token embeddings (X = [x 1 , ..., x l ] ∈ R l×d , l is the length of the input token sequence and d is the embedding size of PLM), and y is a classification label. Then, we map the class label y to its corresponding verbalizer or the verbalizer-template sequence, and call it Yfoot_0 . Each soft prompt P j = [p 1 , ..., p m ] ∈ R m×d is also a sequence of embeddings, where m is the number of soft prompt embeddings for the task. Prompt tuning prepends a randomly initialized task-specific soft prompt P j to the input embedding sequence X, resulting in the final input [P j ; X]. Then, [P j ; X] is fed into the pre-trained model to make the prediction. The target task is modeled as Pr θ (Y | [P j ; X]) where the parameters of the pre-trained model (θ) is frozen during the tuning and only the soft prompt P j is updated in order to maximize Pr θ (Y | [P j ; X]). We show the process of prompt tuning in Fig. 1 (a) .

3.2. SAMPLE-SPECIFIC ENSEMBLE OF SOURCE MODELS

Given a collection of source prompts [P 1 ; ...; P T ] from source tasks [S 1 ; ...; S T ] trained using prompt-tuning, and the pre-trained language model θ that these soft prompts are conditioned on, SESoM aims to use their knowledge on a new target task under few-shot settings. We call the prompt from a source task together with the PLM as a source model, represented as [P j ; θ]. In SESoM, we first train each source model [P j ; θ] with the few-shot target data in a prompt-tuning manner. This enforces the source models to generate target task's verbalizers given the target input sample. Then taking a labeled instance (X, y) from the few-shot target task T target , SESoM first feeds [P j ; X] into the corresponding source model [P j ; θ] and obtain the pre-softmax logit l x,j . It then uses the l x,j and X to compute sample-specific attention weight representing the competence of the ] from all T source models as input and usesˆx as the key to calculate the attention weights a x of source logits. Then source logits are weighted averaged accordingly to construct l x for the final prediction. source model [P j ; θ] for the given input X. We train an attention-style network G ( § 3.2.1) to compute these sample-specific weights for every source model. Finally, it takes the weighted average of logits (L x = [l x,1 ; ...; l x,T ] ∈ R T ×v , where v is the vocabulary size of the pre-trained model) from all source models using their attention weights to make its prediction for X.

3.2.1. ATTENTION MODULE G.

The goal of the attention module G is to learn to generate weights to ensemble the source models' pre-softmax logits based on their competence on a target sample X. One input of G is therefore the representation of target sample X. Other candidate inputs of G which might benefit this samplespecific weight generation process are either source prompts or pre-softmax logits of source models. We empirically find using X and source prompts as input for G are less effective compared to taking X and source model logits as G inputs under few-shot settings. We hypothesize that it is because prompts are not self-sufficient as their competency is not solely determined by them. They are effective only when used along with the PLM for which it was tuned. With abundant training samples, the prompt-based attention module may learn to identify source prompts that are relevant for an input sample from the target task. But with few training samples, it is unable to learn to accurately weight predictions from different source models, required to capture their effectiveness for every sample. Logits of source models, on the other hand, are a function of both the prompt and the PLM and are also capable of modeling the uncertainty in predictions from different models. Consequently, we propose to use the source model logits l x,j along with the input X in our attention module. We first perform max-pool over the token embedding sequence X = [x 1 ; ...; x l ] ∈ R l×d , to obtain x ∈ R d as the representation of X, Then given the representation x of the target input X, we apply a feed-forward network to project x non-linearly to a new representational space, h x = W T u,x • γ(W T d,x • x) (1) where γ(•) is a non-linear activation function, W d,x ∈ R d×d ′ x and W u,x ∈ R d ′ x ×d ′ are trainable weights. Then we apply Layer Norm (Ba et al., 2016)  to get h x ∈ R d ′ -the final projected representation of target input X. Similarly, we project the pre-softmax logit l x,j ∈ R v of each source model given X into the same space, h l,j = W T u,l • γ(W T d,l • l x,j ), where W d,l ∈ R v×d ′ l and W u,l ∈ R d ′ l ×d ′ are trainable weights. Then, given h x and the projected representations of all source logits {h l,j } T j=1 , we compute the attention score a x,j for the source logit l j by a x,j = e (h l,j •hx) T k=1 e (h l,k •hx) . (3) SESoM produces the final output logit l x ∈ R v by computing a linear combination of [l x,1 ; ...; l x,T ] given the computed input-logit attention, l x = G(x, [l x,1 ; ...; l x,T ]) = T j=1 a x,j l x,j Then following the prompt tuning approach in Section 3.1, we freeze all the parameters θ in pretrained language model and source soft prompts ([P 1 ; ...; P t ]). We also show the forward pass of SESoM in Fig. 1 (b) . Attention module G is trained by minimizing its cross-entropy loss between softmax(l x ) and the label Y. During the few-shot training, we update the attention module G with the few-shot labeled target samples. Through this way, G is trained to capture the sample-specific preference of different source models. At inference, we also use G to calculate the sample specific ensemble weight of all source logits, and calculate the weighted average of them as the final logit to make the prediction. Parameter efficiency. As mentioned in Lester et al. (2021) , prompt tuning methods are a natural fit for model ensemble approaches from the perspective of parameter efficiency. Unlike other models for which an ensemble of T models leads to T times more model parameters, an ensemble of T different prompt-tuning models only leads to T times more soft prompts. Because the pre-trained model that soft prompts are trained to condition on, is identical for all models to be ensembled. Therefore, although SESoM is a model ensemble approach, the additional model parameters introduced by the ensemble are only the soft prompts of 6 source tasks, i.e., T × m × d parameters (0.6M in our experiment). SESoM also trains one attention module which includes four projection layers and two layer norms. It requires d × d ′ x + d ′ x × d ′ + v × d ′ l + d ′ l × d ′ + 4d ′ parameters (∼ 0. 9M in our experiment). Therefore, the total number of additional trainable model parameters in SESoM is only less than 0.5% of a pre-trained T5-base model.

4. EXPERIMENTS

In this section, we first describe our experimental setup, then our main empirical results compared to baselines. Finally we conduct various empirical analysis and case study to show the effectiveness of different components of SESoM and reasonings behind our design choices.

4.1. EXPERIMENTAL SETUP

Few-shot prompt tuning setups and hyper-parameters of SESoM. For our main experiments, we follow existing work (Gu et al., 2022) to set the number of few-shot training and validation samples for every target task as 32. We evaluate all models with the full original validation set of every target task. All presented results are an average of 20 runs with different random seeds. For SESoM, we randomly initialize attention module G based on Section 3.2. Details of other hyperparameters of SESoM, such as the embedding sizes of G, the optimizer and learning rate during training, etc, can be found in Appendix B.3. We use pre-trained T5-{base,large,XL} (Raffel et al., 2020) as language model θ. Source and target tasks. Following existing work (Asai et al., 2022) , we use six tasks that have been considered generally helpful to other NLP tasks as our source tasks. They are MNLI (Williams et al., 2018) , QNLI (Demszky et al., 2018) , QQP (Wang et al., 2018) , SST2 (Socher et al., 2013) , SQuAD (Rajpurkar et al., 2016) , and ReCoRD (Zhang et al., 2018) . Given each pre-trained model tested (T5-{base,large,XL}), we train soft prompts for these source tasks separately using prompt tuning. The hyper-parameters for training the soft prompts can be found in Appendix B.1. We then evaluate our model on the following GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) tasks: WNLI (Wang et al., 2018) , MRPC (Dolan et al., 2005) , BoolQ (Clark et al., 2019) , MultiRC (Khashabi et al., 2018) , RTE (Giampiccolo et al., 2007) , WiC (Pilehvar & Camacho-Collados, 2019) , WSC (Levesque et al., 2012), and CB (De Marneffe et al., 2019) . They cover a wide range of NLP tasks including question answering, sentiment analysis and textual entailment. Details of source and target tasks can be found in Appendix A. Baselines. We compare SESoM not only with existing knowledge transfer approaches in the context of prompt tuning, but also with multiple existing ensemble strategies. The knowledge transfer methods for prompt tuning are, • SPoT-t (Vu et al., 2022) : SPoT-t originally first trains soft prompt with the full training set of the target task for a few epochs, and then retrieve the source soft prompt which shares the highest cosine similarity with this target prompt to re-initialize the target prompt. Then, it re-trains the target prompt starting from this new initialization. To adapt SPoT-t to few-shot settings, we keep everything unchanged except that we train the target prompt with the few-shot labelled data, instead of with the full target training set. • PPT (Gu et al., 2022) : PPT pre-trains the target prompt with additional data by pre-processing the additional data to the format of the target task. For fair comparison with other approaches, we use the training data of all 6 source tasks to pre-train the target soft prompt. We then update the target soft prompt with the few-shot target data as in prompt tuning. • ATTEMPT (Asai et al., 2022) : ATTEMPT is a sample-specific knowledge fusion method for fulldata prompt tuning. In this method, the soft prompts of source tasks are fused together to construct a new prompt for each target sample, unlike SESoM's model-level ensemble of source models. To apply ATTEMPT in few-shot settings, we keep everything unchanged for ATTEMPT except that we train their model with the few-shot labelled data, instead of with the full target training set. The ensemble baselines are, • Uniform Ensemble: the simplest ensemble method of the source models. Same as SESoM, we first fine-tune each source model with the target few-shot labelled data. Then we take the average of pre-softmax logits of all source model's output given a target sample to make the prediction. The difference of Uniform-Ensemble from SESoM is that Uniform-Ensemble simply averages the source logits for all target samples instead of conducting a sample-specific weighted average. • Majority-Vote Ensemble: in this ensemble baseline, we use every individual source model to make predictions first. Then we take the prediction that have the most votes as the final prediction. The rest setting of Majority-Vote Ensmeble is the same as Uniform Ensemble. • Fixed-weight Ensemble: Instead of simply averaging logits of source models as Uniform-Ensemble, Fixed-weight Ensemble takes a weighted average of the source logits to make predictions. The voting weight of each source model is fixed for all target samples. The voting weight is calculated according to the F1 score of the source models given the few-shot target data. The better a source model performs given the labelled few-shot data, larger voting weight it would have for all target samples. More details about the training of all baseline methods above can be found in Appendix B.2.

4.2. MAIN RESULTS

The few-shot performance of SESoM and other methods are shown in Table 1 . First, on every pretrained model (T5-{base,large,XL}), SESoM outperforms all baselines on the average score by a large margin. SESoM also outperforms baselines on a definite majority of single target tasks on pretrained models with different sizes. Specifically, compared to approaches that provide a fixed knowledge transfer strategy for all target samples (SPoT-t, PPT, Uni Ensemble, FW Ensemble), SESOM's superior performance indicates the effectiveness of its sample-specific strategy. Compared to existing approaches that conduct sample-specific soft prompt fusion (ATTEMPT), SESoM's superior performance indicates the effectiveness of its model ensemble strategy in few-shot settings. Moreover, if we compare SESoM on T5-Base with non-ensemble baselines (SPoT-t, PPT, ATTEMPT) on T5-Large, we can see that SESoM on T5-Base even outperforms these baselines with a larger pretrained model. Similarly, SESoM on T5-Large also outperforms non-ensemble baselines on T5-XL. This indicates that SESoM further improves the parameter efficiency of prompt tuning in few-shot settings. Second, if we compare the performances of ensemble methods (Uni Ensemble, FW Ensemble and our proposed method SESoM) versus the rest methods, ensemble methods have superior performance against the rest methods in the few-shot setting. The bigger the pre-trained model is, more apparent the performance advantages of ensemble methods are in the few-shot setting. It is likely due to the widely observed strengths of ensemble methods that provide a better generalization and lower variance during inference, as illustrated in Section 1 as our motivation. Finally, compared with all methods, ATTEMPT seems to have the most unsatisfactory results in few-shot settings. AT-TEMPT uses the few-shot target data to train a network to output weights to fuse source soft prompts for each target sample. Although it achieves great performances in full-data fine-tuning, few-shot training is not an easy scenario for it due to the overfitting caused by limited training data. While our proposed SESoM takes the logits of source models as input and outputs weights for model ensemble during few-shot training, which takes advantage of the superior generalization and robustness of model ensemble approaches.

4.3. EMPIRICAL ANALYSIS

Would more "ordinary" source tasks help SESoM? In our main experiments, we use 6 source tasks as mentioned earlier, regardless of their actual transferability given the target task. However, if a source task is too different from the target task, it is possible that this source task wouldn't effect the target task positively in general. In order to investigate how SESoM performs when less preferable source tasks are added, we conduct experiments as follows. First, given each target task, we pick top 1, top 3 and top 5 most generally helpful source tasks for this target task, following SPoT-t (Vu 2 . From Table 2 we can see that SESoM can achieve better performances when the number of source tasks increases, although the newly added source tasks might be less preferable to the target task in general. It indicates that for the target samples on which the generally less favourable source tasks can actually perform well, SESoM is able to set proper sample-specific weights and lean more on these source models for such specific target samples, which implies the effectiveness of SESoM sample-specific strategy. The effect of the number of shots. In main experiments, the number of few-shot labeled data we use for each target task is 32 following existing work. We wonder how SESoM would perform when we increase the number of shots compared to other methods. Therefore we have tried to set the number of shots as 64, and 128 and full. Results are presented in Table 3 . We can find that SESoM can consistently achieve better performance compared to existing methods when the number of few-shot labeled target data increases. It suggests that SESoM can be applied to a wide range of few-shot settings. Even in full-data settings, SESoM outperforms existing methods although not as significantly as in few-shot settings. Verification of other design choices. Before the current SESoM, we have also tried some related design choices that outperform existing baselines to some extent, but turned out to be less effective than SESoM. For example, we have tried different inputs for attention module G. Instead of using source logits as our attention module's other input given input sample X in SESoM, we have tried using the source prompts as inputs, i.e., replacing l j in Eq.2 with P j . Other components of this method stay the same as SESoM. We refer to this method as "Ensemble acc. source prompts". We have also tried directly generating pseudo labels from source models to train the target soft prompt. Specifically, we first prompt tune source prompts on the target data sets with the few-shot target training data. Then we use these prompt-tuned source prompts with the pre-trained language model, to generate pseudo labels for the entire target datasets. The final pseudo label of each target sample is decided by majority voting of the 6 source models. These pseudo labelled samples are used to train the soft prompt of the target task before few-shot prompt tuning with the few-shot target training data. We refer to this method as "Pseudo Label Generation'. Results of these methods are shown in Table 4 . We can see that they achieve relatively similar performances with FW Ensemble method in Table 1 , while all under-perform SESoM.

4.4. CASE STUDY

In this section, we conduct a case study to verify the effectiveness of SESoM's sample-specific model ensemble strategy. First, we would like to see whether it is true that given one target task, different samples of this target task require different source models. Because it is the basic assumption under which we propose SESoM's sample-specific ensemble strategy. In Table 5 , we show two samples from target task MRPC. Under each sample, in the row of "preds from individual source", we show each individual source model's prediction given this sample. We can see that for Example #1, QQP and QNLI makes the correct prediction while for Example #2, MNLI, SST2 and QNLI makes the correct prediction. This sample-specific performance of different source models on target-task samples is not rare for all target tasks. More examples of other target tasks can be found in Appx. E. Moreover, the majority or at least half of source models make wrong predictions for each example. It indicates that universal model ensemble or majority vote model ensemble approaches would fail on these examples. Second, we would like to see whether SESoM's sample-specific contribution weights of different source models generated by its attention module G, are aligned with the performances of different source models. In Table 5 under each example, we show the contribution weights from SESoM for ensembling each source models in the row of "weights from SESoM". We can see that SESoM generally generates higher weights for source models that make the correct predictions 

5. CLOSING REMARKS

In this paper we explored the potentials of prompt tuning methods in few-shot settings. We have found in our exploration that by properly transferring knowledge from trained soft prompts of source tasks, prompt tuning's few-shot performance can be significantly improved. Specifically, our proposed method, sample-specific ensemble of source models (SESoM) outperforms existing methods by a large margin in every tested few-shot scenario. SESoM adjusts the contribution of each source model for each target sample separately when ensembling source model outputs. Our empirical analysis suggests that the two key components of SESoM, the model-level ensemble instead of prompt-level fusion, and the sample-specific strategy for model ensemble, are both critical to its superior performance. SESoM also consistently achieves superior performance given larger numbers of labelled data and larger numbers of source tasks.

A DATASETS

A.1 SOURCE TASKS Following existing work (Asai et al., 2022) , we use six tasks that have been considered generally helpful to other NLP tasks as our source tasks. They are MNLI (Williams et al., 2018) , QNLI (Demszky et al., 2018) , QQP (Wang et al., 2018) , SST2 (Socher et al., 2013) , SQuAD (Rajpurkar et al., 2016) , and ReCoRD (Zhang et al., 2018) . Details are shown in Table 6 . 

A.2 TARGET TASKS

We evaluate our model on the following GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) tasks: WNLI (Wang et al., 2018) , MRPC (Dolan et al., 2005) , BoolQ (Clark et al., 2019) , Mul-tiRC (Khashabi et al., 2018) , RTE (Giampiccolo et al., 2007) , WiC (Pilehvar & Camacho-Collados, 2019) , WSC (Levesque et al., 2012), and CB (De Marneffe et al., 2019) . Details are shown in Table 7 . ] with corresponding source task data. These source prompts are prompt-tuned (Lester et al., 2021) respectively on MNLI (Williams et al., 2018) , QNLI (Demszky et al., 2018) , QQP (Wang et al., 2018) , SST2 (Socher et al., 2013) , SQuAD (Rajpurkar et al., 2016) , and ReCoRD (Zhang et al., 2018) . Details can be found in Table 6 . Soft tokens size (m) is 100 for all pre-trained models and embedding dimensions (d) are 768, and 1024 respectively for T5-{base, large, 3b}. The rest hyper-parameters are shown in Table 8 .  i ; for k ∈ M T do li,k ← min(l i,k , l i,M T [k] ) ; li,M T [k] ← max(l i,k , l i,M[k] ); end B.6 INFERENCE TIME Compared to non-ensemble baselines on the same pre-trained language model, SESoM indeed increases inference time due to the multiple forward passes of source models. In order to achieve a fair comparison, we compare SESoM to prompt fusion model (ATTEMPT (Asai et al., 2022) ), which uses the same number of source tasks as SESoM. As a sample-specific knowledge fusion method, ATTEMPT fuses N soft prompts of source tasks together to construct a new prompt for each target sample and foward it together with the target sample to the pre-trained language model, which is one forward pass per sample for inference. While SESoM runs N forward passes for each sample to obtain the source logits. Although SESoM requires more forward passes given one target sample, it requires a smaller pretrained language model (shorter inference time per forward pass) to achieve the same /or even better performance. SESoM with a small pre-trained model (shorter inference time per forward pass) can outperform non-ensemble baseline with a much larger pre-trained model (longer inference time per forward pass). For example, SESoM with T5-base outperforms ATTEMPT with T5-XL on few-shot classification (please see Table 1 ). The inference time of SESoM with T5-base is 31.38s while the inference of ATTEMPT with T5-XL takes 38.13s, which shows that the inference time of SESoM with T5-base is shorter. The inference time is the average of 20 runs with different random seeds on C ANALYSIS ON ATTENTION Figure 2 shows the average attention weights of G between source and target tasks on D test . The average weights are produced by the SESoM model with T5-base and hyper-parameters are shown in Appx. B.3. G has higher attention score between more similar target and source tasks. G gives significantly higher attentions to QQP for MRPC and WSC or MNLI for RTE and WNLI. MRPC and QQP are both tasks to determine whether a pair of questions/sentences are semantically equivalent. WSC is coreference resolution task to determine the correct referent of the pronoun from the provided noun, which shares similarities with QQP as well. Besides, WNLI, RTE,and MNLI are all textual entailment challenges, which share similarities between each other. We also find the attention weights and F1 scores are highly positive correlated on MRPC, RTE, Mul-tiRC, WSC and CB tasks. Pearson product-moment correlation coefficients between attention weights and F1 scores are shown as follows, We show more examples of attention weights on samples from different target task in Table 15 , Table 16 and Table 17 . They show the superior power of SESoM to give higher attention to those pre-softmax logits, which prefers correct label. House. Edelstein has described it as "a really complicated, adult relationship", explaining: "These are people who have very full lives and lots of responsibilities that perhaps conflict with their feelings for each other." The actress "would love for them to have a (romantic) relationship, because it could be as complicated as the rest of their relationship", however, she is unsure how it would affect the dynamics of the show. Jacobs commented at the end of the show's third season: "I can't see them pairing them in a permanent fashion. But they are close; they have gone through a lot together. Might there be a moment of weakness in which the two might explore their chemistry? Maybe." Questioned at the end of the fourth season on whether Cuddy and House. 



We detail all the verbalizers used in our experiments in Appx. B.5. https://github.com/AkariAsai/ATTEMPT



Training of attention module G.

Figure 1: Overview of SESoM. Components with are updated during training while those with are frozen. (a) describes the training of source prompts in SESoM. Given the input token embedding sequence X of different source tasks and a pre-trained model, prompt tuning trains task specific source prompts[P 1 ; ...; P T ] for source tasks [S 1 ; ...; S T ] by prepending them to X as inputs and feeding to the frozen pre-trained model. (b) describes how SESoM trains the attention module G. G takes the logits [l x,1 ; ...; l x,T] from all T source models as input and usesˆx as the key to calculate the attention weights a x of source logits. Then source logits are weighted averaged accordingly to construct l x for the final prediction.

Figure 2: Average attention weights of G.

Few-shot performance of all methods. All the scores are the average of 20 runs with different random seeds, with standard errors included in subscripts. The best scores are in bold. "Uni Ensemble" is the acronym for Uniform Ensemble. "MV Ensemble" stands for Majority-vote Ensemble. "FW Ensemble" stands for Fixed-weight Ensemble.

SESoM with top k source models. All results are the average of 20 runs with different random seeds. Standard errors are included in subscripts. "# s." stands for the number of source models used in SESoM. Model # s. WNLI MRPC RTE MultiRC BoolQ WiC WSC CB Avg std

Results on target tasks with different number of few-shot training labels. All the scores are the average of 5 runs, with standard errors included in subscripts.

Results of different design choices on a pre-trained T5-Base model. All the scores are the average of 20 runs, with standard errors included in subscripts. "Ensemble acc SP" stands for Ensemble acc. source prompts. "PLG" stands for Pseudo Label Generation. Training details can be found in Appx. B.4.

on each example and the weight distribution is different for different examples. More examples of other target tasks can be found in Appx. E. It indicates that SESoM can generate sample-specific weights for model ensemble according to the actual sample-specific performance of different source models. Unable to find a home for him, a judge told mental health authorities they needed to find supervised housing and treatment for DeVries somewhere in California. s 2: The judge had told the state Department of Mental Health to find supervised housing and treatment for DeVries somewhere in California. This integrates with Rational PurifyPlus and allows developers to work in supported versions of Java, Visual C # and Visual Basic.NET. s 2: IBM said the Rational products were also integrated with Rational PurifyPlus, which allows developers to work in Java, Visual C # and VisualBasic.Net.

The tasks included in source models. NLI is natural language inference, and QA is question answering.

The tasks included in source models. NLI is natural language inference, and QA is question answering.We followAsai et al. (2022) 2 to train the source prompts, [P 1 ; ...; P 6

Parameters of baseline and SESoM training. "Batch size" is different for T5-base, T5-large and T5-XL.We first train a randomly initialized soft prompt on a target task, which is referred as target prompt, using the target few-shot training data. Then we calculate the cosine similarity between the target prompt P t and the source prompts, [P 1 ; ...; P 6 ]. The source prompt P i , which shares the highest similarity with P t , is used to initialize the soft prompt training. We then prompt-tune this initialized soft prompt Pt on the same target few-shot training data. Finally, the source model [ Pt ; θ] is tested with verbalizer (Appx. B.5) on the original validation set of target task. Hyper-parameter details can be found in Table8.Classification text labels of CB is A(no), B(maybe) and C(yes). We then prompt-tune a randomly initialized soft prompt P ppt on these unified datasets. We use the same hyperparametes as training SPoT-t. Then for each target, P ppt is tuned on the same D train and D dev to get target-specific soft prompt Pppt . Finally, the source model [ Pppt ; θ] is tested with verbalizer (Appx. B.5) applied on the same D test . See training details in Table8. [P 1 ; ...; P 6 ] (Appx. B.1) are used as source prompts. ATTEMPT first initialize a attention module between source prompts and inputs. ATTEMPT interpolates the source prompts and a newly initialized target-task-specific prompt P target given attention scores generated by the attention module to produce a target soft prompt P t , then [P t ; θ] is used to produce predictions. We use the same training parameters used inAsai et al. (2022), which is shown in Table8. from source model [P i ; θ] on target dataset T is transformed to li by Algorithm 1, where v is the vocabulary size of tokens. More specifically, all the logits in each pre-softmax logit of l i will be swapped to the same position. For example, in source model of ReCoRD, the pre-softmax logit of token 209 in li is replaced with the maximum of the pre-softmax logit of token 209 and 10998 in l i , because M T [10998] = 209, where 10998 is the first token representing "True" and 209 is the first token representing "1".

Mapping dictionary details.

Target Task WNLI MRPC RTE MultiRC BoolQ WiC WSC CB High F1 scores of source model on target task also indicate underlying task similarities. Fixed-Weight ensemble model takes advantage of these F1 scores to get a weighted average of all source pre-max logits to make predictions. But as we showed in Table1and Section 4.2, these weights are not enough to obatin the best performance on target tasks.

Case study from WSC and MultiRC target task. "Label" (yellow cells) is ground truth label and "Pred" (yellow cells) is the predicted label obtained by SESoM with the weights (orange cells) shown in the table. "Preds from individual source" (pink cells) shows predictions of each source model [P 1,t ; θ], ...,[P 6,t  ; θ] obtained by the pre-softmax logits [l x,1 ; ...; l x,T ]. Well satisfied with his purchases and feeling very elegant indeed, Babar goes to the photographer to have his picture taken. Sarah introduces him to three other guests. Name them. paragraph: The bar was manned by an expensive humanoid robot. It turned toward Sarah's wave and acknowledged her with a nod, moments later setting a fluted glass of sparkling liquid in front of her. I marveled at the robot's smoothness and coordination. Clearly, it was a high-end model. Sarah transferred the glass to my free hand and pulled me away from the bar for more introductions, with Alexis trailing after us. I spent the evening listening, mostly. Listening and stuffing my face with all the bits of fine food provided. No one minded; Sarah's inner circle was content to fill our circle of couches with plenty of chatter. Ray, a plump man who was grey where he wasn't bald.Zheng, short and dark and lean, with a very intense gaze. He made me a little uncomfortable. Kishori, petite, her hair strung out in a series of braids that reached nearly to her waist.

Case study from BoolQ and WiC target task. "Label" (yellow cells) is ground truth label and "Pred" (yellow cells) is the predicted label obtained by SESoM with the weights (orange cells) shown in the table. "Preds from individual source" (pink cells) shows predictions of each source model [P 1,t ; θ], ..., [P 6,t ; θ] obtained by the pre-softmax logits [l x,1 ; ...; l x,T ]. Did the stock market crash of 1929 caused the great depression? passage: Wall Street Crash of 1929 -The Wall Street Crash of 1929, also known as Black Tuesday, the Great Crash, or the Stock Market Crash of 1929, began on October 24, 1929, and was the most devastating stock market crash in the history of the United States, when taking into consideration the full extent and duration of its after effects. The crash, which followed the London Stock Exchange's crash of September, signalled the beginning of the 12-year Great Depression that affected all Western industrialized countries. Do house and cuddy get back together in season 8? passage: Lisa Cuddy -The relationship between House and Cuddy is known by the portmanteau term, Huddy. Cuddy has what USA Today's Peter Johnson terms a cat-and-mouse relationship with

An emerging professional class. s 2: Apologizing for losing your temper, even though you were badly provoked, showed real class. He could not touch the meaning of the poem. s 2: Helen Keller felt the physical world by touching people and objects around her.

B.3 SESOM

The hyper-parameters of G are shown in Table 9 . 

B.4 ABLATION STUDY

Before SESoM, we also tried some related methods that turned out to be less effective.Ensemble acc SP. Firstly, we tried consider to use input-prompt attention, where max-pool embedding input x is used as key and source prompts [P 1 ; ...; P 6 ] are values to calculate attention score to computing a linear combination of pre-softmax logits. The architecture design of the attention module is the same with SESoM. The dimensions of the attention module after hyper-parameters tuning is shown in Table 10 . "neutral" and "contradiction"; and MRPC (Dolan et al., 2005) is classified as "equivalent" and "not equivalent". We show the verbalizers used for all target task in Table 11 . For rare cases in which the source model after few-shot prompt tuning is not able to transfer all the verbalizers from source dataset to target dataset, for all baselines and our method, we conduct a verbalizer mapping for pre-softmax logits.Based on original verbalizers in Table 11 , we develop a mapping dictionary M T to map text to the token for target dataset T . For example, we map "True" to "1", and "False" to "0" for the prediction of source models with source prompts trained on Squad (Rajpurkar et al., 2016) and Record (Zhang et al., 2018) . Formally, for the ith source model, the pre-softmax logitPublished as a conference paper at ICLR 2023 all 8 target tasks. It indicates that SESoM can achieve a better performance with shorter inference time.

D.1 EFFECT OF THE NUMBER OF SHOTS

Besides our main experiments (Table 3 ), we also reported results on target tasks with 8 training samples in Table 12 . The performance of SESoM is better than all the baselines with 32 training samples, which are shown in Table 1 . In main experiments, we train each source model with the few-shot target data in a prompt-tuning manner. We wonder how SESoM would perform when we skip this step. Therefore, we tried to perform SESoM on non-tuned source models. Results are presented in Table 13 . We can find that with only 8 and 32 samples of few-shot training samples, SESoM with prompt-tuned source models cannot achieve significantly better performance than SESoM with non-tuned source models. However, when the number of few-shot labeled target data increase, SESoM with prompt-tuned source models achieve better performance, because the source models themselves have a better performance on target datasets. 

D.3 EFFECT OF ENSEMBLE

We wonder whether our ensemble method is better than choosing single most appropriate source model's prediction as final prediction. We implement a hard variant, where after attention weights are computed, we set the argmax position to 1 and all remaining entries to 0s. Results are presented in Table 14 . It shows that our ensemble attention is useful compared to selecting the single most appropriate source model. 

