NOT ALL TASKS ARE BORN EQUAL: UNDERSTANDING ZERO-SHOT GENERALIZATION

Abstract

Recent work has achieved remarkable zero-shot performance with multi-task prompted pretraining, but little has been understood. For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. These novel findings combined deepen our understanding about zero-shot generalization-training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks. In addition, to automate this procedure, we devise a method that (1) identifies key training tasks without observing the test tasks by examining the pairwise generalization results and (2) resamples training tasks for better data distribution. Empirically, our approach achieves improved results across various model scales and tasks. 1

1. INTRODUCTION

Recent work (Brown et al., 2020; Artetxe et al., 2022; Rae et al., 2021) has demonstrated the potential of leveraging pretrained language models (PLMs) to perform zero-shot generalization. Zero-shot generalization enables PLMs to adapt to a variety of natural language processing (NLP) tasks without relying on any annotated data, which opens the possibility towards generic systems. Pretrained models, such as GPT-3 (Brown et al., 2020) , BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) , can perform zero-shot inference on unseen test tasks by leveraging natural language prompts and formulating NLP tasks into language modeling tasks. More recent advances (Wei et al., 2022; Sanh et al., 2022) performed multi-task prompted training on PLMs and further enhanced the zero-shot performance to a large extent. Despite the substantial progress, few works have studied how multi-task prompted training boosts the zero-shot performance. The lack of understanding hinders further improvement of the field. To this end, we take a further step to understand multi-task training for zero-shot generalization (1) by selecting only a small number of key training tasks and performing multi-task training and (2) by studying the characteristics of tasks with general transfer ability. The results reveal several interesting findings -First of all, only a small number of training tasks dominate the performance of zero-shot generalization. In other words, using only these key training tasks to perform multi-task training leads to good results, while removing these key tasks would drastically hurt the zero-shot performance. Secondly, not all tasks are born equal, and some tasks show general transfer ability by providing widely useful knowledge. Moreover, key tasks with general transfer ability can be automatically detected using pairwise generalization results. In addition, based on the findings, we propose an improved method, task resampling, which improves multi-task training for zero-shot generalization to a large extent. Task resampling first automatically identifies a set of key training tasks based on pairwise training and evaluation without peeking forward

2.3. TRANSFER RELATIONSHIPS IN MULTI-TASK LEARNING

Based on the observation that transfer ability comes from key tasks, we would like to further improve the zero-shot performance by exploring the transfer relationships in multi-task learning. Some previous works learn transfer relationships in supervised multi-task training. Taskonomy (Zamir et al., 2018) builds a transfer structure of a series of computer vision tasks by training task-specific encoders on each dataset and retraining the decoders on target datasets. Dwivedi & Roig (2019) and Song et al. (2019) obtain the transfer relationship based on the assumption that transferable task-specific models have similar representations or embeddings on the same data. Vu et al. (2020) learns the embedding of each NLP task and tries to predict the transfer relationship between different datasets. ExT5 (Aribandi et al., 2022) learns the transferability by co-training the source task and target task and evaluate on the target task. UnifiedQA (Khashabi et al., 2020) explores the transferability between different QA tasks. The major challenge lies in that we could not get access to the test task at the training stage, thus it is hard to figure out in advance which training tasks can be more useful for unseen tasks. To address this challenge, we propose a reweighting method based on the transfer performance of pairwise training tasks, which will be discussed in Section 4.

2.4. DATA AUGMENTATION IN NLP

Data Augmentation is widely used in NLP to strengthen the robustness and diversity of the data distribution and promote the model performance. However, most of the approaches are conducted in word-level (Zhang et al., 2015; Wang & Yang, 2015; Wei & Zou, 2019 ) and sentence-level (Kafle et al., 2017; Hou et al., 2018; Khashabi et al., 2018; Zhang et al., 2018) , or in other words, instancelevel. In our paper, to naturally cater for the multi-task setting, we adopt a new perspective and design a cross-domain data augmentation which intersects each domain data with different kinds of tasks and significantly improves the diversity of data distribution.

3. UNDERSTANDING ZERO-SHOT TASK GENERALIZATION

This section explores how multi-task training contributes to zero-shot generalization. By revealing the mechanism of task transfer, we provide some insights for improving zero-shot performance.

3.1. DATA

We followed the setting of T0 (Sanh et al., 2022 ) and adopted the tasks therein. There are 38 training tasks across 8 task types, and 11 test tasks ranging from natural language inference (RTE (Candela et al., 2006) , CB (De Marneffe et al., 2019) , ANLI/R1-R3 (Nie et al., 2020) ), coreference resolution (WSC (Levesque et al., 2012) , Winogrande (Sakaguchi et al., 2020) ), sentence completion (COPA (Roemmele et al., 2011) , StoryCloze (Mostafazadeh et al., 2017) , Hellaswag (Zellers et al., 2019) ), to word disambiguation (WiC (Pilehvar & Camacho-Collados, 2019) ). Both training and test sets are disjoint in task types, thus guaranteeing the zero-shot setting. We report the mean and median accuracy over multiple prompts for each test task.

3.2. A SMALL NUMBER OF KEY TASKS DOMINATE PERFORMANCE

Since there has been little agreement on where the excellent zero-shot generalization performance comes from, we would like to conduct experiments to explore its mechanism. Some researchers believe that the model understands the prompts through multi-task training (Wei et al., 2022; Schick & Schütze, 2021; Mishra et al., 2022) , while we take an orthogonal perspective and hypothesize that there might be a few key tasks that are crucial for the zero-shot generalization performance. For a straightforward verification of the above hypothesis, we conduct two experiments. Single Task Shows Zero-Shot Transfer Ability We set up an experiment to study the pairwise transfer results between all pairs of tasks. Specifically, for any pair of tasks in the T0 collection, we train on one task and evaluate on the other. We expect to decouple the effects of multi-task learning and observe the transfer ability of single tasks. Results are shown in Figure 1 . On the held-out test tasks, the performance gap between multi-task training and single-task training is less than 5 points on average. We also conduct experiments by training on top-3 tasks for each test dataset. The detailed results are presented in Appendix D.1. 8 82 61 52 45 31 17 54 58 43 32 38 74 35 53 62 47 5 9 54 38 80 77 30 5 3 6 11 3 4 6 43 12 37 49 47 33 34 33 56 26 53 52 56 50 Extractive QA 32 62 54 9 78 56 51 45 30 16 57 60 38 32 43 71 37 54 67 47 5 9 56 38 70 68 30 5 4 7 12 2 4 6 49 23 30 49 38 33 33 33 61 26 55 52 55 50  Adv./DRoberTa 32 62 54 8 78 57 49 43 28 16 58 57 42 32 43 72 36 54 68 47 5 7 51 39 77 74 32 5 2 6 12 2 4 6 49 31 42 48 40 33 33 33 62 27 56 53 55 The entry at row i and column j denotes the average performance when the model is trained on task i and evaluated on task j. For each entry, the value is the average score of different prompts. (Accuracy if only Accuracy is calculated, and otherwise the mean of Accuracy and F1.) Only those prompts related to the original tasks are included for evaluation. We highlight those entries with high scores for each task (Red is the Top-1). The horizontal and vertical lines denote the boundary of task-type groups.

Adv./DBERT

A Small Set of Tasks Dominates Performance on the Test Tasks We manually select the top-8 tasks out of all training tasks of T0. These selected tasks empirically demonstrate good generalization to the test tasks according to preliminary experiments. Specifically, we first select the top-3 key tasks for each test task. Then, we select the tasks which appear at least twice in the top-3 support tasks. As a result, the number of selected key tasks is exactly 8. The top-8 selected tasks are CosmosQA (Huang et al., 2019) , Social IQA (Sap et al., 2019) , PAWS (Zhang et al., 2019b) , QuAIL (Rogers et al., 2020) , Wiki QA (Yang et al., 2015) , QuaRTz (Tafjord et al., 2019) , QASC (Khot et al., 2020) , and ROPES (Lin et al., 2019) . For comparison, we experimented with three different variants by training a T5-Large model using only top-8 tasks ("Top-8 Only") and all but the top-8 tasks ("T0 Tasks w/o Top-8"), respectively. We also experimented using backbones with different scales (XL) or architectures (decoder-only model), and results are presented in Appendix C. Results on T5-Large are shown in Table 1 . We observe that training with only the top-8 tasks outperforms training with all tasks when tested on 11 test tasks, while training with all but the top-8 tasks drastically decreases performance. It proves that a few key tasks contribute to zero-shot task generalization. Training with key tasks selected by post-hoc results achieves much better zero-shot performance than training with all tasks. It is generally agreed that training with more tasks should lead to better learning of prompts, but our experiments show that this is not the key reason for the performance improvement. The information contained in the key tasks plays a significant role in improving the few-shot generalization. This result raises a further question whether the model learns to read, understand and react to instructions broadly from all tasks, or just benefits from several key tasks which have strong general transferability. Only" means using only the top-8 tasks. "T0 Tasks w/o Top-8" means using the T0 tasks with top-8 tasks removed. Results that are comparable to or outperform the T0 baseline are denoted in bold.

3.3. GENERAL TRANSFER AND SPECIFIC TRANSFER

First of all, we divide the transfer ability into specific transfer ability and general transfer ability according to the scope of target tasks. Specific transfer ability means that the task can only provide special knowledge for a small set of tasks with certain kinds of patterns. A typical example is the sentiment analysis task, which helps the sentiment analysis tasks of different domains a lot, but has little effect on other complex NLU tasks. The specific transfer ability is relatively stronger among the tasks at the diagonal blocks in Figure 1 . General transfer ability means that the task can provide knowledge that is required by most downstream tasks. The more it provides beyond the knowledge captured by the pretrained model, the better it will contribute to the overall transfer ability of the model. For example, adding question answering (QA) tasks will improve the performance on most of the downstream tasks, which could be because they provide valuable commonsense knowledge and reasoning skills. After introducing the division of transfer ability, we raise two questions based on the experimental results in Section 3.2. (1) Do the key tasks embrace the ability of general transfer or specific transfer? (2) Can we reveal the common patterns shared by the tasks with general transfer ability?

3.4. NOT ALL TASKS ARE BORN EQUAL

Some QA Tasks Show General Transfer Ability For the first question, from Figure 1 , most of the tasks selected in post-hoc experiments bring improvements on a wide range of tasks, and thus have a certain degree of general transfer ability. In addition, most of the tasks that show general transfer ability are QA tasks. Two notable concepts are QA format and QA tasks. QA tasks indeed take the QA format. However, QA-formatted data are not necessarily QA tasks (e.g., prompted sentiment analysis data taking the QA format is not a QA task.). Here, QA tasks refer to those tasks that require reasoning skills, such as reading comprehension. We conduct an experiment by formatting the sentiment analysis task into a multiple-choice QA task. Specifically, we convert the format of a sentiment analysis task: Yelp Review Full into the multiple choice QA format (see Figure 2 ). We compare the zeroshot performance on 11 unseen tasks between training on raw Yelp Review Full and training on QA-formatted Yelp Review Full, which is displayed in Table 2 . Results show that simply using QA-formatted non-QA-tasks does not benefit zero-shot performance, proving that it is not simply the QA format that results in the zero-shot ability. Why QA Tasks Show General Transfer Ability? Experiments show that some QA tasks demonstrate general transfer ability, and we would like to explore the underlying reasons. We suspect some QA tasks provide some knowledge that is not captured in the pretraining process. From the examples in Table 3 , we can see that both CosmosQA and Social IQA require some simple reasoning ability in the general domain, which is required for a wide range of NLP tasks. More importantly, it is difficult to learn this knowledge in the pretraining stage, so an additional supplement is necessary to make the model have good cognitive ability. As a result, those tasks show better general transfer ability. There may be some quantitative methods to evaluate the knowledge provided by these tasks. A possible solution is to design some probe tasks, as is done in Pruksachatkun et al. (2020) , and we leave this for future work. Which Kinds of QA Tasks Work? Another important observation is that not all QA tasks show general transfer ability. Specifically, CosmosQA, Social IQA, and QuAIL show outstanding transfer ability, while some tasks such as WikiHop, and WiQA do not. Through a careful examination of the datasets (see Table 3 for reference), we conjecture that there are two reasons. First of all, the domain type matters a lot. Taking an extreme case as an example, training on datasets full of math problems is not likely to provide general transfer ability to other tasks. All of CosmosQA, Social IQA, and QuAIL require commonsense knowledge that is useful in the general domain. However, the WikiHop dataset urges the model to remember specific knowledge that is mainly required for knowledge contests. Another possible factor is the text format. In detail, the expressions of WikiHop/WiQA seem much more artificially-constructed than Social IQA/CosmosQA, thus showcase limited transfer ability. So far, we have analyzed phenomena about the zero-shot task generalization ability. At the same time, three problems remain unsolved: (1) The test set cannot be seen in advance. (2) We need to further distinguish which QA tasks are useful when a large number of QA tasks are provided. (3) When the training set provided is changed, we need to distinguish new tasks with general transfer ability. For the above three reasons, we propose a general data-driven approach to identify tasks with general transfer capability.

4.1. METHOD

We have arrived at several findings in Section 3, revealing that (1) only a small number of training tasks dominate the performance of zero-shot generalization and (2) some key tasks with general transfer ability can be detected through pairwise generalization. Following the findings, we hypothesize that one of the key aspects of improving zero-shot performance is to appropriately adjust the weight of different training tasks, such that the model is trained with an optimized mixture of multi-task datasets. To this end, we devise a novel method, task resampling, that first automatically identifies a set of key training tasks based on pairwise training and evaluation without observing any test tasks, and then adjusts the training data distribution through upsampling or downsampling. Figure 3 shows an overview of our method. Formally, we are given a set of training tasks T = {t i } i where t i is a task, and a pretrained model M. Each task is formulated as t i = {x i,j , y i,j } j , consisting of a prompted input x i,j and a prompted target y i,j . Our goal is to assign appropriate weights {w i } i for each training task, and use the optimized mixture of training tasks to train the model M in a multi-task manner such that it performs well on unseen test tasks. Generally, our method consists of three major steps. 1. Pre-detection of key tasks. Identify key training tasks based on pairwise training and evaluation without relying on any test tasks. This can be viewed as a prior approximation to the post-hoc method in Section 3.2. 2. Task resampling. Resample different training tasks by upsampling key tasks or downsampling non-key tasks. 3. Multi-task training. Train a model using the resampled mixture of multi-task datasets. Pre-detection of Key Tasks Without observing the test tasks, the main idea of our approach is to use pairwise training and evaluation within the training tasks. To identify the key tasks, we first train a model on each training task and evaluate it on all the training tasks. This results in an N × N (N is the number of training tasks) table, which is part of the results in Figure 1 . Then we design a method to select key training tasks based on this N × N table. For each task pair A and B, let f (A, B) be the performance of training on A and evaluating on B. We let g(A, B) = 1 if the following conditions are satisfied (and otherwise g(A, B) = 0): 1. A and B are of different task types. 2. The performance f (A, B) is high enough: f (A, B) ≥ max A ′ ̸ =B f (A ′ , B) -TH 1 and f (A, B) ≥ mean A ′ ̸ =B f (A ′ , B) + TH 2 . Here TH 1 and TH 2 are two constants which control the tolerant distance between f(A, B) and maximum performance and average performance respectively. The value g(A, B) indicates whether A is a high-performing training task for B. We constrain that A and B are of different types because we eventually target cross task generalization. We then aggregate g(A) = B g(A, B) to represent how many times A is a high-performing training task for another task, and use the tasks with the largest g(A) values as the key tasks. In our implementation, we apply thresholding on g(A) to obtain the set of key tasks. Task Resampling by Upsampling or Downsampling After detecting the key tasks using only the training tasks, we propose a simple yet effective resampling approach to optimize the mixture of multi-task data. We either perform upsampling or downsampling strategy. For upsampling, we upsample the key tasks by N u times. For downsampling, we cap the number of samples to be N d for each non-key task and use the original sample size for the key tasks. In our preliminary experiments, we found task resampling more robust than using the key tasks only because it takes a softer approach to highlight the importance of key tasks while maintaining knowledge from other tasks. Data Augmentation In the above sections, we have discussed an important discovery that certain tasks are crucial for zero-shot performance. However, some of these key tasks might be limited in terms of labeled data. Thus, we further propose a data augmentation method to create as many samples as possible for each task. Specifically, given tasks A and B, we apply the prompts of A to the data of B to obtain additional augmented data. In other words, we have more data from task B that are used to perform the task A. We use a trained T0 to predict the labels of the augmented samples, which is similar to self-training. Note that data augmentation is optional and independent of task resampling. We do ablation studies to investigate the effectiveness of this component.

Multi-task Training

Our training procedure is the same as T0. The only difference is that we employ an optimized mixture of datasets (and optionally with data augmentation).

4.2. EXPERIMENTAL SETUP

Following the same training and evaluation setting as T0 (Sanh et al., 2022) , we finetune the T5-LM-Adapt model on 38 training tasks, which has been discussed in detail in Section 3.1. For data preprocessing, following T0, to balance the number of data for different tasks, we restrict the maximum number of data examples for each training task to 500,000. Based on our resampling strategy, we set the pre-detection parameters as TH 1 = 5, TH 2 = 10, and then choose the datasets which are counted as the key tasks at least twice (i.e., all tasks A with g(A) ≥ 2). Given each key task D with data size |D|, we duplicate D by 5 times (N u = 5) for the upsampling strategy and empirically start from 50,000 samples for each dataset. For the downsampling strategy, we downsample each non-key task to N d = min(50, 000, |D|) samples. We provide detailed statistics about the datasets in Appendix B.1.

4.3. RESULTS

Post-hoc v.s. Prior Detection of Key Tasks One vital part of our approach is the detection of key tasks, so we list the key tasks selected by post-hoc (i.e., observing the test tasks) and pre-detection methods in Table 4 . There are five common tasks shared by the two methods, indicating that our approach can detect most of the key tasks. Moreover, even though the two sets of key tasks are not exactly matched, our experiments demonstrate that this does not affect performance. and T0 with downsampling non-key tasks (DS-T0). Besides, we display the performance of the task-resampled T0 with augmented data, dubbed as US+DA-T0 and DS+DA-T0, respectively. We summarize the following key observations from Table 5 : Zero-shot performance for our improved T0 and original T0 at three different scales. Results with † are reported by Sanh et al., and results with ⋆ are reproduced in our experiments. US-T0 means T0 with upsampling key tasks, DS-T0 means T0 with downsampling non-key tasks, and DS+DA-T0 / US+DA-T0 represents DS-T0 / US-T0 with augmented data. "Our Best" is achieved with the US+DA-T0 setup. 1. Advantage of Task Resampling. Our task-resampled T0 with both upsampling and downsampling strategies (US-T0 and DS-T0) boosts the performance of reproduced T0. Specifically, DS-T0 outperforms T0 (*) by 3.2% at Large scale and 2.0% at XL scale, and US-T0 outperforms T0 (*) by 3.9% at Large scale and 0.6% at XL scale. 2. Advantage of Data Augmentation. Task-resampled T0 achieves better performance with augmented data. Specifically, downsampling T0 (DS-T0) increases by 1.3% with augmented data at Large scale and upsampling T0 (US-T0) increases by 1.4% with augmented data at XL scale. And T0 with both task resampling and augmented data achieves 0.8% gain with reproduced T0. It indicates that data augmentation can further strengthen the mixture of multi-task data. 3. Advantage of Our Implementation Framework. Our reproduced T0 result is better than the reported T0 (Sanh et al., 2022) by 9.4% at XL scale and 3.2% at XXL scale.

5. CONCLUSIONS

This work studies the principles of zero-shot generalization through pairwise experiments, and reveals that a small number of training tasks dominate performance. We further divide the transfer relationship into specific transfer and general transfer, and find that adding those tasks with general transfer ability will contribute to the performance gain for most tasks. Moreover, those tasks with general transfer ability can be identified by examining the pairwise generalization results. Based on the findings, we propose the task resampling method to improve the zero-shot performance. Extensive experiments demonstrate the effectiveness of our framework.

A EXPERIMENTAL DETAILS

A.1 HYPER-PARAMETERS SELECTION For all the experiments, we adopt the ADAM optimizer and use a learning rate of 1e-4. Considering the amount of data, the batch size and training steps are different for different settings in our experiments. We don't search the training hyper-parameters anymore, because we find that the performance is similar as long as we train for sufficient epochs in our preliminary experiments. We use different batch sizes and training steps for different amounts of data for time-saving. The hyper-parameters of our experiments are in Table 6 . For other hyper-parameters we selected, we predefine them using some preliminary experiments on T5-Large, and then apply them directly on larger models for the sake of time. We first select the key tasks by searching T H 1 , T H 2 using the upsampling strategy with N u = 5, the whole search space is T H 1 = {5}, T H 2 = {5, 10}. We choose G(A) ≥ 2 because if you draw the distribution of G(A) values, you can clearly see that the tasks with G(A) ≤ 1 are the long-tailed part. When the key tasks are selected, we then search the hyper-parameters for upsampling and downsampling using the search space N u = {2, 5}, N d = {5, 10}, and choose the best one on T5-Large. We find that those hyper-parameters don't affect the results a lot, i.e., the gap among them is much smaller than the gap between them and the baseline.

Experiment

We only report the best result on T5-XXL, because we are unable to run all the experiments due to the limitation of computing resources. Our best result is achieved using US+DA-T0. Published as a conference paper at ICLR 2023

A.2 DATA AUGMENTATION DETAILS

We propose two algorithms for the domain-task intersection. The first one is based on a human-written taxonomy tree, and another relies on universal fields. We combine the two techniques and achieve a balance between quality and diversity. Taxonomy Tree Based Domain-Task Intersection To build a general taxonomy tree to cover as many prompts as possible, we take both task format and task content into consideration to develop a series of guidelines similar to a Decision Tree. Then, at the end of each branch, we intersect the source data lying in that branch with the related prompts belonging to that branch. In this way, we might produce a reasonable combination of the source data and prompt. For example, classification tasks like IMDB can also do tasks like title generation. Universal Domain-Task Generation To further improve the diversity of the augmented data, we get rid of the man-made restriction and propose universal domain-task generation. In detail, we define the unified fields for data from all domains, which are utilized for various tasks. Obviously, each original domain data lacks certain kinds of fields, e.g., AG NEWS data only have two fields: category label and text. Therefore, we leverage the T0 to predict the missing fields in order to conduct different kinds of tasks using the prompts having already been trained in T0. For some tasks, we also train a specific model for prediction to get better performance. After that, we filter samples according to the confidence score (i.e., the probability output by the model).

B MORE EXPLORATIONS ON THE TRANSFER ABILITY B.1 STATISTICS OF THE CURRECT DATASETS

We report the size of the dataset in our experiments in Table 7 . To explore more statistical features of the original dataset, we further provide some statistical indicators. Specifically, we calculate the average sequence length (SLEN) and the mean segmental TTR (MSTTR) of the prompted input of training datasets. MSTTR is calculated with a window size of 50. We calculate those statistical values using at most 10000 examples for each prompted task for the sake of time. Results are in Table 7 . An interesting phenomenon is that datasets with fewer training examples are more likely to show general transfer ability. We speculate that it is because a lot of manual effort is needed to make up those datasets with sufficient knowledge. It is too costly to make these datasets extremely large. It seems that the statistics such as MSTTR or data length alone cannot be good indicators for transfer ability. More explorations about the influence factors of transfer ability are left for future work.

C MORE RESULTS ON OTHER MODELS AND OTHER BASELINES C.1 RESULTS ON LARGER MODEL

To verify whether the observation that a small number of key tasks dominate zero-shot performance still holds true on larger models, we conduct the top-8 tasks experiments on T5-XL, similar to Section 3.2. Results are in Table 8 . From the results, we can see that the model trained on top-8 only slightly outperforms the baseline, while greatly defeating the performance of the model trained on all T0 tasks without the top-8 tasks.

C.2 RESULTS ON DECODER-ONLY ARCHITECTURE

To verify whether the results still hold true on more architectures, we experiment on more architectures. Considering that the encoder-only model is not suitable for language modeling tasks, here we only consider the decoder-only architecture. In specific, we conduct the top-8 experiments on GPT-Neo-1.3B (Black et al., 2021) . Results are in Table 9 . From the results, we can see that, the observation that a small number of key tasks dominate zero-shot performance mentioned in Section 3.2 still holds true. We train with a prefix-LM loss. C.3 RESULTS ON OTHER BASELINES most tasks in T0 are defined as QA tasks, the observation that QA tasks are important might not be fair enough. Thus, we want to investigate whether the statement that "some general transfer classes dominate the zero-shot performance" still holds with a totally different mixture of datasets. Therefore, we consider conducting the experiment on the prompted datasets of FLAN (Wei et al., 2022) . Noted that reading comprehension is one of the task types used in FLAN, which can be regarded as narrative QA (in this way, we classify QA tasks based on the content rather than the format), so we hope to explore what will happen if we remove all reading comprehension tasks in the mixture of datasets used in FLAN. We use exactly the same datasets and prompts as FLAN (Wei et al., 2022) , except that we include three dialogue datasets in the same way as in FLAN-T5 (Chung et al., 2022) , and exclude the translation datasets. We leave the NLI and Commonsense Reasoning as the hold-out test set, which follows FLAN. We conduct the experiments as follows: 1. Training with all remaining tasks in FLAN. (43 tasks in total); 2. Training with only the reading comprehension tasks in FLAN. (7 tasks in total); 3. Training without the reading comprehension tasks in FLAN. (36 tasks in total) As can be seen in Table 10 , the model which is trained on reading comprehension tasks greatly outperforms the model trained on FLAN tasks without reading comprehension tasks. Therefore, these experiments serve as supplementary to our main experiment conducted on T0 datasets.

Met.

Natural The results when training on the top-3 key datasets for each test task are in Table 11 and Table 12 . We can see that the model trained on top-3 key datasets shows comparable results with the T0 baseline.

D.2 FULL RESULTS EVALUATED ON HELD-OUT TEST DATASETS

Full results evaluated on held-out test datasets are in Table 13 and Table 14. 



Figure1: Pairwise transfer relationships on T5-XL. The entry at row i and column j denotes the average performance when the model is trained on task i and evaluated on task j. For each entry, the value is the average score of different prompts. (Accuracy if only Accuracy is calculated, and otherwise the mean of Accuracy and F1.) Only those prompts related to the original tasks are included for evaluation. We highlight those entries with high scores for each task (Red is the Top-1). The horizontal and vertical lines denote the boundary of task-type groups.

context and choose the best option to answer the question \n Context: {{text}} \n Question: {{question}} \n Options: {{answer_choices | join("\\n-")}} \n 1 star ||| 2 stars ||| 3 stars ||| 4 stars ||| 5 stars (a) Predict the full choice. context and choose the best option to answer the question \n Context: {{text}} \n Question: {{question}} \n Options: \nA. 1 star \nB. 2 stars \nC. 3 stars \nD. 4 stars \nE.5 stars \n A ||| B ||| C ||| D ||| E (b) Predict the option label.

Figure 2: An illustration of multiple-choice QA formatted sentiment analysis prompts.

Figure 3: The pipeline of the improved multi-task prompted training recipe. We detect key tasks by examining the pairwise generalization results between training tasks. These key tasks are upsampled, or non-key tasks are downsampled to form an optimized mixture of datasets.

Zero-shot performance of training with/without top-8 tasks (out of 38) on T5-Large. The top-8 tasks are CosmosQA, Social IQA, PAWS, QuAIL, Wiki QA, QuaRTz, QASC, and ROPES.



Examples of part of the QA tasks. CosmosQA and Social IQA both show excellect general transfer ability, while WikiHop QA and WiQA don't. The biggest difference between them is the knowledge domain.



Training hyper-parameters for our experiments.

Statistics of the training sets. "Orig Num" denotes the size of the original dataset. "T0 Num" denotes the size of prompted data in the T0 baseline. "US Num" denotes the size of prompted data we used in our upsampling experiments. "DS Num" denotes the size of prompted data we used in our downsampling experiments. "MSTTR" denotes the mean segmental TTR, which is an indicator to reflect lexical diversity. "In Len" denotes the average length of prompted input data. "Out Len" denotes the average length of target prompted target data.



Zero-shot performance of training with/without top-8 tasks (out of 38) on GPT-Neo. The top-8 tasks are CosmosQA, SocialIQA, PAWS, QuAIL, Wiki QA, QuaRTz, QASC, and ROPES. "Top-8 Only" means using only the top-8 tasks. "T0 Tasks w/o Top-8" means using the T0 tasks with top-8 tasks removed.

Results of training on FLAN. We train a T5-Large model using the datasets of FLAN. "RC" denotes Reading Comprehension. "ARC/E." denotes "ARC/Easy", and "ARC/C." denotes "ARC/Challenge".

ACKNOWLEDGMENTS

Jian Li and Jing Zhou are supported in part by the National Natural Science Foundation of China Grant 62161146004, Turing AI Institute of Nanjing and Xi'an Institute for Interdisciplinary Information Core Technology.

