NOT ALL TASKS ARE BORN EQUAL: UNDERSTANDING ZERO-SHOT GENERALIZATION

Abstract

Recent work has achieved remarkable zero-shot performance with multi-task prompted pretraining, but little has been understood. For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. These novel findings combined deepen our understanding about zero-shot generalization-training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks. In addition, to automate this procedure, we devise a method that (1) identifies key training tasks without observing the test tasks by examining the pairwise generalization results and (2) resamples training tasks for better data distribution. Empirically, our approach achieves improved results across various model scales and tasks.

1. INTRODUCTION

Recent work (Brown et al., 2020; Artetxe et al., 2022; Rae et al., 2021) has demonstrated the potential of leveraging pretrained language models (PLMs) to perform zero-shot generalization. Zero-shot generalization enables PLMs to adapt to a variety of natural language processing (NLP) tasks without relying on any annotated data, which opens the possibility towards generic systems. Pretrained models, such as GPT-3 (Brown et al., 2020) , BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) , can perform zero-shot inference on unseen test tasks by leveraging natural language prompts and formulating NLP tasks into language modeling tasks. More recent advances (Wei et al., 2022; Sanh et al., 2022) performed multi-task prompted training on PLMs and further enhanced the zero-shot performance to a large extent. Despite the substantial progress, few works have studied how multi-task prompted training boosts the zero-shot performance. The lack of understanding hinders further improvement of the field. To this end, we take a further step to understand multi-task training for zero-shot generalization (1) by selecting only a small number of key training tasks and performing multi-task training and (2) by studying the characteristics of tasks with general transfer ability. The results reveal several interesting findings -First of all, only a small number of training tasks dominate the performance of zero-shot generalization. In other words, using only these key training tasks to perform multi-task training leads to good results, while removing these key tasks would drastically hurt the zero-shot performance. Secondly, not all tasks are born equal, and some tasks show general transfer ability by providing widely useful knowledge. Moreover, key tasks with general transfer ability can be automatically detected using pairwise generalization results. In addition, based on the findings, we propose an improved method, task resampling, which improves multi-task training for zero-shot generalization to a large extent. Task resampling first automatically identifies a set of key training tasks based on pairwise training and evaluation without peeking forward any test tasks, and then performs resampling by upsampling key tasks or downsampling non-critical tasks, as shown in Figure 3 . In this way, we build a better mixture of multi-task training sets to highlight those key tasks with better general transfer ability. Experiments show that task resampling consistently outperforms the previous approach T0 (Sanh et al., 2022) across three different model scales and on test tasks of various types. To sum up, our contributions are as follows. 1. We conduct experiments to understand and reveal how multi-task training for zero-shot generalization works-(1) Only a small number of training tasks dominate zero-shot generalization; (2) Some key tasks provide general transfer ability and can be detected using pairwise generalization results. 2. We devise a novel method, task resampling, to improve zero-shot generalization by (1) first automatically identifying key training tasks based on pairwise training and evaluation without observing any test tasks, and (2) resampling training tasks using upsampling and downsampling strategies. 3. Experiments show that our approach achieves new state-of-the-art results across various model scales and tasks. et al., 2022; Wei et al., 2022) have shown that explicit multi-task prompted training where all tasks are unified by the natural language prompts can vastly promote zero-shot task generalization. We build upon previous work within this new paradigm and devote ourselves to improving the recipe of multi-task prompted training by revealing the mechanism of generalization in Section 3 and further enhancing its performance in Section 4.

2.2. THE INTERPRETATION OF PROMPTED LEARNING

Recent work has shown an increased interest in how the prompts help the model generalize to unseen tasks. Some researchers (Wei et al., 2022; Schick & Schütze, 2021; Mishra et al., 2022) suggest that the model learns to understand what they are doing through prompts. While some work (Webson & Pavlick, 2022; Logan IV et al., 2022) challenges this assumption, revealing that sometimes we could get comparable performance without prompts or even with wrong prompts. T0 (Sanh et al., 2022) claims that they only empirically witness the transfer phenomenon, but it is unclear why it happens. We provide new insights into the reason for generalization. We challenge the idea that the model learns the task through instructions, based on the observation that deleting a small but important set of tasks will lead to transfer failure. We suggest that most generalization ability comes from key tasks, which could be divided into specific and general transfer abilities. We hope our discovery could promote the development of this field.



ZERO-SHOT LEARNING IN NLP Zero-Shot Learning denotes the setting when no data correlated with the test set is available during the training stage. The early definition of zero-shot learning referred to predicting samples with unseen classes, so traditional methods require prior information such as semantic knowledge (Zhang et al., 2019a) or knowledge graph (Chen et al., 2021) for an unseen class so that model can predict that class without training data. Meta-learning (Zhang et al., 2022) and reinforcement learning (Ye et al., 2020) methods are also used for zero-shot learning. Recently, more work focuses on the setting of predicting samples with unseen tasks, supported by the development and prevalence of pre-trained language models (PLMs), as well as multi-task training. McCann et al. (2018) unifies NLP tasks into QA-format to perform multi-task learning. Liu et al. (2019) designs a multi-task deep neural network for natural language understanding tasks. Aghajanyan et al. (2021) designs an intermediate training stage between pretraining and finetuning using around 50 tasks. Most recently, with the combination of the above two approaches, T0 and FLAN (Sanh

