NOT ALL TASKS ARE BORN EQUAL: UNDERSTANDING ZERO-SHOT GENERALIZATION

Abstract

Recent work has achieved remarkable zero-shot performance with multi-task prompted pretraining, but little has been understood. For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. These novel findings combined deepen our understanding about zero-shot generalization-training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks. In addition, to automate this procedure, we devise a method that (1) identifies key training tasks without observing the test tasks by examining the pairwise generalization results and (2) resamples training tasks for better data distribution. Empirically, our approach achieves improved results across various model scales and tasks.

1. INTRODUCTION

Recent work (Brown et al., 2020; Artetxe et al., 2022; Rae et al., 2021) has demonstrated the potential of leveraging pretrained language models (PLMs) to perform zero-shot generalization. Zero-shot generalization enables PLMs to adapt to a variety of natural language processing (NLP) tasks without relying on any annotated data, which opens the possibility towards generic systems. Pretrained models, such as GPT-3 (Brown et al., 2020) , BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) , can perform zero-shot inference on unseen test tasks by leveraging natural language prompts and formulating NLP tasks into language modeling tasks. More recent advances (Wei et al., 2022; Sanh et al., 2022) performed multi-task prompted training on PLMs and further enhanced the zero-shot performance to a large extent. Despite the substantial progress, few works have studied how multi-task prompted training boosts the zero-shot performance. The lack of understanding hinders further improvement of the field. To this end, we take a further step to understand multi-task training for zero-shot generalization (1) by selecting only a small number of key training tasks and performing multi-task training and (2) by studying the characteristics of tasks with general transfer ability. The results reveal several interesting findings -First of all, only a small number of training tasks dominate the performance of zero-shot generalization. In other words, using only these key training tasks to perform multi-task training leads to good results, while removing these key tasks would drastically hurt the zero-shot performance. Secondly, not all tasks are born equal, and some tasks show general transfer ability by providing widely useful knowledge. Moreover, key tasks with general transfer ability can be automatically detected using pairwise generalization results. In addition, based on the findings, we propose an improved method, task resampling, which improves multi-task training for zero-shot generalization to a large extent. Task resampling first automatically identifies a set of key training tasks based on pairwise training and evaluation without peeking forward

