BIDIRECTIONAL LANGUAGE MODELS ARE ALSO FEW-SHOT LEARNERS

Abstract

Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its fewshot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate promptbased learning is an emergent property of a broader class of language models, rather than only unidirectional models.

1. INTRODUCTION

Recent work on GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) have shown that large language models possess few-shot learning capabilities and zero-shot instruction following capabilities, despite only being pre-trained with a self-supervised causal language modeling objective (which is to predict the next token). An arbitrary task can be converted into a natural language task specification, often called a prompt. Prompting a task in this way makes its format similar to the language modeling objective used to pre-train large language models. In the zero-shot setting, this prompt contains just the task with instructions, whereas in the few-shot setting, the prompt contains both the task and several example demonstrations. When a language model is tasked to generate text to complete this prompt, it can perform the task in the process. The broader paradigm of reframing all tasks as text generation is known as prompt-based learning. In the few-shot setting, the learning that occurs from examples provided in a given prompt (the context) is known as in-context learning (Liu et al., 2021) . In the zero-shot setting, models perform instruction following (Ouyang et al., 2022) , with their performance guided through natural language instructions provided in the prompt. Emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. Bidirectional language models have stronger learned representations (Devlin et al., 2019; Conneau et al., 2020; Raffel et al., 2020) ; however, they have not been able to broadly "GPT-3 has several structural and algorithmic limitations ... as a result our experiments do not include any bidirectional architectures or other training objectives such as denoising ... our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectionality ... making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few-or zero-shot learning, is a promising direction for future research, and could help achieve the 'best of both worlds'." In this paper, we directly address this dilemma. We contribute a new technique, SAP (Sequential Autoregressive Prompting), that enables bidirectional language models to take advantage of prompting and allows them to perform at the level of unidirectional models in few-or zero-shot learning without fine-tuning. SAP iteratively prompts bidirectional models, concatenating previous generations back into the prompt, to produce longer generations from models that were only pre-trained to output short, mask-infill spans. We acknowledge efficiency concerns in Section 6 and we discuss the importance and impact of SAP and its results to the field regardless of those concerns. Using the machine translation task as an in-depth case study, we empirically demonstrate mT5 (Xue et al., 2021) , a bidirectional language model, used with SAP outperforms its unidirectional counterparts, GPT-3 and XGLM (Brown et al., 2020; Lin et al., 2021) in both the few-shot and zero-shot settings, while utilizing approximately 50% fewer parameters. We then examine SAP's effectiveness on other tasks such as question answering and summarization, demonstrating that bidirectional models can be prompted for tasks beyond machine translation. Our work hints at the possibility of more efficient and performant few-shot learners through pretrained language models that incorporate bidirectionality. We discuss this impact and outline future research directions to this end in Section 6. In summary, our key contributions are:



Figure 1: A visualization of our SAP technique extracting high-quality translations from mT5. In the zero-shot setting, the examples used in the prompt are synthetic examples retrieved in a fully unsupervised manner.

