PRE-TRAINED LANGUAGE MODELS CAN BE FULLY ZERO-SHOT LEARNERS

Abstract

How can we extend a pre-trained model to many language understanding tasks, without labeled or additional unlabeled data? Pre-trained language models (PLMs) have been effective for a wide range of NLP tasks. However, existing approaches either require fine-tuning on downstream labeled datasets or manually constructing proper prompts. In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding. Unlike previous methods, NPPrompt uses only pre-trained language models and does not require any labeled data or additional raw corpus for further fine-tuning, nor does it rely on humans to construct a comprehensive set of prompt label words. We evaluate NPPrompt against previous major few-shot and zero-shot learning methods on diverse NLP tasks: including text classification, text entailment, similar text retrieval, and paraphrasing. Experimental results demonstrate that our NPPrompt outperforms the previous best fully zero-shot method by big margins, with absolute gains of 12.8% in accuracy on text classification and 15.6% on the GLUE benchmark.

1. INTRODUCTION

Natural language understanding (NLU) has been important in many applications such as intelligent dialog assistants, online search, and social media analysis. Recent advancement of NLU has been driven by emergent pre-trained language models (PLMs) including BERT (Devlin et al., 2019; Liu et al., 2019b) , GPT (Radford et al., 2018; 2019; Brown et al., 2020) , BART (Lewis et al., 2020), and T5 (Raffel et al., 2020) . Prior studies show that PLMs obtain substantial knowledge during pre-training on raw text corpus (Petroni et al., 2019; Feldman et al., 2019) . By fine-tuning on taskspecific labeled data, PLMs exploit such knowledge and gain impressive accuracy on a wide range of NLP tasks, such as text classification (Kowsari et al., 2019 ), question answering (Rajpurkar et al., 2016) , machine reading comprehension (Campos et al., 2016) , etc. However, fine-tuning approaches are expensive. It requires labeled datasets, which are rarely available for many tasks. Significant computational efforts are needed to update PLMs' parameters for multiple tasks. In addition, fine-tuning results in one distinct model for each task to maintain. How can we generalize a pre-trained model to many NLP tasks, without labeled or additional unlabeled data? Existing few-shot and zero-shot approaches propose to construct prompts to elicit desired predictions from PLMs (Brown et al., 2020) . The main idea of prompting PLMs is to convert an input utterance to one with masked templates. For example, in text classification an input can be "The Warriors won the NBA championship 2022" and it is instead converted to "A [MASK] news: The Warriors won the NBA championship 2022". A PLM (e.g. BERT) takes the converted text and produces predictions for the masked token, along with the probability. Ideally, a PLM will generate a higher probability for the word "sports" than "politics" on the [MASK] token. Although these prompting-based methods are effective, they require unlabeled data for training or huge human efforts to construct prompts and to choose designated tokens to represent class labels (Schick & Schütze, 2021a; b; Gao et al., 2021) . In addition, these manually constructed verbalizers, i.e. mapping from words (e.g. "basketball") to class labels (e.g. SPORTS), do not extend to new emerging categories after PLMs are deployed.

