PRE-TRAINED LANGUAGE MODELS CAN BE FULLY ZERO-SHOT LEARNERS

Abstract

How can we extend a pre-trained model to many language understanding tasks, without labeled or additional unlabeled data? Pre-trained language models (PLMs) have been effective for a wide range of NLP tasks. However, existing approaches either require fine-tuning on downstream labeled datasets or manually constructing proper prompts. In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding. Unlike previous methods, NPPrompt uses only pre-trained language models and does not require any labeled data or additional raw corpus for further fine-tuning, nor does it rely on humans to construct a comprehensive set of prompt label words. We evaluate NPPrompt against previous major few-shot and zero-shot learning methods on diverse NLP tasks: including text classification, text entailment, similar text retrieval, and paraphrasing. Experimental results demonstrate that our NPPrompt outperforms the previous best fully zero-shot method by big margins, with absolute gains of 12.8% in accuracy on text classification and 15.6% on the GLUE benchmark.

1. INTRODUCTION

Natural language understanding (NLU) has been important in many applications such as intelligent dialog assistants, online search, and social media analysis. Recent advancement of NLU has been driven by emergent pre-trained language models (PLMs) including BERT (Devlin et al., 2019; Liu et al., 2019b) , GPT (Radford et al., 2018; 2019; Brown et al., 2020) , BART (Lewis et al., 2020), and T5 (Raffel et al., 2020) . Prior studies show that PLMs obtain substantial knowledge during pre-training on raw text corpus (Petroni et al., 2019; Feldman et al., 2019) . By fine-tuning on taskspecific labeled data, PLMs exploit such knowledge and gain impressive accuracy on a wide range of NLP tasks, such as text classification (Kowsari et al., 2019 ), question answering (Rajpurkar et al., 2016) , machine reading comprehension (Campos et al., 2016) , etc. However, fine-tuning approaches are expensive. It requires labeled datasets, which are rarely available for many tasks. Significant computational efforts are needed to update PLMs' parameters for multiple tasks. In addition, fine-tuning results in one distinct model for each task to maintain. How can we generalize a pre-trained model to many NLP tasks, without labeled or additional unlabeled data? Existing few-shot and zero-shot approaches propose to construct prompts to elicit desired predictions from PLMs (Brown et al., 2020) . The main idea of prompting PLMs is to convert an input utterance to one with masked templates. For example, in text classification an input can be "The Warriors won the NBA championship 2022" and it is instead converted to "A [MASK] news: The Warriors won the NBA championship 2022". A PLM (e.g. BERT) takes the converted text and produces predictions for the masked token, along with the probability. Ideally, a PLM will generate a higher probability for the word "sports" than "politics" on the [MASK] token. Although these prompting-based methods are effective, they require unlabeled data for training or huge human efforts to construct prompts and to choose designated tokens to represent class labels (Schick & Schütze, 2021a; b; Gao et al., 2021) . In addition, these manually constructed verbalizers, i.e. mapping from words (e.g. "basketball") to class labels (e.g. SPORTS), do not extend to new emerging categories after PLMs are deployed. In this paper, we investigate the fully zero-shot learning problem for NLU where only the target label names are available but not the extra raw text. We propose nonparametric prompting PLM (NPPrompt), a novel method to generate predictions for semantic labels without any fine-tuning. NPPrompt uses PLM's own embeddings to automatically find relevant words to labels (e.g. "basketball" and "NBA" for SPORTS), therefore it does not need humans to construct verbalizers. Our key idea is to search for the top k nearest neighbors to a label name in the embedding manifold and then generate and aggregate PLM's predicted logits from masked prompts. In the above case, both predicted values for "basketball" and "NBA" contribute to the final prediction for the SPORTS category. In this way, NPPrompt can be easily generalized to any new categories as long as the category names are semantically meaningful. The contributions of this paper are as follows. a) We develop NPPrompt, a novel method for fully zero-shot learning with PLMs. b) We conduct extensive experiments on diverse language understanding tasks including text classification, text entailment, similar text retrieval, and paraphrasing. Experimental results show that NPPrompt outperforms the previous zero-shot methods by absolute 12.8% in accuracy on text classification and 15.6% on the GLUE benchmark. Surprisingly, NPPrompt is on a par with the best prior method that trained with manual verbalizers, an additional knowledge base, and extra unlabeled data.

2. RELATED WORK

Prompting The success of GPT-3 (Brown et al., 2020) has attracted much attention to prompting engineering, a new way to leverage pre-trained language models. Brown et al. ( 2020) concatenate a few input and output pairs and feed them to the large-scale GPT-3 language model, which is an intuitive in-context learning paradigm, allowing the model to generate answers for additional cases autoregressively. Recent works (Schick & Schütze, 2021a; b) show that small-scale pre-trained language models such as BERT (Devlin et al., 2019 ), RoBERTa (Liu et al., 2019b) and ALBERT (Lan et al., 2019) can also achieve decent performance using prompt-tuning. Prompting has been applied to a large variety of tasks such as Text Classification (Schick & Schütze, 2021a), Natural Language Understanding (Schick & Schütze, 2021b), Knowledge Probing (Petroni et al., 2019), and Relation Extraction (Han et al., 2021) . Typically, a piece of prompt contains a template and a verbalizer. The language model predicts a probability distribution over vocabulary given the template and the verbalizer transforms it into a prediction over class labels. In this work, we focus on designing the verbalizers automatically. Verbalizer Design The verbalizer is an important component in prompting which bridges model outputs and labels and greatly impacts the performance. Schick & Schütze (2021a) design human written verbalizers for prompting, however, they are highly biased towards personal vocabulary with inadequate coverage. Apart from manually designed verbalizers, some recent studies explore automatic verbalizer construction. Auto-L (Gao et al., 2021) uses re-ranking to find the label words set by fine-tuning the model on the candidates searched by RoBERTa; AutoPrompt (Shin et al., 2020) applies gradient-based search to create both prompts and label words automatically with a few trigger examples. But these approaches need to update parameters with gradient descent, which turns out to be infeasible without access to the model weights (e.g., GPT-3). KPT (Han et al., 2021) incorporates external knowledge into the verbalizer in which the unlabeled dataset is needed to refine the label words and thus is not applicable to scenarios where only label names are known. In contrast, our approach NPPrompt directly finds, without any gradient update, relevant words to label names with only PLM's initial word embedding.

Zero-shot Text Classification

The general zero-shot text classification usually focuses on classifying texts into classes that are unseen during the training process. Transferring knowledge from seen classes to unseen ones requires accurate and discriminative descriptions of all classes (Liu et al., 2019a; Xia et al., 2018) , joint embeddings of categories and documents (Nam et al., 2016) or semantic correlations among classes (Rios & Kavuluru, 2018; Zhang et al., 2019) . However, these methods require supervised data for the known label set and thus cannot be extended to settings where no labeled pairs for any category is available. (Meng et al., 2020) propose the LOTClass model that uses label names with self-training to do zero-shot classification. But LOTClass still requires unlabeled corpus for extracting the topic-related words and performing self-training. In this work, NPPrompt achieves competitive and even better performance without using any unlabeled dataset.

