CAN DISCRETE INFORMATION EXTRACTION PROMPTS GENERALIZE ACROSS LANGUAGE MODELS?

Abstract

We study whether automatically-induced prompts that effectively extract information from a language model can also be used, out-of-the-box, to probe other language models for the same information. After confirming that discrete prompts induced with the AutoPrompt algorithm outperform manual and semi-manual prompts on the slot-filling task, we demonstrate a drop in performance for Au-toPrompt prompts learned on a model and tested on another. We introduce a way to induce prompts by mixing language models at training time that results in prompts that generalize well across models. We conduct an extensive analysis of the induced prompts, finding that the more general prompts include a larger proportion of existing English words and have a less order-dependent and more uniform distribution of information across their component tokens. Our work provides preliminary evidence that it's possible to generate discrete prompts that can be induced once and used with a number of different models, and gives insights on the properties characterizing such prompts. 1

1. INTRODUCTION

NLP has shifted to a paradigm where very large pre-trained language models (LMs) are adapted to downstream tasks through relatively minor updates (Bommasani et al., 2021; Liu et al., 2021) . In the most extreme case, task adaptation does not require modifying the LM or even accessing its internals at all, but simply formulating a linguistic query that elicits an appropriate, task-specific response by the model (Petroni et al., 2019a; Radford et al., 2019) . This has promising practical applications, as one could easily imagine proprietary LMs only exposing a natural-language-based interface, with downstream agents extracting the information they need by formulating the appropriate queries. 2In this scenario, one fundamental question is how robust the querying protocol is to changes in the underlying LM. On the one hand, the same downstream agent might want to query multiple LMs. On the other, if the LM provider updates the model, this should not break the downstream pipeline. On a more theoretical level, the properties of an emergent robust protocol might give us insights on the general language processing capabilities of neural networks, and how they relate to natural language. We present a systematic study of the extent to which LM query protocols, that, following current usage, we call prompting methods, generalize across LMs. Extending and confirming prior results, we find that discrete prompts that are automatically induced through an existing optimization procedure (Shin et al., 2020) outperform manually and semi-manually crafted prompts, reaching a good performance level when tested with the same LM used for prompt induction. While the automatically induced discrete prompts also generalize better to other LMs than (semi-)manual prompts and currently popular "soft" prompts, their overall generalization performance is quite poor. We next show that a simple change to the original training procedure, namely using more than one LM at prompt induction time, leads to discrete prompts that better generalize to new LMs. The proposed procedure, however, is brittle, crucially relying on the "right" choice of LMs to mix at prompt induction. We finally conduct the first extensive analysis of automatically induced discrete prompts,



The code to reproduce our analysis is available at https://github.com/ncarraz/prompt_ generalization. As a concrete example, one of the most powerful current LMs, GPT3, is only available via a text-based API (https://beta.openai.com/overview).

