CAN DISCRETE INFORMATION EXTRACTION PROMPTS GENERALIZE ACROSS LANGUAGE MODELS?

Abstract

We study whether automatically-induced prompts that effectively extract information from a language model can also be used, out-of-the-box, to probe other language models for the same information. After confirming that discrete prompts induced with the AutoPrompt algorithm outperform manual and semi-manual prompts on the slot-filling task, we demonstrate a drop in performance for Au-toPrompt prompts learned on a model and tested on another. We introduce a way to induce prompts by mixing language models at training time that results in prompts that generalize well across models. We conduct an extensive analysis of the induced prompts, finding that the more general prompts include a larger proportion of existing English words and have a less order-dependent and more uniform distribution of information across their component tokens. Our work provides preliminary evidence that it's possible to generate discrete prompts that can be induced once and used with a number of different models, and gives insights on the properties characterizing such prompts. 1

1. INTRODUCTION

NLP has shifted to a paradigm where very large pre-trained language models (LMs) are adapted to downstream tasks through relatively minor updates (Bommasani et al., 2021; Liu et al., 2021) . In the most extreme case, task adaptation does not require modifying the LM or even accessing its internals at all, but simply formulating a linguistic query that elicits an appropriate, task-specific response by the model (Petroni et al., 2019a; Radford et al., 2019) . This has promising practical applications, as one could easily imagine proprietary LMs only exposing a natural-language-based interface, with downstream agents extracting the information they need by formulating the appropriate queries. 2In this scenario, one fundamental question is how robust the querying protocol is to changes in the underlying LM. On the one hand, the same downstream agent might want to query multiple LMs. On the other, if the LM provider updates the model, this should not break the downstream pipeline. On a more theoretical level, the properties of an emergent robust protocol might give us insights on the general language processing capabilities of neural networks, and how they relate to natural language. We present a systematic study of the extent to which LM query protocols, that, following current usage, we call prompting methods, generalize across LMs. Extending and confirming prior results, we find that discrete prompts that are automatically induced through an existing optimization procedure (Shin et al., 2020) outperform manually and semi-manually crafted prompts, reaching a good performance level when tested with the same LM used for prompt induction. While the automatically induced discrete prompts also generalize better to other LMs than (semi-)manual prompts and currently popular "soft" prompts, their overall generalization performance is quite poor. We next show that a simple change to the original training procedure, namely using more than one LM at prompt induction time, leads to discrete prompts that better generalize to new LMs. The proposed procedure, however, is brittle, crucially relying on the "right" choice of LMs to mix at prompt induction. We finally conduct the first extensive analysis of automatically induced discrete prompts, Figure 1 : Cartoon summary of our main results. Prompts induced using a single language model have a significant drop of performance when used to query other models. The problem is alleviated when prompts are exposed to multiple models in the induction phase. Subtle but consistent differences in the nature of the induced prompts also emerge. tentatively identifying a set of properties characterizing the more general prompts, such as a higher incidence of existing English words and robustness to token shuffling and deletion. 2019) demonstrated that LMs can be directly adapted to new tasks through appropriate querying methods. This led to an explosion of work on so-called "prompt engineering" (see Liu et al., 2021 , for a thorough review). Much of this work focuses on crafting appropriate manual or semi-manual prompts and/or on tuning LMs to better respond to such prompts (e.g., Schick & Schütze, 2021; Sanh et al., 2022) . Going beyond manual prompts, Shin et al. ( 2020) introduced the AutoPrompt algorithm to generate prompts using gradient-guided search, and demonstrated that such prompts often outperform manual ones. While automatically induced prompts suffer of issues such as low-interpretability, we think it is important to continue focusing on them because, besides their better performance (a result we confirm here for AutoPrompt across a range of LMs), they are more promising than manual prompts in terms of scalability, especially in contexts in which it is not sufficient to formulate a single prompt template for a whole task, but each input query demands a distinct prompt formulation (Zhang et al., 2022) . Concurrent and later work has proposed to replace discrete strings, such as those generated by AutoPrompt, with sequences of arbitrary vectors from the LM's embedding space (Lester et al., 2021; Zhong et al., 2021) . We confirm here that these continuous, or "soft" prompts outperform AutoPrompt when trained and tested on the same LM. However, they cannot be used in our envisaged multiple-LM scenario. First, they require access to a model inner representations, beyond the standard natural language querying interface, so that embeddings can be passed as input. Second, continuous prompts, by their nature, won't generalize out-of-the-box to other LMs. Trivially, they can't generalize across models with different embedding dimensionality. Even when models share dimensionality, there is no reason why the absolute position of a vector in the embedding space of a model should meaningfully transfer to another model. Discretizing soft prompt tokens to their nearest vocabulary neighbours in order to overcome these issues does not help either. Khashabi et al. (2021) demonstrated that it is possible to find well-performing soft prompts whose nearest neighbor projections are arbitrarily fixed discrete tokens. Appendix B elaborates on the failure of soft prompts to generalize across models, as well as the problematic behaviour of discretized soft prompts. We are not aware of much previous work that has addressed the challenge of LM-to-LM transferability. Wallace et al. (2019) studied this problem in the context of textual adversarial attacks (that can be seen as a special case of prompting, and indeed their attack method is closely related to AutoPrompt). Similarly to us, they notice some performance drop when transferring adversarial "triggers" to different LMs, and they show that this can be mitigated by an ensembling approach where two triggers generated using variants of the same LM are combined. Su et al. (2022) study LM-to-LM transferability in the context of continuous prompts. Since, as we just discussed, such prompts are not directly transferable, they induce a projection from the embedding space of the source LM to that of the target LM, thus considering a very different scenario from the type of "out-of-the-box" transferability we are interested in here.



The code to reproduce our analysis is available at https://github.com/ncarraz/prompt_ generalization. As a concrete example, one of the most powerful current LMs, GPT3, is only available via a text-based API (https://beta.openai.com/overview).



2 RELATED WORK Prior work such as Petroni et al. (2019a) and Radford et al. (

