ASK ME ANYTHING: A SIMPLE STRATEGY FOR PROMPTING LANGUAGE MODELS

Abstract

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly crafted perfect prompt for a task. To mitigate the high degree of effort, we instead ask whether collecting multiple decent, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed method, ASK ME ANYTHING PROMPTING (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. True or False?"). AMA recursively uses the LLM to transform task inputs to the effective QA format. AMA generates multiple questions per input and applies these prompts to collect several noisy votes for the input's true label. We find the prompts have varying accuracies and dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code here: https://github.com/HazyResearch/ama_prompting.

1. INTRODUCTION

Large language models (LLMs) are bringing us closer to the goal of task-agnostic machine learning (Brown et al., 2020; Bommasani et al., 2021) . Rather than training models for new tasks, LLMs are applied to new tasks out-of-the box with no additional training. In this paradigm, termed in-context learning, LLMs are controlled through user-provided natural language specifications of the task, or prompts, which illustrate how to complete a task. A prompt is defined by a template which contains placeholders for in-context demonstrations of the inputs and outputs for the task. Recent work has evaluated LLM prompting performance on a broad set of tasks and finds the process to be brittle -small changes to the prompt result in large performance variations (Zhao et al., 2021; Holtzman et al., 2021) . The performance further varies depending on the chosen LLM family (Ouyang et al., 2022; Sanh et al., 2022, inter alia.) and model size (Wei et al., 2022c; Lampinen et al., 2022) . To improve reliability, significant effort is dedicated towards designing a painstakingly perfect prompt. For instance, Mishra et al. (2021) and Wu et al. (2022) recommend that users manually explore large search-spaces of strategies to tune their prompts on a task-by-task basis. This work instead considers aggregating the predictions of multiple effective, yet imperfect prompts to improve prompting performance over a broad set of models and tasks. Given a task input, each prompt produces a vote for the input's true label, and these votes are aggregated to produce a final prediction. In pursuit of high quality prompting via aggregation, we face the following challenges: 1. Effective prompts: High quality prompts are a precursor to improvements from aggregation. We take the original prompts which yield near-random performance using the GPT-3 model in



We draw inspiration from Wu et al. (2022) and focus on task-agnostic and scalable prompt-chains.

annex

Context: John and his friends went to the theater and saw Jurassic Park Question: Did John go to the park? Answer:Write the claim as a yes/no question.Claim: Jack camped with Mark Question: Did Jack camp with Mark? Claim: the test was not hard Question: Was the test hard?Claim: John went to the park. Question:Figure 1 : AMA uses the LLM itself to reformat task inputs to more effective formats. AMA creates multiple reformatted prompts per input. The LLM predictions from the prompts are aggregated using weak supervision. Brown et al. (2020) for two SuperGLUE tasks (CB, RTE). Generating multiple prompts in the same format and taking majority vote prediction across prompts has a minor effect (+4% for CB) and can even hurt performance versus the average prompt performance (-2% for RTE). Many proposals for improved prompts focus on a single task type and evaluate on a single modelfamily and/or size (Wei et al., 2022c; Jung et al., 2022) . We need a structure for prompting that improves quality across tasks and models. 2. Scalable collection: After identifying effective prompt formats, we need to obtain such prompts at scale. The original format of a task varies widely, and prior works manually rewrite each task input to new formats (Mishra et al., 2021; Wu et al., 2022) , which is challenging to scale. Generating multiple prompts per input increases the difficulty. 3. Prompt aggregation: Using the prompts above (for CB and RTE), we see 9.5% average variation in accuracy and that the Jaccard index over errors is 69% higher than if prompt errors were i.i.d., suggesting highly correlated prompt outputs. Majority vote (MV) is the primary unsupervised aggregation strategy in prior prompting work (Jiang et al., 2020; Schick & Schütze, 2021) , but it does not account for either property, making it unreliable. We need a strategy that accounts for the varying accuracies and dependencies.We propose ASK ME ANYTHING PROMPTING (AMA), a simple approach that enables open-source LLMs with 30x fewer parameters to exceed the few-shot performance of GPT3-175B. In AMA:1. We identify properties of prompts that improve effectiveness across tasks, model types, and model sizes. We study standard prompt-formats categorized by prior work (Brown et al., 2020) and find prompts that encourage open-ended answers ("Where did John go?") to be more effective than prompts that restrict the model output to particular tokens (e.g. "John went to the park. Output True or False?"). For instance, converting three SuperGLUE tasks (CB, RTE, WSC) from the original restrictive formats in (Brown et al., 2020) to open-ended formats provides a 72% performance improvement (Section 3.2). Given a task input, we find that a simple structure of (1) forming questions based on the input and (2) prompting the LLM to answer the questions applies quite generally and improves performance across diverse benchmark tasks. 2. We propose a strategy for scalably reformatting task inputs to the effective formats found in (1). We propose to transform task inputs to the effective open-ended question-answering format by recursively using the LLM itself in a task-agnostic two step pipeline. We first use question()prompts, which contain examples of how to transform statements to various (e.g., yes-no, cloze) questions and second use answer()-prompts that demonstrate ways of answering questions (e.g., concise or lengthy answers). Applying prompt-chains-answer(question(x))-gives a final prediction for the input x. 1 Chains are (1) reused across inputs and (2) different pairs of functional prompts can be combined to create variety. We apply the varying functional prompt-chains to an input to collect multiple votes for the input's true label. 3. We propose the use of weak supervision (WS) to reliably aggregate predictions. We find that the errors produced by the predictions of different chains can be highly varying and correlated. While majority vote (MV) may do well on certain sets of prompts, it performs poorly in the above cases. AMA accounts for these cases by identifying dependencies among prompts and using WS, a procedure for modeling and combining noisy predictions without any labeled data (Ratner et al., 2017; Varma et al., 2019) . We apply WS to prompting broadly for the first time in this work,

