ASK ME ANYTHING: A SIMPLE STRATEGY FOR PROMPTING LANGUAGE MODELS

Abstract

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly crafted perfect prompt for a task. To mitigate the high degree of effort, we instead ask whether collecting multiple decent, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed method, ASK ME ANYTHING PROMPTING (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. True or False?"). AMA recursively uses the LLM to transform task inputs to the effective QA format. AMA generates multiple questions per input and applies these prompts to collect several noisy votes for the input's true label. We find the prompts have varying accuracies and dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code here:

1. INTRODUCTION

Large language models (LLMs) are bringing us closer to the goal of task-agnostic machine learning (Brown et al., 2020; Bommasani et al., 2021) . Rather than training models for new tasks, LLMs are applied to new tasks out-of-the box with no additional training. In this paradigm, termed in-context learning, LLMs are controlled through user-provided natural language specifications of the task, or prompts, which illustrate how to complete a task. A prompt is defined by a template which contains placeholders for in-context demonstrations of the inputs and outputs for the task. Recent work has evaluated LLM prompting performance on a broad set of tasks and finds the process to be brittle -small changes to the prompt result in large performance variations (Zhao et al., 2021; Holtzman et al., 2021) . The performance further varies depending on the chosen LLM family (Ouyang et al., 2022; Sanh et al., 2022, inter alia.) and model size (Wei et al., 2022c; Lampinen et al., 2022) . To improve reliability, significant effort is dedicated towards designing a painstakingly perfect prompt. For instance, Mishra et al. (2021) and Wu et al. (2022) recommend that users manually explore large search-spaces of strategies to tune their prompts on a task-by-task basis. This work instead considers aggregating the predictions of multiple effective, yet imperfect prompts to improve prompting performance over a broad set of models and tasks. Given a task input, each prompt produces a vote for the input's true label, and these votes are aggregated to produce a final prediction. In pursuit of high quality prompting via aggregation, we face the following challenges: 1. Effective prompts: High quality prompts are a precursor to improvements from aggregation. We take the original prompts which yield near-random performance using the GPT-3 model in

availability

https://github.com/HazyResearch/ama_prompting.

