LEVERAGING LARGE LANGUAGE MODELS FOR MULTIPLE CHOICE QUESTION ANSWERING

Abstract

While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective, the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.

1. INTRODUCTION

Current state of the art (SOTA) methods on many multiple choice question answering (MCQA) tasks involve specialized models, extensive per-task engineering, and individualized tuning in general. What if one model could do just as well as each of these models does individually? This is part of a general vision for so-called foundation models (Bommasani et al., 2021) . Foundation models include large pre-trained language models (LLMs) that have derived enough broad knowledge (spanning, for example, linguistic, factual, and commonsense (Liu et al., 2019; Amrami & Goldberg, 2018; Petroni et al., 2020; Bosselut et al.; Bouraoui et al.; Zuo et al., 2018; Bhagavatula et al., 2019) ) to transfer from a simple language modelling objective to a huge array of natural language tasks. Interestingly, while LLMs have achieved SOTA results on many tasks, they generally fall short on MCQA. Why is this the case, given their general language modelling prowess as suggested by the low cross-entropy loss they attain with all their parameters, data, and compute (Kaplan et al., 2020; Henighan et al., 2020; Hernandez et al., 2021) ? Should they not excel, or at least be highly competitive? In this paper, we argue that they fall short because dominant methods used with them conflate probabilities of sentences with probabilities of correct answers. We hypothesize that there are fundamental problems with the near-universal approach to MCQA for LLMs, which we refer to as "cloze prompting" (CP). Specifically, these problems include 1) the conflation of the grammaticality, commonality, and "naturalness" of a text and its likelihood qua question-answer, 2) the computational expense of scoring multiple candidate answers, 3) the fact that the LLM cannot explicitly reason about and compare different candidate answers, and 4) finicky normalization due to tokenization schemes. The centerpiece of our paper is an extensive investigation of an alternative: we explain how these problems might be solved by what we call multiple choice prompting (MCP). In MCP, the language model receives both the question and also a list of candidate answers as on a multiple choice test, with each answer associated with (or "bound" to) a symbol such as "A", "B", "C", etc. We explain how this approach might be why MCP outperforms CP in Section 3. More importantly, though, we demonstrate that when we prompt LLMs with MCP instead of CP, performance often dramatically improves -approaching or even surpassing SOTA performance. On a varied group of 20 datasets, we show that MCP outperforms CP on all but 4 of the datasets, with a mean gap of 9.7% on all tasks and a max gap of 44%. MCP surpasses old SOTA scores on 9 of 20 datasets (by as much as 15% on a single task), and averaged across all datasets, MCP scores fall 0.6% shy of SOTA. This implies that the de facto method for prompting LLMs has led them to be considerably underestimated for MCQA, and that there exists a better general way to prompt a single LLM that scores within a percent of accuracy of all other previous SOTA scores, on average. For the 20 different datasets we consider, SOTA accuracy required 14 customized models and approaches -nearly three individualized setups for every four datasets. We argue that the fact that MCP is comparable to or surpasses SOTA, with no task-specific tuning, is evidence for the efficiency, generality, and overall promise of foundation models in MCQA. Our primary contribution is three-fold: 1) We present an argument for multiple-choice prompting over cloze prompting and formally define multiple choice symbol binding (MCSB), a required ability for an LLM to benefit from MCP; 2) We show that not all LLMs are equally skilled in this regard; and 3) Across 20 diverse datasets, we show that the models most capable of MCSB can individually approach or beat SOTA on most of the considered tasks when prompted with multiple choice prompting instead of the near-universal approach of cloze prompting. Code is available.foot_0 2 RELATED WORK Transformers (Vaswani et al., 2017) have revolutionized the field of NLP by allowing models to effectively absorb much larger datasets via massive scaling in parameter count and compute; these three factors are proportional to lower loss in models (Kaplan et al., 2020; Henighan et al., 2020; Hernandez et al., 2021) . Parameter counts have quickly grown from 1.5B in 2018 (Radford et al., 2018 ) to 540B in 2022 (Chowdhery et al., 2022) , and in general, larger models are tested on a more extensive suite of tasks to test their capacity for transfer. This invariably includes multiple choice question answering tasks, and nearly every LLM we know of uses cloze prompting for these tasks (Brown et al., 2020; Du et al., 2022; Smith et al., 2022; Chowdhery et al., 2022; Lieber et al., 2021) . It was, in part, these massive language models that prompted the coining of the phrase "foundation models" (Bommasani et al., 2021) . This is a family of large models that are heavily trained on enormous datasets in a self-supervised fashion. They derive general knowledge about a modality and can transfer with impressive sample efficiency to a great number of downstream tasks. A key part of the vision of these models is that they can be repurposed, avoiding the energy, storage, and human capital costs associated with ad hoc models. Our work supports this vision of LLMs as one such foundation model by demonstrating their ability to answer many kinds of multiple choice questions correctly in a zero or few-shot fashion when prompted appropriately. To the best of our knowledge, the only LLM papers that use the MCP approach for evaluation on any dataset are Gopher (Rae et al., 2021) and followup Chinchilla (Hoffmann et al., 2022) . The use of MCP in these works is peripheral and limited to a few specific datasets (MMLU ((Hendrycks et al., 2021) ), RACE (Lai et al., 2017 ), TruthfulQA (Lin et al., 2021b) ). One other recent work (Liévin et al., 2022) used MCP when evaluating InstructGPT (Ouyang et al., 2022) on three medical question datasets. In these works the impact on results of the MCP approach in particular is not explored. Ours is the first work to systematically investigate the benefits of this prompting strategy. We show that language models vary greatly in their ability to leverage MCP, and demonstrate that MCP can substantially improve LLM accuracy across a diverse set of tasks. We hope this observation will lead to wider adoption of MCP in LLM work.



https://github.com/BYU-PCCL/leveraging-llms-for-mcqa

