MEASURING AND NARROWING THE COMPOSITIONALITY GAP IN LANGUAGE MODELS

Abstract

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all subproblems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multihop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly instead of implicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and then answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy. 1 

1. INTRODUCTION

Compositional reasoning lets models go beyond rote memorization of directly observed facts to deduce previously unseen knowledge. For example, a model should be able to answer "How long was Queen Elizabeth's reign?" even if the answer did not explicitly appear in the training data, by recalling her coronation and death dates and reasoning over these facts. While language models (LMs) have shown strong question answering performance, it remains unclear how much is due to memorization of huge corpora vs how much is due to reasoning. First, we quantify the reasoning abilities of LMs using multi-hop question answering. We present a new, automatically generated dataset, Compositional Celebrities (CC), of 8.6k 2-hop questions; it combines frequently stated facts in improbable ways (e.g., "Who won the Master's Tournament the year Justin Bieber was born?"), allowing us to disentangle memorization and reasoning. Intuitively, the reasoning required to answer such questions seems trivial given access to the relevant facts. We introduce the term compositionality gap to describe the fraction of compositional questions that the model answers incorrectly out of all the compositional questions for which the model answers the sub-questions correctly. Unsurprisingly, we find that performance on both single-and multi-hop question answering improves monotonically with the scale of the pretrained model. Intriguingly, however, we find that the compositionality gap remains at a roughly constant 40% between different model sizes and training techniques, with no apparent improvement from scale (Fig. 1 ). This result is especially surprising given the straightforward reasoning steps required for such questions, and it suggests that larger scale pretraining is highly effective at teaching models to memorize facts but not how to compose them. We also find a positive correlation between LM confidence about a certain fact and whether it can compose it with other facts to answer compositional questions about it. 1 No models were trained or finetuned in the making of this preprint. 1

