LANGUAGE MODELS ARE GREEDY REASONERS: A SYSTEMATIC FORMAL ANALYSIS OF CHAIN-OF-THOUGHT

Abstract

Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability indirectly, by evaluating accuracy on downstream tasks such as mathematical reasoning. However, it is unclear how these models obtain the answers and whether they rely on simple heuristics rather than the generated chain-of-thought. To enable systematic exploration of the reasoning ability of LLMs, we present a new synthetic question-answering dataset called PRONTOQA, where each example is generated from a synthetic world model represented in first-order logic. This allows us to parse the generated chain-ofthought into symbolic proofs for formal analysis. Our analysis on INSTRUCTGPT and GPT-3 shows that LLMs are quite capable of making correct individual deduction steps, and so are generally capable of reasoning, even in fictional contexts. However, they have difficulty with proof planning: When multiple valid deduction steps are available, they are not able to systematically explore the different options.

1. INTRODUCTION

The ability to reason-drawing new conclusions from provided facts-is a hallmark of human intelligence. Recently, chain-of-thought (CoT) prompting has enabled large language models (LLMs) to perform logical reasoning tasks with impressive accuracy (Wei et al., 2022; Chowdhery et al., 2022; Lewkowycz et al., 2022) . In CoT prompting, each example consists of a question (e.g., " 6 3 -1?"), a short description of the reasoning required to answer the question called the "chain-of-thought" (e.g., " 6 3 is 2. 2 -1 is 1."), and a label (e.g., "1"). When prompted with a few CoT examples, the elicited reasoning allows LLMs to predict the label with much higher accuracy than standard question-answer prompting. However, it is unclear to what extent these models can reason due to several confounding factors. First, existing studies primarily rely on question-answering (QA) tasks from real-world settings such as math word problems (Cobbe et al., 2021; Han et al., 2022; Weston et al., 2016) . It is likely that LLMs have already acquired the knowledge through pretraining and simply retrieve the answer rather than reason over it. Second, the reasoning task may contain spurious correlations that allow the model to obtain the correct answer through shortcuts (Zhang et al., 2022b) . In this work, we systematically investigate the reasoning capability of LLMs by directly evaluating their predicted chains-of-thought (the interpretable proof steps), rather than the predicted label. To enable easy analysis of the CoT, we construct a new synthetic QA dataset called PRONTOQA, for Proof and Ontology-Generated Question-Answering. Inspired by the PROOFWRITER dataset (Tafjord et al., 2021) , each example in PRONTOQA is generated from an ontology and has a unique proof (see figure 1 for an example). We convert the proofs into syntactically simple sentences using a grammar such that the inverse process is relatively easy: From the predicted CoT, we semantically parse each sentence into a formal language and reconstruct the underlying proof steps. We then directly analyze the model's reasoning by inspecting each step in the reconstructed proof and comparing them against the gold proof. 1 We emphasize here that while the dataset is an important contribution of this paper, the main contribution is the analysis that is facilitated by the dataset.



All analysis code, data, data generation scripts, and model outputs are available at github.com/asaparov/prontoqa.

