LANGUAGE MODELS ARE GREEDY REASONERS: A SYSTEMATIC FORMAL ANALYSIS OF CHAIN-OF-THOUGHT

Abstract

Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability indirectly, by evaluating accuracy on downstream tasks such as mathematical reasoning. However, it is unclear how these models obtain the answers and whether they rely on simple heuristics rather than the generated chain-of-thought. To enable systematic exploration of the reasoning ability of LLMs, we present a new synthetic question-answering dataset called PRONTOQA, where each example is generated from a synthetic world model represented in first-order logic. This allows us to parse the generated chain-ofthought into symbolic proofs for formal analysis. Our analysis on INSTRUCTGPT and GPT-3 shows that LLMs are quite capable of making correct individual deduction steps, and so are generally capable of reasoning, even in fictional contexts. However, they have difficulty with proof planning: When multiple valid deduction steps are available, they are not able to systematically explore the different options.

1. INTRODUCTION

The ability to reason-drawing new conclusions from provided facts-is a hallmark of human intelligence. Recently, chain-of-thought (CoT) prompting has enabled large language models (LLMs) to perform logical reasoning tasks with impressive accuracy (Wei et al., 2022; Chowdhery et al., 2022; Lewkowycz et al., 2022) . In CoT prompting, each example consists of a question (e.g., " 6 3 -1?"), a short description of the reasoning required to answer the question called the "chain-of-thought" (e.g., " 6 3 is 2. 2 -1 is 1."), and a label (e.g., "1"). When prompted with a few CoT examples, the elicited reasoning allows LLMs to predict the label with much higher accuracy than standard question-answer prompting. However, it is unclear to what extent these models can reason due to several confounding factors. First, existing studies primarily rely on question-answering (QA) tasks from real-world settings such as math word problems (Cobbe et al., 2021; Han et al., 2022; Weston et al., 2016) . It is likely that LLMs have already acquired the knowledge through pretraining and simply retrieve the answer rather than reason over it. Second, the reasoning task may contain spurious correlations that allow the model to obtain the correct answer through shortcuts (Zhang et al., 2022b) . In this work, we systematically investigate the reasoning capability of LLMs by directly evaluating their predicted chains-of-thought (the interpretable proof steps), rather than the predicted label. To enable easy analysis of the CoT, we construct a new synthetic QA dataset called PRONTOQA, for Proof and Ontology-Generated Question-Answering. Inspired by the PROOFWRITER dataset (Tafjord et al., 2021) , each example in PRONTOQA is generated from an ontology and has a unique proof (see figure 1 for an example). We convert the proofs into syntactically simple sentences using a grammar such that the inverse process is relatively easy: From the predicted CoT, we semantically parse each sentence into a formal language and reconstruct the underlying proof steps. We then directly analyze the model's reasoning by inspecting each step in the reconstructed proof and comparing them against the gold proof. 1 We emphasize here that while the dataset is an important contribution of this paper, the main contribution is the analysis that is facilitated by the dataset. Hop ¬herbivorous(fae) Step 3: Translate ontology to natural language context "Q: Each cat is a carnivore. Every carnivore is not herbivorous. Carnivores are mammals. All mammals are warm-blooded. Mammals are vertebrates. Every vertebrate is an animal. Animals are multicellular." Step 4: Translate proof into query, chain-of-thought, and label "Fae is a cat. True or false: Fae is not herbivorous. A: Fae is a cat. Cats are carnivores. Fae is a carnivore. Every carnivore is not herbivorous. Fae is not herbivorous. True" FIGURE 2: Schematic of the generative process for each example in PRONTOQA. Step 1: We generate an ontology from a prior distribution, shown here as a tree. Each node denotes a concept (e.g., mammal), each with an optional property (e.g., warm_blooded), and each blue edge denotes a "subtype of" relation. Step 2: Generate proof from the ontology. Each horizontal black line indicates a proof step, with its premises written above the line and the conclusion written below. Step 3: Convert the ontology into a natural language context. Step 4: Convert the proof into a natural language query, chain-of-thought, and answer label. There is a one-to-one correspondence between the conclusion of each proof step and the sentences in the chain-of-thought. We systematically evaluate INSTRUCTGPTfoot_1 (Ouyang et al., 2022) and the original GPT-3 (Brown et al., 2020) on PRONTOQA by controlling a number of variables that characterize the complexity of the reasoning task, such as the ontology type and the number of proof steps required. Our analysis shows that these models are quite good at producing valid individual proof steps, even on fictional and counterfactual ontologies. However, LLMs have difficulty with proof planning: when the models encounter a point in the proof where multiple valid proof steps are available, they sometimes select the wrong step, and this often leads to an incomplete proof and subsequently an incorrect answer. Interestingly, the models are much less likely to be misled with a true ontology, suggesting that the world knowledge acquired during pretraining plays an important role in LLM reasoning. We also find that our results generalize to more sophisticated/informative prompts, such as self-consistency prompting (Wang et al., 2022), and prompts with example traces of depth-first proof search instead of CoT.

2. RELATED WORK

Our proposed dataset is most closely related to PROOFWRITER (Tafjord et al., 2021) and FOLIO (Han et al., 2022) which are QA datasets designed to test reasoning ability. PROOFWRITER provides multi-hop proofs for each example. However, there are a number of key properties that led us to develop our own dataset (see table 1 for a summary). FOLIO does not provide easily-parseable proofs/CoTs in their examples, and evaluation is done by inspecting the predicted labels, which may not necessarily be a good measure of reasoning ability. In our analysis, we focus on more specific variables that may affect the reasoning of the model, such as: (1) Is the model's reasoning dependent on whether the example is consistent with pretraining ("true"), inconsistent ("false"), or neither ("fictional")? (2) Is the model's reasoning sensitive to whether the predicates in the examples are unary or binary? (3) Is the model's reasoning dependent on the rules of deduction in the examples? These variables are not controllable in existing datasets. Further, in some datasets, the code to generate examples is not available.



All analysis code, data, data generation scripts, and model outputs are available at github.com/asaparov/prontoqa. INSTRUCTGPT is the model resulting from fine-tuning GPT-3 via reinforcement learning from human feedback. Throughout the paper, "INSTRUCTGPT" refers to the model named text-davinci-002. But note that in our experiments, we also evaluate text-ada-001, text-babbage-001, text-curie-001, davinci, and text-davinci-001.



Each cat is a carnivore. Every carnivore is not herbivorous. Carnivores are mammals. All mammals are warm-blooded. Mammals are vertebrates. Every vertebrate is an animal. Animals are multicellular. Fae is a cat. True or false: Fae is not herbivorous. A: Fae is a cat. Cats are carnivores. Fae is a carnivore. Every carnivore is not herbivorous. Fae is not herbivorous. True

