SELECTION-INFERENCE: EXPLOITING LARGE LAN-GUAGE MODELS FOR INTERPRETABLE LOGICAL REA-SONING

Abstract

Large language models (LLMs) have been shown to be capable of impressive fewshot generalisation to new tasks. However, they still tend to perform poorly on multi-step logical reasoning problems. Here we carry out a comprehensive evaluation of LLMs on 46 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. In light of this, we propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules, and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer. Focusing on a sub-set of 10 reasoning tasks from ProofWriter and bAbI, we show that a 7B parameter, decoder-only LLM used within the SI framework in a 5-shot generalisation setting,with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent Vanilla baseline. The same model in the same setting even outperforms a significantly larger 280B parameter baseline on the same suite of tasks. Moreover, answers produced by the SI framework are accompanied by a causal natural-language-based reasoning trace, which has important implications for the safety and trustworthiness of the system.

1. INTRODUCTION

Large language models (LLMs) are powerful few-shot learners (Bommasani et al., 2021; Brown et al., 2020; Lu et al., 2022) . However, one area where they tend to perform poorly is logical reasoning (Rae et al., 2021 ). Yet the ability to perform multi-step, logically valid reasoning is fundamental for the discovery of new knowledge and explainability. It underpins many advancements that have been made in science, medicine, maths and philosophy. It is also one of the most valued strengths of classical, symbolic AI over contemporary deep learning methods (Marcus & Davis, 2019; Marcus, 2020; Bengio et al., 2021) , prompting the recent increase in the use of neurosymbolic approaches to bridge this gap (Garnelo & Shanahan, 2019; Garcez & Lamb, 2020) . Here we propose a Selection-Inference (SI) framework that takes inspiration from the neurosymbolic literature to improve the ability of LLMs to do logically valid reasoning. There are many flavours of neurosymbolic models (Garcez & Lamb, 2020) . Those from which we draw inspiration tend to have a modular structure, where each module is specialised for one type of operation (Mao et al., 2019; Andreas et al., 2016) . For example, such modules may be neural networks or hand-crafted functions designed to attend to a single object, or to compare the location or size of two inputs (Andreas et al., 2016; Yi et al., 2018) . Neurosymbolic models can produce an answer to a complex query by chaining these operations together, passing inputs from one module to another. This has the benefit of producing an interpretable trace of intermediate computations, in contrast to the "black-box" computations common to end-to-end deep learning approaches. Importantly, the modularity of neurosymbolic methods allows them to generalise to significantly harder problems that require long chains of reasoning (Hudson & Manning, 2019) . However, the hand-crafted and specialised nature of the modules often makes the resulting systems brittle and more difficult to extend to new domains (Yi et al., 2018) . We hope to address this limitation by leveraging LLMs in our modules. Building on other modular (Tafjord et al., 2021; Andreas et al., 2016) and step-wise approaches (Wei et al., 2022; Dalvi et al., 2021) , we propose SI which decomposes logical reasoning into two modular stages: 1) selection, which involves choosing a subset of relevant information sufficient to make a single step of inference, and 2) inference, which only sees the limited information provided by the selection module, and uses it to infer a new intermediate piece of evidence on the way to the final answer (see Fig. 1 ). We implement both stages using pre-trained LLMs which, thanks to their powerful few-shot generalisation capabilities, serve as more general alternatives to the hand-crafted, specialised modules typically used in neurosymbolic approaches. In the SI framework, multiple steps of selection and inference are chained together to produce a sequence of reasoning steps. As well as underpinning better performance on reasoning problems, this yields an interpretable trace that justifies the final answer. Furthermore, the reasoning trace produced by our system is causal, in the sense that each step follows from, and depends on, the previous step. Each inference step is made in isolation, based solely on the limited information provided by the Selection module, without direct access to the question or to previous steps of reasoning. This contrasts with the more common approach of obtaining post-hoc rationalisation, where the answer produced by the model has no direct dependence on the explanation, since the explanation is produced either in parallel to the answer or after the fact (Saha et al., 2020; Lampinen et al., 2022; Cobbe et al., 2021) . A notable example that sits in the grey area between post-hoc rationalisation approaches and the more causal explanation approaches is Chain-Of-Thought (COT) (Wei et al., 2022) (see Fig. 1 ). In this approach LLMs are encouraged to produce a reasoning trace before the answer. However the dependence of the answer on the reasoning is not explicitly encouraged to be causal (as defined above). Indeed, the authors show that while the COT explanations help boost the final answer accuracy, the reasoning traces produced by the model are often wrong even when the final answer is correct (see the appendices of Wei et al. (2022) for examples) and the model is prone to making up facts (Figure A6 ). Developing a system that can demonstrate how it reaches its answers using a causal reasoning trace has important benefits in terms of safety, explainability, interpretability, debugging, and trust. In this paper we make the following contributions:



Language model references throughout the paper are removed to preserve anonymity. These will be added back if the paper is accepted for publication. Note that we use decoder-only language models.



Figure 1: Schematic comparison between Selection-Inference and other representative approaches for reasoning in natural language: Chain-of-Thought (Wei et al., 2022), ProofWriter (Tafjord et al., 2021), and EntailmentBank (Dalvi et al., 2021). C -context, Q -question, A -answer, S -selection, Iinference. Grey circles -givens, white circles -model output. Loops indicate multi-step reasoning. The order of letters in a single circle indicates the order in which the corresponding steps are output by the model.

acknowledgement

1. We provide a comprehensive evaluation of LLMs on a set of 46 tasks probing different aspects of logical reasoning, and show that LLMs are good at simpler single step logical inference in 5-shot generalisation settings, but struggle with harder problems (Sec. 3) 2. We introduce the Selection-Inference (SI) framework, a modular, iterative approach to solving reasoning problems (Sec. 4). 3. We demonstrate the utility of the SI framework by evaluating a 7B parameter, decoder-only, LLM 1 on 10 logical reasoning tasks, showing overall that it almost triples the performance of the same model used naively and almost doubles the performance of the same model used in the COT framework. Moreover, it often outperforms a 40x larger 280B, decoder-only, LLM baseline used both naively and in the COT framework. 4. We illustrate further benefits of the SI framework in terms of the causal and interpretable reasoning traces produced (Sec. 5). These traces can help humans understand how the model reached its final answer, which is useful for debugging and opens the system's decisions to human critique.

