STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK

Abstract

We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.

1. INTRODUCTION

A long-term pursuit in Artificial Intelligence is to endow machines with the ability to reason and manipulate premises to reach conclusions and perform tasks. Initially, most reasoning systems performed multi-step operations over symbolic or probabilistic knowledge (Newell & Simon, 1956; McCarthy et al., 1960; Siler & Buckley, 2005) , and even though these systems were able to perform complex tasks (Vernon et al., 2007; Metaxiotis et al., 2002; Ribeiro & Forbus, 2021) , there were still shortcomings when it comes to encoding such knowledge, learning reasoning rules and dealing with ambiguity (Bell, 1985; Ribeiro et al., 2019) . Some recent works in the field of question-answering (QA) have demonstrated that language models can bypass some of these issues and learn to reason directly over natural language (Clark et al., 2020) , allowing for more flexible and adaptable reasoning capabilities. Another advantage of performing multi-step reasoning over natural language is that it allows for more inspectable outputs, improving the explainability of models that are otherwise regarded as black box systems (Jain & Wallace, 2019; Rajani et al., 2019a; Danilevsky et al., 2020) . Despite the recent progress, we notice that there is still a gap in resources for training and evaluating general reasoning capabilities over natural language. To facilitate research in this direction we propose the STructured REasoning and Explanation Multi-Task benchmark (or STREET for short), containing a collection of tasks in various domains including quantitative reasoning (math questions), analytical reasoning (logic puzzle questions), and deductive reasoning (common-sense and science questions). We build upon existing QA datasets by adding multi-premise, multi-step, structured explanations in the form of reasoning graphs, as depicted in Figure 1 . The STREET benchmark contains 35.8k questions, each of which is accompanied by a reasoning graph, either created by expert annotators or programmatically. When combined, all reasoning graphs contain a total of 151.1k reasoning steps (or textual entailments), of which 14.7k were created by our expert annotators. We carefully selected the tasks such that most of the relevant knowledge required to answer the questions is contained within the question or context themselves. Therefore, we focus on the reasoning problem, with a greater number of reasoning steps (an average of 7.8 reasoning steps per answer) and a more complex reasoning structure than previous datasets. These properties differentiate our work from single-step reasoning such as Natural Language Inference (NLI) (Bowman et al., 2015; Williams et al., 2018; Zellers et al., 2018) or multi-hop QA (Yang et al., 2018; Chen et al., 2021) that require specific factual knowledge retrieval. In our proposed evaluation, the models are expected to not only answer the questions, but also generate the reasoning graphs (including the textual intermediate steps) that explains their output answer. With that in mind, we design a few evaluation metrics to verify if the generated reasoning graphs match the expected golden data. We perform extensive evaluation using some popular language models of various sizes, namely T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020) , either fine-tuning on training data or using few-shot prompting. Our experiments show that even though these models can achieve high solving rates on many of the original QA datasets, they still struggle to generate coherent and relevant reasoning graphs and appear to be far below human performance. Our main contributions are as follows: (1) We define reasoning graphs, which are structured chains of reasoning in natural language that provide explainability to the output of models on QA tasks. (2) We propose STREET, a multi-task and multi-domain benchmark containing questions requiring diverse types of reasoning skills. The answers in the dataset contain annotated or generated reasoning graphs. We make the data and evaluation code available onlinefoot_0 (3) We evaluate the performance of LMs such as fine-tuned T5 and few-shot prompting with GPT-3 on our proposed task. Our results suggest there is still room for improving language models when it comes to generating complex multi-step reasoning explanations.



https://github.com/amazon-science/street-reasoning



used pop is on sale and new opera is not. (6) If both types of pop are on sale, then all soul is. (8) If neither type of jazz is on sale, then new pop is (16) If new jazz or used jazz are not on sale, then new pop is on sale. (17) Since new pop and used pop are on sale, then new soul and used soul are on sale. Answer: D) CONTEXT: (1) A music store carries exactly ten types of CDs (2) both new and used of each of jazz, opera, pop, rap, and soul. (3) The store is having a sale on some of these types of CDs. (4) The following conditions must apply: (5) used pop is on sale and new opera is not. (6) If both types of pop are on sale, then all soul is. (7) If both types of jazz are on sale, then no rap is. (8) If neither type of jazz is on sale, then new pop is. (9) If either type of rap is on sale, then no soul is. QUESTION: (10) Which one of the following CANNOT be true? (11) A) Neither type of opera and neither type of rap is on sale. (12) B) Neither type of jazz and neither type of opera is on sale. (13) C) Neither type of opera and neither type of soul is on sale. (14) D) Neither type of jazz and neither type of soul is on sale. (15) E) Neither type of jazz and neither type of rap is on sale. CONTEXT: (1) Natalia sold clips to 48 of her friends in April, and then (2) she sold half as many clips in May. QUESTION: (3) How many clips did Natalia sell altogether in April and May? (10) Which one of the following CANNOT be true? (18) If neither type of jazz is on sale, both types of soul must be on sale, which contradicts option D) AR-LSAT GSM8K (14) D) Neither type of jazz and neither type of soul is on sale.

Figure1: Two examples from our proposed STREET benchmark. The questions are derived from the Grade School Math (GSM8K) and Analytical Reasoning -Law School Admission Test (AR-LSAT) tasks. The QA components (e.g. question, context, and answers options) are broken into textual logical units, or TLUs. These TLUs are connected to form a reasoning graph. Our proposed benchmark builds upon existing QA datasets by adding structured reasoning explanations that shows how one can derive the answer to a given question.

