STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK

Abstract

We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.

1. INTRODUCTION

A long-term pursuit in Artificial Intelligence is to endow machines with the ability to reason and manipulate premises to reach conclusions and perform tasks. Initially, most reasoning systems performed multi-step operations over symbolic or probabilistic knowledge (Newell & Simon, 1956; McCarthy et al., 1960; Siler & Buckley, 2005) , and even though these systems were able to perform complex tasks (Vernon et al., 2007; Metaxiotis et al., 2002; Ribeiro & Forbus, 2021) , there were still shortcomings when it comes to encoding such knowledge, learning reasoning rules and dealing with ambiguity (Bell, 1985; Ribeiro et al., 2019) . Some recent works in the field of question-answering (QA) have demonstrated that language models can bypass some of these issues and learn to reason directly over natural language (Clark et al., 2020) , allowing for more flexible and adaptable reasoning capabilities. Another advantage of performing multi-step reasoning over natural language is that it allows for more inspectable outputs, improving the explainability of models that are otherwise regarded as black box systems (Jain & Wallace, 2019; Rajani et al., 2019a; Danilevsky et al., 2020) . Despite the recent progress, we notice that there is still a gap in resources for training and evaluating general reasoning capabilities over natural language. To facilitate research in this direction we propose the STructured REasoning and Explanation Multi-Task benchmark (or STREET for short), containing a collection of tasks in various domains including quantitative reasoning (math questions), analytical reasoning (logic puzzle questions), and deductive reasoning (common-sense and science questions). We build upon existing QA datasets by adding multi-premise, multi-step, structured explanations in the form of reasoning graphs, as depicted in Figure 1 . The STREET benchmark contains 35.8k questions, each of which is accompanied by a reasoning graph, either created by expert annotators or programmatically. When combined, all reasoning graphs contain a total of 151.1k reasoning steps (or textual entailments), of which 14.7k

