ROSCOE: A SUITE OF METRICS FOR SCORING STEP-BY-STEP REASONING

Abstract

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers (Nye et al., 2021; Wei et al., 2022). These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality -among other traits -by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets -covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics. 1 

1. INTRODUCTION

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark (Brown et al., 2020) , with large language models (LLMs) performing impressively as few-shot learners (Brown et al., 2020) . Despite these achievements, even the largest of these models still struggle with tasks including math word problems (Hendrycks et al., 2021) , symbolic manipulation (Rytting & Wingate, 2021) , and commonsense reasoning (West et al., 2022) . Recent work has shown that prompting (Wei et al., 2022; Wang et al., 2022) or fine-tuning (Lampinen et al., 2022) LLMs to generate step-by-step rationales can lead to improvements on reasoning tasks. Some of these include small-scale analysis of specific error types within step-by-step rationales (Lewkowycz et al., 2022; Chowdhery et al., 2022) , as shown in Table 1 . However, existing works primarily focus on end-task performance. Although text generation evaluation metrics sometimes offer fine-grained quality evaluations (e.g., adequacy, fluency) against human scores (Opitz & Frank, 2021; Leiter et al., 2022) , these metrics generally treat the output as a whole, and many of these generative metrics operate on tasks such as summarization or machine-translation rather than reasoning. Table 1: Step-by-step reasoning for logical inference task generated from 2-shot GPT-3 (text-davinci-002) (Brown et al., 2020) . Error types are highlighted: logical or factual issues, partially correct, somewhat-relevant. Full 2-shot exemplars are in Appendix B. Context: Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. Going home next Tuesday! Question: Is it likely that Shelly will be near the Pacific Ocean this weekend? GPT-3 Step-by-step reasoning: • No, it is not likely that Shelley will be near the Pacific Ocean this weekend. • Shelley is from Virginia, which is on the East Coast of the United States. • The Pacific Ocean is on the West Coast. • Shelley is visiting a city with a famous market, which could be anywhere in the world. • Tuesday is her travel day, so she will likely not be traveling this weekend. In this paper, we present ROSCOE, a suite of interpretable and fine-grained step-by-step generation evaluation metrics to address the above gaps. Rather than providing one score that only evaluates the generated text on the overall, ROSCOE encapsulates fine-grained metrics under four perspectives: (1) semantic alignment defines to what extend the generated reasoning is coherent, and grounded with the source context; (2) logical inference evaluates if the generated reasoning steps are consistent within itself and checks for logical fallacies; (3) semantic similarity quantifies the degree of similarity between the generated reasoning and the context or between intermediate steps to capture hallucinations or repetitions; and (4) language coherence evaluates if the whole chain flows naturally. To evaluate ROSCOE against existing metrics, we devise a taxonomy of reasoning errors for multi-step generations and use it to create synthetic data and collect human evaluations on commonly used reasoning datasets. Our taxonomy and annotated datasets help us gain deeper insights into the causes of reasoning inconsistencies and weaknesses of LLMs. We evaluate ROSCOE with 18 fine-grained metrics under the above four perspectives. ROSCOE demonstrates performance gains against baseline evaluation metrics on all tasks that require reasoning over context. Additional sensitivity analysis shows that ROSCOE is more robust when dealing with tasks that require logical and arithmetic reasoning.

Contributions.

(1) We propose a new taxonomy for reasoning errors, and use it for collecting human annotations and creating synthetic datasets. (2) Using our taxonomy, we propose a new suite of metrics that focus on sequence and step level analysis of step-by-step reasoning. (3) We present extensive comparative analysis on 11 datasets of varied complex reasoning problems demonstrating the strengths of each metric, especially in terms of interpretability relative to baselines, and considerations for use.

2. RELATED WORK

Evaluating Explanations. Free-form natural Language (NL) explanations of model decisions should enable accurate representation of the reasoning process and degree of plausibility (Danilevsky et al., 2020; Jacovi & Goldberg, 2021; Jacovi et al., 2021) . A qualitative assessment of NL explanations with correctness labels collected from human judges was presented in (Camburu et al., 2018) . Recent work has also investigated automatic metrics for natural language generation (NLG) evaluation including word overlap or embedding based similarly with human written explanations (Clinciu et al., 2021) . Though fast and cost-effective, automatic metrics for NLG are not equipped to measure the logical inconsistencies or information gain with thinking steps (Reiter, 2019; Celikyilmaz et al., 2020) . Explanations have also been evaluated by collecting datasets, and running correlation analysis to investigate the degree to which an automatic metric correlates with human judgements of clarity, relevance and informativeness (Leiter et al., 2022; Welleck et al., 2022) .Although reliable, human evaluation is an expensive, domain specific, and time-consuming process. In comparison, ROSCOE provides generic automatic evaluation procedures that are domain and task specific. Automatic Metrics. Many NLG evaluation metrics exist in the literature including ones based on: n-gram match (Lin, 2004 ), regression (Sellam et al., 2020 ), embedding proximity (Zhang et al., 2020 ), paraphrasing (Thompson & Post, 2020) , generation as an evaluator (Yuan et al., 2021) ; information alignment (Deng et al., 2021) ; among others. Although these metrics are easy to use, they evaluate the alignment of two texts as a whole and are not designed to assess individual reasoning steps. The closest metrics to ours are CTC (Deng



Code can be found at https://github.com/facebookresearch/ParlAI/tree/main/projects/ roscoe. Annotated datasets can be downloaded from https://dl.fbaipublicfiles.com/parlai/ projects/roscoe/annotations.zip.

