ROSCOE: A SUITE OF METRICS FOR SCORING STEP-BY-STEP REASONING

Abstract

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers (Nye et al., 2021; Wei et al., 2022). These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality -among other traits -by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets -covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics. 1 

1. INTRODUCTION

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark (Brown et al., 2020) , with large language models (LLMs) performing impressively as few-shot learners (Brown et al., 2020) . Despite these achievements, even the largest of these models still struggle with tasks including math word problems (Hendrycks et al., 2021) , symbolic manipulation (Rytting & Wingate, 2021), and commonsense reasoning (West et al., 2022) . Recent work has shown that prompting (Wei et al., 2022; Wang et al., 2022) or fine-tuning (Lampinen et al., 2022) LLMs to generate step-by-step rationales can lead to improvements on reasoning tasks. Some of these include small-scale analysis of specific error types within step-by-step rationales (Lewkowycz et al., 2022; Chowdhery et al., 2022) , as shown in Table 1 . However, existing works primarily focus on end-task performance. Although text generation evaluation metrics sometimes offer fine-grained quality evaluations (e.g., adequacy, fluency) against human scores (Opitz & Frank, 2021; Leiter et al., 2022) , these metrics generally treat the output as a whole, and many of these generative metrics operate on tasks such as summarization or machine-translation rather than reasoning.



Code can be found at https://github.com/facebookresearch/ParlAI/tree/main/projects/ roscoe. Annotated datasets can be downloaded from https://dl.fbaipublicfiles.com/parlai/ projects/roscoe/annotations.zip.

