ROSCOE: A SUITE OF METRICS FOR SCORING STEP-BY-STEP REASONING

Abstract

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers (Nye et al., 2021; Wei et al., 2022) . These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality -among other traits -by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets -covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics. 1 N i=0 p gram (h i ). Diagnostics Datasets. We construct our first category of labeled datasets by generating perturbations -i.e., deterministic modifications -on half of the reference reasoning steps and assign binary labels based on whether or not a chain has been perturbed. We select seven language understanding and entailment datasets

1. INTRODUCTION

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark (Brown et al., 2020) , with large language models (LLMs) performing impressively as few-shot learners (Brown et al., 2020) . Despite these achievements, even the largest of these models still struggle with tasks including math word problems (Hendrycks et al., 2021) , symbolic manipulation (Rytting & Wingate, 2021) , and commonsense reasoning (West et al., 2022) . Recent work has shown that prompting (Wei et al., 2022; Wang et al., 2022) or fine-tuning (Lampinen et al., 2022) LLMs to generate step-by-step rationales can lead to improvements on reasoning tasks. Some of these include small-scale analysis of specific error types within step-by-step rationales (Lewkowycz et al., 2022; Chowdhery et al., 2022) , as shown in Table 1 . However, existing works primarily focus on end-task performance. Although text generation evaluation metrics sometimes offer fine-grained quality evaluations (e.g., adequacy, fluency) against human scores (Opitz & Frank, 2021; Leiter et al., 2022) , these metrics generally treat the output as a whole, and many of these generative metrics operate on tasks such as summarization or machine-translation rather than reasoning. Table 1: Step-by-step reasoning for logical inference task generated from 2-shot GPT-3 (text-davinci-002) (Brown et al., 2020) . Error types are highlighted: logical or factual issues, partially correct, somewhat-relevant. Full 2-shot exemplars are in Appendix B. Context: Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. Going home next Tuesday! Question: Is it likely that Shelly will be near the Pacific Ocean this weekend? Step-by-step reasoning: • No, it is not likely that Shelley will be near the Pacific Ocean this weekend. • Shelley is from Virginia, which is on the East Coast of the United States. • The Pacific Ocean is on the West Coast. • Shelley is visiting a city with a famous market, which could be anywhere in the world. • Tuesday is her travel day, so she will likely not be traveling this weekend. In this paper, we present ROSCOE, a suite of interpretable and fine-grained step-by-step generation evaluation metrics to address the above gaps. Rather than providing one score that only evaluates the generated text on the overall, ROSCOE encapsulates fine-grained metrics under four perspectives: (1) semantic alignment defines to what extend the generated reasoning is coherent, and grounded with the source context; (2) logical inference evaluates if the generated reasoning steps are consistent within itself and checks for logical fallacies; (3) semantic similarity quantifies the degree of similarity between the generated reasoning and the context or between intermediate steps to capture hallucinations or repetitions; and (4) language coherence evaluates if the whole chain flows naturally. To evaluate ROSCOE against existing metrics, we devise a taxonomy of reasoning errors for multi-step generations and use it to create synthetic data and collect human evaluations on commonly used reasoning datasets. Our taxonomy and annotated datasets help us gain deeper insights into the causes of reasoning inconsistencies and weaknesses of LLMs. We evaluate ROSCOE with 18 fine-grained metrics under the above four perspectives. ROSCOE demonstrates performance gains against baseline evaluation metrics on all tasks that require reasoning over context. Additional sensitivity analysis shows that ROSCOE is more robust when dealing with tasks that require logical and arithmetic reasoning.

Contributions.

(1) We propose a new taxonomy for reasoning errors, and use it for collecting human annotations and creating synthetic datasets. (2) Using our taxonomy, we propose a new suite of metrics that focus on sequence and step level analysis of step-by-step reasoning. (3) We present extensive comparative analysis on 11 datasets of varied complex reasoning problems demonstrating the strengths of each metric, especially in terms of interpretability relative to baselines, and considerations for use.

2. RELATED WORK

Evaluating Explanations. Free-form natural Language (NL) explanations of model decisions should enable accurate representation of the reasoning process and degree of plausibility (Danilevsky et al., 2020; Jacovi & Goldberg, 2021; Jacovi et al., 2021) . A qualitative assessment of NL explanations with correctness labels collected from human judges was presented in (Camburu et al., 2018) . Recent work has also investigated automatic metrics for natural language generation (NLG) evaluation including word overlap or embedding based similarly with human written explanations (Clinciu et al., 2021) . Though fast and cost-effective, automatic metrics for NLG are not equipped to measure the logical inconsistencies or information gain with thinking steps (Reiter, 2019; Celikyilmaz et al., 2020) . Explanations have also been evaluated by collecting datasets, and running correlation analysis to investigate the degree to which an automatic metric correlates with human judgements of clarity, relevance and informativeness (Leiter et al., 2022; Welleck et al., 2022) .Although reliable, human evaluation is an expensive, domain specific, and time-consuming process. In comparison, ROSCOE provides generic automatic evaluation procedures that are domain and task specific. Automatic Metrics. Many NLG evaluation metrics exist in the literature including ones based on: n-gram match (Lin, 2004) , regression (Sellam et al., 2020) , embedding proximity (Zhang et al., 2020) , paraphrasing (Thompson & Post, 2020) , generation as an evaluator (Yuan et al., 2021) ; information alignment (Deng et al., 2021) ; among others. Although these metrics are easy to use, they evaluate the alignment of two texts as a whole and are not designed to assess individual reasoning steps. The closest metrics to ours are CTC (Deng 10 .

Error Type Definition

Grammar Faulty, unconventional, or controversial grammar usage Factuality Information about an object (i.e. quantity, characteristics) or a named entity doesn't match with the input context.

Hallucination

Information is not provided in the problem statement and is irrelevant or wrong Redundancy Explanation contains redundant information, which even though might be factual, is not required to answer the question Repetition Step paraphrases information already mentioned in previous reasoning steps

Missing step

The content of the generated reasoning is incomplete and lacks required information to produce the correct answer.

Coherency

Steps contradict each other or do not follow a cohesive story Commonsense Model lacks relations that should be known from general world (e.g., "all ducks are birds") Arithmetic Error in math calculations et al., 2021) and BARTScore (Yuan et al., 2021) , as both introduce a set of interpretable metrics to evaluate the similarity between two texts. However, ROSCOE is unique in providing fine-grained interpretations of reasoning steps, determining contradictions, and identifying ordering issues in the reasoning narrative. Self-Consistency with LLMs. Recent work on improving LLMs performance on complex reasoning tasks uses an ensemble strategy called self-consistency (Wang et al., 2022) . This method samples a diverse set of reasoning paths from a language model via reasoning traces prompting and returns the most consistent final answer in the set. Other work evaluates the diversity of a reasoning path (Li et al., 2022) , or the consistency of an inference step (Creswell et al., 2022) or finetune LLMs (Zelikman et al., 2022) to improve on difficult NLP tasks. In contrast to these works, we present a suit of metrics that focus on determining the type of the error (e.g., commonsense or logical inconsistency) in a reasoning path, if one exists.

3. REASONING ERROR TAXONOMY AND DATASETS CONSTRUCTION

Problem Formulation. Our goal is to score step-by-step rationales generated by a language model. We assume that the model is given a source context s = {s 1 , • • • , s T } of T-sentences indicating a problem statement followed by a question and is prompted to generate step-by-step reasoning (Nye et al., 2021) . We refer to this as a hypothesis h = {h 1 , • • • , h N } of N-steps, including a final answer as the last step. We do not assume availability of gold step-by-step reasoning references r = {r 1 , • • • , r K } of K-steps. Taxonomy. We propose a new taxonomy of generic reasoning errors for language problem solving. We first conduct manual preliminary analysis on different types of LLMs reasoning errors using five Human judged datasets described below. Based on our analysis, we identified nine error types centered on the overall reasoning chain (i.e., the quality of the step-by-step thinking, including consistency with the context and commonsense reasoning). Our taxonomy also includes fine-grained errors marking inconsistency of a reasoning step with the previous steps, whether each step contributes to the final decision, and overall logical inference or fluency issues. The definition of error types is in Table 2 , and Table 10 provides examples. Datasets and Annotations. To evaluate ROSCOE, we select datasets covering diverse set of tasks that require reasoning skills (e.g., logical, arithmetic, and commonsense reasoning tasks). We separate these datasets into two: (1) Diagnostics datasets that contain gold standard step-wise reasoning chains, where we synthetically perturb some of the reasoning steps to introduce different generation errors (e.g., missing step, mathematical error, etc.); (2) Human judged datasets with model generated step-by-step reasoning outputs where the reasoning error evaluations are solicited from expert judges. We investigate these in §5.

4. REASONING SCORER: ROSCOE

We present our fine-grained metrics under four perspectives: semantic alignment, semantic similarity, logical inference and language coherence. Each metric is bounded within [0, 1], where 1 indicates the perfect score and 0 corresponds to failure. A metric is reference-free or unsupervised when it uses the source and hypothesis (h → s), while reference-based or supervised when evaluated between hypothesis and reference (h → r).

4.1. SEMANTIC ALIGNMENT METRICS (ROSCOE-SA)

At the core of the ROSCOE semantic alignmentfoot_1 metrics is the reasoning alignment vector from the N -step hypothesis h to the source s of length T : r-align(h → s) = {α 1 , α 2 , • • • , α N }, where each alignment value α i = r-align(h i → s) = [1 + max T j=1 (cos(h i , s j )]/2 ∈ [0, 1] is the normalized cosine similarity between hypothesis step and most similar sentence in a context, and explicitly measures the grounding of the step-wise reasoning with respect to the source text (illustrated in App. D, Fig. 3 ). We estimate the alignment vector r-align(h → s) by matching source text and the reasoning chains on the embeddings of tokens and individual reasoning steps. A similar information alignment score is introduced in CTC (Deng et al., 2021) to measure the confidence that the information of the i-th source document token s j is grounded by a hypothesis token h i . Our reasoning alignment is different in that we measure if a hypothesized reasoning step h i supports the source context s. Our proposed metrics are summarized in Table 3 . This step-level score is based on the alignment from the hypothesis steps to the source sentences, and is calculated as the mean reasoning alignment score over the steps of reasoning (see illustration in Appendix D, Figure 3 ): (1/N ) N i=1 r-align(h i → s). Faithfulness measures if the model misinterpreted the problem statement, or the reasoning chain is too vague, irrelevant, or misuses information.

Faithfulness-Token (h → s)

We extend step-level embeddings of the Faithfulness-Step by measuring similarities between the token embeddings: (1/(N + M )) N i=1 [r-align(h i → s) + Mi j=1 r-align token (h i,j → s)], as shown in App. D, Fig. 3. M i is the number of tokens in step h i , M = N i=1 M i is the total number of tokens in the reasoning chain, h i,j is the jth token in ith step, and r-align token is the alignment vector from tokens in step h i to all tokens in s.

Informativeness-Step

(Info-Step) (h ↔ s) Measures how well information present in the source is used in the reasoning steps: [(1/T ) T t=1 r-align(s t → h) + (1/N ) N i=1 r-align(h i → s)]/2 . Info-step gives a higher score to reasoning steps that are well-grounded with respect to the source, and identifies the degree of information from source that is covered by the generated hypothesis. A lower Info-Step score corresponds to the reasoning steps that are not related to the source sentences or have missed information provided in the context.

Repetition-Token

(h i → h j ) To identify repeated, or paraphrased steps, we look at the token alignment scores between all steps in the hypothesis chain: 1 -max i=2..N max j=1•••i-1 [(1/M i ) Mi l=1 r-align token (h i,l → h j )]. For each pair of sentences, we look at the mean token alignment, and find those sentences that maximize this alignment score. In other words, Repetition-Token will punish chains where there are at least two steps with high overlap in token embeddings.

Hallucination (h → (s, r))

To find irrelevant reasoning steps, we use alignment score to identify steps that are both not related to the context and not in the reference chain (to avoid punishing for possibly relevant commonsense knowledge): 1 -max i=1..N ([1 -r-align(h → s)] • [1 -r-align(h → r)]). Here, 1 is an all-ones vector, and (•) is the element-wise product.

Redundancy (h → r)

To find chains that contain information that is not required to solve the problem (i.e., redundant steps), we identify those hypothesis steps that are least aligned with the the reference steps: min i=1..N r-align(h i → r). This score punishes chains with steps that are not required for the correct solution.

Semantic

Coverage-Step ((r, h) → s) This score can be viewed as a measure of how easily a gold reference could be generated by the hypothesis. It compares step level grounding of the hypothesis with respect to the source, and the gold reference grounding: |(1/T ) K t=1 r-align(r t → s) -(1/N ) N i=1 r-align(h i → s)|, where |•| indicates absolute value. Reasoning Alignment (h → r) The most straightforward way to evaluate the correctness of the hypothesis chain is to compare the degree of the overlap between the hypothesis and the reference. One way of doing that is to measure the reasoning alignment between them: (1/N ) N i=1 r-align(h i → r). Commonsense (r → (h, s)) Measures if hypothesis lacks steps that are not stated in the source, but are required to solve the problem such as general world knowledge (e.g., "velocity is distance divided by time", "1 foot is 12 inches", "all ducks are birds", etc.). We detect such information by extracting steps in the reference reasoning that are not grounded by the source text: 1 -max i=1..K ([1 -r-align(r → h)] • [1 -r-align(r → s)]).

Missing Step (r→h)

To identify steps that are missing from the hypothesis but could be required to solve the problem, we look at the alignment between reference and the hypothesis, similar to Redundancy. However, here we go through each step in the reference, and check if there is a similar step in the hypothesis: min i=1..K (r-align(r i →h)). that require complex problem solving skills, and have reference step-by-step explanations: Entailment-Bank (deductive reasoning) (Dalvi et al., 2021) , ProofWriter (logical reasoning) (Tafjord et al., 2021) ; three arithmetic reasoning datasets MATH (Hendrycks et al., 2021) , ASDIV (Miao et al., 2020) and AQUA (Liang et al., 2018) ; EQASC (explanations for commonsense question answering) (Aggarwal et al., 2021) , and StrategyQA (question answering with implicit reasoning strategies) (Geva et al., 2021) (see dataset details in App. E.1). Using our taxonomy, we introduce 12 error perturbation rules and apply on these datasets to construct our diagnostics datasets (see details in App. E.3). Human Judged Datasets. We select our second category of datasets from commonly used complex reasoning tasks: GSM8K (arithmetic reasoning) (Cobbe et al., 2021) , DROP (discrete reasoning) (Dua et al., 2019) , ESNLI (deductive and commonsense reasoning) (Camburu et al., 2018) , COSMOS-QA (commonsense reasoning) (Huang et al., 2019) and SemEVAL (Ostermann et al., 2018) (commonsense reasoning). ROSCOE Training. To obtain reasoning step embeddings, we finetune SimCSE (Gao et al., 2021) , a supervised sentence similarity model extending the RoBERTa word embedding model (Liu et al., 2019) on multi-step reasoning datasets we listed in §5 (see details in Table 11 ) 4 . SimCSE is a contrastive learning model that is trained on triplets of reference reasoning steps, positive and hard-negative hypothesis reasoning steps to minimize the cross-entropy objective with in-batch negatives. For contrastive learning, we use the context and reference reasoning steps as a positive sample (s, r), and context and perturbed reference steps (s, h) as hard-negative pairs. For finetuning, we embed source context and hypothesis chain as a whole, without splitting it into steps. With the finetuned model we embed each individual step, as well as a reasoning chain as a whole. We use the pretrained checkpoint of supervised SimCSE model sup-simcse-roberta-base to initialize our model, and further train it for five epochs on our synthetic train data (details in App. G). We also compare ROSCOE scores calculated against sup-simcse-roberta-base SimCSE model, and all-mpnet-base-v2 sentence embedding model (Reimers & Gurevych, 2019) to understand metrics sensitivity to the embedding method. Baseline Metrics. We use text generation evaluation metrics as baseline metrics and comprehensively examine the ones outlined in §2, which are: n-gram match based metrics including ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) ; pre-trained scores including BLEURT (Sellam et al., 2020) , PRISM (Thompson & Post, 2020) , BERTScore (Zhang et al., 2020) , BARTScore using the Faithfulness (s → h) direction for factuality and relevance, and its finetuned variant BARTScore+CNN+Para BARTScore+ (Yuan et al., 2021) ; and information alignment metrics of CTC, CTC-Relevancy and CTC-Consistency. We also include BARTScore-P, which we obtain by finetuneing BART (Lewis et al., 2020) on the same reasoning datasets we use for finetuning our SimCSE embedding models. Most of our ROSCOE metrics are constructed referencefree. We also have metrics that use reference reasoning steps which we examine against human judgements. We use the official code for each metric. Meta Evaluation. We use Somers' Dfoot_4 (Somers, 1962) , which measures the ordinal association between two measured quantities, to meta-evaluate each scorer against synthetic and human scores. We prefer Somers' D over more commonly used Kendall's τ or Kendall's τ -b, because it is better in handling the ties of a biased random variable (Agresti, 2010, Section 7.1.5) , which imposes an upper bound on the possible values Kendall's τ (-b) can take. For each score Y considered, our correlations are built against the biased random variable X ∈ [0, 1], represented by the perturbation or error presence indicator and evaluated using D(Y |X) = τ (X, Y )/τ (X, X).

6. EXPERIMENTAL RESULTS

Controlled Experiments with Diagnostics Datasets. 12 ), where step is represented by an equation with one of the arithmetic perturbations added. We hypothesize that including these patterns in finetuning helped the model to better learn relationships between context and equations, and resulted in higher scores. On EQASC dataset, Repetition* scores are able to catch all duplicated steps in a chain, i.e., we can separate perturbed and non-perturbed chains based on the given threshold value for the Repetition* scores, and achieve perfect correlation scores (App. Table 20 ). To understand if finetuning actually helps to improve scoring, we compare non-aggregated metrics (see details in App. Table 18 ). We observe, that finetuning indeed helps to improve ROSCOE: on average across datasets, all correlations except Repetition_* scores improve (up to 0.556 on Informativeness-Chain), with mean Repetition-Token not changing, and mean Repetition-Step degrading by 0.005. We speculate that since we finetune the model using reasoning chains and context as a whole, it helps to better capture step-by-step rationales, while possibly degrading on word and sentence-level semantics. 

7. ANALYSIS

How sensitive are ROSCOE metrics against level of errors? To evaluate how well metric values match human assessment of reasoning, we measure sensitivity to the level of errors. We perturb sentences in the MATH (arithmetic) and EntailmentBank (deductive reasoning) diagnostic datasets (similar to § 5) and inject different levels of errors into the reasoning text. Using randomly selected perturbation types, we construct up to a maximum of 3 perturbations per instance. We measure the correlation (Somers' D) between the reasoning inconsistency level 1, 2, 3 of the reasoning steps (i.e., the number of injected errors) and the metric score. Fig. 1 illustrates the results averaged over different perturbations. We expect the metrics correlate with humans better when the level of errors is high. Both semantic alignment of the reasoning ROSCOE-SA , and the semantic similarity metrics ROSCOE-SS show consistent behavior on both datasets, while baseline metrics fluctuate with low correlations. Baseline metrics perform better on EntailmentBank. On MATH, ROSCOE-LC and the baseline metrics show minimal impact, which can be that some of the perturbations applied on the MATH dataset (e.g., RandomOperation, or ShuffleNumbers) are harder to detect with language model based (BARTScore) and NLI model based (ROSCOE-LC) metrics. What does ROSCOE illuminate about scores across errors and tasks? For an ideal scorer based on ease of use, it would be possible to pick a set of fixed thresholds that had error discrimination power across datasets. However, we show that this dataset-agnostic ideal is currently not possible and an issue endemic across scores, including baselines. We study which metrics correlate strongly with which perturbations, with a focus of consistency across datasets. From this, we plot the interquartile ranges for strongly correlated metric and perturbation pairs. We show a sample of these in Fig. 2 , though find that the trends generally hold across metrics and perturbations (see Fig 6) . We note that within a given dataset, scores are well separated: the perturbed version of a dataset for a given score and perturbation type shows little interquartile overlap with the original version. However, this does not hold across datasets -e.g., in (Score: Info-Chain, Perturbation: Repetition), if one were to set a detective threshold for the Repetition perturbation based off EntBank (around 0.95), it would mark almost all values of EQASC as perturbed, even non-perturbed samples. This shows the challenge of using metrics for classification without calibration for drifts in both mean and variance across datasets, even if a metric generally correlates well with detecting a given error. 

8. CONCLUSION

In this paper, we introduce ROSCOE, a new suite of interpretable, unsupervised metrics that enables evaluation of step-by-step reasoning generations of LMs when no golden reference generation exists. We present a taxonomy of reasoning errors used to generate and evaluate our metrics. Experimental results, from evaluating on both synthetic and human-labeled datasets exhibiting multiple types of reasoning (commonsense, arithmetic, and logical inference, etc.), demonstrate superior performance compared to prior semantic and lexical similarly based baseline metrics for text generation. Our analysis shows improved capability in evaluation of reasoning exhibiting nuances, such as factual and logical errors in step-wise decisions.

ETHICS STATEMENT

Explainability builds transparency and trust for users, eases bug-fixing and shortens improvement cycles for metric designers, and will be required by law/regulations for AI systems to be applied to large-scale, high-stakes domains. In this context, we hope our work will catalyze efforts on the topic of explainable evaluation metrics for language model rationale generations. We should mention that our evaluation metrics do not monitor the explanations from integrity or bias perspectives. Our work also uses five human expert annotators and in the annotation process, annotators need to rate the model generated candidate rationals. While the model-generated explanations can produce potentially unsafe content, the datasets for annotations include domains related to logical and arithmetic concepts and general commonsense knowledge. The anecdotal consensus was that the generations were safe and didn't include biased statements.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of our empirical results, we will open source our code to Github, which will contain: instructions for installing the virtual environment, data preprocessing, all score generation and correlation scripts (both for ROSCOE and baselines), and trained embedding models. Detailed explanation of all the finetuned models and metrics are given in the main paper as well as in the Appendices. We will also release all the diagnostic and human judgment datasets used in our experiments.

APPENDIX A LIMITATIONS

Our study is the first initial step that investigates the evaluation of the step-by-step reasoning produced by large language models. Our taxonomy (in Table 2 ) covers several reasoning errors and we designed our metrics to evaluate a spectrum of criteria including the ones in the taxonomy. Even though we cannot say we cover all possible reasoning errors, our metrics are generic enough, work on natural language rationales, and consider the alignment with the input context and the generated explanation. Nevertheless, we believe our study can spur others to investigate different reasoning errors and use our code and datasets as templates to extend further. Due to the extensive analysis needed to thoroughly test and communicate the ability of our proposed metrics to capture reasoning errors, we decided to leave some follow-up questions, such as the application of these metrics for improving downstream task performance, for future exploration.

B FEW-SHOT PROMPTING EXAMPLES (CONT. FROM § 1)

Below is the 2-shot example we used to generate the explanations from GPT-3 as we show in the Fig. 1 . I will explain how to make logical inferences: Context: John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a plate full of food. 2022 has been a great year. Question: Can we infer that John was born before the inauguration of Barack Obama? Explanation: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack Obama was inaugurated in 2009, so it has been about 13 years since Obama was inaugurated. John's children are old enough to speak, so if John were born after Obama's inauguration, he would have been younger than 10 when he had children. So, we can infer that John was born before Obama's inauguration. The answer is "yes". Context: Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your rendition of Chopsticks was at the Christmas party?" Question: Can we infer that John majored in piano at Juilliard? Explanation: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "Chopsticks" is a very simple song to play on the piano. His friends were therefore likely being sarcastic when they told him to play, which would imply that Sam was not an expert piano player. So it is unlikely that Sam majored in piano at Juilliard. The answer is "no". Table 9 illustrates sample if in-context examples we used as prompts to obtain step-wise explanations. Table 9 : Example prompts used to generate step-by-step explanations using GPT-3. (Used for ESNLI dataset) Explain if premise entails hypothesis. Premise: Tomato sauce (also known as Neapolitan sauce, and referred to in Italy as Salsa di pomodoro) refers to any of a very large number of sauces made primarily from tomatoes, usually to be served as part of a dish (rather than as a condiment). Tomato sauces are common for meat and vegetables, but they are perhaps best known as sauces for pasta dishes. Hypothesis: Tomato sauce is only rarely used to top pasta. Explanation: Salsa di pomodoro is made of peppers. The answer is No. Premise: Kirk Johnson (born 1960) is an American scientist, author, curator, and museum administrator, and is currently serving as Director of Smithsonian's National Museum of Natural History. Dr. Johnson is the host of the PBS Nova series, Making North America ¨, which is a three-part series that describes the shaping of North America, which aired on November 4, 11 and 18, 2015. Hypothesis: johnson hosted Making North America when he turned 50. Explanation: Johnson was born in 1960 and hosted Making North America in 2015, so he was 55, not 50. The answer is No. Premise: A person rides his bicycle in the sand beside the ocean. Hypothesis: A person is on a beach. Explanation: A person rides beside the ocean means is on a beach if there is sand beside the ocean. The answer is Yes. (used for SEMEVAL dataset) Explain if premise entails hypothesis. Premise: I went into my bedroom and flipped the light switch. Oh, I see that the ceiling lamp is not turning on. It must be that the light bulb needs replacement. I go through my closet and find a new light bulb that will fit this lamp and I place it in my pocket. I also get my stepladder and I place it under the lamp. I make sure the light switch is in the off position. I climb up the ladder and unscrew the old light bulb. I place the old bulb in my pocket and take out the new one. I then screw in the new bulb. I climb down the stepladder and I place it back into the closet. I then throw out the old bulb into the recycling bin. I go back to my bedroom and turn on the light switch. I am happy to see that there is again light in my room. Which room did the light go out in? Hypothesis: Bedroom Label: Correct Explanation: The answer is bedroom, because the light went out in the bedroom. The person replaced the light bulb in the bedroom. Premise: The weather was so nice today that I decided to have a barbecue. I called up some of my friends to invite them over. I set up some chairs and tables outside for my friends to sit around. I then took out some hot dogs from my fridge and brought them over to my barbecue. My barbecue is a gas barbecue, so all I had to do was press the knob and turn it to ignite it. Once I got a little fire going I put my hot dogs on the grill over it to cook. I closed the lid and left my hot dogs to cook for about ten minutes. After ten minutes I decided to check on them to see if they were done. They looked like they were cooked, so I put them all on a plate to serve to my friends. I went to my kitchen and got hot dog buns, ketchup, mustard, and relish to serve on top of the hot dogs. I laid out all of the food and condiments on the table so my friends could help themselves. We ate the hot dogs and relaxed around the table. Why are we having the barbecue, to celebrate an occasion, or just because we want to? Hypothesis: Just because they want to.

Label: Correct

Explanation: The answer is correct, because the weather was nice. They decided to have a barbecue. There is no mention of any special occasion. Premise: Another Nurse can in to look at me , took one look at my rash and said OMG and ran out of the room . At that point I started crying , regretted coming to the hospital alone and started wheezing again . Then the doctor came in , her eyes were all bugged out . What will I do after the doctor examines me ? Hypothesis: I will talk with my friends . Label: Incorrect Explanation: The premise is about the nurse's reaction to the patient's rash, and how the patient feels. The hypothesis does not follow up from this premise correctly. Instead, it talks about talking with friends.

C TAXONOMY OF REASONING ERRORS (CONT. FROM § 3)

To gain deeper insights into the types of reasoning errors introduced by LLMs while explaining their decisions, we propose a new taxonomy of generic reasoning errors for language problem solving. Specifically, we sampled from the training portions of the logical inference and commonsense reasoning datasets, and prompted GPT-3 with reasoning explanations using prompts similar to App. B. We used task specific indomain examples for prompting. We also analyzed model generated explanations shared in Wei et al. (2022) . We then manually looked into each explanation and identified potential errors that are inconsistent with the source, question or the prompt and within the reasoning chain. Some tasks require a model to classify the logical relationship between premise and a hypothesis, others are question and answering tasks. We adjusted our context and prompts according to the type of the task. Our reasoning error taxonomy is summarized in Table 10 . It contains types of errors concerning an overall chain or an individual step. Specifically, the chain-level coarse-grained evaluations of the overall reasoning chain deals with overall quality of the step-by-step thinking, coherence, consistency of the explanation within itself, and consistency with the context, etc. On the other hand the step-level fine-grained evaluations focus on the consistency of a reasoning step with the previous steps, if a step conveys new and supporting information over the previous steps, factuality or logical inference issues. We use these error categories to construct diagnostics datasets with perturbed errors as well as human judged datasets of reasoning errors. In the taxonomy, we indicate *-step level errors to differentiate from the chain level error types. ROSCOE metrics are constructed under four categories: semantic alignment, semantic similarity, logical inference, and logical coherence. The details of each metric is explained in §4. At the core of ROSCOE semantic alignment metrics is the reasoning alignment score, which we designed to measure the grounding of step-by-step reasoning with respect to the source text. Fig. 3 illustrates the reasoning alignment. Figure 3 : Reasoning alignment illustrating the measurement of the Faithfulness-Step and Faithfulness-Token semantic alignment scores. h = {h 1 , h 2 } is a hypothesis chain with tokens {h 1,1 , h 1,2 , h 1,3 , h 2,1 , h 2,2 }, and s = {s 1 , s 2 , s 3 } is a context with tokens {s 1,1 , s 2,1 , s 2,2 , s 2,3 , s 3,1 , s 3,2 , s 3,3 }. Alignment scores from hypothesis to context are highlighted, and alignment scores from context to hypothesis are underscored. The reasoning alignment combines token and step level similarities where each alignment value (cell) is the cosine similarity and explicitly measures the grounding of the token and step-wise reasoning with respect to the source text. The variation of scorers of the ROSCOE shares some similarities, thus we explain them here: BARTScore (Yuan et al., 2021) claims that more high level text can be generated using sequence to sequence model. It can support different evaluation perspectives such as factuality (by evaluating from source to hypothesis) or informativeness (by evaluating from both directions between reference and hypothesis). BARTScore is used to measure the probability of generated text from a source text x to a target set y: BART Score = m t=1 w t log p(y t |y <t , x, θ) (1) BARTScoreintroduce two variations: (1) finetuning, in which the BART model is finetuned on the task specific dataset to make the pre-training domain closer to the evaluation domain. (2) prompting, in which a task specific textual prompt is appended to the source x to get the y. In our experiments we compare the the BARTScorebaseline and one with the prompting variant BARTScore+to compare in the experiments. CTC (Compression, Transduction, and Creation) (Deng et al., 2021) , is a suite of metrics that unifies different perspectives of different tasks (e.g, summarization, style transfer, or text rewriting) into information alignment, which measures weather the information in one generation component is grounded in another. The information alignment is defined as follows: let x (e.g, dialog context) be the source input, c (e.g., external world knowledge) be some additional context, and y be the generated output text (e.g., generated response). The alignment is measured on token level and it is measured as the vector of scores: align(a → b) = ⟨α 1 , • • • , α N ⟩ (2) where each score α i indicates confidence that the n-th token in a aligns with the whole sentence b. Using the information alignment they define a list of metrics to evaluate text for different tasks. In our experiments we use two of these metrics that are closer to ROSCOE: the Relevance (CTC Relevance), which measures the consistency of the generated text with the source and its balanced between the reference, and the Consistency (CTC Consistency) which deals with the faithfullness of the generated text to the input context by the alignment between the two. E EXPERIMENTAL SETUP DETAILS (CONT. FROM § 5)

E.1 DIAGNOSTIC DATASETS

In the following we present details of each diagnostics dataset used in our work. EntailmentBank (EntBank) (Dalvi et al., 2021 ) is a complex question answering dataset which contains multi-step entailment trees, namely a tree of multi-premise entailment steps from facts that are known, through intermediate conclusions to hypothesis of interest (which in this case the question and answer). ProofWriter (Tafjord et al., 2021 ) is a question answering dataset for logical reasoning. It contains 500k questions, answers and proofs over natural-language rulebases. This dataset is mostly used to emulate reasoning over rules expressed in language, including proof generation. The datasets proofs include intermediate conclusions. In our experiments, we used depth-0, depth-1, depth-2, depth-3, and depth-5 OWA sets. MATH (Hendrycks et al., 2021 ) is a dataset of 12,500 problems from high school math competitions. Given a math problem such as in Table 12 models generate a sequence, such as 2 3 , that encodes the final answer. ASDIV (Miao et al., 2020) (Academia Sinica Diverse MWP Dataset) is a dataset of 2,305 questions on diverse math word problem solving. It includes a diverse operations such as basic arithmetic or aggregative operations (e.g., comparisons, set-operations). AQUA (Liang et al., 2018) is a dataset of 100,000 algebraic word problems with step-wise solutions as shown below. In the original dataset each question is decomposed in four parts, two inputs and two outputs: the description of the problem and a question, and the possible (multiple choice) answer options, one being the Step1: earth is a kind of celestial object Its position appears Step2: a star is a kind of celestial object to shift relative / celestial body to the horizon. Step3: apparent motion is when an object appears to move relative to another object 's position Step4 Therefore apparent motion of stars is when stars appear to move relative to earth's position Step5: The earth rotating on its axis causes stars to appear to move across the sky at night Step6: Therefore the earth rotating on its axis causes apparent motion of stars Step7: Stars appear to move relative to the horizon during the night Step8: Therefore the earth rotating on its axis causes stars to move relative to the horizon during the night. ProofWriter Facts: The cow is not big. The cow is not green. The lion eats the tiger. The lion sees the cow. The lion visits the cow. The lion does not visit the squirrel. the lion visits the tiger. The squirrel is big. The squirrel is round. The tiger is not green. The tiger does not see the cow. Rules: if something sees the squirrel and the squirrel eats the cow then the cow is round. if something is green then it eats the tiger. if the squirrel is round then the squirrel visits the cow. if something eats the cow then it sees the squirrel. if something sees the tiger and the tiger visits the squirrel then it is nice. if something is round then it eats the cow. if something is kind then it eats the cow. if the tiger visits the cow then the cow sees the squirrel. if something sees the cow then the cow eats the tiger. Question: The cow does not see the squirrel.

MATH

Context: Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. Question: How many different groups of two marbles can Tom choose? Step1: There are two cases here: Step2: either Tom chooses two yellow marbles (1 result), or he chooses two marbles of different colors ( 4 2 =6 results.). Step3: The total number of distinct pairs of marbles Tom can choose is 1 + 6 = 7. Answer: 7 StrategyQA Question: Are more people today related to Genghis Khan than Julius Caesar?

ASDIV

Step1: Julius Caesar had three children. Step2: Genghis Khan had sixteen children. Step3: Modern geneticists have determined that out of every 200 men today has DNA that can be traced to Genghis Khan. Answer: True correct one. In this work we only used the context and question, the step-wise solution and the correct answer to construct our diagnostic dataset. EQASC (Aggarwal et al., 2021 ) is a multi-hop question answering dataset with 98K explanation annotations for multi-step factual reasoning. Each instance in the dataset comes with a question, multiple answer choices, explanation of each answer choice and a free flow explanation of the whole context. In our experiments we used the correct answer's explanation to construct our diagnostic datasets. StrategyQA (Geva et al., 2021) is another multi-step question answering (QA) dataset, that covers a diverse set of reasoning skills. StrategyQA consists of 2,780 questions, annotated with their decomposition and per-step evidence.

E.2 HUMAN JUDGED DATASET CONSTRUCTION

In the following we present details of each human judged datasets used in our work. Table 11 lists each dataset and illustrates how each dataset is used in our experiments. Specifically, all six datasets are used for evaluations in the experiments results and model finetuning, and one dataset was used for finetuning only. The dataset details are explained below. To construct these datasets, we first sample instances from each dataset (see the number of instances sampled in Table 11 ). We use GPT-3 with few-shot in-context examples and a prompt to generate step-by-step reasoning (e.g., "explain step-by-step") for each sampled instance (see in-context examples and prompts in App. B). Then, using our taxonomy we constructed a list of evaluation perspectives to label the model generated step-by-step reasoning step of each of these datasets. We explain the details of the perspectives used to label human judged datasets in § 5 and App. F. All datasets with examples are summarised in in Table 13 . In the following we present details of each human judged datasets. DROP (Dua et al., 2019) , Discrete Reasoning Over the content of Paragraphs, is a dataset of 96K of instances with context and a question. To solve the tasks, a system must resolve references in the context that match with the question, and perform discrete operations over them (such as addition, counting, or sorting). These operations require comprehensive understanding of the content of the input context. GSM8K (Cobbe et al., 2021 ) is a dataset of 8.5K linguistically diverse grade school math word problems. On this dataset, even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. CosmosQA (Huang et al., 2019 ) is a dataset of 35K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. The questions focus on reading between the lines over a diverse collection of people's everyday narratives, asking such questions as "what might be the possible reason of ...?", or "what would have happened if ...?". The dataset does not introduce step-by-step reasoning output, and contains multiple choice answers. ESNLI (Camburu et al., 2018) is the extended version of the Stanford Natural Language Inference corpus (Bowman et al., 2015) of 570K labeled sentence pairs with entailment or contradiction labels. ESNLI includes human labeled explanations of the entailment decision. SemEVAL (Ostermann et al., 2018 ) is a dataset on machine comprehension using commonsense knowledge. It contains questions that require commonsense knowledge for finding the correct answer.

E.3 SYNTHETIC DIAGNOSTICS DATASET GENERATION WITH PERTURBATION RULES

To construct the diagnostics datasets we apply synthetic perturbations on half of the chains from six datasets (for details see App. E.1 and the summary Table 11 ). Also, in Table 14 we illustrate these synthetic perturbations applied on reasoning steps {r i } of gold reference chains of all the datasets. In there, g * indicates a grammar error, which includes changing verb tense, dropping verb, or random word swap. s * represents change the semantics of one step in the chain by replacing named entities. To simulate extrinsic hallucinations, we use random steps from other chains within the same dataset. To construct diagnostic data from math datasets, we introduce four additional perturbations to simulate stepwise explanation errors that might arise in arithmetic reasoning task (Arithmetic error), general knowledge about relationships and equation construction (Common sense error), and misinformation about object/subject characteristics (Factuality or Hallucination): • Shuffle numbers: randomly shuffles all numbers in the chain, • Shuffle operations: randomly shuffles all math operations in the chain, • Random number: randomly replaces one number in the chain, • Random operation: randomly replaces one math operation in the chain.

F HUMAN ANNOTATIONS (CONT. FROM § 5)

To construct Human Judged Datasets, we perform human annotations on five datasets which we summarize in Table 11 (Type='Human judged'). These datasets do not include explanations (except GSM8K and ESNLI), so we construct model generated reasoning steps and label them with reasoning errors. We explain our generation process in §5 and App. E.2. We used five expert human annotators to collect reasoning error labels on five datasets. We asked human evaluators to directly rate the generated reasoning errors on overall chain level using a Likert scale from 1 to 5. We also asked them to mark whether each error type proposed in our error taxonomy ( §3) appeared in each step in step-level evaluations. In Fig. 4 and Fig. 5 we illustrate the UI used to collect the data. Table 15 summarizes questions that experts were asked. Table 16 reports the distribution of errors for each dataset. In general, we found that it was hard to get anonymous crowd workers to annotate our data accurately even when we paid averages of upwards of $30 an hour, hence relying on expert annotators. For the annotation sessions reported in the text of the paper, we find that it takes an average of 754 seconds for expert annotators to complete a session of at most 5 examples, or slightly over 2-and-a-half minutes per example. This highlights the difficulty of obtaining high-quality annotations on these cognitive challenging tasks. Validation. We replace original validation procedure on semantic textual similarity tasks with similaritybased validation on perturbed reasoning chains. In particular, during training, we select best checkpoint that maximizes cosine similarity between positive and minimizes cosine similarity between hard-negative pairs within the batch of size B as the following: N i=1 [cos(s i , r i ) -cos(s i , h i )] 2 * B Model is evaluated every 100 steps on the development dataset and the best checkpoint is applied at the inference. Other parameters not described in this section are kept as in the original SimCSE model used for initialization. Inference. We compare ROSCOE scores calculated against three embeddings: finetuned SimCSE model, sup-simcse-roberta-base SimCSE model, and all-mpnet-base-v2 sentence embedding model (Reimers & Gurevych, 2019) . During inference, we set the random seed to 42. Without this, the embedding-based scores naturally varied by about 0.01.

H.1 CONTROLLED EXPERIMENTS WITH DIAGNOSTICS DATASETS

In this section, we presented Somers' D correlation of all metrics on all Diagnostics datasets. Table 18 summarizes the evaluations when investigated reference-free. One of the characteristics of our ROSCOE metrics is that, they can provide judgement of the model generated reasoning steps with and without the human reference reasoning chains. In the experiments section in §6, we discussed the results of our unsupervised scores in comparison to baseline scores when measured reference-free. In Table 19 , we summarize the correlation analysis on ROSCOE metrics in comparison to baselines on diagnostic datasets when reference is present for evaluation. Specifically, each score is measured between the human provided reasoning steps (reference) and the model generated reasoning steps (hypothesis). We also display fine-grained meta-evaluations of all metrics on each diagnostics dataset in separate tables. Specifically, Tables 20, 26 for EQASC, Tables 21, 27 for EntailmentBank, Tables 22, 28 for MATH, Tables 23, 29 for ProofWriter, Tables 24, 30 for ASDIV, and Tables 25, 31 for AQUA. To understand if designed reference-free scores capture targeted error types we analyze perturbation-level correlations summarized in Fig. 6 . Out of the all considered scores, Info-Chain is able to cover 10 out of 12 of errors, except Remove Step and Semantic error perturbations. In general we can note that ROSCOE fails to consistently identify missing step error type represented by Remove Step perturbation across different datasets, while other synthesized error types are covered by at least one score type. Reference-based scores are covering all synthetic errors, with Semantic Coverage Chain showing strong correlations with all types of perturbations (Table 19 ). We also note that along with ROSCOE scores, the highest correlation among all reference-based scores belong to ROUGE and BERT scores (Tables 26 27 28 29 30 31 ). ROUGE scores consistently outperform on Repetition, Hallucination, Remove Step, Shuffle Steps, Swap Steps, Negate Step, and Semantic perturbations, while under performing on Random operation, and Shuffle operations. We attribute this to the fact that ROUGE is an n-gram based score, so it is better in catching errors were wording has significantly changed, while failing to catch small changes within steps. It is worth noting that some scores, especially those among reference-based evaluations, get the highest possible Somers' D correlation scores of 1.0. What it means is that in some scenarios, there is a perfect correlation between the metric and the error type. In other words, for this metric we can find a threshold such generated chains that have scores greater than the threshold do not have errors of the given type, and in all generated chains with scores less than the threshold have that error. It is especially evident on referenced-based metrics that directly compare the reference solution and hypothesis. In this scenario, we build correlation for two groups: 1) non-perturbed hypothesis: the score is calculated by comparing embedding similarities of the reference with itself, and we expect to get high scores, 2) perturbed hypothesis: comparing reference with its perturbed version, where the scores should be lower. In some cases, we are able to perfectly separate perturbed and non-perturbed chains based on the corresponding metric values by selecting a threshold, in other cases we cannot due to a number of false-negatives (i.e., a chain gets a high score, although the error is present). As an example, consider the Semantic Coverage-Chain metric calculated on EQASC dataset using all-mpnet-base-v2 sentence embeddings, and Hallucination perturbation (Table 26 ). Here the Somers' D correlation score is 1.0. Semantic Coverage-Chain is calculated as a normalized cosine distance between the chain embedding of the reference solution r, and the chain embedding of the hypothesis h : [1 + cos(r, h)]/2. Recall that in our setup, half of the hypothesis chains are perturbed reference chains, and another half is the same as the reference. While Hallucination perturbation is an insertion of a random step from a dataset, it is hard to predict how if will affect the embedding of the chain as a whole, but on the unperturbed chains, where h == r, the Semantic Coverage-Chain should be: [1 + cos(r, r)]/2 = 1.0. Further review confirmed that in this dataset there are no false-positive instances, i.e., all chains with perturbations had Semantic Coverage-Chain score less than 1.0. That means, we can always identify if the chain contains a Hallucination error or not, by comparing Semantic Coverage-Chain value with 1.0 (threshold value), which is reflected in perfect Somers' D score. Highest correlations among reference-free scores belong to the Repetition-* scores, that exhibit perfect correlation on EQASC dataset (Tables 20 21 22 23 24 25 ). For other datasets, non-perfect correlations can be attributed to the small number of false-negatives, i.e. they give low Repetition-* scores for chains with non-duplicated but similar steps, while all chains with duplicates got almost 0 scores (Fig. 7 ). In EQASC explanations are created from a set of facts that are not directly related to each other, but are intended to give an answer when combined together. Among all datasets considered, these steps are most dissimilar, and thus can be separated with similarity-based scores. scores. While all perturbed subsets have 0 or near 0 scores, all datasets except EQASC have some chains that were also scored as low despite the absence of duplicates. 



Code can be found at https://github.com/facebookresearch/ParlAI/tree/main/projects/ roscoe. Annotated datasets can be downloaded from https://dl.fbaipublicfiles.com/parlai/ projects/roscoe/annotations.zip. Semantic alignment refers to determination of relations between concepts with the same or a similar intended meaning(Agirre et al., 2013). We chose expert annotators over crowd-sourcing, because our annotation task is cognitively challenging and requires fine-grained annotation. Fine-tuned model is available at https://huggingface.co/facebook/roscoe-512-roberta-base We use SciPy(Virtanen et al., 2020) to calculate correlations and obtain p-values from a hypothesis test where the null hypothesis is an absence of association. Step1: The squirrel is round.Step2: something is round then it eats the cow.Step3: The squirrel eats the cow.Step4: If something sees the squirrel and the squirrel eats the cow then the cow is round.Step5: The cow is round.Step6: If something is round then it eats the cow.Step7: The cow eats the cow.Step8: if something eats the cow then it sees the squirrel.Step9: the cow sees the squirrel. Answer: True



Figure 1: Sensitivity of selected metrics on Somers' D by injecting levels of error into reasoning steps.

Figure2: Box-and-whisker plots of interquartile ranges of scores, for perturbations and reference-free metrics with strong Somers' D values. Scores are split by dataset and perturbation use. While interquartile ranges separate well by perturbation use within a single dataset, there is overlap across datasets. This shows the drift of neural scores across datasets and applies to both ROSCOE (left, center) and strong baselines (right).

Context: A sandwich is priced at $0.75. A cup of pudding is priced at $0.25. Tim bought 2 sandwiches and 4 cups of pudding. Question: How much money should Tim pay? The entrance fee for a fair is $5 for persons under the age of 18 and 20% more for persons older. Each ride at the fair costs $0.50. If Joe goes with her 6 years old twin brothers, and they each took 3 rides in total. Question: How much money does Joe end up spending at the fair? Step1: Total entrance fee is (2*$5)+(1.20*5) = $16 Step2: Total rides fee is ( 0.50 * 3 ) * 3 = $4.50 Step3: Total money spent is $20.50 Answer: 20.5 EQASC Question: Where is water likely to form beads? Step1: Beads of water are formed by water vapor condensing Step2: Moisture builds up in condenses air and the wherever the surfaces are cold. Answer: Water beads form on cold surfaces.

Figure 4: Screenshot of expert annotation user interface, showing the context for the initial question as well as the questions regarding the generated response.

Figure 5: Screenshot of expert annotation user interface, showing questions asked for each step, using the question in Fig 4. The questions are asked of every step generated by the model, with steps separated by sentence-ending periods.

Figure 6: Relative presence of the strong score-perturbation correlation, measured as the number of datasets where for each score-perturbation pair Somers' D correlation value is in the 90 th percentile, normalized by the total number of datasets where this type of perturbation occurs. Statistics collected over ROSCOE referencefree scores with finetuned SimCSE embeddings. (Continued from §7)

Taxonomy of Step-by-Step Reasoning Errors. Full list of the error types with examples is illustrated in Table

Semantic alignment metrics (ROSCOE-SA).

Wei et al. (2022) provide model generated chain of thought reasoning steps for GSM8K. We used chains produced by the 175b_verification model to annotate for reasoning errors. For other datasets, we prompt GPT-3 LLM(Brown et al., 2020) with few-shot in-context examples to obtain step-by-step reasoning sequences (see examples in App. E.2). We use the error types in our taxonomy in Table 2 as human evaluation perspectives of reasoning errors where we solicit five expert annotators

Table7shows Somers' D correlation for metrics measured reference-free on six different datasets and compares baselines to ROSCOE-* aggregated categories calculated with finetuned embeddings: ROSCOE-SA, ROSCOE-SS, ROSCOE-LI, ROSCOE-LC. Results also include ROSCOE metrics with all-mpnet-base-v2 (ROSCOE-SA 1 , ROSCOE-SS 1 ) and sup-simcse-robertabase (ROSCOE-SA 2 , ROSCOE-SS 2 ) sentence embedding models. Correlations for ProofWriter are taken on its depth-5 subset. We report highest correlation scores across perturbations within each dataset. The breakdown of all ROSCOE metrics is in App. Table18. Somers' D correlation of different metrics on six Diagnostics datasets. Metrics are measured reference-free on (s, h).

Somers' D correlations of metrics with human judgement.

Taxonomy of Step-by-Step Reasoning Errors. Errors used for perturbations in constructing the diagnostic datasets (Diag.) and for human annotation (Human) of the model generated reasoning chains are also marked. (Cont. from Table 2. The basketball team went to the steakhouse to eat dinner. The first player ate a 6-ounce steak. The second player ate beef tips, containing 8 beef tips, each an ounce in size. The third player ate a one-pound steak. And the fourth and fifth players ordered vegetarian meals. In total, how many ounces of meat were consumed by the team? Model Expl: The fourth and fifth players ordered vegetarian meals, for a total of 2 ounces of meat.

Table 11 illustrates how each dataset is used in our experiments. StrategyQA dataset is only used to finetune the SimCSE embeddings model, because it contains reference reasoning chains in train and validation partitions, but not in the test partition. The rest of the six diagnostic datasets are used for sentence embedding model finetuning, and evaluating our models as presented in the experiments results. All datasets with examples are summarised in Table 12. Summary of datasets used in our work. Reasoning Chain represent whether it contains human written golden step-wise reasoning explanation. Type indicates whether it is used for constructing Diagnostic or Human judged datasets. Train/Val./Test indicate whether the dataset is used for training, validation and/or testing. StrategyQA dataset is only used for finetuning SimCSE embedding model.

We show instances from seven of the Diagnostics Datasets here. (Continue from §5). Earth is a kind of celestial object. Stars appear to move relative to the horizon during the night. A star is a kind of celestial object celestial body. The earth rotating on its axis causes stars to appear to move across the sky at night.

We show instances from five of the Human Judged Datasets used in our work. Only GSM8K and ESNLI include human labeled explanations.CosmosQA Context: A woman had topped herself by jumping off the roof of the hospital she had just recently been admitted to. She was there because the first or perhaps latest suicide attempt was unsuccessful. She put her clothes on, folded the hospital gown and made the bed. She walked through the unit unimpeded and took the elevator to the top floor Question: What would have happened to the woman if the staff at the hospital were doing their job properly?

Evaluation perspectives used to Human Judged the datasets. The perspectives, which we used to ask humans to label, align with our taxonomy of reasoning errors. (Continued from § 5)

Statistics of types of errors in Human Judged datasets. Each column reports the number of examples where the specified error type exists. (Continue from § 5)Model training. We use the train portions of the perturbed diagnostics datasets to finetune the SimCSE embeddings model (explained in § 5) and validation portions to select the best embedding model. The test portions are used to evaluate our metrics against baseline metrics. We randomly select 500,000 samples with replacement from each dataset to create uniform representation and reduce bias.The hyperparameters used to finetune SimCSE model are described in Table17. We use NVIDIA Tesla V100 Volta GPU instances with 32GB Graphics Card. We perform hyperparameter search, varying batch size in {32, 64, 256, 512, 1024, 2048}, learning rate in {5e-06, 1e-05, 5e-05, 1e-04}, and max sequence length in {64, 128, 512}. Not all combinations of batch size and max sequence length were explored due to memory limitations.

Hyperparameters used to fine-tune SimCSE model on perturbed datasets.

Somers' D correlation of all metrics on six Diagnostics datasets. All metrics are measured reference-free on (s, h). The highest correlation overall for each dataset is in bold. The second best models are underlined. Correlations that are not significant (p-value >= 0.05) are omitted when aggregating, and "-" denotes an absence of any significant correlations. Note that ASDIV is a 1-step equation dataset, so there are no repetition and self-consistency scores as there are no steps to compare. (Continued from §6, more details in App. H.1.).

Somers' D correlation of all reference-based metrics on six Diagnostics datasets. Metrics are measured using reference generations on (r, h). The highest correlation overall for each dataset is in bold. The second best models are underlined. (Continued from §6, more details inApp. H.1.)

Somers' D correlations of all metrics per different perturbation applied on EQASC Diagnostics datasets. All metrics are measured reference-free on (s, h). The highest correlation overall for each dataset is in bold. The second best models are underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from §6, more details inApp. H.1.)

Somers' D correlations of all metrics per different perturbation applied on MATH Diagnostics datasets. All metrics are measured reference-free on (s, h). The highest correlation overall for each dataset is in bold. The second best models are underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from §6, more details in App. H.1.)

Somers' D correlations of all metrics per different perturbation applied on AQUA Diagnostics datasets. All metrics are measured reference-free on (s, h). The highest correlation overall for each dataset is in bold. The second best models are underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from §6, more details inApp. H.1.)

Somers' D correlation of metrics on DROP human judged dataset analyzing step-by-step reasoning on overall chain and step-level perspectives. All metrics are measured reference-free on (s, h). The highest correlation overall for each aspect on each dataset is in bold, second best are underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from § 6, more details in App. H.2)

Somers' D correlation of all metrics on GSM8K human judged dataset step-by-step reasoning on overall chain and step-level perspectives. All metrics are measured reference-free on (s, h). The highest correlation overall for each aspect on each dataset is in bold, second best is underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from § 6, more details in App. H.2)

Somers' D correlation of all metrics on human judged dataset analyzing step-by-step reasoning on overall chain and step-level perspectives. All metrics are measured reference-free on (s, h). The highest correlation overall for each aspect on each dataset is in bold and second best is underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from § 6, more details in App. H.2)

Somers' D correlation of all metrics on COSMOS human judged dataset step-by-step reasoning on overall chain and step-level perspectives. All metrics are measured reference-free on (s, h). The highest correlation overall for each aspect on each dataset is in bold and second best is underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from § 6, more details in App. H.2)

Somers' D correlation of all metrics on human judged dataset analyzing step-by-step reasoning on overall chain and step-level perspectives. All metrics are measured reference-free on (s, h). The highest correlation overall for each aspect on each dataset is in bold and second best is underlined. Correlation scores with p-value < 0.05 are marked with †. (Continued from § 6, more details in App. H.2)

ROSCOE performance analysis on examples from Human Judged datasets. Errors highlighted in red. (Cont. from $ 6) DROP Over the next year, however, the Polish forces were subject to attrition, as the Sejm again refused to raise taxes and pay the army, resulting in mass desertions of unpaid soldiery. The Polish problems were further aggravated by the incompetent leadership of hetman Michał Kazimierz Pac, who obstructed Sobieski's leadership, while the Ottomans continued to receive reinforcements. Nonetheless in 1674 the Commonwealth resumed the offensive, taking advantage of a new Muscovy-Ottoman conflict that year, and the Polish-Ottoman war remained undecided. Sobieski's force of 6,000 defeated 20,000 Turks and Tatars under Ibrahim Shyshman in the battle of Lwow in August 1675. Even after the Battle of Trembowla, the Sejm still refused his pleas for more funds and a larger army. In 1676, after Sobieski's 16,000 withstood the twoweek siege of Żurawno, by 100,000 men under Ibrahim Pasha, a new peace treaty was signed, the Treaty of Żurawno. The peace treaty partially reversing those from Buczacz: the Ottomans kept approximately two thirds of the territories they gained in 1672, and the Commonwealth no longer was obliged to pay any kind of tribute to the Empire; a large number of Polish prisoners were released by the Ottomans. How many was the difference beween Sobieski's force and the Turks and Tatars? Claim: 14000. Is the Claim supported by the Situation? Peter the Great ordered his army to advance towards Azov. The army comprised crack regiments and the Don Cossacks and was divided into three units under the command of Franz Lefort, Patrick Gordon and Avtonom Golovin. Supplies were shipped down the Don from Voronezh.In 1693 the Ottoman garrison of the fortress was 3,656, of whom 2,272 were Janissaries.Between June 27-July 5 the Russians blocked Azov from land but could not control the river and prevent resupply. After two unsuccessful attacks on August 5 and September 25, the siege was lifted on October 1. Another Russian army under the command of Boris Sheremetev set out for the lower reaches of the Dnieper to take the Ottoman forts there. The main fort at Gazi-Kerman was taken when its powder magazine blew up, as well as Islam-Kerman, Tagan and Tavan, but the Russians were not able to hold the area and withdrew most of their forces. By the Treaty of Constantinople the remaining Russians were withdrawn and the lower Dnieper was declared a demilitarized zone. What happened first: Russians blocked Azov or Treaty of Constantinople? Claim: Russians blocked Azov. Is the Claim supported by the Situation? The first Azov campaign began in the spring of 1695. Peter the Great ordered his army to advance towards Azov. The army comprised crack regiments and the Don Cossacks and was divided into three units under the command of Franz Lefort, Patrick Gordon and Avtonom Golovin. Supplies were shipped down the Don from Voronezh.In 1693 the Ottoman garrison of the fortress was 3,656, of whom 2,272 were Janissaries.Between June 27-July 5 the Russians blocked Azov from land but could not control the river and prevent resupply. After two unsuccessful attacks on August 5 and September 25, the siege was lifted on October 1. Another Russian army under the command of Boris Sheremetev set out for the lower reaches of the Dnieper to take the Ottoman forts there.The main fort at Gazi-Kerman was taken when its powder magazine blew up, as well as Islam-Kerman.

annex

Published as a conference paper at ICLR 2023 

H.2 EXPERIMENTS WITH HUMAN JUDGEMENT DATASETS

In this section, we present Somers' D correlation of all metrics on all Human Judged datasets in separate tables. Specifically, Table 32 summarizes meta-evaluations for ROSCOE metrics in comparison to baselines on all human judged datasets. Fine-grained evaluations are presented in Table 33 for DROP, Table 34 , 38 for GSM8K, Table 35 , 39 for ESNLI, Table 36 for CosmosQA, and Table 37 for SemEVAL. Human evaluation perspectives used in evaluations are described in App. Table 15 .Looking at how errors are captured by ROSCOE reference-free scores (Fig. 8 ), we observe strongest correlations between Redundancy error and Repetition-*, Self-Consistency scores. Repetition error is not present in this analysis as it has at most 3 occurrences per dataset. Out of the all considered scores, Self-Consistency is able to cover 6 out of 7 evaluation perspectives, except Missing Step. We further look at specific human annotated examples where our ROSCOE gives highest and lowest scores to understand strength and weaknesses of the proposed approach. Results are summarized in Table 40 . Similar analysis for diagnostic datasets is summarized in Table 41 .Published as a conference paper at ICLR 2023 The order is $7.50 for the sub, $1.50 for chips and $1.00 for cookies so the total order is 7.50+1.50+1.00 = $«7.50+1.50+1.00=10.0»10.00. There's a 20% delivery fee added at check out so that's 10*.20 = $«10*.20=2.0»2.00. The order is $10.00 and there's a $2.00 delivery fee so 10+2 = $«10+2=12.00»12.00. She also wants to add a $5.00 tip which will make the order 12+5 = $«12+5=17. 

