SOLVING MATH WORD PROBLEMS WITH PROCESS-BASED AND OUTCOME-BASED FEEDBACK

Abstract

Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process-and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% → 12.7% final-answer error and 14.0% → 3.4% reasoning error among final-answer-correct solutions.

1. INTRODUCTION

Recent work has shown that asking language models to use step-by-step reasoning improves performance on reasoning tasks (Shwartz et al., 2020; Nakano et al., 2021; Cobbe et al., 2021; Wei et al., 2022; Kojima et al., 2022; Lewkowycz et al., 2022) . While these works have primarily focused on prompting language models, prior work suggests that finetuning should outperform prompting alone (Stiennon et al., 2020; Perez et al., 2021; Ouyang et al., 2022) . This raises the question of how best to supervise such models. Two natural approaches are outcome-based approaches, which supervise the final result, and process-based approaches, which supervise each step of the reasoning process, including the last step outputting the final result. In this work, we conduct the first comprehensive comparison between process-and outcome-based approaches trained on a natural language task. For this, we use the recently proposed GSM8K dataset (Cobbe et al., 2021) of math word problems. In all cases, we generate a sequence of reasoning steps leading to the final answer, but vary whether or not supervision is provided only on the final answers (outcome-based) or on individual reasoning steps (process-based). For process-based approaches we consider supervision provided by both offline human-generated reasoning traces from the GSM8K dataset itself, as well as online human correctness annotations, which we collect for each step of model-generated samples. We compare these approaches in the context of a number of different modeling and training components, including: few-shot prompting, supervised fine-tuning, reinforcement learning (RL) via expert iteration, and reward modeling for both reranking and RL. Throughout, we consider two primary metrics: trace error rate, which measures how often the model makes any mistake in its reasoning trace according to human annotators, and final-answer error rate, which only considers the model's final answer and ignores the reasoning trace. By "reasoning trace" we refer to all steps of reasoning, including the last step which in GSM8K is the final numeric answer. While process-based approaches may provide multiple benefits (discussed further in Appendix B), including encouraging human understanding of the problem domain, here we concentrate on investigating their effect on the trace error rate. We do so because trace error rate is directly measurable and of interest in many settings. For example, in educational settings, an answer without an (understandable) explanation may often confuse more than it explains. Recent findings suggest that outcome-based approaches often lack in this area. For example, work on natural-language-based reasoning (Zelikman et al., 2022; Creswell et al., 2022) suggests that models optimized exclusively for final-answer correctness can often produce the correct final answer, even when their generated reasoning traces are incorrect.

Key results

We find that our best approach, which combines supervised learning with rewardmodel-based reinforcement learning, significantly improves the state-of-the-art for both trace error rate, from 14.0% → 3.4%, and final-answer error rate, from 16.8% → 12.7%. Final-answer error rate is further lowered to 2.7% when the model is allowed to abstain on 30% of questions. Our key findings regarding process-and outcome-based feedback are as follows: • Outcome-based and process-based approaches lead to similar final-answer error rates. Both without reward models (23.5% vs. 22.2%) and with reward models (16.6% vs. 14.8%), language models supervised with final-answer correctness attain nearly the same final-answer error rate as those trained to imitate human-provided solutions. • Both process-and outcome-supervised reward models learn to emulate process-based feedback. Somewhat surprisingly, we find that even reward models trained with outcome-based labels (indicating whether the final answer is correct), result in predictions that agree more closely with the process-based labels (indicating whether each reasoning step is correct) than they do with the outcome-based labels themselves. While this effect may be dataset-specific, as discussed in Section 3, it helps explain the effectiveness of reward models for improving trace error, and we hope that it is investigated further in future work. • Low trace error requires either process-based feedback, or a reward model that emulates it. All models using reinforcement learning directly against final-answer correctness resulted in high trace error, with a best trace error of 12.4%, compared to only 3.8% for our best process-based method. Building on our previous finding, most of this gap closes when using reinforcement learning against a reward model rather than final-answer correctness directly, which reduces trace error to 5.5%.

2. METHODS

This section describes the dataset, evaluation metrics and the different modelling components evaluated in this paper. See Fig. 1 for an overview of how they all fit together.

2.1. DATASET AND EVALUATION METRICS

We conduct all experiments on the GSM8K dataset (Cobbe et al., 2021) , composed of grade school math word problems. We chose GSM8K because it is a competitive benchmark, and contains natural language reasoning traces. We focus on a single dataset, since the need to recruit human annotators with the domain expertise to accurately evaluate reasoning traces imposes a large up-front cost. See Section 4 for a discussion of other datasets we considered. We report two main metrics for all methods evaluated on the GSM8K test set. Final-answer error rate is the fraction of problems for which the method does not produce the correct final answer. Because all final answers on GSM8K are integers, this can be measured with exact string matching. Trace error rate is the fraction of problems with correct final answers for which the method produces at least one incorrect reasoning step. We estimate this via human annotations of the correctness of each reasoning step, which is detailed in Section 2.7. We report final-answer and trace errors as two separate metrics because we are particularly interested in errors which remain undetected after checking easily verifiable metrics (in this case, final-answer errors). For example in an educational setting it is important to show a student the correct steps to get the answer, and we can easily filter out incorrect traces that lead to the wrong answer, but it is much more difficult to filter out incorrect traces that lead to the correct answer.



Table 2 and Appendix I show several example problems. We split out our own validation set of 256 examples from the original training set, which leaves us with 7118 training and 1319 test examples.

