SOLVING MATH WORD PROBLEMS WITH PROCESS-BASED AND OUTCOME-BASED FEEDBACK

Abstract

Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process-and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% → 12.7% final-answer error and 14.0% → 3.4% reasoning error among final-answer-correct solutions.

1. INTRODUCTION

Recent work has shown that asking language models to use step-by-step reasoning improves performance on reasoning tasks (Shwartz et al., 2020; Nakano et al., 2021; Cobbe et al., 2021; Wei et al., 2022; Kojima et al., 2022; Lewkowycz et al., 2022) . While these works have primarily focused on prompting language models, prior work suggests that finetuning should outperform prompting alone (Stiennon et al., 2020; Perez et al., 2021; Ouyang et al., 2022) . This raises the question of how best to supervise such models. Two natural approaches are outcome-based approaches, which supervise the final result, and process-based approaches, which supervise each step of the reasoning process, including the last step outputting the final result. In this work, we conduct the first comprehensive comparison between process-and outcome-based approaches trained on a natural language task. For this, we use the recently proposed GSM8K dataset (Cobbe et al., 2021) of math word problems. In all cases, we generate a sequence of reasoning steps leading to the final answer, but vary whether or not supervision is provided only on the final answers (outcome-based) or on individual reasoning steps (process-based). For process-based approaches we consider supervision provided by both offline human-generated reasoning traces from the GSM8K dataset itself, as well as online human correctness annotations, which we collect for each step of model-generated samples. We compare these approaches in the context of a number of different modeling and training components, including: few-shot prompting, supervised fine-tuning, reinforcement learning (RL) via expert iteration, and reward modeling for both reranking and RL. Throughout, we consider two primary metrics: trace error rate, which measures how often the model makes any mistake in its reasoning trace according to human annotators, and final-answer error rate, which only considers the model's final answer and ignores the reasoning trace. By "reasoning trace" we refer to all steps of reasoning, including the last step which in GSM8K is the final numeric answer. While process-based approaches may provide multiple benefits (discussed further in Appendix B), including encouraging human understanding of the problem domain, here we concentrate on investigating their effect on the trace error rate. We do so because trace error rate is directly measurable and of interest in many settings. For example, in educational settings, an answer without an (understandable) explanation may often confuse more than it explains. Recent

