AUTOMATIC CHAIN OF THOUGHT PROMPTING IN LARGE LANGUAGE MODELS

Abstract

Large Language Models (LLMs) can carry out complex reasoning tasks by generating intermediate reasoning steps. These steps are triggered by what is called chain-of-thought (CoT) prompting, which comes in two flavors: one leverages a simple prompt like "Let's think step by step" to facilitate step-by-step reasoning before answering a question (Zero-Shot-CoT). The other uses manual demonstrations, each composed of a question and a reasoning chain that leads to an answer (Manual-CoT). Unfortunately, the superior performance of the latter strategy crucially hinges on manually generating task-specific demonstrations. This makes it far less scalable and more dependent on the talent of the CoT engineer. We show that such manual efforts may be eliminated by leveraging LLMs to generate the reasoning chains on its own. Since these generated chains often come with mistakes we propose a number of mitigation strategies. Our proposed Auto-CoT method automaticaly samples diverse questions and we perform postprocessing quality control to generate usable reasoning chains from Zero-Shot-CoT. On ten public benchmark reasoning tasks, Auto-CoT performs on par with Manual-CoT without the need for human intervention.

1. INTRODUCTION

Large language models (LLMs) (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022) have performed impressively on complex reasoning tasks by decomposing multi-step problems into intermediate steps before giving answers (Nye et al., 2022) . This reasoning process is elicited by a recent technique: chain-of-thought (CoT) prompting (Wei et al., 2022b) . CoT prompting comes in two major flavors: one is to add a single prompt such as "Let's think step by step" after the test question to facilitate the reasoning chains in LLMs (Kojima et al., 2022) . Since this strategy is task-agnostic and does not need input-output demonstrations, it is called Zero-Shot-CoT (Figure 1 left). Via Zero-Shot-CoT, LLMs have shown to be decent zero-shot reasoners. The other strategy is to provide few-shot prompting through manual reasoning demonstrations one by one (Wei et al., 2022b) . Each demonstration has a question and a reasoning chain. The latter is composed of a rationale (a series of intermediate reasoning steps) and an expected answer. With all the demonstrations being manually designed, this is referred to as Manual-CoT (Figure 1 

right).

In practice, Manual-CoT outperforms Zero-Shot- CoT (Wei et al., 2022b; Kojima et al., 2022) . However, superior performance hinges on the hand-crafting of effective demonstrations. This involves nontrivial efforts in designs of both questions and their reasoning chains for demonstrations. Even more problematic, different tasks, such as arithmetic (Roy & Roth, 2015) and commonsense reasoning (Talmor et al., 2019) , require different ways of demonstrations to be manually generated. We propose Auto-CoT. It addresses the problems in Manual-CoT by automatically constructing demonstrations with questions and reasoning chains. Auto-CoT uses LLMs for this task. It generates examples using the prompt "Let's think step by step" with Zero-Shot-CoT. Unfortunately, a naive approach is insufficient. For example, given a test question of a dataset, retrieving semantically A: Let's think step by step. There are 64 puppies. 28 of them were sold. This leaves 36 puppies. Each cage has 4 puppies, so we need 9 cages. Therefore, the answer (arabic numerals) is Test Question

A Few Manual Demos

Q: A pet store had 64 puppies. In one day they sold 28 of them and put the rest into cages with 4 in each cage. How many cages did they use? Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 -15 = 6. The answer is 6.

Rationale Generation

9.

Answer Extraction

The pet store had 64 puppies. 

2. RELATED WORK

Two lines of research are key for the current work: chain-of-thought (CoT) prompting for multi-step reasoning and in-context learning for LLMs. We review both of them below. (2022) . In their work, an LLM is prompted to generate rationales. Among them, the ones that lead to the correct answer are selected. The selection requires a training dataset of questions with annotated answers. In contrast, we consider a more challenging scenario where only a set of test questions are given (without a training dataset), following CoT prompting (Wei et al., 2022b; Kojima et al., 2022) .

2.1. CHAIN-OF-THOUGHT PROMPTING

Manual-CoT achieves stronger performance by eliciting CoT reasoning via effective humangenerated and designed demonstrations. However, the work required in designing both questions and their reasoning chains are nontrivial. Instead of addressing this limitation, recent studies mainly focus on hand-crafting more complex demonstrations or leveraging ensemble-like methods. One trend



They sold 28 of them. So they had 64 -28 = 36 puppies left. They put them into cages with 4 in each cage. So they used 36 / 4 = 9 cages. The answer is 9.

CoT prompting is a gradient-free technique of inducing LLMs to produce intermediate reasoning steps that lead to the final answer.Wei et al. (2022b)  studied CoT prompting in language models. It elicits LLMs to generate a coherent sequence of intermediate reasoning steps that lead to the final answer. LLMs can perform CoT reasoning with zero-shot prompting (Zero-Shot-CoT)(Kojima et al.,  2022)  or through human generated few-shot demonstrations (Manual-CoT)(Wei et al., 2022b). LLMs are decent zero-shot reasoners whose generated rationales have already reflected the CoT reasoning. This observation inspires our work to leverage self-generated rationales for demonstrations. Generating rationales by LLMs was shown to be practical by Zelikman et al.

