RECURSION OF THOUGHT: DIVIDE AND CONQUER REASONING WITH LANGUAGE MODELS

Abstract

With the recent advances in language models, attempts are being made to apply them to solving multi-step reasoning problems. A major breakthrough in this line of research is to let language models generate intermediate steps, often called Chain of Thought (CoT), before producing a final answer. However, language models have an upper bound on the context size, i.e., the number of input tokens, such as 2048 for the recent GPT-3 and PaLM. Although several thousand tokens are enough to handle various tasks, solving more complex reasoning tasks can require orders of magnitude more tokens. Therefore, the context limit imposes a fundamental limit on the model's reasoning capability. Inspired by human's incredible reasoning ability based on abstraction and recursion, we propose Recursion of Thought (RoT) as a model-agnostic framework with the novel paradigm of teaching a language model to divide and conquer complex problems by recursively creating multiple contexts. Since RoT casts the context-related operations as tokens, a language model can trigger the recursion operations by simply producing the corresponding tokens. On multiple arithmetic and algorithmic reasoning tasks, we demonstrate that RoT dramatically improves the recent large-scale language model GPT-3 to solve extremely complex problems. Moreover, RoT can make tiny, randomly initialized Transformers or LSTMs to solve problems that even humans find daunting.

1. INTRODUCTION

Recently, language models (LMs) have become a prominant direction to solve reasoning. Given a question sequence, the models are tasked to predict the following answer sequence. One recent line of research for reasoning with LMs is chain of thought (CoT) generation (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022; Lewkowycz et al., 2022) . In CoT generation, complex reasoning problems are solved by generating intermediate reasoning steps, or a chain of thought, before producing the final answer. Directly answering a question would require a model to fully solve the problem in a single forward pass, meaning the range of solvable problems is severely limited by the model's capacity. On the other hand, generating CoT before the answer allows the problem's complexity to be spread across the CoT, making each token generation more straightforward given the previous tokens. This is closer to how humans solve complex problems, as we think step by step, instead of producing an answer reflexively. Although CoT seems promising, there is a critical issue that significantly limits its utility: the effective context size of sequence models cannot grow unbounded. In this work, context refers to the set of input tokens that a model is conditioned on when generating output. Practically, all sequence models have a limit on the maximum context length due to various reasons. For instance, Transformers (Vaswani et al., 2017) suffer from a quadratic computational cost on the context length, and RNNs (Hochreiter & Schmidhuber, 1997) struggle with long-term dependency modeling. Therefore, even the state-of-the-art language models, such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022) , limit the maximum context length by up to 2048 tokens. However, the length of intermediate steps can grow rapidly with the problem's complexity and exceeds the context limit. Since CoT can handle a problem only if the process of solving it fits into a single context, the range of problems that CoT can handle is severely constrained by the context limit. This issue must be overcome to solve more challenging and useful reasoning problems, whose solutions may require millions of tokens. Humans can handle this issue by using abstraction and recursion. We divide a large problem into smaller subproblems and focus on each subproblem while solving it, instead of considering the entire problem at every step. We can further subdivide a subproblem into even smaller subproblems. With this intuition, we propose Recursion of Thought (RoT) as a model-agnostic framework for recursively solving multi-step reasoning problems. The key feature of RoT is to grant the model the ability to recursively create and utilize multiple contexts for subproblems. We achieve this feat by introducing several special tokens that a model can output to control its context. During inference, the model recursively solves the problems by producing appropriate tokens at the right time. Moreover, RoT supports tail recursion, which enables general computation with an indefinitely long chain of recursion. We demonstrate RoT on four basic arithmetic operations (addition, subtraction, multiplication, and division) and four algorithmic tasks (longest common subsequence, longest palindromic subsequence, 0-1 knapsack, and matrix chain multiplication) to show its generality. Without any taskspecific component, such as a calculator, all tasks are formulated as autoregressive sequence modeling problems. These tasks require a model to generalize by just seeing a tiny fraction of the problem space since the space is combinatorially large. For example, even in simple arithmetic operations, two 6-digit operands result in one trillion possible combinations. Hence, we evaluate whether a model understands the underlying rules, instead of brute force memorization. In our experiments, the range of problems that CoT can handle is seriously constrained by the context limit. On the other hand, RoT leads language models to achieve near perfect accuracy, even if the problem size increases to the extreme, where solving one problem requires producing hundreds of thousands of tokens. Moreover, the dramatic improvement is not limited to large pre-trained language models like GPT-3. RoT can make tiny, randomly initialized Transformers or LSTMs perform extremely complex reasoning. The key messages of this work are summarized as follows: • The reasoning capability of current language models is seriously constrained by the maximum length of a single context. • Our Recursion of Thought (RoT) unleashes the reasoning capability of language models by letting them recursively create and utilize multiple contexts of subproblems, following the principle of divide and conquer. In the supplementary file, we provide the source code to fully reproduce our experiments.

2. RELATED WORK

Chain of Thought. Among several prior works on applying language models to reasoning, Scratchpad (Nye et al., 2021) may be the most closely related to our work. It is the first approach to fine-tune language models to produce CoT before generating an answer. It demonstrates its effectiveness on 8-digit addition, polynomial evaluation, and Python program execution. It also mentions the confined context size as a major limitation to be overcome. In order to unlock the full potential of Scratchpad, the authors argue that Transformers should be improved to allow greater context sizes. We solve this exact problem from a completely different perspective, i.e., using multiple contexts to divide-and-conquer. Our approach is more practical and scalable, compared to increasing the context limit. More recently, it has been found that sufficiently large pre-trained language models can be induced to produce CoT, by simply tuning the prompt. For instance, CoT prompting (Wei et al., 2022) adds several QA exemplars with CoT before the main question, encouraging the model to generate final answers in the similar manner. Kojima et al. ( 2022)'s prompting is even simpler; after a question, they start the answer with "Let's think step by step," and then let the model finish the rest. Even without fine-tuning, these methods significantly improve the reasoning accuracy of language models. Minerva (Lewkowycz et al., 2022) utilizes these prompting techniques with a specially curated scientific pre-training dataset to achieve remarkable results on various reasoning benchmarks. However, all of these works are still limited by the maximum context size.

