SUB-TASK DECOMPOSITION ENABLES LEARNING IN SEQUENCE TO SEQUENCE TASKS

Abstract

The field of Natural Language Processing (NLP) has experienced a dramatic leap in capabilities with the recent introduction of huge Language Models (LMs). Despite this success, natural language problems that involve several compounded steps are still practically unlearnable, even by the largest LMs. This complies with experimental failures for end-to-end learning of composite problems that were demonstrated in a variety of domains. An effective mitigation is to introduce intermediate supervision for solving sub-tasks of the compounded problem. Recently, several works have demonstrated high gains by taking a straightforward approach for incorporating intermediate supervision in compounded natural language problems: the sequence-to-sequence LM is fed with an augmented input, in which the decomposed tasks' labels are simply concatenated to the original input (see figure 1 ). In this paper, we prove a positive learning result that motivates these recent efforts. We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on O(1) previous sub-task results. Beyond motivating contemporary empirical efforts for incorporating intermediate supervision in sequence-to-sequence language models, our positive theoretical result is the first of its kind in the landscape of results on the benefits of intermediate supervision for neural-network learning: Until now, all theoretical results on the subject are negative, i.e., show cases where learning is impossible without intermediate supervision, while our result is positive, showing that learning is facilitated in the presence of intermediate supervision.

1. INTRODUCTION

Large-scale language models such as BERT (Devlin et al., 2019) , T5 (Raffel et al., 2020) , and GPT-3 (Brown et al., 2020) have recently pushed the envelope in many NLP tasks. Nevertheless, there are some problem-families that even the largest models do not seem to be capable of solving. One such family is that of "multi-hop" reasoning problems (see, e.g., Geva et al. (2021) ; Kalyan et al. (2021) ; Press et al. (2022) ) that require compounding operations in order to produce an answer. For example, Gopher (Rae et al., 2021) , one of the largest available language models, achieves 61% accuracy in the StrategyQA benchmark (Geva et al., 2021) that requires implicit decomposition into reasoning steps, while human level performance is around 87% accuracy. The limitations of learning compounded tasks with neural networks in an end-to-end manner have been observed in a variety of non-linguistic domains. A leading experimental approach for addressing these is to first explicitly break the compounded operations into more basic "single-hop" operations and then combine the results. Gülçehre & Bengio (2016) , one of the earliest works on this subject, propose that supervision for the single-hop intermediate steps is crucial for avoiding bad local minima in the optimization of neural networks. Afterward, Glasmachers (2017) demonstrated that gradient-based end-to-end multi-hop learning is inefficient for solving complex problems that are easily solved by a divide-and-conquer strategy. Beyond position papers, specific examples were Similar limitations were observed in language related compounded tasks, including commonsense reasoning (Liu et al., 2022; Wei et al., 2022; Zelikman et al., 2022) , math word problems (Piękos et al., 2021; Wei et al., 2022) , and programs execution (Nye et al., 2022) . The go-to architectures in this domain are powerful language models, which are trained as sequence-to-sequence models over text. In this setting, a particular form of introducing intermediate supervision for compounded tasks has emerged: intermediate sub-tasks and their labels are concatenated to the original task's input to form a new input sequence, on which the sequence-to-sequence LM is trained. This approach has recently been widely adopted, e.g., by Rajani et al. ( 2019 While such decomposition based approaches are intuitive, we are not aware of theoretical results that motivate and formulate their benefits for learning composite problems with neural-networks. In this paper, we provide positive theoretical results in this domain, which are in fact the first of their kind (see related work in section 2). We show our results for sequential models, integrating the intermediate supervision in a manner that mimics the above cited successful empirical approaches in the language domain. In this formulation, a learner learns to predict a sequence composed of the task inputs x, followed by the single-hop reasoning steps referred to as the evidence, and finally, the final answer y. We extend provable guarantees for the convergence of overparameterized recurrent neural networks (Wang et al., 2021) and prove that with intermediate sub-task supervision, even a simple sequence-to-sequence model provably learns any task that obeys an efficient decomposition into simpler subtasks that depend only on a small fraction of the input. Importantly, both the sample complexity and the required number of gradient updates are polynomial. In contrast, we rely on existing works (Valiant, 1984; Goldreich et al., 1986; Daniely & Shalev-Shwartz, 2016) to show that in the absence of intermediate supervision, there exist efficiently decomposable tasks that are unlearnable with polynomial time learning algorithms. Our results apply to a broad family of tasks. As a first exemplifying step, we show a positive result for learning bit subset parity, a setting that is notoriously not amenable to gradient-based algorithms in an efficient way without intermediate supervision (Kearns, 1998; Shalev-Shwartz et al., 2017; Abbe & Sandon, 2020; Abbe et al., 2021) . In this setting, the family of target functions consists of parities over subsets of unknown input bits. Specifically, the input is d bits and the task is to predict whether the number of 1's in certain unknown subset of d /2 bits is odd or even. The corresponding sub-tasks we consider are the parities of subsets of the unknown input subset. We prove a theorem guaranteeing that, when intermediate supervision is available, efficient neural network learning is



Figure 1: An illustrative example of the prominent method for introducing sub-task decomposition and intermediate supervision for math word problems (Ling et al., 2017; Cobbe et al., 2021). Intermediate sub-tasks and their labels are concatenated to the original task's input to form a new input sequence. At training time, the likelihood of the entire sequence following the original input is maximized conditioned on the input, and at test time only the original input is given to the model.

); Cobbe et al. (2021); Piękos et al. (2021); Recchia (2021); Nye et al. (2022); Wei et al. (2022); Zelikman et al. (2022).Figure 1 illustrates this approach for math problems, as done in Ling et al. (2017); Cobbe et al. (2021). These works show that training sequence-to-sequence models with concatenated sub-task decomposition supervision significantly improves the results when compared to training the same model without the intermediate supervision. For example, Nye et al. (2022) show > 99% accuracy for 8 digits addition when concatenating intermediate calculations to the input, while the vanilla accuracy without intermediate supervision is around ∼ 35%.

