UNVEILING TRANSFORMERS WITH LEGO: A SYNTHETIC REASONING TASK

Abstract

We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel association pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose to replace certain attention heads with hardcoded patterns. This architectural change significantly reduces Flops and maintains or even improves the model's performance at large-scale pretraining.

1. INTRODUCTION

The deep learning revolution is about training large neural networks on vast amount of data. The first field transformed by this methodology was computer vision, crucially leveraging the convolutional neural network architecture LeCun et al. (1989) ; Krizhevsky et al. (2012) . More recently natural language processing was revolutionized by the Transformer architecture Vaswani et al. (2017) . Transformers are designed to process input represented as "set of elements" (e.g., the words in a sentence with their positional encoding). This is of course an incredibly generic assumption, and thus Transformers can be applied to a wide variety of tasks, including vision Dosovitskiy et al. ( 2021 2022). Yet, for all of these wonders, there is very little understanding of how these models learn, or in fact what do they learn. Answering such questions in the at-scale experiments is particularly challenging, as one has little control over the data when hundreds of billions of tokens are harvested from various sources. In this paper, we propose to take a step back, and try to understand how learning occurs and what is being learned in a more controlled setting that captures important aspects of "reasoning". The benefit of such a controlled setting is that we can try to understand some of the most pressing questions in learning with Transformers, particularly around (i) the architecture and (ii) the importance of training data. For (i) we probe the role of multiple heads and depth, and we show that we can successfully understand them in our controlled setting. For (ii) we investigate how much the dataset composition matters, as well as how pretraining on merely vaguely related tasks makes finetuning successful. In turn, these insights can guide our thinking for large-scale experiments, and we give some of the lessons learned below. In particular, our insights crystallize into an architectural change to BERT for faster inference with matching or even better performance (Section 5).

1.1. LEGO: A SYNTHETIC REASONING TASK

Core components of reasoning include the ability to associate concepts, and to manipulate them. We propose a simple task that captures these two aspects, which we call LEGO (Learning Equality and Group Operations). In LEGO, the input describes a sequence of variable assignments as well as operations on these variables by a fixed (mathematical) group. One needs to be able to deal with both long-range assignments (the same variable appearing in different parts of the input should be viewed as a being equal to same quantity), as well as short-range operations (describing what group element is applied to which variable). A key parameter of an input sequence will be its length, which is proportional to the number of sequential reasoning steps one has to do in order to resolve the value of each variable. We will mostly train with a fixed sequence length (say 12). We often provide supervision only on part of the sequence (say the first 6 variables). We do so in order to test the generalization capabilities from smaller length sequences to longer length sequences without introducing potential errors due to the positional encoding in Transformers.

1.2. SOME TAKEAWAYS

In LEGO, we are interested in both classical generalization (i.e., training and test distribution are the same) and out-of-distribution generalization. For the latter we focus on distribution shifts that vary the length of the chain of reasoning, and thus we refer to this type of generalization as length extrapolation. Specifically, the setting for length extrapolation is to train with supervision on shorter sequence lengths (e.g., supervision on only the first 6 variables) and test on a long sequences (e.g., accuracy computed on 12 variables). A summary of our empirical observations is as follows: 1. First, classical generalization happens reliably for all architectures and data regimes. 2. More interestingly, length extrapolation seems to depend on architectural/data composition choices. Specifically, BERT-like models without special data preparation do not extrapolate to longer sequences, while other models like ALBERT, or BERT with carefully selected data (such as diverse sequence lengths, or pre-trained BERT) do extrapolate. 3. The extrapolating models all seem to evolve attention heads dedicated to either association (long-range identity matching) or manipulation (short-range operations). We provide evidence that pre-trained BERT (which is pre-trained on a seemingly unrelated dataset) generalizes because it has learned such heads. 4. The non-extrapolating models seem to solve the classical generalization problem using a certain shortcut-like solution, whereby using the specificity of the group operations they are able to jump to the end of the chain of reasoning, and then complete the rest of the variables by following the reasoning both from the start and the end of the chain. We interpret our findings as follows: (i) Classical generalization can be a deceptive metric, as there might be unexpected ways to solve the problem. This is famously related to the issue of embedding machine learning systems with common sense reasoning. Namely, we hope that when an ML system solves a task, it does so in "the way humans do it", but of course, nothing guarantees that this will happen. Our findings are consistent with the current methodology of increasing the diversity of the training data, which seems crucial for generalization. (ii) ALBERT-like models, where a layer is repeated several times, seem to be an ideal structure for problems that could be described algorithmically as a "for loop" (as is the case with following



), reinforcement learning Chen et al. (2021a), and protein structure prediction Rives et al. (2021); Jumper et al. (2021) among others, or even jointly across domains to produce generalized agents Reed et al. (2022). In fact, learning with Transformers is rapidly becoming the norm in deep learning. Transformer models display excellent performance on the standard criterion "training error/test error" (e.g., for masked language prediction or translation). However, what makes them particularly noteworthy, is that large-scale Transformer models seem to exhibit unexpected emergent behaviors, such as basic reasoning ability Thoppilan et al. (2022); Brown et al. (2020); Chowdhery et al. (2022); Du et al. (2021); Rae et al. (2021); Hoffmann et al. (2022); Smith et al. (2022); Zhang et al. (2022); Wei et al. (2022); Nye et al. (2022), excellent fine-tuning performance Hu et al. (2022); Thoppilan et al. (2022); Nye et al. (2022); Rae et al. (2021); Polu et al. (2022), or zero-shot learning Brown et al. (2020); Chowdhery et al. (2022); Du et al. (2021); Rae et al. (2021); Hoffmann et al. (2022); Smith et al. (2022); Zhang et al. (2022). Currently, there is a remarkable community effort towards at-scale experimental investigation of Transformers, essentially trying to find out what such models can do when they become large enough and are trained on large/diverse enough datasets. The successes are striking and capture the imagination Brown et al. (2020); Ramesh et al. (

