UNVEILING TRANSFORMERS WITH LEGO: A SYNTHETIC REASONING TASK

Abstract

We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel association pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose to replace certain attention heads with hardcoded patterns. This architectural change significantly reduces Flops and maintains or even improves the model's performance at large-scale pretraining.

1. INTRODUCTION

The deep learning revolution is about training large neural networks on vast amount of data. The first field transformed by this methodology was computer vision, crucially leveraging the convolutional neural network architecture LeCun et al. (1989) ; Krizhevsky et al. (2012) . More recently natural language processing was revolutionized by the Transformer architecture Vaswani et al. (2017) . Transformers are designed to process input represented as "set of elements" (e.g., the words in a sentence with their positional encoding). This is of course an incredibly generic assumption, and thus Transformers can be applied to a wide variety of tasks, including vision Dosovitskiy et al. ( 2021 2022). Yet, for all of these wonders, there is very little understanding of how these models learn, or in fact what do they learn. Answering such questions in the at-scale experiments is particularly challenging, as one has little control over the data when hundreds of billions of tokens are harvested from various 1



), reinforcement learning Chen et al. (2021a), and protein structure prediction Rives et al. (2021); Jumper et al. (2021) among others, or even jointly across domains to produce generalized agents Reed et al. (2022). In fact, learning with Transformers is rapidly becoming the norm in deep learning. Transformer models display excellent performance on the standard criterion "training error/test error" (e.g., for masked language prediction or translation). However, what makes them particularly noteworthy, is that large-scale Transformer models seem to exhibit unexpected emergent behaviors, such as basic reasoning ability Thoppilan et al. (2022); Brown et al. (2020); Chowdhery et al. (2022); Du et al. (2021); Rae et al. (2021); Hoffmann et al. (2022); Smith et al. (2022); Zhang et al. (2022); Wei et al. (2022); Nye et al. (2022), excellent fine-tuning performance Hu et al. (2022); Thoppilan et al. (2022); Nye et al. (2022); Rae et al. (2021); Polu et al. (2022), or zero-shot learning Brown et al. (2020); Chowdhery et al. (2022); Du et al. (2021); Rae et al. (2021); Hoffmann et al. (2022); Smith et al. (2022); Zhang et al. (2022). Currently, there is a remarkable community effort towards at-scale experimental investigation of Transformers, essentially trying to find out what such models can do when they become large enough and are trained on large/diverse enough datasets. The successes are striking and capture the imagination Brown et al. (2020); Ramesh et al. (

