ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY

Abstract

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.

1. INTRODUCTION

In recent years, the Transformer architecture (Vaswani et al., 2017) has brought large improvements to a wide range of Natural Language Processing tasks such as machine translation, sentence representation (Devlin et al., 2019), and summarization (Edunov et al., 2019) . Transformers are also successfully used as an autoregressive model on sequential tasks such as language modeling (Dai et al., 2019; Rae et al., 2020) and reinforcement learning (Parisotto et al., 2019) . Unlike more traditional recurrent architectures such as RNNs and LSTMs, the Transformer architecture processes a sequence in parallel in an order-invariant way. Techniques such as position embeddings (Sukhbaatar et al., 2015; Shaw et al., 2018) and attention masking are required to capture input order information. In this work, we focus on several limitations of the Transformer architecture as an autoregressive model and present a straightforward solution -Feedback memory. These limitations and our proposed solution target sequential token prediction tasks, such as language modeling or other auto-regressive generative tasks. The feedforward nature of Transformers makes them efficient on modern hardware, but restricts the Transformer from taking full advantage of the input's sequential property. In particular, the current hidden representation of a Transformer only accesses the past representations of lower layers, even though higher level representations of the past have already been computed as an autoregressive model. At generation, the Transformer generates only one token at a time, so it could access these representations for better performance, but does not exploit these at training time due to parallelization. However, if these past higher level representations could be used at training time, they would enrich future lower level representations, enabling shallower models to have the same representation power. Another inherent limitation of Transformers on sequential tasks is the lack of recursive computation (Dehghani et al., 2018) , and the number of transformations possible on the input is bounded by the model depth. Such disadvantages have impact on tasks that require careful tracking of a world state or modeling hierarchical structures (Tran et al., 2018; Hahn, 2020) . On the other hand, while RNNs can maintain an internal state for an unbounded time while accumulating more computations upon it, the size of this internal state is limited by the dimension of the hidden state. In this work, we propose a novel autoregressive model, the Feedback Transformer, that makes all previous hidden representations accessible to the computation of a representation at any depththe model feeds back previous computations to itself. The feedback allows the model to perform

