ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY

Abstract

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.

1. INTRODUCTION

In recent years, the Transformer architecture (Vaswani et al., 2017) has brought large improvements to a wide range of Natural Language Processing tasks such as machine translation, sentence representation (Devlin et al., 2019) , and summarization (Edunov et al., 2019) . Transformers are also successfully used as an autoregressive model on sequential tasks such as language modeling (Dai et al., 2019; Rae et al., 2020) and reinforcement learning (Parisotto et al., 2019) . Unlike more traditional recurrent architectures such as RNNs and LSTMs, the Transformer architecture processes a sequence in parallel in an order-invariant way. Techniques such as position embeddings (Sukhbaatar et al., 2015; Shaw et al., 2018) and attention masking are required to capture input order information. In this work, we focus on several limitations of the Transformer architecture as an autoregressive model and present a straightforward solution -Feedback memory. These limitations and our proposed solution target sequential token prediction tasks, such as language modeling or other auto-regressive generative tasks. The feedforward nature of Transformers makes them efficient on modern hardware, but restricts the Transformer from taking full advantage of the input's sequential property. In particular, the current hidden representation of a Transformer only accesses the past representations of lower layers, even though higher level representations of the past have already been computed as an autoregressive model. At generation, the Transformer generates only one token at a time, so it could access these representations for better performance, but does not exploit these at training time due to parallelization. However, if these past higher level representations could be used at training time, they would enrich future lower level representations, enabling shallower models to have the same representation power. Another inherent limitation of Transformers on sequential tasks is the lack of recursive computation (Dehghani et al., 2018) , and the number of transformations possible on the input is bounded by the model depth. Such disadvantages have impact on tasks that require careful tracking of a world state or modeling hierarchical structures (Tran et al., 2018; Hahn, 2020) . On the other hand, while RNNs can maintain an internal state for an unbounded time while accumulating more computations upon it, the size of this internal state is limited by the dimension of the hidden state. In this work, we propose a novel autoregressive model, the Feedback Transformer, that makes all previous hidden representations accessible to the computation of a representation at any depththe model feeds back previous computations to itself. The feedback allows the model to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, we modify self-attention to attend to higher level representations rather than lower ones. As shown in Figure 1 , the Feedback Transformer merges the hidden states from all layers into a single vector for every time step and stores them in a memory. Instead of self-attention, all subsequent layers attend to this memory, which means every previously computed representation is accessible by all future layers, mediated by the memory. This allows Feedback Transformers to recursively compute and transform an input as many times as the input length, which is something Transformers cannot achieve. While RNNs can perform recursive computation, the amount of information that Feedback Transformers can maintain is not limited by the number of layers. There are computational benefits to this straightforward modification. First, it uses less memory because all the layers share a single Feedback memory, thus reducing the memory size by L times, where L is the number of layers. There is also less computation because we share the key and value projections during attention computation, which increases the speed of the attention over the Feedback Memory. Further, the GPU memory usage is reduced due to the memory sharing -the overall model is 2x smaller -allowing the batch size to be increased for computational efficiency. During inference, the increased batch size contributes to substantially faster decoding speeds. In summary, our main contributions are: (1) The Feedback Transformer architecture, which completely changes the way a Transformer works to access available higher level representations immediately. ( 2 From the architectural perspective, our work shares similarities with recurrent networks augmented with external shared memories (Graves et al., 2014; Joulin & Mikolov, 2015; Sukhbaatar et al., 2015) . For example, the stack augmented RNN of Joulin & Mikolov (2015) adds an external memory to a recurrent network to keep long term dependencies. Closer to our work, the Neural Turing Machine of Graves et al. ( 2014) models an unconstrained memory that resembles the self-attention layer of a Transformer. Further improvements to recurrent networks, such as the Gated Feedback RNN (Chung et al., 2015) , are based on better controlling signal from different layers and extended to feedback through multiple pathways (Jin et al., 2017) . These works are built on recurrent networks with additional components to store long term dependencies. Other works have studied modifications to the Transformer architecture by enriching its structure with components inspired by recurrent networks. For example, Wang et al. (2019) propose adding a local recurrent sublayer to the Transformer layer to remove the need for position embeddings in the multi-head self-attention layers. Universal Transformer (Dehghani et al., 2018) share the parameters between the layers of a Transformer, leading a recurrent network in depth. Hao et al. (2019) and Chen et al. (2018) augment Transformers with a second, recurrent encoder. As opposed to our work, these prior investigations do not change the computational path in a Transformer to reduce the discrepancy between the training and inference time. Closer to our work, Merity (2019) proposes adding a self-attention layer on top of the past outputs from an LSTM cell. However, this approach keeps the recurrent and the self-attention mechanisms decoupled, as opposed to ours which makes the attention mechanism recurrent. In particular, the LSTM layer of Merity ( 2019) still intrinsically has a bottleneck corresponding to the dimension of the hidden layer.



) We show the Feedback Transformer can achieve state of the art results with smaller, shallower models that have faster decoding speed and smaller memory footprint. (3) The Feedback Transformer uses substantially less memory during training and inference time. 2 RELATED WORK Several previous works have analyzed the limitations of Transformer architectures, such as the inability to process input sequentially (Dehghani et al., 2018) or represent hierarchical structure (Tran et al., 2018). Hahn (2020) demonstrate that Transformers cannot model structures involving bounded recursion, such as closing parentheses. Pérez et al. (2019) study Transformers in the context of Turing machines, where they must produce unbounded numbers of decoding steps. Various work in probing Transformers identified several limitations where Transformers may not have the computational capacity of recurrent architecture like an LSTM (Hahn, 2020).

