MEMFORMER: THE MEMORY-AUGMENTED TRANS-FORMER

Abstract

Transformer models have obtained remarkable accomplishments in various NLP tasks. However, these models have efficiency issues on long sequences, as the complexity of their self-attention module scales quadratically with the sequence length. To remedy the limitation, we present Memformer, a novel language model that utilizes a single unified memory to encode and retrieve past information. It includes a new optimization scheme, Memory Replay Back-Propagation, which promotes long-range back-propagation through time with a significantly reduced memory requirement. Memformer achieves O(n) time complexity and O(1) space complexity in processing long sequences, meaning that the model can handle an infinite length sequence during inference. Our model is also compatible with other self-supervised tasks to further improve the performance on language modeling. Experimental results show that Memformer outperforms the previous long-range sequence models on 

1. INTRODUCTION

Memory has a fundamental role in human cognition. Humans perceive and encode sensory information into a compressed representation in neurons, and later our brains can effectively retrieve past information to accomplish various tasks. The formation of memories involves complex cognitive processes. Modeling and studying the behavior of human memory is still a challenging research problem in many academic areas. Many researchers have attempted to incorporate memory systems in artificial neural networks. Early works like recurrent neural networks (RNN) (Rumelhart et al., 1988) , including LSTM (Hochreiter & Schmidhuber, 1997) model temporal sequences with their internal compressed state vector as memory. Although RNNs are theoretically Turing-complete, they are limited in preserving the longterm information due to the memory bottleneck. To alleviate the limitation, more powerful memory network architectures such as Neural Turing Machine (NTM) (Graves et al., 2014) , Differential Neural Computer (DNC) (Graves et al., 2016) have been proposed by leveraging a large external memory. However, due to their complex memory addressing mechanism, they are not widely used in NLP. More recently, Vaswani et al. (2017) proposes Transformer by ditching the use of memory and recurrence. Instead, it maintains all O(N 2 ) dependencies in the sequence with self-attention (Bahdanau et al., 2015) . Transformer and its followers have achieved great success in various NLP tasks. Nevertheless, the quadratic complexity can be extremely costly when the input sequence is long. Some works address the limitations of self-attention, including Reformer, Sparse Transformer, Longformer, Linformer, etc (Child et al., 2019; Kitaev et al., 2020; Wang et al., 2020) . They successfully reduce the complexity of self-attention and can process longer sequences. However, the space cost still scales with sequence length, which cannot be fully eliminated without memory and recurrence. Transformer-XL (Dai et al., 2019) re-introduces the concept of memory and recurrence. It caches each layer's hidden states of self-attention into a fixed size queue and re-uses them in the later attention computation. However, the memory as raw hidden states cannot effectively compress high-level information. Transformer-XL in practice needs a huge memory size to perform well. Compressive Transformer (Rae et al., 2020) improves upon Transformer-XL by further compressing its memories into fewer vectors via a compression network. However, as mentioned in the papers, both Transformer-XL and Compressive Transformer still have a theoretical maximum temporal range due to the uni-directional self-attention constraint. In this work, we propose Memformer, which includes a more efficient memory system with a Transformer encoder-decoder architecture. The resulting model has a theoretically unlimited temporal range of memorization. We also improve the relative positional encoding in Transformer-XL with a simplified version. As the traditional back-propagation through time (BPTT) has an unaffordable memory cost for our model, we introduce a new optimization scheme, memory replay backpropagation (MRBP), to significantly reduce the memory cost of training recurrent networks with large memory. We show that Memformer is compatible with different self-supervised tasks and can further improve its performance on language modeling. Our main contributions can be summarized as follows: (1) We introduce a new optimization scheme for training recurrent neural networks with large memory and long temporal range. (2) We propose Memformer, a Transformer-based model, which outperforms the previous Transformer-XL and Compressive Transformer on WikiText-103 language modeling. (3) We show that Memformer is compatible with a wide range of self-supervised tasks other than autoregressive language modeling.

2.1. SIMPLIFIED RELATIVE POSITIONAL ENCODING

The standard attention mechanism involves the dot product between the query vector q i and the key vector k j , where W q , W k , W v are the projection matrices to produce the query, key, and value. TransformerXL proposes a new type of relative positional encoding method. The attention computation is decomposed into four parts: (a) content-based addressing, (b) content dependent positional bias, (c) global content bias, and (d) global positional bias. The relative positional embedding R i-j provides the positional information between every pair of x i and x j . The equation is defined below. u and v are trainable parameters. A i,j = E xi W q W r E xj (a) + E xi W q W r R i-j (b) + u W k E xj (c) + v W r R i-j (d) . (1) However, we observe that (c) and (d) can be simplified by introducing a bias term to the original query and key projection. Thus, we re-formalize the self-attention, as shown in Eq. 3. The product of b q and K x is equivalent to the term (c) global content bias. For the term (d), since v, W r , and R i-j are all trainable parameters, it can be simplified into the product between b q and b k , which has a similar effect to the global attention bias. Different from Transformer-XL that only injects positional information in the attention computation, our attention mechanism shown in Eq. 4 attends over the positional information and accumulate the results to have more robust output representations. Q x = W q E x + b q ; K x = W k E x + b k ; V x = W v E x + b v (2) A i,j = Q xi K xj + Q xi R i-j H x = j A i,j (V xj + R i-j )

2.2. MEMFORMER

This section explains the details of Memformer. We first talk about the language model background and a new way of formulating language generation with text continuation. Then we describe an instance of such formulation, which is our proposed Memformer model. After that, we introduce the multi-task training setting. Finally, we describe the newly proposed optimization scheme, memory reply back-propagation to tackle the memory cost problem.

