MEMFORMER: THE MEMORY-AUGMENTED TRANS-FORMER

Abstract

Transformer models have obtained remarkable accomplishments in various NLP tasks. However, these models have efficiency issues on long sequences, as the complexity of their self-attention module scales quadratically with the sequence length. To remedy the limitation, we present Memformer, a novel language model that utilizes a single unified memory to encode and retrieve past information. It includes a new optimization scheme, Memory Replay Back-Propagation, which promotes long-range back-propagation through time with a significantly reduced memory requirement. Memformer achieves O(n) time complexity and O(1) space complexity in processing long sequences, meaning that the model can handle an infinite length sequence during inference. Our model is also compatible with other self-supervised tasks to further improve the performance on language modeling. Experimental results show that Memformer outperforms the previous long-range sequence models on 

1. INTRODUCTION

Memory has a fundamental role in human cognition. Humans perceive and encode sensory information into a compressed representation in neurons, and later our brains can effectively retrieve past information to accomplish various tasks. The formation of memories involves complex cognitive processes. Modeling and studying the behavior of human memory is still a challenging research problem in many academic areas. Many researchers have attempted to incorporate memory systems in artificial neural networks. Early works like recurrent neural networks (RNN) (Rumelhart et al., 1988) , including LSTM (Hochreiter & Schmidhuber, 1997) model temporal sequences with their internal compressed state vector as memory. Although RNNs are theoretically Turing-complete, they are limited in preserving the longterm information due to the memory bottleneck. To alleviate the limitation, more powerful memory network architectures such as Neural Turing Machine (NTM) (Graves et al., 2014) , Differential Neural Computer (DNC) (Graves et al., 2016) have been proposed by leveraging a large external memory. However, due to their complex memory addressing mechanism, they are not widely used in NLP. More recently, Vaswani et al. (2017) proposes Transformer by ditching the use of memory and recurrence. Instead, it maintains all O(N 2 ) dependencies in the sequence with self-attention (Bahdanau et al., 2015) . Transformer and its followers have achieved great success in various NLP tasks. Nevertheless, the quadratic complexity can be extremely costly when the input sequence is long. Some works address the limitations of self-attention, including Reformer, Sparse Transformer, Longformer, Linformer, etc (Child et al., 2019; Kitaev et al., 2020; Wang et al., 2020) . They successfully reduce the complexity of self-attention and can process longer sequences. However, the space cost still scales with sequence length, which cannot be fully eliminated without memory and recurrence. Transformer-XL (Dai et al., 2019) re-introduces the concept of memory and recurrence. It caches each layer's hidden states of self-attention into a fixed size queue and re-uses them in the later attention computation. However, the memory as raw hidden states cannot effectively compress high-level information. Transformer-XL in practice needs a huge memory size to perform well. Compressive

