NEURAL ATTENTION MEMORY

Abstract

Scaled dot-product attention has become the essence of state-of-the-art deep neural networks for various machine learning tasks. Though its ubiquitous accomplishments, it is inefficient for long sequence tasks and problematic for tasks requiring memory states such as compositional generalization. We propose a novel perspective of the attention mechanism by reinventing it as a memory architecture for neural networks, namely Neural Attention Memory (NAM). NAM follows the same query-key-value structure by constructing a memory matrix while reducing its computational complexity from quadratic to linear to the sequence length. NAM writes a memory matrix by adding outer products of value and unit key vectors, and reads it by multiplying the matrix with a unit query vector. We define read and write primitives of NAM and mathematically prove their functionalities. One benefit of NAM is that it can be a basis for efficient linear attention, namely normalized outer-product attention. We evaluate a NAM-based Transformer on long-range tasks and demonstrate NAM's efficiency and efficacy. Most importantly, NAM provides building blocks for memory-augmented neural networks. We propose two NAM-augmented neural networks, namely Long Short-Term Attention Memory (LSAM) and NAM Turing Machine (NAM-TM), and test their compositional generalization capabilities using four different tasks. LSAM replaces LSTM's long-term cell state with NAM memory matrix and NAM-TM implements a Turing tape structure using NAM read/write primitives. The experiments show that they have better computational power than Transformer and LSTM, as well as DNC. NAM opens up possibilities in diverse research problems, including hierarchical data modeling, efficient edge inference, and few-shot learning.

1. INTRODUCTION

Scaled dot-product attention (Vaswani et al., 2017) has become a core mechanism of state-of-the-art deep learning models for variety of machine learning tasks, including natural language processing (Devlin et al., 2018 ), multi-modal task (Li et al., 2019) , and graph data processing (Hamilton et al., 2017) . Specifically, the Transformers using the self-attention method have replaced recurrent neural networks (RNN) by outperforming them in most of the tasks. Despite its success, there exist limitations to the mechanism. First, it needs the information of the entire sequence to compute one attention so that its computational complexity becomes quadratic to the length of the sequence. Hence, it is inefficient for long sequence tasks (Tay et al., 2020) or edge inference environments (Tambe et al., 2020) . Also, its stateless design enables efficient parallelism but makes it impossible to solve tasks that require memory states. Hence, Transformers fail to generalize the rules that require inductive bias (Dehghani et al., 2018) or compositional generalization (Lake & Baroni, 2018). There have been studies designing neural networks with external memory to solve algorithmic tasks where Transformers fail. These memory-augmented neural networks (MANN) design differentiable read/write functions that can be trained by backpropagation. Some of them implement basic data structures like stack (Joulin & Mikolov, 2015) and queue (Grefenstette et al., 2015) while some implement complex memory structures using attention mechanisms (Graves et al., 2014; 2016) . They outperform generic neural networks in synthetic algorithmic tasks but are considered impractical due to their complexities and inefficiencies. In this work, we re-invent the attention mechanism as a memory architecture for neural networks, namely neural attention memory (NAM). NAM's design objective is to build simple, efficient, yet powerful external memory which also incorporates the attention mechanism. Following the same

