DISTRIBUTED ASSOCIATIVE MEMORY NETWORK WITH ASSOCIATION REINFORCING LOSS

Abstract

Despite recent progress in memory augmented neural network research, associative memory networks with a single external memory still show limited performance on complex relational reasoning tasks. The main reason for this problem comes from the lossy representation of relational information stored in a contentbased addressing memory and its insufficient associating performance for long temporal sequence data. To address these problems, here we introduce a novel Distributed Associative Memory architecture (DAM) with Association Reinforcing Loss (ARL) function which enhances the relation reasoning performance of memory augmented neural network. In this framework, instead of relying on a single large external memory, we form a set of multiple smaller associative memory blocks and update these sub-memory blocks simultaneously and independently with the content-based addressing mechanism. Based on DAM architecture, we can effectively retrieve complex relational information by integrating diverse representations distributed across multiple sub-memory blocks with an attention mechanism. Moreover, to further enhance the relation modeling performance of memory network, we propose ARL which assists a task's target objective while learning relational information exist in data. ARL enables the memory augmented neural network to reinforce an association between input data and task objective by reproducing stochastically sampled input data from stored memory contents. With this content reproducing task, it enriches the representations with relational information. In experiments, we apply our two main approaches to Differential Neural Computer (DNC), which is one of the representative contentbased addressing memory model and achieves state-of-the-art performance on both memorization and relational reasoning tasks.

1. INTRODUCTION

The essential part of human intelligence for understanding the story and predicting unobserved facts largely depends on the ability of memorizing the past and reasoning for relational information based on the pieces of memory. In this context, research on artificial intelligence has focused on designing human like associative memory network which can easily store and recall both events and relational information from a part of information. In neural network research, many approaches generally model sequential data with memory systems, such as Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and memoryaugmented neural networks (MANN). Especially, recent approach in MANN constructs an associative memory with a content-based addressing mechanism and stores both input data and its relational information to a single external memory. MANN has already proven to be an essential component on many tasks which need long term context understanding (Weston et al., 2014; Sukhbaatar et al., 2015; Graves et al., 2014; 2016; Gulcehre et al., 2018) . Also, compared to recurrent neural networks, it can store more information from sequential input data and correctly recall desired information from memory with a given cue. However, even with its promising performance on a wide range of tasks, MANN still has difficulties in solving complex relational reasoning problems (Weston et al., 2015) . Since content-based addressing model implicitly encodes data item and its relational information into one vector representation, they often result in a lossy representation of relational information which is not rich enough for solving relational reasoning tasks. To address such weakness, some researches find relational information by leveraging interaction between memory entities with attention (Palm et al., 2018; Santoro et al., 2018) . Others focus on long sequence memorization performance of memory (Trinh et al., 2018; Le et al., 2019; Munkhdalai et al., 2019) . Another attempts apply a self-attention to memory contents and explicitly encode relational information to a separate external memory (Le et al., 2020b) . However, all those models need to explicitly find relational information among memory entities with highly computational attention mechanism and have to repeatedly recompute it on every memory update. In this research, we approach the same problem in a much simpler and efficient way which do not require any explicit relational computation, such as self-attention, among memory entities. We hypothesize that lossy representation of relational information (Le et al., 2020b) in MANN is caused by both a single memory based representation and long-temporal data association performance. Although MANN learns to correlate sequential events across time, its representation is not rich enough to reflect complex relational information existing in input data. Therefore, for the enhanced relation learning, we focus on the richness of representation which implicitly embeds associations existing in input data. For this purpose, we introduce a novel Distributed Associative Memory (DAM) architecture which is inspired by how the biological brain works (Lashley, 1950; Bruce, 2001) . In DAM, we replace the single external memory with multiple smaller sub-memory blocks and update those memory blocks simultaneously and independently. The basic operations for each associative memory block are based on the content-based addressing mechanism of MANN, but its parallel memory architecture allows each sub-memory system to evolve over time independently. Therefore, similar to the underlying insight of multi-head attention (Vaswani et al., 2017) , our memory model can jointly attend to information from different representation subspaces at different sub-memory blocks and is able to provide a more rich representation of the same common input data. To retrieve rich information for relational reasoning, we apply a soft-attention based interpolation to the diverse representations distributed across multiple memories. Moreover, to enrich long-term relational information in the memory, we introduce a novel association reinforcing loss (ARL) which fortifies data associations of the memory and generally enhances the memorization capacity of MANN. The ARL forces the memory network to learn to reproduce the number of stochastically sampled input data only based on the stored memory contents. As if, other associated pieces of memory are reminded together whenever a person recalls a certain event in his memory, the data reproducing task enables MANN to have better association and memorization ability for input data. It is designed to reproduce a predefined percentage of input representations in the memory matrix on average and, while optimizing two different tasks at the same time, keep the balance between ARL and target objective loss by dynamically re-weighting each task (Liu & Zhou, 2006; Cui et al., 2019) . By combining the above two approaches, DAM, and ARL, our architecture provides rich representation which can be successfully used for tasks requiring both memorization and relational reasoning. We apply our architecture to Differential Neural Computer(DNC) (Graves et al., 2016) , which is one of the representative content-based addressing memory, to construct novel distributed associative memory architecture with ARL. DNC has promising performance on diverse tasks but also known to be poor at complex relational reasoning tasks. In experiments, we show that our architecture greatly enhances both memorization and relation reasoning performance of DNC, and even achieves the state-of-the-art records.

2. DIFFERENTIABLE NEURAL COMPUTER

We first briefly summarize DNC architecture which is a baseline model for our approaches. DNC (Graves et al., 2016) is a memory augmented neural network inspired by conventional computer architecture and mainly consists of two parts, a controller and an external memory. When input data are provided to the controller, usually LSTM, it generates a collection of memory operators called as an interface vector ξ t for accessing an external memory. It consists of several keys and values for read/write operations and constructed with the controller internal state h t as ξ t = W ξ h t at each time step t. Based on these memory operators, every read/write operation on DNC is performed. During writing process, DNC finds a writing address, w w t ∈ [0, 1] A , where A is a memory address size, along with write memory operators, e.g. write-in key, and built-in functions. Then it updates write-in values, v t ∈ R L , in the external memory, M t-1 ∈ R A×L , along with erasing value, e t ∈

