CONTINUAL MEMORY: CAN WE REASON AFTER LONG-TERM MEMORIZATION?

Abstract

Existing reasoning tasks often follow the setting of "end-to-end reasoning", which has an important assumption that the input contents can be always accessed while reasoning. However, human beings frequently adopt another reasoning setting in daily life, referred to "reasoning after memorizing". Concretely, human beings have the ability to unconsciously memorize their experiences within limited memory capacity, from which they can recall and respond to subsequent tasks. In this setting, the input contents are no longer available during reasoning, thus we need to compress and memorize the input stream in one pass, trying to answer general queries that are unseen before. Memory augmented neural networks introduce a write-read memory to perform such human-like memorization and reasoning, but they continually update the memory from current information and inevitably forget the early contents, failing to answer the queries relevant to early information. In this paper, we propose the Continual Memory (CM) to explore this ability of reasoning after long-term memorization. To alleviate the gradual forgetting of early information, we develop self-supervised memorization training with item-level and sequence-level objectives. We demonstrate several interesting characteristics of our continual memory via synthetic data, and evaluate its performance by several downstream tasks, including long-term text QA, long-term video QA and recommendation with long sequences.

1. INTRODUCTION

In recent years, the tremendous progress of neural networks has enabled machines to perform reasoning given a query Q and the input contents X, e.g., infer the answer of given questions from the text/video stream in text/video question answering (Seo et al., 2016; Le et al., 2020b) , or predict whether a user will click the given item based on the user behavior sequence in recommender systems (Ren et al., 2019; Pi et al., 2019) . Studies that achieve top performances at such reasoning tasks usually follow the setting of "end-to-end reasoning", where the raw input contents X is available at the time of answering Q. In this setting, complex interaction between X and Q can be designed to extract query-relevant information from X with little loss, such as co-attention interaction (Xiong et al., 2016) . Though these methods (Seo et al., 2016; Le et al., 2020b) can effectively handle these reasoning tasks, they require unlimited storage resources to hold the original input X. Further, they have to encode the whole input and develop the elaborate interaction from scratch, which are time consuming. This is not acceptable for online services that require instant response such as recommender systems, as the input sequence becomes extremely long (Ren et al., 2019) . Another setting of "reasoning after memorization", which has the restrictions that the raw input X is not available at the time of answering Q, requires the model to first digest X in a streaming manner, i.e., incrementally compress the current subsequence of X into a memory M with very limited capacity (size much smaller than |X|). Under such constraints, in the inference phase, we can only capture query-relevant clues from the limited states M (rather than X) to infer the answer to Q, where the information compression procedure in M is totally not aware of Q, posing great challenges of what to remember in M . This setting is very similar to the daily situation of our human beings, i.e., we may not even know the tasks Q that we will answer in the future when we are experiencing current events, and we also cannot go back to replay when we are solving problems at hand. However, it's our instincts, which continually process information during our entire life with limited and compressed memory storages, that allow us to recall and draw upon past events to frame our behaviors given the present situations (Moscovitch et al., 2016; Baddeley, 1992) . Compared to "end-to-end reasoning", "reasoning after memorization" though may not achieve better precisions at regular tasks with short sequences according to literatures (Park et al., 2020) , is naturally a better choice for applications like long-sequence recommendation (Ren et al., 2019) and long-text understanding (Ding et al., 2020) . Maintaining M can be incremental with only a small part of inputs at each timestep while inference over M and Q is also tractable for online service. Memory augmented neural networks (MANNs) (Graves et al., 2014; 2016) introduce a write-read memory that already follows the setting of "reasoning after memorization", which compress the input contents into a fixed-size memory and only read relevant information from the memory during reasoning. However, existing works do not emphasize on using MANNs to perform long-term memory-based reasoning. They learn how to maintain the memory only by back-propagated losses to the final answer and do not design specific training target for long-term memorization, which inevitably lead to the gradual forgetting of early contents (Le et al., 2019a) . That is, when dealing with the long-term input stream, the memory may only focus on current contents and naturally neglect long-term clues. Thus, existing MANNs fail to answer the query relevant to early information due to the lack of long-term memorization training. In this paper, we propose the Continual Memory (CM) to further explore this ability of reasoning after long-term memorization. Specifically, we compress the long-term input stream into the continual memory with fixed-size capacity and infer subsequent queries based on the memory. To overcome gradual forgetting of early information and increase the generalization ability of the memorization technique, we develop the extra self-supervised task to recall the recorded history contents from the memory. This is inspired by the fact that human beings can recall details nearby some specific events and distinguish whether a series of events happened in the history, which respectively correspond to two different memory process revealed in cognitive, neuropsychological, and neuroimaging studies, i.e., recollection and familiarity (Yonelinas, 2002; Moscovitch et al., 2016) . Concretely, we design the self-supervised memorization training with item-level and sequence-level objectives. The item-level objective aims to predict the masked items in history fragments, which are sampled from the original input stream and parts of items are masked as the prediction target. This task tries to endow the recollection ability that enables one to relive past episodes. And the sequence-level objective tries to distinguish whether a historical fragment ever appears in the input stream, where we directly sample positive fragments from the early input stream and replace parts of the items in positive ones as negative fragments. This task enables the familiarity process that can recognize experienced events or stimulus as familiar. We also give implementations on segment-level maintenance of memory to better capture context clues and improve the modeling efficiency. We illustrate the long-term memorization ability of our continual memory via a synthetic task, and evaluate its performance at solving real-world downstream tasks, including long-term text QA, long-term video QA and recommendation with long sequences, showing that it achieves significant advantages over existing MANNs in the "reasoning after memorizing" setting.

2. RELATED WORKS

Memory Augmented Neural Networks (MANNs) introduce external memory to store and access the past information by differentiable write-read operators. Neural Turing Machine (NTM) (Graves et al., 2014) and Differentiable Neural Computer (DNC) (Graves et al., 2016) are the typical MANNs for human-like reasoning under the setting of "reasoning after memorizing", whose inference relies only on the memory with limited capacity rather than starting from the original input data. In However, these works exploit MANNs mainly to help capture long-range dependencies in dealing with input sequences, but not paying efforts in dealing with the gradual forgetting issue in MANNs (Le et al., 2019a) . They share the same training objective as those methods developed for the setting of "end-to-end reasoning", inevitably incurring gradual forgetting of early contents (Le



this line of research, Rae et al. (2016) adopt the sparse memory accessing to reduce computational cost. Csordás & Schmidhuber (2019) introduce the key/value separation problem of content-based addressing and adopt a mask for memory operations as a solution. Le et al. (2019b) manipulate both data and programs stored in memory to perform universal computations. And Santoro et al. (2018); Le et al. (2020a) consider the complex relational reasoning with the information they remember.

