CONTINUAL MEMORY: CAN WE REASON AFTER LONG-TERM MEMORIZATION?

Abstract

Existing reasoning tasks often follow the setting of "end-to-end reasoning", which has an important assumption that the input contents can be always accessed while reasoning. However, human beings frequently adopt another reasoning setting in daily life, referred to "reasoning after memorizing". Concretely, human beings have the ability to unconsciously memorize their experiences within limited memory capacity, from which they can recall and respond to subsequent tasks. In this setting, the input contents are no longer available during reasoning, thus we need to compress and memorize the input stream in one pass, trying to answer general queries that are unseen before. Memory augmented neural networks introduce a write-read memory to perform such human-like memorization and reasoning, but they continually update the memory from current information and inevitably forget the early contents, failing to answer the queries relevant to early information. In this paper, we propose the Continual Memory (CM) to explore this ability of reasoning after long-term memorization. To alleviate the gradual forgetting of early information, we develop self-supervised memorization training with item-level and sequence-level objectives. We demonstrate several interesting characteristics of our continual memory via synthetic data, and evaluate its performance by several downstream tasks, including long-term text QA, long-term video QA and recommendation with long sequences.

1. INTRODUCTION

In recent years, the tremendous progress of neural networks has enabled machines to perform reasoning given a query Q and the input contents X, e.g., infer the answer of given questions from the text/video stream in text/video question answering (Seo et al., 2016; Le et al., 2020b) , or predict whether a user will click the given item based on the user behavior sequence in recommender systems (Ren et al., 2019; Pi et al., 2019) . Studies that achieve top performances at such reasoning tasks usually follow the setting of "end-to-end reasoning", where the raw input contents X is available at the time of answering Q. In this setting, complex interaction between X and Q can be designed to extract query-relevant information from X with little loss, such as co-attention interaction (Xiong et al., 2016) . Though these methods (Seo et al., 2016; Le et al., 2020b ) can effectively handle these reasoning tasks, they require unlimited storage resources to hold the original input X. Further, they have to encode the whole input and develop the elaborate interaction from scratch, which are time consuming. This is not acceptable for online services that require instant response such as recommender systems, as the input sequence becomes extremely long (Ren et al., 2019) . Another setting of "reasoning after memorization", which has the restrictions that the raw input X is not available at the time of answering Q, requires the model to first digest X in a streaming manner, i.e., incrementally compress the current subsequence of X into a memory M with very limited capacity (size much smaller than |X|). Under such constraints, in the inference phase, we can only capture query-relevant clues from the limited states M (rather than X) to infer the answer to Q, where the information compression procedure in M is totally not aware of Q, posing great challenges of what to remember in M . This setting is very similar to the daily situation of our human beings, i.e., we may not even know the tasks Q that we will answer in the future when we are experiencing current events, and we also cannot go back to replay when we are solving problems at hand. However, it's our instincts, which continually process information during our entire life with

