Intrinsic Motivation via Surprise Memory

Abstract

We present a new computing model for intrinsic rewards in reinforcement learning that addresses the limitations of existing surprise-driven explorations. The reward is the novelty of the surprise rather than the surprise norm. We estimate the surprise novelty as retrieval errors of a memory network wherein the memory stores and reconstructs surprises. Our surprise memory (SM) augments the capability of surprise-based intrinsic motivators, maintaining the agent's interest in exciting exploration while reducing unwanted attraction to unpredictable or noisy observations. Our experiments demonstrate that the SM combined with various surprise predictors exhibits ecient exploring behaviors and signicantly boosts the nal performance in sparse reward environments, including Noisy-TV, navigation and challenging Atari games.

1. Introduction

What motivates agents to explore? Successfully answering this question would enable agents to learn eciently in formidable tasks. Random explorations such as -greedy are inecient in high dimensional cases, failing to learn despite training for hundreds of million steps in sparse reward games (Bellemare et al., 2016) . Alternative approaches propose to use intrinsic motivation to aid exploration by adding bonuses to the environment's rewards (Bellemare et al., 2016; Stadie et al., 2015) . The intrinsic reward is often proportional to the novelty of the visiting state: it is high if the state is novel (e.g. dierent from the past ones (Badia et al., 2020; 2019) ) or less frequently visited (Bellemare et al., 2016; Tang et al., 2017) . Another view of intrinsic motivation is from surprise, which refers to the result of the experience being unexpected, and is determined by the discrepancy between the expectation (from the gent's prediction) and observed reality (Barto et al., 2013; Schmidhuber, 2010) . Technically, surprise is the dierence between prediction and observation representation vectors. The norm of the residual (i.e. prediction error) is used as the intrinsic reward. 1 Under review as a conference paper at ICLR 2023 Here, we will use the terms surprise and surprise norm to refer to the residual vector and its norm, respectively. Recent works have estimated surprise with various predictive models such as dynamics (Stadie et al., 2015) , episodic reachability (Savinov et al., 2018) and inverse dynamics (Pathak et al., 2017) ; and achieved signicant improvements with surprise norm (Burda et al., 2018a) . However, surprise-based agents tend to be overly curious about noisy or unpredictable observations (Itti and Baldi, 2005; Schmidhuber, 1991) . For example, consider an agent watching a television screen showing white noise (noisy-TV problem). The TV is boring, yet the agent cannot predict the screen's content and will be attracted to the TV due to its high surprise norm. This distraction or "fake surprise" is common in partially observable Markov Decision Process (POMDP), including navigation tasks and Atari games (Burda et al., 2018b) . Many works have addressed this issue by relying on the learning progress (Achiam and Sastry, 2017; Schmidhuber, 1991) or random network distillation (RND) (Burda et al., 2018b) . However, the former is computationally expensive, and the latter requires many samples to perform well. This paper overcomes the "fake surprise" issue by using surprise novelty -a new concept that measures the uniqueness of surprise. To identify surprise novelty, the agent needs to compare the current surprise with surprises in past encounters. One way to do this is to equip the agent with some kind of associative memory, which we implement as an autoencoder whose task is to reconstruct a query surprise. The lower the reconstruction error, the lower the surprise novelty. A further mechanism is needed to deal with the rapid changes in surprise structure within an episode. As an example, if the agent meets the same surprise at two time steps, its surprise novelty should decline, and with a simple autoencoder this will not happen. To remedy this, we add an episodic memory, which stores intra-episode surprises. Given the current surprise, this memory can retrieve similar surprises presented earlier in the episode through an attention mechanism. These surprises act as a context added to the query to help the autoencoder better recognize whether the query surprise has been encountered in the episode or not. The error between the query and the autoencoder's output is dened as surprise novelty, to which the intrinsic reward is set proportionally. We argue that using surprise novelty as an intrinsic reward is better than surprise norm. As in POMDPs, surprise norms can be very large since the agent cannot predict its environment perfectly, yet there may exist patterns of prediction failure. If the agent can remember these patterns, it will not feel surprised when similar prediction errors appear regardless of the surprise norms. An important emergent property of this architecture is that when random observations are presented (e.g., white noise in the noisy-TV problem), the autoencoder can act as an identity transformation operator, thus eectively passing the noise through to reconstruct it with low error. We conjecture that the autoencoder is able to do this with the surprise rather than the observation as the surprise space has lower variance, and we show this in our paper. To make our memory system work on the surprise level, we adopt an intrinsic motivation method to generate surprise for the memory. The surprise generator (SG) can be of any kind based on predictive models and is jointly trained with the memory to optimize its own loss function. To train the surprise memory (SM), we optimize the memory's parameters to minimize the reconstruction error. Our contribution is to propose a new concept of surprise novelty for intrinsic motivation. We argue that it reects better the environment originality than surprise norm (see motivating graphics Fig. 1 ). In our experiments, the SM helps RND (Burda et al., 2018b) perform well in our challenging noisy-TV problem while RND alone performs poorly. Not only with RND, we consistently demonstrate signicant performance gain when coupling three dierent SGs with our SM in sparse-reward tasks. Finally, in hard exploration Atari games, we boost the scores of 2 strong SGs, resulting in better performance under the low-sample regime. 2 Methods

2.1. Surprise Novelty

Surprise is the dierence between expectation and observation (Ekman and Davidson, 1994) . If a surprise repeats, it is no longer a surprise. Based on this intuition, we hypothesize that surprises can be characterized by their novelties, and an agent's curiosity is driven by the



Figure 1: Montezuma Revenge: surprise novelty better reects the originality of the environment than surprise norm. While surprise norm can be signicant even for dull events such as those in the dark room due to unpredictability, surprise novelty tends to be less (3 rd and 6 th image). On the other hand, surprise novelty can be higher in truly vivid states on the rst visit to the ladder and island rooms (1 st and 2 nd image) and reduced on the

