Intrinsic Motivation via Surprise Memory

Abstract

We present a new computing model for intrinsic rewards in reinforcement learning that addresses the limitations of existing surprise-driven explorations. The reward is the novelty of the surprise rather than the surprise norm. We estimate the surprise novelty as retrieval errors of a memory network wherein the memory stores and reconstructs surprises. Our surprise memory (SM) augments the capability of surprise-based intrinsic motivators, maintaining the agent's interest in exciting exploration while reducing unwanted attraction to unpredictable or noisy observations. Our experiments demonstrate that the SM combined with various surprise predictors exhibits ecient exploring behaviors and signicantly boosts the nal performance in sparse reward environments, including Noisy-TV, navigation and challenging Atari games.

1. Introduction

What motivates agents to explore? Successfully answering this question would enable agents to learn eciently in formidable tasks. Random explorations such as -greedy are inecient in high dimensional cases, failing to learn despite training for hundreds of million steps in sparse reward games (Bellemare et al., 2016) . Alternative approaches propose to use intrinsic motivation to aid exploration by adding bonuses to the environment's rewards (Bellemare et al., 2016; Stadie et al., 2015) . The intrinsic reward is often proportional to the novelty of the visiting state: it is high if the state is novel (e.g. dierent from the past ones (Badia et al., 2020; 2019) ) or less frequently visited (Bellemare et al., 2016; Tang et al., 2017) . -2 0 .7 Surprise Novelty Surprise Norm -0 .7 2 1 2 3 4 5 6 Figure 1 : Montezuma Revenge: surprise novelty better reects the originality of the environment than surprise norm. While surprise norm can be signicant even for dull events such as those in the dark room due to unpredictability, surprise novelty tends to be less (3 rd and 6 th image). On the other hand, surprise novelty can be higher in truly vivid states on the rst visit to the ladder and island rooms (1 st and 2 nd image) and reduced on the second visit ( 4th and 5 th image). Here, surprise novelty and surprise norm are quantied and averaged over steps in each room. Another view of intrinsic motivation is from surprise, which refers to the result of the experience being unexpected, and is determined by the discrepancy between the expectation (from the gent's prediction) and observed reality (Barto et al., 2013; Schmidhuber, 2010) . Technically, surprise is the dierence between prediction and observation representation vectors. The norm of the residual (i.e. prediction error) is used as the intrinsic reward. and its norm, respectively. Recent works have estimated surprise with various predictive models such as dynamics (Stadie et al., 2015) , episodic reachability (Savinov et al., 2018) and inverse dynamics (Pathak et al., 2017) ; and achieved signicant improvements with surprise norm (Burda et al., 2018a) . However, surprise-based agents tend to be overly curious about noisy or unpredictable observations (Itti and Baldi, 2005; Schmidhuber, 1991) . For example, consider an agent watching a television screen showing white noise (noisy-TV problem). The TV is boring, yet the agent cannot predict the screen's content and will be attracted to the TV due to its high surprise norm. This distraction or "fake surprise" is common in partially observable Markov Decision Process (POMDP), including navigation tasks and Atari games (Burda et al., 2018b) . Many works have addressed this issue by relying on the learning progress (Achiam and Sastry, 2017; Schmidhuber, 1991) or random network distillation (RND) (Burda et al., 2018b) . However, the former is computationally expensive, and the latter requires many samples to perform well. This paper overcomes the "fake surprise" issue by using surprise novelty -a new concept that measures the uniqueness of surprise. To identify surprise novelty, the agent needs to compare the current surprise with surprises in past encounters. One way to do this is to equip the agent with some kind of associative memory, which we implement as an autoencoder whose task is to reconstruct a query surprise. The lower the reconstruction error, the lower the surprise novelty. A further mechanism is needed to deal with the rapid changes in surprise structure within an episode. As an example, if the agent meets the same surprise at two time steps, its surprise novelty should decline, and with a simple autoencoder this will not happen. To remedy this, we add an episodic memory, which stores intra-episode surprises. Given the current surprise, this memory can retrieve similar surprises presented earlier in the episode through an attention mechanism. These surprises act as a context added to the query to help the autoencoder better recognize whether the query surprise has been encountered in the episode or not. The error between the query and the autoencoder's output is dened as surprise novelty, to which the intrinsic reward is set proportionally. We argue that using surprise novelty as an intrinsic reward is better than surprise norm. As in POMDPs, surprise norms can be very large since the agent cannot predict its environment perfectly, yet there may exist patterns of prediction failure. If the agent can remember these patterns, it will not feel surprised when similar prediction errors appear regardless of the surprise norms. An important emergent property of this architecture is that when random observations are presented (e.g., white noise in the noisy-TV problem), the autoencoder can act as an identity transformation operator, thus eectively passing the noise through to reconstruct it with low error. We conjecture that the autoencoder is able to do this with the surprise rather than the observation as the surprise space has lower variance, and we show this in our paper. To make our memory system work on the surprise level, we adopt an intrinsic motivation method to generate surprise for the memory. The surprise generator (SG) can be of any kind based on predictive models and is jointly trained with the memory to optimize its own loss function. To train the surprise memory (SM), we optimize the memory's parameters to minimize the reconstruction error. Our contribution is to propose a new concept of surprise novelty for intrinsic motivation. We argue that it reects better the environment originality than surprise norm (see motivating graphics Fig. 1 ). In our experiments, the SM helps RND (Burda et al., 2018b) perform well in our challenging noisy-TV problem while RND alone performs poorly. Not only with RND, we consistently demonstrate signicant performance gain when coupling three dierent SGs with our SM in sparse-reward tasks. Finally, in hard exploration Atari games, we boost the scores of 2 strong SGs, resulting in better performance under the low-sample regime.

2.1. Surprise Novelty

Surprise is the dierence between expectation and observation (Ekman and Davidson, 1994) . If a surprise repeats, it is no longer a surprise. Based on this intuition, we hypothesize that surprises can be characterized by their novelties, and an agent's curiosity is driven by the Figure 2 : Surprise Generator+Surprise Memory (SG+SM). The SG takes input I t from the environment to estimate the surprise u t at state s t . The SM consists of two modules: an episodic memory (M) and an autoencoder network (W ). M is slot-based, storing past surprises within the episode. At any timestep t, given surprise u t , M retrieves read-out u e t to form a query surprise q t = [u e t , u t ] to W. W tries to reconstruct the query and takes the reconstruction error (surprise novelty) as the intrinsic reward r i t . surprise novelty rather than the surprising magnitude. Moreover, surprise novelty should be robust against noises: it is small even for random observations. For example, watching a random-channel TV can always be full of surprises as we cannot expect which channel will appear next. However, the agent should soon nd it boring since the surprise of random noises reoccurs repeatedly, and the channels are entirely unpredictable. We propose using a memory-augmented neural network (MANN) to measure surprise novelty. The memory remembers past surprise patterns, and if a surprise can be retrieved from the memory, it is not novel, and the intrinsic motivation should be small. The memory can also be viewed as a reconstruction network. The network can pass its inputs through for random, pattern-free surprises, making them retrievable. Surprise novelty has an interesting property: if some event is unsurprising (the expectation-reality residual is -→ 0 ), its surprise ( -→ 0 with norm 0) is always perfectly retrievable (surprise novelty is 0). In other words, low surprise norm means low surprise novelty. On the contrary, high surprise norm can have little surprise novelty as long as the surprise can be retrieved from the memory either through associative recall or pass-through mechanism. Another property is that the variance of surprise is generally lower than that of observation (state), potentially making the learning on surprise space easier. This property is formally stated as follows. Proposition 1. Let X and U be random variables representing the observation and surprise at the same timestep, respectively. Under an imperfect SG, the following inequality holds: ∀i : σ X i 2 ≥ σ U i 2 where σ X i 2 and σ U i 2 denote the i-th diagonal elements of var(X) and var(U ), respectively. Proof. See Appendix E.

2.2. Surprise Generator

Since our MANN requires surprises for its operation, it is built upon a prediction model, which will be referred to as Surprise Generators (SG). In this paper, we adopt many wellknown SGs (e.g. RND (Burda et al., 2018b) and ICM (Pathak et al., 2017) ) to predict the observation, compute the surprise u t and its norm for every step in the environment. The surprise norm is the Euclidean distance between the expectation and the actual observation: u t = SG (I t ) -O t (1) where u t ∈ R n is the surprise vector of size n, I t the input of the SG at step t of the episode, SG (I t ) and O t the SG's prediction and the observation target, respectively. The input I t is specic to the SG architecture choice, which can be the current (s t ) or previous state, action (s t-1 , a t ). The observation target O t is usually a transformation (can be identical or random) of the current state s t , which serves as the target for the SG's prediction. The SG is usually trained to minimize: L SG = E t [ u t ] Here, predictable observations have minor prediction errors or little surprise. One issue is that a great surprise norm can be simply due to noisy or distractive observations. Next, we propose a remedy for this problem.

2.3. Surprise Memory

The surprise generated by the SG is stored and processed by a memory network dubbed Surprise Memory (SM). It consists of an episodic memory M and an autoencoder network W, jointly optimized to reconstruct any surprise. At each timestep, the SM receives a surprise u t from the SG module and reads content u e t from the memory M. {u e t , u t } forms a surprise query q t to W to retrieve the reconstructed qt . This reconstruction will be used to estimate the novelty of surprises forming intrinsic rewards r i t . Fig. 2 summarizes the operations of the components of our proposed method. Our 2 memory design eectively recovers surprise novelty by handling intra and inter-episode surprise patterns thanks to M and W, respectively. M can quickly adapt and recall surprises that occur within an episode. W is slower and focuses more on consistent surprise patterns across episodes during training. Here the query q t can be directly set to the surprise u t . However, this ignores the rapid change in surprise within an episode. Without M, when the SG and W are xed (during interaction with environments), their outputs u t and qt stay the same for the same input I t . Hence, the intrinsic reward r i t also stays the same. It is undesirable since when the agent observes the same input at dierent timesteps (e.g., I 1 = I 2 ), we expect its curiosity should decrease in the second visit (r i 1 <r i 2 ). Therefore, we design SM with M to x this issue. The episodic memory M stores representations of surprises that the agent encounters during an episode. For simplicity, M is implemented as a rst-in-rst-out queue whose size is xed as N . Notably, the content of M is wiped out at the end of each episode. Its information is limited to a single episode. M can be viewed as a matrix: M ∈ R N ×d , where d is the size of the memory slot. We denote M (j) as the j-th row in the memory, corresponding to the surprise u t-j . To retrieve from M a read-out u e t that is close to u t , we perform content-based attention (Graves et al., 2014) to compute the attention weight as w t (j) = (utQ)M(j) (utQ) M(j) . The read-out from M is then u e t = w t MV ∈ R n . Here, Q ∈ R n×d and V ∈ R d×n are learnable weights mapping between the surprise and the memory space. To force the read-out close to u t , we minimize: L M = E t [ u e t -u t ] The read-out and the SG's surprise form the query surprise to W: q t = [u e t , u t ] ∈ R 2n . M stores intra-episode surprises to assist the autoencoder in preventing the agent from exploring fake surprise within the episode. Since we train the parameters to reconstruct u t using past surprises in the episode, if the agent visits a state whose surprise is predictable from those in M, u e t -u t should be small. Hence, the read-out context u e t contains no extra information than u t and reconstructing q t from W becomes easier as it is equivalent to reconstructing u t . In contrast, visiting diverse states leads to a more novel read-out u e t and makes it more challenging to reconstruct q t , generally leading to higher intrinsic reward. The autoencoder network W can be viewed as an associative memory of surprises that persist across episodes. At timestep t in any episode during training, W is queried with q t to produce a reconstructed memory qt . The surprise novelty is then determined as: r i t = qt -q t (4) which is the norm of the surprise residual qt -q t . It will be normalized and added to the external reward as an intrinsic reward bonus. The details of computing and using normalized intrinsic rewards can be found in Appendix C. We implement W as a feed-forward neural network that learns to reconstruct its own inputs. This kind of autoencoder has been shown to be equivalent to an associative memory that supports memory encoding and retrieval through attractor dynamics (Radhakrishnan et al., 2020) . The query surprise is encoded to the weights of the network via backpropagation as we minimize the reconstruction loss below: L W = E t r i t = E t [ W (q t ) -q t ] Here, qt = W (q t ). Intuitively, it is easier to retrieve non-novel surprises experienced many times in past episodes. Thus, the intrinsic reward is lower for states that leads to these familiar surprises. On the contrary, rare surprises are harder to retrieve, which results in high reconstruction errors and intrinsic rewards. W is like a long-term inter-episode associative memory. Unlike slot-based memories, it has a xed memory capacity, can compress information and learn data representations. We could store the surprise in a slot-based memory across episodes, but the size of this memory would be autonomous, and the data would be stored redundantly. Hence, the quality of the stored surprise will reduce as more and more observations come in. Readers can refer to Appendix A to see the architecture details and how W can be interpreted as implementing associative memory. The whole system SG+SM is trained end-to-end by minimizing the following loss: L = L SG + L M + L W . Here, we block the gradients from L W backpropagated to the parameters of SG to avoid trivial reconstructions of q t . Pseudocode of our algorithm is presented in Appendix B. 3 Experimental Results

3.1. Noisy-TV: Robustness against Noisy Observations

We use Noisy-TV, an environment designed to fool exploration methods (Burda et al., 2018b; Savinov et al., 2018) , to conrm that our method can generate intrinsic rewards that (1) are more robust to noises and (2) can discriminate rare and common observations through surprise novelty. We simulate this problem by employing a 3D maze environment with a random map structure. The TV is not xed in specic locations in the maze to make it more challenging. Instead, the agent brings the TV with it and can choose to watch TV anytime. Hence, there are three basic actions (turn left, right, and move forward) plus an action: watch TV. When taking this action, the agent will see a white noise image sampled from standard normal distribution and thus, the number of TV channels can be considered innity. The agent's state is an image of its viewport, and its goal is to search for a red box randomly placed in the maze (+1 reward if the agent reaches the goal). The baseline is RND (Burda et al., 2018b) , a simple yet strong SG that is claimed to obviate the stochastic problems of Noisy-TV. Our SG+SM model uses RND as the SG, so we name it RND+SM. Since our model and the baseline share the same RND architecture, the dierence in performance must be attributed to our SM. On the other hand, when observing red box, RND+SM shows higher MNIR than RND. The dierence between MNIR for common and rare states is also more prominent in RND+SM than in RND because RND prediction is not perfect even for common observations, creating relatively signicant surprise norms for seeing walls. The SM xes that issue by remembering surprise patterns and successfully retrieving them, producing much smaller surprise novelty compared to those of rare events like seeing red box. Consequently, the agent with SM outperforms the other by a massive margin in task rewards (Fig. 3 (b) ). As we visualize the number of watching TV actions and the value of the intrinsic reward by RND+SM and RND over training time, we realize that RND+SM helps the agent take fewer watching actions and thus, collect smaller amounts of intrinsic rewards compared to RND. We also verify that our proposed method outperforms a simplied version of SM using counts to measure surprise novelty and a vanilla baseline that does not use intrinsic motivation. The details of these results are given in Appendix D.1.

3.2. MiniGrid: Compatibility with Different Surprise Generators

We show the versatility of our framework SG+SM by applying SM to 4 SG backbones: RND (Burda et al., 2018b) , ICM (Pathak et al., 2017) , NGU (Badia et al., 2019) and autoencoder-AE (see Appendix D.2 for implementation details). We test the models on three tasks from MiniGrid environments: Key-Door (KD), Dynamic-Obstacles (DO) and Lava-Crossing (LC) (Chevalier-Boisvert et al., 2018) . If the agent reaches the goal in the tasks, it receives a +1 reward. Otherwise, it can be punished with negative rewards if it collides with obstacles or takes too much time to nish the task. These environments are not stochastic as the Noisy-TV but they still contain other types of distraction. For example, in KD, the agent can be attracted to irrelevant actions such as going around to drop and pick the key. In DO, instead of going to the destination, the agent may chase obstacle balls ying around the map. In LC the agent can commit unsafe actions like going near lava areas, which are dierent from typical paths. In any case, due to reward sparsity, intrinsic motivation is benecial. However, surprise alone may not be enough to guide an ecient exploration since the observation can be too complicated for SG to minimize its prediction error. Thus, the agent quickly feels surprised, even in unimportant states. Table 1 shows the average returns of the models for three tasks. The Baseline is the PPO backbone trained without intrinsic reward. RND, ICM, NGU and AE are SGs providing the PPO with surprise-norm rewards while our method SG+SM uses surprise-novelty rewards. The results demonstrate that models with SM often outperform SG signicantly and always contain the best performers. Notably, in the LC task, SGs hinder the performance of the Baseline because the agents are attracted to dangerous vivid states, which are hard to predict but cause the agent's death. The SM models avoid this issue and outperform the Baseline for the case of ICM+SM. Compared to AE, which computes intrinsic reward based on the novelty of the state, AE+SM shows a much higher average score in all tasks. That manifests the importance of modeling the novelty of surprise instead of states. To analyze the dierence between the SG+SM and SG's MNIR structure, we visualize the MNIR for each cell in the map of Key-Door in Appendix's Figs. 5 (b) and (c). We create a synthetic trajectory that scans through all the cells in the big room on the left and, at each cell, uses RND+SM and RND models to compute the corresponding surprise-norm and surprise-novelty MNIRs, respectively. As shown in Fig. 5 (b), RND+SM selectively identies truly surprising events, where only a few cells have high surprise-novelty MNIR. Here, we can visually detect three important events that receive the most MNIR: seeing the key (bottom row), seeing the door side (in the middle of the rightmost column) and approaching the front of the door (the second and fourth rows). Other less important cells are assigned very low MNIR. On the contrary, RND often gives high surprise-norm MNIR to cells around important ones, which creates a noisy MNIR map as in Fig. 5 (c ). As a result, RND's performance is better than Baseline, yet far from that of RND+SM. Another analysis of how surprise novelty discriminates against surprises with similar norms is given in Appendix's Fig. 8 .

3.3. Atari: Sample-efficient Benchmark

We adopt the sample-eciency Atari benchmark (Kim et al., 2019) on six hard exploration games where the training budget is only 50 million frames. We use our SM to augment 2 SGs: RND (Burda et al., 2018b) and LWM (Ermolov and Sebe, 2020) We also verify the benet of the SM in the long run for Montezuma Revenge and Frostbite. As shown in Fig. 4 (a,b), RND+SM still signicantly outperforms RND after 200 million training frames, achieving average scores of 10,000 and 9,000, respectively. The result demonstrates the scalability of our proposed method. When using RND and RND+SM to compute the average MNIR in several rooms in Montezuma Revenge (Fig. 1 ), we realize that SM makes MNIR higher for surprising events in rooms with complex structures while depressing the MNIR of fake surprises in dark rooms. Here, even in the dark room, the movement of agents (human or spider) is hard to predict, leading to a high average MNIR. On the contrary, the average MNIR of surprise novelty is reduced if the prediction error can be recalled from the memory. RND+SM's average times are 26h 24m and 28h 1m, respectively. These correspond to only 7% more training time while the performance gap is signicant (4000 scores).

3.4. Ablation Study

Role of Memories Here, we use Minigrid's Dynamic-Obstacle task to study the role of M and W in the SM (built upon RND as the SG). Disabling W, we directly use q t = [u e t , u t ] as the intrinsic reward, and name this version: SM (no W). To ablate the eect of M, we remove u e t from q t and only use q t = u t as the query to W, forming the version: SM (no M). We also consider dierent episodic memory capacity and slot size N -d= {32 -4, 128 -16, 1024 -64}. As N and d increase, the short-term context expands and more past surprise information is considered in the attention. In theory, a big M is helpful to capture long-term and more accurate context for constructing the surprise query. SM (no W) and SM (no M) show weak signs of learning, conrming the necessity of both modules in this task. Increasing N -d from 32-4 to 1024-64 improves the nal performance. However, 1024 -64 is not signicantly better than 128 -16, perhaps because it is unlikely to have similar surprises that are more than 128 steps apart. Thus, a larger attention span does not provide a benet. As a result, we keep using N = 128 and d = 16 in all other experiments for faster computing. We also verify the necessity of M and W in Montezuma Revenge and illustrate how M generates lower MNIR when 2 similar event occurs in the same episode in Key-Door (see Appendix D.4). No Task Reward In this experiment, we remove task rewards and merely evaluate the agent's ability to explore using intrinsic rewards. The task is to navigate 3D rooms and get a +1 reward for picking an object (Chevalier-Boisvert, 2018). The state is the agent's image view, and there is no noise. Without task rewards, it is crucial to maintain the agent's interest in unique events of seeing the objects. In this partially observable environment, surprise-prediction methods may struggle to explore even without noise due to lacking information for good predictions, leading to usually high prediction errors. For this testbed, we evaluate random exploration agent (Baseline), RND and RND+SM in 2 settings: 1 room with three objects (easy), and 4 rooms with one object (hard). To see the dierence among the models, we compare the cumulative task rewards over 100 million steps (see Appendix D.4 for details). RND is even worse than Baseline in the easy setting because predicting causes high biases (intrinsic rewards) towards the unpredictable, hindering exploration if the map is simple. In contrast, RND+SM uses surprise novelty, generally showing smaller intrinsic rewards (see Appendix Fig. 12 (right)). Consequently, our method consistently demonstrates signicant improvements over other baselines (see Fig. 4 (d) for the hard setting).

4. Related works

Intrinsic motivation approaches usually give the agent reward bonuses for visiting novel states to encourage exploration. The bonus is proportional to the mismatch between the predicted and reality, also known as surprise (Schmidhuber, 2010) . One kind of predictive model is the dynamics model, wherein the surprise is the error of the models as predicting the next state given the current state and action (Achiam and Sastry, 2017; Stadie et al., 2015) . One critical problem of these approaches is the unwanted bias towards transitions where the prediction target is a stochastic function of the inputs, commonly found in partially observable environments. Recent works focus on improving the features of the predictor's input by adopting representation learning mechanisms such as inverse dynamics (Pathak et al., 2017) , variational autoencoder, random/pixel features (Burda et al., 2018a) , or whitening transform (Ermolov and Sebe, 2020) . Although better representations may improve the reward bonus, they cannot completely solve the problem of stochastic dynamics and thus, fail in extreme cases such as the noisy-TV problem (Burda et al., 2018b) . Besides dynamics prediction, several works propose to predict other quantities as functions of the current state by using autoencoder (Nylend, 2017), episodic memory (Savinov et al., 2018) , and random network (Burda et al., 2018b) . Burda et al. (2018) claimed that using a deterministic random target network is benecial in overcoming stochasticity issues. Other methods combine this idea with episodic memory and other techniques, achieving good results in large-scale experiments (Badia et al., 2020; 2019) . From an information theory perspective, the notation of surprise can be linked to information gain or uncertainty, and predictive models can be treated as parameterized distributions (Achiam and Sastry, 2017; Houthooft et al., 2016; Still and Precup, 2012) . Furthermore, to prevent the agent from unpredictable observations, the reward bonus can be measured by the progress of the model's prediction (Achiam and Sastry, 2017; Lopes et al., 2012; Schmidhuber, 1991) . However, these methods are complicated and hard to scale, requiring heavy computing. A dierent angle to handle stochastic observations during exploration is surprsie minimization (Berseth et al., 2020; Rhinehart et al., 2021) . In this direction, the agents get bigger rewards for seeing more familiar states. Such a strategy is somewhat opposite to our approach and suitable for unstable environments where the randomness occurs separately from the agents' actions. These earlier works rely on the principle of using surprise as an incentive for exploration and dier from our principle that utilizes surprise novelty. Also, our work augments these existing works with a surprise memory module and can be used as a generic plug-in improvement for surprise-based models. We note that our memory formulation diers from the memorybased novelty concept using episodic memory (Badia et al., 2019) , momentum memory (Fang et al., 2022) , or counting (Bellemare et al., 2016; Tang et al., 2017) because our memory operates on the surprise level, not the state level. In our work, exploration is discouraged not only in frequently visited states but also in states whose surprises can be reconstructed using SM. Our work provides a more general and learnable novelty detection mechanism, which is more exible than the nearest neighbour search or counting lookup table.

5. Discussion

This paper presents Surprise Generator-Surprise Memory (SG+SM) framework to compute surprise novelty as an intrinsic motivation for the reinforcement learning agent. Exploring with surprise novelty is benecial when there are repeated patterns of surprises or random observations. For example, in the Noisy-TV problem, our SG+SM can harness the agent's tendency to visit noisy states such as watching random TV channels while encouraging it to explore rare events with distinctive surprises. We empirically show that our SM can supplement three surprise-based SGs to achieve more rewards in fewer training steps in three grid-world environments. In 3D navigation without external reward, our method signicantly outperforms the baselines. On two strong SGs, our SM also achieve superior results in hard-exploration Atari games within 50 million training frames. Even in the long run, our method maintains a clear performance gap from the baselines, as shown in Montezuma Revenge and Frostbite. If we view surprise as the rst-order error between the observation and the predicted, surprise noveltythe retrieval error between the surprise and the reconstructed memory, is essentially the second-order error. It would be interesting to investigate the notion of higher-order errors, study their theoretical properties, and utilize them for intrinsic motivation in our future work. analysis, but the idea can extend to multi-layer feed-forward neural networks. For simplicity, assuming W is a square matrix, the objective is to minimize the dierence between the input and the output of W : For simplicity, assuming W is a square matrix, the objective is to minimize the dierence between the input and the output of W : L = W x -x 2 2 (6) Using gradient descent, we update W as follow, W ← W -α ∂L ∂W ← W -2α (W x -x) x ← W -2αW xx + 2αxx ← W I -2αxx + 2αxx where I is the identity matrix, x is the column vector. If a batch of inputs {x i } B i=1 is used in computing the loss in Eq. 6, at step t, we update W as follows, W t = W t-1 (I -αX t ) + αX t where X t = 2 B i=1 x i x i . From t = 0, after T updates, the weight becomes W T = W 0 T t=1 (I -αX t ) -α 2 T t=2 X t X t-1 T k=t+1 (I -αX k ) + α T t=1 X t (7) Given the form of X t , X t is symmetric positive-denite. Also, as α is often very small (0<α 1), we can show that I -αX t < 1 -λ min (αX t ) < 1. This means as T → ∞, W 0 T t=1 (I -αX t ) → 0 and thus, W T → α 2 T t=2 X t X t-1 T k=t+1 (I -αX k ) + α T t=1 X t independent from the initialization W 0 . Eq. 7 shows how the data (X t ) is integrated into the neural network weight W t . The other components such as α 2 T t=2 X t X t-1 T k=t+1 (I -αX k ) can be viewed as additional encoding noise. Without these components (by assuming α is small enough), W T ≈ α T t=1 X t = α T t=1 B i=1 x i,t x i,t or equivalently, we have the Hebbian update rule: W ← W + x i,t ⊗ x i,t where W can be seen as the memory, ⊗ is the outer product and x i,t is the data or item stored in the memory. This memory update is the same as that of classical associative memory models such as Hopeld network and Correlation Matrix Memory (CMM) . Given a query q, we retrieve the value in W as output of the neural network: q = q W = q R + α T t=1 qX t = q R + 2α T t=1 B i=1 q x i,t x i,t where R = W 0 T t=1 (I -αX t ) -α 2 T t=2 X t X t-1 T k=t+1 (I -αX k ). If q is presented to the memory W in the past as some x j , q can be represented as: q = q R + 2α T t=1 B i=1,i =j q x i,t x i,t + 2αq qq = q R + 2α T t=1 B i=1,i =j q x i,t x i,t +2α q q noise cross talk Assuming that the noise is insignicant thanks to small α, we can retrieve exactly q given that all items in the memory are orthogonal 2 . As a result, after scaling q with 1/2α, the retrieval error ( q 2α -q ) is 0. If q is new to W , the error will depend on whether the items stored in W are close to q. Usually, the higher the error, the more novel q is w.r.t W .

B SM's Implementation Detail

In practice, the short-term memory M is a tensor of shape [B, N, d] where B is the number of actors, N the memory length and d the slot size. B is the SG's hyperparameters and tuned depending on tasks based on SG performance. For example, for the Noisy-TV, we tune RND as the SG, obtaining B = 64 and directly using them for M. N and d are the special hyperparameters of our method. As mentioned in Sec. 3.4, we x N = 128 and d = 16 in all experiments. As B increases in large-scale experiments, memory storage for M can be demanding. To overcome this issue, we can use the uniform writing trick to optimally preserve information while reducing N (Le et al., 2019) . Also, for W, by using a small hidden size, we can reduce the requirement for physical memory signicantly. Practically, in all experiments, we implement W as a 2-layer feedforward neural network with a hidden size of 32 (2n → 32 → 2n). The activation is tanh. With n = 512 d = 16, the number of parameters of W is only about 65K. Also, Q ∈ R n×d and V ∈ R d×n have about 8K parameters. In total, our SM only introduces less than 90K trainable parameters, which are marginal to that of the SG and policy/value networks (up to 10 million parameters). The join training of SG+SM is presented in Algo. 2. We note that vector notations in the algorithm are row vectors. For simplicity, the algorithm assumes 1 actor. In practice, our algorithm works with multiple actors and mini-batch training.

C Intrinsic Reward Normalization

Following (Burda et al., 2018b) , to make the intrinsic reward on a consistent scale, we normalized the intrinsic reward by dividing it by a running estimate of the standard deviations 2 By certain transformation, this condition can be reduced to linear independence Under review as a conference paper at ICLR 2023 Algorithm 1 Intrinsic rewards computing via SG+SM framework. Require: u t , and our surprise memory SM consisting of a slot-based memory M, parameters Q, V , and a neural network W 1: Compute L SG = u t 2: Query M with u t , retrieve u e t = w t MV where w t is the attention weight 3: Compute L M = u e t -u t .detach() 4: Query W with q t = [u e t , u t ], retrieve qt = W(q t ) 5: Compute intrinsic reward r i t = L W = qt -q t .detach() 6: return L SG , L M , L W Algorithm 2 Jointly training SG+SM and the policy. Require: buer, policy π θ , surprise-based predictor SG, and our surprise memory SM consisting of a slot-based memory M, parameters Q, V , and a neural network W 1: Initialize π θ , SG, Q, W 2: for iteration = 1, 2, ... Compute surprise u t = SG (I t ) -O t .detach() (Eq. 1) 15: Compute L SG , L M , L W using Algo. 1 16: Update SG, Q and W by minimizing the loss L = L SG + L M + L W 17: Update π θ with sample (s t-1 , s t , a t , r t ) from buer using backbone algorithms 18: end for 19: end for of the intrinsic returns. This normalized intrinsic reward (NIR) will be used for training. In addition, there is a hyperparameter named intrinsic reward coecient to scale the intrinsic contribution relatively to the external reward. We denote the running mean's standard deviations and intrinsic reward coecient as r std t and β, respectively, in Algo. 2. In our experiments, if otherwise stated, β = 1. We note that when comparing the intrinsic reward at dierent states in the same episode (as in the experiment section), we normalize intrinsic rewards by subtracting the mean, followed by a division by the standard deviation of all intrinsic rewards in the episode. Hence, the mean-normalized intrinsic reward (MNIR) in these experiments is dierent from the one used in training and can be negative. We argue that normalizing with mean and std. of the episode's intrinsic rewards is necessary to make the comparison reasonable. For example, in an episode, method A assigns all steps with intrinsic rewards of 200; and method B assigns novel steps with intrinsic rewards of 1 while others 0. Clearly, method A treats all steps in the episode equal, and thus, it is equivalent to giving no motivation for all of the steps in the episode (the learned policy will not motivate the agent to visit novel states). On the contrary, method B triggers motivation for novel steps in the episodes (the learned policy will encourage visits to novel states). Without normalizing by mean subtraction, it is tempting to conclude that the relative intrinsic reward of method A for a novel step is higher, which is technically incorrect. in the MiniWorld library (Apache License) (Chevalier-Boisvert, 2018). The backbone RL algorithm is PPO. We adopt a public code repository for the implementation of PPO and RND (MIT License) 3 . In this environment, the state is an image of the agent's viewport. The details of architecture and hyperparameters of the backbone and RND is presented in Table 4 . Most of the setting is the same as in the repository. We only tune the number of actors 128, 1024), mini-batch size (4, 16, 64) and -clip (0.1, 0.2, 0.3) to suit our hardware and the task. After tuning with RND, we use the same setting for our RND+SM. As mentioned in the main text, RND+SM is better at handling noise than RND. Note that RND aims to predict the transformed states by minimizing SG (s t ) -f R (s t ) where f R is a xed neural network initialized randomly. If RND can learns the transformation, it can passthrough the state, which is similar to reconstruction in an autoencoder. However, learning f R can be harder and require more samples than learning an identity transformation since f R is non-linear and complicated. Hence, it may be more challenging for RND to pass-through the noise than SM. Another possible reason lies in the operating space (state vs. surprise). If we treat white noise as a random variable X, a surprise generator (SG) can at most learn to predict the mean of this variable and compute the surprise U = E [X|Y ] -X where Y is a random factor that aects the training of the surprise generator. The factor Y makes the SG produce imperfect reconstruction E [X|Y ] 4 . Here, SG and SM learn to reconstruct X and U , respectively. We can prove that the variance of each feature dimension in U is smaller than that of X (see Sec. E). Learning an autoencoder on surprise space is more benecial than in state space since the data has less variance and thus, it may require less data points to learn the data distribution. intrinsic reward is then β/ c(u t ). We tune the hyperparameter β = {0.5, 1, 5} and the hash matrix size k h = {32, 64, 128, 256} and use the same normalization and training process to run this baseline. We report the learning curves of the best variant with β = 0.5 and k h = 128. The result demonstrates that the proposed SM using memory-augmented neural networks outperforms the count-based SM by a signicant margin. One possible reason is that count-based method cannot handle white noise: it always returns high intrinsic rewards. In contrast, our SM can somehow reconstruct white noise via pass-through mechanism and thus reduces the impact of fake surprise on learning. Also, the proposed SM is more exible than the count-based counterpart since it learns to reconstruct from the data rather than using a x counting scheme. The result also shows that RND+SM outperforms the vanilla Baseline. Although the improvement is moderate (0.9 vs 0.85), the result is remarkable since the Noisy-TV is designed to fool intrinsic motivation methods and among all, only RND+SM can outperform the vanilla Baseline.

D.2 MiniGrid

The tasks in this experiment are from the MiniGrid library (Apache License) (Chevalier-Boisvert et al., 2018) . In MiniGrid environments, the state is a description vector representing partial observation information such as the location of the agents, objects, moving directions, etc. The three tasks use hardest maps: • DoorKey: MiniGrid-DoorKey-16x16-v0 • LavaCrossing: MiniGrid-LavaCrossingS11N5-v0 • DynamicObstacles: MiniGrid-Dynamic-Obstacles-16x16-v0 The SGs used in this experiment are RND (Burda et al., 2018b) , ICM (Pathak et al., 2017) , NGU (Badia et al., 2019) and AE. Below we describe the input-output structure of these SGs. • RND: I t = s t and O t = f R (s t ) where s t is the current state and f R is a neural network that has a similar structure as the prediction network, yet its parameters are initialized randomly and xed during training. • ICM: I t = (s t-1 , a t ) and O t = s t where s is the embedding of the state and a the action. We note that in addition to the surprise loss (Eq. 2), ICM is trained with inverse dynamics loss. • NGU: This agent reuses the RND as the SG (I t = s t and O t = f R (s t )) and combines the surprise norm with an KNN episodic reward. When applying our SM to NGU, we only take the surprise-based reward as input to the SM. The code for NGU is based on this public repository https://github.com/opendilab/DI-engine. • AE: I t = s t and O t = s t where s is the embedding of the state. This SG can be viewed as an associative memory of the observation, aiming to remember the states. This baseline is designed to verify the importance of surprise modeling. Despite sharing a similar architecture, it diers from our SM, which operates on surprise and have an augmented episodic memory to support reconstruction. 

SG and RL backbone implementations

We 2. Depending on the setting, the models are trained for 50 or 200 million frames.

Results

Fig. 9 demonstrates the learning curves of all models in 6 Atari games under the low-sample regime. LWM+SM and RND+SM clearly outperfrom LWM and RND in Frostbite, Venture, Gravitar, Solaris and Frostbite, Venture, Gravitar and MontezumaRevenge, respectively. Table 5 reports the results of more baselines.

Role of Memories

We conduct more ablation studies to verify the need for the short M and long-term (W ) memory in our SM. We design additional baselines SM (no W) and SM (no M) (see Sec. We also shows the impact of the episodic memory in decreasing the intrinsic rewards for similar states as discussed in Sec. 2.3. We select 3 states in the MiniGrid's KeyDoor task and computes the MNIR for each state, visualized in Fig. 11 . At the step-1 state, the MNIR is low since there is nothing special in the view of the agent. At the step-15 state, the agent rst sees the key, and get a high MNIR. At the step-28 state, the agent drops the key and sees the key again. This event is still more interesting than the step-1 state. However, the view is similar to the one in step 15, and thus, the MNIR decreases from 0.7 to 0.35 as expected.

No Task Reward

The tasks in this experiment are from the MiniWorld library (Apache License) (Chevalier-Boisvert, 2018). The two tasks are: • Easy: MiniWorld-PickupObjs-v0 • Hard: MiniWorld-FourRooms-v0 The Using the Law of Iterated Expectations, we have E[X -Z] = E[E[X -Z|Y ]] = E[E [X|Y ] -E [Z|Y ]] = E [Z -Z] = 0 and E[(X -Z)Z] = E[E[(X -Z)Z|Y ]] = E[E[XZ -Z 2 |Y ]] = E[E (XZ|Y ) -E Z 2 |Y ] = E[ZE (X|Y ) -Z 2 ] = E[Z 2 -Z 2 ] = 0 Therefore, var (X) = var(X -Z) + var(Z) Let C X ii , C X-Z ii and C Z ii denote the diagonal entries of these covariance matrices, they are the variances of the components of the random vector X, X -Z and Z, respectively. That is,



Figure 3: Noisy-TV: (a) mean-normalized intrinsic reward (MNIR) produced by RND and RND+SM at 7 selected steps in an episode. (b) Average task return (mean±std. over 5 runs) over 4 million training steps.

Fig. 3 (a) illustrates the mean-normalized intrinsic rewards (MNIR) 1 measured at dierent states in our Noisy-TV environment. The rst two states are noises, the following three states are common walls, and the last two are ones where the agent sees the box. The 1 See Appendix C for more information on this metric.

Figure 4: (a,b) Atari long runs over 200 million frames: average return over 128 episodes. (c) Ablation study on SM's components. (d) MiniWorld exploration without task reward: Cumulative task returns over 100 million training steps for the hard setting. The learning curves are mean±std. over 5 runs.

Fig. 4 (c) depicts the performance curves of the methods after 10 million training steps.

Figure 5: Key-Door: (a) Example map in Key-Door where light window is the agent's view window (state). MNIR produced for each cell in a manually created trajectory for RND+SM (b) and RND (c). The green arrows denote the agent's direction at each location.The brighter the cell, the higher MNIR assigned to the corresponding state.

Fig. 6 reports all results for this environment. Fig. 6 (a) compares the nal intrinsic reward (IR) generated by RND and RND+SM over training time. Overall, RND's IR is always higher than RND+SM's, indicating that our method is signicantly reduces the attention of the agent to the noisy TV by assigning less IR to watching TV. Fig. 6 (b) compares the number of noisy actions between two methods where RND+SM consistently shows fewer watching TV actions. That conrms RND+SM agent is less distracted by the TV.

Fig. 6 (c)  reports performance of all baselines. Besides RND and RND+SM, we also include PPO without intrinsic reward as the vanilla Baseline for reference. In addition, we investigate a simple implementation of SM using count-based method to measure surprise novelty. Concretely, we use SimHash algorithm to count the number of surprise c(u t ) in a similar manner as(Bellemare et al., 2016) and name the baseline RND+SM (count). The 3 https://github.com/jcwleo/random-network-distillation-pytorch 4 In this case, the perfect reconstruction is E [X]

Figure8: Key-Door: t-NSE 2d representations of surprise (u t ) and surprise residual ( qt -q t ).Each point corresponds to the MNIR at some step in the episode. Color denotes the MNIR value (darker means higher MNIR). The red circle on the left picture shows an example cluster of 6 surprise points. Surprise residuals of these points are not clustered, as shown in 6 red circles on the right pictures. In other words, surprise residual can discriminate surprises with similar norms.

Figure 9: Atari low-sample regime: learning curves over 50 million frames (mean±std. over 5 runs). To aid visualization, we smooth the curves by taking average over a window sized 50.

3.4), and compare them with the SM with full features in Montezuma Revenge and Frostbite task. Fig. 10 (a) shows that only SM (full) can reach an average score of more than 5000 after 50 million training frames. Other ablated baselines can only achieve around 2000 scores.

Figure 10: Ablation study: average returns (mean±std.) over 5 runs.

Figure 11: MiniGrid's KeyDoor: MNIR of SM at dierent steps in an episode.

Figure 12: MiniWorld: Exploration without task reward. Left: Cumulative task returns over 100 million training steps for two setting: Easy (1 room, 3 objects) and Hard (4 rooms, 1 object). Right: The average intrinsic reward over training time. The learning curves are taken average (mean±std.) over 5 runs

MiniGrid: test performance after 10 million training steps. The numbers are average task return×100 over 128 episodes (mean±std. over 5 runs). Bold denotes best results on each task. Italic denotes that SG+SM is better than SG regarding Cohen eect size less than 0.5. MNIR bars show that both models are attracted mainly by the noisy TV, resulting in the highest MNIRs. However, our model with SM suers less from noisy TV distractions since its MNIR is lower than RND's. We speculate that SM is able to partially reconstruct the whitenoise surprise via pass-through mechanism, making the normalized surprise novelty generally smaller than the normalized surprise norm in this case. That mechanism is enhanced in SM with surprise reconstruction (see Appendix D.1 for explanation).

Atari: average return over 128 episodes after 50 million training frames (mean over 5 runs). ♠ is from a prior work(Ermolov and Sebe, 2020). ♦ is our run. The last two rows are mean and median human normalized scores. Bold denotes best results. Italic denotes that SG+SM is signicantly better than SG regarding Cohen eect size less than 0.5.

In Frostbite and Montezuma Revenge, RND+SM's score is almost twice as many as that of RND. For LWM+SM, games such as Gravitar and Venture observe more than 40% improvement. Overall, LWM+SM and RND+SM achieve the best mean and median human normalized score, improving 16% and 22% w.r.t the best SGs, respectively. Notably, RND+SM shows signicant improvement for the notorious Montezuma Revenge.

Execute policy π θ to collect s t , a t , r t , forming input I t = s t , ... and target O t Compute surprise u t = SG (I t ) -O t .detach() (Eq. 1)

Hyperparameters of RND (PPO backbone).

6 https://github.com/htdt/lwm 7 https://github.com/openai/gym Atari: test performance after 50 million training frames (mean over 5 runs). ♠ is from a prior work(Ermolov and Sebe, 2020). ♦ is our run. The last two rows are mean and median human normalized scores. Bold denotes best results. Italic denotes that SG+SM is signicantly better than SG regarding Cohen eect size less than 0.5.

Appendix

A W as Associative Memory This section will connect the associative memory concept to neural networks trained with the reconstruction loss as in Eq. 5. We will show how the neural network (W ) stores and retrieves its data. We will use 1-layer feed-forward neural network W to simplify the The backbone RL algorithm is PPO. The code for and RND is the same as in Sec.We adopt a repository for the of ICM (MIT 5 .We AE ourselves using a 3-layer feed-forward neural network. For the we only tune the number of actors 1024), mini-batch size (4, 16, 64) and -clip 0.2, 0.3) for the DoorKey task. We also the architecture of the AE (number of 1,2 or 3, activation tanh or ReLU) on the task. After tuning the SGs, we use same setting for our SG+SM. The detailed congurations the SGs for this experiment are reported in 3 and Table 4 .The full learning curves the (Baseline), SG and SG+SM are given in 7. To visualize the dierence between surprise and residual vectors, we map these in the trajectory to 2-dimensional space using t-SNE projection in Fig. 8 . The surprise points show patterns for high-MNIR which conrms our hypothesis that there familiar surprises (they are highly surprising to high norm, repeated).In contrast, the surprise residual estimated by the SM has no high-MNIR clusters. The SM transforms clustered surprises to scatter surprise residuals, resulting in a broader range of MNIR, thus showing signicant discrimination on states that have similar surprise norm.

D.3 Atari

The Atari 2600 Games task involves training an agent to achieve high game scores. The state is a 2d image representing the screen of the game.5 https://github.com/jcwleo/curiosity-driven-exploration-pytorch In our setting, X and U represents observation and surprise spaces, respectively. Therefore, the variance of each feature dimension in surprise space is smaller than that of observation space. The equality is obtained when σ Z i 2 = 0 or E (X|Y ) = E (X). That is, the SG's prediction is perfect, which is unlikely to happen in practice.

F Limitations

Our method assumes that surprises have patterns and can be remembered by our surprise memory. There might exist environments beyond those studied in this paper where this assumption may not hold, or surprise-based counterparts already achieve optimal exploration (e.g., perfect SG) and thus do not need SM for improvement (e.g., Freeway game).In addition, M and W require additional physical memory (RAM/GPU) than SG methods. Finally, a plug-in module like SM introduces more hyperparameters, such as N and d.Although we nd the default values of N=128 and d=16 work well across all experiments in this paper, we recommend adjustments if users apply our method to novel domains.

