REVISITING INTRINSIC REWARD FOR EXPLORATION IN PROCEDURALLY GENERATED ENVIRONMENTS

Abstract

Exploration under sparse rewards remains a key challenge in deep reinforcement learning. Recently, studying exploration in procedurally-generated environments has drawn increasing attention. Existing works generally combine lifelong intrinsic rewards and episodic intrinsic rewards to encourage exploration. Though various lifelong and episodic intrinsic rewards have been proposed, the individual contributions of the two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used MiniGrid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods. On the other hand, only using lifelong intrinsic rewards hardly makes progress in exploration. This demonstrates that episodic intrinsic reward is more crucial than lifelong one in boosting exploration. Moreover, we find through experimental analysis that the lifelong intrinsic reward does not accurately reflect the novelty of states, which explains why it does not help much in improving exploration.

1. INTRODUCTION

How to encourage sufficient exploration in environments with sparse rewards is one of the most actively studied challenges in deep reinforcement learning (RL) (Bellemare et al., 2016; Pathak et al., 2017; Ostrovski et al., 2017; Burda et al., 2019a; b; Osband et al., 2019; Ecoffet et al., 2019; Badia et al., 2019) . In order to learn exploration behavior that can generalize to similar environments, recent works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020; Campero et al., 2021; Flet-Berliac et al., 2021; Zha et al., 2021) have paid increasing attention to procedurally-generated grid-like environments (Chevalier-Boisvert et al., 2018; Küttler et al., 2020) . Among them, approaches based on intrinsic reward (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) have proven to be quite effective, which combine lifelong and episodic intrinsic rewards to encourage exploration. The lifelong intrinsic reward encourages visits to the novel states that are less frequently experienced in the entire past, while the episodic intrinsic reward encourages the agent to visit states that are relatively novel within an episode. However, previous works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) mainly focus on designing lifelong intrinsic reward while considering episodic one only as a minor complement. The individual contributions of these two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we present in this work a comprehensive empirical study to reveal their contributions. Through extensive experiments on the commonly used MiniGrid benchmark (Chevalier-Boisvert et al., 2018) , we ablatively study the effect of lifelong and episodic intrinsic rewards on boosting exploration. Specifically, we use the lifelong and episodic intrinsic rewards considered in prior works (Pathak et al., 2017; Burda et al., 2019b; Raileanu & Rocktäschel, 2020; Zhang et al., 2020) , and compare the performance of all lifelong-episodic combinations, in environments with and without extrinsic rewards. Surprisingly, we find that only using episodic intrinsic rewards, the trained agent can match or surpass the performance of the one trained by using other lifelong-episodic combinations, in terms of the cumulative extrinsic rewards in a sparse reward setting and the number of explored rooms in a pure exploration setting. In contrast, only using lifelong intrinsic rewards makes little progress in exploration. Such observations suggest that the episodic intrinsic reward is the more crucial ingredient of the intrinsic reward for efficient exploration. Furthermore, we experimentally analyze why lifelong intrinsic reward does not offer much help in improving exploration. Specifically, we find that randomly permuting the lifelong intrinsic rewards within a batch does not cause a significant drop in the performance. This demonstrates that the lifelong intrinsic reward may not accurately reflect the novelty of states, explaining its ineffectiveness. We also compare the performance of only using episodic intrinsic rewards to other state-of-the-art methods including RIDE (Raileanu & Rocktäschel, 2020) , BeBold (Zhang et al., 2020) , AGAC (Flet-Berliac et al., 2021) and RAPID (Zha et al., 2021) . We find simply using episodic intrinsic rewards outperforms the others by large margins. In summary, our work makes the following contributions: • We conduct a comprehensive study on the lifelong and episodic intrinsic rewards in exploration and find that the episodic intrinsic reward overlooked by previous works is actually the more important ingredient for efficient exploration in procedurally-generated gridworld environments. • We analyze why lifelong intrinsic reward does not contribute much and find that it is unable to accurately reflect the novelty of states. • We show that simply using episodic intrinsic rewards can serve as a very strong baseline outperforming current state-of-the-art methods by large margins. This finding should inspire future works on designing and rethinking the intrinsic rewards.

2.1. NOTATION

We consider a single agent Markov Decision Process (MDP) described by a tuple (S, A, R, P, γ). S denotes the state space and A denotes the action space. At time t the agent observes state s t ∈ S and takes an action a t ∈ A by following a policy π : S → A. The environment then yields a reward signal r t according to the reward function R(s t , a t ). The next state observation s t+1 ∈ S is sampled according to the transition distribution function P (s t+1 |s t , a t ). The goal of the agent is to learn an optimal policy π * that maximizes the expected cumulative reward: π * = arg max π∈Π E π,P ∞ t=0 γ t r t (1) where Π denotes the policy space and γ ∈ [0, 1) is the discount factor. Following previous curiosity-driven approaches (Pathak et al., 2017; Burda et al., 2019b; Raileanu & Rocktäschel, 2020; Zhang et al., 2020) , we consider that at each timestep the agent receives some intrinsic reward r i t in addition to the extrinsic reward r e t . The intrinsic reward is designed to capture the agent's curiosity about the states, via quantifying how different the states are compared to those already visited. r i t can be computed for any transition tuple (s t , a t , s t+1 ). The agent's goal then becomes maximizing the weighted sum of intrinsic and extrinsic rewards, i.e., r t = r e t + βr i t (2) where β is a hyperparameter for balancing the two rewards. We may drop the subscript t to refer to the corresponding reward as a whole rather than the reward at timestep t.

2.2. TWO PARTS OF THE INTRINSIC REWARD

For procedurally-generated environments, recent works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) form the intrinsic reward as the multiplication of two parts: r i t = r lifelong t • r episodic t , a lifelong intrinsic reward r lifelong that gradually discourages visits to states that have been visited many times across episodes, and an episodic intrinsic reward r episodic that discourages revisiting the same states within an episode. r lifelong is learned and updated throughout the whole training process and r episodic is reset at the beginning of each episode. However, these recent works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) mainly focus on designing better lifelong intrinsic reward while considering the episodic one only as a part of minor importance. In this work, we find that the episodic intrinsic reward is actually more essential for encouraging exploration. For experiments and discussions in the following sections, we consider 4 choices of r lifelong and 2 choices of r episodic used in prior works, and introduce them below. 

ICM

r ICM t = 1 2 ∥ femb (s t+1 ) -f emb (s t+1 )∥ 2 . ( ) RIDE Built upon ICM, Raileanu & Rocktäschel (2020) propose a novel intrinsic reward which encourages the agent to take actions that have a large impact on the state representation. Similarly, they train inverse and forward dynamic models. The difference is that, for a transition tuple (s t , a t , s t+1 ), r lifelong t is now computed as the change in state representation: r RIDE t = ∥f emb (s t ) -f emb (s t+1 )∥ 2 . RND Burda et al. (2019b) propose to train a state embedding network f (s) to predict the output of another state embedding network f (s) with fixed random initialization. The prediction error is expected to be higher for novel states dissimilar to the ones f (s) has been trained on. Therefore, for a transition tuple (s t , a t , s t+1 ), the prediction error is used as r lifelong t : r RND t = ∥f (s t+1 ) -f (s t+1 )∥ 2 . ( ) BeBold Built upon RND, Zhang et al. (2020) give a lifelong intrinsic reward that encourages the agent to explore beyond the boundary of explored regions. Intuitively, if the agent takes a step from an explored state to an unexplored state, then the RND error of the unexplored state will likely to be larger than that of the explored state. For a transition tuple (s t , a t , s t+1 ), r lifelong t is computed as the clipped difference of RND prediction error between s t and s t+1 : r BeBold t = max(0, ∥f (s t+1 ) -f (s t+1 )∥ 2 -∥f (s t ) -f (s t )∥ 2 ). ( ) Episodic visitation count Raileanu & Rocktäschel (2020) discount their proposed lifelong intrinsic reward with episodic state visitation counts. Concretely, for a transition tuple (s t , a t , s t+1 ), r episodic t is computed as r ep count t = 1 N ep (s t+1 ) where N ep (s) denotes the number of visits to state s in the current episode. N ep (s) is post-factum computed such that the denominator is always larger than 0. Episodic first visits Compared with the above episodic intrinsic reward, Zhang et al. (2020) take a more aggressive strategy: the agent only receives lifelong intrinsic reward when it visits the state for the first time in an episode. That is, for a transition tuple (s t , a t , s t+1 ), r episodic t is computed as r ep visit t = 1(N ep (s t+1 ) = 1), where 1(•) denotes an indicator function. 

3. DISENTANGLING THE INTRINSIC REWARD

As discussed above, the intrinsic reward for procedurally-generated environments can be disentangled into a lifelong term and an episodic term. In this section, we conduct extensive experiments on MiniGrid benchmark (Chevalier-Boisvert et al., 2018) to study their individual contributions to improving exploration. To this end, we ablatively compare different lifelong-episodic combinations under two settings (Sec. 3.2): (i) a sparse extrinsic reward setting where reward is only obtained upon accomplishing the task, and (ii) a pure exploration setting without extrinsic rewards. We find that episodic intrinsic reward is more important for efficient exploration, while lifelong intrinsic reward does not help much. Then, in Sec. 3.3, we also compare the approach of using only episodic intrinsic reward to other state-of-the-art methods. Finally, we analyze the reason why the lifelong intrinsic reward does not help improve exploration.

3.1. SETUP

Environments Following previous works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020; Campero et al., 2021; Zha et al., 2021; Flet-Berliac et al., 2021) on exploration in procedurallygenerated gridworld environments, we use the MiniGrid benchmark (Chevalier-Boisvert et al., 2018) , which runs fast and hence is suitable for large-scale experiments. As shown in Fig. 1 , we use 3 commonly used hard-exploration tasks in MiniGrid: • KeyCorridor: The task is to pick up an object hidden behind a locked door. The agent has to explore different rooms to find the key for opening the locked door. • MultiRoom: The task is to navigate from the first room, through a sequence of rooms connected by doors, to a goal in the last room. The agent must open the door to enter the next room. • ObstructedMaze: The task is to pick up an object hidden behind a locked door. The key is now hidden in a box. The agent has to find the box and open it to get the key. Besides, the agent needs to move the obstruction in front of the locked door. The suffixes denote different configurations (e.g., N7S8 for MultiRoom refers to 7 rooms of size no larger than 8). For each task, we choose 2 representative configurations from those used in previous works, in order to reduce the computational burden. Following prior works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) , we use the partial observation of the grid. The agent chooses from 7 actions (see Appx. A.1 for details). In our sparse reward setting, the agent obtains a positive reward only when it reaches the goal position or picks up the goal object. Accomplishing the tasks requires sufficient exploration and completion of some nontrivial subroutines (e.g., pick up keys, open doors). More details about the environments and tasks are deferred to Appx. A.1. Training details For RL learning algorithm, we find that IMPALA (Espeholt et al., 2018; Küttler et al., 2019) used in previous works takes too long to train and requires a lot of CPU resources. Therefore, to make the computational cost manageable, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) to train the policy, which is also a high-performing actor-critic algorithm but easier and more efficient to run on a single machine than IMPALA. Following previous works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020; Campero et al., 2021; Flet-Berliac et al., 2021) , we use convolutional neural networks to process the input observation (see Appx. A.2 for network architectures). The neural networks are optimized with Adam (Kingma Raileanu & Rocktäschel (2020) . We also include the testing results in Appx. A.6.1.

3.2. INDIVIDUAL CONTRIBUTIONS OF EPISODIC AND LIFELONG INTRINSIC REWARDS

In this section, we ablatively study how episodic and lifelong intrinsic rewards improve exploration with and without extrinsic reward (i.e., goal-reaching and pure exploration). On the MiniGrid environments, we compare the performance of 15 lifelong-episodic combinations: {r ICM , r RIDE , r RND , r BeBold , None} × {r ep count , r ep visit , None}. Here "None" refers to not using lifelong or episodic intrinsic reward. If both are not used, then it reduces to vanilla PPO without intrinsic rewards. If only one is not used, then the reward reduces to 1. For the goal-reaching task, the extrinsic reward is sparse and only provided when the agent achieves the goal. Exploration ability is measured by how fast the agent can achieve high average return on 6 environments introduced in Sec. 3.1. For the pure exploration setting, we train agents on MultiRoom environments without providing any extrinsic reward when agents reach the goal. MultiRoom environments consist of several consecutive rooms connected by doors and the agent starts from the first room (see Fig. 1 ). To explore the environment, the agent needs to open each door and navigate to different rooms. Thus, we can easily use the average number of explored rooms within an episode as a proxy for quantifying the exploration ability. Here we use MultiRoom-N7S8 and MultiRoom-N10S10 which have 7 and 10 rooms respectively, and consider a room is explored if the agent visits any tile within this room (excluding the connecting door). Fig. 2 shows the results of exploration with extrinsic rewards. We can see that only using episodic intrinsic reward can match or surpass the performance of combining lifelong and episodic intrinsic rewards. In contrast, only using lifelong intrinsic reward hardly makes any progress. These observations show that the episodic intrinsic reward is the key ingredient for boosting exploration. Moreover, comparing the performance of different lifelong intrinsic rewards under the same episodic intrinsic reward (i.e., dark red or dark blue curves in each column of Fig. 2 ), we find that recently proposed ones (r RIDE and r BeBold ) do not exhibit clear advantages over previous ones (r ICM and r RND ). Additionally, when comparing two episodic intrinsic rewards, we can see they perform comparably well for most environments. An exception is the MultiRoom environments, where r ep visit performs slightly better. The reason might be that r ep visit is more aggressive than r ep count in discouraging revisits to visited states within an episode, pushing the agent to explore new rooms more quickly. The results for the pure exploration setting are summarized in Fig. 3 . Again, when only using lifelong intrinsic rewards (i.e., green curves), the agent fails to explore farther than the first 2 rooms. In contrast, it can quickly explore more rooms when augmented with episodic intrinsic rewards. This further demonstrates that episodic intrinsic reward is more important than lifelong one for efficient exploration. Moreoever, comparing r ep visit and r ep count , we can see r ep visit helps the agent explore faster. This difference in exploration efficiency explains the performance difference between r ep visit and r ep count in MultiRoom environments with extrinsic reward (see Fig. 2 ). Figure 5 : Performance of whether randomly permuting lifelong intrinsic rewards on 6 hard exploration environments in MiniGrid. Best viewed in color.

3.3. COMPARISON WITH STATE-OF-THE-ART METHODS

From the previous section, we can see that only using episodic intrinsic rewards can achieve very competitive performance. In this section, we compare its performance with other state-of-theart approaches, including RIDE (Raileanu & Rocktäschel, 2020) , BeBold (Zhang et al., 2020) , RAPID (Zha et al., 2021) and AGAC (Flet-Berliac et al., 2021) , under the sparse extrinsic reward setting. Among the 6 environments used, 1) for environments also used in these works, we contacted the authors and obtained the original results reported in their papers; 2) for other environments, we train agents using their official implementations and the same hyperparameters reported in their papers. Please see Appx. A.4 for more details. As Fig. 4 shows, only using r ep visit or r ep count clearly outperforms other state-of-the-art methods, achieving high average return with fewer training frames. Specifically, in KeyCorridor-S6R3, ObstructedMaze-1Q and ObstructedMaze-2Q, it is almost an order of magnitude more efficient than previous methods. Note that the performance of BeBold in ObstructedMaze-2Q is directly taken from their paper but we cannot reproduce it. Compared to these sophisticated methods, simply using episodic intrinsic rewards serves as a very strong baseline. In Appx. A.6.2, we also include the comparison results on other environments in MiniGrid benchmark.

3.4. WHY DOESN'T LIFELONG INTRINSIC REWARD HELP?

As shown in Sec. 3.2, lifelong intrinsic reward contributes little to improving exploration. In this section, we make preliminary attempts to analyze the reason for its ineffectiveness. To begin with, we examine the learned behavior of an agent trained with the lifelong intrinsic rewards only (see Fig. 2 ), by rolling out the learned policies in the environment. We find that the trained agent is often stuck in the first room and oscillating between two states. An example, as shown in Fig. 6 , is that the agent keeps opening and closing the first door. In comparison, the agent trained with episodic intrinsic rewards only can explore all rooms without oscillating (see videos in the supplementary material). A possible reason for this failure could be that the lifelong intrinsic reward is not able to accurately reflect the novelty of states. During training, the agent may randomly sample a moving-forward action after it opens the door, but the lifelong intrinsic reward it receives from this step is lower than the previous opening-door step. Thus, as the training progresses, the agent gets stuck in the opening-closing behavior since this gives higher intrinsic rewards than moving forward to explore new rooms. To verify this, we randomly permute the lifelong intrinsic rewards within a batch, and evaluate the resulting performance under sparse extrinsic rewards. Specifically, for a transition batch {(s j , a j , s ′ j )} B j=1 and the corresponding intrinsic reward batch {r j } B j=1 , we randomly permute {r j } B j=1 , such that each transition (s j , a j , s ′ j ) might be associated with the wrong intrinsic reward r k (where k ̸ = j) rather than the correct one. If this random permutation does not lead to a significant performance drop, then it indicates that the computed lifelong intrinsic reward may not accurately measure the novelty of states. From results in Fig. 5 , we can see that permutation does not have much negative impact on the performance for most cases, especially when using r ep visit as episodic intrinsic reward. This shows that the computed lifelong intrinsic reward may not accurately reflect the novelty of states, otherwise such permutation would lead to an obvious performance drop. We note the performance impact is not negligible in one case (e.g., BeBold + ep count in MultiRoom environments), but we believe it will not alter the message that the lifelong intrinsic reward is not critical to performance.

4. DISCUSSIONS

In the above experiments, we show that episodic intrinsic reward solely is able to boost exploration in procedurally-generated grid-world environments, while lifelong intrinsic reward struggles to make progress. Such intriguing observations provide new perspectives to think about intrinsic reward design as below. Episodic intrinsic reward in latent spaces Our experiments use grid-world environments as testbeds, which have discrete state spaces, so it is easy to obtain visitation counts for each state. Therefore, one may wonder whether episodic intrinsic reward is able to generalize to high-dimensional continuous state spaces, where it is infeasible to calculate r ep visit and r ep count . One possible solution is to compute the episodic intrinsic reward in a latent space (Badia et al., 2019) . However, we empirically find that directly borrowing this solution from singleton environments to MiniGrid is ineffective, as latent embedding learning is much more challenging in procedurally-generated environments. For future research, we may consider designing more powerful embedding models. Lifelong intrinsic reward in procedurally-generated environments Lifelong intrinsic reward like RND (Burda et al., 2019b) and ICM (Pathak et al., 2017) has shown promising performance in singleton environments (e.g., Atari (Bellemare et al., 2013) ), but performs poorly in procedurallygenerated environments (e.g., MiniGrid). This counter-intuitive phenomenon further shows that quality of the representation plays an important role in the success of curiosity-driven methods. To make lifelong intrinsic reward work again, one has to think about more generalizable representation learning for RL. One possible direction is to design proper auxiliary or pretext tasks like selfsupervised learning instead of just memorizing the environment.

Exploration and intrinsic rewards

In RL, the agent improves its policy via trial and error. Thus a steady stream of reward signals is critical for efficient learning. Since many practical scenarios come with sparse rewards, studying how to encourage exploration behavior in such environments attracts continuing attention. One of the most popular techniques is to provide intrinsic rewards (Schmidhuber, 2010) , which can viewed as the agent's curiosity about particular states. In the absence of extrinsic rewards from the environment, the agent's behavior will be driven by the intrinsic rewards. If we expect the agent to explore novel states, the intrinsic reward can be designed to reflect how different the states are from those already visited. Such difference can be measured by the pseudo-counts of states derived from a density model (Bellemare et al., 2016; Ostrovski et al., 2017) , or errors of predicting the next state (Pathak et al., 2017) or a random target (Burda et al., 2019b) . Episodic and lifelong intrinsic reward Prior works (Pathak et al., 2017; Burda et al., 2019a; b) typically model a lifelong intrinsic reward, which is updated throughout the whole learning process. Recent works focusing on procedurally-generated environments (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) additionally use episodic intrinsic rewards to modulate the lifelong ones. However, they mainly focus on designing the lifelong intrinsic rewards. In this work we show that the episodic intrinsic reward is actually more important in encouraging exploration. Concurrent to our work, Henaff et al. (2022) also finds that the episodic intrinsic reward is essential to good performance. Our work is also related to recent progress on Atari (Badia et al., 2019; 2020) , where more attention is paid to episodic intrinsic rewards and the lifelong intrinsic reward is only considered as an optional modulator. They focus on pushing performance limits on Atari, while our work aims to compare the individual contributions of episodic and lifelong intrinsic rewards in procedurally-generated gridworld environments. One interesting connection to note is that the episodic visitation count bonus considered in our paper is essentially an MBIE-EB-style bonus (Strehl & Littman, 2008) . Generalization and procedurally-generated environments Recent papers (Rajeswaran et al., 2017; Zhang et al., 2018a; b; c; Machado et al., 2018; Song et al., 2020) find that deep RL is susceptible to overfitting to training environments and suffers poor generalization. Using procedurally-generated environments (Chevalier-Boisvert et al., 2018; Chevalier-Boisvert, 2018; Juliani et al., 2019; Cobbe et al., 2020) , where each episode is randomly constructed but corresponds to the same task. To make progress in such environments, an agent must learn a generic policy that can generalize to novel episodes. Most prior works in exploration are also susceptible to overfitting issues, since they train and test on the same environment, such as Montezuma's Revenge (Bellemare et al., 2013) or VizDoom (Wydmuch et al., 2018) . To achieve generalizable exploration, recent works (Raileanu & Rocktäschel, 2020; Campero et al., 2021; Zhang et al., 2020; Zha et al., 2021; Flet-Berliac et al., 2021) often use procedurally-generated gridworld environments to benchmark their methods.

6. CONCLUSIONS

Exploration is a long-standing and important problem in reinforcement learning, especially when the environments have sparse rewards. To encourage exploration in procedurally-generated environments, recent works combine lifelong and episodic intrinsic rewards. However, how these two kinds of intrinsic rewards actually contribute to exploration is seldom comprehensively investigated. To answer this question, we conducted the first systematic empirical study by making exhaustive combinations of lifelong and episodic intrinsic rewards in environments with and without extrinsic reward signals. The results are counter-intuitive and intriguing: episodic intrinsic reward alone is able to boost exploration and set new state-of-the-art. Our further investigation shows that the poor performance of lifelong intrinsic reward roots in its ineffectiveness in distinguishing whether a state is novel or not. We believe our findings are inspiring for future work to better design intrinsic reward functions. We would like to note that our discovery focuses more on hard exploration tasks in MiniGrid, and whether it applies to general procedurally generated environments requires further investigation. Observation By default, the observation is an egocentric view of the environment, as shown in Fig. 7 . The agent can not see the world behind walls or closed doors. Specifically, the observation is compactly encoded as a 7 × 7 × 3 array. The first channel encodes the object type, e.g., wall, door. The second channel encodes the object color, e.g., red, green. The third channel encodes the state, e.g., door open, door closed. Action In MiniGrid, there are 7 actions available for the agent: turn left, turn right, move forward, pick up an object, drop an object, toggle and done. The agent can change the direction it is currently facing by turning left or right. Taking the move forward action will move the agent one tile forward along its current direction. If moving forward causes collision (e.g., running into walls or closed doors), the agent will remain the previous location. The agent can use the toggle action to open a closed door in front of it (or a locked one if the agent has the corresponding key). The toggle action can also be used to open a box. The done action is not necessary for the environments considered in this paper, but we still keep it in the action space following previous works. Task As briefly introduced in Sec. 3.1, we consider 3 tasks in MiniGrid: KeyCorridor, MultiRoom, ObstructedMaze. Here we provide more details about these 3 tasks. • KeyCorridor (Fig. 8 ) The environment consists of a corridor and several rooms connecting to the corridor. The agent starts in the corridor and aims to pick up a goal object placed in one of the rooms. The door to this room (i.e., the room with the goal) is locked. To open the locked door, the agent has to find the key that is placed in one of the rooms except the locked one. The configuration suffix is specified as SxRy, where x denotes the room size and y denotes the number of rows. Each row has 2 rooms, placed on each side of the corridor. • MultiRoom (Fig. 9 ) The environment consists of a sequence of rooms connected by closed doors. The agent starts in the first room and aims to reach the goal position in the last room. The configuration suffix is specified as NxSy, where x denotes the number of rooms and y denotes the maximum room size. • ObstructedMaze (Fig. 10 ) The environment consists of a 3 × 3 grid of rooms. In these 9 rooms, 1 is the center room, 4 are corner rooms and 4 are non-corner rooms. The doors connecting the center room and non-corner rooms are closed. The doors connecting corner rooms and The agent has to move these obstructions before unlocking the doors with keys. -1Q Compared to 2Dlhb, the difference is that the agent starts in the center room instead of the middle right one. -2Q Compared to 1Q, the difference is that the goal object is hidden in either the top right room or the bottom right room. -Full Compared to 1Q, the difference is that the goal object is hidden in one of the 4 corner rooms. Episode length In these tasks, there is a limit on the maximum number of steps allowed in each episode, denoted as t max . It varies from task to task, as summarized in Fig. 11 . Reward When the agent achieves the goal, it receives an extrinsic reward r e t . This reward is computed based on the number of steps taken t and the maximum number of steps allowed t max : r e t = 1 -0.9t/t max . The agent is encouraged to reach the goal as fast as possible. Only the optimal path gives the highest reward.

A.2 NETWORK ARCHITECTURES

For the policy and value network in PPO, we use a shared feature extractor and two separate heads (one policy head and one value head), as shown in Fig. 12 (a) . The feature extractor uses a convolutional neural network, similar to the ones used in prior works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020; Flet-Berliac et al., 2021) . For r RND and r BeBold , the target and predictor networks use the architecture in Figure 12 

A.3 COMPUTATION COST

As mentioned in Sec. 3.1, our experiments can run on a single machine with 8 CPUs and a Titan X GPU. Specifically, one run of an experiment (e.g., training with r ICM + r ep count ) takes less than 600M GPU memory on a Titan X GPU. So, we can concurrently run several experiments on a single GPU. On average, training for 1e7 frames takes about 1.5 hours. The total wall-clock training time for all conducted experiments is roughly 4000 GPU hours. In comparison, the IMPALA used in the released code of RIDE (Raileanu & Rocktäschel, 2020 ) is typically 10-20x slower than PPO for our hardware. The low efficiency is mainly because it uses CPU to do the forward pass of the policy during experience collection and it is non-trivial for us to migrate it to GPU. This also means that we need a large number of CPUs (typically 100+) to collect samples in parallel to reach the full speed. Such computation requirement (high CPU but low GPU demand) does not fit our computation resources (where each workstation often has less than 32 CPUs). Therefore, we choose PPO as the base algorithm to fully utilize our resources.

A.4 TRAINING DETAILS

We provide more training details for the comparison with other state-of-the-art methods in Sec. 3.3. For RIDE (Raileanu & Rocktäschel, 2020) , we obtained results on MultiRoom-N7S8, MultiRoom-N10S10 from the authors, and rerun their released codefoot_1 for other environments. For KeyCorridor-S4R3 and KeyCorridor-S6R3, we use the same hyperparameters as KeyCorridor-S3R3 provided in their paper. For ObstructedMaze-1Q and ObstructedMaze-2Q, we use the same hyperparameters as ObstructedMaze-2Dlh. For AGAC (Flet-Berliac et al., 2021) , we obtained results on MultiRoom-N10S10, ObstructedMaze-1Q and ObstructedMaze-2Q from the authors, and rerun their released codefoot_2 for other environments. For hyperparameters, we use their default configurations. For RAPID (Flet-Berliac et al., 2021) , we obtained results on MultiRoom-N7S8 and MultiRoom-N10S10 from the authors, and rerun their released codefoot_3 for other environments. For hyperparameters, we use their default configurations. For BeBold (Zhang et al., 2020) , we obtained the results on KeyCorridor-S4R3, KeyCorridor-S6R3, MultiRoom-N7S8, ObstructedMaze-1Q and ObstructedMaze-2Q from the authors, and rerun their released codefoot_4 for MultiRoom-N10S10. For hyperparameters, we use their default configurations. 



https://github.com/maximecb/gym-minigrid https://github.com/facebookresearch/impact-driven-exploration https://github.com/yfletberliac/adversarially-guided-actor-critic https://github.com/daochenzha/rapid https://github.com/tianjunz/NovelD



Figure 1: The 6 hard-exploration environments in MiniGrid used for our experiments.

Figure 2: Performance of different combinations of lifelong and episodic intrinsic rewards on 6 hard exploration environments in MiniGrid. Note that simply using episodic intrinsic reward can achieve top performance among all combinations. Best viewed in color.

Figure 3: Average number of explored rooms without extrinsic rewards. Best viewed in color.

Figure 4: Comparison of only using episodic intrinsic rewards and state-of-the-art methods. Note that the result of BeBold in ObstructedMaze-2Q is directly taken from the associated paper (Zhang et al., 2020).

Figure 6: The agent gets stuck, keeping opening and closing the door.

Figure 7: Observation encoding in MiniGrid environment. See text for details. Best viewed in color.

Figure 8: KeyCorridor environments.

Figure 9: MultiRoom environments.

Figure 10: ObstructedMaze environments.

Figure 11: Different MiniGrid task configurations used in previous works.

(b). For r ICM and r RIDE , the network architecture of the state embedding function f emb is shown in Figure12(b). The network architectures of the forward and inverse dynamic model are shown in Figure 12 (c) and Figure 12 (d) respectively.

Figure 12: Network architectures.

Figure 15: Best searched β values for experiments with r RND .

Figure 16: Best searched β values for experiments with r BeBold .

Figure 17: Best searched β values for experiments with episodic intrinsic rewards only.

Figure 18: Performance of training with r ep visit only on KeyCorridorS4R3, using different β.

Figure 20: Comparison to lifelong count-based bonus 1 N lif e (s t+1 ) on 6 hard exploration environments in MiniGrid, averaged from 5 runs. Best viewed in color.

Figure 21: Performance of different combinations of lifelong and episodic intrinsic rewards on 6 hard exploration environments in MiniGrid, averaged from 10 runs. Note that simply using episodic intrinsic rewards can achieve top performance among all combinations. Best viewed in color.

Pathak et al. (2017) introduce a lifelong intrinsic reward based on a learned state representation f emb (s). Such representation ignores the aspects of the environment that have little impact upon the agent, and is obtained by jointly training a forward and an inverse dynamic model. The learned model makes a prediction of the next state representation femb (s t+1 ), and large prediction error implies that the agent might not explore very well. For a transition tuple (s t , a t , s t+1 ), r lifelong

7. REPRODUCIBILITY STATEMENT

We describe the training details (e.g., environment configurations, network architectures, hyperparameters) in Sec. 3.1 and the Appendix. To reproduce the results, we also include the source code in the supplementary material. The computation cost is detailed in Appx A.3.

A.5 HYPERPARAMETERS

For training the policy, we use the hyperparameters listed in Tab. 1. Generalized Advantage Estimation (GAE) (Schulman et al., 2016) is used to estimate the advantage. As mentioned in Sec. 3.1, we search the intrinsic reward coefficient β for each experiment. We start with a search range {5e-2, 1e-2, 5e-3, 1e-3}, and expand it until finding a locally best value or reaching 5e1 or 1e-5. The best searched values for β are summarized in Fig. 13, 14, 15, 16, 17. Here we showcase in Fig. 18 that choosing an appropriate value for β is critical for the performance. Other results in the search (i.e., using different β) will be made publicly available. N lif e (s t+1 ) (instead of episodic ones). The results are shown in Fig. 20 . We can see the performance is close to that of only using r ep visit . One possible reason is that lifelong counts behave quite similarly to episodic counts since every episode is new in procedurally-generated environments.It is also possible that the performance difference between episodic and lifelong intrinsic rewards on MiniGrid actually results from "count-based intrinsic rewards" v.s. "prediction-based intrinsic rewards".

