INTEGRATING EPISODIC AND GLOBAL NOVELTY BONUSES FOR EFFICIENT EXPLORATION Anonymous

Abstract

Exploration in environments which differ across episodes has received increasing attention in recent years. Current methods use some combination of global novelty bonuses, computed using the agent's entire training experience, and episodic novelty bonuses, computed using only experience from the current episode. However, the use of these two types of bonuses has been ad-hoc and poorly understood. In this work, we first shed light on the behavior these two kinds of bonuses on hard exploration tasks through easily interpretable examples. We find that the two types of bonuses succeed in different settings, with episodic bonuses being most effective when there is little shared structure between environments and global bonuses being effective when more structure is shared. We also find that combining the two bonuses leads to more robust behavior across both of these settings. Motivated by these findings, we then investigate different algorithmic choices for defining and combining function approximation-based global and episodic bonuses. This results in a new algorithm which sets a new state of the art across 18 tasks from the MiniHack suite used in prior work. Our code is public at web-link.

1. INTRODUCTION

Balancing exploration and exploitation is a long-standing challenge in reinforcement learning (RL). A large body of research has studied this problem within the Markov Decision Processes (MDP) framework (Sutton & Barto, 2018) , both from a theoretical standpoint (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Agarwal et al., 2020 ) and an empirical one. This has led to practical exploration algorithms such as pseudocounts (Bellemare et al., 2016b) , intrinsic curiosity modules (Pathak et al., 2017) and random network distillation (Burda et al., 2019) , yielding impressive results on hard exploration problems like Montezuma's Revenge and PitFall (Bellemare et al., 2012) . More recently, there has been increasing interest in algorithms which move beyond the MDP framework. The standard MDP framework assumes that the agent is initialized in the same environment at each episode (we will refer to these MDPs as singleton MDPs). However, several studies have found that agents trained in singleton MDPs exhibit poor generalization, and that even minor changes to the environment can cause substantial degradation in agent performance (Zhang et al., 2018b; Justesen et al., 2018; Zhang et al., 2018a; Cobbe et al., 2019; Kirk et al., 2021a) . This has motivated the use of contextual MDPs (CMDPs, (Hallak et al., 2015) ), where different episodes correspond to different environments which nevertheless share some structure. Examples of CMDPs include procedurallygenerated environments (Chevalier-Boisvert et al., 2018; Samvelyan et al., 2021; Küttler et al., 2020; Juliani et al., 2019; Cobbe et al., 2020; Beattie et al., 2016; Hafner, 2021; Petrenko et al., 2021) or embodied AI tasks where the agent must generalize across different physical spaces (Savva et al., 2019; Shen et al., 2020; Gan et al., 2020; Xiang et al., 2020) . While exploration is well-studied in the singleton MDP case, it becomes more nuanced when dealing with CMDPs. For singleton MDPs, a common and successful strategy consists of defining an exploration bonus which is added to the reward function being optimized. This exploration bonus typically represents how novel the current state is, where novelty is computed with respect to the entirety of the agent's experience across all episodes. However, it is unclear to what extent this strategy is applicable in the CMDP setting-if two environments corresponding to different episodes are very different, we might not want the experience gathered in one to affect the novelty of a state observed in the other. An alternative to using global bonuses is to use episodic ones. Episodic bonuses define novelty with respect to the experience gathered in the current episode alone, rather than across all episodes. Recently, several works (Stanton & Clune, 2018; Raileanu & Rocktäschel, 2020; Flet-Berliac et al., 2021; Zhang et al., 2021b; Henaff et al., 2022) have used episodic bonuses, with Henaff et al. (2022) showing that this is an essential ingredient for solving sparse reward CMDPs. However, as we will show, an episodic bonus alone may not be optimal if there is considerable shared structure across different episodes in the CMDP. In this work, we study how to best define and integrate episodic and global novelty bonuses for exploration in CMDPs. First, through a series of easily interpretable examples using episodic and global count-based bonuses, we shed light on the strengths and weaknesses of both types of bonuses. In particular, we show that global bonuses, which are commonly used in singleton MDPs, can be poorly suited for CMDPs that share little structure across episodes; however, episodic bonuses, which are commonly used in contextual MDPs, can also fail in certain classes of singleton MDPs where knowledge transfer across episodes is crucial. Second, we show that by multiplicatively combining episodic and global bonuses, we are able to get robust performance on both contextual MDPs that share little structure across episodes and singleton MDPs that are identical across episodes. Third, motivated by these observations, we comprehensively evaluate different combinations of episodic and global bonuses which do not rely on counts, as well as strategies for integrating them, on a wide array of tasks from the MiniHack suite (Samvelyan et al., 2021) . Our investigations yield a new algorithm which combines the elliptical episodic bonus of Henaff et al. ( 2022) and the NovelD global bonus of Zhang et al. (2021b) , which sets a new state of the art across 18 tasks from the MiniHack environment, solving the majority of them. Our code is available at web-link.

2.1. CONTEXTUAL MDPS

We consider a contextual Markov Decision Process (CMDP) defined by (S, A, C, P, r, µ C , µ S ) where S is the state space, A is the action space, C is the context space, P is the transition function, µ S is the initial state distribution conditioned on context and µ C is the context distribution. At each episode, we first sample a context c ⇠ µ C and an initial state s 0 ⇠ µ S (•|c). At each step t in the episode, the next state is then sampled according to s t+1 ⇠ P (•|s t , a t , c) and the reward is given by r t = r(s t , a t , c). Let d c ⇡ represent the distribution over states induced by following policy ⇡ with context c. The goal is to learn a policy which maximizes the expected return, averaged across contexts: R = E c⇠µ C ,s⇠d c ⇡ ,a⇠⇡(•|s) [r(s, a)] Examples of CMDPs include procedurally-generated environments, such as ProcGen (Cobbe et al., 2020 ), MiniGrid (Chevalier-Boisvert et al., 2018 ), NetHack (Küttler et al., 2020) , or Mini-Hack (Samvelyan et al., 2021) , where each context c corresponds to the random seed used to generate the environment; in this case, the number of contexts |C| is effectively infinite (we will slightly abuse notation and denote this case by |C| = 1). Other examples include embodied AI environments (Savva et al., 2019; Szot et al., 2021; Gan et al., 2020; Shen et al., 2020; Xiang et al., 2020) , where the agent is placed in different simulated houses and must navigate to a location or find an object. In this setting, each context c 2 C represents a house identifier and the number of houses |C| is typically between 20 and 1000. More recently, CARL (Benjamins et al., 2021) was introduced as a benchmark for testing generalization in contextual MDPs. However, their focus is on using privileged information about the context c to improve generalization, which we do not assume access to here. For an in-depth review of the literature on CMDPs and generalization in RL, see Kirk et al. (2021b) . Singleton MDPs are a special case of contextual MDPs with |C| = 1.

2.2. EXPLORATION BONUSES

At a high level, exploration bonuses operate by estimating the novelty of a given state, and assigning a high bonus if the state is novel according to some measure. The exploration bonus is then combined with the extrinsic reward provided by the environment, and the result is optimized using RL. More precisely, the reward function optimized by the agent is given by:

