BEBOLD: EXPLORATION BEYOND THE BOUNDARY OF EXPLORED REGIONS

Abstract

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. To guide exploration, previous work makes extensive use of intrinsic reward (IR). There are many heuristics for IR, including visitation counts, curiosity, and state-difference. In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR. The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment. The resulting method, Be-Bold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning. In comparison, previous SoTA only solves 50% of the tasks. BeBold also achieves SoTA on multiple tasks in NetHack, a popular rogue-like game that contains more challenging procedurally-generated environments.

1. INTRODUCTION

Deep reinforcement learning (RL) has experienced significant progress over the last several years, with impressive performance in games like Atari (Mnih et al., 2015; Badia et al., 2020a) , Star-Craft (Vinyals et al., 2019) and Chess (Silver et al., 2016; 2017; 2018) . However, most work requires either a manually-designed dense reward (Brockman et al., 2016) or a perfect environment model (Silver et al., 2017; Moravčík et al., 2017) . This is impractical for real-world settings, where the reward is sparse; in fact, the proper reward function for a task is often even unknown due to lack of domain knowledge. Random exploration (e.g., -greedy) in these environments is often insufficient and leads to poor performance (Bellemare et al., 2016) . Recent approaches have proposed to use intrinsic rewards (IR) (Schmidhuber, 2010) to motivate agents for exploration before any extrinsic rewards are obtained. Various criteria have been proposed, including curiosity/surprise-driven (Pathak et al., 2017 ), count-based (Bellemare et al., 2016; Burda et al., 2018a; b; Ostrovski et al., 2017; Badia et al., 2020b) , and state-diff approaches (Zhang et al., 2019; Marino et al., 2019) . Each approach has its upsides and downsides: Curiosity-driven approaches look for prediction errors in the learned dynamics model and may be misled by the noisy TV (Burda et al., 2018b) problem, where environment dynamics are inherently stochastic. Count-based approaches favor novel states in the environment but suffer from detachment and derailment (Ecoffet et al., 2019) , in which the agent gets trapped into one (long) corridor and fails to try other choices. Count-based approaches are also short-sighted: the agent often settles in local minima, sometimes oscillating between two states that alternately feature lower visitation counts (Burda et al., 2018b) . Finally, state-diff approaches offer rewards if, for each trajectory, representations of consecutive states differ significantly. While these approaches consider the entire trajectory of the agent rather than a local state, it is asymptotically inconsistent: the intrinsic reward remains positive when the visitation counts approach infinity. As a result, the final policy does not necessarily maximize the cumulative extrinsic reward. In this paper, we propose a novel exploration criterion that combines count-based and state-diff approaches: instead of using the difference of state representations, we use the regulated difference of inverse visitation counts of consecutive states in a trajectory. The inverse visitation count is approximated by Random Network Distillation (Burda et al., 2018b) . Our IR provides two benefits: (1) This addresses asymptotic inconsistency in the state-diff, since the inverse visitation count vanishes with sufficient explorations. (2) Our IR is large at the end of a trajectory and at the boundary between the explored and the unexplored regions (Fig. 1 ). This motivates the agent to move Beyond the Boundary of the explored regions and step into the unknown, mitigating the short-sighted issue in count-based approaches. Following this simple criterion, we propose a novel algorithm BeBold and evaluate it on two very challenging procedurally-generated (PG) environments: MiniGrid (Chevalier-Boisvert et al., 2018) and NetHack (Küttler et al., 2020) . MiniGrid is a popular benchmark for evaluating exploration algorithms (Raileanu and Rocktäschel, 2020; Campero et al., 2020; Goyal et al., 2019) and NetHack is a much more realistic environment with complex goals and skills. BeBold manages to solve the 12 most challenging environments in MiniGrid within 120M environment steps, without curriculum learning. In contrast, (Campero et al., 2020) solves 50% of the tasks, which were categorized as "easy" and "medium", by training a separate goal-generating teacher network in 500M steps. In NetHack, a more challenging procedurally-generated environment, BeBold also outperforms all baselines with a significant margin on various tasks. In addition, we analyze BeBold extensively in MiniGrid. The quantitative results show that BeBold largely mitigates the detachment problem, with a much simpler design than Go-Explore (Ecoffet et al., 2020) which contains multiple handtune stages and hyper-parameters. Most Related Works. RIDE (Raileanu and Rocktäschel, 2020) also combines multiple criteria together. RIDE learns the state representation with curiosity-driven approaches, and then uses the difference of learned representation along a trajectory as the reward, weighted by pseudo counts of the state. However, as a two-stage approach, RIDE heavily relies on the quality of generalization of the learned representation on novel states. As a result, BeBold shows substantially better performance in the same procedurally-generated environments. Go-Explore (Ecoffet et al., 2020) stores many visited states (including boundaries), reaches these states without exploration, and explores from them. BeBold focuses on boundaries, perform exploration without human-designed cell representation (e.g., image downsampling) and is end-to-end. Frontier-based exploration (Yamauchi, 1997; 1998; Topiwala et al., 2018) is used to help specific robots explore the map by maximizing the information gain. The "frontier" is defined as the 2D spatial regions out of the explored parts. No automatic policy optimization with deep models is used. In contrast, BeBold can be applied to more general partial observable MDPs with deep policies.

2. BACKGROUND

Following single agent Markov Decision Process (MDP), we define a state space S, an action space A, and a (non-deterministic) transition function T : S × A → P (S) where P (S) is the probability of next state given the current state and action. The goal is to maximize the expected reward R = E[ T k=0 γ k r t+k=1 ] where r t is the reward, γ is the discount factor, and the expectation is taken w.r.t. the policy π and MDP transition P (S). In this paper, the total reward received at time step t is given by r t = r e t + αr i t , where r e t is the extrinsic reward given by the environment, r i t is the intrinsic reward from the exploration criterion, and α is a scaling hyperparameter.



by chance starts exploring the bottom right corner heavily, resulting in the IR at top right higher than bottom right 4. RND re-explores the upper right and forgets the bottom right, gets trapped

Figure 1: A Hypothetical Demonstration of how exploration is done in BeBold versus Random Network Distillation (Burda et al., 2018b), in terms of distribution of intrinsic rewards (IR). BeBold reaches the goal by continuously pushing the frontier of exploration while RND got trapped. Note that IR is defined differently in RND (1/N (st)) versus BeBold (max(1/N (st+1) -1/N (st), 0), See Eqn. 3), and different color is used.

