RANK THE EPISODES: A SIMPLE APPROACH FOR EXPLORATION IN PROCEDURALLY-GENERATED ENVIRONMENTS

Abstract

Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good exploration behaviors by looking into the entire episode, we introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments. RAPID regards each episode as a whole and gives an episodic exploration score from both per-episode and long-term views. Those highly scored episodes are treated as good exploration behaviors and are stored in a small ranking buffer. The agent then imitates the episodes in the buffer to reproduce the past good exploration behaviors. We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks. The results show that RAPID significantly outperforms the state-of-the-art intrinsic reward strategies in terms of sample efficiency and final performance. The code is available at https://github.com/daochenzha/rapid.

1. INTRODUCTION

Deep reinforcement learning (RL) is widely applied in various domains (Mnih et al., 2015; Silver et al., 2016; Mnih et al., 2016; Lillicrap et al., 2015; Andrychowicz et al., 2017; Zha et al., 2019a; Liu et al., 2020) . However, RL algorithms often require tremendous number of samples to achieve reasonable performance (Hessel et al., 2018) . This sample efficiency issue becomes more pronounced in sparse reward environments, where the agent may take an extremely long time before bumping into a reward signal (Riedmiller et al., 2018) . Thus, how to efficiently explore the environment under sparse reward remains an open challenge (Osband et al., 2019) . To address the above challenge, many exploration methods have been investigated and demonstrated to be effective on hard-exploration environments. One of the most popular techniques is to use intrinsic rewards to encourage exploration (Schmidhuber, 1991; Oudeyer & Kaplan, 2009; Guo et al., 2016; Zheng et al., 2018; Du et al., 2019) . The key idea is to give intrinsic bonus based on uncertainty, e.g., assigning higher rewards to novel states (Ostrovski et al., 2017) , or using the prediction error of a dynamic model as the intrinsic reward (Pathak et al., 2017) . While many intrinsic reward methods have demonstrated superior performance on hard-exploration environments, such as Montezuma's Revenge (Burda et al., 2018b) and Pitfall! (Badia et al., 2019) , most of the previous studies use the same singleton environment for training and testing, i.e., the agent aims to solve the same environment in each episode. However, recent studies show that the agent trained in this way is susceptible to overfitting and may fail to generalize to even a slightly different environment (Rajeswaran et al., 2017; Zhang et al., 2018a) . To deal with this issue, a few procedually-generated environments While count-based exploration works well in singleton setting, i.e., the environment is the same in each episode, it may be brittle in procedually-generated setting. In (a), if the observation of the third block in an episode is sampled from ten possible values, the third block will have very high intrinsic reward because the agent is uncertain on the new states. This will reward the agent for exploring the third block and also its neighbors, and hence the agent may get stuck around the third block. In (b), we score behaviors in episode-level with # visited states/ # total states. The episodic exploration score can effectively distinguish good exploration behaviors in procedually-generated setting. are designed to test the generalization of RL, such as (Beattie et al., 2016; Chevalier-Boisvert et al., 2018; Nichol et al., 2018; Côté et al., 2018; Cobbe et al., 2019; Küttler et al., 2020) , in which the agent aims to solve the same task, but a different environment is generated in each episode. Unfortunately, encouraging exploration in procedually-generated environments is a very challenging task. Many intrinsic reward methods, such as count-based exploration and curiosity-driven exploration, often fall short in procedually-generated environments (Raileanu & Rocktäschel, 2020; Campero et al., 2020) . Figure 1a shows a motivating example of why count-based exploration is less effective in procedually-generated environments. We consider a 1D grid-world, where the agent (red) needs to explore the environment to reach the goal within 7 timesteps, that is, the agent needs to move right in all the steps. While count-based exploration assigns reasonable intrinsic rewards in singleton setting, it may generate misleading intrinsic rewards in procedually-generated setting because visiting a novel state does not necessarily mean a good exploration behavior. This issue becomes more pronounced in more challenging procedually-generated environments, where the agent is not likely to visit the same state more than once. To tackle the above challenge, this work studies how to reward good exploration behaviors that can generalize to different environments. When a human judges how well an agent explores the environment, she often views the agent from the episode-level instead of the state-level. For instance, one can easily tell whether an agent has explored a maze well by looking into the coverage rate of the current episode, even if the current environment is different from the previous ones. In Figure 1b , we can confidently tell that the episode in the right-hand-side is a good exploration behavior because the agent achieves 0.875 coverage rate. Similarly, we can safely conclude that the episode in the left-hand-side does not explore the environment well due to its low coverage rate. Motivated by this, we hypothesize that episode-level exploration score could be a more general criterion to distinguish good exploration behaviors than state-level intrinsic rewards in procedually-generated setting. To verify our hypothesis, we propose exploration via Ranking the Episodes (RAPID). Specifically, we identify good exploration behaviors by assigning episodic exploration scores to past episodes. To efficiently learn the past good exploration behaviors, we use a ranking buffer to store the highlyscored episodes. The agent then learns to reproduce the past good exploration behaviors with imitation learning. We demonstrate that RAPID significantly outperforms the state-of-the-art intrinsic reward methods on several procedually-generated benchmarks. Moreover, we present extensive ablations and analysis for RAPID, showing that RAPID can well explore the environment even without extrinsic rewards and could be generally applied in tasks with continuous state/action spaces.

2. EXPLORATION VIA RANKING THE EPISODES

An overview of RAPID is shown in Figure 2 . The key idea is to assign episodic exploration scores to past episodes and store those highly-scored episodes into a small buffer. The agent then treats the episodes in the buffer as demonstrations and learns to reproduce these episodes with imitation learning. RAPID encourages the agent to explore the environment by reproducing the past good



Figure 1: A motivating example of count-based exploration versus episode-level exploration score.While count-based exploration works well in singleton setting, i.e., the environment is the same in each episode, it may be brittle in procedually-generated setting. In (a), if the observation of the third block in an episode is sampled from ten possible values, the third block will have very high intrinsic reward because the agent is uncertain on the new states. This will reward the agent for exploring the third block and also its neighbors, and hence the agent may get stuck around the third block. In (b), we score behaviors in episode-level with # visited states/ # total states. The episodic exploration score can effectively distinguish good exploration behaviors in procedually-generated setting.

