REVISITING INTRINSIC REWARD FOR EXPLORATION IN PROCEDURALLY GENERATED ENVIRONMENTS

Abstract

Exploration under sparse rewards remains a key challenge in deep reinforcement learning. Recently, studying exploration in procedurally-generated environments has drawn increasing attention. Existing works generally combine lifelong intrinsic rewards and episodic intrinsic rewards to encourage exploration. Though various lifelong and episodic intrinsic rewards have been proposed, the individual contributions of the two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used MiniGrid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods. On the other hand, only using lifelong intrinsic rewards hardly makes progress in exploration. This demonstrates that episodic intrinsic reward is more crucial than lifelong one in boosting exploration. Moreover, we find through experimental analysis that the lifelong intrinsic reward does not accurately reflect the novelty of states, which explains why it does not help much in improving exploration.

1. INTRODUCTION

How to encourage sufficient exploration in environments with sparse rewards is one of the most actively studied challenges in deep reinforcement learning (RL) (Bellemare et al., 2016; Pathak et al., 2017; Ostrovski et al., 2017; Burda et al., 2019a; b; Osband et al., 2019; Ecoffet et al., 2019; Badia et al., 2019) . In order to learn exploration behavior that can generalize to similar environments, recent works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020; Campero et al., 2021; Flet-Berliac et al., 2021; Zha et al., 2021) have paid increasing attention to procedurally-generated grid-like environments (Chevalier-Boisvert et al., 2018; Küttler et al., 2020) . Among them, approaches based on intrinsic reward (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) have proven to be quite effective, which combine lifelong and episodic intrinsic rewards to encourage exploration. The lifelong intrinsic reward encourages visits to the novel states that are less frequently experienced in the entire past, while the episodic intrinsic reward encourages the agent to visit states that are relatively novel within an episode. However, previous works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) mainly focus on designing lifelong intrinsic reward while considering episodic one only as a minor complement. The individual contributions of these two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we present in this work a comprehensive empirical study to reveal their contributions. Through extensive experiments on the commonly used MiniGrid benchmark (Chevalier-Boisvert et al., 2018) , we ablatively study the effect of lifelong and episodic intrinsic rewards on boosting exploration. Specifically, we use the lifelong and episodic intrinsic rewards considered in prior works (Pathak et al., 2017; Burda et al., 2019b; Raileanu & Rocktäschel, 2020; Zhang et al., 2020) , and compare the performance of all lifelong-episodic combinations, in environments with and without extrinsic

