REVISITING INTRINSIC REWARD FOR EXPLORATION IN PROCEDURALLY GENERATED ENVIRONMENTS

Abstract

Exploration under sparse rewards remains a key challenge in deep reinforcement learning. Recently, studying exploration in procedurally-generated environments has drawn increasing attention. Existing works generally combine lifelong intrinsic rewards and episodic intrinsic rewards to encourage exploration. Though various lifelong and episodic intrinsic rewards have been proposed, the individual contributions of the two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we disentangle these two parts and conduct ablative experiments. We consider lifelong and episodic intrinsic rewards used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used MiniGrid benchmark. Experimental results show that only using episodic intrinsic rewards can match or surpass prior state-of-the-art methods. On the other hand, only using lifelong intrinsic rewards hardly makes progress in exploration. This demonstrates that episodic intrinsic reward is more crucial than lifelong one in boosting exploration. Moreover, we find through experimental analysis that the lifelong intrinsic reward does not accurately reflect the novelty of states, which explains why it does not help much in improving exploration.

1. INTRODUCTION

How to encourage sufficient exploration in environments with sparse rewards is one of the most actively studied challenges in deep reinforcement learning (RL) (Bellemare et al., 2016; Pathak et al., 2017; Ostrovski et al., 2017; Burda et al., 2019a; b; Osband et al., 2019; Ecoffet et al., 2019; Badia et al., 2019) . In order to learn exploration behavior that can generalize to similar environments, recent works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020; Campero et al., 2021; Flet-Berliac et al., 2021; Zha et al., 2021) have paid increasing attention to procedurally-generated grid-like environments (Chevalier-Boisvert et al., 2018; Küttler et al., 2020) . Among them, approaches based on intrinsic reward (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) have proven to be quite effective, which combine lifelong and episodic intrinsic rewards to encourage exploration. The lifelong intrinsic reward encourages visits to the novel states that are less frequently experienced in the entire past, while the episodic intrinsic reward encourages the agent to visit states that are relatively novel within an episode. However, previous works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) mainly focus on designing lifelong intrinsic reward while considering episodic one only as a minor complement. The individual contributions of these two kinds of intrinsic rewards to improving exploration are barely investigated. To bridge this gap, we present in this work a comprehensive empirical study to reveal their contributions. Through extensive experiments on the commonly used MiniGrid benchmark (Chevalier-Boisvert et al., 2018) , we ablatively study the effect of lifelong and episodic intrinsic rewards on boosting exploration. Specifically, we use the lifelong and episodic intrinsic rewards considered in prior works (Pathak et al., 2017; Burda et al., 2019b; Raileanu & Rocktäschel, 2020; Zhang et al., 2020) , and compare the performance of all lifelong-episodic combinations, in environments with and without extrinsic rewards. Surprisingly, we find that only using episodic intrinsic rewards, the trained agent can match or surpass the performance of the one trained by using other lifelong-episodic combinations, in terms of the cumulative extrinsic rewards in a sparse reward setting and the number of explored rooms in a pure exploration setting. In contrast, only using lifelong intrinsic rewards makes little progress in exploration. Such observations suggest that the episodic intrinsic reward is the more crucial ingredient of the intrinsic reward for efficient exploration. Furthermore, we experimentally analyze why lifelong intrinsic reward does not offer much help in improving exploration. Specifically, we find that randomly permuting the lifelong intrinsic rewards within a batch does not cause a significant drop in the performance. This demonstrates that the lifelong intrinsic reward may not accurately reflect the novelty of states, explaining its ineffectiveness. We also compare the performance of only using episodic intrinsic rewards to other state-of-the-art methods including RIDE (Raileanu & Rocktäschel, 2020), BeBold (Zhang et al., 2020) , AGAC (Flet-Berliac et al., 2021) and RAPID (Zha et al., 2021) . We find simply using episodic intrinsic rewards outperforms the others by large margins. In summary, our work makes the following contributions: • We conduct a comprehensive study on the lifelong and episodic intrinsic rewards in exploration and find that the episodic intrinsic reward overlooked by previous works is actually the more important ingredient for efficient exploration in procedurally-generated gridworld environments. • We analyze why lifelong intrinsic reward does not contribute much and find that it is unable to accurately reflect the novelty of states. • We show that simply using episodic intrinsic rewards can serve as a very strong baseline outperforming current state-of-the-art methods by large margins. This finding should inspire future works on designing and rethinking the intrinsic rewards.

2.1. NOTATION

We consider a single agent Markov Decision Process (MDP) described by a tuple (S, A, R, P, γ). S denotes the state space and A denotes the action space. At time t the agent observes state s t ∈ S and takes an action a t ∈ A by following a policy π : S → A. The environment then yields a reward signal r t according to the reward function R(s t , a t ). The next state observation s t+1 ∈ S is sampled according to the transition distribution function P (s t+1 |s t , a t ). The goal of the agent is to learn an optimal policy π * that maximizes the expected cumulative reward: π * = arg max π∈Π E π,P ∞ t=0 γ t r t (1) where Π denotes the policy space and γ ∈ [0, 1) is the discount factor. Following previous curiosity-driven approaches (Pathak et al., 2017; Burda et al., 2019b; Raileanu & Rocktäschel, 2020; Zhang et al., 2020) , we consider that at each timestep the agent receives some intrinsic reward r i t in addition to the extrinsic reward r e t . The intrinsic reward is designed to capture the agent's curiosity about the states, via quantifying how different the states are compared to those already visited. r i t can be computed for any transition tuple (s t , a t , s t+1 ). The agent's goal then becomes maximizing the weighted sum of intrinsic and extrinsic rewards, i.e., r t = r e t + βr i t (2) where β is a hyperparameter for balancing the two rewards. We may drop the subscript t to refer to the corresponding reward as a whole rather than the reward at timestep t.

2.2. TWO PARTS OF THE INTRINSIC REWARD

For procedurally-generated environments, recent works (Raileanu & Rocktäschel, 2020; Zhang et al., 2020) form the intrinsic reward as the multiplication of two parts: r i t = r lifelong t • r episodic t ,

