EMERGENCE OF EXPLORATION IN POLICY GRADIENT REINFORCEMENT LEARNING VIA RESETTING

Abstract

In reinforcement learning (RL), many exploration methods explicitly promote stochastic policies, e.g., by adding an entropy bonus. We argue that exploration only matters in RL because the agent repeatedly encounters the same or similar states, so that it is beneficial to gradually improve the performance over the encounters; otherwise, the greedy policy would be optimal. Based on this intuition, we propose ReMax, an objective for RL whereby stochastic exploration arises as an emergent property, without adding any explicit exploration bonus. In ReMax, an episode is modified so that the agent can reset to previous states in the trajectory, and the agent's goal is to maximize the best return in the trajectory tree. We show that this ReMax objective can be directly optimized with an unbiased policy gradient method. Experiments confirm that ReMax leads to the emergence of a stochastic exploration policy, and improves the performance compared to RL with no exploration bonus.

1. INTRODUCTION

Exploration is widely studied in reinforcement learning (RL) (Sutton & Barto, 2018) (see App. A for an extended overview). The most popular method of exploration is to explicitly promote a stochastic policy by maximizing the entropy in addition to the rewards (Williams, 1992; Ziebart et al., 2008; Mnih et al., 2016; Haarnoja et al., 2018) . We note that it is non-obvious why one should add such an entropy bonus-the objective of RL is only to maximize the rewards. Such exploration methods are only retrospectively justified as they improve the performance of the algorithms. In our article, we propose a method that, paradoxically, promotes exploration by greedily maximizing the rewards. The motivation of our method is the following: we suppose that exploration is vital in RL because the agent, intentionally or unintentionally, visits the same (or similar) state repeatedly; exploration allows the gain of some valuable information for making a better decision on the next visit to the same state. However, it has no value if the agent would never encounter the same state. Based on this observation, we propose a new objective function for RL called ReMax that encourages exploration in a novel way. Briefly, the ReMax objective is computed as follows: while interacting with the environment, in addition to taking usual actions, the agent may choose to reset to a previously visited state in the trajectory up to some limited number of times; then, after the interaction, the value of the ReMax objective is computed as the sum of the rewards along the best trajectory. The crucial difference between our approach and previous ones is that, while most previous approaches explicitly set the goal of obtaining a stochastic exploratory policy via an exploration bonus (e.g., state-visitation bonus or entropy bonus), in our approach, such an exploratory policy is not the explicit goal, but the optimization of the ReMax objective naturally results in an exploratory policy. We note that several previous studies successfully utilized resetting. For example, Go-Explore (Ecoffet et al., 2021) , which achieved impressive results on the well-known hard-exploration problem Montezuma's Revenge, utilized resetting to rarely visited states, but also AlphaGo (Silver et al., 2016) used Monte-Carlo tree search that can be regarded as a kind of resetting. In practical RL problems, resetting is often possible, such as when we have access to the environment simulator (like Go). Also, even if such simulator access is not available, we can use powerful model-based RL (MBRL) methods, e.g., DreamerV2 (Hafner et al., 2021) , and use resetting in simulations with the learned model. The main objective of this article is to test our hypothesis that ReMax leads to the emergence of a stochastic exploratory policy. To this end, we perform three phases of experiments: • Step 1. We illustrate the main idea and demonstrate that optimizing the ReMax objective causes a stochastic policy in a simple bandit task (Sec. 3). This experiment was non-conclusive as the emergence of the stochastic policy relied on the partial observability of the environment. • Step 2. To overcome the limitation of the previous step, we demonstrate that, by optimizing the ReMax objective, a stochastic policy emerges even in a deterministic maze environment, where optimizing the regular RL objective causes the policy to become deterministic and the learning to stop (Sec. 5). The limitation here is that the example relied on a simple model parameterization. • Step 3. To make the scenario of the maze experiment more realistic, we modify the maze to represent the observations by images, and use a neural network function approximator (Sec. 6). This experiment indicates that the failure of the regular RL and the emergence of exploration happen in a practical deep RL scenario, even in a deterministic environment. Finally 

2. PRELIMINARIES

Notation. We consider an episodic Markov decision process (MDP) M, defined as a tuple (S, A, P, r, ρ 0 , T ), where the state space S, the action space A are discrete and T is a finite horizon. The initial state s 0 ∈ S follows the distribution ρ 0 : S → [0, 1], and the state transition kernel P : S × A × S → [0, 1] defines the state transition probability from the current state s ∈ S to the next state s ′ ∈ S after the action a ∈ A is taken. The reward function r : S × A → [r min , r max ] determines the immediate reward given the state, s, and action, a. At each state, s, the agent can take a legal action a ∈ A(s) ⊂ A, where A(s) are the legal actions at state s. The agent acts following a parameterized policy π θ : S × A → [0, 1] with the goal of maximizing the rewards. The trajectory τ := (s 0 , a 0 , . . . , s T ) is the sequence of state-action pairs from the current episode: τ ∼ ρ π (τ ) where ρ π (τ ) := ρ 0 (s 0 ) T -1 t=0 π(a t |s t )P (s t+1 |s t , a t ). Note that s T is the terminal state. The RL objective is to maximize the expected return J RL (π) := E τ ∼ρπ R(τ )], where R(τ ) = T -1 t=0 r(s t , a t ). Policy gradient methods. In this study, we focus on the policy gradient (PG) method, which directly optimizes a parameterized policy π θ via gradient ascent. The policy gradient theorem (Sutton et al., 1999) provides an expression of the PG, ∇ θ J RL (π θ ), amenable for estimation. In particular, we use REINFORCE (Williams, 1992) as the simplest PG method, whose gradient estimator is given by ĝ := T -1 t=0 ∇ θ log π θ (a t |s t )(R(τ ) -b t ) , where b t is a constant baseline for variance reduction. This estimator is unbiased: ∇ θ J RL (π θ ) = E τ [ĝ]. One may also average a batch of N gradient estimates from different trajectories, N i=1 1 N ĝi . A common baseline is b t = N i=1 1 N R(τ i ) , the average of the returns in the batch. Another common method to reduce the variance is using the future return R t (τ ) := T -1 h=t r(s h , a h ), that only includes the rewards following the action; this maintains the unbiasedness of the estimator. An important property of PG methods-and part of the reason we focus on them-is that they remain unbiased even when the system is a POMDP (partially observable MDP), i.e., unobservable hidden states characterize the state transitions.

3. STEP 1: BANDIT PROBLEM EXAMPLE

In the first step of our 3-stage experiment, we illustrate the core idea behind our ReMax objective. Through a simple randomized bandit task, we explain the principle of why a stochastic policy is optimal under the ReMax objective; thus, leading to the emergence of exploration. Problem. There are two arms, indexed by 0 and 1. At the beginning of each episode, one arm is chosen as an unobservable "correct" arm z ∈ {0, 1}. The correct arm z ∈ {0, 1} is randomly chosen according to a Bernoulli distribution with probability q = 0.75. In each episode, the agent plays only one arm a ∈ {0, 1}. Playing the correct arm (i.e., a = z) gives the return 1 and 0 otherwise: R(z, a) = I z=a , where I e takes 1 if e is true and 0 otherwise. Under the usual RL objective, which maximizes expected return E z,a [R(z, a)], the optimal policy is deterministic, taking action a = 1 with probability 1, which yields a maximum expected return 0.75.



, we showed that ReMax can promote exploration in modern policy gradient algorithms, such as A2C(Mnih et al., 2016), and improve the performance in MinAtar (Young & Tian, 2019), a simplified version of the Arcade Learning Environment (Bellemare et al., 2013)(Sec. 8.1). We believe ReMax is a viable competitor to classical stochastic exploration approaches, such as entropy bonuses.

