EMERGENCE OF EXPLORATION IN POLICY GRADIENT REINFORCEMENT LEARNING VIA RESETTING

Abstract

In reinforcement learning (RL), many exploration methods explicitly promote stochastic policies, e.g., by adding an entropy bonus. We argue that exploration only matters in RL because the agent repeatedly encounters the same or similar states, so that it is beneficial to gradually improve the performance over the encounters; otherwise, the greedy policy would be optimal. Based on this intuition, we propose ReMax, an objective for RL whereby stochastic exploration arises as an emergent property, without adding any explicit exploration bonus. In ReMax, an episode is modified so that the agent can reset to previous states in the trajectory, and the agent's goal is to maximize the best return in the trajectory tree. We show that this ReMax objective can be directly optimized with an unbiased policy gradient method. Experiments confirm that ReMax leads to the emergence of a stochastic exploration policy, and improves the performance compared to RL with no exploration bonus.

1. INTRODUCTION

Exploration is widely studied in reinforcement learning (RL) (Sutton & Barto, 2018) (see App. A for an extended overview). The most popular method of exploration is to explicitly promote a stochastic policy by maximizing the entropy in addition to the rewards (Williams, 1992; Ziebart et al., 2008; Mnih et al., 2016; Haarnoja et al., 2018) . We note that it is non-obvious why one should add such an entropy bonus-the objective of RL is only to maximize the rewards. Such exploration methods are only retrospectively justified as they improve the performance of the algorithms. In our article, we propose a method that, paradoxically, promotes exploration by greedily maximizing the rewards. The motivation of our method is the following: we suppose that exploration is vital in RL because the agent, intentionally or unintentionally, visits the same (or similar) state repeatedly; exploration allows the gain of some valuable information for making a better decision on the next visit to the same state. However, it has no value if the agent would never encounter the same state. Based on this observation, we propose a new objective function for RL called ReMax that encourages exploration in a novel way. Briefly, the ReMax objective is computed as follows: while interacting with the environment, in addition to taking usual actions, the agent may choose to reset to a previously visited state in the trajectory up to some limited number of times; then, after the interaction, the value of the ReMax objective is computed as the sum of the rewards along the best trajectory. The crucial difference between our approach and previous ones is that, while most previous approaches explicitly set the goal of obtaining a stochastic exploratory policy via an exploration bonus (e.g., state-visitation bonus or entropy bonus), in our approach, such an exploratory policy is not the explicit goal, but the optimization of the ReMax objective naturally results in an exploratory policy. We note that several previous studies successfully utilized resetting. For example, Go-Explore (Ecoffet et al., 2021) , which achieved impressive results on the well-known hard-exploration problem Montezuma's Revenge, utilized resetting to rarely visited states, but also AlphaGo (Silver et al., 2016) used Monte-Carlo tree search that can be regarded as a kind of resetting. In practical RL problems, resetting is often possible, such as when we have access to the environment simulator (like Go). Also, even if such simulator access is not available, we can use powerful model-based RL (MBRL) methods, e.g., DreamerV2 (Hafner et al., 2021) , and use resetting in simulations with the learned model. The main objective of this article is to test our hypothesis that ReMax leads to the emergence of a stochastic exploratory policy. To this end, we perform three phases of experiments:

