TEMPORALLY-EXTENDED -GREEDY EXPLORATION

Abstract

Recent work on exploration in reinforcement learning (RL) has led to a series of increasingly complex solutions to the problem. This increase in complexity often comes at the expense of generality. Recent empirical studies suggest that, when applied to a broader set of domains, some sophisticated exploration methods are outperformed by simpler counterparts, such as -greedy. In this paper we propose an exploration algorithm that retains the simplicity of -greedy while reducing dithering. We build on a simple hypothesis: the main limitation ofgreedy exploration is its lack of temporal persistence, which limits its ability to escape local optima. We propose a temporally extended form of -greedy that simply repeats the sampled action for a random duration. It turns out that, for many duration distributions, this suffices to improve exploration on a large set of domains. Interestingly, a class of distributions inspired by ecological models of animal foraging behaviour yields particularly strong performance.

1. INTRODUCTION

Exploration is widely regarded as one of the most important open problems in reinforcement learning (RL) . The problem has been theoretically analyzed under simplifying assumptions, providing reassurance and motivating the development of algorithms (Brafman and Tennenholtz, 2002; Asmuth et al., 2009; Azar, Osband, and Munos, 2017) . Recently, there has been considerable progress on the empirical side as well, with new methods that work in combination with powerful function approximators to perform well on challenging large-scale exploration problems (Bellemare et al., 2016; Ostrovski et al., 2017; Burda et al., 2018; Badia et al., 2020b) . Despite all of the above, the most commonly used exploration strategies are still simple methods like -greedy, Boltzmann exploration and entropy regularization (Peters, Mulling, and Altun, 2010; Sutton and Barto, 2018) . This is true for both work of a more investigative nature (Mnih et al., 2015) and practical applications (Levine et al., 2016; Li et al., 2019) . In particular, many recent successes of deep RL, from data-center cooling to Atari game playing, rely heavily upon these simple exploration strategies (Mnih et al., 2015; Lazic et al., 2018; Kapturowski et al., 2019) . Why does the RL community continue to rely on such naive exploration methods? There are several possible reasons. First, principled methods usually do not scale well. Second, the exploration problem is often formulated as a separate problem whose solution itself involves quite challenging steps. Moreover, besides having very limited theoretical grounding, practical methods are often complex and have significantly poorer performance outside a small set of domains they were specifically designed for. This last point is essential, as an effective exploration method must be generally applicable. Naive exploration methods like -greedy, Boltzmann exploration and entropy regularization are general because they do not make strong assumptions about the underlying domain. As a consequence, they are also simple, not requiring too much implementation effort or per-domain tuning. This makes them appealing alternatives even when they are not as efficient as some more complex variants. Perhaps there is a middle ground between simple yet inefficient exploration strategies and more complex, though efficient, methods. The method we propose in this paper represents such a compromise. We ask the following question: how can we deviate minimally from the simple exploration strategies adopted in practice and still get clear benefits? In more pragmatic terms, we want a simple-to-implement algorithm that can be used in place of naive methods and lead to improved exploration. In order to achieve our goal we propose a method that can be seen as a generalization of -greedy, perhaps the simplest and most widely adopted exploration strategy. As is well known, the -greedy algorithm selects an exploratory action uniformly at random with probability at each time step. Besides its simplicity, -greedy exploration has two properties that contribute to its universality: 1 It is stationary, i.e. its mechanics do not depend on learning progress. Stationarity is important for stability, since an exploration strategy interacting with the agent's learning dynamics results in circular dependencies that can in turn limit exploration progress. In simple terms: bad exploratory decisions can hurt the learned policy which can lead to more bad exploration. 2 It provides full coverage of the space of possible trajectories. All sequences of states, actions and rewards are possible under -greedy exploration, albeit some with exceedingly small probability. This guarantees, at least in principle, that no solutions are excluded from consideration. Convergence results for RL algorithms rely on this sort of guarantee (Singh et al., 2000) . This may also explain sophisticated exploration methods' use of -greedy exploration (Bellemare et al., 2016) . However, -greedy in its original form also comes with drawbacks. Since it does not explore persistently, the likelihood of deviating more than a few steps off the default trajectory is vanishingly small. This can be thought of as an inductive bias (or "prior") that favors transitions that are likely under the policy being learned (it might be instructive to think of a neighbourhood around the associated stationary distribution). Although this is not necessarily bad, it is not difficult to think of situations in which such an inductive bias may hinder learning. For example, it may be very difficult to move away from a local maximum if doing so requires large deviations from the current policy. The issue above arises in part because -greedy provides little flexibility to adjust the algorithm's inductive bias to the peculiarities of a given problem. By tuning the algorithm's only parameter, , one can make deviations more or less likely, but the nature of such deviations is not modifiable. To see this, note that all sequences of exploratory actions are equally likely under -greedy, regardless of the specific value used for . This leads to a coverage of the state space that is largely defined by the current ("greedy") policy and the environment dynamics (see Figure 1 for an illustration). In this paper we present an algorithm that retains the beneficial properties of -greedy while at the same time allowing for more control over the nature of the induced exploratory behavior. In order to achieve this, we propose a small modification to -greedy: we replace actions with temporallyextended sequences of actions, or options (Sutton, Precup, and Singh, 1999) . Options then become a mechanism to modulate the inductive bias associated with -greedy. We discuss how by appropriately defining a set of options one can "align" the exploratory behavior of -greedy with a given environment or class of environments; we then show how a very simple set of domain-agnostic options work surprisingly well across a variety of well known environments.

2. BACKGROUND AND NOTATION

Reinforcement learning can be set within the Markov Decision Process (MDP) formalism (Puterman, 1994) . An MDP M is defined by the tuple (X , A, P , R, γ), where x ∈ X is a state in the state space, a ∈ A is an action in the action space, P (x | x, a) is the probability of transitioning from state x to state x after taking action a, R : X × A → R is the reward function and γ ∈ [0, 1) is the discount factor. Let P(A) denote the space of probability distributions over actions; then, a policy π : X → P(A) assigns some probability to each action conditioned on a given state. We will denote by π a = 1 a the policy which takes action a deterministically in every state. The agent attempts to learn a policy π that maximizes the expected return or value in a given state, V π (x) = E A∼π Q π (x, A) = E π ∞ t=0 γ t R(X t , A t ) | X 0 = x , where V π , Q π are the value and action-value functions of π. The greedy policy for action-value function Q takes the action arg max a∈A Q(x, a), ∀x ∈ X . In this work we primarily rely upon methods based on the Q-learning algorithm (Watkins and Dayan, 1992), which attempts to learn the

