TEMPORALLY-EXTENDED -GREEDY EXPLORATION

Abstract

Recent work on exploration in reinforcement learning (RL) has led to a series of increasingly complex solutions to the problem. This increase in complexity often comes at the expense of generality. Recent empirical studies suggest that, when applied to a broader set of domains, some sophisticated exploration methods are outperformed by simpler counterparts, such as -greedy. In this paper we propose an exploration algorithm that retains the simplicity of -greedy while reducing dithering. We build on a simple hypothesis: the main limitation ofgreedy exploration is its lack of temporal persistence, which limits its ability to escape local optima. We propose a temporally extended form of -greedy that simply repeats the sampled action for a random duration. It turns out that, for many duration distributions, this suffices to improve exploration on a large set of domains. Interestingly, a class of distributions inspired by ecological models of animal foraging behaviour yields particularly strong performance.

1. INTRODUCTION

Exploration is widely regarded as one of the most important open problems in reinforcement learning (RL) . The problem has been theoretically analyzed under simplifying assumptions, providing reassurance and motivating the development of algorithms (Brafman and Tennenholtz, 2002; Asmuth et al., 2009; Azar, Osband, and Munos, 2017) . Recently, there has been considerable progress on the empirical side as well, with new methods that work in combination with powerful function approximators to perform well on challenging large-scale exploration problems (Bellemare et al., 2016; Ostrovski et al., 2017; Burda et al., 2018; Badia et al., 2020b) . Despite all of the above, the most commonly used exploration strategies are still simple methods like -greedy, Boltzmann exploration and entropy regularization (Peters, Mulling, and Altun, 2010; Sutton and Barto, 2018) . This is true for both work of a more investigative nature (Mnih et al., 2015) and practical applications (Levine et al., 2016; Li et al., 2019) . In particular, many recent successes of deep RL, from data-center cooling to Atari game playing, rely heavily upon these simple exploration strategies (Mnih et al., 2015; Lazic et al., 2018; Kapturowski et al., 2019) . Why does the RL community continue to rely on such naive exploration methods? There are several possible reasons. First, principled methods usually do not scale well. Second, the exploration problem is often formulated as a separate problem whose solution itself involves quite challenging steps. Moreover, besides having very limited theoretical grounding, practical methods are often complex and have significantly poorer performance outside a small set of domains they were specifically designed for. This last point is essential, as an effective exploration method must be generally applicable. Naive exploration methods like -greedy, Boltzmann exploration and entropy regularization are general because they do not make strong assumptions about the underlying domain. As a consequence, they are also simple, not requiring too much implementation effort or per-domain tuning. This makes them appealing alternatives even when they are not as efficient as some more complex variants. Perhaps there is a middle ground between simple yet inefficient exploration strategies and more complex, though efficient, methods. The method we propose in this paper represents such a compromise. We ask the following question: how can we deviate minimally from the simple exploration strategies adopted in practice and still get clear benefits? In more pragmatic terms, we want a simple-to-implement algorithm that can be used in place of naive methods and lead to improved exploration. In order to achieve our goal we propose a method that can be seen as a generalization of -greedy, perhaps the simplest and most widely adopted exploration strategy. As is well known, the -greedy algorithm selects an exploratory action uniformly at random with probability at each time step. Besides its simplicity, -greedy exploration has two properties that contribute to its universality: 1 It is stationary, i.e. its mechanics do not depend on learning progress. Stationarity is important for stability, since an exploration strategy interacting with the agent's learning dynamics results in circular dependencies that can in turn limit exploration progress. In simple terms: bad exploratory decisions can hurt the learned policy which can lead to more bad exploration. 2 It provides full coverage of the space of possible trajectories. All sequences of states, actions and rewards are possible under -greedy exploration, albeit some with exceedingly small probability. This guarantees, at least in principle, that no solutions are excluded from consideration. Convergence results for RL algorithms rely on this sort of guarantee (Singh et al., 2000) . This may also explain sophisticated exploration methods' use of -greedy exploration (Bellemare et al., 2016) . However, -greedy in its original form also comes with drawbacks. Since it does not explore persistently, the likelihood of deviating more than a few steps off the default trajectory is vanishingly small. This can be thought of as an inductive bias (or "prior") that favors transitions that are likely under the policy being learned (it might be instructive to think of a neighbourhood around the associated stationary distribution). Although this is not necessarily bad, it is not difficult to think of situations in which such an inductive bias may hinder learning. For example, it may be very difficult to move away from a local maximum if doing so requires large deviations from the current policy. The issue above arises in part because -greedy provides little flexibility to adjust the algorithm's inductive bias to the peculiarities of a given problem. By tuning the algorithm's only parameter, , one can make deviations more or less likely, but the nature of such deviations is not modifiable. To see this, note that all sequences of exploratory actions are equally likely under -greedy, regardless of the specific value used for . This leads to a coverage of the state space that is largely defined by the current ("greedy") policy and the environment dynamics (see Figure 1 for an illustration). In this paper we present an algorithm that retains the beneficial properties of -greedy while at the same time allowing for more control over the nature of the induced exploratory behavior. In order to achieve this, we propose a small modification to -greedy: we replace actions with temporallyextended sequences of actions, or options (Sutton, Precup, and Singh, 1999) . Options then become a mechanism to modulate the inductive bias associated with -greedy. We discuss how by appropriately defining a set of options one can "align" the exploratory behavior of -greedy with a given environment or class of environments; we then show how a very simple set of domain-agnostic options work surprisingly well across a variety of well known environments.

2. BACKGROUND AND NOTATION

Reinforcement learning can be set within the Markov Decision Process (MDP) formalism (Puterman, 1994) . An MDP M is defined by the tuple (X , A, P , R, γ), where x ∈ X is a state in the state space, a ∈ A is an action in the action space, P (x | x, a) is the probability of transitioning from state x to state x after taking action a, R : X × A → R is the reward function and γ ∈ [0, 1) is the discount factor. Let P(A) denote the space of probability distributions over actions; then, a policy π : X → P(A) assigns some probability to each action conditioned on a given state. We will denote by π a = 1 a the policy which takes action a deterministically in every state. The agent attempts to learn a policy π that maximizes the expected return or value in a given state, V π (x) = E A∼π Q π (x, A) = E π ∞ t=0 γ t R(X t , A t ) | X 0 = x , where V π , Q π are the value and action-value functions of π. The greedy policy for action-value function Q takes the action arg max a∈A Q(x, a), ∀x ∈ X . In this work we primarily rely upon methods based on the Q-learning algorithm (Watkins and Dayan, 1992) , which attempts to learn the =1.0 =0.1 =0.5 =0.9 =1.0 =0.1 =0.5 =0.9 Temporally-extended -Greedy (b) -Greedy (a) (1) In practice, the state space X is often too large to represent exactly and thus we have Q θ (x, a) ≈ Q(x, a) for a function approximator parameterized by θ. We will generally use some form of differentiable function approximator Q θ , whether it be linear in a fixed set of basis functions, or an artificial neural network. We update parameters θ to minimize a squared or Huber loss between the left-and right-hand sides of equation 1, with the right-hand side held fixed (Mnih et al., 2015) . In addition to function approximation, it has been argued that in order to scale to large problems, RL agents should be able to reason at multiple temporal scales (Dayan and Hinton, 1993; Parr and Russell, 1998; Sutton, Precup, and Singh, 1999; Dietterich, 2000) . One way to model temporal abstraction is via options (Sutton, Precup, and Singh, 1999) , i.e. temporally-extended courses of action. In the most general formulation, an option can depend on the entire history between its initiation time step t and the current time step t + k, h t:t+k ≡ x t a t x t+1 ...a t+k-1 x t+k . Let H be the space of all possible histories; a semi-Markov option is a tuple ω ≡ (I ω , π ω , β ω ), where I ω ⊆ X is the set of states where the option can be initiated, π ω : H → P(A) is a history-dependent policy, and β ω : H → [0, 1] gives the probability that the option terminates after observing some history (Sutton, Precup, and Singh, 1999) . As in this work we will use options for exploration, we will assume that I ω = X , ∀ω. Once an option ω is selected, the agent takes actions a ∼ π ω (• | h) after having observed history h ∈ H and at each step terminates the option with probability β ω (h). It is worth emphasizing that semi-Markov options depend on the history since their initiation, but not before. Also, they are usually defined with respect to a statistic of histories h ∈ H; for example, by looking at the length of h one can define an option that terminates after a fixed number of steps.

3. EXPLORATION IN REINFORCEMENT LEARNING

At its core, RL presents the twin challenges of temporal credit assignment and exploration. The agent must accurately, and efficiently, assign credit to past actions for their role in achieving some long-term return. However, to continue improving the policy, it must also consider behaviours it estimates to be sub-optimal. This leads to the well-known exploration-exploitation trade-off. Because of its central importance in RL, exploration has been among the most studied topics in the field. In finite state-action spaces, the theoretical limitations of exploration, with respect to sample complexity bounds, are fairly well understood (Azar, Osband, and Munos, 2017; Dann, Lattimore, and Brunskill, 2017) . However, these results are of limited practical use for two reasons. First, they bound sample complexity by the size of the state-action space and horizon, which makes their immediate application in large-scale or continuous state problems difficult. Second, these algorithms tend to be designed based on worst-case scenarios, and can be inefficient on problems of actual interest. Bayesian RL methods for exploration address the explore-exploit problem integrated with the estimation of the value-function itself (Kolter and Ng, 2009) . Generally such methods strongly depend upon the quality of their priors, which can be difficult to set appropriately. Thompson sampling methods (Thompson, 1933; Osband, Russo, and Van Roy, 2013) estimate the posterior distribution of value-functions, and act greedily according to a sample from this distribution. As with other methods which integrate learning and exploration into a single estimation problem, this creates non-stationary, but temporally persistent, exploration. Other examples of this type of exploration strategy include randomized prior functions (Osband, Aslanides, and Cassirer, 2018) , uncertainty Bellman equations (O'Donoghue et al., 2018) , NoisyNets (Fortunato et al., 2017) , and successor uncertainties (Janz et al., 2019) . Although quite different from each other, they share key commonalities: non-stationary targets, temporal persistence, and exploration based on the space of value functions. At the other end of the spectrum, there have recently been successful attempts to design algorithms with specific problems of interest in mind. Certain games from the Atari-57 benchmark (e.g. MON-TEZUMA'S REVENGE, PITFALL!, PRIVATE EYE) have been identified as 'hard exploration games' (Bellemare et al., 2016) , attracting the attention of the research community, leading to significant progress in terms of performance (Ecoffet et al., 2019; Burda et al., 2018) . On the downside, these results have been usually achieved by algorithms with little or no theoretical grounding, adopting specialized inductive biases, such as density modeling of images (Bellemare et al., 2016; Ostrovski et al., 2017) , error-seeking intrinsic rewards (Pathak et al., 2017; Badia et al., 2020a) , or perfect deterministic forward-models (Ecoffet et al., 2019) . Generally, such algorithms are evaluated only on the very domains they are designed to perform well on, raising questions of generality. Recent empirical analysis showed that some of these methods perform similarly to each other on hard exploration problems and significantly under-performgreedy otherwise (Ali Taïga et al., 2020 ). One explanation is that complex algorithms tend to be more brittle and harder to reproduce, leading to lower than expected performance in follow-on work. However, these results also suggest that much of the recent work on exploration is over-fitting to a small number of domains.

4. TEMPORALLY-EXTENDED EXPLORATION

There are many ways to think about exploration: curiosity, experimentation, reducing uncertainty, etc. Consider viewing exploration as a search for undiscovered rewards or shorter paths to known rewards. In this context, the behaviour of -greedy appears shortsighted because the probability of moving consistently in any direction decays exponentially with the number of exploratory steps. In Figure 1a we visualize the behaviour of uniform -greedy in an open gridworld, where the agent starts at the center-top and the greedy policy moves straight down. Observe that for ≤ 0.5 the agent is exceedingly unlikely to reach states outside a narrow band around the greedy policy. Even the purely exploratory policy ( = 1.0) requires a large number of steps to visit the bottom corners of the grid. This is because, under the uniform policy, the probability of moving consistently in any direction decays exponentially (see Figure 1a ). By contrast, a method that explores persistently with a directed policy leads to more efficient exploration of the space at various values of (Figure 1b ). The importance of temporally-extended exploration has been previously highlighted (Osband et al., 2016) , and in general, count-based (Bellemare et al., 2016) or curiosity-based (Burda et al., 2018) exploration methods are inherently temporally-extended due to integrating exploration and exploitation into the greedy policy. Here our goal is to leverage the benefits of temporally-extended exploration without modifying the greedy policy. There has been a wealth of research on learning options (McGovern and Barto, 2001; Stolle and Precup, 2002; S ¸ims ¸ek and Barto, 2004; Bacon, Harb, and Precup, 2017; Harutyunyan et al., 2019) , and specifically for exploration (Machado, Bellemare, and Bowling, 2017; Machado et al., 2018b; Jinnai et al., 2019; 2020; Hansen et al., 2020) . These methods use options for exploration and to augment the action-space, adding learned options to actions available at states where they can be initiated. In the remainder of this work, we argue for temporally-extended exploration, using options to encode a set of inductive biases to improve sample-efficiency. This fundamental message is found throughout the existing work on exploration with options, but producing algorithms that are empirically effective on large environments remains a challenge for the field. In the next section, we discuss in more detail how the options' policy π ω and termination β ω can be used to induce different types of exploration. Temporally-Extended -Greedy A temporally-extended -greedy exploration strategy depends on choosing an exploration probability , a set of options Ω, and a sampling distribution p with support Ω. On each step the agent follows the current policy π for one step with probability 1 -, or with probability samples an option ω ∼ p(Ω) and follows it until termination. Standard -greedy has three desirable properties that help explain its wide adoption in practice: it is simple, stationary, and promotes full coverage of the state-action space in the limit (guaranteeing convergence to the optimal policy under the right conditions). We now discuss to what extent the proposed algorithm retains these properties. Although somewhat subjective, it seems fair to call temporally-extended -greedy a simple method. It is also stationary when the set of options Ω and distribution p are fixed, for in this case its mechanics are not influenced by the collected data. Finally, it is easy to define conditions under which temporally-extended -greedy covers the entire state-action space, as we discuss next. Obviously, the exploratory behavior of temporally-extended -greedy will depend on the set of options Ω. Ideally we want all actions a ∈ A to have a nonzero probability of being executed in all states x ∈ X regardless of the greedy policy π. This is clearly not the case for all sets Ω. In fact, this may not be the case even if for all (x, a) ∈ X × A there is an option ω ∈ Ω such that π ω (a|hx) > 0, where hx represents all histories ending in x. To see why, note that, given a fixed Ω and > 0, it may be impossible for an option ω ∈ Ω to be "active" in state x (that is, either start at or visit x). For example, if all options in Ω terminate after a fixed number of steps that is a multiple of k, temporally-extended -greedy with = 1 will only visit states of an unidirectional chain whose indices are also multiples of k. Perhaps even subtler is that, even if all options can be active at state x, the histories hx ∈ H associated with a given action a may themselves not be realizable under the combination of Ω and the current π. It is clear then that the coverage ability of temporally-extended -greedy depends on the interaction between π, Ω, , and the dynamics P (•|x, a) of the MDP. One way to reason about this is to consider that, once fixed, these elements induce a stochastic process which in turn gives rise to a well-defined distribution over the space of histories H. Property 1 (Full coverage). Let M be the space of all MDPs with common state-action spaces X , A, and Ω a set of options defined over this state-action space. Then, Ω has full coverage for M if ∀M ∈ M, > 0, and π, the semi-Markov policy µ := (1 -)π + π ω , where ω is a random variable uniform over Ω, visits every state-action pair with non-zero probability. Note that µ is itself a random variable and not an average policy. We can then look for simple conditions that would lead to having Property 1. For example, if the options' policies only depend on the last state of the history, π ω (•|hx) = π ω (•|x) (i.e. they are Markov, rather than semi-Markov policies), we can get the desired coverage by having π ω (a|x) > 0 for all x ∈ X and all a ∈ A. The coverage of X × A also trivially follows from having all primitive actions a ∈ A as part of Ω. Note that if the primitive actions are the only elements of Ω we recover standard -greedy, and thus coverage of X × A. Of course, in these and similar cases, temporally-extended -greedy allows for convergence to the optimal policy under the same conditions as its precursor. This view of temporally-extended -greedy, as inducing a stochastic process, also helps us to understand its differences with respect to its standard counterpart. Since the induced stochastic process defines a distribution over histories we can also talk about distributions over sequences of actions. With standard -greedy, every sequence of k exploratory actions has a probability of occurrence of exactly ( /|A|) k , where |A| is the size of the action space. By changing one can uniformly change the probabilities of all length-k sequences of actions, but no sequence can be favored over the others. Temporally-extended -greedy provides this flexibility through the definition of Ω; specifically, by defining the appropriate set of options one can control the temporal correlation between actions. This makes it possible to control how quickly the algorithm converges, as we discuss next. Efficient exploration For sample-efficiency we want to cover the state-action space quickly. Definition 1. The cover time of an RL algorithm is the number of steps needed to visit all state-action pairs at least once with probability 0.5 starting from the initial state distribution. Even- Dar and Mansour (2003) show that the sample efficiency of Q-learning can be bounded in terms of the cover time of the exploratory policy used. Liu and Brunskill (2018) provide an upper bound for the cover time of a random exploratory policy based on properties of the MDP. Putting these results together, we have the characterization of a class of MDPs for which Q-learning plus -greedy exploration is sample efficient (that is, it converges in polynomial time). Normally, the efficiency of -greedy Q-learning is completely determined by the MDP: given an MDP, either the algorithm is efficient or it is not. We now discuss how by replacing -greedy exploration with its temporally-extended counterpart we can have efficient exploration on a much broader class of MDPs. To understand why this is so, note that the definition of the set of options Ω can be seen as the definition of a new MDP in which histories play the role of states and options play the role of actions. 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " E 0 H 4 q p c B L S d + M H c 3 r a G L K i T i v B 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p 6 Q 7 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G N n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N i U b g r f 6 8 j p p X 1 W 9 W r X W r F X q t 3 k c R T i D c 7 g E D 6 6 h D v f Q g B Y w Q H i G V 3 h z H p 0 X 5 9 3 5 W L Y W n H z m F P 7 A + f w B e 1 + M u w = = < / l a t e x i t > 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " R S O B f s C 0 h 9 n 0 l P M 3 V b i R 0 p c e Z i g = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p 6 Q 3 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G N n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N i U b g r f 6 8 j p p X 1 W 9 W r X W r F X q t 3 k c R T i D c 7 g E D 6 6 h D v f Q g B Y w Q H i G V 3 h z H p 0 X 5 9 3 5 W L Y W n H z m F P 7 A + f w B f O O M v A = = < / l a t e x i t >

2

< l a t e x i t s h a 1 _ b a s e 6 4 = " X W 0 q c P r c c S f I 7 o n j b b F 2 x R 2 g G y 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k p 6 L H o x W M L 9 g P a U D b b S b t 2 s w m 7 G 6 G E / g I v H h T x 6 k / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n Y 3 N r e 2 d 3 c J e c f / g 8 O i 4 d H L a 1 n G q G L Z Y L G L V D a h G w S W 2 D D c C u 4 l C G g U C O 8 H k b u 5 3 n l B p H s s H M 0 3 Q j + h I 8 p A z a q z U r A 5 K Z b f i L k D W i Z e T M u R o D E p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N i P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I p V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G N n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N k U b g r f 6 8 j p p V y t e r V J r 1 s r 1 2 z y O A p z D B V y B B 9 d Q h 3 t o Q A s Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F s 3 n H z m D P 7 A + f w B f m e M v Q = = < / l a t e x i t > < l a t e x i t s h a _ b a s e = " E H q p c B L S d + M H c r a G L K i T i v B = " > A A A B H i c b V B N S N A E J U r q / q h L B b B U m k o M e i F t A o Q l s J + a z S b s b o Q S + g u e F D E q z / J m / / G b Z u D t j Y e L w w y I B F c G f d g o b m v b O X d t + w e F R + f i k r e N U M W y x W M S q G C N g k t s G W E d h O F N A o E d o L J d z v P K H S P J Y P Z p q g H G R C F n F i p Q K F b f q L k D W i Z e T C u R o D M p f / W H M g i l Y Y J q f P c x P g Z V Y Y z g b N S P W Y U D a h I + x Z K m m E s W h I h V W G J I y V L W n I Q v k d F I k U M I m r F e e b i f v N e G N n G Z p A Y l W y K U F M T O Z f k y F X y I y Y W k K Z v Z W w s Z U U W Z s N i U b g r f j p p X W W r X W r F X q t k c R T i D c g E D h D v f Q g B Y w Q H i G V h z H p X W L Y W n H z m F P A + f w B e + M u w = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " E 0 H 4 q p c B L S d + M H c 3 r a G L K i T i v B 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p 6 Q 7 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G N n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N i U b g r f 6 8 j p p X 1 W 9 W r X W r F X q t 3 k c R T i D c 7 g E D 6 6 h D v f Q g B Y w Q H i G V 3 h z H p 0 X 5 9 3 5 W L Y W n H z m F P 7 A + f w B e 1 + M u w = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " E 0 H 4 q p c B L S d + M H c 3 r a G L K i T i v B 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p 6 Q 7 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G N n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N i U b g r f 6 8 j p p X 1 W 9 W r X W r F X q t 3 k c R T i D c 7 g E D 6 6 h D v f Q g B Y w Q H i G V 3 h z H p 0 X 5 9 3 5 W L Y W n H z m F P 7 A + f w B e 1 + M u w = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " E 0 H 4 q p c B L S d + M H c 3 r a G L K i T i v B 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p 6 Q 7 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G N n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N i U b g r f 6 8 j p p X 1 W 9 W r X W r F X q t 3 k c R T i D c 7 g E D 6 6 h D v f Q g B Y w Q H i G V 3 h z H p 0 X 5 9 3 5 W L Y W n H z m F P 7 A + f w B e 1 + M u w = = < / l a t e x i t > n < l a t e x i t s h a 1 _ b a s e 6 4 = " v 7 U x Z 5 K Q o 0 0 m k Q M w c Q Y m x 7 I i W 7 M = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p K Q f l i l t 1 F y D r x M t J B X I 0 B u W v / j B m a Y T S M E G 1 7 n l u Y v y M K s O Z w F m p n 2 p M K J v Q E f Y s l T R C 7 W e L Q 2 f k w i p D E s b K l j R k o f 6 e y G i k 9 T Q K b G d E z V i v e n P x P 6 + X m v D G z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t 9 e Z 2 0 r 6 p e r V p r 1 i r 1 2 z y O I p z B O V y C B 9 d Q h 3 t o Q A s Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 2 V e M + Q = = < / l a t e x i t > n < l a t e x i t s h a 1 _ b a s e 6 4 = " v 7 U x Z 5 K Q o 0 0 m k Q M w c Q Y m x 7 I i W 7 M = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p K Q f l i l t 1 F y D r x M t J B X I 0 B u W v / j B m a Y T S M E G 1 7 n l u Y v y M K s O Z w F m p n 2 p M K J v Q E f Y s l T R C 7 W e L Q 2 f k w i p D E s b K l j R k o f 6 e y G i k 9 T Q K b G d E z V i v e n P x P 6 + X m v D G z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t 9 e Z 2 0 r 6 p e r V p r 1 i r 1 2 z y O I p z B O V y C B 9 d Q h 3 t o Q A s Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 2 V e M + Q = = < / l a t e x i t > a 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " K R e i K S A 2 L Z T + U j 2 7 B Q Y n E h d M Z / w = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k U I 9 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 7 d z v P K H S P J a P Z p q g H 9 G R 5 C F n 1 F j p g Q 7 c Q b n i V t 0 F y D r x c l K B H M 1 B + a s / j F k a o T R M U K 1 7 n p s Y P 6 P K c C Z w V u q n G h P K J n S E P U s l j V D 7 2 e L U G b m w y p C E s b I l D V m o v y c y G m k 9 j Q L b G V E z 1 q v e X P z P 6 6 U m v P Y z L p P U o G T L R W E q i I n J / G 8 y 5 A q Z E V N L K F P c 3 k r Y m C r K j E 2 n Z E P w V l 9 e J + 2 r q l e r 1 u 5 r l c Z N H k c R z u A c L s G D O j T g D p r Q A g Y j e I Z X e H O E 8 + K 8 O x / L 1 o K T z 5 z C H z i f P + l Z j Y 8 = < / l a t e x i t > a 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " f e n A J M 2 m 6 z G 3 h I / p / k Z + m 2 C 1 H H Y = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k U I 9 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 i N O E + x E d K R E K R t F K D 3 T g D c o V t + o u Q N a J l 5 M K 5 G g O y l / 9 Y c z S i C t k k h r T 8 9 w E / Y x q F E z y W a m f G p 5 Q N q E j 3 r N U 0 Y g b P 1 u c O i M X V h m S M N a 2 F J K F + n s i o 5 E x 0 y i w n R H F s V n 1 5 u J / X i / F 8 N r P h E p S 5 I o t F 4 W p J B i T + d 9 k K D R n K K e W U K a F v Z W w M d W U o U 2 n Z E P w V l 9 e J + 2 r q l e r 1 u 5 r l c Z N H k c R z u A c L s G D O j T g D p r Q A g Y j e I Z X e H O k 8 + K 8 O x / L 1 o K T z 5 z C H z i f P + r d j Z A = < / l a t e x i t > (a) < l a t e x i t s h a 1 _ b a s e 6 4 = " y t d / 1 e Z m A N P F Q e 1 k Q 5 H L B n 0 g 2 9 Y = " > A A A B 8 X i c b V D L S g N B E O y N r x h f U Y 9 e B o M Q L 2 F X A n o M e v E Y w T w w W c L s Z J I M m Z 1 d Z n r F s O Q v v H h Q x K t / 4 8 2 / c Z L s Q R M L G o q q b r q 7 g l g K g 6 7 7 7 e T W 1 j c 2 t / L b h Z 3 d v f 2 D 4 u F R 0 0 S J Z r z B I h n p d k A N l 0 L x B g q U v B 1 r T s N A 8 l Y w v p n 5 r U e u j Y j U P U 5 i 7 o d 0 q M R A M I p W e u g i f 8 K 0 T M + n v W L J r b h z k F X i Z a Q E G e q 9 4 l e 3 H 7 E k 5 A q Z p M Z 0 P D d G P 6 U a B Z N 8 W u g m h s e U j e m Q d y x V N O T G T + c X T 8 m Z V f p k E G l b C s l c / T 2 R 0 t C Y S R j Y z p D i y C x 7 M / E / r 5 P g 4 M p P h Y o T 5 I o t F g 0 S S T A i s / d J X 2 j O U E 4 s o U w L e y t h I 6 o p Q x t S w Y b g L b + 8 S p o X F a 9 a q d 5 V S 7 X r L I 4 8 n M A p l M G D S 6 j B L d S h A Q w U P M M r v D n G e X H e n Y 9 F a 8 7 J Z o 7 h D 5 z P H 1 E C k L A = < / l a t e x i t > (b) Chain: Q-Learning < l a t e x i t s h a 1 _ b a s e 6 4 = " J N b w M 1 y e e / E 6 E k 3 J Y 2 F c T c l s + d 8 = " > A A A C B X i c b V A 9 S w N B E N 3 z M 8 a v q K U W i 0 G I h e F O A o p V M I 2 F R Q L m A 5 I Q 9 j a T Z H F v 7 9 i d E 8 O R x s a / Y m O h i K 3 / w c 5 / 4 + a j 0 O i D g c d 7 M 8 z M 8 y M p D L r u l 7 O w u L S 8 s p p a S 6 9 v b G 5 t Z 3 Z 2 a y a M N Y c q D 2 W o G z 4 z I I W C K g q U 0 I g 0 s M C X U P d v S 2 O / f g f a i F D d 4 D C C d s D 6 S v Q E Z 2 i l T u a g h X C P S c 4 / p q U B E + q C V k 6 u g W k l V H / U y W T d v D s B / U u 8 G c m S G c q d z G e r G / I 4 A I V c M m O a n h t h O 2 E a B Z c w S r d i A x H j t 6 w P T U s V C 8 C 0 k 8 k X I 3 p k l S 7 t h d q W Q j p R f 0 4 k L D B m G P i 2 M 2 A 4 M P P e W P z P a 8 b Y O 2 8 n Q k U x g u L T R b 1 Y U g z p O B L a F R o 4 y q E l j G t h b 6 V 8 w D T j a I N L 2 x C 8 + Z f / k t p p 3 i v k C 5 V C t n g 5 i y N F 9 s k h y R G P n J E i u S J l U i W c P J A n 8 k J e n U f n 2 X l z 3 q e t C 8 5 s Z o / 8 g v P x D Q A x l 6 E = < / l a t e x i t > (b) Atari: R2D2 < l a t e x i t s h a 1 _ b a s e 6 4 = " q y t y 9 C R 5 j u B c w 9 o D w 1 L U U X 5 l E 1 0 = " > A A A B / 3 i c b V D J S g N B E O 1 x j X E b F b x 4 a Q x C v I S Z E F A 8 x e X g M Y p Z I B l C T 6 c n a d L T M 3 T X i G H M w V / x 4 k E R r / 6 G N / / G z n L Q x A c F j / e q q K r n x 4 J r c J x v a 2 F x a X l l N b O W X d / Y 3 N q 2 d 3 Z r O k o U Z V U a i U g 1 f K K Z 4 J J V g Y N g j V g x E v q C 1 f 3 + 5 c i v 3 z O l e S T v Y B A z L y R d y Q N O C R i p b e + 3 g D 1 A m v e P 8 T k Q x c / w b f G q O G z b O a f g j I H n i T s l O T R F p W 1 / t T o R T U I m g Q q i d d N 1 Y v B S o o B T w Y b Z V q J Z T G i f d F n T U E l C p r 1 0 f P 8 Q H x m l g 4 N I m Z K A x + r v i Z S E W g 9 C 3 3 S G B H p 6 1 h u J / 3 n N B I J T L + U y T o B J O l k U J A J D h E d h 4 A 5 X j I I Y G E K o 4 u Z W T H t E E Q o m s q w J w Z 1 9 e Z 7 U i g W 3 V C j d l H L l i 2 k c G X S A D l E e u e g E l d E 1 q q A q o u g R P a N X 9 G Y 9 W S / W u / U x a V 2 w p j N 7 6 A + s z x / r k p S / < / l a t e x i t > Zeta < l a t e x i t s h a 1 _ b a s e 6 4 = " r u G M l f w A N 5 c g x k l H V p P G l j H D O L M = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 m k o M e i F 4 8 V 7 A e 2 o W y 2 k 3 b p Z h N 2 J 2 I J / R l e P C j i 1 V / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V H J o 8 l r H u B M y A F A q a K F B C J 9 H A o k B C O x j f z P z 2 I 2 g j Y n W P k w T 8 i A 2 V C A V n a K V u D + E J s w d A N u 2 X K 2 7 V n Y O u E i 8 n F Z K j 0 S 9 / 9 Q Y x T y N Q y C U z p u u 5 C f o Z 0 y i 4 h G m p l x p I G B + z I X Q t V S w C 4 2 f z k 6 f 0 z C o D G s b a l k I 6 V 3 9 P Z C w y Z h I F t j N i O D L L 3 k z 8 z + u m G F 7 5 m V B J i q D 4 Y l G Y S o o x n f 1 P B 0 I D R z m x h H E t 7 K 2 U j 5 h m H G 1 K J R u C t / z y K m l d V L 1 a t X Z X q 9 S v 8 z i K 5 I S c k n P i k U t S J 7 e k Q Z q E k 5 g 8 k 1 f y 5 q D z 4 r w 7 H 4 v W g p P P H J M / c D 5 / A N G o k Z w = < / l a t e x i t > Uniform < l a t e x i t s h a 1 _ b a s e 6 4 = " p q z g i t 1 3 Q E 8 n I Y G i P 3 x L 9 + l 1 V O I = " > A A A B 9 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B E 8 l U Q K e i x 6 8 V j B t I U 2 l s 1 2 2 y 7 d 3 Y T d i V p C / 4 c X D 4 p 4 9 b 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X J Y I b 9 L x v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 T J x q y g I a i 1 i 3 I m K Y 4 I o F y F G w V q I Z k Z F g z W h 0 P f W b D 0 w b H q s 7 H C c s l G S g e J 9 T g l a 6 7 y B 7 w i y w S q z l p F s q e Hence, by appropriately defining Ω, we can have an MDP in which random exploration is efficient. We now formalize this notion by making explicit properties of Ω that lead to efficient exploration: x V v B n e Z + D k p Q 4 5 6 t / T V 6 c U 0 l U w h F c S Y t u 8 l G G Z E I 6 e C T Y q d 1 L C E 0 B E Z s L a l i k h m w m x 2 9 c Q 9 t U r P t Y t t K X R n 6 u + J j E h j x j K y n Z L g 0 C x 6 U / E / r 5 1 i / z L M u E p S Z I r O F / V T 4 W L s T i N w e 1 w z i m J s C a G a 2 1 t d O i S a U L R B F W 0 I / u L L y 6 R x X v G r l e p t t V y 7 y u M o w D G c w B n 4 c A E 1 u I E 6 B E B B w z O 8 w p v z 6 L w 4 7 8 7 H v H X F y W e O 4 A + c z x 9 H E Z M G < / l a t e x i t > Exponential < l a t e x i t s h a 1 _ b a s e 6 4 = " G P W G N s S g W w g i C b W 3 m N H 9 n K k R e a w = " > A A A B + 3 i c b V D L S g M x F M 3 U V 6 2 v s S 7 d B I v g q s x I Q Z d F E V x W s A 9 o h 5 J J M 2 1 o J h m S O 9 I y 9 F f c u F D E r T / i z r 8 x b W e h r Q c C h 3 P u 4 d 6 c M B H c g O d 9 O 4 W N z a 3 t n e J u a W / / 4 P D I P S 6 3 j E o 1 Z U 2 q h N K d k B g m u G R N 4 C B Y J 9 G M x K F g 7 X B 8 O / f b T 0 w b r u Q j T B M W x G Q o e c Q p A S v 1 3 X I P 2 A S y u 0 m i J J P A i Z j 1 3 Y p X 9 R b A 6 8 T P S Q X l a P T d r 9 5 A 0 T S 2 e S q I M V 3 f S y D I i A Z O B Z u V e q l h C a F j M m R d S y W J m Q m y x e 0 z f G 6 V A Y 6 U t k 8 C X q i / E x m J j Z n G o Z 2 M C Y z M q j c X / / O 6 K U T X Q c Z l k g K T d L k o S g U G h e d F 4 A H X j I K Y W k K o 5 v Z W T E d E E w q 2 r p I t w V / 9 8 j p p X V b 9 W r X 2 U K v U b / I 6 i u g U n a E L 5 K M r V E f 3 q I G a i K I J e k a v 6 M 2 Z O S / O u / O x H C 0 4 e e Y E / Y H z + Q P u h J U G < / l a t e x i t > Assumption 1. For an MDP M and set of options Ω, there exists n max ∈ N such that ∀x, y ∈ X , ∃ω ∈ Ω leading to E πω [t | x 0 = x, x t = y] ≤ n max . Theorem 1. For any irreducible MDP, let Ω be a set of options satisfying Assumption 1 with n max ≤ Θ(|X ||A|). Then, temporally-extended -greedy with sampling distribution p satisfying 1 ρ(ω) ≤ Θ(|X ||A|), ∀ω ∈ Ω, has polynomial sample complexity. In many cases it is easy to define options that satisfy Assumption 1, as we will discuss shortly. But even when this is not the case, one can learn options deliberately designed to have this property. For example, Jinnai et al. (2019; 2020) learn point-options (transitioning from one state to one other state) that explicitly minimize cover time. The approach proposed by Machado et al. (2017; 2018b ) also leads to options with a small cover time. Alternatively, Whitney et al. (2020) learn an embedding of action sequences such that sequences with similar representations also have similar future state distributions. They observe that sampling uniformly in this abstracted action space yields action sequences whose future state distribution is nearly uniform over reachable states. We can interpret such an embedding, coupled with a corresponding decoder back into primitive actions, as an appealing approach to learning open-loop options for temporally-extended -greedy exploration. Next, we propose a concrete form of temporally-extended -greedy which requires neither learning Ω nor specific domain knowledge. These options encode a commonly held inductive bias: actions have (largely) consistent effects throughout the state-space. z-Greedy We begin with the options ω a ≡ (X , π a , β), where π a (h) = 1 a and β(h) = 1 for all h ∈ H, and consider a single modification, temporal persistence. Let ω an ≡ (X , π a , β(h) = 1 |h|==n ) be the option which takes action a for n steps and then terminates. Our proposed algorithm, is to let Ω = {ω an } a∈A,n≥1 and p to be uniform over actions with durations distributed according to some distribution z. Intuitively, we are proposing the set of semi-Markov options made up of all "action-repeat" policies for all combinations of actions and repeat durations, with a parametric sampling distribution on durations. This exploration algorithm is described by two parameters, dictating when to explore, and z dictating the degree of persistence. Notice that when z puts all mass on n = 1, this is standard -greedy; more generally this combination of distributions forms a composite distribution with support [0, ∞), which is to say that with some probability the agent explores for n = 0 steps, corresponding to following its usual policy, and for all other n > 0 the agent explores following an action-repeat policy. A natural question arises: what distribution over durations should we use? To help motivate this question, and to help understand the desirable characteristics, consider Figure 2 which shows a modified chain MDP with two actions. Taking the 'down' action immediately terminates with the specified reward, whereas taking the 'right' action progresses to the next state in the chain. Similar to other exploration chain-like MDPs, -greedy performs poorly here because the agent must move consistently in one direction for an arbitrary number of steps (determined by the discount) to reach the optimal reward. Instead, we consider the effects of three classes of duration distribution: exponential (z(n) ∝ λ n-foot_0 ), uniform (z(n) = 1 n≤N /N ), and zeta (z(n) ∝ n -µ ). Figure 2b shows the average return achieved by these distributions as their hyper-parameters are varied. This problem illustrates that, without prior knowledge of the MDP, it is important to support long durations, such as with a heavy-tailed distribution (e.g. the zeta distribution). Why not simply allow uniform over an extremely large support? Doing so would effectively force 'pure' exploration without any exploitation, because this form of ballistic exploration would simply continue exploring indefinitely. Indeed, we can see in Figure 2 that the uniform distribution leads to poor performance (the same is true for the zeta distribution as µ → 1, which also leads to ballistic exploration). On the other hand, short durations lead to frequent switching and vanishingly small probabilities of reaching larger rewards at all. This trade-off leads to the existence of an optimal value of µ for the zeta distribution that can vary by domain (Humphries et al., 2010) , and is illustrated by the inverted U-curve in Figure 2 . A class of ecological models for animal foraging known as Lévy flights follow a similar pattern of choosing a direction uniformly at random, and following that direction for a duration sampled from a heavy-tailed distribution. Under certain conditions, this has been shown to be an optimal foraging strategy, a form of exploration for a food source of unpredictable location (Viswanathan et al., 1996; 1999) . In particular, a value of µ = 2 has consistently been found as the best for modeling animal foraging, as well as performing best in our hyper-parameter sweep. Thus, in the remainder of this work we will use the zeta distribution, with µ = 2 unless otherwise specified, and call this combination of chance to explore and zeta-distributed durations, z-Greedy exploration 1 .

5. EXPERIMENTAL RESULTS

We have emphasized the importance of simplicity, generality (via convergence guarantees), and stationarity of exploration strategies. We proposed a simple temporally-extended -greedy algorithm, z-greedy, and saw that a heavy-tailed duration distribution yielded the best trade-off between temporal persistence and sample efficiency. In this section, we present empirical results on tabular, linear, and deep RL settings, pursuing two objectives: The first is to demonstrate the generality of our method in applying it across domains as well as across multiple value-based reinforcement learning algorithms (Q-learning, SARSA, Rainbow, R2D2). Second, we make the point that exploration comes at a cost, and that z-greedy improves exploration with significantly less loss in efficiency on dense-reward domains compared with existing exploration algorithms. Small-Scale: Tabular & Linear RL We consider four small-scale environments (DeepSea, Grid-World, MountainCar, CartPole swingup-sparse), configured to be challenging sparse-reward explo-Published as a conference paper at ICLR 2021 (Konidaris, Osentoski, and Thomas, 2011) . In Figure 3 we present results comparing -greedy and z-greedy on these four domains. Unless otherwise specified, the hyper-parameters and training settings for these two methods are identical. For each domain we show (i) learning curves showing average return against training episodesfoot_1 , (ii) average first-visit times on states during pure ( = 1.0) exploration for -greedy and (iii) z-greedy. The results show that z-greedy provides significantly improved performance on these domains, and the first-visit times indicate significantly better state-space coverage compared to -greedy. Atari-57: Deep RL Motivated by the results in tabular and linear settings, we now turn to deep RL and evaluate performance on 57 Atari 2600 games in the Arcade Learning Environment (ALE) (Bellemare et al., 2013) . To demonstrate the generality of the approach, we apply z-greedy to two state-of-the-art deep RL agents, Rainbow (Hessel et al., 2018) and R2D2 (Kapturowski et al., 2019) . We compare with baseline performance as well as the performance of a recent intrinsic motivationbased exploration algorithm: CTS-based pseudo-counts (Bellemare et al., 2016) in Rainbow and RND (Burda et al., 2018) in R2D2, each tuned for performance comparable with published results. Finally, in R2D2 experiments we also compare with a Bootstrapped DQN version of R2D2 (Osband et al., 2016) , providing an exploration baseline without intrinsic rewards. We include pseudo-code and hyper-parameters in the Appendix, though the implementation of z-greedy in each case is trivial, hyper-parameters are mostly identical to previous work, and we fix µ = 2 for results in this section. Our findings (see Figure 4 ) show that z-greedy improves performance on the hard exploration tasks with little to no loss in performance on the rest of the suite. By comparison, we observe that the intrinsic motivation methods often (although not always) outperform z-greedy on the hard exploration tasks, but at a significant loss of performance on the rest of the benchmark. The results in Figure 4 show median human-normalized score over the 57 games and the human-gap, measuring how much the agent under-performs humans on average (see Appendix D for details). We consider the median to indicate overall performance on the suite and the human-gap to illustrate gains on the hard exploration games where agents still under-perform relative to humans, with full per-game and mean performance given in the Appendix. Table 1 gives the final performance of each of the agents in terms of these summary statistics. Figure 5 shows representative examples of per-game performance for the R2D2-based agents. These per-game results make a strong point, that even on the hard exploration games the inductive biases of intrinsic motivation methods may be poorly aligned, and that outside a small number of games these methods significantly hurt performance, whereas our proposed method improves exploration while avoiding this significant loss elsewhere. To demonstrate that the effectiveness of our method does not crucially depend on evaluation on deterministic domains, in Appendix E we additionally show a similar comparison of the Rainbowbased agents on a stochastic variant of the Atari-57 benchmark using 'sticky actions' (Machado et al., 2018a) . The results are qualitatively similar: while all agent variants do somewhat worse on the stochastic compared to the deterministic case, z-greedy improves over the baseline and -greedy on the hardest exploration domains while not substantially affecting performance in others, and coming out on top in terms of mean and median human-normalized performance as well as human-gap ( 

6. DISCUSSION AND CONCLUSIONS

We have proposed temporally-extended -greedy, a form of random exploration performed by sampling an option and following it until termination, with a simple instantiation which we call z-greedy. We showed, across domains and algorithms spanning tabular, linear and deep reinforcement learning that z-greedy improves exploration and performance in sparse-reward environments with minimal loss in performance on easier, dense-reward environments. Further, we showed that compared with other exploration methods (pseudo-counts, RND, Bootstrap), z-greedy has comparable performance averaged over the hard-exploration games in Atari, but without the significant loss in performance on the remaining games. Although action-repeats have been a part of deep RL algorithms since DQN, and have been considered as a type of option (Schoknecht and Riedmiller, 2002; 2003; Braylan et al., 2015; Lakshminarayanan, Sharma, and Ravindran, 2017; Sharma, Lakshminarayanan, and Ravindran, 2017) , their use for exploration with sampled durations does not appear to have been studied before. Generality and Limitations. Both -and z-greedy are guaranteed to converge in the finite stateaction case, but they place probability mass over exploratory trajectories very differently, thus encoding different inductive biases. We expect there to be environments where z-greedy significantly under-performs -greedy. Indeed, these are easy to imagine: DeepSea with action effects randomized per-state (see Appendix Figure 14 ), GridWorld with many obstacles that immediately end the episode ('mines'), a maze changing direction every few steps, etc. More generally, the limitations of z-greedy are: (i) Actions may not homogeneously (over states) correspond to a natural notion of shortest-path directions in the MDP. (ii) Action spaces may be biased (e.g. many actions have the same effect), so uniform action sampling may produce undesirable biased drift through the MDP. (iii) Obstacles and dynamics in the MDP can cause long exploratory trajectories to waste time (e.g. running into a wall for thousands of steps), or produce uninformative transitions (e.g. end of episode, death). In Appendix F we report on a series of experiments investigating z-Greedy's sensitivities to such modifications of the Gridworld domain and find that its performance degrades gracefully overall. These limitations are precisely where we believe future work is best motivated. How can an agent learn stationary, problem-specific notions of direction, and explore in that space efficiently? How to avoid wasteful long trajectories, perhaps by truncating early? This form of exploration bears similarity to the Lévy-flights model of foraging, where an animal will abruptly end foraging as soon as food is within sight. Could we use discrepancies in value along a trajectory to similarly truncate exploration early? Recent work around learning action representations appear to be promising directions (Tennenholtz and Mannor, 2019; Chandak et al., 2019) .

APPENDICES A COVER TIME ANALYSIS

Assumption 1. For an MDP M and set of options Ω, there exists n max ∈ N such that ∀x, y ∈ X , ∃ω ∈ Ω leading to E πω [t | x 0 = x, x t = y] ≤ n max . Theorem 1. For any irreducible MDP, let Ω be a set of options satisfying Assumption 1 with n max ≤ Θ(|X ||A|). Then, temporally-extended -greedy with sampling distribution p satisfying Proof. Liu and Brunskill (2018) establish a PAC RL bound, leading to polynomial sample complexity, for random exploration using primitive actions when 1 minx φ(x) and 1 h are polynomial in states and actions, with steady-state distribution φ and Cheeger constant h. However, their result does not require actions to be primitive, and here we show how temporally-extended actions in the form of exploratory options can be used for a similar result. We begin by bounding the steady-state probability for any state x ∈ X , where x 0 := arg max x φ(x) and n max the maximum expected path distance between two states. We can understand this as bounding the steady-state probability of x by the product of (1) the maximal steady-state probability over states, (2) the probability of choosing an option from x 0 that reaches x. Let ω be any option satisfying Assumption 1 for starting in state x 0 and reaching x, φ(x) ≥ φ(x 0 ) × p(ω), =⇒ 1 min x φ(x) ≤ Θ(|X ||A|) φ(x 0 ) . Next, we can similarly bound the Cheeger constant. Recall from (Liu and Brunskill, 2018 ) that h = inf U F (∂U ) min{F (U ),F ( Ū )} where Ū denotes the set of states not in U , F (u, v) = φ(u)P (u, v), F (∂U ) = u∈U,v ∈U F (u, v), F (U ) = u∈U φ(u). Let U = {x 0 }, then, the Cheeger constant can be bounded by, h = inf U F (∂U ) min{F (U ), F ( Ū )} , ≥ F (∂U ) min{F (U ), F ( Ū )} , ≥ x =x0 φ(x 0 )P (x 0 , x) φ(x 0 ) , = x =x0 P (x 0 , x), ≥ p(ω), =⇒ 1 h ≤ Θ(|X ||A|).

EXAMPLE: CHAIN MDP

Theorem 1 clarifies the conditions under which temporally-extended -greedy is efficient. Given an MDP, this will depend on two factors: the options and the sampling distribution. To illustrate this point, we use the well known Chain MDP, for which z-Greedy satisfies Assumption 1. Specifically, the requirement on z is satisfied by the zeta distribution (z(n) = n -µ ζ(µ) ) but not by the geometric distribution (exponential decay). This implies that z-greedy on the Chain MDP will have polynomial sample complexity when z is zeta distributed, but not when exponentially distributed. We can see this by observing that in a Chain MDP of size N any state can be reached from the starting state within at most N steps, yielding n max ≤ |X |. The sampling distribution ρ is uniform over actions meaning that ρ(ω) ≥ z(n max )/|A|. Finally, we consider the specific form of the duration distribution z. When given by a zeta distribution, z(n) = n -µ /ζ(µ), we have 1 ρ(ω) ≤ |A| z(n max ) , = |A|ζ(µ) n -µ max , = |A|n µ max ζ(µ), ≤ |A||X | µ ζ(µ), thus satisfying our assumption. On the other hand, if we let the duration distribution be geometric, z(n) = λ(1 -λ) n-1 , we have 1 ρ(ω) ≤ |A| z(n max ) , = |A| λ(1 -λ) n-1 , = 1 λ |A| 1 1 -λ nmax-1 . ≤ 1 λ |A| 1 1 -λ |X |-1 . As 1/(1 -λ) > 1 and n max is only bounded by the number of states, this results in an upper bound that is exponential in the number of states and therefore does not satisfy the assumptions.

B DOMAIN SPECIFICATIONS

DeepSea (Osband, Aslanides, and Cassirer, 2018) Parameterized by problem size N , this environment can be viewed as the lower triangle of an N × N gridworld with two actions: "down-left" and "down-right" which move diagonally down either left or right. There is a single goal state in the far bottom-right corner, which can only be reached through a single action-sequence. The goal reward is 1.0, and there is a per-step reward of -0.01/N . Finally, all episodes end after exactly N steps, once the agent reaches the bottom. Therefore, the maximum possible undiscounted return is 0.99. An example with N = 20 is shown in Figure 6a . Average first-passage times are shown for a problem size of N = 20 in Figure 3a , and unlike other plots are logarithmically scaled, log(E [fpt] + 1) with contour levels in the range [0, 16] . In this work we use the deterministic variant of DeepSea; however, the standard stochastic version randomizes the action effects at every state. That is, "down-left" may correspond to action index 0 in one state and 1 in another, and these assignments are performed randomly for each training run (consistently across episodes). We briefly mention this variant in our conclusions as an example in which our proposed method should be expected to perform poorly. Indeed, in Figure 14 we show that such an adversarial modification reduces z-greedy's performance back to that of -greedy. For experiments, we used Q-learning with a tabular function approximator, learning rate α = 1.0, and = 1.0/(N + 1) for problem size N . Experiment results are averages over 30 random seeds. GridWorld Shown in Figure 6b , this is an open single-room gridworld with four actions ("up", "down", "left", and "right"), and a single non-zero reward at the goal state. The initial state is in the top center of the grid (offset from the wall by one row), and the goal state is diagonally across from it at the other end of the room. Notice that if the goal were in the same row or column, as well as if it were placed directly next to a wall, this could be argued to advantage an action-repeat exploration method. Instead, the goal location was chosen to be harder for z-greedy to find (offset from wall, far from and not in same row/column as start state). For experiments, we used Q-learning with a tabular function approximator, learning rate α = 0.1, = 0.1, and maximum episode length 1000. Experiment results are averages over 30 random seeds. Figure 1 shows average first-passage times on a similar gridworld, but with a fixed greedy policy which takes the "down" action deterministically. MountainCar (Sutton and Barto, 2018) This environment models an under-powered car stuck in the valley between two hills. The agent must build momentum in order to reach the top of one hill and obtain the goal reward. In this version of the domain all rewards are zero except for the goal, which yields reward of 1.0. There are two continuous state variables, corresponding to the agent location, x, and velocity, ẋ. The dense-reward version of this environment can be solved reliably in less than a dozen episodes using linear function approximation on top of a low-order Fourier basis (Konidaris, Osentoski, and Thomas, 2011) . In our experiments using the sparse-reward variant of the environment, we used SARSA(λ) with a linear function approximation on top of an order 5 Fourier basis. We used learning rate α = 0.005, = 0.05, γ = 0.99, and λ = 0.9. The maximum episode length was set to 5000. Experiment results are averages over 30 random seeds. A near-optimal policy, given this discount and , but without confounds due to function-approximation, should reach approximately 0.29 episodic discounted return. CartPole (Barto, Sutton, and Anderson, 1983) We use the "swingup sparse" variant as implemented in Tassa et al. (2018) . In this sparse reward version of the environment, the agent receives zero reward unless |x| < 0.25 and cos(θ) > 0.995, for the cart location x and pole angle θ. All episodes run for 1000 steps, and observations are 5-dimensional continuous observation. For experiments, we used SARSA(λ) with a linear function approximation on top of an order 7 Fourier basis. We used learning rate α = 0.0005, = 0.01, γ = 0.99, and λ = 0.7. The maximum episode length was 1000. Weights were initialized randomly from a mean-zero normal distribution with variance 0.001. Experiment results are averages over 30 random seeds. Atari-57 (Bellemare et al., 2013) , is a benchmark suite of 57 Atari 2600 games in the Arcade Learning Environment (ALE). Observations are 210 × 160 color images (following Mnih et al. (2015) , in many agents these are down-scaled to 84 × 84 and converted to grayscale). For the primary results in this work we use the original ALE version of Atari 2600 games, which does not include subsequently added games (beyond the 57) or features such as "sticky actions". For results with sticky actions enabled consult Appendix E. Many existing results on Atari-57 report performance of the best agent throughout training, or simply the maximum evaluation performance attained during training. We do not report this metric in the main text because it does not reflect the true learning progress of agents and tends to reflect an over estimate. However, for comparison purposes, "best" performance is included later in the Appendix. In the next section, alongside other agent details, we will give hyper-parameters used in the Atari-57 experiments. An example frame from the game PRIVATE EYE is shown in Figure 6e .

C AGENT AND ALGORITHM DETAILS

Except for ablation experiments on the duration distribution, all z-greedy experiments use a duration distribution z(n) ∝ n -µ with µ = 2.0. These durations were capped at n ≤ 10000 for all experiments except for the Rainbow-based agents which were limited to n ≤ 100, but in this case no other values were attempted.

PSEUDO-CODE

Algorithm 1 z-Greedy exploration policy 1: function EZGREEDY(Q, , z) 2: n ← 0 3: ω ← -1 4: while True do 5: Observe state x 6: if n == 0 then Take action a NETWORK ARCHITECTURE. Rainbow-based agents use an identical network architecture as the original Rainbow agent (Hessel et al., 2018) . In particular, these include the use of NoisyNets (Fortunato et al., 2017) , with the exception of Rainbow-CTS, which uses a simple dueling value network like the "no noisy-nets" ablation in Hessel et al. (2018) . A preliminary experiment showed this setting with Rainbow-CTS performed slightly better than when NoisyNets were included. R2D2-based agents use a slightly enlarged variant of the network used in the original R2D2 (Kapturowski et al., 2019) , namely a 4-layer convolutional neural network with layers of 32, 64, 128 and 128 feature planes, with kernel sizes of 7, 5, 5 and 3, and strides of 4, 2, 2 and 1, respectively. These are followed by a fully connected layer with 512 units, an LSTM with another 512 hidden units, which finally feeds a dueling architecture of size 512 (Wang et al., 2015) . Unlike the original R2D2, Atari frames are passed to this network without frame-stacking, and at their original resolution of 210 × 160 and in full RGB. Like the original R2D2, the LSTM receives the reward and one-hot action vector from the previous time step as inputs.

HYPER-PARAMETERS AND IMPLEMENTATION NOTES

Unless stated otherwise, hyper-parameters for our Rainbow-based agents follow the original implementation in Hessel et al. (2018) , see Table 2 . An exception is the Rainbow-CTS agent, which uses a regular dueling value network instead of the NoisyNets variant, and also makes use of an -greedy policy (whereas the baseline Rainbow relies on its NoisyNets value head for exploration). The parameter follows a linear decay schedule 1.0 to 0.01 over the course of the first 4M frames, remaining constant after that. Evaluation happens with an even lower value of = 0.001. The same -schedule is used in Rainbow+ -greedy and Rainbow+ z-greedy, on top of Rainbow's regular NoisyNets-based policy. The CTS-based intrinsic reward implementation follows Bellemare et al. (2016) , with the scale of intrinsic rewards set to a lower value of 0.0005. This agent was informally tuned for better performance on hard-exploration games: Instead of the "mixed Monte-Carlo return" update rule from Bellemare et al. (2016) , Rainbow-CTS uses an n-step Q-learning rule with n = 5 (while the baseline Rainbow uses n = 3), and differently from the baseline does not use a target network. All of our R2D2-based agents are based on a slightly tuned variant of the published R2D2 agent (Kapturowski et al., 2019) with hyper-parameters unchanged, unless stated otherwise -see Table 3 . Instead of an n-step Q-learning update rule, our R2D2 uses expected SARSA(λ) with λ = 0.7 Van Seijen et al., 2009) . It also uses a somewhat shorter target network update period of 400 update steps and the higher learning rate of 2 × 10 -4 . For faster experimental turnaround, we also use a slightly larger number of actors (320 instead of 256). This tuning was performed on the vanilla R2D2 in order to match published results. ( The RND agent is a modification of our baseline R2D2 with the addition of the intrinsic reward generated by the error signal of the RND network from Burda et al. (2018) . The additional networks ("predictor" and "target" in the terminology of Burda et al. (2018) ) are small convolutional neural networks of the same sizing as the one used in Mnih et al. (2015) , followed by a single linear layer with output size 128. The predictor is trained on the same replay batches as the main agent network, using the Adam optimizer with learning rate 0.0005. The intrinsic reward derived from its loss is normalized by dividing by its variance, utilizing running estimates of its empirical mean and variance. Note, the RND agent includes the use of -greedy exploration. The Bootstrapped R2D2 agent closely follows the details of Osband et al. (2016) . The R2D2 network is extended to have k = 8 action-value function heads which share a common convolutional and LSTM network, but with distinct fully-connected layers on top (each with the same dimensions as in R2D2). During training, each actor samples a head uniformly at random, and follows that action-value function's -greedy policy for an entire episode. Each step, a mask is sampled, and added to replay, with probability p = 0.5 indicating which heads will be trained on that step of experience. During evaluation, we compute the average of each head's -greedy policy to form an ensemble policy that is followed. 

D EXPERIMENT DETAILS

First-visit visualizations These results (e.g. see Figure 1 ) are intended to illustrate the differences in state-visitation patterns between -greedy and z-greedy. These are generated with some fixed , often = 1.0 for pure-exploration independent of the greedy policy, and are computed using Monte-Carlo rollouts with each state receiving an integer indicating the first step at which that state was visited on a given trial. States that are never seen in a trial receive the maximum step count, and we then average these over many trials. For continuous-state problems we discretize the state-space and count any state within a small region for the purposes of visitation. We give these precise values in Table 4 .

Atari experiments

The experimental setup for the Rainbow-based and R2D2-based agents each match those used in their respective baseline works. In particular, Rainbow-based agents perform a mini-batch gradient update every 4 steps and every 1M environment frames learning is frozen and the agent is evaluated for 500K environment frames. In the R2D2-based agents, acting, learning, and evaluating all occur simultaneously and in parallel, as in the baseline R2D2. In the Atari-57 experiments, all results for Rainbow agents are averaged over 5 random seeds, while results for R2D2-based agents are averages over 3 random seeds. Atari-57 is most often used with a built-in number (4) of action-repeats for every primitive action taken (Mnih et al., 2015) . We did not change this environment parameter, which means that an exploratory action-repeat policy of length n will, in the actual game, produce 4 × n low-level actions. Similarly, DQN-based agents typically use frame-stacking, while agents such as R2D2 which use an RNN do not. The robustness of our results across these different algorithms suggests that z-greedy is not greatly impacted by the presence or absence of frame-stacking.

Atari metrics

The human-normalized score is defined as score = agent -random human -random , where agent, random, and human are the per-game scores for the agent, a random policy, and a human player respectively. The human-gap is defined as the average, over games, performance difference between human-level over all games, human gap = 1.0 -E min(1.0, score). Computational Resources Small-scale experiments were written in Python and run on commodity hardware using a CPU. Rainbow-based agents were implemented in Python using JAX, with each configuration (game, algorithm, hyper-parameter setting) run on a single V100 GPU. Such experiments generally required less than a week of wall-clock time. R2D2-based agents were implemented in Python using Tensorflow, with each configuration run on a single V100 GPU and a number of actors (specified above) each run on a single CPU. These agents were trained with a distributed training regime, described in the R2D2 paper (Kapturowski et al., 2019) , and required approximately 3 days to complete.

E EXPERIMENTAL RESULTS: STOCHASTICITY

In this section we present a series of experiments focused on the effects of stochasticity and nondeterminism in the environment on the performance of z-Greedy.

SMALL-SCALE ENVIRONMENTS

We consider stochastic versions of two of our small-scale environments: GridWorld, and MountainCar. In the Gridworld domain the "noise scale" is the probability of the action transitioning to a random neighboring state. In MountainCar, where actions apply a force in -1, 0, 1 to the car, we add mean zero Gaussian noise to this force with variance given by the specified noise scale. Finally, note that these forces are clipped to be within the original range of [-1, 1]. Figure 7a shows the discounted return, averaged over the 100 episode training runs, as a function of the noise scale. Figure 7b gives example learning curves for the agents ( -Greedy and z-Greedy) for four levels of noise. We first note that for near-deterministic settings we replicate the original findings in the main text, but that the performance of the -Greedy agent actually improves for small amounts of transition noise, while both agents degrade performance as this noise becomes larger. We interpret this to suggest that the -Greedy agent is initially benefiting from the increased exploration, whereas z-Greedy was already exploring more and begins to degrade slightly sooner. Figure 8 similarly shows the two agent's performance on MountainCar as we increase the scale of the transition noise. Here we see both agents generally suffer reduced performance as the level of noise increases and is maximal for both in the deterministic case, unlike in Gridworld.

ATARI-57 WITH STICKY-ACTIONS

The Arcade Learning Environment (Bellemare et al., 2013, ALE) , supports a form of non-determinism called sticky actions where with some probability ζ (typically set to 0.25) the agent's action is ignored and instead the previously executed action is repeated (Machado et al., 2018a) . This is not exactly equivalent to simpler transition noise, such as we used in the small-scale experiments above, because there is now a non-Markovian dependence on the previous action executed in the environment. Two additional details are important to keep in mind. First, the agent does not observe the action that was executed, and instead only observes the action that it intended to take. Second, the sticky-action effect applies at every step of the low-level ALE interaction. That is, it is standard practice to use an action-repeat (usually 4) in Atari, such that each action selected by the agent is repeated some number of times in the low-level interaction. Thus, sticky actions apply at every step of this low-level interaction. Because of the non-Markovian effects and the relatively fast decay in probability of action-repeats, sticky actions do not provide similar exploration benefits as seen in z-Greedy. Unlike in exploration using action-repeats the agent is unable to learn about the actions that were actually executed, making the underlying learning problem more challenging. In some environments, where precise control is needed for high performance, sticky actions tend to significantly degrade performance of agents. We note that sticky actions are a modification usually applied to the problem, not the agent, and usually make the problem harder. The reason for this is that sticky actions increase the level of stochasticity of the environment, but do so with a non-Markovian dependency on the previously executed action. Given the superficial similarity with action-repeats, one might ask whether using sticky actions might benefit exploration on hard exploration games similar to z-Greedy. In order to answer this question we compare our Rainbow-based -Greedy and z-Greedy agents on the Atari-57 benchmark with sticky actions enabled (ζ = 0.25). Figure 9 gives the median, mean, and human-gap values for the human-normalized scores. We observe that much of the gap in performance in mean and median have disappeared, and that even for human-gap the performance benefits of z-Greedy over -Greedy have been partially reduced. Nonetheless, we continue to see significant improvements in performance on the same set of harder exploration games as in the non-sticky-action results (see Figure 10 ). Finally, we give the numeric values for the final performance of these agents in Table 5 .

F EXPERIMENTAL RESULTS: LIMITATIONS

To further study the limitations of z-Greedy, and motivate work on learned options for exploration, we consider the effect on performance of adding "obstacles" and "traps" in the Gridworld domain. For our purposes, an obstacle is an interior wall in the gridworld, such that if the agent attempts to occupy the location of the obstacle the result is a no-op action with no state transition. On the other hand, a "trap" results in the immediate end of an episode, with zero reward, if the agent attempts to occupy the same location as the trap. Figure 11 shows a set of example gridworlds with obstacles and traps generated at varying target densities. We generate the environments by filling a 20 × 20 gridworld with either obstacles or traps (not both), where each state has some probability of being so filled (given by the target density). We then identify the largest connected component of open cells and select the goal and start states randomly without replacement from the states in this component. This ensures that all gridworlds can be solved. Note that increasing the density of obstacles and traps has two opposite effects: on one hand it reduces the overall state space, thus making the problem easier, on the other hand it also makes exploration more difficult, since obstacles result in a sparser transition graph and traps result in many absorbing states. Based on our observations, the latter effect tends to be stronger than the former. Our experiments focus on studying the performance of -Greedy and z-Greedy as we scale the density of the obstacles or traps. Figure 12 & 13 show our results in these two sets of experiments. In both cases we observe that the gap in performance between z-Greedy and -Greedy decreases with the density of the problematic states. Interestingly, we see that -Greedy is not impacted as seriously. We believe this is partly to be expected because -Greedy is exploring more locally and more densely around the start state, making navigating around obstacles somewhat easier. Note that we have increased the number of episodes from 100 (used in the main text) to 1000 in order to increase the likelihood of both agents solving a given problem. Additionally, unlike in the main text, every trial of these experiments is performed on a randomly generated gridworld with randomly selected start and goal locations; although we do ensure that each agent is trained on the same environment, the environment itself is different for each seed. This is reflected in the larger variances in performance indicated by the shaded regions in the figures.

G FURTHER EXPERIMENTAL RESULTS

In this section, we include several additional experimental results that do not fit into the main text but may be helpful to the reader. In the conclusions we highlight a limitation of z-greedy which occurs when the effects of actions differ significantly between states. In Figure 14 we present results for such an adversarial setting in the DeepSea environment, where the action effects are randomly permuted for every state. We observe, as expected, that in this setting z-greedy no longer provides more efficient exploration than -greedy. In Figure 15 we compare with RMax (Brafman and Tennenholtz, 2002) on the Gridworld domain. We consider two values for the threshold for a state-action being marked as known: N = 1, effectively encoding an assumption that the domain is deterministic, and N = 10 which is a more generally reasonable value. We observe that, if tuned aggressively, RMax can significantly outperform z-Greedy, as should be expected. However, we note that unlike RMax z-Greedy does not assume access to a tabular representation and can scale to large-scale problems. Next, in Figures 16 & 17 we report the per-game percentage of relative improvement, using humannormalized scores, over an -greedy baseline for z-greedy in both the Rainbow and R2D2 agents. In both cases we give the corresponding results for improvement of CTS, for Rainbow, and RND, for R2D2, over the same baselines. Additionally, we give these results for both the final agent performance and the performance averaged over training. The percent relative improvement of a score over a baseline is computed as 100 × scorebaseline baseline . Note that we limit the maximum vertical axis in such a way that improvement in some games is cut off. This is because for some games these relative improvements are so large that it becomes difficult to see the values for other games on the same scale. In the main text we give summary learning curves on Atari-57 for Rainbow-and R2D2-based agents in terms of median human-normalized score and human-gap. In Figure 18 we show these as well as the mean human-normalized score learning curves. In Figures 19 & 20 we give full, per-game results for Rainbow-and R2D2-based agents respectively. These results offer additional context on those we reported in the main text, demonstrating more concretely the nature of the performance trade-offs being made by each algorithm. Finally, in Table 1 we give mean and median human-normalized scores and the human-gap on Atari-57 for the final trained agents. However, this is a slightly different evaluation method than is often used (Mnih et al., 2015; Hessel et al., 2018) , in which only the best performance for each game, over training, is considered. For purposes of comparison we include these results in Table 6 . -Greedy -Greedy DeepSea (Adv) -Greedy Improvement Over -Greedy (Rainbow, Final) CTS Improvement Over -Greedy (Rainbow, Final) -Greedy Improvement Over -Greedy (Rainbow, Average) CTS Improvement Over -Greedy (Rainbow, Average) Figure 16 : Percent relative improvement of exploration methods ( z-Greedy and CTS) over -Greedy for the Rainbow-based agents on Atari-57 per-game. We report this for both final performance (top) and average over training (bottom). -Greedy Improvement Over -Greedy (R2D2, Final) RND Improvement Over -Greedy (R2D2, Final) -Greedy Improvement Over -Greedy (R2D2, Average) RND Improvement Over -Greedy (R2D2, Average) 



Pronounce 'easy-greedy'. To compare with previous work on DeepSea, we report expected time to learn versus problem scale.



Figure 1: Average (estimated) first-visit times, comparing -greedy policies (a) without and with (b) temporal persistence, in an open gridworld (blue represents fewer steps to and red states rarely or never seen). Greedy policy moves directly down from the top center. See Appendix for details. optimal policy by approximating the Bellman optimality equation: Q(x, a) = R(x, a) + γ E X ∼P (•|x,a) max a ∈A Q(X , a ) .(1)

Figure 2: (a) Modified chain MDP, action a 0 moves right, a 1 terminates with specified reward. Rewards follow a pattern of n zeros followed by a single reward n, etc. Evaluation of performance under various duration distributions and hyper-parameters on the chain. (b) Duration distribution similarly compared for an R2D2-based deep RL agent in Atari.

Figure 3: Comparing -greedy with z-greedy on four small-scale domains requiring exploration. (a) DeepSea is a tabular problem in which only one action-sequence receives positive reward, and uniform exploration is exponentially inefficient, (b) GridWorld is a four-action gridworld with a single reward, (c) MountainCar is the sparse reward (only at goal) version of the classic RL domain, and (d) CartPole swingup-sparse only gives non-zero reward when the pole is perfectly balanced and the cart near-center.For each, we show performance comparing -greedy with z-greedy (left), as well as average first-visit times over states for both algorithms during pure exploration ( = 1). In all first-visit plots, color levels are linearly scaled, except for DeepSea in which we use a log scale.

Figure 4: Results on the Atari-57 benchmark for (a) Rainbow-based and (b) R2D2-based agents.

≤ Θ(|X ||A|), ∀ω ∈ Ω, has polynomial sample complexity.

Figure 6: Environments used in this work: (a) DeepSea, (b) GridWorld, (c) MountainCar, (d) CartPole, (e) Atari-57.

action: a ← arg max a Q(x, a)

Figure 7: Stochastic Gridworld experiments. (a) We show averaged training performance (over 100 episodes) with respect to the noise scale. (b) Example learning curves from these experiments showing the effect of stochasticity on both agents.

Figure 8: Stochastic MountainCar experiments. (a) We show averaged training performance (over 1000 episodes) with respect to the noise scale. (b) Example learning curves from these experiments showing the effect of stochasticity on both agents.

Figure 9: Sticky-action Atari-57 summary curves Rainbow-based agents.

Figure 11: Example Gridworlds with varying density of (top) obstacles and (bottom) traps. The shades of grey represent the type of each cell: white cells are open states, light grey cells are the start state, grey cells are traps, dark grey cells are goal states, and black cells are obstacles. These environments are randomly generated to a target density of obstacle / trap, while ensuring there exists a path between the start and goal states.

Figure 12: Gridworld with obstacles at varying density. (a) We show averaged training performance (over 1000 episodes) with respect to the obstacle density. (b) Example learning curves from these experiments showing the effect on both agents.

Figure 13: Gridworld with traps at varying density. (a) We show averaged training performance (over 1000 episodes) with respect to the trap density. (b) Example learning curves from these experiments showing the effect on both agents.

Figure 15: Experiment in the Gridworld domain comparing Rmax, with visitation thresholds 1 and 10 with -Greedy and z-Greedy.

Figure 17: Percent relative improvement of exploration methods ( z-Greedy and RND) over -Greedy for the R2D2-based agents on Atari-57 per-game. We report this for both final performance (top) and average over training (bottom).

Figure 18: Atari-57 summary curves for R2D2-based methods (top) and Rainbow-based methods (bottom).

Figure 19: Per-game Atari-57 results for Rainbow-based methods.

Figure 21: Per-game Sticky-action Atari-57 results for Rainbow-based methods.

Figures 9, 10, 21).

Atari-57 final performance. H-Gap denotes the human-gap, defined fully in the Appendix.

Hyper-parameters values used in Rainbow-based agents (deviations fromHessel et al. (2018) highlighted in boldface).

Hyper-parameters values used in R2D2-based agents (deviations fromKapturowski et al. (2019) highlighted in boldface).

Settings for experiments used to generate average first-visit visualizations found in main text.

Sticky-action Atari-57 final performance summaries for Rainbow-based agents after 200M environment frames.

Figure14: Adversarial modification to DeepSea environment causes z-greedy to perform no better than -greedy. Atari-57 final performance summaries. R2D2 results are after 30B environment frames, and Rainbow results are after 200M environment frames. We also include median and mean humannormalized scores obtained by using best (instead of final) evaluation scores for each training run, to allow comparison with past publications which often used this metric (e.g.Hessel et al. (2018)).

