TEMPORALLY-EXTENDED -GREEDY EXPLORATION

Abstract

Recent work on exploration in reinforcement learning (RL) has led to a series of increasingly complex solutions to the problem. This increase in complexity often comes at the expense of generality. Recent empirical studies suggest that, when applied to a broader set of domains, some sophisticated exploration methods are outperformed by simpler counterparts, such as -greedy. In this paper we propose an exploration algorithm that retains the simplicity of -greedy while reducing dithering. We build on a simple hypothesis: the main limitation ofgreedy exploration is its lack of temporal persistence, which limits its ability to escape local optima. We propose a temporally extended form of -greedy that simply repeats the sampled action for a random duration. It turns out that, for many duration distributions, this suffices to improve exploration on a large set of domains. Interestingly, a class of distributions inspired by ecological models of animal foraging behaviour yields particularly strong performance.

1. INTRODUCTION

Exploration is widely regarded as one of the most important open problems in reinforcement learning (RL) . The problem has been theoretically analyzed under simplifying assumptions, providing reassurance and motivating the development of algorithms (Brafman and Tennenholtz, 2002; Asmuth et al., 2009; Azar, Osband, and Munos, 2017) . Recently, there has been considerable progress on the empirical side as well, with new methods that work in combination with powerful function approximators to perform well on challenging large-scale exploration problems (Bellemare et al., 2016; Ostrovski et al., 2017; Burda et al., 2018; Badia et al., 2020b) . Despite all of the above, the most commonly used exploration strategies are still simple methods like -greedy, Boltzmann exploration and entropy regularization (Peters, Mulling, and Altun, 2010; Sutton and Barto, 2018) . This is true for both work of a more investigative nature (Mnih et al., 2015) and practical applications (Levine et al., 2016; Li et al., 2019) . In particular, many recent successes of deep RL, from data-center cooling to Atari game playing, rely heavily upon these simple exploration strategies (Mnih et al., 2015; Lazic et al., 2018; Kapturowski et al., 2019) . Why does the RL community continue to rely on such naive exploration methods? There are several possible reasons. First, principled methods usually do not scale well. Second, the exploration problem is often formulated as a separate problem whose solution itself involves quite challenging steps. Moreover, besides having very limited theoretical grounding, practical methods are often complex and have significantly poorer performance outside a small set of domains they were specifically designed for. This last point is essential, as an effective exploration method must be generally applicable. Naive exploration methods like -greedy, Boltzmann exploration and entropy regularization are general because they do not make strong assumptions about the underlying domain. As a consequence, they are also simple, not requiring too much implementation effort or per-domain tuning. This makes them appealing alternatives even when they are not as efficient as some more complex variants. Perhaps there is a middle ground between simple yet inefficient exploration strategies and more complex, though efficient, methods. The method we propose in this paper represents such a compromise. We ask the following question: how can we deviate minimally from the simple exploration strategies adopted in practice and still get clear benefits? In more pragmatic terms, we want a

