ROBUST EXPLORATION VIA CLUSTERING-BASED ONLINE DENSITY ESTIMATION Anonymous

Abstract

Intrinsic motivation is a critical ingredient in reinforcement learning to enable progress when rewards are sparse. However, many existing approaches that measure the novelty of observations are brittle, or rely on restrictive assumptions about the environment which limit generality. We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method that estimates visitation counts for clusters of states that are similar according to the metric induced by a specified representation learning technique. We adapt classical clustering algorithms to the online setting to design a new type of memory that allows RECODE to efficiently track global visitation counts over thousands of episodes. RECODE can easily leverage both off-the-shelf and novel representation learning techniques. We introduce a novel generalization of the actionprediction representation that leverages transformers for multi-step predictions, which we demonstrate to be more performant on a suite of challenging 3Dexploration tasks in DM-HARD-8. We show experimentally that our approach can work with a variety of RL agents, obtaining state-of-the-art performance on Atari and DM-HARD-8, and being the first agent to reach the end-screeen in Pitfall!

1. INTRODUCTION

Exploration mechanisms are a key component of reinforcement learning (RL, Sutton & Barto, 2018) agents, especially in sparse-reward tasks where long sequences of actions need to be executed before collecting a reward. The exploration problem has been studied theoretically (Kearns & Singh, 2002; Azar et al., 2017; Brafman & Tennenholtz, 2003; Auer et al., 2002; Agrawal & Goyal, 2012; Audibert et al., 2010; Jin et al., 2020) in the context of bandits (Lattimore & Szepesvári, 2020) and Markov Decision Processes (MDP, Puterman, 1990; Jaksch et al., 2010) . Among those theoretical works, one simple and theoretically-sound approach to perform exploration efficiently in MDPs is to use a decreasing function of the visitation counts as an exploration bonus (Strehl & Littman, 2008; Azar et al., 2017) . However, this approach is intractable with large or continuous state spaces, as generalization between states becomes essential. Several experimental works have tried to come up with ways to estimate visitation counts/densities in complex environments where counting is not trivial. Two partially successful approaches have emerged to empirically estimate visitation counts/densities in deep RL: (i) the parametric approach that uses neural networks and (ii) the nonparametric approach that uses a slot-based memory to store representations of visited states, where the representation learning method serves to induce a more meaningful metricfoot_0 between states. Parametric methods either explicitly estimate the visitation counts using density models (Bellemare et al., 2016; Ostrovski et al., 2017) or implicitly estimate the counts using e.g., Random Network Distillation (RND, Burda et al., 2019; Badia et al., 2020b) . Non-parametric methods rely on a memory to store encountered state representations (Badia et al., 2020b) and representation learning to construct a metric that differentiates states meaningfully (Pathak et al., 2017) . Parametric methods do not store individual states explicitly and as such their capacity is not directly bound by memory constraints; but they are less well suited to rapid adaptation on short timescales (e.g., within a single episode). To obtain the best of both worlds, Never Give Up (NGU, Badia et al., 2020b) combines 



Usually this is a pseudometric on the space of observations, since d(x, y) = 0 for x ̸ = y is permitted by typical network architectures, and may be desirable as a means to discard noisy or uncontrollable features 1

