ROBUST EXPLORATION VIA CLUSTERING-BASED ONLINE DENSITY ESTIMATION Anonymous

Abstract

Intrinsic motivation is a critical ingredient in reinforcement learning to enable progress when rewards are sparse. However, many existing approaches that measure the novelty of observations are brittle, or rely on restrictive assumptions about the environment which limit generality. We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method that estimates visitation counts for clusters of states that are similar according to the metric induced by a specified representation learning technique. We adapt classical clustering algorithms to the online setting to design a new type of memory that allows RECODE to efficiently track global visitation counts over thousands of episodes. RECODE can easily leverage both off-the-shelf and novel representation learning techniques. We introduce a novel generalization of the actionprediction representation that leverages transformers for multi-step predictions, which we demonstrate to be more performant on a suite of challenging 3Dexploration tasks in DM-HARD-8. We show experimentally that our approach can work with a variety of RL agents, obtaining state-of-the-art performance on Atari and DM-HARD-8, and being the first agent to reach the end-screeen in Pitfall!

1. INTRODUCTION

Exploration mechanisms are a key component of reinforcement learning (RL, Sutton & Barto, 2018) agents, especially in sparse-reward tasks where long sequences of actions need to be executed before collecting a reward. The exploration problem has been studied theoretically (Kearns & Singh, 2002; Azar et al., 2017; Brafman & Tennenholtz, 2003; Auer et al., 2002; Agrawal & Goyal, 2012; Audibert et al., 2010; Jin et al., 2020) in the context of bandits (Lattimore & Szepesvári, 2020) and Markov Decision Processes (MDP, Puterman, 1990; Jaksch et al., 2010) . Among those theoretical works, one simple and theoretically-sound approach to perform exploration efficiently in MDPs is to use a decreasing function of the visitation counts as an exploration bonus (Strehl & Littman, 2008; Azar et al., 2017) . However, this approach is intractable with large or continuous state spaces, as generalization between states becomes essential. Several experimental works have tried to come up with ways to estimate visitation counts/densities in complex environments where counting is not trivial. Two partially successful approaches have emerged to empirically estimate visitation counts/densities in deep RL: (i) the parametric approach that uses neural networks and (ii) the nonparametric approach that uses a slot-based memory to store representations of visited states, where the representation learning method serves to induce a more meaningful metricfoot_0 between states. Parametric methods either explicitly estimate the visitation counts using density models (Bellemare et al., 2016; Ostrovski et al., 2017) or implicitly estimate the counts using e.g., Random Network Distillation (RND, Burda et al., 2019; Badia et al., 2020b) . Non-parametric methods rely on a memory to store encountered state representations (Badia et al., 2020b) and representation learning to construct a metric that differentiates states meaningfully (Pathak et al., 2017) . Parametric methods do not store individual states explicitly and as such their capacity is not directly bound by memory constraints; but they are less well suited to rapid adaptation on short timescales (e.g., within a single episode). To obtain the best of both worlds, Never Give Up (NGU, Badia et al., 2020b ) combines a short-term novelty signal based on an episodic memory and a long-term novelty via RND, into a single intrinsic reward. However, this approach also naturally inherits the disadvantages of RND; in particular, susceptibility to uncontrollable or noisy features (see Section 5), and being difficult to tune. More details on related works are provided in App. C. In this paper, we propose to decompose the exploration problem into two orthogonal sub-problems. First, (i) Representation Learning which is the task of learning an embedding function on observations or trajectories that encodes a meaningful notion of similarity. Second, (ii) Density Estimation which is the task of estimating smoothed visitation counts to derive a novelty-based exploration bonus. We first present a general solution to (ii) which is computationally efficient and scalable to complex environments. We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method that estimates visitation counts for clusters of states that are similar according to a metric induced by any arbitrary representation. We adapt classical clustering algorithms to an online setting, resulting in a new type of memory that allows RECODE to keep track of histories of interactions spanning thousands of episodes. This is in contrast to existing non-parametric exploration methods, which store only the recent history and in practice usually only account for the current episode. The resulting exploration bonus is principled, simple, and matches or exceeds state-of-the-art exploration results on Atari; being the first agent to reach the end-screen in Pitfall!. In the presence of noise, we show that it strictly improves over state-of-the-art exploration bonuses such as NGU or RND. The generality of RECODE also allows us to easily leverage both off-the-shelf and novel representation learning techniques, which leads in to our second contribution. Specifically, we generalize the action-prediction representations (Pathak et al., 2017) , used in several state-of-the-art exploration agents, by applying transformers to masked trajectories of state and action embeddings for multi-step action prediction. We refer to this method as CASM for Coupled Action-State Masking. In conjunction with RECODE, CASM can yield significant performance gains in hard 3D-exploration tasks included in the DM-HARD-8 suite; achieving a new state of the art in the single-task setting.

2. BACKGROUND AND NOTATION

In this section, we provide the necessary background and notation to understand our method (see Sec. 3). First, we present a general setting of interaction between an agent and its environment. Second, we define the terms embeddings, atoms and memory. Third, we present our notation for visitation counts. Finally, we show how we derive intrinsic rewards from visitations counts. Interaction Process between an Agent and its Environment. We consider a discrete-time interaction process (McCallum, 1995; Hutter, 2004; Hutter et al., 2009; Daswani et al., 2013) between an agent and its environment where, at each time step t ∈ N, the agent receives an observation o t ∈ O and generates an action a t ∈ A. We consider an environment with stochastic dynamics p : H × A → ∆ Ofoot_1 that maps a history of past observations-actions and a current action to a probability distribution over future observations. More precisely, the space of past observations-actions is H = t∈N H t where H 0 = O and ∀t ∈ N * , H t+1 = H t × A × O. We consider policies π : H → ∆ A that maps a history of past observations-actions to a probability distribution over actions. Finally, an extrinsic reward function r e : H × A → R maps a history to a scalar feedback. Embeddings, Atoms and Memory. An embedder is a parameterized function f θ : H → E where E is an embedding space. Typically, the embedding space is the vector space R N where N ∈ N * is the embedding size. Therefore, for a given time step t ∈ N , an embedder is a function f θ that associates to any history h t ∈ H t a vector e t = f θ (h t ) called an embedding. There are several ways to train an embedder f θ such as using an auto-encoding loss of the observation o t (Burda et al., 2018a) , using an inverse dynamics loss (Pathak et al., 2017) or using a multi-step prediction-error loss at the latent level (Guo et al., 2020; 2022) . Those techniques are referred as representation learning methods. An atom f ∈ E is a vector in the embedding space that is contained in a memory M = {f i ∈ E} |M | i=1 which is a finite slot-based container, where |M | ∈ N * is the memory size. The memory M is updated at each time step t by a non-parametric function of the memory M and the embedding e t . In the simplest case, the memory is filled in a first-in first-out (FIFO) manner along



Usually this is a pseudometric on the space of observations, since d(x, y) = 0 for x ̸ = y is permitted by typical network architectures, and may be desirable as a means to discard noisy or uncontrollable features We write ∆Y the set of probability distributions over a set Y.

