IMPACT-DRIVEN EXPLORATION WITH CONTRASTIVE UNSUPERVISED REPRESENTATIONS

Abstract

Procedurally-generated sparse reward environments pose significant challenges for many RL algorithms. The recently proposed impact-driven exploration method (RIDE) by Raileanu & Rocktäschel (2020), which rewards actions that lead to large changes (measured by 2 -distance) in the observation embedding, achieves state-of-the-art performance on such procedurally-generated MiniGrid tasks. Yet, the definition of "impact" in RIDE is not conceptually clear because its learned embedding space is not inherently equipped with any similarity measure, let alone 2 -distance. We resolve this issue in RIDE via contrastive learning. That is, we train the embedding with respect to cosine similarity, where we define two observations to be similar if the agent can reach one observation from the other within a few steps, and define impact in terms of this similarity measure. Experimental results show that our method performs similarly to RIDE on the MiniGrid benchmarks while learning a conceptually clear embedding space equipped with the cosine similarity measure. Our modification of RIDE also provides a new perspective which connects RIDE and episodic curiosity (Savinov et al., 2019), a different exploration method which rewards the agent for visiting states that are unfamiliar to the agent's episodic memory. By incorporating episodic memory into our method, we outperform RIDE on the MiniGrid benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) algorithms aim to learn an optimal policy that maximizes expected reward from the environment. The search for better RL algorithms is motivated by the fact that many complex real-world problems can be formulated as RL problems. Yet, environments with sparse rewards, which often occur in the real-world, pose a significant challenge for RL algorithms that rely on random actions for exploration. Sparsity of the reward can make it extremely unlikely for the agent to stumble upon any positive feedback by chance. The agent may spend a long time simply exploring and not receiving a single positive reward. To overcome this issue of exploration, several previous works have used intrinsic rewards (Schmidhuber, 1991; Oudeyer et al., 2007; 2008; Oudeyer & Kaplan, 2009) . Intrinsic rewards, as the name suggests, are reward signals generated by the agent which can make RL algorithms more sample efficient by encouraging exploratory behavior that is more likely to encounter rewards. Previous works have used state novelty in the form of state visitation counts (Strehl & Littman, 2008) for tabular states, pseudo-counts for high-dimensional state spaces (Bellemare et al., 2016; Ostrovski et al., 2017; Martin et al., 2017) , prediction error of random networks (Burda et al., 2019b) , and curiosity about environment dynamics (Stadie et al., 2015; Pathak et al., 2017) as intrinsic rewards. Although such advances in exploration methods have enabled RL agents to achieve high rewards in notoriously difficult sparse reward environments such as Montezuma's Revenge and Pitfall (Bellemare et al., 2013) , many existing exploration methods use the same environment for training and testing (Bellemare et al., 2016; Pathak et al., 2017; Aytar et al., 2018; Ecoffet et al., 2019) . As a result, agents trained in this fashion do not generalize to new environments. Indeed, several recent papers point out that deep RL agents overfit to the environment they were trained on (Rajeswaran et al., 2017; Zhang et al., 2018; Machado et al., 2018) , leading to the creation of new benchmarks consisting of procedurally-generated environments (Cobbe et al., 2019; Risi & Togelius, 2020; Küttler et al., 2020) . In practice, agents often have to act in environments that are similar, but different from the environments they were trained on. Hence, it is crucial that the agent learns a policy that generalizes across diverse (but similar) environments. This adds another layer of difficulty, the diversity of environment layout for each episode, to the already challenging sparse reward structure. To tackle this challenge head-on, Raileanu & Rocktäschel (2020) focus on exploration in procedurally-generated environments and propose RIDE, an intrinsic rewarding scheme based on the "impact" of a new observation. Denoting the observation embedding function by φ, RIDE measures the impact of observation o by computing φ(o ) -φ(o) 2 , where o is the previous observation. Similarly, Savinov et al. (2019) propose episodic curiosity (EC), an intrinsic rewarding scheme which rewards visiting states that are dis-similar to states in the agent's episodic memory. RIDE uses forward and inverse dynamics prediction (Pathak et al., 2017) to train the observation embedding φ in a self-supervised manner. Hence, a question that one might ask is: What is the 2 -distance in this embedding space measuring? We address this question by modifying the embedding training procedure, thereby changing the definition of impact. That is, we modify RIDE so that impact corresponds to an explicitly trained similarity measure in the embedding space, where we define two observations to be similar if they are reachable from each other within a few steps. The original definition of "impact" in RIDE is not conceptually clear because the learned embedding space is not inherently equipped with a similarity measure, let alone 2 -distance. It is still possible that RIDE's measure of impact based on 2 -distance may implicitly correspond to some similarity measure in the embedding space, but we leave this investigation for future work. Our main contributions are 1) proposing a conceptually clear measure of impact by training observation embeddings explicitly with the cosine similarity objective instead of forward and inverse dynamics prediction, 2) providing a new perspective which connects RIDE and EC, and 3) outperforming RIDE via episodic memory extensions. We use SimCLR (Chen et al., 2020) to train the embedding function and propose a novel intrinsic rewarding scheme, which we name RIDE-SimCLR. As in EC, the positive pairs used in the contrastive learning component of RIDE-SimCLR correspond to pairs of observations which are within k-steps of each other (referred to as "k-step reachability" in their work). Following the experimental setup of Raileanu & Rocktäschel (2020), we use MiniGrid (Chevalier-Boisvert et al., 2018) to evaluate our method as it provides a simple, diverse suite of tasks that allows us to focus on the issue of exploration instead of other issues such as visual perception. We focus on the comparison of our approach to RIDE since Raileanu & Rocktäschel (2020) report that RIDE achieves the best performance on all their MiniGrid benchmarks among other exploration methods such as intrinsic curiosity (ICM) by Pathak et al. (2017) and random network distillation (RND) by Burda et al. (2019b) . We note that MiniGrid provides a sufficiently challenging suite of tasks for RL agents despite its apparent simplicity, as ICM and RND fail to learn any effective policies for some tasks due to the difficulty posed by procedurally-generated environments. Our experimental results show that RIDE-SimCLR performs similarly to RIDE on these benchmarks with the added benefit of having a conceptually clear similarity measure for the embedding space. Our qualitative analysis shows interesting differences between RIDE and RIDE-SimCLR. For instance, RIDE highly rewards interactions with controllable objects such as opening a door, which is not the case in RIDE-SimCLR. We also observe that our episodic memory extension improves the quantitative performance of both methods, which demonstrates the benefit of establishing a connection between RIDE and EC. The Never Give Up (NGU) agent by Badia et al. (2020) can be seen as a close relative of our memory extension of RIDE since it uses 2 distance in embedding space trained with the same inverse dynamics objective to compute approximate counts of states and aggregates episodic memory to compute a novelty bonus. Our work is different from EC because we do not explicitly sample negative pairs for training the observation embedding network, and we use cosine similarity, instead of a separately trained neural network, to output similarity scores for pairs of observations. We note that Campero et al. ( 2020) report state-of-the-art results on even more challenging MiniGrid tasks by training an adversarially motivated teacher network to generate intermediate goals for the agent (AMIGo), but we do not compare against this method since their agent receives full observations of the environment. Both RIDE and RIDE-SimCLR agents only receive partial observations.

