PLANNING GOALS FOR EXPLORATION

Abstract

Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "Planning Exploratory Goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command.

1. INTRODUCTION

Complex real-world environments such as kitchens and offices afford a large number of configurations. These may be represented, for example, through the positions, orientations, and articulation states of various objects, or indeed of an agent within the environment. Such configurations could plausibly represent desirable goal states for various tasks. Given this context, we seek to develop intelligent autonomous agents that, having first spent some time exploring an environment, can afterwards configure it to match commanded goal states. The goal-conditioned reinforcement learning paradigm (GCRL) (Andrychowicz et al., 2017) offers a natural framework to train such goal-conditioned agent policies on the exploration data. Within this framework, we seek to address the central problem: how should a GCRL agent explore its environment during training time so that it can achieve diverse goals revealed to it only at test time? This requires efficient unsupervised exploration of the environment. Exploration in the GCRL setting can naturally be reduced to the problem of setting goals for the agent during training time; the current GCRL policy, commanded to the right goals, will generate exploratory data to improve itself (Ecoffet et al., 2021; Nair et al., 2018b) . Our question now reduces to the goal-directed exploration problem: how should we choose exploration-inducing goals at training time? Prior works start by observing that the final GCRL policy will be most capable of reaching familiar states, encountered many times during training. Thus, the most direct approach to exploration is to set goals in sparsely visited parts of the state space, to directly expand the set of these familiar states (Ecoffet et al., 2021; Pong et al., 2019; Pitis et al., 2020) . While straightforward, this approach suffers from several issues in practice. First, the GCRL policy during training is not yet proficient at reaching arbitrary goals, and regularly fails to reach commanded goals, often in uninteresting ways that have low exploration value. For example, a novice agent commanded to an unseen portion of a maze environment might respond by instead reaching a previously explored part of the maze, encountering no novel states. To address this, prior works (Pitis et al., 2020; Bharadhwaj et al., 2021) set up additional mechanisms to filter out unreachable goals, typically requiring additional 1

availability

https://sites.google.com/view/

