PLANNING GOALS FOR EXPLORATION

Abstract

Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "Planning Exploratory Goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command.

1. INTRODUCTION

Complex real-world environments such as kitchens and offices afford a large number of configurations. These may be represented, for example, through the positions, orientations, and articulation states of various objects, or indeed of an agent within the environment. Such configurations could plausibly represent desirable goal states for various tasks. Given this context, we seek to develop intelligent autonomous agents that, having first spent some time exploring an environment, can afterwards configure it to match commanded goal states. The goal-conditioned reinforcement learning paradigm (GCRL) (Andrychowicz et al., 2017) offers a natural framework to train such goal-conditioned agent policies on the exploration data. Within this framework, we seek to address the central problem: how should a GCRL agent explore its environment during training time so that it can achieve diverse goals revealed to it only at test time? This requires efficient unsupervised exploration of the environment. Exploration in the GCRL setting can naturally be reduced to the problem of setting goals for the agent during training time; the current GCRL policy, commanded to the right goals, will generate exploratory data to improve itself (Ecoffet et al., 2021; Nair et al., 2018b) . Our question now reduces to the goal-directed exploration problem: how should we choose exploration-inducing goals at training time? Prior works start by observing that the final GCRL policy will be most capable of reaching familiar states, encountered many times during training. Thus, the most direct approach to exploration is to set goals in sparsely visited parts of the state space, to directly expand the set of these familiar states (Ecoffet et al., 2021; Pong et al., 2019; Pitis et al., 2020) . While straightforward, this approach suffers from several issues in practice. First, the GCRL policy during training is not yet proficient at reaching arbitrary goals, and regularly fails to reach commanded goals, often in uninteresting ways that have low exploration value. For example, a novice agent commanded to an unseen portion of a maze environment might respond by instead reaching a previously explored part of the maze, encountering no novel states. To address this, prior works (Pitis et al., 2020; Bharadhwaj et al., 2021) set up additional mechanisms to filter out unreachable goals, typically requiring additional hyperparameters. Second, recent works (Ecoffet et al., 2021; Yang et al., 2022; Pitis et al., 2020) have observed improved exploration in long-horizon tasks by extending training episodes. Specifically, rather than resetting immediately after deploying the GCRL policy to a goal, these methods launch a new exploration phase right afterwards, such as by selecting random actions (Pitis et al., 2020; Kamienny et al., 2022) or by maximizing an intrinsic motivation reward (Guo et al., 2020) . In this context, even successfully reaching a rare state through the GCRL policy might be suboptimal; many such states might be poor launchpads for the exploration phase that follows. For example, the GCRL policy might end up in a novel dead end in the maze, from which all exploration is doomed to fail. To avoid these shortcomings and focus exploration on the most promising parts of the environment, we propose to leverage planning with world models in a new goal-directed exploration algorithm, PEG (short for "Planning Exploratory Goals"). Our key idea is to optimize directly for goal commands that would induce high exploration value trajectories, cognizant of current shortcomings in the GCRL policy, and of the exploration phase during training. Note that this does not mean merely commanding the agent to novel or rarely observed states. Instead, PEG commands might be to a previously observed state, or indeed, even to a physically implausible state (see Figure 1 ). PEG only cares that the command will induce the chained GCRL and exploration phases together to generate interesting training trajectories, valuable for policy improvement. Our key contributions are as follows. We propose a novel paradigm for goal-directed exploration by directly optimizing goal selection to generate trajectories with high exploration value. Next, we show how learned world models permit an effective implementation of goal command planning, by adapting planning algorithms that are often used for low-level action sequence planning. We validate our approach, PEG, on challenging simulated robotics settings including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop. In each environment, PEG exploration enables more efficient and effective training of generalist GCRL policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks.

2. PROBLEM SETUP AND BACKGROUND

We wish to build agents that can efficiently explore an environment to autonomously acquire diverse environment-relevant capabilities. Specifically, in our problem setting, the agent is dropped into an unknown environment with no specification of the tasks that might be of interest afterwards. Over episodes of unsupervised exploration, it must learn about its environment, the various "tasks" that it affords to the agent, and also how to perform those tasks effectively. We focus on goal state-reaching tasks. After this exploration, a successful agent would be able to reach diverse previously unknown goal states in the environment upon command. To achieve this, our method, planning exploratory goals (PEG), focuses on improving unsupervised exploration within the goal-conditioned reinforcement learning framework. We set up notation and preliminaries below, before discussing our approach in section 3. Preliminaries. We formalize the unsupervised exploration stage within a goal-conditioned Markov decision process, defined by the tuple (S, A, T , G). The unsupervised goal-conditioned MDP does not contain the test-time goal distribution, nor a reward function. At each time t, a goal-conditioned policy π G (• | s t , g) in the current state s t ∈ S, under goal command g ∈ G, selects an action a t ∈ A, and transitions to the next state s t+1 with probability T (s t+1 |s t , a t ). To enable handling of the most broadly expressive goal-reaching tasks, we set the goal space to be identical to the state space: G = S, i.e., every state maps to a plausible goal-reaching command. In our setting, an agent must learn a goal-conditioned reinforcement learning (GCRL) policy π G (• | s, g) as well as collect exploratory data to train this policy on. Thus, as motivated in section 1, it



Figure 1: PEG exploration in a U-maze. Brown background dots: explored states, : commanded goals, colored lines: resulting paths. (Left) PEG optimizes directly for exploration, even setting unseen goals, and achieving farther paths. (Right) Setting goals at the frontier of the seen state distribution yields less exploration.

availability

https://sites.google.com/view/

