RAPID TASK-SOLVING IN NOVEL ENVIRONMENTS

Abstract

We propose the challenge of rapid task-solving in novel environments (RTS), wherein an agent must solve a series of tasks as rapidly as possible in an unfamiliar environment. An effective RTS agent must balance between exploring the unfamiliar environment and solving its current task, all while building a model of the new environment over which it can plan when faced with later tasks. While modern deep RL agents exhibit some of these abilities in isolation, none are suitable for the full RTS challenge. To enable progress toward RTS, we introduce two challenge domains: (1) a minimal RTS challenge called the Memory&Planning Game and (2) One-Shot StreetLearn Navigation, which introduces scale and complexity from real-world data. We demonstrate that state-of-the-art deep RL agents fail at RTS in both domains, and that this failure is due to an inability to plan over gathered knowledge. We develop Episodic Planning Networks (EPNs) and show that deep-RL agents with EPNs excel at RTS, outperforming the nearest baseline by factors of 2-3 and learning to navigate held-out StreetLearn maps within a single episode. We show that EPNs learn to execute a value iteration-like planning algorithm and that they generalize to situations beyond their training experience.

1. INTRODUCTION

An ideal AI system would be useful immediately upon deployment in a new environment, and would become more useful as it gained experience there. Consider for example a household robot deployed in a new home. Ideally, the new owner could turn the robot on and ask it to get started, say, by cleaning the bathroom. The robot would use general knowledge about household layouts to find the bathroom and cleaning supplies. As it carried out this task, it would gather information for use in later tasks, noting for example where the clothes hampers are in the rooms it passes. When faced with its next task, say, doing the laundry, it would use its newfound knowledge of the hamper locations to efficiently collect the laundry. Humans make this kind of rapid task-solving in novel environments (RTS) look easy (Lake et al., 2017) , but as yet it remains an aspiration for AI. Prominent deep RL systems display some of the key abilities required, namely, exploration and planning. But, they need many episodes over which to explore (Ecoffet et al., 2019; Badia et al., 2020) and to learn models for planning (Schrittwieser et al., 2019) . This is in part because they treat each new environment in isolation, relying on generic exploration and planning algorithms. We propose to overcome this limitation by treating RTS as a meta-reinforcement learning (RL) problem, where agents learn exploration policies and planning algorithms from a distribution over RTS challenges. Our contributions are to: 1. Develop two domains for studying meta-learned RTS: the minimal and interpretable Mem-ory&Planning Game and the scaled-up One-Shot StreetLearn. 2. Show that previous meta-RL agents fail at RTS because of limitations in their ability to plan using recently gathered information. 4. Show that EPNs learn exploration and planning algorithms that generalize to larger problems than those seen in training. 5. Demonstrate that EPNs learn a value iteration-like planning algorithm that iteratively propagates information about state-state connectivity outward from the goal state.

2. PROBLEM FORMULATION

Our objective is to build agents that can maximize reward over a sequence of tasks in a novel environment. Our basic approach is to have agents learn to do this through exposure to distributions over multi-task environments. To define such a distribution, we first formalize the notion of an environment e as a 4-tuple (S, A, P a , R) consisting of states, actions, a state-action transition function, and a distribution over reward functions. We then define the notion of a task t in environment e as a Markov decision process (MDP) (S, A, P a , r) that results from sampling a reward function r from R. We can now define a framework for learning to solve tasks in novel environments as a simple generalization of the popular meta-RL framework (Wang et al., 2017; Duan et al., 2016) . In meta-RL, the agent is trained on MDPs (S, A, P a , r) sampled from a task distribution D. In the rapid task-solving in novel environments (RTS) paradigm, we instead sample problems by first sampling an environment e from an environment distribution E, then sequentially sampling tasks, i.e. MDPs, from that environment's reward function distribution (see Figure 1 ). An agent can be trained for RTS by maximizing the following objective: E e∼E [E r∼Re [J e,r (θ)]], where J e,r is the expected reward in environment e with reward function r. When there is only one reward function per environment, the inner expectation disappears and we recover the usual meta-RL objective E e∼E [J e (θ)]. RTS can be viewed as meta-RL with the added complication that the reward function changes within the inner learning loop.



Figure 1: (a) Rapid Task Solving in Novel Environments (RTS) setup. A new environment is sampled in every episode. Each episode consists of a sequence of tasks which are defined by sampling a new goal state and a new initial state. The agent has a fixed number of steps per episode to complete as many tasks as possible. (b) Episodic Planning Network (EPN) architecture. The EPN uses multiple iterations of a single shared self-attention function over memories retrieved from an episodic storage.

