SPRINT: SCALABLE SEMANTIC POLICY PRE-TRAINING VIA LANGUAGE INSTRUCTION RELABELING

Abstract

We propose SPRINT, a scalable offline policy pre-training approach based on natural language instructions. SPRINT pre-trains an agent's policy to execute a diverse set of semantically meaningful skills that it can leverage to learn new tasks faster. Prior work on offline pre-training required tedious manual definition of pre-training tasks or learned semantically meaningless skills via random goalreaching. Instead, our approach SPRINT (Scalable Pre-training via Relabeling Language INsTructions) leverages natural language instruction labels on offline agent experience, collected at scale (e.g., via crowd-sourcing), to define a rich set of tasks with minimal human effort. Furthermore, by using natural language to define tasks, SPRINT can use large language models to automatically expand the initial task set. As a result, we can learn an extensive collection of new skills via offline RL during pre-training by relabeling and aggregating task instructions, even across multiple trajectories. Experiments in ALFRED, a realistic household simulator, show that agents pre-trained with SPRINT learn new long-horizon household tasks substantially faster than with previous pre-training approaches.

1. INTRODUCTION

When humans learn a new task, e.g., how to cook a new dish, we rely on a large repertoire of previously learned skills, like "chopping vegetables" or "boiling pasta", that make learning more efficient. Improving learning efficiency is crucial for practical deployment of artificial agents; thus, many works in reinforcement learning (RL) aim to equip agents with a similar set of skills. To autonomously acquire such skills, recent works optimize for diverse agent behaviors (Eysenbach et al., 2019; Sharma et al., 2020; Mendonca et al., 2021) , imitate short action sequences (Lynch et al., 2020; Pertsch et al., 2020) , or reach randomly sampled goal states (Chebotar et al., 2021) from pre-collected experience. However, such objectives may result in the agent learning skills that are not semantically plausible in practice, e.g., "placing a knife in the microwave" or "half-closing the microwave door." To focus pre-training on plausible skills, one could instead manually curate a set of pre-training tasks for the policy, but this requires tedious reward function design and does not scale well beyond a few dozen tasks (Yu et al., 2019 ). Yet, defining a large set of pre-training tasks is crucial: only a policy with a wide range of skills can accelerate learning on many downstream tasks. How can we define a large set of meaningful pre-training tasks in a scalable manner? In this paper, we propose to leverage natural language instructions to define a large number of semantically meaningful tasks for policy pre-training. Natural language has recently been used to allow humans to effectively interact with agents (Lynch & Sermanet, 2021) or to generate longhorizon plans (Ahn et al., 2022) . In the context of defining pre-training tasks, using natural language has two important benefits: (1) language is a natural and expressive interface for humans to specify tasks (in contrast to, e.g., numerical reward functions) as it is the primary way to communicate tasks in our everyday lives. Thus, even non-experts can define tasks easily via language instructions. (2) By specifying pre-training tasks via natural language, we can leverage the knowledge captured in large language models to automatically generate more tasks through instruction relabeling. To combine both benefits we introduce SPRINT (Scalable Pre-training via Relabeling Language INsTructions), a scalable pre-training approach that equips policies with a repertoire of semantically meaningful skills (see Figure 1 for an illustration). SPRINT has three core components: (1) language-conditioned offline RL, (2) LLM-based skill aggregation and (3) cross-trajectory skill 1

