INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS

Abstract

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in visionbased environments. Our InstructRL method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work. 1

1. INTRODUCTION

Humans are able to understand language and vision to accomplish a wide range of tasks. Many tasks require language understanding and vision perception, from driving to whiteboard discussion and cooking. Humans can also generalize to new tasks by building upon knowledge acquired from previously-seen tasks. Meanwhile, creating generic instruction-following agents that can generalize to multiple tasks and environments is one of the central challenges of reinforcement learning (RL) and robotics. Driven by significant advances in learning generic pre-trained models for language understanding (Devlin et al., 2018; Brown et al., 2020; Chowdhery et al., 2022) , recent work has made great progress towards building instruction-following agents (Lynch & Sermanet, 2020; Mandlekar et al., 2021; Ahn et al., 2022; Jang et al., 2022; Guhur et al., 2022; Shridhar et al., 2022b) . For example, SayCan (Ahn et al., 2022) exploits PaLM models (Chowdhery et al., 2022) to generate language descriptions of step-by-step plans from language instructions, then executes the plans by mapping the steps to predefined macro actions. HiveFormer (Guhur et al., 2022) uses a pre-trained language encoder to generalize to multiple manipulation tasks. However, a remaining challenge is that pure language-only pre-trained models are disconnected from visual representations, making it difficult to differentiate vision-related semantics such as colors. Therefore, visual semantics have to be further learned to connect language instructions and visual inputs. Another category of methods use pre-trained vision-language models, which have shown great success in joint visual and language understanding (Radford et al., 2021) . This has made tremendous progress towards creating a general RL agent (Zeng et al., 2022; Khandelwal et al., 2022; Nair et al., 2022b; Khandelwal et al., 2022; Shridhar et al., 2022a) . For example, CLIPort (Shridhar et al., 2022a) uses CLIP (Radford et al., 2021) vision encoder and language encoder to solve manipulation tasks. However, a drawback is that they come with limited language understanding compared to pure language-only pre-trained models like BERT (Devlin et al., 2018) , lacking the ability to follow long and detailed instructions. In addition, the representations of visual input and textual input are often disjointly learned, so such methods typically require designing specialized network architectures on top of the pre-trained models to fuse them together. To address the above challenges, we introduce InstructRL, a simple yet effective method based on the multimodal transformer (Vaswani et al., 2017; Tsai et al., 2019) . It first encodes fine-grained cross-modal alignment between vision and language using a pre-trained multimodal encoder (Geng et al., 2022) , which is a large transformer (Vaswani et al., 2017; He et al., 2022) jointly trained on image-text (Changpinyo et al., 2021; Thomee et al., 2016 ) and text-only data (Devlin et al., 2018) . The generic representations of each camera and instructions form a sequence, and are concatenated with the embeddings of proprioception data and actions. These tokens are fed into a multimodal policy transformer, which jointly models dependencies between the current and past observations, and cross-modal alignment between instruction and views from multiple cameras. Based on the output representations from our multimodal transformer, we predict 7-DoF actions, i.e., position, rotation, and state of the gripper. We evaluate InstructRL on RLBench (James et al., 2020) , measuring capabilities for single-task learning, multi-task learning, multi-variation generalization, long instructions following, and model scalability. On all 74 tasks which belong to 9 categories (see Figure 1 for example tasks), our InstructRL significantly outperforms state-of-the-art models (Shridhar et al., 2022a; Guhur et al., 2022; Liu et al., 2022) , demonstrating the effectiveness of joint vision-language pre-trained representations. Moreover, InstructRL not only excels in following basic language instructions, but is also able to benefit from human-written long and detailed language instructions. We also demonstrate that InstructRL generalizes to new instructions that represent different variations of the task that are unseen during training, and shows excellent model scalability with performance continuing to increase with larger model size.

2. RELATED WORK

Language-conditioned RL with pre-trained language models. Pre-trained language models have been shown to improve the generalization capabilities of language-conditioned agents to new instructions and to new low-level tasks (Lynch & Sermanet, 2020; Hill et al., 2020; Nair et al., 2022a; Jang et al., 2022; Ahn et al., 2022; Huang et al., 2022) . Some prior work use prompt engineering with large language models (Brown et al., 2020; Chowdhery et al., 2022) to extract temporally extended plans over predefined skills (Huang et al., 2022; Ahn et al., 2022; Jiang et al., 2022) , similar to work that decomposes high-level actions into sub-goals (Team et al., 2021) . These work rely purely on language models to drive agents and require converting observations into language through predefined APIs. Others combine pre-trained language representations with visual inputs (Jang et al., 



The code of InstructRL is available at https://sites.google.com/view/instructrl/



Figure 1: Examples of RLBench tasks considered in this work. Left: InstructRL can perform multiple tasks from RLBench given language instructions, by leveraging the representations of a pre-trained vision-language transformer model, and learning a transformer policy. Right: Each task can be composed of multiple variations that share the same skills but differ in objects. For example, in the block stacking task, InstructRL can generalize to varying colors and ordering of the blocks.

