INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS

Abstract

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in visionbased environments. Our InstructRL method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work. 1

1. INTRODUCTION

Humans are able to understand language and vision to accomplish a wide range of tasks. Many tasks require language understanding and vision perception, from driving to whiteboard discussion and cooking. Humans can also generalize to new tasks by building upon knowledge acquired from previously-seen tasks. Meanwhile, creating generic instruction-following agents that can generalize to multiple tasks and environments is one of the central challenges of reinforcement learning (RL) and robotics. Driven by significant advances in learning generic pre-trained models for language understanding (Devlin et al., 2018; Brown et al., 2020; Chowdhery et al., 2022) , recent work has made great progress towards building instruction-following agents (Lynch & Sermanet, 2020; Mandlekar et al., 2021; Ahn et al., 2022; Jang et al., 2022; Guhur et al., 2022; Shridhar et al., 2022b) . For example, SayCan (Ahn et al., 2022) exploits PaLM models (Chowdhery et al., 2022) to generate language descriptions of step-by-step plans from language instructions, then executes the plans by mapping the steps to predefined macro actions. HiveFormer (Guhur et al., 2022) uses a pre-trained language encoder to generalize to multiple manipulation tasks. However, a remaining challenge is that pure language-only pre-trained models are disconnected from visual representations, making it difficult to differentiate vision-related semantics such as colors. Therefore, visual semantics have to be further learned to connect language instructions and visual inputs. Another category of methods use pre-trained vision-language models, which have shown great success in joint visual and language understanding (Radford et al., 2021) . This has made tremendous progress towards creating a general RL agent (Zeng et al., 2022; Khandelwal et al., 2022; Nair et al., 2022b; Khandelwal et al., 2022; Shridhar et al., 2022a) . For example, CLIPort (Shridhar et al., 2022a) uses CLIP (Radford et al., 2021) vision encoder and language encoder to solve manipulation



The code of InstructRL is available at https://sites.google.com/view/instructrl/ 1

