VIMA: GENERAL ROBOT MANIPULATION WITH MULTIMODAL PROMPTS

Abstract

Prompt-based learning has emerged as a successful paradigm in natural language 1 processing, where a single general-purpose language model can be instructed to 2 perform any task specified by input prompts. Yet task specification in robotics 3 comes in various forms, such as imitating one-shot demonstrations, following lan-4 guage instructions, and reaching visual goals. They are often considered different 5 tasks and tackled by specialized models. This work shows that we can express a 6 wide spectrum of robot manipulation tasks with multimodal prompts, interleaving 7 textual and visual tokens. We design a transformer-based robot agent, VIMA, 8 that processes these prompts and outputs motor actions autoregressively. To train 9 and evaluate VIMA, we develop a new simulation benchmark with thousands of 10 procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert tra-11 jectories for imitation learning, and four levels of evaluation protocol for systematic 12 generalization. VIMA achieves strong scalability in both model capacity and data 13 size. It outperforms prior SOTA methods in the hardest zero-shot generalization 14 setting by up to 2.9× task success rate given the same training data. With 10× less 15 training data, VIMA still performs 2.7× better than the top competing approach. a uniform sequence IO interface for training generalist robot agents. Previously, different robot 47 manipulation tasks require distinct policy architectures, objective functions, data pipelines, and 48

1. INTRODUCTION 18

Transformers have given rise to remarkable multi-task consolidation across many AI domains. For 19 example, users can describe a task using natural language prompt to GPT-3 (Brown et al., 2020) , 20 allowing the same model to perform question answering, machine translation, text summarization, 21 etc. Prompt-based learning provides an accessible and flexible interface to communicate a natural 22 language understanding task to a general-purpose model.

23

We envision that a generalist robot agent should have a similarly intuitive and expressive interface 24 for task specification. What does such an interface for robot learning look like? As a motivating 25 example, consider a personal robot tasked with household activities. We can ask the robot to bring us 26 a cup of water by a simple natural language instruction. If we require more specificity, we can instead 27 instruct the robot to "bring me <image of the cup>". For tasks requiring new skills, the robot 28 should be able to adapt preferably from a few video demonstrations (Duan et al., 2017) . Tasks that 29 need interaction with unfamiliar objects can be easily explained via a few image examples for novel 30 concept grounding (Hermann et al., 2017) . Finally, to ensure safe deployment, we can further specify 31 visual constraints like "do not enter <image> room".

32

To enable a single agent with all these capabilities, we make three key contributions in this work: 1) 33 a novel multimodal prompting formulation that converts a wide spectrum of robot manipulation 34 tasks into one sequence modeling problem; 2) a new robot agent model capable of multi-task 35 and zero-shot generalization; and 3) a large-scale benchmark with diverse tasks to systematically 36 evaluate the scalability and generalization of our agents.

37

We start with the observation that many robot manipulation tasks can be formulated by multimodal 38 prompts that interleave language and images or video frames (Fig. 1 ). For example, Rearrange-39 ment (Batra et al., 2020) , a type of Visual Goal, can be formulated as "Please rearrange objects to 40 match this {scene image}"; Novel Concept Grounding looks like "This is a dax {new object} 1



and this is a blicket {new object} 2 . Put two metal dax on the marble blicket."; Few-shot Imita-

annex

We observe that many robot manipulation tasks can be expressed as multimodal prompts that interleave language and image/video frames. We propose VIMA, an embodied agent model capable of processing mulitimodal prompts (left) and controlling a robot arm to solve the task (right).{frame 1 }, {frame 2 }, {frame 3 }, {frame 4 }"; and expressing Visual Constraint is as simple as 44 adding the clause "without touching {safety boundary}".

