VIMA: GENERAL ROBOT MANIPULATION WITH MULTIMODAL PROMPTS

Abstract

Prompt-based learning has emerged as a successful paradigm in natural language 1 processing, where a single general-purpose language model can be instructed to 2 perform any task specified by input prompts. Yet task specification in robotics 3 comes in various forms, such as imitating one-shot demonstrations, following lan-4



guage instructions, and reaching visual goals. They are often considered different 5 tasks and tackled by specialized models. This work shows that we can express a 6 wide spectrum of robot manipulation tasks with multimodal prompts, interleaving 7 textual and visual tokens. We design a transformer-based robot agent, VIMA, 8 that processes these prompts and outputs motor actions autoregressively. Transformers have given rise to remarkable multi-task consolidation across many AI domains. For 19 example, users can describe a task using natural language prompt to GPT-3 (Brown et al., 2020) , 20 allowing the same model to perform question answering, machine translation, text summarization, 21 etc. Prompt-based learning provides an accessible and flexible interface to communicate a natural 22 language understanding task to a general-purpose model.

23

We envision that a generalist robot agent should have a similarly intuitive and expressive interface 24 for task specification. What does such an interface for robot learning look like? As a motivating 25 example, consider a personal robot tasked with household activities. We can ask the robot to bring us 26 a cup of water by a simple natural language instruction. If we require more specificity, we can instead 27 instruct the robot to "bring me <image of the cup>". For tasks requiring new skills, the robot 28 should be able to adapt preferably from a few video demonstrations (Duan et al., 2017) . Tasks that 29 need interaction with unfamiliar objects can be easily explained via a few image examples for novel 30 concept grounding (Hermann et al., 2017) . Finally, to ensure safe deployment, we can further specify 31 visual constraints like "do not enter <image> room".

32

To enable a single agent with all these capabilities, we make three key contributions in this work: 1) 33 a novel multimodal prompting formulation that converts a wide spectrum of robot manipulation 34 tasks into one sequence modeling problem; 2) a new robot agent model capable of multi-task 35 and zero-shot generalization; and 3) a large-scale benchmark with diverse tasks to systematically 36 evaluate the scalability and generalization of our agents.

37

We start with the observation that many robot manipulation tasks can be formulated by multimodal 38 prompts that interleave language and images or video frames (Fig. 1 ). For example, Rearrange- 



39ment(Batra et al., 2020), a type of Visual Goal, can be formulated as "Please rearrange objects to 40 match this {scene image}"; Novel Concept Grounding looks like "This is a dax {new object} 1 41 and this is a blicket {new object} 2 . Put two metal dax on the marble blicket."; Few-shot Imita-42 tion can embed video snippet in the prompt "Follow this motion trajectory for the wooden cube: 43 1

To train

