VIMA: GENERAL ROBOT MANIPULATION WITH MULTIMODAL PROMPTS

Abstract

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9× task success rate given the same training data. With 10× less training data, VIMA still performs 2.7× better than the top competing approach. Video demos are available at https://iclr3081.github.io/.

1. INTRODUCTION

Transformers have given rise to remarkable multi-task consolidation across many AI domains. For example, users can describe a task using natural language prompt to GPT-3 (Brown et al., 2020) , allowing the same model to perform question answering, machine translation, text summarization, etc. Prompt-based learning provides an accessible and flexible interface to communicate a natural language understanding task to a general-purpose model. We envision that a generalist robot agent should have a similarly intuitive and expressive interface for task specification. What does such an interface for robot learning look like? As a motivating example, consider a personal robot tasked with household activities. We can ask the robot to bring us a cup of water by a simple natural language instruction. If we require more specificity, we can instead instruct the robot to "bring me <image of the cup>". For tasks requiring new skills, the robot should be able to adapt preferably from a few video demonstrations (Duan et al., 2017) . Tasks that need interaction with unfamiliar objects can be easily explained via a few image examples for novel concept grounding (Hermann et al., 2017) . Finally, to ensure safe deployment, we can further specify visual constraints like "do not enter <image> room". To enable a single agent with all these capabilities, we make three key contributions in this work: 1) a novel multimodal prompting formulation that converts a wide spectrum of robot manipulation tasks into one sequence modeling problem; 2) a new robot agent model capable of multi-task and zero-shot generalization; and 3) a large-scale benchmark with diverse tasks to systematically evaluate the scalability and generalization of our agents. We start with the observation that many robot manipulation tasks can be formulated by multimodal prompts that interleave language and images or video frames (Fig. 1 ). For example, Rearrangement (Batra et al., 2020) , a type of Visual Goal, can be formulated as "Please rearrange objects to match this {scene image}"; Novel Concept Grounding looks like "This is a dax {new object} 1 and this is a blicket {new object} 2 . Put two metal dax on the marble blicket."; Few-shot Imitation can embed video snippet in the prompt "Follow this motion trajectory for the wooden cube: We observe that many robot manipulation tasks can be expressed as multimodal prompts that interleave language and image/video frames. We propose VIMA, an embodied agent model capable of processing mulitimodal prompts (left) and controlling a robot arm to solve the task (right). {frame 1 }, {frame 2 }, {frame 3 }, {frame 4 }"; and expressing Visual Constraint is as simple as adding the clause "without touching {safety boundary}". Multimodal prompts not only have more expressive power than individual modalities, but also enable a uniform sequence IO interface for training generalist robot agents. Previously, different robot manipulation tasks require distinct policy architectures, objective functions, data pipelines, and training procedures (Aceituno et al., 2021; Stengel-Eskin et al., 2022; Lynch & Sermanet, 2021) , leading to siloed robot systems that cannot be easily combined for a rich set of use cases. Instead, our multimodal prompt interface allows us to harness the latest advances in large transformer models (Lin et al., 2021; Tay et al., 2020; Khan et al., 2021) for developing scalable multi-task robot learners. To this end, we design a novel VisuoMotor Attention model (VIMA). The architecture follows the encoder-decoder transformer design proven to be effective and scalable in NLP (Raffel et al., 2020) . VIMA encodes an input sequence of interleaving textual and visual prompt tokens with a pre-trained language model (Tsimpoukelli et al., 2021) , and decodes robot control actions autoregressively for each environment interaction step. The transformer decoder is conditioned on the prompt via cross-attention layers that alternate with the usual causal self-attention. Instead of operating on raw pixels, VIMA adopts an object-centric approach. We parse all images in the prompt or observation into objects by off-the-shelf detectors (He et al., 2017) , and flatten them into sequences of object tokens. All these design choices combined deliver a conceptually simple architecture with strong model and data scaling properties. To systematically evaluate our proposed algorithm, we introduce a new benchmark (VIMA-BENCH) built on the Ravens simulator (Zeng et al., 2020; Shridhar et al., 2021) . We provide 17 representative meta-tasks with multimodal prompt templates, which can be procedurally instantiated into thousands of individual tasks by various combinations of textures and tabletop objects. VIMA-BENCH establishes a 4-level protocol to evaluate progressively stronger generalization capabilities, from randomized object placement to novel tasks altogether (Fig. 2 ). To demonstrate the scalability of VIMA, we train a spectrum of 7 models ranging from 2M to 200M parameters. Our approach outperforms strong prior SOTA methods such as Gato (Reed et al., 2022 ), Decision Transformer (Chen et al., 2021) , and Flamingo (Alayrac et al., 2022) video demonstration, natural language instruction) can all be instantiated as multimodal prompts (Fig. 1 ). Concretely, a multimodal prompt P of length l is defined as an ordered sequence of arbitrarily interleaved texts and images P := [x 0 , x 1 , . . . , x l ], where each element x i ∈ {text, image}. Task Suite. The flexibility afforded by multimodal prompts allows us to specify and build models for a huge variety of task specification formats. Here we consider the following six task categories. 1. Simple object manipulation: simple tasks like "put <object> into <container>", where each image in the prompt corresponds to a single object; 2. Visual goal reaching: manipulating objects to reach a goal configuration, e.g., Rearrangement (Batra et al., 2020) ; 3. Novel concept grounding: the prompt contains unfamiliar words like "dax" and "blicket", which are explained by in-prompt images and then immediately used in an instruction. This tests the agent's ability to rapidly internalize new concepts; 4. One-shot video imitation: watching a video demonstration and learning to reproduce the same motion trajectory for a particular object; 5. Visual constraint satisfaction: the robot must manipulate the objects carefully and avoid violating the (safety) constraints; 6. Visual reasoning: tasks that require reasoning skills, such as appearance matching "move all objects with same textures as <object> into a container", and visual memory, "put <object> in container and then restore to their original position". Note that these six categories are not mutually exclusive. For example, a task may introduce a previously unseen verb (Novel Concept) by showing a video demonstration, or combine goal reaching with visual reasoning. More details about the task suite are discussed in Appendix, Sec. B.

4. VIMA-BENCH: BENCHMARK FOR MULTIMODAL ROBOT LEARNING

Simulation Environment. Existing benchmarks are generally geared towards a particular task specification. To our knowledge, there is no benchmark that provides a rich suite of multimodal tasks and a comprehensive testbed for targeted probing of agent capabilities. To this end, we introduce a new benchmark suite for multimodal robot learning that we call VIMA-BENCH. We built our benchmark by extending the Ravens robot simulator (Zeng et al., 2020) . VIMA-BENCH supports extensible collections of objects and textures to compose multimodal prompts and procedurally generate a large number of tasks. Specifically, we provide 17 meta-tasks with multimodal prompt templates, which can be instantiated into 1000s of individual tasks. Each meta-task belongs to one or more of 6 task categories mentioned above. VIMA-BENCH can generate large quantities of imitation learning data via scripted oracle agents. More details are elaborated in Appendix, Sec. A. Observation and Actions. The observation space of our simulator includes RGB images rendered from both frontal view and top-down view. Groundtruth object segmentations and bounding boxes are also provided for training object-centric models (Sec. 5). We inherit the high-level action space from Zeng et al. (2020) , which consists of primitive motor skills like "pick and place" and "wipe". These are parameterized by poses of the end effector. Our simulator also features scripted oracle programs that can generate expert demonstrations by using privileged simulator state information, such as the precise location of all objects, and the groundtruth interpretation of the multimodal instruction. Training Dataset. We leverage the pre-programmed oracles to generate a large offline dataset of expert trajectories for imitation learning. Our dataset includes 50K trajectories per meta-task, and 650K successful trajectories in total. We hold out a subset of object models and textures for evaluation, and designate 4 out of 17 meta-tasks as a testbed for zero-shot generalization. rate over all evaluated meta-tasks will be the final reported metric. We design a 4-level evaluation protocol (Fig. 2 ) to systematically probe the generalization capabilities of learned agents. Each level deviates more from the training distribution, and is thus strictly harder than the previous one -Level 1) placement generalization: all prompts are seen verbatim during training, but only the placement of objects on the tabletop is randomized at testing; Level 2) combinatorial generalization: all materials (adjectives) and 3D objects (nouns) are seen during training, but new combinations of them appear in testing; Level 3) novel object generalization: test prompts and the simulated workspace include novel adjectives and objects; Level 4) novel task generalization: new meta-tasks with novel prompt templates at test time.

5. VIMA: VISUOMOTOR ATTENTION MODEL

Our goal is to build a robot agent capable of performing any task specified by multimodal prompts. To learn an effective multi-task robot policy, we propose VIMA, a minimalistic multitask encoder-decoder architecture with object-centric design (Fig. 3 ). Concretely, we learn a robot policy π(a t |P, H), where H := [o 1 , a 1 , o 2 , a 2 , . . . , o t ] denotes the past interaction history, and o t ∈ O, a t ∈ A are observations and actions at each interaction steps. We encode multimodal prompts via a frozen pre-trained langauge model and decode robot motor commands conditioned on the encoded prompts via cross-attention layers. Unlike prior works (Florence et al., 2019; Sieb et al., 2019) , VIMA adopts an object-centric token representation that computes features from bounding box coordinates and cropped RGB patches. Tokenization. There are 3 formats of raw input in the prompt -text, image of a single object, and image of a full tabletop scene (e.g., for Rearrangement or imitation from video frames). For text inputs, we use pre-trained T5 tokenizer and word embedding to obtain word tokens. For images of full scenes, we first extract individual objects using off-the-shelf Mask R-CNN (He et al., 2017) . Each object is represented as a bounding box and a cropped image. We then compute object tokens by encoding them with a bounding box encoder and a ViT, respectively. Since Mask-RCNN is imperfect, the bounding boxes can be noisy and the cropped image may have irrelevant pixels. For images of single objects, we obtain tokens in the same way except with a dummy bounding box. Prompt tokenization produces a sequence of interleaved textual and visual tokens. We then follow the practice in Tsimpoukelli et al. (2021) and encode the prompt via a pre-trained T5 encoder (Raffel et al., 2020) . Since T5 has been pre-trained on large text corpora, VIMA inherits the semantic understanding capability and robustness properties. To accommodate tokens from new modalities, we insert MLPs between the non-textual tokens and T5. To prevent catastrophic forgetting, VIMA finetunes the last two layers of the language encoder with layer-wise learning rate decay (He et al., 2021) but freezes all other layers. Our positional embedding is learnable and absolute. Robot Controller. A challenging aspect of designing multi-task policy is to select a suitable conditioning mechanism. In our schema (Fig. 3 ), the robot controller (decoder) is conditioned on the prompt sequence P by a series of cross-attention layers between P and the trajectory history sequence H. We compute key K P and value V P sequences from the prompt and query Q H from the trajectory history, following the encoder-decoder convention in T5 (Raffel et al., 2020) . Each cross-attention layer then generates an output sequence H ′ = softmax Q H K ⊺ P √ d V P , where d is the embedding dimension. Residual connections (He et al., 2015) are added to connect higher layers with the input rollout trajectory sequence. The cross-attention design enjoys three advantages: 1) strengthened connection to prompt; 2) intact and deep flow of the original prompt tokens; and 3) better computational efficiency, as demonstrated in VideoGPT (Yan et al., 2021) as well. VIMA decoder consists of L alternating cross-attention and self-attention layers. Finally, we follow common practice (Baker et al., 2022) to map predicted action tokens to discretized coordinates of the robot arm. See Appendix, Sec. C.2 for more details. Training. We follow behavioral cloning to train our models by minimizing the negative log-likelihood of predicted actions. Concretely, for a trajectory with T steps, we minimize min θ T t=1 -log π θ (a t |P, H). The entire training is conducted on an offline dataset with no simulator access. To make VIMA robust to detection inaccuracies and failures, we apply object augmentation by randomly injecting false-positive detection outputs. After training, we select model checkpoints for evaluation based on the aggregated accuracy on a held-out validation set. The evaluation involves interacting with the physics simulator. We follow the best practices to train Transformer models using the AdamW optimizer (Loshchilov & Hutter, 2019) , learning rate warm-up, cosine annealing (Loshchilov & Hutter, 2016) , etc. See Appendix Sec. D for comprehensive training hyperparameters.

6. EXPERIMENTS

In this section, we aim to answer three main questions: (1) How does VIMA compare with prior SOTA transformer-based agents on a diverse collection of multimodal-prompted tasks? (2) What are the scaling properties of our approach in model capacity and data size? (3) How do different visual tokenizers, prompt conditioning, and prompt encoding affect decision making? 6.1 BASELINES Gato (Reed et al., 2022) introduces a decoder-only model that solves tasks from multiple domains where tasks are specified by prompting the model with the observation and action subsequence. For fair comparison, we provide the same conditioning as VIMA, i.e., our multimodal embedded prompt. Input images are divided into patches and encoded by a ViT (Dosovitskiy et al., 2020) model to produce observation tokens. Flamingo (Alayrac et al., 2022) is a vision-language model that learns to generate textual completion in response to multimodal prompts. It embeds a variable number of prompt images into a fixed number of tokens via a Perceiver Resampler (Jaegle et al., 2021b) , and conditions the language decoder on the encoded prompt by cross-attention. Flamingo does not work with embodied agents out of the box. We adapt it to support decision-masking by replacing the output layer with robot action heads. Multimodal GPT agent is a GPT-based behavior cloning agent conditioned on tokenized multimodal prompts. It autoregressively decodes next actions given instructions and interaction histories. Similar to prior works of casting RL problems as sequence modeling (Chen et al., 2021; Janner et al., 2021) , it encodes an image into a single state token by a ViT encoder, and prepends the rollout trajectory with prompt tokens. This baseline does not involve cross-attention. A more detailed comparison between these methods can be found in Appendix, Sec. C.1.

6.2. EVALUATION RESULTS

We compare VIMA against other SOTA methods on the four levels of generalization provided in our benchmark for different model and training dataset sizes. Model scaling. We train all methods for a spectrum of model capacities from 2M to 200M parameters, evenly spaced on the log scale. The encoder size is kept constant (pre-trained T5-Base) for all methods and excluded from the parameter count. Across all levels of zero-shot generalization, we find that VIMA strongly outperforms prior work. Although models like Gato and Flamingo show improved performance with bigger model sizes, VIMA consistently achieves superior performance over all model sizes. We note that this can only be achieved with both cross-attention and object token sequence representation without any downsampling -altering any component will degrade the performance significantly, especially in the low model capacity regime (ablations in Sec. 6.3). Data scaling. Next we investigate how different methods scale with varying dataset sizes. We compare model performance at 0.1%, 1%, 10% and full imitation learning dataset provided in VIMA-BENCH (Fig. 4 ). VIMA is extremely sample efficient and with just 1% of the data can achieve performance similar to baseline methods trained with 10× more data for L1 and L2 levels of generalization. In fact, for L4 we find that with just 1% of training data, VIMA already outperforms prior work trained with entire dataset. Finally, across all levels with just 10% of the data, VIMA can outperform prior work trained with the full dataset by a significant margin. We hypothesize that the data efficiency can be attributed to VIMA's object-centric representation, which is less prone to overfitting than learning directly from pixels in the low-data regime. This is consistent with findings from Sax et al. (2018) , which demonstrates that embodied agents conditioned on mid-level visual representations tend to be significantly more sample-efficient than end-to-end control from raw pixels. Progressive Generalization. Finally, we compare the relative performance degradation as we test the models on progressively challenging zero-shot evaluation levels without further finetuning (Fig. 5 ). Our method exhibits a minimal performance regression, especially between L1 → L2 and L1 → L3. In contrast, other methods can degrade as much as 20%, particularly in more difficult generalization scenarios. Although all methods degrade significantly when evaluated on L4 (Novel Tasks), the drop in performance for VIMA is only half as severe as all other baselines. This results suggest that VIMA has developed more generalizable policy and robust representations than the competing approaches. Fig. 6 shows the ablation results. We highlight a few findings. First, we note that our Mask R-CNN detection pipeline (Appendix, Sec. A.20) incurs a minimal performance loss compared to the oracle bounding boxes, thanks to the object augmentation (Sec. 5) that boosts robustness during training.

6.3. ABLATION STUDIES

Second, tokenizing from raw pixels (Image Perceiver, patches, or single embedding) consistently underperforms our object-centric format. We hypothesize that these tokenizers have to allocate extra Prompt Encoding. We vary the size of the pre-trained T5 encoder to study the effect of prompt encoding. We experiment with three T5 capacities: small (30M), base (111M), to large (368M). For all T5 variants, we fine-tune the last two layers and freeze all other layers. We find no significant difference among the variants (Appendix, Sec. E.2), thus we set base as default for all our models. Policy Robustness. We study the policy robustness against increased amounts of distractors and imperfect task specifications. See Appendix, Sec. E.3 for exact setup and results. VIMA exhibits minimal performance degradation with increased distractors and corrupted prompts. We attribute this robustness to the high-quality, pre-trained T5 language backbones.

7. CONCLUSION

Similar to GPT-3, a generalist robot agent should have an intuitive and expressive interface for human users to convey their intent. In this work, we introduce a novel multimodal prompting formulation that converts diverse robot manipulation tasks into a uniform sequence modeling problem. We propose VIMA, a conceptually simple transformer-based agent capable of solving tasks like visual goal, one-shot video imitation, and novel concept grounding with a single model. VIMA exhibits superior model and data scaling properties, and provides a strong starting point for future work. The current VIMA experiments are not without limitations. We identify the following weaknesses: (1) limited action primitives (only pick-and-place and wipe for now); (2) limited simulator realism; (3) reliance on domain-finetuned Mask R-CNN to provide object tokens. However, VIMA's algorithm design is general-purpose and does not make assumptions about the particular observation and action formats. This opens the door to future works that may address many of these weaknesses with more sophisticated environments (e.g. BEHAVIOR (Srivastava et al., 2021) ), stronger vision pipeline (large-scale open-vocabulary models like ViLD (Gu et al., 2021) ), and temporally-extended robot controllers (such as MAPLE (Nasiriany et al., 2021) ). With these stronger modules, VIMA could potentially scale to more challenging problems. We open-source all code to facilitate future research.

A SIMULATOR DETAILS

We build our VIMA-BENCH simulation suite upon the Ravens physics simulator (Zeng et al., 2020; Shridhar et al., 2021) . Specifically, it is supported by PyBullet (Coumans & Bai, 2016 -2021) with a Universal Robot UR5 arm. The size of the tabletop workspace is 0.5 × 1m. Our benchmark contains extensible sets of object geometries and textures. Instantiated from an object-texture combination, all object instances can be rendered as RGB images appeared in multimodal prompts. The observation space of VIMA-BENCH includes RGB images from both frontal and topdown views. It also includes a one-hot vector ∈ {0, 1} 2 to indicate type of the end-effector ∈ {suction cup, spatula}. While a suction cup is equipped in most manipulation tasks, a spatula is used in particular for visual constraint tasks, where an agent is asked to "wipe" objects. VIMA-BENCH inherits the same action space from Zeng et al. ( 2020) and Shridhar et al. (2021) , which consists of primitive actions of "pick and place" for tasks with a suction cup as the end effector, or "push" for tasks with a spatula. Both primitive actions contain two poses ∈ SE(2) specifying target poses of the end effector. For the "pick and place" primitive, they represent the pick pose and the place pose. Fir the "push" primitive, they represent the push starting pose and push ending pose. Similar to prior work (Zeng et al., 2020; Shridhar et al., 2021) , VIMA-BENCH provides scripted oracles to generate successful demonstrations for all tasks. We leverage them to construct an offline imitation dataset for behavioral cloning. Given a prompt, these pre-programmed bots can access privileged information such as the correct object to pick and target location to place.

B TASK SUITE

We develop 17 meta tasks that belong to 6 diverse categories. Thousands of individual tasks and their corresponding multimodal prompts can be procedually generated from these meta-task templates. We use PyBullet (Coumans & Bai, 2016 -2021) as our backend and the default renderer to produce the RGB frames for training data and interactive test environments. For demonstration purpose, we apply the NVISII (Morrical et al., 2020) raytracing renderer to enhance the visual quality. We elaborate each meta task in the following subsections.

B.1 SIMPLE OBJECT MANIPULATION

This task category asks agents to follow basic instructions specified by multimodal prompts. Task 01: Pick the specified object(s) and place it into the specified object. • Prompt: Put the {object} 1 into the {object} 2 . • Description: The image placeholder {object} 1 is the object to be picked and the {object} 2 is the container object. The agent requires to recognize the objects with the correct color-shape combinations. To extend the difficulties, it supports more than one object to be picked or placed. For example, the prompt Put the {object} 1 and {object} 2 into the {object} 3 . asks to pick two different objects and place into a target container. We uniformly sample different color-shape combos for objects to be picked and containers. • Success Criteria: All specified object(s) to pick are within the bounds of the container object(s), with specified shapes and textures provided in the prompt. • Oracle Trajectory: Shown in Fig. A .3 with its multimodal prompt. Task 02: In the workspace, put the objects with a specified texture shown in the scene image in the prompt into container object(s) with a specified color. This task requires the agent to find the correct object to manipulate by grounding the textural attributes from both natural language descriptions and the visual scene images. • Prompt: Put the {texture} 1 object in {scene} into the {texture} 2 object.  ÃÂÁÀ ÇÅAEÄ ÈÀÀÄ ÂÀ ÈAEÄÀ Ä ÁÈ ÂÀ ÈÀ! ÅÀÂÂ$ ÃÈÇ 3ÂÀ) $!ÀÄ ÈAEÄ3À ÂAE)3Ç ÂAE@!3 ÇGÀÇÀÈÃAEÈ! 3ÀÈ SAESAE ÈAEÄÃ$ ÈÀ!@AEÄ!@ÅÀÂÂ$ @)3È À ÈÀ!@AEÄ!@ÈÀÀÄ @)3È À ÈÀ!@AEÄ!@ÃÂÁÀ @)3È À ÈÀ!@AEÄ!@ ÁÈ ÂÀ @)3È À ÅÀÂÂ$@AEÄ!@ÈÀÀÄ @)3È À ÅÀÂÂ$@AEÄ!@ÃÂÁÀ @)3È À ÅÀÂÂ$@AEÄ!@ ÁÈ ÂÀ @)3È À ÈÀÀÄ@AEÄ!@ÃÂÁÀ @)3È À ÈÀÀÄ@AEÄ!@ ÁÈ ÂÀ @)3È À ÃÂÁÀ@AEÄ!@ ÁÈ ÂÀ @)3È À ÈÀ!@AEÄ!@ÅÀÂÂ$ @ ÂAE@!3 ÈÀ!@AEÄ!@ÈÀÀÄ @ ÂAE@!3 ÈÀ!@AEÄ!@ÃÂÁÀ @ ÂAE@!3 ÈÀ!@AEÄ!@ ÁÈ ÂÀ @ ÂAE@!3 ÅÀÂÂ$@AEÄ!@ÈÀÀÄ @ ÂAE@!3 ÅÀÂÂ$@AEÄ!@ÃÂÁÀ @ ÂAE@!3 ÅÀÂÂ$@AEÄ!@ ÁÈ ÂÀ @ ÂAE@!3 ÈÀÀÄ@AEÄ!@ÃÂÁÀ @ ÂAE@!3 ÈÀÀÄ@AEÄ!@ ÁÈ ÂÀ @ ÂAE@!3 ÃÂÁÀ@AEÄ!@ ÁÈ ÂÀ @ ÂAE@!3 ÈÀ!@)$ÈÂ ÅÀÂÂ$@)$ÈÂ ÈÀÀÄ@)$ÈÂ ÃÂÁÀ@)$ÈÂ ÁÈ ÂÀ@)$ÈÂ ÈÀ!@ AE)ÂÀÅ ÅÀÂÂ$@ AE)ÂÀÅ ÈÀÀÄ@ AE)ÂÀÅ ÃÂÁÀ@ AE)ÂÀÅ ÁÈ ÂÀ@ AE)ÂÀÅ !AEÈ@ÅÀÂÂ$ !AEÈ@ÈÀ!@AEÄ! @ÅÀÂÂ$@)3È À !AEÈ@ÈÀ!@AEÄ! @ÈÀÀÄ@)3È À !AEÈ@ÈÀ!@AEÄ! @ÃÂÁÀ@)3È À !AEÈ@ÈÀ!@AEÄ! @ ÁÈ ÂÀ@)3È À !AEÈ@ÅÀÂÂ$@AEÄ! @ÈÀÀÄ@)3È À !AEÈ@ÅÀÂÂ$@AEÄ! @ÃÂÁÀ@)3È À !AEÈ@ÅÀÂÂ$@AEÄ! @ ÁÈ ÂÀ@)3È À !AEÈ@ÈÀÀÄ@AEÄ! @ÃÂÁÀ@)3È À !AEÈ@ÈÀÀÄ@AEÄ! @ ÁÈ ÂÀ@)3È À !AEÈ@ÃÂÁÀ@AEÄ! @ ÁÈ ÂÀ@)3È À !AEÈ@ÈÀ!@AEÄ! @ÅÀÂÂ$@ ÂAE@!3 !AEÈ@ÈÀ!@AEÄ! @ÈÀÀÄ@ ÂAE@!3 !AEÈ@ÈÀ!@AEÄ! @ÃÂÁÀ@ ÂAE@!3 !AEÈ@ÈÀ!@AEÄ! @ ÁÈ ÂÀ@ ÂAE@!3 !AEÈ@ÅÀÂÂ$@AEÄ! @ÈÀÀÄ@ ÂAE@!3 !AEÈ@ÅÀÂÂ$@AEÄ! @ÃÂÁÀ@ ÂAE@!3 !AEÈ@ÅÀÂÂ$@AEÄ! @ ÁÈ ÂÀ@ ÂAE@!3 !AEÈ@ÈÀÀÄ@AEÄ! @ÃÂÁÀ@ ÂAE@!3 !AEÈ@ÈÀÀÄ@AEÄ! @ ÁÈ ÂÀ@ ÂAE@!3 !AEÈ@ÃÂÁÀ@AEÄ! @ ÁÈ ÂÀ@ ÂAE@!3 !AEÈ@ÈÀ!@)$ÈÂ !AEÈ@ÅÀÂÂ$@)$ÈÂ !AEÈ@ÈÀÀÄ@)$ÈÂ !AEÈ@ÃÂÁÀ@)$ÈÂ !AEÈ@ ÁÈ ÂÀ@)$ÈÂ !AEÈ@ÃÂÁÀ !AEÈ@ÇÅAEÄ !AEÈ@ÈÀÀÄ !AEÈ@ÈAEÄÀ !AEÈ@ Ä !AEÈ@ ÁÈ ÂÀ !AEÈ@ÈÀ! ba`@XÂÈ) • Prompt: Rotate the {object} 1 {angles} degrees. tSAEÀ@iÀf3ÁÈÀ) yG3 AEÈ tSAEÀ) yG 3È À) AEÈ 3È À) yG pÂAE@3) AEÈ pÂAE@3) yG@@AEÈ $ÈÂ) pAE)ÂÀÅ) • Description: The agent is required to rotate all objects in the workspace specified by the image placeholder {object} 1 . There are also objects with different color-shape combinations in the workspace as distractors. {angles} is the sampled degree that the dragged object needs to be rotated. A target angle is sampled from 30 • , 60 • , 90 • , 120 • , and 150 • . • Success Criteria: The position of the specified object matches its original position, and the orientation matches the orientation after rotating specific angles. • Oracle Trajectory: Shown in Fig. A .5 with its multimodal prompt.

B.2 VISUAL GOAL REACHING

This task category requires agents to manipulate objects in the workspace to reach goal states represented as images shown in prompts. Rotate the 120 degrees. Note that to achieve the goal configuration, distractors may need to be moved away first. • Prompt: Rearrange to this {scene}. • Description: Objects in the scene placeholder {scene} are target objects to be manipulated and rearranged. In the workspace, the same target objects are spawned randomly, potentially with distractors randomly spawned as well. With a defined distractor conflict rate, the position of each distractor has this probability to occupy the position of any target object such that the rearrangement can only succeed if moving away that distractor first. • Success Criteria: The configuration of target objects in the workspace matches that specified in the prompt. • Oracle Trajectory: Shown in • Prompt: Rearrange objects to this setup {scene} and then restore. • Description: Same as the task 04, except introducing the instruction "restore". • Success Criteria: Meet the success criteria of the task 04, and then within the allowed max steps restore all target objects to their initial configurations. • Oracle Trajectory: Shown in Fig. A .7 with its multimodal prompt. This task category requires agents to ground new concepts of adjectives, nouns, or verbs via visual perception and language understanding. Similar task design can be found in prior work (Hill et al., 2021) . Completing these tasks are challenging, because the model should a) first understand prompts with interleaved texts, images, and even video frames; b) quickly internalize new concepts that are different across task instances, which even tests the ability to meta learn; and c) do complicated reasoning such as comparing between "taller" vs "less taller" vs "shorter" and then ground this reasoning into the robot action space. Prompts consist of two parts: a definition part followed by an instruction part. In the definition part, novel conceptions are defined by multimodal illustrations with multiple support examples. In the instruction part, agents are asked to achieve the goal by properly applying concepts from the definition part. The assignment of unique nonsense words is varied and independent for each task instance such that tasks can only be solved if the agent applies the reasoning correctly. This ability is also referred to as fast-mapping (Heibeck & Markman, 1987) . Task 06: Ground comparative adjectives by comparing the size or the textural saturation of objects and manipulating the correct object(s) instructed in the prompt. • Prompt:{demo object} 1 is {novel adj} than {demo object} 2 . Put the {adv} {novel adj} {object} 1 into the {object} 2 . • Description: The sampled adjective {novel adj} is a dummy adjective placeholder for agent to ground. By default, the novel adjective set is {daxer, blicker, modier, kobar}. The real meaning can be related to size (smaller/larger) or textural saturation (lighter/darker texture). The image placeholders {demo object} 1 and {demo object} 2 illustrate how the novel adjective is defined. For example, if the real comparison is "taller", then the sampled object in {demo object} 1 is taller than {demo object} 2 . The choices of the novel adjective and the real meaning are independently sampled for different task instances. For the instruction part, this task is similar to task 01, where the agent is required to pick the specified dragged object(s) with the novel adjective attribute and then place it into the specified container object. To avoid revealing the correct object to manipulate, we use a neutral texture for objects appeared in the instruction part. • Success Criteria: All target objects with the specified adjective attribute are within the bounds of the specified container object. • Oracle Trajectory: Shown in Fig. A .8 with its multimodal prompt. Task 07: Orthogonal to task 06 by requiring to learn mappings of novel nouns. • Prompt: This is a {novel name} 1 {object} 1 . This is a {novel name} 2 {object} 2 . Put {novel name} 1 into a {novel name} 2 . • Description: Novel noun words are defined with the text placeholders {novel name} 1 and {novel name} 2 , following their image placeholders {object} 1 and {object} 2 , for the target object and container object, respectively. Novel nouns are sampled from {dax, blicket, wug, zup}. In the instruction part, objects are expressed as novel nouns defined in the previous definition part. Distractors are defined the same as task 01. • Success Criteria: All target object(s) are within the bounds of the container object(s). • Oracle Trajectory: Shown in Fig. A .8 with its multimodal prompt. This is a blicket . This is a zup . Put a zup into a blicket. • Prompt: This is a {novel name} 1 {object} 1 . This is a {novel name} 2 {object} 2 . {demo object} 1 is {adj} than {demo object} 2 . Put the {adv} {novel adj} {novel name} 1 into the {novel name} 2 . • Description: see task description for task 06 and task 07. • Success Criteria: Similar as tasks 06 and 07. • Oracle Trajectory: Shown in Fig. A .10 with its multimodal prompt. Task 09: A novel verb "twist" is defined as rotating a specific angle conveyed by several examples. This task is similar to task 03, but it requires the agent to infer what is the exact angle to rotate from the prompt and to ground novel verbs that are semantically similar but different in exact definitions. • Prompt: "Twist" is defined as rotating object a specific angle. For examples: From {before twist} i to {after twist} i . Now twist all {texture} objects. • Description: Both {before twist} i and {after twist} i are scene placeholders where {before twist} i shows a randomly sampled object before "twist" and {after twist} i shows the same object pose after "twist". All examples illustrate the This is a wug . This is a zup . is blicker than . is blicker than . is blicker than . Put the blicker zup into the wug. We follow prior works (Finn et al., 2017; Dasari & Gupta, 2020; Duan et al., 2017) to formulate the problem by giving one video demonstration (represented as consecutive frames in prompts), then test the learned imitator's ability to produce target trajectories. This setup is challenging because a) only one demonstration is available to the agent; b) the model needs to understand video frames interleaved with textual instructions; and c) missing correspondences between demonstrations and target trajectories since demonstrations only show partial key frames. Task 10: Follow motions for specific objects. • Prompt: Follow this motion for {object}: {frame} 1 ...{frame} i ... {frame} n . • Description: Image placeholder {object} is the target object to be manipulated and {{frame} i } is set of workspace-like scene placeholders to represent a video trajectory, where n is the trajectory length. There is an object spawned at the center in both the workspace and the prompt video but with different textures as a distractor. The initial position of the target object matches that in {frame} 1 . • Success Criteria: In each step, the pose of the target object matches the pose in the corresponding video frame. Incorrect manipulation sequences are considered as failures. • Oracle Trajectory: Shown in Fig. A .12 with its multimodal prompt. Follow this motion for : . • Prompt: Stack objects in this order {frame} 1 ...{frame} i ...{frame} n . • Description: There are multiple objects with the same shape but different textures spawned in the workspace without any stacking initially. Distractor objects with different shapes are spawned in the workspace but not in the prompt video. At each step of the prompt video, one of the top objects is stacked over another object or put at an empty position. • Success Criteria: Similar as task 10. • Oracle Trajectory: Shown in Fig. A .13 with its multimodal prompt. Stack objects in this order . . Task 12: Sweep the designated number of objects into a specified region without exceeding the boundary.

• Prompt:

Sweep {quantifier} {object} into {bounds} without exceeding {constraint}. • Description: {object} is the image placeholder of the target object to be swept spawned with a random amount in the workspace. Distractors have the same amount, same shape, but different color from target objects. {quantifier} is the text placeholder to determine the target quantity of objects to be wiped, sampled from any, one, two, three, and all. {bounds} is the image placeholder for a three-sided rectangle as the goal region. { constraint} is the constraint line. • Description: {object} is the sampled goal container object. In the workspace, there are objects with the same texture as the container but potentially different shapes. Distractors with different textures are spawned. • Success Criteria: All objects with the same texture as the goal container are within the bounds of the container. • Oracle Trajectory: Shown in Fig. A .16 with its multimodal prompt. Put all objects with the same texture as into it. Task 15: By reasoning the "same shape", the agent is required to pick all objects in the workspace with the same top-down shape as the goal container specified in the prompt and place them into it. For example, blocks and boxes have the same rectangular shape. • Prompt: Put all objects with the same profile as {object} into it. • Description: Similar to the task 14 except the objects to be picked and placed are with the same shape. There are three different shapes: rectangular-like (e.g. block and pallet), circle-like (e.g. ring and bowl), and undetermined for the rest. • Success Criteria: All objects with the same shape as the container are within the container. • Oracle Trajectory: Shown in Fig. A .17 with its multimodal prompt. Put all objects with the same profile as into it. • Prompt: First put {object} 1 into {object} 2 then put the object that was previously at its {direction} into the same {object} 2 . • Description: Objects in image placeholders {object} 1 and {object} 2 are the target object to be picked and the container, respectively. We then ask the agent to put one of old neighbors of the previous target object into the same container. The old neighboring object is specified through cardinal directions {north, south, west, east}. • Success Criteria: The target object and the correct neighboring object are inside the container. • Oracle Trajectory: Shown in Fig. A .18 with its multimodal prompt. First put into then put the object that was previously at its west into the same . • Prompt: Put {object} 1 into {object} 2 . Finally restore it into its original container. • Description: The object in the image placeholder {object} 1 is the target object to be manipulated across the task. There are more than one target containers (e.g. Put {object} 1 into {object} 2 then {object} 3 .Finally restore it into its original container. for two target base objects to be placed in order). The rest of spawned containers naturally becomes distractors. • Success Criteria: The target object are first put into multiple containers following the specific order. Finally it should be restored into its original container. • Oracle Trajectory: Shown in 

C.1 SUMMARY OF DIFFERENT METHODS

We summarizes differences between VIMA and other baseline methods in Table 1 . In the column "Prompt Conditioning", an alternative of cross-attention is to first concatenate prompt and interaction into a big sequence, then repetitively apply transformer decoders to predict actions. It is referred to as "direct modeling". The relative computation cost is quadratically proportional to number of observation tokens. # the last token is the predicted action token predicted_act_token = x[-1] return predicted_act_token Pseudocode 1: Cross-attention operation that conditions the trajectory history on prompt. We repetitively alternate cross-attention and self-attention to model the trajectory given a specific task.

C.2.5 ACTION DECODING

After obtaining the predicted action token, we map it to the action space A and obtain the predicted action. This is achieved though a group of action heads. Since the action space consists of two SE(2) poses, for each pose we use six independent heads to decode discrete actions (two for xy coordinate and four for rotation represented in quaternion). These discrete actions are then dediscretized and mapped to continuous actions through affine transformation. The two poses are modeled independently. Early ablations show that this independent modeling is equally good as alternatives techniques like autoregressive decoding (Vinyals et al., 2019; OpenAI et al., 2019) . Detailed model hyperparameters are listed in Table 4 .

C.3 BASELINES ARCHITECTURES

In this section, we elaborate model architectures for baseline methods. Some components such as the action decoder are same across all baseline methods and ours. Therefore, we only discuss unique model components.

C.3.1 GATO

Gato (Reed et al., 2022) introduces a decoder-only model that solves tasks from multiple domains including robotics, video game, image captioning, language modeling, etc. Different tasks are specified by supplying the model with an initial sequence of corresponding tokens. For example, in tasks involving decision making, these tokens include observation and action tokens. For fair comparison, we provide the same conditioning as VIMA, i.e., our multimodal tokenized prompts. Similar to our method, Gato also predicts actions in an autoregressive manner. Gato and our method share the same training philosophy to only optimize the causal behavior cloning objective. However, unlike our method that adopts an object-centric representation to treat individual objects as observation tokens, Gato divides input images into patches and encodes them by a ViT (Dosovitskiy et al., 2020) Imperfect Prompts. We then study the policy robustness against imperfect prompts, including incomplete prompts (randomly masking out words with <UNK> token) and corrupted prompts (randomly swapping words, which could have changed the task meaning altogether). We ran our largest VIMA model with 200M parameters, results are shown in Table 16 .



We provide comprehensive details to reproduce our work in the Appendix. Concretely, the specifications of each meta-task in the benchmarking suite are explained in Sec. B. Model architectures are elaborated in Sec. C. Hyperparameter configurations are listed in Sec. D. Furthermore, we host anonymized code at https://iclr3081.github.io/ for review.



Figure1: Multimodal prompts for task specification. We observe that many robot manipulation tasks can be expressed as multimodal prompts that interleave language and image/video frames. We propose VIMA, an embodied agent model capable of processing mulitimodal prompts (left) and controlling a robot arm to solve the task (right).

Figure 2: Evaluation Protocol in VIMA-BENCH. We design 4 levels of evaluation settings to measure the zero-shot generalization capability of an agent systematically. Each level deviates more from the training distribution, and thus is strictly more challenging than the previous level.

Figure 3: VIMA. We encode the multimodal prompts with a pre-trained T5 model, and condition the robot controller on the prompt through cross-attention layers. The controller is a causal transformer decoder consisting of alternating self and cross attention layers that predicts motor commands conditioned on prompts and interaction history. Evaluating Zero-Shot Generalization. Each task in VIMA-BENCH has a binary success criterion and does not provide partial reward signals. During test time, we execute the agent policies in the physics simulator for multiple episodes to compute a success rate in percentage. The average success

Figure 6: Ablation on visual tokenizers. We compare the performance of VIMA-200M model across different visual tokenizers. Our proposed object tokens outperform all methods that learn directly from raw pixels, and Object Perceiver that downsamples the object sequence to a fixed number of tokens.

Figure 5: VIMA incurs much less performance drop than baselines as we evaluate on progressively harder zero-shot generalization.

Figure 7: Ablation: Prompt conditioning. We compare our method (xattn: cross-attention prompt conditioning) with a vanilla transformer decoder (gpt-decoder) across different model sizes. Crossattention is especially helpful in low-parameter regime and for harder generalization tasks. internal capacity to parse the objects from low-level pixels, which likely impedes learning. Sax et al. (2018) echoes our finding that using mid-level vision can greatly improve agent generalization compared to an end-to-end pipeline. Third, even though Ours and Object Perceiver both use the same object bounding box inputs, the latter is significantly worse in decision making. We conclude that it is important to pass the variable sequence of objects directly to the robot controller rather than downsampling to a fixed number of tokens.Prompt Conditioning. VIMA conditions the robot controller (decoder) on the encoded prompt by cross-attention. A simple alternative is to concatenate the prompt P and interaction history H into one big sequence, and then apply a decoder-only transformer like GPT(Radford et al., 2018) to predict actions. In this ablation, we keep the object tokenizer constant, and only switch the conditioning mechanism to causal sequence modeling. Note that this variant is conceptually "Gato with object tokens". Fig.7shows the comparison of VIMA (xattn) and the gpt-decoder variant across 4 generalization levels. While the variant achieves comparable performance in larger models, crossattention still dominates in the small-capacity range and generalizes better in the most challenging L4 (Novel Task) setting. Our hypothesis is that cross-attention helps the controller stay better focused on the prompt instruction at each interaction step. This bears resemblance to the empirical results inSanh et al. (2021);Wang et al. (2022b), which show that well-tuned encoder-decoder architectures can outperform GPT-3 in zero-shot generalization.

Figure A.1 displays all object geometries. Figure A.2 displays all textures.

Figure A.1: Object Gallery in VIMA-BENCH textured with random textures. Bowl and pan are from Google Scanned Objects(Downs et al., 2022) while others are from Ravens(Zeng et al., 2020)

Figure A.2: Texture Gallery in VIMA-BENCH. The first row of image-based textures are from Blender Cloud Libraries(Weikert et al., 2022), while others are hard-coded.

Figure A.3: Simple Object Manipulation: Task 01

Figure A.4: Simple Object Manipulation: Task 02

Figure A.5: Simple Object Manipulation: Task 03

Fig. A.6 with its multimodal prompt. . Rearrange to this .

Figure A.6: Visual Goal Reaching: Task 04

Figure A.7: Visual Goal Reaching: Task 05

Figure A.8: Novel Concept Grounding: Task 06

Figure A.9: Novel Concept Grounding: Task 07

Figure A.10: Novel Concept Grounding: Task 08

Figure A.12: One-shot video imitation: Task 10

Figure A.13: One-shot video imitation: Task 11

Figure A.16: Visual Reasoning: Task 14

Figure A.17: Visual Reasoning: Task 15

Figure A.18: Visual Reasoning: Task 16

Fig.A.19 with its multimodal prompt. Put into then . Finally restore it into its original container.

Figure A.19: Visual Reasoning: Task 17

level generalization results. Model indicates robot controller parameter count. Model Method Task 01 Task 02 Task 03 Task 04 Task 05 Task 06 Task 07 Task 09 Task 11 Task 12 Task 15 Task 16 level generalization results. Model indicates robot controller parameter count. Model Method Task 01 Task 02 Task 03 Task 04 Task 05 Task 06 Task 07 Task 09 Task 11 Task 15 Task 16

Model hyperparameters for Perceiver Resampler used in Flamingo method.

model to produce observation tokens. Furthermore, Gato relies on causal self-attention to model entire trajectory sequences starting with prompt tokens. Hyperparameters of Gato's ViT is listed in Table5.The transformer-decoder style sequence modeling is technically illustrated in Pseudocode 2.Flamingo(Alayrac et al., 2022) is a vision-language model that learns to generate textual completion in response to multimodal prompts. It embeds a variable number of prompt images into a fixed number of tokens via the Perceiver Resampler module(Jaegle et al., 2021b), and conditions the language decoder on encoded prompts by cross-attention. Flamingo does not work with embodied agents out of the box. We adapt it by replacing the output layer with robot action heads (hyperparameters listed in Table4) and using tokenized rollout histories as inputs. We train it end-to-end with causal behavior cloning loss. The modified Flamingo agent differs from ours since it processes image observations into a fixed number of visual tokens through a learned Perceiver Resampler. Model hyperparameters for our reimplementation of the Perceiver Resampler is listed in Table6. C.3.3 MULTIMODAL GPT AGENT L1 level generalization results. Model indicates robot controller parameter count.

annex

• Success Criteria: The exact number of target objects to be swept are all inside the specified region. Failure reasons include 1) any distractor being wiped into the region, 2) target object exceeding the constraint, or 3) incorrect number of target objects being swept into the goal region.• Oracle Trajectory: Shown in • Prompt:Sweep {quantifier} {object} into {bounds} without touching {constraint}.• Description: Similar as task 12 but requiring different way to satisfy the constraint. The agent has to learn to avoid "touching" the constraint line in this case.• Success Criteria: Similar as task 12 except that the constraint is to not touch the red line.• Oracle Trajectory: Shown in Fig. A .15 with its multimodal prompt.Sweep two into without touching . This task category requires agents to make decisions by reasoning over or memorizing information conveyed through multimodal prompts.Task 14: By reasoning the "same texture", the agent is required to pick all objects in the workspace with the same texture as the container objects specified in the prompt and place them into it.• Prompt: Put all objects with the same texture as {object} into it. For text inputs, we follow the standard pipeline in NLP to first tokenize raw languages to discrete indices through pre-trained t5-base tokenizer. We then obtain corresponding word tokens from the embedding look-up of the pre-trained t5-base model. For images of full scenes, we first parse the scene through a fine-tuned mask R-CNN detection model (He et al., 2017; Wu et al., 2019) to extract individual objects. Each object representation contains a bounding box and a cropped image. The bounding box is in the format of [x center , y center , height, width]. We normalize it to be within [0, 1] by dividing each dimension with corresponding upper-bound value. We then pass it through a bounding box encoder MLP and obtain a feature vector. To process the cropped image, we first pad non-square image to a square by padding along the shorter dimension. We then resize it to a pre-configured size and pass it through a ViT (trained from scratch) to obtain the image feature. Finally, an object token is obtained by concatenating the bounding box feature and the image feature and mapping to the embedding dimension. For images of single objects, we obtain tokens in the same way except with a dummy bounding box. Detailed model hyperparameters about tokenizations are listed in Table 2 .After obtaining a sequence of prompt tokens, we follow Tsimpoukelli et al. (2021) to pass it through a pre-trained t5-base encoder to obtain encoded prompt. Note that we add adapter MLP between object tokens and the T5 encoder. We adopt learned absolute positional embedding. Model hyperparameters are listed in Table 2 as well. 

C.2.4 SEQUENCE MODELING

The robot controller in VIMA is a causal decoder that autoregressively predicts actions. To condition the decoder on prompt tokens, we perform cross-attention between history tokens and prompt tokens (Figure 3 ). Concretely, we pass history tokens as the query sequence and prompt tokens as the key-value sequence into cross-attention blocks. The output prompt-aware trajectory tokens then go through causal self-attention blocks. We alternate cross-attention and self-attention L times. This procedure is technically described in Pseudocode 1. given multimodal prompts and interaction histories. We optimize this method end-to-end with causal behavior cloning loss. Similar to prior works of casting RL problems as sequence modeling (Chen et al., 2021; Janner et al., 2021; Zheng et al., 2022) , it encodes an image into a single "state" token through a learned ViT encoder. It also directly models entire trajectory sequences prepended with prompt tokens. Therefore, it differs from our method in the representation of observation tokens and prompt conditioning. For visual tokenizer, we employ a learned ViT with hyperparameters listed in Table 5 .

C.4 MASK R-CNN DETECTION MODEL

Finally, we elaborate the mask R-CNN model (He et al., 2017) for scene parsing and object extraction.We fine-tuned a pre-trained lightweight mask R-CNN (mask rcnn R 50 FPN 3x) from Wu et al. 

D VIMA TRAINING DETAILS

We follow the best practices to train Transformer models using the AdamW optimizer (Loshchilov & Hutter, 2019) , learning rate warm-up, cosine annealing (Loshchilov & Hutter, 2017) , etc. Training hyperparameters are provided in Table 7 . We use GEGLU activation (Shazeer, 2020) insideTransformer models across all methods. 

D.1 VARY MODEL CAPACITY

We train a spectrum of 7 models ranging from 2M to 200M parameters. To vary the model capacity, we follow prior work (Chowdhery et al., 2022) to change embedding dimension and number of layers.We list configurations for methods with cross-attention prompt conditioning (i.e., ours and Flamingo)in Table 8 , and configurations for methods only with causal self-attention (i.e., Gato and DT) in Table 9 . E MORE EXPERIMENT RESULTS

E.1 BREAKDOWN RESULTS

We show breakdown results for Figure 4 in Tables 10, 11 , 12, and 13, respectively.

E.2 VARY T5 ENCODER SIZES

We vary the size of the pre-trained T5 encoder (Raffel et al., 2020) to study the effect of prompt encoding. We experiment with three T5 model capacities: t5-small (30M), t5-base (111M), to t5-large (368M). For all T5 variants, we fine-tune the last two layers and freeze all other layers.We fix the parameter count of the decision-making part to be 200M. As shown in Table 14 , we find no significant difference among the variants. Thus we set the standard t5-base as default for all our models.

E.3 POLICY ROBUSTNESS

Increased Amounts of Distractors. We study the policy robustness against increased amounts of distractors in scenes. For all tasks being evaluated, we add one more distractor object. We ran our largest VIMA model with 200M parameters. The result is presented in Table 15 .It turns out that the performance of VIMA degrades minimally with more distractors than the training distribution. This indicates that our agent has learned a reasonably robust policy against objects that are irrelevant to the task. Our well-trained model exhibits minimal performance decrease when evaluated on masked prompts and minor decrease on corrupted prompts. We attribute this robustness to the high-quality, pre-trained T5 language backbones. (He et al., 2017) , UberNet (Kokkinos, 2016) , and 12in-1 (Lu et al., 2020) leverage a single backbone model with multiple independent heads for different tasks. UVim (Kolesnikov et al., 2022) is another unified approach for vision that uses a language model to generate the guiding code for a second model to predict raw vision outputs. In multimodal learning, numerous works (Lu et al., 2022; Wang et al., 2022a; Zellers et al., 2021; 2022; Buch et al., 2022; Fu et al., 2021; Yang et al., 2022) investigate the unification of image, video, audio, and/or language modalities to deliver multi-purpose foundation models, though most of which are not equipped with decision-making facilities. Perceivers (Jaegle et al., 2021b; a) propose an efficient architecture to handle general-purpose inputs and outputs. BEiT-3 (Wang et al., 2022c) performs masked data modeling on images, texts and image-text pairs to pre-train a backbone for various downstream tasks.MetaMorph (Gupta et al., 2022a ) learns a universal controller over a modular robot design space.Foundation Models for Embodied Agents. Embodied agent research (Duan et al., 2022; Batra et al., 2020; Ravichandar et al., 2020; Collins et al., 2021) is adopting the large-scale pre-training paradigm, powered by a collection of learning environments (Abramson et al., 2020; Shridhar et al., 2020; Savva et al., 2019; Puig et al., 2018; Team et al., 2021; Toyama et al., 2021; Shi et al., 

