INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS

Abstract

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in visionbased environments. Our InstructRL method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work. 1

1. INTRODUCTION

Humans are able to understand language and vision to accomplish a wide range of tasks. Many tasks require language understanding and vision perception, from driving to whiteboard discussion and cooking. Humans can also generalize to new tasks by building upon knowledge acquired from previously-seen tasks. Meanwhile, creating generic instruction-following agents that can generalize to multiple tasks and environments is one of the central challenges of reinforcement learning (RL) and robotics. Driven by significant advances in learning generic pre-trained models for language understanding (Devlin et al., 2018; Brown et al., 2020; Chowdhery et al., 2022) , recent work has made great progress towards building instruction-following agents (Lynch & Sermanet, 2020; Mandlekar et al., 2021; Ahn et al., 2022; Jang et al., 2022; Guhur et al., 2022; Shridhar et al., 2022b) . For example, SayCan (Ahn et al., 2022) exploits PaLM models (Chowdhery et al., 2022) to generate language descriptions of step-by-step plans from language instructions, then executes the plans by mapping the steps to predefined macro actions. HiveFormer (Guhur et al., 2022) uses a pre-trained language encoder to generalize to multiple manipulation tasks. However, a remaining challenge is that pure language-only pre-trained models are disconnected from visual representations, making it difficult to differentiate vision-related semantics such as colors. Therefore, visual semantics have to be further learned to connect language instructions and visual inputs. Another category of methods use pre-trained vision-language models, which have shown great success in joint visual and language understanding (Radford et al., 2021) . This has made tremendous progress towards creating a general RL agent (Zeng et al., 2022; Khandelwal et al., 2022; Nair et al., 2022b; Khandelwal et al., 2022; Shridhar et al., 2022a) . For example, CLIPort (Shridhar et al., 2022a) uses CLIP (Radford et al., 2021) vision encoder and language encoder to solve manipulation Figure 1 : Examples of RLBench tasks considered in this work. Left: InstructRL can perform multiple tasks from RLBench given language instructions, by leveraging the representations of a pre-trained vision-language transformer model, and learning a transformer policy. Right: Each task can be composed of multiple variations that share the same skills but differ in objects. For example, in the block stacking task, InstructRL can generalize to varying colors and ordering of the blocks. tasks. However, a drawback is that they come with limited language understanding compared to pure language-only pre-trained models like BERT (Devlin et al., 2018) , lacking the ability to follow long and detailed instructions. In addition, the representations of visual input and textual input are often disjointly learned, so such methods typically require designing specialized network architectures on top of the pre-trained models to fuse them together. To address the above challenges, we introduce InstructRL, a simple yet effective method based on the multimodal transformer (Vaswani et al., 2017; Tsai et al., 2019) . It first encodes fine-grained cross-modal alignment between vision and language using a pre-trained multimodal encoder (Geng et al., 2022) , which is a large transformer (Vaswani et al., 2017; He et al., 2022) jointly trained on image-text (Changpinyo et al., 2021; Thomee et al., 2016 ) and text-only data (Devlin et al., 2018) . The generic representations of each camera and instructions form a sequence, and are concatenated with the embeddings of proprioception data and actions. These tokens are fed into a multimodal policy transformer, which jointly models dependencies between the current and past observations, and cross-modal alignment between instruction and views from multiple cameras. Based on the output representations from our multimodal transformer, we predict 7-DoF actions, i.e., position, rotation, and state of the gripper. We evaluate InstructRL on RLBench (James et al., 2020) , measuring capabilities for single-task learning, multi-task learning, multi-variation generalization, long instructions following, and model scalability. On all 74 tasks which belong to 9 categories (see Figure 1 for example tasks), our InstructRL significantly outperforms state-of-the-art models (Shridhar et al., 2022a; Guhur et al., 2022; Liu et al., 2022) , demonstrating the effectiveness of joint vision-language pre-trained representations. Moreover, InstructRL not only excels in following basic language instructions, but is also able to benefit from human-written long and detailed language instructions. We also demonstrate that InstructRL generalizes to new instructions that represent different variations of the task that are unseen during training, and shows excellent model scalability with performance continuing to increase with larger model size.

2. RELATED WORK

Language-conditioned RL with pre-trained language models. Pre-trained language models have been shown to improve the generalization capabilities of language-conditioned agents to new instructions and to new low-level tasks (Lynch & Sermanet, 2020; Hill et al., 2020; Nair et al., 2022a; Jang et al., 2022; Ahn et al., 2022; Huang et al., 2022) . Some prior work use prompt engineering with large language models (Brown et al., 2020; Chowdhery et al., 2022) to extract temporally extended plans over predefined skills (Huang et al., 2022; Ahn et al., 2022; Jiang et al., 2022) , similar to work that decomposes high-level actions into sub-goals (Team et al., 2021) . These work rely purely on language models to drive agents and require converting observations into language through predefined APIs. Others combine pre-trained language representations with visual inputs (Jang et al., 2022; Lynch & Sermanet, 2020; Team et al., 2021; Khandelwal et al., 2022) using specialized architectures such as UNet (Ronneberger et al., 2015) and FiLM (Perez et al., 2018) . Such approaches have demonstrated success in solving challenging robotic manipulation benchmarks (Guhur et al., 2022; Shridhar et al., 2022b) . In this work, we argue that using jointly pre-trained vision-language representations with RL can achieve superior performance in solving complex language-specified tasks, and show that our proposed approach enjoys better scalability and simpler architecture. Our work is also complementary to prompt-based methods, e.g., SayCan (Ahn et al., 2022) . This work focuses on improving the mapping from language instructions to robot actions, and we expect that combining our approach with prompt-based methods can achieve greater success. Language-conditioned RL with pre-trained vision-language models. There have been strong interests in leveraging pre-trained vision-language models for language-conditioned RL (Shridhar et al., 2022a; Zeng et al., 2022; Khandelwal et al., 2022) , motivated by the effectiveness of visionlanguage models such as CLIP (Radford et al., 2021) . However, these methods use disentangled pipelines for visual and language input, with the language primarily being used to guide perception. Our work uses jointly pre-trained vision-language models that comes with better grounding text-tovisual content (Geng et al., 2022) . The effectiveness of such jointly pre-trained models enables a simple final model, which is a jointly pre-trained vision-language transformer followed by a policy transformer. While using pretrained vision-language models has been explored in grounded navigation (Guhur et al., 2021; Hao et al., 2020; Majumdar et al., 2020; Shah et al., 2022) , our work focuses on manipulation tasks which have combinatorial complexity (composition of objects/actions). Moreover, in contrast to these methods, our method InstructRL is a simple architecture that scales well to large-scale tasks and can directly merge long complex language instructions with visual input. RL with Transformers. Transformers (Vaswani et al., 2017) have led to significant gains in natural language processing (Devlin et al., 2018; Brown et al., 2020) , computer vision (Dosovitskiy et al., 2020; He et al., 2022) and related fields (Lu et al., 2019; Radford et al., 2021; Geng et al., 2022) . They have also been used in the context of supervised reinforcement learning (Chen et al., 2021a; Reed et al., 2022) , vision-language navigation (Chen et al., 2021b; Shah et al., 2022) , robot learning and behavior cloning from noisy demonstrations (Shafiullah et al., 2022; Cui et al., 2022) , and language-conditioned RL (Guhur et al., 2022; Shridhar et al., 2022a) . Inspired by their success, we leverage the transformer architecture to extract pre-trained representations from language and vision and learn a language-conditioned policy.

3. PROBLEM DEFINITION

We consider the problem of robotic manipulation from visual observations and natural language instructions. We assume the agent receives a natural language instruction x := {x 1 , . . . , x n } con-Place 3 of the red cubes on ..

… … …

Place 3 of the red cubes on .. We parameterize the policy π a t | x, {o i } t i=1 , {a i } t-1 i=1 as a transformer model, which is conditioned on the instruction x, observations {o i } t i=1 , and previous actions {a i } t-1 i=1 . For robotic control, we use macro steps (James & Davison, 2022) , which are key turning points in the action trajectory where the gripper changes its state (open/close) or the joint velocities are set to near zero. Following James & Davison (2022) , we employ an inverse-kinematics based controller to find a trajectory between macro-steps. In this way, the sequence length of an episode is significantly reduced from hundreds of small steps to typically less than 10 macro steps. Observation space: Each observation o t consists of images {c k t } K k=1 taken from K different camera viewpoints, as well as proprioception data o P t ∈ R 4 . Each image c k t is an RGB image of size 256 × 256 × 3. We use K = 3 camera viewpoints located on the robot's wrist, left shoulder, and right shoulder. The proprioception data o P t consists of 4 scalar values: gripper open, left finger joint position, right finger joint position, and timestep of the action sequence. Note that we do not use point cloud data in order for our method to be more flexibly applied to other domains. Since RLBench consists of sparse-reward and challenging tasks, using point cloud data can benefit performance (James & Davison, 2022; Guhur et al., 2022) , but we leave this as future work. Action space: Following the standard setup in RLBench (James & Davison, 2022) , each action a t := (p t , q t , g t ) consists of the desired gripper position p t = (x t , y t , z t ) in Cartesian coordinates and quaternion q t = (q 0 t , q 1 t , q 2 t , q 3 t ) relative to the base frame, and the gripper state g t indicating whether the gripper is open or closed. An object is grasped when it is located in between the gripper's two fingers and the gripper is closing its grasp. The execution of an action is achieved by a motion planner in RLBench.

4. INSTRUCTRL

We propose a unified architecture for robotic tasks called InstructRL, which is shown in Figure 3 . It consists of two modules: a pre-trained multimodal masked autoencoder (He et al., 2022; Geng et al., 2022) to encode instructions and visual observations, and a transformer-based (Vaswani et al., 2017) policy that predicts actions. First, the feature encoding module (Sec. 4.1) generates token embeddings for language instructions {x j } n j=1 , observations {o i } t i=1 , and previous actions {a i } t-1 i=1 . Then, given the token embeddings, the multimodal transformer policy (Sec. 4.2) learns relationships between the instruction, image observations, and the history of past observations and actions, in order to predict the next action a t . 

4.1. FEATURE ENCODING WITH PRE-TRAINED TRANSFORMER

We encode the instruction and visual observations using a pre-trained vision-language encoder, as shown in Figure 3 . Specifically, we use a pre-trained multimodal masked autoencoder (M3AE) (Geng et al., 2022) encoder, which is a large transformer-based architecture based on ViT (Dosovitskiy et al., 2020) and BERT (Devlin et al., 2018) . Specifically M3AE (Geng et al., 2022 ) is a transformer-based architecture that learns a unified encoder for both vision and language data via masked token prediction. It is trained on a large-scale image-text dataset(CC12M (Changpinyo et al., 2021) ) and text-only corpus (Devlin et al., 2018) and is able to learn generalizable representations that transfer well to downstream tasks. Encoding Instructions and Observations. Following the practice of M3AE, we first tokenize the language instructions {x j } n j=1 into embedding vectors and then apply 1D positional encodings. We denote the language instructions as E x ∈ R n×d , where n is the length of language tokens and d is the embedding dimension. We divide each image observation in {c k t } K k=1 into image patches, and use a linear projection to convert them to image embeddings that have the same dimension as the language embeddings. Then, we apply 2D positional encodings. Each image is represented as E c ∈ R lc×de where l c is the length of image patches tokens and d e is the embedding dimension. The image embeddings and text embeddings are then concatenated along the sequence dimension: E = concat(E c , E x ) ∈ R (lc+n)×de . The combined language and image embeddings are then processed by a series of transformer blocks to obtain the final representation ôk t ∈ R (lc+n)×de . Following the practice of VIT and M3AE, we also apply average pooling on the sequence length dimension of ôk t to get o k t ∈ R de as the final representation of the k-th camera image c k t and the instruction. We use multi-scale features h k t ∈ R d which are a concatenation of all intermediate layer representations, where the feature dimension d = L * d e equals the number of intermediate layers L times embedding dimension d e . Finally, we can get the representations over all K camera viewpoints h t = {h 1 t , • • • , h K t } ∈ R K×d as the representation of the vision-language input. Encoding Proprioceptions and Actions. The proprioception data o P t ∈ R 4 is encoded with a linear layer to upsample the input dimension to d (i.e., each scalar in o P t is mapped to R d ) to get a representation z t = {z 1 t , • • • , z 4 t } ∈ R 4×d all each state in o P t . Similarly, the action is projected to feature space f t ∈ R d .

4.2. TRANSFORMER POLICY

We consider a context-conditional policy, which takes all encoded instructions, observations and actions as input, i.e., {(h i , z i )} t i=1 and {f i } t-1 i=1 . By default, we use context length 4 throughout the paper (i.e., 4(K + 5) embeddings are processed by the transformer policy). This enables learning relationships among views from multiple cameras, the current observations and instructions, and between the current observations and history for action prediction. The architecture of transformer policy is illustrated in Figure 4 . We pass the output embeddings of the transformer into a feature map to predict the next action a t = [p t ; q t ; g t ]. We use behavioral cloning to train the models. In RLBench, we generate D, a collection of N successful demonstrations for each task. Each demonstration δ ∈ D is composed of a sequence of (maximum) T macro-steps with observations {o i } T i=1 , actions {a * i } T i=1 and instructions {x l } n l=1 . We minimize a loss function L over a batch of demonstrations B = {δ j } |B| j=1 ⊂ D. The loss function is the mean-square error (MSE) on the gripper's action: L = 1 |B| δ∈B   t≤T MSE (a t , a * t )   . (1)

5. EXPERIMENTAL SETUP

To evaluate the effectiveness of our method, we run experiments on RLBench (James et al., 2020) , a benchmark of robotic manipulation task (see Figure 1 ). We use the same setup as in Guhur et al. (2022) , including the same set of 74 tasks with 100 demonstrations per task for training, and the same set of instructions for training and inference, unless stated otherwise. We group the 74 tasks into 9 categories according to their key challenges (see Appendix A.7 for each category's description and list of tasks). For each task, there are a number of possible variations, such as the shapes, colors, and ordering of objects; the initial object positions; and the goal of the task. These variations are randomized at the start of each episode, during both training and evaluation. Based on the task and variation, the environment generates a natural language task instruction using a language template (see Appendix A.3). We compare InstructRL with strong baseline methods from three categories: • RL with pre-trained language model: HiveFormer (Guhur et al., 2022) is a state-of-the-art method for instruction-following agents that uses a multimodal transformer to encode multicamera views, point clouds, proprioception data, and language representations from a pre-trained CLIP language encoder (Radford et al., 2021) . We report the published results of HiveFormer unless otherwise mentioned. • RL with pre-trained vision-language model: CLIP-RL is inspired by CLIPort (Shridhar et al., 2022a) , which demonstrates the effectiveness of CLIP for robotic manipulation. CLIP-RL uses concatenated visual-and language-representations from a pre-trained CLIP model. Similar to InstructRL, CLIP-RL uses multi-scale features by concatenating intermediate layer representations, and is trained using the same hyperparameters. • RL trained from scratch: Auto-λ (Liu et al., 2022 ) is a model trained from scratch that uses the UNet network (Ronneberger et al., 2015) and applies late fusion to predictions from multiple views. We report the published results of Auto-λ unless mentioned otherwise. InstructRL uses the official pre-trained multimodal masked autoencoder (M3AE) model (Geng et al., 2022) , which was jointly pre-trained on a combination of image-text datasets (Changpinyo et al., 2021; Thomee et al., 2016 ) and text-only corpus (Devlin et al., 2018) . CLIP-RL uses the official pre-trained CLIP models. See Appendix A.2 for more details about the pre-training datasets. All models are trained for 100K iterations. For evaluation, we measure the per-task success rate for 500 episodes per task. Further implementation and training details can be found in Appendix A.1.

6. EXPERIMENTAL RESULTS

We evaluate the methods on single-task performance (Figure 5 ), multi-task performance (Figures 6, 7), generalization to unseen instructions and variations (Figure 8a ), and scalability to larger model size (Figure 8b ). In all metrics, we find that InstructRL outperforms state-of-the-art baselines despite being a simpler method. Single-task performance. Figure 5 shows that, across all 9 categories of tasks, InstructRL performs on par or better than all state-of-the-art baselines. On average, InstructRL significantly outperforms prior work despite being much simpler. Multi-task performance. In the multi-task setting, each model is trained on a set of tasks and then evaluated on each of the training tasks. In Figure 6 , we compare multi-task performance on 10 RLBench tasks considered in HiveFormer (Guhur et al., 2022) and Auto-λ (Liu et al., 2022) , with 100 training demonstrations per task. InstructRL exhibits strong multi-task performance compared to other methods. In particular, even though both InstructRL and HiveFormer use a transformer policy with language instructions, InstructRL outperforms HiveFormer by a large margin. In Figure 7 , we further evaluate multi-task performance on all 74 RLBench tasks, and see that InstructRL outperforms CLIP-RL in most categories. These results demonstrate the transferability of jointly pre-trained vision-language models to diverse multi-task settings. Generalization to unseen instructions and variations. In Figure 8a , we report the performance on the Push Buttons task, which requires the agent to push colored buttons in a specified sequential order given in the instruction. In this task, instructions are necessary to solve the unseen task variations correctly (e.g., pushing the buttons in a blue-red-green vs. red-green-blue order cannot be inferred from only looking at the scene). We evaluate the models on instructions that are both seen and unseen during training; unseen instructions can contain new colors, or an unseen ordering of colors. We see that InstructRL achieves higher performance on both seen and unseen instructions, even in the most challenging case where only 10 demonstrations are available per variation. Scalability to larger model size. One of the key benefits of pre-trained models is that we can use a huge amount of pre-training data that is typically not affordable in RL and robot learning scenarios. In Figure 8b , we evaluate different model sizes on multi-task performance across 14 selected tasks (listed in Appendix A.5). For fair comparison with training-from-scratch, we fix the size of the policy transformer, and only vary the size of the transformer encoder. We compare four model sizes: B/32, B/16, L/16, and H/16, where "B" denotes ViT-base; "L" denotes ViT-large; "H" denotes ViT-huge; and "16" and "32" denote patch sizes 16 and 32, respectively. Both CLIP-RL and InstructRL improve with larger model size, but InstructRL shows better scalability. 2 On the other hand, learning-from-scratch is unable to benefit from larger model capacity; in fact, a strong weight decay was needed to prevent this baseline from overfitting to the limited data. We compare four different model sizes of the transformer encoder. All methods are evaluated on multi-task performance across 14 selected tasks (listed in Appendix A.5). InstructRL scales well with larger model size.

7. ABLATION STUDY

Instructions are more effective with joint vision-language model. Table 1 shows models trained with and without instructions, on 10 RLBench tasks from Guhur et al. (2022) ; Liu et al. (2022) . While using instructions improves the performance of all methods, they are most effective with InstructRL. This is not surprising, as the CLIP language encoder in HiveFormer was not pre-trained on large text corpora. One can pair CLIP with a large language model such as BERT, but doing so can often hurt performance (Guhur et al., 2022) due to the language representations not being aligned with the visual representations. On the other hand, InstructRL shows the strongest performance due to using the M3AE encoder, which was jointly pre-trained on both image-text and text-only data. Detailed instructions require language understanding. In Table 2 , we also evaluate the methods on longer and more detailed instructions, which include step-by-step instructions specific to the task (see Table 3 for examples). The detailed instructions are automatically generated using a template that contains more details than the default instruction template from RLBench ("Default"). Each "Tuned Short Instruction" has maximum token length 77 (to be compatible with CLIP's language encoder), while each "Tuned Long Instruction" can have up to 256 tokens (to be compatible with InstructRL). We compare InstructRL with CLIP-RL, which uses concatenated visual-and languagerepresentations from CLIP, as well as a variant of CLIP-RL called BERT-RL, which uses a BERT language encoder and a trained-from-scratch vision encoder. Generally across all methods, we can see that performance increases with longer and more detailed instructions, which implies that the pre-trained language encoders can extract task-relevant information from language instructions. InstructRL achieves the best performance, especially with longer instructions (e.g., InstructRL achieves 63.7 vs. 46.1 from BERT-RL, on the Push Buttons task with Tuned Long Instructions). Fusion strategy. How to fuse representations from multimodal data is one of the key design choices in InstructRL. In Figure 9a , we compare variants of InstructRL with other fusion strategies: Concat 

Push Buttons

Default Push the red button, then push the green button, then push the yellow button

Short Tuned Inst

Move the gripper close to red button, then push the red button, after that, move the gripper close to the green button to push the green button, finally, move the gripper close to the yellow button to push the yellow button.

Long Tuned Inst

Move the white gripper closer to red button, then push red button down, after pushing red button, pull the white gripper up and move the gripper closer to green button, then push green button down, after pushing green button, pull the white gripper up and move the gripper closer to the yellow button, then push yellow button down .

Stack Blocks

Default Place 3 of the red cubes on top of each other Short Tuned Inst Choose a red cube as the stack base, then pick another red cube and place it onto the red cube stack, repeat until the stack has 3 red cubes.

Long Tuned Inst

Choose a red cube as the stack base, then pick another red cube and place it onto the red cube stack, then move the white gripper to pick another red cube and place it onto the red cube stack, repeat until the stack has 3 red cubes. Representation Type Success Rate (%) runs the encoder twice to obtain vision and language representation separately, then concatenates them together. FiLM (Perez et al., 2018) fuses vision and language features layer-wisely before concatenating them together. Default refers to InstructRL where language and vision are encoded once to obtain joint representations. We see that using a joint representation of language and vision inputs is critical for performance. History context encoding. The flexibility of the transformer architecture allows us to encode multiple past steps. To investigate the benefit of utilizing historical information, we ablate the history context length. Figure 9b shows that there is a steady increase in performance when we increase the context length. However, using a longer context requires more compute and memory (e.g., increasing the context length from 4 to 8 adds about 20% longer training time on one task). Thus, we choose a context length of 4 for computational efficiency. Multi-scale features. In Figure 9c , we compare the performance of InstructRL with features selected from different layers. The results show that using a combination of intermediate representations is most effective.

8. CONCLUSION

In this paper, we propose InstructRL, a novel transformer-based instruction following method. In-structRL has a simple and scalable architecture which consists of a pre-trained multimodal transformer and a policy transformer. Experimental results show that InstructRL significantly outperforms the state-of-the-art pre-trained and train-from-scratch instruction following methods. As In-structRL achieves state-of-the-art results and scales well with model capacity, applying our approach to a larger scale of problems would be an interesting future work.

A APPENDIX A.1 IMPLEMENTATION AND COMPUTE RESOURCES

We use the AdamW (Kingma & Ba, 2015; Loshchilov & Hutter, 2017) optimizer with a learning rate of 5×10 -4 and weight decay 5e-5. All models are trained on TPU v3-128 using cloud TPU. Since TPU does not support headless rendering with PyRepfoot_3 simulators that RLBench is built upon, we evaluate the models on NVIDIA Tesla V100 SXM2 GPU using headless rendering. Each training batch per device consists of 32 demonstrations sequence with length 4, in total the batch size is 512. All models are trained for 100K iterations. We apply data augmentation in training, including jitter over the RGB images c k t . Our implementation of InstructRL is built upon the official JAX implementation of multimodal masked autoencoder (M3AE) (Geng et al., 2022) foot_4 , and we use the official pre-trained M3AE models that were pre-trained on CC12M (Changpinyo et al., 2021 ) and text corpus (Devlin et al., 2018) . Our implementation of CLIP-RL is built upon a JAX CLIP implementationfoot_5 , and we use the official pre-trained CLIP modelsfoot_6 for the visual-and language-representations. Both CLIP-RL and InstructRL use ViT-B/16 in all experiments unless otherwise specified. As we demonstrated in experiment section, while the larger models ViT-L/16 and ViT-H/16 can further boost InstructRL's results, we use ViT-B/16 to reduce computation cost and to have apple to apple comparison with baselines. Our code is available at at https://sites.google.com/view/instructrl/.

A.2.1 M3AE

M3AE is trained on a combination of image-text datasets and text-only datasets. The image-text data includes the publicly available Conceptual Caption 12M (CC12M) (Changpinyo et al., 2021) , Redcaps (Desai et al., 2021) , and a 15M subset of the YFCC100M (Thomee et al., 2016) selected according to (Radford et al., 2021) . The text-only data includes the publicly available Wikipedia and the Toronto BookCorpus (Zhu et al., 2015) . For language data, we use the BERT tokenizer from Huggingfacefoot_7 to tokenize the text. Following Geng et al. (2022) , we use 0.5 for text loss weight and 0.75 for text mask ratio. We use 0.1 for text-only loss.

A.2.2 CLIP

CLIP is trained on the publicly available YFCC100M (Thomee et al., 2016) which consists of 100M image-text pairs.

A.3 RLBENCH INSTRUCTIONS

We use the task instructions provided in RLBench. For each task, there are a number of possible variations, such as the shapes and colors of objects, the initial object positions, and the goal of the task. Based on the task and variation, the environment generates a natural language task instruction. For example, in the task "Put groceries in cupboard", the generated instruction is of the form "pick up the OBJECT and place it in the cupboard" where OBJECT is one of 9 possible grocery items. A full list of RLBench task instructions can be found at https://github.com/stepjam/ RLBench/tree/master/rlbench/tasks. 

A.4 EVALUATION DETAILS

In multi-task setting, models are trained on multiple tasks simultaneously and subsequently evaluated on each task independently. In the generalization experiment setting, models are trained on a subset of all variations, then evaluated on unseen task variations.

A.5 MULTI-TASK MULTI-VARIATION TASKS DETAILS

For experiments including model scaling (Figure 8b ) and ablation studies (Figure 9 ), we selected 14 difficult tasks that have multiple variations: Stack wine, Open drawer, Meat off grill, Put item in drawer, Turn tap, Put groceries in cupboard, Sort shape, Screw bulb in, Close jar, Stack blocks, Put item in safe, Insert peg, Stack cups, and Place cups. Each task comes with multiple variations that include objects colors, objects shapes, and the ordering of objects to be interacted with. To make a fair comparison with baselines, we use the slightly modified version of RLBench from HiveFormer (Guhur et al., 2022) . The changes include adding new tasks and improving the motion planner. The full details of each task can be found in RLBench Github repofoot_8 and the HiveFormer Github repofoot_9 .

A.6 SAMPLE EFFICIENCY ABLATION

In the experiments, we used 100 demonstrations following the same setup as in prior work, and InstructRL outperforms prior state-of-the-arts significantly. However, getting 100 demonstrations can be expensive and difficult in many real world tasks. We hypothesize that InstructRL can achieve better sample-efficient learning thanks to the joint language-vision encoder pretrained on large-scale passive datasets. To study this, we randomly choose a subset of 8 tasks and evaluate the performance using only 10 demonstrations. The results from Table 4 show that InstructRL outperforms baselines when using 100 demonstrations or 10 demonstrations. Moreover, using only 10 demonstrations, InstructRL achieves competitive or higher success rates than baselines that use 100 demonstrations. For example, on the stack blocks task, InstructRL using 10 demos achieves 32% success rate while HiveFormer gets 31% using 100 demos. Similarly, on the open drawer task, InstructRL using 10 demos gets 81% while HiveFormer gets 86% using 100 demos. The results show that InstructRL is more sample efficient than prior state-of-the-arts.

A.7 TASK CATEGORIES DETAILS

To compare with baselines including Guhur et al. (2022) , our experiments are conducted on the same 74 out of 106 tasks from RLBench 8 . The 74 tasks can be categorized into 9 task groups according to their key challenges as shown in Table 5 . Task diversity. The 74 tasks RLBench cover a wide range of challenges that are essential for robot learning, including planning over multiple sub goals (e.g., picking a basket ball and then throwing the ball) and domains where there are multiple possible trajectories to solve a task due to a large affordance area of the target object (e.g., the edge of a cup). RLBench employs 3 keys terms: Task, Variation, and Episode. For each task there are one or more variations, and from each variation, an infinite number of episodes can be drawn. Each variation of a task comes with a list of textual descriptions that verbally summarise this variation of the task. An example showing the distinction between task and variation is Figure 1 and Figure 4 of James et al. (2020) . Task difficulty. RLBench is a manipulation benchmark that is significantly more difficult than locomotion benchmarks. Some tasks involve precise object manipulation such as 'put knife on chopping board' and 'take usb out of computer', some tasks require preceise grasping such as 'close laptop lid' and 'open and close drawer', some tasks require solving long horizon tasks that involve many composed sets of actions, for example, the 'empty dishwasher' task involves opening the washer door, sliding out the tray, grasping a plate, and then lifting the plate out of the tray. In this work we consider tasks that can be grouped into 9 categories, as shown in Table 5 , to comprehensively evaluate the effectiveness of InstructRL. In addition to manipulation and perception challenges, RLBench comes with a suite of diverse task instructions that require natural language understanding. RLBench has a large and diverse vocabulary, it has over 100 content words vocabulary size (e.g., table, cup, open, grasp, box, etc) with function words removed, combing with the diverse visual input and complex interactions leads to a suite of challenging instruction following manipulation tasks.

B INSTRUCTION TEMPLATES AND GENERATIONS

Default instructions generation. The default instructions templates are shown in Table 6 . Long instructions generation. The tuned instructions that contain step-by-step and detailed descriptions of objects are shown in Table 7 . Move the gripper close to {1st color name} button, then push the {2nd color name} button, after that, move the grip-per close to the {3rd color name} button to push the {4th color name} button, finally, . . . , move the gripper close to the {n-th color name} button to push the {n-th color name} button.

Group

Move the white gripper closer to {1st color name} but-ton, then push {2nd color name} button down, after push-ing {3rd color name} button, pull the white gripper up and move the gripper closer to {4th color name} button, . . . , pull the white gripper up and move the gripper closer to the {nth color name} button, then push {nth color name} but-ton down push but-tons Ordering of but-tons, colors 200 push the {color name} button, then push the {color name} button, then ... Choose a {color name} cube as stack base, then pick another {color name} cube and place it onto the {color name} stack, repeat until the stack has 3 {color name} cubes. Choose a {color name} cube as the stack base, then pick an-other {color name} cube and place it onto the {color name} cube stack, then move the white gripper to pick another {color name} cube and place it onto the {color name} cube stack, repeat until the stack has 3 {color name} cubes. Table 7 : Language instructions templates for tuned instructions used in Table 2 and Table 3. 



The code of InstructRL is available at https://sites.google.com/view/instructrl/ We only report the performances of CLIP-RL with B/32 and B/16, since the larger models (L/16 and H/16) are not provided in the open-source released models. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015. https://github.com/stepjam/PyRep https://github.com/young-geng/m3ae_public https://github.com/google-research/scenic/tree/main/scenic/projects/ baselines/clip https://github.com/openai/CLIP https://huggingface.co/docs/transformers/main_classes/tokenizer https://github.com/stepjam/RLBench/tree/master/rlbench/tasks https://github.com/guhur/RLBench/tree/74427e188cf4984fe63a9c0650747a7f07434337



Figure 2: Different frameworks of leveraging pre-trained representations for instruction-following agents. In prior work, additional training-from-scratch is needed to combine the representations of text and image from (I) a pre-trained vision model, (II) a pre-trained language model, or (III) disjointly pre-trained language and vision models. In contrast, InstructRL extracts generic representations from (IV) a jointly pre-trained vision-language model.

Figure 3: InstructRL is composed of a vision-language transformer and a policy transformer. First, the instruction (text) and multi-view image observations are jointly encoded using the pre-trained vision-language transformer. Next, the sequence of representations and a history of actions are encoded by the policy transformer to predict the next action.

Figure 4: The architecture of the transformer policy. The model is conditioned on a history of language-vision representations and actions to predict the next action.

Figure 5: Single-task performance on 74 RLBench tasks from James et al. (2020); Guhur et al. (2022) grouped into 9 categories.

Figure 8: (a): Performance on the Push Buttons task, which has many variations on the ordering of colored buttons. InstructRL achieves higher performance on both seen and unseen instructions, even with only 10 training demonstrations per variation. (b): We compare four different model sizes of the transformer encoder. All methods are evaluated on multi-task performance across 14 selected tasks (listed in Appendix A.5). InstructRL scales well with larger model size.

Figure 9: Ablations of InstructRL evaluated on the 14 selected tasks listed in Appendix A.5. We report the average success % over all tasks. (a): Performance of InstructRL with different strategies for fusing the language and vision features. (b): Effect of history context length on the performance of InstructRL. (c): InstructRL with features selected from different layers.

Success rates (%) of all methods with and without instructions on 10 RLBench tasks.

Comparison between using default instructions vs. longer and more detailed instructions.

Longer instructions for Push Buttons and Stack Blocks.

Taskstack blocks slide block open drawer push buttons meat off grill put money in safe stack wine turn tap

Performance (success rate %) of HiveFormer and InstructRL while varying the number of demonstrations. InstructRL can use fewer demonstrations to achieve high success rate, while HiveFormer requires more demonstrations.

RLBench tasks used in our experiments grouped into 9 categories.

Language instructions templates in RLBench(James et al., 2020).

