PROGRAMMATICALLY GROUNDED, COMPOSITIONALLY GENERALIZABLE ROBOTIC MANIPULATION

Abstract

Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose PROGRAMPORT, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.

1. INTRODUCTION

Robotic manipulation models that map directly from raw pixels to actions are capable of learning diverse and complex behaviors through imitation. To enable more abstract goal specification, many such models also take as input natural language instructions. However, this vision-language manipulation setting introduces a new problem: the agent must jointly learn to ground language tokens to its perceptual inputs, and correspond this grounded understanding with the desired actions. Moreover, to fully leverage the flexibility of language, the agent must handle novel vocabulary and compositions not explicitly seen during training, but specified at test time (Fig. 1 ). To these ends, many recent works have relied on large pretrained vision-language (VL) models such as CLIP (Radford et al., 2021) to tackle both grounding and zero-shot generalization. As shown in Fig. 2a , these works generally treat the pretrained VL model as a semantic prior, for example, by partially initializing the weights of image and text encoders with a pretrained VL model (Ahn et al., 2022; Khandelwal et al., 2022; Shridhar et al., 2022a) . These models are then updated via imitating expert demonstrations. However, this training scheme entangles the learning of domainspecific control policies and domain-independent vision-language grounding. Specifically, we find that these VL-enabled agents overfit to their task-specific data, leveraging shortcuts during training to successfully optimize their imitation learning (IL) objective, without learning a generalizable grounding of language to vision. This phenomenon is particularly apparent when the agent is given language goals with unknown concepts and objects, or novel compositions of known concepts and objects. For example, an agent that overfits to packing shapes into a box can fail to generalize to other manipulation behaviors (e.g. pushing) involving the same shapes. In this work, we introduce PROGRAMPORT, a program-based modular approach that enables more faithful vision-language grounding when incorporating pretrained VL models for robotic manipulation. Given the natural language instruction, we first use a Combinatory Categorial Grammar (CCG) (Steedman, 1996) to parse the sentence into a "manipulation program," based on a compact but general domain-specific language (DSL). The program consists of functional modules that are either visual grounding modules (e.g., locate all objects of a given category) or action policies (e.g., produce a control parameter). This enables us to directly leverage a pretrained VL model to ground singular, independent categories or attribute descriptors to their corresponding pixels, and thus disentangles the learning of visual grounding and action policies (Fig. 2b ). Our programmatically structured, modular design enables PROGRAMPORT to successfully learn more performant imitation policies with fewer data across 10 diverse tasks in a tabletop manipulation environment. We show that after training on a subset of objects, visual properties, and actions, our model can zero-shot generalize to completely different subsets and reason over novel compositions of language descriptors, without further finetuning (Fig. 1 ).

2. BACKGROUND

Problem formulation. We take a manipulation primitive-based approach to robot learning. At a high level, we assume our robot is given a set of primitives P (e.g., pick, place, and push). Such primitives are usually defined by interaction modes between the robot and objects, parameterized by continuous parameters, and their composition spans a wide range of tasks. Throughout the paper, we will be using the simple primitive set P = {pick, place} in a tabletop environment as the example, and extend our framework to other primitives such as pushing in the experiment section. Therefore, our goal is to learn a policy π mapping an input observation o t at time t to an action a t . Each a t is a tuple of two control parameters (T pick , T place ), which, in the pick-and-place setting,



Figure 1: Zero-Shot and Compositional Generalization: Our framework, PROGRAMPORT, is capable of generalizing to combinations of unseen objects and manipulation behaviors at test time.

