A SYSTEM FOR MORPHOLOGY-TASK GENERALIZATION VIA UNIFIED REPRESENTATION AND BEHAVIOR DIS-TILLATION

Abstract

The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization 1 .

1. INTRODUCTION

The impressive success of large language models (Devlin et al., 2019; Radford et al., 2019; Bommasani et al., 2021; Brown et al., 2020; Chowdhery et al., 2022) has encouraged the other domains, such as computer vision (Radford et al., 2021; Gu et al., 2021b; Alayrac et al., 2022; Jaegle et al., 2021) or robotics (Ahn et al., 2022; Huang et al., 2022b) , to leverage the large-scale pre-trained model trained with massive data with unified input-output interface. These large-scale pre-trained models are innately multi-task learners: they surprisingly work well not only in the fine-tuning or few-shot transfer but also in the zero-shot transfer settings (Raffel et al., 2020; Chen et al., 2022a) . Learning a "generalist" model seems to be an essential goal in the recent machine learning paradigm with the same key ingredients: curate massive diverse dataset, define unified IO representation, and perform efficient representation and architecture selection, altogether for best generalization. In reinforcement learning (RL) for continuous control, various aspects are important for generalization. First, we care about "task" generalization. For instance, in robotic manipulation, we care the policy to generalize for different objects and target goal positions (Kalashnikov et al., 2018; Andrychowicz et al., 2017; Yu et al., 2019; Lynch et al., 2019) . Recent advances in vision and language models also enable task generalization through compositional natural language instructions (Jiang et al., 2019; Shridhar et al., 2022a; Ahn et al., 2022; Cui et al., 2022) . However, to scale the data, equally important is "morphology" generalization, where a single policy can control agents of different embodiment (Wang et al., 2018; Noguchi et al., 2021) and can thereby ingest experiences from as

