A SYSTEM FOR MORPHOLOGY-TASK GENERALIZATION VIA UNIFIED REPRESENTATION AND BEHAVIOR DIS-TILLATION

Abstract

The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization 1 .

1. INTRODUCTION

The impressive success of large language models (Devlin et al., 2019; Radford et al., 2019; Bommasani et al., 2021; Brown et al., 2020; Chowdhery et al., 2022) has encouraged the other domains, such as computer vision (Radford et al., 2021; Gu et al., 2021b; Alayrac et al., 2022; Jaegle et al., 2021) or robotics (Ahn et al., 2022; Huang et al., 2022b) , to leverage the large-scale pre-trained model trained with massive data with unified input-output interface. These large-scale pre-trained models are innately multi-task learners: they surprisingly work well not only in the fine-tuning or few-shot transfer but also in the zero-shot transfer settings (Raffel et al., 2020; Chen et al., 2022a) . Learning a "generalist" model seems to be an essential goal in the recent machine learning paradigm with the same key ingredients: curate massive diverse dataset, define unified IO representation, and perform efficient representation and architecture selection, altogether for best generalization. In reinforcement learning (RL) for continuous control, various aspects are important for generalization. First, we care about "task" generalization. For instance, in robotic manipulation, we care the policy to generalize for different objects and target goal positions (Kalashnikov et al., 2018; Andrychowicz et al., 2017; Yu et al., 2019; Lynch et al., 2019) . Recent advances in vision and language models also enable task generalization through compositional natural language instructions (Jiang et al., 2019; Shridhar et al., 2022a; Ahn et al., 2022; Cui et al., 2022) . However, to scale the data, equally important is "morphology" generalization, where a single policy can control agents of different embodiment (Wang et al., 2018; Noguchi et al., 2021) and can thereby ingest experiences from as We first train a single-task policy for each environment on MxT-Bench, and then collect proficient morphology-task behavior dataset (Section 4.1). To enable a single policy to learn multiple tasks and morphologies simultaneously, we convert stored transitions to the morphology-task graphrepresentation to align with unified IO interface (Section 4.3) for multi-task distillation (Section 4.2). After behavior distillation, the learned policy can be utilized for in-distribution or zero-shot generalization (Section 5.1), downstream fine-tuning (Section 5.2), and representation and architecture selection (Section 5.3). many robots in different simulators (Freeman et al., 2021; Todorov et al., 2012; Coumans & Bai, 2016) as possible. Most prior works (Mendonca et al., 2021; Gupta et al., 2022) only address either the task or morphology axis separately, and achieving broad generalization over task and morphology jointly remains a long-standing problemfoot_0 . This paper first proposes MxT-Benchfoot_1 , the first multi-morphology and multi-task benchmarking environments, as a step toward building the massive diverse dataset for continuous control. MxT-Bench provides various combinations of different morphologies (ant, centipede, claw, worm, and unimal (Gupta et al., 2022) ) and different tasks (reach, touch, and twisters). MxT-Bench is easily scalable to additional morphologies and tasks, and is built on top of Brax (Freeman et al., 2021) for fast behavior generation. Next, we define unified IO representation for an architecture to ingest all the multi-morphology multi-task data. Inspired by scene graph (Johnson et al., 2015) in computer vision that represents the 3D relational information of a scene, and by morphology graph (Wang et al., 2018; Chen et al., 2018; Huang et al., 2020; Gupta et al., 2022) that expresses an agent's geometry and actions, we introduce the notion of morphology-task graph (MTG) as a unified interface to encode observations, actions, and goals (i.e. tasks) as nodes in the shared graph representation. Goals are represented as sub-nodes, and different tasks correspond to different choices: touching is controlling a torso node, while reaching is controlling an end-effector node (Figure 3 ). In contrast to discretizing and tokenizing every dimension as proposed in recent work (Janner et al., 2021; Reed et al., 2022) , this unified IO limits data representation it can ingest, but strongly preserves 3D geometric relationships that are crucial for any physics control problem (Wang et al., 2018; Ghasemipour et al., 2022) , and we empirically show it outperforms naive tokenization in our control-focused dataset. Lastly, while conventional multi-task or meta RL studies generalization through on-policy joint training (Yu et al., 2019; Cobbe et al., 2020) , we perform efficient representation and architecture selection, over 11 combinations of unified IO representation and network architectures, and 8 local node observations, for optimal generalization through behavior distillation (Figure 1 ), where RL is essentially treated as a (single-task, low-dimensional) behavior generator (Gu et al., 2021a) and multi-task supervised learning (or offline RL (Fujimoto et al., 2019) ) is used for imitating all the behaviors (Singh et al., 2021; Chen et al., 2021b; Reed et al., 2022) . Through offline distillation, we controllably and tractably evaluate two variants of MTG representation, along with multiple network architectures (MLP, GNN (Kipf & Welling, 2017 ), Transformers (Vaswani et al., 2017) ), and show that MTGv2 variant with Transformer improves the multi-task goal-reaching performances compared to other possible choices by 23% and provides better prior knowledge for zero-shot generalization (by 14∼18%) and fine-tuning for downstream multi-task imitation learning (by 50 ∼ 55 %). As the fields of vision and language move toward broad generalization (Chollet, 2019; Bommasani et al., 2021) , we hope our work could encourage RL and continuous control communities to continue



Here, "task" means what each agent should solve. See Section for the detailed definition. When considering the models, as slightly overloaded, it may imply morphological diversity as well.3 Pronounced as "mixed"-bench. It stands for "Morphology × Task".



Figure1: Behavior distillation pipeline. We first train a single-task policy for each environment on MxT-Bench, and then collect proficient morphology-task behavior dataset (Section 4.1). To enable a single policy to learn multiple tasks and morphologies simultaneously, we convert stored transitions to the morphology-task graphrepresentation to align with unified IO interface (Section 4.3) for multi-task distillation (Section 4.2). After behavior distillation, the learned policy can be utilized for in-distribution or zero-shot generalization (Section 5.1), downstream fine-tuning (Section 5.2), and representation and architecture selection (Section 5.3).

