A SYSTEM FOR MORPHOLOGY-TASK GENERALIZATION VIA UNIFIED REPRESENTATION AND BEHAVIOR DIS-TILLATION

Abstract

The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization 1 .

1. INTRODUCTION

The impressive success of large language models (Devlin et al., 2019; Radford et al., 2019; Bommasani et al., 2021; Brown et al., 2020; Chowdhery et al., 2022) has encouraged the other domains, such as computer vision (Radford et al., 2021; Gu et al., 2021b; Alayrac et al., 2022; Jaegle et al., 2021) or robotics (Ahn et al., 2022; Huang et al., 2022b) , to leverage the large-scale pre-trained model trained with massive data with unified input-output interface. These large-scale pre-trained models are innately multi-task learners: they surprisingly work well not only in the fine-tuning or few-shot transfer but also in the zero-shot transfer settings (Raffel et al., 2020; Chen et al., 2022a) . Learning a "generalist" model seems to be an essential goal in the recent machine learning paradigm with the same key ingredients: curate massive diverse dataset, define unified IO representation, and perform efficient representation and architecture selection, altogether for best generalization. In reinforcement learning (RL) for continuous control, various aspects are important for generalization. First, we care about "task" generalization. For instance, in robotic manipulation, we care the policy to generalize for different objects and target goal positions (Kalashnikov et al., 2018; Andrychowicz et al., 2017; Yu et al., 2019; Lynch et al., 2019) . Recent advances in vision and language models also enable task generalization through compositional natural language instructions (Jiang et al., 2019; Shridhar et al., 2022a; Ahn et al., 2022; Cui et al., 2022) . However, to scale the data, equally important is "morphology" generalization, where a single policy can control agents of different embodiment (Wang et al., 2018; Noguchi et al., 2021) and can thereby ingest experiences from as We first train a single-task policy for each environment on MxT-Bench, and then collect proficient morphology-task behavior dataset (Section 4.1). To enable a single policy to learn multiple tasks and morphologies simultaneously, we convert stored transitions to the morphology-task graphrepresentation to align with unified IO interface (Section 4.3) for multi-task distillation (Section 4.2). After behavior distillation, the learned policy can be utilized for in-distribution or zero-shot generalization (Section 5.1), downstream fine-tuning (Section 5.2), and representation and architecture selection (Section 5.3). many robots in different simulators (Freeman et al., 2021; Todorov et al., 2012; Coumans & Bai, 2016) as possible. Most prior works (Mendonca et al., 2021; Gupta et al., 2022) only address either the task or morphology axis separately, and achieving broad generalization over task and morphology jointly remains a long-standing problemfoot_0 . This paper first proposes MxT-Benchfoot_1 , the first multi-morphology and multi-task benchmarking environments, as a step toward building the massive diverse dataset for continuous control. MxT-Bench provides various combinations of different morphologies (ant, centipede, claw, worm, and unimal (Gupta et al., 2022) ) and different tasks (reach, touch, and twisters). MxT-Bench is easily scalable to additional morphologies and tasks, and is built on top of Brax (Freeman et al., 2021) for fast behavior generation. Next, we define unified IO representation for an architecture to ingest all the multi-morphology multi-task data. Inspired by scene graph (Johnson et al., 2015) in computer vision that represents the 3D relational information of a scene, and by morphology graph (Wang et al., 2018; Chen et al., 2018; Huang et al., 2020; Gupta et al., 2022) that expresses an agent's geometry and actions, we introduce the notion of morphology-task graph (MTG) as a unified interface to encode observations, actions, and goals (i.e. tasks) as nodes in the shared graph representation. Goals are represented as sub-nodes, and different tasks correspond to different choices: touching is controlling a torso node, while reaching is controlling an end-effector node (Figure 3 ). In contrast to discretizing and tokenizing every dimension as proposed in recent work (Janner et al., 2021; Reed et al., 2022) , this unified IO limits data representation it can ingest, but strongly preserves 3D geometric relationships that are crucial for any physics control problem (Wang et al., 2018; Ghasemipour et al., 2022) , and we empirically show it outperforms naive tokenization in our control-focused dataset. Lastly, while conventional multi-task or meta RL studies generalization through on-policy joint training (Yu et al., 2019; Cobbe et al., 2020) , we perform efficient representation and architecture selection, over 11 combinations of unified IO representation and network architectures, and 8 local node observations, for optimal generalization through behavior distillation (Figure 1 ), where RL is essentially treated as a (single-task, low-dimensional) behavior generator (Gu et al., 2021a) and multi-task supervised learning (or offline RL (Fujimoto et al., 2019) ) is used for imitating all the behaviors (Singh et al., 2021; Chen et al., 2021b; Reed et al., 2022) . Through offline distillation, we controllably and tractably evaluate two variants of MTG representation, along with multiple network architectures (MLP, GNN (Kipf & Welling, 2017) , Transformers (Vaswani et al., 2017) ), and show that MTGv2 variant with Transformer improves the multi-task goal-reaching performances compared to other possible choices by 23% and provides better prior knowledge for zero-shot generalization (by 14∼18%) and fine-tuning for downstream multi-task imitation learning (by 50 ∼ 55 %). As the fields of vision and language move toward broad generalization (Chollet, 2019; Bommasani et al., 2021) , we hope our work could encourage RL and continuous control communities to continue growing diverse behavior datasets, designing different IO representations, and iterating more representation and architecture selection, and eventually optimize a single policy that can be deployed on any morphology for any task. In summary, our key contributions are: • We develop MxT-Benchfoot_2 as a test bed for morphology-task generalization with fast expert behavior generator. MxT-Bench supports the scalable procedural generation of both agents and tasks with minimal blueprints. • We introduce morphology-task graph, a universal IO for control which treats the agent's observations, actions and goals/tasks in a unified graph representation, while preserving the task structure. • We study generalization through offline supervised behavior distillation, where we can efficiently try out various design choices; over 11 combinations of unified IO representation and network architectures, and 8 local node observations. As a result, we find that Transformer with MTGv2 achieves the best multi-task performances among other possible designs (MLP, GNN and Transformer with MTGv1, Tokenized-MTGv2, etc.) in both in-distribution and downstream tasks, such as zero-shot transfer and fine-tuning for multi-task imitation learning. Morphology Generalization While, in RL for continuous control, the policy typically learns to control only a single morphology (Tassa et al., 2018; Todorov et al., 2012) , several works succeed in generalizing the control problem for morphologically different agents to solve a locomotion task by using morphology-aware Graph Neural Network (GNN) policies (Wang et al., 2018; Huang et al., 2020; Blake et al., 2021) . In addition, several work (Kurin et al., 2021; Gupta et al., 2022; Hong et al., 2022; Trabucco et al., 2022) have investigated the use of Transformer (Vaswani et al., 2017) . Other work jointly optimize the morphologyagnostic policy and morphology itself (Pathak et al., 2019; Gupta et al., 2021; Yuan et al., 2022; Hejna et al., 2021) , or transfer a controller over different morphologies (Devin et al., 2017; Chen et al., 2018; Hejna et al., 2020; Liu et al., 2022) .

2. RELATED WORK

While substantial efforts have been investigated to realize morphology generalization, those works mainly focus on only a single task (e.g. running), and less attention is paid to multi-task settings, where the agents attempt to control different states of their bodies to desired goals. We believe that goal-directed control is a key problem for an embodied single controller. Concurrently, Feng et al. (2022) propose an RL-based single controller that is applied to different quadruped robots and target poses in the sim-to-real setting. In contrast, our work introduces the notion of morphology-task graph as a unified IO that represents observations, actions, and goals in a shared graph, and can handle more diverse morphologies to solve multiple tasks. Task Generalization In the previous research, task generalization has been explored in multi-task or meta RL literature (Wang et al., 2016; Duan et al., 2017; Cabi et al., 2017; Teh et al., 2017; Colas et al., 2019; Li et al., 2020; Yang et al., 2020; Kurin et al., 2022) . Each task might be defined by the difference in goals, reward functions, and dynamics (Ghasemipour et al., 2019; Kalashnikov et al., 2021; Eysenbach et al., 2020) , under shared state and action spaces. Some works leverage graph representation to embed the compositionality of manipulation tasks (Li et al., 2019; Zhou et al., 2022; Li et al., 2021; Kumar et al., 2022; Ghasemipour et al., 2022) , while others use natural language representation to specify diverse tasks (Jiang et al., 2019; Shridhar et al., 2022a; Ahn et al., 2022; Huang et al., 2022a; Cui et al., 2022) . Despite the notable success of acquiring various task generalization, multi-task RL often deals with only a single morphology. We aim to extend the general behavior policy into the "cartesian product" of tasks and morphologies (as shown in Figure 2 ) to realize a more scalable and capable controller. Transformer for RL Recently, Chen et al. (2021a) and Janner et al. (2021) consider offline RL as supervised sequential modeling problem and following works achieve impressive success (Reed et al., 2022; Lee et al., 2022; Furuta et al., 2022; Xu et al., 2022; Shafiullah et al., 2022; Zheng et al., 2022; Paster et al., 2022) . In contrast, our work leverages Transformer to handle topological and geometric information of the scene, rather than a sequential nature of the agent trajectory. Behavior Distillation Due to the massive try-and-error and large variance, training a policy from scratch in online RL is an inefficient process, especially in multi-task setting. It is more efficient to use RL for generating single-task behaviors (often from low dimensions) (Gu et al., 2021a) and then use supervised learning to imitate all behaviors with a large single policy (Levine et al., 2016; Rusu et al., 2016; Parisotto et al., 2016; Singh et al., 2021; Ajay et al., 2021; Chen et al., 2021b) . Several works tackle the large-scale behavior distillation with Transformer (Reed et al., 2022; Lee et al., 2022) , or with representation that treats observations and actions in the same vision-language space (Zeng et al., 2020; Shridhar et al., 2022b) . Our work utilizes similar pipeline, but focuses on finding the good representation and architecture to generalize across morphology and tasks simultaneously with proposed morphology-task graph. See Appendix M for the connection to policy distillation.

3. PRELIMINARIES

In RL, consider a Markov Decision Process with following tuple (S, A, p, p 1 , r, γ), which consists of state space S, action space A, state transition probability function p : S × A × S → [0, ∞), initial state distribution p 1 : S → [0, ∞), reward function r : S × A → R, and discount factor γ ∈ [0, 1). The agent follows a Markovian policy π : S × A → [0, ∞), which is often parameterized in Deep RL, and seeks optimal policy π * that maximizes the discounted cumulative rewards: π * = arg max π 1 1-γ E s∼ρ π (s),a∼π(•|s) [r(s, a)] , where p π t (s t ) = s0:t,a0:t-1 t p(s t |s t-1 , a t-1 )π(a t |s t ) and ρ π (s) = (1 -γ) t γ t p π t (s t = s) represent time-aligned and time-aggregated state marginal distributions following policy π. Graph Representation for Morphology-Agnostic Control Following prior continuous control literature (Todorov et al., 2012) , we assume the agents have bodies modeled as simplified skeletons of animals. An agent's morphology is characterized by the parameter for rigid body module (torso, limbs), such as radius, length, mass, and inertia, and by the connection among those modules (joints). In order to handle such geometric and topological information, an agent's morphology can be expressed as an acyclic tree graph representation G := (V, E), where V is a set of nodes v i ∈ V and E is a set of edges e ij ∈ E between v i and v j . The node v i corresponds to i-th module of the agent, and the edge e ij corresponds to the hinge joint between the nodes v i and v j . Each joint may Figure 4 : The overview of MxT-Bench, which can procedurally generate both various morphologies and tasks with minimal blueprints. MxT-Bench can not only construct the agents with different number of limbs, but also randomize missing limbs and size/mass of bodies. We could design the tasks with parameterized goal distributions. It also supports to import custom complex agents such as unimals (Gupta et al., 2022) . Compared to relevant RL benchmarks in terms of (1) multi-task (task coverage), (2) multi-morphology (morphology coverage), and (3) scalability, MuJoCo (Todorov et al., 2012) and DM Control (Tassa et al., 2018) only have a single morphology for a single task. While other existing works (Yu et al., 2019; Huang et al., 2020) partially cover task-/morphology-axis with some sort of scalability, they do not satisfy all criteria. have 1-3 actuators corresponding to a degree of freedom. If a joint has several actuators, the graph G is considered a multipath one. This graph-based factored formulation can describe the various agents' morphologies in a tractable manner (Wang et al., 2018; Huang et al., 2020; Loynd et al., 2020) . We assume that node v i observes local sensory input s i t at time step t, which includes the information of limb i such as position, velocity, orientation, joint angle, or morphological parameters. To process these node features and graph structure, a morphology-agnostic policy can be modeled as node-based GNN (Kipf & Welling, 2017; Battaglia et al., 2018; Cappart et al., 2021) . The objective of morphology-agnostic RL is the average of Equation 1 among given morphologies. Goal-conditional RL In goal-conditional RL (Kaelbling, 1993; Schaul et al., 2015) , the agent aims to find an optimal policy π * (a|s, s g ) conditioned on goal s g ∈ G, where G stands for goal space that is sub-dimension of state space S (e.g. XYZ coordinates, velocity, or quaternion). The desired goal s g is sampled from the given goal distribution p ψ : G → [0, ∞), where ψ stands for task in the task space Ψ (e.g. reaching the agent's leg or touching the torso to the ball). The reward function can include a goal-reaching term, that is often modeled as r ψ (s t , s g ) = -d ψ (s t , s g ), where d ψ (•, •) is a task-dependent distance function, such as Euclidean distance, between the sub-dimension of interest in current state s t and given goal s g . Some task ψ give multiple goals to the agents. In that case, we overload s g to represent a set of goals; {s i g } N ψ i=1 , where N ψ is the number of goals that should be satisfied in the task ψ. Morphology-Task Generalization This paper aims to achieve morphology-task generalization, where the learned policy should generalize over tasks and morphologies simultaneously. The optimal policy should generalize over morphology space M, task Ψ, and minimize the distance to any given goal s g ∈ G. Mathematically, this objective can be formulated as follows: π * = arg max π 1 1-γ E m,ψ∼M,Ψ E sg∼p ψ (sg) E s m ,a m ∼ρ π (s m ),π(•|s m ,sg) [-d ψ (s m , s g )] , (2) where the graph representation of morphology m ∈ M is denoted as  G m = (V m , E m ),

4.1. MXT-BENCH AS A TEST BED FOR MORPHOLOGY-TASK GENERALIZATION

To overcome these shortcomings in the existing RL environments, we develop MxT-Bench, which has a wide coverage over both tasks and morphologies to test morphology-task generalization, with the functionalities for procedural generation from minimal blueprints (Figure 4 ). MxT-Bench is built on top of Brax (Freeman et al., 2021) and Composer (Gu et al., 2021a) , for faster iteration of behavior distillation with hardware-accelerated environments. Beyond supporting multi-morphology and multi-task settings, the scalability of MxT-Bench helps to test the broader range of morphology-task generalization since we can easily generate out-of-distribution tasks and morphologies, compared to manually-designed morphology or task specifications. In the morphology axis, we prepare 4 types of blueprints (ant, claw, centipede, and worm) as base morphologies, since they are good at the movement on the XY-plane. Through MxT-Bench, we can easily spawn agents that have different numbers of bodies, legs, or different sizes, lengths, and weights. Moreover, we can also import the existing complex morphology used in previous work. For instance, we include 60+ morphologies that are suitable for goal-reaching, adapted from Gupta et al. (2022) designed in MuJoCo. In the task axis, we design reach, touch, and twistersfoot_3 as basic tasks, which could evaluate different aspects of the agents; the simplest one is the reach task, where the agents aim to put their leg on the XY goal position. In the touch task, agents aim to create and maintain contact between a specified torso and a movable ball. The touch task requires reaching behavior, while maintaining a conservative momentum to avoid kicking the ball away from the agent. Twisters tasks are the multi-goal problems; for instance, the agents should satisfy XY-position for one leg, and Z height for another leg. We pre-define 4 variants of twisters with max 3 goals (see Appendix B.2 for the details). Furthermore, we could easily specify both initial and goal position distribution with parameterized distribution. In total, we prepare 180+ environments combining the morphology and task axis for the experiments in the later section. See Appendix B for further details.

4.2. BEHAVIOR DISTILLATION

Toward broader generalization over morphologies and tasks, a single policy should handle the diversity of the scene among morphology M, task Ψ, and goal space G simultaneously. Multitask online RL from scratch, however, is difficult to tune, slow to iterate, and hard to reproduce. Instead, we employ behavior cloning on RL-generated expert behaviors to study morphology-task generalization. To obtain rich goal-reaching behaviors, we train a single-morphology single-task policy using PPO (Schulman et al., 2017) with a simple MLP policy, which is significantly more efficient than multi-morphology training done in prior work (Gupta et al., 2022; Kurin et al., 2021; Huang et al., 2020) . Since MxT-Bench is built on top of Brax (Freeman et al., 2021) , a hardwareaccelerated simulator, training PPO policies can be completed in about 5∼30 minutes per environment (on NVIDIA RTX A6000). We collect many behaviors per morphology-task combination from the expert policy rollout. We then train a single policy π θ with a supervised learning objective: L π = -E m,ψ∼M,Ψ E s m ,a m ,sg∼D m,ψ [log π θ (a m |{s m , s g })] , where D m,ψ is an expert dataset of morphology m and task ψ. Importantly, offline behavior distillation protocol runs (parallelizable) single-task online RL only once, and allows us to reuse the same fixed data to try out various design choices, such as model architectures or local features of morphology-task graph, which is often intractable in multi-task online RL.

4.3. MORPHOLOGY-TASK GRAPH

To learn a single policy that could solve various morphology-task problems, it is essential to unify the input-output interface among those. Inspired by the concept of scene graph (Johnson et al., 2015) and morphological graph (Wang et al., 2018) , we introduce the notion of morphology-task graph (MTG) representation that incorporates goal information while preserving the geometric structure of the task. Morphology-task graph could express the agent's observations, actions, and goals/tasks in a unified graph space. Although most prior morphology-agnostic RL has focused on locomotion (running) with the reward calculated from the velocity of the center of mass, morphology-task graph could naturally extend morphology-agnostic RL to multi-task goal-oriented settings: including static single positional goals (reaching), multiple-goal problems (twister-game) and object interaction tasks (ball-touching). In practice, we develop two different ways to inform the goals and tasks (Figure 3 ); morphology-task graph v1 (MTGv1) accepts the morphological graph, encoded from the agent's geometric information, as an input-output interface, and merges positional goal information as a 

5. EXPERIMENTS

We first evaluate the multi-task performance of morphology-task graph representation (MTGv1, MTGv2) in terms of in-distribution (known morphology and task with different initialization), compositional (known task with unseen morphology or known morphology with unseen task), and out-of-distribution generalization (either morphology or task is unseen) on MxT-Bench (Section 5.1). Then, we investigate whether morphology-task graph could contribute to obtaining better control prior for multi-task fine-tuning (Section 5.2). In addition, we conduct the offline node feature selection to identify what is the most suitable node feature set for morphology-task generalization (Section 5.3). The results are averaged among 4 random seeds. See Appendix A for the hyperparameters. We also investigate the other axis of representations or architectures (Appendix E, F, G) and test the effect of dataset size, the number of morphology-task combinations, and model size (Appendix H). Lastly, we examine why morphology-task graph works well by visualizing attention weights (Appendix K). Evaluation Metric Goal-reaching tasks are evaluated by the distance to the goals at the end of episode (Pong et al., 2018; 2020; Ghosh et al., 2021; Choi et al., 2021; Eysenbach et al., 2021) . However, this can be problematic in our settings, because the initial distance or the degree of goalreaching behaviors might be different among various morphologies and tasks. We measure the performance of the policy π by using a normalized final distance metric d(M, Ψ; π) over morphology M and task space Ψ with pre-defined max/min value of each morphology m and task ψ, d(M, Ψ; π) := 1 |M||Ψ| M m Ψ ψ E sg∼p ψ N ψ i=1 d ψ (s m T ,s i g )-d i,m,ψ min d i,m,ψ max -d i,m,ψ min , where s m T is the last state of the episode, d i,m,ψ max is a maximum, and d i,m,ψ min is a minimum distance of i-th goal s i g with morphology m and task ψ. We use a distance threshold to train the expert PPO policy as d i,m,ψ min , and the average distance from an initial position of the scene as d i,m,ψ max . Equation 3 is normalized around the range of [0, 1], and the smaller, the better. See Appendix B for the details.

5.1. BEHAVIOR DISTILLATION ON MXT-BENCH

We systematically evaluate three types of morphology-task generalization; in-distribution, compositional, and out-of-distribution generalization through behavior distillation. As baselines, we compare In in-distribution settings, we prepare 50 environments and 60 unimal environments adapted from Gupta et al. (2022) (see Appendix B.3 for the details) for both training and evaluation. Compositional and out-of-distribution settings evaluate zero-shot transfer. In compositional settings, we test morphology and task generalization separately; we prepare 38 train environments and 12 test environments with hold-out morphologies for morphology evaluation, and leverage 50 train environments and prepare 9 test environments with unseen task for task evaluation. In out-of-distribution settings, we also leverage 50 environments as a training set, and define 27 environments with diversified morphologies and unseen task as an evaluation set. The proficient behavioral data contains 12k transitions per each environment. See Appendix D for the details of environment division. Table 1 reveals that MTGv2 achieves the best multi-task goal-reaching performances among other possible combinations in all the aspects of generalization. Comparing average normalized distance, MTGv2 improves the multi-task performance against the second best, MTGv1, by 23% in indistribution evaluation. Following previous works (Kurin et al., 2021; Gupta et al., 2022) , Transformer with MTGv1 achieves better goal-reaching behaviors than GNN. In compositional and out-ofdistribution zero-shot evaluation, MTGv2 outperforms other choices by 14 ∼ 18%. Moreover, the compositional zero-shot performance of MTGv2 is comparable with the performance of MTGv1 in in-distribution settings. These results imply MTGv2 might be the better formulation to realize the morphology-task generalization.

5.2. DOES MORPHOLOGY-TASK GRAPH OBTAIN BETTER PRIOR FOR CONTROL?

To reveal whether the distilled policy obtains reusable inductive bias for unseen morphology or task, we test the fine-tuning performance for multi-task imitation learning on MxT-Bench. We adopt the same morphology-task division for compositional and out-of-distribution evaluation in Section 5.1. Figure 5 shows that fine-tuning outperforms random initialization in all settings, which suggests that behavior-distilled policy works as a better prior knowledge for control. The same as zero-shot transfer results in Section 5.1, MTGv2 outperforms other baselines, and is better than the second best, MTGv1 by 50 ∼ 55%. Furthermore, MTGv2 could work even with a small amount of dataset (4k, 8k); for instance, in compositional morphology evaluation (left in Figure 5 ), MTGv2 trained with 8k transitions still outperforms competitive combinations with 12k transitions, which indicates better sample efficiency (see Appendix J for the detailed scores). These results suggest MTGv2 significantly captures the structure of morphology-task graph as prior knowledge for downstream tasks.

5.3. WHAT IS THE BEST NODE FEATURES FOR MORPHOLOGY-TASK GRAPH?

In the agent's system, there are a lot of observable variables per module: such as Cartesian position (p), Cartesian velocity (v), quaternion (q), angular velocity (a), joint angle (ja), joint range (jr), limb id (id), joint velocity (jv), relative position (rp), relative rotation (rr), and morphological information (m). Morphological information contains module's shape, mass, inertia, actuator's gear, dof-index, etc. To shed light on the importance of the local representation for generalization, we execute an extensive ablation of node feature selections and prepossessing. In prior works, Huang et al. (2020) and others (Kurin et al., 2021; Hong et al., 2022) used {p, v, q, a, ja, jr, id} and Gupta et al. ( 2022) used {p, v, q, a, ja, jr, jv, rp, rr, m}. Considering the intersection of those, we define {p, v, q, a, ja, jr} as base_set and test the combination to other observations (jv, id, rp, rr, m). In the experiments, we evaluate in-distribution generalization, as done in Section 5.1, with MTGv2 representation. Table 2 shows that, while some additional features (id, rp, m) contribute to improving the performance, the most effective feature can be morphological information (m), which suggests base_set contains sufficient features for control, and raw morphological properties (m) serves better task specification than manually-encoded information such as limb id (id). The key observations are; (1) morphological information is critical for morphology-task generalization, and (2) extra observation might disrupt the performance, such as relative rotation between parent and child node (rr). Throughout the paper, we treat base_set-m as node features for morphology-task graph. velocity (v) , quaternion (q), angular velocity (a), joint angle (ja), joint range (jr), limb id (id), joint velocity (jv), relative position (rp), relative rotation (rr), and morphological information (m). base_set is composed of {p, v, q, a, ja, jr}. These results suggest base_set contains sufficient features for control, and the most effective feature seems morphological information (m) for better task specification.

6. DISCUSSION AND LIMITATION

While the experimental evaluation on MxT-Bench implies that morphology-task graph is a simple and effective method to distill the diverse proficient behavioral data into a generalizable single policy, there are some limitations. For instance, we focus on distillation from expert policies only, but it is still unclear whether morphology-task graph works with moderate or random quality behaviors in offline RL (Fujimoto et al., 2019; Levine et al., 2020; Fujimoto & Gu, 2021) . Combining distillation with iterative data collection (Ghosh et al., 2021; Matsushima et al., 2021) or online fine-tuning (Li et al., 2022) would be a promising future work. In addition, we avoided tasks where expert behaviors cannot be generated easily by single-task RL without fine-scale reward engineering or human demonstrations; incorporating such datasets or bootstrapping single-task RL from the distilled policy could be critical for scaling the pipeline to more complex tasks such as open-ended and dexterous manipulation (Lynch et al., 2019; Ghasemipour et al., 2022; Chen et al., 2022b) . Since morphology-task graph only uses readily accessible features in any simulator and could be automatically defined through URDFs or MuJoCo XMLs, in future work we aim to keep training our best morphology-task graph architecture policy on additional data from more functional and realistic control behaviors from other simulators like MuJoCo (Yu et al., 2019; Tassa et al., 2018) , PyBullet (Shridhar et al., 2022a) , IsaacGym (Chen et al., 2022b; Peng et al., 2021) , and Unity (Juliani et al., 2018) , and show it achieves better scaling laws than other representations (Reed et al., 2022) on broader morphology-task families.

7. CONCLUSION

The broader range of behavior generalization is a promising paradigm for RL. To achieve morphologytask generalization, we propose morphology-task graph, which expresses the agent's modular observations, actions, and goals as a unified graph representation while preserving the geometric task structure. As a test bed for morphology-task generalization, we also develop MxT-Bench, which enables the scalable procedural generation of agents and tasks with minimal blueprints. Fast-generated behavior datasets of MxT-Bench with RL allow efficient representation and architecture selection through supervised learning, and MTGv2, variant of morphology-task graph, achieves the best multi-task performances among other possible designs (MLP, GNN and Transformer with MTGv1, and tokenized-MTGv2, etc), outperforming them in in-distribution evaluation (by 23 %), zero-shot transfer among compositional or out-of-distribution evaluation (by 14 ∼ 18 %) and fine-tuning for downstream multi-task imitation (by 50 ∼ 55 %). We hope our work will encourage the community to explore scalable yet incremental approaches for building a universal controller.

B DETAILS OF MXT-BENCH B.1 MORPHOLOGY

We prepare 4 base blueprints; ant, centipede, claw, and worm, for procedural scene generation (Figure 6 ). Base ant has 1 torso and 2 legs with 2 joints per limb, and we could procedurally generate the agents with different number of limbs. Base centipede has 2 bodies and 4 legs with 2 joints per limb, and we could procedurally generate the agents with different number of bodies. Base claw has 1 torso and 2 legs with 4 joints per limb (each leg consists of 3 modules), and we could procedurally generate the agents with different number of limbs. Base worm has 2 bodies and no legs, and we could procedurally generate the agents with different number of bodies. Furthermore, we develop the functionality for morphology diversification, with missing, mass, and size parameters (Figure 7 ). Missing randomization lacks one module at one leg. This might be an equivalent situation that one leg is broken. Mass randomization changes the default mass of each module with specified scales. The appearance does not change, but certainly, the dynamics would differ. Size randomization changes the default length and radius of each module with specified scales. 

B.2 TASK

We prepare 4 base tasks with parameterized goal distributions; reach, touch, twisters, and push, for procedural task generation (Figure 8 ). Reach task requires the agents to put their one leg to the given goal position (XY). The variant, reach_hard task, represents that the goal distribution is farther than reach task. Touch task requires the agents to contact their body or torso to the movable ball (i.e. movable ball is a goal). Twisters is a multi-goal problem; the agents should satisfy given goals at the same time. There are two basic constraints; reach and handsup. Handsup requires the agents to raise their one leg to the given goal Z height. Twisters has some combinations like reach_handsup, reach_hard_handsup, reach2_handsup, or reach_handsup2. For instance, in reach_handsup2, the agents should put their one leg to the given goal position, and raise their two legs to the given goal heights simultaneously. The XY position goals are sampled from uniform distribution with a predefined donut-shape range, and Z height goals are also from uniform distribution with predefined range. Push task requires the agents to move the box object to the given goal position (XY) sampled from uniform distribution with predefined range. Since it has richer interaction with the object, push might be a more difficult task than those three. We use this task in Appendix L. 

B.3 CUSTOM MORPHOLOGY

MxT-Bench also supports custom morphology import used in previous work. For instance, Gupta et al. (2022) propose unimal agents that are generated via evolutional strategy and designed for MuJoCo. Since they are not manually designed, their morphologies seem more diverse than our ant, centipede, claw, and worm. We inspect unimals whether they are suitable for goal-reaching, and include 72 morphologies from there (and use 60 morphologies for the experiments). Figure 9 shows some example of unimals. Since the current RL community has not paid much attention to embodied control so far, there are no suitable benchmarks to quantify the generalization over various tasks beyond single locomotion tasks and morphologies at the same time. In addition, the scalability to various morphologies or tasks seems to be required for the benchmark, because we should avoid "overfitting" to manually-designed tasks. As summarized in Figure 4 , MuJoCo (Todorov et al., 2012) or DM Control (Tassa et al., 2018) , the most popular benchmarks in the continuous control domain, could not evaluate task or morphology generalization; they only have a single morphology for a single task as a pre-defined environment. Yu et al. (2019) propose a robot manipulation benchmark for meta RL, but it does not care about the morphology. Furthermore, while it has quite a diverse set of tasks, the scalability of environments seems to be limited. In contrast, previous morphology-agnostic RL works (Huang et al., 2020; Kurin et al., 2021; Hong et al., 2022; Trabucco et al., 2022) have a set of different morphologies adapted from MuJoCo agents. Gupta et al. (2022) also provide a much larger set of agents that are produced via joint-optimization of morphology and task rewards by the evolutionary strategy (Gupta et al., 2021) , with kinematics and dynamics randomization. However, those works only aim to solve single locomotion tasks, i.e. running forward as fast as possible.

C DETAILS OF EXPERT DATA GENERATION

We train the single-task PPO (Schulman et al., 2017) at each environment to obtain the expert policy. For the reward function, we basically adopt dense reward (except for push task) -d ψ (s m , s g ) until agents satisfy a given condition, 1[d ψ (s m , s g ) ≤ d m,ψ min ]. See Appendix D for the threshold d m,ψ min . After convergence, we collect the proficient behaviors. Unless otherwise specified, we use 12k transitions per environment. The average normalized final distances of those datasets are mostly less than 0.1 (Figure 10 ). BOU@SFBDI@ BOU@SFBDI@ BOU@SFBDI@ BOU@SFBDI@ BOU@SFBDI@ BOU@UPVDI@ BOU@UPVDI@ BOU@UPVDI@ BOU@UPVDI@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ BOU@SFBDI@IBOETVQ@ DMBX@SFBDI@ DMBX@SFBDI@ DMBX@SFBDI@ DMBX@SFBDI@ DMBX@SFBDI@ DMBX@UPVDI@ DMBX@UPVDI@ DMBX@UPVDI@ DMBX@UPVDI@ DMBX@SFBDI@IBOETVQ@ DMBX@SFBDI@IBOETVQ@ DMBX@SFBDI@IBOETVQ@ DMBX@SFBDI@IBOETVQ@ DMBX@SFBDI@IBSE@IBOETVQ@ DMBX@SFBDI@IBSE@IBOETVQ@ DFOUJQFEF@SFBDI@ DFOUJQFEF@SFBDI@ DFOUJQFEF@SFBDI@ DFOUJQFEF@SFBDI@ DFOUJQFEF@SFBDI@ DFOUJQFEF@SFBDI@ DFOUJQFEF@UPVDI@ DFOUJQFEF@UPVDI@ DFOUJQFEF@UPVDI@ DFOUJQFEF@UPVDI@ DFOUJQFEF@UPVDI@ XPSN@UPVDI@ XPSN@UPVDI@ XPSN@UPVDI@ XPSN@UPVDI@ XPSN@UPVDI@ "WFSBHF/PSNBMJ[FE%JTUBODF 4 ).

D ENVIRONMENTS DIVISION

Throughout the experiments, we test a lot of morphology-task combinations to investigate the in-distribution generalization, compositional generalization for morphology and task, and out-ofdistribution generalization. In this section, we list up the combination of environments used in the experiments. Table 4 explains the combinations for the experiments of in-distribution generalization, and compositional generalization for morphology (both zero-shot transfer and fine-tuning) in Table 1 and Figure 5 . For compositional morphology evaluation, we use Morph-Train division as training dataset and Morph-Test division as evaluation environments. For the evaluation of dataset size and the number of morphology-task combinations (Appendix H), we use In-Distribution, 12 Env, and 25 Env division as train datasets and test environments. Following prior works (Huang et al., 2020; Kurin et al., 2021; Gupta et al., 2022) , we have chosen Morph-Test division that would not exceed the maximum number of nodes in Morph-Train division for convenience. We have checked our trained model could be generalized to the agents with an extra number of limbs (e.g. 7-leg ant or 8-body centipede), but have not extensively increased the number of limbs until the model could not deal with them. We note that this may limit the morphology generalization in the open-ended evaluation. Table 6 also explains the combinations for the experiments of in-distribution generalization with unimals (Gupta et al., 2022) in Table 1 and Table 12 . Table 5 provides the combinations for the experiments of compositional generalization for task and out-of-distribution generalization (both zero-shot transfer and fine-tuning) in Table 1 and Figure 5 . We use them as test environments, and for training datasets, we leverage In-Distribution division in Table 4 . In addition, we extensively evaluate the compositional generalization for task and out-of-distribution generalization with more different unseen task, such as push (see Appendix L for the details). Table 5 : The combinations of environments used in the experiments of compositional generalization for task and out-of-distribution generalization. 

E ARCHITECTURE SELECTION: TOKENIZED MORPHOLOGY-TASK GRAPH

In the recent literature, offline RL is considered as supervised sequential modeling problem (Chen et al., 2021a) , and some works (Janner et al., 2021; Reed et al., 2022) tokenize the continuous observations and actions, like an analogy of vision transformer (Dosovitskiy et al., 2020) ; treating input modality like a language. As a part of offline architecture selection, we examine the effectiveness of tokenization. We mainly follow the protocol in Reed et al. (2022) ; we first apply mu-law encoding (Oord et al., 2016) Then, we discretize pre-processed observations and actions with 1024 bins. To examine the broader range of design choice, we prepare the following 6 variants of tokenized morphology-task graph: whether layer normalization is added after embedding function (LN) (Bhatt et al., 2019; Parisotto et al., 2019; Furuta et al., 2021a; Chen et al., 2021a) , or outputting discretized action (D), smoothed action by taking the average of bins (DA), and continuous values directly (C). As shown in Table 8 , predicting continuous value reveals the better performance than discretized action maybe due to some approximation errors. However, the performance of Token-MTGv2 (C) or Token-MTGv2 (C, LN) is still lower than MTGv2 itself. This might be because tokenization looses some morphological invariance among nodes. In Table 1 , we adopt Token-MTGv2 (C) for comparison. G REPRESENTATION SELECTION: MORPHOLOGY-TASK GRAPH WITH HISTORY Some meta RL or multi-task RL algorithms take historical observations as inputs, or leverage recurrent neural networks to encode the temporal information of the tasks (Wang et al., 2016; Teh et al., 2017; Rakelly et al., 2019) . To examine whether morphology-task graph could be augmented with historical observations, we test the morphology-task graph with history, which concatenates recent H frames morphology-task graph as inputs while predicting one-step actions, in various types of morphology-task generalization with ant agents. In all the setting, we choose ant_{reach, touch, twisters} (in Table 4 ) for training. In morphology generalization setting, we hold out ant_5 agent tasks for evaluation. In task generalization setting, we use ant_reach_hard (in Table 5 ) as a test set. In out-of-distribution setting, we use ant_reach_hard_diverse (in Table 5 ) for evaluation. We also set the history length as H = 3. Table 10 implies that when the policy faces unseen tasks (i.e. in Compositional (Task) or Out-of-Distribution settings), morphology-task graph with history may help to improve the performance, which seems promising results to extend our framework to more complex tasks. Bench (especially, ant) . We concatenate recent three frames morphology-task graph as inputs to Transformer. The results show that when the policy faces unseen tasks, MTGv2-history may help to improve the performance.

H DOES TRANSFORMER WITH MORPHOLOGY-TASK GRAPH SCALE UP WITH

DATASET/MORPHOLOGY-TASK/MODEL SIZE? The important aspect of the success in large language models is the scalability to the size of training data, the number of tasks for joint-training, and the number of parameters (Raffel et al., 2020; Brown et al., 2020) . One natural question is whether a similar trend holds even in RL. Figure 11 suggests that the performance can get better if we increase the number of datasets and the number of parameters. The performance of Transformer with 0.4M parameters is equivalent to that of MLP with 3.1M. In contrast, when we increase the number of environments, the performance degrades while Transformer with MTGv1 and MTGv2 surpasses the degree of degradation, which seems inevitable trends in multi-task RL (Yu et al., 2020; Kurin et al., 2022) and an important future direction towards generalist controllers. The smaller value means the better multi-task performance (see Appendix D for environment division). These results suggest that the performance can get better when we increase the number of datasets, and morphology-task graph can surpasses the degradation of the performance when we increase the number of environments. Transformer is parameter-efficient than MLP, and improves the performance as many parameters.

I PERCENTAGE OF IMPROVEMENT

For clarification, we compute the percentage of improvement between two average normalized final distances (defined in Equation 3) d1 and d2 as follows ( d1 < d2 ): 100 * d2 -d1 d2 .

J ADDITIONAL RESULTS

In this section, we provide the detailed performance of in-distribution generalization (Table 11 and Table 12 ), compositional morphology and task generalization (Table 13 and Table 14 ), and out-ofdistribution generalization (Table 15 ). For fine-tuning experiments, we summarized the detailed scores of Figure 5 in Table 16 . 

K WHY MORPHOLOGY-TASK GRAPH WORKS WELL?: ATTENTION ANALYSIS

In this section, we provide the attention analysis of Transformer with MTGv1 (Figure 13 ) and MTGv2 (Figure 12 ). The experimental results reveal that despite slight differences, MTGv2 generalizes various morphologies and tasks better than MTGv1. To find out the difference between those, we qualitatively analyze the attention weights in Transformer. Figure 12 shows that MTGv2 consistently focuses on goal nodes over time, and activates important nodes to solve the task; for instance, in centipede_touch (top), MTGv2 pays attention to corresponding nodes (torso0 and goal0) at the beginning of the episode, and gradually sees other relevant nodes (torso1 and torso2) to hold the movable ball. Furthermore, in ant_twisters (bottom), MTGv2 firstly tries to raise the agent's legs to satisfy goal1 and goal2, and then focus on reaching a leg (goal0). Temporally-consistent attention to goal nodes and dynamics attention to relevant nodes can contribute to generalization over goal-directed tasks and morphologies. Figure 13 implies that MTGv1 does not show such consistent activation to the goal-conditioned node; for instance, in centipede_touch_3 (above), the goal information is treated as an extra node feature of torso0, but there are no nodes that consistently activated with torso0. Moreover, in ant_reach_handsup2_3 (bottom), MTGv1 does not keep focusing on the agent's limbs during the episode. Rather, MTGv1 tends to demonstrate some periodic patterns during the rollout as implied in prior works (Huang et al., 2020; Kurin et al., 2021) . 

L ADDITIONAL RESULTS OF TASK GENERALIZATION

Although in Table 1 and Figure 5 , we examine the compositional task generalization and out-ofdistribution generalization with reach_hard tasks, where the goal distribution is farther than original reach tasks. While they seem more difficult unseen tasks (Furuta et al., 2021b) , they also seem to have some sort of task similarities between training dataset environments and those evaluation environments. Another question might be how morphology-task graph performs in more different unseen tasks. To evaluate compositional task generalization and out-of-distribution on the environments that have less similarity to the training datasets, we prepare push task, where the agents try to move the box objects to the given goal position. See Table 7 for the environment division. For training datasets, we leverage In-Distribution division in Table 4 . Because this task requires the sufficient interaction with the object, the nature of tasks seem quite different from training dataset environments (reach, touch, and twisters). Table 17 shows the results of compositional task evaluation and Table 18 shows those of out-ofdistribution evaluation. In contrast to Table 1 , the zero-shot performance seems limited. Transferring pre-trained control primitives to significantly different tasks still remains as important future work. However, as shown in Figure 14 , morphology-task graph works as better prior knowledge for downstream multi-task imitation learning, even with the environments that have less similarity to the pre-training datasets. As prior work suggested (Mandi et al., 2022) , these results suggests that, in RL, jointly-learned multi-task model has a strong inductive bias even for unseen and significantly different environments. Table 18 : The average normalized final distance for out-of-distribution evaluation on MxT-Bench with unseen push task. See Table 7 for the environment division.



Here, "task" means what each agent should solve. See Section for the detailed definition. When considering the models, as slightly overloaded, it may imply morphological diversity as well.3 Pronounced as "mixed"-bench. It stands for "Morphology × Task". https://github.com/frt03/mxt_bench https://en.wikipedia.org/wiki/Twister_(game)



Figure1: Behavior distillation pipeline. We first train a single-task policy for each environment on MxT-Bench, and then collect proficient morphology-task behavior dataset (Section 4.1). To enable a single policy to learn multiple tasks and morphologies simultaneously, we convert stored transitions to the morphology-task graphrepresentation to align with unified IO interface (Section 4.3) for multi-task distillation (Section 4.2). After behavior distillation, the learned policy can be utilized for in-distribution or zero-shot generalization (Section 5.1), downstream fine-tuning (Section 5.2), and representation and architecture selection (Section 5.3).

Figure 2: We tackle morphology-task generalization, which requires achieving both task and morphology generalization simultaneously. See Appendix M for the details.

Figure3: We propose the notion of morphology-task graph, which expresses the agent's observations, actions, and goals/tasks in a unified graph representation, while preserving the geometric structure of the task. We develop two practical implementations; morphology-task graph v1 (left) accepts the morphological graph, encoded from the agent's geometric information, as an input-output interface, and merges positional goal information as a part of corresponding node features. morphology-task graph v2 (right) treats given goals as extra disjoint nodes of the morphological graph. While most prior morphology-agnostic RL have focused on locomotion (run task), i.e. a single static goal node controlling (maximizing) the velocity of center of mass, morphology-task graph could naturally extend morphology-agnostic control to other goal-oriented tasks: single static goal node for reach, multiple static goal nodes for twisters, and single dynamic goal node tracking a movable ball for an object interaction task; touch.

and s m := {s i t } |Vm| i=1 and a m := {a e t } |Em| e=1 stand for a set of local observations and actions of morphology m. While we can use multi-task online RL to maximize Equation 2 in principle, it is often sample inefficient due to the complexity of task, which requires a policy that can handle the diversity of the scene among morphology M, task Ψ, and goal space G simultaneously.

Figure5: Multi-task goal-reaching performances on fine-tuning (multi-task imitation) for compositional and out-of-distribution evaluation. These results reveal that fine-tuning outperforms random initialization in all settings, and fine-tuned MTGv2 outperforms others by 50 ∼ 55 %. See Appendix J for the detailed scores.

Figure 6: Examples of procedurally-generated morphology from base blueprints in MxT-Bench. From left to right, each figure shows the example of 5-leg ant, 4-body centipede, 6-leg claw, and 5-body worm.

Figure 7: Examples of diversified morphology from base blueprints in MxT-Bench. From left to right, each figure shows the example of 5-leg-1-missing ant, 5-leg-size-randomized ant, 4-body-1-missing centipede, and 3-body-size-randomized centipede.

Figure 8: Examples of pre-defined task in MxT-Bench. From left to right, each figure shows the example of reach, touch, twisters, and push task.

Figure 9: Examples of unimal agents, adapted from Gupta et al. (2022)

Figure 10: Average normalized final distance in each dataset (Table4).

combinations of environments used in the experiments of in-distribution generalization, and compositional generalization for morphology.

to morphology-task graph representation: mu_law(x) := sgn(x) log(|x|µ + 1) log(M µ + 1) , with µ = 100 and M = 256. This pre-processing could normalize the input to the range of [0, 1].

(D, LN) Token-MTGv2 (DA, LN) Token-MTGv2 (C, LN) Token-MTGv2 (D) Token-MTGv2 (DA) Token-MTGv2 (C) Transformer (MTGv2) ant_reach 0.

Figure11: The average normalized final distance with different size of datasets (left), morphology-task combinations (middle), and model size (right). The smaller value means the better multi-task performance (see Appendix D for environment division). These results suggest that the performance can get better when we increase the number of datasets, and morphology-task graph can surpasses the degradation of the performance when we increase the number of environments. Transformer is parameter-efficient than MLP, and improves the performance as many parameters.

Figure 12: Attention analysis of MTGv2 in centipede_touch_3 (top) and ant_reach_handsup2_5 (bottom; from twisters). From left to right, we visualize the attention weights of MTGv2 during the rollout. In contrast to MTGv1, MTGv2 consistently focuses on goal nodes over time, and activates important nodes to solve the task.

The average normalized final distance in various types of morphology-task generalization on MxTof corresponding node features. For instance, in touch task, MTGv1 includes XY position of the movable ball as an extra node feature of the body node. Moreover, morphology-task graph v2 (MTGv2) considers given goals as additional disjoint nodes of morphological graph representation. These morphology-task graph strategies enable the policy to handle a lot of combinations of tasks and morphologies simultaneously.

Offline node feature selection. We compare the combination of Cartesian position (p), Cartesian

Table 7 also shows the environment division for both zero-shot transfer and fine-tuning.

The extra combinations of environments used in the experiments of compositional generalization for task and out-of-distribution generalization with push task.

The average normalized final distance in in-distribution evaluation. We extensively evaluate the tokenized morphology-task graph variants, similar toReed et al. (2022).F ARCHITECTURE SELECTION: POSITION EMBEDDINGAs a part of architecture selection, we investigate whether position embedding (PE) contributes to the generalization. Our empirical results in Table9suggest that, multi-task goal reaching performance seems comparable between those. However, in more diverse morphology domains, PE plays an important role. Therefore, we include PE into a default design.

The average normalized final distance on in-distribution evaluation. We compare the effect of position embedding.

The average normalized final distance in various types of morphology-task generalization on MxT-

The average normalized final distance for in-distribution evaluation on MxT-Bench (as shown in Table1).

The average normalized final distance for in-distribution evaluation on MxT-Bench with challenging morphologies fromGupta et al. (2022) (as shown in Table1).

The average normalized final distance for compositional morphology evaluation on MxT-Bench (as shown in Table1).

The average normalized final distance for compositional task evaluation on MxT-Bench with unseen push task. See Table7for the environment division.

ACKNOWLEDGMENTS

This work was supported by JSPS KAKENHI Grant Number JP22J21582. We thank Mitsuhiko Nakamoto and Daniel C. Freeman for converting several MuJoCo agents to Brax, So Kuroki for the support on implementations, Yujin Tang, Kamyar Ghasemipour, Yingtao Tian, and Bert Chan for helpful feedback on this work.

APPENDIX A DETAILS OF IMPLEMENTATION

The hyperparameters we used are listed in Table 3 . We implement MLP and Transformer policy with Jax (Bradbury et al., 2018) and Flax (Heek et al., 2020) . For the implementation of GNN policy, we use a graph neural network library, Jraph (Godwin et al., 2020) . Transformer for Morphology-Task Graph Transformer first encodes morphology-task graph to the latent representation vector z 0 with shared single-layer MLP and learnable position embedding (PE) (in case of MTGv1, we omit s g , but include it to corresponding node observation s i instead): z 0 = [MLP(s 1 ), . . . , MLP(s |Vm| ), MLP(s g )] + PE, then multi-head attention (MHA) (Vaswani et al., 2017) and layer normalization (LayerNorm) (Ba et al., 2016) are recursively applied to latent representation z l at l-th layer,

Method

Before decoding the action per module from the last-layer latent representation z L , we employ the residual connection of the node features (Kurin et al., 2021) ,where shared MLP has a single layer and tanh activation to clip the output within the range of [- 

M EXTENDED RELATED WORK

Morphology-Task Generalization In the previous literature, multi-task in RL has arisen from each MDP component; from different (1) dynamics (Hallak et al., 2015; Yu et al., 2019) , ( 2) reward (Kaelbling, 1993; Andrychowicz et al., 2017) , and (3) state/action space (Wang et al., 2018; Huang et al., 2020) . We think multi-morphology RL covers (1) and ( 3), and multi-task (or multi-goal) RL does (2) and (1). Considering morphology-task generalization, we could study the generalization in RL across all the MDP components (dynamics, reward, state space, action space). While, from a broader perspective, the notion of multi-task in RL may contain both morphological and goal diversity, in this paper, we treat them separately, and multi-task just stands for multi-goal settings.Connection to Policy Distillation Policy distillation has been proposed and studied for a while (Rusu et al., 2016; Parisotto et al., 2016; Levine et al., 2016; Czarnecki et al., 2019) , where a single student policy distills the knowledge from multiple teacher policies to obtain better multitask learners. In contrast, we call our protocol behavior distillation, where the single student policy distills the knowledge from multi-source offline data (e.g. human teleoperation, scripted behavior (Singh et al., 2021) , play data (Lynch et al., 2019) ), not limited to RL policy, since recent RL or robotics research often leverage such a data-driven approach for scalability (Levine et al., 2020; Chen et al., 2021b; Gu et al., 2021a; Lee et al., 2022; Reed et al., 2022; Zeng et al., 2020; Shridhar et al., 2022b) .

