UNIVERSAL EMBODIED INTELLIGENCE: LEARNING FROM CROWD, RECOGNIZING THE WORLD, AND REIN-FORCED WITH EXPERIENCE

Abstract

The interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge adaptive to multiple task and universal environments is wanted. Although there are increasing efforts on Reinforcement learning (RL) studies with the assistance of transformers, it might subject to the limitation of the offline training pipeline, in which the exploration and generalization ability is prohibited. Motivated by the cognitive and behavioral psychology, such agent should have the ability to learn from others, recognize the world, and practice itself based on its own experience. In this study, we propose the framework of Online Decision MetaMorphFormer (ODM) which attempts to achieve the above learning modes, with a unified model architecture to both highlight its own body perception and produce action and observation predictions. ODM can be applied on any arbitrary agent with a multi-joint body, located in different environments, trained with different type of tasks. Large-scale pretrained dataset are used to warmup ODM while the targeted environment continues to reinforce the universal policy. Substantial online experiments as well as few-shot and zero-shot tests in unseen environments and never-experienced tasks verify ODM's performance, and generalization ability. Our study shed some lights on research of general artificial intelligence on the embodied and cognitive field studies. Code, result and video examples can be found on the website https://baimaxishi.github.io/.

1. INTRODUCTION

Research of embodied intelligence focus on the learning of control policy given the agent with some morphology (joints, limbs, motion capabilities), while it has always been a topic whether the control policy should be more general or specific. As the improvement of large-scale data technology and cloud computing ability, the idea of artificial general intelligence (AGI) has received substantial interest (Reed et al., 2022) . Accordingly, a natural motivation is to develop a universal control policy for different morphological agents and easy adaptive to different scenes. It is argued that such a smart agent could be able to identify its 'active self' by recognizing the egocentric, proprioceptive perception, react with exteroceptive observations and have the perception of world forward model (Hoffmann & Pfeifer, 2012) . However, there is seldom such machine learning framework by so far although some previous studies have similar attempts in one or several aspects. Reinforcement Learning(RL) learns the policy interactively based on the environment feedback therefore could be viewed as a general solution for our embodied control problem. Conventional RL could solve the single-task problem in an online paradigm, but is relatively difficult to implement and slow in practice, and lack of generalization and adaptation ability. Offline RL facilitates the implementation but in cost of performance degradation. Inspired by recent progress of large model on language and vision fields, transformer-based RL (Reed et al., 2022; Chen et al., 2021; Lee et al., 2022; Janner et al., 2021; Zheng et al., 2022; Xu et al., 2022) has been proposed by transforming RL trajectories as a large time sequence model and train it in the auto-regressive manner. Such methods provide an effective approach to train a generalist agent for different tasks and environments, but usually have worse performance than classic RL, and fail to capture the morphology information. In contrast, MetaMorph (Gupta et al., 2022) chooses to encode on agent's body mor-phology and performs online learning, therefore has good performance but lack of time-dependency consideration. To have a better solution of embodied intelligence, we are motivated from behavioral psychology in which agent improve its skill by actual practice, learning from others (teachers, peers or even someone with worse skills), or makes decision based on the perception of 'the world model' (Ha & Schmidhuber, 2018; Wu et al., 2022) . It is reasonable to believe that an embodied intelligence agent should have the above three learning paradigm simultaneously. We propose such a methodology by designing a morphology-time transformer-based RL architecture which is compatible with both offline and online learning. Offline training is conducted on multi-task datasets which considers both learning from other agents and speculate the future system states. The online training allows the agent to improve its policy in an on-policy way given a single task. In this work, we propose a framework called Online Decision Metamorphformer (ODM), which aims to study the general knowledge of embodied control across different body shapes, environments and tasks, as indicated in Figure 1 . The model architecture contains the universal backbone and the task-specific modules. The task-specific modules capture the potential difference in agent body shapes, and the morphological difference is enhanced by a prompt based on characteristic of body shapes. We first pretrain this model with a curriculum learning, by learning demonstrations from the easiest to the hardest task, from the expert to low-level players. The environment model prediction is added as an auxiliary loss. The same architecture can then be finetuned online given a specific task. During the test, we are able to test ODM with all training environments, transfer the policy to different body shapes, adaptive to unseen environments and accommodate with new types of tasks (e.g. from locomotion to reaching, target capturing or escaping from obstacles.).

Main contributions of this paper include:

• We design a unified model architecture to encode time and morphology dependency simultaneously which bridges sequential decision making with embodiment intelligence. • We propose a training paradigm which mimic the process of natural intelligence emerging, including learning from others, boost with practices, and recognize the world. • We train and test our framework with agent in eight different body shapes, different environment terrain and different task types. These comprehensive analysis verifies the general knowledge of motion control learned by us.

2. RELATED WORKS

Classic RL: Among conventional RL methods, on-policy RL such as Proximal Policy Optimization (PPO (Schulman et al., 2017) is able to learn the policy therefore has good adaptive ability to environment, but is slow to convergence and might have large trajectory variations. Off-policy RL such as DQN (Mnih et al., 2015) improves the sampling efficiency but still require the data buffer updated dynamically. In contrast, offline RL (Fujimoto et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021) can solve the problem similar with supervised learning, but might have degraded performance because of the distribution shift between offline dataset and online environment. In our work, we aim to reach the state-of-the-art performance for different embodied control tasks, therefore a model architecture compatible with on-policy Rl is proposed. Transformer-based RL: Among these efforts, Decision Transformer (DT) (Chen et al., 2021) and Multi-game Decision Transformer (Lee et al., 2022) embodied the continuous state and action directly and use a GPT-like casual transformer to solve the policy offline. Their action decision is conditioned on Return-to-Go (RTG), either arbitrarily set or estimated by model, since RTG is unknown during inference. Instead, Trajectory Transformer (TT) (Janner et al., 2021) discards RTG in the sequential modeling to avoid that approximation.PromptDT (Xu et al., 2022) adds the task difference consideration into the model by using demonstrated trajectory as prompts. ODT (Zheng et al., 2022) first attempts to solve transformer-based RL in an online manner but mainly focus on supervised on actions instead of maximizing rewards. In our work, we propose a similar model architecture but is able to conduct both offline learning and on-policy, actor-critic learning. The online learning employs PPO as the detailed policy update tools with the objective as reward maximization.

Morphology-based RL:

There are some other studies which focus on the agent's morphology information, including GNN-based RL which modeling agent joints as a kinematics tree graph (Huang et al., 2020) , Amorpheus (Kurin et al., 2021) which encodes a policy modular for each body joint, and MetaMorph (Gupta et al., 2022) which intuitively use transformer to encode the body morphology as a sequence of joint properties, and train it by PPO. In our work, we have the morphologyaware encoder which is similar with MetaMorph and the same PPO update rule. However, comparing with Metamorph, we encode the morphology on not only state but also historical actions, and consider the historical contextual consideration.

3. PRELIMINARIES AND PROBLEM SETUP

We formulate a typical sequential decision making problem, in which on each time step t, an embodied agent conceives a state s t ∈ R ns , performs an action a t ∈ R na , and receives a scalar reward r t ∈ R 1 . Reinforcement Learning (RL) can then be employed to produce the policy π(a t |s t ) which aims to maximize the expectation sum of discounted rewards. The actor-critic framework is a famous RL framework with the critic estimates the state value function V (s), which the actor determines the policy. Classical RL methodologies such as Proximal Policy Optimization (PPO) can be employed to solve the problem effectively, with detailed derivation in Appendix A. Problem Setup: Here we redefine the aforementioned conventional RL notations in a more 'embodied style', although still generalized enough for any arbitrary agent with multi-joint body. Inspired by the idea of Gupta et al. (2022) , we differentiate the observation into the agent's proprioceptive observations, the agent's embodied joint-wise self-perception (e.g. angular, angular velocity of each joint), as well as the exteroceptive observation, which is the agent's global sensory (e.g. position, velocity). Given a K-joint agent, we denote the proprioceptive observation by o pro ∈ R K×n in which each joint is embedded with n dimension observations. The exteroceptive observation is  x-dimensional which results in o ext ∈ R x and s := [o pro , o ext ]. ∈ R K×n , o ext ∈ R x s = [o pro , o ext ] a, Ma ∈ R K×m Connections ns = K * n - Ms + x na = K * m - Ma Stepping forward from Gupta et al. (2022) , we also define the action in the joint-dependent way; that is, assuming each joint has m degree of freedom (DoF) of movements (e.g. torque), the action is reshaped as a ∈ R K×m . To allow the room of different agent body shapes, we introduce binary masks which have the same shapes of o pro and a and zero-pad the impossible observations or actions (e.g. DoF of a humanoid's forearm should be smaller than its upper-arm due to their physical connection). Table 1 visualizes the comparison between the conventional RL notations and our embodied version notations. Attention-based Encoding: Given a stacked time sequence vector x ∈ R T ×e with T as the time length and e as the embedding dimension, an time sequence encoder can be expressed as Enc T (x) = Attention(Q=x, K=x, V=x) ∈ R T ×e (1) according to the self-attention mechanism (Vaswani et al., 2017)  k = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 j j h f k Q H S o S C U b T S g + m 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u q m h i e U j e i A d y x V N O L G z 2 a n T s i J V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P Y z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W w t O P n M I f + B 8 / g A F K o 2 i < / l a t e x i t > s 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " v / W L / l y v M 7 P y 8 k 4 U I / J u w E 5 h W v c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 j j h f k Q H S o S C U b T S g + l 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u q m h i e U j e i A d y x V N O L G z 2 a n T s i J V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P Y z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W w t O P n M I f + B 8 / g A G r o 2 j < / l a t e x i t > s 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " U p 8 9 X L + / G F 6 q 9 s L J e a B g b s 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 4 t H G T A 0 h I 7 S R F l 3 Z C w P L m N F A J g = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 V j 0 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R i i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 n f q t J 1 S a x / L R j B P 0 I z q Q P O S M G i s 9 0 F 6 1 V y q 7 F X c G s k y 8 n J Q h R 7 1 X + u r 2 Y 5 Z G K A 0 T V O u O 5 y b G z 6 g y n A m c F L u p x o S y E R 1 g x 1 J J I 9 R + N j t 1 Q k 6 t 0 i d h r G x J Q 2 b q 7 4 m M R l q P o 8 B 2 R t Q M 9 a I 3 F f / z O q k J r / 2 M y y Q 1 K N l 8 U Z g K Y m I y / Z v 0 u U J m x N g S y h S 3 t x I 2 p I o y Y 9 M p 2 h C 8 x Z e X S b N a 8 S 4 r 5 / c X 5 d p N H k c B j u E E z s C D K 6 j B H d S h A Q w G 8 A y v 8 O Y I 5 8 V 5 d z 7 m r S t O P n M E f + B 8 / g D s t 4 2 S < / l a t e x i t > a 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " U p 8 9 X L + / G F 6 q 9 s L J e a B g b z f O  z f O V F U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 V j 0 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R i i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 n f q t J 1 S a x / L R j B P 0 I z q Q P O S M G i s 9 6 F 6 1 V y q 7 F X c G s k y 8 n J Q h R 7 1 X + u r 2 Y 5 Z G K A 0 T V O u O 5 y b G z 6 g y n A m c F L u p x o S y E R 1 g x 1 J J I 9 R + N j t 1 Q k 6 t 0 i d h r G x J Q 2 b q 7 4 m M R l q P o 8 B 2 R t Q M 9 a I 3 F f / z O q k J r / 2 M y y Q 1 K N l 8 U Z g K Y m I y / Z v 0 u U J m x N g S G U 8 Z 1 O 2 6 0 z g j z q + l G u f e 9 h K C 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U C b b T b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o 6 j h V l D V o L G L V D l A z w S V r G G 4 E a y e K Y R Q I 1 g p G t 1 O / 9 c S U 5 r F 8 N O O E + R E O J A 8 5 R W O l B + y 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 9 M 0 Y t J Q g V p 3 P D c x f o b K c C r Y p N R N N U u Q j n D A O p Z K j J j 2 s 9 m p E 3 J i l T 4 J Y 2 V L G j J T f 0 9 k G G k 9 j g L b G a E Z 6 k V v K v 7 n d V I T X v s Z l 0 l q m K T z R W E q i I n J 9 G / S 5 4 p R I 8 a W I F X c 3 k r o E B V S Y 9 M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A y g M 4 B l e 4 c 0 R z o v z 7 n z M W w t O P n M I f + B 8 / g D p r 4 2 Q < / l a t e x i t > a 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " Y m G U 8 Z 1 O 2 6 0 z g j z q + l G u f e 9 h K C 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U C b b T b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o 6 j h V l D V o L G L V D l A z w S V r G G 4 E a y e K Y R Q I 1 g p G t 1 O / 9 c S U 5 r F 8 N O O E + R E O J A 8 5 R W O l B + y 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 9 M 0 Y t J Q g V p 3 P D c x f o b K c C r Y p N R N N U u Q j n D A O p Z K j J j 2 s 9 m p E 3 J i l T 4 J Y 2 V L G j J T f 0 9 k G G k 9 j g L b G a E Z 6 k V v K v 7 n d V I T X v s Z l 0 l q m K T z R W E q i I n J 9 G / S V F U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 V j 0 4 r G i / Y A 2 l M 1 2 0 i 7 d b M L u R i i h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 n f q t J 1 S a x / L R j B P 0 I z q Q P O S M G i s 9 6 F 6 1 V y q 7 F X c G s k y 8 n J Q h R 7 1 X + u r 2 Y 5 Z G K A 0 T V O u O 5 y b G z 6 g y n A m c F L u p x o S y E R 1 g x 1 J J I 9 R + N j t 1 Q k 6 t 0 i d h r G x J Q 2 b q 7 4 m M R l q P o 8 B 2 R t Q M 9 a I 3 F f / z O q k J r / 2 M y y Q 1 K N l 8 U Z g K Y m I y / Z v 0 u U J m x N g S = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S Q q 6 r H o x W M F a w t t K J P t t l 2 6 2 Y T d i V B C f 4 Q X D 4 p 4 9 f d 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X J l I Y 8 r x v p 7 C y u r a + U d w s b W 3 v 7 O 6 V 9 w 8 e T Z x q x h s s l r F u h W i 4 F I o 3 S J D k r U R z j E L J m + H o d u o 3 n 7 g 2 I l Y P N E 5 4 E O F A i b 5 g S F Z q d o Z I G U 6 6 5 Y p X 9 W Z w l 4 m f k w r k q H f L X 5 1 e z N K I K 2 I S j W n 7 X k J B h p o E k 3 x S 6 q S G J 8 h G O O B t S x V G 3 A T Z 7 N y J e 2 K V n t u P t S 1 F 7 k z 9 P Z F h Z M w 4 C m 1 n h D Q 0 i 9 5 U / M 9 r p 9 S / D j K h k p S 4 Y v N F / V S 6 F L v T 3 9 2 e 0 J y R H F u C T A t 7 q 8 u G q J G R T a h k Q / A X X 1 4 m j 2 d V / 7 J 6 f n 9 R q d 3 k c R T h C I 7 h F H y 4 g h r c Q R 0 a w G A E z / A K b 0 7 i v D j v z s e 8 t e D k M 4 f w B 8 7 n D 4 / l j 7 o = < / l a t e x i t > â < l a t e x i t s h a 1 _ b a s e 6 4 = " K h J d C O o K X u Y Y 3 I k f l E 7 H q 6 j I j L M = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o m K e i x 6 8 d i C / Y A 2 l M 1 2 0 q 7 d b M L u R i i h v 8 C L B 0 W 8 + p O 8 + W / c t j l o 6 4 O B x 3 s z z M w L E s G 1 c d 1 v Z 2 V 1 b X 1 j s 7 B V 3 N 7 Z 3 d s v H R w 2 d Z w q h g 0 W i 1 i 1 A 6 p R c I k N w 4 3 A d q K Q R o H A V j C 6 m / q t J 1 S a x / L B j B P 0 I z q Q P O S M G i v V a a 9 U d i v u D G S Z e D k p Q 4 5 a r / T V 7 c c s j V A a J q j W H c 9 N j J 9 R Z T g T O C l 2 U 4 0 J Z S M 6 w I 6 l k k a o / W x 2 6 I S c W q V P w l j Z k o b M 1 N 8 T G Y 2 0 H k e B 7 Y y o G e p F b y r + 5 3 V S E 9 7 4 G Z d J a l C y + a I w F c T E Z P o 1 6 X O F z I i x J Z Q p b m 8 l b E g V Z c Z m U 7 Q h e I s v L 5 P m e c W 7 q l z U L 8 v V 2 z y O A h z D C Z y B B 9 d Q h X u o Q Q M Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 z F t X n H z m C P 7 A + f w B x f m M 7 Q = = < / l a t e x i o e t = Embed o (o pro t ) ∈ R K,e , x e t = Embed x (o ext t ) ∈ R e , a e t = Embed a (a t ) ∈ R K,e Morphology-aware Encoder: Corresponding pose embedding vectors are obtained by traversing the agent's kinematic tree and encoding the morphology by Eq 2: o p t = Enc M (o e t ), s p t = MLP s ([s p t , x e t ]) ∈ R e , a p t = Enc M (a e t ) Casual Transformer: To capture the morphology difference, we apply the prompt technique as in (Xu et al., 2022) , but embedding the morphology specifications instead of imitations Prompt = Embed(K, n, m, x) The casual transformer then translates the prompt and the input sequence into the output sequence output =Enc T (Prompt, input) (6) input :={BOS, s p 0 , a p 0 , s p 1 , • • • , • • • , a p t-1 , s p t } (7) output :={ŝ p 0 , âp 0 , ŝp 1 , âp 2 , • • • , • • • , ŝp t , âp t } (8) with a forward casual time mask. Detailed structure is inherited from GPT2, a decoder-only structure as in (Chen et al., 2021; Janner et al., 2021; Zheng et al., 2022) . Projector: The task-specific projectors map latent outputs back to the original spaces: ât = Proj a (â p t ), ŝt = Proj s (ŝ p t ), Vt = Proj V (ŝ p t ) Embed o , Embed x , Embed a , Embed s , Proj a , Proj s , Proj V are all modeled as MLPs with LayerNorm and Relu between layers. More detailed configuration can be found in Appendix.

4.2. TRAINING PARADIGM

ODM has a two-phase training paradigm including pretraining and finetuning, as in Algorithm 1. Pretraining: To mimic the learning process of human infant, we design a curriculum-based learning mechanism in which the training dataset transverses from the easiest to the most complicated one. During each epoch, the current dataset is trained in a auto-regressive manner with two loss terms: L imitation = MSE(â t , a p t ), L prediction = MSE(ŝ t , s p t ), L pretrain = η p L imitation + η i L prediction (10) where MSE denotes the mean-square-error. L imitation corresponds the imitation of action from demonstrations, while L prediction encourages the agent to predict future observationsfoot_0 . Finetuning: one extra predict head is activated to predict the state value Vt ; this head as long as the very last prediction head of ât are employed as outputs of actor and critic: Vt → V (s t ), ât → π(s t ) Actor and critic are then trained by PPO with more details introduced in Section A.Keeping some extent of L pretrain as auxiliary loss, this finetuning becomes a self-supervised plus model-based RL L finetune = η 1 L PPO + η 2 L pretrain Algorithm 1 ODM 1: Initialize θ 2: Pretraining: Bodies, Environments and Tasks: We practice with enormous agents, environments and tasks, to validate the general knowledge studied by ODM. These scenes include: • Body shape: including swimmer (3-joints, no foot), reacher (1-joint and one-side fixed), hopper (1 foot), halfcheetah (2-foot), walker2d (2-foot), ant (4-foot), and humanoid on the gym-mujoco platformfoot_1 ; walker (the agent called ragdoll has a realistic humanoid body)foot_2 on the unity platform (Juliani et al., 2018) ; and finally unimal, a mujoco-based enviroment which contain 100 different morphological agents (Gupta et al., 2021 ). • Environment: flat terrain (FT), variable terrain (VT) or escaping from obstacles. • Task: pure locomotion, standing-up (humanoid), or target reaching (reacher, walker). Baselines: We compare ODM with four baselines, each representing a different learning paradigm: • Metamorph: a morphological encoding-based online learning method to learn a universal control policy (Gupta et al., 2022 ). • DT: As a state-of-the-art offline learning baseline, we implement the decision transformer with the expert action inference as in Lee et al. (2022) and deal with continuous space as in (Chen et al., 2021) . We name it DT in the following sections for abbreviation. • PPO: The classical on-policy RL algorithm Schulman et al. (2017) . Code is cloned from stable-baseline3 in which PPO is in the actor-critic style. • Random: The random control policy by sampling each action from uniform distribution from its This indicates the performance without any prior knowledge especially for the zero-shot case. Demonstrating pioneers: For purpose of pretraining, we collect offline data samples of hopper, halfcheetah, walker2d and ant from D4RL (Fu et al., 2020) , as sources of pioneer demonstrations. For these environments, D4RL provide datasets sampled from agents of different skill levels, which corresponds to different learning pioneers in our framework, including expert, medium-expert, medium, medium-replay and random. We also train some baseline expert agents and using them to sample offline dataset on walker and unimal. These dataset contains more than 25 million data samples, with statistics details shown in Appendix, Table 8 . Within each curriculum, we also rotate demonstrations from the above pioneers for training, as indicated in Algorithm 1.

5.2. EXPERIMENT RESULTS

Pretraining: Model is trained with datasets of hopper, halfcheetah, walker2d, ant, walker and unimal, from the easiest to the most complex. Figure 3 shows the loss plot. One can observe that the training loss successfully converges within each curriculum course; although its absolute value occasionally jumps to different levels because of the environment (and the teacher) switching. Validation set accuracy is also improved with walker and unimal as exhibition examples in Figure 3 . Online Experiments: To make the online learning faster, we use 32 independent agents to sample the trajectory in parallel, with 1000 as the maximum episode steps. Experiment continues more than 1500 iterations after the performance converges. Figure 4 provides a quick snapshot of online performances. Comparing with ODM without pretraining, returns of ODM are higher at the very beginning, indicating knowledge from pretraining is helpful. As online learning continuous, the performance degrades slightly until finally grows up again, and converges faster than the other two methods, although the entire training time (pretraining plus finetuning) is longer. During online testing, 100 independent episodes are sampled, and each is fed with a unique random seed. We examine the averaged episode return and episode length as our evaluation metrics . Table 2 shows the online testing performance. One can observe that our ODM outperforms or is at Few-shot Experiments: We examine the policy transferability by providing several few-shot experiments. Pretrained ODM is loaded in several unseen tasks, which are listed in Table 3 . As a few-shot test, online training only lasts for 500 steps before testing. ODM obtains the best performance except humanoid on flat terrain, indicating ODM has better adaptation ability than MetaMorph. Zero-shot Experiments: Zero-shot experiments can be conducted by inferencing the model directly without any online finetuning. The unimal environment allows such experiment in which the flat terrain (FT) can be replaced by variable terrain (VT) or obstacles. Results are shown in Table 4 . It can be observed that ODM reaches state-of-the-art performance for zero-shot tests, indicating that ODM has strong generalization ability by capturing general high-level knowledge from pretraining, even without any prior experience. Ablation Studies: To verify the effectiveness of each model component, we conduct the ablation tests for ODM with only online finetuning phase (wo pretrain) and with only pretraining (wo finetune); within the pretraining scope, we further examine ODM without the curriculum mechanism (wo curriculum) and morphology prompt (wo prompt). The DT method could be viewed as the ablation of both L prediction and L PPO , so we do not list the ablation results of these two loss terms. We conduct the ablation study on unimal (all 3 tasks) as well as walker, with results shown in Table 5 . Results shown that ODM is still the best on all these tasks, which indicating both learning from others' imitation and self-experiences are necessary for intelligence agents. By examining the agent motion's rationality and smoothness, we first visualize motions of trained models on the walker environment. Since the walker agent has a humanoid body (the 'ragdoll') such that reader could easily evaluate the motion reasonability based on real life experiences. Figure 5 exhibit key frames of videos on the same time points. In this experiment, we force the agent starts from exact the same state and remove the process noise. By comparing ODM (the bottom line) with PPO (the upper line), one can see the ODM behaves more like human while PPO keep swaying forwards and backwards and side to side, with unnatural movements such as lowering shoulders and twisting waist. We compare the motion agility by visualizing on the unimal environment, in which the agent is encouraged to walk toward arbitrary direction. Figure 6 comparing ODM with Metamorph. Metamorph wastes most of time shaking foots, fluiding and gliding, therefore ODM walks longer distance than the Metamorph, within the same time intervalfoot_4 . 

6. DISCUSSION

Our work can be viewed as an early attempt of an embodied intelligence generalist accommodated for varied body shape, tasks and environment. One shortcoming of current approach is that ODM still has task-specific modules (tokenizers and projectors) for varied body shapes. By using some self-adaptive model structure (e.g. Hypernetwork) in these modules, it is possible to use one unified model to represent the generalist agent. Another potential improvement is to add the value/return prediction into the sequence modeled by the casual transformer. That is, the agent is able to estimate 'the value of its action' before the action is actually conducted, which is also known as 'metacognition'. The last interesting topic is the potential training conflict when training switch from offline to online. That might be improved by some hyperparameter tuning (out of this paper's scope), e.g., some warmup schedule of L PPO during finetuning; but could also be improved by different model architecture which could better accommodate knowledge learning from offline and online phases.

7. CONCLUSION

In this paper, motivated by the intelligence development process in the natural world, we propose a learning framework to learn a universal body control policy in arbitrary body shapes, environments and tasks. We combine ideas of learning from others, reinforcing with self-experiences, as well as the world model recognition. To achieve this, we design a two-dimensional transformer structure which first encode the morphological information of agent states and actions at each time step, then encode the time sequential decision process to formulate the policy function. A two-phase training paradigm is designed accordingly in which the agent first learns from demonstrations of pioneers on different skill levels, and from different tasks. After that, the agent interact with its own environment and further reinforce its skill by on-policy RL. Online, few-shot and zero-shot experiments show that our methodology is able to learn some general knowledge for embodied motion control. We believe this work could shed some light on the embodied intelligence study when some kind of generalist intelligent is wanted.



L prediction (t = 0) is masked out, since it is meaningless to predict the very first, randomly initiated state. https://www.gymlibrary.dev/environments/mujoco/ https://github.com/Unity-Technologies/ml-agents Full version of videos can be found on the website https://baimaxishi.github.io/ Figure grids could help reader to recognize the comparison although the video is more obvious.



Figure 1: Application pipeline of ODM.

t e x i t s h a 1 _ b a s e 6 4 = " 4 x M Z o G Y w v z G 4 M P 1 t E e 7 j t b H / s w

y h S 3 t x I 2 p I o y Y 9 M p 2 h C 8 x Z e X S b N a 8 S 4 r 5 / c X 5 d p N H k c B j u E E z s C D K 6 j B H d S h A Q w G 8 A y v 8 O Y I 5 8 V 5 d z 7 m r S t O P n M E f + B 8 / g A I M o 2 k < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " N p 5 Q 9 e O u e a y l x z 9 x 2 C t d g Y z p S Y c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 j j h f k Q H S o S C U b T S A + 1 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u q m h i e U j e i A d y x V N O L G z 2 a n T s i J V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P Y z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W w t O P n M I f + B 8 / g D r M 4 2 R < / l a t e x i t > a 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Y m

5 4 p R I 8 a W I F X c 3 k r o E B V S Y 9 M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A y g M 4 B l e 4 c 0 R z o v z 7 n z M W w t O P n M I f + B 8 / g D p r 4 2 Q < / l a t e x i t > a 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 x M Z o G Y w v z G 4 M P 1 t E e 7 j t b H / s wk = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 j j h f k Q H S o S C U b T S g + m 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u q m h i e U j e i A d y x V N O L G z 2 a n T s i J V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P Y z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W w t O P n M I f + B 8 / g A F K o 2 i < / l a t e x i t >s 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " v / W L / l y v M 7 P y 8 k 4 U I / J u w E 5 h W v c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 j j h f k Q H S o S C U b T S g + l 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u q m h i e U j e i A d y x V N O L G z 2 a n T s i J V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P Y z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W w t O P n M I f + B 8 / g A G r o 2 j < / l a t e x i t > s 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " N p 5 Q 9 e O u e a y l x z 9 x 2 C t d g Y z p S Y c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 j j h f k Q H S o S C U b T S A + 1 5 v X L F r b o z k G X i 5 a Q C O e q 9 8 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u q m h i e U j e i A d y x V N O L G z 2 a n T s i J V f o k j L U t h W S m / p 7 I a G T M O A p s Z 0 R x a B a 9 q f i f 1 0 k x v P Y z o Z I U u W L z R W E q C c Z k + j f p C 8 0 Z y r E l l G l h b y V s S D V l a N M p 2 R C 8 x Z e X S f O s 6 l 1 W z + 8 v K r W b P I 4 i H M E x n I I H V 1 C D O 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W w t O P n M I f + B 8 / g D r M 4 2 R < / l a t e x i t > a 1

y h S 3 t x I 2 p I o y Y 9 M p 2 h C 8 x Z e X S b N a 8 S 4 r 5 / c X 5 d p N H k c B j u E E z s C D K 6 j B H d S h A Q w G 8 A y v 8 O Y I 5 8 V 5 d z 7 m r S t O P n M E f + B 8 / g A I M o 2 k < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " P I c U z W x t B O h o T n y w l X M 9 d v m t u Z k = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 q M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / R F e P C j i 1 d / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G d z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l d m 9 E M T P T f r n i V t 0 5 y C r x c l K B H I 1 + + a s 3 i F k a c Y V M U m O 6 n p u g n 1 G N g k k + L f V S w x P K x n T I u 5 Y q G n H j Z / N z p + T M K g M S x t q W Q j J X f 0 9 k N D J m E g W 2 M 6 I 4 M s v e T P z P 6 6 Y Y 3 v i Z U E m K X L H F o j C V B G M y + 5 0 M h O Y M 5 c Q S y r S w t x I 2 o p o y t A m V b A j e 8 s u r p H V R 9 a 6 q l w + 1 S v 0 2 j 6 M I J 3 A K 5 + D B N d T h H h r Q B A Z j e I Z X e H M S 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w C r P 4 / M < / l a t e x i t > ŝ < l a t e x i t s h a 1 _ b a s e 6 4 = " x c 3 d Y v X / a w 8 B c L 2 Z m r S l A x K s 5 b g

Figure 2: Model structure of ODM and its training paradigm. Tokenizer: At each time t, observations and actions are first embedded into the latent space

π for T timesteps 15: compute advantage estimates Â0 , • • • , ÂT 16: end for 17: update θ with the surrogate L PPO on a mini-batch 18:

Figure 3: Time plot of pretraining performance. Orange: training loss. Green: validation MSE of walker; Blue: validation MSE of unimal.

Figure 4: Comparison of averaged episode returns during online experiments. Left: unimal. Right: walker. Curves are smoothed and values are rescaled for better visualization.

Figure 5: ODM improve motion fluency and coherence in the walker environment. Key frames are screened on Second 1, 2, 3, 4, 5, respectively. Video can be found on the website.

Figure 6: ODM improve motion agility of a typical unimal agent. Key frames are screened evenly from a 30-second video. Video can be found on the website.

Comparison of conventional RL and our notations

Averaged Episodic Performance in online locomotion environments. Performance of walker has substantial deviations since walker has forward process noise implemented.

Performance in few-shot experiments. The official reacher environment has a maximum episode length limit of 50.

Performance in zero-shot experiments.

Performance in ablation studies on unimal and walker. Generalist learning not only aims to improve the mathematical metrics, but also the motion reasonability from human's viewpoint. It is difficult for traditional RL to work on this issue which only solve the mathematical optimization problem. By jointly learning other agent's imitation and bridge with the agent's self-experiences, we assume ODM could obtain more universal intelligence about body control by solving many different types of problems. Here we provide some quick visualizations about generated motions of ODM, comparing with the original versions 4 .

