UNIVERSAL EMBODIED INTELLIGENCE: LEARNING FROM CROWD, RECOGNIZING THE WORLD, AND REIN-FORCED WITH EXPERIENCE

Abstract

The interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge adaptive to multiple task and universal environments is wanted. Although there are increasing efforts on Reinforcement learning (RL) studies with the assistance of transformers, it might subject to the limitation of the offline training pipeline, in which the exploration and generalization ability is prohibited. Motivated by the cognitive and behavioral psychology, such agent should have the ability to learn from others, recognize the world, and practice itself based on its own experience. In this study, we propose the framework of Online Decision MetaMorphFormer (ODM) which attempts to achieve the above learning modes, with a unified model architecture to both highlight its own body perception and produce action and observation predictions. ODM can be applied on any arbitrary agent with a multi-joint body, located in different environments, trained with different type of tasks. Large-scale pretrained dataset are used to warmup ODM while the targeted environment continues to reinforce the universal policy. Substantial online experiments as well as few-shot and zero-shot tests in unseen environments and never-experienced tasks verify ODM's performance, and generalization ability. Our study shed some lights on research of general artificial intelligence on the embodied and cognitive field studies. Code, result and video examples can be found on the website https://baimaxishi.github.io/.

1. INTRODUCTION

Research of embodied intelligence focus on the learning of control policy given the agent with some morphology (joints, limbs, motion capabilities), while it has always been a topic whether the control policy should be more general or specific. As the improvement of large-scale data technology and cloud computing ability, the idea of artificial general intelligence (AGI) has received substantial interest (Reed et al., 2022) . Accordingly, a natural motivation is to develop a universal control policy for different morphological agents and easy adaptive to different scenes. It is argued that such a smart agent could be able to identify its 'active self' by recognizing the egocentric, proprioceptive perception, react with exteroceptive observations and have the perception of world forward model (Hoffmann & Pfeifer, 2012) . However, there is seldom such machine learning framework by so far although some previous studies have similar attempts in one or several aspects. Reinforcement Learning(RL) learns the policy interactively based on the environment feedback therefore could be viewed as a general solution for our embodied control problem. Conventional RL could solve the single-task problem in an online paradigm, but is relatively difficult to implement and slow in practice, and lack of generalization and adaptation ability. Offline RL facilitates the implementation but in cost of performance degradation. Inspired by recent progress of large model on language and vision fields, transformer-based RL (Reed et al., 2022; Chen et al., 2021; Lee et al., 2022; Janner et al., 2021; Zheng et al., 2022; Xu et al., 2022) has been proposed by transforming RL trajectories as a large time sequence model and train it in the auto-regressive manner. Such methods provide an effective approach to train a generalist agent for different tasks and environments, but usually have worse performance than classic RL, and fail to capture the morphology information. In contrast, MetaMorph (Gupta et al., 2022) chooses to encode on agent's body mor-phology and performs online learning, therefore has good performance but lack of time-dependency consideration. To have a better solution of embodied intelligence, we are motivated from behavioral psychology in which agent improve its skill by actual practice, learning from others (teachers, peers or even someone with worse skills), or makes decision based on the perception of 'the world model' (Ha & Schmidhuber, 2018; Wu et al., 2022) . It is reasonable to believe that an embodied intelligence agent should have the above three learning paradigm simultaneously. We propose such a methodology by designing a morphology-time transformer-based RL architecture which is compatible with both offline and online learning. Offline training is conducted on multi-task datasets which considers both learning from other agents and speculate the future system states. The online training allows the agent to improve its policy in an on-policy way given a single task. In this work, we propose a framework called Online Decision Metamorphformer (ODM), which aims to study the general knowledge of embodied control across different body shapes, environments and tasks, as indicated in Figure 1 . The model architecture contains the universal backbone and the task-specific modules. The task-specific modules capture the potential difference in agent body shapes, and the morphological difference is enhanced by a prompt based on characteristic of body shapes. We first pretrain this model with a curriculum learning, by learning demonstrations from the easiest to the hardest task, from the expert to low-level players. The environment model prediction is added as an auxiliary loss. The same architecture can then be finetuned online given a specific task. During the test, we are able to test ODM with all training environments, transfer the policy to different body shapes, adaptive to unseen environments and accommodate with new types of tasks (e.g. from locomotion to reaching, target capturing or escaping from obstacles.).

Main contributions of this paper include:

• We design a unified model architecture to encode time and morphology dependency simultaneously which bridges sequential decision making with embodiment intelligence. • We propose a training paradigm which mimic the process of natural intelligence emerging, including learning from others, boost with practices, and recognize the world. • We train and test our framework with agent in eight different body shapes, different environment terrain and different task types. These comprehensive analysis verifies the general knowledge of motion control learned by us.

2. RELATED WORKS

Classic RL: Among conventional RL methods, on-policy RL such as Proximal Policy Optimization (PPO (Schulman et al., 2017) is able to learn the policy therefore has good adaptive ability to environment, but is slow to convergence and might have large trajectory variations. Off-policy RL



Figure 1: Application pipeline of ODM.

