SIMPLE EMERGENT ACTION REPRESENTATIONS FROM MULTI-TASK POLICY TRAINING

Abstract

The low-level sensory and motor signals in deep reinforcement learning, which exist in high-dimensional spaces such as image observations or motor torques, are inherently challenging to understand or utilize directly for downstream tasks. While sensory representations have been extensively studied, the representations of motor actions are still an area of active exploration. Our work reveals that a space containing meaningful action representations emerges when a multi-task policy network takes as inputs both states and task embeddings. Moderate constraints are added to improve its representation ability. Therefore, interpolated or composed embeddings can function as a high-level interface within this space, providing instructions to the agent for executing meaningful action sequences. Empirical results demonstrate that the proposed action representations are effective for intra-action interpolation and inter-action composition with limited or no additional learning. Furthermore, our approach exhibits superior task adaptation ability compared to strong baselines in Mujoco locomotion tasks. Our work sheds light on the promising direction of learning action representations for efficient, adaptable, and composable RL, forming the basis of abstract action planning and the understanding of motor signal space. Project page: https://sites.

1. INTRODUCTION

Deep reinforcement learning (RL) has shown great success in learning near-optimal policies for performing low-level actions with pre-defined reward functions. However, reusing this learned knowledge to efficiently accomplish new tasks remains challenging. In contrast, humans naturally summarize low-level muscle movements into high-level action representations, such as "pick up" or "turn left", which can be reused in novel tasks with slight modifications. As a result, we carry out the most complicated movements without thinking about the detailed joint motions or muscle contractions, relying instead on high-level action representations (Kandel et al., 2021) . By analogy with such abilities of humans, we ask the question: can RL agents have action representations of low-level motor controls, which can be reused, modified, or composed to perform new tasks? As pointed out in Kandel et al. (2021) , "the task of the motor systems is the reverse of the task of the sensory systems. Sensory processing generates an internal representation in the brain of the outside world or of the state of the body. Motor processing begins with an internal representation: the desired purpose of movement." In the past decade, representation learning has made significant progress in representing high-dimensional sensory signals, such as images and audio, to reveal the geometric and semantic structures hidden in raw signals (Bengio et al., 2013; Chen et al., 2018; Kornblith et al., 2019; Chen et al., 2020; Baevski et al., 2020; Radford et al., 2021; Bardes et al., 2021; Bommasani et al., 2021; He et al., 2022; Chen et al., 2022) . With the generalization ability of sensory representation learning, downstream control tasks can be accomplished efficiently, as shown by recent studies Nair et al. (2022); Xiao et al. (2022); Yuan et al. (2022) . While there have been significant advances in sensory representation learning, action representation learning remains largely unexplored. To address this gap, we aim to investigate the topic and discover generalizable action representations that can be reused or efficiently adapted to perform new tasks. An important concept in sensory representation learning is pretraining with a comprehensive task or set of tasks, followed by reusing the resulting latent representation. We plan to extend this approach to action representation learning and explore its potential for enhancing the efficiency and adaptability of reinforcement learning agents. We propose a multi-task policy network that enables a set of tasks to share the same latent action representation space. Further, the time-variant sensory representations and time-invariant action representations are decoupled and then concatenated as the sensory-action representations, which is finally transformed by a policy network to form the low-level action control. Surprisingly, when trained on a comprehensive set of tasks, this simple structure learns an emergent self-organized action representation that can be reused for various downstream tasks. In particular, we demonstrate the efficacy of this representation in Mujoco locomotion environments, showing zero-shot interpolation/composition and few-shot task adaptation in the representation space, outperforming strong meta RL baselines. Additionally, we find that the decoupled time-variant sensory representation exhibits equivariant properties. The evidence elucidates that reusable and generalizable action representations may lead to efficient, adaptable, and composable RL, thus forming the basis of abstract action planning and understanding motor signal space. The primary contributions in this work are listed as follows: 1. We put forward the idea of leveraging emergent action representations from multi-task learners to better understand motor action space and accomplish task generalization. 2. We decouple the state-related and task-related information of the sensory-action representations and reuse them to conduct action planning more efficiently. 3. Our approach is a strong adapter, which achieves higher rewards with fewer steps than strong meta RL baselines when adapting to new tasks. 4. Our approach supports intra-action interpolation as well as inter-action composition by modifying and composing the learned action representations. Next, we begin our technical discussion right below and leave the discussion of many valuable and related literature to the end.

2. PRELIMINARIES

Soft Actor-Critic. In this paper, our approach is built on Soft Actor-Critic (SAC) (Haarnoja et al., 2018) . SAC is a stable off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework, in which the actor maximizes both the returns and the entropy. We leave more details of SAC in Appendix A. Task Distribution. We assume the tasks that the agent may meet are drawn from a pre-defined task distribution p(T ). Each task in p(T ) corresponds to a Markov Decision Process (MDP). Therefore, a task T can be defined by a tuple (S, A, P, p 0 , R), in which S and A are respectively the state and action space, P the transition probability, p 0 the initial state distribution and R the reward function. The concept of task distribution is frequently employed in meta RL problems, but we have made some extensions on it to better match with the setting in this work. We divide all the task distributions into two main categories, the "uni-modal" task distributions and the "multi-modal" task distributions. Concretely, the two scenarios are defined as follows: • Definition 1 (Uni-modal task distribution): In a uni-modal task distribution, there is only one modality among all the tasks in the task distribution. For example, in HalfCheetah-Vel, a Mujoco locomotion environment, we train the agent to run at different target velocities. Therefore, running is the only modality in this task distribution. • Definition 2 (Multi-modal task distribution): In contrast to uni-modal task distribution, there are multiple modalities among the tasks in this task distribution. A multi-modal task distribution includes tasks of several different uni-modal task distributions. For instance, we design a multimodal task distribution called HalfCheetah-Run-Jump, which contains two modalities including HalfCheetah-BackVel and HalfCheetah-BackJump. The former has been defined in the previous section, and the latter contains tasks that train the agent to jump with different reward weight. In our implementation, we actually train four motions in this environment, running, walking, jumping ans standing. We will leave more details in Section 4 and Appendix B.1.

availability

google.com/view/

