PLANNING IMMEDIATE LANDMARKS OF TARGETS FOR MODEL-FREE SKILL TRANSFER ACROSS AGENTS Anonymous

Abstract

In reinforcement learning applications, agents usually need to deal with various input/output features when specified with different state and action spaces by their developers or physical restrictions, indicating re-training from scratch and considerable sample inefficiency, especially when agents follow similar solution steps to achieve tasks. In this paper, we aim to transfer pre-trained skills to alleviate the above challenge. Specifically, we propose PILoT, i.e., Planning Immediate Landmarks of Targets. PILoT utilizes the universal decoupled policy optimization to learn a goal-conditioned state planner; then, we distill a goal-planner to plan immediate landmarks in a model-free style that can be shared among different agents. In our experiments, we show the power of PILoT on various transferring challenges, including few-shot transferring across action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; and we also illustrate PILoT provides a zero-shot transfer solution from a simple 2D navigation task to the harder Ant-Maze task.

1. INTRODUCTION

Figure 1 : Zero-shot transferring on Ant-Maze, where the Ant agent starts from the yellow point to the desired goal (big blue star). PILoT provide planned immediate landmarks (small red points) given the temporal goal (green points) and the desired goal (small blue point), learned from a naive 2D maze task. Recent progress of Reinforcement Learning (RL) has promoted considerable developments in resolving kinds of decision-making challenges, such as games (Guan et al., 2022 ), robotics (Gu et al., 2017) and even autonomous driving (Zhou et al., 2020) . However, most of these work are designed for a single task with a particular agent. Recently, researchers have developed various goal-conditioned reinforcement learning (GCRL) methods in order to obtain a generalized policy to settle a group of homogeneous tasks with different goals simultaneously (Liu et al., 2022a) , but are still limited in the same settings of environment dynamics/reward, and the same state/action space of the agent. Many existing solutions in the domain of Transfer RL (Zhu et al., 2020) or Meta RL (Yu et al., 2020) aim to transfer among different dynamics/reward with the same agent, but care less for the shared knowledge across agents with different state/action spaces. There are many motivations and scenarios encouraging us to design a transferring solution among agents: a) deployed agents facing with changed observing features, for instance, non-player characters (NPC) trained and updated for incremental scenes of games (Juliani et al., 2018) , robots with new sensors due to hardware replacement (Bohez et al., 2017) ; b) agents in different morphology have to finish the same tasks (Gupta et al., 2021) , such as run a complicate quadruped robotic following a much simpler simulated robot (Peng et al., 2018) ; c) improving the learning efficiency with rich and redundant observations or complicate action spaces, like transferring the knowledge from compact low-dimensional vector input to high-dimensional image features (Sun et al., 2022) .

