PLANNING IMMEDIATE LANDMARKS OF TARGETS FOR MODEL-FREE SKILL TRANSFER ACROSS AGENTS Anonymous

Abstract

In reinforcement learning applications, agents usually need to deal with various input/output features when specified with different state and action spaces by their developers or physical restrictions, indicating re-training from scratch and considerable sample inefficiency, especially when agents follow similar solution steps to achieve tasks. In this paper, we aim to transfer pre-trained skills to alleviate the above challenge. Specifically, we propose PILoT, i.e., Planning Immediate Landmarks of Targets. PILoT utilizes the universal decoupled policy optimization to learn a goal-conditioned state planner; then, we distill a goal-planner to plan immediate landmarks in a model-free style that can be shared among different agents. In our experiments, we show the power of PILoT on various transferring challenges, including few-shot transferring across action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; and we also illustrate PILoT provides a zero-shot transfer solution from a simple 2D navigation task to the harder Ant-Maze task.

1. INTRODUCTION

Figure 1 : Zero-shot transferring on Ant-Maze, where the Ant agent starts from the yellow point to the desired goal (big blue star). PILoT provide planned immediate landmarks (small red points) given the temporal goal (green points) and the desired goal (small blue point), learned from a naive 2D maze task. Recent progress of Reinforcement Learning (RL) has promoted considerable developments in resolving kinds of decision-making challenges, such as games (Guan et al., 2022) , robotics (Gu et al., 2017) and even autonomous driving (Zhou et al., 2020) . However, most of these work are designed for a single task with a particular agent. Recently, researchers have developed various goal-conditioned reinforcement learning (GCRL) methods in order to obtain a generalized policy to settle a group of homogeneous tasks with different goals simultaneously (Liu et al., 2022a) , but are still limited in the same settings of environment dynamics/reward, and the same state/action space of the agent. Many existing solutions in the domain of Transfer RL (Zhu et al., 2020) or Meta RL (Yu et al., 2020) aim to transfer among different dynamics/reward with the same agent, but care less for the shared knowledge across agents with different state/action spaces. There are many motivations and scenarios encouraging us to design a transferring solution among agents: a) deployed agents facing with changed observing features, for instance, non-player characters (NPC) trained and updated for incremental scenes of games (Juliani et al., 2018) , robots with new sensors due to hardware replacement (Bohez et al., 2017) ; b) agents in different morphology have to finish the same tasks (Gupta et al., 2021) , such as run a complicate quadruped robotic following a much simpler simulated robot (Peng et al., 2018) ; c) improving the learning efficiency with rich and redundant observations or complicate action spaces, like transferring the knowledge from compact low-dimensional vector input to high-dimensional image features (Sun et al., 2022) . Some previous works have made progress on transfer across agents on a single task. (Sun et al., 2022) transferred across different observation spaces with structural-similar dynamics and the same action spaces via learning a shared latent-space dynamics to regularize the policy training. On the other hand, (Liu et al., 2022b) decouples a policy as a state planner that predicts the consecutive target state, and an inverse dynamics model that delivers action to achieve the target state, which allows transferring across different action spaces and action dynamics, but limit in the same state space and state transitions. In this paper, we propose a more general solution for transferring the multi-task skills across agents with heterogeneous action spaces and observation space, named Planning Immediate Landmarks of Targets (PILoT). Our method works under the assumption that agents share the same goal transition to finish tasks, but without any prior knowledge of the inter-task mapping between the different state/action spaces, and agents can not interact with each other. The whole workflow of PILoT is composed of three stages, including pre-training, distillation and transfer: 1) the pre-training stage extends the decoupled policy to train a universal state planner on simple tasks with universal decoupled policy optimization; 2) the distillation stage distills the knowledge of state planner into an immediate goal planner, which is then utilized to 3) the transferring stage that plans immediate landmarks in a model-free style serving as dense rewards to improve the learning efficiency or even straightforward goal guidance. Fig. 1 provides a quick overview of our algorithm for zero-shot transferring on Ant-Maze. Correspondingly, we first train a decoupled policy on a simple 2D maze task to obtain a universal state planner, then distill the knowledge into a goal planner that predicts the immediate target goal (red points) to reach given the desired goal (blue point) and arbitrary started goal (green points). Following the guidance, the Ant controllable policy is pre-trained on a free ground without the walls can be directly deployed on the maze environment without any training. As the name suggests, we are providing immediate landmarks to guide various agents like the runway center line light on the airport guiding the flight to take off. Comprehensive challenges are designed to examine the superiority of PILoT on the skill transfer ability, we design a set of hard transferring challenges, including few-shot transfer through different action spaces and action dynamics, from low-dimensional vectors to image inputs, from simple robots to complicated morphology, and even zero-shot transfer. The experimental results present the learning efficiency of PILoT transferred on every tasks by outperforming various baseline methods.

2. PRELIMINARIES

Goal-Augmented Markov Decision Process. We consider the problem of goal-conditioned reinforcement learning (GCRL) as a γ-discounted infinite horizon goal-augmented Markov decision process (GA-MDP) M = ⟨S, A, T , ρ 0 , r, γ, G, p g , ϕ⟩, where S is the set of states, A is the action space, T : S × A × S → [0, 1] is the environment dynamics function, ρ 0 : S → [0, 1] is the initial state distribution, and γ ∈ [0, 1] is the discount factor. The agent makes decisions through a policy π(a|s) : S → A and receives rewards r : S × A → R, in order to maximize its accumulated reward R = t t=0 γ t r(s t , a t ). Additionally, G denotes the goal space w.r.t tasks, p g represents the desired goal distribution of the environment, and ϕ : S → G is a tractable mapping function that maps the state to a specific goal. One typical challenge in GCRL is reward sparsity, where usually the agent can only be rewarded once it reaches the goal: r g (s t , a t , g) = 1(the goal is reached) = 1(∥ϕ(s t+1 ) -g∥ ≤ ϵ) . (1) Therefore, GCRL focuses on multi-task learning where the task variationality comes only from the difference of the reward function under the same dynamics. To shape a dense reward, a straightforward idea is to utilizing a distance measure d between the achieved goal and the final desired goal, i.e., rg (s t , a t , g) = -d(ϕ(s t+1 ), g). However, this reshaped reward will fail when the agent must first increase the distance to the goal before finally reaching it, especially when there are obstacles on the way to the target (Trott et al., 2019) . In our paper, we work on a deterministic environment dynamics function T , such that s ′ = T (s, a), and we allow redundant actions, i.e., the transition probabilities can be written as linear combination of other actions'. Formally, there exists of a state s m ∈ S, an action a n ∈ A and a distribution p defined on A \ {a n } such that A\{an} p(a)T (s ′ |s m , a) da = T (s ′ |s m , a n ).

