PLANNING IMMEDIATE LANDMARKS OF TARGETS FOR MODEL-FREE SKILL TRANSFER ACROSS AGENTS Anonymous

Abstract

In reinforcement learning applications, agents usually need to deal with various input/output features when specified with different state and action spaces by their developers or physical restrictions, indicating re-training from scratch and considerable sample inefficiency, especially when agents follow similar solution steps to achieve tasks. In this paper, we aim to transfer pre-trained skills to alleviate the above challenge. Specifically, we propose PILoT, i.e., Planning Immediate Landmarks of Targets. PILoT utilizes the universal decoupled policy optimization to learn a goal-conditioned state planner; then, we distill a goal-planner to plan immediate landmarks in a model-free style that can be shared among different agents. In our experiments, we show the power of PILoT on various transferring challenges, including few-shot transferring across action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; and we also illustrate PILoT provides a zero-shot transfer solution from a simple 2D navigation task to the harder Ant-Maze task.

1. INTRODUCTION

Figure 1 : Zero-shot transferring on Ant-Maze, where the Ant agent starts from the yellow point to the desired goal (big blue star). PILoT provide planned immediate landmarks (small red points) given the temporal goal (green points) and the desired goal (small blue point), learned from a naive 2D maze task. Recent progress of Reinforcement Learning (RL) has promoted considerable developments in resolving kinds of decision-making challenges, such as games (Guan et al., 2022) , robotics (Gu et al., 2017) and even autonomous driving (Zhou et al., 2020) . However, most of these work are designed for a single task with a particular agent. Recently, researchers have developed various goal-conditioned reinforcement learning (GCRL) methods in order to obtain a generalized policy to settle a group of homogeneous tasks with different goals simultaneously (Liu et al., 2022a) , but are still limited in the same settings of environment dynamics/reward, and the same state/action space of the agent. Many existing solutions in the domain of Transfer RL (Zhu et al., 2020) or Meta RL (Yu et al., 2020) aim to transfer among different dynamics/reward with the same agent, but care less for the shared knowledge across agents with different state/action spaces. There are many motivations and scenarios encouraging us to design a transferring solution among agents: a) deployed agents facing with changed observing features, for instance, non-player characters (NPC) trained and updated for incremental scenes of games (Juliani et al., 2018) , robots with new sensors due to hardware replacement (Bohez et al., 2017) ; b) agents in different morphology have to finish the same tasks (Gupta et al., 2021) , such as run a complicate quadruped robotic following a much simpler simulated robot (Peng et al., 2018) ; c) improving the learning efficiency with rich and redundant observations or complicate action spaces, like transferring the knowledge from compact low-dimensional vector input to high-dimensional image features (Sun et al., 2022) . Some previous works have made progress on transfer across agents on a single task. (Sun et al., 2022) transferred across different observation spaces with structural-similar dynamics and the same action spaces via learning a shared latent-space dynamics to regularize the policy training. On the other hand, (Liu et al., 2022b) decouples a policy as a state planner that predicts the consecutive target state, and an inverse dynamics model that delivers action to achieve the target state, which allows transferring across different action spaces and action dynamics, but limit in the same state space and state transitions. In this paper, we propose a more general solution for transferring the multi-task skills across agents with heterogeneous action spaces and observation space, named Planning Immediate Landmarks of Targets (PILoT). Our method works under the assumption that agents share the same goal transition to finish tasks, but without any prior knowledge of the inter-task mapping between the different state/action spaces, and agents can not interact with each other. The whole workflow of PILoT is composed of three stages, including pre-training, distillation and transfer: 1) the pre-training stage extends the decoupled policy to train a universal state planner on simple tasks with universal decoupled policy optimization; 2) the distillation stage distills the knowledge of state planner into an immediate goal planner, which is then utilized to 3) the transferring stage that plans immediate landmarks in a model-free style serving as dense rewards to improve the learning efficiency or even straightforward goal guidance. Fig. 1 provides a quick overview of our algorithm for zero-shot transferring on Ant-Maze. Correspondingly, we first train a decoupled policy on a simple 2D maze task to obtain a universal state planner, then distill the knowledge into a goal planner that predicts the immediate target goal (red points) to reach given the desired goal (blue point) and arbitrary started goal (green points). Following the guidance, the Ant controllable policy is pre-trained on a free ground without the walls can be directly deployed on the maze environment without any training. As the name suggests, we are providing immediate landmarks to guide various agents like the runway center line light on the airport guiding the flight to take off. Comprehensive challenges are designed to examine the superiority of PILoT on the skill transfer ability, we design a set of hard transferring challenges, including few-shot transfer through different action spaces and action dynamics, from low-dimensional vectors to image inputs, from simple robots to complicated morphology, and even zero-shot transfer. The experimental results present the learning efficiency of PILoT transferred on every tasks by outperforming various baseline methods.

2. PRELIMINARIES

Goal-Augmented Markov Decision Process. We consider the problem of goal-conditioned reinforcement learning (GCRL) as a γ-discounted infinite horizon goal-augmented Markov decision process (GA-MDP) M = ⟨S, A, T , ρ 0 , r, γ, G, p g , ϕ⟩, where S is the set of states, A is the action space, T : S × A × S → [0, 1] is the environment dynamics function, ρ 0 : S → [0, 1] is the initial state distribution, and γ ∈ [0, 1] is the discount factor. The agent makes decisions through a policy π(a|s) : S → A and receives rewards r : S × A → R, in order to maximize its accumulated reward R = t t=0 γ t r(s t , a t ). Additionally, G denotes the goal space w.r.t tasks, p g represents the desired goal distribution of the environment, and ϕ : S → G is a tractable mapping function that maps the state to a specific goal. One typical challenge in GCRL is reward sparsity, where usually the agent can only be rewarded once it reaches the goal: r g (s t , a t , g) = 1(the goal is reached) = 1(∥ϕ(s t+1 ) -g∥ ≤ ϵ) . (1) Therefore, GCRL focuses on multi-task learning where the task variationality comes only from the difference of the reward function under the same dynamics. To shape a dense reward, a straightforward idea is to utilizing a distance measure d between the achieved goal and the final desired goal, i.e., rg (s t , a t , g) = -d(ϕ(s t+1 ), g). However, this reshaped reward will fail when the agent must first increase the distance to the goal before finally reaching it, especially when there are obstacles on the way to the target (Trott et al., 2019) . In our paper, we work on a deterministic environment dynamics function T , such that s ′ = T (s, a), and we allow redundant actions, i.e., the transition probabilities can be written as linear combination of other actions'. Formally, there exists of a state s m ∈ S, an action a n ∈ A and a distribution p defined on A \ {a n } such that A\{an} p(a)T (s ′ |s m , a) da = T (s ′ |s m , a n ). Decoupled Policy Optimization Classical RL methods learn a state-to-action mapping policy function, whose optimality is ad-hoc to a specific task. In order to free the agent to learn a highlevel planning strategy that can be used for transfer, Liu et al. (2022b) proposed Decoupled Policy Optimization (DePO) which decoupled the policy structure by a state transition planner and an inverse dynamics model as: π(•|s) = s ′ h π (s ′ |s)I(•|s, s ′ ) ds ′ = E ŝ′ ∼hπ(ŝ ′ |s) I(•|s, ŝ′ ) . (2) To optimize the decoupled policy, DePO first optimizes the inverse dynamics model via supervised learning, and then performs policy gradient assuming a fixed but locally accurate inverse dynamics function. DePO provides a way of planning without training an environment dynamics model. The state planner of DePO pre-trained on simple tasks can be further transferred to agents with various action spaces or dynamics. However, as noted below, the transferring ability of DePO limits in the same state space transition. In this paper, we aims to derive a more generalized skill transferring solution utilizing the common latent goal space shared among tasks and agents. (Liu et al., 2022b; Sun et al., 2022) and PILoT. (a) (Liu et al., 2022b) allows for transferring across action spaces and but ask the same state space and state transition. (b) (Sun et al., 2022) transfers across different state spaces but require there exists a shared latent state space and dynamics. (c) PILoT provides a generalized transferring ability for both state space and action space, but asks for a shared underlying latent goal transition.

3. TRANSFER ACROSS AGENTS

In this section, we explain the problem setup of transfer across agents. Formally, we pre-train and learn knowledge from a source GA-MDP M = ⟨S, A, T , ρ 0 , r, γ, G, p g , ϕ⟩ and transfer to a target GA-MDP M = ⟨ S, Ã, T , ρ0 , r, γ, G, p g , φ⟩. Here we allow significant difference between the state spaces S, S, and action spaces A, Ã. Therefore, both the input / output shapes of the source policy are totally different from the target one and therefore it is challenging to transfer a shared knowledge. To accomplish the objective of transfer, prior works always make assumptions about the shared structure (Fig. 2 ). For example, (Sun et al., 2022) proposed to transfer across significantly different state spaces with the same action space and similar structure between dynamics, i.e., a mapping between the source and target state spaces exists such that the transition dynamics shares between two tasks under the mapping. In comparison, (Liu et al., 2022b) pay attention on transferring across action spaces under the same state space and the same state transitions, i.e., a action mapping between the source and target exists such that the transition dynamics shares between two tasks under the mapping. In this paper, we take a more general assumption by only require agents have a shared goal transition, and allow transferring across different observation spaces and action spaces. We argue that this is a reasonable requirement in real world since for tasks like robot navigation, different robot agents can share a global positioning system constructed by techniques like SLAM that allows them to quickly figure out their 3D position in the world. Formally, the assumption corresponds to: Assumption 1. There exists a function f : A → Ã such that ∀s, s ′ ∈ S, ∀a ∈ A, and ∃s, s′ ∈ S, ∃ã ∈ Ã, T (s ′ |s, a) = T (s ′ |s, f (a)), r(s, a, g t ) = r(s, ã, g t ), ϕ(s) = φ(s), ϕ(s ′ ) = φ(s ′ ) . Here ϕ is usually a many-to-one mapping, such as an achieved position or the velocity of a robot. f can be any function, like many-to-one mapping, where several target actions relate to the same source action; or one-to-many mapping; or non-surjective, where there exists a source action that does not correspond to any target action.

4. PLANNING IMMEDIATE LANDMARKS OF TARGETS

In this section, we introduce our generalized multi-task skills transfer solution with the proposed Planning Immediate Landmarks of Targets (PILoT) framework. First, we demonstrate how we derive the training procedure of a universal decoupled policy structure for multiple tasks in the pretraining stage; then, we characterize the distillation stage which obtain a goal planner for providing the informative reward bonus or zero-shot guidance for the transferring stage. An overview of the method is shown in Fig. 3 , and we list the step-by-step algorithm in Algo. 1.

4.1. UNIVERSAL DECOUPLED POLICY OPTIMIZATION

In order to derive a generally transferable solution, we first extend the decoupled policy structure (Liu et al., 2022b ) into a goal-conditioned form. Formally, we decouple the goal-conditioned policy π(a|s, g t ) as: π(a|s, g t ) = s ′ h π (s ′ |s, g t )I(•|s, s ′ ) ds ′ = E ŝ′ ∼hπ(ŝ ′ |s,g t ) I(•|s, ŝ′ ) , where g t is the target goal, h π is a goal-conditioned state planner. Approximating the planner by neural networks (NNs), we can further apply the reparameterization trick and bypass explicitly computing the integral over s ′ as s ′ = h(ϵ; s, g t ), π(a|s, g t ) = E ϵ∼N I(a|s, h(ϵ; s, g t )) , (4) where ϵ is an input noise vector sampled from some fixed distribution, like a Gaussian. The inverse dynamics I should serve as a control module known in advance for reaching the target predicted by the planner. When it must be learned from scratch, we can choose to minimize the divergence (for example, KL) between the inverse dynamics of a sampling policy π B and the ϕ-parameterized function I ϕ , i.e., min ψ L I = E (s,s ′ )∼π B [D f (I π B (a|s, s ′ )∥I ϕ (a|s, s ′ ))] . It is worth noting that the model is only responsible and accurate for states encountered by the current policy instead of the overall state space. In result, the inverse dynamics model is updated every time before updating the policy. To update the decoupled policy, particularly, the goal-conditioned state planner h π , given that the inverse dynamics to be an accurate local control module for the current policy and the inverse dynamics function I is static when optimizing the policy function, we adopt the decoupled policy gradient (DePG) as derived in (Liu et al., 2022b) : ∇ ψ L π = E (s,a)∼π,g t ∼pg,ϵ∼N Q(s, a, g) π(a|s, g t ) ∇ h I(a|s, h ψ (ϵ; s, g t ))∇ ψ h ψ (ϵ; s, g t ) , which can be seen as taking the knowledge of the inverse dynamics about the action a to optimize the planner by a prediction error ∆s ′ = α∇ h I(a|s, h(ϵ; s)) where α is the learning rate. However, (Liu et al., 2022b) additionally pointed that there exists the problem of agnostic gradients that the optimization direction is not always lead to a legal state transition. To alleviate the challenge, they proposed calibrated decoupled policy gradient to ensure the state planner from predicting a infeasible state transition. In this paper, we turn to a simpler additional objective for constraints, i.e., we maximize the probability of predicting the legal transitions that are sampled by the current policy, which is demonstrated to have a similar performance in (Liu et al., 2022b) : max E (s,s ′ )∼π [h(s ′ |s, g t )] , Therefore, the overall gradient for updating the planner becomes: ∇ ψ L π = E (s,a,s ′ )∼π,g t ∼pg ,ϵ∼N Q(s, a, g) π(a|s, g t ) ∇ h I(a|s, h ψ (ϵ; s, g t ))∇ ψ h ψ (ϵ; s, g t ) + λ∇ ψ h ψ (ϵ; s, g t ) , (8) where λ is the hyperparamer for trading off the constraint. Note that such a decoupled learning scheme also allows incorporating various relabeling strategy to further improving the sample efficiency, such as HER (Andrychowicz et al., 2017) .

4.2. GOAL PLANNER DISTILLATION

In order to transfer the knowledge to new settings, we leverage the shared latent goal space and distill a goal planner from the goal-conditioned state planner, i.e., we want to predict the consecutive goal given the current goal and the target goal. Formally, we aims to obtain a ω-parameterized function f ω (g ′ |g, g t ), where g ′ is the next goal to achieve, g = ϕ(s) is the current goal the agent is achieved, and g t is the target. This can be achieved by treating the state planner h ψ (s ′ |s, g t ) as the teacher, and f ω (g ′ |g, g t ) becomes the student. The objective of the distillation is constructed as an MLE loss: ∇ ω L f = max ω E s∼B, s′ ∼h ψ g t ∼pg [f ω (g ′ |g, g t )], where g = ϕ(s), g′ = ϕ( s′ ) , where B is the replay buffer, ϕ is the mapping function that translates a state to a specific goal. With the distilled goal planner, we can now conduct goal planning without training and querying an environment dynamics model as in Zhu et al. (2021) ; Chua et al. (2018) .

4.3. TRANSFER MULTI-TASK KNOWLEDGE ACROSS AGENTS

A typical challenge for GCRL is the rather sparse reward function, and simply utilizing the Euclid distance of the final goal and the current achieved goal can lead to additional sub-optimal problems. To this end, with PILoT having been distilled for acquiring plannable goal transitions, it is natural to construct a reward function (or bonus) leveraging the difference between the intermediate goals to reach and the goal that actually achieves for transferring. In particular, when the agent aims to go to g t with the current achieved goal g, we exploit the distilled planner f ω from PILoT to provide reward as similarity of goals: r(s, a, ĝ′ ) = ϕ(s ′ ) • ĝ′ ∥ϕ(s ′ )∥∥ĝ ′ ∥ , where s ′ = T (s, a) , ĝ′ ∼ f ω (ĝ ′ |g, g t ) . ( ) Note that we avoid the different scale problem among different agents by using the form of cosine distance. Thereafter, we can actually transfer to a totally different agent. For example, we can learn a locomotion task from a easily controllable robot, and then transfer the knowledge to a complex one with more joints which is hard to learn directly from the sparse rewards; or we can learn from a low-dimensional ram-based agent and then transfer on high-dimensional image inputs. To verify the effectiveness of PILoT, we design various of transferring settings in Section 6.

5. RELATED WORK

Goal-conditioned RL. Our work lies in the formulation of goal-conditioned reinforcement learning (GCRL). The existence of goals, which can be explained as skills, tasks, or targets, making it possible for our skill transfer across various agents with different state and action space. In the literature of GCRL, researchers focus on alleviating the challenges in learning efficiency and generalization ability, from the perspective of optimization Zhu et al. (2021) . A comprehensive GCRL survey can be further referred to Liu et al. (2022a) . In these works, the goal given to the policy function is either provided by the environment or proposed by a learned function. In comparison, in our paper, the proposed UDPO algorithm learns the next target states in an end-to-end manner which can be further distilled into a goal planner that can be used to propose the next target goals. Hierarchical reinforcement learning. The framework of UDPO resembles Hierarchical Reinforcement Learning (HRL) structures, where the state planner plays like a high-level policy and the inverse dynamics as the low-level policy. Typical paradigm of HRL trains the high-level policy using environment rewards to predict sub-goals (or called options) that the low-level policy should achieve, and learn the low-level policy using handcrafted goal-reaching rewards to provide the action and interacts with the environment. Generally, most of works provided the sub-goals / options by the high-level policy are lied in a learned latent space (Konidaris & Barto, 2007; Heess et al., 2016; Kulkarni et al., 2016; Vezhnevets et al., 2017; Zhang et al., 2022) , keeping it for a fixed timesteps (Nachum et al., 2018; Vezhnevets et al., 2017) or learn to change the option (Zhang & Whiteson, 2019; Bacon et al., 2017) . On the contrary, Nachum et al. (2018) and Kim et al. (2021) both predicted sub-goals in the raw form, while still training the high-level and low-level policies with separate objectives. As for Nachum et al. (2018) , they trained the high-level policy in an offpolicy manner; and for Kim et al. (2021) , which focused on goal-conditioned HRL tasks as ours, they sampled and selected specific landmarks according to some principles, and asked the high-level policy to learn to predict those landmarks. Furthermore, their only sampled a goal from the highlevel policy for a fixed steps, otherwise using a pre-defined goal transition process. Like UDPO, Li et al. ( 2020) optimized the two-level hierarchical policy in an end-to-end way, with a latent skill fixed for c timesteps. The main contribution of HRL works concentrate on improving the learning efficiency on complicated tasks, yet UDPO aims to obtain every next targets for efficient transfer. Transferable RL. Before our work, there are a few works have investigated transferable RL. In this endeavor, Srinivas et al. (2018) proposed to transfer an encoder learned in the source tasks, which maps the visual observation to a latent representation. When transfer to target tasks / agents, the latent distance from the goal image to the current observation is used to construct an obstaclesaware reward function. To transfer across tasks, Barreto et al. (2017) ; Borsa et al. (2018) utilized success features based on strong assumptions about the reward formulation, that decouples the information about the dynamics and the rewards into separate components, so that only the relevant module need to be retrained when the task changes. In order to reuse the policy, 

6. EXPERIMENTS

We conduct a set of transferring challenges to examine the skill transfer capacity of PILoT. Implementation and baselines. We choose several classical and recent representative works as our comparable baselines, for both source tasks and target tasks. For source tasks we want to compare the learning performance from scratch of the proposed UDPO algorithm with: i) Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) , a classical GCRL algorithm that trains goal-conditioned policy by relabeling the target goals as the samples appeared in future of the same trajectory for one particular state-action sample, which is also used as the basic strategy to train the UDPO policy; ii) HIerarchical reinforcement learning Guided by Landmarks (HIGL) Kim et al. (2021) , an HRL algorithm that utilizes a high-level policy to propose landmark states for the low-level policy to explore. As for the target tasks, we aim to show the transferring ability of PILoT from source tasks to target tasks, except HER and HIGL, we also compare a recent well-performed GCRL algorithm, Maximum Entropy Gain Exploration (MEGA) (Pitis et al., 2020) , which proposed to enhance the exploration coverage of the goal space by sampling goals that maximize the entropy of past achieved goals; also, a very recent algorithm, Contrastive RL (Eysenbach et al., 2022) , which design a contrastive representation learning solution for GCRL. Noted that contrastive RL requires the goal space the same as the state space (e.g., both are images). For each baseline algorithm, we either take their suggested hyperparameters, or try our best to tune important ones. In our transferring experiments, our PILoT solution first trains the decoupled policy by UDPO with HER's relabeling strategy for all source tasks; then, we distill the goal planner and generate dense reward signals to train a normal policy by HER in the target tasks, denoted as HER (PILoT).

6.2. RESULTS AND ANALYSIS

In the main context, we focus on presenting the training results on source tasks and the transferring results on target tasks. Additionally, we leave in depth analysis, ablation studies and hyperparameter choices in the Appendix C. Learning in the source tasks. We first train UDPO on four source tasks, and show the training curve in Fig. 5 . Compared with HER that is trained using a normal policy, we observe that UDPO achieves similar performance, and sometimes it behaves better efficiency. For comparing UDPO with HRL methods, we also include HIGL on these source tasks. To our surprise, HIGL performs badly or even fails in many tasks, indicating its sensitivity by their landmarks sampling strategy. In Appendix C.1 we further illustrate the MSE error between the planned state and the real state that the agent actually achieved, with the visualization of the planning states, demonstrating that UDPO has a great planning ability under goal-conditioned challenges. Few-shot transferring to high-dimensional action spaces. The High-Dim-Action transferring challenge requires the agent to generalize its state planner to various action spaces and action dynamics. In our design, the target task has higher action dimension with different dynamics (see Appendix B.2 for details), making the task hard to learn from scratch. As shown in Fig. 5 , on the target Fetch-Reach and Fetch-Push, HER is much more struggle to learn well as it is in the source tasks, and all GCRL, HRL and contrastive RL baselines can only learn with much more samples or even fail. For PILoT, since the source and the target tasks share the same state space and state transition, we can transfer the goal-conditioned state planner and only have to train the inverse dynamics from scratch. In result, by simply augmented HER with additional transferring reward, HER (PILoT) shows impressive efficiency advantage for transferring the shared knowledge to new tasks. Few-shot transfer from vector to image states. The Vec-to-Image transferring challenge learns high-dimensional visual observation input based agents guided by the planned goals learned from the low-level vector input. From Fig. 5 , we can also learn that using the transferring reward of PILoT, HER (PILoT) achieves the best efficiency and final performance with much less samples on the two given tasks, compared with various baselines that are designed with complicated techniques. Few-shot transferring to different morphology. We further test the Complex-Morph transferring challenge that requires to distill the locomotion knowledge from a simple Point robot to a much more complex Ant robot. The learning results from Fig. 5 again indicate impressive performance of HER (PILoT), while we surprisingly find that MEGA, HIGL and contrastive RL all can hardly learn feasible solutions on this task, even worse than HER. In our further comparison, we find that this tested task is actually more hard to determine the success by requiring the agent to reach at a very close distance (i.e., less than 0.1, which can be further referred to Appendix B). In addition, the reason why MEGA fails to reach a good performance like the other tasks can be attribute to its dependency on an exploration strategy, which always chooses rarely achieved goals measured by its lowest density. This helps a lot in exploration when and the target goals' distribution is dense, like Ant-Maze. However, when the goals are scattered as in Ant-Locomotion, the agent has to explore a wide range of the goal space, which may lead MEGA's exploration strategy to be inefficient. On comparison, PILoT shows that, despite the task is difficult or the target goals are hard-explored, as long as we can transfer necessary knowledge from similar source tasks, agents can learn the skills quickly in few interactions. Zero-shot transfer for different layouts. Finally, as we find the intermediate goals provided by the goal planner is accurate enough for every step that the agent encounters, we turn to a intuitive and interesting zeroshot knowledge transfer experiment for different map layouts. Specifically, we try to learn the solution on Ant-Maze, as shown in Fig. 4 , which is a hard-exploring tasks since the wall between the starting point and the target position requires the agent to first increase the distance to the goal before finally reaching it. As Fig. 6 illustrates, simply deploying HER fails. With sufficient exploration, all recent GCRL, HRL and contrastive RL baselines can learn a feasible solution after a quite large times of sampling. However, since PILoT provide a way of distilling the goal transition knowledge from a much simpler tasks, i.e., the 2D-Maze task, we take the intermediate goals as short guidance for the Ant-Locomotion policy pre-trained in Fig. 5 . Note that the Ant-Locomotion policy provides the ability to reach arbitrary goals within the map. In this way, PILoT achieves zero-shot transfer performance without any sampling. This shows a promising disentanglement of goal planner and motion controler for resolving complex tasks.

7. CONCLUSION AND LIMITATION

In this paper, we provide a general solution for skill transferring across various agents. In particular, We propose PILoT, i.e., Planning Immediate Landmarks of Targets. First, PILoT utilizes and extends a decoupled policy structure to learn a goal-conditioned state planner by universal decoupled policy optimization; then, a goal-planner is distilled to plan immediate landmarks in a model-free style that can be shared among different agents. To validate our proposal, we further design kinds of transferring challenges and show different usages of PILoT, such as few-shot transferring across different action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; we also show a promising case of zero-shot transferring on the harder Ant-Maze task. However, we find that the proposed PILoT solution mainly limited in those tasks that have clear goal transitions that can be easily distilled, such as navigation tasks; on the contrary, for those tasks which takes the positions of objects as goals, it will be much harder to transfer the knowledge since the goals are always static when agents do not touch the objects. We leave those kinds of tasks as future works. It's worth noted that Ant robot in the common used Ant-Maze environment (e.g., the one used in Pitis et al. (2020) ; Eysenbach et al. (2022) ) is different the one in Ant-Locomotion (e.g., the one used in Zhu et al. (2021) ), like the gear and ctrlrange attributes. Thus, in order to test the transferring ability, we synchronize the Ant robot in these two tasks and re-run all baseline methods on these tasks.

B.2 ACTION DYNAMICS SETTING FOR HI G H-DI M-AC T CHALLENGE

For transferring experiments on High-Dim-Act challenge, we take an 80% of the original gravity with a designed complicated dynamics for the transferring experiment (different both action space and dynamics). Particularly, given the original action space dimension m and dynamics s ′ = f s (a) on state s, the new action dimension and dynamics become n = 2m and s ′ = f s (h(a)), where h is constructed as: h = -exp(a[0 : n/2] + 1) + exp(a[n/2 : -1]))/1.5 here a[i : j] selects the i-th to (j -1)-th elements from the action vector a. In other words, we transfer to a different gravity setting while doubling the action space and construct a more complicated action dynamics for agent to learn.

B.3 IMPLEMENTATION DETAILS

The implementation of PILoT and HER are based on a open-source Pytorch code frameworkfoot_1 . As for compared baselines, we take their official implementation, use its default hyperparameters and try our best to tune important ones: • MEGA (Pitis et al., 2020) : https://github.com/spitis/mrl For resolving image-based tasks, we learn an encoder that is shared between the policy and the critic. In particular, we use the same encoder structure for HER, MEGA and HIGL. The encoder has four convolution layers with the same 3 × 3 kernel size and 32 output channel. The stride of the first layer is 2 and the stride of the other layers is 1, as shown in Fig. 7 . We adopt ReLU as the activation function in all the layers. After convolution, we have a fully connected layer with 50 hidden units and a layer-norm layer to get the output of the encoder. When training, only the gradients from Q network are used to update the encoder. For contrastive RL, we take their default structures for imagebased tasks.

B.4 IMPORTANT HYPERPARAMETERS

We list the key hyperparameters of the best performance of HER in Tab. 2 and UDPO in Tab. 3 on each source task. For each task, we first tune HER to achieve the best performance, based on which we further slightly adjust UDPO's additional hyperparameters. For HER, we tune Replay buffer size ∈ {1e5, 1e6}, Batch size ∈ {128, 2048, 4096}, Policy π learning rate ∈ {3e -4, 1e -3}. For UDPO, we only tune two hyperparameters, State planner coefficient λ ∈ {1e -3, 5e -3, 1e -2, 5e -2, 1e -1}, Inverse dynamics I learning interval ∆ ∈ {200, 500, 1500, 2000}. As further shown in Section C.2, the choice slightly affects the success rate but can impose considerable influence on the accuracy of the state planner. The larger λ will lead stronger constraint on the accuracy of the state planner, but can hurt the exploration. On the other hand, ∆ controls the training stability, and a larger ∆ assumes that the local inverse dynamics does not change for a longer time. Therefore, in principle, for those tasks that exploration is much more difficult, we tend to choose a small λ; for those tasks that the algorithm can learn fast so that the local inverse dynamics changes drastically, we should have a small ∆. In default, we choose λ = 1e -2 and ∆ = 1500. We also list the hyperparameters of HER (PILoT) in Tab. 4 on each target task, which is the same as the baseline HER algorithm (except the additional Transferring bonus rate.) In default, we set all Transferring bonus rate to be 1.0, and find that it can reach a desired performance. In Section C.2, we also include the ablation of the choice of this hyperparameter. In the universal decoupled policy structure, the state planner is decoupled and trained for predicting the future plans that the agent is required to reach. Therefore, it is critical to understand the plans given by the planner and make sure it is accurate enough that the agent can reach where it plans to go, for distilling and transferring. To this end, we analyze the distance of the reaching states and the predicted consecutive states and draw the mean square error (MSE) along the RL leaning procedure in Fig. 8 . To our delight, as the training goes, the gap between the planned states and the achieved states is becoming smaller, indicating the accuracy of the state plan.



This submission does not violate any ethics concern and adheres and acknowledge the ICLR Code of Ethics. https://github.com/Ericonaldo/ILSwiss



Figure2: Comparison of the different assumption for transferring across agents of previous works(Liu et al.,  2022b;Sun et al., 2022) and PILoT. (a)(Liu et al., 2022b)  allows for transferring across action spaces and but ask the same state space and state transition. (b)(Sun et al., 2022) transfers across different state spaces but require there exists a shared latent state space and dynamics. (c) PILoT provides a generalized transferring ability for both state space and action space, but asks for a shared underlying latent goal transition.

Figure 3: Overview of universal decoupled policy optimization (PILoT) framework. PILoT leverages a transferring-by-pre-training process, including the stage of pre-training, distillation and transferring.

Figure 4: Illustration of tested environments.

Figure 5: Training curves on four source tasks and four target tasks. High-Dim-Action: few-shot transfer to high-dimensional action space. High-Dim-Action: few-shot transferring to high-dimensional action space. Vec-to-Image: few-shot transferring from vector to image states. Complex-Morph: few-shot transfer to different morphology. UDPO denotes the learning algorithm proposed in Section 4.1. HER (PILoT) denotes HER with the transferring rewards provided by the proposed PILoT solution.

Figure 6: Training and zero-shot transferring curves on Ant-Maze task across 10 seeds.

Figure 7: The encoder network architecture.

Trott et al. (2019);Ghosh et al. (2021);Zhu et al. (2021), generating or selecting sub-goalsFlorensa et al. (2018);Pitis et al. (2020) and relabelingAndrychowicz et al. (2017);

They transfer the low-level controller on the target tasks while retraining the high-level one. On the other hand,Liu et al. (2022b)  decoupled the policy as a state planner and an inverse dynamic model, and shown that the high-level state planner can be transferred to agents with different action spaces. For generalizing and transferring across modular robots' morphology,Gupta et al. (2017) tried learning invariant visual features.Wang et al. (2018) andHuang et al.

Hyperparameters of HER on the source tasks.

Hyperparameters of UDPO on the source tasks.

Hyperparameters of HER / HER (PILoT) on the target tasks.

9. REPRODUCIBILITY STATEMENT

All experiments included in this paper are conducted over 5 random seeds for stability and reliability. The algorithm outline is included in Appendix A, the hyperparameters are included in Appendix B.4 and the implementation details are include in Appendix B.3. We promise to release our code to public after publication.

Appendices

A ALGORITHM OUTLINE Algorithm 1 Planning Immediate Landmarks of Targets (PILoT) 1: Source Task Input: Empty replay buffer B s , state planner h ψ , inverse dynamics model I ϕ and goal planner f ω . 2: Target Task Input: Empty replay buffer B t , goal planner f ω , policy π θ .▷ Pre-training stage trains UDPO on source tasks.Collect trajectories {(s, a, s ′ , g t , r, done)} using current policy π = E ϵ∼N [I ϕ (a|s, h ψ (ϵ; s))] and store in B Update ω by ∇ ω L f (Eq. ( 9)) 12: end for ▷ Transfer stage trains HER (PILoT) on target tasks.Collect trajectories {(s, a, s ′ , g t , r, done)} using current policy π θ 15:Supplement the reward with additional bonus following Eq. ( 10):Learn π θ by HER 19: end for

B EXPERIMENT SETTINGS B.1 ENVIRONMENTS

We list important features of the tested environments as in Tab. 1. Note that the Goal Reaching Distance is rather important to decide the difficulty of the tasks, so we carefully choose them to meet the most of the current works. Additionally, we also visualize the imagined rollout by state planner on the source tasks, which is generated by consecutively take a predicted states as a new input. We compare it with the real rollout in Fig. 9 , showing the state planner can conduct reasonable multi-step plan.On the target tasks, we visualize the subgoals proposed by the distilled goal planner and the real rollout that was achieved during the interaction in Fig. 10 , showing the effective and explainable guidance from the goal planner. Pre-training ablation on the inverse dynamic train frequency ∆. We first conduct ablation studies on the inverse dynamic train frequency ∆. In particular, this hyperparameter determines how often we train the low-level inverse dynamics and how long we regard the inverse dynamics is static when we train the high-level state planner. It is intuitive that a larger ∆ is assuming that the local inverse dynamics does not change for a longer time; on the contrary, a small ∆ should be used when the the local inverse dynamics changes drastically. In our experiments, we find that ∆ affects Pre-training ablation on the regularization coefficient λ. The regularization coefficient λ is another critical hyperparameter in UDPO training. The choice of λ balances the policy gradient term and the constraint term in state planning updates. Particularly, larger λ puts more weight on the supervised penalty objective, which reduces the planning of infeasible next states. However this can hurt the exploration ability offered by policy gradient objective. The results in Fig. 12 supports the intuition. In both environments, the models trained with λ = 0.1 performs best in reaching where they plan, but can not finish the goal-reaching task well. On the other hand, a extreme small λ results in a quite large gap between reached states and planned states. Such gap can make the subsequent transferring impossible. Therefore, the recipe is to find the medium λ which achieves competitive success rate while keeping the value of prediction real MSE from explosion. Transferring ablation on the bonus ratio β. In the transferring stage, bonus ratio β balances the similarity reward from distilled planner and the reward from the environment. We observe from Fig. 13 that 1.0 is an appropriate choice for at least tasks tested in this paper. When β is smaller (e.g., 0.1, 0.2, 0.5), the success rate converges more slowly, and the final performance is also worse. This indicates that small β can not provide strong enough signals for the policy to follow planned landmarks, leading to a decreased transfer efficiency. On the other hand, when a much larger β (e.g., 2.0, 5.0) is adopted, the performance becomes even worse than insufficient guidance from small β. Also, in Fetch-Reach-Image environment, the training curves are quite unstable. Although the planned landmarks are useful in overcoming the sparse reward issue, they can not completely replace the final sparse reward. As Fig. 8 shows, even in the source environments, there exist small but not negligible errors between planned states and achieved states. Using large β can let the policy misled by the goal planner and overfitted to those errors. 

C.3 MORE ZERO-SHOT TRANSFER VISUALIZATION

In this section we illustrate more zero-shot transfer cases including success cases and failure cases. In fact, the failures should be attributed to the inaccuracy of the controller (policy) since the goal planner always gives the right way to success. If we can train a more accurate local controller, the performance of success rate can no doubt be further improved. 

