PRIORS, HIERARCHY, AND INFORMATION ASYMMETRY FOR SKILL TRANSFER IN REINFORCEMENT LEARNING

Abstract

The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sample-efficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KL-regularized reinforcement learning individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry across architectural modules to bias which skills are learnt. While asymmetry choice has a large influence on transferability, existing methods base their choice primarily on intuition in a domain-independent, potentially suboptimal, manner. In this paper, we theoretically and empirically show the crucial expressivity-transferability trade-off of skills across sequential tasks, controlled by information asymmetry. Given this insight, we introduce Attentive Priors for Expressive and Transferable Skills (APES), a hierarchical KL-regularized method, heavily benefiting from both priors and hierarchy. Unlike existing approaches, APES automates the choice of asymmetry by learning it in a data-driven, domaindependent, way based on our expressivity-transferability theorems. Experiments over complex transfer domains of varying levels of extrapolation and sparsity, such as robot block stacking, demonstrate the criticality of the correct asymmetric choice, with APES drastically outperforming previous methods.

1. INTRODUCTION

While Reinforcement Learning (RL) algorithms recently achieved impressive feats across a range of domains (Silver et al., 2017; Mnih et al., 2015; Lillicrap et al., 2015) , they remain sample inefficient (Abdolmaleki et al., 2018; Haarnoja et al., 2018b) and are therefore of limited use for real-world robotics applications. Intelligent agents during their lifetime discover and reuse skills at multiple levels of behavioural and temporal abstraction to efficiently tackle new situations. For example, in manipulation domains, beneficial abstractions could include low-level instantaneous motor primitives as well as higher-level object manipulation strategies. Endowing lifelong learning RL agents (Parisi et al., 2019) with a similar ability could be vital towards attaining comparable sample efficiency. To this end, two paradigms have recently been introduced. KL-regularized RL (Teh et al., 2017; Galashov et al., 2019) presents an intuitive approach for automating skill reuse in multi-task learning. By regularizing policy behaviour against a learnt task-agnostic prior, common behaviours across tasks are distilled into the prior, which encourages their reuse. Concurrently, hierarchical RL also enables skill discovery (Wulfmeier et al., 2019; Merel et al., 2020; Hausman et al., 2018; Haarnoja et al., 2018a; Wulfmeier et al., 2020) by considering a two-level hierarchy in which the high-level policy is task-conditioned, whilst the low-level remains task-agnostic. The lower level of the hierarchy therefore also discovers skills that are transferable across tasks. Both hierarchy and priors offer their own skill abstraction. However, when combined, hierarchical KL-regularized RL can discover multiple abstractions. Whilst prior methods attempted this (Tirumala et al., 2019; 2020; Liu et al., 2022; Goyal et al., 2019) , the transfer benefits from learning both abstractions varied drastically, with approaches like Tirumala et al. (2019) unable to yield performance gains. In fact, successful transfer of the aforementioned hierarchical and KL-regularized methods critically depends on the correct choice of information asymmetry (IA). IA more generally refers to an asymmetric masking of information across architectural modules. This masking forces independence to, and ideally generalisation across, the masked dimensions (Galashov et al., 2019) . For example, for self-driving cars, by conditioning the prior only on proprioceptive information it discovers skills independent to, and shared across, global coordinate frames. For manipulation, by not conditioning on certain object information, such as shape or weight, the robot learns generalisable grasping independent to these factors. Therefore, IA biases learnt behaviours and how they transfer across environments. Previous works predefined their IAs, which were primarily chosen on intuition and independent of domain. In addition, previously explored asymmetries were narrow (Table 1 ), which if sub-optimal, limit transfer benefits. We demonstrate that this indeed is the case for many methods on our domains (Galashov et al., 2019; Bagatella et al., 2022; Tirumala et al., 2019; 2020; Pertsch et al., 2021; Wulfmeier et al., 2019) . A more systematic, theoretically and data driven, domain dependent, approach for choosing IA is thus required to maximally benefit from skills for transfer learning. In this paper, we employ hierarchical KL-regularized RL to effectively transfer skills across sequential tasks. We begin by theoretically and empirically showing the crucial expressivity-transferability trade-off, controlled by choice of IA, of skills across sequential tasks for hierarchical KL-regularized RL. Our expressivity-transferability theorems state that conditioning skill modules on too little or too much information, such as the current observation or entire history, can both be detrimental for transfer, due to the discovery of skills that are either too general (e.g. motor primitives) or too specialised (e.g. non-transferable task-level). We demonstrate this by ablating over a wide range of asymmetries between the hierarchical policy and prior. We show the inefficiencies of previous methods that choose highly sub-optimal IAs for our domains, drastically limiting transfer performance. Given this insight, we introduce APES, 'Attentive Priors for Expressive and Transferable Skills' as a method that forgoes user intuition and automates the choice of IA in a data driven, domain dependent, manner. APES builds on our expressivity-transferability theorems to learn the choice of asymmetry between policy and prior. Specifically, APES conditions the prior on the entire history, allowing for expressive skills to be discovered, and learns a low-entropic attention-mask over the input, paying attention only where necessary, to minimise covariate shift and improve transferability across domains. Experiments over domains of varying levels of sparsity and extrapolation, including a complex robot block stacking one, demonstrate APES' consistent superior performance over existing methods, whilst automating IA choice and by-passing arduous IA sweeps. Further ablations show the importance of combining hierarchy and priors for discovering expressive multi-modal behaviours.

2. SKILL TRANSFER IN REINFORCEMENT LEARNING

We consider multi-task reinforcement learning in Partially Observable Markov Decision Processes (POMDPs), defined by M k = (S, X , A, r k , p, p 0 k , γ), with tasks k sampled from p(K). S, A, X denote observation, action, and history spaces. p(x ′ |x, a) : X × X × A → R ≥0 is the dynamics model. We denote the history of observations s ∈ S, actions a ∈ A up to timestep t as x t = (s 0 , a 0 , s 1 , a 1 , . . . , s t ). Reward function r k : X × A × K → R is history-, action-and task-dependent.

2.1. HIERARCHICAL KL-REGULARIZED REINFORCEMENT LEARNING

The typical multi-task KL-regularized RL objective (Todorov, 2007; Kappen et al., 2012; Rawlik et al., 2012; Schulman et al., 2017) takes the form: L(π, π 0 ) = E τ ∼pπ(τ ), k∼p(K) ∞ t=0 γ t r k (x t , a t ) -α 0 D KL (π(a|x t , k) ∥ π 0 (a|x t )) ( ) where γ is the discount factor and α 0 weighs the individual objective terms. π and π 0 denote the task-conditioned policy and task-agnostic prior respectively. The expectation is taken over tasks and trajectories τ from policy π and initial observation distribution p 0 k (s 0 ), (i.e. p π (τ )). Summation over t occurs across all episodic timesteps. When optimised with respect to π, this objective can be viewed as a trade-off between maximising rewards whilst remaining close to trajectories produced by π 0 . When π 0 is learnt, it can learn shared behaviours across tasks and bias multi-task exploration (Teh et al., 2017) . We consider the sequential learning paradigm, where skills are learnt from past tasks, p source (K), and leveraged while attempting the transfer set of tasks, p trans (K). While KL-regularized RL has achieved success across various settings (Abdolmaleki et al., 2018; Teh et al., 2017; Pertsch et al., 2020; Haarnoja et al., 2018a) , recently Tirumala et al. (2019) proposed a hierarchical extension where policy π and prior π 0 are augmented with latent variables, π(a, z|x, k) = π H (z|x, k)π L (a|z, x) and π 0 (a, z|x) = π H 0 (z|x)π L 0 (a|z, x), where subscripts H and L denote the higher and lower hierarchical levels. This structure encourages the shared low-level policy (π L = π L 0 ) to discover task-agnostic behavioural primitives, whilst the high-level discovers higher-level task relevant skills. By not conditioning the high-level prior on task-id, Tirumala et al. (2019) encourage the reuse of common high-level abstractions across tasks. They also propose the following upper bound for approximating the KL-divergence between hierarchical policy and prior: D KL (π(a|x) ∥ π 0 (a|x)) ≤ D KL π H (z|x) π H 0 (z|x) + E π H D KL π L (a|x, z) π L 0 (a|x, z) (2) We omit task conditioning and declaring shared modules to show this bound is agnostic to this. input (Tirumala et al., 2020; Liu et al., 2022) xt (Bagatella et al., 2022) at-n:t (Bagatella et al., 2022) xt-1:t (Pertsch et al., 2020; 2021; Ajay et al., 2020) st (Rao et al., 2021; Tirumala et al., 2019; 2020) zt-1 (Tirumala et al., 2019; Goyal et al., 2019) -Information Asymmetry (IA) is a key component in both of the aforementioned approaches, promoting the discovery of behaviours that generalise. IA can be understood as the masking of information accessible by certain modules. Not conditioning on specific environment aspects forces independence and generalisation across them (Galashov et al., 2019) . In the context of (hierarchical) KL-regularized RL, the explored asymmetries between the (high-level) policy, π (H) , and prior, π (H) 0 , have been narrow (Tirumala et al., 2019; 2020; Liu et al., 2022; Pertsch et al., 2020; 2021; Rao et al., 2021; Ajay et al., 2020; Goyal et al., 2019) . Concurrent with our research, Bagatella et al. (2022) published work exploring a wider range of asymmetries, closer to those we explore. We summarise explored asymmetries in Table 1 (with a t-n:t representing action history up to n steps in the past).

2.2. INFORMATION ASYMMETRY

Choice of information conditioning heavily influences which skills can be uncovered and how well they transfer. For example, Pertsch et al. (2020) discover observation-dependent behaviours, such as navigating corridors in maze environments, yet are unable to learn history-dependent skills, such as never traversing the same corridor twice. In contrast, Liu et al. (2022) , by conditioning on history, are able to learn these behaviours. However, as we will show, in many scenarios, naïvely conditioning on entire history can be detrimental for transfer, by discovering behaviours that do not generalise favourably across history instances, between tasks. We refer to this dilemma as the expressivitytransferability trade-off. Crucially, all previous works predefine the choice of asymmetry, based on the practitioner's intuition, that may be sub-optimal for skill transfer. By introducing theory behind the expressivity-transferability of skills, we present a simple data-driven method for automating the choice of IA, by learning it, yielding transfer benefits. To rigorously investigate the contribution of priors, hierarchy, and information asymmetry for skill transfer, it is important to isolate each individual mechanism while enabling the recovery of previous models of interest. To this end, we present the unified architecture in Fig. 1 , which introduces information gating functions (IGFs) as a means of decoupling IA from architecture. Each component has its own IGF, depicted with a colored rectangle. Every module is fed all environment information x k = (x, k) and distinctly chosen IGFs mask which part of the input each network has access to, thereby influencing which skills they learn. By presenting multiple priors, we enable a comparison with existing literature. With the right masking, one can recover previously investigated asymmetries (Tirumala et al., 2019; 2020; Pertsch et al., 2020; Bagatella et al., 2022; Goyal et al., 2019) , explore additional ones, and also express purely hierarchical (Wulfmeier et al., 2019) and KL-regularized equivalents (Galashov et al., 2019; Haarnoja et al., 2018c) .

3. MODEL ARCHITECTURE AND THE EXPRESSIVITY-TRANSFERABILITY TRADE-OFF

π 0,h π 0,h KL KL π 0,h π 0,h

3.1. THE INFORMATION ASYMMETRY EXPRESSIVITY-TRANSFERABILITY TRADE-OFF

While existing works investigating the role of IAs for skill transfer in hierarchical KL-regularized RL have focused on multi-task learning (Galashov et al., 2019) foot_0 , we focus on the sequential task setting, in particular the prior's π 0 ability to handle covariate shift. In contrast to multi-task learning, in the sequential setting, there exists abrupt distributional shifts, during training, for task p(K) and trajectory p π (τ ) distributions. As such, it is important that the prior handles such distributional and covariate  D KL (p(b) ∥ q(b)) ≥ D KL (p(c) ∥ q(c)) . Proof. See Appendix B.1. In our case, p and q can be interpreted as training p source (•) and transfer p trans (•) distributions over network inputs, such as history x t for high-level prior π H 0 . Intuitively, Theorem 3.1 states that the more variables you condition your network on, the less likely it will transfer due to increased covariate shift encountered between source and transfer domains, thus promoting minimal information conditioning. For example, imagine conditioning the high-level prior on either the entire history x 0:t or a subset of it x t-n:t , n ∈ [0, t -1] (the subscript referring to the range of history values). According to Theorem 3.1, the covariate shift across sequential tasks will be smaller if we condition on a subset of the history, D KL (p source (x 0:t ) ∥ p trans (x 0:t )) ≥ D KL (p source (x t-n:t ) ∥ p trans (x t-n:t )). Interestingly, covariate shift is upper-bounded by trajectory shift: D KL (p πsource (τ ) ∥ p πtrans (τ )) ≥ D KL (p πsource (τ f ) ∥ p πtrans (τ f )) (using Theorem 3.1), with the right hand side representing covariate shift over network inputs τ f = IGF (τ ), filtered trajectories (e.g. τ f = x t-n:t , τ f ⊂ τ ), and π source , π trans , source and transfer domain policies. It is therefore crucial, if possible, to minimise both trajectory and covariate shifts across domains, to benefit from previous skills. Nevertheless, the less information a prior is conditioned on, the less knowledge that can be distilled and transferred: Theorem 3.2. The more random variables a network depends on, the greater its ability to distil knowledge in the expectation (output distributional shift between network and target distribution, here represented by the expected KL-divergence). That is, for target distribution p and network q with outputs a and possible inputs b, c, d, such that b = (b 0 , b 1 , ..., b n ) , d ⊂ c ⊂ b , e ∈ d ⊕ c: E q(e|d) [D KL (p(a|b) ∥ q(a|c))] ≤ D KL (p(a|b) ∥ q(a|d)) . Proof. See Appendix B.2. In this particular instance, p and q could be interpreted as policy π and prior π 0 distributions, a as action a t , b as history x 0:t , and c, d, e as subsets of the history (e.g. x t-n:t , x t-m:t , x t-n:t-m respectively, with n > m and m & n ∈ [0, t]), with e denoting the set of variables in c but not d . Intuitively, Theorem 3.2 states in the expectation, conditioning on more information improves knowledge distillation between policy and prior (e.g. E π0(xt-n:t-m|xt-m:t) [D KL (π(a t |x 0:t ) ∥ π 0 (a t |x t-n:t ))] ≤ D KL (π(a t |x 0:t ) ∥ π 0 (a t |x t-m:t )) , with π 0 (x t-n:t-m |x t-m:t ) the conditional distribution, induced by π 0 , of history subset x t-n:t-m given x t-m:t ). Therefore, IA leads to an expressivity-transferability trade-off of skills (Theorems 3.1 and 3.2). Interestingly, hierarchy does not influence covariate shift and hence does not hurt transferability, but it does increase network expressivity (e.g. of the prior), enabling the distillation and transfer of rich multi-modal behaviours present in the real-world.

4. APES: ATTENTIVE PRIORS FOR EXPRESSIVE AND TRANSFERABLE SKILLS

While previous works chose IA on intuition (Tirumala et al., 2019; 2020; Galashov et al., 2019; Pertsch et al., 2020; Wulfmeier et al., 2019; Bagatella et al., 2022; Singh et al., 2020; Ajay et al., 2020; Liu et al., 2022) we propose learning it. Consider the information gating functions (IGFs) introduced in Section 3 and depicted in Figure 1 . Existing methods can be recovered by having the IGFs perform hard attention: IGF (x k ) = m ⊙ x k , with m ∈ {0, 1} dim(x k ) , predefined and static, and ⊙ representing element-wize multiplication. In contrast, we propose performing soft attention with m ∈ [0, 1] dim(x k ) and learn m based on: 1) the hierarchical KL-regularized RL objective (Equations ( 1) and ( 2)); 2) L IGF (m) = -H(m), H denoting entropy (calculated by turning m into a probability distribution by performing Softmax over it), thereby encouraging low entropic, sparse IGFs (similar to Salter et al. (2019) applying a related technique for sim2real transfer): L APES (π, π 0 , {m i } i∈I ) = E τ ∼pπ(τ ), k∼p(K) ∞ t=0 γ t r k (x t , a t ) -α 0 D KL (π(a|x k )||π 0 (a|x k )) - i∈I α mi H(m i ) (3) With α mi weighing the relative importance of the entropy and KL-regularized RL objectives for each attention mask {m i } i∈I for each module using self-attention (e.g. π H 0 ), I denoting this set. Whilst soft attention does not eliminate dimensions in the same way that hard attention does, thus losing the strict connection with Theorems 3.1 and 3.2, in practice it often leads to many 0-attention elements (Salter et al., 2020; Mott et al., 2019) . m i spans all history dimensions and supports intra-observation and intra-action attention. We train off-policy akin to SAC (Haarnoja et al., 2018b) , sampling experiences from the replay buffer, approximating the return of the agent using Retrace (Munos et al., 2016) and double Q-learning (Hasselt, 2010) to train our critic. Refer to Appendices A and D for full training details. Exposing IGFs to all available information x k , we enable expressive skills that maximize the KL-regularized RL objective, with complex, potentially long-range, temporal dependencies (Theorem 3.2). Encouraging low-entropic masks m i promotes minimal information conditioning (by limiting the IGF's channel capacity) whilst still capturing expressive behaviours. This is achieved by paying attention only where necessary to key environment aspects (Salter et al., 2022) that are crucial for decision making and hence heavily influence behaviour expressivity. Minimising the dependence on redundant information (aspects of the observation s, action a, or history x spaces that behaviours are independent to), we minimise covariate shift and improve the transferability of skills to downstream domains (Theorem 3.1). Consider learning the IGF of high-level prior π H 0 for a humanoid navigation task. Low-level skills π L could correspond to motor-primitives, whilst the high-level prior could represent navigation skills. For navigation, joint quaternions are not relevant, but the Cartesian position is. By learning to mask parts of the observations corresponding to joints, the agent becomes invariant and robust to covariate shifts across these dimensions (unseen joint configurations). We call our method APES, 'Attentive Priors for Expressive and Transferable Skills'.

4.1. TRAINING REGIME AND THE INFORMATION ASYMMETRY SETUP

We are concerned with investigating the roles of priors, hierarchy and IA for transfer in sequential task learning, where skills learnt over past tasks p source (K) are leveraged for transfer tasks p trans (K). While one could investigate IA between hierarchical levels (π H , π L ) as well as between policy and prior (π, π 0 ), we concern ourselves solely with the latter. Specifically, to keep our comparisons with existing literature fair, we condition π L on s t and z t , and share it with the prior, π L = π L 0 , thus enabling expressive multi-modal behaviours to be discovered with respect to s t (Tirumala et al., 2019; 2020; Wulfmeier et al., 2019) . In this paper, we focus on the role of IA between high-level policy π H and prior π H 0 for supporting expressive and transferable high-level skills between tasks. As is common, we assume the source tasks are solved before tackling the transfer tasks. Therefore, for analysis purposes it does not matter whether we learn skills from source domain demonstrations provided by a hardcoded expert or an optimal RL agent. For simplicity, we discover skills and skill priors using variational behavioural cloning from expert policy π e samples: L bc (π, π 0 , {m i } i∈I ) = j∈{0,e} L APES (π, π j , {m i } i∈I ) with r k = 0, γ = 1, τ ∼ p πe (τ ) Equation ( 4) can be viewed as hierarchical KL-regularized RL in the absence of rewards and with two priors: the one we learn π 0 ; the other the expert π e . See Appendix A.2 for a deeper discussion on the similarities with KL-regularized RL. We then transfer the skills and solve the transfer domains using hierarchical KL-regularized RL (as per Equation ( 3)). To compare the influence of distinct IAs for transfer in a controlled way, we propose the following regime: Stage 1) train a single hierarchical policy π in the multi-task setup using Equation ( 4), but prevent gradient flow from prior π 0 to policy. Simultaneously, based on the ablation, train distinct priors (with differing IAs) on Equation (4) to imitate the policy. As such, we compare various IAs influence on skill distillation and transfer in a controlled manner, as they all distil behaviours from the same policy; Stage 2) freeze the shared modules (π L , π H 0 ) and train a newly instantiated π H on the transfer task. By freezing π L , we assume the low-level skills from the source domains suffice for the transfer domains, often called the modularity assumption (Khetarpal et al., 2020a; Salter et al., 2022) . While appearing restrictive, the increasingly diverse the source domains are (commonly desired in settings like lifelong learning (Khetarpal et al., 2020a) and offline RL), the increasingly probable the optimal transfer policy can be obtained by recomposing the learnt skills. If this assumption does not hold, one could either fine-tune π L during transfer, which would require tackling the catastrophic forgetting of skills (Kirkpatrick et al., 2017) , or train additional skills (by expanding z dimensionality for discrete z spaces). We leave this as future work. We also leave extending APES to sub-optimal demonstration learning as future work, potentially using advantage-weighted regression (Peng et al., 2019) , rather than behavioral cloning, to learn skills. For further details refer to Appendices A and D.

5. EXPERIMENTS AND RESULTS

Our experiments are designed to answer the following sequential task questions: (1) Can we benefit from both hierarchy and priors for effective transfer? 2019), on similar navigation and manipulation tasks for which they were originally designed and tested against.

5.1. ENVIRONMENTS

We evaluate on two domains: one designed for controlled investigation of core agent capabilities and the another, more practical, robotics domain (see Figure 2 ). Both exhibit modular behaviours whose discovery could yield transfer benefits. See Appendix C for full environmental setup details. • CorridorMaze. The agent must traverse corridors in a given ordering. We collect 4 * 10 3 trajectories from a scripted policy traversing any random ordering of two corridors (p source (K)). For transfer (p trans (K)), an inter-or extrapolated ordering must be traversed (number of sequential corridors = {2, 4}) allowing us to inspect the generalization ability of distinct priors to increasing levels of covariate shift. We also investigate the influence of covariate shift on effective transfer across reward sparsity levels: s-sparse (short for semi-sparse), rewarding per half-corridor completion; sparse, rewarding on task completion. Our transfer tasks are sparse 2 corr and s-sparse 4 corr. • Stack. The agent must stack a subset of four blocks over a target pad in a given ordering. The blocks have distinct masses and only lighter blocks should be placed on heavier ones. Therefore, discovering temporal behaviours corresponding to sequential block stacking according to mass, is beneficial. We collect 17.5 * 10 3 trajectories from a scripted policy, stacking any two blocks given this requirement (p source (K)). The extrapolated transfer task (p trans (K)), called 4 blocks, requires all blocks be stacked according to mass. Rewards are given per individual block stacked. Our full setup, APES, leverages hierarchy and priors for skill transfer. The high-level prior is given access to the history (as is common for POMDPs) and learns sparse self-attention m. To investigate the importance of priors, we compare against APES-no prior, a baseline from Tirumala et al. (2019) , with the full APES setup except without a learnt prior. Comparing transfer results in Table 2 , we see APES' drastic gains highlighting the importance of temporal high-level behavioural priors. To inspect the transfer importance of the traditional hierarchical setup (with π L (s t , z t )), we compare APES-no prior against two baselines trained solely on the transfer task. RecSAC represents a historydependent SAC (Haarnoja et al., 2018d) and Hier-RecSAC a hierarchical equivalent from Wulfmeier et al. (2019) . APES-no prior has marginal benefits showing the importance of both hierarchy and priors for transfer. See Table 6 for a detailed explanation of all baseline and ablation setups. To investigate the importance of IA for transfer, we ablate over high-level priors with increasing levels of asymmetry (each input a subset of the previous): APES-{H20, H10, H1, S}, S denoting an observation-dependent high-level prior, Hi a historydependent one, x t-i:t . Crucially, these ablations do not learn m, unlike APES, our full method. Ablating history lengths is a natural dimension for POMDPs where discovering belief states by history conditioning is crucial (Thrun, 1999) . 2020) with π L (s t , z t ) rather than π L (z t ). Table 2 shows the heavy IA influence, with the trend that conditioning on too little or much information limits performance. The level of influence depends on reward sparsity level: the sparser, the heavier influence, due to rewards guiding exploration less. Regardless of the transfer domain being interpolated or extrapolated, IA is influential, suggesting that IA is important over varying levels of sparsity and extrapolation. 8 for detailed tabular results.

5.3. INFORMATION ASYMMETRY FOR KNOWLEDGE TRANSFER

To investigate whether Theorems 3.1 and 3.2 are the reason for the apparent expressivitytransferability trade-off seen in Table 2 , we plot Figure 4 showing, on the vertical axis, the distillation loss D KL π H π H 0 at the end of training over p source (K), verses, on the horizontal axis, the increase in transfer performance (on p trans (K)) when initialising π H as a task agnostic high-level policy pretrained over p source (K) (instead of randomly, as is default). We ran these additional pre-trained π H experiments to investigate whether covariate shift is the culprit for the degradation in transfer performance when conditioning the highlevel prior π H 0 on additional information. By pre-training and transferring π H , we reduce initial trajectory shift and thus initial covariate shift between source and transfer domains (see Section 2.2). This is as no networks are reinitialized during transfer, which would usually lead to an initial shift in policy behaviour across domains. As per Theorem 3.1, we would expect a larger reduction in covariate shift for the priors that condition on more information. If covariate shift were the culprit for reduced transfer performance, we would expect a larger performance gain for those priors conditioned on more more information. Figure 4 demonstrates that is the case in general, regardless of whether the transfer domain is inter-or extra-polated. The trend is significantly less apparent for the semi-sparse domain, as here denser rewards guide learning significantly, alleviating covariate shift issues. We show results for APES-{H20, H10, H1, S} as each input is a subset of the previous. These relations govern Theorems 3.1 and 3.2. Figure 4 and Table 3 show that conditioning on more information improves prior expressivity, reducing distillation losses, as per Theorem 3.2. These results together with Table 2 , show the impactful expressivity-transferability trade-off of skills, controlled by IA (as per Theorems 3.1 and 3.2), where conditioning on too little or much information limits performance. As seen in Table 2 , APES, our full method, strongly outperforms (almost) every baseline and ablation on each transfer domain. Comparing APES with APES-H20, the most comparable approach with the prior fed the same input (x t-20:t ), we observe drastic performance gains. These results demonstrate the importance of reducing covariate shift (by minimising information conditioning), whilst still supporting expressive behaviours (by exposing the prior to maximal information), only achieved by APES. Table 3 shows H(m), each π H 0 mask's entropy (a proxy for the amount of information conditioning), vs D KL (π H ||π H 0 ) (distillation loss), reporting max/min scores across all experiments cycles. We ran 4 random seeds but omit standard deviations as they were negligible. APES not only attends to minimal information (H(m)), but for that given level achieves a far lower distillation loss than comparable methods. This demonstrates APES pays attention only where necessary. We inspect APES' attention masks m in Figure 5 (aggregated at the observation and action levels). Firstly, many attention values tend to 0 (seen clearly in Figure 7 ) aligning APES closely to Theorems 3.1 and 3.2. Secondly, APES primarily pays attention to the recent history of actions. This is interesting, as is inline with recent concurrent work (Bagatella et al., 2022) demonstrating the effectiveness of state-free priors, conditioned on a history of actions, for effective generalization. Unlike Bagatella et al. (2022) that need exhaustive history lengths sweeps for effective transfer, our approach learns the length in an automated domain dependent manner. As such, our learnt history lengths are distinct for CorridorMaze and Stack. For CorridorMaze, some attention is paid to the most recent observation s t , which is unsurprising as this information is required to infer how to optimally act whilst at the end of a corridor. In Figure 7 , we plot intra-observation and action attention, and note that for Stack, APES ignores various dimensions of the observation-space, further reducing covariate shift. Refer to Appendix E.1 for an in-depth analysis of APES' full attention maps. To further investigate whether hierarchy is necessary for effective transfer, we compare APES-H1 with APES-H1-flat, which has the same setup except with a flat, non-hierarchical prior (conditioned on x t-1:t ). With a flat prior, KL-regularization must occur over the raw, rather than latent, action space. Therefore, to adequately compare, we additionally investigate a hierarchical setup where regularization occurs only over the action-space, APES-H1-KL-a. Transfer results for CorridorMaze are shown in Table 4 (7 seeds). Comparing APES-H1-KL-a and APES-H1-KL-flat, we see the benefits of a hierarchical prior, more significant for sparser domains. Upon inspection (see Section 5.7), APES-H1-KL-flat is unable to solve the task as it cannot capture multi-modal behaviours (at corridor intersections). Contrasting APES-H1 with APES-H1-KL-a, we see minimal benefits for latent regularization, suggesting with alternate methods for multi-modality, hierarchy may not be necessary.

5.7. SKILL-LEVEL EXPLORATION ANALYSIS

To gain a further understanding on the effects of hierarchy and priors, we visualise policy rollouts early on during transfer (5 * 10 3 steps). For CorridorMaze (Figure 3 ), with hierarchy and priors APES explores randomly at the corridor level. Hierarchy alone, unable to express preference over high-level skills, leads to temporally uncorrelated behaviours, unable to explore at the corridor level. The flat prior, unable to represent multi-modal behaviours, leads to suboptimal exploration at the intersection of corridors, with the agent often remaining static. Without priors nor hierarchy, exploration is further hindered, rarely traversing corridor depths. For Stack (Figure 6 ), APES explores at the block stacking level, alternating block orderings but primarily stacking lighter upon heavier. Hierarchy alone is unable to stack blocks with temporally uncorrelated skills, exploring at the intra-block stacking level, switching between blocks before successfully stacking, or often interacting with, any individual one. (2019) primarily use priors to tackle value overestimation (Levine et al., 2020) . In the variational literature, priors have been used to guide latent-space learning (Hausman et al., 2018; Igl et al., 2019; Pertsch et al., 2020; Merel et al., 2018) 2022) use priors to reduce covariate shifts during planning. In contrast, APES ensures the priors themselves experience reduced shifts. In the multi-task literature, priors have been used to guide exploration (Pertsch et al., 2020; Galashov et al., 2019; Siegel et al., 2020; Pertsch et al., 2021; Teh et al., 2017 ), yet without hierarchy expressivity in learnt behaviours is limited. In the sequential transfer literature, priors have also been used to bias exploration (Pertsch et al., 2020; Ajay et al., 2020; Singh et al., 2020; Bagatella et al., 2022; Goyal et al., 2019; Rao et al., 2021; Liu et al., 2022 ), yet either do not leverage hierarchy (Pertsch et al., 2020) or condition on minimal information (Ajay et al., 2020; Singh et al., 2020; Rao et al., 2021) (2020) investigate a way of learning asymmetry for sim2real domain adaptation, but condition m on observation and state. We consider exploring this direction as future work. We provide a principled investigation on the role of IA for transfer, proposing a method for automating the choice.

7. CONCLUSION

We employ hierarchical KL-regularized RL to efficiently transfer skills across sequential tasks, showing the effectiveness of combining hierarchy and priors. We theoretically and empirically show the crucial expressivity-transferability trade-off, controlled by IA choice, of skills for hierarchical KLregularized RL. Our experiments validate the importance of this trade-off for both interpolated and extrapolated domains. Given this insight, we introduce APES, 'Attentive Priors for Expressive and Transferable Skills' automating the IA choice for the high-level prior, by learning it in a data driven, domain dependent, manner. This is achieved by feeding the entire history to the prior, capturing expressive behaviours, whilst encouraging its attention mask to be low entropic, minimising covariate shift and improving transferability. Experiments over domains of varying sparsity levels demonstrate APES' consistent superior performance over existing methods, whilst by-passing arduous IA sweeps. Ablations demonstrate the importance of hierarchy for prior expressivity, by supporting multi-modal behaviours. Future work will focus on additionally learning the IGFs between hierarchical levels.

A METHOD A.1 TRAINING REGIME

In this section, we algorithmically describe our training setup. We relate each training phase to the principle equations in the main paper, but note that Appendices A.2 and A.3 outline a more detailed version of these equations that were actually used. We note that during BC, we apply DAGGER (Ross et al., 2011) , as per Algorithm 2, improving learning rates. For further details refer to Appendix D. Algorithm π H ← RL policy update(π, π 0 , R rl ) # Eq. 3 14: x ← env.observation() Q k ← RL critic update(Q k , π, R rl ) #

4:

π e ← env.expert() 5: a i ← π i (x) 6: a e ← π e (x) 7: a ← Bernoulli([a i , a e ], [r, 1 -r]) 8: x ′ , r k , env ← env.step(a) 9: if dag then 10: a f ← a e 11: else 12: a f ← a i 13: end if 14: R j ← R j .update(x, a f ,r k , x ′ ) 15: return R j , env 16: end function

A.2 VARIATIONAL BEHAVIORAL CLONING AND REINFORCEMENT LEARNING

In the following section, we omit APES' specific information gating function objective IGF (x k ) for simplicity and generality. Nevertheless, it is trivial to extend the following derivations to APES. Behavioral Cloning (BC) and KL-Regularized RL, when considered from the variational-inference perspective, share many similarities. These similarities become even more apparent when dealing with hierarchical models. A particularly unifying choice of objective functions for BC and RL that fit with off-policy, generative, hierarchical RL: desirable for sample efficiency, are: L bc (π, {π i } i∈I ) = - i∈I D KL (π(τ ) ∥ π i (τ )), L rl (π, {π i } i⊂I ) = E π (τ )[R(τ )]+L bc (π, {π i } i⊂I ) (5) L bc , corresponds to the KL-divergence between trajectories from the policy, π, and various priors, π i . For BC, i ∈ {0, u, e}, denote the learnt, uniform, and expert priors. For BC, in practice, we train multiple priors in parallel: π 0 = {π i } i∈{0,...,N } . We leave this notation out for the remainder of this section for simplicity. When considering only the expert prior, this is the reverse KL-divergence, opposite to what is usually evaluated in the literature, (Pertsch et al., 2020) . L rl , refers to a lower bound on the expected optimality of each prior log p πi (O = 1); O denoting the event of achieving maximum return (return referred to as R(.)); refer to (Abdolmaleki et al., 2018) , appendix B.4.3 for proof, further explanation, and necessary conditions. During transfer using RL, we do not have access to the expert or its demonstrations (i ⊂ I := i ∈ {0, u}). For hierarchical policies, the KL terms are not easily evaluable. Tirumala et al., 2019) , is a commonly chosen upper bound. If sharing modules, e.g. π L i = π L , or using non-hierarchical networks, this bound can be simplified (removing the second or first terms respectively). To make both Eq. ( 5) amendable to off-policy training (experience from {π e , π b }, for BC/RL respectively; π b representing behavioral policy), we introduce importance weighting (IW), removing off-policy bias at the expense of higher variance. Combining all the above with additional individual term weighting hyperparameters, {β z i , β a i }, we attain: D KL (π(τ ) ∥ π i (τ )) ≤ t E π(τ ) D KL π H (z t |x k ) π H i (z t |x k ) + E π H (zt|x k ) D KL π L (a t |x k , z t ) π L i (a t |x k , z t ) 2 ( Dq(τ) KL (π(τ )||π i (τ )) := E q(τ ) t ν q [t] • β z i • C i,h (z t |x k ) + β a i • E π H (zt|x k ) [C i,l (a t |x k , z t )] ζ n i = E π H (zi|xi,k) π L (a i |x i , z i , k) n(a i |x i , k) , ν n = ζ n 1 , ζ n 1 ζ n 2 , . . . , τt i=1 ζ n i , C µ,ϵ (y) = log π r ϵ (y) π r µ,ϵ (y) -D KL (π(τ )||π e (τ )) ≥ - i∈{0,u,e} Dπe(τ) KL (π(τ )||π i (τ )) (6) E p(K), π0(τ ) [log(O = 1|τ, k)] ≥ E π b (τ ) t ν π b [t] • r k (x t , a t ) - i∈{0,u} Dπ b (τ ) KL (π(τ )||π i (τ )) (7) Where Dq(τ) KL (π(τ )||π i (τ )) (for {β z i , β a i } = 1 ) is an unbiased estimate for the aforementioned upper bound, using q's experience. ζ n i is the IW for timestep i, between π and arbitrary policy n. ν n [t] is the t th element of ν n ; the cumulative IW product at timestep t. Equations 6, 7 are the BC/RL lower bounds used for policy gradients. See Appendix B.4 for a derivation, and necessary conditions, of these bounds. For BC, this bounds the KL-divergence between hierarchical and expert policies, π, π e . For RL, this bounds the expected optimality, for the learnt prior policy, π 0 . Intuitively, maximising this particular bound, maximizes return for both policy and prior, whilst minimizing the disparity between them. Regularising against an uninformative prior, π u , encourages highly-entropic policies, further aiding at exploration and stabilising learning (Igl et al., 2019) . In RL, IWs are commonly ignored (Lillicrap et al., 2015; Abdolmaleki et al., 2018; Haarnoja et al., 2018b) , thereby considering each sample equally important. This is also convenient for BC, as IWs require the expert probability distribution: not usually provided. We did not observe benefits of using them and therefore ignore them too. We employ module sharing (π L i = π L ; unless stated otherwize), and freeze certain modules during distinct phases, and thus never employ more than 2 hyperparameters, β, at any given time, simplifying the hyperparameter optimisation. These weights balance an exploration/exploitation trade-off. We use a categorical latent space, explicitly marginalising over, rather than using sampling approximations (Jang et al., 2016) . For BC, we train for 1 epoch (referring to training in the expectation once over each sample in the replay buffer).

A.3 CRITIC LEARNING

The lower bound presented in Eq. ( 7) is non-differentiable due to rewards being sampled from the environment. Therefore, as is common in the RL literature (Mnih et al., 2015; Lillicrap et al., 2015) , we approximate the return of policy π with a critic, Q. To be sample efficient, we train in an off-policy manner with TD-learning (Sutton, 1988) using the Retrace algorithm (Munos et al., 2016) to provide a low-variance, low-bias, policy evaluation operator: Q ret t := Q ′ (x t , a t , k) + ∞ j=t ϵ t j r k (x j , a j ) + E π H (z|xj+1,k), π L (a ′ |xj+1,z,k) Q ′ (x j+1 , a ′ , k) -Q ′ (x j , a j , k) (8) L(Q) = E p(K), π b (τ ) (Q(x t , a t , k) -arg min Q ret t (Q ret t )) 2 ϵ t j = γ j-t j i=t+1 ζ b i (9) Where Q ret t represents the policy return evaluated via Retrace. Q ′ is the target Q-network, commonly used to stabilize critic learning (Mnih et al., 2015) , and is updated periodically with the current Q values. IWs are not ignored here, and are clipped between [0, 1] to prevent exploding gradients, (Munos et al., 2016) . To further reduce bias and overestimates of our target, Q ret t , we apply the double Q-learning trick, (Hasselt, 2010) , and concurrently learn two target Q-networks, Q ′ . Our critic is trained to minimize the loss in Eq. ( 9), which regularizes the critic against the minimum of the two targets produced by both target networks.

B THEORY AND DERIVATIONS

In this section we provide proofs for the theory introduced in the main paper and in Appendix A. B.1 THEOREM 1. Theorem 1. The more random variables a network depends on, the larger the covariate shift (input distributional shift, here represented by KL-divergence) encountered across sequential tasks. That is, for distributions p, q D KL (p(b) ∥ q(b)) ≥ D KL (p(c) ∥ q(c)) with b = (b 0 , b 1 , ..., b n ) and c ⊂ b. ( ) Proof D KL (p(b) ∥ q(b)) = E p(b) log p(b) q(b) = E p(d|c)•p(c) log p(d|c) • p(c) q(d|c) • q(c) with d ∈ b ⊕ c = E p(c) E p(d|c) [1] • log p(c) q(c) + E p(c) E p(d|c) log p(d|c) q(d|c) = D KL (p(c) ∥ q(c)) + E p(c) [D KL (p(d|c) ∥ q(d|c))] ≥ D KL (p(c) ∥ q(c)) given E p(c) [D KL (p(d|c) ∥ q(d|c))] ≥ 0 (11) B.2 THEOREM 2. Theorem 2. The more random variables a network depends on, the greater its ability to distil knowledge in the expectation (output distributional shift between network and target distribution, here represented by the expected KL-divergence). That is, for target distribution p and network q with outputs a and possible inputs b, c, d, such that b = (b 0 , b 1 , ..., b n ) and d ⊂ c ⊂ b E q(e|d) [D KL (p(a|b) ∥ q(a|c))] ≤ D KL (p(a|b) ∥ q(a|d)) with e ∈ d ⊕ c (12) Proof D KL (p(a|b) ∥ q(a|d)) = E p(a|b) log p(a|b) q(a|d) = E p(a|b) log p(a|b) -log E q(e|d) [q(a|c)] with e ∈ d ⊕ c ≥ E p(a|b)•q(e|d) log p(a|b) q(a|c) given Jensen's Inequality = E q(e|d) [D KL (p(a|b) ∥ q(a|c))]

B.3 HIERARCHICAL KL-DIVERGENCE UPPER BOUND

All proofs in this section ignore multi-task setup for simplicity. Extending to this scenario is trivial. Upper Bound D KL (π(τ ) ∥ π i (τ )) ≤ t E π(τ ) [D KL π H (z t |x t ) π H i (z t |x t ) + E π H (zt|xt) D KL π L (a t |x t , z t ) π L i (a t |x t , z t ) ] Proof D KL (π(τ ) ∥ π i (τ )) = E π(τ ) log π(τ ) π i (τ )) = E π(τ ) log p(s 0 ) • t p(s t+1 |x t , a t ) • π(a t |x t ) p(s 0 ) • t p(s t+1 |x t , a t ) • π i (a t |x t ) = E π(τ ) log t π(a t |x t ) π i (a t |x t ) = t E π(τ ) [D KL (π(a t |x t ) ∥ π i (a t |x t ))] ≤ t E π(τ ) [D KL (π(a t |x t ) ∥ π i (a t |x t )) + E π(at|xt) [D KL (π(z t |x t , a t ) ∥ π(z t |x t , a t ))]] = t E π(τ ) E π(at,zt|xt) log π(a t |x t ) π i (a t |x t ) + log π(z t |x t , a t ) π i (z t |x t , a t ) = E π(τ ) [D KL (π(a t , z t |x t ) ∥ π i (a t , z t |x t ))] = t E π(τ ) [D KL π H (z t |x t ) π H i (z t |x t ) + E π H (zt|xt) D KL π L (a t |x t , z t ) π L i (a t |x t , z t ) ] B.4 POLICY GRADIENT LOWER BOUNDS B.4.1 IMPORTANCE WEIGHTS DERIVATION Dq(τ) KL (π(τ )||π i (τ )) = ub(D KL (π(τ ) ∥ π i (τ ))) For β z i , β a i = 1, where ub(D KL (π(τ ) ∥ π i (τ ))) corresponds to the hierarchical upper bound intro- duced in Appendix A.2. Proof ub(D KL (π(τ ) ∥ π i (τ ))) = t E π(τ ) [D KL π H (z t |x t ) π H i (z t |x t ) + E π H (zt|xt) D KL π L (a t |x t , z t ) π L i (a t |x t , z t ) ] = t E q(τ )• π(τ ) q(τ ) [D KL π H (z t |x t ) π H i (z t |x t ) + E π H (zt|xt) D KL π L (a t |x t , z t ) π L i (a t |x t , z t ) ] = t E q(τ )• t i=0 π(a i |x i ) q(a i |x i ) [D KL π H (z t |x t ) π H i (z t |x t ) + E π H (zt|xt) D KL π L (a t |x t , z t ) π L i (a t |x t , z t ) ] = Dq(τ) KL (π(τ )||π i (τ )) (17) B.4.2 BEHAVIORAL CLONING UPPER BOUND -D KL (π(τ )||π e (τ )) ≥ - i∈{0,u,e} Dπe(τ) KL (π(τ )||π i (τ )) for β z i , β a i ≥ 1 (18) Proof D KL (π(τ ) ∥ π e (τ )) ≤ i∈{0,u,e} D KL (π(τ ) ∥ π i (τ )) ≤ i∈{0,u,e} ub(D KL (π(τ ) ∥ π i (τ ))) = i∈{0,u,e} Dq(τ) KL (π(τ )||π i (τ )) for β z i , β a i = 1 ≤ i∈{0,u,e} Dq(τ) KL (π(τ )||π i (τ )) for β z i , β a i ≥ 1 The last line holds true as each weighted term in Dq(τ) KL (π(τ )||π i (τ )) corresponds to KL-divergences which are positive. B.4.3 REINFORCEMENT LEARNING UPPER BOUND E p(K), π0(τ ) [log(O = 1|τ, k)] ≥ E π b (τ ) t ν π b [t] • r k (x t , a t ) - i∈{0,u} Dπ b (τ ) KL (π(τ )||π i (τ )) for β z i , β a i ≥ 1 and r k < 0 (20) Proof E p(K), π0(τ ) [log(O = 1|τ, k)] ≥ E π b (τ ) t ν π b [t] • r k (x t , a t ) -D KL (π(τ ) ∥ π i (τ )) ≥ E π b (τ ) t ν π b [t] • r k (x t , a t ) - Dπ b (τ ) KL (π(τ )||π i (τ )), for β z i , β a i = 1 ≥ E π b (τ ) t ν π b [t] • r k (x t , a t ) - Dπ b (τ ) KL (π(τ )||π i (τ )), for β z i , β a i ≥ 1 ≥ E π b (τ ) t ν π b [t] • r k (x t , a t ) - i∈{0,u} Dπ b (τ ) KL (π(τ )||π i (τ )), for β z i , β a i ≥ 1 For line 1 proof see (Abdolmaleki et al., 2018) . The final 2 lines hold due to positive KL-divergences.

C ENVIRONMENTS

Here we cover each environment setup in detail, including the expert setup used for data collection.

C.1 CORRIDORMAZE

Intuitively, the agent starts at the intersection of corridors, at the origin, and must traverse corridors, aligned with each dimension of the observation space, in a given ordering. This requires the agent to reach the end of the corridor (which we call half-corridor cycle), and return back to the origin, before the corridor is considered complete. s ∈ {0, l} c , p(s 0 ) = 0 c , k = one-hot task encoding, a ∈ [0, 1], r semi-sparse k (x t , a t ) = 1 if agent has correctly completed the entire or half-corridor cycle else 0, r sparse k (x t , a t ) = 1 if task complete else 0. Task is considered complete when a desired ordering of corridors have been traversed. c = 5 represents the number of corridors in our experiments. l = 6, the lengths of each corridor. Observations transition according to deterministic transition function s jt t+1 = f (s jt t , a t ). j t corresponds to the index of the current corridor that the agent is in (i.e. j t = 0 if the agent is in corridor 0 at timestep t). s i corresponds to the i th dimension of the observation. Observations transition incrementally or decrementally down a corridor, and given observation dimension s i , if actions fall into corresponding transition action bins ψ inc , ψ dec . We define the transition function as follows: f (s jt t , a t ) =    s jt t + 1, if ψ w inc (a t , j t ). s jt t -1, elif ψ w dec (a t , j t ). 0, otherwise. ψ w inc (a t , j) = bool(a t in [j/c, (j + 0.5 • w)/c]), ψ w dec (a t , j) = bool(a t in [j/c, (2 • j -0.5 • w)/c]). The smaller the w parameter, the narrower the distribution of actions that lead to transitions. As such, w, together with r k controls the exploration difficulty of task k. We set w = 0.9. We constrain the observation transitions to not transition outside of the corridor boundaries. Furthermore, if the agent is at the origin, s = 0 c (at the intersection of corridors), then the transition function is ran for all values of j t , thereby allowing the agent to transition into any corridor.

C.1.1 EXPERT SETUP

The expert samples actions uniformly within the optimal action bin range, from Eq. ( 22), that leads to the optimal state transition, traversing the correct corridor in the correct direction, according to the task, which corridors have been traversed, and which remain.

C.2 STACK

This domain is adapted from the well known gym robotics FetchPickAndPlace-v0 environment (Plappert et al., 2018) . The following modifications were made: 1) 3 additional blocks were introduced, with different colours, and a goal pad, 2) object spawn locations were not randomized and were instantiated equidistantly around the goal pad, see Fig. 2 , 3) the number of substeps was increased from 20 to 60, as this reduced episodic lengths, 4) a transparent hollow rectangular tube was placed around the goal pad, to simplify the stacking task and prevent stacked objects from collapsing due to structural instabilities, 5) the arm was always spawned over the goal pad, see figure Fig. 2, 6 ) the observation space corresponded to gripper position and grasp state, as well as the object positions and relative positions with respect to the arm: velocities were omitted as access to such information may not be realistic for real robotic systems. k = one-hot task encoding, r sparse k (x t , a t ) = 1 if correct object has been placed on stack in correct ordering else 0.

C.2.1 EXPERT SETUP

The expert is set up to stack the blocks in the given ordering. Each individual block stacking cycle consists of six segments: 1) Move the gripper position to a target location 20cm directly above the block, keeping the gripper open; 2) Vertically lower the gripper to 5mm over the block, keeping the gripper open; 3) Close the gripper until the object is grasped; 4) Vertically raise the gripper to 20cm above the initial block position, keeping the gripper closed; 5) Move the gripper to target location 20cm above the target pad, with the gripper closed; 6) Open the gripper until the object is dropped onto the target pad. We use Mujoco's (Todorov et al., 2012) PD controller, given target relative desired gripper position, which coupled with Mujoco's inverse kinematics model, produces desired actions. We apply gains of 21 to these actions. One the target location is reached for a given stage, we proceed to the next. When opening or closing the gripper we apply actions of 0.05, -0.1, for stages 3 and 6. For stage 3, we continue closing the gripper until there are contact forces between the gripper and cube. For stage 6, we continue opening the gripper until the block has dropped such that it is within 14cm of the target pad. To prevent fully deterministic samples, which can be problematic for behavioural cloning, we inject noise into the expert actions. Specifically, we add Gaussian noise with a diagonal covariance  π L 0 π L π H KL level π L = π L 0 reused modules APES ✓ xt-20:t st st x k z ✓ π H 0 , π L 0 , π L APES-H20 ✗ xt-20:t st st x k z ✓ π H 0 , π L 0 , π L APES-H10 ✗ xt-10:t st st x k z ✓ π H 0 , π L 0 , π L APES-H1 ✗ xt-1:t st st x k z ✓ π H 0 , π L 0 , π L APES-S ✗ st st st x k z ✓ π H 0 , π L 0 , π L APES-no prior - st x k π L Hier-RecSAC - st x k RecSAC - x k APES-H1-KL-a ✗ xt-1:t st st x k a ✗ π H 0 , π L 0 APES-H1-flat ✗ xt-1:t st x k a ✗ π L 0 and standard deviation of 0.2 per dimension. We do not apply noise to the gripper closing or opening actions.

D EXPERIMENTAL SETUP

We provide the reader with the experimental setup for all training regimes and environments below. We build off the softlearning code base (Haarnoja et al., 2018b) . Algorithmic details not mentioned in the following sections are omitted as are kept constant with the original code base. For all experiments, we sample batch size number of entire episodes of experience during training.

D.1 MODEL ARCHITECTURES

We continue by outlining the shared model architectures across domains and experiments. Each policy network (e.g. π H , π L , π H 0 , π L 0 ) is comprized of a feedforward module outlined in Table 5 . The softlearning repository that we build off (Haarnoja et al., 2018b) , applies tanh activation over network outputs, where appropriate, to match the predefined outpute ranges of any given module. The critic is also comprized of the same feedforward module, but is not hierarchical. To handle historical inputs, we tile the inputs and flatten, to become one large input 1-dimensional array. We ensure the input always remains of fixed size by appropriately left padding zeros. For π H , π H 0 we use a categorical latent space of size 10. We found this dimensionality sufficed for expressing the diverse behaviours exhibited in our domains. Table 6 describes the setup for all the experiments in the main paper, including inputs to each module, level over which KL-regularization occurs (z or a), which modules are shared (e.g. π L and π L 0 ), and which modules are reused across sequential tasks. For the covariate-shift designed experiments in Table 8 , we additionally reuse π H (or π L for RecSAC) across domains, and whose input is x t . For all the above experiments, any reused modules are not given access to task-dependent information, namely task-id (k) and exteroceptive information (cube locations for Stack domain). This choice ensures reused modules generalize across task instances.

D.2 BEHAVIOURAL CLONING

For the BC setup, we use a deterministic, noisy, expert controller to create experience to learn off. We apply DAGGER (Ross et al., 2011) during data collection and training of policy π as we found this aided at achieving a high success rate at the BC tasks. Our DAGGER setup intermittently during data collection, with a predefined rate, samples an action from π instead of π e , but still saves BC target action a t as the one that would have been taken by the expert for x k . This setup helps mitigate covariate shift during training, between policy and expert. Noise levels were chosen to be small enough so that the expert still succeeded at the task. We trained our policies for one epoch (once over each collected data sample in the expectation). It may be possible to be more sample efficient, by increasing the ratio of gradient steps to data collection, but we did not explore this direction. The interplay we use between data collection and training over the collected experience, is akin to the RL paradigm. We build off the softlearning code base (Haarnoja et al., 2018b) , so please refer to it for details regarding this interplay. Refer to Table 7a for BC algorithmic details. It is important to note here that, although we report five β hyper-parameter values, there are only two degrees of freedom. As we stop gradients flowing from π 0 to π, choice of β 0 is unimportant (as long as it is not 0) as it does not influence the interplay between gradients from individual loss terms. We set these values to 1. β a e 's absolute value is also unimportant, and only its relative value compared to β z u and β a u matters. We also set β a e to 1. For the remaining two hyper-parameters, β z u , β a u , we performed a hyper-parameter sweep over three orders of magnitude, three values across each dimension, to obtain the reported optimal values. In practice π 0 = {π i } i∈{0,...,N } , multiple trained priors each sharing the same β 0 hyper-parameters. For α m , denoting the hyper-paramter in Equation (3) weighing the relative contribution of the π H 0 's self-attention entropy objective IGF (x k ) with relation to the remainder of the RL/BC objectives, we performed a sweep over three ordered of magnitude (1e 0 , 1e -1 , 1e -2 ). This sweep was ran independent of all the other sweeps, using the optimal setup for all other hyper-parameters. We chose the hyper-parameter with lowest D KL (π H ||π H 0 ), H(m) combination. Four seeds were ran, as for all experiments. We observed very small variation in learning across the seeds, and used the best performing seed to bootstrap off for transfer. We separately also performed a hyper-parameter sweep over π learning rate, in the same way as before. We did not perform a sweep for batch size. We found for both BC and RL setups, that conditioning on entire history for π H was not always necessary, and sometimes hurt performance. We state the history lengths used for π H for BC in Table 7a . This value was also used for both π H and Q for the RL setup. We prevent gradient flow from π 0 to π, to ensure as fair a comparison between ablations as possible: each prior distils knowledge from the same, high performing, policy π and dataset. If we simultaneously trained multiple π and π 0 pairs (for each distinct prior), it is possible that different learnt priors would influence the quality of each policy π which knowledge is distilled off. In this paper, we are not interested in investigating how priors affect π during BC, but instead how priors influence what knowledge can be distilled and transferred. We observed prior KL-distillation loss convergence across tasks and seeds, ensuring a fair comparison.

D.3 REINFORCEMENT LEARNING

During this stage of training we freeze the prior and low-level policy (if applicable, depending on the ablation). In general, any reused modules across sequential tasks are frozen (apart from π H for the covariate shift experiments in Table 8 ). Any modules that are not shared (such as π H for most experiments), are initialized randomly across tasks. The RL setup is akin to the softlearning repository (Haarnoja et al., 2018b) that we build off. We note any changes in Table 7b . We regularize against the latent or action level for, depending on the ablation, whether or not our models are hierarchical, share low level policies, or use pre-trained modules (low-level policy and prior). Therefore, we only ever regularize against, at most, two β hyper-parameters. Hyper-parameter sweeps are performed in the same way as previously. We did not sweep over Retrace λ, batch size, or episodic length. For Retrace, we clip the importance weights between [0, 1], like in Munos et al. (2016) , and perform λ returns rather than n-step. We found the Retrace operator important for sample-efficient learning. We plot the full attention maps for APES in Figure 7 , including intra-observation and action attention. For CorridorMaze, attention is primarily paid to the most recent action a t . For most observations in the environment (excluding end of the corridor or corridor intersection observations), conditioning on the previous action suffices to infer the optimal next action (e.g. continue traversing the depths of a corridor). Therefore, it is understandable that APES has learnt such an attention mechanism. For the remainder of the environment observations (such as corridor ends), conditioning on (a history of) observations is necessary to infer optimal behaviour. As such APES pays some attention to observations. For Stack, attention is primarily paid to a short recent history of actions, with the weights decaying further into the past. Interestingly, attention over actions corresponding to opening/closing the gripper (the bottom row of Figure 7 ) decay a lot quicker, suggesting that this information is redundant. This makes sense, as there exists strong correlation between gripper actions at successive time-steps, but this correlation decays very quickly. Additionally, APES does not pay attention to observations corresponding to gripper position (the final 3 observation rows in Figure 7 ), as this can be inferred from the remainder of the observation-space as well as the recent history of gripper actions. In Table 8 we report additional return (over a randomly initialised π H ) achieved by pre-training (over p source (K)) and transferring a task-agnostic high-level policy π H during transfer (p trans (K)). For experiments that are not hierarchical we pre-train an equivalent non-hierarchical agent. Theorem 3.1 suggests we would expect a larger improvement in transfer performance for priors that condition on more information. Table 8 confirms this trend demonstrating the importance of prior covariate shift in transferability of behaviours. This trend is less apparent for the semi-sparse domain. Additionally, for the interpolated transfer task (2 corridor), the solution is entirely in the support of the training set of tasks. Naïvely, one would expect pretraining to fully recover lost performance and match the most performant method. However, this is not the case as the critic, trained solely on the transfer task, quickly encourages sub-optimal out-of-distribution behaviours. final policy rollouts (episodes vertically, rollouts horizontally) for most performant method. Task: (Blue, Orange, Green, Red) corridors. Policy distribution over categorical latent space for π H and π H 0 plotted (between policy rollouts, denoted by x t ) as vertical histograms, with colour and width denoting category and probability.

E.3 FINAL POLICY ROLLOUTS

In this section, we show final policy performance (in terms of episodic rollouts) for APES-H1, across each transfer domain. We additionally display the categorical probability distributions for π H (x k ) and π H 0 (F (x k )) across each rollout to analyse the behaviour of each. F (.) denotes the chosen information gating function for the prior (referred to as IGF (.) in the main text). For CorridorMaze 2 and 4 corridor, seen in Figs. 8 and 9 (a), we see that the full method successfully solves each respective task, correctly traversing the correct ordering of corridors. The categorical distributions for these domains remain relatively entropic. In general, latent categories cluster into those that lead the agent deeper down a corridor, and those that return the agent to the hallway. Policy and prior align their categorical distributions in general, as expected. Interestingly, however, the two categorical distributions deviate the most from each-other at the hallway, the bottleneck state (Sutton et al., 1999) , where prior multimodality (for hierarchical π 0 ) exists most (e.g. which corridor to traverse next). In this setting, the policy needs to deviate from the multimodal prior, and traverse only the optimal next corridor. We also observe, for the hallway, that the prior allocates one category to each of the five corridors. Such behaviour would not be possible with a flat prior. Fig. 9 (b) plots the same information for Stack 4 blocks. APES-H1 successfully solves the transfer task, stacking all blocks according to their masses. Similar categorical latent-space trends exist for this domain as the previous. Most noteworthy is the behaviour of both policy and prior at the bottleneck state, the location above the block stack, where blocks are placed. This location is visited five times within the episode: at the start s 0 , and four more times upon each stacked block. Interestingly, for this state, the prior becomes increasingly less entropic upon each successive visit. This suggests that the prior has learnt that the number of feasible high-level actions (corresponding to which block to stack next), reduces upon each visit, as there remains fewer lighter blocks to stack. It is also interesting that for s 0 , the red categorical value is more favoured than the rest. Here, the red categorical value corresponds to moving towards cube 0, the heaviest cube. This behaviour is as expected, as during BC, this cube was stacked first more often than the others, given its mass. For this domain, akin to CorridorMaze, the policy deviates most from the prior at the bottleneck state, as here it needs to behave deterministically (regrading which block to stack next).



Concurrent with our research Bagatella et al. (2022) also investigated various IAs for sequential transfer. For proof refer to Appendix B.3



Figure 1: Hierarchical KLregularized architecture. The hierarchical policy modules π H and π L are regularized against their corresponding prior modules π H i and π L i . The inputs to each module are filtered by an information gating function (IGF), depicted with colored rectangles.

Figure 2: Environments. a) CorridorMaze: The agent starts in the hallway and must traverse a given sequence of corridors. The agent completes a corridor by traversing to its depth and back. b) Stack: The agent must stack the cubes in a given ordering over the light blue target pad. shifts (see Theorem 3.1 for a definition). For multi-task learning, trajectory shifts are gradual and skills are continually retrained over such shifts, alleviating transfer issues. In general, IA plays a crucial role, influencing the level of covariate shift encountered by the prior during learning: Theorem 3.1. The more random variables a network depends on, the larger the covariate shift (input distributional shift, here represented by KL-divergence) encountered across sequential tasks. That is, for distributions p, q and inputs b, c such that b = (b 0 , b 1 , ..., b n ) and c ⊂ b:

Figure 3: Skill-level exploration for Sparse 2 corr. 4 rollouts, episodes unrolled horizontally. Corridors are colour coded and depth within them is denoted by shade; the darker the deeper. Hallway is white. Only the full setup leads to corridor-level exploration.

(2) How important is IA choice between high-level policy and prior. Does it lead to an impactful expressivity-transferability trade-off? In practice, how detrimental is covariate shift for transfer? (3) How favourably does APES automate the choice of IA for effective transfer? Which IAs are discovered? (4) How important is hierarchy for transferring expressive skills? Is hierarchy necessary? We compare against competitive skill transfer baselines, primarily Tirumala et al. (2019; 2020); Pertsch et al. (2020); Wulfmeier et al. (

APES-H1, APES-S are hierarchical extensions of Bagatella et al. (2022); Galashov et al. (2019) respectively, and APES-H20 (representing entire history conditioning) is fromTirumala et al. (2020). APES-S is also an extension ofPertsch et al. (

Figure 4: Expressivity-Transferability Trade-Off. Distillation loss vs transfer performance increase when pre-training π H . The level of information conditioning is shown by marker size. Absolute transfer performance is shown by marker colour (red -high; blue -low). We fit a second order curve (in red) to show the general expressivity-transferability trend, per Theorems 3.1 and 3.2, that distilling more, by reducing IA, can hurt transfer, due to increased covariate shift. See Table8for detailed tabular results.

Figure 5: APES attention for π H 0 , plotted as log 10 (m), for each domain (key on right; red and blue as high and low values). We aggregate attention to the observation/action levels, e.g. log 10 (m st ) = log 10 ( dim(S) i=0m i st ) (with m st the aggregate attention for s t and m i st attention for the i th dimension of s t ). For intra-observation/action attention visualisation, reporting values per individual dimension, see Figure7. APES learns sparse, domain dependent, attention, primarily focusing on recent actions.

Figure 6: Skill-level exploration for 4 Blocks.End-effector and block rollouts depicted by dotted and dashed lines respectively, colour-coded per rollout. The end positions for gripper and blocks are represented by cross or cubes, respectively. Cube numbers represent their mass order with 0 the heaviest. APES explores at the stacking level whilst hierarchy alone is unable to stack. See Figure10for video rollouts.

, limiting expressivity. Unlike APES,Singh et al. (2020);Bagatella et al. (2022) leverage flow-based transformations to achieve multi-modality. Unlike many previous works, we consider the POMDP setting, arguably more suited for robotics, and learn the information conditioning of priors on based on our expressivity-transferability theorems.Whilst most previous works rely on IA, choice is primarily motivated by intuition. For example,Igl et al. (2019);Wulfmeier et al. (2019; 2020)  only employ task or goal asymmetry and Tirumala et al. (2019); Merel et al. (2020); Galashov et al. (2019) use exteroceptive asymmetry. Salter et al.

Figure 7: APES attention for π H 0 , plotted as log 10 (m), for each domain (key on right; red and blue as high and low values). APES learns sparse, domain dependent, attention.

Figure 8: CorridorMaze 4 corridor. 10 final policy rollouts (episodes vertically, rollouts horizontally) for most performant method. Task: (Blue, Orange, Green, Red) corridors. Policy distribution over categorical latent space for π H and π H0 plotted (between policy rollouts, denoted by x t ) as vertical histograms, with colour and width denoting category and probability.

Previously Explored IAs

Average return across 100 episodes (mean ± standard deviation for 7 random seeds). Experiments that do not ((Hier-)RecSAC) / do leverage prior experience, were ran for 10 6 vs 1.5 * 10 5 environment steps. APES employs hierarchy, priors and learns m for π H 0 . APES-{H20 -S} do not learn m. The remainder do not employ priors.

H(m) vs D KL (π H ||π H 0 )

Hierarchy Ablation

1 APES training regime 1: # Full training and transfer regime. For BC, gradients are prevented from flowing from π 0 to π. In practice π 0 = {π i } i∈{0,...,N } , multiple trained priors. During transfer, π H , is reinitialized. R bc , env ← collect(π, R bc , env, True, r) ← BC update(π, π 0 , R bc ) # Eq. 4 8: end for 9: # Reinforcement Learning 10: Initialize: high level policy π H , critics Q k∈{1,2} , replay R rl , transfer environment env t 11: for Number of RL training steps do

Feedforward Module, π

Full experimental setup. Describes inputs to each module, level over which KL-regularization occurs, which modules are shared, and which are reused across training and transfer tasks.

Full Training Setup

Covariate Shift Analysis. In general, reduced IA benefits more from reduced covariate shift. Sparse domains suffer more from shift, seen by clearer IA covariate trends.

annex

Figure 9 : Final transfer performance for most performant method. Tasks are solved. Displaying latent distribution for both policy and prior. For both domains, policy deviates most from prior at the bottleneck state (hallway/stacking zone, for CorridorMaze/Stack), where prior multimodality exists (e.g. which corridor/block to stack next), but where determinism is required for the task at hand. 

