PRIORS, HIERARCHY, AND INFORMATION ASYMMETRY FOR SKILL TRANSFER IN REINFORCEMENT LEARNING

Abstract

The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sample-efficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KL-regularized reinforcement learning individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry across architectural modules to bias which skills are learnt. While asymmetry choice has a large influence on transferability, existing methods base their choice primarily on intuition in a domain-independent, potentially suboptimal, manner. In this paper, we theoretically and empirically show the crucial expressivity-transferability trade-off of skills across sequential tasks, controlled by information asymmetry. Given this insight, we introduce Attentive Priors for Expressive and Transferable Skills (APES), a hierarchical KL-regularized method, heavily benefiting from both priors and hierarchy. Unlike existing approaches, APES automates the choice of asymmetry by learning it in a data-driven, domaindependent, way based on our expressivity-transferability theorems. Experiments over complex transfer domains of varying levels of extrapolation and sparsity, such as robot block stacking, demonstrate the criticality of the correct asymmetric choice, with APES drastically outperforming previous methods.

1. INTRODUCTION

While Reinforcement Learning (RL) algorithms recently achieved impressive feats across a range of domains (Silver et al., 2017; Mnih et al., 2015; Lillicrap et al., 2015) , they remain sample inefficient (Abdolmaleki et al., 2018; Haarnoja et al., 2018b) and are therefore of limited use for real-world robotics applications. Intelligent agents during their lifetime discover and reuse skills at multiple levels of behavioural and temporal abstraction to efficiently tackle new situations. For example, in manipulation domains, beneficial abstractions could include low-level instantaneous motor primitives as well as higher-level object manipulation strategies. Endowing lifelong learning RL agents (Parisi et al., 2019) with a similar ability could be vital towards attaining comparable sample efficiency. To this end, two paradigms have recently been introduced. KL-regularized RL (Teh et al., 2017; Galashov et al., 2019) presents an intuitive approach for automating skill reuse in multi-task learning. By regularizing policy behaviour against a learnt task-agnostic prior, common behaviours across tasks are distilled into the prior, which encourages their reuse. Concurrently, hierarchical RL also enables skill discovery (Wulfmeier et al., 2019; Merel et al., 2020; Hausman et al., 2018; Haarnoja et al., 2018a; Wulfmeier et al., 2020) by considering a two-level hierarchy in which the high-level policy is task-conditioned, whilst the low-level remains task-agnostic. The lower level of the hierarchy therefore also discovers skills that are transferable across tasks. Both hierarchy and priors offer their own skill abstraction. However, when combined, hierarchical KL-regularized RL can discover multiple abstractions. Whilst prior methods attempted this (Tirumala et al., 2019; 2020; Liu et al., 2022; Goyal et al., 2019) , the transfer benefits from learning both abstractions varied drastically, with approaches like Tirumala et al. (2019) unable to yield performance gains. In fact, successful transfer of the aforementioned hierarchical and KL-regularized methods critically depends on the correct choice of information asymmetry (IA). IA more generally refers to an asymmetric masking of information across architectural modules. This masking forces independence to, and ideally generalisation across, the masked dimensions (Galashov et al., 2019) . For example, for self-driving cars, by conditioning the prior only on proprioceptive information it discovers skills independent to, and shared across, global coordinate frames. For manipulation, by not conditioning on certain object information, such as shape or weight, the robot learns generalisable grasping independent to these factors. Therefore, IA biases learnt behaviours and how they transfer across environments. Previous works predefined their IAs, which were primarily chosen on intuition and independent of domain. In addition, previously explored asymmetries were narrow (Table 1 ), which if sub-optimal, limit transfer benefits. We demonstrate that this indeed is the case for many methods on our domains (Galashov et al., 2019; Bagatella et al., 2022; Tirumala et al., 2019; 2020; Pertsch et al., 2021; Wulfmeier et al., 2019) . A more systematic, theoretically and data driven, domain dependent, approach for choosing IA is thus required to maximally benefit from skills for transfer learning. In this paper, we employ hierarchical KL-regularized RL to effectively transfer skills across sequential tasks. We begin by theoretically and empirically showing the crucial expressivity-transferability trade-off, controlled by choice of IA, of skills across sequential tasks for hierarchical KL-regularized RL. Our expressivity-transferability theorems state that conditioning skill modules on too little or too much information, such as the current observation or entire history, can both be detrimental for transfer, due to the discovery of skills that are either too general (e.g. motor primitives) or too specialised (e.g. non-transferable task-level). We demonstrate this by ablating over a wide range of asymmetries between the hierarchical policy and prior. We show the inefficiencies of previous methods that choose highly sub-optimal IAs for our domains, drastically limiting transfer performance. Given this insight, we introduce APES, 'Attentive Priors for Expressive and Transferable Skills' as a method that forgoes user intuition and automates the choice of IA in a data driven, domain dependent, manner. APES builds on our expressivity-transferability theorems to learn the choice of asymmetry between policy and prior. Specifically, APES conditions the prior on the entire history, allowing for expressive skills to be discovered, and learns a low-entropic attention-mask over the input, paying attention only where necessary, to minimise covariate shift and improve transferability across domains. Experiments over domains of varying levels of sparsity and extrapolation, including a complex robot block stacking one, demonstrate APES' consistent superior performance over existing methods, whilst automating IA choice and by-passing arduous IA sweeps. Further ablations show the importance of combining hierarchy and priors for discovering expressive multi-modal behaviours.

2. SKILL TRANSFER IN REINFORCEMENT LEARNING

We consider multi-task reinforcement learning in Partially Observable Markov Decision Processes (POMDPs), defined by M k = (S, X , A, r k , p, p 0 k , γ), with tasks k sampled from p(K). S, A, X denote observation, action, and history spaces. p(x ′ |x, a) : X × X × A → R ≥0 is the dynamics model. We denote the history of observations s ∈ S, actions a ∈ A up to timestep t as x t = (s 0 , a 0 , s 1 , a 1 , . . . , s t ). Reward function r k : X × A × K → R is history-, action-and task-dependent.

2.1. HIERARCHICAL KL-REGULARIZED REINFORCEMENT LEARNING

The typical multi-task KL-regularized RL objective (Todorov, 2007; Kappen et al., 2012; Rawlik et al., 2012; Schulman et al., 2017) takes the form: L(π, π 0 ) = E τ ∼pπ(τ ), k∼p(K) ∞ t=0 γ t r k (x t , a t ) -α 0 D KL (π(a|x t , k) ∥ π 0 (a|x t )) (1) where γ is the discount factor and α 0 weighs the individual objective terms. π and π 0 denote the task-conditioned policy and task-agnostic prior respectively. The expectation is taken over tasks and trajectories τ from policy π and initial observation distribution p 0 k (s 0 ), (i.e. p π (τ )). Summation over t occurs across all episodic timesteps. When optimised with respect to π, this objective can be viewed as a trade-off between maximising rewards whilst remaining close to trajectories produced by π 0 . When π 0 is learnt, it can learn shared behaviours across tasks and bias multi-task exploration (Teh et al., 2017) . We consider the sequential learning paradigm, where skills are learnt from past tasks, p source (K), and leveraged while attempting the transfer set of tasks, p trans (K). While KL-regularized RL has achieved success across various settings (Abdolmaleki et al., 2018; Teh et al., 2017; Pertsch et al., 2020; Haarnoja et al., 2018a ), recently Tirumala et al. (2019) proposed a hierarchical extension where policy π and prior π 0 are augmented with latent variables, π(a, z|x, k) = π H (z|x, k)π L (a|z, x) and π 0 (a, z|x) = π H 0 (z|x)π L 0 (a|z, x), where subscripts H and L denote the higher and lower hierarchical levels. This structure encourages the shared low-level policy (π L = π L 0 ) to discover task-agnostic behavioural primitives, whilst the high-level discovers higher-level task relevant skills. By not conditioning the high-level prior on task-id, Tirumala et al.



* Equal contribution. Correspondence to: sasha.salter@hotmail.com.

