PRIORS, HIERARCHY, AND INFORMATION ASYMMETRY FOR SKILL TRANSFER IN REINFORCEMENT LEARNING

Abstract

The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sample-efficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KL-regularized reinforcement learning individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry across architectural modules to bias which skills are learnt. While asymmetry choice has a large influence on transferability, existing methods base their choice primarily on intuition in a domain-independent, potentially suboptimal, manner. In this paper, we theoretically and empirically show the crucial expressivity-transferability trade-off of skills across sequential tasks, controlled by information asymmetry. Given this insight, we introduce Attentive Priors for Expressive and Transferable Skills (APES), a hierarchical KL-regularized method, heavily benefiting from both priors and hierarchy. Unlike existing approaches, APES automates the choice of asymmetry by learning it in a data-driven, domaindependent, way based on our expressivity-transferability theorems. Experiments over complex transfer domains of varying levels of extrapolation and sparsity, such as robot block stacking, demonstrate the criticality of the correct asymmetric choice, with APES drastically outperforming previous methods.

1. INTRODUCTION

While Reinforcement Learning (RL) algorithms recently achieved impressive feats across a range of domains (Silver et al., 2017; Mnih et al., 2015; Lillicrap et al., 2015) , they remain sample inefficient (Abdolmaleki et al., 2018; Haarnoja et al., 2018b) and are therefore of limited use for real-world robotics applications. Intelligent agents during their lifetime discover and reuse skills at multiple levels of behavioural and temporal abstraction to efficiently tackle new situations. For example, in manipulation domains, beneficial abstractions could include low-level instantaneous motor primitives as well as higher-level object manipulation strategies. Endowing lifelong learning RL agents (Parisi et al., 2019) with a similar ability could be vital towards attaining comparable sample efficiency. To this end, two paradigms have recently been introduced. KL-regularized RL (Teh et al., 2017; Galashov et al., 2019) presents an intuitive approach for automating skill reuse in multi-task learning. By regularizing policy behaviour against a learnt task-agnostic prior, common behaviours across tasks are distilled into the prior, which encourages their reuse. Concurrently, hierarchical RL also enables skill discovery (Wulfmeier et al., 2019; Merel et al., 2020; Hausman et al., 2018; Haarnoja et al., 2018a; Wulfmeier et al., 2020) by considering a two-level hierarchy in which the high-level policy is task-conditioned, whilst the low-level remains task-agnostic. The lower level of the hierarchy therefore also discovers skills that are transferable across tasks. Both hierarchy and priors offer their own skill abstraction. However, when combined, hierarchical KL-regularized RL can discover multiple abstractions. Whilst prior methods attempted this (Tirumala et al., 2019; 2020; Liu et al., 2022; Goyal et al., 2019) , the transfer benefits from learning both abstractions varied drastically, with approaches like Tirumala et al. (2019) unable to yield performance gains. In fact, successful transfer of the aforementioned hierarchical and KL-regularized methods critically depends on the correct choice of information asymmetry (IA). IA more generally refers to an asymmetric masking of information across architectural modules. This masking forces independence to, and ideally generalisation across, the masked dimensions (Galashov et al., 2019) . For example, for self-driving cars, by conditioning the prior only on proprioceptive information it discovers skills * Equal contribution. Correspondence to: sasha.salter@hotmail.com. 1

