DATA-EFFICIENT HINDSIGHT OFF-POLICY OPTION LEARNING

Abstract

Hierarchical approaches for reinforcement learning aim to improve data efficiency and accelerate learning by incorporating different abstractions. We introduce Hindsight Off-policy Options (HO2), an efficient off-policy option learning algorithm, and isolate the impact of temporal and action abstraction in the option framework by comparing flat policies, mixture policies without temporal abstraction, and finally option policies; all with comparable policy optimization. We demonstrate the importance of off-policy optimization, as even flat policies trained off-policy can outperform on-policy option methods. In addition, off-policy training and backpropagation through a dynamic programming inference procedure -through time and through the policy components for every time-step -enable us to train all components' parameters independently of the data-generating behavior policy. Experimentally, we demonstrate that HO2 outperforms existing option learning methods and that both action and temporal abstraction provide strong benefits, particularly in more demanding simulated robot manipulation tasks from raw pixel inputs. We additionally illustrate challenges in off-policy option learning and highlight the importance of trust-region constraints. Finally, we develop an intuitive extension to further encourage temporal abstraction and investigate differences in its impact between learning from scratch and using pre-trained options.

1. INTRODUCTION

Despite deep reinforcement learning's considerable successes in recent years (Silver et al., 2017; OpenAI et al., 2018; Vinyals et al., 2019) , applications in domains with limited or expensive data have so far been rare. To address this challenge, data efficiency can be improved through additional structure imposed on the solution space. Hierarchical methods, such as the options framework (Sutton et al., 1999; Precup, 2000) , present one approach to integrate different abstractions into the agent. By representing an agent as a combination of low-level and high-level controllers, option policies can support reuse of low-level behaviours and can ultimately accelerate learning. The advantages introduced by the hierarchical control scheme imposed by the options framework are partially balanced by additional complexities, including possible degenerate cases (Precup, 2000; Harb et al., 2018) , trade-offs regarding option length (Harutyunyan et al., 2019) , and additional stochasticity. Overall, the interaction of algorithm and environment, as exemplified by the factors above, can become increasingly difficult, especially in an off-policy setting (Precup et al., 2006) . With Hindsight Off-policy Options (HO2), we present a method for data-efficient, robust off-policy learning of options to partially combat the previously mentioned challenges in option learning. We evaluate against current methods for option learning to demonstrate the importance of off-policy learning for data-efficiency. In addition, we compare HO2 with comparable policy optimization methods for flat policies and mixture policies without temporal abstraction. This allows us to isolate the individual impact of action abstraction (in mixture and option policies) and temporal abstraction in the option framework. The algorithm updates the policy via critic-weighted maximum-likelihood (similar to Abdolmaleki et al. (2018b) ; Wulfmeier et al. ( 2020)) and combines these with an efficient dynamic programming procedure to infer option probabilities along trajectories and update all policy parts via backpropagation through the inference graph (conceptually related to (Rabiner, 1989; Shiarlis et al., 2018; Smith et al., 2018) ). Intuitively, the approach can be understood as inferring option and action probabilities for off-policy trajectories in hindsight and maximizing the likelihood of good actions and options by backpropagating through the inference procedure. To stabilize policy updates, HO2 uses adaptive trust-region constraints, demonstrating the importance of robust policy optimization for hierarchical reinforcement learning (HRL) in line with recent work (Zhang & Whiteson, 2019) . Rather than conditioning on executed options, the algorithm treats these as unobserved variables, computes the marginalized likelihood, and enables exact gradient computation and renders the algorithm independent of further approximations such as Monte Carlo estimates or continuous relaxation (Li et al., 2019) . As an additional benefit, the formulation of the inference graph also allows to impose hard constraints on the option termination frequency, thereby regularizing the learned solution without introducing additional weighted loss terms that are dependent on reward scale. We exploit this to further investigate temporal abstraction for learning from scratch and for pre-trained options. Experimentally, we demonstrate that HO2 outperforms existing option learning methods and that both action and temporal abstraction provide strong benefits, particularly in more demanding simulated robot manipulation tasks from raw pixel inputs. Our main contributions include: • a data-efficient off-policy option learning algorithm which outperforms existing option learning methods on common benchmarks. • careful analysis to isolate the impact of action abstraction and temporal abstraction in the option framework by comparing flat policies, mixture policies without temporal abstraction, and finally option policies. Our experiments demonstrate individual improvements from both forms of abstraction in complex 3D robot manipulation tasks from raw pixel inputs, and include ablations of other factors such as trust region constraints and off-policy versus on-policy data. • an intuitive method to further encourage temporal abstraction beyond the core method, using the inference graph to constrain option switches without additional weighted loss terms. We investigate differences between temporal abstraction in learning from scratch and pre-trained options and show that optimizing for temporally abstract behaviour in addition to simply providing the methods for its emergence mostly provides benefits in the context of pre-trained options.

2. PRELIMINARIES

Problem Setup We consider a reinforcement learning setting with an agent operating in a Markov Decision Process (MDP) consisting of the state space S, the action space A, and the transition probability p(s t+1 |s t , a t ) of reaching state s t+1 from state s t when executing action a t . The actions a t are drawn from the agent's policy π(a t |x t ), where x t can either refer to the current state s t of the agent or, in order to model dependencies on the previous steps, the trajectory until the current step, h t = {s t , a t-1 , s t-1 , ...s 0 , a 0 }. Jointly, the transition dynamics and policy induce the marginal state visitation distribution p(s t ). The discount factor γ together with the reward r t = r (s t , a t ) gives rise to the expected return, which the agent aims to maximize: J(π) = E p(st),π(at,st) ∞ t=0 γ t r t . We start by describing mixture policies as an intermediate between flat policies and the option framework (see Figure 1 ), which introduce a type of action abstraction via multiple low-level policies. Note that both Gaussian and mixture policies have been trained in prior work via similar policy optimization methods (Abdolmaleki et al., 2018a; Wulfmeier et al., 2020) , which we will extend towards option policies. In particular, we will make use of the connection to isolate the impact of action abstraction and temporal abstraction in the option framework in Section 4.2. The next paragraphs focus on computing likelihoods of actions and options, which forms the foundation for the proposed critic-weighted maximum likelihood algorithm to learn hierarchical policies. Mixture Policies can be seen as a simplification of the options framework without initiation or termination condition, with resampling of options after every step (i.e. no dependency between the options of timestep t and t + 1 in Figure 1 ). The joint probability of action and option is given as:  π θ ( where π H and π L respectively represent high-level policy (which for the mixture is equal to a Categorical distribution π C ) and low-level policy (components of the resulting mixture distribution), and o is the index of the sub-policy or mixture component.



a t , o t |s t ) = π L (a t |s t , o t ) π H (o t |s t ) , with π H (o t |s t ) = π C (o t |s t ) ,

