DATA-EFFICIENT HINDSIGHT OFF-POLICY OPTION LEARNING

Abstract

Hierarchical approaches for reinforcement learning aim to improve data efficiency and accelerate learning by incorporating different abstractions. We introduce Hindsight Off-policy Options (HO2), an efficient off-policy option learning algorithm, and isolate the impact of temporal and action abstraction in the option framework by comparing flat policies, mixture policies without temporal abstraction, and finally option policies; all with comparable policy optimization. We demonstrate the importance of off-policy optimization, as even flat policies trained off-policy can outperform on-policy option methods. In addition, off-policy training and backpropagation through a dynamic programming inference procedure -through time and through the policy components for every time-step -enable us to train all components' parameters independently of the data-generating behavior policy. Experimentally, we demonstrate that HO2 outperforms existing option learning methods and that both action and temporal abstraction provide strong benefits, particularly in more demanding simulated robot manipulation tasks from raw pixel inputs. We additionally illustrate challenges in off-policy option learning and highlight the importance of trust-region constraints. Finally, we develop an intuitive extension to further encourage temporal abstraction and investigate differences in its impact between learning from scratch and using pre-trained options.

1. INTRODUCTION

Despite deep reinforcement learning's considerable successes in recent years (Silver et al., 2017; OpenAI et al., 2018; Vinyals et al., 2019) , applications in domains with limited or expensive data have so far been rare. To address this challenge, data efficiency can be improved through additional structure imposed on the solution space. Hierarchical methods, such as the options framework (Sutton et al., 1999; Precup, 2000) , present one approach to integrate different abstractions into the agent. By representing an agent as a combination of low-level and high-level controllers, option policies can support reuse of low-level behaviours and can ultimately accelerate learning. The advantages introduced by the hierarchical control scheme imposed by the options framework are partially balanced by additional complexities, including possible degenerate cases (Precup, 2000; Harb et al., 2018) , trade-offs regarding option length (Harutyunyan et al., 2019) , and additional stochasticity. Overall, the interaction of algorithm and environment, as exemplified by the factors above, can become increasingly difficult, especially in an off-policy setting (Precup et al., 2006) . With Hindsight Off-policy Options (HO2), we present a method for data-efficient, robust off-policy learning of options to partially combat the previously mentioned challenges in option learning. We evaluate against current methods for option learning to demonstrate the importance of off-policy learning for data-efficiency. In addition, we compare HO2 with comparable policy optimization methods for flat policies and mixture policies without temporal abstraction. This allows us to isolate the individual impact of action abstraction (in mixture and option policies) and temporal abstraction in the option framework. The algorithm updates the policy via critic-weighted maximum-likelihood (similar to Abdolmaleki et al. (2018b); Wulfmeier et al. (2020) ) and combines these with an efficient dynamic programming procedure to infer option probabilities along trajectories and update all policy parts via backpropagation through the inference graph (conceptually related to (Rabiner, 1989; Shiarlis et al., 2018; Smith et al., 2018) ). Intuitively, the approach can be understood as inferring option and action probabilities for off-policy trajectories in hindsight and maximizing the likelihood of good actions and options by backpropagating through the inference procedure. To stabilize

