HIGHWAY REINFORCEMENT LEARNING

Abstract

Traditional Dynamic Programming (DP) approaches suffer from slow backward credit-assignment (CA): one time step per update. A popular solution for multistep CA is to use multi-step Bellman operators. Existing control methods, however, typically suffer from large variance of multi-step off-policy corrections or are biased, preventing convergence. To overcome these problems, we introduce a novel multi-step Bellman Optimality operator, which quickly transports credit information from the future into the past through multiple "highways" induced by various behavioral policies. Our operator is unbiased with respect to the optimal value function and converges faster than the traditional Bellman Optimality operator. Its computational complexity is linear in the number of behavioral policies and lookahead depth. Moreover, it yields a family of novel multi-step off-policy algorithms that do not need importance sampling. We derive a convergent multistep off-policy variant of Q-learning called Highway Q-Learning, and also a deep function approximation variant called Highway DQN. Experiments on toy tasks and visual MinAtar Games (Young & Tian, 2019) illustrate that our algorithms outperform similar multi-step methods.

1. INTRODUCTION

Recent advances in multi-step reinforcement learning (RL) have achieved remarkable empirical success (Horgan et al., 2018; Barth-Maron et al., 2018) . However, a major challenge of multi-step RL is to balance the trade-off between traditional "safe" one-time-step-per-trial credit assignment (CA) relying on knowledge stored in a learned Value Function (VF), and large CA jumps across many time steps. A traditional way of addressing this issue is to impose a fixed prior distribution over the possible numbers of CA steps, e.g., TD(λ) (Sutton & Barto, 2018) , GAE(λ) (Schulman et al., 2016) . This typically ignores the current state-specific quality of the current VF, which dynamically improves during learning. Besides, the prior distribution usually has to be tuned case by case. Multi-step RL should also work for off-policy learning, that is, learning from data obtained by other behavioral policies. Most previous research on this has focused on Policy Iteration(PI)-based approaches (Sutton & Barto, 2018) , which need to correct the discrepancy between target policy and behavior policy to evaluate the VF (Precup, 2000; Harutyunyan et al., 2016; Munos et al., 2016; Sutton & Barto, 2018; Schulman et al., 2016) . Classic importance sampling(IS)-based methods is proven to be unbiased, but suffer from high variance due to the product of IS ratios (Cortes et al., 2010; Metelli et al., 2018) . Recently, several variance reduction methods have been proposed and shown to be effective in practice, such as Q(λ) (Harutyunyan et al., 2016) , Retrace (λ) (Munos et al., 2016) and C-trace (Rowland et al., 2020) , and so on (Espeholt et al., 2018; Horgan et al., 2018; Asis et al., 2017) . In contrast to PI, Value Iteration (VI) methods propagate the values of the most promising actions backward one step at a time (Sutton & Barto, 2018; Szepesvári, 2010) . Such methods can safely use data from any behavioral policy. However, step-by-step value propagation makes them somewhat ill-suited for general multi-step CA. Here we provide a new tool for multi-step off-policy learning by extending VI approaches to the multi-step setting. The foundation of our method is a new Bellman operator, the Highway Operator, which connects current and future states through multiple "highways," and focuses on the most promising one. Highways are constructed through various policies looking ahead for multiple steps. Our operator has the following desirable properties: 1) It yields a new Bellman Optimality Equation that reflects the latent structure of multi-step CA, providing a novel sufficient condition for the optimal VF; 2) It effectively assigns future credit to past states across multiple time steps and has

