HIGHWAY REINFORCEMENT LEARNING

Abstract

Traditional Dynamic Programming (DP) approaches suffer from slow backward credit-assignment (CA): one time step per update. A popular solution for multistep CA is to use multi-step Bellman operators. Existing control methods, however, typically suffer from large variance of multi-step off-policy corrections or are biased, preventing convergence. To overcome these problems, we introduce a novel multi-step Bellman Optimality operator, which quickly transports credit information from the future into the past through multiple "highways" induced by various behavioral policies. Our operator is unbiased with respect to the optimal value function and converges faster than the traditional Bellman Optimality operator. Its computational complexity is linear in the number of behavioral policies and lookahead depth. Moreover, it yields a family of novel multi-step off-policy algorithms that do not need importance sampling. We derive a convergent multistep off-policy variant of Q-learning called Highway Q-Learning, and also a deep function approximation variant called Highway DQN. Experiments on toy tasks and visual MinAtar Games (Young & Tian, 2019) illustrate that our algorithms outperform similar multi-step methods.

1. INTRODUCTION

Recent advances in multi-step reinforcement learning (RL) have achieved remarkable empirical success (Horgan et al., 2018; Barth-Maron et al., 2018) . However, a major challenge of multi-step RL is to balance the trade-off between traditional "safe" one-time-step-per-trial credit assignment (CA) relying on knowledge stored in a learned Value Function (VF), and large CA jumps across many time steps. A traditional way of addressing this issue is to impose a fixed prior distribution over the possible numbers of CA steps, e.g., TD(λ) (Sutton & Barto, 2018) , GAE(λ) (Schulman et al., 2016) . This typically ignores the current state-specific quality of the current VF, which dynamically improves during learning. Besides, the prior distribution usually has to be tuned case by case. Multi-step RL should also work for off-policy learning, that is, learning from data obtained by other behavioral policies. Most previous research on this has focused on Policy Iteration(PI)-based approaches (Sutton & Barto, 2018) , which need to correct the discrepancy between target policy and behavior policy to evaluate the VF (Precup, 2000; Harutyunyan et al., 2016; Munos et al., 2016; Sutton & Barto, 2018; Schulman et al., 2016) . Classic importance sampling(IS)-based methods is proven to be unbiased, but suffer from high variance due to the product of IS ratios (Cortes et al., 2010; Metelli et al., 2018) . Recently, several variance reduction methods have been proposed and shown to be effective in practice, such as Q(λ) (Harutyunyan et al., 2016) , Retrace (λ) (Munos et al., 2016) and C-trace (Rowland et al., 2020) , and so on (Espeholt et al., 2018; Horgan et al., 2018; Asis et al., 2017) . In contrast to PI, Value Iteration (VI) methods propagate the values of the most promising actions backward one step at a time (Sutton & Barto, 2018; Szepesvári, 2010) . Such methods can safely use data from any behavioral policy. However, step-by-step value propagation makes them somewhat ill-suited for general multi-step CA. Here we provide a new tool for multi-step off-policy learning by extending VI approaches to the multi-step setting. The foundation of our method is a new Bellman operator, the Highway Operator, which connects current and future states through multiple "highways," and focuses on the most promising one. Highways are constructed through various policies looking ahead for multiple steps. Our operator has the following desirable properties: 1) It yields a new Bellman Optimality Equation that reflects the latent structure of multi-step CA, providing a novel sufficient condition for the optimal VF; 2) It effectively assigns future credit to past states across multiple time steps and has remarkable convergence properties; 3) It yields a family of novel multi-step off-policy algorithms that do not need importance sampling, safely using arbitrary off-policy data. Experiments on toy tasks and visual MinAtar Games (Young & Tian, 2019) illustrate that our Highway RL algorithms outperform existing multi-step methods.

2. PRELIMINARIES

A Markov Decision Processes (MDP) (Puterman, 2014) is described by the tuple M = (S, A, γ, T , µ 0 , R), where S is the state space; A is the action space; γ ∈ [0, 1) is the discount factor. We assume MDPs with countable S (discrete topology) and finite A. T : S × A → ∆(S) is the transition probability function; µ 0 denotes the initial state distribution; R : S × A → ∆(R) denotes reward probability function. We use the following symbols to denote related conditional probabilities: T (s ′ |s, a), R(•|s, a), s, s ′ ∈ S, a ∈ A. We also use r(s, a) ≜ E R∼R(•|s,a) [R] for convenience. In order to make the space of value functions complete (assumption of Banach fixed point theorem), we assume bounded rewards, which with discounting produce bounded value functions. We denote l ∞ (X ) the space of bounded sequences with supremum norm ∥•∥ ∞ with support X assuming X is countable and has discrete topology. Completeness of our value spaces then follows from completeness of l ∞ (N)foot_0 . The goal is to find a policy π : S → ∆(A) that yield maximal return. The return is defined as the accumulated discounted reward from time step t, i.e., G t = ∞ n=0 γ n r(s t+n , a t+n ). The statevalue function (VF) of a policy π is defined as the expected return of being in state s and following policy π, V π (s) ≜ E [G t |s t = s; π]. Let Π denote the space of all policies. The optimal VF is V * = max π∈Π V π . It is also convenient to define the action-VF, Q π (s, a) ≜ E [G t |s t = s, a t = a; π] and the optimal action-VF is denoted as Q * = max π∈Π Q π . The Bellman Expectation/Optimality Equation and the corresponding operators are as follows: B π V π = V π , where (B π V )(s) ≜ E a∼π(•|s),s ′ ∼T (•|s,a) [r(s, a) + γV (s ′ )] BV * = V * , where (BV )(s) ≜ max a r(s, a) + γE s ′ ∼T (•|s,a) [V (s ′ )] . (2)

3. HIGHWAY REINFORCEMENT LEARNING

Value Iteration (VI) looks ahead for one step to identify a promising value using only short-term information (see eq. 2). How can we quickly exploit long-term information through larger lookaheads? Our idea is to exploit the information conveyed by policies. We connect current and future states through multiple "highways" induced by policies, allowing for the unimpeded flow of credit across various lookaheads, and then focus on the most promising highway. Formally, we propose the novel Highway Bellman Optimality Operator G Π N (Highway Operator in short), defined by G Π N V (s0) ≜ max π∈ Π max n∈N Eτn s 0 ∼π n-1 t=0 γ t r (st, at) + γ n max a ′ n r(sn, a ′ n ) + γE s ′ n+1 V s ′ n+1 , where Π = {π 1 , • • • , π m • • • , π M |π m ∈ Π} is a set of behavioral policies, which are used to collect data; n is called the lookahead depth (also named bootstrapping step in RL literature), and N is the set of lookahead depths, which we assume always includes 0 (0 ∈ N ) unless explicitly stated otherwise; τ n s0 = (s 0 , a 0 , s 1 , a 1 , s 2 , a 2 , • • • , s n ), and τ n s0 ∼ π is the trajectory starting from s 0 by executing policy π for n steps. Fig. 1 (Left) illustrates the backup diagram of this operator. Our operator can be rewritten using the Bellman Operators: G Π N V ≜ max π∈ Π max n∈N (B π ) n BV. As implied by eq. ( 3) and eq. ( 4), given some trial, we pick a policy and a possible lookahead (up to the trial end if N is sufficiently large) that maximize the cumulative reward during the lookahead



The space of all bounded sequences with supremum norm, which is known to be a complete metric space.

