HIGHWAY REINFORCEMENT LEARNING

Abstract

Traditional Dynamic Programming (DP) approaches suffer from slow backward credit-assignment (CA): one time step per update. A popular solution for multistep CA is to use multi-step Bellman operators. Existing control methods, however, typically suffer from large variance of multi-step off-policy corrections or are biased, preventing convergence. To overcome these problems, we introduce a novel multi-step Bellman Optimality operator, which quickly transports credit information from the future into the past through multiple "highways" induced by various behavioral policies. Our operator is unbiased with respect to the optimal value function and converges faster than the traditional Bellman Optimality operator. Its computational complexity is linear in the number of behavioral policies and lookahead depth. Moreover, it yields a family of novel multi-step off-policy algorithms that do not need importance sampling. We derive a convergent multistep off-policy variant of Q-learning called Highway Q-Learning, and also a deep function approximation variant called Highway DQN. Experiments on toy tasks and visual MinAtar Games (Young & Tian, 2019) illustrate that our algorithms outperform similar multi-step methods. Recent advances in multi-step reinforcement learning (RL) have achieved remarkable empirical success (Horgan et al., 2018; Barth-Maron et al., 2018) . However, a major challenge of multi-step RL is to balance the trade-off between traditional "safe" one-time-step-per-trial credit assignment (CA) relying on knowledge stored in a learned Value Function (VF), and large CA jumps across many time steps. A traditional way of addressing this issue is to impose a fixed prior distribution over the possible numbers of CA steps, e.g., TD(λ) (Sutton & Barto, 2018) , GAE(λ) (Schulman et al., 2016) . This typically ignores the current state-specific quality of the current VF, which dynamically improves during learning. Besides, the prior distribution usually has to be tuned case by case. Multi-step RL should also work for off-policy learning, that is, learning from data obtained by other behavioral policies. Most previous research on this has focused on Policy Iteration(PI)-based approaches (Sutton & Barto, 2018) , which need to correct the discrepancy between target policy and behavior policy to evaluate the VF (



remarkable convergence properties; 3) It yields a family of novel multi-step off-policy algorithms that do not need importance sampling, safely using arbitrary off-policy data. Experiments on toy tasks and visual MinAtar Games (Young & Tian, 2019) illustrate that our Highway RL algorithms outperform existing multi-step methods.

2. PRELIMINARIES

A Markov Decision Processes (MDP) (Puterman, 2014) is described by the tuple M = (S, A, γ, T , µ 0 , R), where S is the state space; A is the action space; γ ∈ [0, 1) is the discount factor. We assume MDPs with countable S (discrete topology) and finite A. T : S × A → ∆(S) is the transition probability function; µ 0 denotes the initial state distribution; R : S × A → ∆(R) denotes reward probability function. We use the following symbols to denote related conditional probabilities: T (s ′ |s, a), R(•|s, a), s, s ′ ∈ S, a ∈ A. We also use r(s, a) ≜ E R∼R (•|s,a) [R] for convenience. In order to make the space of value functions complete (assumption of Banach fixed point theorem), we assume bounded rewards, which with discounting produce bounded value functions. We denote l ∞ (X ) the space of bounded sequences with supremum norm ∥•∥ ∞ with support X assuming X is countable and has discrete topology. Completeness of our value spaces then follows from completeness of l ∞ (N)foot_0 . The goal is to find a policy π : S → ∆(A) that yield maximal return. The return is defined as the accumulated discounted reward from time step t, i.e., G t = ∞ n=0 γ n r(s t+n , a t+n ). The statevalue function (VF) of a policy π is defined as the expected return of being in state s and following policy π, V π (s) ≜ E [G t |s t = s; π]. Let Π denote the space of all policies. The optimal VF is V * = max π∈Π V π . It is also convenient to define the action-VF, Q π (s, a) ≜ E [G t |s t = s, a t = a; π] and the optimal action-VF is denoted as Q * = max π∈Π Q π . The Bellman Expectation/Optimality Equation and the corresponding operators are as follows: (1) B π V π = V π , where (B π V )(s) ≜ E a∼π(•|s),s ′ ∼T (• (2)

3. HIGHWAY REINFORCEMENT LEARNING

Value Iteration (VI) looks ahead for one step to identify a promising value using only short-term information (see eq. 2). How can we quickly exploit long-term information through larger lookaheads? Our idea is to exploit the information conveyed by policies. We connect current and future states through multiple "highways" induced by policies, allowing for the unimpeded flow of credit across various lookaheads, and then focus on the most promising highway. Formally, we propose the novel Highway Bellman Optimality Operator G Π N (Highway Operator in short), defined by G Π N V (s0) ≜ max π∈ Π max n∈N Eτn s 0 ∼π n-1 t=0 γ t r (st, at) + γ n max a ′ n r(sn, a ′ n ) + γE s ′ n+1 V s ′ n+1 , where Π = {π 1 , • • • , π m • • • , π M |π m ∈ Π} is a set of behavioral policies, which are used to collect data; n is called the lookahead depth (also named bootstrapping step in RL literature), and N is the set of lookahead depths, which we assume always includes 0 (0 ∈ N ) unless explicitly stated otherwise; τ n s0 = (s 0 , a 0 , s 1 , a 1 , s 2 , a 2 , • • • , s n ), and τ n s0 ∼ π is the trajectory starting from s 0 by executing policy π for n steps. Fig. 1 (Left) illustrates the backup diagram of this operator. Our operator can be rewritten using the Bellman Operators: G Π N V ≜ max π∈ Π max n∈N (B π ) n BV. As implied by eq. ( 3) and eq. ( 4), given some trial, we pick a policy and a possible lookahead (up to the trial end if N is sufficiently large) that maximize the cumulative reward during the lookahead (Right) An illustrative example of "highway" under a simple N -horizon MDP, where N is the horizon between the start state s A and the end state s Z . With the highway (dashed lines) induced by rolling-out behavioral policies, the start state s A can directly access various deeper horizon information. By taking maximization over this information, the high credit information can be directly assigned to the previous states through the highway. interval plus a dynamic programming-based estimate of the future return. This provides a highway to quickly transport credit information from the future into the past. That's why our operator generally converges faster than the classical Bellman Optimality Operator. See the detailed theoretical analysis in Section 5. Fig. 1 (Right) illustrates this highway through an example: an N -horizon MDP problem and two policies. A highway connects the start state s A to information derived from deeper lookaheads. In this example, our operator converges to the optimal VF within 2 iterations, while B needs N . The following result shows that our Highway operator is a contraction on the complete metric space l ∞ and admits V * as a unique fixed point. Theorem 1 (Properties of Highway Operator) For any Π and N (s.t. 0 ∈ N ), we have 1) G Π N is a contraction on complete metric space l ∞ (S), i.e. for any V, V ′ ∈ l ∞ (S), we have 2 ∥G Π N V -G Π N V ′ ∥ ≤ ∥BV -BV ′ ∥ ≤ γ∥V -V ′ ∥; 2) (Highway Bellman Optimality Equation, Highway Equation in short) V * is the only fixed point G Π N , that is, for all V ∈ l ∞ (S) holds V = G Π N V if and only if V = V * . Formally, we have G Π N V * ≜ max π∈ Π max n∈N (B π ) n BV * = V * , 3) For any V 0 ∈ l ∞ (S) and any sequence of policy sets ( Π k ),k ∈ N, the sequence (G Π k N • G Π k-1 N • . . . • G Π1 N )[V 0 ],k ∈ N converges R-linearly to V * with convergence rate γ. All the proofs are provided in Appendix A. Note that 0 ∈ N is necessary for the guarantee of the fixed point property (Point 2), but not for the contraction property. This theorem implies that our Highway operator can provide a powerful extension to Bellman Optimality Equation, which can be potentially applied to various RL control methods. Table 1 summarizes the comparison to classical Bellman Operators and some advanced operators. More details on the comparison are in Section 5.

4. ALGORITHMS

Here we illustrate three applications of our Highway theory for model-based and model-free RL algorithms. Although we listed only three instances in this paper, note that our theory can be potentially applied to various RL methods which involve value estimation, such as actor-critic methods. (Hessel et al., 2018; Horgan et al., 2018 ) V * /Q * / γ ✓ Bellman Expectation Operator B π ′ (eq. 1) V π ′ /Q π ′ / γ ✓ Multi-Step Off-Policy Operators † Multi-Step Bellman Optimality Operator (N ≥ 2) E π∼P Π (B π ) N -1 B V * /Q * For Π s.t. ∀π ∈ Π, π = π * γ N ✓ § Multi-Step IS-based Bellman Expectation Operator E π∼P Π ( Bπ ′ π ) N (Sutton & Barto, 2018) V π ′ /Q π ′ For any Π, N γ N × Q(λ) Operator (Harutyunyan et al., 2016) V π ′ /Q π ′ For Π s.t. ∀π ∈ Π, π is close to π ′ γ ✓ Retrace(λ) Operator (Munos et al., 2016) V π ′ /Q π ′ For any Π γ ✓ Highway Operator and its Variants (Ours) Highway Operator (eq. 4) G Π N ≜ max π∈ Π max n∈N (B π ) n B V * /Q * For any Π, N γ (γ 2 , γ N under some conditions) ✓ Softmax Highway Operator (eq. 9) G Π N ≜ smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } (B π ) n B V * /Q * For any Π, N , α γ ✓ ‡ Expectation Highway Operator G N Π ≜ E π∼P Π max n∈N (B π ) n B V * /Q * For any Π, N γ ✓ Table 1 : Properties of the operators. Π denote the set of behavioral policies; P Π is a distribution over Π and P Π (π) is the probability of selecting π; π ′ denotes the target policy of policy evaluation. † and ‡: Please refer to Appendix A.2 for details. §: Bπ ′ π is the importance sampling-based (IS-based) Bellman Expectation Operator (see Appendix A.4).

4.1. MODEL-BASED REINFORCEMENT LEARNING

Highway Value Iteration. From the new operator, we can naturally derive a new Value Iteration algorithm, (Algorithm B.1). Specifically, for a finite S, the update using the Highway Operator G Π N (eq. 3) can be written as v k+1 = max π∈ Π max n∈N PART3 n i=0 (γT π ) i-1 r π PART1 + (γT π ) n PART2 max a r a + γT a v k , where v k is a |S| × 1 column vector of VF; r a and r π are |S| × 1 column vectors of rewards for action a and policy π respectively, where [r a ] s = r(s, a), [r π ] s = a π(a|s)r(s, a). T a and T π are |S| × |S| matrices of transition probabilities for action a and policy π respectively, where [T a ] s,s ′ = T (s ′ |s, a), [T π ] s,s ′ = a π(a|s)T (s ′ |s, a). The computational complexity of each iteration is O |A| + Π |N | |S| 2 . Two strategies can be adopted to accelerate the update process. First, the matrix in PART 1 and PART 2 in eq. ( 6) can be computed and reused for each π and n, as they are fixed during the iteration process. Second, PART 3 can be computed in parallel for each policy π ∈ Π and n ∈ N .

4.2. OFF-POLICY LEARNING IN MODEL-FREE REINFORCEMENT LEARNING

In model-free RL, it is convenient to use Q instead of V . The corresponding operator 3 is defined as 3 We use the same notation G Π N to denote the operator w.r.t. the the VF V and the action VF Q when there is no ambiguity. Similarly, for the Bellman operators, we will reuse the same symbols. G Π N Q(s 0 , a 0 ) ≜ max π∈ Π max n∈N E τ n+1 s0,a0 ∼ π n t=0 γ t r t + γ n+1 max a ′ n+1 Q ( s n+1 , a ′ n+1 ) G n+1 Q (τ n+1 s 0 ,a 0 ) , where G n+1 Q (τ n+1 s0,a0 ) is the n + 1-step return; τ n+1 s0,a0 ≜ (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , • • • , s n+1 ). This operator can also be represented by the Bellman Operators, i.e., G Π N Q ≜ max π max n (B π ) n BQ; and it converges to the optimal action VF Q * with any set of behavioral policies Π, i.e., G Π N Q * = Q * (similar to Theorem 1, see APPENDIX Theorem 6 for a formal statement). This means that it can utilize any off-policy data collected by arbitrary policy, without additional corrections. We propose two methods for the tabular VF and VF approximation, named Highway Q-Learning and Highway DQN respectively. Highway Q-Learning. Let D (m) s0,a0 = {τ n+1 s0,a0 |τ n+1 s0,a0 ∼ π m } denote the trajectory data collected by the policy π m . The k-th VF Q k is updated in the following way: Q k+1 (s 0 , a 0 ) = max m∈Ms 0 ,a 0 max n∈N E D (m) s 0 ,a 0 [G n+1 Q k (τ n+1 s0,a0 )], where M s0,a0 ⊆ m |D (m) s0,a0 | ̸ = 0 is a subset of indexes of the dataset that are not empty under (s 0 , a 0 ); E D (m) s 0 ,a 0 [•] = 1 D (m) s 0 ,a 0 τ n+1 s 0 ,a 0 ∈D (m) s 0 ,a 0 [•] is the empirical averaged value. Note that all we need to do is saving the trajectory data into the corresponding datasets D (m) s0,a0 , and then search over them, without having to know the form of π m or save the π m into the set of behavioral policies Π. The algorithm, Highway Q-Learning, is presented in Appendix B Algorithm B.3. Highway DQN. For large-scale space or continuous MDPs, the VF is usually approximated. Due to the estimation noise involved in function approximation (Van Hasselt et al., 2016) , our method may lead to overestimation by the two maximization operations (over policies and lookahead depths). However, as we show below, this issue can be easily solved through a minor modification. We propose a new variant of eq. ( 7), named Softmax Highway Operator, as follows G Π N Q (s 0 , a 0 ) ≜ smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } E τ n+1 s 0 ,a 0 ∼π G n+1 Q (τ n+1 s0,a0 ) , where the softmax operator smax α with the temperature parameter α is defined as smax α x∈X f (x) ≜ x∈X exp (αf (x)) x ′ ∈X exp (αf (x ′ )) f (x) , smax α reduces to max when α → ∞. We have the following theorem. Theorem 2 For any α, any Π, and any N , we have G Π N Q * = Q * and (∀Q ∈ l ∞ (S × A)) : ∥ G Π N Q -Q * ∥ ≤ γ∥Q -Q * ∥. The operator in eq. ( 9) is derived in the following way. First, the Highway Q Operator G Π N in eq. ( 7) can be rewritten as an equivalent form, G Π N Q ≜ max π∈ Π max n ′ ∈N max n∈{0,n ′ } (B π ) n BQ. Then, we replace the first two max operators max π max n ′ with the softmax operators smax α π smax α n ′ . The above modification of Softmax Highway Operator G Π N is necessary to a) remain unbiased w.r.t. Q * (as shown in APPENDIX Theorem 2)foot_2 ; b) alleviate the overestimation issue and improve exploration with the softmax operation, which has been shown effective in recent RL literature (Fox et al., 2015; Haarnoja et al., 2017; Schulman et al., 2017; Song et al., 2019) . Based on the above theoretically justified operator, we propose the following objective function for updating the action VF Q θ (s, a) parametrized by the parameter θ: L (θ) = (s 0 ,a 0 )∈ D Q θ (s0, a0) -smax α m∈Ms 0 ,a 0 smax α n ′ ∈N max n∈{0,n ′ } E D (m) s 0 ,a 0 G n+1 Q θ ′ (τ n+1 s 0 ,a 0 ) 2 , where Q θ ′ is the target network parametrized by θ ′ , occasionally copied from θ; D = {(s 0 , a 0 )}) is the sampled batch data of state-action (s 0 , a 0 ) pairs. The computational complexity of the method implied by the equation above is close to the one of existing eligibility trace-based methods (Schulman et al., 2016; Munos et al., 2016) , which also need to compute n-step returns G n Q for each n. The resulting algorithm is called Highway DQN, presented in APPENDIX Algorithm B.4. In practice, our algorithm balances the trade-off between accuracy and sample efficiency by deciding the number of trials per policy, the size of the search space (of behavioral policies and lookahead depths), and the softmax temperature. While more trials per policy may improve the estimation accuracy, they may also cost more samples and reduce sample efficiency. On the other hand, while a larger search space may increase efficiency, it might incur overestimation issues when the estimate is biased, leading to high variance. In summary, our Highway Q-Learning and Highway DQN can recycle any trajectory data collected by some arbitrary policy, and utilize the multiple-step trajectory data that do not require Importance Sampling-based corrections (as stated in Theorem 1 and 2).

5. THEORETICAL ANALYSIS

In this section, we study the theoretical properties of Highway operatorfoot_3 G Π N and show its superiority over classical Bellman operators, e.g., B and B π . For convenience, our analysis is in the space of state VF, i.e., the operators G Π N , B, B π , . . . are assumed to be mappings on l ∞ (S). First, we compare our Highway Operator G Π N to the Bellman Optimality Operator B. Theorem 3 (Comparison to Bellman Optimality Operator B) For all V, V 0 ∈ l ∞ (S) holds: 1) G Π N V -V * ≤ ∥BV -V * ∥; 2) Assume V 0 ≤ V * . For any s we have G Π N V 0 (s) -V * (s) ≤ |BV 0 (s) -V * (s)|, where the strict inequality holds as long as there exists π ′ ∈ Π such that arg max n∈N B π ′ n BV 0 (s) > 0. The first point of the theorem implies that our operator converges at the same rate as the Bellman Optimality Operator in the worst case. The second point shows a state-wise convergence comparison under the case of V 0 ≤ V * . Our operator generally converges faster than the Bellman Optimality Operator as long as one behavioral policy finds a better path by looking forward for n steps (n > 0). Note that the condition V 0 ≤ V * can be easily satisfied by setting V 0 = min s ′ ,a ′ r(s ′ , a ′ ). Moreover, as long as V 0 ≤ V * , we have (G Π N ) •k V 0 ≤ V * for any k (see Lemma 2 in Appendix) . Then, we show the relation of our Highway operator with the Multi-step Bellman Expectation operator B π N ≜ (B π ) •N , which is adopted in Generalized Policy Iteration (GPI) (see APPENDIX Algorithm B.2). GPI needs to balance the evaluation-improvement trade-off by adapting hyperparameter N in (B π ) •N . Our Highway operator provides an optimal solution for deciding such hyperparameter N in terms of approaching the optimal VF. Theorem 4 (Comparison to Multi-Step Bellman Expectation Operator) Assume that the VF V k ≤ V * . Let V G Π k N k+1 and V B π k N k+1 denote the k + 1-th VF of Highway Value Iteration with hyperparameter N and Generalized Policy Iteration with hyperparameter N . We have V G Π k N k+1 -V * ≤ min N ∈N V B π k N +1 k+1 -V * (12) Next, we show that by assuming that some of the behavioral policies act optimally within a few time steps, our Highway operator can achieve better convergence rates. Assumption A( Π, n) Given set of behavioral policies Π, a lookahead depth n. Let S π ′ s,n denote the set of all possible visited states by executing policy π ′ for n steps from state s. For each s ∈ S, there exists at least one policy π ′ ∈ Π such that π ′ (s ′ ) = π * (s ′ ) for any s ′ ∈ S π ′ s,n where π * refers to an optimal policy. Note the quantification order: (∀s ∈ S, ∃π ′ ∈ Π, ∀s ′ ∈ S π ′ s,n ) : π ′ (s ′ ) = π * (s ′ ). Theorem 5 (Better Contraction Rate) Assume N -1 ∈ N , Π satisfies Assumption A( Π, N -1), and V 0 ∈ l ∞ (S) and V 0 ≤ V * , the convergence rate of G Π N is γ N , i.e., ∥G Π N V -V * ∥ ≤ γ N ∥V -V * ∥; This theorem implies that when the set of behavioral policies Π satisfies Assumption A( Π, N -1), our operator G Π N has a convergence rate of γ N . Note that, in Assumption A( Π, N -1), Π is not required to include an optimal policy π * . Instead, it only requires that, for each state, there exists one policy that behaves well within a period (N -1 consecutive steps), and the well-behaved policy could be varying towards different starting from each state. Note that this condition is much weaker than having some optimal or near-optimal policies included in the behavioral policy set. Specifically, Assumption A( Π ′ , 1) can be satisfied by constructing a Π ′ = {π a |a ∈ A, (∀s ∈ S) : π a (a|s) = 1}, yielding a convergence rate of γ 2 . Note that although the operator (B) •N also leads to a convergence rate of γ N , our Highway operator G Π N differs essentially from it. A major difference is that Highway operator G Π N applies Multi-step Bellman Expectation Operator (B π ) •n instead of applying Multi-step Bellman Optimality Operator (B) •N . In the model-free case, (B π ) •n V can utilize the n-step trajectory data generated by π with minor cost (just by accumulating rewards within n-step, i.e., n ′ γ n ′ r t+n ′ + γ n V (s t+n )). While (B) •N can only utilize the 1-step data and needs to update the VF N times (implying by eq. 2).

6. RELATED WORK

Multi-step RL methods has been studied in RL for a long history, including multi-step SARSA (Sutton & Barto, 2018), Tree Backup (Precup, 2000) , Q(σ) (Asis et al., 2017) , and Monte Carlo methods (which can be regarded as ∞-step learning). λ-return assign exponentially decaying weights depends on the decay factor λ (Sutton & Barto, 2018; Schulman et al., 2016; White & White, 2016) . Sahil et al. (2017) proposed a more general form called weighted returns, which assigns weights to all n-step returns. Roughly, the lookahead depth, the decay factor, and the weights represent the prior knowledge or bias regarding appropriate CA intervals, usually tuned in a case-by-case manner. Our method instead adaptively adjusts the lookahead depth in line with the quality of the data and the learned VF. Another similar work related to CA is RUDDER (Arjona-Medina et al., 2019) , which trains an LSTM to re-assign credits to previous actions. Our methods derive a simple but sound principle for transporting the credit with minor cost. Existing off-policy learning methods usually use additional corrections for off-policy data. Classical importance sampling (IS)-based methods suffer from high variance due to the products of IS ratios (Cortes et al., 2010; Metelli et al., 2018) . Several variance reduction techniques have been proposed. Munos et al. (2016) 's Retrace(λ) reduces the variance by clipping the IS ratios, and has achieved great success in practice. Other work ignores IS: TB(λ) (Precup, 2000) corrects the off-policy data by computing expectations w.r.t. data and estimated VF, using the probabilities of the target policy. Harutyunyan et al. (2016) 's Q(λ) corrects the off-policy data by adding an off-policy correction term. Our method provides an alternative IS-free tool for off-policy learning. Similar to Retrace(λ), it can safely use arbitrary off-policy data. Moreover, it is very efficient, offering a faster convergence rate under mild conditions. More research is needed, however, to understand how the variance of our method compares to that of advanced IS-based variance reduction methods like Retrace(λ). Searching over various policies and (or) lookahead depths to improve the convergence of RL systems is an active field. Barreto et al. (2020) search over various policies while using only a fixed lookahead depth until the end of the trial. He et al. (2017) first propose to search over various lookahead depths along the trajectory data, but they do not search over various policies. Moreover, they employ the lookahead returns to construct additional inequality bounds on the VF. Our work instead contributes to a novel Bellman operator on updating the VF directly and provides a thorough theoretical analysis of the convergence properties. We also address that greedy operations (Barreto et al., 2020; He et al., 2017) may cause overestimation, by proposing a novel softmax operation to alleviate this issue in an unbiased fashion. Compared to previous methods searching over the product of original policy space and action space (Efroni et al., 2018; 2019; 2020; Jiang et al., 2018; Tomar et al., 2020) , ours has a smaller search space (i.e., a limited set of policies). In summary, this paper contributes to a theoretically-justified Bellman operator which searches both policies and lookahead depths, leading to more flexibility and scalability against different settings and applications. Our method can also be viewed as combining the best of both worlds: (1) direct policy search based on policy gradients (Williams, 1992) or evolutionary computation (Rechenberg, 1971 ) (where lookahead equals trial length-no runtime-based identification of useful subprograms and subgoals), and (2) dynamic programming-based (Bellman, 1957) identification of useful subgoals during runtime. This naturally and safely combines best subgoals/sub-policies (Schmidhuber, 1991) derived from data collected by arbitrary previously encountered policies. It can be viewed as a soft hierarchical chunking method (Schmidhuber, 1992) for rapid CA, with a sound Bellman-based foundation.

7. EXPERIMENTS

We designed our experiments to evaluate the algorithms under different cases and investigate their properties. Please refer to Appendix C for additional details of the experiment settings. Model-based Toy Tasks. We first evaluate on a model-based task: Multi-Room environments with different number of rooms. The agent needs to go through many rooms and reach the goal to get a reward. We compare our Highway Value Iteration to classical Value Iteration (VI) and Policy Iteration (PI). The algorithms are evaluated until convergence. As shown in the three plots of Fig. 2 (a), our Highway Value Iteration outperforms VI and PI in terms of number of iterations required, total number of samples (total number of queries to the MDP model), and computation time. Model-free Toy Tasks. To evaluate credit assignment efficiency, we evaluate the algorithms on two toy tasks involving delayed rewards (Arjona-Medina et al., 2019) , in which a reward is only provided at the end of each trial. We compare our method to classical eligibility trace methods including Q(λ), SARSA(λ), Monte Carlo methods; and also the advanced credit assignment method RUDDER. As shown in Fig. 2 (b) and (c), our method significantly outperforms all competitors on both tasks. For example, on Trace Back, our method requires only 20 episodes to solve the task, while the second best algorithm RUDDER requires more than 1000. Notably, the costs of Highway Q-Learning do not observably increase with the reward delays. In contrast, other methods such as Q(λ) require exponentially increasing numbers of trials. MinAtar Games. We evaluate our algorithms on benchmark tasks from MinAtar (Young & Tian, 2019) . We compare to several advanced multi-step off-policy methods, including Multi-Step DQN (Horgan et al., 2018; Barth-Maron et al., 2018) and Retrace(λ) Munos et al. (2016) . All multi-step methods are implemented on top of Maxmin DQN (Lan et al., 2020) , which chooses the smallest Qvalue among multiple target Q networks. All the compared methods adopt the same implementations to ensure that we measure differences between algorithms, not implementations. Fig. 3 shows the performance of the algorithms. Highway DQN significantly outperforms all competitors in terms of both reward and sample efficiency on almost all the tasks. Compared to the advanced Retrace(λ), our method performs significantly better on 3 of 5 tasks while performing on par with it on the remaining 2 tasks. Ablation Study. We conducted the following ablation studies to investigate properties of Highway DQN. 

8. CONCLUSIONS

We introduced a novel multi-step Bellman Optimality Equation for efficient multi-step credit assignment (CA) in reinforcement learning (RL). We proved that its solution is the optimal value function and that the corresponding policy adjustments generally converge faster than the traditional Bellman Optimality operator. Our Highway RL methods combine the best of direct policy search (where CA is performed without trying to identify useful environmental states or subgoals during runtime), and standard RL, which finds useful states/subgoals through dynamic programming. Highway RL quickly and safely extracts useful sub-policies derived from data obtained through previously tested policies. The derived algorithms have several advantages over existing off-policy algorithms. Their feasibility and effectiveness were experimentally illustrated on a series of standard benchmark datasets. Future work will theoretically analyze the algorithm's behavior in the model-free case.

A THEOREM PROOFS

A.1 PROOF OF PROPERTIES OF HIGHWAY OPERATOR G Π

N

Theorem 1 (Properties of Highway Operator) For any Π and N (s.t. 0 ∈ N ), we have 1) G Π N is a contraction on complete metric space l ∞ (S), i.e. for any V, V ′ ∈ l ∞ (S), we havefoot_4  ∥G Π N V -G Π N V ′ ∥ ≤ ∥BV -BV ′ ∥ ≤ γ∥V -V ′ ∥; 2) (Highway Bellman Optimality Equation, Highway Equation in short) V * is the only fixed point G Π N , that is, for all V ∈ l ∞ (S) holds V = G Π N V if and only if V = V * . Formally, we have G Π N V * ≜ max π∈ Π max n∈N (B π ) n BV * = V * , 3) For any V 0 ∈ l ∞ (S) and any sequence of policy sets ( Π k ),k ∈ N, the sequence (G Π k N • G Π k-1 N • . . . • G Π1 N )[V 0 ],k ∈ N converges R-linearly to V * with convergence rate γ. Proof: 1) The contraction property can be obtained: G Π N V -G Π N V ′ = max π∈ Π max n∈N (B π ) n BV -max π∈ Π max n∈N (B π ) n BV ′ ≤ max π∈ Π max n∈N ∥(B π ) n BV -(B π ) n BV ′ ∥ ≤ max π∈ Π max n∈N γ n ∥BV -BV ′ ∥ ≤ ∥BV -BV ′ ∥ ≤ γ ∥V -V ′ ∥ 2) By application of the Banach fixed point theorem to point 1., the operator G Π N must have a unique fixed point. Therefore, it suffices to verify that V * is a fixed point of G Π N . Our operator can be rewritten as G Π N V ≜ max π∈ Π max n∈N (B π ) n BV. Using the fact that B π V * ≤ BV * = V * and B π monotony we obtain (B π ) n BV * ≤ V * for any π and n ≥ 0 (with the equality when n = 0), which implies max π∈ Π max n∈N (B π ) n BV * = V * . 3) From 1) and 2) follows: ∥(G Π k N • G Π k-1 N • . . . G Π1 N )[V 0 ] -V * ∥ = ∥(G Π k N • G Π k-1 N • . . . G Π1 N )[V 0 ] - G Π k N V * ∥ ≤ γ∥(G Π k-1 N • . . . G Π1 N )[V 0 ] -V * ∥. By repeating the same argument we end up with ∥(G Π k N • G Π k-1 N • . . . G Π1 N )[V 0 ] -V * ∥ ≤ γ k ∥V 0 -V * ∥ from which the statement follows. □ Theorem 2 For any α, any Π, and any N , we have G Π N Q * = Q * and (∀Q ∈ l ∞ (S × A)) : ∥ G Π N Q -Q * ∥ ≤ γ∥Q -Q * ∥. Proof: G Π N Q (s 0 , a 0 ) ≜ smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } E τ n+1 s 0 ,a 0 ∼π G n+1 Q (τ n+1 s0,a0 ) , This operator can be represented as G Π N Q ≜ smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } (B π ) n BQ. First, following the proof in Theorem 1, we have (B π ) n BQ * ≤ Q * , for any π and n ≥ 0 (with equality when n = 0). Then, we have smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } (B π ) n BQ * = Q * Given an action VF Q, let us define two distribution over Π and N , denoted by P s,a Π and P s,a N such that E π∼P s,a Π ,n ′ ∼P s,a N max n∈{0,n ′ } (B π ) n BQ (s, a) = smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } (B π ) n BQ (s, a) It follows G Π N Q -Q * = smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } (B π ) n BQ -Q * = E π∼P Π ,n ′ ∼P N max n∈{0,n ′ } (B π ) n BQ -E π∼P Π ,n ′ ∼P N max n∈{0,n ′ } (B π ) n BQ * ≤ max π∈ Π max n∈N ∥(B π ) n BQ -(B π ) n BQ * ∥ ≤ γ ∥Q -Q * ∥ . □ Corollary 1 Assume Q 0 ∈ l ∞ (S × A) and Π k ,k ∈ N be a sequence of sets of behavioral policies then the sequence ( G Π k N • G Π k-1 N •. . .• G Π1 N )[Q 0 ],k ∈ N converges R-linearly to Q * with convergence rate γ. Proof: The proof follows from just proved point Theorem 2 and proceeds analogically to proof of Theorem 1 point 3). □ Theorem 3 (Comparison to Bellman Optimality Operator B) For all V, V 0 ∈ l ∞ (S) holds: 1) G Π N V -V * ≤ ∥BV -V * ∥; 2) Assume V 0 ≤ V * . For any s we have G Π N V 0 (s) -V * (s) ≤ |BV 0 (s) -V * (s)|, where the strict inequality holds as long as there exists π ′ ∈ Π such that arg max n∈N B π ′ n BV 0 (s) > 0. Proof: 1) has been proved in Theorem 1. 2) Note that G Π N V 0 ≥ BV 0 . Further, if V 0 ≤ V * , then G Π N V 0 ≤ V * and BV 0 ≤ V * (as they are both monotonic). Putting all together we obtain 0 ≤ V * -G Π N V 0 ≤ V * -BV 0 , from which the desired inequality follows. □ Theorem 4 (Comparison to Multi-Step Bellman Expectation Operator) Assume that the VF V k ≤ V * . Let V G Π k N k+1 and V B π k N k+1 denote the k + 1-th VF of Highway Value Iteration with hyperparameter N and Generalized Policy Iteration with hyperparameter N . We have V G Π k N k+1 -V * ≤ min N ∈N V B π k N +1 k+1 -V * Proof: First, for any N ∈ N , π ′ ∈ Π, and any s ∈ S, max π∈ Π max n∈N (B π ) n BV k (s) ≥ B π ′ N BV k (s) According to Algorithm B.1, we know that π k ∈ Π k Then, for any s we have G Π N V k (s) -V * (s) = max π∈ Π max n∈N (B π ) n BV k (s) -V * (s) = V * (s) -max π∈ Π max n∈N (B π ) n BV k (s) = min π∈ Π min n∈N (V * (s) -(B π ) n BV k (s)) = min π∈ Π min n∈N |(B π ) n BV k (s) -V * (s)| ≤ min n∈N |(B π k ) n BV k (s) -V * (s)| = min N ∈N (B π k ) N +1 V k (s) -V * (s) = min N ∈N V B π k N +1 k+1 -V * (s) from which the conclusion follows.

□

Assumption A( Π, n) Given set of behavioral policies Π, a lookahead depth n. Let S π ′ s,n denote the set of all possible visited states by executing policy π ′ for n steps from state s. For each s ∈ S, there exists at least one policy π ′ ∈ Π such that π ′ (s ′ ) = π * (s ′ ) for any s ′ ∈ S π ′ s,n where π * refers to an optimal policy. Note the quantification order: (∀s ∈ S, ∃π ′ ∈ Π, ∀s ′ ∈ S π ′ s,n ) : π ′ (s ′ ) = π * (s ′ ). Theorem 5 (Better Contraction Rate) Assume N -1 ∈ N , Π satisfies Assumption A( Π, N -1), and V 0 ∈ l ∞ (S) and V 0 ≤ V * , the convergence rate of G Π N is γ N , i.e., ∥G Π N V -V * ∥ ≤ γ N ∥V -V * ∥; Proof: 1) First, if Π satisfies assumption Assumption A( Π, N -1), that is, starting from any s, there exists at least one π ∈ Π executing the optimal actions for N -1 steps, then we have G Π N V = max π∈ Π max n ′ ∈N (B π ) n ′ BV ≥ max π∈ Π (B π ) N -1 BV ≥ B π * N -1 BV. Second, if V ≤ V * , then G Π N V ≤ V * (G Π N is monotonic) and B π * N -1 BV ≤ V * . Finally, with the results above, we have G Π N V -V * ≤ B π * N -1 BV -V * = B π * N -1 BV -B π * N -1 BV * ≤ γ N ∥V -V * ∥ . To utilize the better convergence rates obtained in theorem 5, we have to be a bit more careful than in theorem 1 point 3) and take care of the monotony requirements in theorem 5. This is done in the following lemma. Lemma 1 Assume a sequence of operators T k , k ∈ N on L ∞ (S), where all T k satisfy (∀V ∈ L ∞ (S), V ≤ V * ) : ∥T k V -V * ∥ ≤ γ ′ ∥V -V * ∥ with common convergence rate γ ′ and common limit V * . Further assume all T k are monotonic and have fixed point V * . Then for any V 0 ∈ l ∞ (S), V 0 ≤ V * the sequence (T k • T k-1 • . . . T 1 )[V 0 ],k ∈ N converges R-linearly to V * with convergence rate γ ′ . Proof: Assuming V 0 ∈ l ∞ (S), V 0 ≤ V * and monotony and fixed point of operators T 1 , . . . , T k-1 implies: V := (T k-1 • T k-2 • . . . T 1 )[V 0 ] ≤ V * . Now we can apply the inequality with γ ′ to get ∥(T k • T k-1 • . . . T 1 )[V 0 ] -V * ∥ ≤ γ ′ ∥(T k-1 • . . . T 1 )[V 0 ] -V * ∥. By repeating the same argument we end up with Regardless of the sets Π k , the algorithm converges to the optimal value function form theorem 1 point 3). This means that more and more close to optimal policies are added to Π k set with growing iteration number k. When assumption Assumption A(•, n) gets satisfied for some n at iteration k 0 , then due to monotony of the sequence Π k ,k ∈ N (i.e. the set Π k can just grow over time) Assumption A(•, n) is satisfied for the whole suffix Π k ,k ≥ k 0 . We can then use lemma 1 on corresponding suffix of operator sequence G Π k N ,k ≥ k 0 with γ ′ = γ n+1 to claim better convergence rate (γ n+1 ) of the corresponding suffix of value functions V k ,k ≥ k 0 . When S is finite this gives us monotonic improvement of the convergence rate of the sequence V k ,k ∈ N as Assumption A(•, n) gets eventually satisfied for bigger and bigger n. ∥(T k • T k-1 • . . . T 1 )[V 0 ] -V * ∥ ≤ γ ′k ∥V 0 -V * ∥ Theorem 6 (Highway Bellman Optimality Equation (Highway Equation) w.r.t. action VF Q) For any Π and N , we have 1. G Π N is a contraction on complete metric space l ∞ (S × A), i.e. for any Q, Q ′ ∈ l ∞ (S × A), we have ∥G Π N Q -G Π N Q ′ ∥ ≤ ∥BQ -BQ ′ ∥ ≤ γ∥Q -Q ′ ∥; 2. Q * is the only fixed point G Π N . That is, G Π N Q * = Q * , and for all Q ∈ l ∞ (S × A) holds Q = G Π N Q if and only if Q = Q * . Proof: Note that G Π N w.r.t action VF Q can also be represented by the Bellman Operators w.r.t Q, i.e., G Π N Q ≜ max π∈ Π max n∈N (B π ) n BQ. Except for the change of value spaces (l ∞ (S × A) instead of l ∞ (S)) the proof is the same as for Theorem 1). □ Lemma 2 If V 0 ≤ V * , then (G Π k N • G Π k-1 N • . . . • G Π1 N )V 0 ≤ V * for any k and any sequence of policy sets Π 1 , Π 2 , . . . , Π k . Proof: As G Π N is monotonic (for any set of behavioral policies Π), we have (G Π1 N )V 0 ≤ (G Π1 N )V * = V * . Further, applying G Π2 N again to the both sides we obtain (G Π2 N • G Π1 N )V 0 ≤ (G Π2 N )V * = V * . Repeting the argument further k -2 times we get the result. A.2 PROOF OF PROPERTIES OF MULTI-STEP BELLMAN OPTIMALITY OPERATOR Recent works combined multi-step trajectory data and estimated the values of the most promising actions without any correction for the off-policy data (Horgan et al., 2018; Barth-Maron et al., 2018) . The underlying operator is defined as B Π N Q (s 0 , a 0 ) ≜E π∼P Π (B π ) N -1 BQ(s 0 , a 0 ) =E π∼P Π ,τ N s 0 ,a 0 ∼π N -1 t=0 γ t r t + γ N max a ′ N Q (s N , a ′ N ) where P Π is a distribution over Π and P Π (π) is the probability of selecting π. Here we use π ∼ P Π and τ N s0,a0 ∼ π to formulate the procedure of prioritized experience replay (Schaul et al., 2015; Horgan et al., 2018) , which samples the trajectory data collected by various behavioral policies according to a prior distribution. Although these methods have shown promising results in practice, below we will show that this operator is generally biased w.r.t. the optimal action VF Q * . In other words, the corresponding fixed point Q * B Π N is different from the optimal VF Q * , and unbiased learning only happens when all behavioral policies are optimal. Theorem 7 (Properties of the Multi-step Bellman Optimality Operator B Π N ) For any N ≥ 2, 1) The operator B Π N is a contraction on complete metric space l ∞ (S × A), i.e., for any two vectors Q, Q ′ ∈ l ∞ (S × A), ∥B Π N Q -B Π N Q ′ ∥ ≤ γ N ∥Q -Q ′ ∥. 2) Let Q * B Π N denote the fixed point of B Π N , i.e., B Π N Q * B Π N = Q * B Π N , we have Q * B Π N ≤ Q * . 3) Q * B Π N = Q * if and only if any π ∈ Π, P Π (π) > 0 satisfies π(s) = π * (s) for any s ∈ U := {s 1 ∈ S|∃s 0 , a 0 , T (s 1 |s 0 , a 0 ) > 0} and π * an optimal policy. Before giving the proof of the above theorem, we'd like to mention a variant of our Highway Operator, named Expectation Highway Operator, defined as G N Π Q(s 0 , a 0 ) ≜ E π∼P Π max n∈N (B π ) n-1 BQ(s 0 , a 0 ) Compared to the above Multi-Step Bellman Optimality Operator in eq. ( 17), our operator uses maximization over lookahead depths instead of a fixed lookahead depth. It's interesting to note that this operator is unbiased w.r.t. Q * for any set of behavioral policies Π. Formally, we have the following theorem. Using previous result we obtain ((B π ) N -1 BQ * )(s 0 , a 0 ) ≤ B π Q * (s 0 , a 0 ) < Q * (s 0 , a 0 ), and applying expectation with finite set Π using P(π) > 0 we get B Π N Q * (s 0 , a 0 ) < Q * (s 0 , a 0 ), which can be combined with eq. ( 20) to show Q * B Π N (s 0 , a 0 ) < Q * (s 0 , a 0 ). We now give proof of Theorem 8. Proof of Theorem 8: 1) We first prove the contraction property, ∥G N Π Q -G N Π Q ′ ∥ ≤ E π∼P Π max n∈N (B π ) n BQ -E π∼P Π max n∈N (B π ) n BQ ′ ≤ E π∼P Π max n∈N ∥(B π ) n BQ -(B π ) n BQ ′ ∥ ≤ γ ∥Q -Q ′ ∥ 2) Similar to the proof in Theorem 1, (B π ) n BQ * ≤ Q * for any π and n ≥ 0 (and the equality holds when n = 0), we have G N Π Q * ≜ E π∼P Π max n∈N (B π ) n BQ * = Q * □ A.3 DISCUSSION ABOUT SOFTMAX HIGHWAY OPERATOR The Softmax Highway Operator is defined by G Π N Q ≜ smax α π∈ Π smax α n ′ ∈N max n∈{0,n ′ } (B π ) n BQ, If we remove the max n∈{0,n ′ } in the above operator, i.e., smax α π∈ Π smax α n ′ ∈N (B π ) n BQ then the above operator is biased w.r.t. Q * . This operator can be regarded as an extension to the multi-step Bellman Optimality Operator B Π n , average over various n with weights smax α n ′ ∈N (•). Therefore, it has similar biased property to B Π n . However, simply adding max n∈{0,n ′ } with minor computational cost, our Softmax Highway operator is unbiased while alleviating the overestimation issue and improving the exploration.

A.4 MULTI-STEP IMPORTANCE SAMPLING-BASED BELLMAN EXPECTATION OPERATOR

In this section, we describe the classical off-policy learning method based on importance sampling (IS). IS-based off-policy methods evaluate the value function of a policy π ′ (called target policy) using the data collected by a different policy π (called behavior policy). The underlying operator, called Importance Sampling-based Bellman Expectation Operator (Sutton & Barto, 2018) , is defined as follows, Bπ ′ π Q(s, a) ≜ E s ′ ∼T (•|s,a),a ′ ∼π(•|s ′ ) π ′ (a ′ |s ′ ) π(a ′ |s ′ ) (r(s, a) + γQ(s ′ , a ′ )) Algorithm B.2 Generalized Policy Iteration (Sutton & Barto, 2018) Input: the lookahead depth N . Initialize: Table 2 : Hyperparameters of the implemented algorithms. We reused hyperparameters and settings of neural networks in the Maxmin DQN paper (Lan et al., 2020) .  Initial VF V 0 ∈ R |S| , ϵ. for k = 1, 2 . . . do π k (s) = arg max a [r(s, a) + γE s ′ [V k-1 (s ′ )]] V k ← (B π k ) •N V k-1 if ∥V k -V k-1 ∥ ∞ ≤ ϵ Q k+1 (s0, a0) = max m∈Ms 0 ,a 0 max n∈N E D (m) s 0 ,a 0 G n+1 Q k (τ n+1 s 0 ,a 0 ) where E D (m) s 0 ,a 0 [•] = 1 D (m) s,a



The space of all bounded sequences with supremum norm, which is known to be a complete metric space. We denote by ∥ • ∥ the supremum norm throughout the paper. Note another variant smax α π smax α n (B π ) n BQ (without max n∈{0,n ′ } ) is generally biased w.r.t. Q * . Please refer to Appendix A.3 for the detail of the reason. Note that, unless otherwise stated, the results hold for any set of behavioral policies Π and set of lookahead depths N . For convenience, we analyze under fixed Π. However, these results can also be extended to the case of dynamically changing Π (as shown Highway Value Iteration in Algorithm B.1, where Π k could change over different k-th iterations as new policies are added to the set of behavioral policies). We denote by ∥ • ∥ the supremum norm throughout the paper.



|s,a) [r(s, a) + γV (s ′ )] BV * = V * , where (BV )(s) ≜ max a r(s, a) + γE s ′ ∼T (•|s,a) [V (s ′ )] .

Figure 1: (Left) Backup diagram of Highway Operator G ΠN with N = {0, 1, 2}. (Right) An illustrative example of "highway" under a simple N -horizon MDP, where N is the horizon between the start state s A and the end state s Z . With the highway (dashed lines) induced by rolling-out behavioral policies, the start state s A can directly access various deeper horizon information. By taking maximization over this information, the high credit information can be directly assigned to the previous states through the highway.

Figure 2: (a) shows results of model-based algorithms in Multi-Room environments. The x-axis is the number of rooms. The y-axes for the three figures are total iteration, total samples, and total running time required by the algorithm, respectively. (b) and (c) in Choice and Trace Back environments: number of episodes required to solve the task, as a function of the delay of reward. Average over 100 seeds, 1 standard deviation.

Figure 3: Episode rewards during the training on MinAtar Games (Young & Tian, 2019). Average over 5 seeds, 1 standard deviation.

Figure 4: The performance of Highway DQN and Multi-Step DQN using (a) varying lookahead depth and (b) varying replay buffer sizes. Average over 5 seeds, 1 standard deviation.

(1) Lookahead depth (N = {0, 1, • • • , N -1} for Highway DQN and N for Multi-step DQN). As shown in Fig. 4a, compared to Multi-step DQN, our Highway DQN shows strong robustness against variations of the lookahead depth. This is because Highway DQN can adaptively choose the lookahead depth. (2) Replay buffer size. As shown in Fig. 4b, when the memory size increases to 8 × 10 5 (orange line), our Highway DQN shows a performance improvement, while Multi-Step DQN shows a degradation. (3) For the results with varying softmax Temperature α and number of target networks, please refer to APPENDIX Appendix C.4 for more details.

from which the statement follows.□Let us denote Π k (k ∈ N) the sequence of behavioral policies generated by Algorithm B.1. In Algorithm B.1 the set of behavioral policies Π k changes by adding a greedy policy every K iterations.

Ms 0 ,a 0 = {m1, • • • , mM }, mi ∼ U nif orm m ′ |D (m ′ ) s 0 ,a 0 | ̸ = 0 16: k = k + 1 17:end for 18: end forC.1.2 DETAILS OF ALGORITHMSWe compare our Highway Value Iteration to Policy Iteration and Value Iteration. For Policy Iteration, the lookahead step is set by N = 10 (see Algorithm B.2). For our Highway Value Iteration method, the interval of adding policy is set by K = 7; the set of lookahead depths N = {0, 1, 2, • • • , 9}; the size of behavioral policies is set by | Π| = 5. The error bound (∥V k -V k-1 ∥ ∞ ≤ ϵ) for all algorithms is set by ϵ = 10 -10 .C.2 EXPERIMENTS OF MODEL-FREE ALGORITHMS IN TOY TASKSHere we present the details of model-free-algorithms in artificial tasks. We adopt the experimental setting ofArjona-Medina et al. (2019).

Figure 5: The illustration of the Minimalistic Gridworld Environment

Figure 6: (a) Performance (left) and Q values (right) of Highway DQN using varying softmax temperatures α. (b) Corresponding results using varying numbers of target networks. Average over 5 seeds, 1 standard deviation.

Set of lookahead depths N ; Number of behavior policies M (for computing target Q value); Epochs of running algorithm Irun; Number of behavior policies M (for computing target Q value); Number of searched behavior policies M ; Epochs of rolling-out policy I rollout ; Epochs of updating value function I update ; Initial value function Q0 ∈ R |S|×|A| ; Exploration rate ϵ. 2: Initialize: k = 0; State-action replay buffer D = ∅; .

ETHICS STATEMENT

This paper provides a new operator for off-policy learning-based methods. The authors do not find any particular concerns w.r.t the potential ethical or societal consequence.

REPRODUCIBILITY STATEMENT

The code is available at https://anonymous.4open.science/r/ Highway-Reinforcement-Learning-4202 and more implementation details can be found at Appendix B and C.

annex

Theorem 8 For any Π, any P Π , and any N , 1) G N Π is a contraction on complete metric space l ∞ (S × A), i.e., for any two vectorsWe now give the proofs of Theorem 7 and 8 respectively.

Proof of Theorem 7:

For simplicity, we will use P instead of P Π . 1)2) Applying Banach's fixed point theorem to B Π N using the contraction result, we know that B Π N has only one fixed point.Following the proof in Theorem 1, we have (B π ) N -1 BQ * ≤ Q * for any π and N ≥ 2. This implies thatbased on the contraction property and Banach fixed point theorem :3) For the implication "⇐="it suffices to show for all π ∈ Π, P(πSince in the first expectation we just care about s 1 for which T (s 1 |s 0 , a 0 ) > 0, we can assume s 1 ∈ U . As π on U can be replaced by the optimal policy π * from the assumption, we getThe remaining implication "=⇒" will be proved by contradiction. Assume that the conclusion does not hold, i.e. there exist π ∈ Π, P(π) > 0 and s 1 ∈ U such that P(π) > 0 and π(•|s 1 ) is not optimal. Since s 1 ∈ U there exists s 0 ∈ S, a 0 ∈ A such that T (s 1 |s 0 , a 0 ) > 0. First we aim to prove inequality B π Q * (s 0 , a 0 ) < Q * (s 0 , a 0 ). Since π(•|s 1 ) assigns positive probability to non-optimal action, it is easy to obtain (especially for finite A) that). For other states different from s 1 we can still have equality but the countable sum leaves the inequality strictTo utilize the multi-step data collected by different behavior policies, the above operator can be extended to a multi-step version with a set of behavioral policies (we call Multi-Step IS-based Bellman Expectation Operator) (Sutton & Barto, 2018), which is defined aswhere N is the lookahead depth/bootstrapping step; τ N s0 = (s 0 , a 0 , s 1 , a 1 , s 2 , a 2 , • • • , s N ); τ N s0 ∼ π is the trajectory starting from s 0 by executing policy π for N steps; andis the product of IS ratios. The products of IS ratios could cause high variance. 

C EXPERIMENTAL RESULTS

The code of this paper is publicly-available at https://anonymous.4open.science/r/ Highway-Reinforcement-Learning-4202.

C.1.1 DETAILS OF ENVIRONMENTS

Multi-Room is a grid world environment with multiple rooms connected by doors. The agent's goal is to reach a goal square in the opposite corner and get a reward (r = 1000). In addition, the agent will get a small reward r = 0.001 when it finds the exit door of the room. We use the implementations based on the gym-minigrid (Chevalier-Boisvert et al., 2018) .

Algorithm B.1 Highway Value Iteration

Input: Initial set of behavioral policies Π 0 ; the set of lookahead depths N ; interval for adding new policy K . Initialize:

C.2.1 DETAILS OF ENVIRONMENTS

We evaluate the model-free algorithms on two toy tasks involving delayed rewards (Arjona-Medina et al., 2019) , where a reward is only provided at the end of each trial and is associated with the previous actions. For example, in task "Trace Back," the final reward depends on the first two actions. Each task is run with 100 random seeds. In task "Choice," the stochastic reward depends on the first action at the beginning; the final reward depends on the first two actions. Please refer to Arjona-Medina et al. (2019) for more details.

C.2.2 ALGORITHMIC DETAILS

The following methods are compared:• RUDDER with reward redistribution for Q-value estimation, and RUDDER applied on top of Q-learning.• Q-learning with eligibility traces according to Watkins (Q(λ)).• SARSA with eligibility traces (SARSA(λ)).• Monte Carlo.For RUDDER, we use the default setting of Arjona-Medina et al. (2019) . For Q(λ) and SARSA(λ), the hyperparameter of eligibility traces is λ = 0.9. For Q(λ), we use Watkins' implementation.The algorithms are evaluated until the task is solved. For MC, Q-values are the exponential moving average of the episode return. In all experiments, an ϵ-greedy policy with ϵ = 0.2 is adopted.For our Highway Q-Learning, the set of lookahead depths N is set by {0, 1, 2, 2 .We reuse the hyper-parameters and settings of neural networks in the Maxmin DQN paper (Lan et al., 2020) . For Maxmin DQN, the best number of target networks (a hyperparameter) was chosen from [2, 3, 4, 5, 6, 7, 8, 9] and the best learning rates were chosen from [3×10 -3 ; 3×10 -4 ; 3×10 4, 8, 12] ; the best learning rate from [3 × 10 -3 ; 3 × 10 -4 ], and the best number of target networks was chosen from [2, 4, 6] . For Retrace(λ), λ is set by 1 according to the suggestion of Retrace(λ) paper (Munos et al., 2016) . For our Highway DQN, the best hyperparameter of softmax temperature α was chosen from {0.1, 0.5}. Epochs of rolling-out policy is set by I rollout = 1.In practice, we adopt several measures to accelerate the computation of Highway DQN. We cache the Q values generated by the target network for data (s, a) such that they can be reused when the data is sampled again for training until the target network is updated. With the cached Q values, all we need is a softmax over |N | × |K| numbers, which is typically fast on GPUs.

