GRADIENT DESCENT TEMPORAL DIFFERENCE-DIFFERENCE LEARNING

Abstract

Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD learning by introducing second-order differences in successive parameter updates. We investigate this algorithm in the framework of linear value function approximation, analytically showing its improvement over GTD learning. Studying the model empirically on the random walk and Boyan-chain prediction tasks, we find substantial improvement over GTD learning and, in several cases, better performance even than conventional TD learning.

1. INTRODUCTION

Off-policy algorithms for value function learning enable an agent to use a behavior policy that differs from the target policy in order to gain experience for learning. However, because off-policy methods learn a value function for a target policy given data due to a different behavior policy, they often exhibit greater variance in parameter updates. When applied to problems involving function approximation, off-policy methods are slower to converge than on-policy methods and may even diverge (Baird, 1995; Sutton & Barto, 2018) . Two general approaches have been investigated to address the challenge of developing stable and effective off-policy temporal-difference algorithms. One approach is to use importance sampling methods to warp the update distribution back to the on-policy distribution (Precup et al., 2000; Mahmood et al., 2014) . This approach is useful for decreasing the variance of parameter updates, but it does not address stability issues. The second main approach to addressing the challenge of off-policy learning is to develop true gradient descent-based methods that are guaranteed to be stable regardless of the update distribution. Sutton et al. (2009a; b) proposed the first off-policy gradientdescent-based temporal difference (GTD and GTD2, respectively) algorithms. These algorithms are guaranteed to be stable, with computational complexity scaling linearly with the size of the function approximator. Empirically, however, their convergence is much slower than conventional temporal difference (TD) learning, limiting their practical utility (Ghiassian et al., 2020; White & White, 2016) . Building on this work, extensions to the GTD family of algorithms (see Ghiassian et al. (2018) for a review) have allowed for incorporating eligibility traces (Maei & Sutton, 2010; Geist & Scherrer, 2014) , non-linear function approximation such as with a neural network (Maei, 2011) , and reformulation of the optimization as a saddle point problem (Liu et al., 2015; Du et al., 2017) . However, due to their slow convergence, none of these stable off-policy methods are commonly used in practice. In this work, we introduce a new gradient descent algorithm for temporal difference learning with linear value function approximation. This algorithm, which we call gradient descent temporal difference-difference (Gradient-DD) learning, is an acceleration technique that employs second-order differences in successive parameter updates. The basic idea of Gradient-DD is to modify the error objective function by additionally considering the prediction error obtained in last time step, then to derive a gradient-descent algorithm based on this modified objective function. In addition to exploiting the Bellman equation to get the solution, this modified error objective function avoids drastic changes in the value function estimate by encouraging local search around the current estimate. Algorithmically, the Gradient-DD approach only adds an additional term to the update rule of the GTD2 method, and the extra computational cost is negligible. We show mathematically that applying this method significantly improves the convergence rate relative to the GTD2 method for linear function approximation. This result is supported by numerical experiments, which also show that Gradient-DD obtains better convergence in many cases than conventional TD learning.

1.1. RELATED WORK

In related approaches to ours, some previous studies have attempted to improve Gradient-TD algorithms by adding regularization terms to the objective function. Liu et al. (2012) have used l 1 regularization on weights to learn sparse representations of value functions, and Ghiassian et al. (2020) has used l 2 regularization on weights. Unlike these references, our approach modifies the error objective function by regularizing the evaluation error obtained in the most recent time step. With this modification, our method provides a learning rule that contains second-order differences in successive parameter updates. Our approach is similar to trust region policy optimization (Peters & Schaal, 2008; Schulman et al., 2015) or relative entropy policy search (Peters et al., 2010) , which penalize large changes being learned in policy learning. In these methods, constrained optimization is used to update the policy by considering the constraint on some measure between the new policy and the old policy. Here, however, our aim here is to look for the optimal value function, and the regularization term uses the previous value function estimate to avoid drastic changes in the updating process.

2.1. PROBLEM DEFINITION AND BACKGROUND

In this section, we formalize the problem of learning the value function for a given policy under the Markov Decision Process (MDP) framework. In this framework, the agent interacts with the environment over a sequence of discrete time steps, t = 1, 2, . . .. At each time step the agent observes a partial summary of the state s t ∈ S and selects an action a t ∈ A. In response, the environment emits a reward r t ∈ R and transitions the agent to its next state s t+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, as well as on the next state. The process generating the agent's actions is termed the behavior policy. In off-policy learning, this behavior policy is in general different from the target policy π : S → A. The objective is to learn an approximation to the state-value function under the target policy in a particular environment: V (s) = E π ∞ t=1 γ t-1 r t |s 1 = s , where γ ∈ [0, 1) is the discount rate. In problems for which the state space is large, it is practical to approximate the value function. In this paper we consider linear function approximation, where states are mapped to feature vectors with fewer components than the number of states. Specifically, for each state s ∈ S there is a corresponding feature vector x(s) ∈ R p , with p ≤ |S|, such that the approximate value function is given by V w (s) := w x(s). ( ) The goal is then to learn the parameters w such that V w (s) ≈ V (s).

2.2. GRADIENT TEMPORAL DIFFERENCE LEARNING

A major breakthrough for the study of the convergence properties of MDP systems came with the introduction of the GTD and GTD2 learning algorithms (Sutton et al., 2009a; b) . We begin by briefly recapitulating the GTD algorithms, which we will then extend in the following sections. To begin, we introduce the Bellman operator B such that the true value function V ∈ R |S| satisfies the Bellman equation: V = R + γPV =: BV, where R is the reward vector with components E(r n+1 |s n = s), and P is a matrix of state transition probabilities. In temporal difference methods, an appropriate objective function should minimize the difference between the approximate value function and the solution to the Bellman equation. Having defined the Bellman operator, we next introduce the projection operator Π, which takes any value function V and projects it to the nearest value function within the space of approximate value functions of the form (2). Letting X be the matrix whose rows are x(s), the approximate value function can be expressed as V w = Xw. We will also assume that there exists a limiting probability distribution such that d s = lim n→∞ p(s n = s) (or, in the episodic case, d s is the proportion of time steps spent in state s). The projection operator is then given by Π = X(X DX) -1 X D, where the matrix D is diagonal, with diagonal elements d s . The natural measure of how closely the approximation V w satisfies the Bellman equation is the mean-squared Bellman error: MSBE(w) = V w -BV w 2 D , where the norm is weighted by D, such that V 2 D = V DV. However, because the Bellman operator follows the underlying state dynamics of the Markov chain, irrespective of the structure of the linear function approximator, BV w will typically not be representable as V w for any w. An alternative objective function, therefore, is the mean squared projected Bellman error (MSPBE), which we define as J(w) = V w -ΠBV w 2 D . (4) Following (Sutton et al., 2009b) , our objective is to minimize this error measure. As usual in stochastic gradient descent, the weights at each time step are then updated by ∆w = -α∇ w J(w), where α > 0, and - 1 2 ∇ w J(w) = -E[(γx n+1 -x n )x n ][E(x n x n )] -1 E(δ n x n ) ≈ -E[(γx n+1 -x n )x n ]η. For notational simplicity, we have denoted the feature vector associated with s n as x n = x(s n ). We have also introduced the temporal difference error δ n = r n + (γx n+1 -x n ) w n , as well as η, a linear predictor to approximate [E(x n x n )] -1 E(δ n x n ). Because the factors in Eqn. ( 5) can be directly sampled, the resulting updates in each step are δ n =r n + (γx n+1 -x n ) w n η n+1 =η n + β n (δ n -x n η n )x n w n+1 =w n -α n (γx n+1 -x n )(x n η n ). These updates define the GTD2 learning algorithm, which we will build upon in the following section.

3. GRADIENT DESCENT TEMPORAL DIFFERENCE-DIFFERENCE LEARNING

In order to improve the GTD2 algorithm described above, in this section we modify the objective function via additionally considering the approximation error V w -V wn-1 given the previous time step n -1. Specifically, we modify Eqn. (4) as follows: where κ ≥ 0 is a parameter of the regularization. J GDD (w|w n-1 ) = J(w) + κ V w -V wn-1 2 D , Minimizing Eqn. ( 7) is equivalent to the following optimization arg min w J(w) s.t. V w -V wn-1 2 D ≤ µ (8) where µ > 0 is a parameter which becomes large when κ is small, so that the MSPBE objective is recovered as µ → ∞, equivalent to κ → 0 in Eqn. ( 7). We show in the Appendix that for any µ > 0, there exist κ ≥ 0 such that the solution of Eqn. ( 7) and that of Eqn. ( 8) are the same. Eqns. ( 7) and ( 8) represent a tradeoff between minimizing the MSPBE error and preventing the estimated value function from changing too drastically. Rather than simply minimizing the optimal prediction from the projected Bellman equation, the agent makes use of the most recent update to look for the solution. Figure 1 gives a schematic view of the effect of the regularization. Rather than directly following the direction of the MSPBE gradient, the update chooses a w that minimizes the MSPBE while following the constraint that the estimated value function should not change too greatly. In effect, the regularization term encourages searching around the estimate at previous time step, especially when the state space is large. With these considerations in mind, the negative gradient of J GDD (w|w n-1 ) is - 1 2 ∇ w J GDD (w|w n-1 ) = -E[(γx n+1 -x n )x n ][E(x n x n )] -1 E(δ n x n ) -κE[(x n w n -x n w n-1 )x n ] ≈ -E[(γx n+1 -x n )x n ]η n -κE[(x n w n -x n w n-1 )x n ]. Because the terms in Eqn. ( 9) can be directly sampled, the stochastic gradient descent updates are given by δ n =r n + (γx n+1 -x n ) w n η n+1 =η n + β n (δ n -x n η n )x n w n+1 =w n -κ n (x n w n -x n w n-1 )x n -α n (γx n+1 -x n )(x n η n ). ( ) These update equations define the Gradient-DD method, in which the GTD2 update equations ( 6) are generalized by including a second-order update term in the third update equation, where this term originates from the squared bias term in the objective (7). In the following sections, we shall analytically and numerically investigate the convergence and performance of Gradient-DD learning.

4. IMPROVED CONVERGENCE RATE

In this section we analyze the convergence rate of Gradient-DD learning. Note that the second-order update in the last line in Eqn. ( 10) can be rewritten as a system of first-order difference equations: (I + κ n x n x n )(w n+1 -w n ) =κ n x n x n (u n+1 -u n ) -α n (γx n+1 -x n )(x n η n ); u n+1 =w n+1 -w n . ( ) Let β n = ζα n , ζ > 0. We consider constant step sizes in the updates, i.e., κ n = κ and α n = α. De- note H n = 0 0 0 x n x n and G n = √ ζx n x n x n (x n -γx n+1 ) -(x n -γx n+1 )x n 0 . We rewrite the update rules of two iterations in Eqn. ( 11) as a single iteration in a combined parameter vector with 2n components, ρ n = (η n / √ ζ, w n ) , and a new reward-related vector with 2n components, g n+1 = (r n x n , 0 ) , as follows: ρ n+1 =ρ n -κH n (ρ n -ρ n-1 ) + ζα(G n ρ n + g n+1 ), Denoting 12) is rewritten as ψ n+1 = α -1 (ρ n+1 -ρ n ), Eqn. ( ρ n+1 -ρ n ψ n+1 -ψ n =α I + κH n -καH n I -αI -1 - √ ζ(G n ρ n -g n+1 ) ψ n =α - √ ζG n -κH n - √ ζα -1 G n -α -1 (I + κH n ) ρ n ψ n + α √ ζg n+1 √ ζα -1 g n+1 , where the second step is from I + κH n -καH n I -αI -1 = I -κH n α -1 I -α -1 (I + κH n ) . De- note J n = - √ ζG n -κH n - √ ζα -1 G n -α -1 (I + κH n ) . Eqn. ( 13) tells us that J n is the update matrix of the Gradient-DD algorithm. (Note that G n is the update matrix of the GTD2 algorithm.) Therefore, assuming the stochastic approximation in Eqn. ( 13) goes to the solution of an associated ordinary differential equation (ODE) under some regularity conditions (a convergence property is provided in the appendix by following Borkar & Meyn ( 2000)), we can analyze the improved convergence rate of Gradient-DD learning by comparing the eigenvalues of the matrices E(G n ) denoted by G, and E(J n ) denoted by J (Atkinson et al., 2008) . Obviously, J = - √ ζG -κH - √ ζα -1 G -α -1 (I + κH) , where H = E(H n ). To simplify, we consider the case that the matrix E(x n x n ) = I. Let λ G be a real eigenvalue of the matrix √ ζG. (Note that G is defined here with opposite sign relative to G in Maei (2011) .) From Maei (2011) , the eigenvalues of the matrix -G are strictly negative. In other words, λ G > 0. Let λ be an eigenvalue of the matrix J, i.e. a solution to the equation |λI -J| =(λ + λ G )(λ + α -1 ) + κα -1 λ = λ 2 + [α -1 (1 + κ) + λ G ]λ + α -1 λ G = 0. ( ) The smaller eigenvalues λ m of the pair solutions to Eqn. ( 14) are λ m < -λ G , where details of the above derivations are given in the appendix. This explains the enhanced speed of convergence in Gradient-DD learning. We shall illustrate this enhanced speed of convergence in numerical experiments in Section 5. Additionally, we also show a convergence property of Gradient-DD under constant step sizes by applying the ordinary differential equation method of stochastic approximation (Borkar & Meyn, 2000) . Let the TD fixed point be w * , such that V w * = ΠBV w * . Under some conditions, we prove that, for any > 0, there exists b 1 < ∞ such that lim sup n→∞ P ( w n -w * > ) ≤ b 1 α. Details are provided in the appendix. For tapered step sizes, which would be necessary to obtain an even stronger convergence proof, the analysis framework in Borkar & Meyn (2000) does not apply into the Gradient-DD algorithm. Although theoretical investigation of the convergence under tapered step sizes is a question to be studied, we find empirically in numerical experiments that the algorithm does in fact converge with tapered step sizes and even obtains much better performance in this case than with fixed step sizes.

5. EMPIRICAL STUDY

In this section, we assess the practical utility of the Gradient-DD method in numerical experiments. To validate performance of Gradient-DD learning, we compare Gradient-DD learning with GTD2 learning, TDC learning (TD with gradient correction (Sutton et al., 2009b )), TD learning, and Emphatic TD learning (Sutton & Mahmood, 2016) in tabular representation using a random-walk task and in linear representation using the Boyan-chain task. For each method and each task, we performed a scan over the step sizes α n and the parameter κ so that the comprehensive performance of the different algorithms can be compared. We considered two choices of step size sequence {α n }: • (Case 1) α n is constant, i.e., α n = α 0 . • (Case 2) The learning rate α n is tapered according to the schedule α n = α 0 (10 3 + 1)/(10 3 + n). We set the κ = cα 0 where c = 1, 2, 4. Additionally, we also allow κ dependent on n and consider Case 3: α n is tapered as in Case 2, but κ n = cα n . In order to simplify presentation, the results of Case 3 are reported in the Appendix. To begin, we set β n = α n , then later allow for β n = ζα n under ζ ∈ {1/4, 1/2, 1, 2} in order to investigate the effect of the two-timescale approach of the Gradient-based TD algorithms on Gradient-DD. In all cases, we set γ = 1.

5.1. RANDOM WALK TASK

As a first test of Gradient-DD learning, we conducted a simple random walk task (Sutton & Barto, 2018) with tabular representation of the value function. The random walk task has a linear arrangement of m states plus an absorbing terminal state at each end. Thus there are m+2 sequential states, S 0 , S 1 , • • • , S m , S m+1 , where m = 20, 50, or 100. Every walk begins in the center state. At each step, the walk moves to a neighboring state, either to the right or to the left with equal probability. If either edge state (S 0 or S m+1 ) is entered, the walk terminates. A walk's outcome is defined to be r = 0 at S 0 and r = 1 at S m+1 . Our aim is to learn the value of each state V (s), where the true values are (1, • • • , m)/(m + 1). In all cases the approximate value function is initialized to the intermediate value V 0 (s) = 0.5. In order to investigate the effect of the initialization V 0 (s), we also initialize V 0 (s) = 0, and report the results in Figure 7 of the Appendix, where its performance is very similar as the initialization V 0 (s) = 0.5. We first compare the methods by plotting the empirical RMS error from the final episode during training as a function of step size α in Figure 2 , where 5000 episodes are used. From the figure, we can make several observations. (1) Emphatic TD works well but is sensitive to α. It prefers very small α even in the tapering case, and this preference becomes strong as the state space becomes large in size. (2) Gradient-DD works well and is robust to α, as is conventional TD learning. (3) TDC performs similarly to the GTD2 method, but requires slightly larger α than GTD2. (4) Gradient-DD performs similarly to conventional TD learning and better than the GTD2 method. This advantage is consistent in different settings. (5) The range of α leading to effective learning for Gradient-DD is roughly similar to that for GTD2. Next we look closely at the performance during training, which we show in Figure 3 , where each method and parameter setting was run for 5000 episodes. From the observations in Figure 2 , in order to facilitate comparison of these methods, we set α 0 = 0.1 for 10 spaces, α 0 = 0.2 for 20 spaces, and α 0 = 0.5 for 50 spaces. Because Emphatic TD requires the step size α to be especially small as shown in Figure 2 , the plotted values of α 0 for Emphatic TD are tuned relative to the values used in the algorithm defined in Sutton & Mahmood (2016) , where the step sizes of Emphatic TD α (ETD) 0 are chosen from {0.5%, 0.1%, 0.05%, 0.01%} by the smallest area under the performance curve. Additionally we also tune α 0 for TDC because TDC requires α n larger a little than GTD2 as shown in Figure 2 . The step sizes for TDC are set as α (TDC) n = aα n , where a is chosen from {1, 1.5, 2, 3} by the smallest area under the performance curve. From the results shown in Figure 3a , we draw several observations. ( 1) For all conditions tested, Gradient-DD converges much more rapidly than GTD2 and TDC. The results indicate that Gradient-DD even converges faster than TD learning in some cases, though it is not as fast in the beginning episodes. (2) The advantage of Gradient-DD learning over other methods grows as the state space increases in size. (3) Gradient-DD learning is robust to the choice of c, which controls the size κ of the second-order update, as long as c is not too large. (Empirically c = 2 is a good choice.) (4) Gradient-DD has consistent and good performance under both the constant step size setting and under the tapered step size setting. In summary, compared with GTD2 learning and other methods, Gradient-DD learning in this task leads to improved learning with good convergence. In addition to investigating the effects of the learning rate, size of the state space, and magnitude of the regularization parameter, we also investigated the effect of using distinct values for the two learning rates, α n and β n . To do this, we set β n = ζα n with ζ ∈ {1/4, 1/2, 1, 2} and report the results in Figure 8 of the appendix. The results show that comparably good performance of Gradient-DD is obtained under these various β n settings.

5.2. BOYAN-CHAIN TASK

We next investigate Gradient-DD learning on the Boyan-chain problem, which is a standard task for testing linear value-function approximation (Boyan, 2002) . In this task we allow for 4p -3 states, with p = 20, each of which is represented by a p-dimensional feature vector. The p-dimensional representation for every fourth state from the start is [1, 0, • • • , 0] for state s 1 , [0, 1, 0, • • • , 0] for s 5 , • • • , and [0, 0, • • • , 0, 1] for the terminal state s 4p-3 . The representations for the remaining states are obtained by linearly interpolating between these. The optimal coefficients of the feature vector are (-4(p -1), -4(p -2), • • • , 0)/5. Simulations with p = 50 and 100 give similar results to those from the random walk task, and hence are not shown here. In each state, except for the last one before the end, there are two possible actions: move forward one step or move forward two steps with equal probability 0.5. Both actions lead to reward -0.3. The last state before the end just has one action of moving forward to the terminal with reward -0.2. As in the random-walk task, α 0 used in Emphatic TD is tuned from {0.5%, 0.2%, 0.1%, 0.05%}. We report the results in Figure 4 , which leads to conclusions similar to those already drawn from Figure 3 . (1) Gradient-DD has much faster convergence than GTD2 and TDC, and generally converges to better values despite being somewhat slower than TD learning at the beginning episodes. (2) Gradient-DD is competitive with Emphatic TD. The improvement over other methods grows as the state space becomes larger. (3) As κ increases, the performance of Gradient-DD improves. Additionally, the performance of Gradient-DD is robust to changes in κ as long as κ is not very large. Empirically a good choice is to set κ = α or 2α. (4) Comparing the performance with constant step size versus that with tapered step size, the Gradient-DD method performs better with tapered step size than it does with constant step size. 

5.3. BAIRD'S COUNTEREXAMPLE

We also verify the performance of Gradient-DD on Baird's off-policy counterexample (Baird, 1995) , for which TD learning famously diverges. We consider three cases: 7-state, 100-state and 500-state. We set α = 0.02 (but α = 10 -5 for ETD), β = α and γ = 0.99. We set κ = 0.2 for GDD1, κ = 0.4 for GDD2 and κ = 0.8 for GDD3. For the initial parameter values (1, • • • , 1, 10, 1) . We measure the performance by the empirical RMS errors as function of sweep, and report the results in Figure 5 . The figure demonstrates that Gradient-DD works as well on this well-known counterexample as GTD2 does, and even works better than GTD2 for the 100-state case. We also observe that the performance improvement of Gradient-DD increases as the state spaces increases. We also note that, because the linear approximation leaves a residual error in the value estimation due to the projection TDC is not reported here due to its similarity to GTD2. We set α = 0.02 (but α = 10 -5 for ETD), β = α, and κ = 0.02c. GDD(c) denotes the Gradient-DD with c. error, the RMS errors in this task do not go to zero. Interestingly, Gradient-DD reduces this residual error as the size of the state space increases.

6. CONCLUSION AND DISCUSSION

In this work, we have proposed Gradient-DD learning, a new gradient descent-based TD learning algorithm. The algorithm is based on a modification of the projected Bellman error objective function for value function approximation by introducing a second-order difference term. The algorithm significantly improves upon existing methods for gradient-based TD learning, obtaining better convergence performance than conventional linear TD learning. Since GTD learning was originally proposed, the Gradient-TD family of algorithms has been extended for incorporating eligibility traces and learning optimal policies (Maei & Sutton, 2010; Geist & Scherrer, 2014) , as well as for application to neural networks (Maei, 2011) . Additionally, many variants of the vanilla Gradient-TD methods have been proposed, including HTD (Hackman, 2012) and Proximal Gradient-TD (Liu et al., 2016) . Because Gradient-DD just modifies the objective error of GTD2 by considering an additional squared-bias term, it may be extended and combined with these other methods, potentially broadening its utility for more complicated tasks. In this work we have focused on value function prediction in the two simple cases of tabular representations and linear approximation. An especially interesting direction for future study will be the application of Gradient-DD learning to tasks requiring more complex representations, including neural network implementations. Such approaches are especially useful in cases where state spaces are large, and indeed we have found in our results that Gradient-DD seems to confer the greatest advantage over other methods in such cases. Intuitively, we expect that this is because the difference between the optimal update direction and that chosen by gradient descent becomes greater in higher-dimensional spaces (cf. Fig. 1 ). This performance benefit in large state spaces suggests that Gradient-DD may be of practical use for these more challenging cases. APPENDIX 6.1 ON THE EQUIVALENCE OF EQNS. ( 7) & ( 8) The Karush-Kuhn-Tucker conditions of Eqn. ( 8) are the following system of equations        d dw J(w) + κ d dw ( V w -V wn-1 2 D -µ) = 0; κ( V w -V wn-1 2 D -µ) = 0; V w -V wn-1 2 D ≤ µ; κ ≥ 0. These equations are equivalent to    d dw J(w) + κ d dw V w -V wn-1 2 D = 0 and κ > 0, if V w -V wn-1 2 D = µ; d dw J(w) = 0 and κ = 0, if V w -V wn-1 2 D < µ. Thus, for any µ > 0, there exists a κ ≥ 0 such that d dw J(w) + µ d dw V w -V wn-1 2 D = 0.

6.2. EIGENVALUES OF J

Let λ be an eigenvalue of the matrix J. We have that |λI -J| = λI + √ ζG κH √ ζα -1 G λI + α -1 (I + κH) = λI + √ ζG κH -λα -1 I λI + α -1 I = λI + √ ζG κH 0 λI + α -1 I + κα -1 λ(λI + √ ζG) -1 H =|(λI + ζG)(λI + α -1 I) + κα -1 λH|. From the assumption E(x n x n ) = I and the definition of H, some eigenvalues of the matrix J, λ, are solutions to |λI -J| =(λ + λ G )(λ + α -1 ) = 0; and other eigenvalues of the matrix J, λ, are solutions to |λI -J| =(λ + λ G )(λ + α -1 ) + κα -1 λ =λ 2 + [α -1 (1 + κ) + λ G ]λ + α -1 λ G = 0. Note λ G > 0. the pair solutions to the equation above are λ = - 1 2 [α -1 (1 + κ) + λ G ] ± 1 2 [α -1 (1 + κ) + λ G ] 2 -4α -1 λ G = - 1 2 [α -1 (1 + κ) + λ G ] ± 1 2 [α -1 (1 + κ) -λ G ] 2 + 4α -1 λ G κ. Thus, the smaller eigenvalues of the pairs are λ m = - 1 2 [α -1 (1 + κ) + λ G ] - 1 2 [α -1 (1 + κ) -λ G ] 2 + 4α -1 λ G κ < - 1 2 [α -1 (1 + κ) + λ G ] - 1 2 [α -1 (1 + κ) -λ G ] 2 , where the inequality is from λ G > 0. When α -1 (1 + κ) -λ G > 0, then λ m < - 1 2 [α -1 (1 + κ) + λ G ] - 1 2 (α -1 (1 + κ) -λ G ) = -α -1 (1 + κ) < -λ G , When α -1 (1 + κ) -λ G ≤ 0, then λ m < - 1 2 [α -1 (1 + κ) + λ G ] + 1 2 (α -1 (1 + κ) -λ G ) = -λ G ,

CONVERGENCE WITH CONSTANT STEP SIZES

At last we apply the ODE method of stochastic approximation to obtain the convergence performance. Theorem 1 Consider the update rules (10) with constant step size sequences κ, α and β satisfying κ ≥ 0, β = ζα, ζ > 0, α ∈ (0, 1) and β > 0. Let the TD fixed point be w * , such that V w * = ΠBV w * . Suppose that (A1) (x n , r n , x n+1 ) is an i.i.d. sequence with uniformly bounded second moments, and (A2) E[(x n -γx n+1 ) x n ] and E(x n x n ) are non-singular. Then for any > 0, there exists b 1 < ∞ such that lim sup n→∞ P ( w n -w * > ) ≤ b 1 α. Proof From the constant step sizes in the conditions, we denote κ n = κ and α n = α. Thus, Eqn. ( 12) equals (I + κH n )(ρ n+1 -ρ n ) -κH n (ρ n+1 -2ρ n + ρ n-1 ) = -ζα(G n ρ n -g n+1 ). (A.1) Denoting ψ n+1 = α -1 (ρ n+1 -ρ n ), Eqn. (A.1) is rewritten as ρ n+1 -ρ n ψ n+1 -ψ n =α I + κH n -καH n I -αI -1 - √ ζ(G n ρ n -g n+1 ) ψ n =α - √ ζG n -κH n - √ ζα -1 G n -α -1 (I + κH n ) ρ n ψ n + α √ ζg n+1 √ ζα -1 g n+1 , (A.2) where the second step is from I + κH n -καH n I -αI -1 = I -κH n α -1 I -α -1 (I + κH n ) . Denoting G = E(G n ), g = E(g n ) and H = E(H n ), then the TD fixed point of Eqn. (A.1) is given by -Gρ + g = 0 (A.3) We apply the ordinary differential equation approach of the stochastic approximation in Theorem 1 (Theorem 2.3 of (Borkar & Meyn, 2000) ) into Eqn. (A.2). Note that (Sutton et al., 2009a) and (Sutton et al., 2009b ) also applied Theorem 2.3 of (Borkar & Meyn, 2000) in using the gradientdescent method for temporal-difference learning to obtain their convergence results. For simplifying notation, denote J n = - √ ζG n -κH n - √ ζα -1 G n -α -1 (I + κH n ) , J = - √ ζG -κH - √ ζα -1 G -α -1 (I + κH) , y n = ρ n ψ n , h n = √ ζg n+1 √ ζα -1 g n+1 , and h = √ ζg √ ζα -1 g . Eqn. (A.2) is rewritten as A.4) where f (y n ) = Jy n and M n+1 = (J n -J)y n + h n -h. Now we verify the conditions (c1-c4) of Lemma 1. Firstly, Condition (c1) is satisfied under the assumption of constant step sizes. Secondly, f (y) is Lipschitz and f ∞ (y) = Gy. Following Sutton et al. (2009a) , the Assumption A2 implies the real parts of all the eigenvalues of G are positive. Therefore, Condition (c2) is satisfied. y n+1 = y n + α(f (y n ) + h + M n+1 ),



Figure 1: Schematic diagram of Gradient-DD learning with w ∈ R 2 . Rather than updating w directly along the gradient of the MSPBE (arrow), the update rule selects w n that minimizes the MSPBE while satisfying the constraint V w -V wn-1 2 D ≤ µ (shaded ellipse).

Figure 2: Performance in the random walk task depends on step size. (a), Constant step size α n = α 0 . (b), Tapering step size α n = α 0 (10 3 + 1)/(10 3 + n). In (a) and (b), state space size 10 (left) or 20 (right). GDD(c) denotes the Gradient-DD with c. The curves are averaged over 20 runs, with error bars denoting standard deviations across runs.

Tapering step size αn = α0(10 3 + 1)/(10 3 + n).

Figure 3: Performance of Gradient-DD in the random walk task. From left to right in each subfigure: the size of state space is 10 (α 0 = 0.1), 20 (α 0 = 0.2), 50 (α 0 = 0.5). The curves are averaged over 20 runs, with error bars denoting standard deviations across runs.

Constant step size αn = α0, where α0 = 0.05, 0.1, 0.2 from left to right. Tapering step size αn = α0(10 3 + 1)/(10 3 + n) , where α0 = 0.3, 0.5, 0.8 from left to right.

Figure 4: Performance of Gradient-DD in the Boyan Chain task with 20 features. Note that the case Gradient-DD(4), i.e. c = 4, is not shown when it does not converge.

Figure5: Bairds off-policy counterexample. From left to right: 7-state, 100-state, and 500-state. TDC is not reported here due to its similarity to GTD2. We set α = 0.02 (but α = 10 -5 for ETD), β = α, and κ = 0.02c. GDD(c) denotes the Gradient-DD with c.

annex

Because E(M n+1 |F n ) = 0 and E( M n+1 2 |F n ) ≤ c 0 (1+ y n 2 ), where F n = σ(y i , M i , i ≤ n), is a martingale difference sequence, we have thatFrom the assumption A1, Eqn. (A.5) follows that there are constants c j and c h such thatThus, Condition (c3) is satisfied.Finally, Condition (c4) is satisfied by noting that y * = G -1 g is the unique globally asymptotically stable equilibrium.Theorem 1 bounds the estimation error of w in probability. Note that the convergence of Gradient-DD learning provided in Theorem 1 is a somewhat weaker result than the statement that w n → w * with probability 1 as n → ∞. The technical reason for this is the condition on step sizes. In Theorem 1, we consider the case of constant step sizes, with α n = α and κ n = κ. This restriction is imposed so that Eqn. ( 12) can be written as a system of first-order difference equations, which cannot be done rigorously when step sizes are tapered as in (Sutton et al., 2009b) . As shown below, however, we find empirically in numerical experiments that the algorithm does in fact converge with tapered step sizes and even obtains much better performance in this case than with fixed step sizes.

AN ODE RESULT ON STOCHASTIC APPROXIMATION

We introduce an ODE result on stochastic approximation in the following lemma, then prove Theorem 1 by applying this result.Lemma 1 (Theorem 2.3 of Borkar & Meyn (2000) ) Consider the stochastic approximation algorithm described by the d-dimensional recursionSuppose the following conditions hold: (c1) The sequence {α n } satisfies for some constant 0 < α < ᾱ < 1, α < α n < ᾱ; (c2) The function f is Lipschitz, and there exists a function f ∞ such that lim r→∞ f r (y) = f ∞ (y), where the scaled function f r : R d → R d is given by f r (y) = f (ry)/r. Furthermore, the ODE ẏ = f ∞ (y) has the origin as a globally asymptotically stable equilibrium;Moreover, for some c 0 < ∞ and any initial conditionhas a unique globally asymptotically stable equilibrium y * . Then for any > 0, there exists b 1 < ∞ such that lim sup n→∞ P ( y n -y * > ) ≤ b 1 ᾱ. Tapering step size α n = α 0 (10 3 +1)/(10 3 +n), where α 0 = 0.3, 0.5, 0.8, , where α 0 = 0.3, 0.5, 0.8 from left to right, and κ is allowed to be dependent on n: κ n = cα n Note that the case GDD(4), i.e. c = 4, is not shown when it does not converge.

