GRADIENT DESCENT TEMPORAL DIFFERENCE-DIFFERENCE LEARNING

Abstract

Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD learning by introducing second-order differences in successive parameter updates. We investigate this algorithm in the framework of linear value function approximation, analytically showing its improvement over GTD learning. Studying the model empirically on the random walk and Boyan-chain prediction tasks, we find substantial improvement over GTD learning and, in several cases, better performance even than conventional TD learning.

1. INTRODUCTION

Off-policy algorithms for value function learning enable an agent to use a behavior policy that differs from the target policy in order to gain experience for learning. However, because off-policy methods learn a value function for a target policy given data due to a different behavior policy, they often exhibit greater variance in parameter updates. When applied to problems involving function approximation, off-policy methods are slower to converge than on-policy methods and may even diverge (Baird, 1995; Sutton & Barto, 2018) . Two general approaches have been investigated to address the challenge of developing stable and effective off-policy temporal-difference algorithms. One approach is to use importance sampling methods to warp the update distribution back to the on-policy distribution (Precup et al., 2000; Mahmood et al., 2014) . This approach is useful for decreasing the variance of parameter updates, but it does not address stability issues. The second main approach to addressing the challenge of off-policy learning is to develop true gradient descent-based methods that are guaranteed to be stable regardless of the update distribution. Sutton et al. (2009a; b) proposed the first off-policy gradientdescent-based temporal difference (GTD and GTD2, respectively) algorithms. These algorithms are guaranteed to be stable, with computational complexity scaling linearly with the size of the function approximator. Empirically, however, their convergence is much slower than conventional temporal difference (TD) learning, limiting their practical utility (Ghiassian et al., 2020; White & White, 2016) . Building on this work, extensions to the GTD family of algorithms (see Ghiassian et al. (2018) for a review) have allowed for incorporating eligibility traces (Maei & Sutton, 2010; Geist & Scherrer, 2014) , non-linear function approximation such as with a neural network (Maei, 2011) , and reformulation of the optimization as a saddle point problem (Liu et al., 2015; Du et al., 2017) . However, due to their slow convergence, none of these stable off-policy methods are commonly used in practice. In this work, we introduce a new gradient descent algorithm for temporal difference learning with linear value function approximation. This algorithm, which we call gradient descent temporal difference-difference (Gradient-DD) learning, is an acceleration technique that employs second-order differences in successive parameter updates. The basic idea of Gradient-DD is to modify the error objective function by additionally considering the prediction error obtained in last time step, then to derive a gradient-descent algorithm based on this modified objective function. In addition to exploiting the Bellman equation to get the solution, this modified error objective function avoids drastic changes in the value function estimate by encouraging local search around the current estimate. Algorithmically, the Gradient-DD approach only adds an additional term to the update rule of the GTD2 method, and the extra computational cost is negligible. We show mathematically that applying this method significantly improves the convergence rate relative to the GTD2 method for linear function approximation. This result is supported by numerical experiments, which also show that Gradient-DD obtains better convergence in many cases than conventional TD learning.

1.1. RELATED WORK

In related approaches to ours, some previous studies have attempted to improve Gradient-TD algorithms by adding regularization terms to the objective function. Liu et al. ( 2012) have used l 1 regularization on weights to learn sparse representations of value functions, and Ghiassian et al. (2020) has used l 2 regularization on weights. Unlike these references, our approach modifies the error objective function by regularizing the evaluation error obtained in the most recent time step. With this modification, our method provides a learning rule that contains second-order differences in successive parameter updates. Our approach is similar to trust region policy optimization (Peters & Schaal, 2008; Schulman et al., 2015) or relative entropy policy search (Peters et al., 2010) , which penalize large changes being learned in policy learning. In these methods, constrained optimization is used to update the policy by considering the constraint on some measure between the new policy and the old policy. Here, however, our aim here is to look for the optimal value function, and the regularization term uses the previous value function estimate to avoid drastic changes in the updating process.

2.1. PROBLEM DEFINITION AND BACKGROUND

In this section, we formalize the problem of learning the value function for a given policy under the Markov Decision Process (MDP) framework. In this framework, the agent interacts with the environment over a sequence of discrete time steps, t = 1, 2, . . .. At each time step the agent observes a partial summary of the state s t ∈ S and selects an action a t ∈ A. In response, the environment emits a reward r t ∈ R and transitions the agent to its next state s t+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, as well as on the next state. The process generating the agent's actions is termed the behavior policy. In off-policy learning, this behavior policy is in general different from the target policy π : S → A. The objective is to learn an approximation to the state-value function under the target policy in a particular environment: V (s) = E π ∞ t=1 γ t-1 r t |s 1 = s , where γ ∈ [0, 1) is the discount rate. In problems for which the state space is large, it is practical to approximate the value function. In this paper we consider linear function approximation, where states are mapped to feature vectors with fewer components than the number of states. Specifically, for each state s ∈ S there is a corresponding feature vector x(s) ∈ R p , with p ≤ |S|, such that the approximate value function is given by V w (s) := w x(s). (2) The goal is then to learn the parameters w such that V w (s) ≈ V (s).

