CLOSING THE GAP BETWEEN SVRG AND TD-SVRG WITH GRADIENT SPLITTING

Abstract

Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a geometric convergence bound with predetermined learning rate of 1/8, that is identical to the convergence bound available for SVRG in the convex setting.

1. INTRODUCTION

Reinforcement learning (RL) is a learning paradigm which addresses a class of problems in sequential decision making environments. Policy evaluation is one of those problems, which consists of determining expected reward agent will achieve if it chooses actions according to stationary policy. Temporal Difference learning (TD learning, Sutton (1988) ) is popular algorithm, since it is simple and might be performed online on single samples or small mini-batches. TD learning method uses Bellman equation to bootstrap the estimation process update the value function from each incoming sample or minibatch. As all methods in RL, TD learning from the "curse of dimensionality" when number of states is large. To address this issue, in practice linear or nonlinear feature approximation of state values is often used. Despite its simple formulation, theoretical analysis of approximate TD learning is subtle. There are few important milestones in this process, one of which is a work of Tsitsiklis & Van Roy (1997) , in which asymptotic convergence guarantees were established. More recently advances were made by Bhandari et al. (2018 ), Srikant & Ying (2019) and Liu & Olshevsky (2020) . In particular, the last paper shows that TD learning might be viewed as an example of gradient splitting, a process analogous to gradient descent. TD-learning has inherent variance problem, which is that the variance of the update does not go to zero as the method converges. This problem is also present in a class of convex optimization problems where target function is represented as a sum of functions and SGD-type methods are applied Robbins & Monro (1951) . Such methods proceed incrementally by sampling a single function, or a minibatch of functions, to use for stochastic gradient evaluations. A few variance reduction techniques were developed to address this problem and make convergence faster, including SAG Schmidt et al. (2013 ), SVRG Johnson & Zhang (2013 ) and SAGA Defazio et al. (2014) . These methods are collectively known as variance-reduced gradient methods. The distinguishing feature of these methods is that they converge geometrically. The first attempt to adapt variance reduction to TD learning with online sampling was done by Korda & La (2015) . All these results obtained geometric convergence of the algorithms, improving the sub-geometric convergence of the standard TD methods. However, the convergence rates obtained in these papers are significantly worse than the convergence of SVRG in convex setting. In particular, the resulting convergence times for policy evaluations scaled with the square of the condition number, as opposed to SVRG which retains the linear scaling with the condition number of SGD. Quadratic scaling makes practical application of theoretically obtain values almost unfeasible, since number of computations becomes very large even for simple problems. Moreover, the convergence time bounds contained additional terms coming from the condition number of a matrix that diagonalizes some of the matrices appearing the problem formulations, which can be arbitrarily large. In this paper we analyze the convergence of the SVRG technique applied to TD (TD-SVRG) in two settings: piq a pre-sampled trajectory of the Markov Decision Process (MDP) (finite sampling), and piiq when states are sampled directly from the MDP (online sampling). Our contribution is threefold: • For the finite sample case we achieve significantly better results with simpler analysis. We are first to show that TD-SVRG has the same convergence rate as SVRG in the convex optimization setting with a pre-determined learning rate of 1/8. • For i.i.d. online sampling, we similarly achieve better results with simpler analysis. Similarly, we are first to show that TD-SVRG has the same convergence rate as SVRG in the convex optimization setting with a predetermined learning rate of 1/8. In addition, for Markovian online sampling, we provide convergence guarantees that in most cases are better than state-of-the art results. • We are the first to develop theoretical guarantees for an algorithm that can be directly applied to practice. In previous works, batch sizes required to guarantee convergence were very large that made them impractical (see Subsection H.1) and grid search was needed to optimize the learning rate and batch size values. We include experiments that show our theoretically obtained batch size and learning rate can be applied in practice and achieve geometric convergence.

2. PROBLEM FORMULATION

We consider a discounted reward Markov Decision Process (MDP) pS, A, P, r, γq, where S is a state space, A is an action space, P " Pps 1 |s, aq s,s 1 PS,aPA are the transition probabilities, r " rps, s 1 q are the rewards and γ P r0, 1q is a discount factor. In this MDP agent follows policy π, which is a mapping π : S Ś A Ñ r0, 1s. Given that policy is fixed, for the remainder of the paper we will consider transition matrix P , such that: P ps, s 1 q " ř a πps, aqPps 1 |s, aq. We assume, that Markov process produced by transition matrix is irreducible and aperiodic with stationary distribution µ π . The policy evaluation problem is to compute V π , defined as: V π psq :" E "ř 8 t"0 γ t r t`1

‰

. Here V π is the value function, formally defined to be the unique vector which satisfies the equality T π V π " V π , where T π is a Bellman operator, defined as: T π V π psq " ř s 1 P ps, s 1 q prps, s 1 q `γV π ps 1 qq . The TD(0) method is defined as follows: one iteration performs a fixed point update on randomly sampled pair of states s, s 1 with learning rate η: V psq Ð V psq `ηprps, s 1 q `γV ps 1 q ´V psqq. When the state space size |S| is large, tabular methods which update a value for every state V psq become impractical. For this reason linear approximation is often used. Each state a represented as feature vector ϕpsq P R d and state value V π psq is approximated by V π psq « ϕpsq T θ, where θ is a tunable parameter vector. Now a single TD update on randomly sampled transition s, s 1 becomes: θ Ð θ `ηg s,s 1 pθq " θ `ηpprps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕpsqq, where the second equation should be viewed as a definition of g s,s 1 pθq. Our goal is to find parameter vector θ ˚such that average update vector is zero E s,s 1 rg s,s 1 pθ ˚qs " 0. This expectation is also called mean-path update ḡpθq and can be written as:



Their results were discussed by Dalal et al. (2018) and Narayanan & Szepesvári (2017); Xu et al. (2020) performed reanalysis of their results and shown geometric convergence for Variance Reduction Temporal Difference learning (VRTD) algorithm for both Markovian and i.i.d sampling. The work of Du et al. (2017) directly apply SVRG and SAGA to a version of policy evaluations by transforming it into an equivalent convex-concave saddle-point problem. Since their algorithm uses two sets of parameters, in this paper we call it Primal Dual SVRG or PD SVRG.

