CLOSING THE GAP BETWEEN SVRG AND TD-SVRG WITH GRADIENT SPLITTING

Abstract

Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a geometric convergence bound with predetermined learning rate of 1/8, that is identical to the convergence bound available for SVRG in the convex setting.

1. INTRODUCTION

Reinforcement learning (RL) is a learning paradigm which addresses a class of problems in sequential decision making environments. Policy evaluation is one of those problems, which consists of determining expected reward agent will achieve if it chooses actions according to stationary policy. Temporal Difference learning (TD learning, Sutton (1988) ) is popular algorithm, since it is simple and might be performed online on single samples or small mini-batches. TD learning method uses Bellman equation to bootstrap the estimation process update the value function from each incoming sample or minibatch. As all methods in RL, TD learning from the "curse of dimensionality" when number of states is large. To address this issue, in practice linear or nonlinear feature approximation of state values is often used. Despite its simple formulation, theoretical analysis of approximate TD learning is subtle. There are few important milestones in this process, one of which is a work of Tsitsiklis & Van Roy (1997) , in which asymptotic convergence guarantees were established. More recently advances were made by Bhandari et al. (2018) , Srikant & Ying (2019) and Liu & Olshevsky (2020). In particular, the last paper shows that TD learning might be viewed as an example of gradient splitting, a process analogous to gradient descent. TD-learning has inherent variance problem, which is that the variance of the update does not go to zero as the method converges. This problem is also present in a class of convex optimization problems where target function is represented as a sum of functions and SGD-type methods are applied Robbins & Monro (1951) . Such methods proceed incrementally by sampling a single function, or a minibatch of functions, to use for stochastic gradient evaluations. A few variance reduction techniques were developed to address this problem and make convergence faster, including SAG Schmidt et al. (2013 ), SVRG Johnson & Zhang (2013 ) and SAGA Defazio et al. (2014) . These methods are collectively known as variance-reduced gradient methods. The distinguishing feature of these methods is that they converge geometrically. The first attempt to adapt variance reduction to TD learning with online sampling was done by Korda & La (2015) . 



Their results were discussed by Dalal et al. (2018) and Narayanan & Szepesvári (2017); Xu et al. (2020) performed reanalysis of their results and shown geometric convergence for Variance Reduction Temporal Difference learning (VRTD) algorithm for both Markovian and i.i.d sampling. The work of Du et al. (2017) directly apply SVRG and SAGA to a version of policy

