CLOSING THE GAP BETWEEN SVRG AND TD-SVRG WITH GRADIENT SPLITTING

Abstract

Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a geometric convergence bound with predetermined learning rate of 1/8, that is identical to the convergence bound available for SVRG in the convex setting.

1. INTRODUCTION

Reinforcement learning (RL) is a learning paradigm which addresses a class of problems in sequential decision making environments. Policy evaluation is one of those problems, which consists of determining expected reward agent will achieve if it chooses actions according to stationary policy. Temporal Difference learning (TD learning, Sutton (1988) ) is popular algorithm, since it is simple and might be performed online on single samples or small mini-batches. TD learning method uses Bellman equation to bootstrap the estimation process update the value function from each incoming sample or minibatch. As all methods in RL, TD learning from the "curse of dimensionality" when number of states is large. To address this issue, in practice linear or nonlinear feature approximation of state values is often used. Despite its simple formulation, theoretical analysis of approximate TD learning is subtle. There are few important milestones in this process, one of which is a work of Tsitsiklis & Van Roy (1997) , in which asymptotic convergence guarantees were established. More recently advances were made by Bhandari et al. (2018) , Srikant & Ying (2019) and Liu & Olshevsky (2020) . In particular, the last paper shows that TD learning might be viewed as an example of gradient splitting, a process analogous to gradient descent. TD-learning has inherent variance problem, which is that the variance of the update does not go to zero as the method converges. This problem is also present in a class of convex optimization problems where target function is represented as a sum of functions and SGD-type methods are applied Robbins & Monro (1951) . Such methods proceed incrementally by sampling a single function, or a minibatch of functions, to use for stochastic gradient evaluations. A few variance reduction techniques were developed to address this problem and make convergence faster, including SAG Schmidt et al. (2013) , SVRG Johnson & Zhang (2013) and SAGA Defazio et al. (2014) . These methods are collectively known as variance-reduced gradient methods. The distinguishing feature of these methods is that they converge geometrically. The first attempt to adapt variance reduction to TD learning with online sampling was done by Korda & La (2015) . Their results were discussed by Dalal et al. (2018) and Narayanan & Szepesvári (2017) ; Xu et al. (2020) performed reanalysis of their results and shown geometric convergence for Variance Reduction Temporal Difference learning (VRTD) algorithm for both Markovian and i.i.d sampling. The work of Du et al. (2017) directly apply SVRG and SAGA to a version of policy evaluations by transforming it into an equivalent convex-concave saddle-point problem. Since their algorithm uses two sets of parameters, in this paper we call it Primal Dual SVRG or PD SVRG. All these results obtained geometric convergence of the algorithms, improving the sub-geometric convergence of the standard TD methods. However, the convergence rates obtained in these papers are significantly worse than the convergence of SVRG in convex setting. In particular, the resulting convergence times for policy evaluations scaled with the square of the condition number, as opposed to SVRG which retains the linear scaling with the condition number of SGD. Quadratic scaling makes practical application of theoretically obtain values almost unfeasible, since number of computations becomes very large even for simple problems. Moreover, the convergence time bounds contained additional terms coming from the condition number of a matrix that diagonalizes some of the matrices appearing the problem formulations, which can be arbitrarily large. In this paper we analyze the convergence of the SVRG technique applied to TD (TD-SVRG) in two settings: piq a pre-sampled trajectory of the Markov Decision Process (MDP) (finite sampling), and piiq when states are sampled directly from the MDP (online sampling). Our contribution is threefold: • For the finite sample case we achieve significantly better results with simpler analysis. We are first to show that TD-SVRG has the same convergence rate as SVRG in the convex optimization setting with a pre-determined learning rate of 1/8. • For i.i.d. online sampling, we similarly achieve better results with simpler analysis. Similarly, we are first to show that TD-SVRG has the same convergence rate as SVRG in the convex optimization setting with a predetermined learning rate of 1/8. In addition, for Markovian online sampling, we provide convergence guarantees that in most cases are better than state-of-the art results. • We are the first to develop theoretical guarantees for an algorithm that can be directly applied to practice. In previous works, batch sizes required to guarantee convergence were very large that made them impractical (see Subsection H.1) and grid search was needed to optimize the learning rate and batch size values. We include experiments that show our theoretically obtained batch size and learning rate can be applied in practice and achieve geometric convergence.

2. PROBLEM FORMULATION

We consider a discounted reward Markov Decision Process (MDP) pS, A, P, r, γq, where S is a state space, A is an action space, P " Pps 1 |s, aq s,s 1 PS,aPA are the transition probabilities, r " rps, s 1 q are the rewards and γ P r0, 1q is a discount factor. In this MDP agent follows policy π, which is a mapping π : S Ś A Ñ r0, 1s. Given that policy is fixed, for the remainder of the paper we will consider transition matrix P , such that: P ps, s 1 q " ř a πps, aqPps 1 |s, aq. We assume, that Markov process produced by transition matrix is irreducible and aperiodic with stationary distribution µ π . The policy evaluation problem is to compute V π , defined as: V π psq :" E "ř 8 t"0 γ t r t`1

‰

. Here V π is the value function, formally defined to be the unique vector which satisfies the equality T π V π " V π , where T π is a Bellman operator, defined as: T π V π psq " ř s 1 P ps, s 1 q prps, s 1 q `γV π ps 1 qq . The TD(0) method is defined as follows: one iteration performs a fixed point update on randomly sampled pair of states s, s 1 with learning rate η: V psq Ð V psq `ηprps, s 1 q `γV ps 1 q ´V psqq. When the state space size |S| is large, tabular methods which update a value for every state V psq become impractical. For this reason linear approximation is often used. Each state a represented as feature vector ϕpsq P R d and state value V π psq is approximated by V π psq « ϕpsq T θ, where θ is a tunable parameter vector. Now a single TD update on randomly sampled transition s, s 1 becomes: θ Ð θ `ηg s,s 1 pθq " θ `ηpprps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕpsqq, where the second equation should be viewed as a definition of g s,s 1 pθq. Our goal is to find parameter vector θ ˚such that average update vector is zero E s,s 1 rg s,s 1 pθ ˚qs " 0. This expectation is also called mean-path update ḡpθq and can be written as: et al. (2017) , VRTD and TD results from Xu et al. (2020) , GTD2 from Touati et al. (2018) . λ min pQq and κpQq are used to define, respectively, minimum eigenvalue and condition number of matrix Q. λ A in this table denotes minimum eigenvalue of matrix 1{2pA `AT q. Other notation is taken from original papers. For simplicity 1 `γ is upper bounded by 2.

Setting

Method Learning rate Batch size Total complexity Finite sample GTD2 9 2 ˆ2σ 8σ 2 pk`2q`9 2 ζ 1 Op κpQq 2 Hd λminpGqϵ q PD SVRG λminpA T C ´1 Aq 48κpCqL 2 G 51κ 2 pCqL 2 G λminpA T C ´1 Aq 2 Op κ 2 pCqL 2 G λminpA T C ´1Aq 2 logp 1 ϵ qq PD SAGA λminpA T C ´1Aq 3p8κC 2 L 2 G `nµρq 1 Op κ 2 pCqL 2 G λminpA T C ´1Aq 2 logp 1 ϵ qq This paper 1/8 16{λ A Op 1 λ A logp 1 ϵ qq i.i.d sampling TD minp λ A 16 , 2 λ A q 1 Op 1 ϵλ 2 A logp 1 ϵ qq VRDT λ A {64 132 λ 2 A Opmaxp 1 ϵ , 1 λ 2 A q logp 1 ϵ qq This paper 1/8 16{λ A Opmaxp 1 ϵ , 1 λ A q logp 1 ϵ qq ḡpθq " E s,s 1 rg s,s 1 pθqs " E s,s 1 rpγϕps 1 q T θ ´ϕpsq T θqϕpsqs `Es,s 1 " rps, s 1 qϕpsq ‰ :" ´Aθ `b, where the last line should be taken as the definition of A and b. Finally, the minimum eigenvalue of matrix pA `AT q{2 plays an important role in our analysis and will be denoted as λ min . There are few possible setting of the problem: the samples s, s 1 might be drawn from the MDP on-line (Markovian sampling) or independently (i.i.d. sampling): first state s is drawn from µ π , then s 1 is drawn from correspondent row of P . The latter case analysis is covered in 6. Another possible setting for analysis is the "finite sample set" setting, in which states a data set D " tps t , a t , r t , s t`1 qu N t"1 of size N is drawn ahead of time following Markov sampling, and TD(0) proceeds by drawing samples from this data set. We analyze this case in Sections 4 and 5. We make following standard assumptions: Assumption 1. The matrix A is non-singular. Assumption 2. ||ϕpsq|| 2 ď 1 for all s P S. Assumption 1 needed to guarantee that A ´1b exists and the problem is solvable. Assumption 2 is introduced for simplicity, it always might be fulfilled by rescaling feature matrix.

2.1. KEY IDEA OF THE ANALYSIS

In our analysis we often use function f pθq, defined as: f pθq " pθ ´θ˚qT Apθ ´θ˚q . (1) Function f pθq is a key characteristic function of TD learning. In their paper Liu & Olshevsky (2020) introduce function f pθq as p1 ´γq||V θ ´Vθ ˚|| 2 D `γ||V θ ´Vθ ˚|| 2 Dir . Then, authors define gradient splitting (linear function hpxq " Bpx ´aq is gradient splitting of quadratic function jpxq " px ´aq T Qpx ´aq, where Q is symmetric positive semi-definite matrix, if B `BT " 2Q) and show that negation to mean-path update ´ḡpθq is indeed a gradient splitting of function f pθq. In this paper we do not use the fact that function f pθq might be represented as weighted sum of Dnorm and Dirichlet norm and, for convenience, define function f pθq based on its gradient splitting properties. We rely on interpretation of TD learning as splitting of the gradient descent in our analysis. In Xu et al. (2020) authors note: "In Johnson & Zhang (2013) , the convergence proof relies on the relationship between the gradient and the value of the objective function, but there is not such an objective function in the TD learning problem." Well, viewing on TD learning as gradient splitting gives the relationship between the gradient and the value function, which allows implementation of similar analysis as in Johnson & Zhang (2013) to achieve stronger results. It also gives the objective function f pθq which is better measure of the distance to optimal solution, rather than ||θ ´θ˚| | 2 , and yields tighter convergence bounds.

3. THE TD-SVRG ALGORITHM

We next propose a modification of the TD(0) method (TD SVRG) which can attain a geometric rate. This algorithm is given below as Algorithm 1. The algorithm works under the "fixed sample set" setting which assumes there already exists a sampled data set D. This is the same setting was considered in Du et al. (2017) . However, the method we propose does not add regularization and does not use dual parameters, which makes it considerably simpler.

Algorithm 1 TD-SVRG for finite sample case

Parameters update frequency M and learning rate η Initialize θ0 . Iterate: for m " 1, 2, . . . θ " θm´1 , ḡpθq " 1 N ř s,s 1 PD g s,s 1 pθq, where g s,s 1 pθq " prps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕps t q, θ 0 " θ. Iterate: for t " 1, 2, . . . , M Randomly sample s, s 1 from the dataset and compute update vector v t " g s,s 1 pθ t´1 q ´gs,s 1 pθq `ḡpθq. Update parameters θ t " θ t´1 ´ηv t . end Set θm " θ t for randomly chosen t P p0, . . . , M ´1q. end Like the classic SVRG algorithm, our proposed TD-SVRG has two layers of loops. We refer one step of the outer loop as an epoch and one step of inner loop as an iteration. TD-SVRG keeps two parameter vectors: current parameter vector θ t , which is being updated every iteration, and the vector θ, which is updated the end of each epoch. In the beginning of outer loop, the mean-path TD update vector ḡp θq is computed with a pass through the entire data set. This vector is used in inner loop to compute local updates v t " g s,s 1 pθ t´1 q ´gs,s 1 p θq `ḡpθq, where g s,s 1 pθ t´1 q is a TD update computed on a uniformly randomly sampled data point from D with current parameter vector θ t and g s,s 1 p θq is a TD update computed on the same data point. Each iteration ends with an update of epoch vector, which is randomly chosen from parameter vectors during the epoch.

4. CONVERGENCE ANALYSIS

In this section we show, that under simple assumption Algorithm 1 attain geometric convergence in terms of specially chosen function f pθq with η is Op1q and M is Op1{λ min q.

4.1. PRELIMINARIES

In order to analyze the convergence of the presented algorithm we define expected square norm of difference in current and optimal parameters as wpθq : wpθq " E s,s 1 ||g s,s 1 pθq ´gs,s 1 pθ ˚q|| 2 . (2) With this notation we provide an technical lemma. All of our proofs are based on variations of this lemma. Lemma 1. If Assumptions 1, 2 hold, epoch parameters of two consecutive epochs m ´1 and m are related by the following inequality: 2ηM Ef p θm q ´2M η 2 Ewp θm q ď E|| θm´1 ´θ˚| | 2 `2η 2 M Ewp θm´1 q, ( ) where the expectation is taken with respect to all previous epochs and choices of states s, s 1 during the epoch m. Proof. The proof of the lemma generally follows the analysis logic in Johnson & Zhang (2013) , it might be found in Appendix A. Lemma 1 plays an auxiliary role in our analysis and significantly simplifies it. It introduces a new approach to the convergence proof by carrying iteration to iteration and epoch to epoch bounds to the earlier part of the analysis. In particular, deriving bounds in terms of some arbitrary function upθq is now reduced to deriving upper bounds on || θm´1 || 2 and wpθq and a lower bound on f pθq in terms of the function u. In fact, the function f pθq itself will play the role of upθq in our proof. In addition, it is now easy to demonstrate the point we made in Subsection 2.1. The main problem of direct application of the SVRG convergence analysis to TD learning is that it requires the target function P pwq to be a sum of convex functions ϕ i pwq, where all functions ϕ i are L-Lipschitz and γsmooth (notation here is from Johnson & Zhang (2013) ). Of course, the TD learning problem cannot be represented in this form. However, the main use of L-Lipschitz and smoothness properties in the original is to derive a bound on the expected norm of the difference between current and optimal parameters. As will be shown later, in TD learning this expected norm (wpθq in our notation) still might be derived, even when a sum representation does not exist.

4.2. WARM-UP: CONVERGENCE IN TERMS OF SQUARED NORM

Firstly, we derive the convergence bound in terms of ||θ ´θ˚| | 2 and show that they are consistent with previous results. Proposition 1. Suppose Assumptions 1, 2 hold. If we chose learning rate as η " λ min {32 and number of iteration as M " 32{λ 2 min , then Algorithm 1 has a convergence rate of: Er|| θm ´θ˚| | 2 s ď ˆ5 7 ˙m || θ0 ´θ˚| | 2 . Proof. The proof is given in Appendix B Note that deriving a convergence rate in terms of squared norm || θm ´θ˚| | 2 leads to batch size m to be Op1{λ 2 min q, which is better than results in Du et al. (2017) , since their results has complexity Opκ 2 pCqκ 2 G q, where κpCq is condition number of matrix C " E sPD rϕpsqϕpsq T s and κ G 91{λ min pA T C ´1Aq. Experimental comparison of these values is provided in Subsection H.1 .

4.3. FIRST MAIN RESULT: CONVERGENCE IN TERMS OF f pθq

In this section derive a bound in terms of f pθq. For analysis to be simpler and illustrative we introduce one more assumption which we ease in section 4.5. Assumption 3 (Dataset Balance). In the dataset D first state of the first sample and the second state of the last sample is the same state, i.e. s 1 " s N `1. We need this assumption to omit dataset bias, so that states s and s 1 have the same distribution, as in the original MDP. Having this assumption, we can proof theorem 1: Theorem 1. Suppose Assumptions 1, 2, 3 hold. If we choose learning rate η " 1{8 and number of inner loop iterations M " 16{λ min , then Algorithm 1 will have a convergence rate of: Erf p θm qs ď ˆ2 3 ˙m f p θ0 q. Note that θm refers to the iterate after m iterations of the outer loop. Because we choose the length M of the inner loop to be 16{λ min , the total number of samples guaranteed by this theorem until Erf p θm qs ď ϵ is actually p16{λ min q logp1{ϵq. Proof of Theorem 1. The proof is given in Appendix C. Convergence analysis without Assumption 3 provided in Appendix Section D. 4.4 SIMILARITY OF SVRG AND TD-SVRG Liu & Olshevsky (2020) show that negation to mean-path update ´ḡpθq is a gradient splitting of f pθq. In this work we show even greater importance of function f pθq for TD learning process. Recall convergence rate obtained in Johnson & Zhang (2013) for sum of convex functions setting: 1 γηp1 ´2Lηqm `2Lη 1 ´2Lη , where γ is a strong convexity parameter and L is Lipschitz smoothness parameter (employing notation from the original paper). Function f pθq " pθ ´θ˚qT Apθ ´θ˚q is 2λ min pAq strongly convex and 2-Lipschitz smooth, which means that convergence rate obtained in this paper is identical to the convergence rate of SVRG in convex setting (we have slightly better bound L instead of 2L due to strong bound on wpθq we derived for this setting). This fact further extends the analogy between TD learning and convex optimization earlier explored by Bhandari et al. (2018) and Liu & Olshevsky (2020) .

5. BATCHING SVRG CASE ANALYSIS

In this section we extend our results to inexact mean-path update computation, applying the results of Babanezhad et al. (2015) to TD SVRG algorithm. We show that geometric convergence rate might be achieved with smaller number of computations by estimating mean-path TD-update instead of performing full computation. This approach is similar to Peng et al. ( 2019), but again doesn't require introduction of dual variables. In addition, we provide a particular way to compute n m , which might be used in practice. Since computation of mean-path error is not related to the dataset balance, in this section for simplicity we assume that dataset is balanced. Theorem 2. Suppose Assumptions 1, 2, 3 hold, then if learning rate is chosen as η " 1{8 number of inner loop iterations M " 16{λ min and batch size n m " minpN, N cρ 2m pN ´1q p2|r max | 2 `8|| θm || 2 qq, where c is a parameter, Algorithm 2 will have a convergence rate of: Erf p θm qs ď ρ m pf p θ0 q `Cq, where ρ P p0, 1q is convergence rate and C is some constant.

Algorithm 2 TD-SVRG with batching for finite sample case

Parameters update frequency M and learning rate η Initialize θ0 . Iterate: for m " 1, 2, . . . θ " θm´1 , choose batch size n m , sample batch D m of size n m from D without replacement, compute µ " 1 nm ř s,s 1 PD m g s,s 1 pθq, where g s,s 1 pθq " prps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕps t q, θ 0 " θ. Iterate: for t " 1, 2, . . . , M Randomly sample s, s 1 from D and compute update vector v t " g s,s 1 pθ t´1 q ´gs,s 1 p θq `µ, Update parameters θ t " θ t´1 ´ηv t . end set θm " θ t for randomly chosen t P p0, . . . , M ´1q. end Proof. The proof is given in Appendix E. This theorem shows that during early epochs approximation of mean-path update is good enough to guarantee geometric convergence. However, the batch size used for approximation increases geometrically with each epoch with rate ρ 2 , where ρ is a desired convergence rate, until it reaches size of the dataset N . The constant C depends on parameter c and upper bound Z " max θ p|θ´θ ˚|), where the max is taken over all parameter vectors seen during the run of the algorithm.

6. SECOND MAIN RESULT: ONLINE IID SAMPLING FROM THE MDP

In this section we apply TD learning as gradient splitting analysis to the case of online i.i.d sampling from the MDP each time we need to generate a new state s. We show that this view of TD learning as gradient splitting might be applied in this case to derive tighter convergence bounds. One issue of TD-SVRG in i.i.d. setting is that mean-path update may not be computed directly. However, this issue might be addressed with sampling technique introduced in Section 5, which makes i.i.d. case very similar to TD-SVRG with non-exact mean-path computation in finite samples case. In this setting, geometric convergence is clearly not attainable with variance reduction, which always relies on a pass through the entire dataset. Since here there is no data set, and one samples from the MDP at every step, one clearly cannot make a pass through all states of the MDP (or, rather, this is unrealistic to do in practice). To obtain convergence, one needs to take increasing batch sizes. Our next theorem does so, while improving the scaling with condition number from quadratic to linear. TD-SVRG algorithm for iid sampling case is very similar to Algorithm 2, with only difference that states s, s 1 are being sampled from the MDP instead of the dataset D. Formal definition of Algorithm 3 might be found in section F. Theorem 3. Suppose Assumptions 1, 2 hold, then if learning rate is chosen as η " 1{8, number of inner loop iterations M " 16{λ min and batch size n m " 1 cρ 2m p2|r max | 2 `8||θ|| 2 q, where c is some arbitrary chosen constant, Algorithm 3 will have a convergence rate of: Erf p θm qs ď ρ m pf p θ0 q `Cq, where ρ P p0, 1q is convergence rate and C is some constant. Proof. The proof is given in Appendix F. To parse this, observe that as in 5, each epoch requires n m computations to estimate mean-path update and 16{λ min to perform inner loop iterations. Thus, as epoch number m grows, n m will dominate 16{λ min , which results in the total computational complexity of Opmaxp 1 ϵ , 1 λmin q logp 1 ϵ qq, which is better than Opmaxp 1 ϵ , 1 λ 2 min q logp 1 ϵ qq shown by Xu et al. (2020) when λ min ą ϵ. In practice, λ min is determined by the MDP and the feature matrix, while ϵ is a desired accuracy, thus, the former is given and the latter might be chosen. In most scenarios, λ min is a small number and ϵ is chosen such that ϵ ă λ min . Even if this is not a case, λ 2 min is a very small number and most likely ϵ ă λ 2 min . Thus, in the absolute majority of cases results shown in this paper are stronger. The same convergence result with predetermined constant learning rate cannot be derived for Markovian sampling case, in which update sampling strategy is different from one in classical SVRG sampling. However, gradients splitting interpretation of TD learning still allows to achieve better convergence guarantees than in previous works for the absolute majority of problems. Algorithm, discussion and convergence proof is provided in Appendix Section G. 2016). For Random MDP, we construct MDP environment with |S| " 400, 21, features and 10 actions, with actions choice probabilities generated from U r0, 1q. For OpenAI gym environments, agent select states uniformly at random. Features constructed by applying RBF kernels to original states and then removing highly correlated features (correlation coefficient ą 0.5). To produce datasets of similar sizes we resampled dataset if smallest eigen-value of its matrix A was outside the interval r0.32, 0.54s ¨10 ´4, which corresponds to TD-SVRG batch sizes between 30000 and 50000. Decay rate γ is set to 0.95. Hyperparameters for algorithms selected as follows: for TD-SVRG theoretically justified parameters are selected, learning rate η " 1{8 and number of inner loop computations M " 16{λ min ; for GTD2 we used parameters which are suggested for small problems α " 1 and β " 1. For vanilla TD decreasing learning rates are set to α " 1{ ? t and α " 1{t. For PD-SVRG setting parameters to theoretically suggested is not feasible, since even for simple problems values of number of inner loop computations M is too large (see Appendix Subsection H.1). Following original paper Du et al. (2017) we run a simple grid search are pick best performing values, which are σ θ " σ w " 0.1{λ max p Ĉq. Results presented on Figure 1 . Each algorithm for each setting was run 10 times, average result is presented. As theory predicts, TD-SVRG and PD-SVRG converge geometrically, while GTD and vanilla TD converge sub linearly. In this set of experiments we compare the performance of TD-SVRG, VRTD and three Vanilla TD with fixed and decreasing learning rates in i.i.d. sampling case. States and rewards are sampled from the same MDP as in Section 7.1. Hyperparameters are chosen as follows: for TD-SVRG -learning rate η " 1{8, number of inner loop computations M " 16{λ min . VRTD -learning rate α " 0.1 and batch size M " 2000. For vanilla TD with constant learning rate its value set to 0.1 and for decreasing learning rate it is 1{t, where t is number of performed update. Average results over 10 runs presented on Figure 2 .

7.3. REPRODUCIBILITY

Authors provide a link to anonynous github repository with code and instruction how to reproduce the experiments.

8. CONCLUSION

In the paper we utilize a view on TD learning as splitting of gradient descent to show that SVRG technique applied to TD updates attain similar convergence rate as SVRG in convex function setting. Our analysis addresses both finite sample and i.i.d. sampling cases, which previously were analyzed separately, and improves state of the art bounds in both cases. In addition we show that gradient splitting interpretation helps to improve convergence guarantees in Markovian sampling case. The algorithms based on our analysis have fixed learning rate and small number of inner loop computation, easy to implement and demonstrates good performance during experiments.

A PROOF OF LEMMA 1

The proof follows the same logic as in Johnson & Zhang (2013) and organized in four steps. Step A.1. In the original paper proof starts with deriving a bound on the squared norm of the difference between current and optimal sets of parameters. Since the introduction of wpθq this step in our proof is trivial. E s,s 1 ||g s,s 1 pθq ´gs,s 1 pθ ˚q|| 2 " wpθq Step A.2. During Step 2 we derive a bound on the norm of an single iteration t update v t " g s,s 1 pθ t´1 q ´gs,s 1 p θq `ḡp θq, assuming that states s, s 1 were sampled randomly during step t: E s,s 1 r||v t || 2 s " E||g s,s 1 pθ t´1 q ´gs,s 1 p θq `ḡp θq|| 2 " E s,s 1 ||pg s,s 1 pθ t´1 q ´gs,s 1 pθ ˚qq `pg s,s 1 pθ ˚q ´gs,s 1 p θq `ḡp θq|| 2 ď 2E s,s 1 ||pg s,s 1 pθ t´1 q ´gs,s 1 pθ ˚qq|| 2 `2E s,s 1 ||g s,s 1 p θq ´gs,s 1 pθ ˚q ´pḡp θq ´ḡpθ ˚qq|| 2 " 2E s,s 1 ||pg s,s 1 pθ t´1 q ´gs,s 1 pθ ˚qq|| 2 `2E s,s 1 ||g s,s 1 p θq ´gs,s 1 pθ ˚q ´Es,s 1 rg s,s 1 p θq ´gs,s 1 pθ ˚qs|| 2 ď 2E s,s 1 ||pg s,s 1 pθ t´1 q ´gs,s 1 pθ ˚qq|| 2 `2E s,s 1 ||g s,s 1 p θq ´gs,s 1 pθ ˚q|| 2 " 2wpθ t´1 q `2wp θq The first inequality uses E||a `b|| 2 ď 2E||a|| 2 `2E||b|| 2 . The second inequality uses the face that second central moment is smaller than second moment. The last equality uses the equality from step 1. Step A.3. During this step we derive a bound on the expected squared norm of a distance to optimal parameter vector after a single update t: E s,s 1 ||θ t ´θ˚| | 2 " E s,s 1 ||θ t´1 ´θ˚`η v t || 2 " ||θ t´1 ´θ˚| | 2 `2ηpθ t´1 ´θ˚q Ev t `η2 E||v t || 2 ď ||θ t´1 ´θ˚| | 2 `2ηpθ t´1 ´θ˚q ḡpθ t´1 q `2η 2 wpθ t´1 q `2η 2 wp θq " ||θ t´1 ´θ˚| | 2 ´2ηf pθ t´1 q `2η 2 wpθ t´1 q `2η 2 wp θq The inequality uses the bound obtained in step 2. After rearranging terms it becomes: E||θ t ´θ˚| | 2 `2ηf pθ t´1 q ´4η 2 wpθ t´1 q ď ||θ t´1 ´θ˚| | 2 `4η 2 wp θq Step A.4. During this step we sum the inequality obtained in Step 3 over the epoch and take expectations with respect to all choices of pair of states s, s 1 and all previous history and use the random choice property to obtain Equation 1 which relates parameter vectors of two consecutive epochs: M ÿ t"1 E||θ t ´θ˚| | 2 `M ÿ t"1 2ηEf pθ t´1 q ´M ÿ t"1 2η 2 Ewpθ t´1 q ď M ÿ t"1 E||θ t´1 ´θ˚| | 2 `M ÿ t"1 2η 2 Ewp θq We analyze this expression term-wise. We denote the parameter vector θ chosen for epoch parameters in the end of the epoch θm . Since this vector is chosen uniformly at random among all iteration vectors θ t , ř M t"1 Ef pθ t´1 q " M Ef p θm q and ř M t"1 Ewpθ t´1 q " M Ewp θm q. ř M t" At the same time, θ, which was chosen in the end of the previous epoch remains the same throughout the epoch, therefore, ř M t"1 Ewp θq " M Ewp θq. Note, that current epoch starts with setting θ 0 " θ. Also, to underline it is previous epoch, we denote it as θm´1 . Plugging this values in we have 3: 2ηM Ef p θm q ´2M η 2 Ewp θm q ď E|| θm´1 ´θ˚| | 2 `2η 2 M Ewp θm´1 q B PROOF OF PROPOSITION 1 To transform inequality 3 from Lemma 1 into convergence rate guarantee, we need to bound wpθq and f pθq in terms of ||θ ´θ˚| | 2 . Both bounds are easy to show: wpθq " E s,s 1 ||g s,s 1 pθq ´gs,s 1 pθ ˚q|| 2 " pθ ´θ˚qT E s,s 1 rpγϕps 1 q ´ϕpsqqϕpsq T ϕpsqpγϕps 1 q ´ϕpsqq T spθ ´θ˚q ď pθ ´θ˚qT E s,s 1 r||pγϕps 1 q ´ϕpsqq|| ¨||ϕpsq|| ¨||ϕpsq|| ¨||pγϕps 1 q ´ϕpsqq||spθ ´θ˚q ď 4||θ ´θ˚| | 2 , f pθq " pθ ´θ˚qT E s,s 1 rϕpsqpϕpsq ´γϕps 1 qq T spθ ´θ˚q ě λ min ||θ ´θ˚| | 2 . Plugging these bounds into Equation 3 we have: p2ηM λ min ´8M η 2 q|| θm ´θ˚| | 2 ď p1 `8M η 2 q|| θm´1 ´θ˚| | 2 . Which yields epoch to epoch convergence rate of: 1 `8M η 2 2ηM λ min ´8M η 2 . For this expression to be ă 1, we need that ηM to be Op1{λ min q, which means that η needs to be Opλ min q for M η 2 to be Op1q. Therefore, M need to be Op1{λ 2 min q. Setting η " λ min {32 and m " 32{λ 2 min yields convergence rate of 5{7.

C PROOF OF THEOREM 1

The same as in the previous section, we start with deriving bounds, but this time we bound ||θ´θ ˚|| 2 and wpθq in terms of f pθq. First bound is straightforward: f pθq " pθ ´θ˚qT E ϕ,ϕ 1 rϕpϕ ´γϕ 1 q T spθ ´θ˚q ùñ ||θ ´θ˚| | 2 ď 1 λ min f pθq. For wpθq we have: wpθq " pθ ´θ˚qT E s,s 1 rpγϕps 1 q ´ϕpsqqϕpsq T ϕpsqpγϕps 1 q ´ϕpsqq T spθ ´θ˚q " pθ ´θ˚qT " 1 N ÿ s,s 1 PD pγϕps 1 q ´ϕpsqqϕ T psqϕpsqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q ď pθ ´θ˚qT " 1 N ÿ s,s 1 PD pγϕps 1 q ´ϕpsqqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q " pθ ´θ˚qT " 1 N ÿ s,s 1 PD γ 2 ϕps 1 qϕps 1 q T ´γϕps 1 qϕpsq T ‰ pθ ´θ˚q `f pθq " pθ ´θ˚qT " 1 N ÿ s,s 1 PD γ 2 ϕpsqϕpsq T ´γϕpsqϕps 1 q T ‰ pθ ´θ˚q `f pθq ď 2f pθq, first inequality uses Assumption 2, third equality uses Assumption 3, ( ř s 1 γ 2 ϕps 1 qϕps 1 q T " ř s γ 2 ϕpsqϕpsq T , since s and s 1 are the same set of states). The last inequality uses the fact that γ ă 1. Plugging these bound into Equation 3, we have: 2ηM Ef p θm q ´4M η 2 Ef p θm q ď 1 λ min Ef p θm´1 q `4η 2 M Ef p θm´1 q, which yields epoch to epoch convergence rate of: Ef p θm q ď " 1 2λ min ηM p1 ´2ηq `2η 1 ´2η ı Ef p θm´1 q. Setting η " 1 8 and M " 16 λmin we have the desired inequality.

D CONVERGENCE ANALYSIS WITHOUT DATASET BALANCE

In this section we show the convergence bound for the problem without Assumption 3. In this case, the problem is that after the sampling, s and s 1 in the dataset does not have the same distribution, i.e., the first element of the tuples ps t , a t , r t , s t`1 q in our dataset D need not have the same distribution as the last element of this tuple. Indeed, it could happen that a particular state occurs a different numbers of times as the first element of the tuples in D as compared to the last element, which would not happen under Assumption 3. When this happens, we will say that the data set is unbalanced. In that case, ḡpθq need not be a gradient splitting of function f pθq. One might hope that, when the size of the data set N is large, this effect has an impact which decays to zero with N . Our second main result shows something even stronger: we show that the effect of unbalacedness disappears completely for large N . Thus our next theorem completely recovers the performance attained by Theorem 1 for large N . The catch is the size of the dataset has to be at least as big as λ ´1 min for this to happen. Theorem 4. Suppose Assumptions 1, 2 hold and dataset is unbalanced. Define error term J " 4γ 2 N λmin . Then, if we choose learning rate η " 1{p8 `Jq and number of inner loop iterations M " 2{pλ min ηq, Algorithm 1 will have a convergence rate of: Erf p θm qs ď ˆ2 3 ˙m f p θ0 q. Proof. The proof is given in Appendix D.1. Note, that in this case η P Op 1 maxp1,1{pN λminqq q and M P Op 1 λminη q, which always better than parameters required to guarantee convergence of ||θ´θ ˚|| 2 (Proposition 1). Note that the guarantees of this theorem are identical to the guarantees of Theorem 1 in the unbalanced case when N ě λ ´1 min .

E PROOF OF THEOREM 3

In the first part of the proof we derive an inequality which relates model parameters of two consecutive epochs similar to what we achieved in previous proofs, but with error term. In this part of the proof we follow the same 4 steps logic as while proof of Lemma 1. In the second part of the proof we show that there are conditions under which error term converges to 0. Step E.1. During the first step we the bound obtained in inequality 4: wpθq ď 2f pθq Step E.2. During this step we derive a bound on the squared norm of a single update Er||v t || 2 s. But now, compared to previous case, we do not compute the exact mean-path updated ḡpθq, but its estimate, and assume our computation has error µ " ḡpθq `e. Thus the single update vector will be v t " gpθ t´1 q ´gp θq `ḡp θq `e Thus, the bound on the single update might be derived as: Er||v t || 2 s " E||gpθ t´1 q ´gp θq `ḡp θq `e|| 2 " E||pgpθ t´1 q ´ḡpθ ˚qq `pḡpθ ˚q ´gp θq `ḡp θq `eq|| 2 ď 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q ´pḡp θq ´ḡpθ ˚qq ´e|| 2 " 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q ´Ergp θq ´gpθ ˚qs ´e|| 2 " 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q ´Ergp θq ´gpθ ˚qs|| 2 ´2Exgp θq ´gpθ ˚q ´4Ergp θq ´gpθ ˚qs, ey `2E||e|| 2 ď 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q|| 2 `2E||e|| 2 " 2wpθ t´1 q `2wp θq `2E||e|| 2 ď 4f pθ t´1 q `4f p θq `2E||e|| 2 , where first inequality uses E||A `B|| 2 ď 2E||A|| 2 `2E||B|| 2 , second inequality uses E||A ÉrAs|| 2 ď E||A|| 2 and the third inequality uses the result of Step E.1. Step E.3. During this step, we derive a bound on a vector norm after a single update: E||θ t ´θ˚| | 2 " E||θ t´1 ´θ˚`p ´ηv t q|| 2 " ||θ t´1 ´θ˚| | 2 ´2ηpθ t´1 ´θ˚qT Ev t `η2 E||v t || 2 ď ||θ t´1 ´θ˚| | 2 ´2ηpθ t´1 ´θ˚qT ḡpθ t´1 q `4η 2 f pθ t´1 q `4η 2 f p θq `2η 2 E||e|| 2 " ||θ t´1 ´θ˚| | 2 ´2ηpθ t´1 ´θ˚qT ∇f pθ t´1 q ´2ηpθ t´1 ´θ˚qT e `4η 2 f pθ t´1 q `4η 2 f p θq `2η 2 E||e|| 2 Rearranging terms we obtain: E||θ t ´θ˚| | 2 `2ηf pθ t´1 q ´4η 2 f pθ t´1 q ď ||θ t´1 ´θ˚| | 2 `4η 2 f p θq ´2ηpθ t´1 ´θ˚qT e `2η 2 E||e|| 2 ď ||θ t´1 ´θ˚| | 2 `4η 2 f p θq `2η||θ t´1 ´θ˚| | ¨||e|| `2η 2 E||e|| 2 Step E.4. Now derive a bound on epoch update. We assume that quantity ||θ t´1 ´θ˚| | might be bounded by constant Z. Similarly, we denote an error term from previous epoch as e m´1 . We use the similar logic as during the proof of theorem 1. Since error term doesn't change over the epoch, thus, summing over the epoch we have: E||θ m ´θ˚| | 2 `2ηM Ef p θm q ´8η 2 M Ef p θm q ď E||θ 0 ´θ˚| | 2 `8η 2 M Ef p θq `2M ηZE||e m´1 || `2η 2 M E||e m´1 || 2 Rearranging terms we have the bound: Ef p θm q ď p 1 λ min 2ηM p1 ´4ηq `4η 1 ´4η qEf p θm´1 q `1 1 ´4η pZE||e m´1 || `ηE||e m´1 || 2 q To obtain convergence, we need guarantee geometric convergence of first and second term in the sum separately. The first term is dependent on inner loop updates, its convergence is analyzed in Theorem 1. Here we show how to achieve a similar geometric convergence rate of the second term. Since error term has 0 mean and it is finite sample case with replacement, expected squared norm might be bounded by: E||e m || 2 ď N ´nm N n m S 2 ď p1 ´nm N q S 2 n m ď S 2 n m where S 2 is a bound on update vector norm variance. If we want the error to be bounded by cρ 2m , we need the number of batch computations n m to satisfy the condition: n m ě S 2 cρ 2m Satisfying this condition guarantees that the second term has geometric convergence: 1 1 ´4η pZE||e m´1 || `ηE||e m´1 || 2 q ď 2 1 ´4η maxpZ ? c, ηcρqρ m It is only left to derive a bound S 2 for on update vector norm sample variance: 1 N ´1 ÿ s,s 1 ||g s,s 1 pθq|| 2 ´||ḡpθq|| 2 ď N N ´1 1 N ÿ s,s 1 ||g s,s 1 pθq|| 2 " N N ´1 1 N ÿ s,s 1 ||prps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕpsq|| 2 ď N N ´1 1 N ÿ s,s 1 2||rϕpsq|| 2 `4||γϕps 1 q T θϕpsq|| 2 `4||ϕpsq T θϕpsq|| 2 ď N N ´1 p2|r max | 2 `4γ 2 ||θ|| 2 `4||θ|| 2 q " N N ´1 p2|r max | 2 `8||θ|| 2 q " S 2 F PROOF OF THEOREM 4 TD-SVRG algorithm for iid sampling case is described as Algorithm 3: The proof of its convergence is very similar to E, the only difference is that now we derive expectation with respect to MDP instead of fixed dataset. Er||v t || 2 s " E||gpθ t´1 q ´gp θq `ḡp θq `e|| 2 " E||pgpθ t´1 q ´ḡpθ ˚qq `pḡpθ ˚q ´gp θq `ḡp θq `eq|| 2 ď 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q ´pḡp θq ´ḡpθ ˚qq ´e|| 2 " 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q ´Ergp θq ´gpθ ˚qs ´e|| 2 " 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q ´Ergp θq ´gpθ ˚qs|| 2 ´2Exgp θq ´gpθ ˚q ´4Ergp θq ´gpθ ˚qs, ey `2E||e|| 2 ď 2E||pgpθ t´1 q ´gpθ ˚qq|| 2 `2E||gp θq ´gpθ ˚q|| 2 `2E||e|| 2 " 2wpθ t´1 q `2wp θq `2E||e|| 2 ď 4f pθ t´1 q `4f p θq `2E||e|| 2 where first inequality uses E||A `B|| 2 ď 2E||A|| 2 `2E||B|| 2 , second inequality uses E||A ÉrAs|| 2 ď E||A|| 2 and the third inequality uses the result of Step F.1. Step F.3. Bound on a vector norm after a single update: E||θ t ´θ˚| | 2 " E||θ t´1 ´θ˚`p ´ηv t q|| 2 " ||θ t´1 ´θ˚| | 2 ´2ηpθ t´1 ´θ˚qT Ev t `η2 E||v t || 2 ď ||θ t´1 ´θ˚| | 2 ´2ηpθ t´1 ´θ˚qT ḡpθ t´1 q `4η 2 f pθ t´1 q `4η 2 f p θq `2η 2 E||e|| 2 " ||θ t´1 ´θ˚| | 2 ´2ηpθ t´1 ´θ˚qT ∇f pθ t´1 q ´2ηpθ t´1 ´θ˚qT e `4η 2 f pθ t´1 q `4η 2 f p θq `2η 2 E||e|| 2 Rearranging terms we obtain: E||θ t ´θ˚| | 2 `2ηf pθ t´1 q ´4η 2 f pθ t´1 q ď ||θ t´1 ´θ˚| | 2 `4η 2 f p θq ´2ηpθ t´1 ´θ˚qT e `2η 2 E||e|| 2 ď ||θ t´1 ´θ˚| | 2 `4η 2 f p θq `2η||θ t´1 ´θ˚| | ¨||e|| `2η 2 E||e|| 2 Step F.4. Now derive a bound on epoch update. We assume that quantity ||θ t´1 ´θ˚| | might be bounded by constant Z. Similarly, we denote an error term from previous epoch as e m´1 . We use the similar logic as during the proof of Theorem 1. Since error term doesn't change over the epoch, thus, summing over the epoch we have: E||θ m ´θ˚| | 2 `2ηM Ef p θm q ´8η 2 M Ef p θm q ď E||θ 0 ´θ˚| | 2 `8η 2 M Ef p θq `2M ηZE||e m´1 || `2η 2 M E||e m´1 || 2 Rearranging terms we have the bound: Ef p θm q ď p 1 λ min 2ηM p1 ´4ηq `4η 1 ´4η qEf p θm´1 q `1 1 ´4η pZE||e m´1 || `ηE||e m´1 || 2 q Similarly to C convergence for the first term might be obtained by setting learning rate η " 1{8 and number of inner loop iterations M " 16{λ min . To guarantee convergence of the second term, we need to bound E||e m || 2 . In the infinite population with replacement case norm of the error vector is bounded by: E||e m || 2 ď S 2 n m Error terms can be bounded by slightly modified lemmas from the original papers. For ζpθq, we apply a bound from Lemma 11 in Bhandari et al. (2018) : |Erζ t pθqs| ď G 2 p4 `6τ mix pηqqη. In the original lemma, a bound on Erζ t pθqs is stated, however, in the proof a bound on absolute value of the expectation is also derived. For mean-path estimation error term, we use a modified version of Lemma 1 in Xu et al. (2020) . The proof of this lemma in the original paper starts by applying the inequality a T b ď k 2 ||a|| 2 `1 2k ||b|| 2 to the expression pθ ´θ˚qT pµpθq ´ḡpθqq, with k " λ A {2 (using the notation in Xu et al. (2020) ). For the purposes of our proof we use k " λ min . Thus, we will have the expression: Erξ m pθqs ď λ min 2 Er||θ ´θ˚| | 2 2 |s m´1 s `4p1 `pm ´1qρq λ min p1 ´ρqn m r4R 2 `r2 max s " λ min 2 Er||θ ´θ˚| | 2 2 |s m´1 s `C λ min n m . Also, note, that the term E||v t || 2 2 might be bounded as E||v t || 2 2 ď 18R 2 . Plugging 8 and 9 bounds into 7 we obtain: E||θ t ´θ˚| | 2 2 ď||θ t´1 ´θ˚| | 2 2 ´2ηf pθ t´1 q `4η 2 G 2 p4 `6τ mix pηqq2 ηp λ min 2 || θ ´θ˚| | 2 2 `C λ min n m q `18η 2 R 2 . Summing the inequality over the epoch and taking expectation with respect to all previous history, we have: 2ηM Erf p θs qs ď|| θs´1 ´θ˚| | 2 2 `2ηM p λ min 2 || θs´1 ´θ˚| | 2 2 `C λ min n m qὴ 2 M p4G 2 p4 `6τ mix pηqq `18R 2 q. Then we divide both sides by 2ηM and use || θs´1 ´θ˚| | 2 2 ď 1{λ min f p θs´1 q to obtain: Erf p θs qs ďp 1 2λ min ηM `1 2 qf p θs´1 q `C λ min n m ὴp2G 2 p4 `6τ mix pηqq `9R 2 q. We choose η and M such that ηM λ min " 2. We then apply this inequality to the value of the function f in the first term in the right-hand side recursively, which yields the desired result: Erf p θs qs ď p 3 4 q s f pθ 0 q `8C 0 λ min n m `4ηp2G 2 p4 `6τ mix pηqq `9R 2 q H DETAILS ON ALGORITHM IMPLEMENTATION

H.1 COMPARISON OF THEORETIC BATCH SIZES

In this subsection we compare the values of batch sizes which are theoretically required to guarantee convergence. We compare batch sizes of three algorithms: TD-SVRG, PDSVRG (Du et al. (2017) ) and VRTD (Xu et al. (2020) ). Note that PDSVRG and VRTD are algorithms for different settings, but for TD-SVRG the batch size value is the same: 16{λ min , thus, we compare two algorithms in the same table. We compare the batch sizes required by algorithm for three MDPs: first with 50 state, 20 action and γ " 0.8, second with 400 state, 10 actions and γ " 0.95, third with 1000 states, 20 actions and γ " 0.99, with actions choice probabilities generated from U r0, 1q. (similar to one used for experiments in Subsections 7.1 and 7.2). Since a batch size is dependent on the smallest eigenvalue of the matrix A, which is, in turn, is dependent on the dimensionality of the feature vector, we do the comparison for different feature vector sizes: 5, 10, 20 and 40 randomly generated features + 1 constant feature for each state. We generate 10 datasets and environments for each feature size. Our results are summarized in tables 2, 3 and 4 In this set of experiments we compare the performance of TD-SVRG and batched TD-SVRG in finite-sample case. We generate 10 datasets of size 50000 from the similar MDP as in Section 7.1. Algorithms also run with the same hyperparameters. Average results over 10 runs presented on Figure 3 and show, that batched TD-SVRG saves a lot of computations during the earlier epochs, which provides faster convergence. 



Figure 1: Average performance of different algorithms in finite sample case. Columns -dataset source environments: MDP, Acrobot, CartPole and Mountain Car. Rows -performance measurements: logpf pθqq and logp|θ ´θ˚| q.

Figure 2: Average performance of TD-SVRG, VRTD and vanilla TD in i.i.d. sampling case. "TDdecr" refers to vanilla TD with decrasing learning rate, "TD-const" -to vanilla TD with constant learning rate. Left figure -performance in terms of logpf pθqq, right in terms of logp|θ ´θ˚| q.

Figure 3: Average performance of TD-SVRG and batching TD-SVRG in finite sample case. Datasets sampled from MDP environments. Left figure -performance in terms of logpf pθqq. Right figure -performance in terms of logp|θ ´θ˚| q.

Algorithms parameter comparison. PD SVRG and PD SAGA results reported from Du

E||θ t ´θ˚| | 2 consist of same terms, except the first term in the first sum and the last term in the last sum, which are E||θ 0 ´θ˚| | 2 and E||θ M ´θ˚| | 2 . Since E||θ M ´θ˚| | 2 is always positive and it is on the left hand side of the inequality, we could drop it.

Comparison of theory suggested batch sizes for MDP with 50 states, 20 actions and γ " 0.8. Values in the first row are feature vectors dimensionality. Value in other rows: bitch size of a corresponded method (row 1). Values are average over 10 generated datasets and environments.

Comparison of theory suggested batch sizes for MDP with 400 states, 10 actions and γ " 0.95. Values in the first row are feature vectors dimensionality. Value in other rows: bitch size of a corresponded method (row 1). Values are average over 10 generated datasets and environments.

Comparison of theory suggested batch sizes for MDP with 1000 states, 20 actions and γ " 0.99. Values in the first row are feature vectors dimensionality. Value in other rows: bitch size of a corresponded method (row 1). Values are average over 10 generated datasets and environments.

D.1 PROOF OF THEOREM 4

To proof the theorem we follow the same strategy as in C. For the f pθq we can use the same bound:f pθq " pθ ´θ˚qT E ϕ,ϕ 1 rϕpϕ ´γϕ 1 q T spθ ´θ˚q ùñ ||θ ´θ˚| | 2 ď 1 λ min f pθqBound for wpθq is a little bit more difficult:wpθq " pθ ´θ˚qT " 1 N ÿ s,s 1 PD pγϕps 1 q ´ϕpsqqϕ T psqϕpsqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q ď pθ ´θ˚qT " 1 N ÿ s,s 1 PD pγϕps 1 q ´ϕpsqqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q " pθ ´θ˚qT " 1 N ÿ s,s 1 PD γϕps 1 qpγϕps 1 q ´ϕpsqq T ´ϕpsqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q " pθ ´θ˚qT " 1 N ÿ s,s 1 PD γ 2 ϕps 1 qϕps 1 q T ´γϕps 1 qϕpsq T ‰ pθ ´θ˚q `f pθqThe first inequality follows from the assumption about norms of feature vectors. The third equality is obtained by adding and subtracting γ 2 N pθ ´θ˚qT ϕps 1 qϕps 1 q T pθ ´θ˚q . Second inequality uses the fact that γ 2 ă 1. We denote maximum eigen-value of matrix ϕps N `1qϕps N `1q T ´ϕps 1 qϕps 1 q T as K (note that K ď 1). Thus,Plugging these bounds into Equation 3 we have:Which yields convergence rate of:To achieve constant convergence rate, for example 2 3 , we set up η such that ηp2 `γ2 N λmin q " 0.25, thus the second term is equal to 1/3 and η ". Then, to make the first term equal to 1/3, we need to set s,s 1 PD m g s,s 1 pθq, where g s,s 1 pθq " prps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕps t q, θ 0 " θ. Iterate: for t " 1, 2, . . . , MRandomly sample s, s 1 and compute update vector v t " g s,s 1 pθ t´1 q ´gs,s 1 p θq `µ, Update parameters θ t " θ t´1 ´ηv t . end set θm " θ t for randomly chosen t P p0, . . . , M ´1q. endStep F.1. During the first step we use the bound obtained during the proof of theorem 1:wpθq " pθ ´θ˚qT E s,s 1 rpγϕps 1 q ´ϕpsqqϕpsq T ϕpsqpγϕps 1 q ´ϕpsqq T spθ ´θ˚q " pθ ´θ˚qT " ÿ s,s 1 µ π psqP ps, s 1 qpγϕps 1 q ´ϕpsqqϕ T psqϕpsqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q ď pθ ´θ˚qT " ÿ s,s 1 µ π psqP ps, s 1 qpγϕps 1 q ´ϕpsqqpγϕps 1 q ´ϕpsqq T ‰ pθ ´θ˚q " pθ ´θ˚qT " ÿ s,s 1 µ π psqP ps, s 1 qpγ 2 ϕps 1 qϕps 1 q T ´γϕps 1 qϕpsq T q ‰ pθ ´θ˚q `f pθq " pθ ´θ˚qTfirst inequality uses Assumption 2, third equality uses the fact that µ π is a stationary distribution ofThe last inequality uses the fact that γ ă 1.Step F.2. During this step we derive a bound on the squared norm of a single update Er||v t || 2 s.Similarly with E we assume inexact computation of mean-path update µ " ḡpθq `e. Thus the single update vector becomes:Norm of this vector is bounded by: where S 2 is a bound update vector norm variance. If we want the error to be bounded by cρ 2m , we need the number of batch computations n m to satisfy the condition:Satisfying this condition guarantees that the second term has geometric convergence:Similarly to E, bound on sample variance S 2 might be derived as follows:G MARKOVIAN SAMPLING CASE ALGORITHM AND ANALYSIS.Markovian sampling case is the hardest to analyse due to its dependence on MDP properties, which makes establishing bounds on various quantities used during the proof much harder. Applying gradient splitting view helps to improve over existing bounds but derived algorithm does not have a nice property of constant learning rate. To deal with sample-to-sample dependencies with utilize one more assumption often used in the literature:Assumption 4. The considered MDP is irreducible and aperiodic and there exist constant m ą 0 and ρ P p0, 1q such that sup sPS d T V pPps t P ¨|s 0 " sq, πq ď mρ t , @t ě 0, where d T V pP, Qq denotes the total-variation distance between the probability measures P and Q.Another thing we need to employ is projection, which will help to set a bound on update vector v.Following Bhandari et al. (2018) and Xu et al. (2020) after each iteration we project parameter vector on a ball or radius R (denoted as Π R pθq " arg min θ 1 :|θ 1 |ďR |θ ´θ1 | 2 . We assume that |θ ˚| ď R, choice of R which guarantees it might be found in Bhandari et al. (2018) , Section 8.2. Adding projection results in Algorithm 4.Guarantees of convergence of Algorithm 4 are given in Theorem 5.Theorem 5. Suppose Assumptions 1, 2, 4 hold, then output of Algorithm 4 will satisfy:Erf p θs qs ď p 3 4 q s f pθ 0 q `8C λ min n m `4ηp2G 2 p4 `6τ mix pηqq `9R 2 q, where C " 4p1`pm´1qρq p1´ρq r4R 2 `r2 max s.Proof. The proof is given in Appendix G.1.Theorem 5 implies that if we choose s " Oplogp1{ϵqq, n m " Op1{pλ min ϵqq and η " Opϵ{ logp1{ϵq and M " Op logp1{ϵq ϵλmin q, total sample complexity to achieve accuracy of ϵ is:Algorithm 4 TD-SVRG with batching for Markovian sampling case Parameters update frequency M , learning rate η, projection radius R and batch size n m Initialize θ0 . Iterate: for m " 1, 2, . . . θ " θm´1 , sample trajectory D m of length n m , compute µ " 1 nm ř s,s 1 PD m g s,s 1 pθq, where g s,s 1 pθq " prps, s 1 q `γϕps 1 q T θ ´ϕpsq T θqϕps t q, θ 0 " θ. Iterate: for t " 1, 2, . . . , MRandomly sample s, s 1 and compute update vector v t " g s,s 1 pθ t´1 q ´gs,s 1 p θq `µ, Update parameters θ t " Π R pθ t´1 ´ηv t q. end set θm " θ t for randomly chosen t P p0, . . . , M ´1q. end Op log 2 p1{ϵq ϵλ min qIn the most practical application his result is better than Op 1 ϵλ 2 min logp1{ϵqq, since logp1{ϵq{λ min ą 1 for practical values of ϵ and λ min .

G.1 PROOF OF THEOREM 5

In the Markovian sampling case, we cannot simply apply Lemma 1; due to high estimation bias the bounds on f pθq and wpθq will not be derived based on current value of θ, but based on global constraints on the updates guaranteed by applying projection.First, we analyse a single iteration on step t of epoch m, during which we apply the update vector v t " g t pθq ´gt p θq `µp θq. The update takes the form: where the expectation is taken with respect to s, s 1 sampled during iteration t. Recall that under Markovian sampling, Erg t pθ t´1 qs ‰ ḡpθ t´1 q and that for the expectation of the estimated meanpath update we have Erµp θq|s m´1 s ‰ ḡp θq, where s m´1 is the last state of epoch m ´1. To tackle this issue, we follow the approach introduced in a previous works (Bhandari et al. (2018) , Xu et al. (2020) ) and rewrite the expectation as a sum of mean-path update and error terms. Similar to Bhandari et al. (2018) , we denote the error term on a single update as ζ: ζ t pθq " pθ ´θ˚qT pg t pθq ´ḡpθqq.For an error term on the trajectory we follow Xu et al. (2020) 

