GREEDY-GQ WITH VARIANCE REDUCTION: FINITE-TIME ANALYSIS AND IMPROVED COMPLEXITY

Abstract

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an -stationary point with a sample complexity in the order of O( -3 ). Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of O( -2 ). We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings.

1. INTRODUCTION

In reinforcement learning (RL), an agent interacts with a stochastic environment following a certain policy and receives some reward, and it aims to learn an optimal policy that yields the maximum accumulated reward Sutton & Barto (2018) . In particular, many RL algorithms have been developed to learn the optimal control policy, and they have been widely applied to various practical applications such as finance, robotics, computer games and recommendation systems Mnih et al. (2015; 2016) ; Silver et al. (2016) ; Kober et al. (2013) . Conventional RL algorithms such as Q-learning Watkins & Dayan (1992) and SARSA Rummery & Niranjan (1994) have been well studied and their convergence is guaranteed in the tabular setting. However, it is known that these algorithms may diverge in the popular off-policy setting under linear function approximation Baird (1995) ; Gordon (1996) . To address this issue, the two time-scale Greedy-GQ algorithm was developed in Maei et al. (2010) for learning the optimal policy. This algorithm extends the efficient gradient temporal difference (GTD) algorithms for policy evaluation Sutton et al. (2009b) to policy optimization. In particular, the asymptotic convergence of Greedy-GQ to a stationary point has been established in Maei et al. (2010) . More recently, Wang & Zou (2020) studied the finite-time convergence of Greedy-GQ under linear function approximation and Markovian sampling, and it is shown that the algorithm achieves an -stationary point of the objective function with a sample complexity in the order of O( -3 ). Such an undesirable high sample complexity is caused by the large variance induced by the Markovian samples queried from the dynamic environment. Therefore, we want to ask the following question. • Q1: Can we develop a variance reduction scheme for the two time-scale Greedy-GQ algorithm? In fact, in the existing literature, many recent work proposed to apply the variance reduction techniques developed in the stochastic optimization literature to reduce the variance of various TD learning algorithms for policy evaluation, e.g., Du et al. (2017) ; Peng et al. (2019) ; Korda & La (2015) ; Xu et al. (2020) . Some other work applied variance reduction techniques to Q-learning algorithms, e.g., Wainwright (2019) ; Jia et al. (2020) . Hence, it is much desired to develop a variance-reduced Greedy-GQ algorithm for optimal control. In particular, as many of the existing variance-reduced RL algorithms have been shown to achieve an improved sample complexity under variance reduction, it is natural to ask the following fundamental question. • Q2: Can variance-reduced Greedy-GQ achieve an improved sample complexity under Markovian sampling? In this paper, we provide affirmative answers to these fundamental questions. Specifically, we develop a two time-scale variance reduction scheme for the Greedy-GQ algorithm by leveraging the SVRG scheme Johnson & Zhang (2013) . Moreover, under linear function approximation and Markovian sampling, we prove that the proposed variance-reduced Greedy-GQ algorithm achieves an -stationary point with an improved sample complexity O( -2 ). We summarize our technical contributions as follows. 1.1 OUR CONTRIBUTIONS We develop a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for optimal control in reinforcement learning. Specifically, the algorithm leverages the SVRG variance reduction scheme Johnson & Zhang (2013) to construct variance-reduced stochastic updates for updating the parameters in both time-scales. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling in the off-policy setting. Specifically, we show that VR-Greedy-GQ achieves an -stationary point of the objective function J (i.e., ∇J(θ) 2 ≤ ) with a sample complexity in the order of O( -2 ). Such a complexity result improves that of the original Greedy-GQ by a significant factor of O( -1 ) Wang & Zou (2020) . In particular, our analysis shows that the bias error caused by the Markovian sampling and the variance error of the stochastic updates are in the order of O(M -1 ), O(η θ M -1 ), respectively, where η θ is the learning rate and M corresponds to the batch size of the SVRG reference batch update. This shows that the proposed variance reduction scheme can significantly reduce the bias and variance errors of the original Greedy-GQ update (by a factor of M ) and lead to an improved overall sample complexity. The analysis logic of VR-Greedy-GQ partly follows that of the conventional SVRG, but requires substantial new technical developments. Specifically, we must address the following challenges. First, VR-Greedy-GQ involves two time-scale variance-reduced updates that are correlated with each other. Such an extension of the SVRG scheme to the two time-scale updates is novel and requires new technical developments. Specifically, we need to develop tight variance bounds for the two time-scale updates under Markovian sampling. Second, unlike the convex objective functions of the conventional GTD type of algorithms, the objective function of VR-Greedy-GQ is generally non-convex due to the non-stationary target policy. Hence, we need to develop new techniques to characterize the per-iteration optimization progress towards a stationary point under nonconvexity. In particular, to analyze the two time-scale variance reduction updates of the algorithm, we introduce a 'fine-tuned' Lyapunov function of the form R m t = J(θ (m) t ) + c t θ (m) t -θ (m) 2 , where the parameter c t is fine-tuned to cancel other additional quadratic terms θ (m) t -θ (m) 2 that are implicitly involved in the tracking error terms. The design of this special Lyapunov function is critical to establish the formal convergence of the algorithm. With these technical developments, we are able to establish an improved finite-time convergence rate and sample complexity for VR-Greedy-GQ.

1.2. RELATED WORK

Q-learning and SARSA with function approximation. The asymptotic convergence of Q-learning and SARSA under linear function approximation were established in Melo et al. (2008) ; Perkins & Precup (2003) , and their finite-time analysis were developed in Zou et al. (2019) ; Chen et al. (2019) . However, these algorithms may diverge in off-policy training Baird (1995) . Also, recent works focused on the Markovian setting. Various analysis techniques have been developed to analyze the finite-time convergence of TD/Q-learning under Markovian samples. Specifically, Wang et al. (2020) developed a multi-step Lyapunov analysis for addressing the biasedness of the stochastic approximation in Q-learning. Srikant & Ying (2019) developed a drift analysis to the linear stochastic approximation problem. Besides the linear function approximation, the finite-time analysis of Q-learning under neural network function approximation is developed in Xu & Gu (2019) . GTD algorithms. The GTD2 and TDC algorithms were developed for off-policy TD learning. Their asymptotic convergence was proved in Sutton et al. (2009a; b) ; Yu (2017) , and their finite-time analysis were developed recently in Dalal et al. (2018) ; Wang et al. (2017) ; Liu et al. (2015) ; Gupta et al. (2019) ; Xu et al. (2019) . The Greedy-GQ algorithm is an extension of these algorithms to optimal control and involves nonlinear updates. RL with variance reduction: Variance reduction techniques have been applied to various RL algorithms. In TD learning, Du et al. (2017) reformulate the MSPBE problem as a convex-concave saddle-point optimization problem and applied SVRG Johnson & Zhang (2013) and SAGA Defazio et al. (2014) to primal-dual batch gradient algorithm. In Korda & La (2015) , the variance-reduced TD algorithm was introduced for solving the MSPBE problem, and later Xu et al. (2020) provided a correct non-asymptotic analysis for this algorithm over Markovian samples. Recently, some other works applied the SVRG , SARAH Nguyen et al. (2017) and SPIDER Fang et al. (2018) variance reduction techniques to develop variance-reduced Q-learning algorithms, e.g., Wainwright (2019) ; Jia et al. (2020) . In these works, TD or TDC algorithms are in the form of linear stochastic approximation, and Q-learning has only a single time-scale update. As a comparison, our VR-Greedy-GQ takes nonlinear two time-scale updates to optimization a nonconvex MSPBE.

2. PRELIMINARIES: POLICY OPTIMIZATION AND GREEDY-GQ

In this section, we review some preliminaries of reinforcement learning and recap the Greedy-GQ algorithm under linear function approximation.

2.1. POLICY OPTIMIZATION IN REINFORCEMENT LEARNING

In reinforcement learning, an agent takes actions to interact with the environment via a Markov Decision Process (MDP). Specifically, an MDP is specified by the tuple (S, A, P, r, γ), where S and A respectively correspond to the state and action spaces that include finite elements, r : S ×A×S → [0, +∞) denotes a reward function and γ ∈ (0, 1) is the associated reward discount factor. At any time t, assume that the agent is in the state s t ∈ S and takes a certain action a t ∈ A following a stationary policy π, i.e., a t ∼ π(•|s t ). Then, at the subsequent time t + 1, the current state of the agent transfers to a new state s t+1 according to the transition kernel P(•|s t , a t ). At the same time, the agent receives a reward r t = r(s t , a t , s t+1 ) from the environment for this action-state transition. To evaluate the quality of a given policy π, we often use the action-state value function Q π : S × A → R that accumulates the discounted rewards as follows: Q π (s, a) = E s ∼P(•|s,a) [r(s, a, s ) + γV π (s )] , where V π (s) is the state value function defined as V π (s) = E ∞ t=0 γ t r t |s 0 = s . In particu- lar, define the Bellman operator T π such that T π Q(s, a) = E s ,a [r(s, a, s ) + γQ(s , a )] for any Q(s, a), where a ∼ π(•|s ). Then, Q π (s, a) is a fixed point of T π , i.e., T π Q π (s, a) = Q π (s, a), ∀s, a. (1) The goal of policy optimization is to learn the optimal policy π * that maximizes the expected total reward E[ ∞ t=0 γ t r t |s 0 = s] for any initial state s ∈ S, and this is equivalent to learn the optimal value function Q * (s, a) = sup π Q π (s, a), ∀s, a. In particular, Q * is a fixed point of the Bellman operator T that is defined as T Q(s, a) = E s ∼P(•|s,a) [r(s, a, s ) + γ max b∈A Q(s , b)].

2.2. GREEDY-GQ WITH LINEAR FUNCTION APPROXIMATION

The Greedy-GQ algorithm is inspired by the fixed point characterization in eq. ( 1), and in the tabular setting it aims to minimize the Bellman error T π Q π -Q π 2 µs,a . Here, • 2 µs,a is induced by the state-action stationary distribution µ s,a (induced by the behavior policy π b ), and is defined as Q 2 µs,a = E (s,a)∼µs,a [Q(s, a) 2 ]. In practice, the state and action spaces may include a large number of elements that makes tabular approach infeasible. To address this issue, function approximation technique is widely applied. In this paper, we consider approximating the state-action value function Q(s, a) by a linear function. Specifically, consider a set of basis functions {φ (i) : S × A → R, i = 1, 2, . . . , d}, each of which maps a given state-action pair to a certain value. Define φ s,a = [φ (1) (s, a); ...; φ (d) (s, a)] as the feature vector for (s, a). Then, under linear function approximation, the value function Q(s, a) is approximated by Q θ (s, a) = φ s,a θ, where θ ∈ R d denotes the parameter of the linear approximation. Consequently, Greedy-GQ aims to find the optimal θ * that minimizes the following mean squared projected Bellman error (MSPBE). (MSPBE): J(θ) := 1 2 ΠT π θ Q θ -Q θ 2 µs,a , where µ s,a is the stationary distribution induced by the behavior policy π b , Π is a projection operator that maps an action-value function Q to the space Q spanned by the feature vectors, i.e., ΠQ = arg min U ∈Q U -Q µs,a . Moreover, the policy π θ is parameterized by θ. In this paper, we consider the class of Lipschitz and smooth policies (see Assumption 4.2). Next, we introduce the Greedy-GQ algorithm. Define V s (θ) = a ∈A π θ (a |s )φ s ,a θ, δ s,a,s (θ) = r(s, a, s ) + γV s (θ) -φ s,a θ and denote φ s (θ) = ∇V s (θ). Then, the gradient of the objective function J(θ) in eq. ( 2) is expressed as ∇J(θ) = -E[δ s,a,s (θ)φ s,a ] + γE[ φ s (θ)φ s,a ]ω * (θ), where ω * (θ) = E[φ s,a φ s,a ] -1 E[δ s,a,s (θ)φ s,a ]. To address the double-sampling issue when estimating the product of expectations involved in E[ φ s (θ)φ s,a ]ω * (θ), Sutton et al. (2009a) applies a weight doubling trick and constructs the following two time-scale update rule for the Greedy-GQ algorithm: for every t = 0, 1, 2, ..., sample (s t , a t , r t , s t+1 ) using the behavior policy π b and do (Greedy-GQ):      θ t+1 = θ t -η θ -δ t+1 (θ t )φ t + γ(ω t φ t ) φ t+1 (θ t ) , ω t+1 = ω t -η ω φ t ω t -δ t+1 (θ t ) φ t , π θt+1 = P(φ θ t+1 ). (3) where η θ , η ω > 0 are the learning rates and we denote δ t+1 (θ) := δ st,at,st+1 (θ), φ t := φ st,at , φ t+1 (θ t ) := φ st+1 (θ t ) for simplicity. To elaborate, the first two steps correspond to the two timescale updates for updating the value function Q θ , whereas the last step is a policy improvement operation that exploits the updated value function to improve the target policy, e.g., greedy, -greedy, softmax and mellowmax Asadi & Littman (2017) . The above Greedy-GQ algorithm uses a single Markovian sample to perform the two time-scale updates in each iteration. Such a stochastic Markovian sampling often induces a large variance that significantly slows down the overall convergence. This motivates us to develop variance reduction schemes for the two time-scale Greedy-GQ in the next section.

3. GREEDY-GQ WITH VARIANCE REDUCTION

In this section, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm under Markovian sampling by leveraging the SVRG variance reduction scheme Johnson & Zhang (2013) . To simplify notations, we define the stochastic updates regarding a sample x t = (s t , a t , r t , s t+1 ) used in the Greedy-GQ as follows: G xt (θ, ω) := -δ t+1 (θ)φ t + γ(ω φ t ) φ t+1 (θ), H xt (θ, ω) := φ t ω -δ t+1 (θ) φ t . Next, consider a single MDP trajectory {x t } t≥0 obtained by the behavior policy π b . In particular, we divide the entire trajectory into multiple batches of samples {B m } m≥1 so that B m = {x (m-1)M , ..., x mM -1 }, and our proposed VR-Greedy-GQ uses one batch of samples in every epoch. To elaborate, in the m-th epoch, we first initialize this epoch with a pair of reference points θ m) , where θ (m) , ω (m) are set to be the output points θ (m-1) M (m) 0 = θ (m) , ω (m) 0 = ω ( , ω (m-1) M of the previous epoch, respectively. Then, we compute a pair of reference batch updates using the reference points and the batch of samples as follows m) , ω (m) ). (4) ), ( θ (m) , ω (m) ). After that, we use these stochastic updates and the reference batch updates to construct the variance-reduced updates in Algorithm 1 via the SVRG scheme, where for simplicity we denote the stochastic updates G (m) = 1 M mM -1 k=(m-1)M G x k ( θ (m) , ω (m) ), H (m) = 1 M mM -1 k=(m-1)M H x k ( θ ( G x ξ m t , H x ξ m t respectively as G (m) t , H (m) t . In particular, we project the two time-scale updates onto the Euclidean ball with radius R to stabilize the algorithm updates, and we assume that R is large enough to include at least one stationary point of J. Lastly, we further update the policy via the policy improvement operation P. Algorithm 1: Variance-Reduced Greedy-GQ Input: learning rates η θ , η ω , batch size M . Initialize: θ (1) = θ 0 , ω (1) = ω 0 , π θ (1) ← P(φ θ (1) ). for m = 1, 2, . . . do θ (m) 0 = θ (m) , ω = ω (m) . Compute G (m) , H (m) according to eq. ( 4). m) . for t = 0, 1, . . . , M -1 do Query a sample from B m with replacement. θ (m) t+1 = Π R θ (m) t -η θ G (m) t (θ (m) t , ω (m) t ) -G (m) t ( θ (m) , ω (m) ) + G (m) . ω (m) t+1 = Π R ω (m) t -η ω H (m) t (θ (m) t , ω (m) t ) -H (m) t ( θ(m) , ω(m) ) + H ( Policy improvement: π θ (m) t+1 ← P(φ θ (m) t+1 ). end Set θ (m+1) = θ (m) M , ω (m+1) = ω (m) M . end Output: parameter θ chosen among {θ (m) t } t,m uniformly at random. The above VR-Greedy-GQ algorithm has several advantages and uniqueness. First, it takes incremental updates that use a single Markovian sample per-iteration. This makes the algorithm sample efficient. Second, VR-Greedy-GQ applies variance reduction to both of the two time-scale updates. As we show later in the analysis, such a two time-scale variance reduction scheme significantly reduces the variance error of both of the stochastic updates. We want to further clarify the incrementalism and online property of VR-Greedy-GQ. Our VR-Greedy-GQ is based on the online-SVRG and can be viewed as an incremental algorithm with regard to the batches of samples used in the outer-loops, i.e., in every outer-loop the algorithm samples a new batch of samples and use them to perform variance reduction in the corresponding inner-loops. Therefore, VR-Greedy-GQ can be viewed as an online batch-incremental algorithm. In general, there is a trade-off between incrementalism and variance reduction for SVRG-type algorithms: a larger batch size in the outer-loops enhances the effect of variance reduction, while a smaller batch size makes the algorithm more incremental. Assumption 4.1 (Feature boundedness). The feature vectors are uniformly bounded, i.e., φ s,a ≤ 1 for all (s, a) ∈ S × A. Assumption 4.2 (Policy smoothness). The mapping θ → π θ is k 1 -Lipschitz and k 2 -smooth. We note that the above class of smooth policies covers a variety of practical policies, including softmax and mellowmax policies Asadi & Littman (2017) ; Wang & Zou (2020) . Assumption 4.3 (Problem solvability). The matrix C := E[φ s,a φ s,a ] is non-singular. Assumption 4.4 (Geometric uniform ergodicity). There exists Λ > 0 and ρ ∈ (0, 1) such that sup s∈S d TV P(s t |s 0 = s), µ ≤ Λρ t , for any t > 0, where d TV is the total-variation distance. Based on the above assumptions, we obtain the following finite-time convergence rate result. Theorem 4.5 (Finite-time convergence). Let Assumptions 4.1-4.4 hold and consider the VR-Greedy-GQ algorithm. Choose learning rates η θ , η ω and the batch size M that satisfy the conditions specified in eqs. (15) to (19). Then, after T epochs, the output of the algorithm satisfies E ∇J(θ (ζ) ξ ) 2 ≤ O 1 η θ T M + 1 T η ω + η 2 θ η 2 ω + η ω + η 2 θ η 2 ω 2 + 1 M , where ξ, ζ are random indexes that are sampled from {0, ..., M -1} and {1, ..., T } uniformly at random, respectively. Theorem 4.5 shows that VR-Greedy-GQ asymptotically converges to a neighborhood of a stationary point at a sublinear rate. In particular, the size of the neighborhood is in the order of O(M -1 + η 4 θ η -4 ω + η 2 ω ) , which can be driven arbitrarily close to zero by choosing a large batch size and sufficiently small learning rates that satisfy the two time-scale condition η θ /η ω → 0. Moreover, the convergence error terms implicitly include a bias error O( 1 M ) caused by the Markovian sampling and a variance error O( η θ M ) caused by the stochastic updates, both of which are substantially reduced by the large batch size M . This shows that the SVRG scheme can effectively reduce the bias and variance error of the two time-scale stochastic updates. By further optimizing the choice of hyper-parameters, we obtain the following characterization of sample complexity of VR-Greedy-GQ. Corollary 4.6 (Sample complexity). Under the same conditions as those of Theorem 4.5, choose learning rates so that η θ = O( 1 M ), η ω = O(η 2/3 θ ), and set T ,M = O( -1 )). Then, the required sample complexity for achieving E ∇J(θ (ζ) ξ ) 2 ≤ is in the order of T M = O( -2 ). Such a complexity result is orderwise lower than the complexity O( -3 ) of the original Greedy-GQ Wang & Zou (2020) . Therefore, this demonstrates the advantage of applying variance reduction to the two time-scale updates of VR-Greedy-GQ. We also note that for online stochastic non-convex optimization, the sample complexity of the SVRG algorithm is in the order of O( -5/3 ) Li & Li (2018), which is slightly better than our result. This is reasonable as the SVRG in stochastic optimization is unbiased due to the i.i.d. sampling. In comparison, VR-Greedy-GQ works on a single MDP trajectory that induces Markovian noise, and the two-timescale updates of the algorithm also introduces additional tracking error.

5. SKETCH OF THE TECHNICAL PROOF

In this section, we provide an outline of the technical proof of the main Theorem 4.5 and highlight the main technical contributions. The details of the proof can be found in the appendix. We note that our proof logic partly follows the that of the conventional SVRG, i.e., exploiting the objective function smoothness and introducing a Lyapunov function. However, our analysis requires substantial new developments to address the challenges of off-policy control, two time-scale updates of VR-Greedy-GQ and correlation of Markovian samples. The key step of the proof is to develop a proper Lyapunov function that drives the parameter to a stationary point along the iterations. In addition, we also need to develop tight bounds for the bias error, variance error and tracking error. We elaborate the key steps of the proof below. Step 1: We first define the following Lyapunov function with certain c t > 0 to be determined later. R m t := J(θ (m) t ) + c t θ (m) t -θ (m) 2 . ( ) To explain the motivation, note that unlike the analysis of variance-reduced TD learning Xu et al. (2020) where the term θ (m) t -θ (m) 2 can be decomposed into θ (m) t -θ * 2 + θ (m) -θ * 2 to get the desired upper bound, here we do not have θ * due to the non-convexity of J(θ). Hence, we need to properly merge this term into the Lyapunov function R m t . By leveraging the smoothness of J(θ) and the algorithm update rule, we obtain the following bound for the Lyapunov function R m t (see eq. ( 10) in the appendix for the details). E[R m t+1 ] ≤ E[J(θ (m) t )] + O E[ θ (m) t -θ (m) 2 ] + E[ ∇J(θ (m) t )] 2 + M -1 + O E[ ω (m) -ω * ( θ (m) ) 2 ] + E[ ω (m) t -ω * (θ (m) t ) 2 ] . In particular, the error term 1 M is due to the noise of Markovian sampling and the variance of the stochastic updates, and the last two terms correspond to tracking errors. Step 2: To telescope the Lyapunov function over t based on eq. ( 6), one may want to define J(θ (m) t )] + O E[ θ (m) t -θ (m) 2 ] = R m t by choosing a proper c t of R m t . However, note that eq. ( 6) involves the last two tracking error terms, which also implicitly depend on E θ (m) t -θ (m) 2 as we show later in the Step 3. Therefore, we need to carefully define the c t of R m t so that after applying the tracking error bounds developed in the Step 3, the right hand side of eq. ( 6) can yield an R m t without involving the term E θ (m) t -θ (m) 2 . It turns out that we need to define c t via the recursion specified in eq. ( 11) in the appendix. We rigorously show that the sequence {c t } t is uniformly bounded by a small constant c = 1 8 . Then, plugging these bounds into eq. ( 6) and summing over one epoch, we obtain the following bound (see eq. ( 13) in the appendix for the details). η θ M -1 t=0 E[ ∇J(θ (m) t ) 2 ] ≤ E[R m 0 ] -E[R m M ] + O η θ + η 2 θ M E[ ω (m) -ω * ( θ (m) ) 2 ] + O η θ M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 -η θ c η ω + η 2 θ η 2 ω M -1 t=0 E[ θ (m) t -θ (m) 2 ] . Step 3: We derive bounds for the tracking error terms M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 and E ω (m)ω * ( θ (m) ) 2 in the above bound in Lemma D.7 and Lemma D.8. Step 4: Lastly, by substituting the tracking error bounds obtained in Step 3 into the bound obtained in Step 2, the resulting bound does not involve the term M -1 t=0 E θ (m) t -θ (m) 2 . Then, summing this bound over the epochs m = 1, ..., T , we obtain the desired finite-time convergence rate result.

6. EXPERIMENTS

In this section, we conduct two reinforcement learning experiments, namely, Garnet problem Archibald et al. (1995) and Frozen Lake game Brockman et al. (2016) , to test the performance of VR-Greedy-GQ in the off-policy setting, and compare it with Greedy-GQ in the Markovian setting.

6.1. GARNET PROBLEM

For the Garnet problem, we refer to Appendix F for the details of the problem setup. In Figure 1 (left), we plot the minimum gradient norm v.s. the number of pseudo stochastic gradient computations for both algorithms using 40 Garnet MDP trajectories, and each trajectory contains 10k samples. The upper and lower envelopes of the curves correspond to the 95% and 5% percentiles of the 40 curves, respectively. It can be seen that VR-Greedy-GQ outperforms Greedy-GQ and achieves a significantly smaller asymptotic gradient norm. In Figure 1 (middle), we track the estimated variance of the stochastic update for both algorithms along the iterations. Specifically, we query 500 Monte Carlo samples per iteration to estimate the pseudo gradient variance E G (m) t (θ (m) t , ω (m) t ) -∇J(θ (m) t ) 2 . It can be seen from the figure that the stochastic updates of VR-Greedy-GQ induce a much smaller variance than Greedy-GQ. This demonstrates the effectiveness of the two time-scale variance reduction scheme of VR-Greedy-GQ. We further study the asymptotic convergence error of VR-Greedy-GQ under different batch sizes M . We use the default learning rate setting that is mentioned previously and run 100k iterations for one Garnet trajectories. We use the mean of the convergence error of the last 10k iterations as an estimate of the asymptotic convergence error (the training curves are already saturated and flattened). Figure 1 (right) shows the asymptotic convergence error of VR-Greedy-GQ under different batch sizes M . It can be seen that VR-Greedy-GQ achieves a smaller asymptotic convergence error with a larger batch size, which matches our theoretical result. In Figure 2 (Left), we plot the MSPBE J(θ) v.s. number of gradient computations for both Greedy-GQ and VR-Greedy-GQ, where one can see that VR-Greedy-GQ achieves a much smaller MSPBE than Greedy-GQ. In Figure 2 (Middle), we plot the estimated expected maximum reward (see Appendix F for details) v.s. number of gradient computations for Greedy-GQ, VR-Greedy-GQ and actor-critic, where for actor-critic we set learning rate η θ = 0.02 for the actor update and η ω = 0.01 for the critic update. One can see that VR-Greedy-GQ achieves a higher reward than the other two algorithms, demonstrating the high quality of its learned policy. In addition, we also plot the estimated expected maximum reward v.s. number of iterations for Greedy-GQ, VR-Greedy-GQ and policy gradient in Figure 2 (Right). For the policy gradient, we apply the standard off-policy policy gradient algorithm. For each update, we sample 30 independent trajectories with a fixed length 60 to estimate the expected discounted return. The learning rate of policy gradient is set as η θ . We note that each iteration of policy gradient consumes 1800 samples and hence it is very sample inefficient. Hence we set the x-axis to be number of iterations for a clear presentation (otherwise it becomes a flat curve). One can see that VR-Greedy-GQ achieves a much higher expected reward than both Greedy-GQ and policy gradient.

6.2. FROZEN LAKE GAME

We further test these algorithms in solving the more complex frozen lake game. we refer to Appendix F for the details of the problem setup. Figure 3 shows the comparison between VR-Greedy-GQ and Greedy-GQ, and one can make consistent observations with those made in the Garnet experiment. Specifically, Figure 3 (left) shows that VR-Greedy-GQ achieves a much more stationary policy than Greedy-GQ. Figure 3 (middle) shows that the stochastic updates of VR-Greedy-GQ induce a much smaller variance than those of Greedy-GQ. Moreover, Figure 3 (right) verifies our theoretical result that VR-Greedy achieves a smaller asymptotic convergence error with a larger batch size. We further plot the MSPBE v.s. number of gradient computations for both Greedy-GQ and VR-Greedy-GQ in Figure 4 (Left), where one can see that VR-Greedy-GQ outperforms Greedy-GQ. In Figure 2 (Middle), we plot the estimated expected maximum reward v.s. number of gradient computations for Greedy-GQ, VR-Greedy-GQ and actor-critic, where for actor-critic we set learning rate η θ = 0.2 for the actor update and η ω = 0.1 for the critic update. It can be seen that VR-Greedy-GQ achieves a higher reward than the other two algorithms. In Figure 2 (Right), we plot the estimated expected maximum reward v.s. number of iterations for Greedy-GQ, VR-Greedy-GQ and policy gradient. For policy gradient, we use the same parameter settings as before. One can see that VR-Greedy-GQ achieves a much higher expected reward than both Greedy-GQ and policy gradient. 

7. CONCLUSION

In this paper, we develop a variance-reduced two time-scale Greedy-GQ algorithm for optimal control by leveraging the SVRG variance reduction scheme. Under linear function approximation and Markovian sampling, we establish the sublinear finite-time convergence rate of the algorithm to a stationary point and prove an improved sample complexity bound over that of the original Greedy-GQ. The RL experiments well demonstrated the effectiveness of the proposed two time-scale variance reduction scheme. Our algorithm design may inspire new developments of variance reduction for two time-scale RL algorithms. In the future, we will explore Greedy-GQ with other nonconvex variance reduction schemes to possibly further improve the sample complexity. be the sample picked in the t-th iteration of the m-th epoch. Then, we define the filtration for Markovian samples as follows

Appendix Table of Contents

F 1,0 = σ(B 0 ∪ σ( θ(0) , w(0) )), F 1,1 = σ(F 1,0 ∪ σ(x (1) 0 )), . . . , F 1,M = σ(F 1,M -1 ∪ σ(x (1) M -1 )) F 2,0 = σ B 1 ∪ F 1,M ∪ σ( θ(1) , w(1) ) , F 2,1 = σ(F 2,0 ∪ σ(x (2) 0 )), . . . , F 2,M = σ(F 2,M -1 ∪ σ(x (2) M -1 )) . . . F m,0 = σ B m-1 ∪ F m-1,M ∪ σ( θ(m-1) , w(m-1) ) , F m,1 = σ(F m,0 ∪ σ(x (m) 0 )), . . . , F m,M = σ(F m,M -1 ∪ σ(x (m) M -1 )). Moreover, we define E t,m as the conditional expectation with respect to the σ-field F t,m .

List of Constants

We summarize all the constants that are used in the proof as follows. • G = r max + (1 + γ)R + γ(|A|Rk 1 + 1)R. • H = (2 + γ)R + r max . • C 1 = (1 + 2Λ ρ 1-ρ )(G + C ∇J ) 2 . • C 2 = H 2 (1 + Λ ρ 1-ρ ). • C 3 = 8R 2 λ C (1 + ρΛ 1-ρ ). • C 4 = 2 λ C (R(2 + γ) + r max ) 2 (1 + ρΛ 1-ρ ). B PROOF OF THEOREM 4.5 We first define the following Lyapunov function R m t := J(θ (m) t ) + c t θ (m) t -θ (m) 2 , where c t > 0 is to be determined later. Our strategy is to characterize the per-iteration progress of R m t . In particular, we use Lemma E.9 to bound the first term of R m t and use Lemma E.10 to bound the second term of R m t . Note that Lemma E.9 implies that J(θ (m) t+1 ) ≤ J(θ (m) t ) -η θ ∇J(θ (m) t ), g (m) t -∇J(θ (m) t ) -η θ ∇J(θ (m) t ) 2 + L 2 η 2 θ g (m) t 2 . Let ξ (m) t := ∇J(θ (m) t ), g (m) t -∇J(θ (m) t ) . Then, we obtain that J(θ (m) t+1 ) ≤ J(θ (m) t ) -η θ ξ (m) t -η θ ∇J(θ (m) t ) 2 + L 2 η 2 θ g (m) t 2 . ( ) Substituting eq. ( 7) and eq. ( 30) into the definition of R m t+1 , we obtain that R m t+1 := J(θ (m) t+1 ) + c t+1 θ (m) t+1 -θ (m) 2 ≤ J(θ (m) t ) -η θ ξ (m) t -η θ ∇J(θ (m) t ) 2 + L 2 η 2 θ g (m) t 2 + c t+1 η 2 θ g (m) t 2 + θ (m) t -θ (m) 2 -2η θ ζ (m) t + η θ 1 β t ∇J(θ (m) t ) 2 + β t θ (m) t -θ (m) 2 = J(θ (m) t ) + c t+1 (η θ β t + 1) θ (m) t -θ (m) 2 + -η θ + c t+1 η θ β t ∇J(θ (m) t ) 2 + L 2 η 2 θ + c t+1 η 2 θ g (m) t 2 -η θ ξ (m) t -2c t+1 η θ ζ (m) t . Next, we bound the two inner product terms ξ ) . Recall the variance-reduced stochastic update g (m) t = G (m) t (θ (m) t , ω (m) t ) -G (m) t ( θ (m) , ω (m) ) + G (m) . Then, the term ξ (m) t can be decomposed as ξ (m) t = ∇J(θ (m) t ), g (m) t -∇J(θ (m) t ) = ∇J(θ (m) t ), G (m) t (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) + ∇J(θ (m) t ), G (m) t (θ (m) t , ω (m) t ) -G (m) t (θ (m) t , ω * (θ (m) t )) + ∇J(θ (m) t ), -G (m) t ( θ (m) , ω (m) ) + G (m) In the last equality, the first inner product term is the bias caused by Markovian samples, and by Lemma D.3 we have that E ∇J(θ (m) t ), G (m) t (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) =E ∇J(θ (m) t ), G (m) (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) = 1 4 E ∇J(θ (m) t ) 2 + E G (m) (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 ≤ 1 4 E ∇J(θ (m) t ) 2 + C 1 M . The second inner product term is the bias caused by tracking error, and we further obtain that ∇J(θ (m) t ), G (m) t (θ (m) t , ω (m) t ) -G (m) t (θ (m) t , ω * (θ (m) t )) ≤ 1 4 ∇J(θ (m) t ) 2 + L 1 ω (m) t -ω * (θ (m) t ) 2 . The third inner product term is unbiased. Combining all of these bounds, we finally obtain that |Eξ (m) t | ≤ C 1 M + 1 2 ∇J(θ (m) t ) 2 + L 1 ω (m) t -ω * (θ (m) t ) 2 . (8) Bounding the term ζ (m) t : ζ (m) t := g (m) t -∇J(θ (m) t ), θ (m) t -θ (m) . Similar to the previous proof for bounding ξ (m) t , we can decompose ζ (m) t as ζ (m) t = θ (m) t -θ (m) , g (m) t -∇J(θ (m) t ) = θ (m) t -θ (m) , G (m) t (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) + θ (m) t -θ (m) , G (m) t (θ (m) t , ω (m) t ) -G (m) t (θ (m) t , ω * (θ (m) t )) + θ (m) t -θ (m) , -G (m) t ( θ (m) , ω (m) ) + G (m) In the last equality, the first inner product term is the bias caused by Markovian samples. We obtain that E θ (m) t -θ (m) , G (m) t (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) ≤ 1 2 E θ (m) t -θ (m) 2 + 1 2 E G (m) t (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 ≤ 1 2 E θ (m) t -θ (m) 2 + 1 2 C 1 M . The second inner product term is the bias caused by tracking error. We obtain that θ (m) t -θ (m) , G , ω (m) t ) -G (m) t (θ (m) t , ω * (θ (m) t )) ≤ 1 2 θ (m) t -θ (m) 2 + L 1 2 ω (m) t -ω * (θ (m) t ) 2 . The third inner product term is unbiased. Combining all of these bounds, we finally obtain that |Eζ (m) t | ≤ 1 2 C 1 M + θ (m) t -θ (m) 2 + L 1 2 ω (m) t -ω * (θ (m) t ) 2 . ( ) Next, we continue to bound the Lyapunov function. Recall we have shown that R m t+1 ≤ J(θ (m) t ) + c t+1 (η θ β t + 1) θ (m) t -θ (m) 2 + -η θ + c t+1 η θ β t ∇J(θ (m) t ) 2 + L 2 η 2 θ + c t+1 η 2 θ g (m) t 2 -η θ ξ (m) t -2c t+1 η θ ζ (m) t . Taking expectation on both sides of the above inequality and applying eq. ( 8), eq. ( 9), and Lemma D.1, we obtain that E[R m t+1 ] ≤ E J(θ (m) t ) + c t+1 (η θ β t + 1) θ (m) t -θ (m) 2 + -η θ + c t+1 η θ β t E ∇J(θ (m) t ) 2 + L 2 η 2 θ + c t+1 η 2 θ 6L 1 E ω (m) t -ω * (θ (m) t ) 2 + 9L 1 E ω (m) -ω * ( θ (m) ) 2 + 9L 2 E θ (m) t -θ (m) 2 + 1 M • 9C 1 + 9E ∇J(θ (m) t ) 2 + η θ C 1 M + 1 2 ∇J(θ (m) t ) 2 + L 1 ω (m) t -ω * (θ (m) t ) 2 + 2c t+1 η θ 1 2 C 1 M + θ (m) t -θ (m) 2 + L 1 2 ω (m) t -ω * (θ (m) t ) 2 . ( ) We note that the tracking error term ω (m) t -ω * (θ (m) t ) 2 has dependence on θ (m) t -θ (m) 2 . Here we use a trick to merge this dependence to the coefficient c t+1 . Specifically, we add and subtract the same term in the above bound and obtain that E[R m t+1 ] ≤ E J(θ (m) t ) + c t+1 (η θ β t + 1 + 2η θ ) + 9L 1 L 2 η 2 θ + c t+1 η 2 θ θ (m) t -θ (m) 2 + - 1 2 η θ + c t+1 η θ β t + 9 L 2 η 2 θ + c t+1 η 2 θ E ∇J(θ (m) t ) 2 + η θ + 2c t+1 η θ + 9 L 2 η 2 θ + c t+1 η 2 θ C 1 M + 9L 1 L 2 η 2 θ + c t+1 η 2 θ E ω (m) -ω * ( θ (m) ) 2 + 6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 E ω (m) t -ω * (θ (m) t ) 2 -6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω E θ (m) t -θ (m) 2 + 6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω E θ (m) t -θ (m) 2 . Then, we define R m t := J(θ (m) t ) + c t θ (m) t -θ (m) 2 with c t being specified via the following recursion. c t = c t+1 (η θ β t + 1 + 2η θ ) + 9L 1 L 2 η 2 θ + c t+1 η 2 θ + 6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω . Based on this definition, the previous inequality reduces to E[R m t+1 ] ≤ E[R m t ] + - 1 2 η θ + c t+1 η θ β t + 9 L 2 η 2 θ + c t+1 η 2 θ E ∇J(θ (m) t ) 2 + η θ + 2c t+1 η θ + 9 L 2 η 2 θ + c t+1 η 2 θ C 1 M + 9L 1 L 2 η 2 θ + c t+1 η 2 θ E ω (m) -ω * ( θ (m) ) 2 + 6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 E ω (m) t -ω * (θ (m) t ) 2 -6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω E θ (m) t -θ (m) 2 . ( ) Assume that c t ≤ c for some universal constant c > 0 (we will formally prove it later). Then, we sum the above inequality over one epoch and obtain that 1 2 η θ -c η θ β t -9 L 2 η 2 θ + cη 2 θ M -1 t=0 E ∇J(θ (m) t ) 2 ≤ E[R m 0 ] -E[R m M ] + η θ + 2 cη θ + 9 L 2 η 2 θ + cη 2 θ C 1 + 9L 1 L 2 η 2 θ + cη 2 θ M E ω (m) -ω * ( θ (m) ) 2 + 6L 1 L 2 η 2 θ + cη 2 θ + η θ L 1 + η θ L 1 c M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 -6L 1 L 2 η 2 θ + cη 2 θ + η θ L 1 + η θ L 1 c • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω M -1 t=0 E θ (m) t -θ (m) 2 . ( ) By Lemma D.7, we have that M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 ≤ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω E ω (m) -ω * ( θ (m) ) 2 + 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 9 λ C + 2L 2 3 η 2 θ η 2 ω 36 λ C M -1 t=0 E ∇J(θ (m) t ) 2 + 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω M -1 t=0 E θ (m) t -θ (m) 2 + 8 λ C (C 3 + C 4 ). For simplicity, we define D := 6L 1 L 2 + c + L 1 + L 1 c. Substituting the above bound into the previous inequality and simplifying, we obtain that 1 2 η θ -c η θ β t -9 L 2 η 2 θ + cη 2 θ M -1 t=0 E ∇J(θ (m) t ) 2 ≤ ER m 0 -ER m M + η θ + 2 cη θ + 9 L 2 η 2 θ + cη 2 θ C 1 + 9L 1 L 2 η 2 θ + cη 2 θ M E ω (m) -ω * ( θ (m) ) 2 + Dη θ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω E ω (m) -ω * ( θ (m) ) 2 + 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 9 λ C + 2L 2 3 η 2 θ η 2 ω 36 λ C M -1 t=0 E ∇J(θ (m) t ) 2 + 8 λ C (C 3 + C 4 ). One can see that the above bound is independent of M -1 t=0 E θ (m+1) t -θ (m) 2 , and this is what we desire. After simplification, the above inequality further implies that 1 2 η θ -c η θ β t -9 L 2 η 2 θ + cη 2 θ -D 9 λ C + 2L 2 3 36 λ C η 3 θ η 2 ω M -1 t=0 E ∇J(θ (m) t ) 2 ≤E[R m 0 ] -E[R m M ] + η θ + 2 cη θ + 9 L 2 η 2 θ + cη 2 θ C 1 + 8 λ C (C 3 + C 4 )Dη θ + Dη θ 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 9L 1 L 2 η 2 θ + cη 2 θ M + Dη θ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω E ω (m) -ω * ( θ (m) ) 2 . ( ) Choose optimal learning rates: Here, we provide the omitted proof of our earlier claim made after eq. ( 12), that is, the upper bound of {c t } is a small constant. We first present the following fundamental simple lemma, and the proof is omitted. Lemma B.1. Let {c i } i=0,...,M be a finite sequence with c M = 0 and satisfies the following relation for certain a > 1: c t ≤ a • c t+1 + b. Then, {c i } i=0,...,M is a deceasing sequence and c 0 ≤ ab • a M -1 a -1 . Next, we derive the upper bound c of c t . Set β t = 1 for all t. Then we have that c t ≤ c t+1 (η θ + 1 + 2η θ ) + 9L 1 L 2 η 2 θ + c t+1 η 2 θ + 6L 1 L 2 η 2 θ + c t+1 η 2 θ + η θ L 1 + η θ L 1 c t+1 • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω := a • c t+1 + b where a = 1 + (3 + 16L 1 )η θ and b = 15 2 L 1 Lη 2 θ + L 1 η θ • 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω . Note that here we require 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω ≤ 1 and max{η ω , η θ } ≤ 1. Moreover, let (3 + 16L 1 )η θ ≤ 1 M . Based on the above conditions, we obtain that c 0 ≤ 15 2 L 1 Lη θ + 4L 1 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω • 4 + 16L 1 3 + 16L 1 • (1 + (3 + 16L 1 )η θ ) M -1 ≤ 15 2 L 1 Lη θ + 4L 1 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω • 4 + 16L 1 3 + 16L 1 • (e -1) . Lastly, we choose 15 2 L 1 Lη θ + 4L 1 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω • 4 + 16L 1 3 + 16L 1 • (e -1) ≤ 1 8 . Therefore c 0 ≤ 1 8 . Since {c t } t is decreasing, we obtain that c = 1 8 . Now, substituting β t = 1 and c = 1 8 into the coefficient of the term M -1 t=0 E ∇J(θ (m) t ) 2 in eq. ( 14), the coefficient reduces to the following, and we choose an appropriate (η θ , η ω ) such that the coefficient is greater than 1 4 η θ . 3 8 η θ -9 L 2 η 2 θ + cη 2 θ -D 9 λ C + 2L 2 3 36 λ C η 3 θ η 2 ω ≥ 1 4 η θ . Deriving the final bound: Exploiting the above conditions on the learning rates, eq. ( 14) further implies that 1 4 η θ M -1 t=0 E ∇J(θ (m) t ) 2 ≤E[J( θ (m) )] -E[J( θ (m+1) )] + η θ + 2 cη θ + 9 L 2 η 2 θ + cη 2 θ C 1 + 8 λ C (C 3 + C 4 )Dη θ + Dη θ 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 9L 1 L 2 η 2 θ + cη 2 θ M + Dη θ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω E ω (m) -ω * ( θ (m) ) 2 . ( ) On the other hand, by Lemma D.8 we have that E ω (m) -ω * ( θ (m) ) 2 ≤ (1 - 1 2 λ C η ω ) mM E ω (0) -ω * ( θ (0) ) 2 + 4 λ C C 3 + C 4 1 M + 4 λ C H 2 η ω + 2 λ C 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η 2 ω . Substituting the above bound into eq. ( 20) and summing over m, we obtain that 1 4 η θ 1 T M T m=1 M -1 t=0 E ∇J(θ (m) t ) 2 ≤ 1 T M E[J( θ (0) )] + 1 M η θ + 2 cη θ + 9 L 2 η 2 θ + cη 2 θ C 1 + 8 λ C (C 3 + C 4 )Dη θ + Dη θ 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 1 T M 9L 1 L 2 η 2 θ + cη 2 θ M + Dη θ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω • E ω (0) -ω * ( θ (0) ) 2 • 1 1 -(1 -1 2 λ C η ω ) M + 1 M 9L 1 L 2 η 2 θ + cη 2 θ M + Dη θ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω • 4 λ C C 3 + C 4 1 M + 4 λ C H 2 η ω + 2 λ C 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η 2 ω . Rearranging the above inequality, we obtain the following final bound, where ξ, ζ are random indexes that are sampled from {0, ..., M -1} and {1, ..., T } uniformly at random, respectively. E ∇J(θ (ζ) ξ ) 2 ≤ 1 η θ T M • 4E[J( θ (0) )] + 4 M 1 + 2 c + 9 L 2 η θ + cη θ C 1 + 8 λ C (C 3 + C 4 )D + D 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 1 T M 9L 1 L 2 η θ + cη θ M + D 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω • E ω (0) -ω * ( θ (0) ) 2 • 1 1 -(1 -1 2 λ C η ω ) M + 9L 1 L 2 η θ + cη θ + D 4 λ C 1 η ω M + 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω • 4 λ C C 3 + C 4 1 M + 4 λ C H 2 η ω + 2 λ C 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η 2 ω . Next, we simplify the above inequality into an asymptotic form. Note that the first term is in the order of O( 1 η θ T M ). The second term is of order O( 1 M ). The third term is of order O( 1 ηωT M + 1 T (η ω + η 2 θ η 2 ω )), and the last term is the product of a term of order O( η 2 θ η 2 ω + η ω + 1 ηωM ) and another term of order O( 1 M + η 2 θ η 2 ω +η ω ) , which leads to the overall order O(( η 2 θ η 2 ω +η ω ) 2 + 1 M ). Combining these asymptotic orders together, we obtain the following asymptotic convergence rate result. E ∇J(θ (ζ) ξ ) 2 = O 1 η θ T M + η 2 θ η 2 ω + η ω 2 + 1 M + 1 T η ω + η 2 θ η 2 ω . C PROOF OF COROLLARY 4.6 Regarding the convergence rate result of Theorem 4.5, we choose the optimized learning rates such that η θ = O(η 3/2 ω ), and we obtain that E ∇J(θ (ζ) ξ ) 2 = O 1 η θ T M + η 2 ω + 1 M + η ω T . Then, we set η θ = O( 1 M ) such that eq. ( 17) is satisfied, and moreover η ω = O( 1 M 2/3 ). Under this learning rate setting, the learning rate conditions in eq. ( 15), eq. ( 16), eq. ( 18), eq. ( 19) are all satisfied for a sufficiently large constant-level M . Then, the overall convergence rate further becomes E ∇J(θ (ζ) ξ ) 2 = O 1 T + 1 M . By choosing T, M = O( -1 ), we conclude that the sample complexity for achieving E ∇J(θ (ζ) ξ ) 2 ≤ is in the order of T M = O( -2 ).

D TECHNICAL LEMMAS

In this section, we present all the technical lemmas that are used in the proof of the main theorem. Bounding E g (m) t 2 and E h in Algorithm 1 is bounded as E g (m) t 2 ≤ 6L 2 1 E ω (m) t -ω * (θ (m) t ) 2 + 9L 2 1 E ω (m) -ω * ( θ (m) ) 2 + 9L 2 2 E θ (m) t -θ (m) 2 + 1 M 9C 1 + 9E ∇J(θ (m) t ) 2 where the constant C 1 is specified in Lemma D.3. Proof. For convenience, define m) , ω (m) ), T (m) t := G (m) t (θ (m) t , ω (m) t ) -G (m) t ( θ ( and m) , ω * ( θ (m) )). Then, we obtain that S (m) t := G (m) t (θ (m) t , ω * (θ (m) t )) -G (m) t ( θ ( g (m) t 2 = G (m) t (θ (m) t , ω (m) t ) -G (m) t ( θ (m) , ω (m) ) + G (m) 2 = T (m) t + G (m) -G (m) ( θ (m) , ω * ( θ (m) )) + G (m) ( θ (m) , ω * ( θ (m) )) -S (m) t + S (m) t 2 ≤ 3 T (m) t -S (m) t 2 + 3 G (m) -G (m) ( θ (m) , ω * ( θ (m) )) 2 + 3 S (m) t + G (m) ( θ (m) , ω * ( θ (m) )) 2 ≤ 6L 2 1 ω (m) t -ω * (θ (m) t ) 2 + 9L 2 1 ω (m) -ω * ( θ (m) ) 2 + 3 S (m) t + G (m) ( θ (m) , ω * ( θ (m) )) 2 , where G (m) ( θ (m) , ω * ( θ (m) )) is obtained by substituting the arguments θ (m) , ω * ( θ (m) ) into the definition in eq. ( 4). Moreover, we have that S (m) t + G (m) ( θ (m) , ω * ( θ (m) )) 2 = S (m) t + G (m) ( θ (m) , ω * ( θ (m) )) -∇J(θ (m) t ) + ∇J(θ (m) t ) 2 ≤ 3 S (m) t -E m,t S (m) t 2 + 3 G (m) (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 + 3 ∇J(θ (m) t ) 2 , which further implies that E S (m) t + G (m) ( θ (m) , ω * ( θ (m) )) 2 ≤ 3E S (m) t 2 + 3E G (m) (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 + 3E ∇J(θ (m) t ) 2 ≤ 3L 2 2 E θ (m) t -θ (m) 2 + 1 M • 3C 1 + 3E ∇J(θ (m) t ) 2 . Combining all the above bounds, we finally obtain that E g (m) t 2 ≤ 6L 2 1 E ω (m) t -ω * (θ (m) t ) 2 + 9L 2 1 E ω (m) -ω * ( θ (m) ) 2 + 9L 2 2 E θ (m) t -θ (m) 2 + 1 M • 9C 1 + 9E ∇J(θ (m) t ) 2 . Lemma D.2. Under the same assumptions as those of Theorem 4.5, we have that h (m) t 2 ≤ 6L 2 4 ω (m) t -ω * (θ (m) t ) 2 + 9L 2 4 ω (m) -ω * ( θ (m) ) 2 + 6L 2 5 θ (m) t -θ (m) 2 + 6C 2 M . Proof. For convenience, define V (m) t := H ) - H (m) t ( θ (m) , ω (m) ), and U (m) t := H (m) t (θ (m) t , ω * (θ (m) t )) -H (m) t ( θ (m) , ω * ( θ (m) )). Then, we obtain that h (m) t 2 = H (m) t (θ (m) t , ω (m) t ) -H (m) t ( θ (m) , ω (m) ) + H (m) 2 = V (m) t + H (m) -H (m) ( θ (m) , ω * ( θ (m) )) + H (m) ( θ (m) , ω * ( θ (m) )) -U (m) t + U (m) t 2 ≤ 3 V (m) t -U (m) t 2 + 3 H (m) -H (m) ( θ (m) , ω * ( θ (m) )) 2 + 3 U (m) t + H (m) ( θ (m) , ω * ( θ (m) )) 2 ≤ 6L 2 4 ω (m) t -ω * (θ (m) t ) 2 + 9L 2 4 ω (m) -ω * ( θ (m) ) 2 + 3 U (m) t + H (m) ( θ (m) , ω * ( θ (m) )) 2 . Moreover, note that U (m) t + H (m) ( θ (m) , ω * ( θ (m) )) 2 ≤ 2 U (m) t 2 + 2 H (m) ( θ (m) , ω * ( θ (m) )) 2 ≤ 2L 2 5 θ (m) t -θ (m) 2 + 2C 2 M . Combining the above bounds, we finally obtain that h (m) t 2 ≤ 6L 2 4 ω (m) t -ω * (θ (m) t ) 2 + 9L 2 4 ω (m) -ω * ( θ (m) ) 2 + 6L 2 5 θ (m) t -θ (m) 2 + 6C 2 M . Bounding pseudo-gradient variance: Lemma D.3. Under the same assumptions as those of Theorem 4.5, we have that E G (m) (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 ≤ C 1 M . Proof. Note that the variance can be expanded as E G (m) (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 = 1 M 2 E M -1 s=0 G (m) s (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) 2 + i =j G (m) i (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ), G (m) j (θ (m) t , ω * (θ (m) t )) -∇J(θ (m) t ) ≤ 1 M 2 M -1 s=0 (G + C ∇J ) 2 + i =j Λρ |i-j| (G + C ∇J ) 2 ≤ 1 M (1 + 2Λ ρ 1 -ρ )(G + C ∇J ) 2 . Then, we define the constant C 1 := (1 + 2Λ ρ 1-ρ )(G + C ∇J ) 2 . Lemma D.4. Under the same assumptions as those of Theorem 4.5, we have that E H (m) ( θ (m) , ω * ( θ (m) )) 2 ≤ C 2 M . Proof. Note that this second moment term can be expanded as E H (m) ( θ (m) , ω * ( θ (m) )) 2 = 1 M 2 M -1 i=0 E H (m) i ( θ (m) , ω * ( θ (m) )) + 1 M 2 i =j E H (m) i ( θ (m) , ω * ( θ (m) )), H (m) j ( θ (m) , ω * ( θ (m) )) ≤ H 2 M + 1 M 2 H 2 Λ i =j ρ |i-j| ≤ H 2 (1 + Λ ρ 1 -ρ ) 1 M . Lastly, we define the constant C 2 := H 2 (1 + Λ ρ 1-ρ ). Bounding Markovian Noise: Lemma D.5. Let the same assumptions as those of Theorem 4.5 hold and define ς (m) t := ω (m) t -ω * (θ (m) t ), φ (m) t (φ (m) t ) -C ω (m) t -ω * (θ (m) t ) . Then, it holds that E[ς (m) t ] ≤ 1 8 λ C ω (m) t -ω * (θ (m) t ) 2 + C 3 M , where C 3 = 8R 2 λ C (1 + ρΛ 1-ρ ). Proof. By definition of ς (m) t , we obtain that E ω (m) t -ω * (θ (m) t ), φ (m) t (φ (m) t ) -C ω (m) t -ω * (θ (m) t ) =E ω (m) t -ω * (θ (m) t ), E m,t-1 φ (m) t (φ (m) t ) -C ω (m) t -ω * (θ (m) t ) ≤ 1 2 • λ C 4 E ω (m) t -ω * (θ (m) t ) 2 + 1 2 • 4R 2 λ C M 2 E M -1 i=0 φ (m) i (φ (m) i ) -C 2 . For the last term, note that E M -1 i=0 φ (m) i (φ (m) i ) -C 2 = E M -1 i=0 φ (m) i (φ (m) i ) -C 2 + E i =j φ (m) i (φ (m) i ) -C, φ (m) j (φ (m) j ) -C ≤ 4M + 4 i =j Λρ |i-j| ≤ 4M + 4M ρΛ 1 -ρ . Combining the above bounds, we finally obtain that E ω (m) t -ω * (θ (m) t ), φ (m) t (φ (m) t ) -C ω (m) t -ω * (θ (m) t ) ≤ λ C 8 E ω (m) t -ω * (θ (m) t ) 2 + 8R 2 λ C (1 + ρΛ 1 -ρ ) 1 M . We define C 3 := 8R 2 λ C (1 + ρΛ 1-ρ ). Lemma D.6. Let the same assumptions as those of Theorem 4.5 hold and define κ (m) t := ω (m) t -ω * (θ (m) t ), (φ (m) t ) ω * (θ (m) t ) -δ (m) t+1 (θ (m) t ) φ (m) t . Then, we obtain that Eκ (m) t ≤ 1 8 λ C ω (m) t -ω * (θ (m) t ) 2 + C 4 M . Proof. Similar to the proof of Lemma D.6, we have that E ω (m) t -ω * (θ (m) t ), (φ (m) t ) ω * (θ (m) t ) -δ (m) t+1 (θ (m) t ) φ (m) t ≤ 1 2 • λ C 4 E ω (m) t -ω * (θ (m) t ) 2 + 1 2 • 4 λ C M 2 E M -1 i=0 (φ (m) i ) ω * (θ (m) t ) -δ (m) i+1 (θ (m) t ) φ (m) i 2 . For the last term, we can bound it as E M -1 i=0 (φ (m) i ) ω * (θ (m) t ) -δ (m) i+1 (θ (m) t ) φ (m) i 2 ≤(R(2 + γ) + r max ) 2 M + (R(2 + γ) + r max ) 2 ρΛ 1 -ρ M. Combining all the above bounds, we finally obtain that E ω (m) t -ω * (θ (m) t ), (φ (m) t ) ω * (θ (m) t ) -δ (m) t+1 (θ (m) t ) φ (m) t ≤ λ C 8 E ω (m) t -ω * (θ (m) t ) 2 + 2 λ C (R(2 + γ) + r max ) 2 (1 + ρΛ 1 -ρ ) 1 M . We then define C 4 := 2 λ C (R(2 + γ) + r max ) 2 (1 + ρΛ 1-ρ ). Bounding Tracking Error: Lemma D.7. Under the same assumptions as those of Theorem 4.5, the tracking error can be bounded as M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 ≤ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω E ω (m) -ω * ( θ (m) ) 2 + 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 9 λ C + 2L 2 3 η 2 θ η 2 ω 36 λ C M -1 t=0 E ∇J(θ (m) t ) 2 + 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω M -1 t=0 E θ (m) t -θ (m) 2 + 8 λ C (C 3 + C 4 ). Proof. Recall the one-step update at ω (m) t+1 : ω (m) t+1 = Π R ω (m) t -η ω h (m) t . Then, we obtain the following upper bound of the tracking error ω (m) t+1 -ω * (θ (m) t+1 ) 2 , ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ ω (m) t -ω * (θ (m) t ) -η ω h (m) t + ω * (θ (m) t ) -ω * (θ (m) t+1 ) 2 ≤ ω (m) t -ω * (θ (m) t ) 2 -2η ω ω (m) t -ω * (θ (m) t ), h (m) t + 2 ω (m) t -ω * (θ (m) t ), ω * (θ (m) t ) -ω * (θ (m) t+1 ) + 2η 2 ω h (m) t 2 + 2 ω * (θ (m) t ) -ω * (θ (m) t+1 ) 2 . Substituting the bound of Lemma D.2 into the above bound, we obtain that ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ ω (m) t -ω * (θ (m) t ) 2 -2η ω ω (m) t -ω * (θ (m) t ), h (m) t + λ C η ω ω (m) t -ω * (θ (m) t ) 2 + 9 λ C + 2L 2 3 η 2 θ η ω 6L 2 1 E ω (m) t -ω * (θ (m) t ) 2 + 9L 2 1 E ω (m) -ω * ( θ (m) ) 2 + 9L 2 2 E θ (m) t -θ (m) 2 + 1 M • 9C 1 + 9E ∇J(θ (m) t ) 2 + 2η 2 ω 6L 2 4 E ω (m) t -ω * (θ (m) t ) 2 + 9L 2 4 E ω (m) -ω * ( θ (m) ) 2 + 6L 2 5 E θ (m) t -θ (m) 2 + 6C 2 M . Taking expectation on both sides of the above inequality and simplifying, we obtain that E ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ 1 -λ C η ω + 9 λ C + 2L 2 3 6L 2 1 η 2 θ η ω + 12L 2 4 η 2 ω E ω (m) t -ω * (θ (m) t ) 2 + 9 λ C + 2L 2 3 9L 2 1 η 2 θ η ω + 18L 2 4 η 2 ω E ω (m) -ω * ( θ (m) ) 2 + 9 λ C + 2L 2 3 η 2 θ η ω • 9C 1 M + η 2 ω • 12C 2 M + 9 λ C + 2L 2 3 η 2 θ η ω 9E ∇J(θ (m) t ) 2 + 12L 2 5 η 2 ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η ω E θ (m) t -θ (m) 2 -2η ω Eκ (m) t -2η ω Eς (m) t , where ς (m) t := ω (m) t -ω * (θ (m) t ), φ (m) t (φ (m) t ) -C ω (m) t -ω * (θ (m) t ) , and κ (m) t := ω (m) t -ω * (θ (m) t ), (φ (m) t ) ω * (θ (m) t ) -δ (m) t+1 (θ (m) t ) φ (m) t η ω -12L 2 4 η 2 ω M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 ≤ E ω (m) M -ω * (θ (m) M ) 2 + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η ω + 18L 2 4 η 2 ω E ω (m) -ω * ( θ (m) ) 2 + 9 λ C + 2L 2 3 η 2 θ η ω • 9C 1 + η 2 ω • 12C 2 + 9 λ C + 2L 2 3 η 2 θ η ω 9 M -1 t=0 E ∇J(θ (m) t ) 2 + 12L 2 5 η 2 ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η ω M -1 t=0 E θ (m) t -θ (m) 2 + 2η ω (C 3 + C 4 ). Choosing an appropriate (η θ , η ω ) such that 1 2 λ C η ω - 9 λ C + 2L 2 3 6L 2 1 η 2 θ η ω -12L 2 4 η 2 ω ≥ 1 4 λ C η ω , and we finally obtain that M -1 t=0 E ω (m) t -ω * (θ (m) t ) 2 ≤ 4 λ C 1 η ω + M 9 λ C + 2L 2 3 9L 2 1 η 2 θ η 2 ω + 18L 2 4 η ω E ω (m) -ω * ( θ (m) ) 2 + 4 λ C 9 λ C + 2L 2 3 η 2 θ η 2 ω • 9C 1 + 4 λ C η ω • 12C 2 + 9 λ C + 2L 2 3 η 2 θ η 2 ω 36 λ C M -1 t=0 E ∇J(θ (m) t ) 2 + 4 λ C 12L 2 5 η ω + 9 λ C + 2L 2 3 9L 2 2 η 2 θ η 2 ω M -1 t=0 E θ (m) t -θ (m) 2 + 8 λ C (C 3 + C 4 ). Lemma D.8. Under the same assumptions as those of Theorem 4.5, the tracking error can be bounded as E ω (m) -ω * ( θ (m) ) 2 ≤ (1 - 1 2 λ C η ω ) mM E ω (0) -ω * ( θ (0) ) 2 + 4 λ C C 3 + C 4 1 M + 4 λ C H 2 η ω + 2 λ C 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η 2 ω . Proof. Recall the one-step update at ω . Then, we obtain the following upper bound of the tracking error ω (m) t -ω * (θ (m) t ) 2 . ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ ω (m) t -ω * (θ (m) t ) -η ω h (m) t + ω * (θ (m) t ) -ω * (θ (m) t+1 ) 2 ≤ ω (m) t -ω * (θ (m) t ) 2 -2η ω ω (m) t -ω * (θ (m) t ), h (m) t + 2 ω (m) t -ω * (θ (m) t ), ω * (θ (m) t ) -ω * (θ (m) t+1 ) + 2η 2 ω H 2 + 2 ω * (θ (m) t ) -ω * (θ (m) t+1 ) 2 . Then above inequality can be further bounded as ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ ω (m) t -ω * (θ (m) t ) 2 -2η ω ω (m) t -ω * (θ (m) t ), h (m) t + λ C η ω ω (m) t -ω * (θ (m) t ) 2 + 9 λ C + 2L 2 3 η 2 θ η ω G 2 + 2η 2 ω H 2 . Taking conditional expectation on both sides of the above inequality, we obtain that ), -H (m) t ( θ (m) , ω (m) ) + H (m) + λ C η ω ω (m) t -ω * (θ (m) t ) 2 + 9η 2 θ λ C η ω G 2 + 2H 2 η 2 ω + 2L 2 3 G 2 η 2 θ = E m,0 ω (m) t -ω * (θ (m) t ) 2 -2η ω E m,0 ω (m) t -ω * (θ (m) t ), H (m) t (θ (m) t , ω (m) t ) + λ C η ω ω (m) t -ω * (θ (m) t ) 2 + 2H 2 η 2 ω + 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η ω ( ) To further bound the inequality above, we first consider the following explicit form of the pseudogradient term: H (m) t (θ, ω) = (φ (m) t ) ω -δ -ω * (θ (m) t ) ≤ -2ηλ C ω (m) t -ω * (θ (m) t ) 2 . ( ) Substituting eq. ( 27) and eq. ( 26) into eq. ( 25) yields that E ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ (1 -λ C η ω )E ω (m) t -ω * (θ (m) t ) 2 -2η ω Eκ (m) t -2η ω Eς (m) t + 2H 2 η 2 ω + 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η ω , where ς ) φ (m) t . Applying Lemma D.5, and Lemma D.6 to the above inequality, we obtain that E ω (m) t+1 -ω * (θ (m) t+1 ) 2 ≤ (1 - 1 2 λ C η ω )E ω (m) t -ω * (θ (m) t ) 2 + 2η ω C 3 + C 4 1 M + 2H 2 η 2 ω + 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η ω . Telescoping the above inequality over one epoch, we obtain that E ω (m) M -ω * (θ (m) M ) 2 ≤ (1 - 1 2 λ C η ω ) M E ω (m) 0 -ω * (θ (m) 0 ) 2 + 2η ω C 3 + C 4 1 M • 1 -(1 -1 2 λ C η ω ) M 1 2 λ C η ω + 2H 2 η 2 ω • 1 -(1 -1 2 λ C η ω ) M 1 2 λ C η ω + 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η ω • 1 -(1 -1 2 λ C η ω ) M 1 2 λ C η ω . By definition, ω (m) = ω  1 2 λ C η ω ) M E ω (m) -ω * ( θ (m) ) 2 + 2η ω C 3 + C 4 1 M • 1 -(1 -1 2 λ C η ω ) M 1 2 λ C η ω + 2H 2 η 2 ω • 1 -(1 -1 2 λ C η ω ) M 1 2 λ C η ω + 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η ω • 1 -(1 -1 2 λ C η ω ) M 1 2 λ C η ω . Then, we unroll the inequality above and yield that Proof. By Lemma E.9, ∇J(θ) is smooth. Hence, by the compactness of {θ : θ ≤ R}, we conclude that ∇J(θ) is bounded by a certain constant C ∇J . E ω (m) -ω * ( θ (m) ) 2 ≤ (1 - 1 2 λ C η ω ) mM E ω (0) -ω * ( θ (0) ) 2 + 4 λ C C 3 + C 4 1 M + 4 λ C H 2 η ω + 2 λ C 2L 2 3 G 2 + 9 λ C G 2 η 2 θ η 2 ω .



FINITE-TIME ANALYSIS OF VR-GREEDY-GQIn this section, we analyze the finite-time convergence rate of VR-Greedy-GQ. We adopt the following standard technical assumptions fromWang & Zou (2020);Xu et al. (2020).



Figure 1: Comparison of Greedy-GQ and VR-Greedy-GQ in solving the Garnet problem.

Figure 2: Comparison of MSPBE and reward obtained by Greedy-GQ, VR-Greedy-GQ and PG.

Figure 3: Comparison of Greedy-GQ and VR-Greedy-GQ in solving the Frozen Lake problem.

Figure 4: Comparison of MSPBE and reward obtained by Greedy-GQ, VR-Greedy-GQ and PG.

We follow the definition of filtration in VRTD (Appendix D,Xu et al. (2020)). Recall that B m denotes the set of Markovian samples used in the m-th epoch, and we also abuse the notation here by letting x (m) t

Under the same assumptions as those of Theorem 4.5, the square norm of the one-step update of θ (m) t

-C ω -ω * (θ) + C ω -ω * (θ) + (φ (m) t ) ω * (θ) -δ (m) t+1 (θ) φ (m) t .(26)By Assumption 4.3, we have-2η ω E m,0 ω (m) t -ω * (θ (m) t), C ω (m) t

, and the initial parameter for the current inner loop is chosen as the reference parameter, ω (m) 0 = ω (m) and θ (m) 0 = θ (m) . Then we have E ω (m) -ω * ( θ (m) ) 2 ≤ (1 -

1. Within the set {θ : θ ≤ R}, there exists a constant C ∇J such that sup θ ∇J(θ) ≤ C ∇J .(29)

ACKNOWLEDGEMENT

The work of S. Zou was supported by the National Science Foundation under Grant CCF-2007783. 

annex

Lemma E.2. Let G := r max + (1 + γ)R + γ(|A|Rk 1 + 1)R be a constant unrelated to m and t. Then G (m) t ≤ G for all m and t.Proof. By its definition, we obtain that G (m) t = -δ t+1 (θ t )φ t + γ(ω t φ t ) φ t+1 (θ t ) ≤ -δ t+1 (θ t ) φ t + γ (ω t φ t ) φ t+1 (θ t ) ≤ r max + (1 + γ)R + γ(|A|Rk 1 + 1)R.Lemma E.3. Let H = (2 + γ)R + r max be a constant unrelated to m and t. Then H (m) t ≤ H for all m and t.Proof. The result follows from the definition:Proof. See Lemma 3 of Wang & Zou (2020) .Proof. See Lemma 3 of Wang & Zou (2020) .Lemma E.6. The mapping ω * (•) is L 3 -Lipschitz.Proof. See eq.( 56) of Wang & Zou (2020) .Proof. By definition, we haveBounding Lyapunov function:

F DETAILS OF EXPERIMENTS

Garnet problem: The Garnet problem Archibald et al. (1995) is specified as G(n S , n A , b, d), where n S and n A denote the cardinality of the state and action spaces, respectively, b is referred to as the branching factor-the number of states that have strictly positive probability to be visited after an action is taken, and d denotes the dimension of the features. In our experiment, we set n S = 5, n A = 3, b = 2, d = 4 and generate the features Φ ∈ R n S ×d via the uniform distribution on [0, 1].We then normalize its rows to have unit norm. Then, we randomly generate a state-action transition kernel P ∈ R n S ×n A ×n S via the uniform distribution on [0, 1] (with proper normalization). We set the behavior policy as the uniform policy, i.e., π b (a|s) = n -1 A for any s and a. The discount factor is set to be γ = 0.95. As the transition kernel and the features are known, we compute ∇J(θ) 2 to evaluate the performance of all the algorithms. We set the default learning rates as η θ = 0.02 and η ω = 0.01 for both VR-Greedy-GQ and Greedy-GQ algorithm. For VR-Greedy-GQ, we set the default batch size as M = 3000.Frozen Lake: We generate a Gaussian feature matrix with dimension 8 to linearly approximate the value function and we aim to evaluate a target policy based on a behavior policy. The target policy is generated via the uniform distribution on [0, 1] with proper normalization and the behavior policy is the uniform policy. We set the learning rates as η θ = 0.2 and η ω = 0.1 for both algorithms and set the batch size as M = 3000 for the VR-Greedy-GQ. We run 200k iterations for each of the 10 trajectories.

Estimated maximum Reward:

In the experiments, we compute the maximum reward as follows: When the policy parameter θ t is updated to θ t+1 , we estimate the corresponding reward by sampling a Markov decision process {s 1 , a 1 , s 2 , a 2 , . . . , s N , a N , s N +1 } using π θ . Then we estimate the expected reward usingUnder the ergodicity assumption, this average reward will tend to the expected reward with respected the stationary distribution induced by π θ (Wu et al. (2020) ). Then the maximum reward is defined as the maximum estimated expected reward along the training trajectory; that is, Maximum Reward = max t rt . In the experiments, we set N = 100 when estimating the expected reward.

