ASYNCHRONOUS ADVANTAGE ACTOR CRITIC: NON-ASYMPTOTIC ANALYSIS AND LINEAR SPEEDUP

Abstract

Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including the non-asymptotic analysis and the performance gain of parallelism (a.k.a. speedup). This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0), with provable convergence guarantees. With linear value function approximation for the TD update, the convergence of A3C-TD(0) is established under both i.i.d. and Markovian sampling. Under i.i.d. sampling, A3C-TD(0) obtains sample complexity of O( -2.5 /N ) per worker to achieve accuracy, where N is the number of workers. Compared to the best-known sample complexity of O( -2.5 ) for two-timescale AC, A3C-TD(0) achieves linear speedup, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetically generated instances and OpenAI Gym environments have been provided to verify our theoretical analysis. Our contributions. Compared to the existing literature on both the AC algorithms and the async-SGD, our contributions can be summarized as follows. c1) We revisit two-timescale A3C-TD(0) and establish its convergence rates with both i.i.d. and Markovian sampling. To the best of our knowledge, this is the first non-asymptotic convergence result for asynchronous parallel AC algorithms. c2) We characterize the sample complexity of A3C-TD(0). In i.i.d. setting, A3C-TD(0) achieves a sample complexity of O( -2.5 /N ) per worker, where N is the number of workers. Compared to the best-known complexity of O( -2.5 ) for i.i.d. two-timescale AC [18], A3C-TD(0) achieves linear speedup, thanks to the parallelism and asynchrony. In the Markovian setting, if delay is bounded, the sample complexity of A3C-TD(0) matches the order of the non-parallel AC algorithm [17] .

1. INTRODUCTION

Reinforcement learning (RL) has achieved impressive performance in many domains such as robotics [1, 2] and video games [3] . However, these empirical successes are often at the expense of significant computation. To unlock high computation capabilities, the state-of-the-art RL approaches rely on sampling data from massive parallel simulators on multiple machines [3, 4, 5] . Empirically, these approaches can stabilize the learning processes and reduce training time when they are implemented in an asynchronous manner. One popular RL method that often achieves the best empirical performance is the asynchronous variant of the actor-critic (AC) algorithm, which is referred to as A3C [3] . A3C builds on the original AC algorithm [6] . At a high level, AC simultaneously performs policy optimization (a.k.a. the actor step) using the policy gradient method [7] and policy evaluation (a.k.a. the critic step) using the temporal difference learning (TD) algorithm [8] . To ensure scalability, both actor and critic steps can combine with various function approximation techniques. To ensure stability, AC is often implemented in a two time-scale fashion, where the actor step runs in the slow timescale and the critic step runs in the fast timescale. Similar to other on-policy RL algorithms, AC uses samples generated from the target policy. Thus, data sampling is entangled with the learning procedure, which generates significant overhead. To speed up the sampling process of AC, A3C introduces multiple workers with a shared policy, and each learner has its own simulator to perform data sampling. The shared policy can be then updated using samples collected from multiple learners. Despite the tremendous empirical success achieved by A3C, to the best of our knowledge, its theoretical property is not well-understood. The following theoretical questions remain unclear: Q1) Under what assumption does A3C converge? Q2) What is its convergence rate? Q3) Can A3C obtain benefit (or speedup) using parallelism and asynchrony? For Q3), we are interested in the training time linear speedup with N workers, which is the ratio between the training time using a single worker and that using N workers. Since asynchronous parallelism mitigates the effect of stragglers and keeps all workers busy, the training time speedup can be measured roughly by the sample (i.e., computational) complexity linear speedup [9] , given by Speedup(N ) = sample complexity when using one worker average sample complexity per worker when using N workers . If Speedup(N ) = Θ(N ), the speedup is linear, and the training time roughly reduces linearly as the number of workers increases. This paper aims to answer these questions, towards the goal of providing theoretical justification for the empirical successes of parallel and asynchronous RL.

1.1. RELATED WORKS

Analysis of actor critic algorithms. AC method was first proposed by [6, 10] , with asymptotic convergence guarantees provided in [6, 10, 11] . It was not until recently that the non-asymptotic analyses of AC have been established. The finite-sample guarantee for the batch AC algorithm has been established in [12, 13] with i.i.d. sampling. Later, in [14] , the finite-sample analysis was established for the double-loop nested AC algorithm under the Markovian setting. An improved analysis for the Markovian setting with minibatch updates has been presented in [15] for the nested AC method. More recently, [16, 17] have provided the first finite-time analyses for the two-timescale AC algorithms under Markov sampling, with both Õ( -2.5 ) sample complexity, which is the bestknown sample complexity for two-timescale AC. Through the lens of bi-level optimization, [18] has also provided finite-sample guarantees for this two-timescale Markov sampling setting, with global optimality guarantees when a natural policy gradient step is used in the actor. However, none of the existing works has analyzed the effect of the asynchronous and parallel updates in AC. Empirical parallel and distributed AC. In [3] , the original A3C method was proposed and became the workhorse in empirical RL. Later, [19] has provided a GPU-version of A3C which significantly decreases training time. Recently, the A3C algorithm is further optimized in modern computers by [20] , where a large batch variant of A3C with improved efficiency is also proposed. In [21] , an importance weighted distributed AC algorithm IMPALA has been developed to solve a collection of problems with one single set of parameters. Recently, a gossip-based distributed yet synchronous AC algorithm has been proposed in [5] , which has achieved the performance competitive to A3C. Asynchronous stochastic optimization. For solving general optimization problems, asynchronous stochastic methods have received much attention recently. The study of asynchronous stochastic methods can be traced back to 1980s [22] . With the batch size M , [23] analyzed asynchronous SGD (async-SGD) for convex functions, and derived a convergence rate of O(K -1 2 M -1 2 ) if delay K 0 is bounded by O(K 4 ). This result implies linear speedup. [24] extended the analysis of [23] to smooth convex with nonsmooth regularization and derived a similar rate. Recent studies by [25] improved upper bound of K 0 to O(K 1 4 M - 1 2 M -1 2 ). However, all these works have focused on the single-timescale SGD with a single variable, which cannot capture the stochastic recursion of the AC and A3C algorithms. To best of our knowledge, non-asymptotic analysis of asynchronous two-timescale SGD has remained unaddressed, and its speedup analysis is even an uncharted territory.

1.2. THIS WORK

In this context, we revisit A3C with TD(0) for the critic update, termed A3C-TD(0). The hope is to provide non-asymptotic guarantee and linear speedup justification for this popular algorithm. c3) We test A3C-TD(0) on the synthetically generated environment to verify our theoretical guarantees with both i.i.d. and Markovian sampling. We also test A3C-TD(0) on the classic control tasks and Atari Games from OpenAI Gym. Code is available in the supplementary material. Technical challenges. Compared to the recent convergence analysis of nonparallel two-timescale AC in [16, 17, 18] , several new challenges arise due to the parallelism and asynchrony. Markovian noise coupled with asynchrony and delay. The analysis of two-timescale AC algorithm is non-trivial because of the Markovian noise coupled with both the actor and critic steps. Different from the nonparallel AC that only involves a single Markov chain, asynchronous parallel AC introduces multiple Markov chains (one per worker) that mix at different speed. This is because at a given iteration, workers collect different number of samples and thus their chains mix to different degrees. As we will show later, the worker with the slowest mixing chain will determine the convergence. Linear speedup for SGD with two coupled sequences. Parallel async-SGD has been shown to achieve linear speedup recently [9, 26] . Different from async-SGD, asynchronous AC is a two-timescale stochastic semi-gradient algorithm for solving the more challenging bilevel optimization problem (see [18] ). The errors induced by asynchrony and delay are intertwined with both actor and critic updates via a nested structure, which makes the sharp analysis more challenging. Our linear speedup analysis should be also distinguished from that of mini-batch async-SGD [27] , where the speedup is a result of variance reduction thanks to the larger batch size generated by parallel workers.

2.1. MARKOV DECISION PROCESS AND POLICY GRADIENT THEOREM

RL problems are often modeled as an MDP described by M = {S, A, P, r, γ}, where S is the state space, A is the action space, P(s |s, a) is the probability of transitioning to s ∈ S given current state s ∈ S and action a ∈ A, and r(s, a, s ) is the reward associated with the transition (s, a, s ), and γ ∈ [0, 1) is a discount factor. Throughout the paper, we assume the reward r is upper-bounded by a constant r max . A policy π : S → ∆(A) is defined as a mapping from the state space S to the probability distribution over the action space A. Considering discrete time t in an infinite horizon, a policy π can generate a trajectory of state-action pairs (s 0 , a 0 , s 1 , a 1 , . . .) with a t ∼ π(•|s t ) and s t+1 ∼ P(•|s t , a t ). Given a policy π, we define the state and state action value functions as Vπ(s) := E ∞ t=0 γ t r(st, at, st+1) | s0 = s , Qπ(s, a) := E ∞ t=0 γ t r(st, at, st+1) | s0 = s, a0 = a (2) where E is taken over the trajectory (s 0 , a 0 , s 1 , a 1 , . . .) generated under policy π. With the above definitions, the advantage function is A π (s, a) := Q π (s, a) -V π (s). With η denoting the initial state distribution, the discounted state visitation measure induced by policy π is defined as d π (s) := (1 -γ) ∞ t=0 γ t P(s t = s | s 0 ∼ η, π) , and the discounted state action visitation measure is d π (s, a) = (1 -γ) ∞ t=0 γ t P(s t = s | s 0 ∼ η, π)π(a|s). The goal of RL is to find a policy that maximizes the expected accumulative reward J(π ) := E s∼η [V π (s)]. When the state and action spaces are large, finding the optimal policy π becomes computationally intractable. To overcome the inherent difficulty of learning a function, the policy gradient methods search the best performing policy over a class of parameterized policies. We parameterize the policy with parameter θ ∈ R d , and solve the optimization problem as max θ∈R d J(θ) with J(θ) := E s∼η [Vπ θ (s)]. To maximize J(θ) with respect to θ, one can update θ using the policy gradient direction given by [7] ∇J (θ) = E s,a∼d θ [Aπ θ (s, a)ψ θ (s, a)] , where ψ θ (s, a) := ∇ log π θ (a|s), and 4) is expensive if not impossible, popular policy gradient-based algorithms iteratively update θ using stochastic estimate of (4) such as REINFORCE [28] and G(PO)MDP [29] . d θ := (1 -γ) ∞ t=0 γ t P(s t = s | s 0 , π θ )π θ (a|s). Since computing E in (

2.2. ACTOR-CRITIC ALGORITHM WITH VALUE FUNCTION APPROXIMATION

Both REINFORCE and G(PO)MDP-based policy gradient algorithms rely on a Monte-Carlo estimate of the value function V π θ (s) and thus ∇J(θ) by generating a trajectory per iteration. However, policy gradient methods based on Monte-Carlo estimate typically suffer from high variance and large sampling cost. An alternative way is to recursively refine the estimate of V π θ (s). For a policy π θ , it is known that V π θ (s) satisfies the Bellman equation [30] , that is Vπ θ (s) = E a∼π θ (•|s), s ∼P(•|s,a) r(s, a, s ) + γVπ θ (s ) , ∀s ∈ S. In practice, when the state space S is prohibitively large, one cannot afford the computational and memory complexity of computing V π θ (s) and A π θ (s, a). To overcome this curse-of-dimensionality, a popular method is to approximate the value function using function approximation techniques. Given the state feature mapping φ(•) : S -→ R d for some d > 0, we approximate the value function linearly as V π θ (s) ≈ Vω (s) := φ(s) ω, where ω ∈ R d is the critic parameter. Given a policy π θ , the task of finding the best ω such that V π θ (s) ≈ Vω (s) is usually addressed by TD learning [8] . Defining the kth transition as x k := (s k , a k , s k+1 ) and the corresponding TD-error as δ(x k , ω k ) := r(s k , a k , s k+1 ) + γφ(s k+1 ) ω k -φ(s k ) ω k , the parameter ω is updated via ω k+1 = Π Rω ω k + β k g(x k , ω k ) with g(x k , ω k ) := δ(x k , ω k )∇ ω k Vω k (s k ) where β k is the critic stepsize, and Π Rω is the projection with R ω being a pre-defined constant. The projection step is often used to control the norm of the gradient. In AC, it prevents the actor and critic updates from going a too large step in the 'wrong' direction; see e.g., [6, 16, 17] .

Using the definition of advantage function

A π θ (s, a) = E s ∼P [r(s, a, s ) + γV π θ (s )] -V π θ (s), we can also rewrite (4) as ∇J(θ) = E s,a∼d θ ,s ∼P [(r(s, a, s ) + γV π θ (s ) -V π θ (s)) ψ θ (s, a)]. Leveraging the value function approximation, we can then approximate the policy gradient as ∇J(θ) = r(s, a, s ) + γ Vω(s ) -Vω(s) ψ θ (s, a) = δ(x, ω)ψ θ (s, a) which gives rise to the policy update θ k+1 = θ k + α k v(x k , θ k , ω k ) with v(x k , θ k , ω k ) := δ(x k , ω k )ψ θ k (s k , a k ) where α k is the stepsize for the actor update. To ensure convergence when simultaneously performing critic and actor updates, the stepsizes α k and β k often decay at two different rates, which is referred to the two-timescale AC [17, 18] .

3. ASYNCHRONOUS ADVANTAGE ACTOR CRITIC WITH TD(0)

To speed up the training process, we implement AC over N workers in a shared memory setting without coordinating among workers -a setting similar to that in A3C [3] . Each worker has its own simulator to perform sampling, and then collaboratively updates the shared policy π θ using AC updates. As there is no synchronization after each update, the policy used by workers to generate samples may be outdated, which introduces staleness. Notations on transition (s, a, s ). Since each worker will maintain a separate Markov chain, we thereafter use subscription t in (s t , a t , s t+1 ) to indicate the tth transition on a Markov chain. We use k to denote the global counter (or iteration), which increases by one whenever a worker finishes the actor and critic updates in the shared memory. We use subscription (k) in (s (k) , a (k) , s (k) ) to indicate the transition used in the kth update. Specifically, we initialize θ 0 , ω 0 in the shared memory. Each worker will initialize the simulator with initial state s 0 . Without coordination, workers will read θ, ω in the shared memory. Read θ, ω in the shared memory. Compute δ(x t , ω) = r(s t , a t , s t+1 ) + γ Vω (s t+1 ) -Vω (s t ). 10: Compute g(x t , ω) = δ(x t , ω)∇ ω Vω (s t ). 11: Compute v(x t , θ, ω) = δ(x t , ω)ψ θ (s t , a t ). 12: In the shared memory, perform update (9). 13: end for obtaining x t := (s t , a t , s t+1 ), each worker locally computes the policy gradient v(x t , θ, ω) and the TD(0) update g(x t , ω), and then updates the parameters in shared memory asynchronously by ω k+1 = Π Rω ω k + β k g(x (k) , ω k-τ k ) , θ k+1 = θ k + α k v(x (k) , θ k-τ k , ω k-τ k ), where τ k is the delay in the kth actor and critic updates. See the A3C with TD(0) in Algorithm 1. Sampling distributions. Since the transition kernel required by the actor and critic updates are different in the discounted MDP, it is difficult to design a two-timescale AC algorithm. To address this issue, we adopt the sampling method introduced in the seminal work [6, 31] and the recent work [15, 16] , which inevitably introduces bias by sampling from the artificial transition P instead of P. However, as we will mention later, this extra bias is small when the discount factor γ is close to 1. Parallel sampling. The AC update ( 6) and ( 8) uses samples generated "on-the-fly" from the target policy π θ , which brings overhead. Compared with ( 6) and ( 8), the A3C-TD(0) update (9) allows parallel sampling from N workers, which is the key to linear speedup. We consider the case where only one worker can update parameters in the shared memory at the same time and the update cannot be interrupted. In practice, (9) can also be performed in a mini-batch fashion. Minor differences from A3C [3] . The A3C-TD(0) algorithm resembles the popular A3C method [3] . With n max denoting the horizon of steps, for n ∈ {1, ..., n max }, A3C iteratively uses n-step TD errors to compute actor and critic gradients. In A3C-TD(0), we use the TD(0) method which is the 1-step TD method for actor and critic update. When n max = 1, A3C method reduces to A3C-TD(0). The n-step TD method is a hybrid version of the TD(0) method and the Monte-Carlo method. The A3C method with Monte-Carlo sampling is essentially the delayed policy gradient method, and thus its convergence follows directly from the delayed SGD. Therefore, we believe that the convergence of the A3C method based on TD(0) in this paper can be easily extended to the convergence of the A3C method with n-step TD. We here focus on A3C with TD(0) just for ease of exposition.

4. CONVERGENCE ANALYSIS OF TWO-TIMESCALE A3C-TD(0)

In this section, we analyze the convergence of A3C-TD(0) in both i.i.d. and Markovian settings. Throughout this section, the notation O(•) contains constants that are independent of N and K 0 . To analyze the performance of A3C-TD(0), we make the following assumptions. Assumption 1. There exists K 0 such that the delay at each iteration is bounded by τ k ≤ K 0 , ∀k. Assumption 1 ensures the viability of analyzing the asynchronous update; see the same assumption in e.g., [5, 25] . In practice, the delay usually scales as the number of workers, that is K 0 = Θ(N ). With P π θ (s |s) = a∈A P(s |s, a)π θ (a|s), we define that: A θ,φ := E s∼µ θ ,s ∼ Pπ θ [φ(s)(γφ(s ) -φ(s)) ], b θ,φ := E s∼µ θ ,a∼π θ ,s ∼ P [r(s, a, s )φ(s)]. It is known that for a given θ, the stationary point ω * θ of the TD(0) update in Algorithm 1 satisfies A θ,φ ω * θ + b θ,φ = 0. Assumption 2. For all s ∈ S, the feature vector φ(s) is normalized so that φ(s) 2 ≤ 1. For all γ ∈ [0, 1] and θ ∈ R d , A θ,φ is negative definite and its max eigenvalue is upper bounded by -λ. Assumption 2 is common in analyzing TD with linear function approximation; see e.g., [17, 32, 33] . With this assumption, A θ,φ is invertible, so we have ω * θ = -A -1 θ,φ b θ,φ . Define R ω := r max /λ, then we have ω * θ 2 ≤ R ω . It justifies the projection introduced in Algorithm 1. In practice, the projection radius R ω can be estimated online by methods proposed in [32, Section 8.2] or [34, Lemma 1] . Assumption 3. For any θ, θ ∈ R d , s ∈ S and a ∈ A, there exist constants such that: i) ψ θ (s, a) 2 ≤ C ψ ; ii) ψ θ (s, a) -ψ θ (s, a) 2 ≤ L ψ θ -θ 2 ; iii) |π θ (a|s) -π θ (a|s)| ≤ L π θ -θ 2 . Assumption 3 is common in analyzing policy gradient-type algorithms which has also been made by e.g., [34, 35, 36] . This assumption holds for many policy parameterization methods such as tabular softmax policy [36] , Gaussian policy [37] and Boltzmann policy [31] . Assumption 4. For any θ ∈ R d , the Markov chain under policy π θ and transition kernel P(•|s, a) or P(•|s, a) is irreducible and aperiodic. Then there exist constants κ > 0 and ρ ∈ (0, 1) such that sup s∈S dT V (P(st ∈ •|s0 = s, π θ ), µ θ ) ≤ κρ t , ∀t where µ θ is the stationary state distribution under π θ , and s t is the state of Markov chain at time t. Assumption 4 assumes the Markov chain mixes at a geometric rate; see also [32, 33] . The stationary distribution µ θ of an artificial Markov chain with transition P is the same as the discounted visitation measure d θ of the Markov chain with transition P [6] . This means that if we sample according to a t ∼ π θ (•|s t ), s t+1 ∼ P(•|s t , a t ), the marginal distribution of (s t , a t ) will converge to the discounted state-action visitation measure d θ (s, a), which allows us to control the gradient bias. In this section, we consider A3C-TD(0) under the i.i.d. sampling setting, which is widely used for analyzing RL algorithms; see e.g., [13, 18, 38] . We first give the convergence result of critic update as follows. Theorem 1 (Critic convergence). Suppose Assumptions 1-4 hold. Consider Algorithm 1 with i.i.d. sampling and Vω (s) = φ(s) ω. Select step size α k = c1 (1+k) σ 1 , β k = c2 (1+k) σ 2 , where 0 < σ 2 < σ 1 < 1 and c 1 , c 2 are positive constants. Then it holds that 1 K K k=1 E ω k -ω * θ k 2 2 = O 1 K 1-σ 2 +O 1 K 2(σ 1 -σ 2 ) +O K 2 0 K 2σ 2 +O K0 K σ 1 +O 1 K σ 2 . ( ) Different from async-SGD (e.g., [9] ), the optimal critic parameter ω * θ is constantly drifting as θ changes at each iteration. This necessitates setting σ 1 > σ 2 to make the policy change slower than the critic, which can be observed in the second term in (13) . If σ 1 > σ 2 , then the policy is static relative to the critic in an asymptotic sense. To introduce the convergence of actor update, we first define the critic approximation error as app := max θ∈R d E s∼µ θ |Vπ θ (s) -Vω * θ (s)| 2 ≤ fa + sp, where µ θ is the stationary distribution under π θ and P. The error app captures the quality of the critic approximation under Algorithm 1. It can be further decomposed into the function approximation error fa , which is common in analyzing AC with function approximation [14, 15, 17] , and the sampling error sp = O(1 -γ), which is unique in analyzing two-timescale AC for a discounted MDP. The error app is small when the value function approximation is accurate and the discounting factor γ is close to 1; see the detailed derivations in Lemma 7 of supplementary material. Now we are ready to give the actor convergence. Theorem 2 (Actor convergence). Under the same assumptions of Theorem 1, select step size α k = c1 (1+k) σ 1 , β k = c2 (1+k) σ 2 , where 0 < σ 2 < σ 1 < 1 and c 1 , c 2 are positive constants. Then it holds that 1 K K k=1 E ∇J(θ k ) 2 2 = O 1 K 1-σ 1 +O K0 K σ 1 +O K 2 0 K 2σ 2 +O 1 K K k=1 E ω k -ω * θ k 2 2 +O( app). Different from the analysis of async-SGD, in actor update, the stochastic gradient v(x, θ, ω) is biased because of inexact value function approximation. The bias introduced by the critic optimality gap and the function approximation error correspond to the last two terms in (15) . In Theorem 1 and Theorem 2, optimizing σ 1 and σ 2 gives the following convergence rate. Corollary 1 (Linear speedup). Given Theorem 1 and Theorem 2, select σ 1 = 3 5 and σ 2 = 2 5 . If we further assume K 0 = O(K 1 5 ), then it holds that 1 K K k=1 E ∇J(θ k ) 2 2 = O K -2 5 + O( app) where O(•) contains constants that are independent of N and K 0 . By setting the first term in ( 16) to , we get the total iteration complexity to reach -accuracy is O( -2.5 ). Since each iteration only uses one sample (one transition), it also implies a total sample complexity of O( -2.5 ). Then the average sample complexity per worker is O( -2.5 /N ) which indicates linear speedup in (1) . Intuitively, the negative effect of parameter staleness introduced by parallel asynchrony vanishes asymptotically, which implies linear speedup in terms of convergence.

4.2. CONVERGENCE RESULT WITH MARKOVIAN SAMPLING

Theorem 3 (Critic convergence). Suppose Assumptions 1-4 hold. Consider Algorithm 1 with Markovian sampling and Vω (s ) = φ(s) ω. Select step size α k = c1 (1+k) σ 1 , β k = c2 (1+k) σ 2 , where 0 < σ 2 < σ 1 < 1 and c 1 , c 2 are positive constants. Then it holds that 1 K K k=1 E ω k -ω * θ k 2 2 = O 1 K 1-σ 2 +O 1 K 2(σ 1 -σ 2 ) +O K 2 0 K 2σ 2 +O K 2 0 log 2 K K σ 1 +O K0 log K K σ 2 . ( ) The following theorem gives the convergence rate of actor update in Algorithm 1. Theorem 4 (Actor convergence). Under the same assumptions of Theorem 3, select step size α k = c1 (1+k) σ 1 , β k = c2 (1+k) σ 2 , where 0 < σ 2 < σ 1 < 1 and c 1 , c 2 are positive constants. Then it holds that 5 ). Given Theorem 3, select σ 1 = 3 5 and σ 2 = 2 5 , then it holds that 1 K K k=1 E ∇J(θ k ) 2 2 = O 1 K 1-σ 1 +O K 2 0 log 2 K K σ 1 +O K 2 0 K 2σ 2 +O 1 K K k=1 E ω k -ω * θ k 2 2 +O( app). ( ) Assume K 0 = O(K 1 K K k=1 E ∇J(θ k ) 2 2 = O K0K -2 5 + O( app), where O(•) hides constants and the logarithmic order of K. With Markovian sampling, the stochastic gradients g(x, ω) and v(x, θ, ω) are biased, and the bias decreases as the Markov chain mixes. The mixing time corresponds to the logarithmic term log K in ( 17) and (18) . Because of asynchrony, at a given iteration, workers collect different number of samples and their chains mix to different degrees. The worker with the slowest mixing chain will determine the rate of convergence. The product of K 0 and log K in ( 17) and ( 18) appears due to the slowest mixing chain. As the last term in (17) dominates other terms asymptotically, the convergence rate reduces as the number of workers increases. While the theoretical linear speedup is difficult to establish in the Markovian setting, we will empirically demonstrate it in Section 5.2. 

5. NUMERICAL EXPERIMENTS

We test the speedup performance of A3C-TD(0) on both synthetically generated and OpenAI Gym environments. The settings, parameters, and codes are provided in supplementary material.

5.1. A3C-TD(0) IN SYNTHETIC ENVIRONMENT

To verify the theoretical result, we tested A3C-TD(0) with linear value function approximation in a synthetic environment. We use tabular softmax policy parameterization [36] , which satisfies Assumption 3. The MDP has a state space |S| = 100, an discrete action space of |A| = 5. Each state feature has a dimension of 10. Elements of the transition matrix, the reward and the state features are randomly sampled from a uniform distribution over (0, 1). We evaluate the convergence of actor and critic respectively with the running average of test reward and critic optimality gap ω k -ω * θ k 2 . Figures 1 and 2 show the training time and sample complexity of running A3C-TD(0) with i.i.d. sampling and Markovian sampling respectively. The speedup plot is measured by the number of samples needed to achieve a target running average reward under different number of workers. All the results are average over 10 Monte-Carlo runs. Figure 1 shows that the sample complexity of A3C-TD(0) stays about the same with different number of workers under i.i.d. sampling. Also, it can be observed from the speedup plot of Figure 1 that the A3C-TD(0) achieves roughly linear speedup with i.i.d. sampling, which is consistent with Corollary 1. The speedup of A3C-TD(0) with Markovian sampling shown in Figure 2 is roughly linear when number of workers is small.

5.2. A3C-TD(0) IN OPENAI GYM ENVIRONMENTS

We have also tested A3C-TD(0) with neural network parametrization in the classic control (Carpole) environment and the Atari game (Seaquest and Beamrider) environments. In Figures 3 4 5 , each curve is generated by averaging over 5 Monte-Carlo runs with 95% confidence interval. Figures 3 4 5 show the speedup of A3C-TD(0) under different number of workers, where the average reward is computed by taking the running average of test rewards. The speedup and runtime speedup plots are respectively measured by the number of samples and training time needed to achieve a target running average reward under different number of workers. Although not justified theoretically, Figures 3 4 5 suggest that the sample complexity speedup is roughly linear, and the runtime speedup slightly degrades when the number of workers increases. This is partially due to our hardware limit. Similar observation has also been obtained in async-SGD [9] . 

Supplementary Material

A PRELIMINARY LEMMAS A.1 GEOMETRIC MIXING The operation p ⊗ q denotes the tensor product between two distributions p(x) and q(y), i.e. (p ⊗ q)(x, y) = p(x) • q(y). Lemma 1. Suppose Assumption 4 holds for a Markov chain generated by the rule a t ∼ π θ (•|s t ), s t+1 ∼ P(•|s t , a t ). For any θ ∈ R d , we have sup s0∈S d T V P((s t , a t , s t+1 ) ∈ •|s 0 , π θ ), µ θ ⊗ π θ ⊗ P ≤ κρ t . ( ) where µ θ (•) is the stationary distribution with policy π θ and transition kernel P(•|s, a). Proof. We start with sup s0∈S d T V P((s t , a t , s t+1 ) = •|s 0 , π θ ), µ θ ⊗ π θ ⊗ P = sup s0∈S d T V P(s t = •|s 0 , π θ ) ⊗ π θ ⊗ P, µ θ ⊗ π θ ⊗ P = sup s0∈S 1 2 s∈S a∈A s ∈S |P(s t = ds|s 0 , π θ )π θ (a|s) P(ds |s, a) -µ θ (ds)π θ (a|s) P(ds |s, a) = sup s0∈S 1 2 s∈S |P(s t = ds|s 0 , π θ ) -µ θ (ds)| a∈A π θ (a|s) s ∈S P(ds |s, a) = sup s0∈S d T V (P(s t ∈ •|s 0 , π θ ), µ θ ) ≤ κρ t , which completes the proof. For the use in the later proof, given K > 0, we first define m K as: m K := min m ∈ N + | κρ m-1 ≤ min{α k , β k } , where κ and ρ are constants defined in (4). m K is the minimum number of samples needed for the Markov chain to approach the stationary distribution so that the bias incurred by the Markovian sampling is small enough.

A.2 AUXILIARY MARKOV CHAIN

The auxiliary Markov chain is a virtual Markov chain with no policy drifting -a technique developed in [34] to analyze stochastic approximation algorithms in non-stationary settings. Lemma 2. Under Assumption 1 and Assumption 3, consider the update (9) in Algorithm 1 with Markovian sampling. For a given number of samples m, consider the Markov chain of the worker that contributes to the kth update: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-d m-1 ------→ a t-m+1 • • • s t-1 θ k-d 1 ----→ a t-1 P -→ s t θ k-d 0 ----→ a t P -→ s t+1 , where (s t , a t , s t+1 ) = (s (k) , a (k) , s (k) ) , and {d j } m j=0 is some increasing sequence with d 0 := τ k . Given (s t-m , a t-m , s t-m+1 ) and θ k-dm , we construct its auxiliary Markov chain by repeatedly applying π θ k-dm : s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-dm ----→ a t-m+1 • • • s t-1 θ k-dm ----→ a t-1 P -→ s t θ k-dm ----→ a t P -→ s t+1 . Define x t := (s t , a t , s t+1 ), then we have: d T V (P(x t ∈ •|θ k-dm , s t-m+1 ), P( x t ∈ •|θ k-dm , s t-m+1 )) ≤ 1 2 |A|L π dm i=τ k E [ θ k-i -θ k-dm 2 |θ k-dm , s t-m+1 ] . Proof. Throughout the lemma, all expectations and probabilities are conditioned on θ k-dm and s t-m+1 . We omit this condition for convenience. First we have d T V (P(s t+1 ∈ •), P( s t+1 ∈ •)) = 1 2 s ∈S |P(s t+1 = ds ) -P( s t+1 = ds )| = 1 2 s ∈S s∈S a∈A P(s t = ds, a t = a, s t+1 = ds ) -P( s t = ds, a t = a, s t+1 = ds ) ≤ 1 2 s ∈S s∈S a∈A |P(s t = ds, a t = a, s t+1 = ds ) -P( s t = ds, a t = a, s t+1 = ds )| = 1 2 s∈S a∈A s ∈S |P(s t = ds, a t = a, s t+1 = ds ) -P( s t = ds, a t = a, s t+1 = ds )| = d T V (P(x t ∈ •), P( x t ∈ •)) , where the last second equality is due to Tonelli's theorem. Next we have d T V (P(x t ∈ •), P( x t ∈ •)) = 1 2 s∈S a∈A s ∈S |P(s t = ds, a t = a, s t+1 = ds ) -P( s t = ds, a t = a, s t+1 = ds )| = 1 2 s∈S a∈A |P(s t = ds, a t = a) -P( s t = ds, a t = a)| s ∈S P(s t+1 = ds |s t = ds, a t = a) = 1 2 s∈S a∈A |P(s t = ds, a t = a) -P( s t = ds, a t = a)| = d T V (P ((s t , a t ) ∈ •) , P (( s t , a t ) ∈ •)) . ( ) Due to the fact that θ k-τ k is dependent on s t , we need to write P(s t , a t ) as P(s t , a t ) = θ k-τ k ∈R d P(s t , θ k-τ k , a t ) = θ∈R d P(s t )P(θ k-τ k = dθ|s t )π θ k-τ k (a t |s t ) = P(s t ) θ∈R d P(θ k-τ k = dθ|s t )π θ k-τ k (a t |s t ) = P(s t ) E[π θ k-τ k (a t |s t )|s t ]. Then we have d T V (P ((s t , a t ) ∈ •) , P (( s t , a t ) ∈ •)) = 1 2 s∈S a∈A P(s t = ds) E[π θ k-τ k (a t = a|s t = ds)|s t = ds] -P( s t = ds)π θ k-dm ( a t = a| s t = ds) ≤ 1 2 s∈S a∈A P(s t = ds) E[π θ k-τ k (a t = a|s t = ds)|s t = ds] -P(s t = ds)π θ k-dm (a t = a|s t = ds) + 1 2 s∈S a∈A P(s t = ds)π θ k-dm ( a t = a| s t = ds) -P( s t = ds)π θ k-dm ( a t = a| s t = ds) = 1 2 s∈S P(s t = ds) a∈A E[π θ k-τ k (a t = a|s t = ds)|s t = ds] -π θ k-dm (a t = a|s t = ds) + 1 2 s∈S |P(s t = ds) -P( s t = ds)| . Using Jensen's inequality, we have d T V (P ((s t , a t ) ∈ •) , P (( s t , a t ) ∈ •)) ≤ 1 2 s∈S P(s t = ds) a∈A E π θ k-τ k (a t = a|s t = ds) -π θ k-dm (a t = a|s t = ds) s t = ds + 1 2 s∈S |P(s t = ds) -P( s t = ds)| ≤ 1 2 s∈S P(s t = ds) a∈A E [ θ k-τ k -θ k-dm 2 | s t = ds] + 1 2 s∈S |P(s t = ds) -P( s t = ds)| = 1 2 |A|L π E θ k-τ k -θ k-dm 2 + d T V (P(s t ∈ •), P( s t ∈ •)) where the last inequality follows Assumption 3. Now we start to prove (22) . d T V (P(x t ∈ •), P( x t ∈ •)) (24) = d T V (P((s t , a t ) ∈ •), P(( s t , a t ) ∈ •)) ≤ d T V (P(s t ∈ •), P( s t ∈ •)) + 1 2 |A|L π E θ k-τ k -θ k-dm 2 ≤ d T V (P(x t-1 ∈ •), P( x t-1 ∈ •)) + 1 2 |A|L π E θ k-τ k -θ k-dm 2 . Now we have d T V (P(x t ∈ •), P( x t ∈ •)) ≤ d T V (P(x t-1 ∈ •), P( x t-1 ∈ •)) + 1 2 |A|L π E θ k-τ k -θ k-dm 2 . ( ) Since d T V (P(x t-m ∈ •), P(x t-m ∈ •)) = 0, recursively applying (27) for {t -1, ..., t -m} gives d T V (P(x t ∈ •), P( x t ∈ •)) ≤ 1 2 |A|L π m j=0 E θ k-dj -θ k-dm 2 ≤ 1 2 |A|L π dm i=τ k E θ k-i -θ k-dm 2 , which completes the proof.

A.3 LIPSCHITZ CONTINUITY OF VALUE FUNCTION

Lemma 3. Suppose Assumption 3 holds. For any θ 1 , θ 2 ∈ R d and s ∈ S, we have ∇V π θ 1 (s) 2 ≤ L V , |V π θ 1 (s) -V π θ 2 (s)| ≤ L V θ 1 -θ 2 2 , ( ) where the constant is L V := C ψ r max /(1 -γ) with C ψ defined as in Assumption 3. Proof. First we have Q π (s, a) = E ∞ t=0 γ t r(s t , a t , s t+1 )|s 0 = s, a 0 = a ≤ ∞ t=0 γ t r max = r max 1 -γ . By the policy gradient theorem [7] , we have ∇V π θ 1 (s) 2 = E Q π θ 1 (s, a)ψ θ1 (s, a) 2 ≤ E Q π θ 1 (s, a)ψ θ1 (s, a) 2 ≤ E |Q π θ 1 (s, a)| ψ θ1 (s, a) 2 ≤ r max 1 -γ C ψ , where the first inequality is due to Jensen's inequality, and the last inequality follows Assumption 3 and the fact that Q π (s, a) ≤ rmax 1-γ . By the mean value theorem, we immediately have |V π θ 1 (s) -V π θ 2 (s)| ≤ sup θ1∈R d ∇V π θ 1 (s) 2 θ 1 -θ 2 2 = L V θ 1 -θ 2 2 , which completes the proof.

A.4 LIPSCHITZ CONTINUITY OF POLICY GRADIENT

We give a proposition regarding the L J -Lipschitz of the policy gradient under proper assumptions, which has been shown by [35] . Proposition 1. Suppose Assumption 3 and 4 hold. For any θ, θ ∈ R d , we have ∇J(θ) -∇J(θ ) 2 ≤ L J θ -θ 2 , where L J is a positive constant.

A.5 LIPSCHITZ CONTINUITY OF OPTIMAL CRITIC PARAMETER

We provide a justification for Lipschitz continuity of ω * θ in the next proposition. Proposition 2. Suppose Assumption 3 and 4 hold. For any θ 1 , θ 2 ∈ R d , we have ω * θ1 -ω * θ2 2 ≤ L ω θ 1 -θ 2 2 , where L ω := 2r max |A|L π (λ -1 + λ -2 (1 + γ))(1 + log ρ κ -1 + (1 -ρ) -1 ). Proof. We use A 1 , A 2 , b 1 and b 2 as shorthand notations of A π θ 1 , A π θ 2 , b π θ 1 and b π θ 2 respectively. By Assumption 2, A θ,φ is invertible for any θ ∈ R d , so we can write ω * θ = -A -1 θ,φ b θ,φ . Then we have ω * 1 -ω * 2 2 = -A -1 1 b 1 + A -1 2 b 2 2 = -A -1 1 b 1 -A -1 1 b 2 + A -1 1 b 2 + A -1 2 b 2 2 = -A -1 1 (b 1 -b 2 ) -(A -1 1 -A -1 2 )b 2 2 ≤ A -1 1 (b 1 -b 2 ) 2 + (A -1 1 -A -1 2 )b 2 2 ≤ A -1 1 2 b 1 -b 2 2 + A -1 1 -A -1 2 2 b 2 2 = A -1 1 2 b 1 -b 2 2 + A -1 1 (A 2 -A 1 )A -1 2 2 b 2 2 ≤ A -1 1 2 b 1 -b 2 2 + A -1 1 2 A -1 2 2 b 2 2 (A 2 -A 1 ) 2 ≤ λ -1 b 1 -b 2 2 + λ -2 r max A 1 -A 2 2 , ( ) where the last inequality follows Assumption 2, and the fact that b 2 2 = E[r(s, a, s )φ(s)] 2 ≤ E r(s, a, s )φ(s) 2 ≤ E [|r(s, a, s )| φ(s) 2 ] ≤ r max . Denote (s 1 , a 1 , s 1 ) and (s 2 , a 2 , s 2 ) as samples drawn with θ 1 and θ 2 respectively, i.e. s 1 ∼ µ θ1 , a 1 ∼ π θ1 , s 1 ∼ P and s 2 ∼ µ θ2 , a 2 ∼ π θ2 , s 2 ∼ P. Then we have b 1 -b 2 2 = E r(s 1 , a 1 , s 1 )φ(s 1 ) -E r(s 2 , a 2 , s 2 )φ(s 2 ) 2 ≤ sup s,a,s r(s, a, s )φ(s) 2 P((s 1 , a 1 , s 1 ) ∈ •) -P((s 2 , a 2 , s 2 ) ∈ •) T V ≤ r max P((s 1 , a 1 , s 1 ) ∈ •) -P((s 2 , a 2 , s 2 ) ∈ •) T V = 2r max d T V µ θ1 ⊗ π θ1 ⊗ P, µ θ2 ⊗ π θ2 ⊗ P ≤ 2r max |A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ) θ 1 -θ 2 2 , where the first inequality follows the definition of total variation (TV) norm, and the last inequality follows Lemma A.1. in [17] . Similarly we have: A 1 -A 2 2 ≤ 2(1 + γ)d T V (µ θ1 ⊗ π θ1 , µ θ2 ⊗ π θ2 ) = (1 + γ)|A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ) θ 1 -θ 2 2 . (31) Substituting ( 30) and ( 31) into (29) completes the proof.

B PROOF OF MAIN THEOREMS B.1 PROOF OF THEOREM 1

For brevity, we first define the following notations: x := (s, a, s ), δ(x, ω) := r(s, a, s ) + γφ(s ) ω -φ(s) ω, g(x, ω) := δ(x, ω)φ(s), g(θ, ω) := E s∼µ θ ,a∼π θ ,s ∼ P [g(x, ω)] . We also define constant C δ := r max + (1 + γ) max{ rmax 1-γ , R ω }, and we immediately have g(x, ω) 2 ≤ |r(x) + γφ(s ) ω -φ(s) ω| ≤ r max + (1 + γ)R ω ≤ C δ (32) and likewise, we have g(x, ω) 2 ≤ C δ . The critic update in Algorithm 1 can be written compactly as: ω k+1 = Π Rω ω k + β k g(x (k) , ω k-τ k ) , where τ k is the delay of the parameters used in evaluating the kth stochastic gradient, and x (k) := (s (k) , a (k) , s (k) ) is the sample used to evaluate the stochastic gradient at kth update. Proof. Using ω * k as shorthand notation of ω * θ k , we start with the optimality gap ω k+1 -ω * k+1 2 2 = Π Rω ω k + β k g(x (k) , ω k-τ k ) -ω * k+1 2 2 ≤ ω k + β k g(x (k) , ω k-τ k ) -ω * k+1 2 2 = ω k -ω * k 2 2 + 2β k ω k -ω * k , g(x (k) , ω k-τ k ) + 2 ω k -ω * k , ω * k -ω * k+1 + ω * k -ω * k+1 + β k g(x (k) , ω k-τ k ) 2 2 = ω k -ω * k 2 2 + 2β k ω k -ω * k , g(x (k) , ω k-τ k ) -g(x (k) , ω k ) + 2β k ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) + 2β k ω k -ω * k , g(θ k , ω k ) + 2 ω k -ω * k , ω * k -ω * k+1 + ω * k -ω * k+1 + β k g(x (k) , ω k-τ k ) 2 2 ≤ ω k -ω * k 2 2 + 2β k ω k -ω * k , g(x (k) , ω k-τ k ) -g(x (k) , ω k ) + 2β k ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) + 2β k ω k -ω * k , g(θ k , ω k ) + 2 ω k -ω * k , ω * k -ω * k+1 + 2 ω * k -ω * k+1 2 2 + 2C 2 δ β 2 k . ( ) We first bound ω k -ω * k , g(θ k , ω k ) in (34) as ω k -ω * k , g(θ k , ω k ) = ω k -ω * k , g(θ k , ω k ) -g(θ k , ω * k ) = ω k -ω * k , E (γφ(s ) -φ(s)) (ω k -ω * k )φ(s) = ω k -ω * k , E φ(s) (γφ(s ) -φ(s)) (ω k -ω * k ) = ω k -ω * k , A π θ k (ω k -ω * k ) ≤ -λ ω k -ω * k 2 2 , where the first equality is due to g(θ, ω * θ ) = A θ,φ ω * θ + b = 0, and the last inequality follows Assumption 2. Substituting (35) into (34) , then taking expectation on both sides of (34) yield E ω k+1 -ω * k+1 2 2 ≤ (1 -2λβ k ) E ω k -ω * k 2 2 + 2β k E ω k -ω * k , g(x (k) , ω k-τ k ) -g(x (k) , ω k ) + 2β k E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) + 2 E ω k -ω * k , ω * k -ω * k+1 + 2 E ω * k -ω * k+1 2 2 + 2C 2 δ β 2 k . ( ) We then bound the term E ω k -ω * k , g(x (k) , ω k-τ k ) -g(x (k) , ω k ) in (36) as E ω k -ω * k , g(x (k) , ω k-τ k ) -g(x (k) , ω k ) = E ω k -ω * k , γφ(s (k) ) -φ(s (k) ) (ω k-τ k -ω k )φ(s (k) ) ≤ (1 + γ) E [ ω k -ω * k 2 ω k-τ k -ω k 2 ] ≤ (1 + γ) E   ω k -ω * k 2 k-1 i=k-τ k (ω i+1 -ω i ) 2   ≤ (1 + γ) E ω k -ω * k 2 k-1 i=k-τ k β i g(x i , ω i-τi ) 2 ≤ (1 + γ) E ω k -ω * k 2 k-1 i=k-τ k β k-K0 g(x i , ω i-τi ) 2 ≤ C δ (1 + γ)K 0 β k-K0 E ω k -ω * k 2 , where the second last inequality is due to the monotonicity of step size, and the last inequality follows the definition of C δ in (32) . Next we jointly bound the fourth and fifth term in (36) as 2 E ω k -ω * k , ω * k -ω * k+1 + 2 E ω * k -ω * k+1 2 2 ≤ 2 E ω k -ω * k 2 ω * k -ω * k+1 2 + 2 E ω * k -ω * k+1 2 2 ≤ 2L ω E [ ω k -ω * k 2 θ k -θ k+1 2 ] + 2L 2 ω E θ k -θ k+1 2 2 = 2L ω α k E ω k -ω * k 2 δ(x (k) , ω k-τ k )ψ θ k-τ k (s (k) , a (k) ) 2 + 2L 2 ω α 2 k E δ(x (k) , ω k-τ k )ψ θ k-τ k (s (k) , a (k) ) 2 2 ≤ 2L ω C p α k E ω k -ω * k 2 + 2L 2 ω C 2 p α 2 k , where constant C p := C δ C ψ . The second inequality is due to the L ω -Lipschitz of ω * θ shown in Proposition 2, and the last inequality follows the fact that δ(x (k) , ω k-τ k )ψ θ k-τ k (s (k) , a (k) ) 2 ≤ C δ C ψ = C p . Substituting ( 37) and ( 38) into ( 36) yields E ω k+1 -ω * k+1 2 2 ≤ (1 -2λβ k ) E ω k -ω * k 2 2 + 2β k (C 1 α k β k + C 2 K 0 β k-K0 ) E ω k -ω * k 2 + 2β k E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) + C q β 2 k , where C 1 := L ω C p , C 2 := C δ (1 + γ) and C q := 2C 2 δ + 2L 2 ω C 2 p max (k) α 2 k β 2 k = 2C 2 δ + 2L 2 ω C 2 p c 2 1 c 2 2 . For brevity, we use x ∼ θ to denote s ∼ µ θ , a ∼ π θ and s ∼ P in this proof. Consider the third term in (40) conditioned on θ k , ω k , θ k-τ k . We bound it as E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) |θ k , ω k , θ k-τ k = ω k -ω * k , E x (k) ∼θ k-τ k g(x (k) , ω k )|ω k -g(θ k , ω k ) = ω k -ω * k , g(θ k-τ k , ω k ) -g(θ k , ω k ) ≤ ω k -ω * k 2 g(θ k-τ k , ω k ) -g(θ k , ω k ) 2 ≤ 2R ω E x∼θ k-τ k [g(x, ω k )] -E x∼θ k [g(x, ω k )] 2 ≤ 2R ω sup x g(x, ω k ) 2 µ θ k-τ k ⊗ π θ k-τ k ⊗ P -µ θ k ⊗ π θ k ⊗ P T V ≤ 4R ω C δ d T V (µ θ k-τ k ⊗ π θ k-τ k ⊗ P, µ θ k ⊗ π θ k ⊗ P), where second last inequality follows the definition of TV norm and the last inequality uses the definition of C δ in (32) . Define constant C 3 := 2R ω C δ |A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ). Then by following the third item in Lemma A.1 shown by [17] , we can write (41) as E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) |θ k , ω k , θ k-τ k ≤ 4R ω C δ d T V (µ θ k-τ k ⊗ π θ k-τ k ⊗ P, µ θ k ⊗ π θ k ⊗ P) ≤ C 3 θ k-τ k -θ k 2 ≤ C 3 k-1 i=k-τ k α i g(x i , ω i-τi ) 2 ≤ C 3 C δ K 0 α k-K0 , where we used the monotonicity of α k and Assumption 1. Taking total expectation on both sides of (42) and substituting it into (40) yield E ω k+1 -ω * k+1 2 2 ≤ (1 -2λβ k ) E ω k -ω * k 2 2 + 2β k C 1 α k β k + C 2 K 0 β k-K0 E ω k -ω * k 2 + 2C 3 C δ K 0 β k α k-K0 + C q β 2 k . ( ) Taking summation on both sides of (43) and rearranging yield 2λ K k=K0 E ω k -ω * k 2 2 ≤ K k=K0 1 β k E ω k -ω * k 2 2 -E ω k+1 -ω * k+1 2 2 I1 +C q K k=K0 β k I2 + 2 K k=K0 2C 3 C δ K 0 α k-K0 I3 +2 K k=K0 C 1 α k β k + C 2 K 0 β k-K0 E ω k -ω * k 2 I4 . We bound I 1 as I 1 = K k=M K 1 β k E ω k -ω * k 2 2 -E ω k+1 -ω * k+1 2 2 = K k=M K 1 β k - 1 β k-1 E ω k -ω * k 2 2 + 1 β M K -1 E ω M K -ω * M K 2 2 - 1 β k E ω K+1 -ω * K+1 2 2 ≤ K k=M K 1 β k - 1 β k-1 E ω k -ω * k 2 2 + 1 β M K -1 E ω M K -ω * M K 2 2 ≤ 4R 2 ω K k=M K 1 β k - 1 β k-1 + 1 β M K -1 = 4R 2 ω β k = O(K σ2 ), where the last inequality is due to the fact that ω k -ω * θ 2 ≤ ω k 2 + ω * θ 2 ≤ 2R ω . We bound I 2 as K k=M K β k = K k=M K c 2 (1 + k) σ2 = O(K 1-σ2 ) where the inequality follows from the integration rule b k=a k -σ ≤ b 1-σ 1-σ . I 3 = K k=K0 2C 3 C δ K 0 α k-K0 = 2C 3 C δ c 1 K 0 K-K0 k=0 (1 + k) -σ1 = O(K 0 K 1-σ1 ). For the last term I 4 , we have I 4 = K k=K0 C 1 α k β k + C 2 K 0 β k-K0 E ω k -ω * k 2 ≤ K k=K0 C 1 α k β k + C 2 K 0 β k-K0 2 K k=K0 E ω k -ω * k 2 2 ≤ K k=K0 C 1 α k β k + C 2 K 0 β k-K0 2 K k=K0 E ω k -ω * k 2 2 , where the first inequality follows Cauchy-Schwartz inequality, and the second inequality follows Jensen's inequality. In (48), we have K k=K0 C 1 α k β k + C 2 K 0 β k-K0 2 ≤ K-K0 k=0 C 1 α k β k + C 2 K 0 β k 2 = C 2 1 K-K0 k=0 α 2 k β 2 k + 2C 1 C 2 K 0 K-K0 k=0 α k + C 2 2 K 2 0 K-K0 k=0 β 2 k = O K 2(σ2-σ1)+1 + O K 0 K -σ1+1 + O K 2 0 K 1-2σ2 (49) where the first inequality is due to the fact that α k β k and β k-K0 are monotonically decreasing. Substituting (49) into (48) gives I 4 ≤ O K 2(σ2-σ1)+1 + O (K 0 K -σ1+1 ) + O (K 2 0 K 1-2σ2 ) K k=M K E ω k -ω * k 2 2 . Substituting (45), ( 46), ( 47) and ( 50) into (44), and dividing both sides of (44) by K -K 0 + 1 give 2λ 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 ≤ O K 2(σ2-σ1)+1 + O (K 0 K -σ1+1 ) + O (K 2 0 K 1-2σ2 ) K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 + O 1 K 1-σ2 + O 1 K σ2 + O K 0 K σ1 . ( ) We define the following functions: T 1 (K) := 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 , T 2 (K) := O 1 K 1-σ2 + O 1 K σ2 + O K 0 K σ1 , T 3 (K) := O K 2(σ2-σ1)+1 + O K 0 K -σ1+1 + O K 2 0 K 1-2σ2 K -K 0 + 1 . Then (51) can be written as: T 1 (K) - 1 2λ T 1 (K) T 3 (K) ≤ 1 2λ T 2 (K). Solving this quadratic inequality in terms of T 1 (K), we obtain T 1 (K) ≤ 1 λ T 2 (K) + 1 2λ 2 T 3 (K), which implies 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 = O 1 K 1-σ2 + O 1 K 2(σ1-σ2) + O K 2 0 K 2σ2 + O K 0 K σ1 + O 1 K σ2 . We further have 1 K K k=1 E ω k -ω * k 2 2 ≤ 1 K K0-1 k=1 4R 2 ω + K k=K0 E ω k -ω * k 2 2 = K 0 -1 K 4R 2 ω + K -K 0 + 1 K 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 = O K 0 K + O 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 = O 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 (53) which completes the proof.

B.2 PROOF OF THEOREM 2

We first clarify the notations: x := (s, a, s ), δ(x, ω) := r(s, a, s ) + γφ(s ) ω -φ(s) ω, δ(x, θ) := r(s, a, s ) + γV π θ (s ) -V π θ (s). The update in Algorithm 1 can be written compactly as: θ k+1 = θ k + α k δ(x (k) , ω k-τ k )ψ θ k-τ k (s (k) , a (k) ). For brevity, we use ω * k as shorthand notation of ω * θ k . Then we are ready to give the proof. Proof. From L J -Lipschitz of policy gradient shown in Proposition 1, we have: J(θ k+1 ) ≥ J(θ k ) + ∇J(θ k ), θ k+1 -θ k - L J 2 θ k+1 -θ k 2 2 = J(θ k ) + α k ∇J(θ k ), δ(x (k) , ω k-τ k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) + α k ∇J(θ k ), δ(x (k) , ω * k )ψ θ k-τ k (s (k) , a (k) ) - L J 2 α 2 k δ(x (k) , ω k-τ k )ψ θ k-τ k (s (k) , a (k) ) 2 2 ≥ J(θ k ) + α k ∇J(θ k ), δ(x (k) , ω k-τ k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) + α k ∇J(θ k ), δ(x (k) , ω * k )ψ θ k-τ k (s (k) , a (k) ) - L J 2 C 2 p α 2 k , where the last inequality follows the definition of C p in (39) . Taking expectation on both sides of the last inequality yields E[J(θ k+1 )] ≥ E[J(θ k )] + α k E ∇J(θ k ), δ(x (k) , ω k-τ k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) I1 + α k E ∇J(θ k ), δ(x (k) , ω * k )ψ θ k-τ k (s (k) , a (k) ) I2 - L J 2 C 2 p α 2 k . We first decompose I 1 as I 1 = E ∇J(θ k ), δ(x (k) , ω k-τ k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) = E ∇J(θ k ), δ(x (k) , ω k-τ k ) -δ(x (k) , ω k ) ψ θ k-τ k (s (k) , a (k) ) I (1) 1 + E ∇J(θ k ), δ(x (k) , ω k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) I . We bound I 1 as I (1) 1 = E ∇J(θ k ), γφ(s (k) ) -φ(s (k) ) (ω k-τ k -ω k )ψ θ k-τ k (s (k) , a (k) ) ≥ -E ∇J(θ k ) 2 γφ(s (k) ) -φ(s (k) ) 2 ω k -ω k-τ k 2 ψ θ k-τ k (s (k) , a (k) ) 2 ≥ -2C ψ E [ ∇J(θ k ) 2 ω k -ω k-τ k 2 ] ≥ -2C ψ C δ K 0 β k-1 E ∇J(θ k ) 2 , where the last inequality follows ω k -ω k-τ k 2 = k-1 i=k-τ k (ω i+1 -ω i ) 2 ≤ k-1 i=k-τ k β i g(x i , ω i-τi ) 2 ≤ β k-1 k-1 i=k-τ k g(x i , ω i-τi ) 2 ≤ β k-1 K 0 C δ , where the second inequality is due to the monotonicity of step size, and the third one follows (32) . Then we bound I 1 as I (2) 1 = E ∇J(θ k ), δ(x (k) , ω k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) = -E ∇J(θ k ), γφ(s (k) ) -φ(s (k) ) (ω * k -ω k )ψ θ k-τ k (s (k) , a (k) ) ≥ -E ∇J(θ k ) 2 γφ(s (k) ) -φ(s (k) ) 2 ω k -ω * k 2 ψ θ k-τ k (s (k) , a (k) ) 2 ≥ -2C ψ E [ ∇J(θ k ) 2 ω k -ω * k 2 ] . Collecting the lower bounds of I 1 and I (2) 1 gives I 1 ≥ -2C ψ E [ ∇J(θ k ) 2 (C δ K 0 β k-1 + ω k -ω * k 2 )] . Now we consider I 2 . We first decompose I 2 as I 2 = E ∇J(θ k ), δ(x (k) , ω * k )ψ θ k-τ k (s (k) , a (k) ) = E ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , ω * k-τ k ) ψ θ k-τ k (s (k) , a (k) ) I (1) 2 + E ∇J(θ k ), δ(x (k) , ω * k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) I (2) 2 + E ∇J(θ k ), δ(x (k) , θ k-τ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) I (3) 2 + ∇J(θ k ) 2 2 . We bound I 2 as I (1) 2 = E ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , ω * k-τ k ) ψ θ k-τ k (s (k) , a (k) ) = E ∇J(θ k ), γφ(s (k) ) -φ(s (k) ) ω * k -ω * k-τ k ψ θ k-τ k (s (k) , a (k) ) ≥ -E ∇J(θ k ) 2 γφ(s (k) ) -φ(s (k) ) 2 ω * k -ω * k-τ k 2 ψ θ k-τ k (s (k) , a (k) ) 2 ≥ -L V C ψ (1 + γ) E ω * k -ω * k-τ k 2 ≥ -L V L ω C ψ (1 + γ) E θ k -θ k-τ k 2 ≥ -L V L ω C ψ C p (1 + γ)K 0 α k-K0 , where the second last inequality follows from Proposition 2 and the last inequality uses (39) as θ k -θ k-τ k 2 ≤ k-1 i=k-τ k θ i+1 -θ i 2 = k-1 i=k-τ k α i δ(x i , ω i-τi )ψ θi-τ i (s i , a i ) 2 ≤ k-1 i=k-τ k α k-τ k C p ≤ C p K 0 α k-K0 . We bound I 2 as I (2) 2 = E ∇J(θ k ), δ(x (k) , ω * k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) ≥ -E ∇J(θ k ) 2 δ(x (k) , ω * k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) 2 ≥ -C ψ E ∇J(θ k ) 2 δ(x (k) , ω * k-τ k ) -δ(x (k) , θ k-τ k ) = -C ψ E ∇J(θ k ) 2 γ φ(s (k) ) ω * k-τ k -V π θ k-τ k (s (k) ) + V π θ k-τ k (s (k) ) -φ(s (k) ) ω * k-τ k ≥ -C ψ E ∇J(θ k ) 2 γ φ(s (k) ) ω * k-τ k -V π θ k-τ k (s (k) ) + V π θ k-τ k (s (k) ) -φ(s (k) ) ω * k-τ k = -C ψ E ∇J(θ k ) 2 E γ φ(s (k) ) ω * k-τ k -V π θ k-τ k (s (k) ) + V π θ k-τ k (s (k) ) -φ(s (k) ) ω * k-τ k θ k , θ k-τ k ≥ -2C ψ app E ∇J(θ k ) 2 ≥ -2C ψ L V fa -2C ψ sp E ∇J(θ k ) 2 where the second last inequality follows from the fact that E γ φ(s (k) ) ω * k-τ k -V π θ k-τ k (s (k) ) + V π θ k-τ k (s (k) ) -φ(s (k) ) ω * k-τ k ≤ γ E φ(s (k) ) ω * k-τ k -V π θ k-τ k (s (k) ) 2 + E V π θ k-τ k (s (k) ) -φ(s (k) ) ω * k-τ k 2 ≤ 2 app . Define artificial transition x(k) := (s (k) , a (k) , s (k) ∼ P), then I 2 can be bounded as I (3) 2 = E ∇J(θ k ), δ(x (k) , θ k-τ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) = E E ∇J(θ k ), δ(x (k) , θ k-τ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) θ k-τ k , θ k = E ∇J(θ k ), E δ(x (k) , θ k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k + E ∇J(θ k ), E δ(x (k) , θ k-τ k )ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k -∇J(θ k ) ≥ -E ∇J(θ k ) 2 E δ(x (k) , θ k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k 2 -E ∇J(θ k ) 2 E δ(x (k) , θ k-τ k )ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k -∇J(θ k ) 2 . ( ) The first term in the last inequality can be bounded as E δ(x (k) , θ k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = E δ(x (k) , θ k-τ k ) -δ(x (k) , θ k-τ k ) ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = E r(x (k) ) + γ E[r(s k , a , s )] -r(x (k) ) + γ E[r(s k , a , s )] ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k ≤ 2C ψ r max P -P T V ≤ 8C ψ r max (1 -γ), where the last inequality follows P -P T V = 2 s ∈S P(s |s, a) -P(s |s, a) = 2(1 -γ) s ∈S |P(s |s, a) -η(s )| ≤ 4(1 -γ). The second term in (59) can be rewritten as E δ(x (k) , θ k-τ k )ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = E s (k) ∼µ θ k-τ k a (k) ∼π θ k-τ k s (k) ∼P r(x (k) ) + γV π θ k-τ k (s (k) ) -V π θ k-τ k (s (k) ) ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = E s (k) ∼µ θ k-τ k a (k) ∼π θ k-τ k Q π θ k-τ k (s (k) , a (k) ) -V π θ k-τ k (s (k) ) ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = E s (k) ∼µ θ k-τ k a (k) ∼π θ k-τ k A π θ k-τ k (s (k) , a (k) )ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = E s (k) ∼d θ k-τ k a (k) ∼π θ k-τ k A π θ k-τ k (s (k) , a (k) )ψ θ k-τ k (s (k) , a (k) ) θ k-τ k , θ k = ∇J(θ k-τ k ) ( ) where the second last equality follows µ θ (•) = d θ (•) with d θ being a shorthand notation of d π θ [6] . Under review as a conference paper at ICLR 2021 Substituting (60) and ( 62) into (59) yields I (3) 2 ≥ -8C ψ r max (1 -γ) E ∇J(θ k ) 2 -E [ ∇J(θ k ) 2 ∇J(θ k-τ k ) -∇J(θ k ) 2 ] ≥ -8C ψ r max (1 -γ) E ∇J(θ k ) 2 -L V L J E θ k-τ k -θ k 2 ≥ -8C ψ r max (1 -γ) E ∇J(θ k ) 2 -L V L J C p K 0 α k-K0 , where the second last inequality is due to L J -Lipschitz of policy gradient shown in Proposition 1, and the last inequality follows (57).

Collecting lower bounds of

I (1) 2 , I 2 and I (3) 2 gives I 2 ≥ -D 1 K 0 α k-K0 -(2C ψ sp + 8C ψ r max (1 -γ)) E ∇J(θ k ) 2 -2C ψ L V fa + ∇J(θ k ) 2 2 , (64) where the constant is D 1 := L V L ω C ψ C p (1 + γ) + L V L J C p . Substituting ( 56) and ( 64) into (55) yields E[J(θ k+1 )] ≥ E[J(θ k )] -2α k C ψ ( sp + 4r max (1 -γ) + C δ K 0 β k-1 + ω k -ω * k 2 ) E ∇J(θ k ) 2 -α k D 1 K 0 α k-K0 -2α k C ψ L V fa + α k ∇J(θ k ) 2 2 - L J 2 C 2 p α 2 k . ( ) By following Cauchy-Schwarz inequality, the second term in (65) can be bounded as ( sp + 4r max (1 -γ) + C δ K 0 β k-1 + ω k -ω * k 2 ) E ∇J(θ k ) 2 ≤ E ∇J(θ k ) 2 2 E ( sp + 4r max (1 -γ) + C δ K 0 β k-1 + ω k -ω * k 2 ) 2 ≤ E ∇J(θ k ) 2 2 E 4C 2 δ K 2 0 β 2 k-1 + 4 ω k -ω * k 2 2 + 4 2 sp + 64r 2 max (1 -γ) 2 = 2 E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ), where the last inequality follows the order of sp in Lemma 7. Collecting the upper bound gives E[J(θ k+1 )] ≥ E[J(θ k )] -4α k C ψ E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) -α k D 1 K 0 α k-K0 -2α k C ψ L V fa + α k ∇J(θ k ) 2 2 - L J 2 C 2 p α 2 k . ( ) Dividing both sides of (67) by α k , then rearranging and taking summation on both sides give K k=K0 E ∇J(θ k ) 2 2 ≤ K k=K0 1 α k (E[J(θ k+1 )] -E[J(θ k )]) I3 + K k=K0 D 1 K 0 α k-K0 + L J 2 C 2 p α k I4 + 4C ψ K k=K0 E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) I5 + 2C ψ L V (K -K 0 + 1) fa . ( ) We bound I 3 as I 3 = K k=K0 1 α k (E [J(θ k+1 )] -E [J(θ k )]) = K k=K0 1 α k-1 - 1 α k E [J(θ k )] - 1 α M K -1 E [J(θ M K )] + 1 α K E [J(θ K+1 )] ≤ 1 α K E [J(θ K+1 )] ≤ r max 1 -γ 1 α K = O(K σ1 ), where the first inequality is due to the α k is monotonic decreasing and positive, and last inequality is due to V π θ (s) ≤ rmax 1-γ for any s ∈ S and π θ . We bound I 4 as I 4 = K k=K0 D 1 K 0 α k-K0 + L J 2 C 2 p α k ≤ K-K0 k=0 D 1 K 0 α k + L J 2 C 2 p α k = O(K 0 K 1-σ1 ). We bound I 5 as I 5 = K k=K0 E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) ≤ K k=K0 E ∇J(θ k ) 2 2 K k=K0 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) = K k=K0 E ∇J(θ k ) 2 2 C 2 δ K 2 0 K k=K0 β 2 k-1 + K k=K0 E ω k -ω * k 2 2 + O(K 2 sp ), where the first inequality follows Cauchy-Schwartz inequality. In (70), we have K k=K0 β 2 k-1 ≤ K-K0 k=0 β 2 k = K-K0 k=0 c 2 2 (1 + k) -2σ2 = O(K 1-2σ2 ). Substituting the last equality into (70) gives I 5 ≤ K k=M K E ∇J(θ k ) 2 2 O(K 2 0 K 1-2σ2 ) + K k=M K E ω k -ω * k 2 2 + O(K 2 sp ). ( ) Dividing both sides of (67) by K -K 0 + 1 and collecting upper bounds of I 3 , I 4 and I 5 give 1 K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 ≤ 4C ψ K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 O(K 2 0 K 1-2σ2 ) + K k=K0 E ω k -ω * k 2 2 + O(K 2 sp ) + O 1 K 1-σ1 + O K 0 K σ1 + O( fa ). ( ) Define the following functions T 4 (K) := 1 K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 , T 5 (K) := 1 K -K 0 + 1 O(K 2 0 K 1-2σ2 ) + K k=K0 E ω k -ω * k 2 2 + O(K 2 sp ) , T 6 (K) := O 1 K 1-σ1 + O K 0 K σ1 + O( fa ). Then (72) can be rewritten as T 4 (K) ≤ T 6 (K) + √ 2(1 + γ)C ψ T 4 (K) T 5 (K). Solving this quadratic inequality in terms of T 4 (K), we obtain T 4 (K) ≤ 2T 6 (K) + 4(1 + γ) 2 C 2 ψ T 5 (K), (73) which implies 1 K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 = O 1 K 1-σ1 + O K 0 K σ1 + O K 2 0 K 2σ2 + O 1 K -K 0 + 1 K k=K0 E ω k -ω * k 2 2 + O( app ). We further have 1 K K k=1 E ∇J(θ k ) 2 2 ≤ 1 K K0-1 k=1 L 2 V + K k=K0 E ∇J(θ k ) 2 2 = K 0 -1 K L 2 V + K -K 0 + 1 K 1 K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 = O K 0 K + O 1 K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 = O 1 K -K 0 + 1 K k=K0 E ∇J(θ k ) 2 2 (74) which completes the proof.

B.3 PROOF OF THEOREM 3

Given the definition in Section B.1, we now give the convergence proof of critic update in Algorithm 1 with linear function approximation and Markovian sampling. By following the derivation of (40), we have E ω k+1 -ω * k+1 2 2 ≤ (1 -2λβ k ) E ω k -ω * k 2 2 + 2β k (C 1 α k β k + C 2 K 0 β k-K0 ) E ω k -ω * k 2 + 2β k E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) + C q β 2 k , where C 1 := C p L ω , C 2 := C δ (1 + γ) and C q := 2C 2 δ + 2L 2 ω C 2 p max (k) α 2 k β 2 k = 2C 2 δ + 2L 2 ω C 2 p c 2 1 c 2 2 . Now we consider the third item in the last inequality. For some m ∈ N + , we define M := (K 0 + 1)m + K 0 . Following Lemma 4 (to be presented in Sec. C.1), for some d m ≤ M and positive constants C 4 , C 5 , C 6 , C 7 , we have E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) ≤ C 4 E θ k -θ k-dm 2 + C 5 dm i=τ k E θ k-i -θ k-dm 2 + C 6 E ω k -ω k-dm 2 + C 7 κρ m-1 ≤ C 4 k-1 i=k-dm E θ i+1 -θ i 2 + C 5 dm-1 i=τ k k-i-1 j=k-dm E θ j+1 -θ j 2 + C 6 k-1 i=k-dm E ω i+1 -ω i 2 + C 7 κρ m-1 ≤ C 4 k-1 i=k-dm α i C p + C 5 dm-1 i=τ k k-i-1 j=k-dm α j C p + C 6 k-1 i=k-dm β i C δ + C 7 κρ m-1 ≤ C 4 α k-dm k-1 i=k-dm C p + C 5 α k-dm dm-1 i=τ k k-i-1 j=k-dm C p + C 6 β k-dm k-1 i=k-dm C δ + C 7 κρ m-1 ≤ C 4 d m C p α k-dm + C 5 (d m -τ k ) 2 C p α k-dm + C 6 d m C δ β k-dm + C 7 κρ m-1 ≤ C 4 M + C 5 M 2 C p α k-M + C 6 M C δ β k-M + C 7 κρ m-1 , where the third last inequality is due to the monotonicity of step size, and the last inequality is due to τ k ≥ 0 and d m ≤ M . Further letting m = m K which is defined in ( 21) yields E ω k -ω * k , g(x (k) , ω k ) -g(θ k , ω k ) = C 4 M K + C 5 M 2 K C p α k-M K + C 6 C δ M K β k-M K + C 7 κρ m K -1 ≤ C 4 M K + C 5 M 2 K C p α k-M K + C 6 C δ M K β k-M K + C 7 α K , where M K = (K 0 + 1)m K + K 0 , and the last inequality follows the definition of m K . Substituting (77) into (75), then rearranging and summing up both sides over k = M K , ..., K yield 2λ K k=M K E ω k -ω * k 2 2 ≤ K k=M K 1 β k E ω k -ω * k 2 2 -E ω k+1 -ω * k+1 2 2 I1 +C q K k=M K β k I2 + 2 K k=M K C 4 M K + C 5 M 2 K C p α k-M K + C 6 C δ M K β k-M K + C 7 α K I3 + 2 K k=M K C 1 α k β k + C 2 K 0 β k-K0 E ω k -ω * k 2 I4 . ( ) where the order of I 1 , I 2 and I 4 have already been given by ( 45), ( 46) and (50) respectively. We bound I 3 as I 3 = C 4 M K + C 5 M 2 K C p K k=M K α k + C 6 C δ M K K k=M K β k + C 7 α K K k=M K 1 ≤ C 4 M K + C 5 M 2 K C p c 1 K 1-σ1 1 -σ 1 + C 6 C δ M K c 2 K 1-σ2 1 -σ 2 + C 7 c 1 K(1 + K) -σ1 = O (K 2 0 log 2 K)K 1-σ1 + O (K 0 log K)K 1-σ2 , where the last inequality follows from the integration rule b k=a k -σ ≤ b 1-σ 1-σ , and the last equality is due to O(M K ) = O(K 0 m K ) = O(K 0 log K). Collecting the bounds of I 1 , I 2 , I 3 and I 4 , and dividing both sides of (78) by K -M K + 1 yield 2λ 1 K -M K + 1 K k=M K E ω k -ω * k 2 2 ≤ O K 2(σ2-σ1)+1 + O (K 0 K -σ1+1 ) + O (K 2 0 K 1-2σ2 ) K -M K + 1 K k=M K E ω k -ω * k 2 2 + O 1 K 1-σ2 + O K 2 0 log 2 K K σ1 + O K 0 log K K σ2 . ( ) Similar to the derivation of (52), (80) implies 1 K -M K + 1 K k=M K E ω k -ω * k 2 2 = O 1 K 1-σ2 + O 1 K 2(σ1-σ2) + O K 2 0 K 2σ2 + O K 2 0 log 2 K K σ1 + O K 0 log K K σ2 . Under review as a conference paper at ICLR 2021 Similar to (53), we have 1 K K k=1 E ω k -ω * k 2 2 = O K 0 log K K + O 1 K -M K + 1 K k=M K E ω k -ω * k 2 2 = O 1 K -M K + 1 K k=M K E ω k -ω * k 2 2 (81) which completes the proof.

B.4 PROOF OF THEOREM 4

Given the definition in section B.2, we now give the convergence proof of actor update in Algorithm 1 with linear value function approximation and Markovian sampling method. By following the derivation of (55), we have E[J(θ k+1 )] ≥ E[J(θ k )] + α k E ∇J(θ k ), δ(x (k) , ω k-τ k ) -δ(x (k) , ω * k ) ψ θ k-τ k (s (k) , a (k) ) I1 + α k E ∇J(θ k ), δ(x (k) , ω * k )ψ θ k-τ k (s (k) , a (k) ) I2 - L J 2 C 2 p α 2 k . The item I 1 can be bounded by following (56) as I 1 ≥ -2C ψ E [ ∇J(θ k ) 2 (C δ K 0 β k-1 + ω k -ω * k 2 )] . Next we consider I 2 . We first decompose it as I 2 = E ∇J(θ k ), δ(x (k) , ω * k )ψ θ k-τ k (s (k) , a (k) ) = E ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) I (1) 2 + E ∇J(θ k ), δ(x (k) , θ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) I (2) 2 + E ∇J(θ k ) 2 2 . For some m ∈ N + , define M := (K 0 + 1)m + K 0 . Following Lemma 5, for some d m ≤ M and positive constants D 2 , D 3 , D 4 , D 5 , I 2 can be bounded as I (1) 2 = E ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) ≥ -D 2 E θ k-τ k -θ k-dm 2 -D 3 E θ k -θ k-dm 2 -D 4 k-τ k i=k-dm E θ i -θ k-dm 2 -D 5 κρ m-1 -2C ψ L V fa -2C ψ sp E ∇J(θ k ) 2 ≥ -D 2 (d m -τ k )C p α k-dm -D 3 d m C p α k-dm -D 4 (d m -τ k ) 2 C p α k-dm -D 5 κρ m-1 -2C ψ L V fa -2C ψ sp E ∇J(θ k ) 2 , where the derivation of the last inequality is similar to that of (76). By setting m = m K in (85), and following the fact that d m K ≤ M K and τ k ≥ 0, we have I (1) 2 ≥ -D 2 M K C p α k-M K -D 3 M K C p α k-M K -D 4 M 2 K C p α k-M K -D 5 κρ m K -1 -2C ψ L V fa -2C ψ sp E ∇J(θ) 2 = -(D 2 + D 3 )C p M K + D 4 C p M 2 K α k-M K -D 5 κρ m K -1 -2C ψ L V fa -2C ψ sp E ∇J(θ k ) 2 ≥ -(D 2 + D 3 )C p M K + D 4 C p M 2 K α k-M K -D 5 α K -2C ψ L V fa -2C ψ sp E ∇J(θ k ) 2 , where the last inequality is due to the definition of m K . Following Lemma 6, for some positive constants D 6 , D 7 , D 8 and D 9 , we bound I 2 as I (2) 2 = E ∇J(θ k ), δ(x (k) , θ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) ≥ -D 6 E θ k-τ k -θ k-dm 2 -D 7 E θ k -θ k-dm 2 -D 8 dm i=τ k E θ k-i -θ k-dm 2 -D 9 κρ m-1 -8C ψ r max (1 -γ) E ∇J(θ k ) 2 . Similar to the derivation of (86), we have I (2) 2 ≥ -(D 6 + D 7 + D 8 M K ) C p M K α k-M K -D 9 α K -8C ψ r max (1 -γ) E ∇J(θ k ) 2 . ( ) Collecting the lower bounds of I 2 and I 2 yields I 2 ≥ -2C ψ L V fa -2C ψ ( sp + 4r max (1 -γ)) E ∇J(θ k ) 2 + E ∇J(θ k ) 2 2 -D K α k-M K -(D 5 + D 9 )α K , where we define 83) and ( 88) into (82) yields D K := (D 4 + D 8 )C p M 2 K + (D 2 + D 3 + D 6 + D 7 )C p M K for brevity. Substituting ( E[J(θ k+1 )] ≥ E[J(θ k )] -2α k C ψ E [ ∇J(θ k ) 2 ( sp + 4r max (1 -γ) + C δ K 0 β k-1 + ω k -ω * k 2 )] -α k (D K α k-M K + (D 5 + D 9 )α K ) -2C ψ L V fa α k + α k E ∇J(θ k ) 2 2 - L J 2 C 2 p α 2 k . Similar to the derivation of (67), the last inequality implies E[J(θ k+1 )] ≥ E[J(θ k )] -4α k C ψ E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) -α k (D K α k-M K + (D 5 + D 9 )α K ) -2C ψ L V fa α k + α k E ∇J(θ k ) 2 2 - L J 2 C 2 p α 2 k . Rearranging and dividing both sides by α k yield E ∇J(θ k ) 2 2 ≤ 1 α k (E[J(θ k+1 )] -E[J(θ k )]) + D K α k-M K + (D 5 + D 9 )α K + L J 2 C 2 p α k + 4C ψ E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) + 2C ψ L V fa . Taking summation gives K k=M K E ∇J(θ k ) 2 2 ≤ K k=M K 1 α k (E[J(θ k+1 )] -E[J(θ k )]) I3 + K k=M K D K α k-M K + L J 2 C 2 p α k + (D 5 + D 9 )α K I4 + 4C ψ K k=M K E ∇J(θ k ) 2 2 C 2 δ K 2 0 β 2 k-1 + E ω k -ω * k 2 2 + O( 2 sp ) I5 + 2C ψ L V (K -M K + 1) fa . in which the upper bounds of I 3 and I 5 have already been given by ( 69) and (71) respectively. We bound I 4 as I 4 = K k=M K D K α k-M K + L J 2 C 2 p α k + (D 5 + D 9 )α K ≤ K k=M K D K α k-M K + L J 2 C 2 p α k-M K + (D 5 + D 9 )α K = D K + L J 2 C 2 p K k=M K α k-M K + (D 5 + D 9 )(K -M K + 1)α K = D K + L J 2 C 2 p K-M K k=0 α k + (D 5 + D 9 )(K -M K + 1)α K ≤ D K + L J 2 C 2 p c 1 1 -σ 1 K 1-σ1 + c 1 (D 5 + D 9 )(K + 1) 1-σ1 = O (K 2 0 log 2 K)K 1-σ1 where the last inequality uses b k=a k -σ ≤ b 1-σ 1-σ , and the last equality is due to the fact that O(D K ) = O(M 2 K + M K ) = O((K 0 m K ) 2 + K 0 m K ) = O(K 2 0 log 2 K). Substituting the upper bounds of I 3 , I 4 and I 5 into (89), and dividing both sides by K -M K + 1 give 1 K -M K + 1 K k=M K E ∇J(θ k ) 2 2 ≤ 4C ψ K -M K + 1 K k=M K E ∇J(θ k ) 2 2 O(K 2 0 K 1-2σ2 ) + K k=M K E ω k -ω * k 2 2 + O(K 2 sp ) + O 1 K 1-σ1 + O K 2 0 log 2 K K σ1 + O( fa ). Following the similar steps of those in (73), (91) essentially implies 1 K -M K + 1 K k=M K E ∇J(θ k ) 2 2 = O 1 K 1-σ1 +O K 2 0 log 2 K K σ1 +O K 2 0 K 2σ2 +O 1 K -M K + 1 K k=M K E ω k -ω * θ k 2 2 +O( app ). Similar to (74), we have 1 K K k=1 E ∇J(θ k ) 2 2 = O K 0 log K K + O 1 K -M K + 1 K k=M K E ∇J(θ k ) 2 2 = O 1 K -M K + 1 K k=M K E ∇J(θ k ) 2 2 which completes the proof.

C SUPPORTING LEMMAS

C.1 SUPPORTING LEMMAS FOR THEOREM 3 Lemma 4. For any m ≥ 1 and k ≥ (K 0 + 1)m + K 0 + 1, we have E ω k -ω * θ k , g(x (k) , ω k ) -g(θ k , ω k ) ≤ C 4 E θ k -θ k-dm 2 + C 5 dm i=τ k E θ k-i -θ k-dm 2 + C 6 E ω k -ω k-dm 2 + C 7 κρ m-1 , where d m ≤ (K 0 + 1)m + K 0 , and C 4 := 2C δ L ω + 4R ω C δ |A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ), C 5 := 4R ω C δ |A|L π and C 6 := 4(1 + γ)R ω + 2C δ , C 7 := 8R ω C δ . Proof. Consider the collection of random samples {x (k-K0-1) , x (k-K0) , ..., x (k) }. Suppose x (k) is sampled by worker n, then due to Assumption 1, {x (k-K0-1) , x (k-K0) , ..., x (k-1) } will contain at least another sample drawn by worker n. Therefore, {x (k-(K0+1)m) , x (k-(K0+1)m+1) , ..., x (k-1) } will contain at least m samples from worker n. Consider the Markov chain formed by m + 1 samples in {x (k-(K0+1)m) , x (k-(K0+1)m+1) , ..., x (k) }: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-d m-1 ------→ a t-m+1 • • • s t-1 θ k-d 1 ----→ a t-1 P -→ s t θ k-d 0 ----→ a t P -→ s t+1 , where (s t , a t , s t+1 ) = (s (k) , a (k) , s (k) ), and {d j } m j=0 is some increasing sequence with d 0 := τ k . Suppose θ k-dm was used to do the k m th update, then we have x t-m = x (km) . Following Assumption 1, we have τ km = k m -(k -d m ) ≤ K 0 . Since x (km) is in {x (k-(K0+1)m) , ..., x (k) }, we have k m ≥ k -(K 0 + 1)m. Combining these two inequalities, we have d m ≤ (K 0 + 1)m + K 0 . Given (s t-m , a t-m , s t-m+1 ) and θ k-dm , we construct an auxiliary Markov chain as that in Lemma 2: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-dm ----→ a t-m+1 • • • s t-1 θ k-dm ----→ a t-1 P -→ s t θ k-dm ----→ a t P -→ s t+1 . For brevity, we define ∆ 1 (x, θ, ω) := ω -ω * θ , g(x, ω) -g(θ, ω) . Throughout this proof, we use θ, θ , ω, ω , x and x as shorthand notations of θ k , θ k-dm , ω k , ω k-dm , x t and x t respectively. First we decompose ∆ 1 (x, θ, ω) as ∆ 1 (x, θ, ω) = ∆ 1 (x, θ, ω) -∆ 1 (x, θ , ω) I1 + ∆ 1 (x, θ , ω) -∆ 1 (x, θ , ω ) I2 + ∆ 1 (x, θ , ω ) -∆ 1 ( x, θ , ω ) I3 + ∆ 1 ( x, θ , ω ) I4 where the last inequality is due to Proposition 2. We use x ∼ θ as shorthand notations to represent that s ∼ µ θ , a ∼ π θ , s ∼ P. For the second term in (94), we have | ω -ω * θ , g(x, ω) -g(θ, ω) -ω -ω * θ , g(x, ω) -g(θ , ω) | = | ω -ω * θ , g(θ , ω) -g(θ, ω) | ≤ ω -ω * θ 2 g(θ , ω) -g(θ, ω) 2 ≤ 2R ω g(θ , ω) -g(θ, ω) 2 = 2R ω E x∼θ [g(x, ω)] -E x∼θ [g(x, ω)] 2 ≤ 2R ω sup x g(x, ω) 2 µ θ ⊗ π θ ⊗ P -µ θ ⊗ π θ ⊗ P T V ≤ 2R ω C δ µ θ ⊗ π θ ⊗ P -µ θ ⊗ π θ ⊗ P T V = 4R ω C δ d T V µ θ ⊗ π θ ⊗ P, µ θ ⊗ π θ ⊗ P ≤ 4R ω C δ |A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ) θ -θ 2 , where the third inequality follows the definition of TV norm, the second last inequality follows (32) , and the last inequality follows Lemma A.1. in [17] . Collecting the upper bounds of the two terms in (94) yields I 1 ≤ 2C δ L ω + 4R ω C δ |A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ) θ -θ 2 . Next we bound E[I 2 ] in (93) as E[I 2 ] = E[∆ 1 (x, θ , ω) -∆ 1 (x, θ , ω )] = E ω -ω * θ , g(x, ω) -g(θ , ω) -ω -ω * θ , g(x, ω ) -g(θ , ω ) ≤ E | ω -ω * θ , g(x, ω) -g(θ , ω) -ω -ω * θ , g(x, ω ) -g(θ , ω ) | + E | ω -ω * θ , g(x, ω ) -g(θ , ω ) -ω -ω * θ , g(x, ω ) -g(θ , ω ) | . (95) We bound the first term in (95) as E | ω -ω * θ , g(x, ω) -g(θ , ω) -ω -ω * θ , g(x, ω ) -g(θ , ω ) | = E | ω -ω * θ , g(x, ω) -g(x, ω ) + g(θ , ω ) -g(θ , ω) | ≤ 2R ω (E g(x, ω) -g(x, ω ) 2 + E g(θ , ω ) -g(θ , ω) 2 ) ≤ 2R ω E g(x, ω) -g(x, ω ) 2 + E E x∼θ [g(x, ω )] -E x∼θ [g(x, ω)] 2 = 2R ω E (γφ(s ) -φ(s)) (ω -ω ) 2 + E E x∼θ (γφ(s ) -φ(s)) (ω -ω) 2 ≤ 2R ω ((1 + γ) E ω -ω 2 + (1 + γ) E ω -ω 2 ) = 4R ω (1 + γ) E ω -ω 2 . We bound the second term in (95) as E | ω -ω * θ , g(x, ω ) -g(θ , ω ) -ω -ω * θ , g(x, ω ) -g(θ , ω ) | = E | ω -ω , g(x, ω ) -g(θ , ω ) | ≤ 2C δ E ω -ω 2 . Collecting the upper bounds of the two terms in (95) yields E[I 2 ] ≤ (4(1 + γ)R ω + 2C δ ) E ω -ω 2 . We first bound I 3 as E[I 3 |θ , ω , s t-m+1 ] = E [∆ 1 (x, θ , ω ) -∆ 1 ( x, θ , ω )|θ , ω , s t-m+1 ] ≤ |E [∆ 1 (x, θ , ω )|θ , ω , s t-m+1 ] -E [∆ 1 ( x, θ , ω )|θ , ω , s t-m+1 ]| ≤ sup x |∆ 1 (x, θ , ω )| P(x ∈ •|θ , ω , s t-m+1 ) -P( x ∈ •|θ , ω , s t-m+1 ) T V ≤ 8R ω C δ d T V (P(x ∈ •|θ , s t-m+1 ), P( x ∈ •|θ , s t-m+1 )) , where the second last inequality follows the definition of TV norm, and the last inequality follows the fact that |∆ 1 (x, θ , ω )| ≤ ω -ω * θ 2 g(x, ω ) -g(θ , ω ) 2 ≤ 4R ω C δ . By following (22) in Lemma 2, we have d T V (P(x ∈ •|θ , s t-m+1 ), P( x ∈ •|θ , s t-m+1 )) ≤ 1 2 |A|L π dm i=τ k E [ θ k-i -θ k-dm 2 | θ , s t-m+1 ] . Substituting the last inequality into (96), then taking total expectation on both sides yield E[I 3 ] ≤ 4R ω C δ |A|L π dm i=τ k E θ k-i -θ k-dm 2 . Next we bound I 4 . Define x := (s, a, s ) where s ∼ µ θ , a ∼ π θ and s ∼ P. It is immediate that E[∆ 1 (x, θ , ω )|θ , ω , s t-m+1 ] = ω -ω * θ , E[g(x, ω )|θ , ω , s t-m+1 ] -g(θ , ω ) = ω -ω * θ , g(θ , ω ) -g(θ , ω ) = 0. Then we have E[I 4 |θ , ω , s t-m+1 ] = E [∆ 1 ( x, θ , ω ) -∆ 1 (x, θ , ω )|θ , ω , s t-m+1 ] ≤ |E [∆ 1 ( x, θ , ω )|θ , ω , s t-m+1 ] -E [∆ 1 (x, θ , ω )|θ , ω , s t-m+1 ]| ≤ sup x |∆ 1 (x, θ , ω )| P( x ∈ •|θ , s t-m+1 ) -P(x ∈ •|θ , s t-m+1 ) T V ≤ 8R ω C δ d T V (P( x ∈ •|θ , s t-m+1 ), P(x ∈ •|θ , s t-m+1 )) = 8R ω C δ d T V P( x ∈ •|θ , s t-m+1 ), µ θ ⊗ π θ ⊗ P , where the second inequality follows the definition of TV norm, and the third inequality follows (97). The auxiliary Markov chain with policy π θ starts from initial state s t-m+1 , and s t is the (m -1)th state on the chain. Following Lemma 1, we have: d T V P( x ∈ •|θ , s t-m+1 ), µ θ ⊗ π θ ⊗ P = d T V P (( s t , a t , s t+1 ) ∈ •|θ , s t-m+1 ) , µ θ ⊗ π θ ⊗ P ≤ κρ m-1 . Substituting the last inequality into (98) and taking total expectation on both sides yield E[I 4 ] ≤ 8R ω C δ κρ m-1 . Taking total expectation on (93) and collecting bounds of I 1 , I 2 , I 3 , I 4 yield E [∆ 1 (x, θ, ω)] ≤ C 4 E θ k -θ k-dm 2 + C 5 dm i=τ k E θ k-i -θ k-dm 2 + C 6 E ω k -ω k-dm 2 + C 7 κρ m-1 , where C 4 := 2C δ L ω + 4R ω C δ |A|L π (1 + log ρ κ -1 + (1 -ρ) -1 ), C 5 := 4R ω C δ |A|L π , C 6 := 4(1 + γ)R ω + 2C δ and C 7 := 8R ω C δ .

C.2 SUPPORTING LEMMAS FOR THEOREM 4

Lemma 5. For any m ≥ 1 and k ≥ (K 0 + 1)m + K 0 + 1, we have E ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) ≥ -D 2 E θ k-τ k -θ k-dm 2 -D 3 E θ k -θ k-dm 2 -D 4 dm i=τ k E θ k-i -θ k-dm 2 -D 5 κρ m-1 -2C ψ L V fa -2C ψ sp E ∇J(θ) 2 , where D 2 := 2L V L ψ C δ , D 3 := (2C δ C ψ L J + L V C ψ (L ω + L V )(1 + γ) + 2C ψ L J app ), D 4 := 2L V C ψ C δ |A|L π and D 5 := 4L V C ψ C δ . Proof. For the worker that contributes to the kth update, we construct its Markov chain: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-d m-1 ------→ a t-m+1 • • • s t-1 θ k-d 1 ----→ a t-1 P -→ s t θ k-d 0 ----→ a t P -→ s t+1 , where (s t , a t , s t+1 ) = (s (k) , a (k) , s (k) ), and {d j } m j=0 is some increasing sequence with d 0 := τ k . By (92) in Lemma 4, we have d m ≤ (K 0 + 1)m + K 0 . Given (s t-m , a t-m , s t-m+1 ) and θ k-dm , we construct an auxiliary Markov chain: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-dm ----→ a t-m+1 • • • s t-1 θ k-dm ----→ a t-1 P -→ s t θ k-dm ----→ a t P -→ s t+1 . First we have ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) = ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) + ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-dm (s (k) , a (k) ) . We first bound the fist term in (99) as ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) ≥ -J(θ k ) 2 | δ(x (k) , ω * k ) -δ(x (k) , θ k )| ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -J(θ k ) 2 | δ(x (k) , ω * k )| + |δ(x (k) , θ k )| ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -L V | δ(x (k) , ω * k )| + |δ(x (k) , θ k )| ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -2L V C δ ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -2L V L ψ C δ θ k-τ k -θ k-dm 2 , where the last inequality follows Assumption 3 and second last inequality follows | δ(x, ω * θ )| ≤ |r(x)| + γ φ(s ) 2 ω * θ 2 + φ(s) 2 ω * θ 2 ≤ r max + (1 + γ)R ω ≤ C δ , |δ(x, θ)| ≤ |r(x)| + γ|V π θ (s )| + |V π θ (s)| ≤ r max + (1 + γ) r max 1 -γ ≤ C δ . Substituting (100) into (99) gives ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) ≥ -2L V L ψ C δ θ k-τ k -θ k-dm 2 + ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-dm (s (k) , a (k) ) . Then we start to bound the second term in (101). For brevity, we define ∆ 2 (x, θ) := ∇J(θ), δ(x, ω * θ ) -δ(x, θ) ψ θ k-dm (s, a) . In the following proof, we use θ, θ , ω * θ , ω * θ , x and x as shorthand notations for θ k , θ k-dm , ω * k , ω * k-dm , x t and x t respectively. We also define x := (s, a, s ), where s ∼ µ θ , a ∼ π θ and s ∼ P. We decompose the second term in (101) as ∆ 2 (x, θ) = ∆ 2 (x, θ) -∆ 2 (x, θ ) I1 + ∆ 2 (x, θ ) -∆ 2 ( x, θ ) I2 + ∆ 2 ( x, θ ) -∆ 2 (x, θ ) I3 + ∆ 2 (x, θ ) I4 where the last inequality is due to the L ω -Lipschitz continuity of ω * θ shown in Proposition 2 and L V -Lipschitz continuity of V π θ (s) shown in Lemma 3. Collecting the upper bounds of I 1 yields I 1 ≥ -(2C δ C ψ L J + L V C ψ (L ω + L V )(1 + γ)) θ -θ 2 . First we bound I 2 as E[I 2 |θ , s t-m+1 ] = E [∆ 2 (x, θ ) -∆ 2 ( x, θ )|θ , s t-m+1 ] ≥ -|E [∆ 2 (x, θ )| θ , s t-m+1 ] -E [∆ 2 ( x, θ )| θ , s t-m+1 ]| ≥ -sup x |∆ 2 (x, θ )| P(x ∈ •|θ , s t-m+1 ) -P( x ∈ •|θ , s t-m+1 ) T V ≥ -4L V C ψ C δ d T V (P(x ∈ •|θ , s t-m+1 ), P( x ∈ •|θ , s t-m+1 )) ≥ -2L V C ψ C δ |A|L π dm i=τ k E [ θ k-i -θ k-dm 2 | θ , s t-m+1 ] , where the second inequality is due to the definition of TV norm, the last inequality follows (22) in Lemma 2, and the second last inequality follows the fact that |∆ 2 (x, θ )| ≤ ∇J(θ ) 2 | δ(x, ω * θ ) -δ(x, θ )| ψ θ (s, a) 2 ≤ 2L V C δ C ψ . Taking total expectation on both sides of (102) yields E[I 2 ] ≥ -2L V C ψ C δ |A|L π dm i=τ k E θ k-i -θ k-dm 2 . Next we bound I 3 as E[I 3 |θ , s t-m+1 ] = E [∆ 2 ( x, θ ) -∆ 2 (x, θ )| θ , s t-m+1 ] ≥ -|E [∆ 2 ( x, θ )| θ , s t-m+1 ] -E [∆ 2 (x, θ )| θ , s t-m+1 ]| ≥ -sup x |∆ 2 (x, θ )| P( x ∈ •|θ , s t-m+1 ) -P(x ∈ •|θ , s t-m+1 ) T V ≥ -4L V C ψ C δ d T V P( x ∈ •|θ , s t-m+1 ), µ θ ⊗ π θ ⊗ P , where the second inequality is due to the definition of TV norm, and the last inequality follows (103). where the second last inequality follows Jensen's inequality. The last inequality further implies E[I 4 ] ≥ -2C ψ E ∇J(θ ) -∇J(θ) + ∇J(θ) 2 app ≥ -2C ψ app E ∇J(θ ) -∇J(θ) 2 -2C ψ app E ∇J(θ) 2 ≥ -2C ψ app E ∇J(θ ) -∇J(θ) 2 -2C ψ fa E ∇J(θ) 2 -2C ψ L V sp ≥ -2C ψ L J app E θ -θ 2 -2C ψ fa E ∇J(θ) 2 -2C ψ L V sp , where the last inequality follows Proposition 1. Taking total expectation on both sides of (101), and collecting lower bounds of I 1 , I 2 , I 3 and I 4 yield E ∇J(θ k ), δ(x (k) , ω * k ) -δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) ≥ -D 2 E θ k-τ k -θ k-dm 2 -D 3 E θ k -θ k-dm 2 -D 4 dm i=τ k E θ k-i -θ k-dm 2 -D 5 κρ m-1 -2C ψ L V fa -2C ψ sp E ∇J(θ k ) 2 , where D 2 := 2L V L ψ C δ , D 3 := (2C δ C ψ L J + L V C ψ (L ω + L V )(1 + γ) + 2C ψ L J app ), D 4 := 2L V C ψ C δ |A|L π and D 5 := 4L V C ψ C δ . Lemma 6. For any m ≥ 1 and k ≥ (K 0 + 1)m + K 0 + 1, we have E ∇J(θ k ), δ(x (k) , θ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) ≥ -D 6 E θ k-τ k -θ k-dm 2 -D 7 E θ k -θ k-dm 2 -D 8 dm i=τ k E θ k-i -θ k-dm 2 -D 9 κρ m-1 -8C ψ r max (1 -γ) E ∇J(θ k ) 2 , where D 6 := L V C δ L ψ , D 7 := C p L J + (1 + γ)L 2 V C ψ + 2L V L J + 8C ψ r max L J (1 -γ), D 8 := L V (C p + L V )|A|L π , D 9 := 2L V (C p + L V ). Proof. For the worker that contributes to the kth update, we construct its Markov chain: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-d m-1 ------→ a t-m+1 • • • s t-1 θ k-d 1 ----→ a t-1 P -→ s t θ k-d 0 ----→ a t P -→ s t+1 , where (s t , a t , s t+1 ) = (s (k) , a (k) , s (k) ), and {d j } m j=0 is some increasing sequence with d 0 := τ k . By (92) in Lemma 4, we have d m ≤ (K 0 + 1)m + K 0 . Given (s t-m , a t-m , s t-m+1 ) and θ k-dm , we construct an auxiliary Markov chain: s t-m θ k-dm ----→ a t-m P -→ s t-m+1 θ k-dm ----→ a t-m+1 • • • s t-1 θ k-dm ----→ a t-1 P -→ s t θ k-dm ----→ a t P -→ s t+1 . First we have ∇J(θ k ), δ(x (k) , θ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) = ∇J(θ k ), δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) + ∇J(θ k ), δ(x (k) , θ k )ψ θ k-dm (s (k) , a (k) ) -∇J(θ k ) . We bound the first term in (105) as ∇J(θ k ), δ(x (k) , θ k ) ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) ≥ -∇J(θ k ) 2 δ(x (k) , θ k ) 2 ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -L V δ(x (k) , θ k ) 2 ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -L V C δ ψ θ k-τ k (s (k) , a (k) ) -ψ θ k-dm (s (k) , a (k) ) 2 ≥ -L V C δ L ψ θ k-τ k -θ k-dm 2 , where the last inequality follows Assumption 3, and the second last inequality follows the fact that |δ(x, θ)| ≤ |r(x)| + γ|V π θ (s )| + |V π θ (s)| ≤ r max + (1 + γ) r max 1 -γ ≤ C δ . Substituting (106) into (105) gives ∇J(θ k ), δ(x (k) , θ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) ≥ -L V C δ L ψ θ k-τ k -θ k-dm 2 + ∇J(θ k ), δ(x (k) , θ k )ψ θ k-dm (s (k) , a (k) ) -∇J(θ k ) . Then we start to bound the second term in (107). For brevity, we define ∆ 3 (x, θ) := ∇J(θ), δ(x, θ)ψ θ k-dm (s, a) -∇J(θ) . Throughout the following proof, we use θ, θ , x and x as shorthand notations of θ k , θ k-dm , x t and x t respectively. We decompose ∆ 3 (x, θ) as ∆ 3 (x, θ) = ∆ 3 (x, θ) -∆ 3 (x, θ ) I1 + ∆ 3 (x, θ ) -∆ 3 ( x, θ ) I2 + ∆ 3 ( x, θ ) I3 . We first bound I 1 as |I 1 | = |∆ 3 (x, θ) -∆ 3 (x, θ )| = ∇J(θ), δ(x, θ)ψ θ (s, a) -∇J(θ) 2 2 -∇J(θ ), δ(x, θ )ψ θ (s, a) + ∇J(θ ) 2 2 ≤ | ∇J(θ), δ(x, θ)ψ θ (s, a) -∇J(θ ), δ(x, θ )ψ θ (s, a) | + ∇J(θ ) 2 2 -∇J(θ) 2 2 ≤ | ∇J(θ), δ(x, θ)ψ θ (s, a) -∇J(θ ), δ(x, θ )ψ θ (s, a) | + ∇J(θ ) + ∇J(θ) 2 ∇J(θ ) -∇J(θ) 2 ≤ | ∇J(θ), δ(x, θ)ψ θ (s, a) -∇J(θ ), δ(x, θ )ψ θ (s, a) | + 2L V L J θ -θ 2 , where the last equality is due to L V -Lipschitz of value function and L J -Lipschitz of policy gradient. We bound the first term in (108) as = L V C ψ γ(V π θ (s ) -V π θ (s )) + V π θ (s) -V π θ (s) + C p ∇J(θ) -∇J(θ ) 2 ≤ L V C ψ γ V π θ (s ) -V π θ (s ) + V π θ (s) -V π θ (s) + C p ∇J(θ) -∇J(θ ) 2 ≤ L V C ψ (γL V θ -θ 2 + L V θ -θ ) + C p L J θ -θ 2 = C p L J + (1 + γ)L 2 V C ψ θ -θ 2 . Substituting the above inequality into (108) gives the lower bound of I 1 :  I 1 ≥ -C p L J + (1 + γ)L 2 V C ψ + 2L V L J θ -θ 2 . ≥ -L V (C p + L V )|A|L π dm i=τ k E [ θ k-i -θ k-dm 2 |θ , s t-m+1 ] , where the second inequality is due to the definition of TV norm, the last inequality is due to (22) in Lemma 2, and thesecond last inequality follows the fact that  Taking total expectation on both sides of (109) yields E[I 2 ] ≥ -L V (C p + L V )|A|L π dm i=τ k E θ k-i -θ k-dm 2 . Define x := (s, a, s ), where s ∼ d θ , a ∼ π θ and s ∼ P. Then we have E[I 3 ] = E [∆ 3 ( x, θ ) -∆ 3 (x, θ )] + E [∆ 3 (x, θ )] . We bound the first term in (111) as where the second inequality follows the definition of total variation norm, and the third inequality follows (110). The last equality is due to the fact shown by [6] that µ θ (•) = d θ (•), where µ θ is the stationary distribution of an artificial MDP with transition kernel P(•|s, a) and policy π θ . E [∆ 3 ( x, The auxiliary Markov chain with policy π θ starts from initial state s t-m+1 , and s t is the (m -1)th state on the chain. Following Lemma 1, we have: d T V P( x ∈ •|θ , s t-m+1 ), µ θ ⊗ π θ ⊗ P = d T V P (( s t , a t , s t+1 ) ∈ •|θ , s t-m+1 ) , µ θ ⊗ π θ ⊗ P ≤ κρ m-1 . Substituting the last inequality into (112) and taking total expectation on both sides yield E [∆ 3 ( x, θ ) -∆ 3 (x, θ )] ≥ -2L V (C p + L V )κρ m-1 . Consider the second term in (111). Note its form is similar to (59), so by following the derivation of (63), we directly have E[∆ 3 (x, θ )] = E ∇J(θ ), δ(x, θ )ψ θ (s, a) -∇J(θ ) ≥ -8C ψ r max (1 -γ) E ∇J(θ ) 2 , which further implies E[∆ 3 (x, θ )] ≥ -8C ψ r max (1 -γ) E ∇J(θ ) 2 ≥ -8C ψ r max (1 -γ) E ∇J(θ ) -∇J(θ) 2 -8C ψ r max (1 -γ) E ∇J(θ) 2 ≥ -8C ψ r max L J (1 -γ) E θ -θ 2 -8C ψ r max (1 -γ) E ∇J(θ) 2 where the last inequality follows from Proposition 1. Collecting the lower bounds gives E[I 3 ] ≥ -2L V (C p + L V )κρ m-1 -8C ψ r max (1 -γ) (L J E θ -θ 2 -E ∇J(θ) 2 ) . Taking total expectation on ∆ 3 (x, θ) and collecting lower bounds of I 1 , I 2 , I 3 yield E[∆ 3 (x, θ)] ≥ -C p L J + (1 + γ)L 2 V C ψ + 2L V L J + 8C ψ r max L J (1 -γ) E θ k -θ k-dm 2 -L V (C p + L V )|A|L π dm i=τ k E θ k-i -θ k-dm 2 -2L V (C p + L V )κρ m-1 -8C ψ r max (1 -γ) E ∇J(θ k ) 2 . Taking total expectation on (107) and substituting the above inequality into it yield  E

C.3 EXPLANATION OF THE APPROXIMATION ERROR

In this section, we will provide a justification for the circumstances when the approximation error app defined in ( 14) is small. Lemma 7. Suppose Assumption 2 and 4 hold. Then it holds that app ≤ max θ∈R d E s∼µ θ |V π θ (s) -Vω * θ (s)| 2 + 4r max (λ -1 + λ -2 r max ) 1 + log ρ κ -1 + 1 1 -ρ (1 -γ) where ω * θ the critic stationary point of original Markov chain with policy π θ and transition kernel P. In (113), the first term captures the quality of critic function parameterization method which also appears in previous works [14, 15, 17] . When using linear critic function approximation, it becomes zero when the value function V π θ belongs to the linear function space for any θ. The second term corresponds to the error introduced by sampling from the artificial transition kernel P(•|s, a) = (1 -γ)P(•|s, a) + γη(•). For a large γ close to 1, the artificial Markov chain is close to the original one. In this case, the second error term is therefore small. This fact also consists with practice where large γ is commonly used in two time-scale actor critic algorithms [3] . Before going into the proof, we first define that: Āθ,φ := E where μθ as the stationary distribution of the original Markov chain with π θ and transition kernel P. Proof. Recall the definition of the approximation error: app = max θ∈R d E s∼µ θ |V π θ (s) -Vω * θ (s)| 2 , where µ θ is the stationary distribution of the artificial Markov chain with π θ and transition kernel P, and ω * θ is the stationary point of critic update under the artificial Markov chain. We decompose app as app = max 



This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0). With linear value function approximation, the convergence of the A3C-TD(0) algorithm has been established under both i.i.d. and Markovian sampling settings. Under i.i.d. sampling, A3C-TD(0) achieves linear speedup compared to the best-known sample complexity of two-timescale AC, theoretically justifying the benefit of parallelism and asynchrony for the first time. Under Markov sampling, such a linear speedup can be observed in most classic benchmark tasks.



LINEAR SPEEDUP RESULT WITH I.I.D. SAMPLING

Figure 1: Convergence results of A3C-TD(0) with i.i.d. sampling in synthetic environment.

Figure 3: Speedup of A3C-TD(0) in OpenAI gym classic control task (Carpole).

Figure 4: Speedup of A3C-TD(0) in OpenAI Gym Atari game (Seaquest).

Figure 5: Speedup of A3C-TD(0) in OpenAI Gym Atari game (Beamrider).

∇J(θ), δ(x, θ)ψ θ (s, a) -∇J(θ ), δ(x, θ )ψ θ (s, a) | ≤ | ∇J(θ), δ(x, θ)ψ θ (s, a) -∇J(θ), δ(x, θ )ψ θ (s, a) | + | ∇J(θ), δ(x, θ )ψ θ (s, a) -∇J(θ ), δ(x, θ )ψ θ (s, a) | = | ∇J(θ), (δ(x, θ) -δ(x, θ )) ψ θ (s, a) | + | ∇J(θ) -∇J(θ ), δ(x, θ )ψ θ (s, a) | ≤ L V C ψ |δ(x, θ) -δ(x, θ )| + C p ∇J(θ) -∇J(θ ) 2

First we bound I 2 asE[I 2 |θ , s t-m+1 ] = E [∆ 3 (x, θ ) -∆ 3 ( x, θ )|θ , s t-m+1 ] ≥ -|E [∆ 3 (x, θ )|θ , s t-m+1 ] -E [∆ 3 ( x, θ )|θ , s t-m+1 ]| ≥ -sup x |∆ 3 (x, θ )| P(x ∈ •|θ , s t-m+1 ) -P( x ∈ •|θ , s t-m+1 ) T V ≥ -2L V (C p + L V )d T V (P(x ∈ •|θ , s t-m+1 ), P( x ∈ •|θ , s t-m+1 ))

x, θ )| ≤ ∇J(θ) 2 δ(x, θ)ψ θ k-dm (s, a) 2 + ∇J(θ) 2 ≤ L V (C p + L V ).

θ ) -∆ 3 (x, θ )|θ , s t-m+1 ] ≥ -|E [∆ 3 ( x, θ )|θ , s t-m+1 ] -E [∆ 3 (x, θ )|θ , s t-m+1 ]| ≥ -sup x |∆ 3 (x, θ )| P( x ∈ •|θ , s t-m+1 ) -P(x ∈ •|θ , s t-m+1 ) T V ≥ -2L V (C p + L V )d T V (P( x ∈ •|θ , s t-m+1 ), P(x ∈ •|θ , s t-m+1 )) = -2L V (C p + L V )d T V P( x ∈ •|θ , s t-m+1 ), d θ ⊗ π θ ⊗ P = -2L V (C p + L V )d T V P( x ∈ •|θ , s t-m+1 ), µ θ ⊗ π θ ⊗ P(112)

∇J(θ k ), δ(x (k) , θ k )ψ θ k-τ k (s (k) , a (k) ) -∇J(θ k ) ≥ -D 6 E θ k-τ k -θ k-dm θ k -θ k-dm 2 -D 8 dm i=τ k E θ k-i -θ k-dm 2 -D 9 κρ m-1 -8C ψ r max (1 -γ) E ∇J(θ k ) 2 , where D 6 := L V C δ L ψ , D 7 := C p L J + (1 + γ)L 2 V C ψ + 2L V L J + 8C ψ r max L J (1 -γ), D 8 := L V (C p + L V )|A|L π , D 9 := 2L V (C p + L V ).

s∼μ θ ,s ∼Pπ θ [φ(s)(γφ(s ) -φ(s)) ],bθ,φ := E s∼μ θ ,a∼π θ ,s ∼P [r(s, a, s )φ(s)],

π θ (s) -Vω * θ (s) + Vω * θ (s) -Vω * θ (s)| 2 ≤ max θ∈R d E s∼µ θ |V π θ (s) -Vω * θ (s)| 2 fa + max θ∈R d E s∼µ θ | Vω * θ (s) -Vω * θ (s)| 2 sp ,(114)where the first term corresponds to the function approximation error fa , and second term corresponds to the sampling error sp .With A, b and Ā, b as shorthand notations for A θ,ψ , b θ,ψ and Āθ,ψ , bθ,ψ respectively, we bound the second term in (114) as| Vω * θ (s) -Vω * θ (s)| = φ(s) ω * θ -φ(s) ω * θ ≤ A -1 b -Ā-1 b A -1 b -A -1 b + A -1 b -Ā-1 b A -1 (b -b) 2 + (A -1 -Ā-1 ) b λ -1 b -b 2 + r max A -1 -Ā-1 λ -1 b -b 2 + r max A -1 ( Ā -A) Ā-1 λ -1 b -b 2 + λ -2 r max Ā -A 2 .(115)We bound the first term in last inequality asb -b 2 = E s∼µ θ ,a∼π θ ,s ∼ P [r(s, a, s )φ(s)] -E s∼μ θ ,a∼π θ ,s ∼P [r(s, a, s )φ(s)] ≤ sup r(s, a, s )φ(s) 2 µ θ ⊗ π θ ⊗ P -μθ ⊗ π θ ⊗ P T V ≤ 2r max d T V (µ θ ⊗ π θ ⊗ P, μθ ⊗ π θ ⊗ P).(116)We now bound the divergence term in the last inequality asd T V (µ θ ⊗ π θ ⊗ P, μθ ⊗ π θ ⊗ P) = s∈S a∈A s ∈Sµ θ (s)π θ (a|s) P(s |s, a) -μθ (s)π θ (a|s)P(s |s, a) = s∈S a∈A s ∈S |µ θ (s)π θ (a|s) P(s |s, a) -µ θ (s)π θ (a|s)P(s |s, a) + µ θ (s)π θ (a|s)P(s |s, a) -μθ (s)π θ (a|s)P(s |s, a)| ≤ s∈S a∈A µ θ (s)π θ (a|s) s ∈S P(s |s, a) -P(s |s, a) + s∈S |µ θ (s) -μθ (s)| . (117)

annex

 (119) where the last inequality follows (118).Substituting (118) and ( 119) into (117) givesSubstituting the above inequality into (116) givesSimilarly, we also haveSubstituting ( 120) and ( 121) into (115), then substituting (115) into (114) completes the proof.

D EXPERIMENT DETAILS

Hardware device. The tests on synthetic environment and CartPole was performed in a 16-core CPU computer. The test on Atari game was run in a 4 GPU computer.Parameterization. For the synthetic environment, we used linear value function approximation and tabular softmax policy [36] . For CartPole, we used a 3-layer MLP with 128 neurons and sigmoid activation function in each layer. The first two layers are shared for both actor and critic network. For the Atari seaquest game, we used a convolution-LSTM network. For network details, see [40] . Hyper-parameters. For the synthetic environment tests, we run Algorithm 1 with actor step size α k = 0.05 (1+k) 0.6 and critic step size β k = 0.05 (1+k) 0.4 . In tests of CartPole, we run Algorithm 1 with a minibatch of 20 samples. We update the actor network with a step size of α k = 0.01 (1+k) 0.6 and critic network with a step size of β k = 0.01 (1+k) 0.4 . See Table 1 for hyper-parameters to generate the Atari game results in Figure 4 .

