SINGLE-TIMESCALE ACTOR-CRITIC PROVABLY FINDS GLOBALLY OPTIMAL POLICY

Abstract

We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear O(K -1/2 ) rate, where K is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actorcritic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.

1. INTRODUCTION

In reinforcement learning (RL) (Sutton et al., 1998) , the agent aims to make sequential decisions that maximize the expected total reward through interacting with the environment and learning from the experiences, where the environment is modeled as a Markov Decision Process (MDP) (Puterman, 2014) . To learn a policy that achieves the highest possible total reward in expectation, the actor-critic method (Konda and Tsitsiklis, 2000) is among the most commonly used algorithms. In actor-critic, the actor refers to the policy and the critic corresponds to the value function that characterizes the performance of the actor. This method directly optimizes the expected total return over the policy class by iteratively improving the actor, where the update direction is determined by the critic. In particular, recently, actor-critic combined with deep neural networks (LeCun et al., 2015) achieves tremendous empirical successes in solving large-scale RL tasks, such as the game of Go (Silver et al., 2017) , StarCraft (Vinyals et al., 2019) , Dota (OpenAI, 2018), Rubik's cube (Agostinelli et al., 2019; Akkaya et al., 2019) , and autonomous driving (Sallab et al., 2017) . See Li (2017) for a detailed survey of the recent developments of deep reinforcement learning. Despite these great empirical successes of actor-critic, there is still an evident chasm between theory and practice. Specifically, to establish convergence guarantees for actor-critic, most existing works either focus on the bi-level setting or the two-timescale setting, which are seldom adopted in practice. In particular, under the bi-level setting (Yang et al., 2019a; Wang et al., 2019; Agarwal et al., 2019; Fu et al., 2019; Liu et al., 2019; Abbasi-Yadkori et al., 2019a; b; Cai et al., 2019; Hao et al., 2020; Mei et al., 2020; Bhandari and Russo, 2020) , the actor is updated only after the critic solves the policy evaluation sub-problem completely, which is equivalent to applying the Bellman evaluation operator to the previous critic for infinite times. Consequently, actor-critic under the bi-level setting is a double-loop iterative algorithm where the inner loop is allocated for solving the policy evaluation sub-problem of the critic. In terms of theoretical analysis, such a double-loop structure decouples the analysis for the actor and critic. For the actor, the problem is essentially reduced to analyzing the convergence of a variant of the policy gradient method (Sutton et al., 2000; Kakade, 2002) where the error of the gradient estimate depends on the policy evaluation error of the critic. Besides, under the two-timescale setting (Borkar and Konda, 1997; Konda and Tsitsiklis, 2000; Xu et al., 2020; Wu et al., 2020; Hong et al., 2020) , the actor and the critic are updated simultaneously, but with disparate stepsizes. More concretely, the stepsize of the actor is set to be much smaller than that of the critic, with the ratio between these stepsizes converging to zero. In an asymptotic sense, such a separation between stepsizes ensures that the critic completely solves its policy evaluation sub-problem asymptotically. In other words, such a two-timescale scheme results in a separation between actor and critic in an asymptotic sense, which leads to asymptotically unbiased policy gradient estimates. In sum, in terms of convergence analysis, the existing theory of actor-critic hinges on decoupling the analysis for critic and actor, which is ensured via focusing on the bi-level or two-timescale settings. However, most practical implementations of actor-critic are under the single-timescale setting (Peters and Schaal, 2008a; Schulman et al., 2015; Mnih et al., 2016; Schulman et al., 2017; Haarnoja et al., 2018) , where the actor and critic are simultaneously updated, and particularly, the actor is updated without the critic reaching an approximate solution to the policy evaluation sub-problem. Meanwhile, in comparison with the two-timescale setting, the actor is equipped with a much larger stepsize in the the single-timescale setting such that the asymptotic separation between the analysis of actor and critic is no longer valid. Furthermore, when it comes to function approximation, most existing works only analyze the convergence of actor-critic with either linear function approximation (Xu et al., 2020; Wu et al., 2020; Hong et al., 2020) , or shallow-neural-network parameterization (Wang et al., 2019; Liu et al., 2019) . In contrast, practically used actor-critic methods such as asynchronous advantage actor-critic (Mnih et al., 2016) and soft actor-critic (Haarnoja et al., 2018) oftentimes represent both the actor and critic using deep neural networks.

Thus, the following question is left open:

Does single-timescale actor-critic provably find a globally optimal policy under the function approximation setting, especially when deep neural networks are employed? To answer such a question, we make the first attempt to investigate the convergence and global optimality of single-timescale actor-critic with linear and neural network function approximation. In particular, we focus on the family of energy-based policies and aim to find the optimal policy within this class. Here we represent both the energy function and the critic as linear or deep neural network functions. In our actor-critic algorithm, the actor update follows proximal policy optimization (PPO) (Schulman et al., 2017) and the critic update is obtained by applying the Bellman evaluation operator only once to the current critic iterate. As a result, the actor is updated before the critic solves the policy evaluation sub-problem. Such a coupled updating structure persists even when the number of iterations goes to infinity, which implies that the update direction of the actor is always biased compared with the policy gradient direction. This brings an additional challenge that is absent in the bi-level and the two-timescale settings, where the actor and critic are decoupled asymptotically. To tackle such a challenge, our analysis captures the joint effect of actor and critic updates on the objective function, dubbed as the "double contraction" phenomenon, which plays a pivotal role for the success of single-timescale actor-critic. Specifically, thanks to the discount factor of the MDP, the Bellman evaluation operator is contractive, which implies that, after each update, the critic makes noticeable progress by moving towards the value function associated with the current actor. As a result, although we use a biased estimate of the policy gradient, thanks to the contraction brought by the discount factor, the accumulative effect of the biases is controlled. Such a phenomenon enables us to characterize the progress of each iteration of joint actor and critic update, and thus yields the convergence to the globally optimal policy. In particular, for both the linear and neural settings, we prove that, single-timescale actor-critic finds a O(K -1/2 )-globally optimal policy after K iterations. To the best of our knowledge, we seem to establish the first theoretical guarantee of global convergence and global optimality for actor-critic with function approximation in the singletimescale setting. Moreover, under the broader scope of policy optimization with nonlinear function approximation, our work seems to prove convergence and optimality guarantees for actor-critic with deep neural network for the first time. Contribution. Our contribution is two-fold. First, in the single-timescale setting with linear function approximation, we prove that, after K iterations of actor and critic updates, actor-critic returns a policy that is at most O(K -1/2 ) inferior to the globally optimal policy. Second, when both the actor and critic are represented by deep neural networks, we prove a similar O(K -1/2 ) rate of convergence to the globally optimal policy when the architecture of the neural networks are properly chosen. Related Work. Our work extends the line of works on the convergence of actor-critic under the function approximation setting. In particular, actor-critic is first introduced in Sutton et al. (2000) ; Konda and Tsitsiklis (2000) . Later, Kakade (2002) ; Peters and Schaal (2008b) propose the natural actor-critic method which updates the policy via the natural gradient (Amari, 1998) direction. The convergence of (natural) actor-critic with linear function approximation are studied in Bhatnagar et al. (2008; 2009) ; Bhatnagar (2010) ; Castro and Meir (2010) ; Maei (2018) . However, these works only characterize the asymptotic convergence of actor-critic and their proofs all resort to tools from stochastic approximation via ordinary differential equations (Borkar, 2008) . As a result, these works only show that actor-critic with linear function approximation converges to the set of stable equilibria of a set of ordinary differential equations. Recently, Zhang et al. (2019) propose a variant of actorcritic where Monte-Carlo sampling is used to ensure the critic and the policy gradient estimates are unbiased. Although they incorporate nonlinear function approximation in the actor, they only establish finite-time convergence result to a stationary point of the expected total reward. Moreover, due to having an inner loop for solving the policy evaluation sub-problem, they focus on the bi-level setting. Moreover, under the two-timescale setting, Wu et al. (2020) ; Xu et al. (2020) show that actorcritic with linear function approximation finds an ε-stationary point with O(ε -5/2 ) samples, where ε measures the squared norm of the policy gradient. All of these results establish the convergence of actor-critic, without characterizing the optimality of the policy obtained by actor-critic. In terms of the global optimality of actor-critic, Fazel et al. (2018) ; Malik et al. (2018) ; Tu and Recht (2018) ; Yang et al. (2019a) ; Bu et al. (2019) ; Fu et al. (2019) show that policy gradient and bi-level actor-critic methods converge to the globally optimal policies under the linear-quadratic setting, where the state transitions follow a linear dynamical system and the reward function is quadratic. For general MDPs, Bhandari and Russo (2019) recently prove the global optimality of vanilla policy gradient under the assumption that the families of policies and value functions are both convex. In addition, our work is also related to Liu et al. (2019) and Wang et al. (2019) , where they establish the global optimality of proximal policy optimization and (natural) actor-critic, respectively, where both the actor and critic are parameterized by two-layer neural networks. Our work is also related to Agarwal et al. (2019) ; Abbasi-Yadkori et al. (2019a; b) ; Cai et al. (2019) ; Hao et al. (2020) ; Mei et al. (2020) ; Bhandari and Russo (2020) , which focus on characterizing the optimality of natural policy gradient in tabular and/or linear settings. However, these aforementioned works all focus on bi-level actor-critic, where the actor is updated only after the critic solves the policy evaluation sub-problem to an approximate optimum. Besides, these works consider linear or two-layer neural network function approximations whereas we focus on the setting with deep neural networks. Furthermore, under the two-timescale setting, Xu et al. (2020) ; Hong et al. (2020) prove that linear actor-critic requires a sample complexity of O(ε -4 ) for obtaining an ε-globally optimal policy. In comparison, our O(K -1/2 ) convergence for single-timescale actor-critic can be translated into a similar O(ε -4 ) sample complexity directly. Moreover, when reusing the data, our result leads to an improved O(ε -2 ) sample complexity. In addition, our work is also related to Geist et al. (2019) , which proposes a variant of policy iteration algorithm with Bregman divergence regularization. Without considering an explicit form of function approximation, their algorithm is shown to converge to the globally optimal policy at a similar O(K -1/2 ) rate, where K is the number of policy updates. In contrast, our method is single-timescale actor-critic with linear or deep neural network function approximation, which enjoys both global convergence and global optimality. Meanwhile, our proof is based on a finite-sample analysis, which involves dealing with the algorithmic errors that track the performance of actor and critic updates as well as the statistical error due to having finite data. Our work is also related to the literature on deep neural networks. Previous works (Daniely, 2017; Jacot et al., 2018; Wu et al., 2018; Allen-Zhu et al., 2018a; b; Du et al., 2018; Zou et al., 2018; Chizat and Bach, 2018; Jacot et al., 2018; Li and Liang, 2018; Cao and Gu, 2019a; b; Arora et al., 2019; Lee et al., 2019; Gao et al., 2019) analyze the computational and statistical rates of supervised learning methods with overparameterized neural networks. In contrast, our work employs overparameterized deep neural networks in actor-critic for solving RL tasks, which is significantly more challenging than supervised learning due to the interplay between the actor and the critic. Notation. We denote by [n] the set {1, 2, . . . , n}. For any measure ν and 1 ≤ p ≤ ∞, we denote by f ν,p = ( X |f (x)| p dν) 1/p and f p = ( X |f (x)| p dµ) 1/p , where µ is the Lebesgue measure.

2. BACKGROUND

In this section, we introduce the background on discounted Markov decision processes (MDPs) and actor-critic methods.

2.1. DISCOUNTED MDP

A discounted MDP is defined by a tuple (S, A, P, ζ, r, γ). Here S and A are the state and action spaces, respectively, P : S × S × A → [0, 1] is the Markov transition kernel, ζ : S → [0, 1] is the initial state distribution, r : S × A → R is the deterministic reward function, and γ ∈ [0, 1) is the discount factor. A policy π(a | s) measures the probability of taking the action a at the state s. We focus on a family of parameterized policies defined as follows, Π = {π θ (• | s) ∈ P(A) : s ∈ S}, where P(A) is the probability simplex on the action space A and θ is the parameter of the policy π θ . For any state-action pair (s, a) ∈ S × A, we define the action-value function as follows, Q π (s, a) = (1 -γ) • E π ∞ t=0 γ t • r(s t , a t ) s 0 = s, a 0 = a , (2.2) where s t+1 ∼ P (• | s t , a t ) and a t+1 ∼ π(• | s t+1 ) for any t ≥ 0. We use E π [•] to denote that the actions follow the policy π, which further affect the transition of the states. We aim to find an optimal policy π * such that Q π * (s, a) ≥ Q π (s, a) for any policy π and state-action pair (s, a) ∈ S × A. That is to say, such an optimal policy π * attains a higher expected total reward than any other policy π, regardless of the initial state-action pair (s, a). For notational convenience, we denote by Q * (s, a) = Q π * (s, a) for any (s, a) ∈ S × A hereafter. Meanwhile, we denote by ν π (s) and ρ π (s, a) = ν π (s) • π(a | s) the stationary state distribution and stationary state-action distribution of the policy π, respectively, for any (s, a) ∈ S × A. Correspondingly, we denote by ν * (s) and ρ * (s, a) the stationary state distribution and stationary state-action distribution of the optimal policy π * , respectively, for any (s, a) ∈ S × A. For ease of presentation, given any functions g 1 : S → R and g 2 : S × A → R, we define two operators P and P π as follows, [Pg 1 ](s, a) = E[g 1 (s 1 ) | s 0 = s, a 0 = a], [P π g 2 ](s, a) = E π [g 2 (s 1 , a 1 ) | s 0 = s, a 0 = a], (2.3) where s 1 ∼ P (• | s 0 , a 0 ) and a 1 ∼ π(• | s 1 ). Intuitively, given the current state-action pair (s 0 , a 0 ), the operator P pushes the agent to its next state s 1 following the Markov transition kernel P (• | s 0 , a 0 ), while the operator P π pushes the agent to its next state-action pair (s 1 , a 1 ) following the Markov transition kernel P (• | s 0 , a 0 ) and policy π(• | s 1 ). These operators also relate to the Bellman evaluation operator T π , which is defined for any function g : S × A → R as follows, T π g = (1 -γ) • r + γ • P π g. (2.4) The Bellman evaluation operator T π is used to characterize the actor-critic method in the following section. By the definition in (2.2), it is straightforward to verify that the action-value function Q π is the fixed point of the Bellman evaluation operator T π defined in (2.4), that is, Q π = T π Q π for any policy π. For notational convenience, we let P denote the -fold composition PP • • • P, where there are operators P composed together. Such notation is also adopted for other linear operators such as P π and T π .

2.2. ACTOR-CRITIC METHOD

To obtain an optimal policy π * , the actor-critic method (Konda and Tsitsiklis, 2000) aims to maximize the expected total reward as a function of the policy, which is equivalent to solving the following maximization problem, max π∈Π J(π) = E s∼ζ,a∼π(• | s) Q π (s, a) , (2.5) where ζ is the initial state distribution, Q π is the action-value function defined in (2.2), and the family of parameterized polices Π is defined in (2.1). The actor-critic method solves the maximization problem in (2.5) via first-order optimization using an estimator of the policy gradient ∇ θ J(π). Here θ is the parameter of the policy π. In detail, by the policy gradient theorem (Sutton et al., 2000) , we have ∇ θ J(π) = E (s,a)∼ π Q π (s, a) • ∇ θ log π(a | s) . (2.6) Here π is the state-action visitation measure of the policy π, which is defined as π (s, a) = (1 -γ) • ∞ t=0 γ t • Pr[s t = s, a t = a]. Based on the closed form of the policy gradient in (2.6), the actor-critic method consists of the following two parts: (i) the critic update, where a policy evaluation algorithm is invoked to estimate the action-value function Q π , e.g., by applying the Bellman evaluation operator T π to the current estimator of Q π , and (ii) the actor update, where a policy improvement algorithm, e.g., the policy gradient method, is invoked using the updated estimator of Q π . In this paper, we consider the following variant of the actor-critic method, π k+1 ← argmax π∈Π E νπ k Q k (s, •), π(• | s) -β • KL π(• | s) π k (• | s) , Q k+1 (s, a) ← E π k+1 (1 -γ) • r(s 0 , a 0 ) + γ • Q k (s 1 , a 1 ) s 0 = s, a 0 = a , for any (s, a) ∈ S × A, where  s 1 ∼ P (• | s 0 , a 0 ), a 1 ∼ π k+1 (• | s 1 ), (• | s)) = a∈A log(π(a | s)/π k (a | s)) • π(a | s). In (2.7), the actor update uses the proximal policy optimization (PPO) method (Schulman et al., 2017) , while the critic update applies the Bellman evaluation operator T π k+1 defined in (2.4) to Q k only once, which is the current estimator of the action-value function. Furthermore, we remark that the updates in (2.7) provide a general framework in the following two aspects. First, the critic update can be extended to letting Q k+1 ← (T π k+1 ) τ Q k for any fixed τ ≥ 1, which corresponds to updating the value function via τstep rollouts following π k+1 . Here we only focus on the case with τ = 1 for simplicity. Our theory can be easily modified for any fixed τ . Moreover, the KL divergence used in the actor step can also be replaced by other Bregman divergences between probability distributions over A. Second, the actor and critic updates in (2.7) is a general template that admits both on-and off-policy evaluation methods and various function approximators in the actor and critic. In the next section, we present an incarnation of (2.7) with on-policy sampling and linear and neural network function approximation. Furthermore, for analyzing the actor-critic method, most existing works (Yang et al., 2019a; Wang et al., 2019; Agarwal et al., 2019; Fu et al., 2019; Liu et al., 2019) rely on (approximately) obtaining Q π k+1 at each iteration, which is equivalent to applying the Bellman evaluation operator T π k+1 infinite times to Q k . This is usually achieved by minimizing the mean-squared Bellman error Q -T π k+1 Q 2 ρπ k+1 ,2 using stochastic semi-gradient descent, e.g., as in the temporal-difference method (Sutton, 1988) , to update the critic for sufficiently many iterations. The unique global minimizer of the mean-squared Bellman error gives the action-value function Q π k+1 , which is used in the actor update. Meanwhile, the two-timescale setting is also considered in existing works (Borkar and Konda, 1997; Konda and Tsitsiklis, 2000; Xu et al., 2019; 2020; Wu et al., 2020; Hong et al., 2020) , which require the actor to be updated more slowly than the critic in an asymptotic sense. Such a requirement is usually satisfied by forcing the ratio between the stepsizes of the actor and critic updates to go to zero asymptotically. In comparison with the setting with bi-level updates, we consider the single-timescale actor and critic updates in (2.7), where the critic involves only one step of update, that is, applying the Bellman evaluation operator T π to Q k only once. Meanwhile, in comparison with the two-timescale setting, where the actor and critic are updated simultaneously but with the ratio between their stepsizes asymptotically going to zero, the single-timescale setting is able to achieve a faster rate of convergence by allowing the actor to be updated with a larger stepsize, while updating the critic simultaneously. In particular, such a single-timescale setting better captures a broader range of practical algorithms (Peters and Schaal, 2008a; Schulman et al., 2015; Mnih et al., 2016; Schulman et al., 2017; Haarnoja et al., 2018) , where the stepsize of the actor is not asymptotically zero. In §3, we discuss the implementation of the updates in (2.7) for different schemes of function approximation. In §4, we compare the rates of convergence between the two-timescale and single-timescale settings.

3. ALGORITHMS

We consider two settings, where the actor and critic are parameterized using linear functions and deep neural networks (which is deferred to §A of the appendix), respectively. We consider the energy-based policy π θ (a | s) ∝ exp(τ -1 f θ (s, a)), where the energy function f θ (s, a) is parameterized with the parameter θ. Also, for the (estimated) action-value function, we consider the parameterization Q ω (s, a) for any (s, a) ∈ S × A, where ω is the parameter. For such parameterizations of the actor and critic, the updates in (2.7) have the following forms. Actor Update. The following proposition gives the closed form of π k+1 in (2.7). Proposition 3.1. Let π θ k (a | s) ∝ exp(τ -1 k f θ k (s, a) ) be an energy-based policy and π k+1 = argmax π E ν k Q ω k (s, •), π(• | s) -β • KL π(• | s) π θ k (• | s) . Then π k+1 has the following closed form: π k+1 (a | s) ∝ exp β -1 Q ω k (s, a) + τ -1 k f θ k (s, a) , for any (s, a) ∈ S × A, where ν k = ν π θ k is the stationary state distribution of π θ k . See §G.1 for a detailed proof of Proposition 3.1. Motivated by Proposition 3.1, to implement the actor update in (2.7), we update the actor parameter θ by solving the following minimization problem, θ k+1 ← argmin θ E ρ k f θ (s, a) -τ k+1 • β -1 Q ω k (s, a) + τ -1 k f θ k (s, a) 2 , where ρ k = ρ π θ k is the stationary state-action distribution of π θ k . Critic Update. To implement the critic update in (2.7), we update the critic parameter ω by solving the following minimization problem, ω k+1 ← argmin ω E ρ k+1 [Q ω -(1 -γ) • r -γ • P π θ k+1 Q ω k ](s, a) 2 , (3.2) where ρ k+1 = ρ π θ k+1 is the stationary state-action distribution of π θ k+1 and the operator P π is defined in (2.3).

3.1. LINEAR FUNCTION APPROXIMATION

In this section, we consider linear function approximation. More specifically, we parameterize the action-value function using Q ω (s, a) = ω ϕ(s, a) and the energy function of the energy-based policy π θ using f θ (s, a) = θ ϕ(s, a). Here ϕ(s, a) ∈ R d is the feature vector, where d > 0 is the dimension. Without loss of generality, we assume that ϕ(s, a) 2 ≤ 1 for any (s, a) ∈ S × A, which can be achieved by normalization. Actor Update. The minimization problem in (3.1) admits the following closed-form solution, θ k+1 = τ k+1 • (β -1 ω k + τ -1 k θ k ), (3.3) which corresponds to a step of the natural policy gradient method (Kakade, 2002) . Critic Update. The minimization problem in (3.2) admits the following closed-form solution, ω k+1 = E ρ k+1 [ϕ(s, a)ϕ(s, a) ] -1 E ρ k+1 [(1 -γ) • r + γ • P π θ k+1 Q ω k ](s, a) • ϕ(s, a) . (3.4) Since the closed-form solution ω k+1 in (3.4) involves the expectation over the stationary state-action distribution ρ k+1 of π θ k+1 , we use data to approximate such an expectation. More specifically, we sample {(s ,1 , a ,1 )} ∈[N ] and {(s ,2 , a ,2 , r ,2 , s ,2 , a ,2 )} ∈[N ] such that (s ,1 , a ,1 ) ∼ ρ k+1 , (s ,2 , a ,2 ) ∼ ρ k+1 , r ,2 = r(s ,2 , a ,2 ), s ,2 ∼ P (• | s ,2 , a ,2 ), and a ,2 ∼ π θ k+1 (• | s ,2 ), where N is the sample size. We approximate ω k+1 using ω k+1 , which is defined as follows, ω k+1 = Γ R N =1 ϕ(s ,1 , a ,1 )ϕ(s ,1 , a ,1 ) -1 (3.5) • N =1 (1 -γ) • r ,2 + γ • Q ω k (s ,2 , a ,2 ) • ϕ(s ,2 , a ,2 ) . Here Γ R is the projection operator, which projects the parameter onto the centered ball with radius R in R d . Such a projection operator stabilizes the algorithm (Konda and Tsitsiklis, 2000; Bhatnagar et al., 2009) . It is worth mentioning that one may also view the update in (3.5) as one step of the least-squares temporal difference method (Bradtke and Barto, 1996) , which can be modified for the off-policy setting (Antos et al., 2007; Yu, 2010; Liu et al., 2018; Nachum et al., 2019; Xie et al., 2019; Zhang et al., 2020; Uehara and Jiang, 2019; Nachum and Dai, 2020) . Such a modification allows the data points in (3.5) to be reused in the subsequent iterations, which further improves the sample complexity. Specifically, let ρ bhv ∈ P(S × A) be the stationary state-action distribution induced by a behavioral policy π bhv . We replace the actor and critic updates in (3.1) and (3.2) by θ k+1 ← argmin θ E ρbhv f θ (s, a) -τ k+1 • β -1 Q ω k (s, a) + τ -1 k f θ k (s, a) 2 , (3.6) ω k+1 ← argmin ω E ρbhv [Q ω -(1 -γ) • r -γ • P π θ k+1 Q ω k ](s, a) 2 , (3.7) respectively. With linear function approximation, the actor update in (3.6) is reduced to (3.3), while the critic update in (3.7) admits a closed form solution ω k+1 = E ρbhv [ϕ(s, a)ϕ(s, a) ] -1 • E ρbhv [(1 -γ) • r + γ • P π θ k+1 Q ω k ](s, a) • ϕ(s, a) , which can be well approximated using state-action pairs drawn from ρ bhv . See §4 for a detailed discussion. Finally, by assembling the updates in (3.3) and (3.5), we present the linear actor-critic method in Algorithm 1, which is deferred to §B of the appendix.

4. THEORETICAL RESULTS

In this section, we upper bound the regret of the linear actor-critic method. We defer the analysis of the deep neural actor-critic method to §C of the appendix. Hereafter we assume that |r(s, a)| ≤ r max for any (s, a) ∈ S × A, where r max is a positive absolute constant. First, we impose the following assumptions. Recall that ρ * is the stationary state-action distribution of π * , while ρ k is the stationary state-action distribution of π θ k . Moreover, let ρ ∈ P(S × A) be a state-action distribution with respect to which we aim to characterize the performance of the actor-critic algorithm. Specifically, after K + 1 actor updates, we are interested in upper bounding the following regret E K k=0 Q * -Q π θ k+1 ρ,1 = E K k=0 Q * (s, a) -Q π θ k+1 (s, a) , (4.1) where the expectation is taken with respect to {θ k } k∈[K+1] and (s, a) ∼ ρ. Here we allow ρ to be any fixed distribution for generality, which might be different from ρ * . Assumption 4.1 (Concentrability Coefficient). The following statements hold. (i) There exists a positive absolute constant φ * such that φ * k ≤ φ * for any k ≥ 1, where φ * k = dρ * /dρ k ρ k ,2 . (ii) We assume that for any k ≥ 1 and a sequence of policies {π i } i≥1 , the k-step future-stateaction distribution ρP π1 • • • P π k is absolutely continuous with respect to ρ * , where ρ is the same as the one in (4.1) Also, it holds for such ρ that C ρ,ρ * = (1 -γ) 2 ∞ k=1 k 2 γ k • c(k) < ∞, where c(k) = sup {πi} i∈[k] d(ρP π1 • • • P π k )/dρ * ρ * ,∞ . In Assumption 4.1, C ρ,ρ * is known as the discounted-average concentrability coefficient of the future-state-action distributions. Such an assumption indeed measures the stochastic stability properties of the MDP, and the class of MDPs with such properties is quite large. See Szepesvári and Munos (2005)  , θ ∈ B(0, R) that inf ω∈B(0,R) E ρπ θ [T π θ Q ω -ω ϕ](s, a) 2 = 0, where T π θ is defined in (2.4). Assumption 4.2 imposes a structural assumption of the MDP under the linear setting. Specifically speaking, it assumes that the Bellman operator of each policy maps a linear value function to a linear function. Therefore, the value function associated with each policy (which is the fixed point of the corresponding Bellman operator) lies in the linear function class. Since the value functions are linear here, the energy-based policy class approximately covers the optimal policy as the temperature parameter τ goes to zero. In summary, our Assumption 4.2 ensures that the energy-based policy class approximately captures the optimal policy and thus there is no approximation error. When Assumption 4.2 does not hold, we only need to add an additional bias term to the regret upper bound in our theorem without much change in the proof. Assumption 4.3 (Well-Conditioned Feature). The minimum singular value of the matrix E ρ k [ϕ(s, a)ϕ(s, a) ] is uniformly lower bounded by a positive absolute constant σ * for any k ≥ 1. Assumption 4.3 ensures that the minimization problem in (3.2) admits a unique minimizer, which is used in the critic update. Similar assumptions are commonly imposed in the literature (Bhandari et al., 2018; Xu et al., 2019; Zou et al., 2019; Wu et al., 2020) . Under Assumptions 4.1, 4.2, and 4.3, we upper bound the regret of Algorithm 1 in the following theorem. Theorem 4.4. We assume that Assumptions 4.1, 4.2, and 4.3 hold. Let ρ be a state-action distribution satisfying (ii) of Assumption 4.1. Also, for any sufficiently large K > 0, let β = K 1/2 , N = Ω(KC 2 ρ,ρ * • (φ * /σ * ) 2 • log 2 N ) , and the sequence of policy parameters {θ k } k∈[K+1] be generated by Algorithm 1. It holds that E K k=0 Q * (s, a) -Q π θ k+1 (s, a) ≤ 2(1 -γ) -3 • log |A| + O(1) • K 1/2 , (4.2) where the expectation is taken with respect to {θ k } k∈[K+1] and (s, a) ∼ ρ. We sketch the proof in §D. See §E.1 for a detailed proof. Theorem 4.4 establishes an O(K 1/2 ) regret of Algorithm 1, where K is the total number of iterations. Here O(•) omits terms involving (1 -γ) -1 and log |A|. To better understand Theorem 4.4, we consider the ideal setting, where we have access to the action-value function Q π of any policy π. In such an ideal setting, the critic update is unnecessary. However, the natural policy gradient method, which only uses the actor update, achieves the same O(K 1/2 ) regret (Liu et al., 2019; Agarwal et al., 2019; Cai et al., 2019) . In other words, in terms of the iteration complexity, Theorem 4.4 shows that in the single-timescale setting, using only one step of the critic update along with one step of the actor update is as efficient as the natural policy gradient method in the ideal setting. Furthermore, by the regret bound in (4.2), to obtain an ε-globally optimal policy, it suffices to set K (1 -γ) -6 • ε -2 • log 2 |A| in Algorithm 1 and output a randomized policy that is drawn from {π θ k } K+1 k=1 uniformly. Plugging such a K into N = Ω(KC 2 ρ,ρ * (φ * /σ * ) 2 • log 2 N ), we obtain that N = O(ε -2 ), where O(•) omits the logarithmic terms. Thus, to achieve an ε-globally optimal policy, the total sample complexity of Algorithm 1 is O(ε -4 ). This matches the sample complexity results established in Xu et al. (2020) ; Hong et al. (2020) for two-timescale actor-critic methods. Meanwhile, notice that here the critic updates are on-policy and we draw N new data points in each critic update. As discussed in §3.1, under the off-policy setting, the critic updates given in (3.7) can be implemented using a fixed dataset sampled from ρ bhv , the stationary state-action distribution induced by the behavioral policy. Under this scenario, the total number of data points used by the algorithm is equal to N . Moreover, by imposing similar assumptions on ρ bhv as in (i) of Assumption 4.1 and Assumption 4.3, we can establish a similar O(K 1/2 ) regret as in (4.2) for the off-policy setting. As a result, with data reuse, to obtain an ε-globally optimal policy, the sample complexity of Algorithm 1 is essentially O(ε -2 ), which demonstrates the advantage of our single-timescale actor-critic method. Besides, only focusing on the convergence to an ε-stationary point, Wu et al. (2020) ; Xu et al. (2020) establish the sample complexity of O(ε -5/2 ) for two-timescale actor-critic, where ε measures the squared Euclidean norm of the policy gradient. In contrast, by adopting the natural policy gradient (Kakade, 2002) in actor updates, we achieve convergence to the globally optimal policy. We remark that the idea of off-policy evaluation cannot be applied to typical twotimescale setting (Wu et al., 2020; Xu et al., 2020) , where the critic is updated using TD learning (e.g. TD(0) and TD(λ)), since it is shown that off-policy TD method may diverge even with linear function approximation (Baird et al., 1995; Sutton et al., 2008) . To the best of our knowledge, we establish the rate of convergence and global optimality of the actor-critic method with function approximation in the single-timescale setting for the first time. Furthermore, as we will show in Theorem C.5 of §B, when both the actor and the critic are represented using overparameterized deep neural networks, we establish a similar O((1 -γ) -3 • log |A| • K 1/2 ) regret when the architecture of the actor and critic neural networks are properly chosen. To our best knowledge, this seems the first theoretical guarantee for the actor-critic method with deep neural network function approximation in terms of the rate of convergence and global optimality.

A DEEP NEURAL NETWORK APPROXIMATION

In this section, we consider deep neural network approximation. We first formally define deep neural networks. Then we introduce the actor-critic method under such a parameterization. A deep neural network (DNN) u θ (x) with the input x ∈ R d , depth H, and width m is defined as x (0) = x, x (h) = 1 √ m • σ(W h x (h-1) ), for h ∈ [H], u θ (x) = b x (H) . (A.1) Here σ : R m → R m is the rectified linear unit (ReLU) activation function, which is define as σ(y) = (max{0, y 1 }, . . . , max{0, y m }) for any y = (y 1 , . . . , y m ) ∈ R m . Also, we have . We denote the initialization of the parameter θ as θ 0 = (vec(W 0 1 ) , . . . , vec(W 0 H ) ) . Meanwhile, we restrict θ within the ball B(θ 0 , R) during training, which is defined as follows, b ∈ {-1, 1} m , W 1 ∈ R d×m , B(θ 0 , R) = θ ∈ R m all : W h -W 0 h F ≤ R, for h ∈ [H] . (A.2) Here {W h } h∈[H] and {W 0 h } h∈[H] are the weight matrices of θ and θ 0 , respectively. By (A.2), we have θ -θ 0 2 ≤ R √ H for any θ ∈ B(θ 0 , R). Now, we define the family of DNNs as U(m, H, R) = u θ : θ ∈ B(θ 0 , R) , (A. 3) where u θ is a DNN with depth H and width m. We parameterize the action-value function using Q ω (s, a) ∈ U(m c , H c , R c ) and the energy function of the energy-based policy π θ using f θ (s, a) ∈ U(m a , H a , R a ). Here U(m c , H c , R c ) and U(m a , H a , R a ) are the families of DNNs defined in (A.3). Hereafter we assume that the energy function f θ and the action-value function Q ω share the same architecture and initialization, i.e., m a = m c , H a = H c , R a = R c , and θ 0 = ω 0 . Such shared architecture and initialization of the DNNs ensure that the parameterizations of the policy and the action-value function are approximately compatible. See Sutton et al. (2000) ; Konda and Tsitsiklis (2000) ; Kakade (2002) ; Peters and Schaal (2008a) ; Wang et al. (2019) for a detailed discussion. Actor Update. To solve (3.1), we use projected stochastic gradient descent, whose n-th iteration has the following form, θ(n + 1) ← Γ B(θ0,Ra) θ(n) -α • f θ(n) (s, a) -τ k+1 • β -1 Q ω k (s, a) + τ -1 k f θ k (s, a) • ∇ θ f θ(n) (s, a) . Here Γ B(θ0,Ra) is the projection operator, which projects the parameter onto the ball B(θ 0 , R a ) defined in (A.2). The state-action pair (s, a) is sampled from the stationary state-action distribution ρ k . We summarize the update in Algorithm 3, which is deferred to §B of the appendix. Critic Update. To solve (3.2), we apply projected stochastic gradient descent. More specifically, at the n-th iteration of projected stochastic gradient descent, we sample a tuple (s, a, r, s , a ), where (s, a) ∼ ρ k+1 , r = r(s, a), s ∼ P (• | s, a), and a ∼ π θ k+1 (• | s ). We define the residual at the n-th iteration as δ (n) = Q ω(n) (s, a) -(1 -γ) • r -γ • Q ω k (s , a ). Then the n-th iteration of projected stochastic gradient descent has the following form, ω(n + 1) ← Γ B(ω0,Rc) ω(n) -η • δ(n) • ∇ ω Q ω(n) (s, a) . Here Γ B(ω0,Rc) is the projection operator, which projects the parameter onto the ball B(ω 0 , R c ) defined in (A.2). We summarize the update in Algorithm 4, which is deferred to §B of the appendix. By assembling Algorithms 3 and 4, we present the deep neural actor-critic method in Algorithm 2, which is deferred to §B of the appendix. Finally, we remark that the off-policy actor and critic updates given in (3.6) and (3.7) can also incorporate deep neural network approximation with a slight modification, which enables data reuse in the algorithm.

B DETAILS OF ALGORITHMS

In this section, we summarize the algorithms in §3. We first introduce the actor-critic method with linear function approximation in Algorithm 1. Algorithm 1 Linear Actor-Critic Method Input: Number of iterations K, sample size N , temperature parameter β. Initialization: Set τ 0 ← ∞, and randomly initialize the actor parameter θ 0 and the critic parameter ω 0 . for k = 0, 1, 2, . . . , K do Actor Update: Update θ k+1 via (3.3) with τ -1 k+1 = (k + 1) • β -1 . Critic Update: Sample {(s ,1 , a ,1 )} ∈[N ] and {(s ,2 , a ,2 , r ,2 , s ,2 , a ,2 )} ∈[N ] as specified in §3.1. Update ω k+1 via (3.5). end for Output: {π θ k } k∈[K+1] , where π θ k ∝ exp(τ -1 k f θ k ). We introduce the actor-critic method with DNN approximation in Algorithm 2, which relies on Algorithms 3 and 4 for the actor and critic updates. Algorithm 2 Deep Neural Actor-Critic Method Input: Number of iterations K, N a , N c , stepsizes α, η, and temperature parameter β. Initialization: Set τ 0 ← ∞ and initialize DNNs f θ0 and Q ω0 as specified in §A.  for k = 0, 1, 2, . . . , K do Actor Update: Update θ k+1 via Algorithm 3 with input π θ k , θ 0 , Q ω k , α, β, τ k+1 = (k +1) -1 • β, {π θ k } k∈[K+1] , where π θ k ∝ exp(τ -1 k f θ k ). Algorithm 3 Actor Update for Deep Neural Actor-Critic Method Input: Policy π θ ∝ exp(τ -1 f θ ), initial actor parameter θ 0 , action-value function Q ω , stepsize α, temperature parameter β, temperature τ , and number of iterations N a . Initialization: Set θ(0) ← θ 0 . for n = 0, 1, 2, . . . , N a -1 do Sample (s, a) as specified in §A. Set θ(n + 1) ← Γ B(θ0,Ra) (θ(n) -α • (f θ(n) (s, a) -τ • (β -1 Q ω (s, a) + τ -1 f θ (s, a))) • ∇ θ f θ(n) (s, a)). end for Output: θ = 1/N a • Na n=1 θ(n).

Algorithm 4 Critic Update for Deep Neural Actor-Critic Method

Input: Policy π θ , action-value function Q ω , initial critic parameter ω 0 , stepsize η, and number of iterations N c . Initialization: Set ω(0) ← ω 0 . for n = 0, 1, 2, . . . , N c -1 do Sample (s, a, r, s , a ) as specified in §A. Set δ(n) ← Q ω(n) (s, a) -(1 -γ) • r -γ • Q ω (s , a ). Set ω(n + 1) ← Γ B(ω0,Rc) (ω(n) -η • δ(n) • ∇ ω Q ω(n) (s, a)). end for Output: ω = 1/N c • Nc n=1 ω(n).

C CONVERGENCE RESULTS OF ALGORITHM 2

In this section, we upper bound the regret of the deep neural actor-critic method. Hereafter we assume that |r(s, a)| ≤ r max for any (s, a) ∈ S × A, where r max is a positive absolute constant. First, we impose the following assumptions in parallel to Assumption 4.1. Recall that ρ * is the stationary state-action distribution of π * , while ρ k is the stationary state-action distribution of π θ k . Assumption C.1 (Concentrability Coefficient). The following statements hold. (i) There exists a positive absolute constant φ * such that φ * k ≤ φ * for any k ≥ 1, where φ * k = dρ * /dρ k ρ k ,2 . (ii) For the state-action distribution ρ used to define the regret in (4.1), we assume that for any k ≥ 1 and a sequence of policies {π i } i≥1 , the k-step future-state-action distribution ρP π1 • • • P π k is absolutely continuous with respect to ρ * . Also, it holds that C ρ,ρ * = (1 -γ) 2 ∞ k=1 k 3 γ k • c(k) < ∞, where c(k) = sup {πi} i∈[k] d(ρP π1 • • • P π k )/dρ * ρ * ,∞ . Meanwhile, we impose the following assumption in parallel to Assumption 4.2. Assumption C.2 (Zero Approximation Error). For any Q ω ∈ U(m c , H c , R c ) and policy π, it holds that T π Q ω ∈ U(m c , H c , R c ), where T π is defined in (2.4). Assumption C.2 states that U(m c , H c , R c ) is closed under the Bellman evaluation operator T π , which is commonly imposed in the literature (Munos and Szepesvári, 2008; Antos et al., 2008a; Farahmand et al., 2010; 2016; Tosatto et al., 2017; Yang et al., 2019b; Liu et al., 2019) . We upper bound the regret of the deep neural actor-critic method in Algorithm 2 in the sequel. To establish such an upper bound, we first establish the rates of convergence of Algorithms 3 and 4 as follows. Proposition C.3. For any sufficiently large N a > 0, let m a = Ω(d 3/2 R -1 a H -3/2 a log(m 1/2 a /R a ) 3/2 ), H a = O(N 1/4 a ), and R a = O(m 1/2 a H -6 a (log m a ) -3 ). We denote by θ the output of Algorithm 3 with input π θ ∝ exp(τ -1 f θ ), θ 0 , Q ω , α, β, τ = (τ -1 + β -1 ) -1 , and N a . Also, let f = τ • (β -1 Q ω + τ -1 f θ ). With probability at least 1 -exp(-Ω(R 2/3 a m 2/3 a H a )) over the random initialization θ 0 , we have E f θ (s, a) -f (s, a) 2 = O(R 2 a N -1/2 a + R 8/3 a m -1/6 a H 7 a log m a ). Here the expectation is taken over the randomness of θ conditioning on the initialization θ 0 and (s, a) ∼ ρ π θ , where ρ π θ is the stationary state-action distribution of π θ . Proof. See §G.2 for a detailed proof. Proposition C.4. For any sufficiently large N c > 0, let m c = Ω(d 3/2 R -1 c H -3/2 c log(m 1/2 c /R c ) 3/2 ), H c = O(N 1/4 c ), and R c = O(m 1/2 c H -6 c (log m c ) -3 ). We denote by ω the output of Algorithm 4 with input π θ , Q ω , ω 0 , η, and N c . Also, let Q = (1 -γ) • r + γ • P π θ Q ω . With probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )) over the random initialization ω 0 , we have E Q ω (s, a) -Q(s, a) 2 = O(R 2 c N -1/2 c + R 8/3 c m -1/6 c H 7 c log m c ). Here the expectation is taken over the randomness of ω conditioning on the initialization ω 0 and (s, a) ∼ ρ π θ , where ρ π θ is the stationary state-action distribution of π θ . Proof. See §G.3 for a detailed proof. Based on Propositions C.3 and C.4, we upper bound the regret of Algorithm 2 in the following theorem, which is in parallel to Theorem 4.4. Theorem C.5. We assume that Assumptions C.1 and C.2 hold. Let ρ be a state-action distribution satisfying (ii) of Assumption C.1. Also, for any sufficiently large K > 0, let N a = Ω(K 2 C 4 ρ,ρ * (φ * + ψ * + 1) 4 R 4 a ), N c = Ω(K 2 C 4 ρ,ρ * φ * 4 R 4 c ), H a = H c = O(N 1/4 c ), R a = R c = O(m 1/2 c H -6 c (log m c ) -3 ), m a = m c = Ω(d 3/2 K 6 C 12 ρ,ρ * (φ * +ψ * +1) 12 R 16 c H 42 c log(m 1/2 c /R c ) 3/2 ), β = K 1/2 , and the sequence {θ k } k∈[K] be generated by Algorithm 2. With probability at least 1 -1/K over the random initialization θ 0 and ω 0 , it holds that E K k=0 Q * (s, a) -Q π θ k+1 (s, a) ≤ 2(1 -γ) -3 log |A| + O(1) • K 1/2 , where the expectation is taken over the randomness of (s, a) ∼ ρ and {θ k+1 } k∈[K] conditioning on the initialization θ 0 and ω 0 . Proof. See §E.2 for a detailed proof. When the architecture of the actor and critic neural networks are properly chosen, Theorem C.5 establishes an O(K 1/2 ) regret of Algorithm 2, where K is the total number of iterations. Specifically speaking, to establish such a regret upper bound, we need the widths m a and m c of the DNNs f θ and Q ω to be sufficiently large. Meanwhile, to control the errors of actor update and critic update in Algorithm 2, we also run sufficiently large numbers of iterations in Algorithms 3 and 4. In terms of the total sample complexity, to simplify our discussion, we omit constant and logarithmic terms here. To obtain an ε-globally optimal policy, it suffices to set K ε -2 in Algorithm 2. By plugging such a K into N a = Ω(K 2 C 4 ρ,ρ * (φ * + ψ * + 1) 4 R 4 a ) and N c = Ω(K 2 C 4 ρ,ρ * φ * 4 R 4 c ) as required in Theorem C.5, we have N a = O(ε -4 ) and N c = O(ε -4 ). Thus, to achieve an ε-globally optimal policy, the total sample complexity of Algorithm 2 is O(ε -6 ). With the modification to offpolicy setting as in §3.1, the total sample complexity of Algorithm 2 is O(ε -4 ). In comparison, Liu et al. (2019) requires a total sample complexity of O(ε -8 ) to achieve an ε-globally optimal policy, which is worse than our single-timescale algorithm. Meanwhile, since Liu et al. (2019) uses TD(0) in the critic update, which is shown to diverge under off-policy setting even with linear function approximation (Baird et al., 1995) , the method of data reuse cannot be applied to Liu et al. (2019) to eliminate the total sample complexity. To the best of our knowledge, we establish the rate of convergence and global optimality of the actor-critic method under single-timescale setting with DNN approximation for the first time. D PROOF SKETCH OF MAIN THEOREM 4.4 In this section, we sketch the proof of Theorem 4.4. Recall that ρ is a state-action distribution satisfying (ii) of Assumption 4.1. We first upper bound K k=0 (Q * (s, a) -Q π θ k+1 (s, a)) for any (s, a) ∈ S × A in part 1. Then by further taking the expectation over ρ in part 2, we conclude the proof of Theorem 4.4. See §E.1 for a detailed proof. Part 1. In the sequel, we upper bound K k=0 (Q * (s, a) -Q π θ k+1 (s, a)) for any (s, a) ∈ S × A. We first decompose Q * -Q π θ k+1 into the following three terms, K k=0 [Q * -Q π θ k+1 ](s, a) = K k=0 (I -γP π * ) -1 (A 1,k + A 2,k + A 3,k ) (s, a), (D.1) the proof of which is deferred to (E.1) and (E.2) in §E.1 of the appendix. Here the operator P π * is defined in (2.3), (I -γP π * ) -1 = ∞ i=0 (γP π * ) i , and A 1,k , A 2,k , and A 3,k are defined as follows, A 1,k (s, a) = [γ(P π * -P π θ k+1 )Q ω k ](s, a), (D.2) A 2,k (s, a) = γP π * (Q π θ k+1 -Q ω k ) (s, a), (D.3) A 3,k (s, a) = [T π θ k+1 Q ω k -Q π θ k+1 ](s, a). (D.4) To understand the intuition behind A 1,k , A 2,k , and A 3,k , we interpret them as follows. Interpretation of A 1,k . As defined in (D.2), A 1,k arises from the actor update and measures the convergence of the policy π θ k+1 towards a globally optimal policy π * , which implies the convergence of P π θ k+1 towards P π * . Interpretation of A 3,k . Note that by (2.2) and (2.4), we have Q π θ k+1 = T π θ k+1 Q π θ k+1 and T π θ k+1 is a γ-contraction, which implies that applying the Bellman evaluation operator T π θ k+1 to any Q, e.g., Q ω k , infinite times yields Q π θ k+1 . As defined in (D.4), A 3,k measures the error of tracking the action-value function Q π θ k+1 of π θ k+1 by applying the Bellman evaluation operator T π θ k+1 to Q ω k only once, which arises from the critic update. Also, as A 3,k = T π θ k+1 (Q ω k -Q π θ k+1 ), A 3,k measures the difference between Q π θ k , which is approximated by Q ω k as discussed subsequently, and Q π θ k+1 . Such a difference can also be viewed as the difference between π θ k and π θ k+1 , which arises from the actor update. Therefore, the convergence of A 3,k to zero implies the contractions of not only the critic update but also the actor update, which illustrates the "double contraction" phenomenon. We establish the convergence of A 3,k to zero in (D.10) subsequently. Interpretation of A 2,k . Assuming that A 3,k-1 converges to zero, we have T π θ k Q ω k-1 ≈ Q π θ k . Moreover, assuming that the number of data points N is sufficiently large and ignoring the projection in (3.5), we have T π θ k Q ω k-1 = Q ω k ≈ Q ω k as ω k defined in (3.4) is an estimator of ω k . Hence, we have Q π θ k ≈ Q ω k . Such an approximation error is characterized by c k defined in (D.5) subsequently. Hence, A 2,k measures the difference between π θ k and π θ k+1 through the difference between Q π θ k ≈ Q ω k and Q π θ k+1 , which relies on the convergence of A 3,k-1 to zero. In the sequel, we upper bound A 1,k , A 2,k , and A 3,k , respectively. To establish such upper bounds, we define the following quantities, c k+1 (s, a) = [T π θ k+1 Q ω k -Q ω k+1 ](s, a), (D.5) e k+1 (s, a) = [Q ω k -T π θ k+1 Q ω k ](s, a), (D.6) ϑ k (s) = KL π * (• | s) π θ k (• | s) -KL π * (• | s) π θ k+1 (• | s) . (D.7) To understand the intuition behind c k+1 , e k+1 , and ϑ k , we interpret them as follows. Interpretation of c k+1 . Recall that ω k+1 is defined in (3.4), which parameterizes T π θ k+1 Q ω k (ignoring the projection in (3.5)). Here c k+1 arises from approximating ω k+1 using ω k+1 as an estimator, which is constructed based on ω k and the N data points. In particular, c k+1 decreases to zero as N → ∞, which is used in characterizing A 2,k defined in (D.3). Interpretation of e k+1 . Assuming that A 3,k-1 defined in (D.4) and c k defined in (D.5) converge to zero, which implies T π θ k Q ω k-1 ≈ Q π θ k and T π θ k Q ω k-1 ≈ Q ω k , respectively, we have Q ω k ≈ Q π θ k . Therefore, as defined in (D.6), e k+1 = Q ω k -T π θ k+1 Q ω k ≈ Q π θ k -T π θ k+1 Q π θ k = (T π θ k -T π θ k+1 )Q π θ k measures the difference between π θ k and π θ k+1 , which implies the difference between T π θ k and T π θ k+1 . We remark that e k+1 fully characterizes A 3,k defined in (D.4) as shown in (D.8) subsequently. Interpretation of ϑ k . As defined in (D.7), ϑ k measures the difference between π θ k and π θ k+1 in terms of their differences with π * , which are measured by the corresponding KL-divergences. In particular, ϑ k is used in characterizing A 1,k and A 2,k defined in (D.2) and (D.3), respectively. We remark that c k+1 measures the statistical error in the critic update, while ϑ k measures the optimization error in the actor update. As discussed above, the convergence of A 3,k to zero implies the Q ω k Q ω k+1 T π θ k+1 Q ω k T π θ k Q ω k-1 T π * Q ω k Q π θ k Q π θ k+1 T π * Q ω k+1 T π * Q π θ k T π * Q π θ k-1 ε c k A 3,k-1 A 2,k-1 A 1,k e k+1 ε c k+1 A 1,k+1 A 3,k A 2,k ϑ k Critic Update: Actor Update: Figure 1: Illustration of the relationship among A 1,k , A 2,k , A 3,k , c k+1 , e k+1 , and ϑ k . Here {θ k , ω k } and {θ k+1 , ω k+1 } are two consecutive iterates of actor-critic. The red arrow from Q ω k to Q ω k+1 represents the critic update and the red arrow from Q π θ k to Q π θ k+1 represents the action-value functions associated with the two policies in any actor update. Here ϑ k given in (D.7) quantifies the difference between π θ k and π θ k+1 in terms of their KL distances to π * . In addition, the cyan arrows represent quantities A 1,k , A 2,k , and A 3,k introduced in (D.2)-(D.4), which are intermediate terms used for analyzing the error Q * -Q π k+1 . Finally, the blue arrows represent ε c k+1 and e k+1 defined in (D.5) and (D.6), respectively. Here ε c k+1 corresponds to the statistical error due to having finite data whereas e k+1 essentially quantifies the difference between π θ k and π θ k+1 . contraction of both the actor update and the critic update, which illustrates the "double contraction" phenomenon. Meanwhile, since e k+1 fully characterizes A 3,k as shown in (D.8) subsequently, e k+1 plays a key role in the "double contraction" phenomenon. In particular, the convergence of e k+1 to zero is established in (D.9) subsequently. See Figure 1 for an illustration of these quantities. With the quantities defined in (D.5), (D.6), and (D.7), we upper bound A 1,k , A 2,k , and A 3,k as follows,  A 1,k (s, a) ≤ γβ • [Pϑ k ](s, a), A 2,k (s, a) ≤ (γP π * ) k+1 (Q * -Q ω0 ) (s, a) + γβ • k-1 i=0 (γP π * ) k-i Pϑ i (s, a) + k-1 i=0 (γP π * ) k-i c i+1 (s, a), A 3,k (s, a) = γP π θ k+1 (I -γP π θ k+1 ) -1 e k+1 ( P π θs e 1 + k i=1 γ k-i k s=i+1 P π θs (I -γP π θ i ) c i (s, a), (D.9) the proof of which is deferred to Lemma E.4 in §E.1 of the appendix. By plugging (D.9) into (D.8), we have A 3,k (s, a) ≤ γP π θ k+1 (I -γP π θ k+1 ) -1 γ k k s=1 P π θs e 1 (D.10) + k i=1 γ k-i k s=i+1 P π θs (I -γP π θ i ) c i (s, a). To better understand (D.10) and how it relates to the convergence of A 3,k , A 2,k , and A 1,k to zero, we discuss in the following two steps. Step (i). We assume c i = 0, which corresponds to the number of data points N → ∞. Then (D.10) yields A 3,k = O(γ k ), which implies that A 3,k defined in (D.4) converges to zero driven by the discount factor γ. As discussed above, the convergence of A 3,k to zero also implies the contraction between π θ k and π θ k+1 of the actor update and the contraction between Q ω k and Q π θ k of the critic update, which illustrates the "double contraction" phenomenon. Step (ii). The convergence of A 3,k to zero further ensures that A 2,k converges to zero. To see this, we further assume A 3,k = 0, which together with the assumption that c k+1 = 0 implies Q π θ k+1 = T π θ k+1 Q ω k = Q ω k+1 by their definitions in (D.4) and (D.5), respectively. Then by telescoping the sum of A 2,k defined in (D.3), which cancels out Q ω k+1 and Q π θ k+1 , we obtain the convergence of A 2,k to zero. Meanwhile, telescoping the sum of A 1,k defined in (D.2) and the sum of its upper bound in (D.8) implies that A 1,k converges to zero. Now, by plugging (D.8) and (D.10) into (D.1), we establish an upper bound of K k=0 (Q * (s, a) -Q π θ k+1 (s, a)) for any (s, a) ∈ S × A, which is deferred to (E.12) in §E.1 of the appendix. Hence, we conclude the proof in part 1. See part 1 of §E.1 for details. Part 2. Recall that ρ is a state-action distribution satisfying (ii) of Assumption 4.1. In the sequel, we take the expectation over ρ in (E.12) and upper bound each term. We first introduce the following lemma, which upper bounds c k+1 defined in (D.5). Lemma D.1. Under Assumptions 4.2 and 4.3, it holds for any k ≥ 1 that E c k+1 (s, a) 2 = E Q ω k+1 (s, a) -[T π θ k Q ω k ](s, a) 2 ≤ 16(r max + R) 2 N σ * 4 • log(N + d) 2 , where the expectation is taken with respect to randomness of ω k+1 and (s, a) ∼ ρ k+1 . Proof. See §H.1 for a detailed proof. On the right-hand side of (E.12) in §E.1 of the appendix, for the terms not involving c k+1 , i.e., M 1 , M 2 , and M 3 in (E.13), we take the expectation over ρ and establish their upper bounds in the ∞norm over (s, a) in Lemma E.5. On the other hand, for the terms involving c k+1 , i.e., M 4 and M 5 in (E.14), we take the expectation over ρ and then change the measure from ρ to ρ k+1 . By Assumption 4.1 and Lemma D.1, which relies on ρ k+1 , we establish the upper bounds in Lemma E.6. See part 2 of §E.1 for details. Combining Lemmas E.5 and E.6 yields Theorem 4.4. See §E.1 for a detailed proof.

E PROOFS OF THEOREMS E.1 PROOF OF THEOREM 4.4

Recall that ρ is a state-action distribution satisfying (ii) of Assumption 4.1. We first upper bound K k=0 (Q * (s, a) -Q π θ k+1 (s, a)) for any (s, a) ∈ S × A in part 1. Then by further taking the expectation over ρ and invoking Lemma D.1 in part 2, we conclude the proof of Theorem 4.4. Part 1. In the sequel, we upper bound K k=0 (Q * (s, a) -Q π θ k+1 (s, a)) for any (s, a) ∈ S × A. By the definition of Q * in (2.2), it holds for any (s, a) ∈ S × A that [Q * -Q π θ k+1 ](s, a) = ∞ =0 (1 -γ) • (γP π * ) r (s, a) -Q π θ k+1 (s, a) = ∞ =0 (1 -γ) • (γP π * ) r + (γP π * ) +1 Q π θ k+1 -(γP π * ) +1 Q π θ k+1 (s, a) -Q π θ k+1 (s, a) = ∞ =0 (1 -γ) • (γP π * ) r + (γP π * ) +1 Q π θ k+1 -(γP π * ) Q π θ k+1 (s, a) = ∞ =0 (γP π * ) (1 -γ) • r + γ • P π * Q π θ k+1 -Q π θ k+1 (s, a), (E.1) where P π * is defined in (2.3). We upper bound [(1 -γ) • r + γ • P π * Q π θ k+1 -Q π θ k+1 ](s, a) on the RHS of (E.1) in the sequel. By calculation, we have (1 -γ) • r + γ • P π * Q π θ k+1 -Q π θ k+1 (s, a) = (1 -γ) • r + γ • P π * Q π θ k+1 -(1 -γ) • r + γ • P π * Q ω k (s, a) + (1 -γ) • r + γ • P π * Q ω k -(1 -γ) • r + γ • P π θ k+1 Q ω k (s, a) + (1 -γ) • r + γ • P π θ k+1 Q ω k -Q π θ k+1 (s, a) = A 1,k (s, a) + A 2,k (s, a) + A 3,k (s, a), (E. 2) where A 1,k , A 2,k , and A 3,k are defined as follows, A 1,k (s, a) = γ(P π * -P π θ k+1 )Q ω k (s, a), A 2,k (s, a) = γP π * (Q π θ k+1 -Q ω k ) (s, a), A 3,k (s, a) = [T π θ k+1 Q ω k -Q π θ k+1 ](s, a). (E.3) Here T π θ k+1 is defined in (2.4). By the following three lemmas, we upper bound A 1,k , A 2,k , and A 3,k on the RHS of (E.2), respectively. Lemma E.1. It holds for any (s, a) ∈ S × A that A 1,k (s, a) = γ(P π * -P π θ k+1 )Q ω k (s, a) ≤ γβ • P(ϑ k + a k+1 ) (s, a) , where ϑ k and a k+1 are defined as follows, ϑ k (s) = KL π * (• | s) π θ k (• | s) -KL π * (• | s) π θ k+1 (• | s) , (E.4) a k+1 (s) = log π θ k+1 (• | s)/π θ k (• | s) -β -1 • Q ω k (s, •), π * (• | s) -π θ k+1 (• | s) . (E.5) Proof. See §H.2 for a detailed proof. We remark that a k+1 = 0 for any k in the linear actor-critic method. Meanwhile, such a term is included in Lemma E.1 only aiming to generalize to the deep neural actor-critic method. Lemma E.2. It holds for any (s, a) ∈ S × A that A 2,k (s, a) ≤ (γP π * ) k+1 (Q * -Q ω0 ) (s, a) + γβ • k-1 i=0 (γP π * ) k-i P(ϑ i + a i+1 ) (s, a) + k-1 i=0 (γP π * ) k-i c i+1 (s, a), where ϑ i is defined in (E.4) of Lemma E.1, a i+1 is defined in (E.5) of Lemma E.1, and c i+1 is defined as follows, c i+1 (s, a) = [T π θ i+1 Q ωi -Q ωi+1 ](s, a). (E.6) Proof. See §H.3 for a detailed proof. We remark that a k+1 = 0 for any k in the linear actor-critic method. Meanwhile, such a term is included in Lemma E.2 only aiming to generalize to the deep neural actor-critic method. Lemma E.3. It holds for any (s, a) ∈ S × A that A 3,k (s, a) = γP π θ k+1 (I -γP π θ k+1 ) -1 e k+1 (s, a), where e k+1 is defined as follows, e k+1 (s, a) = [Q ω k -T π θ k+1 Q ω k ](s, a). (E.7) Proof. See §H.4 for a detailed proof. We upper bound e k+1 in (E.7) of Lemma E.3 using Lemma E.4 as follows. Lemma E.4. It holds for any (s, a) ∈ S × A that e k+1 (s, a) ≤ γ k k s=1 P π θs e 1 + k i=1 γ k-i k s=i+1 P π θs γβP b i+1 + (I -γP π θ i ) c i (s, a). where c i (s, a) is defined in (E.6) of Lemma E.2 and b i+1 (s) is defined as follows, b i+1 (s) = log π θi+1 (• | s)/π θi (• | s) -β -1 • Q ωi (s, •), π θi (• | s) -π θi+1 (• | s) . (E.8) Proof. See §H.5 for a detailed proof. We remark that b i+1 = 0 for any i in the linear actor-critic method. Meanwhile, such a term is included in Lemma E.4 only aiming to generalize to the deep neural actor-critic method. Combining Lemmas E.3 and E.4, we obtain the following upper bound of A 3,k , A 3,k (s, a) = γP π θ k+1 (I -γP π θ k+1 ) -1 e k+1 (s, a) ≤ γP π θ k+1 (I -γP π θ k+1 ) -1 γ k k s=1 P π θs e 1 (E.9) + k i=1 γ k-i k s=i+1 P π θs βγP b i+1 + (I -γP π θ i ) c i (s, a). Combining (E.1), (E.2), Lemma E.1 and Lemma E.2, it holds for any (s, a) ∈ S × A that K k=0 [Q * -Q π θ k+1 ](s, a) ≤ K k=0 (I -γP π * ) -1 (γP π * ) k+1 (Q * -Q ω0 ) + k i=0 (γP π * ) k-i γβP(ϑ i + a i+1 ) + k-1 i=0 (γP π * ) k-i c i+1 + A 3,k (s, a) = (I -γP π * ) -1 K k=0 (γP π * ) k+1 (Q * -Q ω0 ) + K k=0 k i=0 (γP π * ) k-i γβP a i+1 (E.10) + K k=0 k-1 i=0 (γP π * ) k-i c i+1 + K k=0 A 3,k + K k=0 k i=0 (γP π * ) k-i γβPϑ i (s, a), where ϑ i , a i+1 , c i+1 , and e k+1 are defined in (E.4) of Lemma E.1, (E.5) of Lemma E.1, (E.6) of Lemma E.2, and (E.7) of Lemma E.3, respectively. We upper bound the last term as follows, K k=0 k i=0 (γP π * ) k-i γβPϑ i (s, a) = K k=0 k i=0 γβ(γP π * ) i Pϑ k-i (s, a) = K i=0 γβ(γP π * ) i P K k=i ϑ k-i (s, a) = K i=0 γβ(γP π * ) i P K k=i KL π * π θ k-i -KL π * π θ k-i+1 (s, a) = K i=0 γβ(γP π * ) i P KL(π * π θ0 ) -KL(π * π θ K-i+1 ) (s, a) ≤ K i=0 γβ(γP π * ) i PKL(π * π θ0 ) (s, a), (E.11) where we use the definition of ϑ k-i in (E.4) of Lemma E.1 and the non-negativity of the KL divergence in the second equality and the last inequality, respectively. By plugging (E.9) and (E.11) into (E.10), we have K k=0 [Q * -Q π θ k+1 ](s, a) ≤ (I -γP π * ) -1 K k=0 (γP π * ) k+1 (Q * -Q ω0 ) + K k=0 k i=0 (γP π * ) k-i γβP a i+1 (E.12) + K k=0 k-1 i=0 (γP π * ) k-i c i+1 + K k=0 γ k+1 P π θ k+1 (I -γP π θ k+1 ) -1 k s=1 P π θs e 1 + K k=0 P π θ k+1 (I -γP π θ k+1 ) -1 k =1 γ k-+1 k s= +1 P π θs γβP b +1 + (I -γP π θ ) c (s, a). + K i=0 (γP π * ) i γβPKL(π * π θ0 ) We remark that a i+1 = b i+1 = 0 for any i in the linear actor-critic method. Meanwhile, such terms is included in (E.12) only aiming to generalize to the deep neural actor-critic method. This concludes the proof in part 1. Part 2. Recall that ρ is a state-action distribution satisfying (ii) of Assumption 4.1. In the sequel, we take the expectation over ρ in (E.12) and upper bound each term. Recall that a i+1 = b i+1 = 0 for any i in the linear actor-critic method. Hence, we only need to consider terms in (E.12) that do not involve a i+1 or b i+1 . We first upper bound terms on the RHS of (E.12) that do not involve c i+1 . More specifically, for any measure ρ satisfying satisfying (ii) of Assumption 4.1, we upper bound the following three terms, M 1 = E ρ (I -γP π * ) -1 K k=0 (γP π * ) k+1 (Q * -Q ω0 ) , M 2 = E ρ (I -γP π * ) -1 K k=0 γ k+1 P π θ k+1 (I -γP π θ k+1 ) -1 k s=1 P π θs e 1 , M 3 = E ρ (I -γP π * ) -1 K i=0 (γP π * ) i γβPKL(π * π θ0 ) . (E.13) We upper bound M 1 , M 2 , and M 3 in the following lemma. Lemma E.5. It holds that |M 1 | ≤ 4(1 -γ) -2 • (r max + R), |M 2 | ≤ (1 -γ) -3 • (2R + r max ), |M 3 | ≤ (1 -γ) -2 • log |A| • K 1/2 , where M 1 , M 2 , and M 3 are defined in (E.13). Proof. See §H.6 for a detailed proof. Now, we upper bound terms on the RHS of (E.12) that involve c i+1 . More specifically, for any measure ρ satisfying (ii) of Assumption 4.1, we upper bound the following two terms, M 4 = E ρ (I -γP π * ) -1 K k=0 k i=0 (γP π * ) k-i c i+1 , (E.14) M 5 = E ρ (I -γP π * ) -1 K k=0 P π θ k+1 (I -γP π θ k+1 ) -1 k =1 γ k-+1 k s= +1 P π θs (I -γP π θ ) c . We upper bound M 4 and M 5 in the following lemma. Lemma E.6. It holds that |M 4 | ≤ 3KC ρ,ρ * • ε Q , |M 5 | ≤ KC ρ,ρ * • ε Q . where M 4 and M 5 are defined in (E.14). Proof. See §H.7 for a detailed proof. Now, by plugging Lemmas E.5 and E.6 into (E.12), we have E ρ K k=0 Q * (s, a) -Q π θ k+1 (s, a) ≤ 2(1 -γ) -3 • log |A| • K 1/2 + 4KC ρ,ρ * • ε Q + O(1). (E.15) Meanwhile, by changing measure from ρ * to ρ k+1 , it holds for any k that E ρ * [| c k+1 |] ≤ E ρ k+1 ( c k+1 (s, a)) 2 • φ * k+1 , (E.16) where φ * k+1 is defined in Assumption 4.1. Also, by Lemma D.1, it holds that E ρ k+1 ( c k+1 (s, a)) 2 = O 1/( √ N σ * ) • log N ). (E.17) Now, by plugging (E.17) into (E.16), combining the definition of ε Q = max k E ρ * [| c k+1 |], we have ε Q = O φ * /( √ N σ * ) • log N ). (E.18) Combining (E.15), (E.18), and the choices of parameters stated in the theorem that N = Ω KC 2 ρ,ρ * (φ * /σ * ) 2 • log 2 N , we have E ρ K k=0 Q * (s, a) -Q π θ k+1 (s, a) ≤ 2(1 -γ) -3 log |A| + O(1) • K 1/2 , which concludes the proof of Theorem 4.4.

E.2 PROOF OF THEOREM C.5

We follow the proof of Theorem 4.4 in §E.1. Following similar arguments when deriving (E.12) in §E.1, we have K k=0 [Q * -Q π θ k+1 ](s, a) ≤ (I -γP π * ) -1 • K k=0 (γP π * ) k+1 (Q * -Q ω0 ) + K k=0 k i=0 (γP π * ) k-i • γβP a i+1 (E.19) + K k=0 k-1 i=0 (γP π * ) k-i c i+1 + K i=0 (γP π * ) i • γβP • KL(π * π θ0 ) + K k=0 γ k+1 P π θ k+1 (I -γP π θ k+1 ) -1 k s=1 P π θs e 1 + K k=0 P π θ k+1 (I -γP π θ k+1 ) -1 k =1 γ k-+1 k s= +1 P π θs βγP b +1 -(I -γP π θ ) c (s, a), for any (s, a) ∈ S × A. Here a i+1 , b +1 , c i+1 , and e 1 are defined in (E.5), (E.8), (E.6), and (E.7), respectively. Now, it remains to upper bound each term on the RHS of (E.19). We introduce the following error propagation lemma. Lemma E.7. Suppose that E ρ k f θ k+1 (s, a) -τ k+1 • (β -1 Q ω k (s, a) -τ -1 k f θ k (s, a)) 2 1/2 ≤ ε k+1,f . (E.20) Then, we have E ν * | a k+1 (s)| ≤ √ 2τ -1 k+1 • ε k+1,f • (φ * k + ψ * k ), E ν * | b k+1 (s)| ≤ √ 2τ -1 k+1 • ε k+1,f • (1 + ψ * k ), where a k+1 and b k+1 are defined in (E.5) and (E.8), respectively, φ * k and ψ * k are defined in Assumption C.1. Proof. See §H.8 for a detailed proof. Following from Lemma F.4, with probability at least 1 -O(H c ) exp(-Ω(H -1 c m c )), we have |Q ω0 | ≤ 2. Also, from the fact that |r(s, a)| ≤ r max , we know that |Q * | ≤ r max . Therefore, for any measure ρ, we have E ρ (I -γP π * ) -1 K k=0 (γP π * ) k+1 (Q * -Q ω0 ) ≤ E ρ (I -γP π * ) -1 K k=0 (γP π * ) k+1 |Q * -Q ω0 | ≤ r max (1 -γ) -1 K k=0 γ k+1 ≤ r max (1 -γ) -2 . (E.21) Also, by changing the index of summation, we have E ρ (I -γP π * ) -1 K k=0 k i=0 (γP π * ) k-i γβP a i+1 = E ρ K k=0 k i=0 ∞ j=0 (γP π * ) k-i+j γβP a i+1 = E ρ K k=0 k i=0 ∞ t=k-i (γP π * ) t γβP a i+1 ≤ K k=0 k i=0 ∞ t=k-i E ρ (γP π * ) t γβP a i+1 , (E.22) where we expand (I -γP π * ) -1 into an infinite sum in the first equality. Further, by changing the measure of the expectation on the RHS of (E.22), we have K k=0 k i=0 ∞ t=k-i E ρ (γP π * ) t γβP a i+1 ≤ K k=0 k i=0 ∞ t=k-i βγ t+1 c(t) • E ν * [| A i+1 |], (E.23) where c(t) is defined in Assumption C.1. Further, by Lemma E.7 and interchanging the summation on the RHS of (E.23), we have E ρ (I -γP π * ) -1 K k=0 k i=0 (γP π * ) k-i γβP a i+1 ≤ 2 K k=0 ∞ t=0 k i=max{0,k-t} βγ t+1 c(t) • τ -1 i+1 ε f (φ * i + ψ * i ) ≤ K k=0 ∞ t=0 4ktγ t+1 c(t) • ε f (φ * + ψ * ) ≤ γ K k=0 4C ρ,ρ * • ε f (φ * + ψ * ) ≤ 2γKC ρ,ρ * (φ * + ψ * ) • ε f , (E.24) where ε f = max i E ρi [(f θi+1 (s, a)-τ i+1 •(β -1 Q ωi (s, a)-τ -1 i f θi (s, a))) 2 ] 1/2 , and C ρ,ρ * is defined in Assumption C.1. Here in the second inequality, we use the fact that τ -1 i+1 = (i + 1) • β -1 , and φ * i ≤ φ * and ψ * i ≤ ψ * by Assumption C.1. By similar arguments in the derivation of (E.24), we have E ρ (I -γP π * ) -1 K k=0 k-1 i=0 (γP π * ) k-i c i+1 ≤ 2(K + 1)C ρ,ρ * φ * • ε Q , (E.25) E ρ (I -γP π * ) -1 K i=0 (γP π * ) i γβPKL(π * π θ0 ) ≤ log |A| • K 1/2 (1 -γ) -2 , E ρ (I -γP π * ) -1 K k=0 γ k+1 P π θ k+1 (I -γP π θ k+1 ) -1 k s=1 P π θs e 1 ≤ (2 + r max ) • (1 -γ) -3 , where ε Q = max i E ρ * [| c i+1 |]. And we use the fact that β = K 1/2 . Now, it remains to upper bound the last term on the RHS of (E.19). We first consider the terms involving b +1 . We have E ρ (I -γP π * ) -1 K k=0 P π θ k+1 (I -γP π θ k+1 ) -1 k =1 γ k-+1 k s= +1 P π θs βγP b +1 = ∞ j=0 ∞ i=0 K k=0 k =1 E ρ (γP π * ) j (γP π θ k+1 ) i+1 γ k- k s= +1 P π θs βγP b +1 ≤ βγ K k=0 k =1 ∞ j=0 ∞ i=0 γ i+j+k-+1 • E ρ * [|P b +1 |] • c(i + j + k -+ 1) ≤ 2γ K k=0 k =1 ∞ j=0 ∞ i=0 γ i+j+k-+1 • ( + 1)ε f • (1 + ψ * ) • c(i + j + k -+ 1), (E.26) where we expand (I -γP π * ) -1 and (I -γP π θ k+1 ) -1 to infinite sums in the first equality, change the measure of the expectation in the first inequality, and use Lemma E.7 in the last inequality. Now, by changing the index of the summation, we have γ K k=0 k =1 ∞ j=0 ∞ i=0 γ i+j+k-+1 • ( + 1)ε f • (1 + ψ * ) • c(i + j + k -+ 1) = γ K k=0 k =1 ∞ j=0 ∞ t=j+k-+1 γ t • ( + 1)ε f • (1 + ψ * ) • c(t) ≤ γ K k=0 ∞ j=0 ∞ t=j+1 k =max{0,j+k-t+1} γ t • ( + 1)ε f • (1 + ψ * ) • c(t), where we use the fact that ψ * ≤ ψ * from Assumption C.1 in the last inequality. By further manipulating the order of summations of the RHS of (E.27), we have γ K k=0 ∞ j=0 ∞ t=j+1 k =max{0,j+k-t+1} γ t • ( + 1)ε f (1 + ψ * ) • c(t) ≤ γ K k=0 ∞ j=0 j+k+1 t=j+1 (t -j)(2k + j -k + 1) • γ t c(t) + ∞ t=j+k+2 k 2 • γ t c(t) • ε f (1 + ψ * ) = γ K k=0 ∞ t=1 t-1 j=max{0,t-k-1} (t -j)(2k + j -k + 1) • γ t c(t) + ∞ t=k+2 t-k-2 j=1 k 2 • γ t c(t) • ε f (1 + ψ * ) ≤ 20γ K k=0 ∞ t=1 k 2 • tγ t c(t) + ∞ t=1 k 2 • tγ t c(t) • ε f (1 + ψ * ) ≤ 20γK • C ρ,ρ * • ε f (1 + ψ * ), (E.28) where we use the definition of C ρ,ρ * from Assumption C.1 in the last inequality. Now, combining (E.26), (E.27), and (E.28), we have E ρ (I -γP π * ) -1 K k=0 P π θ k+1 (I -γP π θ k+1 ) -1 k =1 γ k-+1 k s= +1 P π θs βγP b +1 ≤ 20γK • C ρ,ρ * • ε f • (1 + ψ * ). (E.29) Following from similar arguments when deriving (E.29), we have E ρ (I -γP π * ) -1 K k=0 P π θ k+1 (I -γP π θ k+1 ) -1 k =1 γ k-+1 k s= +1 P π θs (I -γP π θ ) c ≤ 20K • C ρ,ρ * φ * • ε Q , (E.30) Now, by plugging (E.21), (E.24), (E.25), (E.29), and (E.30) into (E.19), with probability at least 1 -O(H c ) exp(-Ω(H -1 c m c )), we have E ρ K k=0 Q * (s, a) -Q π θ k+1 (s, a) (E.31) ≤ 2 log |A| • K 1/2 (1 -γ) -3 + 60KC ρ,ρ * (φ * + ψ * + 1) • ε f + 50KC ρ,ρ * φ * • ε Q . Meanwhile, following from Propositions C.3 and C.4, it holds with probability at least 1 -1/K that ε f = O R a N -1/4 a + R 4/3 a m -1/12 a H 7/2 a (log m a ) 1/2 ), ε Q = O R c N -1/4 c + R 4/3 c m -1/12 c H 7/2 c (log m c ) 1/2 ). (E.32) Combining (E.31), (E.32), and the choices of parameters stated in the theorem, it holds with probability at least 1 -1/K that E ρ K k=0 Q * (s, a) -Q π θ k+1 (s, a) ≤ 2(1 -γ) -3 log |A| + O(1) • K 1/2 , which concludes the proof of Theorem C.5.

F SUPPORTING RESULTS

In this section, we provide some supporting results in the proof of Theorems 4.4 and C.5. We introduce Lemma F.1, which applies to both Algorithms 1 and 2. To introduce Lemma F.1, for any policy π and action-value function Q, we define π(a | s) ∝ exp(β -1 Q(s, a)) • π(a | s). Lemma F.1. For any s ∈ S and π † , we have β -1 • Q(s, •), π † (• | s) -π(• | s) ≤ KL π † (• | s) π(• | s) -KL π † (• | s) π(• | s) + log π(• | s)/π(• | s) -β -1 • Q(s, •), π † (• | s) -π(• | s) . Proof. By calculation, it suffices to show that log( π(• | s)/π(• | s)), π † (• | s) -π(• | s) ≤ KL(π † (• | s) π(• | s)) -KL(π † (• | s) π(• | s)). By the definition of the KL divergence, it holds for any s ∈ S that KL(π † (• | s) π(• | s)) -KL(π † (• | s) π(• | s)) = log( π(• | s)/π(• | s)), π † (• | s) . (F.1) Meanwhile, for the term on the RHS of (F.1), we have log( π(• | s)/π θ k (• | s)), π † (• | s) = log( π(• | s)/π(• | s)), π † (• | s) -π(• | s) + log( π(• | s)/π(• | s)), π(• | s) = log( π(• | s)/π(• | s)), π † (• | s) -π(• | s) + KL( π(• | s) π(• | s)) ≥ log( π(• | s)/π(• | s)), π † (• | s) -π(• | s) . (F.2) Combining (F.1) and (F.2), we obtain that log( π(• | s)/π(• | s)), π † (• | s) -π(• | s) ≤ KL(π † (• | s) π(• | s)) -KL(π † (• | s) π(• | s)), which concludes the proof of Lemma F.1.

G PROOFS OF PROPOSITIONS

G.1 PROOF OF PROPOSITION 3.1 The proof follows the proof of Proposition 3.1 in Liu et al. (2019) . First, we write the update π k+1 ← argmax π E ν k [ Q ω k (s, •), π(• | s) -β • KL(π(• | s) π θ k (• | s)) ] as a constrained optimization problem in the following way, max π E ν k π(• | s), Q ω k (s, •) -β • KL(π(• | s) π θ k (• | s)) s.t. a∈A π(a | s) = 1, for any s ∈ S. We consider the Lagrangian of the above program, s∈S π(• | s), Q ω k (s, •) -β • KL π(• | s) π θ k (• | s) dν k (s) + s∈S a∈A π(a | s) -1 dλ(s), where λ(•) is the dual parameter, which is a function on S. Now, by plugging in π θ k (a | s) = exp(τ -1 k f θ k (s, a)) a ∈A exp(τ -1 k f θ k (s, a )) , we have the following optimality condition, Q ω k (s, a) + βτ -1 k f θ k (s, a) -β • log a ∈A exp(τ -1 k f θ k (s, a )) + log π(a |s) + 1 + λ(s) ν k (s) = 0, for any (s, a) ∈ S × A. Note that log( a ∈A exp(τ -1 k f θ k (s, a )) ) is only a function of s. Thus, we have π k+1 (a | s) ∝ exp(β -1 Q ω k (s, a) + τ -1 k f θ k (s, a) ) for any (s, a) ∈ S × A, which concludes the proof of Proposition 3.1.

G.2 PROOF OF PROPOSITION C.3

We define the local linearization of f θ as follows, fθ = f θ0 + (θ -θ 0 ) ∇ θ0 f θ . (G.1) Meanwhile, we denote by g n = f θ(n) -τ • (β -1 Q ω + τ -1 f θ ) • ∇ θ f θ(n) , g e n = E ρπ θ [g n ], ḡn = fθ(n) -τ • (β -1 Q ω + τ -1 f θ ) • ∇ θ f θ0 , ḡe n = E ρπ θ [ḡ n ], g * = f θ * -τ • (β -1 Q ω + τ -1 f θ ) • ∇ θ f θ * , g e * = E ρπ θ [g * ], ḡ * = fθ * -τ • (β -1 Q ω + τ -1 f θ ) • ∇ θ f θ0 , ḡe * = E ρπ θ [ḡ * ], (G.2) where θ * satisfies that θ * = Γ B(θ0,Ra) (θ * -α • ḡe * ). (G.3) By Algorithm 3, we know that θ(n + 1) = Γ B(θ0,Ra) (θ(n) -α • g n ). (G.4) By (G.3) and (G.4), we have E ρπ θ θ(n + 1) -θ * 2 2 | θ(n) = E ρπ θ Γ B(θ0,Ra) (θ(n) -α • g n ) -Γ B(θ0,Ra) (θ * -α • ḡe * ) 2 2 | θ(n) ≤ E ρπ θ (θ(n) -α • g n ) -(θ * -α • ḡe * ) 2 2 | θ(n) = θ(n) -θ * 2 2 + 2α • θ * -θ(n), g e n -ḡe * (i) +α 2 • E ρπ θ g n -ḡe * 2 2 | θ(n) (ii) , (G.5) where use the fact that Γ B(θ0,Ra) is a contraction in the first inequality. We upper bound term (i) and term (ii) on the RHS of (G.5) in the sequel. Upper Bound of Term (i). By Cauchy-Schwarz inequality, it holds that θ * -θ(n), g e n -ḡe * = θ * -θ(n), g e n -ḡe n + θ * -θ(n), ḡe n -ḡe * ≤ θ * -θ(n) 2 • g e n -ḡe n 2 + θ * -θ(n), ḡe n -ḡe * ≤ 2R a • g e n -ḡe n 2 + θ * -θ(n), ḡe n -ḡe * , (G.6 ) where we use the fact that θ(n), θ * ∈ B(θ 0 , R a ) in the last inequality. Further, by the definitions in (G.2), it holds that θ * -θ(n), ḡe n -ḡe * = E ρπ θ ( fθ(n) -fθ * ) • θ * -θ(n), ∇ θ f θ0 = E ρπ θ ( fθ(n) -fθ * ) • ( fθ * -fθ(n) ) = -E ρπ θ ( fθ(n) -fθ * ) 2 , (G.7 ) where we use (G.1) in the second equality. Combining (G.6) and (G.7), we obtain the following upper bound of term (i), θ * -θ(n), g e n -ḡe * ≤ 2R a • g e n -ḡe n 2 -E ρπ θ ( fθ(n) -fθ * ) 2 . (G.8) Upper Bound of Term (ii). We now upper bound term (ii) on the RHS of (G.5). It holds by Cauchy-Schwarz inequality that E ρπ θ g n -ḡe * 2 2 | θ(n) ≤ 2E ρπ θ g n -g e n 2 2 | θ(n) + 2 g e n -ḡe * 2 2 ≤ 2 E ρπ θ g n -g e n 2 2 | θ(n) (ii).a +4 g e n -ḡe n 2 2 (ii).b +4 ḡe n -ḡe * 2 2 (ii).c . (G.9) We upper bound term (ii).a, term (ii).b, and term (ii).c in the sequel. Upper Bound of Term (ii).a. Note that E ρπ θ g n -g e n 2 2 | θ(n) = E ρπ θ g n 2 2 -g e n 2 2 | θ(n) ≤ E ρπ θ g n 2 2 | θ(n) . (G.10) Meanwhile, by the definition of g n in (G.2), it holds that g n 2 2 = f θ(n) -τ • (β -1 Q ω + τ -1 f θ ) 2 • ∇ θ f θ(n) 2 2 . (G.11) We first upper bound f θ as follows, f 2 θ = x (Ha) bb x (Ha) = x (Ha) x (Ha) = x (Ha) 2 2 , where x (Ha) is the output of the H a -th layer of the DNN f θ . Further combining Lemma F.4, it holds with probability at least 1 -O(H a ) exp(-Ω(H -1 a m a )) that |f θ | ≤ 2. (G.12) Following from similar arguments, with probability at least 1-O(H a ) exp(-Ω(H -1 a m a )), we have |Q ω | ≤ 2, |f θ(n) | ≤ 2. (G.13) Combining Lemma F.2, (G.10), (G.11), (G.12), and (G.13), it holds with probability at least 1exp(-Ω(R 2/3 a m 2/3 a H a )) that E ρπ θ g n -g e n 2 2 | θ(n) = O(H 2 a ), (G.14) which establishes an upper bound of term (ii).a. Upper Bound of Term (ii).b. It holds that g e n -ḡe n 2 = E ρπ θ f θ(n) -τ • (β -1 Q ω + τ -1 f θ ) • ∇ θ f θ(n) -fθ(n) -τ • (β -1 Q ω + τ -1 f θ ) • ∇ θ f θ0 2 ≤ E ρπ θ f θ(n) ∇ θ f θ(n) -fθ(n) ∇ θ f θ0 2 + τ • E ρπ θ (β -1 Q ω + τ -1 f θ ) • (∇ θ f θ0 -∇ θ f θ(n) ) 2 ≤ E ρπ θ f θ(n) ∇ θ f θ0 -fθ(n) ∇ θ f θ0 2 + E ρπ θ f θ(n) ∇ θ f θ(n) -f θ(n) ∇ θ f θ0 2 (G.15) + E ρπ θ τ • (β -1 Q ω + τ -1 f θ ) • (∇ θ f θ0 -∇ θ f θ(n) ) 2 . We upper bound the terms on the RHS of in the sequel, respectively. For the term f θ(n) ∇ θ f θ0 -fθ(n) ∇ θ f θ0 2 on the RHS of (G.15), following from Lemmas F.2 and F.3, it holds with probability at least 1 -exp(-Ω(R  2/3 a m 2/3 a H a )) that f θ(n) ∇ θ f θ0 -fθ(n) ∇ θ f θ0 2 = O R 4/3 a m -1/6 a H 7/2 a (log m a ) 1/2 . (G.16) For the term f θ(n) ∇ θ f θ(n) -f θ(n) ∇ θ f θ0 f θ(n) ∇ θ f θ(n) -f θ(n) ∇ θ f θ0 2 = O R 1/3 a m -1/6 a H 5/2 a (log m a ) 1/2 . (G.17) For the term τ • (β -1 Q ω + τ -1 f θ ) • (∇ θ f θ0 -∇ θ f θ(n) ) 2 on the RHS of (G.15), we first upper bound τ • (β -1 Q ω + τ -1 f θ ) as follows, | τ • (β -1 Q ω + τ -1 f θ )| ≤ 2, where we use (G.12), (G.13), and the fact that τ -1 = β -1 + τ -1 . Further combining Lemma F.2, it holds with probability at least 1 -exp(-Ω(R 2/3 a m 2/3 a H a )) that τ • (β -1 Q ω + τ -1 f θ ) • (∇ θ f θ0 -∇ θ f θ(n) ) 2 = O R 1/3 a m -1/6 a H 5/2 a (log m a ) 1/2 . (G.18) Now, combining (G.15), (G.16), (G.17), and (G.18), it holds with probability at least 1exp(-Ω(R 2/3 a m 2/3 a H a )) that g e n -ḡe n 2 2 = O R 8/3 a m -1/3 a H 7 a log m a , (G.19) which establishes an upper bound of term (ii).b. Upper Bound of Term (ii).c. It holds that ḡe n -ḡe * 2 2 = E ρπ θ [( fθ(n) -fθ * )∇ θ f θ0 ] 2 2 ≤ E ρπ θ ( fθ(n) -fθ * ) 2 • ∇ θ f θ0 2 2 . Further combining Lemma F.2, it holds with probability at least 1 -exp(-Ω(R (G.20) which establishes an upper bound of term (ii).c. Now, combining (G.9), (G.14), (G.19), and (G.20), we have 2/3 a m 2/3 a H a )) that ḡe n -ḡe * 2 2 ≤ O(H 2 a ) • E ρπ θ ( fθ(n) -fθ * ) 2 , E ρπ θ g n -ḡe * 2 2 | θ(n) ≤ O R 8/3 a m -1/3 a H 7 a log m a + O(H 2 a ) • E ρπ θ ( fθ(n) -fθ * ) 2 , (G.21 ) which is an upper bound of term (ii) on the RHS of (G.5). By plugging the upper bound of term (i) in (G.8) and the upper bound of term (ii) in (G.21) into (G.5), combining (G.19), with probability at least 1 -exp(-Ω(R 2/3 a m 2/3 a H a )), we have E ρπ θ θ(n + 1) -θ * 2 2 | θ(n) ≤ θ(n) -θ * 2 2 + 2α • O R 7/3 a m -1/6 a H 7/2 a (log m a ) 1/2 -E ρπ θ ( fθ(n) -fθ * ) 2 (G.22) + α 2 • O R 8/3 a m -1/3 a H 7 a log m a + O(H 2 a ) • E ρπ θ ( fθ(n) -fθ * ) 2 . Rearranging terms in (G.22), it holds with probability at least 1 -exp(-Ω(R 2/3 a m 2/3 a H a )) that (2α -α 2 • O(H 2 a )) • E ρπ θ ( fθ(n) -fθ * ) 2 ≤ θ(n) -θ * 2 2 -E ρπ θ θ(n + 1) -θ * 2 2 | θ(n) + α • O R 8/3 a m -1/6 a H 7 a log m a . (G.23) By telescoping sum and using Jensen's inequality in (G.23), we have ρπ θ ( fθ -fθ * ) 2 ≤ 1 N a • Na-1 n=0 E ρπ θ ( fθ(n) -fθ * ) 2 ≤ 1/N a • 2α -α 2 • O(H 2 a ) -1 • θ 0 -θ * 2 2 + αN a • O(R 8/3 a m -1/6 a H 7 a log m a ) ≤ N -1/2 a • θ 0 -θ * 2 2 + O(R 8/3 a m -1/6 a H 7 a log m a ), where the last line comes from the choices that α = N -1/2 a and H a = O(N 1/4 a ). Further combining Lemma F.3 and using triangle inequality, we have E ρπ θ (fθ -fθ * ) 2 = O(R 2 a N -1/2 a + R 8/3 a m -1/6 a H 7 a log m a ). (G.24) By the definition of θ * in (G.3), we know that ḡe * , θ -θ * ≥ 0, for any θ ∈ B(θ 0 , R a ). (G.25) By plugging the definition of ḡe * into (G.25), we have E ρπ θ fθ * -τ • (β -1 Q ω + τ -1 f θ ), fθ † -fθ * ≥ 0, for any θ † ∈ B(θ 0 , R a ), which is equivalent to θ * = argmin θ † ∈B(θ0,Ra) E ρπ θ fθ † -τ • (β -1 Q ω + τ -1 f θ ) 2 . (G.26) Meanwhile, by the fact that θ 0 = ω 0 , we have τ • (β -1 Qω + τ -1 fθ ) = τ • β -1 • (Q ω0 + (ω -ω 0 ) ∇ ω Q ω0 ) + τ -1 • (f θ0 + (θ -θ 0 ) ∇ θ f θ0 ) = f θ0 + τ • (β -1 ω + τ -1 θ) -θ 0 ∇ θ f θ0 , where the second line comes from τ -1 = β -1 + τ -1 . Note that θ ∈ B(θ 0 , R a ), ω ∈ B(ω 0 , R c ), θ 0 = ω 0 , and R a = R c , we know that τ • (β -1 ω + τ -1 θ) ∈ B(θ 0 , R a ). Therefore, with probability at least 1 -exp(-Ω(R 2/3 a m 2/3 a H a )) we have E ρπ θ fθ * -τ • (β -1 Q ω + τ -1 f θ ) 2 ≤ E ρπ θ τ • (β -1 Qω + τ -1 fθ ) -τ • (β -1 Q ω + τ -1 f θ ) 2 ≤ τ 2 • β -2 • E ρπ θ [( Qω -Q ω ) 2 ] + τ 2 • τ -2 • E ρπ θ [( fθ -f θ ) 2 ] = O(R 8/3 a m -1/3 a H 5 a log m a ), (G.27) where the first inequality comes from (G.26), and the last inequality comes from Lemma F.3 and the fact that R c = R a , m c = m a , and H c = H a . Combining (G.24) and (G.27), by triangle inequality, we have E ρπ θ f θ (s, a) -τ • (β -1 Q ω (s, a) + τ -1 f θ (s, a)) 2 = O(R 2 a N -1/2 a + R 8/3 a m -1/6 a H 7 a log m a ), which finishes the proof of Proposition C.3.

G.3 PROOF OF PROPOSITION C.4

The proof is similar to that of Proposition C.3 in §G.2. For the completeness of the paper, we present it here. We define the local linearization of Q ω as follows, Qω = Q ω0 + (ω -ω 0 ) ∇ ω0 Q ω . (G.28) We denote by g n = Q ω(n) (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 • ∇ ω Q ω(n) (s 0 , a 0 ), g e n = E π θ [g n ], ḡn = Qω(n) (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 • ∇ ω Q ω0 (s 0 , a 0 ), ḡe n = E π θ [ḡ n ], g * = Q ω * (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 • ∇ ω Q ω * (s 0 , a 0 ), g e * = E π θ [g * ], ḡ * = Qω * (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 • ∇ ω Q ω0 (s 0 , a 0 ), ḡe * = E π θ [ḡ * ], (G.29) where ω * satisfies ω * = Γ B(ω0,Rc) * -α • ḡe * ). (G.30) Here the expectation E π θ [•] is taken following (s 0 , a 0 ) ∼ ρ π θ (•), s 1 ∼ P (• | s 0 , a 0 ), a 1 ∼ π θ (• | s 1 ) , and r 0 = r(s 0 , a 0 ). By Algorithm 4, we know that ω(n + 1) = Γ B(ω0,Rc) (ω(n) -η • g n ). Note that E π θ ω(n + 1) -ω * 2 2 | ω(n) = E π θ Γ B(ω0,Rc) (ω(n) -η • g n ) -Γ B(ω0,Rc) (ω * -η • ḡe * ) 2 2 | ω(n) ≤ E π θ (ω(n) -η • g n ) -(ω * -η • ḡe * ) 2 2 | ω(n) = ω(n) -ω * 2 2 + 2η • ω * -ω(n), g e n -ḡe * (iii) +η 2 • E π θ g n -ḡe * 2 2 | ω(n) (iv) . (G.31) We upper bound term (iii) and term (iv) on the RHS of (G.31) in the sequel. Upper Bound of Term (iii). By Hölder's inequality, it holds that ω * -ω(n), g e n -ḡe * = ω * -ω(n), g e n -ḡe n + ω * -ω(n), ḡe n -ḡe * ≤ ω * -ω(n) 2 • g e n -ḡe n 2 + ω * -ω(n), ḡe n -ḡe * ≤ 2R c • g e n -ḡe n 2 + ω * -ω(n), ḡe n -ḡe * , (G.32) where we use the fact that ω(n), ω * ∈ B(ω 0 , R c ) in the last line. Further, by the definitions in (G.29), it holds that (G.33) where the second equality comes from (G.28), and the last equality comes from the fact that the expectation is only taken to the state-action pair (s 0 , a 0 ). Combining (G.32) and (G.33), we obtain the following upper bound of term (i), ω * -ω(n), ḡe n -ḡe * = E π θ ( Qω(n) (s 0 , a 0 ) -Qω * (s 0 , a 0 )) • ω * -ω(n), ∇ ω Q ω0 (s 0 , a 0 ) = E π θ ( Qω(n) (s 0 , a 0 ) -Qω * (s 0 , a 0 )) • ( Qω * (s 0 , a 0 ) -Qω(n) (s 0 , a 0 )) = -E π θ ( Qω(n) (s 0 , a 0 ) -Qω * (s 0 , a 0 )) 2 = -E ρπ θ ( Qω(n) -Qω * ) 2 , ω * -ω(n), g e n -ḡe * ≤ 2R c • g e n -ḡe n 2 -E ρπ θ ( Qω(n) -Qω * ) 2 . (G.34) Upper Bound of Term (iv). We now upper bound term (iv) on the RHS of (G.31). It holds by Cauchy-Schwarz inequality that E π θ g n -ḡe * 2 2 | ω(n) ≤ 2E π θ g n -g e n 2 2 | ω(n) + 2 g e n -ḡe * 2 2 ≤ 2 E π θ g n -g e n 2 2 | ω(n) (iv).a +4 g e n -ḡe n 2 2 (iv).b +4 ḡe n -ḡe * 2 2 (iv).c . (G.35) We upper bound term (iv).a, term (iv).b, and term (iv).c in the sequel. Upper Bound of Term (iv).a. We now upper bound term (iv).a on the RHS of (G.35). By expanding the square, we have E π θ g n -g e n 2 2 | ω(n) = E π θ g n 2 2 -g e n 2 2 | ω(n) ≤ E π θ g n 2 2 | ω(n) . (G.36) Meanwhile, by the definition of g n in (G.29), it holds that g n 2 2 = Q ω(n) (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 2 • ∇ ω Q ω(n) (s 0 , a 0 ) 2 2 . (G.37) We first upper bound Q ω as follows, Q 2 ω x (Hc) bb x (Hc) = x (Hc) x = x (Hc) 2 2 , where x (Hc) is the output of the H c -th layer of the DNN Q ω . Further combining Lemma F.4, it holds that |Q ω | ≤ 2. (G.38) Similarly, we have |Q ω(n) | ≤ 2. (G.39) Combining Lemma F.2, (G.36), (G.37), (G.38), and (G.39), we have E π θ g n -g e n 2 2 | ω(n) = O(H 2 c ). (G.40) Upper Bound of Term (iv).b. We now upper bound term (iv).b on the RHS of (G.35). It holds that g e n -ḡe n 2 = E π θ Q ω(n) (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 • ∇ ω Q ω(n) (s 0 , a 0 ) -Qω(n) (s 0 , a 0 ) -γ • Q ω (s 1 , a 1 ) -(1 -γ) • r 0 • ∇ ω Q ω0 (s 0 , a 0 ) 2 ≤ E π θ γ • Q ω (s 1 , a 1 ) + (1 -γ) • r t • (∇ ω Q ω0 (s 0 , a 0 ) -∇ ω Q ω(n) (s 0 , a 0 )) 2 + E ρπ θ Q ω(n) ∇ ω Q ω(n) -Qω(n) ∇ ω Q ω0 2 ≤ E π θ γ • Q ω (s 1 , a 1 ) + (1 -γ) • r 0 • (∇ ω Q ω0 (s 0 , a 0 ) -∇ ω Q ω(n) (s 0 , a 0 )) 2 (G.41) + E ρπ θ (Q ω(n) -Qω(n) ) • ∇ ω Q ω0 2 + E ρπ θ Q ω(n) • (∇ ω Q ω(n) -∇ ω Q ω0 ) 2 . We now bound the three terms on the RHS of (G.41) in the sequel, respectively. For the term E ρπ θ [ (Q ω(n) -Qω(n) ) • ∇ ω Q ω0 2 ] on the RHS of (G.41), following from Lemmas F.2 and F.3, it holds with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )) that E ρπ θ (Q ω(n) -Qω(n) ) • ∇ ω Q ω0 2 = O R 4/3 c m -1/6 c H 7/2 c (log m c ) 1/2 . (G.42) For the term E ρπ θ [ Q ω(n) • (∇ ω Q ω(n) -∇ ω Q ω0 ) 2 ] on the RHS of (G.41), following from (G.39) and Lemma F.2, with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )), we have E ρπ θ Q ω(n) • (∇ ω Q ω(n) -∇ ω Q ω0 ) 2 = O R 1/3 c m -1/6 c H 5/2 c (log m c ) 1/2 . (G.43) For the term E π θ [ (γ • Q ω (s 1 , a 1 ) + (1 -γ) • r 0 ) • (∇ ω Q ω0 (s 0 , a 0 ) -∇ ω Q ω(n) (s 0 , a 0 )) 2 ] on the RHS of (G.41), we first upper bound |γ • Q ω (s 1 , a 1 ) + (1 -γ) • r 0 | as follows, |γ • Q ω (s 1 , a 1 ) + (1 -γ) • r 0 | ≤ 2 + r max , where we use (G.38) and the fact that |r(s, a)| ≤ r max for any (s, a) ∈ S × A. Further combining Lemma F.2, with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )), we have E π θ γ • Q ω (s 1 , a 1 ) + (1 -γ) • r 0 • (∇ ω Q ω0 (s 0 , a 0 ) -∇ ω Q ω(n) (s 0 , a 0 )) 2 = O R 1/3 c m -1/6 c H 5/2 c (log m c ) 1/2 . (G.44) Now, combining (G.41), (G.42), (G.43), and (G.44), it holds with probability at least 1exp(-Ω(R 2/3 c m 2/3 c H c )) that g e n -ḡe n 2 2 = O(R 8/3 c m -1/3 c H 7 c log m c ). (G.45) Upper Bound of Term (iv).c. We now upper bound term (iv).c on the RHS of (G.35). It holds that ḡe n -ḡe * 2 2 = E ρπ θ [( Qω(n) -Qω * )∇ ω Q ω0 ] 2 2 ≤ E ρπ θ ( Qω(n) -Qω * ) 2 • ∇ ω Q ω0 2 2 . Further combining Lemma F.2, it holds that E π θ n -ḡe * 2 | ω(n) ≤ O(H 2 c ) • E ρπ θ ( Qω(n) -Qω * ) 2 . (G.46) Combining (G.35), (G.40), (G.45), and (G.46), we obtain the following upper bound for term (iv) on the RHS of (G.31), E π θ g n -ḡe * 2 2 | ω(n) ≤ O(R 8/3 c m -1/3 c H 7 c log m c ) + O(H 2 c ) • E ρπ θ ( Qω(n) -Qω * ) 2 . (G.47) We continue upper bounding (G.31). By plugging (G.34) and (G.47) into (G.31), it holds with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )) that E π θ ω(n + 1) -ω * 2 2 | ω(n) ≤ ω(n) -ω * 2 2 + 2η • O R 7/3 c m -1/6 c H 7/2 c (log m c ) 1/2 -E ρπ θ ( Qω(n) -Qω * ) 2 + η 2 • O R 8/3 c m -1/3 c H 7 c log m c + O(H 2 c ) • E ρπ θ ( Qω(n) -Qω * ) 2 . (G.48) Rearranging terms in (G.48), it holds with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )) that (2η -η 2 • O(H 2 c )) • E ρπ θ ( Qω(n) -Qω * ) 2 ≤ ω(n) -ω * 2 2 -E ρπ θ [ ω(n + 1) -ω * 2 2 | ω(n)] + η • O(R 8/3 c m -1/3 c H 7 c log m c ). (G.49) By telescoping the sum and using Jensen's inequality in (G.49), we have E ρπ θ ( Qω -Qω * ) 2 ≤ 1 N c • Nc-1 n=0 E ρπ θ ( Qω(n) -Qω * ) 2 ≤ 1/N c • 2η -η 2 • O(H 2 c ) -1 • ω 0 -ω * 2 2 + ηN c • O(R 8/3 c m -1/6 c H 7 c log m c ) ≤ N -1/2 c • θ 0 -θ * 2 2 + O(R 8/3 c m -1/6 c H 7 c log m c ), where the last line comes from the choices that η = N -1/2 c and H c = O(N 1/4 c ). Further combining Lemma F.3 and using triangle inequality, we have E ρπ θ (Q ω -Qω * ) 2 = O(R 2 c N -1/2 c + R 8/3 c m -1/6 c H 7 c log m c ). (G.50) To establish the upper bound of E ρπ θ [( Qω * -Q) 2 ], we upper bound E ρπ θ [( Qω * -Q) 2 ] in the sequel. By the definition of ω * in (G.30), following a similar argument to derive (G.26), we have ω * = argmin ω † ∈B(ω0,Rc) E ρπ θ ( Qω † (s 0 , a 0 ) -Q(s 0 , a 0 )) 2 . (G.51) From the fact that Q ∈ U(m c , H c , R c ) by Assumption C.2, we know that Q = Q ω for some ω ∈ B(ω 0 , R c ). Therefore, by (G.51), with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )), we have E ρπ θ ( Qω * -Q) 2 ≤ E ρπ θ ( Q ω -Q) 2 = O(R 8/3 c m -1/3 c H 5 c log m c ), (G.52) where we use Lemma F.3 in the last inequality. Now, combining (G.50) and (G.52), by triangle inequality, with probability at least 1 -exp(-Ω(R 2/3 c m 2/3 c H c )), we have E ρπ θ (Q ω -Q) 2 ≤ 2E ρπ θ (Q ω -Qω * ) 2 + 2E ρπ θ ( Qω * -Q) 2 = O(R 2 c N -1/2 c + R 8/3 c m -1/6 c H 7 c log m c ), which concludes the proof of Proposition C.4. H PROOFS OF LEMMAS H.1 PROOF OF LEMMA D.1 W denote by = T π θ k Q ω k . the sequel, we upper bound E ρ k+1 [(Q ω k+1 -Q ωk+1 ) 2 ] , where ωk+1 = Γ R ( ω k+1 ) and ω k+1 is defined in (3.4). Note that by the fact that ϕ(s, a) 2 ≤ 1 uniformly, it suffices to upper bound ω k+1 -ω k+1 2 . By the definitions of ω k+1 and ω k+1 in (3.5) and (3.4), respectively, we have ω k+1 -ωk+1 2 ≤ Φ v -Φv 2 ≤ Φ 2 • v -v 2 + Φ -Φ 2 • v 2 . (H.1) Here, we use the fact that the projection Γ R (•) is a contraction in the first inequality, and triangle inequality in the second inequality. Also, for notational convenience, we denote by Φ, Φ, v, and v in (H.1) as follows, Φ = 1 N N =1 ϕ(s ,1 , a ,1 )ϕ(s ,1 , a ,1 ) -1 , Φ = E ρ k+1 [ϕ(s, a)ϕ(s, a) ] -1 , v = 1 N N =1 (1 -γ)r ,2 + γQ ω k (s ,2 , a ,2 ) • ϕ(s ,2 , a ,2 ), v = E ρ k+1 (1 -γ)r + γP π θ k+1 Q ω k (s, a) • ϕ(s, a) . By the fact that ϕ(s, a) 2 ≤ 1, |r(s, a)| ≤ r max , and ω k 2 ≤ R we have Φ 2 ≤ 1/σ * , v 2 ≤ r max + R. (H.2) Now, following from matrix Bernstein inequality (Tropp, 2015) and Assumption 4.3, we have E Φ -Φ 2 ≤ 2 √ N (σ * ) 2 • log(N + d), (H.3) where σ * is defined in Assumption 4.3. Similarly, we have E v -v 2 ≤ 2(r max + R)/ √ N • log(N + d). (H.4) Now, combining (H.1), (H.2), (H.3), and (H.4), we have E ω k+1 -ωk+1 2 ≤ 4(r max + R) √ N (σ * ) 2 • log(N + d). Therefore, it holds that E (Q ω k+1 -Q ωk+1 ) 2 ≤ 16(r max + R) 2 N (σ * ) 2 • log 2 (N + d). (H.5) Meanwhile, by Assumption 4.2 and the definition of ωk+1 , we have Q = Q ωk+1 . (H.6) Combining (H.5) and (H.6), we have E (Q ω k+1 -Q) 2 ≤ 16(r max + R) 2 N (σ * ) 4 • log 2 (N + d), which concludes the proof of Lemma D.1.

H.2 PROOF OF LEMMA E.1

Following from the definitions of P π and P in (2.3), we have A 1,k (s, a) = γ(P π * -P π θ k+1 )Q ω k (s, a) = γP Q ω k , π * -π θ k+1 (s, a). (H.7) By invoking Lemma F.1 and combining (H.7), it holds for any (s, a) ∈ S × A that A 1,k (s, a) = γ(P π * -P π θ k+1 )Q ω k (s, a) ≤ γβ • P(ϑ k + a k+1 ) (s, a) , where ϑ k and a k+1 are defined in (E.4) and (E.5) of Lemma E.1, respectively. We conclude the proof of Lemma E.1.

H.3 PROOF OF LEMMA E.2

By the definition that Q is the action-value function of optimal policy π * , we know that Q * (s, a) ≥ Q π (s, a) for any policy π and state-action pair (s, a) ∈ S × A. Therefore, for any (s, a) ∈ S × A, we have A 2,k (s, a) = γP π * (Q π θ k+1 -Q ω k ) (s, a) ≤ γP π * (Q * -Q ω k ) (s, a). (H.8) In the sequel, we upper bound Q * (s, a) -Q ω k (s, a) for any (s, a) ∈ S × A. We define Q k+1 = (1 -γ) • r + γ • P π θ k+1 Q ω k . By its definition, we know that Q k+1 = T π θ k+1 Q ω k . It holds for any (s, a) ∈ S × A that Q * (s, a) -Q ω k+1 (s, a) = Q * (s, a) -Q k+1 (s, a) + Q k+1 (s, a) -Q ω k+1 (s, a) = (1 -γ) • r + γ • P π * Q * -(1 -γ) • r + γ • P π θ k+1 Q ω k (s, a) + c k+1 (s, a) = γ • [P π * Q * -P π θ k+1 Q ω k ](s, a) + c k+1 (s, a) = γ • [P π * Q * -P π * Q ω k ](s, a) + γ • [P π * Q ω k -P π θ k+1 Q ω k ](s, a) + c k+1 (s, a) = γ • P π * (Q * -Q ω k ) (s, a) + A 1,k (s, a) + c k+1 (s, a) ≤ γ • P π * (Q * -Q ω k ) (s, a) + γβ • P(ϑ k + a k+1 ) (s, a) + c k+1 (s, a), where c k+1 and A 1,k are defined in (E.6) and (E.3), respectively. Here, we use Lemma E.1 to upper bound A 1,k in the last line. We remark that (H.9) upper bounds Q * -Q ω k+1 using Q * -Q ω k . By recursively applying a similar argument as in (H.9), we have Q * (s, a) -Q ω k (s, a) ≤ (γP π * ) k (Q * -Q ω0 ) (s, a) + γβ • k-1 i=0 (γP π * ) k-i-1 P(ϑ i + a i+1 ) (s, a) (H.10) + k-1 i=0 (γP π * ) k-i-1 c i+1 (s, a). Combining (H.8) and (H.10), it holds for any (s, a) ∈ S × A that A 2,k (s, a) ≤ γP π * (Q * -Q ω k ) (s, a) ≤ (γP π * ) k+1 (Q * -Q ω0 ) (s, a) + γβ • k-1 i=0 (γP π * ) k-i P(ϑ i + a i+1 ) (s, a) + k-1 i=0 (γP π * ) k-i c i+1 (s, a), where ϑ i , a i+1 , and c i+1 are defined in (E.4) of Lemma E.1, (E.5) of Lemma E.1, and (E.6) of Lemma E.2, respectively. We conclude the proof of Lemma E.2.

H.4 PROOF OF LEMMA E.3

Note that for any (s, a) ∈ × A, we have A 3,k (s, a) = π θ k+1 Q ω k -Q π θ k+1 ](s, a) = (1 -γ) • r + γP π θ k+1 Q ω k -Q π θ k+1 (s, a) = (1 -γ) • r + γP π θ k+1 Q ω k - ∞ t=0 (1 -γ)(γP π θ k+1 ) t r (s, a) = ∞ t=1 (γP π θ k+1 ) t Q ω k -(γP π θ k+1 ) t+1 Q ω k - ∞ t=1 (1 -γ)(γP π θ k+1 ) t r (s, a) = ∞ t=1 (γP π θ k+1 ) t Q ω k -γP π θ k+1 Q ω k -(1 -γ) • r (s, a) = ∞ t=1 (γP π θ k+1 ) t Q ω k -T π θ k+1 Q ω k (s, a) = ∞ t=1 (γP π θ k+1 ) t e k+1 (s, a) = γP π θ k+1 (I -γP π θ k+1 ) -1 e k+1 (s, a), where the term e k+1 in the last line is defined in (E.7). We conclude the proof of Lemma E.3. H.5 PROOF OF LEMMA E.4 We invoke Lemma F.1 in §F, which gives (H.12) By the definition of e k+1 in (E.7), we have e k+1 (s, a) β -1 • Q ω k (s, •), π θ k (• | s) -π θ k+1 (• | s) ≤ log(π θ k+1 (• | s)/π θ k (• | s)) -β -1 • Q ω k (s, = Q ω k -γ • P π θ k+1 Q ω k -(1 -γ) • r (s, a) ≤ Q ω k -γ • P π θ k Q ω k -(1 -γ) • r (s, a) + βγ • [P b k+1 ](s, a) (H.13) = Q k -γ • P π θ k Q k -(1 -γ) • r (s, a) + βγP b k+1 -(I -γP π θ k ) c k (s, a) , where we use (H.12) in the first inequality, and Q k = (1 -γ) • r + γ • P π θ k Q ω k-1 . (H.14) For the first term on the RHS of (H.13), by (H.14), it holds that (H.21) where we use β = K 1/2 . We see that (H.17), (H.19), and (H.21) upper bound M 1 , M 2 , and M 3 , respectively. We conclude the proof of Lemma E.5. Q k -γ • P π θ k Q k -(1 -γ) • r = (1 -γ) • r + γ • P π θ k Q ω k-1 -γ(1 -γ) • P π θ k r -(γP π θ k ) 2 Q ω k-1 -(1 -γ) • r = γ • P π θ k Q ω k-1 -γP π θ k Q ω k-1 -(1 -γ)r = γ • P π θ k e k . M 3 ≤ (1 -γ) -2 • log |A| • K 1/2 , H.7 PROOF OF LEMMA E.6 For M 4 , by changing the index of summation, we have (H.22) where we expand (I -γP π * ) -1 into an infinite sum in the first equality. Further, by changing the measure of the expectation from ρ to ρ * on the RHS of (H.22), we have H.23) where c(t) is defined in Assumption 4.1. Further, by changing the index of on the RHS of (H.23), combining (H.22), we have |M 4 | = E ρ K k=0 k i=0 ∞ j=0 (γP π * ) k-i+j c i+1 = E ρ K k=0 k i=0 ∞ t=k-i (γP π * ) t c i+1 ≤ K k=0 k i=0 ∞ t=k-i E ρ (γP π * ) t c i+1 , K k=0 k i=0 ∞ t=k-i E ρ (γP π * ) t c i+1 ≤ K k=0 k i=0 ∞ t=k-i γ t c(t) • E ρ * [| c i+1 |], 4 | ≤ K k=0 ∞ t=0 k i=max{0,k-t} γ t c(t) • ε Q ≤ K k=0 ∞ t=0 2tγ t c(t) • ε Q ≤ γ K k=0 2C ρ,ρ * • ε Q ≤ 3KC ρ,ρ * • ε Q , (H.24) where ε Q = max i E ρ * [| c i+1 |], and C ρ,ρ * is defined in Assumption 4.1. Now, for M 5 , by a similar argument as in the derivation of (H.24), we have M 5 ≤ ∞ i=0 K k=0 ∞ j=0 k =1 γ i+j+k-+1 c(i + j + k -+ 1) • ε Q = ∞ i=0 K k=0 ∞ j=0 i+j+k t=i+j+1 γ t c(t) • ε Q ≤ K k=0 ∞ t=1 t 2 γ t c(t) • ε Q ≤ KC ρ,ρ * • ε Q . (H.25) We see that (H.24) and (H.25) upper bound M 4 and M 5 , respectively. We conclude the proof of Lemma E.6. Thus, we have Thus, it remains to upper bound the right-hand side of (H.26). We have log(π θ k+1 (• | s)/π θ k (• | s)) -β -1 Q ω k (s, •), π * (• | s) -π θ k+1 (• | s) = τ -1 k+1 f θ k+1 (s, •) -(β -1 Q ω k (s, •) + τ -1 k f θ k (s, •)), π * (• | s) -π θ k (• | s) , τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π * (• | s) -π θ k+1 (• | s) (H.27) = τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s)• π * (• | s) π θ k (• | s) - π θ k+1 (• | s) π θ k (• | s) . Taking expectation with respect to s ∼ ν * the both sides of (H.27) and using the Cauchy-Schwarz inequality, we obatin E * τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π * (• | s) -π θ k+1 (• | s) = S τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s) • ν k (s)• π * (• | s) π θ k (• | s) - π θ k+1 (• | s) π θ k (• | s) • ν * (s) ν k (s) ds = S×A τ -1 k+1 f θ k+1 (s, a) -(β -1 k Q ω k (s, a) + τ -1 k f θ k (s, a)) • ρ * (a | s) ρ k (a | s) - π θ k+1 (a | s) • ν * (s) ρ k (a | s) dρ k (s, a) ≤ E ρ k τ -1 k+1 f θ k+1 (s, a) -(β -1 k Q ω k (s, a) + τ -1 k f θ k (s, a)) 2 1/2 • E ρ k dρ * dρ k - d(π θ k+1 ν * ) dρ k 2 1/2 ≤ √ 2τ -1 k+1 • ε k+1,f • (φ * k + ψ * k ) , where in the last inequality we use the error bound in (E.20) and the definition of φ * k and ψ * k in Assumption C.1. This finishes the proof of the first inequality. Part 2. The proof of the second inequality follows from a similar argument as above. We have Thus, it remains to upper bound the right-hand side of (H.28). We have log(π θ k+1 (• | s)/π θ k (• | s)) -β -1 Q ω k (s, •), π θ k (• | s) -π θ k+1 (• | s) = τ -1 k+1 f θ k+1 (s, •) -(β -1 Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s) -π θ k+1 (• | s) , τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s) -π θ k+1 (• | s) (H.29) = τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s)• 1 - π θ k+1 (• | s) π θ k (• | s) . Taking expectation with respect to s ∼ ν * on the both sides of (H.29) and using the Cauchy-Schwarz inequality, we obatin E ν * τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s) -π θ k+1 (• | s) = S τ -1 k+1 f θ k+1 (s, •) -(β -1 k Q ω k (s, •) + τ -1 k f θ k (s, •)), π θ k (• | s) • ν k (s)• 1 - π θ k+1 (• | s) π θ k (• | s) • ν * (s) ν k (s) ds = S×A τ -1 k+1 f θ k+1 (s, a) -(β -1 k Q ω k (s, a) + τ -1 k f θ k (s, a)) • 1 - π θ k+1 (a | s) • ν * (s) ρ k (a | s) dρ k (s, a) ≤ E ρ k τ -1 k+1 f θ k+1 (s, a) -(β -1 k Q ω k (s, a) + τ -1 k f θ k (s, a)) 2 1/2 • E ρ k 1 - d(π θ k+1 ν * ) dρ k 2 1/2 ≤ √ 2τ -1 k+1 • ε k+1,f • (1 + ψ * k ) , where in the last inequality we use the error bound in (E.20) and the definition of ψ * k in Assumption C.1. This finishes the proof of the second inequality.



Propositions C.3 and C.4 characterize the errors that arise from the actor and critic updates in Algorithm 2, respectively. In particular, if the widths m a and m c of the DNNs f θ and Q ω are sufficiently large, the errors characterized in Propositions C.3 and C.4 decay to zero at the rates of O(N . Propositions C.3 and C.4 act as the key ingredients to upper bounding the regret of the deep neural actor-critic method.

2 on the RHS of (G.15), following from (G.13) and Lemma F.2, with probability at least 1 -exp(-

•), π θ k (• | s) -π θ k+1 (• | s) -KL(π θ k (• | s) π θ k+1 (• | s)) ≤ log(π θ k+1 (• | s)/π θ k (• | s)) -β -1 • Q ω k (s, •), π θ k (• | s) -π θ k+1 (• | s) = b k+1 (s).(H.11) Combining (H.11) and the definition of P π in (2.3), we have[P π θ k Q ω k -P π θ k+1 Q ω k ](s, a) ≤ β[P b k+1 ](s).

13) and (H.15), we have for any (s, a) ∈ S × A that e k+1 (s, a) ≤ [γP π θ k e k ](s, a) + βγP b k+1 -(I -γP π θ k ) c k (s, a). (H.16) By telescoping (H.16), it holds that e k+1 (s, a) ≤ k s=1 γP π θs e 1 + k i=1 γ k-i k s=i+1 P π θs βγP b i+1 -(I -γP π θ i ) c i (s, a).This finishes the proof of the lemma.H.6 PROOF OF LEMMA E.5Note that ω 0 ≤ R and |r(s, a)| ≤ r max for any (s, a) ∈ × A, which implies that |Q ω0 (s, a)| ≤ R and |Q * (s, a)| ≤ r max by their definitions. Thus, for M 1 , we have|M 1 | ≤ E ρ (I -γP π * ) k+1 • (r max + R) ≤ 4(1 -γ) -2 • (r max + R).(H.17) For M 2 , by the definition of e 1 in (E.7), |ω k | ≤ R, |φ(s, a)| ≤ 1, and |r(s, a)| ≤ r max , we have|e 1 (s, a)| = [Q ω k -T π θ k+1 Q ω k ](s, a) = ω k φ(s, a) -γ • ω k [P π θ k+1 φ](s, a) -(1 -γ) • r(s, a) ≤ 2R + r max (H.18)for any (s, a) ∈ S × A. Therefore, we have|M 2 | ≤ (1 -γ) -3 • (2R + r max ). (H.19)Meanwhile, by the initialization τ 0 = ∞ in Algorithm 1, the initial policy π θ0 (• | s) is a uniform distribution over A. Therefore, it holds for any s ∈ S thatKL π * (• | s) π θ0 (• | s) = A π * (a | s) log π * (a | s) π θ0 (a | s) da = A π * (a | s) log π * (a | s)da -A π * (a | s) log π θ0 (a | s)da≤ -A π * (a | s) log π θ0 (a | s)da = A π * (a | s) log |A|da = log |A|. (H.20) Therefore, by (H.20), we have

PROOF OF LEMMA E.7Part 1. We first show that the first inequality holds. Note thatπ θ k (a | s) = exp(τ -1 k f θ k (s, a))/Z θ k (s), π θ k+1 (a | s) = exp(τ -1 k+1 f θ k+1 (s, a))/Z θ k+1 (s),Here Z θ k (s), Z θ k+1 (s) ∈ R are normalization factors, which are defined asZ θ k (s) = a ∈A exp(τ -1 k f θ k (s, a )), Z θ k+1 (s) = a ∈Aexp(τ -1 k+1 f θ k+1 (s, a )).

H.26)    where we use the fact thatlog Z θ k+1 (s) -log Z θ k (s), π * (• | s) -π θ k+1 (• | s) = (log Z θ k+1 (s) -log Z θ k (s)) • a ∈A (π * (a | s) -π θ k+1 (a | s)) = 0.

H.28)    where we use the fact thatlog Z θ k+1 (s) -log Z θ k (s), π θ k (• | s) -π θ k+1 (• | s) = (log Z θ k+1 (s) -log Z θ k (s)) • a ∈A (π θ k (a | s) -π θ k+1 (a | s)) = 0.

and W h ∈ R m×m for 2 ≤ h ≤ H. Meanwhile, we denote the parameter of the DNN u θ as θ = (vec(W 1 ) , . . . , vec(W H ) ) ∈ R m all with m all = md + (H -1)m 2 . We call {W h } h∈[H] the weight matrices of θ. Without loss of generality, we normalize the input x such that x 2 = 1.We initialize the DNN such that each entry of W h follows the standard Gaussian distribution N (0, 1) for any h ∈ [H], while each entry of b follows the uniform distribution Unif({-1, 1}). Without loss of generality, we fix b during training and only optimize {W h } h∈[H]

F.1 LOCAL LINEARIZATION OF DNNS

In the proofs of Propositions C.3 and C.4 in §G.2 and §G.3, respectively, we utilize the linearization of DNNs. We introduce some related auxiliary results here. First, we define the linearization ūθ of the DNN u θ ∈ U(w, H, R) as follows,where θ 0 is the initialization of u θ . The following lemmas characterize the linearization error.Lemma F.2. Suppose that H = O(m 1/12 R -1/6 (log m) -1/2 ) and m = Ω(d 3/2 R -1 H -3/2 • log(m 1/2 /R) 3/2 ). Then with probability at least 1 -exp(-Ω(R 2/3 m 2/3 H)) over the random initialization θ 0 , it holds for any θ ∈ B(θ 0 , R) and any (s, a) ∈ S × A thatProof. See the proof of Lemma A.5 in Gao et al. (2019) for a detailed proof.Then with probability at least 1 -exp(-Ω(R 2/3 m 2/3 H)) over the random initialization θ 0 , it holds for any θ ∈ B(θ 0 , R) and any (s, a) ∈ S × A thatBy mean value theorem, there exists t ∈ [0, 1], which depends on θ and (s, a), such that u θ (s, a) -ūθ (s, a) = (θ -θ 0 ) ∇ θ u θ0+t(θ-θ0) (s, a) -∇ θ u θ0 (s, a) .Further by Lemma F.2, we havewhere we use Cauchy-Schwarz inequality in the first inequality. This concludes the proof of Lemma F.3.We denote by x (h) the output of the h-th layer of the DNN u θ ∈ U(m, H, R), and x (h),0 the output of the h-th layer of the DNN u θ0 ∈ U(m, H, R). The following lemma upper bounds the distance between x (h) and x (h),0 . Lemma F.4. With probability at least 1 -exp(-Ω(R 2/3 m 2/3 H)) over the random initialization θ 0 , for any θ ∈ B(θ 0 , R) and any h ∈ [H], we haveAlso, with probability at least 1 -O(H) exp(-Ω(H -1 m)) over the random initialization θ 0 , for any θ ∈ B(θ 0 , R) and any h ∈ [H], it holds that 2/3 ≤ x (h) 2 ≤ 4/3.Proof. The first inequality follows from Lemma A.5 in Gao et al. (2019) , and the second inequality follows from Lemma 7.1 in Allen-Zhu et al. (2018b) .

