FINITE-TIME ANALYSIS OF SINGLE-TIMESCALE ACTOR-CRITIC ON LINEAR QUADRATIC REGULATOR

Abstract

Actor-critic (AC) methods have achieved state-of-the-art performance in many challenging tasks. However, their convergence in most practical applications are still poorly understood. Existing works mostly consider the uncommon doubleloop or two-timescale stepsize variants for the ease of analysis. We investigate the practical yet more challenging single-sample single-timescale natural AC for solving the canonical linear quadratic regulator problem. Specifically, the actor and the critic update only once with a single sample in each iteration using proportional stepsizes. We prove that the single-sample single-timescale natural AC(NAC) can attain an ϵ-optimal solution with a sample complexity of O(ϵ -2 ), which elucidates on the practical efficiency of single-sample single-timescale NAC. We develop a novel analysis framework that directly bounds the whole interconnected iteration system without the conservative decoupling commonly adopted in previous analysis of AC and NAC. Our work presents the first finite-time analysis of single-sample single-timescale NAC with a global optimality guarantee. Under review as a conference paper at ICLR 2023 2021; Olshevsky & Gharesifard, 2022). Besides, (Olshevsky & Gharesifard, 2022) only considers the simple tabular case. We attempt to answer the more general yet more challenging question: Can the single-sample single-timescale AC(NAC) find a global optimal policy, especially on the general unbounded continuous state-action space with unbounded reward? To this end, we make the first step to consider the classic Linear Quadratic Regulation (LQR), a fundamental continuous state-action space control problem that are commonly employed to study the performance and the limits of RL algorithms (Fazel et al., 2018; Yang et al., 2019; Tu & Recht, 2018; Duan et al., 2021) . In particular, under the time-average cost, the single-sample single-timescale AC(NAC) algorithm for solving LQR consists of three parallel updates in each iteration: the cost estimator, the critic, and the actor. Unlike the aforementioned double-loop, two-timescale, or multisample structures, there is no specialized design in single-sample single-timescale AC(NAC) that facilitates a decoupled analysis of its three interconnected updates. In fact, it is both conservative and difficult to bound the three iterations separately. Moreover, the existing perturbed gradient analysis can no longer be applied to establish the convergence of the actor either. To tackle these challenges in analysis, we instead propose a novel framework to directly bound the overall interconnected iteration system altogether, without resorting to conservative decoupled analysis. In particular, despite the inaccurate estimation in all three updates, we prove the estimation errors diminish to zero if the (constant) ratio of the stepsizes between the actor and the critic is below a threshold. The identified threshold provides new insights into the practical choices of the stepsizes for single-timescale AC. Overall, our contributions are summarized as follows: • Our work furthers the theoretical understanding of AC(NAC) in its most practical form. We for the first time show that the single-sample single-timescale NAC can provably find the ϵ-accurate global optimum with a sample complexity of O(ϵ -2 ) for tasks with unbounded continuous state-action space. The previous works consider either specialized algorithm variants (Fu et al., 

1. INTRODUCTION

Actor-critic (AC) methods achieved substantial success in solving many difficult reinforcement learning (RL) problems (LeCun et al., 2015; Mnih et al., 2016; Silver et al., 2017) . In addition to a policy update, AC methods employ a parallel critic update to bootstrap the Q-value for policy gradient estimation, which often enjoys reduced variance and fast convergence in training. Despite the empirical success, theoretical analysis of AC in the most practical form remains challenging. Most existing works focus on either the double-loop setting or the two-timescale setting, both of which are uncommon in practical implementations. In double-loop AC, the actor is updated in the outer loop only after the critic takes sufficiently many steps to have an accurate estimation of the Q-value in the inner loop (Yang et al., 2019; Kumar et al., 2019; Wang et al., 2019) . Hence, the convergence of critic is decoupled from that of the actor. The analysis is separated into a policy evaluation sub-problem in the inner loop and a perturbed gradient descent in the outer loop. In two-timescale AC, the actor and the critic are updated simultaneously in each iteration using stepsizes of different timescales. The actor stepsize (denotes by α t ) is typically smaller than that of the critic (denotes by β t ), with their ratio goes to zero as the iteration number goes to infinity (i.e., lim t→∞ α t /β t = 0). The two-timescale allows the critic to approximate the correct Q-value in an asymptotic way. This design essentially decouples the analysis of the actor and the critic. The aforementioned AC variants are considered mainly for the ease of analysis. In practice, the single-timescale AC, where the actor and the critic are updated simultaneously using constantly proportional stepsizes (i.e., with α t /β t = c α > 0), is more favorable due to its simplicity of implementation and empirical sample efficiency (Schulman et al., 2015; Mnih et al., 2016) . However, its analysis is significantly more difficult than the other variants. To understand its finite-time convergence, some recent works (Fu et al., 2020; Zhou & Lu, 2022) consider multi-sample variants of single-timescale AC, where the critics are updated by the least square temporal difference (LSTD) estimator rather than the TD(0) update. The idea is still to obtain an accurate policy gradient estimation at each iteration by using sufficient samples (LSTD), and then follows the common perturbed gradient analysis to guarantee the convergence of the actor, decoupling the convergence analysis of the actor and the critic. In addition to the multi-sample settings, there are few attempts that analyzed the single-sample single-timescale AC(NAC), and they only attest local convergence (Chen et al., • We also contribute to the work of RL on continuous control task. It is novel that even with actor updated by a roughly estimated gradient, the single-sample single-timescale NAC algorithm can still find the global optimal policy for LQR, under general assumptions. Compared with all other modelfree RL algorithms for solving LQR (see related work 1.1), our work is the first one adopting the simplest single-sample single-loop structure, which may serve as the first step towards understanding the limits of AC(NAC) methods on continuous control task. In addition, compared with the stateof-the-art double-loop AC for solving LQR (Yang et al., 2019) , we improve the sample complexity from O(ϵ -5 ) to O(ϵ -2 ). We also show the algorithm is much more sample-efficient empirically compared to a few classic works in Section 5, which unveils the practical wisdom of AC(NAC) algorithm. • Technically, we provide a new proof framework that can establish the finite-time convergence for single-timescale AC. In the finite-time analysis of double-loop AC (Yang et al., 2019) and twotimescale AC (Wu et al., 2020) , the previous techniques hinge on decoupling the analysis of actor and critic, establishing the convergence of critic first and then the convergence of actor consequently. The novelty of our proof framework is that we formulate the estimation errors of the time-average cost, the critic, and the natural policy gradient into an interconnected iteration system and establish the convergence for them simultaneously rather than separately. This proof framework may provide new insights for finite-time analysis of other single-timescale algorithms.

1.1. RELATED WORK

In this section, we review the existing works that are most relevant to ours. Actor-Critic methods. The first AC algorithm was proposed by Konda & Tsitsiklis (1999) . Kakade (2001) extended it to the natural AC algorithm. The asymptotic convergence of AC algorithms has been well established in Kakade (2001) ; Bhatnagar et al. (2009) ; Castro & Meir (2010) ; Zhang et al. (2020) . Many recent works focused on the finite-time convergence of AC methods. Under the double-loop setting, Yang et al. (2019) established the global convergence of AC methods for solving LQR. Wang et al. (2019) studied the global convergence of AC methods with both the actor and the critic being parameterized by neural networks. Kumar et al. (2019) studied the finite-time local convergence of a few AC variants with linear function approximation. Under the two-timescale AC setting, Wu et al. (2020) ; Xu et al. (2020) established the finite-time convergence to a stationary point at a sample complexity of O(ϵ -2.5 ). Under the single-timescale setting, all the related works (Fu et al., 2020; Chen et al., 2021; Zhou & Lu, 2022; Olshevsky & Gharesifard, 2022) have been reviewed in the Introduction. RL algorithms for LQR. RL algorithms in the context of LQR have seen increased interest in the recent years. These works can be mainly divided into two categories: model-based methods (Dean et al., 2018; Mania et al., 2019; Cohen et al., 2019; Dean et al., 2020) and model-free methods. Our main interest lies in the model-free methods. Notably, Fazel et al. (2018) established the first global convergence result for LQR under the policy gradient method using zeroth-order optimization. Krauth et al. (2019) studied the convergence and sample complexity of the LSTD policy iteration method under the LQR setting. On the subject of adopting AC to solve LQR, Yang et al. (2019) provided the first finite-time analysis with convergence guarantee and sample complexity under the double-loop setting. Zhou & Lu (2022) considered the multi-sample (LSTD) and singletimescale setting. For the more practical yet challenging single-sample single-timescale AC, there is no such theoretical guarantee so far, which is the focus of this paper. Notation. Without other specification, for two sequences {x n } and {y n }, we write x n = O(y n ) if there exists an constant C such that x n ≤ Cy n . We use O(•) to further hide logarithm factors. For any symmetric matrix M ∈ R n×n , let svec(M ) ∈ R n(n+1)/2 denote the vectorization of the upper triangular part of M and smat(•) denote its inverse such that smat(svec(M )) = M . Finally, we denote by A ⊗ s B the symmetric Kronecker product of two matrices A and B.

2. PRELIMINARIES

In this section, we introduce the AC algorithm and provide the theoretical background of LQR.

2.1. ACTOR-CRITIC ALGORITHMS

We consider the reinforcement learning for the standard Markov Decision Process (MDP) defined by (X , U, P, c), where X is the state space, U is the action space, P(x t+1 |x t , u t ) denotes the transition kernel that the agent transits to state x t+1 after taking action u t at current state x t , and c(x t , u t ) is the running cost. A policy π θ (u|x) parameterized by θ is defined as a mapping from a given state to a probability distribution over actions. In this paper, we aim to find a policy π θ that minimizes the infinite-horizon time-average cost, which is given by θ * = arg min θ J(θ) := lim T →∞ E θ T t=0 c(x t , u t ) T = E x∼ρ θ ,u∼π θ [c(x, u)], where ρ θ denotes the stationary state distribution generated by policy π θ . In the time-average cost setting, the state-action value (Q-value) of policy π θ is defined as Q θ (x, u) = E θ [ ∞ t=0 (c(x t , u t ) -J(θ))|x 0 = x, u 0 = u], which describes the accumulated differences between running costs and average cost for selecting u in state x and thereafter following policy π θ (Sutton & Barto, 2018) . Based on this definition, we can use the policy gradient theorem (Sutton et al., 1999) to express the gradient of J(θ) with respect to θ as ∇ θ J(θ) = E x∼ρ θ ,u∼π θ [∇ θ log π θ (u|x)Q θ (x, u)]. One can also choose to update the policy using the natural policy gradient (Kakade, 2001) , which is given by ∇ N θ J(θ) = F (θ) † ∇ θ J(θ). (3) where F (θ) = E x∼ρ θ ,u∼π θ [∇ θ log π θ (u|x)∇ θ log π θ (u|x) ⊤ ] is the Fisher information matrix and F (θ) † denotes its Moore Penrose pseudoinverse. Optimizing J(θ) in ( 1) with (3) requires evaluating the Q-value of the current policy π θ , which is usually unknown. AC estimates both the Q-value and the policy. The critic update approximates Q-value towards the actual value of the current policy π θ using temporal difference (TD) learning (Sutton & Barto, 2018) . The actor improves the policy to reduce the time-average cost J(θ) via gradient descent. Note that the AC with natural policy gradient is also known as natural AC.

2.2. NATURAL ACTOR-CRITIC FOR LINEAR QUADRATIC REGULATOR

In this paper, we aim to demystify the convergence property of natural AC by focusing on the infinite-horizon time-average linear quadratic regulator (LQR) problem: minimize {ut} J({u t }) := lim T →∞ 1 T E[ T t=1 x ⊤ t Qx t + u ⊤ t Ru t ] (4) subject to x t+1 = Ax t + Bu t + ϵ t , where x t ∈ R d is a state and u t ∈ R k is a action; A ∈ R d×d and B ∈ R d×k are system matrices; Q ∈ S d×d and R ∈ S k×k are performance matrices; ϵ t ∼ N (0, D 0 ) are i.i.d Gaussian random variables with D 0 > 0. From the optimal control theory (Anderson & Moore, 2007), the optimal policy of ( 4) is a linear feedback of the state u t = -K * x t , where K * ∈ R k×d is the optimal policy which can be uniquely found by solving an Algebraic Riccati Equation (ARE) (Anderson & Moore, 2007) depending on A, B, Q, R. This means that finding K ⋆ using ARE relies on the complete model knowledge. In the sequel, we pursue finding the optimal policy in a model-free way by using the natural AC method, without knowing or estimating A, B, Q, R. The structure of the optimal policy in (6) allows us to reformulate (4) as a static optimization problem over all feasible policy matrix K ∈ R k×d . To encourage exploration, we parameterize the policy as {π K (•|x) = N (-Kx, σ 2 I k ), K ∈ R k×d }, where σ > 0 is the standard deviation of the exploration noise. In other words, given a state x t , the agent will take an action u t according to u t = -Kx t + σζ t , where ζ t ∼ N (0, I k ). As a consequence, the closed-loop form of system (5) under policy (7) is given by x t+1 = (A -BK)x t + ξ t , where ξ t = ϵ t + σBζ t ∼ N (0, D σ ) with D σ = D 0 + σ 2 BB ⊤ . Note that optimizing over the set of stochastic policies (7) will lead to the same optimal K * . The set K of all stabilizing policies is given by K := K ∈ R k×d : ρ(A -BK) < 1 , where ρ(•) denotes the spectral radius. It is well known that if K ∈ K, the Markov chain in (8) yields a stationary state distribution N (0, D K ), where D K satisfies the following Lyapunov equation D K = D σ + (A -BK)D K (A -BK) ⊤ . ( ) Similarly, we define P K as the unique positive definite solution to P K = Q + K ⊤ RK + (A -BK) ⊤ P K (A -BK). Based on D K and P K , the following lemma characterizes J(K) and its gradient ∇ K J(K). Lemma 2.1. (Yang et al., 2019) For any K ∈ K, the time-average cost J(K) and its gradient ∇ K J(K) take the following forms J(K) = Tr(P K D σ ) + σ 2 Tr(R), (12a) ∇ K J(K) = 2E K D K , where E K := (R + B ⊤ P K B)K -B ⊤ P K A. Then, the natural gradient of J(K) can be calculated as (Fazel et al., 2018; Yang et al., 2019 ) ∇ N K J(K) = ∇ K J(K)D -1 K = E K , which eliminates the burden of estimating D K . Note that we omit the constant coefficient since it can be absorbed by the stepsize. Calculating the natural gradient ∇ N K J(K) requires estimating P K , which depends on A, B, Q, R. To estimate the gradient without the knowledge of the model, we instead directly utilize the Q-value. Lemma 2.2. (Bradtke et al., 1994; Yang et al., 2019) For any K ∈ K, the Q-value Q K (x, u) takes the following form Q K (x, u) = (x ⊤ , u ⊤ )Ω K x u -σ 2 Tr(R + P K BB ⊤ ) -Tr(P K D K ), where Ω K := Ω 11 K Ω 12 K Ω 21 K Ω 22 K := Q + A ⊤ P K A A ⊤ P K B B ⊤ P K A R + B ⊤ P K B . Clearly, if we can estimate Ω K , then E k in ( 13) can be readily estimated by using Ω 21 K and Ω 22 K .

3. SINGLE-SAMPLE SINGLE-TIMESCALE NATURAL ACTOR-CRITIC

In this section, we describe the single-sample single-timescale natural AC algorithm for solving LQR. In view of the structure of the Q-value given in ( 14), we define the following feature function ϕ(x, u) = svec x u x u ⊤ . Then, we can parameterize the Q-estimator (critic) by QK (x, u; w, b) = ϕ(x, u) ⊤ w + b. Using the TD(0) learning, the critic update follows by ω t+1 = ω t + β t [(c t -J(K) + ϕ(x t+1 , u t+1 ) ⊤ ω t + b -ϕ(x t , u t ) ⊤ ω t -b)]ϕ(x t , u t ), where β t is the stepsize of the critic and K denotes the policy under which the state-action pairs are sampled. Note that the constant b is not required for updating the linear coefficient ω. Taking the expectation of ω t+1 in ( 16) with respect to the stationary distribution, conditioned on ω t , the expected subsequent critic can be written as E[ω t+1 |ω t ] = ω t + β t (b K -A K ω t ), where A K = E (x,u) [ϕ(x, u)(ϕ(x, u) -ϕ(x ′ , u ′ )) ⊤ ], b K = E (x,u) [(c(x, u) -J(K))ϕ(x, u)]. Note that for ease of exposition, we denote (x ′ , u ′ ) as the next state-action pair after (x, u) and abbreviate E x∼ρ K ,u∼π K (•|x) as E (x,u) . Given a policy π K , it is not hard to show that if the update in (17) has converged to some limiting point ω * K , i.e., lim t→∞ ω t = ω * K , ω * K must be the solution of A K ω = b K . Proposition 3.1. Suppose K ∈ K. Then the matrix A K defined in ( 18) is invertible and A K ω = b K has a unique solution ω * K that satisfies ω * K = svec(Ω K ). ( ) where Ω K is defined in (15). Combining ( 13), (15), and (19), we can express the natural gradient of J(K) using ω * K : ∇ N K J(K) = Ω 22 K K -Ω 21 K = smat(ω * K ) 22 K -smat(ω * K ) 21 . This allows us to estimate the natural policy gradient using the critic parameters ω t , and then update the actor in a model-free manner K t+1 = K t -α t ∇ N Kt J(K t ), where α t is the actor stepsize and ∇ N Kt J(K t ) is the natural gradient estimation depending on ω t : ∇ N Kt J(K t ) = smat(ω t ) 22 K t -smat(ω t ) 21 . Furthermore, we introduce a cost estimator η t to estimate the time-average cost J(K t ). Combining the critic update ( 16) and the actor update (20), the single-sample single-timescale natural AC for solving LQR is listed below. TD error calculation: δ t = c t -η t + ϕ(x ′ t , u ′ t ) ⊤ ω t -ϕ(x t , u t ) ⊤ ω t 7: Cost estimator update: η t+1 = η t + γ t (c t -η t ) 8: Critic update: ω t+1 = Π ω (ω t + β t δ t ϕ(x t , u t )) 9: Actor update: K t+1 = K t -α t (smat(ω t ) 22 K t -smat(ω t ) 21 ) 10: end for Note that "single-sample" refers to the fact that only one sample is used to update the critic per actor step. Line 3 of Algorithm 1 samples from the stationary distribution induced by the policy π Kt , which is a mild requirement in the analysis of uniformly ergodic Markov chain, such as in the LQR problem (Yang et al., 2019) . It is only made to simplify the theoretical analysis. Indeed, as shown in Tu & Recht (2018) , when K ∈ K, (8) is geometrically β-mixing and thus its distribution converges to the stationary distribution exponentially. In practice, one can run the Markov chain in (8) a sufficient number of steps and sample one state from the last step. In addition, "single-timescale" refers to the fact that the stepsizes for the critic and the actor updates are constantly proportional. Since the update of the critic parameter in ( 16) requires the time-average cost J(K t ), Line 7 provides an estimation of it. Besides, on top of ( 16), we additionally introduce a projection (Π ω ) in Line 8 to keep the critic norm-bounded, which is common in the literature (Wu et al., 2020; Yang et al., 2019; Xu et al., 2020) . In our analysis, the projection is relaxed using its nonexpansive property.

4. MAIN THEORY

In this section, we establish the global convergence and analyze the finite-time performance of Algorithm 1. All the proofs can be found in the Appendix A. Before preceding, we make the following standard assumptions. Assumption 4.1. There exists a constant K > 0 such that ∥K t ∥ ≤ K for all t. The above assumes the uniform boundedness of the actor parameter (Konda & Tsitsiklis, 1999; Karmakar & Bhatnagar, 2018; Barakat et al., 2022; Zhou & Lu, 2022) . As can be seen from our proof, it is only made to guarantee the boundedness of the feature functions, which is a standard assumption in the literature of analyzing AC with linear function approximation (Xu et al., 2020; Wu et al., 2020; Zhou & Lu, 2022) . Assumption 4.2. There exists a constant ρ ∈ (0, 1) such that ρ(A -BK t ) ≤ ρ for all t. Assumption 4.2 is made to ensure the stability of the closed loop systems induced in each iteration and thus ensure the existence of the stationary distribution corresponding to policy π Kt . In the single-sample case, the estimation of the natural gradient of J(K t ) can be highly noisy and biased. In general, it is difficult to obtain a theoretical guarantee for this condition. Nevertheless, we will present numerical examples to support this assumption. Moreover, the assumption for the existence of stationary distribution is common and has been widely used in Chen et al. (2021) ; Zhou & Lu (2022) ; Olshevsky & Gharesifard (2022) . Under these two assumptions, we can now prove the convergence of Algorithm 1, which consists of three estimators: η t , ω t , and K t . Theorem 4.3. Suppose that Assumptions 4.1 and 4.2 hold and choose α t = cα √ 1+t , β t = γ t = 1 √ 1+t , where c α is a small positive constant. With probability at least 1 -10 -10 , we have 1 T T -1 t=0 E(η t -J(K t )) 2 =O( 1 √ T ), 1 T T -1 t=0 E∥ω t -ω * Kt ∥ 2 =O( 1 √ T ), min 0≤t<T E[J(K t ) -J(K * )] =O( 1 √ T ). The theorem shows that the cost estimator, the critic, and the actor all converge at a sub-linear rate of O(T -1 2 ). Correspondingly, to obtain an ϵ-optimal policy, the required sample complexity is O(ϵ -2 ). This order is consistent with the existing results on single-timescale AC (Fu et al., 2020; Chen et al., 2021; Olshevsky & Gharesifard, 2022) . Nevertheless, our result is the first finite-time analysis of the single-sample single-timescale AC with a global optimality guarantee.

4.1. PROOF SKETCH

The main challenge in the finite-time analysis lies in that the estimation errors of the time-average cost, the critic, and the natural policy gradient are strongly coupled. To overcome this issue, we view the propagation of these errors as an interconnected system and analyze them comprehensively. To see the merit of our analysis framework, we sketch the main proof steps of Theorem 4.3 in the following. The supporting propositions and theorems mentioned below can be found in the Appendix. We define three measures A(T ), B(T ), C(T ) which denote the average values of the cost estimation error, the critic error, and the square norm of the natural policy gradient, respectively: A(T ) := 1 T T -1 t=0 Ey 2 t , B(T ) := 1 T T -1 t=0 E∥z t ∥ 2 , C(T ) := 1 T T -1 t=0 E∥E Kt ∥ 2 , ( ) where y t := η t -J(K t ) is the cost estimation error and z t := ω t -ω * t with ω * t := ω * Kt is the critic error. Note that E Kt = ∇ N Kt J(K t ) is the natural policy gradient according to (13). We first derive implicit (coupled) upper bounds for the cost estimation error y t , the critic error z t , and the natural gradient E Kt , respectively. After that, we solve an interconnected system of inequalities in terms of A(T ), B(T ), C(T ) to establish the finite-time convergence. Step 1: Cost estimation error analysis. From the cost estimator update rule (Line 7 of Algorithm 1), we decompose the cost estimation error into: y 2 t+1 =(1 -2γ t )y 2 t + 2γ t y t (c t -J(K t )) + 2y t (J(K t ) -J(K t+1 )) + [J(K t ) -J(K t+1 ) + γ t (c t -η t )] 2 . ( ) The second term on the right hand side of ( 23) is a noise term introduced by random sampling of the state-action pairs, which reduces to 0 after taking the expectations. The third term is the variation of the moving targets J(K t ) tracked by cost estimator. It is bounded by y t , z t , E Kt utilizing the Lipschitz continuity of J(K t ) (Proposition A.6), the actor update rule ( 21), and the Cauchy-Schwartz inequality. The last term reflects the variance in cost estimation, which is controlled by a high probability bound of c t (Proposition A.4). Step 2: Critic error analysis. By the critic update rule (Line 8 of Algorithm 1), we decompose the squared error by (neglecting the projection for the time being) ∥z t+1 ∥ 2 =∥z t ∥ 2 + 2β t ⟨z t , h(ω t , K t )⟩ + 2β t Λ(O t , ω t , K t ) + 2β t ⟨z t , ∆h(O t , η t , K t )⟩ + 2⟨z t , ω * t -ω * t+1 ⟩ + ∥β t (h(O t , ω t , K t ) + ∆h(O t , η t , K t )) + (ω * t -ω * t+1 )∥ 2 , (24) where the definitions of h, h, ∆h, Λ, and O t can be found in ( 27) in the Appendix. The second term on the right hand side of ( 24) is bounded by -µ∥z t ∥ 2 , where µ is a lower bound of σ min (A Kt ) proved in Proposition A.8. The third term is a random noise introduced by sampling, which reduces to 0 after taking expectation. The fourth term is caused by inaccurate cost and critic estimations, which can be bounded by the norm of y t and z t . The fifth term tracks the difference between the drifting critic targets. We control it by the Lipschitz continuity of the critic target established in Proposition A.9. The last term reflects the variances of various estimations, which is bounded by the diminishing β t . Step 3: Natural gradient norm analysis. From the actor update rule (Line 9 of Algorithm 1) and the almost smoothness property of LQR (Lemma A.11), we derive 2Tr(D Kt+1 E ⊤ Kt E Kt ) = 1 α t [J(K t ) -J(K t+1 )] -2Tr(D Kt+1 ( ÊKt -E Kt ) ⊤ E Kt ) + α t Tr(D Kt+1 Ê⊤ Kt (R + B ⊤ P Kt B) ÊKt ), where ÊKt denotes the estimation of the natural gradient E Kt . The first term on the left hand side of ( 25) can be considered as the scaled square norm of the natural gradient. The first term on the right hand side compares the actor's performances between consecutive updates, which is bounded via Abel summation by parts. The second term evaluates the inaccurate natural gradient estimation, which is then bounded by the critic error z t and the natural gradient E Kt . The last term can be considered as the variance of the perturbed natural gradient update, which is controlled by the diminishing stepsize. Step 4: Interconnected iteration system analysis. Taking the expectation and summing ( 23), (24), and (25) from 0 to T -1, respectively, we obtain the following interconnected iteration system in terms of A(T ), B(T ), C(T ): A(T ) ≤O( 1 √ T ) + bB(T ) + bC(T ), B(T ) ≤O( 1 √ T ) + d A(T )B(T ) + eC(T ), C(T ) ≤O( 1 √ T ) + g B(T )C(T ), where b, d, e, g are positive constants. By solving the above system of inequalities, we further prove that if bd 2 + bd 2 g 2 + 2eg 2 < 1, then A(T ), B(T ), C(T ) converge at a rate of O(T -1 2 ). This condition can be easily satisfied by choosing the stepsize ratio c α to be smaller than a threshold defined in (52). Step 5: Global convergence analysis. To prove the global optimality, we utilize the gradient domination condition of LQR (Lemma A.12), J(K) -J(K * ) ≤ 1 σ min (R) ∥D K * ∥Tr(E ⊤ K E K ). This property shows that the actor performance error can be bounded by the norm of the natural gradient (that is, Tr(E ⊤ K E K )). Since we have proved the average natural gradient norm C(T ) converges to zero, summation over both sides of the above inequality yields min 0≤t<T E[J(K t ) -J(K * )] =O( 1 √ T ), which is the convergence of the actor performance error. We thus complete the proof of Theorem 4.3.

5. EXPERIMENTS

We provide two numerical examples to illustrate our theoretical results. The first example is a twodimensional system and the second example is a four-dimensional system (See Appendix D for the system matrices and other settings). The performance of Algorithm 1 is shown in Figure 1 , where the left column corresponds to the first example and the right column to the second example. The solid lines plot the mean values and the shaded regions denote the 95% confidence interval over 10 independent runs. Consistent with our theorem, Figure 1 (a) shows that the cost estimation error, the critic error, and the actor performance error all diminish at a rate of at least T -1 2 . The convergence also suggests that the intermediate closed-loop linear systems during iteration are uniformly stable. We also compare Algorithm 1 with the zeroth-order method (Fazel et al., 2018) and the double-loop AC algorithm proposed in (Yang et al., 2019) (listed in Algorithm 2 and Algorithm 3, respectively, in Appendix D). We plotted the relative errors of the actor parameters for all three methods in Figure 1(b) . Algorithm 1 demonstrates superior sample-efficiency compared to the other two algorithms, which is well supported by our theoretical analysis. ,WHUDWLRQ  (UURU &RVWHVWLPDWLRQHUURU &ULWLFHUURU $FWRUSHUIRUPDQFHHUURU ,WHUDWLRQ (UURU &RVWHVWLPDWLRQHUURU &ULWLFHUURU $FWRUSHUIRUPDQFHHUURU (a) Learning results of Algorithm 1 6DPSOHFRPSOH[LW\ K K * F 6LQJOHWLPH6FDOH$& 7ZRWLPH6FDOH$& =HURWKRUGHU 'RXEOHORRS$& 6DPSOHFRPSOH[LW\ K K * F 6LQJOHWLPHVFDOH$& 7ZRWLPH6FDOH$& =HURWKRUGHU 'RXEOHORRS$& T T -1 t=0 (η t - J(K t )) 2 , Critic error refers to 1 T T -1 t=0 ∥ω t -ω * Kt ∥ 2 , and the Actor performance error refers to  1 T T -1 t=0 [J(K t ) -J(K * )],

6. CONCLUSION AND DISCUSSION

In this paper, we establish the first finite-time global convergence analysis for the single-sample single-timescale natural actor-critic method under the Linear Quadratic Regulation (LQR) setting. Our work is the first one adopting the simplest single-sample single-timescale structure for solving LQR, which may serve as the first step towards understanding the limits of the AC(NAC) on continuous control task. We provide a novel analysis framework that systematically establishes the convergence of actor and critic simultaneously. Our framework can be extended to analyze other single-timescale reinforcement learning algorithms.

A Proof of Main Theorems 13

A 

A PROOF OF MAIN THEOREMS

We choose stepsizes α t = cα √ 1+t , β t = γ t = 1 √ 1+t . Additional constant multipliers c β , c γ can be considered in a similar way. Before proceeding, we define the following notations for the ease of presentation: ω * t := ω * Kt , y t := η t -J(K t ), z t := ω t -ω * t , O t := (x t , u t , x ′ t , u ′ t ), ÊKt := ∇ N Kt J(K t ), ( ) ∆h(O, η, K) := [J(K) -η]ϕ(x, u), h(O, ω, K) := [c(x, u) -J(K) + (ϕ(x ′ , u ′ ) -ϕ(x, u)) ⊤ ω]ϕ(x, u), h(ω, K) := E (x,u)∼(ρ K ,π K ) [[c(x, u) -J(K) + (ϕ(x ′ , u ′ ) -ϕ(x, u)) ⊤ ω]ϕ(x, u)]. Λ(O, ω, K) := ⟨ω -ω * K , h(O, ω, K) -h(ω, K)⟩. In the sequel, we establish implicit (coupled) upper bounds for the cost estimator, the critic, and the actor in Theorem A.7, Theorem A.10, and Theorem A.13, respectively. Then we prove the main Theorem 4.3 by solving an interconnected system of inequalities in Appendix A.4. Before start, we define two notations which are frequently used in our proof. Definition A.1. For any symmetric matrix M ∈ S n , we define the vector svec(M ) ∈ R 1 2 (n+1) as svec(M ) = (m 11 , √ 2m 21 , • • • , √ 2m n1 , m 22 , √ 2m 32 , • • • , √ 2m n2 , • • • , m nn ) ⊤ . We further define its inverse smat(•) such that smat(svec(M )) = M.

A.1 COST ESTIMATION ERROR ANALYSIS

In this section, we establish an implicit upper bound for the cost estimator η t , in terms of the critic error and the natural gradient norm. We first give an uniform upper bound for the covariance matrix D Kt . Proposition A.2. (Upper bound for covariance matrix). Suppose that Assumption 4.2 holds. The covariance matrix of the stationary distribution N (0, D Kt ) induced by the Markov chain in (8) can be upper bounded by ∥D Kt ∥ ≤ c 1 1 -( 1+ρ 2 ) 2 ∥D σ ∥ for all t, where c 1 is a constant. Note that the distribution of state-action pair is unbounded so that the feature function is also unbounded. We can establish an upper bound for the tail probability of (x, u) by the Hansen-Wright inequality, the proof of which can be found in Rudelson & Vershynin (2013) . Lemma A.3. (Hansen-Wright inequality). For any integer m > 0, let A be a matrix in R m×m and let η ∼ N (0, I m )be the standard Gaussian random variable in R m . Then there exists an absolute constant c > 0 such that, for any θ ≥ 0, we have P[|η ⊤ Aη -E(η ⊤ Aη)| > θ] ≤ 2e -c•min{θ 2 ∥A∥ -2 F ,θ∥A∥ -1 } . With this lemma, we can provide an uniform upper bound for the cost under high probability. Proposition A.4. (Upper bound for cost). With probability at least 1 -10 -10 , for t = 0, 1, 2, • • • , T -1, the cost satisfies ∥x t ∥ 2 + ∥u t ∥ 2 ≤ Ū , c(x t , u t ) ≤ Ū , where Ū =2c 2 (σ max (Q) + σ max (R) + 1)[σ 2 + (1 + K2 ) c 1 1 -( 1+ρ 2 ) 2 ∥D σ ∥]log(10) and c 2 is a constant. Hereafter, we use Ū as an upper bound for all cost c(x t , u t ). As a consequence, we choose η 0 ≤ Ū so that we have η t ≤ Ū for all t. Lemma A.5. (Perturbation of P K ). Suppose K ′ is a small perturbation of K in the sense that ∥K ′ -K∥ ≤ σ min (D 0 ) 4 ∥D K ∥ -1 ∥B∥ -1 (∥A -BK∥ + 1) -1 . ( ) Then we have ∥P K ′ -P K ∥ ≤6σ -1 min (D 0 )∥D K ∥∥K∥∥R∥(∥K∥∥B∥ • ∥A -BK∥ + ∥K∥∥B∥ + 1)∥K -K ′ ∥. Proof. See Lemma 5.7 in Yang et al. (2019) for a detailed proof. With the perturbation of P K , we are ready to prove the Lipschitz continuous of J(K). Proposition A.6. (Local Lipschitz continuity of J(K)) Suppose Lemma A.5 holds, for any K t , K t+1 , we have |J(K t+1 ) -J(K t )| ≤ l 1 ∥K t+1 -K t ∥, where l 1 :=6c 1 d Kσ -1 min (D 0 ) ∥D σ ∥ 2 1 -( 2 ) 2 ∥R∥( K∥B∥(∥A∥ + K∥B∥ + 1) + 1). (31) Equipped with the above propositions and lemmas, we are able to bound the cost estimation error. Theorem A.7. Suppose that Assumptions 4.1 and 4.2 hold and choose α t = cα √ 1+t , β t = γ t = 1 √ 1+t , where c α is a small positive constant. With probability at least 1 -10 -10 , we have 1 T T -1 t=0 Ey 2 t ≤ (4l 2 1 ( K + 1) 2 ω2 c 2 α + 3 Ū 2 ) 1 √ T + l 1 c α T T -1 t=0 E∥z t ∥ 2 + l 1 c α T T -1 t=0 E∥E Kt ∥ 2 . ( ) Proof. From line 5 of Algorithm 1, we have y 2 t+1 =(y t + J(K t ) -J(K t+1 ) + γ t (c t -η t )) 2 ≤y 2 t + 2γ t y t (c t -η t ) + 2y t (J(K t ) -J(K t+1 )) + 2(J(K t ) -J(K t+1 )) 2 + 2γ 2 t (c t -η t ) 2 =(1 -2γ t )y 2 t + 2γ t y t (c t -J(K t )) + 2γ 2 t (c t -η t ) 2 + 2y t (J(K t ) -J(K t+1 )) + 2(J(K t ) -J(K t+1 )) 2 . Taking expectation up to (x t , u t ) for both sides, we have E[y 2 t+1 ] ≤(1 -2γ t )Ey 2 t + 2γ t E[y t (c t -J(K t ))] + 2γ 2 t E(c t -η t ) 2 + 2Ey t (J(K t ) -J(K t+1 )) + 2E(J(K t ) -J(K t+1 )) 2 . To compute E[y t (c t -J(K t ))], we use the notation v t to denote the vector (x t , u t ) and v 0:t to denote the sequence (x 0 , u 0 ), (x 1 , u 1 ), • • • , (x t , u t ). Hence, we have E[y t (c t -J(K t ))] =E v0:t [y t (c t -J(K t ))] = E v0:t-1 E v0:t [y t (c t -J(K t ))|v 0:t-1 ] Once we know v 0:t-1 , y t is not a random variable any more. Thus we get E v0:t-1 E v0:t [y t (c t -J(K t ))|v 0:t-1 ] =E v0:t-1 y t E v0:t [(c t -J(K t ))|v 0:t-1 ] =E v0:t-1 y t E vt [c t -J(K t )|v 0:t-1 ] =0 Hereafter, we need to verify Lemma A.5 first and use the local Lipschitz continuous property of J(K) provided by Proposition A.6 to bound the cost estimation error. Since we have ∥K t+1 -K t ∥ = α t ∥(smat(ω t ) 22 K t -smat(ω t ) 21 )∥, to satisfy (30), we choose c α ≤ (1 -( 1+ρ 2 ) 2 )σ min (D 0 ) 4c 1 ∥D σ ∥∥B∥(1 + ∥A∥ + K∥B∥)( K + 1)ω . Hence, according to the update rule, we have ∥K t+1 -K t ∥ =α t ∥(smat(ω t ) 22 K t -smat(ω t ) 21 )∥ ≤ c α (1 + t) δ ( K∥smat(ω t ) 22 ∥ + ∥smat(ω t ) 21 ∥) ≤ c α (1 + t) δ ( K∥ω t ∥ + ∥ω t ∥) ≤ c α (1 + t) δ ( K + 1)ω ≤ (1 -( 1+ρ 2 ) 2 )σ min (D 0 ) 4c 1 ∥D σ ∥∥B∥(1 + ∥A∥ + K∥B∥) 1 (1 + t) δ ≤ σ min (D 0 ) 4 ∥D Kt ∥ -1 ∥B∥ -1 (∥A -BK t ∥ + 1) -1 , where the last inequality comes from (28). Thus Lemma A.5 holds for Algorithm 1. As a consequence, Proposition A.6 is also guaranteed. Combining the fact 2γ t E[y t (c t -J(K t ))] = 0, we get E[y 2 t+1 ] ≤(1 -2γ t )Ey 2 t + 2Ey t (J(K t ) -J(K t+1 )) + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -2γ t )Ey 2 t + 2E|y t ||J(K t ) -J(K t+1 )| + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -2γ t )Ey 2 t + 2l 1 E|y t |∥K t -K t+1 ∥ + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -2γ t )Ey 2 t + 2l 1 α t E|y t |∥ E Kt ∥ + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -2γ t )Ey 2 t + 2l 1 α t E|y t |∥ E Kt -E Kt + E Kt ∥ + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -2γ t )Ey 2 t + 2l 1 α t E[(2 K2 + 2)|y t |∥z t ∥ + |y t |∥E Kt ∥] + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -2γ t )Ey 2 t + 2l 1 α t E[2( K2 + 1) 2 y 2 t + ∥z t ∥ 2 /2 + y 2 t /2 + ∥E Kt ∥ 2 /2] + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 ≤(1 -(2γ t -2l 1 α t (2( K2 + 1) 2 + 1 2 )))Ey 2 t + l 1 α t E∥z t ∥ 2 + l 1 α t E∥E Kt ∥ 2 + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 , where we use the fact that ∥ E Kt -E Kt ∥ ≤ 2( K + 1)∥ω t -ω * t ∥. Choose c α small enough such that 2l 1 c α (2( K2 + 1) 2 + 1 2 ) ≤ 1. Then we get γ t ≥ 2l 1 α t (2( K2 + 1) 2 + 1 2 ). Thus we have E[y 2 t+1 ] ≤(1 -γ t )Ey 2 t + l 1 α t E∥z t ∥ 2 + l 1 α t E∥E Kt ∥ 2 + 2E(J(K t ) -J(K t+1 )) 2 + 2γ 2 t E(c t -η t ) 2 Rearranging and summing from 0 to T -1, we have T -1 t=0 Ey 2 t ≤ T -1 t=0 1 γ t E(y 2 t -y 2 t+1 ) I1 + T -1 t=0 2 γ t E(J(K t ) -J(K t+1 )) 2 I2 + T -1 t=0 2γ t E(c t -η t ) 2 I3 + l 1 c α T -1 t=0 E∥z t ∥ 2 + l 1 c α T -1 t=0 E∥E Kt ∥ 2 . In the sequel, we need to control I 1 , I 2 , I 3 respectively. For I 1 , following Abel summation by parts, we have I 1 = T -1 t=0 1 γ t E(y 2 t -y 2 t+1 ) = T -1 t=1 ( 1 γ t - 1 γ t-1 )E(y 2 t ) + 1 γ 0 E(y 2 0 ) - 1 γ T -1 E(y 2 T ) ≤ Ū 2 T -1 t=1 ( 1 γ t - 1 γ t-1 ) + 1 γ 0 Ū 2 ≤ Ū 2 γ T -1 = Ū 2 √ T , where we use the fact that |y t | ≤ Ū . For I 2 , we get I 2 = T -1 t=0 2 γ t E(J(K t ) -J(K t+1 )) 2 ≤ 2l 2 1 ( K + 1) 2 ω2 T -1 t=0 1 γ t α 2 t = 2l 2 1 ( K + 1) 2 ω2 c 2 α T -1 t=0 1 (1 + t) ≤ 4l 2 1 ( K + 1) 2 ω2 c 2 α √ T , where the last inequality is due to T -1 t=0 1 (1 + t) ≤ T 0 t -1 2 dt = 2 √ T . For I 3 , we have I 3 = T -1 t=0 γ t E(c t -η t ) 2 ≤ T -1 t=0 γ t Ū 2 ≤ 2 Ū 2 √ T . where we use the fact 0 ≤ c t , η t ≤ Ū derived by Proposition A.4. Combining all terms together, we get T -1 t=0 Ey 2 t ≤(4l 2 1 ( K + 1) 2 ω2 c 2 α + 3 Ū 2 ) √ T + l 1 c α T -1 t=0 E∥z t ∥ 2 + l 1 c α T -1 t=0 E∥E Kt ∥ 2 . Dividing by T , we have 1 T T -1 t=0 Ey 2 t ≤ (4l 2 1 ( K + 1) 2 ω2 c 2 α + 3 Ū 2 ) 1 √ T + l 1 c α T T -1 t=0 E∥z t ∥ 2 + l 1 c α T T -1 t=0 E∥E Kt ∥ 2 . Thus we finish our proof.

A.2 CRITIC ERROR ANALYSIS

In this section, we derive an implicit bound for the critic error, in terms of the cost estimator error and the natural gradient norm. First, we need the following propositions. Proposition A.8. For all the K t , there exists a constant µ > 0 such that σ min (A Kt ) ≥ µ. Proposition A.9. (Lipschitz continuity of ω * t ) For any ω * t , ω * t+1 , we have ∥ω * t -ω * t+1 ∥ ≤ l 2 ∥K t -K t+1 ∥, where l 2 =6c 1 d Theorem A.10. Suppose that Assumptions 4.1 and 4.2 hold and choose α t = cα √ 1+t , β t = γ t = 1 √ 1+t , where c α is a small positive constant. With probability at least 1 -10 -10 , we have 1 T T -1 t=1 E∥z t ∥ 2 ≤ 4 µ ( Ū 4 (1 + 2ω) 2 + ω2 + l 2 2 c 2 3 ) 1 √ T + l 2 c α µT T -1 t=0 E∥E Kt ∥ 2 + 2 µ Ū ( 1 T T -1 t=0 Ey 2 t ) 1 2 ( 1 T T -1 t=0 E∥z t ∥ 2 ) 1 2 . ( ) Proof. Since we have A Kt ω * t = b Kt , where b Kt = E (xt,ut) [(c(x t , u t ) -J(K t ))ϕ(x t , u t )], we can further get ∥ω * t ∥ = ∥A -1 Kt b Kt ∥ ≤ 1 µ Ū ∥ϕ(x t , u t )∥ ≤ 1 µ Ū 2 , where in the last inequality, we use the fact that ∥ϕ(x, u)∥ = ∥ x u (x ⊤ u ⊤ )∥ F = ∥ x u (x ⊤ u ⊤ )∥ ≤ Tr( x u (x ⊤ u ⊤ )) = ∥x∥ 2 + ∥u∥ 2 ≤ Ū . Hence, we set ω = 1 µ Ū 2 (39) such that all ω * t lie within this projection radius for all t. From update rule of critic in Algorithm 1, we have ω t+1 = Π ω (ω t + β t δ t ϕ(x t , u t )), which further implies ω t+1 -ω * t+1 = Π ω (ω t + β t δ t ϕ(x t , u t )) -ω * t+1 . By applying 1-Lipschitz continuity of projection map, we have ∥ω t+1 -ω * t+1 ∥ =∥Π ω (ω t + β t δ t ϕ(x t , u t )) -ω * t+1 ∥ =∥Π ω (ω t + β t δ t ϕ(x t , u t )) -Π ω (ω * t+1 )∥ ≤∥ω t + β t δ t ϕ(x t , u t ) -ω * t+1 ∥ =∥ω t -ω * t + β t δ t ϕ(s t , a t ) + (ω * t -ω * t+1 )∥. This means ∥z t+1 ∥ 2 ≤∥z t + β t δ t ϕ(s t , a t ) + (ω * t -ω * t+1 )∥ 2 =∥z t + β t (h(O t , ω t , K t ) + ∆h(O t , η t , K t )) + (ω * t -ω * t+1 )∥ 2 =∥z t ∥ 2 + 2β t ⟨z t , h(O t , ω t , K t )⟩ + 2β t ⟨z t , ∆h(O t , η t , K t )⟩ + 2⟨z t , ω * t -ω * t+1 ⟩ + ∥β t (h(O t , ω t , K t ) + ∆h(O t , η t , K t )) + (ω * t -ω * t+1 )∥ 2 =∥z t ∥ 2 + 2β t ⟨z t , h(ω t , K t )⟩ + 2β t Λ(O t , ω t , K t ) + 2β t ⟨z t , ∆h(O t , η t , K t )⟩ + 2⟨z t , ω * t -ω * t+1 ⟩ + ∥β t (h(O t , ω t , K t ) + ∆h(O t , η t , K t )) + (ω * t -ω * t+1 )∥ 2 ≤∥z t ∥ 2 + 2β t ⟨z t , h(ω t , K t )⟩ + 2β t Λ(O t , ω t , K t ) + 2β t ⟨z t , ∆h(O t , η t , K t )⟩ + 2⟨z t , ω * t -ω * t+1 ⟩ + 2β 2 t ∥h(O t , ω t , K t ) + ∆h(O t , η t , K t ))∥ 2 + 2∥ω * t -ω * t+1 ∥ 2 . From Proposition A.8, we know that σ min (A Kt ) ≥ µ for all K t . Then we have ⟨z t , h(ω t , K t )⟩ = ⟨z t , b Kt -A Kt ω t ⟩ = ⟨z t , b Kt -A Kt w t -(b Kt -A Kt ω * t )⟩ = ⟨z t , -A Kt z t ⟩ = -z ⊤ t A Kt z t ≤ -µ∥z t ∥ 2 , where we use the fact A K ω * Kt -b Kt = 0. Hence, we have ∥z t+1 ∥ 2 ≤(1 -2µβ t )∥z t ∥ 2 + 2β t Λ(O t , ω t , K t ) + 2β t ⟨z t , ∆h(O t , η t , K t )⟩ + 2⟨z t , ω * t -ω * t+1 ⟩ + 2β 2 t Ū 4 (1 + ω) 2 + 2∥ω * t -ω * t+1 ∥ 2 , where we use the fact that ∥h(O t , ω t , K t ) + ∆h(O t , η t , K t ))∥ =∥(c(x t , u t ) -η t )ϕ(x t , u t ) + (ϕ(x ′ t , u ′ t ) -ϕ(x t , u t )) ⊤ ω t ϕ(x t , u t )∥ ≤∥(c(x t , u t ) -η t )ϕ(x t , u t )∥ + ∥(ϕ(x ′ t , u ′ t ) -ϕ(x t , u t )) ⊤ ω t ϕ(x t , u t )∥ ≤ Ū 2 + 2 Ū 2 ω = Ū 2 (1 + 2ω). Taking expectation up to (x t , u t ) and noticing that E[Λ(O t , ω t , K t )] = E v0:t [⟨ω t -ω * Kt , h(O t , ω t , K t ) -h(ω t , K t )⟩] = E v0:t-1 E v0:t [⟨ω t -ω * Kt , h(O t , ω t , K t ) -h(ω t , K t )⟩|v 0:t-1 ] = E v0:t-1 ⟨ω t -ω * Kt , E vt [h(O t , ω t , K t ) -h(ω t , K t )|v 0:t-1 ]⟩ = 0, we get E∥z t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2β t E⟨z t , ∆h(O t , η t , K t )⟩ + 2E⟨z t , ω * t -ω * t+1 ⟩ + 2E∥ω * t -ω * t+1 ∥ 2 + 2 Ū 4 (1 + 2ω) 2 β 2 t . Therefore, using ∥∆h(O t , η t , K t )∥ ≤ Ū |y t |, we can further rewrite (40) as E∥z t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2E⟨z t , ω * t -ω * t+1 ⟩ + 2 Ū β t E|y t |∥z t ∥ + 2β 2 t Ū 4 (1 + 2ω) 2 + 2E∥ω * t -ω * t+1 ∥ 2 . Based on (36), we can rewrite the above inequality as E∥z t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2 Ū β t E|y t |∥z t ∥ + 2l 2 E∥z t ∥∥K t -K t+1 ∥ + 2 Ū 4 (1 + 2ω) 2 β 2 t + 2l 2 2 E∥K t -K t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2 Ū β t E|y t |∥z t ∥ + 2l 2 α t E∥z t ∥∥ E Kt ∥ + 2 Ū 4 (1 + 2ω) 2 β 2 t + 2l 2 2 E∥K t -K t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2 Ū β t E|y t |∥z t ∥ + 2l 2 α t E∥z t ∥∥ E Kt -E Kt + E Kt ∥ + 2 Ū 4 (1 + 2ω) 2 β 2 t + 2l 2 2 E∥K t -K t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2l 2 α t E[∥z t ∥∥ E Kt -E Kt ∥ + ∥z t ∥∥E Kt ∥] + 2 Ū β t E|y t |∥z t ∥ + 2 Ū 4 (1 + 2ω) 2 β 2 t + 2l 2 2 E∥K t -K t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2l 2 α t E[2( K + 1)∥z t ∥ 2 + ∥z t ∥ 2 2 + ∥E Kt ∥ 2 2 ] + 2 Ū β t E|y t |∥z t ∥ + 2 Ū 4 (1 + 2ω) 2 β 2 t + 2l 2 2 E∥K t -K t+1 ∥ 2 ≤(1 -2µβ t )E∥z t ∥ 2 + 2 Ū β t E|y t |∥z t ∥ + (4 K + 5)l 2 α t E∥z t ∥ 2 + l 2 α t E∥E Kt ∥ 2 + 2( Ū 4 (1 + 2ω) 2 + l 2 2 c 2 3 )β 2 t ( ) where the second inequality is due to ∥K t -K t+1 ∥ ≤ c3 (1+t) δ = c 3 β t from (34), where c 3 := (1 -( 1+ρ 2 ) 2 )σ min (D 0 ) 4c 1 ∥D σ ∥∥B∥(1 + ∥A∥ + K∥B∥) . ( ) Choose c α small enough such that (4 K + 5)l 2 c α ≤ µ. ( ) Thus we can rewrite 41 as E∥z t+1 ∥ 2 ≤(1 -µβ t )E∥z t ∥ 2 + 2 Ū β t E|y t |∥z t ∥ + l 2 α t E∥E Kt ∥ 2 + 2( Ū 4 (1 + 2ω) 2 + l 2 2 c 2 3 )β 2 t Rearranging the inequality and summing from 0 to T -1 yields µ T -1 t=1 E∥z t ∥ 2 ≤ T -1 t=0 1 β t E(∥z t ∥ 2 -∥z t+1 ∥ 2 ) + 2 Ū T -1 t=0 E|y t |∥z t ∥ + l 2 c α T -1 t=0 E∥E Kt ∥ 2 + 2( Ū 4 (1 + 2ω) 2 + l 2 2 c 2 3 ) T -1 t=0 β t ≤ T -1 t=0 1 β t E(∥z t ∥ 2 -∥z t+1 ∥ 2 ) I1 +2 Ū T -1 t=0 E|y t |∥z t ∥ I2 +l 2 c α T -1 t=0 E∥E Kt ∥ 2 + 4( Ū 4 (1 + 2ω) 2 + l 2 2 c 2 3 ) √ T . orc We need to control I 1 and I 2 , respectively. For term I 1 , from Abel summation by parts, we have I 1 = T -1 t=0 1 β t E(∥z t ∥ 2 -∥z t+1 ∥ 2 ) = T -1 t=1 ( 1 β t - 1 β t-1 )E∥z t ∥ 2 + 1 β 0 E∥z 0 ∥ 2 - 1 β T -1 E∥z T ∥ 2 ≤ T -1 t=1 ( 1 β t - 1 β t-1 )E∥z t ∥ 2 + 1 β 0 E∥z 0 ∥ 2 ≤4ω 2 ( T -1 t=1 ( 1 β t - 1 β t-1 ) + 1 β 0 ) =4ω 2 1 β T -1 =4ω 2 √ T . For I 2 , from Cauchy-Schwartz inequality, we have I 2 = T -1 t=0 E|y t |∥z t ∥ ≤ T -1 t=0 (Ey 2 t ) 1 2 (E∥z t ∥ 2 ) 1 2 ≤( T -1 t=0 Ey 2 t ) 1 2 ( T -1 t=0 E∥z t ∥ 2 ) 1 2 . 1 σ min (R) ∥D K * ∥Tr(E ⊤ K E K ). Theorem A.13. Suppose that Assumptions 4.1 and 4.2 hold and choose α t = cα √ 1+t , β t = γ t = 1 √ 1+t , where c α is a small positive constant. With probability at least 1 -10 -10 , we have 1 T T -1 t=0 E∥E Kt ∥ 2 ≤ ( Ū + 2c 4 c 2 α 2σ min (D 0 )c α ) 1 √ T + c 5 ( K + 1) σ min (D 0 ) ( 1 T T -1 t=0 E∥z t ∥ 2 ) 1 2 ( 1 T T -1 t=0 E∥E Kt ∥) 1 2 . ( ) Proof. Combining the almost smoothness property, we get J(K t+1 ) -J(K t ) = -2Tr(D Kt+1 (K t -K t+1 ) ⊤ E Kt ) + Tr(D Kt+1 (K t -K t+1 ) ⊤ (R + B ⊤ P Kt B)(K t -K t+1 )) = -2α t Tr(D Kt+1 Ê⊤ Kt E Kt ) + α 2 t Tr(D Kt+1 Ê⊤ Kt (R + B ⊤ P Kt B) ÊKt ) = -2α t Tr(D Kt+1 ( ÊKt -E Kt ) ⊤ E Kt ) -2α t Tr(D Kt+1 E ⊤ Kt E Kt ) + α 2 t Tr(D Kt+1 Ê⊤ Kt (R + B ⊤ P Kt B) ÊKt ) . By the similar trick to the proof of Proposition A.2, we can bound P Kt by ∥P Kt ∥ ≤ ĉ1 1 -( 1+ρ 2 ) 2 ∥Q + K ⊤ RK∥ ≤ ĉ1 (σ max (Q) + K2 σ max (R)) 1 -( 1+ρ 2 ) 2 , where ĉ1 is a constant. Hence we further have Tr(D Kt+1 Ê⊤ Kt (R + B ⊤ P Kt B) ÊKt ) ≤d∥D Kt+1 ∥∥R + B ⊤ P Kt B∥∥ ÊKt ∥ 2 F ≤d( K + 1) 2 ω2 c 1 ∥D σ ∥ 1 -( 1+ρ 2 ) 2 (σ max (R) + σ 2 max (B) ĉ1 (σ max (Q) + K2 σ max (R)) 1 -( 1+ρ 2 ) 2 ), where we use ∥ ÊKt ∥ F ≤ ( K + 1)ω. Hence we define c 4 as follows c 4 :=d( K + 1) 2 ω2 c 1 ∥D σ ∥ 1 -( 1+ρ 2 ) 2 (σ max (R) + σ 2 max (B) ĉ1 (σ max (Q) + K2 σ max (R)) 1 -( 1+ρ 2 ) 2 ). Then we get J(K t+1 ) -J(K t ) ≤ -2α t Tr(D Kt+1 ( ÊKt -E Kt ) ⊤ E Kt ) -2α t Tr(D Kt+1 E ⊤ Kt E Kt ) + c 4 α 2 t ≤α t 2c 1 d 3 2 ∥D σ ∥ 1 -( 1+ρ 2 ) 2 ∥E Kt ∥∥ ÊKt -E Kt ∥ -2α t σ min (D 0 )∥E Kt ∥ 2 + c 4 α 2 t =c 5 α t ∥E Kt ∥∥ ÊKt -E Kt ∥ -2α t σ min (D 0 )∥E Kt ∥ 2 + c 4 α 2 t , c 5 := 2c 1 d 3 2 ∥D σ ∥ 1 -( 1+ρ 2 ) 2 . Taking expectation up to (x t , u t ) and rearranging the above inequality, we have E∥E Kt ∥ 2 ≤ E[J(K t ) -J(K t+1 )] 2α t σ min (D 0 ) + c 5 2σ min (D 0 ) E∥E Kt ∥∥ ÊKt -E Kt ∥ + c 4 α t 2σ min (D 0 ) . Summing over t from 0 to T -1 gives T -1 t=0 E∥E Kt ∥ 2 ≤ T -1 t=0 E[J(K t ) -J(K t+1 )] 2α t σ min (D 0 ) I1 + c 5 2σ min (D 0 )) T -1 t=0 E∥E Kt ∥∥ ÊKt -E Kt ∥ I2 + c 4 c α σ min (D 0 ) √ T . For term I 1 , using Abel summation by parts, we have T -1 t=0 E[J(K t ) -J(K t+1 )] 2α t σ min (D 0 ) = 1 2σ min (D 0 ) ( T -1 t=1 ( 1 α t - 1 α t-1 )E[J(K t )] + 1 α 0 E[J(K 0 )] - 1 α T -1 E[J(K T )]) ≤ Ū 2σ min (D 0 ) ( T -1 t=1 ( 1 α t - 1 α t-1 ) + 1 α 0 ) = Ū 2σ min (D 0 ) 1 α T -1 = Ū 2c α σ min (D 0 ) √ T . For term I 2 , by Cauchy-Schwartz inequality, we have T -1 t=0 E∥E Kt ∥∥ ÊKt -E Kt ∥ ≤ ( T -1 t=0 E∥E Kt ∥ 2 ) 1 2 ( T -1 t=0 E∥ ÊKt -E Kt ∥ 2 ) 1 2 . Combining the results of I 1 and I 2 , we have T -1 t=0 E∥E Kt ∥ 2 ≤( Ū + 2c 4 c 2 α 2σ min (D 0 )c α ) √ T + c 5 2σ min (D 0 ) ( T -1 t=0 E∥E Kt ∥ 2 ) 1 2 ( T -1 t=0 E∥ ÊKt -E Kt ∥ 2 ) 1 2 ≤( Ū + 2c 4 c 2 α 2σ min (D 0 )c α ) √ T + c 5 ( K + 1) σ min (D 0 ) ( T -1 t=0 E∥z t ∥ 2 ) 1 2 ( T -1 t=0 E∥E Kt ∥) 1 2 . Dividing by T , we get 1 T T -1 t=0 E∥E Kt ∥ 2 ≤ ( Ū + 2c 4 c 2 α 2σ min (D 0 )c α ) 1 √ T + c 5 ( K + 1) σ min (D 0 ) ( 1 T T -1 t=0 E∥z t ∥ 2 ) 1 2 ( 1 T T -1 t=0 E∥E Kt ∥) 1 2 . Thus we conclude our proof.

A.4 INTERCONNECTED ITERATION SYSTEM ANALYSIS

From the definition in 22, we have A(T ) = 1 T T -1 t=0 Ey 2 t , B(T ) = 1 T T -1 t=0 E∥z t ∥ 2 , C(T ) = 1 T T -1 t=0 E∥E Kt ∥ 2 . ( ) In the following, we give an interconnected iteration system analysis with respect to A(T ), B(T ) and C(T ). Theorem A.14. Combining (32), ( 38) and ( 44), we have A(T ) = O( 1 √ T ), B(T ) = O( 1 √ T ), C(T ) = O( 1 √ T ). Proof. From ( 32), ( 38) and ( 44), we have A(T ) ≤(4l 2 1 ( K + 1) 2 ω2 c 2 α + 3 Ū 2 ) 1 √ T + l 1 c α B(T ) + l 1 c α C(T ), B(T ) ≤ 4 µ ( Ū 4 (1 + 2ω) 2 + ω2 + l 2 2 c 2 3 ) 1 √ T + 2 µ Ū A(T )B(T ) + l 2 c α µ C(T ), C(T ) ≤( Ū + 2c 4 c 2 α 2σ min (D 0 )c α ) 1 √ T + c 5 ( K + 1) σ min (D 0 ) B(T )C(T ). For simplicity, we denote a =(4l 2 1 ( K + 1) 2 ω2 c 2 α + 3 Ū 2 ) 1 √ T , b =l 1 c α , c = 4 µ ( Ū 4 (1 + 2ω) 2 + ω2 + l 2 2 c 2 3 ) 1 √ T , d = 2 µ Ū , e = l 2 c α µ , f =( Ū + 2c 4 c 2 α 2σ min (D 0 )c α ) 1 √ T , g = c 5 ( K + 1) σ min (D 0 ) .

Thus we further have

A(T ) ≤a + bB(T ) + bC(T ), B(T ) ≤c + d A(T )B(T ) + eC(T ), C(T ) ≤f + g B(T )C(T ). Then we have B(T ) ≤ c + 1 2 (d 2 A(T ) + B(T )) + eC(T ), B(T ) ≤ 2c + d 2 A(T ) + 2eC(T ). For C(T ), we get C(T ) ≤ f + 1 2 (g 2 B(T ) + C(T )), C(T ) ≤ 2f + g 2 B(T ) Combining 49, 50 and 51, we have B(T ) ≤2c + d 2 (a + bB(T ) + b(2f + g 2 B(T ))) + 2e(2f + g 2 B(T )) =2c + ad 2 + 2bd 2 f + 4ef + (bd 2 + bd 2 g 2 + 2eg 2 )B(T ) If bd 2 + bd 2 g 2 + 2eg 2 < 1, we have B(T ) ≤ 2c + ad 2 + 2bd 2 f + 4ef 1 -bd 2 -bd 2 g 2 -2eg 2 Note that bd 2 + bd 2 g 2 + 2eg 2 =l 1 c α 4 µ 2 Ū 2 + l 1 c α 4 µ 2 Ū 2 c 2 5 ( K + 1) 2 σ 2 min (D 0 ) + 2l 2 c α µ c 2 5 ( K + 1) 2 σ 2 min (D 0 ) =c α (l 1 4 µ 2 Ū 2 + l 1 4 µ 2 Ū 2 c 2 5 ( K + 1) 2 σ 2 min (D 0 ) + 2l 2 c 2 5 ( K + 1) 2 µσ 2 min (D 0 ) ) Thus we can achieve bd 2 + bd 2 g 2 + 2eg 2 < 1 by choosing the stepsize ratio smaller than the following threshold: 1/(l 1 4 µ 2 Ū 2 + l 1 4 µ 2 Ū 2 c 2 5 ( K + 1) 2 σ 2 min (D 0 ) + 2l 2 c 2 5 ( K + 1) 2 µσ 2 min (D 0 ) ) Therefore, we get B(T ) ≤ 2c + ad 2 + 2bd 2 f + 4ef 1 -bd 2 -bd 2 g 2 -2eg 2 = O( 1 √ T ), C(T ) ≤2f + g 2 B(T ) = O( 1 √ T ), A(T ) ≤a + bB(T ) + C(T ) = O( 1 √ T ). Thus we have A(T ) = O( 1 √ T ), B(T ) = O( 1 √ T ), C(T ) = O( 1 √ T ), which concludes the proof.

A.5 GLOBAL CONVERGENCE ANALYSIS

Proof of Theorem 4.3 Proof. From gradient domination, we know that E(J(K t ) -J(K * )) ≤ 1 σ min (R) ∥D K * ∥E[Tr(E ⊤ Kt E Kt )] ≤ d∥D K * ∥ σ min (R) E∥E Kt ∥ 2 . ( ) From the convergence of C(T ), we know that 1 T T -1 t=0 E∥E Kt ∥ = O( 1 √ T ) Hence, we have min 0≤t<T d∥D K * ∥ σ min (R) E∥E Kt ∥ 2 ≤ d∥D K * ∥ σ min (R) 1 T T -1 t=0 E∥E Kt ∥ 2 = O( 1 √ T ) Therefore, from 53 we get min 0≤t<T E(J(K t ) -J(K * )) = O( 1 √ T ). Thus we conclude the proof of Theorem 4.3.

B PROOF OF PROPOSITIONS

To establish the Proposition 3.1, we need the following lemma, the proof of which can be found in Nagar (1959); Magnus (1978) . Lemma B.1. Let g ∼ N (0, I n ) be the standard Gaussian random variable in R n and let M, N be two symmetric matrices. Then we have E[g ⊤ M gg ⊤ N g] = 2Tr(M N ) + Tr(M )Tr(N ). Proof of Proposition 3.1: Proof. This proposition is a slight modification of lemma 3.2 in Yang et al. (2019) and the proof is inspired by the proof of this lemma. For any state-action pair (x, u) ∈ R d+k , we denote the successor state-action pair following policy π K by (x ′ , u ′ ). With this notation, as we defined in (7), we have x ′ = Ax + Bu + ϵ, u ′ = -Kx ′ + σζ. where ϵ ∼ N (0, D 0 ) and ζ ∼ N (0, I k ). We further denote (x, u) and (x ′ , u ′ ) by ϑ and ϑ ′ respectively. Therefore, we have ϑ ′ = Lϑ + ε, where L := A B -KA -KB = I d -K [A B] , ε := ϵ -Kϵ + σζ . Therefore, by definition, we have ε ∼ N (0, D0 ) where D0 = D 0 -D 0 K ⊤ -KD 0 KD 0 K ⊤ + σ 2 I k . Since for any two matrices M and N , it holds that ρ(M N ) = ρ(N M ). Then we get ρ(L) = ρ(A -BK) < 1. Consequently, the Markov chain defined in (54) have a stationary distribution N (0, DK ) denoted by ρK , where DK is the unique positive definite solution of the following Lyapunov equation DK = L DK L ⊤ + D0 Meanwhile, from the fact that x ∼ N (0, D K ) and u = -Kx + σζ, by direct computation we have DK = D K -D K K ⊤ -KD K KD K K ⊤ + σ 2 I k = 0 0 0 σ 2 I k + I d -K D K I d -K ⊤ . From the fact that ∥AB∥ F ≤ ∥A∥ F ∥B∥ and ∥A∥ ≤ ∥A∥ F , we have ∥ DK ∥ ≤ ∥ DK ∥ F ≤ σ 2 k + ∥D K ∥(d + ∥K∥ 2 F ). Then we get E (x,u) [ϕ(x, u)ϕ(x, u) ⊤ ] = E ϑ∼ ρK [ϕ(ϑ)ϕ(ϑ) ⊤ ]. Let M, N be any two symmetric matrices with appropriate dimension, we have svec(M ) ⊤ E ϑ∼ ρK [ϕ(ϑ)ϕ(ϑ) ⊤ ]svec(N ) = E ϑ∼ ρK [svec(M ) ⊤ ϕ(ϑ)ϕ(ϑ) ⊤ svec(N )] = E ϑ∼ ρK [⟨ϑϑ ⊤ , M ⟩⟨ϑϑ ⊤ , N ⟩] = E ϑ∼ ρK [ϑ ⊤ M ϑϑ ⊤ N ϑ] = E g∼N (0,I d+k ) [g ⊤ D1/2 K M D1/2 K gg ⊤ D1/2 K N D1/2 K g], where D1/2 K is the square root of DK . By applying Lemma B.1, we have svec(M ) ⊤ E ϑ∼ ρK [ϕ(ϑ)ϕ(ϑ) ⊤ ]svec(N ) =E g∼N (0,I d+k ) [g ⊤ D1/2 K M D1/2 K gg ⊤ D1/2 K N D1/2 K g] =2Tr( D1/2 K M DK N D1/2 K ) + Tr( D1/2 K M D1/2 K )Tr( D1/2 K N D1/2 K ) =2⟨M, DK N DK ⟩ + ⟨M, DK ⟩⟨N, DK ⟩ =svec(M ) ⊤ (2 DK ⊗ s DK + svec( DK )svec( DK ) ⊤ )svec(N ), where the last equality follows from the fact that svec( 1 2 (N SM ⊤ + M SN ⊤ )) = (M ⊗ s N )svec(S). for any two matrix M, N and a symmetric matrix S (Schacke, 2004 ). Thus we have E ϑ∼ ρK [ϕ(ϑ)ϕ(ϑ) ⊤ ] = 2 DK ⊗ s DK + svec( DK )svec( DK ) ⊤ . ( ) Similarly ϕ(ϑ ′ ) = svec[(Lϑ + ε)(Lϑ + ε) ⊤ ] = svec(Lϑϑ ⊤ L ⊤ + Lϑε ⊤ -εϑ ⊤ L ⊤ + εε ⊤ ). Since ϵ is independent of ϑ, we get E ϑ∼ ρK [ϕ(ϑ)ϕ(ϑ ′ ) ⊤ ] = E ϑ∼ ρK [ϕ(ϑ)svec(Lϑϑ ⊤ L ⊤ + D0 )]. By the same argument, we have svec(M ) ⊤ E ϑ∼ ρK [ϕ(ϑ)ϕ(ϑ ′ ) ⊤ ]svec(N ) =E ϑ∼ ρK [⟨ϑϑ ⊤ , M ⟩⟨Lϑϑ ⊤ L ⊤ + D0 , N ⟩] =E ϑ∼ ρK [ϑ ⊤ M ϑϑ ⊤ L ⊤ N Lϑ] + ⟨M, DK ⟩⟨ D0 , N ⟩] =E g∈N (0,I d+k ) [g ⊤ D 1 2 K M D 1 2 K gg ⊤ D 1 2 K L ⊤ N L D 1 2 K g] + ⟨M, DK , ⟩⟨ D0 , N ⟩] =2Tr(M DK L ⊤ N L DK ) + Tr(M DK )Tr(L ⊤ N L DK ) + ⟨M, DK ⟩⟨ D0 , N ⟩ =2⟨M, DK L ⊤ N L DK ⟩ + ⟨M, DK ⟩⟨L DK L ⊤ , N ⟩ + ⟨M, DK ⟩⟨ D0 , N ⟩ =2⟨M, DK L ⊤ N L DK ⟩ + ⟨M, DK ⟩⟨ DK , N ⟩ =svec(M ) ⊤ (2 DK L ⊤ ⊗ s DK L ⊤ + svec( DK )svec( DK ) ⊤ )svec(N ), Since all norms are equivalent on the finite dimensional Euclidean space, there exists a constant c 1 satisfies ∥D Kt ∥ ≤ c 1 1 -( 1+ρ 2 ) 2 ∥D σ ∥, which concludes our proof. Proof of Proposition A.4: Proof. Since x t ∼ N (0, D Kt ) and u t = -K t x t + σζ t , we denote the joint distribution of ϑ t = (x t , u t ) by ρKt = N (0, DKt ) where DKt = D Kt -D Kt K ⊤ t -K t D Kt K t D Kt K ⊤ t + σ 2 I k = 0 0 0 σ 2 I k + I d -K t D Kt I d -K t ⊤ . Based on Lemma A.3, for (x t , u t ) ∼ N (0, DKt ) with DKt defined in (58), we obtain P[|∥x t ∥ 2 2 + ∥u t ∥ 2 2 -Tr( DKt )| > θ] ≤ 2e -c•min{θ 2 ∥ DK t ∥ -2 F ,θ∥ DK t ∥ -1 } . Choose θ = c 2 log(10)∥ DKt ∥ with c 2 sufficiently large such that cc 2 > 12 and θ 2 ∥ DKt ∥ -2 F > θ∥ DKt ∥ -1 , where we make use of the fact that ∥ DKt ∥ F is bounded. Hence we have the following probability inequality 10) . We define the following event P[|∥x t ∥ 2 2 + ∥u t ∥ 2 2 -Tr( DKt )| ≤ c 2 log(10)∥ DKt ∥] ≥ 1 -2e -cc2log A t = {|∥x t ∥ 2 2 + ∥u t ∥ 2 2 -Tr( DKt )| ≤ c 2 log(10)∥ DKt ∥}. Then we have P(A t ) ≥ 1 -2e -cc2log(10) ≥ 1 -2 • 10 -12 ≥ 1 -10 -11 . We further define Ā = ∩ 0≤t≤T -1 A t . Thus we get P( Ā) ≥ 1 -10 -10 . In the sequel, we only consider the case when Ā holds. That is, for any 0 ≤ t ≤ T -1, we have ∥x t ∥ 2 + ∥u t ∥ 2 ≤ c 2 log(10)∥ DKt ∥ + tr( DKt ) ≤ (c 2 log(10) + d + k)∥ DKt ∥ ≤ 2c 2 log(10)∥ DKt ∥ ≤ 2c 2 log(10)[σ 2 k + (d + ∥K t ∥ 2 F )∥D Kt ∥], where the third inequality holds since we choose c 2 large enough such that c 2 log(10) ≥ d + k and the last inequality is due to the fact ∥ DKt ∥ ≤ σ 2 k + (d + ∥K t ∥ 2 F )∥D Kt ∥. From Assumption 4.1, we know that ∥K t ∥ ≤ K, so we have ∥x t ∥ 2 + ∥u t ∥ 2 ≤ 2c 2 log(10)[σ 2 k + d(1 + K2 )∥D Kt ∥]. From Proposition A.2, we know that ∥D Kt ∥ ≤ c 1 1 -( 1+ρ 2 ) 2 ∥D σ ∥. Substitute ∥D Kt ∥ into (59), we get ∥x t ∥ 2 + ∥u t ∥ 2 ≤2c 2 log(10)[σ 2 k + d(1 + K2 ) c 1 1 -( 1+ρ 2 ) 2 ∥D σ ∥]. For ∥a -K ⊤ t b∥ 2 , we have ∥a -K ⊤ t b∥ 2 ≥∥a∥ 2 + ∥K ⊤ t b∥ 2 -2∥a∥∥K ⊤ t ∥∥b∥ ≥∥a∥ 2 -2 K∥a∥∥b∥ ≥∥a∥ 2 - 1 2 (∥a∥ 2 + 4 K2 ∥b∥ 2 ) = 1 2 ∥a∥ 2 -2 K2 ∥b∥ 2 . Hence we get a ⊤ b ⊤ DKt a b ≥σ min (D Kt )∥a -K ⊤ t b∥ 2 + σ 2 ∥b∥ 2 ≥σ min (D Kt )( 1 2 ∥a∥ 2 -2 K2 ∥b∥ 2 ) + σ 2 ∥b∥ 2 ≥ min{σ min (D 0 ), σ 2 4 K2 }( 1 2 ∥a∥ 2 -2 K2 ∥b∥ 2 ) + σ 2 ∥b∥ 2 ≥ min{ σ min (D 0 ) 2 , σ 2 8 K2 , σ 2 2 }(∥a∥ 2 + ∥b∥ 2 ). Thus we have σ min ( DKt ) ≥ min{ σ min (D 0 ) 2 , σ 2 8 K2 , σ 2 2 } > 0, which further implies ∥A -1 Kt ∥ ≤ 1 2(1 -ρ 2 )σ 2 min ( DKt ) ≤ 1 2(1 -ρ 2 )(min{ σmin(D0) 2 , σ 2 8 K2 , σ 2 2 }) 2 . We define µ := 2(1 -ρ 2 )(min{ σ min (D 0 ) 2 , σ 2 8 K2 , σ 2 2 }) 2 such that we get σ min (A Kt ) ≥ µ, which concludes the proof. Proof of Proposition A.9: respectively. Therefore, for any positive definite matrices S 1 and S 2 , we get Tr(S 1 Γ K (S 2 )) = t≥0 Tr(S 1 (A -BK) t S 2 [(A -BK) t ] ⊤ ) = t≥0 Tr([(A -BK) t ] ⊤ S 1 (A -BK) t S 2 ) = Tr(Γ ⊤ K (S 1 )S 2 ). Combining ( 10), ( 55), ( 64) and ( 65), we know that D K = Γ K (D σ ), P K = Γ ⊤ K (Q + K ⊤ RK). Thus (62) implies J(K) = Tr((Q + K ⊤ RK)D K ) + σ 2 Tr(R) = Tr((Q + K ⊤ RK)Γ K (D σ )) + σ 2 Tr(R) = Tr(Γ ⊤ K (Q + K ⊤ RK)D σ ) + σ 2 Tr(R) = Tr(P K D σ ) + σ 2 Tr(R). It remains to establish the gradient of J(K). Based on (62), we have ∇ K J(K) =∇ K Tr((Q + K ⊤ RK)C))| C=D K + ∇ K Tr(CD K )| C=Q+K ⊤ RK , where we use C to denote that we compute the gradient with respect to K and then substitute the expression of C. Hence we get ∇ K J(K) = 2RKD K + ∇ K Tr(C 0 D K )| C0=Q+K ⊤ RK . Furthermore, we have ∇ K Tr(C 0 D K ) =∇ K Tr(C 0 Γ K (D σ )) =∇ K Tr(C 0 D σ + C 0 (A -BK)Γ K (D σ )(A -BK) ⊤ ) =∇ K Tr(C 0 D σ ) + ∇ K Tr((A -BK) ⊤ C 0 (A -BK)Γ K (D σ )) = -2B ⊤ C 0 (A -BK)Γ K (D σ ) + ∇ K Tr(C 1 Γ K (D σ ))| C1=(A-BK) ⊤ C0(A-BK) . Then it reduces to compute ∇ K Tr(C 1 Γ K (D σ ))| C1=(A-BK) ⊤ C0(A-BK) . Applying this iteration for n times, we get ∇ K Tr(C 0 D K ) = -2B ⊤ n t=0 C t (A -BK)Γ K (D σ ) + ∇ K Tr(C n Γ K (D σ ))| Cn=[(A-BK) n ] ⊤ C0(A-BK) n . Meanwhile, by Lyapunov equation defined in (11), we have ∞ t=0 C t = ∞ t=0 [(A -BK) t ] ⊤ (Q + K ⊤ RK)(A -BK) t =P K . Since ρ(A -BK) < 1, we further get lim n→∞ Tr(C n Γ K (D σ )) ≤ lim n→∞ ∥(Q + K ⊤ RK)∥ρ(A -BK) 2n Tr(Γ K (D σ )) =0. Thus by letting n go to infinity in (68), we get ∇ K Tr(C 0 D K )| C0=Q+K ⊤ RK = -2B ⊤ P K (A -BK)Γ K (D σ ) = -2B ⊤ P K (A -BK)D K . Hence, combining (67), we have ∇ K J(K) = 2RKD K -2B ⊤ P K (A -BK)D K = 2[(R + B ⊤ P K B)K -B ⊤ P K A]D K , which concludes our proof. Proof of Lemma 2.2: Proof. By definition, we have the state-value function as follows V θ (x) : = ∞ t=0 E θ [(c(x t , u t ) -J(θ))|x 0 = x] = E u∼π θ (•|x) [Q θ (x, u)], Therefore, we have V K (x) = ∞ t=0 E[c(x t , u t ) -J(K)|x 0 = x, u t = -Kx t + σζ t ] = ∞ t=0 E{[x ⊤ t (Q + K ⊤ RK)x t ] + σ 2 Tr(R) -J(K)}. Combining the linear dynamic system in (8) and the form of (70), we see that V K (x) is a quadratic function, which can be denoted by V K (x) = x ⊤ P K x + C K , where P K is defined in (11) and C K only depends on K. Moreover, by definition, we know that E x∼ρ K [V K (x)] = 0, which implies E x∼ρ K [x ⊤ P K x + C K ] = Tr(P K D K ) + C K = 0. Thus we have C K = -Tr(P K D K ). Hence, the expression of V K (x) is given by V K (x) = x ⊤ P K x -Tr(P K D K ). Therefore, the action-value function Q K (x, u) can be written as Q(x, u) =c(x, u) -J(K) + E[V K (x ′ )|x, u] =c(x, u) -J(K) + (Ax + Bu) ⊤ P K (Ax + Bu) + Tr(P K D 0 ) -Tr(P K D K ) =x ⊤ Qx + u ⊤ Ru + (Ax + Bu) ⊤ P K (Ax + Bu) -σ 2 Tr(R + P K BB ⊤ ) -Tr(P K Σ K ). Thus we finish the proof. Proof of Lemma A.11: Proof. By the definition of operator in ( 63) and (66), we have x ⊤ P K ′ x =x ⊤ Γ ⊤ K ′ (Q + K ′⊤ RK ′ )x = t≥0 x ⊤ [(A -BK ′ ) t ] ⊤ (Q + K ′⊤ RK ′ )(A -BK ′ ) t x. Hereafter, we define (A -BK ′ ) t x = x ′ t and u ′ t = -K ′ x ′ t . Hence, we further have x ⊤ P K ′ x = t≥0 x ′⊤ t (Q + K ′⊤ RK ′ )x ′ t = t≥0 (x ′⊤ t Qx ′ t + u ′⊤ t Ru ′ t ). Therefore, we get x ⊤ P K ′ x -x ⊤ P K x = t≥0 [(x ′⊤ t Qx ′ t + u ′⊤ Ru ′ t ) + x ′⊤ t P K x ′ t -x ′⊤ t P K x ′ t ] -x ′⊤ 0 P K x ′ 0 = t≥0 [(x ′⊤ t Qx ′ t + u ′⊤ Ru ′ t ) + x ′⊤ t+1 P K x ′ t+1 -x ′⊤ t P K x ′ t ] = t≥0 [(x ′⊤ t Qx ′ t + u ′⊤ t Ru ′ t ) + [(A -BK ′ )x ′ t ] ⊤ P K (A -BK ′ )x ′ t -x ′ t P K x ′ t ] = t≥0 {x ′⊤ t [Q + (K ′ -K + K) ⊤ R(K ′ -K + K)]x ′ t + x ′⊤ t [A -BK -B(K ′ -K) ⊤ P K [A -BK -B(K ′ -K)]x ′ t -x ′ t P K x ′ t } = t≥0 {2x ⊤ t (K ′ -K) ⊤ [(R + B ⊤ P K B)K -B ⊤ P K A]x ′ t + x ′⊤ t (K ′ -K) ⊤ (R + B ⊤ P K B)(K ′ -K)x ′ t } = t≥0 [2x ′⊤ t (K ′ -K) ⊤ E K x ′ t + x ′⊤ t (K ′ -K) ⊤ (R + B ⊤ P K B)(K ′ -K)x ′ t ]. Define A K,K ′ (x) := 2x ⊤ (K ′ -K) ⊤ E K x + x ⊤ (K ′ -K) ⊤ (R + B ⊤ P K B)(K ′ -K)x. Then, from the expression of J(K) in (12a), we have J(K ′ ) -J(K) =E x∼N (0,Dσ) [x ⊤ (P K ′ -P K )x] =E x ′ 0 ∼N (0,Dσ) t≥0 A K,K ′ (x t ) =E x ′ 0 ∼N (0,Dσ) t≥0 [2x ′⊤ t (K ′ -K) ⊤ E K x ′ t + x ′⊤ t (K ′ -K) ⊤ (R + B ⊤ P K B)(K ′ -K)x ′ t ] =Tr(2E x ′ 0 ∼N (0,Dσ) [ t≥0 x ′⊤ t x ′ t ](K ′ -K) ⊤ E K )+ Tr(E x ′ 0 ∼N (0,Dσ) [ t≥0 x ′⊤ t x ′ t ](K ′ -K) ⊤ (R + B ⊤ P K B)(K ′ -K)) = -2Tr(D K ′ (K -K ′ ) ⊤ E K ) + Tr(D K ′ (K -K ′ ) ⊤ (R + B ⊤ P K B)(K -K ′ )). where the last equation is due to the fact that E x ′ 0 ∼N (0,Dσ) [ t≥0 x ′ t (x ′ t ) ⊤ ] =E x∼N (0,Dσ) { t≥0 (A -BK ′ ) t xx ⊤ [(A -BK ′ ) t ] ⊤ } =Γ K ′ (D σ ) = D K ′ . Hence, we finish our proof. We compare our considered single-sample single-timescale AC with two other baseline algorithms that have been analyzed in the state-of-the-art theoretical works: the zeroth-order method (Fazel et al., 2018) (listed in Algorithm 2 on the next page) and the double loop AC (Yang et al., 2019) (listed in Algorithm 3 on the next page). For the considered single-sample single-timescale AC, we set for both examples α t = 0.005 √ 1+t , β t = 0.01 √ 1+t , γ t = 0.1 √ 1+t , σ = 1, T = 10 6 . Note that multiplying small constants to these stepsizes does not affect our theoretical results. For the zeroth-order method proposed in Fazel et al. (2018) , we set z = 5000, l = 20, r = 0.1, stepsize η = 0.01 and iteration number J = 1000 for the first numerical example; while in the second example, we set z = 20000, l = 50, r = 0.1, η = 0.01, J = 1000. We choose different parameters based on the trade-off between better performance and fewer sample complexity. For the double loop AC proposed in Yang et al. (2019) , we set for both examples α t = 0.01 √ 1+t , σ = 0.2, η = 0.05, inner-loop iteration number T = 500000 and outer-loop iteration number J = 100. We note that the algorithm is fragile and sensitive to the practical choice of these parameters. Moreover, we found that it is difficult for the algorithm to converge without an accurate critic estimation in the inner-loop. In our implementation, we have to set the inner-loop iteration number to T = 500000 to barely get the algorithm converge to the global optimum. This nevertheless demands a significant amount of computation. Higher T iterations can yield more accurate critic estimation, and consequently more stable convergence, but at a price of even longer running time. We run the outer-loop for 100 times for each run of the algorithm. We run the whole algorithm 10 times independently to get the results shown in Figure . With parallel computing implementation, it takes more than 2 weeks on our desktop workstation (Intel Xeon(R) W-2225 CPU @ 4.10GHz × 8) to finish the computation. In comparison, it takes about 0.5 hour to run the single-sample single-timescale AC and 5 hours for the zeroth-order method. Algorithm 2 Zeroth-order Natural Policy Gradient Input: stabilizing policy gain K 0 such that ρ(A -BK 0 ) < 1, number of trajectories z, roll-out length l, perturbation amplitude r, stepsize η while updating current policy do Gradient Estimation: for i = 1, • • • , z do Sample x 0 from D Simulate K j for l steps starting from x 0 and observe y 0 , • • • , y l-1 and c 0 , • • • , c l-1 . Draw U i uniformly over matrices such that ∥U i ∥ F = 1, and generate a policy K j,Ui = K j + rU i . Simulate K j,Ui for l steps starting from x 0 and observe c ′ 0 , • • • , c ′ l-1 . Calculate empirical estimates: J i Kj = l-1 t=0 c t , L i Kj = l-1 t=0 y t y ⊤ t , J K j,U i = l-1 t=0 c ′ t . end for Return estimates: ∇J(K j ) = 1 z z i=1 J K j,U i -J i Kj r U i , L Kj = 1 z z i=1 L i Kj . Policy Update: K j+1 = K j -η ∇J(K j ) L Kj -1 . j = j + 1. end while



K(∥A∥ + ∥B∥) 2 σ -1 min (D 0 )



Figure 1: (a) Learning results of Algorithm 1. Here the cost estimation error refers to 1

corresponding to the conclusion in Theorem 4.3 empirically. (b) Comparison of Algorithm 1 with two other algorithms. The plots are the actor error ∥K -K * ∥ F .

Algorithm 1 Single-Sample Single-timescale Natural Actor-Critic for Linear Quadratic Regulator 1: Input initialize actor parameter K 0 ∈ K, critic parameter ω 0 , time-average cost η 0 , stepsizes α t for actor, β t for critic, and γ t for cost estimator.Take action u t ∼ π Kt (•|x t ) and receive c t = c(x t , u t ) and the next state x ′ t .

.1 Cost Estimate Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Critic Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Natural Gradient Norm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 A.4 Interconnected Iteration System Analysis . . . . . . . . . . . . . . . . . . . . . . 22 A.5 Global Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

annex

Combining the upper bound of the above two items, we can getDividing by T , we havewhich concludes he convergence of critic.

A.3 NATURAL GRADIENT NORM ANALYSIS

In this subsection, we derive an implicit bound for the natural gradient norm in terms of the the critic error. Before proceeding, we need the following two lemmas, which characterize two important properties of LQR system. Lemma A.11. (Almost Smoothness). For any two stable policies K and K ′ , J(K) and J(K ′ ) satisfy:Lemma A.12. (Gradient Domination). Let K * be an optimal policy. Suppose K has finite cost. Then, it holds thatwhere we make use of the Lyapunov equation ( 55). Thus we getTherefore, combining ( 56) and (57), we havewhere in the last equality we use the fact thatMultiply each side by ϕ(x, u) and take a expectation with respect to (x, u), we getwhere the first equality comes from the low of total expectation andThus we conclude our proof.Proof of Proposition A.2:Proof. Since D Kt satisfies the Lyapunov equation defined in (10), we haveFrom Assumption 4.2, we know that ρ(A -BK t ) ≤ ρ < 1. Thus for any ϵ > 0, there exists a sub-multiplicative matrix norm ∥ • ∥ * such thatTherefore, we can bound the norm of D Kt byThus we can use Ū as an upper bound to both c(x, u) and ∥x t ∥ 2 + ∥u t ∥ 2 , which concludes the proof.Proof of Proposition A.6:where the second inequality is due to the perturbation of P K in Lemma A.5 and2 ) 2 ∥R∥( K∥B∥(∥A∥ + K∥B∥ + 1) + 1). Thus we finish our proof.

Proof of Proposition A.8:

Proof. From Proposition 3.1, we know thatBy Assumption 4.2, we have ρ(L) = ρ(A -BK t ) ≤ ρ < 1. Then we have.To bound σ min ( DKt ), for any a ∈ R d and b ∈ R k , we haveProof.whereC PROOF OF AUXILIARY LEMMASThe following lemmas are well known and have been established in several papers (Yang et al., 2019; Fazel et al., 2018) . We include the proof here only for completeness.Proof of Lemma 2.1:Proof. Since we focus on the family of linear-Gaussian policies defined in (7), we haveFurthermore, for K ∈ R k×d such that ρ(AB -K) < 1 and positive definite matrix S ∈ R d×d , we define the following two operatorsHence, Γ K (S) and Γ ⊤ K (S) satisfy Lyapunov equationsProof of Lemma A.12:Proof. By definition of A K,K ′ in (71), we have, where the equality is satisfied whenThus we complete the proof of upper bound.It remains to establish the lower bound. Since the equality is attained atwhich concludes our proof.

D EXPERIMENTAL DETAILS

Example D.1. Consider a two-dimensional system withAlgorithm 3 Double-loop Natural Actor-Critic Input: Initial policy π K0 such that ρ(A -BK 0 ) < 1, stepsize γ for policy update. while updating current policy do Gradient Estimation: Initialize the primal and dual variables by v 0 ∈ X Θ and ω 0 ∈ X Ω , respectively. Sample the initial state x 0 ∈ R d from stationary distribution ρ Kj . Take action u 0 ∼ π Kj (•|x 0 ) and obtain the reward c 0 and the next state x 1 . for i = 1, 2, • • • , T do Take action u t according to policy π Kj , observe the reward c t and the next state x t+1 . δ t = v 1 t-1 -c t-1 + [ϕ(x t-1 , u t-1 ) -ϕ(x t , u t )] ⊤ v 2 t-1 .). ω 2 t = (1 -α t )ω 2 t + α t δ t ϕ(x t-1 , u t-1 ). Project v t and ω t to v 0 ∈ X Θ and ω 0 ∈ X Ω . end for Return estimates:Policy Update: K j+1 = K j -η( Θ 22 K j -Θ 21 ). j = j + 1. end while

