FINITE-TIME ANALYSIS OF SINGLE-TIMESCALE ACTOR-CRITIC ON LINEAR QUADRATIC REGULATOR

Abstract

Actor-critic (AC) methods have achieved state-of-the-art performance in many challenging tasks. However, their convergence in most practical applications are still poorly understood. Existing works mostly consider the uncommon doubleloop or two-timescale stepsize variants for the ease of analysis. We investigate the practical yet more challenging single-sample single-timescale natural AC for solving the canonical linear quadratic regulator problem. Specifically, the actor and the critic update only once with a single sample in each iteration using proportional stepsizes. We prove that the single-sample single-timescale natural AC(NAC) can attain an ϵ-optimal solution with a sample complexity of O(ϵ -2 ), which elucidates on the practical efficiency of single-sample single-timescale NAC. We develop a novel analysis framework that directly bounds the whole interconnected iteration system without the conservative decoupling commonly adopted in previous analysis of AC and NAC. Our work presents the first finite-time analysis of single-sample single-timescale NAC with a global optimality guarantee.

1. INTRODUCTION

Actor-critic (AC) methods achieved substantial success in solving many difficult reinforcement learning (RL) problems (LeCun et al., 2015; Mnih et al., 2016; Silver et al., 2017) . In addition to a policy update, AC methods employ a parallel critic update to bootstrap the Q-value for policy gradient estimation, which often enjoys reduced variance and fast convergence in training. Despite the empirical success, theoretical analysis of AC in the most practical form remains challenging. Most existing works focus on either the double-loop setting or the two-timescale setting, both of which are uncommon in practical implementations. In double-loop AC, the actor is updated in the outer loop only after the critic takes sufficiently many steps to have an accurate estimation of the Q-value in the inner loop (Yang et al., 2019; Kumar et al., 2019; Wang et al., 2019) . Hence, the convergence of critic is decoupled from that of the actor. The analysis is separated into a policy evaluation sub-problem in the inner loop and a perturbed gradient descent in the outer loop. In two-timescale AC, the actor and the critic are updated simultaneously in each iteration using stepsizes of different timescales. The actor stepsize (denotes by α t ) is typically smaller than that of the critic (denotes by β t ), with their ratio goes to zero as the iteration number goes to infinity (i.e., lim t→∞ α t /β t = 0). The two-timescale allows the critic to approximate the correct Q-value in an asymptotic way. This design essentially decouples the analysis of the actor and the critic. The aforementioned AC variants are considered mainly for the ease of analysis. In practice, the single-timescale AC, where the actor and the critic are updated simultaneously using constantly proportional stepsizes (i.e., with α t /β t = c α > 0), is more favorable due to its simplicity of implementation and empirical sample efficiency (Schulman et al., 2015; Mnih et al., 2016) . However, its analysis is significantly more difficult than the other variants. To understand its finite-time convergence, some recent works (Fu et al., 2020; Zhou & Lu, 2022) consider multi-sample variants of single-timescale AC, where the critics are updated by the least square temporal difference (LSTD) estimator rather than the TD(0) update. The idea is still to obtain an accurate policy gradient estimation at each iteration by using sufficient samples (LSTD), and then follows the common perturbed gradient analysis to guarantee the convergence of the actor, decoupling the convergence analysis of the actor and the critic. In addition to the multi-sample settings, there are few attempts that analyzed the single-sample single-timescale AC(NAC), and they only attest local convergence (Chen et al., To this end, we make the first step to consider the classic Linear Quadratic Regulation (LQR), a fundamental continuous state-action space control problem that are commonly employed to study the performance and the limits of RL algorithms (Fazel et al., 2018; Yang et al., 2019; Tu & Recht, 2018; Duan et al., 2021) . In particular, under the time-average cost, the single-sample single-timescale AC(NAC) algorithm for solving LQR consists of three parallel updates in each iteration: the cost estimator, the critic, and the actor. Unlike the aforementioned double-loop, two-timescale, or multisample structures, there is no specialized design in single-sample single-timescale AC(NAC) that facilitates a decoupled analysis of its three interconnected updates. In fact, it is both conservative and difficult to bound the three iterations separately. Moreover, the existing perturbed gradient analysis can no longer be applied to establish the convergence of the actor either. To tackle these challenges in analysis, we instead propose a novel framework to directly bound the overall interconnected iteration system altogether, without resorting to conservative decoupled analysis. In particular, despite the inaccurate estimation in all three updates, we prove the estimation errors diminish to zero if the (constant) ratio of the stepsizes between the actor and the critic is below a threshold. The identified threshold provides new insights into the practical choices of the stepsizes for single-timescale AC. Overall, our contributions are summarized as follows: • Our work furthers the theoretical understanding of AC(NAC) in its most practical form. We for the first time show that the single-sample single-timescale NAC can provably find the ϵ-accurate global optimum with a sample complexity of O(ϵ -2 ) for tasks with unbounded continuous state-action space. The previous works consider either specialized algorithm variants (Fu et al., 2020; Zhou & Lu, 2022) , or more restricted settings with only local convergence guarantee (Chen et al., 2021; Olshevsky & Gharesifard, 2022) . • We also contribute to the work of RL on continuous control task. It is novel that even with actor updated by a roughly estimated gradient, the single-sample single-timescale NAC algorithm can still find the global optimal policy for LQR, under general assumptions. Compared with all other modelfree RL algorithms for solving LQR (see related work 1.1), our work is the first one adopting the simplest single-sample single-loop structure, which may serve as the first step towards understanding the limits of AC(NAC) methods on continuous control task. In addition, compared with the stateof-the-art double-loop AC for solving LQR (Yang et al., 2019) , we improve the sample complexity from O(ϵ -5 ) to O(ϵ -2 ). We also show the algorithm is much more sample-efficient empirically compared to a few classic works in Section 5, which unveils the practical wisdom of AC(NAC) algorithm. • Technically, we provide a new proof framework that can establish the finite-time convergence for single-timescale AC. In the finite-time analysis of double-loop AC (Yang et al., 2019) and twotimescale AC (Wu et al., 2020) , the previous techniques hinge on decoupling the analysis of actor and critic, establishing the convergence of critic first and then the convergence of actor consequently. The novelty of our proof framework is that we formulate the estimation errors of the time-average cost, the critic, and the natural policy gradient into an interconnected iteration system and establish the convergence for them simultaneously rather than separately. This proof framework may provide new insights for finite-time analysis of other single-timescale algorithms.

1.1. RELATED WORK

In this section, we review the existing works that are most relevant to ours. Actor-Critic methods. The first AC algorithm was proposed by Konda & Tsitsiklis (1999) . Kakade (2001) 



extended it to the natural AC algorithm. The asymptotic convergence of AC algorithms has been well established in Kakade (2001); Bhatnagar et al. (2009); Castro & Meir (2010); Zhang et al. (2020). Many recent works focused on the finite-time convergence of AC methods. Under the double-loop setting, Yang et al. (2019) established the global convergence of AC methods for solving LQR. Wang et al. (2019) studied the global convergence of AC methods with both the actor and the critic being parameterized by neural networks. Kumar et al. (2019) studied the finite-time

