SINGLE-TIMESCALE ACTOR-CRITIC PROVABLY FINDS GLOBALLY OPTIMAL POLICY

Abstract

We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear O(K -1/2 ) rate, where K is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actorcritic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.

1. INTRODUCTION

In reinforcement learning (RL) (Sutton et al., 1998) , the agent aims to make sequential decisions that maximize the expected total reward through interacting with the environment and learning from the experiences, where the environment is modeled as a Markov Decision Process (MDP) (Puterman, 2014) . To learn a policy that achieves the highest possible total reward in expectation, the actor-critic method (Konda and Tsitsiklis, 2000) is among the most commonly used algorithms. In actor-critic, the actor refers to the policy and the critic corresponds to the value function that characterizes the performance of the actor. This method directly optimizes the expected total return over the policy class by iteratively improving the actor, where the update direction is determined by the critic. In particular, recently, actor-critic combined with deep neural networks (LeCun et al., 2015) achieves tremendous empirical successes in solving large-scale RL tasks, such as the game of Go (Silver et al., 2017 ), StarCraft (Vinyals et al., 2019 ), Dota (OpenAI, 2018 ), Rubik's cube (Agostinelli et al., 2019; Akkaya et al., 2019) , and autonomous driving (Sallab et al., 2017) . See Li (2017) for a detailed survey of the recent developments of deep reinforcement learning. Despite these great empirical successes of actor-critic, there is still an evident chasm between theory and practice. Specifically, to establish convergence guarantees for actor-critic, most existing works either focus on the bi-level setting or the two-timescale setting, which are seldom adopted in practice. In particular, under the bi-level setting (Yang et al., 2019a; Wang et al., 2019; Agarwal et al., 2019; Fu et al., 2019; Liu et al., 2019; Abbasi-Yadkori et al., 2019a; b; Cai et al., 2019; Hao et al., 2020; Mei et al., 2020; Bhandari and Russo, 2020) , the actor is updated only after the critic solves the policy evaluation sub-problem completely, which is equivalent to applying the Bellman evaluation operator to the previous critic for infinite times. Consequently, actor-critic under the bi-level setting is a double-loop iterative algorithm where the inner loop is allocated for solving the policy evaluation sub-problem of the critic. In terms of theoretical analysis, such a double-loop structure decouples the analysis for the actor and critic. For the actor, the problem is essentially reduced to analyzing the convergence of a variant of the policy gradient method (Sutton et al., 2000; Kakade, 2002) where the error of the gradient estimate depends on the policy evaluation error of the critic. Besides, under the two-timescale setting (Borkar and Konda, 1997; Konda and Tsitsiklis, 2000; Xu et al., 2020; Wu et al., 2020; Hong et al., 2020) , the actor and the critic are updated simultaneously, but with disparate stepsizes. More concretely, the stepsize of the actor is set to be much smaller than that of the critic, with the ratio between these stepsizes converging to zero. In an asymptotic sense, such a separation between stepsizes ensures that the critic completely solves its policy evaluation sub-problem asymptotically. In other words, such a two-timescale scheme results in a separation between actor and critic in an asymptotic sense, which leads to asymptotically unbiased policy gradient estimates. In sum, in terms of convergence analysis, the existing theory of actor-critic hinges on decoupling the analysis for critic and actor, which is ensured via focusing on the bi-level or two-timescale settings. However, most practical implementations of actor-critic are under the single-timescale setting (Peters and Schaal, 2008a; Schulman et al., 2015; Mnih et al., 2016; Schulman et al., 2017; Haarnoja et al., 2018) , where the actor and critic are simultaneously updated, and particularly, the actor is updated without the critic reaching an approximate solution to the policy evaluation sub-problem. Meanwhile, in comparison with the two-timescale setting, the actor is equipped with a much larger stepsize in the the single-timescale setting such that the asymptotic separation between the analysis of actor and critic is no longer valid. Furthermore, when it comes to function approximation, most existing works only analyze the convergence of actor-critic with either linear function approximation (Xu et al., 2020; Wu et al., 2020; Hong et al., 2020) , or shallow-neural-network parameterization (Wang et al., 2019; Liu et al., 2019) . In contrast, practically used actor-critic methods such as asynchronous advantage actor-critic (Mnih et al., 2016) and soft actor-critic (Haarnoja et al., 2018) oftentimes represent both the actor and critic using deep neural networks.

Thus, the following question is left open:

Does single-timescale actor-critic provably find a globally optimal policy under the function approximation setting, especially when deep neural networks are employed? To answer such a question, we make the first attempt to investigate the convergence and global optimality of single-timescale actor-critic with linear and neural network function approximation. In particular, we focus on the family of energy-based policies and aim to find the optimal policy within this class. Here we represent both the energy function and the critic as linear or deep neural network functions. In our actor-critic algorithm, the actor update follows proximal policy optimization (PPO) (Schulman et al., 2017) and the critic update is obtained by applying the Bellman evaluation operator only once to the current critic iterate. As a result, the actor is updated before the critic solves the policy evaluation sub-problem. Such a coupled updating structure persists even when the number of iterations goes to infinity, which implies that the update direction of the actor is always biased compared with the policy gradient direction. This brings an additional challenge that is absent in the bi-level and the two-timescale settings, where the actor and critic are decoupled asymptotically. To tackle such a challenge, our analysis captures the joint effect of actor and critic updates on the objective function, dubbed as the "double contraction" phenomenon, which plays a pivotal role for the success of single-timescale actor-critic. Specifically, thanks to the discount factor of the MDP, the Bellman evaluation operator is contractive, which implies that, after each update, the critic makes noticeable progress by moving towards the value function associated with the current actor. As a result, although we use a biased estimate of the policy gradient, thanks to the contraction brought by the discount factor, the accumulative effect of the biases is controlled. Such a phenomenon enables us to characterize the progress of each iteration of joint actor and critic update, and thus yields the convergence to the globally optimal policy. In particular, for both the linear and neural settings, we prove that, single-timescale actor-critic finds a O(K -1/2 )-globally optimal policy after K iterations. To the best of our knowledge, we seem to establish the first theoretical guarantee of global convergence and global optimality for actor-critic with function approximation in the singletimescale setting. Moreover, under the broader scope of policy optimization with nonlinear function

