ASYNCHRONOUS ADVANTAGE ACTOR CRITIC: NON-ASYMPTOTIC ANALYSIS AND LINEAR SPEEDUP

Abstract

Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including the non-asymptotic analysis and the performance gain of parallelism (a.k.a. speedup). This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0), with provable convergence guarantees. With linear value function approximation for the TD update, the convergence of A3C-TD(0) is established under both i.i.d. and Markovian sampling. Under i.i.d. sampling, A3C-TD(0) obtains sample complexity of O( -2.5 /N ) per worker to achieve accuracy, where N is the number of workers. Compared to the best-known sample complexity of O( -2.5 ) for two-timescale AC, A3C-TD(0) achieves linear speedup, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetically generated instances and OpenAI Gym environments have been provided to verify our theoretical analysis.

1. INTRODUCTION

Reinforcement learning (RL) has achieved impressive performance in many domains such as robotics [1, 2] and video games [3] . However, these empirical successes are often at the expense of significant computation. To unlock high computation capabilities, the state-of-the-art RL approaches rely on sampling data from massive parallel simulators on multiple machines [3, 4, 5] . Empirically, these approaches can stabilize the learning processes and reduce training time when they are implemented in an asynchronous manner. One popular RL method that often achieves the best empirical performance is the asynchronous variant of the actor-critic (AC) algorithm, which is referred to as A3C [3] . A3C builds on the original AC algorithm [6] . At a high level, AC simultaneously performs policy optimization (a.k.a. the actor step) using the policy gradient method [7] and policy evaluation (a.k.a. the critic step) using the temporal difference learning (TD) algorithm [8] . To ensure scalability, both actor and critic steps can combine with various function approximation techniques. To ensure stability, AC is often implemented in a two time-scale fashion, where the actor step runs in the slow timescale and the critic step runs in the fast timescale. Similar to other on-policy RL algorithms, AC uses samples generated from the target policy. Thus, data sampling is entangled with the learning procedure, which generates significant overhead. To speed up the sampling process of AC, A3C introduces multiple workers with a shared policy, and each learner has its own simulator to perform data sampling. The shared policy can be then updated using samples collected from multiple learners. Despite the tremendous empirical success achieved by A3C, to the best of our knowledge, its theoretical property is not well-understood. The following theoretical questions remain unclear: Q1) Under what assumption does A3C converge? Q2) What is its convergence rate? Q3) Can A3C obtain benefit (or speedup) using parallelism and asynchrony? For Q3), we are interested in the training time linear speedup with N workers, which is the ratio between the training time using a single worker and that using N workers. Since asynchronous parallelism mitigates the effect of stragglers and keeps all workers busy, the training time speedup can be measured roughly by the sample (i.e., computational) complexity linear speedup [9], given by Speedup(N ) = sample complexity when using one worker average sample complexity per worker when using N workers . (1) If Speedup(N ) = Θ(N ), the speedup is linear, and the training time roughly reduces linearly as the number of workers increases. This paper aims to answer these questions, towards the goal of providing theoretical justification for the empirical successes of parallel and asynchronous RL. 1 [16, 17] have provided the first finite-time analyses for the two-timescale AC algorithms under Markov sampling, with both Õ( -2.5 ) sample complexity, which is the bestknown sample complexity for two-timescale AC. Through the lens of bi-level optimization, [18] has also provided finite-sample guarantees for this two-timescale Markov sampling setting, with global optimality guarantees when a natural policy gradient step is used in the actor. However, none of the existing works has analyzed the effect of the asynchronous and parallel updates in AC. Empirical parallel and distributed AC. In [3], the original A3C method was proposed and became the workhorse in empirical RL. Later, [19] has provided a GPU-version of A3C which significantly decreases training time. Recently, the A3C algorithm is further optimized in modern computers by [20] , where a large batch variant of A3C with improved efficiency is also proposed. In [21] , an importance weighted distributed AC algorithm IMPALA has been developed to solve a collection of problems with one single set of parameters. Recently, a gossip-based distributed yet synchronous AC algorithm has been proposed in [5] , which has achieved the performance competitive to A3C. Asynchronous stochastic optimization. For solving general optimization problems, asynchronous stochastic methods have received much attention recently. The study of asynchronous stochastic methods can be traced back to 1980s [22] . With the batch size M , [23] analyzed asynchronous SGD (async-SGD) for convex functions, and derived a convergence rate of O(K -1 2 M -1 2 ) if delay K 0 is bounded by O(K 1 4 M -3 4 ). This result implies linear speedup. [24] extended the analysis of [23] to smooth convex with nonsmooth regularization and derived a similar rate. Recent studies by [25] improved upper bound of K 0 to O(K 1 2 M -1 2 ). However, all these works have focused on the single-timescale SGD with a single variable, which cannot capture the stochastic recursion of the AC and A3C algorithms. To best of our knowledge, non-asymptotic analysis of asynchronous two-timescale SGD has remained unaddressed, and its speedup analysis is even an uncharted territory.

1.2. THIS WORK

In this context, we revisit A3C with TD(0) for the critic update, termed A3C-TD(0). The hope is to provide non-asymptotic guarantee and linear speedup justification for this popular algorithm. Our contributions. Compared to the existing literature on both the AC algorithms and the async-SGD, our contributions can be summarized as follows. c1) We revisit two-timescale A3C-TD(0) and establish its convergence rates with both i.i.d. and Markovian sampling. To the best of our knowledge, this is the first non-asymptotic convergence result for asynchronous parallel AC algorithms. c2) We characterize the sample complexity of A3C-TD(0). In i.i.d. setting, A3C-TD(0) achieves a sample complexity of O( -2.5 /N ) per worker, where N is the number of workers. Compared to the best-known complexity of O( -2.5 ) for i.i.d. two-timescale AC [18], A3C-TD(0) achieves linear speedup, thanks to the parallelism and asynchrony. In the Markovian setting, if delay is bounded, the sample complexity of A3C-TD(0) matches the order of the non-parallel AC algorithm [17] .



.1 RELATED WORKS Analysis of actor critic algorithms. AC method was first proposed by[6, 10], with asymptotic convergence guarantees provided in[6, 10, 11]. It was not until recently that the non-asymptotic analyses of AC have been established. The finite-sample guarantee for the batch AC algorithm has been established in[12, 13]  with i.i.d. sampling. Later, in[14], the finite-sample analysis was established for the double-loop nested AC algorithm under the Markovian setting. An improved analysis for the Markovian setting with minibatch updates has been presented in[15]  for the nested AC method. More recently,

