ASYNCHRONOUS ADVANTAGE ACTOR CRITIC: NON-ASYMPTOTIC ANALYSIS AND LINEAR SPEEDUP

Abstract

Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including the non-asymptotic analysis and the performance gain of parallelism (a.k.a. speedup). This paper revisits the A3C algorithm with TD(0) for the critic update, termed A3C-TD(0), with provable convergence guarantees. With linear value function approximation for the TD update, the convergence of A3C-TD(0) is established under both i.i.d. and Markovian sampling. Under i.i.d. sampling, A3C-TD(0) obtains sample complexity of O( -2.5 /N ) per worker to achieve accuracy, where N is the number of workers. Compared to the best-known sample complexity of O( -2.5 ) for two-timescale AC, A3C-TD(0) achieves linear speedup, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetically generated instances and OpenAI Gym environments have been provided to verify our theoretical analysis.

1. INTRODUCTION

Reinforcement learning (RL) has achieved impressive performance in many domains such as robotics [1, 2] and video games [3] . However, these empirical successes are often at the expense of significant computation. To unlock high computation capabilities, the state-of-the-art RL approaches rely on sampling data from massive parallel simulators on multiple machines [3, 4, 5] . Empirically, these approaches can stabilize the learning processes and reduce training time when they are implemented in an asynchronous manner. One popular RL method that often achieves the best empirical performance is the asynchronous variant of the actor-critic (AC) algorithm, which is referred to as A3C [3] . A3C builds on the original AC algorithm [6] . At a high level, AC simultaneously performs policy optimization (a.k.a. the actor step) using the policy gradient method [7] and policy evaluation (a.k.a. the critic step) using the temporal difference learning (TD) algorithm [8] . To ensure scalability, both actor and critic steps can combine with various function approximation techniques. To ensure stability, AC is often implemented in a two time-scale fashion, where the actor step runs in the slow timescale and the critic step runs in the fast timescale. Similar to other on-policy RL algorithms, AC uses samples generated from the target policy. Thus, data sampling is entangled with the learning procedure, which generates significant overhead. To speed up the sampling process of AC, A3C introduces multiple workers with a shared policy, and each learner has its own simulator to perform data sampling. The shared policy can be then updated using samples collected from multiple learners. Despite the tremendous empirical success achieved by A3C, to the best of our knowledge, its theoretical property is not well-understood. The following theoretical questions remain unclear: Q1) Under what assumption does A3C converge? Q2) What is its convergence rate? Q3) Can A3C obtain benefit (or speedup) using parallelism and asynchrony? For Q3), we are interested in the training time linear speedup with N workers, which is the ratio between the training time using a single worker and that using N workers. Since asynchronous parallelism mitigates the effect of stragglers and keeps all workers busy, the training time speedup 1

