Decentralized Deterministic Multi-Agent Reinforcement Learning

Abstract

provided the first decentralized actor-critic algorithm for 1 multi-agent reinforcement learning (MARL) that offers convergence guarantees. In 2 that work, policies are stochastic and are defined on finite action spaces. We extend 3 those results to offer a provably-convergent decentralized actor-critic algorithm for 4



learning deterministic policies on continuous action spaces. Deterministic policies 5 are important in real-world settings. To handle the lack of exploration inherent in de-6 terministic policies, we consider both off-policy and on-policy settings. We provide 7 the expression of a local deterministic policy gradient, decentralized deterministic 8 actor-critic algorithms and convergence guarantees for linearly-approximated value 9 functions. This work will help enable decentralized MARL in high-dimensional 10 action spaces and pave the way for more widespread use of MARL. Fujimoto et al. [2018] ). Deterministic policies further avoid estimating 23 the complex integral over the action space. Empirically this allows for lower variance of the critic 24 estimates and faster convergence. On the other hand, deterministic policy gradient methods suffer 25 from reduced exploration. For this reason, we provide both off-policy and on-policy versions of our 26 results, the off-policy version allowing for significant improvements in exploration. The contributions 27 of this paper are three-fold: (1) we derive the expression of the gradient in terms of the long-term 28 average reward, which is needed in the undiscounted multi-agent setting with deterministic policies;

29

(2) we show that the deterministic policy gradient is the limiting case, as policy variance tends to 30 zero, of the stochastic policy gradient; and (3) we provide a decentralized deterministic multi-agent 31 actor critic algorithm and prove its convergence under linear function approximation.

32

Submitted to 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Do not distribute.



agent reinforcement learning (MARL) has seen considerably less use than its 13 single-agent analog, in part because often no central agent exists to coordinate the cooperative agents. 14 As a result, decentralized architectures have been advocated for MARL. Recently, decentralized 15 architectures have been shown to admit convergence guarantees comparable to their centralized 16 counterparts under mild network-specific assumptions (see Zhang et al. [2018], Suttle et al. [2019]).

17

In this work, we develop a decentralized actor-critic algorithm with deterministic policies for multi-18 agent reinforcement learning. Specifically, we extend results for actor-critic with stochastic policies 19 (Bhatnagar et al. [2009], Degris et al. [2012], Maei [2018], Suttle et al. [2019]) to handle deterministic 20 policies. Indeed, theoretical and empirical work has shown that deterministic algorithms outperform 21 their stochastic counterparts in high-dimensional continuous action settings (Silver et al. [January 22 2014b], Lillicrap et al. [2015],

