Decentralized Deterministic Multi-Agent Reinforcement Learning

Abstract

provided the first decentralized actor-critic algorithm for 1 multi-agent reinforcement learning (MARL) that offers convergence guarantees. In 2 that work, policies are stochastic and are defined on finite action spaces. We extend 3 those results to offer a provably-convergent decentralized actor-critic algorithm for 4 learning deterministic policies on continuous action spaces. Deterministic policies 5 are important in real-world settings. To handle the lack of exploration inherent in de-6 terministic policies, we consider both off-policy and on-policy settings. We provide 7 the expression of a local deterministic policy gradient, decentralized deterministic 8 actor-critic algorithms and convergence guarantees for linearly-approximated value 9 functions. This work will help enable decentralized MARL in high-dimensional 10 action spaces and pave the way for more widespread use of MARL. 11

1. Introduction 12

Cooperative multi-agent reinforcement learning (MARL) has seen considerably less use than its 13 single-agent analog, in part because often no central agent exists to coordinate the cooperative agents. Fujimoto et al. [2018] ). Deterministic policies further avoid estimating 23 the complex integral over the action space. Empirically this allows for lower variance of the critic 24 estimates and faster convergence. On the other hand, deterministic policy gradient methods suffer 25 from reduced exploration. For this reason, we provide both off-policy and on-policy versions of our 26 results, the off-policy version allowing for significant improvements in exploration. The contributions 27 of this paper are three-fold: (1) we derive the expression of the gradient in terms of the long-term 28 average reward, which is needed in the undiscounted multi-agent setting with deterministic policies;

29

(2) we show that the deterministic policy gradient is the limiting case, as policy variance tends to 30 zero, of the stochastic policy gradient; and (3) we provide a decentralized deterministic multi-agent 31 actor critic algorithm and prove its convergence under linear function approximation. where π i θ i (s, a i ) is the probability of agent i choosing action a i at state s, and G t = (N , E t ), θ i ∈ Θ i ⊆ R mi is 45 the policy parameter. We pack the parameters together as θ = [(θ 1 ) , • • • , (θ N ) ] ∈ Θ where 46 Θ = i∈N Θ i . We denote the joint policy by π θ : S ×A → [0, 1] where π θ (s, a) = i∈N π i θ i (s, a i ).

47

Note that decisions are decentralized in that rewards are observed locally, policies are evaluated 48 locally, and actions are executed locally. We assume that for any i ∈ N , s ∈ S, a i ∈ A i , the 49 policy function π i θ i (s, a i ) > 0 for any θ i ∈ Θ i and that π i θ i (s, a i ) is continuously differentiable with 50 respect to the parameters θ i over Θ i . In addition, for any θ ∈ Θ, let P θ : S × S → [0, 1] denote 51 the transition matrix of the Markov chain {s t } t≥0 induced by policy π θ , that is, for any s, s ∈ S, 52 P θ (s |s) = a∈A π θ (s, a) • P (s |s, a). We make the standard assumption that the Markov chain 53 {s t } t≥0 is irreducible and aperiodic under any π θ and denote its stationary distribution by d θ .

54

Our objective is to find a policy π θ that maximizes the long-term average reward over the network.

55

Let r i t+1 denote the reward received by agent i as a result of taking action a i t . Then, we wish to solve: 



As a result, decentralized architectures have been advocated for MARL. Recently, decentralized 15 architectures have been shown to admit convergence guarantees comparable to their centralized 16 counterparts under mild network-specific assumptions (see Zhang et al. [2018], Suttle et al. [2019]).

17

In this work, we develop a decentralized actor-critic algorithm with deterministic policies for multi-18 agent reinforcement learning. Specifically, we extend results for actor-critic with stochastic policies 19 (Bhatnagar et al. [2009], Degris et al. [2012], Maei [2018], Suttle et al. [2019]) to handle deterministic 20 policies. Indeed, theoretical and empirical work has shown that deterministic algorithms outperform 21 their stochastic counterparts in high-dimensional continuous action settings (Silver et al. [January 22 2014b], Lillicrap et al. [2015],

Submitted to 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Do not distribute.

of N agents denoted by N = [N ] in a decentralized setting. Agents determine 34 their decisions independently based on observations of their own rewards. Agents may however com-35 municate via a possibly time-varying communication network, characterized by an undirected graph 36

d θ (s)π θ (s, a) R(s, a), where R(s, a) = (1/N ) • i∈N R i (s, a) is the globally averaged reward function. Let rt = 57 (1/N ) • i∈N r i t , then R(s, a) = E [r t+1 |s t = s, a t = a], and therefore, the global relative action-58 value function is: Q θ (s, a) = t≥0 E [r t+1 -J(θ)|s 0 = s, a 0 = a, π θ ] , and the global relative

where E t is the set of communication links connecting the agents at time t ∈ N. The is a finite global state space shared by all agents in N , A i is the action space of agent i, and 39 {G t } t≥0 is a time-varying communication network. In addition, let A = i∈N A i denote the joint 40 action space of all agents. Then, P : S × A × S → [0, 1] is the state transition probability of the 41 MDP, and R i : S × A → R is the local reward function of agent i. States and actions are assumed 42 globally observable whereas rewards are only locally observable. At time t, each agent i chooses its 43 action a i t ∈ A i given state s t ∈ S, according to a local parameterized policy π i θ i : S × A i → [0, 1],

