Decentralized Deterministic Multi-Agent Reinforcement Learning

Abstract

provided the first decentralized actor-critic algorithm for multi-agent reinforcement learning (MARL) that offers convergence guarantees. In that work, policies are stochastic and are defined on finite action spaces. We extend those results to offer a provably-convergent decentralized actor-critic algorithm for learning deterministic policies on continuous action spaces. Deterministic policies are important in real-world settings. To handle the lack of exploration inherent in deterministic policies, we consider both off-policy and on-policy settings. We provide the expression of a local deterministic policy gradient, decentralized deterministic actor-critic algorithms and convergence guarantees for linearly-approximated value functions. This work will help enable decentralized MARL in high-dimensional action spaces and pave the way for more widespread use of MARL.

1. Introduction

Cooperative multi-agent reinforcement learning (MARL) has seen considerably less use than its single-agent analog, in part because often no central agent exists to coordinate the cooperative agents. As a result, decentralized architectures have been advocated for MARL. Recently, decentralized architectures have been shown to admit convergence guarantees comparable to their centralized counterparts under mild network-specific assumptions (see Zhang et al. [2018] , Suttle et al. [2019] ). In this work, we develop a decentralized actor-critic algorithm with deterministic policies for multiagent reinforcement learning. Specifically, we extend results for actor-critic with stochastic policies (Bhatnagar et al. [2009] , Degris et al. [2012] , Maei [2018] , Suttle et al. [2019] ) to handle deterministic policies. Indeed, theoretical and empirical work has shown that deterministic algorithms outperform their stochastic counterparts in high-dimensional continuous action settings (Silver et al. [January 2014b ], Lillicrap et al. [2015] , Fujimoto et al. [2018] ). Deterministic policies further avoid estimating the complex integral over the action space. Empirically this allows for lower variance of the critic estimates and faster convergence. On the other hand, deterministic policy gradient methods suffer from reduced exploration. For this reason, we provide both off-policy and on-policy versions of our results, the off-policy version allowing for significant improvements in exploration. The contributions of this paper are three-fold: (1) we derive the expression of the gradient in terms of the long-term average reward, which is needed in the undiscounted multi-agent setting with deterministic policies; (2) we show that the deterministic policy gradient is the limiting case, as policy variance tends to zero, of the stochastic policy gradient; and (3) we provide a decentralized deterministic multi-agent actor critic algorithm and prove its convergence under linear function approximation. Consider a system of N agents denoted by N = [N ] in a decentralized setting. Agents determine their decisions independently based on observations of their own rewards. Agents may however communicate via a possibly time-varying communication network, characterized by an undirected graph G t = (N , E t ) , where E t is the set of communication links connecting the agents at time t ∈ N. The networked multi-agent MDP is thus characterized by a tuple (S, A i i∈N , P, R i i∈N , {G t } t≥0 ) where S is a finite global state space shared by all agents in N , A i is the action space of agent i, and {G t } t≥0 is a time-varying communication network. In addition, let A = i∈N A i denote the joint action space of all agents. Then, P : S × A × S → [0, 1] is the state transition probability of the MDP, and R i : S × A → R is the local reward function of agent i. States and actions are assumed globally observable whereas rewards are only locally observable. At time t, each agent i chooses its action a i t ∈ A i given state s t ∈ S, according to a local parameterized policy π i θ i : S × A i → [0, 1], where π i θ i (s, a i ) is the probability of agent i choosing action a i at state s, and θ i ∈ Θ i ⊆ R mi is the policy parameter. We pack the parameters together as θ = [(θ 1 ) , • • • , (θ N ) ] ∈ Θ where Θ = i∈N Θ i . We denote the joint policy by π θ : S ×A → [0, 1] where π θ (s, a) = i∈N π i θ i (s, a i ). Note that decisions are decentralized in that rewards are observed locally, policies are evaluated locally, and actions are executed locally. We assume that for any i ∈ N , s ∈ S, a i ∈ A i , the policy function π i θ i (s, a i ) > 0 for any θ i ∈ Θ i and that π i θ i (s, a i ) is continuously differentiable with respect to the parameters θ i over Θ i . In addition, for any θ ∈ Θ, let P θ : S × S → [0, 1] denote the transition matrix of the Markov chain {s t } t≥0 induced by policy π θ , that is, for any s, s ∈ S, P θ (s |s) = a∈A π θ (s, a) • P (s |s, a). We make the standard assumption that the Markov chain {s t } t≥0 is irreducible and aperiodic under any π θ and denote its stationary distribution by d θ . Our objective is to find a policy π θ that maximizes the long-term average reward over the network. Let r i t+1 denote the reward received by agent i as a result of taking action a i t . Then, we wish to solve: max θ J(π θ ) = lim T →∞ 1 T E T -1 t=0 1 N i∈N r i t+1 = s∈S,a∈A d θ (s)π θ (s, a) R(s, a), where R(s, a) = (1/N ) • i∈N R i (s, a) is the globally averaged reward function. Let rt = (1/N ) • i∈N r i t , then R(s, a) = E [r t+1 |s t = s, a t = a], and therefore, the global relative actionvalue function is: Q θ (s, a) = t≥0 E [r t+1 -J(θ)|s 0 = s, a 0 = a, π θ ] , and the global relative state-value function is: V θ (s) = a∈A π θ (s, a)Q θ (s, a). For simplicity, we refer to V θ and Q θ as simply the state-value function and action-value function. We define the advantage function as Zhang et al. [2018] provided the first provably convergent MARL algorithm in the context of the above model. The fundamental result underlying their algorithm is a local policy gradient theorem: A θ (s, a) = Q θ (s, a) -V θ (s). ∇ θ i J(µ θ ) = E s∼d θ ,a∼π θ ∇ θ i log π i θ i (s, a i ) • A i θ (s, a) , where A i θ (s, a) = Q θ (s, a) -Ṽ i θ (s, a -i ) is a local advantage function and Ṽ i θ (s, a -i ) = a i ∈A i π i θ i (s, a i )Q θ (s, a i , a -i ). This theorem has important practical value as it shows that the policy gradient with respect to each local parameter θ i can be obtained locally using the corresponding score function ∇ θ i log π i θ i provided that agent i has an unbiased estimate of the advantage functions A i θ or A θ . With only local information, the advantage functions A i θ or A θ cannot be well estimated since the estimation requires the rewards r i t i∈N of all agents. Therefore, they proposed a consensus based actor-critic that leverages the communication network to share information between agents by placing a weight c t (i, j) on the message transmitted from agent j to agent i at time t. Their action-value function Q θ was approximated by a parameterized function Qω : S × A → R, and each agent i maintains its own parameter ω i , which it uses to form a local estimate Qω i of the global Q θ . At each time step t, each agent i shares its local parameter ω i t with its neighbors on the network, and the shared parameters are used to arrive at a consensual estimate of Q θ over time. Our objective is to find a policy µ θ that maximizes the long-run average reward: max θ J(µ θ ) = E s∼d µ θ [ R(s, µ θ (s))] = s∈S d µ θ (s) R(s, µ θ (s)). Analogous to the stochastic policy case, we denote the action-value function by Q θ (s, a) = t≥0 E[r t+1 -J(µ θ )|s 0 = s, a 0 = a, µ θ ], and the state-value function by V θ (s) = Q θ (s, µ θ (s)). When there is no ambiguity, we will denote J(µ θ ) and d µ θ by simply J(θ) and d θ , respectively. We present three results for the long-run average reward: (1) an expression for the local deterministic policy gradient in the on-policy setting ∇ θ i J(µ θ ), (2) an expression for the gradient in the off-policy setting, and (3) we show that the deterministic policy gradient can be seen as the limit of the stochastic one.

On-Policy Setting

Theorem 1 (Local Deterministic Policy Gradient Theorem -On Policy). For any θ ∈ Θ, i ∈ N , ∇ θ i J(µ θ ) exists and is given by s) . ∇ θ i J(µ θ ) = E s∼d µ θ ∇ θ i µ i θ i (s)∇ a i Q θ (s, µ -i θ -i (s), a i ) a i =µ i θ i ( The first step of the proof consists in showing that ∇ θ J(µ θ ) = E s∼d θ ∇ θ µ θ (s)∇ a Q θ (s, a)| a=µ θ (s) . This is an extension of the well-known stochastic case, for which we have ∇ θ J(π θ ) = E s∼d θ [∇ θ log(π θ (a|s))Q θ (s, a)], which holds for a long-term averaged return with stochastic policy (e.g Theorem 1 of Sutton et al. [2000a] ). See the Appendix for the details. Off-Policy Setting In the off-policy setting, we are given a behavior policy π : S → P(A), and our goal is to maximize the long-run average reward under state distribution d π : J π (µ θ ) = E s∼d π R(s, µ θ (s)) = s∈S d π (s) R(s, µ θ (s)). Note that we consider here an excursion objective (Sutton et al. [2009] , Silver et al. [January 2014a ], Sutton et al. [2016] ) since we take the average over the state distribution of the behaviour policy π of the state-action reward when selecting action given by the target policy µ θ . We thus have: Theorem 2 (Local Deterministic Policy Gradient Theorem -Off Policy). For any θ ∈ Θ, i ∈ N , π : S → P(A) a fixed stochastic policy, ∇ θ i J π (µ θ ) exists and is given by s) . ∇ θ i J π (µ θ ) = E s∼d π ∇ θ i µ i θ i (s)∇ a i R(s, µ -i θ -i (s), a i ) a i =µ i θ i ( Proof. Since d π is independent of θ we can take the gradient on both sides of (1) ∇ θ J π (µ θ ) = E s∼d π ∇ θ µ θ (s) ∇ a R(s, µ θ (s)) a=µ θ (s) . Given that ∇ θ i µ j θ (s) = 0 if i = j, we have ∇ θ µ θ (s) = Diag(∇ θ 1 µ 1 θ1 (s), . . . , ∇ θ N µ N θ N (s) ) and the result follows. This result implies that, off-policy, each agent needs access to µ -i θ -i t (s t ) for every t . Limit Theorem As noted by Silver et al. [January 2014b ], the fact that the deterministic gradient is a limit case of the stochastic gradient enables the standard machinery of policy gradient, such as compatible-function approximation (Sutton et al. [2000b] ), natural gradients (Kakade [2001] ), on-line feature adaptation (Prabuchandran et al. [2016] ,) and actor-critic (Konda [2002] ) to be used with deterministic policies. We show that it holds in our setting. The proof can be found in the Appendix. Theorem 3 (Limit of the Stochastic Policy Gradient for MARL). Let π θ,σ be a stochastic policy such that π θ,σ (a|s) = ν σ (µ θ (s), a), where σ is a parameter controlling the variance, and ν σ satisfy Condition 1 in the Appendix. Then, lim σ↓0 ∇ θ J π θ,σ (π θ,σ ) = ∇ θ J µ θ (µ θ ) where on the l.h.s the gradient is the standard stochastic policy gradient and on the r.h.s. the gradient is the deterministic policy gradient.

4. Algorithms

We provide two decentralized deterministic actor-critic algorithms, one on-policy and the other off-policy and demonstrate their convergence in the next section; assumptions and proofs are provided in the Appendix.

On-Policy Deterministic Actor-Critic

Algorithm 1 Networked deterministic on-policy actor-critic Initialize: step t = 0; parameters Ĵi 0 , ω i 0 , ω i 0 , θ i 0 , ∀i ∈ N ; state s 0 ; stepsizes {β ω,t } t≥0 , {β θ,t } t≥0 Draw a i 0 = µ i θ i 0 (s 0 ) and compute a i 0 = ∇ θ i µ i θ i 0 (s 0 ) Observe joint action a 0 = (a 1 0 , . . . , a N 0 ) and a 0 = a 1 0 , . . . , a N 0 repeat for i ∈ N do Observe s t+1 and reward r i t+1 = r i (s t , a t ) Update Ĵi t+1 ← (1 -β ω,t ) • Ĵi t + β ω,t • r i t+1 Draw action a t+1 = µ i θ i t (s t+1 ) and compute a i t+1 = ∇ θ i µ i θ i t (s t+1 ) end for Observe joint action a t+1 = (a 1 t+1 , . . . , a N t+1 ) and a t+1 = a 1 t+1 , . . . , a N t+1 for i ∈ N do Update: δ i t ← r i t+1 -Ĵi t + Qω i t (s t+1 , a t+1 ) -Qω i t (s t , a t ) Critic step: ω i t ← ω i t + β ω,t • δ i t • ∇ ω Qω i (s t , a t ) ω=ω i t Actor step: θ i t+1 = θ i t + β θ,t • ∇ θ i µ i θ i t (s t ) ∇ a i Qω i t (s t , a -i t , a i ) a i =a i t Send ω i t to the neighbors {j ∈ N : (i, j) ∈ E t } over G t Consensus step: ω i t+1 ← j∈N c ij t • ω j t end for Update t ← t + 1 until end Consider the following on-policy algorithm. The actor step is based on an expression for ∇ θ i J(µ θ ) in terms of ∇ a i Q θ (see Equation (15) in the Appendix). We approximate the action-value function Q θ using a family of functions Qω : S × A → R parameterized by ω, a column vector in R K . Each agent i maintains its own parameter ω i and uses Qω i as its local estimate of Q θ . The parameters ω i are updated in the critic step using consensus updates through a weight matrix C t = c ij t i,j ∈ R N ×N where c ij t is the weight on the message transmitted from i to j at time t, namely: Ĵi t+1 = (1 -β ω,t ) • Ĵi t + β ω,t • r i t+1 (2) ω i t = ω i t + β ω,t • δ i t • ∇ ω Qω i (s t , a t ) ω=ω i t (3) ω i t+1 = j∈N c ij t • ω j t (4) with δ i t = r i t+1 -Ĵi t + Qω i t (s t+1 , a t+1 ) -Qω i t (s t , a t ). For the actor step, each agent i improves its policy via: θ i t+1 = θ i t + β θ,t • ∇ θ i µ i θ i t (s t ) • ∇ a i Qω i t (s t , a -i t , a i ) a i =a i t . Since Algorithm 1 is an on-policy algorithm, each agent updates the critic using only (s t , a t , s t+1 ), at time t knowing that a t+1 = µ θt (s t+1 ). The terms in blue are additional terms that need to be shared when using compatible features (this is explained further in the next section). Off-Policy Deterministic Actor-Critic We further propose an off-policy actor-critic algorithm, defined in Algorithm 2 to enable better exploration capability. Here, the goal is to maximize J π (µ θ ) where π is the behavior policy. To do so, the globally averaged reward function R(s, a) is approximated using a family of functions Rλ : S × A → R that are parameterized by λ, a column vector in R K . Each agent i maintains its own parameter λ i and uses Rλ i as its local estimate of R. Based on (1), the actor update is θ i t+1 = θ i t + β θ,t • ∇ θ i µ i θ i t (s t ) • ∇ a i Rλ i t (s t , µ -i θ -i t (s t ), a i ) a i =µ θ i t (st) , which requires each agent i to have access to µ j θ j t (s t ) for j ∈ N . The critic update is λ i t = λ i t + β λ,t • δ i t • ∇ λ Rλ i (s t , a t ) λ=λ i t (7) λ i t+1 = j∈N c ij t λ j t , with δ i t = r i (s t , a t ) -Rλ i t (s t , a t ). In this case, δ i t was motivated by distributed optimization results, and is not related to the local TD-error (as there is no "temporal" relationship for R). Rather, it is simply the difference between the sample reward and the bootstrap estimate. The terms in blue are additional terms that need to be shared when using compatible features (this is explained further in the next section).

5. Convergence

To show convergence, we use a two-timescale technique where in the actor, updating deterministic policy parameter θ i occurs more slowly than that of ω i and Ĵi in the critic. We study the asymptotic behaviour of the critic by freezing the joint policy µ θ , then study the behaviour of θ t under convergence of the critic. To ensure stability, projection is often assumed since it is not clear how boundedness of Algorithm 2 Networked deterministic off-policy actor-critic Initialize: step t = 0; parameters λ i 0 , λ i 0 , θ i 0 , ∀i ∈ N ; state s 0 ; stepsizes {β λ,t } t≥0 , {β θ,t } t≥0 Draw a i 0 ∼ π i (s 0 ) , compute ȧi 0 = µ i θ i 0 (s 0 ) and a i 0 = ∇ θ i µ i θ i 0 (s 0 ) Observe joint action a 0 = (a 1 0 , . . . , a N 0 ), ȧ0 = ( ȧ1 0 , . . . , ȧN 0 ) and a 0 = a 1 0 , . . . , a N 0 repeat for i ∈ N do Observe s t+1 and reward r i t+1 = r i (s t , a t ) end for for i ∈ N do Update: δ i t ← r i t+1 -Rλ i t (s t , a t ) Critic step: λ i t ← λ i t + β λ,t • δ i t • ∇ λ Rλ i (s t , a t ) λ=λ i t Actor step: θ i t+1 = θ i t + β θ,t • ∇ θ i µ i θ i t (s t ) • ∇ a i Rλ i t (s t , µ -i θ -i t (s t ), a i ) a i =µ θ i t (st) Send λ i t to the neighbors {j ∈ N : (i, j) ∈ E t } over G t end for for i ∈ N do Consensus step: λ i t+1 ← j∈N c ij t • λ j t Draw action a t+1 ∼ π(s t+1 ), compute ȧi t+1 = µ i θ i t+1 (s t+1 ) and compute a i t+1 = ∇ θ i µ i θ i t+1 (s t+1 ) end for Observe joint action a t+1 = (a 1 t+1 , . . . , a N t+1 ), ȧt+1 = ( ȧ1 t+1 , . . . , ȧN t+1 ) and a t+1 = a 1 t+1 , . . . , a N t+1 Update t ← t + 1 until end θ i t can otherwise be ensured (see Bhatnagar et al. [2009] ). However, in practice, convergence is typically observed even without the projection step (see Bhatnagar et al. [2009] , Degris et al. [2012] , Prabuchandran et al. [2016] , Zhang et al. [2018] , Suttle et al. [2019] ). We also introduce the following technical assumptions which will be needed in the statement of the convergence results. Assumption 1 (Linear approximation, average-reward). For each agent i, the average-reward function Furthermore, we assume that for any θ ∈ Θ, the feature matrix R Φ θ ∈ R |S|×K has full column rank, where the k-th column of Φ θ is φ k (s, µ θ (s)), s ∈ S for any k ∈ 1, K . Also, for any u ∈ R K , Φ θ u = 1. Assumption 3 (Bounding θ). The update of the policy parameter θ i includes a local projection by Γ i : R mi → Θ i that projects any θ i t onto a compact set Θ i that can be expressed as {θ i |q i j (θ i ) ≤ 0, j = 1, . . . , s i } ⊂ R mi , for some real-valued, continuously differentiable functions {q i j } 1≤j≤s i defined on R mi . We also assume that Θ = N i=1 Θ i is large enough to include at least one local minimum of J(θ). We use {F t } to denote the filtration with F t = σ(s τ , C τ -1 , a τ -1 , r τ -1 , τ ≤ t). Assumption 4 (Random matrices). The sequence of non-negative random matrices {C t = (c ij t ) ij } satisfies: 1. C t is row stochastic and E(C t |F t ) is a.s. column stochastic for each t, i.e., C t 1 = 1 and 1 E(C t |F t ) = 1 a.s. Furthermore, there exists a constant η ∈ (0, 1) such that, for any c ij t > 0, we have c ij t ≥ η. 2. C t respects the communication graph G t , i.e., c ij t = 0 if (i, j) / ∈ E t . 3. The spectral norm of E C t • (I -11 /N ) • C t is smaller than one. 4. Given the σ-algebra generated by the random variables before time t, C t , is conditionally independent of s t , a t and r i t+1 for any i ∈ N . Assumption 5 (Step size rules, on-policy). The stepsizes β ω,t , β θ,t satisfy: t β ω,t = t β θ,t = ∞ t (β 2 ω,t + β 2 θ,t ) < ∞ t |β θ,t+1 -β θ,t | < ∞. In addition, β θ,t = o(β ω,t ) and lim t→∞ β ω,t+1 /β ω,t = 1. Assumption 6 (Step size rules, off-policy). The step-sizes β λ,t , β θ,t satisfy: t β λ,t = t β θ,t = ∞, t β 2 λ,t + β 2 θ,t < ∞ β θ,t = o(β λ,t ), lim t→∞ β λ,t+1 /β λ,t = 1. On-Policy Convergence To state convergence of the critic step, we define D s θ = Diag d θ (s), s ∈ S , Rθ = R(s, µ θ (s)), s ∈ S ∈ R |S| and the operator T Q θ : R |S| → R |S| for any action-value vector Q ∈ R |S| (and not R |S|•|A| since there is a mapping associating an action to each state) as: T Q θ (Q ) = Rθ -J(µ θ ) • 1 + P θ Q . Theorem 4. Under Assumptions 3, 4, and 5, for any given deterministic policy µ θ , with { Ĵt } and {ω t } generated from (2), we have lim t→∞ 1 N i∈N Ĵi t = J(µ θ ) and lim t→∞ ω i t = ω θ a.s. for any i ∈ N , where J(µ θ ) = s∈S d θ (s) R(s, µ θ (s)) is the long-term average return under µ θ , and ω θ is the unique solution to Φ θ D s θ T Q θ (Φ θ ω θ ) -Φ θ ω θ = 0. ( ) Moreover, ω θ is the minimizer of the Mean Square Projected Bellman Error (MSPBE), i.e., the solution to minimize ω Φ θ ω -ΠT Q θ (Φ θ ω) 2 D s θ , where Π is the operator that projects a vector to the space spanned by the columns of Φ θ , and • 2 D s θ denotes the euclidean norm weighted by the matrix D s θ . To state convergence of the actor step, we define quantities ψ i t,θ , ξ i t and ξ i t,θ as ψ i t,θ = ∇ θ i µ i θ i (s t ) and ψ i t = ψ i t,θt = ∇ θ i µ i θ i t (s t ), ξ i t,θ = ∇ ai Qω θ (s t , a -i t , a i ) ai=ai=µ i θ i t (st) = ∇ ai φ(s t , a -i t , a i ) ai=ai=µ i θ i t (st) ω θ , ξ i t = ∇ ai Qω i t (s t , a -i t , a i ) ai=µ i θ i (st) = ∇ ai φ(s t , a -i t , a i ) ai=µ i θ i (st) ω i t . Additionally, we introduce the operator Γ(•) as Γi [g(θ)] = lim 0<η→0 Γ i θ i + η • g(θ) -θ i η (11) for any θ ∈ Θ and g : Θ → R mi a continuous function. In case the limit above is not unique we take Γi [g(θ)] to be the set of all possible limit points of (11). Theorem 5. Under Assumptions 2, 3, 4, and 5, the policy parameter θ i t obtained from (5) converges a.s. to a point in the set of asymptotically stable equilibria of θi = Γi E st∼d θ ,µ θ ψ i t,θ • ξ i t,θ , for any i ∈ N . In the case of multiple limit points, the above is treated as a differential inclusion rather than an ODE. The convergence of the critic step can be proved by taking similar steps as that in Zhang et al. [2018] . For the convergence of the actor step, difficulties arise from the projection (which is handled using Kushner-Clark Lemma Kushner and Clark [1978] ) and the state-dependent noise (that is handled by "natural" timescale averaging Crowder [2009] ). Details are provided in the Appendix. Remark. Note that that with a linear function approximator Q θ , ψ t,θ • ξ t,θ = ∇ θ µ θ (s t ) ∇ a Qω θ (s t , a) a=µ θ (st) may not be an unbiased estimate of ∇ θ J(θ): E s∼d θ ψ t,θ •ξ t,θ = ∇ θ J(θ)+E s∼d θ ∇ θ µ θ (s) • ∇ a Qω θ (s, a) a=µ θ (s) -∇ a Q ω θ (s, a)| a=µ θ (s) . A standard approach to overcome this approximation issue is via compatible features (see, for  , a) = a • ∇ θ µ θ (s) , giving, for ω ∈ R m , Qω (s, a) = a • ∇ θ µ θ (s) ω = (a -µ θ (s)) • ∇ θ µ θ (s) ω + Vω (s), with Vω (s) = Qω (s, µ θ (s)) and ∇ a Qω (s, a) a=µ θ (s) = ∇ θ µ θ (s) ω. We thus expect that the convergent point of (5) corresponds to a small neighborhood of a local optimum of J(µ θ ), i.e., ∇ θ i J(µ θ ) = 0, provided that the error for the gradient of the actionvalue function ∇ a Qω (s, a) a=µ θ (s) -∇ a Q θ (s, a)| a=µ θ (s) is small. However, note that using compatible features requires computing, at each step t, φ(s t , a t ) = a t • ∇ θ µ θ (s t ) . Thus, in Algorithm 1, each agent observes not only the joint action a t+1 = (a 1 t+1 , . . . , a N t+1 ) but also (∇ θ 1 µ 1 θ 1 t (s t+1 ), . . . , ∇ θ N µ N θ N t (s t+1 )) (see the parts in blue in Algorithm 1).

Off-Policy Convergence

Theorem 6. Under Assumptions 1, 4, and 6, for any given behavior policy π and any θ ∈ Θ, with {λ i t } generated from (7), we have lim t→∞ λ i t = λ θ a.s. for any i ∈ N , where λ θ is the unique solution to B π,θ • λ θ = A π,θ • d s π ( ) where d s π = d π (s), s ∈ S , A π,θ = A π(a|s) R(s, a)w(s, a) da, s ∈ S ∈ R K×|S| and B π,θ = s∈S d π (s) A π(a|s)w i (s, a) • w(s, a) da, 1 ≤ i ≤ K ∈ R K×K . From here on we let ξ i t,θ = ∇ ai Rλ θ (s t , µ -i θ -i t (s t ), a i ) ai=µ i θ i t (st) = ∇ ai w(s t , µ -i θ -i t (s t ), a i ) ai=µ i θ i t (st) λ θ ξ i t = ∇ ai Rλ i t (s t , µ -i θ -i t (s t ), a i ) ai=µ i θ i t (st) = ∇ ai w(s t , µ -i θ -i (s t ), a i ) ai=µ i θ i (st) λ i t and we keep ψ i t,θ = ∇ θ i µ i θ i (s t ), ψ i t = ψ i t,θt = ∇ θ i µ i θ i t (s t ). Theorem 7. Under Assumptions 1, 3, 4, and 6, the policy parameter θ i t obtained from (6) converges a.s. to a point in the asymptotically stable equilibria of θi = Γ i E s∼d π ψ i t,θ • ξ i t,θ . We define compatible features for the action-value and the average-reward function in an analogous manner: w θ (s, a) = (a -µ θ (s)) • ∇ θ µ θ (s) . For λ ∈ R m , Rλ,θ (s, a) = (a -µ θ (s)) • ∇ θ µ θ (s) • λ ∇ a Rλ,θ (s, a) = ∇ θ µ θ (s) • λ and we have that, for λ * = argmin λ E s∼d π ∇ a Rλ,θ (s, µ θ (s)) -∇ a R(s, µ θ (s)) 2 : ∇ θ J π (µ θ ) = E s∼d π ∇ θ µ θ (s) • ∇ a R(s, a) a=µ θ (s) = E s∼d π ∇ θ µ θ (s) • ∇ a Rλ * ,θ (s, a) a=µ θ (s) . The use of compatible features requires each agent to observe not only the joint action taken a t+1 = (a 1 t+1 , . . . , a N t+1 ) and the "on-policy action" ȧt+1 = ( ȧ1 t+1 , . . . , ȧN t+1 ), but also a t+1 = (∇ θ 1 µ 1 θ 1 t (s t+1 ), . . . , ∇ θ N µ N θ N t (s t+1 )) (see the parts in blue in Algorithm 2). We illustrate algorithm convergence on multi-agent extension of a continuous bandit problem from Sec. 5.1 of Silver et al. [January 2014b ]. Details are in the Appendix. Figure 2 shows the convergence of Algorithms 1 and 2 averaged over 5 runs. In all cases, the system converges and the agents are able to coordinate their actions to minimize system cost. 

6. Conclusion

We have provided the tools needed to implement decentralized, deterministic actor-critic algorithms for cooperative multi-agent reinforcement learning. We provide the expressions for the policy gradients, the algorithms themselves, and prove their convergence in on-policy and off-policy settings. We also provide numerical results for a continuous multi-agent bandit problem that demonstrates the convergence of our algorithms. Our work differs from Zhang and Zavlanos [2019] as the latter was based on policy consensus whereas ours is based on critic consensus. Our approach represents agreement between agents on every participants' contributions to the global reward, and as such, provides a consensus scoring function with which to evaluate agents. Our approach may be used in compensation schemes to incentivize participation. An interesting extension of this work would be to prove convergence of our actor-critic algorithm for continuous state spaces, as it may hold with assumptions on the geometric ergodicity of the stationary state distribution induced by the deterministic policies (see Crowder [2009] ). The expected policy gradient (EPG) of Ciosek and Whiteson [2018] , a hybrid between stochastic and deterministic policy gradient, would also be interesting to leverage. The Multi-Agent Deep Deterministic Policy Gradient algorithm (MADDPG) of Lowe et al. [2017] assumes partial observability for each agent and would be a useful extension, but it is likely difficult to extend our convergence guarantees to the partially observed setting.

Numerical experiment details

We demonstrate the convergence of our algorithm in a continuous bandit problem that is a multiagent extension of the experiment in Section 5.1 of Silver et al. (2014) . Each agent chooses an action a i ∈ R m . We assume all agents have the same reward function given by R i (a) = i a i -a * T C i a i -a * . The matrix C is positive definite with eigenvalues chosen from {0.1, 1}, and a * = [4, . . . , 4] T . We consider 10 agents and action dimensions m = 10, 20, 50. Note that there are multiple possible solutions for this problem, requiring the agents to coordinate their actions to sum to a * . We assume a target policy of the form µ θ i = θ i for each agent i and a Gaussian behaviour policy β(•) ∼ N (θ i , σ 2 β ) where σ β = 0.1. We use the Gaussian behaviour policy for both Algorithms 1 and 2. Strictly speaking, Algorithm 1 is on-policy, but in this simplified setting where the target policy is constant, the on-policy version would be degenerate such that the Q estimate does not affect the TD-error. Therefore, we add a Gaussian behaviour policy to Algorithm 1. Each agent maintains an estimate Q ω i (a) of the critic using a linear function of the compatible features a -θ and a bias feature. The critic is recomputed from each successive batch of 2m steps and the actor is updated once per batch. The critic step size is 0.1 and the actor step size is 0.01. Performance is evaluated by measuring the cost of the target policy (without exploration). Figure 2 shows the convergence of Algorithms 1 and 2 averaged over 5 runs. In all cases, the system converges and the agents are able to coordinate their actions to minimize system cost. The jupyter notebook will be made available for others to use. In fact, in this simple experiment, we also observe convergence under discounted rewards. 

Proof of Theorem 1

The proof follows the same scheme as Sutton et al. [2000a] , naturally extending their results for a deterministic policy µ θ and a continuous action space A. Note that our regularity assumptions ensure that, for any s ∈ S, V θ (s), ∇ θ V θ (s), J(θ), ∇ θ J(θ), d θ (s) are Lipschitz-continuous functions of θ (since µ θ is twice continuously differentiable and Θ is compact), and that Q θ (s, a) and ∇ a Q θ (s, a) are Lipschitz-continuous functions of a (Marbach and Tsitsiklis [2001] ). We first show that ∇ θ J(θ) = E s∼d θ ∇ θ µ θ (s)∇ a Q θ (s, a)| a=µ θ (s) ]. The Poisson equation under policy µ θ is given by Puterman [1994]  Q θ (s, a) = R(s, a) -J(θ) + s ∈S P (s |s, a)V θ (s ). So, ∇ θ V θ (s) = ∇ θ Q θ (s, µ θ (s)) = ∇ θ R(s, µ θ (s)) -J(θ) + s ∈S P (s |s, µ θ (s))V θ (s ) = ∇ θ µ θ (s) ∇ a R(s, a) a=µ θ (s) -∇ θ J(θ) + ∇ θ s ∈S P (s |s, µ θ (s))V θ (s ) = ∇ θ µ θ (s) ∇ a R(s, a) a=µ θ (s) -∇ θ J(θ) + s ∈S ∇ θ µ θ (s) ∇ a P (s |s, a)| a=µ θ (s) V θ (s ) + s ∈S P (s |s, µ θ (s))∇ θ V θ (s ) = ∇ θ µ θ (s)∇ a R(s, a) + s ∈S P (s|s , a)V θ (s ) a=µ θ (s) -∇ θ J(θ) + s ∈S P (s |s, µ θ (s))∇ θ V θ (s ) = ∇ θ µ θ (s)∇ a Q θ (s, a)| a=µ θ (s) + s ∈S P (s |s, µ θ (s))∇ θ V θ (s ) -∇ θ J(θ) Hence, ∇ θ J(θ) = ∇ θ µ θ (s)∇ a Q θ (s, a)| a=µ θ (s) + s ∈S P (s |s, µ θ (s))∇ θ V θ (s ) -∇ θ V θ (s) s∈S d θ (s)∇ θ J(θ) = s∈S d θ (s)∇ θ µ θ (s)∇ a Q θ (s, a)| a=µ θ (s) + s∈S d θ (s) s ∈S P (s |s, µ θ (s))∇ θ V θ (s ) - s∈S d θ (s)∇ θ V θ (s). Using stationarity property of d θ , we get s∈S s ∈S d θ (s)P (s |s, µ θ (s))∇ θ V θ (s ) = s ∈S d θ (s )∇ θ V θ (s ). Therefore, we get ∇ θ J(θ) = s∈S d θ (s)∇ θ µ θ (s) ∇ a Q θ (s, a)| a=µ θ (s) = E s∼d θ ∇ θ µ θ (s) ∇ a Q θ (s, a)| a=µ θ (s) ]. Given that ∇ θ i µ j θ (s) = 0 if i = j, we have ∇ θ µ θ (s) = Diag(∇ θ 1 µ 1 θ1 (s), . . . , ∇ θ N µ N θ N (s)), which implies ∇ θ i J(θ) = E s∼d θ ∇ θ i µ i θ i (s)∇ a i Q θ (s, µ -i θ -i (s), a i ) a i =µ i θ i (s) ]. (15)

Proof of Theorem 3

We extend the notation for off-policy reward function to stochastic policies as follows. Let β be a behavior policy under which {s t } t≥0 is irreducible and aperiodic, with stationary distribution d β . For a stochastic policy π : S → P(A), we define J β (π) = s∈S d β (s) A π(a|s) R(s, a)da. Recall that for a deterministic policy µ : S → A, we have J β (µ) = s∈S d β (s) R(s, µ(s)). We introduce the following conditions which are identical to Conditions B1 from Silver et al. [January 2014a]. Conditions 1. Functions ν σ parametrized by σ are said to be regular delta-approximation on R ⊂ A if they satisfy the following conditions: 1. The distributions ν σ converge to a delta distribution: lim σ↓0 A ν σ (a , a)f (a)da = f (a ) for a ∈ R and suitably smooth f . Specifically we require that this convergence is uniform in a and over any class F of L-Lipschitz and bounded functions, ∇ a f (a) < L < ∞, sup a f (a) < b < ∞, i.e.: lim σ↓0 sup f ∈F ,a ∈R A ν σ (a , a)f (a)da -f (a ) = 0. 2. For each a ∈ R, ν σ (a , •) is supported on some compact C a ⊆ A with Lipschitz boundary bd(C a ), vanishes on the boundary and is continuously differentiable on C a . 3. For each a ∈ R, for each a ∈ A, the gradient ∇ a ν σ (a , a) exists. 4. Translation invariance: for all a ∈ A, a ∈ R, and any δ ∈ R n such that a + δ ∈ A, a + δ ∈ A, ν σ (a , a) = ν σ (a + δ, a + δ). The following lemma is an immediate corollary of Lemma 1 from Silver et al. [January 2014a ]. Lemma 1. Let ν σ be a regular delta-approximation on R ⊆ A. Then, wherever the gradients exist ∇ a ν(a , a) = -∇ a ν(a , a). Theorem 3 is a less technical restatement of the following result. Theorem 8. Let µ θ : S → A. Denote the range of µ θ by R θ ⊆ A, and R = ∪ θ R θ . For each θ, consider π θ,σ a stochastic policy such that π θ,σ (a|s) = ν σ (µ θ (s), a), where ν σ satisfy Conditions 1 on R. Then, there exists r > 0 such that, for each θ ∈ Θ, σ → J π θ,σ (π θ,σ ), σ → J π θ,σ (µ θ ), σ → ∇ θ J π θ,σ (π θ,σ ), and σ → ∇ θ J π θ,σ (µ θ ) are properly defined on 0, r (with J π θ,0 (π θ,0 ) = J π θ,0 (µ θ ) = J µ θ (µ θ ) and ∇ θ J π θ,0 (π θ,0 ) = ∇ θ J π θ,0 (µ θ ) = ∇ θ J µ θ (µ θ ) ), and we have: lim σ↓0 ∇ θ J π θ,σ (π θ,σ ) = lim σ↓0 ∇ θ J π θ,σ (µ θ ) = ∇ θ J µ θ (µ θ ). To prove this result, we first state and prove the following Lemma. Lemma 2. There exists r > 0 such that, for all θ ∈ Θ and σ ∈ 0, r , stationary distribution d π θ,σ exists and is unique. Moreover, for each θ ∈ Θ, σ → d π θ,σ and σ → ∇ θ d π θ,σ are properly defined on 0, r and both are continuous at 0. Proof of Lemma 2. For any policy β, we let P β s,s s,s ∈S be the transition matrix associated to the Markov Chain {s t } t≥0 induced by β. In particular, for each θ ∈ Θ, σ > 0, s, s ∈ S, we have P µ θ s,s = P (s |s, µ θ (s)), P π θ,σ s,s = A π θ,σ (a|s)P (s |s, a)da = A ν σ (µ θ (s), a)P (s |s, a)da. Let θ ∈ Θ, s, s ∈ S, (θ n ) ∈ Θ N such that θ n → θ and (σ n ) n∈N ∈ R + N , σ n ↓ 0: P π θn ,σn s,s -P µ θ s,s ≤ P π θn ,σn s,s -P µ θn s,s + P µ θn s,s -P µ θ s,s . Applying the first condition of Conditions 1 with f : a → P (s |s, a) belonging to F: P π θn ,σn s,s -P µ θn s,s = A ν σn (µ θn (s), a)P (s |s, a)da -P (s |s, µ θn (s)) ≤ sup f ∈F ,a ∈R A ν σn (a , a)f (a)da -f (a ) -→ n→∞ 0. By regularity assumptions on θ → µ θ (s) and P (s |s, •), we have P µ θn s,s -P µ θ s,s = |P (s |s, µ θn (s)) -P (s |s, µ θ (s))| -→ n→∞ 0. Hence, P π θn ,σn s,s -P µ θ s,s -→ n→∞ 0. Therefore, for each s, s ∈ S, (θ, σ) → P π θ,σ s,s , with P π θ,0 s,s = P µ θ s,s , is continuous on Θ × {0}. Note that, for each n ∈ N, P → s,s (P n ) s,s is a polynomial function of the entries of P . Thus, for each n ∈ N, f n : (θ, σ) → s,s (P π θ,σ n ) s,s , with f n (θ, 0) = s,s (P µ θ n ) s,s is continuous on Θ × {0}. Moreover, for each θ ∈ Θ, σ ≥ 0, from the structure of P π θ,σ , if there is some n * ∈ N such that f n * (θ, σ) > 0 then, for all n ≥ n * , f n (θ, σ) > 0. Now let us suppose that there exists (θ n ) ∈ Θ N * such that, for each n > 0 there is a σ n ≤ n -1 such that f n (θ n , σ n ) = 0. By compacity of Θ, we can take (θ n ) converging to some θ ∈ Θ. For each n * ∈ N, by continuity we have f n * (θ, 0) = lim n→∞ f n * (θ n , σ n ) = 0. Since P µ θ is irreducible and aperiodic, there is some n ∈ N such that for all s, s ∈ S and for all n * ≥ n, P µ θ n * s,s > 0, i.e. f n * (θ, 0) > 0. This leads to a contradiction. Hence, there exists n * > 0 such that for all θ ∈ Θ and σ ≤ n * -1 , f n (θ, σ) > 0. We let r = n * -1 . It follows that, for all θ ∈ Θ and σ ∈ 0, r , P π θ,σ is a transition matrix associated to an irreducible and aperiodic Markov Chain, thus d π θ,σ is well defined as the unique stationary probability distribution associated to P π θ,σ . We fix θ ∈ Θ in the remaining of the proof. Let β a policy for which the Markov Chain corresponding to P β is irreducible and aperiodic. Let s * ∈ S, as asserted in Marbach and Tsitsiklis [2001] , considering stationary distribution d β as a vector d β s s∈S ∈ R |S| , d β is the unique solution of the balance equations: s∈S d β s P β s,s = d β s s ∈ S\{s * }, s∈S d β s = 1. Hence, we have A β an |S| × |S| matrix and a = 0 a constant vector of R |S| such that the balance equations is of the form A β d β = a with A β s,s depending on P β s ,s in an affine way, for each s, s ∈ S. Moreover, A β is invertible, thus d β is given by d β = 1 det(A β ) adj(A β ) a. Entries of adj(A β ) and det(A β ) are polynomial functions of the entries of P β . Thus, σ → d π θ,σ = 1 det(A π θ,σ ) adj(A π θ,σ ) a is defined on 0, r and is continuous at 0. Lemma 1 and integration by parts imply that, for s, s ∈ S, σ ∈ 0, r : • Let a * ∈ R, ∀θ ∈ Θ, ∇ θ π θ,σ (a|s)P (s |s, a) = ∇ θ µ θ (s) ∇ a ν σ (a , a)| a =µ θ (s) ≤ ∇ θ µ θ (s) op ∇ a ν σ (a , a)| a =µ θ (s) ≤ sup θ∈Θ ∇ θ µ θ (s) op ∇ a ν σ (µ θ (s), a) = sup θ∈Θ ∇ θ µ θ (s) op ∇ a ν σ (a * , a -µ θ (s) + a * ) (18) ≤ sup θ∈Θ ∇ θ µ θ (s) op sup a∈C a * ∇ a ν σ (a * , a) 1 a∈C a * where • op denotes the operator norm, and (18) comes from translation invariance (we take ∇ a ν σ (a * , a) = 0 for a ∈ R n \C a * ). a → sup θ∈Θ ∇ θ µ θ (s) op sup a∈C a * ∇ a ν σ (a * , a) 1 a∈C a * is measurable, bounded and supported on C a * , so it is integrable on A. • Dominated convergence ensures that, for each k ∈ 1, m , partial derivative g k (θ) = ∂ θ k A ∇ θ π θ,σ (a|s)P (s |s, a)da is continuous: let θ n ↓ θ, then g k (θ n ) = ∂ θ k A ∇ θ π θn,σ (a|s)P (s |s, a)da = ∂ θ k µ θn (s) C a * ν σ (a * , a -µ θn (s) + a * )∇ a P (s |s, a)da -→ n→∞ ∂ θ k µ θ (s) C a * ν σ (a * , a -µ θ (s) + a * )∇ a P (s |s, a)da = g k (θ) with the dominating function a → sup Since d π θ,σ = 1 det(A π θ,σ ) adj (A π θ,σ ) a with | det (A π θ,σ ) | > 0 for all σ ∈ 0, r and since entries of adj (A π θ,σ ) and det (A π θ,σ ) are polynomial functions of the entries of P π θ,σ , it follows that σ → ∇ θ d π θ,σ is properly defined on 0, r and is continuous at 0, which concludes the proof of Lemma 2. We now proceed to prove Theorem 8. Let θ ∈ Θ, π θ as in Theorem 3, and r > 0 such that σ → d π θ,σ , σ → ∇ θ d π θ,σ are well defined on 0, r and are continuous at 0. Then, the following two functions σ → J π θ,σ (π θ,σ ) = s∈S d π θ,σ (s) A π θ,σ (a|s) R(s, a)da, σ → J π θ,σ (µ θ ) = s∈S d π θ,σ (s) R(s, µ θ (s)), are properly defined on 0, r (with J π θ,0 (π θ,0 ) = J π θ,0 (µ θ ) = J µ θ (µ θ )). Let s ∈ S, by taking similar arguments as in the proof of Lemma 2, we have ∇ θ A π θ,σ (a|s) R(s, a)da = A ∇ θ π θ,σ (a, s) R(s, a)da, = ∇ θ µ θ (s) C µ θ (s) ν σ (µ θ (s), a)∇ a R(s, a)da. Thus, σ → ∇ θ J π θ,σ (π θ,σ ) is properly defined on 0, r and ∇ θ J π θ,σ (π θ,σ ) = s∈S ∇ θ d π θ,σ A π θ,σ (a|s) R(s, a)da + s∈S d π θ,σ (s)∇ θ A π θ,σ (a|s) R(s, a)da = s∈S ∇ θ d π θ,σ A ν σ (µ θ (s), a) R(s, a)da + s∈S d π θ,σ (s)∇ θ µ θ (s) C µ θ (s) ν σ (µ θ (s), a)∇ a R(s, a)da. Similarly, σ → ∇ θ J π θ,σ (µ θ ) is properly defined on 0, r and ∇ θ J π θ,σ (µ θ ) = s∈S ∇ θ d π θ,σ (s) R(s, µ θ (s)) + s∈S d π θ,σ (s)∇ θ µ θ (s) ∇ a R(s, a) a=µ θ (s) To prove continuity at 0 of both σ → ∇ θ J π θ,σ (π θ,σ ) and σ → ∇ θ J π θ,σ (µ θ ) (with ∇ θ J π θ,0 (π θ,0 ) = ∇ θ J π θ,0 (µ θ ) = ∇ θ J µ θ (µ θ )), let (σ n ) n≥0 ↓ 0: ∇ θ J π θ,σn (π θ,σn ) -∇ θ J π θ,0 (π θ,0 ) ≤ ∇ θ J π θ,σn (π θ,σn ) -∇ θ J π θ,σn (µ θ ) + ∇ θ J π θ,σn (µ θ ) -∇ θ J µ θ (µ θ ) . For the first term of the r.h.s we have ∇ θ J π θ,σn (π θ,σn ) -∇ θ J π θ,σn (µ θ ) ≤ s∈S ∇ θ d π θ,σn (s) A ν σn (µ θ (s), a) R(s, a)da -R(s, µ θ (s)) + s∈S d π θ,σn (s) ∇ θ µ θ (s) op A ν σn (µ θ (s), a)∇ a R(s, a)da -∇ a R(s, a) a=µ θ (s) . Then, for each θ ∈ Θ, we can introduce ν θ : S → R n the solution to the Poisson equation: I -P θ ν θ (•) = H(θ, •) -h(θ) that is given by ν θ (s) = k≥0 E s k+1 ∼P θ (•|s k ) [H(θ, s k ) -h(θ)|s 0 = s] which is properly defined (similar to the differential value function V ). With projection, actor update (5) becomes θ t+1 = Γ [θ t + β θ,t H(θ t , s t , ω t )] (20) = Γ [θ t + β θ,t h(θ t ) -β θ,t (h(θ t ) -H(θ t , s t )) -β θ,t (H(θ t , s t ) -H(θ t , s t , ω t ))] = Γ θ t + β θ,t h(θ t ) + β θ,t (I -P θt )ν θt (s t ) + β θ,t A 1 t = Γ θ t + β θ,t h(θ t ) + β θ,t (ν θt (s t ) -ν θt (s t+1 )) + β θ,t ν θt (s t+1 ) -P θt ν θt (s t ) + β θ,t A 1 t = Γ θ t + β θ,t h(θ t ) + A 1 t + A 2 t + A 3 t where A 1 t = H(θ t , s t , ω t ) -H(θ t , s t ), A 2 t = ν θt (s t ) -ν θt (s t+1 ), A 3 t = ν θt (s t+1 ) -P θt ν θt (s t ). For r < t we have t-1 k=r β θ,k A 2 k = t-1 k=r β θ,k (ν θ k (s k ) -ν θ k (s k+1 )) = t-1 k=r β θ,k ν θ k (s k ) -ν θ k+1 (s k+1 ) + t-1 k=r β θ,k ν θ k+1 (s k+1 ) -ν θ k (s k+1 ) = t-1 k=r (β θ,k+1 -β θ,k ) ν θ k+1 (s k+1 ) + β θr ν θr (s r ) -β θt ν θt (s t ) + t-1 k=r (2) k = t-1 k=r (1) k + t-1 k=r (2) k + η r,t where (1) k = (β θ,k+1 -β θ,k ) ν θ k+1 (s k+1 ), k = β θ,k ν θ k+1 (s k+1 ) -ν θ k (s k+1 ) , η r,t = β θr ν θr (s r ) -β θt ν θt (s t ). Lemma 5. t-1 k=0 β θ,k A 2 k converges a.s. for t → ∞ Proof of Lemma 5. Since ν θ (s) is uniformly bounded for θ ∈ Θ, s ∈ S, we have for some K > 0 t-1 k=0 (1) k ≤ K t-1 k=0 |β θ,k+1 -β θ,k | which converges given Assumption 5. Moreover, since µ θ (s) is twice continuously differentiable, θ → ν θ (s) is Lipschitz for each s, and so we have t-1 k=0 (2) k ≤ t-1 k=0 β θ,k ν θ k (s k+1 ) -ν θ k+1 (s k+1 ) ≤ K 2 t-1 k=0 β θ,k θ k -θ k+1 ≤ K 3 t-1 k=0 β 2 θ,k . We introduce the following operators as in Zhang et al. [2018] : • • : R KN → R K λ = 1 N (1 ⊗ I)λ = 1 N i∈N λ i . • J = 1 N 11 ⊗ I : R KN → R KN such that J λ = 1 ⊗ λ . • J ⊥ = I -J : R KN → R KN and we note λ ⊥ = J ⊥ λ = λ -1 ⊗ λ . We then proceed in two steps as in Zhang et al. [2018] , firstly by showing the convergence a.s. of the disagreement vector sequence {λ ⊥,t } to zero, secondly showing that the consensus vector sequence { λ t } converges to the equilibrium such that λ t is solution to (13). Lemma 8. Under Assumptions 4, 1 and 6, for any M > 0, we have sup t E β -1 λ,t λ ⊥,t 2 1 {sup t λt ≤M } < ∞. Since dynamic of {λ t } described by ( 21) is similar to (5.2) in Zhang et al. [2018] we have E β -1 λ,t+1 λ ⊥,t+1 2 |F t,1 = β 2 λ,t β 2 λ,t+1 ρ β -1 λ,t λ ⊥,t 2 +2 • β -1 λ,t λ ⊥,t •E( y t+1 2 |F t,1 ) 1 2 + E( y t+1 2 |F t,1 ) (22) where ρ represents the spectral norm of E C t • (I -11 /N ) • C t , with ρ ∈ [0, 1) by Assumption 4. Since y i t+1 = δ i t • w(s t , a t ) we have E y t+1 2 |F t,1 = E i∈N (r i (s t , a t ) -w(s t , a t )λ i t ) • w(s t , a t ) 2 |F t,1 ≤ 2 • E i∈N r i (s t , a t )w(s t , a t ) 2 + w(s t , a t ) 4 • λ i t 2 |F t,1 . By uniform boundedness of r(s, •) and w(s, •) (Assumptions 1) and finiteness of S, there exists K 1 > 0 such that E y t+1 2 |F t,1 ≤ K 1 (1 + λ t 2 ). Thus, for any M > 0 there exists K 2 > 0 such that, on the set {sup τ ≤t λ τ < M }, E y t+1 2 1 {sup τ ≤t λτ <M } |F t,1 ≤ K 2 . We let v t = β -1 λ,t λ ⊥,t 2 1 {sup τ ≤t λτ <M } . Taking expectation over (22), noting that 1 {sup τ ≤t+1 λτ <M } ≤ 1 {sup τ ≤t λτ <M } we get E(v t+1 ) ≤ β 2 λ,t β 2 λ,t+1 ρ E(v t ) + 2 E(v t ) • K 2 + K 2 which is the same expression as (5.10) in Zhang et al. [2018] . So similar conclusions to the ones of Step 1 of Zhang et al. [2018] holds: Besides, since r i t+1 and w are uniformly bounded, there exists K 5 < ∞ such that E y t+1 |F t,1 2 ≤ K 5 • (1 + λ t 2 ). Thus, for any M > 0, there exists some K 6 < ∞ such that over the set sup t E β -1 λ,t λ ⊥,t {sup t λ t ≤ M } E M t+1 2 |F t,1 ≤ K 6 • (1 + λ t 2 ). Hence, for any M > 0, assumptions (a.1) -(a.5) of B.1. from Zhang et al. [2018] are verified on the set {sup t λ t ≤ M }. Finally, we consider the ODE asymptotically followed by λ t : ˙ λ t = -B π,θ • λ t + A π,θ • d π which has a single globally asymptotically stable equilibrium λ * ∈ R K , since B π,θ is positive definite: λ * = B -1 π,θ and λ ⊥,t -→ t→∞ 0 a.s., we have for each i ∈ N , a.s., λ i t -→ t→∞ B -1 π,θ • A π,θ • d π . Proof of Theorem 7 Let F t,2 = σ(θ τ , τ ≤ t) be the σ-field generated by {θ τ , τ ≤ t}, and let  ζ i t,1 = ψ i t • ξ i t -E st∼d π ψ i t • ξ i t |F t,



Submitted to 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Do not distribute.



is parameterized by the class of linear functions, i.e., Rλ i ,θ (s, a) = w θ (s, a) • λ i where w θ (s, a) = w θ,1 (s, a), . . . , w θ,K (s, a) ∈ R K is the feature associated with the state-action pair (s, a). The feature vectors w θ (s, a), as well as ∇ a w θ,k (s, a) are uniformly bounded for any s ∈ S, a ∈ A, k ∈ 1, K . Furthermore, we assume that the feature matrix W π ∈ R |S|×K has full column rank, where the k-th column of W π,θ is A π(a|s)w θ,k (s, a)da, s ∈ S for any k ∈ 1, K .Assumption 2 (Linear approximation, action-value). For each agent i, the action-value function is parameterized by the class of linear functions, i.e., Qω i (s, a) = φ(s, a) • ω i where φ(s, a) = φ 1 (s, a), . . . , φ K (s, a) ∈ R K is the feature associated with the state-action pair (s, a). The feature vectors φ(s, a), as well as ∇ a φ k (s, a) are uniformly bounded for any s ∈ S, a ∈ A, k ∈ {1, . . . , K}.

example, Silver et al. [January 2014a] and Zhang and Zavlanos [2019]), i.e. φ(s

Figure 1: Convergence of Algorithms 1 and 2 on the multi-agent continuous bandit problem.

Figure 2: Convergence of Algorithms 1 and 2 on the multi-agent continuous bandit problem.

a ν σ (a , a)| a =µ θ (s) P (s |s, a)da = -A ∇ a ν σ (µ θ (s), a)P (s |s, a)da = C µ θ (s) ν σ (µ θ (s), a)∇ a P (s |s, a)da + boundary terms = C µ θ (s) ν σ (µ θ (s), a)∇ a P (s |s, a)da where the boundary terms are zero since ν σ vanishes on the boundary due to Conditions 1. Thus, for s, s ∈ S, σ ∈ 0, r : ∇ θ P π θ,σ s,s = ∇ θ A π θ,σ (a|s)P (s |s, a)da = A ∇ θ π θ,σ (a|s)P (s |s, a)da (17) = A ∇ θ µ θ (s) ∇ a ν σ (a , a)| a =µ θ (s) P (s |s, a)da = ∇ θ µ θ (s) C µ θ (s) ν σ (µ θ (s), a)∇ a P (s |s, a)da where exchange of derivation and integral in (17) follows by application of Leibniz rule with: • ∀a ∈ A, θ → π θ,σ (a|s)P (s |s, a) is differentiable, and ∇ θ π θ,σ (a|s)P (s |s, a) = ∇ θ µ θ (s) ∇ a ν σ (a , a)| a =µ θ (s) .

a∈C a * |ν σ (a * , a)|sup a∈A ∇ a P (s |s, a) 1 a∈C a * . Thus σ → ∇ θ P π θ,σ s,sis defined for σ ∈ 0, r and is continuous at 0, with∇ θ P π θ,0 s,s = ∇ θ µ θ (s) ∇ a P (s |s, a)| a=µ θ (s) . Indeed, let (σ n ) n∈N ∈ 0, r + N , σ n ↓ 0,then, applying the first condition of Conditions 1 with f : a → ∇ a P (s |s, a) belonging to F, we get ∇ θ µ θ (s) op C µ θ (s) ν σn (µ θ (s), a)∇ a P (s |s, a)da -∇ a P (s |s, a)| a=µ θ (s) -→ n→∞ 0.

1 {sup t λt ≤M } < ∞ ⊥,t ) = λ t + β λ,t (h(λ t , s t ) + M t+1 ) where h(λ t , s t ) = E at∼π y t+1 |F t and M t+1 = (C t ⊗I)(y t+1 +β -1 λ,t λ ⊥,t ) -E at∼π y t+1 |F t . Since δ t = r(s t , a t ) -w(s t , a t ) λ t , we have h(λ t , s t ) = E at∼π (r(s t , a t )w(s t , a t ) |F t ) + E at∼π (w(s t , a t ) λ t • w(s t , a t ) |F t,1 ) so h is Lipschitz-continuous in its first argument. Moreover, since λ ⊥,t = 0 and 1 E(C t |F t,1 ) = 1 a.s.:E at∼π (C t ⊗ I)(y t+1 + β -1 λ,t λ ⊥,t ) |F t,1 = E at∼π 1 N (1 ⊗ I)(C t ⊗ I)(y t+1 + β -1 λ,t λ ⊥,t )|F t,1 I)(E(C t |F t,1 ) ⊗ I)E at∼π y t+1 + β -1 λ,t λ ⊥,t |F t,1 E(C t |F t,1 ) ⊗ I)E at∼π y t+1 + β -1 λ,t λ ⊥,t |F t,1 = E at∼π y t+1 |F t,1 a.s.So {M t } is a martingale difference sequence. Additionally we haveE M t+1 2 |F t,1 ≤ 2 • E y t+1 + β -1 λ,t λ ⊥,t2Gt |F t,1 + 2 • E y t+1 |F t,1 2 with G t = N -2 • C t 11 C t ⊗ I whose spectral norm is bounded for C t is stochastic. From (23) and(24) we have that, for any M > 0, over the set {sup t λ t ≤ M }, there existsK 3 , K 4 < ∞ such that E y t+1 +β -1 λ,t λ ⊥,t2Gt |F t,1 1 {sup t λt ≤M } ≤ K 3 •E y t+1 2 + β -1 λ,t λ ⊥,t 2 |F t,1 1 {sup t λt ≤M } ≤ K 4 .

• A π,θ • d π . By Lemma 7, sup t λ t < ∞ a.s., all conditions to apply Theorem B.2. of Zhang et al. [2018] hold a.s., which means that λ t -→

2 , ζ i t,2 = E st∼d π ψ i t • (ξ i t -ξ i t,θt )|F t,2 . + β θ,t E st∼d π ψ i t • ξ i t,θt |F t,2 + β θ,t ζ i t,1 + β θ,t ζ i t,2 . (26) So with h i (θ t ) = E st∼d π ψ i t • ξ i t,θt |F t,2 and h(θ) = h 1 (θ), . . . , h N (θ) , we have h Our regularity assumptions ensure that θ → ψ i t,θt is continuous for each i ∈ N , s t ∈ S. Moreover, θ → d θ (s) is also Lipschitz continuous for each s ∈ S. Hence, θ → g(θ) is Lipschitz-continuous in θ and the ODE (12) is well-posed. This holds even when using compatible features. By critic faster convergence, we have lim t→∞ ξ i t -ξ i t,θt = 0. =0 β θ,τ ζ i τ,1 . M i t is a martingale sequence with respect to F t,2 . Since {ω t } t , {∇ a φ k (s, a)} s,k , and {∇ θ µ θ (s)} s are bounded (Lemma 3, Assumption 2), it follows that the sequence ζ i t,1 is bounded. Thus, by Assumption 5, t E M i t+1 -M i Hence, by Kushner-Clark lemma Kushner and Clark [1978] (pp 191-196) we have that the update in (26) converges a.s. to the set of asymptotically stable equilibria of the ODE (12).

annex

Applying the first assumption in Condition 1 with f : a → R(s, a) and f : a → ∇ a R(s, a) belonging to F we have, for each s ∈ S: A ν σn (µ θ (s), a) R(s, a)da -R(s, µ θ (s)) -→ n→∞ 0 and A ν σn (µ θ (s), a)∇ a R(s, a)da -∇ a R(s, a) a=µ θ (s) -→ n→∞ 0.Moreover, for each s ∈ S, d π θ,σn (s) -→ n→∞ d µ θ (s) and ∇ θ d π θ,σn (s) -→ n→∞ ∇ θ d µ θ (s) (by Lemma 2), and ∇ θ µ θ (s) op < ∞, soFor the second term of the r.h.s of ( 19), we haveHence,So, σ → ∇ θ J π θ,σ (π θ,σ ) and ∇ θ J π θ,σ (µ θ ) are continuous at 0:

Proof of Theorem 4

We will use the two-time-scale stochastic approximation analysis . We let the policy parameter θ t fixed as θ t ≡ θ when analysing the convergence of the critic step. Thus we can show the convergence of ω t towards an ω θ depending on θ, which will then be used to prove the convergence for the slow time-scale.Lemma 3. Under Assumptions 3 -5, the sequence ω i t generated from ( 2) is bounded a.s., i.e., sup t ω i t < ∞ a.s., for any i ∈ N .The proof follows the same steps as that of Lemma B.1 in the PMLR version of Zhang et al. [2018] .Lemma 4. Under Assumption 5, the sequenceThe proof follows the same steps as that of Lemma B.2 in the PMLR version of Zhang et al. [2018] .The desired result holds since Step 1 and Step 2 of the proof of Theorem 4.6 in Zhang et al. [2018] can both be repeated in the setting of deterministic policies.

Proof of Theorem 5

Let F t,2 = σ(θ τ , s τ , τ ≤ t) a filtration. In addition, we defineFinally, lim t→∞ η 0,t = β θ,0 ν θ0 (s 0 ) < ∞ a.s.Thus,Proof of Lemma 6. We setSince Z t is F t -adapted and E [ν θt (s t+1 )|F t ] = P θt ν θt (s t ), Z t is a martingale. The remaining of the proof is now similar to the proof of Lemma 2 on page 224 of Benveniste et al. [1990] .and g(θ) = g 1 (θ), . . . , g N (θ) . We haveGiven (10), θ → ω θ is continuously differentiable and θ → ∇ θ ω θ is bounded so θ → ω θ is Lipschitz-continuous. Thus θ → ξ i t,θ is Lipschitz-continuous for each s t ∈ S. Due to our regularity assumptions, θ → ψ i t,θt is also continuous for each i ∈ N , s t ∈ S. Moreover, θ → d θ (s) is alsoLipschitz continuous for each s ∈ S. Hence, θ → g(θ) is Lipschitz-continuous in θ and the ODE ( 12) is well-posed. This holds even when using compatible features.By critic faster convergence, we have lim t→∞ ξ i t -ξ i t,θt = 0 so lim t→∞ A 1 t = 0.Hence, by Kushner-Clark lemma Kushner and Clark [1978] (pp 191-196) we have that the update in (20) converges a.s. to the set of asymptotically stable equilibria of the ODE (12).

Proof of Theorem 6

We use the two-time scale technique: since critic updates at a faster rate than the actor, we let the policy parameter θ t to be fixed as θ when analysing the convergence of the critic update.Lemma 7. Under Assumptions 4, 1 and 6, for any i ∈ N , sequence {λ i t } generated from ( 7) is bounded almost surely.To prove this lemma we verify the conditions for Theorem A.2 of Zhang et al. [2018] to hold.We use {F t,1 } to denote the filtration with F t,1 = σ(s τ , C τ -1 , a τ -1 , r τ , λ τ , τ ≤ t). With λ t = (λ 1 t ) , . . . , (λ N t ) , critic step (7) has the form:with y t+1 = δ 1 t w(s t , a t ) , . . . , δ N t w(s t , a t ) ∈ R KN , ⊗ denotes Kronecker product and I is the identity matrix. Using the same notation as in Assumption A.1 from Zhang et al. [2018] , we have:Since feature vectors are uniformly bounded for any s ∈ S and a ∈ A, h i is Lipschitz continuous in its first argument. Since, for i ∈ N , the r i are also uniformly bounded, E M t+1 2 |F t,1 ≤ K • (1 + λ t 2 ) for some K > 0. Furthermore, finiteness of |S| ensures that, a.s., h(λ t ) -h(λ t , s t ) 2 ≤ K • (1 + λ t 2 ). Finally, h ∞ (y) exists and has the form h ∞ (y) = -B π,θ • y. From Assumption 1, we have that -B π,θ is a Hurwitcz matrix, thus the origin is a globally asymptotically stable attractor of the ODE ẏ = h ∞ (y). Hence Theorem A.2 of Zhang et al. [2018] applies, which concludes the proof of Lemma 7.

