MODEL-BASED DECENTRALIZED POLICY OPTIMIZA-TION

Abstract

Decentralized policy optimization has been commonly used in cooperative multiagent tasks. However, since all agents are updating their policies simultaneously, from the perspective of individual agents, the environment is non-stationary, resulting in it being hard to guarantee monotonic policy improvement. To help the policy improvement be stable and monotonic, we propose model-based decentralized policy optimization (MDPO), which incorporates a latent variable function to help construct the transition and reward function from an individual perspective. We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization. Moreover, due to non-stationarity, the latent variable function is varying and hard to be modeled. We further propose a latent variable prediction method to reduce the error of latent variable function, which theoretically contributes to the monotonic policy improvement. Empirically, MDPO can indeed obtain superior performance than model-free decentralized policy optimization in a variety of cooperative multi-agent tasks. ∞ t=0 γ t r t |ρ 0 , π] under the joint policy of all agents π and distribution of initial state ρ 0 , where γ ∈ [0, 1) is the discounted factor. The joint policy π can be represented as the product of each agent's policy π i . Also we denote π -i as the joint policy of all agents except i.

1. INTRODUCTION

Decentralized multi-agent reinforcement learning (MARL) has been commonly used in practice for cooperative multi-agent tasks, e.g., traffic signal control (Wei et al., 2018) , unmanned aerial vehicles (Qie et al., 2019) , and IoT (Cao et al., 2020) , where global information is inaccessible. Independently performing policy optimization using local information, e.g., independent PPO (Schulman et al., 2017) (IPPO), is one of the most straightforward methods for decentralized MARL. Recent empirical studies (de Witt et al., 2020; Yu et al., 2021a; Papoudakis et al., 2021) demonstrate that IPPO performs surprisingly well in several cooperative multi-agent benchmarks, which shows great promise for fully decentralized policy optimization. However, since all agents are updating policies, from the perspective of an individual agent, the environment is non-stationary (Zhang et al., 2019) . Thus, the monotonic policy improvement, which can be achieved by policy optimization in single-agent settings (Schulman et al., 2015; 2017) , may not be guaranteed in decentralized MARL. Concretely, in policy optimization, the state visitation frequency is assumed to be stationary since the agent policy is limited to slight updates, which is necessary to guarantee monotonic policy improvement (Schulman et al., 2015) . However, in decentralized multi-agent settings, as all agents are updating policies simultaneously, the state visitation frequency will change greatly, which contradicts the fundamental assumption of policy optimization, thus the monotonic improvement of policy optimization may not be preserved. To address this problem, we resort to exploiting the environment model to stabilize the state visitation frequency and help monotonic policy improvement. However, learning an environment model in decentralized settings is non-trivial, since the information of other agents, e.g., other agents' policies, is not observable and changing. Therefore, we introduce a latent variable to help distinguish different transitions resulting from the unobservable information. And then we build an environment model for each agent, which contains a transition function, a reward function, and a latent variable function that learns the latent variable given observation. The agents are trained using independent policy optimization methods, e.g., TRPO (Schulman et al., 2015) or PPO (de Witt et al., 2020) , on both the experiences generated by the environment model and collected in the environment. Since the environment is non-stationary, the latent variable function is also varying during learning. We theoretically show that independently performing policy optimization on experiences generated by the environment model with the varying latent variable function can obtain more stationary observation visitation frequency than on the experiences collected in the non-stationary environment. Thus, independent policy optimization goes more stable on the environment model. Moreover, to obtain monotonic improvement, the gap between the return of interacting with the environment and the return predicted by the environment model should be small. We theoretically analyze that the return gap is bounded by the prediction error of the latent variable function. As the latent variable function is varying due to non-stationarity, to minimize the prediction error, we propose a latent variable prediction method that uses the historical variables to predict the future variable. Thus, the latent variable prediction can reduce the return gap and help the monotonic policy improvement. The proposed algorithm, model-based decentralized policy optimization (MDPO), is theoretically grounded and empirically effective for fully decentralized learning. We evaluate MDPO on a variety of cooperative multi-agent tasks, i.e., a stochastic game, multi-agent particle environment (MPE) (Lowe et al., 2017) , and multi-agent MuJoCo (Peng et al., 2021a) . MDPO outperforms the model-free independent policy optimization baseline, and the proposed latent variable prediction additionally obtains performance gain, verifying that MDPO can help stable and monotonic policy improvement in fully decentralized learning.

2. PRELIMINARIES

Dec-POMDP. A cooperative multi-agent task is generally modeled as a decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek & Amato, 2016) . Specifically, a Dec-POMDP is defined as a tuple G = {S, I, A, O, Ω, P, R, γ}. S is the state space, I is the set of agents, and A = A 1 × • • • × A |I| is the joint action space, where A i is the action space for each agent i. At each state s, each agent i ∈ I merely gets access to the observation o i ∈ O, which is drawn from observation function Ω(s, i), and selects an action a i ∈ A i , and all the actions form a joint action a ∈ A. The state transitions to next s ′ according to the transition function P (s ′ |s, a) : S × A × S → [0, 1], and all agents receive a shared reward r = R(s, a) : S × A → R. The objective is to maximize the expected return η(π) = E[ Fully decentralized learning. We consider the fully decentralized way to solve the Dec-POMDP (Tan, 1993; de Witt et al., 2020) , where each agent independently learns a policy and executes actions without communication or parameter sharing in both training and execution phases. Since all agents are updating policies, from the perspective of individual agents, the environment is nonstationary, which fundamentally challenges decentralized learning (Zhang et al., 2019) . The existing decentralized MARL methods are limited. Independent Q-learning (IQL) (Tan, 1993) and independent policy optimization, e.g., IPPO (de Witt et al., 2020) , are the most straightforward fully decentralized algorithms. Despite good empirical performance (Papoudakis et al., 2021) , due to non-stationarity, these methods lack theoretical guarantees. IQL has no convergence guarantee, to the best of our knowledge. Although there has been some study (Sun et al., 2022) , IPPO may not guarantee policy improvement by independent policy optimization, since the assumption of stationary state visitation frequency for policy optimization may not hold in fully decentralized settings, which we will discuss in the following. Monotonic policy improvement. In Dec-POMDP, from a centralized perspective, we can obtain a TRPO objective (Schulman et al., 2015) of the joint policy π for the monotonic improvement, η(π new ) -η(π old ) ≥ s ρ π new (s) a π new (a|s)A π old (s, a) -C • D max KL (π old ∥π new ) (1) ⪆ s ρ π old (s) a π new (a|s)A π old (s, a) -C • D max KL (π old ∥π new ), where ρ π old (s) = t=0 γ t Pr(s t = s|π old ) is the discounted state visitation frequency given π old , similarly for ρ π new (s), A π old is the advantage function under π old , D max KL (π old ∥π new ) = max s D KL (π old (•|s)∥π new (•|s)), and C is a constant. From (1) to ( 2) is an approximation or assumption (Schulman et al., 2015) . As ρ π new is unknown and the policy is limited to slight updates, ρ π new is approximated by ρ π old . However, in fully decentralized MARL, this assumption may not hold, as all agents are updating their policies simultaneously and their joint policy may change significantly especially when the number of agents is large. This will severely affect the performance of independent policy optimization. Although we can constrain the policy update of each agent to be slight like TRPO, this leads to much slower convergence, especially in fully decentralized MARL, where the joint policy has a much larger search space and is merely optimized by independent learning of individual agents.

3. MODEL-BASED DECENTRALIZED POLICY OPTIMIZATION

In this paper, we provide a novel perspective and resort to the environment model to bridge the gap between ρ π new and ρ π old for each agent such that the monotonic joint policy improvement can be potentially achieved by fully decentralized policy optimization. As illustrated in the following, we turn the learning process into a Dyna-style (Sutton, 1990 ) decentralized model-based method with the green pathfoot_0 . Each agent i additionally learns a decentralized model using local information from policy rollout and can optionally perform policy optimization on the experiences from model rollout. When optimizing policy with model rollout, we essentially have 1 , which means the state visitation frequency in model rollout (ρ model ) is more stable. Thus, the approximation from (1) to (2) becomes acceptable under looser constraints of policy update. Further, we can bound the gap between the returns of policy rollout (η) and model rollout (η model ), formally in 2 . Once the bound (B) is controllable throughout the learning process, it can potentially guarantee the monotonic improvement of the joint policy in the real environment. policy optimization π model i model rollout model π old i policy optimization π new policy rollout environment π old 1 ρ π new -ρ π old > ρ π model i model -ρ π old i model 2 η(π model i , π new -i ) -η model (π model i ) < B Thus, 1 and 2 together highlight the potential benefits of incorporating an environment model for decentralized policy optimization. In the following, we discuss how to learn such a decentralized model, theoretically investigate its benefits for decentralized policy optimization, and analyze the return bound for monotonic policy improvement.

3.1. LATENT VARIABLE MODEL

In decentralized learning, from the perspective of each agent i, the transition function and reward function are respectively, P i (s ′ |s, a i ) = E a-i∼π-i P (s ′ |s, a, a -i ) and R i (s, a i ) = E a-i∼π-i R(s, a, a -i ), where a -i denotes the joint action of all agents except i. As other agents are also updating their policies, P i and R i are varying throughout the learning process, which is the well-known nonstationarity problem. Moreover, as each agent i usually obtains observation instead of state in decentralized learning, the model can only be learned on (o i , a i , o ′ i , r). Thus, it is challenging to construct an environment model from the perspective of an individual agent. To build a decentralized environment model, we introduce a latent variable z i , which helps distinguish different transitions resulting from varying unobservable information of the full state and other agents' policies. Then the transition function and the reward function can be redefined as: P i (o ′ i |o i , a i , z i ) and R i (o i , a i , z i ). As we discuss fully decentralized learning, we drop the subscript of i for simplicity in the following. To model the transition function and the reward function with the latent variable, we define the latent variable function from the perspective of an individual agent, ψ(z|o), which indicates the probability of latent variable z given observation o. As z is related to the policies of other agents, ψ(z|o) also varies during policy updates. A latent variable model consists of three modules: transition function P θ , reward function R ϕ , and latent variable function ψ ω , to predict the next observation and reward. As the impact of unobservable information is designed to merely reflect on the latent variable, although other agents update their policies, the transition function and reward function stay constant and only the latent variable function varies. We learn such a model by maximizing the likelihood of experiences of policy rollout D, max θ,ω,ϕ E (o,a,o ′ ,r)∼D,z∼ψω(•|o) P θ (o ′ |o, a, z) -(R ϕ (o, a, o ′ , z) -r) 2 . (3) We examine the correlation between latent variable learned end-to-end and inaccessible information in a simple setting, and the learned latent variable is indeed correlated with the inaccessible information. More details can be found in Appedix B. Moreover, when using the learned latent variable model to train an agent, we adopt k-step branched model rollout in MBPO (Janner et al., 2019) to avoid compounding model error due to longhorizon rollout. Concretely, at each policy update of an agent, we sample h-step length experiences {(o 1 , a 1 , o ′ 1 , r 1 ), • • • , (o h , a h , o ′ h , r h )} from policy rollout D and perform k-step model rollout starting from the last observation o ′ h under current policy π. The policy π is updated on the merged (h + k)-step experiences {(o 1 , a 1 , o ′ 1 , r 1 ), • • • , (o h+k , a h+k , o ′ h+k , r h+k ) } by policy optimization, e.g., PPO (Schulman et al., 2015) .

3.2. STABLE POLICY OPTIMIZATION ON MODEL

Now, we turn to analyze the benefits of such a model-based method over model-free independent policy optimization. We first theoretically analyze that independently performing policy optimization e.g.,, TRPO (Schulman et al., 2015) or PPO (Schulman et al., 2017) , on the latent variable model can make the learning process more stable. In decentralized learning, from the perspective of an agent, given the true latent variable function ψ, the discounted observation visitation frequency of D obtained by policy rollout is defined as ρ π,ψ (o) = ρ π,ψ 0 (o) + γρ π,ψ 1 (o) + γ 2 ρ π,ψ 2 (o) + • • • , where ρ π,ψ t (o) ≜ P r(o t = o) and o t is the observation at timestep t of experience from D. Note that ρ π,ψ (o) is an unbiased estimate of discounted observation visitation frequency when interacting in the environment. Similarly, ρ π,ψω denotes the discounted observation visitation frequency for experiences obtained by model rollout. During the learning process, π n and ψ n respectively denote the policy and latent variable function after the nth policy update. Then, we have the following theorem. All proofs are available in Appendix A. Theorem 1. Define ∆ρ n (o) ≜ ρ π n ,ψ n (o) -ρ π n-1 ,ψ n-1 (o), and denote ∥∆ρ n ∥ ≜ max o |ρ n (o) - ψ n-1 (o)|, similarly for ∥∆π n ∥ and ∥∆ψ n ∥. It holds that, ∥∆ρ n ∥ ≤ C (E π + E ψ ) , where E π ≜ max n ∥∆π n ∥, E ψ ≜ max n ∥∆ψ n ∥ and C is a constant. Assume ψ n ω = (1-α)ψ n-1 ω +αψ n and ψ 0 ω = ψ 0foot_1 . It holds that E ψ > E ψω and the bound above is lower when substituting ψ with ψ ω . According to Theorem 1, the divergence of discounted observation visitation frequency is bounded by the divergence of policy and latent variable function. Again, the policy divergence can be constrained via policy optimization, like TRPO. Thus, the main difference lies in the divergence of latent variable function. As indicated by Theorem 1, the learned latent variable function ψ ω has a smaller divergence between consecutive policy rollouts than the true latent variable function ψ. Therefore, independent policy optimization on experiences generated by the latent variable model can obtain more stationary observation visitation frequency than on the experiences collected in the varying environment, so the learning process of independent policy optimization becomes stable on the model.

3.3. RETURN BOUNDS

We then analyze the bound of return gap between interacting in the environment and interacting with the model. If the return improvement of interacting with the model is higher than the bound, the agent can obtain the monotonic policy improvement when interacting in the environment. However, the return of interacting in the environment is hard to analyze in decentralized learning since the policies of other agents are inaccessible, we turn to analyze the return in policy rollout, which is an unbiased estimate of expected return in the environment. Several bounds have been introduced in MBPO (Janner et al., 2019) for the return bound analysis, which however are not sufficient in decentralized learning. Thus, we need to introduce two new bounds that indicate the divergence of latent variable function between consecutive policy rollouts and the error of learned latent variable function. Formally, with ψ n , ψ n ω , and π n respectively referring to the true latent variable function, the learned latent variable function, and the policy of the nth policy rollout, we denote the bounds as follows: • reward bound r max ≜ max o,a,z max{R(o, a, z), R ϕ (o, a, z)}; • policy divergence ϵ π ≜ max o D T V (π∥π n ), where D T V is total variation distance; • transition function error ϵ θ ≜ max t E at,zt∼(π,ψ n ω ) D T V (P (o t+1 |o t , a t ) ∥P θ (o t+1 |o t , a t , z t )); • latent variable function divergence ϵ ψ ≜ max o D T V ψ n ∥ψ n+1 ; • learned latent variable function error ϵ ω ≜ max o D T V (ψ n ∥ψ n ω ). Additionally, we use several notations to represent different returns. The return in nth policy rollout with the true latent variable function ψ is denoted as η(π, ψ), the return in model rollout with the nth learned latent variable function ψ ω is denoted as η model (π, ψ n ω ), and the return in k-step branched model rollout with h-step experiences of nth policy rollout is denoted as η branch ((π n , π), (ψ, ψ n ω )). Now we analyze the return bound of model rollout and branched model rollout with the newly introduced ϵ ψ and ϵ ω in the following two theorems. Theorem 2. The return gap between n + 1th policy rollout and model rollout with nth learned model is bounded as: η π, ψ n+1 -η model (π, ψ n ω ) ≤ 2r max (1 -γ) 2 (γϵ θ + 2ϵ π + 2ϵ ω + ϵ ψ ) C(ϵ θ ,ϵπ,ϵω,ϵ ψ ) . Theorem 3. The return gap between n + 1th policy rollout and branched model rollout with nth learned model is bounded as: η π, ψ n+1 -η branch ((π n , π) , (ψ n , ψ n ω )) ≤ C (ϵ θ , ϵ π , ϵ ω , ϵ ψ ) . According to Theorem 2 and 3, we can guarantee the monotonic improvement in the environment via improving the return in model rollout or branched model rollout beyond a bound linear to (ϵ π , ϵ θ , ϵ ω , ϵ ψ ). In these bounds, ϵ θ and ϵ ω are limited via supervised learning and ϵ π is constrained by policy optimization. However, ϵ ψ is left unrestricted. In the following, we try to find a better bound in which all elements are controllable.

3.4. LATENT VARIABLE PREDICTION

In order to restrict the impact of divergence of the latent variable function, we introduce one new error bound, which measures the divergence between the learned latent variable function and the true latent variable function in incoming policy rollout. Formally, such an error bound in nth policy rollout is defined as: For learning, each agent maintains the experiences of l consecutive policy rollouts, P θ and R ϕ learn on l consecutive policy rollouts, ψ ω l learns on the experiences of lth policy rollout, and f ζ learns to predict lth latent variable given l -1 latent variable functions. εω ≜ max o D T V ψ n+1 ∥ψ n ω . Now we η π, ψ n+1 -η model (π, ψ n ω ) ≤ C (ϵ θ , ϵ π , εω ) . Theorem 5. The return gap of n + 1th policy rollout and branched model rollout with nth learned model is bounded as: η π, ψ n+1 -η branch ((π n , π) , (ψ n , ψ n ω )) ≤ C (ϵ θ , ϵ π , ϵ ω , εω ) . Now all elements of the bounds are controllable, once we can constrain εω in the learning process. To achieve this, we introduce a latent variable prediction function, which predicts the latent variable distribution given observation o in incoming policy rollout via latent variable distributions of o in the latest l -1 policy rollouts. However, as the true latent variable function cannot be obtained directly for an agent, the latent variable prediction function f ζ can instead minimize: max o D T V ψ ω l (o)∥f ψ ω1 (o), • • • , ψ ω l-1 (o) . With such a latent variable prediction function, εω is controllable.

3.5. ALGORITHM

With all the theoretical analysis and discussions above, we are ready to present the learning of model-based decentralized policy optimization (MDPO). As illustrated in Figure 1 , the environment model consists of transition function P θ , reward function R ϕ , latent variable prediction function f ζ , and latent variable functions {ψ ω1 , • • • , ψ ω l } over recent l consecutive policy rollouts. The experiences of the l consecutive policy rollouts D env = {D 1 , • • • , D l } are also stored. After the latest policy rollout l, we update the transition function and reward function, and learn the latent variable function ψ ω l of policy rollout l by optimizing the objective: max θ,ω l ,ϕ l j=1 E (o,a,o ′ ,r)∼D j ,z∼ψω j (o) P θ (o ′ |o, a, z) -(R ϕ (o, a, z) -r) 2 , ( ) where ψ ω l is obtained by updating ψ ω l-1 using D l , while P θ and R ϕ are updated using D env to make sure they are stable across policy rollouts. Then, the latent variable prediction function f ζ is updated using D l by optimizing the objective: max ζ E o∼D l ,z 1 ∼ψω 1 (o),••• ,z l ∼ψω l (o) f ζ z l |z 1 , • • • , z l-1 . (5) For model rollout, the model predicts the transition in incoming policy rollout given observation o and action a via l -1 latest learned latent variable functions (ψ 2 , • • • , ψ l ) as: z ∼ f ζ •|z 2 ∼ ψ ω2 (o), • • • , z l ∼ ψ ω l (o) , ô′ ∼ P θ (o, a, z) , r = R ϕ (o, a, z) . (6) Finally, the policy is updated using the branched model rollout by policy optimization, such as PPO or TRPO. We summarize the full learning procedure of MDPO in Algorithm 1. Algorithm 1. MDPO 1: Initiate D env = {D 1 , • • • , D l }, π, P θ , R ϕ , Ψ = {ψ ω1 , • • • , ψ ω l }, f ζ . 2: repeat 3: policy rollout in environment and obtain D l 

4. EXPERIMENTS

For evaluation, we compare MDPO, MDPO without latent variable prediction (denoted by MDPO w/o prediction), and independent PPO (IPPO) (Schulman et al., 2017) on a set of cooperative multiagent tasks including a stochastic game, multi-agent particle environment (MPE) (Lowe et al., 2017) , and multi-agent MuJoCo (Peng et al., 2021b) . We do not consider StarCraft multi-agent challenge (SMAC) (Samvelyan et al., 2019) , because IPPO has been shown to perform very well in SMAC (de Witt et al., 2020; Papoudakis et al., 2021) , close enough to centralized training with decentralized execution methods like QMIX (Rashid et al., 2018) and MAPPO (Yu et al., 2021a) . Thus, the gain of MDPO may not be clearly evidenced there. By experiments, we try to answer the following three questions: 1. Does the latent variable model help to generate experiences with more stationary observation visitation frequency experimentally? 2. Does latent variable prediction help to control εω ? 3. Does MDPO help to improve performance in decentralized learning? For a fair comparison, the network architecture and hyperparameters are the same for IPPO and MDPO. The number of environment steps taken in each round (policy rollout, network update) is consistent and thus we compare the performance of methods under the same number of environment steps and policy updates. Note that since we consider fully decentralized learning, for all methods, agents do not use parameter-sharing. Indeed, parameter-sharing should not be allowed in decentralized learning (Terry et al., 2020) . More details on experiment settings, implementation, and hyperparameters are available in Appendix C.

4.1. STOCHASTIC GAME

The stochastic game is a cooperative game with 30 observations (states), 3 agents, and 5 actions for each agent, and every episode consists of 40 steps. The transition function and the shared reward function are randomly generated. The game is chosen to verify our theoretical results. Figure 2 (left) shows the learning curves of MDPO, MDPO w/o prediction, and IPPO, among which MDPO performs better throughout the learning process. With a finite observation space in this game, we calculate the divergence of observation visitation frequencies (∥∆ρ∥ in Section 3.2) in consecutive rollouts. Concretely, we calculate the L1 distance of observation visitation frequency over all observations in consecutive rollouts (policy rollouts for IPPO and branched model rollouts for MDPO), and their curves are shown in Figure 2 (mid). We can see that the latent variable model generates experiences with more stationary observation visitation frequency than IPPO, which is consistent with Theorem 1. This may account for the superior performance of MDPO w/o prediction over IPPO. We also examine how well the latent variable prediction helps to control the prediction error (ε ω ). As the real latent variable function is inaccessible, we examine εω by comparing how well the learned environment model predicts with and without latent variable prediction. Specifically, we measure the As MDPO helps to handle non-stationarity in multi-agent settings from the perspective of an individual agent, it will be natural to also apply MDPO to non-stationary single-agent settings. So, we modify this stochastic game into a non-stationary single-agent game and show that MDPO also outperforms the baselines. More details are available in Appendix D.

4.2. MPE

MPE is a multi-agent environment with continuous observation. In our MPE tasks, agents observe their own positions, velocity, and others' relative positions. And agents are expected to fulfill a certain goal via controlling their accelerations in every direction which is continuous in our experiments. Two tasks of MPE, 4-agent Cooperative Navigation and 5-agent Regular Polygon Control, are chosen for performance comparison. In 4-agent Cooperative Navigation, 4 agents learn to cooperate to reach 4 landmarks respectively. In 5-agent Regular Polygon Control, 4 agents learn to cooperate with another agent, which is controlled by a fixed policy, aiming to form a regular pentagon, and the reward is given according to the similarity to a regular pentagon. Figure 3 shows the learning curves of all methods. Generally, MDPO w/o prediction performs better than IPPO, which verifies that the latent variable model can help decentralized policy improvement by making the observation visitation frequency more stationary. And MDPO outperforms MDPO w/o prediction, which verifies latent variable prediction can reduce the gap between the return of interaction and the return predicted by the environment model. It is worth noting that the unobservable information required to fulfill the goal is at completely different levels in the two tasks. Concretely, acknowledging the general direction of others is enough to decide which landmark to approach in Cooperative Navigation. Yet the precise positions of others 

4.3. MULTI-AGENT MUJOCO

Multi-agent MuJoCo is a continuous multi-agent robotic control environment, based on OpenAI's Mujoco Gym environments. In a multi-agent MuJoCo task, each agent controls several joints of the robotic to move forward, where both the observation space and action space are continuous. We choose 3-agent Hopper, 4-agent Ant, and 4 versions of HalfCheetah with different agent numbers or joint allocation for performance comparison. Details of joint allocation are given in Appendix C. As illustrated in Figure 4 , MDPO consistently performs better in these tasks with different difficulties and various agent numbers. Compared with MPE, agents in multi-agent MuJoCo have deeper impacts on each other due to the interaction between adjacent joints. Consequently, the transitions of each agent are closely related to the policies of other agents. Thus, non-stationarity caused by policy updates of other agents is severer in these tasks, resulting in IPPO struggling and converging to low performance. Moreover, note that MDPO w/o prediction performs almost the same as IPPO or even worse in some tasks. The poor performance of MDPO w/o prediction is a consequence of a larger ϵ ψ caused by strongly associated agents in these tasks. Thus, latent variable prediction is necessary in these tasks with closely associated agents.

5. CONCLUSION

In this paper, we propose model-based decentralized policy optimization (MDPO). By introducing a latent variable into the environment model, we theoretically show the model helps to generate experiences with more stationary observation visitation frequency and benefits decentralized policy optimization. Furthermore, We theoretically analyze that the return bound for monotonic policy improvement is controllable by the prediction error of the latent variable function. Consequently, we propose a latent variable prediction method to constrain the prediction error. We examine all the theories and designs via experiments on a set of cooperative multi-agent tasks. Results verify our theoretical results and show MDPO indeed obtains superior performance over model-free decentralized policy optimization.

A PROOFS A.1 OBSERVATION VISITATION FREQUENCY DIVERGENCE

In this section, we provide proofs for the upper bound of observation visitation frequency divergence. Lemma 1. Given two pairs of policy and latent variable function, (π 1 , ψ 1 ) and (π 2 , ψ 2 ). ∀o ∈ O, it holds that a,z |π 1 (a|o) ψ 1 (z|o) -π 2 (a|o) ψ 2 (z|o)| ≤ |A| • ∥π 1 -π 2 ∥ + |Z| • ∥ψ 1 -ψ 2 ∥ , where ∥π 1 -π 2 ∥ ≜ max a,o |π 1 (a|o) -π 2 (a|o)|, ∥ψ 1 -ψ 2 ∥ ≜ max z,o |ψ 1 (z|o) -ψ 2 (z|o)|. Proof. a,z |π 1 (a|o) ψ 1 (z|o) -π 2 (a|o) ψ 2 (z|o)| ≤ a,z |π 1 (a|o) ψ 1 (z|o) -π 1 (a|o) ψ 2 (z|o)| + a,z |π 1 (a|o) ψ 2 (z|o) -π 2 (a|o) ψ 2 (z|o)| = a π 1 (a|o) z |ψ 1 (z|o) -ψ 2 (z|o)| + z ψ 2 (z|o) a |π 1 (a|o) -π 2 (a|o)| ≤ |A| • ∥π 1 -π 2 ∥ + |Z| • ∥ψ 1 -ψ 2 ∥ . Lemma 2 (Timestep observation visitation frequency recursion). Given two pairs of policy and latent variable function (π 1 , ψ 1 ) and (π 2 , ψ 2 ), we define :  ∆ρ (π1, (o ′ ) ≤ |A| • ∥π 1 -π 2 ∥ + |Z| • ∥ψ 1 -ψ 2 ∥ + |O| • ∆ρ (π1,ψ1),(π2,ψ2) t . Proof. For observation visitation frequency at timestep t + 1, there is a recurrence relation: ρ π,ψ t+1 (o ′ ) = o ρ π,ψ t (o) a,z P (o ′ |a, o, z) π (a|o) ψ (z|o) Thus, the divergence of observation visitation frequency at timestep t + 1 can be processed correspondingly: ∆ρ (π 1 ,ψ 1 ),(π 2 ,ψ 2 ) t+1 o ′ =ρ π 1 ,ψ 1 t+1 o ′ -ρ π 2 ,ψ 2 t+1 o ′ = o ρ π 1 ,ψ 1 t (o) a,z (P (o′|a, o, z) π1 (a|s) ψ1 (z|o)) - o ρ π 2 ,ψ 2 t (o) a,z (P (o′|a, o, z) π2 (a|s) ψ2 (z|o)) = o ρ π 1 ,ψ 1 t (o) a,z (P (o′|a, o, z) (π1 (a|o) ψ1 (z|o) -π2 (a|o) ψ2 (z|o))) + o ∆ρ (π 1 ,ψ 1 ),(π 2 ,ψ 2 ) t (o) a,z (P (o′|a, o, z) π2 (a|o) ψ2 (z|o)) Using Lemma 1, we can bound the divergence of observation state frequency at timestep t + 1: ∆ρ (π 1 ,ψ 1 ),(π 2 ,ψ 2 ) t+1 o ′ ≤ o ρ π 1 ,ψ 1 t (o) a,z P o ′ |a, o, z |π1 (a|o) ψ1 (z|o) -π2 (a|o) ψ2 (z|o)| + o ∆ρ (π 1 ,ψ 1 ),(π 2 ,ψ 2 ) t (o) a,z P o ′ |a, o, z π2 (a|o) ψ2 (z|o) ≤ o ρ π 1 ,ψ 1 t (o) (|A| ∥π1 -π2∥ + |Z| ∥ψ1 -ψ2∥) + ∆ρ (π 1 ,ψ 1 ),(π 2 ,ψ 2 ) t o,a,z P o ′ |a, o, z π2 (a|o) ψ2 (z|o) ≤ |A| • ∥π1 -π2∥ + |Z| • ∥ψ1 -ψ2∥ + |O| • ∆ρ (π 1 ,ψ 1 ),(π 2 ,ψ 2 ) t Lemma 3 (discounted observation visitation frequency divergence bound). Given two pairs of policy and latent variable function (π 1 , ψ 1 ) and (π 2 , ψ 2 ), with the same distribution of initial observation ρ 0 (o), it holds that ∆ρ (π1,ψ1),(π2,ψ2) 1 ≤ C (∥π 1 -π 2 ∥ + ∥ψ 1 -ψ 2 ∥) , where C is a certain constant. Proof. We transform it to the cumulative form of the timestep, and scale it using Lemma 2: ∆ρ (π1,ψ1),(π2,ψ2) (o) = ∞ t=0 γ t ∆ρ (π1,ψ1),(π2,ψ2) t (o) ≤ T -1 t=1 γ t ∆ρ (π1,ψ1),(π2,ψ2) t (o) + γ T ∞ t=T γ t-T ∆ρ (π1,ψ1),(π2,ψ2) t (o) ≤ T -1 t=1 γ t (|A| ∥π 1 -π 2 ∥ + |Z| ∥ψ 1 -ψ 2 ∥) + γ |O| T -2 t=1 γ t ∆ρ (π1,ψ1),(π2,ψ2) t + 2γ T 1 -γ ≤ T -1 t=1 γ t T -1-t k=0 (γ |O|) k (|A| ∥π 1 -π 2 ∥ + |Z| ∥ψ 1 -ψ 2 ∥) + 2γ T 1 -γ . Thus, we get bound discounted observation visitation frequency divergence: ∆ρ (π1,ψ1),(π2,ψ2) ∞ ≤ C 1 (∥π 1 -π 2 ∥ + ∥ψ 1 -ψ 2 ∥) , ∆ρ (π1,ψ1),(π2,ψ2) 1 ≤ |O| • ∆ρ (π1,ψ1),(π2,ψ2) ∞ ≤ C 2 (∥π 1 -π 2 ∥ + ∥ψ 1 -ψ 2 ∥) .

A.2 LATENT VARIABLE FUNCTION DIVERGENCE

In this section, we provide proof for divergence comparison between latent variable function of policy rollout and learned latent variable function in the model. Lemma 4 (Latent variable function divergence comparison). Assume ψ n ω = (1 -α)ψ n-1 ω + αψ n , and initially ψ 1 ω = ψ 1 , where n is the nth policy rollout. Then, E ψ > E ψω , where E ψ ≜ max n ψ n -ψ n-1 and E ψω ≜ max n ψ n ω -ψ n-1 ω . Proof. Firstly, we can construct such a recursive inequality: ψ n -ψ n-1 ω = ψ n -(1 -α) ψ n-2 ω -αψ n-1 ≤ ψ n -ψ n-1 + (1 -α) ψ n-1 -ψ n-2 ω . Thus, we can expand it recursively: ψ n -ψ n-1 ω ≤ ψ n -ψ n-1 + (1 -α) ψ n-1 -ψ n-2 ω ≤ ψ n -ψ n-1 + (1 -α) ψ n-1 -ψ n-2 + • • • + (1 -α) n-1 ψ 1 -ψ 0 ω = ψ n -ψ n-1 + (1 -α) ψ n-1 -ψ n-2 + • • • + (1 -α) n-1 ψ 1 -ψ 0 ≤E ψ 1 + (1 -α) + • • • + (1 -α) n-1 < E ψ α . Using this inequality, we can zoom ψ n ω -ψ n-1 ω : ψ n ω -ψ n-1 ω = (1 -α) ψ n-1 ω + αψ n -ψ n-1 ω = α ψ n -ψ n-1 ω z < E ψ . Thus, E ψω = max n ψ n ω -ψ n-1 ω < E ψ . Now we combine Lemma 3 and 4 to prove Theorem 1. Theorem 1 (Latent variable model benefits). Define ∆ρ n (o) ≜ ρ π n ,ψ n (o) -ρ π n-1 ,ψ n-1 (o), and denote ∥∆ρ n ∥ ≜ max o |ρ n (o) -ψ n-1 (o)|, similarly for ∥∆π n ∥ and ∥∆ψ n ∥. It holds that, ∥∆ρ n ∥ ≤ C (E π + E ψ ) , where E π ≜ max n ∥∆π n ∥ , E ψ ≜ max n ∥∆ψ n ∥ and C is a constant. Assume ψ n ω = (1 -α)ψ n-1 ω + αψ n , and initially ψ 0 ω = ψ 0 .foot_2 It holds that E ψ > E ψω and the bound above is lower when substituting ψ with ψ ω . Proof. Using Lemma 3, we can scale ∥∆ρ n ∥, ∥∆ρ n ∥ ≤ C (∥∆π n ∥ + ∥∆ψ n ∥) ≤ C (E π + E ψ ) Using Lemma 4, E ψ > E ψω . Thus, the bound above is lower for E ψω .

A.3 LEMMAS FOR RETURN BOUND ANALYSIS

In this section, we prove several lemmas as preparations for return bound analysis. Lemma 5 (TVD bound of joint distribution). Consider two joint distributions of n + 1 variables like this: P 1 (x, y 1 , y 2 , • • • , y n ) = P 1 (x) • n i=1 P 1 (y i |x) P 2 (x, y 1 , y 2 , • • • , y n ) = P 2 (x) • n i=1 P 2 (y i |x) We can bound the total variation distance of the joint distributions as: DT V (P1 (x, y1, • • • , yn) ∥P2 (x, y1, • • • , yn)) ≤ DT V (P1 (x) ∥P2 (x))+ n i=1 max x DT V (P1 (yi|x) ∥P2 (yi|x)). Proof. We start the proof from a basis case when n = 1: D T V (P 1 ∥P 2 ) = 1 2 x,y |P 1 (x, y) -P 2 (x, y)| ≤ 1 2 x,y |P 1 (x) P 1 (y|x) -P 2 (x) P 1 (y|x)| + |P 2 (x) P 1 (y|x) -P 2 (x) P 2 (y|x)| = 1 2 x,y |P 1 (x) -P 2 (x)| • P 1 (y|x) + P 2 (x) • |P 1 (y|x) -P 2 (y|x)| = 1 2 x |P 1 (x) -P 2 (x)| + x P 2 (x) D T V (P 1 (y|x) ∥P 2 (y|x)) ≤D T V (P 1 (x) ∥P 2 (x)) + max x D T V (P 1 (y|x) ∥P 2 (y|x)) Similarly, we can prove the case of multi-variables: D T V (P 1 ∥P 2 ) ≤ D T V (P 1 (x) ∥P 2 (x)) +max x D T V (P 1 (y 1 , • • • , y n |x) ∥P 2 (y 1 , • • • , y n |x)) ≤ D T V (P 1 (x) ∥P 2 (x)) +max x D T V (P 1 (y 1 |x) ∥P 2 (y 1 |x)) + max x D T V (P 1 (y 2 , • • • , y n |x) ∥P 2 (y 2 , • • • , y n |x)) = D T V (P 1 (x) ∥P 2 (x)) +max x D T V (P 1 (y 1 |x) ∥P 2 (y 1 |x)) + max x D T V (P 1 (y 2 , • • • , y n |x) ∥P 2 (y 2 , • • • , y n |x)) ≤ D T V (P 1 (x) ∥P 2 (x)) + n i=1 max x D T V (P 1 (y i |x) ∥P 2 (y i |x)) Before proving following lemmas, we clarify the premise our discuss in this section is based on. In a Dec-POMDP, denote the co-occurrence probability of tuple (o, a, z) at timestep t as P t (o, a, z) ≜ P (o t = o, a t = a, z t = z). Consider two Dec-POMDPs different merely in transition function and reward function, G 1 , G 2 . P 1 represents the probability in G 1 while P 2 for G 2 . Different policies and latent variable functions, (π 1 , ψ 1 ) and (π 2 , ψ 2 ), are used to rollout respectively in G 1 and G 2 , We denote several bound between them: reward bound: r max ≜ max o,a,z max{R 1 (o, a, z), R 2 (o, a, z)}; policy bound: ϵ π ≜ max o D T V (π 1 ∥π 2 ) ; latent variable function bound: ϵ ψ ≜ max o D T V (ψ 1 ∥ψ 2 ) ; transition function bound: ϵ m ≜ max t E o,a,z∼P t-1 2 D T V (P 1 (o t |o, a, z) ∥P 2 (o t |o, a, z)) . Additionally, consider the branched model rollout mentioned in Section 3. Policy, latent variable function, transition function, and reward function vary before and after the model rollout branch. We denote these functions via superscripts 'Pre' for function before branch and 'Post' for functions after branch. Correspondingly, when discussing branched model rollout, we extend the bounds above: Lemma 6 (Observation distributions TVD bound). The total variation distance of observation distributions at timestep t, P 1 (o t ) and P 2 (o t ), can be bounded as below: D T V (P 1 (o t ) ∥P 2 (o t )) ≤ t (ϵ π + ϵ ψ + ϵ m ) . Proof. D T V (P 1 (o t ) ∥P 2 (o t )) = 1 2 ot |P 1 (o t ) -P 2 (o t )| = 1 2 ot o,a,z P t-1 1 (o, a, z) P 1 (o t |o, a, z) -P t-1 2 (o, a, z) P 2 (o t |o, a, z) ≤ 1 2 ot o,a,z P t-1 1 (o, a, z) -P t-1 2 (o, a, z) P 1 (o t |o, a, z) + o,a,z P t-1 2 (o, a, z) |P 1 (o t |o, a, z) -P 2 (o t |o, a, z)| = 1 2 o,a,z P t-1 1 (o, a, z) -P t-1 2 (o, a, z) ot P 1 (o t |o, a, z) + E o,a,z∼P t-1 2 D T V (P 1 (o t |o, a, z) ∥P 2 (o t |o, a, z)) =D T V P t-1 1 (o, a, z) ∥P t-1 2 (o, a, z) + ϵ m According to Lemma 5, D T V P t-1 1 (o, a, z) ∥P t-1 2 (o, a, z) ≤D T V (P 1 (o t-1 ) ∥P 2 (o t-1 )) + max o D T V (π 1 ∥π 2 ) + max o D T V (ψ 1 ∥ψ 2 ) =D T V (P 1 (o t-1 ) ∥P 2 (o t-1 )) + ϵ π + ϵ ψ Thus, D T V (P 1 (o t ) ∥P 2 (o t )) ≤D T V (P 1 (o t-1 ) ∥P 2 (o t-1 )) + ϵ π + ϵ ψ + ϵ m ≤D T V (P 1 (o 0 ) ∥P 2 (o 0 )) + t (ϵ π + ϵ ψ + ϵ m ) =t (ϵ π + ϵ ψ + ϵ m ) Lemma 7 (Rollout return bound). The gap between rollout returns in G 1 with (π 1 , ψ 1 ) and G 2 with (π 2 , ψ 2 ) is bounded as: |η 1 (π 1 , ψ 1 ) -η 2 (π 2 , ψ 2 )| ≤ 2r max 1 -γ γ (ϵ π + ϵ ψ + ϵ m ) 1 -γ + ϵ π + ϵ ψ Proof. |η 1 (π 1 , ψ 1 ) -η 2 (π 2 , ψ 2 )| = o,a,z R (o, a, z) t γ t P t 1 (o, a, z) -P t 2 (o, a, z) ≤2r max t γ t D T V P t 1 (o, a, z) ∥P t 2 (o, a, z) Using Lemma 5 and 6, DT V P t 1 (o, a, z) ∥P t 2 (o, a, z) ≤DT V (P1 (ot) ∥P2 (ot)) + max o DT V (π1∥π2) + max o DT V (ψ1∥ψ2) ≤t (ϵπ + ϵ ψ + ϵm) + ϵπ + ϵ ψ Thus, |η1 (π1, ψ1) -η2 (π2, ψ2)| ≤ 2rmax 1 -γ γ (ϵπ + ϵ ψ + ϵm) 1 -γ + ϵπ + ϵ ψ Lemma 8 (Branched rollout return bound). In k-step branched rollout with h-step length before the branch taking into consideration, denote the gap between branched rollout returns in G 1 with π P re  |η 1 -η 2 | ≜ η branch 1 π P re 1 , π P ost 1 , ψ P re 1 , ψ P ost 1 -η branch 2 π P re 2 , π P ost 2 , ψ P re 2 , ψ P ost 2 which is bounded as: Proof. According to Lemma 6, |η 1 -η 2 | ≤ 2r max 1 -γ h + γ h+k+1 D T V (P 1 (o t ) ∥P 2 (o t )) ≤ D T V (P 1 (o t-1 ) ∥P 2 (o t-1 )) + ϵ π + ϵ ψ + ϵ m , which stays in branched rollout case. We can discuss In this section, we provide proofs of return bound in different cases. δ t ≜ D T V (P t 1 (o, a, z) ∥P t 2 (o, a, z)) with different t value: when t ≤ h, δ t ≤ t ϵ P re π + ϵ P re ψ + ϵ P re ω + ϵ P re π + ϵ P re ψ when h < t ≤ h + k, Theorem 2 (Rollout return bound for decentralized model) . The gap between return in n + 1th policy rollout and return in model rollout with nth learned model η model (π, ψ n ω ) is bounded as: η π, ψ n+1 -η model (π, ψ n ω ) ≤ 2r max (1 -γ) 2 (γϵ θ + 2ϵ π + 2ϵ ω + ϵ ψ ) C(ϵ θ ,ϵπ,ϵω,ϵ ψ ) . Proof. η π, ψ n+1 -η model (π, ψ n ω ) ⩽ η π, ψ n+1 -η (π n D , ψ n ) L1 + η (π n D , ψ n ) -η model (π, ψ n ω )

L2

Apply Lemma 7 to L 1 and L 2 : L 1 ≤ 2r max 1 -γ γ (ϵ π + ϵ ψ ) 1 -γ + ϵ π + ϵ ψ , L 2 ≤ 2r max 1 -γ γ (ϵ π + ϵ θ + ϵ ω ) 1 -γ + ϵ π + ϵ ω . Thus, η π, ψ n+1 -η model (π, ψ n ω ) ≤ 2r max (1 -γ) 2 (2ϵ π + ϵ ψ + ϵ ω + γϵ θ ) . Theorem 3 (Branched rollout return bound for decentralized model) . The gap between return in n + 1th policy rollout η(π, ψ n+1 ) and return in k-step branched rollout with h-step experience with nth learned model η branch ((π n D , π), (ψ n , ψ n ω )) is bounded as: η π, ψ n+1 -η branch ((π n D , π) , (ψ n , ψ n ω )) ≤ C (ϵ θ , ϵ π , ϵ ω , ϵ ψ ) Proof. δ ≜ η π, ψ n+1 -η branch ((π n D , π) , (ψ n , ψ n ω )) ≤ η π, ψ n+1 -η branch ((π n D , π n D ) , (ψ n , ψ n )) L1 + η branch ((π n D , π n D ) , (ψ n , ψ n )) -η branch ((π n D , π) , (ψ n , ψ n ω )) Apply Lemma 8 to L 1 and L 2 : L 1 ≤ 2r max 1 -γ h + γ h+k+1 1 -γ (ϵ π + ϵ ψ ) + (ϵ π + ϵ ψ ) + γ h+1 (k (ϵ π + ϵ ψ + ϵ θ ) + ϵ π + ϵ ψ ) = 2r max 1 -γ h + γ h+k+1 1 -γ + 1 + (k + 1) γ h+1 (ϵ π + ϵ ψ ) + kγ h+1 ϵ θ , L 2 ≤ 2r max 1 -γ h + γ h+k+1 1 -γ (ϵ ω ) + (ϵ ω ) + γ h+1 (k (ϵ π + ϵ ω ) + ϵ π + ϵ ω ) = 2r max 1 -γ h + γ h+k+1 1 -γ + 1 ϵ ω + (k + 1) γ h+1 (ϵ π + ϵ ω ) . Thus, δ ≤ 2r max 1 -γ γ h+k+1 1 -γ + h + 1 + (k + 1) γ h+1 (ϵ π + ϵ ψ + ϵ ω ) + kγ h+1 (ϵ θ + kϵ π ) Theorem 4 (Rollout return bound for decentralized model with prediction error) . The gap between return in n + 1th policy rollout η(π, ψ n+1 ) and return in model rollout with nth learned model η model (π, ψ n ω ) is bounded as: η π, ψ n+1 -η model (π, ψ n ω ) ≤ 2r max (1 -γ) 2 (2ϵ π + εω + γϵ θ ) C(ϵ θ ,ϵπ,εω) . Proof. η π, ψ n+1 -η model (π, ψ n ω ) ⩽ η π, ψ n+1 -η (π n D , ψ n ω ) L1 + η (π n D , ψ n ω ) -η model (π, ψ n ω ) L2 Apply Lemma 7 to L 1 and L 2 : L 1 ≤ 2r max 1 -γ γ (ϵ π + εω ) 1 -γ + ϵ π + εω L 2 ≤ 2r max 1 -γ γ (ϵ π + ϵ θ ) 1 -γ + ϵ π Thus, η π, ψ n+1 -η model (π, ψ n ω ) ≤ 2r max (1 -γ) 2 (2ϵ π + εω + γϵ θ ) . Theorem 5 (Branched rollout return bound for decentralized model with prediction error) . The gap between return in n + 1th policy rollout η(π, ψ n+1 ) and return in k-step branched rollout with h-step experience with nth learned model η branch ((π n D , π), (ψ n , ψ n ω )) is bounded as: η π, ψ n+1 -η branch ((π n D , π) , (ψ n , ψ n ω )) ≤ C (ϵ θ , ϵ π , ϵ ω , εω ) . Proof. δ ≜ η π, ψ n+1 -η branch ((π n D , π) , (ψ n , ψ n ω )) ≤ η π, ψ n+1 -η branch ((π n D , π n D ) , (ψ n ω , ψ n ω )) L1 + η branch ((π n D , π n D ) , (ψ n ω , ψ n ω )) -η branch ((π n D , π) , (ψ n , ψ n ω )) L2 Apply Lemma 8 to L 1 and L 2 : 5-Agent Regular Polygon Control. In 5-agent Regular Polygon Control, as shown in Figure 6 (right), 4 agents learn to cooperate with another agent, which is controlled by a fixed policy, aiming to form a regular pentagon. The fixed policy is that the acceleration of the agent is always in the direction of the relative position between the center of the other 4 agents and itself. And the reward is given according to the area S of current pentagon scaled by its perimeter C , which formally is: L 1 ≤ 2r max 1 -γ h + γ h+k+1 1 -γ (ϵ π + εω ) + (ϵ π + εω ) + γ h+1 (k (ϵ π + εω + ϵ θ ) + ϵ π + εω ) = 2r max 1 -γ h + γ h+k+1 1 -γ + 1 + (k + 1) γ h+1 (ϵ π + εω ) + kγ h+1 ϵ θ , L 2 ≤ 2r max 1 -γ h + γ h+k+1 1 -γ (ϵ ω ) + (ϵ ω ) + γ h+1 (k (ϵ π ) + ϵ π ) = 2r max 1 -γ h + γ h+k+1 1 -γ + 1 ϵ ω + (k + 1) γ h+1 ϵ π . Thus, δ ≤ 2r max 1 -γ γ h+k+1 1 -γ + h + 1 (ϵ π + εω + ϵ ω ) + (k + 1) γ h+1 (2ϵ π + εω ) + kγ h+1 ϵ θ S scaled =    S • ( C ) 2 , agents form a convex pentagon 0, otherwise, and represents the area of its similar pentagon with a perimeter of 10. So when the pentagon is a regular pentagon, S scaled comes to its maximum, 5 cot π 5 . Additionally, two penalty items are given. Bound penalty, p b , is used to restrict agents to stay in bounds : bound(x) =      0, |x| < 0.9 10 * (|x| -0.9), |x| < 1.0 e 2|x|-2 , otherwise , p b = 3 i=0 bound(x i ) + bound(y i ) , where (x i , y i ) is the position of agent i. Collision penalty, p c , is as same as that in 4-agent Cooperative Navigation. Finally, we design the reward as: r = min max S scaled , 1 5 cot π 5 -S scaled , 1000 -4p c -p b , where max operator helps to distinguish when pentagon is relatively large and min operator handles the situation being divided by zero. This task is more difficult than Cooperative Navigation. Multi-Agent MuJoCo. In our multi-agent MuJoCo experiments, the state in MuJoCo environment, which describes the position, velocity, angular velocity of each joint, etc, is used as the observation distributed to each agent. Specifically, in Ant task, we only use dimensions from 0 to 26 of the state. We limit the episode length of Halfcheetah to 250 steps, and 500 steps for Ant and Hopper. We provide the joint allocation of each task in Table 1 .

C.2 IMPLEMENTATION & HYPERPARAMETERS

In this section, we provide details for implementation and hyperparameters. For the experiment environment, we adopt MPE (MIT license) and MuJoCo Gym (MIT license). For PPO, we follow the version in OpenAI's Spinning Up (MIT license). In the implementation of latent variable function, both deterministic and stochastic latent variable satisfy our analysis, and we choose between them according to environment properties and experimental performance. In MPE and stochastic game except for Appendix B, we use deterministic latent variable with L2 regularization. In stochastic game in Appendix B, we use Category distribution. And in multi-agent MuJoCo environment, we use Gaussian distribution. As for the implementation of transition function and reward function, we use Category distribution for transition function in stochastic game and deterministic output for others. )] ai = a i 0 Ant 4×2 (a0, • • • , a7) [(a 0 0 , a 0 1 ), • • • , (a 3 0 , a 3 1 )] a 2i+k = a i k HalfCheetah 3×2 (a0, • • • , a5) [(a 0 0 , a 0 1 ), • • • , (a 2 0 , a 2 1 )] a 2i+k = a i k HalfCheetah 6×1 (a0, • • • , a5) [(a 0 0 ), • • • , (a 5 0 )] ai = a i 0 HalfCheetah 5:[1,1,1,1,2] (a0, • • • , a5) [(a 0 0 ), • • • , (a 3 0 ), (a 4 0 , a 4 1 )] a i+k = a i k HalfCheetah 5×2 (a0, • • • , a5) [(a 0 0 , a 0 1 ), • • • , (a 4 0 , a 4 1 )] ai = a i 0 , a5 = i a i 1 5 The experiments are carried out on Intel i9-10900K CPU and NVIDIA GTX 3080Ti GPU. The training of stochastic game task costs 6 hours, while it takes 14 hours for each MPE task, and 25 hours for each multi-agent MuJoCo task.

D ADDITIONAL RESULTS

Since MDPO helps to handle non-stationarity in decentralized MARL from the perspective of an individual agent, it will be natural and easy to also apply MDPO to single-agent RL in non-stationary environments. In this section, we investigate how MDPO performs in such a non-stationary singleagent environment. We adopt the cooperative stochastic game into a single-agent non-stationary version. Concretely, we fix the policies of two agents and leave only one agent to update its policy. And we generate 5 noise matrices (N 0 , • • • , N 4 ) randomly, which are in the same shape as the transition matrix (T ) and will influence the transition probability in a rotating manner. Formally, in nth policy rollout, the transition matrix is T + N n mod 5 , and we guarantee such a transition matrix is reasonable when generating noise matrices. We compare the performance of MDPO, MDPO w/o prediction, and IPPO on the single-agent nonstationary stochastic game, and the learning curves are shown in Figure 7 . As illustrated in Figure 7 (left), MDPO still performs the best in the single-agent non-stationary environment. As shown in Figure 7 (right), latent variable prediction helps to predict the non-stationary transition, and as no noise is applied to the reward matrix, there is merely a slight difference in reward prediction. Since the latent variable in this environment (noise matrices) is in a regular rotation, the prediction function is easier to learn than in decentralized MARL settings. However, unlike in decentralized MARL, non-stationarity in this setting will not fade away in pace with policy convergence, thus MDPO w/o prediction may keep oscillating and generate experiences with larger observation visitation frequency divergence than MDPO, which is shown in Figure 7 (mid). Generally, MDPO also works in single-agent non-stationary environments, especially when there is a regular pattern of non-stationarity. More thorough studies are left as future work.

E ADDITIONAL RELATED WORK

By utilizing an environment model, model-based RL has shown many advantages, such as sample efficiency (Wang et al., 2019) and exploration (Pathak et al., 2017) . Many paradigms have been 



Related work on model-based MARL can be found in Appendix E. However, none of the existing work considers exploiting the environment model to help fully decentralized policy optimization. Since ψ is varying and ψω is continuously updated using the experiences from several recent policy rollouts, we use the form of soft-update for the relation between ψ and ψω. Since ψ is varying and ψω is continuously updated using the experiences from several recent policy rollouts, we use the form of soft-update for the relation between ψ and ψω . co′ = 1 in 4-Agent Cooperative Navigation task and co′ = 100 in 5-Agent Regular Polygon Control



use εω in place of ϵ ψ to analyze the return bound of model rollout and branched model rollout again in the following two theorems. Theorem 4. The return gap of n + 1th policy rollout and model rollout with nth learned model is bounded as:

Figure1: The environment model includes four modules: transition function P θ , reward function R ϕ , latent variable prediction function f ζ , and latent variable functions {ψ ω1 , • • • , ψ ω l } over l consecutive policy rollouts. For learning, each agent maintains the experiences of l consecutive policy rollouts, P θ and R ϕ learn on l consecutive policy rollouts, ψ ω l learns on the experiences of lth policy rollout, and f ζ learns to predict lth latent variable given l -1 latent variable functions.

4: optimize P θ ,R ϕ and ψ ω l on D env with (4) 5: optimize prediction function f ζ on D l with (5) 6: obtain branched model rollout D rollout based on D l using P θ , R ϕ , π, Ψ, and f ζ with (6) 7: optimize policy π using D rollout by PPO or TRPO 8: for j ← 1, . . . , l -1 do 9: D j ← D j+1 , ψ ωj ← ψ ωj+1 10: end for 11: until terminate

Figure 2: Learning curves of MDPO compared with MDPO w/o prediction and IPPO on the stochastic game: average return (left), observation visitation frequency divergence (mid), and model prediction errors (right). Each round is 1600 environment steps.

h + k, δ t ≤h ϵ P re π + ϵ P re ψ + ϵ P re m + k ϵ P ost π + ϵ P ost ψ + ϵ P ost m + ϵ P ost π + ϵ P ost ψ + (t -h -k) ϵ P re π + ϵ P re ψ + ϵ P re m + ϵ P re π + ϵ P re ψ = (t -k) ϵ P re π + ϵ P re ψ + ϵ P re m + ϵ P re π + ϵ P re ψ + k ϵ P ost π + ϵ P ost ψ + ϵ P ost m + ϵ P ost π + ϵ P ost ψ Using the inequalities above, we can write: δ ≜ t γ t D T V P t 1 (o, a, z) ∥P t 2 (o, a, z)

Figure 6: Illustration of MPE tasks: Cooperative Navigation (left) and Regular Polygon Control (right).

Figure 7: Learning curves of MDPO compared with MDPO w/o prediction and IPPO on the single-agent non-stationary stochastic game: average return (left), observation visitation frequency divergence (mid), and model prediction errors (right). Each round is 1600 environment steps.

Figure 4: Learning curves of MDPO, MDPO w/o prediction, and IPPO in six multi-agent MuJoCo tasks. Each round is 4000 environment steps for 4-agent Ant and 2000 for other tasks.

Joint allocation in multi-agent MuJoCo tasks. The relation column indicates how agents control the joints of robotics.

Structure of the neural networks we used in experiments.All neural networks used in our implementation are in the form of Multi-Layer Perception (MLP). Particularly, the transition function and reward function are respectively learned using an ensemble formed by 3 individual versions of the last layer. The likelihood of next observation is multiplied by a coefficient c o′ to balance the scale of the elements in (3) and (4). The hidden size and activation function used in the networks are provided in Table2. And the parameters used in training are provided in Table3.

Structure of the neural networks we used in GRF experiments.

B VERIFICATION ON LEARNED LATENT VARIABLE

To examine how related the learned latent variable and the inaccessible information are, we designed a simple tabular case, where policies, transition matrix, and reward matrix are preset. There are 3 states and 3 agents with 2 actions for each and the space of latent variable is set to be 4. For agent 0, we collect experiences ⟨s, a 0 , s ′ , r⟩ and train a latent variable model end-to-end. For visualization, we design ψ ω as an explicit network to fetch learned z 0 and preset the forward pass for the state in P θ to avoid the correspondence being conditioned on the state. Then, we sample z 0 from learned latent variable function network ψ ω for each experience in the buffer, and then calculate the conditional probabilities, P (z 0 |a -0 ) and P (a -0 |z 0 ). As shown in Figure 5 , there is a one-to-one correspondence between the latent variable and other agents' joint action. This demonstrates the latent variable model can implicitly capture inaccessible information relevant to transition and reward via end-to-end learning.

C EXPERIMENT DETAILS C.1 ENVIRONMENT SETTING

In this section, we introduce the environment settings we used in the experiments.Stochastic Game. In our stochastic game, there are 30 observations, 3 agents and 5 actions for each agent, and episode length is limited to 40 steps. We generate a transition matrix T and a reward matrix R in advance as transition function and reward function. Concretely, T is a matrix in shape of [30, 5, 5, 5, 30] and R is a matrix in shape of [30, 5, 5, 5, 1] . At each timestep t, given observation o t and agent joint actions a 0 t , a 1 t , a 2 t , the transition is:MPE. In our MPE tasks, agents observe their own positions, velocity, and others' relative positions. And actions of agents control their accelerations in every direction which is continuous in our experiments. In MPE tasks, the episode length is limited to 40 steps.4-Agent Cooperative Navigation. In 4-agent Cooperative Navigation, as shown in Figure 6 (left), 4 agents learn to cooperate to reach 4 landmarks respectively. Concretely, we denote the radius of agent i as d i , position of agent i as (x i a , y i a ) and position of landmark i as (x i l , y i l ), and the reward is:where p c is a collision penalty:Thus, the reward upper bound at each step is -4. proposed on how to exploit the environment model. Model-based planning methods, such as model predictive control, select actions through model rollouts. Dyna-style methods (Sutton, 1990; Feinberg et al., 2018; Janner et al., 2019) use both data collected in the real environment and data generated by the learned model to update the policy. Recent studies have extended model-based methods to multi-agent settings for sample efficiency in zero-sum game (Zhang et al., 2020) , in stochastic game (Zhang et al., 2021 ) and in networked system (Du et al., 2022) , centralized training (Willemsen et al., 2021) , opponent modeling (Yu et al., 2021b) , and communication (Kim et al., 2021) . However, none of them strictly tackle fully decentralized setting of our paper.Specially, the DMPO (decentralized model-based policy optimization) algorithm in prior work (Du et al., 2022) is designed for a networked system, where agents are able to communicate along the edges with their neighbors. The naming of DMPO and MDPO may lead to misunderstanding of similar settings, but the two algorithms are actually concerned with different settings. In fully decentralized setting of our paper, no information sharing is allowed between agents. And when the number of neighbors is set zero in DMPO, it will degenerate into the version of MDPO w/o prediction in our algorithm.

F ADDITIONAL EXPERIMENTS

In order to enrich our baseline and test MDPO on more complex environment, we supplement experiments on Google Research Football environment (GRF) with ITRPO as the baseline. ITRPO is the decentralized version of TRPO (Schulman et al., 2015) , and we use TRPO to optimize the policy in MDPO and MDPO w/o prediction. Specifically, We choose 'simple115v2' as observation representation which encodes the state with 115 floats and 'scoring+checkpoint' as reward which encodes the domain knowledge that scoring is aided by advancing across the pitch. We examine MDPO, MDPO w/o prediction and ITRPO in two scenarios, Run and Pass and 3 vs 1 with Keeper, in both of which MDPO improves the average goal rate. The learning curves is shown in Figure 8 .In our implementation, we use Category distribution for one-hot dimensions in observation and Gaussion distribution for others to model the transition. We use MLP for reward function and Gaussion distribution for latent variable function. The structures are listed in Table 4 and the hyperparameters are listed in Table 5 .

