MODEL-BASED DECENTRALIZED POLICY OPTIMIZA-TION

Abstract

Decentralized policy optimization has been commonly used in cooperative multiagent tasks. However, since all agents are updating their policies simultaneously, from the perspective of individual agents, the environment is non-stationary, resulting in it being hard to guarantee monotonic policy improvement. To help the policy improvement be stable and monotonic, we propose model-based decentralized policy optimization (MDPO), which incorporates a latent variable function to help construct the transition and reward function from an individual perspective. We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization. Moreover, due to non-stationarity, the latent variable function is varying and hard to be modeled. We further propose a latent variable prediction method to reduce the error of latent variable function, which theoretically contributes to the monotonic policy improvement. Empirically, MDPO can indeed obtain superior performance than model-free decentralized policy optimization in a variety of cooperative multi-agent tasks.

1. INTRODUCTION

Decentralized multi-agent reinforcement learning (MARL) has been commonly used in practice for cooperative multi-agent tasks, e.g., traffic signal control (Wei et al., 2018) , unmanned aerial vehicles (Qie et al., 2019), and IoT (Cao et al., 2020) , where global information is inaccessible. Independently performing policy optimization using local information, e.g., independent PPO (Schulman et al., 2017) (IPPO), is one of the most straightforward methods for decentralized MARL. Recent empirical studies (de Witt et al., 2020; Yu et al., 2021a; Papoudakis et al., 2021) demonstrate that IPPO performs surprisingly well in several cooperative multi-agent benchmarks, which shows great promise for fully decentralized policy optimization. However, since all agents are updating policies, from the perspective of an individual agent, the environment is non-stationary (Zhang et al., 2019) . Thus, the monotonic policy improvement, which can be achieved by policy optimization in single-agent settings (Schulman et al., 2015; 2017) , may not be guaranteed in decentralized MARL. Concretely, in policy optimization, the state visitation frequency is assumed to be stationary since the agent policy is limited to slight updates, which is necessary to guarantee monotonic policy improvement (Schulman et al., 2015) . However, in decentralized multi-agent settings, as all agents are updating policies simultaneously, the state visitation frequency will change greatly, which contradicts the fundamental assumption of policy optimization, thus the monotonic improvement of policy optimization may not be preserved. To address this problem, we resort to exploiting the environment model to stabilize the state visitation frequency and help monotonic policy improvement. However, learning an environment model in decentralized settings is non-trivial, since the information of other agents, e.g., other agents' policies, is not observable and changing. Therefore, we introduce a latent variable to help distinguish different transitions resulting from the unobservable information. And then we build an environment model for each agent, which contains a transition function, a reward function, and a latent variable function that learns the latent variable given observation. The agents are trained using independent policy optimization methods, e.g., TRPO (Schulman et al., 2015) or PPO (de Witt et al., 2020) , on both the experiences generated by the environment model and collected in the environment. Since the environment is non-stationary, the latent variable function is also varying during learning. We theoretically show that independently performing policy optimization on experiences generated by the environment model with the varying latent variable function can obtain more stationary observation visitation frequency than on the experiences collected in the non-stationary environment. Thus, independent policy optimization goes more stable on the environment model. Moreover, to obtain monotonic improvement, the gap between the return of interacting with the environment and the return predicted by the environment model should be small. We theoretically analyze that the return gap is bounded by the prediction error of the latent variable function. As the latent variable function is varying due to non-stationarity, to minimize the prediction error, we propose a latent variable prediction method that uses the historical variables to predict the future variable. Thus, the latent variable prediction can reduce the return gap and help the monotonic policy improvement. The proposed algorithm, model-based decentralized policy optimization (MDPO), is theoretically grounded and empirically effective for fully decentralized learning. We evaluate MDPO on a variety of cooperative multi-agent tasks, i.e., a stochastic game, multi-agent particle environment (MPE) (Lowe et al., 2017) , and multi-agent MuJoCo (Peng et al., 2021a) . MDPO outperforms the model-free independent policy optimization baseline, and the proposed latent variable prediction additionally obtains performance gain, verifying that MDPO can help stable and monotonic policy improvement in fully decentralized learning.

2. PRELIMINARIES

Dec-POMDP. A cooperative multi-agent task is generally modeled as a decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek & Amato, 2016) . Specifically, a Dec-POMDP is defined as a tuple G = {S, I, A, O, Ω, P, R, γ}. S is the state space, I is the set of agents, and A = A 1 × • • • × A |I| is the joint action space, where A i is the action space for each agent i. At each state s, each agent i ∈ I merely gets access to the observation o i ∈ O, which is drawn from observation function Ω(s, i), and selects an action a i ∈ A i , and all the actions form a joint action a ∈ A. The state transitions to next s ′ according to the transition function P (s ′ |s, a) : S × A × S → [0, 1], and all agents receive a shared reward r = R(s, a) : S × A → R. The objective is to maximize the expected return η(π) = E[ ∞ t=0 γ t r t |ρ 0 , π] under the joint policy of all agents π and distribution of initial state ρ 0 , where γ ∈ [0, 1) is the discounted factor. The joint policy π can be represented as the product of each agent's policy π i . Also we denote π -i as the joint policy of all agents except i. Fully decentralized learning. We consider the fully decentralized way to solve the Dec-POMDP (Tan, 1993; de Witt et al., 2020) , where each agent independently learns a policy and executes actions without communication or parameter sharing in both training and execution phases. Since all agents are updating policies, from the perspective of individual agents, the environment is nonstationary, which fundamentally challenges decentralized learning (Zhang et al., 2019) . The existing decentralized MARL methods are limited. Independent Q-learning (IQL) (Tan, 1993) and independent policy optimization, e.g., IPPO (de Witt et al., 2020) , are the most straightforward fully decentralized algorithms. Despite good empirical performance (Papoudakis et al., 2021) , due to non-stationarity, these methods lack theoretical guarantees. IQL has no convergence guarantee, to the best of our knowledge. Although there has been some study (Sun et al., 2022) , IPPO may not guarantee policy improvement by independent policy optimization, since the assumption of stationary state visitation frequency for policy optimization may not hold in fully decentralized settings, which we will discuss in the following. Monotonic policy improvement. In Dec-POMDP, from a centralized perspective, we can obtain a TRPO objective (Schulman et al., 2015) of the joint policy π for the monotonic improvement, η(π new ) -η(π old ) ≥  where ρ π old (s) = t=0 γ t Pr(s t = s|π old ) is the discounted state visitation frequency given π old , similarly for ρ π new (s), A π old is the advantage function under π old , D max KL (π old ∥π new ) = max s D KL (π old (•|s)∥π new (•|s)), and C is a constant. From (1) to ( 2) is an approximation or



new (a|s)A π old (s, a) -C • D max KL (π old ∥π new ) new (a|s)A π old (s, a) -C • D max KL (π old ∥π new ),

