MODEL-BASED DECENTRALIZED POLICY OPTIMIZA-TION

Abstract

Decentralized policy optimization has been commonly used in cooperative multiagent tasks. However, since all agents are updating their policies simultaneously, from the perspective of individual agents, the environment is non-stationary, resulting in it being hard to guarantee monotonic policy improvement. To help the policy improvement be stable and monotonic, we propose model-based decentralized policy optimization (MDPO), which incorporates a latent variable function to help construct the transition and reward function from an individual perspective. We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization. Moreover, due to non-stationarity, the latent variable function is varying and hard to be modeled. We further propose a latent variable prediction method to reduce the error of latent variable function, which theoretically contributes to the monotonic policy improvement. Empirically, MDPO can indeed obtain superior performance than model-free decentralized policy optimization in a variety of cooperative multi-agent tasks.

1. INTRODUCTION

Decentralized multi-agent reinforcement learning (MARL) has been commonly used in practice for cooperative multi-agent tasks, e.g., traffic signal control (Wei et al., 2018) , unmanned aerial vehicles (Qie et al., 2019), and IoT (Cao et al., 2020) , where global information is inaccessible. Independently performing policy optimization using local information, e.g., independent PPO (Schulman et al., 2017) (IPPO), is one of the most straightforward methods for decentralized MARL. Recent empirical studies (de Witt et al., 2020; Yu et al., 2021a; Papoudakis et al., 2021) demonstrate that IPPO performs surprisingly well in several cooperative multi-agent benchmarks, which shows great promise for fully decentralized policy optimization. However, since all agents are updating policies, from the perspective of an individual agent, the environment is non-stationary (Zhang et al., 2019) . Thus, the monotonic policy improvement, which can be achieved by policy optimization in single-agent settings (Schulman et al., 2015; 2017) , may not be guaranteed in decentralized MARL. Concretely, in policy optimization, the state visitation frequency is assumed to be stationary since the agent policy is limited to slight updates, which is necessary to guarantee monotonic policy improvement (Schulman et al., 2015) . However, in decentralized multi-agent settings, as all agents are updating policies simultaneously, the state visitation frequency will change greatly, which contradicts the fundamental assumption of policy optimization, thus the monotonic improvement of policy optimization may not be preserved. To address this problem, we resort to exploiting the environment model to stabilize the state visitation frequency and help monotonic policy improvement. However, learning an environment model in decentralized settings is non-trivial, since the information of other agents, e.g., other agents' policies, is not observable and changing. Therefore, we introduce a latent variable to help distinguish different transitions resulting from the unobservable information. And then we build an environment model for each agent, which contains a transition function, a reward function, and a latent variable function that learns the latent variable given observation. The agents are trained using independent policy optimization methods, e.g., TRPO (Schulman et al., 2015) or PPO (de Witt et al., 2020) , on both the experiences generated by the environment model and collected in the environment. Since the environment is non-stationary, the latent variable function is also varying during learning. We theoretically show that independently performing policy optimization on experiences generated 1

