LEARNING PREDICTIVE COMMUNICATION BY IMAGI-NATION IN NETWORKED SYSTEM CONTROL Anonymous authors Paper under double-blind review

Abstract

Dealing with multi-agent control in networked systems is one of the biggest challenges in Reinforcement Learning (RL) and limited success has been presented compared to recent deep reinforcement learning in single-agent domain. However, obstacles remain in addressing the delayed global information where each agent learns a decentralized control policy based on local observations and messages from connected neighbors. This paper first considers delayed global information sharing by combining the delayed global information and latent imagination of farsighted states in differentiable communication. Our model allows an agent to imagine its future states and communicate that with its neighbors. The predictive message sent to the connected neighbors reduces the delay in global information. On the tasks of networked multi-agent traffic control, experimental results show that our model helps stabilize the training of each local agent and outperforms existing algorithms for networked system control.

1. INTRODUCTION

Networked system control (NSC) is extensively studied and widely applied, including connected vehicle control (Jin & Orosz, 2014) , traffic signal control (Chu et al., 2020b) , distributed sensing (Xu et al., 2016) , networked storage operation (Qin et al., 2015) etc. In NSC, agents are connected via a communication network for a cooperative control objective. For example, in an adaptive traffic signal control system, each traffic light performs decentralized control based on its local observations and messages from connected neighbors. Although deep reinforcement learning has been successfully applied to some complex problems, such as Go (Silver et al., 2016), and Starcraft II (Vinyals et al., 2019) , it is still not scalable in many real-world networked control problems. Multiagent reinforcement learning (MARL) addresses the issue of scalability by performing decentralized control. Recent decentralized MARL performs decentralized control based on the assumptions of global observations and local or global rewards (Zhang et al., 2018; 2019a; Qu et al., 2019; 2020b; a) , which are reasonable in multi-agent gaming but not suitable in NSC. A practical solution is to allow each agent to perform decentralized control based on its local observations and messages from the connected neighbors. Various communication-based methods are proposed to stabilize training and improve observability, and communication is studied to enable agents to behave as a group, rather than a collection of individuals (Sukhbaatar & Fergus, 2016; Chu et al., 2020a) . Despite recent advances in neural communication (Sukhbaatar & Fergus, 2016; Foerster et al., 2016; Chu et al., 2020a) , delayed global information sharing remains an open problem that widely exists in many NSC applications. Communication protocol not only reflects the situation at hand but also guides the policy optimization. Recent deep neural models (Sukhbaatar & Fergus, 2016; Foerster et al., 2016; Hoshen, 2017) implement differentiable communication based on available connections. However, in NSC, such as traffic signal control, each agent only connects to its neighbors, leading to a delay in receiving messages from the distant agents in the system, and the non-stationarity mainly comes from these partial observation (Chu et al., 2020a) . Communication with delayed global information limits the learnability of RL because RL agents can only use the delayed information and not leverage potential future information. Moreover, it is not efficient in situations where an environment is sensitive when the behaviours of agents change. It is therefore of great practical relevance to develop algorithms which can learn beyond the communication with the delayed information sharing. In this paper we introduce ImagComm that learns communication by imagination for multi-agent reinforcement learning in NSC. We leverage the model of the agent's world to provide an estimate of farsighted information in latent space for communication. At each time step, the agent is allowed to imagine its future states in an abstract space and convey this information to its neighbors. Therefore unlike previous works, our communication protocol conveys not only the current sharing information but also the imagined sharing information. It is applicable whenever communication changes frequently, e.g. at every time step agents may receive new communication information. We summarize our main contributions as follows: (1) We first introduce the imagination module that can be used to learn latent dynamics for communication in networked multi-agent systems control. (2) We predict the future state of each local agent and allow each agent to convey the latent state to neighbors as messages, which reduce the delay of global information. (3) We demonstrate that leveraging the predictive communication by imagination in latent space succeeds in networked system control. We explore this model on a range of NSC tasks. Our results demonstrate that our method consistently outperform baselines on these tasks.

2. RELATED WORK

Networked system control (NSC) considers the problem where agents are connected via a communication network for a cooperative control objective, such as autonomous vehicle control (Jin & Orosz, 2014), adaptive traffic signal control (Chu et al., 2020b) , and distributed sensing (Xu et al., 2016) , etc. Recently reinforcement learning has become popular for NSC through decentralized control and communications by networked agents. Communication is an important part for multi-agent RL to compensate for the information loss in partial observations. Heuristic communication allows the agents to share some certain forms of information, such as policy fingerprints from other agents (Foerster et al., 2017) and averaged neighbor's policies (Yang et al., 2018) . Recently end-to-end differentiable communications have become popular (Foerster et al., 2016; Sukhbaatar & Fergus, 2016; Chu et al., 2020a) since the communication channel is learned to optimize the performance. Attention-based communication (Hoshen, 2017; Das et al., 2019; Singh et al., 2019) selectively send messages to the agents chosen, however, these are not suitable for NSC since the communication is allowed only between connected neighbors. Our method adopts differentiable communication with end-to-end training. Compared to existing works, we introduce a new predictive communication module through learning latent dynamics. Learning latent dynamics has been studied to solve single agent tasks, such as E2C (Watter et al., 2015) , RCE (Banijamali et al., 2018 ), PlaNet (Hafner et al., 2019 ), SOLAR (Zhang et al., 2019b) and so on. Lee et al. (2019) and Gregor et al. (2019) learn belief representations to accelerate modelfree agents. World Models (Ha & Schmidhuber, 2018) learn latent dynamics in a two-stage process to evolve linear controllers in imagination. I2A (Racanière et al., 2017) hands imagined trajectories to a model-free policy based on a rollout encoder. In contrast to these works, our work considers multi-agent tasks and learns predictive communication by imagination in latent space.

3. PRELIMINARIES

In networked system control problem, we work with a networked system, which is described by a graph G(V, E), where i ∈ V denotes the ith agent and ij ∈ E denotes the communication link between agents i and j. The corresponding networked (cooperative) multi-agent MDP is defined by a tuple (G, {S i , A i } i∈V , {M ij } ij∈E , p, {r i } i∈V ). S i and A i are the local state space and action space of agent i. Let S := ∪ i∈V S i and A := ∪ i∈V A i , the MDP transitions follow a stationary probability distribution p : S × A × S → [0, 1]. The global reward is denoted by r : S × A → R and defined as r = 1 |V| i∈V r i indicating that all local rewards are shared globally. The communication is limited to neighborhoods. M denotes the message space for the communication model. That is each agent i observes si,t := s i,t ∪ m Nii,t , where s i,t ∈ S i denotes local state space of agent i and m Nii,t := {m ji,t } j∈Ni and N i := {j ∈ V|ji ∈ E}. Message m ji,t ∈ M ji denotes all the available information at an agent's neighbor. In NSC, the system is decentralized and the communication is limited to neighborhoods. Each agent i follows a decentralized policy π i : Si × A i → [0, 1]

