LEARNING PREDICTIVE COMMUNICATION BY IMAGI-NATION IN NETWORKED SYSTEM CONTROL Anonymous authors Paper under double-blind review

Abstract

Dealing with multi-agent control in networked systems is one of the biggest challenges in Reinforcement Learning (RL) and limited success has been presented compared to recent deep reinforcement learning in single-agent domain. However, obstacles remain in addressing the delayed global information where each agent learns a decentralized control policy based on local observations and messages from connected neighbors. This paper first considers delayed global information sharing by combining the delayed global information and latent imagination of farsighted states in differentiable communication. Our model allows an agent to imagine its future states and communicate that with its neighbors. The predictive message sent to the connected neighbors reduces the delay in global information. On the tasks of networked multi-agent traffic control, experimental results show that our model helps stabilize the training of each local agent and outperforms existing algorithms for networked system control.

1. INTRODUCTION

Networked system control (NSC) is extensively studied and widely applied, including connected vehicle control (Jin & Orosz, 2014) , traffic signal control (Chu et al., 2020b) , distributed sensing (Xu et al., 2016) , networked storage operation (Qin et al., 2015) etc. In NSC, agents are connected via a communication network for a cooperative control objective. For example, in an adaptive traffic signal control system, each traffic light performs decentralized control based on its local observations and messages from connected neighbors. Although deep reinforcement learning has been successfully applied to some complex problems, such as Go (Silver et al., 2016), and Starcraft II (Vinyals et al., 2019) , it is still not scalable in many real-world networked control problems. Multiagent reinforcement learning (MARL) addresses the issue of scalability by performing decentralized control. Recent decentralized MARL performs decentralized control based on the assumptions of global observations and local or global rewards (Zhang et al., 2018; 2019a; Qu et al., 2019; 2020b; a) , which are reasonable in multi-agent gaming but not suitable in NSC. A practical solution is to allow each agent to perform decentralized control based on its local observations and messages from the connected neighbors. Various communication-based methods are proposed to stabilize training and improve observability, and communication is studied to enable agents to behave as a group, rather than a collection of individuals (Sukhbaatar & Fergus, 2016; Chu et al., 2020a) . (Sukhbaatar & Fergus, 2016; Foerster et al., 2016; Chu et al., 2020a) , delayed global information sharing remains an open problem that widely exists in many NSC applications. Communication protocol not only reflects the situation at hand but also guides the policy optimization. Recent deep neural models (Sukhbaatar & Fergus, 2016; Foerster et al., 2016; Hoshen, 2017) implement differentiable communication based on available connections. However, in NSC, such as traffic signal control, each agent only connects to its neighbors, leading to a delay in receiving messages from the distant agents in the system, and the non-stationarity mainly comes from these partial observation (Chu et al., 2020a) . Communication with delayed global information limits the learnability of RL because RL agents can only use the delayed information and not leverage potential future information. Moreover, it is not efficient in situations where an environment is sensitive when the behaviours of agents change. It is therefore of great practical relevance to develop algorithms which can learn beyond the communication with the delayed information sharing.

