LEARNING PREDICTIVE COMMUNICATION BY IMAGI-NATION IN NETWORKED SYSTEM CONTROL Anonymous authors Paper under double-blind review

Abstract

Dealing with multi-agent control in networked systems is one of the biggest challenges in Reinforcement Learning (RL) and limited success has been presented compared to recent deep reinforcement learning in single-agent domain. However, obstacles remain in addressing the delayed global information where each agent learns a decentralized control policy based on local observations and messages from connected neighbors. This paper first considers delayed global information sharing by combining the delayed global information and latent imagination of farsighted states in differentiable communication. Our model allows an agent to imagine its future states and communicate that with its neighbors. The predictive message sent to the connected neighbors reduces the delay in global information. On the tasks of networked multi-agent traffic control, experimental results show that our model helps stabilize the training of each local agent and outperforms existing algorithms for networked system control.

1. INTRODUCTION

Networked system control (NSC) is extensively studied and widely applied, including connected vehicle control (Jin & Orosz, 2014) , traffic signal control (Chu et al., 2020b) , distributed sensing (Xu et al., 2016) , networked storage operation (Qin et al., 2015) etc. In NSC, agents are connected via a communication network for a cooperative control objective. For example, in an adaptive traffic signal control system, each traffic light performs decentralized control based on its local observations and messages from connected neighbors. Although deep reinforcement learning has been successfully applied to some complex problems, such as Go (Silver et al., 2016) , and Starcraft II (Vinyals et al., 2019) , it is still not scalable in many real-world networked control problems. Multiagent reinforcement learning (MARL) addresses the issue of scalability by performing decentralized control. Recent decentralized MARL performs decentralized control based on the assumptions of global observations and local or global rewards (Zhang et al., 2018; 2019a; Qu et al., 2019; 2020b; a) , which are reasonable in multi-agent gaming but not suitable in NSC. A practical solution is to allow each agent to perform decentralized control based on its local observations and messages from the connected neighbors. Various communication-based methods are proposed to stabilize training and improve observability, and communication is studied to enable agents to behave as a group, rather than a collection of individuals (Sukhbaatar & Fergus, 2016; Chu et al., 2020a) . Despite recent advances in neural communication (Sukhbaatar & Fergus, 2016; Foerster et al., 2016; Chu et al., 2020a) , delayed global information sharing remains an open problem that widely exists in many NSC applications. Communication protocol not only reflects the situation at hand but also guides the policy optimization. Recent deep neural models (Sukhbaatar & Fergus, 2016; Foerster et al., 2016; Hoshen, 2017) implement differentiable communication based on available connections. However, in NSC, such as traffic signal control, each agent only connects to its neighbors, leading to a delay in receiving messages from the distant agents in the system, and the non-stationarity mainly comes from these partial observation (Chu et al., 2020a) . Communication with delayed global information limits the learnability of RL because RL agents can only use the delayed information and not leverage potential future information. Moreover, it is not efficient in situations where an environment is sensitive when the behaviours of agents change. It is therefore of great practical relevance to develop algorithms which can learn beyond the communication with the delayed information sharing. In this paper we introduce ImagComm that learns communication by imagination for multi-agent reinforcement learning in NSC. We leverage the model of the agent's world to provide an estimate of farsighted information in latent space for communication. At each time step, the agent is allowed to imagine its future states in an abstract space and convey this information to its neighbors. Therefore unlike previous works, our communication protocol conveys not only the current sharing information but also the imagined sharing information. It is applicable whenever communication changes frequently, e.g. at every time step agents may receive new communication information. We summarize our main contributions as follows: (1) We first introduce the imagination module that can be used to learn latent dynamics for communication in networked multi-agent systems control. (2) We predict the future state of each local agent and allow each agent to convey the latent state to neighbors as messages, which reduce the delay of global information. (3) We demonstrate that leveraging the predictive communication by imagination in latent space succeeds in networked system control. We explore this model on a range of NSC tasks. Our results demonstrate that our method consistently outperform baselines on these tasks.

2. RELATED WORK

Networked system control (NSC) considers the problem where agents are connected via a communication network for a cooperative control objective, such as autonomous vehicle control (Jin & Orosz, 2014) , adaptive traffic signal control (Chu et al., 2020b) , and distributed sensing (Xu et al., 2016) , etc. Recently reinforcement learning has become popular for NSC through decentralized control and communications by networked agents. Communication is an important part for multi-agent RL to compensate for the information loss in partial observations. Heuristic communication allows the agents to share some certain forms of information, such as policy fingerprints from other agents (Foerster et al., 2017) and averaged neighbor's policies (Yang et al., 2018) . Recently end-to-end differentiable communications have become popular (Foerster et al., 2016; Sukhbaatar & Fergus, 2016; Chu et al., 2020a) since the communication channel is learned to optimize the performance. Attention-based communication (Hoshen, 2017; Das et al., 2019; Singh et al., 2019) selectively send messages to the agents chosen, however, these are not suitable for NSC since the communication is allowed only between connected neighbors. Our method adopts differentiable communication with end-to-end training. Compared to existing works, we introduce a new predictive communication module through learning latent dynamics. Learning latent dynamics has been studied to solve single agent tasks, such as E2C (Watter et al., 2015) , RCE (Banijamali et al., 2018) , PlaNet (Hafner et al., 2019) , SOLAR (Zhang et al., 2019b) and so on. Lee et al. (2019) and Gregor et al. (2019) learn belief representations to accelerate modelfree agents. World Models (Ha & Schmidhuber, 2018) learn latent dynamics in a two-stage process to evolve linear controllers in imagination. I2A (Racanière et al., 2017) hands imagined trajectories to a model-free policy based on a rollout encoder. In contrast to these works, our work considers multi-agent tasks and learns predictive communication by imagination in latent space.

3. PRELIMINARIES

In networked system control problem, we work with a networked system, which is described by a graph G(V, E), where i ∈ V denotes the ith agent and ij ∈ E denotes the communication link between agents i and j. The corresponding networked (cooperative) multi-agent MDP is defined by a tuple (G, {S i , A i } i∈V , {M ij } ij∈E , p, {r i } i∈V ). S i and A i are the local state space and action space of agent i. Let S := ∪ i∈V S i and A := ∪ i∈V A i , the MDP transitions follow a stationary probability distribution p : S × A × S → [0, 1]. The global reward is denoted by r : S × A → R and defined as r = 1 |V| i∈V r i indicating that all local rewards are shared globally. The communication is limited to neighborhoods. M denotes the message space for the communication model. That is each agent i observes si,t := s i,t ∪ m Nii,t , where s i,t ∈ S i denotes local state space of agent i and m Nii,t := {m ji,t } j∈Ni and N i := {j ∈ V|ji ∈ E}. Message m ji,t ∈ M ji denotes all the available information at an agent's neighbor. In NSC, the system is decentralized and the communication is limited to neighborhoods. Each agent i follows a decentralized policy π Left: An illustration of a networked system example is an adaptive cruise control (ACC) system with vehicle-to-vehicle (V2V) communication. For the information sharing, agent i knows information at time t including s i,t ∪ s i-1,t ∪ s i-2,t-1 ∪ s i-3,t-2 , which are the delayed global observations. For the communication, based on world models b i,t is learned from its neighbor's future states, which compensate for delayed information. Right: policy representation that includes messages of neighbors generated by the imagined module. i : Si × A i → [0, 1] to choose its own action a i,t ∼ π i (•|s i,t ) at time t. The objective is to maximize E π [R 0 ], where R t = ∞ l=0 γ l r t+l and γ is a discount factor.

4. METHODOLOGY

Our goal is to learn predictive communication on a particular observation or environment state. We start by introducing the networked MDP with neighborhood communications and delayed information issue in communication. Then, we describe ImagComm that utilizes predictive communication, which we learn the agent's world model to provide an additional context for communication.

4.1. DELAYED COMMUNICATION IN NETWORKED SYSTEM CONTROL

Following the setting of NSC in (Chu et al., 2020a) , we assume that all messages sent from agent i are identical and we denote m ij = m i , ∀j ∈ N i . The message explicitly includes state s and policy π and agent belief h, i.e., m i,t = s i,t ∪ π i,t-1 ∪ h i,t-1 in communication. Note that π i,t-1 is the probability distribution over discrete actions. Thus for each agent in NSC, si,t := s Vi,t ∪ π Ni,t-1 ∪ h Ni,t-1 . Note the communication phase is prior-decision, so only h i,t-1 and π i,t-1 are available. This protocol can be easily extended for multi-pass communication. We assume that any information that agent j knows at time t can be included in m ji,t and m ji,t = s j,t ∪ {m kj,t-1 } k∈Nj . Then si,t := s i,t ∪ {s j,t+1-dij } j∈V/{i} , which includes the delayed global observations. d ij indicates the distance between i and j, i.e. the hops between two agents on the graph of the networked system. We illustrate the delayed information in Figure 1 . A more rigorous analysis of this conclusion can be found in the Appendix A.

4.2. PREDICTIVE COMMUNICATION

To reduce the delay of global information, we consider a forward model for predicting future states of each agent j, then s j,t+1 can be encoded as a message for communication, and agent i can benefit from this information. Let ŝi,t be the abstract state of ith agent, W i ∈ W i be a world model of the transition dynamics from ŝi,t to the abstract state ŝi,t+1 , and let b i,t := ∪ k τ =1 ŝi,t+τ denote the predictive message. We aim to build a policy based on delayed global observations and predictive messages. The value of policy π i can be defined as V π,W i (s) based on the model W i : V π,W i (s, a Ni ) = E ai,t∼πi(•|si,t,bi,t) R π i,t | st = s, a Ni,t = a Ni . Learning based on (1) has the benefit of reduced delay in global information compared to that without b i,t ; this is formally presented in Proposition 1. Proofs are provided in Appendix A. Proposition 1. ImagComm can reduce the delay of global information by incorporating a predictive model in the communication protocol. We are now interested in constructing an abstract model W i (•; ϕ) to approximate W i , which operates on an abstract state. Let ŝi,t+1 be the new abstract state sampled by ŝi,t+1 ∼ W i (ŝ i,t ). We want to minimize ŝi,t+1g i (s i,t+1 ; ψ) , where g i (•; ψ) is an embedding of raw states. Let V π, W i be the value function of the policy on the estimated model W i . Towards optimizing V π,W * i (s, a Ni ), we build a lower bound as follows and maximize it iteratively: V π,W * i (s, a Ni ) ≥ V π, W i (s, a Ni ) -D( W , π), where D( W , π) ∈ R bounds the discrepancy between V π,W i and V π, W i . In practice, D( W i , π i ) is defined as D π ref i ( W i , π i ) = α • E s0,...,st,∼π ref i [ W i (ŝ i,t ) -g i (s i,t+1 ) ], where α is a hyperparameter, π ref i is the policy used for sampling. For each agent, we solve the following problem: π k+1 , W k+1 = argmax π∈Π,W ∈W V π,W i -D π k i ,δ (W, π). With the predictive imagination module, each agent utilizes the estimate of predictive state information to learn its belief and optimize the control performance of all other agents. Follow the analysis in (Luo et al., 2018) , we can show that ImagComm can lead to monotonic improvement in policy iteration. Proofs are defered to Appendix A. Proposition 2. Suppose that W * i ∈ W i is the optimal model and the optimization problem in equation ( 4) is solvable at each iteration. Solving (4) produces a sequence of policies π 0 i , . . . , π T i with monotonically increasing values: V π 0 ,W * i ≤ V π 1 ,W * i ≤ • • • ≤ V π T ,W * i . A conclusion following directly from Proposition 2 is that solving (4) will converge to a local maximum. ImagComm considers build a world model and predict the farsighted state by a imagination module to eliminate the delay in global information and henceforth reduce the negative influence of the partial observability. Because the future information after time t compensate for some of the delayed information at time t. Next we will present the differentiable neural communication with imagination.

4.3. DIFFERENTIABLE NEURAL COMMUNICATION

In our approach, an agent performs the following operations throughout the agent's life time: learning the latent dynamics model from the dataset of past experience to predict future states of itself, encoding the imagined features into message and learning differentiable communication models together with the predictive information. Specifically, the predictive model lets us predict the states ahead in the latent space without having to observe. Different with previous works that use h i,t = g Vi (h i,t-1 , s i,t , m Ni,t ), we propose to learn communication with imagination (ImagComm), as shown in Figure 1 , to add the imagined delayed information to communication: h i,t = g Vi (h i,t-1 , s i,t , m Ni,t , b Ni,t ) where b Ni,t indicates the module of the predictive message, m Ni,t represents the standard communication module, which is the same as previous communication work. g Vi is a differentiable function to extract informaton for the agent's beliefs. For example, g Vi can be LSTM (Hochreiter & Schmidhuber, 1997) . Compared to (Chu et al., 2020a) , ImagComm uses imagination module to provide an augmented observation of agents and reduces delays in global information. Compared to model-based approaches (Luo et al., 2018; Janner et al., 2019) , the differences are two-fold: i) instead of learning a model for a single-agent MDP, each agent learns a decentralized predictive model locally; ii) instead of aiming to improve the sampling efficiency, we aim to augment the message for communication, thus reducing the delay in global information from each agent's view; this can be viewed as a combination of model-based and model-free aspects.

4.4. PRACTICAL IMPLEMENTATIONS

Simply we assume each agent is based on A2C (Advantage Actor-Critic) models. Let {π θi } i∈V and {V ωi } i∈V be the decentralized actor-critics, and {(s i,τ , m Nii,τ , a i,τ , r i,τ , b Nii,τ )} i∈V,τ ∈B be the onpolicy minibatch from networked MDPs under stationary policies {π θi } i∈V . For each agent with the belief h i,t , the actor and critic become π θi (h i ) and V ωi (h i , a Ni ) for fitting the optimal policy π * i and value function V πi . ImagComm has three components based on (5). m Ni,t represents the message that will be passed to h i,t with m Ni,t = s Ni,t ∪ π Ni,t-1 ∪ h Ni,t-1 . For the predictive message b i,t , ŝi,t = f i (s Vi,t , h i,t-1 ; φ i ), and ŝi,t+1 = W i (ŝ i,t ). Let Φ i = {φ i , ψ i , ϕ i } denote the parameters for abstract state embedding, raw state embedding, and the imagination module for agent i. Then each actor and critic are updated by losses L (θi) = 1 |B| τ ∈B (-log π θ i (ai,τ | hi,τ )) A π i,τ + βH[π θ i (ai | hi,τ )], L (ωi) = 1 |B| τ ∈B R π i,τ -Vω i (hi,τ , aN i ,τ ) 2 , L(Φi) = α |B| τ ∈B Wi(fi(sV i ,τ , hi,τ-1)) -gi(si,τ+1) , where Rπ i,τ = τ B -1 τ =τ γ τ -τ r i,τ + γ τ B -τ v i,τ B is the target action-value, v i,τ = V ωi-(s i,τ , a Ni,τ ) is the state-value as baseline, A π i,τ = R π i,τ -v i,τ is the advantage function as critic, and β is the hyperparameter of the entropy loss. In implementation of Eq. (6c), we adopt the root mean square loss (RMSE). The overall algorithm is provided in Algorithm 1. Algorithm 1 Imagination-based policy optimization 1: Initialize policy {π 0 i } i∈V , predictive model { W 0 i } i∈V , empty mini-batch D B . 2: for k = 0, 1,..., T do 3: Collect a batch of data with {π k i } i∈V in real environment: DB = {(si,τ , mN i i,τ , ai,τ , ri,τ , bN i i,τ )} i∈V,τ ∈B .

4:

Update policy {π k+1 i } i∈V under imagination by solving (6a). 

5. NUMERICAL EXPERIMENTS

We evaluate ImagComm on several challenging environments in networked system control and compare it to current state-of-the-art algorithms for communication.

5.1. ENVIRONMENTS

We use four existing simulation environments: ATSC Grid, ATSC Monaco, Cooperative Adaptive Cruise Control (CACC) Catch-up and CACC Slow-down (Chu et al., 2020b) . The ATSC environments are developed based on SUMO (Krajzewicz et al., 2012) . In ATSC, ATSC Grid simulates a 5 × 5 synthetic traffic grid , as shown in Figure 2a . ATSC Monaco simulates a real-world 28intersection traffic network from Monaco city, as shown in Figure 2b . In homogeneous scenario, i.e. ATSC Grid, all agents have the same action space consisting of five pre-defined signal phases. While in heterogeneous scenario, i.e. ATSC Monaco, agents have a variety of action spaces. For both scenarios, The objective of ATSC is to adaptively control traffic lights at the intersections to minimize traffic congestion based on real-time road-traffic measurements. Local state is defined as s t,i = {wait t [l], wave t [l]} ji∈E,l∈Lji , where l is each incoming lane of intersection i. Wait[•] measures the cumulative delay of the first vehicle and wave[•] measures the total number of approaching vehicles along each incoming lane within 50m to the intersection. Rewards for each agent are defined as r i,t = -ji∈E,l∈Lji queue t+∆t [l] , where queue[•] denotes the number of queuing vehicles on an approaching lane, which is measured by induction-loop detectors (ILD). In 

5.1.2. COOPERATIVE ADAPTIVE CRUISE CONTROL

The objective of CACC is to adaptively coordinate a platoon of vehicles to minimize the car-following headway and speed perturbations based on real-time vehicle-to-vehicle communication. Here we implement two CACC scenarios: "Catch-up" and "Slow-down", with physical vehicle dynamics. To save space, the setup details are deferred to C.4.1.

5.2. ALGORITHM SETUP

For fair comparison, all MARL approaches are applied to A2C agents with learning methods in Eq. (3)(4), and only neighborhood observation and communication are allowed. IA2C performs independent learning, which is an A2C implementation of MADDPG (Lowe et al., 2017) as the critic takes neighboring actions (see Eq. ( 4)). ConseNet (Zhang et al., 2018) has the additional consensus update to overwrite parameters of each critic as the mean of those of all critics inside the closed neighborhood. FPrint (Foerster et al., 2017) includes neighbor policies. DIAL (Foerster et al., 2016) and CommNet (Sukhbaatar et al., 2016) All algorithms use the same DNN hidden layers: one fully-connected layer for message encoding e λ , and one LSTM layer for message extracting g ν . All hidden layers have 64 units. The encoding layer implicitly learns normalization across different input signal types. We train each model over 1M steps, with γ = 0.99, actor learning rate 5 × 10 -4 , and critic learning rate 2.5 × 10 -4 . Also, each training episode has a different seed for generalization purposes. In ATSC, β = 0.01, |B| = 120, while in CACC, β = 0.05, |B| = 60, to encourage the exploration of collision-free policies.

5.3. ABLATION STUDY

We perform ablation study in proposed scenarios, which are sorted as ATSC Monaco > ATSC Grid > CACC Slow-down > CACC Catch-up by task difficulty. ATSC is more challenging than CACC due to larger scale (>=25 vs 8), more complex dynamics (stochastic traffic flow vs deterministic vehicle dynamics), and longer control interval (5s vs 0.1s). ATSC Monaco > ATSC Grid due to more heterogenous network, while CACC Slow-down > CACC Catch-up due to more frequently changing leading vehicle profile. To visualize the learning performance, we plot the learning curve, that is, average episode return ( R = 1 T T -1 t=0 i∈V r i,t ) vs training step. For better visualization, all learning curves are smoothened using moving average with a window size of 100 episodes. First, we investigate the impact of spatial discount factor, by comparing the learning curves among α ∈ {0.8, 0.9, 1} for IA2C and CommNet. Fig. 3 reveals a few interesting facts. First, α * CommNet is always higher than α * IA2C . Indeed, α * CommNet = 1 in almost all scenarios (except for ATSC Monaco). This is because communicative policies perform delayed global information sharing, whereas noncommunicative policies utilize neighborhood information only, causing difficulty to fit the global return. Second, learning performance becomes much more sensitive to α when the task is more 6 

5.1.2. COOPERATIVE ADAPTIVE CRUISE CONTROL

The objective of CACC is to adaptively coordinate a platoon of vehicles to minimize the car-following headway and speed perturbations based on real-time vehicle-to-vehicle communication. Here we implement two CACC scenarios: "Catch-up" and "Slow-down", with physical vehicle dynamics. To save space, the setup details are deferred to C.4.1.

5.2. ALGORITHM SETUP

For fair comparison, all MARL approaches are applied to A2C agents with learning methods in Eq. (3)(4), and only neighborhood observation and communication are allowed. IA2C performs independent learning, which is an A2C implementation of MADDPG (Lowe et al., 2017) as the critic takes neighboring actions (see Eq. ( 4)). ConseNet (Zhang et al., 2018) has the additional consensus update to overwrite parameters of each critic as the mean of those of all critics inside the closed neighborhood. FPrint (Foerster et al., 2017) includes neighbor policies. DIAL (Foerster et al., 2016) and CommNet (Sukhbaatar et al., 2016) are described in Section 4. IA2C, ConseNet, and FPrint are non-communicative policies since they utilize only neighborhood information. In contrast, DIAL, CommNet, and NeurComm are communicative policies. The implementation details are in C.1. All algorithms use the same DNN hidden layers: one fully-connected layer for message encoding e λ , and one LSTM layer for message extracting g ν . All hidden layers have 64 units. The encoding layer implicitly learns normalization across different input signal types. We train each model over 1M steps, with γ = 0.99, actor learning rate 5 × 10 -4 , and critic learning rate 2.5 × 10 -4 . Also, each training episode has a different seed for generalization purposes. In ATSC, β = 0.01, |B| = 120, while in CACC, β = 0.05, |B| = 60, to encourage the exploration of collision-free policies.

5.3. ABLATION STUDY

We perform ablation study in proposed scenarios, which are sorted as ATSC Monaco > ATSC Grid > CACC Slow-down > CACC Catch-up by task difficulty. ATSC is more challenging than CACC due to larger scale (>=25 vs 8), more complex dynamics (stochastic traffic flow vs deterministic vehicle dynamics), and longer control interval (5s vs 0.1s). ATSC Monaco > ATSC Grid due to more heterogenous network, while CACC Slow-down > CACC Catch-up due to more frequently changing leading vehicle profile. To visualize the learning performance, we plot the learning curve, that is, average episode return ( R = 1 T T -1 t=0 i∈V r i,t ) vs training step. For better visualization, all learning curves are smoothened using moving average with a window size of 100 episodes. First, we investigate the impact of spatial discount factor, by comparing the learning curves among α ∈ {0.8, 0.9, 1} for IA2C and CommNet. Fig. 3 reveals a few interesting facts. First, α * CommNet is always higher than α * IA2C . Indeed, α * CommNet = 1 in almost all scenarios (except for ATSC Monaco). This is because communicative policies perform delayed global information sharing, whereas noncommunicative policies utilize neighborhood information only, causing difficulty to fit the global return. Second, learning performance becomes much more sensitive to α when the task is more 6 (b) ATSC Monaco. CACC, CACC Catch-up scenario simulates a string of 8 vehicles for 60s with a 0.1s control interval, where target speed v * t = 15m/s and initial headway h 1,0 > h i,0 , ∀i = 1. CACC Slow-down also simulates a string of 8 vehicles for 60s with a 0.1s control interval, where initial headway h i,0 = h * and target speed v * t linearly decreases to 15m/s during the first 30s and then stays at constant. In both scenarios, the objective is to adaptively coordinate a platoon of vehicles to minimize the carfollowing headway and speed perturbations based on real-time vehicle-to-vehicle communication. Each vehicle observes and shares its headway h, velocity v, and acceleration a to neighbors within one step. Models are trained to recommend appropriate hyper-parameters (α • , β • ) for each OVM controller (Bando et al., 1995) , selected from four levels {(0,0),(0.5,0),(0,0.5),(0.5,0.5)}. Rewards are designed as a cost function. Assuming the target headway and velocity profile are h * = 20m and v * t , respectively, the cost of each agent is (h i,t -h * ) 2 + (v i,t -v * t ) 2 + 0.1u 2 i,t . Whenever a collision happens (h i,t < 1m), a large penalty of 1000 is assigned to each agent and the state becomes absorbing. An additional cost 5 (2h sth i,t ) 2 + is provided in training for potential collisions.

5.2. BASELINES AND SETUP

All the baselines are implemented based on the A2C agent (Mnih et al., 2016) following the methods in Eq. (6a)(6b). The baselines include one non-communicative policy IA2C (Mnih et al., 2016) , and three communicative policies NeurComm (Chu et al., 2020a) , CommNet (Sukhbaatar & Fergus, 2016) , DIAL (Foerster et al., 2016) . The details will be provided in Appendix B. For ImagComm, we found that two-hop information already lead to compelling performance and we use one-step imagination. We use h i,t = LSTM(h i,t-1 , concat(relu(s Vi,t ), relu(π Ni,t-1 ), relu(h Ni,t-1 ), relu(b Ni,t ))), and ŝi,t+1 = tanh(concat(relu(s Vi,t ), h i,t-1 )). Then h i,t is fed into two fully-connected neural networks producing policy and value separately. Note that for state encoding and g i (•), we use one fullyconnected layer with tanh activation. For message extracting g Vi , we use a LSTM layer. All layers are set up with 64 hidden units. For the imagination module, we stack two fully-connected layers. Other settings include actor learning rate 5 × 10 -4 , critic learning rate 2.5 × 10 -4 , entropy coefficient β = 0.01, batch size |B| = 120. Each method is trained over 1M steps under an actor-critic framework. In ATSC, entropy coefficient β = 0.01, batch size |B| = 120 and in CACC, β = 0.05, batch size |B| = 60. We use a different random seed to initialize the environment without loss of generality. As different initial seeds may cause fluctuation, we smooth the learning curve using moving averages with a window size of 100 episodes following NeurComm (Chu et al., 2020a) .

5.3. TRAINING RESULTS

Figure 3 compares the learning curves of different methods on four environments.The results show that ImagComm overall performs better than others for these environments. In both ATSC environments, our model learns quickly with good stability and performs well in the end. In CACC, the standard deviation of episode returns is high due to the large penalty of collisions. In ATSC Grid, CommNet and DIAL gain little performance improvement even after 0.4M training steps, while ImagComm and NeurComm learn with a faster speed and end with lower deviation. Imag- Comm learns slower at the initial state of training owing to the learning of imagination module, but it quickly outperforms NeurComm after 0.5M steps, showing the benefits of incorporating the predictive communication. In ATSC Monaco, DIAL and NeurComm learn fast before 0.3M steps, but NeurComm does not improve much after that, and ImagComm outperforms all the baselines after 0.4M steps. In the CACC environments, ImagComm works better and more stable than others, with a better performance in the end. It shows that adding predictive communication is useful for these tasks. Though IA2C works well sometimes, it is very unstable, which proves the effectiveness of communication. NeurComm and CommNet also work stable in both environments, but suffering from delayed communication in such real-time scenarios, they perform worse than ImagComm. Hyper parameter study We investigate the impact of coefficient α of D( W , π) in our Imag-Comm by comparing the learning curves among {0.01, 0.1, 1, 10, 100} on the ATSC and CACC environments in Figure 4 . The results show that different α values have different results. In ATSC environments, α = 1, 0.1 work best for ATSC Grid and Monaco respectively. In CACC Catch-up scenario, different α values yield similar results. But in CACC Slow-down scenario, α = 1 performs best.

5.4. EXECUTION PERFORMANCE

Additionally, we investigate the execution results of different methods. We use two more metrics, average queue length and average intersection delay, to get a deeper understanding of controllers' execution impact. Average intersection delay is the mean of waiting time of all vehicles, which is another congestion metric. Figure 5a and Figure 5b compare average queue length of different trained models in one episode. As shown in the figures, ImagComm achieves the best performance in both scenarios. In ATSC Grid, IA2C and CommNet both fail to reduce traffic pressure and the congestion level remains high in the end. In contrast, ImagComm helps agents communicate with each other effectively, so the queue length increases slower and remains low in the end. Figure 5c and Figure 5d compare average intersection delay of different trained models in one episode. In Figure 5c , all four communicative policies outperform IA2C dramatically, which proves the effectiveness of communication. However, when it comes to ATSC Monaco, results turn out differently. In Figure 5d , IA2C achieves the lowest average intersection delay while communicative policies perform worse. This can be explained by the emphasis on the queue length as we only consider queue length in reward functions. Intersection delay is not explicitly included in rewards. Thus communicative models may tend to block vehicles in short queues. Although this inclination helps 

6. CONCLUSIONS

In this work, we study the delayed communication problem for decentralized MARL in networked system control. We have introduced an imagination module to predict farsighted information for predictive communication. ImagComm combines the delayed global information and predictive state information and performs end-to-end training of the neural communication and imagination module to optimize the control performance in NSC. Extensive empirical studies demonstrate that by leveraging world models for learning the latent delayed information for communication, our method achieves the compelling performance gains in the challenging traffic signal control and adaptive cruise control tasks. We hope that our work will provide inspiration for the research in model-based learning for (networked) multi-agent systems. A PROOFS Chu et al. (2020a) has shown that the communication protocol allows the local agent to utinize the delayed global information. We cite the proof here to stay self-contained. Lemma 1. (Chu et al., 2020a) By communicating through h i,t = g Vi (h i,t-1 , s i,t , m Ni,t ), the delayed global information is utilized to estimate each hidden state, that is h i,t ⊃ s i,0:t ∪ s j,0:t+1-dij , π j,0:t-dij j∈V\{i} where x ⊃ y if information y is utilized to estimate x, and x 0:t := {x 0 , x 1 , . . . , x t }. Proof. Based on the definition of communication protocol, m i,t ⊃ h i,t-1 , and h i,t ⊃ h i,t-1 ∪ s Vi,t ∪ π Ni,t-1 ∪ m Ni,t . Hence, hi,t ⊃ si,t ∪ {sj,t, πj,t-1} j∈N i ∪ {hj,t-1} j∈V i ⊃ si,t ∪ {sj,t, πj,t-1} j∈N i ∪ sj,t-1 ∪ {s k,t-1 , π k,t-2 } k∈N j ∪ {h k,t-2 } k∈V j j∈V i = si,t-1:t ∪ {sj,t-1:t, πj,t-2:t-1} j∈N i ∪ {sj,t-1, πj,t-2} j∈{V|d ij =2} ∪ {hj,t-2} j∈{V|d ij ≤2} ⊃ • • • ⊃ si,0:t ∪ {sj,0:t, πj,t-2:t-1} j∈N i ∪ {sj,0:t-1, πj,0:t-2} j∈{V |d ij =2} ∪ . . . ∪ {s j,0:t+1-dmax , π j,0:t-dmax } j∈{V|d ij =dmax} , which concludes the proof.

A.1 PROPOSITION 1

Proof of Proposition 1. Based on (5), we have that h i,t ⊃ {m j,t } j∈Ni ∪ {b j,t } j∈Ni ⊃ {h j,t-1 } j∈Ni ⊃ {m j,t-1 } j∈{V|dij =2} ∪ {b j,t-1 } j∈{V|dij =2} ⊃ • • • ⊃ {m j,t+1-d } j∈{V|dij =d} ∪ {b j,t+1-d } j∈{V|dij =d} ⊃ • • • Since m j,t = s j,t ∪ π j,t-1 ∪ h j,t-1 and b j,t = ∪ k τ =1 ŝj,t+τ with k the step of forward imagination, h i,t ⊃ {s j,t+1-d } j∈{V|dij =d} ∪ {ŝ j,t+k+1-d } j∈{V|dij =d} . Hence, ŝi,t=τ is included in the observation of agent j at time τ + d ijk -1, s i,t=τ , ahead of s i,t=τ at time τ + d ij -1 by k steps.

A.2 PROPOSITION 2

Proof of Proposition 2. We follow the sketch in (Luo et al., 2018) . From (2), we have that V π k+1 ,W * i ≥ V π k+1 , W i -D π k ( W , π k+1 ). By solving (4), V π k+1 , W i -D π k ( W , π k+1 ) ≥ V π k ,W * i - D π k (W * , π k ) with D π k (W * , π k ) = 0. Thus we have V π k+1 ,W * i ≥ V π k ,W * i which completes the proof.

B EXPERIMENT DETAILS

We describe the baselines used in our experiments as follows: • NeurComm (Chu et al., 2020a) : it is one of the state-of-the-art methods for NSC. It formulates the neighborhood communication under a spatiotemporal MDP and performs independent learning for actors and critics. The messages include the state of current step, policy fingerprints and hidden state of last time step. • CommNet (Sukhbaatar & Fergus, 2016) : it allows agents to communicate through broadcasting a communication vector, which is the average of neighbors' hidden states. The messages include the current state and last step hidden state. • DIAL (Foerster et al., 2016) : each agent encodes the received messages instead of averaging them, but still sums all encoded inputs. It uses the observations of neighbors as the message. • IA2C (Mnih et al., 2016) : it is an advantage actor-critic method that trains decentralized policies and critics for each agent. Each agent does not communicate with nearby agents. It is implemented similar to MADDPG Lowe et al. (2017) as the critic takes neighboring actions. • ImagComm: our method adopts an imagination module to predict the delayed message and mitigate nonstationarity coming from partial observations in training. Fig. 6 illustrates the diagram of policy representation. IA2C is a non-communicative policy while the rest four approaches are communicative policies requiring messages from the neighbors. In terms of baseline, the communication is based on h i,t = g Vi (h i,t-1 , s i,t , m Ni,t ), and the algorithm implementation details are listed below: IA2C: h i,t = LSTM(h i,t-1 , relu(s i,t )). NeurComm: h i,t = LSTM(h i,t-1 , concat(relu(s Vi,t ), relu(π Ni,t-1 ), relu(h Ni,t-1 ))). DIAL: h i,t = LSTM(h i,t-1 , relu(s Vt,t ) + relu(relu((h i,t-1 ))) + onehot(a i,t-1 ).

CommNet:

h i,t = LSTM(h i,t-1 , tanh(s Vi,t ) + linear(mean(h Ni,t-1 ))). In terms of ImagComm, the communication is based on (5) with h i,t = LSTM(h i,t-1 , concat(relu(s Vi,t ), relu(π Ni,t-1 ), relu(h Ni,t-1 ), relu(b Ni,t ))). For the imagination module, ŝi,t+1 = tanh(concat(relu(s Vi,t ), h i,t-1 )).

C ADDITIONAL RESULTS

Figure 7 shows the convergence of the learning curve for the imagination module. Different from the other three tasks, the ATSC Monaco is tasked with heterogeneous agents in which each agent's state has different dimensions. Thus in Figure 7b we plot the loss that is the summation of the square error of N agents without average normalization by state dimensions. In Figure 7a , 7c and 7d, the y-axis is averaged RMSE loss for N agents. 



Figure 1: Flow diagram of predictive communication. Left: An illustration of a networked system example is an adaptive cruise control (ACC) system with vehicle-to-vehicle (V2V) communication.For the information sharing, agent i knows information at time t including s i,t ∪ s i-1,t ∪ s i-2,t-1 ∪ s i-3,t-2 , which are the delayed global observations. For the communication, based on world models b i,t is learned from its neighbor's future states, which compensate for delayed information. Right: policy representation that includes messages of neighbors generated by the imagined module.

D B by solving (6c). 7: end for

Traffic flows within the grid.

Figure 2: ATSC scenarios for NMARL. (a) Synthetic traffic grid, with major and minor traffic flows shown in solid and dotted arrows. (b) Simulated time-variant traffic flows within the traffic grid. (c) Monaco traffic network, with traffic flow collections shown in colored arrows.

are described in Section 4. IA2C, ConseNet, and FPrint are non-communicative policies since they utilize only neighborhood information. In contrast, DIAL, CommNet, and NeurComm are communicative policies. The implementation details are in C.1.

Traffic flows within the grid.

Figure 2: ATSC scenarios for NMARL. (a) Synthetic traffic grid, with major and minor traffic flows shown in solid and dotted arrows. (b) Simulated time-variant traffic flows within the traffic grid. (c) Monaco traffic network, with traffic flow collections shown in colored arrows.

Figure 2: Environments on adaptive traffic signal control (ATSC) and cooperative adaptive cruise control (CACC) systems.

Figure 3: Learning curves for our method and different baselines on four environments

Figure 4: Learning curves for ImagComm with different α.

Figure 5: Performance on average queue length (a, b) and average intersection delay (c, d) in the ATSC setting.

Figure 6: Policy representation that includes messages of neighbors generated by the imagined module.

Figure 8 compares the performance of ImagComm and a modified version of NeurComm-2Hop on CACC task. ImagComm performs one-step prediction to implicitly include the information of two-hop neighbors. NeurComm-2Hop directly uses two-hop information. It is interesting to see that ImagComm outperforms NeurComm-2hop, which proves the effectiveness of the imagination module. Due to the one-step imagination before communication, the receiver obtains the information of the neighbors that are two-hop away and the next-step information of neighbors one-hop away.

Average reward comparison over trained models.

Performance of MARL controllers in ATSC environments: synthetic traffic grid (top) and Monaco traffic network (bottom). Best values are in bold.We freeze and evaluate our model for another 50 episodes using different seeds and present the average rewards in Table1. The results show that in ATSC scenarios, ImagComm outperforms other models by a large margin. We further investigate queue length and vehicle speed in ATSC enviroments and present the results in Table2. In ATSC Grid, ImagComm achieves the lowest queue length and highest average vehicle speed, which means vehicles flow smoothly and ImagComm greatly reduces congestion level. In ATSC Monaco, ImagComm achieves the lowest queue length, but CommNet attains a higher average vehicle speed. This means ImagComm makes vehicles move steadily rather than getting blocked on the road for a long time, but with a sacrifice of moving slowly.

