LEARNING TO COMMUNICATE THROUGH IMAGINA-TION WITH MODEL-BASED DEEP MULTI-AGENT REIN-FORCEMENT LEARNING

Abstract

The human imagination is an integral component of our intelligence. Furthermore, the core utility of our imagination is deeply coupled with communication. Language, argued to have been developed through complex interaction within growing collective societies serves as an instruction to the imagination, giving us the ability to share abstract mental representations and perform joint spatiotemporal planning. In this paper, we explore communication through imagination with multi-agent reinforcement learning. Specifically, we develop a model-based approach where agents jointly plan through recurrent communication of their respective predictions of the future. Each agent has access to a learned world model capable of producing model rollouts of future states and predicted rewards, conditioned on the actions sampled from the agent's policy. These rollouts are then encoded into messages and used to learn a communication protocol during training via differentiable message passing. We highlight the benefits of our model-based approach, compared to a set of strong baselines, by developing a set of specialised experiments using novel as well as well-known multi-agent environments.

1. INTRODUCTION

"We use imagination in our ordinary perception of the world. This perception cannot be separated from interpretation." (Warnock, 1976) . The human brain, and the mind that emerges from its working, is currently our best example of a general purpose intelligent learning system. And our ability to imagine, is an integral part of it (Abraham, 2020) . The imagination is furthermore intimately connected to other parts of our cognition such as our use of language (Shulman, 2012) . In fact, Dor (2015) argues that: "The functional specificity of language lies in the very particular functional strategy it employs. It is dedicated to the systematic instruction of imagination: we use it to communicate directly with our interlocutors' imaginations." However, the origin of language resides not only in individual cognition, but in society (Von Humboldt, 1999) , grounded in part through interpersonal experience (Bisk et al., 2020) . The complexity of the world necessitates our use of individual mental models (Forrester, 1971) , to store abstract representations of the information we perceive through the direct experiences of our senses (Chang and Tsao, 2017) . As society expanded, the sharing of direct experiences within groups reached its limit. Growing societies could only continue to function through the invention of language, a unique and effective communication protocol where a sender's coded message of abstract mental representations delivered through speech, could serve as a direct instruction to the receiver's imagination (Dor, 2015) . Therefore, the combination of language and imagination gave us the ability to solve complex tasks by performing abstract reasoning (Perkins, 1985) and joint spatiotemporal planning (Reuland, 2010) . In this work, we explore a plausible learning system architecture for the development of an artificial multi-agent communication protocol of the imagination. Based on the above discussion, the minimum set of required features of such a system include: (1) that it be constructed from multiple individual agents where, (2) each agent possesses an abstract model of the world that can serve as an imagination, (3) has access to a communication medium, or channel, and (4) jointly learns and interacts in a collective society. Consequently, these features map most directly onto the learning framework of model-based deep multi-agent reinforcement learning. Reinforcement learning (RL) has demonstrated close connections with neuroscientific models of learning (Barto, 1995; Schultz et al., 1997) . However, beside this connection, RL has proven to be an extremely useful computational framework for building effective artificial learning systems (Sutton and Barto, 2018) . This is true, not only in simulated environments and games (Mnih et al., 2015; Silver et al., 2017) , but also in real-world applications (Gregurić et al., 2020) . Futhermore, RL approaches are being considered for some of humanities most pressing problems, such as the need to build sustainable food supply (Binas et al., 2019) and energy forecasting systems (Jeong and Kim, 2020), brought about through global climate change (Manabe and Wetherald, 1967; Hays et al., 1976; Hansen et al., 2012; Rolnick et al., 2019) . Our system. We develop our system specifically in the context of cooperative mutli-agent RL (OroojlooyJadid and Hajinezhad, 2019) , where multiple agents jointly attempt to learn how to act in a partially observable environment by maximising a shared global reward. Our agents make use of model-based reinforcement learning (Langlois et al., 2019; Moerland et al., 2020) . To learn an artificial language of the imagination, each individual agent in our system is given access to a recurrent world model capable of learning rich abstract representations of real and imagined future states. We combine this world model with an encoder function to encode world model rollouts as messages and use a recurrent differentiable message passing channel for communication. To show the benefits of our system, we develop a set of ablation tests and specialised experiments using novel as well as well-known multi-agent environments and compare the performance of our system to a set of strong model-free deep MARL baselines. Our findings and contributions. We find that joint planning using learned communication through imagination can significantly improve MARL system performance when compared to a set of stateof-the-art baselines. We demonstrate this advantage of planning in a set of specialised environments specifically designed to test for the use of communication combined with imagined future prediction. Our present work is not at scale and we only consider situations containing two agents. However, to the best of our knowledge, this is the first demonstration of a model-based deep MARL system that combines world models with differentiable communication for joint planning, able to solve tasks successfully, where state-of-the-art model-free deep MARL methods fail. We see this work as a preliminary step towards building larger-scale joint planning systems using model-based deep multi-agent RL.

2. BACKGROUND AND RELATED WORK

Reinforcement learning is concerned with optimal sequential decision making within a particular environment. In single agent RL, the problem is modeled as a Markov decision process (MDP) defined by the following tuple (S, A, r, p, ρ 0 , γ) (Andreae, 1969; Watkins, 1989) . At time step t, in a state s t , which is a member of the state space S, the agent can select an action a t from a set of actions A. The environment state transition function p(s t+1 |s t , a t ) provides a distribution over next states s t+1 and a reward function r(s t , a t , s t+1 ) returns a scalar reward, given the current state, action and next state. The initial state distribution is given by ρ 0 , with s 0 ∼ ρ 0 , and γ ∈ (0, 1] is a discount factor controlling the influence of future reward. The goal of RL is to find an optimal policy π * , where the policy is a mapping from states to a distribution over actions, that maximises long-term discounted future reward such that π * = argmax π E[ ∞ t=0 γ t r(s t , a t , s t+1 )]. If the environment state is partially observed by the agent, an observation function o(s t ) is assumed and the agent has access only to the observation o t = o(s t ) at each time step, with the full observation space defined as O = {o(s)|s ∈ S}. In this work, we focus only on the case of partial observability. Deep RL. Popular algorithms for solving the RL problem include value-based methods such as Qlearning (Watkins and Dayan, 1992) and policy gradient methods such as the REINFORCE algorithm (Williams, 1992) . Q-learning learns a value function Q(s, a) for state-action pairs and obtains a policy by selecting actions according to these learned values using a specific action selector, e.g. -greedy (Watkins, 1989) or UCB (Auer et al., 2002) . In contrast, policy gradient methods learn a parameterised policy π θ , with parameters θ, directly by following a performance gradient signal with respect to θ. The above approaches are combined in actor-critic methods (Sutton et al., 2000) , where

