EXPLORING ZERO-SHOT EMERGENT COMMUNICA-TION IN EMBODIED MULTI-AGENT POPULATIONS Anonymous

Abstract

Effective communication is an important skill for enabling information exchange and cooperation in multi-agent settings. Indeed, emergent communication is now a vibrant field of research, with common settings involving discrete cheap-talk channels. One limitation of this setting is that it does not allow for the emergent protocols to generalize beyond the training partners. Furthermore, so far emergent communication has primarily focused on the use of symbolic channels. In this work, we extend this line of work to a new modality, by studying agents that learn to communicate via actuating their joints in a 3D environment. We show that under realistic assumptions, a non-uniform distribution of intents and a commonknowledge energy cost, these agents can find protocols that generalize to novel partners. We also explore and analyze specific difficulties associated with finding these solutions in practice. Finally, we propose and evaluate initial training improvements to address these challenges, involving both specific training curricula and providing the latent feature that can be coordinated on during training.

1. INTRODUCTION

The ability to communicate effectively with other agents is part of a necessary skill repertoire of intelligent agents and, by definition, can only be studied in multi-agent contexts. Over the last few years, a number of papers have studied emergent communication in multi-agent settings (Lazaridou et al., 2016; Havrylov & Titov, 2017; Cao et al., 2018; Bouchacourt & Baroni, 2018; Eccles et al., 2019; Graesser et al., 2019; Chaabouni et al., 2019; Lowe et al., 2019b) . This work typically assumes a symbolic (discrete) cheap-talk channel, through which agents can send messages that have no impact on the reward function or transition dynamics. A common task is the so called referential game, in which a sender observes an intent needing to be communicated to a listener via a message. In these cheap-talk settings, the solution space typically contains many equivalent but mutually incompatible (self-play) policies. For example, permuting bits in the channel and adapting the receiver policy accordingly would preserve payouts, but differently permuted senders and receivers are mutually incompatible. This makes it difficult for independently trained agents to utilize the cheap-talk channel at test time, a setting which is formalized as zero-shot (ZS) coordination (Hu et al., 2020) . In contrast, we study how gesture-based communication can emerge under realistic assumptions. Specifically, this work considers emergent communication in the context of embodied agents that learn to communicate through actuating and observing their joints in simulated physical environments. In other words, our setup is a referential game, where each message is a multi-step process that produces an entire trajectory of limb motion (continuous actions) in a simulated 3D world. Not only does body language play a crucial role in social interactions, but furthermore, zoomorphic agents, robotic manipulators, and prelingual infants are generally not expected to use symbolic language to communicate at all. From a practical point of view, it is clear that our future AI agents will need to signal and interpret the body language of other (human) agents, e.g., when self-driving cars decide whether it is safe to cross an intersection. With that, there has been work on the emergence of grounded language for robots (Steels et al., 2012; Spranger, 2016) . To the best of our knowledge however, we are first to explore deep reinforcement learning for emergent communication in the context of embodied agents using articulated motion. Moreover, while cheap-talk is a great proxy for symbolic communication across dedicated channels, communication through articulated motion means agents have to control their joints to signal. One universal feature of this physical actuation is it requires the expenditure of energy, which is a scarce resource both for biological agents and man-made robots. Another ubiquitous factor of the physical world (and many other domains) is that communicative intents are not distributed uniformly. In particular, the Zipf distribution (Zipf, 2016) is known to be a good proxy for many real-world distributions associated with human activity. A consequence of combining energy cost with a non-uniform distribution over intents in the context of referential games is that, in principle, it allows for ZS communication: Trajectories requiring lower energy exertion should be used for encoding more common intents, while those associated with higher energy encode less common ones. In contrast, superficially related, auxiliary losses such as entropy penalties do not allow for ZS coordination without further assumptions. While these auxillary losses are design decisions, energy cost is an example of a universal (common-knowledge) cost grounded in the environment, which can be exploited for ZS communication. Unfortunately, training agents that can successfully learn these strategies is a difficult problem for current state-of-the-art machine learning models. There are three major challenges: 1) Local optima associated with the lock-in between sender and receiver: The interpretation of a message depends on the entire policy, not just the state and action. While recent methods have been developed to address this in discrete action spaces (Foerster et al., 2019) , to the best of our knowledge none have been proposed for costly, continuous action spaces. 2) The latent structure underlying the protocol, in our case energy, that can be coordinated on, needs to be discovered, requiring agents to ignore other (redundant) degrees of freedom. 3) Even when this structure is provided, optimization is difficult since it contains a large number of local optima. As a consequence, on top of the continuous optimization problem there is a combinatorial problem of ordering the energy values for each intent. We explore and analyse these difficulties, suggesting initial steps for addressing them in our setting. First, we show that providing the latent variable (energy) at training time does indeed allow for some amount of ZS coordination. To do so, we adapt our method two fold: 1) During training, we change the observation to only the energy value of a given trajectory. 2) We add an observer agent that trains on an entire population of fully trained Self-Play (SP) agents. To evaluate ZS performance, we test this observer on an independently trained set of SP agents. The ZS performance in this setting is around 35% for 10 intents, comparable with the most frequent class for the Zipf-distribution (34%). Next, we pretrain sender agents to minimize the energy associated with each intent, but ZS performance remains around 34% (10 intents). This is intuitive due to the challenges associated with re-ordering energy values on a 1D line without incurring a high cross-entropy while the different distributions are overlapping. Finally, we show that using the entire trajectory during SP, but only the energy value for the external observer, in combination with pretraining, consistently achieves a much higher ZS performance of around 56%. All of this illustrates that learning an optimal ZS policy given a specific set of assumptions about the problem setting, which are common knowledge to all parties, is a challenging problem, even in seemingly simple instances. Furthermore, since this work focuses on the two extreme ends of this problem (self-play and ZS), our ideas are relevant for a broad range of intermediate settings as well.



Figure 1: Embodied Referential Game

