

Abstract

To develop computational agents that better communicates using their own emergent language, we endow the agents with an ability to focus their attention on particular concepts in the environment. Humans often understand a thing or scene as a composite of concepts and those concepts are further mapped onto words. We implement this intuition as attention mechanisms in Speaker and Listener agents in a referential game and show attention leads to more compositional and interpretable emergent language. We also demonstrate how attention helps us understand the learned communication protocol by investigating the attention weights associated with each message symbol and the alignment of attention weights between Speaker and Listener agents. Overall, our results suggest that attention is a promising mechanism for developing more human-like emergent language.

1. INTRODUCTION

We endow computational agents with an ability to focus their attention on a particular concept in the environment and communicate using emergent language. Emergent language refers to the communication protocol based on discrete symbols developed by agents to solve a specific task (Nowak & Krakauer, 1999; Lazaridou et al., 2017) . One important goal in the study of emergent language is clarifying conditions that lead to more compositional languages. Seeking compositionality provides insights into the origin of the compositional natures of human language and helps develop efficient communication protocols in multi-agent systems that are interpretable by humans. Much recent work studies emergent language with generic deep agents (Lazaridou & Baroni, 2020) with minimum assumptions on the inductive bias of the model architecture. For example, a typical speaker agent encodes information into a single fixed-length vector to initialize the hidden state of a RNN decoder and generate symbols (Lazaridou et al., 2017; Mordatch & Abbeel, 2018; Ren et al., 2020) . Only a few studies have explored some architectural variations (Słowik et al., 2020) to improve the compositionality of emergent language, and there remains much to be discussed about the effects of the inductive bias provided by different architectures. We posit that towards more human-like emergent language we need to explore other modeling choices that reflect the human cognitive process. In this study, we focus on the attention mechanism. Attention is one of the most successful neural network architectures (Bahdanau et al., 2015; Xu et al., 2015; Vaswani et al., 2017) that have an analogy in psychology (Lindsay, 2020). The conceptual core of attention is an adaptive control of limited resources, and we hypothesize that this creates pressure for learning more compositional emergent languages. Compositionality entails a whole consisting of subparts. Attention allows the agents to dynamically highlight different subparts of an object when producing/understanding each symbol, which potentially results in clear associations between the object attributes and symbols. Another reason to explore the attention mechanism is its interpretability. Emergent language is optimized for task success and the learned communication protocol often results in counter-intuitive and opaque encoding (Bouchacourt & Baroni, 2018) . Several metrics have been proposed to measure specific characteristics of emergent language (Brighton & Kirby, 2006; Lowe et al., 2019) but these metrics provide rather a holistic view of emergent language and do not tell us a fine-grained view of what each symbol is meant for or understood as. Attention weights, on the other hand, have been shown to provide insights into the basis of the network's prediction (Bahdanau et al., 2015; Xu et al., 2015; Yang et al., 2016) . Incorporating attention in the process of symbol production/comprehension will allow us to inspect the meaning of each symbol in the messages.

