

Abstract

To develop computational agents that better communicates using their own emergent language, we endow the agents with an ability to focus their attention on particular concepts in the environment. Humans often understand a thing or scene as a composite of concepts and those concepts are further mapped onto words. We implement this intuition as attention mechanisms in Speaker and Listener agents in a referential game and show attention leads to more compositional and interpretable emergent language. We also demonstrate how attention helps us understand the learned communication protocol by investigating the attention weights associated with each message symbol and the alignment of attention weights between Speaker and Listener agents. Overall, our results suggest that attention is a promising mechanism for developing more human-like emergent language.

1. INTRODUCTION

We endow computational agents with an ability to focus their attention on a particular concept in the environment and communicate using emergent language. Emergent language refers to the communication protocol based on discrete symbols developed by agents to solve a specific task (Nowak & Krakauer, 1999; Lazaridou et al., 2017) . One important goal in the study of emergent language is clarifying conditions that lead to more compositional languages. Seeking compositionality provides insights into the origin of the compositional natures of human language and helps develop efficient communication protocols in multi-agent systems that are interpretable by humans. Much recent work studies emergent language with generic deep agents (Lazaridou & Baroni, 2020) with minimum assumptions on the inductive bias of the model architecture. For example, a typical speaker agent encodes information into a single fixed-length vector to initialize the hidden state of a RNN decoder and generate symbols (Lazaridou et al., 2017; Mordatch & Abbeel, 2018; Ren et al., 2020) . Only a few studies have explored some architectural variations (Słowik et al., 2020) to improve the compositionality of emergent language, and there remains much to be discussed about the effects of the inductive bias provided by different architectures. We posit that towards more human-like emergent language we need to explore other modeling choices that reflect the human cognitive process. In this study, we focus on the attention mechanism. Attention is one of the most successful neural network architectures (Bahdanau et al., 2015; Xu et al., 2015; Vaswani et al., 2017) that have an analogy in psychology (Lindsay, 2020). The conceptual core of attention is an adaptive control of limited resources, and we hypothesize that this creates pressure for learning more compositional emergent languages. Compositionality entails a whole consisting of subparts. Attention allows the agents to dynamically highlight different subparts of an object when producing/understanding each symbol, which potentially results in clear associations between the object attributes and symbols. Another reason to explore the attention mechanism is its interpretability. Emergent language is optimized for task success and the learned communication protocol often results in counter-intuitive and opaque encoding (Bouchacourt & Baroni, 2018) . Several metrics have been proposed to measure specific characteristics of emergent language (Brighton & Kirby, 2006; Lowe et al., 2019) but these metrics provide rather a holistic view of emergent language and do not tell us a fine-grained view of what each symbol is meant for or understood as. Attention weights, on the other hand, have been shown to provide insights into the basis of the network's prediction (Bahdanau et al., 2015; Xu et al., 2015; Yang et al., 2016) . Incorporating attention in the process of symbol production/comprehension will allow us to inspect the meaning of each symbol in the messages. In this paper, we test attention agents with the referential game (Lewis, 1969; Lazaridou et al., 2017) , which involves two agents: Speaker and Listener. The goal of the game is to convey the type of object that Speaker sees to Listener. We conduct extensive experiments with two types of agent architectures, LSTM and Transformer, and two types of environments: the one-hot game and the Fashion-MNIST game. We compare the attention agents against their non-attention counterparts to show that adding attention mechanisms with disentangled inputs to either/both Speaker or/and Listener helps develop a more compositional language. We also analyze the attention weights and explore how they can help us understand the learned language. We visualize the learned symbolconcept mapping and demonstrate how the emergent language can deviate from human language and, with experiments with pixel images, how the emergent language can be affected by the visual similarity of referents. We also investigate the potential of the alignment between Speaker's and Listener's attention weights as a proxy for the establishment of common understanding and show that the alignment correlates with task success.

2.1. REFERENTIAL GAME

We study emergent language in the referential game (Lewis, 1969; Lazaridou et al., 2017) . The game focuses on the most basic feature of language, referring to things. The version of the referential game we adopt in this paper is structured as follows: 1. Speaker is presented with a target object o tgt ∈ O and generates a message m that consists of a sequence of discrete symbols. 2. Listener receives the message m and a candidate set C = {o 1 , o 2 , ..., o |C| } including the target object o tgt and distractor objects sampled randomly without replacement from O. 3. Listener chooses one object from the candidate set and if it is the target object, the game is considered successful. The objects can be represented as a set of attributes. The focus of this game is whether agents can represent the objects in a compositional message that is supposedly based on the attributes.

2.2. AGENT ARCHITECTURES

Our goal in this paper is to test the effect of the attention mechanism on emergent language. The attention mechanism in its general form takes a query vector x and key-value vectors {y 1 , ..., y L }. The key-value vectors are optionally transformed into key vectors and value vectors or used as is. Then, attention scores {s 1 , ..., s L } are calculated as the similarity between the query vector and the key vectors to the attention weights via the softmax function. Finally, the attention weights are used to produce a weighted sum of the value vectors. A key feature of attention is that it allows the agents to selectively attend to a part of disentangled vector inputs. Our intuition is that modeling direct associations between the symbol representations as query and disentangled input representations as key-value will bias agents toward packing each symbol with the information of a meaningful subpart of the inputs rather than with opaque and non-compositional information. To test these hypotheses, we design non-attention and attention agents for both Speaker and Listener (Figure 1 ).



Figure 1: Illustration of the attention agents in the referential game.

