NEURAL AGENTS STRUGGLE TO TAKE TURNS IN BIDIRECTIONAL EMERGENT COMMUNICATION

Abstract

The spontaneous exchange of turns is a central aspect of human communication. Although turn-taking conventions come to us naturally, artificial dialogue agents struggle to coordinate, and must rely on hard-coded rules to engage in interactive conversations with human interlocutors. In this paper, we investigate the conditions under which artificial agents may naturally develop turn-taking conventions in a simple language game. We describe a cooperative task where success is contingent on the exchange of information along a shared communication channel where talking over each other hinders communication. Despite these environmental constraints, neural-network based agents trained to solve this task with reinforcement learning do not systematically adopt turn-taking conventions. However, we find that agents that do agree on turn-taking protocols end up performing better. Moreover, agents that are forced to perform turn-taking can learn to solve the task more quickly. This suggests that turn-taking may help to generate conversations that are easier for speakers to interpret.



1 INTRODUCTION • ★ ★ ■ • ★ Figure 1 : Illustration of our proposed game. Both agents can exchange utterances through a shared communication channel. At each step of the conversation, agents can decide to either speak or stay silent. However, information cannot be transmitted if both agents decide to speak at the same time. Natural conversations involve a rapid exchange of utterances where speakers coordinate on-the-fly to avoid talking over each other. This turn-taking phenomenon is ubiquitous across cultures (Stivers et al., 2009) and is even found in some forms of animal communication (Pika et al., 2018; Demartsev et al., 2018) . The ability to engage in spontaneous turn-taking develops early in humans, even before linguistic competence (Nguyen et al., 2021) and allows us to hold fluent conversations with very little downtime between utterances (Heldner & Edlund, 2010) . In contrast, fluid turn-taking is difficult to replicate in artificial dialogue systems. Modern conversational agents often rely on explicit cues, for instance pressing "enter" in text-based chatbots, the use of specific wake-words (Gao et al., 2020) , or long silences of pre-determined length (Skantze, 2021) . The goal of this paper is to provide a testbed for studying the conditions under which artificial agents may develop a turn-taking convention to resolve a cooperative task. We describe a simple two-player game where agents observe partial views on an object which they must reconstruct. Agents can exchange information by emitting symbols across a shared communication channel over multiple rounds. The game exhibits two key features: first, at each round agents can decide to either talk or stay silent. However, if both agents decide to talk at the same time, they are not able to hear the other agent's message. Second, agents get a higher score if they solve the task in fewer rounds. This creates an explicit pressure towards a protocol that allows the agents to solve the task quickly while minimizing overlap. In experiments, we find that simple neural-network-based agents trained with reinforcement learning do not consistently develop natural turn-taking strategies. However, agents that do develop a turntaking protocol are able to achieve a much higher score, sometimes solving the task perfectly. To shed light on this finding, we perform an in-depth analysis of an asymmetric version of the game, where one agent has all the information. We show that in this case, there is an optimal strategy that does not rely on turn-taking. However, we demonstrate empirically that agents fail to solve the game when they are forced to use this strategy. In contrast, agents that are forced to use strategies in multiple turns rapidly achieve almost perfect accuracy.

2. RELATED WORK

Schegloff (2000) attributes the first description of turn-taking to Goffman (1955) , although the term "turn-taking" itself was coined much later (Yngve, 1970) . Since then, and following early seminal work in the 70s (Duncan, 1972; Sacks et al., 1978) , turn-taking has been the subject of considerable study under the umbrella of "conversation analysis" (Levinson, 1983) . For example, researchers have sought to characterize linguistic and paralinguistic cues involved in organizing turns (Stephens & Beattie, 1986; Kendon, 1967; Clancy et al., 1996) , to understand the time-scales involved in turn changes (Stivers et al., 2009; Heldner & Edlund, 2010) or to identify turn-taking conventions in non-human primates (Rossano, 2013; Rossano & Liebal, 2014; Demartsev et al., 2018) . Research in automated dialogue systems dates back to early efforts in the 60s and 70s (Weizenbaum, 1966; Bobrow et al., 1977) . Particularly relevant to our work is a line of research on task-based methods, which formulates dialogue as a reinforcement learning problem to satisfy some user-relevant task (Walker, 2000; Levin et al., 2000) . After the success of early deep-learning based models (Vinyals & Le, 2015; Li et al., 2016) , most state-of-the-art systems (Adiwardana et al., 2020; Zhang et al., 2020) are now generally built on top of large pretrained models (Radford et al.; Lewis et al., 2020) . Research in dialogue systems focuses primarily on text-based models, and spoken dialogue systems are generally implemented as speech recognition/generation scaffolding around a text-based core (Chen et al., 2018) . However, there have been recent on addressing phenomena specific to spontaneous spoken dialogue such as filled pauses (Székely et al., 2019) or non-verbal vocalizations and turn changes (Nguyen et al., 2021) . Our work belongs to a long line of research devoted to the use of computational models for simulating the emergence of language (Batali, 1998; Cangelosi & Parisi, 2002) . With the advent of deep learning, there has been a surge of interest in emergent communication among neural-network agents trained with reinforcement learning (Lazaridou et al., 2017; Foerster et al., 2016; Lazaridou & Baroni, 2020) . The overwhelming majority of research in this area focuses on unidirectional senderreceiver communication (see Havrylov & Titov (2017); Bouchacourt & Baroni (2018); Chaabouni et al. (2020) inter alia) based on variants of the Lewis signaling game (Lewis, 1969; Skyrms, 2010) . Although several authors have studied bidirectional communication (Kottur et al., 2017; Graesser et al., 2019) , alternating turns are usually hard-coded into the game.

3. METHOD: A TESTBED FOR THE EMERGENCE OF TURN-TAKING

There is evidence in the conversational analysis literature that turn-taking systems in human languages almost universally tend to favor (1) few instances of overlapping speech and (2) minimal pauses between turns (Stivers et al., 2009) . In this section, we describe a communication game based on a variant of the Lewis signaling game (Lewis, 1969) which encapsulates these features.

3.1. GAME DESCRIPTION

The game is played between two agents, Agent 1 and Agent 2. Before every episode of the game, an object x is drawn according to distribution p from a pool X and both agents observe partial views on the object, x1 and x2 . The goal of the game is for the agents to communicate enough information to reconstruct the original input x. Throughout the paper, we will experiment with two versions of

