ON THE WORD BOUNDARIES OF EMERGENT LAN-GUAGES BASED ON HARRIS'S ARTICULATION SCHEME

Abstract

This paper shows that emergent languages in signaling games lack meaningful word boundaries in terms of Harris's Articulation Scheme (HAS), a universal property of natural language. Emergent Languages are artificial communication protocols arising among agents. However, it is not obvious whether such a simulated language would have the same properties as natural language. In this paper, we test if they satisfy HAS. HAS states that word boundaries can be obtained solely from phonemes in natural language. We adopt HAS-based word segmentation and verify whether emergent languages have meaningful word segments. The experiment suggested they do not have, although they meet some preconditions for HAS. We discovered a gap between emergent and natural languages to be bridged, indicating that the standard signaling game satisfies prerequisites but is still missing some necessary ingredients.

1. INTRODUCTION

Communication protocols emerging among artificial agents in a simulated environment are called emergent languages (Lazaridou & Baroni, 2020) . It is important to investigate their structure to recognize and bridge the gap between natural and emergent languages, as several structural gaps have been reported (Kottur et al., 2017; Chaabouni et al., 2019) . For instance, Kottur et al. (2017) indicated that emergent languages are not necessarily compositional. Such gaps are undesirable because major motivations in this area are to develop interactive AI (Foerster et al., 2016; Mordatch & Abbeel, 2018; Lazaridou et al., 2020) and to simulate the evolution of human language (Kirby, 2001; Graesser et al., 2019; Dagan et al., 2021) . Previous work examined whether emergent languages have the same properties as natural languages, such as grammar (van der Wal et al., 2020) , entropy minimization (Kharitonov et al., 2020 ), compositionality (Kottur et al., 2017) , and Zipf's law of abbreviation (ZLA) (Chaabouni et al., 2019) .foot_0 Word segmentation would be another direction to understand the structure of emergent languages because natural languages not only have construction from word to sentence but also from phoneme to word (Martinet, 1960) . However, previous studies have not gone so far as to address word segmentation, as they treat each symbol in emergent messages as if it were a "word" (Kottur et al., 2017; van der Wal et al., 2020) , or ensure that a whole message constructs just one "word" (Chaabouni et al., 2019; Kharitonov et al., 2020) . The purpose of this paper is to study whether Harris's articulation scheme (HAS) (Harris, 1955; Tanaka-Ishii, 2021 ) also holds in emergent languages. HAS is a statistical universal in natural languages.foot_1 Its basic idea is that we can obtain word segments from the statistical information of phonemes but without referring to word meanings. HAS holds not only for phonemes but also for other symbol units like characters. HAS can be used for unsupervised word segmentation (Tanaka-Ishii, 2005) to allow us to study the structure of emergent languages. It should be appropriate to apply such unsupervised methods since word segments and meanings are not available beforehand in emergent languages. Figure 1 : Illustration of a signaling game. Section 3.1 gives its formal definition. In each play, a sender agent obtains an input and converts it to a sequential message. A receiver agent receives the message and converts it to an output. Each agent is represented as an encoder-decoder model. In addition to the absence of ground-truth data on segmentation, it is not even evident, in the first place, whether emergent languages have meaningful word segments. In other words, the problem is whether they have meaningful segments. If not, then it means that we find another gap between emergent and natural languages. In this paper, we pose several verifiable questions to answer whether their segments are meaningful. Importantly, some of the questions are applicable to any word segmentation scheme as well as HAS, so that they can be general criteria for future work on the segmentation of emergent languages. To simulate the emergence of language, we adopt Lewis's signaling game (Lewis, 1969) . This game involves two agents called sender S and receiver R, and allows only one-way communication from S to R. In each play, S obtains an input i ∈ I and converts i into a sequential message m = S(i) ∈ M. Then, R receives m ∈ M and predicts the original input. The goal of the game is the correct prediction S(m) = i. Figure 1 illustrates the signaling game. Here, we consider the set {m ∈ M | m = S(i)} i∈I as the dataset of an emergent language, to which the HASbased boundary detection (Tanaka-Ishii, 2005) is applicable. The algorithm yields the segments of messages. Our experimental results showed that emergent languages arising from signaling games satisfy two preconditions for HAS: (i) the conditional entropy (Eq. 2) decreases monotonically, and (ii) the branching entropy (Eq. 1) repeatedly falls and rises. However, it was also suggested that the HASbased boundaries are not necessarily meaningful. Segments divided by the boundaries may not serve as meaning units, while words in natural languages do (Martinet, 1960) . It is left for future work to bridge the gap between emergent and natural languages in terms of HAS, by giving rise to meaningful word boundaries.

2. HARRIS'S ARTICULATION SCHEME

This section introduces Harris's articulation scheme (HAS) and the HAS-based boundary detection algorithm. HAS has advantages such as being simple, deterministic, and easy-to-interpret while being linguistically motivated. Such simplicity is important particularly when we have neither any prior knowledge nor ground-truth data of target languages, e.g., emergent languages. In the paper "From phoneme to morpheme" (Harris, 1955) , Harris hypothesized that word boundaries tend to occur at points where the number of possible successive phonemes reaches a local peak in a context. Harris (1955) exemplifies the utterance "He's clever" that has the phoneme sequence /hiyzclev@r/. 3 The number of possible successors after the first phoneme /h/ is 9: /w,y,i,e,ae,a,@,o,u/. Next, the number of possible successors after /hi/ increases to 14. Likewise, the number of possible phonemes increases to 29 after /hiy/, stays at 29 after /hiyz/, decreases to 11 after /hiyzk/, decreases to 7 after /hiyzkl/, and so on. Peak numbers are found at /y/, /z/, and /r/, which divides the phoneme sequence into /hiy/+/z/+/klev@r/. Thus, the utterance is divided into "He", "s", and "clever". Harris's hypothesis can be reformulated from an information-theoretic point of view by replacing the number of successors with entropy. We review the mathematical formulation of the hypothesis as Harris's articulation scheme (HAS) and the HAS-based boundary detection (Tanaka-Ishii, 2005) . HAS does involve statistical information of phonemes but does not involve word meanings. This is important because it gives a natural explanation for a well-known linguistic concept called double articulation (Martinet, 1960) . Martinet (1960) pointed out that languages have two structures: phonemes (irrelevant to meanings) and meaning units (i.e., words and morphemes). HAS can construct meaning units without referring to meanings.



ZLA states that the more frequently a word is used, the shorter it tends to be(Zipf, 1935). Note that this is different from the famous distributional hypothesis(Harris, 1954). There may be other representations for the phonemes, but we follow Harris's notation.

