ON THE WORD BOUNDARIES OF EMERGENT LAN-GUAGES BASED ON HARRIS'S ARTICULATION SCHEME

Abstract

This paper shows that emergent languages in signaling games lack meaningful word boundaries in terms of Harris's Articulation Scheme (HAS), a universal property of natural language. Emergent Languages are artificial communication protocols arising among agents. However, it is not obvious whether such a simulated language would have the same properties as natural language. In this paper, we test if they satisfy HAS. HAS states that word boundaries can be obtained solely from phonemes in natural language. We adopt HAS-based word segmentation and verify whether emergent languages have meaningful word segments. The experiment suggested they do not have, although they meet some preconditions for HAS. We discovered a gap between emergent and natural languages to be bridged, indicating that the standard signaling game satisfies prerequisites but is still missing some necessary ingredients.

1. INTRODUCTION

Communication protocols emerging among artificial agents in a simulated environment are called emergent languages (Lazaridou & Baroni, 2020) . It is important to investigate their structure to recognize and bridge the gap between natural and emergent languages, as several structural gaps have been reported (Kottur et al., 2017; Chaabouni et al., 2019) . For instance, Kottur et al. (2017) indicated that emergent languages are not necessarily compositional. Such gaps are undesirable because major motivations in this area are to develop interactive AI (Foerster et al., 2016; Mordatch & Abbeel, 2018; Lazaridou et al., 2020) and to simulate the evolution of human language (Kirby, 2001; Graesser et al., 2019; Dagan et al., 2021) . Previous work examined whether emergent languages have the same properties as natural languages, such as grammar (van der Wal et al., 2020), entropy minimization (Kharitonov et al., 2020 ), compositionality (Kottur et al., 2017) , and Zipf's law of abbreviation (ZLA) (Chaabouni et al., 2019) .foot_0 Word segmentation would be another direction to understand the structure of emergent languages because natural languages not only have construction from word to sentence but also from phoneme to word (Martinet, 1960) . However, previous studies have not gone so far as to address word segmentation, as they treat each symbol in emergent messages as if it were a "word" (Kottur et al., 2017; van der Wal et al., 2020) , or ensure that a whole message constructs just one "word" (Chaabouni et al., 2019; Kharitonov et al., 2020) . The purpose of this paper is to study whether Harris's articulation scheme (HAS) (Harris, 1955; Tanaka-Ishii, 2021 ) also holds in emergent languages. HAS is a statistical universal in natural languages.foot_1 Its basic idea is that we can obtain word segments from the statistical information of phonemes but without referring to word meanings. HAS holds not only for phonemes but also for other symbol units like characters. HAS can be used for unsupervised word segmentation (Tanaka-Ishii, 2005) to allow us to study the structure of emergent languages. It should be appropriate to apply such unsupervised methods since word segments and meanings are not available beforehand in emergent languages.



ZLA states that the more frequently a word is used, the shorter it tends to be(Zipf, 1935). Note that this is different from the famous distributional hypothesis(Harris, 1954).1

