ON THE WORD BOUNDARIES OF EMERGENT LAN-GUAGES BASED ON HARRIS'S ARTICULATION SCHEME

Abstract

This paper shows that emergent languages in signaling games lack meaningful word boundaries in terms of Harris's Articulation Scheme (HAS), a universal property of natural language. Emergent Languages are artificial communication protocols arising among agents. However, it is not obvious whether such a simulated language would have the same properties as natural language. In this paper, we test if they satisfy HAS. HAS states that word boundaries can be obtained solely from phonemes in natural language. We adopt HAS-based word segmentation and verify whether emergent languages have meaningful word segments. The experiment suggested they do not have, although they meet some preconditions for HAS. We discovered a gap between emergent and natural languages to be bridged, indicating that the standard signaling game satisfies prerequisites but is still missing some necessary ingredients.

1. INTRODUCTION

Communication protocols emerging among artificial agents in a simulated environment are called emergent languages (Lazaridou & Baroni, 2020) . It is important to investigate their structure to recognize and bridge the gap between natural and emergent languages, as several structural gaps have been reported (Kottur et al., 2017; Chaabouni et al., 2019) . For instance, Kottur et al. (2017) indicated that emergent languages are not necessarily compositional. Such gaps are undesirable because major motivations in this area are to develop interactive AI (Foerster et al., 2016; Mordatch & Abbeel, 2018; Lazaridou et al., 2020) and to simulate the evolution of human language (Kirby, 2001; Graesser et al., 2019; Dagan et al., 2021) . Previous work examined whether emergent languages have the same properties as natural languages, such as grammar (van der Wal et al., 2020) , entropy minimization (Kharitonov et al., 2020) , compositionality (Kottur et al., 2017) , and Zipf's law of abbreviation (ZLA) (Chaabouni et al., 2019) . 1 Word segmentation would be another direction to understand the structure of emergent languages because natural languages not only have construction from word to sentence but also from phoneme to word (Martinet, 1960) . However, previous studies have not gone so far as to address word segmentation, as they treat each symbol in emergent messages as if it were a "word" (Kottur et al., 2017; van der Wal et al., 2020) , or ensure that a whole message constructs just one "word" (Chaabouni et al., 2019; Kharitonov et al., 2020) . The purpose of this paper is to study whether Harris's articulation scheme (HAS) (Harris, 1955; Tanaka-Ishii, 2021 ) also holds in emergent languages. HAS is a statistical universal in natural languages. 2 Its basic idea is that we can obtain word segments from the statistical information of phonemes but without referring to word meanings. HAS holds not only for phonemes but also for other symbol units like characters. HAS can be used for unsupervised word segmentation (Tanaka-Ishii, 2005) to allow us to study the structure of emergent languages. It should be appropriate to apply such unsupervised methods since word segments and meanings are not available beforehand in emergent languages. Figure 1 : Illustration of a signaling game. Section 3.1 gives its formal definition. In each play, a sender agent obtains an input and converts it to a sequential message. A receiver agent receives the message and converts it to an output. Each agent is represented as an encoder-decoder model. In addition to the absence of ground-truth data on segmentation, it is not even evident, in the first place, whether emergent languages have meaningful word segments. In other words, the problem is whether they have meaningful segments. If not, then it means that we find another gap between emergent and natural languages. In this paper, we pose several verifiable questions to answer whether their segments are meaningful. Importantly, some of the questions are applicable to any word segmentation scheme as well as HAS, so that they can be general criteria for future work on the segmentation of emergent languages. To simulate the emergence of language, we adopt Lewis's signaling game (Lewis, 1969) . This game involves two agents called sender S and receiver R, and allows only one-way communication from S to R. In each play, S obtains an input i ∈ I and converts i into a sequential message m = S(i) ∈ M. Then, R receives m ∈ M and predicts the original input. The goal of the game is the correct prediction S(m) = i. Figure 1 illustrates the signaling game. Here, we consider the set {m ∈ M | m = S(i)} i∈I as the dataset of an emergent language, to which the HASbased boundary detection (Tanaka-Ishii, 2005) is applicable. The algorithm yields the segments of messages. Our experimental results showed that emergent languages arising from signaling games satisfy two preconditions for HAS: (i) the conditional entropy (Eq. 2) decreases monotonically, and (ii) the branching entropy (Eq. 1) repeatedly falls and rises. However, it was also suggested that the HASbased boundaries are not necessarily meaningful. Segments divided by the boundaries may not serve as meaning units, while words in natural languages do (Martinet, 1960) . It is left for future work to bridge the gap between emergent and natural languages in terms of HAS, by giving rise to meaningful word boundaries.

2. HARRIS'S ARTICULATION SCHEME

This section introduces Harris's articulation scheme (HAS) and the HAS-based boundary detection algorithm. HAS has advantages such as being simple, deterministic, and easy-to-interpret while being linguistically motivated. Such simplicity is important particularly when we have neither any prior knowledge nor ground-truth data of target languages, e.g., emergent languages. In the paper "From phoneme to morpheme" (Harris, 1955) , Harris hypothesized that word boundaries tend to occur at points where the number of possible successive phonemes reaches a local peak in a context. Harris (1955) exemplifies the utterance "He's clever" that has the phoneme sequence /hiyzclev@r/. 3 The number of possible successors after the first phoneme /h/ is 9: /w,y,i,e,ae,a,@,o,u/. Next, the number of possible successors after /hi/ increases to 14. Likewise, the number of possible phonemes increases to 29 after /hiy/, stays at 29 after /hiyz/, decreases to 11 after /hiyzk/, decreases to 7 after /hiyzkl/, and so on. Peak numbers are found at /y/, /z/, and /r/, which divides the phoneme sequence into /hiy/+/z/+/klev@r/. Thus, the utterance is divided into "He", "s", and "clever". Harris's hypothesis can be reformulated from an information-theoretic point of view by replacing the number of successors with entropy. We review the mathematical formulation of the hypothesis as Harris's articulation scheme (HAS) and the HAS-based boundary detection (Tanaka-Ishii, 2005) . HAS does involve statistical information of phonemes but does not involve word meanings. This is important because it gives a natural explanation for a well-known linguistic concept called double articulation (Martinet, 1960) . Martinet (1960) pointed out that languages have two structures: phonemes (irrelevant to meanings) and meaning units (i.e., words and morphemes). HAS can construct meaning units without referring to meanings.

2.1. MATHEMATICAL FORMULATION OF HARRIS'S HYPOTHESIS

While Harris (1955) focuses on phonemes for word boundary detection, Tanaka-Ishii (2021) suggests that the hypothesis is also applicable to units other than phonemes. Therefore, in this section, a set of units is called an alphabet X as a purely mathematical notion that is not restricted to phonemes. Tanaka-Ishii (2005) uses characters for the same purpose. Moreover, Frantzi & Ananiadou (1996) and Tanaka-Ishii & Ishii (2007) investigate the detection of collocation from words. Let X be an alphabet and X n be the set of all n-grams on X . We denote by X i a random variable of X indexed by i, and by X i:j a random variable sequence from X i to X j . The formulation by Tanaka-Ishii (2005) involves two kinds of entropy: branching entropy and conditional entropy (Cover & Thomas, 2006) . 4 The branching entropy of a random variable X n after a sequence s = x 0 • • • x n-1 ∈ X n is defined as: h(s) ≡ H(X n | X 0:n-1 = s) = - x∈X P (x | s) log 2 P (x | s), where P (x | s) = P (X n = x | X 0:n-1 = s). Intuitively, the branching entropy h(s) means how many elements can occur after s or the uncertainty of the next element after s. In addition to h(s), the conditional entropy of a random variable X n after an n-gram sequence X 0:n-1 is defined as: H(n) ≡ H(X n | X 0:n-1 ) = - s∈X n P (s) x∈X P (x | s) log 2 P (x | s), where P (s) = P (X 0:n-1 = s). The conditional entropy H(n) can be regarded as the mean of h(s) over n-gram sequences s ∈ X n , since H(n) = s∈X n P (s)h(s). H(n) is known to decrease monotonically in natural languages (Bell et al., 1990) . Thus, for a partial sequence x 0:n-1 ∈ X n , h(x 0:n-2 ) > h(x 0:n-1 ) holds on average, although h(s) repeatedly falls and rises depending on a specific s. Based on such properties, Harris's articulation scheme (HAS) is formulated as:foot_4  If there is some partial sequence x 0:n-1 ∈ X n (n > 1) s.t. h(x 0:n-2 ) < h(x 0:n-1 ), then x n is at a boundary. (3)

2.2. BOUNDARY DETECTION ALGORITHM BASED ON HARRIS'S ARTICULATION SCHEME

Algorithm 1 Boundary Detection Algorithm 1: i ← 0; w ← 1; B ← {} 2: while i < n do 3: Compute h(s i:i+w-1 ) 4: if w > 1 & h(s i:i+w-1 ) -h(s i:i+w-2 ) > threshold then 5: B ← B ∪ {i + w} 6: end if 7: if i + w < n -1 then 8: w ← w + 1 9: else 10: i ← i + 1; w ← 1 11: end if 12: end while This section introduces the HAS-based boundary detection algorithm (Tanaka-Ishii, 2005) . Let s = x 0 • • • x n-1 ∈ X n . We denote by s i:j its partial sequence x i • • • x j . Given s and a parameter threshold, the boundary detection algorithm yields boundaries B. 6 It proceeds as shown in Algorithm 1. Entropybased methods are advantageous in simplicity and interpretability, though they are inferior to other state-of-the-art methods in terms of F score. 7 Simplicity and interpretability are important when we analyze emergent languages for which we have no prior knowledge and have no ground-truth data of word segmentation. Note also that it is not even evident, in the first place, whether they have meaningful word segments. Since our targets are emergent languages, the outputs of the boundary detection algorithm do not necessarily mean articulatory boundaries. Instead, we call them hypothetical boundaries (hypoboundaries) and refer to the segments split by hypo-boundaries as hypo-segments.

3. EMERGENT LANGUAGE ARISING FROM SIGNALING GAME

We have to define environments, agent architectures, and optimization methods for language emergence simulations. This paper adopts the framework of Chaabouni et al. (2020) . We define an environment in Section 3.1, specify the agent architecture and optimization methods in Section 3.2, and also give an explanation of the compositionality of emergent languages in Section 3.3.

3.1. SIGNALING GAME

An environment is formulated based on Lewis's signaling game (Lewis, 1969) . A signaling game G consists of a quadruple (I, M, S, R), where I is an input space, M is a message space, S : I → M is a sender agent, and R : M → I is a receiver agent. The goal is the correct reconstruction i = R(S(i)) for all i ∈ I. While the input space I and the message space M are fixed, the agents S, R are trained for the goal. An illustration of a signaling game is shown in Figure 1 . Following Chaabouni et al. (2020) , we define I as an attribute-value set D natt n val (defined below) and M as a set of discrete sequences of fixed length k over a finite alphabet A: I ≡ D natt n val , M ≡ A k = {a 1 • • • a k | a j ∈ A}. Attribute-Value Set Let n att , n val be positive integers called the number of attributes and the number of values. Then, an attribute-value set D natt n val is the set of ordered tuples defined as follows: D natt n val ≡ {(v 1 , . . . , v natt ) | v j ∈ {1, . . . , n val }} . This is an abstraction of an attribute-value object paradigm (e.g., Kottur et al., 2017) by Chaabouni et al. (2020) . Intuitively, each index j of a vector (v 1 , . . . v j , . . . , v natt ) is an attribute (e.g., color), while each v j is an attribute value (e.g., blue, green, red, and purple).foot_8 

3.2. ARCHITECTURE AND OPTIMIZATION

We follow Chaabouni et al. (2020) as well for the architecture and optimization method. Each agent is represented as an encoder-decoder model (Figure 1 ): the sender decoder and the receiver encoder are based on single-layer GRUs (Cho et al., 2014) , while the sender encoder and the receiver decoder are linear functions. Each element i ∈ D natt n val is vectorized to be fed into or output from the linear functions. Formally, each i = (v 1 , . . . , v natt ) is converted into the n att × n val -dimensional vector which is the concatenation of n att one-hot representations of v j . During training, the sender samples messages probabilistically. At the test time, it samples them greedily to serve as a deterministic function. Similarly, the receiver's output layer, followed by the Softmax, determines n att categorical distributions over values {1, . . . , n val } during training. At the test time, n att values are greedily sampled. The agents are optimized with the stochastic computation graph (Schulman et al., 2015) that is a combination of REINFORCE (Williams, 1992 ) and standard backpropagation. The sender is optimized with the former, while the receiver is optimized with the latter.

3.3. COMPOSITIONALITY OF EMERGENT LANGUAGES

An attribute-value set D natt n val by Chaabouni et al. ( 2020) is an extension of an attribute-value setting (Kottur et al., 2017) introduced to measure the compositionality of emergent languages. While the concept of compositionality varies from domain to domain, researchers in this area typically regard it as the disentanglement of representation learning. Kottur et al. (2017) , for instance, set up an environment where objects have two attributes: color and shape, each of which has several possible values (e.g., blue, red, ... for color and circle, star, ... for shape). They assumed that if a language is sufficiently compositional, each message would be a composition of symbols denoting the color value and shape value separately. This concept has been the basis for subsequent studies (Li & Bowling, 2019; Andreas, 2019; Ren et al., 2020; Chaabouni et al., 2020) . Topographic Similarity Topographic Similarity (TopSim) (Brighton & Kirby, 2006; Lazaridou et al., 2018) is the de facto compositionality measure in emergent communication literature. Suppose we have distance functions d I , d M for spaces I, M, respectively. TopSim is defined as the Spearman correlation between distances d I (i 1 , i 2 ) and d M (S(i 1 ), S(i 2 )) for all i 1 , i 2 ∈ I s.t. i 1 ̸ = i 2 . This definition reflects an intuition that compositional languages should map similar (resp. dissimilar) inputs to similar (resp. dissimilar) messages. Following previous work using attribute-value objects (e.g., Chaabouni et al., 2020) , we define d I as the Hamming distance and d M as the edit distance. Because this paper is about message segmentation, we can consider two types of edit distance. One is the "character" edit distance that regards elements a ∈ A as symbols. The other is the "word" edit distance that regards hypo-segments as symbols. Let us call the former C-TopSim and the latter W-TopSim. The purpose of this paper is to study whether Harris's articulation scheme (HAS) also holds in emergent languages. However, this question is too vague to answer. We first divide it into the following: Q1. Does the conditional entropy H decrease monotonically? Q2. Does the branching entropy h repeatedly fall and rise? Q3. Do hypo-boundaries represent meaningful boundaries? Q3 is the same as the original question, except that Q3 is slightly more formal. However, we have to answer Q1 and Q2 beforehand, because HAS implicitly takes it for granted that H decreases monotonically and h jitters. Although both Q1 and Q2 generally hold in natural languages, neither is trivial in emergent languages. Figure 2 illustrates Q1, Q2, and Q3.

4. PROBLEM DEFINITION

It is straightforward to answer Q1 and Q2 as we just need to calculate H and h. In contrast, Q3 is still vague to answer, since we do not have prior knowledge about the boundaries of emergent languages and do not even know if they have such boundaries. To mitigate it, we posit the following necessary conditions for Q3. Let G be a game (D natt n val , A k , S, R). If the answer to Q3 is yes, then: C1. the mean number of hypo-boundaries per message should increase as n att increases, C2. the size of the vocabulary (set of all hypo-segments) should increase as n val increases, C3. W-TopSim should be higher than C-TopSim. About C1 and C2 An attribute-value set D natt n val was originally introduced to measure compositionality. Compositionality, in this context, means how symbols in a message separately denote the components of meaning. In our case, each segment, or word, can be thought of as a certain unit that denotes the attribute values, so that the number of words in a message should increase as the corresponding attributes increase. Therefore, if the answer to Q3 is yes, then C1 should be valid. Likewise, the vocabulary size should be larger in proportion to the number of values n val , motivating C2. Here, we mean by vocabulary the set of all hypo-segments. Note that the message length is fixed, because otherwise, the number of hypo-segments would be subject to variable message length as well as (n att , n val ), and the implication of results would be obscure. About C3 C3 comes from the analogy of the linguistic concept called double articulation (Martinet, 1960) . In natural languages, meanings are quite arbitrarily related to the phonemes that construct them. In contrast, the meanings are less arbitrarily related to the words. The phonemes do not denote meaning units, but the words do. In our case, for example, the attribute-value object (RED, CIRCLE) seems less compositionally related to the character sequence "r,e,d,c,i,r,c,l,e", while it seems more compositionally related to the word sequence "red,circle." This intuition motivates C3. Based on conditions C1, C2, and C3, Q3 is restated as follows: (Q3-1) Does the mean number of hypo-boundaries per message increase as n att increases? (Q3-2) Does the vocabulary size increase as n val increases? (Q3-3) Is W-TopSim higher than C-TopSim? Importantly, Q3-1, Q3-2, and Q3-3 are applicable to any word segmentation scheme as well as HAS, as they neither contain Eq. 1 nor Eq. 2 anymore. Thus, they can be general criteria for future work on the segmentation of emergent languages. We also verified that the above conditions C1, C2, and C3 are indeed satisfied by synthetic languages that are ideally compositional (see Appendix A).

5.1. PARAMETER SETTINGS

Input Space n att and n val have to be varied to answer Q3-1, Q3-2, and Q3-3, while the sizes of the input spaces |I| = (n val ) natt must be equal to each other to balance the complexities of games. Therefore, we fix |I| = 4096 and vary (n att , n val ) as follows: (n att , n val ) ∈ {(1, 4096), (2, 64), (3, 6), (4, 8), (6, 4) , (12, 2)}. ( ) Message Space The message length k and alphabet A have to be determined for a message space M = A k . We set k = 32, similarly to previous work on ZLA (Chaabouni et al., 2019; Rita et al., 2020; Ueda & Washio, 2021 ) that regards each a ∈ A as a "character." Note that k = 32 is set much longer than those of previous work on compositionality (Chaabouni et al., 2020; Ren et al., 2020; Li & Bowling, 2019 ) that typically adopts k ≒ n att as if each symbol a ∈ A were a "word." We set A = {1, . . . , 8}. Its size |A| should be as small as possible to avoid the data sparsity issue for boundary detection, and to ensure that each symbol a ∈ A serves as a "character." In preliminary experiments, we tested |A| ∈ {2, 4, 8, 16} and found that learning is stable when |A| ≥ 8.

Architecture and Optimization

We follow Chaabouni et al. (2020) for agent architectures, an objective function, and an optimization method. The hidden size of GRU (Cho et al., 2014 ) is set to 500, following Chaabouni et al. (2020) . All data from an input space I = D natt n val are used for training. This dataset is upsampled to 100 times following the default setting of the code of Chaabouni et al. (2020) . The learning rate is set to 0.001, which also follows Chaabouni et al. (2020) . Based on our preliminary experiments to explore stable learning, a sender S and a receiver R are trained for 200 epochs, and the coefficient of the entropy regularizer is set to 0.01.

Boundary Detection Algorithm

The boundary detection algorithm involves a parameter threshold. Since the appropriate value of threshold is unclear, we vary threshold as follows: threshold ∈ {0, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2}. (7)

5.2. IMPLEMENTATION, NUMBER OF TRIALS, AND LANGUAGE VALIDITY

We implemented the code for training agents using the EGG toolkit (Kharitonov et al., 2019) .foot_9 foot_10 EGG also includes the implementation code of Chaabouni et al. (2020) , which we largely refer to. Each run took a few hours with a single GPU. In the following sections, an emergent language with a communication success rate of more than 90% is called a successful language.

6. RESULTS

As a result of training agents, we obtained 7, 8, 6, 8, 7, and 6 successful languages out of 8 runs for configurations (n att , n val ) = (1, 4096), (2, 64), (3, 16), (4, 8), (6, 4) , and (12, 2), respectively.

6.1. CONDITIONAL ENTROPY MONOTONICALLY DECREASES

To verify Q1, we show the conditional entropy H(n) (Eq. 2) in Figure 3 . In Figure 3 , the conditional entropies of the successful languages (solid red lines) decrease monotonically. This confirms Q1 in successful languages. Interestingly, the conditional entropies of emergent languages derived from untrained senders do not necessarily decrease, shown as dashed blue lines in Figure 3 . 11 The monotonic decrease in conditional entropy emerges after training agents.

6.2. BRANCHING ENTROPY REPEATEDLY FALLS AND RISES

Next, to answer Q2, we computed the branching entropy h(s) (Eq. 1) of the successful languages and applied boundary detection. As an example, we show a few actual transitions of h(s) in Figure 4 , in which the y-axis represents the value of h(s) and the x-axis represents a message "3,8,4,...,4,4,4". The message is randomly sampled from a successful language when (n att , n val ) = (2, 64). The boundary detection algorithm with threshold = 1 yields three hypo-boundaries that are represented as dashed black lines in Figure 4 . Blue, yellow, and green lines with triangle markers represent the transitions of h(s) that yield hypo-boundaries. Note that the (i + 1)-th transition of h(s) does not necessarily start from the i-th hypo-boundary, due to the definition of the algorithm. For instance, the second transition overlaps the first hypo-boundary. While the conditional entropy decreases monotonically as shown in Figure 3 , the branching entropy repeatedly falls and rises in Figure 4 . Moreover, we show the mean number of hypo-boundaries per message in Figure 5 . Figure 5 indicates that for any (n att , n val ) configuration, there are hypo-boundaries if threshold < 2, i.e., the branching entropy repeatedly falls and rises. These results validate Q2.

6.3. HYPO-BOUNDARIES MAY NOT BE MEANINGFUL BOUNDARIES

Next, we investigate whether Q3-1, Q3-2, and Q3-3 hold in successful languages. The results in the following sections falsify all of them. Thus, Q3 may not be true: hypo-boundaries may not be meaningful boundaries. Mean Number of Hypo-Boundaries per Message See Figure 5 again. The figure shows that the mean number of hypo-boundaries per message does not increase as n att increases. It does not decrease, either. This result falsifies Q3-1. Even when n att = 1, there are as many hypo-boundaries as other configurations. Vocabulary Size Figure 6 shows the mean vocabulary sizes for each (n att , n val ). The vocabulary size does not increase as n val increases, which falsifies Q3-2. However, focusing on (n att , n val ) ∈ {(2, 64), (3, 16), (4, 8), (6, 4 )} and 0.25 ≤ threshold ≤ 1, there is a weak tendency to support C2. It suggests that hypo-segments are not completely meaningless either. C-TopSim vs W-TopSim Figure 7 shows C-Topsim and W-Topsim for each (n att , n val ) and threshold.foot_12 Note that C-TopSim is TopSim with "character" edit distance, and W-TopSim is Top-Sim with "word" edit distance. In Figure 7 , threshold = -∞ corresponds to C-TopSim, while the others correspond to W-TopSim. 13 Our assumption in Q3-3 was C-TopSim < W-TopSim. On the contrary, Figure 7 shows a clear tendency for C-TopSim > W-TopSim, which falsifies Q3-3. Hypoboundaries may not be meaningful. However, they may not be completely meaningless, either. This is because the hypo-boundary-based W-TopSim is higher than the random-boundary-based W-TopSim in Figure 8 . Here, we mean by random boundaries the boundaries chosen randomly in the same number as hypo-boundaries in each message. Other (n att , n val ) configurations show similar tendencies (see Appendix C).

6.4. FURTHER INVESTIGATION

Ease-of-Teaching Setting Does Not Contribute to HAS Methods proposed to improve TopSim might also contribute to giving rise to HAS. We picked up the Ease-of-Teaching (EoT, Li & Bowling, 2019) , as it is known to be a simple and effective way to improve TopSim. However, the EoT setting did not contribute to the emergence of HAS, while improving C-TopSim (see Appendix D). Word Length and Word Frequency The results so far are related to compositionality of emergent languages (e.g., Kottur et al., 2017) . We further associate our results with previous discussions on Zipf's law of abbreviation (ZLA) in emergent languages in Appendix E.

7. DISCUSSION

In Section 6.1, we showed that the conditional entropy H(n) decreases monotonically in emergent languages, confirming Q1. In Section 6.2, we demonstrated that the branching entropy h(s) repeatedly falls and rises in emergent languages, which confirms Q2. It is an intriguing result, considering the discussions of Kharitonov et al. (2020) , who showed that the entropy decreases to the minimum for successful communication if the message length k = 1. In contrast, our results suggest that the (branching) entropy does not simply fall to the minimum when the message length k is longer. However, in Section 6.3, our results indicate that the hypo-boundaries may not be meaningful since Q3-1, Q3-2, and Q3-3 were falsified. Nevertheless, hypo-boundaries may not be completely meaningless either. This is because the hypo-boundary-based W-TopSim is higher than the random-boundarybased W-TopSim. It suggests that HAS-based boundary detection worked to some extent. One limitation of this paper is that we focused on the specific neural architecture, i.e., GRU (Cho et al., 2014) as an agent basis, following Chaabouni et al. (2020) . It remains to be answered whether articulation structure may arise in other architectures. Another limitation is that our HAS-based criteria highly relies on attribute-value settings, which obscures how to apply them to other settings, e.g., visual referential games (Havrylov & Titov, 2017; Choi et al., 2018) . At least, however, C3 is applicable to the referential game, by using cosine dissimilarity between hidden states (Lazaridou et al., 2018) . Finally, the question of why the current emergent languages fail to give rise to HAS is left open.

8. RELATED WORK

Segmentation of Emergent Language Resnick et al. (2020) proposed a compositionality measure called residual entropy (RE) that considers the partition of emergent messages, akin to the segmentation in this paper. However, RE is different from our HAS-based criteria in the following sense. First, RE assumes that all the messages are partitioned in the same way, i.e., boundary positions are the same in all the messages. Second, the computation of a partition involves an explicit reference to attribute values. This is opposed to HAS' statement that boundaries can be estimated without reference to meanings. Third, RE can be hard to compute. 14Unsupervised Word Segmentation Regarding unsupervised word segmentation, this paper focused on HAS, for its simplicity, interpretability, and linguistic motivation. Of course, however, statistical models for word segmentation have made progress (e.g., Goldwater et al., 2009; Mochihashi et al., 2009) . In addition, Taniguchi et al. (2016) tried to discover the double articulation structure (Martinet, 1960) , i.e., both phonemes and words, from continuous speech signals. As emergent languages become more complex and natural-language-like in the future, it will be worthwhile to adopt more sophisticated segmentation methods.

9. CONCLUSION

This paper investigated whether Harris's articulation scheme (HAS) also holds in emergent languages. Emergent languages are artificial communication protocols emerging between agents. HAS has been considered a statistical universal in natural languages and can be used for unsupervised word segmentation. Our experimental results suggest that although emergent languages satisfy some prerequisites for HAS, HAS-based word boundaries may not be meaningful. Our contributions are (1) to focus on the word segmentation of emergent languages, (2) to pose verifiable questions to answer whether emergent languages have meaningful segments, and (3) to show another gap between emergent and natural languages. The standard signaling game satisfies prerequisites but still lacks some necessary ingredients for HAS. 

B WHY CAN CONDITIONAL ENTROPY INCREASE IN SIGNALING GAME?

One might wonder why the conditional entropy H(n) can increase and think it cannot due to its definition. This is true when we have a single (possibly infinite) sequence. For example, the conditional entropy of an infinite monkey typing sequence is constant since for any n ∈ N and s ∈ X n , h(s) = - x∈X P (x | s) log 2 P (x | s) = - x∈X |X | -1 log 2 |X | -1 = log 2 |X |, H(n) = s∈X n P (s)h(s) = log 2 |X |. In general, the conditional entropy H(n) of a single sequence is a weakly decreasing function. However, emergent languages arising from signaling games are not single sequences. Each emergent language is a set of finite sequences: L = {S(i) ∈ M | i ∈ I}. Consider, for instance, the following toy language: L toy = aaaaa aaaab aaaac . In L toy , H(1) < H(4) holds, as a symbol after a unigram is most likely to be a, while a symbol after a 4-gram is equally likely to be a, b, and c. D EASE-OF-TEACHING SETTING DOES NOT CONTRIBUTE TO HAS Some methods proposed to improve TopSim might also contribute to the emergence of HAS. We picked up the Ease-of-Teaching (EoT) setting (Li & Bowling, 2019) , as it is known to be a simple and effective way to improve TopSim. As a result, the EoT setting did not contribute to the emergence of HAS, while improving TopSim (C-TopSim). Setup The EoT setting is quite simple: it resets the receiver agent periodically during training time. In this paper, we reset the receiver every 20 epochs. Result We show the results in Figure 16 , Figure 17 , and Figure 18 . The mean number of hypoboundaries does not increase as n att increases in Figure 16 . The vocabulary size does not increase as n val increases in Figure 17 . C-TopSim > W-TopSim in Figure 18 . 

E HYPO-SEGMENTS AND ZIPF'S LAW OF ABBREVIATION

The results in the main content are related to compositionality of emergent languages (e.g., Kottur et al., 2017) . In this supplemental section, we further associate our results with previous discussions on Zipf's law of abbreviation (ZLA) in emergent languages (Chaabouni et al., 2019; Rita et al., 2020; Ueda & Washio, 2021) . ZLA is known as a statistical property in natural languages that the more frequently a word is used, the shorter it is (Zipf, 1935) . By considering hypo-segments as "words," we can check whether hypo-segments follow ZLA. Figure 19 shows the hypo-segment lengths sorted by frequency rank for (n att , n val ) = (1, 4096).foot_15 If hypo-segments follow ZLA ideally, they should show a monotonic increase. The distribution of the lengths of the hypo-segments shows a clear ZLA-like tendency for threshold ∈ {0, 0.5}, although the tendencies are less clear for the other threshold. 16 It means that hypo-segments follow ZLA with an appropriate threshold value. Other (n att , n val ) configurations show similar tendencies (see Figure 20 , Figure 21 , Figure 22 , Figure 23 , and Figure 24 ). This is a suggestive result because we neither imposed a length penalty on messages (Chaabouni et al., 2020) , modeled the laziness/impatience of agents (Rita et al., 2020) , nor modeled short-term memories (Ueda & Washio, 2021) . Note that it may be just an artifact, analogous to the fact that even a monkey typing sequence divided by the "white space" follows ZLA (Miller, 1957) . 



ZLA states that the more frequently a word is used, the shorter it tends to be(Zipf, 1935). Note that this is different from the famous distributional hypothesis(Harris, 1954). There may be other representations for the phonemes, but we follow Harris's notation. The term "branching entropy" is from Tanaka-Ishii & Jin (2008), but the definition per se is quite basic. Although this is called hypothesis in Tanaka-Ishii (2005), Tanaka-Ishii & Jin (2006) and Tanaka-Ishii & Ishii (2007), we refer to it as scheme following the recent publication (Tanaka-Ishii, 2021). For practical reasons, the original algorithm involves another parameter maxlen to ensure w < maxlen. We omit it because the message length in emergent languages is fixed in this paper (see Section 3). Tanaka-Ishii & Jin (2008) reported F-score = 83.6% in English and F-score = 83.8% in Chinese. They are considerably high scores for entropy-based settings. Although the game is extremely simple, it is suitable for avoiding some pitfalls.Lowe et al. (2019) indicated that agents might not communicate effectively in more complex games than in a signaling game.Bouchacourt & Baroni (2018) suggested that agents fail to capture conceptual properties when I is a set of images. The EGG toolkit: https://github.com/facebookresearch/EGG Our code: https://github.com/wedddy0707/HarrisSegmentation One might think that the conditional entropy cannot increase by its definition. However, this is not the case in our setting (see Appendix B for more details). Note that TopSim can only be defined when natt > 1. If we were to apply boundary detection with threshold = -∞, it would regard every data point in a message as a boundary. In other words, W-TopSim with threshold = -∞ would be identical to C-TopSim. We adopt this notation in order to represent C-TopSim and W-TopSim in a unified manner in a single figure. We have to compute entropy values for all the possible partitions to finally obtain RE. In our case, the number of possible partitions in one language can be astronomical up to natt k = 12 32 ≈ 3.4 × 10 34 . We picked up only threshold ∈ {0.5, 1.5, 2} and adopted a log-log graph for readability. The plot shows the zigzagging behavior for threshold = 1.5, and most of the hypo-segment lengths hit the message length k = 32 for threshold = 2.



Figure 2: Illustration of questions.

Figure 3: Conditional entropy H(n). Dashed blue lines represent H(n) of languages from untrained agents that finally learned successful languages, while solid red lines represent H(n) of successful languages.

Figure 4:Example transition sequences of the branching entropy h in a message "3,8,4,...,4,4,4" in a successful language for (n att , n val ) = (2, 64).

Figure 5: Mean number of hypo-boundaries per message in successful languages. threshold varies according to Eq. 7. Each data point is averaged over random seeds, and shaded regions represent one standard error of mean (SEM).

Figure 6: Vocabulary size in successful languages. threshold varies according to Eq. 7. Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 7: C-TopSim and W-TopSim in successful languages. "-∞" corresponds to C-TopSim, while others correspond to W-TopSim. Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 8: Hypo-boundary-based W-TopSim compared to random-boundary-based W-TopSim in successful languages for (n att , n val ) = (2, 64). Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 9: Mean number of hypo-boundaries per message in synthetic languages. Each data point is averaged over random seeds, and shaded regions represent one standard error of mean (SEM).

Figure 10: Vocabulary size in synthetic languages. Each data point is averaged over random seeds, and shaded regions represent one SEM. The y-axis is on the log scale, as the datapoint for (n att , n val ) = (1, 4096) is very large compared to the other datapoints.

Figure 11: C-TopSim and W-TopSim in synthetic languages. "-∞" corresponds to C-TopSim, while other threshold values correspond to W-TopSim. Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 16: Mean number of hypo-boundaries per message in successful languages with the EoT setting. Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 17: Vocabulary size in successful languages with the EoT setting. Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 18: C-TopSim and W-TopSim in successful languages with the EoT setting. "-∞" corresponds to C-TopSim, while other threshold values correspond to W-TopSim. Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 19: Hypo-segment lengths sorted by frequency rank for (n att , n val ) = (1, 4096). Each data point is averaged over random seeds, and shaded regions represent one SEM.

Figure 20: Hypo-segment lengths sorted by frequency rank for (n att , n val ) = (2, 64). Each data point is averaged over random seeds and shaded regions represent one SEM.

Figure 21: Hypo-segment lengths sorted by frequency rank for (n att , n val ) = (3, 16). Each data point is averaged over random seeds and shaded regions represent one SEM.

Figure 22: Hypo-segment lengths sorted by frequency rank for (n att , n val ) = (4, 8). Each data point is averaged over random seeds and shaded regions represent one SEM.

Figure23: Hypo-segment lengths sorted by frequency rank for (n att , n val ) =(6, 4). Each data point is averaged over random seeds and shaded regions represent one SEM.

Figure 24: Hypo-segment lengths sorted by frequency rank for (n att , n val ) = (12, 2). Each data point is averaged over random seeds and shaded regions represent one SEM.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for fruitful discussions and insightful comments.

A IDEALLY COMPOSITIONAL SYNTHETIC LANGUAGE INDEED SATISFIES CONDITIONS C1, C2, AND C3

The purpose of this supplemental section is to show that the HAS-based boundary detection algorithm works for an ideal language and that the ideal language indeed satisfies conditions C1, C2, and C3.

A.1 SETUP

Recall that n att is the number of attributes, n val is the number of values, an attribute value set D natt n val is a meaning space, and (n att , n val ) ∈ {(1, 4096), (2, 64), (3, 16), (4, 8), (6, 4) , (12, 2)}. Set the alphabet size |A| = 8 (same as the main content). Set the message length k = 36 (slightly different from the main content, but has a good property in the sense that k mod n att = 0 for all n att ). Make an ideal synthetic language in the following procedure:with replacement for all a ∈ {1, . . . , n att } and v ∈ {1, . . . , n val }. Assume that each s a,v denotes each attribute-value meaning unit: the value v of the attribute a. A.2 RESULT 

