COMPUTATIONAL LANGUAGE ACQUISITION WITH THEORY OF MIND

Abstract

Unlike current state-of-the-art language models, young children actively acquire language through interactions with their surrounding environment and caretakers. One mechanism that has been argued to be critical to language learning is the ability to infer the mental states of other agents in social environments, coined Theory of Mind (ToM) by Premack & Woodruff (1978). Drawing inspiration from the modern operationalized versions of ToM implemented in Rabinowitz et al. ( 2018) and Zhu et al. ( 2021), we build language-learning agents equipped with ToM, and measure its effects on the learning process. We model ToM by giving the speaker agent an internal listener model that is trained alongside the speaker and used to rerank potential utterances. We experiment with varying task difficulty, hypothesizing that models will acquire more complex language to adapt to stronger environmental pressures. We find that training speakers with a highly weighted ToM listener component leads to performance gains in our image referential game setting. We also find some evidence that increasing task difficulty in the training process results in more fluent and precise utterances in evaluation. This suggests the potential utility of further incorporating ToM, as well as other insights from child language acquisition, into computational models of language acquisition 1 .

1. INTRODUCTION

Human languages are fundamentally shaped by social-communicative goals in the grounded world. Modern theories from developmental psychology often attribute humans' unique ability to quickly acquire and adapt language to their ability to ascribe mental states to other agents (Tomasello, 2005) , an ability also known as Theory of Mind (ToM). Some previous studies have attempted to perform computational modeling of ToM. For instance, ToM-like mechanisms have been demonstrated to allow models to better predict the behavior of a future agent (Rabinowitz et al., 2018) , model agents' beliefs in a negotiation (Cao et al., 2018) or a cooperative game (Bard et al., 2020) , or choose good utterances based on the listener's linguistic abilities (Zhu et al., 2021) . However, the effects of ToM have not yet been studied in the higher-level context of computational language acquisition. In this paper, we study how an internal ToM mechanism and external environmental pressure contribute to language learning. We use an image referential game setting consisting of a series of training episodes between a speaker, which represents a language learner (Zhu et al., 2022) , and a listener, which represents a fluent teacher. When presented with a set of images, one of which is the target referent, the speaker must learn to generate an English utterance that the listener can use to select the target. The speaker is rewarded for generating utterances that are used to correctly guess the target image. Additionally, the speaker may be given feedback depending on the confidence the listener has in the selection. This setting provides an attractive test-bed for testing the effects of various reward signals or model designs on the speaker's learned language; previous studies of pragmatics in language acquisition, such as Andreas & Klein (2016), have used similar settings.



Code and data can be found at https://github.com/neulab/ToM-Language-Acquisition.

Target Image

Distractor Image yellow shirt man throw frisbeeCandidate Utterance Sampling ToM Listener Reranking Output Utterance 1. Three people play frisbee 2. Frisbee front of trees 3. Yellow shirt man throw frisbee 1. Yellow shirt man throw frisbee 2. Frisbee front of trees 3. Three people play frisbee Within this setting, we seek to better understand how models with and without ToM adapt to their environments. We focus on two specific research questions in this area:RQ1. How does the inclusion of ToM in language acquisition speaker models affect their performance and learned language? (internal ToM mechanism)RQ2. How do our models adapt to more difficult referential game environments during the language acquisition process? (external environmental pressure)We study the impact of ToM (RQ1) by modeling an internal listener module within our speakers that aims to predict which utterances are most likely to result in the desired listener behavior. By incorporating the probabilities given to the target image by the ToM listener into the utterance reranking process, as shown in Fig. 1 , we select for more pragmatic utterances sampled from the speaker's distribution. To study the impact of environmental pressure (RQ2) on our speakers, we create referential games with different difficulties by sampling distractors from different distributions. These distributions are based on the similarity between images calculated by various image representation models, including CLIP (Radford et al., 2021 ), RoBERTa (Liu et al., 2020) , and TF-IDF variants.In experiments, we find that (RQ1) speaker models including ToM components generally outperform those that do not in terms of fluency and final accuracy. We also find that (RQ2) training with more visually and semantically similar distractor referents causes the speaker model to develop longer, more fluent, and more precise utterances to distinguish between potential referents, although these do not always translate to gains in referential game performance. These results suggest contributions to language acquisition from both ToM and environmental pressures in this setting. We still find significant gaps between the language that our speaker model acquires and human captions. Additionally, we restrict both the vocabulary and the maximum length of our speakers' utterances. These both suggest that there is still room for improvement in this class of models. However, we hope our results still hint at both better training methods for pragmatic language models and a deeper computational understanding of human language acquisition. 

