GROUNDED LANGUAGE LEARNING FAST AND SLOW

Abstract

Recent work has shown that large text-based neural language models acquire a surprising propensity for one-shot learning. Here, we show that an agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional RL algorithms. After a single introduction to a novel object via visual perception and language ("This is a dax"), the agent can manipulate the object as instructed ("Put the dax on the bed"), combining short-term, within-episode knowledge of the nonsense word with long-term lexical and motor knowledge. We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful later. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for fast-mapping, a fundamental pillar of human cognitive development and a potentially transformative capacity for artificial agents.

1. INTRODUCTION

Language models that exhibit one-or few-shot learning are of growing interest in machine learning applications because they can adapt their knowledge to new information (Brown et al., 2020; Yin, 2020) . One-shot language learning in the physical world is also of interest to developmental psychologists; fast-mapping, the ability to bind a new word to an unfamiliar object after a single exposure, is a much studied facet of child language learning (Carey & Bartlett, 1978) . Our goal is to enable an embodied learning system to perform fast-mapping, and we take a step towards this goal by developing an embodied agent situated in a 3D game environment that can learn the names of entirely unfamiliar objects in a single exposure, and immediately apply this knowledge to carry out instructions based on those objects. The agent observes the world via active perception of raw pixels, and learns to respond to linguistic stimuli by executing sequences of motor actions. It is trained by a combination of conventional RL and predictive (semi-supervised) learning. We find that an agent architecture consisting of standard neural network components is sufficient to follow language instructions whose meaning is preserved across episodes. However, learning to fast-map novel names to novel objects in a single episode relies on semi-supervised prediction mechanisms and a novel form of external memory, inspired by the dual-coding theory of knowledge representation (Paivio, 1969) . With these components, an agent can exhibit both slow word learning and fast-mapping. Moreover, the agent exhibits an emergent propensity to integrate both fast-mapped and slowly acquired word meanings in a single episode, successfully executing instructions such as "put the dax in the box" that depend on both slow-learned ("put", "box") and fast-mapped ("dax") word meanings. Via controlled generalization experiments, we find that the agent is reasonably robust to a degree of variation in the number of objects involved in a given fast-mapping task at test time. The agent also exhibits above-chance success when presented with the name for a particular object in the ShapeNet taxonomy (Chang et al., 2015) and then instructed (using that name) to interact with a different exemplar from the same object class, and this propensity can be further enhanced by specific metatraining. We find that both the number of unique objects observed by the agent during training and the temporal aspect of its perceptual experience of those objects contribute critically to its ability to generalize, particularly its ability to execute fast-mapping with entirely novel objects. Finally, we show that a dual-coding memory schema can provide a more effective basis to derive a signal for intrinsic motivation than a more conventional (unimodal) memory.

2. AN ENVIRONMENT FOR FAST WORD LEARNING

We conduct experiments in a 3D room built with the Unity game engine. In a typical episode, the room contains a pre-specified number N of everyday 3D rendered objects from a global set G. In all training and evaluation episodes, the initial positions of the objects and agent are randomized. The objects include everyday household items such as kitchenware (cup, glass), toys (teddy bear, football), homeware (cushion, vase), and so on. Episodes consist of two phases: a discovery phase, followed by an instruction phase (see Figure 1 ). 1In the discovery phase, the agent must explore the room and fixate on each of the objects in turn. When it fixates on an object, the environment returns a string with the name of the object (which is a nonsense word), for example "This is a dax" or "This is a blicket". Once the environment has returned the name of each of the objects (or if a time limit of 30s is reached), the positions of all the objects and the agent are re-randomized and the instruction phase begins. The environment then emits an instruction, for example "Pick up a dax" or "Pick up a blicket". To succeed, the agent must then lift up the specified object and hold it above 0.25m for 3 consecutive timesteps, at which point the episode ends, and a new episode begins with a discovery phase and a fresh sample of objects from the global set G. If the agent first lifts up an incorrect object, the episode also ends (so it is not possible to pick up more than one object in the instruction phase). To provide a signal for the agent to learn from, it receives a scalar reward of 1.0 if it picks up the correct object in the instruction phase. In the default training setting, to encourage the necessary information-seeking behaviour, a smaller shaping reward of 0.1 is provided for visiting each of the objects in the discovery phase. Given this two-phase episode structure, two distinct learning challenges can be posed to the agent. In a slow-learning regime, the environment can assign the permanent name (e.g. "cup", "chair") to objects in the environment whenever they are sampled. By contrast, in the fast-mapping regime,



Rendered images are higher resolution than those passed to the agent.



Figure 1: Top: The two phases of a fast-mapping episode. Bottom: Screenshots of the task from the agent's perspective at important moments (including the contents of the language channel).

