GROUNDED LANGUAGE LEARNING FAST AND SLOW

Abstract

Recent work has shown that large text-based neural language models acquire a surprising propensity for one-shot learning. Here, we show that an agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional RL algorithms. After a single introduction to a novel object via visual perception and language ("This is a dax"), the agent can manipulate the object as instructed ("Put the dax on the bed"), combining short-term, within-episode knowledge of the nonsense word with long-term lexical and motor knowledge. We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful later. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for fast-mapping, a fundamental pillar of human cognitive development and a potentially transformative capacity for artificial agents.

1. INTRODUCTION

Language models that exhibit one-or few-shot learning are of growing interest in machine learning applications because they can adapt their knowledge to new information (Brown et al., 2020; Yin, 2020) . One-shot language learning in the physical world is also of interest to developmental psychologists; fast-mapping, the ability to bind a new word to an unfamiliar object after a single exposure, is a much studied facet of child language learning (Carey & Bartlett, 1978) . Our goal is to enable an embodied learning system to perform fast-mapping, and we take a step towards this goal by developing an embodied agent situated in a 3D game environment that can learn the names of entirely unfamiliar objects in a single exposure, and immediately apply this knowledge to carry out instructions based on those objects. The agent observes the world via active perception of raw pixels, and learns to respond to linguistic stimuli by executing sequences of motor actions. It is trained by a combination of conventional RL and predictive (semi-supervised) learning. We find that an agent architecture consisting of standard neural network components is sufficient to follow language instructions whose meaning is preserved across episodes. However, learning to fast-map novel names to novel objects in a single episode relies on semi-supervised prediction mechanisms and a novel form of external memory, inspired by the dual-coding theory of knowledge representation (Paivio, 1969) . With these components, an agent can exhibit both slow word learning and fast-mapping. Moreover, the agent exhibits an emergent propensity to integrate both fast-mapped and slowly acquired word meanings in a single episode, successfully executing instructions such as "put the dax in the box" that depend on both slow-learned ("put", "box") and fast-mapped ("dax") word meanings. Via controlled generalization experiments, we find that the agent is reasonably robust to a degree of variation in the number of objects involved in a given fast-mapping task at test time. The agent also exhibits above-chance success when presented with the name for a particular object in the ShapeNet 1 arXiv:2009.01719v4 [cs.CL] 14 Oct 2020

