ANALOGICAL REASONING FOR VISUALLY GROUNDED COMPOSITIONAL GENERALIZATION

Abstract

Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of multimodal compositional generalization within the context of visually grounded language acquisition. We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNET), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as "washing apple" or "cutting carrot", it can generalize and recognize new compositions in new video frames, such as "washing carrot" or "cutting apple". To this end, ARTNET refers to relevant instances in the training data and uses their visual features and captions to establish analogies with the query image. Then it chooses a suitable verb and noun to create a new composition that describes the new image best. Extensive experiments on an instructional video dataset demonstrate that the proposed method achieves significantly better generalization capability and recognition accuracy compared to state-of-the-art transformer models.

1. INTRODUCTION

Visually grounded Language Acquisition (VLA) is an innate ability of the human brain. It refers to the way children learn their native language from scratch, through exploration, observation, and listening (i.e., self-supervision), and without taking language training lessons (i.e., explicit supervision). 2-year-old children are able to quickly learn semantics of phrases and their constituent words after repeatedly hearing phrases like "washing apple", or "cutting carrot" and observing such situations. More interestingly, they will also understand new compositions such as "washing carrot" or "cutting apple", even before experiencing them. This ability of human cognition is called compositional generalization (Montague (1970); Minsky (1988); Lake et al. ( 2017)). It helps humans use a limited set of known components (vocabulary) to understand and produce unlimited new compositions (e.g. verb-noun, adjective-noun, or adverb-verb compositions). This is also one of the long-term goals of Artificial Intelligence (AI), e.g. in robotics, where it enables the robot to learn new instructions that they have never heard before. Nevertheless, contemporary machine intelligence needs to overcome several major challenges of the task. On one hand, learning compositional generalization can be difficult without using datahungry models. The power of existing language models mainly rely on large-scale language corpora (Lake & Baroni ( 2017 2019)) between seen compositions and new compositions. For instance, given seen verb-noun compositions "1A" and "2B" (the digit indicates verb, the letter indicates noun), the model should be able to link seen compositions to new compositions (like "1B" or "2A") in completely new cases.



); Pennington et al. (2014); Devlin et al. (2018)). They are still inadequate at compositional generalization (Marcus (1998); Lake & Baroni (2018); Surís et al. (2019)). Their goal is to recognize training examples rather than focusing on what is missing from training data. On the other hand, the designed model should close the paradigmatic gap (Nikolaus et al. (

