ANALOGICAL REASONING FOR VISUALLY GROUNDED COMPOSITIONAL GENERALIZATION

Abstract

Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of multimodal compositional generalization within the context of visually grounded language acquisition. We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNET), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as "washing apple" or "cutting carrot", it can generalize and recognize new compositions in new video frames, such as "washing carrot" or "cutting apple". To this end, ARTNET refers to relevant instances in the training data and uses their visual features and captions to establish analogies with the query image. Then it chooses a suitable verb and noun to create a new composition that describes the new image best. Extensive experiments on an instructional video dataset demonstrate that the proposed method achieves significantly better generalization capability and recognition accuracy compared to state-of-the-art transformer models.

1. INTRODUCTION

Visually grounded Language Acquisition (VLA) is an innate ability of the human brain. It refers to the way children learn their native language from scratch, through exploration, observation, and listening (i.e., self-supervision), and without taking language training lessons (i.e., explicit supervision). 2-year-old children are able to quickly learn semantics of phrases and their constituent words after repeatedly hearing phrases like "washing apple", or "cutting carrot" and observing such situations. More interestingly, they will also understand new compositions such as "washing carrot" or "cutting apple", even before experiencing them. This ability of human cognition is called compositional generalization (Montague (1970) ; Minsky (1988) ; Lake et al. ( 2017)). It helps humans use a limited set of known components (vocabulary) to understand and produce unlimited new compositions (e.g. verb-noun, adjective-noun, or adverb-verb compositions) . This is also one of the long-term goals of Artificial Intelligence (AI), e.g. in robotics, where it enables the robot to learn new instructions that they have never heard before. Nevertheless, contemporary machine intelligence needs to overcome several major challenges of the task. On one hand, learning compositional generalization can be difficult without using datahungry models. The power of existing language models mainly rely on large-scale language corpora (Lake & Baroni ( 2017 2019)) between seen compositions and new compositions. For instance, given seen verb-noun compositions "1A" and "2B" (the digit indicates verb, the letter indicates noun), the model should be able to link seen compositions to new compositions (like "1B" or "2A") in completely new cases. (1989) ). An analogy is a comparison between similar concepts or situations, and AR is analogical semantic reasoning that relies upon an analogy. The human brain spontaneously engages in AR to make sense of unfamiliar situations in every day life (Vamvakoussi ( 2019)). Inspired by the AR process in the human brain, we design the counterpart for machine language acquisition. To this end, we create a language model that generate appropriate novel compositions by relevant seen compositions, and forming analogies and appropriate arithmetic operations to express the new compositions (e.g. "washing carrot" = "washing apple' + "cutting carrot" -"cutting apple"). We describe this process in three steps: association, reasoning, and inference, as shown in Figure 1 . Given an image (a video frame in our case) and a narrative sentence describing it, we mask the main verb-noun composition from the sentence, and ask the model to guess the correct composition that completes the sentence, considering the provided image. To this end, we propose a novel self-supervised and reasoning-augmented framework, Analogical Reasoning Transformer Networks (ARTNET). ARTNET adopts a multimodal transformer (similar to ViLBERT (Lu et al. (2019) )) as its backbone to represent visual-textual data in a common space. Then it builds three novel modules on top of the backbone that corresponds to the aforementioned AR steps: association, reasoning, and inference. First, we design Analogical Memory Module (AMM), which discovers analogical exemplars for a given query scenario, from a reference pool of observed samples. Second, we propose Analogical Reasoning Networks (ARN), which takes the retrieved samples as input, selects candidate analogy pairs from the relevant reference samples, and learns proper reasoning operations over the selected analogy pairs, resulting in an analogy context vector. Third, we devise a Conditioned Composition Engine (CCE), which combines the analogy context vector with the representations of the query sample to predict the masked words and complete the target sentence with a novel composition. We show how ARTNET generalizes to new compositions and excels in visually grounded language acquisition by designing experiments in various evaluations: novel composition prediction, assessment of affordance, and sensitivity to data scarcity. The results on the ego-centric video dataset (EPIC-Kitchens) demonstrate the effectiveness of the proposed solution in various aspects: accuracy, capability, robustness, etc. The project code is publicly available at https://github.com/XX. The main contributions of this paper include the following: • We call attention to a challenging problem, compositional generalization, in the context of machine language acquisition, which has seldom been studied. • We propose ideas supported by human analogical reasoning: approximating new verb-noun compositions by learned arithmetic operations over relevant compositions seen before. • We propose a novel reasoning-augmented architecture for visually grounded language acquisition, which addresses the compositional generalization problem through association and analogical reasoning. • We evaluate the proposed model in various aspects, such as composition prediction, validity test, and robustness against data scarcity. The results show that ARTNET achieves significant performance improvements in terms of new composition accuracy, over a large-scale video dataset.



); Pennington et al. (2014); Devlin et al. (2018)). They are still inadequate at compositional generalization (Marcus (1998); Lake & Baroni (2018); Surís et al. (2019)). Their goal is to recognize training examples rather than focusing on what is missing from training data. On the other hand, the designed model should close the paradigmatic gap (Nikolaus et al. (

Figure1: We propose a multimodal language acquisition approach inspired by human language learning processes consisting of three steps: association, reasoning, and inference.

