MULTIMODAL ANALOGICAL REASONING OVER KNOWLEDGE GRAPHS

Abstract

Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. We hope our work can deliver benefits and inspire future research 1 .

1. INTRODUCTION

Analogical reasoning -the ability to perceive and use relational similarity between two situations or events -holds an important place in human cognition (Johnson-Laird, 2006; Wu et al., 2020; Bengio et al., 2021; Chen et al., 2022a) and can provide back-end support for various fields such as education (Thagard, 1992) , creativity (Goel, 1997) , thus appealing to the AI community. Early, Mikolov et al. (2013b) ; Gladkova et al. (2016a) ; Ethayarajh et al. (2019a) propose visual analogical reasoning aiming at lifting machine intelligence in Computer Vision (CV) by associating vision with relational, structural, and analogical reasoning. Meanwhile, researchers of Natural Language Processing (NLP) hold the connectionist assumption (Gentner, 1983) of linear analogy (Ethayarajh et al., 2019b) ; for example, the relation between two words can be inferred through vector arithmetic of word embeddings. However, it is still an open question whether artificial neural networks are also capable of recognizing analogies among different modalities. Note that humans can quickly acquire new abilities based on finding a common relational system between two exemplars, situations, or domains. Based on Mayer's Cognitive Theory of multimedia learning (Hegarty & Just, 1993; Mayer, 2002) , human learners often perform better on tests with analogy when they have learned from multimodal sources than single-modal sources. Evolving from recognizing single-modal analogies to exploring multimodal reasoning for neural models, we emphasize the importance of a new kind of analogical reasoning task with Knowledge Graphs (KGs). In this paper, we introduce the task of multimodal analogical reasoning over knowledge graphs to fill this blank. Unlike the previous multiple-choice QA setting, we directly predict the analogical target and formulate the task as link prediction without explicitly providing relations. Specifically, the task can be formalized as (e h , e t ) : (e q , ?) with the help of background multimodal knowledge graph G, in which e h , e t or e q have different modalities. We collect a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG to support this task. These data are collected and annotated from seed entities and relations in E-KAR (Chen et al., 2022a) and BATs (Gladkova et al., 2016a) , with linked external entities in Wikidata and images from Laion-5B (Schuhmann et al., 2021) . To evaluate the multimodal analogical reasoning process, we follow the guidelines from psychological theories and conduct comprehensive experiments on MARS with multimodal knowledge graph embedding baselines and multimodal pre-trained Transformer baselines. We further propose a novel Multimodal analogical reasoning framework with Transformer, namely MarT, which is readily pluggable into any multimodal pre-trained Transformer models and can yield better performance. To summarize, our contributions are three-fold: (1) We advance the traditional setting of analogy learning by introducing a new multimodal analogical reasoning task. Our work may open up new avenues for improving analogical reasoning through multimodal resources. (2) We collect and build a dataset MARS with a multimodal knowledge graph MarKG, which can be served as a scaffold for investigating the multimodal analogy reasoning ability of neural networks. (3) We report the performance of various multimodal knowledge graph embedding, multimodal pre-trained Transformer baselines, and our proposed framework MarT. We further discuss the potential of this task and hope it facilitates future research on zero-shot learning and domain generalization in both CV and NLP.

2. BACKGROUND 2.1 ANALOGICAL REASONING IN PSYCHOLOGICAL

To better understand analogical reasoning, we introduce some crucial theories from cognitive psychology, which we take as guidelines for designing the multimodal analogical reasoning task. Structure Mapping Theory (SMT) (Gentner, 1983) . SMT is a theory that takes a fundamental position in analogical reasoning. Specifically, SMT emphasizes that humans conduct analogical reasoning depending on the shared relations structure rather than the superficial attributes of domains and distinguishes analogical reasoning with literal similarity. Minnameier (2010) further develops the inferential process of analogy into three steps: abduction, mapping and induction, which inspires us to design benchmark baselines for multimodal analogical reasoning. Mayer's Cognitive Theory (Hegarty & Just, 1993; Mayer, 2002) . Humans live in a multi-source heterogeneous world and spontaneously engage in analogical reasoning to make sense of unfamiliar situations in everyday life (Vamvakoussi, 2019 ). Mayer's Cognitive Theory shows that human learners often perform better on tests of recall and transfer when they have learned from multimodal sources than single-modal sources. However, relatively little attention has been paid to multimodal analogical reasoning, and it is still unknown whether neural network models have the ability of multimodal analogical reasoning.

2.2. ANALOGICAL REASONING IN CV AND NLP

Visual Analogical Reasoning. Analogical reasoning in CV aims at lifting machine intelligence by associating vision with relational, structural, and analogical reasoning (Johnson et al., 2017; Prade & Richard, 2021; Hu et al., 2021; Malkinski & Mandziuk, 2022) . Some datasets built in the context of Raven's Progressive Matrices (RPM) are constructed, including PGM (Santoro et al., 2018) and RAVEN (Zhang et al., 2019) . Meanwhile, Hill et al. (2019) demonstrates that incorporating structural differences with structure mapping in analogical visual reasoning benefits the machine learning models. Hayes & Kanan (2021) investigates online continual analogical reasoning and demonstrates the importance of the selective replay strategy. However, these aforementioned works still focus on analogy reasoning among visual objects while ignoring the role of complex texts. Natural Language Analogical Reasoning. In the NLP area, early attempts devote to word analogy recognition (Mikolov et al., 2013b; Gladkova et al., 2016a; Jurgens et al., 2012; Ethayarajh et al., 2019a; Gladkova et al., 2016b) which can often be effectively solved by vector arithmetic for neural word embeddings Word2Vec (Mikolov et al., 2013a) and Glove (Pennington et al., 2014) . Recent studies have also evaluated on the pre-trained language models (Devlin et al., 2019; Brown et al., 

