MULTIMODAL ANALOGICAL REASONING OVER KNOWLEDGE GRAPHS

Abstract

Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. We hope our work can deliver benefits and inspire future research 1 .

1. INTRODUCTION

Analogical reasoning -the ability to perceive and use relational similarity between two situations or events -holds an important place in human cognition (Johnson-Laird, 2006; Wu et al., 2020; Bengio et al., 2021; Chen et al., 2022a) and can provide back-end support for various fields such as education (Thagard, 1992) , creativity (Goel, 1997) , thus appealing to the AI community. Early, Mikolov et al. (2013b) ; Gladkova et al. (2016a) ; Ethayarajh et al. (2019a) propose visual analogical reasoning aiming at lifting machine intelligence in Computer Vision (CV) by associating vision with relational, structural, and analogical reasoning. Meanwhile, researchers of Natural Language Processing (NLP) hold the connectionist assumption (Gentner, 1983) of linear analogy (Ethayarajh et al., 2019b) ; for example, the relation between two words can be inferred through vector arithmetic of word embeddings. However, it is still an open question whether artificial neural networks are also capable of recognizing analogies among different modalities. Note that humans can quickly acquire new abilities based on finding a common relational system between two exemplars, situations, or domains. Based on Mayer's Cognitive Theory of multimedia learning (Hegarty & Just, 1993; Mayer, 2002) , human learners often perform better on tests with analogy when they have learned from multimodal sources than single-modal sources. Evolving from recognizing single-modal analogies to exploring multimodal reasoning for neural models, we emphasize the importance of a new kind of analogical reasoning task with Knowledge Graphs (KGs). In this paper, we introduce the task of multimodal analogical reasoning over knowledge graphs to fill this blank. Unlike the previous multiple-choice QA setting, we directly predict the analogical target and formulate the task as link prediction without explicitly providing relations. Specifically, the task can be formalized as (e h , e t ) : (e q , ?) with the help of background multimodal knowledge graph G, in which e h , e t or e q have different modalities. We collect a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG to support this task. These data are collected and annotated from seed entities and relations in E-KAR (Chen et al., 2022a) and BATs (Gladkova et al., 2016a) , with linked external entities in Wikidata and images from Laion-5B (Schuhmann et al., 2021) . To evaluate the multimodal analogical reasoning process, we follow the guidelines from psychological theories and conduct comprehensive experiments on MARS with multimodal knowledge graph embedding baselines and multimodal pre-trained Transformer baselines. We further propose a novel Multimodal analogical reasoning framework with Transformer, namely MarT, which is readily pluggable into any multimodal pre-trained Transformer models and can yield better performance. To summarize, our contributions are three-fold: (1) We advance the traditional setting of analogy learning by introducing a new multimodal analogical reasoning task. Our work may open up new avenues for improving analogical reasoning through multimodal resources. (2) We collect and build a dataset MARS with a multimodal knowledge graph MarKG, which can be served as a scaffold for investigating the multimodal analogy reasoning ability of neural networks. (3) We report the performance of various multimodal knowledge graph embedding, multimodal pre-trained Transformer baselines, and our proposed framework MarT. We further discuss the potential of this task and hope it facilitates future research on zero-shot learning and domain generalization in both CV and NLP.

2. BACKGROUND 2.1 ANALOGICAL REASONING IN PSYCHOLOGICAL

To better understand analogical reasoning, we introduce some crucial theories from cognitive psychology, which we take as guidelines for designing the multimodal analogical reasoning task. Structure Mapping Theory (SMT) (Gentner, 1983) . SMT is a theory that takes a fundamental position in analogical reasoning. Specifically, SMT emphasizes that humans conduct analogical reasoning depending on the shared relations structure rather than the superficial attributes of domains and distinguishes analogical reasoning with literal similarity. Minnameier (2010) further develops the inferential process of analogy into three steps: abduction, mapping and induction, which inspires us to design benchmark baselines for multimodal analogical reasoning. Mayer's Cognitive Theory (Hegarty & Just, 1993; Mayer, 2002) . Humans live in a multi-source heterogeneous world and spontaneously engage in analogical reasoning to make sense of unfamiliar situations in everyday life (Vamvakoussi, 2019 ). Mayer's Cognitive Theory shows that human learners often perform better on tests of recall and transfer when they have learned from multimodal sources than single-modal sources. However, relatively little attention has been paid to multimodal analogical reasoning, and it is still unknown whether neural network models have the ability of multimodal analogical reasoning.

2.2. ANALOGICAL REASONING IN CV AND NLP

Visual Analogical Reasoning. Analogical reasoning in CV aims at lifting machine intelligence by associating vision with relational, structural, and analogical reasoning (Johnson et al., 2017; Prade & Richard, 2021; Hu et al., 2021; Malkinski & Mandziuk, 2022) . Some datasets built in the context of Raven's Progressive Matrices (RPM) are constructed, including PGM (Santoro et al., 2018) and RAVEN (Zhang et al., 2019) . Meanwhile, Hill et al. (2019) demonstrates that incorporating structural differences with structure mapping in analogical visual reasoning benefits the machine learning models. Hayes & Kanan (2021) investigates online continual analogical reasoning and demonstrates the importance of the selective replay strategy. However, these aforementioned works still focus on analogy reasoning among visual objects while ignoring the role of complex texts. Natural Language Analogical Reasoning. In the NLP area, early attempts devote to word analogy recognition (Mikolov et al., 2013b; Gladkova et al., 2016a; Jurgens et al., 2012; Ethayarajh et al., 2019a; Gladkova et al., 2016b) which can often be effectively solved by vector arithmetic for neural word embeddings Word2Vec (Mikolov et al., 2013a) and Glove (Pennington et al., 2014) . Recent studies have also evaluated on the pre-trained language models (Devlin et al., 2019; Brown et al., 3 THE MULTIMODAL ANALOGICAL REASONING TASK

3.1. TASK DEFINITION

In this section, we introduce the task of Multimodal Analogical Reasoning that can be formulated as link prediction without explicitly providing relations. As shown in Figure 1 , given an analogy example (e h , e t ) and a question-answer entity pair (e q , ?) where e h , e t , e q ∈ E a and E a ∈ E, the goal of analogical reasoning is to predict the missing entity e a ∈ E a . Moreover, multimodal analogical reasoning is based on background multimodal knowledge graph G = (E, R, I, T ), where E and R are sets of entities and relations, I and T represent images and textual descriptions of entities. Note that the relations of (e h , e t ) and (e q , e a ) are identical but unavailable, and the relation structure can be analogized implicitly from source domain to target domain without knowing the relations. Specifically, the task can be formalized as (e h , e t ) : (e q , ?), further divided into Single Analogical Reasoning and Blended Analogical Reasoning according to different modalities of e h , e t , e q and e a . Single Analogical Reasoning. In this setting, the analogy example and the question-answer entity pair involve only one modality. As shown in the middle column of Figure 1 , the modalities of the analogy example (e h , e t ) are identical and opposite to the analogy question-answer pair (e q , e a ). Based on both visual and textual modalities, this setting can be further divided into (I h , I t ) : (T q , ?) and (T h , T t ) : (I q , ?) where I h , T h represent the modality of e h is visual or textual respectively. Blended Analogical Reasoning. In the setting, the modality of analogy example (e h , e t ) are unidentical, which is similar to real-world human cognition and perceptionfoot_0 . Note that Mayer's theory indicates that humans can have powerful transfer and knowledge recall abilities in multimodal scenarios. Inspired by this, we propose the blended analogical reasoning that can be formalized as (I h , T t ) : (I q , ?), which means the modalities between e h (e q ) and e t (e a ) are different.

3.2. DATA COLLECTION AND PREPROCESSING

We briefly introduce the construction process of the dataset in Figure 2 . Firstly, we collect a multimodal knowledge graph dataset MarKG and a multimodal analogical reasoning dataset MARS, which are developed from seed entities and relations in E-KAR (Chen et al., 2022a) and BATs (Gladkova et al., 2016a) . Secondly, we link these seed entities into the free and open knowledge base Wikidatafoot_1 for formalization and normalization. Thirdly, to acquire the image data, we further search from the Google engine and query from the multimodal data Laion-5B (Schuhmann et al., 2021) by the text descriptions of entities. Then, an image validation strategy is applied to filter low-quality images. Lastly, we sample high-quality analogy data to construct MARS. A detailed description of the data collection and processing to create our datasets are in Appendix B.1 and B.2.

3.3. DATASET STATISTICS

MARS is the evaluation dataset of the multimodal analogical reasoning task that contains analogy instances, while MarKG can provide the relative structure information of those analogy entities retrieved from Wikidata. The statistics of MARS and MarKG are shown in Table 1 and Table 5 . MarKG has 11,292 entities, 192 relations and 76,424 images, include 2,063 analogy entities and 27 analogy relations. MARS has 10,685 training, 1,228 validation and 1,415 test instances, which are more significant than previous language analogy datasets. The original intention of MarKG is to provide prior knowledge of analogy entities and relations for better reasoning. We release the dataset with a leaderboard at https://zjunlp.github.io/project/MKG_Analogy/. More details including quality control can be found in Appendix B.3.

3.4. EVALUATION METRICS

Previous study (Chen et al., 2022a) adopts the multiple-choice QA to conduct analogical reasoning and leverage the accuracy metric for evaluation. However, the multiple-choice QA setting may struggle to handle the one-to-more entities, which is very common in real-world analogy scenarios. Thus, we formulate the task as link prediction that directly predicts the answer entity e a ∈ E a . Our evaluation metrics include Hits@k scores (proportion of valid entities ranked in top k) and MRR (reciprocal value of the mean rank of correct entities). More details can be found in Appendix B.4.

4. BENCHMARK METHODS

In this section, we introduce some baselines to establish the initial benchmark results on MARS, including multimodal knowledge graph embedding baselines and multimodal pre-trained Transformer baselines. We further propose MarT: a multimodal analogical reasoning framework with Trans- ②Mapping Learnable ••• ••• ••• ••• Analogy Abduction Induction Single Blended [PAD] [CLS] [MASK] [SEP] [CLS] [R] [SEP] [R][MASK][SEP] [CLS] [R] [SEP] [R][MASK][SEP] ? a. MKGE Methods [CLS] [R] [SEP] [R][MASK][SEP] is matrix multiplication is the image of the entity is the text of the entity [PAD]is a blank image for padding former, which can capture fine-grained associations between one analogy example and one analogy question-answer pair for better multimodal analogy abilities.

4.1. MULTIMODAL KNOWLEDGE GRAPH EMBEDDING BASELINES

We consider three multimodal knowledge embedding (MKGE) approaches as our baselines, including IKRL (Xie et al., 2017) , TransAE (Wang et al., 2019) , and RSME (Wang et al., 2021) . These methods are typically based on TransE (Bordes et al., 2013) or ComplEx (Trouillon et al., 2016) and combine with visual encoders to encode images for multimodal knowledge representation learning. They can not be directly applied to the multimodal analogical reasoning task. To successfully utilize MKGE methods, we first pre-train them on MarKG to obtain entity embeddings and then follow the structure-mapping theory (Minnameier, 2010) to leverage the Abduction-Mapping-Induction as explicit pipline steps for MKGE methods. As shown in Figure 3 .a, Abduction aims to predict the relation r of (e h , e t ) similar to the relation classification task, Mapping represents that the structural relation is mapped onto entity candidates, analogous to template-filling, and Induction utilizes the relation r to predict the tail entity of (e q , r, ?) similar to the link prediction task. Despite the previous MKGE methods achieving excellent performance for KG-related tasks, the backbone, such as TransE, is not designed for analogy reasoning, which may hinder performance. Thus, we replace the backbone of MKGE methods with ANALOGY Liu et al. (2017) that models analogical structure explicitly as baselines.

4.2. MULTIMODAL PRE-TRAINED TRANSFORMER BASELINES

We select multimodal pre-trained Transformer (MPT) approaches including the single-stream models VisualBERT (Li et al., 2019) , ViLT (Kim et al., 2021) , the dual-stream model ViLBERT (Lu et al., 2019) , and the mixed-stream model FLAVA (Singh et al., 2022) and MKGformer Chen et al. (2022b) as the strong baselines. However, the current multimodal pre-trained Transformer cannot directly deal with analogical reasoning. To address the bottleneck above, we devise an end-to-end approach to empower the MPT with analogical reasoning ability. As shown in Figure 3 , we first leverage MarKG to pre-train the model over sparese MarKG to obtain the representation of entities and relations. We then present the prompt-based analogical reasoning over MARS.

4.2.1. PRE-TRAIN OVER MARKG

We represent the entities e ∈ E and relations r ∈ R as special tokens and denote E as the learnable embedding of these special tokens in the word vocabulary of language models. In the pre-train stage, we design masked entity and relation prediction like the Masked Language Modeling (MLM) task to learn the embeddings of the special tokens over the MarKG dataset. As shown in Figure 3 .b, we devise a prompt template to convert the input as predicting the missing entity and relation via [MASK] token. In addition, we mix missing relation and entity prediction in the pre-train stage and consider different modalities of input entities. Specifically, we represent the visual entity e h by its image I h and special entity embedding E e h , and the text entity e t by its text description T t and special entity embedding E et , respectively. Benefiting from the mixed entity and relation prediction with the multimodal entity in the pre-train stage, we can obtain KG embedding with multimodal semantics over the current knowledge graph MarKG.

4.2.2. PROMPT-BASED ANALOGICAL REASONING

Based on the above-pre-trained entity and relation embeddings over MarKG, we propose promptbased analogical reasoning with implicit structure mapping on downstream MARS. Taking the blended analogical reasoning as an example, we feed the analogy example (I h , T t ) and analogy question-answer pair (I q , ?) as input, and the goal is to predict the missing answer entity e a ∈ E a . We leverage an analogical prompt template to convert the input as follows: T (I h ,Tt,Iq) = T E ∥ T A = I h I q [CLS]e h [R]T t e t [SEP] ∥ e q [R][MASK][SEP], where ∥ represents concatenate operation in the template input, I h and I q represent the images of the entity e h and e q , T t is the text description of the entity e t . Moreover, e h , e t , e q are entity ids and will be encoded to special entity tokens E e h , E et , E eq in word embedding layer. Since the relations are not explicitly provided in the actual analogical reasoning task, we assign [R] as a special token to denote the explicit relation between (I h , T t ), which is initialized with the average relation embeddings. Finally, we train the model to predict the [MASK] over the special token embedding E via cross-entropy loss, which likes the MLM task. Remark 1 We summarize the two parts of T E and T A in the template as the implicit Abduction and Induction respectively, which are unified in an end-to-end learning manner with prompt tuning. In addition, the analogical reasoning is reformulated as predicting the [MASK] over the multimodal analogy entity embeddings to obtain e a .

Adaptive Interaction Across Analogy

Relation-Oriented Structure Mapping close relations alienate entities

Self-Attention Layer

attention score Although the approach above-mentioned can enable multimodal pre-trained Transformer models to multimodal analogical reasoning, they only superficially consider implicit Abduction and Induction, ignoring the fine-grained associations between the analogy example and analogy question-answer pair. [R] [R] [R] [R] Adaptive Interaction Across Analogy. Since the analogy question may interfere with the representation of the analogy example and the inevitable noisy data issue, we propose adaptive interaction across analogy in encoding process to interact between the analogy example and question-answer pair adaptively, as shown in Figure 4 . Denote the input to a Transformer layer as X = [X E ∥ X A ], where X E and X A denote the hidden representation of analogy example T E and question-answer pair T A respectively. In each attention head of layer, the query and key representation can be formalized as: Q = XW Q = XE XA W Q = QE QA , K = XW K = XE XA W K = KE KA , where W Q , W K are project matrices. A similar expression also holds for values V . Then the attention probability matrix P can be defined in terms of four sub-matrices: P = QK ⊤ = QE QA (K ⊤ E , K ⊤ A ) = QE K ⊤ E QE K ⊤ A QAK ⊤ E QAK ⊤ A = PEE PEA PAE PAA where P EE , P AA (diagonal of P ) are intra-analogy attentions and P EA , P AE (anti-diagonal of P ) are inter-analogy attentions. We use the gate G to regulate the inter-analogy interactions adaptively: P ′ = G ⊙ P = 1 gEA gAE 1 ⊙ PEE PEA PAE PAA = PEE gEAPEA gAE PAE PAA where G ∈ R 2×2 is adaptive association gate which has two learnable variables g EA , g AE ∈ [0, 1].

Method Baselines Backbone

Hits@1 Hits@3 Hits@5 Hits@10 MRR Table 2 : The main performance results on MARS. We report pipeline baselines with multimodal knowledge graph embedding (MKGE) methods and replace their backbone models with analogyaware model ANALOGY. We also utilize our MarT on end-to-end baselines with multimodal pretrained Transformer (MPT) methods and obtain the best performance in MarT MKGformer. Remark 2 On the one hand, the query from T A may interfere with the example from T E . On the other hand, T E may have a weaker impact on T A in noisy data. Adaptive association gates can increase and decrease inter-analogy interaction automatically based on the intimacy of T E and T A . Relation-Oriented Structure Mapping. The structure mapping theory emphasizes the relation transfer rather than object similarity in analogical reasoning, it is relations between objects, rather than attributes of objects, are mopped from base to target. For example, battery can make an analogy to reservoir because they both store potential, rather than their shapes being cylindrical. Motivated by this, we propose the relaxation loss to bring the relations closer and alienate the entities: L rel = 1 |S| |S| i (1 -sim(h E [R] , h A [R] ) close relations + max (0, sim(h e h , h eq )) alienate entities ) ( ) where |S| is the total number of the training set S, h E [R] is the hidden feature of [R] in analogy example T E output from the MLM head, sim(•) is the cosine similarity. We leverage the masked entity prediction task to obtain the answer entity e a with a cross-entropy loss:  L mem = - 1 |S| (e h , Afterwards, we interpolate the relaxation loss L rel and the masked entity prediction loss L mem using parameter λ to produce the final loss L: L = λL rel + (1 -λ)L mem (7) Remark 3 The relaxation loss is composed of pull-in and pull-away that correspond to the close relation and alienate entity terms, respectively, which can constrain the model's focus on relation structure transfer and implicitly realize the Structure Mapping process.

5.1. MAIN RESULTS

The main performance results of all benchmark methods can be seen in Table 2 . In general, we find the performance of multimodal knowledge graph embedding (MKGE) baselines and multimodal pre-trained Transformer (MPT) baselines is comparable except MKGformer, which establishes a Published as a conference paper at ICLR 2023 Model Hits@1 Hits@3 Hits@5 Hits@10 MRR Making analogies from one domain to another novel domain is a fundamental ingredient for human creativity. In this section, we conduct a novel relation transfer experiment (including both task settings) to measure how well the models generalize by analogy to unfamiliar relations. Specifically, we randomly split the 27 analogy relations into the source and target relations. The models are then trained on the source and tested on the novel target relations. As shown in Table 3 , we observe that MarT MKGformer can indeed learn to make sense of unfamiliar relations, respectively. We further evaluate the model without pre-training on MarKG and find the performance decreased, which indicates that the structure knowledge provided by MarKG is critical for generalization. Note that the novel relation transfer setting is somewhat similar to zero-shot or domain generalization, and we hope our work can benefit other communities.

5.3. ABLATION STUDY

To validate the effectiveness of MarKG and MarT, we conduct an ablation study as shown in Table 4. We observe that discarding pre-train on MarKG results in worse performance for both MKGE and MPT baselines. It indicates that the knowledge structure information provided by MarKG helps learn the representation of entities and relations, further benefiting analogical reasoning. We also find that the performance clearly drops when ablating each component of MarT and reaches the valley when ablating all, proving the effectiveness of each analogical component of our MarT. Moreover, we ablate the analogy example in the input and find the performance drops a lot, which reveals the importance of analogical prompts.

5.4. ANALYSIS

Analysis across Different Sub-Tasks. In previous Table 2 , we are amazed by ANALOGY significantly improving the performance of MKGE baselines. Therefore, we further compare the perfor- mance of vanilla baselines to the addition of analogical components in different sub-task settings. As shown in Figure 5 , we observe that vanilla TransAE performs poorly in the blended task setting. However, when replacing the backbone TransE with ANALOGY, TransAE is competent in blended analogical reasoning setting and even outperforms the single setting. On the other side, RSME with ComplEx as backbone can handle the blended setting reluctantly but perform worse than the single setting. ANALOGY improves the performance of RSME in this situation. Meanwhile, MarT further explores the potential of MKGformer and improves its performance in various tasks. All in all, the analogical components consistently improve the multimodal analogical reasoning ability of all baseline methods, especially in blended analogical reasoning, which supports Mayer's theory (Mayer, 2002) that analogical reasoning is more affinity for multimodal scenarios. (T h , T t ):(I q , ?) (I h , I t ):(T q , ?) (I h , T t ):(I q , ?) Case Analysis. As shown in Figure 6 , we provide case analysis and observe that the top ranking entities (film, life, etc.) of the baselines without analogical components are usually irrelevant to the question entity "campaign"foot_2 . Analogical components make the predictions more reasonable and successfully predict the answer entity "battle". In the difficult blended analogical reasoning setting, the blended modal input of visual and text is challenging. We find that vanilla MKGformer and TransAE fail to understand the visual semantic of "apple" and incorrectly linked with "capital, phone, shipping" that related to "Apple Company". We also notice that TransAE with ANALOGY as backbone significantly decreases the prediction error but incorrectly predicts "plant" as the top-1 entity due to the interference of "Panax notoginseng". On the contrary, MarT MKGformer with relaxation loss can alienate the entities and focus on relation structures transfer and obtain reasonable predictions. These observations reveal that multimodal analogical reasoning is a highly challenging task, and analogy-aware components could enhance the analogical ability of models. Besides, we discuss limitations in Appendix A and provide a comprehensive error analysis in Appendix D.

6. DISCUSSION AND CONCLUSION

In this work, we introduce the new task of multimodal analogical reasoning over knowledge graphs.Preliminary experiments show that this task brings a rather difficult challenge and is worth further exploration. Besides evaluating the analogical reasoning ability of models, there are some potential applications to explore: (1) knowledge graph completion with analogies, (2) transfer learning and zero-shot learning by analogy and (3) analogical question answering. We hope our work inspires future research on analogical reasoning and applications, especially in the multimodal world. and we leave this for future works. Besides, we have not evaluated the very large-scale pre-trained models on the MARS due to the GPU resources, and it is well worth investigating whether largescale pre-trained models can emerge the multimodal analogy reasoning ability. Step 1: Collect Analogy Entities and Relations. Since E-KAR and BATs are widely used text analogy datasets with high-quality and semantically specific entities, we collect the analogy seed entities E a and relations from them according to the following criteria: (1) The relations and entities that have the same meanings will be merged. For example, we merge the relation is a of E-KAR and the relation Hypernyms of BATs since they both represent the hypernym relationship of entities. We obtain 38 relations after this step. (2) The relation must imply analogical knowledge reasoning rather than simple word linear analogy. For example, we discard the analogy relations that only reflect simple word changes of BATs dataset such as Inflections (Nouns, Verbs, etc.) and Derivation (Stem change, etc.). After this step, we filter 11 relations and retain 27 analogy relations.

B ADDITIONAL DATASETS INFORMATION

(3) The entity must be visualizable and realistic. We filter those entities that cannot be linked into Wikidata and drop out the extremely abstract entities such as virtue by hand (some entities that have no image after Step 3 are also filtered). We discard a total of 463 entities after filtering. Finally, we obtained 2,063 seed entities and 27 relations. Step 2: Link to Wikidata and Retrieve Neighbors. Consider that complex analogical reasoning is difficult through individual information (descriptions or images) of entities. We link the analogy seed entities to Wikidata by Mediawiki APIfoot_3 and retrieve the one-hop neighbors of seed entities as well as the possible relationships between the seed entities to obtain their neighbor structure information. In this step, we also take the retrieved descriptions from Wikidata as the textual information of entities and relations. Step 3: Acquire and Validate Images. We collect images from two sources: Google Engine and Laion-5B query servicefoot_4 . We search from Google Engine with the descriptions of entities and crawl 5 images per entity. Laion-5B service depends on Clip retrieval and query by knn index; we leverage the clip text embedding of the description and also query 5 images for each entity. Then we apply four filters to the above images: (1) we check the format of the images and filter invalid files, (2) we remove corrupted (the images are damaged and cannot be opened), low-quality (image size less than 50 × 50 or non-panchromatic images) and duplicate images, (3) we use CLIP (Radford et al., 2021) to remove the images with outlier visual embeddings, (4) we delete unreasonable images manually. Step 4: Sample Analogical Reasoning Data. From Step 1 to Step 3, we obtain the MarKG, which includes 2,063 analogy entities, 8,881 neighbor entities, 27 analogy relations and 165 other relations. To construct the MARS dataset, we sample analogy example (e h , r, e t ) and analogy question-answer pair (e q , r, e a ) with the same relation r from 2,063 analogy entities, but we do not explicitly provide the relation in the input. Then we split the data into different task settings evenly. More details about the sample strategy of MARS can be seen in Section B.2.

B.2 SAMPLE STRATEGY OF MARS

In Section B.1, we obtain the analogy seed entities E a and the analogy relations between E a . Then we sample analogy example (e h , e t ) and analogy question-answer pair (e q , e a ) from E a . Guided by SMT, we make sure that (e h , e t ) and (e q , e a ) have the same relation r. Specifically, we divide the entity pairs that share the same relation into two categories to avoid overlap issues. Then we randomly sample the analogy examples from one category and the analogy question-answer pairs from another to construct analogy input instances. Last, we split the instances into different task settings evenly. The statistical comparison of MarKG with two multimodal knowledge graph datasets WN9-IMG (Xie et al., 2017) and FB15k-IMG (Liu et al., 2019) as shown in Table 5 , we report the number of entity, relation, triple, image and the data source. Note that WN9-IMG and FB15k-IMG aim for knowledge completion and triple classification tasks while our MarKG aims to support MARS to do multimodal analogical reasoning. We also show the complete relations of our MARS in Table 6 and the distribution of relation categories in Figure 7 . Human evaluation on MARS. To evaluate the complexity and difficulty of the multimodal analogical reasoning task, we build a human evaluation in this section. However, humans encounter the following problems in this entity prediction task: (1) The candidate entity set is too huge for humans to select one entity. (2) Hit@k metric is not available since human hard rank predictions. Therefore, we utilize the multiple-choice format for human beings and apply the Accuracy metric to evaluate. Specifically, we randomly sample 100 instances from the test set to construct the evaluation set, and we use the top 10 ranking entities in TransAE prediction as candidate choices for each instance. If the golden answer entity is not in the top 10 entities, we will randomly replace one candidate with the golden entity. Then humans must select one entity from the candidate choices as the answer entity. The results can be seen in Table 7 . We limit the prediction space of baseline models in candidate choices for a fair comparison. We find that the performance of the baselines in the Hit@1 metric has a large gap with human, which indicates the difficulty of the multimodal analogical reasoning task.

B.4 DETAILED EVALUATION METRICS

The evaluation method of (Chen et al., 2022a) can not reflect one-to-more entities and does not fully explore the internal knowledge in the models due to the limited search space. Thus, we follow the link prediction task and choose Hits@k and MRR as our evaluation metrics. Both metrics are in the range [0, 1]. The bigger, the better performance. The Hits at k metric (Hits@k) is acquired by counting the number of times the golden entity appears at the first k positions in the predictions. Given the prediction score of each entity in the candidate entity set, we sort the score and obtain the ranking of each entity. Denote the rank of the gold entity of i triple as rank i , and the reciprocal rank is 1/rank i . The Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks across all triples in the knowledge graph: MRR = 1 |S| |S| i 1 rank i ( ) where |S| is the total number of the training set.

C ADDITIONAL OF EXPERIMENTS

C.1 IMPLEMENTATION DETAILS This section detail the training procedures and hyper-parameters for various models. For multimodal knowledge representation methods, we first use MarKG to do knowledge representation learning and obtain the entity and relation matrix embeddings. Then we apply abduction and induction processes to continue training the models on the MARS dataset. Note that these processes are serial and share models. For multimodal pre-trained Transformer models, we also use MarKG to pre-train the models and then fine-tune on MARS end-to-end with our analogy prompt tuning strategy. We utilize Pytorch to conduct all experiments with 1 Nvidia 3090 GPU. The details of hyper-parameters can be seen in Table 8 . the high semantic entities exist. As shown in example (a), "management" and "control" are abstract entities that are difficult to find equivalent images. Moreover, the uncoordinated convergence problem in multimodal learning further exacerbates the difficulty of the multimodal analogical reasoning task (Peng et al., 2022; Wang et al., 2020) . 2) One-to-more problem. It is challenging for the models to solve one-to-more entities. In example (b), "Memba" is an instance of both "snake" and "animal", which is confusing to MKGformer. 3) Unintuitive relations. In our MARS dataset, some relations are not intuitive, requiring models to have strong relation reasoning ability. As shown in example (c), the relation "intersection to" means the extension of the head and tail entity intersects. All four models are struggling and far away from the golden answer entity.



For example, humans invented hieroglyphics by analogy from the concrete world. https://www.wikidata.org A Huggingface Demo at https://huggingface.co/spaces/zjunlp/MKG_Analogy. https://www.wikidata.org/w/api.php https://knn5.laion.ai/



Figure 1: Overview of the Multimodal Analogical Reasoning task. We divide the task into single and blended settings with a multimodal knowledge graph. Note that the relation marked by dashed arrows ( ) and the text around parentheses under images are only for annotation and not provided in the input.

Figure 3: Overview of baseline methods. (a) Pipeline of MKGE methods for multimodal analogical reasoning. (b) and (c) are two stages of multimodal pre-trained Transformer (MPT) baselines.

Figure 4: The MarT framework.

et,eq,ea)∈S log(p([MASK] = e a )|T (e h ,et,eq) )

Figure 6: Case examples of MARS. We show the analogy example and analogy question-answer pair with their implicit relations. "Top-3 Entity" means top-3 ranking entities in the prediction. "Gold Rank" refers to the rank of the gold answer entity in the prediction. * denotes the baseline model with analogical components (MarT or ANALOGY).

Figure 5: Performance on MARS in different sub-task settings.

Figure 7: Relation distribution of MARS.

Figure 10: Error case examples.

An illustration of data collection and processing steps to create MARS and MarKG.

Comparison between MARS and previous analogical reasoning datasets. "KB" refers to the knowledge base, # denotes the number. "Knowledge Intensive" means reasoning requires external knowledge. Our MarKG focuses on knowledge-intensive reasoning across multiple modalities.

Ablation experiments on MARS.w/o MarKG refers to the model without pre-training on MarKG dataset. w/o MarT refers to ablate all components of MarT that equivalents to MKGformer.competitive baseline of MARS. In addition, when replacing the backbone of MKGE methods with ANALOGY that models analogical structure explicitly, the performance is significantly improved. Meanwhile, the MPT models without analogy-related structures obtain substantial performance with the analogical reasoning ability enhanced by MarT. For example, although MKGformer achieves outstanding performance, MarT MKGformer further improves and obtains state-of-the-art performance, exceeding other methods by 4.9%-12.4% points in the MRR metric. It reveals that the MarT framework stimulates the ability of the Transformer-based model for multimodal analogical reasoning. We also report the pre-training results on MarKG in Appendix C.2.



Data statistics of MarKG. # refers to the number of.

The complete relations with definitions, examples of MARS. Some relations and definitions refer to(Chen et al., 2022a)  and Wikidata Properties.Quality Control of Datasets. We devise some quality control strategies while construct our MarKG and MARS datasets: (1) Entity and relation formalization and normalization. We link the analogy entities collected from E-KAR and SAT to Wikidata and filter non-link items. Since Wikidata is a knowledge base with quality-assured, some rare or worthless entities are excluded. (2) Image validation mechanism. We devise complex image filter strategies to control the robustness of image data, as mentioned in Section B.1. (3) Control of text description. We take the description in Wikidata as the textual information of entities.

Human evaluation on MARS.

ACKNOWLEDGMENT

We would like to express gratitude to the anonymous reviewers for their kind comments. This work was supported by the National Natural Science Foundation of China (No.62206246 and U19B2027), Zhejiang Provincial Natural Science Foundation of China (No. LGG22F030011), Ningbo Natural Science Foundation (2021J190), and Yongjiang Talent Introduction Programme (2021A-156-G), CAAI-Huawei MindSpore Open Fund, and NUS-NCS Joint Laboratory (A-0008542-00-00).

REPRODUCIBILITY STATEMENT

The source MARS and MarKG datasets will be released on Github soon. In order to provide support to reproduce our experiments in Section 5, we provide the detailed source code of all pipeline baselines (IKRL, TransAE, RSME) and end-to-end baselines (VisualBERT, ViLBERT, ViLT, FLAVA, MKGformer) in the supplementary materials with all scripts and hyper-parameters. We also provide a README script to instruct how to run the codes.Published as a conference paper at ICLR 2023 

Method Baselines Entity Prediction Relation Prediction

Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10 MRR We report the pre-train results on MarKG in Table 9 . We find that MPT baselines perform better than MKGE baselines consistently. It reveals the strong fit ability of Transformer-based models. As shown in Figure 8 , we can observe that pre-training and fine-tuning stages trends are roughly the same, especially in the same type of baselines, which indicates that pre-train on MarKG benefits analogical reasoning on MARS.

C.3 RESULTS OF IMPLICIT RELATION INFERENCE OF MPT.

We conduct an evaluation experiment on the relation inference of MKGE and MPT methods. For MKGE methods, we evaluate the relation predicted of Abuduction process with hit@k metrics. Since MPT methods solve the analogical reasoning task end-to-end without any explicit relation prediction process, we use two ways to evaluate their relation-aware abilities. The first one is that we predict the relation via the special relation token [R], which is similar to mask entities prediction and evaluate the predictions with Hit@k metrics. However, this evaluation method does not precisely reflect the relation-aware abilities of models since whereis the hidden state of [R] in the last transformer layer, E er is the special relation embedding (described in Section 4.2.1) of the golden relation r.The evaluation results are shown in Table 6 , we find that MKGE methods perform better than most MPT methods on Hit@k metrics, especially on Hit@3 metric, which may benefit from the explicit relation perception in the pipeline process. Moreover, MarT FLAVA achieves the best relationaware performance on Hit@k and Euclidean distance metrics, but MarT FLAVA performs worse than MarT MKGformer in answer entity prediction as shown in Table 2 . We speculate that the special token [R] contains not only the golden relation but also other related relation information. In this section, we detail the size of MPT baseline models and compare them with their performance.

C.4 COMPARISON OF PERFORMANCE AND MODEL SIZE

In MPT models, the single-stream models (VisualBERT, ViLT) are the smallest, the dual-stream models (ViLBERT) are the middle and the mixed-stream models (FLAVA, MKGformer) are the biggest. The performance of models is roughly proportional to their sizes, as shown in Figure 9 . MKGformer outperforms all other models, including the biggest FLAVA model.

D ERROR CASE ANALYSIS

In this section, we conduct an error case study on MARS in Figure 10 . From the error cases, we can see the hardship of the multimodal analogical reasoning task: 1) Imbalance of multimodal.The semantic scales of images and text are inconsistent, which leads to incorrect matching (Zhu et al., 2022) . Although we filter some hard-to-visualize entities in data collection in Section B.1,

