MULTIMODALQA: COMPLEX QUESTION ANSWERING OVER TEXT, TABLES AND IMAGES

Abstract

When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. In this paper, we present MULTIMODALQA (MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new framework for generating complex multi-modal questions at scale, harvesting tables from Wikipedia, and attaching images and text paragraphs using entities that appear in each table. We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions. Last, crowdsourcing workers take these automatically generated questions and rephrase them into more fluent language. We create 29,918 questions through this procedure, and empirically demonstrate the necessity of a multi-modal multi-hop approach to solve our task: our multi-hop model, ImplicitDecomp, achieves an average F 1 of 51.7 over cross-modal questions, substantially outperforming a strong baseline that achieves 38.2 F 1 , but still lags significantly behind human performance, which is at 90.1 F 1 .

1. INTRODUCTION

When presented with complex questions, people often do not know in advance what source(s) of information are relevant for answering it. In general scenarios, these sources can encompass multiple modalities, be it paragraphs of text, structured tables, images or combinations of those. For instance, a user might ponder "When was the famous painting with two touching fingers completed?", if she cannot remember the exact name of the painting. Answering this question is made possible by integrating information across both the textual and visual modalities. Recently, there has been substantial interest in question answering (QA) models that reason over multiple pieces of evidence (multi-hop questions (Yang et al., 2018; Talmor & Berant, 2018; Welbl et al., 2017) ). In most prior work, the question is phrased in natural language and the answer is in a context, which may be a paragraph (Rajpurkar, 2016), a table (Pasupat & Liang, 2015) , or an image (Antol et al., 2015) . However, there has been relatively little work on answering questions that require integrating information across modalities. Hannan et al. (2020) created MANYMODALQA: a dataset where the context for each question includes information from multiple modalities. However, the answer to each question can be derived from a single modality only, and no cross-modality reasoning is needed. Thus, the task is focused on identifying the relevant modality. Recently, Chen et al. (2020b) presented HYBRIDQA, a dataset that requires reasoning over tabular and textual data. While HYBRIDQA requires cross-modal reasoning, it does not require visual inference, limiting the types of questions that can be represented (See Table 1 for a comparison between the datasets). In this work, we present MMQA, the first large-scale (29,918 examples) QA dataset that requires integrating information across free text, semi-structured tables, and images, where 35.7% of the questions require cross-modality reasoning. Figure 1 shows an example question: "Which B.Piazza title came earlier: the movie S. Stallon's son starred in, or the movie with half of a lady's face on the poster?". Answering this question entails (i) decomposing the question into a sequence of simpler questions, (ii) determining the modalities for the simpler questions and answering them, i.e., information on the poster is in an image, the information on "S. Stallon's son" is in free text, and the years of the movies are in the table, (iii) combining the information from the simpler questions to compute the answer: "Tell Me that you love me, Junie Moon". Our methodology for creating MMQA involves three high-level steps. (a) Context construction: we harvest tables from Wikipedia, and connect each table to images and paragraphs that appear in existing Reading Comprehension (RC) datasets (Kwiatkowski et al., 2019; Clark et al., 2019; Yang et al., 2018) ; (b) Question generation: Following past work (Talmor & Berant, 2018) , we use the linked structure of the context to automatically generate questions that require multiple reasoning operations (composition, conjunction, comparison) across modalities in pseudo-language ; (c) Paraphrasing: we use crowdsourcing workers to paraphrase the pseudo-language questions into more fluent English. We empirically evaluate MMQA by comparing ImplicitDecomp to strong baselines that do not perform cross-modal reasoning and to human performance. We find that on multimodal questions, ImplicitDecomp improves F 1 from 38.2 → 51.7 over a single-hop approach. Humans are able to reach 90.1 F 1 , significantly outperforming our best model. Because automatic evaluation is non-trivial, we also manually analyze human performance and find humans correctly answer 94.5% of the questions in MMQA. Finally, our dataset can be used in an open-domain setup over all of Wikipedia. In this setup, the F 1 of humans is 84.8.



Figure 1: Example of a MMQA question, answer and context. In green are the text modality question and answer, and in red the image modality. The table is used to perform the year comparison between the answers of the text and image question parts.

