MULTIMODALQA: COMPLEX QUESTION ANSWERING OVER TEXT, TABLES AND IMAGES

Abstract

When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. In this paper, we present MULTIMODALQA (MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new framework for generating complex multi-modal questions at scale, harvesting tables from Wikipedia, and attaching images and text paragraphs using entities that appear in each table. We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions. Last, crowdsourcing workers take these automatically generated questions and rephrase them into more fluent language. We create 29,918 questions through this procedure, and empirically demonstrate the necessity of a multi-modal multi-hop approach to solve our task: our multi-hop model, ImplicitDecomp, achieves an average F 1 of 51.7 over cross-modal questions, substantially outperforming a strong baseline that achieves 38.2 F 1 , but still lags significantly behind human performance, which is at 90.1 F 1 .

1. INTRODUCTION

When presented with complex questions, people often do not know in advance what source(s) of information are relevant for answering it. In general scenarios, these sources can encompass multiple modalities, be it paragraphs of text, structured tables, images or combinations of those. For instance, a user might ponder "When was the famous painting with two touching fingers completed?", if she cannot remember the exact name of the painting. Answering this question is made possible by integrating information across both the textual and visual modalities. Recently, there has been substantial interest in question answering (QA) models that reason over multiple pieces of evidence (multi-hop questions (Yang et al., 2018; Talmor & Berant, 2018; Welbl et al., 2017) ). In most prior work, the question is phrased in natural language and the answer is in a context, which may be a paragraph (Rajpurkar, 2016), a table (Pasupat & Liang, 2015) , or an image (Antol et al., 2015) . However, there has been relatively little work on answering questions that require integrating information across modalities. Hannan et al. (2020) created MANYMODALQA: a dataset where the context for each question includes information from multiple modalities. However, the answer to each question can be derived from a single modality only, and no cross-modality reasoning is needed. Thus, the task is focused on identifying the relevant modality. Recently, Chen et al. (2020b) presented HYBRIDQA, a dataset that requires reasoning over tabular and textual data. While HYBRIDQA requires cross-modal reasoning, it does not require visual inference, limiting the types of questions that can be represented (See Table 1 for a comparison between the datasets).

acknowledgement

The authors contributed equally 1

