OPEN-DOMAIN VISUAL ENTITY LINKING

Abstract

We introduce the task of Open-domain Visual Entity Linking (OVEN), targeting a wide range of entities including animals, plants, buildings, locations and much more. Given an image (e.g., an image of an aircraft), a text query ('What is the model?' or 'What is the airline?'), and a multi-modal knowledge base (e.g., Wikipedia), the goal is to link to an entity (BO E I N G-777 or EVA AI R) out of all entities in the knowledge base. We build a benchmark dataset (OVEN-Wiki), by repurposing 14 existing image classification, image retrieval, and visual QA datasets. We link all existing labels to Wikipedia entities when possible, using a state-of-the-art entity linking system and human annotators, creating a diverse and unified label space. OVEN is a rich and challenging task, which requires models to recognize and link visual content to both a small set of seen entities as well as a much larger set of unseen entities (e.g., unseen aircraft models). OVEN also requires models to generalize to previously unseen intents that may require more fine-grained reasoning ('What is the model of the aircraft in the back?'). We build strong baselines based on state-of-the-art pre-trained models and find that current pre-trained models struggle to address the challenges posed by OVEN. We hope OVEN will inspire next-generation pre-training techniques and pave the way to future knowledge-intensive vision tasks.

1. INTRODUCTION

Recent interest in knowledge-intensive visual applications such as KVQA (Shah et al., 2019) and OK-VQA (Marino et al., 2019) has demonstrated the value of grounding images to knowledge bases such as Wikipedia. However, while models for these applications focus on integrating information from the knowledge base, there has been little focus on systematic, broad-coverage approaches to the grounding problem itself. Existing work typically combines various closed-set classifiers (e.g., ImageNet (Russakovsky et al., 2015) , COCO (Lin et al., 2014) , etc.) in an ad-hoc manner, without a clear formulation of image grounding as a general task. In this paper, we propose and formally define the task of Open-domain Visual Entity Linking (OVEN), with the goal of building vision systems that ground visual content to entities in large-scale multi-modal knowledge bases (such as Wikipedia). In contrast to existing informal setups that rely on closed-set classifiers in an adhoc way, OVEN is open-domain where predictions cover a large space of entities governed by a knowledge base. The OVEN task takes as input with an image, a text queryfoot_0 that expresses intent with respect to the image, and a knowledge base which contains the entire set of entities, along with supporting text descriptions and a relevant set of images for each. Given these inputs the goal is to predict an entity that is both physically present in the input image as well as satisfies the unambiguous intent expressed in the query. For instance, given the same image of an aircraft, different text queries such as 'Which model is this?' or 'Which airline is this?' can lead to different answers, i.e., BO E I N G-777 or EVA AI R. A strong OVEN model should learn to use all three inputs, i.e. the query, the image and the multi-modal knowledge base when making predictions. Figure 1 illustrates the input-output mapping of OVEN. By connecting images to entities in the knowledge base, OVEN creates a universal entity space that organizes existing image labels. This setup poses challenges in the form of (1) generalization to UNSEEN entities and (2) generalization to UNSEEN queries. For example, classic object recognition systems trained on FGVCAircraft (Maji et al., 2013) can recognize the ninety-six aircraft categories



A query can be expressed in different formats; in this paper, we choose to use a question to reflect the intent.1

