OPEN-DOMAIN VISUAL ENTITY LINKING

Abstract

We introduce the task of Open-domain Visual Entity Linking (OVEN), targeting a wide range of entities including animals, plants, buildings, locations and much more. Given an image (e.g., an image of an aircraft), a text query ('What is the model?' or 'What is the airline?'), and a multi-modal knowledge base (e.g., Wikipedia), the goal is to link to an entity (BO E I N G-777 or EVA AI R) out of all entities in the knowledge base. We build a benchmark dataset (OVEN-Wiki), by repurposing 14 existing image classification, image retrieval, and visual QA datasets. We link all existing labels to Wikipedia entities when possible, using a state-of-the-art entity linking system and human annotators, creating a diverse and unified label space. OVEN is a rich and challenging task, which requires models to recognize and link visual content to both a small set of seen entities as well as a much larger set of unseen entities (e.g., unseen aircraft models). OVEN also requires models to generalize to previously unseen intents that may require more fine-grained reasoning ('What is the model of the aircraft in the back?'). We build strong baselines based on state-of-the-art pre-trained models and find that current pre-trained models struggle to address the challenges posed by OVEN. We hope OVEN will inspire next-generation pre-training techniques and pave the way to future knowledge-intensive vision tasks.

1. INTRODUCTION

Recent interest in knowledge-intensive visual applications such as KVQA (Shah et al., 2019) and OK-VQA (Marino et al., 2019) has demonstrated the value of grounding images to knowledge bases such as Wikipedia. However, while models for these applications focus on integrating information from the knowledge base, there has been little focus on systematic, broad-coverage approaches to the grounding problem itself. Existing work typically combines various closed-set classifiers (e.g., ImageNet (Russakovsky et al., 2015) , COCO (Lin et al., 2014) , etc.) in an ad-hoc manner, without a clear formulation of image grounding as a general task. In this paper, we propose and formally define the task of Open-domain Visual Entity Linking (OVEN), with the goal of building vision systems that ground visual content to entities in large-scale multi-modal knowledge bases (such as Wikipedia). In contrast to existing informal setups that rely on closed-set classifiers in an adhoc way, OVEN is open-domain where predictions cover a large space of entities governed by a knowledge base. The OVEN task takes as input with an image, a text queryfoot_0 that expresses intent with respect to the image, and a knowledge base which contains the entire set of entities, along with supporting text descriptions and a relevant set of images for each. Given these inputs the goal is to predict an entity that is both physically present in the input image as well as satisfies the unambiguous intent expressed in the query. For instance, given the same image of an aircraft, different text queries such as 'Which model is this?' or 'Which airline is this?' can lead to different answers, i.e., BO E I N G-777 or EVA AI R. A strong OVEN model should learn to use all three inputs, i.e. the query, the image and the multi-modal knowledge base when making predictions. Figure 1 illustrates the input-output mapping of OVEN. By connecting images to entities in the knowledge base, OVEN creates a universal entity space that organizes existing image labels. This setup poses challenges in the form of (1) generalization to UNSEEN entities and (2) generalization to UNSEEN queries. For example, classic object recognition systems trained on FGVCAircraft (Maji et al., 2013) 

can recognize the ninety-six aircraft categories

What is the model of this vehicle?

Bugatti Veyron

The Bugatti Veyron EB 16.4 is a mid-engine sports car, designed and developed in Germany by the Volkswagen Group and … What is this building called?

Skanderbeg Museum

The National History Museum "Gjergj Kastrioti Skënderbeu" (Albanian: Muzeu Historik Kombëtar ), also known as the Skanderbeg Museum… What piece of equipment is placed on the animal in the image? Who manufactured the plane?

Bridle

A bridle is a piece of equipment used to direct a horse. As defined in the Oxford English Dictionary, the "bridle" includes both the headstall that… defined in FGVCAircraft, but would necessarily fail to recognize BO E I N G-787, as it does not belong to the pre-defined categories. In contrast, models for OVEN are required to recognize entities that were UNSEEN in the training data. This requirement reflects a more realistic scenario, since there will never be enough training data to cover all knowledge base entities, especially when the number of knowledge base entities is constantly growing.foot_1 OVEN also evaluates a model's ability to understand UNSEEN queries, since it is impossible to observe all text queries to a real-world KB during training.

Mcdonnell douglas

To benchmark OVEN, we construct OVEN-Wiki by repurposing 14 existing image classification, image retrieval, and visual QA datasets, and grounding all labels to the most prominent knowledge base, Wikipedia. The training, validation, and testing splits are formed in a way such that models need to generalize to entities that are unseen during training. Grounding labels from different datasets into Wikipedia and combining datasets is challenging because of the ambiguity of language. For example, due to language polysemy, 'Yoke' can represent a wooden beam or part of the construction of a garment. 'Tornado' can be a weather phenomena or a type of airplane (PA N A V I A TO R N A D O). To reduce such ambiguity in the grounding, we take multiple steps to refine the labels, including the use of a state-of-the-art textual entity linking system (De Cao et al., 2020) . For more accurate evaluation, a subset of examples is thoroughly annotated by human annotators: entity linking errors are corrected and the ambiguous queries are rewritten so that no other objects can be the answer. To ensure that the dataset is computationally manageable for the community, we use a 100k subset of Wikipedia entities. Using multiple types of information (image, input query and multi-modal knowledge) is important for succeeding on OVEN, but existing pre-trained models like CLIP (Radford et al., 2021) or SimVLM (Wang et al., 2021) cannot use all of the available information natively. We experiment with baselines applying different combinations of strong pre-trained models and conduct error analysis to show possible headroom. The contributions of the paper are as follows: • Towards the broader goal of linking visual entities to large open-domain KBs, we formalize the task of Open-domain Visual Entity Linking (OVEN) • We construct the first large-scale visual entity linking benchmark, OVEN-Wiki, by re-purposing 14 existing datasets with all of their labels grounded to Wikipedia. A subset of the dataset is annotated by human annotators for high quality evaluation. • We build strong baselines based on state-of-the-art pre-trained models and find that the best performing systems are the ones that better leverage multi-modal knowledge.

2. TASK FORMULATION FOR OVEN

The proposed task of open-domain visual entity linking relies on inputs that consist of text queries, images and a multimodal knowledge base (KB), all corresponding to a unified label space. The input image-text pairs x = (x p , x t ) include a text query x t expressing intent with respect to the corresponding image x p . Given a unified label space E which defines the set of all possible entities, the input knowledge base K = {(e, p(e), t(e))|e ∈ E} is a set of triples, each containing an entity e, its corresponding text description t(e) and a (possibly empty) set of relevant images p(e). For instance, an entity e = SA B A T I A C A M P E S T R I S would have a corresponding textual description t(e) = 'Sabatia campestris is a species of Sabatia . . .' and a set p(e) containing one



A query can be expressed in different formats; in this paper, we choose to use a question to reflect the intent. English Wikipedia has grown from 3M to 6.5M articles in the last decade and continues to grow.



Figure 1: An illustration of the proposed OVEN task. Examples on the right are sampled from the constructed OVEN-Wiki dataset. OVEN aims at linking entities physically presented or revealed in the image.

