CROSS-MODAL RETRIEVAL AUGMENTATION FOR MULTI-MODAL CLASSIFICATION

Abstract

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves state-of-the-art image-caption retrieval performance w.r.t. similar methods. Second, we show that retrievalaugmented multi-modal transformers using the trained alignment model significantly improve results on VQA over strong baselines. reporting state-of-the-art performance. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hotswapping indices.

1. INTRODUCTION

Neural networks augmented with non-parametric retrieval components have recently shown impressive results in NLP (Khandelwal et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard & Grave, 2020) . In this work, we train a state-of-the-art image-caption alignment model and utilize it in various retrieval-augmented multi-modal transformer architectures, achieving state-of-the-art performance on visual question answering (VQA) significant improvement over the baselines, including the winner of the VQA 2.0 2020 challenge. Retrieval components are promising because they allow for easy revision and expansion of their memory, as compared to their parametric, pre-training counterparts. They provide more interpretability, as well as direct factual consistency with trusted knowledge sources. In the multi-modal setting, retrieval augmentation allows for leveraging the strengths of text-based models-as evidenced by the strong performance of BERT-based models in vision-and-language (Lu et al., 2019; Li et al., 2019b; Kiela et al., 2019) -via cross-modal translation from images to text. Being able to seamlessly "hot swap" knowledge sources without the need for re-training the model affords a unique scalability not typically seen in the traditional deep learning literature. Nearest neighbor methods are known to be strong baselines in the vision and language domain (Devlin et al., 2015) . We introduce a simple yet effective novel dense cross-modal alignment architecture called DXR (Dense X-modal Retriever). DXR achieves state-of-the-art performance on both COCO (Chen et al., 2015) and Flickr30k (Young et al., 2014) image-caption retrieval, with respect to similar methods. We subsequently use DXR to augment several multi-modal transformer architectures with a retrieval component. We show that retrieval augmentation yields impressive results for a variety of well-known multi-modal transformer architectures, ranging from VisualBERT (Li et al., 2019b) and ViLBERT (Lu et al., 2019) -which use bounding-box features-to Movie+MCAN (Jiang et al., 2020) -which uses grid features. We name our overall method XTRA, for X-modal Transformer Retrieval Augmentation. Specifically, our contributions are as follows: • We introduce a novel image-caption retrieval architecture, DXR, that achieves state-of-theart performance on COCO and Flickr30k, with respect to similar methods. • We introduce a new retrieval-augmented multi-modal transformer architecture, XTRA, that achieves state-of-the-art performance significant improvement on VQA over the baselines.

