CROSS-MODAL RETRIEVAL AUGMENTATION FOR MULTI-MODAL CLASSIFICATION

Abstract

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves state-of-the-art image-caption retrieval performance w.r.t. similar methods. Second, we show that retrievalaugmented multi-modal transformers using the trained alignment model significantly improve results on VQA over strong baselines. reporting state-of-the-art performance. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hotswapping indices.

1. INTRODUCTION

Neural networks augmented with non-parametric retrieval components have recently shown impressive results in NLP (Khandelwal et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard & Grave, 2020) . In this work, we train a state-of-the-art image-caption alignment model and utilize it in various retrieval-augmented multi-modal transformer architectures, achieving state-of-the-art performance on visual question answering (VQA) significant improvement over the baselines, including the winner of the VQA 2.0 2020 challenge. Retrieval components are promising because they allow for easy revision and expansion of their memory, as compared to their parametric, pre-training counterparts. They provide more interpretability, as well as direct factual consistency with trusted knowledge sources. In the multi-modal setting, retrieval augmentation allows for leveraging the strengths of text-based models-as evidenced by the strong performance of BERT-based models in vision-and-language (Lu et al., 2019; Li et al., 2019b; Kiela et al., 2019) -via cross-modal translation from images to text. Being able to seamlessly "hot swap" knowledge sources without the need for re-training the model affords a unique scalability not typically seen in the traditional deep learning literature. Nearest neighbor methods are known to be strong baselines in the vision and language domain (Devlin et al., 2015) . We introduce a simple yet effective novel dense cross-modal alignment architecture called DXR (Dense X-modal Retriever). DXR achieves state-of-the-art performance on both COCO (Chen et al., 2015) and Flickr30k (Young et al., 2014) image-caption retrieval, with respect to similar methods. We subsequently use DXR to augment several multi-modal transformer architectures with a retrieval component. We show that retrieval augmentation yields impressive results for a variety of well-known multi-modal transformer architectures, ranging from VisualBERT (Li et al., 2019b) and ViLBERT (Lu et al., 2019) -which use bounding-box features-to Movie+MCAN (Jiang et al., 2020) -which uses grid features. We name our overall method XTRA, for X-modal Transformer Retrieval Augmentation. Specifically, our contributions are as follows: • We introduce a novel image-caption retrieval architecture, DXR, that achieves state-of-theart performance on COCO and Flickr30k, with respect to similar methods. • We introduce a new retrieval-augmented multi-modal transformer architecture, XTRA, that achieves state-of-the-art performance significant improvement on VQA over the baselines. To our knowledge, this is the first work to showcase the promise of hybrid parametric and non-parametric models for the vision and language domain. • We conduct extensive experiments to shed light on this novel approach. We explore different datasets for training the alignment model, as well as the effect of in-domain versus out-of-domain retrieval indices, the index size and inference time applications. Our experiments show that our proposed method significantly improves over a variety of strong multi-modal baselines, and demonstrates superior results over pre-training.

2. RELATED WORK

Cross-Modal Retrieval Prior work in cross-modal retrieval can be divided into two primary categories: (i) methods that use grid-features and/or vector representations of the embedding space, and (ii) methods that use detection features, sequence representations, or share information between the two modalities for computing the similarity metric. (Chen et al., 2020) learn to align between image and text by using positive and negative tuples of images and captions. While these models perform well, they suffer from high computational complexity as we discuss in Sec. 



External Knowledge Source Methods The use of an external knowledge source (KS) has gained much attention in the field of natural language processing (NLP), such as the work ofVerga et al.  (2020). Our work is inspired by that ofLewis et al. (2020), which introduced RAG, a generic approach for a variety of downstream NLP tasks, that uses a learned retriever (DPR byKarpukhin  et al. (2020)) to augment the inputs by marginalizing across several retrieved phrases retrieved from Wikipedia. In the multi-modal domain, previous efforts have focused on building different types of KS, such as the work of Zhu et al. (2014), Chen et al. (2013), Divvala et al. (2014), Sadeghi et al. (2015) and Zhu et al. (2015), which use web information for the construction of the KS. Methods that use an external KS for a downstream task use a structured KS, such as the work of Narasimhan et al. (2018), Narasimhan & Schwing (2018), Wang et al. (2015) Wang et al. (2018) and Zhu et al. (2017). Zhu et al. (2017) introduced an iterative method for VQA tasks. Marino et al. (2019) introduced OK-VQA, a novel VQA dataset that requires the use of an external KS. Fan et al. (2020) applied a KS to multi-modal dialogue. In our work, we focus on a more natural KS, such as images and captions, which better reflect the data generated in newspapers and social media.Multi-modal ClassificationIn this work, we investigate the potential advantages of using an external KS for the popular and challenging VQA domain, a multi-modal classification task. Current methods for VQA use pre-training on different datasets in order to gain better performance. In our experiments, we show performance for common methods such asVisualBERT (Li et al., 2019b), which concatenates the text and image modalities as an input to a pre-trainedBERT (Devlin et al.,  2018) model. ViLBERT (Lu et al., 2019)  fuses text and image modalities using co-attentional transformer layers. Other methods such as Pythia(Jiang et al., 2018),VLBERT (Su et al., 2019)  andMMBT (Kiela et al., 2019)  can benefit from our method, as well as more recent work such as Oscar(Li et al., 2020b)  and UNITER(Chen et al., 2020), which use the alignment task for pretraining their models. The currently SOTA In this paper, we choose to show our performance on the two common VisualBERT and ViLBERT models, and the winner of the VQA 2.0 2020 challange,Movie+MCAN (Jiang et al., 2020)  , which uses grid features instead of detection features, a modulated convolutional bottleneck for the image backbone, andMCAN (Yu et al., 2019)  for fusion. A

The first category consists of methods such as RRF(Liu et al., 2017)  and DPC(Zheng et al., 2017)  which use two convolutional network branches for image and text. CMPM by Zhang & Lu (2018) introduced a pre-trained image backbone with a Bi-directional LSTM to learn image and text embeddings. The most relevant work in this category isVSE++ (Faghri et al., 2017), which focuses on hard negative mining and ranking loss. The second category generally exploits the use of detection features, which enforces an additional complexity.

