HOW TO DESIGN SAMPLE AND COMPUTATIONALLY EFFICIENT VQA MODELS

Abstract

In multi-modal reasoning tasks, such as visual question answering (VQA), there have been many modeling and training paradigms tested. Previous models propose different methods for the vision and language tasks, but which ones perform the best while being sample and computationally efficient? Based on our experiments, we find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner. Empirical results demonstrate that this differentiable end-to-end program executor is able to maintain state-of-the-art accuracy while being sample and computationally efficient.

1. INTRODUCTION

Many real-world complex tasks require both perception and reasoning (or System I and System II intelligence (Sutton & Barto, 2018 )), such as VQA. What is the best way to integrate perception and reasoning components in a single model? Furthermore, how would such an integration lead to accurate models, while being sample and computationally efficient? Such questions are important to address when scaling reasoning systems to real world use cases, where empirical computation bounds must be understood in addition to the final model performance. There is a spectrum of methods in the literature exploring different ways of integrating perception and reasoning. Nowadays, the perception is typically carried out via neural models: such as CNNs for vision, and LSTMs (Gers et al., 1999) or Transformers (Vaswani et al., 2017) for language. Depending on the representation of perception input and their reasoning interface, a method can be either more towards the neural end of the spectrum or more toward the symbolic end. For the vision part, models can either use pixel-level or object-level symbolic representation. For the language part, models can generate either textual attention or programs, where the text is decomposed into a sequence of functions. Within the program representations, models typically operate on a selected discrete program or on probabilistic programs. The reasoning part used to produce the final answer can either use neural models, symbolic reasoning, or something in between, such as neural module networks (NMN) or soft logic blocks. Existing works for NMN methods leverage pixel-level representations and program representations such as NMN (Hu et al., 2017 ), Prob-NMN (Vedantam et al., 2019 ), and Stack-NMN (Hu et al., 2018) . Representative models that use object-level vision also leverage both neural and symbolic language and reasoning. Models that are more neural are LXMERT (Tan & Bansal, 2019) and NSM (Hudson & Manning, 2019) , while those that are more symbolic are NS-VQA (Yi et al., 2018) , NS-CL (Mao et al., 2019) and NGS (Li et al., 2020) . A systematic comparison across these models is illustrated in Table 1 with more details in Appendix A. Overall, neural models have more expressive power but with more parameters, while more-symbolic models have more prior structures built into them but with fewer parameters. There is an interesting bias-variance trade-off in the model design. By encoding as much bias into the model as possible, one could reduce sample requirements. The different choices of perception and reasoning components also limit how the QA models will be trained. If both components are chosen as neural modules, then the training can be done in a very efficient end-to-end fashion. If the reasoning is carried out using more discrete operations, then the perception model needs to sample discrete outputs or take discrete inputs to interface with downstream reasoning. For instance, if symbolic reasoning is used, REINFORCE (Williams, 1992) is typically used to train the perception models, which may require many samples during the optimization process. Alternatively, one can also use expensive abduction (Li et al., 2020) to manipulate the perception models outputs to provide the correct reasoning and then optimize these perception models using these pseudo-labels. Overall, more neural models will be easier to optimize, while more symbolic models will need additional expensive discrete sampling during optimization. To highlight this interesting fact, we call it the neuro-symbolic trade-off. This neuro-symbolic trade-off also affects sample efficiency and computational efficiency. To be more sample efficient, the model needs to be less neural, yet, a more neural model can be more computationally efficient during training. Thus a method that can achieve an overall good performance in terms of both sample and computation efficiency will require systematically determining which perception and reasoning components should be used and how to integrate them. To design such a model, we first test which method within each perception and reasoning component works the most efficiently. From this neuro-symbolic trade-off exploration we can design a model that uses these most efficient components and compare its overall performance against existing models.

2. PROBLEM SETTING

Before the exploration, we formally define the different choices for the vision, language, and reasoning components. In the general VQA setting we are provided with an image I, a natural language question Q, and an answer A. We now define how these basic inputs are used in each component.

2.1. REPRESENTATION FOR VISION

Given the image I there are two predominant visual representations: pixel and object-level attention. Pixel Attention. Given an image one can leverage traditional deep learning architectures used for image representations and classification such as ResNets (He et al., 2016) . Here the image is passed through many residual convolution layers before entering a MLP sub-network to perform a classification task. From one of these MLP linear layers, an intermediate dense image representation feature f I ∈ R D can be extracted, denoted by f I = ResNet(I). These features are used further down the VQA pipeline, where the downstream model computes attention over the relevant part of the feature based on the question asked. Object-level. Another paradigm is to leverage object detection models such as Faster R-CNNs (Ren et al., 2015) to identify individual objects within images. Given objects in the image, one can



A breakdown of VQA models by indicating which method is used with respect to their vision, language, inference, and training components. Refer to Appendix A for a detailed description of these methods.

