HOW TO DESIGN SAMPLE AND COMPUTATIONALLY EFFICIENT VQA MODELS

Abstract

In multi-modal reasoning tasks, such as visual question answering (VQA), there have been many modeling and training paradigms tested. Previous models propose different methods for the vision and language tasks, but which ones perform the best while being sample and computationally efficient? Based on our experiments, we find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner. Empirical results demonstrate that this differentiable end-to-end program executor is able to maintain state-of-the-art accuracy while being sample and computationally efficient.

1. INTRODUCTION

Many real-world complex tasks require both perception and reasoning (or System I and System II intelligence (Sutton & Barto, 2018 )), such as VQA. What is the best way to integrate perception and reasoning components in a single model? Furthermore, how would such an integration lead to accurate models, while being sample and computationally efficient? Such questions are important to address when scaling reasoning systems to real world use cases, where empirical computation bounds must be understood in addition to the final model performance. There is a spectrum of methods in the literature exploring different ways of integrating perception and reasoning. Nowadays, the perception is typically carried out via neural models: such as CNNs for vision, and LSTMs (Gers et al., 1999) or Transformers (Vaswani et al., 2017) for language. Depending on the representation of perception input and their reasoning interface, a method can be either more towards the neural end of the spectrum or more toward the symbolic end. For the vision part, models can either use pixel-level or object-level symbolic representation. For the language part, models can generate either textual attention or programs, where the text is decomposed into a sequence of functions. Within the program representations, models typically operate on a selected discrete program or on probabilistic programs. The reasoning part used to produce the final answer can either use neural models, symbolic reasoning, or something in between, such as neural module networks (NMN) or soft logic blocks. Existing works for NMN methods leverage pixel-level representations and program representations such as NMN (Hu et al., 2017 ), Prob-NMN (Vedantam et al., 2019 ), and Stack-NMN (Hu et al., 2018) . Representative models that use object-level vision also leverage both neural and symbolic language and reasoning. Models that are more neural are LXMERT (Tan & Bansal, 2019) and NSM (Hudson & Manning, 2019) , while those that are more symbolic are NS-VQA (Yi et al., 2018) , NS-CL (Mao et al., 2019) and NGS (Li et al., 2020) . A systematic comparison across these models is illustrated in Table 1 with more details in Appendix A. Overall, neural models have more expressive power but with more parameters, while more-symbolic models have more prior structures built into them but with fewer parameters. There is an interesting bias-variance trade-off in the model design. By encoding as much bias into the model as possible, one could reduce sample requirements. The different choices of perception and reasoning components also limit how the QA models will be trained. If both components are chosen as neural modules, then the training can be done in a very efficient end-to-end fashion. If the reasoning is carried out using more discrete operations, 1

