TRANSFORMER MODULE NETWORKS FOR SYSTEMATIC GENERALIZATION IN VISUAL QUESTION ANSWERING

Abstract

Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

1. INTRODUCTION

Visual Question Answering (VQA) (Antol et al., 2015) is a fundamental testbed to assess the capability of learning machines to perform complex visual reasoning. The compositional structure inherent to visual reasoning is at the core of VQA: Visual reasoning is a composition of visual subtasks, and also, visual scenes are compositions of objects, which are composed of attributes such as textures, shapes and colors. This compositional structure yields a distribution of image-question pairs of combinatorial size, which cannot be fully reflected in an unbiased way by training distributions. Systematic generalization is the ability to generalize to novel compositions of known concepts beyond the training distribution (Lake & Baroni, 2018; Bahdanau et al., 2019; Ruis et al., 2020) . A learning machine capable of systematic generalization is still a distant goal, which contrasts with the exquisite ability of current learning machines to generalize in-distribution. In fact, the most successful learning machines, i.e., Transformer-based models, have been tremendously effective for VQA when evaluated in-distribution (Tan & Bansal, 2019; Chen et al., 2020; Zhang et al., 2021) . Yet, recent studies stressed the need to evaluate systematic generalization instead of in-distribution generalization (Gontier et al., 2020; Tsarkov et al., 2021; Bergen et al., 2021) , as the systematic generalization capabilities of Transformers for VQA are largely unknown. A recent strand of research for systematic generalization in VQA investigates Neural Module Networks (NMNs) (Bahdanau et al., 2019; 2020; D'Amario et al., 2021) . NMNs decompose a question in VQA into sub-tasks, and each sub-task is tackled with a shallow neural network called module. Thus, NMNs use a question-specific composition of modules to answer novel questions. NMNs alleviate the gap between in-distribution generalization and systematic generalization due to its inherent compositional structure. In our experiments, we found that CNN-based NMNs outperform Transformers on systematic generalization to novel compositions of sub-tasks. This begs the question of whether we can combine the strengths of Transformers and NMNs in order to improve the systematic generalization capabilities of learning machines. Program 𝑃 2 , 13, 90, 79 X, 13, 126, 79 Question 𝑄 2 Program 𝑃 1 𝐼 1 𝐼 2 𝐴 1 𝐴 2 … X Question 𝑄 1 In this paper, we introduce Transformer Module Network (TMN), a novel NMN for VQA based on compositions of Transformer modules. In this way, we take the best of both worlds: the capabilities of Transformers given by attention mechanisms, and the flexibility of NMNs to adjust to questions based on novel compositions of modules. TMN allows us to investigate whether and how modularity brings benefits to Transformers in VQA. An overview of TMNs is depicted in Fig. 1 . To foreshadow the results, we find that TMNs achieve state-of-the-art systematic generalization accuracy in the following three VQA datasets: CLEVR-CoGenT (Johnson et al., 2017) , CLO-SURE (Bahdanau et al., 2020) , and a novel test set based on GQA (Hudson & Manning, 2019) that we introduce for evaluating systematic generalization performance with natural images, which we call GQA-SGL (Systematic Generalization to Linguistic combinations). Remarkably, TMNs improve systematic generalization accuracy over standard Transformers more than 30% in the CLO-SURE dataset, i.e., systematic generalization to novel combinations of known linguistic constructs (equivalently, sub-tasks). Our results also show that both module composition and module specialization to a sub-task are key to TMN's performance gain.

2. RELATED WORK

We review previous works on systematic generalization in VQA. We first revisit the available benchmarks and then introduce existing approaches. Benchmarking VQA. Even though systematic generalization capabilities are the crux of VQA, attempts to benchmark these capabilities are only recent. The first VQA datasets evaluated indistribution generalization, and later ones evaluated generalization under distribution shifts that do not require systematicity. In the following, we review progress made towards benchmarking systematic generalization in VQA: -In-distribution generalization: There is a plethora of datasets to evaluate in-distribution generalization, e.g., VQA-v2 (Goyal et al., 2019) and GQA (Hudson & Manning, 2019) . It has been reported that these datasets are biased and models achieve high accuracy by relying on spurious correlations instead of performing visual reasoning (Agrawal et al., 2018; Kervadec et al., 2021) . -Out-of-distribution generalization: VQA-CP (Agrawal et al., 2018) and GQA-OOD (Kervadec et al., 2021) were proposed to evaluate generalization under shifted distribution of question-answer pairs. While this requires a stronger form of generalization than in-distribution, it does not require tackling the combinatorial nature of visual reasoning, and models can leverage biases in the images and questions.



Fig.1

Fig.1

