TRANSFORMER MODULE NETWORKS FOR SYSTEMATIC GENERALIZATION IN VISUAL QUESTION ANSWERING

Abstract

Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

1. INTRODUCTION

Visual Question Answering (VQA) (Antol et al., 2015) is a fundamental testbed to assess the capability of learning machines to perform complex visual reasoning. The compositional structure inherent to visual reasoning is at the core of VQA: Visual reasoning is a composition of visual subtasks, and also, visual scenes are compositions of objects, which are composed of attributes such as textures, shapes and colors. This compositional structure yields a distribution of image-question pairs of combinatorial size, which cannot be fully reflected in an unbiased way by training distributions. Systematic generalization is the ability to generalize to novel compositions of known concepts beyond the training distribution (Lake & Baroni, 2018; Bahdanau et al., 2019; Ruis et al., 2020) . A learning machine capable of systematic generalization is still a distant goal, which contrasts with the exquisite ability of current learning machines to generalize in-distribution. In fact, the most successful learning machines, i.e., Transformer-based models, have been tremendously effective for VQA when evaluated in-distribution (Tan & Bansal, 2019; Chen et al., 2020; Zhang et al., 2021) . Yet, recent studies stressed the need to evaluate systematic generalization instead of in-distribution generalization (Gontier et al., 2020; Tsarkov et al., 2021; Bergen et al., 2021) , as the systematic generalization capabilities of Transformers for VQA are largely unknown. A recent strand of research for systematic generalization in VQA investigates Neural Module Networks (NMNs) (Bahdanau et al., 2019; 2020; D'Amario et al., 2021) . NMNs decompose a question in VQA into sub-tasks, and each sub-task is tackled with a shallow neural network called module. Thus, NMNs use a question-specific composition of modules to answer novel questions. NMNs alleviate the gap between in-distribution generalization and systematic generalization due to its inherent compositional structure. In our experiments, we found that CNN-based NMNs outperform Transformers on systematic generalization to novel compositions of sub-tasks. This begs the question of whether we can combine the strengths of Transformers and NMNs in order to improve the systematic generalization capabilities of learning machines.

