ITERATED LEARNING FOR EMERGENT SYSTEMATICITY IN VQA

Abstract

Although neural module networks have an architectural bias towards compositionality, they require gold standard layouts to generalize systematically in practice. When instead learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure. We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature that has primarily been applied to simple referential games in machine learning. Considering the layouts of module networks as samples from an emergent language, we use iterated learning to encourage the development of structure within this language. We show that the resulting layouts support systematic generalization in neural agents solving the more complex task of visual question-answering. Our regularized iterated learning method can outperform baselines without iterated learning on SHAPES-SyGeT (SHAPES Systematic Generalization Test), a new split of the SHAPES dataset we introduce to evaluate systematic generalization, and on CLOSURE, an extension of CLEVR also designed to test systematic generalization. We demonstrate superior performance in recovering ground-truth compositional program structure with limited supervision on both SHAPES-SyGeT and CLEVR.

1. INTRODUCTION

Although great progress has been made in visual question-answering (VQA), recent methods still struggle to generalize systematically to inputs coming from a distribution different from that seen during training (Bahdanau et al., 2019b; a) . Neural module networks (NMNs) present a natural solution to improve generalization in VQA, using a symbolic layout or program to arrange neural computational modules into computation graphs. If these modules are learned to be specialized, they can be composed in arbitrary legal layouts to produce different processing flows. However, for modules to learn specialized roles, programs must support this type of compositionality; if programs reuse modules in non-compositional ways, modules are unlikely to become layout-invariant. This poses a substantial challenge for the training of NMNs. Although Bahdanau et al. (2019b) and Bahdanau et al. (2019a) both observe that NMNs can systematically generalize if given humandesigned ground-truth programs, creating these programs imposes substantial practical costs. It becomes natural to jointly learn a program generator alongside the modules (Johnson et al., 2017b; Hu et al., 2017; Vedantam et al., 2019) , but the generated programs often fail to generalize systematically and lead to worse performance (Bahdanau et al., 2019b) . Iterated learning (IL) offers one way to address this problem. Originating in cognitive science, IL explains how language evolves to become more compositional and easier-to-acquire in a repeated transmission process, where each new generation acquires the previous generation's language through a limited number of samples (Kirby et al., 2014) . Early works with human participants (Kirby et al., 2008) as well as agent-based simulations (Zuidema, 2003) support this hypoth- , 2020) . Different from previous works, we believe that IL is an algorithmic principle that is equally applicable to recovering compositional structure in more general tasks. We thus propose treating NMN programs as samples from a "layout language" and applying IL to the challenging problem of VQA. Our efforts highlight the potential of IL for broader machine learning applications beyond the previously-explored scope of language emergence and preservation (Lu et al., 2020) . To demonstrate our method, we introduce a lightweight benchmark for systematic generalization research based on the popular SHAPES dataset (Andreas et al., 2016) called SHAPES-SyGeT (SHAPES Systematic Generalization Test). Our experiments on SHAPES-SyGeT, CLEVR (Johnson et al., 2017a), and CLOSURE (Bahdanau et al., 2019a) show that our IL algorithm accelerates the learning of compositional program structure, leading to better generalization to both unseen questions from the training question templates and unseen question templates. Using only 100 ground-truth programs for supervision, our method achieves CLEVR performance comparable to Johnson et al. (2017b) and Vedantam et al. (2019) , which use 18000 and 1000 programs for supervision respectively.

2. RELATED WORK

Systematic generalization. Systematicity was first proposed as a topic of research in neural networks by Fodor & Pylyshyn (1988) , who argue that cognitive capabilities exhibit certain symmetries, and that representations of mental states have combinatorial syntactic and semantic structure. Whether or not neural networks can exhibit systematic compositionality has been a subject of much debate in the research community (Fodor & Pylyshyn, 1988; Christiansen & Chater, 1994; Marcus, 1998; Phillips, 1998; Chang, 2002; Marcus, 2018; van der Velde et al., 2004; Botvinick & Plaut, 2009; Bowers et al., 2009; Brakel & Frank, 2009; Fodor & Lepore, 2002; Marcus, 2018; Calvo & Symons, 2014) . 



Figure1: An overview of neural module networks (NMNs). A question q is read by the program generator to produce a program ẑ. The execution engine assembles neural modules according to the layout ẑ and feeds the input image x into the assembled module network. A classifier takes the output of the top-level module to produce an answer ŷ for the given (q, x) pair.

Bahdanau et al. (2019b)  investigate various VQA architectures such as neural module networks (NMNs)(Andreas et al., 2016),MAC (Hudson & Manning, 2018), FiLM (Perez et al., 2018), and relation networks(Santoro et al., 2017)  on their ability to systematically generalize on a new synthetic dataset called SQOOP. They show that only NMNs are able to robustly solve test problems, but succeed only when a fixed tree-structured layout is provided. When learning to infer the module

