ITERATED LEARNING FOR EMERGENT SYSTEMATICITY IN VQA

Abstract

Although neural module networks have an architectural bias towards compositionality, they require gold standard layouts to generalize systematically in practice. When instead learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure. We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature that has primarily been applied to simple referential games in machine learning. Considering the layouts of module networks as samples from an emergent language, we use iterated learning to encourage the development of structure within this language. We show that the resulting layouts support systematic generalization in neural agents solving the more complex task of visual question-answering. Our regularized iterated learning method can outperform baselines without iterated learning on SHAPES-SyGeT (SHAPES Systematic Generalization Test), a new split of the SHAPES dataset we introduce to evaluate systematic generalization, and on CLOSURE, an extension of CLEVR also designed to test systematic generalization. We demonstrate superior performance in recovering ground-truth compositional program structure with limited supervision on both SHAPES-SyGeT and CLEVR.

1. INTRODUCTION

Although great progress has been made in visual question-answering (VQA), recent methods still struggle to generalize systematically to inputs coming from a distribution different from that seen during training (Bahdanau et al., 2019b; a) . Neural module networks (NMNs) present a natural solution to improve generalization in VQA, using a symbolic layout or program to arrange neural computational modules into computation graphs. If these modules are learned to be specialized, they can be composed in arbitrary legal layouts to produce different processing flows. However, for modules to learn specialized roles, programs must support this type of compositionality; if programs reuse modules in non-compositional ways, modules are unlikely to become layout-invariant. This poses a substantial challenge for the training of NMNs. Although Bahdanau et al. (2019b) and Bahdanau et al. (2019a) both observe that NMNs can systematically generalize if given humandesigned ground-truth programs, creating these programs imposes substantial practical costs. It becomes natural to jointly learn a program generator alongside the modules (Johnson et al., 2017b; Hu et al., 2017; Vedantam et al., 2019) , but the generated programs often fail to generalize systematically and lead to worse performance (Bahdanau et al., 2019b) . Iterated learning (IL) offers one way to address this problem. Originating in cognitive science, IL explains how language evolves to become more compositional and easier-to-acquire in a repeated transmission process, where each new generation acquires the previous generation's language through a limited number of samples (Kirby et al., 2014) . Early works with human participants (Kirby et al., 2008) as well as agent-based simulations (Zuidema, 2003) support this hypoth-

