ON THE SPECIALIZATION OF NEURAL MODULES

Abstract

A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.

1. INTRODUCTION

Humans frequently display the ability to systematically generalize, that is, to leverage specific learning experiences in diverse new settings (Lake et al., 2019) . For instance, exploiting the approximate compositionality of natural language, humans can combine a finite set of words or phonemes into a near-infinite set of sentences, words, and meanings. Someone who understands "brown dog" and "black cat" also likely understands "brown cat," to take one example from Szabó (2012). The result is that a human's ability to reason about situations or phenomena extends far beyond their ability to directly experience and learn from all such situations or phenomena. Deep learning techniques have made great strides in tasks like machine translation and language prediction, providing proof of principle that they can succeed in quasi-compositional domains. However, these methods are typically data hungry and the same networks often fail to generalize in even simple settings when training data are scarce (Lake & Baroni, 2018b; Lake et al., 2019) . Empirically, the degree of systematicity in deep networks is influenced by many factors. One possibility is that the learning dynamics in a deep network could impart an implicit inductive bias toward systematic structure (Hupkes et al., 2020) ; however, a number of studies have identified situations where depth alone is insufficient for structured generalization (Pollack, 1990; Niklasson & Sharkey, 1992; Phillips & Wiles, 1993; Lake & Baroni, 2018b; Mittal et al., 2022) . Another significant factor is architectural modularity, which can enable a system to generalize when modules are appropriately configured (Vani et al., 2021; Phillips, 1995) . However, identifying the right modularity through learning remains challenging (Mittal et al., 2022) . In spite of these (and many other) possibilities for improving systematicity (Hupkes et al., 2020) , it remains unclear when standard deep neural networks will exhibit systematic generalization (Dankers et al., 2021) , reflecting a long-standing theoretical debate stretching back to the first wave of connectionist deep networks (Rumelhart & McClelland, 1986; Pollack, 1990; Fodor & Pylyshyn, 1988; Smolensky, 1991; 1990; Hadley, 1993; 1994) . In this work we theoretically study the ability of neural modules to specialize to structures in a dataset. Our goal is to provide a formalism for systematic generalization and to begin to concretize some of the intuitions and concepts in the systematic generalization literature. To begin we make a careful distinction between the compositionality and systematicity of a neural network architecture. Specifically, in this work we maintain that compositionality is a feature of an architecture, such as a Neural Module Network (Andreas et al., 2016; Andreas, 2018) , or dataset where modules or components can be composed by design. Systematicity is a property of a (potentially compositional) architecture exploiting structure in the world (dataset) such as the compositional structure of natural language from the Szabó (2012) example. Intuitively, if a dataset does not have structure which can be exploited for generalization then no compositional architecture will be able to systematically generalize. As a result in this work we are concerned with formalizing both the dataset and neural network learning dynamics to study this interplay between domain and architecture. The main approach of this work can be summarized as follows: we introduce a reflective space of datasets that contain compositional and non-compositional features, and examine the impact of implicit biases and architectural modularity on the learned input-output mappings of deep linear network modules. In particular, • We derive exact training dynamics for deep linear network modules as a function of the dataset parameters. This is a novel, theoretical means of analysing the effect of dataset structure on neural network learning dynamics. • We formalize the goal of modularity as finding lower-rank sub-structure within a dataset that can be exploited to improve generalization. • We show that for all datasets in the space, despite the possibility of learning a systematic mapping, non-modular networks do not do so under gradient descent dynamics. • We show that modular network architectures can learn fully systematic network mappings, but only when the modularity perfectly segregates the underlying lower-rank sub-structure in the dataset. In Section 7 we show that our findings, which rely on a simplified setting for mathematical tractability, generalize to more complicated datasets and non-linear architectures by training a convolutional neural network to label handwritten digits between 0 and 999. Overall, our results help clarify the interplay between dataset structure and architectural biases which can facilitate systematic generalization when neural modules specialize.

2. BACKGROUND

Systematic generalization has been proposed as a key feature of intelligent learning agents which can generalize to novel stimuli in their environment (Hockett & Hockett, 1960; Fodor & Pylyshyn, 1988; Hadley, 1993; Kirby et al., 2015; Lake et al., 2017; Mittal et al., 2022) . In particular, the closely related concept of compositional structure has been shown to have benefits for both learning speed (Shalev-Shwartz & Shashua, 2016; Ren et al., 2019) and generalizability (Lazaridou et al., 2018) . There are, however, counter-examples which find only a weak correlation between compositionality and generalization (Andreas, 2018) or learning speed (Kharitonov & Baroni, 2020) . In most cases neural networks do not manage to generalize systematically (Ruis et al., 2020; Mittal et al., 2022) , or systematic generalization occurs only with the addition of explicit regularizers or a degree of supervision on the learned features (Shalev-Shwartz et al., 2017; Wies et al., 2022) which is also termed "mediated perception" (Shalev-Shwartz & Shashua, 2016) . Neural Module Networks (NMNs) (Andreas et al., 2016; Hu et al., 2017; 2018) have become one successful method of creating network architectures which generalize systematically. By (jointly) training individual neural modules on particular subsets of data or to perform particular subtasks, the modules will specialize. These modules can then be combined in new ways when an unseen data point is input to the model. Thus, through the composition of the modules, the model will systematically generalize, assuming that the correct modules can be structured together. Bahdanau et al. (2019b) show, however, that strong regularizers are required for the correct module structures to be learned. Specifically, it was shown that neural modules become coupled and as a result do not specialize in a manner which can be useful to the compositional design of the architecture. Thus, without task-specific regularizers, systematic mappings did not emerge with NMNs. Similar problems arise with other models which are compositional in nature. Tensor Product Networks (Smolensky et al., 2022) for example aim to learn an encoding of place and content for features in a data point. These encodings are then tensor-producted and summed. By the nature of separating

