FAST AND SLOW LEARNING OF RECURRENT INDEPEN-DENT MECHANISMS

Abstract

Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic way to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, metaparameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.

1. INTRODUCTION

The classical statistical framework of machine learning is focused on the assumption of independent and identically distributed (i.i.d) data, implying that the test data comes from the same distribution as the training data. However, a learning agent interacting with the world is often faced with nonstationarities and changes in distribution which could be caused by the actions of the agent itself, or by other agents in the environment. A long-standing goal of machine learning has been to build models that can better handle such changes in distribution and hence achieve better generalization (Lake & Baroni, 2018; Bahdanau et al., 2018) . At the same time, deep learning systems built in the form of a single big network, consisting of a layered but otherwise monolithic architecture, tend to co-adapt across different components of the network. Due to a monolithic structure, when the task or the distribution changes, a majority of the components of the network are likely to adapt in response to these changes, potentially leading to catastrophic interferences between different tasks or pieces of knowledge (Andreas et al., 2016; Fernando et al., 2017; Shazeer et al., 2017; Jo et al., 2018; Rosenbaum et al., 2019; Alet et al., 2018; Kirsch et al., 2018; Goyal et al., 2019; 2020; Goyal & Bengio, 2020 ). An interesting challenge of current machine learning research is thus out-of-distribution adaptation and generalization. Humans seem to be able to learn a new task quickly by re-using relevant prior knowledge, raising two fundamental questions which we explore here: (1) how to separate knowledge into easily recomposable pieces (which we call modules), and (2) how to do this so as to achieve fast adaptation to new tasks or changes in distribution when a module may need to be modified or when different modules may need to be combined in new ways. For the former objective, instead of representing knowledge with a homogeneous architecture as in standard neural networks, we adopt recently proposed approaches (Goyal et al., 2019; Mittal et al., 2020; Goyal et al., 2020; Rahaman et al., 2019) which compete using attention mechanisms to capture the dynamics of the system. The network's state is divided into N modules, such that at every time step, conditioned on a given input x t , only a subset k out of total N modules are dynamically activated and updated fast in the inner learning loop, while the parameters of the two attention mechanisms, Input Attention and Communication Attention, are learnt slowly in the outer loop as meta parameters. et al., 2020) to learn a modular architecture consisting of a set of independent modules which compete with each other to attend to an input and sparsely communicate using a key-value attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017; Santoro et al., 2018) . For the second objective of fast transfer, we adopt a meta-learning approach on the two sets of parameters (those of the modules and those of the attention mechanisms) with the goal of achieving fast adaptation to changes in distribution or a new task in a reinforcement-learning agent (Duan et al., 2016; Mishra et al., 2017; Wang et al., 2016; Nichol et al., 2018; Xu et al., 2018; Houthooft et al., 2018; Kirsch et al., 2019) . In this paper, we study the generalization ability of the proposed modular architecture on tasks not seen during training. We conduct experiments in the domain of grounded language learning, in which poor data efficiency is one of the major limitations for agents to learn efficiently and generalize well (Hermann et al., 2017; Chaplot et al., 2017; Wu et al., 2018; Yu et al., 2018; Chevalier-Boisvert et al., 2018) . We show, empirically, how the proposed learning agent can generalize better not only on the seen data, but also is more sample efficient, faster to train and adapt, and has better transfer capabilities in the face of changes in distributions. We thus present evidence that combining meta-learning with modular architectures can help in building agents that learn and leverage the compositional properties of the environment to generalize better on novel domains and achieve better transferability and a more systematic generalization.

2. META LEARNING OF RECURRENT INDEPENDENT MECHANISMS

We intend to assay whether a modular architecture, combined with learning different parts of the model on different timescales, can help in decomposing knowledge into re-usable pieces such that the resulting model is not only more sample efficient, but also generalizes well across changes in task distributions. We first give a high-level overview, and then describe the components of the model and the two learning phases in more detail: Section 2.1 contains an overview of the modular architecture consisting of an ensemble of recurrent modules, and Section 2.2 explains the meta-learning approach to learn the parameters of the modular network at different time scales. Modular Network. The proposed method is based on RIMs architecture (Goyal et al., 2019) which consists of a set of competing modules, such that each module acts independently and interacts with other modules sparingly through attention. Attention Networks to modulate information. The soft-attention mechanisms control the flow of information in the model such that different modules attend to different parts of the input via input attention, and a module queries relevant contextual information from other modules via communication attention. The two attention mechanisms can use multiple attention heads and have their own set of parameters, and the different modules have their own independent parameters.



Mila, University of Montreal, Mila, Polytechnique Montréal, Max Planck Institute for Intelligent Systems. Corresponding author: madankanika.s@gmail.com



Figure1: Proposed Model architecture: Input images processed through an encoder, and the embedded mission instruction are passed through a set of recurrent independent modules or RIMs(Goyal  et al., 2019)  which compete using attention mechanisms to capture the dynamics of the system. The network's state is divided into N modules, such that at every time step, conditioned on a given input x t , only a subset k out of total N modules are dynamically activated and updated fast in the inner learning loop, while the parameters of the two attention mechanisms, Input Attention and Communication Attention, are learnt slowly in the outer loop as meta parameters.

