FAST AND SLOW LEARNING OF RECURRENT INDEPEN-DENT MECHANISMS

Abstract

Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic way to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, metaparameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.

1. INTRODUCTION

The classical statistical framework of machine learning is focused on the assumption of independent and identically distributed (i.i.d) data, implying that the test data comes from the same distribution as the training data. However, a learning agent interacting with the world is often faced with nonstationarities and changes in distribution which could be caused by the actions of the agent itself, or by other agents in the environment. A long-standing goal of machine learning has been to build models that can better handle such changes in distribution and hence achieve better generalization (Lake & Baroni, 2018; Bahdanau et al., 2018) . At the same time, deep learning systems built in the form of a single big network, consisting of a layered but otherwise monolithic architecture, tend to co-adapt across different components of the network. Due to a monolithic structure, when the task or the distribution changes, a majority of the components of the network are likely to adapt in response to these changes, potentially leading to catastrophic interferences between different tasks or pieces of knowledge (Andreas et al., 2016; Fernando et al., 2017; Shazeer et al., 2017; Jo et al., 2018; Rosenbaum et al., 2019; Alet et al., 2018; Kirsch et al., 2018; Goyal et al., 2019; 2020; Goyal & Bengio, 2020 ). An interesting challenge of current machine learning research is thus out-of-distribution adaptation and generalization. Humans seem to be able to learn a new task quickly by re-using relevant prior knowledge, raising two fundamental questions which we explore here: (1) how to separate knowledge into easily recomposable pieces (which we call modules), and (2) how to do this so as to achieve fast adaptation to new tasks or changes in distribution when a module may need to be modified or when different modules may need to be combined in new ways. For the former objective, instead of representing knowledge with a homogeneous architecture as in standard neural networks, we adopt recently proposed approaches (Goyal et al., 2019; Mittal et al., 2020; Goyal et al., 2020 



; Rahaman 01 Mila, University of Montreal, 2 Mila, Polytechnique Montréal, 3 Max Planck Institute for Intelligent Systems. Corresponding author: madankanika.s@gmail.com

