LIFELONG LEARNING OF COMPOSITIONAL STRUCTURES

Abstract

A hallmark of human intelligence is the ability to construct self-contained chunks of knowledge and adequately reuse them in novel combinations for solving different yet structurally related problems. Learning such compositional structures has been a significant challenge for artificial systems, due to the combinatorial nature of the underlying search problem. To date, research into compositional learning has largely proceeded separately from work on lifelong or continual learning. We integrate these two lines of work to present a general-purpose framework for lifelong learning of compositional structures that can be used for solving a stream of related tasks. Our framework separates the learning process into two broad stages: learning how to best combine existing components in order to assimilate a novel problem, and learning how to adapt the set of existing components to accommodate the new problem. This separation explicitly handles the trade-off between the stability required to remember how to solve earlier tasks and the flexibility required to solve new tasks, as we show empirically in an extensive evaluation.

1. INTRODUCTION

A major goal of artificial intelligence is to create an agent capable of acquiring a general understanding of the world. Such an agent would require the ability to continually accumulate and build upon its knowledge as it encounters new experiences. Lifelong machine learning addresses this setting, whereby an agent faces a continual stream of diverse problems and must strive to capture the knowledge necessary for solving each new task it encounters. If the agent is capable of accumulating knowledge in some form of compositional representation (e.g., neural net modules), it could then selectively reuse and combine relevant pieces of knowledge to construct novel solutions. Various compositional representations for multiple tasks have been proposed recently (Zaremba et al., 2016; Hu et al., 2017; Kirsch et al., 2018; Meyerson & Miikkulainen, 2018) . We address the novel question of how to learn these compositional structures in a lifelong learning setting. We design a general-purpose framework that is agnostic to the specific algorithms used for learning and the form of the structures being learned. Evoking Piaget's (1976) assimilation and accommodation stages of intellectual development, this framework embodies the benefits of dividing the lifelong learning process into two distinct stages. In the first stage, the learner strives to solve a new task by combining existing components it has already acquired. The second stage uses discoveries from the new task to improve existing components and to construct fresh components if necessary. Our proposed framework, which we depict visually in Appendix A, is capable of incorporating various forms of compositional structures, as well as different mechanisms for avoiding catastrophic forgetting (McCloskey & Cohen, 1989) . As examples of the flexibility of our framework, it can incorporate naïve fine-tuning, experience replay, and elastic weight consolidation (Kirkpatrick et al., 2017) as knowledge retention mechanisms, and linear combinations of linear models (Kumar & Daumé III, 2012; Ruvolo & Eaton, 2013) , soft layer ordering (Meyerson & Miikkulainen, 2018) , and a soft version of gating networks (Kirsch et al., 2018; Rosenbaum et al., 2018) as the compositional structures. We instantiate our framework with the nine combinations of these examples, and evaluate it on eight different data sets, consistently showing that separating the lifelong learning process into two stages increases the capabilities of the learning system, reducing catastrophic forgetting and achieving higher overall performance. Qualitatively, we show that the components learned by an algorithm that adheres to our framework correspond to self-contained, reusable functions. Lifelong learning In continual or lifelong learning, agents must handle a variety of tasks over their lifetimes, and should accumulate knowledge in a way that enables them to more efficiently learn to solve new problems. Recent efforts have mainly focused on avoiding catastrophic forgetting. At a high level, algorithms define parts of parametric models (e.g., deep neural networks) to be shared across tasks. As the agent encounters tasks sequentially, it strives to retain the knowledge that enabled it to solve earlier tasks. One common approach is to impose regularization to prevent parameters from deviating in directions that are harmful for performance on the early tasks (Kirkpatrick et al., 2017; Zenke et al., 2017; Li & Hoiem, 2017; Ritter et al., 2018) . Another approach retains a small buffer of data from all tasks, and continually updates the model parameters utilizing data from all tasks, thereby maintaining the knowledge required to solve them (Lopez-Paz & Ranzato, 2017; Nguyen et al., 2018; Isele & Cosgun, 2018) . A related technique is to learn a generative model to "hallucinate" data, reducing the memory footprint at the cost of using lower-quality data and increasing the cost of accessing data (Shin et al., 2017; Achille et al., 2018; Rao et al., 2019) . These approaches, although effective in avoiding the problem of catastrophic forgetting, make no substantial effort toward the discovery of reusable knowledge. One could argue that the model parameters are learned in such a way that they are reusable across all tasks. However, it is unclear what the reusability of these parameters means, and moreover the way in which parameters are reused is hard-coded into the architecture design. This latter issue is a major drawback when attempting to learn tasks with a high degree of variability, as the exact form in which tasks are related is often unknown. Ideally, the algorithm would be able to determine these relations autonomously. Other methods learn a set of models that are reusable across many tasks and automatically select how to reuse them (Ruvolo & Eaton, 2013; Nagabandi et al., 2019) . However, such methods selectively reuse entire models, enabling knowledge reuse, but not explicitly in a compositional manner. Compositional knowledge A mostly distinct line of parallel work has explored the learning of compositional knowledge. The majority of such methods either learn the structure for piecing together a given set of components (Cai et al., 2017; Xu et al., 2018; Bunel et al., 2018) or learn the set of components given a known structure for how to compose them (Bošnjak et al., 2017) . A more interesting case is when neither the structure nor the set of components are given, and the agent must autonomously discover the compositional structure underlying a set of tasks. Some approaches for solving this problem assume access to a solution descriptor (e.g., in natural language), which can be mapped by the agent to a solution structure (Hu et al., 2017; Johnson et al., 2017; Pahuja et al., 2019) . However, many agents (e.g., service robots) are expected to learn in more autonomous settings, where this kind of supervision is not available. Other approaches instead learn the structure directly from optimization of a cost function (Rosenbaum et al., 2018; Kirsch et al., 2018; Meyerson & Miikkulainen, 2018; Alet et al., 2018; Chang et al., 2019) . Many of these works can be viewed as instances of neural architecture search, a closely related area (Elsken et al., 2019) . However, note that the approaches above assume that the agent will have access to a large batch of tasks, enabling it to evaluate numerous combinations of components and structures on all tasks simultaneously. More realistically, the agent faces a sequence of tasks in a lifelong learning fashion. Most work in this line assumes that each component can be fully learned by training on a single task, and then can be reused for other tasks (Reed & de Freitas, 2016; Fernando et al., 2017; Valkov et al., 2018) . Unfortunately, this is infeasible in many real-world scenarios in which the agent has access to little data for each of the tasks. One notable exception was proposed by Gaunt et al. (2017) , which improves early components with experience in new tasks, but is limited to very simplistic settings. Unlike prior work, our approach explicitly learns compositional structures in a lifelong learning setting. We do not assume access to a large batch of tasks or the ability to learn definitive components after training on a single task. Instead, we train on a small initial batch of tasks (four tasks, in our experiments), and then autonomously update the existing components to accommodate new tasks. Our framework also permits incorporating new components over time. Related work has increased network capacity in the non-compositional setting (Yoon et al., 2018) or in a compositional setting where previously learned parameters are kept fixed (Li et al., 2019) . Another method enables adaptation of existing parameters (Rajasegaran et al., 2019) , but requires expensively storing and training multiple models for each task to select the best one before adapting the existing parameters, and is designed for a specific choice of architecture, unlike our general framework.

