SELF-ACTIVATING NEURAL ENSEMBLES FOR CON-TINUAL REINFORCEMENT LEARNING

Abstract

The ability of an agent to continuously learn new skills without catastrophically forgetting existing knowledge is of critical importance for the development of generally intelligent agents. Most methods devised to address this problem depend heavily on well-defined task boundaries which simplify the problem considerably. Our task-agnostic method, Self-Activating Neural Ensembles (SANE), uses a hierarchical modular architecture designed to avoid catastrophic forgetting without making any such assumptions. At each timestep a path through the SANE tree is activated; during training only activated nodes are updated, ensuring that unused nodes do not undergo catastrophic forgetting. Additionally, new nodes are created as needed, allowing the system to leverage and retain old skills while growing and learning new ones. We demonstrate our approach on MNIST and a set of grid world environments, demonstrating that SANE does not undergo catastrophic forgetting where existing methods do.

1. INTRODUCTION

Lifelong learning is of critical importance for the field of robotics; an agent that interacts with the world should be able to continuously learn from it, acting intelligently in a wide variety of situations. In marked contrast to this ideal, most standard deep reinforcement learning methods are centered around a single task. First, a task is defined, then a policy is learned to maximize the rewards the agent receives in that setting. If the task is changed, a completely new model is learned, throwing away the previous model and previous interactions. Task specification therefore plays a central role in current end-to-end deep reinforcement learning frameworks. But is task-driven learning scalable? In contrast, humans do not require concrete task boundaries to be able to effectively learn separate tasks -instead, we perform continual (lifelong) learning. The same model is used to learn new skills, leveraging the lessons of previous skills to learn more efficiently, without forgetting old behaviors. However, when placed into continual learning settings, current deep reinforcement learning approaches do neither: the transfer properties of these systems are negligible and they suffer from catastrophic forgetting (McCloskey & Cohen, 1989; French, 2006) . The core issue of catastrophic forgetting is that a neural network trained on one task starts to forget what it knows when trained on a second task, and this issue only becomes exacerbated as more tasks are added. The problem ultimately stems from training one network end-to-end sequentially; the shared nature of the weights and the backpropagation used to update them mean that later tasks overwrite earlier ones (McCloskey & Cohen, 1989; Ratcliff, 1990) . To handle this, past approaches have attempted a wide variety of ideas: from task-based regularization (Kirkpatrick et al., 2017) , to learning different sub-modules for different tasks (Rusu et al., 2016) , to dual-system slow/fast learners inspired by the human hippocampus (Schwarz et al., 2018) . The core problem of continual learning, which none of these methods address, is that the agent needs to autonomously determine how and when to adapt to changing environments, as it is infeasible for a human to indefinitely provide an agent with task-boundary supervision. Specifically, these approaches rely on the notion of tasks to identify when to spawn new sub-modules, when to freeze weights, when to save parameters, etc. Leaning on task boundaries is unscalable and side-steps the core problem. There are a few existing task-free methods. Some address the problem by utilizing fixed subdivisions of the input space (Aljundi et al., 2018; Veness et al., 2019) , which we believe limits their flexibility. Recent experience-based approaches such as Rolnick et al. ( 2018) forgo explicit task IDs, but to do so they must maintain a large replay buffer, which is unscalable with an ever-increasing number of tasks. To our knowledge only one other continual learning method, Neural Dirichlet Process Mixture Models (Lee et al., 2020) , adaptively creates new clusters, and we show in our experiments that SANE outperforms it. Our method approaches the problem head-on by dynamically adapting to changing environments. Our task-agnostic hierarchical method, Self-Activating Neural Ensembles (SANE), depicted in Figure 1 , is the core of the method. Every node in the tree is a separate module, none task-specific. At each timestep a single path through the tree is activated to determine the action to take. Only activated nodes are updated, leaving unused modules unchanged and therefore immune to catastrophic forgetting. Critically, our tree is dynamic; new nodes are created when existing nodes are found to be insufficient. This allows for the creation of new paths through the tree for novel scenarios, preventing destructive parameter updates to the other nodes. SANE provides the following desirable properties for continual reinforcement learning: (a) It mitigates catastrophic forgetting by only updating relevant modules; (b) Because of its task-agnostic nature, unlike previous approaches, it does not require explicit supervision with task IDs; (c) It achieves these targets with bounded resources and computation. In our results, we demonstrate the ability of SANE to learn and retain MNIST digits when presented fully sequentially, a challenging task our baselines struggle with. We also demonstrate SANE on a series of three grid world tasks, showing that SANE works in the reinforcement learning setting.

2. MODEL

Figure 1 : The overall tree structure of the SANE system, here with 2 layers of nodes. Each node contains an actor and a critic. The structure of our Self-Activating Neural Ensembles (SANE) system is a tree, as shown in Figure 1 . Each node in the tree contains both an actor and a critic, described in 2.1. At each timestep we activate a path through the tree from the root to a leaf, at each layer selecting the node whose critic estimates the highest activation score (described in 2.2). We then sample an action from that leaf node's policy to execute in the environment. During training, the activated nodes are the only ones updated; all other nodes are unchanged. This enables behaviors selected for by other paths through the tree to remain unchanged. The tree structure is critical to the success of the system; a node's children can be seen as partitioning the input space that activates the node, specializing their policies to these subspaces. When they differ sufficiently from their parent in performance, they are promoted to a higher level of the tree and obtain children of their own, further partitioning the space. Nodes are constantly being created, promoted to higher levels, or merged as the situation demands (described in Section 2.4). Static vs. Dynamic Trees: One critical question to ask is why use a dynamic structure and not just a huge static tree with only parameter updates? In the case of the static tree, if a path through the network is beneficial for one task, it is likely to be chosen as a starting point for the subsequent task, and updated in-place for this new setting. Examples of this can be found in Jacobs et al. (1991); Shazeer et al. (2017) , where specialized losses are necessary to distribute activation across the ensemble of experts, even in the case of a stationary distribution. With SANE, the change of environment would be detected and a new path created, thus minimizing disruption of the previously beneficial behaviors.

