EFFICIENT ARCHITECTURE SEARCH FOR CONTINUAL LEARNING

Abstract

Continual learning with neural networks is an important learning framework in AI that aims to learn a sequence of tasks well. However, it is often confronted with three challenges: (1) overcome the catastrophic forgetting problem, (2) adapt the current network to new tasks, and meanwhile (3) control its model complexity. To reach these goals, we propose a novel approach named as Continual Learning with Efficient Architecture Search, or CLEAS in short. CLEAS works closely with neural architecture search (NAS) which leverages reinforcement learning techniques to search for the best neural architecture that fits a new task. In particular, we design a neuron-level NAS controller that decides which old neurons from previous tasks should be reused (knowledge transfer), and which new neurons should be added (to learn new knowledge). Such a fine-grained controller allows finding a very concise architecture that can fit each new task well. Meanwhile, since we do not alter the weights of the reused neurons, we perfectly memorize the knowledge learned from previous tasks. We evaluate CLEAS on numerous sequential classification tasks, and the results demonstrate that CLEAS outperforms other state-of-the-art alternative methods, achieving higher classification accuracy while using simpler neural architectures.

1. INTRODUCTION

Continual learning, or lifelong learning, refers to the ability of continually learning new tasks and also performing well on learned tasks. It has attracted enormous attention in AI as it mimics a human learning process -constantly acquiring and accumulating knowledge throughout their lifetime (Parisi et al., 2019) . Continual learning often works with deep neural networks (Javed & White, 2019; Nguyen et al., 2017; Xu & Zhu, 2018) as the flexibility in a network design can effectively allow knowledge transfer and knowledge acquisition. However, continual learning with neural networks usually faces three challenges. The first one is to overcome the so-called catastrophic forgetting problem (Kirkpatrick et al., 2017) , which states that the network may forget what has been learned on previous tasks. The second one is to effectively adapt the current network parameters or architecture to fit a new task, and the last one is to control the network size so as not to generate an overly complex network. In continual learning, there are two main categories of strategies that attempt to solve the aforementioned challenges. The first category is to train all tasks within a network with fixed capacity. For example, (Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Aljundi et al., 2018) replay some old samples with the new task samples and then learn a new network from the combined training set. The drawback is that they typically require a memory system that stores past data. (Kirkpatrick et al., 2017; Liu et al., 2018) employ some regularization terms to prevent the re-optimized parameters from deviating too much from the previous ones. Approaches using fixed network architecture, however, cannot avoid a fundamental dilemma -they must either choose to retain good model performances on learned tasks, leaving little room for learning new tasks, or compromise the learned model performances to allow learning new tasks better. To overcome such a dilemma, the second category is to expand the neural networks dynamically (Rusu et al., 2016; Yoon et al., 2018; Xu & Zhu, 2018) . They typically fix the parameters of the old neurons (partially or fully) in order to eliminate the forgetting problem, and also permit adding new neurons to adapt to the learning of a new task. In general, expandable networks can achieve better model performances on all tasks than the non-expandable ones. However, a new issue appears: expandable networks can gradually become overly large or complex, which may break the limits of the available computing resources and/or lead to over-fitting. In this paper, we aim to solve the continual learning problems by proposing a new approach that only requires minimal expansion of a network so as to achieve high model performances on both learned tasks and the new task. At the heart of our approach we leverage Neural Architecture Search (NAS) to find a very concise architecture to fit each new task. Most notably, we design NAS to provide a neuron-level control. That is, NAS selects two types of individual neurons to compose a new architecture: (1) a subset of the previous neurons that are most useful to modeling the new task; and (2) a minimal number of new neurons that should be added. Reusing part of the previous neurons allows efficient knowledge transfer; and adding new neurons provides additional room for learning new knowledge. Our approach is named as Continual Learning with Efficient Architecture Search, or CLEAS in short. Below are the main features and contributions of CLEAS. • CLEAS dynamically expands the network to adapt to the learning of new tasks and uses NAS to determine the new network architecture; • CLEAS achieves zero forgetting of the learned knowledge by keeping the parameters of the previous architecture unchanged; • NAS used in CLEAS is able to provide a neuron-level control which expands the network minimally. This leads to an effective control of network complexity; • The RNN-based controller behind CLEAS is using an entire network configuration (with all neurons) as a state. This state definition deviates from the current practice in related problems that would define a state as an observation of a single neuron. Our state definition leads to improvements of 0.31%, 0.29% and 0.75% on three benchmark datasets. • If the network is a convolutional network (CNN), CLEAS can even decide the best filter size that should be used in modeling the new task. The optimized filter size can further improve the model performance. We start the rest of the paper by first reviewing the related work in Section 2. Then we detail our CLEAS design in Section 3. Experimental evaluations and the results are presented in Section 4.

2. RELATED WORK

Continual Learning Continual learning is often considered as an online learning paradigm where new skills or knowledge are constantly acquired and accumulated. Recently, there are remarkable advances made in many applications based on continual learning: sequential task processing (Thrun, 1995) , streaming data processing (Aljundi et al., 2019) , self-management of resources (Parisi et al., 2019; Diethe et al., 2019) , etc. A primary obstacle in continual learning, however, is the catastrophic forgetting problem and many previous works have attempted to alleviate it. We divide them into two categories depending on whether their networks are expandable. The first category uses a large network with fixed capacity. These methods try to retain the learned knowledge by either replaying old samples (Rebuffi et al., 2017; Rolnick et al., 2019; Robins, 1995) or enforcing the learning with regularization terms (Kirkpatrick et al., 2017; Lopez-Paz & Ranzato, 2017; Liu et al., 2018; Zhang et al., 2020) . Sample replaying typically requires a memory system which stores old data. When learning a new task, part of the old samples are selected and added to the training data. As for regularized learning, a representative approach is Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) which uses the Fisher information matrix to regularize the optimization parameters so that the important weights for previous tasks are not altered too much. Other methods like (Lopez-Paz & Ranzato, 2017; Liu et al., 2018; Zhang et al., 2020 ) also address the optimization direction of weights to prevent the network from forgetting the previously learned knowledge. The major limitation of using fixed networks is that it cannot properly balance the learned tasks and new tasks, resulting in either forgetting old knowledge or acquiring limited new knowledge. To address the above issue, another stream of works propose to dynamically expand the network, providing more room for obtaining new knowledge. For example, Progressive Neural Network (PGN) (Rusu et al., 2016) allocates a fixed number of neurons and layers to the current model for a new task. Apparently, PGN may end up generating an overly complex network that has high redundancy and it can easily crash the underlying computing system that has only limited resources.

