CONTINUAL LEARNING WITH NEURAL ACTIVATION IMPORTANCE

Abstract

Continual learning is a concept of online learning along with multiple sequential tasks. One of the critical barriers of continual learning is that a network should learn a new task keeping the knowledge of old tasks without access to any data of the old tasks. In this paper, we propose a neuron importance based regularization method for stable continual learning. We propose a comprehensive experimental evaluation framework on existing benchmark data sets to evaluate not just the accuracy of a certain order of continual learning performance also the robustness of the accuracy along with the changes in the order of tasks.

1. INTRODUCTION

Continual learning, or sequential learning, is a concept of online learning along multiple sequential tasks. The aim of continual learning is learning a set of related tasks that are observed irregularly or separately online. Therefore each task does not necessarily contain overlapped classes with others and, in worst case, different tasks consist of mutually disjoint classes. Therefore, in continual learning, one of the main challenges is training new tasks with new classes without catastrophic forgetting existing knowledge of prior tasks (and their classes). A model adapts to a new task without access to some or entire classes of past tasks but keeping and maintaining acquired knowledge from new tasks (Thrun, 1996) . Since the training of a neural network is influenced more by recently and frequently observed data, a machine learning model forgets what it has learned in prior tasks without continuing access to them in the current task. On the other hand, rigorous methods that maintain the knowledge of entire previous tasks are impractical in adapting new tasks. Thus, researchers have developed diverse methods to achieve both stability (remembering past tasks) and plasticity (adapting new tasks) in continual learning. There are three major types in previous continual learning methods; 1) architectural approaches modifying the architecture of neural networks (Yoon et al., 2017; Sharif Razavian et al., 2014; Rusu et al., 2016) , 2) rehearsal approaches using sampled data from previous tasks (Riemer et al., 2018; Aljundi et al., 2019; Gupta et al., 2020) , and 3) regularization approaches freezing significant weights of a model calculating the importance of weights or neurons (Li & Hoiem, 2017; Kirkpatrick et al., 2017; Zenke et al., 2017; Nguyen et al., 2017; Aljundi et al., 2018a; b; Zeno et al., 2018; Ahn et al., 2019; Javed & White, 2019; Jung et al., 2020) . Most recent regularization methods have tackled the problem in more fundamental way with regularization approaches that utilize the weights of a given network to the hit. The basic idea of regularization approaches is to constrain essential weights of old tasks not to change. In general, they alleviate catastrophic interference by imposing a penalty on the difference of weights between the past tasks and the current task. The degree of the penalty follows the importance of weights or neurons with respective measurements. Significance of an weight stands for how important the weight is in solving a certain task. EWC (Kirkpatrick et al., 2017) introduces elastic weight consolidation which estimates parameter importance using the diagonal of the Fisher information matrix equivalent to the second derivative of the loss. However, they compute weights' importance after network training. SI (Zenke et al., 2017) measures the importance of weights in an online manner by calculating each parameter's sensitivity to the loss change while training a task. To be specific, when a certain parameter changes slightly during training batches but its contribution to the loss is high (i.e., rapid change of its gradient), the parameter is considered to be crucial and restricted to be updated while learning future tasks. However, the accuracy of their method shows limited stability even with the same order of tasks of

