CONTINUAL LEARNING WITH NEURAL ACTIVATION IMPORTANCE

Abstract

Continual learning is a concept of online learning along with multiple sequential tasks. One of the critical barriers of continual learning is that a network should learn a new task keeping the knowledge of old tasks without access to any data of the old tasks. In this paper, we propose a neuron importance based regularization method for stable continual learning. We propose a comprehensive experimental evaluation framework on existing benchmark data sets to evaluate not just the accuracy of a certain order of continual learning performance also the robustness of the accuracy along with the changes in the order of tasks.

1. INTRODUCTION

Continual learning, or sequential learning, is a concept of online learning along multiple sequential tasks. The aim of continual learning is learning a set of related tasks that are observed irregularly or separately online. Therefore each task does not necessarily contain overlapped classes with others and, in worst case, different tasks consist of mutually disjoint classes. Therefore, in continual learning, one of the main challenges is training new tasks with new classes without catastrophic forgetting existing knowledge of prior tasks (and their classes). A model adapts to a new task without access to some or entire classes of past tasks but keeping and maintaining acquired knowledge from new tasks (Thrun, 1996) . Since the training of a neural network is influenced more by recently and frequently observed data, a machine learning model forgets what it has learned in prior tasks without continuing access to them in the current task. On the other hand, rigorous methods that maintain the knowledge of entire previous tasks are impractical in adapting new tasks. Thus, researchers have developed diverse methods to achieve both stability (remembering past tasks) and plasticity (adapting new tasks) in continual learning. There are three major types in previous continual learning methods; 1) architectural approaches modifying the architecture of neural networks (Yoon et al., 2017; Sharif Razavian et al., 2014; Rusu et al., 2016) , 2) rehearsal approaches using sampled data from previous tasks (Riemer et al., 2018; Aljundi et al., 2019; Gupta et al., 2020) , and 3) regularization approaches freezing significant weights of a model calculating the importance of weights or neurons (Li & Hoiem, 2017; Kirkpatrick et al., 2017; Zenke et al., 2017; Nguyen et al., 2017; Aljundi et al., 2018a; b; Zeno et al., 2018; Ahn et al., 2019; Javed & White, 2019; Jung et al., 2020) . Most recent regularization methods have tackled the problem in more fundamental way with regularization approaches that utilize the weights of a given network to the hit. The basic idea of regularization approaches is to constrain essential weights of old tasks not to change. In general, they alleviate catastrophic interference by imposing a penalty on the difference of weights between the past tasks and the current task. The degree of the penalty follows the importance of weights or neurons with respective measurements. Significance of an weight stands for how important the weight is in solving a certain task. EWC (Kirkpatrick et al., 2017) introduces elastic weight consolidation which estimates parameter importance using the diagonal of the Fisher information matrix equivalent to the second derivative of the loss. However, they compute weights' importance after network training. SI (Zenke et al., 2017) measures the importance of weights in an online manner by calculating each parameter's sensitivity to the loss change while training a task. To be specific, when a certain parameter changes slightly during training batches but its contribution to the loss is high (i.e., rapid change of its gradient), the parameter is considered to be crucial and restricted to be updated while learning future tasks. However, the accuracy of their method shows limited stability even with the same order of tasks of In this method, it finds posterior parameters(e.g. mean and variance) assuming that posterior and the prior distribution are Gaussian. These methods (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018a; Nguyen et al., 2017; Zeno et al., 2018) calculate and assign importance to weights directly as describe in Figure 1a . In order to alleviate interference across multiple tasks, weight importance based approaches such as EWC, SI, and MAS assign importance Ω k to each weight ω k in the network. However, in the case of convolutional neural networks, weights in the same convolutional filter map should have the same importance. Furthermore, since those methods basically consider the amount of change of weight, it is impossible to reinitialize weights at each training of a new task, which decreases the plasticity of the network. (Additional explanation of weight reinitialization is discussed in section 2.2) Weight importance is defined as the importance of its connected neuron. UCL (Ahn et al., 2019) propose a Bayesian neural network based method to mitigate catastrophic forgetting by incorporating weight uncertainty measurement indicating the variance of weight distribution. They claim that the distribution of essential weights for past tasks has low variance and such stable weights during training a task are regarded as important weight not to forget. As illustrated in Figure 1b , they suggest a node based importance in neural network. First, the smallest variance among the weights incoming to and outgoing from a corresponding neuron decides the importance of the neuron, and then all those weights take the same neurons importance as their weight importance. Ahn et al. ( 2019) is applicable only to a Bayesian neural network and highly dependent upon hyper-parameters. Furthermore, it is computationally expensive to train compared to Zenke et al. (2017) and our proposed method. Jung et al. ( 2020) is another recent algorithm based on neuron importance. Its node importance depends on the average activation value. This idea is simple but powerful. Activation value itself is a measurement of neuron importance, and weights connected to the neuron get identical weight importance. This corresponds to the type in Figure 1c . One of our key observations in prior experimental evaluations is that the accuracy of each task significantly changes when we change the order of tasks, as it is also claimed and discussed in (Yoon et al., 2019) proposing a order robust method of continual learning in their architectural approach. Evaluating with fixed task order does not coincide with the fundamental aim of continual learning where no dedicated order of tasks is given. Figure 2 shows example test results of state of the art continual learning methods compared to our proposed method. Classification accuracy values of prior methods fluctuate as the order of tasks changes(from Figure 2a to Figure 2b ). Based on the observation, we propose to evaluate the robustness to the order of tasks in comprehensive manner in which we evaluate the average and standard deviation of classification accuracy with multiple sets of randomly shuffled orders.



test data. Unlike Zenke et al. (2017), MAS (Aljundi et al., 2018a) assesses the contribution of each weight to the change of learned function. In other words, it considers the gradient of outputs of a model with a mean square error loss. Gradient itself represents a change of outputs with respect to the weights. The strength of the method lies in the scheme of data utilization. It considers only the degree of change of output values of the network, so that any data (even unlabeled one) is valid to compute the gradient of the learned function with regard to weights. VCL (Nguyen et al., 2017) is a Bayesian neural network based method. It decides weight importance through variational inference. BGD (Zeno et al., 2018) is another Bayesian neural network based apporoach.

Figure 1: Three different types of Measurement-Importance (a) EM-EI (Edge Measurement, Edge Importance): Weight importance is calculated based on weight measurement. (b) EM-NI (Edge Measurement, Node Importance): Neuron importance is calculated based on weight measurement. Weight importance is redefined as the importance of its connected neuron. (c) NM-NI (Node Measurement, Node Importance): Neuron importance is calculated based on neuron measurement. Weight importance is defined as the importance of its connected neuron.

