UNIFYING REGULARISATION METHODS FOR CONTIN-UAL LEARNING

Abstract

Continual Learning addresses the challenge of learning a number of different distributions sequentially. The goal of maintaining knowledge of earlier distributions without re-accessing them starkly conflicts with standard SGD training for artificial neural networks. An influential method to tackle this are so-called regularisation approaches. They measure the importance of each parameter for modelling a given distribution and subsequently protect important parameters from large changes. In the literature, three ways to measure parameter importance have been put forward and they have inspired a large body of follow-up work. Here, we present strong theoretical and empirical evidence that these three methods, Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI) and Memory Aware Synapses (MAS), are all related to the Fisher Information. Only EWC intentionally relies on the Fisher, while the other two methods stem from rather different motivations. We find that for SI the relation to the Fisher -and in fact its performance -is due to a previously unknown bias. Altogether, this unifies a body of regularisation approaches. It provides better explanations for the effectiveness of SI-and MASbased algorithms and offers more justified versions of these algorithms. From a practical viewpoint, our insights uncover conditions needed for different algorithms to work and allow predicting their relative performance in some situations as well as offering improvements.

1. INTRODUCTION

Despite considerable progress, many gaps between biological and machine intelligence remain. Animals, for example, flexibly learn new tasks throughout their lives, while at the same time maintaining robust memories of previous knowledge. This ability conflicts with traditional training procedures for artificial neural networks, which overwrite previous skills when optimizing new tasks (McCloskey & Cohen, 1989; Goodfellow et al., 2013) . The field of continual learning is dedicated to mitigate this crucial shortcoming of machine learning. It exposes neural networks to a sequence of distinct tasks. During training new tasks, the algorithm is not allowed to revisit old data, but should retain previous skills while remaining flexible enough to also acquire new knowledge. An influential line of work to approach this challenge was introduced by Kirkpatrick et al. ( 2017), who proposed the first regularisation-algorithm, Elastic Weight Consolidation (EWC). After training a task, EWC measures the importance of each parameter and introduces an auxiliary loss penalising large changes in important parameters. Naturally, this raises the question of how to measure this 'importance'. While EWC uses the diagonal Fisher Information, two alternatives have been proposed: Synaptic Intelligence (SI, Zenke et al. ( 2017)) aims to attribute the decrease in loss during training to individual parameters and Memory Aware Synapses (MAS, Aljundi et al. (2018) ) introduces a heuristic measure of output sensitivity. Together, these three approaches have inspired many further regularisation-based approaches, including combinations of them (Chaudhry et al., 2018 ), refinements (Huszár, 2018; Ritter et al., 2018; Chaudhry et al., 2018; Yin et al., 2020 ), extensions (Schwarz et al., 2018; Liu et al., 2018; Park et al., 2019; Lee et al., 2020) and applications in different continual learning settings (Aljundi et al., 2019) as well as different domains of machine learning (Lan et al., 2019) . Almost every new continual learning method compares to at least one of EWC, SI and MAS. Despite their popularity and influence, basic practical and theoretical questions regarding these algorithms had previously been unanswered. Notably, it was unkown how similar these importance measures are. Additionally, for SI and MAS (and their follow-ups) there was no solid theoretical understanding of their effectiveness. Here, we close both these gaps through a theoretical analysis confirmed by a series of carefully designed experiments on standard continual learning benchmarks. Our main findings can be summarised as follows: (a) We show that MAS is almost identical to the 'Absolute Fisher', a quantity similar to the Fisher Information. (b) We show that SI's importance approximation is biased, that the bias is responsible for SI's performance and that the bias, like MAS, is tightly linked to the Absolute Fisher. (c) Our fine-grained analysis leads to new baselines, including more justified versions of MAS and SI. (d) Together, (a) and (b) show that all three regularisation approaches (and their follow-ups) are closely linked to the Fisher Information. This unifies a large body of regularisation literature. It also gives a more solid theoretical justification for the effectiveness of SI-and MAS-based algorithms. (e) Our precise understanding of SI allows predicting and improving its performance in different situations and offers a better performing alternative.

2. RELATED WORK

The problem of catastrophic forgetting in neural networks has been studied for many decades (McCloskey & Cohen, 1989; Ratcliff, 1990; French, 1999) . In the context of deep learning, it received more attention again (Goodfellow et al., 2013; Srivastava et al., 2013) . All previous work we are aware of proposes new or modified algorithms to tackle continual learning. No attention has been directed towards understanding or unifying existing methods and we hope that our work will not remain the only effort of this kind. We now review the broad body of continual learning algorithms. Following (Parisi et al., 2019) , they are often categorised into regularisation-, replay-and architectural approaches. Most regularisation methods will be reviewed more closely in the next section. There we cover regularisation for standard neural nets, but the same idea has also been applied to bayesian neural networks (Nguyen et al., 2017; Swaroop et al., 2019 ). An alternative approach (Mirzadeh et al., 2020) does not penalize deviating from important parameters, but rather modifies training so that the network is robust to such deviations. Replay methods refer to algorithms which either store a small sample or generate data of old distributions and use this data while training on new methods (Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Kemker & Kanan, 2017) . These approaches can be seen as investigating how far standard i.i.d.-training can be relaxed towards the (highly non-i.i.d.) continual learning setting without losing too much performance. They are interesting, but usually circumvent the original motivation of continual learning to maintain knowledge without accessing old distributions. Intriguingly, the most effective way to use old data appears to be simply replaying it, i.e. mimicking training with i.i.d. batches sampled from all tasks simultaneously (Chaudhry et al., 2019) . Architectural methods extend the network as new tasks arrive (Fernando et al., 2017; Li et al., 2019; Schwarz et al., 2018; Golkar et al., 2019; von Oswald et al., 2019) . This can be seen as a study of how old parts of the network can be effectively used to solve new tasks and touches upon transfer learning. Typically, it avoids the challenge of integrating new knowledge into an existing networks. However, at test time the algorithms is tested on all K tasks, usually the average accuracy is taken as measure of the algorithms performance, but see also Chaudhry et al. (2018) .



In the appendix, we critically review and question some of their experimental results.



Finally, van de Ven & Tolias (2019); Hsu et al. (2018); Farquhar & Gal (2018) point out that different continual learning scenarios and assumptions with varying difficulty were used across the literature. 1 3 REVIEW OF EXISTING REGULARISATION METHODS 3.1 FORMAL DESCRIPTION OF CONTINUAL LEARNING In continual learning we are given K datasets D 1 , . . . , D K sequentially. When training a neural net with N parameters θ ∈ R N on dataset D k , we have no access to the previously seen datasets D 1:k-1 .

