UNIFYING REGULARISATION METHODS FOR CONTIN-UAL LEARNING

Abstract

Continual Learning addresses the challenge of learning a number of different distributions sequentially. The goal of maintaining knowledge of earlier distributions without re-accessing them starkly conflicts with standard SGD training for artificial neural networks. An influential method to tackle this are so-called regularisation approaches. They measure the importance of each parameter for modelling a given distribution and subsequently protect important parameters from large changes. In the literature, three ways to measure parameter importance have been put forward and they have inspired a large body of follow-up work. Here, we present strong theoretical and empirical evidence that these three methods, Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI) and Memory Aware Synapses (MAS), are all related to the Fisher Information. Only EWC intentionally relies on the Fisher, while the other two methods stem from rather different motivations. We find that for SI the relation to the Fisher -and in fact its performance -is due to a previously unknown bias. Altogether, this unifies a body of regularisation approaches. It provides better explanations for the effectiveness of SI-and MASbased algorithms and offers more justified versions of these algorithms. From a practical viewpoint, our insights uncover conditions needed for different algorithms to work and allow predicting their relative performance in some situations as well as offering improvements.

1. INTRODUCTION

Despite considerable progress, many gaps between biological and machine intelligence remain. Animals, for example, flexibly learn new tasks throughout their lives, while at the same time maintaining robust memories of previous knowledge. This ability conflicts with traditional training procedures for artificial neural networks, which overwrite previous skills when optimizing new tasks (McCloskey & Cohen, 1989; Goodfellow et al., 2013) . The field of continual learning is dedicated to mitigate this crucial shortcoming of machine learning. It exposes neural networks to a sequence of distinct tasks. During training new tasks, the algorithm is not allowed to revisit old data, but should retain previous skills while remaining flexible enough to also acquire new knowledge. An influential line of work to approach this challenge was introduced by Kirkpatrick et al. ( 2017), who proposed the first regularisation-algorithm, Elastic Weight Consolidation (EWC). After training a task, EWC measures the importance of each parameter and introduces an auxiliary loss penalising large changes in important parameters. Naturally, this raises the question of how to measure this 'importance'. While EWC uses the diagonal Fisher Information, two alternatives have been proposed: Synaptic Intelligence (SI, Zenke et al. ( 2017)) aims to attribute the decrease in loss during training to individual parameters and Memory Aware Synapses (MAS, Aljundi et al. (2018) ) introduces a heuristic measure of output sensitivity. Together, these three approaches have inspired many further regularisation-based approaches, including combinations of them (Chaudhry et al., 2018 ), refinements (Huszár, 2018; Ritter et al., 2018; Chaudhry et al., 2018; Yin et al., 2020 ), extensions (Schwarz et al., 2018; Liu et al., 2018; Park et al., 2019; Lee et al., 2020) and applications in different continual learning settings (Aljundi et al., 2019) as well as different domains of machine learning (Lan et al., 2019) . Almost every new continual learning method compares to at least one of EWC, SI and MAS. Despite their popularity and influence, basic practical and theoretical questions regarding these algorithms had previously been unanswered. Notably, it was unkown how similar these importance

