GRADIENT PROJECTION MEMORY FOR CONTINUAL LEARNING

Abstract

The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches 1 .

1. INTRODUCTION

Humans exhibit remarkable ability in continual adaptation and learning new tasks throughout their lifetime while maintaining the knowledge gained from past experiences. In stark contrast, Artificial Neural Networks (ANNs) under such Continual Learning (CL) paradigm (Ring, 1998; Thrun & Mitchell, 1995; Lange et al., 2021) forget the information learned in the past tasks upon learning new ones. This phenomenon is known as 'Catastrophic Forgetting' or 'Catastrophic Interference' (Mccloskey & Cohen, 1989; Ratcliff, 1990) . The problem is rooted in the general optimization methods (Goodfellow et al., 2016) that are being used to encode input data distribution into the parametric representation of the network during training. Upon exposure to a new task, gradient-based optimization methods, without any constraint, change the learned encoding to minimize the objective function with respect to the current data distribution. Such parametric updates lead to forgetting. Given a fixed capacity network, one way to address this problem is to put constraints on the gradient updates so that task specific knowledge can be preserved. To this end, Kirkpatrick et al. (2017 ), Zenke et al. (2017 ), Aljundi et al. (2018 ), Serrà et al. (2018) add a penalty term to the objective function while optimizing for new task. Such term acts as a structural regularizer and dictates the degree of stability-plasticity of individual weights. Though these methods provide resource efficient solution to the catastrophic forgetting problem, their performance suffer while learning longer task sequence and when task identity is unavailable during inference. Approaches (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) that store episodic memories of old data essentially solve an optimization problem with 'explicit' constraints on the new gradient directions so that losses for the old task do not increase. In Chaudhry et al. (2019b) the performance of old task is retained by taking gradient steps in the average gradient direction obtained from the new data and memory samples. To minimize interference, Farajtabar et al. ( 2020) store gradient directions (instead of data) of the old tasks and optimize the network in the orthogonal directions to these gradients for the new task, whereas Zeng et al. (2018) update gradients orthogonal to the old input directions using projector matrices calculated iteratively during training. However, these methods either compromise data privacy by storing raw data or utilize resources poorly, which limits their scalability. In this paper, we address the problem of catastrophic forgetting in a fixed capacity network when data from the old tasks are not available. To mitigate forgetting, our approach puts explicit constraints on the gradient directions that the optimizer can take. However, unlike contemporary methods, we neither store old gradient directions nor store old examples for generating reference directions. Instead we propose an approach that, after learning each task, partitions the entire gradient space of the weights into two orthogonal subspaces: Core Gradient Space (CGS) and Residual Gradient Space (RGS) (Saha et al., 2020) . Leveraging the relationship between the input and the gradient spaces, we show how learned representations (activations) form the bases of these gradient subspaces in both fully-connected and convolutional networks. Using Singular Value Decomposition (SVD) on these activations, we show how to obtain the minimum set of bases of the CGS by which past knowledge is preserved and learnability for the new tasks is ensured. We store these bases in the memory which we define as Gradient Projection Memory (GPM). In our method, we propose to learn any new task by taking gradient steps in the orthogonal direction to the space (CGS) spanned by the GPM. Our analysis shows that such orthogonal gradient descent induces minimum to no interference with the old learning, and thus effective in alleviating catastrophic forgetting. We evaluate our approach in the context of image classification with miniImageNet, CIFAR-100, PMNIST and sequence of 5-Datasets on a variety of network architectures including ResNet. We compare our method with related state-of-the-art approaches and report comparable or better classification performance. Overall, we show that our method is memory efficient and scalable to complex dataset with longer task sequence while preserving data privacy.

2. RELATED WORKS

Approaches to continual learning for ANNs can be broadly divided into three categories. In this section we present a detailed discussion on the representative works from each category, highlighting their contributions and differences with our approach. Expansion-based methods: Methods in this category overcome catastrophic forgetting by dedicating different subsets of network parameters to each task. With no constraint on network architecture, Progressive Neural Network (PGN) (Rusu et al., 2016) preserves old knowledge by freezing the base model and adding new sub-networks with lateral connections for each new task. Dynamically Expandable Networks (DEN) (Yoon et al., 2018) either retrains or expands the network by splitting/duplicating important units on new tasks, whereas Sarwar et al. (2020) grow the network to learn new tasks while sharing part of the base network. Li et al. ( 2019) with neural architecture search (NAS) find optimal network structures for each sequential task. RCL (Xu & Zhu, 2018) adaptively expands the network at each layer using reinforcement learning, whereas APD (Yoon et al., 2020) additively decomposes the parameters into shared and task specific parameters to minimize the increase in the network complexity. In contrast, our method avoids network growth or expensive NAS operations and performs sequential learning within a fixed network architecture. (Garg et al., 2020) partition the parametric space of the weights (filters) into core and residual (filter) spaces after learning each task. The past knowledge is preserved in the frozen core space, whereas the residual space is updated when learning the next task. In contrast to these methods, we do not ascribe importance to or restrict the gradients of any individual parameters or filters. Rather we put constraints on the 'direction' of gradient descent.

Regularization

Memory-based methods: Methods under this class mitigate forgetting by either storing a subset of (raw) examples from the past tasks in the memory for rehearsal (Robins, 1995; Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; b; Riemer et al., 2019) or synthesizing old data from generative models to perform pseudo-rehearsal (Shin et al., 2017) . For instance,



Our code is available at https://github.com/sahagobinda/GPM



-based methods: These methods attempt to overcome forgetting in fixed capacity model through structural regularization which penalizes major changes in the parameters that were important for the previous tasks. Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) computes such importance from diagonal of Fisher information matrix after training, whereas Zenke et al. (2017) compute them during training based on loss sensitivity with respect to the parameters. Additionally, Aljundi et al. (2018) compute importance from sensitivity of model outputs to the inputs. Other methods, such as PackNet (Mallya & Lazebnik, 2018) uses iterative pruning to fully restrict gradient updates on important weights via binary mask, whereas HAT (Serrà et al., 2018) identifies important neurons by learning attention masks that control gradient propagation in the individual parameters. Saha et al. (2020) using a PCA based pruning on activations

