GRADIENT PROJECTION MEMORY FOR CONTINUAL LEARNING

Abstract

The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches 1 .

1. INTRODUCTION

Humans exhibit remarkable ability in continual adaptation and learning new tasks throughout their lifetime while maintaining the knowledge gained from past experiences. In stark contrast, Artificial Neural Networks (ANNs) under such Continual Learning (CL) paradigm (Ring, 1998; Thrun & Mitchell, 1995; Lange et al., 2021) forget the information learned in the past tasks upon learning new ones. This phenomenon is known as 'Catastrophic Forgetting' or 'Catastrophic Interference' (Mccloskey & Cohen, 1989; Ratcliff, 1990) . The problem is rooted in the general optimization methods (Goodfellow et al., 2016) that are being used to encode input data distribution into the parametric representation of the network during training. Upon exposure to a new task, gradient-based optimization methods, without any constraint, change the learned encoding to minimize the objective function with respect to the current data distribution. Such parametric updates lead to forgetting. Given a fixed capacity network, one way to address this problem is to put constraints on the gradient updates so that task specific knowledge can be preserved. To this end, Kirkpatrick et al. (2017) , Zenke et al. (2017) , Aljundi et al. (2018) , Serrà et al. (2018) add a penalty term to the objective function while optimizing for new task. Such term acts as a structural regularizer and dictates the degree of stability-plasticity of individual weights. Though these methods provide resource efficient solution to the catastrophic forgetting problem, their performance suffer while learning longer task sequence and when task identity is unavailable during inference. Approaches (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) that store episodic memories of old data essentially solve an optimization problem with 'explicit' constraints on the new gradient directions so that losses for the old task do not increase. In Chaudhry et al. (2019b) the performance of old task is retained by taking gradient steps in the average gradient direction obtained from the new data and memory samples. To minimize interference, Farajtabar et al. (2020) store gradient directions (instead of data) of the old tasks and optimize the network in the orthogonal directions to these gradients for the new task, whereas Zeng et al. (2018) update gradients orthogonal to the old input directions using projector matrices calculated iteratively during training. However, these methods either compromise data privacy by storing raw data or utilize resources poorly, which limits their scalability. In this paper, we address the problem of catastrophic forgetting in a fixed capacity network when data from the old tasks are not available. To mitigate forgetting, our approach puts explicit constraints on the gradient directions that the optimizer can take. However, unlike contemporary methods, we neither store old gradient directions nor store old examples for generating reference directions. Instead we propose an approach that, after learning each task, partitions the entire gradient space of the weights into two orthogonal subspaces: Core Gradient Space (CGS) and Residual Gradient Space (RGS) (Saha et al., 2020) . Leveraging the relationship between the input and the gradient spaces, we show how learned representations (activations) form the bases of these gradient subspaces in both fully-connected and convolutional networks. Using Singular Value Decomposition (SVD) on these activations, we show how to obtain the minimum set of bases of the CGS by which past knowledge is preserved and learnability for the new tasks is ensured. We store these bases in the memory which we define as Gradient Projection Memory (GPM). In our method, we propose to learn any new task by taking gradient steps in the orthogonal direction to the space (CGS) spanned by the GPM. Our analysis shows that such orthogonal gradient descent induces minimum to no interference with the old learning, and thus effective in alleviating catastrophic forgetting. We evaluate our approach in the context of image classification with miniImageNet, CIFAR-100, PMNIST and sequence of 5-Datasets on a variety of network architectures including ResNet. We compare our method with related state-of-the-art approaches and report comparable or better classification performance. Overall, we show that our method is memory efficient and scalable to complex dataset with longer task sequence while preserving data privacy.

2. RELATED WORKS

Approaches to continual learning for ANNs can be broadly divided into three categories. In this section we present a detailed discussion on the representative works from each category, highlighting their contributions and differences with our approach. Expansion-based methods: Methods in this category overcome catastrophic forgetting by dedicating different subsets of network parameters to each task. With no constraint on network architecture, Progressive Neural Network (PGN) (Rusu et al., 2016) preserves old knowledge by freezing the base model and adding new sub-networks with lateral connections for each new task. Dynamically Expandable Networks (DEN) (Yoon et al., 2018) either retrains or expands the network by splitting/duplicating important units on new tasks, whereas Sarwar et al. (2020) grow the network to learn new tasks while sharing part of the base network. Li et al. (2019) with neural architecture search (NAS) find optimal network structures for each sequential task. RCL (Xu & Zhu, 2018) adaptively expands the network at each layer using reinforcement learning, whereas APD (Yoon et al., 2020) additively decomposes the parameters into shared and task specific parameters to minimize the increase in the network complexity. In contrast, our method avoids network growth or expensive NAS operations and performs sequential learning within a fixed network architecture. Regularization-based methods: These methods attempt to overcome forgetting in fixed capacity model through structural regularization which penalizes major changes in the parameters that were important for the previous tasks. Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) computes such importance from diagonal of Fisher information matrix after training, whereas Zenke et al. (2017) compute them during training based on loss sensitivity with respect to the parameters. Additionally, Aljundi et al. (2018) compute importance from sensitivity of model outputs to the inputs. Other methods, such as PackNet (Mallya & Lazebnik, 2018) uses iterative pruning to fully restrict gradient updates on important weights via binary mask, whereas HAT (Serrà et al., 2018) identifies important neurons by learning attention masks that control gradient propagation in the individual parameters. Saha et al. (2020) using a PCA based pruning on activations (Garg et al., 2020) partition the parametric space of the weights (filters) into core and residual (filter) spaces after learning each task. The past knowledge is preserved in the frozen core space, whereas the residual space is updated when learning the next task. In contrast to these methods, we do not ascribe importance to or restrict the gradients of any individual parameters or filters. Rather we put constraints on the 'direction' of gradient descent. Memory-based methods: Methods under this class mitigate forgetting by either storing a subset of (raw) examples from the past tasks in the memory for rehearsal (Robins, 1995; Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; b; Riemer et al., 2019) or synthesizing old data from generative models to perform pseudo-rehearsal (Shin et al., 2017) . For instance, Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) avoids interference with previous task by projecting the new gradients in the feasible region outlined by previous task gradients calculated from the samples of episodic memory. Averaged-GEM (A-GEM) (Chaudhry et al., 2019a) simplified this optimization problem to projection in one direction estimated by randomly selected samples from the memory. Guo et al. (2020) propose a unified view of episodic memory-based CL methods, that include GEM and A-GEM and improves performance over these methods utilizing loss-balancing update rule. Additionally, Experience Replay (ER) (Chaudhry et al., 2019b) and Meta-Experience Replay (MER) (Riemer et al., 2019) mitigate forgetting in online CL setup by jointly training on the samples from new tasks and episodic memory. All these methods, however, rely on the access to old data which might not be possible when users have concern over data privacy. Like all the memory-based methods we also use a storage unit which we call GPM. However, we do not save any raw data in GPM, thus satisfy data privacy criterion. Our method is closely related to recently proposed Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) and Orthogonal Weight Modulation (OWM) (Zeng et al., 2018) . OGD stores a set of gradient directions in the memory for each task and minimizes catastrophic forgetting by taking gradient steps in the orthogonal directions for new tasks. In contrast to OGD, we compute and store the bases of core gradient space from network representations (activations) which reduces the memory requirement by orders of magnitude. Moreover, OGD is shown to work under locality assumption for small learning rates which limits its scalability in learning longer task sequences with complex dataset. Since our method does not use gradient directions (like OGD) to describe the core gradient spaces, we do not need to obey such assumptions, thus can use higher learning rates. On the other hand, OWM reduces forgetting by modifying the weights of the network in the orthogonal to the input directions of the past tasks. This is achieved by multiplying new gradients with projector matrices. These matrices are computed from the stored past projectors and the inputs with recursive least square (RLS) method at each training step. However, such an iterative method not only slows down the training process but also shows limited scalability in end-to-end task learning with modern network architectures. Like OWM, we aim to encode new learning in the orthogonal to the old input directions. In contrast to iterative projector computation in OWM, we identify a low-dimensional subspace in the gradient space analyzing the learned representations with SVD in one-shot manner at the end of each task. We store the bases of these subspaces in GPM and learn new tasks in the orthogonal to these spaces to protect old knowledge. We quantitatively show that our method is memory efficient, fast and scalable to deeper networks for complex long sequence of tasks.

3. NOTATIONS AND BACKGROUND

In this section, we introduce the notations used throughout the paper and give a brief overview of SVD for matrix approximation. In section 4, we establish the relationship between input and gradient spaces. In section 5 we show the steps of our algorithm that leverage such relationship. Continual Learning: We consider supervised learning setup where T tasks are learned sequentially. Each task has a task descriptor, τ ∈ {1, 2...., T } with a corresponding dataset, D τ = {(x i,τ , y i,τ ) nτ i=1 } having n τ example pairs. Let's consider an L layer neural network where at each layer network computes the following function for task τ : x l+1 i,τ = σ(f (W l τ , x l i,τ )). (1) Here, l = 1, ...L, σ(.) is a non-linear function and f (., .) is a linear function. We will use vector notation for input (x i,τ ) in fully connected layers and matrix notation for input (X i,τ ) in convolutional layers. At the first layer, x 1 i,τ = x i,τ represents the raw input data from task τ , whereas in the subsequent layers we define x l i,τ as the representation of input x i,τ at layer l. Set of parameters of the network is defined by, W τ = {(W l τ ) L l=1 }, where W 0 denotes set of parameters at initialization. Matrix approximation with SVD: SVD can be used to factorize a rectangular matrix, A = U ΣV T ∈ R m×n into the product of three matrices, where U ∈ R m×m and V ∈ R n×n are orthogonal, and Σ contains the sorted singular values along its main diagonal (Deisenroth et al., 2020) . If the rank of the matrix is r (r ≤ min(m, n)), A can be expressed as A = r i=1 σ i u i v T i , where u i ∈ U and v i ∈ V are left and right singular vectors and σ i ∈ diag(Σ) are singular values. Also, k-rank approximation to this matrix can be expressed as, A k = k i=1 σ i u i v T i , where k ≤ r and its value can be chosen by the smallest k that satisfies ||A k || 2 F ≥ th ||A|| 2 F . Here, ||.|| F is the Frobenius norm of the matrix and th (0 < th ≤ 1) is the threshold hyperparameter.  C i × k × k h o × w o h o × w o C o C i × k × k C o × = p 1 p 2 p n C i × k × k C i × k × k h o × w o h o × w o C o C o × p 1 T p 2 T p n T = X W O X T Δ Δ W L (a) Forward Pass (b) Backward Pass Input Weight Output Input Error Weight Gradient

4. INPUT AND GRADIENT SPACES

Our algorithm leverages the fact that stochastic gradient descent (SGD) updates lie in the span of input data points (Zhang et al., 2017) . In the following subsections we will establish this relationship for both fully connected and convolutional layers. The analysis presented in this section is generally applicable to any layer of the network for any task, and hence we drop the task and layer identifiers.

4.1. FULLY CONNECTED LAYER

Let's consider a single layer linear neural network in supervised learning setup where each (input, label) training data pair comes from a training dataset, D. Let, x ∈ R n is the input vector, y ∈ R m is the label vector in the dataset and W ∈ R m×n are the parameters (weights) of the network. The network is trained by minimizing the following mean-squared error loss function L = 1 2 ||W x -y|| 2 2 . We can express gradient of this loss with respect to weights as ∇ W L = (W x -y)x T = δx T , where δ ∈ R m is the error vector. Thus, the gradient update will lie in the span of input (x), where elements in δ scale the magnitude of x by different factors. Here, we have considered perexample loss (batch size of 1) for simplicity. However, this relation also holds for mini-batch setting (see appendix B.1). The input-gradient relation in equation 3 is generically applicable to any fully connected layer of a neural network where x is the input to that layer and δ is the error coming from the next layer. Moreover, this equation also holds for network with non-linear units (e.g. ReLU) and cross-entropy losses except the calculation of δ will be different.

4.2. CONVOLUTIONAL LAYER

Filters in a convolutional (Conv) layer operate in a different way on the inputs than the weights in a fully connected (FC) layer. Let's consider a Conv layer with the input tensor X ∈ R Ci×hi×wi and filters W ∈ R Co×Ci×k×k . Their convolution X , W, * produces output feature map, O ∈ R Co×ho×wo (Liu et al., 2018) . Here, C i (C o ) denotes the number of input (output) channels of the Conv layer, h i , w i (h o , w o ) denote the height and width of the input (output) feature maps and k is the kernel size of the filters. As shown in Figure 1  (a), if X is reshaped into a (h o ×w o )×(C i ×k ×k) matrix, X and W is reshaped into a (C i × k × k) × C o matrix, W , then the convolution can be expressed as matrix multiplication between X and W as O=XW , where O ∈ R (h0×w0)×Co . Each row of X contains an input patch vector, p j ∈ R (Ci×k×k)×1 , where j = 1, 2..., n (n = h o * w o ). Formulation of convolution in terms of matrix multiplication provides an intuitive picture of the gradient computation during backpropagation. Similar to the FC layer case, in Conv layer, during backward pass an error matrix ∆ of size (h 0 × w 0 ) × C o (same size as O) is obtained from the next layer. As shown in Figure 1 (b), the gradient of loss with respect to filter weights is calculated by ∇ W L = X T ∆, (4) where, ∇ W L is of shape (C i × k × k) × C o (same size as W ). Since, columns of X T are the input patch vectors (p), the gradient updates of the convolutional filters will lie in the space spanned by these patch vectors.

5. CONTINUAL LEARNING WITH GRADIENT PROJECTION MEMORY (GPM)

In this section, we describe our continual learning algorithm which leverages the relationship between gradient and input spaces to identify the core gradient spaces of the past tasks. We show how gradient descent orthogonal to these spaces enable us to learn continually without forgetting. Learning Task 1: We learn the first task (τ = 1) using dataset, D 1 without imposing any constraint on parameter updates. At the end of Task 1, we obtain a learned set of parameters W 1 . To preserve the knowledge of the learned task, we impose constraints on the direction of gradient updates for the next tasks. To do so, we partition the entire gradient space into two (orthogonal) subspaces: Core Gradient Space (CGS) and Residual Gradient Space (RGS), such that gradient steps along CGS induce high interference on the learned tasks whereas gradient steps along RGS have minimum to no interference. We aim to find and store the bases of the CGS and take gradient steps orthogonal to the CGS for the next task. In our formulation, each layer has its own CGS. To find the bases, after learning Task 1 , for each layer we construct a representation matrix, R l 1 = [x l 1,1 , x l 2,1 , ..., x l ns,1 ] (for Conv layers R l 1 = [(X l 1,1 ) T , (X l 2,1 ) T , ..., (X l ns,1 ) T ] ) concatenating n s representations along the column obtained from forward pass of n s random samples from the current training dataset through the network. Next, we perform SVD on R l 1 = U l 1 Σ l 1 (V l 1 ) T followed by its k-rank approximation (R l 1 ) k according to the following criteria for the given threshold, l th : ||(R l 1 ) k || 2 F ≥ l th ||R l 1 || 2 F . We define the space, S l = span{u l 1,1 , u l 2,1 , ..., u l k,1 }, spanned by the first k vectors in U l 1 as the space of significant representation for task 1 at layer l since it contains all the directions with highest singular values in the representation. For the next task, we aim to take gradient steps in a way that the correlation between this task specific significant representation and the weights in each layer is preserved. Since, inputs span the space of gradient descent (section 4), the bases of S l will span a subspace in the gradient space which we define as the Core Gradient space (CGS). Thus gradient descent along CGS will cause maximum change in the input-weight correlation whereas gradient steps in the orthogonal directions to CGS (space of low representational significance) will induce very small to no interference to the old tasks. We define this subspace orthogonal to CGS as Residual Gradient space (RGS). We save the bases of the CGS in the memory, M = {(M l ) L l=1 }, where M l = [u l 1,1 , u l 2,1 , ..., u l k,1 ]. We define this memory as Gradient Projection Memory (GPM). Learning Task 2 to T: We learn task 2 with the examples from dataset D 2 only. Before taking gradient step, bases of the CGS are retrieved from GPM. New gradients (∇ W l 2 L 2 ) are first projected onto the CGS and then projected components are subtracted out from the new gradient so that remaining gradient components lie in the space orthogonal to CGS. Gradients are updated as FC Layer: ∇ W l 2 L 2 = ∇ W l 2 L 2 -(∇ W l 2 L 2 )M l (M l ) T , Conv Layer: ∇ W l 2 L 2 = ∇ W l 2 L 2 -M l (M l ) T (∇ W l 2 L 2 ). At the end of the task 2 training, we update the GPM with new task-specific bases (of CGS). To obtain such bases, we construct R l 2 = [x l 1,2 , x l 2,2 , ..., x l ns,2 ] using data from task 2 only. However, before performing SVD and subsequent k-rank approximation, from R l 2 we eliminate the common directions (bases) that are already present in the GPM so that newly added bases are unique and orthogonal to the existing bases in the memory. To do so, we perform the following step : Rl 2 = R l 2 -M l (M l ) T (R l 2 ) = R l 2 -R l 2,P roj . (8) Afterwards, SVD is performed on Rl 2 (= Û l 2 Σl 2 ( V l 2 ) T ) and k new orthogonal bases are chosen for minimum value of k satisfying the following criteria for the given threshold, l th : ||R l 2,proj || 2 F + ||( Rl 2 ) k || 2 F ≥ l th ||R l 2 || 2 F . GPM is updated by adding new bases as M l = [M l , ûl 1,2 , ..., ûl k,2 ]. Thus after learning each new task, CGS grows and RGS becomes smaller, where maximum size of M l (hence the dimension of the gradient bases) is fixed by the choice of initial network architecture. Once the GPM update is complete we move on to the next task and repeat the same procedure that we followed for task 2. The pseudo-code of the algorithm is given in Algorithm 1 in the appendix.

6. EXPERIMENTAL SETUP

Datasets: We evaluate our continual learning algorithm on Permuted MNIST (PMNIST) (Lecun et al., 1998) , 10-Split CIFAR-100 (Krizhevsky, 2009) , 20-Spilt miniImageNet (Vinyals et al., 2016) and sequence of 5-Datasets (Ebrahimi et al., 2020b) . The PMNIST dataset is a variant of MNIST dataset where each task is considered as a random permutation of the original MNIST pixels. For PMNIST, we create 10 sequential tasks using different permutations where each task has 10 classes (Ebrahimi et al., 2020a) . The 10-Split CIFAR-100 is constructed by splitting 100 classes of CIFAR-100 into 10 tasks with 10 classes per task. Whereas, 20-Spilt miniImageNet, used in (Chaudhry et al., 2019a) , is constructed by splitting 100 classes of miniImageNet into 20 sequential tasks where each task has 5 classes. Finally, we use a sequence of 5-Datasets including CIFAR-10, MNIST, SVHN (Netzer et al., 2011) , notMNIST (Bulatov, 2011) and Fashion MNIST (Xiao et al., 2017) , where classification on each dataset is considered as a task. In our experiments we do not use any data augmentation. The dataset statistics are given in Table 4 & 5 in the appendix. Network Architecture: We use fully-connected network with two hidden layer of 100 units each for PMNIST following Lopez-Paz & Ranzato (2017) . For experiments with split CIFAR-100 we use a 5-layer AlexNet similar to Serrà et al. (2018) . For split miniImageNet and 5-Datasets, similar to Chaudhry et al. (2019b) , we use a reduced ResNet18 architecture. No bias units are used and batch normalization parameters are learned for the first task and shared with all the other tasks (following Mallya & Lazebnik (2018) ). Details on architectures are given in the appendix section C.2. For permuted MNIST, we evaluate and compare our algorithm in 'single-head' setting (Hsu et al., 2018; Farquhar & Gal, 2018) where all tasks share the final classifier layer and inference is performed without task hint. For all other experiments, we evaluate our algorithm in 'muti-head' setting, where each task has a separate classifier on which no gradient constraint is imposed during learning. Baselines: We compare our method with state-of-the art approaches from both memory based and regularization based methods that consider sequential task learning in fixed network architecture. From memory based approach, we compare with Experience Replay with reservoir sampling (ER Res) (Chaudhry et al., 2019b) , Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) , Averaged GEM (A-GEM) (Chaudhry et al., 2019a) , Orthogonal Gradient Descent (OGD) (Farajtabar et al., 2020) and Orthogonal Weight Modulation (OWM) (Zeng et al., 2018) . Moreover, we compare with sate-of-the-art HAT (Serrà et al., 2018) baseline and Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) from regularization based methods. Additionally, we add 'multitask' baseline where all the tasks are learned jointly using the entire dataset at once in a single network. Multitask is not a continual learning strategy but will serve as upper bound on average accuracy on all tasks. Details on the implementation along with the hyperparameters considered for each of these baselines are provided in section C.4 and Table 6 in the appendix. Training Details: We train all the models with plain stochastic gradient descent (SGD). For each task in PMNIST and split miniImageNet we train the network for 5 and 10 epochs respectively with batch size of 10. In Split CIFAR-100 and 5-Datasets experiments, we train each task for maximum of 200 and 100 epochs respectively with the early termination strategy based on the validation loss as proposed in Serrà et al. (2018) . For both datasets, batch size is set to 64. For GEM, A-GEM and ER Res the episodic memory size is chosen to be approximately the same size as the maximum GPM size (GPM Max). Calculation of GPM size is given in Table 7 in the appendix. Moreover, selection of the threshold values ( th ) in our method is discussed in section C.5 in the appendix. Performance Metrics: To evaluate the classification performance, we use the ACC metric, which is the average test classification accuracy of all tasks. To measure the forgetting we report backward transfer, BWT which indicates the influence of new learning on the past knowledge. For instance, negative BWT indicates (catastrophic) forgetting. Formally, ACC and BWT are defined as: ACC = 1 T T i=1 R T,i , BWT = 1 T -1 T -1 i=1 R T,i -R i,i . Here, T is the total number of sequential tasks and R T,i is the accuracy of the model on i th task after learning the T th task sequentially (Lopez-Paz & Ranzato, 2017) .

7. RESULTS AND DISCUSSIONS

Single-head inference with PMNIST: First, we evaluate our algorithm in single-head setup for 10 sequential PMNIST tasks. In this setup task hint is not necessary. As HAT cannot perform infer- For memory, size of GPM Max and for time, method with highest complexity is used as references (value of 1). All the other methods are reported relative to these references. ence without task hint, it is not included in the comparison. Since network size is very small (0.1M parameters) with 87% parameters in the first layer, we choose threshold value ( th ) of 0.95 for that layer and 0.99 for the other layers to ensure better learnability. From the results, shown in Table 1 (a), we observe that our method (GPM) achieves best average accuracy (93.91 ± 0.16%). In addition, we achieve least amount of forgetting, except OWM, which essentially trades off accuracy to minimize forgetting. Figure 2 (a) compares the memory utilization of all the memory-based approaches. While OWM, GEM, A-GEM and ER Res use memory of size of GPM Max, we obtain better performance by using only 69% of the GPM Max. Moreover, compared to OGD, we use about 400 times lower memory and achieve ∼ 10% better accuracy. In Figure 2 (b), we compare the per epoch training time of different memory based methods and found our method to be the fastest primarily due to the precomputation of the reference gradient bases (of CGS). Additionally, in single-epoch setting (Lopez-Paz & Ranzato, 2017) , as shown in Table 8 in the appendix, we obtain best average accuracy (91.74 ± 0.15%), which demonstrates the potential for our algorithm in online CL setup. Split CIFAR-100: Next, we switch to multi-head setup which enables us to compare with strong baselines such as HAT. For ten split CIFAR-100 tasks, as shown in Table 1 (b), we outperform all the memory based approaches while using 45% less memory (Figure 2(c) ). We also outperform EWC and our accuracy is marginally better than HAT while achieving zero forgetting. Also, we obtain ∼ 20% better accuracy than OWM, which have high forgetting (BWT=-0.30) thus demonstrating its limited scalability to convolutional architectures. Split miniImageNet: With this experiment, we test the scalability of our algorithm to deeper network (ResNet18) for long task sequence from miniImageNet dataset. The average accuracies for different methods after learning 20 sequential tasks are given in Table 1 (b). Again, in this case we outperform A-GEM, ER Res and EWC using 76% of the GPM Max (Figure 2(d) ). Also, we achieve marginally better accuracy than HAT, however unlike HAT (and other methods) we completely avoid forgetting (BWT=0.00). Moreover, compared other methods sequential learning in our method is more stable, which means accuracy of the past tasks have minimum to no degradation over the course of learning (shown for task 1 accuracy in Figure 4 in the appendix). 5-Datasets: Next, we validate our approach on learning across diverse datasets, where classification on each dataset is treated as one task. Even in this challenging setting, as shown in in we achieve better accuracy (91.22 ± 0.20%) then A-GEM, ER Res and EWC utilizing 78% of the GPM Max (Figure 2 (e)). Though, HAT performs marginally better than our method, both HAT and we achieve the lowest BWT (-0.01). In this experiment, we have used tasks that are less related to each other. After learning 5 such tasks 78% of the gradient space is already constrained. Which implies, if the tasks are less or non-similar, GPM will get populated faster and reach to its maximum capacity after which no new learning will be possible. Since, we use a fixed capacity network and the size of GPM is determined by the network architecture, the ability of learning sequences of hundreds of such tasks with our method will be limited by the chosen network capacity. Training Time. Table 2 shows the total training time for all the sequential tasks for different algorithms. This includes time spent for memory management for the memory-based methods, fisher importance calculation for EWC and learning activation masks for HAT. Details of time measurement are given in appendix section C.6. For PMNIST, CIFAR-100, and 5-dataset tasks our algorithm trains faster than all the other baselines while spending only 0.2%, 3% and 6% of its total training time in GPM update (using SVD) respectively. Since each miniImageNet tasks are trained for only 10 epochs, our method have relatively higher overhead (30% of the total time) due to GPM update, thus runs a bit slower than the fastest ER Res. Overall, our formulation uses GPM bases and projection matrices of reasonable dimensions (see appendix section C.8); precomputation of which at the start of each task leads to fast per-epoch training. This gain in time essentially compensates for the extra time required for the GPM update, which is done only once per task, enabling fast training. Comparison with Expansion-based methods. To compare our method with the state-of-the-art expansion based methods we perform experiment with 20-task CIFAR-100 Superclass dataset (Yoon et al., 2020) . In this experiment, each task contains 5 different but semantically related classes from CIFAR-100 dataset. Similar to APD, here we use the LeNet-5 architecture. Details of architecture and training setup are given in appendix section C.3. Results are shown in Table 3 where ACC represents average accuracy over 5 different task sequences (used in APD) and Capacity denotes percentage of network capacity used with respect to the original network. We outperform all the CL methods achieving best average accuracy (57.72 ± 0.37%) with BWT of -0.01 using the smallest network. For instance, we outperform RCL and APD utilizing 84% and 30% fewer network parameters respectively, which shows that our method induces more sharing between tasks. Overall, we outperform the memory-based methods with less memory utilization, achieve better accuracy than expansion-based methods using smaller network, and obtain better or on-par performance compared to HAT in the given experimental setups. However, in the class-incremental learning setup (Rebuffi et al., 2017) , method (Kamra et al., 2017) that uses data replay achieves better performance than GPM (see experiment in appendix section D.3). In this setup, we believe a subset of old data replay either from storage or via generation is inevitable for attaining better per- formance with minimal forgetting (Rajasegaran et al., 2019) . In that quest, a hybrid approach such as combining GPM with small data replay would be an interesting direction for future exploration. Controlling Forgetting: Finally, we discuss the factors that implicitly or explicitly control the amount of forgetting in our algorithm. As discussed in section 5, we propose to minimize interference by taking gradient steps orthogonal to the CGS, where CGS bases are computed such that space of significant representations of the past tasks can be well approximated by these bases. The degree of this approximation is controlled by the threshold hyperparameter, th (through equation 5, 9). For instance, a low value of th (closer to 0) would allow the optimizer to change the weights along the directions where past data has higher representational significance, thereby significantly altering the past input-weight correlation inducing (catastrophic) interference. On the other hand, a high value of th (closer to 1) would preserve such correlation, however learnability of the new task might suffer due to high volume of constraints in the gradient space. Therefore, in our continual learning algorithm, th mediates the stability-plasticity dilemma. To show this analytically, let's consider a network after learning T sequential tasks with weights of the network at any layer, l expressed as : W l T = W l 1 + T -1 i=1 ∆W l i→i+1 = W l 1 + ∆W l 1→T . Here, W l 1 is the weights after task 1 and ∆W l 1→T is the change of weights from task 1 to T. Weight update with our method ensures that ∆W l 1→T lie in the orthogonal space of the data (representations) of task 1. Linear operation at layer l with data from task 1 (x 1 ) would produce: W l T x l 1 = W l 1 x l 1 + ∆W l 1→T x l 1 . If ∆W l 1→T x l 1 = 0, then the output of the network for task 1 data after learning task T will be the same as the output after learning task 1 (i.e. W l T x l 1 = W l 1 x l 1 ), that means no interference for task 1. We define ∆W l 1→T x l 1 as the interference activation for task 1 at layer l (for any task, τ < T : ∆W l τ →T x l τ ). As discussed above, degree of such interference is dictated by th . Figure 3 (a)-(b) (and Figure 5 in appendix) show histograms (distributions) of interference activations at each layer of the network for split CIFAR-100 experiment. For lower value of th , these distributions have higher variance (spread) implying high interference, whereas with increasing value of th , the variance reduces around the (zero) mean value. As a direct consequence, as shown in Figure 3 (c), backward transfer reduces for increasing th with improvement in accuracy.

8. CONCLUSION

In this paper we propose a novel continual learning algorithm that finds important gradient subspaces for the past tasks and minimizes catastrophic forgetting by taking gradient steps orthogonal to these subspaces when learning a new task. We show how to analyse the network representations to obtain minimum number of bases of these subspaces by which past information is preserved and learnability for the new tasks is ensured. Evaluation on diverse image classification tasks with different network architectures and comparisons with state-of-the-art algorithms show the effectiveness of our approach in achieving high classification performance while mitigating forgetting. We also show our algorithm is fast, makes efficient use of memory and is capable of learning long sequence of tasks in deeper networks preserving data privacy. Since, the batch loss is the summation of the losses due to individual examples, the total batch loss for n samples can be expressed as L batch = n i=1 L i = n i=1 1 2 ||W x i -y i || 2 2 . ( ) The gradient of this loss with respect to weights can be expressed as ∇ W L batch = δ 1 x T 1 + δ 2 x T 2 + ... + δ n x T n . The gradient update will remain in the subspace spanned by the n input examples.

B.2 ALGORITHM PSEUDO CODE

Algorithm 1 Algorithm for Continual Learning with GPM for layer, l = 1, 2, ...L do 18: Reduced ResNet18 architecture: This is the similar architecture used by Lopez-Paz & Ranzato (2017) . For miniImageNet experiment, we use convolution with stride 2 in the first layer. For both miniImageNet and 5-Datasets experiments we replace the 4 × 4 average-pooling before classifier layer with 2 × 2 average-pooling. Rl τ ← PROJECT(R l τ , M l ) // see equation (8) 19: Û l τ ← SVD( Rl τ ) 20: k ← criteria( Rl τ , R l τ , l th ) // see equation (9) 21: M l ← [M l , Û l τ [0 : k]] All the networks use ReLU in the hidden units and softmax with cross entropy loss in the final layer.

C.3 CIFAR-100 SUPERCLASS EXPERIMENT

For this experiment, similar to APD (Yoon et al., 2020) , we use a modified LeNet-5 architecture with 20-50-800-500 neurons. All the baseline results are reported from APD. Like APD, we do not use any data augmentation or preprocessing. We keep 5% of training data from each task for validation. We train the network with our algorithm with batch size of 64 and initial learning rate of 0.01. We Train each task for a maximum of 50 epochs with decay schedule and early termination strategy similar to Serrà et al. (2018) . We use th = 0.98 for all the layers and increasing the value of th by 0.001 for each new tasks.

C.4 BASELINE IMPLEMENTATIONS

GEM (Lopez-Paz & Ranzato, 2017) , A-GEM (Chaudhry et al., 2019a) , ER Res (Chaudhry et al., 2019b) and OWM (Zeng et al., 2018) are implemented from their respective official implementations. EWC and HAT are implemented from the official implementation provided by Serrà et al. (2018) . While, OGD is implemented from adapting the code provided by Bennani et al. (2020) .

C.5 THRESHOLD HYPERPARAMETER

As discussed in section 5 and section 7, the threshold hyperparameter, th controls the degree of interference through the approximation of space of significant representations of the past tasks. Since in neural network, characteristics of learned representations vary for different architectures and different dataset, using the same value of th may not be useful in capturing the similar space of significance. In our experiments we use th in the range of 0.95 to 1. For PMNIST experiment, as discussed in section 7, we use th = 0.95 in the first layer and 0.99 in the other layers. For split CIFAR-100 experiment, we use th = 0.97 for all the layers and increasing the value of th by 0.003 for each new tasks. For split miniImageNet experiment, we use th = 0.985 for all the layers and increasing the value of th by 0.0003 for each new tasks. For experiment with 5-Datasets, we use th = 0.965 for all the layers across all the tasks. As discussed in section 4, the gradient update will lie in the span of input vectors (x) in fully connected layers, whereas the gradient updates of the convolutional filters will lie in the space spanned by the input patch vectors (p). Therefore, each basis stored in the GPM for a particular layer, l will have the same dimension as x l or p l . Thus, for any particular layer, GPM matrix, M l can have a maximum size of : size(x l ) × size(x l ) or size(p l ) × size(p l ). Maximum size of the GPM, which we refer as GPM Max, is computed by including GPM matrices (M l ) from all the layers. Thus the size of GPM Max is fixed by the choice of network architecture. In Table 7 , we show the maximum size of GPM matrix (M l ) for each layers for the architectures that we have used in our experiments along with the size of GPM Max. 11 , where the baselines are reported from DGDMN. While EWC forgets catastrophically, we perform better than DGR (Shin et al., 2017) , which employs data replay through old data generation. DGDMN, an improved data replay method, outperforms all. In this setup, we believe a subset of old data replay either from storage or via generation is inevitable for attaining better performance with minimal forgetting (Rebuffi et al., 2017; Rajasegaran et al., 2019) . 



Our code is available at https://github.com/sahagobinda/GPM



Figure 1: Illustration of convolution operation in matrix multiplication format during (a) Forward Pass and (b) Backward Pass.

Figure 2: (a) Memory utilization and (b) per epoch training time for PMNIST tasks for different methods. Memory utilization for different approaches for (c) CIFAR-100, (d) miniImageNet and (e) 5-Datasets tasks.For memory, size of GPM Max and for time, method with highest complexity is used as references (value of 1). All the other methods are reported relative to these references.

Figure 3: Histograms of interference activations as a function of threshold, ( th ) at (a) Conv layer 2 (b) FC layer 2 for split CIFAR-100 tasks. (c) Impact of th on ACC (%) and BWT(%). With increasing value of th , spread of interference reduces, which improves accuracy and reduces forgetting.formance with minimal forgetting(Rajasegaran et al., 2019). In that quest, a hybrid approach such as combining GPM with small data replay would be an interesting direction for future exploration.

AND GRADIENT SPACES (CONT.)

function TRAIN (f W , D train , α, th ) 2: Initialize, M l ← [ ], for all l = 1, 2, ....L // till L-1 if multi-head setting 3: M ← {(M l ) L l=1 } 4: W ← W 0 5: for τ ∈ 1, 2, ....., T do a mini-batch of size n from task τ 8: gradient, ∇ W L τ ← SGD(B n , f W ) 9: ∇ W L τ ← PROJECT(∇ W L τ , M) //see equation (6a mini-batch of size n s from task τ 15: // construct representation matrices for each layer by forward pass (section 5) 16: R τ ← forward(B ns , f W ), where R τ = {(R l τ ) L l=1 } 17:

Figure4: Evolution of task 1 accuracy over the course of incremental learning of 20 sequential tasks from miniImageNet dataset. Learned accuracy in our method remains stable throughout learning.

Continual learning on different datasets. Methods that do not adhere to CL setup is indicated by (*). All the results are (re) produced by us and averaged over 5 runs. Standard deviations are reported in Table8and 9 in the appendix.



Total wall-clock training time measured on a single GPU after learning all the tasks.

Continual learning of 20-task from CIFAR-100 Superclass dataset. ( †) denotes the result reported from APD. (*) indicates the methods that do not adhere to CL setup. Single-task learning (STL), where a separate network in trained for each task, serves as an upper bound on accuracy.

Table 5 show the summary of the datasets used in the experiments.

Dataset Statistics.

5-Datasets Statistics. For the datasets with monochromatic images, we replicate the image across all RGB channels so that size of each image becomes 3 × 32 × 32. This is the same architecture used bySerrà et al. (2018) with batch normalization added in each layer except the classifier layer. The network consists of 3 convolutional layers of 64, 128, and 256 filters with 4 × 4, 3 × 3, and 2 × 2 kernel sizes, respectively, plus two fully connected layers of 2048 units each. Rectified linear units is used as activations, and 2 × 2 max-pooling after the convolutional layers. Dropout of 0.2 is used for the first two layers and 0.5 for the rest.

For ten sequential tasks in PMNIST experiment, we computed per epoch training time for each task and reported the average value over all the tasks. Training time for different algorithms reported in Table 2(a) for PMNIST tasks were measured on a Single NVIDIA GeForce GTX 1060 GPU. For all the other datasets, training time for different algorithms reported in Table 2(b) were measured on a Single NVIDIA GeForce GTX 1080 Ti GPU.

List of hyperparameters for the baselines and our approach. Here, 'lr' represents (initial) learning rate. In the table we represent PMNIST as 'perm', 10-Split CIFAR-100 as 'cifar', Split miniImageNet as 'minImg' and 5-Datasets as '5data'.

Size of GPM matrices for each layer for the architectures used in our experiments. Maximum sizes of the GPM in terms of number of parameters are also given. ResNet18 27 × 27, 180 × 180, 180 × 180, 180 × 180, 180 × 180, (17 layers+ 180 × 180, 360 × 360, 20 × 20, 360 × 360, 360 × 360, 8.98M 3 short-cut 360 × 360, 720 × 720, 40 × 40, 720 × 720, 720 × 720, connections) 720 × 720, 1440 × 1440, 80 × 80, 1440 × 1440, 1440 × 1440

contains the additional results for PMNIST experiment in single-epoch setting along with the standard deviation values for the results shown inTable 1(a) for multi-epoch (5 epoch) setting. Method that does not adhere to CL setup is indicated by (*) in the table. Results are reported from 5 different runs.

Continual learning on PMNIST in single-epoch and multi-epoch setting. ± 0.27 -0.05 ± 0.00 87.24 ± 0.53 -0.11 ± 0.01 EWC 88.27 ± 0.39 -0.04 ± 0.01 89.97 ± 0.57 -0.04 ± 0.01 GPM (ours) 91.74 ± 0.15 -0.03 ± 0.00 93.91 ± 0.16 -0.03 ± 0.00 MNIST digit is learned per task. Results are listed in Table

ACKNOWLEDGMENTS

This work was supported in part by the National Science Foundation, Vannevar Bush Faculty Fellowship, Army Research Office, MURI, and by Center for Brain Inspired Computing (C-BRIC), one of six centers in JUMP, a Semiconductor Research Corporation program sponsored by DARPA.

D.3 CLASS-INCREMENTAL LEARNING

In this section, we evaluate our algorithm in a class-incremental learning setup (Rebuffi et al., 2017) , where disjoint classes are learned one by one and classification is performed within all the learned classes without task hint (Kamra et al., 2017) . This setup is different (Hsu et al., 2018) from the (single-head/multi-head) evaluation setups used throughout this paper. Also, this scenario is very challenging and often infeasible for the regularization (HAT, EWC etc.) and expansion-based (DEN, APD etc.) methods which do not use old data replay. Using the experimental setting similar to DGDMN (Kamra et al., 2017) , we implemented the 'Digit dataset' experiment where a single class

