GRADIENT PROJECTION MEMORY FOR CONTINUAL LEARNING

Abstract

The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches 1 .

1. INTRODUCTION

Humans exhibit remarkable ability in continual adaptation and learning new tasks throughout their lifetime while maintaining the knowledge gained from past experiences. In stark contrast, Artificial Neural Networks (ANNs) under such Continual Learning (CL) paradigm (Ring, 1998; Thrun & Mitchell, 1995; Lange et al., 2021) forget the information learned in the past tasks upon learning new ones. This phenomenon is known as 'Catastrophic Forgetting' or 'Catastrophic Interference' (Mccloskey & Cohen, 1989; Ratcliff, 1990) . The problem is rooted in the general optimization methods (Goodfellow et al., 2016) that are being used to encode input data distribution into the parametric representation of the network during training. Upon exposure to a new task, gradient-based optimization methods, without any constraint, change the learned encoding to minimize the objective function with respect to the current data distribution. Such parametric updates lead to forgetting. Given a fixed capacity network, one way to address this problem is to put constraints on the gradient updates so that task specific knowledge can be preserved. To this end, Kirkpatrick et al. 2018) add a penalty term to the objective function while optimizing for new task. Such term acts as a structural regularizer and dictates the degree of stability-plasticity of individual weights. Though these methods provide resource efficient solution to the catastrophic forgetting problem, their performance suffer while learning longer task sequence and when task identity is unavailable during inference. Approaches (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) that store episodic memories of old data essentially solve an optimization problem with 'explicit' constraints on the new gradient directions so that losses for the old task do not increase. In Chaudhry et al. (2019b) the performance of old task is retained by taking gradient steps in the average gradient direction obtained from the new data and memory samples. To minimize interference, Farajtabar et al. (2020) store gradient directions (instead of data) of the old tasks and optimize the network in the orthogonal directions to these gradients for the new task, whereas Zeng et al. (2018) update gradients orthogonal to the old input directions using projector matrices calculated iteratively during training. However, these methods either compromise data privacy by storing raw data or utilize resources poorly, which limits their scalability.



Our code is available at https://github.com/sahagobinda/GPM 1



(2017), Zenke et al. (2017), Aljundi et al. (2018), Serrà et al. (

