ITERATIVE RELAXING GRADIENT PROJECTION FOR CONTINUAL LEARNING

Abstract

A critical capability for intelligent systems is to continually learn given a sequence of tasks. An ideal continual learner should be able to avoid catastrophic forgetting and effectively leverage past learned experiences to master new knowledge. Among different continual learning algorithms, gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize task interference, yet hinder forward knowledge transfer at the same time. Recent methods use expansion-based techniques to relax the constraints, but a growing network can be computationally expensive. Therefore, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches using a fixed network architecture. In this work, we propose the Iterative Relaxing Gradient Projection (IRGP) framework. The basic idea is to iteratively search for the parameter subspaces most related to the current task and relax these parameters, then reuse the frozen spaces to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither memory buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our iterative relaxing strategies.

1. INTRODUCTION

A critical capability for intelligence systems is to continually learn given a sequence of tasks (Thrun & Mitchell, 1995; McCloskey & Cohen, 1989) . Unlike human beings, vanilla neural networks straightforwardly update parameters regarding current data distribution when learning new tasks, suffering from catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; Kirkpatrick et al., 2017) . As a result, continual learning is gaining increasing attention in recent years (Kurle et al., 2019; Ehret et al., 2020; Ramesh & Chaudhari, 2021; Liu & Liu, 2022; Teng et al., 2022 ). An ideal continual learner is expected to not only avoid catastrophic forgetting but also facilitating forward knowledge transfer (Lopez-Paz & Ranzato, 2017) , which is to leverage past learning experiences to master new knowledge efficiently and effectively (Parisi et al., 2019; Finn et al., 2019) . Several types of methods have been proposed for continual learning. Replay-based methods (Lopez-Paz & Ranzato, 2017; Shin et al., 2017) alleviate catastrophic forgetting by storing some old samples in the memory as they are inaccessible when new tasks come, while expansion-based methods (Rusu et al., 2016; Yoon et al., 2017; 2019) expand the model structure to accommodate incoming knowledge. However, these methods require either extra memory buffers (Parisi et al., 2019) or a growing network architecture as new tasks continually arrive (Kong et al., 2022) , which always results in expensive computation costs (De Lange et al., 2021) . In order to maintain a fixed network capacity, regularization-based methods (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018) penalize the transformation of parameters regarding the corresponding plasticity via regularization terms. While these regularization terms are applied to individual neurons, recent gradient projection methods (Zeng et al., 2019; Saha et al., 2021; Wang et al., 2021) modify the gradients in the feature space by constraining the directions of gradient update, which achieves outstanding performance. However, although gradient projection methods effectively mitigate forgetting within a fixed network capacity (Zeng et al., 2019) , the capability of learning new tasks is hindered by the limited optimization space, resulting in insufficient forward knowledge transfer. In other words, constraining the directions of gradient update fails on the plasticity in the stability-plasticity dilemma (French, 1997) .

