ITERATIVE RELAXING GRADIENT PROJECTION FOR CONTINUAL LEARNING

Abstract

A critical capability for intelligent systems is to continually learn given a sequence of tasks. An ideal continual learner should be able to avoid catastrophic forgetting and effectively leverage past learned experiences to master new knowledge. Among different continual learning algorithms, gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize task interference, yet hinder forward knowledge transfer at the same time. Recent methods use expansion-based techniques to relax the constraints, but a growing network can be computationally expensive. Therefore, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches using a fixed network architecture. In this work, we propose the Iterative Relaxing Gradient Projection (IRGP) framework. The basic idea is to iteratively search for the parameter subspaces most related to the current task and relax these parameters, then reuse the frozen spaces to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither memory buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our iterative relaxing strategies.

1. INTRODUCTION

A critical capability for intelligence systems is to continually learn given a sequence of tasks (Thrun & Mitchell, 1995; McCloskey & Cohen, 1989) . Unlike human beings, vanilla neural networks straightforwardly update parameters regarding current data distribution when learning new tasks, suffering from catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; Kirkpatrick et al., 2017) . As a result, continual learning is gaining increasing attention in recent years (Kurle et al., 2019; Ehret et al., 2020; Ramesh & Chaudhari, 2021; Liu & Liu, 2022; Teng et al., 2022 ). An ideal continual learner is expected to not only avoid catastrophic forgetting but also facilitating forward knowledge transfer (Lopez-Paz & Ranzato, 2017) , which is to leverage past learning experiences to master new knowledge efficiently and effectively (Parisi et al., 2019; Finn et al., 2019) . Several types of methods have been proposed for continual learning. Replay-based methods (Lopez-Paz & Ranzato, 2017; Shin et al., 2017) alleviate catastrophic forgetting by storing some old samples in the memory as they are inaccessible when new tasks come, while expansion-based methods (Rusu et al., 2016; Yoon et al., 2017; 2019) expand the model structure to accommodate incoming knowledge. However, these methods require either extra memory buffers (Parisi et al., 2019) or a growing network architecture as new tasks continually arrive (Kong et al., 2022) , which always results in expensive computation costs (De Lange et al., 2021) . In order to maintain a fixed network capacity, regularization-based methods (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018) penalize the transformation of parameters regarding the corresponding plasticity via regularization terms. While these regularization terms are applied to individual neurons, recent gradient projection methods (Zeng et al., 2019; Saha et al., 2021; Wang et al., 2021) modify the gradients in the feature space by constraining the directions of gradient update, which achieves outstanding performance. However, although gradient projection methods effectively mitigate forgetting within a fixed network capacity (Zeng et al., 2019) , the capability of learning new tasks is hindered by the limited optimization space, resulting in insufficient forward knowledge transfer. In other words, constraining the directions of gradient update fails on the plasticity in the stability-plasticity dilemma (French, 1997) . Figure 1 : Illustration of our proposed IRGP method and two baselines: GPM and TRGP. Blocks painted in different colors denote the parameters optimized after different tasks. We denote the relaxing subspace within the frozen space as the painted stripes in our IRGP pipeline. 

…

Trust Region Gradient Projection (Lin et al., 2022) tackles this problem by expanding the selected subspace of old tasks as trust regions with scaled weight projection, similar to other expansion-based methods (Yoon et al., 2019) . In spite of substantial improvement, these methods are computationally expensive as a result of growing network architecture (Wang et al., 2021) . Therefore, insufficient forward knowledge transfer remains a key challenge for gradient projection methods. To address this challenge, we propose the Iterative Relaxing Gradient Projection (IRGP) framework to facilitate forward knowledge transfer within a fixed network capacity. We design a simple yet effective strategy to find the critical subspace within the frozen space. During the training phase, we iteratively reuse the parameters within the selected subspace. Instead of strictly freezing those parameters, our method explores a larger optimization space, which allows better forward knowledge transfer and thus achieves better performance on new tasks. The procedure of our approach is illustrated in Figure 1 . Extensive experiments on various continual learning benchmarks demonstrate that our IRGP framework promotes forward knowledge transfer and achieves better classification performance compared with related state-of-the-art approaches. Moreover, our framework performs can also be extended as an expansion-based methods by storing the parameters of the selected relaxing subspace, universally surpassing TRGP (Lin et al., 2022) and other expansion-based approaches. We also provide theoretical proof to guarantee the efficiency of our relaxing strategy.

2. RELATED WORK

In this section, we review the representative approaches for continual learning and briefly analyze their differences from our method. Conceptually, these approaches can be roughly divided into the following four categories. Replay-based methods: These methods maintain a complementary memory for old samples, which are replayed during learning novel tasks. GEM (Lopez-Paz & Ranzato, 2017) constrains gradients concerning previous samples and Chaudhry et al. (2018) further propose to estimate with random samples to accelerate. While past samples are commonly not accessible in the real world, auxiliary deep generative models are thus deployed to synthesize pseudo data (Chenshen et al., 2018; Cong et al., 2020) . Recent approaches (PourKeshavarzi et al., 2021; Choi et al., 2021 ) leverage a single model for both classification and pseudo data generation. However, including extra data into the current task introduces excessive training time (De Lange et al., 2021) , especially on long task sequence. Our approach requires no previous data, in other words, is a replay-free method. Expansion-based methods: Expansion-based methods dynamically allocate new parameters or modules to learn new tasks. Rusu et al. (2016) propose to incrementally introduce additional subnetworks with a fixed capacity. DEN (Yoon et al., 2017) selectively retrains the frozen model and expands only with necessary neurons. Moreover, Li et al. (2019) perform an explicit network architecture search to decide where to expand. APD (Yoon et al., 2019) further decomposes the network and utilizes sparse task-specific parameters. However, these methods face capacity explosion inevitably after learning a long sequence of tasks. In contrast, our approach maintains a fixed network architecture to avoid expensive model growth.

