ITERATIVE RELAXING GRADIENT PROJECTION FOR CONTINUAL LEARNING

Abstract

A critical capability for intelligent systems is to continually learn given a sequence of tasks. An ideal continual learner should be able to avoid catastrophic forgetting and effectively leverage past learned experiences to master new knowledge. Among different continual learning algorithms, gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize task interference, yet hinder forward knowledge transfer at the same time. Recent methods use expansion-based techniques to relax the constraints, but a growing network can be computationally expensive. Therefore, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches using a fixed network architecture. In this work, we propose the Iterative Relaxing Gradient Projection (IRGP) framework. The basic idea is to iteratively search for the parameter subspaces most related to the current task and relax these parameters, then reuse the frozen spaces to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither memory buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our iterative relaxing strategies.

1. INTRODUCTION

A critical capability for intelligence systems is to continually learn given a sequence of tasks (Thrun & Mitchell, 1995; McCloskey & Cohen, 1989) . Unlike human beings, vanilla neural networks straightforwardly update parameters regarding current data distribution when learning new tasks, suffering from catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; Kirkpatrick et al., 2017) . As a result, continual learning is gaining increasing attention in recent years (Kurle et al., 2019; Ehret et al., 2020; Ramesh & Chaudhari, 2021; Liu & Liu, 2022; Teng et al., 2022 ). An ideal continual learner is expected to not only avoid catastrophic forgetting but also facilitating forward knowledge transfer (Lopez-Paz & Ranzato, 2017) , which is to leverage past learning experiences to master new knowledge efficiently and effectively (Parisi et al., 2019; Finn et al., 2019) . Several types of methods have been proposed for continual learning. Replay-based methods (Lopez-Paz & Ranzato, 2017; Shin et al., 2017) alleviate catastrophic forgetting by storing some old samples in the memory as they are inaccessible when new tasks come, while expansion-based methods (Rusu et al., 2016; Yoon et al., 2017; 2019) expand the model structure to accommodate incoming knowledge. However, these methods require either extra memory buffers (Parisi et al., 2019) or a growing network architecture as new tasks continually arrive (Kong et al., 2022) , which always results in expensive computation costs (De Lange et al., 2021) . In order to maintain a fixed network capacity, regularization-based methods (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018) penalize the transformation of parameters regarding the corresponding plasticity via regularization terms. While these regularization terms are applied to individual neurons, recent gradient projection methods (Zeng et al., 2019; Saha et al., 2021; Wang et al., 2021) modify the gradients in the feature space by constraining the directions of gradient update, which achieves outstanding performance. However, although gradient projection methods effectively mitigate forgetting within a fixed network capacity (Zeng et al., 2019) , the capability of learning new tasks is hindered by the limited optimization space, resulting in insufficient forward knowledge transfer. In other words, constraining the directions of gradient update fails on the plasticity in the stability-plasticity dilemma (French, 1997) . Figure 1 : Illustration of our proposed IRGP method and two baselines: GPM and TRGP. Blocks painted in different colors denote the parameters optimized after different tasks. We denote the relaxing subspace within the frozen space as the painted stripes in our IRGP pipeline. 

…

Trust Region Gradient Projection (Lin et al., 2022) tackles this problem by expanding the selected subspace of old tasks as trust regions with scaled weight projection, similar to other expansion-based methods (Yoon et al., 2019) . In spite of substantial improvement, these methods are computationally expensive as a result of growing network architecture (Wang et al., 2021) . Therefore, insufficient forward knowledge transfer remains a key challenge for gradient projection methods. To address this challenge, we propose the Iterative Relaxing Gradient Projection (IRGP) framework to facilitate forward knowledge transfer within a fixed network capacity. We design a simple yet effective strategy to find the critical subspace within the frozen space. During the training phase, we iteratively reuse the parameters within the selected subspace. Instead of strictly freezing those parameters, our method explores a larger optimization space, which allows better forward knowledge transfer and thus achieves better performance on new tasks. The procedure of our approach is illustrated in Figure 1 . Extensive experiments on various continual learning benchmarks demonstrate that our IRGP framework promotes forward knowledge transfer and achieves better classification performance compared with related state-of-the-art approaches. Moreover, our framework performs can also be extended as an expansion-based methods by storing the parameters of the selected relaxing subspace, universally surpassing TRGP (Lin et al., 2022) and other expansion-based approaches. We also provide theoretical proof to guarantee the efficiency of our relaxing strategy.

2. RELATED WORK

In this section, we review the representative approaches for continual learning and briefly analyze their differences from our method. Conceptually, these approaches can be roughly divided into the following four categories. Replay-based methods: These methods maintain a complementary memory for old samples, which are replayed during learning novel tasks. GEM (Lopez-Paz & Ranzato, 2017) constrains gradients concerning previous samples and Chaudhry et al. (2018) further propose to estimate with random samples to accelerate. While past samples are commonly not accessible in the real world, auxiliary deep generative models are thus deployed to synthesize pseudo data (Chenshen et al., 2018; Cong et al., 2020) . Recent approaches (PourKeshavarzi et al., 2021; Choi et al., 2021 ) leverage a single model for both classification and pseudo data generation. However, including extra data into the current task introduces excessive training time (De Lange et al., 2021) , especially on long task sequence. Our approach requires no previous data, in other words, is a replay-free method. Expansion-based methods: Expansion-based methods dynamically allocate new parameters or modules to learn new tasks. Rusu et al. (2016) propose to incrementally introduce additional subnetworks with a fixed capacity. DEN (Yoon et al., 2017) selectively retrains the frozen model and expands only with necessary neurons. Moreover, Li et al. (2019) perform an explicit network architecture search to decide where to expand. APD (Yoon et al., 2019) further decomposes the network and utilizes sparse task-specific parameters. However, these methods face capacity explosion inevitably after learning a long sequence of tasks. In contrast, our approach maintains a fixed network architecture to avoid expensive model growth. Regularization-based methods: Methods in this category introduce extra regularization terms to the objective function to penalize the modification of parameters. EWC (Kirkpatrick et al., 2017) first proposes to constrain the change based on the importance weight approximated by Fisher Information Matrix. MAS (Aljundi et al., 2018) measures the importance of the sensitivity of model outputs under an unsupervised setting. Other methods, also called parameter-isolation methods, defy catastrophic forgetting via freezing the gradient updates of particular parameters (De Lange et al., 2021) . PackNet (Mallya & Lazebnik, 2018) iteratively prunes and allocates parameters subset to corresponding tasks, whereas HAT (Serra et al., 2018) learns task-based hard attention to identify important parameters. Instead of restricting individual parameters with estimated importance, the main idea of our approach is constraining the direction of gradients. Gradient projection methods: Gradient projection methods directly constrain the gradients to overcome catastrophic forgetting, and our approach belongs to this category. Mehta et al. (2021) implicitly expands the model with respect to the frozen space and GEM (Lopez-Paz & Ranzato, 2017) utilizes complementary memory to restrict the update. While our approach requires neither storing old samples nor expanding the network. OWM (Zeng et al., 2019) first proposed to modify the gradients upon projector matrices. OGD (Farajtabar et al., 2020) keeps the gradients orthogonal to the space spanned by previous gradients, whereas GPM (Saha et al., 2021) computes the frozen space based on old data. NCL (Kao et al., 2021) combines the idea of gradient projection and Bayesian weight regularization to mitigate catastrophic forgetting. In spite of minimizing backward interference, these approaches suffer poor forward knowledge transfer and lack plasticity (Kong et al., 2022) . TRGP (Lin et al., 2022) expands the model with trust regions based on task relationship to achieve better performance on new tasks. In contrast, we focus on facilitating forward knowledge transfer within a fixed capacity network by iteratively relaxing frozen regions with constraints.

3.1. PRELIMINARIES

In a continual learning setting, we consider T tasks arriving as a sequence. The datasets are denoted as D (t) = {x (t) i , y (t) i } Nt i=1 , where N t is the number of samples. When learning the current task, the datasets of old tasks are inaccessible. We use an L-layer neural network with fixed capacity, and parameters defined as W = {W l } L l=1 , where W l denotes the parameters in the l-th layer. The model is optimized by minimizing the objective function (1) and L t is the loss function for task t. L(W, D (t) ) = 1 N t Nt i=1 L t (f (x (t) i ; W), y (t) i ) Gradient projection methods mitigate catastrophic forgetting by only updating the model in the orthogonal direction to frozen spaces. Saha et al. (2021) proposed to compute the frozen spaces based on the inputs of each layer. For task t, the frozen gradient spaces for the first t -1 tasks are denoted as U t-1 = {U l t-1 } L l=1 , where U l t is the frozen space of layer l for task t. During the training phase, for each layer l, gradients g l t are constrained to be orthogonal to U l t-1 . Particularly, assuming B l t = [u l t-1,1 , ..., u l t-1,N ] as the total N basis for U l t-1 , gradients g t are modified as: g l t = g l t -Proj U l t-1 (g l t ) = g l t -g l t B l t-1 (B l t-1 ) T (2) After getting the learned model based on the following criteria: W t = {W l t } L l=1 , ∥Σ l t [0 : k]∥ 2 F ≥ ϵ l th ∥Σ l t ∥ 2 F (3) to construct the significant representation space R l t = span{U l H,t [0 : k]}, where || • || 2 F denotes Frobenius norm here. The significant representation spaces, considered as the frozen spaces for current task t, are then merged into the whole frozen gradient spaces for the first t tasks: U t = {U l t } L l=1 = {U l t-1 ∪ R l t } L l=1 Although freezing gradients update significantly mitigates catastrophic forgetting, limited optimization space hinders the forward knowledge transfer, compromising the performance of new tasks. TRGP (Lin et al., 2022) tackles this problem by selecting old tasks relevant to the current task and expanding the corresponding frozen spaces as the trust regions. The scaled weight projection is further designed for memory-efficient updating and storing the parameters within the trust regions by scaling the basis, instead of directly changing the parameters. Considering that task i is selected as the trust region, the scaled weight projection is shown as: Proj S l i U l i (g l t ) = g l t B l i S l i (B l i ) T where S l i denotes the scale matrix. The parameters in the trust regions are retrained with the scaled weight projection and the learnt scale matrices are stored in the memory for the inference phase. Particularly, during the forward transfer, the parameters are modified with the scale matrices as: W l t = Proj (U l i ) ⊥ (W l t ) + Proj S l i U l i (W l t ) = W l t -Proj U l i (W l t ) + Proj S l i U l i (W l t ) where (•) ⊥ denotes the orthogonal complemented subspace. However, as tasks come, increasing extra parameters are introduced by storing the scaling matrices. Our experiments demonstrate that TRGP requires around 5000% amount of the parameters regarding the network architecture after learning 20 tasks on MiniImageNet, see Figure 3-(c ). Therefore, we propose our Iterative Relaxing Gradient Projection framework to facilitate forward knowledge transfer while maintaining a fixed network capacity by wisely reusing parameters within the frozen space.

3.2. RELAXING SUBSPACE SEARCHING

We first design a searching strategy to determine which part of the frozen space to relax based on the estimated importance characterized by the angle from the representation space spanned by current gradients g l t . The angle between a given space and a vector is defined in definition 3.1. Definition 3.1. (Angle between vector and space) We denote the angle between two inputs as Θ(•) and the inner product between two vectors as ⟨•⟩. The angle between a vector v ∈ R n and a space U n×c ⊂ R n is defined as the minimum angle between the given vector v and any unit vector u ∈ U : Θ(v, U ) = arccos max u∈U ⟨v, u⟩ ∥v∥ (7) Moreover, given the threshold γ l t , we define that a vector d is relaxable when: Θ(d, R l g,t ) ≤ γ l t (8) where R l g,t is constructed by compressing g l t with Equation (3). For task t, we aim to find the relaxing subspace V l t ⊆ U l t-1 spanned all by relaxable vectors from U l t-1 . Particularly, we implement with the modulus of the projection, namely:      min v∈V l t ∥Proj R l g,t (v)∥ F ≥ ζ l t ∥v∥ F max u∈U l,c t-1 ∥Proj R l g,t (u)∥ F < ζ l t ∥u∥ F (9) where ζ l t = cos γ l t is the threshold and U l,c t-1 = U l t-1 \V l t denotes the complemented subspace of V l t with respect to U l t-1 . Above criterion guarantees max u∈V l t Θ(u, R l g,t ) ≤ γ l t and min v∈U l,c t-1 Θ(v, R l g,t ) > γ l t . However, it is hard to construct V l t directly from U l t-1 . Therefore, we propose a simple yet efficient strategy to find the relaxing subspaces. With V l t initiated as ∅, we select the closest vector to R l g,t within U l,c t-1 by arg min d∈U l,c t-1 Θ(d, R l g,t ). The selected vector d is then appended into V l t as basis if it satisfies criterion (8). We repeat this procedure until no relaxable vector left to get the target V l t . The pseudo-code of our searching strategy is provided in Algorithm 1. Considering the scope of our procedure, we further provide theoretical analysis on the upper bound of the dimension of the selected subspace V l t , which is also the number of iterations. Here we introduce Lemma 3.2 and Theorem 3.3, which guarantee that the dimension of V l T is no more than of the representation matrix. Detailed proof is provided in Appendix A.1 and A.2.

Algorithm 1 Relaxing Subspace Searching

Input: gradient {g l t } L l=1 , frozen subspace {U l t-1 } L l=1 and thresholds {ϵ l th , γ l t } L l=1 Output: relaxing subspace {V l t } L l=1 1: for l ∈ 1, ..., L do 2: Construct the significant representation space R l g,t from gradients g l t by Equation (3). 3: V l t ← ∅ 4: repeat 5: d ← arg min d∈U l,c t-1 Θ(d, R l g,t ) 6: if Θ(d, R l g,t ) ≤ γ l t then 7: V l t ← V l t ∪ d 8: U l,c t-1 ← U l t-1 \V l t 9: end if 10: until Θ(d, R l g,t ) > γ l t 11: end for Lemma 3.2. Denote the relaxed subspace as V = span{v 1 , v 2 , ..., v N }, where v N is the last base included in V . Given representation subspace U , ∀v ∈ V , we have Θ(v, U ) ≤ Θ(v t , U ). Theorem 3.3. Denote k p as the dimension of the representation subspace and k l as the dimension of the relaxed subspace, the upper bound of k l is k p , regardless of the frozen subspace. Moreover, according to Theorem 3.4, we figure that our strategy guarantees to find the maximum space within the whole solution set satisfying criterion ( 9), which further substantiates the efficiency of our searching strategy. We also include the corresponding proof in Appendix A.3. Theorem 3.4. The relaxed subspace obtained by Algorithm 1 takes up the maximum subspace of the whole solution set. To further validate our searching strategy, we propose IRGP-Exp, a modified version of our proposed IRGP, directly storing the parameters in the relaxed subspaces. For task t, we retrieve the corresponding relaxed subspaces {V l t } L l=1 and the scale matrices {S l t } L l=1 during the inference phase similar to TRGP. The modified parameters W l t,I used for inference on task t is: W l t,I = W l -Proj V l t (W l ) + Proj S l t V l t (W l ) where W l denotes the parameters of layer l of current network. Replacing the parameters in the relaxed subspaces with the parameters optimized in task t, the model achieves better performance.

3.3. ITERATIVE MODIFYING THE SCALE MATRIX WITH CONSTRAINTS

After getting the relaxed subspaces, we want to retrain inside parameters while consolidating previous knowledge, to facilitate forward knowledge transfer within a fixed network capacity. One direct way is to fine-tune the parameters with regularization such as EWC (Kirkpatrick et al., 2017) . However, regularization terms are designed for explicit parameters, which are not applicable for implicit subspace in our framework. Therefore, we introduce the scaled weight projection (Lin et al., 2022) to modify explicit parameters instead. With scaled weight projection, we fine-tune the parameters within V l t by adding regularization term on scaling matrices S t = {S l t } L l=1 instead of direct on target parameters. Specifically, the objective function of task t is: L t = L(W t , D (t) ) + Σ L l=1 β l ∥S l t -1(S l t )∥ 2 2 (11) where 1(•) denotes the identity matrix with the size of the rank of the input matrix and β l is the weight of the regularization term for layer l. During back propagation, gradients within frozen space U l t-1 are eliminated and parameters within V l t are modified by S l t . Generally, in our Iterative Relaxing Gradient Projection framework, we adopt our searching strategy to determine the relaxed subspaces and modify those parameters with constraints on the scaling matrices. However, during the training phase, the direction of gradients shifts sharply and frequently due to the steep learning scope of deep neural networks. Diverse subspaces would be selected in different training phases. Therefore, we iteratively execute Algorithm 1 during training until no extra subspace is required. Particularly, for each task, our model is optimized for limited epochs first. Then we search for the target relaxing subspace and examine whether there exists a new subspace within the remaining frozen space. If extra frozen subspace is released, the scale of the scaling matrix requires to be modified, while TRGP maintains a fixed-size scaling matrix throughout training. Thus, to accommodate the increasing relaxing subspace, we propose to expand the scaling matrix with the identity matrix of the corresponding size as: S l,new t = S l,old t 0 0 1(V l,new t ) where V l,new t denotes the newly included relaxing subspace. If there is no extra relaxing subspace, we optimize our model thoroughly on current task. After training, the parameters within V l t are further consolidated by Equation ( 6) and the scaling matrices are emptied to be identical matrices. The pseudo-code of our Iterative Relaxing Gradient Projection framework is provided in Appendix D.

4.1. EXPERIMENTAL SETUP

Datasets: We evaluate our framework on five datasets. Following Saha et al. (2021) , we conduct experiments on CIFAR-100 Split (Krizhevsky & Hinton, 2009) , MiniImageNet (Vinyals et al., 2016) , Permuted MNIST (MNIST) (Kirkpatrick et al., 2017) and CIFAR-100 Sup (Yoon et al., 2019) . Moreover, Serra et al. (2018) first propose Mixture, consists of CIFAR-10 ( Krizhevsky & Hinton, 2009) , MNIST (LeCun et al., 1998) , CIFAR-100 (Krizhevsky & Hinton, 2009) , SVHN (Netzer et al., 2011) , FashionMNIST (Xiao et al., 2017) , TrafficSigns (Stallkamp et al., 2011) , FaceScrub (Ng & Winkler, 2014) and NotMNIST (Bulatov, 2011) . Here we evaluate our framework on Mixture with seven tasks as a sequence except TrafficSignsfoot_0 . Details and statistics of the datasets can be found in Appendix B.1. Moreover, we include the details of network architectures in Appendix B.2. Baselines: We compare our approach with competitive and well-established approaches maintaining a fixed network capacity following Saha et al. (2021) . We adopt ER Res (Chaudhry et al., 2019) and A-GEM (Chaudhry et al., 2018) as representative replay-based methods: the memory buffer size for PMNIST, CIFAR-100 Split, MiniImageNet, and Mixture are 1000, 2000, 500 and 3000, respectively. For gradient projection approaches, we consider OWM (Zeng et al., 2019) and GPM (Saha et al., 2021) . For regularization approaches, we compare against EWC (Kirkpatrick et al., 2017) and state-of-the-art HAT (Serra et al., 2018) . We also include the "multitask" baseline jointly training all tasks in a single network, which is always considered as an upper bound for continual learning. Other implementation details are listed in Appendix B.3.We exclude expansion-based methods in the main experiments as they use continually growing architecture, which is out of the scope of our work.

Metrics:

We first employ two standard evaluation metrics: Average Accuracy (ACC) (Mirzadeh et al., 2020) and Backward Transfer (BWT) (Lopez-Paz & Ranzato, 2017) . Denote A i,j as the test accuracy of task j after learning task i. ACC is the average test accuracy evaluated after learning all tasks, defined as ACC = 1 T Σ T i=1 A T,i . BWT is the average accuracy decrease after learning following tasks, defined as BWT = 1 T -1 Σ T -1 i=1 (A T,i -A i,i ). To evaluate forward knowledge transfer, we further introduce Forward Transfer (FWT) (Lopez-Paz & Ranzato, 2017) and Ω new (Kemker et al., 2018) . FWT reflects the influence of the observed tasks on new tasks in a zero-shot manner, while Ω new indicates the capability of acquiring new tasks. The detailed definitions are provided in Appendix B.4. In this paper, we mainly focus on Ω new among the three metrics and results on FWT are provided as well. Generally, the larger ACC, the better the approach. Forward and backward knowledge transfer evaluate the capability of learning and memorizing respectively.

4.2. MAIN RESULTS

We show the comparative results on four benchmarks in Table 1 . The experiments on Mixture are implemented by us, while other results are reported from (Saha et al., 2021) . We run each experiment five times and report the mean results. We include implementation details in Appendix B.3 and detailed results including forward transfer can be found in Appendix C.2. As shown in Table 1 , our approach obtain the best accuracy with comparable forgetting across all datasets. Compared with replay-based methods A-GEM and ER Res, IRGP achieves at least around 2% higher ACC with less forgetting. For regularization-based methods, IRGP significantly dominates EWC across all benchmarks and outperforms HAT on MiniImageNet and PMNIST. Although HAT obtains less forgetting on the other two datasets, IRGP gains 1% better ACC on average. For gradient projection methods, we observe that IRGP achieves around 1% higher ACC on CIFAR-100 Split and MiniImageNet than GPM with comparable forgetting. On PMNIST and Mixture, IRGP improves the accuracy with less forgetting, reducing 1% BWT than GPM. The averaged accuracy after learning each task on CIFAR-100 Split exhibited in Figure 2-(a ) further validates that IRGP universally outperforms GPM. We include the detailed results on other benchmarks in Appendix C.1. Moreover, we compare the accuracy evolution of specific tasks during sequential tasks with GPM, which achieves the highest accuracy among selected baselines according to Table 1 . Here we show the results of the second task on CIFAR-100 Split in Figure 2-(b) . We further present the results of three randomly selected tasks on all benchmarks in Appendix C.3. We notice that IRGP achieves better accuracy right after learning a new task, in other words, gains better Ω new , which is also the purpose of our relaxing strategy. Without forward knowledge transfer, approaches may have limited performance even with less forgetting (Lopez-Paz & Ranzato, 2017) . Thus, we observe the accuracy tested after learning each task. As shown in Figure 2-(c ), our approach achieves 2.7% better average accuracy on CIFAR-100 Split setting than GPM. Results on other benchmarks provided in Appendix C.2 further substantiate this phenomenon. As tasks keep coming, accumulated frozen spaces lead to decreasing optimization spaces for GPM. In the contrast, IRGP explores larger optimization spaces by relaxing previous frozen spaces. Thus, IRGP achieves better forward knowledge transfer by implicitly reusing the weights within the relaxed subspaces. In brief, our approach universally outperforms selected baselines on all datasets in a fixed capacity. With comparable forgetting, IRGP achieves better forward knowledge transfer with larger optimization spaces against GPM. To validate the efficiency of our relaxing strategy, we further compare IRGP-Exp with well-established and competitive expansion-based methods in the next section.

4.3. COMPARED WITH EXPANSION-BASED METHODS

The above experiments exhibit the outstanding performance of our approach maintaining a fixed network capacity. By allocating new neurons or modules, expansion-based methods significantly mitigate backward interference with increasing capacity. Thus, to further validate our strategy, we compare IRGP and IRGP-Exp with relative expansion-based methods in this section. (2017) . We refer to the results of baselines from Saha et al. (2021) . Capacity denotes the model capacity normalized with respect to the network used in GPM. Here we use the same model as GPM. According to Table 2 , IRGP outperforms all baselines including GPM with the smallest capactiy. Lin et al. (2022) proposed TRGP to expand the limited optimization spaces by retraining parameters within the selected trust regions, achieving superior performance. During the inference phase, TRGP reuses the parameters in corresponding trust regions memorized after learning this task. In contrast, GPM and our IRGP only store the representation of the frozen space. Therefore, although indeed a stable network capacity is allocated for each task, the entire memory size of TRGP grows continually. As shown in Figure 3-(c ), after learning the last task on MiniImageNet setting, TRGP requires around 5000% extra parameters with respect to the network capacity. Results on other benchmarks provided in Appendix C.5 further substantiate that TRGP introduces a significant number of extra parameters. Thus, we categorize TRGP as an expansion-based method here. As mentioned in Section 3.2, we propose IRGP-Exp to further validate our searching strategy. In this setting, the main difference between IRGP-Exp and TRGP is the strategy of deciding which part of the frozen space to reuse. We conduct experiments on all four benchmarks against TRGP. The results of ACC are provided in the left of Table 3 . The percentages indicate the ratios of the rank of relaxing subspaces with respect to the corresponding frozen space. We evaluate three constant ratios and further use the ratios in TRGP, denoted as T%. According to the left of Table 3 , IRGP-Exp already outperforms TRGP with relaxing only 50% of the frozen spaces on CIFAR100-Split and PMNIST. As TRGP selects the top 2 tasks as the trust regions, T% is larger than 80% most times. Moreover, our approach gains better ACC on all benchmarks with a comparable size of relaxing subspaces, which substantiate the efficiency of our subspace searching strategy. We further modify TRGP as TRGP-Reg with similar regularization terms on the scale matrices as our IRGP to compare the relaxing strategies. We report the results on four benchmarks with three representative regularization weights w on TRGP-Reg in the right of Table 3 . As shown in Table 3 , IRGP significantly outperforms TRGP-Reg, especially on PMNIST, gaining over 20% ACC improvement. Generally, IRGP achieves better or comparable Ω new than TRGP under or without the constraint of a fixed network capacity. Detailed results are included in Appendix C.7.

5. ANALYSIS AND DISCUSSION

To gain a deeper insight into IRGP, we investigate the trend of scales of the subspaces relaxed by our strategy. With the theoretical upper bound of the rank of the relaxed subspace provided in Theorem 3.3, we further inspect the ratios of the relaxed subspaces concerning corresponding frozen spaces in practical. Results of the last layer on three different settings are provided in Figure 3 The value of � Ratio (%) range over sequential tasks, on both benchmarks. As different tasks explore different optimization directions, ideal relaxing subspaces vary across tasks, in accord with the fluctuation of our results. The dimension of the frozen spaces keeps growing as the tasks come, leading to expanding range for searching relaxing subspaces. Therefore, the computation complexity and time consumption are supposed to increase gradually. To investigate the practical efficiency of our approach, we further reported the time consumption of IRGP on CIFAR-100 Split and MiniImageNet against other baselines in Appendix C.4. According to Table 14 , our approach takes around 60% more time than GPM on both settings. In general, the practical efficiency of our approach is accecptable. To understand our relaxing strategy better, we further conduct experiments on different thresholds ϵ mentioned in Equation ( 3), which regulate the criterion of the frozen spaces. Saha et al. (2021) argue that ϵ controls the scale of the frozen space to mediate the stability-plasticity dilemma, and thus is critical for GPM. However, IRGP enables the frozen space to be dynamically regulated regarding the current task. Therefore, ϵ plays a much less important role in IRGP. We present the performance of different ϵ on CIFAR-100 Split in Figure 3 -(b). As shown in Figure 3 -(b), the performance of GPM drops significantly when ϵ ≥ 0.97, the optimal value reported in GPM, while IRGP consistently performs well even with ϵ = 0.98. Generally, IRGP is more robust on the threshold ϵ. In the contrast, IRGP mediates the stability-plasticity dilemma by controlling the dimension and flexibility of the relaxing space by ζ in Equation ( 9) and β in Equation ( 11) respectively. We present the results on CIFAR-100 Split in Table 4 , where ζ conv denotes the hyper-parameter for convolutional layers and ζ f c denotes the hyper-parameter for fully connected layers. As shown in Table 4 , a larger ζ guarantees less forgetting, as a result of smaller relaxing subspaces. Detailed results are provided in Appendix C.6. Similarly, we observe less forgetting on larger β, which constrains the update of parameters within the relaxing subspace more strictly. However, strict constraints also lead to limited performance on new tasks as discussed in Section 4. Generally, ζ and β work together to overcome catastrophic forgetting with better forward transfer.

6. CONCLUSION

In this paper, we propose a novel continual learning approach that facilitates the forward knowledge transfer in gradient projection methods with a fixed network capacity by iteratively searching and relaxing subspaces within the frozen space to expand the optimization space. Extensive experiments demonstrate that our IRGP framework surpasses related state-of-the-art approaches on diverse benchmarks. Moreover, we propose a modified version expanding the architecture with relaxing subspaces, achieving better average accuracy than other expansion-based methods. We further provide solid proof and analysis validating the efficiency of our algorithm.

A PROOF

A.1 PROOF OF LEMMA 3.2 For simplification, all vectors here are assumed to be unit vectors, namely ∥v∥ = 1. Depicting the angle by projection form, Lemma 3.2 can be expressed as: ∀v ∈ V, ∥Proj U (v)∥ ≥ ∥Proj U (v t )∥ Denote B = [u 1 , ..., u m ] as the representation matrix of the representation subspace U = span{u 1 , ..., u m }, where u i is the i-th normalized base of U . Lemma 3.2 can be expressed as: ∀v ∈ V, v T BB T v ≥ v T t BB T v t ( ) As v i s are the basis iteratively appended by Algorithm 1, for any i ≤ j, we have: v T i BB T v i ≥ v T j BB T v j (15) Therefore, to prove Lemma 3.2, it suffices to prove the following lemma. If there exists a vector v = Σ t+1 i=1 w i v i satisfying the condition v T BB T v < v T t BB T v t , then we can find another vector v ′ = Σ t+1 i=k w ′ i v i such that v ′T BB T v ′ > v T k BB T v k , which contradicts Algorithm 1 where Θ(v k , U ) ≤ Θ(v, U ) for ∀v ∈ span{v k , ..., v t+1 }. First, we consider a special case where the current relaxed subspace has only one base, denoted by V = span{v 1 }. Assume there exists v = w 1 v 1 + w 2 v 2 that v T BB T v < v T 2 BB T v 2 , we have: w 2 1 v T 2 BB T v 2 > w 2 1 v T 1 BB T v 1 + 2w 1 w 2 v T 1 BB T v 2 (16) Construct v ′ = w 2 v 1 -w 1 v 2 , we have: v ′T BB T v ′ = w 2 2 v T 1 BB T v 1 + w 2 1 v T 2 BB T v 2 -2w 1 w 2 v T 1 BB T v 2 > w 2 2 v T 1 BB T v 1 + w 2 1 v T 2 BB T v 2 + w 2 1 v T 1 BB T v 1 -w 2 1 v T 2 BB T v 2 = v T 1 BB T v 1 (17) which contradicts Θ(v 1 , U ) ≤ Θ(v, U ) for ∀v ∈ span{v 1 , v 2 }. Then we consider the general case V = span{v 1 , ..., v t }. For ∀v ∈ V , we have Θ(v, U ) ≤ Θ(v t , U ). After v t is included, we assume that there exists v ∈ V that Θ(v, U ) > Θ(v t+1 , U ). Then we can find the minimum s satisfying that there exists v = Σ s i=1 w i v i + w t+1 v t+1 that Θ(v, U ) > Θ(v t+1 , U ) and for ∀v ′ = Σ s-1 i=1 w i v i + w t+1 v t+1 , we have Θ(v ′ , U ) ≤ Θ(v t+1 , U ). When s = 1, it is similar to the special case, so the proof is omitted. Thus, we consider the case where s ≥ 2. For simplification, we express v as v = c 0 v 0 + c 1 v s + c 2 v t+1 with v 0 = Σ s-1 i=1 a i v i , where c i s and a i s are coefficients. We have: (c 0 v 0 + c 1 v s + c 2 v t+1 ) T BB T (c 0 v 0 + c 1 v s + c 2 v t+1 ) < v T t+1 BB T v t+1 As Θ(w 1 v 0 + w 2 v s , U ) ≤ Θ(v s , U ) ≤ Θ(v t+1 , U ), we have: (w 1 v 0 + w 2 v s ) T BB T (w 1 v 0 + w 2 v s ) ≥ v T s BB T v s (19) which is: w 2 1 v T 0 BB T v 0 + 2w 1 w 2 v T 0 BB T v s ≥ w 2 1 v T s BB T v s ≥ w 2 1 v T t BB T v t Similarly we have: w 2 1 v T 0 BB T v 0 + 2w 1 w 2 v T 0 BB T v t ≥ w 2 1 v T t BB T v t (21) Then we can express Equation ( 18) as: v T t+1 BB T v t+1 > (c 2 0 + c 2 2 )v T t BB T v t + c 2 1 v T s BB T v s + 2c 1 c 2 v T s BB T v t (22) As ∥v∥ = ∥v i ∥ = 1, c 2 0 + c 2 1 + c 2 2 = 1. Then we have: -2c 1 c 2 v T s BB T v t > c 2 1 v T s BB T v s -c 2 1 v T t+1 BB T v t+1 (23) Construct v ′ = c2vs-c1vt+1 √ c 2 1 +c 2 2 , we have: v ′T BB T v ′ = 1 c 2 1 + c 2 2 (c 2 2 v T s BB T v s + c 2 1 v T t+1 BB T v t+1 -2c 1 c 2 v T s BB T v t ) > 1 c 2 1 + c 2 2 (c 2 2 v T s BB T v s + c 2 1 v T t+1 BB T v t+1 + c 2 1 v T s BB T v s -c 2 1 v T t+1 BB T v t+1 ) = v T s BB T v s (24) which contradicts Θ(v s , U ) ≤ Θ(v, U ) for ∀v ∈ span{v s , v s+1 , ..., v t , v t+1 }. Thus, for ∀v ∈ V = span{v 1 , ..., v t }, we have Θ(v, U ) ≤ Θ(v t , U ). A.2 PROOF OF THEOREM 3.3 Denote the relaxed subspace and the representation subspace as V and U = span{u 1 , ..., u kp } respectively. With Lemma 3.2, we have ∀v ∈ V , Θ(v, U ) ≤ Θ(v t , U ) < π 2 . In other words, ∀v ∈ V, v ̸ ⊥ U (25) For the sake of contradiction, assume the dimension of V is larger than k p , namely dim(V ) > dim(U ) = k p . denote U c as the complemented subspace of U with respect to the whole space R. Obviously, dim(U c ) = n -k p , where n is the dimension of R. Then we have: dim(V ∩ U c ) = dim(V ) + dim(U c ) -dim(V + U c ) > k p + (n -k p ) -n = 0 Thus, there exists v ′ ∈ V such that v ′ ∈ U c too. As U c is the complemented subspace, ∀u ∈ U c , u ⊥ U . Then we have v ′ ⊥ U , which contradicts Equation ( 25). Therefore, the assumption is aborted. The upper bound of the dimension of V is k p , namely the dimension of the representation subspace U . A.3 PROOF OF THEOREM 3.4 Denote the whole solution set as S = {u|Θ(u, U ) ≤ γ and u ∈ U f } where U is the representation subspace, U f is the frozen space and γ is the threshold. Theorem 3.4 can be expressed as that all subspace V ′ ⊆ S satisfies dim(V ′ ) ≤ dim(V ), where V is the relaxed subspace obtained by Algorithm 1. Similar to Theorem 3.3, we assume there exists V ′ ⊆ S that dim(V ′ ) > dim(V ). denote V c f as the complemented subspace of V with respect to the frozen space U f . According to Algorithm 1, for ∀v ∈ V c f , we have Θ(v, U ) > γ. We also have: dim(V ′ ∩ V c f ) = dim(V ′ ) + dim(V c f ) -dim(V ′ + V c f ) > 0 Thus, there exists v ′ ∈ V ′ that v ′ ∈ U c f , namely there exists v ′ ∈ S that Θ(v ′ , U ) > γ, which is contradict. Therefore, the relaxed subspace obtained by Algorithm 1 takes up the maximum subspace of the whole solution set.

B EXPERIMENTAL SETUP B.1 DATASETS

Here we introduce the datasets we use for evaluation. 1) CIFAR-100 Split Saha et al. (2021) constructed CIFAR-100 Split, by splitting CIFAR100 (Krizhevsky & Hinton, 2009) into 10 tasks where each task has 10 classes. 2) MiniImageNet Following Saha et al. (2021) , we split MiniIm-ageNet (Vinyals et al., 2016) into 20 sequential tasks with 5 classes each. 3) Permuted MNIST (PMNIST) PMNIST (Kirkpatrick et al., 2017) is a variant of MNIST (LeCun et al., 1998) where each task has a different permutation of inputting images, consists of 10 sequential tasks with 10 classes each. 4) CIFAR-100 Sup Following Yoon et al. (2019) , we adopt CIFAR-100 Sup consisting of 20 superclasses as sequential tasks. 5) Mixture Serra et al. (2018) first proposed Mixture consisting of 8 datasets, including CIFAR-10 ( Krizhevsky & Hinton, 2009) , MNIST (LeCun et al., 1998) , CIFAR-100 (Krizhevsky & Hinton, 2009) , SVHN (Netzer et al., 2011) , FashionM-NIST (Xiao et al., 2017) , TrafficSigns (Stallkamp et al., 2011) , FaceScrub (Ng & Winkler, 2014) , and NotMNIST (Bulatov, 2011) , from which Ebrahimi et al. (2020) further constructed 5-Datasets. Here we follow the original harder benchmark. Particularly, we consider all tasks as a sequence except TrafficSigns (Stallkamp et al., 2011) , which we failed to access. Among all evaluated datasets, PMNIST is a benchmark under the domain-incremental scenario, while other four datasets are under the task-incremental scenario. Moreover, we provide the statistics of selected datasets in Table 5 and Table 6 . For the Mixture benchmark, the images of MNIST, FashionMNIST, and notMNIST are replicated across all RGB channels following Serra et al. (2018) . We adopt a 3-layer model including two hidden layers with 100 neurons each for the PMNIST setting, the same as Lopez-Paz & Ranzato (2017) . ReLU is used as the activate function here and for all other architectures. Also, we use softmax with cross entropy loss on all settings. AlexNet architecture: For CIFAR-100 Split setting, we adopt the same network as Serra et al. (2018) with batch normalization, including two fully connected layers and three convolutional layers. The convolutional layers have 4 × 4, 3 × 3, and 2 × 2 kernel sizes with 64, 128, and 256 filters respectively. After each convolutional layer, we add batch normalization and 2 × 2 max-pooling. Each fully connected layer has 2048 units. For the first two layers, we use the dropout of 0.2, and for the rest layers, we use the dropout of 0.5. Modified LeNet-5 architecture: For the CIFAR-100 Sup setting, a modified LeNet-5 architecture consisting of two convolutional layers and two fully connected layers is adopted, similar to Saha et al. (2021) . Max-pooling of 3 × 2 is used after each convolutional layer. The last two layers have 800 and 500 units respectively. Reduced ResNet-18 architecture: We adopt the same reduced ResNet-18 architecture as Saha et al. (2021) for the MiniImageNet and Mixture settings, using 2 × 2 average-pooling before the classifier layer instead of the 4 × 4 average-pooling used by Lopez-Paz & Ranzato (2017) . Moreover, we present the dimension of the representation space of each layer of our architectures in Table 7 .

B.3 IMPLEMENTATION DETAILS

We use the official implementation of GPM (Saha et al., 2021) , OWM (Zeng et al., 2019) , HAT Serra et al. (2018) , and TRGP (Lin et al., 2022) . Moreover, we implement A-GEM and ER Res with the 27; 180; 180; 180; 180; 180; 360; 20; 360; 360; 360; 720; 40; 720; 720; 720; 1,440; 80; 1,440; 1,440 official implementation by Chaudhry et al. (2018) and implement EWC with the implementation by Serra et al. (2018) . Following Saha et al. (2021) and Lin et al. (2022) , we run all experiments five times on an established seed without fixing the cuda settings for a fair comparison. Particularly, we use five random seeds on PMNIST where there is no diversity on a single seed. For CIFAR-100 Sup, we use five different orders provided by Yoon et al. (2019) . Following Saha et al. (2021) , we report the experimental results of replay-base methods A-GEM and ER Res on the Mixture dataset with the same buffer size as GPM and our IRGP, which is 8.98M in term of the number of parameters for the Resnet18 architecture. On CIFAR-100 Split, MiniImageNet, and PMNIST, we follow the hyper-parameters utilized by Saha et al. (2021) and Lin et al. (2022) , including learning rate, batch size, and the threshold ϵ. On Mixture, as we adopt the same network architecture Saha et al. (2021) use on their 5-Dataset setting, we follow the provided learning rate and batch size as well. Moreover, for the threshold ϵ in GPM, we conduct experiments with ϵ in the range of 0.95 to 1 provided in (Saha et al., 2021) and report the best results whose ϵ is 0.955. We further use ϵ = 0.96 for all layers in our IRGP. As discussed in Section 5, the threshold ζ = cos γ controls the criterion of the relaxing subspace. For CIFAR100-Split and PMNIST, we use ζ = 0.95 for convolutional layers and ζ = 0.9 for fully connected layers. For MiniImageNet and Mixture, we use the same ζ for all layers, 0.95 and 0.9 respectively. Furthermore, we set the regularization weight as 5 for the ResNet18 architecture and 1 for others. Particularly, we run all the experiments on a single NVIDIA GeForce RTX 2080 Ti GPU.

B.4 METRICS

Here we present the detailed definitions of the metrics evaluating the forward knowledge transfer. Ω new (Kemker et al., 2018) . Denote b i as the test accuracy of task i at random initialization, FWT, first proposed by Lopez-Paz & Ranzato (2017) , is defined as FWT = 1 T -1 Σ T i=2 (A i-1,i -b i ) , evaluating the zero-shot performance of the initialization with respect to the observed tasks. While Ω new , first proposed by Kemker et al. (2018) , is defined as Ω new = 1 T -1 Σ T i=2 (A i,i -b i ) , reflecting the test accuracy on new tasks based on the learnt knowledge. As b i stays still across different approaches, we consider Ω new = 1 T -1 Σ T i=2 A i,i for simplicity. For this simplified Ω new , we have: Ω new = T T -1 ACC -BW T -1 T -1 A 1,1 , with the ACC and BWT defined in Section 4.1.

C.1 FINAL ACCURACY

We provide the test accuracy after learning each task on other benchmarks here. As discussed in Section 4, our IRGP universally outperforms GPM over the task sequence on all benchmarks.

C.2 FORWARD KNOWLEDGE TRANSFER

We provide the detailed forward knowledge transfer performance on all four benchmarks here. First, we present the results of Ω new and the detailed accuracy of each task after learning it in baselines without extra memory buffer. For CIFAR100-Split, IRGP even gains 2.7% better Ω new than HAT, which is the second best baseline in this setting. Moreover, we provide the result of FWT (using the definition in (Lopez-Paz & Ranzato, 2017) ) on all benchmarks in Table 13 . According to Table 13 , although our method facilitates the forward knowledge transfer reflected by Ω new , IRGP achieves better FWT than GPM on all three taskincremental benchmarks. 

C.3 ACCURACY EVOLUTION

Here we present the accuracy tested on three randomly selected tasks after learning them on four benchmarks. We select the 2nd, 4th, and 6th tasks for CIFAR-100 Split, MiniImageNet, and PM-NIST. As there are only 7 tasks in Mixture, we select the 1st, 3rd, and 5th tasks. Generally, IRGP outperforms GPM on selected tasks over the sequence. We further notice that the improvement is more significant on later tasks as a result of larger relaxing subspaces, as discussed in Section 5. 

C.4 TIME CONSUMPTION

We report the time consumption of IRGP on two benchmarks compared with relative baselines. TRGP and IRGP are both evaluated on a single NVIDIA GeForce RTX 2080 Ti GPU and we report the results according to (Lin et al., 2022) . As discussed in Section 5, IRGP takes acceptable extra time compared with GPM on both datasets. For CIFAR-100 Split, IRGP tasks similar time as TRGP, which is similar to ER Res and HAT and much less than A-GEM and OWM. For MiniImageNet, IRGP tasks around 30% time than TRGP, and is similar to A-GEM. Table 14 : Time comparison evaluated on two benchmarks. We use the results reported in (Lin et al., 2022) and the time is normalized with respect to GPM. 

C.5 MEMORY USAGE

We provide a comparison between TRGP and IRGP on the ratio of the amount of extra parameters concerning the amount of the parameters of the initial network architecture. According to Figure 9 , TRGP requires at least 200% of the number of extra parameters after learning all tasks on the other three benchmarks, while IRGP only stores the representation of the frozen space, which can further be released in the inference phase. We provide the test accuracy over the task sequence in 



We fail to access the TrafficSigns datasets as the links provided in(Stallkamp et al., 2011;Serra et al., 2018;Saha et al., 2021) are all expired



Figure 2: Results on CIFAR-100 Split setting: (a) averaged accuracy after learning each task; (b) accuracy evolution of a randomly selected task; (c) accuracy tested on task i after learning task i.

Figure 3: (a) Relaxing ratios of the last layer on CIFAR-100 Split, MiniImageNet, and PMNIST. (b) Test accuracy of GPM and IRGP of different ϵ on CIFAR-100 Split. The optimum value of GPM is annotated by a red circle. (c) Ratios of the amount of extra parameters concerning the amount of the parameters of the network architecture on MiniImageNet.

Figure 4: Average accuracy after learning each task on (a) MiniImageNet, (b) PMNIST, and (c) Mixture.

Figure 5: Accuracy evolution of the (a) 2nd, (b) 4th, and (c) 6th task on CIFAR-100 Split.

Figure 7: Accuracy evolution of the (a) 2nd, (b) 4th, and (c) 6th task on PMNIST.

Figure 9: Ratio of the amount of extra parameters concerning the amount of the parameters of the initial network architecture on (a) CIFAR100-Split, (b) PMNIST, and (c) Mixture.

We provide the results on CIFAR100-Split between the hyper-parameter ζ and the relaxing ratio of all five layers of the AlexNet architecture in

As shown in Figure10-(a) and Figure10-(b), our IRGP-Exp dominates TRGP on both benchmarks relaxing either 80% or T% of the frozen spaces under an expansion setting. We further compare IRGP with TRGP modified with the same regularization terms. As shown in Figure10-(c) and Figure10-(d), the performance of TRGP drops significantly constrained in a fixed network capacity on both benchmarks. Detailed results between IRGP and TRGP are provided in Table16 and 17.

Figure 10: Test accuracy after learning each task under an expansion setting on (a) CIFAR-100 Split and (b) PMNIST, and within a fixed network capacity on (c) CIFAR-100 Split and (d) PMNIST.

Comparison of average accuracy and forgetting tested after learning all tasks. Multitask is under non-incremental setting. All results reported are averaged over 5 runs.

Results of ACC (%) and Capacity on CIFAR-100 Sup setting. STL is under non-incremental setting. All baselines are expansion-based methods except GPM.

L: compare IRGP-Exp with TRGP under an expansion setting. R: compare IRGP with TRGP within a fixed network capacity. We reports the results as (ACC / Ω new ) for each experimental setting. Detailed results are provided in Tabel 16 and 17. .76 75.38/75.02 75.06/74.91 74.46/75.01 71.85/72.87 72.08/73.12 72.46/72.91 73.52/74.78 PMNIST 96.68/97.18 96.99/97.29 97.03/97.26 96.34/97.23 73.69/95.20 71.51/95.60 72.43/95.80 94.20/96.19 MiniImageNet 60.81/62.35 62.03/62.44 60.84/62.17 61.78/63.29 55.81/60.28 58.81/62.77 22.69/20.12 61.26/62.80 Mixture 82.45/84.12 83.22/84.60 83.62/83.97 83.54/84.88 73.31/84.05 74.71/83.48 17.36/7.22 77.91/82.21 Following Saha et al. (2021), we perform experiments on CIFAR-100 Sup (Yoon et al., 2019). ACC results shown in Table 2 are averaged over 5 different sequence orders proposed by Yoon et al.



Statistics of CIFAR-100 Split, MiniImageNet, and PMNIST.

Statistics of Mixture benchmark.

Dimension of the representation space of each layer.

Table 8 to 11. According to Table 10 and Table 11, IRGP achieves a similar forward knowledge transfer compared with GPM. For other benchmarks, IRGP improves Ω new by 2.7% and 1.8% on CIFAR-100 Split and MiniImageNet respectively, as shown in Table 8 and Table 9. Moreover, we provided the detailed results including the standard deviation and other baselines in Table 12. According to Table 12, our IRGP consistently achieves the best or second best forward knowledge transfer compared with all

The accuracy tested on task i after learning task i and Ω new on CIFAR-100 Split.

The accuracy tested on task i after learning task i and Ω new on MiniImageNet.

The accuracy tested on task i after learning task i and Ω new on PMNIST.

The accuracy tested on task i after learning task i and Ω new on Mixture.

Comparison of forward knowledge transfer on four benchmarks, evaluated by Ω new .

Comparison of forward knowledge transfer between GPM and IRGP, evaluated by FWT.

Table 15. And β is set to be 1.0 for all experiments here. According to Table 15, generally, larger ζ guarantees smaller relaxing ratios. As discussed in Section 5, forgetting is mitigated by constraining the percentage of the relaxing weights, as a result of increasing ζ.

The relationship between ζ and the relaxing ratio of different layers on CIFAR-100 Split.

Compare IRGP-Exp with TRGP on four benchmarks under an expansion setting. All results reported are average over 5 runs.

Compare IRGP with TRGP-Reg on four benchmarks within a fixed network capacity. All results reported are average over 5 runs.

Comparison of forward knowledge transfer on CIFAR100-Sup, evaluated by Ω new .

D ALGORITHM

We present the pseudo-code of our Iterative Relaxing Gradient Projection here.Algorithm 2 Iterative Relaxing Gradient Projection 1: Initiate frozen subspaces U 0 = {U l 0 } L l=1 as ∅s and optimize W 1 for task 1 2: Compute frozen subspace U 1 with Equation ( 4 

