CONTINUAL LEARNING IN LOW-COHERENCE SUBSPACE: A STRATEGY TO MITIGATE LEARNING CAPACITY DEGRADATION

Abstract

Methods using gradient orthogonal projection, an efficient strategy in continual learning, have achieved promising success in mitigating catastrophic forgetting. However, these methods often suffer from the learning capacity degradation problem following the increasing number of tasks. To address this problem, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, we construct a unified cost function involving regular DNN parameters and gradient projections on the Oblique manifold. We finally develop a gradient descent algorithm on a smooth manifold to jointly minimize the cost function and minimize both the inter-task and the intra-task coherence. Numerical experimental results show that the proposed method has prominent advantages in maintaining the learning capacity when tasks are increased, especially on a large number of tasks, compared with baselines. Li & Lin (2018) 

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved promising performance on many tasks. However, they lack the ability for continual learning, i.e., they suffer from catastrophic forgetting French (1999) when learning sequential tasks, where catastrophic forgetting is a phenomenon of new knowledge interfering with old knowledge. Research on continual learning, also known as incremental learning Aljundi et al. (2018a) ; Chaudhry et al. (2018a) ; Chen & Liu (2018) ; Aljundi et al. (2017) , and sequential learning Aljundi et al. (2018b) ; McCloskey & Cohen (1989) , aims to find effective algorithms that enable DNNs to simultaneously achieve plasticity and stability, i.e., to achieve both high learning capacity and high memory capacity. Various methods have been proposed to avoid or mitigate the catastrophic forgetting De Lange et al. (2019) , either by replaying training samples Rolnick et al. (2019) ; Ayub & Wagner (2020) ; Saha et al. (2021) , or reducing mutual interference of model parameters, features or model architectures between different tasks Zenke et al. (2017) ; Mallya & Lazebnik (2018) ; Wang et al. (2021) . Among these methods, Gradient Orthogonal Projection (GOP) Chaudhry et al. (2020) ; Zeng et al. (2019) ; Farajtabar et al. (2020) ; Li et al. (2021) is an efficient continual learning strategy that advocates projecting gradients with the orthogonal projector to prevent the knowledge interference between tasks. GOP-based methods have achieved encouraging results in mitigating catastrophic forgetting. However, from Fig. 1 , we observe that these methods suffer from the learning capacity degradation problem: their learning capacity is gradually degraded as the number of tasks increases and eventually becomes unlearnable. Specifically, when learning multiple tasks, e.g., more than 30 tasks in Fig. 1 , their performance on new tasks dramatically decreases. These results suggest that the GOP-based methods focus on the stability and somewhat ignore the plasticity. Ignoring the plasticity may limit the task learning capacity of models, i.e., the number of tasks that a model can learn without forgetting. To address this issue, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, Low-coherence projectors are utilized for each layer to project features and gradients into low-coherence subspaces. To achieve this, we construct a unified cost function to find projectors and develop a gradient descent algorithm on the Oblique manifold to jointly minimize inter-task coherence and intra-task coherence. Minimizing the inter-task coherence can reduce the mutual interference between tasks, and minimizing the intra-task coherence can enhance the model's expressive power. Restricting projectors on the Oblique manifold can avoid the scale ambiguity Aharon et al. (2006) ; Wei et al. (2017) , i.e., preventing the parameters of the projector from being extremely large or extremely small. The main contributions of this work are summarized as follows. First, to address the learning capacity degradation problem of GOP, we propose a novel method, namely, Low-coherence Subspace Projection (LcSP), that replaces the orthogonal projectors with the low-coherence gradient projectors, allowing the DNN to maintain both plasticity and stability. Additionally, our work observes that the GOP models with Batch Normalization (BN) Ioffe & Szegedy (2015) layers could cause catastrophic forgetting. This paper proposes two strategies in LcSP to solve this problem, i.e., replacing BN with Group Normalization (GN) Wu & He (2018) or learning specific BN for each task.

2. RELATED WORK

In this section, we briefly review some existing works of continual learning and the GOP based methods.

Replay-based Strategy

The basic idea of this type of approach is to use limited memory to store small amounts of data (e.g., raw samples) from previous tasks, called episodic memory, and to replay them when training a new task. Some of the existing works focused on selecting a subset of raw samples from the previous tasks Rolnick et al. (2019) ; Isele & Cosgun (2018) ; Chaudhry et al. (2019) ; Zhang et al. (2020) . In contrast, others concentrated on training a generative model to synthesize new data that can substitute for the old data Shin et al. (2017) ; Van de Ven & Tolias (2018) ; Lavda et al. (2018) ; Ramapuram et al. (2020) .

Regularization-based Strategy

This strategy prevents catastrophic forgetting by introducing a regularization term in the loss function to penalize the changes in the network parameters. Existing works can be divided into data-focused, and prior-focused methods De Lange et al. ( 2021). The Data-focused methods take the previous model as the teacher and the current model as the student, transferring the knowledge from the teacher model to the student model through knowledge distillation. Typical methods include LwF Li & Hoiem (2017) , LFL Jung et al. (2016) , EBLL Rannen et al. (2017) , DMC Zhang et al. (2020) and GD-WILD Lee et al. (2019) . The prior-focused methods estimate a distribution over the model parameters, assigning an importance score to each parameter and penalizing the changes in significant parameters during learning. Relevant works include SI Zenke et al. (2017) , EWC Kirkpatrick et al. (2017 ), RWalk Chaudhry et al. (2018a) , AGS-CL Jung et al. (2020) and IMM Lee et al. (2017) . Parameter Isolation-based Strategy This strategy considers dynamically modifying the network architecture by pruning, parameter mask, or expansion to greatly or even completely reduce catastrophic forgetting. Existing works can be roughly divided into two categories. One is dedicated to isolating separate sub-networks for each task from a large network through pruning techniques and parameter mask, including PackNet Mallya & Lazebnik (2018 ), PathNet Fernando et al. (2017) , HAT Serra et al. (2018) and Piggyback Mallya et al. (2018) . Another class of methods dynamically expands the network architecture, increasing the number of neurons or sub-network branches, to break the limits of expressive capacity (Rusu et al., 2016; Aljundi et al., 2017; Xu & Zhu, 2018; Rosenfeld & Tsotsos, 2018) . However, as the number of tasks growing, this approach also complicates the network architecture and increases the computation and memory consumption. Gradient Orthogonal Projection-based Strategy Methods based on GOP strategies, which reduce catastrophic forgetting by projecting gradient or features with orthogonal projectors, have been shown to be effective in continual learning with encouraging results Farajtabar et al. (2020) ; Zeng et al. (2019) ; Saha et al. (2021) ; Wang et al. (2021) ; Chaudhry et al. (2020) . According to the different ways of finding the projector, we can further divide the existing works into Context Orthogonal Projection (COP) and Subspace Orthogonal Projection (SOP). Methods based on COP, such as OWM Zeng et al. (2019) , Adam-NSCL Wang et al. (2021) , and GPM Saha et al. (2021) , always rely on the context of previous tasks to build projectors. In contrast to COP, SOP-based methods such as ORTHOG-SUBSPACE Chaudhry et al. (2020) use hand-crafted, task-specific orthogonal projectors and yield competitive results. This paper proposes a novel approach to continual learning called LcSP. Compared with other methods based on GOP, LcSP trains the network on the low-coherence subspaces to balance plasticity and stability and overcomes the learning capacity degradation problem, which significantly decreases the performance of GOP methods with the increasing number of tasks.

3. CONTINUAL LEARNING SETUP

This work adopts the Task-Incremental Learning (TIL) setting, where multiple tasks are learned sequentially. Let us assume that there are T tasks, denoted by T t for t = 1, . . . , T with its training data D t = {(x i , y i , τ t ) Nt i=1 }. Here, the data (x i , y i ) ∈ X × Y t is assumed to be drawn from some independently and identically distributed random variable, and τ t ∈ T denotes the task identifier. In the TIL setting, the data D t can be accessed if and only if task T t arrives. When episodic memory is adopted, a limited number of data samples drawn from old tasks can be stored in the replay buffer M so that D t ∪ M can be used for training when task T t arrives. Assuming that a network f parameterized with Φ = {θ, φ} consists of two parts, where θ denotes the parameters of the backbone network and φ denotes the parameters of the classifier. Let f (x; θ) : X × T → H denote the backbone network parameterized with θ = {W l } L l=1 , which encodes the data samples x into feature vector. Let f (x; φ) : H → Y denote the classifier parameterized with φ = w which returns the classification result of the feature vector obtained by f (x; θ). The goal of TIL is to learn T tasks sequentially with the network f and finally achieve the optimal loss on all tasks. Evaluation Metrics Once the training on all tasks is finished, we evaluate the performance of algorithm by calculating the average accuracy A and forgetting F Chaudhry et al. (2020) of the network on the T tasks {T 1 , ..., T T }. Suppose all tasks come sequentially, let Acc i,j denote the test accuracy of the network on task T i after learning task T j , where i ≤ j. The average accuracy is defined as A = 1 T T i=1 Acc i,T , and the forgetting is defined as F = 1 T -1 T -1 i=1 max j∈{i,...,T -1} (Acc i,j -Acc i,T ). (2)

4. CONTINUAL LEARNING IN LOW-COHERENCE SUBSPACES

In this section, we first introduce how to find task-specific, low-coherence projectors for LcSP on the Oblique manifold. We then describe how to use it in a specific DNN architecture to project features and gradients. Finally, we analyze the factors that enable LcSP to maintain plasticity and stability.

4.1. CONSTRUCTING LOW-COHERENCE PROJECTORS ON OBLIQUE MANIFOLD

Here, we first introduce the concept of coherence metric. The coherence metric is usually used in compressed sensing and sparse signal recovery to describe the correlation of the columns of a measurement matrix Candes et al. (2011) ; Candes & Romberg (2007) . Formally, the coherence of a matrix M is defined as µ(M , N ) =      max j<k |⟨Mj ,M k ⟩| ∥Mj ∥2∥M k ∥2 , M = N max i,j |⟨Mi,Nj ⟩| ∥Mi∥2∥Nj ∥2 , M ̸ = N . ( ) where M j and M k denote the column vectors of matrix M . Without causing confusion, we use µ(M ) denote µ(M , M ). To measure the coherence between different projectors, we introduce the Babel function Li & Lin (2018) , measuring the maximum total coherence between a fixed atom and a collection of other atoms in a dictionary, which can be described as follows. B(M ) = max Λ,|Λ|=M max i / ∈Λ j∈Λ |⟨M i , M j ⟩| ∥M i ∥ ∥M j ∥ (4) With the concept of a coherence metric in mind, we then introduce the main optimization objective in finding projectors. Specifically, suppose that the DNN has learned the task T 1 , T 2 , ..., T t-1 in the subspace S 1 , S 2 , ..., S t-1 , respectively, and P 1 , P 2 , .., P t-1 denote the projectors of all previous tasks. When learning task T t , we project features and gradients into a d t -dimensional low-coherence subspace S t with projector P t so that the LcSP can prevent catastrophic forgetting. The projector P t can be found by optimizing arg min B(P t ), s.t. P t ∈ R m×m , rank(P t ) = d t . Two considerations need to be taken in solving Eq. ( 5), i.e., considering the rank constraint and avoiding the entries of P t being extremely large or extremely small. With these considerations in mind, we can rephrase the rank and scale constrained problem as a problem on the Riemannian manifold, more specifically on the Oblique manifold OM(m, d t ), by setting P t = O t O ⊤ t and normalizing the columns of O t , i.e, diag(O ⊤ t O t ) = I n , where diag(•) represents the diagonal matrix and I n is the n × n identity matrix. With these settings, the new cost function J(•) and optimization problem can be described as follows: J(O t ) = λ • B(O t O ⊤ t ) + γ • µ(O t O ⊤ t ), t > 1 µ(O t O ⊤ t ), t = 1 , O t = arg min J(O t ), s.t. O t ∈ OB(m, d t ). In the cost function J(O t ), we further divided the optimization objective into inter-task B(O t O ⊤ t ) and intra-task µ(O t O ⊤ t ) , and utilize parameters λ and γ to provide a trade-off between them. Here, intra-task coherence is optimized to maintain the full rank of O t . Meeting the full-rank constraint helps to balance plasticity and stability in the case of increasing tasks. Relevant ablation studies and numerical analyses are given in the appendix. Optimization on the Oblique manifold, i.e., the solution lies on the Oblique manifold, is a wellestablished area of research Absil et al. (2009) ; Absil & Gallivan (2006) ; Selvan et al. (2012) . Here, we briefly summarize the main steps of the optimization process. Formally, the Oblique manifold OM(n, p) is defined as OM(n, p) ≜ {X ∈ R n×p : diag(X ⊤ X) = I p }, representing the set of all n × p matrices with normalized columns. OM can also be considered as an embedded Riemannian manifold of R n×p , endowed with the canonical inner product ⟨X 1 , X 2 ⟩ = trace X ⊤ 1 X 2 , where trace(•) represents the sum of the diagonal elements of the given matrix. For a given point X on OM, the tangent space at X, denoted by T X OM, is defined as T X OM(n, p) = {U ∈ R n×p : diag(X ⊤ U ) = 0}. (9) Further, the tangent space projector P X at X which projects H ∈ R n×p into T X OM, is represented as P X (H) = H -X ddiag X ⊤ H , ) where ddiag sets all off-diagonal entries of a matrix to zero. When optimizing on OM, the kth iterate X k must move along a descent curve on OM for the cost function, such that the next iterate X k+1 will be fixed on the manifold. This is achieved by the retraction R X k (U ) = normalize(X k + U ), where normalize scales each column of the input matrix to have unit length. Finally, with this knowledge, we can extend the gradient descent algorithm to solve any unconstrained optimization problems on OM, which can be summarized as U = P X k (∇ X k J), X k+1 = R X k (-αU ), where ∇ X k J denotes the Euclidean gradient at the kth iterate and α is the step size. Finally, our algorithm for finding O t in OM to task T t is summarized in Algorithm 1. Algorithm 1 Construct the O t on OM for Task T t Input: O1, . . . , Ot-1 Output: Ot 1: R X (U ) := normalize(X + U ) ▷ normalize scales each column of the input matrix to have norm 1 2: X0 ← random initialization on OM 3: k ← 0 4: Set tolerance error 0 ≤ E ≪ 1 5: while True do 6: G ← ∇f (X k ) ▷ Calculate Euclidean Gradient 7: U ← G -X k ddiag(X ⊤ k G) ▷ Calculate Riemann Gradient 8: if ∥U ∥ ≤ E then 9: break 10: end if

11:

α ∈ (0, 0.5), β ∈ (0, 1)

12:

t ← 1 ▷ Initial step size 13: while J(R X k (-t • U )) > J(X k ) -α • t • ∥U ∥ 2 2 do ▷ Searching the step size for calculating next iterate 14: t ← β • t 15: end while 16:  X k+1 ← R X k (-t • U ) ▷ Updating

4.2. THE APPLICATION OF LOW-COHERENCE PROJECTORS IN DNNS

With the LcSP at hand, the following introduces some technical details of applying LcSP in DNNs. When learning task T t , LcSP first constructs task-specific projector P l t for each layer before training, and freezes them during training. These projectors are used to project the features and gradients, ensuring that the DNN learns in the low-coherence subspace. Specifically, suppose that a network f with L linear layers is used as DNN architecture, let W l t , x l t , z l t , σ l , and P l t denote the model parameters, the input features, the output features, the activation function, and the introduced lowcoherence projector in layer l ∈ {1, ..., L}, respectively. LcSP introduces P l t immediately after W l t such that the pre-activation features are projected into the subspace, i.e., z l t = (x l t W l t )P l t , x l+1 t = σ l (z l t ). According to the chain rule of derivation, the gradients at W l t will also be multiplied with P t l in backpropagation, as follows ∂L ∂(W l t ) (i,:) = ∂L ∂z l t ∂z l t ∂(W l t ) (i,:) = ∂L z L t L-1 k=l ∂z k+1 t ∂z k t • (x l t ) i • P l t , where (W l t ) (i,:) represents the ith row of W l t and (x l t ) i is the ith element of x l t . In Convolutional Neural Networks (CNNs), the input and the output typically represent the image features and have more than two dimensions, e.g., input channel, output channel, height, and width. In this case, we reshape z l ∈ R cout ×(cin•h•w) to z l ∈ R (cin•h•w)×cout and align the dimension of projector with the output channel so that P l t ∈ R cout×cout . After the projection, we recover the shape of z l t so that it can be used as input for the next layer. Overcoming the Catastrophic Forgetting in BN based models BN is a widely used module in DNNs to make training of DNNs faster and more stable through normalization of the layers' features by re-centering and re-scaling Ioffe & Szegedy (2015) . However, re-centering and re-scaling of the layers' features changes the data distribution (e.g., the mean and the variance) of features of previous tasks, which often leads to the catastrophic forgetting of LcSP. For example, when learning the new task T t , W l t may not work for T t due to the change in data distribution caused by BN. To solve this problem, we propose two strategies in LcSP: the strategy (1) learning specific BN for each task, or the strategy (2) using GN instead of BN. We verify the effectiveness of these two strategies in experiments and compare their performance in §5.

4.3. METHOD ANALYSIS

In this section, we provide analysis on plasticity and stability of LcSP. Stability Analysis Let θ = {W l t } L l=1 denote the parameter set of f ; ∆θ = {∆W 1 t , . . . , ∆W L t } denote set of variation values of parameters after learning task T t ; P t = {P l t } L l=1 denote the projectors set obtained by Algorithm 1; x l q,t and z l q,t denote the input and ouput when feeding the data of task T q (q ≤ t) into the network f , which has been optimized in learning task T t . Lemma 1. Assume that f is fed the data of task T t (q < t), then f can effectively overcomes catastrophic forgetting if z l q,q ≈ z l q,t , ∀q ≤ t (15) holds for l ∈ {1, 2, ..., L}. Lemma 1 suggests that f can overcome catastrophic forgetting if the output of f to previous tasks is invariant. In the following, we prove that LcSP achieves approximate invariance to the output of previous tasks. Proof. Suppose q = t -1. When l = 1, x l q,t = x l q,q . Then z l q,t = x l q,t (W l q + ∆W l t )P l q = x l q,t W l q P l q + x l q,t ∆W l t P l q = z l q,q + x l q,t ∆W l t P l q . (16) Let g l t denote the gradient when training the network on task T t . In backpropagation, ∆W l t = g l t P l t . Then x l q,t ∆W l t P l q = x l q,t g l t P l t P l q . If the inter-task coherence µ(P l t , P l q ) ≈ 0, then P l t P l q ≈ 0. Projectors satisfying this condition can be found by Algorithm 1. We can prove that z l q,q ≈ z l q,t holds for all layers by repeating the above process. This proof can also be generalized to any previous task T q . Plasticity Analysis Let gl t = g l t P l t denote the projected gradient at W l t . f can achieve optimal loss on task T t if ⟨g l t , gl t ⟩ > 0 holds for each l ∈ {1, . . . , L}, where ⟨•, •⟩ represents the inner product. Here, we prove that ⟨g l t , gl t ⟩ > 0 holds for each l ∈ {1, . . . , L}. Proof. Let gl t = g l t P l t denote the projected gradient, we have ⟨g l t , gl t ⟩ = g t gl t ⊤ = g l t O l t O l t ⊤ g l t ⊤ = ⟨g l t O l t , g l t O l t ⟩ = ∥g l t O l t ∥ > 0. ( ) Note that ∥g l t O l t ∥ is always positive unless g l t O l t is 0. This result is easy to generalize to each layer.

5. EXPERIMENTS

In this section, we evaluate our approach on several popular continual learning benchmarks and compare LcSP with previous state-of-the-art methods. The result of accuracy and forgetting demonstrate the effectiveness of our LcSP, especially when the number of tasks is large. 

5.2. BASELINES

We compare the proposed method with SOTA based on GOP. As aforementioned, we generalize GOP into COP and SOP. For methods based on the COP strategy, we compare proposed method with OWM Zeng et al. (2019) , Adam-NSCL Wang et al. (2021) and GPM Saha et al. (2021) . For methods based on the SOP strategy, we compare the proposed method with ORTHOG-SUBSPACE Chaudhry et al. (2020) . Moreover, we also compare our method with HAT Serra et al. (2018) , EWC Kirkpatrick et al. (2017) Figure 2 : Average accuracy and forgetting for different λ and γ on Permuted MNIST. A fully connected network with 2 hidden layers, each with 64 neurons, is used for this experiment.

5.3. IMPLEMENTATION DETAILS

Learning 20 Tasks For experiments on Permuted MNIST and Rotated MNIST, all methods use a fully connected network with two hidden layers, each with 256 neurons, using ReLU activations. For experiments on CIFAR and miniImageNet, all methods use standard ResNet18 architecture except OWM, HAT, and GPM which use AlexNet Krizhevsky et al. (2012) . As described in § 4.2, the proposed LcSP makes two changes to the BN layer in standard ResNet18: (1) learning specific BN for each task and (2) replacing BN with GN. We also apply these strategies to ORTHOG-SUBSPACE for additional comparison. For experiments on MNIST, all tasks share the same classifier. For experiments on CIFAR and miniImageNet, each task requires a task-specific classifier. For all experiments, LcSP does not use episodic memory to store data samples for data replay. For all methods, We use Stochastic Gradient Descent (SGD) uniformly. The learning rate is set to 0.01 for experiments on MNIST and 0.003 for experiments on CIFAR and ImageNet. Both λ and γ in Eq. (??) are set to 1. All experiments were run five times with five different random seeds, with the batch size besing 10. Learning 150 Tasks and 64 Tasks For experiments on Permuted CIFAR10, LcSP uses ResNet18 architecture and applies the strategy (1) to change BN in ResNet18. To compare the performance of different GOP strategies, we did not use episodic memory for ORTHOG-SUBSPACE. Except for these changes, other experimental settings are the same as described above.

5.4. EXPERIMENTAL RESULTS

Comparisons of Learning 20 Tasks Table 1 compares the average accuracy and forgetting results of the proposed LcSP and its variants (LcSP-BN and LcSP-GN) with baselines on the four continual learning benchmarks. Therein, LcSP-BN and LcSP-GN adopt strategy (1) and ( 2) described in § 4.2, respectively. First, as shown in Table 1 , the proposed methods outperform all baselines on MNIST and miniImageNet. On Permuted MNIST and Rotated MNIST, the average accuracy of LcSP surpasses the baselines by 23.8% ∼ 5.6% and 43% ∼ 4.8%, respectively. On miniImageNet, the average accuracy of LcSP-BN surpasses the baselines by 44.9% ∼ 4.63%. On CIFAR100, the proposed LcSP-BN achieved a competitive performance with the second highest average accuracy, 3.25% lower than Adam-NSCL, and a forgetting rate of 0. The average accuracy of LcSP-GN also outperforms most baselines, being lower than Adam-NSCL, HAT and GPM on CIFAR100 but higher than compared methods on miniImageNet. These results suggest that minimizing the inter-task and intra-task coherence with low-coherence projectors is an effective strategy for solving catastrophic forgetting. Secondly, results on CIFAR100 and miniImageNet also show that BN in ORTHOG-SUBSPACE Chaudhry et al. (2020) and LcSP may change previous tasks' data distribution and lead to catastrophic forgetting. Both strategies (1) and (2) described in § 4.2 can effectively solve this problem. Comparisons of Learning 150 Tasks and 64 Tasks In order to demonstrate the high advantage of proposed methods in learning a long sequence of tasks, the following experiments compare the results with 64 tasks and 150 tasks. Note that, in Fig. 1 , LcSP (orthogonal) and LcSP-BN (orthogonal) use orthogonal projectors as comprisons while LcSP (low-coherence) and LcSP-BN (low-coherence) use low-coherence projectors. Fig. 1 (a) and 1(b) report the average accuracy and forgetting of last 10 tasks, with learning 150 tasks on Permuted MNIST. Fig. 1(c ) and 1(d) report the average accuracy and forgetting of last 5 tasks with learning 64 tasks on Permuted CIFAR10. The average accuracy of all methods, except the proposed LcSP-BN (low-coherence), dramatically degrades or is consistently low as the number of tasks increases. Furthermore, it can be seen from 1(d) that all methods except ORTHOG-SUBSPACE have almost no forgetting. This result indicates that methods using orthogonal projectors gradually lose their learning capacity with increasing number of tasks. The proposed method uses the low-coherence projector to relax the orthogonal restriction, effectively solving this problem. Fig. 2 gives the ablation study and shows the performance of our method with different λ and γ. When λ equals γ, the result of average accuracy on Permuted MNIST reached the highest. Results reach the worst when either λ or γ equals zero. These results indicate that both inter-task and intra-task coherence should be minimized to solve the plasticity and stability dilemma.

6. CONCLUSION

This paper proposed a novel gradient projection approach for continual learning to address gradient orthogonal projection's learning capacity degradation problem. Instead of learning in orthogonal subspace, we propose projecting features and gradients via low-coherence projectors to minimize inter-task and intra-task coherence. Additionally, two strategies have been proposed to mitigate the catastrophic forgetting caused by the BN layer, i.e., replacing BN with GN or learning specific BN for each task. Extensive experiments show that our approach works well in alleviating forgetting and has a significant advantage in maintaining learning capacity, especially in learning long sequence tasks. problem is the critical factor that results in degrading the performance of GOP-based methods in the case of a large number of tasks. Ablation studies and experiments for rank and scale constraints Further ablation studies and experiments are conducted to investigate the effects of the rank and scale constraints on the expressive power (plasticity) and stability of DNNs. The result in Fig. 5 (a) suggests that projecting features or gradients into subspaces with low dimensions (e.g., lower than 5 in Fig. 5 (a) ) will decrease the expressive power of DNN. GOP methods and LcSP rely on the projectors to project the features or gradients (or both) into a d-dimensional subspace, which can also be considered as a form of dimension reduction. The dimension reduction is motivated by a consensus in the high-dimensional data analysis community, i.e., the data can be summarized in a low-dimensional space embedded in a high-dimensional space, such as a nonlinear manifold Levina & Bickel (2004) . The dimension of this low-dimensional space is also known as the intrinsic dimension D Carreira-Perpinán (1997). If the d is too small, e.g., d ≪ D, important data features will be "collapsed" onto the same dimension. From the perspective of training a DNN, if the gradients are projected into a subspace with a dimension lower than D, the DNN cannot activate sufficient parameters to learn the presentation of this task. Due to the unknown number of tasks to be learned, the limited dimensions of the features, and the strict orthogonality constraint between projectors, the orthogonal projector cannot be constrained to a fixed-rank manifold. As shown in Fig. 5 (b), the rank of the orthogonal projector decreases as the number of tasks increases. Therefore, methods using orthogonal projectors usually ignore the intrinsic dimension of data (features or gradients) and finally lead to suffering from the learning capacity degradation problem. In contrast to GOP methods, LcSP relaxes the orthogonality constraint and meets the rank constraint by optimizing intra-task coherence on the Oblique manifold and thus does not suffer from this problem. Finally, Fig. 6 gives an ablation study for scale constraints on the projector's columns. In Fig. 6 , when the columns of the projector have unit length, the average accuracy reaches the highest. Result gets worse when the length of the projector's columns is too small or too large. Based on the RTX3060 12G, we tested the efficiency of LcSP and compared methods on Permuted CIFAR10 (which contains 64 tasks, and each image's size is 3 × 32 × 32), using AlexNet for OWM and GPM, and resnet18 for the other methods. We provide 4 metrics to assess the efficiency of the method: floating point operations FLOPs, time spent per epoch during training Time (s), the number of epochs needed to train to convergence Epochs, and the average time spent on inference Mean inference time (ms). The detailed results are listed in Table .4. 

A.3 METHOD ANALYSIS

The mechanism of low-coherence learning and the difference with orthogonal learning Recall the Lemma 1 described in Section 4.3. Lemma 1. Assume that f is fed the data of task T t (q < t), then f can effectively overcomes catastrophic forgetting if z l q,q ≈ z l q,t , ∀q ≤ t



Figure 1: Fig. 1(a) and 1(b) show the average accuracy and forgetting of the last 10 tasks on Permuted MNIST when learning 150 tasks. Fig. 1(c) and 1(d) show the average accuracy and forgetting of the last 5 tasks on Permuted CIFAR10 when learning 64 tasks.

Figure 4: The accuracy of the last task on Permuted MNIST (left) and Permuted CIFAR10 (right), respectively.

Figure 6: The average accuracy (left) and forgetting (right) of the last 20 tasks with different scales of columns on Permuted MNIST.

Figure 7: The average accuracy (left) and forgetting (right) of the last 5 tasks on Permuted CIFAR10. All methods use the low-coherence projector.

The average accuracy and forgetting results of the proposed LcSP and baselines. Memory denotes whether the method is trained using a replay strategy with episodic memory. In this experimental setup, both of the above MNIST datasets contain 20 tasks, each task containing 10,000 samples from 10 classes. Split CIFAR is constructed by splitting CIFAR100 into multiple tasks, where each task contains the data pertaining to five random classes (without replacement) out of the total 100 classes. Split miniImageNetVinyals et al. (2016) is a subset of ImageNet. In Split miniImageNet, each task contains the data from five random classes (without replacement) out of 100 classes. Both CIFAR100 and miniImageNet contain 20 tasks, each contains 250 samples from each of the five classes.

Efficiency analysis.

A APPENDIX

In this section, we give the implementation details about the experiments on Permuted MNIST and Permuted CIFAR10, to help readers reproduce these experiments. Additionally, more ablation studies and experimental results are provided here to further support the conclusions and contributions.

A.1 IMPLEMENTATION DETAILS

The main hyperparameter settings are listed in Table 2 and 3 . For the baselines, we adopt the default settings provided in their code to bring out the proper performance. For fair comprision, we use the uniform Batch size for all methods. 

A.2 ABLATION STUDIES AND ADDITIONAL RESULTS

Additional results on Permuted MNIST and Permuted CIFAR10 Readers may wonder whether our conclusion holds if we evaluate the average performance with more tasks (e.g., the average accuracy and forgetting on the last 20 tasks). As shown in Fig. 3 , LcSP still outperforms all baselines with a significant advantage. However, the phenomenon of learning capacity degradation in baselines becomes more imperceptible, e.g., the average accuracy of OWM on Permuted CIFAR10 is consistently low, rather than significantly decreasing. To further investigate the learning capacity degradation problem, we give the accuracy of baselines for the last task on Permuted MNIST and Permuted CIFAR10. As shown in Fig. 7 , all baselines, except LcSP, suffer from this problem with different degrees and show some decrease in accuracy compared to the initial (66.16% ∼ 24.63% on Permuted MNIST and 24.8% ∼ 3.48% on Permuted CIFAR10). These results suggest that this . Here, z l q,t denotes the output feature of l-th layer when data from task T q are fed into the DNN that has been trained on task T t . Let us consider the case where there are only two tasks, i.e., q = t -1. By appling the LcSP's projection strategy, we have z l q,t = z l q,q + x l q,t ∆W l t P l q = z l q,q + x l q,t g l t P l t P l q . Here, g l t , x l q,t and P l t denote the gradient, input data, and projection matrix (which is a symmetric matrix) of task T t in l-th layer, respectively. The term x l q,t g l t P l t P l q can be thought of as the forgetting, which changes the output of DNN to previous data. Since we cannot change x l q,t and g l t , the key to reduce x l q,t g l t P l t P l q is to minimize P l t P l q . If P l t is orthogonal to P l q , the forgetting is equal to 0. Let us consider reducing forgetting when P l t cannot be orthogonal to P l q (e.g., the number of tasks is large and the dimensionality of the space is limited). If we optimize P l t P l q directly, we may get P l t = 0, or the column scale of P l t is too small. To find a useful P l t , LcSP constrains the length of the column vector to be 1, i.e., to find P l t on the Oblique manifold (OM). To reduce forgetting, we optimize the inter-task coherence µ(P l t , P l q ) on OM so that the maximum entry of P l t P l q is minimized.In the worst case, we may obtain P l t consisting of only one column vector (i.e., P l t with rank 1). To avoid this situation, we optimize the intra-task coherence µ(P l t ) along with inter-task coherence, forcing the projector to satisfy the rank constraint in order to maintain the learning capacity of DNN (as shown in Fig. 5 (a), the rank of projector affects the plasticity of DNN).In general, low-coherence projection, which can be seen as a relaxed orthogonal constraint, aims to balance plasticity and stability better, motivated by our observation that orthogonal projections reduce the plasticity of the model and make the DNN unable to adapt to new environments.

