CONTINUAL LEARNING IN LOW-COHERENCE SUBSPACE: A STRATEGY TO MITIGATE LEARNING CAPACITY DEGRADATION

Abstract

Methods using gradient orthogonal projection, an efficient strategy in continual learning, have achieved promising success in mitigating catastrophic forgetting. However, these methods often suffer from the learning capacity degradation problem following the increasing number of tasks. To address this problem, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, we construct a unified cost function involving regular DNN parameters and gradient projections on the Oblique manifold. We finally develop a gradient descent algorithm on a smooth manifold to jointly minimize the cost function and minimize both the inter-task and the intra-task coherence. Numerical experimental results show that the proposed method has prominent advantages in maintaining the learning capacity when tasks are increased, especially on a large number of tasks, compared with baselines. Li & Lin (2018) 

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved promising performance on many tasks. However, they lack the ability for continual learning, i.e., they suffer from catastrophic forgetting French (1999) when learning sequential tasks, where catastrophic forgetting is a phenomenon of new knowledge interfering with old knowledge. Research on continual learning, also known as incremental learning Aljundi et al. (2018a) ; Chaudhry et al. (2018a) ; Chen & Liu (2018) ; Aljundi et al. (2017) , and sequential learning Aljundi et al. (2018b) ; McCloskey & Cohen (1989) , aims to find effective algorithms that enable DNNs to simultaneously achieve plasticity and stability, i.e., to achieve both high learning capacity and high memory capacity. 2021) is an efficient continual learning strategy that advocates projecting gradients with the orthogonal projector to prevent the knowledge interference between tasks. GOP-based methods have achieved encouraging results in mitigating catastrophic forgetting. However, from Fig. 1 , we observe that these methods suffer from the learning capacity degradation problem: their learning capacity is gradually degraded as the number of tasks increases and eventually becomes unlearnable. Specifically, when learning multiple tasks, e.g., more than 30 tasks in Fig. 1 , their performance on new tasks dramatically decreases. These results suggest that the GOP-based methods focus on the stability and somewhat ignore the plasticity. Ignoring the plasticity may limit the task learning capacity of models, i.e., the number of tasks that a model can learn without forgetting. To address this issue, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, Low-coherence projectors are utilized for each layer to project features and gradients into low-coherence subspaces. To achieve this, we construct a unified cost function to find projectors and develop a gradient descent algorithm on the Oblique manifold to jointly minimize inter-task coherence and intra-task coherence. Minimizing the inter-task coherence can reduce the mutual interference between tasks, and minimizing the intra-task coherence can enhance the model's expressive power. Restricting projectors on the Oblique manifold can avoid the scale ambiguity The main contributions of this work are summarized as follows. First, to address the learning capacity degradation problem of GOP, we propose a novel method, namely, Low-coherence Subspace Projection (LcSP), that replaces the orthogonal projectors with the low-coherence gradient projectors, allowing the DNN to maintain both plasticity and stability. Additionally, our work observes that the GOP models with Batch Normalization (BN) Ioffe & Szegedy (2015) layers could cause catastrophic forgetting. This paper proposes two strategies in LcSP to solve this problem, i.e., replacing BN with Group Normalization (GN) Wu & He (2018) or learning specific BN for each task.

2. RELATED WORK

In this section, we briefly review some existing works of continual learning and the GOP based methods.

Replay-based Strategy

The basic idea of this type of approach is to use limited memory to store small amounts of data (e.g., raw samples) from previous tasks, called episodic memory, and to replay them when training a new task. Some of the existing works focused on selecting a subset of raw samples from the previous tasks Rolnick et al. ( 2019 Parameter Isolation-based Strategy This strategy considers dynamically modifying the network architecture by pruning, parameter mask, or expansion to greatly or even completely reduce catastrophic forgetting. Existing works can be roughly divided into two categories. One is dedicated to isolating separate sub-networks for each task from a large network through pruning techniques and parameter mask, including PackNet Mallya & Lazebnik (2018 ), PathNet Fernando et al. (2017) , HAT Serra et al. (2018) and Piggyback Mallya et al. (2018) . Another class of methods dynamically expands the network architecture, increasing the number of neurons or sub-network branches, to break the limits of expressive capacity (Rusu et al., 2016; Aljundi et al., 2017; Xu & Zhu, 2018; Rosenfeld & Tsotsos, 2018) . However, as the number of tasks growing, this approach also complicates the network architecture and increases the computation and memory consumption. 



Various methods have been proposed to avoid or mitigate the catastrophic forgetting De Lange et al. (2019), either by replaying training samples Rolnick et al. (2019); Ayub & Wagner (2020); Saha et al. (2021), or reducing mutual interference of model parameters, features or model architectures between different tasks Zenke et al. (2017); Mallya & Lazebnik (2018); Wang et al. (2021). Among these methods, Gradient Orthogonal Projection (GOP) Chaudhry et al. (2020); Zeng et al. (2019); Farajtabar et al. (2020); Li et al. (

); Isele & Cosgun (2018); Chaudhry et al. (2019); Zhang et al. (2020). In contrast, others concentrated on training a generative model to synthesize new data that can substitute for the old data Shin et al. (2017); Van de Ven & Tolias (2018); Lavda et al. (2018); Ramapuram et al. (2020). Regularization-based Strategy This strategy prevents catastrophic forgetting by introducing a regularization term in the loss function to penalize the changes in the network parameters. Existing works can be divided into data-focused, and prior-focused methods De Lange et al. (2021). The Data-focused methods take the previous model as the teacher and the current model as the student, transferring the knowledge from the teacher model to the student model through knowledge distillation. Typical methods include LwF Li & Hoiem (2017), LFL Jung et al. (2016), EBLL Rannen et al. (2017), DMC Zhang et al. (2020) and GD-WILD Lee et al. (2019). The prior-focused methods estimate a distribution over the model parameters, assigning an importance score to each parameter and penalizing the changes in significant parameters during learning. Relevant works include SI Zenke et al. (2017), EWC Kirkpatrick et al. (2017), RWalk Chaudhry et al. (2018a), AGS-CL Jung et al. (2020) and IMM Lee et al. (2017).

Orthogonal Projection-based Strategy Methods based on GOP strategies, which reduce catastrophic forgetting by projecting gradient or features with orthogonal projectors, have been shown to be effective in continual learning with encouraging results Farajtabar et al. (2020); Zeng et al. (2019); Saha et al. (2021); Wang et al. (2021); Chaudhry et al. (2020). According to the different ways of finding the projector, we can further divide the existing works into Context Orthogonal Projection (COP) and Subspace Orthogonal Projection (SOP). Methods based on COP, such as OWM Zeng et al. (2019), Adam-NSCL Wang et al. (2021), and GPM Saha et al. (2021), always rely on the context of previous tasks to build projectors. In contrast to COP, SOP-based methods such as ORTHOG-SUBSPACE Chaudhry et al. (2020) use hand-crafted, task-specific orthogonal projectors and yield competitive results.

