CONTINUAL LEARNING IN LOW-COHERENCE SUBSPACE: A STRATEGY TO MITIGATE LEARNING CAPACITY DEGRADATION

Abstract

Methods using gradient orthogonal projection, an efficient strategy in continual learning, have achieved promising success in mitigating catastrophic forgetting. However, these methods often suffer from the learning capacity degradation problem following the increasing number of tasks. To address this problem, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, we construct a unified cost function involving regular DNN parameters and gradient projections on the Oblique manifold. We finally develop a gradient descent algorithm on a smooth manifold to jointly minimize the cost function and minimize both the inter-task and the intra-task coherence. Numerical experimental results show that the proposed method has prominent advantages in maintaining the learning capacity when tasks are increased, especially on a large number of tasks, compared with baselines. Li & Lin (2018) 

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved promising performance on many tasks. However, they lack the ability for continual learning, i.e., they suffer from catastrophic forgetting French (1999) when learning sequential tasks, where catastrophic forgetting is a phenomenon of new knowledge interfering with old knowledge. Research on continual learning, also known as incremental learning 2021) is an efficient continual learning strategy that advocates projecting gradients with the orthogonal projector to prevent the knowledge interference between tasks. GOP-based methods have achieved encouraging results in mitigating catastrophic forgetting. However, from Fig. 1 , we observe that these methods suffer from the learning capacity degradation problem: their learning capacity is gradually degraded as the number of tasks increases and eventually becomes unlearnable. Specifically, when learning multiple tasks, e.g., more than 30 tasks in Fig. 1 , their performance on new tasks dramatically decreases. These results suggest that the GOP-based methods focus on the stability and somewhat ignore the plasticity. Ignoring the plasticity may limit the task learning capacity of models, i.e., the number of tasks that a model can learn without forgetting. To address this issue, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, Low-coherence projectors are utilized for each layer to project features and gradients into low-coherence subspaces. To achieve this, we construct a unified cost function to find projectors and develop a gradient descent algorithm on the Oblique manifold to jointly minimize inter-task coherence and intra-task coherence. Minimizing the inter-task coherence can reduce the mutual interference between tasks, and minimizing the intra-task coherence can enhance the model's expressive power. Restricting projectors on the Oblique manifold can avoid the scale ambiguity



Aljundi et al. (2018a); Chaudhry et al. (2018a); Chen & Liu (2018); Aljundi et al. (2017), and sequential learning Aljundi et al. (2018b); McCloskey & Cohen (1989), aims to find effective algorithms that enable DNNs to simultaneously achieve plasticity and stability, i.e., to achieve both high learning capacity and high memory capacity. Various methods have been proposed to avoid or mitigate the catastrophic forgetting De Lange et al. (2019), either by replaying training samples Rolnick et al. (2019); Ayub & Wagner (2020); Saha et al. (2021), or reducing mutual interference of model parameters, features or model architectures between different tasks Zenke et al. (2017); Mallya & Lazebnik (2018); Wang et al. (2021). Among these methods, Gradient Orthogonal Projection (GOP) Chaudhry et al. (2020); Zeng et al. (2019); Farajtabar et al. (2020); Li et al. (

