NEW INSIGHTS FOR THE STABILITY-PLASTICITY DILEMMA IN ONLINE CONTINUAL LEARNING

Abstract

The aim of continual learning is to learn new tasks continuously (i.e., plasticity) without forgetting previously learned knowledge from old tasks (i.e., stability). In the scenario of online continual learning, wherein data comes strictly in a streaming manner, the plasticity of online continual learning is more vulnerable than offline continual learning because the training signal that can be obtained from a single data point is limited. To overcome the stability-plasticity dilemma in online continual learning, we propose an online continual learning framework named multi-scale feature adaptation network (MuFAN) that utilizes a richer context encoding extracted from different levels of a pre-trained network. Additionally, we introduce a novel structure-wise distillation loss and replace the commonly used batch normalization layer with a newly proposed stability-plasticity normalization module to train MuFAN that simultaneously maintains high plasticity and stability. Mu-FAN outperforms other state-of-the-art continual learning methods on the SVHN, CIFAR100, miniImageNet, and CORe50 datasets. Extensive experiments and ablation studies validate the significance and scalability of each proposed component: 1) multi-scale feature maps from a pre-trained encoder, 2) the structure-wise distillation loss, and 3) the stability-plasticity normalization module in MuFAN.

1. INTRODUCTION

Humans excel in learning new skills without forgetting what they have previously learned over their lifetimes. Meanwhile, in continual learning (CL) (Chen & Liu, 2018) , wherein a stream of tasks is observed, a deep learning model forgets prior knowledge when learning a new task if samples from old tasks are unavailable. This problem is known as catastrophic forgetting (McCloskey & Cohen, 1989) . In recent years, promising research has been conducted to address this problem (Parisi et al., 2019) . However, excessive retention of old knowledge impedes the balance between preventing forgetting (i.e., stability) and acquiring new concepts (i.e., plasticity), which is referred to as the stability-plasticity dilemma (Abraham & Robins, 2005) . In this study, we cover the difference in the stability-plasticity dilemma encountered by online CL and offline CL and propose a novel approach that addresses the stability-plasticity dilemma in online CL. Most offline CL methods aim at less constraining plasticity in the process of preventing forgetting instead of improving it because obtaining high plasticity through iterative training is relatively easy. However, as shown in Figure 1 , the learning accuracy (showing plasticity) of online CL is way lower than that of offline CL, with a gap of 10-20% on all three CL benchmarks. That is, for online CL, wherein data comes in a streaming manner (single epoch), an approach that aims at suppressing excessive forgetting in the process of enhancing plasticity is required. For it, we propose a multi-scale feature adaptation network (MuFAN), which consists of three components to obtain high stability and plasticity simultaneously: 1) multi-scale feature maps exploited from shallow to deeper layers of a pre-trained model, 2) a novel structure-wise distillation loss across tasks, and 3) a novel stability-plasticity normalization module considering both the retention of old knowledge and fast adaptation in a parallel way. First, using pre-trained representations has become common in computer vision (Patashnik et al., 2021; Kolesnikov et al., 2020; Ranftl et al., 2020) and natural language processing (Radford et al.; Peters et al., 2018) . Meanwhile, the use of pre-trained representations in CL is still naïve, for example, using an ImageNet-pretrained ResNet as a backbone (Yin et al., 2021; Wu et al., 2021; Park et al., 2021; Hayes et al., 2020) , which limits the structure or size of the pre-trained model that can be used. As shown in Figure 2 (a), instead of using a pre-trained model as a backbone, we propose an approach that uses the pre-trained model as an encoder to obtain a richer multi-context feature map. Rather than leveraging a raw RGB image, we accelerate classifier training by leveraging an aggregated feature map from the meaningful spaces of the pre-trained encoder. We also verify the scalability of the aggregated multi-scale feature map by integrating it into existing online CL methods. Second, we present a novel structure-wise distillation loss to suppress catastrophic forgetting. Most distillation losses in CL are point-wise (Chaudhry et al., 2019; Buzzega et al., 2020) , and point-wise distillation is indeed effective in alleviating forgetting. Another effective way to preserve knowledge in a classification task is through a relationship between tasks, especially in a highly non-stationary online continual setting. As described in Figure 2 (b), we propose a novel structure-wise distillation loss that can generate an extra training signal to alleviate forgetting using the relationship between tasks in a given replay buffer. Finally, the role of normalization in CL has been investigated in recent studies (Pham et al., 2021b; Cha et al., 2022; Zhou et al., 2022) . In the field of online CL, switchable normalization (SN) (Luo et al., 2018) 

2.1. CONTINUAL LEARNING

To date, in the field of CL, the utility of a pre-trained model is relatively limited to obtaining the last feature map projected in a meaningful space by using a pre-trained model as a feature extractor. CL methods that use a pre-trained model can be categorized into two groups based on the scenario. The first utilizes an ImageNet-pretrained ResNet as a backbone for CL on fine-grained or video-based datasets to increase the base accuracy (Yin et al., 2021; Wu et al., 2021; Park et al., 2021; Hayes et al., 2020) . In general, these methods update an entire classifier continuously during training on



Figure 1: Comparison results of ER-Ring on three CL benchmarks in offline and online CL in bar (left and middle) and scatter (right) plots. For offline CL, plasticity is relatively high, whereas stability is low. In contrast, for online CL, stability is relatively high, whereas plasticity is low. It shows the difference in trend between offline and online CL in terms of the stability-plasticity dilemma. Further analysis of the difference in the trend is provided in Appendix A.

and continual normalization (CN)(Pham et al., 2021b), which use both minibatch and spatial dimensions to calculate running statistics, have led to improvement in the final performance. However, we observed that these approaches do not fully benefit from either batch normalization (BN)(Ioffe & Szegedy, 2015)  or spatial normalization layers. To address this problem, we propose a new stability-plasticity normalization (SPN) module that sets one normalization operation efficient for plasticity and another normalization operation efficient for stability in a parallel manner.Through comprehensive experiments, we validated the superiority of MuFAN over other state-ofthe-art CL methods on theSVHN (Netzer et al., 2011), CIFAR100 (Krizhevsky et al., 2009), mini- ImageNet (Vinyals et al., 2016), and CORe50 (Lomonaco & Maltoni, 2017)  datasets. On CORe50, MuFAN significantly outperformed the other state-of-the-art methods. Furthermore, we conducted diverse ablation studies to demonstrate the significance and scalability of the three components.

availability

://github.com/whitesnowdrop/MuFAN.

