A MODEL OR 603 EXEMPLARS: TOWARDS MEMORY-EFFICIENT CLASS-INCREMENTAL LEARNING

Abstract

Real-world applications require the classification model to adapt to new classes without forgetting old ones. Correspondingly, Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. Typical CIL methods tend to save representative exemplars from former classes to resist forgetting, while recent works find that storing models from history can substantially boost the performance. However, the stored models are not counted into the memory budget, which implicitly results in unfair comparisons. We find that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work, especially for the case with limited memory budgets. As a result, we need to holistically evaluate different CIL methods at different memory scales and simultaneously consider accuracy and memory size for measurement. On the other hand, we dive deeply into the construction of the memory buffer for memory efficiency. By analyzing the effect of different layers in the network, we find that shallow and deep layers have different characteristics in CIL. Motivated by this, we propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel. MEMO extends specialized layers based on the shared generalized representations, efficiently extracting diverse representations with modest cost and maintaining representative exemplars. Extensive experiments on benchmark datasets validate MEMO's competitive performance.

1. INTRODUCTION

In the open world, training data is often collected in stream format with new classes appearing (Gomes et al., 2017; Geng et al., 2020) . Due to storage constraints (Krempl et al., 2014; Gaber, 2012) or privacy issues (Chamikara et al., 2018; Ning et al., 2021) , a practical Class-Incremental Learning (CIL) (Rebuffi et al., 2017) model requires the ability to update with incoming instances from new classes without revisiting former data. The absence of previous training data results in catastrophic forgetting (French, 1999) in CIL -fitting the pattern of new classes will erase that of old ones and result in a performance decline. The research about CIL has attracted much interest (Zhou et al., 2021b; 2022a; b; Liu et al., 2021b; 2023; Zhao et al., 2021a; b) in the machine learning field. Saving all the streaming data for offline training is known as the performance upper bound of CIL algorithms, while it requires an unlimited memory budget for storage. Hence, in the early years, CIL algorithms are designed in a strict setting without retaining any instances from the former classes (Li & Hoiem, 2017; Kirkpatrick et al., 2017; Aljundi et al., 2017; Lee et al., 2017) . It only keeps a classification model in the memory, which helps save the memory budget and meanwhile preserves privacy in the deployment. Afterward, some works noticed that saving limited exemplars from former classes can boost the performance of CIL models (Rebuffi et al., 2017; Chaudhry et al., 2019) . Various exemplar-based methodologies have been proposed, aiming to prevent forgetting by revisiting the old during new class learning, which improves the performance of CIL tasks steadily (Rolnick et al., 2019; Castro et al., 2018; Wu et al., 2019; Isele & Cosgun, 2018) size (Castro et al., 2018; Rebuffi et al., 2017) . Rather than storing exemplars, recent works (Yan et al., 2021; Wang et al., 2022; Li et al., 2021; Douillard et al., 2021) find that saving backbones from the history pushes the performance by one step towards the upper bound. These model-based methods propose to train multiple backbones continually and aggregate their representations as the feature representation for final prediction. Treating the backbones from history as 'unforgettable checkpoints,' this line of work suffers less forgetting with the help of these diverse representations. Model-based CIL methods push the performance towards the upper bound, but does that mean catastrophic forgetting is solved? Taking a close look at these methods, we find that they implicitly introduce an extra memory budget, namely model buffer for keeping old models. The additional buffer implicitly results in an unfair comparison to those methods without storing models. Take CIFAR100 (Krizhevsky et al., 2009) for an example; if we exchange the model buffer of ResNet32 (He et al., 2015) into exemplars of equal size and append them to iCaRL (Rebuffi et al., 2017) (a baseline without retaining models), the average accuracy drastically improves from 62% to 70%. How to fairly measure the performance of these methods remains a long-standing problem since saving exemplars or models will both consume the memory budget. In this paper, we introduce an extra dimension to evaluate CIL methods by considering both incremental performance and memory cost. For those methods with different memory costs, we need to align the performance measure at the same memory scale for a fair comparison. How to fairly compare different methods? There are two primary sources of memory cost in CIL, i.e., exemplar and model buffer. We can align the memory cost by switching the size of extra backbones into extra exemplars for a fair comparison. For example, a ResNet32 model has the same memory size with 603 images for CIFAR100, and 297 ImageNet (Deng et al., 2009) images have the same memory size with a ResNet18 backbone. Figure 1 shows the fair comparison on benchmark datasets, e.g., CIFAR100 and ImageNet100. We report the average accuracy of different models by varying the memory size from small to large. The memory size of the start point corresponds to the cost of an exemplar-based method with a single backbone, and the endpoint denotes the cost of a model-based method with multiple backbones. As we can infer from these figures, there is an intersection between these methods -saving models is less effective when the total budget is limited while more effective when the total budget is ample. In this paper, we dive deeply into the empirical evaluations of different CIL methods considering the incremental performance and memory budget. Towards a fair comparison between different approaches, we propose several new measures that simultaneously consider performance and memory size, e.g., area under the performance-memory curve and accuracy per model size. On the other hand, how to organize the memory buffer efficiently so that we can save more exemplars and meanwhile maintain diverse representations? We analyze the effect of different layers of the network by counting the gradients and shifting range in incremental learning, and find that shallow layers tend to learn generalized features. By contrast, deep layers fit specialized features for corresponding tasks and yield very different characteristics from task to task. As a result, sharing the shallow layers and only creating deep layers for new tasks helps save the memory budget in CIL.



Figure 1: The average accuracy of different methods by varying memory size from small to large. The start point corresponds to the memory size of exemplar-based methods with benchmark backbone (WA (Zhao et al., 2020), iCaRL (Rebuffi et al., 2017), Replay (Chaudhry et al., 2019)), and the endpoint corresponds to the memory cost of model-based methods with benchmark backbone (DER (Yan et al., 2021) and MEMO (our proposed method)). We align the memory cost by using the small model for model-based methods or adding exemplars for exemplar-based methods. 'Base' stands for the number of classes in the first task, and 'Inc' represents the number of classes in each incremental new task. See Section 4.1 and 4.2 for more details.

availability

https://github.com/wangkiw/

