BEEF: BI-COMPATIBLE CLASS-INCREMENTAL LEARN-ING VIA ENERGY-BASED EXPANSION AND FUSION

Abstract

Neural networks suffer from catastrophic forgetting when sequentially learning tasks phase-by-phase, making them inapplicable in dynamically updated systems. Class-incremental learning (CIL) aims to enable neural networks to learn different categories at multi-stages. Recently, dynamic-structure-based CIL methods achieve remarkable performance. However, these methods train all modules in a coupled manner and do not consider possible conflicts among modules, resulting in spoilage of eventual predictions. In this work, we propose a unifying energy-based theory and framework called Bi-Compatible Energy-Based Expansion and Fusion (BEEF) to analyze and achieve the goal of CIL. We demonstrate the possibility of training independent modules in a decoupled manner while achieving bi-directional compatibility among modules through two additionally allocated prototypes, and then integrating them into a unifying classifier with minimal cost. Furthermore, BEEF extends the exemplar-set to a more challenging setting, where exemplars are randomly selected and imbalanced, and maintains its performance when prior methods fail dramatically. Extensive experiments on three widely used benchmarks: CIFAR-100, ImageNet-100, and ImageNet-1000 demonstrate that BEEF achieves state-of-the-art performance in both the ordinary and challenging CIL settings.

1. INTRODUCTION

The ability to continuously acquire new knowledge is necessary in our ever-changing world and is considered a crucial aspect of human intelligence. In the realm of applicable AI systems, it is expected that these systems can learn new concepts in a stream while retaining knowledge of previously learned concepts. However, deep neural network-based systems, which have achieved great success, face a well-known issue known as catastrophic forgetting (French, 1999; Golab & Özsu, 2003; Zhou et al., 2023b) , whereby they abruptly forget prior knowledge when directly fine-tuned on new tasks. To address this challenge, the class-incremental learning (CIL) field aims to design learning paradigms that enable deep neural networks to learn novel categories in multi-stages while maintaining discrimination abilities for previous ones (Rebuffi et al., 2017; Zhou et al., 2023a) . Numerous approaches have been proposed to achieve the goal of CIL, with typical methods falling into two groups: regularization-based methods and dynamic-structure-based methods. Regularizationbased methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Li & Hoiem, 2017; Rebuffi et al., 2017) add constraints (e.g., parameter drift penalty) when updating, thus forcing the model to maintain crucial information for old categories. However, these methods often suffer from the stability-plasticity dilemma, lacking the capacity to handle all categories simultaneously. Dynamicstructure-based methods (Yan et al., 2021; Li et al., 2021) expand new modules at each learning stage to enhance the model's capacity and learn the task-specific knowledge through the new module, achieving remarkable performance. Whereas, these methods have an intrinsic drawback. They At the expansion phase, we independently train the new module for the current task, while classifying all the samples from prior tasks into p b and classifying all samples generated from the built-in energy-based model into p f . At the fusion phase, the output of p b is equally added to the output of prior modules to mitigate the task-bias and form a unifying classifier. directly retain all learned modules without considering conflicts among modules, thus corrupting the joint feature representation and misleading the ultimate predictions. For example, old modules not considering the future possible updates may mislead the final prediction for new classes. This defect limits their performance on long-term incremental tasks. In this paper, we aim to reduce the possible conflicts in dynamic-structure-based methods and innovatively propose to achieve bi-directional compatibility, which consists of backward compatibility and forward compatibility. To be specific, backward compatibility is committed to making the discrimination ability of old modules unaffected by new ones. On the contrary, forward compatibility aims to reduce impact of old modules on the ultimate predictions when new categories emerge. By achieving bi-directional compatibility, given a sample from a specific task, the module responsible for the corresponding task will dominate the ultimate predictions in the ideal case. Fig. 1 displays the BEEF training framework, which is made of two phases (Model Expansion and Model Fusion). At the expansion phase, we assume that different modules are independent and train them isolatedly. Then, at the fusion phase, all trained modules are combined to form a unifying classifier. To achieve bi-directional compatibility, we introduce two additional prototypes: forward prototype p f and backward prototype p b . The backward prototype p b is set to measure the confidence of old classes, while the forward prototype p f aims to measure uncertainty in the open world. To be specific, when training a new module, we set p b as the cluster prototype for all old classes and use it to learn a task boundary between the current task and prior ones. In the meanwhile, we set p f as the cluster prototype for samples from the unseen distributions generated through energy-based sampling and thus use it to measure uncertainty and better capture the current distribution. We show that BEEF can be deduced from a unifying extendable energy-based theoretical framework, which allows us to transform the open-world problem into a normal classification problem and model the input distribution synchronously while learning the discrimination ability. Vast experiments on three widely-used benchmarks show that our method achieves state-of-the-art performance. Besides, prevalent CIL methods all require a well-selected balanced exemplar-set for rehearsal, which might be impractical due to data privacy issues (Delange et al., 2021; Ji et al., 2014) and computation cost when choosing exemplars. Our method pushes it to a harder setting. With only randomly sampled data from prior tasks, BEEF maintains its effectiveness while the performance of other methods declines drastically. 2020) utilize knowledge distillation (Hinton et al., 2015) to constrain the model's output. Douillard et al. (2020) propose a novel spatial knowledge distillation. Zhou et al. (2022) propose the concept of forward compatible (Gheorghioiu et al., 2003; Shen et al., 2020) and squeeze the space for known categories, thereby reserving feature space for future categories. Dynamic-structure-based methods create new modules to enhance the capacity for learning new tasks. Yan et al. (2021) ; Li et al. (2021) combine all modules together to form a unifying classifier, but it leads to an increasing training cost. Douillard et al. (2021) applies transformer (Dosovitskiy et al., 2020; Touvron et al., 2021) to CIL and dynamically expands task tokens when learning new tasks. Wang et al. (2022a) proposes to dynamically expand and compress the model based on gradient boosting (Mason et al., 1999) to adaptively learn new tasks. Liu et al. (2021b; 2023) cleverly apply reinforcement learning in CIL to obtain univserally better memory management strategy or hyperparameters. However, prevalent CIL approaches usually require a well-selected class-balanced exemplar-set for rehearsal (Rebuffi et al., 2017) , which has an evident impact on their performance (Masana et al., 2020) as we verify experimentally. BEEF not only achieves state-of-the-art performance but shows strong robustness to the choice of exemplar-set.

2. RELATED WORK

Energy-based learning. EBMs define probability distributions with density proportional to exp(-E), where E is the energy function (LeCun et al., 2006) . So far, the theory and implementation of EBMs have been well studied. Xie et al. (2016) show that the generative random field model can be derived from the discriminative ConvNet. Xie et al. (2018a; 2022) well study the cooperative training of two generative models for image modeling and synthesis. Additionally, Nijkamp et al. (2019) propose to treat the non-convergent short-run MCMC as a learned generator model or a flow model and show that the model is capable of generating realistic samples. Xie et al. (2021c) propose to learn a VAE to initialize the finite-step MCMC for efficient amortized sampling of the EBM. Xiao et al. (2021) propose a symbiotic composition of a VAE and an EBM that can generate high-quality images while achieving fast traversal of the data manifold. Zhao et al. (2021) propose a multistage coarse-to-fine expanding and sampling strategy, which starts with learning a coarse-level EBM from images at low resolution and then gradually transits to learn a finer-level EBM from images at higher resolution. Besides, EBMs have been successfully applied in many fields such as data generation (Zhai et al., 2016; Zhao et al., 2016; Deng et al., 2020; Du & Mordatch, 2019) with various data formats including graph (Liu et al., 2021a) , video (Xie et al., 2017; 2019) , 3D volumetric shape (Xie et al., 2018b; 2020) , 3D unordered point cloud (Xie et al., 2021a) , image-to-image translation (Xie et al., 2021c; b) , saliency map (Zhang et al., 2022) , etc., out-of-distribution detection (OOD) (Hendrycks & Gimpel, 2016; Bai et al., 2021; Liu et al., 2020; Lee et al., 2020; Lin et al., 2021) , and density estimation (Silverman, 2018; Zhao et al., 2016 ), etc. Wang et al. (2021) model the open world uncertainty as an extra dimension in the classifier, achieving better calibration in OOD datasets. Grathwohl et al. (2019) propose to model the joint distribution P(x, y) for the classification problem. Bian et al. (2022) apply energy-based learning for cooperative games and derive new player valuation methods. Zheng et al. (2021) propose to represent the statistical distribution within a single natural image through an EBM framework. Xu et al. (2022) apply EBM for inverse optimal control and autonomous driving. There have been several attempts to apply EBMs to incremental learning. Li et al. (2020) propose a novel energy-based classification loss and network structure for continual learning. Joseph et al. (2022) build an energy-based latent aligner that recovers the corrupted latent representation. Wang et al. (2022b) propose anchor-based energy self-normalization classifier in incremental learning.

3. METHOD

In this section, we give a description of BEEF and how we apply EBM to CIL to learn a unifying classifier while achieving bi-directional compatibility. In Sec. 3.1, we first introduce some basic knowledge of CIL. In Sec. 3.2, we present the definition of energy and then deduce optimization objective at the expansion phase. Then, to avoid the intractability of normalizing constant, we prove a gradient equivalent objective and explain why it helps to achieve the bi-directional ability. After that, we propose an efficient yet effective fusion strategy in Sec. , where x t i ∈ X t is an input sample and y t i ∈ Y t is the corresponding label, which is not accessible at latter sessions. Only a small amount of exemplars of previous categories are retained in a size-limited exemplar-set V t ⊆ ∪ t-1 i=1 D i . The model is expected to train on D t ∪ V t and be evaluated on the test set of all known categories. At the following discussions, we focus on the details of the t th incremental session without loss of generality. Particularly, we denote the label spaces of all known classes and novel classes as Y o = ∪ t-1 i=1 Y i and Y n = Y t , respectively. |Y n | = K and |Y o | = M , representing the number of new categories and that of old ones.

3.2. ENERGY-BASESD MODEL EXPANSION

Let h θ : X -→ ∆ K+1 be the newly created module (typically as a single-skeleton CNN), where X = ∪ t i=1 X i and ∆ K+1 is a K + 1-standard simplex (i.e., K + 2 dimensional vectors with nonnegative elements that sum up to 1). Therefore, we can further decompose h θ as S • F • Φ, where Φ : X -→ R d is the non-linear feature extractor, F : R d -→ R K+2 is a linear classifier transforming the feature into the K + 2 -dimensional logits, and S denotes the non-linear activation function softmax which constrains the final output on the K + 1standard simplex. h θ (x)[k] represents the k + 1 th element of the final output. Ignoring the bias, F can be denote as a d × (K + 2) matrix F = [ p b F base p f ], where F base with shape d × K is the base classifier for the current task, and p b / p f is an additional prototype that measures past confidence / future uncertainty, transforming the feature extracted from Φ into logits at index 0 / K + 1. First, given an input-label pair (x, y) ∈ X × (Y o ∪ Y n ), we define the energy E θ (x, y) as E θ (x, y) = -log h θ (x)[σ(y)], y ∈ Y n -log (h θ (x)[0]/M ) , y ∈ Y o , where σ : Y n -→ 1, 2, . . . , K is a bijection function mapping a given label to its corresponding class index. The energy E(x, y) measures the uncertainty of predicting x's label as y. Hence, the definition of the energy is compatible with traditional classification definitions, since we typically use h θ (x)[σ(y)], which is the negative exponent of the energy, to indicate the confidence of predicting x's label as y. Moreover, we use h θ (x)[0] to represent the overall confidence of x's label belonging to Y o and do not expect the new module to distinguish between the old categories. Hence, E(x, y) for any y ∈ Y o is represented as -log (h θ (x)[0]/M ). The denominator M makes the energy larger and indicates that the current module has a larger uncertainty about old categories due to the limited supervision for old categories from V t . Since P θ (y|x) = exp(-E θ (x,y)) y ′ exp(-E θ (x,y ′ )) and P θ (x) = y ′ exp(-E θ (x,y ′ )) x ′ y ′ exp(-E θ (x ′ ,y ′ )) , the conditional probability density and marginal probability density can be formulated as P θ (y|x) = h θ (x)[σ(y)] K k=0 h θ (x)[k] , y ∈ Y n h θ (x)[0]/M K k=0 h θ (x)[k] , y ∈ Y o , P θ (x) = K k=0 h θ (x)[k] x ′ K k=0 h θ (x ′ )[k] . (2) We define the energy function E θ (x) via P θ (x) = exp(-E θ (x)) x ′ exp(-E θ (x ′ )) , then E θ (x) is formulated as E θ (x) = -log K k=0 h θ (x)[k] . With energy functions defined above, we give proof of the derivation of our optimization objective when training a new module and demonstrate how it works to achieve bi-directional compatibility. Instead of simply learning a discriminator P θ (y|x), which usually causes overconfident predicts even when receiving samples from unseen distributions, we estimate the joint distribution P θ (x, y) arg min θ E P real (x,y) [-log P θ (x, y)] (4) = arg min θ E P real (x) [-log P θ (x)] + E P real (x,y) [-log P θ (y|x)] . The estimation of the joint distribution not only encourages the model to learn to distinguish all known categories but also helps model the input distribution, thus making the model sensitive to the input distribution drift. Therefore, when unseen categories emerge, it helps those modules alleviate overconfident predictions and reduce their impact on the ultimate predictions. However, due to the intractability of the normalizing constant x ′ y ′ exp(-E θ (x ′ , y ′ )), we optimize the gradient equivalent objective of Eq. 4. Theorem 3.1 (Marginal Distribution Maximum Likelihood Estimation). Defining E ′ θ (x) = -log h θ (x)[K + 1] and its corresponding marginal distribution as P ′ θ (x), the optimization of E P real (x) [-log P θ (x)] is equivalent to that of E P real (x) -log K k=0 h θ (x)[k] + λθE P ′ θ (x) [-log h θ (x)[K + 1] ] when gradient descend is applied, where λθ is the ratio of the normalizing constants determined by E ′ θ (x) and E θ (x), and θ means that parameters of θ is frozen (i.e., instances sampled from P ′ θ (x) are detached). Theorem 3.2 (Conditional Distribution Maximum Likelihood Estimation). With preliminaries from Thm. 3.1, the optimization of E P real (x,y) [-log P θ (y|x)] is equivalent to that of E P real (x,y) [-log h θ (x)[σ ′ (y)]] + µθE P real (x) [-log h θ (x)[K + 1]] when gradient descend is ap- plied, where µθ = hθ(x)[K+1] K k=0 hθ(x)[k] , σ ′ (y) = σ(y), y ∈ Y n 0, y ∈ Y o . Due to the space limit, detailed proofs for Thm. 3.1 and Thm. 3.2 are deferred to Appendix A. Combining Thm. 3.1 and Thm. 3.2, the ultimate optimization objective can be formulated as E P real (x) -log K k=0 h θ (x)[k] + λθE P ′ θ (x) [-log h θ (x)[K + 1]] + E P real (x,y) [-log h θ (x)[σ ′ (y)]] + µθE P real (x) [-log h θ (x)[K + 1]] , which is upper bounded by 2E P real (x,y) [-log h θ (x)[σ ′ (y)]] + µθE P real (x) [-log h θ (x)[K + 1]] + (Objective) λθE P ′ θ (x) [-log h θ (x)[K + 1]] . We take Eq. 7 as the eventual training objective. Here, we explain the roles of different components and the reason why better bi-directional compatibility is achieved through optimizing this objective. E P real (x,y) [-log h θ (x)[σ ′ (y)] ] prompts the new module to accurately discriminate all categories from current task as well as build explicit decision boundaries between current task and prior ones. By setting p b as the special prototype for all old categories, we can better exploit the shared structure of all old categories and reduce the risk of over-fitting, which typically results from inadequate training samples on old categories. Given a sample from prior tasks, the new module perceives this task boundary and reduces the confidence of its own task, so that the old module dominates the ultimate prediction. Hence, we achieve better backward compatibility for old categories than naively tuning the new module on all categories. E P real (x) [-log h θ (x)[K + 1]] encourages the module to reserve a certain degree of confidence for virtual class p f , thus measuring the out-of-distribution uncertainty for given samples and mitigating overconfident predictions. As shown in Fig. 2 , E P ′ θ (x) [-log h θ (x)[K + 1] ] introduces an adversarial learning process, where we iteratively generate samples believed to have lower energies from P ′ θ (x) and then update the energy manifold to increase the energy of generated samples and decrease that of real samples. This process effectively enhances the modeling of the known input distribution, making in-distribution samples have low energy and out-of-distribution data have high energy. Therefore, for a sample from unseen distributions, the module will produce predictions with large uncertainty (h θ (x)[K + 1]) and low confidence due to the fact that the confidence must be lower than 1 -h θ (x)[K + 1]. Then, modules created in the future to handle these unknown distributions will dominate the final prediction. Hence, we achieve the forward compatibility for the future unseen categories.

Module1 Module2

𝐩 𝑏 It alleviates the task-bias and determines the dominant module for the ultimate prediction.

Cat

After training the new module, we aim to fuse it with the prior ones to form a unifying classifier for all seen categories. Assuming that we have trained a unifying model h θo for all the old tasks and σ o maps the label to the output index of h θo , a vanilla approach to combining the h θo and h θ is to redefine the energy function as E {θo,θ} (x, y) = -log h θo (x)[σ o (y)], y ∈ Y o -log h θ (x)[σ(y)], y ∈ Y n . Then we have P {θ,θo} (y|x) =    h θo (x)[σo(y)] M m=1 h θo (x)[m]+ K k=1 h θ (x)[k] , y ∈ Y o h θ (x)[σ(y)] M m=1 h θo (x)[m]+ K k=1 h θ (x)[k] , y ∈ Y n . However, this might cause task bias. Different modules may produce predictions with different entropies, the combined model has a tendency to modules with larger entropies. As shown in Fig. 3 , Simply combing the modules as Eq. 9 leads to misclassification due to the larger entropy in module2. Considering that p b measures the confidence for old categories, we redefine E {θo,θ} as -log {h θo (x)[σ o (y)] + αh θ (x)[0] + β} , y ∈ Y o -log h θ (x)[σ(y)], y ∈ Y n . ( ) Then we have P {θ,θo} (y|x) =    h θo (x)[σ(y)]+αh θ (x)[0]+β M m=1 [h θo (x)[m]+αh θ (x)[0]+β]+ K k=1 h θ (x)[k] , y ∈ Y o h θ (x)[σ(y)] M m=1 [h θo (x)[m]+αh θ (x)[0]+β]+ K k=1 h θ (x)[k] , y ∈ Y n . ( ) We finetune α, β to minimize the negative log-likelihood on a tiny sub-dataset (exemplar-set), thus mitigating the task bias, namly α * , β * = arg min α,β E P real (x,y) -log P {θ,θo} (y|x) .

3.4. SUMMARY OF BEEF

To conclude, we propose a two-stage training approach: expansion and fusion. The expansion phase is intrinsically similar to naive fine-tuning, while trough our novel energy definition and gradient-equivalent optimization simplification, we expand the original K classification model into a K + 2 classification model through two additional prototypes (e.g. backward prototype p b and forward prototype p f ) to achieve the bi-directional compatibility. Specifically, p b learns to measure the confidence on old categories by acting as the cluster prototype for all old samples and thus achieves the backward compatibility. In contrast, p f learns to measure the open world uncertainty by acting as the cluster prototype for all samples generated by the built-in energy-based model to achieve the forward compatibility. By means of our energy-based framework, our model is able to learn the discrimination ability for the current task while synchronously modeling the input distribution. At the fusion phase, the confidence of the old prototype is added to the outputs of all old modules after passing through a learnable affine transformation, forming the ultimate prediction on all categories. The fusion strategy alleviates the possible task bias and effectively improves the performance compared with the naive fusion strategy in Eq. 9. Equipped with an energy-based nature, BEEF allows for a pleasant by-product: test-time alignment. That is, we apply SGLD (Welling & Teh, 2011) to decrease energy of given test samples. Through test-time alignment, some useful characteristics shared by training samples are transferred into the test sample to reduce its energy, making it more explicit and producing more convincing predictions. We find that this process does improve the performance in many protocols though requiring additional computational resources in evaluation. One inevitable drawback of dynamic-based methods is that they suffer from a linearly growing number of parameters as the number of tasks, which might violate the memory usage limitations in CIL. Therefore, for practical usage and fair comparison, we apply the compression strategy following FOSTER (Wang et al., 2022a) to compress the expanded dual branch model into a single skeleton after each incremental learning session, which we call BEEF-Compress.

4. EMPIRICAL STUDIES

4.1 EXPERIMENTAL SETTINGS Datasets. We validate our methods on widely used benchmarks of class-incremental learning CIFAR-100 (Krizhevsky et al., 2009) and ImageNet100/1000 (Deng et al., 2009) . CIFAR-100: CIFAR-100 consists of 50,000 training images with 500 images per class, and 10,000 test images with 100 images per class. ImageNet-1000: ImageNet-1000 is a large scale dataset composed of about 1.28 million images for training and 50,000 for validation with 500 images per class. ImageNet-100: ImageNet-100 is composed of 100 classes randomly chosen from the original ImageNet-1000 dataset. CIAFR-100 Protocol. For benchmark CIFAR-100, we evaluate two widely recognized protocols: CIFAR-100 B0: All the 100 classes are averagely divided into 5, 10, and 20 groups, respectively, i.e., we should train all the 100 classes gradually with 20, 10, 5 classes per incremental session. In addition, models are allowed to save an exemplar-set to store no more than 2000 exemplars throughout all sessions. CIFAR-100 B50: We first train half of 100 classes at the base learning stage. Then, the rest 50 classes are averagely divided into 5, 10, and 25 groups, respectively, i.e., we should train the rest 50 classes gradually with 10, 5, and 2 classes per incremental session. Slightly different from the first protocol, models are allowed to store no more than 20 exemplars for each class. Therefore, after training all the 100 classes, there are also no more than 2,000 exemplars. ImageNet Protocol. For benchmarking ImageNet-100, we evaluate the performance on two different incremental tasks. In the first task, we split the 100 classes averagely into 10 sequential incremental sessions, and up to 2,000 exemplars are allowed to be stored in the exemplar-set. In the second task, models are first trained on 50 base classes and then sequentially trained on the 10 classes at the following incremental sessions (i.e., 5 incremental sessions totally). The same as the protocol CIFAR-100 B50, models are allowed to store no more than 20 exemplars for each class. For benchmark ImageNet-1000, we train all the 1000 classes with 100 classes per step (10 steps in total) with an exemplar-set storing no more than 20,000 exemplars. Compared methods. Our method and all baselines are implemented with Pytorch (Paszke et al., 2017) in PyCIL (Zhou et al., 2021a) . We compare BEEF to strong regularization-based methods: iCaRL (Rebuffi et al., 2017) , BiC (Wu et al., 2019) , WA (Zhao et al., 2020) , PodNet (Douillard et al., 2020) , and Coil (Zhou et al., 2021b) . Beside, we compare to the dynamic-structure-based method RPSNet (Rajasegaran et al., 2019) , DER (Yan et al., 2021) , Dytox (Douillard et al., 2021) , FOSTER (Wang et al., 2022a) . Apart from the above methods, we also compare with RMM (Liu et al., 2021b) , which adjusts the memory partition strategy for new and old data. Table 2: Performance on CIFAR-100. We report both the top-1 average and last accuracy. incremental sessions. ImageNet-100/1000: Tabel 1 and Fig. 8 summarize the experimental results on both ImageNet-100 and ImageNet-1000 benchmarks. We can observe that BEEF, especially BEEF-Compress achieves very competitive performance compared to prior methods. Specifically, BEEF improves average accuracy by 0.63, 0.3 under ImageNet-100 protocols. It is also worth noting that the performance improvement of BEEF on ImageNet is relatively less significant compared to CIFAR-100, which may be attributed to the larger and more complex nature of the ImageNet dataset, making it more challenging to find suitable cluster prototypes as forward/backward prototypes. Comparison under imbalanced exemplar-set. To compare the performance of different methods with imbalanced exemplar-set, we propose three different exemplar selection strategies: exp, random, and half-half. Refer to Appendix C.1 for details. Fig. 4 displays the average accuracy changes of different methods after each incremental session on CIFAR-100 B50 with 5 steps. Fig. 5(c ) illustrates the performance when exemplars are randomly sampled from all the available old instances. Though the exemplar-set is statistically balanced, prior methods encounters a performance drop and the gap between BEEF and prior methods is enlarged. Fig. 5 (b) and Fig. 5 (a) illustrate the performance changes under extreme class imbalance and sample from half classes missing, respectively. Although the performance of prior methods declines dramatically, BEEF maintains its effectiveness under these two imbalanced protocols. BEEF achieves more than 10% performance gain under these challenging settings. In addition, since some classes have no exemplars stored in the exemplar-set to calculate the class center, iCaRL based on NCM-classifier (Mensink et al., 2013) fails in the base training phase.

4.3. ABLATION STUDIES

Ablations of key components in BEEF. To verify the effectiveness of the components in BEEF, we conduct ablation studies on CIFAR-100 B50 with 5 incremental sessions. As shown in Table 3 , the average and last accuracy gradually increase as we add more components. It is notable that after the fusion strategy, the unifying model get a large increase. Besides, by learning energy manifolds with the forward prototype p f and energy alignment, we can further improve the performance. Sensitive study of hyper-parameters. There are three hyper-parameters in BEEF when modeling the energy manifold: hidden layer where energy modeling, the trade-off coefficient λ, and the number of forward prototypes F in Stable-BEEF. As shown in Fig. 6 (a) we conduct experiments on three different hidden layers for modeling the energy manifold, and the experimental results show BEEF the robustness to the choice of different hidden layer. We also change the trade-off coefficients λ from {0, 1e -1, 1e -2, 1e -3} and the number of forward prototypes F from {1, 5, 20, 50}, and the average accuracies on CIFAR-100 B50 with 5 incremental sessions are shown in Fig. 6 (b). We can see that with the increase of F and λ, the accuracy has an upward trend. CONCLUSION. In this work we presented BEEF for achieving efficient class-incremental learning. Under this framework, we efficiently train a specific module for the current task while achieving bidirectional compatibility and fuse it with the prior model under minimal effort. BEEF is equipped with a theoretical analysis showing that its training process is inherently the modeling of an energy-based model. Compression strategies can be applied to address the issue of growing storage overhead.  (x) = -log h θ (x)[K + 1] and its corresponding marginal distribution as P ′ θ (x), the optimization of E P real (x) [-log P θ (x)] is equivalent to that of E P real (x) -log K k=0 h θ (x)[k] + λθE P ′ θ (x) [-log h θ (x)[K + 1] ] when gradient descend is applied, where λθ is the ratio of the normalizing constants determined by E ′ θ (x) and E θ (x), and θ means that parameters of θ is frozen (i.e., instances sampled from P ′ θ (x) are detached). Proof. Since P θ (x) = exp(-E θ (x)) x ′ exp(-E θ (x)) , we have E P real (x) [-log P θ (x)] = E P real (x) [E θ (x)] + log x ′ exp(-E θ (x ′ )) . ( ) Take the gradient of the right part in Eq. 13, and we have ∇ θ log x ′ exp(-E θ (x ′ )) = x exp(-E θ (x)) x ′ exp(-E θ (x ′ )) ∇ θ E θ (x) = x K k=0 h θ (x)[k] x ′ exp(-E θ (x ′ )) • ∇ θ K k=0 h θ (x)[k] K k=0 h θ (x)[k] = x ∇ θ (1 -h θ (x)[K + 1]) x ′ exp(-E θ (x ′ )) = - x ∇ θ h θ (x)[K + 1] x ′ exp(-E θ (x ′ )) = - x x ′ exp(-E ′ θ (x ′ )) x ′ exp(-E θ (x ′ )) • exp(-E ′ θ (x)) x ′ exp(-E ′ θ (x ′ )) • ∇h θ (x)[K + 1] exp(-E ′ θ (x)) = - x x ′ exp(-E ′ θ (x ′ )) x ′ exp(-E θ (x ′ )) • exp(-E ′ θ (x)) x ′ exp(-E ′ θ (x ′ )) • ∇h θ (x)[K + 1] h θ (x)[K + 1] = - x Z ′ θ Z θ • exp(-E ′ θ (x)) x ′ exp(-E ′ θ (x ′ )) • ∇ θ log h θ (x)[K + 1] = - Z ′ θ Z θ E P ′ θ (x) [∇ θ log h θ (x)[K + 1]] = -∇ θ Z ′ θ Zθ E P ′ θ (x) [log h θ (x)[K + 1]] . ( ) where θ represents that θ is frozen (i.e., gradients are not computed), Z ′ θ = x ′ exp(-E ′ θ (x ′ )) is the normalizing constant for energy E ′ θ (x), and Z θ is the normalizing constant for energy E θ (x). Hence, the objective E P real (x) [-log P θ (x)] is equivalent to E P real (x) -log K k=0 h θ (x)[k] + λθE Pθ(x ′ ) [-log h θ (x)[K + 1]] , where λθ = Z ′ θ Zθ . Theorem 3.2 (Conditional Distribution Maximum Likelihood Estimation). With preliminaries from Thm. 3.1, the optimization of E P real (x,y) [-log P θ (y|x)] is equivalent to that of E P real (x,y) [-log h θ (x)[σ ′ (y)]] + µθE P real (x) [-log h θ (x)[K + 1]] when gradient descend is ap- plied, where µθ = hθ(x)[K+1] K k=0 hθ(x)[k] , σ ′ (y) = σ(y), y ∈ Y n 0, y ∈ Y o . Proof. Considering the definition of P θ (y|x) in Eq. 2, we have E P real (x,y) [-log P θ (y | x)] =E P real (x,y)   E θ (x, y) + log y ′ ∈Yn∪Yo exp(-E θ (x, y ′ ))   =E P real (x,y)∩y∈Yo -log h θ (x)[0] M + E P real (x,y)∩y∈Yn [-log h θ (x)[σ(y)]] + E P real (x) log K k=0 h θ (x)[k] Published as a conference paper at ICLR 2023 The gradient of E P real (x,y)∩y∈Yo -log h θ (x) [0] M with respect to θ is equal to that of E P real (x,y)∩y∈Yo [-log h θ (x)[0]]. Therefore, define σ ′ (y) = σ(y), y ∈ Y n 0, y ∈ Y o , and the first two components in Eq. 16 can be written in a unifying form, that is E P real (x,y)∩y∈Yo∪Yn [-log h θ (x)[σ ′ (y)]] . (17) For the last component in Eq. 16, we have ∇ θ log K k=0 h θ (x)[k] = ∇ θ (1 -h θ (x)[K + 1]) K k=0 h θ (x)[k] = - ∇ θ h θ (x)[K + 1] K k=0 h θ (x)[k] = - h θ (x)[K + 1] K k=0 h θ (x)[k] • ∇ θ h θ (x)[K + 1] h θ (x)[K + 1] = - h θ (x)[K + 1] K k=0 h θ (x)[k] • ∇ θ log h θ (x)[K + 1] =∇ θ - hθ(x)[K + 1] K k=0 hθ(x)[k] • log h θ (x)[K + 1] Therefore, the Objective E P real (x,y) [-log P θ (y | x)] is also equivalent to E P real (x,y) [-log h θ (x)[σ ′ (y)]] + µθE P real (x) [-log h θ (x)[K + 1]] , where µθ = hθ(x)[K+1] K k=0 hθ(x)[k] .

B PROOFS FOR STABLE-BEEF

As we have claimed in Sec. 3 and Sec. 4, we expand the original BEEF to have multiple backward and forward prototypes (dubbed as Stable-BEEF), which stabilizes the training process and improves the performance. Here we give a formal illustration of the expandable form and provide proof of it. First, at the t th incremental session, instead of creating one backward prototype p b to measure the confidence for all the old tasks, we create a backward prototype for each old task. This expansion prompts the new module to learn a discriminator for each old task, thereby improving the performance and stability of training especially when there are clear domain shifts among old tasks. Therefore, there are t -1 backward prototypes for measuring the confidence of old tasks. Furthermore, in Stable-BEEF, we create multiple forward prototypes during the training. Concretely, for samples generated through the marginal distribution defined by the energy function, we assign the most-likely pseudo labels to them. We set the number of forward prototypes to a constant number F . After introducing these multiple prototypes, i.e., p b s and p f s, we should redefine h θ : X -→ ∆ K+F +(t-2) . Given an input-label pair (x, y) ∈ ∪ t i=1 X i × ∪ t i=1 Y i , we define the energy E θ (x, y) as E θ (x, y) = -log h θ (x)[σ(y)], y ∈ Y t -log (h θ (x)[σ(y)]/|Y i |) , y ∈ Y i , i = 1, 2, . . . , t -1 , where σ : ∪ t i=1 Y i -→ K + F + (t -1) maps new labels to their K class-corresponding prototypes and maps old labels to their t -1 task-corresponding backward prototypes. Therefore, similar to Eq. 2, the conditional probability density and marginal probability density for Stable-BEEF can be re-formulated as : P θ (y|x) =    h θ (x)[σ(y)] K+t-2 k=0 h θ (x)[k] , y ∈ Y t h θ (x)[σ(y)] |Yi| K+t-2 k=0 h θ (x)[k] , y ∈ Y i , i = 1, . . . , t -1 , P θ (x) = K+t-2 k=0 h θ (x)[k] x ′ K+t-2 k=0 h θ (x ′ )[k] . And then we can induce that the energy function in P θ (x) is formulated as E θ (x) = -log K+t-2 k=0 h θ (x)[k] . Like Eq. 4, Stable-BEEF also estimates the joint distribution P θ (x, y), that is arg min θ E P real (x) [-log P θ (x)] + E P real (x,y) [-log P θ (y|x)] . And same to the original BEEF, Stable-BEEF also aims to find its gradient equivalent optimization objective to avoid the intractable normalizing constant in the joint distribution defined by the energy E θ (x, y). Theorem B.1 (Marginal Distribution Maximum Likelihood Estimation for Stable-BEEF). Defining E ′ θ (x) = -log K+t-2+F k=K+t-1 h θ (x)[k] and its corresponding marginal distribution as P ′ θ (x), the optimization of E P real (x) [-log P θ (x)] is equivalent to that of E P real (x) -log K+t-2 k=0 h θ (x)[k] + λθE P ′ θ (x) -log K+t-2+F k=K+t-1 h θ (x)[k] when gradient descend is applied, where λθ is the ratio of the normalizing constants determined by E ′ θ (x) and E θ (x), and θ means that parameters of θ is frozen (i.e., instances sampled from P ′ θ (x) are detached). Proof. Since P θ (x) = exp(-E θ (x)) x ′ exp(-E θ (x)) , we have E P real (x) [-log P θ (x)] = E P real (x) [E θ (x)] + log x ′ exp(-E θ (x ′ )) . Take the gradient of the right part in Eq. 24, and we have ∇ θ log x ′ exp(-E θ (x ′ )) = x exp(-E θ (x)) x ′ exp(-E θ (x ′ )) ∇ θ E θ (x) = x K+t-2 k=0 h θ (x)[k] x ′ exp(-E θ (x ′ )) • ∇ θ K+t-2 k=0 h θ (x)[k] K+t-2 k=0 h θ (x)[k] = x ∇ θ (1 - K+t-2+F k=K+t-1 h θ (x)[k]) x ′ exp(-E θ (x ′ )) = - x ∇ θ K+t-2+F k=K+t-1 h θ (x)[k] x ′ exp(-E θ (x ′ )) = - x x ′ exp(-E ′ θ (x ′ )) x ′ exp(-E θ (x ′ )) • exp(-E ′ θ (x)) x ′ exp(-E ′ θ (x ′ )) • ∇ K+t-2+F k=K+t-1 h θ (x)[k] exp(-E ′ θ (x)) = - x x ′ exp(-E ′ θ (x ′ )) x ′ exp(-E θ (x ′ )) • exp(-E ′ θ (x)) x ′ exp(-E ′ θ (x ′ )) • ∇ K+t-2+F k=K+t-1 h θ (x)[k] K+t-2+F k=K+t-1 h θ (x)[k] = - x Z ′ θ Z θ • exp(-E ′ θ (x)) x ′ exp(-E ′ θ (x ′ )) • ∇ θ log K+t-2+F k=K+t-1 h θ (x)[k] = - Z ′ θ Z θ E P ′ θ (x) ∇ θ log K+t-2+F k=K+t-1 h θ (x)[k] = -∇ θ Z ′ θ Zθ E P ′ θ (x) log K+t-2+F k=K+t-1 h θ (x)[k] . where θ represents that θ is frozen (i.e., gradients are not computed), Z ′ θ = x ′ exp(-E ′ θ (x ′ )) is the normalizing constant for energy E ′ θ (x) , and Z θ is the normalizing constant for energy E θ (x). Hence, the objective E P real (x) [-log P θ (x)] is equivalent to E P real (x) -log K+t-2 k=0 h θ (x)[k] + λθE Pθ(x ′ ) -log K+t-2+F k=K+t-1 h θ (x)[K + 1] , where λθ = Z ′ θ Zθ . Theorem B.2 (Conditional Distribution Maximum Likelihood Estimation for Stable-BEEF). With preliminaries from Thm. B.1, the optimization of E P real (x,y) [-log P θ (y|x)] is equivalent to that of E P real (x,y) [-log h θ (x)[σ(y)]] + µθE P real (x) -log K+t-2+F k=K+t-1 h θ (x)[k] when gradient descend is applied, where µθ = K+t-2+F k=K+t-1 hθ(x)[K+1] K+t-2 k=0 hθ(x)[k] . Proof. Considering the definition of P θ (y|x) in Eq. 2, we have E P real (x,y) [-log P θ (y | x)] =E P real (x,y)   E θ (x, y) + log y ′ ∈∪ t i=1 Yi exp(-E θ (x, y ′ ))   = t-1 i=1 E P real (x,y)∩y∈Yi -log h θ (x)[σ(y)] |Y i | + E P real (x,y)∩y∈Yt [-log h θ (x)[σ(y)]] + E P real (x) log K+t-2 k=0 h θ (x)[k] The gradient of E P real (x,y)∩y∈Yi -log h θ (x)[σ(y)] |Yi| with respect to θ is equal to that of E P real (x,y)∩y∈Yi [-log h θ (x)[σ(y)]] (i = 1, 2, . . . , t -1) no matter whatever |Y i | is. Therefore, the first t components in Eq. 27 can be written in a unifying form, that is E P real (x,y)∩y∈∪ t i=1 Yi [-log h θ (x)[σ(y)]] , where For the last component in Eq. 16, we have ∇ θ log K+t-2 k=0 h θ (x)[k] = ∇ θ (1 - K+t-2+F k=K+t-1 h θ (x)[k]) K+t-2 k=0 h θ (x)[k] = - ∇ θ K+t-2+F k=K+t-1 h θ (x)[k] K+t-2 k=0 h θ (x)[k] = - K+t-2+F k=K+t-1 h θ (x)[k] K+t-2 k=0 h θ (x)[k] • ∇ θ K+t-2+F k=K+t-1 h θ (x)[k] K+t-2+F k=K+t-1 h θ (x)[k] = - K+t-2+F k=K+t-1 h θ (x)[k] K+t-2 k=0 h θ (x)[k] • ∇ θ log h θ (x)[K + 1] =∇ θ - K+t-2+F k=K+t-1 h θ (x)[k] K+t-2 k=0 h θ (x)[k] • log K+t-2+F k=K+t-1 h θ (x)[k] Therefore, the Objective E P real (x,y) [-log P θ (y | x)] is also equivalent to E P real (x,y) [-log h θ (x)[σ(y)]] + µθE P real (x) -log K+t-2+F k=K+t-1 h θ (x)[k] , where µθ = K+t-2+F k=K+t-1 h θ (x)[k] K+t-2 k=0 h θ (x)[k] . Combining Thm. 26 and Thm. 30, we get the final optimization objective E P real (x) -log K+t-2 k=0 h θ (x)[k] + λθE Pθ(x ′ ) -log K+t-2+F k=K+t-1 h θ (x)[k] + E P real (x,y) [-log h θ (x)[σ(y)]] + µθE P real (x) -log K+t-2+F k=K+t-1 h θ (x)[k] Note that h θ (x)[i] ≥ 0 for all i = 1, 2, . . . , K + t -2 + F , and then we get E P real (x) -log K+t-2 k=0 h θ (x)[k] ≤ E P real (x,y) [-log h θ (x)[σ(y)]] , E -log K+t-2+F k=K+t-1 h θ (x)[k] ≤ E -log max k∈{K+t-1,...,K+t-2+F } h θ (x)[k] . Hence, Eq. 31 is upper bounded by 2E P real (x,y) [-log h θ (x)[σ(y)]] + λθE Pθ(x ′ ) -log max k∈{K+t-1,...,K+t-2+F } h θ (x)[k] + µθE P real (x) -log max k∈{K+t-1,...,K+t-2+F } h θ (x)[k] . Taking Eq. 34 as the optimization objective, we expand the expansion phase in original BEEF into a more stable form with multiple backward and forward prototypes, namely the expansion phase in Stable-BEEF. Here we further discuss how we achieve the expanded fusion phase in Stable-BEEF. We also assume that we have trained a unifying model h θo for all the prior tasks and there is a σ o maps labels to output index of h θo . Considering that those t -1 backward prototypes measures the confidence for t -1 prior tasks, similar to Eq. 10, we define E {θo,θ} (x, y) = -log {h θo (x)[σ(y)] + α i h θ (x)[σ(y)] + β i } , y ∈ Y i , i = 1, 2, . . . , t -1 -log h θ (x)[σ(y)], y ∈ Y t . (35) Then we have P {θ,θo} (y|x) =    h θo (x)[σ(y)]+αh θ (x)[0]+β M m=1 [h θo (x)[m]+αh θ (x)[0]+β]+ K k=1 h θ (x)[k] , y ∈ Y o h θ (x)[σ(y)] M m=1 [h θo (x)[m]+αh θ (x)[0]+β]+ K k=1 h θ (x)[k] , y ∈ Y n . ( ) α i and β i are obtained through the minimization of the negative log-likelihood on the exemplar-set.

C MORE EXPERIMENTAL SETTINGS

C.1 EXEMPLAR SELECTION. As we claimed, all the prior methods assume that all the old samples are available when selecting exemplars. Typically, they apply the exemplar selection strategy proposed in Rebuffi et al. (2017) , where exemplars are carefully selected by greedily minimizing the derivation of the feature center between the selected exemplars and all the old samples. Besides, they assume all the old categories are equal and store the same number of exemplars for each old class. This violates the truth in many application scenarios, where available exemplars are usually imbalanced and even some instances for old categories become unavailable at the following sessions. Therefore, the robustness to the imbalance or lack of some categories is important for CIL methods. Assuming there are k old classes and we want to store m exemplars for each old class. We design three protocols for the imbalance of exemplar-set: (1) half-half: in this protocol, half of k classes are allowed to store more than m exemplars and the other half of k classes are only allowed to store less than m exemplars. Specifically, defining the balance factor γ ∈ [0, 1], half of k classes store (1 + γ)m exemplars while the other store (1 -γ)m exemplars. (2) exp: in this protocol, we use a negative exponential sequence of length k as the weight for each category. Specifically, the i th class has the weight exp(-γi) and its number of exemplars is defined by For ImageNet, we adopt the standard ResNet-18 (He et al., 2016) as our feature extractor and set the batch size as 256. The learning rate starts from 0.1 and gradually decays at milestones (170 epochs in total). For CIFAR-100, we use ResNet-32 (He et al., 2016) as our feature extractor and set the batch size to 128. The learning rate also starts from 0.1 and gradually decays at milestones (170 epochs in total). For both ImageNet and CIFAR-100, we use SGD with the momentum of 0.9 and the weight decay of 5e-4 at the expansion phase. At the fusion phase, we use SGD with a momentum of 0.9 and set the weight decay to 0, and train the fused model on the exemplar-set for 60 epochs. We apply the data augmenation following Rebuffi et al. (2017) and Wang et al. (2022a) .To generating samples from P ′ θ (x) when learning the energy manifold, we use representations of real instances as the starting point of SGLD (Welling & Teh, 2011) following Contrastive Divergence (Hinton, 2002) , which can be seen as a disturbance to the original representation of the real samples. In order to further accelerate the sampling process, we employed additional representation perturbation strategies, including mixup (Zhang et al., 2018) , rotation, etc., on the feature representations of samples. Therefore, we are able to generate new samples rather efficiently when learning the energy manifold. To further stabilize the training process and improve the performance, we expand the original BEEF into Stable-BEEF with multiple backward and forward prototypes. The number of backward prototypes is set to be the number of old tasks, and the number of forward prototypes is set to a constant number F . The detailed illustration and proof are provided in Appendix B. Besides, to simplify the training procedure and avoid the intractability of normalizing constants, we use a trade-off coefficient λ to replace λθ and µθ in our implementation.

D MORE DISCUSSIONS D.1 CONTRIBUTIONS OF BEEF

Our contribution is three-fold: 1) We propose a novel training paradigm for CIL: We efficiently train a specific module for the current task while achieving bi-directional compatibility and then fuse it with the prior model under minimal effort. 2) We provide a theoretical proof showing that the training cost is inherently the modeling of an energy-based model 3) We achieve state-of-the-art performance with lower training cost and maintain the performance when only randomly sampled old data is available while other methods fail dramatically.

D.2 DIFFERENCE AND CONNECTIONS WITH PRIOR METHODS

Here we discuss the crucial difference and connections between our method and prior works. Though Wang et al. (2021) has proposed to use an energy-based extra dimension to model the open world uncertainty in OOD detection, it is not suitable for CIL where old categories should be considered to alleviate forgetting. Besides, we take the joint distribution P(x, y) = P(x)P(y|x) into consideration, which forces the model to learn discrimination (P(y|x)) and be sensitive to the input distribution shift (P(x)) and thus modules that do not match the input distribution will capture the distribution shift and weaken their influence on the ultimate predictions. Similar to Joseph et al. (2022) , we also utilize the energy-based model to achieve energy alignment at the evaluation phase. While our energy-based model is inherently built at the training phase, Joseph et al. (2022) need to train an extra energy aligner after the training phase, introducing additional training costs. Apart from that, despite they assume that the task-ids of given samples are known in their implementation, we do not require that. Zhou et al. (2022) cleverly utilize multiple virtual prototypes to reserve feature space for future unseen categories, thus achieving forward compatibility. Our method can be seen as a bi-directional compatible method, with one prototype p b measuring the confidence for old categories and the other prototype p f modeling the out-of-distribution probability.

D.3 THE ROBUSTNESS OF BEEF TO THE IMBALANCED EXEMPLAR-SET

Here we give an explanation about why BEEF demonstrates strong robustness to the imbalance of exemplar-set. Previous methods, whether based on dynamic-structure or regularization, rely on a unifying feature representation and classifier. When the exemplar-set is unbalanced, that is, when the number of samples in some categories is very small or does not exist, the feature representation and classifier of the class are damaged. Although our method still needs to fuse all modules into a unified classifier, each module is decoupled from each other and is responsible for different tasks. Thus, when training new tasks, even if samples of some old classes are not available, the old modules responsible for these classes are not affected. In addition, when training new modules, we classify all old classes into a special backward prototype. This special backward prototype aims to learn the characteristics shared by categories in old tasks, which is insensitive to the imbalance of old samples. In the model fusion phase, the confidence of backward prototype is uniformly added to the output of old modules, thus acting as a soft task discriminator without affecting the predictions of old modules on old tasks.

D.4 THE EFFICIENCY OF BEEF

Training Cost Analysis. Previous methods, whether based on knowledge distillation or dynamic structure, require the forward propagation of old modules when learning new tasks, to achieve a unifying classifier. Besides, with the increasing number of modules retained, dynamic-structure-based methods suffer from increasing training costs. The training of BEEF consists of two phases, expansion and fusion. At the expansion phase, we only need to train the newly created module independently without the involvement of prior modules, so that the training cost at the expansion phase is equivalent to tuning a single module and will not increase at incremental stages. Though we follow Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) to generate samples from P ′ θ (x) when learning the energy manifold. However, we have made many strategies to simplify and accelerate the process, including Contrastive Divergence, mixup, rotation, etc. At the fusion phase, we only need to learn two factors including scale factor α and bias factor β through training on the randomly selected exemplar-set, and therefore the training cost is minimal compared to the expansion phase. Particularly, we compare the training cost of BEEF with the methods including the classical dynamicstructure-based method DER (Yan et al., 2021) and regularization-based methods like iCaRL (Rebuffi et al., 2017) and WA (Wu et al., 2019) . Without loss of generality, we take the t th incremental stage as an example. Although all the old modules are frozen in DER, the training process still requires the forward propagation of all t -1 old modules and the forward propagation of the new module and then applies backpropagation to update the new module. Apart from that, DER also needs to be the final classifier with a balanced reserved dataset. Therefore, the training process of DER at t th incremental stage is: t × forward propagation, 1 x backpropagation, 1 × finetune classifier. Similarly, due to the requirements of knowledge distillation in regularization-based methods like iCaRL, the training cost is: 2 × forward propagation, 1 × backpropagation. In contrast, in BEEF, we first train the new module in a decoupled manner, which only requires: 1 × forward propagation, 1 × backward propagation. Then, the fusion phase only needs to tune two parameters with a small subset, whose cost is minimal and is even more efficient than the process of simply tuning the final classifier in DER. Therefore, the training cost of BEEF consists of: 1 × forward propagation, 1 × backward propagation, 1 × finetune α and β. Specifically, we compare the training time of BiC (regularization-based), DER (dynamic-structurebased) and BEEF at each incremental session on CIFAR-100 B0 10 steps protocol, the results are shown in Table . 4. We can observe that the average training time of BEEF is lower than both the regularization-based method BiC and the dynamic-structure-based method DER. Memory usage analysis. Here, we report the peak memory for storing the exemplars and the learnable & frozen network parameters during the model training through all phases on B0 5 steps protocol of Benchmark CIFAR-100 and ImageNet-100 in Table . 5. For the model parameters, we use the type float32 to save all of them. That is, each parameter in the model takes 4 bytes of memory. We report the detailed accuracy on standard CIFAR-100 B50 protocols with 5, 10, 25 incremental sessions in Fig. 7 . We can observe that BEEF Our overall training strategy is difficult to integrate with other existing CIL methods, but we find that our proposed forward compatibility implementation can be integrated into many methods. We experimented on the baseline regularization-based method WA and found that by implementing



Figure 1: The conceptual illustration of BEEF. The training consists of two phases: expansion and fusion.

Figure 2: Learning energy manifold. The learning process of the energy manifold. The left part illustrates the concrete workflow of model training. The middle and right part illustrate the corresponding high-level abstract energy manifold learning process.

Figure 3: p b acts as a soft task-discriminator.

Figure 4: Performance on standard CIFAR-100 B0 protocols.

Figure 6: Sensitive studies of hyper-parameters.

γj) • km . (3) random: we uniformly sample km exemplars from all available old instances. This protocol is the most similar to the original setting since the expectation of the number of exemplars for each category is m statistically.C.2 IMPLEMENTATION DETAILS.

Figure 7: Performance on standard CIFAR-100 B50 protocols.

Tolias, 2019). Our work is a class-based method, which is typically called class-incremental learning (CIL). Prevalent CIL methods can be categorized into two classes: regularization-based, and dynamic-structure-based. Regularization-based methods impose constraints when learning tasks.Kirkpatrick et al. (2017);Aljundi et al. (2018) penalize the parameter drift.Li & Hoiem (2017);Rebuffi et al. (2017);Wu et al. (2019);Zhao et al. (

Among the compared methods, Dytox applies stronger neural architecture(Convit (d'Ascoli et al., 2021)) and additional data augmentation; FOSTER(Wang et al., 2022a)  additionally uses the AutoAugmentation(Cubuk et al., 2019) to enhance the sample efficiency and improve classification accuracy. RMM(Liu et al.,  2021b)  achieves a better memory management strategy. With the same training memory, RMM chooses some new samples to train and discards the rest to allow restoring more old exemplars. Note that all these are orthogonal to the method itself, and therefore we combine the augmentation and memory management strategy with BEEF for fair comparison with these methods, which we regard as BEEF-Compress. Performance on ImageNet. We report both average and last accuracy of top-1 and top-5. .55 72.81 62.54 70.65 56.28 70.10 64.01 67.94 60.44 63.83 54.31 BEEF 72.31 62.58 71.94 60.98 69.84 56.71 71.70 65.24 70.71 63.51 66.11 54.36 BEEF-Compress 73.05 62.48 72.93 61.45 71.69 57.06 71.58 64.54 71.70 61.19 64.32 54.81

Ablations

Exemplar selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Implementation details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Here we provide the detailed proofs for Thm. 3.1 and Thm. 3.2.

Training time cost in each session (min) Training time cost comparisons.For images with width W and height H, since each image has three channels and each pixel (with value 0-255) takes one byte, therefore each image takes 3 × W × H bytes of memory. Assuming that we have N exemplars and the number of model parameters is M , the memory usage is calculated as 3 × W × H × N + 4 × M . Note that the memory usage of RMM varies with different choice of configs. For CIFAR-100, the memory usage of it varies between 9.66 MB and 24.2 MB. For ImageNet-100, the memory usage of it varies between 378 MB to 1949 MB. We report the mean value of them in the above table.

Peak memory usage comparisons.

acknowledgement

ACKNOWLEDGMENT. This work is partially supported by NSFC (61921006, 62006112, 62250069), NSF of Jiangsu Province (BK20200313), Collaborative Innovation Center of Novel Software Technology and Industrialization.

Appendix of BEEF

We report the detailed accuracy on standard ImageNet-100 in Fig. 8 . We can observe that BEEF achieves competitive performance under different ImageNet-100 protocols. We report more results on imbalanced protocols in Fig. 9 . From the figure, we can observe that even under the configuration of a more balanced exemplar-set compared to that of Fig. 5 , other methods still suffer from significant performance degradation, while BEEF achieves a much higher average accuracy than the other methods. Specifically, in the case of half-half and γ = 0.5, BEEF outperforms the best method by 4.25%, and in the case of exp and γ = 0.9, BEEF outperforms the best method by 9.80%. These results further demonstrate the robustness of our method BEEF in terms of exemplar-set selection strategy. 

