BEEF: BI-COMPATIBLE CLASS-INCREMENTAL LEARN-ING VIA ENERGY-BASED EXPANSION AND FUSION

Abstract

Neural networks suffer from catastrophic forgetting when sequentially learning tasks phase-by-phase, making them inapplicable in dynamically updated systems. Class-incremental learning (CIL) aims to enable neural networks to learn different categories at multi-stages. Recently, dynamic-structure-based CIL methods achieve remarkable performance. However, these methods train all modules in a coupled manner and do not consider possible conflicts among modules, resulting in spoilage of eventual predictions. In this work, we propose a unifying energy-based theory and framework called Bi-Compatible Energy-Based Expansion and Fusion (BEEF) to analyze and achieve the goal of CIL. We demonstrate the possibility of training independent modules in a decoupled manner while achieving bi-directional compatibility among modules through two additionally allocated prototypes, and then integrating them into a unifying classifier with minimal cost. Furthermore, BEEF extends the exemplar-set to a more challenging setting, where exemplars are randomly selected and imbalanced, and maintains its performance when prior methods fail dramatically. Extensive experiments on three widely used benchmarks: CIFAR-100, ImageNet-100, and ImageNet-1000 demonstrate that BEEF achieves state-of-the-art performance in both the ordinary and challenging CIL settings.

1. INTRODUCTION

The ability to continuously acquire new knowledge is necessary in our ever-changing world and is considered a crucial aspect of human intelligence. In the realm of applicable AI systems, it is expected that these systems can learn new concepts in a stream while retaining knowledge of previously learned concepts. However, deep neural network-based systems, which have achieved great success, face a well-known issue known as catastrophic forgetting (French, 1999; Golab & Özsu, 2003; Zhou et al., 2023b) , whereby they abruptly forget prior knowledge when directly fine-tuned on new tasks. To address this challenge, the class-incremental learning (CIL) field aims to design learning paradigms that enable deep neural networks to learn novel categories in multi-stages while maintaining discrimination abilities for previous ones (Rebuffi et al., 2017; Zhou et al., 2023a) . Numerous approaches have been proposed to achieve the goal of CIL, with typical methods falling into two groups: regularization-based methods and dynamic-structure-based methods. Regularizationbased methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Li & Hoiem, 2017; Rebuffi et al., 2017) add constraints (e.g., parameter drift penalty) when updating, thus forcing the model to maintain crucial information for old categories. However, these methods often suffer from the stability-plasticity dilemma, lacking the capacity to handle all categories simultaneously. Dynamicstructure-based methods (Yan et al., 2021; Li et al., 2021) expand new modules at each learning stage to enhance the model's capacity and learn the task-specific knowledge through the new module, achieving remarkable performance. Whereas, these methods have an intrinsic drawback. They directly retain all learned modules without considering conflicts among modules, thus corrupting the joint feature representation and misleading the ultimate predictions. For example, old modules not considering the future possible updates may mislead the final prediction for new classes. This defect limits their performance on long-term incremental tasks. In this paper, we aim to reduce the possible conflicts in dynamic-structure-based methods and innovatively propose to achieve bi-directional compatibility, which consists of backward compatibility and forward compatibility. To be specific, backward compatibility is committed to making the discrimination ability of old modules unaffected by new ones. On the contrary, forward compatibility aims to reduce impact of old modules on the ultimate predictions when new categories emerge. By achieving bi-directional compatibility, given a sample from a specific task, the module responsible for the corresponding task will dominate the ultimate predictions in the ideal case. Fig. 1 displays the BEEF training framework, which is made of two phases (Model Expansion and Model Fusion). At the expansion phase, we assume that different modules are independent and train them isolatedly. Then, at the fusion phase, all trained modules are combined to form a unifying classifier. To achieve bi-directional compatibility, we introduce two additional prototypes: forward prototype p f and backward prototype p b . The backward prototype p b is set to measure the confidence of old classes, while the forward prototype p f aims to measure uncertainty in the open world. To be specific, when training a new module, we set p b as the cluster prototype for all old classes and use it to learn a task boundary between the current task and prior ones. In the meanwhile, we set p f as the cluster prototype for samples from the unseen distributions generated through energy-based sampling and thus use it to measure uncertainty and better capture the current distribution. We show that BEEF can be deduced from a unifying extendable energy-based theoretical framework, which allows us to transform the open-world problem into a normal classification problem and model the input distribution synchronously while learning the discrimination ability. Vast experiments on three widely-used benchmarks show that our method achieves state-of-the-art performance. Besides, prevalent CIL methods all require a well-selected balanced exemplar-set for rehearsal, which might be impractical due to data privacy issues (Delange et al., 2021; Ji et al., 2014) and computation cost when choosing exemplars. Our method pushes it to a harder setting. With only randomly sampled data from prior tasks, BEEF maintains its effectiveness while the performance of other methods declines drastically.

2. RELATED WORK

Incremental learning. Most recent studies on incremental learning are either task-based or classbased. The crucial difference between them is whether task-id is known at the evaluation phase (Van de



Figure 1: The conceptual illustration of BEEF. The training consists of two phases: expansion and fusion.At the expansion phase, we independently train the new module for the current task, while classifying all the samples from prior tasks into p b and classifying all samples generated from the built-in energy-based model into p f . At the fusion phase, the output of p b is equally added to the output of prior modules to mitigate the task-bias and form a unifying classifier.

