ONLINE PLACEBOS FOR CLASS-INCREMENTAL LEARNING

Abstract

Not forgetting old class knowledge is a key challenge for class-incremental learning (CIL) when the model continuously adapts to new coming classes. A common technique to address this is knowledge distillation (KD) which penalizes prediction inconsistencies between old and new models. Such prediction is made with almost new class data, as old class data is extremely scarce due to the strict memory limitation in CIL. In this paper, we take a deep dive into KD losses and find that "using new class data for KD" not only hinders the model adaption (for learning new classes) but also results in low efficiency for preserving old class knowledge. We address this by "using the placebos of old classes for KD", where the placebos are chosen from a free image stream, such as Google Images, in an automatical and economical fashion. To this end, we train an online placebo selection policy to quickly evaluate the quality of streaming images (good or bad placebos) and use only good ones for one-time feed-forward computation of KD. We formulate the policy training process as an online Markov Decision Process (MDP), and introduce an online learning algorithm to solve this MDP problem without causing much computation costs. In experiments, we show that our method 1) is surprisingly effective even when there is no class overlap between placebos and original old class data, 2) does not require any additional supervision or memory budget, and 3) significantly outperforms a number of top-performing CIL methods, in particular when using lower memory budgets for old class exemplars, e.g., five exemplars per class. The code is available in the supplementary.

1. INTRODUCTION

AI learning systems are expected to learn new concepts while maintaining the ability to recognize old ones. In many practical scenarios, they cannot access the past data due to the limitations such as storage or data privacy but are expected to be able to recognize all seen classes. A pioneer work (Rebuffi et al., 2017) formulated this problem in the class-incremental learning (CIL) pipeline: training samples of different classes are loaded into the memory phase-by-phase, and the model keeps on re-training with new class data (while discarding old class data) and is evaluated on the testing data of both new and old classes. The key challenge is that re-training the model on the new class data tends to override the knowledge acquired from the old classes (McCloskey & Cohen, 1989; McRae & Hetherington, 1993; Ratcliff, 1990) , and the problem is called "catastrophic forgetting". To alleviate this problem, most of CIL methods (Rebuffi et al., 2017; Hou et al., 2019; Douillard et al., 2020; Liu et al., 2020a; 2021a; Zhu et al., 2021) are equipped with knowledge distillation (KD) losses that penalize any feature and/or prediction inconsistencies between the models in adjacent phases. The ideal KD losses should be computed on old class data since the teacher model (i.e., the model in the last phase) was trained on them. This is, however, impossible in the CIL setting, where almost all old class data are inaccessible in the new phase. Existing methods have to use new class data as a substitute to compute KD losses. We argue that this 1) hampers the learning of new classes as it distracts the model from fitting the ground truth labels of new classes, and 2) can not achieve the ideal result of KD, as the model can not generate the same soft labels (or features) on new class data as on old class data. We justify this from an empirical perspective as shown in Figure 1 (the oracle case is to have both "ascent"), e.g., the ground truth label for the second old class is 0, while the "KD label" at this position is 0.8. This is not an issue when using the old class sample, e.g., in (ii), its ground truth label and "KD label" have consistent magnitudes at the same position (1 and 0.7, respectively). (c) Our selected placebos for two old classes ("road" and "table ") and their activation maps using GradCAM (Selvaraju et al., 2017) on CIFAR-100. The free image stream is ImageNet-1k that does not have class overlap with CIFAR-100. They are selected because their partial image regions contain similar visual cues with old classes. the upper bound of KD is achieved when using "Old class data", and if compared to it, using "New class data" sees a clear performance drop for recognizing both old and new classes. In Figure 1 (b), we show the reason by diving into loss computation details: when using new class samples (as substitutes) to compute CE and KD losses simultaneously, these two losses are actually weakening each other, which does not happen in the ideal case of using old class samples. Using unlabeled external data (called placebos in this paper) has been shown to be practical and effective in solving the above issue. Compared to existing works (Lee et al., 2019; Zhang et al., 2020) , this paper aims to solve two open questions. Q1: How to adapt the placebo selection process in the non-stationary CIL pipeline. The ideal selection method needs to handle the dynamics of increasing classes in CIL, e.g., in a later incremental phase, it is expected to handle a more complex evaluation on the placebos of more old classes. Q2: How to control the computational and memoryrelated costs during the selection and utilization of placebos. It is not intuitive how to process external data without encroaching the memory allocated for new class data or breaking the strict assumption of memory budget in CIL. We solve these questions by proposing a new method called PlacoboCIL that adjusts the policy of selecting placebos for each new incremental phase, in an online and automatic fashion and without needing extra memory. Specifically, to tackle Q1, we formulate the PlaceboCIL as an online Markov Decision Process (MDP) and introduce a novel online learning algorithm to learn a dynamic policy. In each new phase, this policy produces a phase-specific function to evaluate the quality of incoming placebos. The policy itself gets updated before the next phase. For Q2, we propose a mini-batchbased memory reusing strategy for PlaceboCIL. Given a free data stream, we sample a batch of unlabeled data, evaluate their quality by using our phase-specific evaluation function (generated by the learned policy), and keep only the high-quality placebos to compute KD losses. After this, we remove this batch totally from the memory before loading a new batch. In our implementation, this batch can be very small, e.g., 200 images. We randomly remove the same size of new class data for intaking this batch to keep the strict assumption of memory budget. We evaluate PlaceboCIL by incorporating it into multiple strong baselines such as PODNet (Douillard et al., 2020) , LUCIR (Hou et al., 2019 ), AANets (Liu et al., 2021a) , and FOSTER (Wang



We made the corresponding changes in the revised paper and colorized these changes in blue.



Figure1: (a) Average accuracy when computing KD loss on different data using iCaRL(Rebuffi  et al., 2017)  on CIFAR-100. The KD loss (softmax KL divergence loss) is computed on new class data (dark blue), placebos (of old class data) selected by our method (light blue), and old class data green), i.e., the ideal case. (b) Conceptual illustrations of the loss problem if using new class data for KD. The dark blue and orange numbers denote the predictions of old and new classes, respectively. It is clear in (i) that the objectives are different when using a new class sample for KD (the oracle case is to have both "ascent"), e.g., the ground truth label for the second old class is 0, while the "KD label" at this position is 0.8. This is not an issue when using the old class sample, e.g., in (ii), its ground truth label and "KD label" have consistent magnitudes at the same position (1 and 0.7, respectively). (c) Our selected placebos for two old classes ("road" and "table") and their activation maps using GradCAM(Selvaraju et al., 2017)  on CIFAR-100. The free image stream is ImageNet-1k that does not have class overlap with CIFAR-100. They are selected because their partial image regions contain similar visual cues with old classes.

