SOLVING CONTINUAL LEARNING VIA PROBLEM DECOMPOSITION

Abstract

This paper is concerned with class incremental learning (CIL) in continual learning (CL). CIL is the popular continual learning paradigm in which a system receives a sequence of tasks with different classes in each task and is expected to learn to predict the class of each test instance without given any task related information for the instance. Although many techniques have been proposed to solve CIL, it remains to be highly challenging due to the difficulty of dealing with catastrophic forgetting (CF). This paper starts from the first principle and proposes a novel method to solve the problem. The definition of CIL reveals that the problem can be decomposed into two probabilities: within-task prediction probability and task-id prediction probability. This paper proposes an effective technique to estimate these two probabilities based on the estimation of feature distributions in the latent space using incremental PCA and Mahalanobis distance. The proposed method does not require a memory buffer to save replay data and it outperforms strong baselines including replay-based methods. 1

1. INTRODUCTION

Continual learning (CL) is a learning problem where a system learns and accumulates knowledge over time without forgetting the previous knowledge (Chen & Liu, 2018) . The key challenge is the catastrophic forgetting (CF), which is a phenomenon that the system corrupts the learned knowledge in the past in learning a new task (McCloskey & Cohen, 1989) . This paper focuses on the challenging CL setting of class incremental learning (CIL) (Rebuffi et al., 2017) in the offline (or batch) mode. In this setting, the system learns a sequence of classification tasks incrementally, where each task arrives with all its training data of a set of classes. The resulting classifier can identify the class of a test instance among all the classes learned in the process with no task information provided. The other popular setting of CL is task incremental learning (TIL), which builds a separate model for each task and in testing, the test instance together with the task-id that the test instance belongs to are provided so that the system can use the model of the specific task to classify the instance. Existing approaches to CIL can be grouped into several categories. Regularization (Kirkpatrick et al., 2017) or distillation (Li & Hoiem, 2016) tries not to change the parameters or knowledge that are important to old tasks when learning the new task. Replay/memory-based approaches (Rebuffi et al., 2017) save some old data and use them jointly with the new task data to learn the new task and to preserve/adjust the old knowledge. Parameter isolation approaches (Serra et al., 2018) expand the network or mask out the important parameters for old tasks (see Sec. 2 for more details). Our approach is entirely different and is derived directly from the definition of the CIL setting. Definition: Class incremental learning (CIL) learns a sequence of tasks 1, ..., t, where each task i has a training data D i = {(x i j , y i j )} n i j=1 with x i j ∈ X i (input space) and y i j ∈ Y i (class label space). The class labels of tasks are disjoint, Y i ∩ Y k = ∅ for any i ̸ = kfoot_1 . Let X = ∪ t i=1 X i and Y = ∪ t i=1 Y i . The goal is to learn a function f : X → Y to predict the class label of test case x. As tasks have disjoint classes, the CIL probability of a sample x having the jth class label y i j of task i can be decomposed into two probabilities, P(y i j |x) = P(y i j |x, i)P(i|x), The decomposition implies that there are two probabilities that define the CIL probability. The first probability on the right-hand-side (RHS) is the within-task prediction (WTP) probability (or intratask prediction probability) and the second probability on the RHS is the task-id prediction (TIP) probability (or inter-task prediction probability). Thus, a system makes a correct CIL prediction if it produces accurate within-task and task-id predictions. We note that the WTP probability is exactly the prediction probability in a TIL problem. However, in TIL, the task-id is given in inference or testing. Thus, to solve the CIL problem, one can learn like TIL and then design a mechanism to predict the task-id to which the test instance belongs. Some existing works have taken this approach (Rajasegaran et al., 2020; Abati et al., 2020) , but they perform poorly because their task-id predictors are very weak. However, these papers did not propose Eq. 1. We will discuss these and other related works in the related work section. In fact, the WTP probability can be improved from that given by a TIL system too. This paper proposes a novel technique to estimate the two probabilities and an exemplar-free CIL system, called EWT (Estimation of WTP and TIP probabilities). EWT makes use of the highly effective hard-attention masking method HAT (Serra et al., 2018) for TIL to learn feature extractor for each task. HAT has almost no forgetting for TIL as it masks out the parameters and neurons learned for previous tasks. This ensures that the estimated probabilities are robust and are not affected by forgetting in incremental learning. Although we could directly use the probability of each class produced from each task as WTP probability, this approach is sub-optimal. We propose a generative approach to improve the estimation by considering possible noisy and/or out-of-distribution samples. This is done by fine-tuning the task classifiers using generated pseudo feature representations for each class. The generation is done based on the Gaussian distributions in the latent feature space. The Gaussian distribution for each class is estimated incrementally using incremental Principle Component Analysis (iPCA). The TIP probability is estimated using Mahalanobis distance. Our experiments demonstrate the effectiveness of the proposed method EWT using a pre-trained transformer network that does not have information leak, i.e., it is trained using the ImageNet data with all classes that are similar to the classes in the experiment datasets removed. Both our system and the baselines fix the transformer and train adapter modules inserted at each transformer layer (Houlsby et al., 2019) . The experimental results shows that EWT outperforms the recent stateof-the-art baselines by large margins, including replay-based approaches.

2. RELATED WORK

Numerous techniques have been proposed for CL. We consider the five most relevant categories: exemplar-free, replay-based, generative methods, network-expansion, and parameter isolation. Exemplar-free methods (saving no previous task data) often use regularization (Kirkpatrick et al., 2017; Zhu et al., 2021) , knowledge distillation (Li & Hoiem, 2016) , or orthogonal projection (Zeng et al., 2019) to preserve previous important parameters (Zenke et al., 2017; Wang et al., 2022) . Our method is also exemplar-free and our CF prevention is based on task masking (Serra et al., 2018) . Replay-based CL has been widely studied in CIL. Different saving mechanisms (Rebuffi et al., 2017; Liu et al., 2020b; Bang et al., 2022) , replay strategies (Aljundi et al., 2019) , and regularizations (Lopez-Paz & Ranzato, 2017; Castro et al., 2018; Chaudhry et al., 2018; Buzzega et al., 2020; Chaudhry et al., 2021) have been used. The goal of these methods is to balance the plasticity and stability using the saved samples of previous tasks (Liu et al., 2021; Yan et al., 2021) . Our method does not save any samples and it also performs much better than recent replay-based methods. Generative methods (Shin et al., 2017; Ostapenko et al., 2019; Ayub & Wagner, 2021) build generators to generate pseudo-replay data to prevent forgetting. Lesort et al. (2018) studied the difficulties of the generative approach. We do not generate pseudo-replay samples similar to the raw data and thus do not have its problems. Our method generates feature vectors rather than raw data. Liu et al. (2020a) and Zhu et al. (2021) also generate feature vectors. They use the generated features for distilling knowledge of previous tasks. However, we estimate the distributions of features to fine-tune the classifier and to compute task-id probability rather than for knowledge distillation. x i {N (µc , Σc )} p Y i |x i , i . . .  f i g i g 1 (a) Step 1 p(Y |z) g . . . {{N (µc , Σc )}} Z (b) Step 2 (µ c (f i (x i )), Σ c (f i (x i ))) of feature vectors f i (x i ). The estimated distributions are used to generate pseudo feature vectors, and the network is jointly trained with the pseudo feature vectors and the training data of task i. Since the network g i •f i is trained to minimize the loss on both training data and the pseudo feature vectors, this process encourages feature vectors of f i to follow the desired distribution. (b) Given the distributions, the system fine-tunes a classifier g created by joining the multi-head classifiers g k , k ≤ i. The finetuning is done using pseudo feature vectors generated from the distributions. Network expansion methods (Rusu et al., 2016; Mehta et al., 2021; Yan et al., 2021) expand the network to preserve old parameters. IBP-WF (Mehta et al., 2021) uses global weight factors for knowledge sharing and a Bayesian non-parametric approach for network expansion. It first finds the task-id in testing using Gaussian distributions and then uses the task-id to select the correct model for prediction. Our method is different as we do not expand the network or find the task-id, but directly compute the CIL probability. DER (Yan et al., 2021) expands the network and also does pruning to reduce the network size. Our method does not expand the network. Another popular branch in continual learning is parameter isolation, which trains a set of task specific parameters. The methods are mostly designed for task incremental learning (TIL) as they require the task-id of each test instance to choose the correct task specific parameters. We leverage the hard attention masking (HAT) (Serra et al., 2018) to prevent forgetting. However, our method is for CIL unlike the original HAT. Although there are attempts (von Oswald et al., 2020; Rajasegaran et al., 2020; Abati et al., 2020; Henning et al., 2021) to use the parameter isolation methods for CIL problem, they do not tackle CIL by problem decomposition problematically as we do. These methods are much weaker than ours (see Sec. 4). Using a pre-trained model (e.g., BERT, GPT-3, ViT, DeiT, or CLIP) has been a standard practice for CL in natural language processing (Ke et al., 2021) . For image data, Ostapenko et al. (2022) studied using pre-trained models as foundation models for CL. SLDA (Hayes & Kanan, 2020) fixes the pretrained feature extractor and fine-tunes the classifier. L2P (Wang et al., 2022) trains a prompt pool with a fixed feature extractor and Wu et al. (2022) fine-tunes replicate layers of a pre-trained model. Our method EWT also leverages a pre-trained feature extractor, but we use adapters in the fixed feature extractor and trains only the adapters to learn new knowledge. Using the same pre-trained model, our method outperforms SLDA and L2P by a large margin (see Sec. 4). Different CIL settings have been studied as well. Blurry task (or task-free) is studied in online CIL (Buzzega et al., 2020; Bang et al., 2022) , where the tasks boundaries are not clear as tasks change gradually. Our method is an offline method, where tasks are disjoint. As noted in footnote 2, we split tasks by unseen classes rather than by datasets. We leave training with additional samples of previous tasks for our future work in the online CL setting.

3. PROPOSED METHOD

An overview of the training process of the proposed method is illustrated in Fig. 1 . Learning a new task i involves two steps. Step 1 focuses on training the feature extractor. Specifically, it trains the task network g i • f i using both the training data X i of task i and the pseudo feature vectors Z generated from the Gaussian distribution N (µ c (f i (X i )), Σ c (f i (X i )) of feature vectors for each class c in the task (see Fig. 1 (a)). The Gaussian distribution is dynamically and incrementally estimated using incremental Principle Component Analysis (iPCA) during training (see Sec. 3.1.1) . Since the network is jointly trained with the training data and the generated features, which also depend on the values of feature extractor f i , this encourages the feature vector to follow the distribution. This step has little forgetting as the training is done based on the hard-attention mechanism in Serra et al. (2018) , which can protect/mask the parameters learned from previous tasks (see Sec. 3.1.2 ). Note that although f i 's are task specific but they are all learned in the same network and there are a lot of parameter sharing. Step 2 computes the two probabilities in Eq. 1 based on the trained feature extractors f i 's and the Gaussian distribution for each class.

3.1. STEP 1: TASK TRAINING

We first discuss the detailed training process to learn a task i in step 1, which performs two functions: (i) estimating the distribution of feature vectors for each class in a task using incremental Principal Component Analysis (iPCA), and (ii) training the feature extractor with hard attention masking in (Serra et al., 2018) to prevent interference or CF in learning task i using the training data X i and the generated feature vectors Z based on the incrementally estimated distributions on the fly. As we explain above, we use Z in the step 1 training because we want to produce better distributions, which will be used in step 2. We train the network for task i by minimizing the loss L ce = - 1 2|B|   (x,y)∈B log p(y|x, i) + (z,y)∈Z log p(y|z, i)   , where the first p on the right is the softmax output g i (f i (x)), the second p is the softmax output g i (z), B is a batch of training data, and Z is a batch of pseudo feature vectors generated from the Gaussian distributions {N (µ c , Σ c )} for each class c in task i. The following sub-sections describe how to estimate the distributions of feature vectors and how to train the feature extractor without CF.

3.1.1. INCREMENTAL ESTIMATION OF GAUSSIAN DISTRIBUTIONS OF FEATURE VECTORS

We use multivariate Gaussian distribution to approximate the feature distribution for each class. The challenges in estimating the distribution of features in CL are: (i) the statistics (µ c , Σ c ) need to be updated incrementally at each batch on the evolving feature extractor as using the whole data to recompute the statistics after training is computationally demanding and (ii) saving statistics of the Gaussian distributions is expensive as the feature vectors have a high dimension d. Therefore, we take ideas from the algorithms developed for incremental Principal Component Analysis (iPCA) to approximate the covariance. In iPCA, only a few (k << d) principal vectors are saved and updated dynamically at each batch without using any previous data. Since the following discussion is about estimating the distribution of feature vectors of each class, we remove the class indicator c for simplicity in notation. Likewise, we also remove task index i. Suppose we have seen n training samples so far. Denote the n samples of a class by X and the the feature vectors by f (X) = Z = [z 1 , • • • , z n ] from the feature extractor f while minimizing Eq. 2. Denote the sample mean by µ = z j /n. Suppose that the system receives a new batch X new of m instances. We obtain the feature vectors Z new = [z n+1 , • • • , z n+m ], and update the mean by μ = (nµ + mµ new )/(n + m), where µ new is the sample mean of the new batch. Denote the singular value decomposition (SVD) of the centered feature by U ΛV T svd = [Z -µ], where T is the transpose symbol. We approximate the covariance with k leading eigenvectors and eigenvalues as follows (n -1)Σ ≈ U k Λ 2 k U T k . For simplicity, we denote the reduced matrices U k by U and Λ k by Λ as the following discussions are based on the reduced matrices. Based on (Ross et al., 2008) , the SVD can be updated for a new set of data Z new as Ũ Λ Ṽ T svd = [ √ n -1U Λ K] given the block matrix K = [Z new - µ new nm/(n + 1)(µ new -µ)], and we obtain (n + m -1)Σ ≈ Ũ Λ2 Ũ T . The derivation for Eq. 5 is given in Appendix D. The statistics for the Gaussian distribution of class c is estimated dynamically with k (<< d) eigenpairs, and pseudo feature vectors can be drawn from the estimated distribution to train the classifier.

3.1.2. HARD ATTENTION MASKING

In training the network g i • f i using the data of task i and the generated pseudo feature vectors, we employ the hard attention mask (Serra et al., 2018) to prevent forgetting in the feature extractor. The hard attention mask a i l is a trainable pseudo binary 0-1 vector at each layer l of task i. It is element-wise multiplied to the output of the layer as a i l ⊗ h l and blocks (for value of 0) or unblocks (for value of 1) the information flow from neurons of adjacent layers. Neurons with value 1 are important for the task and thus need to be protected while neurons with value 0 are not necessary for the task and can be freely modifed without affecting other tasks. More specifically, we modify the gradients of parameters that are important in performing the previous tasks (1, • • • , i -1) during training task i so the important parameters for previous tasks are unaffected. The gradient of parameter w kj,l at kth row and jth column of layer l is modified as ∇w ′ kj,l = 1 -min a <i k,l , a <i j,l-1 ∇w kj,l , where a <i k,l is an accumulated attentions over previous tasks and is 1 if the hard attention of neuron k at layer l is ever used by any previous task < i (see (Serra et al., 2018) for details). To encourage parameter sharing and sparsity in the number of activated masks, a regularization is introduced as L r = l,k a i k,l (1-a <i k,l ) l,k (1-a <i k,l ). The final objective to train a comprehensive task network without forgetting is L = L ce + L r , where L ce is the cross-entropy loss in Eq. 2. 3.2 STEP 2: COMPUTING THE TWO PROBABILITIES IN EQ. 1 We now discuss how to compute the within-task prediction (WTP) probability and the task-id prediction (TIP) probability.

3.2.1. COMPUTING THE WTP PROBABILITY

We could use the softmax probability of each class in a task as the within task prediction (WTP) probability for the class. However, this method is not the best for computing the probability (see the experiment section) because those samples that may be outliers, noises, or other hard-to-classify cases are unlikely to get accurate probabilities, which also affect the probabilities of those samples that are easy to classify. We propose to consider possible out-of-distribution (OOD) samples in each task. However, we do not have OOD data for each task to use in learning the task. Since we have already computed the distributions of feature representations for each class in step 1, for each task we could use the generated data from the distributions of the other tasks as the OOD data for the task. Although we could consider the generated data from previous tasks as OOD data when training a new task, we cannot use the generated data from a later task to update the model of an earlier task because we no longer have the data of the earlier task and even if we can use the generated data of this earlier task, updating its feature extractor can cause serious forgetting because the feature extractors for different tasks share many parameters in the hard attentions (Serra et al., 2018) . We propose a simpler method. We build and fine-tune (see below) a separate linear classifier (with one input layer and one output layer) considering the classes from all tasks learned so far using the generated feature vectors for each class in each task from the feature distribution of the class estimated in step 1. The advantage of this approach is that using a single combined model/classifier we can consider the OOD data for all tasks because for the classes of a task, the classes of all other tasks can be considered as OOD data for the task. After training each task i in step 1, we have the set of distributions {{N (µ c , Σ c )} c } i of features of each class c of the task i ≤ t. We then fine-tune a combined classifier g, which is the classifier cre-ated by joining the parameters of each task's classifier g i , using pseudo feature vectors Z generated from the distributions. This is illustrated in Fig. 1(b ). Note that in step 1, each task network g i • f i is trained independently without considering the other task networks. In this step, we consider the outputs of all the tasks together and fine-tune the combined classifier g. We minimize the cross entropy loss L ce = - 1 |Z| (z,y)∈Z log p(y|z) where the probability is computed using the softmax g 1 (z); • • • ; g t (z) . The WTP probability of class c (which is our class label y i j in Eq. 1) is P(c|x, i) = softmax(g i (f i (x))).

3.2.2. COMPUTING THE TIP PROBABILITY

We now compute the task-id prediction (TIP) probability for a given test sample x. We make use of the distance between the feature vector f i (x) of x and a distribution N (µ c , Σ c ) of features estimated by the training data. This has been used as an effective measure for OOD detection (Lee et al., 2018) . We define the covariance of the distribution of task i as Σ i = c∈C i Σ c /|C i |, where C i is the set of classes of task i and Σ c is the covariance matrix computed by the method discussed in Sec. 3.1.1 with all the principal components. We discard the class covariance after the computation to save memory. Given a set of distributions {N (µ c , Σ i )} c∈C i of task i and a test instance x, we define the following score of the feature f i (x), s i (x) = 1/ max c {M D(f i (x); µ c , Σ i )}, where M D is the Mahalanobis distance of sample x to the distribution N (µ c , Σ i ). The higher the value, the further away the sample is from the distributions of task i. Finally, the TIP probability for task i is defined as, P(i|x) = s i (x)/ k s k (x), Eq. 11 is justified as a sample that is closer to a distribution is more like to belong to the distribution.

4. EXPERIMENT

Baselines. We compare the proposed EWT with 11 baselines among which five are exemplar-free (i.e., saving no previous task data) methods and six are replay-based methods. The exemplar-free methods are: HAT (Serra et al., 2018) , OWM (Zeng et al., 2019) , SLDA (Hayes & Kanan, 2020), PASS (Zhu et al., 2021) , and L2P (Wang et al., 2022) . For the multi-head method HAT, we make prediction by taking arg max over the concatenated logits from each task network as it works the best among all the considered prediction methods (refer to Appendix C for details). The replay methods are: iCaRL (Rebuffi et al., 2017), A-GEM (Chaudhry et al., 2018) , EEIL (Castro et al., 2018) , DER++ (Buzzega et al., 2020) , HAL (Chaudhry et al., 2021) , and DER without pruning (Yan et al., 2021) . We could not run (Wu et al., 2022) as no code was released. We also do not include the existing parameter isolation methods that deal with CIL problems as they are very weakfoot_2 . Datasets. We use four popular continual learning benchmark datasets. (1). CIFAR10 (Krizhevsky & Hinton, 2009) . This is an image classification dataset consisting of 60,000 color images of size 32x32, among which 50,000 are training data and 10,000 are testing data. It has 10 different classes. (2). CIFAR100 (Krizhevsky & Hinton, 2009) . This dataset consists of 50,000 training images and 10,000 testing images with 100 classes. Each image is colored and of size 32x32. (3). Tiny-ImageNet (Le & Yang, 2015) . This classification dataset has 200 classes with 500 training images of size 64x64 per class. The validation data has 50 samples per class. Since no label is provided for the test data, we use the validation set for testing as in (Zhu et al., 2021) . (4). ImageNet380. We randomly selected 380 classes from the 389 classes, which are the remaining classes after removing those classes similar to those in CIFAR and Tiny-ImageNet from the original 1,000 classes of the full ImageNet data (Russakovsky et al., 2015) for pre-training (see below). This dataset has about 1,300 color images per class. Similar to Tiny-ImageNet above, we use the validation set (50 images per class) for testing as its original test data has no label. Backbone Architecture. We use the backbone architecture of transformer DeiT-S/16 (Touvron et al., 2021) . We initially pre-train the network using 611 classes of ImageNet after removing 389 classes which are similar or identical to the classes of CIFAR and Tiny-ImageNet. To leverage the strong performance of the pre-trained model while adapting to new knowledge, we fix the feature extractor and append trainable adapter modules of fully-connected networks with one hidden layer at each transformer layer (Houlsby et al., 2019) except SLDA and L2Pfoot_4 . The number of neurons in each hidden layer is 64 for CIFAR10 and 128 for other datasets. Note that all baselines and our method use the same architecture and the same pre-training model for fairness as using a pretrained model improves the performance (Ostapenko et al., 2022 ) (e.g., DER improves from 65.2 to 73.3 on 10 tasks of CIFAR100 with pre-training on the same transformer architecture). Note that we do not use the pre-trained models like CLIP (Radford et al., 2021) or others trained using the full ImageNet data due to information leak both in terms of features and class labels because our experiment data have been used in training these pre-trained models. This leakage can seriously affect the results. For example, the L2P system using the pre-training model trained using the full ImageNet data performs extremely well, but after those overlapping classes are removed in pre-training, its performances drop greatly. In Table 1 , we can see that it is in fact quite weak. Training Details. For saving eigenpairs, we follow the existing memory budget strategy in the replay-based method (Chaudhry et al., 2019) for fairness. We fix the total number of eigenpairs saved in the CL process. After learning a new task, the system discards q eigenpairs with the smallest eigenvalues from each class of the previous tasks to accommodate k eigenpairs of each newly learned class. This strategy maintains k eigenvectors and the corresponding eigenvalues per class in the budget. Denote the budget size by |M|. For CIFAR10, we split the 10 classes into 5 tasks with 2 classes per task. The size of the hidden layer for the adapter module is 64 and the number of eigenpairs is 10 per class. We refer the experiment as C10-5T. The memory budget size |M| for eigenpairs is 100. For CIFAR100, we conduct two experiments. We split the 100 classes into 10 and 20 tasks, where each task has 10 classes and 5 classes, respectively. We refer the experiments as C100-10T and C100-20T. We choose |M| = 1,000 for both experiments. For Tiny-ImageNet, we conduct two experiments. We split the 200 classes into 5 tasks with 40 classes per task and 10 tasks with 20 classes per task. We refer the experiments as T-5T and T-10T, respectively. We save 2,000 eigenpairs in total for both experiments. For ImageNet380, we split the classes into 10 tasks with 38 classes per task and save 7,600 eigenpairs in total. We refer the experiment to I380-10T. For all the experiments of our system, we find a good set of learning rates and the number of epochs via validation data made of 10% of the training data. We train our model for 15 epochs and use SGD with batch size of 128 and with momentum value 0.9 for step 1. For the experiments of CIFAR10 and CIFAR100, we use learning rate of 0.05 and 0.01, respectively. For Tiny-ImageNet and ImageNet, we use learning rate 0.005. We train the classifier in step 2 for 35 epochs with SGD with the same batch size and learning rate as step 1. Following the random class order protocol of the existing methods (Rebuffi et al., 2017; Yan et al., 2021) , we randomly generate 5 different class orders for each experiment and report the average accuracy over the 5 random orders. For replaybased baselines, we follow Rebuffi et al. (2017) . The systems use the memory buffer of size 200 for Table 1 : Average classification accuracy after the final task. '-XT' means X number of tasks. Our system EWT and all baselines used the pre-trained network. The last column shows the average of each method over all datasets and experiments. We highlight the best results in each column in bold. CIFAR10, 2,000 for CIFAR100 and Tiny-ImageNet, and 7,600 for ImageNet and save a set of raw training samples according to the saving strategy in the respective original papersfoot_5 . For the other baselines, we follow the experiment setups as reported in their official papers. Evaluation Metrics. We use two metrics: average classification accuracy (ACA) and average forgetting rate. ACA after the last task t is A t = t i=1 A t i /t, where A i is the accuracy of the model on task ith data after learning task t. The average forgetting rate after task t is F t = t-1 i=1 A i i -A t i . This is also referred as backward transfer in other literature (Lopez-Paz & Ranzato, 2017) . We report the incremental classification accuracy (ICA) and ACA at each task in Appendix E.

4.1. RESULTS AND COMPARISON

Average Classification Accuracy. Tab. 1 shows the average classification accuracy after the final task. The last column Average indicates the average performance of each method over the 6 experiments. Our proposed method EWT performs the best on average. We achieve 72.98% while the best baseline (DER) achieves 69.67%. The performance gap is even larger when we compare it with non-replay based methods. The best exemplar-free method is HAT and it achieves 68.21% on average, which is much lower than our method. The baselines SLDA and L2P are proposed to leverage a strong pre-trained feature extractor in the original papers. SLDA freezes the feature extractor and only fine-tunes the classifier. It performs well for the simple experiment C10-5T but is significantly poorer than our EWT on other experiments. This is because the fixed feature extractor does not adapt to new knowledge. Our method updates the feature extractor via adapter modules to new knowledge and it is able to learn more complex problems. L2P trains a set of prompt embeddings. In the original paper, it uses a feature extractor that was pre-trained with ImageNet-21k which already includes the classes of the continual learning evaluation datasets. When we remove the classes similar to the datasets used in CL, its performance drops dramatically (58.38% on average over the 6 experiments) and much poorer than our method EWT (72.98% on average). Average Forgetting Rate (Backward Transfer). We compare the forgetting rate of each system after learning the last task in Fig. 2 . The forgetting rates of the proposed method EWT are 5.26, 8.75, and 8.85 on C10-5T, C100-10T and C100-20T, respectively. iCaRL forgets less than ours on C10-5T and C100-20T as it achieves 4.95 and 8.31, respectively. However, iCaRL was not able to adapt to new knowledge effectively as its accuracies are much lower than our method EWT on the same experiments. The average accuracy over the 6 experiments of EWT is 72.98 while that of iCaRL is only 65.47. According to the forgetting rates, the best baseline (DER) adapts to new knowledge well, but it was not able to retain the knowledge as effectively as our method. Its forgetting rates are 13.36, 15.92, and 15.48 on C10-5T, C100-10T, and C100-20T, respectively, and are much larger than ours. This results in lower average performance of DER than EWT.

4.2. ANALYSIS AND ABLATION

Performances of Different Variant methods. Tab. 2 shows the performance gain by adding each proposed technique. The methods in the first (S1) and second rows (S1 + S2) only produce the WTP probability without TIP probability since the TIP is not computed. Thus, we cannot decompose the CIL probability as EWT. Instead, we make a CIL prediction by taking arg max over the concatenated logits from each task classifier g i , which is better than the other considered prediction methods (refer to Appendix C). From Tab. 2, fine-tuning the classifier via the generated feature vectors in step 2 already improves the performance from step 1 as shown in the second row (S1 + S2). On C10-5T, C100-10T, and C100-20T, S2 improves more than 3% from S1. When the proposed task-id prediction (TIP) is introduced, the performance also improves as represented in the third row (S1 + TIP). In fact, this is slightly better than S1 + S2 without TIP which implies the effectiveness of the proposed problem decomposition for CIL. Combining all the proposed techniques together delivers the best performance as represented by the last row, which is the full EWT. Performance by the Number of Eigenpairs. Step 1 and 2 are based on generating pseudo feature vectors from Gaussian distributions. Due to the memory consumption, we approximate the covariance by incremental PCA and save only |M| eigenvectors with the corresponding eigenvalues. This is equivalent to saving |M|/C eigenpairs per class for a dataset of C classes when learning the last task. Tab. 3 shows the model performance on C100-10T with different |M| sizes. With a single eigenvector per class (i.e., |M| = 100), the model already achieves 73.41% accuracy. The performance increases with the size of M until |M| =2,000. The lower performance on 2,000 is because the less informative eigenpairs now generate noisy feature vectors.

5. CONCLUSION

This paper proposed an effective method to solve class-incremental learning (CIL) from the first principle. Based on the definition of CIL, it first decomposes the CIL prediction probability into two probabilities, within-task prediction (WTP) probability and task-id prediction (TIP) probability. Novel methods are designed to estimate these probabilities, which are based on an incremental PCA-based generative approach to fine-tune the multi-head task classifiers using a single head approach and Mahalanobis distance, respectively. Experimental results show that the proposed EWT outperforms existing strong baselines by a large margin.

A PSEUDO-CODE

We provide the pseudo-code for training and testing. Our comments start with symbol "//". for each batch (X j , Y j ) ⊂ D i , until converge do 3: Obtain features Z j = f i (X j ) and outputs g i (Z j ) // Compute the mean and eigenpairs for features of each class in the task 4: for each class c feature Z c ⊂ Z j do 5: Compute µ c using Eq. 3 and the eigenpairs (U c , Λ c ) using Eq. 5 6: end for // Train the classifier with generated features and remember the distributions 7: Generate pseudo features Z from the distributions of the current task and obtain g i (Z) Obtain WTP using Eq. 9 and TIP using Eq. 11 3: Obtain the CIL probability p(Y i |x) = p(Y i |x, i)p(i|x), where Y i is the set of class labels of task i 4: end for // Concatenate the probabilities for the full CIL probability and make a prediction 5: ŷ = arg max i p(Y i |x), where is concatenation 

B REQUIRED MEMORY

We report the network sizes of the systems after learning the last task. We use an 'entry' to denote a parameter or a value required to learn and to inference for a task. All the systems except SLDA and L2P use the feature extractor DeiT-S/16 (Touvron et al., 2021) and adapter modules. The transformer consumes 21.6 millions (M) entries and the adapters take 1.2M and 2.4M entries for CIFAR10 and the other datasets. SLDA fine-tunes only the classifier on top of the fixed pre-trained feature extractor as it does not have a protection mechanism. L2P uses a prompt pool with 23k entries. Since each method requires method-specific elements (e.g., task embedding for HAT), the number of entries required for each method is different. The number of entries for each model is reported in Tab. 4. Our method saves the mean and eigenpairs to approximate the distribution of features for each class to draw pseudo feature vectors while the replay-based methods save the raw inputs to replay jointly with the current task data. As the number of eigenpairs and the number of saved inputs affect the performance, we use a budget M of size |M| each method can save. Since feature dimension is 384, the total entries required for saving the mean and eigenpairs for our method are 42.2k, 422.4k, 844.8k and 3.1M for CIFAR10, CIFAR100, Tiny-ImageNet, and ImageNet380, respectively. The 

C DIFFERENT PREDICTION METHODS

As HAT is designed for task incremental learning and does not provide a task-id prediction mechanism as the other parameter isolation methods such as HyperNet (von Oswald et al., 2020) , we have tried different CIL prediction methods. The reported values in Tab. 1 are the results of the best one among the considered methods. We considered three methods: 1) arg max p(Y i |x, i), where the task-id i is chosen based on the entropy values from each task network as HyperNet. 2 ) arg max[p(Y 1 |x, 1); • • • ; p(Y t |x, t)], where the within-task prediction (WTP) probability is obtained by taking softmax over logits g i (f i (x)) of task i. This is equivalent to using an equal probability for task-id prediction (TIP) probability. 3 ) arg max[g 1 (f 1 (x)); • • • ; g t (f t (x))]. Tab. 5 shows the results of each prediction method. Based on the result, the entropy-based prediction performs the worst. The reason is that the entropy value from each task network is not as informative as other values since it is not trained with entropy. The softmax-based and logit-based predictions are not different on average over the 6 experiments. Since logit-based performance is the best, we choose it as the CIL prediction method for HAT. S1 and S1+S2 in Tab. 2 in the main paper also do not have CIL prediction mechanism. We try the three prediction methods as HAT. The results are in Tab. 5. We can observe similar behaviors in S1 and S1+S2 as HAT. 

D ADDITIONAL DERIVATION DETAILS

We have claimed that the orthonormal matrix Ũ and a diagonal matrix Λ2 obtained from Eq. 5 in the main text are eigenvectors and eigenvalues of unnormalized sample covariance (n + m -1)Σ. Denote the previous sample mean by µ and the eigenpairs of previous covariance Σ old by (U , Λ 2 ). Following Ross et al. ( 2008), we provide more details about the claim. Since K = [X new -µ new nm/(n + m)(µ new -µ)] and U ΛU T = Σ old , where the last derivation from Eq. 15 to Eq. 16 is by Lemma 1 of Ross et al. (2008) . Ũ Λ2 Ũ T = Ũ Λ Ṽ T [ Ũ Λ Ṽ T ] T (12) = [ √ n -1U Λ K][ √ n -1U Λ K] T

E INCREMENTAL CLASSIFICATION ACCURACY

In the main paper, we reported the average classification accuracy (ACA) after learning the last task. In this section, we also report the incremental classification accuracy (ICA) over the learning process. ICA after task t is defined as Āt = t i=1 A i , where A i is ACA after learning task i. Tab. 6 shows ICA of our method EWT and the baselines. For the more challenging datasets (e.g., Tiny-ImageNet and ImageNet), our system outperforms the baselines. SLDA is slightly better than our method on C10-5T and DER is slightly better than EWT on C100-10T and 20T. However, their performances are not consistent over different experiments. The average performance of our method over the 6 experiments is 80.94 while the best performing baseline DER is 79.56. Fig. 3 shows the ACA at each task. 



The code is included in the Supplementary Material. In(Bang et al., 2021), tasks are considered to have shared classes. For instance, the system receives two datasets D 1 and D 2 consisting of classes {y1, y2} and {y1, y3, y4}, respectively. We define task 1 and 2 consisting of {y1, y2} and {y3, y4}, respectively, and consider the samples of shared label y1 as additional training data for task 1. This work does not consider this learning scenario. We leave it for our future work. HyperNet (von Oswald et al., 2020) and PR(Henning et al., 2021) find the task-id via an entropy function and SupSup(Wortsman et al., 2020) finds it via gradient update. They then make a within-task prediction. SupSup, PR, and iTAML(Rajasegaran et al., 2020) assume the test instances come in batches and all samples in a batch belong to one task. When tested per sample on HyperNet, SupSup, PR and iTAML achieve 22.4, 11.8, 45.2 and 33.5 on 10 tasks of CIFAR100, respectively, which are much lower than 51.4 of the baseline iCaRL. CCG(Abati et al., 2020) and IBP-WF(Mehta et al., 2021) do not provide code. For SLDA and L2P, we follow the original papers. SLDA fine-tunes only the classifier with a fixed feature extractor and L2P trains learnable prompts. Note that we save eigenvectors where each vector is in dimension of 384 whereas the replay-based methods save raw inputs. For a memory of size |M| and dataset with C classes, our method and the replay methods save k = |M|/C pairs and raw inputs, respectively, after the last task. Thus, for C100-10T, EWT takes 384K elements for the eigenpairs while replay methods consumes 6.1M elements. Refer to Appdendix B for details.



Figure 1: Overview of the training process for task i. The dashed lines indicate that the gradient flow is blocked while the solid lines indicate gradients are computed along the lines. (a) The network is trained in a multi-head manner (one per task), in which task-specific parameters are trained based on hard-attention and used to effectively eliminate the interference between tasks. During training, the system dynamically estimates the distributions N(µ c (f i (x i )), Σ c (f i (x i ))) of feature vectors f i (x i ).The estimated distributions are used to generate pseudo feature vectors, and the network is jointly trained with the pseudo feature vectors and the training data of task i. Since the network g i •f i is trained to minimize the loss on both training data and the pseudo feature vectors, this process encourages feature vectors of f i to follow the desired distribution. (b) Given the distributions, the system fine-tunes a classifier g created by joining the multi-head classifiers g k , k ≤ i. The finetuning is done using pseudo feature vectors generated from the distributions.

Figure2: Average forgetting rate. The lower the rate, the better the method is.

Training (Step 1) 1: for training data D i of each task do 2:

(n -1)U Λ 2 U T + K KT (14) = (n -1)Σ old + (m -1)Σ new + nm n + m (µ new -µ)(µ new -µ) T

Figure 3: Average classification accuracy after each task. The x-axis indicates the number of learned classes after each task. The systems with dashed lines are exemplar-free methods.

The average classification accuracy by different variants of the proposed technique. The variant S1 indicates the model after step 1. The variant S1 + S2 indicates the model after step 2 and XX + TIP indicates the model with TIP of Sec. 3.2.2 applied at prediction.

The accuracy performance and the number of saved eigenpairs on C100-10T. |M| = m indicates that a total of m eigenvectors are saved with their corresponding eigenvalues.

After training task t, fine-tune the classifier 1: Construct g = [g 1 , ..., g t ] by concatenating the parameters of each task classifier g i // Fine-tuning starts 2: for until converge do Test instance x, task networks [f 1 , ..., f t ], and classifiers [g 1 , ..., g t ] after learning task t // Obtain CIL probabilities of the classes corresponding to each task 1: for for each task i ≤ t do

The size of the model (in entries) required for each method without the memory buffer.

The methods 1), 2), and 3) indicate entropy-based WTP prediction, softmax-based prediction, and logit-based prediction, respectively, as described Appendix C. The last column Average means the average value over the 6 experiments.

Incremental classification accuracy. The last column shows the average of accuracies of each method over all the experiments. We highlight the best results in each column in bold.GEM 68.19±3.24  43.83±0.69 35.97±1.15 49.26±0.64 39.58±3.32 50.16±6.63 47.83 EEIL 90.50±0.72 81.10±0.37 79.54±0.69 66.63±0.40 66.54±0.61 75.08±1.07 76.57 DER++ 89.01±6.29 80.64±2.74 81.72±1.76 66.55±3.73 67.14±1.40 77.41±0.37 77.08 HAL 87.00±7.27 77.42±2.73 77.85±1.71 65.31±3.68 64.48±1.45 75.87±0.40 74.65 DER 92.83±1.10 82.89±0.45 82.79±0.76 70.32±0.57 70.21±0.86 78.30±0.67 79.56 EWT 93.20±1.84 82.57±0.69 80.52±0.85 74.27±0.70 73.87±1.00 81.24±1.65 80.94

