SOLVING CONTINUAL LEARNING VIA PROBLEM DECOMPOSITION

Abstract

This paper is concerned with class incremental learning (CIL) in continual learning (CL). CIL is the popular continual learning paradigm in which a system receives a sequence of tasks with different classes in each task and is expected to learn to predict the class of each test instance without given any task related information for the instance. Although many techniques have been proposed to solve CIL, it remains to be highly challenging due to the difficulty of dealing with catastrophic forgetting (CF). This paper starts from the first principle and proposes a novel method to solve the problem. The definition of CIL reveals that the problem can be decomposed into two probabilities: within-task prediction probability and task-id prediction probability. This paper proposes an effective technique to estimate these two probabilities based on the estimation of feature distributions in the latent space using incremental PCA and Mahalanobis distance. The proposed method does not require a memory buffer to save replay data and it outperforms strong baselines including replay-based methods. 1

1. INTRODUCTION

Continual learning (CL) is a learning problem where a system learns and accumulates knowledge over time without forgetting the previous knowledge (Chen & Liu, 2018) . The key challenge is the catastrophic forgetting (CF), which is a phenomenon that the system corrupts the learned knowledge in the past in learning a new task (McCloskey & Cohen, 1989) . This paper focuses on the challenging CL setting of class incremental learning (CIL) (Rebuffi et al., 2017) in the offline (or batch) mode. In this setting, the system learns a sequence of classification tasks incrementally, where each task arrives with all its training data of a set of classes. The resulting classifier can identify the class of a test instance among all the classes learned in the process with no task information provided. The other popular setting of CL is task incremental learning (TIL), which builds a separate model for each task and in testing, the test instance together with the task-id that the test instance belongs to are provided so that the system can use the model of the specific task to classify the instance. Existing approaches to CIL can be grouped into several categories. Regularization (Kirkpatrick et al., 2017) or distillation (Li & Hoiem, 2016) tries not to change the parameters or knowledge that are important to old tasks when learning the new task. Replay/memory-based approaches (Rebuffi et al., 2017) save some old data and use them jointly with the new task data to learn the new task and to preserve/adjust the old knowledge. Parameter isolation approaches (Serra et al., 2018) expand the network or mask out the important parameters for old tasks (see Sec. 2 for more details). Our approach is entirely different and is derived directly from the definition of the CIL setting. Definition: Class incremental learning (CIL) learns a sequence of tasks 1, ..., t, where each task i has a training data D i = {(x i j , y i j )} n i j=1 with x i j ∈ X i (input space) and y i j ∈ Y i (class label space). The class labels of tasks are disjoint, Y i ∩ Y k = ∅ for any i ̸ = k 2 . Let X = ∪ t i=1 X i and Y = ∪ t i=1 Y i . The goal is to learn a function f : X → Y to predict the class label of test case x. As tasks have disjoint classes, the CIL probability of a sample x having the jth class label y i j of task i can be decomposed into two probabilities, P(y i j |x) = P(y i j |x, i)P(i|x), The decomposition implies that there are two probabilities that define the CIL probability. The first probability on the right-hand-side (RHS) is the within-task prediction (WTP) probability (or intratask prediction probability) and the second probability on the RHS is the task-id prediction (TIP) probability (or inter-task prediction probability). Thus, a system makes a correct CIL prediction if it produces accurate within-task and task-id predictions. We note that the WTP probability is exactly the prediction probability in a TIL problem. However, in TIL, the task-id is given in inference or testing. Thus, to solve the CIL problem, one can learn like TIL and then design a mechanism to predict the task-id to which the test instance belongs. Some existing works have taken this approach (Rajasegaran et al., 2020; Abati et al., 2020) , but they perform poorly because their task-id predictors are very weak. However, these papers did not propose Eq. 1. We will discuss these and other related works in the related work section. In fact, the WTP probability can be improved from that given by a TIL system too. This paper proposes a novel technique to estimate the two probabilities and an exemplar-free CIL system, called EWT (Estimation of WTP and TIP probabilities). EWT makes use of the highly effective hard-attention masking method HAT (Serra et al., 2018) for TIL to learn feature extractor for each task. HAT has almost no forgetting for TIL as it masks out the parameters and neurons learned for previous tasks. This ensures that the estimated probabilities are robust and are not affected by forgetting in incremental learning. Although we could directly use the probability of each class produced from each task as WTP probability, this approach is sub-optimal. We propose a generative approach to improve the estimation by considering possible noisy and/or out-of-distribution samples. This is done by fine-tuning the task classifiers using generated pseudo feature representations for each class. The generation is done based on the Gaussian distributions in the latent feature space. The Gaussian distribution for each class is estimated incrementally using incremental Principle Component Analysis (iPCA). The TIP probability is estimated using Mahalanobis distance. Our experiments demonstrate the effectiveness of the proposed method EWT using a pre-trained transformer network that does not have information leak, i.e., it is trained using the ImageNet data with all classes that are similar to the classes in the experiment datasets removed. Both our system and the baselines fix the transformer and train adapter modules inserted at each transformer layer (Houlsby et al., 2019) . The experimental results shows that EWT outperforms the recent stateof-the-art baselines by large margins, including replay-based approaches.

2. RELATED WORK

Numerous techniques have been proposed for CL. We consider the five most relevant categories: exemplar-free, replay-based, generative methods, network-expansion, and parameter isolation. Exemplar-free methods (saving no previous task data) often use regularization (Kirkpatrick et al., 2017; Zhu et al., 2021) , knowledge distillation (Li & Hoiem, 2016) , or orthogonal projection (Zeng et al., 2019) to preserve previous important parameters (Zenke et al., 2017; Wang et al., 2022) . Our method is also exemplar-free and our CF prevention is based on task masking (Serra et al., 2018) . Replay-based CL has been widely studied in CIL. Different saving mechanisms (Rebuffi et al., 2017; Liu et al., 2020b; Bang et al., 2022 ), replay strategies (Aljundi et al., 2019) , and regularizations (Lopez-Paz & Ranzato, 2017; Castro et al., 2018; Chaudhry et al., 2018; Buzzega et al., 2020; Chaudhry et al., 2021) have been used. The goal of these methods is to balance the plasticity and stability using the saved samples of previous tasks (Liu et al., 2021; Yan et al., 2021) . Our method does not save any samples and it also performs much better than recent replay-based methods. Generative methods (Shin et al., 2017; Ostapenko et al., 2019; Ayub & Wagner, 2021) build generators to generate pseudo-replay data to prevent forgetting. Lesort et al. (2018) studied the difficulties of the generative approach. We do not generate pseudo-replay samples similar to the raw data and thus do not have its problems. Our method generates feature vectors rather than raw data. Liu et al. (2020a) and Zhu et al. ( 2021) also generate feature vectors. They use the generated features for distilling knowledge of previous tasks. However, we estimate the distributions of features to fine-tune the classifier and to compute task-id probability rather than for knowledge distillation.



The code is included in the Supplementary Material. In (Bang et al., 2021), tasks are considered to have shared classes. For instance, the system receives two datasets D 1 and D 2 consisting of classes {y1, y2} and {y1, y3, y4}, respectively. We define task 1 and 2 consisting of {y1, y2} and {y3, y4}, respectively, and consider the samples of shared label y1 as additional training data for task 1. This work does not consider this learning scenario. We leave it for our future work.

