MAKE MEMORY BUFFER STRONGER IN CONTINUAL LEARNING: A CONTINUOUS NEURAL TRANSFORMA-TION APPROACH

Abstract

Continual learning (CL) focuses on learning non-stationary data distribution without forgetting previous knowledge. However, the most widely used memory-replay approach often suffers from memory overfitting. To mitigate the memory overfitting, we propose a continuous and reversible memory transformation method so that the memory data is hard to overfit, thus improving generalization. The transformation is achieved by optimizing a bi-level optimization objective that jointly learns the CL model and memory transformer. Specifically, we propose a deterministic continuous memory transformer (DCMT) modeled by an ordinary differential equation, allowing for infinite memory transformation and generating diverse and hard memory data. Furthermore, we inject uncertainty into the transformation function and propose a stochastic continuous memory transformer (SCMT) modeled by a stochastic differential equation, which substantially enhances the diversity of the transformed memory buffer. The proposed neural transformation approaches have significant advantages over existing ones: (1) we can obtain infinite many transformed data, thus significantly increasing the memory buffer diversity; (2) the proposed continuous transformations are reversible, i.e., the original raw memory data could be restored from the transformed memory data without the need to make a replica of the memory data. Extensive experiments on both task-aware and task-free CL show significant improvement with our approach compared to strong baselines.

1. INTRODUCTION

Continual learning (CL) aims to learn non-stationary data distribution without forgetting previous knowledge. Depending on whether there are explicit task definitions (partitions) during training, CL can be categorized into task-aware and task-free CL. For task-aware CL, there are explicit tasks and class splits during training; according to whether the task identities are known or not during testing, it can be further categorized into task/domain/class-incremental CL (van de Ven & Tolias, 2019) . For task-free CL (Aljundi et al., 2019b) , there is no explicit task definition, and data distribution shift could happen at any time. Memory replay is an effective way to mitigate forgetting and has been widely used in CL. One major problem of the memory-based methods is that the effectiveness of memory buffer data could gradually decay during training (Delange et al., 2021; Jin et al., 2021) , i.e., the CL model may overfit the limited memory data and could not generalize well to the previous tasks. Recently, gradient-based memory editing (GMED) (Jin et al., 2021) has been proposed to mitigate memory overfitting by editing memory data with hard examples in a way similar to adversarial data augmentation (ADA) (Madry et al., 2018) . Specifically, it creates hard examples that increase model losses at each gradient step but restricts to a few (less than three) discrete gradient-based editing steps. With more editing steps, similar to ADA, GMED would make the memory even harder but cause less data diversity since the adversarial force will drive the feature space of different classes overlap and cluster together (Madry et al., 2018; Wang et al., 2021) . However, as studied in previous work (Gontijo-Lopes et al., 2021) , improving the diversity of training data is crucial to improving model generalization. An illustration of this phenomenon is shown in Figure 1 (b) and (c). This naturally leads to a new problem: how can we increase the diversity of the edited memory data and maintain its hardness at the same time? To address this problem, we present a continuous, expressive, and flexible memory transformation method to obtain a diverse set of memory data and make the memory buffer harder to memorize at the same time. We first model the gradual and continuous memory transformation as a deterministic neural ordinary differential equation in the time interval [0, T ], named Deterministic Continuous Memory Transformer (DCMT). There are several advantages compared to existing methods. First, we can obtain infinite time steps of transformed memory data for any t ∈ [0, T ] and thus significantly improve the diversity in the transformed memory data. Second, we do not need to make a replica of the raw memory data since the transformation process is reversible. We can restore original raw memory data from the transformed memory data. As shown in Figure 1 (d), DCMT diversifies memory data while maintaining hardness. The proposed DCMT considers a single transformation function. However, there are infinite possible transformation functions for transforming the memory data, and it is beneficial to model the uncertainty in the transformation function to further avoid overfitting (Lu et al.; Liu et al., 2019) . To model the underlying various transformation functions and further improve the data diversity, we thus generalize the methods in a probabilistic manner to model the memory transformation as a stochastic process with neural stochastic differential equations, named Stochastic Continuous Memory Transformer (SCMT). This enables us to model infinite transformation functions and significantly improves the diversity of the transformed data with some increased computation cost compared to DCMT. The overview of the proposed methods is presented in Figure 2 . Figure 2 : Overview of the proposed approach for memory transformation. DCMT and SCMT continuously and gradually transform the memory data to be diverse and hard to memorize. Note that the transformed data could be obtained at any continuous time step, thus providing significantly larger diversity. We propose a bi-level optimization to jointly learn the memory transformer and CL model. The memory transformer can generate diversified memory data that is hard to memorize. Concretely, after continuous interval [0, T ] transformation, we optimize the loss increase before (t = 0) and after (t = T ) the transformation to ensure the hardness of the transformed data. To ensure that the network embeds similarly for both the original raw memory data and the transformed memory data, we adopt a Jensen-Shannon divergence consistency loss to regularize the network output to ensure smoother neural network responses for both data. Furthermore, our proposed method is general, versatile, and can be seamlessly applied to both task-aware and task-free CL. Extensive experiments on both task-aware and task-free CL demonstrate the effectiveness of the proposed methods. We summarize our contributions as three-fold: • We propose a bi-level optimization framework with a continuous memory transformer to address memory overfitting issues. The continuous memory transformer can make the transformed memory data substantially more diverse and harder to memorize. • We instantiate the continuous memory transformation with deterministic and stochastic memory transformation, which can be seamlessly applied to both task-aware and task-free CL. • We perform extensive experiments on both task-aware and task-free CL, showing significant improvements compared to strong baselines. Furthermore, we provide detailed ablation studies.

2.1. TASK-AWARE CL

Problem setup. Task-aware CL focuses on the case where there are explicit task definitions during CL. Task/domain/class-incremental learning (van de Ven & Tolias, 2019) are the three most representative CL scenarios. We consider the problem of learning a sequence of tasks denoted as D tr = {D tr 1 , D tr 2 , • • • , D tr N }, where N is the number of training tasks. The k-th task training data D tr k consists of a set of triplets {(x k i , y k i , T k ) n k i=1 } , where x k i is the i-th data example in the task, y k i is the corresponding data label, and T k is the task identifier. The goal is to learn a model f θ on the training task sequence D tr so that it performs well on the test set of all the learned tasks D te = {D te 1 , D te 2 , • • • , D te N } without forgetting previously learned knowledge. Existing work. The proposed approaches for task-aware CL can be categorized into: 1) maintaining a memory buffer that stores previous examples for future replay (Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Chaudhry et al., 2019a; Riemer et al., 2019; Chaudhry et al., 2019b; Aljundi et al., 2019a; PourKeshavarzi et al., 2022; Arani et al., 2022) ; 2) using dynamic network architectures (Rusu et al., 2016; Fernando et al., 2017; Yoon et al., 2018; Qin et al., 2021; Miao et al., 2022) and remembering past knowledge by dynamically updated architectures; 3) enforcing regularization to slow down forgetting (Kirkpatrick et al., 2017; Zenke et al., 2017b; von Oswald et al., 2020; Liu & Liu, 2022; Raghavan & Balaprakash, 2021) ; and 4) modeling the parameter update uncertainty with Bayesian methods (Nguyen et al., 2018; Ebrahimi et al., 2020; Henning et al., 2021) . In this paper, we focus on memory-replay-based methods since they often achieve SOTA performance.

2.2. TASK-FREE CL

Problem setup. Task-free CL (He et al., 2019; Zeno et al., 2019; Aljundi et al., 2019b; Chrysakis & Moens, 2020; Lee et al., 2020) is a recent generalization of CL to the more complex cases, where data distribution shift could happen at any time during CL without explicit definition of tasks. Existing work. Most existing works in task-free CL are memory-replay-based methods (Chaudhry et al., 2019b; a) . Our works share a similar goal with GMED (Jin et al., 2021) which edits the memory buffer based on ADA, making the memory data harder but lacks diversity (Madry et al., 2018; Wang et al., 2021) . There are several significant differences. First, GMED is a gradient-based discrete step memory editing method. Our memory transformer can obtain infinite continuous time steps transformation of memory data, which improves the memory diversity significantly. Second, GMED overwrites the memory data with the edited ones making the memory buffer data distribution significantly deviate from the original raw memory data distribution after many epochs of editing, especially in task-aware CL, which would decrease the performance. In contrast, our transformation process is reversible, i.e., the original raw data can be recovered from the transformed data. Thus, we do not need to overwrite the memory buffer data or keep an additional mini-batch transformed data. We provide more detailed discussions of related work in Appendix C due to space limitations.

3. METHODOLOGY

In this section, we first present standard memory replay of CL in Section 3.1, our proposed deterministic continuous memory transformer (DCMT) in Section 3.3 and stochastic continuous memory transformer (SCMT) in Section 3.4. Then, we present the training objectives in Section 3.5. The overall description of the proposed method is shown in Figure 2 .

3.1. CONVENTIONAL MEMORY REPLAY

Standard memory replay for CL (Chaudhry et al., 2019b) is to optimize a risk of data from both memory buffer M and current mini-batch. Formally speaking, the optimization can be formulated as: min ∀θ∈Θ L(θ, x k , y k ) + E (x,y)∼M L(θ, x, y) , ( ) where k is the CL timestamp (k th CL step), θ are model parameters, and L(θ, x, y) is the loss function associated with the data (x, y). Conventional memory replay would make the memory buffer data gradually become less effective for mitigating forgetting when training for a long time, as it is easy to overfit the limited memory buffer data (Delange et al., 2021; Jin et al., 2021) . Thus, the previously learned knowledge would get lost, and the CL model may not generalize well to the previous tasks. We thus propose a continuous memory transformation method to generate diversified memory buffer data that is hard to memorize in the following sections.

3.2. A PRELIMINARY APPROACH TO INCREASE MEMORY DIVERSITY

A preliminary way to increase the memory diversity is to transform the memory data with a neural network function g parameterized by ϕ. A sequence of transformations can be applied on the original raw memory data x m (t) by : x m (t + (i + 1)∆ t ) = x m (t + i∆ t ) + g(x m (t + i∆ t ), t, ϕ)∆ t , i = 0, 1, • • • , n where ∆ t is the step size. By repeating this discrete transformation process, we can obtain a diverse collection of transformed memory data. However, when we updating the function g by backpropagation, we need to store all the intermediate transformations, i.e., {x m (t + i∆ t ), i = 0, • • • , n -1}. Thus, the memory cost scales linearly with the number of memory transformation steps, i.e., O(n). This would bring a lot of memory cost especially if we transform the memory data by a large number of transformation steps. Furthermore, the neural network function g is generally not invertible. We thus also need to store both the raw memory data and the transformed data. In the following, we simultaneously addressed the above issues by viewing the memory transformation process as a continuous dynamic system. This brings several benefits: (1) we do not need to store any intermediate transformation results, i.e., the memory cost is constant, i.e., O(1); (2) we do not need to store both the raw memory data and transformed data since the entire transformation is invertible even if the function g is not invertible (More elaboration on this is provided in Appendix B.13). But the above discrete transformation needs to do so since g is generaly not invertible; (3) our method brings infinite amount of transformed memory data vs. the discrete steps of transformations. We name this our proposed preliminary method as Discrete Transformation (DT).

3.3. MEMORY TRANSFORMER AS A DETERMINISTIC CONTINUOUS DYNAMIC SYSTEM

In this section, we first view the memory transformation as a deterministic dynamic system. We transform the raw memory data into a continuous system in the time interval [0, T ]. Suppose at each CL timestamp k, and we sample a mini-batch data (x m , y m ) from the memory buffer M. Since we perform similar continuous mini-batch memory transformation operations at each CL timestamp k, we thus omit k for presentation clarity. We model the gradual and continuous memory data transformation by the following differential equation: dx m (t) dt = g(x m (t), t, ϕ), x m (0) = x m (3) where the memory transformer is parametrized by function g with parameters ϕ and g represents the instant time transformation rate of memory data. By integrating both parts of the Eq. (3) over the time interval [0, T ], we can obtain the solution to Eq. ( 3) for the transformed memory data at time T : x m (T ) = x m (0) + T 0 g(x m (t), t, ϕ)dt (4) where the transformed memory data at any time T , i.e., x m (T ) is a continuous function of T . For any t ∈ [0, T ], x m (t) is a transformation of original memory data, thus we can obtain a set of infinite transformed memory data, i.e., {x m (t) : t ∈ [0, T ]}. This is in contrast to GMED (Jin et al., 2021) , which works with a small number (less than three) of discrete-time steps of memory editing. With longer editing steps, similar to ADA, the edited data examples by GMED become harder but decrease the data diversity; thus, performance drops (Wang et al., 2021) . Therefore, the design principle restricts the expressiveness and effectiveness of GMED. However, our memory transformation is significantly more expressive and provides substantially more diverse transformed memory data than GMED. We name our method as Deterministic Continuous Memory Transformer (DCMT). In practice, we can use a numerical integration scheme, such as the Runge-Kutta method (Schober et al., 2014) , similar to the implementation in (Chen et al., 2018) to solve the Eq. ( 4). We provide the algorithm details in Algorithm 3 in Appendix B.12. The above transformation process is reversible because we can obtain the raw memory data by the following reverse integration: x m (0) = x m (T ) + 0 T g(x m (t), t, ϕ)dt. Eq .5 transforms from x m (T ) into raw memory data x m (0) by integrating over the reverse time interval [T, 0]. We can thus discard the raw memory data and only keep the transformed memory data. After the replay, we can invert the transformed memory data into original data. The transformation function g does not need to be reversible and only needs to be uniformly Lipschitz continuous in x m (t) and continuous in t (This condition is to make sure the solution to Eq. (3) exits and is unique), thus providing great flexibility for the transformation function design (More elaboration on this is provided in Appendix B.13). Thus, our method has no extra memory cost to store additional raw memory data.

3.4. MEMORY TRANSFORMER AS A STOCHASTIC CONTINUOUS DYNAMIC SYSTEM

The DCMT method in Section 3.3, i.e., Eq. ( 3), only considers a single deterministic transformation function g, but there are infinite possibilities of available memory transformation functions. Thus, a single deterministic transformation is insufficient for modeling the underlying high diversity in the memory transformation functions. Furthermore, (Lu et al.; Liu et al., 2019) show that adding uncertainty modeling for the network is beneficial to avoid overfitting. We thus model the memory transformation process as a stochastic dynamic system with a path-valued random variable X : [0, T ] → R d , where each random variable at time t, i.e., X t , is to model the distribution of the transformed memory data x m (t) at time t. We use d to denote the memory data dimension. Let W : [0, T ] → W w be a w-dimensional Brownian motion (Øksendal, 2014) which is a continuous time stochastic process such that W t+s -W s follows a Gaussian distribution with mean 0 and variance t. Let µ ϕ : [0, T ] × R d → R d be the network for modelling the drift term, and σ ϕ : [0, T ] × R d → R d×w be the network for modeling the diffusion term. They are parameterised together by ϕ. For notation and presentation clarity, we still use the same notations, ϕ, as DCMT to denote the parameters of memory transformer. The memory transformation stochastic process can be modeled as the following stochastic differential equations (SDE): dXt = µ ϕ (t, Xt)dt + σ ϕ (t, Xt) • dWt, x m ∼ X0, where the initial values of the SDE are the raw memory data that can be viewed as samples from the initial random variable X 0 . For all t ∈ [0, T ], let X : [0, T ] → R d denote the solution to Eq. ( 6) and •dW t denotes Stratonovich integration (defined in Appendix B.11). The stochastic process {X t } t∈[0,T ] determined by Eq. ( 6) can be equivalently expressed as following: XT = X0 + T 0 µ ϕ (t, Xt)dt + T 0 σ ϕ (t, Xt) • dWt. Each random variable X t of {X t } t∈[0,T ] models a transformed memory data distribution. Thus, we obtain infinite transformed memory data distributions. This is in contrast to DCMT, where each x m (t) is deterministic. To solve this SDE and achieve cheap backpropagation, we use the adjoint method, i.e., the reversible Heun method proposed in (Kidger et al., 2021) , (algorithm details shown in Algorithm 5 in Appendix B.12.1). We name this method as Stochastic Continuous Memory Transformer (SCMT). Therefore, similar to DCMT, SCMT does not need to make a replica of the raw memory data.

3.5. TRAINING OBJECTIVES FOR CONTINUOUS NEURAL MEMORY TRANSFORMER

The goal of the memory transformer is to make the memory data hard to be memorized for the CL model. Our overall learning objective is the following bi-level optimization: min θ L(x k , y k , θ) + L( x m (T ), y m , θ, ϕ * ) (8) s.t. ϕ * = arg max ϕ [L(x m (T ), y m , θ, ϕ) -L(x m , y m , θ, ϕ) -λJS(x m , x m (T ))] where x m (T ) = x m (0) + T 0 g(x m (t), t, ϕ)dt, x m (T ) = x m (0) + T 0 g(x m (t), t, ϕ * )dt where x m (T ) could be either from DCMT or samples from the terminal state of SCMT, ϕ denotes the parameters of either DCMT or SCMT. The lower-level optimization, i.e., Eq. ( 9), is to make memory buffer data hard to be memorized, and ϕ * is the obtained optimal solution. The last term JS(x m , x m (T )) is to ensure smoother model responses on the original raw and transformed data; where λ is the regularization strength. We will discuss this in detail in the following. The upper-level optimization, i.e., Eq. ( 8), is to replay the transformed memory data. Note that x m (T ) is the transformed memory data by the optimal memory transformer with parameter ϕ * , defined in Eq. 10 (right). The above bi-level optimization is for DCMT, but the method can also be directly applied to SCMT. In practice, besides the obtained x m (T ), we can obtain infinite time steps transformed data at any time in the interval [0, T ]. To make the computation tractable, we can randomly sample time steps, i.e., 0 = t 0 < t 1 < t 2 • • • t n = T and obtain the transformed memory data x m (t 0 ), x m (t 1 ), x m (t 2 ), • • • , x m (T ) without additional cost since they are already in the integration interval [0, T ], which can be used for memory replay. It is worth noting that the transformed data at different time stamps are not combined and they will not be added to the memory buffer. First, previous transformed data has already been learned by the CL learner. For new tasks, we need to transform the memory data adaptively. Second, storing those data would increase a lot of memory storage cost. To maintain the diversity of the transformed memory buffer and reduce computation cost, we randomly sample n from [1, 5] at each CL step. Note that the number of parameters in DCMT or SCMT, i.e. ϕ, is much smaller than that of CL model, θ, thus negligible compared to CL model backbone. Consistency loss. The memory transformer could generate diverse memory data. To ensure that the network embeds similarly for both the original raw memory data and the transformed memory data, we use a similar consistency loss (Hendrycks et al., 2020) to regularize the network output to ensure smoother CL model responses. The goal is to make the CL model respond similarly to x m (T ) and x m , thus minimize the Jensen-Shannon divergence among the posterior distributions of the original sample x m and its transformed variants x m (T ). The consistency loss is defined as below: JS(x m , x m (T )) = (KL(p x m |p mean )+KL(p x m (T ) |p mean )))/2, where p mean = (p x m + p x m (T ) )/2 where KL denotes the KL divergence between two distributions, p x m = f θ (x m ) is the network output probabilities of each class for original raw data x m . Similarly, we can define p x m (T ) = f θ (x m (T )). The CL model parameters are then updated using the transformed memory data and the mini-batch data received at timestamp k as follows: θ k+1 = θ k -η∇ θ [L(θ k , x m (T ), y m ) + L(θ k , x k , y k )] where η is the learning rate. The learning process alternates between updating the memory transformer parameters ϕ (using adjoint method (Pontryagin et al., 1962) to update the parameters in DCMT or using the reversible Heun method (Kidger et al., 2021) to update the parameters in SCMT) and the CL model parameters θ. The complete memory transformation algorithm is shown in Algorithm 1. sample mini-batch data from memory buffer, i.e., (x m , y m ) ∼ M 5: x m (T ) = ODEsolver(x m , 0, T ) (Eq. ( 4)) by Algorithm 3 in Appendix B.12 or x m (T ) ∼ XT = SDEsolver(x m , 0, T ) (Eq. ( 7)) by Algorithm 4 in Appendix B.12 6: calculate loss function L(θ, ϕ) = [L(x m (T ), y m , θ, ϕ) -L(x m , y m , θ, ϕ) -λJS(x m , x m (T ))] 7: we calculate the gradient ∂L(θ,ϕ) ∂ϕ with details provided in Appendix B.3 and Algorithm 2. 8: update memory transformer parameters ϕ by gradient ascent to maximize max ϕ L(θ, ϕ) 9: transform memory data at randomly sampled time steps 0 = t0 < t1 < t2  θ k+1 = θ k -η∇ θ [L(θ k , x m (ti), y) + L(θ k , x k , y k )] 11: if restore then 12: restore x m 0 with Eq. ( 5) 13: end if 14: update memory buffer by reservoir sampling (RS) (Vitter, 1985; Riemer et al., 2019) , M = RS(M, (x k , y k )) 15: end for

4. EXPERIMENTS

We evaluate our methods for task-aware CL in Section 4.1 and task-free CL in Section 4.2. Datasets. We compare different methods on CIFAR10 (Krizhevsky, 2009) with 10 image classes, CIFAR100 (Krizhevsky, 2009) with 100 image classes, MinImageNet (Vinyals et al., 2016) with 100 image classes and Tiny-ImageNet (Stanford, 2015) with 200 classes.

4.1. TASK-AWARE CL

We perform experiments on both task-incremental (Task-IL) and class-incremental (Class-IL) CL (van de Ven & Tolias, 2019). Task-IL provides task identities to the CL learner and is the easiest scenario. Class-IL does not provide task identities and is the hardest scenario in task-aware CL. Baseline. We compare to various SOTA CL methods, including: 1) regularization-based methods, Classifier-Projection Regularization (CPR) (Cha et al., 2021) , PASS (Zhu et al., 2021) , Gradient Projection Memory (GPM) (Saha et al., 2021) , oEWC (Schwarz et al., 2018) , synaptic intelligence (SI) (Zenke et al., 2017a) and Learning without Forgetting (LwF) (Li & Hoiem, 2018) ; 2) Bayesianbased methods, UCB (Ebrahimi et al., 2020) ; 3) architecture-based methods, HAT (Serrá et al., 2018) ; 4) memory-based CL methods, including ER (Chaudhry et al., 2019b) , A-GEM (Chaudhry et al., 2019a) , GSS (Aljundi et al., 2019c) , HAL (Chaudhry et al., 2021) , DER++ (Buzzega et al., 2020) and GMED (Jin et al., 2021) . The implementation for those baselines (Buzzega et al., 2020) already applies data augmentation, such as random crops and horizontal flips, etc. We apply the proposed method on top of those implementations. We adapt PASS (Zhu et al., 2021) to standard CL by adding noise to memory data. We provide detailed baseline descriptions in Appendix B.2. Evaluation metrics. We evaluate the performance of the proposed methods and the compared methods with average accuracy (ACC) and backward transfer (BWT) at the end of CL training to measure the final performance and the degree of forgetting for different methods. We denote a N,k as the testing accuracy on task k after learning on task N . The overall accuracy for all the tasks is ACC = 1 N k=N k=1 a N,k . To measure catastrophic forgetting, we also evaluate BWT, which measures the extent of forgetting on previous tasks after learning new ones. Formally, BWT is defined as: BW T = 1 N -1 k=N -1 k=1 (a N,k -a k,k ). BW T < 0 indicates forgetting of previous tasks, and BW T > 0 indicates that learning new tasks is helpful on previous tasks. Implementation details. We follow (Buzzega et al., 2020) to use ResNet18 (He et al., 2016) as the classifier for all datasets. Following (Buzzega et al., 2020) , we split the CIFAR-10 dataset into 5 disjoint tasks, where each task consists of 2 classes. We split MiniImagenet (Vinyals et al., 2016) into 10 disjoint tasks, where each task has 10 classes. We also split CIFAR-100 into 10 disjoint tasks, where each task has 10 classes. The transformation time T at each CL step is T = 0.05. Other hyperparameter settings follow (Buzzega et al., 2020) . The memory buffer has a size of 500 data points by default. For the DCMT architecture, it is a four-block Resnet with a filter size of 8. For the SCMT architecture, the drift and diffusion networks are both four-block Resnet with a filter size of 6. DT (Section 3.2) uses the same architecture as DCMT. Thus, for DT, DCMT and SCMT, they have a much smaller number of parameters compared to the ResNet18 (He et al., 2016) . It only accounts for about 0.4% parameters of ResNet18, thus is negligible. To exclude the influence of the number of parameters in performance comparison, we reduce the number of parameters in our base model ResNet18 to offset the parameters in our transformation component. This ensures that all the compared models have the same number of parameters. All reported results in our experiments are the average accuracy and standard deviation with ten runs. The compared methods and our proposed methods are based on the public implementationfoot_0 . Result. We compare the proposed methods to various CL baselines in Table 1 . Due to space limitations, we put the results on Tiny-ImageNet in Appendix B.9 and backward transfer results in Appendix B.5. We can observe that our method outperforms those baselines. In particular, for class-IL, combining ER with DCMT or SCMT outperforms ER by 3.2%, 3.1%, and 6.7%, on MiniImageNet, CIFAR-100 and CIFAR10, respectively. For task-IL, integrating ER with DCMT or SCMT outperforms ER by 3.3%, 2.6% and 1.7% on MiniImageNet, CIFAR-100, and CIFAR10, respectively. Furthermore, for class-IL, combining DER++ with DCMT or SCMT, outperforms DER++ by 2.7%, 2.3% and 2.4% on MiniImageNet, CIFAR-100 and CIFAR10 respectively. For task-IL, integrating DER++ with DCMT or SCMT, outperforms DER++ by 3.5%, 3.1%, and 1.8% on MiniImageNet, CIFAR-100 and CIFAR10 respectively. SCMT can further improve over DCMT due to the increased data diversity. GMED brings little or even worse performance, consistent with the observations of (Jin et al., 2021) . We believe this is because, in task-aware CL, memory buffer data could be replayed many epochs, and GMED may edit the memory data too much so that they may significantly deviate from the original raw data. Adding random noise marginally helps in some cases, but some cases brings worse performance. We believe that adding noise only adds a simple transformation pattern, and still lacks diversity. When training for a long time, the network could also memorize the noise pattern. DT improves the performance compared to baselines since discrete step transformation by neural network can be viewed as discrete approximation of our continuous transformation. The performance of DT is upper bound by our continuous transformation. Also, DT needs to store intermediate transformations. In contrast, DCMT and SCMT do not need to store them. Our method outperforms baselines because the memory transformation function is highly expressive and flexible and provides significantly more memory diversity. The transformed memory buffer is more difficult for the CL model to overfit. Implementation details. We use Resnet-18 as (Aljundi et al., 2019a) . The transformation time T at each CL step is T = 0.03. By default, following (Jin et al., 2021) , we set the memory buffer size to be 500 for CIFAR-10, 10K for MiniImagenet, and 5K for CIFAR-100. Other hyperparameters are the same as (Aljundi et al., 2019a) . All reported results are the average accuracy and standard deviation with ten runs. Result. We compare to various CL baselines and combination with ER, MIR, and GMED in Table 2 . We observe that our method outperforms these baselines. Particularly, combining ER with DCMT or SCMT outperforms ER by 4.0%, 3.1%, and 1.8% on CIFAR10, MiniIma-geNet and CIFAR-100, respectively. Combining MIR with DCMT or SCMT outperforms MIR by 3.2%, 2.5% and 1.7% on CIFAR10, MiniIma-geNet and CIFAR-100. Combining GMED with DCMT or SCMT outperforms GMED by 3.1%, 1.2% and 0.9% on CIFAR10, MiniImageNet and CIFAR-100. Our method outperforms baselines for reasons similar to task-aware CL. Ablation Study: The smaller memory size of 2000 and 3000 for CIFAR100; memory sizes of 2000 and 5000 for Mini-imageNet are provided in Table 3 . This further shows the significant improvement of DCMT and SCMT with more than 3% improvements in many cases. Complexity analysis. Our current implementation has comparable computation cost compared to GMED. We put our efficiency improvement techniques and detailed complexity analysis in Appendix B.8. We put the formal complexity analysis in Table 10 and run time versus performance evaluation in Table 11 in Appendix B.8.

5. CONCLUSION

This paper explores the memory overfitting issues and proposes a novel continuous memory transformer. We apply the proposed method to both task-aware and task-free CL. Compared to existing works, our proposed methods are very flexible and can make the memory data diverse and hard to overfit. Extensive experiments with both strong baselines of task-aware and task-free CL demonstrate the effectiveness of the proposed methods. Future work includes automatically learning the transformation time interval to obtain the optimal transformed memory data.

REPRODUCIBILITY STATEMENT

We provided detailed implementation details and codebase we used to implement our methods. Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. 

A APPENDIX

We provide more experimental details and results in Section B and theoretical analysis in Section ??.

B EXPERIMENTS B.1 MORE IMPLEMENTATION DETAILS

We use NVIDIA RTX A6000 GPU to do the experiments. For the hyperparameters values of DER++, ER and all the other baselines, we follow the implementation in (Buzzega et al., 2020) .

B.2 BASELINE DESCRIPTIONS

Experience Replay (ER) (Chaudhry et al., 2019b ) stores a small subset data from previous tasks with reservoir sampling (Chaudhry et al., 2019b) . When training with new tasks, we randomly sample a subset of examples from the memory buffer to train with the received new mini-batch data together to mitigate forgetting. Maximally Interfering Retrieval (MIR) (Aljundi et al., 2019a) , the goal of MIR is to select the examples that are most easily forgettable for replay. We follow similar setting in GMED (Jin et al., 2021) for a fair comparison. We evaluate the CL model forgetting with 25 memory examples for Mini-ImageNet dataset, and 50 memory examples for other datasets. Averaged Gradient Episodic Memory (AGEM) (Chaudhry et al., 2019a) . At every CL training step, AGEM ensures that the average memory buffer data loss over the previous tasks does not increase. AGEM projects the gradient update direction to the closest gradient direction in L 2 space that keeps the gradient angle less than 90 degree to ensure the memory data are less interfered with current data. Gradient-Based Sample Selection (GSS-Greedy) (Aljundi et al., 2019c) encourages storing diverse examples in the memory buffer. We use GSS-Greedy, which is efficient and performs the best in the variants proposed in (Aljundi et al., 2019c) . Gradient based Memory Editing (GMED) (Jin et al., 2021) . The goal of GMED is to edit the memory buffer data with gradient information so that they are harder to be memorized, which shares the similar goal as MIR (Aljundi et al., 2019a) . Dark Experience Replay (DER) (Buzzega et al., 2020 ) is a memory replay-based methods that combines memory replay with knowledge distillation to mitigate forgetting. DER++ is one of the state-of-art methods in CL. Hindsight Anchor Learning (HAL) (Chaudhry et al., 2021) regularizes the training objective with one data point per class per task, named anchors by maximizing its estimated forgetting. Keeping the model prediction fixed on those anchor points preserves the performance of previous tasks. Task-free CL additional baselines Following (Aljundi et al., 2019a) , we also additionally compare: (1) iid online: which trains the model with a single-pass through the iid sampled data on the same set of samples; (2) iid offline: which trains the model with multiple epochs through the iid sampled data. We train the model with 5 epochs for this baseline and the performance serves as upper-bound. Table 4 shows the results of these two baselines.  B.3 BI-LEVEL OPTIMIZATION x m (T ) = x m (0) + T 0 g(x m (t), t, ϕ)dt (11) We first define adjoint-state at time t as a(t) = dL(x m (T ),y) dx m (t) = dL dx m (t) , here we abbreviate the notation L(x m (T ), y) as L, which follows the following differential equation: da(t) dt = -a(t) T ∂g(x m (t), t, ϕ) x m , a T = dL dx m (T ) The adjoint-state equation Eq. ( 12) can be viewed as the continuous version of backpropagation (for discrete number of layers). Therefore, we can obtain a(0) = dL dx m (0) by the reverse time ODE in the time interval [T, 0] with initial state, i.e., a T = dL dx m (T ) . We can compute the gradient with respect to the memory transformer parameters ϕ as the following equation: ∂L ∂ϕ = - 0 T a(t) T ∂g(x m (t), t, ϕ) ∂ϕ ( ) The Eq. ( 11) serves as for transforming the memory data. Eq.( 12) serves as for calculating the adjoint state a(t). Eq. ( 13) calculates the gradient with respect to the memory transformer ϕ. Those three integration can be jointly solved together in a single pass by concatenating the transformed memory data, the adjoint state and derivatives with respect to the parameters. The entire algorithm for calculating the derivative with respect to the memory transformer parameters is shown in Algorithm 2 in the following. Algorithm 2 Reverse-mode derivative for calculating the gradient w.r.t ϕ of DCMT. Table 5 shows the results for task-IL and class-IL on CIFAR-100 and MiniImagenet, respectively with memory size 2000. Our methods still significantly outperform strong baselines by a large margin. 

B.5 BACKWARD TRANSFER (BWT)

Table 6 shows the backward transfer results with memory size 500 on various datasets. BWT measures the forgetting of CL model. Note that if one method restrains learning the current task would preserve the past knowledge with high BWT but achieves overall low accuracy. This would make the current task not learned well. The results indicate the significant improvement of the proposed methods (DCMT and SCMT) for mitigating forgetting compared to baseline ER and DER++. The baseline HAL achieves higher BWT in some cases because HAL does not learn the new task well and achieves overall much lower accuracy as shown in Table 1 (main text). 

B.6 HYPERPARAMETER ANALYSIS

Table 7 shows the effect of integration (transformation) time T on the performance. We can observe that the performance becomes better with a longer integration time T . This is because with a longer integration time T , we can generate more diverse transformed memory data so that the proposed methods generalize better to previous tasks. If the T is too large (T = 2.0), the transformed data would significantly deviate from the original data distribution, thus the performance could become worse. To maintain the transformed memory data distribution not deviate from the original data distribution too much and reduce computation cost, we use a moderate time T . Table 8 shows the effect of λ on the model performance. The model achieves best performance at λ = 3.0. Table 9 shows the effect of the number of editing steps N on the performance for combining GMED (Jin et al., 2021) with DER++. We can observe that with more editing steps, the performance of GMED drops significantly because the memory data becomes much harder but lacks data diversity, thus generalizing worse with more editing steps. Table 9 : Effect of the number of editing steps N for DER++GMED (Jin et al., 2021) on CIFAR10, CIFAR-100 and MiniImagenet, respectively. We can observe that with more editing steps, the performance of GMED drops significantly because the memory data becomes much harder but lacks data diversity, thus generalizing worse with more editing steps. number of editing steps According to (Xu et al., 2022) , the computation cost of backward propagation is two times of forward propagation. N = 1 N = 3 N = 5 N = 10 • DER++: it needs 3 forward and 1 backward pass to replay the memory data, which equivalently requires 3 + 1*2 =5 forward computation. • DER++GMED: at each CL step, GMED additionally needs 3 forwards and 1 backward passes (for two mini-batch data) to edit the memory examples for each editing step. So, the complexity is 6+2*2= 10 forward propagation calculations. • DER++DCMT, since we uniformly sample the number n from [0, 5], the amortized cost across different CL steps is to use 3 discrete time points at each CL step in the interval [0, T ]. First, the time complexity of forward ODE calculation at those time points is 3 forward propagation. Then, we need to calculate the gradients for model parameters. The time complexity 3 backward gradient calculation at those time points. Thus, the computation cost of DCMT is equivalent to 6+3*2 forward propagation. We can further reduce the computation cost of DCMT by backward propagation into the memory transformer for every S CL step instead of backpropagating into the memory transformer for every CL step. For example, suppose we have 3 CL steps, i.e., 1, 2, 3; we backpropagate into the memory transformer at step 1, but use the same memory transformer in the step 2 and 3 to transform the memory data. This method further reduces the cost of DCMT and SCMT into 6 + 6 3 = 8. Additionally, the integration and backward computation time for the memory transformer in the interval [0, T ] is empirically equivalent to 1.6 forward computation of the ResNet backbone since the memory transformer network is very small. Thus, the total computation cost is 9.6 forward computation. • DER++SCMT, the complexity is similar, but it requires evaluating both the drift and diffusion terms with the corresponding gradient. Thus, the computation cost doubles the cost of DCMT, and therefore it is equivalent to 6 + 6 3 = 8 forward propagation. Additionally, the integration and backward computation time for the memory transformer in the interval [0, T ] is empirically equivalent to 4.2 forward computation of the ResNet backbone since the memory transformer network is very small. Thus, the total computation cost is 12.2 forward computation. We summarize the computation complexity of the above methods in table 10. We set DER++ as the baseline with a running time unit of 1; we compare the computation complexity of all the other methods with respect to DER++. 

B.10 T-SNE VISUALIZATION

More T-SNE visualization of ER is shown in Figure 5 , GMED is shown in Figure 6 , our method is shown in Figure 7 .  X t • dW t = lim | |→0 N k=1 ( X t k + X t k-1 2 )(W t k -W t k-1 ) where = 0 = t 0 < • • • < t N = T is a partition of the interval [0, T ], and | | = max k t k -t k-1 denotes the size of largest segment of the partition, and the limit is to be interpreted in the L 2 space.

B.12 TRAINING ALGORITHMS FOR DCMT AND SCMT

We present the Runge-Kutta Method in Algorithm 3 for integrating ODE to obtain its solution. We present the adjoint method for calculating the gradient with respect to the parameters of DCMT ϕ in Algorithm 2. We then present the algorithm for calculating the forward pass of SCMT (Eq. 7) in Algorithm 4. We then present the reversible Heun algorithm for calculating the gradients with respect to the parameters of SCMT in Algorithm 5. Algorithm 3 Runge-Kutta Method for integrating ODE. 1: REQUIRE: dx m (t) dt = g(x m (t), t, ϕ), x m (0) = x m ; 2: dividing the time interval [0, T ] into sub-intervals, [t0, t1], • • • , • • • , [tn, tn+1], • • • , [tN-1, tN ]; where |tn+1 -= h; h is the sep size. 3: for n = 1 to N do 4: K1 = g(x m n , tn)h 5: K2 = g(x m n + 1 2 K1, tn + 1 2 h)h 6: K3 = g(x m n + 1 2 K2, tn + 1 2 h)h 7: K4 = g(x m n + K3, tn + h)h 8: x m n+1 = x m n + 1 6 (K1 + 2K2 + 2K3 + K4) 9: end for 10: Return x m N Algorithm 4 SCMT (SDE) solver (forward pass).  = Wt n+1 -Wt n 4: x m n+1 = 2x m n -x m n + µn∆t + σn∆Wn 5: µn+1 = µ(tn+1, x m n+1 ) 6: σn+1 = σ(tn+1, x m n+1 ) 7: x m n+1 = x m n + 1 2 (µn + µn+1)∆t + 1 2 (σn + σn+1)∆Wn 8: Return tn+1, x m n+1 , x m n+1 , µn+1, σn+1 B.12.1 REVERSIBLE HEUN (KIDGER ET AL., 2021) FOR SOLVING THE SCMT GRADIENTS Suppose we have the loss L on the terminal random variable X T , then the adjoint process A t = dL(X T ) dXt ∈ R d is the solution to the following SDE: dA i t = -A j t ∂µ j ∂X i (t, Xt)dt -A j t ∂σ j,k ∂X i (t, Xt) • dW k t ( ) where A 0 = dL(X T ) dX0 is the obtained backpropagated gradient. The gradients with respect to the parameters of the drift and diffusion networks can be obtained by viewing them as an additional part of the state whose dynamics has zero drift and diffusion (Li et al., 2020) . Furthermore, the adjoint method of reversible Heun (Kidger et al., 2021) utilizes the reversibility of a differential equation: intermediate computations such as X t for t < T are restored from X T , so that they do not need to be held in memory. Algorithm 5 Reversible Heun method of Backward computation to calculate the gradients w.r.t the previous state of SCMT. 1: REQUIRE: tn+1, x m n+1 , x m n+1 , µn+1, σn+1, ∆t, Brownian motion W , ∂L(x m T ) ∂x m n+1 , ∂L(x m T ) ∂ x m n+1 , ∂L(x m T ) ∂µ n+1 , ∂L(x m T ) ∂σ n+1 . 2: tn = tn+1 -∆t 3: ∆Wn = Wt n+1 -Wt n 4: x m n = 2x m n+1 -x m n+1 -µn+1∆t -σn+1∆Wn 5: µn = µ(tn, x m n ) 6: σn = σ(tn, x m n ) 7: x m n = x m n+1 -1 2 (µn + µn+1)∆t -1 2 (σn + σn+1)∆Wn 8: x m n+1 , x m n+1 , µn+1, σn+1 = F orward(tn, x m n , x m n , µn, σn, ∆t, W ) 9: Local Backpropagation 10: ∂L(x m T ) ∂(x m n , x m n ,µn,σn) = ∂L(x m T ) ∂(x m n+1 , x m n+1 ,µ n+1 ,σ n+1 ) ∂(x m n+1 , x m n+1 ,µ n+1 ,σ n+1 ) ∂(x m n , x m n ,µn,σn) 11: Return tn, x m n , x m n , µn, σn, ∂L(x m T ) ∂x m n , ∂L(x m T ) ∂ x m n , ∂L(x m T ) ∂µn , ∂L(x m T ) ∂σn

B.13 WHY THE ENTIRE TRANSFORMATION IS INVERTIBLE AND THE VELOCITY TERM DOES NOT NEED TO BE INVERTIBLE

The forward integration is as the following: x m (T ) = x m (0) + T 0 g(x m (t), t, ϕ)dt We then multiple -1 for both sides of the above equations, obtain the following equation: -x m (T ) = -x m (0) - T 0 g(x m (t), t, ϕ)dt We then rearrange the above equation as following: x m (0) = x m (T ) - T 0 g(x m (t), t, ϕ)dt Then due to the following property of integration - T 0 g(x m (t), t, ϕ)dt = 0 T g(x m (t), t, ϕ)dt We can obtain the following reverse integration: x m (0) = x m (T ) + 0 T g(x m (t), t, ϕ)dt (20) While we can observe that in the above derivation, there is no restriction that g should be invertible. D tr k consists of a set of triplets {(x k i , y k i , T k ) n k i=1 }, where x k i is the i-th data example in the task, y k i is the corresponding data label, and T k is the task identifier. The goal is to learn a model f θ on the training task sequence D tr so that it performs well on the test set of all the learned tasks D te = {D te 1 , D te 2 , • • • , D te N } without forgetting previously learned knowledge. Existing work. The proposed approaches for task-aware CL can be categorized into: 1) maintaining a memory buffer that stores previous examples for future replay (Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Chaudhry et al., 2019a; Riemer et al., 2019; Chaudhry et al., 2019b; Aljundi et al., 2019a; PourKeshavarzi et al., 2022; Arani et al., 2022) ; 2) using dynamic network architectures (Rusu et al., 2016; Fernando et al., 2017; Yoon et al., 2018; Qin et al., 2021; Miao et al., 2022) and remembering past knowledge by dynamically updated architectures; 3) enforcing regularization to slow down forgetting (Kirkpatrick et al., 2017; Zenke et al., 2017b; von Oswald et al., 2020; Liu & Liu, 2022; Raghavan & Balaprakash, 2021) ; and 4) modeling the parameter update uncertainty with Bayesian methods (Nguyen et al., 2018; Ebrahimi et al., 2020; Henning et al., 2021) . In this paper, we focus on memory-replay-based methods since they often achieve SOTA performance. Memory-based methods include experience replay (Chaudhry et al., 2019b) , which jointly trains the memory buffer data with current mini-batch. Meta Experience Replay (MER) (Riemer et al., 2019) adopts meta-learning to maximize transfer from previous examples and minimize interference. Hindsight Anchor Learning (HAL) (Chaudhry et al., 2021) using anchor points to mitigate forgetting on previous tasks. GEM (Lopez-Paz & Ranzato, 2017) and A-GEM (Chaudhry et al., 2019a) use the losses on the memory buffer data as inequality constraints, avoiding their increase but allowing their decrease to avoid forgetting. DER (Buzzega et al., 2020) further combines rehearsal with knowledge distillation.

C.2 TASK-FREE CL

Problem setup. Task-free CL (He et al., 2019; Zeno et al., 2019; Aljundi et al., 2019b; Chrysakis & Moens, 2020; Lee et al., 2020) is a recent generalization of CL to the more complex cases, where data distribution shift could happen at any time during CL without explicit definition of tasks. A sequence of mini-batch labeled data (x k , y k , h k ) sequentially arrives at each timestamp k and forms a non-stationary data stream; where x k denotes the mini-batch data received at timestamp k, y k is the data label associated with x k , and h k is the hidden task identity associated with x k . During both the training and testing time, the task identity h k is not available to the learner. A more general definition of task-free CL in (Aljundi et al., 2019b) assumes no explicit partitions of tasks, and the data distribution can change arbitrarily. However, our proposed methods can be seamlessly applied to those more general scenarios. Existing work. Most existing works in task-free CL are memory-replay-based methods (Chaudhry et al., 2019b; a) . They directly perform memory replay on the raw data without any transformation. MIR (Aljundi et al., 2019a) proposes to replay the samples with which are most interfered. GEN-MIR (Aljundi et al., 2019a) further uses generative models to synthesize the memory examples. Gradient-based Sample Selection (GSS) (Aljundi et al., 2019c) focuses on storing diverse examples which is completely different from our method. Our works share a similar goal with GMED (Jin et al., 2021) which edits the memory buffer based on ADA, making the memory data harder but lacks diversity (Madry et al., 2018; Wang et al., 2021) . There are several significant differences. First, GMED is a gradient-based discrete step memory editing method. Our memory transformer can obtain infinite continuous time steps transformation of memory data, which improves the memory diversity significantly. Second, GMED overwrites the memory data with the edited ones making the memory buffer data distribution significantly deviate from the original raw memory data distribution after many epochs of editing, especially in task-aware CL, which would decrease the performance. In contrast, our transformation process is reversible, i.e., the original raw data can be recovered from the transformed data. Thus, we do not need to overwrite the memory buffer data or keep an additional mini-batch transformed data with extra cost.



https://github.com/aimagelab/mammoth



Figure 1: T-SNE visualization of existing memory-replay and proposed methods on CIFAR10. We use features extracted from the last layer output of ResNet18 as the input to T-SNE. We use four classes of memory data to illustrate the difference. T-SNE embeds each data point, and each color denotes one class of memory buffer data. (a): ER is very easy to overfit and easy to classify; (b): GMED with smaller editing steps has limited effectiveness due to the limited hardness; (c): GMED with larger editing steps creates memory examples harder to classify but lack of diversity; (d): DCMT (our method) with better diversity and transformed memory data is hard to classify and overfit.

REQUIRE: memory transformer parameters ϕ, transformation time interval [0, T ], final transformation state at time T , i.e., x m (T ), loss gradient ∂L ∂x m (T ) . 2: s0 = [x m (T ), ∂L ∂x m (T ) , 0 |ϕ| ]; calculate adjoint state at = dL dx m (t) 3: def aug-dynamics([x m (t), at, •]): 4: return[g(x m (t), t, ϕ), x m (0), ∂L ∂x m (0) , ∂L ∂ϕ ] = ODEsolver(s0, aug -dynamics, T, 0) 6: Return ∂L ∂x m (0) , ∂L ∂ϕ B.4 TASK-AWARE (CLASS-IL AND TASK-IL) MEMORY SIZE 2000

Figure 3 and 4 show the transformation results at different time t. With longer transformation time, the transformed data becomes diverse when viewing all the data in the transformation interval [0, T ].SCMT brings more diversity than DCMT in terms of appearance. This suggests that with gradual and continuous transformation of memory data can significantly improve the memory data diversity.

Figure 3: Gradual memory transformation by DCMT on CIFAR10.

Figure 4: Gradual memory transformation by SCMT on CIFAR10.

Figure Experience Replay (ER) T-SNE visualization on CIFAR10. We use features extracted from the last layer output of ResNet18 as the input to T-SNE. We use four classes of memory data to illustrate the difference. T-SNE embeds each data point, and each color denotes one class of memory buffer data. Each figure is a plot at different times.

REQUIRE: tn, x m n , x m n , µn, σn, ∆t, Brownian motion W . 2: tn+1 = tn + ∆t 3: ∆Wn

PRELIMINARY AND RELATED WORK C.1 TASK-AWARE CL Problem setup. Task-aware CL focuses on the case where there are explicit task definitions during CL. Task/domain/class-incremental learning (van de Ven & Tolias, 2019) are the three most representative CL scenarios. We consider the problem of learning a sequence of tasks denoted as D tr = {D tr 1 , D tr 2 , • • • , D tr N }, where N is the number of training tasks. The k-th task training data

Algorithm 1 Continuous Memory Transformation. 1: REQUIRE: model parameters θ, memory transformer parameters ϕ, CL model learning rate η, memory transformer learning rate β; transformation time T at each iteration, memory buffer M; K is the number of iterations during the training process for both task-aware and task-free CL. 2: for k = 1 to K do

• • • tn = T , i.e. x m (ti) by Eq.

Task-IL and class-IL results on CIFAR10, CIFAR-100 and MiniImagenet, respectively with memory size 500. '-' indicates not applicable. Joint train 92.20 ± 0.15 98.31 ± 0.12 71.32 ± 0.21 91.31 ± 0.17 65.56 ± 0.18 87.74 ± 0.15 ± 1.36 93.88 ± 0.50 36.37 ± 0.85 75.64 ± 0.60 22.09 ± 0.63 61.26 ± 0.57 DER++GMED 72.82 ± 1.79 93.94 ± 0.70 36.25 ± 0.69 75.49 ± 0.64 22.21 ± 0.81 61.42 ± 0.64 DER++noise 72.41 ± 1.83 93.96 ± 0.74 36.02 ± 0.91 75.38 ± 0.68 22.32 ± 0.87 61.51 ± 0.71 DER++DT(Ours) 73.77 ± 1.79 94.68 ± 0.82 37.53 ± 0.97 77.21 ± 0.79 23.45 ± 0.93 63.03 ± 0.75 DER++DCMT(Ours) 74.86 ± 1.53 95.29 ± 0.54 38.68±0.81 78.56 ± 0.82 24.53 ± 0.89 64.08 ± 0.78

Ablation Study (1) Due to space limitations, we put hyperparameter analysis, including λ, T , etc. in Appendix B.6. (2) Memory size 2000 in Appendix B.4. (3) Memory visualization in Appendix B.7. Task-free CL results on CIFAR10, CIFAR-100 and MiniImagenet, respectively.

Task-free CL results on CIFAR-100 and MiniImagenet with different memory size

Task agnostic continual learning using online variational bayes. https://arxiv.org/abs/1803.10123, 2019. Fei Zhu, Xu-Yao Zhang, Chuang Wang, Fei Yin, and Cheng-Lin Liu. Prototype augmentation and self-supervision for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5871-5880, June 2021.

Results for Task-free CL iid online and iid offline training on CIFAR10, CIFAR-100 and MiniImagenet, respectively.

Task-IL and class-IL results on CIFAR-100 and MiniImagenet, respectively with memory size 2000

Backward Transfer of various methods with memory size 500.

Effect of integration (transformation) time T for combining DER++ with DCMT on CIFAR10, CIFAR-100 and MiniImagenet, respectively.

Effect of regularization weight λ for combining DER++ with DCMT on CIFAR10, CIFAR-100 and MiniImagenet, respectively.

shows the running time evaluation of different methods on CIFAR-100 for one epoch training.

Formal analysis of computation complexity of different methods.

Performance verses efficiency evaluation (wall clock time) on CIFAR-100 for one epoch training. '--' indicates not applicable.

Task-IL and class-IL results on Tiny-ImageNet, respectively with memory size 500

