ONLINE BOUNDARY-FREE CONTINUAL LEARNING BY SCHEDULED DATA PRIOR

Abstract

Typical continual learning setup assumes that the dataset is split into multiple discrete tasks. We argue that it is less realistic as the streamed data would have no notion of task boundary in real-world data. Here, we take a step forward to investigate more realistic online continual learning -learning continuously changing data distribution without explicit task boundary, which we call boundary-free setup. Due to the lack of boundary, it is not obvious when and what information in the past to be preserved for a better remedy for the stability-plasticity dilemma. To this end, we propose a scheduled transfer of previously learned knowledge. In addition, we further propose a data-driven balancing between the knowledge in the past and the present in learning objective. Moreover, since it is not straightforward to use the previously proposed forgetting measure without task boundaries, we further propose a novel forgetting and knowledge gain measure based on information theory. We empirically evaluate our method on a Gaussian data stream and its periodic extension, which is frequently observed in real-life data, as well as the conventional disjoint task-split. Our method outperforms prior arts by large margins in various setups, using four benchmark datasets in continual learning literature -CIFAR-10, CIFAR-100, TinyImageNet and ImageNet. Code is available at https://github.com/yonseivnl/sdp. Motivated by a real data stream that changes continuously (e.g., Google search trend of 'swimsuit', 'coat', and 'Christmas gift' during 13 years, depicted in Fig. 1 1 ), we propose a new CL setup called online boundary-free. We argue that our setup is more realistic than the task-split setup for the following considerations: (1) an ever-changing distribution with (periodic) Gaussian online stream, (2) no notion of explicit task boundaries, and (3) any-time inference for online continual learning.

1. INTRODUCTION

In real-world continual learning (CL) scenarios (He et al., 2020) , data arrive in a streamed manner (Aljundi et al., 2019a; Cai et al., 2021) whereas typical continual learning setups split the data into multiple discrete tasks whose data distributions differ from each other. Moreover, most CL algorithms are studied in an offline CL setup (Kirkpatrick et al., 2017; Rebuffi et al., 2017; Saha et al., 2021) , where the model can access data multiple times. While being prevalent in the literature, this setup has a number of issues far from the realistic scenario. Although the task setup have been partly addressed by (Prabhu et al., 2020; Koh et al., 2021; Kim et al., 2021b; Bang et al., 2022) , the revised setups still have the notion of task boundary whereas real-world data may not have the explicit task boundaries as the data distribution changes continuously. Despite that many methods update the model in a boundary-agnostic manner, called task-free CL (Aljundi et al., 2019b; Koh et al., 2021) , they still leverage the notion of task boundary for knowledge transfer and evaluation, e.g., leveraging the fact that distribution shift in data stream occurs only at task boundaries. In addition, the definition of forgetting depends on the notion of 'old' and 'new' tasks, which are defined by the task boundary. We argue to address an online CL setup where data are learned online (allowing only a single access to data) with continuous distribution shift without explicit task boundaries. We refer to the setup as online boundary-free continual learning. In this setup, a small set of data is streamed to the model one by one, and the model only has access to the current data batch only (Aljundi et al., 2019c; a) without the notion of task boundary. For the distribution of a continuous data stream, following (Shanahan et al., 2021; Wang et al., 2022) , we consider Gaussian distribution as an instance of data streaming distributions.The Gaussian † indicates the corresponding author. online data stream models the frequency of each class as Gaussian distribution over time. Note that classes do not recur after their initial Gaussian mode in this setup. However, in real-world data, the frequency of a class may have multiple recurring modalities over time, rather than a single mode, as depicted in Fig. 1 . To further address such a scenario, we investigate a periodic-Gaussian online stream, where each class would recur and the recurrence is periodic. To the best of our knowledge, this is the first work to study CL in continuous data distributions either in periods or not. The boundary-free setup poses several challenges as follows. In the CL setup with explicit task boundary definition, methods using both episodic memory and distillation (i.e., using data prior (Buzzega et al., 2020; Wu et al., 2019; Hou et al., 2019) ) show compelling performance (Masana et al., 2020) . They store the model weights at each task boundary and use the stored model as a distillation teacher for mitigating catastrophic forgetting. However, in the continuous data stream, it is challenging to determine which past models should be stored to be used as a data prior. To determine which data prior to transfer the knowledge from, we propose to combine different exponential moving average (EMA) distributions to have a particular schedule of transferring past knowledge. In addition, as the past knowledge is now from diverse contexts, it is not trivial to balance the supervisory signal from the past and the present. Instead of using a fixed balancing hyperparameter, we propose to learn to balance them for better generalization in multiple scenarios, i.e., datasets. In our empirical studies, we observe that our method outperforms comparable prior arts in Gaussian, periodic-Gaussian, and disjoint task-split data stream on 4 popular benchmarks in CL literature. Moreover, conventional performance metrics for CL methods including forgetting is not trivially applicable to our setup as they are defined on the task boundary. Here, we propose a new metric for measuring forgetting using information theory. In contrast to the conventional forgetting metric, it captures loss and gain of intra-class knowledge, appropriate for periodic data distribution where the model has to accumulate different knowledge about the same class over multiple periods. We summarize our contributions as follows: • Extensively studying online CL with continuous data stream setup without explicit task boundary, including newly proposed periodic CL setup. • Proposing an online boundary-free CL method that uses scheduled transfer of past knowledge. • Proposing to learn to balance amount of using past and present knowledge. • Proposing new metrics that can measure loss of past knowledge (i.e., forgetting) and gain of new knowledge (i.e., the opposite of intransigience) based on information theory.

2. RELATED WORK

Setups for Continual Learning. With the increasing popularity of CL, there have been several proposals on learning configurations to be realistic. As the first task setup to mimic a real-world data stream that continuously changes over time, prior arts have employed the notion of task-split, where the entire data is split into multiple subsets for different continuous tasks In recent literature, however, there has been efforts to question whether the task-split setup is realistic. To enforce the class distribution of each split differently, disjoint task-split confines each class to be assigned to only a single task (Castro et al., 2018) . As the disjoint setup is rather artificial as the data stream arrives in class agnostic manner. Blurry task-split allows every task shares all classes but with different dominance (Aljundi et al., 2019c; Bang et al., 2021) . The i-Blurry task-split further guarantee some classes to be added incremental to the blurry task-split (Koh et al., 2021) . However, these task configurations have explicit task boundary, which is still artificial. For a more realistic scenario, task-free CL (Aljundi et al., 2019b) has been studied, where models are not allowed to use task boundary information during training. However, they still train and evaluate methods on task- split setup such as disjoint, blurry, and i-blurry. Thus, task-free is more of a restriction on methods rather than a setup. 0 9 -J a n 1 0 -J a n 1 1 -J a n 1 2 -J a n 1 3 -J a n 1 4 -J a n 1 5 -J a n 1 6 -J a n 1 7 -J a n 1 8 -J a n 1 9 -J a n 2 0 -J a n 2 1 -J a n In contrast to the prior work, we propose a boundary-free setup. It removes the notion of artificial task boundary, rather, the data arrival follows a certain continuous distribution over time. (Shanahan et al., 2021; Wang et al., 2022) investigate data stream following Gaussian distribution, they still split the data into micro-tasks and neglect a periodic property of real-world data, which we study further here. Online Continual Learning Methods. One of the goals of successful CL learners is not to forget knowledge obtained from the preceding tasks; here, catastrophic forgetting has become the main challenge with deep neural networks for CL (French, 1999) . To mitigate the issue, there are four main directions in the recent literature, namely distillation, memory replay, parameter isolation, and regularization. For a more comprehensive review, we refer the reader to surveys (De Lange et al., 2021; Mai et al., 2022) . The key difference between online and offline CL is the accessibility of streaming inputs; the model can only see the entire stream once except for the samples in the episodic memory. To tackle this online constraint, several methods have been developed based on the similar ways in offline CL. GEM (Lopez-Paz & Ranzato, 2017) leverages the gradient of samples in the available memory so that they alleviate forgetting the knowledge of the previous tasks. A-GEM (Chaudhry et al., 2018b) proposes to utilize the average of gradients for each task instead of using the projection of all the gradients. It further saves memory usage and reduces the computational cost. GDumb (Prabhu et al., 2020) proposes greedy balance selection which randomly selects the samples while balancing the number of selected samples per class. GSS (Aljundi et al., 2019c) and RM (Bang et al., 2021) utilize the gradient and uncertainty of each sample, respectively, to increase diversity of selected samples. CLIB (Koh et al., 2021) makes up informative samples in the memory by discarding the least informative samples in the memory for further training. Unlike offline CL, however, online CL understudied the methods which utilize knowledge distillation. DER (Buzzega et al., 2020) utilized the knowledge distillation between logits of an original image and augmented image from origin, which is well-known and widely used. In our proposed setup, which is boundary-free, it is not trivial to determine when and what knowledge should be transferred to mitigate the forgetting. Here, we propose a new knowledge distillation method which is suitable for online boundary-free CL by using scheduled data prior.

3.1. GAUSSIAN ONLINE STREAM

For easeness of modeling, here we assume that a real data stream for each class follows Gaussian distribution. First, we consider a single mode stream, then extend it to a multi-modal periodic stream. Gaussian-distributed task modeling has been addressed by Shanahan et al. (2021) ; Wang et al. (2022) . However, they differ from ours since their distribution change is not fully continuous as they still split data into multiple micro-tasks and overlook the periodicity of the online stream. Specifically, we consider the 'arrival time' of samples to be modeled by the Gaussian distribution. In order to create an online stream without task boundary; samples are streamed in an increasing order of their arrival time. Formally, let us assume that the distribution of arrival time for a class i follows the Gaussian distribution N (µ i , σ 2 ). For each class i, the mean µ i (=mode) of the distribution is exclusively chosen from 0, 1 N , . . . , N -1 N where N is the number of classes, and the standard deviation σ is the same for all classes for simplicity (we use σ = 0.1 for our empirical validations and provide analysis on different values in the appendix). Periodic Gaussian Online Stream. By a simple modification, we can extend the Gaussian stream to a periodic one. Now, the distribution of arrival time of a class i not only follows the Gaussian distribution but also is repeated multiple times: 1 R R-1 r=0 N (µ i + r, σ 2 ), where R is the number of repetition (i.e., periods, we use R = 5 in our experiments). The mean µ i and standard deviation σ are the same as the ones in the non-periodic Gaussian online stream.

4. EVALUATION METRICS FOR CONTINUAL LEARNERS IN THE BOUNDARY-FREE DATA STREAM

For evaluating the overall performance in a boundary-free CL, we use the area under the curve of accuracy (A AU C ) proposed in (Koh et al., 2021) for measuring area under the curve of accuracyto-(# samples) curve, and last accuracy (A last ) that measures the final accuracy after learning all samples. We cannot measure A avg metric since there is no task boundary. Previous CL research use Forgetting and Intransigence metric (Chaudhry et al., 2018a) to investigate stability and plasticity of CL algorithms. However, these metrics require to measure the accuracy of the previous task, so they are not readily measurable in the boundary-free setup due to lack of task boundary. Here, we propose new metrics of measuring forgetting and ability to learn new knowledge by computing loss and gain of knowledge based on information theory. Knowledge Loss Ratio. Specifically, we want to measure loss and gain of knowledge between arbitrary two points in the training, t 1 and t 2 . Let Y GT be ground truth label for a randomly selected sample from a data distribution and Y t be the model's prediction for that sample at time t. We define the Total Knowledge (T K(t)) at time t as: T K(t) := I(Y t ; Y GT ), where I(X; Y ) = y∈Y x∈X P (x, y) log P (x, y) P (x)P (y) is a mutual information that measures the quantity of information one variable has about the other. In other words, it measures how much information about ground truth we can obtain by observing the model's prediction at time t. To measure the loss of knowledge between t 1 and t 2 , we quantify the knowledge in Y t1 but not in Y t2 . The knowledge loss can be measured by the knowledge difference between having both output (Y t1 , Y t2 ) and having only Y t2 . Thus, we define Knowledge Loss (KL(t 1 , t 2 )) between t 1 and t 2 as: KL(t 1 , t 2 ) := I(Y t1 , Y t2 ; Y GT ) -I(Y t2 ; Y GT ) = I(Y t1 ; Y GT |Y t2 ), where I(X; Y |Z) = I(X, Z; Y ) -I(Z; Y ) is the conditional mutual information. By dividing the knowledge loss by the total knowledge at t 1 , we obtain Knowledge Loss Ratio (KLR(t 1 , t 2 )) as: KLR(t 1 , t 2 ) := KL(t 1 , t 2 ) T K(t 1 ) = I(Y t1 ; Y GT |Y t2 ) I(Y t1 ; Y GT ) . KLR measures the ratio of the past knowledge that was lost between t 1 and t 2 , which we will use an equivalent measure for forgetting in continuous data stream. Knowledge Gain Ratio. Similarly, we can define the Knowledge Gain (KG) between t 1 and t 2 as: KG(t 1 , t 2 ) = I(Y t1 , Y t2 ; Y GT ) -I(Y t1 ; Y GT ) = I(Y t2 ; Y GT |Y t1 ). (5) Similar to KLR, we define Knowledge Gain Ratio (KGR) to measure the ratio of the potentially obtainable knowledge obtained between t 1 and t 2 . The amount of information in GT label is H(Y GT ) where H(Y ) = -y∈Y P (y) log P (y) is the entropy, so the potentially obtainable knowledge at time t 1 is H(Y GT ) -T K(t 1 ). Thus, KGR(t 1 , t 2 ) is defined as: KGR(t 1 , t 2 ) := KG(t 1 , t 2 ) H(Y GT ) -T K(t 1 ) = I(Y t2 ; Y GT |Y t1 ) H(Y GT ) -I(Y t1 ; Y GT ) . ( ) GT Label Model Output at Model Output at Implications. We illustrate the implication of the defined KL and KG in a Venn-Diagram of Fig. One advantage of KL is that it can be interpreted as intraclass forgetting. Since conventional 'forgetting' is measured using task-wise or class-wise accuracy, if forgetting and knowledge gain simultaneously happens within a task or a class, its net effect is zero, thus its effect will not be measured. For example, if we assume there is class A with features a and b and the model forgot about feature a and learned about feature b so that overall accuracy for class A is the same, it counts as zero forgetting in the conventional measure. In contrast, when using the proposed Knowledge Loss (KL), the current model only has information about feature b, but past and current models combined have information about both a and b, so it captures loss of information about feature a. (KL) (KG)

5. APPROACH

The existing task-split setup partitions the data stream into multiple discrete tasks. Prior arts in CL literature that have been developed in this setup focus on extracting information from prior models learned in the previous tasks. However, in the online boundary-free setup whose data stream is not partitioned into explicit tasks, it is necessary to consider information at which previous moment we have to transfer to the current time step for better stability and plasiticity trade-off. Although there is no task boundary, a model learned in the past can be stored when each sample arrives and used as a data prior, i.e., a teacher in distillation framework. However, it is not clear which previous models to be used to transfer knowledge to the current time step. We consider various weighting functions to transfer information from the previous data stream in online and continuous fashion, and summarize the empirical results in Table 1 .

5.1. SCHEDULED DATA PRIOR

To determine the amount of past knowledge to be transferred in a continuous fashion, we consider a schedule of transfer of past knowledge by a composite function of exponential moving average (EMA). EMA calculates a weighted average in an online manner with exponentially decaying weight to place a higher weight on recent datapoints. EMA model θ α (t) with EMA ratio α at timestep t is defined recursively as where θ(t) is the online model's parameter at time t. In particular, the EMA update emphasizes the recent knowledge more than the past knowledge by a exponentially decreasing weight distribution, as shown in Fig. 3-(a) . However, we argue that recently learned information is well maintained in the model being currently trained (i.e. θ α (t) = (1 -α) * θ α (t -1) + α * θ(t) , not yet forgotten) and need to shift the focus to a slightly farther past (as depicted in Fig. 3-(b) ). To implement a such weighting scheme, we propose a hypo-exponential distribution, which has a skewed-bell curve shape to transfer the knowledge from the past. We configure the hypo-exponential distribution by weighted average of two EMA curves with different hyperparameters. The resulting distribution has a mode in the distant past while the vanilla EMA has the mode at the nearest past then the weight monotonically decreases. Specifically, to construct the scheduled data prior (SDP), we take a weighted sum of two EMA models θ α (t) and θ β (t) with EMA ratios of α and β where α > β. Using the coefficient for the hypo-exponential distribution, the proposed SDP model θ SDP(α,β) (t) is defined as: θ SDP(α,β) (t) = α α -β θ β (t) - β α -β θ α (t). Its weight are non-negative and summed up to 1, and form a skewed-bell curve as depicted in Fig. 3-(b ). Instead of using α and β, we define SDP with mean µ and coefficient of variation c 2 = σ 2 µ 2 of weight distributions, where σ 2 is the variance of the distribution, so that hyperparameters are interpretable. The values of α and β can be calculated from µ and σ as: α = 1 + 2c 2 + 2 µ -1 µ(1 -c 2 ) -1 , β = 1 -2c 2 + 2 µ -1 µ(1 -c 2 ) -1 , where α and β are positive real values when 1 2 -1 µ < c 2 < 1 -1 µ . The derivation for these can be found in appendix.

Teacher Model

A With the learnable balancing parameter λ t , we can write the final objective as: L(x, y) = λ t • L CE (x, y) + (1 -λ t ) • η k • L KD (x), where L CE (x, y) is the cross-entropy loss for current time step's knowledge and L KD (x) = ∥f (x) - f θSDP (x)∥ 2 2 is the L2 distillation loss for the past knowledge transfer and f θSDP is the neural network with parameters θ SDP obtained by equation 5. Before balancing the two terms for each task, since model parameter update is proportional to gradient, we normalize the scale of the two terms by η k : η k = |∇ f L CE | |∇ f L KD | , ( ) where f is the feature layer of the model and k is a batch index. Note that we use the gradient norm at the feature layer since that of the earlier layers will be proportional to it thanks to the chain rule. λ t is typically defined as |Cnew| |Cnew|+|Cold| (Wu et al., 2019) , where C new is the new classes, i.e., classes in the current task and C old is old classes, i.e., classes from the previous tasks. In the online boundaryfree setup, unfortunately, as task is not defined, the notion of new or old is not available for each class. Instead of hard assignment of new and old classes, we measure 0 ≤ γ i ≤ 1 that represents how new class i is. The γ i is measured by the inverse of past model's average confidence over samples in class i, since the model will have low confidence for new classes. If a past model predicts class i with p(i) = 1, we consider that the model has completely learned class i, so γ i = 0. If it predicts class i with p(i) ≤ 1 N where N is the total number of classes, i.e., no better than a random model, we consider that the model has learned nothing about class c, so γ i = 1. To compute γ i , we use the samples' confidence on currently learned model at the current time step before using them for training, as a proxy of validation accuracy. Finally, we define the online balancing parameter λ t by averaging confidence for all the samples for each class in the current time step as: λ t = N -1 i=0 γ i N , where As shown in Table 2 , our method outperforms the other baselines on all benchmark datasets. Especially, we can observe that the last accuracy (A last ) is better than the other baselines by large margins, which verifies that our method can quickly adapt to the current task in online boundary-free setup. γ i = 1, if p(i) < 1/N, N N -1 (1 -p(i)) , otherwise. In addition, the lower KLR score shows that our method is more robust to forgetting.

6.2. RESULTS ON PERIODIC GAUSSIAN DATA STREAM

The results of our method and other baselines on periodic-gaussian online stream is displayed in Table 3 . It is observed that our method is superior to other baselines on various datasets with higher accuracies (A AUC , A last ). It is noteworthy that our forgetting scores (KLR) is considerably lower compared to other baselines. We believe that our method can transfer information from the appropriate past, and thus be adventageous when the data distribution repeats periodically.

6.3. RESULTS ON DISJOINT TASK SPLIT

As our method is not specifically designed for the boundary-free task setup, we also compare our method to prior arts in the disjoint setup. We summarize the results on 5-split-CIFAR-10 (i.e., partitions 10 classes into 5 tasks) and 5-split-CIFAR-100 in Table 4 on the conventional task-split setup. As shown in the table, the proposed SDP outperforms other methods even in the setups with clear task boundaries.

6.4. ABLATION STUDY

We now ablate the model for the two components, and summarize the results in 

7. CONCLUSION

For working in a better realistic continual learning scenario, we propose a continual learning setup of a continuous data stream without explicity task boundary defined, called boundary-free continual learning, with the constraint of learning the data in online manner. We also investigate the periodicity of the data distribution. As the existing evaluation metrics for continual learning setup assumes the explicit task boundaries, they are not readily applicable in the boundary-free setup. Thus, we propose two new information theoretic evaluation metrics to quantify the amount of information loss and gain over the data stream, named knowledge loss and knowledge gain. To address the continuous data stream to incrementally update the model, we propose a method to leverage the previously learned knowledge by a skewed bell shaped weighting function in a distillation framework, named scheduled data prior. For further generalization to different benchmarks, we propose to balance the pace of learning of knowledge in the past and the present in a data-driven manner. In our empirical evaluations, the proposed method outperforms comparable prior arts that update the weights per each data batch in a stream in the proposed boundary-free setup as well as the conventional disjoint task-split setup. Limitations. Since our method does not leverage any notion of periodicity, it is expected to improve the method by considering the fact that the data stream has a periodic, though we may not know the duration of the period.

ETHICS STATEMENT

Continual learning (CL) including our approach aims to address real-world scenarios where data distributions continuously change in a non-stationary manner. Therefore, it can alleviate some ethical limitation of models resulting from lack of latest knowledge by continuous training and evaluation on newly observed data. This effort can be more effective to handle recent large language models (Brown et al., 2020; Kim et al., 2021a; Chowdhery et al., 2022) or foundational vision models (Bommasani et al., 2021) . On the other hand, CL methods might be vulnerable to model bias (Rae et al., 2021) caused by various bias inherent in the changing data due to its continuous tracking. This ethical issue is a future research topic. This may expose the public to discrimination by the deployed deep models due to unsolved issues in deep learning such.

REPRODUCIBILITY STATEMENT

We take the reproducibility in deep learning very seriously and highlight some of the contents in the manuscript that might help in reproducing our work. We will definitely release our implementations, learned models and newly derived datasets used in our experiments as mentioned in Sec. 6. Also, we have included relevant implementation details in Sec. 6. Finally, we present our final optimization objective (Eq. 10) with necessary details to reproduce the methodology. 

A.2 DERIVATION OF SDP RATIOS FROM MEAN AND VARIANCE

The weights of EMA is same as the pmf of geometric distribution, with k-th weight w α (k) obtained as: w α (k) = α(1 -α) k-1 ( ) when α is the EMA ratio. Using the known mean and variance of geometric distribution, Thus, mean µ α of EMA is obtained as: µ α = ∞ k kw α (k) = 1 α ( ) The variance σ 2 α is: σ 2 α = ∞ k k 2 w α (k) -µ 2 α = 1 -α α 2 (15) Thus, ∞ k k 2 w α (k) = 1 -α α 2 + 1 α 2 = 2 -α α 2 (16) Using the definition of SDP . 5, we obtain w (α,β) (k), the k-th weights of SDP with two ratios α and β as: w (α,β) (k) = α α -β w β (k) - β α -β w α (k) The mean µ (α,β) of SDP is calculated as: µ (α,β) = ∞ k kw (α,β) (k) = α α -β ∞ k kw β (k) - β α -β ∞ k kw α (k) = α β(α -β) - β α(α -β) = α + β αβ = 1 α + 1 β (18) Also, the variance σ 2 (α,β) of SDP is calculated as: σ 2 (α,β) = ∞ k k 2 w (α,β) (k) -µ 2 (α,β) = α α -β ∞ k k 2 w β (k) - β α -β ∞ k k 2 w α (k) -µ 2 (α,β) = α(2 -β) β 2 (α -β) - β(2 -α) α 2 (α -β) - (α + β) 2 α 2 β 2 = α 2 + β 2 -αβ(α + β) α 2 β 2 (19) Thus, the squared coefficient of variation c 2 = σ 2 µ 2 is: c 2 = (α + β) 2 -αβ(α + β + 2) (α + β) 2 (20) Recall µ (α,β) = α + β αβ Solving these equations for α + β and αβ by substitution, we get αβ = 2 c 2 -µ 2 + 1 (22) α + β = 2µ c 2 -µ 2 + 1 Solving this equations for α and β using quadratic formula, we get: α = 1 + 2c 2 + 2 µ -1 µ(1 -c 2 ) -1 , β = 1 -2c 2 + 2 µ -1 µ(1 -c 2 ) -1 . ( ) A. Update M ← GreedyBalancingSampler (M, (x, y)) ▷ Update Memory 5: θ SDP (t) = α α-β θ β (t) -β α-β θ α (t). ▷ Calculate SDP model parameters 6: Update γ y ← UpdateConfidenceMean (f θSDP (x), y) ▷ Update class-wise confidence 7: Update λ t ← N -1 i=0 γi N ▷ Update balancing parameter λt 8: Sample (X, Y ) ← RandomSample(M) ▷ Get batch (X, Y ) from Memory 9: L CE (X, Y ) = CrossEntropyLoss(f θ X, Y ) ▷ Calculate cross-entropy loss 10: L KD (X) = ∥f (X) -f θSDP (X)∥ 2 2 ▷ Calculate distillation loss 11: η k = |∇ f LCE| |∇ f LKD| ▷ Obtain batch balancing factor η k 12: L(X, Y ) = λ t • L CE (X, Y ) + (1 -λ t ) • η k • L KD (X) ▷ Calculate total loss 13: θ ← θ -µ • ∇ θ L(X, Y ) ▷ Update model 14: Update θ α ← (1 -α) • θ α + α • θ ▷ Update EMA models 15: Update θ β ← (1 -β) • θ β + β • θ 16: end for 17: Output f θ A.4 PSEUDOCODE FOR THE SDP FRAMEWORK Algorithm 1 provides detailed pseudocode for SDP. When the sample enters in time order, the memory is updated by Greedy Balancing Sampler (Prabhu et al., 2020). In addition, update the balancing parameter λ based on the confidence obtained by running inference on new samples using SDP model. SDP model parameters are obtained by weighted sum of two EMA model parameters. The updated SDP serves as a data prior for distillation during training. After training the model using the sampled batch (X, Y ), update the EMA model f θα , f θ β .

A.5 PSEUDOCODE FOR MEASURING KLR AND KGR

Algorithm 2 show how KLR and KGR are measured in practice. We obtain predictions for the test set at two different timestamps, t 1 and t 2 . Then we calculate the joint probability for GT, prediction at t 1 , and prediction at t 2 . KL and KG are calculated as the conditional mutual information. Line 6 and line 7 are derived from the equation for the conditional mutual information I(X; Y |Z) = I(X, Z; Y ) -I(Z; Y ), and equation for computing the mutual information Algorithm 2 Computing KLR and KGR ing data-driven loss balancing. We did not report GDumb since GDumb does not train at training time. Instead, GDumb trains from scratch at inference time, so their computational cost depends on the frequency of inference queries. SDP shows good trade-off between computational complexity and performance, compared to other methods. In the previous experiments, we used Gaussian distribution for data stream, since data as the sum of independent random variables tend to follow Gaussian distribution due to the Central Limit Theorem. However, a real-world data may not strictly follow Gaussian distribution, so we also consider a more complex distribution than a Gaussian distribution. To test the performance of CL methods on a more complex data distribution, we use Gaussian Mixture, which is a mixture of multiple Gaussian distributions. We model the distribution of each class as the mixture of two Gaussians, where mean for each Gaussian is randomly sampled from [0, 1), standard deviation for each Gaussian is randomly sampled from [0, 0.2), and mixture weight for each class is randomly sampled from [0, 1). We also observe in Sec. A.16 that in a real-world data, some classes may have different period with others. To model such scenario, we assign randomly assign period for each class from 1, 2, 4, where the length of data stream is 4. For distribution of each class, we use Gaussian Mixture with randomly selected two Gaussians as in the previous experiments. In summary, each class distribution is a periodic two-component Gaussian mixture with randomly selected means, standard deviations, mixture weights, and periods. We visualize an example of such distribution in Fig. 9 . We summarize the result on this Gaussian Mixture setup with mixed periods in Table 12 . SDP outperforms other methods even on the mixed-period Gaussian Mixture setup with complex data distributions, which indicates that SDP works well on various data distributions. We visualize the accuracy trends for the tested CL methods in Fig. 10 . From the trends, we can observe that SDP performs increasingly well with the training progress, since distillation becomes more important as the amount of knowledge in the past models increase. We also observe that performance in early phases tend to depend on memory management and usage, as methods that use balanced memory and train only on memory, i.e. CLIB, EMA, SDP, show much higher accuracy than methods that use reservoir sampling and experience replay, i.e. ER, DER, MIR, in the early phases. This is likely because reservoir sampling is highly class-imbalanced in the early phases. 1: Input Test set (X, Y GT ), Previous model output Y 1 , Current model f θ(t2) , Set of Possible Labels Y GT , Y 1 , Y 2 2: Y 2 = arg max f θ(t2) (x) for x ∈ X ▷ Inference test set with current model 3: for (y gt , y 1 , y 2 ) ∈ (Y GT , Y 1 , Y 2 ) KLR(t 1 , t 2 ) = KL(t 1 , t 2 ) T K(t 1 ) ▷ Obtain KLR 11: KGR(t 1 , t 2 ) = KG(t 1 , t 2 ) H(Y GT ) -T K(t 1 ) ▷ Obtain KGR 12: Output Y 2 , KLR(t 1 , t 2 ), KGR(t 1 , t 2 ) A.7 DISCUSSION ON THE ADDITIONAL MEMORY COST Method Additional Memory Theoretical ImageNet (MB) ER N s • N c • S 3,010.6 DER++ N s • N c • S + N s • N 2 c 3,090.6 ER-MIR N s • N c • S + N θ 3,057.3 GDumb N s • N c • S 3,010.6 CLIB N s • N c • S 3,010.6 SDP (Ours) N s • N c • S + 2 • N θ 3,100.0 As the training progresses, the reservoir memory becomes close to balanced, thus the gap between two memory management closes in the later phase. However, in the mid-to-later phase, the effect of distillation comes into play, and the performance of methods that use balanced memory (CLIB, EMA, SDP) deviates depending on the distillation method used, as SDP ≫ EMA > CLIB (No distillation). It is of particular importance as many real-world systems such as an e-commerce platform would experience gradual domain shifts in the streamed data for the natural temporal evolution of concepts (e.g., computers in 2020 look different from these in 2010), while being subjected to (periodically) changing data distributions. Note that in the original CLEAR-10 benchmark, the data distribution is uniform and stationary. We make a Periodic Gaussian data stream with each bucket as a period and frequency of each class follows a Gaussian distribution within the bucket, as explained in Sec. 3.1. We report the results in Table 13 . SDP outperforms other methods by significant margins, even though the Greedy Balancing Sampler we use for memory management is reported as not suitable for domain shifts (Mai et al., 2022) . It indicates that the distillation part in SDP can effectively deals with domain shifts in the data stream.

A.16 EXAMPLES OF REAL-WORLD PERIODIC DATA

In addition to the examples mentioned in Figure 1 , many search data follow periodic distribution as shown in Figure 11 and Figure 12 . As we can see in Fig. 12 , search frequency of seasonal fruits and home appliances, which are greatly affected by the season, follows the periodic distribution with the period of a year. In the case of clothing, the period is 6 months as we can see in Figure 11 . It can be seen that almost similar distribution is repeated although there is a slight difference in distribution for each period. In order to address these real-world distributions, we propose the periodic benchmark. 



Figure 1: Search interest trends of three items (i.e., swimsuit, coat, Christmas gift) from 2009 to the present in Google Trends. Each item follows a periodic distribution with its own mode and duration.

Figure 2: Relation of knowledge learned by model to the Knowledge Loss (KL) and Knowledge Gain (KG).

SDP weights with µ = 10, 000 and c 2 = 0.75

Figure 3: Comparison between weight distribution of EMA and SDP.

Figure 4: Distribution for (a) Gaussian and (b) Periodic Gaussian data streams The resulting distribution with N = 4 is depicted in Fig. 4-(a). The periodic Gaussian distributions with N = 4 and R = 5 are illustrated in Fig. 4-(b).

do

(y gt , y 1 , y 2 ) = |{(ygt,y1,y2)∈(YGT,Y1,Y2)}| |YGT| ▷ Calculate joint probability 5: end for 6: KL(t1, t2) = ygt∈YGT y1∈Y1 y2∈Y2

Figure 5: A AUC and A last of various methods compared to training time, on Non-Periodic Gaussian data stream in CIFAR-10.

Figure 9: Data distribution of randomly selected 4 classes in the mixed-period Gaussian Mixture data distribution setup.

Figure 11: Search data with 6-month period

2. Upper circle, lower left circle, lower right circle represent Y GT , Y

Moreover, fixed parameter is not always optimal, as importance of classification and distillation may vary over the course of training. Thus, we propose to learn to balance them for better generalization to different data context and adapting in different phases of training.

For evaluation metrics, we use the area under the curve of accuracy (A AU C ) and last accuracy (A last ) for overall performance, and knowledge loss ratio (KLR) and knowledge gain ratio (KGR) for stability and plasticity, as defined in section 4. KLR and KGR are measured every 10, 000 samples for CIFAR-10 and CIFAR-100, every 20, 000 samples for TinyImageNet, every 100, 000 samples for ImageNet. All results are averaged over 3 different random seeds, except ImageNet(Bang et al., 2021;Koh et al., 2021) for computational cost. We will publicly release the implementation of our method and the continually learned models. Method Detail. SDP uses an episodic memory, updated by the Greedy Balancing Sampler (Prabhu et al., 2020). For training, we use only the samples randomly selected from the memory, following (Koh et al., 2021). We provide a pseudocode of SDP in Appendix Sec. A.4. For the hyperparameters, Accuracy of continually learned model in Non-Periodic Gaussian data stream in CIFAR-10, CIFAR-100, TinyImageNet and ImageNet. The results except ImageNet are the average values of the three random seeds. Implementation Detail. We use ResNet-18 (He et al., 2016) as the network architecture for all experiments. We set training hyperparameters following (Koh et al., 2021; Bang et al., 2021; Prabhu et al., 2020). For CIFAR-10, CIFAR-100, TinyImageNet and ImageNet, we use batch size of 16, 16, 32, 256, number of updates per sample of 1, 3, 3, 0.25, memory size of 500, 2000, 4000, 20000,



Accuracy of continually learned model in Periodic Gaussian data stream. The results except ImageNet are the average values of the three random seeds.

Results of disjoint task split on CIFAR-10 / 100. Each of datasets are split into 5 tasks we have 1.7∼5.6% improved A AU C and 5.2∼8.4% improved A last compared to baseline. Since we didn't apply adaptive loss balancing, we use a fixed balancing parameter optimized by grid search on CIFAR-10 non-periodic Gaussian stream. By applying adaptive loss balancing, we have further improved performance, except for A last in CIFAR-100 periodic-Gaussian setup. It still outperforms baseline and other methods by large margins.

Ablation Study

3 ANALYSIS ON STANDARD DEVIATION OF GAUSSIAN DISTRIBUTION Comparison of Accuracy for different values of standard deviation for Gaussian data Stream in CIFAR-100 Algorithm 1 Pseudocode for SDP 1: Input model f θ , EMA models f θα , f θ β , EMA ratios α, β, Memory M, Training data stream D, Learning rate µ 2: θ α ←θ, θ β ←θ

P (y gt , y 1 , y 2 ) log P (y 1 )P (y gt , y 2 ) P (y gt , y 1 )P (y 1 , y 2 ) (y gt , y 1 , y 2 ) log P (y 2 )P (y gt , y 1 ) P (y gt , y 2 )P (y 1 , y 2 )

Additional memory cost for storing models, replay buffers, and logits. We report theoretical values and actual memory cost on ImageNet.We compare the additional memory required to save models or samples in Table.8. SDP has an additional memory cost of N s • N c • S + 2 • N θ , where N s is the number of exemplars stored per class, N c is the number of classes, S is the size of an image, and N θ is the size of model parameters. Overall, SDP uses more memory than other compared methods since it requires storing model weights for SDP while other methods only store samples in episodic memory.Since the resolution of the image in the real world is larger than or similar to the resolution of the ImageNet images, we compare the memory costs in the ImageNet as the closest baseline for the real world applications. As shown in Table.8, the extra memory cost of storing the model checkpoints is negligible compared to the memory cost of episodic memory. Storing model parameters from previous tasks for regularization or distillation is a common practice in CL with task boundaries (Kirkpatrick et al., 2017; Chaudhry et al., 2018a; Wu et al., 2019). However, such methods were not actively studied in task-free CL, since one cannot decide which checkpoint to store. For comparison with other regularization or distillation methods, EWC (Kirkpatrick et al., 2017) requires 2 • T • N θ where T is the number of tasks, EWC++ (Chaudhry et al., 2018a) requires N s • N c • S + 3 • N θ (3150.6MB on ImageNet), RWalk (Chaudhry et al., 2018a) requires N s •N c •S+5•N θ (3244.1MB on ImageNet), and BiC (Wu et al., 2019) requires N s •N c •S+N θ (3057.3MB on ImageNet).A.8 VISUALIZATION OF LOSS BALANCING PARAMETER OVER TIMEIn this section, we discuss the behavior of the data-driven balancing parameter λ t over time. λ t learns to adjust the weight of distillation loss to follow the performance of the SDP model. Thus, in the early phase where SDP model's performance is low, the distillation loss has small weights (high λ t ), and as the model learns and SDP model's performance increases, the weight for the distillation loss becomes larger (lower λ t ). We provide visualization of λ t over time in CIFAR-10 Gaussian

CL with no episodic memory, on Gaussian and Periodic Gaussian data stream in CIFAR-10. memory replay by 25.66% on Gaussian data stream and 32.70% on Periodic Gaussian data stream. Closing this gap would be an interesting research direction for future work.A.12 ABLATIONS ON THE NUMBER OF REPETITIONS IN PERIODIC CONTINUAL LEARNING

Comparison of Accuracy for different number of repetitions (R) for Periodic Gaussian data stream in CIFAR-10We test CL methods on 4 different number of repetition (R) for Gaussian data stream in CIFAR-10 and CIFAR-100. As seen in Table10, SDP outperform other methods in all tested values of R, showing that our model is robust to various number of repetitions.

Table11summarizes the results for Non-Periodic and Periodic Gaussian Mixture data stream. We observe that SDP still outperforms other methods even in a more complex Gaussian Mixture data stream setup.

Comparison of Accuracy for Periodic Gaussian Mixture data stream with mixed period for each class.

Figure 10: Accuracy trends for CL methods over the course of training, on CIFAR-10 Non-Periodic Gaussian setup. Comparison of Accuracy for Periodic Gaussian data stream with natural domain shift using CLEAR-10. A.15 PERIODIC CONTINUAL LEARNING WITH NATURAL DOMAIN SHIFTS We conduct additional experiments using the recently proposed CLEAR-10 (Lin et al., 2021) dataset that has real-world domain shifts, i.e., the visual concept drift of objects over time in the data stream.

ACKNOWLEDGEMENT

This work is partly supported by the NRF grant (No.2022R1A2C4002300), IITP grants (No.2020-0-01361-003, AI Graduate School Program (Yonsei University) 5%, No.2021-0-02068, AI Innovation Hub 5%, 2022-0-00077, 20%, 2022-0-00113, 20%, 2022-0-00959, 15%, 2022-0-00871, 15%, 2022-0-00951, 15%) funded by the Korea government (MSIT).

annex

 1 , for CIFAR-10 Gaussian non-periodic Gaussian data stream.and Periodic Gaussian data stream in Fig. 6 , where this behavior is observed. Since the model's predictions are almost random in the initial phase, the coefficient tends to start with 1 and decrease with the training progress.A.9 HYPERPARAMETER TUNING ON SDPWe study the effect of two hyperparameters of SDP, mean µ and coefficient of variation c 2 , by grid search on CIFAR-10 non-periodic Gaussian setup, and summarize the results in Fig. 7 . We observe some dependence on SDP mean, but not much on coefficient of variation. Note that the valid range for coefficient of variation is 0.5 -1 µ < c 2 < 1 -1 µ . Since it is not much relevant to performance, we choose the value in the middle of valid range, c 2 = 0.75 for all experiments. SDP mean has some impact on the performance, so we choose the optimal value found on CIFAR-10 non-periodic Gaussian setup, µ = 10, 000. Since dataset or setup-specific hyperparameter search is not desirable in CL scenario, we use the same value, µ = 10, 000, for all datasets and setups.

A.10 HYPERPARAMETER TUNING ON EMA

We also report hyperparameter tuning result for EMA model in Table . 1, searched on CIFAR-10 non-periodic Gaussian setup, which is same as the setup on Table . 1. The performance did not vary much depending on the EMA ratio α, and we used α = 0.001 as it showed highest performance.

A.11 ONLINE CL WITHOUT REPLAY

We consider an extreme scenario where episodic memory is not available. Since other CL methods covered in this work depend on memory replay, we compare only with basic fine-tuning as baseline. We summarize the result on Table 9 . We observed that applying SDP to the baseline (fine-tuning) improves A AUC by 3.07% for Gaussian data stream and 5.03% for Periodic Gaussian data stream on CIFAR-10. While SDP improves performance, the overall performance falls behind SDP using

