ONLINE CONTINUAL LEARNING FOR PROGRESSIVE DISTRIBUTION SHIFT (OCL-PDS): A PRACTITIONER'S PERSPECTIVE

Abstract

We introduce the novel OCL-PDS problem -Online Continual Learning for Progressive Distribution Shift. PDS refers to the subtle, gradual, and continuous distribution shift that widely exists in modern deep learning applications. It is widely observed in industry that PDS can cause significant performance drop. While previous work in continual learning and domain adaptation addresses this problem to some extent, our investigations from the practitioner's perspective reveal flawed assumptions that limit their applicability on daily challenges faced in realworld scenarios, and this work aims to close the gap between academic research and industry. For this new problem, we build 4 new benchmarks from the Wilds dataset (Koh et al., 2021), and implement 12 algorithms and baselines including both supervised and semi-supervised methods, which we test extensively on the new benchmarks. We hope that this work can provide practitioners with tools to better handle realistic PDS, and help scientists design better OCL algorithms.

1. INTRODUCTION

In most modern deep learning applications, the input data undergoes a continual distribution shift over time. For example, consider a satellite image classification task as illustrated in Figure 1a . In this task, the input data distribution changes with time due to the changes in landscape, and camera updates which can lead to higher image resolutions and wider color bands. Similarly, in a toxic language detection task on social media illustrated in Figure 1b , the distribution shift can be caused by a shift in trends and hot topics (many people post about hot topics like BLM (Wikipedia contributors, 2022a) and Roe v. Wade (Wikipedia contributors, 2022b) on social media), or a shift in language use (Röttger & Pierrehumbert, 2021; Luu et al., 2022) . Such distribution shift can cause significant performance drop in deep models, a widely observed phenomenon known as model drift. A critical problem for practitioners, therefore, is how to deal with what we term progressive distribution shift (PDS), defined as the subtle, gradual, and continuous distribution shift that widely exists in modern deep learning applications. In this work, we explore handling PDS with online continual learning (OCL), where the learner collects, learns, and is evaluated on online samples from a continually changing data distribution. In Section 2, we formulate the OCL-PDS problem. The OCL-PDS problem is closely related to two research areas: domain adaptation (DA) and continual learning (CL), in which there is a rich body of academic work. However, through a literature review and our conversations with practitioners, we find that there still remains a gap between the settings widely used in academic work and in real industrial applications. To close this gap, we commit ourselves to thinking from a practitioner's perspective, which is the core spirit of this work. Our primary goal is to build tools for investigating the real issues practitioners are facing in their day-to-day work. To achieve this goal, we challenge the prevailing assumptions in previous work, and propose three important modifications to the conventional DA and CL problem settings: 1. Task-free: One point conventional DA and CL settings have in common is assuming clear boundaries between distinct domains (or tasks), but practitioners rarely apply the same model to very different domains in industry. In contrast, OCL studies the task-free CL setting (Aljundi et al., 2019b) where there is no clear boundary, and the distribution shift is continuous. Moreover, in OCL both training and evaluation are online, unlike previous task-free CL settings with offline evaluation, which is not as realistic in a "lifelong" setting. Figure 1 : FMoW-WPDS and CivilComments-WPDS benchmarks which we build in this work. 2. Forgetting is allowed: Avoiding catastrophic forgetting is a huge topic in CL, which usually requires no forgetting on all tasks. However, remembering everything is actually impractical, infeasible and potentially harmful, so OCL-PDS only requires remembering recent knowledge and important knowledge which is described by a regression set (Sec. 2.2). 3. Infinite storage: Previous work in CL usually assumes a limited storage (buffer) size. However, storage is not the most pressing bottleneck in most industrial applications. Thus, in OCL-PDS, we assume an infinitely large storage where all historical samples can be stored. However, the learner cannot replay all samples because it would be too inefficient. To demonstrate the novelty and practicality of the OCL-PDS problem, in Section 2.3 we will discuss related work, compare OCL-PDS with common and similar settings used in previous work, and elaborate on the reasons why we believe these three key modifications align OCL-PDS more closely with industrial applications and practitioners' pain points. A more thorough literature review can be found in Appendix A. Then, in Section 3, we will build 4 new benchmarks for OCL-PDS, including both vision and language tasks. When building these benchmarks, we make every effort to make sure that they can reflect real PDS scenarios that practitioners need to deal with. In Section 4, we will explore OCL algorithms and how to combine them with semi-supervised learning (SSL) as unlabeled data is very common in practice. In total, we implement 12 supervised and semi-supervised OCL algorithms and baselines adapted to OCL-PDS, which we test extensively on our benchmarks in Section 5. Our key observations in these experiments include: (i) A taskdependent relationship between learning and remembering; (ii) Some existing methods have low performances on regression tasks; (iii) SSL helps improve online performance, but it requires a critical virtual update step. Finally, in Section 6 we discuss remaining problems and limitations. Contributions. Our contributions in this work include: (i) Introducing the novel OCL-PDS problem which more closely aligns with practitioners' needs; (ii) Releasing 4 new benchmarks for this novel setting; (iii) Adapting and implementing 12 OCL algorithms and baselines, including both supervised and semi-supervised, for OCL-PDS; (iv) Comparing these algorithms and baselines with extensive experiments, which leads to a number of key observations. Overall, we believe that this work is an important step in closing the gap between academic research and industry, and we hope that this work can inspire more practitioners and researchers to investigate and dive deep into realworld PDS. To this end, we release our benchmarks and algorithms, which are easy to use and we hope can help boost the development of OCL algorithms for handling PDS.

2. THE OCL-PDS PROBLEM 2.1 PROBLEM FORMULATION

We have a stream of online data S 0 , S 1 , • • • , where each S t is a batch of i.i.d. samples from distribution D t that changes with time t continuously, for which we assume that Div(D t ∥ D t+1 ) < ρ for all t for some divergence function Div. Online Continual Learning (OCL) goes as follows: • At t = 0, receive a labeled training set S 0 , on which train the initial model f 0 • For t = 1, 2, • • • , T, • • • do 1. Data collection: Receive a new unlabeled data batch S t = {(x (i) t , y (i) t )} nt i=1 i.i.d. ∼ D t . 2. Evaluation: Predict on S t with the current model f t-1 , and get some feedback 3. Fine-tuning: Update the model f t-1 → f t with all previous information Evaluation metrics. An OCL algorithm is used for fine-tuning and is evaluated by three metrics: 1. Online performance: Denote the performance of f s on S t by A t s . The online performance at time t (as computed in Step 2 -Evaluation) is A t t-1 , and the average online performance before horizon T is defined as (A 1 0 + • • • + A T T -1 )/T . 2. Knowledge retention: Unlike conventional CL that requires no forgetting on all tasks, in OCL-PDS the model only needs to remember two types of knowledge: recent knowledge and important knowledge. For a certain recent time window w, the recent performance is defined as (A t-w t-1 + • • • + A t-1 t-1 )/w. The important data is described by a regression set, and the regression set performance is the model's performance on this set.

3.

Training efficiency: This is measured by the average runtime of the fine-tuning step, which is very important for this online setting where the OCL algorithm is run for many times.

2.2. DETAILS

Divergence function. For distributions P and Q, Div(P ∥ Q) is the divergence from P to Q, and can be different from Div(Q ∥ P ). According to our reasoning in Appendix B, an ideal divergence function for OCL-PDS should be asymmetric and bounded, so we cannot use popular functions such as total variation, Wasserstein distance, MMD, KL-divergence and JS-divergence. In this work, we use the ϵ-KL-divergence (Eqn. ( 2)), whose definition and properties can be found in Appendix B. Data batch. If |S t | = 1, then this is the conventional online learning setting where samples arrive one by one. However, industry practitioners seldom update the model over a single sample, and always collect a batch of samples before fine-tuning the model, so we consider data batches instead. Feedback. Without any feedback from the evaluation process, the problem is fully unsupervised because we only have the unlabeled batches to fine-tune the model. This setting, however, is too hard and unrealistic. Indeed, for most deployed industrial systems, there are tools for evaluating online performance, either through some automated metrics or feedback provided by end-users. Here, we consider the user error report model, where a fraction of the users provide feedback on incorrect model outputs. This model leads to Random Label Feedback (RLF), where the labels of α fraction of the samples in S t are provided as feedback. In this scenario, α = 0, α = 1 and α ∈ (0, 1) correspond to the unsupervised, supervised and semi-supervised learning settings, respectively. All previous information. We allow the learner to store all previously seen samples and feedback (though the learner cannot really replay all samples as it would be too inefficient), which is starkly different from most previous papers in CL that assume a limited storage size. Recent knowledge. OCL requires no forgetting on recent knowledge, because (i) in general, practitioners expect the model not to forget too quickly, and (ii) in many applications the same distribution repeats periodically, which makes recent knowledge useful. For example, satellite images in summer and winter look very different (e.g. due to snow), but the images in two consecutive summers look similar, so in this case it is useful to remember the knowledge for at least one year. Regression set. In the software industry, regression refers to the deterioration of performance after an update (Yan et al., 2021) . The regression set contains the regression data on which making a mistake is more expensive. Moreover, the labeling function P (Y |X) of the regression data changes very little over time if any (no concept shift). The two most common types of regression data in industry are: (i) Frequent data, which appears more often than other data; (ii) Critical data, which weighs more in the model evaluation and whose definition depends on the specific application.

2.3. RELATED WORK AND COMPARISON WITH PREVIOUS SETTINGS

This work is related to three areas: Domain adaptation (DA), continual learning (CL), and semisupervised learning (SSL). DA provides a learner with labeled samples from a source distribution P and (partially labeled, unlabeled, or no) samples from a target distribution Q, and requires it to learn a good model on Q. Surveys on DA include Lu et al. (2018) ; Wang & Deng (2018) ; Ramponi & Plank (2020) ; Wang et al. (2022b) . In CL, the learner needs to continually learn new knowledge from an online stream of data, and settings include task-incremental CL (including domain-incremental and class-incremental CL), task-aware CL, task-agnostic CL, task-free CL and OCL. Surveys on CL include De Lange et al. (2021) ; Masana et al. (2020) ; Biesialska et al. (2020) . SSL requires the model to learn from a training set consisting of few labeled samples and many unlabeled samples. Surveys on SSL include Van Engelen & Hoos (2020); Ouali et al. (2020) ; Yang et al. (2021) . A more thorough literature review can be found in Appendix A. There are several differences between the OCL-PDS problem and conventional DA and CL settings. We now explain why our formulation is more relevant and thus useful for industry practitioners. PDS vs DA/DG. Domain adaptation (DA) and its sibling domain generalization (DG) mostly study big, one-shot distribution shifts, i.e. training and testing on two very different domains. This has two problems: (i) Practitioners seldom directly apply a model to a very different domain in industry. Instead, they usually train one model for each domain, and train a domain classifier to distinguish among different domains; (ii) Even if a model needs to be applied to a different domain, practitioners would usually first collect some labeled data from the new domain and then fine-tune the model. It is very rare to have no labels or samples at all from the target domain in industry. On the contrary, OCL-PDS is a very common scenario in industry. First, PDS has been widely reported to cause performance drop in industrial applications (Martinel et al., 2016; Jaidka et al., 2018; Huang & Paul, 2019) , and practitioners do not often train new models for PDS. Second, OCL-PDS studies the semi-supervised setting where the practitioners can collect a few labeled samples and many unlabeled samples, which is more realistic than having no labels or samples at all. OCL vs CL. OCL falls under the task-free continual learning setting where there is a fixed task and no clear task boundary. It is different from conventional task-incremental CL in three ways: 1. Task-incremental CL has N distinct tasks and requires a single model to learn them all, but what practitioners would usually do in this case is to train N models, one for each task. In contrast, in OCL-PDS the task is fixed but the data distribution is gradually and constantly changing, so it is more reasonable to use and continually fine-tune a single model.

2.

Conventional CL requires the model to remember all N tasks with a storage of size M . For studying PDS, this requirement has three problems: (i) It is not so practical as not all old knowledge is important -A satellite image classifier in 2022 does not need to do very well on images in 2002 with different landscapes. (ii) In the "lifelong" setting with N continually growing, we need M to also grow with N (like M = O(N )). With a fixed M , it is infeasible to remember everything. (iii) Many applications have concept shift where P (Y |X) could change, so remembering old knowledge can be harmful to the performance on the current data distribution. For instance, languages that were not considered offensive 20 years ago are widely recognized as offensive today thanks to the recent civil rights movements. Thus, OCL-PDS only requires remembering recent knowledge and important knowledge. 3. Our setting assumes infinite storage unlike previous settings. This is an over-optimistic assumption as there are real applications where the amount of data is so huge that it is impossible to store all data even for big companies. Real applications also have other considerations such as privacy restrictions so that the data cannot be stored forever. However, storage size is rarely the bottleneck of industrial applications. The point of making this assumption is to not put too much effort into utilizing every bit of storage. Instead, we want to focus on more practically relevant questions, such as how to leverage unlabeled data and how to improve training efficiency. Other similar settings. First, OCL is different from the task-free CL formulated in some previous work (Aljundi et al., 2019b; de Masson d'Autume et al., 2019; Wang et al., 2022d) where the training is online but the evaluation is offline. Previous work typically splits the data domain into different sections which the learner sees sequentially online, and in the end the learner is evaluated on the entire domain offline. On the contrary, both training and evaluation in OCL are online: For each new batch, the model is first tested and then trained on it. Thus, it is possible to have the real "lifelong" learning setting in OCL where the time horizon T = ∞ but not in the previous setting. Second, OCL-PDS is closely related to reinforcement learning (RL) and time series analysis. The difference from RL is that in RL, the agent can learn from a number of episodes, while in OCL-PDS the evaluation is online and one-pass. The difference from time series analysis is that time series focuses on predicting on future data and does not care about forgetting. Moreover, though PDS naturally resides in time series data, in our literature review (Appendix A) we find little work about handling PDS with time series analysis. One such line of work is temporal covariate shift (TCS) (Du et al., 2021) that assumes that P (Y | X) is always fixed, which is not assumed in OCL-PDS. Third, there are two related settings, gradual domain adaptation (GDA) (Kumar et al., 2020) and gradual concept drift (Liu et al., 2017; Xu & Wang, 2017) , that also study gradual shift from one domain to another with a series of distributions P 0 , P 1 , • • • , P T , where P 0 is the source domain, P T is the target domain, and P t and P t+1 is close for each t. Both settings only require good adaptation performance and do not consider forgetting. However, the concept of regression set widely exists in modern deep learning applications, and no forgetting on the regression set is a critical issue. Finally, Cai et al. (2021) proposed a similar OCL setting, where the model is also first evaluated on the new batch and then fine-tuned on it. However, OCL-PDS has three important distinctions: (i) Cai et al. (2021) There are two existing PDS benchmarks: CLOC (Cai et al., 2021) and CLEAR (Lin et al., 2021) . Both are image classification tasks and do not have regression sets. Thus, we build a new, more comprehensive suite of benchmarks that covers both language and vision tasks, and both classification and regression tasks, with intuitively defined regression sets. Since it is impossible for us to cover all existing tasks, we also provide our 3-step procedure to build our benchmarks, which can be used to construct PDS benchmarks on other existing datasets.

3.1. THE 3-STEP PROCEDURE TO BENCHMARK OCL-PDS

Here we provide the 3-step procedure we use to benchmark OCL-PDS on an existing dataset: 1. Separate the data into groups (domains). For example, group the data by year. Then, do an OOD check which verifies that there is a significant distribution shift across the groups. 2. Assign shifting group weights to the batches to create a distribution shift across the groups.foot_0 Then, do a shift continuity check which verifies that the shift is continuous (not abrupt). 3. Design a separate regression set which does not intersect with any online batch. Then, do a regression check which verifies that naïve methods have regression on this set. Moreover, for each batch (including the regression set), we randomly divide the batch into a training batch and a test batch. The initial training set contains both the first batch and the training regression set. The learner sees the training batches during OCL, and the separate test batches are used to evaluate the recent and regression set performances. The model is never evaluated on training samples it has already seen because it is not useful as the learner can store all these samples in its buffer. Example: CivilComments-WPDS. Here we briefly demonstrate this procedure and a detailed description can be found in Appendix C.1. The CivilComments dataset contains online comments with topic labels, and we want to model PDS with shifting hot topics on it. In Step 1, we divide the comments into four groups according to their topics, and verify that a model trained on any three groups does poorly on the fourth group; In Step 2, we construct a weight shift among the groups to simulate PDS, and verify that a model trained on labeled samples from previous distributions can do well on the new distribution, so that the shift is continuous; In Step 3, we define the regression set, and verify that catastrophic forgetting will happen if we only train on the new data. et al., 2021) . Please refer to this paper for the potential leverage, broader context and ethic considerations of these datasets. Here we briefly describe the 4 benchmarks we release in this work, and details can be found in Appendix C. CivilComments-WPDS. This benchmark is based on the CivilComments-Wilds dataset (Borkan et al., 2019) , which is a toxic language detection task on social media (Figure 1b ). WPDS stands for Wilds-PDS. This benchmark models the shift in hot topics. The regression set contains critical data -Comments with severe harassment, including identity attack and explicit sexual comments. FMoW-WPDS. This benchmark is based on the FMoW-Wilds dataset (Christie et al., 2018) , which is a satellite image facility classification task (Figure 1a ). It models the shift in time. The regression set contains frequent data -Data from two highly populated regions: Americas and Asia. Amazon-WPDS. This benchmark is based on the Amazon-Wilds dataset (Ni et al., 2019) , which is a review sentiment analysis task on shopping websites. This benchmark models the shift in language use. The regression set contains frequent data -Data from 10 popular product categories. Poverty-WPDS. This benchmark is based on the PovertyMap-Wilds dataset (Yeh et al., 2020) , which is a satellite image wealth index regression task. This benchmark models the shift in time. The regression set contains critical data -Images from urban areas. Following Koh et al. ( 2021), performances here are measured by the Pearson correlation between outputs and ground truths.

4. OCL ALGORITHMS

An OCL algorithm consists of the following three components: • Continual fine-tuning: How to fine-tune the model on the new data? • Knowledge retention: How to prevent forgetting? • Semi-supervised learning: How to leverage the unlabeled data? In particular, an OCL algorithm is called supervised if it does not have the semi-supervised learning component, and called semi-supervised if it does. In the rest of this section we will provide an overview of the OCL algorithms we implement: first baselines, then supervised, and finally semisupervised. Implementation details of these algorithms can be found in Appendix D.

4.1. NAÏVE BASELINES

The naïve baselines are used to measure the difficulty of an OCL-PDS task for interpreting the performances of OCL algorithms. First we have First Batch Only (FBO), where we only train an initial model on the first batch S 0 (including the training regression set) with empirical risk minimization (ERM) and use that model till the end, which serves as a lower bound of the online performance as well as an upper bound of the regression set performance (because it directly trains the model on regression data without any forgetting). Then we have i.i.d. offline, where for each t we train a model on a separate training set i.i.d. sampled from D t and test it on S t , which serves an approximate upper bound of the online performance that reflects the generalization gap. Finally, we have New Batch Only (NBO), where we train the model on the new data alone and never care about forgetting, which serves as a lower bound of the knowledge retention performance.

4.2. SUPERVISED OCL ALGORITHMS

Rehearsal based methods. Rehearsal was first introduced in Ratcliff (1990) ; Robins (1995) to prevent catastrophic forgetting in CL, where historical data is stored in a memory buffer and replayed to the learner. The simplest method is ER-FIFO (also called ring buffer (Chaudhry et al., 2019b) ), where ER stands for experience replay. There are three sources of data the model needs to learn or remember: new data, recent data and regression data. Thus, ER-FIFO simply fine-tunes the model over the union of these three sets of data. The buffer looks like a first-in-first-out (FIFO) queue as the new batch replaces the oldest one (while the regression set is never removed). There are some variants of ER-FIFO with different strategies of selecting replay samples. In ER-FIFO-RW where RW stands for reweighting, the three data sources are balanced so that they have the same probability of being sampled in stochastic gradient descent (SGD), which is useful when different batches have different sizes (the weights can also be customized to adjust between learning and remembering). In Maximally Interfered Retrieval (MIR) (Aljundi et al., 2019a) , the model is first "virtually" updated on the new data only, and those previous samples on which the loss increases the most before and after the virtual update are selected. The model is then recovered and fine-tuned with a real update on the selected samples together with the new samples. Similarly, in MaxLoss (Lin et al., 2022) there is also a virtual update step, and previous samples with the highest loss after virtual update are selected to be replayed during real update. GEM-PDS. This method is a combination of Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) and Average GEM (A-GEM) (Chaudhry et al., 2019a) , and we design it specially for OCL-PDS. Denote the gradients of the loss function on the new data, recent data and regression data by g 0 , g 1 and g 2 , respectively. Gradient descent along g 0 might cause the model to forget recent and important knowledge, so instead we find a "pseudo gradient" g that is close to g 0 , and gradient descent along g won't lead to forgetting. We solve the following convex optimization problem: minimize g ∥g -g 0 ∥ 2 2 s.t. ⟨g, g 1 ⟩ ≥ 0, ⟨g, g 2 ⟩ ≥ 0 (1) This problem is always feasible, and the optimal g * can be found with a simple procedure described in Appendix D Eqn. (3). The constraints ensure that descent along g * won't increase the loss on the recent and regression data, which can be shown with the Taylor expansion of the loss function. Regularization based methods. The high-level idea of regularization is to keep the model weights close to the initial model which has a good regression set performance, so as to reduce forgetting. Denote the vectorized model weights at time t by θ t . In Online L2 Regularization (L2Reg), we add a penalty term λ 2 ∥θ t -θ 0 ∥ 2 2 to the loss function. In a variant called Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) , the penalty term is λ 2 ∥diag(F t )(θ t -θ 0 )∥ 2 2 , where F t is the Fisher information matrix (FIM) and diag(F t ) only contains the elements on the diagonal of F t . The FIM ensures that weights that are more influential to the model output change less.

4.3. SEMI-SUPERVISED OCL ALGORITHMS

Pseudo Labeling (PL). PL was proposed in Lee (2013) and is also known as self-training. Since D t is close to D t-1 , if f t-1 can do well on D t-1 , then it is very likely that it can also do well on D t . Based on this observation, for an unlabeled sample x in S t , define its pseudo label simply as f t-1 (x). Then, we fine-tune the model f t-1 → f t on the union of the labeled and the pseudolabeled sets with any supervised OCL algorithm. We denote a PL method by adding suffix "PL" after this supervised algorithm, such as ER-FIFO-PL. Note that in classification, f t-1 (x) is the "hard" label, so training f t-1 on this label minimizes its entropy on x. Kumar et al. (2020) ; Wang et al. (2022a) proved the effectiveness of PL in the GDA context. Moreover, pseudo-labeled samples are not replayed for knowledge retention as they could have large label noise. Furthermore, in our implementation of PL we introduce an innovative virtual update step which we find very important in our experiments. The whole algorithm goes as follows: 1. Virtual update: Fine-tune the model f t-1 → f ′ t-1 with ERM on the labeled samples only. 2. Pseudo label the unlabeled samples with the updated model: x → (x, f ′ t-1 (x)).

3.. Revert the model weights from f ′

t-1 back to f t-1 , and fine-tune f t-1 → f t with any supervised OCL algorithm on the union of labeled and pseudo-labeled samples. FixMatch (FM). FixMatch (Sohn et al., 2020 ) is a variant of PL. In FM, there are two types of data augmentation: a strong one and a weak one, and the pseudo labels are generated on the weakly augmented samples while the model is fine-tuned on the strongly augmented samples, which leads to a consistency regularization effect that makes the model have consistent outputs on the weakly and strongly augmented samples. Currently FixMatch is only implemented for vision tasks. We denote FixMatch by adding suffix "FM" after a supervised algorithm, such as ER-FIFO-FM. 

5. EXPERIMENTS

We compare the algorithms on the 4 benchmarks we constructed for OCL-PDS. Each experiment is run 5 times with different random seeds. For the recent performance and the regression set performance (reg performance), we report both the average and the worst performances, which are the mean and minimum of the performance over t = 1, • • • , T , respectively. The reason why we also report the worst performance is that knowledge retention is required for all t. For saving space, we put detailed settings and results in Appendix E. In this section we summarize five remarkable observations we make from our experiments, first supervised and then semi-supervised OCL algorithms.

5.1. RESULTS OF SUPERVISED OCL ALGORITHMS

Observation #1: Strong positive correlation between online and recent performances. On all benchmarks, methods that achieve higher online performances also achieve higher recent performances. In Figures 2a and 2b , we plot the average online and worst recent performances achieved by different supervised OCL algorithms on the CivilComments-WPDS (α = 0.5%) and FMoW-WPDS (α = 50%) benchmarks (α is the fraction of labeled samples), where we can see a strong positive correlation between these two. The reason is that D t is very close to D t-1 , • • • , D t-w by formulation, so a model that performs well on D t naturally performs well on the w recent distributions too. This is also known as the accuracy-on-the-line phenomenon (Miller et al., 2021) . Observation #2: Correlation between online and reg performances differs among benchmarks. In Figure 2c and 2d, we plot the average online and worst reg performances on CivilComments-WPDS and FMoW-WPDS. We can see that the trends on these two benchmarks are quite different. On CivilComments-WPDS, methods with higher online performances have lower reg performances, while there is no such trend on FMoW-WPDS. One possible reason is that it depends on how the regression set is defined, and how close it is between the regression set distribution and the overall distribution. For CivilComments-WPDS, the regression set samples (severely offensive comments) are very different from the other samples (mostly normal comments), while for FMoW-WPDS the regression set samples from Americas and Asia are closer to the samples from other regions. Observation #3: Some existing methods do not work well on regression tasks. On the right, we plot the average online and worst reg performances of different supervised OCL algorithms on the Poverty-WPDS benchmark (α = 50%), which is a regression task. We observe that while some methods like ER-FIFO and L2Reg do better than the FBO baseline, some others including MIR and EWC achieve lower online performance than FBO which should be a lower bound. Most existing methods have only been tested on classification tasks before, and this experiment shows that they might not work so well on regression tasks.

5.2. RESULTS OF SEMI-SUPERVISED OCL ALGORITHMS

Observation #4: Unlabeled data improves OCL performance. In Figure 4 , we plot the performances of ER-FIFO-PL/FM and ER-FIFO-RW-PL/FM (in red) along with the performances of all supervised OCL algorithms (in blue) on CivilComments-WPDS and FMoW-WPDS. We can see that with the same worst reg performance, SSL methods achieve higher average online performances, and vice versa. SSL improves the online performance by learning more new data, but it does not help knowledge retention. Thus, supervised OCL algorithms with high online performances (such as MIR) do not improve much with SSL. We also observe that FixMatch is slightly better than PL. Observation #5: Virtual update in PL is important. In Figure 5 we plot the performances of ER-FIFO-PL and ER-FIFO-RW-PL on CivilComments-WPDS with different epochs of virtual update as labeled near the points. We can see that without virtual update (0 epoch), the online performance of PL is even lower than the FBO baseline. However, with just 1 epoch of virtual update, the online performance rises above FBO, and with more epochs of virtual update the online performance is higher (but with a lower reg performance). This shows the importance of virtual update though it was not included in previous methods like gradual self-training (Kumar et al., 2020) . One explanation is that the virtual update step distills the knowledge of P (Y | X) from the new distribution into the current model, so P (f ′ t-1 (X) | X) is closer to P (Y | X, D t ). Without virtual update, f t-1 only has knowledge from old distributions, so when the model is trained on the pseudolabeled samples, it reinforces its old knowledge but learns little new knowledge, resulting in a low online performance but high reg performance. This also implies that unsupervised OCL is difficult and perhaps infeasible, because without new labels the learner cannot know how P (Y | X) changes.

6. DISCUSSION

The application of deep learning has become so wide nowadays that it is very difficult to cover all the tasks with a single general problem formulation. Here we briefly discuss two additional problems that stem from OCL-PDS which practitioners in certain areas in industry might find useful. Fine-grained PDS. In our problem formulation in Section 2.1, the learner can fine-tune the model at every t. However, in applications where new data comes in very fast and the distribution changes very quickly, this could be impractical. One typical example is time series analysis such as stock price prediction. We term this setting fine-grained PDS as the data batches are much smaller and are received much more frequently. In our benchmarks, the time horizon T is around 20, while in a fine-grained PDS benchmark T shoud typically be 2,000 or 20,000. One way to handle fine-grained PDS is to have two fine-tune procedures: A fast one that can be done at every t, and a slow one that is run simultaneously with the fast one. For instance, in an ensemble method like Mixture-of-Experts (Masoudnia & Ebrahimpour, 2014) , the fast procedure only adjusts the weights of the base models, while the slow procedure trains a new base model with the new data. Abrupt shift detection. We assumed that Div(D t ∥ D t+1 ) < ρ for all t. However, this might not always be true, and abrupt shifts could happen in practice. For example, for toxic language detection on social media (Figure 1b ), when the US Supreme Court overturned Roe v. Wade, posts related to gender, religion, politics and civil rights flooded on social media, causing a sharp spike in the data distribution and probably a significant drop in the model's performance. Thus, practitioners need a mechanism to detect such abrupt shifts so that prompt human intervention could take place. Some applications in industry have online metrics that keep track of the online performance, and a significant performance drop triggers human intervention. However, in applications without online metrics or where feedback can be delayed, we need to detect distribution shift with unlabeled data alone, a problem known as OOD detection (Hendrycks & Gimpel, 2017; Lee et al., 2018) . A more difficult recently proposed problem is OOD performance prediction (Chen et al., 2021; Jiang et al., 2022; Garg et al., 2022; Baek et al., 2022) , i.e. predicting the model's performance on the new distribution with unlabeled data alone, as distribution shift does not necessarily hurt the performance. Limitation. While we commit ourselves to studying the real PDS that exists in industrial applications, in this paper for privacy reasons we are limited to working on public datasets, which (despite our effort of making them realistic) are still different from real applications. We hope that in the future there could be more datasets containing real PDS in industry applications released for the better development of this field. In general, there are two types of distribution shift problem: Domain shift and subpopulation shift (Koh et al., 2021; Sagawa et al., 2022) . In domain shift, the training and testing distributions contain different domains, and the goal is to generalize to new domains. This problem is also called Outof-distribution (OOD) generalization, such that the training set is ID (in-distribution) and the test set is OOD. Related areas include domain adaptation, domain generalization, transfer learning, etc. In subpopulation shift, the training and testing distributions consist of the same domains, but the relative proportions are different. Related areas include fair machine learning, long-tailed learning (learning with imbalanced classes), etc. The difference between these two is that in subpopulation shift, the supports of the training and testing data distributions are the same, while it is not for domain shift. A lot of methods have been proposed to train models that are robust to distribution shift. Here we introduce two general methods, and in the subsequent sections we will talk about methods for specific areas. The most classic method is importance weighting (Shimodaira, 2000) , which multiplies the loss on sample x with an importance weight Q(x) P (x) , because f (x)dQ = Q(x) P (x) dP . This method requires that P (x) > 0 for all x such that Q(x) > 0, i.e. only works for subpopulation shift. Another general method that is widely used today is Distributionally Robust Optimization (DRO) (Duchi & Namkoong, 2018) , which assumes that Q ∈ U (P ) where U (P ) is the uncertainty set that contains a family of distributions that are close to P . Then, DRO trains the model on the worst distribution in U (P ) (with the highest empirical risk), so as to ensure that the model can do well on any distribution in U (P ) which includes Q. A lot of variants of DRO have been proposed and being used today (Hashimoto et al., 2018; Hu et al., 2018; Sagawa et al., 2020a; Lahoti et al., 2020; Zhai et al., 2021a; b) . However, there is also a recent line of work that points out some problems with these methods both empirically and theoretically (Byrd & Lipton, 2019; Sagawa et al., 2020b; Gulrajani & Lopez-Paz, 2021; Xu et al., 2021; Wang et al., 2022c) . Remarkably, a recent paper Zhai et al. (2022) proved the surprising result that generalized reweighting (GRW) methods, a broad family of methods including importance weighting, DRO and a lot more, cannot do better than ERM for linear models and NTK neural networks (Jacot et al., 2018) . Thus, there is still a huge room of improvement for methods for distribution shift. A.2 CONTINUAL LEARNING Continual learning (CL), also known as lifelong learning, comes from the philosophy that a learning agent should be able to continually learn new knowledge and improve itself with new data on its own. As detailed in Ring (1998), a continual learner should be able to learn context-dependent knowledge autonomously, incrementally and hierarchically. A more recent paper Liu (2020) introduced the concept of on-the-job learning, where the learner is required to detect new tasks on its own, collect data for the new tasks and then learn with the data. Given these general philosophical ideals, the goal of CL is to mathematically formulate a problem setting that reflects these ideals. Here we point our readers to the Avalanche library (Lomonaco et al., 2021) , a recently released Python library that contains many benchmarks and algorithms of continual learning.

A.2.1 SETTINGS

There are a bunch of continual learning settings, and here we discuss the most widely studied ones. Task-incremental CL. In this setting, N tasks T 1 , • • • , T N are sequentially given to the learner, who is required to learn these tasks one by one without forgetting the previous tasks. The learner cannot see the old tasks while learning a new one, but it has a memory buffer in which it can store data from previous tasks. The goal is to perform well on all N tasks, and the performance is usually evaluated with the average or the minimum of the performances on all tasks. A related setting is class-incremental CL where new classes sequentially appear. Task-incremental CL is the oldest setting that dates back to McCloskey & Cohen (1989) , which trained a feed-forward network to first learn addition with one and then learn addition with two, and found that when learning the second task the network forgot the first one. This work pinpointed the catastrophic forgetting problem, and as a result many researchers today believe that "the central problem of continual learning is to overcome the catastrophic forgetting problem" (Aljundi et al., 2019c) . Two related problems are domain-incremental CL and class-incremental CL. In domain-incremental CL, the data of each task comes from a different domain. In class-incremental CL, samples from new classes appear one by one. Task-aware CL. In this setting, the tasks are not sequentially given. Rather, each sample has a "task descriptor" indicating which task it belongs to. This setting dates back to the early paper Ratcliff (1990) which trained a multi-layer encoder model to learn four vectors A, B, C and D, which were provided in a cyclic fashion: ABCDABCDABCD... Then, Lopez-Paz & Ranzato (2017) studied task-aware CL in the online learning setting where samples come in an online stream. The goal of task-aware CL is the same as task-incremental CL: To perform well on all N tasks. Task-agnostic CL. In task-agnostic CL, each sample still belongs to a certain task, but the "task descriptor" is not provided. The model is still required to do well on all N tasks, and it is still evaluated by the average or the minimum of the performances on all tasks. Algorithms for this setting usually do task inference, i.e. inferring the task from the input (Van de Ven & Tolias, 2019), where they assume that samples from the same task are closer to each other. Note that our definition is different from some previous work, and some previous work on task-agnostic CL like Zeno et al. (2018) is in fact under the task-free setting. Task-free CL. In task-free CL, there is no task at all. The data comes in an online stream and each sample only appears once. This setting was first introduced in Aljundi et al. (2019b) , where they split a data domain into several partitions, and sequentially present each partition to the model. At the end of training, the model is evaluated on the entire data domain, i.e. the union of all the partitions. Thus, if the learner is allowed to store all samples and train the model on them, then the problem becomes equivalent to supervised learning. However, the model cannot store all samples because the storage size is assumed to be limited. This problem is also called the data incremental learning problem in De Lange et al. (2021) . We can see that in this problem, the training is online (partitions are sequentially given), but the evaluation is offline (test once on the entire domain). Online Continual Learning (OCL). We study OCL in this work. It is a variant of task-free CL as there is no task in OCL. The key characteristic of OCL is that both training and evaluation are online, which is different from the previous task-free CL problem. In OCL, for each new data batch, the model is first evaluated on it and then trained on it. Thanks to the online evaluation, we can study the real "lifelong" learning setting in OCL where the time horizon is infinite, which is not possible in the previous task-free CL problem. Note that the term "online continual learning" was used in quite a few previous papers, but most of them refer to "continual learning with online data" (such as Yin et al. ( 2021a)) which is not the OCL setting we define in this work.

Time series analysis.

A related area is time series analysis, where the data comes from a nonstationary online distribution, and the task is to predict future data with the current and past observations. We can see that distribution shift naturally exists in the formulation of time series analysis, but we fail to find much work about dealing with distribution shift with time series analysis. One setting in time series analysis that is similar to OCL is temporal covariate shift (TCS), which was first introduced in Du et al. (2021) . TCS makes the covariate shift assumption: P (Y |X) is fixed while P (X) shifts with time. The goal of TCS is r-step ahead prediction, i.e. predicting the labels for inputs whose ground truth labels will be revealed The environment reveals all true labels to the learner after evaluation. In contrast, in OCL-PDS, we use the random label feedback which randomly selects α fraction of the new samples and provides their labels. In particular, we mainly study a semi-supervised learning setting where the environment only reveals a fraction of the labels, which is a more common scenario in industrial applications. 2. The evaluation metrics used in Cai et al. (2021) were: Average online accuracy, backward transfer and forward transfer. The first two metrics correspond to the average online performance and the average recent knowledge retention in OCL-PDS. In addition to these metrics, OCL-PDS also considers the regression set performance which is a very important metric in real applications, the worst (recent/important) knowledge retention performances since knowledge retention is required for all t, and the training efficiency which is very important in an online setting where the algorithm is run for many times. 3. Cai et al. (2021) only evaluates knowledge retention performances at three fixed time steps: H/3, 2H/3 and H, where H is the total number of time steps. In contrast, OCL-PDS is purely online: all metrics including knowledge retention are evaluated at each step. 4. Cai et al. (2021) proposed the CLOC benchmark which is indeed a PDS benchmark. However, CLOC only covers vision classification, while our benchmarks cover both language and vision tasks, and both classification and regression tasks. 5. Cai et al. (2021) only run experience replay (ER) on their benchmark, while our work also compares regularization based methods such as EWC, variants of ER such as GEM, MIR and MaxLoss, as well as semi-supervised learning methods like pseudo-labeling and FixMatch.

A.2.2 METHODS

Rehearsal based methods. The concept of "rehearsal" was first introduced in Ratcliff (1990) ; Robins (1995) . Rehearsal-based methods store samples in an operational memory (a memory buffer). When the learner learns a new task, it also reviews the old samples in this buffer in order to prevent catastrophic forgetting, known as experience replay. Almost all previous work assumed that the buffer has a fixed size, so two key questions for these methods are: (i) Which samples to be store in the buffer, and (ii) Which samples to be replayed to the learner. For problem (i), the most widely used strategy is Reservoir sampling (Isele & Cosgun, 2018) which maintains the buffer distribution to be the average of all task distributions. Other variants include Chrysakis & Moens (2020) ; Kim et al. (2020) . For problem (ii), MIR (Aljundi et al., 2019a ) selects the samples with the highest loss increase after a "virtual update" step to be replayed, MaxLoss (Lin et al., 2022) selects the samples with the highest loss after virtual update, and OCS (Yoon et al., 2022) selects an online coreset that is diverse and has a high affinity to previous tasks. Moreover, experience replay (ER) can be combined with other methods. For example, MER (Riemer et al., 2019) combines ER with meta learning, GMED (Jin et al., 2021) combines ER with adversarial attack, and Wang et al. (2022d) combines ER with DRO. There are alternative ways to use this buffer. For example, iCaRL (Rebuffi et al., 2017) stores "exemplars" in the buffer and uses a nearest neighbor classifier with these exemplars. In other words, samples in the buffer are not used for training, but used for inference. Similarly, Continual Prototype Evolution (De Lange & Tuytelaars, 2021) maintains and continually updates a prototype for each class, and uses a nearest neighbor classifier for inference. Another way is introduced in GEM (Lopez-Paz & Ranzato, 2017) , where the buffered samples are not directly replayed for training, but instead used to find a "pseudo gradient" with a convex optimization problem so that the loss on previous tasks won't increase. Variants of GEM include A-GEM (Chaudhry et al., 2019a) and GSS (Aljundi et al., 2019c) . Since the buffer size is assumed to be limited, a bunch of previous papers studied how to utilize the buffer space more efficiently. One line of work proposes to learn the distributions of previous tasks with a generative model, which includes DGR (Shin et al., 2017) , FearNet (Kemker & Kanan, 2018) and so on. Regularization based methods. The high-level idea of these methods is to not change the model weights too much so that the model's performance on previous tasks won't drop too much. One type of methods directly add to the objective a penalty term which keeps the model weights close to the old weights, such as EWC (Kirkpatrick et al., 2017) . Another type of methods use synaptic regularization (Zenke et al., 2017; Aljundi et al., 2019b) which controls the learning rate of each weight, so that more influential weights change more slowly. Architecture based methods. These methods continually update the model architecture in order to learn the new tasks. One type of method changes the architecture by adding a "mask" to some weights, such as Piggyback (Mallya et al., 2018) which learns a mask for each task, learning a hard attention mask with gradient descent (Serra et al., 2018) , and NCCL that adds weight calibration modules and feature calibration modules to balance between stability and plasticity (Yin et al., 2021b) . Another type of method uses the isolation approach where the model has two parts -A shared part and a task specific part. When a new task arrives, the shared part is updated very little, and a new task specific part is learned on the new task. Examples of isolation methods include progressive neural network (PNN) (Rusu et al., 2016) , Learning without Forgetting (LwF) (Li & Hoiem, 2017) , dynamically expandable network (DEN) (Yoon et al., 2018) , etc. Semi-supervised/Unsupervised continual learning. There are also some existing continual learning methods that work under a semi-supervised or unsupervised setting. For instance, Lump (Madaan et al., 2022) uses Mixup to interpolate between the current task and previous tasks' instances to alleviate catastrophic forgetting, and Fini et al. (2022) combines continual learning with self-supervised learning. A.3 DOMAIN ADAPTATION Domain adaptation (DA) is a type of distribution shift where "the tasks are the same, and the differences are only caused by domain divergence" (Wang & Deng, 2018) . There is a source distribution P and a target distribution Q with distribution shift from P to Q. During training, the learner is provided with samples from both the source and the target distributions, and conditioning on whether the samples from the target distribution is labeled, partially labeled or unlabeled, DA is classified as supervised, semi-supervised and unsupervised DA. Particularly, in supervised DA, the number of target samples is usually very small so that training on these target samples alone cannot lead to a good model. A related area is domain generalization (Gulrajani & Lopez-Paz, 2021; Blanchard et al., 2021) where the learner does not have any samples from the target domain, even unlabeled ones. In general, a DA algorithm consists of two parts: Feature alignment and class alignment. The goal of feature alignment is to train a feature encoder Φ that can encode invariant features, i.e. the images of the source and target domains in the feature space are close to each other, or Φ(P ) ≈ Φ(Q). As summarized in Wang & Deng (2018) , there are two common ways to achieve feature alignment. The first one is adversarial-based, i.e. training a domain discriminator to distinguish features from the source and target domains, and the features are aligned if this discriminator cannot achieve a high performance. Methods of this type include DANN (Ganin et al., 2016) , SagNet (Nam et al., 2021) , etc. The second one is discrepancy-based, i.e. minimizing a divergence function between Φ(P ) and Φ(Q), also called the "confusion alignment loss" in Motiian et al. (2017) . Methods of this type include CORAL (Sun & Saenko, 2016) , IRM (Arjovsky et al., 2019) , ARM (Zhang et al., 2021) , etc. However, there are also some papers that point out the problems within these methods (Rosenfeld et al., 2021; Gulrajani & Lopez-Paz, 2021) . Moreover, even if we have learned a feature encoder Φ such that Φ(P ) ≈ Φ(Q), we cannot be sure that the same classifier works for both domains, because samples of different classes in P and Q might be mapped to the same latent feature. Thus, the goal of class alignment is either to make sure that samples of the same class are mapped together, or to train a new classifier w ′ which works for Φ(Q). Note that class alignment requires labels from the target domain which are unavailable in unsupervised domain adaptation or domain generalization. For instance, Tzeng et al. (2015) used soft labels for class alignment, Long et al. (2016) minimized the cross entropy on the target data while using a residual block to keep the source and target classifiers close, and Motiian et al. (2017) used a similarity penalty between samples from different classes.

A.4 SEMI-SUPERVISED LEARNING

In many applications, the number of labeled samples are limited, but there are also a large number of unlabeled samples. For example, for image classification tasks, while the labels are hard to obtain, free images can be very easily retrieved from the internet. Semi-supervised learning studies how to train a model on a small set of labeled samples and a large set of unlabeled samples. This is an old area of research and there is a very rich body of work, on which we do not intend to make an exhaustive survey here. Here we briefly discuss several types of methods that are widely being used today, and more methods can be found in surveys such as Van Engelen & Hoos (2020) . Generating labels for unlabeled samples. Perhaps the most direct and intuitive way of leveraging the unlabeled samples is to try to generate labels for them, and then train the model over all samples as if they were all labeled. The simplest of such methods is pseudo-labeling (Lee, 2013) , which first trains a model over the labeled samples alone and then uses this model to pseudo-label the unlabeled samples. This method assumes that the labeled and unlabeled samples come from the same underlying distribution, so if a model can do well on the labeled samples, it should be able to generate good pseudo labels for the unlabeled samples. However, the quality of pseudo labels depends on the generalization ability of the model, and if the number of labeled samples is too small, then the pseudo labels could contain large systematic label noise. Thus, a number of techniques have been proposed to improve pseudo-labeling. One such technique is consistency regularization, which is based on the following observation: Given two transformations x ′ and x ′′ of the same sample x, their labels should be the same. For instance, FixMatch (Sohn et al., 2020) uses two augmentation methods: a weak one and a strong one. It generates pseudo labels on the weakly augmented sample x ′ , and trains the model on the strongly augmented one x ′′ . Similarly, Noisy Student (Xie et al., 2020 ) also uses a weak and a strong augmentation, but it alternates between teacher phases which generate pseudo labels and student phases which learn these labels, until convergence. Some other work defines x ′ and x ′′ as the outputs of the model at different epochs, such as Temporal Ensembling (Laine & Aila, 2017) and Mean Teachers (Tarvainen & Valpola, 2017) . There is also a line of work that leverages adversarial attack, such as Virtual Adversarial Training (VAT) (Miyato et al., 2018) . Interpolation based methods. These methods train the model on the interpolation between labeled and unlabeled samples. This type of methods was initially introduced in MixUp (Zhang et al., 2018) , which interpolates between two labeled samples to combat label noise and adversarial attack as it makes the model have linear behavior in between samples. This method is then applied to semi-supervised learning in MixMatch (Berthelot et al., 2019) , which combines MixUp with a lot of other techniques including pseudo labeling. Self-supervised learning. The high-level idea of self-supervised learning (also known as representation learning) is to train a good feature extractor on the unlabeled data set with an auxiliary task (upstream task), and then train a classifier on top of it on the labeled data set (downstream task). The most famous and widely-used self-supervised learning technique is masked language modeling (MLM) in NLP (Devlin et al., 2018) , where the auxiliary task is predicting a masked word within a sentence. MLM has achieved great success in NLP as the feature extractor it learns can be applied to almost any language task and lead to good performance. Inspired by the success of MLM, people also try to apply self-supervised learning to vision tasks. For example, Doersch et al. (2015) extracts random pairs of patches from each image where the auxiliary task is to learn the relative position between each pair of patches, and Gidaris et al. (2018) rotates each image with different angles where the auxiliary task is to learn this rotation angle, which is applied to semi-supervised learning in S 4 L (Zhai et al., 2019) . Today, the most widely used technique is contrastive learning, which extracts different views from each image, and the auxiliary task is to learn which views come from the same image. The feature extractor is trained to learn the similarity between two views: similar if they come from the same image, and different if they do not. This idea was first introduced in Bachman et al. (2019) . Currently the most popular contrastive learning methods include SimCLR (Chen et al., 2020) , MoCo (He et al., 2020) , BYOL (Grill et al., 2020) , SwAV (Caron et al., 2020) , etc.

B THE DIVERGENCE FUNCTION

In this section, we dive deep into one important problem in the formulation of the OCL-PDS problem: How to choose the divergence function Div(P ∥ Q) that guarantees the continuity of the distribution shift? As mentioned in Section 2.2, Div(P ∥ Q) refers to the divergence from P to Q, and ideally we want it to reflect the performance of a model which is trained on P and tested on Q. To study this problem, first we will review existing divergence functions, and then we will show that an ideal divergence function for OCL-PDS should be asymmetric and bounded, which is unfortunately not satisfied by any popular divergence function. Finally, we will introduce the ϵ-KLdivergence which is used in this work.

B.1 EXISTING DIVERGENCE FUNCTIONS

This part is based on the NeurIPS tutorial by Gretton et al. (2019) . Generally speaking, there are two types of existing divergence functions: Integral probability metrics (IPMs) and ϕ-divergences. To quickly understand these two types of divergence, think about how to determine whether two distributions P and Q are equal. There are two ways in general: (a) Compute P -Q and see if it is zero almost everywhere, and (b) Compute P/Q and see if it is one almost everywhere. IPMs correspond to method (a) and ϕ-divergences correspond to method (b). IPMs. An IPM is defined as Div (P ∥ Q) = sup f ∈F [E X∼P f (X) -E Y ∼Q f (Y )] for some function family F. Examples include total variation (TV) defined as Div(P ) ]∥ H for some feature mapping π and reproducing kernel Hilbert space H, and the Wasserstein distance defined as Div(P ∥ Q) = inf γ∈Γ(P,Q) D(x, y)dγ(x, y) for some distance function D(•, •). ∥ Q) = 1 Q(x)|dx, MMD defined as Div(P ∥ Q) = ∥E X∼P [π(X)] -E Y ∼Q [π(Y ϕ-divergences. A ϕ-divergence is defined as Div ϕ (P ∥ Q) = ϕ dP dQ dQ. For example, when ϕ = -log, then the ϕ-divergence becomes the reverse KL-divergence, where the popular KLdivergence is defined as D KL (Q ∥ P ) = x Q(x) log Q(x) P (x) dx (note that P and Q are reversed). Total variation is the only non-trivial function that is both an IPM and a ϕ-divergence.

B.2 DIVERGENCE FUNCTION FOR OCL-PDS

In this part, we show that an ideal divergence function for the OCL-PDS problem should be asymmetric and bounded. Asymmetric. Suppose we have two very different data domains A and B. Let P = A and Q = 0.5A + 0.5B. A model trained on P would have a very poor performance on Q, because it has never seen any samples from domain B. On the other hand, a model trained on Q could have a good performance on P , because it has seen samples from both A and B. Thus, in this example, we would like to have Div(P ∥ Q) > Div(Q ∥ P ), so Div should be asymmetric. Bounded. The KL-divergence is widely used in machine learning literature, but one problem is that it is unbounded. Recall that in the OCL-PDS problem formulation, we assume that Div(D t ∥ D t+1 ) < ρ for all t. Now consider what would happen if the function Div is unbounded, and we want to introduce data from new domains into the problem. Specifically, Q contains samples from new domains that are not in P , i.e. there exists x such that Q(x) > 0 and P (x) = 0. In this situation, we must have Div(P ∥ Q) = ∞, no matter how small Q(x) is. Therefore, if Div is unbounded like the reverse KL-divergence, then we could never introduce new domains into the problem, which is not desirable. There is a variant of the KL-divergence called JS-divergence, which is defined as Div(P ∥ Q) = 1 2 [D KL (P ∥ M ) + D KL (Q ∥ M )] where M = 1 2 P + 1 2 Q. Although it is bounded, it is also symmetric, so it is not ideal for OCL-PDS.

B.3 ϵ-KL-DIVERGENCE

In this work, we use the ϵ-KL-divergence defined as follows: Div(P ∥ Q) = E X∼P [g(X)] + E Y ∼Q [log(-g(Y ))] + 1 where g(x) = - Q(x) max{P (x), ϵ} This divergence functions has the following properties: (i) This function is a lower bound of the reverse KL-divergence, and it is bounded. (ii) If g(x) = -Q(x) P (x) , then this function is equivalent to the reverse KL-divergence (which is its dual formulation). Thus, if for any x such that Q(x) > 0, we have P (x) ≥ ϵ, then the ϵ-KL-divergence is equivalent to the reverse KL-divergence. (iii) In the case where Q contains new domains that are not in P , for example Q = (1 -β)P + β P for some P ⊥ P , then Div( P ∥ Q) = β + β log β ϵ + (1 -β) log(1 -β) ≈ β + D KL (Q ∥ (1 -ϵ)P + ϵ P ). From the properties, we can see that what we are doing in the ϵ-KL-divergence is essentially adding an ϵ lower bound to the denominator of g(x) so that the function becomes bounded. Example. Given two groups A and B, let ρ = 0.17 and ϵ = 0.02. Then, we have the following group weight allocation schedule which is widely used in our benchmarks: Limitations. The major limitation of the ϵ-KL-divergence is that it cannot measure how similar two domains are. For example, suppose we have three domains: A, B and C, and they do not overlap with one another. Samples in A and B are very similar, but they are vastly different from samples in C. In this case, a model trained on A can have a good performance on 0.5A + 0.5B, but a poor performance on 0.5A + 0.5C. However, for the ϵ-KL-divergence, we have Div(A ∥ 0.5A + 0.5B) = Div(A ∥ 0.5A + 0.5C). We can see that the ϵ-KL-divergence cannot measure the similarity between A and B or A and C. Another limitation is that the choice of ϵ is arbitrary and can affect the function value. Nevertheless, even with these two limitations, we still believe that the ϵ-KL-divergence is suitable for the OCL-PDS problem. Finally, keep in mind that no divergence function can cover every facet of real problems in practice, and that's why we have three important checking steps in our benchmarking procedure described in Section 3.1.

C BENCHMARK DETAILS

We build 4 new benchmarks for OCL-PDS following the guidelines listed below: • We only use public datasets in this work. • The benchmarks should cover a wide variety of tasks. • The benchmarks should be realistic and can reflect the real PDS in industry. • Naïve methods should be poor on these benchmarks, so special methods are necessary. In this section, we will first demonstrate in detail how to use the 3-step procedure described in Section 3.1 with the CivilComments-WPDS benchmark as an example. Then, we will present the details for all other datasets, but without the detailed benchmarking procedure.

C.1 CIVILCOMMENTS-WPDS

Here we present how we construct CivilComments-WPDS from the CivilComments-Wilds dataset with the 3-step procedure including the 3 important checking steps. Step 1: Separate the data into groups. First, we investigate the metadata we have in this dataset. In CivilComments-Wilds, apart from the target label, each sample also has the following annotations: whether it contains a certain topic (male, female, LGBTQ, christian, muslim, other religions, black, white), and whether it contains a certain type of toxicity (identity attack, explicit sexual, etc.). Based on this metadata, we can model the distribution shift with the shift in hot topics. We separate the samples into four groups according to their topics, as shown in the following table. We can observe that: (i) There are less toxicity in normal comments than comments about a specific topic; (ii) The portion of toxic comments about race is much higher than that of other topics. Then, we do the shift continuity check, whose point is to make sure that the distribution shift in this schedule is continuous. This check goes as follows: For each t, we sample a set from D t and split it into a training set and a validation set, and then sample a test set from D t+1 . We want to make sure that the gap between the validation and the test performances is not too large for each t. The results are the following: We can see that the OOD gaps are kept under 4%, as opposed to the 14.04% gap we got in the OOD check. And we can observe the same phenomenon again: For some t, the OOD accuracy is higher than the ID occuracy. Then, based on these results, we design the following sample allocation schedule listed in Table 8 . The first batch contains 50000 samples, while all the other batches contain 10000 samples each. Step 3: Design a separate regression set. We notice that each toxic comment is annotate which types of toxicity the comment contains in the CivilComments-Wilds dataset. Thus, we design the regression set as the set of comments with two types of severe toxicity: identity attack and explicit sexual. Such comments are critical toxic comments that a good detector should be able to detect with high success rate. Then, we do the regression check, where we verify that naïve methods have regression on the regression set we constructed. Here we use the NBO baseline described in Section 4.1 as the "naïve method". As shown in Figure 6 , the regression set performance of NBO quickly declines to under 50% -The performance of random guessing. Thus, the benchmarks passes this check. Then, we allocate the groups with the schedule listed in Table 9 . We put 120000 samples from Group 0 into batch 0 for pretraining, and all other batches contain 10000 samples each. The regression set is defined as the set of samples in two highly populated regions: Americas and Asia.

C.3 AMAZON-WPDS

This benchmark is built from the Amazon-Wilds dataset. In the original paper (Koh et al., 2021) , the ID/OOD sets are divided by the reviewers, and the dataset also contains some other metadata such as year and product categories. However, we find in our experiments that these groups cannot create a sufficiently large distribution shift. Thus, we use the following class split method, which has been widely used in the continual learning literature: We divide the 5 classes of this datasets (corresponding to the 5 stars rating) into 2 groups and 2 classes -positive and negative reviews, as shown in the following table : Table 10 : In Amazon-WPDS, the 5 stars rating are divided into 2 groups and 2 classes. Group 0 Group 1 # Samples Group 0 Group 1 y = 0 1, 2 stars 3 stars y = 0 193,900 377,039 y = 1 4 stars 5 stars y = 1 1,087,385 2,343,846 Then, we allocate the groups to the batches with the schedule listed in Table 12 . This schedule is based on the weight schedule we obtained in Table 2 . The first batch has 50000 samples from Group 1, and then we model a group shift from Group 1 to Group 0. Each subsequent batch has 5000 samples. Finally, the regression set is defined as the set of reviews from 10 popular product categories, including books, fashion, etc.

C.4 POVERTY-WPDS

This benchmark is build from the PovertyMap-Wilds dataset, which is a image regression task. First, we cluster the data into 4 group by the year: Then we allocate the groups with the schedule listed in Table 13 . We put all samples from Group 0 (except test samples) into the first batch for pretraining, and each subsequent batch contains 800 samples. Finally, the regression set is defined as the samples from urban areas, which can better reflect the overall wealth index of each country. Note that this is a regression task, and we evaluate the model performance with the Pearson correlation following Koh et al. ( 2021).

D ALGORITHM DETAILS

First of all, at t = 0, all OCL algorithms train the initial model on the first labeled batch and the training regression set with empirical risk minimization (ERM). In particular, following Koh et al. ( 2021), we use the cross entropy loss for classification tasks and the mean squared error for regression tasks. The differences among the methods only appear after t > 0. Baselines. First, note that the upper bound baseline i.i.d. offline is not an OCL algorithm because it assumes access to D t at time t. For each t, it trains a model on a set sampled from D t with a sufficiently large size and no overlapping with S t . The size of the training set is different for different benchmarks, but we always make sure that it is at least as large as the union of all batches before time t. i.i.d. offline is only an approximate upper bound of the online performance because into account the sizes of the batches and the regression set. For example, if the regression set is much larger than the batches, then it will be hard for the model to learn the new knowledge. Thus, ER-FIFO-RW alters the weights of the three data sets: the new data, the recent data and the regression data. Specifically, it uses uniform sampling over these three sets, so that each set has the same probability of being selected. MIR and MaxLoss. When there are too many previous samples in the buffer, we cannot replay all of them for efficiency. ER-FIFO-RW solves this problem by randomly sampling previous samples to replay. However, this might not be the most efficient method, as some previous samples might be more useful than the others for preventing catastrophic forgetting. MIR and MaxLoss are two strategies of selecting replay samples, which operate as follows: For each iteration, 1. Sample n new samples and n kr previous samples that require knowledge retention (KR). 2. Virtual update: Fine-tune f t-1 → f ′ t-1 with ERM on the n new samples. 3. Select n samples from the n kr previous samples for replay. Specifically, MIR selects the samples whose loss increase the most before and after virtual update, while MaxLoss selects the samples with the highest loss after virtual update. 4. Recover the model to f t-1 , and fine-tune f t-1 → f t with ERM on the n new samples and the n selected previous samples. The ratio n kr /n is named as kr size in our code, which should be greater than 1. The larger n kr is, the more likely we can select "useful" replay samples, but the slower the algorithm. GEM-PDS. In this method, for each iteration, we first sample n new data, n kr recent data and n kr regression data, on which we estimate the three gradients of the loss function g 0 , g 1 and g 2 , respectively. A larger n kr allows the learner to estimate g 1 and g 2 more accurately. The ratio n kr /n is still named as kr size. Then, the pseudo gradient g * is the optimal solution of the convex optimization problem Eqn. (1), which can be solved with the following procedure: 1. a = ⟨g 0 , g 1 ⟩, b = ⟨g 1 , g 1 ⟩, c = ⟨g 1 , g 2 ⟩, d = ⟨g 0 , g 2 ⟩, e = ⟨g 2 , g 2 ⟩. 2. If a ≥ 0, d ≥ 0 then return g 0 . 3. p = cd -ae, q = ac -bd, r = be -c 2 . 4. ĝ1 = g 0 -a b g 1 . If a ≤ 0 and q ≤ 0 then return ĝ1 . 5. ĝ2 = g 0 -d e g 2 . If d ≤ 0 and p ≤ 0 then return ĝ2 . 6. Return ĝ3 = g 0 + p r g 1 + q r g 2 . (3) We can verify with the KKT conditions that this procedure returns the correct solution of Eqn. (1) (e.g. see Section 5.5.3 of Boyd et al. (2004) ). The model is then updated with gradient descent along g * . Online L2 Regularization and EWC. Let the loss of model f t parameterized by θ t (which is the vectorized model weight) on the labeled batch of S t be ℓ t (θ t ). In online L2 regularization, we minimize the following objective function with a fixed number of epochs of ERM: min θt ℓ t (θ t ) + λ 2 ∥θ t -θ 0 ∥ 2 2 (4) where we add a L 2 penalty between the new model weight θ t and the initial model weight θ 0 . The reasons why we use ∥θ t -θ 0 ∥ 2 2 instead of ∥θ t -θ t-1 ∥ 2 2 are: 1. θ 0 has a very good regression set performance, which is not guaranteed for θ t-1 . So using ∥θ t -θ 0 ∥ 2 2 ensures a higher regression set performance. 2. If T is very large, then the model weight can still change a lot if we use ∥θ t -θ t-1 ∥ 2 2 (as the change accumulates with t), so knowledge retention cannot be guaranteed. In EWC, the objective function is the following: min θt ℓ t (θ t ) + λ 2 ∥diag(F t )(θ t -θ 0 )∥ 2 2 (5) where F t is the Fisher information matrix (FIM), and diag(F t ) only contains the elements on the diagonal of F t . Following Kirkpatrick et al. (2017) , we estimate diag(F t ) from the first-order derivatives of the loss function on n kr samples for knowledge retention. The ratio n kr /n is still named as kr size. With a larger n kr , we can estimate F t more accurately. λ is named as lbd. Pseudo Labeling (PL) and FixMatch (FM). In PL and FM, there are two hyperparameters: the number of epochs of the virtual update step epochs v, and the number of epochs of the real finetuning step epochs r. Virtual update is done with ERM on the labeled samples only, while real fine-tuning is done with a supervised OCL algorithm on the union of labeled and pseudo-labeled samples. Thus, PL and FM can be combined with any supervised OCL algorithm, which we denote by adding the suffix "PL" or "FM" to the algorithm, such as ER-FIFO-PL and ER-FIFO-FM. 2021), with one exception that we use a multi learning rate decay scheduler for the two vision benchmarks as we find that it can produce better performances than the old scheduler. For each experiment we report the following 6 metrics: • For online performance, we report the average online performance (avg online) within a finite T . • For knowledge retention, we report the average recent performance (avg recent) and the worst recent performance (worst recent), which is the average and the minimum of the recent performance over t = w, • • • , T . We also report the worst performance because knowledge retention is required for every t. Likewise, we also report the average reg performance (avg reg) (regression set performance) and the worst reg performance (worst reg). As mentioned in Section 3.1, we use separate test batches to evaluate the recent and reg performances. • For training efficiency, we report the average runtime (avg time) of the method over t = 1, • • • , T . Note that the time used to train the initial model (t = 0) is not computed in the average runtime. Each experiment is run on a single NVIDIA V100 GPU. Each experiment is run 5 times with different random seeds, and the mean and the standard deviation of the results are reported. In particular, for each benchmark we use 5 fixed initial models: We train 5 initial models on the first batch and the training regression set with ERM for a fixed number of epochs with different random seeds, and then use these 5 models as initial models for all methods. This both alleviates the effect of randomness in the initial models and saves time. Under this setting, FBO can also serve as an upper bound of the worst regression set performance, which is the regression set performance at t = 1 for every method.

E.2 RESULTS

In Table 14 we list the notations we use in the results. The results are reported in Tables 15-18. 



This is not equivalent to group shift or subpopulation shift as studied in fair machine learning and longtailed learning (learning with imbalanced classes). The group weights here are used to control the scale of the divergence from being too big. We simulate a gradual shift by continuously shifting the group weights. |P (x) - 2002-2013 132,948 1 2014 54,575 2 2015 87,358 3 2016 140,459 4 2017 54,746



(a) FMoW-WPDS benchmark.(b) CivilComments-WPDS benchmark.

Figure 2: Results of supervised OCL algorithms on CivilComments-WPDS (α = 0.5%) and FMoW-WPDS (α = 50%). Each point corresponds to one pair of algorithm and hyperparameters.

Figure 3: Poverty-WPDS.

Figure 4: Performances of ER-FIFO-PL/FM and ER-FIFO-RW-PL/FM (in red). FM is only used for FMoW.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A LITERATURE REVIEWA.1 DISTRIBUTION SHIFT Distribution shift in machine learning refers to the scenario where the model is tested on a distribution Q different from the distribution P on which it was trained, which is different from the conventional machine learning setting where the training and testing sets are i.i.d. sampled from the same distribution. Distribution shift is studied in a number of areas in machine learning, including domain adaptation, continual learning, transfer learning, fair machine learning, long-tailed learning, etc. See Table2ofGulrajani & Lopez-Paz (2021) and Table2ofWang et al. (2022b)  for a comparison among these areas.

r steps later. Other recent papers that study distribution shift with time series analysis include Duan et al. (2022); Gagnon-Audet et al. (2022); Kim et al. (2022).Comparison with a previous work. One previous workCai et al. (2021) proposed an OCL framework similar to ours. In both settings, the model is evaluated on new online batches before fine-tuned on them. The differences between our workand Cai et al. (2021) are the following: 1. Cai et al. (2021) considered a fully supervised setting:

Figure 6: Regression check.

Koh et al. (2021), we use a DistilBert-base-uncased for CivilComments-WPDS and Amazon-WPDS, a DenseNet-121 for FMoW-WPDS, and a ResNet-18 for Poverty-WPDS. For the training hyperparameters, we generally use the same ones as inKoh et al. (

, kr size L2Reg-epochs-lbd-kr size EWC-5-0.1-4 ER-FIFO-PL epochs v, epochs r ER-FIFO-PL-epochs v-epochs r ER-FIFO-PL-10-5 ER-FIFO-RW-PL epochs v, epochs r ER-FIFO-RW-PL-epochs v-epochs r ER-FIFO-RW-PL-10-5 ER-FIFO-FM epochs v, epochs r ER-FIFO-FM-epochs v-epochs r ER-FIFO-FM-10-5 ER-FIFO-RW-FM epochs v, epochs r ER-FIFO-RW-FM-epochs v-epochs r ER-FIFO-RW-FM-10-5

Benchmarks we build. T + 1 = total number of batches. w = recent time window.

Sample group weight allocation schedule.

Samples in CivilComments-Wilds are separated into 4 groups by their topics.

Shift continuity check results of CivilComments-WPDS (%).

Train/test split. For each t, we split S t into a training set and a test set, and the regression set is also split into a training set and a test set. These test sets are used to evaluate the recent and regression set performances. We never evaluate the learner on data it has seen, because the learner has an infinitely large buffer where it can store the data. Instead, we evaluate the learner on separate i.i.d. test sets. Note that the numbers listed in Table8are all sizes of the training sets. Samples in FMoW-WPDS are clustered into 5 groups by their year.

Sample allocation schedule for CivilComments-WPDS.

Sample allocation schedule for FMoW-WPDS.

Samples in Poverty-WPDS are clustered into 4 groups by their year.

Notations of algorithms.

Results on CivilComments-WPDS (α = 0.5%). Accuracies in %.

Results on FMoW-WPDS (α = 50%). Accuracies in %.

D Algorithm Details

Then, we do the OOD check, where we verify that there is a distribution shift across the groups. We perform this check in the following way: For each Group k, we train a model on a training set sampled from all groups except Group k, and then test this model on a validation set sampled from all groups except Group k, and a test sampled from Group k. The results are the following: From this table, we can see that Group 1 (race) has the largest OOD performance gap, much larger than other groups. Thus, when allocating the groups in Step 2, we will make sure that the shift to Group 1 is slower. Moreover, we observe that on Group 0, the gap is negative, which means that the OOD performance is better than the ID performance (i.e. A model trained on Groups 1-3 has a higher accuracy on Group 0 than Groups 1-3). This is a counter-intuitive phenomenon as it is usually taken for granted that OOD performance should be lower than ID performance. The cause of this phenomenon might be that Group 0 is much easier to learn than the other groups (for example, the data is more concentrated and depends on fewer features).Step 2: Assign shifting group weights to the batches. We want to model the distribution shift with the shift in hot topics on social media. At each time t, D t contains normal comments as well as comments about the current hot topic, i.e. samples from Group 0 exists in every batch. We design the schedule listed in the following table: Then, for FBO, the algorithm does not do anything after t > 0, and the initial model is used till the end. For NBO, for each new batch S t , the model is trained on the labeled portion of S t only with a fixed number of epochs of ERM, and no previous sample is replayed.ER-FIFO and ER-FIFO-RW. In ER-FIFO, for each t > 0, the model is fine-tuned on the union of the new labeled batch, the recent labeled batches, and the training regression set, for a fixed number of epochs (named epochs) of ERM. Specifically, the recent labeled batches and the training regression set are stored in the memory buffer. One issue of this approach is that it does not take 

