ONLINE CONTINUAL LEARNING FOR PROGRESSIVE DISTRIBUTION SHIFT (OCL-PDS): A PRACTITIONER'S PERSPECTIVE

Abstract

We introduce the novel OCL-PDS problem -Online Continual Learning for Progressive Distribution Shift. PDS refers to the subtle, gradual, and continuous distribution shift that widely exists in modern deep learning applications. It is widely observed in industry that PDS can cause significant performance drop. While previous work in continual learning and domain adaptation addresses this problem to some extent, our investigations from the practitioner's perspective reveal flawed assumptions that limit their applicability on daily challenges faced in realworld scenarios, and this work aims to close the gap between academic research and industry. For this new problem, we build 4 new benchmarks from the Wilds dataset (Koh et al., 2021), and implement 12 algorithms and baselines including both supervised and semi-supervised methods, which we test extensively on the new benchmarks. We hope that this work can provide practitioners with tools to better handle realistic PDS, and help scientists design better OCL algorithms.

1. INTRODUCTION

In most modern deep learning applications, the input data undergoes a continual distribution shift over time. For example, consider a satellite image classification task as illustrated in Figure 1a . In this task, the input data distribution changes with time due to the changes in landscape, and camera updates which can lead to higher image resolutions and wider color bands. Similarly, in a toxic language detection task on social media illustrated in Figure 1b , the distribution shift can be caused by a shift in trends and hot topics (many people post about hot topics like BLM (Wikipedia contributors, 2022a) and Roe v. Wade (Wikipedia contributors, 2022b) on social media), or a shift in language use (Röttger & Pierrehumbert, 2021; Luu et al., 2022) . Such distribution shift can cause significant performance drop in deep models, a widely observed phenomenon known as model drift. A critical problem for practitioners, therefore, is how to deal with what we term progressive distribution shift (PDS), defined as the subtle, gradual, and continuous distribution shift that widely exists in modern deep learning applications. In this work, we explore handling PDS with online continual learning (OCL), where the learner collects, learns, and is evaluated on online samples from a continually changing data distribution. In Section 2, we formulate the OCL-PDS problem. The OCL-PDS problem is closely related to two research areas: domain adaptation (DA) and continual learning (CL), in which there is a rich body of academic work. However, through a literature review and our conversations with practitioners, we find that there still remains a gap between the settings widely used in academic work and in real industrial applications. To close this gap, we commit ourselves to thinking from a practitioner's perspective, which is the core spirit of this work. Our primary goal is to build tools for investigating the real issues practitioners are facing in their day-to-day work. To achieve this goal, we challenge the prevailing assumptions in previous work, and propose three important modifications to the conventional DA and CL problem settings: 1. Task-free: One point conventional DA and CL settings have in common is assuming clear boundaries between distinct domains (or tasks), but practitioners rarely apply the same model to very different domains in industry. In contrast, OCL studies the task-free CL setting (Aljundi et al., 2019b) where there is no clear boundary, and the distribution shift is continuous. Moreover, in OCL both training and evaluation are online, unlike previous task-free CL settings with offline evaluation, which is not as realistic in a "lifelong" setting. 2. Forgetting is allowed: Avoiding catastrophic forgetting is a huge topic in CL, which usually requires no forgetting on all tasks. However, remembering everything is actually impractical, infeasible and potentially harmful, so OCL-PDS only requires remembering recent knowledge and important knowledge which is described by a regression set (Sec. 2.2). 3. Infinite storage: Previous work in CL usually assumes a limited storage (buffer) size. However, storage is not the most pressing bottleneck in most industrial applications. Thus, in OCL-PDS, we assume an infinitely large storage where all historical samples can be stored. However, the learner cannot replay all samples because it would be too inefficient. To demonstrate the novelty and practicality of the OCL-PDS problem, in Section 2.3 we will discuss related work, compare OCL-PDS with common and similar settings used in previous work, and elaborate on the reasons why we believe these three key modifications align OCL-PDS more closely with industrial applications and practitioners' pain points. A more thorough literature review can be found in Appendix A. Then, in Section 3, we will build 4 new benchmarks for OCL-PDS, including both vision and language tasks. When building these benchmarks, we make every effort to make sure that they can reflect real PDS scenarios that practitioners need to deal with. In Section 4, we will explore OCL algorithms and how to combine them with semi-supervised learning (SSL) as unlabeled data is very common in practice. In total, we implement 12 supervised and semi-supervised OCL algorithms and baselines adapted to OCL-PDS, which we test extensively on our benchmarks in Section 5. Our key observations in these experiments include: (i) A taskdependent relationship between learning and remembering; (ii) Some existing methods have low performances on regression tasks; (iii) SSL helps improve online performance, but it requires a critical virtual update step. Finally, in Section 6 we discuss remaining problems and limitations. Contributions. Our contributions in this work include: (i) Introducing the novel OCL-PDS problem which more closely aligns with practitioners' needs; (ii) Releasing 4 new benchmarks for this novel setting; (iii) Adapting and implementing 12 OCL algorithms and baselines, including both supervised and semi-supervised, for OCL-PDS; (iv) Comparing these algorithms and baselines with extensive experiments, which leads to a number of key observations. Overall, we believe that this work is an important step in closing the gap between academic research and industry, and we hope that this work can inspire more practitioners and researchers to investigate and dive deep into realworld PDS. To this end, we release our benchmarks and algorithms, which are easy to use and we hope can help boost the development of OCL algorithms for handling PDS. 2 THE OCL-PDS PROBLEM ∼ D t . 2. Evaluation: Predict on S t with the current model f t-1 , and get some feedback 3. Fine-tuning: Update the model f t-1 → f t with all previous information Evaluation metrics. An OCL algorithm is used for fine-tuning and is evaluated by three metrics: 1. Online performance: Denote the performance of f s on S t by A t s . The online performance at time t (as computed in Step 2 -Evaluation) is A t t-1 , and the average online performance before horizon T is defined as (A 1 0 + • • • + A T T -1 )/T .



(a) FMoW-WPDS benchmark.(b) CivilComments-WPDS benchmark.

Figure 1: FMoW-WPDS and CivilComments-WPDS benchmarks which we build in this work.

2.1 PROBLEM FORMULATIONWe have a stream of online data S 0 , S 1 , • • • , where each S t is a batch of i.i.d. samples from distribution D t that changes with time t continuously, for which we assume that Div(D t ∥ D t+1 ) < ρ for all t for some divergence function Div. Online Continual Learning (OCL) goes as follows:• At t = 0, receive a labeled training set S 0 , on which train the initial model f 0

