SCHEDULE-ROBUST ONLINE CONTINUAL LEARNING

Abstract

A continual learning (CL) algorithm learns from a non-stationary data stream. The non-stationarity is modeled by some schedule that determines how data is presented over time. Most current methods make strong assumptions on the schedule and have unpredictable performance when such requirements are not met. A key challenge in CL is thus to design methods robust against arbitrary schedules over the same underlying data, since in real-world scenarios schedules are often unknown and dynamic. In this work, we introduce the notion of schedule-robustness for CL and a novel approach satisfying this desirable property in the challenging online class-incremental setting. We also present a new perspective on CL, as the process of learning a schedule-robust predictor, followed by adapting the predictor using only replay data. Empirically, we demonstrate that our approach outperforms existing methods on CL benchmarks for image classification by a large margin.

1. INTRODUCTION

A hallmark of natural intelligence is its ability to continually absorb new knowledge while retaining and updating existing one. Achieving this objective in machines is the goal of continual learning (CL). Ideally, CL algorithms learn online from a never-ending and non-stationary stream of data, without catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; French, 1999) . The non-stationarity of the data stream is modeled by some schedule that defines what data arrives and how its distribution evolves over time. Two family of schedules commonly investigated are task-based (De Lange et al., 2021) and task-free (Aljundi et al., 2019b) . The task-based setting assumes that new data arrives one task at a time and data distribution is stationary for each task. Many CL algorithms (e.g., Buzzega et al., 2020; Kirkpatrick et al., 2017; Hou et al., 2019) thus train offline, with multiple passes and shuffles over task data. The task-free setting does not assume the existence of separate tasks but instead expects CL algorithms to learn online from streaming data, with evolving sample distribution (Caccia et al., 2022; Shanahan et al., 2021) . In this work, we tackle the task-free setting with focus on class-incremental learning, where novel classes are observed incrementally and a single predictor is trained to discriminate all of them (Rebuffi et al., 2017) . Existing works are typically designed for specific schedules, since explicitly modeling and evaluating across all possible data schedules is intractable. Consequently, methods have often unpredictable performance when scheduling assumptions fail to hold (Farquhar & Gal, 2018; Mundt et al., 2022; Yoon et al., 2020) . This is a considerable issue for practical applications, where the actual schedule is either unknown or may differ from what these methods were designed for. This challenge calls for an ideal notion of schedule-robustness: CL methods should behave consistently when trained on different schedules over the same underlying data. To achieve schedule-robustness, we introduce a new strategy based on a two-stage approach: 1) learning online a schedule-robust predictor, followed by 2) adapting the predictor using only data from experience replay (ER) (Chaudhry et al., 2019b) . We will show that both stages are robust to diverse data schedules, making the whole algorithm schedule-robust. We refer to it as SChedule-Robust Online continuaL Learning (SCROLL). Specifically, we propose two online predictors that by design are robust against arbitrary data schedules and catastrophic forgetting. To learn appropriate priors for these predictors, we present a meta-learning perspective (Finn et al., 2017; Wang et al., 2021) and connect it to the pre-training strategies in CL (Mehta et al., 2021) . We show that pre-training offers an alternative and efficient procedure for learning predictor priors instead of directly solving the meta-learning formulation. This makes our method computationally competitive and at the same time offers a clear justification for adopting pre-training in CL. Finally, we present effective routines for adapting the predictors from the first stage. We show that using only ER data for this step is key to preserving schedule-robustness, and discuss how to mitigate overfitting when ER data is limited. Contributions. 1) We introduce the novel concept of schedule-robustness for CL, an important property that lacks in existing methods. 2) We propose a novel online strategy that satisfies schedulerobustness, along with practical algorithms. 3) Theoretically, we connect CL to standard meta-learning methods via schedule-robustness. This justifies the use of pre-trained models as a knowledge-prior for CL. We further show that multi-class classification is an efficient and principled procedure for learning such priors in CL. 4) Empirically, we show that SCROLL outperforms state-of-the-art methods by a large margin (over a 20% improvement in accuracy in many settings). This further supports the focus of this work on schedule-robustness and our strategy to achieve it in practice.

2. PRELIMINARIES AND RELATED WORKS

We formalize CL as learning from non-stationary data sequences. A data sequence consists of a dataset D = {(x i , y i )} N i=1 regulated by a schedule S = (σ, β). Applying the schedule S to D is denoted by S(D) ≜ β(σ(D)), where σ(D) is a specific ordering of D, and β(σ(D)) = {B t } T t=1 splits the sequence σ(D) into T batches of samples B t = {(x σ(i) , y σ(i) )} kt+1 i=kt , with k t the batch boundaries. Intuitively, σ determines the order in which (x, y) ∈ D are observed, while β determines how many samples are observed at a time. Fig. 1 (Left) illustrates how the same dataset D could be streamed according to different schedules. For example, S 1 (D) in Fig. 1 (Left) depicts the standard schedule to split and stream D in batches of C classes at the time (C = 2).

2.1. CONTINUAL LEARNING

A CL algorithm learns from S(D) one batch B t at a time, iteratively training a predictor f t : X → Y to fit the observed data. Some formulations assume access to a fixed-size replay buffer M , which mitigates forgetting by storing and reusing samples for future training. Given an initial predictor f 0 and an initial buffer M 0 , we define the update rule of a CL algorithm Alg(•) at step t as (f t , M t ) = Alg(B t , f t-1 , M t-1 ), where the algorithm learns from the current batch B t and updates both the replay buffer M t-1 and predictor f t-1 from the previous iteration. At test time, the performance of the algorithm is evaluated on a distribution π D that samples (x, y) sharing the same labels with the samples in D. The generalization error is denoted by L(S(D), f 0 , Alg) = E (x,y)∼π D ℓ(f T (x), y) (2)



Figure 1: Left. Illustration of a classification dataset D streamed according to different schedules (dashed vertical lines identify separate batches). Right. Pre-training + the two stages of SCROLL: 1) online learning and store replay samples from data stream, 2) adapting the predictor using the replay buffer (green indicates whether the representation ψ is being updated).

