ONLINE BOUNDARY-FREE CONTINUAL LEARNING BY SCHEDULED DATA PRIOR

Abstract

Typical continual learning setup assumes that the dataset is split into multiple discrete tasks. We argue that it is less realistic as the streamed data would have no notion of task boundary in real-world data. Here, we take a step forward to investigate more realistic online continual learning -learning continuously changing data distribution without explicit task boundary, which we call boundary-free setup. Due to the lack of boundary, it is not obvious when and what information in the past to be preserved for a better remedy for the stability-plasticity dilemma. To this end, we propose a scheduled transfer of previously learned knowledge. In addition, we further propose a data-driven balancing between the knowledge in the past and the present in learning objective. Moreover, since it is not straightforward to use the previously proposed forgetting measure without task boundaries, we further propose a novel forgetting and knowledge gain measure based on information theory. We empirically evaluate our method on a Gaussian data stream and its periodic extension, which is frequently observed in real-life data, as well as the conventional disjoint task-split. Our method outperforms prior arts by large margins in various setups, using four benchmark datasets in continual learning literature -CIFAR-10, CIFAR-100, TinyImageNet and ImageNet. Code is available at https://github.com/yonseivnl/sdp.

1. INTRODUCTION

In real-world continual learning (CL) scenarios (He et al., 2020) , data arrive in a streamed manner (Aljundi et al., 2019a; Cai et al., 2021) whereas typical continual learning setups split the data into multiple discrete tasks whose data distributions differ from each other. Moreover, most CL algorithms are studied in an offline CL setup (Kirkpatrick et al., 2017; Rebuffi et al., 2017; Saha et al., 2021) , where the model can access data multiple times. While being prevalent in the literature, this setup has a number of issues far from the realistic scenario. Although the task setup have been partly addressed by (Prabhu et al., 2020; Koh et al., 2021; Kim et al., 2021b; Bang et al., 2022) , the revised setups still have the notion of task boundary whereas real-world data may not have the explicit task boundaries as the data distribution changes continuously. Despite that many methods update the model in a boundary-agnostic manner, called task-free CL (Aljundi et al., 2019b; Koh et al., 2021) , they still leverage the notion of task boundary for knowledge transfer and evaluation, e.g., leveraging the fact that distribution shift in data stream occurs only at task boundaries. In addition, the definition of forgetting depends on the notion of 'old' and 'new' tasks, which are defined by the task boundary. We argue to address an online CL setup where data are learned online (allowing only a single access to data) with continuous distribution shift without explicit task boundaries. We refer to the setup as online boundary-free continual learning. In this setup, a small set of data is streamed to the model one by one, and the model only has access to the current data batch only (Aljundi et al., 2019c; a) without the notion of task boundary. For the distribution of a continuous data stream, following (Shanahan et al., 2021; Wang et al., 2022) , we consider Gaussian distribution as an instance of data streaming distributions.The Gaussian † indicates the corresponding author. online data stream models the frequency of each class as Gaussian distribution over time. Note that classes do not recur after their initial Gaussian mode in this setup. However, in real-world data, the frequency of a class may have multiple recurring modalities over time, rather than a single mode, as depicted in Fig. 1 . To further address such a scenario, we investigate a periodic-Gaussian online stream, where each class would recur and the recurrence is periodic. To the best of our knowledge, this is the first work to study CL in continuous data distributions either in periods or not. The boundary-free setup poses several challenges as follows. In the CL setup with explicit task boundary definition, methods using both episodic memory and distillation (i.e., using data prior (Buzzega et al., 2020; Wu et al., 2019; Hou et al., 2019) ) show compelling performance (Masana et al., 2020) . They store the model weights at each task boundary and use the stored model as a distillation teacher for mitigating catastrophic forgetting. However, in the continuous data stream, it is challenging to determine which past models should be stored to be used as a data prior. To determine which data prior to transfer the knowledge from, we propose to combine different exponential moving average (EMA) distributions to have a particular schedule of transferring past knowledge. In addition, as the past knowledge is now from diverse contexts, it is not trivial to balance the supervisory signal from the past and the present. Instead of using a fixed balancing hyperparameter, we propose to learn to balance them for better generalization in multiple scenarios, i.e., datasets. In our empirical studies, we observe that our method outperforms comparable prior arts in Gaussian, periodic-Gaussian, and disjoint task-split data stream on 4 popular benchmarks in CL literature. Moreover, conventional performance metrics for CL methods including forgetting is not trivially applicable to our setup as they are defined on the task boundary. Here, we propose a new metric for measuring forgetting using information theory. In contrast to the conventional forgetting metric, it captures loss and gain of intra-class knowledge, appropriate for periodic data distribution where the model has to accumulate different knowledge about the same class over multiple periods. We summarize our contributions as follows: • Extensively studying online CL with continuous data stream setup without explicit task boundary, including newly proposed periodic CL setup. • Proposing an online boundary-free CL method that uses scheduled transfer of past knowledge. • Proposing to learn to balance amount of using past and present knowledge. • Proposing new metrics that can measure loss of past knowledge (i.e., forgetting) and gain of new knowledge (i.e., the opposite of intransigience) based on information theory.

2. RELATED WORK

Setups for Continual Learning. With the increasing popularity of CL, there have been several proposals on learning configurations to be realistic. As the first task setup to mimic a real-world data stream that continuously changes over time, prior arts have employed the notion of task-split, where the entire data is split into multiple subsets for different continuous tasks (Rebuffi et al., 2017; Castro et al., 2018; Wu et al., 2019) . When each subset arrives, it is stored and provided as a set of examples to learn a model. They use multiple epochs to learn a model (Kirkpatrick et al., 2017; Rebuffi et al., 2017; Saha et al., 2021) , which is referred as offline setup, while a small batch sample of the subset is streamed and used only once in online setup (Rolnick et al., 2019; Chaudhry et al., 2019; Aljundi et al., 2019a) . In the task-split setup, task boundaries are available and well exploited in various offline/online CL methods (Lopez-Paz & Ranzato, 2017; Rebuffi et al., 2017) . In recent literature, however, there has been efforts to question whether the task-split setup is realistic. To enforce the class distribution of each split differently, disjoint task-split confines each class to be assigned to only a single task (Castro et al., 2018) . As the disjoint setup is rather artificial as the data stream arrives in class agnostic manner. Blurry task-split allows every task shares all classes but with different dominance (Aljundi et al., 2019c; Bang et al., 2021) . The i-Blurry task-split further guarantee some classes to be added incremental to the blurry task-split (Koh et al., 2021) . However, these task configurations have explicit task boundary, which is still artificial. For a more realistic scenario, task-free CL (Aljundi et al., 2019b) has been studied, where models are not allowed to use task boundary information during training. However, they still train and evaluate methods on task-

