MEMORY-EFFICIENT SEMI-SUPERVISED CONTINUAL LEARNING: THE WORLD IS ITS OWN REPLAY BUFFER Anonymous

Abstract

Rehearsal is a critical component for class-incremental continual learning, yet it requires a substantial memory budget. Our work investigates whether we can significantly reduce this memory budget by leveraging unlabeled data from an agent's environment in a realistic and challenging continual learning paradigm. Specifically, we explore and formalize a novel semi-supervised continual learning (SSCL) setting, where labeled data is scarce yet non-i.i.d. unlabeled data from the agent's environment is plentiful. Importantly, data distributions in the SSCL setting are realistic and therefore reflect object class correlations between, and among, the labeled and unlabeled data distributions. We show that a strategy built on pseudo-labeling, consistency regularization, Out-of-Distribution (OoD) detection, and knowledge distillation reduces forgetting in this setting. Our approach, DistillMatch, increases performance over the state-of-the-art by no less than 8.7% average task accuracy and up to a 54.5% increase in average task accuracy in SSCL CIFAR-100 experiments. Moreover, we demonstrate that DistillMatch can save up to 0.23 stored images per processed unlabeled image compared to the next best method which only saves 0.08. Our results suggest that focusing on realistic correlated distributions is a significantly new perspective, which accentuates the importance of leveraging the world's structure as a continual learning strategy.

1. INTRODUCTION

Computer vision models in the real-world are often frozen and not updated after deployment, yet they may encounter novel data in the environment. Unlike the typical supervised learning setting, class-incremental continual learning challenges the learner to incorporate new information as it sequentially encounters new object classes without forgetting previously-acquired knowledge (catastrophic forgetting). Research has shown that rehearsal of prior classes is a critical component for class-incremental continual learning (Hsu et al., 2018; van de Ven & Tolias, 2019) . Unfortunately, rehearsal requires a substantial memory budget, either in the form of a coreset of stored experiences or a separate learned model to generate samples from past experiences. This is not acceptable for memory-constrained applications which cannot afford to increase the size of their memory as they encounter new classes. Instead, we consider a novel real-world setting where an incremental learner's labeled task data is a product of its environment and the learner encounters a vast stream of unlabeled data in addition to the labeled task data. In such a setting (visualized in Figure 1 ), the unlabeled datastream is intrinsically correlated to each learning tasks due to the underlying structure of the environment. We explore many ways in which this correlation may exist. For example, when an incremental learner is tasked to learn samples of the previously-unseen class c i at time i in the real world, examples of c i may be encountered in the environment (in unlabeled form) during some future task. In such a setting, an incremental learner could use the unlabeled data in its environment as a source of memory-free rehearsal, though it would need a method to determine which unlabeled data is relevant to the incremental task (i.e. detecting in-distribution data). We formalize this realistic paradigm in the semi-supervised continual learning (SSCL) setting, wherein unlabeled and labeled data are not i.i.d. as they are correlated through the underlying structure of the environment. We propose and conduct experiments over a realistic setting in which this correlation may exist, in the form of label super-class structure (e.g. unlabeled examples Figure 1 : Unlike standard replay (scheme A) which requires a substantial memory budget, we explore the potential of an unlabeled datastream to serve as replay and significantly reduce the required memory budget. Unlike previous work which requires access to an external datastream uncorrelated to the environment (scheme B), we consider the datastream to be a product of the continual learning agent's environment (scheme C). of household furniture such as chairs, couches, and tables will appear while learning the labeled examples of household electrical devices such as lamp, keyboard, and television (Krizhevsky et al., 2009; Zhu & Bain, 2017) ). We measure the final-task accuracy A, the accuracy over all tasks ⌦, and the coreset memory required to attain a specific level of ⌦ accuracy over several realistic SSCL settings. Our experiments demonstrate that state-of-the-art continual learning methods (Lee et al., 2019) perform inconsistently in the novel SSCL paradigm with no prior method performing "best" across all settings. This leads us to ask "How can an approach to catastrophic forgetting be robust to several realistic, memory-constrained continual learning scenarios?" To answer the above question, we propose a novel learning approach that works well in both the simple (i.e., no correlations) and realistic SSCL settings: DistillMatch. We leverage unlabeled data not only for knowledge distillation (in which the distilling model is fixed), but also for a semi-supervised loss (in which the supervisory signal can adapt during training on the new task). Key to our approach is that we address the distribution mismatch between the labeled and unlabeled data (Oliver et al., 2018) with out-of-distribution (OoD) detection (and are the first to do so in the continual learning setting). Compared to nearest prior state-of-the-art (all methods from (Lee et al., 2019)) configured to work as well as possible in the novel SSCL setting, we outperform the state-of-the-art in all of our experiment scenarios by as much as a 54.5% increase in ⌦ and no less than an 8.7% increase. Furthermore, we find that our method can save up to 0.23 stored images per processed unlabeled image over naive rehearsal (compared to (Lee et al., 2019) which only saved 0.08 stored images per processed unlabeled image). In summary, we make the following contributions: 1. We propose the realistic semi-supervised continual learning (SSCL) setting, where object-object correlations between labeled and unlabeled sets are maintained through a label super-class structure. We show that state-of-the-art continual learning methods perform inconsistently in the SSCL setting (i.e. no baseline method is "best" across all settings). 2. We propose a novel continual learning method DistillMatch for the SSCL setting leveraging pseudo-labeling, strong data augmentations, and out-of-distribution detection. Compared to the baselines, DistillMatch achieves superior performance on a majority of metrics for 8/8 experiments and results in substantial memory budget savings.

2. BACKGROUND AND RELATED WORK

Knowledge Distillation in Continual Learning: Several related methods leverage distillation losses on past tasks to mitigate catastrophic forgetting using soft labels from a frozen copy of the previous task's model (Castro et al., 2018; Hou et al., 2018; Li & Hoiem, 2017; Rebuffi et al., 2017) . For example, learning using two teachers, with one teacher distilling knowledge from previous tasks and another distilling knowledge from the current task, has been found to increase adaptability to a new task while preserving knowledge on the previous tasks (Hou et al., 2018; Lee et al., 2019) . Classbalancing and fine-tuning have been used to encourage the model's final predicted class distribution

