GRADIENT BASED MEMORY EDITING FOR TASK-FREE CONTINUAL LEARNING Anonymous

Abstract

Continual learning often assumes a knowledge of (strict) task boundaries and identities for the instances in a data stream-i.e., a "task-aware" setting. However, in practice it is rarely the case that practitioners can expose task information to the model; thus needing "task-free" continual learning methods. Recent attempts towards task-free continual learning focus on developing memory-construction and replay strategies such that model performance over previously seen instances is best retained. In this paper, looking from a complementary angle, we propose to "edit" memory examples to allow for the model to better retain past performance when memory is replayed. Such memory editing is achieved by making gradient updates to memory examples so that they are more likely to be "forgotten" by the model when viewing new instances in the future. Experiments on five benchmark datasets show our proposed method can be seamlessly combined with baselines to significantly improve performance and achieve state-of-the-art results. 1

1. INTRODUCTION

Accumulating past knowledge and adapting to evolving environments are one of the key traits in human intelligence (McClelland et al., 1995) . While contemporary deep neural networks have achieved impressive results in a range of machine learning tasks Goodfellow et al. (2015) , they haven't yet manifested the ability of continually learning over evolving data streams (Ratcliff, 1990) . These models suffer from catastrophic forgetting (McCloskey & Cohen, 1989; Robins, 1995) when trained in an online fashion-i.e., performance drops over previously seen examples during the sequential learning process. To this end, continual learning (CL) methods are developed to alleviate catastrophic forgetting issue when models are trained on non-stationary data streams (Goodfellow et al., 2013) . Most existing work on continual learning assume that, when models are trained on a stream of tasks sequentially, the task specifications such as task boundaries or identities are exposed to the models. These task-aware CL methods make explicit use of task specifications to avoid catastrophic forgetting issue, including consolidating important parameters on previous tasks (Kirkpatrick et al., 2017; Zenke et al., 2017; Nguyen et al., 2018) , distilling knowledge from previous tasks (Li & Hoiem, 2017; Rannen et al., 2017) , or separating task-specific model parameters (Rusu et al., 2016; Serrà et al., 2018) . However, in practice, it is more likely that the data instances comes in a sequential, non-stationary fashion without task identity or boundary-a setting that is commonly termed as task-free continual learning (Aljundi et al., 2018) . To tackle this setting, recent attempts on task-free CL methods have been made (Aljundi et al., 2018; Zeno et al., 2018; Lee et al., 2020) . These efforts revolve around regularization and model expansion based approaches, which rely on inferring task boundaries or identities (Aljundi et al., 2018; Lee et al., 2020) and perform online paramater importance estimation (Zeno et al., 2018) , to consolidate or separate model parameters. In another line of efforts, memory-based CL methods have achieved strong results in task-free setting Aljundi et al. (2019b) . These methods store a small set of previously seen instances in a fix-sized memory, and utilize them for replay (Robins, 1995; Rolnick et al., 2019) or regularization (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) . The core problem in memory-based CL methods is how to manage the memory instances (e.g., which to replace with new instances) and replay them given a restricted computation budget, so that the model performance can be maximally preserved or In this paper, we provide a new approach to solving the memory management problem in task-free continual learning by studying how to make gradient updates on stored memory examples. We develop a novel memory editing algorithm which complements existing memory-replay methods and data-sampling strategies for memory management (updates). The challenge is to propose a plausible and sound optimization objective of editing. We employ the same intuition as previous study (Toneva et al., 2019; Chaudhry et al., 2020; Aljundi et al., 2019a) : examples that are likely to be forgotten should be prioritized. Our proposed method, named Gradient-based Memory EDiting (GMED), edits examples stored in the memory with gradient-based updates so that they are more likely to be forgotten. Specifically, we estimate the "forgetting" of a stored example by its loss increase in the upcoming one online model update. Finally, we perform gradient ascent on stored examples so that they are more likely to be forgotten. Experiments show that our algorithm consistently outperforms baselines on five benchmark datasets under various memory sizes. Our ablation study shows the proposed editing mechanism outperforms alternative editing strategies such as random editing. We demonstrate that the proposed algorithm is general enough to be used with other strong (more recent) memory-based CL methods to further enhance performance, thus allowing for improvements in many benchmark datasets.

2. RELATED WORKS

Task-aware Continual Learning. Most of continual learning algorithms are studied under "taskaware" settings, where the model visits a sequence of clearly separated "tasks". A great portion of algorithms make explicit use of task boundaries (Kirkpatrick et al., 2017; Rusu et al., 2016; Lopez-Paz & Ranzato, 2017) , by learning separate parameters for each task, or discourage changes of parameters that are important to old tasks. Existing continual learning algorithms can be summarized into three categories: regularization-based, architecture-based and data-based approaches. Regularization based approaches (Kirkpatrick et al., 2017; Zenke et al., 2017; Nguyen et al., 2018; Adel et al., 2020) discourage the change of parameters that are important to previous data. Model expansionbased approaches (Rusu et al., 2016; Serrà et al., 2018; Li et al., 2019) allows expansion of model architecture to separate parameters for previous and current data. Data-based approaches (Robins, 1995; Shin et al., 2017; Lopez-Paz & Ranzato, 2017) replay or constrain model updates with real or synthetic examples. Task-free Continual Learning. Recently, task-free continual learning (Aljundi et al., 2018) have drawn increasing interest, where we do not assume knowledge about task boundaries. To the best of our knowledge, only a handful number of regularization based (Zeno et al., 2018; Aljundi et al., 2018 ), model-expansion based (Lee et al., 2020) , generative replay based (Rao et al., 2019) , continual meta-learning and meta-continual learning (He et al., 2019; Caccia et al., 2020; Harrison et al., 2020) approaches are applicable in the task-free CL setting. Meanwhile, most memory based continual



Code has been uploaded in the supplementary materials and will be published.



Figure 1: Categorization of memory-based methods on Task-free Continual Learning. Reservoir sampling and Sampling from Memory are the ways that Experience Replay (ER) uses to construct and replay memory respectively. In the recent line of works, Gradient based Sample Selection (GSS) and Hindsight Anchor Learning (HAL) explored ways to construct memory, while Maximally Interfering Retrieval (MIR) focused on replay strategy. Our method, Gradient-based Memory Editing (GMED) falls under the former category but provides a new angle to memory construction.

