NONCONVEX CONTINUAL LEARNING WITH EPISODIC MEMORY

Abstract

Continual learning aims to prevent catastrophic forgetting while learning a new task without accessing data of previously learned tasks. The memory for such learning scenarios build a small subset of the data for previous tasks and is used in various ways such as quadratic programming and sample selection. Current memory-based continual learning algorithms are formulated as a constrained optimization problem and rephrase constraints as a gradient-based approach. However, previous works have not provided the theoretical proof on convergence to previously learned tasks. In this paper, we propose a theoretical convergence analysis of continual learning based on stochastic gradient descent method. Our method, nonconvex continual learning (NCCL), can achieve the same convergence rate when the proposed catastrophic forgetting term is suppressed at each iteration. We also show that memory-based approaches have an inherent problem of overfitting to memory, which degrades the performance on previously learned tasks, namely catastrophic forgetting. We empirically demonstrate that NCCL successfully performs continual learning with episodic memory by scaling learning rates adaptive to mini-batches on several image classification tasks.

1. INTRODUCTION

Learning new tasks without forgetting previously learned tasks is a key aspect of artificial intelligence to be as versatile as humans. Unlike the conventional deep learning that observes tasks from an i.i.d. distribution, continual learning train sequentially a model on a non-stationary stream of data (Ring, 1995; Thrun, 1994) . The continual learning AI systems struggle with catastrophic forgetting when the data acess of previously learned tasks is restricted (French & Chater, 2002) . To overcome catastrophic forgetting, continual learning algorithms introduce a memory to store and replay the previously learned examples (Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019b; Chaudhry et al., 2019a) , penalize neural networks with regularization methods (Kirkpatrick et al., 2017; Zenke et al., 2017) , use Bayesian approaches (Nguyen et al., 2018; Ebrahimi et al., 2020) , and other novel methods (Yoon et al., 2018; Lee et al., 2019) . Although Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) first formulated the continual learning as a constrained optimization problem, the theoretical convergence analysis of the performance of previously learned tasks, which implies a measure of catastrophic forgetting, has not been investigated yet. Continual learning with episodic memory utilizes a small subset of the data for previous tasks to keep the model staying in a feasible region corresponding to moderate suboptimal region. GEM-based approaches use the rephrased constraints, which are inequalities based on the inner product of loss gradient vectors for previous tasks and a current task. This intuitive reformulation of constrained optimization does not provide theoretical guarantee to prevent catastrophic forgetting. In addition, the memory-based approaches have the critical limitation of overfitting to memory. Choosing the perfect memory for continual learning is an NP-hard problem (Knoblauch et al., 2020) , then the inductive bias by episodic memory is inevitable. This problem also degrades the performance on previously learned tasks like catastrophic forgetting but has not been discussed quantitatively to analyze backward transfer (BWT). In this paper, we address the continual learning with episodic memory as a smooth nonconvex finitesum optimization problem. This generic form is well studied to demonstrate the convergence and complexity of stochastic gradient methods for the nonconvex setting (Zhou & Gu, 2019; Lei et al., 

