NONCONVEX CONTINUAL LEARNING WITH EPISODIC MEMORY

Abstract

Continual learning aims to prevent catastrophic forgetting while learning a new task without accessing data of previously learned tasks. The memory for such learning scenarios build a small subset of the data for previous tasks and is used in various ways such as quadratic programming and sample selection. Current memory-based continual learning algorithms are formulated as a constrained optimization problem and rephrase constraints as a gradient-based approach. However, previous works have not provided the theoretical proof on convergence to previously learned tasks. In this paper, we propose a theoretical convergence analysis of continual learning based on stochastic gradient descent method. Our method, nonconvex continual learning (NCCL), can achieve the same convergence rate when the proposed catastrophic forgetting term is suppressed at each iteration. We also show that memory-based approaches have an inherent problem of overfitting to memory, which degrades the performance on previously learned tasks, namely catastrophic forgetting. We empirically demonstrate that NCCL successfully performs continual learning with episodic memory by scaling learning rates adaptive to mini-batches on several image classification tasks.

1. INTRODUCTION

Learning new tasks without forgetting previously learned tasks is a key aspect of artificial intelligence to be as versatile as humans. Unlike the conventional deep learning that observes tasks from an i.i.d. distribution, continual learning train sequentially a model on a non-stationary stream of data (Ring, 1995; Thrun, 1994) . The continual learning AI systems struggle with catastrophic forgetting when the data acess of previously learned tasks is restricted (French & Chater, 2002) . To overcome catastrophic forgetting, continual learning algorithms introduce a memory to store and replay the previously learned examples (Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019b; Chaudhry et al., 2019a) , penalize neural networks with regularization methods (Kirkpatrick et al., 2017; Zenke et al., 2017) , use Bayesian approaches (Nguyen et al., 2018; Ebrahimi et al., 2020) , and other novel methods (Yoon et al., 2018; Lee et al., 2019) . Although Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) first formulated the continual learning as a constrained optimization problem, the theoretical convergence analysis of the performance of previously learned tasks, which implies a measure of catastrophic forgetting, has not been investigated yet. Continual learning with episodic memory utilizes a small subset of the data for previous tasks to keep the model staying in a feasible region corresponding to moderate suboptimal region. GEM-based approaches use the rephrased constraints, which are inequalities based on the inner product of loss gradient vectors for previous tasks and a current task. This intuitive reformulation of constrained optimization does not provide theoretical guarantee to prevent catastrophic forgetting. In addition, the memory-based approaches have the critical limitation of overfitting to memory. Choosing the perfect memory for continual learning is an NP-hard problem (Knoblauch et al., 2020) , then the inductive bias by episodic memory is inevitable. This problem also degrades the performance on previously learned tasks like catastrophic forgetting but has not been discussed quantitatively to analyze backward transfer (BWT). In this paper, we address the continual learning with episodic memory as a smooth nonconvex finitesum optimization problem. This generic form is well studied to demonstrate the convergence and complexity of stochastic gradient methods for the nonconvex setting (Zhou & Gu, 2019; Lei et al., 2017; Reddi et al., 2016; Zaheer et al., 2018) . Unlike the convex case, the convergence is generally measured by the expectation of the squared norm of the gradient E ∇f (x) 2 . The theoretical complexity is derived from the -accurate solution, which is also known as a stationary point with E ∇f (x) 2 ≤ . We formulate the proposed continual learning algorithm as a Stochastic gradient descent (SGD) based method that updates both previously learned tasks from episodic memory and the current task simultaneously. By leveraging the update method, we can introduce a theoretical analysis of continual learning problems. We highlight our main contributions as follows. • We develop convergence analysis for continual learning with episodic memory • We show the degradation of backward transfer theoretically and experimentally as problems of catastrophic forgetting and overfitting to memory. • We propose a nonconvex continual learning algorithm that scales learning rates based on sampled mini-batch.

1.1. RELATED WORK

The literature in continual learning can be divided into episodic learning and task-free learning. Smooth nonconvex finite-sum optimization problem has been widely employed to derive the theoretical complexity of computation for stochastic gradient methods (Ghadimi & Lan, 2013; 2016; Lei et al., 2017; Zaheer et al., 2018; Reddi et al., 2016) . Unlike the convex optimization, the gradient based algorithms are not expected to converge to the global minimum but are evaluated by measuring the convergence rate to the stationary points in the nonconvex case. The complexity to reach a stationary point is a key aspect of building a new stochastic gradient method for nonconvex optimization. In constrast with general optimization, memory-based continual learning methods have a limited data pool for previously learned tasks, which causes an overfitting problem to memory. (Knoblauch et al., 2020) found that optimal continual learning algorithms and building a perfect memory is equivalent. Furthermore, the authors proved that these two problems are NP-hard. The theoretical result shows that overfitting to memory is inevitable.

2. PRELIMINARIES

We consider a continual learning problem with episodic memory where a learner can access the boundary between the previous task and the current task. The continuum of data in (Lopez-Paz & Ranzato, 2017) is adopted as our task description of continual learning. First, we formulate our goal as the smooth nonconvex finite-sum optimization problems with two objectives, min x∈R d F (x) = f (x) + g(x) = 1 n f n f i=1 f i (x) + 1 n g ng j=1 g j (x) where x ∈ R d is the model parameter, each objective component f i (x), g j (x) is differentiable and nonconvex, and n f , n g are the numbers of components. We define two different components of the finite-sum optimization as objectives from a sample i of previously learned tasks f i (x) and a sample j of the current task g j (x). Unlike the general stochastic optimization problem, we assume that the initial point x 0 in continual learning is an -accurate solution of f (x) with E ∇f (x) 2 ≤ for some 1. By the property of nonconvex optimization, we know that there might exist multiple local optimal points that satisfy moderate performance on the previously learned task (Garipov et al., 2018) . This implies that the

