NONCONVEX CONTINUAL LEARNING WITH EPISODIC MEMORY

Abstract

Continual learning aims to prevent catastrophic forgetting while learning a new task without accessing data of previously learned tasks. The memory for such learning scenarios build a small subset of the data for previous tasks and is used in various ways such as quadratic programming and sample selection. Current memory-based continual learning algorithms are formulated as a constrained optimization problem and rephrase constraints as a gradient-based approach. However, previous works have not provided the theoretical proof on convergence to previously learned tasks. In this paper, we propose a theoretical convergence analysis of continual learning based on stochastic gradient descent method. Our method, nonconvex continual learning (NCCL), can achieve the same convergence rate when the proposed catastrophic forgetting term is suppressed at each iteration. We also show that memory-based approaches have an inherent problem of overfitting to memory, which degrades the performance on previously learned tasks, namely catastrophic forgetting. We empirically demonstrate that NCCL successfully performs continual learning with episodic memory by scaling learning rates adaptive to mini-batches on several image classification tasks.

1. INTRODUCTION

Learning new tasks without forgetting previously learned tasks is a key aspect of artificial intelligence to be as versatile as humans. Unlike the conventional deep learning that observes tasks from an i.i.d. distribution, continual learning train sequentially a model on a non-stationary stream of data (Ring, 1995; Thrun, 1994) . The continual learning AI systems struggle with catastrophic forgetting when the data acess of previously learned tasks is restricted (French & Chater, 2002) . To overcome catastrophic forgetting, continual learning algorithms introduce a memory to store and replay the previously learned examples (Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019b; Chaudhry et al., 2019a) , penalize neural networks with regularization methods (Kirkpatrick et al., 2017; Zenke et al., 2017) , use Bayesian approaches (Nguyen et al., 2018; Ebrahimi et al., 2020) , and other novel methods (Yoon et al., 2018; Lee et al., 2019) . Although Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) first formulated the continual learning as a constrained optimization problem, the theoretical convergence analysis of the performance of previously learned tasks, which implies a measure of catastrophic forgetting, has not been investigated yet. Continual learning with episodic memory utilizes a small subset of the data for previous tasks to keep the model staying in a feasible region corresponding to moderate suboptimal region. GEM-based approaches use the rephrased constraints, which are inequalities based on the inner product of loss gradient vectors for previous tasks and a current task. This intuitive reformulation of constrained optimization does not provide theoretical guarantee to prevent catastrophic forgetting. In addition, the memory-based approaches have the critical limitation of overfitting to memory. Choosing the perfect memory for continual learning is an NP-hard problem (Knoblauch et al., 2020) , then the inductive bias by episodic memory is inevitable. This problem also degrades the performance on previously learned tasks like catastrophic forgetting but has not been discussed quantitatively to analyze backward transfer (BWT). In this paper, we address the continual learning with episodic memory as a smooth nonconvex finitesum optimization problem. This generic form is well studied to demonstrate the convergence and complexity of stochastic gradient methods for the nonconvex setting (Zhou & Gu, 2019; Lei et al., 2017; Reddi et al., 2016; Zaheer et al., 2018) . Unlike the convex case, the convergence is generally measured by the expectation of the squared norm of the gradient E ∇f (x) 2 . The theoretical complexity is derived from the -accurate solution, which is also known as a stationary point with E ∇f (x) 2 ≤ . We formulate the proposed continual learning algorithm as a Stochastic gradient descent (SGD) based method that updates both previously learned tasks from episodic memory and the current task simultaneously. By leveraging the update method, we can introduce a theoretical analysis of continual learning problems. We highlight our main contributions as follows. • We develop convergence analysis for continual learning with episodic memory • We show the degradation of backward transfer theoretically and experimentally as problems of catastrophic forgetting and overfitting to memory. • We propose a nonconvex continual learning algorithm that scales learning rates based on sampled mini-batch.

1.1. RELATED WORK

The literature in continual learning can be divided into episodic learning and task-free learning. Episodic learning based methods assume that a training model is able to access clear task boundaries and stores observed examples in the task-wise episodic memory (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) . On the other hand, an AI system experiences arbitrarily shifting data streams, which we are not able to access task boundaries in the real world. Task-free continual learning studies the general scenario without the task-boundary assumption. Aljundi et al. (2019a) introduces Memory-aware Synapses (MAS) and applies a learning protocol without waiting until a task is finished. Furthermore, the following work (Aljundi et al., 2019b ) adopt the memory system of GEM selecting observed examples to store for preventing catastrophic forgetting. Smooth nonconvex finite-sum optimization problem has been widely employed to derive the theoretical complexity of computation for stochastic gradient methods (Ghadimi & Lan, 2013; 2016; Lei et al., 2017; Zaheer et al., 2018; Reddi et al., 2016) . Unlike the convex optimization, the gradient based algorithms are not expected to converge to the global minimum but are evaluated by measuring the convergence rate to the stationary points in the nonconvex case. The complexity to reach a stationary point is a key aspect of building a new stochastic gradient method for nonconvex optimization. In constrast with general optimization, memory-based continual learning methods have a limited data pool for previously learned tasks, which causes an overfitting problem to memory. (Knoblauch et al., 2020) found that optimal continual learning algorithms and building a perfect memory is equivalent. Furthermore, the authors proved that these two problems are NP-hard. The theoretical result shows that overfitting to memory is inevitable.

2. PRELIMINARIES

We consider a continual learning problem with episodic memory where a learner can access the boundary between the previous task and the current task. The continuum of data in (Lopez-Paz & Ranzato, 2017 ) is adopted as our task description of continual learning. First, we formulate our goal as the smooth nonconvex finite-sum optimization problems with two objectives, min x∈R d F (x) = f (x) + g(x) = 1 n f n f i=1 f i (x) + 1 n g ng j=1 g j (x) where x ∈ R d is the model parameter, each objective component f i (x), g j (x) is differentiable and nonconvex, and n f , n g are the numbers of components. We define two different components of the finite-sum optimization as objectives from a sample i of previously learned tasks f i (x) and a sample j of the current task g j (x). Unlike the general stochastic optimization problem, we assume that the initial point x 0 in continual learning is an -accurate solution of f (x) with E ∇f (x) 2 ≤ for some 1. By the property of nonconvex optimization, we know that there might exist multiple local optimal points that satisfy moderate performance on the previously learned task (Garipov et al., 2018) . This implies that the model parameter x stays in the neighborhood of x 0 or usually moves from an initial local optimal point x 0 to the other local optimal point at the t-th iteration, x t over T iterations of a successful continual learning scenario. The continual learning algorithm with an episodic memory with size m cannot access the whole dataset of the previously learned tasks with n f samples but use limited samples in the memory when a learner trains on the current task. This limited access allows us to prevent catastrophic forgetting partially. However the fixed samples from memory cause a biased gradient and the overfitting problem. In Section 3, we provide the convergence analysis of the previously learned tasks f (x), which are vulnerable to catastrophic forgetting. We denote f i (x) as the component, which indicates the loss of sample i from the previously learned tasks with the model parameter x and ∇f i (x) as its gradient. We use I t , J t as the mini-batch of samples at iteration t and denote b f t , b g t as the mini-batch size |I t |, |J t | for brevity throughout the paper. We also note that g j from the current task holds the above and following assumptions. To formulate the convergence over iterations, we introduce the Incremental First-order Oracle (IFO) framework (Ghadimi & Lan, 2013) , which is defined as a unit of cost by sampling the pair (∇f i (x), f i (x)). For example, a stochastic gradient descent algorithm requires the cost as much as the batch size b t at each step, and the total cost is the sum of batch sizes T t=1 b t . Let T ( ) be the minimum number of iterations to guarantee -accurate solutions. Then the average bound of IFO complexity is less than or equal to T ( ) t=1 b t . To analyze the convergence and compute the IFO complexity, we define the loss gap between two local optimal points ∆ f as ∆ f = f (x 0 ) -inf 0≤t≤T f (x t ), which might be much smaller than the loss gap of SGD. Suppose that the losses of all optimal points have the same values, i.e., f (x * ) = f (x 0 ), then we have ∆ f ≤ 0. This implies that ∆ f is not a reason for moving away from a stationary point of f , which we will explain details in Section 3. We also define σ f , σ g for f, g, respectively, as the upper bounds on the variance of the stochastic gradients of a given mini-batch. For brevity, we write only one of them σ f , σ f = sup x 1 b f b f i=1 ∇f i (x) -∇f (x) 2 . ( ) Throughout the paper, we assume the L-smoothness. Assumption 1 f i is L-smooth that there exists a constant L > 0 such that for any x, y ∈ R d , ∇f i (x) -∇f i (y) ≤ L x -y (4) where • denotes the Euclidean norm. Then the following inequality directly holds that - L 2 x -y 2 ≤ f i (x) -f i (y) -∇f i (y), x -y ≤ L 2 x -y 2 . ( ) In this paper, we consider the framework of continual learning with episodic memory. By the assumption of GEM, we assign each task sample from i.i.d. distribution within its episode to the same memory budget m. In the learning phase at task k ∈ {1, 2, • • • , K}, we sample a batch with size n f from memories of all task with size [m • (k -1)].

3. NONCONVEX CONTINUAL LEARNING

In this section, we present the convergence analysis of continual learning in the nonconvex setting. The theoretical result shows why catastrophic forgetting occurs in view of the nonconvex optimization problem. As a result, we can propose the Non-Convex Continual Learning (NCCL) algorithm, where the learning rates for the previously learned tasks and the current tasks are scaled by the value of the inner product by their gradients for the parameter in Section 3.3.

3.1. ONE EPISODE ANALYSIS

The key element behind preventing catastrophic forgetting is to use gradient compensation on the training step of the current task. It can be considered as an additive gradient, in turn, is applied to the gradient of the current task, although GEM (Lopez-Paz & Ranzato, 2017) uses the quadratic programming and EWC (Kirkpatrick et al., 2017) introduces the auxiliary loss function. First, we present the proposed gradient compensation, which uses samples of the episodic memory for a single new task episode. We define the gradient update x t+1 = x t -α Ht ∇f It (x t ) -β Ht ∇g Jt (x t ) ) where α Ht , β Ht are learning rates scaled by the sampled mini-batches for H t = I t ∪ J t and ∇f Ht (x t ), ∇g Ht (x t ) are the estimates of the gradient ∇f (x t ), ∇g(x t ) respectively. Equation 6implies that the parameter is updated on the current task g with a gradient compensation on previously learned tasks f by α Ht ∇f It (x t ). Our goal is to explain the effect of the gradient update β Ht ∇g Jt (x t ) on the convergence to stationary points of f (x) and observe the properties of the expectation of each element over I t . For iteration t ∈ [1, T ] and a constant L, we define the catastrophic forgetting term C t to be the expectation in terms of ∇g Jt (x t ): C t = E β 2 Ht L 2 ∇g Jt (x t ) 2 -β Ht ∇f (x t ), ∇g Jt (x t ) , which we derive in Appendix A. We temporally assume the following to show the convergence analysis of continual learning. Assumption 2 Suppose that the episodic memory M contains the entire data points of previously learned tasks [k -1] on the k-th episode and replays the mini-batch I t ⊂ M . Then ∇f It (x t ) is an unbiased estimate that E[e t ] = 0 for e t = ∇f It (x t ) -∇f (x t ). In the next section, we do not use Assumption 2 and investigate the biasedness of the episodic memory M that causes the overfitting on memory. Our first main result is the following theorem that provides the stepwise change of convergence of our algorithm. Theorem 1 Suppose that Lα 2 Ht -α 2 Ht ≤ γ for some γ > 0 and α Ht ≤ 2 L . Under Assumption 1, 2, we have E ∇f (x t ) 2 ≤ 1 1 -L 2 α Ht 1 α Ht E[f (x t ) -f (x t+1 )] + C t + α Ht L 2b f σ 2 f . ( ) We present the proof in Appendix A. Note that the catastrophic forgetting term C t exists, unlike the general SGD, and this term increases the IFO complexity. Fortunately, we can tighten the upper bound of Equation ( 8) by minimizing C t . Now we telescope over a single episode for the current task. Then we obtain the following theorem. Theorem 2 Let α Ht = α = c √ T for some c > 0 and all t ∈ [T ] and 1 -L 2 α = 1 A > 0 for some A. Under Theorem 1, we have min t E ∇f (x t ) 2 ≤ A √ T 1 c ∆ f + T -1 t=0 C t + Lc 2b f σ 2 f . ( ) This theorem can explain the theoretical background of catastrophic forgetting. The cumulative summation of catastrophic forgetting terms C t increases drastically over iterations. This fact implies that the stationary point x 0 can diverge. An immediate consequence of Equation 9 is that we can consider the amount of catastrophic forgetting as an optimization-viewed factor. Without the additive catastrophic forgetting term, Theorem 2 is equivalent to the result for SGD with a fixed learning rate (Ghadimi & Lan, 2013) . Similar to SGD, the upper bound of Equation 9 can be made O( A √ T (∆ f + C t )) when we assume that Lc 2b f σ 2 f = O(1). Conversely, we consider the convergence analysis of g(x) by changing roles for f and g in Theorem 2. In the very beginning of iterations, ∆ g is dominant in Equation 9, and its catastrophic forgetting term C t,g with regard to ∇f It (x t ) is relatively small because x t is the neighborhood of the stationary point. When we consider Assumption 2 and the samples from previously learned tasks are constantly provided, the norm of gradients f It (x t ) is bounded. Therefore, g(x) can reach a stationary point by the same rate as SGD. However, We cannot access the full dataset of previously learned tasks because of the setting of continual learning. There exists an extra term that interrupts the convergence of g(x), which is called the overfitting. We now ignore the extra term to conjecture that ∇g Jt (x) is at least bounded. Then we have the following corollary. Corollary 1 Let the expected stationary of g(x) be O( δ √ T ) for a constant δ > 0 and the upper bound of learning rate for g(x) be β > 0. The cumulative sum of the catastrophic forgetting term C is O(β 2 δ √ T ). Nonconvex continual learning by Equation ( 6) does not converge as iterating the algorithm for the worst case, where min t E ∇f (x t ) 2 is O(β 2 δ) for 1 β 2 δ √ T . When β 2 δ ≤ 1 √ T , we have min t E ∇f (x t ) 2 = O 1 √ T . ( ) Then, the IFO complexity for achieving an -accurate solution of f (x) is O 1/ 2 . We would like to emphasize that catastrophic forgetting is inevitable in the worst case scenario because the stationary of f (x) is not decreasing and the convergence on f (x) cannot be recovered no matter how long we proceed training. Building a tight bound of C is the key to preventing catastrophic forgetting. Note that the general setting to minimize C is scaling down the learning rate β to β 2 δ ≤ 1/ √ T . Then we have the decreasing C = O(1/ √ T ). However, this method is slowing down the convergence of the current task g(x) and not an appropriate way. The other option is to minimize C t itself rather than tightening the loose upper bound O(β 2 δ √ T ). We discuss how to minimize this term by scaling two learning rates in Section 3.3. The constrained optimization problem of GEM provided a useful rephrased constraint but cannot explain and guarantee the catastrophic forgetting in the nonconvex setting. Our convergence analysis of continual learning is the first quantitative result of catastrophic forgetting in the manner of nonconvex optimization.

3.2. OVERFITTING TO EPISODIC MEMORY

In section 3, we discussed the theoretical convergence analysis of continual learning for smooth nonconvex finite-sum optimization problems. The practical continual learning tasks have the restriction on full access to the entire data points of previously learned tasks, which is different from Assumption 2. The episodic memory with limited size [M ] incurs the bias on ∇f (x t ). Suppose that we sample a mini-batch of previously learned tasks from episodic memory M . Then we can formulate this bias E[e M ] as E[e M ] = E ∇f It (x t ) -∇f (x t ) = ∇f M (x t ) -∇f (x t ). This equation shows that the bias depends on the choice of M . In the optimization, the bias drag the convergence of f (x) to f M (x). This fact is considered as the overfitting to the memory M . (Knoblauch et al., 2020) prove that selecting a perfect memory is hard. We can conclude that E[e M ] = 0. Now we extract the overfitting bias on M from the ignored element in Equation 21at Appendix A and the catastrophic forgetting term in Equation 7. The bias related term B t M is added to the upper bound of Equation 9 and reformulates the catastrophic forgetting term to a practical catastrophic forgetting term Ĉt as B t M = γ (∇f (x t ), ∇f M (x t ) -∇f (x t ) + β Ht ∇f M (x t ) -∇f (x t ), ∇g Jt (x t ) (12) Ĉt = E β 2 Ht L 2 ∇g Jt (x t ) 2 -β Ht ∇f It (x t ), ∇g Jt (x t ) . Note that the upper bound of Ĉt is the same as C t even if we modify it to the version with the limited memory size scenario. The cumulative sum of B t M over iterations is the amount of disturbance by overfitting to memory. This inherent defect of a memory-based continual learning framework can be considered as a generalization gap phenomenon Keskar et al. (2016) , and small mini-batch size can resolve this problem. In Section 4, we demonstrate the effect of different mini-batch sizes to alleviate the overfitting problem on the memory M . Figure 1 : Geometric illustration of Non-Convex Continual Learning (NCCL). In the continual learning setting, the model parameter starts from the moderate local optimal point for the previously learned tasks x * P . Over T iterations, we expect to reach the new optimal point x * P ∪C which has a good performance on both previously learned and current tasks. In t-th iteration, the model parameter x t encounters either ∇g Jt,pos (x t ) or ∇g Jt,neg (x t ). These two different cases indicate whether f It (x t ), ∇g Jt (x t ) is positive or not. To prevent x t from escaping the feasible region, i.e., catastrophic forgetting, we propose the theoretical condition on learning rates for f and g.

3.3. SCALING LEARNING RATES

The result of convergence analysis provides a simple continual learning framework that only scales two learning rates in the gradient update of Equation 6. As we proved in the above, we should tighten the upper bound of Ĉt to prevent catastrophic forgetting. We propose an adaptive scaling method for learning rates that can minimize or reduce Ĉt in the both case of ∇f It (x t ), ∇g Jt (x t ) ≤ 0 and ∇f It (x t ), ∇g Jt (x t ) > 0. We note that Equation 13 is a quadratic polynomial of β Ht where β Ht > 0. First, we can solve the minimum of the polynomial on β Ht when ∇f It (x t ), ∇g Jt (x t ) > 0. By differentiating on β Ht , we can easily find the minimum Ĉ * t and the optimal learning rate β * Ht β * Ht = ∇f It (x t ), ∇g Jt (x t ) L ∇g Jt (x t ) 2 , Ĉ * t = - ∇f It (x t ), ∇g Jt (x t ) 2L ∇g Jt (x t ) 2 . ( ) A direct consequence C * It < 0 implies that the optimal catastrophic forgetting surprisingly helps f (x) to decrease the upper bound of stationary. For ∇f It (x t ), ∇g Jt (x t ) ≤ 0, however, β Ht should be negative to achieve the global minimum of Ĉ * t , which violates our assumption. Instead, we propose a surrogate of ∇g Jt (x t ), ∇g Jt (x t ) = ∇g Jt (x t ) - ∇f It (x t ) ∇f It (x t ) , ∇g Jt (x t ) ∇f It (x t ) ∇f It (x t ) . ( ) The surrogate borrows the gradient ∇f It (x t ) to cancel out the negative component of ∇f It (x t ) from ∇g Jt (x t ). Now we can reduce the catastrophic forgetting term drastically by boosting learning rate α Ht without correcting ∇g Jt (x t ) directly. The remaining non-negative value of Ĉt is caused by the magnitude of ∇g Jt (x t ) itself. This phenomenon cannot be inevitable when we should learn the current task for all continual learning framework. We summarize our results as follows. α Ht = α(1 - ∇f I t (x t ),∇g J t (x t ) ∇f I t (x t ) 2 ), ∇f It (x t ), ∇g Jt (x t ) ≤ 0 α, ∇f It (x t ), ∇g Jt (x t ) > 0 (16) β Ht = α, ∇f It (x t ), ∇g Jt (x t ) ≤ 0 ∇f I t (x t ),∇g J t (x t ) L ∇g J t (x t ) 2 , ∇f It (x t ), ∇g Jt (x t ) > 0 (17) We derive the details of the result in this section in Appendix B. The existing GEM-based algorithms have only focused on canceling out the negative direction of ∇f M (x t ) from ∇g Jt (x t ) with the highly computation cost for the only case ∇f It (x t ), ∇g Jt (x t ) ≤ 0. The proposed Algorithm 1 Nonconvex Continual Learning (NCCL) Input: K task data stream {D 1 , • • • D K }, initial model x 0 , memory {M k } with each size m for k = 1 to K do for t = 0 to T -1 do Uniformly sample a mini-batch I t ⊂ [m • (k -1)] with |I t | = b f Uniformly sample a mini-batch J t ⊂ D k with |J t | = b g and store J t into M k Compute learning rates α Ht , β Ht with ∇f It (x t ), ∇g Jt (x t ) x t+1 ← x t -α Ht ∇f It (x t ) -β Ht ∇g Jt (x t ) end for x 0 ← x T -1 end for method has the advantage over both leveraging Ĉt to achieve the better convergence for the case ∇f It (x t ), ∇g Jt (x t ) > 0 and even reducing the effect of catastrophic forgetting by the term β 2 H t L 2 ∇g Jt (x t ) 2 for the case ∇f It (x t ), ∇g Jt (x t ) ≤ 0 . Figure 1 illustrates intuitively how scaling learning rates achieve the convergence to a mutual stationary point x * P ∪C as we proved the theoretical complexity in Corollary 1.

4. EXPERIMENTS

Based on our theoretical analysis of continual learning, we evaluate the proposed NCCL model in episodic continual learning with 3 benchmark datasets. We run our experiments on a GPU server with Intel i9-9900K, 64 GB RAM, and 2 NVIDIA Geforce RTX 2080 Ti GPU.

4.1. EXPERIMENTAL SETUP

Baselines. We compare NCCL to the following continual learning algorithms. Fine-tune is a basic baseline that the model trains data naively without any support, such as memory. Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) uses the regularized loss by Fisher Information. Reservoir Sampling (Chaudhry et al., 2019b) show that simple experience replay can be a power continual learning algorithm. It randomly selects a fixed number of examples from the stream of data tasks, which is similar with GEM and A-GEM. GEM and A-GEM Lopez-Paz & Ranzato (2017); Chaudhry et al. (2019a) is the original and a variant of Gradient Episodic Learning. Datasets. We use the following datasets. 1) Kirkpatrick et al. (2017) design Permuted-MNIST, a MNIST (LeCun et al., 1998) based dataset, where we apply a fixed permutation of pixels to transform a data point to make the input data distribution unrelated. 2) Zenke et al. (2017) introduce Split-MNIST dataset, which splits MNIST dataset into five tasks. Each task consists of two classes, for example (1, 7), and has approximately 12K images. 3) Split-CIFAR10 is one of most commonly used continual learning datasets based on CIFAR10 dataset (Krizhevsky et al., 2009) , respectively (Lee et al., 2020; Rebuffi et al., 2017; Zenke et al., 2017; Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019b) . Training details. We use fully-connected neural networks with two hidden layers of [100, 100] with ReLU activation. For CIFAR10 datasets, we use a smaller viersion of ResNet18 from the setting in GEM. To show the empirical result of our theoretical analysis, we apply vanilla SGD into all train networks.

Performance measurement.

We conduct our experiment on K tasks. We evaluate our experiments by two measures, ACC and BWT. ACC is the average test accuracy of all tasks after the whole learning is finished. Backward Transfer (BWT) is a measure of for forgetting, which shows how much learning new tasks has affected the previously learned tasks. When BW T < 0, it implies that catastrophic forgetting happens. Formally, we define ACC and BWT as ACC = 1 K K k=1 ACC k,K , BWT = 1 K K k=1 ACC k,K -ACC k,k , where ACC i,j is the accuracy of task i at the end of episode j. Table 1 : Comparison on ACC and BWT on Permuted-MNIST, Split-MNIST, Split-CIFAR10 on 5 epochs per task over 5 runs. For A-GEM, we report the result in (Chaudhry et al., 2019a) . 2 show our main experimental results. We explain the property of Split dataset first. Split dataset divide the whold dataset by the number of tasks, so we get a partial version of dataset. For example, 5 Split-MNIST, we can consider the number of data points per task as the number of 0.2 epoch. Then, we can call a single epoch of 5 Split-MNIST as a 5 repeated sets of its datapoints for a task. We conduct experiments on 20 Permuted-MNIST, 5-Split MNIST, and 5-Split CIFAR10. We can notice that NCCL does not outperform GEM and A-GEM. We conjecture that the reason of the lower performance is the differences of optimization techniques for new task. GEMbased methods apply the quadratic programming algorithm to continual learning, which spends more iterations to find a better surrogate for the negative direction between the previous task and the current task, but this procedure requires the very longer computation time which is not effective. We also expect that the theoretical convergence analysis for GEM surrogates can be achieved in future work. Compared to other reported methods, the performance of NCCL has a reasonable result. By these observations, we conclude the followings. • Our theoretical convergence of analysis is reasonable for explaining catastrophic forgetting. • NCCL has both theoretical and empirical supports. • We observe that the small mini-batch size from memory is more effective.

5. CONCLUSION

In this paper, we have presented the first generic theoretical convergence analysis of continual learning. Our proof shows that a training model can circumvent catastrophic forgetting by suppressing the disturbance term on the convergence of previously learned tasks. We also demonstrate theoretically and empirically that the performance of past tasks by nonconvex continual learning with episodic memory is degraded by two separate reasons, catastrophic forgetting and overfitting to memory. To tackle this problem, nonconvex continual learning applies two methods, scaling learning rates adaptive to mini-batches and sampling mini-batches from the episodic memory. Compared to other constrained optimization methods, the mechanism of NCCL utilizes both positive and negative directions between two stochastic gradients from the memory and the current task to keep a stable performance on previous tasks. Finally, it is expected the proposed nonconvex framework if helpful to analyze the convergence rate of other continual learning algorithms.

APPENDIX A THEORETICAL ANALYSIS

Proof of Theorem 1 We analyze the convergence of nonconvex continual learning with episodic memory here. Recall that the gradient update is the following x t+1 = x t -α Ht ∇f It (x t ) -β Ht ∇g Jt (x t ) for all t ∈ {1, 2, • • • , T }. Since we assume that f, g is L-smooth, we have the following inequality by applying Equation 5: f (x t+1 ) ≤ f (x t ) + ∇f (x t ), x t+1 -x t + L 2 x t+1 -x t 2 = f (x t ) -∇f (x t ), α Ht ∇f It (x t ) + β Ht ∇g Jt (x t ) + L 2 α Ht ∇f It (x t ) + β Ht ∇g Jt (x t ) 2 ≤ f (x t ) -α Ht ∇f (x t ), ∇f It (x t ) -β Ht ∇f (x t ), ∇g Jt (x t ) + L 2 α 2 Ht ∇f It (x t ) 2 + L 2 β 2 Ht ∇g Jt (x t ) 2 . ( ) Let e t = ∇f It (x t ) -∇f (x t ) and define Ct = L 2 β 2 Ht ∇g Jt (x t ) 2 -β Ht ∇f (x t ), ∇g Jt (x t ) , for t ≥ 1. We have f (x t+1 ) ≤ f (x t ) -α Ht ∇f (x t ), ∇f It (x t ) + L 2 α 2 Ht ∇f It (x t ) 2 + Ct ≤ f (x t ) -α Ht - L 2 α 2 Ht ∇f (x t ) 2 -(α Ht -Lα 2 Ht ) ∇f (x t ), e t + L 2 α 2 Ht e t 2 + Ct . Taking expectations with respect to I t on both sides, noting that C t = E[ Ct ] , we obtain α Ht - L 2 α 2 Ht ∇f (x t ) 2 ≤ f (x t ) -f (x t+1 ) -(α Ht -Lα 2 Ht )E[ ∇f (x t ), e t ] + L 2 α 2 Ht e t 2 + E[ Ct ] ≤ f (x t ) -f (x t+1 ) + C t + L 2 α 2 Ht e t 2 + (Lα 2 Ht -α Ht )E[ ∇f (x t ), e t ]. Rearranging the terms and assume that Lα 2 Ht -α Ht ≤ γ and 1 -L 2 α Ht > 0, we have ∇f (x t ) 2 ≤ 1 α Ht (1 -L 2 α Ht ) f (x t ) -f (x t+1 ) + C t + (Lα 2 Ht -α Ht )E[ ∇f (x t ), e t ] + L 2 α Ht e t 2 1 -L 2 α Ht ≤ 1 α Ht (1 -L 2 α Ht ) f (x t ) -f (x t+1 ) + C t + γE[ ∇f (x t ), e t ] + L 2 α Ht e t 2 1 -L 2 α Ht . Note that under Assumption 2, E[ ∇f (x t ), e t ] = 0, we conclude  ∇f (x t ) 2 ≤ 1 α Ht (1 -L 2 α Ht ) f (x t ) -f (x t+1 ) + C t + t E ∇f (x t ) 2 ≤ 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 1 1 -L 2 α 1 αT f (x 0 ) -f (x T ) + T -1 t=0 C t + L 2b f ασ 2 f = 1 1 -L 2 α 1 c √ T ∆ f + T -1 t=0 C t + Lc 2b f √ T σ 2 f = A √ T 1 c ∆ f + T -1 t=0 C t + Lc 2b F σ 2 f . Lemma 1 Let a constant δ > 0 and an upper bound β > β Ht > 0. where σ g is analogous to Equation 3 and b g is the mini-batch size of g. Then we have C t = O E ∇g(x t ) 2 = O β 2 δ √ T where t ∈ [T ] and for some δ > 0. Summing over time t, we have C = T -1 t=0 C t = T • O β 2 δ √ T = O β 2 δ √ T . Therefore, we obtain O(1) when β 2 δ √ T ≤ 1. Proof of Corollary 1 To formulate the IFO calls, let T ( ) T ( ) = min {T : min E ∇f (x t ) 2 ≤ }. Recall that E ∇f (x t ) 2 = O( Ct √ T ) by Theorem 2. Then by Lemma 1, we have min t E ∇f (x t ) 2 = O β 2 δ √ T √ T = O(β 2 δ). It implies that min t E ∇f (x t ) 2 is not decreasing when 1 β 2 δ √ T . Then, x t cannot reach to the stationary point. On the other hand, f (x) can be converged to the stationary point when β 2 δ ≤ 1 √ T such that min t E ∇f (x t ) 2 = O(β 2 δ) = O 1 √ T . ( ) To derive a bound for T ( ), we note that O 1 √ T ≤ . Then we have T ( ) = O 1 2 . The IFO call is defined as 



batch size b Proof of Theorem 2 Suppose that the learning rate α Ht is a constant α = c/ √ T , for c > 0, 1 -L 2 α = 1 A > 0. Then, by summing Equation 21 from t = 0 to T -1, we have

min

∇𝑓𝑓 𝐼𝐼 𝑡𝑡 (𝑥𝑥 𝑡𝑡 ) ∇𝑔𝑔 𝐽𝐽 𝑡𝑡,𝑛𝑛𝑛𝑛𝑛𝑛 (𝑥𝑥 𝑡𝑡 ) ∇𝑔𝑔 𝐽𝐽 𝑡𝑡,𝑝𝑝𝑝𝑝𝑝𝑝 (𝑥𝑥 𝑡𝑡 ) 𝛼𝛼 𝐻𝐻 𝑡𝑡 ∇𝑓𝑓 𝐼𝐼 𝑡𝑡 𝑥𝑥 𝑡𝑡 + 𝛽𝛽 𝐻𝐻 𝑡𝑡 ∇𝑔𝑔 𝐽𝐽 𝑡𝑡,𝑝𝑝𝑝𝑝𝑝𝑝 (𝑥𝑥 𝑡𝑡 ) 𝛼𝛼 𝐻𝐻 𝑡𝑡 ∇𝑓𝑓 𝐼𝐼 𝑡𝑡 𝑥𝑥 𝑡𝑡 + 𝛽𝛽 𝐻𝐻 𝑡𝑡 ∇𝑔𝑔 𝐽𝐽 𝑡𝑡,𝑛𝑛𝑛𝑛𝑛𝑛 (𝑥𝑥 𝑡𝑡 )

Comparison on ACC of Permuted-MNIST with 5 permutation tasks on a single epoch per task with multiple choices of hyperparmeters over 5 runs. We define m as the memory budget for each task, b g as a batch size for the current task, b f as a batch size for a single previous task.

The sum of the catastrophic forgetting term over iterations T Proof The upper bound of the catastrophic forgetting term isC t = E β 2 Ht L 2 ∇g Jt (x t ) 2 -β Ht ∇f (x t ), ∇g Jt (x t ) Jt (x t ) 2 + β Ht ∇f (x t ) ∇g Jt (x t ) = O E ∇g Jt (x t ) 2 . Since ∇g Jt (x t ) 2 ≤ ∇g(x t ) 2 + ∇g Jt (x t ) -g(x t ) 2

t=1 b f,t . Therefore, the IFO call is O(1/ 2 ).Proof of Equations 15Let the surrogate ∇g Jt (x t ) as∇g Jt (x t ) = ∇g Jt (x t ) -∇f It (x t ) ∇f It (x t ) , ∇g Jt (x t ) ∇f It (x t ) ∇f It (x t ) . Jt (x t ) 2 -β Ht ∇f It (x t ), ∇g Jt (x t ) Jt (x t ) 2 -2 ∇f It (x t ), ∇g Jt (x t ) 2 ∇f It (x t ) 2 + ∇f It (x t ), ∇g Jt (x t ) 2 ∇f It (x t ) 2 -β Ht ∇f It (x t ), ∇g Jt (x t ) Jt (x t ) 2 -∇f It (x t ), ∇g Jt (x t ) 2 ∇f It (x t ) 2 -β Ht ∇f It (x t ), ∇g Jt (x t ) -∇f It (x t ), ∇g Jt (x t ) Jt (x t ) 2 -∇f It (x t ), ∇g Jt (x t ) 2 ∇f It (x t ) 2 . (24)

