GENERALISATION GUARANTEES FOR CONTINUAL LEARNING WITH ORTHOGONAL GRADIENT DESCENT

Abstract

In Continual Learning settings, deep neural networks are prone to Catastrophic Forgetting. Orthogonal Gradient Descent was proposed to tackle the challenge. However, no theoretical guarantees have been proven yet. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel regime. This framework comprises closed form expression of the model through tasks and proxies for Transfer Learning, generalisation and tasks similarity. In this framework, we prove that OGD is robust to Catastrophic Forgetting then derive the first generalisation bound for SGD and OGD for Continual Learning. Finally, we study the limits of this framework in practice for OGD and highlight the importance of the Neural Tangent Kernel variation for Continual Learning with OGD.

1. INTRODUCTION

Continual Learning is a setting in which an agent is exposed to multiples tasks sequentially (Kirkpatrick et al., 2016) . The core challenge lies in the ability of the agent to learn the new tasks while retaining the knowledge acquired from previous tasks. Too much plasticity (Nguyen et al., 2018) will lead to catastrophic forgetting, which means the degradation of the ability of the agent to perform the past tasks (McCloskey & Cohen 1989 , Ratcliff 1990 , Goodfellow et al. 2014 ). On the other hand, too much stability will hinder the agent from adapting to new tasks. While there is a large literature on Continual Learning (Parisi et al., 2019) , few works have addressed the problem from a theoretical perspective. Recently, Jacot et al. (2018) established the connection between overparameterized neural networks and kernel methods by introducing the Neural Tangent Kernel (NTK). They showed that at the infinite width limit, the kernel remains constant throughout training. Lee et al. (2019) also showed that, in the infinite width limit or Neural Tangent Kernel (NTK) regime, a network evolves as a linear model when trained on certain losses under gradient descent. In addition to these findings, recent works on the convergence of Stochastic Gradient Descent for overparameterized neural networks (Arora et al., 2019) have unlocked multiple mathematical tools to study the training dynamics of over-parameterized neural networks. We leverage these theoretical findings in order to to propose a theoretical framework for Continual Learning in the NTK regime then prove convergence and generalisation properties for the algorithm Orthogonal Gradient Descent for Continual Learning (Farajtabar et al., 2019) . Our contributions are summarized as follows: 1. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel (NTK) regime. This framework frames Continual Learning as a recursive kernel regression and comprises proxies for Transfer Learning, generalisation, tasks similarity and Curriculum Learning. (Thm. 1, Lem. 1 and Thm. 3). 2. In this framework, we prove that OGD is robust to forgetting with respect to an arbitrary number of tasks under an infinite memory (Sec . 5, Thm. 2). 3. We prove the first generalisation bound for Continual Learning with SGD and OGD. We find that generalisation through tasks depends on a task similarity with respect to the NTK. (Sec. 5, Theorem 3) 4. We study the limits of this framework in practical settings, in which the Neural Tangent Kernel may vary. We find that the variation of the Neural Tangent Kernel impacts negatively the robustness to Catastrophic Forgetting of OGD in non overparameterized benchmarks. (Sec. 6)

2. RELATED WORKS

Continual Learning addresses the Catastrophic Forgetting problem, which refers to the tendency of agents to "forget" the previous tasks they were trained on over the course of training. It's an active area of research, several heuristics were developed in order to characterise it (Ans & Rousset 1997 , Ans & Rousset 2000 , Goodfellow et al. 2014 , French 1999 , McCloskey & Cohen 1989 , Robins 1995 , Nguyen et al. 2019) . Approaches to Continual Learning can be categorised into: regularization methods, memory based methods and dynamic architectural methods. We refer the reader to the survey by Parisi et al. (2019) for an extensive overview on the existing methods. The idea behind memory-based methods is to store data from previous tasks in a buffer of fixed size, which can then be reused during training on the current task (Chaudhry et al. 2019 , Van de Ven & Tolias 2018) . While dynamic architectural methods rely on growing architectures which keep the past knowledge fixed and store new knowledge in new components, such as new nodes or layers. (Lee et al. 2018 , Schwarz et al. 2018 ) Finally, regularization methods regularize the objective in order to preserve the knowledge acquired from the previous tasks (Kirkpatrick et al. 2016 , Aljundi et al. 2018 , Farajtabar et al. 2019 , Zenke et al. 2017) . While there is a large literature on the field, there is a limited number of theoretical works on Continual Learning. Alquier et al. (2017) define a compound regret for lifelong learning, as the regret with respect to the oracle who would have known the best common representation for all tasks in advance. Knoblauch et al. (2020) show that optimal Continual Learning algorithms generally solve an NP-HARD problem and will require perfect memory not to suffer from catastrophic forgetting. Benzing (2020) presents mathematical and empirical evidence that the two methods -Synaptic Intelligence and Memory Aware Synapses -approximate a rescaled version of the Fisher Information. Continual Learning is not limited to Catastrophic Forgetting, but also closely related to Transfer Learning. A desirable property of a Continual Learning algorithm is to enable the agent to carry the acquired knowledge through his lifetime, and transfer it to solve new tasks. A new theoretical study of the phenomena was presented by Liu et al. (2019) . They prove how the task similarity contributes to generalisation, when training with Stochastic Gradient Descent, in a two tasks setting and for over-parameterised two layer RELU neural networks. The recent findings on the Neural Tangent Kernel (Jacot et al., 2018) and on the properties of overparameterized neural networks (Du et al. 2018 , Arora et al. 2019 ) provide powerful tools to analyze their training dynamics. We build up on these advances to construct a theoretical framework for Continual Learning and study the generalisation properties of Orthogonal Gradient Descent.

3. PRELIMINARIES

Notation We use bold-faced characters for vectors and matrices. We use • to denote the Euclidean norm of a vector or the spectral norm of a matrix, and • F to denote the Frobenius norm of a matrix. We use •, • for the Euclidean dot product, and •, • H the dot product in the Hilbert space H. We index the task ID by τ . The ≤ operator if used with matrices, corresponds to the partial ordering over symmetric matrices. We denote N the set of natural numbers, R the space of real numbers and N for the set N {0}. We use ⊕ to refer to the direct sum over Euclidean spaces.

3.1. CONTINUAL LEARNING

Continual Learning considers a series of tasks {T 1 , T 2 , . . .}, where each task can be viewed as a separate supervised learning problem. Similarly to online learning, data from each task is revealed only once. The goal of Continual Learning is to model each task accurately with a single model. The challenge is to achieve a good performance on the new tasks, while retaining knowledge from the previous tasks (Nguyen et al., 2018) . We assume the data from each task T τ , τ ∈ N , is drawn from a distribution D τ . Individual samples are denoted (x τ,i , y τ,i ), where i ∈ [n τ ]. For a given task T τ , the model is denoted f τ , we use the superscript (t) to indicate the training iteration t ∈ N, while we use the superscript to indicate the asymptotic convergence. For the regression case, given a ridge regularisation coefficient λ ∈ R + , for all t ∈ N, we write the train loss for a task T τ as : L τ (w τ (t)) = nτ i=1 (f (t) τ (x τ,i ) -y τ,i ) 2 + λ w τ (t) -w τ -1 2 .

3.2. OGD FOR CONTINUAL LEARNING

Let T T the current task, where T ∈ N . For all i ∈ [n T ], let v T,i = ∇ w f T -1 (x T -1,i ), which is the Jacobian of task T T . We define E τ = vec({v τ,i , i ∈ [n τ ]}) , the subspace induced by the Jacobian. The idea behind OGD (Farajtabar et al., 2019) is to update the weights along the projection of the gradient on the orthogonal space induced by the Jacobians over the previous tasks E 1 ⊕ . . . ⊕ E τ -1 . The update rule at an iteration t ∈ N for the task T T is as follows : w T (t + 1) = w T (t) -ηΠ E ⊥ T -1 ∇ w L T (w T (t)). The intuition behind OGD is to "preserve the previously acquired knowledge by maintaining a space consisting of the gradient directions of the neural networks predictions on previous tasks" (Farajtabar et al., 2019) . Throughout the paper, we only consider the OGD-GTL variant which stores the gradient with respect to the ground truth logit.

3.3. NEURAL TANGENT KERNEL

In their seminal paper, Jacot et al. (2018) established the connection between deep networks and kernel methods by introducing the Neural Tangent Kernel (NTK). They showed that at the infinite width limit, the kernel remains constant throughout training. Lee et al. (2019) also showed that a network evolves as a linear model in the infinite width limit when trained on certain losses under gradient descent. Throughout our analysis, we make the assumption that the neural network is overparameterized, and consider the linear approximation of the neural network around its initialisation: f (t) (x) ≈ f (0) (x) + ∇ w f (0) (x) T (w(t) -w(0)).

4. CONVERGENCE -CONTINUAL LEARNING AS A RECURSIVE KERNEL REGRESSION

In this section, we derive a closed form expression for the Continual Learning models through tasks. We find that Continual Learning models can be expressed with recursive kernel ridge regression across tasks. We also find a that the NTK of OGD is recursive with respect to the projection of its feature map on the tasks' spaces. The result is presented in Theorem 1, a stepping stone towards proving the generalisation bound for OGD in Sec. 5.

4.1. CONVERGENCE THEOREM

Theorem 1 (Continual Learning as a recursive Kernel Regression) Given T 1 , . . . , T T a sequence of tasks. Fix a learning rate sequence (η τ ) τ ∈[T ] and a ridge regularisation coefficient λ ∈ R + . If, for all τ , the learning rate satisfies η τ < 1 κτ (Xτ ,Xτ )+λI , then for all τ , w τ (t) converges linearly to a limit solution w τ such that f τ (x) = f τ -1 (x) + κ τ (x, X τ ) T (κ τ (X τ , X τ ) + λI) -1 ỹτ , where κ τ (x, x ) = φ τ (x) φ τ (x ) T , ỹτ = y τ -y τ -1→τ , y τ -1→τ = f τ -1 (X τ ), φ τ (x) = ∇ w f 0 (x) ∈ R d for SGD , T τ ∇ w f 0 (x) ∈ R d-Mτ for OGD. and {T τ ∈ R d-Mτ ,d , τ ∈ [T ]} are projection matrices from R d to ( τ k=1 E k ) ⊥ and M τ = dim( τ k=1 E k ) . The theorem describes how the model f τ evolves across tasks. It is recursive because the learning is incremental. For a given task T τ , f τ -1 (x) is the knowledge acquired by the agent up to the task T τ -1 . At this stage, the model only fits the residual ỹτ = y τ -y τ -1→τ , which complements the knowledge acquired through previous tasks. This residual is also a proxy for task similarity. If the tasks are identical, the residual is equal to zero. The knowledge increment is captured by the term: κ τ (x, X τ ) T (κ τ +1 (X τ , X τ ) + λI) -1 ỹτ . Finally, the task similarity is computed with respect to the most recent feature map φ τ , and κ τ is the NTK with respect to the feature map φ τ . Remark 1 The recursive relation from Theorem 1 can also be written as a linear combination of kernel regressors as follows: f τ (x) = τ k=1 f k (x), where f k (x) = κ k (x, X k ) T (κ k (X k , X k ) + λI) -1 ỹk .

4.2. DISTANCE FROM INITIALISATION THROUGH TASKS

As described in Sec. 4.1, ỹτ is a residual. It is equal to zero if the model f τ -1 makes perfect predictions on the next task T τ . The more the next task T τ is different, the further the neural network needs to move from its previous state in order to fit it. Corollary 1 tracks the distance from initialisation as a function of task similarity. Corollary 1 For SGD, and for OGD under the additional assumption that {T τ , τ ∈ [T ]} are orthonormal, w τ +1 -w τ 2 = ỹT τ +1 (κ(X τ +1 , X τ +1 ) + λI) -1 κ(X τ +1 , X τ +1 )(κ(X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 , Remark 2 Corollary 1 can be applied to get a similar result to Theorem 3 by Liu et al. (2019) . In this remark, we consider mostly their notations. Their theorem states that under some conditions, for 2-layer neural networks with a RELU activation function, with probability no less than 1 -δ over random initialisation, W(P ) -W(Q) F ≤ ỹT P →Q H ∞-1 P ỹP →Q + , where, in their work: y P →Q = H ∞,T P Q H ∞-1 P y P , ỹP →Q = y Q -y P →Q . Note that H ∞ P is a Gram matrix, which also corresponds to the NTK of the neural network they consider. We see an analogy with our result, where we work directly with the NTK, with no assumptions on the neural network. One important observation is that, to our knowledge, since there are no guarantees for the invertibility of our Gram matrix, we add a ridge regularisation to work with a regularised matrix, which is then invertible. In our setting, by considering λ → 0, and with the additional assumption of invertibility of H τ,0 ,which is valid in the two-layer overparameterized RELU neural network considered in the setting of Liu et al. (2019) , we can recover a similar approximation.

5. OGD : LEARNING WITHOUT FORGETTING, PROVABLY

In this section, we study the generalisation properties of OGD building-up on Thm. 1. First, we prove that OGD is robust to catastrophic forgetting with respect to all previous tasks (Theorem 2). Then, we present the main generalisation theorem for OGD (Thm. 3). The theorem provides several insights on the relation between task similarity and generalisation. Finally, we present how the Rademacher complexity relates to task similarity across a large number of tasks (Lemma 1). The lemma states that the more dissimilar tasks are, the larger the class of functions explored by the neural network, with high probability. This result highlights the importance of the curriculum for Continual Learning.

5.1. MEMORISATION PROPERTY OF OGD

The key to obtaining tight generalisation bounds for OGD is Theorem 2. Theorem 2 (No-forgetting Continual Learning with OGD) Given a task T τ , for all x k,i ∈ D k , a sample from the training data of a previous task T k , given that the Jacobian of x k,i belongs to OGD's memory, it holds that: f τ (x k,i ) = f k (x k,i ). As motivated by Farajtabar et al. (2019) , the orthogonality of the gradient updates aims to preserve the acquired knowledge, by not altering the weights along relevant dimensions when learning new tasks. Theorem 2 implies that, given an infinite memory, the training error on all samples from the previous tasks is unchanged, when training with OGD.

5.2. GENERALISATION PROPERTIES OF SGD AND OGD

Now, we state the main generalisation theorem for OGD, which provides generalisation bounds on the data from all the tasks, for SGD and OGD. Theorem 3 (Generalisation of SGD and OGD for Continual Learning) Let {T 1 , . . . T T } be a sequence of tasks. Let be {D 1 , . . . , D T } the respective distributions over R d × {-1, 1}. Let {(x τ,i , y τ,i ), i ∈ [n t ], τ ∈ [T ]} be i.i.d. samples from D τ , τ ∈ [T ] . Denote X τ = (x τ,1 , . . . , x τ,nτ ), y τ = (y τ,1 , . . . , y τ,nτ ). Consider the kernel ridge regression solution f T , then, for any loss function : R × R → [0, c] that is c-Lipschitz in the first argument, with probability at least 1 -δ, L Dτ (f T ) ≤          λ 2 nτ ỹτ (κ τ (X τ , X τ ) + λI) -1 ỹτ + R T + 3c log(2/δ) 2n T , for OGD, τ ∈ [1, T ], λ 2 nτ ỹτ (κ τ (X τ , X τ ) + λI) -1 ỹτ + R T + 3c log(2/δ) 2n T , for SGD, τ = T , λ 2 nτ ỹT τ (k τ (X τ , X τ ) + λI) -1 ỹτ + 1 n T T k=τ +1 H k,τ + R T + 3c log(2/δ) 2n T , for SGD, τ < T . where ỹτ = y τ -f τ -1 (X τ ), R T = T τ =1 tr(κ τ (X τ , X τ )) n 2 τ ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ , H k,τ = ỹT k (k k (X k , X k ) + λI) -1 k k (X k , X τ )k k (X τ , X k )(k k (X k , X k ) + λI) -1 ỹk . Theorem 3 shows that the generalisation bound for OGD is tighter than SGD. This results from Thm. 2, which states that OGD is robust to Catastrophic Forgetting. The bound is looser for SGD over the past tasks through the forgetting term that appears. The bound of Theorem 3 comprises the following main terms : • λ 2 nτ ỹτ (κ τ (X τ , X τ ) + λI) -1 ỹτ : this term is due to the regularisation and describes that the empirical loss is not equal to zero due to the regularisation term. This term tends to zero in case there is no regularisation. For interpretation purposes, in App. D.2.7, we show that the empirical loss tends to zero in the no-regularisation case. • R T captures the impact of task similarity on generalisation which we discuss in Sec. 5.3. • T k=τ +1 H k,τ is a residual term that appears for SGD only. It is due to the catastrophic forgetting that occurs with SGD. It also depends on the tasks similarity. These bounds share some similarities with the bounds derived by Arora et al. (2019) , Liu et al. (2019) and Hu et al. (2019) , where in these works, the bounds were derived for supervised learning settings, and in some cases for two-layer RELU neural networks. Similarly, the bounds depend on the Gram matrix of the data, with the feature map corresponding to the NTK.

5.3. THE IMPACT OF TASK SIMILARITY ON GENERALISATION

Now, we state Lemma 1, which tracks the Rademacher complexity through tasks. Lemma 1 Keeping the same notations and setting as Theorem 3, the Rademacher Complexity can be bounded as follows: R(F T ) ≤ T τ =1 O tr(κ τ (X τ , X τ )) n 2 τ ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ ; where F T is the function class covered by the model up to the task T τ The right hand side term R T in the upper bound of the generalisation theorem follows directly from Lemma 1, and it draws a complexity measure for Continual Learning. It states that the upper bound on the Rademacher complexity increases when the tasks are dissimilar. We define the NTK task dissimilarity between two subsequent tasks T τ -1 and T τ as Sτ-1→τ = ỹT τ (κ τ (X τ , X τ )+λI) -1 ỹτ . This dissimilarity is a generalisation of the term that appears in the upper bound of Thm. 2 by Liu et al. 2019 . The knowledge from the previous tasks is encoded in the kernel κ τ , through the feature map φ τ . As an edge case, if two successive tasks are identical, Sτ-1→τ = 0 and the upper bound does not increase.

Implications for Curriculum Learning

We also observe that the upper bound depends on the task ordering, which may provide a theoretical explanation on the importance of learning with a curriculum (Bengio et al. (2009) ). In the following, we present an edge case which provided an intuition on how the bound captures the importance of the order. Consider two dissimilar tasks T 1 and T 2 . A sequence of tasks alternating between T 1 and T 2 will lead to a large upper bound, as explained in the first paragraph, while a sequence of tasks concatenating two sequences of T 1 then T 2 will lead to a lower upper bound.

6. THE IMPACT OF THE NTK VARIATION ON OGD

In the previous section, we demonstrated that under the NTK regime and an infinite memory, OGD is provably robust to catastrophic forgetting. In practical settings, these two assumptions do not hold, therefore, in this section, we study the limits of the NTK regime. We present OGD+, a variant of OGD we use to study the importance of the NTK variation for Continual Learning in practice. In order to decouple the NTK variation phenomena in our experiments, we propose then study the OGD+ algorithm, which is designed to be more robust to the NTK variation, through updating its orthonormal basis with respect to all the tasks at the end of each task, as opposed to OGD. Algorithm 1 presents the OGD+ algorithm, we highlight the differences with OGD in red. The main difference is that OGD+ stores the feature maps with respect to the samples from previous tasks, in addition to the feature maps with respect to the samples from the current task, as opposed to OGD. This small change is motivated by the variation of the NTK in practice. In order to compute the feature maps with respect to the previous samples, OGD+ saves these samples in a dedicated memory, we call this storage the samples memory. This memory comes in addition to the orthonormal feature maps memory. The only role of the samples memory is to compute the updated feature maps at the beginning of each task. natbib Algorithm 1: OGD+ for Continual Learning Input :A task sequence T 1 , T 2 , . . ., learning rate η 1. Initialize S J ← {} ; S D ← {}; w ← w 0 2. for Task ID τ = 1, 2, 3, . . . do repeat g ← Stochastic Batch Gradient for T τ at w; g = g -v∈S J proj v (g); w ← w -ηg until convergence; Sample S ⊂ S D ; for (x, y) ∈ D τ S and k ∈ [1, c] s.t. y k = 1 do u ← ∇f τ (x; w) -v∈S J proj v (∇f τ (x; w)) S J ← S J {u} end Sample D ⊂ D τ ; Update S D ← S D D end

7. EXPERIMENTS

We study the validity Theorem 2 which states that under the given conditions, OGD is perfectly robust to Catastrophic Forgetting (Sec. 7.1). We find that the Catastrophic Forgetting of OGD decreases with over-parameterization. Then, we perform an ablation study using OGD+ in order to study the applicability and limits of the constant Jacobian assumption in practice (Sec. 7.2). We find that the assumption holds for some benchmarks, and that updating the Jacobian (OGD+) is critical for the robustness of OGD in the non-overparameterized benchmarks. Finally, we present a broader picture of the performance of OGD+ against standard Continual Learning baselines (Sec. 7.3) Benchmarks We use the standard benchmarks similarly to Goodfellow et al. (2014) and Chaudhry et al. (2019) . Permuted MNIST (Goodfellow et al., 2014) consists of a series of MNIST supervised learning tasks, where the pixels of each task are permuted with respect to a fixed permutation. Rotated MNIST (Farajtabar et al., 2019) consists of a series of MNIST classification tasks, where the images are rotated with respect to a fixed angle, monotonically. We increment the rotation angle by 5 degrees at each new task. Split CIFAR-100 (Chaudhry et al., 2019) is constructed by splitting the original CIFAR-100 dataset (Krizhevsky, 2009) into 20 disjoint subsets, where each subset is formed by sampling without replacement 5 classes out of 100. In order to assess the robustness to catastrophic forgetting over long tasks sequences, we increase the length of the tasks streams from 5 to 15 for the MNIST benchmarks, and consider all the 20 tasks for the Split CIFAR-100 benchmark. We also use the CUB200 benchmark (Jung et al., 2020) which contains 200 classes, split into 10 tasks. This benchmark is more overparameterized than the other benchmarks. AVERAGE ACCURACY (A T ) is the average accuracy after the model has been trained on the task T T :

Evaluation

A T = 1 T T τ =1 a T,τ . AVERAGE FORGETTING (F T ) is the average forgetting after the model has been trained on the task T T : F T = 1 T -1 T -1 τ =1 max t∈{1,...,T -1} (a t,τ -a T,τ ). OVER-PARAMETERIZATION We track over-parameterization of a given benchmark as the number of trainable parameters over the average number of samples per task (Arora et al., 2019) : O = p T τ =1 n τ /T . Architectures For the MNIST benchmarks, we use the same neural network architectures as Farajtabar et al. (2019) . However, since the OGD algorithm doesn't scale to large neural networks due to memory limits, we considered smaller scale neural networks for the CIFAR100 and CUB200 benchmarks. For CIFAR100, we use the LeNet architecture (Lecun et al., 1998) . For CUB200, we keep the pretrained AlexNet similarly to Jung et al. (2020) , then freeze the features layers and replace the default classifier with a smaller neural network (App. F.1.1). 7.1 ABLATION STUDY : FOR OGD, CATASTROPHIC FORGETTING DECREASES WITH OVERPARAMETERIZATION (THM. 2) Theorem 2 states that in the NTK regime, given an infinite memory, the train error of OGD is unchanged. We study the impact of overparameterization on Catastrophic Forgetting of OGD through an ablation study on the number of parameters of the model. Experiment We train a model on the full task sequence on the MNIST and CIFAR100 benchmarks, then measure the variation of the train accuracy of the memorised samples from the first task. The reason we consider the samples in the memory is the applicability of Thm. 2 to these samples only. Fixing the datasets sizes, our proxy for overparameterization is the hidden size, which we vary. Results Figures 1 and 2 show that the train error variation decreases with overparameterization for OGD, on the MNIST and CIFAR100 benchmarks. This result concurs with Thm. 2. Figure 1 : The variation of the train accuracy on the memorised samples from the first task as a function of overparameterization (higher is better). The forgetting decreases with overparameterization, as stated in Theorem 2. Figure 2 : The variation of the train accuracy on the memorised samples from each task, after the model was trained on all tasks in sequence (higher is better). We vary the hidden size as a proxy for overparameterization. Our analysis relies on the over-parameterization assumption, which implies that the Jacobian is constant through tasks. Theorem 2 follows from this property. However, in practice, this assumption may not hold and forgetting is observed for OGD. We study the impact of the variation of the Jacobian in practice on OGD. In order to measure this impact, we perform an ablation study on OGD, taking into account the Jacobian's variation (OGD+) or not (OGD). OGD+ takes into account the Jacobian's variation by updating all the stored Jacobians at the end of each task. Experiment We measure the average forgetting (Chaudhry et al., 2019) without (OGD) or with (OGD+) accounting for the Jacobian's variation, on the MNIST, CIFAR100 and CUB200 benchmarks. Results Table 1 shows that for the Rotated MNIST and Permuted MNIST benchmarks, which are the least overparameterized, OGD+ is more robust to Catastrophic Forgetting than OGD. This improvement follows from the variation of the Jacobian in these benchmarks and that OGD+ accounts for it. While on CIFAR100 and CUB200, the most over-parameterized benchmark, the robustness of OGD and OGD+ is equivalent. For these benchmarks, since the variation of the Jacobian is smaller, due to overparameterization, OGD+ is equivalent to OGD. This result confirms our initial hypothesis that the Jacobian's variation in non-overparameterized settings such as MNIST is an important reason the OGD algorithm is prone to Catastrophic Forgetting. This result also highlights the importance of developing a theoretical framework for the non-overparameterized setting, in order to capture the properties of OGD outside the NTK regime.

Dataset

Permuted MNIST Rotated MNIST Split CIFAR100 CUB200 The average forgetting of with and without accounting for the Jacobian's variation, on the MNIST, CIFAR and CUB200 datasets (lower is better). A higher over-parameterization coefficient implies that the constant Jacobian assumption is more likely to hold.

7.3. BENCHMARKING OGD+ AGAINST OTHER CONTINUAL LEARNING BASELINES

In order to provide a broader picture on the robustness of OGD+ to Catastrophic Forgetting, we benchmark it against standard Continual Learning baselines.

Results

Table 2 shows that, on the Permuted MNIST and Rotated MNIST benchmarks, which are the least overparameterized, OGD+ draws an improvement over OGD and is competitive with other Continual Learning methods. On the most overparameterized benchmarks, CIFAR100 and CUB200, OGD+ is not competitive. Permuted MNIST Rotated MNIST Split CIFAR100 CUB200 

8. CONCLUSION

We presented a theoretical framework for Continual Learning algorithms in the NTK regime, then leveraged the framework to study the convergence and generalisation properties of the SGD and OGD algorithms. We also assessed our theoretical results through experiments and highlighted the applicability of the framework in practice. Our analysis highlights multiple connections to neighbouring fields such as Transfer Learning and Curriculum Learning. Extending the analysis to other Continual Learning algorithms is a promising future direction, which would provide a better understanding of the various properties of these algorithms and how they could be improved. Additionally, our experiments highlight the limits of the applicability of the framework to nonoverparameterized settings. In order to gain a better understanding of the properties of the OGD algorithm in this setting, extending the framework beyond the over-parameterization assumption is an important direction. We hope this work provides new keys to investigate these directions. A COMPLEMENTARY DISCUSSION

A.1 NTK VARIATION -THE IMPORTANCE OF ORTHOGONALITY FOR THE OGD, OGD+ AND A-GEM ALGORITHMS

In the main article, we focused only on the SGD, OGD and OGD+ algorithms. In this section, we highlight a connection of these algorithms to the A-GEM algorithm. All these algorithms perform orthogonal projections during the weight update step, however these projections differ in the following ways : • the span of the space the updates are orthogonal to. • the rate at which the projection space is updated. Additionally, the A-GEM algorithm present the additional property of allowing Positive Backward Transfer by design. Positive Backward Transfer is a desirable property in the sense that the generalisation on a past task increases while training on the current task. Similarities among the algorithms Table 3 presents an overview of the connections between the algorithm. In practice, the models are not trained in the NTK regime. Since the feature map changes through training, the orthogonality constraint of OGD becomes less relevant and may harm the learning. On the other hand, while A-GEM performs a projection on a smaller subspace which may not protect all the subspaces, the subspace is updated as it is computed at each training step, the updates are therefore expected to be more relevant. Finally OGD+ lies at the intersection of both, while it doesn't update the feature maps at each gradient step, the projection constraint is with respect to a larger space. On the theoretical connection between OGD and A-GEM The A-GEM algorithm (Chaudhry et al., 2019) is a state of the art Continual Learning algorithm on the standard benchmarks. The idea behind the algorithm is to perform a gradient descent if an estimate of the loss over the previous tasks increases or is unchanged. Otherwise the gradient is projected orthogonally to the gradient over the loss estimate over the previous tasks. We find that OGD draws a upper bound in comparison to A-GEM-NT in terms of generalisation error, as stated in Proposition 1. Proposition 1 In the NTK regime, OGD implies A-GEM with no Positive Backward Transfer. The proof is presented in App. E.1.

Experiments

The extended experiments on the importance of the NTK variation for Continual Learning (Sec. F.3), show that A-GEM outperforms OGD and OGD+ on most benchmarks. A probable reason behind this difference is the rate of update of the feature map for the A-GEM algorithm, even though it spans a smaller space than OGD and OGD+.

A.2 EXPRESSIONS OVERVIEW

For convenience, we present in Table 4 an overview of the notions that were mentioned in the main article, with their respective mathematical expressions.

Name Expression

Residual ỹτ = y τ -y τ -1→τ Knowledge increment f τ (x) = κ τ (x, X τ ) T (κ τ +1 (X τ , X τ ) + λI) -1 ỹτ Projection matrix T τ Feature map φ(x) Effective feature map φ τ (x) Effective kernel κ τ (x, x ) Complexity measure T τ =1 tr(κτ (Xτ ,Xτ )) n 2 τ ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ NTK task dissimilarity Sτ-1→τ = ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ Table 4 : Overview of the notions presented in the main manuscript with their respective notations.

C MISSING PROOFS OF SECTION 4 -CONVERGENCE C.1 PROOF OF THEOREM 1

Stochastic Gradient Descent We start by proving the Stochastic Gradient Descent (SGD) case of Theorem 1.

Proof

We prove the Theorem 1 by induction. Our induction hypothesis H τ is the following : H τ : For all k ≤ τ , Theorem 1 holds. First, we prove that H 1 holds. The proof is straightforward. For the first task, since there were no previous tasks, OGD on this task is the same as SGD. Therefore, it is equivalent to minimising the following objective : arg min w∈R d f 0 (X 1 ) + φ(X 1 ) T (w -w 0 ) -y 1 2 2 + λ 2 w -w 0 2 (1) where φ(x) = ∇ w 0 f 0 (x). We replace into the objective with ỹ1 applying ỹ1 = y 1 -f 0 (X 1 ) : arg min w∈R d φ(X 1 ) T (w -w 0 ) -ỹ1 2 2 + λ 2 w -w 0 2 (2) The objective is quadratic and the Hessian is positive definite, therefore the minimum exists and is unique : w 1 -w 0 = φ(X 1 )(φ(X 1 ) T φ(X 1 ) + λI) -1 ỹ1 Under the NTK regime assumption : f 1 (x) = f 0 (x) + ∇ w f 0 (x) T (w 1 -w 0 ) Then, by replacing into w 1 -w 0 : f 1 (x) = f 0 (x) + ∇ w f 0 (x) T φ(X 1 )(φ(X 1 ) T φ(X 1 ) + λI) -1 ỹ1 (5) f 1 (x) = f 0 (x) + κ 1 (x, X 1 )(κ 1 (X 1 , X 1 ) + λI) -1 ỹ1 Finally : f 1 (x) -f 0 (x) = κ 1 (x, X 1 )(κ 1 (X 1 , X 1 ) + λI) -1 ỹ1 (7) Which completes the proof of H 1 . Let τ ∈ N , we assume that H τ is true, then we show that H τ +1 is true. On the task T τ +1 , we can write the loss L τ +1 as : L τ +1 (w) = φ τ (X τ +1 ) T (w -w τ ) -ỹτ+1 2 2 + λ 2 w -w τ 2 (8) We recall that the optimisation problem on the task T τ +1 is : arg min w∈R d φ τ (X τ +1 ) T (w -w τ ) -ỹτ+1 2 2 + λ 2 w -w τ 2 (9) The optimisation objective is quadratic, unconstrained, with a positive definite hessian. Therefore, an optimum exists and is unique : w τ +1 -w τ = φ(X τ +1 )(φ(X τ +1 ) T φ(X τ +1 ) + λI) -1 ỹτ+1 We define the kernel κ τ +1 : R d × R d → R as : κ τ +1 (x, x ) = φ τ (x) T φ τ (x ) for all x, x ∈ R d (11) Finally, we recover a closed form expression for f τ +1 : First, we use the induction hypothesis H τ : f τ +1 (x) = f τ (x) + ∇ w f τ (x), w τ +1 -w τ (12) = f τ (x) + φ τ (x)φ(X τ +1 )(κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 (13) = f τ (x) + κ τ +1 (x, X τ +1 )(κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 At this stage, we have proven H t+1 . (15) Orthogonal Gradient Descent Now, we prove the Orthogonal Gradient Descent (OGD) case of Theorem 1. The key difference in the proof is that the OGD optimisation objective is constrained. The constraints correspond to the orthogonality to the subspace spanned by the memorised feature maps from the previous tasks. Another key difference occurs during the regularisation, as opposed to the SGD case where the regularisation of the weights spans on the whole space, the regularisation of OGD only applies to the "learnable" space. This property follows from the orthogonality constraint, which enforces that the space spanned by the previous tasks is unchanged. We prove the Theorem 1 by induction. Our induction hypothesis H τ is the following : H τ : For all k ≤ τ , Theorem 1 holds. First, we prove that H 1 holds. The proof is straightforward. For the first task, since there were no previous tasks, OGD on this task is the same as SGD. Therefore, it is equivalent to minimising the following objective : arg min w∈R d f 0 (X 1 )φ(X 1 ) T (w -w 0 ) -y 1 2 2 + λ 2 w -w 0 2 (16) We replace into the objective by the residual term ỹ1 = y 1 -f 0 (X 1 ) arg min w∈R d φ(X 1 ) T (w -w 0 ) -ỹ1 2 2 + λ 2 w -w 0 2 where φ(x) = ∇ w 0 f 0 (x). The objective is quadratic and its Hessian is positive definite, therefore the minimum exists and is unique : w 1 -w 0 = φ(X 1 )(φ(X 1 ) T φ(X 1 ) + λI) -1 ỹ1 Under the NTK regime assumption : f 1 (x) = f 0 (x) + ∇ w f 0 (x) T (w 1 -w 0 ) Then, by replacing into w 1 -w 0 : f 1 (x) = f 0 (x) + ∇ w f 0 (x) T φ(X 1 )(φ(X 1 ) T φ(X 1 ) + λI) -1 ỹ1 (20) f 1 (x) = f 0 (x) + κ 1 (x, X 1 )(κ 1 (X 1 , X 1 ) + λI) -1 ỹ1 Finally : f 1 (x) -f 0 (x) = κ 1 (x, X 1 )(κ 1 (X 1 , X 1 ) + λI) -1 ỹ1 (22) Which completes the proof of H 1 . Let τ ∈ N , we assume that H τ is true, then we show that H τ +1 is true. On the task T τ +1 , we can write the loss L τ +1 as : L τ +1 (w) = φ τ (X τ +1 ) T (w -w τ ) -ỹτ+1 2 2 + λ 2 w -w τ 2 (23) We recall that the optimisation problem on the task T τ +1 is : arg min w∈R d φ τ (X τ +1 ) T (w -w τ ) -ỹτ+1 2 2 + λ 2 w -w τ 2 (24) u.c. V τ +1 (w -w τ ) = 0 (25) where V τ +1 is a projection matrix on ( τ k=1 E k ) ⊥ , the Euclidean space induced by the feature maps Kτ+1) and w ∈ R d-Kτ+1 such as : of the tasks {T k , k ∈ [τ ]}. Let T τ +1 ∈ R d×(d- w -w τ = T τ +1 w K τ = dim( τ k=1 E k ) We rewrite the objective by plugging in the variables we just defined. The two objectives are equivalent : arg min w∈R d-K τ +1 φ τ (X τ +1 ) T T τ +1 w -ỹτ+1 2 2 + λ 2 T τ +1 w 2 This objective is equivalent to : arg min w∈R d-K τ +1 φ τ (X τ +1 ) T T τ +1 w -ỹτ+1 2 2 + λ 2 w 2 For clarity, we define Z τ +1 R nτ+1×(d-Kτ+1) as : Z τ +1 = φ τ (X τ +1 ) T T τ +1 (30) By plugging in Z τ +1 , we rewrite the objective as : arg min w∈R d-K τ +1 Z τ +1 w -ỹτ+1 2 2 + λ 2 T τ +1 w 2 (31) The optimisation objective is quadratic, unconstrained, with a positive definite hessian. Therefore, an optimum exists and is unique : w τ +1 = Z T τ +1 (Z τ +1 Z T τ +1 + λI) -1 ỹτ+1 (32) We recover the expression of the optimum in the original space : w τ +1 -w τ = T τ +1 Z T τ +1 (Z τ +1 Z T τ +1 + λI) -1 ỹτ+1 (33) We define the kernel κ τ +1 : R d × R d → R as : κ τ +1 (x, x ) = φ τ (x) T T τ +1 T T τ +1 φ τ (x ) for all x, x ∈ R d (34) Now we rewrite w τ +1 -w τ : w τ +1 -w τ = T τ +1 Z T τ +1 (κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 (35) Finally, we recover a closed form expression for f τ +1 : First, we use the induction hypothesis H τ : f τ +1 (x) = f τ (x) + ∇ w f τ (x), w τ +1 -w τ (36) = f τ (x) + φ τ (x)T τ +1 Z T τ +1 (κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 (37) = f τ (x) + κ τ +1 (x, X τ +1 )(κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 38) At this stage, we have proven H t+1 . We conclude the proof of Thm. 1. (39)

C.2 PROOF OF THE COROLLARY 1

Stochastic Gradient Descent The proof follows immediately from Thm. 1.

Proof

In the proof of Theorem 1 (App. C.1), for the SGD case, we showed that : w τ +1 -w τ = φ(X τ +1 )(φ(X τ +1 ) T φ(X τ +1 ) + λI) -1 ỹτ+1 (40) This result implies that : w τ +1 -w τ 2 = ỹT τ +1 (φ(X τ +1 ) T φ(X τ +1 ) + λI)φ(X τ +1 ) T φ(X τ +1 )(φ(X τ +1 ) T φ(X τ +1 ) + λI) -1 ỹτ+1 (41) = ỹT τ +1 (κ(X τ +1 , X τ +1 ) + λI) -1 κ(X τ +1 , X τ +1 )(κ(X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 (42) (43) Orthogonal Gradient Descent The proof is exactly the same as the proof above, the difference lies in the kernel definition, which is implicit. In the proof of Theorem 1 (App. C.1), for the OGD case, we showed that : w τ +1 -w τ = T τ +1 Z T τ +1 (κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 This result implies that : w τ +1 -w τ 2 = ỹT τ +1 (κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 Z τ +1 T T τ +1 T τ +1 Z T τ +1 (κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 (45) = ỹT τ +1 (κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 κ τ +1 (X τ +1 , X τ +1 )(κ τ +1 (X τ +1 , X τ +1 ) + λI) -1 ỹτ+1 (46) (47) D.2.2 BOUNDING f τ Hτ : Lemma 2 Let H τ the Hilbert space associated to the kernel κ τ . We recall that : f τ (x) = κ τ (x, X τ ) T α τ (60) α τ = (κ τ (X τ , X τ ) + λI) -1 ỹτ (61) Then : f τ 2 Hτ ≤ ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ (62) (63) We start from the definition of the RKHS norm of f τ : f τ 2 Hτ = α T τ κ τ (X τ , X τ )α τ (64) = ỹT τ (κ τ (X τ , X τ ) + λI) -1 κ τ (X τ , X τ )(κ τ (X τ , X τ ) + λI) -1 ỹτ (65) Since (κ τ (X τ , X τ ) + λI) -1 ≤ (κ τ (X τ , X τ )) -1 ≤ ỹT τ (κ τ (X τ , X τ ) + λI) -1 κ τ (X τ , X τ )κ τ (X τ , X τ ) -1 ỹτ (66) ≤ ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ

D.2.3 BOUNDING THE RADEMACHER COMPLEXITY

The goal of this section is to upper bound the Rademacher Complexity. First, we derive a general upper bound for the Rademacher Complexity of a linear combination of kernels in Lemma 3 . Then, we apply this bound to the linear combination of kernels obtained through Thm. 1, which describes the model through the Continual Learning. A general bound for the Rademacher Complexity Lemma 3 (Rademacher Complexity of a linear combination of kernels) Let κ t : X × X → R, t ∈ [T ] kernels such that : sup x∈X κ t (x, x) < ∞ To every kernel κ t , we associate a feature map φ t : X → H t , where H t is a Hilbert space with inner product •, • Ht , and for all x, x ∈ X , κ t (x, x ) = φ t (x), φ t (x ) Ht For all T ∈ N , given a sequence {B τ , τ ∈ [T ]} we define F T as follows : F T = {x → T t=1 f t (x), f t (x) = α T t κ t (x, X t ) ∀t ∈ [T ], f t Ht ≤ B t } (68) Let X 1 , •, X n be random elements of X . Then for the class F, we have : R(F) ≤ T t=1 2B t n t (T r(κ t (X t , X t ))) 1/2 Proof Let f ∈ F, and let x ∈ X : f (x) = T τ =1 nτ i=1 α τ i κ τ (x, x τ i ) (69) For all τ ∈ [T ], we associate a feature map φ τ : X → H τ ∀x, x ∈ X κ τ (x, x ) = φ τ (x), φ τ (x ) Hτ Therefore : f (x) = T t=1 nτ i=1 α τ i φ τ (x τ i ), φ τ (x) Hτ (71) = T τ =1 nτ i=1 α τ i φ τ (x τ i ), φ τ (x) Hτ (72) On the other hand, the following holds ∀t ∈ [T ] : nτ i=1 α τ i φ τ (x τ i ) 2 Hτ = i,j α τ i α τ j κ τ (x τ i , x τ j ) ≤ B 2 τ ( ) Therefore : F T ⊂ {x → T t=1 w τ , φ τ (x) Hτ , w τ 2 ≤ B τ ∀t ∈ [T ]} := FT Now, we derive an upper bound of the Rademacher complexity of  F T : R(F T ) ≤ R( FT ) (75) = E[ sup wτ 2 ≤Bτ ,t∈[T ] T t=1 w τ , 2 n τ nτ i=1 i φ τ (x τ i ) Hτ |(X τ )] (76) = T t=1 E[ sup wτ 2 ≤Bτ w τ , 2 n τ nτ i=1 i φ τ (x τ i ) Hτ |(X τ )] R(F T ) ≤ T t=1 2B τ n τ (T r(κ τ (X τ , X τ ))) 1/2 (78) (79) Bounding the Rademacher Complexity for Continual Learning Lemma 4 Keeping the same notations and setting as Theorem 3, the Rademacher Complexity can be bounded as follows: R(F T ) ≤ T τ =1 O   ỹT τ (κ τ (X τ , X τ ) + λI) -1 ỹτ n τ   , ( ) where F T is the function class spanned by the model up to the task T τ . (81)

Proof

For all T ∈ N , given a sequence {B τ , τ ∈ [T ]} we define F T as follows : F T = {x → T τ =1 f τ (x), f τ (x) = α T τ κ τ (x, X τ ) ∀τ ∈ [T ], f τ Hτ ≤ B τ } where we set B τ as : B τ = (y τ -y τ -1→τ ) T (κ τ (X τ , X τ ) + λI) -1 (y τ -y τ -1→τ ) Following Lemma 2 for all τ ∈ [T ] : f τ ∈ F T which means that the function class F t contains the Continual Learning model up to task T τ . We apply Lemma 3 in order to upper bound the Rademacher complexity : R(F T ) ≤ T τ =1 2B τ n τ (T r(κ τ (X τ , X τ ))) 1/2 (85) We made the assumption that for all τ ∈ [T ] tr(κ τ (X τ , X τ )) = O(n τ ) : R(F T ) ≤ T τ =1 2B τ n τ O( √ n τ ) ≤ T τ =1 O( B τ √ n τ ) ≤ T τ =1 O   (y τ -y τ -1→τ ) T (κ τ (X τ , X τ ) + λI) -1 (y τ -y τ -1→τ ) n τ  

D.2.4 BOUNDING THE EMPIRICAL LOSS FOR OGD

Lemma 5 Given a current task T T , the empirical losses on the data from all previous tasks (T τ , τ ≤ T ) can be bounded as follows : Let T ∈ N fixed. Then, for all τ ∈ [T ] L Sτ (f T ) ≤ λ 2 n τ ỹτ (κ τ (X τ , X τ ) + λI) -1 ỹτ (90) Case 1 -Task T T Proof We start from the definition of the empirical loss : L S T (f T ) = 1 n T n T i=1 (f T (x T,i ) -y T,i ) 2 (92) = 1 n T f T (X T ) -y T 2 2 We replace into f T by its expression from Theorem 1 : = 1 n T f T (X T ) + f T -1 (X T ) -y T 2 2 (94) = 1 n T f T -1 (X T ) -ỹT 2 2 (95) = 1 n T (k T (X T , X T ) T (κ T (X T , X T ) + λI) -1 ỹT -ỹT 2 2 (96) = 1 n T (k T (X T , X T ) + λI -λI)(κ T (X T , X T ) + λI) -1 ỹT -ỹT 2 2 (97) = 1 n T ỹT -λ(κ T (X T , X T ) + λI) -1 ỹT -ỹT 2 2 (98) = λ 2 n T (κ T (X T , X T ) + λI) -1 ỹT 2 2 Now, we apply Lemma 2 in order to upper bound the right hand side norm : L S T (f T ) ≤ λ 2 n T ỹT (κ T (X T , X T ) + λI) -1 ỹT (100) (101) Case 2 -Tasks {T τ , τ ∈ [1, T -1]} : The proof is very similar to Case 1, we apply Theorem 2, which is the key property of OGD in terms of no forgetting.

Proof

We start from the definition of the empirical loss : L Sτ (f T ) = 1 n τ nτ i=1 (f T (x τ,i ) -y τ,i ) 2 (102) = 1 n τ f T (X τ ) -y τ 2 2 (103) Now, applying Theorem 2 which implies that f τ (X τ ) = f T (X τ ) : L Sτ (f T ) = 1 n τ f τ (X τ ) -y τ 2 2 Then with a similar analysis as Case 1, we get : L Sτ (f T ) ≤ λ 2 n τ ỹτ (k T (X τ , X τ ) + λI) -1 ỹτ

D.2.5 BOUNDING THE EMPIRICAL LOSS FOR SGD

As opposed to the analysis for OGD, forgetting can occur for OGD and Theorem 2 is no longer valid. Similarly to the OGD empirical loss analysis, we study the case on the data from the last task and the case on the data from all the other tasks. Lemma 6 The empirical losses on the source and target tasks can be bounded as follows : Let T ∈ N fixed. Then, for all τ ∈ [T ] L S T (f T ) ≤ λ 2 n T ỹT (k T (X T , X T ) + λI) -1 ỹT (107) For all τ ∈ [T -1] L Sτ (f T ) ≤ 1 n τ (λ 2 ỹT τ (k τ (X τ , X τ )) -1 ỹτ (108) + T k=τ +1 ỹT k (κ k (X k , X k ) + λI) -1 κ k (X τ , X k )κ k (X τ , X k ) T (κ k (X k , X k ) + λI) -1 ỹk ) Case 1 -Task T T For this case, the analysis is the same as for OGD, as no forgetting occurs with respect to the data the model is being trained on. Case 2 -Tasks {T τ , τ ∈ [1, T -1]} : Since SGD is not guaranteed to be robust to Catastrophic Forgetting, this case comprises additional residual terms that correspond to forgetting. Proof Let τ ∈ [T -1]. From Corr. 1, we recall that : f T (x) = f τ (x) + T k=τ +1 f k (x) Then : f T (X τ ) -y τ 2 2 = f τ (X τ ) + T k=τ +1 f k (X τ ) -y τ 2 2 (111) ≤ f τ (X τ ) -y τ 2 2 + T k=τ +1 f k (X τ ) 2 2 (112) ≤ f τ (X τ ) -y τ 2 2 A + T k=τ +1 f k (X τ ) 2 2 B ( ) We can upper bound A similarly to the previous paragraphs : f τ (X τ ) -y τ 2 2 ≤ λ 2 ỹT τ (k τ (X τ , X τ ) + λI) -1 ỹτ Now, we upper bound B . Let k ∈ [τ + 1, T ] : f k (X τ ) 2 2 = ỹT k (κ k (X k , X k ) + λI) -1 κ k (X τ , X k )κ k (X τ , X k ) T Similarity between the tasks T τ and T k (κ k (X k , X k ) + λI) -1 ỹk We conclude by plugging back the upper bounds of A and B f T (X τ ) -y τ 2 2 ≤ λỹ T τ (k τ (X τ , X τ )) -1 ỹτ (116) + T k=τ +1 ỹT k (κ k (X k , X k ) + λI) -1 κ k (X τ , X k )κ k (X τ , X k ) T (κ k (X k , X k ) + λI) -1 ỹk Therefore : L Sτ (f T ) ≤ 1 n τ (λ 2 ỹT τ (k τ (X τ , X τ )) -1 ỹτ (118) + T k=τ +1 ỹT k (κ k (X k , X k ) + λI) -1 κ k (X τ , X k )κ k (X τ , X k ) T (κ k (X k , X k ) + λI) -1 ỹk ) D.2.6 PROOF OF THE GENERALISATION THEOREM (THM. 3) Now, we prove the Theorem 3 by applying the lemmas we developed above.

Proof

With probability 1 -δ we have : sup f ∈F T {L D (f ) -L S (f )} ≤ 2ρ R(F T ) + 3c log(2/δ) 2n T (120) L Dτ (f T ) ≤ L Sτ (f T ) + 2ρ R(F T ) + 3c log(2/δ) 2n T (121) L Dτ (f T ) ≤ L Sτ (f T ) + T k=1 O   ỹT k (κ k (X k , X k )) -1 ỹk n k   + 3c log(2/δ) 2n T Then, by replacing into L Sτ (f T ) with Lemmas 5 and 6, we get the upper bounds of the theorem for the various cases of SGD and OGD. (123)

D.2.7 ALTERNATIVE PROOF FOR THE EMPIRICAL LOSS BOUNDS -NO REGULARISATION CASE

This section is complementary and aims to provide a better interpretation of the bounds in Theorem 3. We prove that in the case where there is no regularisation, the training error converges to zero. This result illustrates better the intuition behind the λ term that appears in the upper bound of the Theorem 3, which is under the regularisation assumption. First, we derive the differential equation of the model's output dynamics in Lemma 7. Then, we apply this Lemma to prove the convergence to zero in Lemma 8. Finally, in Lemma 9, we show that for the past tasks, the training error equals to zero all the time. The training dynamics of OGD Lemma 7 (Differential equation of the model's output dynamics) Let T ∈ N , the index of the task currently being trained on, with OGD. For all τ < T , defining u τ as : u τ (t) = f (t) τ (X τ ) - k<τ f k (X τ ) = φ τ (X τ ) T (w τ (t) -w τ -1 ) The model's output dynamics while training on the task T τ is as follows : u τ (t + 1) -u τ (t) = -ηκ τ (X τ , X τ )( u τ (t) -ỹτ ). (125)

Proof

Let τ ∈ N , we derive the dynamics while training on the task T τ , while training on this task. First, we define : ≈ y τ = ỹτ + φ τ (X τ ) T w τ -1 We also define the projection matrix P τ ∈ R p×p as : P τ = T T τ T τ The matrix P τ performs the projection from the original weight space R p to the trainable weight space during training on task T τ , which corresponds to ( τ k=1 E k ) ⊥ . Our starting point is the SGD update rule : w τ (t + 1) = w τ (t) -ηP τ +1 ∇ w L τ (w τ (t)) where : L τ (w τ (t)) = φ τ (X τ ) T (w τ (t) -w τ -1 ) -ỹτ 2 2 (130) = φ τ (X τ ) T w τ (t) -(ỹ τ + φ τ (X τ ) T w τ -1 ) 2 2 (131) = φ τ (X τ ) T w τ (t) - ≈ y τ 2 2 (132) Therefore, we get the following expression for the gradient of the loss : ∇ w L τ (w τ (t)) = φ τ (X τ )(φ τ (X τ ) T w τ (t) - ≈ y τ ) Then, the following holds : (134) u τ (t + 1) -u τ (t) = φ τ (X τ ) T (w τ (t + 1) -w τ (t)) (135) = φ τ (X τ ) T (-ηP τ ∇ w L(w τ (t))) (136) = φ τ (X τ ) T (-ηP τ φ τ (X τ )(φ τ (X τ ) T w τ (t) - ≈ y τ )) (137) = (φ τ (X τ )T τ ) T (-ηT τ φ τ (X τ )(φ τ (X τ ) T w τ (t) - ≈ y τ )) (138) = -ηκ τ (X τ , X τ )(φ τ (X τ ) T w τ (t) - ≈ y τ ) (139) = -ηκ τ (X τ , X τ )( u τ (t) -ỹτ ) (140) = -ηκ τ (X τ , X τ )( u τ (t) -ỹτ ) It follows that : u τ (t + 1) -ỹτ = (I -ηκ τ (X τ , X τ ))( u τ (t) -ỹτ ) (142) u τ (t) -ỹτ = -(I -ηκ τ (X τ , X τ )) t ( u τ (0) -ỹτ ) (143) The training error converges to zero Lemma 8 For all tasks T τ , τ ∈ [T ], the training error L τ (w τ (t)) converges to 0 when t tends to infinity . We start with the definition of the training error on task T : L S (f (t) τ ) = f (t) τ (X τ ) -y τ 2 (145) = u τ (t) -ỹτ 2 (146) = (I -ηκ τ (X τ , X τ )) t ( u τ (0) -ỹτ ) Under the assumption that κ τ (X τ , X τ ) is positive definite, for η ≤ 1 (κτ (Xτ ,Xτ )) the eigenvalues become less than 1, therefore : lim t→∞ L S (f (t) τ ) = 0 (148) The training error on the past tasks is zero Lemma 9 For all tasks T τ , τ ∈ [T ], the training error L Sτ (f T ) = 0. Proof This lemma follow immediately from Lemma 8 which states that the training error converges to zero for OGD on all tasks and Theorem 2 which states that the training error on the past tasks is unchanged.

E MISSING PROOFS OF APP A -THE IMPORTANCE OF THE NTK VARIATION FOR CONTINUAL LEARNING E.1 PROOF OF PROPOSITION 1 -OGD IMPLIES A-GEM-NT

We present the proof of Proposition 1, which relies mainly on Theorem 2 stating the robustness of OGD to Catastrophic Forgetting.

Proof

Let T τ a task fixed, we recall that in the NTK regime, the model can be expressed as : f w (x) = f τ -1 (x) + ∇f 0 (x)(w -w τ -1 ) Given a task T k and its associated memory M k , we recall that the loss can be expressed as : l(f w , M k ) = (x,y)∈M k (f τ -1 (x) + ∇f 0 (x)(w -w τ -1 ) -y) 2 Similarly to the A-GEM paper, we define the gradients vectors g and {g k , k ∈ [τ -1]} as : g = ∇ w l(f w , D τ ) (151) g k = ∇ w l(f w , M k ) (152) Now we extend the expressions : g k = ∇ w l(f w , D τ ) (153) = ∇ w (x,y)∈M k (f τ -1 (x) + ∇f 0 (x)(w -w τ -1 ) -y) 2 (154) = (x,y)∈M k (f τ -1 (x) + ∇f 0 (x)(w -w τ -1 ) -y)∇f 0 (x) For OGD, following Thm. 2 variant for finite memory : = (x,y)∈M k (∇f 0 (x)(w -w τ -1 ))∇f 0 (x) Finally, since the gradient updates are orthogonal to M k : g k = 0 Therefore, for all k ∈ [τ -1] : g k • g = 0 (158) We conclude, OGD implies learning with GEM and A-GEM with no Positive Backward Transfer. (159)

F EXPERIMENTS

F.1 REPRODUCIBILITY F.1.1 ARCHITECTURES Due to the memory limitations encountered while running OGD and OGD+ on the CIFAR100 and CUB200 dataset, we use smaller architectures which are different from Jung et al. (2020) . For the MNIST benchmark, we keep the same architecture as Farajtabar et al. (2019) . MNIST : Similarly to Farajtabar et al. (2019) , the neural network is a three-layer MLP with 100 hidden units in two layers, each layer uses RELU activation function. The model has 10 logits, which do not use any activation function.

CIFAR100 :

The neural network is a multi-head LeNet Lecun et al. (1998) network with Batch Normalisation Ioffe & Szegedy (2015) and 200 hidden units for the penultimate layer. CUB200 : Similarly to Jung et al. (2020) , our base architecture is AlexNet (Krizhevsky et al., 2012) . In order to scale for the OGD type algorithms, we changed the classifier module to a smaller 3 layer RELU neural network, with dropout (Table 5 ). Layer Linear (4096, 256) RELU Dropout (0.5) Linear (256, 128) RELU Linear (128, ...) Table 5 : Classifier module of the architecture used for the CUB200 benchmark.

F.1.2 EXPERIMENT SETUP

We run each experiment 5 times, the seeds set is the same across all experiments sets. We report the mean and standard deviation of the measurements.

F.1.3 OGD+ IMPLEMENTATION DETAILS

Memory : During the OGD+ memory update step, for each task, the new associated feature maps replace the memory slots of the previous feature maps for the same task. The goal is to ensure a balance of the feature maps from all tasks in the memory. Multi-head models : For the dataset streams Split MNIST and Split CIFAR-100, we consider multi-headed neural networks. We only store the feature maps with respect to the shared weights, the projection step is not performed for the heads' weights.

F.1.4 HYPERPARAMETERS

We use the the same hyperparameters as Farajtabar et al. (2019) for the algorithms SGD, OGD and OGD+, on the MNIST benchmarks. We also keep a small learning rate, in order to preserve the locality assumption of OGD, and in order to verify the conditions of the theorems. We report the shared hyperparameters across the benchmarks and algorithms in Table 6 . For the other benchmarks and algorithms, we report the grid search space in Sec. F.1.5.

F.1.5 HYPERPARAMETER SEARCH

We present our hyperparameter search ranges for each Continual Learning method. We selected the hyperparameter sets that maximise the average accuracy. The results and scripts of the hyperparameter search and the best hyperparameters are provided on the corresponding github repository. Thm. 2 states that in the NTK regime, given an infinite memory, the train error of OGD is unchanged. Experiments We track the variation of the train accuracy of the samples in the memory through tasks after being trained on all subsequent tasks. We consider the MLP hidden layer size as a proxy for overparameterization. We run the experiments on the Permuted MNIST and Rotated MNIST benchmarks, with the OGD algorithm. We vary the hidden size from 100 to 400. Results Figure 3 shows that the variation of the train accuracy of OGD+ decreases uniformly with the hidden size. This result indicates the forgetting of OGD decreases with overparameterization. Figure 3 : The variation of the train accuracy on the memorised samples from the each task, after the model was trained on all tasks in sequence (higher is better). We vary the hidden size as a proxy for overparameterization.

F.2.2 THE IMPORTANCE OF THE MEMORY SIZE ASSUMPTION

Thm. 2 states that in the NTK regime, given an infinite memory, the train error of OGD is unchanged. Experiments We track the variation of the train accuracy of the samples in the memory through tasks after being trained on all subsequent tasks, as a function of the memory size per task. Results Figure 4 shows that the mean train accuracy variation decreases uniformly with the memory size. Figure 4 : The variation of the train accuracy on the memorised samples from the first task, after the model was trained on all tasks in sequence (higher is better). We vary the memory size per task from 100 to 300.

F.3 THE IMPORTANCE OF THE NTK VARIATION FOR CONTINUAL LEARNING (SEC. 6)

In this section, we present extended results on the importance of the NTK variation for Continual Learning. For completion, we present the variation of the test accuracy of the model at the end of training for all tasks. Also, we run the benchmarks on the A-GEM algorithm as a supporting materials for the discussion in App. A. (2020) . They show that Stable SGD outperforms OGD on all benchmarks, in order for the evaluation to be as fair and comprehensive as possible, we also include this method in our benchmarks.

F.5.1 COMPLEMENTARY RESULTS -CONTINUAL LEARNING METRICS

Permuted MNIST Table 9 shows that OGD+ outperforms the baselines on AAC, while it remains competitive on AFM and BWT. CUB200 Table 12 shows that OGD+ is not competitive on the CUB200 benchmark on all metrics. Also it draws the same performance as OGD on all metrics, one probable reason is the overparameterization of the setting as shown in Table 1 . 



Denoting a T,τ the accuracy of the model on the task T τ after being trained on the task T T , we track two metrics introduced by Chaudhry et al. (2019) and Lopez-Paz & Ranzato (2017) :

7.2 ABLATION STUDY : FOR NON OVER-PARAMETERIZED BENCHMARKS, UPDATING THE JACOBIAN IS CRITICAL TO OGD'S ROBUSTNESS (THM. 2)

Now we apply on each function f τ the upper bound from Lemma 22 byBartlett & Mendelson 2003

EXPERIMENTS : MEMORISATION PROPERTY OF OGD (THM. 2) F.2.1 THE IMPORTANCE OF THE OVERPARAMETERIZATION ASSUMPTION

Figure5: Test accuracy on the 10 first tasks of Rotated MNIST, for SGD, OGD, OGD+ and A-GEM. The y-axis is truncated for clarity. We report the mean and standard deviation over 5 independent runs. The test error is measured for every 250 mini-batch interval.

Figure 6: Test accuracy through tasks for different Continual Learning methods on the MNIST and CIFAR100 and CUB200 benchmarks. The y-axis is truncated for clarity. We report the mean and standard deviation over 5 independent runs. The test error is measured at the end of each task.



The average accuracy of several methods on the MNIST, CIFAR and CUB200 datasets.

Overview of the properties of the SGD, OGD, OGD+, A-GEM and A-GEM-NT algorithms.

Permuted MNIST : The test accuracy of models from the indicated task after being trained on all tasks in sequence. The best Continual Learning results are highlighted in bold.

Rotated MNIST : The test accuracy of models from the indicated task after being trained on all tasks in sequence. The best Continual Learning results are highlighted in bold.

Stable SGD 78.17 (±0.76) -11.63 (±1.03) 1.02 (±0.15) -11.63 (±1.03) Comparison of the average accuracy, average forgetting, forward transfer and backward transfer of several methods on the Permuted MNIST benchmark.Rotated MNIST Table10shows that OGD+ is competitive with the baselines on AAC and AFM, while it underperforms on the other metrics. Also it draws an improvement over OGD on AAC, BWT and AFM. One probable reason is the relatively low overparameterization of the setting as shown in Table1.

Comparison of the average accuracy, average forgetting, forward transfer and backward transfer of several methods on the Rotated MNIST benchmark.Split CIFAR100 Table11shows that OGD+ is not competitive on Split CIFAR100 on all metrics. Also it draws the same performance as OGD on all metrics, one probable reason is the overparameterization of the setting as shown in Table1.

Comparison of the average accuracy, average forgetting, forward transfer and backward transfer of several methods on the Split CIFAR100 benchmark.

Comparison of the average accuracy, average forgetting, forward transfer and backward transfer of several methods on the CUB200 benchmark.

B PROOFS OVERVIEW B.1 CONVERGENCE :

Theorem 1 : Convergence of SGD and OGD for Continual Learning We prove Theorem 1 by induction. We rewrite the loss function as a regression on the residual ỹτ instead of y τ . Then, we rewrite the optimisation objective as an unconstrained strongly convex optimisation problem. Finally, we compute the unique solution in a closed form. The full proof is presented in App. C.1.Remark 1 : Continual Learning as a recursive Kernel Regression The remark follows directly from the recursive form of Theorem 1.Corollary 1 : Distance from initialisation through tasks The proof follows immediately from the proof of Theorem 1, in which we compute a closed form of the weights variation. It is presented in App. C.2.

B.2 GENERALISATION :

Theorem 2 : No-forgetting Continual Learning with OGD The proof relies on the orthogonality property of the OGD update, which implies that the subspaces spanned by the samples in the memory would not have any changes incurred. The proof is presented in App . D.1Theorem 3 : Generalisation of SGD and OGD for Continual Learning The proof or Theorem 3 is presented in App. D.2. The proof follows the following structure :• Bounding the Rademacher complexity (App. D.2.3) : First, we state the technical Lemma 3 that upper bounds the Rademacher complexity of the function class that corresponds to the set of linear combinations of kernel regressors. Then we apply the lemma to the function class obtained through Theorem 1.• Bounding the empirical loss for SGD (App. D.2.5) : the error is the same as OGD on the last task. While on the previous tasks, Catastrophic Forgetting can be incurred which implies the appearance of a residual term that corresponds to the forgetting.• Bounding the empirical loss for OGD (App. D.2.4) : The proof techniques are very similar to the SGD case, however we leverage Theorem 2 in order to derive tighter bounds due to the absence of Catastrophic Forgetting.• Wrap-up (App. D.2.6) : Finally, we wrap-up all the lemmas to derive the bound of Theorem 3.Lemma 1 -Implications for Curriculum Learning This results follows from the technical Lemma 3, which upper bounds the Rademacher complexity of the function class that corresponds to the set of linear combinations of kernel regressors.

B.3 THE IMPORTANCE OF THE NTK VARIATION FOR CONTINUAL LEARNING :

Proposition 1 -OGD implies A-GEM-NT The proof relies mainly on Theorem 2 stating the robustness of OGD to Catastrophic Forgetting. We derive the A-GEM update then apply Theorem 2, which leads to the implication. The full proof is presented in App. D.2.3.

D MISSING PROOFS OF SECTION 5 -GENERALISATION D.1 PROOF OF THEOREM 2

The intuition behind the proof is : since the gradient updates were performed orthogonally to the feature maps of the training data of the source task, the parameters in this space are unchanged, while the remaining space, which was changed, is orthogonal to these features maps, therefore, the inference is the same and the training error remains the same as at the end of training on the source task.

Proof

In the proof of Theorem 1, App. C.1, we showed that, for T τ a fixed task:We rewrite the recursive relation into a sum:We observe that, for all k ∈ [T ]:On the other hand, for OGD+, given a sample x from D τ , for all k ∈ [τ + 1, T ] :Therefore :We conclude.(54) D.2 PROOF OF THEOREM 3

Reminder on RKHS norm

Let κ a kernel, and H the reproducing kernel Hilbert space (RKHS) corresponding to the kernel κ.Recall that the RKHS norm of a function f (x) = α T κ(x, X) is :Reminder on Generalization and Rademacher Complexity Consider a loss function l : R × R → R. The population loss over the distribution D, and the empirical loss over n samples D = {(x i , y i ), i ∈ [n]} from the same distribution D are defined as :Theorem 4 Suppose the loss function is bounded in [0, c] and is ρ-Lipschitz in the first argument. Then, with probability at least 1 -δ over sample S of size n : 

MNIST Benchmarks

• EWC, SI and MAS : we fixed the seed to 0, then performed a grid search over the regularisation parameter in [0.1, 1, 10, 100, 1.000, 10.000]• SGD, OGD and OGD+ : we used the same hyperparameters as Farajtabar et al. (2019) .• Stable SGD : we fixed the seed to 0 then performed a grid search over all combinations of : gamma : [0. • SGD, OGD and OGD+ : For the MNIST benchmarks, we used the same hyperparameters as Farajtabar et al. (2019) . While for the CIFAR100 and CUB200 benchmarks, we run the following grid search :learning rate : [0.00001, 0.001, 0.01, 0.1] batch size : [32, 64, 256] epochs : [1, 20, 50] • Stable SGD : we performed the following grid search :gamma : [0.5, 0.6, 0.7, 0.8, 0.9] learning rate : [0.001, 0.01, 0.1] batch size : [10, 64] dropout [0, 0.1, 0.2, 0.3, 0.4, 0.5] epochs : [1, 10, 50] 

