GENERALISATION GUARANTEES FOR CONTINUAL LEARNING WITH ORTHOGONAL GRADIENT DESCENT

Abstract

In Continual Learning settings, deep neural networks are prone to Catastrophic Forgetting. Orthogonal Gradient Descent was proposed to tackle the challenge. However, no theoretical guarantees have been proven yet. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel regime. This framework comprises closed form expression of the model through tasks and proxies for Transfer Learning, generalisation and tasks similarity. In this framework, we prove that OGD is robust to Catastrophic Forgetting then derive the first generalisation bound for SGD and OGD for Continual Learning. Finally, we study the limits of this framework in practice for OGD and highlight the importance of the Neural Tangent Kernel variation for Continual Learning with OGD.

1. INTRODUCTION

Continual Learning is a setting in which an agent is exposed to multiples tasks sequentially (Kirkpatrick et al., 2016) . The core challenge lies in the ability of the agent to learn the new tasks while retaining the knowledge acquired from previous tasks. Too much plasticity (Nguyen et al., 2018) will lead to catastrophic forgetting, which means the degradation of the ability of the agent to perform the past tasks (McCloskey & Cohen 1989 , Ratcliff 1990 , Goodfellow et al. 2014 ). On the other hand, too much stability will hinder the agent from adapting to new tasks. While there is a large literature on Continual Learning (Parisi et al., 2019) , few works have addressed the problem from a theoretical perspective. Recently, Jacot et al. (2018) established the connection between overparameterized neural networks and kernel methods by introducing the Neural Tangent Kernel (NTK). They showed that at the infinite width limit, the kernel remains constant throughout training. Lee et al. (2019) also showed that, in the infinite width limit or Neural Tangent Kernel (NTK) regime, a network evolves as a linear model when trained on certain losses under gradient descent. In addition to these findings, recent works on the convergence of Stochastic Gradient Descent for overparameterized neural networks (Arora et al., 2019) have unlocked multiple mathematical tools to study the training dynamics of over-parameterized neural networks. We leverage these theoretical findings in order to to propose a theoretical framework for Continual Learning in the NTK regime then prove convergence and generalisation properties for the algorithm Orthogonal Gradient Descent for Continual Learning (Farajtabar et al., 2019) . Our contributions are summarized as follows: 1. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel (NTK) regime. This framework frames Continual Learning as a recursive kernel regression and comprises proxies for Transfer Learning, generalisation, tasks similarity and Curriculum Learning. (Thm. 1, Lem. 1 and Thm. 3). 2. In this framework, we prove that OGD is robust to forgetting with respect to an arbitrary number of tasks under an infinite memory (Sec . 5, Thm. 2). 3. We prove the first generalisation bound for Continual Learning with SGD and OGD. We find that generalisation through tasks depends on a task similarity with respect to the NTK. (Sec. 5, Theorem 3) 4. We study the limits of this framework in practical settings, in which the Neural Tangent Kernel may vary. We find that the variation of the 2017) define a compound regret for lifelong learning, as the regret with respect to the oracle who would have known the best common representation for all tasks in advance. Knoblauch et al. (2020) show that optimal Continual Learning algorithms generally solve an NP-HARD problem and will require perfect memory not to suffer from catastrophic forgetting. Benzing (2020) presents mathematical and empirical evidence that the two methods -Synaptic Intelligence and Memory Aware Synapses -approximate a rescaled version of the Fisher Information. Continual Learning is not limited to Catastrophic Forgetting, but also closely related to Transfer Learning. A desirable property of a Continual Learning algorithm is to enable the agent to carry the acquired knowledge through his lifetime, and transfer it to solve new tasks. A new theoretical study of the phenomena was presented by Liu et al. (2019) . They prove how the task similarity contributes to generalisation, when training with Stochastic Gradient Descent, in a two tasks setting and for over-parameterised two layer RELU neural networks. The recent findings on the Neural Tangent Kernel (Jacot et al., 2018) and on the properties of overparameterized neural networks (Du et al. 2018 , Arora et al. 2019 ) provide powerful tools to analyze their training dynamics. We build up on these advances to construct a theoretical framework for Continual Learning and study the generalisation properties of Orthogonal Gradient Descent.

3. PRELIMINARIES

Notation We use bold-faced characters for vectors and matrices. We use • to denote the Euclidean norm of a vector or the spectral norm of a matrix, and • F to denote the Frobenius norm of a matrix. We use •, • for the Euclidean dot product, and •, • H the dot product in the Hilbert space H. We index the task ID by τ . The ≤ operator if used with matrices, corresponds to the partial ordering over symmetric matrices. We denote N the set of natural numbers, R the space of real numbers and N for the set N {0}. We use ⊕ to refer to the direct sum over Euclidean spaces.

3.1. CONTINUAL LEARNING

Continual Learning considers a series of tasks {T 1 , T 2 , . . .}, where each task can be viewed as a separate supervised learning problem. Similarly to online learning, data from each task is revealed only once. The goal of Continual Learning is to model each task accurately with a single model. The challenge is to achieve a good performance on the new tasks, while retaining knowledge from the previous tasks (Nguyen et al., 2018) . We assume the data from each task T τ , τ ∈ N , is drawn from a distribution D τ . Individual samples are denoted (x τ,i , y τ,i ), where i ∈ [n τ ]. For a given task T τ , the model is denoted f τ , we use the



Neural Tangent Kernel impacts negatively the robustness to Catastrophic Forgetting of OGD in non overparameterized benchmarks.(Sec. 6)    2 RELATED WORKS Continual Learning addresses the Catastrophic Forgetting problem, which refers to the tendency of agents to "forget" the previous tasks they were trained on over the course of training. It's an active area of research, several heuristics were developed in order to characterise it(Ans & Rousset 1997,  Ans & Rousset 2000, Goodfellow et al. 2014, French 1999, McCloskey & Cohen 1989, Robins  1995, Nguyen et al. 2019). Approaches to Continual Learning can be categorised into: regularization methods, memory based methods and dynamic architectural methods. We refer the reader to the survey by Parisi et al. (2019) for an extensive overview on the existing methods. The idea behind memory-based methods is to store data from previous tasks in a buffer of fixed size, which can then be reused during training on the current task(Chaudhry et al. 2019, Van de Ven & Tolias 2018). While dynamic architectural methods rely on growing architectures which keep the past knowledge fixed and store new knowledge in new components, such as new nodes or layers.(Lee et al. 2018,  Schwarz et al. 2018)  Finally, regularization methods regularize the objective in order to preserve the knowledge acquired from the previous tasks(Kirkpatrick et al. 2016, Aljundi et al. 2018, Farajtabar  et al. 2019, Zenke et al. 2017).While there is a large literature on the field, there is a limited number of theoretical works on Continual Learning. Alquier et al. (

