GENERALISATION GUARANTEES FOR CONTINUAL LEARNING WITH ORTHOGONAL GRADIENT DESCENT

Abstract

In Continual Learning settings, deep neural networks are prone to Catastrophic Forgetting. Orthogonal Gradient Descent was proposed to tackle the challenge. However, no theoretical guarantees have been proven yet. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel regime. This framework comprises closed form expression of the model through tasks and proxies for Transfer Learning, generalisation and tasks similarity. In this framework, we prove that OGD is robust to Catastrophic Forgetting then derive the first generalisation bound for SGD and OGD for Continual Learning. Finally, we study the limits of this framework in practice for OGD and highlight the importance of the Neural Tangent Kernel variation for Continual Learning with OGD.

1. INTRODUCTION

Continual Learning is a setting in which an agent is exposed to multiples tasks sequentially (Kirkpatrick et al., 2016) . The core challenge lies in the ability of the agent to learn the new tasks while retaining the knowledge acquired from previous tasks. Too much plasticity (Nguyen et al., 2018) will lead to catastrophic forgetting, which means the degradation of the ability of the agent to perform the past tasks (McCloskey & Cohen 1989 , Ratcliff 1990 , Goodfellow et al. 2014) . On the other hand, too much stability will hinder the agent from adapting to new tasks. While there is a large literature on Continual Learning (Parisi et al., 2019) , few works have addressed the problem from a theoretical perspective. Recently, Jacot et al. (2018) established the connection between overparameterized neural networks and kernel methods by introducing the Neural Tangent Kernel (NTK). They showed that at the infinite width limit, the kernel remains constant throughout training. Lee et al. ( 2019) also showed that, in the infinite width limit or Neural Tangent Kernel (NTK) regime, a network evolves as a linear model when trained on certain losses under gradient descent. In addition to these findings, recent works on the convergence of Stochastic Gradient Descent for overparameterized neural networks (Arora et al., 2019) have unlocked multiple mathematical tools to study the training dynamics of over-parameterized neural networks. We leverage these theoretical findings in order to to propose a theoretical framework for Continual Learning in the NTK regime then prove convergence and generalisation properties for the algorithm Orthogonal Gradient Descent for Continual Learning (Farajtabar et al., 2019) . Our contributions are summarized as follows: 1. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel (NTK) regime. This framework frames Continual Learning as a recursive kernel regression and comprises proxies for Transfer Learning, generalisation, tasks similarity and Curriculum Learning. (Thm. 1, Lem. 1 and Thm. 3). 2. In this framework, we prove that OGD is robust to forgetting with respect to an arbitrary number of tasks under an infinite memory (Sec . 5, Thm. 2). 3. We prove the first generalisation bound for Continual Learning with SGD and OGD. We find that generalisation through tasks depends on a task similarity with respect to the NTK. (Sec. 5, Theorem 3) 4. We study the limits of this framework in practical settings, in which the Neural Tangent Kernel may vary. We find that the variation of the Neural Tangent Kernel impacts negatively

