CHALLENGING COMMON ASSUMPTIONS ABOUT CATASTROPHIC FORGETTING

Abstract

Standard gradient descent algorithms applied to sequences of tasks are known to induce catastrophic forgetting in deep neural networks. When trained on a new task, the model's parameters are updated in a way that degrades performance on past tasks. This article explores continual learning (CL) on long sequences of tasks sampled from a finite environment. We show that in this setting, learning with stochastic gradient descent (SGD) results in knowledge retention and accumulation without specific memorization mechanisms. This is in contrast to the current notion of forgetting from the CL literature, which shows that training on new task with such an approach results in forgetting previous tasks, especially in class-incremental settings. To study this phenomenon, we propose an experimental framework, SCoLe (Scaling Continual Learning), which allows to generate arbitrarily long task sequences. Our experiments show that the previous results obtained on relatively short task sequences may not reveal certain phenomena that emerge in longer ones.

1. INTRODUCTION

Continual learning (CL) aims to design algorithms that learn from non-stationary sequences of tasks. Classically, the main challenge of CL is catastrophic forgetting (CF) -fast performance degradation on previous tasks when learning from new data. CF is usually evaluated on scenarios with sequences of disjoint tasks (Lesort et al., 2020; Lange et al., 2019; Belouadah et al., 2021; Hadsell et al., 2020) . In these scenarios, fine-tuning a model with plain empirical risk minimization objective and stochastic gradient descent (SGD) results in CF (Rebuffi et al., 2017; Lesort et al., 2019a; Kang et al., 2022) . The main motivation for this work is to step back from classical continual learning and investigate if fine-tuning with gradient descents approaches only truly leads to catastrophic forgetting or if it can result in knowledge accumulation and decaying forgetting. For example, Evron et al. ( 2022) theoretically showed that, for linear regression trained with SGD knowledge accumulation exists, leading to CF reducing uniformly when tasks reoccur randomly or cyclically. This indicates that CF might not be as catastrophic as it was initially assumed. Perhaps the problem is that CF is often studied in a setup where it is particularly excruciating -short task sequences with non-reoccurring data. In this work, we empirically show that deep neural networks (DNNs) may consistently learn more than they forget when trained with SGD (Fig. 1 ). We investigate how DNNs trained continually for single-head classification on long sequences of task with data reocurrence forget and accumulate knowledge. To this end we propose SCoLe (Scaling Continual Learning), an evaluation framework for CL algorithms that generates task sequences of arbitrary length. As visualized in Fig. 2 , SCoLe creates each new task from a randomly selected subset of all classes. A model is trained for some epochs on data from these classes until the task switches. The SCoLe framework creates tasks online 1



Figure1: With SGD, knowledge accumulation is not observable in standard CL benchmarks (inset top). However, when repeating the sequence of tasks (bottom), knowledge accumulation is apparent and accuracy rises (MNIST, 2 per task, averaged over 3 lr and 5 seeds, each task trained until convergence).

