CHALLENGING COMMON ASSUMPTIONS ABOUT CATASTROPHIC FORGETTING

Abstract

Standard gradient descent algorithms applied to sequences of tasks are known to induce catastrophic forgetting in deep neural networks. When trained on a new task, the model's parameters are updated in a way that degrades performance on past tasks. This article explores continual learning (CL) on long sequences of tasks sampled from a finite environment. We show that in this setting, learning with stochastic gradient descent (SGD) results in knowledge retention and accumulation without specific memorization mechanisms. This is in contrast to the current notion of forgetting from the CL literature, which shows that training on new task with such an approach results in forgetting previous tasks, especially in class-incremental settings. To study this phenomenon, we propose an experimental framework, SCoLe (Scaling Continual Learning), which allows to generate arbitrarily long task sequences. Our experiments show that the previous results obtained on relatively short task sequences may not reveal certain phenomena that emerge in longer ones.

1. INTRODUCTION

Continual learning (CL) aims to design algorithms that learn from non-stationary sequences of tasks. Classically, the main challenge of CL is catastrophic forgetting (CF) -fast performance degradation on previous tasks when learning from new data. CF is usually evaluated on scenarios with sequences of disjoint tasks (Lesort et al., 2020; Lange et al., 2019; Belouadah et al., 2021; Hadsell et al., 2020) . In these scenarios, fine-tuning a model with plain empirical risk minimization objective and stochastic gradient descent (SGD) results in CF (Rebuffi et al., 2017; Lesort et al., 2019a; Kang et al., 2022) . In this work, we empirically show that deep neural networks (DNNs) may consistently learn more than they forget when trained with SGD (Fig. 1 ). We investigate how DNNs trained continually for single-head classification on long sequences of task with data reocurrence forget and accumulate knowledge. To this end we propose SCoLe (Scaling Continual Learning), an evaluation framework for CL algorithms that generates task sequences of arbitrary length. As visualized in Fig. 2 , SCoLe creates each new task from a randomly selected subset of all classes. A model is trained for some epochs on data from these classes until the task switches. The SCoLe framework creates tasks online by sampling an existing source dataset such as CIFAR (Krizhevsky et al.) . By modifying the sampling strategy of the source dataset, we can control data recurrence frequencies, the complexity of the scenario, and the type of shifts in distribution (random, long-term, cyclic, etc.). As in classical continual learning scenarios, SCoLe scenarios have regular distribution shifts that should lead to CF. However, in SCoLe, tasks and data can sparsely reoccur. This reoccurrence makes it possible when training with naive SGD, to study if knowledge can accumulate through time. The idea is: if the model does not forget catastrophically but progressively, some knowledge can be retained until the next occurrence of the data leading to potential knowledge accumulation. We study the impact of such reoccurrence in a series of SCoLe experiments and confirm that DNNs trained with gradient descent only accumulate knowledge without any supplementary CL mechanism. Test Time 2 SCoLe:A FRAMEWORK FOR CL WITH LONG TASK SEQUENCES General Idea. We propose a framework that allows the creation of an arbitrarily long sequence of tasks. The setting is based on the finite-world assumption (Mundt et al., 2020) , which hypothesizes that the world has a finite set of states, and in a finite period of time, the agent will see all of them. Later data will, therefore, necessarily be a repetition of previous ones. As in classical CL, in SCoLe each task is comprised of a subset of world's data, whereby the evaluation is done on the whole world. In such a world, a learning system must accumulate knowledge about the world by experiencing parts of it in isolation. The difference to the classical CL scenarios is that data sparsely reoccurs. An agent can succeed in this setup only by accumulating knowledge faster than forgetting it. SCoLe can reveal learning dynamics of DNNs under distribution shift that are not observable on short sequences of non-overlapping tasks, as we show next. Measuring the performance on whole world allows us to estimate whether, overall, the agent accumulates knowledge faster than it forgets. Framework. We instantiate this idea in a classification setting (Fig. 2 ). At each task, a subset of classes is randomly selected from the total of N available classes (N is dataset dependent). The agent is a DNN that learns to classify on this subset only and is tested on the full test set with all classes. The framework considers scenarios with varying numbers of tasks T and classes per task C. Formally, the training set D t for a task t consists of (x, y) pairs sampled from the distribution p(X, Y |S t ) = p(X|Y )p(Y |S t ). Here S t = {c i } C-1 i=0 is the set of classes in task t. In the default SCoLe scenario, the elements of S t are sampled from the uniform distribution over all N classes c i ∼ U (0, N -1) without replacement. We also consider cases where p(S t ; C) is non-uniform (Sec. 4.2) or evolves over time (Sec. 5). Label y is sampled uniformly over S t and x is obtained as x ∼ p(X|Y = y). The test set D test contains all classes in the scenario. Following the data generation process, it is generated with C = N .



Figure 2: Illustration of SCoLe scenario. With 5 classes in total (one per color) and 2 classes per task. The data are selected randomly based on their label to build the scenario dynamically, into a potential infinite sequence. The evaluation is performed on the test set containing all possible classes. Our contributions are as follows: (1) We propose an experimentation framework "SCoLe" with a potentially infinitely long sequence of tasks. SCoLe scenarios are built to study knowledge retention and accumulation in DNNs with non-stationnary training distributions. (2) We show that in such scenarios standard SGD retains and accumulates knowledge without any CL algorithm, i.e. without supplementary memorization mechanism. This result is counterintuitive given the well-known phenomenon of catastrophic forgetting in DNNs. (3) We study the capabilities and limitations of such training with scenario based on a variety of datasets (MNIST, Fashion-MNIST, KMNIST, CIFAR10, CIFAR100, miniImageNet) and scenarios.

