CONTINUAL EVALUATION FOR LIFELONG LEARNING: IDENTIFYING THE STABILITY GAP

Abstract

Time-dependent data-generating distributions have proven to be difficult for gradient-based training of neural networks, as the greedy updates result in catastrophic forgetting of previously learned knowledge. Despite the progress in the field of continual learning to overcome this forgetting, we show that a set of common state-of-the-art methods still suffers from substantial forgetting upon starting to learn new tasks, except that this forgetting is temporary and followed by a phase of performance recovery. We refer to this intriguing but potentially problematic phenomenon as the stability gap. The stability gap had likely remained under the radar due to standard practice in the field of evaluating continual learning models only after each task. Instead, we establish a framework for continual evaluation that uses per-iteration evaluation and we define a new set of metrics to quantify worst-case performance. Empirically we show that experience replay, constraintbased replay, knowledge-distillation, and parameter regularization methods are all prone to the stability gap; and that the stability gap can be observed in class-, task-, and domain-incremental learning benchmarks. Additionally, a controlled experiment shows that the stability gap increases when tasks are more dissimilar. Finally, by disentangling gradients into plasticity and stability components, we propose a conceptual explanation for the stability gap.

1. INTRODUCTION

The fast convergence in gradient-based optimization has resulted in many successes with highly overparameterized neural networks (Krizhevsky et al., 2012; Mnih et al., 2013; Devlin et al., 2018) . In the standard training paradigm, these results are conditional on having a static data-generating distribution. However, when non-stationarity is introduced by a time-varying data-generating distribution, the gradient-based updates greedily overwrite the parameters of the previous solution. This results in catastrophic forgetting (French, 1999) and is one of the main hurdles in continual or lifelong learning. Continual learning is often presented as aspiring to learn the way humans learn, accumulating instead of substituting knowledge. To this end, many works have since focused on alleviating catastrophic forgetting with promising results, indicating such learning behavior might be tractable for artificial neural networks (De Lange et al., 2021; Parisi et al., 2019) . In contrast, this work surprisingly identifies significant forgetting is still present on task transitions for standard state-ofthe-art methods based on experience replay, constraint-based replay, knowledge distillation, and parameter regularization, although the observed forgetting is transient and followed by a recovery phase. We refer to this phenomenon as the stability gap. Contributions in this work are along three main lines, with code publicly available. 1 First, we define a framework for continual evaluation that evaluates the learner after each update. This framework is designed to enable monitoring of the worst-case performance of continual learners from the perspective of agents that acquire knowledge over their lifetime. For this we propose novel principled metrics such as the minimum and worst-case accuracy (min-ACC and WC-ACC). Second, we conduct an empirical study with the continual evaluation framework, which leads to identifying the stability gap, as illustrated in Figure 1 , in a variety of methods and settings. An ablation study on evaluation frequency indicates continual evaluation is a necessary means to surface the stability gap, explaining why this phenomenon had remained unidentified so far. Additionally, we find that the stability gap is significantly influenced by the degree of similarity of consecutive tasks in the data stream. Third, we propose a conceptual analysis to help explain the stability gap, by disentangling the gradients based on plasticity and stability. We do this for several methods: Experience Replay (Chaudhry et al., 2019b) , GEM (Lopez-Paz & Ranzato, 2017), EWC (Kirkpatrick et al., 2017) , SI (Zenke et al., 2017), and LwF (Li & Hoiem, 2017) . Additional experiments with gradient analysis provide supporting evidence for the hypothesis. Implications of the stability gap. (i) Continual evaluation is important, especially for safety-critical applications, as representative continual learning methods falter in maintaining robust performance during the learning process. (ii) There is a risk that sudden distribution shifts may be exploited by adversaries that can control the data stream to momentarily but substantially decrease performance. (iii) Besides these practical implications, the stability gap itself is a scientifically intriguing phenomenon that inspires further research. For example, the stability gap suggests current continual learning methods might exhibit fundamentally different learning dynamics from the human brain. Figure 1 : The stability gap: substantial forgetting followed by recovery upon learning new tasks in state-of-the-art continual learning methods. Continual evaluation at every iteration (orange curve) reveals the stability gap, remaining unidentified with standard task-oriented evaluation (red diamonds). Shown is the accuracy on the first task, when a network using Experience Replay sequentially learns the first five tasks of class-incremental Split-MiniImagenet. More details in Figure 2 .  ρ eval = 1 ρ eval = T i

2. PRELIMINARIES ON CONTINUAL LEARNING

The continual or lifelong learning classification objective is to learn a function f : X → Y with parameters θ, mapping the input space X to the output space Y, from a non-stationary data stream S = {(x, y) 0 , (x, y) 1 , ..., (x, y) n }, where data tuple (x ∈ X , y ∈ Y) t is sampled from a datagenerating distribution D which depends on time t. While standard machine learning assumes a static data-generating distribution, continual learning introduces the dependency on time variable t. This time-dependency introduces a trade-off between adaptation to the current data-generating distribution and retaining knowledge acquired from previous ones, also referred to as the stability-plasticity trade-off (Grossberg, 1982) . Tasks. The data stream is often assumed to be divided in locally stationary distributions, called tasks. We  A key challenge for continual learning is to estimate this objective having only the current task's training data D k available, while using limited additional compute and memory resources.



introduce discrete task identifier k to indicate the k-th task T k with locally stationary datagenerating distribution D k . Additionally, time variable t is assumed discrete, indicating the overall iteration number in the stream, and t |T k | indicates the overall iteration number at the end of task T k . Learning continually. During the training phase, a learner continuously updates f based on new tuples (x, y) t from the data stream (De Lange & Tuytelaars, 2021). Optimization follows empirical risk minimization over the observed training sets D ≤k , as the learner has no direct access to datagenerating distributions D ≤k . The negative log-likelihood objective we would ideally optimize while learning task T k is: min θ L k = -Σ k n=1 E (x,y)∼ Dn y T log f (x; θ)

