EVALUATING ONLINE CONTINUAL LEARNING WITH CALM

Abstract

Online Continual Learning (OCL) studies learning over a continuous data stream without observing any single example more than once, a setting that is closer to the experience of humans and systems that must learn "on-the-wild". Yet, commonly available benchmarks are far from these real world conditions, because they explicitly signal different tasks, lack latent similarity structure or assume temporal independence between different examples. Here, we propose a new benchmark for OCL based on language modelling in which input alternates between different languages and domains without any explicit delimitation. Additionally, we propose new metrics to study catastrophic forgetting in this setting and evaluate multiple baseline models based on compositions of experts. Finally, we introduce a simple gating technique that learns the latent similarities between different inputs, improving the performance of a Products of Experts model.

1. INTRODUCTION

Machines, like humans, can learn to perform multiple different tasks from feedback alone (Caruana, 1997) . On the other hand, humans, but not machines, can benefit from settings in which tasks are presented repeatedly for multiple trials before switching to the next one (Flesch et al., 2018) , whereas machines require examples to be presented in a shuffled (i.i.d) order to learn effectively. Otherwise, they suffer from an effect known as "catastrophic forgetting" or "catastrophic interference" (McCloskey & Cohen, 1989; Ratcliff, 1990) . While there has been a considerable amount of work focused on solving this problem, an endeavour that also goes by the name of 'Continual', 'Incremental' or 'Life-long' Learning, a large part of it is evaluated on settings in which there is an explicit delimitation signal for every new task presented to the model (Kirkpatrick et al., 2017; Zenke et al., 2017; Sodhani et al., 2018; Serra et al., 2018; Lopez-Paz & Ranzato, 2017; Fernando et al., 2017; Lee et al., 2017; Rusu et al., 2016; Li & Hoiem, 2018; Aljundi et al., 2017; Adel et al., 2020; Titsias et al., 2020; Ebrahimi et al., 2020; von Oswald et al., 2020; Li et al., 2020; Yoon et al., 2020) . However, humans do not need any such signalling at all. Consider, for example, the case of a child growing up in a multi-lingual environment. Even though it is not entirely clear whether the child would rely on environmental cues (for instance, the identity of the speaker) to distinguish different input languages or not (De Houwer, 2017) , any mechanism must be necessarily inferred from the context. Moreover, even the concept of "task" could be vacuous, as it could be represented by shifting data distributions (Lesort et al., 2020) . Even though the emerging field of Online Continual Learning (Parisi & Lomonaco, 2020; Aljundi et al., 2019a) or Task-Free Continual Learning (Aljundi et al., 2019b; Lee et al., 2020) has started to propose solutions to these problems, commonly available benchmarks make assumptions that are far from the real world conditions, such as lacking latent similarity structure on the data stream (e.g. orthogonal permutations of an image pixels) or assuming temporal independence between different examples (e.g. an image of a chair can be classified as "chair" independently of any previous examples). Consider, instead, the challenge of natural language learning which requires making sense of a highly correlated and temporally interdependent data stream. We argue that the notable scarcity of benchmarks featuring temporally correlated sequences of examples, with short and long-term dependencies, latent similarities between different classes of examples, and no explicit delimitation when transitioning between different classes has left a blind spot in the Online Continual Learning community, which we address here. Moreover, almost none of the commonly used benchmarks deals with language, further limiting the amount of research that extends to this modality. Here, we make a two-fold contribution towards studying online continual learning in neural networks in a linguistic setting. First, we bring CALM (Class-Agnostic Continual Language Modelling) to the community, a continual language modelling evaluation framework containing text that alternates between different classes of input (e.g. different languages or domains) with latent similarities to which the models could adapt. We introduce two variants. The first is a characterbased language modelling benchmark featuring five different languages that randomly switch between one another. The second one is a word-based language modelling task, where the text alternates between four different domains. No segmentation signal is given when a switch happens, thus requiring models to learn to adapt to these changes. We also propose novel metrics that capture the impact of catastrophic forgetting in an online learning setting by measuring how efficiently can models adapt to class switches. In line with Aljundi et al. (2019b), we note that when a distribution shift occurs, a neural network that suffers from catastrophic forgetting will display a spike in the loss signal, even when the distribution had been observed in the past (see Figure 1a ). Thus, we propose catastrophic forgetting metrics based on characterizing the size of these peaks. The benchmark is provided as a Python library that can be easily imported into a PyTorch project.foot_0 Second, we evaluate multiple baselines based on expert architectures and propose a novel albeit simple mechanism that we call plastic gates, which we show to improve the performance of Products of Experts. Our post-hoc analysis shows that this mechanism is effective in producing a gating strategy that helps to circumvent catastrophic interference while also uncovering latent similarities in the input classes.

2. RELATED WORK

The field of Continual Learning, Incremental Learning or Lifelong Learning has grown to encompass a large body of work, which is better summarized in respective reviews (Parisi et al., 2019; Lesort et al., 2020) . An overwhelming majority of this work concerns image classification problems or object recognition. Some evaluation datasets are derived from traditional machine learning datasets by manipulating the input examples in more or less artificial ways -like Permuted MNIST (Kirkpatrick et al., 2017) 2019) proposed the only benchmarks dealing with language that we know of, in which the former adopts a sequence to sequence paradigm to study incremental learning of new vocabulary items on simplified or artificial datasets, while the latter adapted existing text classification and QA benchmarks analogously to above-mentioned work in image classification. Our work instead uses naturalistic textual data containing natural latent similarities between distributions that can drive information transfer or forgetting. By and large, work directed to address catastrophic forgetting in neural networks presumes the existence of a task identifier to signal different learning units. However, recent work has aimed at tackling catastrophic forgetting even in conditions in which no task boundaries are provided (Aljundi et al., 2019b; Lee et al., 2020) , going under the name of "Task-Free Continual Learning" or "Online Continual Learning" (Parisi & Lomonaco, 2020; Aljundi et al., 2019a) . Of these works, only Aljundi et al. (2019b) uses naturalistic data to classify actors appearing in soap-opera episodes (Aljundi et al., 2016) , while others resort to artificially modified datasets like split or permuted MNIST. Here, we complement this resource with a text-based benchmark for Task-Free Continual Learning, while arguing for more work on more naturalistic non-i.i.d. datasets. Another aspect of Continual Learning deals with how models are evaluated. Most often, this is done by measuring accuracy on a dedicated test set (Lopez-Paz & Ranzato, 2017; Díaz-Rodríguez et al., 2018; Hayes et al., 2018; Chaudhry et al., 2018; de Masson d'Autume et al., 2019) . However this



Code and materials included in the supplementary materials will be made publicly available upon acceptance.



or Rotated MNIST (Lopez-Paz & Ranzato, 2017)-while others keep examples unchanged but present them in a specific non-i.i.d. order, like for instance, iCIFAR-100 (Rebuffi et al., 2017) or split-MNIST (Zenke et al., 2017). All of these datasets comprise single-input classification problems in which there are no temporal dependencies nor correlations between two consecutive examples. To better approximate the conditions of real-world experiences, Fanello et al. (2013), Lomonaco & Maltoni (2017) and Roady et al. (2020) introduced iCubWorld, CORe50, and Stream-51 respectively, which comprise short videos of objects from different angles (further including naturalistic scenes in the latter case). These datasets address the problem of correlated examples, but not of temporal dependencies, which we do address in this work. Li et al. (2020) and de Masson d'Autume et al. (

