EVALUATING ONLINE CONTINUAL LEARNING WITH CALM

Abstract

Online Continual Learning (OCL) studies learning over a continuous data stream without observing any single example more than once, a setting that is closer to the experience of humans and systems that must learn "on-the-wild". Yet, commonly available benchmarks are far from these real world conditions, because they explicitly signal different tasks, lack latent similarity structure or assume temporal independence between different examples. Here, we propose a new benchmark for OCL based on language modelling in which input alternates between different languages and domains without any explicit delimitation. Additionally, we propose new metrics to study catastrophic forgetting in this setting and evaluate multiple baseline models based on compositions of experts. Finally, we introduce a simple gating technique that learns the latent similarities between different inputs, improving the performance of a Products of Experts model.

1. INTRODUCTION

Machines, like humans, can learn to perform multiple different tasks from feedback alone (Caruana, 1997) . On the other hand, humans, but not machines, can benefit from settings in which tasks are presented repeatedly for multiple trials before switching to the next one (Flesch et al., 2018) , whereas machines require examples to be presented in a shuffled (i.i.d) order to learn effectively. Otherwise, they suffer from an effect known as "catastrophic forgetting" or "catastrophic interference" (McCloskey & Cohen, 1989; Ratcliff, 1990) . While there has been a considerable amount of work focused on solving this problem, an endeavour that also goes by the name of 'Continual', 'Incremental' or 'Life-long' Learning, a large part of it is evaluated on settings in which there is an explicit delimitation signal for every new task presented to the model (Kirkpatrick et al., 2017; Zenke et al., 2017; Sodhani et al., 2018; Serra et al., 2018; Lopez-Paz & Ranzato, 2017; Fernando et al., 2017; Lee et al., 2017; Rusu et al., 2016; Li & Hoiem, 2018; Aljundi et al., 2017; Adel et al., 2020; Titsias et al., 2020; Ebrahimi et al., 2020; von Oswald et al., 2020; Li et al., 2020; Yoon et al., 2020) . However, humans do not need any such signalling at all. Consider, for example, the case of a child growing up in a multi-lingual environment. Even though it is not entirely clear whether the child would rely on environmental cues (for instance, the identity of the speaker) to distinguish different input languages or not (De Houwer, 2017) , any mechanism must be necessarily inferred from the context. Moreover, even the concept of "task" could be vacuous, as it could be represented by shifting data distributions (Lesort et al., 2020) . Even though the emerging field of Online Continual Learning (Parisi & Lomonaco, 2020; Aljundi et al., 2019a) or Task-Free Continual Learning (Aljundi et al., 2019b; Lee et al., 2020) has started to propose solutions to these problems, commonly available benchmarks make assumptions that are far from the real world conditions, such as lacking latent similarity structure on the data stream (e.g. orthogonal permutations of an image pixels) or assuming temporal independence between different examples (e.g. an image of a chair can be classified as "chair" independently of any previous examples). Consider, instead, the challenge of natural language learning which requires making sense of a highly correlated and temporally interdependent data stream. We argue that the notable scarcity of benchmarks featuring temporally correlated sequences of examples, with short and long-term dependencies, latent similarities between different classes of examples, and no explicit delimitation when transitioning between different classes has left a blind spot in the Online Continual Learning community, which we address here. Moreover, almost none of the commonly used benchmarks deals with language, further limiting the amount of research that extends to this modality.

