GRAPH-BASED CONTINUAL LEARNING

Abstract

Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.

1. INTRODUCTION

Recent breakthroughs of deep neural networks often hinge on the ability to repeatedly iterate over stationary batches of training data. When exposed to incrementally available data from non-stationary distributions, such networks often fail to learn new information without forgetting much of its previously acquired knowledge, a phenomenon often known as catastrophic forgetting (Ratcliff, 1990; McCloskey & Cohen, 1989; French, 1999) . Despite significant advances, the limitation has remained a long-standing challenge for computational systems that aim to continually learn from dynamic data distributions (Parisi et al., 2019) . Among various proposed solutions, rehearsal approaches that store samples from previous tasks in an episodic memory and regularly replay them are one of the earliest and most successful strategies against catastrophic forgetting (Lin, 1992; Rolnick et al., 2019) . An episodic memory is typically implemented as an array of independent slots; each slot holds one example coupled with its label. During training, these samples are interleaved with those from the new task, allowing for simultaneous multi-task learning as if the resulting data were independently and identically distributed. While such approaches are effective in simple settings, they require sizable memory and are often impaired by memory constraints, performing rather poorly on complex datasets. A possible explanation is that slot-based memories fail to utilize relational structure between samples; semantically similar items are treated independently both during training and at test time. In marked contrast, relational memory is a prominent feature of biological systems that has been strongly linked to successful memory retrieval and generalization (Prince et al., 2005) . Humans, for example, encode event features into cortical representations and bind them together in the medial temporal lobe, resulting in a durable, yet flexible form of memory (Shimamura, 2011) . In this paper, we introduce a novel Graph-based Continual Learning model (GCL) that resembles some characteristics of relational memory. More specifically, we explicitly model pairwise similarities between samples, including both those in the episodic memory and those found in the current task. These similarities allow for representation transfer between samples and provide a resilient mean to guard against catastrophic forgetting. Our contributions are twofold: (1) We propose the use of random graphs to represent relational structures between samples. While similar notions of dependencies have been proposed in the literature (Louizos et al., 2019; Yao et al., 2020) , the application of random graphs in task-free continual learning is novel, at least to the best of our knowledge. (2) We introduce a new regularization objective that leverages such random graphs to alleviate catastrophic forgetting. In contrast to previous work (Rebuffi et al., 2017; Li & Hoiem, 2017) based on knowledge distillation (Hinton et al., 2015) , the objective penalizes the model for forgetting learned edges between samples rather than their output predictions. Our approach performs competitively on four commonly used datasets, improving accuracy by up to 19.7% and reducing forgetting by almost 37% in the best case when bench-marked against competitive baselines in task-free continual learning.

2. PROBLEM FORMULATION

In this work, we follow the learning protocol for image classification from Lopez-Paz & Ranzato (2017). More specifically, we consider a training set D = {D 1 , • • • , D T } consisting of T tasks where the dataset for the t-th task D t = {(x t i , y t i )} nt i=1 contains n t input-target pairs (x t i , y t i ) ∈ X × Y. While the tasks arrive sequentially and exclusively, we assume the input-target pairs (x t i , y t i ) in each task are independent and identically distributed (i.i.d.). The goal is to learn a supervised model f θ : X → Y, parametrized by θ, that outputs a class label y ∈ Y given an unseen image x ∈ X . Following prior work (Lopez-Paz & Ranzato, 2017; Riemer et al., 2018; Chaudhry et al., 2019) , we consider online streams of tasks in which samples from different tasks arrive at different times. As an additional constraint, we insist that the model can only revisit a small amount of data chosen to be stored in a fixed-size episodic memory M. For clarity, we refer to the data in such an episodic memory as context images and context labels and denote by X C = {x i } i∈C and Y C = {y i } i∈C , respectively. These images and labels are to be distinguished from those in the current task, which we refer to as target images and target labels and denote by X T = {x j } j∈T and Y T = {y j } j∈T , respectively. While the model is allowed to update the context samples during training, the episodic memory is necessarily frozen at test time.

3. GRAPH-BASED CONTINUAL LEARNING

In this section, we propose a Graph-based Continual Learning (GCL) algorithm. While most rehearsal approaches ignore the correlations between images and independently pass them through a network to compute predictions (Rebuffi et al., 2017; Chaudhry et al., 2019; Aljundi et al., 2019c) , we model pairwise similarities between the images with learnable edges in random graphs (see Figure 1 ). Intuitively, although it might be easy for the model to forget any particular sample, the multiple connections it forms with similar neighbors are harder to be forgotten altogether. If trained well, the random graphs can therefore equip the model with a plastic and durable means to fight against catastrophic forgetting. Graph Construction. Given a minibatch of target images X T from the current task, our model makes predictions based on the context images X C and context labels Y C that span several previously seen tasks, up to and including the current one. In particular, we explicitly build two random graphs



Figure1: Illustration of Experiment Replay (ER)(Chaudhry et al., 2019)  on the left and our model (GCL) on the right. While ER independently processes context images from the episodic memory and target images from the current task, GCL models pairwise similarities between the images via the random graphs G and A.

