CONTINUAL LEARNING IN RECURRENT NEURAL NETWORKS

Abstract

While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to feedforward networks, RNNs iteratively reuse a shared set of weights and require working memory to process input samples. We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements, which lead to an increased need for stability at the cost of decreased plasticity for learning subsequent tasks. We additionally provide theoretical arguments supporting this interpretation by studying linear RNNs. Our study shows that established CL methods can be successfully ported to the recurrent case, and that a recent regularization approach based on hypernetworks outperforms weight-importance methods, thus emerging as a promising candidate for CL in RNNs. Overall, we provide insights on the differences between CL in feedforward networks and RNNs, while guiding towards effective solutions to tackle CL on sequential data.

1. INTRODUCTION

The ability to continually learn from a non-stationary data distribution while transferring and protecting past knowledge is known as continual learning (CL). This ability requires neural networks to be stable to prevent forgetting, but also plastic to learn novel information, which is referred to as the stability-plasticity dilemma (Grossberg, 2007; Mermillod et al., 2013) . To address this dilemma, a variety of methods which tackle CL for static data with feedforward networks have been proposed (for reviews refer to Parisi et al. ( 2019) and van de Ven and Tolias (2019)). However, CL for sequential data has only received little attention, despite recent work confirming that recurrent neural networks (RNNs) also suffer from catastrophic forgetting (Schak and Gepperth, 2019) . A set of methods that holds great promise to address this problem are regularization methods, which work by constraining the update of certain parameters. These methods can be considered more versatile than competing approaches, since they do not require rehearsal of past data, nor an increase in model capacity, but can benefit from either of the two (e.g., Nguyen et al., 2018; Yoon et al., 2018) . This makes regularization methods applicable to a broader variety of situations, e.g. when issues related to data privacy, storage, or limited computational resources during inference might arise. The most well-known regularization methods are weight-importance methods, such as elastic weight consolidation (EWC, Kirkpatrick et al. (2017a) ) and synaptic intelligence (SI, Zenke et al. ( 2017)), which are based on assigning importance values to weights. Some of these have a direct probabilistic interpretation as prior-focused CL methods (Farquhar and Gal, 2018) , for which solutions of upcoming tasks must lie in the posterior parameter distribution of the current task (cf. Fig. 1b ), highlighting the stability-plasticity dilemma. Whether this dilemma differently affects feedforward networks and RNNs, and whether weight-importance based methods can be used off the shelf for sequential data has remained unclear. Here, we contribute to the development of CL approaches for sequential data in several ways. • We provide a first comprehensive comparison of CL methods applied to sequential data. For this, we port a set of established CL methods for feedforward networks to RNNs and assess their performance thoroughly and fairly in a variety of settings. • We identify elements that critically affect the stability-plasticity dilemma of weightimportance methods in RNNs. We empirically show that high requirements for working memory, i.e. the need to store and manipulate information when processing individual samples, lead to a saturation of weight importance values, making the RNN rigid and hindering its potential to learn new tasks. In contrast, this trade-off is not directly affected by the sheer recurrent reuse of the weights, related to the length of processed sequences. We complement these observations with a theoretical analysis of linear RNNs. • We show that existing CL approaches can constitute strong baselines when compared in a standardized setting and if equivalent hyperparameter-optimization resources are granted. Moreover, we show that a CL regularization approach based on hypernetworks (von Oswald et al., 2020) mitigates the limitations of weight-importance methods in RNNs. • We provide a code basefoot_0 comprising all assessed methods as well as variants of four well Taken together, our experimental and theoretical results facilitate the development of CL methods that are suited for sequential data.

2. RELATED WORK

Continual learning with sequential data. As in Parisi et al. ( 2019), we categorize CL methods for RNNs into regularization approaches, dynamic architectures and complementary memory systems. Regularization approaches set optimization constraints on the update of certain network parameters without requiring a model of past input data. EWC, for example, uses weight importance values to limit further updates of weights that are considered essential for solving previous tasks (Kirkpatrick et al., 2017b) . Throughout this work, we utilize a more mathematically sound and less memoryintensive version of this algorithm, called Online EWC (Huszár, 2018; Schwarz et al., 2018) . Although a highly popular approach in feedforward networks, it has remained unclear how suitable EWC is in the context of sequential processing. Indeed, some studies report promising results in the context of natural language processing (NLP) (Madasu and Rao, 2020; Thompson et al., 2019) , while others find that it performs poorly (Asghar et al., 2020; Cossu et al., 2020a; Li et al., 2020) . Here, we conduct the first thorough investigation of EWC's performance on RNNs, and find that it can often be a suitable choice. A related CL approach that also relies on weight importance values is SI (Zenke et al., 2017) . Variants of SI have been used for different sequential datasets, but have not been systematically compared against other established methods ( Yang et al., 2019; Masse et al., 2018; Lee, 2017) . Fixed expansion layers (Coop and Arel, 2012) are another method to limit the plasticity of weights and prevent forgetting, and in RNNs take the form of a sparsely activated layer between consecutive hidden states (Coop and Arel, 2013) . Lastly, some regularization approaches rely on the use of non-overlapping and orthogonal representations to overcome catastrophic forgetting (French, 1992; 1994; 1970) . Masse et al. (2018) , for example, proposed the use of context-dependent random subnetworks, where weight changes are regularized by limiting plasticity to task-specific subnetworks. This eliminates forgetting for disjoint networks but leads to a reduction of available capacity per task. In concurrent work, Duncker et al. (2020) introduced a learning rule which aims to optimize the use of the activity-defined subspace in RNNs learning multiple tasks. When tasks are different,



Source code for all experiments (including all baselines) is available at https://github.com/ mariacer/cl_in_rnns.



known sequential datasets adapted to CL: the Copy Task (Graves et al., 2014), Sequential Stroke MNIST (Gulcehre et al., 2017), AudioSet (Gemmeke et al., 2017) and multilingual Part-of-Speech tagging (Nivre et al., 2016).

