CAN BERT REFRAIN FROM FORGETTING ON SEQUEN-TIAL TASKS? A PROBING STUDY

Abstract

Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting 1 .

1. INTRODUCTION

Continual Learning aims to obtain knowledge from a stream of data across time (Ring, 1994; Thrun, 1998; Chen & Liu, 2018) . As a booming area in continual learning, task-incremental learning requires a model to learn a sequence of tasks, without forgetting previously learned knowledge. It is a practical scene to train models on a stream of tasks sequentially, avoiding to re-train on all existing data exhaustively once a new task arrives. In natural language processing, although many large-scale pre-trained language models (PLMs) have ceaselessly achieved on new records on various benchmarks, they cannot be directly deployed in a task-incremental setting. These models tend to perform poorly on previously seen tasks when learning new ones. For instance, a BERT BASE model trained sequentially on text classification tasks may not be able to make any correct predictions for the first task after learning new ones, with almost-zero accuracy scores (d'Autume et al., 2019) . This phenomenon is known as catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999; Rosenstein et al., 2005) . Many existing works design novel architectures or components to alleviate the forgetting when learning incrementally (Kirkpatrick et al., 2017; Zenke et al., 2017; Rebuffi et al., 2017; Mallya & Lazebnik, 2018; d'Autume et al., 2019; Pfeiffer et al., 2020; Sun et al., 2020; Geng et al., 2021; Jin et al., 2022; Qin et al., 2022) . Among them, d'Autume et al. (2019) find that an NLP model augmented by sparse memory replay can refrain from forgetting to a great extent. Their method randomly samples 100 instances from old tasks for replay, after learning every 10,000 unseen instances. Considering that their method can regain the ability to process previous tasks via merely 100 instances in 4 stepsfoot_1 , a question comes to our mind: Whether pretrained language models like BERT really suffer from forgetting when learning a sequence of tasks? Mehta et al. (2021) reveal that, under proper hyper-parameters, models with pre-training can suffer less catastrophic forgetting than models without pre-training. However, in this work, we specifically focus on the frequency of memory replay. We wonder whether the BERT encoder can still maintain knowledge learned from previous tasks as it performs in d' Autume et al. (2019) , with an extremely sparse replay frequency or even without replay. Probing study has become a popular tool to investigate model interpretability (Tenney et al., 2019; Jawahar et al., 2019) . For instance, Wu et al. (2022) probe the continual learning ability of a model by comparing the performance of different PLMs trained with different continual learning strategies. In this paper, our main concern is to examine whether PLMs have an intrinsic ability to maintain previously learned knowledge in a long term. We track the encoding ability of BERT for specific tasks in a BERT before, during, and after it learns the corresponding tasks. Comparing the probing results of models trained under different replay frequencies and trained without memory replay, we find that BERT itself can refrain from forgetting when learning a sequence of tasks. This is somewhat contrary to existing studies about catastrophic forgetting, which further motivates us to investigate how the representations of examples from different tasks are organized in the parameter space. Inspired by prior works (Gao et al., 2019; Wang et al., 2020a) , we define the representation sub-space of a class as a convex cone, and provide an algorithm to acquire the narrowest solution. With this toolkit in hand, we find that: after learning several tasks without memory replay, the representation sub-spaces of classes from different tasks will overlap with each other. However, the sub-spaces of classes from the same task keep never-overlapping all along. The former explains the catastrophic forgetting in task-incremental learning from a novel viewpoint of representations, while the the latter explains why BERT has a potential to encode prior tasks even without replay. Our main contributions in this work are: (1) we conduct a thorough study to quantitatively characterize how the representation ability of a PLM like BERT change when it continuously learns a sequence of tasks. We are the first to track the encoding ability of previously learned tasks in BERT when learning new tasks continuously. (2) Our findings reveal that BERT can actually maintain its encoding ability for already learned tasks, and has a strong potential to produce high-quality representations for previous tasks in a long term, under an extremely sparse replay or even without memory replay, which is contrary to previous studies. (3) We further investigate the topological structure of the learned representation sub-space within a task and among different tasks, and find that the forgetting phenomenon can be interpreted into two aspects, the intra-task forgetting and inter-task forgetting (Section 4), enabling us to explain the contrary between our findings and previous studies.

2. BACKGROUND

Following prior work (Biesialska et al., 2020) , we consider the task-incremental language learning setting as that a model should learn from a sequence of tasks, where samples of former tasks cannot be accessible during the training steps for later tasks, but samples of all classes in the current task can be acquired simultaneously. Formally, the input training stream consists of K ordered tasks T 1 , T 2 , • • • , T K , where we observe n k samples, denoted by x k i , y k i n k i=1 , drawn from distribution P k (X , Y) of task T k . Our training objective is a general model f θ : X → Y which handles all tasks with a limited number of parameters θ, by minimizing the negative log-likelihood averaged over all examples: L(θ) = - 1 N N i=1 ln P (y i | x i ; θ) , where N = K t=1 n t is the number of all training examples.

2.1. INVESTIGATED MODEL

In Natural Language Processing, a model can be divided into two parts, a text encoder and a task decoder, with parameters θ enc and θ dec , respectively. Text Encoder Similar to MbPA++ (d'Autume et al., 2019) and Meta-MbPA (Wang et al., 2020b) , we use BERT BASE (Devlin et al., 2019) as our text encoder, which produces vector representations according to given tokens. In text classification, we take the representation of [CLS] token added at the first to aggregate information of all tokens. For a sequence of input tokens x i , where x i, 0 is [CLS], BERT BASE will generate corresponding vectors {v i, j } L j=1 with L = |x i |. Therefore, we formulate the output of encoder model as: f θ enc (x i ) = v i, 0 . For extractive question answering, we take the task setting of SQuAD 1.1 (Rajpurkar et al., 2016) , as in previous work (d'Autume et al., 2019) . The input tokens x i here are the concatenation of a context x ctx i and a query x que i separated by a special token [SEP] . Task Decoder For text classification, we add a linear transformation and a soft-max layer after BERT BASE encoder. Following d'Autume et al. ( 2019), we adopt a united decoder for all classes of different tasks, and here θ dec is the combination of {W y } y∈Y : P (ŷ = α|x i ) = exp W ⊤ α f θ enc (x i ) y∈Y exp W ⊤ y f θ enc (x i ) = exp W ⊤ α v i, 0 y∈Y exp W ⊤ y v i, 0 , For question answering, the models extract a span from the original context, i.e., determining the start and end boundary of the span. Our decoder for QA has two parts of linear layers W start and W end for the start and the end, respectively. The probability of the t-th token in context as the start of the answer span can be computed as: P start = x ctx i, t |x ctx i ; x que i = exp W ⊤ start v ctx i, t L ctx j=1 exp W ⊤ start v ctx i, j , where L ctx is the length of context, and the probability of the end boundary has a similar form. When predicting, we consider the probability distributions of two boundaries as independent.

2.2. SPARSE EXPERIENCE REPLAY

In reality, humans rely on reviews to keep long-term knowledge, which is based on episodic memories storing past experiences. Inspired by this, Gradient Episodic Memory (Lopez-Paz & Ranzato, 2017) and other methods introduce a memory module M to the learning process. Training examples then can be stored in the memory for rehearsal at a predetermined frequency. As for question answering, this benchmark contains 3 datasets: SQuAD 1.1 (Rajpurkar et al., 2016) , TriviaQA (Joshi et al., 2017) , and QuAC (Choi et al., 2018) . Since TriviaQA has two sections, Web and Wikipedia, considered as two different tasks, this benchmark totally consists of 4 QA tasks.

3. PROBING FOR INTRINSIC ABILITY AGAINST FORGETTING IN BERT

As mentioned in Section 1, a model can rapidly recover its performance of previously learned tasks, by memory replay on merely 100 instances (d'Autume et al., 2019) . If the model completely loses the ability to encode prior tasks, it is counter-intuitive that the model can regain prior knowledge by 4 updating steps. We conjecture that BERT can actually retain old knowledge when learning new tasks rather than catastrophically forgetting. To verify this hypothesis, we first conduct a pilot study. We implement our pilot experiments on the text classification benchmark, employing BERT BASE with a simple linear decoder as our model and training it under 4 different orders (detailed in Appendix A). Following previous probing studies (Tenney et al., 2019; Jawahar et al., 2019) to examine BERT's encoding ability for specific tasks, we freeze encoder parameters after sequentially finetuning, re-initialize five new linear probing decoders and re-train them on five tasks separately. We find that evaluated on the corresponding tasks, every fixed BERT encoder combined with its new decoder can achieve a superior performance. Surprisingly, the macro-averaged accuracy scores of all tasks for 4 orders are 75.87% ±0.73% , 76.76% ±0.64% , 75.19% ±0.43% , 76.76% ±0.71% , which are close to the performance of a multi-task learning model (78.89% ±0.18% ). However, previous works (Biesialska et al., 2020) show that sequentially trained models suffer from catastrophic forgetting and sacrifice their performance on previous tasks when adjusting to new task. Our pilot experiments, in contrary to previous works, actually indicate that BERT may have the ability to maintain the knowledge learned from previous tasks in a long term.

3.1. PROBING METHOD

To verify whether BERT can refrain from forgetting without the help of memory replay, we need a tool to systematically measure a model's encoding ability for previous tasks when it incrementally learns a sequence of tasks. One way is to compare the encoding ability of models at different learning stages trained under two different settings, REPLAY and SEQ. For each setting, we consider to measure the performance before learning corresponding tasks can be regarded as baselines, which indicate BERT's inherent knowledge acquired from pre-training tasks. And then we can examine to what extent BERT forgets old knowledge, by comparing the results during and after learning corresponding tasks. Therefore, it is essential to track the change of BERT's task-specific encoding ability across time. We extract parameters of the encoder and save them as checkpoints at an assigned frequency during training. In both REPLAY and SEQ, we record checkpoints every 5,000 training examplesfoot_3 , without regard to the retrieval memory subset. For every checkpoint, we probe its encoding ability for every task T k by following steps: 1. Add a reinitialized probing decoder to the parameters of BERT BASE in this checkpoint. 2. Train the recombined model with all data in T k 's training set D tr k , with θ enc fixed, which means we adjust the parameters of probing decoder only. 3. Evaluate the scoresfoot_4 of re-trained models on the test set of T k . We re-train a compatible probing decoder on a specific task without touching the encoder before evaluation. We use a linear decoder as probing network for text classification, and two linear boundary decoders for question answering, the same setting as MbPA++ (d'Autume et al., 2019) and Meta-MbPA (Wang et al., 2020b) . We have to mention that there still exist some controversies on whether we should use a simpler probing decoder or a more complex one (Belinkov, 2022) . Here, we adopt simple one-layer probing networks for two reasons. Firstly, a simpler probe can bring about less influence to the performance of re-trained models (Liu et al., 2019a; Hewitt & Liang, 2019), which enables us to focus on the encoding ability of BERT only. Secondly, our purpose in this paper is not to compare BERT's encoding ability among different tasks, but to examine whether it forgets the knowledge of a specific task. Therefore, it is better to use the same single-layer decoder as d'Autume et al. ( 2019) and Wang et al. (2020b) , which can yield comparable results with them.

3.2. RETHINKING CATASTROPHIC FORGETTING

We are now able to quantitatively measure whether a BERT model can maintain its encoding ability for previous tasks during task-incremental learning, by tracking the probing scores among checkpoints. It is also important to investigate whether replay intervals have influence on BERT's encoding ability. We first set up a series of experiments on text classification described as below. To compare with prior works (d'Autume et al., 2019; Wang et al., 2020b) , we retain consistent experimental setups with them,where the maximum length of tokens and batch size are set to 128 and 32, separately. We use the training settings of REPLAY in d' Autume et al. (2019) as the baseline, which samples 100 examples from M for replay every 10,000 new examples from data stream. As mentioned in Section 2.2, we control storage rate γ and replay frequency r both at 1%. To explore the impact of memory replay, we compare models trained under different replay intervals. We randomly select a subset S with ⌊0.01N tr ⌋ samples from M after learning every N tr examples. N tr is set to {10k, 30k, 60k, 115k}, and furthermore, we can consider N tr as +∞ when training models sequentially. We employ Adam (Kingma & Ba, 2015) as the optimizer. We use the method in Section 3.1 to evaluate the quality of the representations generated by BERT in every checkpoint. If the set of BERT parameters have a stronger ability to encode specific task, we can observe a better probing performance. Here, for text classification, we depict the changes of accuracy scores on different figures according to task and training order. The results of Order 1 (detailed in Appendix A) is shown in Figure 1 and the rest is illustrated in Appendix B. Comparing the scores before and after the model learning specific tasks, we further obtain a new understand-ing about the task-incremental language learning: In spite of data distribution shift among tasks, BERT remains most of the ability to classify previously seen tasks, instead of catastrophic forgetting. This conclusion can also apply to SEQ, whose replay frequency is considered as +∞. Although BERT's representation ability gets a little worse under a larger replay interval (such as 60k, 115k, +∞), it still maintains previous knowledge and can recover rapidly by sparse replay. We also provide experimental results on question answering, which is more complex than text classification. To examine whether BERT can still retain old knowledge on QA tasks, we adopt a more strict experimental setting than d' Autume et al. (2019) . We train the model sequentially with 4 different orders in Appendix A, under the setting of SEQ without any memory replay. On each task, the model is finetuned for 15K steps, which is two times more than d'Autume et al. ( 2019). We then evaluate the encoding ability of every BERT checkpoints by our probing methods. The results of Order 1 is illustrated in Figure 2 , and others in Appendix C. Based on our experiment settings, the model is finetuned for enough steps to overfit on every task. However, the probing results (blue lines) are still much higher than the original scores measured before re-training decoders (red dashed lines). Comparing the obvious gap between themfoot_7 , we can find that BERT still keeps most of knowledge of previous tasks when learning new ones. Additionally, we also investigate the ability of other pre-trained language models to retain old-task knowledge, which is detailed in Appendix D. In general, all of these pre-trained language models have an intrinsic ability to refrain from forgetting when learning a sequence of tasks, although our investigated models have various attention mechanisms and various scales. Among different training orders, they still maintain the ability to encode the first learned task, even after learning 5 tasks. From the experiments in Section 3.2, we observe that BERT has the potential to keep a long-term ability to provide high-quality representations for a task, once the model has learned it. Thus, it seems that we only need to finetune the decoder if we attempt to recover the model's ability for previous task. But on the other hand, the SEQ models suffer from a serious performance degradation on learned tasks, which is known as catastrophic forgetting. To reconcile this contradiction, we employ t-SNE toolkit (van der Maaten & Hinton, 2008) and visualize the representations after training on all tasks by SEQ or REPLAY (Figure 3 ). When learning sequentially, it shows the model produces representations of different tasks in overlapped space. In this circumstance, the task decoder identifies all vectors as instances from new task, which leads to confusion but can be averted effectively by sparse replay. All these observations push us to make the assumption that the forgetting in task-incremental learning can be considered as two parts, intra-task forgetting and inter-task forgetting. The intra-task forgetting describes whether a model can still generate meaningful representations for prior tasks after learning new ones, while the inter-task forgetting refers to whether the representations produced for different tasks are distinguishable from each other. In this section, we first propose a toolkit to describe the representation (in Section 4.1). Then, we exhibit the changes of a model learning continuously under REPLAY settings, and provide a novel understanding for catastrophic forgetting in NLP models. Admittedly, question answering models usually involve interactions among representations of different granularities (from token-level to even document-level) (Wang et al., 2018) , thus is more challenging to analyze. Therefore, we will put more emphasis on analysing the results of text classification. Obtain the gradient ∇f δ (c, V). c ← c + α∇f δ (c, V) c ← c/∥c∥ 2 Adjust α by linear search. until ∀c j in c, ∆c j < ε Calculate the cosine of v i and c, denoting as {s i } |V| i=1 . Sort {s i } |V| i=1 . m ← ⌈(|V| -⌈0.95n⌉)/2⌉ Select m lowest s i and their relevant vectors V del . V ← V -V del c 0 ← c end while As claimed in Gao et al. (2019) and Wang et al. (2020a) , when trained by single-layer linear decoders, pre-trained language models produce token-level embedding vectors in a narrow cone. We observe that this conclusion applies to not only token-level representations but also sentence-level representations (more details in Appendix E). Representation vectors of the same class are aggregated together, which enables us to use a convex cone to cover these vectors, whose vertex is the origin. To describe the vectors precisely, the cone should cover all vectors and be as narrow as possible. Formally, we denote the surrounding cone as: x ∈ R d x T c ∥x∥ 2 • ∥c∥ 2 ≥ δ (1) where c ∈ R d is the central axis of the cone, and δ controls the filed angle. To acquire the narrowest cone containing all vectors output by BERT, supposing the vector set is V = {v i } n i=1 , we solve the optimization objective described as below: minimize c, δ -δ; s.t. ∀v i ∈ V, v T i c ∥v i ∥ 2 ≥ δ, ∥c∥ 2 = 1, where ∥ • ∥ 2 means L2-norm. To obtain a definite solution, we add a restriction ∥c∥ 2 = 1, otherwise the equation implies the direction of c only without length. The representation vectors are clustered, so we can obtain a cone with a tiny file angle (δ ≫ 0). Therefore, Eq. ( 2) is a convex optimization objective, which can be solved by Sequential Least Square Programming (Kraft, 1988; Boggs & Tolle, 1995) . In iteration, we acquire the optimization gradient by following expression: f δ (c, {v i } n i=1 ) = max i v T i c ∥v i ∥ 2 ∇f δ (c, {v i } n i=1 ) = v ∥v∥ 2 , v = arg max vi v T i c ∥v i ∥ 2 (3) Furthermore, to reduce the interference from outliers caused by noisy annotations, we modify the constraint conditions as that the cone only needs to cover no less than 95% training examples. Since it violates the convexity of the original objective, we employ an iterative method and get an approximate solution, which keeps every calculating step convexity-preserving. Algorithm 1 outlines the detailed solving procedure. It is obvious that cone axis should be at the center of vectors, thus we initialize c 0 = i v i /∥ i v i ∥ 2 .

4.2. INTRA-TASK FORGETTING

From the results in Section 3.2, we find that BERT can maintain previously learned knowledge in a long term. When working with a re-trained new decoder, BERT can still perform well on prior tasks, indicating that BERT rarely suffers from intra-task forgetting. To investigate the mechanism preventing BERT from intra-task forgetting, we train a BERT model on AGNews and Amazon as an examplefoot_8 to analyse the changes within the BERT's representation space. We first train the model on all instances of AGNews, and then sample 30K instances from Amazon as the second task for task-incremental learning. Similar to Figure 1 , BERT can still generate high-quality representations Published as a conference paper at ICLR 2023 for AGNews after learning Amazon without Episodic Replay. We guess after learning a new task, the representation sub-space of old tasks is still topologically ordered 10 . As shown in Figure 3 (a), we can see that, without Episodic Replay, the representation vectors of old-task instances will rotate to the overlapping sub-space of the new task, which causes the decoder cannot distinguish which task the input instance should belong to. On the other hand, if we adopt a task-specific decoder (e.g., the probing decoder), it can effectively determine the class of a given instance. This may imply that the vectors of the same old-task class are still not far from each other, but they are far away to the vectors of other classes from the the same old task. Therefore, we guess if two representation vectors are trained to be at adjacent positions , they will still be neighbors after learning a new task. To examine whether the rotating process of old-task representation vectors is topologically ordered, we first need a metric to define the relative positions among the representations of instances in the same class. Following our method in Section 4.1, we can describe the representation sub-space of a class y as a convex cone, whose cone axis is c y . Then, for instance i of class y, we can define the relative position of its representation vector v y, i as the cosine between v y, i and c y . Since we need to compare the relative positions of every instance at two checkpoints (before and after learning the second task), we distinguish the vectors at different checkpoints according to their superscripts. Formally, we denote the cone axis and the representation vectors before learning Amazon as c We list the results of all classes in AGNews with different scales of n in Table 1 (where y ∈ {Class-1, Class-2, Class-3, Class-4}, n ∈ {5, 10, 25, 50, 100}). By comparing different n, we can see a median size of neighbors brings a better correlation, which restrains randomness from a tiny set and uncorrelated bias from a huge set. Altogether, the influence of n is inessential and we can reach the conclusion that the positions of v (0) y,i and its neighbors are still close after learning new task, since the Pearson coefficients are no less than 0.483 (partly higher than 0.723). In other words, if two examples are mapped to near positions before learning new tasks, they will remain close with each other after learning new tasks. Once BERT has learned a task, it will tend to generate representations of the same class at close positions, while generating representations of different classes at non-adjacent spaces. Therefore, if the rotating process of old-task representations can keep topologically ordered, the representation vectors of a class will always be separate to the vectors of other classes. This is why BERT exhibits an aptitude to alleviate intra-task forgetting in our study.

4.3. INTER-TASK FORGETTING

Neural network models always suffer from catastrophic forgetting when trained on a succession of different tasks, which is attributed to inter-task forgetting in this work. Similar to prior evaluation, we continue to use covering cones to investigate the role of memory replay when models resisting inter-task forgetting. 10 Given a non-empty vector set V, we can cluster it into many disjoint sub-sets, V 1 , • • • , V K , by the distances between vectors. After learning a new task, the representation vectors of previous tasks will rotate to new directions. For any sub-set V p and any new vector v p x within V p , if any new vector v p y ∈ V p is closer to v p x than any vectors v q z ∈ V q (q ̸ = p) in other sub-sets, we will think the rotating process of representation vectors is perfectly topologically ordered when learning new task. When a model decodes representation vector v via a linear layer connected by soft-max, the decoder can be regarded as a set of columnvectors (i.e. {w y } y∈Y in Section 2.1) and the predicting process is equal to selecting one having the largest inner product with v. Therefore, it is necessary to check whether the cones of previous task rotate to their corresponding column-vectors in decoder. In this section, we still examine the model trained on AGNews first and continuously trained on Amazon with a replay interval of 30K for three times. We observe that there is no significant change of column-vectors in decoder before and after memory replay, since their rotation angles are less than 1 × 10 -3 , which are negligible. For each time t, we denote the cone axis of class k before and after replay as c - t,k and c + t,k , respectively, and its corresponding column-vector in decoder as w k . Then, the rotation angle of the k-th cone can be estimated as: ∆ζ t,k = cos(c - t,k , w k ) -cos(c + t,k , w k ). If ∆ζ t,k > 0, it means cones rotate closer to the direction of w k during replay. The results illustrated in Figure 4 reveal that memory replay obliges the vectors of previous tasks rotating to their corresponding column-vectors in decoder efficiently, while dragging those of current task to deviate from optimal position. Furthermore, this dual process weakens along with the increase of replay times. Since the representation space of BERT is high-dimensional while our tasks are finite, alternately learning on memory and current tasks can separate encoding vectors by mapping them to different sub-spaces. In Appendix F, we provide more visualization results about how memory replay reduces inter-task forgetting, in other words, catastrophic forgetting in the traditional sense.

5. CONCLUSION

In this work, we conduct a probing study to quantitatively measure a PLM's encoding ability for previously learned tasks in a task-incremental learning scenario, and find that, different from previous studies, when learning a sequence of tasks, BERT can retain its encoding ability using knowledge learned from previous tasks in a long term, even without experience replay. We further examine the topological structures of the representation sub-spaces of different classes in each task produced by BERT during its task-incremental learning. We find that without memory replay, the representation sub-spaces of previous tasks tend to overlap with the current one, but the sub-spaces of different classes within one task are distinguishable to each other, showing topological invariance to some extent. Our findings help better understand the connections between our new discovery and previous studies about catastrophic forgetting. Limited by the number of tasks, we have not discussed the capacity of BERT when continuously learning more tasks. As far as we know, there is no existing method yet to measure whether a model has achieved its learning capacity and cannot memorize any more knowledge. In the future, we will extend our probing method to a longer sequence or different types of tasks and explore what amount of knowledge a large pre-trained language model can maintain. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. Proceedings Here, the Web part and the Wikipedia part of TriviaQA (Joshi et al., 2017) are treated as two separate datasets in the orders.

B PROBING ACCURACY SCORES OF ALL ORDERS FOR TEXT CLASSIFICATION

In this section, we illustrate the probing results of all four orders in Figure 5 . Following the main body, background is colored by yellow when and after training on corresponding tasks. And specially, since Amazon and Yelp share the same labels, we color their background by light-yellow once the model is trained on the other.

C ANALYSIS FOR QUESTION ANSWERING TASKS

Similar to the analysis of text classification, we also train models on 4 question answering (QA) tasks in designated orders. To verify whether BERT has a potential to keep knowledge in a long term in QA tasks, we random sample 240K examples from each task (by repeated sampling), where their sizes are two or three times more than the original datasets. We set batch size as 16 and learning rate as 3 × 10 -5 without decay. Additionally, we do NOT use any memory module, which means the models are trained sequentially without memory replay. We save checkpoints every 1,250 steps, and then re-finetune the decoders on 4 tasks respectively, with the parameters of BERT encoders fixed. Since, here, QA is formulated as a sequence-tosequence task, there may be more than one golden answer span for a question. Therefore, we use F1 score to evaluate the performance of models. All results are illustrated in Figure 6 . The results imply BERT BASE still has a durability to keep previously learned knowledge in a long term in more complex tasks like question answering. In QA, the model employs unified span position decoders for all 4 tasks. therefore, the original F1 scores (before refintuning, red dashed lines) for previous tasks will not decrease to zero, which is different from text classification. Although the catastrophic forgetting problem is not too severe in QA, the models still achieve much better F1 scores after re-finetuning their decoders, considering the gaps between blue lines and red dashed lines. In the meantime, we find there is only a limited drop of blue lines after the models finish learning from corresponding tasks. It means that BERT has a satisfactory potential to keep previous knowledge, even without any memory replay. Our conclusions in Section 5 can also apply to question answering tasks. In each row, we illustrate the results for 5 tasks separately, where the leftmost is AGNews, followed by Amazon, DBPedia, Yahoo, and Yelp. In each row, we plot the results for four tasks separately, where the leftmost is TriviaQA (Wiki), followed by TriviaQA (Web), QuAC, and SQuAD. The F1 scores after re-finetuning decoders is represented by blue lines, and as a comparison, we draw F1 scores before re-finetuning decoders by red dashed lines. We color the background into yellow since the model is trained on corresponding task. Specially, TriviaQA (Wiki) and TriviaQA (Web) are actually subsets of one task, therefore, we color their background into light-yellow once the model is trained on the other task. To reduce redundant calculations and to provide a concise quantitative analysis, we no longer track the encoding ability of a PLM at every checkpoint. Here, we only measure the encoding ability of a PLM which has learned all tasks sequentially without any memory replay. All the models employ a single-layer network as decoder, which is the same as Section 3.2. And we also train models with various PLMs with identical settings to former experiments. After sequentially training on the five text classification tasks, we save the parameter weights of PLM encoder and evaluate it by probe-based method proposed in Section 3.1. We place emphasis on that different PLMs should have different performances on a task, even if they are trained under single-task supervised paradigm. Therefore, we provide some results of control tasks (Hewitt & Liang, 2019) as a comparison. Specifically, we train every PLM on every dataset separately, where all parameters of encoder and decoder can be updated. These full-supervised results on single task can be consider as the upper bounds. To check whether a PLM itself can handle these text classification tasks well without downstream fine-tuning, we also present some zero-shot probing results as the lower bounds. We download the weights of various PLMs without any finetuning from open-source platform. Then, we train decoders for every task separately, while keeping the original PLM weights fixed (actually probing study under zero-shot scenarios). Comparing with the results of control tasks, we can examine whether other PLMs can retain knowledge of previous tasks like BERT, after learning a sequence of tasks. We list all results (including the upper bounds and the lower bounds) in Table 2 . From them, we can find that although these PLMs have various attention mechanisms and various scales, they have a similar intrinsic ability to keep previously learned knowledge. Although trained without Episodic Replay, these PLMs can all gain much better probing results than the lower bounds, without regard to training orders. Comparing the results of BERT with different scales, we can find that, without Episodic Replay, the encoders with more parameters (e.g., BERT-base and BERT-large) have a little better abilities to maintain old-task knowledge than those with fewer parameters (e.g., BERT-tiny and BERTmini). However, among the encoders with similar scales but different architectures, including BERTbase (Devlin et al., 2019) , GPT-2 (base) (Radford et al., 2019) , BART (Lewis et al., 2020) , XLNetbase (Yang et al., 2019) , they also have a similar ability to maintain old-task knowledge. Therefore, we guess this intrinsic ability to refrain from forgetting partly comes from the scale of model, while differences of model architectures (e.g., Transformer-Encoder v.s. Transformer-Decoder) make no obvious contributions.

E STRUCTURE OF REPRESENTATION SPACE

As Gao et al. (2019) and Wang et al. (2020a) mention in their work, a large pre-trained language model will embedding all words in a narrow cone when trained with a decoder like that in Section 2.1. Following their opinions, we conjecture the pre-trained language model can also generate sentence-level representation vectors of the same label in a narrow cone. To verify this consideration, we can check the cosine of any arbitrary two vectors produced by BERT. We select AGNews (Zhang et al., 2015) , which has four classes, for investigation. We train a model with BERT and a linear decoder on AGNews for one pass, and then store the representation vectors of training set by class respectively. For the i-th and the j-th class (1 ≤ i ≤ j ≤ 4), we randomly sample one vector from each of them for 1M times. And then, we can approximate the cosine distribution of two vectors from two classes, which illustrated in Figure 7 . From the results, it is obvious that two vectors sampled from the same class have near directions (cosine between them almost to 1), while two sample from different classes have visible discrete directions. It implies the representation sub-spaces are anisotropic, therefore, we can describe them by using convex cones.

F ADDITIONAL VISUALIZATION RESULTS

Figure 8 : Additional visualization results of the representation space during lifelong learning, with points of AGNews and Amazon colored by yellow and blue respectively. Specially, we color the mixed area green, whose size should be smaller when the model has better ability to distinguish different tasks. From left to right, these columns are corresponding to the time of just finishing learned from AGNews, and the first, the second, the third replay. The top row is results before replay, while the bottom is after replay. In this section, we visualize the change of representation space before and after memory replay during lifelong learning. Following the experiment setting in Section 4, we first train the model on AGNews, and then on Amazon with replaying three times. We save all representation vectors after learning AGNews, and every time before or after replay. Then we adopt t-SNE (van der Maaten & Hinton, 2008) to draw all vectors in the plane. Concerning the mixed areas of both classes, we can conclude that memory replay plays a significant role to mitigate inter-task forgetting. Every time after replay, the model have a stronger ability to distinguish instances from different tasks, which is characterized by decrease of green area in Figure 8 . Also, comparing the results among columns, we can confirm although it brings a little confusion among tasks when learning one task continuously without break, sparse memory replay can eliminate the confusion effectively. Therefore, a BERT model enhanced by memory replay can resist not only intra-task but also inter-task forgetting.



Code will be released at https://github.com/kobayashikanna01/plms_are_lifelong_ learners With a training batch of size 32, sampling 100 instances means it takes only 4 steps to recover the forgotten knowledge. Since every batch has 32 instances which is not divisible by 5,000, we save parameters at the closest batches to scheduled points in order to refrain from unnecessary disturbance. We use accuracy scores for text classification, and F1 scores for extractive question answering. The leftmost sub-figure depicts how a model's probing accuracy scores on the training set of AGNews are changing along with the training procedure. The following four sub-figures are for Amazon, DBPedia, Yahoo, and Yelp. We color the background into yellow since the model is trained on corresponding task. Specially, Amazon and Yelp share the same labels, therefore, we color their background into light-yellow once the model is trained on the other task. The leftmost is TriviaQA (Wiki), followed by TriviaQA (Web), QuAC, and SQuAD. The F1 scores after re-training probing decoders are represented by blue lines. As a comparison, we draw F1 scores of models with original decoders by red dashed lines since the models begin to learn new tasks. We color the background into yellow since the model is trained on corresponding task. Specially, TriviaQA (Wiki) and TriviaQA (Web) are actually subsets of one task, therefore, we color their background into light-yellow when learning the other task. In QA, the F1 scores on previous tasks will not decrease to zero when learning new tasks, since all QA tasks share the same answer boundary decoder. But different text classification tasks utilize different dimensions in the decoder, which leads to more drastic deterioration on scores of old tasks. Choose by dictionary order.



Figure 1: Probing results on five text classification tasks trained by Order 1, illustrated separately by the datasets 6 . The abscissas indicate the training order of tasks.

Figure 2: Probing F1 scores on four tasks trained by Order 1, illustrated separately 7 .

Figure 3: Visualization results of representation space after training on tasks by Order 1. Points of AGNews, Amazon & Yelp, DBPedia, Yahoo are colored by blue, orange, green, pink, respectively, while intersection areas of multiple tasks are grey.

DEFINITION OF REPRESENTATION SUBSPACE Algorithm 1: Calculating the Representation Cone Input: vector set V, input size n = |V|, initial central axis c 0 , learning rate α, termination condition ε Output: central axis of the cone c while |V| > ⌈0.95n⌉ do Initialize c = c 0 repeat Compute optimization objective by Eq. 3.

, with the ones after learning Amazon as c y (the universal representation set of class y before learning Amazon), we select its n nearest neighbors from V by Euclidean distance, and record their indicator set as N y, i . It is reasonable to believe that these n neighbors have the most similar semantic information to v (0) y, i . Then, we can check whether v verify whether the representation sub-space of class y is topologically ordered. Here, we compute the the correlation between the relative positions of v

Figure 4: Bar chart for rotation angles during replay, clusters by task label and colored according to replay time.

(a) Results of Order 1. (b) Results of Order 2. (c) Results of Order 3. (d) Results of Order 4.

Figure 5: Probing results of five text classification tasks training by each order.In each row, we illustrate the results for 5 tasks separately, where the leftmost is AGNews, followed by Amazon, DBPedia, Yahoo, and Yelp.

Figure6: F1 scores on four tasks trained by 4 different orders. In each row, we plot the results for four tasks separately, where the leftmost is TriviaQA (Wiki), followed by TriviaQA (Web), QuAC, and SQuAD. The F1 scores after re-finetuning decoders is represented by blue lines, and as a comparison, we draw F1 scores before re-finetuning decoders by red dashed lines. We color the background into yellow since the model is trained on corresponding task. Specially, TriviaQA (Wiki) and TriviaQA (Web) are actually subsets of one task, therefore, we color their background into light-yellow once the model is trained on the other task.

Figure 7: Cosine distribution of vectors pairs from classes of AGNews, with axes aligned.

Every seen example is added to the memory by a fixed rate γ during training. If we sample n k examples of the k-th task, in expectation there will be γn k additional instances in M after learning from T k .

Pearson correlation coefficient (×100) of the angles of v 1,i and its n neighbors to the cone axis. The highest scores are made bold, with the second underlined. ±3.26 50.44 ±10.29 83.90 ±3.52 73.80 ±3.22 25 81.10 ±3.19 51.46 ±10.27 83.76 ±3.58 73.98 ±3.11 50 80.03 ±3.30 51.06 ±10.56 83.25 ±3.65 73.39 ±3.12 100 78.51 ±3.49 50.16 ±10.58 83.27 ±3.84 72.35 ±3.12

of Machine Learning Research, 70:3987-3995, 2017.    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

Probing results of various PLMs discussions in the main body are conducted almost exclusively on the ability of BERT BASE to keep knowledge. BERT(Devlin et al., 2019) is a representative of the PLM family, and widely used in various NLP tasks. We choose BERT in our study, since its transformer-based architecture influences many other PLMs. However, it does not mean BERT is the particular PLM with the intrinsic ability to generate high-quality representations for previous tasks in a long term. In this section, we further investigate various other PLMs with different model scales, different pre-training procedures, or different attention mechanisms. For the pre-trained language models with different attention mechanisms or different pre-training strategies, we investigate RoBERTa-base(Liu et al.,  2019b), BART-base(Lewis et al., 2020), ELECTRA-baseClark et al. (2020), XLNet-baseYang et al. (2019), and GPT-2(Radford et al., 2019). For the pre-trained language models with different scales, we investigate BERT-tiny, BERT-mini, BERT-small, BERT-medium, which are distilled versions fromTurc et al. (2019), and BERT-large fromDevlin et al. (2019). Our probing experiments are detailed as below.

ACKNOWLEDGMENT

This work is supported by the National Key R&D Program of China (No.2020AAA0106600), and the NSFC Grants (No.62161160339).

