CAN BERT REFRAIN FROM FORGETTING ON SEQUEN-TIAL TASKS? A PROBING STUDY

Abstract

Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting 1 .

1. INTRODUCTION

Continual Learning aims to obtain knowledge from a stream of data across time (Ring, 1994; Thrun, 1998; Chen & Liu, 2018) . As a booming area in continual learning, task-incremental learning requires a model to learn a sequence of tasks, without forgetting previously learned knowledge. It is a practical scene to train models on a stream of tasks sequentially, avoiding to re-train on all existing data exhaustively once a new task arrives. In natural language processing, although many large-scale pre-trained language models (PLMs) have ceaselessly achieved on new records on various benchmarks, they cannot be directly deployed in a task-incremental setting. These models tend to perform poorly on previously seen tasks when learning new ones. For instance, a BERT BASE model trained sequentially on text classification tasks may not be able to make any correct predictions for the first task after learning new ones, with almost-zero accuracy scores (d'Autume et al., 2019) . This phenomenon is known as catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999; Rosenstein et al., 2005) . Many existing works design novel architectures or components to alleviate the forgetting when learning incrementally (Kirkpatrick et al., 2017; Zenke et al., 2017; Rebuffi et al., 2017; Mallya & Lazebnik, 2018; d'Autume et al., 2019; Pfeiffer et al., 2020; Sun et al., 2020; Geng et al., 2021; Jin et al., 2022; Qin et al., 2022 ). Among them, d'Autume et al. (2019) find that an NLP model augmented by sparse memory replay can refrain from forgetting to a great extent. Their method randomly samples 100 instances from old tasks for replay, after learning every 10,000 unseen instances. Considering that their method can regain the ability to process previous tasks via merely 100 instances in 4 steps 2 , a question comes to our mind: Whether pretrained language models like BERT really suffer from forgetting when learning a sequence of tasks? Mehta et al. ( 2021) reveal that, under proper hyper-parameters, models with pre-training can suffer less catastrophic forgetting than models without pre-training. However, in this work, we specifically focus on the frequency of memory replay. We wonder whether the BERT encoder can still maintain knowledge learned from previous tasks as it performs in d' Autume et al. (2019) , with an extremely sparse replay frequency or even without replay. Probing study has become a popular tool to investigate model interpretability (Tenney et al., 2019; Jawahar et al., 2019) . For instance, Wu et al. ( 2022) probe the continual learning ability of a model by comparing the performance of different PLMs trained with different continual learning strategies. In this paper, our main concern is to examine whether PLMs have an intrinsic ability to maintain previously learned knowledge in a long term. We track the encoding ability of BERT for specific tasks in a BERT before, during, and after it learns corresponding tasks. Comparing the probing results of models trained under different replay frequencies and trained without memory replay, we find that BERT itself can refrain from forgetting when learning a sequence of tasks. This is somewhat contrary to existing studies about catastrophic forgetting, which further motivates us to investigate how the representations of examples from different tasks are organized in the parameter space. Inspired by prior works (Gao et al., 2019; Wang et al., 2020a) , we define the representation sub-space of a class as a convex cone, and provide an algorithm to acquire the narrowest solution. With this toolkit in hand, we find that: after learning several tasks without memory replay, the representation sub-spaces of classes from different tasks will overlap with each other. However, the sub-spaces of classes from the same task keep never-overlapping all along. The former explains the catastrophic forgetting in task-incremental learning from a novel viewpoint of representations, while the the latter explains why BERT has a potential to encode prior tasks even without replay. Our main contributions in this work are: (1) we conduct a thorough study to quantitatively characterize how the representation ability of a PLM like BERT change when it continuously learns a sequence of tasks. We are the first to track the encoding ability of previously learned tasks in BERT when learning new tasks continuously. (2) Our findings reveal that BERT can actually maintain its encoding ability for already learned tasks, and has a strong potential to produce high-quality representations for previous tasks in a long term, under an extremely sparse replay or even without memory replay, which is contrary to previous studies. (3) We further investigate the topological structure of the learned representation sub-space within a task and among different tasks, and find that the forgetting phenomenon can be interpreted into two aspects, the intra-task forgetting and inter-task forgetting (Section 4), enabling us to explain the contrary between our findings and previous studies.

2. BACKGROUND

Following prior work (Biesialska et al., 2020) , we consider the task-incremental language learning setting as that a model should learn from a sequence of tasks, where samples of former tasks cannot be accessible during the training steps for later tasks, but samples of all classes in the current task can be acquired simultaneously. 

2.1. INVESTIGATED MODEL

In Natural Language Processing, a model can be divided into two parts, a text encoder and a task decoder, with parameters θ enc and θ dec , respectively.



Code will be released at https://github.com/kobayashikanna01/plms_are_lifelong_ learners With a training batch of size 32, sampling 100 instances means it takes only 4 steps to recover the forgotten knowledge.



, the input training stream consists of K ordered tasks T 1 , T 2 , • • • , T K , where we observe n k samples, denoted by x k i , y k i n k i=1 , drawn from distribution P k (X , Y) of task T k . Our training objective is a general model f θ : X → Y which handles all tasks with a limited number of parameters θ, by minimizing the negative log-likelihood averaged over all examples: y i | x i ; θ) , where N = K t=1 n t is the number of all training examples.

