IS FORGETTING LESS A GOOD INDUCTIVE BIAS FOR FORWARD TRANSFER?

Abstract

One of the main motivations of studying continual learning is that the problem setting allows a model to accrue knowledge from past tasks to learn new tasks more efficiently. However, recent studies suggest that the key metric that continual learning algorithms optimize, reduction in catastrophic forgetting, does not correlate well with the forward transfer of knowledge. We believe that the conclusion previous works reached is due to the way they measure forward transfer. We argue that the measure of forward transfer to a task should not be affected by the restrictions placed on the continual learner in order to preserve knowledge of previous tasks. Instead, forward transfer should be measured by how easy it is to learn a new task given a set of representations produced by continual learning on previous tasks. Under this notion of forward transfer, we evaluate different continual learning algorithms on a variety of image classification benchmarks. Our results indicate that less forgetful representations lead to a better forward transfer suggesting a strong correlation between retaining past information and learning efficiency on new tasks. Further, we found less forgetful representations to be more diverse and discriminative compared to their forgetful counterparts.

1. INTRODUCTION

Continual learning aims to improve learned representations over time without having to train from scratch as more data or tasks become available. This objective is especially relevant in the context of large scale models trained on massive scale data, where training from scratch is prohibitively costly. However, the standard stochastic gradient descent (SGD) training, relying on the IID assumption of data, results in a severely degraded performance on old tasks when the model is continually updated on new tasks. This phenomenon is referred to as catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2016) and has been an active area of research (Kirkpatrick et al., 2016; Lopez-Paz & Ranzato, 2017; Mallya & Lazebnik, 2018) . Intuitively, the reduction in catastrophic forgetting allows the learner to accrue knowledge from the past, and use it to learn new tasks more efficiently -either using less training data, less compute, better final performance or any combination thereof. This phenomenon of efficiently learning new tasks using previous information is referred to as forward transfer. Catastrophic forgetting and forward transfer are often thought of as competing desiderata of continual learning where one has to strike a balance between the two depending on the application at hand (Hadsell et al., 2020) . Specifically, Wolczyk et al. (2021) recently studied the interplay of forgetting and forward transfer in the robotics context, and found that many continual learning approaches alleviate catastrophic forgetting at the expense of forward transfer. This is indeed unavoidable if the capacity of the model is less than the amount of information we intend to store. However, assuming that the model has sufficient capacity to learn all the tasks simultaneously, as in multitask learning, one might think that a less forgetful model could transfer its retained knowledge to future tasks when they are similar to past ones. In this work, therefore, we argue for looking at the trade-off between forgetting and forward transfer in the right perspective. Typically, forward transfer is measured as the learning accuracy on a task after the continual learner has already made training updates from the task (Wolczyk et al., 2021; Chaudhry et al., 2019a; Lopez-Paz & Ranzato, 2017) . However, since such training updates are usually modified to preserve performance on previous tasks (e.g. EWC (Kirkpatrick et al., 2016) ), a competition arises between maximizing learning accuracy and mitigating catastrophic forgetting. Therefore, we argue for a measure of forward transfer that is unconstrained from any training modifications made to preserve previous knowledge. We propose to use auxiliary evaluation of continually trained representations as a measure of forward transfer which is separate from the continual training of the model. Specifically, at the arrival of a new task, we fix the representations learned on the previous task and evaluate them on the new task. This evaluation is done by learning a temporary classifier using a small subset of data from the new task and measuring performance on the test set of the task. The continual training on the new task then proceeds with the updates to the representations (and the classifier) with the full training dataset of the task. We note that this notion of forward transfer removes the tug of war between forgetting the previous tasks and transfer to the next task, and it is with this notion of transfer that we ask the question are less forgetful representations more transferable? We analyze the interplay of catastrophic forgetting and forward transfer on several supervised continual learning benchmarks and algorithms. For this work, we restrict ourselves to the task-based continual learning setting, where task information is assumed at both train and test times as it makes the aforementioned evaluation based on auxiliary classification at fixed points easily interpretable. Our results demonstrate that a less forgetful model in fact transfers better (cf. Figure 1 ). We find this observation to be true for both randomly initialized models as well as for models that are initialized from a pre-trained model. We further analyse the reasons of this better transferability and find that less forgetful models result in more diverse and easily separable representations making it easier to learn a classifier head on top. We note that with these results, we want to emphasize that the continual learning community should look at the trade-off between forgetting and forward transfer in the right perspective. The learning accuracy based measure of forward transfer is useful for end-to-end learning on a fixed benchmark and it creates a trade-off between forgetting and forward transfer as rightly demonstrated by Hadsell et al. (2020) ; Wolczyk et al. (2021) . However, in the era of foundation models where pretrain-then-finetune is a dominant paradigm and where one often does not know a priori the tasks where a foundation model will be finetuned, a measure of forward transfer that looks at the capability of a backbone model to be finetuned on several downstream tasks is perhaps a more apt measure. The rest of the paper is organized as follows. In Section 2, we describe the training and evaluation setups considered in this work. In Section 3, we provide experimental details followed by the main results of the paper. Section 4 lists down most relevant works to our study. We conclude with Section 5 providing some hints to how the findings of this study can be useful for the future research. 

2. PROBLEM SETUP AND METRICS

We consider a supervised continual learning setting consisting of a sequence of tasks T = {T 1 , • • • , T N }. A task T j is defined by a dataset D j = {(x i , y i , t i ) nj i=1 }, consisting of n j triplets, where x ∈ X , y ∈ Y, and t ∈ T are input, label and task id, respectively. Each D j = {D tr j , D val j , D te j } consists of train, validation and test sets. At a given task 'j', the learner may have access to all the previous tasks' datasets {D i } i<j , but it will not have access to the future tasks. We define a feed-forward neural network consisting of a feature extractor Φ : X → R D and a task-specific classifier Θ j : R D × T → Y j , that implements an input to output mapping f j = (Θ j • Φ) : X × T → Y j . The neural network is trained by minimizing a loss ℓ j : f j (X , T) × Y j → R + using stochastic gradient descent (SGD) (Bottou, 2010) . While we consider image classification tasks and use cross-entropy loss for each task, the approach would be applicable to other tasks and loss functions as well. The learner updates a shared feature extractor (Φ) and task-specific heads (Θ j ) throughout the continual learning experience. After training on each task 'i', we measure the performance of the learner on all the tasks observed so far. Let Acc(i, j) be the accuracy of the model on D te j after the feature extractor is updated with T i . We define the average forgetting metric at task 'i' similar to (Lopez-Paz & Ranzato, 2017) : Fgt i = 1 i -1 i-1 j=1 Acc(i, j) -Acc(j, j). The average forgetting metric (∈ [-1, 1]) throughout the continual learning is then defined as, AvgFgt = 1 N -1 N i=2 Fgt i . (1) A negative value of Fgt i indicates that the learner has lost performance on the previous tasks, and the more negative AvgFgt is the more forgetful the representations are of the previous knowledge. 2 ). Let Θ be the temporary (linear) classifier head learned on top of fixed Φ j using S k j+1 . We measure the accuracy of this temporary classifier on the test set of task 'j+1' and denote it as Fwt k j . This is called the forward transfer of learned representations Φ j to the next task 'j+1'. The average forward transfer throughout the continual learning is then defined as, AvgFwt k = 1 N -1 N -1 j=1 Fwt k j . We note that linear probing is an auxiliary evaluation process where model updates during evaluation remain distinct from the updates made by the continual learner while observing a task sequence. Contrary to this, in most prior works (Wolczyk et al., 2021; Lopez-Paz & Ranzato, 2017) , forward transfer is measured after the continual learner has made updates on the task. Such updates typically restrict the learning on current task to alleviate catastrophic forgetting on the previous tasks. This causes the learner to perform worse on the current task compared to a learner that is not trying to mitigate catastrophic forgetting. We sidestep this dilemma by separating the updates made by the continual learner on a new task from the temporary updates made during auxiliary evaluation on a copy of the model. We also note that similar to linear probing, one could finetune the whole model, including the representations, during the auxiliary evaluation. The main argument is to decouple the notion of forward transfer from modifications made by the continual learning algorithm to preserve knowledge of the previous tasks. Feature Diversity In addition to AvgFgt (Equation 1) and AvgFwt k (Equation 2), we also measure how diverse and easily separable the features of our trained models are for analyzing the transferability of the representations. Specifically, let Ψ j ∈ R m×D be the feature matrix computed using the feature extractor Φ j (obtained after training on task 'j') on the 'm' test examples of task 'j+1'. Let Ψ c j be a sub-matrix constructed by collecting the rows of Ψ j that belong to class 'c'. Similar to (Wu et al., 2021; Yu et al., 2020) , we define the feature diversity score of Φ j as FDiv j = log |αΨ ⊤ j Ψ j + I| - Cj c=1 log |α j Ψ c j ⊤ Ψ c j + I|, where |•| is a matrix determinant operator, α = D/(mϵ 2 ), α j = D/(m j ϵ 2 ), ϵ = 0.5, and C j denotes the number of classes for task 'j'. The average feature score throughout the continual learning experience is then defined as, AvgFDiv = 1 N -1 N -1 j=1 FDiv j . (3) The intuition behind using this score is that features that enforce high inter-class separation and low intra-class variability should make it easier to learn a classifier head on top leading to a better transfer to next tasks.

3.1. SETUP

We now briefly describe the experimental setup including the benchmarks, approaches and training details. More details can be found in Appendix A. After the experimental details, we provide the main results of the paper.

Benchmarks

• Split CIFAR-10: We split CIFAR-10 dataset (Krizhevsky et al., 2009) into 5 disjoint subsets corresponding to 5 tasks. Each task has 2 classes. • Split CIFAR-100: We split CIFAR-100 dataset (Krizhevsky et al., 2009) into 20 disjoint subsets corresponding to 20 tasks. Each task has 5 classes. • CIFAR-100 Superclasses: We split CIFAR-100 dataset into 5 disjoint subsets corresponding to 5 tasks. Each task has 20 classes from 20 superclasses in CIFAR-100 respectively. • CLEAR: This is a continual image classification benchmark by Lin et al. (2021) , built from YFCC100M (Thomee et al., 2016) images, containing the evolution of object categories from years 2005-2014. There are 10 tasks each containing images in chronological order from years (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) . We consider both CLEAR10 (consisting of 10 object classes) and CLEAR100 (consisting of 100 object classes) variants of the benchmark. • Split ImageNet: We split ImageNet (Russakovsky et al., 2015) dataset into 100 disjoint subsets corresponding to 100 tasks. Each task has 10 classes. For all the benchmarks, except split ImageNet, we considered continual learning from a randomly initialized model as well as from a pre-trained ImageNet model. For split ImageNet, we only considered continual learning from a randomly initialized model. Approachesfoot_0  Below we describe the approaches considered in this work. Except for the independent baseline, all other baselines reuse the model i.e. continue training the same model used for the previous tasks. • Independent (IND): Trains a model from an initial model (either random initialized or pre-trained) on each task independently. • Finetuning (FT): Trains a single model on all the tasks in a sequence, one task at a time. • Linear-Probing-Finetuning (LP-FT): LP-FT (Kumar et al., 2021) is the same as FT except that before each task training, we first learn a task-specific classifier Θ for the task via linear probing and then train both the feature extractor Φ and the classifier Θ on the task to reduce the feature drift. • Multitask (MT): Trains the model on the data from both the current and previous tasks using the multitask training objective (equation 6 in Appendix). The data of previous tasks is used as an auxiliary loss while learning on the current task. • Experience Replay (ER): Uses a replay buffer M = ∪ N -1 i=1 M i when learning on the task sequence, where M i stores m examples per class from the task T i . It trains the model on the data from both the current task and the replay buffer when learning on the current task using an ER training objective (equation 7 in Appendix) (Chaudhry et al., 2019b) . There are two main differences between MT and ER: (1) MT uses all the data from the previous tasks while ER only uses limited data from the previous tasks; (2) MT chooses the coefficient for the auxiliary loss via cross-validation while ER always set it to be 1. • AGEM: Projects the gradient when doing the SGD updates so that the average loss on the data from the episodic memory does not increase (Chaudhry et al., 2019a) . The episodic memory stores m examples per class from each task. • FOMAML: First-order MAML (FOMAML) is a meta-learning approach proposed by Finn et al.. We modify FOMAML such that it can be used in the continual learning setting. Similar to MT, FOMAML uses all the data from the previous tasks when learning the current task. The training objective of FOMAML aims to enable knowledge transfer between different batches from the same task. The learning algorithm for FOMAML is provided in Appendix A.2. Architecture and Training details. We use ResNet50 (He et al., 2016) architecture as the feature extractor Φ on all benchmarks. On CLEAR10 and CLEAR100, we use a single classification head Θ that is shared by all the tasks (single-head architecture) while on other benchmarks, we use a separate classification head Θ i for each task T i (multi-head architecture). We use SGD to update the model parameters and use cosine learning rate scheduling (Loshchilov & Hutter, 2016) baselines, we use a base learning rate of 0.01. When using a random initialization as the initial model f 0 , on split CIFAR-10, split CIFAR-100 and CIFAR-100 superclasses, we train the model for 50 epochs per task while on CLEAR10, CLEAR100 and Split ImageNet, we train the model for 100 epochs per task. When using a pre-trained model as the initial model f 0 , on all benchmarks, we train the model for 20 epochs per task as we found it sufficient for training convergence. For CLEAR10 and CLEAR100, the results are averaged over 5 different runs with different random seeds each corresponding to a different network initialization, where the task order is fixed. For other benchmarks, the results are averaged over 5 different runs, where each run corresponds to a different random ordering of tasks. The results are reported as averages and 95% confidence interval estimates of these 5 runs. For k-shot linear probing, we use SGD with a fixed learning rate of 0.01. We train the classifier head Θ for 100 epochs on the k-shot dataset S k j+1 as we found it sufficient for training convergence. On CLEAR100, we consider k ∈ {5, 10, 20, 40} while on other benchmarks, we consider k ∈ {5, 10, 20, 100}.

LESS FORGETFUL REPRESENTATIONS TRANSFER BETTER

We assess the compatibility between forgetting and transferability through AvgFgt and AvgFwt k metrics described in Section 2. Figures 1 and 3 show these two metrics for Split CIFAR-100, Split-CIFAR10 and CLEAR100, respectively when the continual learning experience begins from a randomly initialized model (the comparison on the other benchmarks is provided in the Appendix B.1). It can be seen from the figures that if a model has less average forgetting, the corresponding model representations have a better K-shot forward transfer. For example, on all the three benchmarks visualized in the figures, FOMAML and MT tend to have the least amount of average forgetting. Consequently, the AvgFwt k of these two baselines is higher compared to all the other baselines, for all the values of k considered in this work. Note that the ranking of other methods in terms of correspondence between forgetting and forward transfer is roughly maintained as well. This shows that when continual learning experience begins from a randomly initialized model, retaining the knowledge of the past tasks or forgetting less on those tasks is a good inductive bias for forward transfer. Recently, Mehta et al. ( 2021) showed that pre-trained models tend to forget less, compared to randomly initialized models, when trained on a sequence of tasks. We build upon this observation and ask if forgetting less on both the upstream (pre-trained) task, and downstream tasks improve the transferability of the representations? Figure 4 shows the comparison between forgetting (left) and forward transfer (middle) on Split CIFAR-10 and Split CIFAR-100 when the continual learning experience begins from a pre-trained model (the comparison on the other benchmarks is provided in the Appendix B.1). It can be seen from the figure that except for LP-FT, less forgetting is a good indicator of a better forward transfer. In order to understand, why LP-FT has a better forward transfer, compared to FOMAML and MT, despite having higher forgetting on the continual learning benchmark at hand, we evaluate the continually updated representations on the upstream data (test set of ImageNet). The evaluation results are given in the right plot of Figure 4 . From the plot, it can be seen that LP-FT has retained better upstream performance (relatively speaking) compared to the other baselines. This follows our general thesis that retaining 'previous knowledge', evidenced here by the past performance on both the upstream and downstream tasks, is a good inductive bias for forward transfer. If instead of freezing the representations and just updating the classifier, we finetune the whole model in the auxiliary evaluation, the less forgetful representations still transfer better (refer to Appendix B.5). In order to aggregate the metrics across different methods and to see a global trend between forgetting and forward transfer, we compute the Spearman rank correlation between AvgFgt and AvgFwt k . Table 1 shows the correlation values for different values of 'k' for both randomly initialized and pre-trained models. It can be seen from the table that most of the entries are above 0.5 and statistically significant (p < 0.01) showing that reducing forgetting improves the forward transfer across the board.

LESS FORGETFUL REPRESENTATIONS ARE MORE DIVERSE

We now look at what makes the less forgetful representations amenable for better forward transfer. We hypothesize that less forgetful representations maintain more diversity and discrimination in the features making it easy to learn a classifier head on top leading to better forward transfer. To measure this diversity of representations, we look at the feature diversity score AvgFDiv, as defined in Equation 3, and compare it with the average forgetting score AvgFgt. sentations that have higher AvgFDiv score. Similarly, on pre-trained models, methods with lower overall forgetting between the upstream and downstream tasks, such as LP-FT, leads to the highest AvgFDiv score. These results suggest that less forgetful representations tend to be more diverse and discriminative.

4. RELATED WORKS

Continual Learning (also known as Life-long Learning) (Ring, 1995; Thrun, 1995) aims to learn a model on a sequence of tasks that has good performance on all the tasks observed so far. However, SGD training, relying on IID assumption of data, tends to result in a degraded performance on older tasks, when the model is updated on new tasks. This phenomenon is known as catastrophic forget-ting (McCloskey & Cohen, 1989; Goodfellow et al., 2014) and it has been a main focus of continual learning research. There are several methods that have been proposed to alleviate catastrophic forgetting, ranging from regularization-based approaches (Kirkpatrick et al., 2016; Aljundi et al., 2018; Nguyen et al., 2018; Zenke et al., 2017) , to methods based on episodic memory (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; Aljundi et al., 2019; Hayes et al., 2018; Riemer et al., 2019; Rolnick et al., 2018; Prabhu et al., 2020) to the algorithms based on parameter isolation (Yoon et al., 2018; Mallya & Lazebnik, 2018; Wortsman et al., 2020; Mirzadeh et al., 2021b; Farajtabar et al., 2020) . Besides the algorithmic innovations to reduce catastrophic forgetting, recently some works looked at the role of training regimes (Mirzadeh et al., 2020) and network architectures (Mirzadeh et al., 2021a; 2022) for understanding the catastrophic forgetting phenomenon. While a learner that reduces catastrophic forgetting tries to preserve the knowledge of the past tasks, often what is more important is to utilize the accrued knowledge to learn new tasks more efficiently, a phenomenon known as forward transfer (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a) . In most existing works, the forward transfer to a task is measured as the learning accuracy of the task after training on the task is finished. Hadsell et al. (2020) and Wolczyk et al. (2021) argued that continual learning methods that avoid catastrophic forgetting do not improve the forward transfer, in fact, sometimes the catastrophic forgetting is reduced at the expense of the forward transfer (as measured by the learning accuracy). This begs the question whether reducing catastrophic forgetting is a good objective for continual learning research or should the community shift focus on the forward transfer as there seems to be a tug of war between the two? Contrary to previous work, here, we take an auxiliary evaluation perspective to forward transfer where instead of asking whether reducing forgetting on previous tasks, during training on the current task, improves the current task learning, we ask whether a learner that has less forgetting on previous tasks, results in network representations that can quickly be adapted to new tasks? We argue that this mode of measuring forward transfer decouples the notion of transfer from the restricted updates on the current task employed by a continual learner to avoid forgetting on previous tasks. To the best of our knowledge, most similar to our work is Javed & White (2019); Beaulieu et al. (2020) who also looked at the network representations in the context of continual learning. But they took a converse perspective -arguing that learning transferable representations via meta-learning alleviates catastrophic forgetting.

5. CONCLUSION

We are interested in understanding how to continuously accrue knowledge for sample efficient learning of downstream tasks. Similar to some previous works, here we question what effect alleviating catastrophic forgetting has on the efficiency of learning new tasks. However, by contrast, we study forward transfer by the auxiliary evaluation of continually trained representations learned through the course of training on a sequence of tasks. To this end, we evaluated several training algorithms on a sequence of tasks and find that our forward transfer metric is highly correlated with the amount of knowledge retention (i.e. less negative forgetting score), indicating that forgetting less may serve as a good inductive bias for forward transfer. The question of how to accrue knowledge from the past tasks to learn new tasks more efficiently is ever more relevant with the recent advancements in the large scale models trained using internet scale data, aka foundation models, where we would want to avoid initialization from scratch to save computation time. Our suggested measure of forward transfer, that evaluates continually trained representations, also fits nicely in the context of comparing generalization of different large scale models, where a model that can transfer to multiple downstream tasks is preferred. We are in the era of discovering new capabilities of models, as new capabilities emerge with larger scale. The extrapolation of our findings could mean that a less forgetful foundation model -where forgetting is evaluated on the upstream data -should be preferred over a forgetful model, as the former could transfer better to downstream tasks. This serves as a useful model selection mechanism which can be further explored in future research.

Supplementary Material

Is Forgetting Less a Good Inductive Bias for Forward Transfer? A EXPERIMENTAL DETAILS A.1 DATASETS We describe the details of the datasets used in this paper below: CIFAR-10. The CIFAR-10 dataset (Krizhevsky et al., 2009) consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. We reserve 10,000 training images as the validation data. So the training set used has 40,000 images. We split the CIFAR-10 dataset into 5 disjoint subsets to create the Split CIFAR-10 benchmark. Split CIFAR-10 has 5 tasks corresponding to the 5 disjoint subsets and each task has 2 classes. During training, we apply random cropping and random horizontal flip to the training images. CIFAR-100. The CIFAR-100 dataset (Krizhevsky et al., 2009) is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coars" label (the superclass to which it belongs). We reserve 10,000 training images as the validation data. So the training set used has 40,000 images. We use the CIFAR-100 dataset to create two benchmarks Split CIFAR-100 and CIFAR-100 Superclasses. The split CIFAR-100 benchmark is created by splitting the CIFAR-100 dataset into 20 disjoint subsets corresponding to 20 tasks. Each task in Split CIFAR-100 has 5 classes. The CIFAR-100 Superclasses benchmark is created by splitting the CIFAR-100 dataset into 5 disjoint subsets corresponding to 5 tasks. Each task in CIFAR-100 Superclasses has 20 classes from 20 superclasses respectively. For both Split CIFAR-100 and CIFAR-100 Superclasses benchmarks, during training, we apply random cropping and random horizontal flipping data augmentations to the training images. CLEAR. CLEAR (Lin et al., 2021) is the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts in the real world that spans a decade (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) . It contains two continual learning benchmark CLEAR10 and CLEAR100. The original CLEAR10 has 33,000 training images and 5,500 test images with 10 tasks and 11 classes (including a BACKGROUND class). We Remove the BACKGROUND class and reserve 5,000 training images as the validation data. So the training set used has 25,000 images and the test set used has 5,000 images. The original CLEAR100 has 99,963 training images and 50,000 test images with 10 tasks and 100 classes. We reserve 19,991 training images as the validation data. So the training set used has 79,972 images. Each task in CLEAR10 (or CLEAR100) contains images from a certain year (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) . For both CLEAR10 and CLEAR100 benchmarks, we resize the images to 224 × 224 and use random cropping and random horizontal flipping data augmentations during training.

ImageNet.

ILSVRC 2012, commonly known as "ImageNet" (Russakovsky et al., 2015) is a large scale image dataset with 1,000 classes organized according to the WordNet hierarchy. It has 1,281,167 training images and 50,000 validation images with labels. It also has 100,000 test images but without labels. We use the validation images as the test set and reserve 300,000 training images (300 images per class) as the validation set. So the training set used contains 981,167 images. We split the ImageNet dataset into 100 disjoint subsets corresponding to 100 tasks. Each task has 10 classes. During training, we use random cropping and random horizontal flipping data augmentations. To reduce computational cost, we resize the images to 64 × 64.

A.2 BASELINES

We consider the following approaches for leaning on a sequence of tasks T 1 , . . . , T N . Except for the independent baseline, all other baselines reuse the model (i.e.) continue training the same model used for the previous tasks. Specifically, when learning on the task T i (i > 1), the continual learning method will use the model f i-1 learned on the previous task T i-1 as an initialization for the model f i , which we call model reusing. For the first task T 1 training, the continual learning method will initialize the model by an initial model f 0 (either a random intialization or a pre-trained model).

Independent (IND).

The IND baseline learns on each task independently. That is when learning on the task T i , it will train the model from an initial model (either a random initialization or a pre-trained model) using the Empirical Risk Minimization (ERM) objective: min fi E (x,y,t)∼Di ℓ i (f i (x, t), y) We use the IND baseline as a reference to see how well we can learn on a task without learning on other tasks in the task sequence. Finetuning (FT). Finetuning is a simple baseline for continual learning. It trains a single model on all the tasks in a sequence. When training the model f i on the current task T i , it doesn't use the data from the previous tasks. It uses the ERM objective (equation 4) to train the model f i on the current task T i only. Linear-Probing-Finetuning (LP-FT). Linear-Probing-Finetuning (Kumar et al., 2021) is the same as the Finetuning baseline except that before each task training, we first learn a task-specific classifier Θ for the task via linear probing and then train both the feature extractor Φ and the classifier Θ on the task. That is when learning on the current task T i , we have a two-stage training process. In the first stage, we train the classifier Θ i while fixing the feature extractor Φ i via the ERM training objective: min Θi E (x,y,t)∼Di ℓ i (f i (x, t), y; Θ i , Φ i ) In the second stage, we train both the classifier Θ i and the feature extractor Φ i via the ERM training objective (equation 4). We only use LP-FT in the setting where the initial model f 0 is a pre-trained model.

Multitask (MT).

The Multitask baseline trains the model on the data from both the current task and the previous tasks when learning on the current task T i . It uses the following multitask training objective: min Φi,{Θj } i j=1 E (x,y,t)∼Di ℓ i (f i (x, t), y; Θ i , Φ i ) + λ • E 1≤j<i E (x,y,t)∼Dj ℓ j (f j (x, t), y; Θ j , Φ i ) (6) It trains a shared feature extractor Φ i and task-specific classifiers {Θ j } i j=1 for the tasks T 1 , . . . , T i . The hyperparameter λ is chosen from the set {1.0, 0.1, 0.01} based on the average learning accuracy across tasks on the validation set. Experience Replay (ER). The ER baseline uses a replay buffer M = ∪ N -1 i=1 M i when learning on the sequence of tasks, where M i stores examples from the task T i . In our work, we restrict the replay buffer M to store only m examples per class from each task. The training objective used by ER is: min Φi,{Θj } i j=1 E (x,y,t)∼Di ℓ i (f i (x, t), y; Θ i , Φ i ) + E 1≤j<i E (x,y,t)∼Mj ℓ j (f j (x, t), y; Θ j , Φ i ) (7) AGEM. AGEM is a continual learning method proposed by Chaudhry et al.. Similar to ER, AGEM also uses a replay buffer (or an episodic memory) M = ∪ N -1 i=1 M i , where M i stores only m examples per class from the task T i . While learning the task T i , the training objective of AGEM is: min fi E (x,y,t)∼Di ℓ i (f i (x, t), y) (8) s.t. E (x,y,t)∼M i-1 1 ℓ t (f i (x, t), y) ≤ E (x,y,t)∼M i-1 1 ℓ t (f i-1 (x, t), y) where M i-1 1 = ∪ i-1 j=1 M j . The corresponding optimization problem is: min g 1 2 ∥g -g∥ 2 2 s.t. gT g ref ≥ 0 ( ) where g is a gradient computed using a batch randomly sampled from the current task to solve the objective (equation 8), g ref is a gradient computed using a batch randomly sampled from the episodic memory M i-1 1 , and g is a projected gradient that we will use to update the model. When the gradient g violates the constraint (equation 9), it is projected via: g = g - g T g ref g T ref g ref g ref (11) FOMAML. MAML is a meta-learning approach proposed by Finn et al.. Since there are some differences in the meta-learning setting and the continual learning setting (e.g. meta-learning algorithms assume that there is a task distribution where we can sample tasks from it while in the continual learning setting, we don't have such a task distribution), we cannot directly use the MAML algorithm proposed in Finn et al. (2017) . We then modify MAML such that it can be used in the continual learning setting. While learning on the task T i , the training objective we want to solve is: min fi E B∼Di ℓ i (B; f i ) + λ • E j∈[i] E B in j,1 ,...,B in j,b ,B out j ∼Dj ℓ j (B out j ; f (b) i,j ) where [i] = {1, 2, . . . , i}, B ∼ D i means sampling a batch B from D i and  f (b) i,j = U b (B in j,1 , . . . , B in j,b ; f i ) f (0) i,j = f i , f (q) i,j = f (q-1) i,j -α • ∇ f (q-1) i,j ℓ j (B in j,q ; f (q-1) i,j ), q = 1, . . . , b The training objective aims to find a model such that it achieves small error on the current task T i and after several gradient update steps on some batches from a seen task T j (j ∈ [i]), the updated model can achieve small error on other batches from the task T j . So we want to find a model that can enable knowledge transfer between different batches from the same task. Solving the objective (equation 12) requires computing the second-order gradients, which might be expensive. However, we can use the idea of first-order MAML (FOMAML) proposed in Finn et al. (2017) , which ignores the second derivative terms, to solve the objective. The algorithm of FOMAML is presented in Algorithm 1. In our experiments, we simply set α = β and c = 1. On Split ImageNet, we set b = 1 while on other benchmarks, we set b = 2.

A.3 ARCHITECTURE AND TRAINING DETAILS

Architecture. We use ResNet50 (He et al., 2016) architecture as the feature extractor Φ on all benchmarks by default. On CLEAR10 and CLEAR100, we use a single classification head Θ that is shared by all the tasks (single-head architecture) while on other benchmarks, we use a separate classification head Θ i for each task T i (multi-head architecture).

Algorithm 1 FIRST-ORDER MAML (FOMAML)

Require: A model fi-1 after training on the previous task Ti-1, a learning rate α for inner-update, a learning rate β for outer-update, the number of previous tasks c used for each training step, the number of gradient update steps b for the inner-update and the number of training steps n for the outer-update. 1: fi ← fi-1 2: Randomly sample a batch B from the current task Ti, i.e., B ∼ Di 3: G ← ∇ f i ℓi(B; fi) 4: for p = 1, 2, . . . , n do 5: Randomly select c indices from the set {1, 2, . . . , i -1} without replacement as a set Ip.

6:

I ← Ip ∪ {i} 7: for j ∈ I do 8: fi,j ← fi 9: for q = 1, 2, . . . , b do 10: Randomly sample a batch B in j,q from Dj. 11: Apply an inner-update step: f (q) i,j ← f (q-1) i,j -α • ∇ f (q-1) i,j ℓj(B in j,q ; f (q-1) i,j ). 12: end for 13: Randomly sample a batch B out j from Dj. 14: G ← G + ∇ f (q) i,j ℓj(B out j ; f (q) i,j ) 15: end for 16: Apply an outer-update step: fi ← fi -β • G 17: end for 18: return fi. Continual Learning Training Details. We use Stochastic Gradient Decent (SGD) for training models. We use cosine learning rate scheduling (Loshchilov & Hutter, 2016) to adjust the learning rate during training. Suppose the base learning rate is r and the number of training steps for each task is n. Then for each task training, at training step t, the learning rate for the SGD update is r • cos tπ 2n . For LP-FT, we use a base learning rate of 0.001 while for other baselines, we use a base learning rate of 0.01. on split CIFAR-10, split CIFAR-100 and CIFAR-100 superclasses, we use a batch size of 64. On CLEAR10 and CLEAR100, we use a batch size of 128. On Split ImageNet, we use a batch size of 256. When using a random initialization as the initial model f 0 , on split CIFAR-10, split CIFAR-100 and CIFAR-100 superclasses, we train the model for 50 epochs per task while on CLEAR10, CLEAR100 and Split ImageNet, we train the model for 100 epochs per task. When using a pre-trained model as the initial model f 0 , on all benchmarks, we train the model for 20 epochs per task as we found it sufficient for training convergence. For LP-FT, we perform the linear probing for 10 epochs per task. These training hyper-parameters are chosen based on the average learning accuracy across tasks on the validation set.

K-Shot Linear Probing Training Details.

We use Stochastic Gradient Decent (SGD) with a fixed learning rate of 0.01 for linear probing. We train the classifier head Θ for 100 epochs on the k-shot dataset S k j+1 as we found it sufficient for training convergence. On CLEAR100, the set of values we consider for k is {5, 10, 20, 40} while on other benchmarks, the set of values we consider for k is {5, 10, 20, 100}. For each k, we use a batch size of min(k • c, 50), where c is the number of classes in the task. We don't apply any data augmentations to the training images during the k-shot linear probing.

A.4 HYPER-PARAMETERS SELECTION

In this section, we discuss how we select the hyper-parameters.

Continual Learning

Training. For different baselines, the shared hyper-parameters are the batch size, the learning rate and the number of training epochs. We do not tune the batch size, but set it to be a fixed number for each benchmark. For the learning rate and the number of training epochs, we choose them based on the average learning accuracy across tasks on the validation data. The range of the learning rate that we consider is {0.1, 0.01, 0.001, 0.0001}. We found that setting the learning rate to be 0.01 leads to the best average learning accuracy for all the baselines except LP-FT. For LP-FT, we found that setting the learning rate to be 0.001 leads to better average learning accuracy. We set the number of training epochs to be a sufficiently large number such that the average learning accuracy doesn't improve as we increase the number of epochs further. For all the baselines, we pick a fixed number of training epochs such that all methods converge for each benchmark setting.

K-shot Linear Probing

Training. The hyper-parameters are the batch size, the learning rate and the number of training epochs. For each k, we just set the batch size to be min(k • c, 50) and do not tune it. Note that K-shot linear probing is a convex optimization problem. Thus, the number of training epochs will not affect the results across baselines as long as we train for a sufficient number of epochs. We found that 100 epochs were more than sufficient for all the baselines to converge for K-shot linear probing. Also, the learning rate will not affect the results much as long as we pick a reasonable one. Therefore, we simply fix the learning rate to be 0.01.

B.1 COMPARING FORGETTING AND FORWARD TRANSFER

In the main paper, we give results for comparing forgetting and forward transfer on some benchmarks. In this section, we provide additional results for comparing forgetting and forward transfer on other benchmarks. Figure 5 shows the results where we use a random initialization and Figure 6 shows the results where we use a pre-trained model. We can see that the claims made in Section 3.2 still hold here.

B.2 EVALUATING AVERAGE ACCURACY AND AVERAGE LEARNING ACCURACY

In this section, we report results for traditional continual learning metrics Average Accuracy and Average Learning Accuracy. The Average Accuracy is defined as: AvgAcc = 1 N N j=1 Acc(N, j). While the Average Learning Accuracy is defined as: AvgLAcc = 1 N N j=1 Acc(j, j). The results are reported in Table 3 .

B.3 ABLATION STUDY ON THE MODEL ARCHITECTURE

We want to see whether our claims about forgetting and forward transfer also hold when we use a different architecture. Thus, on split CIFAR-10, split CIFAR-100, and CIFAR-100 Superclasses benchmarks, we also report results in Figure 7 for using ResNet18 as the model architecture. We only show results for using random initialization since we don't have pre-trained ResNet18 model on ImageNet. From the results, we can see that our claim that less forgetting is a good inductive bias for forward transfer still holds.

B.4 CORRELATION BETWEEN AVERAGE FORGETTING AND AVERAGE FEATURE DIVERSITY SCORE

In order to aggregate the metrics across different approaches and to see a global trend between forgetting and feature diversity, we compute the Spearman rank correlation between the AvgFgt and AvgFDiv. Table 4 shows the correlation values for randomly initialized models. From the results, we can see that for randomly initialized models, AvgFgt and AvgFDiv generally have positive correlations. 

B.5 FORWARD TRANSFER THROUGH K-SHOT FINE-TUNING

We also evaluate forward transfer through k-shot fine-tuning (i.e., we fine-tune the entire model including the feature extractor Φ j and the classifier Θ on the k-shot samples S k j+1 ). The training hyper-parameters are the same as those of k-shot linear probing, except that to avoid overfitting while fine-tuning the whole network, we perform cross-validation for the learning rate and the number of training epochs using the validation set. The learning rate is chosen from the set {0.01, 0.001} while the number of training epochs is chosen from the set {10, 50, 100}. When using random initialization, the results on the Split CIFAR-10, Split CIFAR-100, and CIFAR-100 Superclasses benchmarks are shown in Figure 8 while the results on the CLEAR10, CLEAR100 and Split ImageNet benchmarks are shown in Figure 9 . When using a pre-trained model as initialization, the results on the Split CIFAR-10, Split CIFAR-100, CIFAR-100 Superclasses, CLEAR10 and CLEAR100 benchmarks are shown in Figure 10 . In order to aggregate the metrics across different approaches and to see a global trend between forgetting and forward transfer, we compute the Spearman rank correlation between the AvgFgt and AvgFwt k for the k-shot fine-tuning evaluation. values of 'k' for both randomly initialized and pre-trained models. It can be seen from the table that most of the entries are above 0.5 and statistically significant (p < 0.01) showing that reducing forgetting improves the forward transfer across the board. From the results, we can see that less forgetting generally leads to better forward transfer. Thus, our claim that less forgetting is a good inductive bias for forward transfer still holds.

B.6 ABLATION STUDY ON THE REPLAY BUFFER SIZE

For the Experience Replay (ER) baseline, we perform experiments on the Split CIFAR-100 and CIFAR-100 Superclasses benchmarks to study the effect of the replay buffer size on the AvgFgt and AvgFwt k metrics. The results when using random initialization are shown in Figure 11 while the results when using a pre-trained model as initialization are shown in Figure 12 . From the results, we can see that increasing m usually leads to less forgetting and thus more forward transfer. Therefore, our claim that less forgetting is a good inductive bias for forward transfer still holds.

B.7 RESULTS FOR EWC AND VANILLA L2 REGULARIZATION

In this section, we provide some results for the EWC method (Kirkpatrick et al., 2016) and the vanilla L2 regularization (a variant of EWC where the fisher information matrix is replaced with an identity matrix) using ResNet18 as the model architecture with random initialization on the Split CIFAR-10 benchmark. For λ in EWC, we consider the range {10, 50, 100, 200} and select the best one based on the performance on the validation data. For λ in vanilla L2 regularization, we consider the range {10, 1, 0. We note here that task relatedness bears significant effect on the relationship between forgetting and forward transfer. The benchmarks that we considered in this work either have very similar tasks (CLEAR10/100), where the same classes are observed over a 10 years period, or unrelated tasks (Split CIFAR10/100, ImageNet), where disjoint classes are observed in each task. In both cases, less forgetting improved the forward transfer, although for more similar tasks the improvement is more significant as intuitively expected. We did not observe that "unrelatedness" of tasks leads to negative transfer. However, if tasks were negatively related to begin with then less forgetting would intuitively lead to negative transfer. But in our experience negatively related tasks are very rare and in practical machine learning systems many tasks can learn from each other (which is the basis of transfer learning, multitask learning, etc.). We would like to emphasize that the point of the paper is precisely to show that when tasks are somewhat related, and observed in a continual setting, less forgetting improves forward transfer. It is in this setting that some of the previous Here, we evaluate the forward transfer through k-shot fine-tuning. works (Hadsell et al., 2020; Wolczyk et al., 2021) concluded that less forgetting does not improve end-to-end forward transfer. We show here that even on such tasks less forgetting improves the representational measure of forward transfer. Here, AvgFwt k * is defined like AvgFwt k , but instead of using k-shot linear probing for evaluation, we use k-shot fine-tuning evaluation (i.e., fine-tuning the entire model). p-values are shown in parenthesis if greater than or equal to 0.01. ods that train the model from a pre-trained ImageNet model on the Split CIFAR-10, Split CIFAR-100, CIFAR-100 Superclasses, CLEAR10 and CLEAR100 benchmarks. We also show the accuracy of the models on the upsteam ImageNet data. Here, we evaluate the forward transfer through k-shot fine-tuning. 



The EWC(Kirkpatrick et al., 2016) results are in Appendix Table6.



Figure 1: Comparing average forgetting with average forward transfer for different continual learning methods using random initialization on the Split CIFAR-100 benchmark. FOMAML has less forgetting and thus better forward transfer.

Figure 3: Comparing average forgetting with average forward transfer for different continual learning methods using random initialization on the Split CIFAR-10 and CLEAR100 benchmarks.

Figure 4: Comparing average forgetting with average forward transfer for different continual learning meth-ods that train the model from a pre-trained ImageNet model on the Split CIFAR-10 and Split CIFAR-100 benchmarks. We also show the accuracy of the models on the upsteam ImageNet data. Since CIFAR-10 and CIFAR-100 images have different image resolution than that of ImageNet images, we need to resize the Ima-geNet test images from 224 × 224 to 32 × 32 in order to get meaningful accuracy of the models trained on the Split CIFAR-10 and Split CIFAR-100 benchmarks on the upstream ImageNet data (although the accuracy of the pre-trained model on the resized ImageNet test images is significantly reduced).

Here, U b (B 1 , . . . , B b ; f ) is a model obtained by applying b gradient update steps on the model f using b batches B 1 , . . . , B b . If we use standard SGD for U b and the learning rate is α, then we have

Figure 5: Comparing average forgetting with average forward transfer for different continual learning methods using random initialization on the CIFAR-100 Superclasses, CLEAR10 and Split ImageNet benchmarks.

Figure 6: Comparing average forgetting with average forward transfer for different continual learning meth-ods that train the model from a pre-trained ImageNet model on the CIFAR-100 Superclasses, CLEAR10 and CLEAR100 benchmarks. We also show the accuracy of the models on the upsteam ImageNet data. Since CIFAR-100 images have different image resolution than that of ImageNet images, we need to resize the Ima-geNet test images from 224 × 224 to 32 × 32 in order to get meaningful accuracy of the models trained on the CIFAR-100 Superclasses benchmark on the upstream ImageNet data. On CLEAR10 and CLEAR100 benchmarks, since their images have the same image resolution as the ImageNet images, we don't need to resize the ImageNet test images when evaluating the upstream accuracy.

Figure 8: Comparing average forgetting with average forward transfer for different continual learning methods using random initialization on the Split CIFAR-10, Split CIFAR-100 and CIFAR-100 Superclasses benchmarks.Here, we evaluate the forward transfer through k-shot fine-tuning.

Figure 10: Comparing average forgetting with average forward transfer for different continual learning meth-

Figure 11: Comparing average forgetting with average forward transfer for the ER method with different replay buffer sizes using random initialization on the Split CIFAR-100 and CIFAR-100 Superclasses benchmarks.

tr j+1 denote a sample consisting of 'k' examples per class from D tr j+1 , and let Φ j be the representations obtained after training on task 'j' (see the bottom blob of Figure

Spearman correlation between AvgFgt and AvgFwt k for different k, which computes the correlation over different settings (different training methods and random runs). p-values are shown in parenthesis if greater than or equal to 0.01.

Comparing AvgFgt with AvgFDiv. The numbers for AvgFgt are percentages. Bold numbers are superior results.

Table 5 shows the correlation values for different

1, 0.01} and select the best one based on the performance on the validation Results for average accuracy AvgAcc and average learning accuracy AvgLAcc. All numbers are percentages.

Spearman correlation between AvgFgt and AvgFDiv, which computes the correlation over different settings (different training methods and random runs). Here, we use random initialization as the initial model. p-values are shown in parenthesis if greater than or equal to 0.01.data. Experimenting with EWC on larger models and longer benchmarks is computationally very expensive. The comparison of EWC with FT and vanilla L2 regularization is given in Table6. It can be seen from the table that less forgetting leads to better forward transfer. Thus, our claim that less forgetting is a good inductive bias for forward transfer still holds.

Spearman correlation between AvgFgt and AvgFwt k * for different k, which computes the correlation over different settings (different training methods and random runs).

Results for EWC and Vanilla L2 Regularization using ResNet18 as the model architecture with random initialization on the Split CIFAR-10 benchmark. λ is a hyper-parameter that controls the regularization strength of EWC and Vanilla L2 Regularization. The numbers are percentages.

