IS FORGETTING LESS A GOOD INDUCTIVE BIAS FOR FORWARD TRANSFER?

Abstract

One of the main motivations of studying continual learning is that the problem setting allows a model to accrue knowledge from past tasks to learn new tasks more efficiently. However, recent studies suggest that the key metric that continual learning algorithms optimize, reduction in catastrophic forgetting, does not correlate well with the forward transfer of knowledge. We believe that the conclusion previous works reached is due to the way they measure forward transfer. We argue that the measure of forward transfer to a task should not be affected by the restrictions placed on the continual learner in order to preserve knowledge of previous tasks. Instead, forward transfer should be measured by how easy it is to learn a new task given a set of representations produced by continual learning on previous tasks. Under this notion of forward transfer, we evaluate different continual learning algorithms on a variety of image classification benchmarks. Our results indicate that less forgetful representations lead to a better forward transfer suggesting a strong correlation between retaining past information and learning efficiency on new tasks. Further, we found less forgetful representations to be more diverse and discriminative compared to their forgetful counterparts.

1. INTRODUCTION

Continual learning aims to improve learned representations over time without having to train from scratch as more data or tasks become available. This objective is especially relevant in the context of large scale models trained on massive scale data, where training from scratch is prohibitively costly. However, the standard stochastic gradient descent (SGD) training, relying on the IID assumption of data, results in a severely degraded performance on old tasks when the model is continually updated on new tasks. This phenomenon is referred to as catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2016) and has been an active area of research (Kirkpatrick et al., 2016; Lopez-Paz & Ranzato, 2017; Mallya & Lazebnik, 2018) . Intuitively, the reduction in catastrophic forgetting allows the learner to accrue knowledge from the past, and use it to learn new tasks more efficiently -either using less training data, less compute, better final performance or any combination thereof. This phenomenon of efficiently learning new tasks using previous information is referred to as forward transfer. Catastrophic forgetting and forward transfer are often thought of as competing desiderata of continual learning where one has to strike a balance between the two depending on the application at hand (Hadsell et al., 2020 ). Specifically, Wolczyk et al. (2021) recently studied the interplay of forgetting and forward transfer in the robotics context, and found that many continual learning approaches alleviate catastrophic forgetting at the expense of forward transfer. This is indeed unavoidable if the capacity of the model is less than the amount of information we intend to store. However, assuming that the model has sufficient capacity to learn all the tasks simultaneously, as in multitask learning, one might think that a less forgetful model could transfer its retained knowledge to future tasks when they are similar to past ones. Therefore, we argue for a measure of forward transfer that is unconstrained from any training modifications made to preserve previous knowledge. We propose to use auxiliary evaluation of continually trained representations as a measure of forward transfer which is separate from the continual training of the model. Specifically, at the arrival of a new task, we fix the representations learned on the previous task and evaluate them on the new task. This evaluation is done by learning a temporary classifier using a small subset of data from the new task and measuring performance on the test set of the task. The continual training on the new task then proceeds with the updates to the representations (and the classifier) with the full training dataset of the task. We note that this notion of forward transfer removes the tug of war between forgetting the previous tasks and transfer to the next task, and it is with this notion of transfer that we ask the question are less forgetful representations more transferable? We analyze the interplay of catastrophic forgetting and forward transfer on several supervised continual learning benchmarks and algorithms. For this work, we restrict ourselves to the task-based continual learning setting, where task information is assumed at both train and test times as it makes the aforementioned evaluation based on auxiliary classification at fixed points easily interpretable. Our results demonstrate that a less forgetful model in fact transfers better (cf. Figure 1 ). We find this observation to be true for both randomly initialized models as well as for models that are initialized from a pre-trained model. We further analyse the reasons of this better transferability and find that less forgetful models result in more diverse and easily separable representations making it easier to learn a classifier head on top. We note that with these results, we want to emphasize that the continual learning community should look at the trade-off between forgetting and forward transfer in the right perspective. The learning accuracy based measure of forward transfer is useful for end-to-end learning on a fixed benchmark and it creates a trade-off between forgetting and forward transfer as rightly demonstrated by Hadsell et al. (2020); Wolczyk et al. (2021) . However, in the era of foundation models where pretrain-then-finetune is a dominant paradigm and where one often does not know a priori the tasks where a foundation model will be finetuned, a measure of forward transfer that looks at the capability of a backbone model to be finetuned on several downstream tasks is perhaps a more apt measure. The rest of the paper is organized as follows. In Section 2, we describe the training and evaluation setups considered in this work. In Section 3, we provide experimental details followed by the main results of the paper. Section 4 lists down most relevant works to our study. We conclude with Section 5 providing some hints to how the findings of this study can be useful for the future research.



Figure 1: Comparing average forgetting with average forward transfer for different continual learning methods using random initialization on the Split CIFAR-100 benchmark. FOMAML has less forgetting and thus better forward transfer.

