ARCHITECTURE MATTERS IN CONTINUAL LEARNING Anonymous authors Paper under double-blind review

Abstract

A large body of research in continual learning is devoted to overcoming the catastrophic forgetting of neural networks by designing new algorithms that are robust to the distribution shifts. However, the majority of these works are strictly focused on the algorithmic part of continual learning for a fixed neural network architecture, and the implications of using different architectures are not clearly understood. The few existing continual learning methods that expand the model also assume a fixed architecture and develop algorithms that can efficiently use the model throughout the learning experience. In contrast, in this work, we build on existing works that study continual learning from a neural network's architecture perspective and provide new insights into how the architecture choice, for the same learning algorithm, can impact stability-plasticity trade-off resulting in markedly different continual learning performance. We empirically analyze the impact of various architectural components providing best practices and recommendations that can improve the continual learning performance irrespective of the learning algorithm.

1. INTRODUCTION

Continual learning (CL) (Ring, 1995; Thrun, 1995) is a branch of machine learning where the model is exposed to a sequence of tasks with the hope of exploiting existing knowledge to adapt quickly to new tasks. The research in continual learning has seen a surge in the past few years with the explicit focus of developing algorithms that can alleviate catastrophic forgetting (McCloskey & Cohen, 1989 )-whereby the model abruptly forgets the information of the past when trained on new tasks. While most of the research in continual learning is focused on developing learning algorithms, that can perform better than naive fine-tuning on a stream of data, the role of model architecture, to the best of our knowledge, is not explicitly studied in any of the existing works. Even the class of parameter isolation or expansion-based methods, for example (Rusu et al., 2016; Yoon et al., 2018) , only have a cursory focus on the model architecture insofar that they assume a specific architecture and develop an algorithm operating on the architecture. Orthogonal to this direction for designing algorithms, our motivation is that the inductive biases induced by different architectural components could be important for continual learning irrespective of the learning algorithm. Therefore, we seek to characterize the implications of different architectural choices in continual learning. To motivate our study, consider a ResNet-18 model (He et al., 2016) on Split CIFAR-100, where CIFAR-100 dataset (Krizhevsky et al., 2009) is split into 20 disjoint sets-a prevalent architecture and benchmark in the existing continual learning works. Fig. 1a shows that explicitly designed CL algorithms, EWC (Kirkpatrick et al., 2017) (a parameter regularization-based method) and experience replay (Riemer et al., 2018 ) (a memory-based CL algorithm) indeed improve upon the naive finetuning. However, similar or better performance can be obtained on this benchmark by simply removing the global average pooling layer from ResNet-18 and performing the naive fine-tuning. This clearly demonstrates the need for a better understanding of network architectures in the context of continual learning where the architectural choices are not solely based on the performance of a single task but on a trade-off between the learning of new and previous tasks. Similar observation, though in more limited scenarios have been previously studied, for example Mirzadeh et al. (2022) looks at the role of layer width, while Ramasesh et al. ( 2022) focuses on the scale of the model. We build on these works, extending the analysis to architecture choices as well understanding particular components typically used like batch norm. It is also useful to note that these observations do not imply that the algorithmic improvements are not important. In fact, we show in Appendix B that one can achieve even better performance by combining our architectural findings with specially designed continual learning algorithms. To understand the implications of architectural decisions in continual learning, we thoroughly study different architectures including MLPs, CNNs, ResNets, Wide-ResNets and Visual Transformers. Our experiments suggest that different components of these architectures can have different effects on the relevant continual learning metrics-namely average accuracy, forgetting, and learning accuracy (cf. Sec. 2.1)-to the extent that vanilla fine-tuning with modified components can achieve similar of better performance than specifically designed CL methods on a given base architecture without significantly increasing the parameters count. Contributions. We summarize our main contributions as follows: • We compare both the learning and retention capabilities of popular architectures. We study the role of individual architectural decisions (e.g., width and depth, batch normalization, skip-connections, and pooling layers) and how they can impact the continual learning performance. • We show that, in some cases, simply modifying the architecture can achieve a similar or better performance compared to specifically designed CL algorithms (on top of a base architecture). • In addition to the standard CL benchmarks, Rotated MNIST and Split CIFAR-100, we report results on the large-scale Split ImageNet-1K benchmark, which is rarely used in the CL literature, to make sure our results hold in more complex settings. • Inspired by our findings, we provide practical suggestions that are computationally cheap and can improve the performance of various architectures in continual learning. Limitations. We emphasize that our main focus is to illustrate the significance of architectural decisions in continual learning. We do not claim that this work covers all the possible permutations of architectural components and different continual learning scenarios. Consequently, the majority of our experiments are focused on the task-incremental setup with popular architectures. However, our results in Appendix B.5 for the class-incremental setup confirm our results for the task-incremental setup. Moreover, the secondary aim of this work is to be a stepping-stone that encourages further research on the architectural side of continual learning. That is why we focus on the breadth rather than depth of some topics. Finally, while there are a limited number of works in the literature that study the role of architecture in continual learning, in Sec. 5 we will discuss why those works solely focus on specific topics while this work draws a comprehensive and general picture. We believe our work provides many interesting directions that require deeper analysis beyond the scope of this paper but can significantly improve our understanding of continual learning.

2.1. EXPERIMENTAL SETUP

Here, for brevity, we explain our experimental setup but postpone more detailed information (e.g., hyper-parameters, details of architectures, etc.) to Appendix A. Benchmarks. We use three continual learning benchmarks for our experiments. The Split CIFAR-100 includes 20 tasks where each task has the data of 5 classes (disjoint), and we train on each task for 10 epochs. The Split ImageNet-1K includes 10 tasks where each task includes 100 classes of



Figure 1: Split CIFAR-100: (a) While compared to naive fine-tuning, continual learning algorithms such as EWC and ER improve the performance, a simple modification to the architecture (removing global average (GAP) layer) can match the performance of ER with a replay size of 1000 examples. (b) and (c) Different architectures lead to very different continual learning performance levels in terms of accuracy and forgetting. This work will investigate the reasons behind these gaps and provide insights into improving architectures.

