ARCHITECTURE MATTERS IN CONTINUAL LEARNING Anonymous authors Paper under double-blind review

Abstract

A large body of research in continual learning is devoted to overcoming the catastrophic forgetting of neural networks by designing new algorithms that are robust to the distribution shifts. However, the majority of these works are strictly focused on the algorithmic part of continual learning for a fixed neural network architecture, and the implications of using different architectures are not clearly understood. The few existing continual learning methods that expand the model also assume a fixed architecture and develop algorithms that can efficiently use the model throughout the learning experience. In contrast, in this work, we build on existing works that study continual learning from a neural network's architecture perspective and provide new insights into how the architecture choice, for the same learning algorithm, can impact stability-plasticity trade-off resulting in markedly different continual learning performance. We empirically analyze the impact of various architectural components providing best practices and recommendations that can improve the continual learning performance irrespective of the learning algorithm.

1. INTRODUCTION

Continual learning (CL) (Ring, 1995; Thrun, 1995) is a branch of machine learning where the model is exposed to a sequence of tasks with the hope of exploiting existing knowledge to adapt quickly to new tasks. The research in continual learning has seen a surge in the past few years with the explicit focus of developing algorithms that can alleviate catastrophic forgetting (McCloskey & Cohen, 1989 )-whereby the model abruptly forgets the information of the past when trained on new tasks. While most of the research in continual learning is focused on developing learning algorithms, that can perform better than naive fine-tuning on a stream of data, the role of model architecture, to the best of our knowledge, is not explicitly studied in any of the existing works. Even the class of parameter isolation or expansion-based methods, for example (Rusu et al., 2016; Yoon et al., 2018) , only have a cursory focus on the model architecture insofar that they assume a specific architecture and develop an algorithm operating on the architecture. Orthogonal to this direction for designing algorithms, our motivation is that the inductive biases induced by different architectural components could be important for continual learning irrespective of the learning algorithm. Therefore, we seek to characterize the implications of different architectural choices in continual learning. To motivate our study, consider a ResNet-18 model (He et al., 2016) on Split CIFAR-100, where CIFAR-100 dataset (Krizhevsky et al., 2009) is split into 20 disjoint sets-a prevalent architecture and benchmark in the existing continual learning works. Fig. 1a shows that explicitly designed CL algorithms, EWC (Kirkpatrick et al., 2017) (a parameter regularization-based method) and experience replay (Riemer et al., 2018 ) (a memory-based CL algorithm) indeed improve upon the naive finetuning. However, similar or better performance can be obtained on this benchmark by simply removing the global average pooling layer from ResNet-18 and performing the naive fine-tuning. This clearly demonstrates the need for a better understanding of network architectures in the context of continual learning where the architectural choices are not solely based on the performance of a single task but on a trade-off between the learning of new and previous tasks. Similar observation, though in more limited scenarios have been previously studied, for example Mirzadeh et al. (2022) looks at the role of layer width, while Ramasesh et al. (2022) focuses on the scale of the model. We build on these works, extending the analysis to architecture choices as well understanding particular components typically used like batch norm. It is also useful to note that these observations do not imply that the algorithmic improvements are not important. In fact, we show in Appendix B that one can achieve even better performance by combining our architectural findings with specially designed continual learning algorithms. 1

