REGULARIZATION SHORTCOMINGS FOR CONTINUAL LEARNING

Abstract

In most machine learning algorithms, training data is assumed to be independent and identically distributed (iid). When it is not the case, the algorithms performances are challenged, leading to the famous phenomenon of catastrophic forgetting. Algorithms dealing with it are gathered in the "Continual Learning" research field. In this paper, we study the regularization based approaches to continual learning and show that those approaches can not learn to discriminate classes from different tasks in an elemental continual benchmark, the class-incremental setting. We make theoretical reasoning to prove this shortcoming and illustrate it with experiments. Moreover, we show that it can have some important consequences on multi-tasks reinforcement learning or in pre-trained models used for continual learning. We believe this paper to be the first to propose a theoretical description of regularization shortcomings for continual learning.

1. INTRODUCTION

Continual Learning is a sub-field of machine learning dealing with non-iid (identically and independently distributed) data French (1999) ; Lesort et al. (2019c) . Its goal is to learn the global optima to an optimization problem where the data distribution changes through time. This is typically the case in databases that get regularly augmented with new data or when data is streamed to the algorithms with limited storage possibilities. Continual learning (CL) looks for alternative methods to the iid training to avoid the complete retraining with all data each time new data is available. CL algorithms propose different memory storage approaches to collect information from past learning experiences and learning algorithms to continue to learn with this memory and new data. In this paper, we propose to study the class-incremental setting with regularization based methods. The class-incremental setting consists of learning sets of classes incrementally. Each task is composed of new classes. As the training ends, the model should classify data from all classes correctly. Without task labels for inferences, the model needs to both learn the discrimination of intra-task classes and the trans-task classes discrimination (i.e. distinctions between classes from different tasks). On the contrary, if the task label is available for inferences, only the discrimination of intra-task classes needs to be learned. The discrimination upon different tasks is given by the task label. Learning without access to task labels at test time is then much more complex since it needs to discriminate data that are not available at the same time in the data stream. In such setting, we would like to demonstrate that regularization does not help to learn the discrimination between tasks. For example, if a first task is to discriminate white cats vs black cats and the second is the same with dogs, a regularization based method does not provide the learning criteria to learn features to distinguish white dogs from white cats. We consider as regularization methods, those who aim at protecting important weights learned from past tasks without using a buffer of old data or any replay process. Those methods are widely used for continual learning Kirkpatrick et al. ( 2017 (2018) . In this paper, we show that in the classical setting of class-incremental tasks, this approach has theoretical limitations and can not be used alone. Indeed, it does not provide any learning criteria to distinguish classes from different tasks. Therefore, in practice, regularization algorithms need external information to make an inference in class-incremental settings. It is provided by the task



); Zenke et al. (2017); Ritter et al. (2018); Schwarz et al.

