REGULARIZATION SHORTCOMINGS FOR CONTINUAL LEARNING

Abstract

In most machine learning algorithms, training data is assumed to be independent and identically distributed (iid). When it is not the case, the algorithms performances are challenged, leading to the famous phenomenon of catastrophic forgetting. Algorithms dealing with it are gathered in the "Continual Learning" research field. In this paper, we study the regularization based approaches to continual learning and show that those approaches can not learn to discriminate classes from different tasks in an elemental continual benchmark, the class-incremental setting. We make theoretical reasoning to prove this shortcoming and illustrate it with experiments. Moreover, we show that it can have some important consequences on multi-tasks reinforcement learning or in pre-trained models used for continual learning. We believe this paper to be the first to propose a theoretical description of regularization shortcomings for continual learning.

1. INTRODUCTION

Continual Learning is a sub-field of machine learning dealing with non-iid (identically and independently distributed) data French (1999) ; Lesort et al. (2019c) . Its goal is to learn the global optima to an optimization problem where the data distribution changes through time. This is typically the case in databases that get regularly augmented with new data or when data is streamed to the algorithms with limited storage possibilities. Continual learning (CL) looks for alternative methods to the iid training to avoid the complete retraining with all data each time new data is available. CL algorithms propose different memory storage approaches to collect information from past learning experiences and learning algorithms to continue to learn with this memory and new data. In this paper, we propose to study the class-incremental setting with regularization based methods. The class-incremental setting consists of learning sets of classes incrementally. Each task is composed of new classes. As the training ends, the model should classify data from all classes correctly. Without task labels for inferences, the model needs to both learn the discrimination of intra-task classes and the trans-task classes discrimination (i.e. distinctions between classes from different tasks). On the contrary, if the task label is available for inferences, only the discrimination of intra-task classes needs to be learned. The discrimination upon different tasks is given by the task label. Learning without access to task labels at test time is then much more complex since it needs to discriminate data that are not available at the same time in the data stream. In such setting, we would like to demonstrate that regularization does not help to learn the discrimination between tasks. For example, if a first task is to discriminate white cats vs black cats and the second is the same with dogs, a regularization based method does not provide the learning criteria to learn features to distinguish white dogs from white cats. We consider as regularization methods, those who aim at protecting important weights learned from past tasks without using a buffer of old data or any replay process. Those methods are widely used for continual learning Kirkpatrick et al. ( 2017 (2018) . In this paper, we show that in the classical setting of class-incremental tasks, this approach has theoretical limitations and can not be used alone. Indeed, it does not provide any learning criteria to distinguish classes from different tasks. Therefore, in practice, regularization algorithms need external information to make an inference in class-incremental settings. It is provided by the task label at test time. However, relying on the task label to make inferences is an important limitation for algorithms' autonomy, i.e. its capacity to run without external information, in most application scenarios. We believe this paper presents important results for a better understanding of CL which will help practitioners to choose the appropriate approach for practical settings.

2. RELATED WORKS

In continual learning, algorithms protect knowledge from catastrophic forgetting French (1999) by saving them into a memory. The memory should be able to incorporate new knowledge and protect existing knowledge from modification. In continual learning, we distinguish four types of memorization categories dynamic architecture Rusu et al. ( 2016 In this paper, we are interested in the capacity of making inferences without task labels at test time (test task label). The task label t (typically a simple integer) is an abstract representation built to help continual algorithms to learn. It is designed to index the current task and notify if the task changes Lesort et al. (2019c) . Dynamic architecture is a well-known method that needs the task label at test time for an inference. Indeed, since the inference path is different for different tasks, the task test label is needed to use the right path through the neural network Rusu et al. ( 2016); Li & Hoiem (2017) . Rehearsal and Generative Replay methods generally need the task label at training time but not for inferences Lesort et al. (2019a; b) . Finally, Regularization methods are often assumed as methods that need task labels only at training time. In this article, we show that in class-incremental settings, it is also necessary at test time. Test task labels have been used in many continual learning approaches, in particular in those referred to as "multi-headed" Lange et al. (2019) . However, the need for task labels for inferences makes algorithms unable to make autonomous predictions and therefore we believe that this requirement is not in the spirit of continual learning. Continual learning is about creating autonomous algorithms that can learn in dynamic environments Lesort et al. (2019c) .

3. REGULARIZATION APPROACH

In this section, we present the formalism we use and we present the class-incremental learning problem with a regularization based approach.

3.1. FORMALISM

In this paper, we assume that the data stream is composed of N disjoint tasks learned sequentially one by one (with N >= 2). Task t is noted T t and D t is the associated dataset. The task label t is a simple integer indicating the task index. We refer to the full sequence of tasks as the continuum, noted C N . The dataset combining all data until task t is noted C t . While learning task T t , the algorithm has access to data from D t only. We study a disjoint set of classification tasks where classes of each task only appear in this task and never again. We assume at least two classes per task (otherwise a classifier cannot learn). Let f be a function parametrized by θ that implement the neural network's model. At each task t the model learn an optimal set of parameters θ * t optimizing the task loss Dt (•). Since we are in a continual learning setting, θ * t should also be an optima for all tasks T t , ∀t ∈ 0, t . We consider the class-incremental setting with no test task label. It means that an optima θ * 1 for T 1 is a set of parameters which at test time will, for any data point x from D 0 ∪ D 1 , classify correctly without knowing if x comes from T 0 or T 1 . Therefore, in our continual learning setting, the loss to optimize when learning a given task t is augmented with a remembering loss:



); Zenke et al. (2017); Ritter et al. (2018); Schwarz et al.

); Li & Hoiem (2017), rehearsal Chaudhry et al. (2019); Aljundi et al. (2019); Belouadah & Popescu (2018); Wu et al. (2019); Hou et al. (2019); Caccia et al. (2019), generative replay Shin et al. (2017); Lesort et al. (2019a); Wu et al. (2018) and regularization Kirkpatrick et al. (2017); Zenke et al. (2017); Ritter et al. (2018); Schwarz et al. (2018).

