OPTIMIZING SPCA-BASED CONTINUAL LEARNING: A THEORETICAL APPROACH

Abstract

Catastrophic forgetting and the stability-plasticity dilemma are two major obstacles to continual learning. In this paper, we first propose a theoretical analysis of a SPCA-based continual learning algorithm using high-dimensional statistics. Second, we design OSCL (Optimized Spca-based Continual Learning) which builds on a flexible task optimization based on the theory. By optimizing a single task, catastrophic forgetting can be prevented theoretically. While optimizing multitasks, the trade-off between integrating knowledge from the new task and retaining previous knowledge of the old tasks can be achieved by assigning appropriate weights to corresponding tasks in compliance with the objectives. Experimental results confirm that the various theoretical conclusions are robust to a wide range of data distributions. Besides, several applications on synthetic and real data show that the proposed method while being computationally efficient, achieves comparable results with some state of the art.

1. INTRODUCTION

Continual learning paradigm. Machine learning methods generally learn from samples of data randomly drawn from a stationary distribution. However, this scenario is rare in reality. Continual learning (CL) is a particular machine learning paradigm in which data continuously arrive in a possibly non i.i.d. way and knowledge is accumulated over time (Schlimmer & Fisher, 1986; Ebrahimi et al., 2019; Lee et al., 2020; De Lange et al., 2021) . For designing real-world machine learning systems that mimic humans, continual learning is essential. On the one hand, humans continue to acquire knowledge and solve new problems throughout their lifetimes. The goal of continual learning is to mimic the capacity of humans to learn from a non-stationary data stream without forgetting catastrophically the learned knowledge (Titsias et al., 2019; Lee et al., 2020) . On the other hand, when deploying a trained model in real applications, the distribution of data will consistently drift over time. Therefore, the machine learning algorithm must be able to adapt continuously to these changes (Kirkpatrick et al., 2017; Lesort et al., 2020) . Challenges in continual learning. One of the major challenges of continual learning is to avoid catastrophic forgetting (McCloskey & Cohen, 1989; Chen & Liu, 2018; Aljundi, 2019) . This occurs when the performance of previous tasks is severely degraded during the learning process. To take into account both the current task and the previous tasks, the stability-plasticity dilemma was introduced (Nguyen et al., 2017; Rajasegaran et al., 2019) . More specifically, plasticity refers to the ability of integrating new knowledge, and stability to the capacity of retaining previous knowledge (which is related to catastrophic forgetting). Note that the term catastrophic forgetting, although strongly referenced in the literature in deep neural network models, is a fairly general concept that can occur in any machine learning algorithm as it has been noted in shadow single-layer models, such as self-organizing feature maps (Richardson & Thomas, 2008; Chen & Liu, 2018) . State of the art. Most works on CL have focused on purely supervised tasks which is the focus of the present work (Kirkpatrick et al., 2017; Nguyen et al., 2017; Lee et al., 2020) . Supervised CL can be classified into three main categories based on how knowledge and data are updated and stored: replay, regularization, and dynamic architecture-based methods (De Lange et al., 2021) . Replaybased approaches address the catastrophic forgetting by saving and reusing the previously seen data while learning a new task (Lopez-Paz & Ranzato, 2017; Isele & Cosgun, 2018; Titsias et al., 2019) . Regularization-based methods penalize the update of crucial weights to alleviate forgetting by introducing an extra regularization term in the loss function (Kirkpatrick et al., 2017; Ebrahimi et al., 2019) . Dynamic architecture-based methods flexibly update the learning model as new tasks are added based on the task complexity and the relation between the tasks (Rajasegaran et al., 2019; Lee et al., 2020) . Although successful cases of supervised continual learning were reported in articles, the algorithms tend to feature unpredictable behavior, generally requiring a host of additional inconvenient hyperparameters and requiring a high computational cost. More importantly, supervised continual learning methods lack theoretical guarantees of avoiding catastrophic forgetting. Contributions of the paper. In this paper, we introduce a novel CL method based on Supervised Principal Component Analysis (SPCA). We provide a theoretical analysis of the proposed method using high-dimensional statistics. This analysis allows us to predict in advance the performance of the algorithm. Furthermore, we develop a label optimization scheme based on the theory that avoids catastrophic forgetting theoretically. As a result, we obtain a simple and efficient continual learning algorithm free of hyperparameters, at a low computational cost named OSCL. Moreover, the theory allows us to weight the tasks during the learning process according to the user's preferences and priorities. This is in line with the resolution of the stability-plasticity dilemma. As such, the main contributions of the paper can be summarized as follows.We propose a simple continual learning algorithm, computationally inexpensive based on SPCA, and provide a theoretical analysis (to obtain exact classification error rather than bounds) using high dimensional statistics. Using the theoretical results, we develop a label optimization scheme that provably prevent catastrophic forgetting and also allows for a weighting of the tasks during learning. Several applications are presented to corroborate the practical usefulness of the approach in terms of efficiency and flexibility. Outlines. The remainder of the paper is organized as follows. Section 2 discusses several works in the continual learning literature and highlights the differences and contributions of the present paper. Section 3 introduces and formalizes the continual learning framework, and furthermore proposes a simple continual learning algorithm based on SPCA. Section 4 proposes a theoretical analysis of this algorithm as well as flexible optimization tools to benefit from all tasks. Section 5 provides several applications to corroborate the different conclusions of the paper. Notations. Matrices will be represented in bold capital letters (e.g., matrix A). Vectors will be represented in bold lowercase letters (e.g., vector v) and scalars will be represented without bold letters (e.g., variable a). The canonical vector of size n is denoted by e [n] m ∈ R n , 1 ≤ m ≤ n, where the i-th element is 1 if i = m, and 0 otherwise. The diagonal matrix with diagonal x and 0 elsewhere is denoted by D x . Generally, the subscript t refers to the task number, and the superscript j to the class index. As an example, x j t denotes the -th sample of class j for task t. 1 n ∈ R n is the vector of all ones and the matrix Σ v ∈ R n×n denotes the covariance matrix of the random vector v. [n] denotes the set {1, . . . , n}. n t denotes the number of samples in task t, n tj is the number of samples in class j of task t.

2.1. CONTINUAL LEARNING

As mentioned in the introduction, we focus on the state of the art of supervised CL, which is divided into three main concepts that we develop in detail in this section. The appendix provides a more detailed picture for interested readers. Replay-based methods. Replay-based approaches aim to retain a certain amount of historical examples, extracted features, or generated examples to reduce forgetting when training the model with new data. Three challenges need to be solved. The first problem is selecting appropriate previous samples. In this context, Isele & Cosgun (2018) proposed four strategies for choosing which data will be stored. Aljundi et al. (2019) proposed sample selection as a constraint reduction problem.

