SPURIOUS FEATURES IN CONTINUAL LEARNING

Abstract

Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features (SF) and ( 2) local spurious features (LSF). The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting (CF). This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms performance.

1. INTRODUCTION

Feature selection is a standard machine learning problem. Its objective is to improve the prediction performance, provide faster and more effective predictors, and provide a better understanding of the underlying process that generated the data (Guyon & Elisseeff, 2003) . In this paper, we are interested in improving prediction performance in the presence of spurious features. Spurious features arise when features correlate well with labels in training data but not in test data. Learning algorithms that rely on spurious features will generalize badly to test data. In continual learning (CL), the training data distribution changes through time. Hence, we could expect that spurious features (SFs) in one time-step of the data distribution will not last. A continual learning algorithm relying on a spurious feature to solve a task can then be resilient and learn better features later, given more data. Algorithms can aim to detect and ignore spurious features learned in the past (Javed et al., 2020) . An example of a task with spurious features could be a classification task between cars and bikes, but in the training data, all cars are red, and all bikes are white, but it test data, both are in a unique blue not available in train data. A model could easily overfit the color to solve the task while it is not discriminative in the test data. This problem is notably caused by a covariate shift between train and test data. In CL, we would expect future tasks to bring pictures of cars and bikes of other colors to learn better features. On the other hand, in CL, the second type of spurious feature can be described: local spurious features. Local features denote features that correlate well with labels within a task (a state of the data distribution) but not in the full scenario. In opposite to the usual spurious features, this problem is provoked by the unavailability of all data, for example, because only red cars and white bikes are currently available in train data, but in the test data, there are also cars and bikes of colors not seen yet or seen in the past. Overall, there is no significant covariate shift between all train data and test data. It is, therefore, a problem specific to continual learning. If the task is to distinguish the squares from the circles. In Fig. 1a and 1b , the color is a spurious feature because there is a covariate shift between train and test data. In Fig. 1c and 1d , we observe two tasks of a domain-incremental scenario, the colors are locally spurious in tasks 1 and 2. Even if there is no significant covariate shift between train and test full data distribution, colors appear discriminative while looking at data within a task. We expect this paper to improve the understanding of the continual learning problem.

2. RELATED WORK

In large part of the continual learning bibliography, algorithms assume that to avoid catastrophic forgetting (CF), they should not increase the loss on past tasks (Kirkpatrick et al., 2017; Ritter et al., 2018) . It leads to the definition of interference/forgetting of (Riemer et al., 2019): ∂L(xi,yi) ∂θ • ∂L(xj ,yj ) ∂θ < 0, ∀(x i , y i ) ∈ T i and ∀(x j , y j ) ∈ T j with j > i, < • > is the dot product operator. Following this definition, increasing the loss on past tasks necessarily leads to a performance decrease. However, the algorithm might have learned spurious/local features that need to be forgotten to improve performance. Hence, the loss needs to be temporarily increased to reach a more general solution, and optimizing the interference equation could be counterproductive. On the same line, the presence of spurious features could be adversarial to most continual regularization strategies. For example, if we measure weights importance with Fisher information, high importance will be given to weights using spurious features, and regularization will penalize their modification. Regularization could protect features that generalize poorly to the test set. Vanilla rehearsal or generative replay is a good solution to not forget meaningful information and to deal with spurious features. By replaying old data, algorithms simulate an independent and identical distribution (iid) and avoid local spurious feature problems. Replay methods have been shown in the bibliography to be efficient and versatile even in their most straightforward form (Prabhu et al., 2020) . Notably, CL state-of-the-art on ImageNet uses replay (Douillard et al., 2020; Zhao et al., 2020) . The research field that usually deals with spurious correlations is the out-of-distribution (OOD) generalization field. This field has received a lot of attention in recent years, especially since the Invariant Risk Minimization (IRM) (Arjovsky et al., 2019) paper. OOD approaches target training scenarios where there are several training environments within which different spurious features correlate with labels. The goal then is to learn invariant features among all environments to build an invariant predictor in all training environments and potentially any other (Arjovsky et al., 2019; Ahuja et al., 2021; Sagawa et al., 2019; Pezeshki et al., 2020) . This paper will adapt some of those approaches for continual learning to evaluate how those approaches can deal with sequences of tasks.

3. PROBLEM FORMULATION

This section introduces the spurious features problems in a sequence of tasks. The goal is to present the key types of features, namely: general, local, and spurious features. General Formalism: We consider a continual scenario of classification tasks. We study a function f θ (•), implemented as a neural network, parameterized by a vector of parameters θ ∈ R p (where p is is the number of parameters) representing the set of weight matrices and bias vectors of a deep network. In continual learning, the goal is to find a solution θ * by minimizing a loss L on a stream of data formalized as a sequence of tasks [T 0 , T 1 , ..., T T -1 ], such that ∀(x t , y t ) ∼ T t (t ∈ [0, T -1]), f θ * (x) = y. We do not use the task index for inferences (i.e. single head setting).



This paper investigates both the problem of spurious features (with covariate shift) and local spurious features (without covariate shift) in CL as shown in Fig. 1. Our contributions are: (1) We propose a methodology to highlight the problems of spurious features and local spurious features in continual learning. (2) We create a binary CIFAR10 scenario SpuriousCIFAR2 inspired by colored MNIST to experiment with spurious correlations. (3) We propose a modified version of Out-of-Distribution (OOD) generalization methods for continual learning and evaluate them on SpuriousCIFAR2. (4) We identify local spurious features as a core challenge for continual learning algorithms along with CF.

Figure1: Spurious features and local spurious features. If the task is to distinguish the squares from the circles. In Fig.1a and 1b, the color is a spurious feature because there is a covariate shift between train and test data. In Fig.1c and 1d, we observe two tasks of a domain-incremental scenario, the colors are locally spurious in tasks 1 and 2. Even if there is no significant covariate shift between train and test full data distribution, colors appear discriminative while looking at data within a task.

