SPURIOUS FEATURES IN CONTINUAL LEARNING

Abstract

Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features (SF) and ( 2) local spurious features (LSF). The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting (CF). This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms performance.

1. INTRODUCTION

Feature selection is a standard machine learning problem. Its objective is to improve the prediction performance, provide faster and more effective predictors, and provide a better understanding of the underlying process that generated the data (Guyon & Elisseeff, 2003) . In this paper, we are interested in improving prediction performance in the presence of spurious features. Spurious features arise when features correlate well with labels in training data but not in test data. Learning algorithms that rely on spurious features will generalize badly to test data. In continual learning (CL), the training data distribution changes through time. Hence, we could expect that spurious features (SFs) in one time-step of the data distribution will not last. A continual learning algorithm relying on a spurious feature to solve a task can then be resilient and learn better features later, given more data. Algorithms can aim to detect and ignore spurious features learned in the past (Javed et al., 2020) . An example of a task with spurious features could be a classification task between cars and bikes, but in the training data, all cars are red, and all bikes are white, but it test data, both are in a unique blue not available in train data. A model could easily overfit the color to solve the task while it is not discriminative in the test data. This problem is notably caused by a covariate shift between train and test data. In CL, we would expect future tasks to bring pictures of cars and bikes of other colors to learn better features. On the other hand, in CL, the second type of spurious feature can be described: local spurious features. Local features denote features that correlate well with labels within a task (a state of the data distribution) but not in the full scenario. In opposite to the usual spurious features, this problem is provoked by the unavailability of all data, for example, because only red cars and white bikes are currently available in train data, but in the test data, there are also cars and bikes of colors not seen yet or seen in the past. Overall, there is no significant covariate shift between all train data and test data. It is, therefore, a problem specific to continual learning. 



This paper investigates both the problem of spurious features (with covariate shift) and local spurious features (without covariate shift) in CL as shown in Fig. 1. Our contributions are: (1) We propose a methodology to highlight the problems of spurious features and local spurious features in continual learning. (2) We create a binary CIFAR10 scenario SpuriousCIFAR2 inspired by colored MNIST to experiment with spurious correlations. (3) We propose a modified version of Out-of-Distribution (OOD) generalization methods for continual learning and evaluate them on SpuriousCIFAR2. (4) We identify local spurious features as a core challenge for continual learning algorithms along with CF.

