LEARNING INVARIANT FEATURES FOR ONLINE CONTINUAL LEARNING

Abstract

It has been shown recently that learning only discriminative features sufficient to separate the classes has a major shortcoming for continual learning (CL). This is because many features that are not learned may be necessary for distinguishing classes of some future tasks. When such a future task arrives, these features have to be learned by updating the network, which causes catastrophic forgetting (CF). A recent online CL work showed that if the learning method can learn as many features as possible from each class, called holistic representations, CF can be significantly reduced to achieve a large performance gain. This paper argues that learning only holistic representations is still insufficient. The learned representations should also be invariant and those features that are present in the data but are irrelevant to the class (e.g., background information) should be ignored for better generalization across tasks. This new condition further boosts the performance significantly. This paper proposes several strategies and a loss to learn holistic and invariant representations and evaluates their effectiveness in online CL. 1 

1. INTRODUCTION

A major challenge of continual learning (CL) is catastrophic forgetting (CF) (McCloskey & Cohen, 1989) , which is caused by updating network parameters learned from previous tasks in learning a new task. Although many empirical approaches have been proposed to deal with CF, limited theoretical work has been done to study the necessary conditions for CL to overcome CF. Recently, Guo et al. (2022) argued that it is necessary to learn holistic representations of the data. This work proposes another condition, invariance, and argues that the learning of each task itself needs to be improved so that future tasks would not need to make major changes to old parameters. It is well-known that supervised learning losses (e.g., cross-entropy) learn only discriminative features that are sufficient to separate the classes in a task. This is problematic for CL due to two main reasons. (1) many features that are not learned may be necessary to distinguish classes of some future tasks. When such a future task comes, these features have to be learned, which may make significant changes to the existing parameters and cause CF. (2) even if the previous parameters are completely protected, the classes in the new task still make the classification challenging because each task only learns discriminative features for its own classes, which causes confusion in classification when we need to classify all classes learned so far. We call this biased representation learning. For example, task-1 learns to classify black pig and dove. The learner may only learn the color features (e.g., black and white) as they are sufficient to classify the two classes. However, task-2 learns rabbit and cow, which can be black or white. The color features learned from task-1 are no longer sufficient. Shape-based features need to be learned, which can make major changes to the existing parameters and cause CF. If the shape-based features had been learned from task-1, learning of task 2 will not need to update the parameters linked to the representation of pigs and doves as much, which gives less CF (mitigating (1) above). Since there are now 4 classes to classify, it can be confusing. Some cows or rabbits may be classified as pigs or doves due to the same color. Some pigs may be classified as cows or rabbits due to shape-based features learned in task-2. However, if all features have been learned in task-1 and task-2, such wrong classifications will be reduced (dealing with (2)). Recently, Guo et al. (2022) proposed to learn holistic representations from the input data to cover as many characteristics of the input as possible. Their system OCM 1 The code has been submitted in the supplemental material. 1

