LEARNING INVARIANT FEATURES FOR ONLINE CONTINUAL LEARNING

Abstract

It has been shown recently that learning only discriminative features sufficient to separate the classes has a major shortcoming for continual learning (CL). This is because many features that are not learned may be necessary for distinguishing classes of some future tasks. When such a future task arrives, these features have to be learned by updating the network, which causes catastrophic forgetting (CF). A recent online CL work showed that if the learning method can learn as many features as possible from each class, called holistic representations, CF can be significantly reduced to achieve a large performance gain. This paper argues that learning only holistic representations is still insufficient. The learned representations should also be invariant and those features that are present in the data but are irrelevant to the class (e.g., background information) should be ignored for better generalization across tasks. This new condition further boosts the performance significantly. This paper proposes several strategies and a loss to learn holistic and invariant representations and evaluates their effectiveness in online CL. 1 

1. INTRODUCTION

A major challenge of continual learning (CL) is catastrophic forgetting (CF) (McCloskey & Cohen, 1989) , which is caused by updating network parameters learned from previous tasks in learning a new task. Although many empirical approaches have been proposed to deal with CF, limited theoretical work has been done to study the necessary conditions for CL to overcome CF. Recently, Guo et al. (2022) argued that it is necessary to learn holistic representations of the data. This work proposes another condition, invariance, and argues that the learning of each task itself needs to be improved so that future tasks would not need to make major changes to old parameters. It is well-known that supervised learning losses (e.g., cross-entropy) learn only discriminative features that are sufficient to separate the classes in a task. This is problematic for CL due to two main reasons. (1) many features that are not learned may be necessary to distinguish classes of some future tasks. When such a future task comes, these features have to be learned, which may make significant changes to the existing parameters and cause CF. (2) even if the previous parameters are completely protected, the classes in the new task still make the classification challenging because each task only learns discriminative features for its own classes, which causes confusion in classification when we need to classify all classes learned so far. We call this biased representation learning. For example, task-1 learns to classify black pig and dove. The learner may only learn the color features (e.g., black and white) as they are sufficient to classify the two classes. However, task-2 learns rabbit and cow, which can be black or white. The color features learned from task-1 are no longer sufficient. Shape-based features need to be learned, which can make major changes to the existing parameters and cause CF. If the shape-based features had been learned from task-1, learning of task 2 will not need to update the parameters linked to the representation of pigs and doves as much, which gives less CF (mitigating (1) above). Since there are now 4 classes to classify, it can be confusing. Some cows or rabbits may be classified as pigs or doves due to the same color. Some pigs may be classified as cows or rabbits due to shape-based features learned in task-2. However, if all features have been learned in task-1 and task-2, such wrong classifications will be reduced (dealing with (2)). Recently, Guo et al. (2022) proposed to learn holistic representations from the input data to cover as many characteristics of the input as possible. Their system OCM learns holistic representations in online CL by maximizing the mutual information (MI) between the input data and the learned feature representations to ensure that as much information in the input is reflected in the learned features. This results in a major performance gain in online CL. This paper argues that learning holistic representations is still sub-optimal. It is also necessary for CL to learn features that are invariant for each class. Those features that are present in the input but are irrelevant to the class should be ignored for better generalization to future tasks. For example, to classify images of apple and fish, some green background and red color of apple may be learned, but these features are not invariant to apple. When new classes cows and ladybird need to be learned, the feature green (or red) is shared (see Figure 1(a) ) and may cause high logit outputs to apple and cow (or ladybird). Then the learner has to modify the representations of apple to reduce its logit values, which causes CF. That is, variant features are unsuitable for establishing decision boundaries. If the learner has learned shape and other invariant features for apple, the input of cow (or ladybird) will not activate the parameters linked to the apple representation. Then, in learning cow and ladybird, changes to the parameters that are important for apple will be limited, resulting in less CF. This paper aims to learn invariant and holistic representations for each class. This additional invariance condition gives another boost to online CL performance. Note that invariance is not critical for traditional supervised learning due to the i.i.d assumption,foot_1 but for CL, it is very important because each new task introduces new distributions and we want each class to be distinguishable against any past and future classes, which may have similar variant features that can confuse the model. This paper works in the online class-incremental learning (CIL) setting. 3 It proposes a replay-based method called IFO (Invariant Feature learning for Online CL), which adds the invariance condition to holistic representation learning to also learn invariant features. We propose two new methods and one new optimization objective to achieve invariance. The first method is to construct a diverse set of environments and force the model to learn invariant features across the environments. The second method is a novel use of the replay data to learn invariant features and to deal with a local sampling bias issue. Finally, we combine the methods and propose a new optimization goal to learn invariant features. Theoretical justifications are also given. We verify the effectiveness of IFO in three online CL scenarios: traditional disjoint task scenario, blurry task boundary scenario and data shift scenario. The results show the proposed IFO outperforms strong baselines by a large margin.

2. RELATED WORK

Although many CL approaches have been proposed, little work has been done to study the necessary conditions for CL. The replay approach saves a small amount of past data and uses it to protect/adjust the previous knowledge in learning a new task (Rebuffi et al., 2017; Wu et al., 2019; Hou et al., 2019; Chaudhry et al., 2020; Zhao et al., 2021; Korycki & Krawczyk, 2021; Sokar et al., 2021; Yan et al., 2021; Wang et al., 2022a) . Pseudo-replay generates replay samples (Shin et al., 2017; Hu et al., 2019; Sokar et al., 2021) . Using regularizations to penalize changes to important parameters of previous tasks is another approach (Kirkpatrick et al., 2017; Ritter et al., 2018; Ahn et al., 2019; Yu et al., 2020; Zhang et al., 2020) . Parameter-isolation approaches protect models of old tasks using masks and/or network expansion (Ostapenko et al., 2019; von Oswald et al., 2020; Li et al., 2019; Hung et al., 2019; Rajasegaran et al., 2020; Abati et al., 2020; Wortsman et al., 2020; Saha et al., 2021) . Zhu et al. (2021) found that using data augmentations can learn more transferable features. Online CL methods are mainly based on replay. ER randomly samples the replay data (Chaudhry et al., 2020) , MIR chooses replay samples whose losses increase most (Aljundi et al., 2019a) , ASER uses the Shapley value theory (Shim et al., 2021) , and GDumb produces class balanced replay data (Prabhu et al., 2020) . GSS diversifies the gradients of the replay data (Aljundi et al., 2019b) . DER++ uses knowledge distillation (Buzzega et al., 2020) , SCR uses contrastive loss (Mai et al., 2021) , and NCCL calibrates the network (Yin et al., 2021) . Applications of online CL are also reported (Yan et al., 2021; Wang et al., 2021) . Bang et al. (2021) and Bang et al. (2022) proposed two blurry online CL settings. IFO is also a replay method but focuses on learning invariant features. Domain generalization (DG) is also related. DG learns a model with inputs from multiple given source domains with the same class labels and test with inputs from unseen target domains. Ex-



The code has been submitted in the supplemental material. When out-of-distribution data or data shift is involved, invariance is also important(Arjovsky et al., 2019). In the CIL setting, no task related information (e.g., task-id) is provided in testing. The other popular CL setting is task-incremental learning (TIL), which needs the task-id to be provided for each test instance.

