WARPING THE SPACE: WEIGHT SPACE ROTATION FOR CLASS-INCREMENTAL FEW-SHOT LEARNING

Abstract

Class-incremental few-shot learning, where new sets of classes are provided sequentially with only a few training samples, presents a great challenge due to catastrophic forgetting of old knowledge and overfitting caused by lack of data. During finetuning on new classes, the performance on previous classes deteriorates quickly even when only a small fraction of parameters are updated, since the previous knowledge is broadly associated with most of the model parameters in the original parameter space. In this paper, we introduce WaRP, the weight space rotation process, which transforms the original parameter space into a new space so that we can push most of the previous knowledge compactly into only a few important parameters. By properly identifying and freezing these key parameters in the new weight space, we can finetune the remaining parameters without affecting the knowledge of previous classes. As a result, WaRP provides an additional room for the model to effectively learn new classes in future incremental sessions. Experimental results confirm the effectiveness of our solution and show the improved performance over the state-of-the-art methods.

1. INTRODUCTION

Humans can easily acquire new concepts while preserving old experiences continually over the course of their life span. With a growing desire to imitate such ability, incremental or continual learning has been brought into the spotlight in the AI community recently (Hung et al., 2019; Wortsman et al., 2020; Saha et al., 2021) . Here, due to storage and privacy constraints, it is impractical to save all the training samples of previous tasks during the training process (Desai et al., 2021) . The most challenging issue in this setup is to preserve the knowledge of previous tasks against catastrophic forgetting (Serra et al., 2018) . More recently, an increasing need for such learning capability when dealing with rare data (e.g., military image, medical data, photos of rare animals) has encouraged many researchers to focus on a more challenging setup, known as class-incremental few-shot learning (CIFSL). In CIFSL, each task consists of only a few training samples, making the problem much harder as we must additionally handle the severe overfitting issue caused by lack of training data. Prior works on CIFSL typically take the following two steps: (i) pretraining the model on the first task (base classes) and (ii) adapting the pretrained model to new classes (novel classes) in each training session (e.g., via finetuning), assuming that the first task contains a sufficiently large number of training samples. One recent work, F2M (Shi et al., 2021) , tries to find the flat local minima during the pretraining stage and then finetunes the model within this flat area for learning novel classes so that the forgetting issue could be resolved. Another line of work, named FSLL (Mazumder et al., 2021) , mitigates the forgetting issue by keeping some important parameters frozen and finetunes only the remaining trainable parameters during the incremental learning sessions. Several other works adopt different strategies to learn new classes well during incremental sessions without forgetting (Tao et al., 2020; Cheraghian et al., 2021a; Zhang et al., 2021; Kukleva et al., 2021; Chen & Lee, 2021; Akyürek et al., 2022) .

Motivation.

One key strategy shared among some of these prior works is: model update is performed in the original parameter space, which is defined here with the standard basis. To illustrate the original space, even when we finetune only a small fraction of parameters; finetuning only 3% of model parameters (freeze 97%) throughout the incremental sessions significantly degrades the performance on base classes. This implies that the previous knowledge is more broadly associated with, to a certain extent, most of the model parameters in the original space. Motivated by this, we pose the following question: Can we find another basis, i.e., a new weight parameter space, such that we can push most of the previous knowledge compactly into only a few key parameters? Contributions. In this paper, we introduce the concept of Weight space Rotation Process (WaRP) that provides a solution to the above question. By viewing the weight matrix of a neural network from a different perspective, WaRP transforms the original parameter space into a new space. For any given orthonormal set of matrices, we can reparameterize the weight in (1) as W = w 11 update K 11 + w 12 freeze K 12 + w 21 update K 21 + w 22 update K 22 where B = {K 11 , K 12 , K 21 , K 22 } is a newly constructed basis that consists of orthonormal matrices and w ij is the weight parameter that is reparameterized according to K ij . As shown in Figure 1 , we can always rotate the axes by which the weight is represented for an arbitrary basis B. Here, if w 12 in (2) plays a key role in determining the output of the layer, then {K 11 , K 21 , K 22 } can be viewed as flat directions. Thus, finetuning can be performed along the flat directions by freezing the important parameter w 12 in the new space, which effectively preserves the previous knowledge (Figure 2 ). Figure 1 provides high-level descriptions of WaRP and our finetuning process along the flat direction. By introducing the concept of WaRP, we propose a new strategy to construct an appropriate basis B using the low-rankness property of activation, and redefine the model parameters in this new space so that we can push most of the knowledge of base classes compactly into only a few parameters. Once we construct a new weight space, we identify and freeze the important parameters for keeping the knowledge of previous classes at the end of each training session, and finetune only the remaining



Figure 1: Visualization of loss landscape for base classes after pretraining. Left: red dashed arrow lines refer to directions of standard basis. Finetuning along either of these directions at each incremental session can adversely affect the loss. In this original space, it is generally challenging to capture the flat directions directly. Right: weight space rotation with desired basis (blue dashed lines) provides us with flat direction K11; following this direction, we can finetune the model without performance loss on the base classes. The performance on the novel classes are also preserved by accumulating and freezing the important parameters obtained at each session.

