INCREMENTAL FEW-SHOT LEARNING VIA VECTOR QUANTIZATION IN DEEP EMBEDDED SPACE

Abstract

The capability of incrementally learning new tasks without forgetting old ones is a challenging problem due to catastrophic forgetting. This challenge becomes greater when novel tasks contain very few labelled training samples. Currently, most methods are dedicated to class-incremental learning and rely on sufficient training data to learn additional weights for newly added classes. Those methods cannot be easily extended to incremental regression tasks and could suffer from severe overfitting when learning few-shot novel tasks. In this study, we propose a nonparametric method in deep embedded space to tackle incremental few-shot learning problems. The knowledge about the learned tasks is compressed into a small number of quantized reference vectors. The proposed method learns new tasks sequentially by adding more reference vectors to the model using few-shot samples in each novel task. For classification problems, we employ the nearest neighbor scheme to make classification on sparsely available data and incorporate intra-class variation, less forgetting regularization and calibration of reference vectors to mitigate catastrophic forgetting. In addition, the proposed learning vector quantization (LVQ) in deep embedded space can be customized as a kernel smoother to handle incremental few-shot regression tasks. Experimental results demonstrate that the proposed method outperforms other state-of-the-art methods in incremental learning.

1. INTRODUCTION

Incremental learning is a learning paradigm that allows the model to continually learn new tasks on novel data, without forgetting how to perform previously learned tasks (Cauwenberghs & Poggio, 2001; Kuzborskij et al., 2013; Mensink et al., 2013) . The capability of incremental learning becomes more important in real-world applications, in which the deployed models are exposed to possible out-of-sample data. Typically, hundreds of thousands of labelled samples in new tasks are required to re-train or fine-tune the model (Rebuffi et al., 2017) . Unfortunately, it is impractical to gather sufficient samples of new tasks in real applications. In contrast, humans can learn new concepts from just one or a few examples, without losing old knowledge. Therefore, it is desirable to develop algorithms to support incremental learning from very few samples. While a natural approach for incremental few-shot learning is to fine-tune part of the base model using novel training data (Donahue et al., 2014; Girshick et al., 2014) , the model could suffer from severe over-fitting on new tasks due to a limited number of training samples. Moreover, simple fine-tuning also leads to significant performance drop on previously learned tasks, termed as catastrophic forgetting (Goodfellow et al., 2014) . Recent attempts to mitigate the catastrophic forgetting are generally categorized into two streams: memory relay of old training samples (Rebuffi et al., 2017; Shin et al., 2017; Kemker & Kanan, 2018) and regularization on important model parameters (Kirkpatrick et al., 2017; Zenke et al., 2017) . However, those incremental learning approaches are developed and tested on unrealistic scenarios where sufficient training samples are available in novel tasks. They may not work well when the training samples in novel tasks are few (Tao et al., 2020b) . To the best of our knowledge, the majority of incremental learning methodologies focus on classification problems and they cannot be extended to regression problems easily. In class-incremental learning, the model has to expand output dimensions to learn N novel classes while keeping the knowledge of existing N classes. Parametric models estimate additional classification weights for novel classes, while nonparametric methods compute the class centroids for novel classes. In comparison, output dimensions in regression problems do not change in incremental learning as neither additional weights nor class centroids are applicable to regression problems. Besides, we find that catastrophic forgetting in incremental few-shot classification can be attributed to three reasons. First, the model is biased towards new classes and forgets old classes because the model is fine-tuned on new data only (Hou et al., 2019; Zhao et al., 2020) . Meanwhile, the prediction accuracy on novel classes is not good due to over-fitting on few-shot training samples. Second, features of novel samples could overlap with those of old classes in the feature space, leading to ambiguity among classes in the feature space. Finally, features of old classes and classification weights are no longer compatible after the model is fine-tuned with new data. In this paper, we investigate the problem of incremental few-shot learning, where only a few training samples are available in new tasks. A unified model is learned sequentially to jointly recognize all classes or regression targets that have been encountered in previous tasks (Rebuffi et al., 2017; Wu et al., 2019) . To tackle aforementioned problems, we propose a nonparametric method to handle incremental few-shot learning based on learning vector quantization (LVQ) (Sato & Yamada, 1996) in deep embedded space. As such, the adverse effects of imbalanced weights in a parametric classifier can be completely avoided (Mensink et al., 2013; Snell et al., 2017; Yu et al., 2020) . Our contributions are three fold. First, a unified framework is developed, termed as incremental deep learning vector quantization (IDLVQ), to handle both incremental classification (IDLVQ-C) and regression (IDLVQ-R) problems. Second, we develop intra-class variance regularization, less forgetting constraints and calibration factors to mitigate catastrophic forgetting in class-incremental learning. Finally, the proposed methods achieve state-of-the-art performance on incremental fewshot classification and regression datasets.

2. RELATED WORK

Incremental learning: Some incremental learning approaches rely on memory replay of old exemplars to prevent forgetting previously learned knowledge. Old exemplars can be saved in memory (Rebuffi et al., 2017; Castro et al., 2018; Prabhu et al., 2020) or sampled from generative models (Shin et al., 2017; Kemker & Kanan, 2018; van de Ven et al., 2020) . However, explicit storage of training samples is not scalable if the number of classes is large. Furthermore, it is difficult to train a reliable generative model for all classes from very few training samples. In parallel, regularization approaches do not require old exemplars and impose regularization on network weights or outputs to minimize the change of parameters that are important to old tasks (Kirkpatrick et al., 2017; Zenke et al., 2017) . To avoid quick performance deterioration after learning a sequence of novel tasks in regularization approaches, semantic drift compensation (SDC) is developed by learning an embedding network via triplet loss (Schroff et al., 2015) and compensates the drift of class centroids using novel data only (Yu et al., 2020) . In comparison, IDLVQ-C saves only one exemplar per class and uses saved exemplars to regularize the change in feature extractor and calibrate the change in reference vectors. Few-shot learning: Few-shot learning attempts to obtain models for classification or regression tasks with only a few labelled samples. Few-shot models are trained on widely-varying episodes of fake few-shot tasks with labelled samples drawn from a large-scale meta-training dataset (Vinyals et al., 2016; Finn et al., 2017; Ravi & Larochelle, 2017; Snell et al., 2017; Sung et al., 2018) . Meanwhile, recent works attempt to handle novel few-shot tasks while retraining the knowledge of the base task. These methods are referred to as dynamic few-shot learning (Gidaris & Komodakis, 2018; Ren et al., 2019a; Gidaris & Komodakis, 2019) . However, dynamic few-shot learning is different from incremental few-shot learning, because they rely on the entire base training dataset and an extra meta-training dataset during meta-training. In addition, dynamic few-shot learning does not accumulate knowledge for multiple novel tasks sequentially. Incremental few-shot learning: Prior works on incremental few-shot learning focus on classification problems by computing the weights for novel classes in parametric classifiers, without iterative gradient descent. For instance, the weights of novel classes can be imprinted by normalized prototypes of novel classes, while keeping the feature extractor fixed (Qi et al., 2018) . Since novel weights are computed only with the samples of novel classes, the fixed feature extractor may not be compatible with novel classification weights. More recently, neural gas network is employed to construct an undirected graph to represent knowledge of old classes (Tao et al., 2020b; a) . The vertices in the graph are constructed in an unsupervised manner using competitive Hebbian learning (Fritzke, 1995) , while the feature embedding is fixed. In contrast, IDLVQ learns both feature extractor and reference vectors concurrently in a supervised manner.

3.1. INCREMENTAL FEW-SHOT LEARNING

In this paper, incremental few-shot learning is studied for both classification and regression tasks. For classification tasks, we consider the standard class-incremental setup in literature. After the model is trained on a base task (t = 1) with sufficient data, the model learns novel tasks sequentially. Each novel task contains a number of novel classes with only a few training samples per class. Learning a novel task (t > 1) is referred to as an incremental learning session. In task t, we have access only to training data D t in the current task and previously saved exemplars (one exemplar per class in this study). Each task has a set of classes C t = {c t 1 , ..., c t n t }, where n t is the number of classes in task t. In addition, it is assumed that there is no overlap between classes in different tasks C t C s = ∅ for t = s. After an incremental learning session, the performance of the model is evaluated on a test set that contains all previously seen classes C = i C i . Note that our focus is not multi-task scenario, where a task ID is exposed to the model during test phase and the model is only required to perform a given task one time (van de Ven & Tolias, 2019). Our model is evaluated in a task-agnostic setting, where task ID is not exposed to the model at test time. For regression tasks, we follow a similar setting with a notable difference that the target is realvalued y ∈ R. In addition, the target values in different tasks do not have to be mutually exclusive, unlike the class-incremental setup.

3.2. LEARNING VECTOR QUANTIZATION

Traditional nonparametric methods, such as nearest neighbors, represent knowledge and make predictions by storing the entire training set. Despite the simplicity and effectiveness, they are not scalable to a large-scale base dataset. Typically, incremental learning methods are only allowed to store a small number of exemplars to preserve the knowledge of previously learned tasks. However, randomly selected exemplars may not well present the knowledge in old tasks. LVQ is a classical data compression method that represents the knowledge through a few learned reference vectors (Sato & Yamada, 1996; Seo & Obermayer, 2003; Biehl et al., 2007) . A new sample is classified to the same label as the nearest reference vector in the input space. LVQ has been combined with deep feature extractors as an alternative to standard neural networks for better interpretability (De Vries et al., 2016; Villmann et al., 2017; Saralajew et al., 2018) . The combinations of LVQ and deep feature extractors have been applied to natural language processing (NLP), facial recognition and biometrics (Variani et al., 2015; Wang et al., 2016; Ren et al., 2019b; Leng et al., 2015) . We notice that LVQ is a nonparametric method which is well suited for incremental few-shot learning because the model capacity grows by incorporating more reference vectors to learn new knowledge. For example, incremental learning vector quantization (ILVQ) has been developed to learn classification models adaptively from raw features (Xu et al., 2012) . In this study, we present the knowledge by learning reference vectors in the feature space through LVQ and adapt them in incremental few-shot learning. Compared with ILVQ by Xu et al. (2012) , our method does not rely on predefined rules to update reference vectors and can be learned along with deep neural networks in an end-to-end fashion. Besides, our method uses a single reference vector for each class, while ILVQ automatically assigns different numbers of prototypes for different classes.

4.1. INCREMENTAL DEEP LEARNING VECTOR QUANTIZATION

The general framework of IDLVQ for both classification and regression can be derived from a Gaussian mixture perspective (Ghahramani & Jordan, 1994) , with a simplified covariance structure and supervised deep representation learning. In the base dataset (t = 1), a raw input x is projected into a feature space F 1 by a deep neural network f θ 1 , where θ 1 denotes the parameters in neural networks. In addition, N 1 reference vectors M 1 = {m 1 1 , ..., m 1 N 1 } are placed in the feature space F 1 , which can be learned to capture the representation of the base dataset. More reference vectors will be added incrementally while learning novel tasks. The marginal distribution p(f θ 1 (x)) of feature vector can be described by a Gaussian mixture model p(f θ 1 (x)) = N 1 i=1 p(i)p(f θ 1 (x)|i) of N 1 components, where the prior p(i) = 1/N 1 and the component distribution p(f θ 1 (x)|i) is Gaussian. By assuming that each component distribution p(f θ 1 (x)|i) is isotropic Gaussian centered at m 1 i with the same covariance, the posterior distribution of a component given the input is p 1 (i|x) = κ(f θ 1 (x), m 1 i ) N 1 j=1 κ(f θ 1 (x), m 1 j ) , where κ(f θ 1 (x), m 1 i ) = exp(-f θ 1 (x) -m 1 i 2 /γ ) is a Gaussian kernel and γ is a scale factor. The conditional expectation of the output from a Gaussian mixture is ŷ = N 1 i=1 p 1 (i|x)q 1 i , where q 1 i is the reference target associated with reference vector m 1 i . In classification problems, q 1 i is either 0 or 1 indicating whether m 1 i and x have the same label. Since each reference vector is assigned to a class at initialization, q 1 i is fixed and does not require learning. Meanwhile, q 1 i in regression problems is real-valued and has to be learned. The weights in neural networks θ 1 , reference vectors M 1 , reference targets q 1 i (in regression problems only) and the scale factor γ are learned concurrently by minimizing a loss function between the true label y and the predicted label ŷ. The proposed IDLVQ is a nonparametric method as it makes prediction based on similarity to reference vectors, instead of using any regression or classification weights. The capacity of the model grows naturally by adding more reference vectors to learn novel tasks, while the old knowledge is preserved in existing reference vectors.

4.2. INCREMENTAL DEEP LEARNING VECTOR QUANTIZATION FOR CLASSIFICATION

For classification problems, one reference vector is assigned to each class in our study. Thus, ŷ represents the predicted probability that an input belongs to a class. The model can be trained to classify data correctly by minimizing the cross-entropy loss L CE between the predicted probability ŷ and the true label y. Although the cross-entropy loss encourages separability of features in base classes, it does not guarantee compact intra-class variation in the feature space. Specifically, in an incremental learning session, features of novel classes could overlap with those of previously learned classes. As a result, the overall classification accuracy could deteriorate after incremental learning sessions. A desirable feature embedding leaves large margin between classes to mitigate overlap in features across old and new classes. Inspired by center loss (Wen et al., 2016) to enhance discriminative capability in facial recognition, a regularization term on intra-class distance to reference vectors is added to get compact intra-class variation. L intra = ∀(x,y),y=i f θ 1 (x) -m 1 i 2 (2) As such, f θ 1 (x) is forced to stay close to the reference vector with the same label and naturally moves away from other reference vectors. Consequently, features of new classes are more likely to lie in the margin between old classes to mitigate ambiguity in features across different classes. The total loss in training the base task is given by L = L CE + λ intra L intra , where λ intra is a hyper-parameter to control the weight for intra-class variation loss. The total loss is differentiable w.r.t. neural network parameters θ 1 , reference vectors M 1 = {m 1 1 , ..., m 1 n 1 } and scaling factor γ. All parameters in the model can be trained jointly in an end-to-end fashion. In an incremental session (t > 1), a novel dataset D t contains n t classes and K t samples per class (n t -way K t -shot). n t new reference vectors are added and each reference vector is initialized as the centroid of features in a class m t i = 1 K t K t k=1 f θ t (x k ). The new reference vectors along with the neural network parameters are fine-tuned on D t to learn new knowledge in task t. To preserve the knowledge from the old tasks during incremental learning, the model should be updated only when necessary. Therefore, cross-entropy loss is not used in incremental learning sessions because it always updates model parameters even if the sample is correctly classified. Let m t + be the reference vector with the correct label and m t -be the nearest reference vector with a wrong label. For a training sample (x, y) in D t , the sample is classified correctly if f θ t (x) -m t + 2 < f θ t (x) -m t - 2 . In this case, the loss should be 0. When f θ t (x) -m t + 2 > f θ t (x) -m t - 2 , the sample is misclassified. We adapt the margin based loss function L M from De Vries et al. ( 2016) with a minor modification L M = ReLU f θ t (x) -m t + 2 -f θ t (x) -m t - 2 f θ t (x) -m t + 2 + f θ t (x) -m t - 2 , where ReLU(•) stands for the rectified linear unit function. The margin based loss leads to slow training convergence because it only updates two reference vectors one time. However, the adapted margin based loss is well suited in learning from few-shot samples while avoids unnecessary parameter updates. Features for an old class could deviate away from the corresponding reference vector due to changes in θ t during incremental learning, leading to catastrophic forgetting. A forgetting loss L F is developed to regularize the drift in the feature space L F = N t-1 i=1 f θ t (x i ) -f θ t-1 (x i ) 2 , where x i is the selected exemplar for class i and N t-1 denotes the total number of classes in the base task and all previous novel tasks. Note that the exemplar x i for class i ∈ [N t-1 , N t ] is picked from D t whose feature is nearest to m t i at the end of each learning session. The total loss in the incremental learning session t is L = L M +λ F L F +λ intra L intra , where λ F and λ intra are weights for forgetting loss and intra-class variation loss, respectively. The total loss is optimized w.r.t. neural network parameters θ t and new reference vectors {m t N t-1 +1 , ..., m t N t }. The reference vectors for previously learned tasks are not updated by novel data to prevent catastrophic forgetting. However, they may not be well suited to represent knowledge and make classification in the new feature space F t as feature embedding is changed with updated θ t . Although the true optimal location of those reference vectors are difficult to estimate without using the entire data from all tasks, they can be calculated approximately using the shift in features of exemplars. Considering that features of an exemplar x i are close to m i in the feature space, the shift of a reference vector δ t i in the new feature space can be approximated by the shift of the exemplar's features δ t i = f θ t (x i )-f θ t-1 (x i ). Therefore, the reference vectors for previously learned tasks are calibrated m t i = m t-1 i + δ t i , where m t-1 i is the uncalibrated reference vector for class i ∈ [1, N t-1 ]. A test sample, which could be from any seen classes, is classified according to the distance to reference vectors {m t 1 , ..., m t N t }. The pseudo code for IDLVQ-C is presented in the appendix.

4.3. INCREMENTAL DEEP LEARNING VECTOR QUANTIZATION FOR REGRESSION

For regression problems, the model is trained to recognize regression targets by the minimizing mean squared error (MSE) loss L M SE = (y -ŷ) 2 , where y is the real-valued target in training dataset. The MSE loss function is differentiable w.r.t. neural network weights, reference vectors and targets, and scale factor. Therefore, all parameters can be trained jointly in an end-to-end manner. The proposed IDLVQ-R can also be interpreted as a kernel smoother in deep embedded space. Compared with traditional kernel smoother, such as Nadaraya-Watson estimator (Nadaraya, 1964) , IDLVQ-R is sparse and hence more scalable as it only relies on a few reference vectors and targets. In an incremental learning session (t > 1), we have access to data D t that contains K t pairs of training samples (x t i , y t i ). n t new reference vectors (n t ≤ K t ) along with corresponding targets are added to the model to learn new knowledge in the novel task t. We randomly select n t samples from D t to initialize reference vectors and targets as follows m i+N t-1 = f θ (x t i ), q i+N t-1 = y t i , where N t-1 is the total number of reference vectors in all previous tasks. The new reference vectors and targets are fine-tuned by minimizing MSE on D t while keeping other parameters frozen. After new reference vectors and targets are fine-tuned with novel data D t , the model makes prediction by smoothing targets of all reference vectors ŷ = N t i=1 κ(f θ (x), m i )q i N t i=1 κ(f θ (x), m i ) , where N t is the current total number of reference vectors.

5. EXPERIMENTS

We first describe the overall protocols, then we present the results on incremental few-shot classification and regression problems.

5.1. INCREMENTAL FEW-SHOT CLASSIFICATION

We empirically evaluate the performance of IDLVQ-C on incremental few-shot classification on CUB200-2011 (Welinder et al., 2010) and miniImageNet datasets (Vinyals et al., 2016) . The dataset is split into base classes and multiple groups of novel classes. We apply standard data augmentation, including random crop, horizontal flip and color jitter, on all training images. After each training session, the model performance is evaluated on a test set, which contains all classes that the model has been trained on. ResNet18 (He et al., 2016) is used as the feature extractor for incremental classification problems. The learning process for each dataset is repeated 10 times and the average test accuracy is reported. The proposed method is compared with six methods for few-shot class-incremental learning: fineturning using D t , joint training using the entire training set from all encountered classes, iCaRL (Rebuffi et al., 2017), Rebalancing (Hou et al., 2019) , ProtoNet (Snell et al., 2017) , incremental learning vector quantization (ILVQ) (Xu et al., 2012) , SDC (Yu et al., 2020) , and Imprint (Qi et al., 2018) . Note that ILVQ is applied to the features extracted by neural networks in our experiment. The incremental few-shot learning results on CUB and miniImageNet are shown in Table 1 and 2, respectively. Our method outperforms fine-tuning, iCaRL (Rebuffi et al., 2017) , and ProtoNet (Snell et al., 2017 ) by a large margin. Simply fine-tuning the weights in classifier with few-shot training samples for novel classes significantly deteriorates the prediction accuracy. Although iCaRL alleviates catastrophic forgetting by tuning the model with a mix of old exemplars and novel few-shot data, the prediction accuracy still drops quickly because iCaRL requires sufficient samples per class to achieve satisfactory performance. The ProtoNet relies on distance to prototypes (the mean of features within a class) to make classification but the fixed feature extractor may not be able to well separate novel classes. ILVQ is slightly better than ProtoNet because prototypes can be learned adaptively when more classes are available in incremental learning sessions. Some prototypes in ILVQ are close to the border of a class, which are more effective than class centroids in ProtoNet. However, ILVQ does not achieve the best performance because the feature extractor is fixed and cannot be learned along with the prototypes. IDLVQ-C has a small gain in the first couple of incremental few-shot learning sessions compared with SDC (Yu et al., 2020) and Imprint (Qi et al., 2018) . Similar to ProtoNet, SDC also relies on prototypes to make classification. The performance of SDC is better than that of ProtoNet because SDC fine-tunes the feature extractor with novel dataset and compensates the drift in prototypes. However, the compensation for the drift of old-class prototypes can be less accurate in SDC because it is approximated by samples in novel classes. In parallel, the imprint method directly computes the normalized classification weights from the average of normalized features within a novel class. The imprint method avoids imbalanced classification weights and circumvents the overfitting in few-shot class-incremental learning through weight normalization. Nevertheless, the fixed feature extractor in the imprint method may not be well suited for novel classes. In contrast, IDLVQ-C updates the feature extractor only when necessary and compensates the shift of old reference vectors more accurately using exemplars from old classes. That is why the gain of IDLVQ-C increases with more incremental few-shot learning sessions. The performance of SDC, Imprint and IDLVQ-C is better than offline joint training in early sessions of incremental few-shot learning. Offline joint training may not result in oracle performance due to extremely imbalanced samples between base classes and novel classes. Table 1 : Prediction accuracy on CUB all classes using the 10-way 5-shot incremental setting. samples in novel classes, including 5-shot, 10-shot and 20-shot settings. One reference vector is assigned to each class in all few-shot settings. As shown in Fig. 2 in the appendix, the performance of incremental learning improves as the number of samples per class increases. When the training samples are scarce, the training samples may not well present the generative distribution of training data. Therefore, the learned reference vectors could be biased and classification accuracy is low. With more training samples, the learned reference vectors could well present the center of the distribution and classification accuracy is improved. The gap in performance becomes more obvious as the number of incremental learning sessions grows. The detailed results are reported in Table 7 and 8 .

5.2. INCREMENTAL FEW-SHOT REGRESSION

IDLVQ-R is tested on two regression datasets: sinusoidal wave and 3D spatial data. Considering that there is no state-of-the-art method for incremental few-shot regression, we compare IDLVQ-R against three alternative methods: fine-tuning using novel task data only, fine-tuning using novel task data along with exemplars and offline training using the entire training dataset from all tasks. Sinusoidal wave is defined by a function y = sin(3πx) + 0.3 cos(9πx) + 0.5 sin(7πx) + , where is white noise with a standard deviation of 0.1. 1000 training samples in the first task (bas task) are generated by sampling x ∈ [-1.0, 1.0] uniformly. 5-shot training samples in two novel tasks are generated by sampling x ∈ [1.0, 1.5] and x ∈ [1.5, 2.0], respectively. As shown in Fig. 1 (a), IDLVQ-R achieves comparable performance to offline neural networks in Fig. 1(d ) which are trained using the entire training set from all tasks. In comparison, neural networks trained sequentially with few-shot training samples show catastrophic forgetting on old tasks in Fig. 1(b) . With the addition of exemplars during training, the networks perform better but still suffer from catastrophic forgetting on the base task in Fig. 1(c ). In conclusion, IDLVQ-R preserves old knowledge and adapts to new knowledge quickly using a few reference vectors and achieves satisfactory performance on incremental few-shot regression tasks. The experiment details can be found in the appendix. Table 4 : Normalized RMSE of incremental few-shot regression on 3D spatial data Method sessions 1 2 3 Joint train offline 0.02174(2e-4) 0.02232(2e-4) 0.02296(2e-4) Fine-tune w. novel data 0.02174(2e-4) 0.08462(4e-4) 0.11870(6e-4) Fine-tune w. exemplars 0.02174(2e-4) 0.02988(2e-4) 0.03128(2e-4) IDLVQ-R 0.02181(2e-4) 0.02641(2e-4) 0.02817(2e-4) The normalized root mean squared errors (RMSE) between actual and predicted altitude in the test set are listed in Table 4 . The prediction accuracy drops significantly when the model is finetuned with novel data only. Catastrophic forgetting can be alleviated using exemplars from previous tasks. IDLVQ-R achieves better results than fine tuning with exemplars. The good performance of IDLVQ-R can be attributed to two reasons. First, IDLVQ-R learns a number of reference vectors and targets to preserve the knowledge in encountered tasks. Compared with a linear layer on top of neural networks, a number of reference vectors represent richer information about the training data. Second, IDLVQ-R is nonparametric and can represent local and nonlinear relationship without learning any regression coefficient from few-shot data.

A.3 EXPERIMENT DETAILS FOR INCREMENTAL FEW-SHOT CLASSIFICATION

The base model is trained by the SGD optimizer (momentum of 0.9 and weight decay of 1e-4) with a mini-batch size of 64. For CUB dataset, the initial learning rate is 0.01 and is decayed by 0.1 after 60 and 120 epochs (200 epochs in total). For miniImageNet, the learning rate also starts from 0.01 and is decayed by 0.1 every 200 epochs (600 epochs in total). In an incremental learning session (t > 1), the model is fine-tuned with with D t with a learning rate of 0.01 for 100 epochs. Since novel data D t (t > 1) contains very few training samples, all training samples in D t are included in one mini-batch. In addition, we use λ intra = 1.0 and λ F = 0.5 for both datasets. Empirically, larger λ intra leads to more compact intra-class variation. However, convergence could be slow if λ intra is too large. In addition, larger λ F results in less forgetting in old classes but makes learning novel classes more difficult.

A.4 ADDITIONAL RESULTS FOR INCREMENTAL FEW-SHOT CLASSIFICATION

The accuracies for base and novel classes are reported separately in Table 5 and 6 for CUB amd miniImageNet, respectively. The prediction accuracy of novel classes is calculated upon all novel classes the model has been trained on. Note that the accuracy in Table 1 and 2 is calculated upon all classes (including base and novel classes) that the model has been trained on. The proposed IDLVQ-C demonstrates strong capability of preserving old knowledge by achieving the best performance on old classes across all learning sessions. In parallel, Imprint method performs slightly better on novel classes than IDLVQ-C in early incremental learning sessions, while IDLVQ-C outperforms Imprint method in longer incremental learning sessions. The advantage of IDLVQ-C can be attributed to the adaptive feature extractor, which is tuned in each learning session. The test accuracy on CUB dataset using 10-way 10-shot and 10-way 20-shot incremental settings are reported in Table 7 and 8, respectively. The prediction accuracy improves in all methods with more training samples per class. The iCaRL and Rebalancing methods show the most significant improvement when the number of training samples increases. The proposed IDLVQ-C is effective in different incremental few-shot scenarios as it achieves the best performance on 5-shot, 10-shot and 20-shot settings.

A.5 EXPERIMENT DETAILS FOR INCREMENTAL FEW-SHOT REGRESSION

Sinusoidal wave: A six-layer feedforward neural network with ReLU nonlinear activation is used as the feature extractor. IDLVQ-R learns 10 reference vectors and targets from the base task. In each incremental learning sessions, 5 pairs of reference vectors and targets are added. After new reference vectors and targets are fine-tuned, the model is capable of make prediction for all seen tasks. 10 exemplars are selected uniformly from the training set in the base task. In the incremental learning session of the 2nd task, the model is fine-tuned on 10 exemplars and 5 novel training samples. After the training converges, 5 novel training samples in the current task are added to the exemplar set. In the incremental learning session of the 3rd task, the model is fine-tuned with 15 exemplars from old tasks and 5 novel training samples. 3D spatial data: We follow the same training and test protocols as the sinusoidal wave dataset. We choose 40, 15 and 15 reference vectors and targets for 1st, 2nd and 3rd tasks, respectively. Adding more reference vectors does not result in obvious improvement in accuracy in our experiments. We show the visualization of standard neural networks and IDLVQ-C with/without intra-class variation loss in Fig. 3 . MNIST dataset is used as a toy example for visualization. Classes 0-7 are old classes with sufficient training samples, and classes 8 and 9 are novel classes with few-shot training samples. It can be observed in Fig. 3 (a) and 3(b) that standard neural networks and IDLVQ-C (without intra-class variation loss) trained by cross-entropy loss do not have compact intra-class variation. Consequently, features of novel classes are more likely to overlap with old classes. In this case, the performance of class-incremental learning degrades very quickly because the classifier cannot distinguish between features from different classes. In comparison, the proposed IDLVQ-C makes intra-class variation compact and leaves large margin between classes in Fig 3 (c ). As a result, the features of novel classes are less likely to overlap existing classes. The compact intra-class variation and large margin between classes make features of novel classes distinguishable so that learning novel classes is easier. In addition, the margin based loss only updates the model parameters when necessary and avoids catastrophic forgetting of old classes.



https://archive.ics.uci.edu/ml/machine-learning-databases/00246/ CONCLUSIONSA new incremental few-shot learning approach is developed to harmonize old knowledge preserving and new knowledge adaptation through quantized vector in deep embedded space. Prediction is made in a nonparametric way using similarity to learned reference vectors, which circumvents biased weights in a parametric classification layer during incremental few-shot learning. For classification problems, additional mechanisms are developed to mitigate the forgetting in old classes and improve representation learning for few-shot novel classes. For regression problems, the proposed approach has been reinterpreted as a kernel smoother to predict real-valued target over novel domain.



dataset is composed of 200 fine-grained bird species with 11,788 images. We split the dataset into 5894 training images, 2947 validation images and 2947 test images. All images are resized to 224 × 224. In addition, the first 100 classes are chosen as base classes, where all training samples in base classes are used to train the base model. The remaining 100 classes are treated as novel categories and split into 10 incremental learning sessions. Each incremental learning session contains 10 novel classes and 5 randomly selected training samples per class (10-way 5-shot).miniImageNet dataset is a 100-class subset of the original ImageNet dataset(Deng et al., 2009). Each class contains 500 training images, 50 validation images, and 50 test images. The images are in RGB format of the size 84 × 84. We choose 60 and 40 classes for base and novel classes, respectfully. The 40 novel classes are divided into 8 sessions and each session contains 5 novel classes with 5 randomly selected training samples per class (5-way 5-shot).

Figure 1: Comparison of performance for incremental few-shot regression. Red dots denote test samples for base task, purple dots denotes test sample for novel tasks, grey lines denote model predictions, and black crosses denotes few-shot training samples in novel tasks. (a) IDLVQ-R; (b) neural networks incrementally fine-tuned with novel data only in each session; (c) neural networks incrementally fine-tuned with exemplars and novel training samples; (d) offline neural networks trained with training samples from all tasks.3D spatial data 1 is collected in North Jutland, Denmark. The inputs are longitude x 1 and latitude x 2 , and the output is altitude y. 2482 training samples in the 1st task are collected in the area where

Figure 2: Comparison results of different few-shot settings, evaluated with ResNet18 on CUB dataset

Prediction accuracy on miniImageNet all classes using the 5-way 5-shot incremental setting.Ablation studies are conducted to analyze how individual components affect the performance of incremental few-shot learning. We study five variants of our methods: (a) new reference vectors are initialized as class centroids and no tuning is done for feature extractor or old reference vectors; (b) L intra is not used in incremental learning sessions; (c) L F is not used in the incremental learning sessions; (d) shift in old reference vectors are not compensated; (e) replace the margin based loss L M with the cross entropy loss L CE . Table3shows the results of our ablation studies on CUB dataset. Without any fine-tuning, the initial reference vectors for novel classes lead to descent accuracy in incremental few-shot classification. It demonstrates the robustness of nonparametric classifier. L intra leads to 0.57% gain due to tight intra-class variation. The less forgetting regularization L

Ablation study on CUB using the 10-way 5-shot incremental setting.

Prediction accuracy on CUB base and novel classes using the 10-way 5-shot incremental setting.

Prediction accuracy on miniImageNet base and novel classes using the 5-way 5-shot incremental setting.

Prediction accuracy on CUB using the 10-way 10-shot incremental setting.

Prediction accuracy on CUB using the 10-way 20-shot incremental setting.

A APPENDIX

A.1 PSEUDO CODE FOR IDLVQ-C Algorithm 1 IDLVQ-CIn the base task (t = 1) Initialize θ 1 , {m 1 1 , ..., m 1 N 1 } and γ Minimize L = L CE + λ intra L intra w.r.t. θ 1 , {m 1 1 , ..., m 1 N 1 } and γ Pick exemplars from D 1 for classes in the base task:Pick exemplars from D t for classes in the novel task t: 

