TOWARDS LEARNING TO REMEMBER IN META LEARNING OF SEQUENTIAL DOMAINS Anonymous

Abstract

Meta-learning has made rapid progress in past years, with recent extensions made to avoid catastrophic forgetting in the learning process, namely continual meta learning. It is desirable to generalize the meta learner's ability to continuously learn in sequential domains, which is largely unexplored to-date. We found through extensive empirical verification that significant improvement is needed for current continual learning techniques to be applied in the sequential domain meta learning setting. To tackle the problem, we adapt existing dynamic learning rate adaptation techniques to meta learn both model parameters and learning rates. Adaptation on parameters ensures good generalization performance, while adaptation on learning rates is made to avoid catastrophic forgetting of past domains. Extensive experiments on a sequence of commonly used real-domain data demonstrate the effectiveness of our proposed method, outperforming current strong baselines in continual learning. Our code is made publicly available online (anonymous) https://github.com/ ICLR20210927/Sequential-domain-meta-learning.git.

1. INTRODUCTION

Humans have the ability to quickly learn new skills from a few examples, without erasing old skills. It is desirable for machine-learning models to adopt this capability when learning under changing contexts/domains, which are common scenarios for real-world problems. These tasks are easy for humans, yet pose challenges for current deep-learning models mainly due to the following two reasons: 1) Catastrophic forgetting is a well-known problem for neural networks, which are prone to drastically losing knowledge on old tasks when a domain is shifted (McCloskey & Cohen, 1989) ; 2) It has been a long-standing challenge to make neural networks generalize quickly from a limited amount of training data (Wang et al., 2020a) . For example, the dialogue system can be trained on a sequence of domains, (hotel booking, insurance, restaurant, car services, etc) due to the sequential availability of dataset (Mi et al., 2020) . For each domain, each task is defined as learning one customer-specific model (Lin et al., 2019) . After finishing meta training, the model could be deployed to the previously trained domains, as the new (unseen) customers from previous domains may arrive later, they have their own (small) training data (support set) used for adapting the sequentially meta-learned models. After adaptation, the newly adapted model for the new customers can be deployed to make responses to the customers. We formulate the above problem as sequential domain few-shot learning, where a model is required to make proper decisions based on only a few training examples while undergoing constantly changing contexts/domains. It is expected that adjustments to a new context/domain should not erase knowledge already learned from old ones. The problem consists of two key components that have been considered separately in previous research: the ability to learn from a limited amount of data, referred to as few-shot learning; and the ability to learn new tasks without forgetting old knowledge, known as continual learning. The two aspects have been proved to be particularly challenging for deep learning models, explored independently by extensive previous work (Finn et al., 2017; Snell et al., 2017; Kirkpatrick et al., 2017; Lopez-Paz & Ranzato, 2017) . However, a more challenging yet useful perspective to jointly integrate the two aspects remains less explored. Generally speaking, meta-learning targets learning from a large number of similar tasks with a limited number of training examples per class. Most existing works focus on developing the general- ization ability under a single context/domain (Santoro et al., 2016; Finn et al., 2017; 2018; Snell et al., 2017; Ravi & Beatson, 2019) . Recently, it has been shown that catastrophic forgetting often occurs when transferring a meta-learning model to a new context (Ren et al., 2019; Yoon et al., 2020) . Continual learning aims to mitigate negative backward transfer effects on learned tasks when input distribution shift occurs during sequential context changes. Related techniques of which are currently applied mostly on standard classification problems (Serrà et al., 2018; Ebrahimi et al., 2020b) . In this paper, we generalize it to the sequential domain meta-learning setting, which seeks good generalization on unseen tasks from all domains with only limited training resources from previous domains. We term the problem sequential domain meta learning. Note this setting is different from continual few-shot learning that focuses on remembering previously learned lowresource tasks in a single domain. Our setting does not aim to remember on a specific task, but rather to maintain good generalization to a large amount of unseen few-shot tasks from previous domains without catastrophic forgetting. This setting is common and fits well in dynamic real-world scenarios such as recommendation system and dialogue training system. The domain shift arised from this setting during meta learning poses new challenges to existing continual-learning techniques. This is mainly due to the high variability underlying a large number of dynamically formed few-shot tasks, making it infeasible for a model to explicitly remember each task. In our setting, a model is expected to remember patterns generic to a domain, while neglecting noise and variance of a specific few-shot task. This ability, termed as remember to generalize, allows a model to capture general patterns of a domain that repeatedly occur in batches of tasks while avoid being too sensitive to a specific few-shot task. In this paper, we propose to address the aforementioned challenges by designing a dynamic learningrate adaptation scheme for learning to remember previous domains. These techniques could jointly consider gradients from multiple few-shot tasks to filter out task variance and only remember patterns that are generic in each domain. Our main idea is to meta learn both the model parameters and learning rates by backpropagating both a domain loss and a memory loss to adaptively update model parameters and the learning rates, respectively. Specifically, our mechanism keeps a small memory of tasks from previous domains, which are then used to guide the dynamic and adaptive learning behaviors on different portions of the network parameters. The proposed mechanism is versatile and applicable to both the metric-based prototypical network (Snell et al., 2017) and the gradient-based ANIL (Raghu et al., 2020) meta-learning model. Our contributions are summarized as follows: • We propose a challenging benchmark that requires a meta learning model to sequentially learn on a sequence of domains enduring domain shift without much forgetting on previous domains. • We extend meta learning models with existing dynamic learning rate modeling techniques. This can mitigate catastrophic forgetting through meta learning both model parameters and learning rates to dynamically control the network update process. This can be seamlessly integrated into both metric-based and gradient-based meta learning approaches. • We conduct extensive experiments on multiple public datasets under different sequential domain few-shot learning scenarios. We further test functionality of the dynamic learning-rate update mechanism for both metric-based and gradient-based meta-learning approaches. Comparisons are made towards a wide range of representative continuallearning techniques and models. Results demonstrate that our method outperforms strong baselines by a large margin.

2. RELATED WORKS

2.1 META LEARNING Meta learning (Schmidhuber, 1993) , aka, learning to learn, aims to rapidly adapt to a new task by reusing previous experience through training on a large number of tasks. Meta learning can be roughly classified into the following categories: 1) Metric/Embedding-based approaches such as (Vinyals et al., 2016; Snell et al., 2017; Edwards & Storkey, 2017) , which map input data into embedding (feature) spaces with decisions made based on some distance metric in the feature space; 2) Black-box learning methods such as (Andrychowicz et al., 2016; Graves et al., 2014; Mishra et al., 2018) ; 3) Optimization-based methods such as (Finn et al., 2017; Ravi & Larochelle, 2017; Li et al., 2017; Antoniou & Storkey, 2019) , which improve gradient-based optimization algorithms or learn to initialize network parameters; and 4) Bayesian meta-learning methods such as (Ravi & Beatson, 2019; Finn et al., 2018; Yoon et al., 2018b; Grant et al., 2018; Wang et al., 2020b) . These methods are used to either interprete and understand MAML (Grant et al., 2018) , or to model uncertainty of meta learning models (Yoon et al., 2018b; Finn et al., 2018; Wang et al., 2020b) . 5) Memorybased meta learning (Santoro et al., 2016; Munkhdalai & Yu, 2017; Mikulik et al., 2020) , which apply additional memory component for meta learning. Online meta learning (Finn et al., 2019) is also related to us. They focus on forward transfer, i.e., achieving better performance on future task and use all the data from previous tasks to do meta learning, while our setting is significantly different from theirs as we focus on mitigating catastrophic forgetting with only very limited access to previous domains. Dynamically updating the learning rates for networks is not new and has been explored in several contexts. Meta-SGD (Li et al., 2017) learns the per parameter learning rates for meta learning to improve flexibility and performance. (Gupta et al., 2020 ) use dynamic learning rates to mitigate forgetting in online continual learning. T-net (Lee & Choi, 2018 ) learns a metric in activation space, which informs the update direction and step size for task-specific learning. Flennerhag et al. (2020) proposes e warped gradient descent to meta-learns an efficiently parameterised preconditioning matrix to dynamically update the network. Our work extends dynamic learning rate techniques to sequential domain meta learning setting to mitigate catastrophic forgetting.

2.2. CONTINUAL LEARNING

Continual learning tackles the problem of maintaining knowledge when input distribution shift happens in sequentially arriving tasks. There are different methods to address this problem, including 1) retaining memory for future replay (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; Riemer et al., 2019; Chaudhry et al., 2019b) ; 2) designing tailored network architectures (Rusu et al., 2016; Fernando et al., 2017; Yoon et al., 2018a) ; 3) performing proper regularization during parameter updates (Kirkpatrick et al., 2017; Zenke et al., 2017; von Oswald et al., 2019) ; and 4) introducing Bayesian methods for model parameter inference (Nguyen et al., 2018; Ebrahimi et al., 2020a) . Specifically, methods based on memory replay store representative samples from old tasks and rehearsal is performed during training (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a; Riemer et al., 2019) . Recent research also utilizes generative models to memorize previously seen data (Lesort et al., 2019) . Representatives of architecture-based methods include Progressive Neural Networks (Rusu et al., 2016) , PathNet (Fernando et al., 2017) , Dynamically Expandable Networks (Yoon et al., 2018a) , Hard Attention Mask (HAT) (Serrà et al., 2018) and PackNet (Mallya & Lazeb-nik, 2017) , etc. These models explicitly modify network topology to preserve previous knowledge. The classic architecture-based approaches proposed in (Serrà et al., 2018) and (Mallya & Lazebnik, 2017) do not fit into this setting, as they attempt to fully remember each historic task. Progressive Neural Networks (Rusu et al., 2016) guarantee zero forgetting but at the cost of growing network architectures and increasing parameters rapidly, which is unaffordable in memory-constraint cases. Regularization-based methods constrain the updated parameters to avoid drastic changes to previously learned tasks (Kirkpatrick et al., 2017; Zenke et al., 2017; von Oswald et al., 2019; Ebrahimi et al., 2020c) . They can restrict the capacity to meta-learning of new domains, thus it could hurt the performance on a new domain. Bayesian-based methods model parameters in a probabilistic way, and then parameters are updated either based on their posterior distributions (Nguyen et al., 2018) or on their uncertainty (Ebrahimi et al., 2020a) . However, in the context of meta-learning, the uncertainty or posterior estimation could be highly inaccurate due to the small-data setting in each task, thus hindering the performance. Recently, there are works using meta learning to improve continual learning. For example, (Javed & White, 2019) proposes to learn versatile representations by explicit training towards minimizing forgetting.

2.3. MULTI-DOMAIN META LEARNING

As a new research direction, multi-domain meta learning aims to achieve good generalization across multiple domains. (Triantafillou et al., 2020) releases a dataset containing few-shot tasks from multiple domains. (Tseng et al., 2020) proposes to use transformation layers to associate feature distributions across different domains. (Vuorio et al., 2019) proposes a tailored initialization process of model parameters based on task embedding to enable proper functionality in multiple domains. It is worth noting that the aforementioned multi-domain few-shot learning methods assume data from all domains are jointly available during training. We consider a more challenging setting where domain shift comes sequentially, thus a model needs to remember knowledge from previous domains.

2.4. INCREMENTAL FEW-SHOT LEARNING

Incremental few-shot learning (Gidaris & Komodakis, 2018; Ren et al., 2019; Yoon et al., 2020) aims to handle new categories with limited resources while preserving knowledge on old categories. The task requires building a generalized model while preventing catastrophic forgetting, with an implicit assumption of unlimited access to the base categories. This paper, by contrast, focuses on the case where only very limited access to previous domain is available, and generalization to unseen categories in previous domains is required. (Gidaris & Komodakis, 2018 ) introduces a novel cosine similarity function to relate incremental classes with base classes. (Ren et al., 2019) proposes attention attractor network to regularize the learning of new categories. (Yoon et al., 2020) proposes to integrate two sets of features with one from a pretrained base module and the other from a metatrained module for a smooth adaptation on novel categories. For incremental few-shot learning, the increasing new classes are assumed to be in the same domain as the base classes, which is different from the changing domains in our setting. 2.5 CONTINUAL FEW-SHOT LEARNING Continual few-shot learning is a relatively new research topic. (Antoniou et al., 2020) proposes a general framework for various continual few-shot learning settings, and proposes benchmarks for proper performance evaluation. (Caccia et al., 2020) introduces Continual-MAML for online fast adaptation to new tasks while accumulating knowledge on old tasks and conducted experiments on three domains, our setting is different from theirs because they assume previous tasks can be revisited, we assume only very limited access to previous domains. (Jerfel et al., 2019) considers task distribution shift within a single domain and proposes Dirichlet process mixture of hierarchical Bayesian models (DPM) to tackle the problem. Their ideas are interesting, but they assume different set of parameters for each cluster, which is not efficient in memory-limited setting. 

3. THE PROPOSED METHOD

is used for meta testing. Each task T ∈ D i consists of K data examples, {(x k , y k )} K k=1 . A task is divided into T tr and T test . Our goal is to sequentially meta-learn a model f θ that maps inputs x to y for each arriving domain while not forgetting too much of all previous domains. To this end, our framework consists of a very limited memory of M d , which is used to store a small number of training tasks from each previous domain D d . In domain D q , the parameters are adopted from that of the last domain, θ q-1 . Data available for training are thus D q ( q-1 d=1 M d ). Our goal is to update the parameter to θ q such that it achieves good performance on a domain D q by transferring useful knowledge from previous domains without degrading performance on these domains. The final performance is evaluated on the meta-test sets of current and all previous domains. Basic idea and model architecture For effective learning without catastrophic forgetting of previous domains, our basic ideas include 1) minimizing a loss on the current domain to achieve fast adaptation and good generalization; 2) adjusting the adaptation via optimizing the learning rates to force the model to remember previous domains. In our implementation, we consider a convolutional neural network (CNN) based classifier, which is supposed to evolve over sequences of domains via meta-learning adaptation. We assume all domains share the same CNN-based structure for feature extraction, while the model should also have the flexibility to define domain-specific components. As a result, we divide the CNN classifier into two parts, corresponding to bottom convolutional layers for feature extraction and top convolutional layers for classification. The bottom layers of the CNN are defined as domain-shared layers to serve as a common feature extractor for all domains. In addition, the top layers of the CNN are defined as domain-specific layers, which are defined as N parallel sub-CNNs with each accessible by only one domain. The network structure is illustrated in Figure 1 . For more flexible adaptation, we group the parameters (the convolutional filters) in each domain-shared layer into several blocks and associate each block with one learning rate. In other words, different filters will be updated with different learning rates, which are also adaptively optimized to enforce the memorization of previous domains (specified in the following sections). The CNN structure is also reflected in Figure 1 . Note the learning rate in each of the domain-specific layers is not optimized as the layers in different domains are independent with each other, making the adaptive learning-rate scheme used in the domain-shared part inapplicable. Problem formulation Similar to (Javed & White, 2019) , we treat the domain-shared and domainspecific parameters separately. The meta parameters θ S (parameters of bottom convolutional layers) and its corresponding network block learning rates λ S are shared across all the domains. We use θ D 1:T to denote the domain-specific parameters and λ D 1:T denote the domain-specific learning rates from domain 1 • • • T , where each element θ D q and λ D q represent domain D q specific parameters and its corresponding learning rates. All the learnable parameters for domain D q are defined as θ q = {θ S , θ D q } and learning rates as λ q = {λ S , λ D q }. λ S are dynamically updated and λ D 1:T are fixed during the meta-training process. During meta testing, λ S and λ D 1:T are not used and updated. We use θ q,j to denote all the learnable parameters at j th iteration and λ S q,j to denote all learnable learning rates at the j th iteration when meta training in domain D q . Each element of λ S q,j is the learning rate for parameters inside one block, each of which consists of several filters. Note that λ S and λ S q,j are the same set of learning rates, the difference is that the latter one denotes the specific learning rates at one iteration. M = q-1 d=1 M d denotes all the memory tasks in all the domains before domain D q . L(T , θ(λ)) is the task-specific loss function with model parameters θ that depends on learning rates λ and data from task T . J D tr q (θ q,j+k (λ S q,j )) (Equation 5) represents the expected loss function with respect to training task distributions in domain D q and is used for updating model parameters. J M (θ q,j+K (λ S q,j )) (Equation 2) is the expected loss function with respect to memory task distributions and is used for adjusting learning rates. Detailed definitions are as following. min λ S q,j J M (θ q,j+K (λ S q,j )) (1) J M (θ q,j+K (λ S q,j )) = E Ti∈M L(T i , θ q,j+K (λ S q,j )) θ q,j+k+1 (λ S q,j ) = φ φ φ(θ q,j+k , λ q,j ; D tr q ), for k = 0, 1, 2, . . . , K -1, φ φ φ(θ q,j+k , λ q,j ; D tr q ) = θ q,j+k -λ q,j ∇ θq J D tr q (θ q,j+k (λ q,j )) (4) J D tr q (θ q,j+k (λ S q,j )) = E Ti∈D tr q L(T i , θ q,j+k (λ S q,j )) θ q,j+k+1 (λ S q,j ) is the function of λ S q,j instead of λ q,j since λ D q is fixed during training D q for the reasons stated above. Where K is the number of iterations in domain D q after current iteration, we use K = 1 for simplicity. φ φ φ(•, •) denotes a gradient adaptation function. Similarly, for the learning rate adaptation at domain D q , based on the objective equation 2, an adaptation process is defined as: λ S q,j+1 = λ S q,j -η∇ λ S J M (θ q,j+K (λ S q,j )) , ( ) where η is the hyper learning rate of the learning-rate update rule. To sum up, our method is a bi-level adaptation process. The lower level optimizes on the current domain, while the upper level optimizes on previous domains for updating the block-wise learning rates. Detailed algorithms describing the meta-training and testing phases are given in Algorithm 1 and 2 (Appendix A), respectively. Algorithm 1 Meta training. Randomly initialize θ D q 4: for j = 1 to M do 5: Sample batch of tasks T from P (Dq), distribution over tasks in domain Dq 6: Evaluate J D tr q (θq,j+1(λ S q,j )) =

E

T ∈D tr q L(T , θq,j+1(λ S q,j ))

7:

Perform adaptation: θq,j+1 = θq,jλq,j∇ θq J D tr q (θq,j, λq,j) 8: if q = 1 then 9: λ S q,j = λ0 10: else 11: λ S q,j+1 = λ S q,j -η∇ λ S JM(θq,j, λ S q,j ) 12: end if 13: end for 14: Mq = {a small batch of sampled tasks fromDq} 15: M = M ∪ Mq 16: end for Domain-level meta loss computation Our model contains a component to train on previous tasks stored in the memory to optimize the learning rates. In practice, only a minibatch of data is used at each iteration. Note each task is in the format of training-validation pairs from one domain. If one simply uses one task from one domain at each iteration, the model could overfit to this domain and may not generalize well to other domains. To alleviate this issue, we propose to compute the meta loss at two different levels, called intra-domain and interdomain meta losses. The intra-domain meta loss is defined with tasks in the same domain to ensure good generalization in one domain. The inter-domain meta loss, by contrast, is defined with tasks sampled across all available domains in the memory to encourage model adaptation across different domains. When calculating the inter-domain loss with one task from the memory, at each iteration, we augment the data with those from tasks of other domains. In this way, the model can be trained to better generalize to other domains. Figure 2 illustrates our proposed method. More details can be found in Appendix A.

4. EXPERIMENT

We compare our method against several related strong baselines on a sequence of five domain datasets, which exhibit large domain shift, thus posing new challenges for existing methods. We evaluate the effectivenss of our method on both gradient-based and metric-based meta-learning frameworks. We also conduct ablation studies to verify the effectiveness of each component of our model. Our implementation is based on Torchmeta (Deleu et al., 2019) . Results are reported in terms of mean and standard deviation over three independent runs.

Implementation details

For prototypical-network-based (Protonet) (Snell et al., 2017) baselines, we use a four-layer CNN with 64 filters of kernel size being 3 as shared domain feature extractor for all the domains. The last layer is defined as domain specific with 64 filters. No fully connected layers are used following existing works. For gradient-based meta-learning algorithms (Raghu et al., 2020) , following (Antoniou et al., 2019) , we use a three-layer CNN with 48 filters of kernel size being 3 as shared domain feature extractor, and one convolutional layer with 48 filters and one fully-connected layer for domain-specific learning. Such architectures for Prototypical network and ANIL are widely adopted in existing works on continual learning to mitigate catastrophic forgetting (Serrà et al., 2018; Ebrahimi et al., 2020b) . 750 evaluation tasks from each domain are used for meta testing. The hyper learning rate η is set to 1e-4 for Protonet and 5e-5 for ANIL. The gradients with respect to all the learnable learning rates are clipped to [-10, 10] for both Protonet and ANIL. Unless otherwise specified, the number of learning blocks for each CNN layer is 4, the number of memory tasks for each previous domain is 6 and the arriving order of the domain sequence is MiniImagenet, CIFARFS, Omniglot, Aircraft and CUB. More implementation details are given in Appendix A. Datasets The five datasets include Miniimagenet (Vinyals et al., 2016) , CIFARFS (Bertinetto et al., 2019) , Omniglot (Lake et al., 2011) , CUB (Welinder et al., 2010) , AIRCRAFT (Maji et al., 2013) . All images are resized into 84 × 84. We follow the splits in torchmeta * for the Miniimagenet, CIFARFS, Omniglot and CUB datasets. For AIRCRAFT, we follow the split in (Vuorio et al., 2019) . We compare all the baselines on 5-way-1-shot and 5-way-5-shot learning settings.

Evaluation metrics

We adapt existing work on continual learning to use the evaluation metrics of ACC and BWT (Ebrahimi et al., 2020b) . ACC is defined as the average meta testing classification accuracy across all domains (the higher the better), and BWT measures the average forgetting on all the previous domains evaluated at the end of the sequential meta learning task (the lower the better). Formaly, the ACC and BWT are defined as: ACC = 1 N N j=1 a N,j and BWT = 1 N -1 N -1 i=1 a N,i - a i,i , where a n,i is the meta testing accuracy on domain i after meta training on domain n, a N,i -a i,i measures the forgetting score on domain i after meta training on domain N . Baselines For meta-learning methods, we compare our method with ANIL (Raghu et al., 2020) and Prototypical Network (Snell et al., 2017) . The former is a simplified version of MAML (Finn et al., 2017) with inner loop only conducted on the final layer. For continual learning methods, we compare our method with related strong baselines, which are regularization based method such as the Elastic Weight Consolidation (Kirkpatrick et al., 2017) , architecture-based methods (Hard Attention Mask (HAT) (Serrà et al., 2018) with publicly available code from the authors), Bayesian methods (Ebrahimi et al., 2020b) (with the authors' implementation), Memory-based (including A-GEM (Chaudhry et al., 2019a) , ER-Ringbuffer (Chaudhry et al., 2019b) and Meta Experience Replay (MER) (Riemer et al., 2019) ). We also learn all the domains jointly in a multi-domain meta-learning setting, whose performance can be considered as the upper bound for our sequential learning setting. We also provide a simple baseline, which trains sequentially without using any external memory. We denote this method as "Sequential", whose performance can be considered as reference for the catastrophic forgetting of our setting. It is worth noting that the above baselines have only been used for traditional continual learning. We adapt these methods to our setting to demonstrate the ineffectiveness of these existing techniques when generalizing them to a meta-learning setting. 1 show the comparisons to the baselines we constructed for this problem setting in terms of 5-way 5-shot accuracy. Results of 5-way 1-shot classification is given in Appendix A. In the table, 'N/A' means the BWT is not available since the method does not learn sequentially. Our method significantly outperforms baselines. Especially, in the Protonet-based model, the performance of our model almost matches that of the joint-training model, indicating excellent memorization of past domains. From the results of "Sequential" baseline, we see that there is a significant performance drop if no external mechanism is introduced to prevent forgetting. Among the baselines, ER-Ringbuffer seems to perform worse overall. We believe this is because of the repeatedly-trained memory, which leads to overfiting, making recovery of previous knowledge difficult. Memory data in our method are not directly fit into network but are only used for guiding the training of network on current domain data, thus avoiding overfitting. The inferior performance of UCB might be because the uncertainty estimation with limited data is inaccurate in the metalearning setting. HAT is also worse because it simply remembers history tasks but not domain general information. Furthermore, A-GEM restricts gradient updates on new examples in the same direction as the gradient direction in memory tasks, potentially leading to wrong update directions. In MER, the fast weight for each task in memory across different domains could vary significantly, making updates by Reptile oscillate and unstable, and thus hindering its performance.

Sensitivity to domain ordering

We study the sensitivity of all methods on the order of domain arrivals. We use a different domain-sequence order as: CIFARFS, MiniImagenet, Aircraft, CUB and Omniglot. Results of 5-way-5-shot learning are summarized in results of 5-way-1-shot setting. It can be seen that although there are some performance differences compared to those of the previous order, our method still outperforms the baselines. We use another different domain-sequence order as: Omniglot, Aircraft, CUB, CIFARFS and Mini-Imagenet. Since Omniglot is used as first domain, the model cannot learn good representations for the following domains, thus poses more challenges than the other two domain sequences. Table 3 shows the results, where the baseline Protonet-fixfirst is the method that freeze the model parameters after finishing training on the first domain. For this baseline, there is small BWT, this is because of random testing task variance at different time. In this case, all the baselines and our methods performance drop significantly compared to the sequence with CIFARFS or Miniimagenet as first domain. It indicates that this domain sequence is significantly more challenging than the other two domain sequences. Effect of domain-level meta loss This sets of experiments verify the effectiveness of our proposed domain-level meta loss in Section 3. We compare the domain-level meta loss with the standard intradomain meta loss, which only considers task data in one domain. The results are shown in Table 4 . It is clear that our proposed domain-level meta loss consistently outperforms the intra-domain meta loss by obtaining better accuracy. Effect of number of learning block Our method assigns a different learning rate for each block of the network parameters. A finer grained division of learning blocks allows more flexible control, but leads to higher computation cost. In this experiment, we investigate the impact of blocks numbers to model performance. Specifically, we implement each layer with 2, 4 and 8 blocks. Results are shown in Table 5 . It can be seen that when the block number is relatively large (4 in our case), the accuracies are relatively stable. Thus we can simply set the block number to 4 in practice. Effect of a small memory size We compare performance differences when the number of tasks stored in the memory for each domain is small. Table 6 shows the results of 1 and 6 tasks from each domain. It is interesting to see that with a small memory to store only one task from each domain, performances of all the memory-based baselines drop significantly, while our method maintains relatively stable performance. This might be due to the fact that with a smaller memory size, all memory-based baselines either overfit to the small memory or are impacted by unstable gradient direction updates, which would not happen in our method.

5. CONCLUSION

In this paper, we propose a challenging Benchmark that requires a meta learning model to sequentially meta learn on a sequence of domains with domain shift but without much forgetting on previous domains. Then, we extend existing dynamic learning rate techniques to existing meta learning model to meta learn the learning rates on meta parameters blocks, which can be seamlessly integrated into both metric-based and gradient-based meta learning approaches to mitigate catastrophic forgetting. The adaptation on parameters maintains generalization performance on current domain, while adaptation on learning rates is made remember knowledge of past domains. Our proposed method significantly outperforms existing continual-learning techniques adapted to our problem, achieving significantly better performance than strong baselines. There are several possible future directions to be pursued for future work, such as the cases of imbalanced classes in each task, scaling to longer domain sequences, and fewer training tasks in each arrival domain. For ANIL: The meta batch size for each domain is set to 3. For 1-shot and 5-shot learning, the number of training data for each task are set to 1 and 5, and the number of query data is 15 for both cases. The number of inner loop update steps is 5 with a learning rate of 1e-2. The outer-loop learning rate is 1e-3, and the number of meta training iterations is 20000 with the Adam optimizer (Kingma & Ba, 2014) . For Prototypical Network: The meta batch size for each domain is 3, and the number of meta training iterations is 40000. For 1 shot learning, the number of support data for each task is 1, the number of query data for each task is 10. For 5 shot learning, the number of support data for each task is 5, the number of query data for each task is 10. The meta training loss is optimized by Adam (Kingma & Ba, 2014), and the learning rate is 1e-3.

Sequential domain training details

For ANIL and Prototypical Network, we adopt the following hyperparameter. • ANIL (Raghu et al., 2020) The meta batch size for each domain is 3. The total number of meta training iterations is 40000. The number of meta training iterations for each domain is 8000. For 1-shot and 5-shot learning, the number of training data for each task are set to 1 and 5, and the number of query data is 15 for both cases. The number of inner loop update steps is 5, the inner loop learning rate is 1e-2, the outer loop learning rate is 1e-3, the outer loop meta training loss is optimized by Adam (Kingma & Ba, 2014). • Prototypical Network (Snell et al., 2017) the meta batch size for each domain is 3, and the number of meta training iterations is 40000. The number of meta training iterations for each domain is 8000. For 1-shot and Figure 2 : The meta memory loss mechanism. For one task in domain q from the memory pool , we randomly sample the data from some tasks of other domains and combine them with their own data to calculate the task-specific loss. We use S u p to denote the examples labeled with u for domain p, and S v q to denote the examples labeled with class v for domain q. Suppose P -way K-shot learning is performed in both domain p and domain q. At each iteration, the two tasks from both domains are combined together, and the problem becomes 2P -way K-shot learning. That is, the network is used for 2P class classification problem for images in both domain p and domain q Define the prototype for class u in a domain as  The memory loss is defined as minimizing the log probability J (θ) = -log p θ (y = u|x), where d is distance function. For MAML (ANIL), let D val q D val p denote the validation data from both domain q and p. The loss is defined as  Both mechanisms for the prototypical network and MAML (ANIL) can avoid overfitting and generalize across different domains. On learning-rate optimization Learning rate is one of the most important and effective hyperparameter in standard neural-network optimization (Baydin et al., 2018; Franceschi et al., 2017; Donini et al., 2020) and continual learning (Ebrahimi et al., 2020b) . Different from standard supervised learning whose goal is to minimize the generalization error, continual learning aims to avoid catastrophic forgetting while maintaining good performance on current domain. Inspired by standard gradient-based hyperparameter optimization algorithms such as (Baydin et al., 2018; Franceschi et al., 2017; Donini et al., 2020) , we develop a novel automatic and adaptive learningrate update scheme for newly arrival domains. Our goal is achieved by updating the learning rate for each parameter block, which is designed to adaptively update parts of the network to fit future domains while remembering previous domains.

More Results

Compare to Baselines 5-way 1-shot learning Table 7 show the comparisons to the baselines we constructed for this problem setting in terms of 5-way 1-shot accuracy. In the table, 'N/A' means the BWT is not available since the method does not learn sequentially. Our method significantly outperform baselines. Especially, similar to the performance in 5-way 5-shot learning in 



* https://github.com/tristandeleu/pytorch-meta



Figure 1: Meta-learning over sequential domains. Data in each domain arrive sequentially. Our model consists of a domain-shared part and a domain-specific part, all consist of a few convolutional layers (and possibly fully connected layers). The domain-shared part is shared by all domains, and each domain only owns one sub-network in the domain-specific part. Parameters (e.g., convolutional filters) in each domain-shared convolutional layer i (blue) are divided into n blocks, denoted as B i0 , B i1 , • • • , B in . Each block is associated with one learnable learning rate when meta training each domain on the network. The learning rates are updated by a loss defined on the memory tasks to enforce the memorization of previous domains.

Require: A sequence of training domain data D1, D2, . . . , DN ; Initial learning rates λ S and λ D 1:T to λ0 and model parameters θ S = θ0; λ D 1:T is fixed during the training process. Require: Maximum number of iterations M for each domain. 1: M = {} 2: for q = 1 to N do 3:

yp)∈S u f θ (x p ), u ∈ {0, 1, • • • , 2P -1} . (9)This will encourage the network to differentiate the image classes across two different domains and help the network generalize across different domains.For a given query data x, the goal is to modelp θ (y = u|x) = exp(-d(f θ (x), c u )) 2P -1 u =0exp(-d(f θ (x), c u ))

task (θ, φ φ φ, D tr q )

Appendix A details the

Compare to Baselines

Compare to Baselines with different domain ordering

Compare to Baselines with Omniglot as first training domain

Effect of domain-level meta loss

Compare to our method with different number of learning blocks

Effect of number of memory tasks. Top, memory with one task for each domain; Bottom, memory with six tasks for each domain

Table 1, in the Protonet-based model, the performance of our model almost matches that of the joint-training model, indicating excellent memorization of past domains.

annex

5-shot learning, the number of training data for each task are set to 1 and 5, and the number of query data is 10 for both cases. The meta training loss is optimized by Adam (Kingma & Ba, 2014) , and the learning rate is 1e-3.For continual learning method, we adopt the following hyperparameter • HAT details (Serrà et al., 2018) This baseline is adapted from authors implementation † . The output h l of the units in layer l is element-wise multiplied by the following: h l = a t l h l , where a t l is the annealed version of the single layer gated domain embedding e t l , defined asOther hyperparameter are following (Serrà et al., 2018) , the stability parameter or scaling parameter s is annealed according towhereis the total number of batches.• EWC details (Kirkpatrick et al., 2017) We vary the weight penalty λ for avoiding drastic change in parameters with λ = 5, 1 × 10 1 , 5 × 10 1 , 1 × 10 2 , 5 × 10 2 , 1 × 10 3 , 5 × 10 3 and select the best as our baseline.• A-GEM details (Chaudhry et al., 2019a) The implementation is based on ‡ .• MER details (Riemer et al., 2019) The implementation is based on § . The within batch meta-learning rate β is set to 0.03, across batch meta-learning rate γ to 1.0, and batches per example to 5.• UCB details (Ebrahimi et al., 2020a) Following (Ebrahimi et al., 2020a) , we scale the learning rate of µ and ρ for each parameter proportional to its importance Ω, which is measured by its variance, to reduce changes in important parameters and adapt authors implementation ¶ . We use 10 Monte Carlo samples at each iteration as in (Ebrahimi et al., 2020a) .• ER-Ringbuffer details (Chaudhry et al., 2019b) We use a fixed amount of memory of each previous domain to jointly train with current domain.Other meta learning related settings are the same as those in sequential domain meta training as described above.

Domain-level meta loss

In this part, we will elaborate on the computation of proposed domainlevel meta loss. To alleviate the overfitting issue of a specific domain during the learning process, we propose to compute the meta loss at two different separated levels, intra-domain and inter-domain meta losses. The intra-domain meta loss is defined with tasks in the same domain to ensure good generalization in one domain. The inter-domain meta loss, by contrast, is defined with tasks sampled across all available domains in the memory to encourage model adaptation across different domains.When calculating the inter-domain loss with one task from the memory, at each iteration, we augment the data with those from tasks of other domains. In this way, the model can be trained to better generalize to other domains. † https://github.com/joansj/hat ‡ https://github.com/GMvandeVen/continual-learning § https://github.com/mattriemer/MER ¶ https://github.com/SaynaEbrahimi/UCB 

