TOWARDS LEARNING TO REMEMBER IN META LEARNING OF SEQUENTIAL DOMAINS Anonymous

Abstract

Meta-learning has made rapid progress in past years, with recent extensions made to avoid catastrophic forgetting in the learning process, namely continual meta learning. It is desirable to generalize the meta learner's ability to continuously learn in sequential domains, which is largely unexplored to-date. We found through extensive empirical verification that significant improvement is needed for current continual learning techniques to be applied in the sequential domain meta learning setting. To tackle the problem, we adapt existing dynamic learning rate adaptation techniques to meta learn both model parameters and learning rates. Adaptation on parameters ensures good generalization performance, while adaptation on learning rates is made to avoid catastrophic forgetting of past domains. Extensive experiments on a sequence of commonly used real-domain data demonstrate the effectiveness of our proposed method, outperforming current strong baselines in continual learning. Our code is made publicly available online (anonymous) https://github.com/ ICLR20210927/Sequential-domain-meta-learning.git.

1. INTRODUCTION

Humans have the ability to quickly learn new skills from a few examples, without erasing old skills. It is desirable for machine-learning models to adopt this capability when learning under changing contexts/domains, which are common scenarios for real-world problems. These tasks are easy for humans, yet pose challenges for current deep-learning models mainly due to the following two reasons: 1) Catastrophic forgetting is a well-known problem for neural networks, which are prone to drastically losing knowledge on old tasks when a domain is shifted (McCloskey & Cohen, 1989) ; 2) It has been a long-standing challenge to make neural networks generalize quickly from a limited amount of training data (Wang et al., 2020a) . For example, the dialogue system can be trained on a sequence of domains, (hotel booking, insurance, restaurant, car services, etc) due to the sequential availability of dataset (Mi et al., 2020) . For each domain, each task is defined as learning one customer-specific model (Lin et al., 2019) . After finishing meta training, the model could be deployed to the previously trained domains, as the new (unseen) customers from previous domains may arrive later, they have their own (small) training data (support set) used for adapting the sequentially meta-learned models. After adaptation, the newly adapted model for the new customers can be deployed to make responses to the customers. We formulate the above problem as sequential domain few-shot learning, where a model is required to make proper decisions based on only a few training examples while undergoing constantly changing contexts/domains. It is expected that adjustments to a new context/domain should not erase knowledge already learned from old ones. The problem consists of two key components that have been considered separately in previous research: the ability to learn from a limited amount of data, referred to as few-shot learning; and the ability to learn new tasks without forgetting old knowledge, known as continual learning. The two aspects have been proved to be particularly challenging for deep learning models, explored independently by extensive previous work (Finn et al., 2017; Snell et al., 2017; Kirkpatrick et al., 2017; Lopez-Paz & Ranzato, 2017) . However, a more challenging yet useful perspective to jointly integrate the two aspects remains less explored. Time t ization ability under a single context/domain (Santoro et al., 2016; Finn et al., 2017; 2018; Snell et al., 2017; Ravi & Beatson, 2019) . Recently, it has been shown that catastrophic forgetting often occurs when transferring a meta-learning model to a new context (Ren et al., 2019; Yoon et al., 2020) . Continual learning aims to mitigate negative backward transfer effects on learned tasks when input distribution shift occurs during sequential context changes. Related techniques of which are currently applied mostly on standard classification problems (Serrà et al., 2018; Ebrahimi et al., 2020b) . In this paper, we generalize it to the sequential domain meta-learning setting, which seeks good generalization on unseen tasks from all domains with only limited training resources from previous domains. We term the problem sequential domain meta learning. Note this setting is different from continual few-shot learning that focuses on remembering previously learned lowresource tasks in a single domain. Our setting does not aim to remember on a specific task, but rather to maintain good generalization to a large amount of unseen few-shot tasks from previous domains without catastrophic forgetting. This setting is common and fits well in dynamic real-world scenarios such as recommendation system and dialogue training system. ! "" ! "# ! "$ ! "% ! #" ! ## ! #$ ! #% ! &" ! &# ! &$ ! &% The domain shift arised from this setting during meta learning poses new challenges to existing continual-learning techniques. This is mainly due to the high variability underlying a large number of dynamically formed few-shot tasks, making it infeasible for a model to explicitly remember each task. In our setting, a model is expected to remember patterns generic to a domain, while neglecting noise and variance of a specific few-shot task. This ability, termed as remember to generalize, allows a model to capture general patterns of a domain that repeatedly occur in batches of tasks while avoid being too sensitive to a specific few-shot task. In this paper, we propose to address the aforementioned challenges by designing a dynamic learningrate adaptation scheme for learning to remember previous domains. These techniques could jointly consider gradients from multiple few-shot tasks to filter out task variance and only remember patterns that are generic in each domain. Our main idea is to meta learn both the model parameters and learning rates by backpropagating both a domain loss and a memory loss to adaptively update model parameters and the learning rates, respectively. Specifically, our mechanism keeps a small memory of tasks from previous domains, which are then used to guide the dynamic and adaptive learning behaviors on different portions of the network parameters. The proposed mechanism is versatile and applicable to both the metric-based prototypical network (Snell et al., 2017) and the gradient-based ANIL (Raghu et al., 2020) meta-learning model. Our contributions are summarized as follows:



Generally speaking, meta-learning targets learning from a large number of similar tasks with a limited number of training examples per class. Most existing works focus on developing the general-

Figure 1: Meta-learning over sequential domains. Data in each domain arrive sequentially. Our model consists of a domain-shared part and a domain-specific part, all consist of a few convolutional layers (and possibly fully connected layers). The domain-shared part is shared by all domains, and each domain only owns one sub-network in the domain-specific part. Parameters (e.g., convolutional filters) in each domain-shared convolutional layer i (blue) are divided into n blocks, denoted as B i0 , B i1 , • • • , B in . Each block is associated with one learnable learning rate when meta training each domain on the network. The learning rates are updated by a loss defined on the memory tasks to enforce the memorization of previous domains.

