SCALABLE MULTI-MODAL CONTINUAL META-LEARNING

Abstract

This paper focuses on continual meta-learning, where few-shot tasks from a nonstationary distribution are sequentially available. Recent works maintain a mixture distribution of meta-knowledge to cope with the heterogeneity and a dynamically changing number of components in the mixture distribution to capture incremental information. However, the underlying assumption of mutual exclusiveness among mixture components hinders sharing meta-knowledge across different tasks. Another issue is that they only use a prior to determine whether to increase metaknowledge components, leading to parameter inefficiency. In this paper, we propose a Scalable Multi-Modal Continual Meta-Learning (SMM-CML) algorithm, which employs a multi-modal premise to encourage different clusters of tasks to share meta-knowledge. Specifically, every task cluster is associated with a subset of mixture components, which is achieved by an Indian Buffet Process prior. Besides, to avoid parameter inefficiency caused by the unlimited increase, we propose a component sparsity method based on evidential theory to learn the posterior number of components, filtering out those meta-knowledge without receiving support directly from tasks. Experiments show SMM-CML outperforms strong baselines, which illustrates the effectiveness of our multi-modal meta-knowledge, and confirms that our algorithm can learn parameter-efficient meta-knowledge.

1. INTRODUCTION

Meta-learning (Vanschoren, 2018; Hospedales et al., 2020) is widely used in the low-resource setting. The key idea is to transfer meta-knowledge (i.e., the experience about how to learn) to improve data efficiency and enhance model generalization. In contrast to the conventional assumption that data are homogeneous and available at once (Finn et al., 2017; 2018) , continual meta-learning faces a more practical setting where data are heterogeneous and sequentially available (Finn et al., 2019; Denevi et al., 2019) . That is, tasks from non-stationary distributions arrive sequentially. There are two challenges to consider in this setting: (1) to avoid forgetting the learned meta-knowledge when training on tasks sampled from the heterogeneous distribution, also called as catastrophic forgetting (Kirkpatrick et al., 2017) ; (2) to capture the incremental meta-knowledge when encountering the newer tasks (Lee et al., 2017) . For the first challenge, existing works (Jerfel et al., 2019; Yao et al., 2019; Zhang et al., 2021 ) use a mixture model, associating a cluster of similar tasks with a single component. One major concern is that they implicitly assume different meta-knowledge components are mutually exclusive. This assumption impedes the sharing of meta-knowledge among clusters of tasks, which could lead to suboptimal performance and bias toward one type of meta-knowledge. For example, in the research of user profiling, a user (i.e., a task) can belong to multiple preference groups (i.e., components), so if modeling by a single meta-knowledge component, the algorithm might focus more on one preference and result in biased profiling. For the second challenge, these works (Jerfel et al., 2019; Yao et al., 2020; Zhang et al., 2021) incrementally update meta-knowledge, where a new meta-knowledge component is added to the mixture model for new tasks. However, all of them just leverage a prior (Jerfel et al., 2019; Zhang et al., 2021) or make a simple judgment before the update of meta-knowledge (Yao et al., 2019; 2020) on whether to add new meta-knowledge components but cannot make a posterior decision from task Previous methods only make a prior decision on whether to add new meta-knowledge (the red dashed grid), which might reproduce the redundant components. Our algorithm considers making a posterior decision from tasks after the meta-train to filter out the meta-knowledge without receiving support. data. Also, they can only increase but is not able to decrease the number of mixture components as needed, leading to parameter inefficiency when meeting a large number of tasks (seen Fig. 1 ). To solve these problems, we propose a Scalable Multi-Modal Continual Meta-Learning algorithm, abbreviated as SMM-CML. The proposed SMM-CML associates a task cluster with a subset of components of the meta-knowledge mixture model, where the provided meta-knowledge is multimodal (i.e., a statistical distribution of values with multiple peaks) with each mode being a related meta-knowledge component. The multi-modal meta-knowledge relaxes the constraint of a single component, so that it allows different clusters of tasks to share the meta-knowledge via the overlapped components. This is achieved by employing the Indian Buffet Process (IBP) prior on the number of components when meeting new tasks. To correct the prior after the update of meta-knowledge on new tasks, we propose an evidential sparsification method to decide the posterior number of components, filtering out the meta-knowledge which does not receive support information directly from task data. Our contributions are summarized as: • We propose multi-modal meta-knowledge, where a task is associated with a subset of components of meta-knowledge mixture model instead of a single one. Our multi-modal premise allows sharing meta-knowledge via the overlapped components among different clusters of tasks so as to avoid bias towards one type of meta-knowledge. • We employ the IBP prior to allow the number of mixture components to increase with the newer task arriving, and propose an evidential sparsification method to learn the posterior number of components from tasks, filtering out the meta-knowledge which does not receive support information directly from all occurring tasks. The combination of IBP and evidential sparsification helps to maintain the scalable meta-knowledge to cope with the online nonstationary setting. • We conduct extensive experiments and the results show that our SMM-CML outperforms the-state-of-art baselines under the online non-stationary setting. And it also confirms the effectiveness of multi-modal meta-knowledge and that our algorithm can learn the parameter-efficient meta-knowledge from tasks.

2. RELATED WORK

Meta-Learning. Meta-learning (Vanschoren, 2018; Hospedales et al., 2020) focuses on a fewshot setting. It assumes that source tasks can be used to help with the learning in the target tasks. Recent works include metric-based (Snell et al., 2017; Oreshkin et al., 2018 ), model-based(Ha et al., 2016; Munkhdalai & Yu, 2017 ), optimization-based methods(Finn et al., 2017; 2018) and their Bayesian variants (Ravi & Beatson, 2018; Gordon et al., 2019; Iakovleva et al., 2020) , respectively. However, most of them propose to construct a globally-shared meta-knowledge, which can not fit the heterogeneous data distribution in the real world (Jerfel et al., 2019) . To solve this problem, some works (Jerfel et al., 2019; Zhang et al., 2021) maintain a mixture of meta-knowledge, where a cluster of similar tasks is associated with a single component of the meta-knowledge. This impedes the sharing of meta-knowledge between different clusters of tasks. Different from the existing works, we take into account both sharing and diversity of meta-knowledge simultaneously. Continual Learning. Conventional continual learning (Delange et al., 2021) concentrates on the large-scale data setting. Existing models prevent the catastrophic forgetting issue via replay (Hu et al., 



Figure1: The difference in incremental meta-knowledge between the existing works and ours. Previous methods only make a prior decision on whether to add new meta-knowledge (the red dashed grid), which might reproduce the redundant components. Our algorithm considers making a posterior decision from tasks after the meta-train to filter out the meta-knowledge without receiving support.

