SCALABLE MULTI-MODAL CONTINUAL META-LEARNING

Abstract

This paper focuses on continual meta-learning, where few-shot tasks from a nonstationary distribution are sequentially available. Recent works maintain a mixture distribution of meta-knowledge to cope with the heterogeneity and a dynamically changing number of components in the mixture distribution to capture incremental information. However, the underlying assumption of mutual exclusiveness among mixture components hinders sharing meta-knowledge across different tasks. Another issue is that they only use a prior to determine whether to increase metaknowledge components, leading to parameter inefficiency. In this paper, we propose a Scalable Multi-Modal Continual Meta-Learning (SMM-CML) algorithm, which employs a multi-modal premise to encourage different clusters of tasks to share meta-knowledge. Specifically, every task cluster is associated with a subset of mixture components, which is achieved by an Indian Buffet Process prior. Besides, to avoid parameter inefficiency caused by the unlimited increase, we propose a component sparsity method based on evidential theory to learn the posterior number of components, filtering out those meta-knowledge without receiving support directly from tasks. Experiments show SMM-CML outperforms strong baselines, which illustrates the effectiveness of our multi-modal meta-knowledge, and confirms that our algorithm can learn parameter-efficient meta-knowledge.

1. INTRODUCTION

Meta-learning (Vanschoren, 2018; Hospedales et al., 2020) is widely used in the low-resource setting. The key idea is to transfer meta-knowledge (i.e., the experience about how to learn) to improve data efficiency and enhance model generalization. In contrast to the conventional assumption that data are homogeneous and available at once (Finn et al., 2017; 2018) , continual meta-learning faces a more practical setting where data are heterogeneous and sequentially available (Finn et al., 2019; Denevi et al., 2019) . That is, tasks from non-stationary distributions arrive sequentially. There are two challenges to consider in this setting: (1) to avoid forgetting the learned meta-knowledge when training on tasks sampled from the heterogeneous distribution, also called as catastrophic forgetting (Kirkpatrick et al., 2017) ; (2) to capture the incremental meta-knowledge when encountering the newer tasks (Lee et al., 2017) . For the first challenge, existing works (Jerfel et al., 2019; Yao et al., 2019; Zhang et al., 2021 ) use a mixture model, associating a cluster of similar tasks with a single component. One major concern is that they implicitly assume different meta-knowledge components are mutually exclusive. This assumption impedes the sharing of meta-knowledge among clusters of tasks, which could lead to suboptimal performance and bias toward one type of meta-knowledge. For example, in the research of user profiling, a user (i.e., a task) can belong to multiple preference groups (i.e., components), so if modeling by a single meta-knowledge component, the algorithm might focus more on one preference and result in biased profiling. For the second challenge, these works (Jerfel et al., 2019; Yao et al., 2020; Zhang et al., 2021) incrementally update meta-knowledge, where a new meta-knowledge component is added to the mixture model for new tasks. However, all of them just leverage a prior (Jerfel et al., 2019; Zhang et al., 2021) or make a simple judgment before the update of meta-knowledge (Yao et al., 2019; 2020) on whether to add new meta-knowledge components but cannot make a posterior decision from task 2019; Titsias et al., 2019) , regularization (Benjamin et al., 2018; Pan et al., 2020) and incremental model selection (Kumar et al., 2021; Kessler et al., 2021) . Recently, many works based on metalearning (Finn et al., 2019; Zhuang et al., 2020) focus on the low-resource setting. Inspired by the incremental model selection, some existing works extend meta-knowledge when encountering new tasks, via increasing the number of mixture components (Yao et al., 2019) or adding a novel block to construct the mate-path (Yao et al., 2020) . Moreover, the Chinese Restaurant Process (CRP) has been used to determine the prior number of meta-knowledge components (Jerfel et al., 2019; Zhang et al., 2021) . However, these methods only consider how to construct the prior number of components and do not make a posterior decision from tasks. Such a prior determination only allows the increase of meta-knowledge, which would lead to parameter inefficiency and large computational consumption. In our work, we learn the posterior number of meta-knowledge components from tasks via the combination of IBP prior and the evidential sparsification method. Sparsification Method In recent years, a number of methods have been proposed to sparse the multi-modal space. Most of them (Martins & Astudillo, 2016; Laha et al., 2018) aim to propose a softmax alternative to sparse the large output space. Itkina et al. (2020) pointed out that the above methods are aggressive, and propose a post hoc evidential sparsification for conditional variational auto-encoder, based on the conclusion in (Denoeux, 2019) that most existing classifiers can be seen as converting features into mass function and merging them to the final result. Following Itkina et al. (2020) , Chen et al. (2021) presented evidential softmax method. However, these methods operate on mutual exclusiveness, which is in conflict with ours. Moreover, our evidential sparsification method provides a novel view of how to apply the evidential theory on continual learning.

3. BACKGROUND

3.1 BAYESIAN ONLINE META-LEARNING Suppose there are sequentially arriving tasks τ t with a dataset D t from a non-stationary distribution p(τ ). Note that the dataset D t is split into two sub-datasets, a support set D S t = {x i , y i } Nt i=1 for training and a query set D Q t = {x i , y i } Mt i=1 for validation. Catastrophic forgetting (Lee et al., 2017 ) is a key issue in continual learning. To overcome such an issue in the non-stationary task flow, some variational methods (Yap et al., 2021; Zhang et al., 2021) have been developed. They update the meta-knowledge in an online way following Variational Continual Learning (VCL) (Nguyen et al., 2018) : p(θ t |D 1:t ) ∝ p(D t |θ t )p(θ t |D 1:t-1 ), where θ t is the meta-knowledge and used as the initialization following MAML (Finn et al., 2017) . Note that it assumes that the datasets D 1:t are independent given θ t . Thus, the meta-knowledge can be updated in a recursively way. Then, the meta-learning framework can be reformulated as a variational way (Gordon et al., 2019; Iakovleva et al., 2020) : p(D t |θ t ) = p(D t |ϕ t )p(ϕ t |θ t )dϕ t , where ϕ t is the task-specific parameter. To learn such intractable posteriors, some inference methods (e.g., variational inference (Kingma & Welling, 2013)) are applied to infer the approximate distributions. More details of inference are in Appendix A.

3.2. EVIDENTIAL THEORY

Evidential theory (Denoeux, 2019) works on a discrete set of hypotheses (or equivalently, components of meta-knowledge in this paper). Let Z = {z 1 , z 2 , z 3 , ..., z K } be a finite set, the element of which z k is a binary variable indicating whether the current task is associated with k-th component or not, and the power set of Z, donated by 2 Z . A mass function on Z is a mapping m: 2 Z → [0, 1] and satisfies the following constraints:  m(∅) = 0, A⊆Z m(A) = 1.

Meta-Train

Figure 2 : The framework of SMM-CML. The top is the updating of meta-knowledge, where the IBP prior determines whether to add new meta-knowledge components (the red box), and the evidential sparsity method is used to filter out the component without receiving support directly from tasks based on the posterior of beta distribution from the current time (the solid line) and the previous time (the dashed line). The bottom is the task-specific adaption, where the multi-modal meta-knowledge is decided and then adapted to the task-specific parameter based on the support set D S t (the dashed line), and then the task-specific parameter is used to make the prediction on the data (the solid line). The mass function m(•) represents the support to each potential subset of components provided by a piece of evidence, and any subset A is called focal set if m(A) > 0. As a particular case, the vacuous mass function (i.e., m(Z) = 1) indicates that the evidence can not provide any information. One mass function is said simple when: m(A) = s, m(Z) = 1 -s, w = -ln(1 -s), ( ) where A is a single strict subset A ⊂ Z, s ∈ [0, 1] represents the support degree of A, and w donates the evidential weight of A. Given a mass function, there are two corresponding functions, called belief and plausibility function, respectively, which are defined as follows: Bel(A) = B⊆A m(B), P l(A) = B∩A̸ =∅ m(B) = 1 -Bel( Ā). Bel(A) can be interpreted as the total support degree to A, while the 1 -P l(A) can be interpreted as the total doubt degree to A. Besides, when the plausibility function is restricted to singletons, then it is called contour function pl : z k → [0, 1]. Given two mass functions provided by different evidence, the fusion of them follows Dempster's rule (Dempster, 2008) . More details about the computing rules are in Appendix B.

4. SCALABLE MULTI-MODAL CONTINUAL META-LEARNING

In this section, we present our Scalable Multi-Modal Continual Meta-Learning algorithm (SMM-CML). The total framework of SMM-CML is seen in Fig. 2 .

4.1. MULTI-MODAL CONTINUAL META-LEARNING

SMM-CML relaxes the constraint that a task is associated with only a single component of metaknowledge. The restriction of one-to-one mapping prevents the sharing of meta-knowledge among different clusters of tasks. We employ the multi-modal meta-knowledge where multiple metaknowledge components are maintained. It assumes that a cluster of similar tasks is associated with a subset of components of meta-knowledge: p(D t |θ t ) = p(D t |θ t , z t )p(z t )dz t = p(D t |ϕ t )p(ϕ t |θ t , z t )dϕ t p(z t )dz t , where z t is the indicating vector consisting of binary elements, each element of which indicates whether the current task is relevant to the component of meta-knowledge or not. The multi-modal premise enables the sharing of meta-knowledge among different clusters of tasks via the overlapped related components, relaxing the restriction of mutual exclusiveness.

4.2. INDIAN BUFFET PROCESS PRIOR

In the non-stationary regime, one important requirement is to capture incremental information when a newer task is encountered. Thus, the fixed meta-knowledge is not appropriate. To capture the incremental meta-knowledge and fit the multi-modal premise, we employ the Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2011) to make a prior decision on the number of components z t ∼ IBP (α), where the number of the added components at each time is: K t,new ∼ P ossion( α t ), ( ) where α is the hyperparameter to control the rate of increase. The IBP prior for z t is formulated based on the stick-breaking process: v k ∼ Beta(α, 1), π k = k i=1 v i , z t,k ∼ Bern(π k ), f or k = 1, ..., ∞. where Beta(•) and Bern(•) represents the Beta distribution and the Bernoulli distribution, respectively. Based on the IBP prior, the generative process of SMM-CML is as follows: θ t,k ∼ N (µ t,k , σ t,k ), ϕ t |θ t , z t ∼ p(ϕ t |θ t , z t ), where θ t,k is the meta-knowledge of the k-th component, and the task-specific parameters ϕ t are associated with a subset of meta-knowledge components determined by z t . The inference can be seen in Sec. 4.4 and the probability graph model is shown in Fig. 7 in the Appendix D. Hereby the IBP provides a prior on the number of components when encountering new tasks so that it can capture the incremental knowledge in the online non-stationary setting. However, the IBP just provides a prior and it cannot make a posterior decision after the updating of meta-knowledge. When meeting a large number of task distributions, the unlimited increase in the number of components would cause a large computational consumption and lead to parameter inefficiency.

4.3. EVIDENTIAL SPARSIFICATION FOR MULTI-MODAL META-KNOWLEDGE

To learn the posterior number of components from tasks, we propose an evidential sparsification method for multi-modal meta-knowledge, which is a post hoc method after the update of metaknowledge. Since the components in our multi-modal meta-knowledge are mutually independent, there might be redundancy across time. How to merge the information about different components from both previous and current times remains an issue. The evidential theory provides a good way to merge independent pieces of evidence and make the decision (Dempster, 2008) . After updating the meta-knowledge at one time, the relationship between the current clusters of tasks and each meta-knowledge component is built up. The relationship between tasks and one certain component is a piece of independent evidence, containing the support and doubt information. Such information from the current and previous times can be merged to illustrate the unified relationship between the occurring tasks and the meta-knowledge components. The components not receiving support are cast as redundant and removed. Fig. 6 in Appendix C shows an intuitive explanation of the combination between evidential theory and multi-modal meta-knowledge. In our IBP-based meta-knowledge, the relationship between tasks and components at each time is determined by k beta distributions of v t,k . Following the evidential theory (Denoeux, 2019) , we see each beta distribution at either the current time or the previous time as a piece of evidence, so that there are t • k pieces of evidence. Intuitively, as all beta distributions of v t,k (i.e, the evidences) are independent, each of them only provides the support or doubt for the corresponding component. That is, it supports the corresponding component {z k } or the complementary set of the corresponding component {z k }. And this piece of evidence does not by itself provide 100% certainty, which in evidential theory means that the remaining probability commits to the universal set Z. In this way, each piece of evidence (i.e., each beta distribution of v t,k ) can provide the evidential weight w t,k , conducting two simple mass functions with the focal set {z k } and {z k }, respectively. The evidential weight can be defined as: w t,k = exp(α t,k ) -γ exp(β t,k ), where the evidential weight will increase with a larger α t,k and decrease with a larger β t,k . Note that the hyperparameter γ can effectively adjust the sparsity of meta-knowledge. This weight w t,k can deduce two other evidential weights w + t,k and w - t,k , supporting the singleton of corresponding component {z k } and its complementary set {z k }, respectively. The similar derivation as (Itkina et al., 2020) can be used as: w + t,k = max(0, w t,k ) > 0, w - t,k = max(0, -w t,k ) > 0. ( ) For each piece of evidence v t,k , there exist two mass functions supporting {z k } and {z k }, respectively: m + t,k ({z k }) = 1 -exp(-w + t,k ), m + t,k (Z) = exp(-w + t,k ); (12) m - t,k ({z k }) = 1 -exp(-w - t,k ), m - t,k (Z) = exp(-w - t,k ). And these mass functions provided by different pieces of evidence can be fused using Dempster's rule and get the final result as follows: m({z k }) = CC + C -    exp(-w - k )   exp(w + k ) -1 + l̸ =k (1 -exp(-w - l ))      , where C,C + and C -are the normalization terms and can be omitted when computing, and w + k and w - k are the merged evidential weight of each component. The computational details can be seen in Appendix C. If a component k does not receive support (i.e., m(z k ) = 0), then there must be no evidence directly supporting this component (i.e., w + k = 0) and there is at least one other component receiving doubt directly from the evidence (i.e., w - l = 0, l ̸ = k). For the sparsity of the multi-modal meta-knowledge, we can apply the mass function developed above to filter out the component without receiving the support information directly. That is, components with zero singleton mass value (i.e., m({z k }) = 0) are removed and the construction of metaknowledge for the specific task in Eq. 6 are modified as: p(ϕ t |θ t , z t ) = K k=1 1{m({z k }) ̸ = 0}1{z t,k ̸ = 0}p(ϕ t,k |θ t,k ; λ t,k ). (15)

4.4. STRUCTURED VARIATIONAL INFERENCE

The exact inference is intractable because of non-conjugacy, thus, the approximation is required. In our work, we employ the variational inference (Blei et al., 2017) to approximate the posterior. The evidence lower bound (ELBO) of the observation at the current time t can be derived as following: L(ψ t , D t ) = -E q(vt,zt,θt,ϕt) [log p(D t |ϕ t )] + K k=1 D KL (q(v t,k )∥p(v t,k )) + K k=1 D KL (q(z t,k |v t,k )∥p(z t,k |v t,k )) + K k=1 D KL (q(θ t,k )||p(θ t,k )) + D KL (q(ϕ t |θ t , z t )||p(ϕ t |θ t , z t )), where D KL is the Kullback-Leibler divergence and q(•) is the variational distribution for each latent variables, respecively. Note that the expectation of likelihood in Eq. 16 can be computed using Monte Carlo sampling, while all the KL-terms can be computed directly as they have closed-form expressions via implicit reparameterization gradients (Figurnov et al., 2018) . Details of the definition of variational distribution, the sampling gradient computation for the likelihood term and the closed form expression for KL-terms can be seen in Appendix D.

4.5. DISCUSSION

In contrast to recent works (Jerfel et al., 2019; Zhang et al., 2021) , our work has two major differences that enhance the performance and confirm our contributions. ( 1) Different from the one-to-one matching between the cluster of tasks and the meta-knowledge component, our algorithm constructs a many-to-many matching, where multiple task clusters can share one meta-knowledge component and one task cluster needs multiple meta-knowledge components. This is to avoid bias toward one meta-knowledge component and improve performance on heterogeneous tasks. (2) Secondly, our algorithm combines the IBP prior with the evidential sparsification to learn the posterior number of meta-knowledge components. Compared to the existing works only using CPR as a prior, our algorithm makes a posterior decision, which achieves parameter efficiency and reduces computational consumption. The analysis of complexity is shown in Appendix E

5. EXPERIMENTS

To examine the effectiveness of our SMM-CML, we design experiments, make comparisons and analyze the results. Specifically, the research problems that guide the remainder of the paper are: (RQ1) Can our proposed SMM-CML achieve a better performance than the state-of-the-art baselines under the online non-stationary setting? (RQ2) Can the increasing number of components capture the incremental information? (RQ3) What is the impact of evidential sparsification on performance? Our experiments are conducted under the online non-stationary settings. We compare our algorithm to the following baselines: (1) Train-On-Everything (TOE): an intuitive method that re-initializes the meta-knowledge at each time t and trains on all the arriving data D 1:t ; (2) Train-From-Scratch (TFS): another intuitive method that also re-initializes the meta-knowledge at each time t but trains only on the current data D t ; (3) Follow the Meta Leader (FTML) (Finn et al., 2019) : a method utilizing the Follow the Leader algorithm (Kalai & Vempala, 2005) to minimize the regret of meta-learner. (4) Online Structured Meta-Learning (OSML): a method via conducting a pathway to extract meta-knowledge from a meta-hierarchical graph; (5) Dirichlet Process Mixture Model (DPMM): an algorithm that employs CRP to conduct a mixture meta-knowledge using point estimation; (6) Bayesian Online Meta-Learning with Variational Inference (BOMVI): a method that uses Bayesian meta-learning to address the catastrophic forgetting issue; (7) Variational Continual Bayesian Meta-Learning (VC-BML): a state-of-the-art method that aims to conduct a mixture meta-knowledge via a Bayesian method. Following the exiting works (Yap et al., 2021; Zhang et al., 2021) , we conduct the experiments on four datasets: VGG-Flowers (Nilsback & Zisserman, 2008) , miniImagenet (Ravi & Larochelle, 2017) , CIFAR-FS (Bertinetto et al., 2018) , and Omniglot (Lake et al., 2011) . Tasks sampled from different datasets correspond to different task distribution, so that the online non-stationary environment can be created via chronologically sampling tasks from different datasets. Specifically, the sampled task is a 5-way 5-shot task, and 5 classes are sampled randomly from a dataset for a task. In our experiment, we sequentially meta-train the model on tasks sampled from the meta-training dataset of these four datasets, which means that the model is trained on the tasks sampled from VGG-Flowers dataset, and then proceeds to the next dataset. The showed performances are evaluated on the test set after tuning hyper-parameters on the validation set. More details about experiment are in Appendix F.

5.1. RQ1: PERFORMANCE UNDER ONLINE NON-STATIONARY SETTING

To examine the effectiveness of our algorithm, we present the mean meta-test accuracy on all the learned datasets at each meta-training stage in Tab. that SMM-CML can not only maintain the meta-knowledge learned from previous times but also capture the incremental meta-knowledge from the current tasks. Moreover, the comparison between the performance of SMM-CML and the baselines (i.e., DPMM and VC-CML), which maintain the mutually exclusive meta-knowledge components, confirms that sharing helps to improve performance. To further illustrate the association between tasks and meta-knowledge, we show the posterior of the Bernoulli distribution of each component on each dataset. As in Fig. 3 , the probabilities of Bernoulli distribution of each component are distinct. For example, the VGG-Flowers dataset has a strong association with the later three components, while the Omniglot dataset is closely relevant to all the components except the third one. It confirms that in our learned meta-knowledge, different clusters of tasks share multiple components and maintain their diversity via the other different ones. Besides, we take into account another more challenging setting, where tasks from different datasets are mixed and randomly arrive one by one. Because of the more non-stationary task stream, catastrophic forgetting is more serious. Tab. 2 shows the average accuracy results over all times. Our SMM-CML achieves the best performance on all four datasets even in such a challenging setting. It further confirms that our algorithm has the capability to cope with the online non-stationary task streams.

5.2. RQ2: THE IMPACT OF INCREASING NUMBER OF COMPONENTS

To capture the incremental meta-knowledge in continual meta-learning, we employ the Indian Buffet Process to allow the increasing number of components. We conduct the experiment with different numbers of meta-knowledge components to test its effectiveness. The evolution of meta-test accuracy when training on different datasets is shown in Fig. 4 . TOE has the best performance on most stages because it can replay all the available data. With the number of components increasing, our proposed algorithm has a better performance in both the learned and the new datasets. It further demonstrates that more components can capture the incremental meta-knowledge and alleviate the forgetting issue. 

5.3. RQ3: THE EFFECTIVENESS OF EVIDENTIAL SPECIFICATION

To reduce computational consumption, we propose an evidential sparsification method. To examine the impact of our methods, we compare performance before and after sparsification. The mean meta-test accuracy at each meta-training stage is shown in Tab. 3, and more results are shown in Appendix F.4.2. Compared to the original meta-knowledge, the sparse meta-knowledge can achieve a comparative performance. This confirms that our method can reduce redundancy and computational consumption with acceptable accuracy. Moreover, we conduct experiments on different numbers of components with the appropriate γ. The results in Fig. 5 show that our model can outperform the SOTA even with less number of components. It confirms that our algorithm can filter out the redundant meta-knowledge component and is more parameter-efficiency.

6. CONCLUSION

This paper focuses on a more challenging setting in meta-learning, where tasks from a non-stationary distribution are available sequentially. We propose SMM-CML, a Scalable Multi-Modal Meta-Learning algorithm where a cluster of similar tasks are associated with multiple components, allowing tasks to share meta-knowledge while maintaining their diversity. Moreover, an IBP prior is employed to determine whether to increase the number of components, and an evidential sparsity method is proposed to filter out the components which have not received support information from tasks. This confirms a posterior number of meta-knowledge components so that it avoids parameter inefficiency. The conducted experiment shows the effectiveness of multi-modal meta-knowledge and confirms that our algorithm can learn the needed meta-knowledge from tasks. One limitation comes from the space complexity, since our model still needs to increase the number of mixture components to cover more meta-knowledge. The proposed evidential sparsity method can help alleviate the required space complexity.

A VARIATIONAL INFERENCE FOR META-LEARNING

Following MAML (Finn et al., 2017) , many Bayesian variant (Ravi & Beatson, 2018; Gordon et al., 2019; Iakovleva et al., 2020) are proposed. To fit well with the bi-level optimization architecture, most of them consider a hierarchical bayesian inference (Amit & Meir, 2018) , where the Evidence Lower Bound (ELBO) of likelihood can be derived as follows: log T i=1 D i = log p(θ) T i=1 p(D i |ϕ i )p(ϕ i |θ)dϕ i dθ ≥ E q(θ;ψ) log T i=1 p(D i |ϕ i )p(ϕ i |θ)dϕ i -D KL (q(θ; ψ)||p(θ)) ≥ E q(θ;ψ) T i=1 E q(ϕi;λi) [log p(D i |ϕ i ) -D KL (q(ϕ i ; λ)||p(ϕ i |θ))] -D KL (q(θ; ψ)||p(θ)), where θ and ϕ are the global parameter and task-specific parameter, respectively. Note that the low bound is derived based on the Jensen equation and the variational distributions of θ and ϕ are introduced to approximate the intractable posterior. Then, the bi-level optimization is transformed as: ϕ * , λ * = arg max ψ,λ E q(θ;ψ) T i=1 E q(ϕi;λi) [log p(D i |ϕ i ) -D KL (q(ϕ i ; λ)||p(ϕ i |θ))] -D KL (q(θ; ψ)||p(θ)). So that the goal of the optimization is to seek the optimal variational distribution of θ and ϕ, parameterized by ψ and λ, respectively.

B COMPUTING RULES IN EVIDENTIAL THEORY

There are some computing rules introduced by Dempster-Shafer theory (Denoeux, 2019) . Given two mass functions m 1 and m 2 , their combination is defined according to the Dempster's rule: (m 1 ⊕ m 2 )(A) = 1 1 -κ B∩C=A m 1 (B) • m 2 (C), ( ) where κ is the degree of conflict between two evidences, which is defined as: κ = B∩A=∅ m 1 (B) • m 2 (C). Note that Dempster's rule for the combination of mass functions is commutative and associative. Based on Dempster's rule for the combination between two mass functions, the combination of two corresponding contour functions pl 1 and pl 2 can be computed as: (pl 1 ⊕ pl 2 (z k )) = pl 1 (z k ) • pl 2 (z k ) 1 -κ . ( ) And if both mass functions are simple with the same strict subset, their fusion can be defined as: A w1 ⊕ A w1 = A w1+w2 , where A w1 represents the simple mass function with a single strict subset and its evidential weight is w 1 .

C THE COMPUTATIONAL DETAILS OF FUSING MASS FUNCTION

We try to combine all the positive mass functions and all the negative mass functions, respectively. And then the two can be fused to produce the final result. 

Merged Result

Figure 6 : An intuitive explanation of our proposed evidential sparsity method. The relationship between tasks and meta-knowledge components at each time provides k evidences, containing support and doubt information with uncertainty. Such information can be merged by Dempster's rule to provide a unified relationship between the occurring tasks and the components. The components not receiving support are removed (i.e., the component K -1 in the figure).

C.1 THE FUSION ACROSS TIME

Before positive fusion and negative fusion, we need to merge evidence supporting the same focal elements at different times. Since the simple mass functions have the same focal set, their fusion can be calculated following Eq. 22 and the weight is: w + k = t i=0 w + i,k , w - k = t i=0 w - i,k where w + i,k and w - i,k are the evidential weight of the positive and negative mass function at time i. respectively. In this way, the evidence supporting the same focal element from different time can be merged first: m + k ({z k }) = 1 -exp(-w + k ), m + k (Z) = exp(-w + k ); (24) m - k ({z k }) = 1 -exp(-w - k ), m - k (Z) = exp(-w - k ). C.2 THE FUSION OF m + As we define above, all the positive mass functions have the only two focal elements, {z k } and Z. Then the combination of them can be computed according to the Dempster's rule: m + ({z k }) ∝ [1 -exp(-w + k )] l̸ =k exp(-w + k ) = [exp(w + k ) -1] K l=1 exp(-w + k ), m + (Z) ∝ K k=1 exp(-w + k ). ( ) As the fused mass function constraint to the sum of one, the results can be computed by normalizing the terms. So that the sum of all terms is: m + (Z) + K l=1 m + ({z k }) ∝ K k=1 exp(-w + k ) + K k=1 [exp(w + k ) -1] K l=1 exp(-w + k ) (28) = K k=1 exp(-w + k ) • K k=1 exp(w + k ) -K + 1 . ( ) And the terms can be normalized as: m + ({z k }) = [exp(w + k ) -1] K l=1 exp(-w + k ) K k=1 exp(-w + k ) • K k=1 exp(w + k ) -K + 1 = exp(w + k ) -1 K k=1 exp(w + k ) -K + 1 , ( ) m + (Z) = K k=1 exp(-w + k ) K k=1 exp(-w + k ) • K k=1 exp(w + k ) -K + 1 = 1 K k=1 exp(w + k ) -K + 1 . C.3 THE FUSION OF m - Different from the positive mass functions, the negative mass functions have the only two focal elements, {z k } and Z. To compute the combination of all negative mass functions, we need to compute the conflict firstly: κ -= K k=1 1 -exp(-w - k ) . ( ) Thus, for any strict subset A of Z, its belief can be computed as: m -(A) = z k / ∈A 1 -exp(-w - k ) • z k ∈A exp(-w - k ) 1 - K k=1 1 -exp(-w - k ) . ( ) And the mass belief of the complete set Z is: m -(Z) = K k=1 exp(-w - k ) 1 - K k=1 1 -exp(-w - k ) . ( ) For further fusion of the positive and negative mass functions, we need to compute pl -(z k ), which can be defined as: pl -({z k }) = K k=1 pl - k ({z k }) 1 - K k=1 1 -exp(-w - k ) , ( ) where the plausibility of negative mass function is: pl - l ({z k }) = exp(-w - l ) if k = l 1 otherwise . ( ) Thus, the result of the fused plausibility is: pl -({z k }) = exp(-w - k ) 1 - K k=1 1 -exp(-w - k ) .

C.4 THE FINAL FUSION

To clarify the following derivation, we assume that: C + = 1 K k=1 exp(w + k ) -K + 1 , C -= 1 1 - K k=1 1 -exp(-w - k ) . ( ) Similarly, to combine the positive and negative mass function, we need to compute the conflict between them at first: κ = K k=1    m + ({z k })   z k / ∈A m -(A)      = K k=1 m + ({z k }) • 1 -pl -({z k }) = K k=1 C + exp(w + k ) -1 • 1 -C -(exp(-w - k )) , where A ⊆ Z. To make the following derivation clarified, let: C = 1 1 -κ = 1 1 - K k=1 C + exp(w + k ) -1 • 1 -C -(exp(-w - k )) . ( ) Then for any k ∈ {1, 2, ..., K}, the mass belief of each singleton can be computed as: m({z k }) = C m + ({z k }) • z k ∈A m -(A) + m + (Z) • m -({z k }) = C m + ({z k }) • pl -({z k }) + m + (Z) • m -({z k }) , where A ⊆ Z. Combining Eq. 30, Eq. 31 Eq. 33 and Eq. 37, , the final result of the mass singleton belief is: m({z k }) = C    C + exp(w + k ) -1 • C -[exp(-w - k )] + C + • C -   exp(-w - k ) • l̸ =k 1 -exp(-w - l )      = CC + C -    exp(-w - k )   exp(w + k ) -1 + l̸ =k (1 -exp(-w - l ))      . ( )

D DETAILS OF INFERENCE

In this section, we present the details of our structured variational inference for our proposed SMM-CML. The pseudo-code and the probability model are shown in Alg.1 and Fig. 7 , respectively.

D.1 VARIATIONAL DISTRIBUTION

Because of the intractability of posterior, we introduce the variational distribution to approximate the true posterior. To capture the dependencies among the approximate posterior distribution, we consider using the structured mean-field approximation (Hoffman & Blei, 2015) instead of the traditional mean-field approximation. Specifically, the joint variational distribution can be decomposed as follows: The composed variational distributions are parameterized as: q(v t , z t , θ t , ϕ t |D t ) = q(ϕ t |θ t , z t , D t ) K k=1 q(θ t,k )q(z t,k |v t,k )q(v t,k ), q(v t,k ) = Beta(α t,k , β t,k ), (45) q(z t,k |π t,k ) = Bern(π t,k ), where π t,k = v t,k , q(θ t,k ) = N (µ t,k , σ 2 t,k 1), q(ϕ t |θ t , z t , D t ) = K k=1 1{z t,k ̸ = 0}q(ϕ t,k |θ t,k ; λ t,k ), where λ t = SGD J (θ * t , D S t , ϵ), and SGD J (•) represents the stochastic gradient descent with J steps. That is, the required variational parameters are ψ t = {α t,k , β t,k , µ t,k , σ t,k , λ t,k } for all k = 1, ..., K. Note that we replace k i=1 v t,i with v t,k in the posterior, to remove the implicit order constraint in the prior. So that the optimization aims to search for the optimal variational parameter to maximize the ELBO in Eq. 16.

D.2 REPARAMETERIZATION

The variational posterior is obtained by optimizing the ELBO using structured variational inference. To make inference tractable, we utilize three reparameterizations, to infer the Gaussian distribution, beta distribution and Bernoulli distribution, respectively.

D.2.1 THE VARIATIONAL GAUSSIAN DISTRIBUTION REPARAMETERIZATION

As we mentioned above, the variational distributions of meta-knowledge from each clusters are diagonal Gaussian θ t,k ∼ N (µ t,k , σ t,k ). We employ the reparameterization, which can represent the meta-knowledge using a deterministic function θ t,k = g(ε; µ t,k , σ t,k ), where ϵ ∼ N (0, I). To apply the reparameterization, we define the standardization function and its inverse as: S ψ (θ) = θ -µ σ = ε ∼ q(ε), where q(ε) = N (0, I), θ = S -1 ϕ (ε) = ε • σ + µ. Note that we omit the subscripts for clarity and the remainder of this section omits them as well. Then we can represent the objective in ELBO w.r.t q(θ) as follows: E q ψ (θ) [f (θ)] = E q(ε) [f (S -1 ψ (ε))]. This allows us to compute the gradient of the expectation in another way: ∇ ψ E q ψ (θ) [f (θ)] = E q(ε) [∇ ψ f (S -1 ψ (ε))] = E q(ε) [∇ θ f (S -1 ψ (ε))∇ ψ S -1 ψ (ε)],

D.2.2 THE VARIATIONAL BETA DISTRIBUTION REPARAMETERIZATION

There is no simple inverse of the standardization function when using the reparameterization for Beta distribution, which makes it impossible to apply the explicit reparameterization directly. Instead, Algorithm 1: The meta-training process of SMM-CML. Input: Task distribution p(τ ), data distribution p(D|τ ), the initial number of component K 0 , concentration parameter α, the number of inner update step J, the inner learning rate ϵ, and the outer learning rate ζ 1: for t=1,.. do 2: Determine the added number: K t,new = P ossion( α t ) 3: Determine the number of component: K t = K t-1 + K t,new 4: Initialize the variational beta distribution: α t,k , β t,k , ∀k = 1, ..., K t

5:

Initialize the variational distribution of meta-knowledge: µ k , σ k , ∀k = 1, .., K t

6:

The KL-Divergence between the Kumaraswamy distribution and the Beta distribution in ELBO can be written as: D KL (q(v k ; α k , β k )||p(v; α, β)) = α k -α α k -γ -Ψ(β k ) - 1 β k + log α k β k + log [B(α, β)] - β k 1 -β k + (β -1)β k ∞ m=1 1 m + α k β k B m α k , β k , ( ) where γ is the Euler constant, Ψ(•) is the digamma function, and B(•, •) is the beta function. Following the existing work (Nalisnick & Smyth, 2017) , the above the infinite term in the formula can be approximated using a infinite sum of the first 11 terms.

D.2.3 THE VARIATIONAL BERNOULLI DISTRIBUTION REPARAMETERIZATION

As the Bernoulli distribution is one of the classic discrete distributions, the sampling requires performing an argmax operation. But the argmax operation is not differentiable. We employ the Concrete distribution (Maddison et al., 2017) , also named Gumbel-softmax distribution (Jang et al., 2017) , to address the above issue. Then, we can sample a random variable as follows: x j = σ   log(π k ) + log u k 1-u k λ   , u ∼ U (0, 1), where λ ∈ (0, ∞) is a temperature hyper-parameter, σ(•) is the sigmoid function, π k is the parameter of the Bernoulli distribution and u k is sampled from a uniform distribution U . To guarantee a lower bound on the ELBO, both posterior and prior Bernoulli distribution need to be replaced with concrete distribution: D KL [q(z t |π k,t )∥p(z t |π k,t )] ≥ D KL [q(z t |π k,t , λ)∥p(z t |π k,t , λ)] . E THE ANALYSIS OF COMPLEXITY We discuss the computational cost of our proposed SMM-CML as follows, including the time complexity and space complexity. For time complexity, the de facto bi-level optimization mechanism in meta-learning requires O(n 2 ) when updating one meta-knowledge component, where an algorithm with time complexity O(n) is a linear time algorithm. If without any sparsification or constraint on the number of components, it will see an unlimited increase, and thus the time complexity will be up to O(n 3 ). If with our evidential sparsification, the number of components will be limited to a small constant C with an appropriate hyperparameter γ in Eq. 10, so that the time complexity will be down to O(C * n 2 ) ≈ O(n 2 ). Similarly, as each component of meta-knowledge contains the parameter of the model, its space complexity is O(n). And the total space complexity of models without sparsification will be up to O(n 2 ) for the unlimited number of meta-knowledge components when encountering many tasks. But our algorithm can alleviate this issue using the evidential sparsification to reduce down to O(n) with an appropriate hyperparameter γ.

F DETAILS OF EXPERIMENT

F.1 THE DETAILS OF BASELINES For a fair comparison, we use the widely-applied network architecture following (Yap et al., 2021; Zhang et al., 2021) . In what follows, we describe the details of the baselines: TOE: Training-On-Everything method (TOE) is an intuitive method, that re-initializes the metaknowledge and trains them on all the having arrived datasets at each time. We use the same Bayesian meta-learning architecture as our algorithm. The difference between TOE and SMM-CML is that SMM-CML is only trained on the current dataset at each time instead of all the having arrived dataset in TOE and SMM-CML does not re-initialize the meta-knowledge at each time as what TOE do. TFS: Train-From-Scratch (TFS) is another intuitive method, which also re-initializes meta-knowledge but only trains them on the current dataset. Similarly, it also uses the same Bayesian meta-learning architecture as our algorithm. The difference between TFS and SMM-CML is that our algorithm maintains the posterior meta-knowledge at last time as the prior at the current time instead of re-initializing them as TFS. FTML: Follow the Meta Leader (FTML) proposed by (Finn et al., 2019) uses the Follow the Leader algorithm to fill the gap between meta-learning and online learning. However, it assumes that all the having arrived datasets are available, which is memory-consuming and conflicts with the continual meta-learning. For a fair comparison, we only train FTML on the current dataset as same as our algorithm. OSML: Online Structured Meta-Learning (OSML) (Yao et al., 2020 ) maintains a meta-hierarchical graph with different knowledge blocks and conducts a meta-knowledge pathway for the encountered new task. However, it employs a well pre-trained convolution network to initialize the model in the original paper. As SMM-CML and other baselines are randomly initialized, it would be unfair to use the original initializing way. Therefore, we also randomly initialize the OSML model. DPMM: Dirichlet Process Mixture Model (DPMM) (Jerfel et al., 2019) employs a Chinese Restaurant Process to conduct the mixture meta-knowledge with a dynamic number of components. Note that it is not a Bayesian method and employs the point estimation to update the meta-knowledge. BOMVI: Bayesian Online Meta-Learning with Variational Inference (BOMVI) (Yap et al., 2021) is a state-of-the-art algorithm, which conducts a meta-knowledge distribution to address the catastrophic forgetting issue in continual meta-learning. Similarly, it also employs variational inference to update the meta-knowledge. VC-BML: Variation Continual Bayesian Meta-Learning (VC-BML) (Zhang et al., 2021) is another state-of-the-art algorithm, which also employs a truncated Chinese Restaurant Process to conduct the mixture meta-knowledge. Different from DPMM, it uses the Bayesian inference to conduct the mixture distribution of meta-knowledge and places an upper bound on the number of components to reduce the computational consumption. All the baselines and our proposed SMM-CML follow the experimental setting as described in Sec. F.3.

F.2 THE DATASETS

VGG-Flowers VGG-Flowers (Nilsback & Zisserman, 2008) consists of 102 flower categories. Also, we randomly choose 66 categories for meta-training, 16 categories for validation and the remained 20 categories for meta-test. miniImagenet:miniImagenet (Ravi & Larochelle, 2017) is designed for few-shot learning, which consists of 100 different classes. Similarly, we also split the dataset into three datasets (i.e., 64 classes for meta-training, 16 classes for validation and 20 classes for meta-test) following the existing works. CIFAR-FS:CIFAR-FS (Bertinetto et al., 2018) dataset used in our experiment is adapted from the CIFAR-100 dataset (Krizhevsky et al., 2009) for few-shot learning, which consists of 100 classes. Following the existing works (Yap et al., 2021; Zhang et al., 2021) , we also randomly split the datasets, where 64 classes are used for meta-training, 16 classes are used for validation and the remained 20 classes are used for meta-test, respectively. Omniglot:Omniglot (Lake et al., 2011) is a widely-used dataset, which contains 1,623 different handwritten characters from 50 different alphabets. Following the previous works (Yap et al., 2021; Zhang et al., 2021) , we randomly split the dataset into three subsets, 1,100 characters for meta-training, 100 characters for validation and the remaining 423 characters for meta-test. To create the online non-stationary setting, we assume the above datasets are arriving and available sequentially. Moreover, we focus 5-way 5-shot task, which conducts the low-resource environment. For each dataset, we form the streaming tasks via randomly sampling 5 classes with replacement as a In our experiment, we also consider another more challenging setting, where tasks from different datasets are mixed and arrive one by one. In this setting, we conduct different tasks stream with a length of 100, and then train the model on each task one by one and evaluate the performance on each dataset.

F.3 THE DETAILS OF EXPERIMENT SETTING

For each task, we employ the same convolution network as our base network following the previous works (Yap et al., 2021; Zhang et al., 2021) , which is showed in Tab. 4. For our model, we use the Adam optimizer as the outer optimizer and the SGD optimizer as the inner optimizer. For the Monte Carlo sampling used in our algorithm, we set the number of sampling as 5. For the initial number of components in the multi-modal meta-knowledge, we set it as 4. All the important hyper-parameters can be seen in Tab. 5. We ran our algorithm on NVIDIA Tesla V100 32GB GPU. It took about 54 hours to train.

F.4 ADDITIONAL EXPERIMENTAL RESULT

In what follows, we present the full result on the streaming datasets (i.e., VGG-Flowers, miniImagenet, CIFAR-FS and Omniglot), and change the order of datasets to verify the generality of our algorithm.

F.4.1 META-TEST ACCURACIES ON EACH DATASET AT DIFFERENT META-TRAINING STAGE

We only show the average result at each meta-training stage and the performance on each dataset at the last meta-training stage in the main text. We additionally show the full results in Tab. 6. Although SMM-CML can not achieve the best performance on all having arriving datasets at some meta-training stages (i.e., CIFAR-FS and miniImage), it outperforms all the baselines on the average results, which confirms the effectiveness of SMM-CML. Additionally, SMM-CML can not only maintain the performance on the old datasets, but also achieve better results on the new datasets, which illustrates that it can alleviate better catastrophic forgetting than other baselines. Note that SMM-CML achieves the best performance on the current datasets at each stage (especially compared to VC-BML, where it assumes that each component is mutually exclusive), which shows that our proposed model can resolve the conflict between the learned meta-knowledge and the incremental meta-knowledge, and it is expected that the multi-modal can utilize the shared meta-knowledge to improve the performances. Tab. 6 also shows the detailed results of SMM-CML before and after evidential sparsification. The results show that SMM-CML still achieves a comparative performance on most datasets at each To further confirm the generality of our model, we change the order of datasets in the streaming tasks. We conduct the experiments on a new order, where the model is trained chronologically on Omniglot, CIFAR-FS , miniImagenet and VGG-Flowers. The results are shown in Tab. 7. The result on the streaming tasks with a different order shows that SMM-CML still outperforms other baselines, which further confirms the generality of SMM-CML. 



Figure 3: Each column represents the posterior probability π of the Bernoulli distribution of different component in the multi-modal meta-knowledge on different datasets.

Figure 4: The evolution of meta-test accuracies (%) of SMM-CML with different numbers of components when training on different datasets. TOE and TFS are two baselines for comparison.

Figure 5: The comparison between SMM-CML and VC-BML with different numbers of components on each training stage.

Figure 7: The probability model of SMM-CML. The solid line denotes the generative process, and the white circle and the grey circle denote the latent variant and the observed variant, respectively.

Mean meta-test accuracy (%) of the learned dataset at each meta-training stage. The best performance is marked with boldface.

Mean meta-test accuracy (%) under the sequential task setting, where the performance represents the average accuracy across the whole training tasks sequence. The best performance is marked with boldface.

The meta-test accuracy (%) before sparsification and after sparsification on each dataset

The convolution neural network architecture in SMM-CML and baselines.

Some important hyper-parameters used in our experiments.

Performance of our SMM-CML and the baselines on each datasets at each meta-training stage. The best performance on each dataset is marked with boldface and the second best is marked with underline. BML 79.04 ± 1.54 59.17 ± 1.74 71.40 ± 1.93 -69.87 ± 1.74 SMM-CML 79.29 ± 1.48 58.98 ± 1.65 73.89 ± 1.69 -70.72 ± 1.61 CML 71.92 ± 1.86 50.07 ± 1.66 64.50 ± 1.83 99.36 ± 0.22 71.46 ± 1.39 meta-training stage, compared to before evidential sparsification. It further confirms the effectiveness of our proposed evidential sparsification. F.4.2 ADDITIONAL EXPERIMENTAL IN DIFFERENT ORDER

Performance of our SMM-CML and the baselines on each datasets at each meta-training stage. The best performance (without 'original') on each dataset is marked with boldface and the second best (without 'original') is marked with underline.

annex

while not converge do 7: Sample v t,k ∼ q(v t,k ; α t,k , β t,k ), ∀k = 1, ..., K t 8:Compute the ELBO according to Eq. 16 9:Compute the gradient: ∇µ t,k , ∇σ t,k , ∀k = 1, ..., K t via explicit reparameterization according to Eq. 51 10:Compute the gradient: ∇α t,k , ∇βt, k, ∀k = 1, ..., K t via implicit reparameterization according to Eq. 55 11:Update the variational parameters:Update the variational parameters:Update the variational parameters: µ t,k ← µ t,k -ζ∇µ t,k , ∀k = 1, ..., K t

14:

Update the variational parameters:Compute the evidential weight w + t,k , w - t,k , ∀k = 1, ..., K t according to Eq. 11 18:Compute the mass function according to Eq. 14 19:Remove the components without support information according to Eq. 15 20: end for there are two ways to tackle the problem: the implicit reparameterization and the Kumaraswamy reparameterization.Implicit reparameterization. This way also utilizes the reparameterization to tackle the intractable gradient in Beta distribution:without the inverse of the standardization function, the term ∇ γ v k is difficult to compute. Inspired by (Figurnov et al., 2018) , we employ the implicit reparameterization to compute the gradient, the idea of which is to differentiate the standardization function S γ (v k ) = ε using the chain rule instead of searching its inverse:Note that the standardization function can be the CDF of the Beta distribution and ε ∼ U nif [0, 1].Then the implicit gradient is:where p(v k ; γ) is the PDF of the Beta distribution.Kumaraswamy distribution. The Beta distribution of v k also can be reparameterized using a Kumaraswamy distribution (Nalisnick & Smyth, 2017) . The Kumaraswamy distribution can be defined as:and then the inverse of standardization function can be computed as:

