ENHANCING META LEARNING VIA MULTI-OBJECTIVE SOFT IMPROVEMENT FUNCTIONS

Abstract

Meta-learning tries to leverage information from similar learning tasks. In the commonly-used bilevel optimization formulation, the shared parameter is learned in the outer loop by minimizing the average loss over all tasks. However, the converged solution may be compromised in that it only focuses on optimizing on a small subset of tasks. To alleviate this problem, we consider meta-learning as a multi-objective optimization (MOO) problem, in which each task is an objective. However, existing MOO solvers need to access all the objectives' gradients in each iteration, and cannot scale to the huge number of tasks in typical meta-learning settings. To alleviate this problem, we propose a scalable gradient-based solver with the use of mini-batch. We provide theoretical guarantees on the Pareto optimality or Pareto stationarity of the converged solution. Empirical studies on various machine learning settings demonstrate that the proposed method is efficient, and achieves better performance than the baselines, particularly on improving the performance of the poorly-performing tasks and thus alleviating the compromising phenomenon.

1. INTRODUCTION

Meta-learning, also known as "learning to learn", aims to enable models to learn more effectively by leveraging information from many similar learning tasks (Hospedales et al., 2020) . In recent years, meta-learning has received much attention for its fast adaptation to new learning scenarios with limited data (Kao et al., 2021; Finn et al., 2017; Snell et al., 2017; Lee et al., 2019; Nichol et al., 2018; Deleu et al., 2022; Rajeswaran et al., 2019; Vilalta & Drissi, 2002) . It is usually formulated as a bi-level optimization problem (Franceschi et al., 2018; Hong et al., 2020) , which finds task-specific parameters in the inner level and minimizes the average loss over tasks in the outer level. Recently, Wang et al. (2021) reformulate meta-learning as a multi-task learning problem. From this perspective, minimizing the average loss in the outer level using (stochastic) gradient descent may not always be desirable. Specifically, it may suffer from the compromising (or conflicting) phenomenon, in which the converged solution only focuses on minimizing the losses of a small subset of tasks while ignoring the others (Yu et al., 2020; Liu et al., 2021a; Sener & Koltun, 2018) . This compromised solution may thus lead to poor performance. To alleviate this problem, we propose reformulating meta-learning as a multi-objective optimization (MOO) problem, in which each task is an objective. The performance of all tasks (objectives) are then considered during optimization (Emmerich & Deutz, 2018) . A popular class of MOO solvers is the gradient-based approach (Liu et al., 2021a; Yu et al., 2020; Sener & Koltun, 2018; Navon et al., 2022; Liu et al., 2021b) , with prominent examples such as the multiple-gradient descent algorithm (MGDA) (Désidéri, 2012; Sener & Koltun, 2018) , PCGard (Yu et al., 2020) , and CAGard (Liu et al., 2021a) . In each iteration, they find a common descent direction among all objective gradients, instead of simply optimizing the average performance over all objectives. Existing gradient-based MOO methods require using gradients from all the objectives. However, when formulating meta-learning as a MOO problem with each task being an objective, computing all these gradients in each iteration can become very expensive, as the number of objectives (i.e., tasks) can be huge. For example, in 5-way 1-shot classification on the miniImageNet data, the total number of meta-training tasks is 64 5 ≈ 7 × 10 6 . To address this challenge, we propose a scalable MOO solver by using the improvement function (Miettinen & Mäkelä, 1995; Mäkelä et al., 2016; Montonen et al., 2018) with the help of mini-batch. On the other hand, we show that a trivial extension of existing gradient-based MOO methods with the use of mini-batch does not guarantee Pareto optimality and has poor performance in practice. Our main contributions are as follows: (i) To alleviate the compromising phenomenon, we reformulate meta-learning as a multi-objective optimization problem in which each task is an objective; (ii) To handle the possibly huge number of tasks, we propose a scalable gradient-based solver. (iii) We provide theoretical guarantees on the Pareto optimality or Pareto stationarity of the converged solution. (iv) Empirical studies on few-shot regression, few-shot classification, and reinforcement learning demonstrate that the proposed method achieves better performance, particularly in improving the performance of the poorly-performing tasks and thus alleviating the compromising phenomenon.

2. BACKGROUND

Multi-Objective Optimization (MOO). In MOO (Marler & Arora, 2004) , one aims to minimizefoot_0 m ≥ 2 objectives f 1 (x), . . . , f m (x): min x [f 1 (x), . . . , f m (x)]. (1) Definition 2.1. (Global Pareto optimality) (Miettinen, 2012; Mäkelä et al., 2016) x * is global Pareto optimal if there does not exist another x such that f τ (x * ) ≥ f τ (x) for all τ ∈ {1, . . . , m}, and f τ ′ (x * ) > f τ ′ (x) for at least one τ ′ ∈ {1, . . . , m}. The Pareto front (PF) is the set of multi-objective values of all global Pareto-optimal solutions. Definition 2.2. (Pareto stationarity) (Miettinen, 2012; Désidéri, 2012) x * is Pareto-stationary if there exist {u τ } m τ =1 such that ∥ m τ =1 u τ ∇ x f τ (x)∥ = 0, u τ ≥ 0 ∀τ and m τ =1 u τ = 1. Note that global Pareto optimal solutions are also Pareto stationary (Désidéri, 2012) . Analogous to the extension from a stationary point to an ϵ-stationary point (Lin et al., 2020)  (x, x ′ ) = max τ =1...,m {f τ (x) -f τ (x ′ )}. Note that x * satisfying x * = arg min x H(x, x * ) (intuitively, x * cannot be further improved) is Pareto stationary (Montonen et al., 2018) . To find x * , one can perform steepest descent on H: x s+1 = x s + βd * , d * = arg min d H(x s + d, x s ) + λ ′ 2 ∥d∥ 2 , where x s is the iterate at iteration s, β is the learning rate satisfying H(x s + βd, x s ) < H(x s , x s ), and λ ′ is a hyper-parameter. It can be shown that when s → ∞, x s is Pareto stationary (Montonen et al., 2018) . In this paper, we focus on gradient-based MOO methods, including MGDA (Désidéri, 2012; Sener & Koltun, 2018) , PCGard (Yu et al., 2020) , and CAGard (Liu et al., 2021a) . They assign weights to each objective's gradient and find a common descent direction that decreases the losses of all objectives. For example, MGDA finds the direction g * (x) = m τ =1 γ * τ ∇ x f τ (x) in each iteration, where {γ * τ } = arg min {γτ } m τ =1 γτ ∇xfτ (x) 2 s.t. m τ =1 γτ = 1, γτ ≥ 0, ∀τ. Meta-Learning. Meta-learning aims to achieve good performance with limited data and computation (Hospedales et al., 2020) . Most of them are gradient-based (Nichol et al., 2018; Deleu et al., 2022; Rajeswaran et al., 2019; Zhou et al., 2019; Shu et al., 2019) or metric-based (Snell et al., 2017; Lee et al., 2019; Vinyals et al., 2016) . Let T be the set of all m tasks, and w be the shared model parameter. For a task τ ∈ T , let D τ be its dataset and L τ the corresponding loss function. It tries to obtain task-specific parameter w τ from the shared w as w * τ (w). Meta-learning is usually formulated as the following bilevel optimization problem (Ji et al., 2021) : min w m τ =1 L τ (w τ ) s.t. w τ = w * τ (w). The inner subproblem learns the task-specific parameter w τ for each τ , while the outer subproblem learns w by minimizing the average loss over tasks in T . As m can be very large, usually a mini-batch B of tasks are uniformly sampled from T , and w is then updated as w s+1 = w sβ 1

|B|

τ ∈B L τ (w * τ (w)) (Finn et al., 2017) . By taking each L τ (w * τ (w)) in ( 4) as an objective, this can be regarded as a weighted sum in multi-task learning (Wang et al., 2021) . As observed in (Sener & Koltun, 2018; Yu et al., 2020) , gradient descent on this weighted sum can suffer from the compromising phenomenon, in which the loss obtained on some task τ ′ can be much larger than the losses on the other tasks.

3. SOFT IMPROVEMENT MULTI-OBJECTIVE META-LEARNING (SIMOL)

We take the view of meta-learning as multi-task learning in (Wang et al., 2021) one step further and consider the meta-learning problem as the following multi-objective optimization (MOO) problem: min w (L 1 (w * 1 (w)), . . . , L m (w * m (w))), in which each task corresponds to an objective. This considers all the individual tasks instead of simply considering the total loss over all tasks (Liu et al., 2021a) . Recently, Ye et al. (2021) also use multi-objective learning into meta-learning. However, their focus is not on addressing the compromising phenomenon and they do not treat each task as an objective. Instead, besides minimizing the average task loss in (4), they consider adding some other objectives such as robustness to adversarial attacks. Moreover, MGDA is still used to find the Pareto optimal solution. However, as in other gradient-based MOO methods (Yu et al., 2020; Liu et al., 2021a) , MGDA requires collecting gradients from all m objectives in each iteration (as can be seen from its optimization problem (3)). This is computationally feasible only when there are a small number of objectives.foot_1 When each task is treated as an objective, the number of objectives can easily be in the millions (as in performing 5-way 1-shot classification on miniImageNet). Another widely adopted MOO based methods are the Chebyshev methods (Miettinen, 2012; Mao et al., 2020; Momma et al., 2022) , which leverage the weighted Chebyshev problem to find the Pareto front. However, these methods also cannot handle a huge number of tasks, as the computational complexity per epoch for these methods is O(m 2 ) , where m is the number of tasks. To alleviate this problem, one solution is to use only a mini-batch of objectives in each iteration. For example, when a subset B of objectives is used, MGDA's optimization problem in (3) becomes: min {γτ } τ ∈B γ τ ∇ τ (L τ (w * τ (w)) 2 s.t. τ ∈B γ τ = 1, γ τ ≥ 0, ∀τ. However, the descent direction then only considers objectives in B, and the original normalization constraint m τ =1 γ τ = 1 in (3) is also changed to τ ∈B γ τ = 1. The obtained solution may no longer Pareto optimal. In the following, we demonstrate this by using a simple toy example with two objectives (f 1 (x) and f 2 (x), where x ∈ R 2 ) from (Liu et al., 2021a; Navon et al., 2022) . 3 As can be seen from Figure 1 , mini-batch MGDA (with a mini-batch of 1) cannot converge to the Pareto front.

3.1. SOFT IMPROVEMENT FUNCTION

In this section, we propose a scalable MOO solver with the use of the improvement function (Montonen et al., 2018) . The proposed solver is agnostic to the number of tasks, while still theoretically guaranteeing that the solution is Pareto optimal. Using Definition 2.4, the improvement function for problem ( 5) is: Consider the optimization problem H(w, w ′ ) = max τ =1...,m {L τ (w * τ (w)) -L τ (w * τ (w ′ ))} . max π E τ ∼π [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] , ( ) where π is a probability density function on τ . The following Lemma shows that ( 6) and ( 7) are equivalent when π is the Dirac delta distribution concentrated on the task corresponding to the maximum in ( 6). All the proofs are in Appendix C. Lemma 3.1. H(w, w ′ ) = max π E τ ∼π [L τ (w * τ (w)) -L τ (w * τ (w ′ ))]. Using (2) and Lemma 3.1, w can be updated as w s+1 = w s + βd * , d * = arg min d (max π E τ ∼π [L τ (w * τ (w s + d)) -L τ (w * τ (w s ))]) + λ ′ 2 ∥d∥ 2 . ( ) Taking the first-order approximation L τ (w * τ (w s + d)) ≃ L τ (w * τ (w s )) + ∇ w L τ (w * τ (w s )) ⊤ d, the minimax theorem (Simons, 1995) can be used to swap the min and max operators in (9), as (π * , d * ) = arg maxπ min d Eτ∼π [Lτ (w * τ (w s + d)) -Lτ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 . ( ) The following Proposition shows that the inner minimization problem has a closed-form solution. Proposition 3.2. min d E τ ∼π [L τ (w * τ (w s + d)) -L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 = -1 2λ ′ ∥E τ ∼π ∇ w L τ (w * τ (w))| w=w s ∥ 2 , and the optimal d is d * = -1 λ ′ E τ ∼π [∇ w L τ (w * τ (w)| w=w s ]. The expectation in Proposition 3.2 requires sampling tasks from π. An easier alternative is to sample tasks from the uniform distribution U (•) over the set T of all tasks, and then weighting each sampled task τ with r(τ ) ≡ π(τ )/U (τ ). Note that E τ ∼U r(τ ) = τ U (τ )π(τ )/U (τ ) = τ π(τ ) = 1. ( ) We further parameterize r as a neural network r θ with parameter θ. Using Proposition 3.2, we can then rewrite (11) as θ * = arg max θ -1 2λ ′ ||E τ ∼U r θ (τ )∇ w L τ (w * τ (w))| w=w s || 2 - λ ′′ 2 (E τ ∼U r θ (τ ) -1) 2 , ( ) where the last term (with another hyper-parameter λ ′′ ) is a penalty for enforcing the constraint in (12). For notational simplicity, we denote the objective in (13) by K(θ). In principle, θ * can be obtained from ( 13) by gradient ascent. However, problem (13) involves an expectation over tasks. Recall that we have a total of m tasks, and m can be huge. Hence, using all of them to compute this expectation may not be feasible. Instead, Let B be a mini-batch of k tasks, and denote the the mini-batched version of the objective in (13) as: KB (θ) ≡ -1 2 λ′ 1 |B| τ ∈B r θ (τ )∇ w L τ (w * τ (w))| w=w s 2 - λ′′ 2 1 |B| τ ∈B r θ (τ ) -1 2 , where λ′ , λ′′ are another set of hyper-parameters (which will be set in Proposition 3.3) corresponding to λ ′ , λ ′′ in (13). Note that KT (θ) = K(θ). Let B be the set of all size-k mini-batches (with k > 1). The following Proposition bounds the difference between K(θ), the original objective in ( 13), and the version 1 |B| B∈B KB (θ) based on mini-batches. Proposition 3.3. Set λ′ = λ ′ C1|B| and λ′′ = λ ′′ C 1 |B|k 2 . We have: K(θ) -1 |B| B∈B KB (θ) 2 ≤ C2k|B|G1 λ ′ + C 2 k|B|λ ′′ G 2 , where C 1 ≡ k 2 ( m-2 k-2 )m 2 , and C 2 ≡ m -1 k -1 - 1 2 m -2 k -2 / m -1 k -1 m 2 m -2 k -2 ,G 1 ≡ 1 k|B| B∈B τ ∈B ∥r θ (τ )∇ w L τ (w * τ (w)| w=w s ∥ 2 , and G 2 ≡ 1 k|B| B∈B τ ∈B [r θ (τ ) -1] 2 . Corollary 3.3.1. When k ≪ m, K(θ) -1 |B| B∈B KB (θ) 2 ≤ G1 kλ ′ + G2 k λ ′′ . In the experiments, m ≥ 10 6 , k ≈ 10 2 , and G 1 , G 2 ≤ 10 4 . When λ ′ ≥ 10 2 , λ ′′ ≤ 10 -2 , we have 1 |B| B∈B KB (θ) ∈ [-10 3 , -1] during training, and (K(θ) -1 |B| B∈B KB (θ)) 2 ≤ 0. 2 is small. Thus, Proposition 3.3 shows that K(θ) can be decomposed into mini-batches as 1 |B| B∈B KB (θ). This allows us to update θ by SGD over the task mini-batches as: θ s+1 = θ s + β ′ ∇ θ KB (θ s ), where β ′ is the learning rate. Similarly, we approximate d * by its mini-batch approximation d * = -1 λ ′ |B| [ τ ∈B ∇ w L τ (w * τ (w))| w=w s ] and update w as w s+1 = w s + β d * . The whole procedure, which will be called Soft Improvement Multi-Objective Meta Learning (SIMOL), is shown in Algorithm 1. Step 4 trains the base learner. In the experiments, we use two popular meta-learning algorithms: MAML (Finn et al., 2017) and prototypical network (PN) (Snell et al., 2017) . For MAML, the base learner is updated as w * τ (w) = w -α∇ w L τ (w). ( ) For the PN, w * τ (w) = 1 |Qτ |N C x∈Qτ exp(-∥fw(x)-c k ∥ 2 ) k ′ exp(-∥fw(x)-c k ′ ∥ 2 ) , where Q τ is the set of query examples for task τ , N C is the number of classes per epoch, f w is the model with parameter w, c k =  1 |S k | (xi,yi)∈S k f w (x i ), for τ = 1, 2, . . . , k do obtain r θ s (τ )∇ w s L τ (w * τ (w s )) for task τ ; d * = d * -r θ s (τ )∇ w s L τ (w * τ (w s )); w s+1 = w s + β 1 k d * ; θ s+1 = θ s + β ′ -1 2 λ′ ∇ θ s 1 k τ ∈B r θ s (τ )∇ w L τ (w * τ (w))| w=w s+1 2 - ∇ θ s λ′′ 2 1 k τ ∈B r θ s (τ ) -1 2 ; 4 CONVERGENCE ANALYSIS Lemma 4.1. Define R(θ, w) ≡ E τ ∼U [⊥ (r θ (τ ))[L τ (w * τ (w)]] + ∆(θ) - λ ′′ 2 (E τ ∼U r θ (τ ) -1) 2 , ( ) where ∆(θ) ≡ -1 λ ′ E τ ∼U r θ (τ ) ⊥ [(L τ (w * τ (w) + d * ) -L τ (w * τ (w))]• ⊥ [E τ ∼U r θ (τ )[L τ (w * τ (w) + d * ) -L τ (w * τ (w))]], and ⊥ is the stop gradient operator. 4 Then, ∇ w R(θ, w)| w=w s = -d * , and ∇ θ R(θ, w s ) = ∇ θ K(θ). This allows interpreting the updates in ( 14) and ( 8) as performing Gradient Descent Ascent (GDA) (Singh et al., 2000) on ( 16). Thus, we can leverage game theoretical tools (Lin et al., 2020) in the analysis. Let R(θ, w; B) ≡⊥ 1 |B| τ ∈B [r θ (τ )][L τ (w * τ (w)] -1 |B|λ ′ τ ∈B r θ (τ )[⊥ [(L τ (w * τ (w) + d * ) - L τ (w * τ (w))]• ⊥ 1 |B| τ ∈B r θ (τ )[L τ (w * τ (w) + d * ) -L τ (w * τ (w))] -λ 2 ′′ ( 1 |B| τ ∈B r θ (τ ) -1 ) 2 be the mini-batch version of R, and U (B) be the uniform distribution over task mini-batches. The following Theorem shows that Algorithm 1 converges to an ϵ-Pareto stationary point of (5). Theorem 4.2. Assume that (i) L τ (w * τ (•)) is L-smooth and r θ (•) is µ-strongly concave. (ii) The domain of θ is a convex and bounded set with diameter D > 0, (iii) E B∼U (B) [∇ θ R(θ, w; B)-∇ θ R(θ, w)] = 0, and E B∼U (B) ∥∇ θ R(θ, w, B) -∇ θ R(θ, w)∥ 2 ≤ σ 2 . Assume the first-order approximation in (10), and take β = Θ 1/D 2 σ 2 (L 2 + σ 2 ) , β ′ = Θ(1/Lσ 2 ). Algorithm 1 converges to an ϵ-Pareto stationary point of ( 5) with a rate of O(1/ϵ 8 ). If L τ (w * τ (w)) is also µ ′ -convex w.r.t. w and ϵ = 0, the 0-Pareto stationary point is also global Pareto optimal. Assumption (i) is commonly used in the literature (Collins et al., 2020; Zhou et al., 2021; Finn et al., 2019) ; while (ii) and (iii) are from (Lin et al., 2020) . Theorem 4.2 shows that the proposed method can obtain an ϵ-Pareto stationary point (or global Pareto optimal point for convex objectives) regardless of m, the number of tasks/objectives. Corollary 4.2.1. Consider the MAML base learner update in (15). Assume that ∇ w L τ (w) is Hessian-Lipschitz continuous, bounded, Lipschitz-continuous, and w is bounded. Then, Algorithm 1 converges to an ϵ-Pareto stationary point of ( 5) with a rate of O(1/ϵ 8 ). Corollary 4.2.1 is an application of Theorem 4.2 revealing that SIMOL with MAML can also converge to a Pareto point. Convergence of the outer loop is slower than the O(1/ϵ 2 ) rate of standard MAML (Fallah et al., 2020) . However, standard MAML only guarantees convergence to stationary points of w but not to Pareto-stationary points. Moreover, as will be seen in Section 5.1, empirically, the proposed method has comparable or even slightly faster convergence speed than MAML and other meta learning baselines. Besides, most the gradient-based MOO approaches (except CAGrad (Liu et al., 2021a) ) do not provide convergence rate analysis; while CAGrad requires that all task gradients are available in each epoch, which is very expensive (as will be demonstrated in Section 5.2).

5. EXPERIMENTS

In this section, we perform experiments on few-shot regression (Section 5.1), few-shot classification (Section 5.2), and reinforcement learning (Section 5.3). All experiments are run on a GeForce RTX 2080 Ti GPU and Intel(R) Xeon(R) CPU E5-2680. Our implementations are based on the popular open-source meta-learning library Learn2Learn (Arnold et al., 2020).

5.1. FEW-SHOT REGRESSION

Setup. We follow the setup in (Finn et al., 2017; Li et al., 2017) . The target function for task τ is y = a τ sin(x + b τ ), where a τ and b τ are sampled uniformly from [0.1, 5.0] and [0, π], respectively. We generate 160, 000 meta-training tasks and 1, 000 meta-testing tasks. A multilayer perceptron with 2 fully-connected (FC) layers (each of size 32) and ReLU activation is used as meta-learner and re-weighting network. The re-weighting network uses all mini-batch instances as input. To ensure that the re-weighting network output is positive, we take the square of its last layer's output as output. The backbone meta-learning algorithm is MAML (Finn et al., 2017) . We use Adam (Kingma & Ba, 2014) , with an initial learning rate of 0.01, to update the base learners for 5 steps in the inner-loop. For the outer-loop, we compare (i) minimizing the (single objective) of overall task loss as in MAML; versus performing MOO with (ii) mini-batch MGDA (Désidéri, 2012; Ye et al., 2021) , (iii) mini-batch CAGrad, using the same hyper-parameters as in (Liu et al., 2021a) , (iv) mini-batch PCGrad, using the same hyper-parameters as in (Yu et al., 2020) , (v) the proposed SIMOL, and (vi) updating of ( 6) with a mini-batch version of the improvement function in (6). Hyperparameters for MAML, mini-batch MGDA, and mini-batch CAGrad follow (Finn et al., 2017) , while that for the proposed SIMOL are in Appendix D. The mini-batch size is 16. We do not compare with the batch versions of MGDA/CAGrad, as computing all task gradients takes very large memory and time. The initial learning rate for the outer loop is 0.001. The experiment is repeated three times with different random seeds. For performance evaluation, we use the mean-squared-error (MSE) over all meta-testing tasks. We also report the worst-10% MSE, which is the average MSE for the 10% worst-performing meta-testing tasks. Results. Table 1 shows the MSE and its 95% confidence interval (computed as in (Finn et al., 2017; Li et al., 2017) ). As can be seen, SIMOL consistently outperforms MAML, and the mini-batch versions of MGDA, CAGrad, PCGrad and improvement function in terms of both the overall and worst-10% MSEs. Indeed, the mini-batch versions of MGDA, CAGard, PCGrad and improvement function are even worse than the original MAML. Figure 2 shows the convergence of MSE with the number of training epochs. As can be seen, SIMOL converges slightly faster than the other baselines. 

5.2. FEW-SHOT IMAGE CLASSIFICATION

Setup. In this section, we perform 5-way-1-shot and 5-way-5-shot classification on the miniIma-geNet (Ravi & Larochelle, 2016) and tieredImageNet data (Ren et al., 2018) . Following (Finn et al., 2017) , we split the miniImageNet dataset into a meta-training set with 64 classes, a meta-validation set with 16 classes, and a meta-testing set with 20 classes. The total number of meta-training tasks is 64 5 ≈ 7.6 × 10 6 . Similarly, as in (Zhou et al., 2019) , we split the tieredImageNet dataset into a meta-training set with 351 classes, a meta-validation set with 97 classes, and a meta-test set with 160 classes. The total number of meta-training tasks is 351 with a 3-layer FC as the reweighting network. The optimizer is Adam. The learning rates for the meta networks are 0.003 for MAML-based methods and 0.001 for PN-based methods, respectively. The learning rate of the reweighting network is 0.08 for SIMOL and 0.0008 for SIMOL+PN. The mini-batch size is 32. More details on the hyperparameters are in Appendix D. The proposed SIMOL is compared with (i) minimizing the overall task loss as in standard MAML, (ii) mini-batch MGDA, and (iii) mini-batch CAGrad. We also compare with MAML variants including (iv) Reptile (Nichol & Schulman, 2018) , (v) FOMAML (Finn et al., 2017) , (vi) Meta-MinibatchProx (Zhou et al., 2019) , (vii) TSA-MAML (Zhou et al., 2021) , (viii) IMAML (Rajeswaran et al., 2019 )) (ix) MTL (Wang et al., 2021) , a multi-task learning based maml approach, and the standard prototypical network (Snell et al., 2017) . The evaluation metrics are similar to those in Section 5.1, but with accuracy instead of MSE. Experiments are repeated three times with different random seeds. Results. Tables 2 and 3 show the meta-testing accuraccies and 95% confidence intervals on mini-ImageNet and tieredImageNet, respectively. For MAML and its variants, SIMOL consistently outperforms all the other baselines in terms of both the overall and worst-10% accuracies. The same is also observed on the meta-learning algorithm prototypical network. This demonstrates that SIMOL is useful for both gradient-based and metric-based meta-learning approaches. Note that (mini-batch) MGDA and CAGrad do not perform good in terms of both overall and worst-10% accuracies, showing that they cannot be straightforwardly extended to the use of mini-batch. On the other hand, the batch versions of MGDA and CAGrad are computationally impractical. Table 4 compares the per-epoch running time in training stage of standard MAML, SIMOL, and batch MGDA and CAgrad. Experiment is performed on 5-way-5-shot classification with the MAML algorithm on miniImageNet. As can be seen, while SIMOL has comparable per-epoch running time as MAML, batch MGDA and CAgrad are much more computationally expensive (around 432, 000 times slower). Table 2 : 5-way classification accuracies on miniImageNet (with 95% confidence interval). Results of Reptile, FOMAML, and Meta-MinibatchProx are from (Zhou et al., 2019) , IMAML from (Deleu et al., 2022) , and TSA-MAML from (Zhou et al., 2021) . Results not reported in the original papers are denoted "-". The best results are in bold. (Todorov et al., 2012) . In both environments, each task corresponds to a random direction in the XY-plane, and the agent (Walker/ HalfCheetach) learns to run in that direction as far as possible. The reward is the average velocity minus control costs. We again use MAML as the meta-learning algorithm, and the base reinforcement learning algorithm is vanilla policy gradient (VPG) (Sutton et al., 1999) . Following (Rothfuss et al., 2018) , the policy network has two 64 × 64 FC layers with tanh activation, while the critic is a linear state-value function whose parameters are obtained by minimizing least-square. The re-weighting network has two 64 × 64 FC layers with ReLU activation. Following (Zintgraf et al., 2019) , we use MAML as the baseline. Figure 3 shows the convergence of the accumulated reward with the number of training iterations. As can be seen, SIMOL consistently outperforms MAML. Moreover, the convergence of MAML is less stable, as also reported by (Rothfuss et al., 2018) . On the other hand, the convergence of SIMOL is smoother and more stable. 

5.4. ABLATION STUDY

In this experiment, we use the setup in Section 5.1, and vary the number of FC layers in SIMOL's meta-learner. The number of training epochs is always fixed to 10, 000. Table 5 shows the MSE's on 2-shot regression. as can be seen, the use of 2 FC layers has the best overall MSE and worst-10% MSE. the deeper networks may not be sufficiently trained with the fixed number of training epochs, leading to worse performance. 

6. CONCLUSION

In this paper, we propose to avoid the compromising phenomenon in meta-learning by reformulating it as a multi-objective optimization (MOO) problem, in which each task is an objective. However, current gradient-based MOO solvers cannot scale to a large number of objectives. With the use of improvement function and mini-batch, we propose a scalable gradient-based solver with theoretical guarantees to Pareto-optimality. Empirical studies on few-shot regression, few-shot classification, and reinforcement learning demonstrate that the proposed method is efficient, and has good generalization in terms of both overall performance and performance on the poorly-performing tasks.

A TOY EXAMPLE

The definitions of f 1 and f 2 are: f 1 (x) = c 1 (x)l 1 (x) + c 2 (x)g 1 (x), f 2 (x) = c 1 (x)l 2 (x) + c 2 (x)g 2 (x), where l 1 (x) = log (max (|0.5 (-x 1 -7) -tanh (-x 2 )| , 0.000005)) + 6, l 2 (x) = log (max (|0.5 (-x 1 + 3) -tanh (-x 2 ) + 2| , 0.000005)) + 6, g 1 (x) = (-x 1 + 7) 2 + 0.1 * (-x 2 -8) 2 /10 -20, g 2 (x) = (-x 1 -7) 2 + 0.1 * (-x 2 -8) 2 /10 -20, c 1 (x) = max(tanh(0.5x 2 ), 0), c 2 (x) = max(tanh(-0.5x 2 ), 0). For mini-batch MGDA, and SIMOL the probability to sample task 1 is 2/3 and 1/3 for task 2. The learning rates for MGDA and mini-batch MGDA are both 0.01. The learning rate for SIMOL's meta-learner is 0.01, while that for its re-weighting network is 0.1. Note that the meta-learner and the re-weighting network are represented by learnable vectors.

B PSEUDO-CODES

Algorithms 2 and 3 show the pseudo-codes for SIMOL-based MAML and PN, respectively. The key differences between SIMOL and MAML/PN are highlighted in blue. Compute adapted parameters with gradient descent w * τ (w) = w s -α∇ w L τ (w s );  w s+1 = w s -β s 1 B B τ =1 r θ s (τ )∇ w L τ (w * τ (w)); θ s+1 = θ s + β ′ s K(θ, B); N C (xi,yi)∈S c τ f θ (x i ); Calculate w * τ (w) = 1 |Qτ |N C x∈Qτ exp(-∥fw(x)-c k ∥ 2 ) k ′ exp(-∥fw(x)-c k ′ ∥ 2 ) ; Receive r θ s (τ ); w s+1 = w s -β s 1 B B τ =1 r θ s (τ )∇ w L τ (w * τ (w)); θ s+1 = θ s + β ′ s K(θ, B); C PROOFS C.1 PROOF OF LEMMA 3.1 Proof. Note that max τ =1...,m {L τ (w * τ (w)) -L τ (w * τ (w ′ ))} = τ p(τ )L τ (w * τ (w)) -L τ (w * τ (w ′ )), where p(τ ) = 1 τ = arg max τ L τ (w * τ (w)) -L τ (w * τ (w ′ )) 0 otherwise . First, we have max π(τ ) E π(τ ) [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] - τ p(τ )L τ (w * τ (w)) -L τ (w * τ (w ′ )) ≥ 0. This can be done by setting π(τ ) = p(τ ). Next, we show that max π(τ ) E π(τ ) [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] - τ p(τ )L τ (w * τ (w)) -L τ (w * τ (w ′ )) ≤ 0. This is established since max π(τ ) E τ ∼π(τ ) [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] - τ p(τ )L τ (w * τ (w)) -L τ (w * τ (w ′ )) ≤ τ [π(τ ) -p(τ )] [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] = [π(τ ′ ) -1] [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] + τ ∈{1,...,m}\τ ′ [π(τ )] [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] (a) ≤ [π(τ ′ ) -1] [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] + τ ∈{1,...,m}\τ ′ [π(τ )] [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] ≤ π(τ ′ ) [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] + τ ∈{1,...,m}\τ ′ [π(τ )] [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] -[L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] = [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] -[L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] = 0, where τ ′ = arg max τ =1...,m {L τ (w * τ (w)) -L τ (w * τ (w ′ ))}. (a) is due to the fact that [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] ≥ [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] , ∀τ based on the property of τ ′ . Therefore, we have [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] ≥ [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] and [L τ ′ (w * τ ′ (w)) -L τ ′ (w * τ ′ (w ′ ))] ≤ [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] . Thus, max τ =1...,m {L τ (w * τ (w)) -L τ (w * τ (w ′ ))} = max π(τ ) E τ ∼π(τ ) [L τ (w * τ (w)) -L τ (w * τ (w ′ ))] . C.2 PROOF OF PROPOSITION 3.2 Proof. Using the first-order Taylor expansion, arg min d E U (τ ) [r θ (τ )[L τ (w * τ (w s + d)) -(L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 = arg min d E U (τ ) r θ (τ ) ∇ w L τ (w * τ (w s )) ⊤ d + λ ′ 2 ∥d∥ 2 . ( ) Taking the derivatives w.r.t. d, ∇ d E U (τ ) r θ (τ ) ∇ w L τ (w * τ (w s )) ⊤ d + λ ′ 2 ∥d∥ 2 = E U (τ ) [r θ (τ )[∇ w L τ (w * τ (w s ))]] + λ ′ d. Setting the above to zero, we have E U (τ ) [r θ (τ )∇ w L Dτ (w * τ (w s ))] + d = 0, and d * = -1 λ ′ E U (τ ) [r θ (τ )[∇ w L τ (w * τ (w s ))]]. Putting d * back to the objective in ( 17), we have: E τ ∼U r θ (τ )[L τ (w * τ (w s + d)) -L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 = E τ ∼U r θ (τ )∇ w L τ (w * τ (w))| ⊤ w=w s d + λ ′ 2 - 1 λ ′ E τ ∼π [∇ w L τ (w * τ (w)| w=w s ] 2 = ⟨E τ ∼U r θ (τ )∇ w L τ (w * τ (w))| ⊤ w=w s , d⟩ + 1 2λ ′ ∥E τ ∼π [∇ w L τ (w * τ (w)| w=w s ]∥ 2 = - 1 λ ′ ⟨E τ ∼U r θ (τ )∇ w L τ (w * τ (w))| ⊤ w=w s , E τ ∼U r θ (τ )∇ w L τ (w * τ (w))| ⊤ w=w s ⟩ + 1 2λ ′ ∥E τ ∼π [∇ w L τ (w * τ (w)| w=w s ]∥ 2 = -1 λ ′ ∥E τ ∼π [∇ w L τ (w * τ (w)| w=w s ]|| 2 . Next, we have the following two Lemmas. Lemma C.1. When L τ (w * τ (w s + d)) ≈ L τ (w * τ (w s ) + ∇ w L τ (w * τ (w s )) ⊤ d, Eq. ( 9) is convex w.r.t. d and concave. w.r.t. r θ (τ ). Proof. Putting L τ (w * τ (w s + d)) ≈ L τ (w * τ (w s ) + ∇ w L τ (w * τ (w s )) ⊤ d into Eq. ( 9), we have: min d max r(τ ) E U (τ ) [r(τ )[L τ (w * τ (w s + d)) -(L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 - λ ′′ 2 (E U (τ ) [r(τ )] -1) 2 ] = min d max r(τ ) E U (τ ) [r(τ )[L τ (w * τ (w s )) + ∇ w L τ (w * τ (w s )) ⊤ d -(L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 - λ ′′ 2 (E U (τ ) [r(τ )] -1) 2 ]. For ∇ w L τ (w * τ (w s )) ⊤ d + λ ′ 2 ||d|| 2 , since ∇ w L τ (w * τ (w s )) ⊤ d is both convex and concave w.r.t. d, and λ ′ 2 ||d|| 2 is convex w.r.t. d. Then, ∇ w L τ (w * τ (w s )) ⊤ d + λ ′ 2 ∥d∥ 2 is convex w.r.t d. Thus, the sum of two convex functions ∇ w L τ (w * τ (w s )) ⊤ d + λ ′ 2 ∥d∥ 2 is convex w.r.t. d. Thus, Eq. (9) is convex w.r.t. d. Similarly, since -λ ′′ 2 (E U (τ ) [r θ (τ )] -1) 2 is concave w.r.t. r θ (τ ) . Therefore, Eq. ( 9) is also concave w.r.t. r θ (τ ). Lemma C.2. When L τ (w * τ (w s + d)) ≈ L τ (w * τ (w s ) + ∇ w L τ (w * τ (w s )) ⊤ d, min d max r θ (τ ) E U (τ ) [r θ (τ )[L τ (w * τ (w s + d)) -L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 - λ ′′ 2 (E p [r θ (τ )] -1) 2 ] = max r θ (τ ) min d E U (τ ) [r θ (τ )[L τ (w * τ (w s + d)) -L τ (w * τ (w s ))] + λ ′ 2 ∥d∥ 2 - λ ′′ 2 (E p [r θ (τ )] -1) 2 ]. Proof. The above equation is established by the minimax theorem (Simons, 1995) when Eq. ( 9) is convex w.r.t. d and concave w.r.t. r θ (τ ). This holds by using Lemma C.1. Lemma C.3. If every B ∈ B is a set consists of a unique selection of k (m > k > 1) tasks out of m (the total number of tasks) tasks without replacement. For any continuous function f (•), we have: 1 m m τ =1 f (τ ) 2 = C 1 B∈B τ ∈B f (τ ) 2 -C 2 B∈B τ ∈B [f (τ )] 2 , where C 1 = k 2 ( m-2 k-2 )m 2 , C 2 = m-1 k-1 -1 2 m-2 k-2 / m-1 k-1 m 2 m-2 k-2 . Proof. Note that ( m-2 k-2 )m 2 , and 1 m m τ =1 f (τ ) 2 (a) = τ ′ τ 1(τ, τ ′ )f (τ )f (τ ′ ) (b) = 1 n-2 B-2 B∈B τ ′ ∈B τ ∈B 1(τ, τ ′ )f (τ )f (τ ′ ) - n -1 B -1 - 1 2 n -2 B -2 τ ∈T f (τ ) 2 = 1 n-2 B-2   B∈B τ ′ ∈B τ ∈B 1(τ, τ ′ )f (τ )f (τ ′ ) - n-1 B-1 -1 2 n-2 B-2 n-1 B-1 B∈B τ ∈B f (τ ) 2   = 1 n-2 B-2 m 2   B∈B [ τ ∈B f (τ )] 2 - n-1 B-1 -1 2 n-2 B-2 n-1 B-1 m 2 B∈B τ ∈B f (τ ) 2   = C 1 [ B∈B τ ∈B f (τ ) 2 -C 2 B∈B τ ∈B [f (τ )] 2 . C 2 = m-1 k-1 -1 2 m-2 k-2 / m-1 k-1 m 2 m-2 k-2 . Proof. Observe that ∥g(τ )∥ 2 = i g i (τ ) 2 . The remaining follows from Lemma C.3.  +C 2 k|B|λ ′′ 1 k|B| B∈B τ ∈B [r θ (τ ) -1] 2 2 (20) ≤ C 2 k|B|G 1 λ ′ + C 2 k|B|λ ′′ G 2 . By noting that KT (θ) is exactly K(θ), we obtain the first claim. Regarding the second claim, note that when m is large, C 2 k|B| = k m-1 k-1 -1 2 m-2 k-2 m k m-1 k-1 m 2 m-2 k-2 ≈ k m-1 k-1 m k m-1 k-1 m 2 m-2 k-2 = k m k m 2 m-2 k-2 = k m-1 k-1 m k m 2 m-2 k-2 = k m-2 k-2 m k m-1 k-1 m 2 m-2 k-2 = m -1 (k -1)m ≈ 1 k -1 . The first approximation is due to the fact that m-1 k-1 ≫ 1 = ∇ θ K(θ).



Without loss of generality, we consider minimization in this paper. In the meta-learning experiments of(Ye et al., 2021), they only consider two objectives. Definitions for f1, f2 and the environment setup are in Appendix A. The stop gradient operator satisfies ⊥ (h(x)) = h(x), ∇x ⊥ (h(x)) = 0, where h(•) is any differentiable function. ≈ 4.3 × 10 10 . For both datasets, we randomly select 1, 000 meta-testing tasks for evaluation.We use two backbone meta-learning algorithms, MAML and prototypical network (PN)(Snell et al., 2017), with hyper-parameters following the original papers. Following(Finn et al., 2017;Li et al., 2017;Zintgraf et al., 2019), we use the CNN4 5(LeCun et al., 2015) as the meta-learner, and a CNN45 The CNN4 consists of four 3 × 3 convolution networks with batch normalization, 2 × 2 max-pooling and a ReLU activation layer.



, we extend Pareto stationarity to ϵ-Pareto stationarity. Obviously, 0-Pareto stationarity reduces to Pareto stationarity. Definition 2.3. (ϵ-Pareto stationarity). For a given ϵ, x is ϵ-Pareto-stationary iff there exist {u τ } m τ =1 such that ∥ m τ =1 u τ ∇ x f τ (x)∥ ≤ ϵ, u τ ≥ 0 ∀τ and m τ =1 u τ = 1. Definition 2.4. (Improvement function) (Montonen et al., 2018) The improvement function of problem (1) is: H

Figure 1: Convergence on a two-objective toy dataset with mini-batch size 1. The Pareto front is shown in black.

and S k is the set of examples belonging to class k. Pseudo-codes for SIMOL-based MAML and PN are shown in Algorithms 2 and 3 of Appendix B, respectively. Algorithm 1: Soft Improvement Multi-Objective Meta Learning (SIMOL) Input: T , batch size k, learning rates β and β ′ for w, d * = 0 and θ, respectively. for s = 1, 2, . . . , S do Reset d * = 0;

Figure 2: Convergence of MSE with the number of training epochs.

Figure 3: Returns for SIMOL-VPG and MAML-VPG. Results are averaged over 3 trials.

SIMOL for MAML. Input: T , and total epoch S. B is the batch size, β and β ′ are learning rates for w and θ. α is the learning rates for inner loop. for epoch s = 1, 2, 3, . . . , S do Sample tasks 1, 2, . . . B; for Every task τ do Receive r θ s (τ );

SIMOL for PN. for epoch s = 1, 2, 3, . . . , S do Sample tasks 1, 2, . . . B; for Every task τ do for Every class c do Select the support S c τ and query set Q τ ; Compute prototype c τ = 1

(a) is due to the multinomial theorem(Bolton, 1968) and 1(τ, τ ′ ) = 2 τ ̸ = τ ′ 1 otherwise. (b) is due to the fact that every task τ has occurred exactly n-1 B-1 times, and every tuple (τ, τ ′ ) has occurred exactly n-2 B-2 times.Lemma C.4. If every B ∈ B is a set consists of a unique selection of k (m > k > 1) tasks out of m (the total number of tasks) tasks without replacement. Let g(τ ) = [g 1 (τ ), g 2 (τ ), . . . , g m (τ )], where g i (•)'s are continuous functions. We have

τ )∇ w L τ (w * τ (w))| w=w s || 2 -λ′′ [r θ (τ ) -1] 2 θ (τ )∇ w L τ (w * τ (w))| w=w s ∥ 2 -C 2 λ ′′ [r θ (τ ) -1] 2 , where d * := 1 |B| B∈B r θ (τ )∇ w L τ (w * τ (w))| w=w s .Recall λ′ and λ′′ are defined in Sec. 3. Then, ||r θ (τ )∇ w L τ (w * τ (w))| w=w s || 2 (19)

We have:∇ θ R(θ, w) = ∇ θ ∆(θ) -∇ θ λ ′′ 2 (E τ ∼U r θ (τ ) -1) 2 = -1 λ ′ E τ ∼U ∇ θ r θ (τ ) ⊥ [(L τ (w * τ (w) + d * ) -L τ (w * τ (w))] ⊥ [E τ ∼U r θ (τ )[L τ (w * τ (w) + d * ) -L τ (w * τ (w))]] = ∇ θ ∥E τ ∼U ∇ θ r θ (τ )∇ w L τ (w * τ (w)| w=w s || 2 -∇ θ λ ′′ 2 (E τ ∼U r θ (τ ) -1) 2

MSE (with 95% confidence interval) for few-shot regression. The best results are in bold. The * denotes that the improvement over the second-best is statistically significant (at a significance level of 0.1 using the paired t-test).

5-way classification accuracies on tieredImageNet (with 95% confidence interval). ± 0.96 71.51* ± 0.79 15.10* ± 1.20 49.85* ± 0.79

Per-epoch running time for 5-way 5shot classification on miniImageNet.

Performance for SIMOL with different numbers of layers in 2-shot regression.

annex

Also notice that: 5) is weakly global Pareto optimal if there does not exist another x such that f τ (x * ) ≥ f τ (x) for τ ∈ {1, . . . , m}. Theorem C.6. (Theorem 5 in (Miettinen & Mäkelä, 1995) ) For a multi-objective optimization problem min w [f 1 (w), f 2 (w), . . . , f m (w)] and its corresponding improvement function H (w, w * ). A necessary condition for w * ∈ R n to be weakly global Pareto optimal is that w * = arg min w H(w, w * ). Moreover, if f i (w) is convex ∀i, then it is a sufficient condition.We observe the following.

1) By replacing

, ∀τ , the above theorem can be directly applied to the MAML setting.2) f in our setting can be a neural network. Thus, the convexity condition for f τ (w) can be hard to satisfy. Here, we give the a more relaxed version of the above theorem: Theorem C.7. w * := arg min w H (w, w * ) is Pareto stationary.Proof. Using Theorem 2 in (Miettinen & Mäkelä, 1995) , we have, where ∂ is reloaded as sub-gradient, and conv is the convex set.Note that 0 ∈ ∂ w H (w, w * ) due to the fact that the element in the convex set of the union of subgradients is still a sub-gradient, and for w ∈ {w|0 ∈ ∂ w H (w, w * )}, the sub-gradient ∂ w H (w, w * ) is zero.Then, we always have i w i ∂ w (f i (w) -f i (w * )) = 0, i w i = 1, based by definition of a convex set, where w i ∈ [0, 1], ∀i is a real number. By simplifying the above term, we havewhich is exactly the definition of Pareto stationary point.We now show convergence of Algorithm 1., we have:When equality holds, w s is a stationary point.Proof. By Lemma C.2, the solutions of due to the property of the stop gradient operator. Thus,For convexity, by using the property of 0-smoothness, we have ) ] is convex due to the definition of convexity. Therefore, ⊥ [f (x)] ⊥ [g(x)] is also 0-smooth and convex w.r.t. x.A direct application of the above, we obtain ⊥ [f (x)] is 0-smooth and convex w.r.t. x by setting g(x) ≡ 1.Using the above, we have] are also 0-smooth and convex. Therefore] is L-smooth, and taking the average does not affect the results. Thus, R(θ, w) is C max µ ′ -convex and L-smooth.Proof. (of Theorem 4.2) Recall Lemma C.10 on the properties of convex and smooth for R. Combine it with the assumptions in Theorem 4.2, and use Theorem 4.9 in (Lin et al., 2020) , we obtain the bound of O( 1 ϵ 8 ) when B = 1. When B > 1, note that using Lemma A.2 in (Lin et al., 2020) , we have:Therefore, let σ 2 ′ = σ 2 B and use Theorem 4.9 in (Lin et al., 2020) shows the bound of O( 1 ϵ 8 ). The remaining part that the fixed point w of max θ R(θ, w, B) is global Pareto optimal (resp. Pareto stationary) can be obtained by using Lemma C.9, which says that the fixed point of Algorithm 1 is global Pareto optimal (resp. Pareto stationary).Finally, we show that using SIMOL on MAML can guarantee convergence.Proof. (of Corollary 4.2.1) First, we show that if ∇ w L τ (w) is Hessian-Lipschitz continuous, bounded and Lipschitz-continuous, and ∥w∥ is bounded, then w * τ (w) is also Hessian-Lipschitz continuous, bounded and Lipschitz-continuous. Note thatIt is easy to see that w * τ (w) is also Hessian-Lipschitz continuous, bounded and Lipschitz continuous. Applying Lemma 3 in (Collins et al., 2020) , we obtain that L τ (w * τ (w)) is C-smooth, where C is positive. Then setting C = L and using Theorem 4.2, we get the desire result.

D HYPER-PARAMETER SELECTION OF SIMOL

For the few-shot regression experiment (section 5.1), the regularization parameters λ′ , λ′′ are selected from {0.001, 0.01, 0.1, 1}, and learning rate β ′ is selected from {0.01, 0.03, 0.1, 0.3, 1} based on the validation set.For the few-shot classification experiments (section 5.2), we use the λ′ , λ′′ ) combination selected from few-shot regression, while the learning rate β ′ is selected from {0.0003, 0.0008, 0.01, 0.03, 0.08, 0.1} for the 1-shot miniImageNet task based on the validation set. this is then also used in the other few-shot classification experiments.

