M-L2O: TOWARDS GENERALIZABLE LEARNING-TO-OPTIMIZE BY TEST-TIME FAST SELF-ADAPTATION

Abstract

Learning to Optimize (L2O) has drawn increasing attention as it often remarkably accelerates the optimization procedure of complex tasks by "overfitting" specific task types, leading to enhanced performance compared to analytical optimizers. Generally, L2O develops a parameterized optimization method (i.e., "optimizer") by learning from solving sample problems. This data-driven procedure yields L2O that can efficiently solve problems similar to those seen in training, that is, drawn from the same "task distribution". However, such learned optimizers often struggle when new test problems come with a substantial deviation from the training task distribution. This paper investigates a potential solution to this open challenge, by meta-training an L2O optimizer that can perform fast test-time self-adaptation to an out-of-distribution task, in only a few steps. We theoretically characterize the generalization of L2O, and further show that our proposed framework (termed as M-L2O) provably facilitates rapid task adaptation by locating well-adapted initial points for the optimizer weight. Empirical observations on several classic tasks like LASSO, Quadratic and Rosenbrock demonstrate that M-L2O converges significantly faster than vanilla L2O with only 5 steps of adaptation, echoing our theoretical results. Codes are available in

1. INTRODUCTION

Deep neural networks are showing overwhelming performance on various tasks, and their tremendous success partly lies in the development of analytical gradient-based optimizers. Such optimizers achieve satisfactory convergence on general tasks, with manually-crafted rules. For example, SGD (Ruder, 2016) keeps updating towards the direction of gradients and Momentum (Qian, 1999) follows the smoothed gradient directions. However, the reliance on such fixed rules can limit the ability of analytical optimizers to leverage task-specific information and hinder their effectiveness. Learning to Optimize (L2O), an alternative paradigm emerges recently, aims at learning optimization algorithms (usually parameterized by deep neural networks) in a data-driven way, to achieve faster convergence on specific optimization task or optimizee. Various fields have witnessed the superior performance of these learned optimizers over analytical optimizers (Cao et al., 2019; Lv et al., 2017; Wichrowska et al., 2017; Chen et al., 2021a; Zheng et al., 2022) . Classic L2Os follow a two-stage pipeline: at the meta-training stage, an L2O optimizer is trained to predict updates for the parameters of optimizees, by learning from their performance on sample tasks; and at the meta-testing stage, the L2O optimizer freezes its parameters and is used to solve new optimizees. In general, L2O optimizers can efficiently solve optimizees that are similar to those seen during the meta-training stage, or are drawn from the same "task distribution". However, new unseen optimizees may substantially deviate from the training task distribution. As L2O optimizers predict updates to variables based on the dynamics of the optimization tasks, such as gradients, different task distributions can lead to significant dissimilarity in task dynamics. Therefore, L2O optimizers often incur inferior performance when faced with these distinct unseen optimizees. Such challenges have been widely observed and studied in related fields. For example, in the domain of meta-learning (Finn et al., 2017; Nichol & Schulman, 2018) , we aim to enable neural networks to be fast adapted to new tasks with limited samples. Among these techniques, Model-Agnostic Meta Learning (MAML) (Finn et al., 2017) is one of the most widely-adopted algorithms. Specifically, in the meta-training stage, MAML makes inner updates for individual tasks and subsequently conducts back-propagation to aggregate the gradients of individual task gradients, which are used to update the meta parameters. This design enables the learned initialization (meta parameters) to be sensitive to each task, and well-adapted after few fine-tuning steps. Motivated by this, we propose a novel algorithm, named M-L2O, that incorporates the meta-adaption design in the meta-training stage of L2O. In detail, rather than updating the L2O optimizer directly based on optimizee gradients, M-L2O introduces a nested structure to calculate optimizer updates by aggregating the gradients of meta-updated optimizees. By adopting such an approach, M-L2O is able to identify a well-adapted region, where only a few adaptation steps are sufficient for the optimizer to generalize well on unseen tasks. In summary, the contributions of this paper are outlined below: • To address the unsatisfactory generalization of L2O on out-of-distribution tasks, we propose to incorporate a meta adaptation design into L2O training. It enables the learned optimizer to locate in well-adapted initial points, which can be fast adapted in only a few steps to new unseen optimizees. • We theoretically demonstrate that our meta adaption design grants M-L2O optimizer faster adaption ability in out-of-distribution tasks, shown by better generalization errors. Our analysis further suggests that training-like adaptation tasks can yield better generalization performance, in contrast to the common practice of using testing-like tasks. Such theoretical findings are further substantiated by the experimental results. • Extensive experiments consistently demonstrate that the proposed M-L2O outperforms various baselines, including vanilla L2O and transfer learning, in terms of the testing performance within a small number of steps, showing the ability of M-L2O to promptly adapt in practical applications.

2.1. LEARNING TO OPTIMIZE

Learning to Optimize (L2O) captures optimization rules in a data-driven way, and the learned optimizers have demonstrated success on various tasks, including but not limited to black-box (Chen et al., 2017) , Bayesian (Cao et al., 2019) , minimax optimization problems (Shen et al., 2021) , domain adaptation (Chen et al., 2020b; Li et al., 2020) , and adversarial training (Jiang et al., 2018; Xiong & Hsieh, 2020) . The success of L2O is based on the parameterized optimization rules, which are usually modeled through a long short-term memory network (Andrychowicz et al., 2016) , and occasionally as multi-layer perceptrons (Vicol et al., 2021) . Although the parameterization is practically successful, it comes with the "curse" of generalization issues. 

2.2. FAST ADAPTATION

Fast adaptation is one of the major goals in the meta-learning area Finn et al. (2017) ; Nichol & Schulman (2018) ; Lee & Choi (2018) which often focuses on generalizing to new tasks with limited samples. MAML Finn et al. (2017) , a famous and effective meta-learning algorithm, utilizes the nested loop for meta-adaption. Following this trend, numerous meta-learning algorithms chose to compute meta updates more efficiently. For example, FOMAML (Finn et al., 2017) only updated networks by first-order information; Reptile (Nichol & Schulman, 2018) introduced an extra intermediate variable to avoid Hessian computation; HF-MAML (Fallah et al., 2020) approximated the one-step meta update by Hessian-vector production; and Ji et al. (2022) adopted a multi-step approximation in updates. Meanwhile, many researchers designed algorithms to compute meta updates more wisely. For example, ANIL (Raghu et al., 2020) only updated the head of networks in the inner loop; HSML (Yao et al., 2019) tailored the transferable knowledge to different tasks; and MT-net (Lee & Choi, 2018) enabled meta-learner to learn on each layer's activation space. In terms of theories, (Fallah et al., 2021) 

3. PROBLEM FORMULATION AND ALGORITHM

In this section, we firstly introduce the formulation of L2O, and subsequently propose M-L2O for generalizable self-adaptation.

3.1. L2O PROBLEM DEFINITION

T Optimizees samples

T Optimizees samples

Figure 1 : The pipeline of L2O problems. Most machine learning algorithms adopt analytical optimizer, e.g. SGD, to compute parameter updates for general loss functions (we call it optimizee or task). Instead, L2O aims to estimate such updates by a model (usually a neural network), which we call optimizer. Specifically, the L2O optimizer takes the optimizee information (such as loss values and gradients) as input and generates updates to the optimizee. In this work, our objective is to learn the initialization of the L2O optimizer on training optimizees and subsequently finetune it on adaptation optimizees. Finally, we apply such an adapted optimizer to optimize the testing optimizees and evaluate their performance. We define l(θ; ξ) as the loss function where θ is the optimizee's parameter, ξ = {ξ j (j = 1, 2, . . . , N )} denotes the data sample, then the optimizee empirical and population risks are defined as below: l(θ) = 1 N N j=1 l(θ; ξ j ), l(θ) = E ξ l(θ; ξ). In L2O, the optimizee's parameter θ is updated by the optimizer, an update rule parameterized by ϕ and we formulate it as m(z t (θ t ; ζ t ), ϕ). Specifically, z t = z t (θ t ; ζ t ) denotes the optimizer model input. It captures the t-th iteration's optimizee information and parameterized by θ t with the data batch ζ t . Then, the update rule of θ in t-th iteration is shown as below: θ t+1 (ϕ) = θ t (ϕ) + m(z t (θ t ; ζ t ), ϕ) = θ t (ϕ) + m(z t , ϕ). The above pipeline is also summarized in Figure 1 where k denotes the update epoch of the optimizer. Note that ζ t refers to the data batch of size N used at t-th iteration while ξ j refers to the j-th single sample in data batch ξ. For theoretical analysis, we only consider taking the optimizee's gradients as input to the optimizer for update, i.e., z t (θ t ; ζ t ) = ∇ θ l(θ t (ϕ)). Therefore, the optimizee's gradient at T -th iteration over the optimizer ∇ ϕ θ T (ϕ) takes a form of ∇ ϕ θ T (ϕ) = T -1 i=0 ( T -1 j=i+1 (I + ∇ 1 m(∇ θ l(θ T +i-j ), ϕ)∇ 2 θ l(θ T +i-j ))∇ 2 m(∇ θ l(θ i ), ϕ)), where we assume that ϕ is independent from the optimizee's initial parameter θ 0 and all samples are independent. The detailed derivation is shown in Lemma 1 in the Appendix. Next, we consider a common initial parameter θ 0 for all optimizees and define the optimizer empirical and population risks w.r.t. ϕ as below: ĝt (ϕ) = l(θ t (ϕ)), g t (ϕ) = E ξ l(θ t (ϕ); ξ) = l(θ t (ϕ)), where θ t (ϕ) updates in such a fashion: θ t+1 (ϕ) = θ t (ϕ) + m(z t (θ t ; ζ t ), ϕ). Typically, the L2O optimizer is evaluated and updated after updating the optimizees for T -th iterations. Therefore, the optimal points of the optimizer risks in Equation (3) are defined as below: ϕ * = arg min ϕ ĝT (ϕ), ϕ * = arg min ϕ g T (ϕ). (4)

3.2. M-L2O ALGORITHM

To improve the generalization ability of L2O, we aim at learning a well-adapted optimizer initialization by introducing a nested meta-update architecture. Instead of directly updating ϕ, we adapt it for one step (namely an inner update) and define a new empirical risk for the optimizer as follows: ĜT (ϕ) = ĝT (ϕ -α∇ ϕ ĝT (ϕ)), where α is the step size for inner updates. Consequently, the optimal point of the corresponding updated optimizer is: ϕ M * = arg min ϕ ĜT (ϕ). Based on such an optimizer loss, we introduce M-L2O in Algorithm 1. Such a nested update design has been proved effective in the field of meta-learning, particularly MAML (Finn et al., 2017) . Note that Algorithm 1 only returns the well-adapted optimizer initial point, which would require further adaptation in practice. We first denote the optimizees for training, adaptation, and testing by g 1 (ϕ), g 2 (ϕ), and g 3 (ϕ), respectively, to distinguish tasks seen in different stages. Next, we obtain the results of meta training, denoted by ϕ 1 M K , via Algorithm 1, and we further adapt it based on ĝ2 T (ϕ). The testing loss of M-L2O can be expressed as follows: g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K )), where g 3 T (ϕ) denotes the meta testing loss. Note that ĝ refers to empirical risk and g refers to population risk. Algorithm 1 Our Proposed M-L2O.  : if mod(k, S) = 0: θ 0 ( ϕ 1 M k ) = θ 0 (random initial) else θ 0 ( ϕ 1 M k ) = θ T ( ϕ 1 M (k-1) ) 4: for t = 0, 1, . . . , T -1 do 5: θ t+1 ( ϕ 1 M k ) = θ t ( ϕ 1 M k ) + m(∇ θ l1 (θ t ( ϕ 1 M k ))) Note: l1 (θ t ( ϕ 1 M k )) = ĝ1 t ( ϕ 1 M k ) 6: end for 7: Compute Ĝ1 T ( ϕ 1 M k ) = ĝ1 T ( ϕ 1 M k -α∇ ϕ ĝ1 T ( ϕ 1 M k )) 8: update: ϕ 1 M (k+1) = ϕ 1 M k -β k ∇ ϕ Ĝ1 T ( ϕ 1 M k ) 9: end for 10: Output: ϕ 1 M K Testing Loss: g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K ))

4. GENERALIZATION THEOREM OF M-L2O

In this section, we introduce several assumptions and characterize M-L2O's generalization ability.

4.1. TECHNICAL ASSUMPTIONS AND DEFINITIONS

To characterize the generalization of meta adaptation, it is necessary to make strong convexity assumptions which have been widely observed (Li & Yuan, 2017; Du et al., 2019) under overparameterization condition and adopted in the geometry of functions (Fallah et al., 2021; Finn et al., 2019; Ji et al., 2021) . Assumption 1. We assume that the function g 1 T (ϕ) is µ-strongly convex. This assumption also holds for stochastic ĝ1 T (ϕ). To capture the relationship between the L2O function ĝT (ϕ) and the M-L2O function ĜT (ϕ), we make the following optimal point assumptions. Assumption 2. We assume there exists a non-trivial optimizer optimal point ϕ 1 M * which is defined in Equation (6). Non-trivial means that ∇ ϕ ĝ1 T ( ϕ 1 M * ) ̸ = 0 and ϕ 1 M * ̸ = ϕ 1 * . Then, based on the definition of ϕ 1 M * and the existence of trivial solutions, for any ϕ 1 M * , we have the following equation: ϕ 1 * = ϕ 1 M * -α∇ ϕ ĝ1 T ( ϕ 1 M * ), where α is the step size for inner update of the optimizer. Note that we have defined the strongly-convex of landscape in Assumption 1 which validates the uniqueness of ϕ 1 * . The aforementioned assumption claims that there exist meta optimal points which are different from original task optimal points. Such an assumption is natural in experimental view. In MAML experiments where a single training task is considered, it is reasonable to expect that the solution would converge towards a well-adapted point instead of the task optimal point. Otherwise, MAML would be equivalent to simple transfer learning, where the learned point may not generalize well to new tasks. Assumption 3. We assume that l i (θ) is M -Lipschitz, ∇l i (θ) is L-Lipschitz and ∇ 2 l i (θ) is ρ-Lipschitz for each loss function l i (θ)(i = 1, 2, 3 ). This assumption also holds for stochastic li (θ), ∇ li (θ) and ∇ 2 θ li (θ)(i = 1, 2, 3). We further assume that m(z, ϕ) is M m1 -Lipschitz w.r.t. z, M m2 -Lipschitz w.r.t. ϕ and ∇ 2 ϕ θ T (ϕ) is ρ θ -Lipschitz. The above Lipschitz assumptions are widely adopted in previous optimization works (Fallah et al., 2021; 2020; Ji et al., 2022; Zhao et al., 2022) . To characterize the difference between tasks for meta training and meta adaptation, we define ∆ 12 and ∆ 12 as follows: Assumption 4. We assume there exist union bounds ∆ 12 and ∆ 12 to capture the gradient and Hessian differences respectively between meta training task and adaptation task: ∆ 12 = max θ ∥∇ θ l1 (θ) -∇ θ l2 (θ)∥, ∆ 12 = max θ ∥∇ 2 θ l1 (θ) -∇ 2 θ l2 (θ)∥. Such an assumption has been made similarly in MAML generalization works (Fallah et al., 2021) .

4.2. MAIN THEOREM

In this section, we theoretically analyze the generalization error of M-L2O and compare it with the vanilla L2O approach (Chen et al., 2017) . Firstly, we characterize the difference of optimizee gradients between any two tasks (∇ ϕ θ 1 T (ϕ) and ∇ ϕ θ 2 T (ϕ)) in the following form: Proposition 1. Based on Equation (2), Assumptions 3 and 4, we obtain ∥∇ ϕ θ 1 T (ϕ) -∇ ϕ θ 2 T (ϕ)∥ ≤ T -1 i=0 (Q T -i-1 ∆ Ci + M m2 Q T -i-2 T -1 j=i+1 ∆ Dj ), where ∆ Ci = O(Q i ∆ 12 ), ∆ Dj = O(Q j ∆ 12 + ∆ 12 ) and Q i = (1 + M m1 L) i . Furthermore, we characterize the task difference of optimizer gradient ∇ ϕ ĝT (ϕ) as follows: ∥∇ ϕ ĝ1 T (ϕ) -∇ ϕ ĝ2 T (ϕ)∥ = O(T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 ), where Q = 1 + M m1 L, ∆ 12 and ∆ 12 are defined in Assumption 4. Proposition 1 shows that the difference in optimizer gradient landscape scales exponentially with the optimizee iteration number T . Specifically, it involves the Q 2T -1 term with gradient difference ∆ 12 and the T Q T -1 term with Hessian difference ∆ 12 . Clearly, Q 2T -1 ∆ 12 dominates when T increases, which implies that the gradient gap between optimizees is the key component of the difference in optimizer gradient. Theorem 1. Suppose that Assumptions 1, 2, 3 and 4 hold. Considering Algorithm 1 and Equation (7), if we define δ 13 = ∥ϕ 1 * -ϕ 3 * ∥, set α ≤ min{ 1 2L , µ 8ρg T Mg T }, β k = min(β, 8 µ(k+1) ) for β ≤ 8 µ . Then, with a probability at least 1 -δ, we obtain E[g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K )) -g 3 T (ϕ 3 * )] ≤(M g T (1 + L g T α)) O(1) M g T µ 2 L g T + ρ g T αM g T βK + M g T β √ K + M g T 2 √ 2M g T µ √ δN + M g T δ 13 + M g T αO(T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 ), where Q = 1 + M m1 L, M g T = O(Q T -1 ), L g T = O(T Q T -2 + Q 2T -2 ), ρ g T = O(T Q 2T -3 + Q 3T -3 ), K is total epoch number for meta training, N is the batch size for optimizer training. To provide further understanding of generalization errors, we first make the following remark: Remark 1 (The choice of α). In Theorem 1, we set α ≤ µ 8ρg T Mg T = O( 1 T Q 3T -4 +Q 4T -4 ), thus the error term M g T αO(T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 ) vanishes with larger T . If we fix the iteration number T , then such an error term is determined by the gradient bound ∆ 12 and the Hessian bound ∆ 12 . The key components that lead to Q dependency are the Lipschitz properties to characterize the L2O loss landscape, e.g. the Lipschitz term L gT = O(T Q T -2 + Q 2T -2 ) defined for ∇ ϕ ĝT (ϕ). The reason is due to our nested update procedure θ t+1 (ϕ) = θ t (ϕ) + m(∇ θ l(θ t (ϕ)), ϕ). If we take the gradient of the last update term ĝT (ϕ) = l(θ T (ϕ)) over ϕ, then it requires us to compute gradients iteratively for all t = 0, 1, . . . , T -1, which leads to the exponential term. Consequently, it can be observed that the generalization error of M-L2O can be decomposed into three components: (i) The first term determined by Lg T +ρg T αMg T βK + Mg T β √ K is dominated by the training epoch K. Such an error term characterizes how meta training influences the generalization; (ii) The second term ∥ 2 √ 2Mg T µ √ δN ∥ reflects the empirical error introduced by limited samples; hence it is controlled by the sample size N . (iii) The last two error terms capture task differences. Specifically, δ 13 measures the gap between training and testing optimal points, while ∆ 12 and ∆ 12 , which dominate the last error term and represent the gradient and Hessian union bounds, respectively, reflect the geometry difference between training and adaptation tasks. For better comparison with L2O, we make the following remark about generalization of M-L2O and Transfer Learninig. Remark 2 (Comparison with Transfer Learning). We can rewrite the generalization error of M-L2O in Theorem 1 in the following form: g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K )) -g 3 T (ϕ 3 * ) ≤M g T ∥ ϕ 1 M K -ϕ 1 M * ∥ + M g T ∥ ϕ 1 * -ϕ 1 * ∥ + M g T α∥∇ ϕ ĝ2 T ( ϕ 1 M K ) -∇ ϕ ĝ1 T ( ϕ 1 M * )∥ + M g T δ 13 . For L2O Transfer Learning, the generalization error is shown as below: g 3 T ( ϕ 1 K -α∇ ϕ ĝ2 T (ϕ)) -g 3 T (ϕ 3 * ) ≤M g T ∥ ϕ 1 K -ϕ 1 M * ∥ + M g T ∥ ϕ 1 * -ϕ 1 * ∥ + M g T α∥∇ ϕ ĝ2 T ( ϕ 1 K ) -∇ ϕ ĝ1 T ( ϕ 1 M * )∥ + M g T δ 13 , where δ 13 = ∥ϕ 1 * -ϕ 3 * ∥ and ϕ 1 K represents transfer learning L2O learned point after K epochs. The generalization error gap between M-L2O and Trasnfer Learning can be categorized into two parts: (i) Difference between ∥ ϕ 1 M K -ϕ 1 M * ∥ and ∥ ϕ 1 K -ϕ 1 M * ∥; (ii) Difference between ∥∇ ϕ ĝ2 T ( ϕ 1 M K ) - ∇ ϕ ĝ1 T ( ϕ 1 M * )∥ and ∥∇ ϕ ĝ2 T ( ϕ 1 K ) -∇ ϕ ĝ1 T ( ϕ 1 M * )∥. If we assume ∥∇ ϕ ĝ2 T (ϕ)∥ ≈ ∥∇ ϕ ĝ1 T (ϕ)∥, then both differences can be characterized by the gap between ∥ ϕ 1 M K -ϕ 1 M * ∥ and ∥ ϕ 1 K -ϕ 1 M * ∥. Since ϕ 1 M K is trained to converge to ϕ 1 M * as K increases, it is natural to see that M-L2O (∥ ϕ 1 M K -ϕ 1 M * ∥) enjoys smaller generalization error compared to Transfer Learning ( ∥ ϕ 1 K -ϕ 1 M * ∥). We further distinguish our theory from previous theoretical works. In the L2O area, Heaton et al. (2020) analyzed the convergence of proposed safe-L2O but not generalization while Chen et al. (2020c) analyzed the generalization of quadratic-based L2O. Instead, we develop the generalization on a general class of L2O problems. In the meta-learning area, the previous works have demonstrated the convergence and generalization of MAML (Ji et al., 2022; Fallah et al., 2021) . Instead, we leverage the MAML results in L2O domain to measure the learned point and training optimal point distance. Then, our L2O theory further characterizes transferability of learned point on meta testing tasks. Overall, our developed theorem is based on both L2O and meta learning results. In conclusion, our theoretical novelty lies in three aspects: ① Rigorously characterizing a generic class of L2O generalization. ② Incorporating the MAML results in our meta-learning analysis. ③ Theoretically proving that both training-like and testing-like adaptation contribute to better generalization in L2O.

5. EXPERIMENTS

In this section, we provide a comprehensive description of the experimental settings and present the results we obtained. Our findings demonstrate a high degree of consistency between the empirical observations and the theoretical outcomes.

5.1. EXPERIMENTAL CONFIGURATIONS

Backbones and observations. For all our experiments, we use a single-layer LSTM network with 20 hidden units as the backbone. We adopt the methodology proposed by Lv et al. (2017) and Chen et al. (2020a) to utilize the parameters' gradients and their corresponding normalized momentum to construct the observation vectors. Optimizees. We conduct experiments on three distinct optimizees, namely LASSO, Quadratic, and Rosenbrock (Rosenbrock, 1960) . The formulation of the Quadratic problem is min x 1 2 ∥Ax -b∥ 2 and the formulation of the LASSO problem is min x 1 2 ∥Ax -b∥ 2 + λ∥x∥ 1 , where A ∈ R d×d , b ∈ R d . We set λ = 0.005. The precise formulation of the Rosenbrock problem is available in Section A.6. During the meta-training and testing stage, the optimizees ξ train and ξ test are drawn from the pre-specified distributions D train and D test , respectively. Similarly, the optimizees ξ adapt used during adaptation are sampled from the distribution D adapt . Baselines and Training Settings. We compare M-L2O against three baselines: (1) Vanilla L2O, where we train a randomly initialized L2O optimizer on ξ adapt for only 5 steps; (2) Transfer Learning (TL), where we first meta-train a randomly initialized L2O optimizer on ξ train , and then fine-tune on ξ adapt for 5 steps; (3) Direct Transfer (DT), where we meta-train a randomly initialized L2O optimizer on ξ train only. M-L2O adopts a fair experimental setting, whereby we meta-train on ξ train and adapt on the same ξ adapt . We evaluate these methods using the same set of optimizees for testing (i.e., ξ test ), and report the minimum logarithmic value of the objective functions achieved for these optimizees. For all experiments, we set the number of optimizee iterations, denoted by T , to 20 when meta-training the L2O optimizers and adapting to optimizees. Notably, in large scale experiments involving neural networks as tasks, the common choice for T is 5 (Zheng et al., 2022; Shen et al., 2021) . However, in our experiments, we set T to 20 to achieve better experimental performance. The value of the total epochs, denoted by K, is set to 5000, and we adopt the curriculum learning technique (Chen et al., 2020a) to dynamically adjust the number of epochs per task, denoted by S. To update the weights of the optimizers (ϕ), we use Adam (Kingma & Ba, 2014) with a fixed learning rate of 1 × 10 -4 .

5.2. FAST ADAPTATION RESULTS OF M-L2O

Experiments on LASSO optimizees. We begin with experiments on LASSO optimizees. Specifically, for ξ train , the coefficient matrix A is generated by sampling from a mixture of uniform distributions comprising {U(0, 0.1), U(0, 0.5), U(0, 1)}. In contrast, for ξ test and ξ adapt , the coefficient matrices A are obtained by sampling from a normal distribution with a standard deviation of σ. We conduct experiments with σ = 50 and σ = 100, and report the results in Figures 2a and 2b . Our findings demonstrate that: ① The Vanilla L2O approach, which relies on only five steps of adaptation from initialization on ξ adapt , exhibits the weakest performance, as evidenced by the largest values of the objective function. ② Although Direct Transfer (DT) is capable of learning optimization rules from the training optimizees, the larger variance in coefficients among the testing optimizees renders the learned rules inadequate for generalization. ③ The superiority of Transfer Learning (TL) over DT highlights the values of adaptation when the testing optimizees deviates significantly from those seen in training, as the optimizer is presumably able to acquire new knowledge during the adaptation process. ④ Finally, M-L2O exhibits consistent and notably faster convergence speed compared to other baseline methods. Moreover, it demonstrates the best performance overall, reducing the logarithm of the objective values by approximately 0.2 and 1 when σ = 50 and σ = 100, respectively. M-L2O's superior performance can be attributed to its ability to learn well-adapted initial weights for optimizers, which enables rapid self-adaptation, thus leading to better performance in comparison to the baseline methods. Experiments on Quadratic optimizees. We continue to assess the performance of our approach on a different optimizee, i.e., the Quadratic problem. The coefficient matrices A of the optimizees are also randomly sampled from a normal distribution. We conduct two evaluations, with σ values of 50 and 100, respectively, and present the outcomes in Figure 3 . Notably, the results show a similar trend to those we obtained in the previous experiments. More LASSO experiments. We proceed to investigate the impact of varying the standard deviation σ of the distributions we used to sample the coefficient matrices A for ξ adapt and ξ test . The minimum logarithm of the objective value for each method is reported in Table 1 . Our findings reveal that: ① At lower levels of σ, it is not always necessary, and may even be unintentionally harmful, to use adaptation for normally trained L2O. Although M-L2O produces satisfactory results, it exhibits slightly lower performance than TL, which could be due to the high similarity between the training and testing tasks. Since M-L2O's objective is to identify optimal general initial points, L2O optimizers trained directly on related and similar tasks may effectively generalize. However, after undergoing adaptation on a single task, L2O optimizers may discard certain knowledge acquired during metatraining that could be useful for novel but similar tasks. ② Nevertheless, as the degree of similarity between training and testing tasks is declines, as characterized by an increasing value of σ, M-L2O begins to demonstrate considerable advantage. For values of σ greater than 50, M-L2O exhibits consistent performance advantages that exceed 0.1 in terms of logarithmic loss. This observation empirically supports that the learned initial weights facilitate rapid adaptation to new tasks that are "out-of-distribution", and that manifest large deviations.

5.3. ADAPTATION WITH SAMPLES FROM DIFFERENT TASK DISTRIBUTION

In Section 5.2, we impose a constraint on the standard deviation of the distribution used to sample A, ensuring that is identical for both the optimizees for adaptation and testing. However, it is noteworthy that this constraint is not mandatory, given that our theory can accommodate adaptation and testing optimizees with different distributions. Consequently, we conduct an experiment on LASSO optimizees with varying standard deviations of the distribution, from which the matrices A for optimizee ξ adapt is drawn. Specifically, we sample σ with smaller values that more resemble the training tasks, as well as larger values that are more similar to the testing task (σ = 100). In Theorem 1, we have characterized the generalization of M-L2O with flexible distribution adaptation tasks. The theoretical analysis suggests that a similar geometry landscape (smaller ∆ 12 , ∆ 12 ) between the training and adaptation tasks, can lead to a reduction in generalization loss as defined in Equation ( 7). This claim has been corroborated by the results of our experiments, as presented in Figure 4 . When the σ is similar to the training tasks (e.g., 10), implying a smaller ∆12 and ∆ 12 , M-L2O demonstrates superior testing performance. In conclusion, incorporating training-like adaptation tasks can lead to better generalization performance. Meanwhile, it is reasonable to suggest that the task differences between adaptation and testing, denoted by (∆ 23 , ∆ 23 ), may also have an impact on M-L2O's generalization ability. Intuitively, if the optimizer is required to adapt to testing optimizees, the adapted optimize should demonstrate strong generalization ability on other optimizees that are similar. In order to have a deeper understanding of the relationship between the generalization ability and the difference between adaptation and testing tasks, we rewrite M-L2O generalization error in Theorem 1 in the following form with ϕ 3 * = ϕ 3 M * -α∇ ϕ ĝ3 T ( ϕ 3 M * ) and δ M 13 = ∥ ϕ 1 M * -ϕ 3 M * ∥: g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K )) -g 3 T (ϕ 3 * ) ≤M g T (∥ ϕ 1 M K -ϕ 1 M * ∥ + ∥ ϕ 3 * -ϕ 3 * ∥ + α∥∇ ϕ ĝ2 T ( ϕ 1 M K ) -∇ ϕ ĝ3 T ( ϕ 3 M * )∥ + δ M 13 ). In Equation ( 8), M-L2O generalization error is partly captured by ∥∇ ϕ ĝ2 T ( ϕ 1 M K ) -∇ ϕ ĝ3 T ( ϕ 3 M * )∥ which is controlled by difference in optimizers (i.e., ∥∇ ϕ ĝ2 T (ϕ)-∇ ϕ ĝ3 T (ϕ)∥). From Proposition 1, we know that this term is determined by difference in optimizees, denoted by ∆ 23 and ∆ 23 . Similar to the results established in Theorem 1, we can deduce that superior testing performance is connected with a smaller difference between testing and adaptation optimizees. This result has been demonstrated in Figure 4 where TL generalizes well with larger σ (more testing-like). Moreover, M-L2O also benefits from larger σ values (e.g., σ = 100) in certain scenarios. To summarize, both training-like and testing-like adaptation task can lead to improved testing performance. As shown in Figure 4 , training-like adaptation results in better generalization in L2O. One possible explanation is that when the testing task significantly deviates from the training tasks, it becomes highly challenging for the optimizer to generalize well within limited adaptation steps. In such scenarios, the training-like adaptation provides a more practical solution.

6. CONCLUSION AND DISCUSSION

In this paper, we propose a self-adapted L2O algorithm (M-L2O), which is incorporated with meta adaptation. Such a design enables the optimizer to reach a well-adapted initial point, facilitating its adaptation ability with only a few updates. Our superior generalization performances in outof-distribution tasks have been theoretically characterized and empirically validated across various scenarios. Furthermore, the comprehensive empirical results demonstrate that training-like adaptation tasks can contribute to better testing generalization, which is consistent with our theoretical analysis. One potential future direction is to develop a convergence analysis for L2O. It will be more interesting to consider meta adaptation in analyzing L2O convergence from a theoretical view.

A APPENDIX

A.1 RESTATEMENT OF ASSUMPTION 3 Assumption 5 (Restatement of Assumption 3). We assume Lipschitz properties for all functions l i (θ)(i = 1, 2, 3) as follows: a) l i (θ) is M -Lipschitz, i.e., for any θ 1 and θ 2 , ∥l i (θ 1 ) - l i (θ 2 )∥ ≤ M ∥θ 1 -θ 2 ∥(i = 1, 2, 3). b) ∇l i (θ) is L-Lipschitz, i.e., for any θ 1 and θ 2 , ∥∇l i (θ 1 ) -∇l i (θ 2 )∥ ≤ L∥θ 1 -θ 2 ∥(i = 1, 2, 3). c) ∇ 2 l i (θ) is ρ-Lipschitz, i.e., for any θ 1 and θ 2 , ∥∇ 2 l i (θ 1 ) -∇ 2 l i (θ 2 )∥ ≤ ρ∥θ 1 -θ 2 ∥(i = 1, 2, 3). d) m(z, ϕ) is M m1 -Lipschitz w.r.t. z and M m2 -Lipschitz w.r.t. ϕ, i.e., ∥m(z 1 , ϕ) -m(z 2 , ϕ)∥ ≤ M m1 ∥z 1 -z 2 ∥ for any z 1 and z 2 , ∥m(z, ϕ 1 ) -m(z, ϕ 2 )∥ ≤ M m2 ∥ϕ 1 -ϕ 2 ∥ for any ϕ 1 and ϕ 2 . e) ∇ 2 ϕ θ T (ϕ) is ρ θ -Lipschitz, i.e., for any ϕ 1 and ϕ 2 , ∥∇ 2 ϕ θ T (ϕ 1 ) -∇ 2 ϕ θ T (ϕ 2 )∥ ≤ ρ θ ∥ϕ 1 -ϕ 2 ∥. The above Assumptions (a)(b)(c) also hold for stochastic li (θ), ∇ li (θ) and ∇ 2 θ li (θ)(i = 1, 2, 3). A.2 PROOF OF SUPPORTING LEMMAS (LEMMA 12 CORRESPONDS TO PROPOSITION 1) Lemma 1. Based on update procedure of θ t (θ), we obtain ∇ ϕ θ T (ϕ) = T -1 i=0   T -1 j=i+1 I + ∇ 1 m(∇ θ l(θ T +i-j ), ϕ)∇ 2 θ l(θ T +i-j ) ∇ 2 m(∇ θ l(θ i ), ϕ)   . Proof. The θ t (ϕ) update process is shown below: θ t+1 (ϕ) = θ t (ϕ) + m(z t , ϕ). If we only consider z t (θ t ; ζ t ) = ∇ θ l(θ t (ϕ); ζ t ) = ∇ θ l(θ t ), then we obtain ∇ ϕ θ t+1 (ϕ) = ∇ ϕ θ t (ϕ) + ∇ ϕ m(∇ θ l(θ t ), ϕ) = ∇ ϕ θ t (ϕ) + ∇ 1 m(∇ θ l(θ t ), ϕ)∇ 2 θ l(θ t )∇ ϕ θ t (ϕ) + ∇ 2 m(∇ θ l(θ t ), ϕ) = (I + ∇ 1 m(∇ θ l(θ t ), ϕ)∇ 2 θ l(θ t ))∇ ϕ θ t (ϕ) + ∇ 2 m(∇ θ l(θ t ), ϕ ). If we iterate the above equation from t = 0 to T , then we obtain ∇ ϕ θ T (ϕ) = T -1 i=0   T -1 j=i+1 I + ∇ 1 m(∇ θ l(θ T +i-j ), ϕ)∇ 2 θ l(θ T +i-j ) ∇ 2 m(∇ θ l(θ i ), ϕ)   + T i=1 I + ∇ 1 m(∇ θ l(θ T -i ), ϕ)∇ 2 θ l(θ T -i ) ∇ ϕ θ 0 , We assume θ 0 is randomly sampled and independent from ϕ, then we obtain ∇ ϕ θ T (ϕ) = T -1 i=0   T -1 j=i+1 I + ∇ 1 m(∇ θ l(θ T +i-j ), ϕ)∇ 2 θ l(θ T +i-j ) ∇ 2 m(∇ θ l(θ i ), ϕ)   . Lemma 2. If we assume that θ 0 (ϕ 1 ) = θ 0 (ϕ 2 ), based on Assumption 3, then we obtain ∥θ T (ϕ 1 ) -θ T (ϕ 2 )∥ ≤ ((M m1 L + 1) T -1 -1) M m2 M m1 L ∥ϕ 1 -ϕ 2 ∥ = M θT ∥ϕ 1 -ϕ 2 ∥. (9) Proof. Based on the iterate procedure of θ T (ϕ), we obtain ∥θ T (ϕ 1 ) -θ T (ϕ 2 )∥ (i) = T -1 t=1 (m(∇ θ l(θ t (ϕ 1 )), ϕ 1 ) -m(∇ θ l(θ t (ϕ 2 )), ϕ 2 )) = T -1 t=1 (m(∇ θ l(θ t (ϕ 1 )), ϕ 1 ) -m(∇ θ l(θ t (ϕ 1 )), ϕ 2 ) + m(∇ θ l(θ t (ϕ 1 )), ϕ 2 ) -m(∇ θ l(θ t (ϕ 2 )), ϕ 2 )) ≤ T -1 t=1 (m(∇ θ l(θ t (ϕ 1 )), ϕ 1 ) -m(∇ θ l(θ t (ϕ 1 )), ϕ 2 )) + T -1 t=1 m(∇ θ l(θ t (ϕ 1 )), ϕ 2 ) -m(∇ θ l(θ t (ϕ 2 )), ϕ 2 )) (ii) ≤ T -1 t=1 M m2 ∥ϕ 1 -ϕ 2 ∥ + T -1 t=1 M m1 ∥∇ θ l(θ t (ϕ 1 )) -∇ θ l(θ t (ϕ 2 ))∥ ≤(T -1)M m2 ∥ϕ 1 -ϕ 2 ∥ + M m1 T -1 t=1 ∥∇ θ l(θ t (ϕ 1 )) -∇ θ l(θ t (ϕ 2 ))∥ (iii) ≤ (T -1)M m2 ∥ϕ 1 -ϕ 2 ∥ + M m1 L T -1 t=1 ∥θ t (ϕ 1 ) -θ t (ϕ 2 )∥, where (i) follows from Equation ( 1), (ii) and (iii) from Assumption 3. If we further iterate it from t = 0 to T , we obtain ∥θ T (ϕ 1 ) -θ T (ϕ 2 )∥ ≤ ((M m1 L + 1) T -1 -1) M m2 M m1 L ∥ϕ 1 -ϕ 2 ∥ = M θT ∥ϕ 1 -ϕ 2 ∥. Lemma 3. If we define A i (ϕ) = ∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ), based on Assumption 3 and Lemma 2, we obtain ∥A i (ϕ 1 ) -A i (ϕ 2 )∥ ≤ M Ai ∥ϕ 1 -ϕ 2 ∥, where M Ai = L m2 + L m1 LM θi . Proof. Based on the definition of A i (ϕ), we have ∥A i (ϕ 1 ) -A i (ϕ 2 )∥ =∥∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ) -∇ 2 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 )∥ =∥∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ) -∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 2 ) + ∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 2 ) -∇ 2 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 )∥ (i) ≤L m2 ∥ϕ 1 -ϕ 2 ∥ + L m1 ∥∇ θ l(θ i (ϕ 1 )) -∇ θ l(θ i (ϕ 2 ))∥ ≤L m2 ∥ϕ 1 -ϕ 2 ∥ + L m1 L∥θ i (ϕ 1 ) -θ i (ϕ 2 )∥ (ii) ≤ (L m2 + L m1 LM θi )∥ϕ 1 -ϕ 2 ∥ = M Ai ∥ϕ 1 -ϕ 2 ∥, where (i) follows from Assumption 3, (ii) follows from Lemma 2. Lemma 4. We first define B i (ϕ) = ∇ 1 m(∇ θ l(θ i (ϕ)), ϕ)∇ 2 θ l(θ i (ϕ)). Based on the Lemma 2 and Assumption 3, we obtain ∥B i (ϕ 1 ) -B i (ϕ 2 )∥ ≤ M Bi ∥ϕ 1 -ϕ 2 ∥, where M Bi = M m1 ρM θi + LL m2 + L 2 L m1 M θi . Proof. Based on the definition of B i (ϕ), we have ∥B i (ϕ 1 ) -B i (ϕ 2 )∥ =∥∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 )∇ 2 θ l(θ i (ϕ 1 )) -∇ 1 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 )∇ 2 θ l(θ i (ϕ 2 ))∥ =∥∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 )∇ 2 θ l(θ i (ϕ 1 )) -∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 )∇ 2 θ l(θ i (ϕ 2 )) + ∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 )∇ 2 θ l(θ i (ϕ 2 )) -∇ 1 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 )∇ 2 θ l(θ i (ϕ 2 ))∥ (i) ≤M m1 ∥∇ 2 θ l(θ i (ϕ 1 )) -∇ 2 θ l(θ i (ϕ 2 ))∥ + L∥∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ) -∇ 1 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 )∥ (ii) ≤ M m1 ρM θi ∥ϕ 1 -ϕ 2 ∥ + L∥∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ) -∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 2 ) + ∇ 1 m(∇ θ l(θ i (ϕ 1 )), ϕ 2 ) -∇ 1 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 )∥ (iii) ≤ M m1 ρM θi ∥ϕ 1 -ϕ 2 ∥ + LL m2 ∥ϕ 1 -ϕ 2 ∥ + LL m1 ∥∇ θ l(θ i (ϕ 1 )) -∇ θ l(θ i (ϕ 2 ))∥ ≤(M m1 ρM θi + LL m2 + L 2 L m1 M θi )∥ϕ 1 -ϕ 2 ∥ = M Bi ∥ϕ 1 -ϕ 2 ∥, where (i) and (iii) follows from Assumption 3, (ii) follows from Lemma 2. Lemma 5. Based on Assumption 3 and Lemmas 1, 3 and 4, then we obtain ∥∇ ϕ θ T (ϕ 1 ) -∇ ϕ θ T (ϕ 2 )∥ ≤ L θT ∥ϕ 1 -ϕ 2 ∥, where L θT = T -1 i=0 (1+M m1 L) T -i-1 M Ai + T -1 i=0 M m2 (1+M m1 L) T -i-2 T -1 j=i+1 M B(T +i-j) . Proof. Based on the definition of ∇ ϕ θ T (ϕ) in Lemma 1, we obtain ∥∇ ϕ θ T (ϕ 1 ) -∇ ϕ θ T (ϕ 2 )∥ (i) ≤ T -1 i=0 T -1 j=i+1 I + ∇ 1 m(∇ θ l(θ T +i-j (ϕ 1 )), ϕ 1 )∇ 2 θ l(θ T +i-j (ϕ 1 )) ∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ) - T -1 j=i+1 I + ∇ 1 m(∇ θ l(θ T +i-j (ϕ 2 )), ϕ 2 )∇ 2 θ l(θ T +i-j (ϕ 2 )) ∇ 2 m(∇ θ l(θ i (ϕ 2 )), ϕ 2 ) (ii) = T -1 i=0 T -1 j=i+1 I + B T +i-j (ϕ 1 ) A i (ϕ 1 ) - T -1 j=i+1 I + B T +i-j (ϕ 2 ) A i (ϕ 2 ) = T -1 i=0 T -1 j=i+1 I + B T +i-j (ϕ 1 ) A i (ϕ 1 ) - T -1 j=i+1 I + B T +i-j (ϕ 1 ) A i (ϕ 2 ) + T -1 j=i+1 I + B T +i-j (ϕ 1 ) A i (ϕ 2 ) - T -1 j=i+1 I + B T +i-j (ϕ 2 ) A i (ϕ 2 ) (iii) ≤ T -1 i=0 (1 + M m1 L) T -i-1 ∥A i (ϕ 1 ) -A i (ϕ 2 )∥ + M m2 T -1 j=i+1 I + B T +i-j (ϕ 1 ) - T -1 j=i+1 I + B T +i-j (ϕ 2 ) ≤ T -1 i=0 (1 + M m1 L) T -i-1 ∥A i (ϕ 1 ) -A i (ϕ 2 )∥ + M m2 (1 + M m1 L) T -i-2 T -1 j=i+1 ∥B T +i-j (ϕ 1 ) -B T +i-j (ϕ 2 )∥ (iv) ≤ T -1 i=0 (1 + M m1 L) T -i-1 M Ai ∥ϕ 1 -ϕ 2 ∥ + M m2 (1 + M m1 L) T -i-2 T -1 j=i+1 M B(T +i-j) ∥ϕ 1 -ϕ 2 ∥ = T -1 i=0 (1 + M m1 L) T -i-1 M Ai + T -1 i=0 M m2 (1 + M m1 L) T -i-2 T -1 j=i+1 M B(T +i-j) ∥ϕ 1 -ϕ 2 ∥ =L θT ∥ϕ 1 -ϕ 2 ∥, where (i) is based on Lemma 1, (ii) is based on the fact that A i (ϕ 1 ) = ∇ 2 m(∇ θ l(θ i (ϕ 1 )), ϕ 1 ), B T +i-j (ϕ 1 ) = ∇ 1 m(∇ θ l(θ T +i-j (ϕ 1 )), ϕ 1 )∇ 2 θ l(θ T +i-j (ϕ 1 )), (iii) follows from Assumption 3 and (iv) follows from Lemma 3 and 4. Lemma 6. Based on Lemmas 2, 5 and Assumption 3, we obtain ∥∇ ϕ ĝT (ϕ 1 ) -∇ ϕ ĝT (ϕ 2 )∥ ≤ L g T ∥ϕ 1 -ϕ 2 ∥, where L g T = M L θT + LM 2 θT , L θT is defined in Lemma 5, M θT is defined in Lemma 2. Proof. We assume all functions share the same starting point θ 0 , then we have ∥∇ ϕ ĝT (ϕ 1 ) -∇ ϕ ĝT (ϕ 2 )∥ =∥∇ ϕ l(θ T (ϕ 1 )) -∇ ϕ l(θ T (ϕ 2 ))∥ =∥∇ θ l(θ T (ϕ 1 ))∇ ϕ θ T (ϕ 1 )] -∇ θ l(θ T (ϕ 2 ))∇ ϕ θ T (ϕ 2 )∥ ≤∥∇ θ l(θ T (ϕ 1 ))∥∥∇ ϕ θ T (ϕ 1 ) -∇ ϕ θ T (ϕ 2 )∥ + ∥∇ θ l(θ T (ϕ 1 )) -∇ θ l(θ T (ϕ 2 ))∥∥∇ ϕ θ T (ϕ 2 )∥ (i) ≤M ∥∇ ϕ θ T (ϕ 1 ) -∇ ϕ θ T (ϕ 2 )∥ + L∥θ T (ϕ 1 ) -θ T (ϕ 2 )∥∥∇ ϕ θ T (ϕ 2 )∥ (ii) ≤ M L θT ∥ϕ 1 -ϕ 2 ∥ + LM 2 θT ∥ϕ 1 -ϕ 2 ∥ = (M L θT + LM 2 θT )∥ϕ 1 -ϕ 2 ∥ = L g T ∥ϕ 1 -ϕ 2 ∥ , where (i) from Assumption 3, (ii) from Lemma 2 and 5. Lemma 7. Based on the Lemma 2, 5 and Assumption 3, we obtain ∥∇ 2 ϕ ĝT (ϕ 1 ) -∇ 2 ϕ ĝT (ϕ 2 )∥ ≤ ρ g T ∥ϕ 1 -ϕ 2 ∥, where ρ g T = 3LM θT L θT + M ρ θ + M 3 θT ρ. Proof. We first compute the Lipschitz condition of ∇ ϕ ∇ θ l(θ T (ϕ)) as follows ∥∇ ϕ ∇ θ l(θ T (ϕ 1 )) -∇ ϕ ∇ θ l(θ T (ϕ 2 ))∥ =∥[∇ ϕ θ T (ϕ 1 )] T ∇ 2 θ l(θ T (ϕ 1 )) -[∇ ϕ θ T (ϕ 2 )] T ∇ 2 θ l(θ T (ϕ 2 ))∥ ≤∥[∇ ϕ θ T (ϕ 1 )] T ∥∥∇ 2 θ l(θ T (ϕ 1 )) -∇ 2 θ l(θ T (ϕ 2 ))∥ + ∥[∇ ϕ θ T (ϕ 1 )] T -[∇ ϕ θ T (ϕ 2 )] T ∥∥∇ 2 θ l(θ T (ϕ 2 ))∥ (i) ≤M 2 θT ρ∥ϕ 1 -ϕ 2 ∥ + L θT L∥ϕ 1 -ϕ 2 ∥ =(M 2 θT ρ + L θT L)∥ϕ 1 -ϕ 2 ∥ , where (i) follows from Lemma 2, 5 and Assumption 3. Then, based on the definition of ∇ 2 ϕ ĝT (ϕ), we have ∥∇ 2 ϕ ĝT (ϕ 1 ) -∇ 2 ϕ ĝT (ϕ 2 )∥ =∥∇ 2 ϕ l(θ T (ϕ 1 )) -∇ 2 ϕ l(θ T (ϕ 2 ))∥ =∥∇ 2 ϕ θ T (ϕ 1 )∇ θ l(θ T (ϕ 1 )) + [∇ 2 ϕ θ i T (ϕ 1 )] T ∇ ϕ ∇ θ l(θ T (ϕ 1 )) -∇ 2 ϕ θ T (ϕ 2 )∇ θ l(θ T (ϕ 2 )) -[∇ 2 ϕ θ T (ϕ 2 )] T ∇ ϕ ∇ θ l i (θ T (ϕ 2 ))∥ ≤∥∇ 2 ϕ θ T (ϕ 1 )∇ θ l(θ T (ϕ 1 )) -∇ 2 ϕ θ T (ϕ 2 )∇ θ l(θ T (ϕ 2 ))∥ + ∥[∇ ϕ θ T (ϕ 1 )] T ∇ ϕ ∇ θ l(θ T (ϕ 1 )) -[∇ ϕ θ T (ϕ 2 )] T ∇ ϕ ∇ θ l(θ T (ϕ 2 ))∥ ≤∥∇ 2 ϕ θ T (ϕ 1 )∥∥∇ θ l(θ T (ϕ 1 )) -∇ θ l(θ T (ϕ 2 ))∥ + ∥∇ 2 ϕ θ T (ϕ 1 ) -∇ 2 ϕ θ T (ϕ 2 )∥∥∇ θ l(θ T (ϕ 2 ))∥ + ∥[∇ ϕ θ T (ϕ 1 )] T ∥∥∇ ϕ ∇ θ l(θ T (ϕ 1 )) -∇ ϕ ∇ θ l(θ T (ϕ 2 ))∥ + ∥[∇ ϕ θ T (ϕ 1 )] T -[∇ ϕ θ T (ϕ 2 )] T ∥∥∇ ϕ ∇ θ l(θ T (ϕ 2 ))∥ (i) ≤LL θT ∥θ T (ϕ 1 ) -θ T (ϕ 2 )∥ + M ∥∇ 2 ϕ θ T (ϕ 1 ) -∇ 2 ϕ θ T (ϕ 2 )∥ + M θT ∥∇ ϕ ∇ θ l(θ T (ϕ 1 )) -∇ ϕ ∇ θ l(θ T (ϕ 2 ))∥ + L θT ∥ϕ 1 -ϕ 2 ∥M θT L ≤LM θT L θT ∥ϕ 1 -ϕ 2 ∥ + M ρ θ ∥ϕ 1 -ϕ 2 ∥ + (M 2 θT ρ + L θT L)M θT ∥ϕ 1 -ϕ 2 ∥ + M θT LL θT ∥ϕ 1 -ϕ 2 ∥ =(3LM θT L θT + M ρ θ + M 3 θT ρ)∥ϕ 1 -ϕ 2 ∥ = ρ g T ∥ϕ 1 -ϕ 2 ∥ , where (i) follows from Lemma 2 and 5. Lemma 8. If we assume θ 1 0 (ϕ) = θ 2 0 (ϕ), based on Assumption 3 and 4, we obtain ∥θ 1 T (ϕ) -θ 2 T (ϕ)∥ ≤ σ θT , where T is the iteration number and σ θT = (1 + M m1 L) T ∆12 L -∆12 L . Proof. Based on the iterative process of θ t (ϕ), we obtain ∥θ 1 T (ϕ) -θ 2 T (ϕ)∥ (i) ≤∥θ 1 T -1 (ϕ) + m(∇ θ l1 (θ 1 T -1 ), ϕ) -θ 2 T -1 (ϕ) -m(∇ θ l2 (θ 2 T -1 ), ϕ)∥ (ii) ≤ ∥θ 1 T -1 (ϕ) -θ 2 T -1 (ϕ)∥ + M m1 ∥∇ θ l1 (θ 1 T -1 ) -∇ θ l2 (θ 2 T -1 )∥ ≤∥θ 1 T -1 (ϕ) -θ 2 T -1 (ϕ)∥ + M m1 ∥∇ θ l1 (θ 1 T -1 ) -∇ θ l2 (θ 1 T -1 )∥ + M m1 ∥∇ θ l2 (θ 1 T -1 ) -∇ θ l2 (θ 2 T -1 )∥ (iii) ≤ (1 + M m1 L)∥θ 1 T -1 (ϕ) -θ 2 T -1 (ϕ)∥ + M m1 ∆ 12 , where (i) follows from Equation (1), (ii) follows from Assumption 3, (iii) follows from Assumption 4. If we iterate above inequalities from t = 0 to T -1, then we obtain: ∥θ 1 T (ϕ) -θ 2 T (ϕ)∥ ≤ (1 + M m1 L) T ∆ 12 L - ∆ 12 L = σ θT . Lemma 9. Based on Assumptions 3 and 4, Lemma 8, we have following inequality: ∥C 1 i -C 2 i ∥ ≤ ∆ Ci , where C j i = ∇ 2 m(∇ θ lj (θ i ), ϕ) (i = 0 : T, j ∈ {1, 2}) and ∆ Ci = L m1 (1 + M m1 L) i ∆ 12 . Proof. Based on the definition of C j i , we obtain ∥C 1 i -C 2 i ∥ =∥∇ 2 m(∇ θ l1 (θ 1 i ), ϕ) -∇ 2 m(∇ θ l2 (θ 2 i ), ϕ)∥ ≤L m1 ∥∇ θ l1 (θ 1 i ) -∇ θ l2 (θ 1 i ) + ∇ θ l2 (θ 1 i ) -∇ θ l2 (θ 2 i )∥ (i) ≤L m1 ∆ 12 + L m1 L∥θ 1 i -θ 2 i ∥ (ii) ≤ L m1 (1 + M m1 L) i ∆ 12 = ∆ Ci , where (i) follows from Assumption 3 and 4, (ii) follows from Lemma 8. + L m1 L∥∇ θ l1 (θ 1 i ) -∇ θ l1 (θ 2 i ) + ∇ θ l1 (θ 2 i ) -∇ θ l2 (θ 2 i )∥ ≤M m1 (ρ∥θ 1 i -θ 2 i ∥ + ∆ 12 ) + L m1 L(Lσ θi + ∆ 12 ) (i) ≤M m1 (ρσ θi + ∆ 12 ) + L m1 L(1 + M m1 L) i ∆ 12 = ∆ Di , where (i) follow from Lemma 8. Lemma 11. Based on Assumptions 3, 4 and Lemma 1, we obtain ∥∇ ϕ θ 1 T (ϕ) -∇ ϕ θ 2 T (ϕ)∥ ≤ T -1 i=0   (1 + M m1 L) T -i-1 ∆ Ci + M m2 (1 + M m1 L) T -i-2 T -1 j=i+1 ∆ Dj   , where ∆ Ci and ∆ Dj have been defined in Lemmas 9 and 10. Proof. Based on the Lemma 1, we obtain ∥∇ ϕ θ 1 T (ϕ) -∇ ϕ θ 2 T (ϕ)∥ (i) ≤ T -1 i=0 T -1 j=i+1 (I + D 1 T +i-j )C 1 i - T -1 j=i+1 (I + D 1 T +i-j )C 2 i + T -1 j=i+1 (I + D 1 T +i-j )C 2 i - T -1 j=i+1 (I + D 2 T +i-j )C 2 i ≤ T -1 i=0 T -1 j=i+1 (I + D 1 T +i-j ) ∥C 1 i -C 2 i ∥ + T -1 j=i+1 (I + D 1 T +i-j ) - T -1 j=i+1 (I + D 2 T +i-j ) ∥C 2 i ∥ (ii) ≤ T -1 i=0 (1 + M m1 L) T -i-1 ∥C 1 i -C 2 i ∥ + M m2 (1 + M m1 L) T -i-2 T -1 j=i+1 ∥D 1 T +i-j -D 2 T +i-j ∥ (iii) ≤ T -1 i=0   (1 + M m1 L) T -i-1 ∆ Ci + M m2 (1 + M m1 L) T -i-2 T -1 j=i+1 ∆ Dj   , where (i) follows from the definitions that D Proof. We first consider ∆ Ci and ∆ Di , we obtain j i = ∇ j m(∇ θ lj (θ j i ), ϕ)∇ 2 θ lj (θ j i ), C j i = ∇ 2 m(∇ θ lj (θ i ), ϕ), (ii) ∆ Ci =L m1 (1 + M m1 L) i ∆ 12 ) = O(Q i ∆ 12 ), ∆ Di =O(M m1 (ρσ θi + ∆ 12 ) + L m1 L(1 + M m1 L) i ∆ 12 ) (i) =O(Q i ∆ 12 + ∆ 12 + Q i ∆ 12 ) = O(Q i ∆ 12 + ∆ 12 ), where (i) follows because σ θi = (1 + M m1 L) i ∆12 L -∆12 L = O(Q i ∆ 12 ). Furthermore, we consider the uniform bound for ∥∇ ϕ θ 1 T (ϕ) -∇ ϕ θ 2 T (ϕ)∥, then we obtain ∥∇ ϕ θ 1 T (ϕ) -∇ ϕ θ 2 T (ϕ)∥ (i) =O T -1 i=0   Q T -i-2 Q∆ Ci + T -1 j=i+1 ∆ D(T +i-j)   (ii) = O   T -1 i=0 Q T -i-1 Q i ∆ 12 + Q T -i-2 T -1 j=i+1 (Q T +i-j ∆ 12 + ∆ 12 )   =O T -1 i=0 Q T -1 ∆ 12 + (T -i -1)Q T -i-2 ∆ 12 + (Q 2T -i-2 -Q T -1 )∆ 12 =O T -1 i=0 ((T -i -1)Q T -i-2 ∆ 12 + Q 2T -i-2 ∆ 12 ) (iii) = O   T -1 j=0 (jQ j-1 ∆ 12 + Q T +j-1 ∆ 12 )   =O T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 , where (i) follows from Lemma 11, (ii) follows from Equation ( 12) and Equation ( 13), (iii) follows because j = T -i -1. Based on the formulation of ∇ ϕ ĝT (ϕ) in Lemma 6, we have ∥∇ ϕ ĝ1 T (ϕ) -∇ ϕ ĝ2 T (ϕ)∥ ≤M ∥∇ ϕ θ 1 T (ϕ) -∇ ϕ θ 2 T (ϕ)∥ + M θT Q T ∆ 12 (i) =O(T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 ), where (i) follows because M θT defined in Lemma 2 satisfies that M θT = O(Q T -1 ). Lemma 13. Based on the Assumption 3 and Lemma 2, we obtain ∥g T (ϕ 1 ) -g T (ϕ 2 )∥ ≤ M g T ∥ϕ 1 -ϕ 2 ∥, where M g T = M M θT and M θT is defined in Lemma 2. Proof. Based on the definition of g T (ϕ), we have ∥g T (ϕ 1 ) -g T (ϕ 2 )∥ =∥l(θ T (ϕ 1 )) -l(θ T (ϕ 2 ))∥ (i) ≤M ∥θ T (ϕ 1 ) -θ T (ϕ 2 )∥ (ii) ≤ M g T ∥ ϕ 1 M K -ϕ 1 M * + ϕ 1 * -ϕ 1 * + ϕ 1 * -ϕ 3 * + α∇ ϕ ĝ1 T ( ϕ 1 M * ) -α∇ ϕ ĝ2 T ( ϕ 1 M K )∥ (14) ≤M g T ∥ ϕ 1 M K -ϕ 1 M * ∥ + M g T ∥ ϕ 1 * -ϕ 1 * ∥ + M g T ∥ϕ 1 * -ϕ 3 * ∥ + M g T α∥∇ ϕ ĝ1 T ( ϕ 1 M * ) -∇ ϕ ĝ1 T ( ϕ 1 M K )∥ + M g T α∥∇ ϕ ĝ1 T ( ϕ 1 M K ) -∇ ϕ ĝ2 T ( ϕ 1 M K )∥ ≤(M g T + M g T L g T α)∥ ϕ 1 M K -ϕ 1 M * ∥ + M g T ∥ ϕ 1 * -ϕ 1 * ∥ + M g T ∥ϕ 1 * -ϕ 3 * ∥ + M g T α∥∇ ϕ ĝ1 T ( ϕ 1 M K ) -∇ ϕ ĝ2 T ( ϕ 1 M K )∥ , where (i) follows from Assumption 2, (ii) follows from Lemma 13. Furthermore, considering Algorithm 1, if we set α ≤ min{ 1 2L , µ 8ρg T Mg T }, β k = min(β, 8 µ(k+1) ) for β ≤ 8 µ , based on Lemma 12, 14, 15, with probability at least 1 -δ, we obtain E[g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K )) -g 3 T (ϕ 3 * )] ≤(M g T + M g T L g T α) O(1) M g T µ 2 L g T + ρ g T αM g T βK + M g T β √ K + M g T 2 √ 2M g T µ √ δN + M g T δ 13 + M g T αO(T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 ), where δ 13 = ∥ϕ 1 * -ϕ 3 * ∥, Q = (1 + M m1 L), K is the step number for update, N is the sample size for training. Then for Lipschitz term M g T defined in Lemma 13, M g T = M M θT = O(Q T -1 ), where M θT defined in Lemma 2 satisfies M θT = O(Q T -1 ). For Lipschitz term L g T defined in Lemma 6, we first compute the order for L θT which is defined in Lemma 5, then we obtain L θT =O   T -1 i=0 Q T -i-1 M Ai + T -1 i=0 Q T -i-2 T -1 j=i+1 M B(T +i-j)   =O   T -1 i=0 Q T -i-1 Q i-1 + T -1 i=0 Q T -i-2 T -1 j=i+1 Q T +i-j-1   (i) =O T -1 i=0 Q T -2 + T -1 i=0 Q T -i-2 Q T -1 = O(T Q T -2 + Q 2T -2 ), where (i) follows from Lemmas 3 and 4. Then, we obtain L g T =M L θT + LM 2 θT =O(T Q T -2 + Q 2T -2 + Q 2T -2 ) = O(T Q T -2 + Q 2T -2 ). For Lipschitz term ρ g T defined in Lemma 7, we have ρ g T = 3LM θT L θT + M ρ θ + M 3 θT ρ = O(T Q 2T -3 + Q 2T -3 ). Then, the proof is complete.

A.4 PROOF OF REMARK 2

In terms of M-L2O generalization error, based on the Equation ( 14) in Appendix A.3, we have g 3 T ( ϕ 1 M K -α∇ ϕ ĝ2 T ( ϕ 1 M K )) -g 3 T (ϕ 3 * ) ≤M g T ∥ ϕ 1 M K -ϕ 1 M * + ϕ 1 * -ϕ 1 * + ϕ 1 * -ϕ 3 * + α∇ ϕ ĝ1 T ( ϕ 1 M * ) -α∇ ϕ ĝ2 T ( ϕ 1 M K )∥ ≤M g T ∥ ϕ 1 M K -ϕ 1 M * ∥ + M g T ∥ ϕ 1 * -ϕ 1 * ∥ + M g T α∥∇ ϕ g 2 T ( ϕ 1 M K ) -∇ ϕ g 1 T ( ϕ 1 M * )∥ + M g T δ 13 ,



Input: Inner step size α, Outer learning stepsize β k , Total epochs K, Epoch number per task S, Optimizer initial point ϕ 1 M 0 , Training task ĝ1 , Adaptation task ĝ2 , Testing task g 3 2: for k = 0, 1, . . . , K -1 do

3

Figure 2: Comparison of convergence speeds on target distribution of LASSO optimizees. We repeat the experiments for 10 times and show the 95% confidence intervals in the figures.

Figure 3: Comparison of convergence speeds on target distribution of Quadratic optimizees. We repeat the experiments for 10 times and show the 95% confidence intervals are shown in the figures.

Figure 4: Performance on LASSO optimizees. We vary the standard deviation of the distribution used for sampling the weight matrix A for adaptation optimizees. We visualize both the mean and the confidence interval in the figure.

follows from Assumption 3 and (iii) follows from Lemma 9 and 10.Lemma 12. (Correspond to Proposition 1) Based on Assumptions 3 and 4, Lemmas 8 and 11, we obtain∥∇ ϕ ĝ1 T (ϕ) -∇ ϕ ĝ2 T (ϕ)∥ = O(T Q T -1 ∆ 12 + Q 2T -1 ∆ 12 ), where Q = 1 + M m1 L.

measured the generalization error of MAML;Fallah et al. (2020) captured the single inner step MAML convergence rate by Hessian vector approximation. Furthermore,Ji et al. (2022) characterized the multiple-step MAML convergence rate. Recently,LFT Zhao et al. (2022)   combines the meta-learning design in Learning to Optimize and demonstrates its better performance for adversarial attack applications.

Minimum logarithm loss of different methods on LASSO at different levels of σ. We report the 95% confidence interval from 10 repeated runs.

ACKNOWLEDGEMENTS

The work of Y. Liang was supported in part by the U.S. National Science Foundation under the grants ECCS-2113860 and DMS-2134145. The work of Z. Wang was supported in part by the U.S. National Science Foundation under the grant ECCS-2145346 (CAREER Award).

availability

https://github.com/VITA-Group/

annex

where (i) is based on Assumption 3, (ii) is based on Lemma 2. Lemma 14. Based on the proposition 1 in Fallah et al. (2021) , Assumptions 1 and 3, if we set α ≤ min{ 1 2L , µ 8ρg T Mg T }, β k = min(β, 8 µ(k+1) ) for β ≤ 8 µ in Algorithm 1, then we havewhere M g T is defined in Lemma 13, L g T is defined in Lemma 6, ρ g T is defiend in Lemma 7.Proof. Based on the Proposition 1 in Fallah et al. (2021) , we obtainwhere ĜT (ϕ) is defined in Equation ( 5). Based on the Assumption 1 and the fact thatLemma 15. Based on Assumption 1 and Lemma 13, we havewhere N is the sample size.Proof. Based on Assumption 1 and Lemma 13, from Theorem 2 in Shalev-Shwartz et al. (2010) , with probability at least 1 -δ, we haveFurthermore, based on Assumption 1 and the fact that ϕ 1 * = arg min g 1 T (ϕ), we obtainWe take the square root from both side and obtain:with probability at least 1 -δ.

A.3 PROOF OF THEOREM 1

Based on our definition of generalization error for the algorithm,whereIn terms of Transfer Learning L2O generalization error with learned initial point ϕ K , we havewhereThen, the proof is complete.A.5 PROOF OF EQ. 8 IN SUBSECTION 5.3We assume that ϕ 3), then we haveThen, the proof is complete.

A.6 ADDITIONAL EXPERIMENTS

New Optimizees: Rosenbrock We conduct additional experiments with substantially different optimizees, i.e. Rosenbrock (Rosenbrock, 1960) . In this case, the optimizes are required to minimize a two-dimensional non-convex function taking the following formulation:which is challenging for algorithms to converge to the global minimum (Tani et al., 2021) .We specify D adapt and D test to be the family of Rosenbrock optimizees with randomly sampled initial points from standard normal distribution. In contrast, the training optimizees are still LASSO with a mixture of uniform distribution from which the coefficient matrices are sampled. The experiments are repeated for 10 times, with all the algorithms receiving identical adaptation and testing samples in each run. Figure A5a shows the curves of the logarithm of the objective values generated by different methods, where our proposed M-L2O outperforms other baselines significantly. At 500-th step, the (mean, standard deviation) of the logarithmic objective values for {Vanilla L2O, TL, DT, M-L2O}are {(0.977, 0.225), (-2.170, 1.312), (-4.864, 0.395), (-6.832, 0.445)} , which provides numerical supports of the advantage of our methods.

New Evaluation: Interpolation

To obtain new optimize weights, we employ a linear interpolation strategy between two adapted optimizers. The first one is optimized on the optimizees that are similar to those used in training, and the second is optimized on the optimizees that are similar to those used in testing. We introduce a factor α to control the interpolation between the two weights, denoted by w 1 and w 2 , respectively, and caluclate the new weights as follows:In Figure A5b , we present the mean values of the logarithmic loss, as well as the 95% confidence interval. The results of TL and M-L2O validate our claim that adapting to training-like optimizees tend to yield better performance than adapting to optimizees that more resemble the testing optimizees.Published as a conference paper at ICLR 2023 

