META-LEARNING WITH NEURAL TANGENT KERNELS

Abstract

Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-ofdistribution adaptation than popular baselines, as demonstrated in our experiments.

1. INTRODUCTION

Meta-learning (Schmidhuber, 1987) has made tremendous progresses in the last few years. It aims to learn abstract knowledge from many related tasks so that fast adaption to new and unseen tasks becomes possible. For example, in few-shot learning, meta-learning corresponds to learning a meta-model or meta-parameters so that they can fast adapt to new tasks with a limited number of data samples. Among all existing meta-learning methods, Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) is perhaps one of the most popular and flexible ones, with a number of follow-up works such as (Nichol et al., 2018; Finn et al., 2018; Yao et al., 2019; Khodak et al., 2019a; b; Denevi et al., 2019; Fallah et al., 2020; Lee et al., 2020; Tripuraneni et al., 2020) . MAML adopts a double-looped optimization framework, where adaptation is achieved by one or several gradientdescent steps in the inner-loop optimization. Such a framework could lead to some undesirable issues related to computational inefficiency and sub-optimal solutions. The main reasons are that 1) it is computationally expensive to back-propagate through a stochastic-gradient-descent chain, and 2) it is hard to tune the number of adaptation steps in the inner-loop as it can be different for both training and testing. Several previous works tried to address these issues, but they can only alleviate them to certain extents. For example, first order MAML (FOMAML) (Finn et al., 2017) ignores the high-order terms of the standard MAML, which can speed up the training but may lead to deteriorated performance; MAML with Implicit Gradient (iMAML) (Rajeswaran et al., 2019) directly minimizes the objective of the outer-loop without performing the inner-loop optimization. But it still needs an iterative solver to estimate the meta-gradient. To better address these issues, we propose two algorithms that generalize meta-learning to the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK) (Jacot et al., 2018) . In this RKHS, instead of using parameter adaptation, we propose to perform an implicit function adaptation. To this end, we introduce two algorithms to avoid explicit 2.2 GRADIENT FLOW Our proposed method relies on the concept of gradient flow. Generally speaking, gradient flow is a continuous-time version of gradient descent. In the finite-dimensional parameter space, a gradient flow is defined by an ordinary differential equation (ODE), dθ t /dt = -∇ θ t F (θ t ), with a starting point θ 0 and function F : R d → R. Gradient flow is also known as steepest descent curve. One can generalize gradient flows to infinite-dimensional function spaces. Specifically, given a function space H, a functional F : H → R, and a starting point f 0 ∈ H, a gradient flow is similarly defined as the solution of df t /dt = -∇ f t F(f t ). This is a curve in the function space H. In this paper, we use notation ∇ f t F(f t ), instead of ∇ H F(f t ), to denote the general function derivative of the energy functional F with respect to function f t (Villani, 2008) .

2.3. THE NEURAL TANGENT KERNEL

Neural Tangent Kernel (NTK) is a recently proposed technique for characterizing the dynamics of a neural network under gradient descent (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019) . NTK allows one to analyze deep neural networks (DNNs) in RKHS induced by NTK. One immediate benefit of this is that the loss functional in the function space is often convex, even when it is highly non-convex in the parameter space (Jacot et al., 2018) * . This property allows one to better understand the property of DNNs. Specifically, let f θ be a DNN parameterized by θ. The corresponding NTK Θ is defined as: Θ(x 1 , x 2 ) = ∂f θ (x 1 ) ∂θ ∂f θ (x 2 ) ∂θ , where x 1 , x 2 are two data points. In our paper, we will define meta-learning on an RKHS induced by such a kernel.

3. META-LEARNING IN RKHS

We first define the meta-learning problem in a general function space, and then restrict the function space to be an RKHS, where two frameworks will be proposed to make meta-learning feasible in RKHS, along with some theoretical analysis. For simplicity, in the following we will hide the superscript time t unless necessary, e.g., when the analysis involves time-changing.

3.1. META-LEARNING IN FUNCTION SPACE

Given a function space H, a distribution of tasks P (T ), and a loss function L, the goal of metalearning is to find a meta function f * ∈ H, so that it performs well after simple adaptation on a specific task. Let D tr m and D test m be the training and testing sets, respectively, sampled from a data distribution of task T m . The meta-learning problem on function space H is defined as: f * = arg min f ∈H E(f ), with E(f ) = E Tm L Adapt(f, D tr m ), D test m (1) where Adapt denotes some adaptation algorithms, e.g., several steps of gradient descent; E : H → R is called energy functional, which is used to evaluate the model represented by the function f . In theory, solving equation 1 is equivalent to solving the gradient flow equation df t /dt = -∇ f t E(f t ). However, solving the gradient flow equation is generally infeasible, since i) it is hard to directly apply optimization methods in function space and ii) the energy functional E contains an adaptation algorithm Adapt, making the functional gradient infeasible. Thus, a better way is to design a special energy functional so that it can be directly optimized without running the specific adaptation algorithm. In the following, we first specify the functional meta-learning problem in RKHS, and then propose two methods to derive efficient solutions for the problem.

3.2. META-LEARNING IN RKHS

We consider a function f that is parameterized by θ ∈ R P , denoted as f θ , with P being the number of parameters. Define a realization function F : R P → H that maps parameters to a function. With these, we can then define an energy function in the parameter space as E E • F : R P → R with • being the composition operator. Consequently, with an initialized θ 0 , we can define the gradient flow of E(θ t ) in parameter space as: dθ t /dt = -∇ θ t E(θ t ). In the following, we first establish an equivalence between the gradient flow in RKHS and the gradient flow in the parameter space. We then propose two algorithms for meta-learning in the RKHS induced by NTK. Theorem 1 Let H be an RKHS induced by the NTK Θ of f θ . With f 0 = f θ 0 , the gradient flow of E(f t ) coincides with the function evolution of f θ t driven by the gradient flow of E(θ t ). The proof of Theorem 1 relies on the property of NTK (Jacot et al., 2018) , and is provided in the Appendix. Theorem 1 serves as a foundation of our proposed methods, which indicates that solving the meta-learning problem in RKHS can be done by some appropriate manipulations. In the following, we describe two different approaches termed Meta-RKHS-I and Meta-RKHS-II, respectively.

3.3. META-RKHS-I: META-LEARNING IN RKHS WITHOUT ADAPTATION

Our goal is to design an energy functional that has no adaptation component, but is capable of achieving fast adaptation. For this purpose, we first introduce two definitions: empirical loss function L(f θ , D m ) and expected loss function L(f θ ). Let D m = {x m,i , y m,i } n i=1 be a set containing the data of a regression task T m . The empirical loss function L(f θ , D m ) and the expected loss function L m (f θ ) can be defined as: L(f θ , D m ) = 1 2n n i=1 f (x m,i ) -y m,i 2 , L m (f θ ) = E xm,y m 1 2 f (x m ) -y m 2 . Our idea is to define a regularized functional such that it endows the ability of fast adaptation in RKHS. Our solution is based on some property of the standard MAML. We start from analyzing the meta-objective of MAML with a k-step gradient-descent adaptation, i.e., applying k gradient-descent steps in the inner-loop. The objective can be formulated as θ * = arg min θ E Tm L(f φ φ φ , D test m ) , with φ φ φ = θ -α k-1 i=0 ∇ θi L(f θi , D tr m ) , where α is the learning rate of the inner-loop, θ 0 = θ, and θ i+1 = θ i -α∇ θi L(f θi , D tr m ) † . By Taylor expansion, we have E Tm L(f φ φ φ , D test m ) ≈ E Tm L(f θ , D test m ) -α k-1 i=0 ∇ θi L(f θi , D tr m )∇ θ L(f θ , D test m ) . (2) Since D tr m and D test m come from the same distribution, equation 2 is an unbiased estimator of M k = E Tm [L m (f θ ) - k-1 i=0 β i ], where β i = α∇ θi L m (f θi )∇ θ L m (f θ ) . We focus on the case of k = 1, which is M 1 = E Tm [L m (f θ )] -αE Tm ∇ θ L m (f θ ) 2 . The first term on the RHS is the traditional multi-task loss evaluated at θ for all tasks. The second term corresponds to the negative gradient norm; minimizing it means choosing a θ with the maximum gradient norm. Intuitively, when θ is not a stationary point of a task, one should choose the steepest descent direction to reduce the loss maximally for a specific task, thus leading to fast adaptation. The above understanding suggests us to propose the following regularized energy functional, E α , for meta-learning in the RKHS induced with the NTK for fast function adaptation: E(α, f θ ) = E Tm L m (f θ ) -α ∇ f θ L m (f θ ) 2 H , where • H denotes the functional norm in H, and α is a hyper-parameter. The above objective is inspired by the Taylor expansion of the MAML objective, but is defined in the RKHS induced by the NTK. Its connection with MAML and some functional-space properties will be discussed later.

Solving the Function Optimization Problem

To minimize equation 4, we first derive Theorem 2 to reduce the function optimization problem to a parameter optimization problem. Theorem 2 Let f θ be a neural network with parameter θ and H be the RKHS induced by the NTK Θ of f θ . Then, the following are equivalent E(α, f θ ) = M 1 , and ∇ f θ L m (f θ ) 2 H = ∇ θ L m (f θ ) 2 . Theorem 2 is crucial to our approach as it indicates that solving problem equation 4 is no more difficult than the original parameter-based MAML, although it only considers one-step adaptation case. Next, we will show that multi-step adaptation in the parameter space can also be well-approximated by our objective equation 4 but with a scaled regularized parameter α. In the following, we consider the squared loss L. The case with the cross-entropy loss is discussed in the Appendix. We assume that f θ is parameterized by either fully-connected or convolutional neural networks, and only consider the impact of number of hidden layers L in our theoretical results. Theorem 3 Let f θ be a fully-connected neural network with L hidden layers and ReLU activation function, s 1 , ..., s L+1 be the spectral norm of the weight matrices, s = max h s h , and α be the learning rate of gradient descent. If α ≤ O(qr) with q = min(1/(Ls L ), L -1/(L+1) ) and r = min(s -L , s), then the following holds |M k -E(kα, f θ )| ≤ O 1 L . Theorem 4 Let f θ be a convolutional neural network with L -l convolutional layers and l fullyconnected layers and with ReLU activation function, and d x be the input dimension. Denote by W h the parameter vector of the convolutional layer for h ≤ L -l, and the weight matrices of the fully connected layers for L -l + 1 ≤ h ≤ L + 1. • 2 means both the spectral norm of a matrix and the Euclidean norm of a vector. Define s h = √ d x W h 2 if h = 1, ..., L -l, and W h 2 if L -l + 1 ≤ h ≤ L + 1. Let s = max h s h and α be the learning rate of gradient descent. If α ≤ O(qr) with q = min(1/(Ls L ), L -1/(L+1) ) and r = min(s -L , s), the following holds |M k -E(kα, f θ )| ≤ O 1 L . The above Theorems indicate that, for a meta-model with fully-connected and convolutional layers, the proposed Meta-RKHS-I can be an efficient approximation of MAML with a bounded error. Comparisons with Reptile and MAML Similar to Reptile and MAML, the testing stage of Meta-RKHS-I also requires gradient-based adaptation on meta-test tasks. By Theorem 1, we known that gradient flow of an energy functional can be approximated by gradient descent in a parameter space. Reptile with 1-step adaptation (Nichol et al., 2018) is equivalent to the approximation of the gradient flow of E(α, f θ ) with α = 0, which does not include the fast-adaptation regularization as in our method. For a fairer comparison on the efficiency, we will discuss the computational complexity later. From the equivalent parameter-optimization form indicated in Theorem 2, we know that our energy functional E(α, f θ ) is closely related to MAML. However, with this form, our method does not need the explicit adaptation steps in training (i.e., the inner-loop of MAML), leading to a simpler optimization problem. We will show that our proposed method leads to better results.

3.4. META-RKHS-II: META-LEARNING IN RKHS WITH A CLOSED-FORM ADAPTATION

In this section, we present our second solution for meta-learning in RKHS by deriving a closed-form adaptation function, i.e., we focus on a case where Adapt(f, D tr m ) is analytically solvable using the theory of NTK. Specifically, we are given a loss function L, tasks T m with randomly split training set D tr m = {x tr m,i , y tr m,i } n i=1 , and testing set D test m . Let θ t m and f t m,θ denote the parameters and the corresponding function at time t adapted by task T m from the meta parameter θ and meta function f θ , respectively. From the NTK theory (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019) The above differential equation corresponds to the adaptation step, i.e., how to adapt the meta parameter/function for task m. By the NTK theory, we can show that this admits closed-form solutions. In our meta-learning settings, this indicates that no explicit adaptation steps are necessary. To see why this is the case, we first investigate the regression case, where the loss function L is the squared loss. Let x ∈ D test m be a test data point. As shown in Arora et al. (2019) ; Lee et al. (2019) , with a large enough neural network we can safely assume that NTK will not change too much during the training. In this case, we can have a closed-form solution for f t m,θ as f t m,θ (x) = f θ (x) + H(x, X tr m )H -1 (X tr m , X tr m ) e -tH(X tr m ,X tr m ) -I f θ (X tr m ) -Y tr , ( ) where e is the matrix exponential map, which can be approximated by P adé approximation (M. Arioli et al., 1996) . H(X tr m , X tr m ) is an n × n kernel matrix with its (i, j) element being Θ(x m,i , x m,j ), H(x, X tr m ) is a 1 × n vector with its i-th element being Θ(x, x m,i ), f θ (X tr m ) ∈ R n is the predictions of all training data at the initialization, and Y tr ∈ R n is the target value of the training data. Specifically, at time t = ∞, we have f ∞ m,θ (x) = f θ (x) + H(x, X tr m )H -1 (X tr m , X tr m ) Y tr -f θ (X tr m ) . The above results allow us to directly define an energy functional by substituting Adapt(f, D tr m ) in equation 1 with its closed-form solution f t m,θ . In other words, our new energy functional is E(t, f θ ) = E Tm L m (f t m,θ ) , where f t m,θ is defined in equation 5, and L m (f t m,θ ) is the expectation of L f t m,θ , D test m . For classification problems, we follow the same strategy as in Arora et al. (2019) to extend regression to classification. Mores details can be found in the Appendix, including the algorithm in Appendix A.  + n 3 ) Convolutional O(n(k + 1)C 2 ) O(nkC 2 ) O(nC 2 ) O(nC 2 + n 3 ) On Potential Robustness of Meta-RKHS-II Our extensive empirical studies show that Meta-RKHS-II is a more robust model than related baselines. We provide an intuitive explanation on the potential robustness of Meta-RKHS-II, as we find current theories of both robustness machine learning and NTK are insufficient for a formal explanation. Our explanation is based on some properties of both the meta-learning framework and NTK: 1) Strong initialization (meta model): For NTK to generalize well, we argue that it is necessary to start the model with a good initialization. This is automatically achieved in our meta-learning setting, where the meta model serves as the initialization for NTK predictions. Actually, this has been supported by recent research (Fort et al., 2020) , which shows that there is a chaotic stage in the NTK prediction with finite neural networks, and the NTK regime can be reachable with a good initialization. 2) Low complex classification boundary: It is known that NTK is a linear model in the NTK regime. Intuitively, generating adversarial samples with a lower complex model should be relatively harder because there is less data in the vicinity of the decision boundary compared to a more complex model, making the probability of the model being attacked smaller. Thus we argue that our model can be more robust than standard meta learning models. 3) Our NTK-based model is robust enough to adapt with different time steps. And these finite time steps can be more robust to adversarial attacks than that of the infinite-time limit partly due to the complexity of back-propagating gradients. We note each of the individual factors might not be enough to ensure robustness. Instead, we argue it is the combination effect of these factors that lead to robustness of our model. Formal analysis is out of the scope of this paper and left for future work. Connection with Meta-RKHS-I The proposed two methods choose different strategies to avoid explicit adaptation in meta-learning, which seem to be two very different algorithms. We prove below theorem, which indicates that the difference of the underlying gradient flows of the two algorithms indeed increases w.r.t. both T and the depth L of a DNN (we only consider impacts of T and L). Theorem 5 Let f θ be a neural network with L hidden layers, with each layer being either fullyconnected or convolutional. Assume that L ∞ < ∞. Then, error(T ) = | E(T, f θ ) -E(T, f θ )| is a non-decreasing function of T . Furthermore, for arbitrary T > 0 we have error(T ) ≤ O T 2L+3 . Actually, Meta-RKHS-II implicitly contains a term of functional gradient norm because E(T, f θ ) = E Tm L m (f θ ) - T 0 ∇ θ t L m (f t m,θ ) 2 dt . The difference compared to Meta-RKHS-I mainly comes from the fact that Meta-RKHS-I can be regarded as an approximation of time-discrete adaptation, while Meta-RKHS-II is based on time-continuous adaptation. In our experiments, we observe that Meta-RKHS-I is as fast as FOMAML, which means that it is more computationally efficient than the standard MAML. Meanwhile Meta-RKHS-II is the more robust model in tasks of adversarial attack and out-of-distribution adaptation. Connection with iMAML Our proposed method is similar to the iMAML algorithm (Finn & Levine, 2019) in the sense that both methods try to solve meta-learning without executing the optimization path. Different from iMAML, which still relies on an iterative solver, our method only needs to solve a simpler optimization problem due to the closed-form adaptation.

3.5. TIME COMPLEXITY ANALYSIS

We compare the time complexity of our proposed methods with other first-order meta-learning methods. Without loss of generality, we analyze the complexity in the case of a L-layer MLP or L-layer convolutional neural networks. Recall that d x is the input dimension. Assume each layer has width (filter number) O(p). Let n be the data batch size, k the adaptation steps of inner-loop optimization. We summarize the time complexity in Table 1 , where we simply assume the complexity of multiplying matrices with sizes a × b and b × c to be O(abc). Note in the meta-learning setting, n is typically small, indicating the efficiency of our proposed methods. 

4. EXPERIMENTS

We conduct a set of experiments to evaluate the effectiveness of our proposed methods, including a sine wave regression toy experiment, few-shot classification, robustness to adversarial attacks, out-of-distribution generalization and ablation study. Due to space limit, more results are provided in the Appendix. We compare our models with related baselines including MAML (Finn et al., 2017) , the first order MAML (FOMAML) (Finn et al., 2017) , Reptile (Nichol et al., 2018) and iMAML (Rajeswaran et al., 2019) . Results are reported as mean and variance over three independent runs. 

4.2. FEW-SHOT IMAGE CLASSIFICATION

For this experiment, we choose two popular datasets adopted for meta-learning: Mini-ImageNet and FC-100 (Oreshkin et al., 2018) . The cross-entropy loss is adopted for Meta-RKHS-I; while the squared loss is used for Meta-RKHS-II following Arora et al. (2019) ; Novak et al. (2019) . Similar to Finn et al. (2017) , the model architecture is set to be a four-layer convolutional neural network with ReLU activation. The filter number is set to be 32. The Adam optimizer (Kingma & Ba, 2015) is used to minimize the energy functional. Meta batch size is set to be 16 and learning rates are set to be 0.01 for Meta-RKHS-II. We conjecture the reason is because our Meta-RKHS-I restricts the function to be in an RKHS, making the functional space smaller thus easier to optimize compared to the unrestricted version of FOMAML. In terms of our two algorithms, there is not always a winner on all the tasks. We note that Meta-RKHS-I is more efficient in training. However, we show below that Meta-RKHS-II is better in terms of robustness to adversarial attacks and out-of-distribution generalization.

4.3. ROBUSTNESS TO ADVERSARIAL ATTACKS

We now compare the adversarial robustness of our methods with other popular baselines. We adopt both white-box and black-box attacks in this experiment. For the white-box attacks, we adopt strong attacks including the PGD Attack (Madry et al., 2017) , BPDA attack (Athalye et al., 2018) and SPSA attack (Uesato et al., 2018) . For PGD attack, we use ∞ norm and compare the results on Mini-imagenet and FC-100. We compare the robust accuracy with different magnitude with 20-step attack with a step size of 2/255. For BPDA attack, we apply median smoothing, JPEGFilter and BitSqueezing as input transformation adapted from (Guo et al., 2018) as defense strategies. For SPSA attack, we follow (Uesato et al., 2018) and set the Adam learning rate 0.01, perturbation size δ = 0.01. For Black-box attack, we adopt the strong query efficient attack method (Guo et al., 2019) . Follow the setting of Guo et al. (2019) , we use a fixed step size of 0.2. We consider both finite-time and infinite-time adaptation in this experiment. For finite-time adaptation, the Padé approximation with P = Q = 1 and P = Q = 2 to approximate the matrix exponential are considered (Butcher & Chipman, 1992) . We use Meta-RKHS-II_t100_PQ1 and Meta-RKHS-II_t100_PQ2 to denote methods using finite time t = 100, P = Q = 1 or P = Q = 2, respectively. We observe other finite time t makes similar predictions, thus we only consider t = 100. The results from the black-box attack in Figure 2 indicate the robustness of our Meta-RKHS-II. In fact, the gaps are significantly large, making it the only useful robust model in the adversarial-attack setting. Our Meta-RKHS-I is not as robust as Meta-RKHS-II, but still slightly outperforms other baselines. Regarding the white-box attack, results in Figure 3 , 4 and 5 again show that our proposed Meta-RKHS-II is significantly more robust than baselines under the three strong attacks. It is also interesting to see that our Meta-RKHS-I performs slightly better than Meta-RKHS-II in some rare cases, e.g., in the Mini-ImageNet 5-way 1-shot case when the attack magnitude is not too small. More results are presented in the Appendix. 2 5 5 2 / 2 5 5 3 / 2 5 5 4 / 2 5 5 5 / 2 5 5 6 / 2 5 5 7 / 2 5 5 8 / 2 5 5 9 / 2 5 5 1 0 / 2 5 

4.4. OUT-OF-DISTRIBUTION GENERALIZATION

We adopt similar strategy in (Lee et al., 2020) to test a model's ability of generalizing to out-ofdistribution datasets. In this setting, the state of arts are achieved by Bayesian TAML (Lee et al., 2020) . Different from their setting that considers any-shot learning with maximum number of examples for each class being as large as 50, we only focus on the standard 1 or 5 shot learning. We thus modify their code to accommodate our standard setting. The CUB (Wah et al., 2011) and VGG Flower Nilsback & Zisserman (2008) are fine-grained datasets used in this experiment, where all images are resized to 84 × 84. We follow Lee et al. (2020) to split these datasets into meta training/validation/testing sets. We first train all the methods on Mini-ImageNet or FC-100 datasets, then conduct meta-testing on CUB and VGG Flower datasets. The results are shown in Table 3 . Again, our methods achieve the best results, outperforming the state-of-art method with our Meta-RKHS-II, indicating the robustness of our proposed methods. More results are presented in the Appendix. 

B PROOF OF THEOREM 1

Theorem 1 If f θ is a neural network with parameter θ ∈ R P and H is the Reproducing Kernel Hilbert Space (RKHS) induced by Θ, where Θ is the Neural Tangent Kernel (NTK) of f θ , then with initialization f 0 = f θ 0 , the gradient flow of E(f t ) coincides with the function evolution of f θ t induced by the gradient flow of E(θ t ). Proof Without loss of generality, we can rewrite E(f ) = E Tm {E (xm,y m ) [C(f (x m ), y m )]} with some function C(•, •). For a neural network f θ with parameter θ ∈ R P , the gradient flow of E in R P is dθ t dt = -∇ θ t E(θ t ). We have dθ t dt = -∇ θ t (E • F )(θ t ) = -E Tm {E (xm,y m ) [∇ θ t C(f θ t (x m ), y m )]} = -E Tm E (xm,y m ) ∂C(f t θ (x m ), y m ) ∂f t θ (x m ) ∂f t θ (x m ) ∂θ t . We know that the dynamics of f θ t is df θ t dt = dθ t dt ∂f θ t ∂θ t = -E Tm E (xm,y m ) ∂C(f θ t (x m ), y m ) ∂f θ t (x m ) ∂f θ t (x m ) ∂θ t ∂f θ t ∂θ t = -E Tm E (xm,y m ) ∂C(f θ t (x m ), y m ) ∂f θ t (x m ) ∂f θ t (x) ∂θ t ∂f θ t ∂θ t = -E Tm E (xm,y m ) ∂C(f θ t (x m ), y m ) ∂f θ t (x m ) Θ t (x m , •) , where Θ t is the Neural Tangent Kernel of neural network f θ t (Jacot et al., 2018) . If H t is the Reproducing Kernel Hilbert Space induced by a kernel Θ t and V xm : H → R is the evaluation functional at x m , which is defined as V xm (f ) = f (x m ), then for an arbitrary function g and a small perturbation , we have ∇ f V xm (f ), g = lim →0 V xm (f + g) -V xm (f ) ∇ f V xm (f ), g = lim →0 f (x m ) + g(x m ) -f (x m ) ∇ f V xm (f ), g = g(x m ) ∇ f V xm (f ), g = Θ t (x m , •), g ∇ f V xm (f ) = Θ t (x m , •) ∇ f f (x m ) = Θ t (x m , •). With an initial function f 0 = f θ 0 ∈ H, the gradient flow of E in H is df t dt = -∇ f t E(f t ). We have df t dt = -E Tm {E (xm,y m ) ∇ f t C(f t (x m ), y m ) } = -E Tm E (xm,y m ) ∂C(f t (x m ), y m ) ∂f t (x m ) ∇ f t f t (x m ) = -E Tm E (xm,y m ) ∂C(f t (x m ), y m ) ∂f t (x m ) Θ t (x m , •) . We can complete the proof by comparing equation 8 and equation 9.

C PROOF OF THEOREM 2

Theorem 2 If f θ is a neural network with parameter θ and H is the Reproducing Kernel Hilbert Space (RKHS) induced by Θ, where Θ is the Neural Tangent Kernel (NTK) of f θ , then M 1 = E(α, f θ ), and β 0 = α ∇ θ L m (f θ ) 2 = α ∇ f θ L m (f θ ) 2 H . Proof Without loss of generality, we rewrite L m (f θ ) = E xm,y m [C(f θ (x m ), y m )]. In regression task, we have C(f θ (x m ), y m ) = 1 2 f θ (x m ) -y m 2 . In classification task, we have C(f θ (x m ), y m ) = y m log(f θ (x m )) , where log is element-wise logarithm operation. ∇ θ L m (f θ ) 2 = ∇ θ L m (f θ )∇ θ L m (f θ ) = ∇ θ E xm,y m [C(f θ (x m ), y m )] ∇ θ E xm,y m [C(f θ (x m ), y m )] = E xm,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) ∂f θ (x m ) ∂θ E xm,y m ∂f θ (x m ) ∂θ ∂C(f θ (x m ), y m ) ∂f θ (x m ) = E xm,y m E x m ,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) ∂f θ (x m ) ∂θ ∂f θ (x m ) ∂θ ∂C(f θ (x m ), y m ) ∂f θ (x m ) = E xm,y m E x m ,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) Θ(x m , x m ) ∂C(f θ (x m ), y m ) ∂f θ (x m ) = E xm,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) Θ(x m , •) , E x m ,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) Θ(x m , •) H = E xm,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) ∇ f θ f θ (x m ) , E x m ,y m ∂C(f θ (x m ), y m ) ∂f θ (x m ) ∇ f θ f θ (x m ) H = ∇ f θ L m (f θ ), ∇ f θ L m (f θ ) H = ∇ f θ L m (f θ ) 2 H , where •, • H is the inner product in Reproducing Kernel Hilbert Space (RKHS) H. In the above equations, we use the definition of Neural Tangent Kernel (NTK), the property of inner product in RKHS, the definition of evaluation functional and its gradient in RKHS. Recall that E(α, f θ ) = E Tm L m (f θ ) -α ∇ f θ L m (f θ ) 2 H and M k = E Tm L m (f θ ) - k-1 i=0 β i , where β i = α∇ θi L m (f θi )∇ θ L m (f θ ) and θ 0 = θ, θ i+1 = θ i -α∇ θi L(f θi , D tr m ). The result is straightforward now.

D PROOF OF THEOREM 3

The proof techniques we use are similar to some previous works such as (Arora et al., 2019; Allen-Zhu et al., 2019) . We summaries some of the differences. Different from previous works that typically assume a neural network is Gaussian initialized, we do not have such an assumption as we are trying to learn a good meta-initialization in the meta-learning setting. Previous works try to investigate the behavior of models during training, while we focus on revealing the connection between different meta-learning algorithms. Previous work focuses on single-task regression/classification problems, while we focus on meta-learning problem. Theorem 3 Let f θ be a fully-connected neural network with L hidden layers and ReLU activation function, s 1 , ..., s L+1 be the spectral norm of the weight matrices, s = max h s h , and α be the learning rate of gradient descent. If α ≤ O(qr) with q = min(1/(Ls L ), L -1/(L+1) ) and r = min(s -L , s), then the following holds | E(kα, f θ ) -M k | ≤ O 1 L . Proof We first prove the case of k = 2, i.e. applying a two-step gradient descent adaptation in MAML. We need to prove the following theorem first. Theorem 6 Let f θ be a fully-connected neural network with L hidden layers, and x be a data sample. Represent the neural network by f θ (x) = σ(σ(...σ(x W 1 )...W L-1 )W L )W L+1 , where W 1 , ..., W L+1 denote the weight matrices, and σ is the ReLU activation function. Let s 1 , ..., s L+1 be the spectral norm of weight matrices, and s = max h s h . Let α be the learning rate of gradient descent, and f θ (x) be the resulting value after one step of gradient descent, and • F be the Frobenius norm. If α ≤ O(qs -L ), where q = min(1/(Ls L ), L -1/(L+1) ), then ∂f θ (x) ∂ θ - ∂f θ (x) ∂θ F ≤ O( 1 s √ L + 1 ). Remark 1 Theorem 6 states that for a neural network with L hidden layers, if the learning rate of gradient descent is bounded, then the norm of derivative w.r.t all the parameters will not change too much, although there are O(Lm 2 ) parameters, where m denotes the maximum width of hidden layers. We use row vector instead of column vector for consistency, while it does not affect our results. For simplicity, we will write g h (x) as g h . The bias terms in the neural network are introduced by adding an additional coordinate thus omitted in Theorem 6. Without loss of generality, we can assume x ≤ 1, which can be done by data normalization in pre-processing. Let g h (x) = σ(σ(...σ(x W 1 )...W h-1 )W h ) be the activation at h th hidden layer and g 0 (x) = x, g L+1 = f θ (x). Define diagonal matrices D h , where D h (i,i) = 1{g h-1 W h ≥ 0} and b h = I dy , if h = L + 1 b h+1 (W h+1 ) D h , otherwise where I dy is a d y × d y identity matrix. We first prove the following Lemma. Lemma 7 Given a neural network as stated in Theorem 6, let • 2 denote the spectral norm, W h = W h -W h denote some perturbation on weight matrices, gh (x) denote the resulting value after perturbation, and g h (x) = gh (x) -g h (x). If s ≥ 1 and W h 2 ≤ O(s -L /L) for all h, then g h ≤ O( 1 Ls L-h+1 ); If s < 1 and W h 2 ≤ O(q) for all h, where q = min(1/(Ls L ), L -1/(L+1) ) and r = max(q, s), then g h ≤ O(r h-1 q) = O( 1 Ls L-h+1 ), if 1/(Ls L ) ≤ L -1/(L+1) O(L -h/(L+1) ), if 1/(Ls L ) > L -1/(L+1) . Proof Proof of Lemma 7 is based on induction. We first prove the case of s ≥ 1. Note that g 0 = x, thus g 0 = 0 ≤ O( 1 Ls L-0+1 ) always holds. For g 1 , we have g 1 = σ(x W 1 ) -σ(x W 1 ) ≤ x W 1 -x W 1 , due to the property of ReLU activation ≤ x W 1 2 ≤ O( 1 Ls L ). Thus, the hypothesis holds for g 1 . Now, assume that the hypothesis holds for g h , then we have g h+1 = σ(g h W h+1 ) -σ(g h W h+1 ) ≤ gh W h+1 -g h W h+1 , due to the property of ReLU activation ≤ gh W h+1 + gh W h+1 -g h W h+1 ≤ g h W h+1 2 + gh W h+1 2 ≤ O(s) g h + g h + g h W h+1 2 ≤ O(s) g h + O(s h ) W h+1 2 + g h W h+1 2 ≤ O(s)O( 1 Ls L-h+1 ) + O(s h )O( 1 Ls L ) + O( 1 Ls L-h+1 )O( 1 Ls L ) ≤ O( 1 Ls L-h ). The last three inequalities come from the fact that g h = σ(σ(...σ(x W 1 )...W h-1 )W h ) ≤ O(s h ) and s ≥ 1. Thus, we have proved the Lemma in the case s ≥ 1. Now, we prove the first part of the case of s < 1, i.e. g h ≤ O(r h-1 q). Because g 0 = 0, thus the hypothesis for g 0 always holds. For g 1 , we have g 1 = σ(x W 1 ) -σ(x W 1 ) ≤ x W 1 -x W 1 ≤ x W 1 2 ≤ O(q). Thus, the hypothesis holds for g 1 . Now, we assume that the hypothesis holds for g h . Then, we have g h+1 = σ(g h W h+1 ) -σ(g h W h+1 ) ≤ gh W h+1 -g h W h+1 ≤ gh W h+1 + gh W h+1 -g h W h+1 ≤ g h W h+1 2 + gh W h+1 2 ≤ O(s) g h + g h + g h W h+1 2 ≤ O(s)O(r h-1 q) + O(s h )q + qO(r h-1 q) ≤ O(r h q). The last inequality comes from the fact that r = max(q, s) and s h < s < 1. Next we consider the second part of the case of s < 1. ) , we know that q = 1/(Ls L ) and If 1/(Ls L ) ≤ L -1/(L+1 1/(Ls L ) ≤ L -1/(L+1) L 1/(L+1) ≤ Ls L L -L/(L+1) ≤ s L L -1 ≤ s L+1 L -1 s -L ≤ s, which means q ≤ s, thus r = s. Then, we have g h = O(r h-1 q) = O(s h-1 q) = O(s h-1 L -1 s -L ) = O( 1 Ls L-h+1 ). If 1/(Ls L ) > L -1/(L+1) , we know that q = L -1/(L+1) and q > s; then, r = q and g h = O(r h-1 q) = O(q h-1 q) = O(q h ) = O(L -h/(L+1) ). Thus, we can conclude that Lemma 7 also holds for the case of s < 1, which completes the proof. We now prove a similar Lemma for b h . Lemma 8 Given a neural network as stated in Theorem 6, let • 2 denote the spectral norm, W h = W h -W h denote some perturbation on weight matrices, bh denote the resulting value after perturbation, and b h = bh -b h . If s ≥ 1 and W h 2 ≤ O(s -L /L) for all h, then b h ≤ O( 1 Ls h ); If s < 1 and W h 2 ≤ O(q) for all h, where q = min(1/(Ls L ), L -1/(L+1) ), then b h ≤ O(L -1 s -h ), if 1/(Ls L ) ≤ L -1/(L+1) O(L (h-L-1)/(L+1) ), if 1/(Ls L ) > L -1/(L+1) . Proof Recall that b h = I dy , if h = L + 1 b h+1 (W h+1 ) D h , otherwise where I dy is a d y × d y identity matrix and D h (i,i) = 1{g h-1 W h ≥ 0}. It is easy to see that b h ≤ O(s L-h+1 ), because D h 2 ≤ 1 and W h 2 ≤ s. We first prove the case of s ≥ 1. We know that b L+1 = 0 ≤ O(s -L-1 /L) always holds. For h ≤ L, we can re-write b h as b h = I dy (W L+1 ) D L (W L ) D L-1 ...(W h+1 ) D h . Then, we have b h (g h ) = I dy (W L+1 ) D L (W L ) D L-1 ...(W h+1 ) D h (g h ) . (10) Because of the fact that f θ = g L+1 = x W 1 D 1 W 2 D 2 ...D L W L+1 = g h W h+1 D h+1 ...D L W L+1 and g h = g h D h , D h = (D h ) . We can re-write equation 10 as b h (g h ) = f θ . Thus, bh (g h ) -b h (g h ) = f θ -f θ = g L+1 ≤ O( 1 L ) by Lemma 7. Consequently, we have bh (g h ) -b h (g h ) = b h (g h ) + b h (g h ) + bh (g h ) ≤ O( 1 L ). Since g h ≤ O(s h ), we know that b h ≤ O( 1 Ls h ), b h ≤ O(s L-h+1 ) always hold. Since L ≥ 1, s ≥ 1, we simply have b h ≤ O( 1 Ls h ). Now, we prove the case of s < 1. Similarly, we have bh (g h ) -b h (g h ) = f θ -f θ = g L+1 ≤ O( 1 L ). Similarly, we must have b h ≤ O( 1 Ls h ), b h ≤ O( 1 Lr h-1 q ), where q = min(1/(Ls L ), L -1/(L+1) ) and r = max(q, s) by Lemma 7. If 1/(Ls L ) ≤ L -1/(L+1) , then s L+1 ≥ 1/L. We thus have O( 1 Lr h-1 q ) = O( Ls L-h+1 L ) = O( s L+1 s h ) ≥ O( 1 Ls h ). Hence, we get b h ≤ O( 1 Ls h ). If 1/(Ls L ) > L -1/(L+1) , then s L+1 < 1/L. We have O( 1 Lr h-1 q ) = O(L -1 • L h/(L+1) ) ≤ O(L -1 • s -h ) = O( 1 Ls h ).

Thus, we get

b h ≤ O(L (h-L-1)/(L+1) ). Lemma 9 Given a neural network as stated in Theorem 6, let • F be the Frobenius norm, W 1 , ..., W L+1 be the weight matrices in the neural network, W h = W h -W h be the perturbation on weight matrices, θ h be the parameter vector containing all the elements in W h , θ h = θhθ h be the perturbation on parameter vectors, and f θ (x) be the resulting value after perturbation. If s ≥ 1 and W h 2 ≤ O(s -L /L) for all h, for any weight matrices the following holds ∂f θ (x) ∂ θh - ∂f θ (x) ∂θ h F ≤ O( 1 sL ); If s < 1 and W h 2 ≤ O(q) for all h, where q = min(1/(Ls L ), L -1/(L+1) ), for any weight matrices the following holds ∂f θ (x) ∂ θh - ∂f θ (x) ∂θ h F ≤ O( 1 sL ). Proof We first prove the case of d y = 1, i.e. the output of neural network is 1-dimensional. In this case, we know that ∂f θ (x) ∂ θh - ∂f θ (x) ∂θ h F = ∂f θ (x) ∂ W h - ∂f θ (x) ∂W h F = ∂f θ (x) ∂W h F and the derivative to W h is ∂f θ (x) ∂W h = (b h ) g h-1 . Then, we have ∂f θ (x) ∂W h F = ( bh ) gh-1 -(b h ) g h-1 F = ( bh ) g h-1 -(b h ) g h-1 + ( bh ) g h-1 F ≤ ( b h ) g h-1 F + (b h + b h ) g h-1 F . Recall the fact that g h ≤ O(s h ) and b h ≤ O(s L+1-h ). When s ≥ 1, from Lemma 7 and Lemma 8 we know that g h ≤ O( 1 Ls L-h+1 ), b h ≤ O( 1 Ls h ). Then, we have ∂f θ (x) ∂W h F ≤ O(s h-1 )O( 1 Ls h ) + O(s L+1-h )O( 1 Ls L-h+2 ) + O( 1 Ls L-h+2 )O( 1 Ls h ) ≤ O( 1 sL ). When s < 1, from Lemma 7 and Lemma 8 we know that g h ≤ O( 1 Ls L-h+1 ), if 1/(Ls L ) ≤ L -1/(L+1) O(L -h/(L+1) ), if 1/(Ls L ) > L -1/(L+1) and b h ≤ O(L -1 s -h ), if 1/(Ls L ) ≤ L -1/(L+1) O(L (h-L-1)/(L+1) ), if 1/(Ls L ) > L -1/(L+1) . If 1/(Ls L ) ≤ L -1/(L+1) , we have ∂f θ (x) ∂W h F ≤ O(s h-1 )O( 1 Ls h ) + O(s L-h+1 )O( 1 Ls L-h+2 ) + O( 1 Ls L-h+2 )O( 1 Ls h ). Since 1/(Ls L ) ≤ L -1/(L+1) implies L -1 ≤ s L+1 (from proof of Lemma 7), we have 1 Ls h ≤ s L-h+1 . Then we can conclude that ∂f θ (x) ∂W h F ≤ O( 1 sL ). If 1/(Ls L ) > L -1/(L+1) , we have ∂f θ (x) ∂W h F ≤ O(s h-1 )O(L (h-L-1)/(L+1) ) + O(s L+1-h )O(L -(h-1)/(L+1) ) + O(L -(h-1)/(L+1) )O(L (h-L-1)/(L+1) ). Since 1/(Ls L ) > L -1/(L+1) implies L -1 > s L+1 (from proof of Lemma 7), we have L (h-L-1)/(L+1) > s L-h+1 , 1 L (h-1)/(L+1) > s h-1 . Then we have ∂f θ (x) ∂W h F ≤ O( 1 L ) ≤ O( 1 sL ), because s < 1. We have proved the Lemma for the case of d y = 1. For the case of d y > 1, we know that ∂f θ (x) ∂ θh - ∂f θ (x) ∂θ h 2 F = dy i=1 ∂f θ,i (x) ∂ θh - ∂f θ,i (x) ∂θ h 2 F ≤ O( d y s 2 L 2 ), where f θ,i (x) is the i th dimension of f θ (x). The last inequality directly comes from the 1dimensional case. Since d y is a constant, we ignore it. Then, we have ∂f θ (x) ∂ θh - ∂f θ (x) ∂θ h F ≤ O( 1 sL ), which completes the proof. Now we can prove Theorem 6, if W h is obtained by one step gradient descent starting from W h , θ is obtained by one step gradient descent starting from θ, and learning rate is α. Then, for any weight matrix we have W h 2 = α∇ W h L(θ) 2 ≤ α∇ W h L(θ) F = α∇ θ h L(θ) F = α n i=1 n [f θ (x i ) -y i ] ∂f θ (x i ) ∂θ h F ≤ α n i=1 n c i   dy j ∂f θ,j (x i ) ∂W h 2 F   1/2 ≤ α n i=1 n c i d y O(s L-h+1 )O(s h-1 ) ≤ αO(s L ), where c i = f θ (x i ) -y i are some constants. If α ≤ O(s -2L /L) when s ≥ 1, then for any weight matrix we have W h 2 ≤ αO(s L ) ≤ O(s -L /L). If α ≤ O(qs -L ) where q = min(1/(Ls L ), L -1/(L+1) ) when s < 1, then for any weight matrix we have W h 2 ≤ αO(s L ) ≤ O(q). By Lemma 9, we can conclude that ∂f θ (x) ∂ W h - ∂f θ (x) ∂W h F ≤ O( 1 sL ). Then, we have ∂f θ (x) ∂ θ - ∂f θ (x) ∂θ F = L+1 h=1 ∂f θ (x) ∂ θh - ∂f θ (x) ∂θ h 2 F 1/2 ≤ O( 1 s √ L + 1 ). When s ≥ 1, we know that L+1) . Then, we have s -L ≤ 1 ≤ L L/( 1 Ls L ≤ 1 L ≤ L -1/(L+1) . Thus, we know 1/(Ls L ) = min(1/(Ls L ), L -1/(L+1) ) when s ≥ 1. For the case of s ≥ 1, we can rewrite α ≤ O(s -2L /L) = O(qs -L ), where q = min(1/(Ls L ), L -1/(L+1) ), which completes the proof of Theorem 6. Now, we prove Theorem 3 with k = 2, i.e. two-step gradient descent adaptation. We know that β 1 = α∇ θ L m (f θ )∇ θ L m (f θ ) , ∇ f θ L m (f θ ) 2 H = ∇ θ L m (f θ ) 2 . Thus, we have β 1 -α ∇ f θ L m (f θ ) 2 H = α∇ θ L m (f θ )∇ θ L m (f θ ) -α∇ θ L m (f θ )∇ θ L m (f θ ) =α ∇ θ L m (f θ ) -∇ θ L m (f θ ) ∇ θ L m (f θ ) =α E (xm,y m ) f θ (x m ) -y m ∂f θ (x m ) ∂ θ -[f θ (x m ) -y m ] ∂f θ (x m ) ∂θ ∇ θ L m (f θ ) =α E (xm,y m ) [f θ (x m ) -y m + f θ (x m )] ∂f θ (x m ) ∂ θ + ∂f θ (x m ) ∂θ -[f θ (x m ) -y m ] ∂f θ (x m ) ∂θ ∇ θ L m (f θ ) =α E (xm,y m ) f θ (x m ) ∂f θ (x m ) ∂ θ + ∂f θ (x m ) ∂θ + [f θ (x m ) -y m ] ∂f θ (x m ) ∂θ ∇ θ L m (f θ ) ≤α O( 1 L )O(s L √ L) + O( 1 L )O( 1 s √ L ) + O( 1 s √ L ) ∇ θ L m (f θ ) ≤α O( s L √ L ) + O( 1 s √ L ) ∇ θ L m (f θ ) , because L ≥ 1 ≤ O( qrs L √ L ) + O( qr s √ L ) ∇ θ L m (f θ ) , where q = min(1/(Ls L ), L -1/(L+1) ), r = min(s -L , s) ≤O( q √ L ) ∇ θ L m (f θ ) . In the case of d y = 1, we have ∂f θ (x) ∂W h F = (b h ) g h-1 ≤ O(s L ), which has already been shown in the proof of Lemma 9. Then, we have ∇ θ L m (f θ ) = O L+1 h=1 ∂f θ (x) ∂θ h 2 = O L+1 h=1 ∂f θ (x) ∂W h 2 F ≤ O(s L √ L + 1). In the case of d y ≥ 1, the bound is simply scaled by a constant of d y . Thus we have β 1 -α ∇ f θ L m (f θ ) 2 H ≤ O( q √ L ) ∇ θ L m (f θ ) ≤ O(qs L ) ≤ O( 1 L ) because q = min(1/(Ls L ), L -1/(L+1) ), which completes the proof for the case of k = 2. For the case of k > 2, we only need to make sure that the bound on learning rate always holds. Fortunately, since k is a finite constant, according to what we have already showed in the proof of previous lemmas, every step of gradient descent will not change the spectral norm of the weight matrix too much: W h 2 ≤ O(s -L /L ) for all h if s ≥ 1, and W h 2 ≤ O(q) for all h if s < 1, where q = min(1/(Ls L ), L -1/(L+1) ). Thus, we may assume that the bound on learning rate always holds during the adaptation. Using triangle inequality to generalize the results from k = 2 to k > 2, i.e. for all 1 ≤ i ≤ k -1, we have β i -α ∇ f θ L m (f θ ) 2 H ≤ O 1 L . Recall that E(α, f θ ) = E Tm L m (f θ ) -α ∇ f θ L m (f θ ) 2 H and M k = E Tm L m (f θ ) - k-1 i=0 β i , where β i = α∇ θi L m (f θi )∇ θ L m (f θ ) and θ 0 = θ, θ i+1 = θ i -α∇ θi L(f θi , D tr m ). The result is straightforward now.

E PROOF OF THEOREM 4

Theorem 4 Let f θ be a convolutional neural network with L -l convolutional layers and l fullyconnected layers and with ReLU activation function, and d x be the input dimension. Denote by W h the parameter vector of the convolutional layer for h ≤ L -l, and the weight matrices of the fully connected layers for L -l + 1 < h ≤ L + 1. • 2 means both the spectral norm of a matrix and the Euclidean norm of a vector. Define s h = √ d x W h 2 , if h = 1, ..., L -l W h 2 , if L -l + 1 < h ≤ L + 1 and let s = max h s h and α be the learning rate of gradient descent. If α ≤ O(qr) with q = min(1/(Ls L ), L -1/(L+1) ) and r = min(s -L , s), the following holds |M k -E(kα, f θ )| ≤ O 1 L . Proof We prove Theorem 4 by first transforming the convolutional neural network into an equivalent fully connected neural network and then applying Theorem 3. First of all, we assume that there are c h channels in h th convolutional layer's output g h (x), where h = 0, ..., L -l. For fully-connected layers, define c L-l = ... = c L+1 = 1. We may represent the dimensionality of input data by x ∈ R dxc0 . Instead of using matrices, we represent the output of every convolutional layer by a d x c h length vector g h = g h 1 , g h 2 , ..., g h dx , where every g h i = g h i,1 , g h i,2 , ..., g h i,c h is a c h length vector contains value of different channels at the same position. We assume that for every element g h i,j of g h i , its value is completely determined by elements of set Q h-1 i , where Q h-1 i contains kc h-1 elements with fixed positions in g h-1 for a given i. In other words, every element of the output of a convolutional layer is determined by some elements with fixed positions from output of the previous layer. This is exactly how convolutional layer works in deep learning.

If we use g

h-1 Q h-1 i to represent the concatenation of g h-1 a,b ∈ Q h-1 i , then g h-1 Q h-1 i is a kc h-1 length vector, where k is the kernel size. Then we have g h i = σ(g h-1 Q h-1 i U h i ) where U h i,j ∈ R kc h-1 ×c h is a kc h-1 × c h matrix. For notation simplicity, one can define a matrix U h ∈ R dxc h-1 ×dxc h , where every column of U h only has kc h-1 non-zero elements, and it satisfies g h = σ(g h-1 U h ) By the property of convolutional layer, we know the following facts: • One can represent U h by U h = V h 1 , V h 2 , ..., V h dx where V h i ∈ R dxc h-1 ×c h is sub-matrix of U h ; • Every V h i contains the same set of elements as W h , while these elements are located at different positions; • Every V h i can be obtained by any other V h j by swapping rows; Let's define U L-l = W L-l , ..., U L+1 = W L+1 for the fully-connected layer and output layer. Then we can represent the neural network just as in Theorem 3 by f θ (x) = σ(σ(...σ(x U 1 )...U L-1 )U L )U L+1 , and x ∈ R dxc0 . Now let t h be the spectral norm of U h , and t = max h t h . By Theorem 3, we know that we want α ≤ O(qr), where q = min(1/(Ls L ), L -1/(L+1) ), r = min(s -L , s). Because every V h i contains the same set of elements, we know that every V h i has the same Frobenius norm. Because every V h i can be obtained by any other V h j by swapping rows, we know that every V h i has the same rank. We know that 1 √ r V h 1 F ≤ V h 1 2 ≤ U h 2 ≤ U h F = d x V h 1 F = d x W h 2 where • F denotes Frobenius norm, r denotes the rank of V h 1 . The last equality holds because matrix V h 1 and vector W h have the same set of elements. Let's define s h = √ d x W h 2 , if h = 1, ..., L -l W h 2 , if L -l + 1 < h ≤ L + 1 and s = max h s h . From above we know that t h = Θ(s h ), because s h / √ d x r ≤ t h ≤ s h . So we also have t = Θ(s). Then the conclusion is straightforward.

F REVISION OF THEOREM 3 AND THEOREM 4 IN CLASSIFICATION CASE

We now show how to obtain similar results of Theorem 3 and Theorem 4 in classification problem, where cross-entropy loss is used instead of squared loss. We need two more restrictions in the classification case: 1. There exist matrix A and B such that g L A ≤ softmax(g L W L+1 ) ≤ g L B for all data points, where softmax is the softmax operation at the last layer. 2. For any data point x whose belongs to c th class, there exists a constant > 0 such that f θ,c (x) ≥ , i.e. the output of neural network has a lower bound on the true class position. The proof is actually similar to the proof in regression case. We briefly talk about the differences here. Firstly, in the classification case, softmax function is used at the last layer. By the first restriction, we can get rid of softmax function by introducing new matrices, which further leads to bound of the learning rate as in regression case. Secondly, if the loss function is the cross-entropy loss, we have: ∇ θ L m (f θ ) = E (xm,y m ) 1 f θ,cm (x m ) ∂f θ,cm (x m ) ∂θ where c m denotes the class of x m , e.g. if x m belongs to the third class, then c m = 3. f θ,cm (x m ) denotes the c th m dimensional element of f θ (x m ). We want a lower bound of f θ,c (x) exists, so that the gradient ∇ θ L m (f θ ) can be further bounded. Then we can prove similar theorems just follow the steps in regression case.

G PROOF OF THEOREM 5

Theorem 5 Let f θ be a neural network with L hidden layers, with each layer being either fullyconnected or convolutional. Assume that L ∞ < ∞. Then, error(T ) = | E(T, f θ ) -E(T, f θ )| is a non-decreasing function of T . Furthermore, for arbitrary T > 0 we have: error(T ) ≤ O T 2L+3 . Proof Recall that E(t, f θ ) is defined based on f t m,θ , which is the resulting function whose parameters evolve according to the gradient flow dθ t m dt = -∇ θ t m L(f t m,θ , D tr m ). We actually have the following (Santambrogio, 2016) : θ = θ 0 -θ t ≤ O( √ t). For simplicity and clearness, we use to denote the change of any vectors and matrices. Thus, we know that W h 2 ≤ W h F ≤ θ ≤ O( √ t). Just like the proofs of Lemma 7, Lemma 8 and Lemma 9, we show that g h ≤ O(t h/2 ), b h ≤ O(t (L-h+1)/2 ), ∂f θ (x) ∂θ F ≤ O(t (L+1)/2 √ L + 1) by mathematical inductions; we skip the details here. Note that different from some previous theorem, here we focus on time t, and thus hide the effect of the spectral norms by treating them as constants. Then, we have ∇ θ L m (f θ ) = ∇ θ t L m (f t m,θ ) -∇ θ L m (f θ ) = E (xm,y m ) f t m,θ (x m ) -y m ∂f t m,θ (x m ) ∂θ t -[f θ (x m ) -y m ] ∂f θ (x m ) ∂θ = E (xm,y m ) f θ (x m ) ∂f t m,θ (x m )(x m ) ∂θ t + ∂f θ (x m ) ∂θ + [f θ (x m ) -y m ] ∂f θ (x m ) ∂θ ≤O(t L+1 √ L + 1). Recall that: E(T, f θ ) = E Tm L m (f T m,θ ) = E Tm L m (f θ ) + T 0 ∇ t L m (f t m,θ )dt = E Tm L m (f θ ) + T 0 dθ t dt ∇ θ t L m (f t m,θ )dt = E Tm L m (f θ ) - T 0 ∇ θ t L m (f t m,θ ) 2 dt and E(T, f θ ) = E Tm L m (f θ ) -T ∇ f θ L m (f θ ) 2 H = E Tm L m (f θ ) -T ∇ θ L m (f θ ) 2 . Because E(T, f θ ) -E(T, f θ ) = T 0 ∇ θ t L m (f t m,θ ) 2 dt -T ∇ θ L m (f θ ) 2 = T 0 ∇ θ L m (f θ ) + ∇ θ L m (f θ ) 2 dt -T ∇ θ L m (f θ ) 2 = T 0 2∇ θ L m (f θ ) ∇ θ L m (f θ ) + ∇ θ L m (f θ ) 2 dt, we have error(T ) = | E(T, f θ ) -E(T, f θ )| ≤ O L + 1 2L + 3 T 2L+3 = O T 2L+3 by simple calculation. On the other hand, observe that Ē(T, f θ ) = E Tm L m (f θ ) - T 0 ∇ θ t L m (f t m,θ ) 2 H dt , Ẽ(T, f θ ) = E Tm [L m (f θ ) -T ∇ θ L m (f θ ) 2 H ]. We let G(τ ) = τ 0 ∇ θ t L m (f t m,θ ) 2 dt, and assume that ∇ θ t L m (f t m,θ ) is continuous at t = 0. Then, we have G (τ ) = ∇ θ t L m (f t θ ) 2 . E(T, f θ ) -Ẽ(T, f θ ) = E Tm T 0 ∇ θ t L m (f t m,θ ) 2 dt -T ∇ θ L m (f θ ) 2 H = E Tm (G(T ) -T • G (0)) , where T G (0) = G(0) + T G (0) (note that G(0) = 0) is a first order approximation to G(T ) at τ = 0. When T = 1, G(T ) -T G (0) can be taken as a local truncation error (i.e., the error that occurs in one step of a numerical approximation). When T increases, the difference is no better than the global truncation error (in T steps): G(T ) - T i=0 (i -(i -1))G (i) = T i=0 i+1 i ∇ θ t L m (f t m,θ ) 2 -∇ θ i L m (f t=i m,θ ) 2 dt ≈ T i=0 i+1 i 2 • i t L m (f m,θ ) • ∇L m (f t m,θ )dt , where i t L m (f m,θ ) = ∇ θ t L m (f t m,θ ) -∇ θ i L m (f t m,θ ) as shown previously , i is the i-th time step, and G (i) is the gradient of G at time step i. Now we can see that E(T, f θ ) -Ẽ(T, f θ ) highly relates to the difference between ∇ θ t L m (f t m,θ ) at different time steps (i.e. i t L m (f m,θ )), ∇ θ t L m (f t m,θ ) and T . The first two terms relate to how flat or sharp the hyperplane of L m (f m,θ ) is near t = 0. We can wrap it as a constant C 0 (L, t = 0). Then, the error is at least C 0 (L, t = 0) • O(T ). For the hyperplane smooth enough, we can further get a first order approximation of i t L m (f m,θ ) and yield C(L, t = 0)O(T 2 ), where C(L, t = 0) can be analogized as the second order derivative of L.

H SOME EXPERIMENTAL DETAILS H.1 IMPLEMENTATION OF CLASSIFICATION FOR META-RKHS-II

As we mentioned earlier, our proposed energy functional with closed form adaptation can not be directly applied to classification problem. We handle this challenge following Arora et al. (2019) . For a d y class classification problem, every data x is associated with a R dy one-hot vector y as its label. For C classes classification problem, its encoding is C dimensional vector and we use -1/C and (C -1)/C as its correct and incorrect entries encoding. In the prediction, Y tr is replaced by the encoding of training data. f θ (x) is replaced by f θ (x) [1, ..., 1] ∈ R n×dy for dimension consistency. During the testing time, we compute the encoding of the test data point, and choose the position with largest value as its predicted class.

I EXTRA EXPERIMENTAL RESULTS

I.1 COMPARISON WITH RBF KERNEL One interesting question is, without introducing extra model components or networks, what will the results of other kernel be? We provide the results of using RBF (Gaussian) kernel here: 42.1 ± 1.9 (5-way 1-shot) and 54.9 ± 1.1 (5-way 5-shot) on Mini-ImageNet, 32.4 ± 2.0 (5-way 1-shot) and 38.2 ± 0.9 (5-way 5-shot) on FC-100, which are worse than the NTK based Meta-RKHS-II, showing the superiority of using NTK.

I.2 MORE RESULTS ON OUT-OF-DISTRIBUTION GENERALIZATION

We provide some more results on out-of-distribution generalization experiments here. From the results we can find that the proposed methods is more robust and can generalize to different datasets better. We now show some more extra results on adversarial attack in the following figures. Consistent to the results in main text, we can find that our proposed methods are more robust to adversarial attacks. In this experiment, we compare between our proposed Meta-RKHS-I and Reptile. We evaluate the trained models with different adaptation steps in testing-time. The comparison is shown in Figure 7 . As we can see, our Meta-RKHS-I always gets better results than Reptile, which supports our idea that the learned function should be close to task-specific optimal and have large functional gradient norm. These two conditions together lead to the ability of fast adaptation. 



† For ease of our later notation, we write the gradient ∇ θ i L (thus the parameter as well) as a row vector.



Figure 1: Performance of random initialized network and our methods. The models before/after adaptation are shown in dotted/dashed lines, samples used for adaptation are also shown in the figure.

REGRESSIONFollowingFinn et al. (2017);Nichol et al. (2018), we first test our proposed methods on the 1dimensional sine wave regression problem. This problem is instructive, where a model is trained on many different sine waves with different amplitudes and phases, and tested by adapting the trained model to new sine waves with only a few data points using a fixed number of gradient-descent steps. FollowingFinn et al. (2017);Nichol et al. (2018), we use a fully-connected neural network with 2 hidden layers and the ReLU activation function. The results are shown in Figure1.

Figure 2: Black-box attack on Mini-ImageNet and FC-100. Mini-ImageNet 5-way 1-shot (left), FC-100 5-way 1-shot (middle) and Mini-ImageNet 5-way 5-shot (right).

Figure 3: BPDA attack on Mini-ImageNet 5-way 5-shot (left) and FC-100 5-way 5-shot (right).

Figure 4: SPSA attack on Mini-ImageNet 5-way 5-shot (left) and FC-100 5-way 5-shot (right).

Figure 5: ∞ norm PGD attack on Mini-ImageNet and FC-100. Mini-ImageNet 5-way 5-shot (left), Mini-ImageNet 5-way 1-shot (middle) and FC-100 5-way 5-shot (right).

Figure 6: FC-100 5-way 5-shot Black-box attacks (left) and 5-way 1-shot PGD ∞ norm attack (right).

Figure 7: Reptile (dashed) vs. Meta-RKHS-I (solid) with different testing adaptation steps (x-axis).

Running time comparison per iteration with C 1 = d x p + Lp 2 and C 2 = d x p + Ld x p 2 .

Few-shot classification results on Mini-ImageNet and FC-100.

Meta-RKHS-II with different time t. SHOT 49.67 ± 2.23% 48.27 ± 2.23% 50.53 ± 2.09% 49.13 ± 2.19% 48.70 ± 2.28% 5 WAY 5 SHOTS 64.51 ± 0.93% 64.28 ± 0.98% 65.40 ± 0.91% 64.24 ± 1.06% 64.95 ± 0.96% FC-100 5 WAY 1 SHOT 36.50 ± 2.10% 38.80 ± 2.32% 41.20 ± 2.17% 38.80 ± 2.21% 37.60 ± 2.13% 5 WAY 5 SHOTS 48.35 ± 1.02% 49.79 ± 1.04% 51.36 ± 0.96% 48.59 ± 1.09% 49.48 ± 0.98%

Meta testing on different out-of-distribution datasets with model trained on Mini-ImageNet. Table4) and the impact of network architecture with different number of CNN feature channels (results shown in the Appendix). It is interesting to see that a finite-time (around t = 10) achieves the best accuracy, although the infinite-time case guarantees a stationary point. This indicates that a stationary point achieved by limited training data in the adaptation step is not always the best choice, because the limited training data might easily overfit the model, thus achieving worse test results.A ALGORITHMSOur proposed algorithms for meta-learning in the RKHS are summarized in Algorithm 1.Algorithm 1 Meta-Learning in RKHS Require: p(T ): distribution over tasks, randomly initialized neural network parameters θ.while not done do Sample a batch of tasks {T m } B m=1 ∼ p(T ) for all T m do Sample a batch of data points D m or Sample two batches of data points D tr m , D test m . end for Evaluate the energy functional by equation 4 with {D m } B m=1 or Evaluate the energy functional by equation 7 with {D tr m , D test m } B m=1 . Minimize the energy functional w.r.t θ. end while

Meta testing on different out-of-distribution datasets with model trained on FC-100.

Few-shot classification results on Mini-ImageNet with different number of feature channels of 4 convolution layers.

Few-shot classification results on Mini-ImageNet with different number of feature channels of 5 convolution layers. ± 0.69% 51.37 ± 1.92% 65.39 ± 0.98% META-RKHS-II 50.92 ± 2.16% 66.45 ± 0.91% 50.43 ± 2.42% 64.17 ± 1.06%

5. CONCLUSION

We develop meta-learning in RKHS, and propose two practical algorithms allowing efficient adaptation in the function space by avoiding some complicated adaptations as in traditional methods. We show connections between our proposed methods and existing ones. Extensive experiments suggest that our methods are more effective, achieve better generalization and are more robust against adversarial attacks and out-of-distribution generalization, compared to popular strong baselines.

funding

The research of the first and fifth authors was supported in part by NSF through grants CCF-1716400 and IIS-1910492. 

