META-LEARNING WITH NEURAL TANGENT KERNELS

Abstract

Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-ofdistribution adaptation than popular baselines, as demonstrated in our experiments. * Let H be the function space, F be the realization function for neural network defined in Section 3.2. Note even if a functional loss (e.g., L2 loss) E : H → R is convex on H, the composition E • F is in general not.

1. INTRODUCTION

Meta-learning (Schmidhuber, 1987) has made tremendous progresses in the last few years. It aims to learn abstract knowledge from many related tasks so that fast adaption to new and unseen tasks becomes possible. For example, in few-shot learning, meta-learning corresponds to learning a meta-model or meta-parameters so that they can fast adapt to new tasks with a limited number of data samples. Among all existing meta-learning methods, Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) is perhaps one of the most popular and flexible ones, with a number of follow-up works such as (Nichol et al., 2018; Finn et al., 2018; Yao et al., 2019; Khodak et al., 2019a; b; Denevi et al., 2019; Fallah et al., 2020; Lee et al., 2020; Tripuraneni et al., 2020) . MAML adopts a double-looped optimization framework, where adaptation is achieved by one or several gradientdescent steps in the inner-loop optimization. Such a framework could lead to some undesirable issues related to computational inefficiency and sub-optimal solutions. The main reasons are that 1) it is computationally expensive to back-propagate through a stochastic-gradient-descent chain, and 2) it is hard to tune the number of adaptation steps in the inner-loop as it can be different for both training and testing. Several previous works tried to address these issues, but they can only alleviate them to certain extents. For example, first order MAML (FOMAML) (Finn et al., 2017) ignores the high-order terms of the standard MAML, which can speed up the training but may lead to deteriorated performance; MAML with Implicit Gradient (iMAML) (Rajeswaran et al., 2019) directly minimizes the objective of the outer-loop without performing the inner-loop optimization. But it still needs an iterative solver to estimate the meta-gradient. To better address these issues, we propose two algorithms that generalize meta-learning to the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK) (Jacot et al., 2018) . In this RKHS, instead of using parameter adaptation, we propose to perform an implicit function adaptation. To this end, we introduce two algorithms to avoid explicit function adaptation: one replaces the function adaptation step in the inner-loop with a new metaobjective with a fast-adaptive regularizer inspired by MAML; the other solves the adaptation problem analytically based on tools from NTK so that the meta-objective can be directly evaluated on samples in a closed-form. When restricting the function space to be RKHS, the solutions to the proposed two algorithms become conveniently solvable. In addition, we provide theoretical analysis on our proposed algorithms in the cases of using fully-connected neural networks and convolutional neural networks as the meta-model. Our analysis shows close connections between our methods and the existing ones. Particularly, we prove that one of our algorithms is closely related to MAML with some high-order terms ignored in the meta-objective function, thus endowing effective optimization. In summary, our main contributions are: • We re-analyze the meta-learning problem and introduce two new algorithms for metalearning in RKHS. Different from all existing meta-learning algorithms, our proposed methods can be solved efficiently without cumbersome chain-based adaptations. • We conduct theoretically analysis on the proposed algorithms, which suggests that our proposed algorithms are closely related to the existing MAML methods when fully-connected neural networks and convolutional neural networks are used as the meta-model. • We conduct extensive experiments to validate our algorithms. Experimental results indicate the effectiveness of our proposed methods, through standard few-shot learning, robustness to adversarial attacks and out-of-distribution adaptation.

2. PRELIMINARIES

2.1 META-LEARNING Meta-learning can be roughly categorized as black-box adaptation methods (Andrychowicz et al., 2016; Graves et al., 2014; Mishra et al., 2018 ), optimization-based methods (Finn et al., 2017) , non-parametric methods (Vinyals et al., 2016; Snell et al., 2017; Triantafillou et al., 2020) and Bayesian meta-learning methods (Finn et al., 2018; Yoon et al., 2018; Ravi & Beatson, 2019) . In this paper, we focus on the framework of Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) , which has two key components, meta initialization and fast adaptation. Specifically, MAML solves the meta-learning problem through a double-looped optimization procedure. In the inner-loop, MAML runs a task-specific adaptation procedure to transform a meta-parameter, θ, to a task-specific parameter, {φ φ φ m } B m=1 , for a total of B different tasks. In the outer-loop, MAML minimizes a total loss of B m=1 L(f φ φ φm ) with respect to meta-parameter θ, where f φ φ φm is the model adapted on task m that is typically represented by a deep neural network. It is worth noting that in MAML, one potential problem is to compute the meta-gradient ∇ θ B m=1 L(f φ φ φm ). It requires one to differentiate through the whole inner-loop optimization path, which could be very inefficient.

2.2. GRADIENT FLOW

Our proposed method relies on the concept of gradient flow. Generally speaking, gradient flow is a continuous-time version of gradient descent. In the finite-dimensional parameter space, a gradient flow is defined by an ordinary differential equation (ODE), dθ t /dt = -∇ θ t F (θ t ), with a starting point θ 0 and function F : R d → R. Gradient flow is also known as steepest descent curve. One can generalize gradient flows to infinite-dimensional function spaces. Specifically, given a function space H, a functional F : H → R, and a starting point f 0 ∈ H, a gradient flow is similarly defined as the solution of df t /dt = -∇ f t F(f t ). This is a curve in the function space H. In this paper, we use notation ∇ f t F(f t ), instead of ∇ H F(f t ), to denote the general function derivative of the energy functional F with respect to function f t (Villani, 2008).

2.3. THE NEURAL TANGENT KERNEL

Neural Tangent Kernel (NTK) is a recently proposed technique for characterizing the dynamics of a neural network under gradient descent (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019) . NTK allows one to analyze deep neural networks (DNNs) in RKHS induced by NTK. One immediate benefit of this is that the loss functional in the function space is often convex, even when it is highly non-convex in the parameter space (Jacot et al., 2018) * . This property allows one to better understand the property of DNNs. Specifically, let f θ be a DNN parameterized by θ. The corresponding NTK Θ



* The first two authors contribute equally. Correspondence to Changyou Chen (changyou@buffalo.edu). † The research of the first and fifth authors was supported in part by NSF through grantsCCF-1716400 and IIS-1910492.

