A LAZY APPROACH TO LONG-HORIZON GRADIENT-BASED META-LEARNING

Abstract

Gradient-based meta-learning relates task-specific models to a meta-model by gradients. By this design, an algorithm first optimizes the task-specific models by an inner loop and then backpropagates meta-gradients through the loop to update the meta-model. The number of inner-loop optimization steps has to be small (e.g., one step) to avoid high-order derivatives, big memory footprints, and the risk of vanishing or exploding meta-gradients. We propose an intuitive teacherstudent scheme to enable the gradient-based meta-learning algorithms to explore long horizons by the inner loop. The key idea is to employ a student network to adequately explore the search space of task-specific models (e.g., by more than ten steps), and a teacher then takes a "leap" toward the regions probed by the student. The teacher not only arrives at a high-quality model but also defines a lightweight computation graph for meta-gradients. Our approach is generic; it performs well when applied to four meta-learning algorithms over three tasks: few-shot learning, long-tailed classification, and meta-attack.

1. INTRODUCTION

Humans can quickly learn the skills needed for new tasks by drawing from a fund of prior knowledge and experience. To grant machine learners this level of intelligence, meta-learning studies how to leverage past learning experiences to more efficiently learn for a new task (Vilalta & Drissi, 2002) . A hallmark experiment design provides a meta-learner a variety of few-shot learning tasks (meta-training) and then desires it to solve previously unseen and yet related few-shot learning tasks (meta-test) . This design enforces "learning to learn" because the few-shot training examples are insufficient for a learner to achieve high accuracy on any task in isolation. Recent meta-learning methods hinge on deep neural networks. Some work learns a recurrent neural network as an update rule to a model (Ravi & Larochelle, 2016; Andrychowicz et al., 2016) . Another line of methods transfers an attention scheme across tasks (Mishra et al., 2017; Vinyals et al., 2016a) . Gradient-based meta-learning gains momenta recently following the seminal work (Finn et al., 2017) . It is model-agnostic meta-learning (MAML), learning a global model initialization from which a meta-learner can quickly derive task-specific models by using a few training examples. In its core, MAML is a bilevel optimization problem (Colson et al., 2007) . The upper level searches for the best global initialization, and the lower level optimizes individual models, which all share the common initialization, for particular tasks sampled from a task distribution. This problem is hard to solve. Finn et al. (2017) instead propose a "greedy" algorithm, which comprises two loops. The inner loop samples tasks and updates the task-specific models by k steps using the tasks' training examples. The k-step updates write a differentiable computation graph. The outer loop updates the common initialization by backpropagating meta-gradients through the computation graph. This method is "greedy" in that the number of inner steps is often small (e.g., k = 1). The outer loop takes actions before the inner loop sufficiently explores its search space. This "greedy" algorithm is due to practical constraints that backpropagating meta-gradients through the inner loop incurs high-order derivatives, big memory footprints, and the risk of vanishing or exploding gradients. For the same reason, some related work also turns to greedy strategies, such as meta-attack (Du et al., 2019) and learning to reweigh examples (Ren et al., 2018b) . To this end, it is natural to pose at least two questions. Would a less greedy gradient-based metalearner (say, k > 10 inner-loop updates) achieve better performance? How to make it less greedy? To answer these questions, we provide some preliminary results by introducing a lookahead optimizer (Zhang et al., 2019) into the inner loop. It is intuitive to describe it as a teacher-student scheme. We use a student neural network to explore the search space for a given task adequately (by a large number k of updates), and a teacher network then takes a "leap" toward the regions visited by the student. As a result, the teacher network not only arrives at a high-performing model but also defines a very lightweight computation graph for the outer loop. In contrast to the traditionally "greedy" meta-learning framework used in MAML (Finn et al., 2017 ), meta-attack (Du et al., 2019) , learning to reweigh examples (Ren et al., 2018b) , etc., our approach has a "lazy" teacher. It sends a student to optimize for a task up to many steps and moves only once after that. 2019) proposed a less "greedy" MAML, with which this work shares a similar goal, but our approach improves the gradient-based meta-learning framework rather than a particular algorithm. Hence, we evaluate it on different methods and tasks, including MAML and Reptile (Nichol et al., 2018) for few-shot learning, a two-component weighting algorithm (Jamal et al., 2020) for long-tailed classification, and meta-attack (Du et al., 2019) . Extensive results provide an affirmative answer to the first question above: long-horizon exploration in the inner loop improves a meta-learner's performance. We expect our approach, along with the compelling experimental results, can facilitate future work to address the second question above.

2. "GREEDY" GRADIENT-BASED META-LEARNING

We first review gradient-based meta-learning from the perspective of "search space carving". Notations. Let P T denote a task distribution. For each task drawn from the distribution T ∼ P T , we have a training set D tr and a validation set D val , both in the form of {(x 1 , y 1 ), (x 2 , y 2 ), • • • } where x m and y m are respectively an input and a label. We learn a predictive model for the task by minimizing an empirical loss L T Dtr (φ) (e.g., cross-entropy) over the training set while using the validation set to choose hyper-parameters (e.g., early stopping), where φ collects all trainable parameters of the model. Similarly, we denote by L T D val (φ) the loss calculated over the validation set. Meta-learning as "space carving". Instead of focusing on an isolated task, meta-learning takes a global view and introduces a meta-model, parameterized by θ, that can improve the learning efficiency for all individual tasks drawn from the task distribution P T . The underlying idea is to derive a task-specific model φ from not only the training set D tr but also the meta-model θ, i.e., φ ∈ M(θ, D tr ). We refer to M(θ, D tr ) the "carved" search space for the task-specific model φ, where the "carving" function is realized as an attention module in (Vinyals et al., 2016a; Mishra et al., 2017) , as a conditional neural process in (Garnelo et al., 2018; Gordon et al., 2020) , as a gradient-based update rule in (Finn et al., 2017; Park & Oliva, 2019; Li et al., 2017; Nichol et al., 2018) , and as a regularized optimization problem in (Rajeswaran et al., 2019; Zhou et al., 2019) . An optimal meta-model θ * is supposed to yield the best task-specific models in expectation, θ * ← arg min θ E T ∼P T ,D val ∼T L T D val (φ * (θ)) subject to φ * (θ) ← arg min φ∈M(θ,Dtr) L T Dtr (φ). (1) One can estimate the optimal meta-model θ * from some tasks and then use it to "carve" the search space, M(θ * , D tr ), for novel tasks' models. Gradient-based meta-learning. One of the notable meta-learning methods is MAML (Finn et al., 2017) , which uses a gradient-based update rule to "carve" the search space for a task-specific model, 



MAML (θ, D tr ) := {φ 0 ← θ} ∪ {φ j | φ j ← φ j-1 -α∇ φ L T Dtr (φ j-1 ), j = 1, 2, • • • , k} (2)where the meta-model θ becomes an initialization to the task-specific model φ 0 , the other candidate models φ 1 , • • • , φ k are obtained by gradient descent, and α > 0 is a learning rate. Substituting it into equation (1), φ k ∈ M MAML (θ, D tr ) is naturally a solution to the lower-level optimization problem, and MAML solves the upper-level optimization problem by gradient descent, θ ← θ -βE T ∼P T ,D val ∼T ∇ θ L T D val (φ k (θ)),

