A LAZY APPROACH TO LONG-HORIZON GRADIENT-BASED META-LEARNING

Abstract

Gradient-based meta-learning relates task-specific models to a meta-model by gradients. By this design, an algorithm first optimizes the task-specific models by an inner loop and then backpropagates meta-gradients through the loop to update the meta-model. The number of inner-loop optimization steps has to be small (e.g., one step) to avoid high-order derivatives, big memory footprints, and the risk of vanishing or exploding meta-gradients. We propose an intuitive teacherstudent scheme to enable the gradient-based meta-learning algorithms to explore long horizons by the inner loop. The key idea is to employ a student network to adequately explore the search space of task-specific models (e.g., by more than ten steps), and a teacher then takes a "leap" toward the regions probed by the student. The teacher not only arrives at a high-quality model but also defines a lightweight computation graph for meta-gradients. Our approach is generic; it performs well when applied to four meta-learning algorithms over three tasks: few-shot learning, long-tailed classification, and meta-attack.

1. INTRODUCTION

Humans can quickly learn the skills needed for new tasks by drawing from a fund of prior knowledge and experience. To grant machine learners this level of intelligence, meta-learning studies how to leverage past learning experiences to more efficiently learn for a new task (Vilalta & Drissi, 2002) . A hallmark experiment design provides a meta-learner a variety of few-shot learning tasks (meta-training) and then desires it to solve previously unseen and yet related few-shot learning tasks (meta-test). This design enforces "learning to learn" because the few-shot training examples are insufficient for a learner to achieve high accuracy on any task in isolation. Recent meta-learning methods hinge on deep neural networks. Some work learns a recurrent neural network as an update rule to a model (Ravi & Larochelle, 2016; Andrychowicz et al., 2016) . Another line of methods transfers an attention scheme across tasks (Mishra et al., 2017; Vinyals et al., 2016a) . Gradient-based meta-learning gains momenta recently following the seminal work (Finn et al., 2017) . It is model-agnostic meta-learning (MAML), learning a global model initialization from which a meta-learner can quickly derive task-specific models by using a few training examples. In its core, MAML is a bilevel optimization problem (Colson et al., 2007) . The upper level searches for the best global initialization, and the lower level optimizes individual models, which all share the common initialization, for particular tasks sampled from a task distribution. This problem is hard to solve. Finn et al. ( 2017) instead propose a "greedy" algorithm, which comprises two loops. The inner loop samples tasks and updates the task-specific models by k steps using the tasks' training examples. The k-step updates write a differentiable computation graph. The outer loop updates the common initialization by backpropagating meta-gradients through the computation graph. This method is "greedy" in that the number of inner steps is often small (e.g., k = 1). The outer loop takes actions before the inner loop sufficiently explores its search space. This "greedy" algorithm is due to practical constraints that backpropagating meta-gradients through the inner loop incurs high-order derivatives, big memory footprints, and the risk of vanishing or exploding gradients. For the same reason, some related work also turns to greedy strategies, such as meta-attack (Du et al., 2019) and learning to reweigh examples (Ren et al., 2018b) .

