MODEL AGNOSTIC META-LEARNING ON TREES

Abstract

In meta-learning, the knowledge learned from previous tasks is transferred to new ones, but this transfer only works if tasks are related, and sharing information between unrelated tasks might hurt performance. A fruitful approach is to share gradients across similar tasks during training, and recent work suggests that the gradients themselves can be used as a measure of task similarity. We study the case in which datasets associated to different tasks have a hierarchical, tree structure. While a few methods have been proposed for hierarchical meta-learning in the past, we propose the first algorithm that is model-agnostic, a simple extension of MAML. As in MAML, our algorithm adapts the model to each task with a few gradient steps, but the adaptation follows the tree structure: in each step, gradients are pooled across task clusters, and subsequent steps follow down the tree. We test the algorithm on linear and non-linear regression on synthetic data, and show that the algorithm significantly improves over MAML. Interestingly, the algorithm performs best when it does not know in advance the tree structure of the data.

1. INTRODUCTION

Deep learning models require a large amount of data in order to perform well when trained from scratch. When data is scarce for a given task, we can transfer the knowledge gained in a source task to quickly learn a target task, if the two tasks are related. The field of Multi-task learning studies how to learn multiple tasks simultaneously, with a single model, by taking advantage of task relationships (Ruder (2017) , Zhang & Yang (2018) ). However, in Multi-task learning models, a set of tasks is fixed in advance, and they do not generalize to new tasks. The field of of Meta-learning is inspired by the ability of humans to learn how to quickly learn new tasks, by using the knowledge of previously learned ones. Meta-learning has seen a widespread use in multiple domains, especially in recent years and after the advent of Deep Learning (Hospedales et al. (2020) ). However, there is still a lack of methods for sharing information across tasks in meta-learning models, and the goal of our work is to fill this gap. In particular, a successful model for meta-learning, MAML (Finn et al. (2017) ), does not diversify task relationships according to their similarity, and it is unclear how to modify it for that purpose. In this work, we contribute the following: • We propose a novel modification of MAML to account for a hierarchy of tasks. The algorithm uses the tree structure of data during adaptation, by pooling gradients across tasks at each adaptation step, and subsequent steps follow down the tree (see Figure 1a ). • We introduce new benchmarks for testing a hierarchy of tasks in meta-learning on a variety of synthetic non-linear (sinusoidal) and multidimensional linear regression tasks. • We compare our algorithm to MAML and a baseline model, where we train on all tasks but without any meta-learning algorithm applied. We show that the algorithm has a better performance with respect to both of these models in the sinusoidal regression task and the newly introduced synthetic task because it exploits the hierarchical structure of the data. 2020) defines parameter updates for a task by aggregating gradients from other tasks according to their similarity. However, in contrast with our algorithm, both of these models are not hierarchical, tasks are clustered on one level only and cannot be represented by a tree structure. As far as we know, ours is the first model-agnostic algorithm for meta-learning that can be applied to a tree structure of tasks.

3. THE META-LEARNING PROBLEM

We follow the notation of Hospedales et al. (2020) . We assume the existence of a distribution over tasks τ and, for each task, a distribution over data points D and a loss function L. The loss function of the meta-learning problem, L meta , is defined as an average across both distributions of tasks and data points: L meta (ω) = E τ E D|τ L τ (θ τ (ω); D) The goal of meta-learning is to minimize the loss function with respect to a vector of metaparameters ω. The vector of parameters θ is task-specific and depends on the meta-parameters ω. Different meta-learning algorithms correspond to a different choice of θ τ (ω). We describe below the choice of TreeMAML, the algorithm proposed in this study. During meta-training, the loss is evaluated on a sample of m tasks, and a sample of n v validation data points for each task L meta (ω) = 1 mn v m i=1 nv j=1 L τi (θ τi (ω); D ij ) For each task i, the parameters θ τi are learned by a set of n t training data points, distinct from the validation data. During meta-testing, a new (target) task is given and the parameters θ are learned by a set of n r target data points. In this work, we also use a batch of training data points to adapt θ at test time. No training data is used to compute the final performance of the model, which is computed on separate test data of the target task.

3.1. MAML

MAML aims at finding the optimal initial condition ω from which a good parameter set can be found, separately for each task, after K gradient steps (Finn et al. (2017) ). For task i, we define the single gradient step with learning rate α as U i (ω) = ω - α n t nt j=1 ∇L(ω; D ij )



However, Multi-task Learning is fundamentally different from Meta-learning as it does not consider the problem of generalizing to new tasks(Hospedales et al. (2020)). Recent work includesZamir et al. (2018), who studies a large number of computer vision tasks and quantifies the transfer between all pairs of tasks.Achille et al. (2019)  proposes a novel measure of task representation, by assigning an importance score to each model parameter in each task. The score is based on the gradients of each task's loss function with respect to each model parameter. This work suggests that gradients can be used as a measure of task similarity, and we use this insight in our proposed algorithm.In the context of Meta-learning, a few papers have been published on the problem of learning and using task relationships in the past months. The model ofYao et al. (2019)  applies hierarchical clustering to task representations learned by an autoencoder, and uses those clusters to adapt the parameters to each task. The model of Liu et al. (2019) maps the classes of each task into the edges of a graph, it meta-learns relationships between classes and how to allocate new classes by using a graph neural network with attention. However, these algorithms are not model-agnostic, they have a fixed backbone and loss function, and are thus difficult to apply to new problems. Instead, we design our algorithm as a simple generalization of Model-agnostic meta-learning (MAML, Finn et al. (2017)), and it can be applied to any loss function and backbone.A couple of studies looked into modifying MAML to account for task similarities. The work of Jerfel et al. (2019) finds a different initial condition for each cluster of tasks, and applies the algorithm to the problem of continual learning. The work ofKatoch et al. (

