OPTIMAL ALLOCATION OF DATA ACROSS TRAINING TASKS IN META-LEARNING

Abstract

Meta-learning models transfer the knowledge acquired from previous tasks to quickly learn new ones. They are tested on benchmarks with a fixed number of data-points for each training task, and this number is usually arbitrary, for example, 5 instances per class in few-shot classification. It is unknown how the performance of meta-learning is affected by the distribution of data across training tasks. Since labelling of data is expensive, finding the optimal allocation of labels across training tasks may reduce costs. Given a fixed budget b of labels to distribute across tasks, should we use a small number of highly labelled tasks, or many tasks with few labels each? In MAML applied to mixed linear regression, we prove that the optimal number of tasks follows the scaling law √ b. We develop an online algorithm for data allocation across tasks, and show that the same scaling law applies to nonlinear regression. We also show preliminary experiments on few-shot image classification. Our work provides a theoretical guide for allocating labels across tasks in meta-learning, which we believe will prove useful in a large number of applications.

1. INTRODUCTION

Deep learning (DL) models require a large amount of data in order to perform well, when trained from scratch, but labeling data is expensive and time consuming. An effective approach to avoid the costs of collecting and labeling large amount of data is transfer learning: train a model on one big dataset, or a few related datasets that are already available, and then fine-tune the model on the target dataset, which can be of much smaller size (Donahue et al. (2014) ). In this context, there has been a recent surge of interest in the field of meta-learning, which is inspired by the ability of humans to learn how to learn Hospedales et al. (2020) . A model is meta-trained on a large number of tasks, each characterized by a small dataset, and meta-tested on the target dataset. The number of data points per task is usually set to an arbitrary number in standard meta-learning benchmarks. For example, in few-shot image classification benchmarks, such as mini-ImageNet (Vinyals et al. ( 2017), Ravi & Larochelle (2017)) and CIFAR-FS (Bertinetto et al. (2019) ), this number is usually set to 1 or 5. So far, there has not been any reason to optimize this number, as in most circumstances the performance of a model will improve with the number of data points (see Nakkiran et al. (2019) for exceptions). However, if the total number of labels across training tasks is limited, is it better to have a large number of tasks with very small data in each, or a relatively smaller number of highly labelled tasks? Since data-labeling is costly, the answer to this question may inform the design of new meta-learning datasets and benchmarks. In this work, to our knowledge, we answer this question for the first time, for a specific meta-learning algorithm: MAML (Finn et al. (2017) ). We study the problem of optimizing the number of metatraining tasks, with a fixed budget b of total data-points to distribute across tasks. We study the application of MAML to three datasets: mixed linear regression, sinusoid regression, and CIFAR. In the case of mixed linear regression, we derive an approximation for the meta-test loss, and according to which the optimal number of tasks follows the scaling rule √ b. In order to optimize the number of tasks empirically, we design an algorithm for online allocation of data across training tasks, and we validate the algorithm by performing a grid search over a large set of possible allocations. In summary, our contributions are: • We introduce and formalize the problem of optimizing data allocation with a fixed budget b in meta-learning. • We prove that the optimal scaling of the number of tasks is √ b in mixed linear regression, and confirm this scaling empirically in nolinear regression. • We introduce an algorithm for online allocation of data across tasks, to find the optimal number of tasks during meta-training, and validate the algorithm by grid search. • We perform preliminary experiments on few-shot image classification.

2. RELATED WORK

A couple of recent papers investigated a problem similar to ours. In the context of meta-learning and mixed linear regression, Kong et al. ( 2020) asks whether many tasks with small data can compensate for a lack of tasks with big data. However, they do not address the problem of finding the optimal number of tasks within a fixed budget. The work of Shekhar et al. ( 2020) studies exactly the problem of allocating a fixed budget of data points, but to the problem of estimating a finite set of discrete distributions, therefore they do not study the meta-learning problem and their data has no labels. An alternative approach to avoid labelling a large amount of data is active learning, where a model learns with fewer labels by accurately selecting which data to learn from (Settles ( 2010)). In the context of meta-learning, the option of implementing active learning has been considered in a few recent studies (Bachman et al. (2017 ), Garcia & Bruna (2018 ), Kim et al. (2018 ), Finn et al. (2019 ), Requeima et al. (2020) ). However, they considered the active labeling of data within a given task, for the purpose of improving performance in that task only. Instead, we ask how data should be distributed across tasks. In the context of recommender systems and text classification, a few studies considered whether labeling a data point, within a given task, may increase performance not only in that task but also in all other tasks. This problem has been referred to as multi-task active learning (Reichart et al. 2016)). However, none of these studies consider the problem of meta-learning with a fixed budget. A few studies have looked into actively choosing the next task in a sequence of tasks (Ruvolo & Eaton (2013 ), Pentina et al. (2015) , Pentina & Lampert (2017 ), Sun et al. (2018) ), but they do not look at how to distribute data across tasks.

3. THE PROBLEM OF DATA ALLOCATION FOR META-LEARNING

In the cross-task setting, we are presented with a hierarchically structured dataset, with task parameters (τ (i) ) m i=1 sampled from T ∼ p(τ ) and data (x τ j ) nτ j=1 sampled from D τ := (D|T ) ∼ p(x|T = τ ). Our problem is minimizing the following loss function with respect to a parameter ω: L (ω) = E T E D τ L (ω; x τ ) The empirical risk minimization principle (see Vapnik (1998)) ensures that the optimum of the empirical risk converges to that of the true risk with an increase in samples from the joint distribution of (D, T ).

3.1. META-LEARNING ACROSS TASKS

In the meta-learning problem, we are given the opportunity to adjust the objective function to each task. This adjustment is given by the adaptation step of meta-learning (Hospedales et al. ( 2020)), which represents a transformation on the parameters ω, which is task-dependent and which we refer to as θ τ (ω). The loss function L meta is defined as an average across both distribution of tasks and data points. The goal of meta-learning is to minimize the loss function with respect to a vector of metaparameters ω L meta (ω) = E T E D τ L τ (θ τ (ω); x τ ) (2)



(2008), Zhang (2010), Saha et al. (2011), Harpale (2012), Fang et al. (2017)), or multi-domain active learning (Li et al. (2012), Zhang et al. (

