THEORETICAL BOUNDS ON ESTIMATION ERROR FOR META-LEARNING

Abstract

Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel information-theoretic lower-bounds on minimax rates of convergence for algorithms that are trained on data from multiple sources and tested on novel data. Our bounds depend intuitively on the information shared between sources of data, and characterize the difficulty of learning in this setting for arbitrary algorithms. We demonstrate these bounds on a hierarchical Bayesian model of meta-learning, computing both upper and lower bounds on parameter estimation via maximum-a-posteriori inference.

1. INTRODUCTION

Many practical machine learning applications deal with distributional shift from training to testing. One example is few-shot classification (Ravi & Larochelle, 2016; Vinyals et al., 2016) , where new classes need to be learned at test time based on only a few examples for each novel class. Recently, few-shot classification has seen increased success; however, theoretical properties of this problem remain poorly understood. In this paper we analyze the meta-learning setting, where the learner is given access to samples from a set of meta-training distributions, or tasks. At test-time, the learner is exposed to only a small number of samples from some novel task. The meta-learner aims to uncover a useful inductive bias from the original samples, which allows them to learn a new task more efficiently.foot_0 While some progress has been made towards understanding the generalization performance of specific meta-learning algorithms (Amit & Meir, 2017; Khodak et al., 2019; Bullins et al., 2019; Denevi et al., 2019; Cao et al., 2019) , little is known about the difficulty of the meta-learning problem in general. Existing work has studied generalization upper-bounds for novel data distributions (Ben-David et al., 2010; Amit & Meir, 2017 ), yet to our knowledge, the inherent difficulty of these tasks relative to the i.i.d case has not been characterized. In this work, we derive novel bounds for meta learners. We first present a general information theoretic lower bound, Theorem 1, that we use to derive bounds in particular settings. Using this result, we derive lower bounds in terms of the number of training tasks, data per training task, and data available in a novel target task. Additionally, we provide a specialized analysis for the case where the space of learning tasks is only partially observed, proving that infinite training tasks or data per training task are insufficient to achieve zero minimax risk (Corollary 2). We then derive upper and lower bounds for a particular meta-learning setting. In recent work, Grant et al. ( 2018) recast the popular meta-learning algorithm MAML (Finn et al., 2017) in terms of inference in a Bayesian hierarchical model. Following this, we provide a theoretical analysis of a hierarchical Bayesian model for meta-linear-regression. We compute sample complexity bounds for posterior inference under Empirical Bayes (Robbins, 1956) in this model and compare them to our predicted lower-bounds in the minimax framework. Furthermore, through asymptotic analysis of the error rate of the MAP estimator, we identify crucial features of the meta-learning environment which are necessary for novel task generalization. Published as a conference paper at ICLR 2021 Our primary contributions can be summarized as follows: • We introduce novel lower bounds on minimax risk of parameter estimation in meta-learning. • Through these bounds, we compare the relative utility of samples from meta-training tasks and the novel task and emphasize the importance of the relationship between the tasks. • We provide novel upper bounds on the error rate for estimation in a hierarchical meta-linearregression problem, which we verify through an empirical evaluation.

2. RELATED WORK

An early version of this work (Lucas et al., 2019) presented a restricted version of Theorem 1. The current version includes significantly more content, including more general lower bounds and corresponding upper bounds in a hierarchical Bayesian model of meta-learning (Section 5). Baxter (2000) introduced a formulation for inductive bias learning where the learner is embedded in an environment of multiple tasks. The learner must find a hypothesis space which enables good generalization on average tasks within the environment, using finite samples. In our setting, the learner is not explicitly tasked with finding a reduced hypothesis space but instead learns using a general two-stage approach, which matches the standard meta-learning paradigm (Vilalta & Drissi, 2002) . In the first stage an inductive bias is extracted from the data, and in the second stage the learner estimates using data from a novel task distribution. Further, we focus on bounding minimax risk of meta learners. Under minimax risk, an optimal learner achieves minimum error on the hardest learning problem in the environment. While average case risk of meta learners is more commonly studied, recent work has turned attention towards the minimax setting (Kpotufe & Martinet, 2018; Hanneke & Kpotufe, 2019; 2020; Mousavi Kalan et al., 2020; Mehta et al., 2012) . The worst-case error in meta-learning is particularly important in safety-critical systems, for example in medical diagnosis. Mousavi Kalan et al. (2020) study the minimax risk of transfer learning. In their setting, the learner is provided with a large amount of data from a single source task and is tasked with generalizing to a target task with a limited amount of data. They assume relatedness between tasks by imposing closeness in parameter-space (whereas in our setting, we assume closeness in distribution via KL divergence). They prove only lower bounds, but notably generalize beyond the linear setting towards single layer neural networks. There is a large volume of prior work studying upper-bounds on generalization error in multi-task environments (Ben-David & Borbely, 2008; Ben-David et al., 2010; Pentina & Lampert, 2014; Amit & Meir, 2017; Mehta et al., 2012) . While the approaches in these works vary, one common factor is the need to characterize task-relatedness. Broadly, these approaches either assume a shared distribution for sampling tasks (Baxter, 2000; Pentina & Lampert, 2014; Amit & Meir, 2017) , or a measure of distance between distributions (Ben-David & Borbely, 2008; Ben-David et al., 2010; Mohri & Medina, 2012) . Our lower-bounds utilize a weak form of task relatedness, assuming that the environment contains a finite set that is suitably separated in parameter space but close in KL divergence-this set of assumptions also arises often when computing i.i.d minimax lower bounds (Loh, 2017) . One practical approach to meta-learning is learning a linear mapping on top of a learned feature space. Prototypical Networks (Snell et al., 2017) effectively learn a discriminative embedding function and performs linear classification on top using the novel task data. Analyzing these approaches is challenging due to metric-learning inspired objectives (that require non-i.i.d sampling) and the simultaneous learning of feature mappings and top-level linear functions. Though some progress has been made (Jin et al., 2009; Saunshi et al., 2019; Wang et al., 2019; Du et al., 2020) . Maurer (2009), for example, explores linear models fitted over a shared linear feature map in a Hilbert space. Our results can be applied in these settings if a suitable packing of the representation space is defined. Other approaches to meta-learning aim to parameterize learning algorithms themselves. 



Note that this definition encompasses few-shot learning.



Traditionally, this has been achieved by hyper-parameter tuning(Rasmussen & Nickisch, 2010; MacKay et al.,  2019)  but recent fully parameterized optimizers also show promising performance in deep neural network optimization(Andrychowicz et al., 2016), few-shot learning (Ravi & Larochelle, 2016), unsupervised learning(Metz et al., 2019), and reinforcement learning(Duan et al., 2016). Yet another

