THEORETICAL CHARACTERIZATION OF THE GENERALIZATION PERFORMANCE OF OVERFITTED META-LEARNING

Abstract

Meta-learning has arisen as a successful method for improving training performance by training over many similar tasks, especially with deep neural networks (DNNs). However, the theoretical understanding of when and why overparameterized models such as DNNs can generalize well in meta-learning is still limited. As an initial step towards addressing this challenge, this paper studies the generalization performance of overfitted meta-learning under a linear regression model with Gaussian features. In contrast to a few recent studies along the same line, our framework allows the number of model parameters to be arbitrarily larger than the number of features in the ground truth signal, and hence naturally captures the overparameterized regime in practical deep meta-learning. We show that the overfitted min ℓ 2 -norm solution of model-agnostic meta-learning (MAML) can be beneficial, which is similar to the recent remarkable findings on "benign overfitting" and "double descent" phenomenon in the classical (single-task) linear regression. However, due to the uniqueness of meta-learning such as task-specific gradient descent inner training and the diversity/fluctuation of the ground-truth signals among training tasks, we find new and interesting properties that do not exist in single-task linear regression. We first provide a high-probability upper bound (under reasonable tightness) on the generalization error, where certain terms decrease when the number of features increases. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large. Under this circumstance, we show that the overfitted min ℓ 2 -norm solution can achieve an even lower generalization error than the underparameterized solution.

1. INTRODUCTION

Meta-learning is designed to learn a task by utilizing the training samples of many similar tasks, i.e., learning to learn (Thrun & Pratt, 1998) . With deep neural networks (DNNs), the success of meta-learning has been shown by many works using experiments, e.g., (Antoniou et al., 2018; Finn et al., 2017) . However, theoretical results on why DNNs have a good generalization performance in meta-learning are still limited. Although DNNs have so many parameters that can completely fit all training samples from all tasks, it is unclear why such an overfitted solution can still generalize well, which seems to defy the classical knowledge bias-variance-tradeoff (Bishop, 2006; Hastie et al., 2009; Stein, 1956; James & Stein, 1992; LeCun et al., 1991; Tikhonov, 1943) . The recent studies on the "benign overfitting" and "double-descent" phenomena in classical (singletask) linear regression have brought new insights on the generalization performance of overfitted solutions. Specifically, "benign overfitting" and "double descent" describe the phenomenon that the test error descends again in the overparameterized regime in linear regression setup (Belkin et al., 2018; 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Ju et al., 2020; Mei & Montanari, 2019) . Depending on different settings, the shape and the properties of the descent curve of the test error can differ dramatically. For example, Ju et al. (2020) showed that the min ℓ 1 -norm overfitted solution has a very different descent curve compared with the min ℓ 2 -norm overfitted solution. A more detailed review of this line of work can be found in Appendix A. Compared to the classical (single-task) linear regression, model-agnostic meta-learning (MAML) (Finn et al., 2017; Finn, 2018) , which is a popular algorithm for meta-learning, differs in many aspects. First, the training process of MAML involves task-specific gradient descent inner training and outer training for all tasks. Second, there are some new parameters to consider in meta-training, such as the number of tasks and the diversity/fluctuation of the ground truth of each training task. These distinct parts imply that we cannot directly apply the existing analysis of "benign overfitting" and "double-descent" on single-task linear regression in meta-learning. Thus, it is still unclear whether meta-learning also has a similar "double-descent" phenomenon, and if yes, how the shape of the descent curve of overfitted meta-learning is affected by system parameters. A few recent works have studied the generalization performance of meta-learning. In Bernacchia (2021), the expected value of the test error for the overfitted min ℓ 2 -norm solution of MAML is provided for the asymptotic regime where the number of features and the number of training samples go to infinity. In Chen et al. ( 2022), a high probability upper bound on the test error of a similar overfitted min ℓ 2 -norm solution is given in the non-asymptotic regime. However, the eigenvalues of weight matrices that appeared in the bound of Chen et al. ( 2022) are coupled with the system parameters such as the number of tasks, samples, and features, so that their bound is not fully expressed in terms of the scaling orders of those system parameters. This makes it hard to explicitly characterize the shape of the double-descent or analyze the tightness of the bound. In Huang et al. ( 2022), the authors focus on the generalization error during the SGD process when the training error is not zero, which is different from our focus on the overfitted solutions that make the training error equal to zero (i.e., the interpolators). (A more comprehensive introduction on related works can be found in Appendix A.) All of these works let the number of model features in meta-learning equal to the number of true features, which cannot be used to analyze the shape of the double-descent curve that requires the number of features used in the learning model to change freely without affecting the ground truth (just like the setup used in many works on single-task linear regression, e.g., Belkin et al. (2020) ; Ju et al. ( 2020)). To fill the gap, we study the generalization performance of overfitted meta-learning, especially in quantifying how the test error changes with the number of features. As the initial step towards the DNNs' setup, we consider the overfitted min ℓ 2 -norm solution of MAML using a linear model with Gaussian features. We first quantify the error caused by the one-step gradient adaption for the test task, with which we provide useful insights on 1) practically choosing the step size of the test task and quantifying the gap with the optimal (but not practical) choice, and 2) how overparameterization can affect the noise error and task diversity error in the test task. We then provide an explicit highprobability upper bound (under reasonable tightness) on the error caused by the meta-training for training tasks (we call this part "model error") in the non-asymptotic regime where all parameters are finite. With this upper bound and simulation results, we confirm the benign overfitting in metalearning by comparing the model error of the overfitted solution with the underfitted solution. We further characterize some interesting properties of the descent curve. For example, we show that the descent is easier to observe when the noise and task diversity are large, and sometimes has a descent floor. Compared with the classical (single-task) linear regression where the double-descent phenomenon critically depends on non-zero noise, we show that meta-learning can still have the double-descent phenomenon even under zero noise as long as the task diversity is non-zero.

2. SYSTEM MODEL

In this section, we introduce the system model of meta-learning along with the related symbols. For the ease of reading, we also summarize our notations in Table 2 in Appendix B.

2.1. DATA GENERATION MODEL

We adopt a meta-learning setup with linear tasks first studied in Bernacchia (2021) as well as a few recent follow-up works Chen et al. (2022); Huang et al. (2022) . However, in their formulation, the number of features and true model parameters are the same. Such a setup fails to capture the prominent effect of overparameterized models in practice where the number of model parameters

