META-LEARNING WITH NEGATIVE LEARNING RATES

Abstract

Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or learning to learn a distribution of tasks, where learning is represented by an outer loop, and to learn by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best. In this work, we follow the notation of Hospedales et al. ( 2020) and we use MAML (Finn et al. (2017) ) as the meta-learning algorithm. We assume the existence of a distribution of tasks τ and, for each task, a loss function L τ and a distribution of data points D τ = {x τ , y τ } with input x τ and label y τ . We assume that the loss function is the same for all tasks, L τ = L, but each task is characterized by a different distribution of the data. The empirical meta-learning loss is evaluated on a sample of

1. INTRODUCTION

Deep Learning models represent the state-of-the-art in several machine learning benchmarks (Le-Cun et al. (2015) ), and their performance does not seem to stop improving when adding more data and computing resources (Rosenfeld et al. (2020) , Kaplan et al. (2020) ). However, they require a large amount of data and compute to start with, which are often not available to practitioners. The approach of fine-tuning has proved very effective to address this limitation: pre-train a model on a source task, for which a large dataset is available, and use this model as the starting point for a quick additional training (fine-tuning) on the small dataset of the target task (Pan & Yang (2010) , Donahue et al. (2014) , Yosinski et al. (2014) ). This approach is popular because pre-trained models are often made available by institutions that have the resources to train them. In some circumstances, multiple source tasks are available, all of which have scarce data, as opposed to a single source task with abundant data. This case is addressed by meta-learning, in which a model gains experience over multiple source tasks and uses it to improve its learning of future target tasks. The idea of meta-learning is inspired by the ability of humans to generalize across tasks, without having to train on any single task for long time. A meta-learning problem is solved by a bi-level optimization procedure: an outer loop optimizes meta-parameters across tasks, while an inner loop optimizes parameters within each task (Hospedales et al. (2020) ). The idea of meta-learning has gained some popularity, but a few recent papers argue that a simple alternative to meta-learning is just good enough, in which the inner loop is removed entirely (Chen et al. (2020a) , Tian et al. (2020) , Dhillon et al. (2020) , Chen et al. (2020b) , Raghu et al. (2020) ). Other studies find the opposite (Goldblum et al. (2020) , Collins et al. (2020) , Gao & Sener (2020) ). It is hard to resolve the debate because there is little theory available to explain these findings. In this work, using random matrix theory and exact solutions of linear models, we derive an algebraic expression of the average test loss of MAML, a simple and successful meta-learning algorithm (Finn et al. (2017) ), as a function of its hyperparameters. In particular, we study its performance as a function of the inner loop learning rate during meta-training. Setting this learning rate to zero is equivalent to removing the inner loop, as advocated by recent work (Chen et al. (2020a) , Tian et al. (2020) , Dhillon et al. (2020) , Chen et al. (2020b) , Raghu et al. (2020) ). Surprisingly, we find that the optimal learning rate is negative, thus performance can be increased by reducing the learning rate below zero. In particular, we find the following: • In the problem of mixed linear regression, we prove that the optimal learning rate is always negative in overparameterized models. The same result holds in underparameterized models provided that the optimal learning rate is small in absolute value. We validate the theory by running extensive experiments. • We extend these results to the case of nonlinear regression and wide neural networks, in which the output can be approximated by a linear function of the parameters (Jacot et al. (2018) , Lee et al. (2019) ). While in this case we cannot prove that the optimal learning rate is always negative, preliminary experiments suggest that the result holds in this case as well.

2. RELATED WORK

The field of meta-learning includes a broad range of problems and solutions, see Hospedales et al. (2020) for a recent review focusing on neural networks and deep learning. In this context, metalearning received increased attention in the past few years, several new benchmarks have been introduced, and a large number of algorithms and models have been proposed to solve them (Vinyals et al. (2017) , Bertinetto et al. (2019) , Triantafillou et al. (2020) ). Despite the surge in empirical work, theoretical work is still lagging behind. Similar to our work, a few other studies used random matrix theory and exact solutions to calculate the average test loss for the problem of linear regression (Advani & Saxe (2017), Hastie et al. (2019) , Nakkiran (2019) ). To our knowledge, our study is the first to apply this technique to the problem of meta-learning with multiple tasks. Our results reduce to those of linear regression in the case of one single task. Furthermore, we are among the first to apply the framework of Neural Tangent Kernel (Jacot et al. (2018) , Lee et al. (2019) ) to the problem of meta-learning (a few papers appeared after our submission: Yang & Hu (2020) , Wang et al. (2020a) , Zhou et al. (2021) ). Similar to us, a few theoretical studies looked at the problem of mixed linear regression in the context of meta-learning. In Denevi et al. (2018) , Bai et al. (2021) , a meta-parameter is used to bias the taskspecific parameters through a regularization term. Kong et al. (2020) looks at whether many tasks with small data can compensate for a lack of tasks with big data. Tripuraneni et al. (2020) , Du et al. (2020) study the sample complexity of representation learning. However, none of these studies look into the effect of learning rate on performance, which is our main focus. In this work, we focus on MAML, a simple and successful meta-learning algorithm (Finn et al. (2017)) . A few theoretical studies have investigated MAML, looking at: universality of the optimization algorithm (Finn & Levine (2018) ), bayesian inference interpretation (Grant et al. (2018) ), proof of convergence (Ji et al. (2020)) , difference between convex and non-convex losses (Saunshi et al. (2020) ), global optimality (Wang et al. (2020b) ), effect of the inner loop (Collins et al. (2020) , m tasks, and a sample of n v validation data points for each task: L meta (ω; D t , D v ) = 1 mn v m i=1 nv j=1 L θ(ω; D (i) t ); x v(i) j , y v(i) j (1) The training set D (i) t = x t(i) j , y t(i) j j=1:nt and validation set D (i) v = x v(i) j , y v(i) j j=1:nv are drawn independently from the same distribution in each task i. The function θ represents the adaptation of the meta-parameter ω, which is evaluated on the training set. Different meta-learning algorithms correspond to a different choice of θ, we describe below the choice of MAML (Eq.3), the subject of this study. During meta-training, the loss of Eq.1 is optimized with respect to the meta-parameter ω, usually by stochastic gradient descent, starting from an initial point ω 0 . The optimum is denoted as ω (D t , D v ). This optimization is referred to as the outer loop, while computation of θ is referred to as the inner loop of meta-learning. During meta-testing, a new (target) task is given and θ adapts on a set D r of n r target data points. The final performance of the model is computed on test data D s of the target task. Therefore, the test loss is equal to L test = L meta (ω (D t , D v ); D r , D s ) In MAML, the inner loop corresponds to a few steps of gradient descent, with a given learning rate α t . In this work we consider the simple case of a single gradient step: θ(ω; D (i) t ) = ω - α t n t nt j=1 ∂L ∂θ ω;x t(i) j ,y t(i) j (3) If the learning rate α t is zero, then parameters are not adapted during meta-training and θ(ω) = ω. In that case, a single set of parameters in learned across all data and there is no inner loop. However, it is important to note that a distinct learning rate α r is used during meta-testing. A setting similar to this has been advocated in a few recent studies (Chen et We show that, intuitively, the optimal learning rate at meta-testing (adaptation) time α r is always positive. Surprisingly, in the family of problems considered in this study, we find that the optimal learning rate during meta-training α t is instead negative. We note that the setting α t = 0 effectively does not use the n t training data points, therefore we could in principle add this data to the validation set, but we do not consider this option here since we are interested in a wide range of possible values of α t as opposed to the specific case α t = 0.

4. MIXED LINEAR REGRESSION

We study MAML applied to the problem of mixed linear regression. Note that the goal here is not to solve the problem of mixed linear regression, but to probe the performance of MAML as a function of its hyperparameters. In mixed linear regression, each task is characterized by a different linear function, and a model is evaluated by the mean squared error loss function. We assume a generative model in the form of y = x T w + z, where x is the input vector (of dimension p), y is the output (scalar), z is noise (scalar), and w is a vector of generating parameters (of dimension p), therefore p represents both the number of parameters and the input dimension. All distributions are assumed Gaussian: w ∼ N w 0 , ν 2 p I p x ∼ N (0, I p ) y|x, w ∼ N x T w, σ 2 where I p is the p×p identity matrix, σ is the label noise, w 0 is the task mean and ν represents the task variability. Different meta-training tasks i correspond to different draws of generating parameters w (i) , while the parameters for the meta-testing task are denoted by w . We denote by superscripts t, v, r, s the training, validation, target and test data, respectively. A graphical model of data generation is shown in Figure 1 . Using random matrix theory and exact solutions of linear models, we calculate the test loss as a function of the following hyperparameters: the number of training tasks m, number of data points per task for training (n t ), validation (n v ) and target (n r ), learning rate for training α t and for adaptation to target α r . Furthermore, we have the hyperparameters specific to the mixed linear regression problem: p, ν, σ, w 0 . Since we use exact solutions to the linear problem, our approach is equivalent to running the outer loop optimization until convergence (see section 7.1 in the Appendix for details). We derive results in two cases: overparameterized p > n v m and underparameterized p < n v m.

5.1. OVERPARAMETERIZED CASE

In the overparameterized case, the number of parameters p is larger than the total number of validation data across tasks n v m. In this case, since the data does not fully constrain the parameters, the optimal value of ω found during meta-training depends on the initial condition used for optimization, which we call ω 0 . Theorem 1. Consider the algorithm of section 3 (MAML one-step), and the data generating model of section 4 (mixed linear regression). Let p > n v m. Let p(ξ) and n t (ξ) be any function of order O(ξ) as ξ → ∞. Let |ω 0 -w 0 | be of order O(ξ -1/4 ). Then the test loss of Eq.2, averaged over the entire data distribution (see Eq.27 in the Appendix) is equal to L test = σ 2 2 1 + α 2 r p n r + + h r   ν 2 2 1 + n v m p + 1 2 1 - n v m p |ω 0 -w 0 | 2 + σ 2 n v m 2p 1 + α 2 t p nt h t   + O ξ -3/2 (5) where we define the following expressions h t = (1 -α t ) 2 + α 2 t p + 1 n t (6) h r = (1 -α r ) 2 + α 2 r p + 1 n r Proof. The proof of this Theorem can be found in the Appendix, sections 7.3, 7.3.1. The loss always increases with the output noise σ and task variability ν. Overfitting is expressed in Eq.5 by the term |ω 0 -w 0 |, the distance between the initial condition for the optimization of ω 0 and the ground truth mean of the generating model w 0 . Adding more validation data n v and tasks m may increase or decrease the loss depending on the size of this term relative to the noise (Nakkiran (2019) ), as it does reducing the number of parameters p. However, the loss always decreases with the number of data points for the target task n r , as that data only affects the adaptation step. Our main focus is studying how the loss is affected by the learning rates, during training α t and adaptation α r . The loss is a quadratic and convex function of α r , therefore it has a unique minimum. While it is possible to compute the optimal value of α r from Eq.5, here we just note that the loss is a sum of two quadratic functions, one has a minimum at α r = 0 and another has a minimum at α r = 1/ (1 + (p + 1)/n r ), therefore the optimal learning rate is in between the two values and is always positive. This is intuitive, since a positive learning rate for adaptation implies that the parameters get closer to the optimum for the target task. An example of the loss as a function of the adaptation learning rate α r is shown in Figure 2a , where we also show the results of experiments in which we run MAML empirically. The good agreement between theory and experiment suggest that Eq.5 is accurate. However, the training learning rate α t shows the opposite: by taking the derivative of Eq.5 with respect to α t , it is possible to show that it has a unique absolute minimum for a negative value of α t . This can be proved by noting that this function has the same finite value for large positive or negative α t , its derivative is always positive at α t = 0, and it has one minimum (-) and one maximum (+) at values α ± t = - n t + 1 2p ± n t + 1 2p 2 + n t p (8) Note that the argmax α + t is always positive, while the argmin α - t is always negative. This result is counter-intuitive, since a negative learning rate pushes parameters towards higher values of the loss. However, learning of the meta-parameter ω is performed by the outer loop (minimize Eq.1), for which there is no learning rate since we are using the exact solution to the linear problem and thus we are effectively training to convergence. Therefore, it remains unclear whether the inner loop (Eq.3) should push parameters towards higher or lower values of the loss. An example of the loss as a function of the training learning rate α r is shown in Figure 2b , where we also show the results of experiments in which we run MAML empirically. Here the theory slightly underestimate the experimental loss, but the overall shapes of the curves are in good agreement, suggesting that Eq.5 is accurate. Additional experiments are shown in the Appendix, Figure 6 .  n t = 30, n v = 2, n r = 20, m = 3, p = 60, σ = 1., ν = 0.5, ω 0 = 0, w 0 = 0. In panel a) we set α t = 0.2, in panel b) we set α r = 0.2. In the experiments, each run is evaluated on 100 test tasks of 50 data points each, and each point is an average over 100 runs (a) or 1000 runs (b).

5.2. UNDERPARAMETERIZED CASE

In the underparameterized case, the number of parameters p is smaller than the total number of validation data across tasks n v m. In this case, since the data fully constrains the parameters, the optimal value of ω found during meta-training is unique. We prove the following result. Theorem 2. Consider the algorithm of section 3 (MAML one-step), and the data generating model of section 4 (mixed linear regression). Let p < n v m. Let n v (ξ) and n t (ξ) be any function of order O(ξ). For ξ, m → ∞, the test loss of Eq.2, averaged over the entire data distribution (see Eq.27 in the Appendix) is equal to L test = σ 2 2 1 + α 2 r p n r + h r ν 2 2 + + h r 2h t 2 p n v m σ 2 h t + α 2 t n t [(n v + 1) g 1 + pg 2 ] + ν 2 p [(n v + 1) g 3 + pg 4 ] + O (mξ) -3/2 (9) where h r , h t are defined as in previous section, Eqs.6, 7, and g i are order O(1) polynomials in α t , see Eqs.98-101 in the Appendix. Proof. The proof of this Theorem can be found in the Appendix, sections 7.3, 7.3.2. Again, the loss always increases with the output noise σ and task variability ν. Furthermore, in this case the loss always decreases with the number of data points n v , n r , and tasks m. Note that, for a very large number of tasks m, the loss does not depend on meta-training hyperparameters α t , n v , n t . When the number of tasks is infinite, it doesn't matter whether we run the inner loop, and how much data we have for each task. As in the overparameterized case, the loss is a quadratic and convex function of the adaptation learning rate α r , and there is a unique minimum. While the value of the argmin is different, in this case as well the loss is a sum of two quadratic functions, one with minimum at α r = 0 and another with a minimum at α r = 1/ (1 + (p + 1)/n r ), therefore the optimal learning rate is again in between the same two values and is always positive. Similar comments applies to this case: a positive learning rate for adaptation implies that the parameters get closer to the optimum for the target task. An example of the loss as a function of the adaptation learning rate α r is shown in Figure 3a , where we also show the results of experiments in which we run MAML empirically. The good agreement between theory and experiment suggest that Eq.9 is accurate. As a function of the training learning rate α t , the loss Eq.9 is the ratio of two fourth order polynomials, therefore it is not straightforward to determine its behaviour. However, it is possible to show that the following holds ∂L test ∂α t αt=0 = σ 2 p n v m ≥ 0 (10) suggesting that performance is always better for negative values of α t around zero. Even if counterintuitive, this finding aligns with that of previous section, and similar comments apply. An example of the loss as a function of the training learning rate α r is shown in Figure 3b , where we also show the results of experiments in which we run MAML empirically. A good agreement is observed between theory and experiment, again suggesting that Eq.9 is accurate. Additional experiments are shown in the Appendix, Figure 6 .

5.3. NON-GAUSSIAN THEORY IN OVERPARAMETERIZED MODELS

In previous sections we studied the performance of MAML applied to the problem of mixed linear regression. It remains unclear whether the results in the linear case are relevant for the more interesting case of nonlinear problems. Inspired by recent theoretical work, we consider the case of nonlinear regression with squared loss L (ω) = E x E y|x 1 2 [y -f (x, ω)] 2 (11) where y is a target output and f (x, ω) the output of a neural network with input x and parameters ω. The introduction of the Neural Tangent Kernel showed that, in the limit of infinitely wide neural networks, the output is a linear function of its parameters during the entire course of training (Jacot et al. (2018) , Lee et al. (2019) ). This is expressed by a first order Taylor expansion 2020)), and therefore the output can be linearized around ω 0 . Intuitively, in a model that is heavily overparameterized, the data does not constrain the parameters, and a parameter that minimizes the loss in Eq.11 can be found in the vicinity of any initial condition ω 0 . Note that, while the output of the neural network is linear in the parameters, it remains a nonlinear function of its input, through the vector of nonlinear functions k in Eq.13. f (x, ω) f (x, ω 0 ) + k (x, ω 0 ) T (ω -ω 0 ) (12) k (x, ω 0 ) = ∇ ω f (x, ω)| x,ω0 By substituting Eq.12 into Eq.11, the nonlinear regression becomes effectively linear, in the sense that the loss is a quadratic function of the parameters ω, and all nonlinearities are contained in the functions k in Eq.13, that are fixed by the initial condition ω 0 . This suggests that we can carry over the theory developed in the previous section to this problem. However, in this case the input to the linear regression problem is effectively k (x), and some of the assumptions made in the previous section are not acceptable. In particular, even if we assume that x is Gaussian, k (x) is a nonlinear function of x and cannot be assumed Gaussian. We prove the following result, where we generalize the result of section 5.1 to non-Gaussian inputs and weights. Theorem 3. Consider the algorithm of section 3 (MAML one-step), with ω 0 = 0, and the data generating model of section 4, where the input x and the weights w are not necessarily Gaussian, and have zero mean and covariances, respectively, Σ = Exx T and Σ w = Eww T . Let F be the matrix of fourth order moments F = E x T Σx xx T . Let p > n v m. Let p(ξ) and n t (ξ) be any function of order O(ξ) as ξ → ∞. Let Tr Σ 2 w be of order O ξ -1 , and let the variances of matrix products of the rescaled inputs x/ √ p, up to sixth order, be of order O ξ -1 (see Eqs.134-136 in the Appendix). Then the test loss of Eq.2, averaged over the entire data distribution (see Eq.27 in the Appendix) is equal to L test = 1 2 Tr (Σ w H r ) + σ 2 2 1 + α 2 r n r Tr Σ 2 + + 1 2 n v m Tr (H r H t ) Tr (Σ w H t ) + σ 2 1 + α 2 t nt Tr Σ 2 Tr (H t ) 2 + O ξ -3/2 (14) where we define the following matrices H t = Σ (I -α t Σ) 2 + α 2 t n t F -Σ 3 (15) H r = Σ (I -α r Σ) 2 + α 2 r n r F -Σ 3 (16) Proof. The proof of this Theorem can be found in Appendix, section 7.4. Note that this result reduces to Eqs.5, 6, 7 when Σ = I, Σ w = Iν 2 /p, F = I(p + 2), ω 0 = 0, w = 0. This expression for the loss is more difficult to analyze than those given in the previous sections, because it involves traces of nonlinear functions of matrices, all elements of which are free hyperparameters. Nevertheless, it is possible to show that, as a function of the adaptation learning rate α r , the loss in Eq.14 is still a quadratic function. As a function of the adaptation learning rate α r , the loss in Eq.14 is the ratio of two fourth order polynomials, but it is difficult to draw any conclusions since their coefficients do not appear to have simple relationships. Even if the influence of the hyperparameters is not easy to predict, the expression in Eq.14 can still be used to quickly probe the behavior of the loss empirically, by using example values for the Σ, Σ w , F , since computing the expression is very fast. Here we choose values of Σ, Σ w by a single random draw from a Wishart distribution Σ ∼ W (I, p) Σ w ∼ ν 2 p W (I, p) Note that the number of degrees of freedom of the distribution is equal to the size of the matrices, p, therefore this covariances display significant correlations. Furthermore, we choose F = 2Σ 3 + ΣTr Σ 2 , which is the value taken when x follows a Gaussian distribution. Therefore, we effectively test the loss in Eq.14 for a Gaussian distribution, as in previous section, but we stress that the expression is valid for any distribution of x within the assumptions of Theorem 3. We also run experiments of MAML, applied again to mixed linear regression, but now using the covariance matrices drawn in Eq.17. Figure 4 shows the loss in Eq.14 as a function of the learning rates, during adaptation (panel a) and training (panel b). Qualitatively, we observe a similar behaviour as in section 5.1: the adaptation learning rate has a unique minimum for a positive value of α r , while the training learning rate shows better performance for negative values of α t . Again, there is a good agreement between theory and experiment, suggesting that Eq.14 is a good approximation. 

5.4. NONLINEAR REGRESSION

To investigate whether negative learning rates improve performance on non-linear regression in practice, we studied the simple case of MAML with a neural network applied to a quadratic function. Specifically, the target output is generated according to y = (w T x + b) 2 + z, where b is a bias term. The data x, z and generating parameters w are sampled as described in section 4 (in addition, the bias b was drawn from a Gaussian distribution of zero mean and unit variance.). We use a 2-layer feed-forward neural network with ReLU activation functions. Weights are initialized following a Gaussian distribution of zero mean and variance equal to the inverse number of inputs. We report results with a network width of 400 in both layers; results were similar with larger network widths. We use the square loss function and we train the neural network in the outer loop with stochastic gradient descent with a learning rate of 0.001 for 5000 epochs (until convergence). We used most parameters identical to section 5.1: n t = 30; n v = 2; n r = 20; m = 3; p = 60; σ = 1, ν = 0.5, w 0 = 0. The learning rate for adaptation was set to α r = 0.01. Note that in section 5.1 the model was initialized at the ground truth of the generative model (ω 0 = w 0 ), while here the neural network parameters are initialized at random. Figure 5 shows the test loss as a function of the learning rate α t . The best performance is obtained for a negative learning rate of α t = -0.0075.

6. DISCUSSION

Figure 5 : Average test loss of MAML as a function of the learning rate, on nonlinear (quadratic) regression using a 2-layer feedforward neural network. Optimal learning rate is negative, consistent with results on the linear case. Each run is evaluated on 1000 test tasks, and each point is an average over 10 runs. Error bars show standard errors. Note the qualitative similarity with Figures 2b and 4b . We calculated algebraic expressions for the average test loss of MAML applied to a simple family of linear models, as a function of the hyperparameters. Surprisingly, we showed that the optimal value of the learning rate of the inner loop during training is negative. This finding seems to carry over to more interesting nonlinear models in the overparameterized case. However, additional work is necessary to establish the conditions under which the optimal learning rate may be positive, for example by probing more extensively Eq.14. A negative optimal learning rate is surprising and counter-intuitive, since negative learning rates push parameters towards higher values of the loss. However, the meta-training loss is minimized by the outer loop, therefore it is not immediately obvious whether the learning rate of the inner loop should be positive, and we show that in some circumstances it should not. However, perhaps obviously, we also show that the learning rate during adaptation at test time should always be positive, otherwise the target task cannot be learned. In this work, we considered the case of nonlinear models in the overparameterized case. However, typical applications of MAML (and meta-learning in general) implement relatively small models due to the heavy computational load of running bi-level optimization, including both outer and inner loop. Our theory applies to regression problems, and assumes a limited number of tasks where data is independently drawn in each task, while some applications use a large number of tasks with correlated draws (for example, images may be shared across tasks in few-shot image classification, see Bertinetto et al. ( 2019)). Our theory is valid at the exact optimum of the outer loop, which is equivalent to training the outer loop to convergence, therefore overfitting may occur in the outer loop of our model. Another limitation of our theory is represented by the assumptions on the input and task covariance, which have no correlations in Theorems 1, 2, and are subject to some technical assumptions in Theorem 3. To the best of our knowledge, nobody has considered training meta-learning models with negative learning rates in the inner loop. Given that some studies advocate removing the inner loop altogether, which is similar to setting the learning rate to zero, it would be interesting to try a negative one. On the other hand, it is possible that a negative learning rate does not work in classification problems, in nonlinear models, or using input or tasks with a complex structure, settings that are outside the theory presented in this work. . . , 0.1) (note that overfitting occurs since ω 0 = w 0 ). In the experiments, each run is evaluated on 100 test tasks of 50 data points each, and each point is an average over 100 runs.

7.1. DEFINITION OF THE LOSS FUNCTION

We consider the problem of mixed linear regression y = Xw + z with squared loss, where X is a n × p matrix of input data, each row is one of n data vectors of dimension p, z is a n × 1 noise vector, w is a p × 1 vector of generating parameters and y is a n × 1 output vector. Data is collected for m tasks, each with a different value of the parameters w and a different realization of the input X and noise z. We denote by w (i) the parameters for task i, for i = 1, . . . , m. For a given task i, we denote by X t(i) , X v(i) the input data for, respectively, the training and validation sets, by z t(i) , z v(i) the corresponding noise vectors and by y t(i) , y v(i) the output vectors. We denote by n t , n v the data sample size for training and validations sets, respectively. For a given task i, the training output is equal to y t(i) = X t(i) w (i) + z t(i) (18) Similarly, the validation output is equal to y v(i) = X v(i) w (i) + z v(i) . ( ) We consider MAML as a model for meta-learning (Finn et al 2017) . The meta-training loss is equal to L meta = 1 2n v m m i=1 y v(i) -X v(i) θ (i) (ω) 2 (20) where vertical brackets denote euclidean norm, and the estimated parameters θ (i) (ω) are equal to the one-step gradient update on the single-task training loss L (i) = |y t(i) -X t(i) θ (i) | 2 /2n t , with initial condition given by the meta-parameter ω. The single gradient update is equal to θ (i) (ω) = I p - α t n t X t(i) T X t(i) ω + α t n t X t(i) T y t(i) (21) where I p is the p × p identity matrix and α t is the learning rate. We seek to minimize the metatraining loss with respect to the meta-parameter ω, namely ω = arg min ω L meta (22) We evaluate the solution ω by calculating the meta-test loss L test = 1 2n s |y s -X s θ | 2 (23) Note that the test loss is calculated over test data X s , z s , and test parameters w , namely y s = X s w + z s (24) Furthermore, the estimated parameters θ are calculated on a separate set of target data X r , z r , namely θ = I p - α r n r X r T X r ω + α r n r X r T y r (25) y r = X r w + z r Note that the learning rate and sample size can be different at testing, denoted by α r , n r , n s . We are interested in calculating the average test loss, that is the test loss of Eq.23 averaged over the entire data distribution, equal to L test = E w E z t E X t E z v E X v E w E z s E X s E z r E X r 1 2n s |y s -X s θ | 2 (27)

7.2. DEFINITION OF PROBABILITY DISTRIBUTIONS

We assume that all random variables are Gaussian. In particular, we assume that the rows of the matrix X are independent, and each row, denoted by x, is distributed according to a multivariate Gaussian with zero mean and unit covariance x ∼ N (0, I p ) ( ) where I p is the p × p identity matrix. Similarly, the noise is distributed following a multivariate Gaussian with zero mean and variance equal to σ 2 , namely z ∼ N 0, σ 2 I n (29) Finally, the generating parameters are also distributed according to a multivariate Gaussian of variance ν 2 /p, namely w ∼ N w 0 , ν 2 p I p (30) The generating parameter w is drawn once and kept fixed within a task, and drawn independently for different tasks. The values of x and z are drawn independently in all tasks and datasets (training, validation, target, test) . In order to perform the calculations in the next section, we need the following results. Lemma 1. Let X be a Gaussian n × p random matrix with independent rows, and each row has covariance equal to I p , the p × p identity matrix. Then: E X T X = nI p (31) E X T X 2 = n (n + p + 1) I p = n 2 µ 2 I p (32) E X T X 3 = n n 2 + p 2 + 3np + 3n + 3p + 4 I p = n 3 µ 3 I p (33) E X T X 4 = n n 3 + p 3 + 6n 2 p + 6np 2 + ( ) +6n 2 + 6p 2 + 17np + 21n + 21p + 20 I p = n 4 µ 4 I p (35) E X T X Tr X T X = n 2 p + 2n I p = pn 2 µ 1,1 I p (36) E X T X 2 Tr X T X = n n 2 p + np 2 + np + 4n + 4p + 4 I p = pn 3 µ 2,1 I p (37) E X T XTr X T X 2 = n n 2 p + np 2 + np + 4n + 4p + 4 I p = pn 3 µ 1,2 I p (38) E X T X 2 Tr X T X 2 = n n 3 p + np 3 + 2n 2 p 2 + 2n 2 p + 2np 2 + ( ) +8n 2 + 8p 2 + 21np + 20n + 20p + 20 I p = pn 4 µ 2,2 I p ( ) where the last equality in each of these expressions defines the variables µ. Furthermore, for any n × n symmetric matrix C and any p × p symmetric matrix D, independent of X: E X T CX = Tr (C) I p (41) E X T XDX T X = n (n + 1) D + nTr (D) I p (42) Proof. The Lemma follows by direct computations of the above expectations, using Isserlis' theorem. Particularly, for higher order exponents, combinatorics plays a crucial role in counting products of different Gaussian variables in an effective way. Lemma 2. Let X v(i) , X t(i) be Gaussian random matrices, of size respectively n v × p and n t × p, with independent rows, and each row has covariance equal to I p , the p × p identity matrix. Let p(ξ) and n t (ξ) be any function of order O(ξ) as ξ → ∞. Then: X v(i) X v(i) T = p I nv + O ξ 1/2 (43) X v(i) X t(i) T X t(i) X v(i) T = pn t I nv + O ξ 3/2 (44) X v(i) X t(i) T X t(i) X t(i) T X t(i) X v(i) T = pn t (n t + p + 1)I nv + O ξ 5/2 (45) Note that the order O (ξ) applies to all elements of the matrix in each expression. For i = j X v(i) X v(j) T = O ξ 1/2 (46) X v(i) X t(i) T X t(i) X v(j) T = O ξ 3/2 (47) X v(i) X t(i) T X t(i) X t(j) T X t(j) X v(j) T = O ξ 5/2 (48) Furthermore, for any positive real number δ and for any p × p symmetric matrix D independent of X, where Tr(D) and Tr(D 2 ) are both of order O(ξ δ ) X v(i) DX v(i) T = Tr (D) I nv + O ξ δ/2 (49) X v(i) X t(i) T X t(i) DX v(i) T = Tr (D) n t I nv + O ξ 1+δ/2 (50) X v(i) X t(i) T X t(i) DX t(i) T X t(i) X v(i) T = Tr (D) n t (n t + p + 1)I nv + O ξ 2+δ/2 (51) X v(i) DX v(j) T = O ξ δ/2 (52) X v(i) X t(i) T X t(i) DX v(j) T = O ξ 1+δ/2 (53) X v(i) X t(i) T X t(i) DX t(j) T X t(j) X v(j) T = O ξ 2+δ/2 Proof. The Lemma follows by direct computations of the expectations and variances of each term. Lemma 3. Let X v , X t be Gaussian random matrices, of size respectively n v × p and n t × p, with independent rows, and each row has covariance equal to I p , the p × p identity matrix. Let n v (ξ) and n t (ξ) be any function of order O(ξ) for ξ → ∞. Then: X v T X v = n v I p + O ξ 1/2 (55) X t T X t X v T X v = n t n v I p + O ξ 3/2 (56) X t T X t X v T X v X t T X t = n v n t (n t + p + 1)I p + O ξ 5/2 Note that the order O (ξ) applies to all elements of the matrix in each expression. Proof. The Lemma follows by direct computations of the expectations and variances of each term.

7.3. PROOF OF THEOREMS 1 AND 2

We calculate the average test loss as a function of the hyperparameters n t , n v , n r , p, m, α t , α r , σ, ν, w 0 . Using the expression in Eq.24 for the test output, we rewrite the test loss in Eq.27 as L test = E 1 2n s |X s (w -θ ) + z s | 2 We start by averaging this expression with respect to X s , z s , noting that θ does not depend on test data. We further average with respect to w , but note that θ depends on test parameters, so we average only terms that do not depend on θ . Using Eq.31, the result is L test = σ 2 2 + ν 2 2 + |w 0 | 2 2 + E |θ | 2 2 -(w 0 + δw ) T θ where we define δw = ww 0 . The second term in the expectation is linear in θ and can be averaged over X r , z r , using Eq.25 and noting that ω does not depend on target data. The result is E X r E z r θ = (1 -α r )ω + α r (w 0 + δw ) Using Eq.60 we average over w the second term in the expectation of Eq.59 and find L test = σ 2 2 + 1 2 -α r ν 2 + |w 0 | 2 -(1 -α r ) w T 0 E ω + E |θ | 2 2 We average the last term of this expression over z r , w , using Eq.25 and noting that ω does not depend on target data and test parameters. The result is E w E z r |θ | 2 = |ω | 2 + α 2 r n 2 r (ω -w 0 ) T X r T X r 2 (ω -w 0 ) - (62) - 2α r n r X r T X r ω T (ω -w 0 ) + α 2 r σ 2 n 2 r Tr X r X r T + α 2 r ν 2 n 2 r p Tr X r X r T 2 We now average over X r , again noting that ω does not depend on target data. Using Eqs.31, 32, we find E X r E w E z r |θ | 2 = |ω | 2 + α 2 r 1 + p + 1 n r ν 2 + |ω -w 0 | 2 -2α r ω T (ω -w 0 ) + α 2 r σ 2 p n r We can now rewrite the average test loss 61 as L test = σ 2 2 1 + α 2 r p n r + 1 2 (1 -α r ) 2 + α 2 r p + 1 n r ν 2 + E |ω -w 0 | 2 In order to average the last term, we need an expression for ω . We note that the loss in Eq.20 is quadratic in ω, therefore the solution of Eq.22 can be found using standard linear algebra. In particular, the loss in Eq.20 can be rewritten as L meta = 1 2n v m |γ -Bω| 2 ( ) where γ is a vector of shape n v m × 1, and B is a matrix of shape n v m × p. The vector γ is a stack of m vectors γ =      X v(1) I p -αt nt X t(1) T X t(1) w (1) -αt nt X v(1) X t(1) T z t(1) + z v(1) . . . X v(m) I p -αt nt X t(m) T X t(m) w (m) -αt nt X v(m) X t(m) T z t(m) + z v(m)      Similarly, the matrix B is a stack of m matrices B =      X v(1) I p -αt nt X t(1) T X t(1) . . . X v(m) I p -αt nt X t(m) T X t(m)      We denote by I p the p × p identity matrix. The expression for ω that minimizes Eq.66 depends on whether the problem is overparameterized (p > n v m) or underparameterized (p < n v m), therefore we distinguish these two cases in the following sections.

7.3.1. OVERPARAMETERIZED CASE (THEOREM 1)

In the overparameterized case (p > n v m), under the assumption that the inverse of BB T exists, the value of ω that minimizes Eq.66 is equal to ω = B T BB T -1 γ + I p -B T BB T -1 B ω 0 (69) The vector ω 0 is interpreted as the initial condition of the parameter optimization of the outer loop, when optimized by gradient descent. Note that the matrix B does not depend on w, z t , z v , and Ew Ez t Ez v γ = Bw 0 . We denote by δγ the deviation from the average, and we have ω -w 0 = B T BB T -1 δγ + I p -B T BB T -1 B (ω 0 -w 0 ) We square this expression and average over w, z t , z v . We use the cyclic property of the trace and the fact that B T BB T -1 B is a projection. The result is |ω -w 0 | The matrix Γ is defined as Γ = E w E z t E z v δγ δγ T =    Γ (1) 0 0 0 . . . 0 0 0 Γ (m)    Where matrix blocks are given by the following expression Γ (i) = ν 2 p X v(i) I p - α t n t X t(i) T X t(i) 2 X v(i) T + σ 2 I nv + α 2 t n 2 t X v(i) X t(i) T X t(i) X v(i) T It is convenient to rewrite the scalar product of Eq.71 in terms of the trace of outer products |ω -w 0 | 2 = Tr BB T -1 Γ -B (ω 0 -w 0 ) (ω 0 -w 0 ) T B T + |ω 0 -w 0 | 2 In order to calculate E |ω -w 0 | 2 in Eq.65 we need to average this expression over training and validation data. These averages are hard to compute since they involve nonlinear functions of the data. However, we can approximate these terms by assuming that p and n t are large, both of order O(ξ), where ξ is a large number. Furthermore, we assume that |ω 0 -w 0 | is of order O(ξ -1/4 ). Using Lemma 2, together with the expressions of B (Eq.68) and Γ (Eqs.72,73), we can prove that 1 p BB T = (1 -α t ) 2 + α 2 t p + 1 n t I nvm + O ξ -1/2 (75) Γ = ν 2 (1 -α t ) 2 + α 2 t p + 1 n t + σ 2 1 + α 2 t p n t I nvm + O ξ -1/2 (76) B (ω 0 -w 0 ) (ω 0 -w 0 ) T B T = |ω 0 -w 0 | 2 (1 -α t ) 2 + α 2 t p + 1 n t I nvm + O ξ -1/2 Using Eq.75 and Taylor expansion, the inverse BB T -1 is equal to BB T -1 = 1 p (1 -α t ) 2 + α 2 t p + 1 n t -1 I nvm + O ξ -3/2 , Substituting the three expressions above in Eq.74, and ignoring terms of lower order, we find E |ω -w 0 | 2 = 1 - n v m p |ω 0 -w 0 | 2 + n v m p   ν 2 + σ 2 1 + α 2 t p nt (1 -α t ) 2 + α 2 t p+1 nt   + O ξ -3/2 (79) Substituting this expression into in Eq.65, we find the value of average test loss L test = σ 2 2 1 + α 2 r p n r + (80) +h r   ν 2 2 1 + n v m p + 1 2 1 - n v m p |ω 0 -w 0 | 2 + σ 2 n v m 2p 1 + α 2 t p nt h t   + O ξ -3/2 (81) where we define the following expressions h t = (1 -α t ) 2 + α 2 t p + 1 n t and h r = (1 -α r ) 2 + α 2 r p + 1 n r (82) 7.3.2 UNDERPARAMETERIZED CASE (THEOREM 2) In the underparameterized case (p < n v m), under the assumption that the inverse of B T B exists, the value of ω that minimizes Eq.66 is equal to ω = B T B -1 B T γ Note that the matrix B does not depend on w, z t , z v , and Ew Ez t Ez v γ = Bw 0 . We denote by δγ the deviation from the average, and we have |ω -w 0 | 2 = Tr B T B -1 B T δγ δγ T B B T B -1 We need to average this expression in order to calculate E |ω -w 0 | 2 in Eq.65. We start by averaging δγ δγ T over w, z t , z v , since B does not depend on those variables. Note that w, z t , z v are independent on each other and across tasks. As in previous section, we denote by Γ the result of this operation, given by Eq.s72, 73. Finally, we need to average over the training and validation data E |ω -w 0 | 2 = E X t E X v Tr B T B -1 B T ΓB B T B -1 It is hard to average this expression because it includes nonlinear functions of the data. However, we can approximate these terms by assuming that either m or ξ (or both) is a large number, where ξ is defined by assuming that both n t and n v are of order O(ξ). Using Lemma 3, together with the expression of B (Eq.68), and noting that each factor in Eq.85 has a sum over m independent terms, we can prove that 1 n v m B T B = 1 -2α t + α 2 t µ 2 I p + O (mξ) -1/2 (86) The expression for µ 2 is given in Eq.32. Using this result and a Taylor expansion, the inverse is equal to n v m B T B -1 = 1 -2α t + α 2 t µ 2 -1 I p + O (mξ) -1/2 (87) Similarly, the term B T ΓB is equal to its average plus a term of smaller order 1 n v m B T ΓB = 1 n v m E B T ΓB + O (mξ) -1/2 We substitute these expressions in Eq.85 and neglect lower orders. Here we show how to calculate explicitly the expectation of B T ΓB. For ease of notation, we define the matrix A t(i) = I -αt nt X t(i) T X t(i) . Using the expressions of B (Eq.68) and Γ (Eqs.72,73), the expression for B T ΓB is given by B T ΓB = σ 2 m i=1 A t(i) T X v(i) T X v(i) A t(i) + ν 2 p m i=1 A t(i) T X v(i) T X v(i) A t(i) 2 + + α 2 t σ 2 n 2 t m i=1 A t(i) T X v(i) T X v(i) X t(i) T X t(i) X v(i) T X v(i) A t(i) We use Eqs.31, 32 to calculate the average of the first term in Eq.89 E X t E X v m i=1 A t(i) T X v(i) T X v(i) A t(i) = n v m 1 -2α t + α 2 t µ 2 I p We use Eqs.31, 32, 33, 41, 36, 37, 38, 39 to calculate the average of the second term E X t E X v m i=1 A t(i) T X v(i) T X v(i) A t(i) 2 = E X t m i=1 n v (n v + 1) A t(i) 4 + n v A t(i) 2 Tr A t(i) 2 = (91) = mn v (n v + 1) 1 -4α t + 6α 2 t µ 2 -4α 3 t µ 3 + α 4 t µ 4 I p + + mn v p 1 -4α t + 2α 2 t µ 2 + 4α 2 t µ 1,1 -4α 3 t µ 2,1 + α 4 t µ 2,2 I p (92) Finally, we compute the average of the third term, using Eqs.31, 32, 33, 34, 41, 36 , 37 E X t E X v m i=1 A t(i) T X v(i) T X v(i) X t(i) T X t(i) X v(i) T X v(i) A t(i) = (93) = E X t m i=1 n v (n v + 1) A t(i) T X t(i) T X t(i) A t(i) + n v A t(i) T A t(i) Tr X t(i) T X t(i) = (94) = mn v (n v + 1) n t 1 -2α t µ 2 + α 2 t µ 3 I p + mn v n t p 1 -2α t µ 1,1 + α 2 t µ 2,1 I p w, besides that different data vectors are independent, and so are data and parameters for different tasks. We further assume that those vectors have zero mean, and denote their covariance as Σ = Exx T (109) Σ w = Eww T We will also use the following matrix, including fourth order moments F = E x T Σx xx T We do not make any assumption about the distribution of x, but we note that, if x is Gaussian, then F = 2Σ 3 + ΣTr Σ 2 . We keep the assumption that the output noise is Gaussian and independent for different data points and tasks, with variance σ 2 . Using the same notation as in previous sections, we will also use the following expressions (for any p × p matrix A) E X T X = nΣ (112) E Tr ΣX T XAX T X = Tr A n 2 Σ 3 + n F -Σ 3 We proceed to derive the same formula under these less restrictive assumptions, in the overparameterized case only, following is the same derivation of section 7.3. We further assume ω 0 = 0, w 0 = 0. Again we start from the expression in Eq.24 for the test output, and we rewrite the test loss in Eq.27 as L test = E 1 2n s |X s (w -θ ) + z s | 2 We average this expression with respect to X s , z s , noting that θ does not depend on test data. We further average with respect to w , but note that θ depends on test parameters, so we average only terms that do not depend on θ . Using Eq.112, the result is L test = σ 2 2 + 1 2 Tr (ΣΣ w ) + E 1 2 θ T Σ θ -w T Σ θ The second term in the expectation is linear in θ and can be averaged over X r , z r , using Eq.25 and noting that ω does not depend on target data. The result is E X r E z r θ = (I -α r Σ)ω + α r Σw Furthermore, we show below (Eq.128) that the following average holds E w E z t E z v ω = 0 Combining Eqs.116, 117, we can calculate the second term in the expectation of Eq.115 and find L test = σ 2 2 + 1 2 Tr (ΣΣ w ) -α r Tr Σ 2 Σ w + E 1 2 θ T Σ θ We start by averaging the third term of this expression over z r , w , using Eq.25 and noting that ω does not depend on target data and test parameters. The result is E w E z r θ T Σ θ = Tr Σ I - α r n r X r T X r ω ω T I - α r n r X r T X r + (119) + α 2 r σ 2 n 2 r Tr X r ΣX r T + α 2 r n 2 r Tr ΣX r T X r Σ w X r T X r (120) We now average over X r , again noting that ω does not depend on target data. Using Eqs.112, 113, we find E X r E w E z r θ T Σ θ = Tr ω ω T Σ (I -α r Σ) 2 + α 2 r n r F -Σ 3 + (121) + α 2 r σ 2 n r Tr Σ 2 + α 2 r Tr Σ w Σ 3 + 1 n r F -Σ 3 We can now rewrite the average test loss in Eq.118 as L test = σ 2 2 1 + α 2 r n r Tr Σ 2 + 1 2 Tr Σ w + E ω ω T H r (123) where we define the following matrix H r = Σ (I -α r Σ) 2 + α 2 r n r F -Σ 3 In order to average the last term, we need an expression for ω . We note that the loss in Eq.20 is quadratic in ω, therefore the solution in Eq.22 can be found using standard linear algebra. In particular, the loss in Eq.20 can be rewritten as L meta = 1 2n v m |γ -Bω| 2 ( ) where γ is a vector of shape n v m × 1, and B is a matrix of shape n v m × p. The vector γ is a stack of m vectors γ =      X v(1) I -αt nt X t(1) T X t(1) w (1) -αt nt X v(1) X t(1) T z t(1) + z v(1) . . . X v(m) I -αt nt X t(m) T X t(m) w (m) -αt nt X v(m) X t(m) T z t(m) + z v(m)      Similarly, the matrix B is a stack of m matrices B =      X v(1) I -αt nt X t(1) T X t(1) . . . X v(m) I -αt nt X t(m) T X t(m)      In the overparameterized case (p > n v m), under the assumption that the inverse of BB T exists, the value of ω that minimizes Eq.125, and that also has minimum norm, is equal to ω = B T BB T -1 γ Note that the matrix B does not depend on w, z t , z v , and Ew Ez t Ez v γ = 0, therefore Eq.117 holds. In order to finish calculating Eq.123, we need to average the following term Tr H r ω ω T = Tr BB T -1 γγ T BB T -1 BH r B T (129) where we used the cyclic property of the trace. We start by averaging γγ T over w, z t , z v , since B does not depend on those variables. Note that w, z t , z v are independent on each other and across tasks. We denote by Γ the result of this operation, which is equal to a block diagonal matrix Γ = E w E z t E z v γγ T =    Γ (1) 0 0 0 . . . 0 0 0 Γ (m)    Where matrix blocks are given by the following expression Γ (i) = X v(i) I -α t n t X t(i) T X t(i) Σ w I -α t n t X t(i) T X t(i) X v(i) T + (131) + σ 2 I nv + α 2 t n 2 t X v(i) X t(i) T X t(i) X v(i) T Finally, we need to average over the training and validation data E Tr H r ω ω T = E X t E X v Tr BB T -1 Γ BB T -1 BH r B T These averages are hard to compute since they involve nonlinear functions of the data. However, we can approximate these terms by assuming that p and n t are large, both of order O(ξ), where ξ is a large number. Furthermore, we assume that Tr Σ 2 w is of order O ξ -1 , and that the variances of matrix products of the rescaled inputs x/ √ p, up to sixth order, are all of order O ξ -1 , in particular Var 1 p X v(i) X v(j) T = O ξ -1 (134) Var 1 p 2 X v(i) X t(i) T X t(i) X v(j) T = O ξ -1 (135) Var 1 p 3 X v(i) X t(i) T X t(i) X t(j) T X t(j) X v(j) T = O ξ -1 (136) where, similar to Eq.124, we define H t = Σ (I -α t Σ) 2 + α 2 t n t F -Σ 3 Note that all these terms are of order O (ξ). The inverse of BB T can be found by a Taylor expansion  BB T -1 = Tr H t -1 I nvm + O ξ -3/2



Gao & Sener (2020)). Again, none of these studies look at the effect of the learning rate, the main subject of our work. The theoretical work ofKhodak et al. (2019) connects the learning rate to task similarity, while the work ofLi et al. (2017) meta-learns the learning rate. = Tr Γ BB T -1 + (ω 0 -w 0 ) T I p -B T BB T -1 B (ω 0 -w 0 )(71)



al. (2020a), Tian et al. (2020), Dhillon et al. (2020), Chen et al. (2020b), Raghu et al. (2020)).

Figure 1: Graphical model of data generation in mixed linear regression

Figure 2: Average test loss of MAML as a function of the learning rate, on overparameterized mixed linear regression, as predicted by our theory and confirmed in experiments. a) Effect of learning rate α r during adaptation. b) Effect of learning rate α t during training. The optimal learning rate during adaptation is positive, while that during training is negative. Values of parameters: n t = 30, n v = 2, n r = 20, m = 3, p = 60, σ = 1., ν = 0.5, ω 0 = 0, w 0 = 0. In panel a) we set α t = 0.2, in panel b) we set α r = 0.2. In the experiments, each run is evaluated on 100 test tasks of 50 data points each, and each point is an average over 100 runs (a) or 1000 runs (b).

Figure 3: Average test loss as a function of the learning rate, on underparameterized mixed linear regression, as predicted by our theory and confirmed in experiments. a) Effect of learning rate α r during testing. b) Effect of learning rate α t during training. The optimal learning rate during testing is always positive, while that during training is negative. Values of parameters: n t = 5, n v = 25, n r = 10, m = 40, p = 30, σ = 0.2, ν = 0.2. In panel a) we set α t = 0.2, in panel b) we set α r = 0.2. In the experiments, the model is evaluated on 100 tasks of 50 data points each, and each point is an average over 100 (a) or 1000 (b) runs.

Figure 4: Average test loss of MAML as a function of the learning rate, on overparameterized mixed linear regression with Wishart covariances, as predicted by our theory and confirmed in experiments. a) Effect of learning rate α r during adaptation. b) Effect of learning rate α t during training. The optimal learning rate during adaptation is positive, while that during training appears to be negative. Values of parameters: n t = 30, n v = 2, n r = 20, m = 3, p = 60, σ = 1., ν = 0.5, ω 0 = 0, w 0 = 0. In panel a) we set α t = 0.2, in panel b) we set α r = 0.2. In the experiments, each run is evaluated on 100 tasks of 50 data points each, and each point is an average over 100 runs (a) or 500 runs (b).

Figure 6: Average test loss of MAML as a function of the learning rate α t (training) on mixed linear regression, showing the transition from strongly overparameterized (a), to weakly overparameterized (b), weakly underparameterized (c) and strongly underparameterized (d). As expected, predictions of theory are accurate only in panels (a) and (d). The amount of validation data increases from panels (a) to (d), with the following values: m = 1, n v = 2 (a), m = 5, n v = 5 (b), m = 10, n v = 10 (c), m = 10, n v = 40. Other parameters are equal to: n t = 40, n r = 40, p = 50, σ = 0.5., ν = 0.5, α r = 0.2, ω 0 = 0, w 0 = (0.1, 0.1, . . . , 0.1) (note that overfitting occurs since ω 0 = w 0 ). In the experiments, each run is evaluated on 100 test tasks of 50 data points each, and each point is an average over 100 runs.

Then, using Eqs.112, 113 and the expressions of B (Eq.127) and Γ (Eqs.130,131), we can prove thatBB T = Tr H t I nvm + O ξ 1/2(137)Γ = Tr Σ w H t + σ 2 1 + α 2 BH r B T = Tr H r H t I nvm + +O ξ 1/2 (139)

141)Substituting these expressions in Eq.133, we findE Tr H r ω ω T = n v m Tr (H r H t ) Tr (Σ w H t ) + σ 2 1 +Substituting this expression into in Eq.123, we find the value of average test loss Tr (H r H t ) Tr (Σ w H t ) + σ 2 1 +

acknowledgement

We would like to thank Paolo Grazieschi for helping with formalizing the theorems, and Ritwik Niyogi for helping with nonlinear regression experiments.

annex

Putting everything together in Eq.85, and applying the trace operator, we find the following expression for the meta-parameter varianceWe rewrite this expression aswhere we defined the following expressions for g iand µ i are equal to Substituting this expression back into Eq.65 returns the final expression for the average test loss, equal to(108)

7.4. PROOF OF THEOREM 3

In this section, we release some assumption on the distributions of data and parameters. In particular, we do not assume a specific distribution for input data vectors x and generating parameter vector

