HOW IMPORTANT IS THE TRAIN-VALIDATION SPLIT IN META-LEARNING?

Abstract

Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct non-splitting method, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful on the linear centroid meta-learning problem, in the asymptotic setting where the number of tasks goes to infinity. We show that the splitting method converges to the optimal prior as expected, whereas the non-splitting method does not in general without structural assumptions on the data. In contrast, if the data are generated from linear models (the realizable regime), we show that both the splitting and non-splitting methods converge to the optimal prior. Further, perhaps surprisingly, our main result shows that the non-splitting method achieves a strictly better asymptotic excess risk under this data distribution, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that data splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the non-splitting method can indeed outperform the splitting method, on both simulations and real meta-learning tasks.

1. INTRODUCTION

Meta-learning, also known as "learning to learn", has recently emerged as a powerful paradigm for learning to adapt to unseen tasks (Schmidhuber, 1987) . The high-level methodology in metalearning is akin to how human beings learn new skills, which is typically done by relating to certain prior experience that makes the learning process easier. More concretely, meta-learning does not train one model for each individual task, but rather learns a "prior" model from multiple existing tasks so that it is able to quickly adapt to unseen new tasks. Meta-learning has been successfully applied to many real problems, including few-shot image classification (Finn et al., 2017; Snell et al., 2017) , hyper-parameter optimization (Franceschi et al., 2018) , low-resource machine translation (Gu et al., 2018) and short event sequence modeling (Xie et al., 2019) . A common practice in meta-learning algorithms is to perform a sample splitting, where the data within each task is divided into a training split which the prior uses to adapt to a task-specific predictor, and a validation split on which we evaluate the performance of the task-specific predictor (Nichol et al., 2018; Rajeswaran et al., 2019; Fallah et al., 2020; Wang et al., 2020a) . For example, in a 5-way k-shot image classification task, standard meta-learning algorithms such as MAML (Finn et al., 2017) use 5k examples within each task as training data, and use additional examples (e.g. k images, one for each class) as validation data. This sample splitting is believed to be crucial as it matches the evaluation criterion at meta-test time, where we perform adaptation on training data from a new task but evaluate its performance on unseen data from the same task. Despite the aformentioned importance, performing the train-validation split has a potential drawback from the data efficiency perspective -Because of the split, neither the training nor the evaluation stage is able to use all the available per-task data. In the few-shot image classification example, each task has a total of 6k examples available, but the train-validation split forces us to use these data separately in the two stages. Meanwhile, performing the train-validation split is also not the only option in practice: there exist algorithms such as Reptile (Nichol & Schulman, 2018) and Meta-MinibatchProx (Zhou et al., 2019) that can instead use all the per-task data for training the task-specific predictor and also perform well empirically on benchmark tasks. These algorithms modify the loss function in the outer loop so that the training loss no longer matches the meta-test loss, but may have the advantage in terms of data efficiency for the overall problem of learning the best prior. So far it is theoretically unclear how these two approaches (with/without train-validation split) compare with each other, which motivates us to ask the following Question: Is the train-validation split necessary and optimal in meta-learning? In this paper, we perform a detailed theoretical study on the importance of the train-validation split. We consider the linear centroid meta-learning problem (Denevi et al., 2018b) , where for each task we learn a linear predictor that is close to a common centroid in the inner loop, and find the best centroid in the outer loop (see Section 2 for the detailed problem setup). This problem captures the essence of meta-learning with non-linear models (such as neural networks) in practice, yet is sufficiently simple that allows a precise theoretical characterization. We use a biased ridge solver as the inner loop with a (tunable) regularization parameter, and compare two outer-loop algorithms of either performing the train-validation split (the train-val method) or using all the per-task data for both training and evaluation (the train-train method). Specifically, we compare the two methods when the number of tasks T is large, and examine if and how fast they converge to the (properly defined) best centroid at meta-test time. We summarize our contributions as follows: • On the linear centroid meta-learning problem, we show that the train-validation split is necessary in the general agnostic setting: As T → ∞, the train-val method converges to the optimal centroid for test-time adaptation, whereas the train-train method does not without further assumptions on the tasks (Section 3). The convergence of the train-val method is expected since its (population) training loss is equivalent to the meta-test time loss, whereas the non-convergence of the train-train method is because these two losses are not equivalent in general. • Our main theoretical contribution is to show that the train-validation split is not necessary and even non-optimal, in the perhaps more interesting regime when there are structural assumptions on the tasks: When the data are generated from noiseless linear models, both the train-val and traintrain methods converge to the common best centroid, and the train-train method achieves a strictly better (asymptotic) estimation error and test loss than the train-val method (Section 4). This is in stark contrast with the agnostic case, and suggests that data efficiency may indeed be more important when the tasks have a nice structure. Our results build on tools from random matrix theory in the proportional regime, which may be of broader technical interest. • We perform meta-learning experiments on simulations and benchmark few-shot image classification tasks, showing that the train-train method consistently outperforms the train-val method (Section 5 & Appendix D). This validates our theories and presents empirical evidence that samplesplitting may not be crucial; methods that utilize the per-task data more efficiently may be preferred.

1.1. RELATED WORK

Meta-learning and representation learning theory Baxter (2000) provided the first theoretical analysis of meta-learning via covering numbers, and Maurer et al. (2016) improved the analysis via Gaussian complexity techniques. Another recent line of theoretical work analyzed gradient-based meta-learning methods (Denevi et al., 2018a; Finn et al., 2019; Khodak et al., 2019; Ji et al., 2020) and showed guarantees for convex losses by using tools from online convex optimization. Saunshi et al. (2020) proved the success of Reptile in a one-dimensional subspace setting. Wang et al. (2020b) compared the performance of train-train and train-val methods for learning the learning rate. Denevi et al. (2018b) proposed the linear centroid model studied in this paper, and provided generalization error bounds for train-val method; the bounds proved also hold for train-train method, so are not sharp enough to compare the two algorithms. Wang et al. (2020a) studied the convergence of gradient-based meta-learning by relating to the kernelized approximation. On the representation learning end, Du et al. (2020) ; Tripuraneni et al. (2020a; b) showed that ERM can successfully pool data across tasks to learn the representation. Yet the focus is on the accurate estimation of the common representation, not on the fast adaptation of the learned prior. Lastly, we remark that there are analyses for other representation learning schemes (McNamara & Balcan, 2017; Galanti et al., 2016; Alquier et al., 2016) . Empirical understandings of meta-learning Raghu et al. (2020) investigated the representation learning perspective of meta-learning and showed that MAML with a full finetuning inner loop mostly learns the top-layer linear classifier and does not change the representation layers much. This result partly justifies the validity our linear centroid meta-learning problem in which the features (representations) are fixed and only a linear classifier is learned. Goldblum et al. (2020) investigated the difference of the neural representations learned by classical training (supervised learning) and meta-learning, and showed that the meta-learned representation is both better for downstream adaptation and makes classes more separated than the classically trained one. Although the classical training method in (Goldblum et al., 2020) does not perform a train-validation split, it is not exactly the same as the train-train method considered in this work as it effectively performs a supervised learning on all tasks combined and does not do a per-task adaptation. Multi-task learning Multi-task learning also exploits structures and similarities across multiple tasks. The earliest idea dates back to Caruana (1997) ; Thrun & Pratt (1998) ; Baxter (2000) , initially in connections to neural network models. They further motivated other approaches using kernel methods (Evgeniou et al., 2005; Argyriou et al., 2007) and multivariate linear regression models with structured sparsity (Liu et al., 2009; 2015) . More recent advances on deep multi-task learning focus on learning shared intermediate representations across tasks Ruder (2017) . These multi-task learning approaches usually minimize the joint empirical risk over all tasks, and the models for different tasks are enforced to share a large amount of parameters. In contrast, meta-learning only requires the models to share the same "prior", which is more flexible than multi-task learning.

2. PRELIMINARIES

In this paper, we consider the standard meta-learning setting, in which we observe data from T ≥ 1 supervised learning tasks, and the goal is to find a prior (or "initialization") using the combined data, such that the (T + 1)-th new task may be solved sample-efficiently using the prior. Linear centroid meta-learning We instantiate our study on the linear centroid meta-learning problem (also known as learning to learn around a common mean, Denevi et al. (2018b) ), where we wish to learn a task-specific linear predictor w t ∈ R d in the inner loop for each task t, and learn a "centroid" w 0 in the outer loop that enables fast adaptation to w t within each task: Find the best centroid w 0 ∈ R d for adapting to a linear predictor w t on each task t. Formally, we assume that we observe training data from T ≥ 1 tasks, where for each task index t we sample a task p t (a distribution over R d × R) from some distribution of tasks Π, and observe n examples (X t , y t ) ∈ R n×d × R n that are drawn i.i.d. from p t : p t ∼ Π, (X t , y t ) = {(x t,i , y t,i )} n i=1 where (x t,i , y t,i ) iid ∼ p t . (1) We do not make further assumptions on (n, d); in particular, we allow the underdetermined setting n ≤ d, in which there exists (one or many) interpolators w t that perfectly fit the data: X t w t = y t . Inner loop: Ridge solver with biased regularization towards the centroid Our goal in the inner loop is to find a linear predictor w t that fits the data in task t while being close to the given "centroid" w 0 ∈ R d . We instantiate this through ridge regression (i.e. linear regression with L 2 regularization) where the regularization biases w t towards the centroid. Formally, for any w 0 ∈ R d and any dataset (X, y), we consider the algorithm A λ (w 0 ; X, y) := arg min w 1 n Xw -y 2 + λ w -w 0 2 = w 0 + X X + nλI d -1 X (y -Xw 0 ), where λ > 0 is the regularization strength (typically a tunable hyper-parameter). As we regularize by ww 0 2 , this inner solver encourages the solution to be close to w 0 , as we desire. Such a regularizer is widely used in practical meta-learning algorithms such as MetaOptNet (Lee et al., 2019) and Meta-MinibatchProx (Zhou et al., 2019) . In addition, as λ → 0, this solver recovers gradient descent fine-tuning: we have A 0 (w 0 ; X, y) := lim λ→0 A λ (w 0 ; X, y) = w 0 + X † (y -Xw 0 ) = arg min Xw=y w -w 0 2 (where X † ∈ R d×n denotes the pseudo-inverse of X). This is the minimum-distance interpolator of (X, y) and also the solution found by gradient descentfoot_0 on Xwy 2 initialized at w 0 . Therefore our ridge solver with λ > 0 can be seen as a generalized version of the gradient descent solver used in MAML (Finn et al., 2017) . Outer loop: Finding the best centroid In the outer loop, our goal is to find the best centroid w 0 . The standard approach in meta-learning is to perform a train-validation split, that is, (1) execute the inner solver on a first split of the task-specific data, and (2) evaluate the loss on a second split, yielding a function of w 0 that we can optimize. This two-stage procedure can be written as Compute w t (w 0 ) = A λ (w 0 ; X train t , y train t ), and evaluate y val t -X val t w t (w 0 ) 2 . where (X train t , y train t ) = {(x t,i , y t,i )} n1 i=1 and (X val t , y val t ) = {(x t,i , y t,i )} n i=n1+1 are two disjoint splits of the per-task data (X t , y t ) of size (n 1 , n 2 ), with n 1 + n 2 = n. Written concisely, this is to consider the "split loss" tr-val t (w 0 ) := 1 2n 2 y val t -X val t A λ (w 0 ; X train t , y train t ) 2 . (2) In this paper, we will also consider an alternative version, where we do not perform the trainvalidation split, but instead use all the per-task data for both training and evaluation. Mathematically, this is to look at the "non-split loss" tr-tr t (w 0 ) := 1 2n y t -X t A λ (w 0 ; X t , y t ) 2 . Our overall algorithm is to solve the empirical risk minimization (ERM) problem on the T observed tasks, using either one of the two losses above:  L tr-val T ( L test (w 0 ; Alg) := E p T +1 ∼Π E (X T +1 ,y T +1 ),(x ,y ) iid ∼ p T +1 1 2 x Alg(w 0 ; X T +1 , y T +1 ) -y 2 . Additionally, for both train-val and train-train methods, we need to ensure that the inner loop used for meta-test is exactly the same as that used in meta-training. Therefore, the meta-test performance for the train-val and train-train methods above should be evaluated as L test λ,n1 ( w tr-val 0,T ) := L test ( w tr-val 0,T ; A λ,n1 ), L test λ,n ( w tr-tr 0,T ) := L test ( w tr-tr 0,T ; A λ,n ), where A λ,m denotes the ridge solver with regularization strength λ > 0 on m ≤ n data points. Finally, we let w 0, (λ; n) = arg min w0 L test λ,n (w 0 ) denote the best centroid if the inner loop uses A λ,n . The performance of the train-val algorithm w tr-val 0,T should be compared against w 0, (λ, n 1 ), whereas the train-train algorithm w tr-tr 0,T should be compared against w 0, (λ, n).

2.1. TASK-ABUNDANT SETTING THROUGH ASYMPTOTIC ANALYSIS

In this paper we are interested in the task-abundant setting where we fix some finite (d, n) and let T be very large. We analyze such a task-abundant setting through the asymptotic analysis framework, that is, examine the limiting properties of the estimator (e.g. w {tr-val,tr-tr} 0,T ) as T → ∞. Here we set up the basic notation of asymptotic analysis required in this paper. We emphasize that our large T setting captures practical meta-learning scenarios; for example, 5-way image classification on miniImageNet (Ravi & Larochelle, 2017 ) contains 64 5 diverse tasks (at train time). Asymptotic rate of estimation & excess risk Let L be any population risk with minimizer w 0, (which we assume is unique), L T be the empirical risk on the observed data from T tasks, and w 0,T be the minimizer of L T (i.e. the ERM). We say that w 0,T is consistent if w 0,T → w 0, in probability as T → ∞. For consistent ERMs, we define its asymptotic parameter estimation error (in MSE loss) and asymptotic excess risk as followsfoot_1 : AsymMSE( w 0,T ) := lim T →∞ T • E w 0,T -w 0, 2 AsymExcessRisk( w 0,T ) := lim T →∞ T • E[L( w 0,T ) -L(w 0, )]. We emphasize that asymptotic statements are more refined than non-asymptotic O(•) style upper bounds in the T → ∞ limit: they already imply the {MSE, excess risk} has order O(1/T ) and specifies the leading constant.

3. THE IMPORTANCE OF SAMPLE SPLITTING

We begin by analyzing whether the algorithms w {tr-val,tr-tr} 0,T defined in (4) converge to the best test-time centroid w 0, (λ; n 1 ) or w 0, (λ; n) (defined ( 5)) respectively as T → ∞, in the general situation where we do not make structural assumptions on the data distribution p t . Proposition 1 (Consistency and asymptotics of train-val method). Suppose E x∼pt [xx ] 0, E x∼pt [ x 4 ] < ∞ and E (x,y)∼pt [ xy ] < ∞ for almost surely all p t ∼ Π. Then for any λ > 0 and any (n 1 , n 2 ) such that n 1 + n 2 = n, the train-val method w tr-val 0,T converges to the best test-time centroid: w tr-val 0,T → w 0, (λ, n 1 ) almost surely as T → ∞. Further, we have AsymMSE( w tr-val 0,T ) = tr ∇ -2 L test λ,n1 (w 0, (λ, n 1 )) • Cov ∇ tr-val t (w 0, (λ, n 1 )) • ∇ -2 L test λ,n1 (w 0, (λ, n 1 )) , AsymExcessRisk L test λ,n 1 ( w tr-val 0,T ) = tr ∇ -2 L test λ,n1 (w 0, (λ, n 1 )) • Cov ∇ tr-val t (w 0, (λ, n 1 )) . Proposition 2 (Inconsistency of train-train method). There exists a distribution of tasks Π on d = 1 satisfying the conditions in Proposition 1 on which the train-train method does not converge to the best test-time centroid: for any n ≥ 1 and any λ > 0, the estimation error w tr-tr 0,Tw 0, (λ, n) and the excess risk L test λ,n ( w tr-tr 0,T ) -L test λ,n (w 0, (λ, n)) are both bounded away from 0 almost surely as T → ∞. Propositions 1 and 2 justify the importance of sample splitting: the train-val method converges to the best test-time centroid, whereas the train-train method does not converge to the best centroid in general. The reason behind Proposition 1 is simple: the population loss L tr-val for the trainval method is indeed equal to the test-time loss L test λ,n1 , making the train-val method a proper ERM for the meta-test time we care about, and thus the consistency and asymptotic normality follow from classical results for the ERM (e.g. (Van der Vaart, 2000; Liang, 2016) ). In contrast, for the train-train method, its expected loss L tr-tr is in general not equivalent to L test λ,n : L tr-tr measures the in-sample prediction error of the per-task predictor, whereas L test λ,n measures the out-of-sample prediction error. Consequently, the population minimizers of L test λ,n and L tr-tr are not equal in general, which leads to w 0, converging to the minimizer of L tr-tr , not of L test λ,n . The proof of Proposition 2 constructs a simple counter-example in d = 1, but we expect such a mismatch to generally hold in any dimension. Appendix A gives the proofs of Proposition 1 and 2.

4. IS SAMPLE SPLITTING ALWAYS OPTIMAL?

Proposition 2 states a negative result for the train-train method, showing that it does not converge to the best test-time centroid without further assumptions on the data distribution. However, such a negative result is inherently worst-case, and does not preclude the possibility that there exists a data distribution on which the train-train method can also work well. In this section, we construct a simple data distribution in which we can analyze the performance of both the train-val and the train-train methods more explicitly, showing that sample splitting is indeed not optimal, and the train-train method can work better. Realizable linear model We consider the following instantiation of the (generic) data distribution assumption in (1): We assume that each task p t is specified by a w t ∈ R d sampled from some distribution Π (overloading notation), and the observed data follows the noiseless linear model with ground truth parameter w t : y t = X t w t , where the inputs x t,i iid ∼ N(0, I d ) and are independent of w t . We assume that Π has a finite second moment (i.e. E wt∼Π [ w t 2 ] < ∞). Note that when n ≥ d, we are able to perfectly recover w t for all t (by solving linear equations), therefore the problem in the inner loop is in a sense "easy"; when n < d, we cannot hope for such perfect recoveries. Our goal in the outer loop is to find the best w 0 , measured by the test loss L test λ,n for the train-train method and L test λ,n1 for the train-val method.

4.1. COMPARISON OF TRAIN-TRAIN AND TRAIN-VAL ON THE REALIZABLE MODEL

We begin by showing that on this task and data distribution, the population best centroids w 0, (λ, n) = arg min w0 L test λ,n (w 0 ) is the same for any (λ, n), and both the train-val and traintrain methods are asymptotically consistent and converge to same best centroid. Theorem 3 (Consistency of both train-val and train-train methods). On the realizable linear model (6), the test-time meta loss for all λ > 0 and all n is minimized at the same point, that is, the mean of the ground truth parameters: w 0, (λ, n) = arg min w0 L test λ,n (w 0 ) = w 0, := E wt∼Π [w t ], for all λ > 0, n. Further, both the train-val method and the train-train method are asymptotically consistent: for any λ > 0, n, and (n 1 , n 2 ), we have w tr-val 0,T (n 1 , n 2 ; λ) → w 0, and w tr-tr 0,T (n; λ) → w 0, almost surely as T → ∞. See its proof in Appendix B.1. Theorem 3 shows that both train-val and train-train methods are consistent, and they converge to the same optimal parameter w 0, which is the mean of w t . This is a consequence of the good structure in our realizable linear model (6): at a high level, w 0, is indeed the best centroid since it has (on average) the closest distance to a randomly sampled w t . Theoerem 3 suggests that we are now able to compare performance of the two methods based on their asymptotic parameter estimation error (for estimating w 0, ). Throughout the rest of this section, let R 2 := E w t -w 0, 2 denote the variance of w t . We are now ready to state our main result. Theorem 4 (Comparison of asymptotic MSE of the train-val and train-train methods). In the high-dimensional limiting regime d, n → ∞, d/n → γ ∈ (0, ∞), the optimal rate of the traintrain method obtained by tuning the regularization λ ∈ (0, ∞) satisfies inf λ>0 lim d,n→∞,d/n=γ AsymMSE w tr-tr 0,T (n; λ) = inf λ>0 ρ λ,γ R 2 ( ) ≤ max 1 + 5 27 γ, 5 27 + γ • R 2 , where ρ λ,γ = 4γ 2 (γ -1) 2 +(γ + 1)λ /(λ+γ+1-(λ + γ + 1) 2 -4γ) 2 / (λ + γ + 1) 2 -4γ 3/2 , and the inequality becomes equality at γ = 1. In contrast, the optimal rate of the train-val method by tuning the regularization λ ∈ (0, ∞) and split ratio s ∈ (0, 1) is inf λ>0,s∈(0,1) lim d,n→∞,d/n=γ AsymMSE w tr-val 0,T (ns, n(1s); λ) = (1 + γ)R 2 . As max{1 + 5γ/27, 5/27 + γ} < 1+γ (∀γ > 0), the train-train method achieves a strictly better asymptotic rate than the train-val method when λ and s are optimally tuned in both methods. Implications Comparison between the analytical upper bound max{1 + 5γ/27, 5/27 + γ}R 2 for train-train (1 + γ)R 2 for train-val in Theorem 4 shows that the train-train method achieves a strictly better asymptotic MSE than the train-val method, for any γ > 0foot_2 . (See Figure 1 (a) for a visualization of the exact optimal rates and the upper bound ( ).) Perhaps surprisingly, this suggests that the traintrain method is not only "correct" (converging to the best centroid), but can be even better than the train-val method, when the data is structured. While the "correctness" of the train-train method is a consequence of the realizable linear model, we believe its superior asymptotic MSE is due to the fact that the train-train method is able to use the data more efficiently than the train-val method. Also, while we reached such a conclusion on this particular problem of linear centroid meta-learning, we suspect that this phenomenon to be not restricted to this problem, and may hold in more generality when data is structured or when the signal-to-noise ratio is high. As we are going to see, our real data experiments in Section 5.2 indeed suggests that the superiority of the train-train method may also hold on real meta-learning tasks with neural networks.

4.2. PROOF HIGHLIGHTS OF THEOREM 4

Here we sketch the technical highlights in proving Theorem 4. We defer the full proof to Appendix B.5.

Exact asymptotic MSE for both methods

The proof begins by calculating the exact asymptotic MSE for both methods, which we provide in the following theorem. Lemma 5 (Exact asymptotic rates of the train-val and train-train methods). Suppose that ρ tr-tr = E d i=1 (σ (n) i ) 2 /(σ (n) i +λ) 4 E d i=1 σ (n) i /(σ (n) i +λ) 2 2 and ρ tr-val = E d i=1 λ 2 /(σ (n 1 ) i +λ) 2 2 +(n2+1) d i=1 λ 4 /(σ (n 1 ) i +λ) 4 E d i=1 λ 2 /(σ (n 1 ) i +λ) 2 2 , where for any n, σ (n) 1 ≥ • • • ≥ σ (n) d denotes the eigenvalues of the matrix 1 n X t X t ∈ R d×d , where we recall X t ∈ R n×d is a random matrix with i.i.d. standard Gaussian entries. For any (n, d), we have on the realizable linear model (6) that AsymMSE w tr-tr 0,T (n; λ) = dR 2 ρ tr-tr , AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) = dR 2 ρ tr-val n 2 . See its proof in Appendix B.2. Lemma 5 follows straightforwardly from the classical asymptotic result for empirical risk minimization ( Van der Vaart, 2000) and simplifications of certain matrix traces in terms of the spectrum of the empirical covariance matrix 1 n n t=1 X t X t . Simplifying and optimizing the asymptotic MSEs The asymptotic MSEs of the train-train and train-val method in Lemma 5 are not yet directly comparable, as the quantities ρ tr-tr and ρ tr-val depend on the spectrum of the empirical covariance matrix as well as additional tunable parameters such as λ and (n 1 , n 2 ) (for the train-val method). Towards proving Theorem 4, we further simplify the rates and analyze the optimal tunable parameters, using separate strategies for the two methods: • For the train-val method, we show that the optimal tunable parameters for any (n, d) is taken at a special case λ = ∞ and (n 1 , n 2 ) = (0, n), at which the rates only depends on 1 n1 X train t X train t through its rank (and thus has a simple closed-form). We state this result in Corollary 8. The proof builds on algebraic manipulations of the quantity ρ tr-val , and can be found in Appendix B.3. • For the train-train method, we apply random matrix theory to simplify the spectrum of 1 n X t X t in the proportional limit where d, n → ∞ and d/n stays a constant (Bai & Silverstein, 2010; Anderson et al., 2010) , and obtain a closed-form expression of the asymptotic MSE for any λ > 0, which we can analytically optimize over λ. We state this result in Theorem 9. The proof builds on the Steiltjes transform and its "derivative trick" (Dobriban et al., 2018) , and is deferred to Appendix B.4.

5.1. SIMULATIONS

We experiment on the realizable linear model studied in Section 4. Recall that the observed data of the t-th task are generated as y t = X t w t , with x t,i iid ∼ N(0, I d ). We independently generate w t iid ∼ N(w 0, , I d / √ d), where w 0, is the linear centroid and the corresponding R 2 = 1 here. The goal is to learn the linear centroid w 0, using the train-train method and train-val method, i.e., minimizing L tr-tr T and L tr-val T , respectively. Note that both L tr-tr T and L tr-val T are quadratic in w 0 , therefore, we can find the close-form solutions w {tr-tr,tr-val} 0,T . We measure the performance of train-train and train-val methods using the 2 -error w 0,w {tr-tr,tr-val} 0,T 2 . We present the comparison among train-train and train-val methods in Figure 1 with scatter plots representing the simulation outputs under different settings. Across all the simulations, we well-tune the regularization coefficient λ in the train-train method, and use a sufficiently large λ = 10000 in the train-val method according to Corollary 8. The simulated results concentrate around the reference curves corresponding to our theoretical findings. This corroborates our analyses and demonstrates the better performance of train-train method on the realizable linear model. We additionally investigate the effect of averaging the loss over multiple splits in the trainval method (a "cross-validation" type loss) rather than the vanilla single split. We show that such cross-validation can indeed improve over the vanilla single-split train-val method. We also experiment with the stronger "leave-one-out" style cross-validation and show that it achieves better MSEs than the constant-fold cross-validation (Appendix E). ) < l a t e x i t s h a 1 _ b a s e 6 4 = " i I I A N f w Q 6 / e m T Q q W x p X w j q P V e u 4 = "  T w v N o 2 j 7 U U N E 1 J m A 0 L H k N p D v g = " > A A A C V X i c b V F d a x Q x F M 2 O t d b 1 o 1 t 9 9 C W 4 F S r U Z a Z Q 7 I M P h S L 0 R a z Y b R c 6 6 3 I n c 6 c b m k y G 5 E 7 r E u a v 9 W + I 7 8 U 3 / Q e C m e 0 + 6 N Y L I Y d z 7 k f u S V Y p 6 S i O v 3 e i e y v 3 V x + s P e w + e v z k 6 X p v 4 9 m J M 7 U V O B R G G T v K w K G S J Q 5 J k s J R Z R F 0 p v A 0 u z h o 9 d N L t E 6 a 8 p h m F Y 4 1 n J e y k A I o U J P e a A t e 8 4 8 V S Q 2 K g 5 v p i g x J w T 9 8 f s 9 N w T f T K 5 n j F M i n G m i a F f 6 q a S Y + 3 u b H z R e f E n 4 l q 3 3 q y b 4 h u 8 3 D d Q k q b Z p m c 9 L r x 4 N 4 H v w u S B a g z x Z x N O n 9 S H M j a o 0 l C Q X O n S V x R W M P N j x H Y d N N a 4 c V i A s 4 x 7 M A S 9 D o x n 7 u Q M N f B S b n h b H h l M T n 7 N 8 V H r Q L y 2 U h s 9 3 D L W s t + V 8 t d 2 3 D p e l U 7 I 2 9 L K u a s B S 3 w 4 t a c T K 8 t Z j n 0 q I g N Q s A h J W t n W I K F g S F j + g G Y 5 J l G + 6 C k 5 1 B s j u I P + 3 0 9 9 8 t L F p j L 9 h L t s U S 9 p b t s 0 N 2 x I Z M s G t 2 w 3 6 y X 5 1 v n d / R S r R 6 m x p 1 F j X P 2 T 8 R r f 8 B z w + 2 R Q = = < / l a t e x i t > (a) Optimal asymptotic MSE of b w {tr-tr, tr-val} 0,T < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 u l I t 1 o Y 7 D Y O d L x B B T c i 5 p X C / p 8 = " > A A A C V n i c b V F N T 9 w w E P W m p d D t B 2 k 5 9 m K x q U Q l G i U r V e 2 B A 1 I v P Y K 0 C 1 R k G z n O h L V w 4 s i e L K y s / D b + B v 0 B 5 Q j / A N V Z 9 l C W v o u f 3 p v x j J + z W g q D U f S 7 5 z 1 7 v v Z i f e N l / 9 X r N 2 8 3 / X f v j 4 x q N I c x V 1 L p k 4 w Z k K K C M Q q U c F J r Y G U m 4 T g 7 / 9 7 5 x z P Q R q h q h P M a J i U 7 q 0 Q h O E M n p f 7 P n e w T D R K Q M h 0 G F L R W m q r C K R c i h y l D m 5 Q M p 1 l h L 9 o 2 t d E u H b W / b I J w i b q 0 i U X 9 G f U u d c e M y a R t 2 4 D O Q h P S Y B S k / i A K o w X o U x I v y Y A s c Z D 6 N 0 m u e F N C h V w y Y 0 7 j q M a J Z R o F l 9 D 2 k 8 Z A z f g 5 O 4 N T R y t W g p n Y R Q Q t / e i U n B Z u + 0 J V S B f q v x 2 W l c b M y 8 x V d g 8 y q 1 4 n / t f L T X f h y n Q s v k 2 s q O o G o e I P w 4 t G U l S 0 y 5 j m Q g N H O X e E c S 3 c / p R P m W Y c 3 U / 0 X T D x a g x P y d E w j L + E 0 e F w s L + 3 j G i D f C D b Z I f E 5 C v Z J z / I A R k T T q 7 I H 3 J L 7 n r X v X t v z V t / K P V C Z m / l J A c Q 8 g s r 4 W a l U U = " > A A A C X n i c b V H B T t w w E P W m l N J t K d t y q c T F 6 q Y S l W i a r F S 1 h x 6 Q u H C k E g t I Z B s 5 z o S 1 c O z I n i x d W f m + f k N v n H r r F a 5 1 l j 3 A 0 r n 4 6 b 0 Z P 8 9 z X k t h M Y 5 / 9 4 I n a 0 / X n 2 0 8 7 7 9 4 u f l q a / D 6 z Y n V j e E w 5 l p q c 5 Y z C 1 I o G K N A C W e 1 A V b l E k 7 z y 4 N O P 5 2 B s U K r Y 5 z X M K n Y h R K l 4 A w 9 l Q 3 Y L v 9 A w x S k z E Y h B W O 0 o b r 0 z J U o Y M r Q p R X D a V 6 6 q 7 b N X L x H j 9 s f L k X 4 i a Z y q U P z E c 0 e 9 c e M y b R t 2 5 D O I h v R s P i k Q m o 6 k 2 w w j K N 4 U f Q x S J Z g S J Z 1 l A 3 + p I X m T Q U K u W T W n i d x j R P H D A o u o e 2 n j Y W a 8 U t 2 A e c e K l a B n b h F F C 1 9 7 5 m C l n 6 L U i u k C / b + h G O V t f M q 9 5 3 d Y n Z V 6 8 j / a o X t L l x x x / L r x A l V N w i K 3 5 m X j a S o a Z c 1 L Y Q B j n L u A e N G + P d T P m W G c f Q / 0 v f B J K s x P A Y n o y j 5 H M X f R 8 P 9 b 8 u I N s g O e U d 2 S U K + k H 1 y S I 7 I m H D y i / w l N + S 2 d x 2 s B 5 v B 1 l 1 r 0 F v O b J M H F b z 9 B 1 r f t y M = < / l a t e x i t > (c) `2 error of b w {tr-tr, tr-val} 0,T v.s. d/n ratio < l a t e x i t s h a 1 _ b a s e 6 4 = " A b s M c K I b Y 8 O 0 / b I d g y D Y A d K d i C s = " > A A A C R 3 i c b V B d S x w x F M 2 s t t p V 6 1 o f f Q k u g o I u M 9 K i j x Y p F E F Q d F V w t k s m c 2 c 3 m P k g u a M u I f + p f 8 M / U O h T + + S r b + K j m X U F X b 0 Q c j j n 3 t y c E x V S a P T 9 v 1 5 t Y v L D x 6 n p T / W Z 2 b n P 8 4 2 F L y c 6 L x W H N s 9 l r s 4 i p k G K D N o o U M J Z o Y C l k Y T T 6 G K 3 0 k 8 v Q W m R Z 8 c 4 K K C T s l 4 m E s E Z O q r b 2 A s R r l G l 5 r s e p P t H P + x q e C V i 6 D M 0 Y c q w H y X m y t q u 8 d f p s f 1 l n r t D g 2 o D 1 T p 1 1 y W T o b V 2 r d t o + i 1 / W P Q t C E a g S U Z 1 0 G 3 c h n H O y x Q y 5 J J p f R 7 4 B X Y M U y i 4 B F s P S w 0 F 4 x e s B + c O Z i w F 3 T F D z 5 a u O C a m S a 7 c y Z A O 2 Z c T h q X a W Y p c Z + V D j 2 s V + a 4 W 6 + r B s > A A A C Y H i c b V F N T x s x E H U W W k L 6 Q W h v 5 W I R V e o B o l 0 E K k e k X j h S K Q E k H K J Z 7 2 x i 4 f 2 Q P Q u N 3 P 2 B / Q l c O f T a a 3 u r N 4 n U E h j J 8 t O b N x 9 + j k u t L I X h f S t Y W 3 / x c q O 9 2 X n 1 + s 3 b r e 7 2 u 3 N b V E b i U B a 6 M J c x W N Q q x y E p 0 n h Z G o Q s 1 n g R 3 3 x p 8 h e 3 a K w q 8 g H N S h x l M M l V q i S Q p 8 Z d O e B C Y 0 r i u 8 i A p n H q 7 u q x C / e 4 s A S m 5 v t c 3 K k E p 0 D u n 2 C h G N T X T h B + I 5 M 5 4 c j s k 9 n j / r o F L e q 6 F k Z N p r 7 v 9 c G 4 2 w v 7 4 T z 4 U x A t Q Y 8 t 4 2 z c / S m S Q l Y Z 5 i Q 1 W H s V h S W N H B h S U m P d E Z X F E u Q N T P D K w x w y t C M 3 N 6 P m H z 2 T 8 L Q w / u T E 5 + z / F Q 4 y a 2 d Z 7 J X N g + x q r i G f z S W 2 a b g y n d L j k V N 5 W R H m c j E 8 r T S n g j d u 8 0 Q Z l K R n H o A 0 y u / P 5 R Q M S P J / 0 v H G R K s 2 P A X n B / V T e v I n a a C H z X R S X S p f T C W B t w J U w = " > A A A C X n i c b V F N T x s x E H W 2 h d J Q I L S X S r 1 Y j Z B 6 g G g X g Z p j p F 4 4 g k Q A C Y d o 1 j u b W H g / Z M 8 C k b u / r 7 + h N 0 6 9 9 V q u 9 S a R 2 o a O Z P n p z Z s P P 8 e l V p b C 8 H s r e P F y b f 3 V x u v 2 5 p u t 7 Z 3 O 7 t s L W 1 R G 4 l A W u j B X M V j U K s c h K d J 4 V R q E L N Z 4 G d 9 + a f K X d 2 i s K v J z m p U 4 y m C S q 1 R J I E + N O y A 0 p i S + i g x o G q f u v h 6 7 c J 8 L S 2 B q f s D F v U p w C u T + C B a K 8 / r G C c I H M p k T j s w B m X 3 u r z v Q o q 5 r Y d R k 6 v v e H I 4 7 3 b A X z o M / B 9 E S d N k y T s e d H y I p Z J V h T l K D t d d R W N L I g S E l N d Z t U V k s Q d 7 C B K 8 9 z C F D O 3 J z K 2 q + 5 5 m E p 4 X x J y c + Z / + u c J B Z O 8 t i r 2 w e Z F d z D f n f X G K b h i v T K e 2 P n M r L i j C X i + F p p T k V v P G a J 8 q g J D 3 z A K R R f n 8 u p 2 B A k v + R t j c m W r X h O b g 4 7 E X H v f D s q D v o L y 3 a Y B / Y R / a J R e w z G 7 A T d s q G T L J v 7 C f 7 x Z 5 a j 8 F 6 s B X s L K R B a 1 n z j v 0 T w f v f W i

5.2. FEW-SHOT IMAGE CLASSIFICATION

We further investigate the comparison between the train-train and train-val type methods in fewshot image classification on miniImageNet (Ravi & Larochelle, 2017) and tieredImageNet (Ren et al., 2018) . Methods We instantiate the train-train and train-val method in the centroid meta-learning setting with a ridge solver. The methods are almost exactly the same as in our theoretical setting in (2) and (3), with the only differences being that the parameters w t (and hence w 0 ) parametrize a deep neural network instead of a linear classifier, and the loss function is the cross-entropy instead of squared loss. Mathematically, we minimize the following two loss functions: L tr-val λ,n1 (w 0 ) := 1 T t=1 tr-val t (w 0 ) = 1 T T t=1 arg min wt (w t ; X train t , y train t ) + λ w t -w 0 2 ; X val t , y val t , L tr-tr λ (w 0 ) := 1 T T t=1 tr-tr t (w 0 ) = 1 T T t=1 arg min wt (w t ; X t , y t ) + λ w t -w 0 2 ; X t , y t , where (X t , y t ) is the data for task t of size n, and (X train t , y train t ) and (X val t , y val t ) is a split of the data of size (n 1 , n 2 ). We note that both loss functions above have been considered in prior work (L tr-val in iMAML (Rajeswaran et al., 2019) , and L tr-tr in Meta-MinibatchProx (Zhou et al., 2019) ), though we use slightly different implementation details from these prior work to make sure that the two methods here are exactly the same except for whether the split is used. Additional details about the implementation can be found in Appendix D. Experimental settings We adopt the episodic training procedure (Finn et al., 2017; Zhou et al., 2019; Rajeswaran et al., 2019) . In meta-test, we sample a set of N -way (K + 1)-shot test tasks. The first K instances are for training and the remaining one is for testing. In meta-training, we use the "higher way" training strategy. We set the default choice of the train-validation split ratio to be an even split n 1 = n 2 = n/2 following Zhou et al. ( 2019); Rajeswaran et al. (2019) . For example, for a 5-way 5-shot classification setting, each task contains 5 × (5 + 1) = 30 total images, and we set n 1 = n 2 = 15. (We additionally investigate the optimality of this split ratio in Appendix D.1.) We evaluate both methods under the transduction setting where the information is shared between the test data via batch normalization. We report the average accuracy over 2, 000 random test episodes with 95% confidence interval. Results Table 1 presents the percent classification accuracy on miniImagenet and tieredImageNet. We find that the train-train method consistently outperforms the train-val method. Specifically, on miniImageNet, train-train method outperforms train-val by 2.01% and 3.87% on the 1-shot 5-way and 5-shot 5-way tasks respectively; On tieredImageNet, train-train on average improves by about 6.40% on the four testing cases. These results show the advantages of train-train method over trainval and support our theoretical findings in Theorem 4. 

6. CONCLUSION

We study the importance of train-validation split on the linear-centroid meta-learning problem, and show that the necessity and optimality of train-validation split depends greatly on whether the tasks are structured: the sample splitting is necessary in general situations, and not necessary and nonoptimal when the tasks are nicely structured. It would be of interest to study whether a similar conclusion holds on other meta-learning problems such as learning a representation, or whether our insights can be used towards designing meta-learning algorithms with better empirical performance. Here we provide more details on the asymptotic analysis framework sketched in Section 2.1. In typical scenarios, for consistent ERMs, the limiting distribution of w 0,T is asymptotically normal with a known covariance matrix, as is characterized in the following classical result (see, e.g. Van der Vaart (2000, Theorem 5.21 ) and also Liang ( 2016)). Proposition 6 (Asymptotic normality and excess risk of ERMs). Assume the population minimizer w 0, is unique and the ERM w 0,T is consistent (i.e. it converges to w 0, in probability as T → ∞). Further assume the following regularity conditions: (a) There exists some random variable A t = A(p t , X t , y t ) such that E[A 2 t ] < ∞ and ∇ t (w 1 ) -∇ t (w 2 ) ≤ A t w 1 -w 2 for all w 1 , w 2 ∈ R d ; (b) E[ ∇ t (w 0, ) 2 ] < ∞; (c) L is twice-differentiable with ∇ 2 L(w 0, ) 0, then the ERM w 0,T is asymptotically normally distributed, with √ T • ( w 0,T -w 0, ) d → N 0, ∇ 2 L(w 0, ) -1 Cov(∇ t (w 0, ))∇ 2 L(w 0, ) -1 =: P w , T • (L( w 0,T ) -L(w 0, )) d → ∆ ∇ 2 L(w 0, )∆ where ∆ ∼ P w . where d → denotes convergence in distribution and t : R d → R is the loss function on a single task. When this happens, we define the asymptotic rate of estimation (in MSE loss) and asymptotic excess risk w 0,T as those of its limiting distribution: AsymMSE( w 0,T ) := E ∆∼Pw ∆ 2 = tr ∇ 2 L(w 0, ) -1 Cov(∇ t (w 0, ))∇ 2 L(w 0, ) -1 AsymExcessRisk( w 0,T ) := E ∆∼Pw ∆ ∇ 2 L(w 0, )∆ = tr ∇ 2 L(w 0, ) -1 Cov(∇ t (w 0, )) .

A.2 PROOF OF PROPOSITION 1

Equivalence of test-time risk and training loss for train-val method We first show that L tr-val (w 0 ) = E[ tr-val t (w 0 )] = L test λ,n1 (w 0 ) for all w 0 , that is, the population meta-test loss is exactly the same as the population risk of the trainval method. This is straightforward: as the tasks are i.i.d. and A λ (w 0 ; X train t , y train t ) is independent of the test points (X val t , y val t ), we have for any w 0 that E[ tr-val t (w 0 )] = E pt∼Π,(Xt,yt)∼pt 1 2n 2 y val t -X val t A λ (w 0 ; X train t , y train t ) 2 = E pt∼Π,(Xt,yt)∼pt 1 2 y val t,1 -x val t,1 A λ (w 0 ; X train t , y train t ) 2 = E p T +1 ∼Π,(X T +1 ,y T +1 ),(x ,y ) iid ∼ pt 1 2 y -x A λ,n1 (w 0 ; X T +1 , y T +1 ) 2 = L test λ,n1 (w 0 ). Therefore the train-val method is acutally a valid ERM for the test loss L test λ,n1 , and it remains to show that the train-val method is (itself) consistent. Consistency We expand the empirical risk of the train-val method as L tr-val T (w 0 ) = 1 T T t=1 1 2n 2 y val t -X val t A λ (w 0 ; X train t , y train t ) 2 = 1 T T t=1 1 2n 2 y val t -X val t w 0 + (X train t X train t + n 1 λI d ) -1 X train t (y train t -X train t w 0 ) 2 = 1 T T t=1 1 2n 2 y val t -X val t (X train t X train t + n 1 λI d ) -1 X train t y train t -X val t n 1 λ(X train t X train t + n 1 λI d ) -1 w 0 2 = 1 2 w 0 M T w 0 -w 0 b T + const, where M T := 1 T T t=1 λ 2 (X train t X train t /n 1 + λI d ) -1 X val t X val t n 2 (X train t X train t /n 1 + λI d ) -1 , b T := 1 T T t=1 λ(X train t X train t /n 1 + λI d ) -1 • 1 n 2 X val t y val t -X val t (X train t X train t + n 1 λI d ) -1 X train t y train t . Noticing that (X train t X train t /n 1 + λI d ) -1 λ -1 I d and by the assumption that E (x,y)∼pt [xx ] ≺ ∞, E (x,y)∼pt [xy] < ∞, we have E[ M T ] < ∞ and E[ b T ] < ∞. Since the task p t 's are i.i.d., by the law of large numbers, we have with probability one that M T → E[M T ] = E pt,(Xt,yt) λ 2 (X train t X train t /n 1 + λI d ) -1 X val t X val t n 2 (X train t X train t /n 1 + λI d ) -1 = E pt,(Xt,yt) λ 2 (X train t X train t /n 1 + λI d ) -1 Σ t (X train t X train t /n 1 + λI d ) -1 0, (where Σ t = E x∼pt [xx ] 0) and b T → E[b T ] = E pt,(Xt,yt) λ(X train t X train t /n 1 + λI d ) -1 • 1 n 2 X val t y val t -X val t (X train t X train t + n 1 λI d ) -1 X train t y train t < ∞ 9) as T → ∞. Therefore, by Slutsky's Theorem, we have w 0,T = M -1 T b T → E[M T ] -1 E[b T ] = arg min w0 L tr-val (w 0 ) = arg min w0 L test λ,n1 (w 0 ) = w 0, (λ, n 1 ) as T → ∞. This proves the consistency of the train-val method. Asymptotic normality Similar as above, we can write the per-task loss as t (w 0 ) = 1 2 A t w 0 -c t 2 , where A t = λ √ n 2 X val t (X train t X train t /n 1 + λI d ) -1 , c t = 1 √ n 2 y val t -X val t (X train t X train t /n 1 + λI d ) -1 1 n 1 X train t y train t . In order to show the desired asymptotic normality result, it suffices to check the conditions in Proposition 6. First, we have ∇ t (w 0 ) = A t (A t w 0 -c t ). This is Lipschitz in w 0 with Lipschitz constant A t A t op ≤ A t 2 Fr < 1 n 2 X val t 2 Fr . As E x∼pt [ x 4 ] < ∞, the above quantity is clearly square integrable, therefore verifying (a). As w 0, = w 0, (λ, n 1 ) is finite, we can use similar arguments as above to show (b) holds. Finally, we have already seen L is twice-differentiable (since it is quadratic in w 0 ) and ∇ 2 L(w 0, ) 0, which verifies (c). Therefore the conditions of Proposition 6 hold, which yields the desired asymptotic normality result.

A.3 PROOF OF PROPOSITION 2

High-level idea At a high level, this proof proceeds by showing that the train-train method is also consistent to the (population) minimizer of L tr-tr , and constructing a simple counter-example on which the minimizers of L tr-tr is not equal to that of L test λ,n . Population minimizers of L tr-tr and L tr-val We begin by simplifying the non-splitting risk. We have tr-tr t (w 0 ) = 1 2n y t -X t A λ (w 0 ; X t , y t ) 2 = 1 2n y t -X t w 0 + (X t X t + nλI d ) -1 X t (y t -X t w 0 ) 2 = 1 2 A t w 0 -c t 2 , where A t = 1 √ n nλX t (X t X t + nλI d ) -1 and c t = 1 √ n I n -X t (X t X t + nλI d ) -1 X t y t . Using similar arguments as in the proof of Proposition 1 (Appendix A.2), we see that the traintrain method w tr-tr 0,T converges with probability one to the minimizer of the pouplation risk L tr-tr , which is w tr-tr 0, = arg min w0 L tr-tr (w 0 ) = E[A t A t ] -1 E[A t c t ] = E λ 2 (X t X t /n + λI d ) -2 X t X t n -1 • E 1 n λ(X t X t /n + λI d ) -1 X t (I n -X t (X t X t + nλI d ) -1 X t )y t = E λ 2 (X t X t /n + λI d ) -2 X t X t n -1 • E λ 2 (X t X t /n + λI d ) -2 1 n X t y t . On the other hand, recall from Proposition 1 (( 8) and ( 9)) that the population minimizer of L test λ,n is w 0, (λ, n) = arg min w0 L test λ,n (w 0 ) = E λ 2 (X t X t /n + λI d ) -1 Σ t (X t X t /n + λI d ) -1 -1 • λE (X t X t /n + λI d ) -1 E pt,(x ,y )∼pt [x y ] -λE (X t X t /n + λI d ) -1 Σ t (X t X t /n + λI d ) -1 1 n X t y t . Construction of the counter-example We now construct a distribution for which (10) is not equal to (11). Let d = 1 and let all p t be the following distribution: p t : (x t,i , y t,i ) = (1, 3) with probability 1/2; (3, -1) with probability 1/2. Clearly, we have Σ t = 5, s t := X t X t /n ∈ [1, 9] , and E x ,y ∼pt [x y ] = 0. Therefore we have w tr-tr 0, = E (s t + λ) -2 s t -1 • E (s t + λ) -2 1 n n i=1 x t,i y t,i , and w 0, (λ, n) = -E 5λ 2 (s t + λ) -2 -1 • E 5λ(s t + λ) -2 1 n n i=1 x t,i y t,i = -E λ(s t + λ) -2 -1 • E (s t + λ) -2 1 n n i=1 x t,i y t,i . We now show that w tr-tr 0, = w 0, (λ, n) by showing that E (s t + λ) -2 1 n n i=1 x t,i y t,i = E x t,1 y t,1 (s t + λ) 2 = 0 for any λ > 0. Indeed, conditioning on (x t,1 , y t,1 ) = (1, 3), we know that the sum-of-squares in s t has one term that equals 1, and all others i.i.d. being 1 or 9 with probability one half. On the other hand, if we condition on (x t,1 , y t,1 ) = (3, -1), then we know the sum in s t has one term that equals 9 and all others i.i.d.. This means that the negative contribution in the expectation is smaller than the positive contribution, in other words E x t,1 y t,1 (s t + λ) 2 = 1 2 • 3E 1 (s t + λ) 2 (x t,1 , y t,1 ) = (1, 3) + 1 2 • -3E 1 (s t + λ) 2 (x t,1 , y t,1 ) = (3, -1) > 0. This shows w tr-tr 0, = w 0, (λ, n) and consequently the w tr-tr 0,T does not converge to w 0, (λ, n) and the difference is bounded away from zero as T → ∞. Finally, for this distribution, the risk L test λ,n (w 0 ) is strongly convex (since it has a positive second derivative), this further implies that L test λ,n ( w tr-tr 0,T ) -L test λ,n (w 0, (λ, n)) is bounded away from zero almost surely as T → ∞.

B PROOFS FOR SECTION 4 B.1 PROOF OF THEOREM 3

We first show that w 0, = E wt∼Π [w t ] is a global optimizer for L tr-tr and L tr-val with any regularization coefficient λ > 0, any n, and any split (n 1 , n 2 ). To do this, it suffices to check that the gradient at w 0, is zero and the Hessian is positive definite (PD). Optimality of w 0, in both L tr-tr and L tr-val . We first look at L tr-tr : for any w 0 ∈ R d we have L tr-tr (w 0 ) = E[ tr-tr t (w 0 )] = 1 2n E X t w t -X t X t X t + nλI d -1 X t (X t w t -X t w 0 ) + w 0 2 = 1 2n E X t I d -X t X t + nλI d -1 X t X t (w t -w 0 ) 2 . ( ) Similarly, L tr-val can be written as L tr-val (w 0 ) (13) = E[ tr-val t (w 0 )] = 1 2n 2 E X val t w t -X val t (X train t ) X train t + n 1 λI d -1 (X train t ) X train t w t -X train t w 0 + w 0 2 = 1 2n 2 E X val t I d -(X train t ) X train t + n 1 λI d -1 (X train t ) X train t (w t -w 0 ) 2 . ( ) We denote M tr-tr t = X t I d -X t X t + nλI d -1 X t X t and M tr-val t = X val t I d -(X train t ) X train t + n 1 λI d -1 (X train t ) X train t ( ) to simplify the notations in ( 12) and ( 14). We take gradient of L tr-tr and L tr-val with respect to w 0 : ∇ w0 L tr-tr (w 0 ) = - 1 n E (M tr-tr t ) M tr-tr t (w t -w 0 ) , ∇ w0 L tr-val (w 0 ) = - 1 n 2 E (M tr-val t ) M tr-val t (w t -w 0 ) . Substituting w 0, into (17) and taking expectation, we deduce ∇ w0 L tr-tr (w 0, ) = - 1 n E (M tr-tr t ) M tr-tr t (w t -w 0, ) = 0. ( ) To see this, observe that by definition E[w tw 0, ] = 0. Combining with w t being generated independently of X t , we have the first term in RHS of (19) vanish. In addition, z t is independent white noise, therefore, the second term in RHS of (19) also vanishes. Following the same argument, we can show ∇ w0 L tr-val (w 0, ) = 0, since X t is also independent of w t . The above reasonings indicates that w 0, is a stationary point of both L tr-tr and L tr-val . The remaining step is to check ∇ w0 L tr-tr (w 0, ) and ∇ w0 L tr-val (w 0, ) are PD. From ( 17) and ( 18), we derive respectively the hessian of L tr-tr and L tr-val as ∇ 2 w0 L tr-tr (w 0, ) = 1 n E[(M tr-tr t ) M tr-tr t ] ∇ 2 w0 L tr-val (w 0, ) = 1 n 2 E[(M tr-val t ) M tr-val t ]. Let v ∈ R d be any nonzero vector, we check v ∇ 2 w0 L tr-tr (w 0, )v > 0. A key observation is that I d -X t X t + nλI d -1 X t X t is positive definite for any λ = 0. To see this, let σ 1 ≥ • • • ≥ σ d be eigenvalues of 1 n X t X t , some algebra yields the eigenvalues of I d -X t X t + nλI d -1 X t X t are λ λ+σi > 0 for λ = 0 and i = 1, . . . , d. Hence, we deduce v ∇ 2 w0 L tr-tr (w 0, )v = 1 n E[v X t I d -X t X t + nλI d -1 X t X t 2 X t v] > 0, since X t is isotropic (an explicit computation of the hessian matrix can be found in Appendix B.2). As a consequence, we have shown that w 0, is a global optimum of L tr-tr . The same argument applies to L tr-val , and the proof is complete.

Consistency of w

{tr-tr,tr-val} 0,T . To check the consistency, we need to verify the conditions (a) -(c) in Proposition 6. For condition (a), we derive from ( 17) and ( 18 < ∞. For the split method, we also have 0 I d - (X train t ) X train t + nλI d -1 (X train t ) X train t I d and E[ (X val t ) X val t op ] < ∞, which implies E (M tr-val t ) M tr-val t 2 op < ∞. For condition (b), using a similar argument in condition (a) and combining with R 2 = E[ w 0, - w t 2 ], we have E[ ∇ {tr-tr,tr-val} t (w 0, ) 2 ] < ∞. For condition (c), using (20), we directly verify that L {tr-tr,tr-val} is twice-differentiable and ∇ 2 L {tr-tr,tr-val} 0.

B.2 PROOF OF LEMMA 5

Proof. In this section we prove Lemma 5. Using the the asymptotic normality in Proposition 6, the asymptotic covariance is ∇ -2 L {tr-tr,tr-val} Cov[∇ {tr-tr,tr-val} t ]∇ -2 L {tr-tr,tr-val} . Therefore, in the following, we only need to find ∇ -2 L {tr-tr,tr-val} and Cov[∇ {tr-tr,tr-val} t ]. • Asymptotic variance of w tr-tr 0,T . We begin with the computation of the expected Hessian 1 n E[(M tr-tr t ) M tr-tr t ]. E[(M tr-tr t ) M tr-tr t ] = E I d -X t X t + nλI d -1 X t X t X t X t I d -X t X t + nλI d -1 X t X t (i) = E V t I d -(D t D t + nλI d ) -1 D t D t D t D t I d -(D t D t + nλI d ) -1 D t D t V t , where the equality (i) is obtained by plugging in the SVD of X t = U t D t V t with U t ∈ R n×n , D t ∈ R n×d , and V t ∈ R d×d . A key observation is that U t and V t are independent, since X t is isotropic, i.e., homogeneous in each orthogonal direction. To see this, for any orthogonal matrices Q ∈ R n×n and P ∈ R d×d , we know X t and QX t P share the same distribution. Moreover, we have QX t P = (QU t )D t (PV t ) as the SVD. This shows that the left and right singular matrices are independent and both uniformly distributed on all the orthogonal matrices of the corresponding dimensions (R n×n and R d×d , respectively). Recall that we denote σ (n) 1 ≥ • • • ≥ σ (n) d as the eigenvalues of 1 n X t X t . Thus, we have D t D t = Diag(nσ (n) 1 , . . . , nσ (n) d ). We can further simplify (21) as E V t I d -(D t D t + nλI d ) -1 D t D t D t D t I d -(D t D t + nλI d ) -1 D t D t V t = E V t Diag nλ 2 σ (n) 1 (σ (n) 1 + λ) 2 , . . . , nλ 2 σ (n) d (σ (n) d + λ) 2 V t = E d i=1 nλ 2 σ (n) i (σ (n) i + λ) 2 v t,i v t,i . We will utilize the isotropicity of X t to find (22) . Recall that we have shown that V t is uniform on all the orthogonal matrices. Let P ∈ R d×d be any permutation matrix, then V t P has the same distribution as V t . For this permuted data matrix V t P, (22) becomes E d i=1 nλ 2 σ (n) i (σ (n) i + λ) 2 v t,τp(i) v t,τp(i) with τ p (i) denotes the permutation of the i-th element in P. Summing over all the permutations P (and there are totally d! instances), we deduce d!E[(M tr-tr t ) M tr-tr t ] = all permutation τp E d i=1 nλ 2 σ (n) i (σ (n) i + λ) 2 v t,τp(i) v t,τp(i) = (d -1)!E   d j=1 d i=1 nλ 2 σ (n) i (σ (n) i + λ) 2 v t,j v t,j   = (d -1)!E V t Diag d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 , . . . , d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 V t = (d -1)!E d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 V t I d V t Dividing (d -1)! on both sides of ( 23) yields E[(M tr-tr t ) M tr-tr t ] = n d E d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 I d . Next, we find the expected covariance matrix  (i) = E V t Diag nλ 2 σ (n) 1 (σ (n) 1 + λ) 2 , . . . , nλ 2 σ (n) d (σ (n) d + λ) 2 V t (w 0, -w t )(w 0, -w t ) • V t Diag nλ 2 σ (n) 1 (σ (n) 1 + λ) 2 , . . . , nλ 2 σ (n) d (σ (n) d + λ) 2 V t . Here step (i) uses the SVD of X t and the computation in ( 22). Combining ( 24) and ( 25), we derive the asymptotic covariance matrix of using L tr-tr as AsymCov( w tr-tr 0,T ) = E[∇ -2 tr-tr t ]Cov[∇ tr-tr t (w 0, )]E[∇ -2 tr-tr t ] = d 2 n 2 E d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 -2 • E V t Diag nλ 2 σ (n) 1 (σ (n) 1 + λ) 2 , . . . , nλ 2 σ (n) d (σ (n) d + λ) 2 V t (w 0, -w t )(w 0, -w t ) • V t Diag nλ 2 σ (n) 1 (σ (n) 1 + λ) 2 , . . . , nλ 2 σ (n) d (σ (n) d + λ) 2 V t . Taking trace in (26), we deduce AsymMSE( w tr-tr 0,T ) = tr(AsymCov( w tr-tr 0,T )) = d 2 n 2 E d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 -2 • tr E V t Diag n 2 λ 4 (σ (n) 1 ) 2 (σ (n) 1 + λ) 4 , . . . , n 2 λ 4 (σ (n) d ) 2 (σ (n) d + λ) 4 V t (w 0, -w t )(w 0, -w t ) (i) = d 2 n 2 E d i=1 λ 2 σ (n) i (λ + σ (n) i ) 2 -2 • n 2 d tr E d i=1 λ 4 (σ (n) i ) 2 (λ + σ (n) i ) 4 V t I d V t (w 0, -w t )(w 0, -w t ) = d E d i=1 λ 4 (σ (n) i ) 2 /(λ + σ (n) i ) 4 E d i=1 λ 2 σ (n) i /(λ + σ (n) i ) 2 2 • tr E[(w 0, -w t )(w 0, -w t ) ] = dR 2 E d i=1 (σ (n) i ) 2 /(λ + σ (n) i ) 4 E d i=1 σ (n) i /(λ + σ (n) i ) 2 2 , where step (i) utilizes the independence between w t and X t and applies the permutation trick in (23) to find E V t Diag n 2 λ 4 (σ (n) 1 ) 2 (σ (n) 1 +λ) 4 , . . . , n 2 λ 4 (σ (n) d ) 2 (σ (n) d +λ) 4 V t . • Asymptotic variance of w tr-val 0,T . Similar to the no-split case, we compute the Hessian 1 n2 E[∇ 2 tr-val t ] = 1 n2 E[(M tr-val t ) M tr-val t ] first. E[(M tr-val t ) M tr-val t ] = E I d -(X train t ) X train t + n 1 λI d -1 (X train t ) X train t (X val t ) X val t • I d -(X train ) t X train t + n 1 λI d -1 (X train t ) X train t (i) = n 2 E I d -(X train t ) X train t + n 1 λI d -1 ((X train t ) X train t • I d -(X train ) t X train t + n 1 λI d -1 (X train t ) X train t (ii) = n 2 E V train t I d -((D train t ) D train t + n 1 λI d ) -1 (D train t ) D train t 2 (V train t ) , where (i) uses the data generating assumption E[(X val t ) X val t ] = n 2 I d and the independence between X train t and X val t , and (ii) follows from the SVD of X train t = U train t D train t (V train t ) . Here we denote σ (n1) 1 ≥ • • • ≥ σ (n1) d as the eigenvalues of 1 n1 (X train t ) X train t . Thus, we have (D train t ) D train t = Diag(n 1 σ (n1) 1 , . . . , n 1 σ (n1) d ). We can now further simplify (28) as n 2 E V train t I d -((D train t ) D train t + n 1 λI d ) -1 (D train t ) D train t 2 (V train t ) (i) = n 2 E V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (ii) = n 2 d E d i=1 λ 2 (λ + σ (n1) i ) 2 I d . Step (i) follows from the same computation in ( 22), and step (ii) uses the permutation trick in (23). Next, we find the expected covariance matrix 1 n 2 2 E[∇ tr-tr t (w 0, )(∇ tr-tr t (w 0, )) ]. E[∇ tr-tr t (w 0, )(∇ tr-tr t (w 0, )) ] = E[(M tr-tr t ) M tr-tr t (w 0, -w t )(w 0, -w t ) (M tr-tr t ) M tr-tr t ] = E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (w 0, -w t )(w 0, -w t ) • V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) . Combining ( 29) and ( 30), we derive the asymptotic covariance matrix of using L tr-val as AsymCov( w tr-val 0,T ) = E[∇ -2 tr-val t ]Cov[∇ tr-val t (w 0, )]E[∇ -2 tr-val t ] = d 2 n 2 2 E d i=1 λ 2 (λ + σ (n1) i ) 2 -2 • E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (w 0, -w t )(w 0, -w t ) • V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) . Taking trace in (31), we deduce AsymMSE( w tr-tr 0,T ) = tr(AsymCov( w tr-tr 0,T )) = d 2 n 2 2 E d i=1 λ 2 (λ + σ (n1) i ) 2 -2 • tr E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (w 0, -w t )(w 0, -w t ) • V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (i) = d 2 n 2 2 E d i=1 λ 2 (λ + σ (n) i ) 2 -2 • tr E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (w 0, -w t )(w 0, -w t ) , where (i) follows from the cyclic property of the matrix trace operation. Due to the isotropicity of X train t and X val t , we claim that E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) is a diagonal matrix cI d with all the diagonal elements identical. We can show the claim bying taking expectation with respect to X val t first. Since V train t is an orthogonal matrix, X val t V train t has the same distribution as X val t and independent of X t . We verify that any off-diagonal element of the matrix expectation A := E X val t (V train t ) (X val t ) X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 • (V train t ) (X val t ) X val t V train t is zero. We denote X val t V train t = [x 1 , . . . , x n ] ∈ R n2×d with x i iid ∼ N(0, I d ). For k = , the (k, )-th entry A k, of A is A k, = E   j λ 2 (σ (n1) j + λ) 2 i x k,i x j,i i x j,i x ,i   = E   j λ 2 (σ (n1) j + λ) 2 m,n x k,m x j,m x j,n x ,n   (i) = 0, where x i,j denotes the j-th element of x i . Equality (i) holds, since either x k,m or x ,n only appears once in each summand. Therefore, we can write A = Diag (A 1,1 , . . . , A d,d ) with A k,k being A k,k = E   j λ 2 (σ (n1) j + λ) 2 m,n x k,m x j,m x j,n x ,n   = E λ 2 (σ (n1) k + λ) 2 m,n x k,m x k,m x k,n x k,n . Observe that A k,k only depends on σ (n1) k . Plugging back into (33), we have E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) = E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ Diag(A 1,1 , . . . , A d,d ) • Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) = E V train t Diag λ 2 A 1,1 (σ (n1) 1 + λ) 2 , . . . , λ 2 A d,d (σ (n1) d + λ) 2 (V train t ) (i) = cI d , where equality (i) utilizes the permutation trick in (24). To this end, it is sufficient to find c as c = 1 d tr E V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) (X val t ) • X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (X val t ) • X val t V train t Diag λ σ (n1) 1 + λ , . . . , λ σ (n1) d + λ (V train t ) = 1 d tr E X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (X val t ) • X val t V train t Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 (V train t ) (X val t ) . ( ) Observe again that X val t V train t ∈ R n2×d is a Gaussian random matrix. We rewrite (34) as c = 1 d E      n2 i,j=1 v i Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 v j   2    , ( ) where v i iid ∼ N(0, I d ) is i.i.d. Gaussian random vectors for i = 1, . . . , n 2 . To compute (35), we need the following result. Claim 7. Given any symmetric matrix A ∈ R d×d and i.i.d. standard Gaussian random vectors v, u iid ∼ N(0, I d ), we have E (v Av) 2 = 2 A 2 Fr + tr 2 (A) and ( 36) E (v Au) 2 = A 2 Fr . ( ) Proof of Claim 7. We show (36) first. We denote A i,j as the (i, j)-th element of A and v i as the i-th element of v. Expanding the quadratic form, we have E (v Av) 2 = E   i,j,k, ≤d v i v j v k v A i,j A k,   = E   i≤d v 4 i A 2 i,i   + E   i =j v 2 i v 2 j (A 2 i,j + A i,i A j,j + A i,j A j,i )   = 3 i≤d A 2 i,i + i =j (A 2 i,j + A i,i A j,j + A i,j A j,i ) = tr 2 (A) + 2 i≤d A 2 i,i + i =j (A 2 i,j + A i,j A j,i ) = tr 2 (A) + 2 A 2 Fr . Next, we show (37) by the cyclic property of race. E (v Au) 2 = tr E uu Avv A = tr(A 2 ) = A 2 Fr . We back to the computation of ( 35) using Claim 7. c = 1 d E   n2 i,j=1 v i Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 v j 2   = 1 d E   n2 i=1 v i Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 v i 2   + 1 d E   i =j v i Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 v j 2   = n 2 d E tr 2 Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , 2 (σ (n1) d + λ) 2 + 2 n 2 d E   Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 2 Fr   + n 2 (n 2 -1) d E   Diag λ 2 (σ (n1) 1 + λ) 2 , . . . , λ 2 (σ (n1) d + λ) 2 2 Fr   = n 2 d   E d i=1 λ 2 (σ (n1) i + λ) 2 2 + (n 2 + 1)E d i=1 λ 4 (σ (n1) i + λ) 4   . Combining ( 38) and ( 33), by the independence between w t and X train t , X val t , we compute (32) as AsymMSE( w tr-tr 0,T ) = d 2 n 2 2 E d i=1 λ 2 (λ + σ (n1) i ) 2 -2 • n 2 d   E d i=1 λ 2 (σ (n1) i + λ) 2 2 + (n 2 + 1)E d i=1 λ 4 (σ (n1) i + λ) 4   • E (w 0, -w t )(w 0, -w t ) = dR 2 n 2 E d i=1 λ 2 /(σ (n1) i + λ) 2 2 + (n 2 + 1)E d i=1 λ 4 /(σ (n1) i + λ) 4 E d i=1 λ 2 /(λ + σ (n1) i ) 2 2 . The proof is complete.

B.3 OPTIMAL RATE OF THE TRAIN-VAL METHOD AT FINITE (n, d)

Corollary 8 (Optimal rate of the train-val method at finite (n, d)). For any (n, d) and any split ratio (n 1 , n 2 ) = (n 1 , n-n 1 ), the optimal rate (by tuning the regularization λ > 0) of the train-val method is achieved at inf λ>0 AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) = lim λ→∞ AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) = (d + n 2 + 1)R 2 n 2 . Further optimizing the rate over n 2 , the best rate is taken at (n 1 , n 2 ) = (0, n), in which the rate is inf λ>0, n2∈[n] AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) = (d + n + 1)R 2 n . Discussion: Using all data as validation Corollary 8 suggests that the optimal asymptotic rate of the train-val method is obtained at λ = ∞ and (n 1 , n 2 ) = (0, n). In other words, the optimal choice for the train-val method is to use all the data as validation. In this case, since there is no training data, the inner solver reduces to the identity map: A ∞,0 (w 0 ; X t , y t ) = w 0 , and the outer loop reduces to learning a single linear model w 0 on all the tasks combined. We remark that while the optimality of such a split ratio is likely an artifact of the data distribution we assumed (noiseless realizable linear model) and may not generalize to other meta-learning problems, we do find experimentally that using more data as validation (than training) can also improve the performance on real metalearning tasks (see Table 2 ).

Proof of Corollary 8

Fix n 1 ∈ [n] and n 2 = n -n 1 . Recall from Lemma 5 that AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) = dR 2 n 2 • E d i=1 λ 2 /(σ (n1) i + λ) 2 2 + (n 2 + 1) d i=1 λ 4 /(σ (n1) i + λ) 4 E d i=1 λ 2 /(σ (n1) i + λ) 2 2 . Clearly, as λ → ∞, we have lim λ→∞ AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) = dR 2 n 2 • d 2 + (n 2 + 1)d d 2 = (d + n 2 + 1)R 2 n 2 . It remains to show that the above quantity is a lower bound for AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) for any λ > 0, which is equivalent to E d i=1 λ 2 /(σ (n1) i + λ) 2 2 + (n 2 + 1) d i=1 λ 4 /(σ (n1) i + λ) 4 E d i=1 λ 2 /(σ (n1) i + λ) 2 2 ≥ d + n 2 + 1 d , for all λ > 0. We now prove (39). For i ∈ [n 1 ], define random variables X i := λ 2 (σ (n1) i + λ) 2 ∈ [0, 1] and Y i := 1 -X i ∈ [0, 1]. Then the left-hand side of (39) can be rewritten as E (d -n 1 + n1 i=1 X i ) 2 + (n 2 + 1) d -n 1 + n1 i=1 X 2 i (E[d -n 1 + n i=1 X i ]) 2 = E (d - n1 i=1 Y i ) 2 + (n 2 + 1) d -2 n1 i=1 Y i + n1 i=1 Y 2 i (E[d - n1 i=1 Y i ]) 2 = d 2 + (n 2 + 1)d -2(d + n 2 + 1)E[ Y i ] + E ( Y i ) 2 + (n 2 + 1)E Y 2 i d 2 -2dE[ Y i ] + (E[ Y i ]) By algebraic manipulation, inequality (39) is equivalent to showing that E ( Y i ) 2 + (n 2 + 1)E Y 2 i (E[ Y i ]) 2 ≥ d + n 2 + 1 d . Clearly , E[( Y i ) 2 ] ≥ (E[ Y i ]) 2 . By Cauchy-Schwarz we also have E Y 2 i ≥ 1 n 1 E Y i 2 ≥ 1 n 1 E Y i 2 . Therefore we have E ( Y i ) 2 + (n 2 + 1)E Y 2 i (E[ Y i ]) 2 ≥ 1 + n 2 + 1 n 1 ≥ 1 + n 2 + 1 d = d + n 2 + 1 d , where we have used that n 1 ≤ n ≤ d. This shows (40) and consequently (39).

B.4 RATE OF THE TRAIN-TRAIN METHOD IN THE PROPORTIONAL LIMIT

Theorem 9 (Exact rates of the train-train method in the proportional limit). In the high-dimensional limiting regime d, n → ∞, d/n → γ where γ ∈ (0, ∞) is a fixed shape parameter, for any λ > 0 lim d,n→∞,d/n=γ AsymMSE w tr-tr 0,T (n; λ) = ρ λ,γ R 2 . where ρ λ,γ = 4γ 2 (γ -1) 2 +(γ + 1)λ /(λ+1+γ-(λ + γ + 1) 2 -4γ) 2 / (λ + γ + 1) 2 -4γ 3/2 .

Proof of Theorem 9

Let Σ n := 1 n X t X t denote the sample covariance matrix of the inputs in a single task (t). By Lemma 5, we have AsymMSE( w tr-tr 0,T (n; λ)) = R 2 • 1 d E d i=1 σ i ( Σ n ) 2 /(σ i ( Σ n ) + λ) 4 1 d E d i=1 σ i ( Σ n )/(σ i ( Σ n ) + λ) 2 2 = R 2 • 1 d E tr ( Σ n + λI d ) -4 Σ 2 n I n,d 1 d E tr ( Σ n + λI d ) -2 Σ n II n,d 2 . ( ) We now evaluate quantities I n,d and II n,d in the high-dimensional limit of d, n → ∞, d/n → γ ∈ (0, ∞). Consider the (slightly generalized) Stieltjes transform of Σ n defined for all λ 1 , λ 2 > 0: s(λ 1 , λ 2 ) := lim d,n→∞, d/n→γ 1 d E tr (λ 1 I d + λ 2 Σ n ) -1 . As the entries of X t are i.i.d. N(0, 1), the above limiting Stieltjes transform is the Stieltjes form of the Marchenko-Pastur law, which has a closed form (see, e.g. (Dobriban et al., 2018 , Equation (7))) s(λ 1 , λ 2 ) = λ -1 2 s(λ 1 /λ 2 , 1) = 1 λ 2 • γ -1 -λ 1 /λ 2 + (λ 1 /λ 2 + 1 + γ) 2 -4γ 2γλ 1 /λ 2 = γ -1 -λ 1 /λ 2 + (λ 1 /λ 2 + 1 + γ) 2 -4γ 2γλ 1 . Now observe that differentiating s(λ 1 , λ 2 ) yields quantity II (known as the derivative trick of Stieltjes transforms). Indeed, we have - d dλ 2 s(λ 1 , λ 2 ) = - d dλ 2 lim d,n→∞, d/n→γ 1 d E tr (λ 1 I d + λ 2 Σ n ) -1 = lim d,n→∞, d/n→γ 1 d E - d dλ 2 tr (λ 1 I d + λ 2 Σ n ) -1 = lim d,n→∞, d/n→γ 1 d E tr (λ 1 I d + λ 2 Σ n ) -2 Σ n . (Above, the exchange of differentiation and limit is due to the uniform convergence of the derivatives, which holds at any λ 1 , λ 2 > 0. See Appendix B.4.1 for a detailed justification.) Taking λ 1 = λ and λ 2 = 1, we get lim d,n→∞, d/n→γ II n,d = lim d,n→∞, d/n→γ 1 d E tr (λI d + Σ n ) -2 Σ n = - d dλ 2 s(λ 1 , λ 2 )| λ1=λ,λ2=1 . Similarly we have lim d,n→∞, d/n→γ I n,d = lim d,n→∞,d/n→γ 1 d E tr (λI d + Σ n ) -4 Σ 2 n = - 1 6 d dλ 1 d 2 dλ 2 2 s(λ 1 , λ 2 )| λ1=λ,λ2=1 . Evaluating the right-hand sides from differentiating the closed-form expression (42), we get lim d,n→∞, d/n→γ I n,d = 1 2γ • λ + 1 + γ (λ + 1 + γ) 2 -4γ - 1 2γ , lim d,n→∞, d/n→γ II n,d = (γ -1) 2 + (γ + 1)λ ((λ + 1 + γ) 2 -4γ) 5/2 . Substituting back to (41) yields that lim d,n→∞, d/n→γ AsymMSE( w tr-tr 0,T (n; λ)) = lim d,n→∞, d/n→γ R 2 • I n,d /II 2 n,d = R 2 • 4γ 2 (γ -1) 2 + (γ + 1)λ ((λ + 1 + γ) 2 -4γ) 5/2 • λ+1+γ √ (λ+1+γ) 2 -4γ -1 2 = R 2 • 4γ 2 (γ -1) 2 + (γ + 1)λ ((λ + 1 + γ) 2 -4γ) 3/2 • λ + 1 + γ -(λ + 1 + γ) 2 -4γ 2 . This proves the desired result.

B.4.1 INTERCHANGING DERIVATIVE AND EXPECTATION / LIMIT

Here we rigorously establish the interchange of the derivative and the expectation / limit used in (43). For convenience of notation let Σ = Σ n = X t X t /n denote the empirical covariance matrix of X t . We wish to show that d dλ 2 lim d,n→∞,d/n→γ 1 d E tr (λ 1 I d + λ 2 Σ) -1 = lim d,n→∞,d/n→γ 1 d E d dλ 2 tr (λ 1 I d + λ 2 Σ) -1 . This involves the interchange of derivative and limit, and then the interchange of derivative and expectation. Interchange of derivative and expectation First, we show that for any fixed (d, n), d dλ 2 E tr (λ 1 I d + λ 2 Σ) -1 = E d dλ 2 tr (λ 1 I d + λ 2 Σ) -1 . By definition of the derivative, we have d dλ 2 E tr (λ 1 I d + λ 2 Σ) -1 = lim t→0 E tr (λ 1 I d + λ 2 Σ + tΣ) -1 -tr (λ 1 I d + λ 2 Σ) -1 t . For any A 0, the function t → tr((A+tB) -1 ) is continuously differentiable at t = 0 with derivative -tr(A -2 B), and thus locally Lipschitz around t = 0 with Lipschitz constant |tr(A -2 B)| + 1. Applying this in the above expectation with A = λ 1 I d + λ 2 Σ λ 1 I d and B = Σ, we get that for sufficiently small |t|, the fraction inside the expectation is upper bounded by |tr(λ -2 1 Σ)| + 1 < ∞ uniformly over t. Thus by the Dominated Convergence Theorem, the limit can be passed into the expectation, which yields the expectation of the derivative.

Interchange of derivative and limit

Define f n,d (λ 2 ) := 1 d E tr (λ 1 I d + λ 2 Σ) -1 . It suffices to show that d dλ 2 lim d,n→∞,d/n→γ f n,d (λ 2 ) = lim d,n→∞,d/n→γ f n,d (λ 2 ), where f n,d (λ 2 ) = E d dλ 2 1 d tr (λ 1 I d + λ 2 Σ) -1 = - 1 d E tr (λ 1 I d + λ 2 Σ) -2 Σ by the result of the preceding part. As f n,d (λ 2 ) → s(λ 1 , λ 2 ) pointwise over λ 2 by properties of the Wishart matrix (Bai & Silverstein, 2010) and each individual f n,d is differentiable, it suffices to show that the derivatives f n,d ( λ 2 ) converges uniformly for λ 2 in a neighborhood of λ 2 . Observe that can rewrite f n,d as f n,d ( λ 2 ) = -E µ n,d E λ∼ µ n,d g λ2 (λ) , where µ n,d is the empirical distribution of the eigenvalues of Σ, and g λ2 (λ) := λ (λ 1 + λ 2 λ) 2 ≤ 1 λ 1 λ 2 for all λ ≥ 0. Therefore, as µ n,d converges weakly to the Marchenko-Pastur distribution with probability one and g λ2 is uniformly bounded for λ 2 in a small neighborhood of λ 2 , we get that f n,d ( λ 2 ) does converge uniformly to the expectation of g λ2 (λ) under the Marchenko-Pastur distribution. This shows the desired interchange of derivative and limit.

B.5 PROOF OF THEOREM 4

Throughout this proof we assume that R 2 = 1 without loss of generality (as all the rates are constant multiples of R 2 ). Part I: Optimal rate for L tr-tr By Theorem 9, we have inf λ>0 lim d,n→∞,d/n=γ AsymMSE w tr-tr 0,T (n; λ) = inf λ>0 4γ 2 (γ -1) 2 + (γ + 1)λ (λ + 1 + γ -(λ + γ + 1) 2 -4γ) 2 • ((λ + γ + 1) 2 -4γ) 3/2 :=f (λ,γ) . In order to bound inf λ>0 f (λ, γ), picking any λ = λ(γ) gives f (λ(γ), γ) as a valid upper bound, and our goal is to choose λ that yields a bound as tight as possible. Here we consider the choice λ = λ(γ) = max {1 -γ/2, γ -1/2} = (1 -γ/2)1 {γ ≤ 1} + (γ -1/2)1 {γ > 1} which we now show yields the claimed upper bound. Case 1: γ ≤ 1 Substituting λ = 1γ/2 into f (λ, γ) and simplifying, we get f (1 -γ/2, γ) = 2(γ 2 -3γ + 4) (2 -γ/2) 3 =: g 1 (γ). Clearly, g 1 (0) = 1 and g 1 (1) = 32/27. Further differentiating g 1 twice gives g 1 (γ) = γ 2 + 7γ + 4 (2 -γ/2) 5 > 0 for all γ ∈ [0, 1]. Thus g 1 is convex on [0, 1], from which we conclude that g 1 (γ) ≤ (1 -γ) • g 1 (0) + γ • g 1 (1) = 1 + 5 27 γ. Case 2: γ > 1 Substituting λ = γ -1/2 into f (λ, γ) and simplifying, we get f (γ -1/2, γ) = 2γ 2 (4γ 2 -3γ + 1) (2γ -1/2) 3 =: g 2 (γ). We have g 2 (1) = g 1 (1) = 32/27. Further differentiating g 2 gives g 2 (γ) = - 1 (4γ -1) 2 - 6 (4γ -1) 3 - 6 (4γ -1) 4 + 1 < 1 for all γ > 1. Therefore we have for all γ > 1 that g 2 (γ) = g 2 (1) + γ 1 g 2 (t)dt ≤ g 2 (1) + γ -1 = γ + 5 27 . Combining Case 1 and 2, we get inf λ>0 f (λ, γ) ≤ g 1 (γ) ≤ 1 {γ ≤ 1} + g 2 (γ)1 {γ > 1} ≤ 1 + 5 27 γ 1 {γ ≤ 1} + 5 27 + γ 1 {γ > 1} = max 1 + 5 27 γ, 5 27 + γ . This is the desired upper bound for L tr-tr . Equality at γ = 1 We finally show that the above upper bound becomes an equality when γ = 1. At γ = 1, we have f (λ, 1) = 8λ (λ + 2 - √ λ 2 + 4λ) 2 (λ 2 + 4λ) 3/2 = 8λ -4 (1 + 2/λ -1 + 4/λ) 2 (1 + 4/λ) 3/2 . Make the change of variable t = 1 + 4/λ so that λ -1 = (t 2 -1)/4, minimizing the above expression is equivalent to minimizing (t 2 -1) 4 /32 (t 2 /2 -t + 1/2) 2 t 3 = (t + 1) 4 8t 3 over t > 1. It is straightforward to check (by computing the first and second derivatives) that the above quantity is minimized at t = 3 with value 32/27. In other words, we have shown inf λ>0 f (λ, 1) = 32 27 = max 1 + 5 27 γ, 5 27 + γ γ=1 , that is, the equality holds at γ = 1. Part II: Optimal rate for L tr-val We now prove the result on L tr-val , that is, inf λ>0,s∈(0,1) lim d,n→∞,d/n=γ AsymMSE w tr-val 0,T (ns, n(1s); λ) (i) = lim d,n→∞,d/n=γ inf λ>0,n1+n2=n AsymMSE w tr-val 0,T (n 1 , n 2 ; λ) d+n+1 n (ii) = 1 + γ. First, equality (ii) follows from Corollary 8 and the fact that (d + n + 1)/n → 1 + γ. Second, the "≥" direction of equality (i) is trivial (since we always have "inf lim ≥ lim inf"). Therefore we get the "≥" direction of the overall equality, and it remains to prove the "≤" direction. For the "≤" direction, we fix any λ > 0, and bound AsymMSE( w tr-val 0,T (n 1 , n 2 ; λ)) (and consequently its limit as d, n → ∞.) We have from Lemma 5 that AsymMSE( w tr-val 0,T (n 1 , n 2 ; λ)) = d n 2 • E d i=1 λ 2 /(σ (n1) i + λ) 2 2 + (n 2 + 1) d i=1 λ 4 /(σ (n1) i + λ) 4 E d i=1 λ 2 /(σ (n1) i + λ) 2 2 ≤ d n 2 • d 2 + (n 2 + 1)d E d i=1 λ 2 /(σ (n1) i + λ) 2 2 = d + n 2 + 1 n 2 • 1 E 1 d d i=1 λ 2 /(σ (n1) i + λ) 2 2 Observe that E 1 d d i=1 λ 2 (σ (n1) i + λ) 2 (i) ≥ E    λ 2 d i=1 σ (n1) i /d + λ 2    (ii) ≥ λ 2 E d i=1 σ (n1) i /d + λ 2 (iii) = λ 2 (1 + λ) 2 , where (i) follows from the convexity of t → λ 2 /(t + λ) 2 on t ≥ 0; (ii) follows from the same convexity and Jensen's inequality, and (iii) is since E d i=1 σ (n1) i = E tr( 1 n1 X t X t ) = E X t 2 Fr /n 1 = d. Applying this in the preceeding bound yields AsymMSE( w tr-val 0,T (n 1 , n 2 ; λ)) ≤ d + n 2 + 1 n 2 • (1 + λ) 2 λ 2 . Further plugging in n 1 = ns and n 2 = n(1s) for any s ∈ (0, 1) yields lim d,n→∞,d/n→γ AsymMSE( w tr-val 0,T (ns, n(1s); λ) ) ≤ γ + 1 -s 1 -s • (1 + λ) 2 λ 2 . Finally, the right-hand side is minimized at λ → ∞ and s = 0, from which we conclude that inf λ>0,s∈(0,1) lim d,n→∞, d/n→γ AsymMSE( w tr-val 0,T (n 1 , n 2 ; λ)) ≤ 1 + γ, which is the desired "≤" direction.

C CONNECTIONS TO BAYESIAN ESTIMATOR

Here we discuss the relationship between our train-train meta-learining estimator using ridge regression solvers and a Bayesian estimator under a somewhat natural hierarchical generative model for the realizable setting in Section 4. We show that these two estimators are not equal in general, albeit they have some similarities in their expressions. We consider the following hierarchical probabilitistic model: w 0, ∼ N 0, σ 2 w d I d , w t |w 0, iid ∼ N w 0, , R 2 d I d , y t = X t w t + σz t where z t iid ∼ N(0, I n ). This model is similar to our realizable linear model ( 6), except that w 0 has a prior and that there is observation noise in the data (such that data likelihoods and posteriors are well-defined). We also note that the R 2 /d variance for w t guarantees that E[ w tw 0, 2 ] = R 2 , consistent with our definition (7). Bayesian estimator We now derive the Bayesian posterior mean estimator of w 0, , which requires us to compute the posterior distribution of w 0, given the data {(X t , y t )} T t=1

4. .

We begin by computing the likelihood of one task by marginalizing over w t : p(X t , y t |w 0, ) ∝ p(w t |w 0, ) • p(y t |X t , w t )dw t ∝ exp - w t -w 0, 2 2R 2 /d • exp - y t -X t w t 2 2σ 2 dw t (i) ∝ exp - w 0, 2 2R 2 /d + 1 2 w 0, R 2 /d + X t y t σ 2 X t X t σ 2 + I d R 2 /d -1 w 0, R 2 /d + X t y t σ 2 ∝ exp - 1 2 w 0, X t X t + dσ 2 R 2 I d -1 X t X t R 2 /d w 0, + w 0, X t X t + dσ 2 R 2 I d -1 X t y t R 2 /d , where (i) is obtained by integrating a multivariate Gaussian density over w t , and "∝" drops all the terms that do not depend on w 0, . Therefore, by the Bayes rule, the overall posterior distribution of w 0, is given by p w 0, |{(X t , y t )} T t=1 ∝ p(w 0, ) • T t=1 p(X t , y t |w 0, ) ∝ exp - w 0, 2 2σ 2 w /d • T t=1 exp - 1 2 w 0, X t X t + dσ 2 R 2 I d -1 X t X t R 2 /d w 0, + w 0, X t X t + dσ 2 R 2 I d -1 X t y t R 2 /d . This means that the posterior distribution of w 0, is Gaussian, with mean , i.e. the Bayesian estimator, equal tofoot_4  w Bayes 0,T := E w 0, | {(X t , y t )} T t=1 = (A Bayes T ) -1 c Bayes T , where A Bayes T := d σ 2 w I d + T t=1 X t X t + dσ 2 R 2 I d -1 X t X t R 2 /d , c Bayes T := T t=1 X t X t + dσ 2 R 2 I d -1 X t y t R 2 /d . We note that w Bayes 0,T has a similar form as our train-train estimator, but is not exactly the same. Indeed, recall the closed form of our train-train estimator is (cf. ( 10)) w tr-tr 0,T = (A tr-tr T ) -1 c tr-tr T , where A tr-tr T = T t=1 X t X t + nλI d -2 X t X t , c tr-tr T = T t=1 X t X t + nλI d -2 X t y t . As w Bayes 0,T uses the inverse and w tr-tr 0,T uses the squared inverse, these two sets of estimators are not the same in general, no matter how we tune the λ in the train-train estimator. This is true even if we set σ w = ∞ so that the prior of w 0, becomes degenerate (and the Bayesian estimator reduces to the MLE).

D DETAILS ON THE FEW-SHOT IMAGE CLASSIFICATION EXPERIMENT

Here we provide additional details of the few-shot image classification experiment in Section 5.2. Optimization and architecture For both methods, we run a few gradient steps on the inner argmin problem to obtain (an approximation of) w t , and plug w t into the ∇ w0 {tr-val,tr-tr} t (w 0 ) (which involves w t through implicit function differentiation) for optimizing w 0 in the outer loop. For both train-train and train-val methods, we use the standard 4-layer convolutional network in (Finn et al., 2017; Zhou et al., 2019) as the backbone (i.e. the architecture for w t ). We further tune their hyper-parameters, such as the regularization constant λ, the learning rate (initial learning rate and its decay strategy), and the gradient clipping threshold. Datasets We experiment on miniImageNet (Ravi & Larochelle, 2017) and tieredImageNet (Ren et al., 2018) . MiniImageNet consists of 100 classes of images from ImageNet (Krizhevsky et al., 2012) and each class has 600 images of resolution 84 × 84 × 3. We use 64 classes for training, 16 classes for validation, and the remaining 20 classes for testing (Ravi & Larochelle, 2017) . Tiered-ImageNet consists of 608 classes from the ILSVRC-12 data set (Russakovsky et al., 2015) and each image is also of resolution 84 × 84 × 3. TieredImageNet groups classes into broader hierarchy categories corresponding to higher-level nodes in the ImageNet. Specifically, its top hierarchy has 20 training categories (351 classes), 6 validation categories (97 classes) and 8 test categories (160 classes). This structure ensures that all training classes are distinct from the testing classes, providing a more realistic few-shot learning scenario.

D.1 EFFECT OF THE SPLIT RATIO FOR THE TRAIN-VAL METHOD

We further tune the split (n 1 , n 2 ) in the train-val method and report the results in Table 2 . As can be seen, as the number of test samples n 2 increases, the percent classification accuracy on both the miniImageNet and tieredImageNet datasets becomes higher. This testifies our theoretical affirmation in Corollary 8. However, note that even if we take the best split (n 1 , n 2 ) = (5, 25) (and compare again with Table 1 ), the train-val method still performs worse than the train-train method. We remark that our theoretical results on train-train performing better than train-val (in Section 4) rely on the assumptions that the data can be exactly realized by the representation and contains no label noise. Our experimental results here may suggest that the miniImageNet and tieredImageNet few-shot tasks may have a similar structure (there exists a NN representation that almost perfectly realizes the label with no noise) that allows the train-train method to perform better than the trainval method. We test the effect of using cross-validation for the train-val method on the same synthetic data (realizable linear centroid meta-learning) as in Section 5.1.

Method

We fix the number of per-task data n = 20, and use 4-fold cross validation in the following two settings: (n 1 , n 2 ) = (5, 15), and (n 1 , n 2 ) = (15, 5). In both cases, we partition the data into 4 parts each with 5 data points, and we roulette over 4 possible partitions of which one as train and which one as validation. The estimated optimal w cv 0 is obtained by minimize the averaged Under review as a conference paper at ICLR 2021 train-val loss over the 4 partitions: as the ratio d/n varies from 0 to 3 (n 20 and T = 1000 are fixed). For the cross-validation method, the regularization coefficient λ = 0.5 is tuned. j A v E s a 5 V w W 5 d / r / 6 N r W o U D Z / s = " > A A A C Z H i c b V F d a x N B F J 2 s V d t Y N b X 0 S S i D Q f A h D b t F q Y + l v v h Y I W k L n T T c n b 2 b D J 3 9 Y O Z u a h j 3 J / o D + g f 6 B 3 x V c D Y J a N N e G O Z w 7 r k f c y Y u t b I U h r e t 4 M n G 0 2 f P N 7 f a L 7 Z f v n r d 2 X l z Z o v K S B z K Q h f m I g a L W u U 4 J E U a L 0 q D k M U a z + P r L 0 3 + f I b G q i I f 0 L z E U Q a T X K V K A n l q 3 J k M u N C Y k v g h M q B p n L q b e u z C Z i T 1 G D t Z R S W N H J g S E m N d V t U F k u Q 1 z D B S w 9 z y N C O 3 M K Q m r / 3 T M L T w v i T E 1 + w / 1 c 4 y K y d Z 7 F X N o + y 6 7 m G f D S X 2 K b h 2 n R K P 4 + c y s u K M J f L 4 W m l O R W 8 c Z w n y q A k P f c A p F F + f y 6 n Y E C S / Result As showin in Figure 2 , for both (n 1 , n 2 ) = (15, 5) and (n 1 , n 2 ) = (5, 15), using cross-validation consistently beats the performance of the train-val method. This demonstrates the variance-reduction effect of cross-validation. Note that the best performance (among the crossvalidation methods) is still achieved at n 1 = 5, similar as for the vanilla train-val method. However, numerically, the best cross-validation performance is still not as good as the train-train method. Leave-one-out cross-validation Figure 3 left further tests with an increased number of per-task samples n = 40, and incorporates the train-val method with the leave-one-out cross-validation, i.e., (n 1 , n 2 ) = (39, 1) and (n 1 , n 2 ) = (1, 39). We repeat the experiment 10 times for plotting the error bar (shaded area). We see that the train-train method still outperforms the train-val method with leave-one-out validation. We further increase the per-task sample size n to 200, and test the leave-one-out method with a sample split of (n 1 , n 2 ) = (1, 199). We adopt a matrix inverse trick to mitigate the computational overhead of finding A λ (w 0 ; X train,j t , y train,j t ). To ease the computation, we also vary d from 0 to 400 on a coarse grid (with an increment of 80). From Figure 3 right, we see that the leave-one-out method can slightly beat the train-train method for some d/n values. Compared to n = 20 and n = 40 experiments, this is the first time of seeing leave-one-out method outperforms the traintrain method. We suspect that the per-task sample size n plays a vital role in the power of the leave-one-out method: a large n tends to have a strong variance reduction effect in the leave-one-out method, so that the performance can be improved. Yet using the leave-one-out method with a large n invokes a high computational burden.  U p q H x a d R 4 4 / + R f G R 8 I = " > A A A C V n i c b V H L b t Q w F P U E + m A K b Y A l G 4 s J E o s S k p E q W F Z 0 w 7 J I n b a o m U a O c 9 O x 6 t i R f T N l Z O X b + I 3 y A b C E P 0 A 4 0 1 n A l L M 6 O u c + f I + L R g q L S f J t E D x 4 u L G 5 t f 1 o u P P 4 y e 5 e + P T Z q d W t 4 T D h W m p z X j A L U i i Y o E A J 5 4 0 B V h c S z o r r o 9 4 / m 4 O x Q q s T X D Q w r d m V E p X g D L 2 U h 5 + j D K T M x x E F Y 7 S h u q J R d i N K m D F 0 W c 1 w V l T u p u t y l + z T k + 7 S Z Q h f 0 N Q u c 2 j e o N m n f J 5 1 X R f R e W x j G p V v V U R N P z w P R 0 m c L E H v k 3 R F R m S F 4 z z 8 k Z W a t z U o 5 J J Z e 5 E m D U 4 d M y i 4 h G 6 Y t R Y a x q / Z F V x 4 q l g N d u q W E X T 0 l V d K W v k L K q 2 Q L t W / O x y r r V 3 U h a / s j 7 L r X i / + 1 y t t P 3 B t O 1 b v p 0 6 o p k V Q / G 5 5 1 U q K m v Y Z 0 1 I Y 4 C g X n j B u h H 8 / 5 T N m G E f / E 0 M f T L o e U g v o i V z l + j z i B p T M o I c S F Z w X G n g a K T i L r v 7 W + t k 1 a C P z r I u j A v o p H 2 Q y k Y K j o 8 J W 2 K V M Q Y L s j q U c h 1 F i b 6 r Q + l u U G e S 6 o t u U 3 c g Y h h z t f 8 P E 0 a 0 u L U O 4 R Z 1 a Z l F v o 9 6 i 4 p p V V c W 0 H A z d z M u d s N X 2 O / 6 4 6 E c Q T E G b T O s 4 b D 2 y O B d l C h k K x Y 2 5 C P w C + 5 Z r l E J B 1 W S l g Y K L K z 6 A C w c z n o L p 2 3 E Q F f 3 l m J g m u X Y n Q z p m 3 3 Z Y n h o z S i P n r B 9 j Z r W a / F S L T T 1 w Z j s m + 3 0 r s 6 J E y M R k e V I q i j m t k 6 a x 1 C B Q j R z g Q k t 3 f y q G X F I s Z w S Q 4 T X O k G + d T 7 Z v p K k f D 8 = " > A A A C S 3 i c b V D L b t N A F B 2 H l r b h 0 V C W b E a N k V g g Y 0 e q 2 m V F N 1 2 h I j V t p T p Y 4 / F 1 M + p 4 x p q 5 T o l G / i t + g w 8 o W x A / w A 5 1 w T j N A l L O 6 u i c + z x 5 L Y X F O L 4 N e o / W 1 h 9 v b G 7 1 n z x 9 9 n x 7 8 G L n z O r G c B h z L b W 5 y J k F K R S M U a C E i 9 o A q 3 I J 5 / n 1 U e e f z 8 B Y o d U p z m u Y V O x K i V J w h l 7 K B h / C F K T M R i E F Y 7 S h u q R h e i M K m D J 0 a c V w m p f u p m 0 z F 7 + l p + 0 n l y J 8 R l M 5 P m v b k M 4 i G 9 G w e K d C a r q J 2 W A Y R / E C 9 C F J l m R I l j j J B j / T Q v O m A o V c M m s v k 7 j G i W M G B Z f Q 9 t P G Q s 3 4 N b u C S 0 8 V q 8 B O 3 O L v l m f t D L W q 3 h X O X n C U m w q H k C b Q W M = " > A A A C U X i c b V H L T h s x F L 2 Z t j x C g V C W 3 V i N K r G A

F NONASYMPTOTIC ANALYSIS UNDER GENERAL SCALINGS OF d, n, T

We sketch an nonasymptotic analysis of the train-train method in the realizable linear model in Section 4. We assume in addition that w t  where A t := λ 2 X t X t n + λI d -2 X t X t n , B t := λ 2 X train t X train t n 1 + λI d -1 X val t X val t n 2 X train t X train t n 1 + λI d -1 . Nonasymptotic result for w tr-tr 0,T For simplicity, we restrict attention on the analysis of the traintrain estimator w tr-tr 0,T , whose closed form expression is given in (44). Analysis of the train-val estimator can be done in a similar fashion. We sketch a proof for the following Result: Let T = Ω(d) and (d, n) is such that f tr-tr (d, n) = E 1 d d i=1 (σ (n) i ) 2 /(σ (n) i + λ) 4 1 d E d i=1 σ (n) i /(σ (n) i + λ) 2 2 = Θ(1) (which for example holds if (d, n) are in the proportional limit), then with high probability we have MSE( w tr-tr 0,T ) = R 2 d 1 ± O 1 √ d • 1 T • d • f tr-tr (d, n) ± O d log(T /δ) T ≈ R 2 T • f tr-tr (d, n) = 1 T AsymMSE( w tr-tr 0,T ), that is, the MSE for the train-train method concentrates around the asymptotic MSE. Proof Sketch We first define the matrix Σ T := T t=1 A t -2 T t=1 A 2 t = 1 T • T t=1 A t T -2 T t=1 A 2 t T , which will be key to our analysis. Observe that conditioned on A t (and only looking at the randomness of w t ), we have w tr-tr 0,T -w 0, = T t=1 A t -1 T t=1 A t (w t -w 0, ) ∼ N 0, R 2 d Σ T . Therefore, applying the Hanson-Wright inequality (Vershynin, 2018, Theorem 6.2.1) yields that MSE( w tr-tr 0,T ) = w tr-tr 0,Tw 0, 2 ∈ R 2 d tr(Σ T ) ± max Σ T Fr log 1 δ , Σ T op log 1 δ with probability at least 1δ. In the following, we argue that tr(Σ T ) is the dominating term in the above bound and concentrates around the asymptotic MSE in Lemma 5, whereas the error terms within the max are lower order errors compared with tr(Σ T ). Concentration of tr(Σ T ) Recall that A t are i.i.d. PSD matrices in R d×d . We have tr(Σ T ) = 1 T T t=1 A t T -2 , T t=1 A 2 t T = 1 T E[A 1 ] -2 , E[A 2 1 ] I + T t=1 A t T -2 -E[A 1 ] -2 , E[A 2 1 ] II + T t=1 A t T -2 , T t=1 A 2 t T -E[A 2 1 ] III . We argue that I is the main term that depends on (d, n) and is independent of T . Further, A 1 has eigenvalues λ 2 σ i /(λ + σ i ) 2 where {σ i } d i=1 are the eigenvalues of X 1 X 1 /n, and A 1 has uniformly distributed eigenvectors. Following the same analysis of Lemma 5, we see that (d,n) . This is exactly the same quantity that appeared in the asymptotic MSE for the train-train method in Theorem 4. In the following we assume that d, n is such that f tr-tr (d, n) = Θ(1). I = d 2 • ρ tr-tr = d • E 1 d d i=1 (σ (n) i ) 2 /(σ (n) i + λ) 4 1 d E d i=1 σ (n) i /(σ (n) i + λ) 2 2 :=f tr-tr We further argue that terms II and III are low-order terms compared with term I. Indeed, term I is O(d). Applying matrix concentrations (e.g. the matrix Bernstein inequality), we can get that terms II and III are of order max{ d log(T /δ)/T , d log(T /δ)/T } with probability at least 1δ. Combining the terms yields that Controlling the error in the MSE We now control Σ T Fr and (consequently) Σ T op . We wish to show that Σ T Fr ≤ O 1 √ d • tr(Σ T ) with high probability. This can be seen from the combination of two results: (1) tr(Σ T ) = Ω(d/T ) by the preceding part, and (2) Σ T Fr ≤ 1 T • 1 λ min ( t≤T A t ) 2 T t=1 A 2 t /T Fr ≤ O( √ d/T ). The above requires (1) A t Fr ≤ O( √ d), which is true since we have 0 A t λI d and thus A 2 t Fr ≤ λ 2 √ d, and (2) λ min ( t≤T A t /T ) = Ω(1) with high probability, which is true whenever T = Ω(d) because of matrix concentration and the fact that λ min (E[A 1 ]) = Ω(1). Putting together We have with high probability that MSE( w tr-tr 0,T ) = R 2 d 1 ± O 1 √ d • 1 T • d • f tr-tr (d, n) ± O d log(T /δ) T . as long as T = Ω(d) and d, n is such that f tr-tr (d, n) = Θ(1). 



with a small step-size, or gradient flow. These definitions assume that the expectation exists for finite T ; the more general definition can be found in Appendix A.1. The same conclusion also holds for the asymptotic excess risk, as the Hessian of the excess risk is a rescaled identity, see Appendix B.2. Hereafter we treat Xt as fixed, as the density of Xt won't affect the Bayesian calculation. Any density p(w) ∝ exp(-w Aw/2 + w c) specifies a Gaussian distreibution N(µ, Σ), where A = Σ -1 and c = Σ -1 µ, so that µ = A -1 c. This Gaussian assumption on wt is mainly for technical convenience, and can be relaxed to that wt,i -w0, ,i are i.i.d. with variance R 2 /d and sub-Gaussian with parameter O(R 2 /d) without changing the proof.



L {tr-val,tr-tr} (w 0 ) := E pt∼Π,(Xt,yt)∼pt {tr-val,tr-tr} t (w 0 ) be the population risks.(Meta-)Test time The meta-test time performance of any meta-learning algorithm is a joint function of the (learned) centroid w 0 and the inner algorithm Alg. Upon receiving a new task p T +1 ∼ Π and training data (X T +1 , y T +1 ) ∈ R n×d × R n , we run the inner loop Alg with prior w 0 on the training data, and evaluate it on an (unseen) test example (x , y ) ∼ p T +1 :

t e x i t s h a 1 _ b a s e 6 4 = " + 0

6 y 5 4 t 8 g i e / x e j C 7 T l < / l a t e x i t > (b) `2 error of b w {tr-tr, tr-val} 0,T v.s. T < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 o I y f 2 q

e 2 Y b H e M y I o S I e N P y 5 N S U s x p F S q N h Q K O c u A A 4 0 q 4 / 1 P e Z 4 p x d N H X X T D B e A x v w c l m K / j W 8 g + / N n e 2 R x F N k y W y T F Z J Q L b I D v l J D k i b c P K b / C H / y H / v x r v z 7 r 2 H p 9 a a N 5 p Z J K + q 5 j 0 C 6 e q 0 B g = = < / l a t e x i t >

3 o q B 9 + P e y d H C 8 t a r M d t s s + s Y h 9 Z i f s l J 2 x I Z P s B / v F f r M / r Y e g H W w F 2 w t p 0 F r W v G e P I v j w F 4 s T u w M = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = "

Figure 1: Panel (a) presents the optimal AsymMSE( w tr-tr 0,T ) (blue) in Theorem 9 via grid search, and the optimal AsymMSE( w tr-val 0,T ) in Corollary 8 with n1 = 0 (orange) and n1 = 5 (green), as well as the upper bound of AsymMSE( w tr-tr 0,T ) (magenta) in Corollary 4. The optimal AsymMSE( w {tr-tr,tr-val} 0,T) are used as

w t )(w 0,w t ) (M tr-tr t ) M tr-tr t ]

where superscript j denotes the index of the cross-validation. The performance is depicted in Figure2. t e x i t s h a 1 _ b a s e 6 4 = " / j / B 0 g / w UZ p x C Y Z L Q 2 X h g + 2 1 U S 0 = " > A A A C X n i c b V F B b 9 M w F H a z M U Z h r I M L E h d r D R K H E p J K C I 4 T X D g O a d 0 m L S V y n J f V m m N H 9 k u 3 y s r v 2 2 / g t t N u X O G K 0 / X A u r 2 L P 3 3 f e / 7 8 P u e 1 F B b j + F c v 2 N h 8 s v V 0 + 1 n / + Y u d l 7 u D v V f H V j e G w 4 R r q c 1 p z i x I o W C C A i W c 1 g Z Y l U s 4 y S + + d f r J H I w V W h 3 h o o Z p x c 6 V K A V n 6 K l s w M I U p M z G I Q V j t K G 6 p G F 6 K Q q Y M X R p x X C W l + 6 y b T M X j + h R + 9 O l C F d o K p c 6 N B / Q j K g / 5 k y O K J + n b d u G d B 7 Z i I b F R x V S 0 5 l k g 2 E c x c u i D 0 G y A k O y q s N s c J s W m j c V K O S S W X u W x D V O H T M o u I S 2 n z Y W a s Y v 2 D m c e a h Y B X b q l l G 0 9 J 1 n C l r 6 T U q t k C 7 Z / y c c q 6 x d V L n v 7 J a z 6 1 p H P q o V t r t w z R 3 L L 1 M n V N 0 g K H 5 n X j a S o q Z d 1 r Q Q B j j K h Q e M G +H f T / m M G c b R / 0 j f B 5 O s x / A Q H I + j 5 F M U / x g P D 7 6 u I t o m b 8 k + e U 8 S 8 p k c k O / k k E w I J 9 f k N / l D / v Z u g q 1 g J 9 i 9 a w 1 6 q 5 n X 5 F 4 F b / 4 B E y K 3 e g = = < / l a t e x i t > `2 error of b w {tr-tr, tr-val, cv} 0,T v.s. d/n ratio < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 9

H h e W w N T 8 g I s b l e A U y P 0 T L B W D + s o J w u 9 k M i c c m Q M y P e 6 v G e g e l z N R 1 7 U w a j L 1 v a 8 O x 5 1 u 2 A 8 X w R + C a A W 6 b B W n 4 8 6 d S A p Z Z

Figure 2: The scaled (by T 2 -error of w {tr-tr,tr-val,cv} 0,T

t e x i t s h a 1 _ b a s e 6 4 = " t l 4 Z n g C 5 U

w 3 1 y O o 7 T g z j 5 N B 4 d f l h F t E 1 e k J f k N U n J O 3 J I P p J j M i G c f C X f y U / y a 3 A 7 + B 1 s B F t 3 p c F g 1 f O c / I M g / A P M 8 7 W H < / l a t e x i t > `2 error of b w {tr-tr, cv} 0,T v.s. d/n ratio < l a t e x i t s h a 1 _ b a s e 6 4 = " J P c n + m u I A 7 5 U j I v V e H / f N B C Y / t s = " > A A A C X H i c b V F N T x s x E H W 2 f D V 8 N B S p l 1 4 s o k o 9 Q L S L i u C I 6 I U j S A k g 4 b D y e m c T C + + H 7 F k g M v v z + B F c e u T C F e 5 4 k 0 i F w E i W n t 5 7 M 2 M / R 4 W S B n 3 / o e F 9 m Z t f W F z 6 2 l x e W V 3 7 1 l r / f m r y

H O B 7 j + a L p h g N o a P 4 H S n E + x 2 / J M / 7 Y P D a U R L 5 C f Z J L 9 J Q P b I A T k i x 6 R H B L k n T + S Z v D T + e X P e s r c 6 s X q N a c 8 G e V f e j 1 c F N b l o < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " E 5

r 7 2 S k F L f 3 a p F d K F + n e H Y 5 W 1 8 y r 3 l d 0 n d t X r x P 9 6 h e 0 G r m z H 8 m D i h K o b B M X v l 5 e N p K h p F y w t h A G O c u 4 J 4 0 b 4 + y m f M s M 4 + v j 7 P p h k N Y a H 5 G w U J X t R / H E 0 P H y / j G i T v C K 7 5 A 1 J y D 4 5 J M f k h I w J J 1 / I N / K d / A i + B r + C 3 8 H d f W k v W P a 8 J P + g t / 4 H 0 M e z I g = = < / l a t e x i t > `2 error of b w cv 0,T v.s. d/n ratio < l a t e x i t s h a 1 _ b a s e 6 4 = " y P

Figure 3: The scaled (by T 2 -error w {tr-tr,cv} 0,T as the ratio d/n varies from 0 to 3 (n ∈ {40, 200} and T = 1000 are fixed). the cross-validation method, the regularization coefficient λ = 0.5. Left: n = 40. Leave-out-out CV performs worse than the train-train method. Right: n = 200. Leave one-out CV appears better than the train-train method for d/n ∈ {1.2, 1.6}.

iid ∼ N(w 0, , R 2 d I d ) 6 . We begin by recalling from Appendix B.1 and B.2 that the train-train and train-val estimators have closed

tr(Σ T ) ∈ 1 T d • f tr-tr (d, n) ± d log(T /δ) Twith high probability.

Figure 4: The scaled (by T ) 2 of w {tr-tr,tr-val} 0,T as the ratio d/n varies from 0 to 3 in the general scaling setting (n = 100 and T = 300 are fixed). The regularization coefficient λ is fine tuned for the train-train method and λ = 2000 for the train-val method.

Few-shot classification accuracy (%) on the miniImageNet and tieredImageNet datasets.

Under review as a conference paper at ICLR 2021 Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.Haoxiang Wang, Ruoyu Sun, and Bo Li. Global convergence and induced kernels of gradient-based meta-learning with neural nets. arXiv preprint arXiv:2006.14606, 2020a.

In particular, we know 0I d -X t X t + nλI d -1 X t X t I d and E[ X t X t op ] < ∞ since X t is Gaussian.As a consequence, for the no-split method,

Investigation of the effects of training/validation splitting ratio in the train-val method (iMAML) to the few-shot classification accuracy (%) on miniImageNet and tieredImageNet.

