GUARANTEES FOR TUNING THE STEP SIZE USING A LEARNING-TO-LEARN APPROACH Anonymous

Abstract

Learning-to-learn-using optimization algorithms to learn a new optimizer-has successfully trained efficient optimizers in practice. This approach relies on metagradient descent on a meta-objective based on the trajectory that the optimizer generates. However, there were few theoretical guarantees on how to avoid metagradient explosion/vanishing problems, or how to train an optimizer with good generalization performance. In this paper we study the learning-to-learn approach on a simple problem of tuning the step size for quadratic loss. Our results show that although there is a way to design the meta-objective so that the metagradient remain polynomially bounded, computing the meta-gradient directly using backpropagation leads to numerical issues that look similar to gradient explosion/vanishing problems. We also characterize when it is necessary to compute the meta-objective on a separate validation set instead of the original training set. Finally, we verify our results empirically and show that a similar phenomenon appears even for more complicated learned optimizers parametrized by neural networks.

1. INTRODUCTION

Choosing the right optimization algorithm and related hyper-parameters is important for training a deep neural network. Recently, a series of works (e.g., Andrychowicz et al. (2016) ; Wichrowska et al. (2017) ) proposed to use learning algorithms to find a better optimizer. These papers use a learning-to-learn approach: they design a class of possible optimizers (often parametrized by a neural network), and then optimize the parameters of the optimizer (later referred to as metaparameters) to achieve better performance. We refer to the optimization of the optimizer as the meta optimization problem, and the application of the learned optimizer as the inner optimization problem. The learning-to-learn approach solves the meta optimization problem by defining a metaobjective function based on the trajectory that the inner-optimizer generates, and then using backpropagation to compute the meta-gradient (Franceschi et al., 2017) . Although the learning-to-learn approach has shown empirical success, there are very few theoretical guarantees for learned optimizers. In particular, since the optimization for meta-parameters is usually a nonconvex problem, does it have bad local optimal solutions? Current ways of optimizing meta-parameters rely on unrolling the trajectory of the inner-optimizer, which is very expensive and often lead to exploding/vanishing gradient problems. Is there a way to alleviate these problems? Can we have a provable way of designing meta-objective to make sure that the inner optimizers can achieve good generalization performance? In this paper we answer some of these problems in a simple setting, where we use the learningto-learn approach to tune the step size of the standard gradient descent/stochastic gradient descent algorithm. We will see that even in this simple setting, many of the challenges still remain and we can get better learned optimizers by choosing the right meta-objective function. Though our results are proved only in the simple setting, we empirically verify the results using complicated learned optimizers with neural network parametrizations. Theorem 1 (Informal). For tuning the step size of gradient descent on a quadratic objective, if the meta-objective is the loss of the last iteration, then the meta-gradient can explode/vanish. If the meta-objective is the log of the loss of the last iteration, then the meta-gradient is polynomially bounded. Further, doing meta-gradient descent with a meta step size of 1/ √ k (where k is the number of meta-gradient steps) provably converges to the optimal step size for the inner-optimizer. Surprisingly, even though taking the log of the objective solves the gradient explosion/vanishing problem, one cannot simply implement such an algorithm using auto-differentiation tools such as those used in TensorFlow (Abadi et al., 2016) . The reason is that even though the meta-gradient is polynomially bounded, if we compute the meta-gradient using the standard back-propagation algorithm, the meta-gradient will be the ratio of two exponentially large/small numbers, which causes numerical issues. Detailed discussion for the first result appears in Section 3 (Theorem 3 and Theorem 4). The generalization performance of the learned optimizer is another challenge. If one just tries to optimize the performance of the learned optimizer on the training set (we refer to this as the trainby-train approach), then the learned optimizer might overfit. Metz et al. (2019) proposed to use a train-by-validation approach instead, where the meta-objective is defined to be the performance of the learned optimizer on a separate validation set. Our second result considers a simple least squares setting where y = w * , x + ξ and ξ ∼ N (0, σ 2 ). We show that when the number of samples is small and the noise is large, it is important to use train-by-validation; while when the number of samples is much larger train-by-train can also learn a good optimizer. Theorem 2 (Informal). For a simple least squares problem in d dimensions, if the number of samples n is a constant fraction of d (e.g., d/2), and the samples have large noise, then the train-by-train approach performs much worse than train-by-validation. On the other hand, when number of samples n is large, train-by-train can get close to error dσ 2 /n, which is optimal. We discuss the details in Section 4 (Theorem 5 and Theorem 6). In Section 5 we show that such observations also hold empirically for more complicated learned optimizers-an optimizer parametrized by neural network.

1.2. RELATED WORK

Learning-to-learn for supervised learning Hochreiter et al. (2001) introduced the application of gradient descent method to meta-learning. The idea of using a neural network to parametrize an optimizer started in Andrychowicz et al. (2016) , which used an LSTM to directly learn the update rule. Before that, the idea of using optimization to tune parameters for optimzers also appeared in Maclaurin et al. (2015) . Later, Li & Malik (2016) ; Bello et al. (2017) applied techniques from reinforcement learning to learn an optimizer. Wichrowska et al. (2017) used a hierarchical RNN as the optimizer. Metz et al. (2019) adopted a small MLP as the optimizer and used dynamic weighting of two gradient estimators to stabilize and speedup the meta-training process. Learning-to-learn in other settings Ravi & Larochelle (2016) used LSTM as a meta-learner to learn the update rule for training neural networks in the few-shot learning setting, Wang et al. (2016) learned an RL algorithm by another meta-learning RL algorithm, and Duan et al. (2016) learned a general-purpose RNN that can adapt to different RL tasks. Gradient-based meta-learning Finn et al. (2017) proposed Model-Agnostic Meta-Learning (MAML) where they parameterize the update rule for network parameters and learn a shared initialization for the optimizer using the tasks sampled from some distribution. Subsequent works generalized or improved MAML, e.g., Rusu et al. (2018) learned a low-dimensional latent representation for gradient-based meta-learning, and Li et al. (2017) enabled the concurrent learning of learning rate and update direction. Chen et al. (2020) studied a model with an optimization solver stacked on another neural component. They computed Rademacher complexity of the model, but didn't give any optimization guarantee or study train-by-train versus train-by-validation. Learning assisted algorithms design Similar ideas can also be extended to develop a metaalgorithm selecting an algorithm from a family of parametrized algorithms. Gupta & Roughgarden (2017) first modeled the algorithm-selection process as a statistical learning problem and bounded the number of tasks it takes to tune a step size for gradient descent. However, they didn't consider the meta-optimization problem. Based on Gupta & Roughgarden (2017) , people have developed and analyzed the meta-algorithms in many problems (Balcan et al., 2016; 2018a; c; b; Denevi et al., 2018; Alabi et al., 2019; Denevi et al., 2019) Tuning step size/step size schedule for SGD Shamir & Zhang (2013) showed that SGD with polynomial step size scheduling can almost match the minimax rate in convex non-smooth settings, which was later tightened by Harvey et al. (2018) for standard step size scheduling. Assuming that the horizon T is known to the algorithm, the information-theoretically optimal bound in convex nonsmooth setting was later achieved by Jain et al. (2019) which used another step size schedule, and Ge et al. (2019) showed that exponentially decaying step size scheduling can achieve near optimal rate for least squares regression. There are also a line of works that investigate methods which adapt a vector of step sizes (Sutton, 1992; Schraudolph, 1999; Kearney et al., 2018; Günther et al., 2019; Jacobsen et al., 2019) .

2. PRELIMINARIES

In this section, we first introduce some notations, then formulate the learning-to-learn framework.

2.1. NOTATIONS

For any integer n, we use [n] to denote {1, 2, • • • , n}. We use • to denote the 2 norm for a vector and the spectral norm for a matrix. We use •, • to denote the inner product of two vectors. For a symmetric matrix A ∈ R d×d , we denote its eigenvalues as λ 1 (A) ≥ • • • ≥ λ d (A). We denote the d-dimensional identity matrix as I d . We also denote the identity matrix simply as I when the dimension is clear from the context. We use O(•), Ω(•), Θ(•) to hide constant factor dependencies. We use poly(•) to represent a polynomial on the relevant parameters with constant degree. We say an event happens with high probability if it happens with probability 1 -c for small constant c.

2.2. LEARNING-TO-LEARN FRAMEWORK

We consider the learning-to-learn approach applied to training a distribution of learning tasks. Each task is specified by a tuple (D, S train , S valid , ). Here D is a distribution of samples in X × Y , where X is the domain for the sample and Y is the domain for the label/value. The sets S train and S valid are samples generated independently from D, which serve as the training and validation set (the validation set is optional). The learning task looks to find a parameter w ∈ W that minimizes the loss function (w, x, y) : W × X × Y → R, which gives the loss of the parameter w for sample (x, y). The training loss for this task is f (w) := 1 |Strain| (x,y)∈Strain (w, x, y), while the population loss is f (w) := E (x,y)∼D [ (w, x, y) ]. The goal of inner-optimization is to minimize the population loss f (w). For the learned optimizer, we consider it as an update rule u(•) on weight w. The update rule is a parameterized function that maps the weight at step τ and its history to the step τ + 1 : w τ +1 = u(w τ , ∇ f (w τ ), ∇ f (w τ -1 ), • • • ; θ). In most parts of this paper, we consider the update rule u as gradient descent mapping with step size as the trainable parameter (here θ = η which is the step size for gradient descent). That is, u η (w) = w -η∇ f (w) for gradient descent and u η (w) = w -η∇ w (w, x, y) for stochastic gradient descent where (x, y) is a sample randomly chosen from the training set S train . In the outer (meta) level, we consider a distribution T of tasks. For each task P ∼ T , we can define a meta-loss function ∆(θ, P ). The meta-loss function measures the performance of the optimizer on this learning task. The meta objective, for example, can be chosen as the target training loss f at the last iteration (train-by-train), or the loss on the validation set (train-by-validation). The training loss for the meta-level is the average of the meta-loss across m different specific tasks P 1 , P 2 , ..., P m , that is, F (θ) =foot_0 m m i=1 ∆(θ, P k ). The population loss for the meta-level is the expectation over all the possible specific tasks F (θ) = E P ∼T [∆(θ, P )]. In order to train an optimizer by gradient descent, we need to compute the gradient of meta-objective F in terms of meta parameters θ. The meta parameter is updated once after applying the optimizer on the inner objective t times to generate the trajectory w 0 , w 1 , ..., w t . The meta-gradient is then computed by unrolling the optimization process and back-propagating through the t applications of the optimizer. As we will see later, this unroll procedure is costly and can introduce meta-gradient explosion/vanishing problems.

3. ALLEVIATING GRADIENT EXPLOSION/VANISHING PROBLEMS

First we consider the meta-gradient explosion/vanishing problem. More precisely, we say the metagradient explodes/vanishes if it is exponentially large/small with respect to the number of steps t of the inner-optimizer. In this section, we consider a very simple instance of the learning-to-learn approach, where the distribution T only contains a single task P , and the task also just defines a single loss function f 1 . Therefore, in this section F (η) = F (η) = ∆(η, P ). We will simplify notation and only use F (η). The inner task P is a simple quadratic problem, where the starting point is fixed at w 0 , and the loss function is f (w) = 1 2 w Hw for some fixed positive definite matrix H. Without loss of generality, assume w 0 has unit 2 norm. Suppose the eigenvalue decomposition of H is d i=1 λ i u i u i . Throughout this section we assume L = λ 1 (H) and α = λ d (H) are the largest and smallest eigenvalues of H with L > α. For each i ∈ [d], let c i be w 0 , u i and let c min = min(|c 1 |, |c d |). We assume c min > 0 for simplicity. Note that if w 0 is uniformly sampled from the unit sphere, with high probability c min is at least Ω(1/ √ d); if H is XX with X ∈ R d×2d as a random Gaussian matrix, with constant probability, both α and L -α are at least Ω(d). Let {w τ,η } be the GD sequence running on f (w) starting from w 0 with step size η. We consider several ways of defining meta-objective, including using the loss of the last point directly, or using the log of this value. We first show that although choosing F (η) = f (w t,η ) does not have any bad local optimal solution, it has the gradient explosion/vanishing problem. We use F (η) to denote the derivative of F in η. Theorem 3. Let the meta objective be F (η) = f (w t,η ) = 1 2 w t,η Hw t,η . We know F (η) is a strictly convex function in η with an unique minimizer. However, for any step size η < 2/L, | F (η)| ≤ t d i=1 c 2 i λ 2 i |1 -ηλ i | 2t-1 ; for any step size η > 2/L, | F (η)| ≥ c 2 1 L 2 t(ηL -1) 2t-1 -L 2 t. Note that in Theorem 3, when η < 2/L, | F (η)| is exponentially small because |1 -ηλ i | < 1 for all i ∈ [d]; when η > 2/L, | F (η)| is exponentially large because ηL -1 > 1. Intuitively, gradient explosion/vanishing happens because the meta-loss function becomes too small or too large. A natural idea to fix the problem is to take the log of the meta-loss function to reduce its range. We show that this indeed works. More precisely, if we choose F (η) = 1 t log f (w t,η ), then we have Theorem 4. Let the meta objective be F (η) = 1 t log f (w t,η ). We know F (η) has a unique minimizer η * and F (η) = O L 3 c 2 min α(L-α) for all η ≥ 0. Let {η k } be the GD sequence running on F with meta step size µ k = 1/ √ k. Suppose the starting step size η 0 ≤ M. Given any 1/L > > 0, there exists k = M 6 2 poly( 1 cmin , L, 1 α , 1 L-α ) such that for all k ≥ k , |η k -η * | ≤ . For convenience, in the above algorithmic result, we reset η to zero once η goes negative. Note that although we show the gradient is bounded and there is a unique optimizer, the problem of optimizing η is still not convex because the meta-gradient is not monotone. We use ideas from quasi-convex optimization to show that meta-gradient descent can find the unique optimal step size for this problem. Surprisingly, even though we showed that the meta-gradient is bounded, it cannot be effectively computed by doing back-propagation due to numerical issues. More precisely: Corollary 1. If we choose the meta-objective as F (η) = 1 t log f (w t,η ), when computing the metagradient using back-propagation, there are intermediate results that are exponentially large/small in number of inner-steps t. Indeed, in Section 5 we empirically verify that standard auto-differentiation tools can still fail in this setting. This suggests that one should be more careful about using standard back-propagation in the learning-to-learn approach. The proofs of the results in this section are deferred into Appendix A.

4. TRAIN-BY-TRAIN VS. TRAIN-BY-VALIDATION

Next we consider the generalization ability of simple optimizers. In this section we consider a simple family of least squares problems. Let T be a distribution of tasks where every task (D(w * ), S train , S valid , ) is determined by a parameter w * ∈ R d which is chosen uniformly at random on the unit sphere. For each individual task, (x, y) ∼ D(w * ) is generated by first choosing x ∼ N (0, I d ) and then computing y = w * , x + ξ where ξ ∼ N (0, σ 2 ) with σ ≥ 1. The loss function (w, x, y) is just the squared loss (w, x, y) = 1 2 (y -w, x ) 2 . That is, the tasks are just standard least-squares problems with ground-truth equal to w * and noise level σ 2 . For the meta-loss function, we consider two different settings. In the train-by-train setting, the training set S train contains n independent samples, and the meta-loss function is chosen to be the training loss. That is, in each task P , we first choose w * uniformly at random, then generate (x 1 , y 1 ), ..., (x n , y n ) as the training set S train . The meta-loss function ∆ T bT (n) (η, P ) is defined to be ∆ T bT (n) (η, P ) = 1 2n n i=1 (y i -w t,η , x i ) 2 . Here w t,η is the result of running t iterations of gradient descent starting from point 0 with step size η. Note we truncate a sequence and declare the meta loss is high once the weight norm exceeds certain threshold. We can safely do this because we assume the ground truth weight w * has unit norm, so if the weight norm of our model is too high, it means the inner training has diverged and the step size is too large. Specifically, if at the τ -th step, w τ,η ≥ 40σ, we freeze the training on this task and set w τ ,η = 40σu for all τ ≤ τ ≤ t, for some arbitrary vector u with unit norm. Setting the weight to a large vector is just one way to declare the loss is high; we choose this particular way for some proof convenience. As before, the empirical meta objective in train-by-train setting is the average of the meta-loss across m different specific tasks P 1 , P 2 , ..., P m , that is, FT bT (n) (η) = 1 m m k=1 ∆ T bT (n) (η, P k ). In the train-by-validation setting, the specific tasks are generated by sampling n 1 training samples and n 2 validation samples for each task, and the meta-loss function is chosen to be the validation loss. That is, in each specific task P , we first choose w * uniformly at random, then generate (x 1 , y 1 ), ..., (x n1 , y n1 ) as the training set S train and (x 1 , y 1 ), ..., (x n2 , y n2 ) as the validation set S valid . The meta-loss function ∆ T bV (n1,n2) (η, P ) is defined to be ∆ T bV (n1,n2) (η, P ) = 1 2n 2 n2 i=1 (y i -w t,η , x i ) 2 . Here again w t,η is the result of running t iterations of the gradient descent on the training set starting from point 0, and we use the same truncation as before. The empirical meta objective is defined as FT bV (n1,n2) (η) = 1 m m k=1 ∆ T bV (n1,n2) (η, P k ), where each P k is independently sampled according to the described procedure. We first show that when the number of samples is small (in particular n < d) and the noise is a large enough constant, train-by-train can be much worse than train-by-validation, even when n 1 + n 2 = n (the total number of samples used in train-by-validation is the same as train-by-train)  η * train = Θ(1) and E w t,η * train -w * 2 = Ω(1)σ 2 , for all η * train ∈ arg min η≥0 FT bT (n) (η); η * valid = Θ(1/t) and E w t,η * valid -w * 2 = w * 2 -Ω(1) for all η * valid ∈ arg min η≥0 FT bV (n1,n2) (η). In both equations the expectation is taken over new tasks. Note that in this case, the number of samples n is smaller than d, so the least square problem is underdetermined and the optimal training loss would go to 0 (there is always a way to simultaneously satisfy all n equations). This is exactly what train-by-train would do-it will choose a large constant learning rate which guarantees the optimizer converges exponentially to the empirical risk minimizer (ERM). However, when the noise is large making the training loss go to 0 will overfit to the noise and hurt the generalization performance. Train-by-validation on the other hand will choose a smaller learning rate which allows it to leverage the information in the training samples without overfitting to noise. Theorem 5 is proved in Appendix B. We also prove similar results for SGD in Appendix D We emphasize that neural networks are often over-parameterized, which corresponds to the case when d > n. Indeed Liu & Belkin (2018) showed that variants of stochastic gradient descent can converge to the empirical risk minimizer with exponential rate in this case. Therefore in order to train neural networks, it is better to use train-by-validation. On the other hand, we show when the number of samples is large (n d), train-by-train can also perform well. Theorem 6. Let FT bT (n) (η) be as defined in Equation 1. Assume noise level is a constant c 1 .  E w t,η * train -w * 2 ≤ (1 + ) dσ 2 n , for all η * train ∈ arg min η≥0 FT bT (n) (η), where the expectation is taken over new tasks. Therefore if the learning-to-learn approach is applied to a traditional optimization problem that is not over-parameterized, train-by-train can work well. In this case, the empirical risk minimizer (ERM) already has good generalization performance, and train-by-train optimizes the convergence towards the ERM. We defer the proof of Theorem 6 into Appendix C.

5. EXPERIMENTS

Optimizing step size for quadratic objective We first validate the results in Section 3. We fixed a 20-dimensional quadratic objective as the inner problem and vary the number of inner steps t and initial value η 0 . We compute the meta-gradient directly using a formula which we derive in Appendix A. In this way, we avoid the computation of exponentially small/large intermediate terms. We use the algorithm suggested in Theorem 4, except we choose the meta-step size to be 1/(100 √ k) as the constants in the theorem were not optimized. An example training curve of η for t = 80 and η 0 = 0.1 is shown in Figure 1 , and we can see that η converges quickly within 300 steps. Similar convergence also holds for larger t or much larger initial η 0 . In contrast, we also implemented the meta-training with Tensorflow, where the code was adapted from the previous work of Wichrowska et al. (2017) . Experiments show that in many settings (especially with large t and large η 0 ) the implementation does not converge. In Figure 1 , under the TensorFlow implementation, the step size is stuck at the initial value throughout the meta training because the meta gradient explodes and gives NaN value. In Figure 2 , we verify the observation from Metz et al. (2019) that the optimal step size depends on inner training length. Train-by-train vs. train-by-validation, synthetic data Here we validate our theoretical results in Section 4 using the least-squares model defined there. We fix the input dimension d to be 1000. In the first experiment, we fix the size of the data (n = 500 for train-by-train, n 1 = n 2 = 250 for train-by-validation). Under different noise levels, we find the optimal η * by a grid search on its meta-objective for train-by-train and train-by-validation settings respectively. We then use the optimal η * found in each of these two settings to test on 10 new least-squares problem. The mean RMSE, as well as its range over the 10 test cases, are shown in Figure 3 . We can see that for all of these cases, the train-by-train model overfits easily, while the train-by-validation model performs much better and does not overfit. Also, when the noise becomes larger, the difference between these two settings becomes more significant. In the next experiment, we fix σ = 1 and change the sample size. For train-by-validation, we always split the samples evenly into training and validation set. From Figure 4 , we can see that the gap between these two settings is decreasing as we use more data, as expected by Theorem 6. Train-by-train vs. train-by-validation, MLP optimizer on MNIST Finally we consider a more complicated multi-layer perceptron (MLP) optimizer on MNIST data set. We use the same MLP optimizer as in Metz et al. (2019) , details of this optimizer is discussed in Appendix F. As the inner problem, we use a two-layer fully-connected network of 100 and 20 hidden units with ReLU activations. The inner objective is the classic 10-class cross entropy loss, and we use mini-batches of 32 samples at inner training. In all the following experiments, we use SGD as a baseline with step size tuned by grid search against validation loss. To see whether the comparison between train-by-train and train-by-validation behave similarly to our theoretical results, we consider different number of samples and different levels of label noise. For each optimizer, we run 5 independent tests and collect training accuracy and test accuracy for evaluation. The plots show the mean of the 5 tests. We didn't show the measure of the spread because the results of these 5 tests are so close to each other, such that the range or standard deviation marks will not be readable in the plots. First, consider optimizing the MNIST dataset with small number of samples. In this case, the trainby-train setting uses 1,000 samples (denoted as "TbT1000"), and we use another 1,000 samples as the validation set for the train-by-validation case (denoted as "TbV1000+1000"). To be fair to trainby-train we also consider TbT2000 where the train-by-train algorithm has access to 2000 data points. Figure 5 shows the results-all the models have training accuracy close to 1, but both TbT1000 and TbT2000 overfits the data significantly, whereas TbV1000+1000 performs well. To show that when the noise is higher, the advantage of train-by-validation increases, we keep the same sample size and consider a "noisier" version of MNIST, where we randomly change the label of a sample with probability 0.2 (the new label is chosen uniformly at random, including the original label). The results are shown in Figure 6 . We can see that both train-by-train models, as well as SGD, overfit easily with training accuracy close to 1 and their test performances are low. The train-byvalidation model performs much better. Finally we run experiments on the complete MNIST data set (without label noise). For the train-byvalidation setting, we split the data set to 50,000 training samples and 10,000 validation samples. As shown in Figure 7 , in this case train-by-train and train-by-validation performs similarly (in fact both are slightly weaker than the tuned SGD baseline). This shows that when the sample size is sufficiently large, train-by-train can get comparable results as train-by-validation. 

6. CONCLUSION AND FUTURE WORKS

In this paper, we have proved optimization and generalization guarantees for tuning the step size for quadratic loss. From the optimization perspective, we considered a simple task whose objective is a quadratic function. We proved that the meta-gradient can explode/vanish if the meta-objective is simply the loss of the last iteration; we then showed that the log-transformed meta-objective has polynomially bounded meta-gradient and can be successfully optimized. To study the generalization issues, we considered the least squares problem. We showed that when the number of samples is small and the noise is large, train-by-validation approach generalizes better than train-by-train; while when the number of samples is large, train-by-train can also work well. Although our theoretical results are proved for quadratic loss, this simple setting already yields interesting phenomenons and requires non-trivial techniques to analyze. We have also verified our theoretical results on an optimizer parameterized by neural networks and MNIST dataset. Since this is a very first work studying the learning to learn approach, there are many potential future works. One immediate future work is to extend the result for least squares to log-transformed meta objective (as in Section 3). This is probably doable because compositing the log function with the current meta-objective should not change its minimizer. For the least squares problem, we only studied the generalization properties of the optimal step size under the meta-objective, it's unclear if meta-gradient descent can converge to such an optimal step size. We believe our techniques for the meta-optimization of the simple quadratic objective function (Section 3) can also be useful in this analysis. More broadly, we are also interested in analyzing more complicated optimizers on more complicated tasks. In the appendix, we first give the missing proofs for the theorems in the main paper. Later in Appendix F we give details for the experiments. Notations: Besides the notations defined in Section 2, we define more notations that will be used in the proofs. For a matrix X ∈ R n×d with n ≤ d, we denote its singular values as σ 1 (X) ≥ • • • ≥ σ n (X). For a positive semi-definite matrix A ∈ R d×d , we denote u Au as u 2 A . For a matrix X ∈ R d×n , let Proj X ∈ R d×d be the projection matrix onto the column span of X. That means, Proj X = SS , where the columns of S form an orthonormal basis for the column span of X. For any event E, we use 1 {E} to denote its indicator function: 1 {E} equals 1 when E holds and equals 0 otherwise. We use Ē to denote the complementary event of E. A PROOFS FOR SECTION 3 -ALLEVIATING GRADIENT EXPLOSION/VANISHING PROBLEM FOR QUADRATIC OBJECTIVE In this section, we prove the results in Section 3. Recall the meta learning problem as follows: The inner task is a fixed quadratic problem, where the starting point is fixed at w 0 , and the loss function is f (w) = 1 2 w Hw for some fixed positive definite matrix H ∈ R d×d . Suppose the eigenvalue decomposition of H is d i=1 λ i u i u i . In this section, we assume L = λ 1 (H) and α = λ d (H) are the largest and smallest eigenvalues of H with L > α. We assume the starting point w 0 has unit 2 norm. For each i ∈ [d], let c i be w 0 , u i and let c min = min(|c 1 |, |c d |). We assume c min > 0 for simplicity, which is satisfied if w 0 is chosen randomly from the unit sphere. Let {w τ,η } be the GD sequence running on f (w) starting from w 0 with step size η. For the metaobjective, we consider using the loss of the last point directly, or using the log of this value. In Section A.1, we first show that although choosing F (η) = f (w t,η ) does not have any bad local optimal solution, it has the gradient explosion/vanishing problem (Theorem 3). Then, in Section A.2, we show choosing F (η) = 1 t log f (w t,η ) leads to polynomially bounded meta-gradient and further show meta-gradient descent converges to the optimal step size (Theorem 4). Although the metagradient is polynomially bounded, if we simply use back-propogation to compute the meta-gradient, the intermediate results can still be exponentially large/small (Corollary 1). This is also proved in Section A.2.

A.1 META-GRADIENT VANISHING/EXPLOSION

In this section, we show although choosing F (η) = f (w t,η ) does not have any bad local optimal solution, it has the meta-gradient explosion/vanishing problem. Recall Theorem 3 as follows. Theorem 3. Let the meta objective be F (η) = f (w t,η ) = 1 2 w t,η Hw t,η . We know F (η) is a strictly convex function in η with an unique minimizer. However, for any step size η < 2/L, | F (η)| ≤ t d i=1 c 2 i λ 2 i |1 -ηλ i | 2t-1 ; for any step size η > 2/L, | F (η)| ≥ c 2 1 L 2 t(ηL -1) 2t-1 -L 2 t. Intuitively, if we write w t,η in the basis of the eigen-decomposition of H, then each coordinate evolve exponentially in t. The gradient of the standard objective is therefore also exponential in t. Proof of Theorem 3. According to the gradient descent iterations, we have w t,η = w t-1,η -η∇f (w t-1,η ) = w t-1,η -ηHw t-1,η = (I -ηH)w t-1,η = (I -ηH) t w 0 . Therefore, F (η) := f (w t,η ) = 1 2 w 0 (I -ηH) 2t Hw 0 . Taking the derivative of F (η), F (η) = -tw 0 (I -ηH) 2t-1 H 2 w 0 = -t d i=1 c 2 i λ 2 i (1 -ηλ i ) 2t-1 , where c i = w 0 , u i . Taking the second derivative of F (η), F (η) =t(2t -1)w 0 (I -ηH) 2t-2 H 3 w 0 = t(2t -1) d i=1 c 2 i λ 3 i (1 -ηλ i ) 2t-2 . Since L > α, we have F (η) > 0 for any η. That means F (η) is a strictly convex function in η with a unique minimizer. For any fixed η ∈ (0, 2/L) we know |1 - ηλ i | < 1 for all i ∈ [d]. We have F (η) ≤ t d i=1 c 2 i λ 2 i |1 -ηλ i | 2t-1 . For any fixed η ∈ (2/L, ∞), we know ηL -1 > 1. We have F (η) = -tc 2 1 L 2 (1 -ηL) 2t-1 -t i =1:(1-ηλi)≤0 c 2 i λ 2 i (1 -ηλ i ) 2t-1 -t i =1:(1-ηλi)>0 c 2 i λ 2 i (1 -ηλ i ) 2t-1 ≥tc 2 1 L 2 (ηL -1) 2t-1 -t d i=1 c 2 i λ 2 i ≥ tc 2 1 L 2 (ηL -1) 2t-1 -L 2 t, where the last inequality uses d i=1 c 2 i = 1. A.2 ALLEVIATING META-GRADIENT VANISHING/EXPLOSION We prove when the the meta objective is chosen as 1 t log f (w t,η ), the meta-gradient is polynomially bounded. Furthermore, we show meta-gradient descent can converge to the optimal step size within polynomial iterations. Recall Theorem 4 as follows. Theorem 4. Let the meta objective be F (η) = 1 t log f (w t,η ). We know F (η) has a unique minimizer η * and F (η) = O L 3 c 2 min α(L-α) for all η ≥ 0. Let {η k } be the GD sequence running on F with meta step size µ k = 1/ √ k. Suppose the starting step size η 0 ≤ M. Given any 1/L > > 0, there exists k = M 6 2 poly( 1 cmin , L, 1 α , 1 L-α ) such that for all k ≥ k , |η k -η * | ≤ . When we take the log of the function value, the derivative of the function value with respect to η becomes much more stable. We will first show some structural result on F (η) -it has a unqiue minimizer and the gradient is polynomially bounded. Further the gradient is only close to 0 when the point η is close to the unique minimizer. Then using such structural result we prove that metagradient descent converges. Proof of Theorem 4. The proof consists of three claims. In the first claim, we show that F has a unique minimizer and the minus meta derivative always points to the minimizer. In the second claim, we show that F has bounded derivative. In the last claim, we show that for any η that is outside the -neighborhood of η * , | F (η)| is lower bounded. Finally, we combine these three claims to finish the proof. Claim 1. The meta objective F has only one stationary point that is also its unique minimizer η * . For any η ∈ [0, η * ), F (η) < 0 and for any η ∈ (η * , ∞), F (η) > 0. Furthermore, we know η * ∈ [1/L, 1/α]. We can compute the derivative of F in η as follows, F (η) = -2w 0 (I -ηH) 2t-1 H 2 w 0 w 0 (I -ηH) 2t Hw 0 = -2 d i=1 c 2 i λ 2 i (1 -ηλ i ) 2t-1 d i=1 c 2 i λ i (1 -ηλ i ) 2t . (3) It's not hard to verify that the denominator d i=1 c 2 i λ i (1 -ηλ i ) 2t is always positive. Denote the numerator -2 d i=1 c 2 i λ 2 i (1 -ηλ i ) 2t-1 as g(η). Since g (η) > 0 for any η ∈ [0, ∞), we know g(η) is strictly increasing in η. Combing with the fact that g(0) < 0 and g(∞) > 0, we know there is a unique point (denoted as η * ) where g(η * ) = 0 and g(η) < 0 for all η ∈ [0, η * ) and g(η) > 0 for all η ∈ (η * , ∞). Since the denominator in F (η) is always positive and the numerator equals g(η), we know there is a unique point η * where F (η * ) = 0 and F (η) < 0 for all η ∈ [0, η * ) and F (η) > 0 for all η ∈ (η * , ∞). It's clear that η * is the minimizer of F . Also, it's not hard to verify that for any η ∈ [0, 1/L), F (η) < 0 and for any η ∈ (1/α, ∞), F (η) > 0. This implies that η * ∈ [1/L, 1/α]. Claim 2. For any η ∈ [0, ∞), we have | F (η)| ≤ 4L 3 c 2 min α(L -α) := D max . For any η ∈ [0, 2 α+L ], we have |1 -ηλ i | ≤ 1 -ηα for all i. Dividing the numerator and denominator in F (η) by (1 -ηα) 2t , we have F (η) = 2 d i=1 c 2 i λ 2 i 1-ηα ( 1-ηλi 1-ηα ) 2t-1 c 2 d α + d-1 i=1 c 2 i λ i ( 1-ηλi 1-ηα ) 2t ≤ 2 d i=1 c 2 i λ 2 i c 2 d α(1 -ηα) ≤ 2(α + L) d i=1 c 2 i λ 2 i c 2 d α(L -α) ≤ 4L 3 c 2 d α(L -α) , where the second last inequality uses η ≤ 2 α+L . Similarly for any η ∈ ( 2 α+L , ∞), we have |1 -ηλ i | ≤ ηL -1 for all i. Dividing the numerator and denominator in F (η) by (ηL -1) 2t , we have F (η) = 2 d i=1 c 2 i λ 2 i ηL-1 ( 1-ηλi ηL-1 ) 2t-1 c 2 1 L + d i=2 c 2 i λ i ( 1-ηλi ηL-1 ) 2t ≤ 2 d i=1 c 2 i λ 2 i c 2 1 L(ηL -1) ≤ 2(α + L) d i=1 c 2 i λ 2 i c 2 1 L(L -α) ≤ 4L 3 c 2 1 L(L -α) where the last inequality uses η ≥ 2 α+L . Overall, we know for any η ≥ 0, | F (η)| ≤ 4L 3 L -α max 1 c 2 d α , 1 c 2 1 L ≤ 4L 3 c 2 min α(L -α) . Claim 3. Given M ≥ 2/α and 1/L > > 0, for any η ∈ [0, η * -] ∪ [η * + , M ], we have |F (η)| ≥ min 2 c 2 d α 3 L , 2 c 2 1 L 2 ( M L -1) 2 ≥ 2 c 2 min min α 3 L , 1 M 2 := D min ( M ). If η ∈ [0, η * -] and η ≤ 2 α+L , we have F (η) = -2 d i=1 c 2 i λ 2 i (1 -ηλ i ) 2t-1 d i=1 c 2 i λ i (1 -ηλ i ) 2t = -2 d i=1 c 2 i λ 2 i (1 -ηλ i ) 2t-1 - d i=1 c 2 i λ 2 i (1 -η * λ i ) 2t-1 d i=1 c 2 i λ i (1 -ηλ i ) 2t , where the second equality holds because d i=1 c 2 i λ 2 i (1 -η * λ i ) 2t-1 = 0. For the numerator, we have d i=1 c 2 i λ 2 i (1 -ηλ i ) 2t-1 - d i=1 c 2 i λ 2 i (1 -η * λ i ) 2t-1 ≥c 2 d α 2 (1 -ηα) 2t-1 -(1 -η * α) 2t-1 ≥c 2 d α 2 (1 -ηα) 2t-1 -(1 -ηα -α) 2t-1 ; for the denominator, we have d i=1 c 2 i λ i (1 -ηλ i ) 2t ≤ d i=1 c 2 i λ i (1 -ηα) 2t , where the second inequality holds because |1 -ηλ i | ≤ 1 -ηα for all i. Overall, we have when η ∈ [0, η * -] and η ≤ 2 α+L , F (η) ≥2 c 2 d α 2 (1 -ηα) 2t-1 -(1 -ηα -α) 2t-1 d i=1 c 2 i λ i (1 -ηα) 2t ≥ 2 c 2 d α 3 d i=1 c 2 i λ i (1 -ηα) ≥ 2 c 2 d α 3 L , where the last inequality holds because (1 -ηα) ≤ 1 and d i c 2 i λ i ≤ L. Similarly, if η ∈ [0, η * -] and η ≥ 2 α+L , we have F (η) ≥2 c 2 1 L 2 (1 -ηL) 2t-1 -(1 -ηL -L) 2t-1 d i=1 c 2 i λ i (1 -ηL) 2t =2 c 2 1 L 2 (ηL + L -1) 2t-1 -(ηL -1) 2t-1 d i=1 c 2 i λ i (ηL -1) 2t ≥ 2 c 2 1 L 3 d i=1 c 2 i λ i (ηL -1) 2 ≥ 2 c 2 1 α 2 L 2 (L -α) 2 , where the last inequality holds because η ≤ η * -≤ 1/α and d i c 2 i λ i ≤ L. If η ∈ [η * + , ∞) and η ≤ 2 α+L , we have F (η) ≥2 c 2 d α 2 (1 -ηα + α) 2t-1 -(1 -ηα) 2t-1 d i=1 c 2 i λ i (1 -ηα) 2t ≥ 2 c 2 d α 3 L , If η ∈ [η * + , ∞) and η ≥ 2 α+L , we have F (η) ≥2 c 2 1 L 2 (1 -ηL + η ) 2t-1 -(1 -ηL) 2t-1 d i=1 c 2 i λ i (1 -ηL) 2t ≥ 2 c 2 1 L 3 d i=1 c 2 i λ i (ηL -1) 2 ≥ 2 c 2 1 L 2 ( M L -1) 2 , where the last inequality uses the assumption that η ≤ M . With the above three claims, we are ready to prove the optimization result. By Claim 1, we know F (η) < 0 for any η ∈ [0, η * ) and F (η) > 0 for any η ∈ (η * , ∞). So the opposite gradient descent always points to the minimizer. Since µ k = 1/ √ k, when k ≥ k 1 := D 2 max 2 we know µ k ≤ Dmax . By Claim 2, we know | F (η)| ≤ D max for all η ≥ 0, which implies |µ k F (η)| ≤ for all k ≥ k 1 . That means, meta gradient descent will never overshoot the minimizer by more than when k ≥ k 1 . In other words, after k 1 meta iterations, once η enters the -neighborhood of η * , it will never leave this neighborhood. We also know that at meta iteration k 1 , we have η k1 ≤ max(1/α + D max , M ) := M . Here, 1/α + D max comes from the case that the eta starts from the left of η * and overshoot to the right of η * by D max . Since η * ∈ [1/L, 1/α], we have |η k1 -η * | ≤ max(1/α, 1/α + D max -1/L, M - 1/L) := R. By Claim 3, we know that | F (η)| ≥ D min ( M ) for any η ∈ [0, η * -] ∪ [η * + , M ]. Choosing some k 2 satisfying k2 k=k1 1/ √ k ≥ R Dmin , we know for any k ≥ k 2 , |η k -η * | ≤ . Plugging in all the bounds for D min , D max from Claim 3 and Claim 2, we know there exists k 1 = 1 2 poly( 1 cmin , L, 1 α , 1 L-α ), k 2 = M 6 2 poly( 1 cmin , L, 1 α , 1 L-α ) satisfying these conditions. Next, we show although the meta-gradient is polynomailly bounded, the intermediate results can still vanish or explode if we use back-propogation to compute the meta-gradient. Corollary 1. If we choose the meta-objective as F (η) = 1 t log f (w t,η ), when computing the metagradient using back-propagation, there are intermediate results that are exponentially large/small in number of inner-steps t. Proof of Corollary 1. This is done by direct calculation. If we use back-propagation to compute the derivative of 1 t log(f (w t,η )), we need to first compute ∂f (wt,η) ∂ 1 t log(f (w t,η )) that equals 1 tf (wt,η) . Same as the analysis in Theorem 3, we can show 1 tf (wt,η) is exponentially large when η < 2/L and is exponentially small when η > 2/L. B PROOFS OF TRAIN-BY-TRAIN V.S. TRAIN-BY-VALIDATION (GD) In this section, we show when the number of samples is small and when the noise level is a large constant, train-by-train overfits to the noise in training tasks while train-by-validation generalizes well. We separately prove the results for train-by-train and train-by-validation in Theorem 7 and Theorem 8, respectively. Then, Theorem 5 is simply a combination of Theorem 7 and Theorem 8. Recall that in the train-by-train setting, each task P contains a training set S train with n samples. The inner objective is defined as f (w) = 1 2n (x,y)∈Strain ( w, x -y) 2 . Let {w τ,η } be the GD sequence running on f (w) from initialization 0 (with truncation). The meta-loss on task P is defined as the inner objective of the last point, ∆ T bT (n) (η, P ) = f (w t,η ) = 1 2n (x,y)∈Strain ( w t,η , x -y) 2 . The empirical meta objective FT bT (n) (η) is the average of the meta-loss across m different tasks. We show that under FT bT (n) (η), the optimal step size is a constant and the learned weight is far from ground truth w * on new tasks. We prove Theorem 7 in Section B.2.  η * train = Θ(1) and E w t,η * train -w * 2 = Ω(1)σ 2 , for all η * train ∈ arg min η≥0 FT bT (n) (η) , where the expectation is taken over new tasks. In Theorem 7, Ω(1) is an absolute constant independent with σ. Intuitively, the reason that train-bytrain performs badly in this setting is because there is a way to set the step size to a constant such that gradient descent converges very quickly to the empirical risk minimizer, therefore making the train-by-train objective very small. However, when the noise is large and the number of samples is smaller than the dimension, the empirical risk minimizer (ERM) overfits to the noise and is not the best solution. In the train-by-validation setting, each task P contains a training set S train with n 1 samples and a validation set with n 2 samples. The inner objective is defined as f (w) = 1 2n1 (x,y)∈Strain ( w, x -y) 2 . Let {w τ,η } be the GD sequence running on f (w) from initialization 0 (with truncation). For each task P , the meta-loss ∆ T bV (n1,n2) (η, P ) is defined as the loss of the last point w t,η evaluated on the validation set S valid . That is, ∆ T bV (n1,n2) (η, P ) = 1 2n2 (x,y)∈Svalid ( w t,η , x -y) 2 . The empirical meta objective FT bV (n1,n2) (η) is the average of the meta-loss across m different tasks P 1 , P 2 , ..., P m . We show that under FT bV (n1,n2) (η), the optimal step size is Θ(1/t) and the learned weight is better than initialization 0 by a constant on new tasks. Theorem 8 is proved in Section B.3.  η * valid = Θ(1/t) and E w t,η * valid -w * 2 = w * 2 -Ω(1) for all η * valid ∈ arg min η≥0 FT bV (n1,n2) (η), where the expectation is taken over new tasks. Intuitively, train-by-validation is optimizing the right objective. As long as the meta-training problem has good generalization performance (that is, good performance on a few tasks implies good performance on the distribution of tasks), then train-by-validation should be able to choose the optimal learning rate. The step size of Θ(1/t) here serves as regularization similar to early-stopping, which allows gradient descent algorithm to achieve better error on test data. Notations We define more quantities that are useful in the analysis. In the train by train setting, given a task P k := (D(w * k ), S (k) train , ). The training set S (k) train contains n samples {x (k) i , y (k) i } n i=1 with y (k) i = w * k , x (k) i + ξ (k) i . Let X (k) train be an n × d matrix with its i-th row as (x (k) i ) . Let H (k) train := 1 n (X (k) train ) X (k) train be the covariance matrix of the inputs in S (k) train . Let ξ (k) train be an n-dimensional column vector with its i-th entry equal to ξ (k) i . Since n ≤ d, with probability 1, we know X (k) train is full row rank. Therefore, X (k) train has pseudo- inverse (X (k) train ) † such that X (k) train (X (k) train ) † = I n . It's not hard to verify that there exists w (k) train = Proj (X (k) train ) w * k + (X (k) train ) † ξ (k) train such that y (k) i = w (k) train , x (k) i for every (x (k) i , y (k) i ) ∈ S (k) train . Here, Proj (X (k) train ) is the projection matrix onto the column span of (X (k) train ) . We also denote Proj (X (k) train ) w * k as (w (k) train ) * . We use B (k) t,η to denote (I -(I -ηH (k) train ) t ). Let w (k) t,η be the weight obtained by running GD on S (k) train with step size η (with truncation). With the above notations, it's not hard to verify that for task P k , the inner objective f (w) = 1 2 w -w (k) train 2 H (k) train . The meta-loss on task P k is just ∆ T bT (n) (η, P k ) = 1 2 w t,η -w (k) train 2 H (k) train . In the train-by-validation setting, each task P k contains a training set S  (k) train , X (k) train , H (k) train , w (k) train , B (k) t,η , w (k) t,η ; for the validation set S (k) valid , we can define ξ (k) valid , X (k) valid , H (k) valid , w (k) valid . With these notations, the inner objective is f (w) = 1 2 w -w (k) train 2 H (k) train and the meta-loss is ∆ T bV (n1,n2) (η, P k ) = 1 2 w t,η -w (k) valid 2 H (k) valid . We also use these notations without index k to refer to the quantities defined on task P. In the proofs, we ignore the subsripts on n, n 1 , n 2 and simply write ∆ T bT (η, P k ), ∆ T bV (η, P k ), FT bT , FT bV , F T bT , F T bV .

B.1 OVERALL PROOF STRATEGY

In this section (and the next), we follow similar proof strategies that consists of three steps. Step 1: First, we show for both train-by-train and train-by-validation, there is a good step size that achieves small empirical meta-objective (however the step sizes and the empirical meta-objective they achieve are different in the two settings). This does not necessarily mean that the actual optimal step size is exactly the good step size that we propose, but it gives an upperbound on the empirical meta-objective for the optimal step size. Step 2: Second, we define a threshold step size such that for any step size larger than it, the empirical meta-objective must be higher than what was achieved at the good step size in Step 1. This immediately implies that the optimal step size cannot exceed this threshold step size. Step 3: Third, we show the meta-learning problem has good generalization performance, that is, if a learning rate η performs well on the training tasks, it must also perform well on the task distribution, and vice versa. Thanks to Step 1 and Step 2, we know the optimal step size cannot exceed certain threshold and then only need to prove generalization result within this range. The generalization result is not surprising as we only have a single trainable parameter η, however we also emphasize that this is non-trivial as we will not restrict the step size η to be small enough that the algorithms do not diverge. Instead we use a truncation to alleviate the diverging problem (this allows us to run the algorithm on distribution of data whose largest possible learning rate is unknown). Combing Step 1, 2, 3, we know the population meta-objective has to be small at the optimal step size. Finally, we show that as long as the population meta-objective is small, the performance of the algorithms satisfy what we stated in Theorem 5. The last step is easier for the train-by-validation setting, because its meta-objective is exactly the correct measure that we are looking at; for the trainby-train setting we instead look at the property of empirical risk minimizer (ERM), and show that anything close to the ERM is going to behave similarly.

B.2 TRAIN-BY-TRAIN (GD)

Recall Theorem 7 as follows. Theorem 7. Let the meta objective FT bT (n) (η) be as defined in Equation 1with n ∈ [d/4, 3d/4]. Assume noise level σ is a large constant c 1 . Assume unroll length t ≥ c 2 , number of training tasks m ≥ c 3 log(mt) and dimension d ≥ c 4 log(m) for certain constants c 2 , c 3 , c 4 . With probability at least 0.99 in the sampling of the training tasks, we have η * train = Θ(1) and E w t,η * train -w * 2 = Ω(1)σ 2 , for all η * train ∈ arg min η≥0 FT bT (n) (η) , where the expectation is taken over new tasks. According to the data distribution, we know X train is an n × d random matrix with each entry i.i.d. sampled from standard Gaussian distribution. In the following lemma, we show that the covariance matrix H train is approximately isotropic when d/4 ≤ n ≤ 3d/4. Specifically, we show √ d √ L ≤ σ i (X train ) ≤ √ Ld and 1 L ≤ λ i (H train ) ≤ L for all i ∈ [n] with L = 100. We use letter L to denote the upper bound of H train to emphasize that this bounds the smoothness of the inner objective. Throughout this section, we use letter L to denote constant 100. The proof of Lemma 1 follows from random matrix theory. We defer its proof into Section B.2.4. Lemma 1. Let X ∈ R n×d be a random matrix with each entry i.i.d. sampled from standard Gaussian distribution. Let H := 1/nX X. Assume n = cd with c ∈ [ 1 4 , 3 4 ]. Then, with probability at least 1 -exp(-Ω(d)), there exists constant L = 100 such that √ d √ L ≤ σ i (X) ≤ √ Ld and 1 L ≤ λ i (H) ≤ L, for all i ∈ [n]. In this section, we always assume the size of each training set is within [d/4, 3d/4] so Lemma 1 holds. Since H train is upper bounded by L with high probability, we know the GD sequence converges to w train for η ∈ [0, 1/L]. In Lemma 2, we prove that the empirical meta objective FT bT monotonically decreases as η increases until 1/L. Also, we show FT bT is exponentially small in t at step size 1/L. This serves as step 1 in Section B.1. The proof is deferred into Section B.2.1. Lemma 2. With probability at least 1 -m exp(-Ω(d)), FT bT (η) is monotonically decreasing in [0, 1/L] and FT bT (1/L) ≤ 2L 2 σ 2 1 - 1 L 2 t . When the step size is larger than 1/L, the GD sequence can diverge, which incurs a high loss in meta objective. Later in Definition 1, we define a step size η such that the GD sequence gets truncated with descent probability for any step size that is larger than η. In Lemma 3, we show with high probability, the empirical meta objective is high for all η > η. This serves as step 2 in the proof strategy described in Section B.1. The proof is deferred into Section B.2.2. Lemma 3. With probability at least 1 -exp(-Ω(m)), FT bT (η) ≥ σ 2 10L 8 , for all η > η. By Lemma 2 and Lemma 3, we know the optimal step size must lie in [1/L, η]. We can also show 1/L < η < 3L, so η * train is a constant. To relate the empirical loss at η * train to the population loss. We prove a generalization result for step sizes within [1/L, η]. This serves as step 3 in Section B.1. The proof is deferred into Section B.2.3. Lemma 4. Suppose σ is a large constant c 1 . Assume t ≥ c 2 , d ≥ c 4 for certain constants c 2 , c 4 . With probability at least 1 -m exp(-Ω(d)) -O(t + m) exp(-Ω(m)), |F T bT (η) -FT bT (η)| ≤ σ 2 L 3 , for all η ∈ [1/L, η], Combining the above lemmas, we know the population meta objective F T bT is small at η * train , which means w t,η * train is close to the ERM solution. Since the ERM solution overfits to the noise in training tasks, we know w t,η * train -w * has to be large. We present the proof of Theorem 7 as follows. Proof of Theorem 7. We assume σ is a large constant in this proof. According to Lemma 2, we know with probability at least 1 -m exp(-Ω(d)), FT bT (η) is monotonically decreasing in [0, 1/L] and FT bT (1/L) ≤ 2L 2 σ 2 (1 -1/L 2 ) t . This implies that the optimal step size η * train ≥ 1/L and FT bT (η * train ) ≤ 2L 2 σ 2 (1-1/L 2 ) t . By Lemma 3, we know with probability at least 1-exp(-Ω(m)), FT bT (η) ≥ σ 2 10L 8 for all η > η, where η is defined in Definition 1. As long as t ≥ c 2 for certain constant c 2 , we know σ 2 10L 8 > 2L 2 σ 2 (1 -1/L 2 ) t , which then implies that the optimal step size η * train lies in [1/L, η]. According to Lemma 6, we know η ∈ (1/L, 3L). Therefore η * train is a constant. According to Lemma 4, we know with probability at least 1 -m exp(-Ω(d)) -O(t + m) exp(-Ω(m)), |F T bT (η) -FT bT (η)| ≤ σ 2 L 3 , for all η ∈ [1/L, η]. As long as t is larger than some constant, we have FT bT (η * train ) ≤ σ 2 L 3 . Combing with the generalization result, we have F T bT (η * train ) ≤ 2σ 2 L 3 . Next, we show that under a small population loss, E w t,η * train -w * 2 has to be large. Let E 1 be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. We have E w t,η * train -w train 2 Htrain ≥ 1 L E w t,η * train -w train 2 1 {E 1 } ≥ 1 L E w t,η * train -w * train -(X train ) † ξ train 1 {E 1 } 2 ≥ 1 L E (X train ) † ξ train 1 {E 1 } -E w t,η * train -w * train 1 {E 1 } 2 . Since E w t,η * train -w train 2 Htrain ≤ 4σ 2 L 3 , this then implies E (X train ) † ξ train 1 {E 1 } -E w t,η * train -w * train 1 {E 1 } ≤ L 4σ 2 L 3 = 2σ L . Conditioning on E 1 , we can lower bound (X train ) † ξ train by σ 4 √ L . According to Lemma 1 and Lemma 45, we know Pr[E 1 ] ≥ 1 -exp(-Ω(d)). As long as d is at least certain constant, we have Pr[E 1 ] ≥ 0.9. This then implies E (X train ) † ξ train 1 {E 1 } ≥ 9σ 40 √ L . Therefore, we have E w t,η * train -w * train 1 {E 1 } ≥ 9σ 40 √ L - 2σ L = 9σ 4L - 2σ L = σ 4L , where the first equality uses L = 100. Then, we have E w t,η * train -w * 2 ≥ E w t,η * train -w * train 2 1 {E 1 } ≥ E w t,η * train -w * train 1 {E 1 } 2 ≥ σ 2 16L 2 , where the first inequality holds because for any S train , w * train is the projection of w * on the subspace of S train and w t,η * train is also in this subspace. Taking a union bound for all the bad events, we know this result holds with probability at least 0.99 as long as σ is a large constant c 1 and t ≥ c 2 , m ≥ c 3 log(mt) and d ≥ c 4 log(m) for certain constants c 2 , c 3 , c 4 . B.2.1 BEHAVIOR OF FT bT FOR η ∈ [0, 1/L] In this section, we prove the empirical meta objective FT bT is monotonically decreasing in [0, 1/L]. Furthermore, we show FT bT (1/L) is exponentially small in t.

Lemma 2. With probability at least

1 -m exp(-Ω(d)), FT bT (η) is monotonically decreasing in [0, 1/L] and FT bT (1/L) ≤ 2L 2 σ 2 1 - 1 L 2 t . Proof of Lemma 2. For each k ∈ [m], let E k be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Here, L is constant 100 from Lemma 1. According to Lemma 1 and Lemma 45, we know for each k ∈ [m], E k happens with probability at least 1 -exp(-Ω(d)). Taking a union bound over all k ∈ [m], we know ∩ k∈[m] E k holds with probability at least 1 -m exp(-Ω(d)). From now on, we assume ∩ k∈[m] E k holds. Let's first consider each individual loss function ∆ T bT (η, P k ). Let { ŵ(k) τ,η } be the GD sequence without truncation. We have ŵ(k) τ,η -w (k) train = ŵ(k) τ -1,η -w (k) train -ηH (k) train ( ŵ(k) τ -1,η -w (k) train ) =(I -ηH (k) train )( ŵ(k) τ -1,η -w (k) train ) = -(I -ηH (k) train ) t w (k) train . For any η ∈ [0, 1/L], we have ŵ(k) τ,η ≤ w (k) train = (w (k) train ) * + (X (k) train ) † ξ (k) train ≤ 2 √ Lσ for any τ. Therefore, w t,η never exceeds the norm threshold and never gets truncated. Noticing that ∆ T bT (η, P k ) = 1 2 (w (k) t,η -w (k) train ) H (k) train (w (k) t,η -w (k) train ), we have ∆ T bT (η, P k ) = 1 2 (w (k) train ) H (k) train (I -ηH (k) train ) 2t w (k) train . Taking the derivative of ∆ T bT (η, P k ) in η, we have ∂ ∂η ∆ T bT (η, P k ) = -t(w (k) train ) (H (k) train ) 2 (I -ηH (k) train ) 2t-1 w (k) train . Conditioning on E k , we know 1/L ≤ λ i (H (k) train ) ≤ L for all i ∈ [n] and H (k) train is full rank in the row span of X (k) train . Therefore, we know ∂ ∂η ∆ T bT (η, P k ) < 0 for all η ∈ [0, 1/L). Here, we assume w (k) train > 0, which happens with probability 1.

Overall, we know that conditioning on

∩ k∈[m] E k , every ∆ T bT (η, P k ) is strictly decreasing for η ∈ [0, 1/L]. Since FT bT (η) := 1 m m k=1 ∆ T bT (η, P k ), we know FT bT (η) is strictly decreasing when η ∈ [0, 1/L]. At step size η = 1/L, we have ∆ T bT (η, P k ) = 1 2 (w (k) train ) H (k) train (I -ηH (k) train ) 2t w (k) train ≤ 1 2 L 1 - 1 L 2 t w (k) train 2 ≤ 2L 2 σ 2 1 - 1 L 2 t , where we upper bound w (k) train 2 by 4Lσ 2 at the last step. Therefore, we have FT bT (1/L) ≤ 2L 2 σ 2 (1 -1 L 2 ) t . B.2.2 LOWER BOUNDING FT bT FOR η ∈ (η, ∞) In this section, we prove that the empirical meta objective is lower bounded by Ω(σ 2 ) with high probability for η ∈ (η, ∞). Step size η is defined such that there is a descent probability of diverging for any step size larger than η. Then, we show the contribution from these truncated sequence will be enough to provide an Ω(σ 2 ) lower bound for FT bT . The proof of Lemma 3 is given at the end of this section. Lemma 3. With probability at least 1 -exp(-Ω(m)), FT bT (η) ≥ σ 2 10L 8 , for all η > η. We define η as the smallest step size such that the contribution from the truncated sequence in the population meta objective exceeds certain threshold. The precise definition is as follows. Definition 1. Given a training task P, let E 1 be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Let Ē2 (η) be the event that the GD sequence is truncated with step size η. Define η as follows, η = inf η ≥ 0 E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ σ 2 L 6 . In the next lemma, we prove that for any fixed training set, 1 E 1 ∩ Ē2 (η ) ≥ 1 E 1 ∩ Ē2 (η) for any η ≥ η. This immediately implies that Pr[E 1 ∩ Ē2 (η)] and E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) is non-decreasing in η. Basically we need to show, conditioning on E 1 , if a GD sequence gets truncated at step size η, it must be also truncated for larger step sizes. Let {w τ,η } be the GD sequence without truncation. We only need to show that for any τ, if w τ,η exceeds the norm threshold, w τ,η must also exceed the norm threshold for any η ≥ η. This is easy to prove if τ is odd because in this case w τ,η is always non-decreasing in η. The case when τ is even is trickier because there indeed exists certain range of η such that w τ,η is decreasing in η. We manage to prove that this problematic case cannot happen when w τ,η is at least 4 √ Lσ. The full proof of Lemma 5 is deferred into Section B.2.4. Lemma 5. Fixing a task P, let E 1 and Ē2 (η) be as defined in Definition 1. We have 1 E 1 ∩ Ē2 (η ) ≥ 1 E 1 ∩ Ē2 (η) , for any η ≥ η. In the next Lemma, we prove that η must lie within (1/L, 3L). We prove this by showing that the GD sequence never gets truncated for η ∈ [0, 2/L] and almost always gets truncated for η ∈ [2.5L, ∞). The proof is deferred into Section B.2.4. Lemma 6. Let η be as defined in Definition 1. Suppose σ is a large constant c 1 . Assume t ≥ c 2 , d ≥ c 4 for some constants c 2 , c 4 . We have 1/L < η < 3L. Now, we are ready to give the proof of Lemma 3. Proof of Lemma 3. Let E 1 and Ē2 (η) be as defined in Definition 1. For the simplicity of the proof, we assume E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ σ 2 L 6 . We will discuss the proof for the other case at the end, which is very similar. Conditioning on E 1 , we know 1 2 w t,η -w train 2 Htrain ≤ 18L 2 σ 2 . Therefore, we know Pr[E 1 ∩ Ē2 (η)] ≥ 1 18L 8 . For each task P k , define E (k) 1 and Ē(k) 2 (η) as the corresponding events on training set S (k) train . By Hoeffding's inequality, we know with probability at least 1 -exp(-Ω(m)), 1 m m k=1 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 1 20L 8 . By Lemma 5, we know 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 1 E (k) 1 ∩ Ē(k) 2 (η) for any η ≥ η. Then, we can lower bound FT bT for any η > η as follows, FT bT (η) = 1 m m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train ≥ 1 m m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train 1 E (k) 1 ∩ Ē(k) 2 (η) ≥2σ 2 1 m m k=1 1 E (k) 1 ∩ Ē(k) 2 (η) ≥2σ 2 1 m m k=1 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ σ 2 10L 8 , where the second inequality lower bounds the loss for one task by 2σ 2 when the sequence gets truncated. We have assumed E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ σ 2 L 6 in the proof. Now, we show the proof also works when E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) < σ 2 L 6 with slight changes. According to the definition and Lemma 5, we know E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) > σ 2 L 6 for all η > η. At each training set S train , we can define 1 E 1 ∩ Ē2 (η ) as lim η→η + 1 E 1 ∩ Ē2 (η) . We also have Pr[E 1 ∩ Ē2 (η )] ≥ 1 18L 8 . The remaining proof is the same as before as we substitute 1 E 1 ∩ Ē2 (η) by 1 E 1 ∩ Ē2 (η ) . B.2.3 GENERALIZATION FOR η ∈ [1/L, η] In this section, we show empirical meta objective FT bT is point-wise close to population meta objective F T bT for all η ∈ [1/L, η]. Lemma 4. Suppose σ is a large constant c 1 . Assume t ≥ c 2 , d ≥ c 4 for certain constants c 2 , c 4 . With probability at least 1 -m exp(-Ω(d)) -O(t + m) exp(-Ω(m)), |F T bT (η) -FT bT (η)| ≤ σ 2 L 3 , for all η ∈ [1/L, η], In this section, we first show FT bT concentrates on F T bT for any fixed η and then construct -net for FT bT and F T bT for η ∈ [1/L, η]. We give the proof of Lemma 4 at the end. We first show that for a fixed η, FT bT (η) is close to F T bT (η) with high probability. We prove the meta-loss on each task ∆ T bT (η, P k ) is O(1)-subexponential. Then we apply Bernstein's inequality to get the result. The proof is deferred into Section B.2.4. We will assume σ is a large constant and t ≥ c 2 , d ≥ c 4 for some constants c 2 , c 4 so that Lemma 6 holds and η is a constant. Lemma 7. Suppose σ is a constant. For any fixed η and any 1 > > 0, with probability at least 1 -exp(-Ω( 2 m)), FT bT (η) -F T bT (η) ≤ . Next, we construct an -net for F T bT . By the definition of η, we know for any η ≤ η, the contribution from truncated sequences in F T bT (η) is small. We can show the contribution from the un-truncated sequences is O(t)-lipschitz. Lemma 8. Suppose σ is a large constant c 1 . Assume t ≥ c 2 , d ≥ c 4 for some constant c 2 , c 4 . There exists an 11σ 2 L 4 -net N ⊂ [1/L, η] for F T bT with |N | = O(t). That means, for any η ∈ [1/L, η], |F T bT (η) -F T bT (η )| ≤ 11σ 2 L 4 , for η = arg min η ∈N,η ≤η (η -η ). Proof of Lemma 8. Let E 1 and Ē2 (η) be as defined in Definition 1. For the simplicity of the proof, we assume E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≤ σ 2 L 6 . We will discuss the proof for the other case at the end, which is very similar. We can divide E 1 2 w t,η -w train 2 Htrain as follows, E 1 2 w t,η -w train 2 Htrain =E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} + E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) + E 1 2 w t,η -w train 2 Htrain 1 Ē1 . We will construct an -net for the first term and show the other two terms are small. Let's first consider the third term. Since 1 2 w t,η -w train 2 Htrain is O(1)-subexponential and Pr[ Ē1 ] ≤ exp(-Ω(d)), we have E 1 2 w t,η -w train 2 Htrain 1 Ē1 = O(1) exp(-Ω(d)). Choosing d to be at least certain con- stant, we know 1 2 w t,η -w train 2 Htrain 1 Ē1 ≤ σ 2 /L 4 . Then we upper bound the second term. Since E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≤ σ 2 L 6 and 1 2 w t,η -w train Htrain ≥ 2σ 2 when w t,η diverges, we know Pr[E 1 ∩ Ē2 (η)] ≤ 1 2L 6 . Then, we can upper bound the second term as follows, E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≤ 18L 2 σ 2 1 2L 6 = 9σ 2 L 4 Next, we show the first term 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} has desirable Lipschitz condition. According to Lemma 5, we know 1 {E 1 ∩ E 2 (η)} ≥ 1 {E 1 ∩ E 2 (η)} for any η ≤ η. Therefore, conditioning on E 1 ∩ E 2 (η), we know w t,η never gets truncated for any η ≤ η. This means w t,η = B t,η w train with B t,η = (I -(I -ηH train ) t ). We can compute the derivative of 1 2 w t,η -w train 2 Htrain as follows, ∂ ∂η 1 2 w t,η -w train 2 Htrain = tH train (I -ηH train ) t-1 w train , H train (w t,η -w train ) . Since w t,η = (I -(I -ηH train ) t )w train ≤ 4 √ Lσ and w train ≤ 2 √ Lσ, we have (I -ηH train ) t w train ≤ 6 √ Lσ. We can bound (I -ηH train ) t-1 w train with (I -ηH train ) t w train + w train by bounding the expanding directions using (I -ηH train ) t w train and bounding the shrinking directions using w train . Therefore, we can bound the derivative as follows, ∂ ∂η 1 2 w t,η -w train 2 Htrain ≤ tL × 8 √ Lσ × 6L √ Lσ = 48L 3 σ 2 t. Suppose σ is a constant, we know E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} is O(t)-lipschitz. There- fore, there exists an σ 2 L 4 -net N for E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} with size O(t). That means, for any η ∈ [1/L, η], E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} -E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} ≤ σ 2 L 4 for η = arg min η ∈N,η ≤η (η -η ). Note we construct the -net in a particular way such that η is chosen as the largest step size in N that is at most η. Combing with the upper bounds on the second term and the third term, we have for any η ∈ [1/L, η], |F T bT (η) -F T bT (η )| ≤ 11σ 2 L 4 for η = arg min η ∈N,η ≤η (η -η ). In the above analysis, we have assumed E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≤ σ 2 L 6 . The proof can be easily generalized to the other case. We can define 1 E 1 ∩ Ē2 (η ) as lim η→η -1 E 1 ∩ Ē2 (η) . Then the proof works as long as we substitute 1 E 1 ∩ Ē2 (η) by 1 E 1 ∩ Ē2 (η ) . We will also add η into the -net. In order to prove F T bT is close to FT bT point-wise in [1/L, η], we still need to construct an -net for the empirical meta objective FT bT . Lemma 9. Suppose σ is a large constant c 1 . Assume t ≥ c 2 , d ≥ c 4 for certain constants c 2 , c 4 . With probability at least 1 -m exp(-Ω(d)), there exists an σ 2 L 4 -net N ⊂ [1/L, η] for FT bT with |N | = O(t + m). That means, for any η ∈ [1/L, η], | FT bT (η) -FT bT (η )| ≤ σ 2 L 4 , for η = arg min η ∈N ,η ≤η (η -η ). Proof of Lemma 9. For each k ∈ [m], let E 1,k be the event that √ d/ √ L ≤ σ i (X (k) train ) ≤ √ Ld and 1/L ≤ λ i (H (k) train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ (k) train ≤ √ dσ. According to Lemma 1 and Lemma 45, we know with probability at least 1 -m exp(-Ω(d)), E 1,k 's hold for all k ∈ [m]. From now on, we assume all these events hold. Recall that the empirical meta objective as follows, FT bT (η) := 1 m m k=1 ∆ T bT (η, P k ). For any k ∈ [m], let η c,k be the smallest step size such that w (k) t,η gets truncated. If η c,k > η, by similar argument as in Lemma 8, we know ∆ T bT (η, P k ) is O(t)-Lipschitz in [1/L, η] as long as σ is a constant. If η c,k ≤ η, by Lemma 5 we know w (k) t,η gets truncated for any η ≥ η c,k . This then implies that ∆ T bT (η, P k ) is a constant function for η ∈ [η c,k , η]. We can also show that ∆ T bT (η, P k ) is O(t)-Lipschitz in [1/L, η c,k ). There might be a discontinuity in function value at η c,k , so we need to add η c,k into the -net. Overall, we know there exists an σ 2 L 4 -net N with |N | = O(t + m) for FT bT . That means, for any η ∈ [1/L, η], FT bT (η) -FT bT (η ) ≤ σ 2 L 4 for η = arg min η ∈N ,η ≤η (η -η ). Finally, we combine Lemma 7, Lemma 8 and Lemma 9 to prove that FT bT is point-wise close to F T bT for η ∈ [1/L, η]. Proof of Lemma 4. We assume σ as a constant in this proof. By Lemma 7, we know with probability at least 1 -exp(-Ω( 2 m)), FT bT (η) -F T bT (η) ≤ for any fixed η. By Lemma 8, we know there exists an 11σ 2 L 4 -net N for F T bT with size O(t). By Lemma 9, we know with probability at least 1 -m exp(-Ω(d)), there exists an σ 2 L 4 -net N for FT bT with size O(t + m). According to the proofs of Lemma 8 and Lemma 9, it's not hard to verify that N ∪ N is still an 11σ 2 L 4 -net for FT bT and F T bT . That means, for any η ∈ [1/L, η], we have |F T bT (η) -F T bT (η )|, | FT bT (η) -FT bT (η )| ≤ 11σ 2 L 4 , for η = arg min η ∈N ∪N ,η ≤η (η -η ). Taking a union bound over N ∪ N , we have with probability at least 1 -O(t + m) exp(-Ω(m)), FT bT (η) -F T bT (η) ≤ σ 2 L 4 for all η ∈ N ∪ N . Overall, we know with probability at least 1 -m exp(-Ω(d)) -O(t + m) exp(-Ω(m)), for all η ∈ [1/L, η], |F T bT (η) -FT bT (η)| ≤|F T bT (η) -F T bT (η )| + | FT bT (η) -FT bT (η )| + | FT bT (η ) -F T bT (η )| ≤ 23σ 2 L 4 ≤ σ 2 L 3 , where η = arg min η ∈N ∪N ,η ≤η (η -η ). We use the fact that L = 100 in the last inequality.

B.2.4 PROOFS OF TECHNICAL LEMMAS

Proof of Lemma 1. Recall that X train is an n×d matix with n = cd where c ∈ [1/4, 3/4]. According to Lemma 48, with probability at least 1 -2 exp(-t 2 /2), we have √ d - √ cd -t ≤ σ i (X train ) ≤ √ d + √ cd + t, for all i ∈ [n]. Since H train = 1/nX train X train , we know λ i (H train ) = 1/nσ 2 i (X train ). Since c ∈ [ 1 4 , 3 4 ], we have 1 cd ( √ d + √ cd) 2 ≤ 100 -c and 1 cd ( √ d - √ cd) 2 ≥ 1 100 + c , for some constant c . Therefore, we know with probability at least 1 -exp(-Ω(d)), 1 100 ≤ λ i (H train ) ≤ 100, for all i ∈ [n]. Similarly, since there exists constant c such that √ d + √ cd ≤ (10 -c ) √ d and √ d - √ cd ≥ (1/10 + c ) √ d, we know with probability at least 1 -exp(-Ω(d)), 1 √ d ≤ σ i (X train ) ≤ 10 √ d, for all i ∈ [n]. Choosing L = 100 finishes the proof. Proof of Lemma 5. We prove that for any training set S train , 1 E 1 ∩ Ē2 (η ) ≥ 1 E 1 ∩ Ē2 (η ) for any η > η. This is trivially true if E 1 is false on S train . Therefore, we focus on the case when E 1 holds for S train . Suppose η c is the smallest step size such that the GD sequence gets truncated. Let {w τ,ηc } be the GD sequence without truncation. There must exists τ ≤ t such that w τ,ηc ≥ 4 √ Lσ. We only need to prove that w τ,η ≥ 4

√

Lσ for any η ≥ η c . We prove this by showing the derivative of w τ,η 2 in η is non-negative assuming w τ,η 2 ≥ 4 √ Lσ. Recall the recursion of w τ,η as w τ,η = w train -(I -ηH train ) τ w train . If τ is an odd number, it's clear that ∂ ∂η w τ,η 2 is non-negative at any η ≥ 0. From now on, we assume τ is an even number. Actually in this case, ∂ ∂η w τ,η 2 can be negative for some η. However, we can prove the derivative must be non-negative assuming w τ,η 2 ≥ 4 √ Lσ. Under review as a conference paper at ICLR 2021 Suppose the eigenvalue decomposition of H train is n i=1 λ i u i u i with λ 1 ≥ • • • λ n . Denote c i as w train , u i . Let λ j be the smallest eigenvalue such that (1 -ηλ j ) ≤ -1. This implies λ i ≤ 2/η for any i ≥ j + 1. We can write down w τ,η 2 as follows w τ,η 2 = j i=1 1 -(1 -ηλ i ) t 2 c 2 i + n i=j+1 1 -(1 -ηλ i ) t 2 c 2 i ≤ j i=1 1 -(1 -ηλ i ) t 2 c 2 i + w train 2 . Since E 1 holds, we know w train 2 ≤ 4Lσ 2 . Combining with w τ,η 2 ≥ 16Lσ 2 , we have j i=1 (1 -(1 -ηλ i ) t ) 2 c 2 i ≥ 12Lσ 2 . We can lower bound the derivative as follows, ∂ ∂η w τ,η 2 = j i=1 2tλ i (1 -ηλ i ) t-1 1 -(1 -ηλ i ) t c 2 i + n i=j+1 2tλ i (1 -ηλ i ) t-1 1 -(1 -ηλ i ) t c 2 i ≥2t j i=1 λ i (1 -ηλ i ) t-1 1 -(1 -ηλ i ) t c 2 i -2t 2 η n i=j+1 c 2 i ≥2t j i=1 λ i (1 -ηλ i ) t-1 1 -(1 -ηλ i ) t c 2 i -2t × 8Lσ 2 /η. Then, we only need to show that j i=1 λ i (1 -ηλ i ) t-1 (1 -(1 -ηλ i ) t ) c 2 i is larger than 8Lσ 2 /η. We have j i=1 λ i (1 -ηλ i ) t-1 1 -(1 -ηλ i ) t c 2 i = j i=1 λ i (1 -ηλ i ) t-1 1 -(1 -ηλ i ) t 1 -(1 -ηλ i ) t 2 c 2 i = j i=1 λ i (ηλ i -1) t-1 (ηλ i -1) t -1 1 -(1 -ηλ i ) t 2 c 2 i = j i=1 λ i (ηλ i -1) t (ηλ i -1) t -1 1 ηλ i -1 1 -(1 -ηλ i ) t 2 c 2 i ≥ j i=1 1 η 1 -(1 -ηλ i ) t 2 c 2 i ≥ 12Lσ 2 /η > 8Lσ 2 /η. Proof of Lemma 6. Similar as the analysis in Lemma 2, conditioning on E 1 , we know the GD sequence never exceeds the norm threshold for any η ∈ [0, 2/L]. This then implies E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) = 0, for all η ∈ [0, 2/L]. Let {w τ,η } be the GD sequence without truncation. For any step size η ∈ [2.5L, ∞], conditioning on E 1 , we have w t,η ≥ (η/L -1) t -1 w train ≥ 1.5 t -1 σ 4 √ L -1 ≥ 4 √ Lσ, where the last inequality holds as long as σ ≥ 5 √ L, t ≥ c 2 for some constant c 2 . Therefore, we know when η ∈ [2.5L, ∞), 1 E 1 ∩ Ē2 (η) = 1 {E 1 }. Then, we have for any η ≥ 2.5L, E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ 1 2L 4 √ Lσ -2 √ Lσ 2 Pr[E 1 ] ≥ 2σ 2 Pr[E 1 ] ≥ σ 2 L 3 , where the last inequality uses Pr[E 1 ] ≥ 1 -exp(-Ω(d)) and assume d ≥ c 4 for some constant c 4 . Overall, we know E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) equals zero for all η ∈ [0, 2/L] and is at least σ 2 L 3 for all η ∈ [2.5L, ∞). By definition, we know η ∈ (1/L, 3L). Proof of Lemma 7. Recall that FT bT (η) := 1 m m k=1 ∆ T bT (η, P k ). We prove that each ∆ T bT (η, P k ) is O(1)-subexponential. We can further write ∆ T bT (η, P k ) as follows, ∆ T bT (η, P k ) = 1 2 w (k) t,η -w * k -(X (k) train ) † ξ (k) train 2 H (k) train ≤ 1 2 w (k) t,η -w * k 2 H (k) train + 1 2n ξ (k) train 2 + w (k) t,η -w * k 1 √ n ξ (k) train 1 √ n X (k) train . We can write H (k) train as σ 2 max ( 1 √ n X (k) train ). According to Lemma 47, we know σ max (X (k) train ) - Eσ max (X (k) train ) is O(1)-subgaussian, which implies that σ max ( 1 √ n X (k) train ) -Eσ max ( 1 √ n X (k) train ) is O(1/ √ d)-subgaussian. Since Eσ max ( 1 √ n X (k) train ) is a constant, we know σ max ( 1 √ n X (k) train ) is O(1)- subgaussian and σ 2 max ( 1 √ n X (k) train ) is O(1)-subexponential. Similarly, we know both 1 2n ξ (k) train 2 and 1 √ n X (k) train 1 √ n ξ (k) train are O(1)-subexponential. Suppose σ is a constant, we know w FT bT (η) -F T bT (η) ≤ .

B.3 TRAIN-BY-VALIDATION (GD)

In this section, we show that the optimal step size under FT bV is Θ(1/t). Furthermore, we show under this optimal step size, GD sequence makes constant progress towards the ground truth. Precisely, we prove the following theorem.  η * valid = Θ(1/t) and E w t,η * valid -w * 2 = w * 2 -Ω(1) for all η * valid ∈ arg min η≥0 FT bV (n1,n2) (η) , where the expectation is taken over new tasks. In this section, we still use L to denote constant 100. We start from analyzing the behavior of the population meta-objective F T bV for step sizes within [0, 1/L]. We show the optimal step size within this range is Θ(1/t) and GD sequence moves towards w * under the optimal step size. This serves as step 1 in Section B.1 We defer the proof of Lemma 10 into Section B.3.1. Lemma 10. Suppose noise level σ is a large enough constant c 1 . Assume unroll length t ≥ c 2 and dimension d ≥ c 4 for some constants c 2 , c 4 . There exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that F T bV (η 2 ) ≤ 1 2 w * 2 - 9 10 C + σ 2 2 F T bV (η) ≥ 1 2 w * 2 - 6 10 C + σ 2 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L] where C is a positive constant. To relate the behavior of F T bV to the behavior of FT bV , we prove the following generalization result for step sizes in [0, 1/L]. This serves as step 3 in Section B.1. The proof is deferred into Section B.3.2. Lemma 11. For any 1 > > 0, assume d ≥ c 4 log(1/ ) for some constant c 4 . With probability at least 1 -O(1/ ) exp(-Ω( 2 m)), | FT bV (η) -F T bV (η)| ≤ , for all η ∈ [0, 1/L]. In Lemma 12, we show the empirical meta objective FT bV is high for all step size larger than 1/L, which then implies η * valid ∈ [0, 1/L]. This serves as step 2 in Section B.1. We prove this lemma in Section B.3.3. Lemma 12. Suppose σ is a large constant. Assume t ≥ c 2 , d ≥ c 4 log(t) for some constants c 2 , c 4 . With probability at least 1 -exp(-Ω(m)), FT bV (η) ≥C σ 2 + 1 2 σ 2 , for all η ≥ 1/L, where C is a positive constant independent with σ. Combining Lemma 10, Lemma 11 and Lemma 12, we give the proof of Theorem 8. Proof of Theorem 8. According to Lemma 10, we know as long as d and t are larger than certain constants, there exists η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that F T bV (η 2 ) ≤ 1 2 w * 2 - 9 10 C + σ 2 /2 F T bV (η) ≥ 1 2 w * 2 - 6 10 C + σ 2 /2, ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L], for some positive constant C. Choosing = min(1, C/10) in Lemma 11, we know as long as d is larger than certain constant, with probability at least 1 -exp(-Ω(m)), | FT bV (η) -F T bV (η)| ≤ C/10, for all η ∈ [0, 1/L]. Therefore, FT bV (η 2 ) ≤ 1 2 w * 2 - 8 10 C + σ 2 /2 FT bV (η) ≥ 1 2 w * 2 - 7 10 C + σ 2 /2, ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. By Lemma 12, we know as long as t ≥ c 2 , d ≥ c 4 log(t) for some constants c 2 , c 4 , with probability at least 1 -exp(-Ω(m)), FT bV (η) ≥ C σ 2 + 1 2 σ 2 , for all η ≥ 1/L. As long as σ ≥ 1/ √ C , we have FT bV (η) ≥ 1 + 1 2 σ 2 for all η ≥ 1/L. Combining with FT bV (η 2 ) ≤ 1 2 w * 2 -8 10 C +σ 2 /2, we know η * valid ∈ [0, 1/L]. Furthermore, since FT bV (η) ≥ 1 2 w * 2 -7 10 C + σ 2 /2, ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L], we have η 1 ≤ η * valid ≤ η 3 . Recall that η 1 , η 3 = Θ(1/t), we know η * valid = Θ(1/t). At the optimal step size, we have F T bV (η * valid ) ≤ FT bV (η * valid ) + C/10 ≤ FT bV (η 2 ) + C/10 ≤ 1 2 w * 2 - 7 10 C + σ 2 /2. Since F T bV (η * valid ) = E 1 2 w t,η * valid -w * 2 + σ 2 /2, we have E w t,η * valid -w * 2 ≤ w * 2 - 7 5 C. Choosing m to be at least certain constant, this holds with probability at least 0.99. B.3.1 BEHAVIOR OF F T bV FOR η ∈ [0, 1/L] In this section, we study the behavior of F T bV when η ∈ [0, 1/L]. We prove the following Lemma. Lemma 10. Suppose noise level σ is a large enough constant c 1 . Assume unroll length t ≥ c 2 and dimension d ≥ c 4 for some constants c 2 , c 4 . There exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that F T bV (η 2 ) ≤ 1 2 w * 2 - 9 10 C + σ 2 2 F T bV (η) ≥ 1 2 w * 2 - 6 10 C + σ 2 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L] where C is a positive constant. It's not hard to verify that F T bV (η) = E1/2 w t,η -w * 2 +σ 2 /2. For convenience, denote Q(η) := 1/2 w t,η -w * 2 . In order to prove Lemma 10, we only need to show that EQ(η 2 ) ≤ 1 2 w * 2 - 9 10 C and EQ(η) ≥ 1 2 w * 2 -6 10 C for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. In Lemma 13, we first show that this happens with high probability over the sampling of tasks. Lemma 13. Suppose noise level σ is a large enough constant c 1 . Assume unroll length t ≥ c 2 for certain constant c 2 . Then, with probability at least 1-exp(-Ω(d)) over the sampling of tasks, there exists η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that Q(η 2 ) := 1 2 w t,η2 -w * 2 ≤ 1 2 w * 2 -C Q(η) := 1 2 w t,η -w * 2 ≥ 1 2 w * 2 - C 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L] where C is a positive constant. Since we are in the small step size regime, we know the GD sequence converges with high probability and will not be truncated. For now, let's assume w t,η = B t,η w * train + B t,η (X train ) † ξ train , where B t,η = I -(I -ηH train ) t . We have Q(η) = 1 2 B t,η w * train + B t,η (X train ) † ξ train -w * 2 = 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 + B t,η w * train -w * , B t,η (X train ) † ξ train = 1 2 w * 2 + 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 -B t,η w * train , w * + B t,η w * train -w * , B t,η (X train ) † ξ train . In Lemma 14, we show that with high probability the crossing term B t,η w * train -w * , B t,η (X train ) † ξ train is negligible for all η ∈ [0, 1/L] . By Hoeffding's inequality, we know the crossing term is small for any fixed η. Constructing an -net for the crossing term in η, we can take a union bound and show it's small for all η ∈ [0, 1/L]. We defer the proof of Lemma 14 to Section B.3.4. Lemma 14. Assume σ is a constant. For any 1 > > 0, we know with probability at least 1 -O(1/ ) exp(-Ω( 2 d)), B t,η w * train -w * , B t,η (X train ) † ξ train ≤ , for all η ∈ [0, 1/L]. Denote G(η) := 1 2 w * 2 + 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 -B t,η w * train , w * . Choosing = C/4 in Lemma 14, we only need to show G(η 2 ) ≤ w * 2 -5C/4 and G(η) ≥ w * 2 -C/4 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. We first show that there exists η 2 = Θ(1/t) such that G(η 2 ) ≤ 1 2 w * 2 -5C/4 for some constant C. It's not hard to show that 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 = O(η 2 t 2 σ 2 ). In Lemma 15, we show that the improvement B t,η w * train , w * = Ω(ηt) is linear in η. Therefore there exists η 2 = Θ(1/t) such that G(η 2 ) ≤ 1 2 w * 2 -5C/4 for some constant C. We defer the proof of Lemma 15 to Section B.3.4. Lemma 15. For any fixed η ∈ [0, L/t] with probability at least 1 -exp(-Ω(d)), B t,η w * train , w * ≥ ηt 16L . To lower bound G(η) for small η, we notice G(η) ≥ 1 2 w * 2 -B t,η w * train , w * . We can show that B t,η w * train , w * = O(ηt). Therefore, there exists η 1 = Θ(1/t) such that B t,η w * train , w * ≤ C/4 for all η ∈ [0, η 1 ]. To lower bound G(η) for large η, we lower bound G(η) using the noise square term, G(η) ≥ 1 2 B t,η (X train ) † ξ train 2 . We show that with high probability B t,η (X train ) † ξ train 2 = Ω(σ 2 ) for all η ∈ [log(2)L/t, 1/L]. Therefore, as long as σ is larger than some constant, there exists η 3 = Θ(1/t) such that G(η) ≥ 1 2 w * 2 for all η ∈ [η 3 , 1/L]. Combing Lemma 14 and Lemma 15, we give a complete proof for Lemma 13. Proof of Lemma 13. Recall that Q(η) = 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 + B t,η w * train -w * , B t,η (X train ) † ξ train =G(η) + B t,η w * train -w * , B t,η (X train ) † ξ train We first show that with probability at least 1 -exp(-Ω(d)), there exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that G(η 2 ) ≤ 1/2 w * 2 -5C/4 and G(η) ≥ 1/2 w * 2 -C/4 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. According to Lemma 1, we know with probability at least 1 -exp(-Ω(d)), √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] with L = 100. Upper bounding G(η 2 ): We can expand G(η) as follows: G(η) := 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 = 1 2 w * 2 + 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 -B t,η w * train , w * . Recall that B t,η = I -(I -ηH train ) t , for any vector w in the span of H train , B t,η w = I -(I -ηH train ) t w ≤ Lηt w . According to Lemma 45, we know with probability at least 1 -exp(-Ω(d)), ξ train ≤ √ dσ. Therefore, we have 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 ≤ L 2 η 2 t 2 /2 + L 3 η 2 t 2 σ 2 /2 ≤ L 3 η 2 t 2 σ 2 , where the second inequality uses σ, L ≥ 1. According to Lemma 15, for any fixed η ∈ [0, L/t], with probability at least 1 -exp(-Ω(d)), B t,η w * train , w * ≥ ηt 16L . Therefore, G(η) ≤ 1 2 w * 2 + L 3 η 2 t 2 σ 2 - ηt 16L ≤ 1 2 w * 2 - ηt 32L , where the second inequality holds as long as η ≤ 1 32L 4 σ 2 t . Choosing η 2 := 1 32L 4 σ 2 t , we have G(η 2 ) ≤ 1 2 w * 2 - 1 1024L 5 σ 2 = 1 2 w * 2 - 5C 4 , where C = 1 819.2L 5 σ 2 . Note C is a constant as σ, L are constants. Lower bounding G(η) for η ∈ [0, η 1 ] : Now, we prove that there exists η 1 = Θ(1/t) with η 1 < η 2 such that for any η ∈ [0, η 1 ], G(η) ≥ 1 2 w * 2 -C 4 . Recall that G(η) = 1 2 w * 2 + 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 -B t,η w * train , w * . ≥ 1 2 w * 2 -B t,η w * train , w * . Since | B t,η w * train , w * | ≤ Lηt, we know for any η ∈ [0, η 1 ], G(η) ≥ 1 2 w * 2 -Lη 1 t. Choosing η 1 = C 4Lt , we have for any η ∈ [0, η 1 ], G(η) ≥ 1 2 w * 2 - C 4 . Lower bounding G(η) for η ∈ [η 3 , 1/L]: Now, we prove that there exists η 3 = Θ(1/t) with η 3 > η 2 such that for all η ∈ [η 3 , 1/L], G(η) ≥ 1 2 w * 2 - C 4 . Recall that G(η) = 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 ≥ 1 2 B t,η (X train ) † ξ train 2 . According to Lemma 45, we know with probability at least 1 -exp(-Ω(d)), √ dσ 2 √ 2 ≤ ξ train . Therefore, B t,η (X train ) † ξ train 2 ≥ 1 -e -ηt/L 2 σ 2 8L ≥ σ 2 32L , where the last inequality assumes η ≥ log(2)L/t. As long as t ≥ log(2)L 2 , we have log(2)L/t ≤ 1/L. Choosing η 3 = log(2)L/t, we know for all η ∈ [η 3 , 1/L], G(η) ≥ 1 2 B t,η (X train ) † ξ train 2 ≥ σ 2 64L . Note that 1 2 w * 2 = 1/2. Therefore, as long as σ ≥ 8 √ L, we have G(η) ≥ 1 2 w * 2 for all η ∈ [η 3 , 1/L]. Overall, we have shown that there exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that G(η 2 ) ≤ 1/2 w * 2 -5C/4 and G(η) ≥ 1/2 w * 2 -C/4 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. Recall that Q(η) = G(η) + B t,η w * train -w * , B t,η (X train ) † ξ train . Choosing = C/4 in Lemma 14, we know with probability at least 1 -exp(-Ω(d)), B t,η w * train -w * , B t,η (X train ) † ξ train ≤ C/4 for all η ∈ [0, 1/L]. Therefore, we know Q(η 2 ) ≤ 1/2 w * 2 -C and Q(η) ≥ 1/2 w * 2 -C/2 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. Next, we give the proof of Lemma 10. Proof of Lemma 10. Recall that F T bV (η) = E1/2 w t,η -w * 2 + σ 2 2 . For convenience, denote Q(η) := 1/2 w t,η -w * 2 . In order to prove Lemma 10, we only need to show that EQ(η 2 ) ≤ 1 2 w * 2 -9 10 C and EQ(η) ≥ 1 2 w * 2 -6 10 C for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. According to Lemma 13, as long as σ is a large enough constant c 1 and t is at least certain constant c 2 , with probability at least 1 -exp(-Ω(d)) over the sampling of S train , there exists η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that Q(η 2 ) := 1/2 w t,η2 -w * 2 ≤ 1 2 w * 2 -C Q(η) := 1/2 w t,η -w * 2 ≥ 1 2 w * 2 - C 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L] where C is a positive constant. Call this event E. Suppose the probability that E happens is 1 -δ. We can write EQ(η) as follows, EQ(η) = E[Q(η)|E] Pr[E] + E[Q(η)| Ē] Pr[ Ē]. According to the algorithm, we know w t,η is always bounded by 4 √ Lσ. Therefore, Q(η) := 1/2 w t,η -w * 2 ≤ 13Lσ 2 . When η = η 2 , we have EQ(η 2 ) ≤ 1 2 w * 2 -C (1 -δ) + 13Lσ 2 δ = 1 2 w * 2 - δ 2 -C + (C + 13Lσ 2 )δ ≤ 1 2 w * 2 - 9C 10 , where the last inequality assumes δ ≤ C 10C+130Lσ 2 . When η ∈ [0, η 1 ] ∪ [η 3 , 1/L], we have EQ(η 2 ) ≥ 1 2 w * 2 - C 2 (1 -δ) -13Lσ 2 δ = 1 2 w * 2 - δ 2 -(1 -δ) C 2 -13Lσ 2 δ ≥ 1 2 w * 2 - C 2 -(1/2 + 13Lσ 2 )δ ≥ 1 2 w * 2 - 6C 10 , where the last inequality holds as long as δ ≤ C 5C+130Lσ 2 . According to Lemma 13, we know δ ≤ exp(-Ω(d)). Therefore, the conditions for δ can be satisfied as long as d is larger than certain constant. B.3.2 GENERALIZATION FOR η ∈ [0, 1/L] In this section, we show FT bV is point-wise close to F T bV for all η ∈ [0, 1/L]. Recall Lemma 11 as follows. Lemma 11. For any 1 > > 0, assume d ≥ c 4 log(1/ ) for some constant c 4 . With probability at least 1 -O(1/ ) exp(-Ω( 2 m)), | FT bV (η) -F T bV (η)| ≤ , for all η ∈ [0, 1/L]. In order to prove Lemma 11, let's first show that for a fixed η with high probability FT bV (η) is close to F T bV (η). Similar as in Lemma 7, we show each ∆ T bV (η, P k ) is O(1)-subexponential. We defer its proof to Section B.3.4. Lemma 16. Suppose σ is a constant. For any fixed η ∈ [0, 1/L] and any 1 > > 0, with probability at least 1 -exp(-Ω( 2 m)), FT bV (η) -F T bV (η) ≤ . Next, we show that there exists an -net for F T bV with size O(1/ ). By -net, we mean there exists a finite set N of step size such that |F T bV (η) -F T bV (η )| ≤ for any η ∈ [0, 1/L] and η ∈ arg min η∈N |η -η |. We defer the proof of Lemma 17 to Section B.3.4. Lemma 17. Suppose σ is a constant. For any 1 > > 0, assume d ≥ c 4 log(1/ ) for constant c 4 . There exists an -net N for F T bV with |N | = O(1/ ). That means, for any η ∈ [0, 1/L], |F T bV (η) -F T bV (η )| ≤ , for η ∈ arg min η∈N |η -η |. Next, we show that with high probability, there also exists an -net for FT bV with size O(1/ ). Lemma 18. Suppose σ is a constant. For any 1 > > 0, assume d ≥ c 4 log(1/ ) for constant c 4 . With probability at least 1 -exp(-Ω( 2 m)), there exists an -net N for FT bV with |N | = O(1/ ). That means, for any η ∈ [0, 1/L], | FT bV (η) -FT bV (η )| ≤ , for η ∈ arg min η∈N |η -η |. Combing Lemma 16, Lemma 17 and Lemma 18, now we give the proof of Lemma 11. Proof of Lemma 11. The proof is very similar as in Lemma 4. By Lemma 16, we know with probability at least 1 -exp(-Ω( 2 m)), FT bV (η) -F T bV (η) ≤ for any fixed η. By Lemma 17 and Lemma 18, we know as long as d = Ω(log(1/ )), with probability at least 1 -exp(-Ω( 2 m)), there exists -net N and N for F T bV and FT bV respectively. Here, both of N and N have size O(1/ ). According to the proofs of Lemma 17 and Lemma 18, it's not hard to verify that N ∪ N is still an -net for FT bV and F T bV . That means, for any η ∈ [0, 1/L], we have |F T bV (η) -F T bV (η )|, | FT bV (η) -FT bV (η )| ≤ , for η ∈ arg min η∈N ∪N |η -η |. Taking a union bound over N ∪ N , we have with probability at least 1 -O(1/ ) exp(-Ω( 2 m)), FT bV (η) -F T bV (η) ≤ for any η ∈ N ∪ N . Overall, we know with probability at least 1 -O(1/ ) exp(-Ω( 2 m)), for all η ∈ [0, 1/L], |F T bV (η) -FT bV (η)| ≤|F T bV (η) -F T bV (η )| + | FT bV (η) -FT bV (η )| + | FT bV (η ) -F T bV (η )| ≤3 , where η ∈ arg min η∈N ∪N |η -η |. Changing to /3 finishes the proof. B.3.3 LOWER BOUNDING FT bV FOR η ∈ [1/L, ∞) In this section, we prove FT bV is large for any step size η ≥ 1/L. Therefore, the optimal step size η * valid must be smaller than FT bV . Lemma 12. Suppose σ is a large constant. Assume t ≥ c 2 , d ≥ c 4 log(t) for some constants c 2 , c 4 . With probability at least 1 -exp(-Ω(m)), FT bV (η) ≥C σ 2 + 1 2 σ 2 , for all η ≥ 1/L, where C is a positive constant independent with σ. When the step size is very large (larger than 3L), we know the GD sequence gets truncated with high probability, which immediately implies the loss is high.  -exp(-Ω(m)), FT bV (η) ≥ σ 2 , for all η ∈ [3L, ∞) The case for step size within [1/L, 3L] requires more efforts. We give the proof of Lemma 20 in this section later. Lemma 20. Suppose σ is a large constant. Assume t ≥ c 2 , d ≥ c 4 log(t) for some constants c 2 , c 4 . With probability at least 1 -exp(-Ω(m)), FT bV (η) ≥C 4 σ 2 + 1 2 σ 2 , for all η ∈ [1/L, 3L], where C 4 is a positive constant independent with σ. With the above two lemmas, Lemma 12 is just a combination of them. Proof of Lemma 12. The result follows by taking a union bound and choosing C = min(C 4 , 1/2). In the remaining of this section, we give the proof of Lemma 20. When the step size is between 1/L and 3L, if the GD sequence has a reasonable probability of diverging, we can still show the loss is high similar as before. If not, we need to show the GD sequence overfits the noise in the training set, which incurs a high loss. Recall that the noise term is roughly 1 2 (I - (I -ηH train ) t )(X train ) † ξ train 2 . When η ∈ [1/L, 3L], the eigenvalues of I -ηH train in S train subspace can be negative. If all the non-zero n eigenvalues of H train have the same value, there exists a step size such that the eigenvalues of I -ηH train in subspace S train is -1. If t is even, the eigenvalues of I -(I -ηH train ) t in S train subspace are zero, which means GD sequence does not catch any noise in S train . Notice that the above problematic case cannot happen when the eigenvalues of H train are spread out. Basically, when there are two different eigenvalues, there won't exist any large η that can cancel both directions at the same time.  √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Let E 3 be the event that √ d/ √ L ≤ σ i (X valid ) ≤ √ Ld and 1/L ≤ λ i (H valid ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ valid ≤ √ dσ. According to Lemma 1 and Lemma 45, we know both E 1 and E 3 hold with probability at least 1 -exp(-Ω(d)). Let the top n eigenvalues of H train be λ 1 ≥ • • • ≥ λ n . According to Lemma 21, assuming d is larger than certain constant, we know there exist positive constants µ 1 , µ 2 , µ 3 such that with probability at least µ 1 , λ µ2n -λ n-µ2n+1 ≥ µ 3 . Call this event E 2 . Let S 1 and S 2 be the span of the bottom and top µ 2 n eigenvectors of H train respectively. According to Lemma 45, we know ξ train ≥ √ d 4 σ with probability at least 1 -exp(-Ω(d)). Let P 1 ∈ R n×n be a rank-µ 2 n projection matrix such that the column span of (X train ) † P 1 is S 1 . By Johnson-Lindenstrauss Lemma, we know with probability at least 1 -exp(-Ω(d)), Proj P1 ξ train ≥ √ µ2 2 ξ train . Taking a union bound, with probability at least 1 -exp(-Ω(d)), Proj P1 ξ train ≥ √ µ2dσ 8 . Similarly, we can define P 2 for the S 2 subspace and show with probability at least 1 -exp(-Ω(d)), Proj P2 ξ train ≥ √ µ2dσ 8 . Call the intersection of both events as E 4 , which happens with with probability at least 1 -exp(-Ω(d)). Taking a union bound, we know E 1 ∩ E 2 ∩ E 3 ∩ E 4 holds with probability at least µ 1 /2 as long as d is larger than certain constant. Through the proof, we assume E 1 ∩ E 2 ∩ E 3 ∩ E 4 holds. Let's first lower bound B t,η w train -w * train as follows, B t,η w train -w * train = B t,η w * train + (X train ) † ξ train -w * train ≥ B t,η w * train + (X train ) † ξ train -1 Recall that we define S 1 and S 2 as the span of the bottom and top µ 2 n eigenvectors of H train respectively. We rely on S 1 to lower bound w t,η -w * when η is small and rely on S 2 when η is large. Case 1: Let σ S1 min (B t,η ) be the smallest singular value of B t,η within S 1 subspace. If ηλ n-µ2n+1 ≤ 2 -µ 3 /(2L), we have σ S1 min (B t,η ) ≥ min 1 -1 - 1 L 2 t , 1 -1 - µ 3 2L t ≥ 1 2 , where the second inequality assumes t ≥ max(L 2 , 2L/µ 3 ) log 2. Then, we have w t,η -w * ≥ σ S1 min (B t,η ) Proj S1 (X train ) † ξ train -1 -1 ≥ 1 2 √ µ 2 σ 8 √ L -1 -1 ≥ √ µ 2 σ 32 √ L , where the second inequality uses Proj P1 ξ train ≥ √ µ2dσ 8 and the last inequality assumes σ ≥ 48 √ L √ µ2 . Case 2: If ηλ n-µ2n+1 > 2 -µ 3 /(2L), we have ηλ µ2n ≥ 2 + µ 3 /(2L) since λ µ2n -λ n-µ2n+1 ≥ µ 3 and η ≥ 1/L. Let σ S2 min (B t,η ) be the smallest singular value of B t,η within S 2 subspace. We have σ S2 min (B t,η ) ≥ 1 + µ 3 2L t -1 ≥ 1 2 , where the last inequality assumes t ≥ 4L/µ 3 . Then, similar as in Case 1, we can also prove w t,η -w * ≥ √ µ2σ 32 √ L . Therefore, we have B t,η w train -w * 2 Htrain = B t,η w train -w * train 2 Htrain ≥ 1 L B t,η w train -w * train 2 ≥ µ 2 σ 2 1024L 2 , for all η ∈ [1/L, 3L]. We denote C 1 := µ 1 /2 and C 2 = µ2 1024L 2 . Before we present the proof of Lemma 20, we still need a technical lemma that shows the noise in S valid concentrates at its mean. The proof of Lemma 23 is deferred into Section B.3.4. Lemma 23. Suppose σ is constant. For any 1 > > 0, with probability at least 1 -O(t/ ) exp(-Ω( 2 d)), λ n (H valid ) ≥ 1/L and w t,η -w valid 2 H valid ≥ w t,η -w * 2 H valid + (1 -)σ 2 , for all η ∈ [1/L, 3L]. Combing the above lemmas, we give the proof of Lemma 20. Proof of Lemma 20. According to Lemma 23, we know given 1 > > 0, with probability at least 1 -O(t/ ) exp(-Ω( 2 d)), λ n (H valid ) ≥ 1/L and w t,η -w valid 2 Hvalid ≥ w t,η -w * 2 Hvalid + (1 - )σ 2 for all η ∈ [1/L, 3L]. Call this event E 1 . Suppose Pr[E 1 ] ≥ 1 -δ/2 , where δ will be specifies later. For each training set S (k) train , we also define E (k) 1 . By concentration, we know with probability at least 1 -exp(-Ω(δ 2 m)), 1/m m k=1 1 E (k) 1 ≥ 1 -δ. According to Lemma 22, we know there exist constants C 1 , C 2 such that with probability at least C 1 , B t,η w train -w * 2 Htrain ≥ C 2 σ 2 for all η ∈ [1/L, 3L]. Call this event E 2 . For each training set S (k) train , we also define E (k) 2 . By concentration, we know with probability at least 1 -exp(-Ω(m)), 1/m m k=1 1 E (k) 2 ≥ C 1 /2. For any step size η ∈ [1/L, 3L], we can lower bound FT bV (η) as follows, FT bV (η) = 1 m m k=1 1 2 w (k) t,η -w (k) valid 2 H (k) valid ≥ 1 m m k=1 1 2 w (k) t,η -w (k) valid 2 H (k) valid 1 E (k) 1 ≥ 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 + 1 2 (1 -)(1 -δ)σ 2 ≥ 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 + 1 2 (1 -)(1 -δ)σ 2 . As long as δ ≤ C 1 /4, we know 1 m m k=1 1 E (k) 1 ∩ E (k) 2 ≥ C 1 /4. Let Ē3 (η) be the event that w (k) t,η gets truncated with step size η. We have 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 = 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 ∩ E (k) 3 + 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 ∩ Ē(k) 3 . If 1 m m k=1 1 E (k) 1 ∩ E (k) 2 ∩ Ē(k) 3 ≥ C 1 /8, we have 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 ≥ 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 ∩ Ē(k) 3 ≥ C 1 8 × 9σ 2 2 = 9C 1 σ 2 16 . Here, we lower bound w (k) t,η -w * k 2 Hvalid by 9σ 2 when the sequence gets truncated. If 1 m m k=1 1 E (k) 1 ∩ E (k) 2 ∩ Ē(k) 3 < C 1 /8, we know 1 m m k=1 1 E (k) 1 ∩ E (k) 2 ∩ E (k) 3 ≥ C 1 /8. Then, we have 1 m m k=1 1 2 w (k) t,η -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 ≥ 1 m m k=1 1 2 B (k) t,η w train -w * k 2 Hvalid 1 E (k) 1 ∩ E (k) 2 ∩ E (k) 3 ≥ C 1 8 × C 2 σ 2 2 = C 1 C 2 σ 2 Letting C 3 = min( 9C1 16 , C1C2 16 ), we then have FT bV (η) ≥ C 3 σ 2 + 1 2 (1 -)(1 -δ)σ 2 ≥ C 3 σ 2 2 + 1 2 σ 2 , where the last inequality chooses δ = = C 3 /2. In order for Pr[E 1 ] ≥ 1 -δ/2, we only need d ≥ c 4 log(t) for some constant c 4 . Replacing C 3 /2 by C 4 finishes the proof.

B.3.4 PROOFS OF TECHNICAL LEMMAS

Proof of Lemma 14. We first show that for a fixed η ∈ [0, 1/L], the crossing term B t,η w * train -w * , B t,η (X train ) † ξ train is small with high probability. We can write down the crossing term as follows: B t,η w * train -w * , B t,η (X train ) † ξ train = [(X train ) † ] B t,η (B t,η w * train -w * ), ξ train . Noticing that ξ train is independent with [(X train ) † ] B t,η (B t,η w * train -w * ), we will use Hoeffding's inequality to bound B t,η w * train -w * , B t,η (X train ) † ξ train . According to Lemma 1, we know with probability at least 1 -exp(-Ω(d)), √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ≤ L for all i ∈ [n] with L = 100. Since η ≤ 1/L, we know B t,η = I -(I -ηH train ) t ≤ 1. Therefore, we have [(X train ) † ] B t,η (B t,η w * train -w * ) ≤ 2 √ L √ d , for any η ∈ [0, 1/L]. Then, for any > 0, by Hoeffding's inequality, with probability at least 1 -exp(-Ω( 2 d)), B t,η w * train -w * , B t,η (X train ) † ξ train ≤ . Next, we construct an -net on η and show the crossing term is small for all η ∈ [0, 1/L]. Let g(η) := B t,η w * train -w * , B t,η (X train ) † ξ train . We compute the derivative of g(η) as follows: g (η) = tH train (I -ηH train ) t-1 w * train , B t,η (X train ) † ξ train + B t,η w * train -w * , tH train (I -ηH train ) t-1 (X train ) † ξ train By Lemma 45, we know with probability at least 1 -exp(-Ω(d)), ξ train ≤ √ dσ. Therefore, |g (η)| ≤ L 1.5 t 1 - η L t-1 σ + 2L 1.5 t 1 - η L t-1 σ = 3L 1.5 t 1 - η L t-1 σ. We can control |g (η)| in different regimes: • For η ∈ [0, L t-1 ], we have |g (η)| ≤ 3L 1.5 tσ. • Given any 1 ≤ i ≤ log t -1, for any η ∈ ( iL t-1 , (i+1)L t-1 ], we have |g (η)| ≤ 3L 1.5 tσ e i . • For any η ∈ ( L log t t-1 , 1/L], we have |g (η)| ≤ 3L 1.5 σ. Fix any > 0, we know there exists an -net N with size |N | = 1 L t -1 log t-1 i=0 3L 1.5 tσ e i + 1 L - L log t t -1 3L 1.5 σ ≤ 1 3eL 2.5 tσ t -1 + 3 √ Lσ = O( 1 ) such that for any η ∈ [0, 1/L], there exists η ∈ N with |g(η) -g(η )| ≤ . Note that L = 100 and σ is a constant. Taking a union bound over N and all the other bad events, we have with probability at least 1 -exp(-Ω(d)) -O(1/ ) exp(-Ω( 2 d)), for all η ∈ [0, 1/L], B t,η w * train -w * , B t,η (X train ) † ξ train ≤ + = 2 . As long as 1 > > 0, this happens with probability at least 1 -O(1/ ) exp(-Ω( . By Johnson-Lindenstrauss lemma (Lemma 49), we know with probability at least 1 -2 exp(-c 2 d/4), w * train ≥ 1 2 (1 -) w * = 1 2 (1 -). Then, we know with probability at least 1 -2 exp(-c 2 d/4) -exp(-Ω(d)), B t,η w * train , w * ≥ 1 -exp - ηt L w * train 2 ≥ 1 -exp - ηt L 1 4 (1 -) 2 ≥ 1 -2 4 1 -exp - ηt L Since e x ≤ 1 -x + x 2 /2 for any x ≤ 0, we know exp(-ηt/L) ≤ 1 -ηt/L + η 2 t 2 /(2L 2 ). For any η ≤ L/t, we have exp(-ηt/L) ≤ 1-ηt/(2L). Then with probability at least 1-2 exp(-c 2 d/4)exp(-Ω(d)), B t,η w * train , w * ≥ 1 -2 4 ηt 2L ≥ ηt 16L , where the second inequality holds by choosing = 1/4. Proof of Lemma 16. Recall that FT bV (η) := 1 m m k=1 ∆ T bV (η, P k ) For each individual loss function ∆ T bV (η, P k ), we have ∆ T bV (η, P k ) = 1 2 w (k) t,η -w * -(X (k) valid ) † ξ (k) valid 2 H (k) valid = 1 2 w (k) t,η -w * 2 H (k) valid + 1 2n ξ (k) valid 2 + w (k) t,η -w * , 1 n (X (k) valid ) ξ (k) valid ≤ 25Lσ 2 2 H (k) valid + 1 2n ξ (k) valid 2 + 5 √ Lσ 1 √ n X (k) valid 1 √ n ξ (k) valid We can write H (k) valid as σ 2 max ( 1 √ n X (k) valid ). According to Lemma 47, we know σ max (X (k) valid ) - Eσ max (X (k) valid ) is O(1)-subgaussian, which implies that σ max ( 1 √ n X (k) valid ) -Eσ max ( 1 √ n X (k) valid ) is O(1/ √ d)-subgaussian. Since Eσ max ( 1 √ n X (k) valid ) is a constant, we know σ max ( 1 √ n X (k) valid ) is O(1)- subgaussian and σ 2 max ( 1 √ n X (k) valid ) is O(1)-subexponential. Similarly, we know both 1 2n ξ (k) valid 2 and 1 √ n X (k) valid 1 √ n ξ (k) valid are O(1)-subexponential. This further implies that ∆ T bV (η, P k ) is O(1)-subexponential. Therefore, FT bV is the average of m i.i.d. O(1)-subexponential random variables. By standard concentration inequality, we know for any 1 > > 0, with probability at least 1 -exp(-Ω( 2 m)), FT bV (η) -F T bV (η) ≤ . Proof of Lemma 17. Recall that F T bV (η) =E 1 2 w t,η -w * 2 + σ 2 /2. We only need to construct an -net for E 1 2 w t,η -w * 2 . Let E be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and ξ train ≤ √ dσ. We have E 1 2 w t,η -w * 2 = E 1 2 w t,η -w * 2 |E Pr[E] + E 1 2 w t,η -w * 2 | Ē Pr[ Ē] We first construct an -net for E 1 2 w t,η -w * 2 |E Pr[E]. Let Q(η) := 1 2 w t,η -w * 2 . Fix a training set S train under which event E holds. We show that Q(η) has desirable lipschitz property. The derivative of Q(η) can be computed as follows, Q (η) = tH train (I -ηH train ) t-1 w train , w t,η -w * . Conditioning on E, we have |Q (η)| = O(1)t(1 - η L ) t-1 . Therefore, we have ∂ ∂η E 1 2 w t,η -w * 2 |E Pr[E] = O(1)t(1 - η L ) t-1 . Similar as in Lemma 14, for any > 0, we know there exists an -net N with size O(1/ ) such that for any η ∈ [0, 1/L], E 1 2 w t,η -w * 2 |E Pr[E] -E 1 2 w t,η -w * 2 |E Pr[E] ≤ for η ∈ arg min η∈N |η -η |. Suppose the probability of Ē is δ. We have E 1 2 w t,η -w * 2 | Ē Pr[ Ē] ≤ 25Lσ 2 2 δ ≤ , where the last inequality assumes δ ≤ 2 25Lσ 2 . According to Lemma 1 and Lemma 45, we know δ := Pr[ Ē] ≤ exp(-Ω(d)). Therefore, given any > 0, there exists constant c 4 such that δ ≤ 2 25Lσ 2 as long as d ≥ c 4 log(1/ ). Overall, for any > 0, as long as d = Ω(log(1/ )), there exists N with size O(1/ ) such that for any η ∈ [0, 1/L], |F T bV (η) -F T bV (η )| ≤ 3 for η ∈ arg min η∈N |η -η |. Changing to /3 finishes the proof. Proof of Lemma 18. For each k ∈ [m], let E k be the event that √ d/ √ L ≤ σ i (X (k) train ) ≤ √ Ld for any i ∈ [n] and ξ (k) train ≤ √ dσ. Then, we can write the empirical meta objective as follows, FT bV (η) := 1 m m k=1 ∆ T bT (η, P k )1 E k + 1 m m k=1 ∆ T bT (η, P k )1 Ēk . Similar as Lemma 17, we will show that the first term has desirable Lipschitz property and the second term is small. Now, let's focus on the first term 1 m m k=1 ∆ T bT (η, P k )1 E k . Recall that ∆ T bT (η, P k ) = 1 2 w (k) t,η -w (k) valid 2 H (k) valid = 1 2 B (k) t,η w (k) train -w * -(X (k) valid ) † ξ (k) valid 2 H (k) valid . Computing the derivative of ∆ T bT (η, P k ) in terms of η, we have ∂ ∂η ∆ T bT (η, P k ) = tH (k) train (I -ηH (k) train ) t-1 w (k) train , H (k) valid w (k) t,η -w * -(X (k) valid ) † ξ (k) valid Conditioning on E k , we can bound the derivative, ∂ ∂η ∆ T bT (η, P k ) = O(1)t 1 - η L t-1 H (k) valid + 1 √ d X (k) valid 1 √ d ξ (k) valid . Therefore, we have 1 m m k=1 ∂ ∂η ∆ T bT (η, P k )1 E k = O(1)t 1 - η L t-1 1 m m k=1 H (k) valid + 1 √ d X (k) valid 1 √ d ξ (k) valid . Similar as in Lemma 16, we know both H (k) valid and 1 √ d X (k) valid 1 √ d ξ (k) valid are O(1)-subexponential. Therefore, we know with probability at least 1 -exp(-Ω(m)), 1 m m k=1 H (k) valid + 1 √ d X (k) valid 1 √ d ξ (k) valid = O(1) . This further shows that with probability at least 1 -exp(-Ω(m)), 1 m m k=1 ∂ ∂η ∆ T bT (η, P k )1 E k = O(1)t 1 - η L t-1 . Similar as in Lemma 14, we can show that for any > 0, there exists an -net with size O(1/ ) for 1 m m k=1 ∆ T bT (η, P k )1 E k . Next, we show that the second term 1 m m k=1 ∆ T bT (η, P k )1 Ēk is small with high probability. According to the proof in Lemma 16, we know ∆ T bT (η, P k ) = O(1) H (k) valid + 1 d ξ (k) valid 2 + 1 √ d X (k) valid 1 √ d ξ (k) valid Therefore, there exists constant C such that 1 m m k=1 ∆ T bT (η, P k )1 Ēk ≤ C 1 m m k=1 H (k) valid + 1 d ξ (k) valid 2 + 1 √ d X (k) valid 1 √ d ξ (k) valid 1 Ēk . It's not hard to verify that H (k) valid + 1 d ξ (k) valid 2 + 1 √ d X (k) valid 1 √ d ξ (k) valid 1 Ēk is O(1)- subexponential. Suppose the expectation of H (k) valid + 1 d ξ (k) valid 2 + 1 √ d X (k) valid 1 √ d ξ (k) valid is µ, which is a constant. Suppose the probability of Ēk be δ. We know the expectation of H (k) valid + 1 d ξ (k) valid 2 + 1 √ d X (k) valid 1 √ d ξ (k) valid 1 Ēk is µδ due to independence. By standard concentration inequality, for any 1 > > 0, with probability at least 1 -exp(-Ω( 2 m)), C 1 m m k=1 H (k) valid + 1 d ξ (k) valid 2 + 1 √ d X (k) valid 1 √ d ξ (k) valid 1 Ēk ≤ Cµδ+C ≤ (C+1) , where the second inequality assumes δ ≤ /(Cµ). By Lemma 1 and Lemma 45, we know δ ≤ exp(-Ω(d)). Therefore, as long as d ≥ c 4 log(1/ ) for some constant c 4 , we have δ ≤ /(Cµ). Overall, we know that as long as d ≥ c 4 log(1/ ), with probability at least 1 -exp(-Ω( 2 m)), there exists N with |N | = O(1/ ) such that for any η ∈ [0, 1/L], | FT bV (η) -FT bV (η )| ≤ (2C + 3) , for η ∈ arg min η∈N |η -η |. Changing to /(2C + 3) finishes the proof. Proof of Lemma 19. Let E 1 be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Let E 2 be the event that √ d/ √ L ≤ σ i (X valid ) ≤ √ Ld and 1/L ≤ λ i (H valid ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ valid ≤ √ dσ. According to Lemma 1 and Lemma 45, we know both E 1 and E 2 hold with probability at least 1 -exp(-Ω(d)). Assuming d ≥ c 4 for certain constant c 4 , we know Pr[E 1 ∩ E 2 ] ≥ 2/3. Also define E (k) 1 and E (k) 2 on each training set S (k) train . By concentration, we know with probability at least 1 -exp(-Ω(m)), 1 m m k=1 1 E (k) 1 ∩ E (k) 2 ≥ 1 2 . It's easy to verify that conditioning on E 1 , the GD sequence always exceeds the norm threshold and gets truncated for η ≥ 3L as long as t is larger than certain constant. We can lower bound FT bV for any η ≥ 3L as follows, FT bV (η) = 1 m m k=1 1 2 w (k) t,η -w (k) valid 2 H (k) valid ≥ 1 m m k=1 1 2 w (k) t,η -w (k) valid 2 H (k) valid 1 {E 1 ∩ E 2 } ≥ 2σ 2 1 2 = σ 2 , where the last inequality lower bounds w (k) t,η -w (k) valid 2 H (k) valid by 2σ 2 when w (k) t,η gets truncated. Proof of Lemma 21. We first show that with constant probability in X train , the variance of the eigenvalues of H train is lower bounded by a constant. Let λ be 1/n n i=1 λ i . Specifically, we show 1/n n i=1 λ 2 i -λ2 is lower bounded by a constant. Let's first compute the variance of the eigenvalues in expectation. Let the i-th row of X train be x i . We have, E λ2 = 1 n 2 E tr 1 n X train X train 2 = 1 n 4 E   n i=1 x i 2 2   = 1 n 4 n i=1 E x i 4 + 1 n 4 1≤i =j≤n E x i 2 x j 2 = 1 n 4 nd(d + 2) + n(n -1)d 2 = d 2 n 2 + 2d n 3 . Similarly, we compute E 1/n n i=1 λ 2 i as follows, E 1 n n i=1 λ 2 i = 1 n 3 E tr X train X train X train X train = 1 n 3 n i=1 E x i 4 + 1 n 3 1≤i =j≤n E x i , x j 2 = 1 n 3 (nd(d + 2) + n(n -1)d) = d 2 n 2 + d n + d n 2 Therefore, we have E 1 n n i=1 λ 2 i -λ2 = d n + d n 2 - 2d n 3 ≥ d n ≥ 4 3 , where the first inequality assumes n ≥ 2 and the last inequality uses n ≤ 3d 4 . Since n ≥ 1 4 d, we know n ≥ 2 as long as d ≥ 8. Let E be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for i ∈ [n] with L = 100. According to Lemma 1, we know E happens with probability at least 1 -exp(-Ω(d)). Let 1 {E} be the indicator function for event E. Next we show that E[1/n n i=1 (λ i -λ) 2 1 {E}] is also lower bounded. It's clear that E λ2 1 {E} is upper bounded by E λ2 . In order to lower bound E 1 n n i=1 λ 2 i 1 {E} , we first show that E 1 n n i=1 λ 2 i 1 Ē is small. We can decompose E 1 n n i=1 λ 2 i 1 Ē into two parts, E 1 n n i=1 λ 2 i 1 Ē =E 1 n n i=1 λ 2 i 1 Ē and λ 1 ≤ L + E 1 n n i=1 λ 2 i 1 {λ 1 > L} . The first term can be bounded by L 2 Pr[ Ē]. Since Pr[ Ē] ≤ exp(-Ω(d)) , we know the first term is at most 1/6 as long as d is larger than certain constant. The second term can be bounded by E λ 2 1 1 {λ 1 > L} . According to Lemma 48, we know Pr[λ 1 ≥ L + t] ≤ exp(-Ω(dt)). Then, it's not hard to verify that E λ 2 1 1 {λ 1 > L} = O(1/d) that is bounded by 1/6 as long as d is larger than certain constant. Overall, we know E 1 n n i=1 λ 2 i 1 {E} ≥ E 1 n n i=1 λ 2 i -1/3. Combing with the upper bounds on E λ2 1 {E} , we have E 1 n n i=1 (λ i -λ) 2 1 {E} ≥ 1. Since conditioning on E, λ i is bounded by L for all i ∈ [n]. In order to make E 1 n n i=1 (λ i -λ) 2 1 {E} lower bounded by one, there must exist positive constants µ 1 , µ 2 such that with probability at least µ 1 , E holds and 1 n n i=1 (λ i -λ) 2 ≥ µ 2 . Since 1 n n i=1 (λ i -λ) 2 ≥ µ 2 and λ i ≤ L for all i ∈ [n], we know there exists a subset of eigenvalues S ⊂ {λ i } n 1 with size µ 3 n such that |λ i -λ| ≥ µ 4 for all λ i ∈ S, where µ 3 , µ 4 are both positive constants. If at least half of eigenvalues in S are larger than λ, we know at least µ3µ4n 2L number of eigenvalues are smaller than λ. Otherwise, the expectation of the eigenvalues will be larger than λ, which contradicts the definition of λ. Similarly, if at least half of eigenvalues in S are smaller than λ, we know at least µ3µ4n 2L number of eigenvalues are larger than λ. Denote µ 5 := µ3µ4 2L . We know λ µ5n -λ n-µ5n+1 ≥ µ 4 . Proof of Lemma 23. Let E 1 be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Let E 3 be the event that √ d/ √ L ≤ σ i (X valid ) ≤ √ Ld and 1/L ≤ λ i (H valid ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ valid ≤ √ dσ. According to Lemma 1 and Lemma 45, we know both E 1 and E 3 hold with probability at least 1 -exp(-Ω(d)). In this proof, we assume both properties hold and take a union bound at the end. We can lower bound w t,η -w valid 2 Hvalid as follows, w t,η -w valid 2 Hvalid = w t,η -w * -(X valid ) † ξ valid 2 Hvalid ≥ w t,η -w * 2 Hvalid + 1 n ξ valid 2 -2 w t,η -w * , H valid (X valid ) † ξ valid . For the second term, by Lemma 45, we know for any 1 > > 0, with probability at least 1exp(-Ω( 2 d)), 1 n ξ valid 2 ≥ (1 -)σ 2 . We can write down the third term as [(X valid ) † ] H valid (w t,η -w * ), ξ valid . Suppose σ is a constant, we know [(X valid ) † ] H valid (w t,η -w * ) = O(1/ √ d). Therefore, for a fixed η ∈ [1/L, 3L], we have with probability at least 1 -exp(-Ω( 2 d)), w t,η -w * , H valid (X valid ) † ξ valid ≤ . By Lemma 24 and Lemma 25, we know when t is reasonably large, FT bT (η) is larger than FT bT (2/3) for all step sizes η > η. This means the optimal step size η must lie in [0, η]. In Lemma 26, we show a generalization result for η ∈ [0, η]. This serves as step 3 in Section B.1. We prove this lemma in Section C.3. Lemma 26. Let η be as defined in Definition 2 with 1 > > 0. Suppose σ is a constant. Assume n ≥ c log( n d )d, t ≥ c 2 , d ≥ c 4 for some constants c, c 2 , c 4 . With probability at least 1-m exp(-Ω(n))- O( tn 2 d + m) exp(-Ω(m 4 d 2 /n 2 )), |F T bT (η) -FT bT (η)| ≤ 17 2 dσ 2 n , for all η ∈ [0, η], Combining Lemma 24, Lemma 25 and Lemma 26, we present the proof of Theorem 6 as follows. Proof of Theorem 6. According to Lemma 24, assuming n ≥ 40d, given any 1/2 > > 0, with probability at least 1 -m exp(-Ω(n)) -exp(-Ω( 4 md/n)), FT bT (2/3) ≤ 20(1 -1 3 ) 2t σ 2 + n-d 2n σ 2 + 2 dσ 2 20n . As long as t ≥ c 2 log( n d ) for certain constant c 2 , we have FT bT (2/3) ≤ n -d 2n σ 2 + 7 2 dσ 2 100n . Let η be as defined in Definition 2 with the same . According to Lemma 25, as long as n ≥ cd, t ≥ c 2 , d ≥ c 4 with probability at least 1 -exp(-Ω( 4 md 2 /n 2 )), FT bT (η) ≥ 2 dσ 2 8n + n -d 2n σ 2 - 2 dσ 2 20n = n -d 2n σ 2 + 7.5 2 dσ 2 100n for all η > η. We have FT bT (η) > FT bT (2/3) for all η ≥ η. This implies that η * train is within [0, η] and FT bT (η * train ) ≤ FT bT (2/3) ≤ n-d 2n σ 2 + 7 2 dσ 2 100n . By Lemma 26, assuming σ is a constant and assuming n ≥ c log( n d )d for some constant c, we have with probability at least 1 -m exp(- Ω(n)) -O( tn 2 d + m) exp(-Ω(m 4 d 2 /n 2 )), |F T bT (η) -FT bT (η)| ≤ 17 2 dσ 2 n , for all η ∈ [0, η]. This then implies F T bT (η * train ) ≤ FT bT (η * train ) + 17 2 dσ 2 n ≤ n -d 2n σ 2 + 24 2 dσ 2 n . By the analysis in Lemma 24, we have F T bT (η * train ) =E 1 2 w t,η * train -w train 2 Htrain + E 1 2n (I n -Proj Xtrain )ξ train 2 =E 1 2 w t,η * train -w train 2 Htrain + n -d 2n σ 2 . Therefore, we know E 1 2 w t,η * train -w train 2 Htrain ≤ 24 2 dσ 2 n . Next, we show this implies E w t,η * train -w * 2 is small. Let E be the event that 1 -≤ λ i (H train ) ≤ 1 + for all i ∈ [d]. According to Lemma 27, we know Pr[E] ≥ 1 -exp(-Ω( 2 n)) as long as n ≥ 10d/ 2 . Then, we can decompose E w t,η * train -w * 2 as follows, E w t,η * train -w * 2 = E w t,η * train -w * 2 1 {E} + E w t,η * train -w * 2 1 Ē . Let's first show the second term is small. Due to the truncation in our algorithm, we know w t,η * train -w * 2 ≤ 41 2 σ 2 , which then implies E w t,η * train -w * 2 1 Ē ≤ 41 2 σ 2 exp(-Ω( 2 n)). As long as n ≥ c 2 log( n d ) for some constant c, we have E w t,η * train -w * 2 1 Ē ≤ dσ 2 n . Recall that X train is an n × d matrix with its i-th row as x i . With probability 1, we know X train is full column rank. Denote the pseudo-inverse of X train as X † train ∈ R d×n that satisfies X † train X train = I d and X train X † train = Proj Xtrain , where Proj Xtrain ∈ R n×n is a projection matrix onto the column span of X train . Let w train be w * + X † train ξ train , where ξ train is an n-dimensional vector with its i-th entry as ξ i . We have, ∆ T bT (η, P ) = 1 2n n i=1 w t,η -w train , x i -ξ i -x i X † train ξ train 2 = 1 2 w t,η -w train 2 Htrain + 1 2n (I n -Proj Xtrain )ξ train 2 - 1 n n i=1 w t,η -w train , x i ξ i -x i x i X † train ξ train . We first show the crossing term is actually zero. We have, 1 n n i=1 w t,η -w train , x i ξ i -x i x i X † train ξ train = 1 n w t,η -w train , n i=1 x i ξ i - n i=1 x i x i X † train ξ train = 1 n w t,η -w train , X train ξ train -X train X train X † train ξ train = 1 n w t,η -w train , X train ξ train -X train ξ train = 0, where the second last equality holds because X train X † train = Proj Xtrain . We can define w (k) train as w * k + (X (k) train ) † ξ (k) train for every training set S train . Then, we have FT bT (η) = 1 m m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train + 1 m m k=1 1 2n (I n -Proj X (k) train )ξ (k) train 2 We first prove that the second term concentrates on its mean. We can concatenate m noise vectors ξ (k) train into a single noise vector ξtrain with dimension nm. We can also construct a data matrix Xtrain ∈ R nm×dm that consists of X train as diagonal blocks. Then the second term can be written as 1 2 1 √ nm (I nm -Proj Xtrain ) ξtrain 2 . According to Lemma 45, with probability at least 1 -exp(-Ω( 4 md 2 /n)), 1 - 2 d n σ ≤ 1 √ nm ξtrain ≤ 1 + 2 d n σ. By Johnson-Lindenstrauss Lemma (Lemma 49), we know with probability at least 1exp(-Ω( 4 md)), 1 √ nm Proj Xtrain ξtrain ≥ (1 -2 ) √ md √ mn 1 √ nm ξtrain ≥ (1 -2 ) d n (1 - 2 d n )σ. Therefore, we have 1 √ nm ξtrain 2 ≤ (1 + 3 2 d n )σ 2 and 1 √ nm Proj Xtrain ξtrain 2 ≥ (1 -2 2 ) d n σ 2 . Overall, we know with probability at least 1 -exp(-Ω( 4 md/n)), 1 2 1 √ nm (I nm -Proj Xtrain ) ξtrain 2 ≤ n -d 2n σ 2 + 5 2 dσ 2 2n . Now, we show the first term in meta objective is small when we choose a right step size. According to Lemma 27, we know as long as n ≥ 40d, with probability at least 1 -exp(-Ω(n)), √ n/2 ≤ σ i (X (k) train ) ≤ 3 √ n/2 and 1/2 ≤ λ i (H (k) train ) ≤ 3/2, for all i ∈ [d]. According to Lemma 45, we know with probability at least 1 -exp(-Ω(n)), ξ (k) train ≤ 2 √ nσ. Taking a union bound on m tasks, we know all these events hold with probability at least 1 -m exp(-Ω(n)). For each k ∈ [m], we have w (k) train ≤ 1 + 2 √ n 2 √ nσ ≤ 5σ. It's easy to verify that for any step size at most 2/3, the GD sequence will not be truncated since we choose the threshold norm as 40σ. Then, for any step size η ≤ 2/3, we have 1 m m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train = 1 m m k=1 1 2 (I -ηH (k) train ) t w (k) train 2 H (k) train ≤ 3 4 (1 - η 2 ) 2t 25σ 2 ≤ 20(1 - 1 3 ) 2t σ 2 , where the last inequality chooses η as 2/3. Overall, we know with probability at least 1 -m exp(-Ω(n)) -exp(-Ω( 4 md/n)), FT bT (2/3) ≤ 20(1 - 1 3 ) 2t σ 2 + n -d 2n σ 2 + 5 2 dσ 2 2n . We finish the proof by changing 5 2 2 by ( ) 2 /20. C.2 LOWER BOUNDING FT bT FOR η ∈ (η, ∞) In this section, we show the empirical meta objective is large when the step size exceeds certain threshold. Recall Lemma 25 as follows. Lemma 25. Let η be as defined in Definition 2 with 1 > > 0. Assume n ≥ cd, t ≥ c 2 , d ≥ c 4 for some constants c, c 2 , c 4 . With probability at least 1 -exp(-Ω( 4 md 2 /n 2 )), FT bT (η) ≥ 2 dσ 2 8n + n -d 2n σ 2 - 2 dσ 2 20n , for all η > η. Roughly speaking, we define η such that for any step size larger than η the GD sequence has a reasonable probability being truncated. The definition is very similar as η in Definition 1. Definition 2. Given a training task P, let E 1 be the event that √ n/2 ≤ σ i (X train ) ≤ 3 √ n/2 and 1/2 ≤ λ i (H train ) ≤ 3/2 for all i ∈ [d] and √ nσ/2 ≤ ξ train ≤ 2 √ nσ. Let Ē2 (η) be the event that the GD sequence is truncated with step size η. Given 1 > > 0, define η as follows, η = inf η ≥ 0 E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ 2 dσ 2 n . Similar as in Lemma 5, we show 1 E 1 ∩ Ē2 (η ) ≥ 1 E 1 ∩ Ē2 (η) for any η ≥ η. This means conditioning on E 1 , if a GD sequence gets truncated with step size η, it has to be truncated with any step size η ≥ η. The proof is deferred into Section C.4. Lemma 28. Fixing a training set S train , let E 1 and Ē2 (η) be as defined in Definition 2. We have 1 E 1 ∩ Ē2 (η ) ≥ 1 E 1 ∩ Ē2 (η) , for any η ≥ η. Next, we show η does exist and is a constant. Similar as in Lemma 6, we show that the GD sequence almost never diverges when η is small and diverges with high probability when η is large. The proof is left in Section C.4. Lemma 29. Let η be as defined in Definition 2. Suppose σ is a constant. Assume n ≥ cd, t ≥ c 2 , d ≥ c 4 for some constants c, c 2 , c 4 . We have 4 3 < η < 6. Next, we show the empirical loss is large for any η larger than η. The proof is very similar as the proof of Lemma 3. Proof of Lemma 25. By Lemma 29, we know η is a constant as long as n ≥ cd, t ≥ c 2 , d ≥ c 4 for some constants c, c 2 , c 4 . Let E 1 and Ē2 (η) be as defined in Definition 2. For the simplicity of the proof, we assume E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ 2 dσ 2 n . The other case can be resolved using same techniques in Lemma 3 Conditioning on E 1 , we know 1 2 w t,η -w train 2 Htrain ≤ 3 4 45 2 σ 2 . Therefore, we know Pr[E 1 ∩ Ē2 (η)] ≥ 4 2 d 3×45 2 n . For each task k, define E (k) 1 and Ē(k) 2 (η) as the corresponding events on training set S (k) train . By Hoeffding's inequality, we know with probability at least 1 -exp(-Ω( 4 md 2 /n 2 )), 1 m m k=1 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 2 d 45 2 n . By Lemma 28, we know 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 1 E (k) 1 ∩ Ē(k) 2 (η) for any η ≥ η. Recall that FT bT (η) = 1 m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train + 1 m m k=1 1 2n (I n -Proj X (k) train )ξ (k) train 2 . We can lower bound the first term for any η > η as follows, FT bT (η) = 1 m m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train ≥ 1 m m k=1 1 2 w (k) t,η -w (k) train 2 H (k) train 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 35 2 σ 2 4 1 m m k=1 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 35 2 σ 2 4 1 m m k=1 1 E (k) 1 ∩ Ē(k) 2 (η) ≥ 2 dσ 2 8n , where the second inequality lower bounds the loss for one task by 35 2 σ 2 when the sequence gets truncated. For the second term, according to the analysis in Lemma 24, with probability at least 1exp(-Ω( 4 md/n)), 1 m m k=1 1 2n (I n -Proj X (k) train )ξ (k) train 2 ≥ n -d 2n σ 2 - 2 dσ 2 20n .

C.4 PROOFS OF TECHNICAL LEMMAS

Proof of Lemma 27. According to Lemma 48, we know with probability at least 1 -2 exp(- t 2 /2), √ n - √ d -t ≤ σ i (X train ) ≤ √ n + √ d + t for all i ∈ [d]. Since d ≤ 2 n 10 , we have √ n - √ n √ 10 -t ≤ σ i (X train ) ≤ √ n + √ n √ 10 + t. Choosing t = ( 1 3 -1 √ 10 ) √ n, we have with probability at least 1 -exp(-Ω( 2 n)), (1 - 3 ) √ n ≤ σ i (X train ) ≤ (1 + 3 ) √ n. Since λ i (H train ) = 1/nσ 2 i (X train ), we have 1 -≤ λ i (H train ) ≤ 1 + . Proof of Lemma 28. The proof is almost the same as in Lemma 5. We omit the details here. Basically, in Lemma 5, the only property we rely on is that the norm threshold is larger than 2 w train conditioning on E 1 . Conditioning on E 1 , we know w train ≤ 5σ. Recall that the norm threshold is still set as 40σ. So this property is preserved and the previous proof works. Proof of Lemma 29. The proof is very similar as in Lemma 6. Conditioning on E 1 , we know H train ≤ 3/2 and w train ≤ 5σ. So the GD sequence never exceeds the norm threshold 40σ for any η ≤ 4/3. That means, E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) = 0 for all η ≤ 4/3. To lower bound the loss for large step size, we need to first lower bound w train . Recall that w train = w * + (X train ) † ξ train . Conditioning on E 1 , we know ξ train ≤ 2 √ nσ and σ d (X train ) ≥ √ n/2, which implies (X train ) † ≤ 2/ √ n. By Johnson-Lindenstrauss Lemma (Lemma 49), we have Proj Xtrain ξ train ≤ 3 2 d/n ξ train with probability at least 1 -exp(-Ω(d)). Call this event E 3 . Conditioning on E 1 ∩ E 3 , we have (X train ) † ξ train ≤ 2 √ nσ 2 √ n 3 2 d n ≤ 6 d n σ, which is smaller than 1/2 as long as n ≥ 12 2 dσ 2 . Note that we assume σ is a constant. This then implies w train ≥ 1/2. Let {w τ,η } be the GD sequence without truncation. For any step size η ∈ [6, ∞], conditioning on E 1 ∩ E 3 , we have w t,η ≥ (6 × 1 2 -1) t -1 w train ≥ 2 t -1 1 2 ≥ 40σ, where the last inequality holds as long as t ≥ c 2 for some constant c 2 . Therefore, we know when η ∈ [6, ∞), 1 E 1 ∩ Ē2 (η) = 1 {E 1 ∩ E 3 }. Assuming n ≥ 40d, we know E 1 holds with probability at least 1 -exp(-Ω(n)). Then, we have for any η ≥ 6, E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≥ 1 4 (40σ -5σ) 2 Pr[E 1 ∩ E 3 ] ≥ 2 dσ 2 n , where the last inequality assumes n ≥ c, d ≥ c 4 for some constant c, c 4 . Overall, we know E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) equals zero for all η ∈ [0, 4/3] and is at least ) . By definition, we know η ∈ (4/3, 6). 2 dσ 2 n for all η ∈ [6, ∞ Proof of Lemma 31. By Lemma 29, we know η is a constant. The proof is very similar as in Lemma 8. Let E 1 and Ē2 (η) be as defined in Definition 2. For the simplicity of the proof, we assume E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) ≤ 2 dσ 2 n . The other case can be resolved using techniques in the proof of Lemma 8. Recall the population meta objective F T bT (η) = E 1 2 w t,η -w train 2 Htrain + n -d 2n σ 2 . Therefore, we only need to construct an -net for the first term. We can divide E 1 2 w t,η -w train 2 Htrain as follows, E 1 2 w t,η -w train 2 Htrain =E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} + E 1 2 w t,η -w train 2 Htrain 1 E 1 ∩ Ē2 (η) + E 1 2 w t,η -w train Htrain 1 Ē1 . We will construct an -net for the first term and show the other two terms are small. Let's first consider the third term.  1 {E 1 ∩ E 2 (η)} -E 1 2 w t,η -w train 2 Htrain 1 {E 1 ∩ E 2 (η)} ≤ 2 dσ 2 n for η = arg min η ∈N,η ≤η (η -η ). Combing with the upper bounds on the second term and the third term, we have for any η ∈ [0, η], |F T bT (η) -F T bT (η )| ≤ 8 2 dσ 2 n for η = arg min η ∈N,η ≤η (η -η ). Proof of Lemma 32. By Lemma 29, we know η is a constant. For each k ∈ [m], let E 1,k be the event that √ n/2 ≤ σ i (X (k) train ) ≤ 3 √ n/2 and 1/2 ≤ λ i (H (k) train ) ≤ 3/2 for all i ∈ [d] and √ nσ/2 ≤ ξ (k) train ≤ 2 √ nσ. Assuming n ≥ 40d, by Lemma 27, we know with probability at least 1 -m exp(-Ω(n)), E 1,k 's hold for all k ∈ [m]. Then, similar as in Lemma 9, there exists an For each k ∈ [m], we have ∆ T bT (η, P k ) := 1 2 E SGD w (k) t,η -w (k) train 2 H (k) train . Since 1/L ≤ λ i (H (k) train ) ≤ L and (w (k) t,η -w (k) train ) is in the span of H (k) train , we have 1 2L E SGD w (k) t,η -w (k) train 2 ≤ ∆ T bT (η, P k ) ≤ L 2 E SGD w (k) t,η -w (k) train 2 . Recall the updates of stochastic gradient descent, w (k) t,η -w (k) train = (I -ηH (k) train )(w (k) t-1,η -w (k) train ) -ηn (k) t-1,η . Therefore, E SGD w (k) t,η -w (k) train 2 |w (k) t-1,η = (I -ηH (k) train )(w (k) t-1,η -w (k) train ) 2 +η 2 E SGD n (k) t-1,η 2 |w (k) t-1,η . We know for any η ≤ 1/L, (1 -2ηL) w (k) t-1,η -w (k) train 2 ≤ (I -ηH (k) train )(w (k) t-1,η -w (k) train ) 2 ≤ (1 - η L ) w (k) t-1,η -w (k) train 2 . The noise can be bounded as follows, η 2 E SGD n (k) t-1,η 2 |w (k) t-1,η =η 2 E SGD x i(t-1) x i(t-1) (w (k) t-1,η -w (k) train ) -H (k) train (w (k) t-1,η -w (k) train ) 2 |w (k) t-1,η ≤η 2 E SGD x i(t-1) x i(t-1) (w (k) t-1,η -w (k) train ) 2 |w (k) t-1,η ≤η 2 max i(t-1) x i(t-1) 2 w (k) t-1,η -w (k) train 2 H (k) train . Since X train ≤ √ L √ d, we immediately know max i(t-1) x i(t-1) ≤ √ L √ d. Therefore, we can bound the noise as follows, η 2 E SGD n (k) t-1,η 2 |w (k) t-1,η ≤η 2 max i(t-1) x i(t-1) 2 w (k) t-1,η -w (k) train 2 H (k) train ≤L 2 η 2 d w (k) t-1,η -w (k) train 2 . As long as η ≤ 1 2L 3 d , we have (1 -ηL) w (k) t-1,η -w (k) train 2 ≤ E SGD w (k) t,η -w (k) train 2 |w (k) t-1,η ≤ (1 - η 2L ) w (k) t-1,η -w (k) train 2 . This further implies (1 -ηL) t w train 2 ≤ E SGD w (k) t,η -w (k) train 2 ≤ (1 - η 2L ) t w train 2 . Let η 2 := 1 2L 3 d , we have ∆ T bT (η, P k ) ≤ L 2 (1 - 1 4L 4 d ) t w train Let η 1 := 1 6L 5 d , for all η ∈ [0, η 1 ] we have ∆ T bT (η, P k ) ≥ 1 2L (1 - 1 6L 4 d ) t w train 2 . As long as t ≥ c 2 d for certain constant c 2 , we know 1 2L (1 - 1 6L 4 d ) t w train 2 > L 2 (1 - 1 4L 4 d ) t w train 2 . As this holds for all k ∈ [m] and FT bT = 1/m m i=1 ∆ T bT (η, P k ), we know the optimal step size η * train is within [ 1 6L 5 d , 1 2L 3 d ]. We rely the following technical lemma to prove Lemma 34. Lemma 35. Suppose σ is a constant. Given any > 0, with probability at least 1 - O(1/ ) exp(-Ω( 2 d)), B t,η w * train -w * , B t,η (X train ) † ξ train ≤ , for all η ∈ [0, 1 2L 3 d ]. Proof of Lemma 35. By Lemma 1, with probability at least 1 -exp(-Ω(d)), √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n]. There- fore [(X train ) † ] B t,η (B t,η w * train -w * ) ≤ 2 √ L/ √ d. Notice that ξ train is independent with [(X train ) † ] B t,η (B t,η w * train -w * ). By Hoeffding's inequality, with probability at least 1 - exp(-Ω( 2 d)), [(X train ) † ] B t,η (B t,η w * train -w * ), ξ train ≤ . Next, we construct an -net for η and show the crossing term is small for all η ∈ [0, 1 2L 3 d ]. For simplicity, denote g(η) := B t,η w * train -w * , B t,η (X train ) † ξ train . Taking the derivative of g(η), we have g (η) =t H train (I -ηH train ) t-1 w * train , B t,η (X train ) † ξ train + t B t,η w * train -w * , H train (I -ηH train ) t-1 (X train ) † ξ train According to Lemma 45, we know with probability at least 1 -exp(-Ω(d)), ξ train ≤ √ dσ. Therefore, the derivative g (η) can be bounded as follows, |g (η)| = O(1)t(1 - η L ) t-1 Similar as in Lemma 14, there exists an -net N with size O(1/ ) such that for any η ∈ [0, 1 3L 3 d ], there exists η ∈ N with |g(η)-g(η )| ≤ . Taking a union bound over N , we have with probability at least 1 -O(1/ ) exp(-Ω( 2 d)), for every η ∈ N , B t,η w * train -w * , B t,η (X train ) † ξ train ≤ . which implies for every η ∈ [0, 1 3L 3 d ]. B t,η w * train -w * , B t,η (X train ) † ξ train ≤ 2 . Changing to /2 finishes the proof. Proof of Lemma 34. According to Lemma 1 and Lemma 45, we know with probability at least 1 -exp(-Ω(d)), √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. We assume these properties hold in the proof and take a union bound at the end. Recall that E SGD w t,η -w * 2 can be lower bounded as follows, E SGD w t,η -w * 2 =E SGD B t,η (w * train + (X train ) † ξ train ) -η t-1 τ =0 (I -ηH train ) t-1-τ n τ,η -w * 2 ≥ B t,η (w * train + (X train ) † ξ train ) -w * 2 ≥ B t,η (X train ) † ξ train 2 + 2 B t,η w * train -w * , B t,η (X train ) † ξ train For any η ∈ [ 1 6L 5 d , 1 2L 3 d ], we can lower bound the first term as follows, B t,η (X train ) † ξ train 2 ≥ 1 -exp - ηt L 2 σ 2 16L ≥ 1 -exp - t 6L 6 d 2 σ 2 16L ≥ σ 2 64L , where the last inequality holds as long as t ≥ c 2 d for certain constant c 2 . Choosing = σ 2 256L in Lemma 35, we know with probability at least 1 -exp(-Ω(d)), B t,η w * train -w * , B t,η (X train ) † ξ train ≤ σ 2 256L , for all η ∈ [0, 1 2L 3 d ]. Overall, we have E SGD w t,η -w * 2 ≥ σ 2 128L . Taking a union bound over all the bad events, we know this happens with probability at least 1 -exp(-Ω(d)).

D.2 TRAIN-BY-VALIDATION (SGD)

Recall Theorem 10 as follows. Theorem 10. Let the meta objective FT bV (n1,n2) be as defined in Equation 5 d) FT bV (n1,n2) (η), where the expectation is taken over the new tasks and SGD noise. To prove Theorem 10, we first study the behavior of the population meta objective F T bV . That is, F T bV (η) := E P ∼T ∆ T bV (η, P ) =E P ∼T E SGD 1 2 w t,η -w * -(X valid ) † ξ valid 2 Hvalid =E P ∼T E SGD 1 2 w t,η -w * 2 + σ 2 2 . We show that the optimal step size for the population meta objective F T bV is Θ(1/t) and E P ∼T E SGD w t,η -w * 2 = w * 2 -Ω(1) under the optimal step size. Lemma 36. Suppose σ is a large constant c 1 . Assume t ≥ c 2 d 2 log 2 (d), d ≥ c 4 for some constants c 2 , c 4 . There exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 and constant c 5 such that F T bV (η 2 ) ≤ 1 2 w * 2 - 9 10 C + σ 2 2 F T bV (η) ≥ 1 2 w * 2 - 6 10 C + σ 2 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1 c 5 d 2 log 2 (d) ] where C is a positive constant. In order to relate the behavior of F T bV to FT bV , we show a generalization result from FT bV to F T bV for η ∈ [0, 1 c5d 2 log 2 (d/ ) ]. Lemma 37. For any 1 > > 0, assume σ is a constant and d ≥ c 4 log(1/ ) for some constant c 4 . There exists constant c 5 such that with probability at least 1 -O(1/ ) exp(-Ω( 2 m)), | FT bV (η) -F T bV (η)| ≤ , for all η ∈ [0, 1 c5d 2 log 2 (d/ ) ]. Combining Lemma 36 and Lemma 37, we give the proof of Theorem 10. Proof of Theorem 10. The proof is almost the same as in the GD setting (Theorem 8). We omit the details here. D.2.1 BEHAVIOR OF F T bV FOR η ∈ [0, 1 c5d 2 log 2 d ] In this section, we give the proof of Lemma 36. Recall the lemma as follows, Lemma 36. Suppose σ is a large constant c 1 . Assume t ≥ c 2 d 2 log 2 (d), d ≥ c 4 for some constants c 2 , c 4 . There exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 and constant c 5 such that F T bV (η 2 ) ≤ 1 2 w * 2 - 9 10 C + σ 2 2 F T bV (η) ≥ 1 2 w * 2 - 6 10 C + σ 2 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1 c 5 d 2 log 2 (d) ] where C is a positive constant. Recall that Instead, we define an auxiliary sequence {w τ,η } that is obtained by running SGD on task P without truncation and we first study Q (η) := 1/2E SGD w t,η -w * 2 . In Lemma 38, we show that with high probability in the sampling of task P , the minimizer of Q (η) is Θ(1/t). The proof is very similar as the proof of Lemma 13 except that we need to bound the SGD noise at step size η 2 . We defer the proof into Section D.2.3. Lemma 38. Given a task P , let {w τ,η } be the weight obtained by running SGD on task P without truncation. Choose σ as a large constant c 1 . Assume unroll length t ≥ c 2 d for some constant c 2 . With probability at least 1 -exp(-Ω(d)) over the sampling of task P, F T bV (η) = E P ∼T E SGD 1/2 w t,η -w * 2 + σ 2 /2. Denote Q(η) := E SGD 1/2 w t √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ and there exists η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that Q (η 2 ) := 1/2E SGD w t,η2 -w * 2 ≤ 1 2 w * 2 -C Q (η) := 1/2E SGD w t,η -w * 2 ≥ 1 2 w * 2 - C 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L] where C is a positive constant. To relate the behavior of Q (η) defined on {w τ,η } to the behavior of Q(η) defined on {w τ,η }. We show when the step size is small enough, the SGD sequence gets truncated with very small probability so that sequence {w τ,η } almost always coincides with sequence {w τ,η }. The proof of Lemma 39 is deferred into Section D.2.3. Lemma 39. Given a task P , assume √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Given any > 0, suppose η ≤ 1 c5d 2 log 2 (d/ ) for some constant c 5 , we have |Q(η) -Q (η)| ≤ . Combining Lemma 38 and Lemma 39, we give the proof of lemma 36. Proof of Lemma 36. Recall that we define Q(η) := 1/2E SGD w t,η -w * 2 and Q (η) = 1/2E SGD w t,η -w * 2 . Here, {w τ,η } is a SGD sequence running on task P without truncation. According to Lemma 38, with probability at least 1 -exp(-Ω(d)) over the sampling of task P, √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ and there exists η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that Q (η 2 ) ≤ 1 2 w * 2 -C Q (η) ≥ 1 2 w * 2 - C 2 , ∀η ∈ [0, η 1 ] ∪ [η 3 , 1/L] where C is a positive constant. Call this event E. Suppose the probability that E happens is 1 -δ. We can write E P ∼T Q(η) as follows, E P ∼T Q(η) = E P ∼T [Q(η)|E] Pr[E] + E P ∼T [Q(η)| Ē] Pr[ Ē]. According to the algorithm, we know w t,η is always bounded by 4 √ Lσ. Therefore, Q(η) := 1/2 w t,η -w * 2 ≤ 13Lσ 2 . By Lemma 39, we know conditioning on E, |Q(η) -Q (η)| ≤ for any η ≤ 1 c5d 2 log 2 (d/ ) . As long as t ≥ c 2 d 2 log 2 (d/ ) for certain constant c 2 , we know η 3 ≤ 1 c5d 2 log 2 (d/ ) . When η = η 2 , we have E P ∼T Q(η 2 ) ≤ (Q (η 2 ) + ) (1 -δ) + 13Lσ 2 δ ≤ 1 2 w * 2 -C + (1 -δ) + 13Lσ 2 δ ≤ 1 2 w * 2 -C + 13Lσ 2 δ + ≤ 1 2 w * 2 - 9C 10 , where the last inequality assumes δ ≤ C 260Lσ 2 and ≤ C 20 . When η ∈ [0, η 1 ] ∪ [η 3 , 1 c5d 2 log 2 (d/ ) ], we have E P ∼T Q(η 2 ) ≥ (Q (η) -) (1 -δ) -13Lσ 2 δ ≥ 1 2 w * 2 - C 2 - (1 -δ) -13Lσ 2 δ ≥ 1 2 w * 2 - C 2 - δ 2 -13Lσ 2 δ -≥ 1 2 w * 2 - 6C 10 , where the last inequality holds as long as δ ≤ C 280Lσ 2 and ≤ C 20 . According to Lemma 38, we know δ ≤ exp(-Ω(d)). Therefore, the conditions for δ can be satisfied as long as d is larger than certain constant. The condition on can be satisfied as long as η ≤ 1 c5d 2 log 2 (d) for some constant c 5 .

D.2.2 GENERALIZATION

FOR η ∈ [0, 1 c5d 2 log 2 d ] In this section, we prove Lemma 37 by showing that FT bV (η) is point-wise close to F T bV (η) for all η ∈ [0, 1 c5d 2 log 2 (d/ ) ]. Recall Lemma 37 as follows. Lemma 37. For any 1 > > 0, assume σ is a constant and d ≥ c 4 log(1/ ) for some constant c 4 . There exists constant c 5 such that with probability at least 1 -O(1/ ) exp(-Ω( 2 m)), | FT bV (η) -F T bV (η)| ≤ , for all η ∈ [0, 1 c5d 2 log 2 (d/ ) ]. In order to prove Lemma 37, we first show that for a fixed η with high probability FT bV (η) is close to F T bV (η). Similar as in Lemma 16, we can still show that each ∆ T bV (η, P ) is O(1)-subexponential. The proof is deferred into Section D.2.3. Lemma 40. Suppose σ is a constant. Given any 1 > > 0, for any fixed η with probability at least 1 -exp(-Ω( 2 m)), FT bV (η) -F T bV (η) ≤ . Next, we show that there exists an -net for F T bV with size O(1/ ). By -net, we mean there exists a finite set N of step sizes such that |F T bV (η) -F T bV (η )| ≤ for any η and η ∈ arg min η∈N |ηη |. The proof is very similar as in Lemma 17. We defer the proof of Lemma 41 into Section D.2.3. Lemma 41. Suppose σ is a constant. For any 1 > > 0, assume d ≥ c 4 log(1/ ) for some c 4 . There exists constant c 5 and an -net N ⊂ [0, 1 c5d 2 log 2 (d/ ) ] for F T bV with |N | = O(1/ ). That means, for any η ∈ [0, 1 c5d 2 log 2 (d/ ) ], |F T bV (η) -F T bV (η )| ≤ , for η ∈ arg min η∈N |η -η |. Next, we show that with high probability, there also exists an -net for FT bV with size O(1/ ). The proof is very similar as the proof of Lemma 18. We defer the proof into Section D.2.3. Lemma 42. Suppose σ is a constant. For any 1 > > 0, assume d ≥ c 4 ) for some c 4 . With probability at least 1 -exp(-Ω( 2 m)), there exists constant c 5 and an -net N ⊂ [0, 1 c5d 2 log 2 (d/ ) ] for FT bV with |N | = O(1/ ). That means, for any η ∈ [0, 1 c5d 2 log 2 (d/ ) ], | FT bV (η) -FT bV (η )| ≤ , for η ∈ arg min η∈N |η -η |. Combing Lemma 40, Lemma 41 and Lemma 42, now we give the proof of Lemma 37. Proof of Lemma 37. The proof is almost the same as the proof of Lemma 11. We omit the details here.

D.2.3 PROOFS OF TECHNICAL LEMMAS

In Lemma 43, we show when the step size is small, the expected SGD noise square is well bounded. The proof follows from the analysis in Lemma 33. Lemma 43. Let {w τ,η } be an SGD sequence running on task P without truncation. Let n τ,η be the SGD noise at w τ,η . Assume √ d/ √ L ≤ σ i (X train ) ≤ √ L √ σ for all i ∈ [n] and ξ train ≤ √ dσ. Suppose η ∈ [0, 1 2L 3 d ], we have E SGD n τ,η 2 ≤ 4L 3 σ 2 d for all τ ≤ t. Proof of Lemma 43. Similar as the analysis in Lemma 33, for η ≤ 1 2L 3 d , we have E SGD n τ,η 2 |w τ -1,η ≤ L 2 d w τ -1,η -w train 2 . and E SGD w τ -1,η -w train 2 ≤ (1 - η 2L ) τ -1 w train 2 ≤ w * train + (X train ) † ξ train 2 ≤ 4Lσ 2 . Therefore, we have E SGD n τ,η 2 ≤ L 2 dE SGD w τ,η -w train 2 ≤ 4L 3 σ 2 d. Proof of Lemma 38. We can expand Q (η) as follows, Q (η) := 1 2 E SGD w t,η -w * 2 = 1 2 E SGD B t,η w * train + B t,η (X train ) † ξ train -η t-1 τ =0 (I -ηH train ) t-1-τ n τ,η -w * 2 = 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 + η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 + B t,η w * train -w * , B t,η (X train ) † ξ train Denote G(η) := 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 + η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 . We first show that with probability at least 1 -exp(-Ω(d)), there exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that G(η 2 ) ≤ 1/2 w * 2 -5C/4 and G(η) ≥ 1/2 w * 2 -C/4 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. According to Lemma 1, we know with probability at least 1 -exp(-Ω(d)), √ d/ √ L ≤ σ i (X train ) ≤ √ L √ d and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n]. According to Lemma 45, we know with probability at least 1 -exp(-Ω(d)), √ dσ/4 ≤ ξ train ≤ √ dσ. Upper bounding G(η 2 ): We can expand G(η) as follows: G(η) := 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 + η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 = 1 2 w * 2 + 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 + η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 -B t,η w * train , w * . Same as in Lemma 13, we know 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 ≤ L 3 η 2 t 2 σ 2 . For the SGD noise, by Lemma 43 we know E SGD n τ,η 2 ≤ 4L 3 σ 2 d for all τ ≤ t as long as η ≤ 1 2L 3 d . Therefore, η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 ≤ η 2 2 t-1 τ =0 E SGD n τ,η 2 ≤ 2L 3 η 2 σ 2 dt ≤ 2L 3 η 2 σ 2 t 2 , where the last inequality assumes t ≥ d. According to Lemma 15, for any fixed η ∈ [0, L/t], with probability at least 1 -exp(-Ω(d)) over X train , B t,η w * train , w * ≥ ηt 16L . Therefore, for any step size η ≤ 1 2L 3 d , G(η) ≤ 1 2 w * 2 + 3L 3 η 2 σ 2 t 2 - ηt 16L ≤ 1 2 w * 2 - ηt 32L , where the second inequality holds as long as η ≤ 1 96L 4 σ 2 t . Choosing η 2 := 1 96L 4 σ 2 t that is smaller than 1 2L 3 d assuming t ≥ d. Then, we have G(η 2 ) ≤ 1 2 w * 2 - 5C 4 , where constant C = 1 3072L 5 σ 2 . Lower bounding G(η) for η ∈ [0, η 1 ] : Now, we prove that there exists η 1 = Θ(1/t) with η 1 < η 2 such that for any η ∈ [0, η 1 ], G(η) ≥ 1 2 w * 2 -C 4 . Recall that G(η) = 1 2 w * 2 + 1 2 B t,η w * train 2 + 1 2 B t,η (X train ) † ξ train 2 + η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 -B t,η w * train , w * . ≥ 1 2 w * 2 -B t,η w * train , w * . Same as in Lemma 13, by choosing η 1 = C 4Lt , we have for any η ∈ [0, η 1 ], G(η) ≥ 1 2 w * 2 - C 4 . Lower bounding G(η) for η ∈ [η 3 , 1/L]: Now, we prove that there exists η 3 = Θ(1/t) with η 3 > η 2 such that for all η ∈ [η 3 , 1/L], G(η) ≥ 1 2 w * 2 - C 4 . Recall that G(η) = 1 2 B t,η w * train -w * 2 + 1 2 B t,η (X train ) † ξ train 2 + η 2 2 E SGD t-1 τ =0 (I -ηH train ) t-1-τ n τ,η 2 ≥ 1 2 B t,η (X train ) † ξ train 2 . Same as in Lemma 13, by choosing η 3 = log(2)L/t, as long as σ ≥ 8 √ L, we have G(η) ≥ 1 2 w * 2 for all η ∈ [η 3 , 1/L]. Note η 3 ≤ 1/L as long as t ≥ log(2)L 2 . Overall, we have shown that there exist η 1 , η 2 , η 3 = Θ(1/t) with η 1 < η 2 < η 3 such that G(η 2 ) ≤ 1/2 w * 2 -5C/4 and G(η) ≥ 1/2 w * 2 -C/4 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. Recall that Q (η) = G(η) + B t,η w * train -w * , B t,η (X train ) † ξ train . Choosing = C/4 in Lemma 14, we know with probability at least 1 -exp(-Ω(d)), B t,η w * train -w * , B t,η (X train ) † ξ train ≤ C/4 for all η ∈ [0, 1/L]. Therefore, we know Q (η 2 ) ≤ 1/2 w * 2 -C and Q (η) ≥ 1/2 w * 2 -C/2 for all η ∈ [0, η 1 ] ∪ [η 3 , 1/L]. In order to prove Lemma 39, we first construct a super-martingale to show that as long as task P is well behaved, with high probability in SGD noise, the weight norm along the trajectory never exceeds 4 √ Lσ. Lemma 44. Assume √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ. Given any 1 > δ > 0, suppose η ≤ 1 c5d 2 log 2 (d/δ ) for some constant c 5 , with probability at least 1 -δ in the SGD noise, w τ,η < 4 √ Lσ for all τ ≤ t. Proof of Lemma 44. According to the proofs of Lemma 43, as long as η ≤ 1 2L 3 d , we have E SGD w t,η -w train 2 |w t-1,η ≤ (1 - η 2L ) w t-1,η -w train 2 . Since log is a concave function, by Jenson's inequality, we know E SGD log w t,η -w train 2 |w t-1,η ≤ log E SGD w t,η -w train 2 |w t-1,η ≤ log w t-1,η -w train 2 + log(1 - η ). Defining G t = log w t,η -w train 2 -t log(1 -η 2L ), we know G t is a super-martingale. Next, we bound the martingale differences.

We can bound |G

t -E SGD [G t |w t-1,η ]| as follows, |G t -E SGD [G t |w t-1,η ]| ≤ max n t-1,η ,n t-1,η log (I -ηH train )(w t-1,η -w train ) -ηn t-1,η 2 (I -ηH train )(w t-1,η -w train ) -ηn t-1,η We can expand (I -ηH train )(w t-1,η -w train ) -ηn t-1,η 2 as follows, (I -ηH train )(w t-1,η -w train ) -ηn t-1,η 2 = (I -ηH train )(w t-1,η -w train ) 2 -2η n t-1,η , (I -ηH train )(w t-1,η -w train ) + η 2 n t-1,η We can bound the norm of the noise as follows, n t-1,η = x i(t-1) x i(t-1) (w t-1,η -w train ) -H train (w t-1,η -w train ) ≤ x i(t-1) x i(t-1) (w t-1,η -w train ) + H train (w t-1,η -w train ) ≤ (Ld + L) w t-1,η -w train ≤ 2Ld w t-1,η -w train , where the second inequality uses x i(t-1) ≤ √ Ld. Therefore, we have 2η n t-1,η , (I -ηH train )(w t-1,η -w train ) ≤ 4Lηd w t-1,η -w train 2 , η 2 n t-1,η 2 ≤ 4L 2 η 2 d 2 w t-1,η -w train 2 . This further implies, |G t -E SGD [G t |w t-1,η ]| ≤ log (I -ηH train )(w t-1,η -w train ) 2 + 4Lηd + 4L 2 η 2 d 2 w t-1,η -w train 2 (I -ηH train )(w t-1,η -w train ) 2 -4Lηd w t-1,η -w train 2 ≤ log 1 + 8Lηd + 4L 2 η 2 d 2 (1 -2Lη -4Lηd) ≤ 16Lηd + 8L 2 η 2 d 2 , where the second inequality uses (I -ηH train )(w t-1,η -w train ) 2 ≥ (1 - 2Lη) w t-1,η -w train 2 . The last inequality assumes η ≤ 1 12Ld and uses numerical inequality log (1 + x) ≤ x. Assuming η ≤ 1/(Ld), we further have |G t -E SGD [G t |w t-1,η ]| ≤ L 2 ηd. By Azuma's inequality, we know with probability at least 1 -δ/t, G t ≤ G 0 + L 2 √ 2tηd log(t/δ). Plugging in G t = log w t,η -w train 2 -t log(1-η 2L ) and G 0 = log w 0 -w train 2 = log w train 2 , we have log w t,η -w train 2 ≤ log w train 2 + t log(1 - η 2L ) + L 2 √ 2tηd log(t/δ) ≤ log w train 2 - η 2L t + L 2 √ 2tηd log(t/δ). This implies, w t,η -w train 2 ≤ w train 2 exp η - 1 2L t + L 2 √ 2 log(t/δ)d √ t = w train 2 exp O(d 2 log 2 (d/δ))η ≤ w train 2 exp (2/3) , where the second inequality assumes η ≤ 1 c5d 2 log 2 (d/δ) for some constant c 5 . Furthermore, since w train ≤ (1 + √ L)σ, we have w t,η ≤ (1 + e 1/3 ) w train < 4 √ Lσ. Overall, we know as long as η ≤ 1 c5d 2 log 2 (d/δ) , with probability at least 1 -δ/t, w t,η ≤ 4 √ Lσ. Since this analysis also applies to any τ ≤ t, we know for any τ, with probability at least 1 -δ/t, w τ,η < 4 √ Lσ. Taking a union bound over τ ≤ t, we have with probability at least 1 -δ, w τ,η < 4 √ Lσ for all τ ≤ t. Proof of Lemma 39. Let E be the event that w τ,η < 4 √ Lσ for all τ ≤ t. We first show that E SGD w t,η -w * 2 is close to E SGD w t,η -w * 2 1 {E}. It's not hard to verify that E SGD w t,η -w * 2 = E SGD w t,η -w * 2 1 {E} + u -w * 2 Pr[ Ē], where u is a fixed vector with norm 4 √ Lσ. By Lemma 44, we know Pr[ Ē] ≤ /(25Lσ 2 ) as long as η ≤ 1 c5d 2 log 2 (d/ ) for some constant c 5 . Therefore, we have E SGD w t,η -w * 2 -E SGD w t,η -w * 2 1 {E} ≤ . Next, we show that E SGD w t,η -w * 2 1 {E} is close to E SGD w t,η -w * 2 . For any 1 ≤ τ ≤ t, let E τ be the event that w τ,η ≥ 4

√

Lσ and w τ ,η < 4 √ Lσ for all τ < τ. Basically E τ means the weight norm exceeds the threshold at step τ for the first time. It's easy to see that ∪ t τ =1 E τ = Ē. Therefore, we have E SGD w t,η -w * 2 = E SGD w t,η -w * 2 1 {E} + t τ =1 E SGD w t,η -w * 2 1 {E τ } . Conditioning on E τ , we know w τ -1,η < 4 Lσ. Since we assume √ d √ L ≤ σ i (X train ) ≤ √ L √ d for all i ∈ [n] and ξ train ≤ √ dσ, we know w train ≤ 2 √ Lσ. Therefore, we have w τ -1,η -w train ≤ 6 √ Lσ. Recall the SGD updates, w τ,η -w train = (I -ηH train )(w τ -1,η -w train ) -ηn τ -1,η . For the noise term, we have η n τ -1,η ≤ 2ηLd w τ -1,η -w train that is at most w τ -1,η -w train assuming η ≤ 1 2Ld . Therefore, we have w τ,η -w train ≤ 2 w τ -1,η -w train ≤ 12 √ Lσ. Note that event E τ is independent with the SGD noises after step τ . Therefore, according to the previous analysis, we know as long as η ≤ Combing with the bounds on 1 2 E SGD w t,η -w * 2 1 {E} -1 2 E SGD w t,η -w * 2 1 {E} and  E P ∼T

F EXPERIMENT DETAILS

We describe the detailed settings of our experiments in Section F.1 and give more experimental results in Section F.2.

F.1 EXPERIMENT SETTINGS

Optimizing step size for quadratic objective In this experiment, we meta-train a learning rate for gradient descent on a fixed quadratic objective. Our goal is to show that the autograd module in popular deep learning softwares, such as Tensorflow, can have numerical issues when using the log-transformed meta objective. Therefore, we first implement the meta-training process with Tensorflow to see the results. We then re-implement the meta-training using the hand-derived metagradient (see Eqn 3) to compare the result. A general setting for both implementations is as follows. The inner problem is fixed as a 20dimensional quadratic objective as described in Section 3, and we use the log-transformed meta objective for training. The positive semi-definite matrix H is generated by first sampling a 20 × 20 matrix X with all entries drawn from the standard normal distribution and then setting H = X T X. The initial point w 0 is drawn from standard normal as well. Note that we use the same quadratic problem (i.e., the same H and w 0 ) throughout the meta-training. We do 1000 meta-training iterations, and collect results for different settings of the initial learning rate η 0 and the unroll length t. We first implement the meta-training code with Tensorflow. Our code is adapted from Wichrowska et al. (2017) foot_2 . We use their global learning rate optimizer and specify the problem set to have only one quadratic objective instance. We implemented the quadratic objective class ourselves (the "MyQuadratic" class). We also turned off multiple advanced features in the original code, such as attention and second derivatives, by assigning their flags as false. This ensures that the experiments have exactly the same settings as we described. The meta-training learning rate is set to be 0.001, which is of similar scale as our next experiment. We also try RMSProp as the meta optimizer, which alleviates some of the numerical issues as it renormalizes the gradient, but our experiments show that even RMSProp is still much worse than our implementation. We then implement the meta-training by hand to show the accurate training results that avoid numerical issues. Specifically, we compute the meta-gradient using Eq (3), where we also scaled the numerator and denominator as described in Claim 2 to avoid numerical issues. We use the algorithm suggested in Theorem 4, except we choose the meta-step size to be 1/(100 √ k) as the constants in Theorem 4 were not optimized. Train-by-train vs. train-by-validation, synthetic data In this experiment, we find the optimal learning rate η * for least-squares problems trained in train-by-train and train-by-validation settings and then see how the learning rate works on new tasks. Specifically, we generate 300 different 1000-dimensional least-squares tasks with noise as defined in Section 4 for inner-training and then use the meta-objectives defined in Eq (1) and ( 2) to find the optimal learning rate. The inner-training number of steps t is set as 40. We try different sample sizes and different noise levels for comparison. Subsequently, in order to test how the two η * (for trainby-train and train-by-validation respectively) work, we use them on 10 test tasks (the same setting as the inner-training problem) and compute training and testing root mean squared error (RMSE). Note that since we only need the final optimal η * found under the two meta-objective settings (regardless of how we find it), we do not need to actually do the meta-training. Instead, we do a grid search on the interval [10 -6 , 1], which is divided log-linearly to 25 candidate points. For both the train-by-train and train-by-validation settings, we average the meta-objectives over the 300 inner problems and see which η minimizes this averaged meta-objective. Train-by-train vs. train-by-validation, MLP optimizer on MNIST To observe the trade-off between train-by-train and train-by-validation in a broader and more realistic case, we also do ex-periments to meta-train an MLP optimizer as in Metz et al. (2019) to solve the MNIST classification problem. We use part of their codefoot_3 to integrate with our code in the first experiment, and we use exactly the same default setting as theirs, which is summarized below. The MLP optimizer is a trainable optimizer that works on each parameter separately. When doing inner-training, for each parameter, we first compute some statistics of that parameter (explained below), which are combined into a feature vector, and then feed that feature vector to a Muti-Layer Perceptron (MLP) with ReLU activations, which outputs two scalars, the update direction and magnitude. The update is computed as the direction times the exponential of the magnitude. The feature vector is 31-dimensional, which includes gradient, parameter value, first-order moving averages (5-dim), second-order moving averages (5-dim), normalized gradient (5-dim), reciprocal of square root second-order moving averages (5-dim) and a step embedding (9-dim). All moving averages are computed using 5 different decay rates (0.5, 0.9, 0.99, 0.999, 0.9999), and the step embedding is tanh distortion of the current number of steps divided by 9 different scales (3, 10, 30, 100, 300, 1000, 3000, 10000, 300000) . After expanding the 31-dimensional feature vector for each parameter, we also normalize the set of vectors dimension-wise across all the parameters to have mean 0 and standard deviation 1 (except for the step embedding part). More details can be found in their original paper and original implementation. The inner-training problem is defined as using a two-layer fully connected network (i.e., another "MLP") with ReLU activations to solve the classic MNIST 10-class classification problem. We use a very small network for computational efficiency, and the two layers have 100 and 20 neurons. We fix the cross-entropy loss as the inner-objective and use mini-batches of 32 samples when innertraining. When we meta-train the MLP optimizer, we use exactly the same process as fixed in experiments by Wichrowska et al. (2017) . We use 100 different inner problems by shuffling the 10 classes and also sampling a new subset of data if we do not use the complete MNIST data set. We run each of the problems with three inner-training trajectories starting with different initialization. Each innertraining trajectory is divided into a certain number of unrolled segments, where we compute the meta-objective and update the meta-optimizer after each segment. The number of unrolled segments in each trajectory is sampled from 10 + Exp(30), and the length of each segment is sampled from 50 + Exp(100), where Exp(•) denotes the exponential distribution. Note that the meta-objective computed after each segment is defined as the average of all the inner-objectives (evaluated on the train/validation set for train-by-train/train-by-val) within that segment for a better convergence. We also do not need to log-transform the inner-objective this time because the cross entropy loss has a log operator itself. The meta-training, i.e. training the parameters of the MLP in the MLP optimzier, is completed using a classic RMSProp optimizer with meta learning rate 0.01. For each settings of sample sizes and noise levels, we train two MLP optimizer: one for train-bytrain, and one for train-by-validation. When we test the learned MLP optimizer, we use similar settings as the inner-training problem, and we run the trajectories longer for full convergence (4000 steps for small data sets; 40000 steps for the complete data set). We run 5 independent tests and collect training accuracy and test accuracy for evaluation. The plots show the mean of the 5 tests. We have also tuned a SGD optimizer (with the same mini-batch size) by doing a grid-search of the learning rate as baseline.

F.2 ADDITIONAL RESULTS

Optimizing step size for quadratic objective We try experiments for the same settings of the initial η 0 and inner training length t for all of three implementations (our hand-derived GD version, Tensorflow GD version and the Tensorflow RMSProp version). We do 1000 meta-training steps for all the experiments. For both Tensorflow versions, we always see infinite meta-objectives if η 0 is large or t is large, whose meta-gradient is usually treated as zero, so the training get stuck and never converge. Even for the case that both η 0 and t is small, it still has very large meta-objectives (the scale of a few hundreds), and that is why we also try RMSProp, which should be more robust against the gradient scales. Our hand-derived version, however, does not have the numerical issues and can always converge to the optimal η * . The detailed convergence is summarized in Tab 1 and Tab 2. Note that the optimal η * is usually around 0.03 under our settings. Train-by-train vs. train-by-validation, MLP optimizer on MNIST We also do additional experiments on training an MLP optimizer on the MNIST classification problem. We first try using all samples under the 20% noised setting. The results are shown in Fig 8 . The train-by-train setting can perform well if we have a large data set, but since there is also noise in the data, the train-by-train model still overfits and is slightly worse than the train-by-validation model. . We can see that as the theory predicts, as the amount of data increases (from 1000 samples to 12000 samples and then to 60000 samples) the gap between train-by-train and train-byvalidation decreases. Also, when we condition on the same number of samples, having additional label noise always makes train-by-train model much worse compared to train-by-validation. 



In the notation of Section 2, one can think that D contains a single point (0, 0) and the loss function f (w) = (w, 0, 0). Their open source code is available at https://github.com/tensorflow/models/tree/ master/research/learned_optimizer Their code is available at https://github.com/google-research/google-research/ tree/master/task_specific_learned_opt



Figure 1: Training η (t = 80, η 0 = 0.1)

Figure 3: Training and testing RMSE for different σ values (500 samples)

Figure 4: Training and testing RMSE for different samples sizes (σ = 1)

Figure 5: Training and testing accuracy for different models (1000 samples, no noise)

Let the meta objective FT bT (n) (η) be as defined in Equation 1 with n ∈ [d/4, 3d/4]. Assume noise level σ is a large constant c 1 . Assume unroll length t ≥ c 2 , number of training tasks m ≥ c 3 log(mt) and dimension d ≥ c 4 log(m) for certain constants c 2 , c 3 , c 4 . With probability at least 0.99 in the sampling of the training tasks, we have

train with n 1 samples and a validation set S (k) valid with n 2 samples. Similar as above, for the training set S (k) train , we can define ξ

-w * k is upper bounded by a constant. Then, we know ∆ T bT (η, P k ) is O(1)-subexponential. Therefore, FT bT (η) is the average of m i.i.d. O(1)subexponential random variables. By standard concentration inequality, we know for any 1 > > 0, with probability at least 1 -exp(-Ω( 2 m)),

-net N with |N | = O( nt 2 d + m) for FT bT . That means, for any η ∈ [0, η], FT bT (η) -FT bT (η ) ≤ 2 dσ 2 n for η = arg min η ∈N ,η ≤η (η -η ). √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσ.According to Lemma 1 and Lemma 45, we know for each k ∈ [m], E k happens with probability at least 1 -exp(-Ω(d)). Taking a union bound over all k ∈ [m], we know ∩ k∈[m] E k holds with probability at least 1 -m exp(-Ω(d)). From now on, we assume ∩ k∈[m] E k holds.

with n 1 , n 2 ∈ [d/4, 3d/4]. Assume noise level σ is a large constant c 1 . Assume unroll length t ≥ c 2 d 2 log 2 (d), number of training tasks m ≥ c 3 and dimension d ≥ c 4 for certain constants c 2 , c 3 , c 4 . There exists constant c 5 such that with probability at least 0.99 in the sampling of training tasks, we have η * valid = Θ(1/t) and E w t,η * valid -w * 2 = w * 2 -Ω(1) for all η * valid ∈ arg min 0≤η≤ 1 c 5 d 2 log 2 (

SGD w t,η -w * 2 | Ē Pr[ Ē] ≤ 13Lσ 2 Pr[ Ē] ≤ ,where the last inequality assumes Pr[ Ē] ≤ 13Lσ 2 . According to Lemma 1 and Lemma 45, we know Pr[ Ē] ≤ exp(-Ω(d)). Therefore, given any > 0, we have Pr[ Ē] ≤ 13Lσ 2 as long as d ≥ c 4 log(1/ ) for some constant c 4 .Then, we only need to construct an -net forE P ∼T 1 2 E SGD w t,η -w * 2 |E Pr[E]. By the analysis in Lemma 33, it's not hard to prove ∂ ∂ηE P ∼T 1 2 E SGD w t,η -w * 2 |E Pr[E] = O(1)t(1 -2 (d/ )]. Similar as in Lemma 14, for any > 0, we know there exists an -net N with size O(1/ ) such that for any η ∈ [0,1 c5d 2 log 2 (d/ ) ], E P ∼T 1 2 E SGD w t,η -w * 2 |E Pr[E] -E P ∼T1 2 E SGD w t,η -w * 2 |E Pr[E] ≤ for η ∈ arg min η∈N |η -η |.

SGD w t,η -w * 2 | Ē Pr[ Ē], we have for any η ∈ [0, 1 c5d 2 log 2 (d/ ) ], F T bV (η) -F T bV (η ) ≤ 4

Figure 8: Training and testing accuracy for different models (all samples, 20% noise) We then try an intermediate sample size 12000. The results are shown in Fig 9 (no noise) and Fig 10 (20% noise). We can see that as the theory predicts, as the amount of data increases (from 1000 samples to 12000 samples and then to 60000 samples) the gap between train-by-train and train-byvalidation decreases. Also, when we condition on the same number of samples, having additional label noise always makes train-by-train model much worse compared to train-by-validation.

Figure 9: Training and testing accuracy for different models (12000 samples, no noise)

) and dimension d ≥ c 4 for certain constants c, c 2 , c 3 , c 4 . With high probability in the sampling of training tasks, we have

≥ c 3 and dimension d ≥ c 4 log(t) for certain constants c 2 , c 3 , c 4 . With probability at least 0.99 in the sampling of training tasks, we have

The proof of Lemma 19 is deferred into Section B.3.4. Lemma 19. Assume t ≥ c 2 , d ≥ c 4 for some constants c 2 , c 4 . With probability at least 1

Next, we utilize this variance in eigenvalues to prove that the GD sequence has to learn a constant fraction of the noise in training set. Lemma 22. Suppose noise level σ is a large enough constant c 1 . Assume unroll length t ≥ c 2 and dimension d ≥ c 4 for some constants c 2 , c 4 . Then, with probability at least C 1 B t,η w train -w * 2 Htrain ≥ C 2 σ 2 , for all η ∈ [1/L, 3L], where C 1 , C 2 are positive constants.Proof of Lemma 22. Let E 1 be the event that

2 d)). Replacing by /2 finishes the proof. Proof of Lemma 15. According to Lemma 1, we know with probability at least 1 -exp(-Ω(d)), 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] with L = 100. We can lower bound B t,η w * train , w * as follows, B t,η w * train , w * = I -(I -ηH train ) t w * train , w * train ≥λ min I -(I -ηH train ) t w *

Assuming n ≥ 40d, we know Pr[E 1 ] ≤ exp(-Ω(n)).

,η -w * 2 . Recall that we truncate the SGD sequence once the weight norm exceeds 4√Lσ. Due to the truncation, the expectation of 1/2 w t,η -w * 2 over SGD noise is very tricky to analyze.

1 2L 3 d , E SGD w t,η -w train 2 |E τ ≤ w τ,η -w train 2 ≤ 2L 2 σ 2 .Then, we can bound E SGD w t,η -w * 2 |E τ as follows,E SGD w t,η -w * 2 |E τ =E SGD w t,η -w train + w train -w * 2 |E τ ≤E SGD w t,η -w train 2 |E τ + 2E SGD w t,η -w train |E τ w train -w * + w train -w * 2 ≤2L 2 σ 2 + 2 • 2Lσ • 3 √ Lσ + 9Lσ 2 ≤ 3L 2 σ 2 . SGD w t,η -w * 2 1 {E τ } = SGD w t,η -w * 2 |E τ Pr[E τ ] Pr[E τ ] = 3L 2 σ 2 Pr[ Ē] ≤ 3L 2 σ 2 .This then implies thatE SGD w t,η -w * 2 -E SGD w t,η -w * 2 1 {E} ≤ 3L 2 σ 2 .Finally, we haveE SGD w t,η -w * 2 -E SGD w t,η -w * 2 Therefore, FT bV (η) is the average of m i.i.d.O(1)-subexponential random variables. By standard concentration inequality, we know for any 1 > > 0, with probability at least 1 -exp(-Ω( 2 m)),FT bV (η) -F T bV (η) ≤ .Proof of Lemma 41. Recall thatF T bV (η) =E P ∼T E SGD 1 2 w t,η -w * 2 + σ 2 /2We only need to construct an -net forE P ∼T E SGD 1 2 w t,η -w * 2 . Let E be the event that √ d/ √ L ≤ σ i (X train ) ≤ √ Ld and 1/L ≤ λ i (H train ) ≤ L for all i ∈ [n] and √ dσ/4 ≤ ξ train ≤ √ dσWe have E P ∼T E SGD Note {w τ,η } is the SGD sequence without truncation. For the second term, we haveE P ∼T

Whether the implementation converges for different t (fixed η 0 = 0.1)

Whether the implementation converges for different η 0 (fixed t = 40)

annex

To prove this crossing term is small for all η ∈ [1/L, 3L], we need to construct an -net for the crossing term. Similar as in Lemma 9, we can show there exists an -net for the crossing term with size O(t/ ). Taking a union bound over this -net, we are able to show with probability at least 1 -O(t/ ) exp(-Ω( 2 d)), w t,η -w * , H valid (X valid ) † ξ valid ≤ , for all η ∈ [1/L, 3L].Overall, we have with probability at least 1 -O(t/ ) exp(-Ω( 2 d)), w t,η -w valid 2 Hvalid ≥ w t,η -w * 2Hvalid + 1 n ξ valid 2 -2 w t,η -w * , H valid (X valid ) † ξ valid ≥ w t,η -w * 2 Hvalid + (1 -)σ 2 -2 ≥ (1 -3 )σ 2 , for all η ∈ [1/L, 3L], where the last inequality uses σ ≥ 1. The proof finishes as we change 3 to .C PROOFS OF TRAIN-BY-TRAIN WITH LARGE NUMBER OF SAMPLES (GD) In this section, we give the proof of Theorem 6. We show when the size of each training set n and the the number of training tasks m are large enough, train-by-train also performs well. Recall Theorem 6 as follows. Theorem 6. Let FT bT (n) (η) be as defined in Equation 1. Assume noise level is a constant c 1 . Given any 1 > > 0, assume training set size n ≥ cd 2 log( nm d ), unroll length t ≥ c 2 log( n d ), number of training tasks m ≥ c3n 2 4 d 2 log( tnm d ) and dimension d ≥ c 4 for certain constants c, c 2 , c 3 , c 4 . With high probability in the sampling of training tasks, we havefor all η * train ∈ arg min η≥0 FT bT (n) (η) , where the expectation is taken over new tasks.In the proof, we use the same notations defined in Section B. On each training task P , in Lemma 24 we show the meta-loss can be decomposed into two terms:where w train = w * + (X train ) † ξ train . Recall that X train is a n × d matrix with its i-th row as x i . The pseudo-inverse (X train ) † has dimension d × n satisfying X † train X train = I d . Here, Proj Xtrain ∈ R n×n is a projection matrix onto the column span of X train .In Lemma 24, we show with a constant step size, the first term in ∆ T bT (η, P ) is exponentially small. The second term is basically the projection of the noise on the orthogonal subspace of the data span. We show this term concentrates well on its mean. This lemma servers as step 1 in Section B.1. The proof of Lemma 24 is deferred into Section C.1. Lemma 24. Assume n ≥ 40d. Given any 1 > > 0, with probability at leastIn the next lemma, we show the empirical meta objective is large when η exceeds certain threshold. We define this threshold η such that for any step size larger than η the GD sequence has reasonable probability being truncated. In the proof, we rely on the truncated sequences to argue the metaobjective must be high. In this section, we show there exists a step size that achieves small empirical meta objective. On each training task P , we show the meta-loss can be decomposed into two terms:where w train = w * + (X train ) † ξ train . In Lemma 24, we show with a constant step size, the first term is exponentially small and the second term concentrates on its mean. Lemma 24. Assume n ≥ 40d. Given any 1 > > 0, with probability at leastBefore we go to the proof of Lemma 24, let's first show the covariance matrix H train is very close to identity when n is much larger than d. The proof follows from the concentration of singular values of random Gaussian matrix (Lemma 48). We leave the proof into Section C.4.Now, we are ready to present the proof of Lemma 24.Proof of Lemma 24. Let's first look at one training set S train , in which y i = w * , x i + ξ i for each sample. Recall the meta-loss asOverall, with probability at least 1 -exp(-Ω( 4 md 2 /n 2 )),Combing Lemma 24 and Lemma 25, it's not hard to see that the optimal step size η * train lies in [0, η]. In this section, we show a generalization result for step sizes in [0, η]. The proof of Lemma 26 is given at the end of this section. Lemma 26. Let η be as defined in Definition 2 with 1In Lemma 30, we show FT bT concentrates on F T bT at any fixed step size. The proof is almost the same as Lemma 7. We omit its proof. Lemma 30. Suppose σ is a constant. For any fixed η and any 1 > > 0, with probability at leastNext, we construct an -net for F T bT in [0, η]. The proof is very similar as in Lemma 8. We defer its proof into Section C.4. Lemma 31. Let η be as defined in Definition 2 with 1 > > 0. Assume the conditions in Lemma 29 hold. Assume n ≥ c log( n d )d for some constant c. There exists anWe also construct an -net for the empirical meta objective. The proof is very similar as in Lemma 9.We leave its proof into Section C.4. Lemma 32. Let η be as defined in Definition 2 with 1 > > 0. Assume the conditions in Lemma 29 hold. Assume n ≥ 40d. With probability at least 1 -m exp(-Ω(n)), there exists anCombing the above three lemmas, we give the proof of Lemma 26.Proof of Lemma 26. We assume σ as a constant in this proof. By Lemma 30, we know withfor any fixed η. ByLemma 31, we know as long as n ≥ c log( n d )d for some constant c, there exists an 8 2 dσ 2 n -net N for F T bT with size O( tn 2 d ). By Lemma 32, we know with probability at least 1 -m exp(-Ω(n)), there exists an 2 dσ 2 n -net N for FT bT with size O( tn 2 d + m). It's not hard to verify that N ∪ N is still an 8 2 dσ 2 n -net for FT bV and F T bV . That means, for any η ∈ [0, η], we haveTaking a union bound over N ∪ N , we have with probability at least 1 -O( tnOverall, we know with probability at leastwhere η = arg min η ∈N ∪N ,η ≤η (η -η ).

D PROOFS OF TRAIN-BY-TRAIN V.S. TRAIN-BY-VALIDATION (SGD)

Previously, we have shown that train-by-validation generalizes better than train-by-train when the tasks are trained by GD and when the number of samples is small. In this section, we show a similar phenomenon also appears in the SGD setting.In the train-by-train setting, each task P contains a training set S train = {(x i , y i )} n i=1 . The inner objective is defined as f (w) = 1 2n (x,y)∈Strain ( w, x -y)2 . Let {w τ,η } be the SGD sequence running on f (w) from initialization 0 (without truncation). That means, w τ,η = w τ -1,η -η ∇ f (w τ -1,η ), where ∇ f (w τ -1,η ) = w τ -1,η , x i(τ -1) -y i(τ -1) x i(τ -1) . Here index i(τ -1) is independently and uniformly sampled from [n] . We denote the SGD noise as). The meta-loss on task P is defined as follows,where the expectation is taken over the SGD noise. Note w t,η depends on the SGD noise along the trajectory. Then, the empirical meta objective FT bT (n) (η) is the average of the meta-loss across m different specific tasksIn order to control the SGD noise in expectation, we restrict the feasible set of step sizes into O(1/d).We show within this range, the optimal step size under FT bT (n) is Ω(1/d) and the learned weight is far from ground truth w * on new tasks. We prove Theorem 9 in Section D.1.Theorem 9. Let the meta objective FT bT (n) be as defined in Equation 4 In the train-by-validation setting, each task P contains a training set S train with n 1 samples and a validation set with n 2 samples. The inner objective is defined as f (w) = 1 2n1 (x,y)∈Strain ( w, x -y)2 . Let {w τ,η } be the SGD sequence running on f (w) from initialization 0 (with the same truncation defined in Section 4). For each task P , the meta-loss ∆ T bV (n1,n2) (η, P ) is defined asThe empirical meta objective FT bV (n1,n2) (η) is the average of the meta-loss across m different tasks P 1 , P 2 , ..., P m ,In order to bound the SGD noise with high probability, we restrict the feasible set of the step sizes into O( 1 d 2 log 2 d ). Within this range, we prove the optimal step size under FT bV (n1,n2) is Θ(1/t) and the learned weight is better than initialization 0 by a constant on new tasks. Theorem 10 is proved in Section D.2. Theorem 10. Let the meta objective FT bV (n1,n2) be as defined in Equation 5withnumber of training tasks m ≥ c 3 and dimension d ≥ c 4 for certain constants c 2 , c 3 , c 4 . There exists constant c 5 such that with probability at least 0.99 in the sampling of training tasks, we have, where the expectation is taken over the new tasks and SGD noise.Notations: In the following proofs, we use the same set of notations defined in Appendix B.We use E P ∼T to denote the expectation over the sampling of tasks and use E SGD to denote the expectation over the SGD noise. We use E to denote E P ∼T E SGD . Same as in Appendix B, we use letter L to denote constant 100, which upper bounds H train with high probability.

D.1 TRAIN-BY-TRAIN (SGD)

Recall Theorem 9 as follows. Theorem 9. Let the meta objective FT bT (n) be as defined in Equation 4 In order to prove Theorem 9, we first show that η * train is Ω(1/d) in Lemma 33. The proof is similar as in the GD setting. As long as η = O(1/d), the SGD noise is dominated by the full gradient. Then, we can show that ∆ T bT (η, P ) is roughly (1 -Θ(1)η) t , which implies that η * train = Ω(1/d). We leave the proof of Lemma 33 into Section D. For any step size η ∈ [ 1 6L 5 d , 1 2L 3 d ], let w t,η be the weight obtained by running SGD on f (w) for t steps. Next, we show E SGD w t,η -w * 2 = Ω(σ 2 ) with high probability in the sampling of P. Lemma 34. Suppose σ is a constant. Assume unroll length t ≥ c 2 d for some constant c 2 . With probability at least 1 -exp(-Ω(d)) in the sampling of test task P ,where w t,η is obtained by running SGD on task P for t iterations.With Lemma Lemma 33 and Lemma 34, the proof of Theorem 9 is straightforward.Proof of Theorem 9. Combing Lemma 33 and Lemma 34, we know as long as σ is a constant, t ≥ c 2 d, d ≥ c 4 log(m), with probability at least 0.99, η * train = Ω(1/d) and E SGD w t,η * train -w * 2 = Ω(σ 2 ), for all η * train ∈ arg min 0≤η≤ 1 2L 3 d FT bT (η).

D.1.1 DETAILED PROOFS

Proof of Lemma 33. The proof is very similar to the proof of Lemma 2 except that we need to bound the SGD noise term. For each k ∈ [m], let E k be the event thatWe finish the proof by replacing 4 by .Proof of Lemma 42. The proof is very similar as the proof of Lemma 18. The only difference is that we need to first relate the SGD sequence with truncation to the SGD sequence without truncation and then bound the Lipschitzness on the SGD sequence without truncation (as we did in Lemma 41). We omit the details here.

E TOOLS E.1 NORM OF RANDOM VECTORS

We use the following lemma to bound the noise in least squares model. Lemma 45 (Theorem 3.1.1 in Vershynin ( 2018)R n be a random vector with each entry independently sampled from N (0, 1). Thenwhere C is an absolute constant.

E.2 SINGULAR VALUES OF GAUSSIAN MATRICES

Given a random Gaussian matrix, in expectation its smallest and largest singular value can be bounded as follows.Lemma 46 (Theorem 5.32 in Vershynin ( 2010)). Let A be an N × n matrix whose entries are independent standard normal random variables. ThenLemma 47 shows a lipchitz function over i.i.d. Gaussian variables concentrate well on its mean. We use this lemma to argue for any fixed step size, the empirical meta objective concentrates on the population meta objective. Lemma 47 (Proposition 5.34 in Vershynin (2010) ). Let f be a real valued Lipschitz function on R n with Lipschitz constant K. Let X be the standard normal random vector in R n . Then for every t ≥ 0 one hasThe following lemma shows a tall random Gaussian matrix is well-conditioned with high probability.The proof follows from Lemma 46 and Lemma 47. We use Lemma 48 to show the covariance matrix is well conditioned in the least squares model. Lemma 48 (Corollary 5.35 in Vershynin ( 2010)). Let A be an N × n matrix whose entries are independent standard normal random variables. Then for every t ≥ 0 with probability at least 1 -2 exp(-t 2 /2) one hasWe also use Johnson-Lindenstrauss Lemma in some of the lemmas. Johnson-Lindenstrauss Lemma tells us the projection of a fixed vector on a random subspace concentrates well as long as the subspace is reasonably large. Lemma 49 (Johnson & Lindenstrauss (1984) ). Let P be a projection in R d onto a random ndimensional subspace uniformly distributed in G d,n . Let z ∈ R d be a fixed point and > 0, then with probability at least 1 -2 exp(-c 2 n),(1 -) n d z ≤ P z ≤ (1 + ) n d z .

