THEORETICAL CHARACTERIZATION OF THE GENERALIZATION PERFORMANCE OF OVERFITTED META-LEARNING

Abstract

Meta-learning has arisen as a successful method for improving training performance by training over many similar tasks, especially with deep neural networks (DNNs). However, the theoretical understanding of when and why overparameterized models such as DNNs can generalize well in meta-learning is still limited. As an initial step towards addressing this challenge, this paper studies the generalization performance of overfitted meta-learning under a linear regression model with Gaussian features. In contrast to a few recent studies along the same line, our framework allows the number of model parameters to be arbitrarily larger than the number of features in the ground truth signal, and hence naturally captures the overparameterized regime in practical deep meta-learning. We show that the overfitted min ℓ 2 -norm solution of model-agnostic meta-learning (MAML) can be beneficial, which is similar to the recent remarkable findings on "benign overfitting" and "double descent" phenomenon in the classical (single-task) linear regression. However, due to the uniqueness of meta-learning such as task-specific gradient descent inner training and the diversity/fluctuation of the ground-truth signals among training tasks, we find new and interesting properties that do not exist in single-task linear regression. We first provide a high-probability upper bound (under reasonable tightness) on the generalization error, where certain terms decrease when the number of features increases. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large. Under this circumstance, we show that the overfitted min ℓ 2 -norm solution can achieve an even lower generalization error than the underparameterized solution. In this section, we introduce the system model of meta-learning along with the related symbols. For the ease of reading, we also summarize our notations in Table 2 in Appendix B. We adopt a meta-learning setup with linear tasks first studied in Bernacchia (2021) as well as a few recent follow-up works Chen et al. ( 2022); Huang et al. (2022) . However, in their formulation, the number of features and true model parameters are the same. Such a setup fails to capture the prominent effect of overparameterized models in practice where the number of model parameters

1. INTRODUCTION

Meta-learning is designed to learn a task by utilizing the training samples of many similar tasks, i.e., learning to learn (Thrun & Pratt, 1998) . With deep neural networks (DNNs), the success of meta-learning has been shown by many works using experiments, e.g., (Antoniou et al., 2018; Finn et al., 2017) . However, theoretical results on why DNNs have a good generalization performance in meta-learning are still limited. Although DNNs have so many parameters that can completely fit all training samples from all tasks, it is unclear why such an overfitted solution can still generalize well, which seems to defy the classical knowledge bias-variance-tradeoff (Bishop, 2006; Hastie et al., 2009; Stein, 1956; James & Stein, 1992; LeCun et al., 1991; Tikhonov, 1943) . The recent studies on the "benign overfitting" and "double-descent" phenomena in classical (singletask) linear regression have brought new insights on the generalization performance of overfitted solutions. Specifically, "benign overfitting" and "double descent" describe the phenomenon that the test error descends again in the overparameterized regime in linear regression setup (Belkin et al., 2018; 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Ju et al., 2020; Mei & Montanari, 2019) . Depending on different settings, the shape and the properties of the descent curve of the test error can differ dramatically. For example, Ju et al. (2020) showed that the min ℓ 1 -norm overfitted solution has a very different descent curve compared with the min ℓ 2 -norm overfitted solution. A more detailed review of this line of work can be found in Appendix A. Compared to the classical (single-task) linear regression, model-agnostic meta-learning (MAML) (Finn et al., 2017; Finn, 2018) , which is a popular algorithm for meta-learning, differs in many aspects. First, the training process of MAML involves task-specific gradient descent inner training and outer training for all tasks. Second, there are some new parameters to consider in meta-training, such as the number of tasks and the diversity/fluctuation of the ground truth of each training task. These distinct parts imply that we cannot directly apply the existing analysis of "benign overfitting" and "double-descent" on single-task linear regression in meta-learning. Thus, it is still unclear whether meta-learning also has a similar "double-descent" phenomenon, and if yes, how the shape of the descent curve of overfitted meta-learning is affected by system parameters. A few recent works have studied the generalization performance of meta-learning. In Bernacchia (2021) , the expected value of the test error for the overfitted min ℓ 2 -norm solution of MAML is provided for the asymptotic regime where the number of features and the number of training samples go to infinity. In Chen et al. (2022) , a high probability upper bound on the test error of a similar overfitted min ℓ 2 -norm solution is given in the non-asymptotic regime. However, the eigenvalues of weight matrices that appeared in the bound of Chen et al. (2022) are coupled with the system parameters such as the number of tasks, samples, and features, so that their bound is not fully expressed in terms of the scaling orders of those system parameters. This makes it hard to explicitly characterize the shape of the double-descent or analyze the tightness of the bound. In Huang et al. (2022) , the authors focus on the generalization error during the SGD process when the training error is not zero, which is different from our focus on the overfitted solutions that make the training error equal to zero (i.e., the interpolators). (A more comprehensive introduction on related works can be found in Appendix A.) All of these works let the number of model features in meta-learning equal to the number of true features, which cannot be used to analyze the shape of the double-descent curve that requires the number of features used in the learning model to change freely without affecting the ground truth (just like the setup used in many works on single-task linear regression, e.g., Belkin et al. (2020) ; Ju et al. (2020) ). To fill the gap, we study the generalization performance of overfitted meta-learning, especially in quantifying how the test error changes with the number of features. As the initial step towards the DNNs' setup, we consider the overfitted min ℓ 2 -norm solution of MAML using a linear model with Gaussian features. We first quantify the error caused by the one-step gradient adaption for the test task, with which we provide useful insights on 1) practically choosing the step size of the test task and quantifying the gap with the optimal (but not practical) choice, and 2) how overparameterization can affect the noise error and task diversity error in the test task. We then provide an explicit highprobability upper bound (under reasonable tightness) on the error caused by the meta-training for training tasks (we call this part "model error") in the non-asymptotic regime where all parameters are finite. With this upper bound and simulation results, we confirm the benign overfitting in metalearning by comparing the model error of the overfitted solution with the underfitted solution. We further characterize some interesting properties of the descent curve. For example, we show that the descent is easier to observe when the noise and task diversity are large, and sometimes has a descent floor. Compared with the classical (single-task) linear regression where the double-descent phenomenon critically depends on non-zero noise, we show that meta-learning can still have the double-descent phenomenon even under zero noise as long as the task diversity is non-zero. is much larger than the number of actual feature parameters. Thus, in our setup, we introduce an additional parameter s to denote the number of true features and allow s to be different and much smaller than the number p of model parameters to capture the effect of overparameterization. In this way, we can fix s and investigate how p affects the generalization performance. We believe this setup is closer to reality because determining the number of features to learn is controllable, while the number of actual features of the ground truth is fixed and not controllable. We consider m training tasks. For the i-th training task (where i = 1, 2, • • • , m), the ground truth is a linear model represented by y = x T s w (i) s + ϵ, where s denotes the number of features, x s ∈ R s is the underlying features, ϵ ∈ R denotes the noise, and y ∈ R denotes the output. When we collect the features of data, since we do not know what are the true features, we usually choose more features than s, i.e., choose p features where p ≥ s to make sure that these p features include all s true features. For analysis, without loss of generality, we let the first s features of all p features be the true features. Therefore, although the ground truth model only has s features, we can alternatively expressed it with p features as y = x T w (i) + ϵ where the first s elements of x ∈ R p is x s and w (i) := w (i) s 0 ∈ R p . The collected data are split into two parts with n t training data and n v validation data. With those notations, we write the data generation model into matrix equations as y t(i) = X t(i) T w (i) + ϵ t(i) , y v(i) = X v(i) T w (i) + ϵ v(i) , where each column of X t(i) ∈ R p×nt corresponds to the input of each training sample, each column of X v(i) ∈ R p×nv corresponds to the input of each validation sample, y t(i) ∈ R nt denotes the output of all training samples, y v(i) ∈ R nv denotes the output of all validation samples, ϵ t(i) ∈ R nt denotes the noise in training samples, and ϵ v(i) denotes the noise invalidation samples. Similarly to the training tasks, we denote the ground truth of the test task by w r s ∈ R s , and thus y = x T s w r s . Let w r = w r s 0 ∈ R p . Let n r denote the number of training samples for the test task, and let each column of X r ∈ R p×nr denote the input of each training sample. Similar to Eq. ( 1), we then have y r = (X r ) T w r + ϵ r , where ϵ r ∈ R nr denotes the noise and each element of y r ∈ R nr corresponds to the output of each training sample. In order to simplify the theoretical analysis, we adopt the following two assumptions. Assumption 1 is commonly taken in the theoretical study of the generalization performance, e.g., Ju et al. (2020) ; Bernacchia (2021) . Assumption 2 is less restrictive (no requirement to be any specific distribution) and a similar one is also used in Bernacchia (2021) . Assumption 1 (Gaussian features and noise). We adopt i.i.d. Gaussian features x ∼ N (0, I p ) and assume i.i.d. Gaussian noise. We use σ and σ r to denote the standard deviation of the noise for training tasks and the test task, respectively. In other words, ϵ t(i) ∼ N (0, σ 2 I nt ), ϵ v(i) ∼ N (0, σ 2 I nv ) for all i = 1, • • • , m , and ϵ r ∼ N (0, σ 2 r I nr ). Assumption 2 (Diversity/fluctuation of unbiased ground truth). The ground truth w r s and w (i) s for all i = 1, 2, • • • , m share the same mean w 0 s , i.e., E[w (i) s ] = w 0 s = E[w r s ]. For the i-th training task, elements of the true model parameter w (i) s ∈ R s are independent, i.e., E[(w (i) s -w 0 s ) T (w (i) s -w 0 s )] = Λ (i) := diag (ν (i),1 ) 2 , (ν (i),2 ) 2 , • • • , (ν (i),s ) 2 . Let ν (i) := Tr(Λ (i) ), ν := m i=1 ν 2 (i) /m, ν r := E ∥w r s -w 0 ∥ 2 2 , and w 0 := w 0 s 0 ∈ R p .

2.2. MAML PROCESS

We consider the MAML algorithm (Finn et al., 2017; Finn, 2018) , the objective of which is to train a good initial model parameter among many tasks, which can adapt quickly to reach a desirable model parameter for a target task. MAML generally consists of inner-loop training for each individual task and outer-loop training across multiple tasks. To differentiate the ground truth parameters w (i) , we use (•) (e.g., ŵ(i) ) to indicate that the parameter is the training result. In the inner-loop training, the model parameter for every individual task is updated from a common meta parameter ŵ. Specifically, for the i-th training task (i = 1, . . . , m), its model parameter ŵ(i) is updated via a one-step gradient descent of the loss function based on its training data X t(i) : ŵ(i) := ŵ -αt nt ∂L (i) inner ∂ ŵ , L (i) inner := 1 2 y t(i) -X t(i) T ŵ 2 2 . ( ) where α t ≥ 0 denotes the step size. In the outer-loop training, the meta loss L meta is calculated based on the validation samples of all training tasks as follows: L meta := 1 mnv m i=1 L (i) outer , where L (i) outer := 1 2 y v(i) -X v(i) T ŵ(i) 2 2 . ( ) The common (i.e., meta) parameter ŵ is then trained to minimize the meta loss L meta . At the test stage, we use the test loss L r inner := 1 2 y r -X where α r ≥ 0 denotes the step size. The squared test error for any input x is given by L test (x, w r ; ŵr ) := x T w r -x T ŵr 2 2 . (5)

2.3. SOLUTIONS OF MINIMIZING META LOSS

The meta loss in Eq. ( 4) depends on the meta parameter ŵ via the inner-loop loss in Eq. ( 3). It can be shown (see the details in Appendix C) that L meta can be expressed as: L meta = 1 2mnv ∥γ -B ŵ∥ 2 2 , where γ ∈ R (mnv)×1 and B ∈ R (mnv)×p are respectively stacks of m vectors and matrices given by γ :=      y v(1) - α t n t X v(1) T X t(1) y t(1) y v(2) - α t n t X v(2) T X t(2) y t(2) . . . y v(m) - α t n t X v(m) T X t(m) y t(m)      , B :=      X v(1) T Ip- α t n t X t(1) X t(1) T X v(2) T Ip- α t n t X t(2) X t(2) T . . . X v(m) T Ip- α t n t X t(m) X t(m) T      . ( ) By observing Eq. ( 6) and the structure of B, we know that min ŵ L meta has a unique solution almost surely when the learning model is underparameterized, i.e., p ≤ mn v . However, in the real-world application of meta learning, an overparameterized model is of more interest due to the success of the DNNs. Therefore, in the rest of this paper, we mainly focus on the overparameterized situation, i.e., when p > mn v (so the meta training loss can decrease to zero). In this case, there exist (almost surely) numerous ŵ that make the meta loss become zero, i.e., interpolators of training samples. Among all overfitted solutions, we are particularly interested in the min ℓ 2 -norm solution, since it corresponds to the solution of gradient descent that starts at zero in a linear model. Specially, the min ℓ 2 -norm overfitted solution ŵℓ2 is defined as ŵℓ2 := arg min ŵ ∥ ŵ∥ 2 subject to B ŵ = γ. In this paper, we focus on quantifying the generalization performance of this min ℓ 2 -norm solution with the metric in Eq. ( 5).

3. MAIN RESULTS

To analyze the generalization performance, we first decouple the overall test error into two parts: i) the error caused by the one-step gradient adaption for the test task, and ii) the error caused by the meta training for the training tasks. The following lemma quantifies such decomposition. Lemma 1. With Assumptions 1 and 2, for any learning result ŵ (i.e., regardless how we train ŵ), the expected squared test error is E x,X r ,ϵ r ,w r L test (x, w r ; ŵr ) = f test ∥ ŵ -w 0 ∥ 2 2 , where f test (ζ) := (1 -α r ) 2 + p+1 nr α 2 r ζ + ν 2 r + α 2 r p nr σ 2 r . Notice that in the meta-learning phase, the ideal situation is that the learned meta parameter ŵ perfectly matches the mean w 0 of true parameters. Thus, the term ∥ ŵw 0 ∥ 2 2 characterizes how well the meta-training goes. The rest of the terms in the Lemma 1 then characterize the effect of the one-step training for the test task. The proof of Lemma 1 is in Appendix H. Note that the expression of f test (ζ) coincides with Eq. ( 65) of Bernacchia (2021) . However, Bernacchia (2021) uses a different setup and does not analyze its implication as what we will do in the rest of this section.

3.1. UNDERSTANDING THE TEST ERROR

Proposition 1. We have p+1 nr+p+1 (∥ ŵ -w 0 ∥ 2 2 + ν 2 r ) ≤ min αr Ex,X r ,ϵ r ,w r [L test (x, w r ; ŵr )] ≤ ∥ ŵ -w 0 ∥ 2 2 + ν 2 r . Further, by letting α r = nr nr+p+1 (which is optimal when σ r = 0), we have Ex,X r ,ϵ r ,w r [L test (x, w r ; ŵr )] = p+1 nr+p+1 (∥ ŵ -w 0 ∥ 2 2 + ν 2 r ) + nrp (nr+p+1) 2 σ 2 r . The derivation of Proposition 1 is in Appendix H.1. Some insights from Proposition 1 are as follows. 1) Optimal α r does not help much when overparameterized. For meta-learning, the number of training samples n r for the test task is usually small (otherwise there is no need to do meta-learning). Therefore, in Proposition 1, when overparameterized (i.e., p is relatively large), the coefficient p+1 nr+p+1 of (∥ ŵ -w 0 ∥ 2 2 +ν 2 r ) is close to 1. On the other hand, the upper bound ∥ ŵw 0 ∥ 2 2 +ν 2 r can be achieved by letting α r = 0, which implies that the effect of optimally choosing α r is limited under this circumstance. Further, calculating the optimal α r requires the precise values of ∥ ŵw 0 ∥ 2 2 , ν 2 r , and σ 2 r . However, those values are usually impossible/hard to get beforehand. Hence, we next investigate how to choose an easy-to-obtain α r . 2) Choosing α r = n r /(n r + p + 1) is practical and good enough when overparameterized. Choosing α r = n r /(n r + p + 1) is practical since n r and p are known. By Proposition 1, the gap between choosing α r = n r /(n r + p + 1) and choosing optimal α r is at most nrp (nr+p+1) 2 σ 2 r ≤ nr p σ 2 r . When p increases, this gap will decrease to zero. In other words, when heavily overparameterized, choosing α r = n r /(n r + p + 1) is good enough. 3) Overparameterization can reduce the noise error to zero but cannot diminish the task diversity error to zero. In the expression of f test ∥ ŵw 0 ∥ 2 2 in Lemma 1, there are two parts related to the test task: the noise error (the term for σ r ) and the task diversity error (the term for ν r ). By Proposition 1, even if we choose the optimal α r , the term of ν 2 r will not diminish to zero when p increases. In contrast, by letting α r = n r /(n r + p + 1), the noise term nrp (nr+p+1) 2 σ 2 r ≤ nr p σ 2 r will diminish to zero when p increases to infinity.

3.2. CHARACTERIZATION OF MODEL ERROR

Since we already have Lemma 1, to estimate the generalization error, it only remains to estimate ∥ ŵw 0 ∥ 2 2 , which we refer as model error. The following Theorem 1 gives a high probability upper bound on the model error. Theorem 1. Under Assumptions 1 and 2, when min{p, n t } ≥ 256, we must have are completely determined by the finite (i.e., non-asymptotic region) system parameters p, s, m, n t , n v , ∥w 0 ∥ 2 2 , νfoot_0 , σ 2 , and α t . The precise expression will be given in Section 4, along with the proof sketch of Theorem 1. Notice that although b w is only an upper bound, we will show in Section 4 that each component of this upper bound is relatively tight. Pr X t(1:m) ,X v(1:m) E w (1:m) ,ϵ t(1:m) ,ϵ v(1:m) ∥ ŵℓ2 -w 0 ∥ 2 2 ≤ b w ≥ 1 -η, Theorem 1 differs from the result in Bernacchia (2021) in two main aspects. First, Theorem 1 works in the non-asymptotic region where p and n t are both finite, whereas their result holds only in the asymptotic regime where p, n t → ∞. In general, a non-asymptotic result is more powerful than an asymptotic result in order to understand and characterize how the generalization performance changes as the model becomes more overparameterized. Second, the nature of our bound is in high probability with respect to the training and validation data, which is much stronger than their result in expectation. Due to the above key differences, our derivation of the bound is very different and much harder. As we will show in Section 4, the detailed expressions of b w0 and b ideal w are complicated. In order to derive some useful interpretations, we provide a simpler form by approximation 1 for an overparameterized regime, where α t p ≪ 1 and min{p, n t } ≫ mn v . In such a regime, we have b w0 ≈ p-mnv p ∥w 0 ∥ 2 2 , b ideal w ≈ b δ p-C4mnv , where b δ ≈ mn v (1 + C1 nt )σ 2 + C 2 (1 + C3 nt )ν 2 , and C 1 to C 4 are some constants. It turns out that a number of interesting insights can be obtained from the simplified bounds above. 1) Heavier overparameterization reduces the negative effects of noise and task diversity. From Eq. ( 9), we know that when p increases, b ideal w decreases to zero. Notice that b ideal w is the only term that is related to σ 2 and ν 2 , i.e., b ideal w corresponds to the negative effect of noise and task diversity/fluctuation. Therefore, we can conclude that using more features/parameters can reduce the negative effect of noise and task diversity/fluctuation. Interestingly, b ideal w can be interpreted as the model error for the ideal interpolator ŵideal defined as ŵideal := arg min ŵ ∥ ŵ -w 0 ∥ 2 2 subject to B ŵ = γ. Differently from the min ℓ 2 -norm overfitted solution in Eq. ( 8) that minimizes the norm of ŵ, the ideal interpolator minimizes the distance between ŵ and w 0 , i.e., the model error (this is why we define it as the ideal interpolator). The following proposition states that b ideal w corresponds to the model error of ŵideal . Proposition 2. When min{p, n t } ≥ 256, we must have Pr X v(1:m) ,X t(1:m) E w (1:m) ,ϵ t(1:m) ,ϵ v(1:m) ∥ ŵideal -w 0 ∥ 2 2 ≤ b ideal w ≥ 1 - 26m 2 n 2 v min{p, n t } 0.4 , 2) Overfitting is beneficial to reduce model error for the ideal interpolator. Although calculating ŵideal is not practical since it needs to know the value of w 0 , we can still use it as a benchmark that describes the best performance among all overfitted solutions. From the previous analysis, we have already shown that b ideal w → 0 when p → ∞, i.e., the model error of the ideal interpolator decreases to 0 when the number of features grows. Thus, we can conclude that overfitting is beneficial to reduce the model error for the ideal interpolator. This can be viewed as evidence that overfitting itself should not always be viewed negatively, which is consistent with the success of DNNs in meta-learning. 3) The descent curve is easier to observe under large noise and task diversity, and the curve sometimes has a descent floor. In Eq. ( 9  ∥w 0 ∥ 2 2 1 - (1- √ g) 2 C4 , which is achieved at p = C4mnv 1-√ g . This implies that the descent curve of the model error has a floor only when b δ is small, i.e., when the noise and task diversity are small. Notice that the threshold C4mnv 1-√ g and the floor value ∥w 0 ∥ 2 2 1 - (1- √ g) 2

C4

increase as g increases. Therefore, we anticipate that as ν and σ increase, the descent floor value and its location both increase. In Fig. 1 , we draw the curve of the model error with respect to p for the min ℓ 2 -norm solution. In subfigure (a), the blue curve (with the marker "+") and the yellow curve (with the marker "×") have relatively large ν and/or σ. These two curves always decrease in the overparameterized region p > mn v and have no descent floor. In contrast, the rest three curves (purple, red, green) in subfigure (a) have descent floor since they have relatively small ν and σ. Subfigure (b) shows the location and the value of the descent floor. As we can see, when ν and σ increase, the descent floor becomes higher and locates at larger p. These observations are consistent with our theoretical analysis. In Appendix E.2, we provide a further experiment where we train a two-layer fully connected neural network over the MNIST data set. We observe that a descent floor still occurs. Readers can find more details about the experiment in Appendix E.2. 4) Task diversity yields double descent under zero noise. For a single-task classical linear regression model y = x T w 0 + ϵ, the authors of Belkin et al. (2020) study the overfitted min ℓ 2 -norm solutions w ℓ2,single learned by interpolating n training samples with p ≥ n + 2 i.i.d. Gaussian features. The result in Belkin et al. (2020) shows that its expected model error is E ∥w ℓ2,single -w 0 ∥ 2 2 = p -n p ∥w 0 ∥ 2 2 + n p -n -1 σ 2 . We find that both meta-learning and the single-task regression have similar bias terms p-mnv p ∥w 0 ∥ 2 2 and p-n p ∥w 0 ∥ 2 2 , respectively. When p → ∞, these bias terms increase to ∥w 0 ∥ 2 2 , which corresponds to the null risk (the error of a model that always predicts zero). For the single-task regression, the remaining term n p-n-1 σ 2 ≈ n p σ 2 when p ≫ n, which contributes to the descent of generalization error when p increases. On the other hand, if there is no noise (σ = 0), then the benign overfitting/double descent will disappear for the single-task regression. In contrast, for meta-learning, the term that contributes to benign overfitting is b ideal w . As we can see in the expression of b ideal w in Eq. ( 9), even if the noise is zero, as long as there exists task diversity/fluctuation (i.e., ν > 0), the descent of the model error with respect to p should still exist. This is also confirmed by Fig. 1 (a) with the descent blue curve (with marker "+") of ν = 60 and σ = 0. 5) Overfitted solution can generalize better than underfitted solution. Let ŵ(p=1) ℓ2 denote the solution when p = s = 1. We have ŵ(p=1) ℓ2 -w 0 2 2 ≈ ν 2 m + σ 2 α 2 t m + σ 2 (1 -α t ) 2 mn v . The derivation is in Appendix O. Notice that p = s means that all features are true features, which is ideal for an underparameterized solution. Compared to the model error of overfitted solution, there is no bias term of w 0 , but the terms of ν 2 and σ 2 are only discounted by m and n v . In contrast, for the overfitted solution when p ≫ mn v , the terms of ν 2 and σ 2 are discounted by p. Since the value of m and the value of n v are usually fixed or limited, while p can be chosen freely and made arbitrarily large, the overfitted solution can do much better to mitigate the negative effect caused by noise and task divergence. This provides us a new insight that when σ, ν are large and ∥w 0 ∥ 2 is small, the overfitted ℓ 2 -norm solution can have a much better overall generalization performance than the underparameterized solution. This is further verified in Fig. 1 (a) by the descent blue curve (with marker "+"), where the first point of this curve with p = 5 = s (in the underparameterized regime) has a larger test error than its last point with p = 1000 (in the overparameterized regime).

4. TIGHTNESS OF THE BOUND AND ITS PROOF SKETCH

We now present the main ideas of proving Theorem 1 and theoretically explain why the bound in Theorem 1 is reasonably tight by showing each part of the bound is tight. (Numerical verification of the tightness is provided in Appendix E.1.) The expressions of b w0 and b ideal w (along with other related quantities) in Theorem 1 will be defined progressively as we sketch our proof. Readers can also refer to the beginning of Appendix I for a full list of the definitions of these quantities. We first re-write ∥w 0 -ŵℓ2 ∥ 2 2 into terms related to B. When B is full row-rank (which holds almost surely when p > mn v ), we have ŵℓ2 = B T BB T -1 γ. (10) Define δγ as δγ := γ -Bw 0 . (11) Doing some algebra transformations (details in Appendix F), we have ∥w 0 -ŵℓ2 ∥ 2 2 = I p -B T (BB T ) -1 B w 0 2 2 Term 1 + B T (BB T ) -1 δγ 2 2 Term 2 (12) To estimate the model error, we take a divide-and-conquer strategy and provide a set of propositions as follows to estimate Term 1 and Term 2 in Eq. ( 12). Proposition 3. Define b w0 := (p -mn v ) + 2 (p -mn v ) ln p + 2 ln p p -2 √ p ln p ∥w 0 ∥ 2 2 , bw0 := (p -mn v ) -2 (p -mn v ) ln p p + 2 √ p ln p + 2 ln p ∥w 0 ∥ 2 2 . When p ≥ mn v and p ≥ 16, We then have Pr[Term 1 of Eq. (12) ≤ b w0 ] ≥ 1 -2/p, Pr[Term 1 of Eq. (12) ≥ bw0 ] ≥ 1 -2/p, E X t(1:m) ,X v(1:m) [Term 1 of Eq. (12)] = p -mn v p ∥w 0 ∥ 2 2 . Proposition 3 gives three estimates on Term 1 of Eq. ( 12): the upper bound b w0 , lower bound bw0 , and the mean value. If we omit all logarithm terms, then these three estimates are the same, which implies that our estimation on Term 1 of Eq. ( 12) is fairly precise. (Proof of Proposition 3 is in Appendix J.) It then remains to estimate Term 2 in Eq. ( 12). Indeed, this term is also the model error of the ideal interpolator. To see this, since B ŵ = γ, we have B( ŵw 0 ) = γ -Bw 0 = δγ. Thus, to get min ∥ ŵw 0 ∥ 2 2 as the ideal interpolator, we have ŵidealw 0 = B T (BB T ) -1 δγ. Therefore, we have ∥ ŵideal -w 0 ∥ 2 2 = B T (BB T ) -1 δγ 2 2 . ( ) Now we focus on B T (BB T ) -1 δγ 2 2 . By Lemma 6 in Appendix G.2, we have ∥δγ∥ 2 2 λmax(BB T ) ≤ B T (BB T ) -1 δγ 2 2 ≤ ∥δγ∥ 2 2 λmin(BB T ) . We have the following results about the eigenvalues of BB T . Published as a conference paper at ICLR 2023 Proposition 4. Define α ′ t := αt nt √ p + √ n t + ln √ n t 2 and b eig,min :=p + max{0, 1 -α ′ t } 2 -1 n t -(n v + 1) max{α ′ t , 1 -α ′ t } 2 + 6mn v √ p ln p, b eig,max :=p + max{α ′ t , 1 -α ′ t } 2 -1 n t + (n v + 1) max{α ′ t , 1 -α ′ t } 2 + 6mn v √ p ln p. When p ≥ n t ≥ 256, we must have Pr b eig,min ≤ λ min (BB T ) ≤ λ max (BB T ) ≤ b eig,max ≥ 1 -23m 2 n 2 v /n 0.4 t . Proposition 5. Define c eig,min := max{0, 1 -α ′ t } 2 p -2mn v max{α ′ t , 1 -α ′ t } 2 p ln p, c eig,max := max{α ′ t , 1 -α ′ t } 2 p + (2mn v + 1) p ln p . When n t ≥ p ≥ 256, we have Pr c eig,min ≤ λ min (BB T ) ≤ λ max (BB T ) ≤ c eig,max ≥ 1 -16m 2 n 2 v /p 0.4 . To see how the upper and lower bounds of the eigenvalues of BB T match, consider α t p ≪ 1, which implies α ′ t ≪ 1, and the fact that √ p and ln p are lower order terms than p, then each of b eig,min , b eig,max , c eig,min , c eig,max can be approximated by p ± Cmn v for some constant C. Further, when p ≫ mn v , all b eig,min , b eig,max , c eig,min , c eig,max can be approximated by p, i.e., the upper and lower bounds of the eigenvalues of BB T match. Therefore, our estimation on λ max (BB T ) and λ min (BB T ) in Proposition 4 and Proposition 5 are fairly tight. (Proposition 4 is proved in Appendix L, and Proposition 5 is proved in Appendix M.) From Eq. ( 14), it remains to estimate ∥δγ∥ 2 2 . Proposition 6. Define D := max 1 -α t n t + 2 n t ln(sn t ) + 2 ln(sn t ) n t , 1 -α t n t -2 n t ln(sn t ) n t 2 , b δ := mn v σ 2 1 + α 2 t p(ln n t ) 2 ln p n t + mn v ν 2 • 2 ln(sn t ) • D + α 2 t (p -1) n t 6.25(ln(spn t )) 2 . When min{p, n t } ≥ 256, we must have Pr X t(1:m) ,X v(1:m) E w (1:m) ,ϵ t(1:m) ,ϵ v(1:m) ∥δγ∥ 2 2 ≤ b δ ≥ 1 -5mnv nt -2mnv p 0.4 . We also have E ∥δγ∥ 2 2 = mn v σ 2 1 + α 2 t p nt + ν 2 mn v (1 -α t ) 2 + α 2 t (p+1) nt , where the expectation is on all random variables. Proposition 6 provides an upper bound b δ on ∥δγ∥ 2 2 and an explicit form for E ∥δγ∥ 2 2 . By comparing b δ and E ∥δγ∥ 2 2 , the differences are only some coefficients and logarithm terms. Thus, the estimation on ∥δγ∥ 2 2 in Proposition 6 is fairly tight. Proposition 6 is proved in Appendix N. Combining Eq. ( 13), Eq. ( 14), Proposition 4, Proposition 5, and Proposition 6, we can get the result of Proposition 2 by the union bound, where b ideal w := b δ max{b eig,min 1 {p>n t } +c eig,min 1 {p≤n t } , 0} . The detailed proof is in Appendix K. Then, by Eq. ( 13), Proposition 2, Proposition 3, and Eq. ( 12), we can get a high probability upper bound on ∥w 0 -ŵℓ2 ∥ 2 2 , i.e., Theorem 1. The detailed proof of Theorem 1 is in Appendix I. Notice that we can easily plug our estimation of the model error (Proposition 2 and Theorem 1) into Lemma 1 to get an estimation of the overall test error defined in Eq. ( 5), which is omitted in this paper by the limit of space.

5. CONCLUSION

We study the generalization performance of overfitted meta-learning under a linear model with Gaussian features. We characterize the descent curve of the model error for the overfitted min ℓ 2 -norm solution and show the differences compared with the underfitted meta-learning and overfitted classical (single-task) linear regression. Possible future directions include relaxing the assumptions and extending the result to other models related to DNNs (e.g., neural tangent kernel (NTK) models).

A RELATED WORK

Our work is related to recent studies on characterizing the double-descent phenomenon for overfitted solutions of single-task linear regression. Some works study the min ℓ 2 -norm solutions for linear regression with simple features such as Gaussian or Fourier features (Belkin et al., 2018; 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019) , where they show the existence of the double-descent phenomenon. Some others (Mitra, 2019; Ju et al., 2020) study the min ℓ 1norm overfitted solution and show it also has the double-descent phenomenon, but with a descent curve whose shape is very different from that of the min ℓ 2 -norm solution. Some recent works study the generalization performance when overparameterization in random feature (RF) models (Mei & Montanari, 2019) , two-layer neural tangent kernel (NTK) models (Arora et al., 2019; Satpathi & Srikant, 2021; Ju et al., 2021) , and three-layer NTK models (Ju et al., 2022) . (RF and NTK models are linear approximations of shallow but wide neural networks.) These works show the benign overfitting for a certain set of learnable ground truth functions that depends on the network structure and which layer to train. However, all those works are on single-task learning, which cannot be directly used to characterize the generalization performance of meta-learning due to their differences in many aspects mentioned in the introduction. Our work is related to a few recent works on the generalization performance of meta-learning. In Bernacchia (2021) , the expected value of the test error for the overfitted min ℓ 2 -norm solution of MAML is provided. However, the result only works in the asymptotic regime where the number of features and the number of training samples go to infinity, which is different from ours where all quantities are finite. In Chen et al. (2022) , a high probability upper bound on the test error of a similar overfitted min ℓ 2 -norm solution is given in the non-asymptotic regime. A significant difference between ours and Chen et al. ( 2022) is that our theoretical bound describes the shape of double descent in a straightforward manner (i.e., our theoretical bound consists of decoupled system parameters). In contrast, the bound in Chen et al. ( 2022) contains coupled parts (e.g., eigenvalues of weight matrices are affected by the number of tasks, samples, and features), which may not be directly used to analyze the shape of double descent. Besides, we also explain and verify the tightness of the bound, while Chen et al. (2022) does notfoot_1 . In Huang et al. (2022) , the authors focus on the generalization error during the SGD process when the training error is not zero, which is different from our focus on the overfitted solutions that make the training error equal to zero (i.e., the interpolators). Our work also differs from Bernacchia (2021) ; Chen et al. (2022) ; Huang et al. (2022) in the data generation model for the purpose of quantifying how the number of features affects the test error, which has been explained in detail at the beginning of Section 2.1. the matrix formed by n t training inputs of the i-th task matrix R p×nt y t (i) the output vector corresponding to X t(i) vector R nt ϵ t (i) the noise in y t (i) vector R nt X v(i) the matrix formed by n v validation inputs of the i-th task matrix R p×nv y v (i) the output vector corresponding to X v(i) vector R nv ϵ v (i) the noise in y v(i) 

C THE CALCULATION OF META LOSS

By the definition of L (i) inner in Eq. ( 3), we have ∂L (i) inner ∂ ŵ = ∂ y t(i) -X t(i) T ŵ ∂ ŵ ∂ 1 2 y t(i) -X t(i) T ŵ 2 2 ∂ y t(i) -X t(i) T ŵ = -X t(i) y t(i) -X t(i) T ŵ =X t(i) X t(i) T ŵ -X t(i) y t(i) . Plugging it into Eq. ( 3), we thus have ŵ(i) = I p - α t n t X t(i) X t(i) T ŵ + α t n t X t(i) y t(i) . ( ) Plugging Eq. ( 15) into Eq. ( 4), we thus have L (i) outer = 1 2 y v(i) -X v(i) T ŵ(i) 2 2 = 1 2 y v(i) -X v(i) T I p - α t n t X t(i) X t(i) T ŵ + α t n t X t(i) y t(i) 2 2 = 1 2 y v(i) - α t n t X v(i) T X t(i) y t(i) -X v(i) T I p - α t n t X t(i) X t(i) T ŵ 2 2 . By the definition of B in Eq. ( 7) and the definition of γ in Eq. ( 7), we thus have L meta = 1 2mn v = 1 2mn v ∥γ -B ŵ∥ 2 2 . Eq. ( 6) thus follows.

D CALCULATION OF DESCENT FLOOR

Define h(p) := p-mnv p ∥w 0 ∥ 2 2 + b δ p-C4mnv where p > C 4 mn v . We have ∂h(p) ∂p = mn v p 2 ∥w 0 ∥ 2 2 - b δ (p -C 4 mn v ) 2 = mn v ∥w 0 ∥ 2 2 (p -C 4 mn v ) 2 1 - C 4 mn v p 2 - b δ mn v ∥w 0 ∥ 2 2 = mn v ∥w 0 ∥ 2 2 (p -C 4 mn v ) 2 1 - C 4 mn v p + √ g 1 - C 4 mn v p - √ g (recall that g := b δ mn v ∥w 0 ∥ 2 2 ). Notice that when p > C 4 mn v and w 0 ̸ = 0, the first and the second factors are positive. Thus, we only need to consider the sign of and the approximation Eq. ( 9) with constantsfoot_3 C 1 = C 3 = 0.001 and C 2 = C 4 = 0.99995. As we can see in Fig. 2 , the experimental value points closely match the theoretical curves, which suggests that Theorem 1 and Eq. ( 9) are fairly tight. A := 1 -C4mnv p - √ g . If g ≥ 1, then A < 0, which implies that h(p) is monotone decreasing. If g < 1, then we have A < 0, when p ∈ C 4 mn v , C4mnv 1- √ g , > 0, when p > C4mnv 1- √ g , h(p) p= C 4 mnv 1- √ g = ∥w 0 ∥ 2 2   1 - 1 - √ g C 4 + gmn v C 4 mn v 1 1- √ g -1   = ∥w 0 ∥ 2 2   1 - 1 - √ g C 4 + g C 4 1 1- √ g -1   = ∥w 0 ∥ 2 2 1 - (1 - √ g) 2

E.2 EXPERIMENTS OF META-LEARNING WITH NEURAL NETWORK ON REAL-WORLD DATA

In this section, we further verify our theoretical findings by an experiment over a two-layer fullyconnected neural network on the MNIST data set. Neural network structure: The input dimension of the neural network is 49 (i.e., 7 × 7 gray-scale image shrunk from the original 28 × 28 gray-scale image). The input is multiplied by the firstlayer weights and fully connected to the hidden layer that consists of ReLUs (rectified linear units max{0, •}) with bias. The output of these ReLUs is then multiplied by the second layer weights and then goes to the output layer. The output layer is a sigmoid activation function with bias. Experimental setup: There are 4 training tasks and 1 test task. The objective of each task is to identify whether the input image belongs to a set of 5 different digits. Specifically, the sets for 4 training tasks are {1, 2, 3, 7, 9}, {0, 2, 3, 6, 9}, {0, 1, 4, 6, 8}, {1, 2, 3, 5, 7}, respectively. The set for the test task is {0, 2, 3, 6, 8}. For each training task, there are 1000 training samples and 100 validation samples. All samples (except the ones to calculate the test error) are corrupted by i.i.d. Gaussian noise in each pixel with zero mean and standard deviation 0.3. The number of samples for the one-step gradient is 1000 for these 4 training tasks and is 500 for the test task, i.e., n t = 1000 and n r = 500. The number of validation samples is n v = 100 for each of these 4 training tasks. The initial weights are uniformly randomly chosen in the range [0, 1]. Training process: We use gradient descent to train the neural network. The step size in the outerloop training is 0.3 and the step size of the one-step gradient adaptation is α t = α r = 0.05. After training 500 epochs, the meta-training error for each simulation is lower than 0.025 (the range of the meta-training error is [0, 1]), which means that the trained model almost completely fits all validation samples.

Simulation results and interpretation:

We run the simulation 30 times with different random seeds. In Fig. 3 , we draw a box plot showing the test error for the test task, where the blue curve denotes the average value of these 30 runs. We can see that the overall trend of the test curve is decreasing, which suggests that more parameters help to enhance the overfitted generalization performance. Another interesting phenomenon we find from Fig. 3 is that the curve is not strictly monotonically decreasing and there exist some decent floors (e.g., when the network width is 80). F DETAILS OF DERIVING EQ. ( 12) Notice that I p -B T (BB T ) -1 B w 0 , B T (BB T ) -1 δγ =(δγ) T (BB T ) -1 B I p -B T (BB T ) -1 B w 0 =(δγ) T (BB T ) -1 B -(BB T ) -1 B w 0 =0. ( ) We thus have ∥w 0 -ŵℓ2 ∥ 2 2 = w 0 -B T (BB T ) -1 γ 2 2 (by Eq. ( 10)) = I p -B T (BB T ) -1 B w 0 -B T (BB T ) -1 δγ 2 2 (by Eq. (11)) = I p -B T (BB T ) -1 B w 0 2 2 Term 1 + B T (BB T ) -1 δγ 2 2 Term 2 (by Eq. ( 16)).

G SOME USEFUL LEMMAS G.1 ESTIMATION ON LOGARITHM

Lemma 2. For any x > 0, we must have ln x ≥ 1 -1 x . Proof. This can be derived by examining the monotonicity of ln x -(1 -1 x ). The complete proof can be found, e.g., in Lemma 33 of Ju et al. (2020) . Lemma 3. When k ≥ 16, we must have 2 ln k k < 1. Proof. We only need to prove that the function g(k) := 4 ln k -k ≤ 0 when k ≥ 16. To that end, when k ≥ 16, we have ∂g(k) ∂k = 4 k -1 ≤ 0. Thus, g(k) is monotone decreasing when k ≥ 16. Also notice that g(16) = 4 ln 16 -16 ≈ -4.91 < 0. The result of this lemma thus follows.

G.2 ESTIMATION ON NORM AND EIGENVALUES

Let λ min (•) and λ min (•) denote the minimum and maximum singular value of a matrix. Lemma 4. Consider an orthonormal basis matrix U ∈ R k×k and any vector a ∈ R k×1 . Then we must have ∥Ua∥ 2 = ∥a∥ 2 . Proof. We have Ua = k i=1 a i U i , where U i is the i-th column of U and a i is the i-th element of a. Because U is an orthonormal basis matrix, we know that U i are orthogonal to each other. In other words, U T i U j = 0, if i ̸ = j, 1, if i = j. Thus, we have ∥Ua∥ 2 = k i=1 a i U i T k i=1 a i U i (by Eq. ( 17)) = k i=1 a 2 i (by Eq. ( 18)) = ∥a∥ 2 . The result of this lemma thus follows. Lemma 5. Consider a diagonal matrix D = diag(d 1 , d 2 , • • • , d k ) ∈ R k×k where min i d i ≥ 0. For any vector a, we must have ∥Da∥ 2 ∈ min i d i ∥a∥ 2 , max i d i ∥a∥ 2 . Proof. We have ∥Da∥ 2 = k i=1 d 2 i a 2 i ∈   min i d i k j=1 a 2 j , max i d i k j=1 a 2 j   ∈ min i d i ∥a∥ 2 , max i d i ∥a∥ 2 . The result of this lemma thus follows. Lemma 6. For any A ∈ R q×k and a ∈ R k×1 , we must have ∥Aa∥ 2 2 ∈ λ min (A T A) ∥a∥ 2 2 , λ max (A T A) ∥a∥ 2 2 . Proof. Do singular value decomposition of A as A = UDV T . Here D ∈ R q×k is a diagonal matrix that consists of all singular values, U ∈ R q×q and V ∈ R k×k (and their transpose) are orthonormal basis matrices. We have ∥Aa∥ 2 2 = a T A T Aa =a T VD T DV T a = √ D T DV T a 2 2 ∈ λ min (D T D) V T a 2 2 , λ max (D T D) V T a 2 2 (by Lemma 5) = λ min (A T A) V T a 2 2 , λ max (A T A) V T a 2 2 (by A = UDV T ) = λ min (A T A) ∥a∥ 2 2 , λ max (A T A) ∥a∥ 2 2 (by Lemma 4). The result of this lemma thus follows. The following lemma is useful to estimate the eigenvalues of a matrix whose off-diagonal elements are relatively small. Lemma 7 (Gershgorin's Circle Theorem (Marquis et al., 2016) ). If A is an n×n matrix with complex entries a i,j then r i (A) = j̸ =i |a i,j | is defined as the sum of the magnitudes of the non-diagonal entries of the i-th row. Then a Gershgorin disc is the disc D(a i,i , r i (A)) centered at a i,i on the complex plane with radius r i (A). Theorem: Every eigenvalue of a matrix lies within at least one Gershgorin disc.

G.3 ESTIMATION ON RANDOM VARIABLES OF CERTAIN DISTRIBUTIONS

Lemma 8 (Corollary 5.35 of Vershynin ( 2010)). Let A be an N 1 × N 2 matrix (N 1 × N 2 ) whose entries are independent standard normal random variables. Then for every t ≥ 0, with probability at least 1 -2 exp(-t 2 /2) one has N 1 -N 2 -t ≤ λ min (A) ≤ λ max (A) ≤ N 1 + N 2 + t. Lemma 9 (stated on pp. 1325 of Laurent & Massart (2000) ). Let U follow χ 2 distribution with D degrees of freedom. For any positive x, we have Pr U -D ≥ 2 √ Dx + 2x ≤ e -x , Pr D -U ≥ 2 √ Dx ≤ e -x . Lemma 10 (Lemma 31 of Ju et al. ( 2020) ). Let u 1 , u 2 , • • • , u k and u 1 , u 2 , • • • , u k denote 2k random variables that follow i.i.d. standard normal distribution. For any a > 0, we must have Pr k i=1 u i v i > ka 2 ≤ 2 exp - k 2 at + ln 2t a , where t = -1 + √ 1 + a 2 a . Lemma 11. For any k ≥ 16, let u 1 , u 2 , • • • , u k and u 1 , u 2 , • • • , u k denote 2k random variables that follows i.i.d. standard normal distribution. For any q ≤ k and c ≥ 0, we must have Pr k i=1 u i v i > c • k ln q ≤ 2 e c•0.4 ln q . Further, by letting c = 1 and q = k ≥ 16, we have Pr k i=1 u i v i > √ k ln k ≤ 2 k 0.4 . By letting q = e, we have Pr k i=1 u i v i > c √ k ≤ 2 e 0.4c . Proof. Recall the definition of t in Lemma 10, we first want to prove at + ln 2t a ≥ a 2 2( √ 1+a 2 +1) . To that end, we have at + ln 2t a = -1 + 1 + a 2 + ln 2(-1 + √ 1 + a 2 ) a 2 (by the definition of t in Lemma 10) ≥ -1 + 1 + a 2 + 1 - a 2 2(-1 + √ 1 + a 2 ) (by Lemma 2) = a 2 1 + √ 1 + a 2 + 1 - 1 + √ 1 + a 2 2 (since a 2 = (-1 + 1 + a 2 )(1 + 1 + a 2 )) = a 2 2( √ 1 + a 2 + 1) . ( ) Now we let a = 2c • ln q k . Thus, we have ka 2 = c • k ln q. ( ) We have a =2c • ln q k ≤2c • ln k k (since q ≤ k) ≤c (by Lemma 3). ( ) Because c ≥ 1, we have 1 + 1 + c 2 ≤ c + c 2 + c 2 = c(1 + √ 2). We thus have at + ln 2t a ≥ a 2 2( √ 1 + a 2 + 1) (by Eq. (19)) ≥ a 2 2(1 + √ 1 + c 2 ) (by Eq. ( 21)) ≥ a 2 2c(1 + √ 2) (by Eq. ( 22)) ≥ 4c 5 ln q k (since a = 2c • ln q k and √ 2 ≈ 1.414 ≤ 3 2 ). ( ) Thus, we have 2 exp - k 2 at + ln 2t a ≤2 exp - k 2 • 4c 5 ln q k (by Eq. ( 23)) =2 exp (-0.4c ln q) . ( ) Plugging Eq. ( 20) and Eq. ( 24) into Lemma 10, the result of this lemma thus follows. Lemma 12 (Isserlis' theorem (Michalowicz et al., 2009) ). If (x 1 , x 2 , • • • , x n ) is a zero-mean multivariate normal random vector, then E[x1x2 • • • x n ] = A∈A 2 n (i,j)∈A E[xixj], where A denotes a partition of 1, 2, • • • , n into pairs, and A 2 n denotes all such partitions. For example, E[x1x2x3x4] = E[x1x2] E[x3x4] + E[x1x3] E[x2x4] + E[x1x4] E[x2x3]. The following lemma is mentioned in Bernacchia (2021) without a detailed proof. For the ease of readers, we provide a detailed proof of this lemma here. Lemma 13. Consider a random matrix X ∈ R p×n whose each element follows i.i.d. standard Gaussian distribution (i.e., i.i.d. N (0, 1)). We mush have E[X T X] = pI n , E[XX T ] = nI p , E[XX T XX T ] = n(n + p + 1)I p . Proof. Since each row of X are i.i.d., we immediately have E[XX T ] = nI p and E[X T X] = pI n . It remains to prove E[XX T XX T ] = n(n + p + 1)I p . To that end, we have XX T i,j = X i X T j = n k=1 X i,k X j,k , where X i denotes the i-th row of X, and [•] i,j denotes the element in the i-th row, j-th column of the matrix. Thus, we have XX T XX T i,j = p k=1 XX T i,k XX T k,j = p k=1 n l=1 X i,l X k,l n l ′ =1 X j,l ′ X k,l ′ (by Eq. ( 25)). ( 26) Now we examine the value of E X i,l X k,l X j,l ′ X k,l ′ by Isserlis' theorem (Lemma 12). E [X i,l X k,l X j,l ′ X k,l ′ ] =              0, when i ̸ = j, 0, when i = j and k ̸ = i and l ̸ = l ′ , 1, when i = j and k ̸ = i and l = l ′ (there are n(p -1) such terms for each i), 1, when i = j = k and l ̸ = l ′ (there are n(n -1) such terms for each i), 3, when i = j = k and l = l ′ (there are n such terms for each i). By Eq. ( 26), we thus have E[XX T XX T ] = (n(p -1) + n(n -1) + 3n)I p = n(n + p + 1)I p . The result of this lemma thus follows. Lemma 14 (Lemma 24 of Ju et al. ( 2020)). Considering a standard Gaussian distribution a ∼ N (0, 1), when t ≥ 0, we have 2/π e -t 2 /2 t 2 + √ t 2 + 4 ≤ Pr {a ≥ t} ≤ 2/π e -t 2 /2 t + t 2 + 8 π . Notice that t + t 2 + 8 π ≥ 2 2 π when t ≥ 0. We thus have Pr {a ≥ t} ≤ 1 2 e -t 2 /2 . By the symmetry of the standard Gaussian distribution, we thus have Pr {|a| ≥ t} ≤ e -t 2 /2 .

H PROOF OF LEMMA 1

Proof. Similar to Eq. ( 15), we have ŵr = I p - α r n r X r X r T ŵ + α r n r X r y r . ( ) By Eq. ( 27), we can express the learned result for the test task as ŵr = I p - α r n r X r X r T ŵ + α r n r X r y r = I p - α r n r X r X r T ŵ + α r n r X r X r T w r + ϵ r (by Eq. (2)). ( ) Considering a new input x for the test task, the ground-truth is x T w r . The distance between the ground-truth and the output of the learned model is x T w r -x T ŵr 2 . Taking the expectation on x for the square of the distance, we have E x x T w r -x T ŵr 2 2 = E x x T (w r -ŵr ) 2 2 = E x (w r -ŵr ) T xx T (w r -ŵr ) =(w r -ŵr ) T E x [xx T ](w r -ŵr ) = ∥w r -ŵr ∥ 2 2 (since E x [xx T ] = I p by Assumption 1). Notice that ŵr -w r = I p - α r n r X r X r T ŵ + α r n r X r X r T -I p ŵr + α r n r X r ϵ r (by Eq. ( 28)) = I p - α r n r X r X r T ( ŵ -w r ) + α r n r X r ϵ r Then, taking the expectation on ϵ r , we have E X r ,ϵ r ∥w r -ŵr ∥ 2 2 = E X r ,ϵ r I p - α r n r X r X r T ( ŵ -w r ) 2 2 + E X r ,ϵ r α r n r X r ϵ r 2 2 (since ϵ r is independent of other random variables) = ( ŵ -w r ) T E X r I p - α r n r X r X r T T I p - α r n r X r X r T ( ŵ -w r ) + α 2 r n 2 r E ϵ r ϵ r T E X r [X r T X r ]ϵ r = 1 -2α r + α 2 r n r (n r + p + 1) ∥ ŵ -w r ∥ 2 2 + α 2 r p n r σ 2 r (by Lemma 13). ( ) Notice that E w r ∥ ŵ -w r ∥ 2 2 = E w r ∥ ŵ -w 0 -(w r -w 0 )∥ 2 2 = ∥ ŵ -w 0 ∥ 2 2 + ν 2 r (by Assumption 2, (w r -w 0 ) has zero mean). ( ) The result of this proposition thus follows by plugging Eq. ( 30) into Eq. ( 29).

H.1 PROOF OF PROPOSITION 1

Proof. For ease of notation, we let K := ∥ ŵ -w 0 ∥ 2 2 + ν 2 r . From Lemma 1, we can see that f test ∥ ŵ -w 0 ∥ 2 2 is a quadratic function of α r : f test ∥ ŵ -w 0 ∥ 2 2 = p + 1 n r + 1 K + pσ 2 r n r α 2 r -2Kα r + K. Thus, to minimize the test loss shown by Lemma 1, we can calculate the optimal choice of α r as [α r ] opti = K 1 + p+1 nr K + pσ 2 r nr . (Thus, we have [α r ] opti = nr nr+p+1 when σ r = 0.) Plugging [α r ] opti into f test ∥ ŵ -w 0 ∥ 2 2 , we thus have f test ∥ ŵ -w 0 ∥ 2 2 αr=[αr]opti = K - K 2 1 + p+1 nr K + pσ 2 r nr = K • p + 1 + pσ 2 r K n r + p + 1 + pσ 2 r K . ( ) The right-hand side of Eq. ( 31) increases when σ 2 r increases. Therefore, by letting σ r = 0 and σ r = ∞, we can get the lower and upper bound of Eq. ( 31), i.e., f test ∥ ŵ -w 0 ∥ 2 2 αr=[αr]opti ∈ K • p + 1 n r + p + 1 , K . The result of this lemma thus follows.

I PROOF OF THEOREM 1

We summarize the definition of the quantities related to the definition of b w as follows: (32) α ′ t := α t n t ( √ p + √ n t + ln √ n t ) 2 , b eig,min := p + max{0, 1 -α ′ t } 2 -1 n t -(n v + 1) max{α ′ t , 1 -α ′ t } 2 + 6mn v √ p ln p, c eig,min := max{0, 1 -α ′ t } 2 p -2mn v max{α ′ t , 1 -α ′ t } 2 p ln p, D := max 1 -α t n t + 2 n t ln(sn t ) + 2 ln(sn t ) n t , 1 -α t n t -2 n t ln(sn t ) n t 2 , b δ := mn v σ 2 1 + α 2 t p(ln n t ) 2 ln p n t + mn v ν 2 • 2 ln(sn t ) • D + α 2 t (p -1) n t 6.25(ln(spn t )) 2 , b w0 := (p -mn v ) + 2 (p -mn v ) ln p + 2 ln p p -2 √ p ln p ∥w 0 ∥ Proof. Define the event in Proposition 2 as A target := E w (1:m) ,ϵ t(1:m) ,ϵ v(1:m) ∥ ŵideal -w 0 ∥ 2 2 ≤ b ideal w . Define the event in Proposition 3 as A target,3 := {Term 1 of Eq. ( 12) ≤ b w0 } . By the definition of b w in Eq. ( 32) and the definition of A target,3 and A target , we have A target,3 ∩ A target =⇒ ∥ ŵℓ2 -w 0 ∥ 2 2 ≤ b w . Thus, we have Pr ∥ ŵℓ2 -w 0 ∥ 2 2 ≤ b w ≥ Pr {A target,3 ∩ A target } =1 -Pr A c target,3 ∪ A c target ≥1 -Pr[A c target,3 ] -Pr[A c target ] (by the union bound) ≥1 - 2 p - 26m 2 n 2 v min{n t , p} 0.4 (by Proposition 3 and Proposition 2) ≥1 - 27m 2 n 2 v min{n t , p} 0.4 . ( ) The last inequality is because 2 p ≤ 1 p 0.4 since p 0.6 ≥ 256 0.6 ≈ 27.86 ≥ 2. The result of this theorem thus follows.

J PROOF OF PROPOSITION 3

Define P := B T (BB T ) -1 B. In this section, we focus on estimating ∥(I p -P) w 0 ∥ 2 2 . Since P 2 = P, we know P is indeed a projection from R p to the subspace spanned by the columns of B T (i.e., the rows of B). The following Lemma 15 shows that the subspace spanned by the columns of B T has rotational symmetry. Then, we will use Lemma 16 to show how the rotational symmetry helps in estimating the expected value of the squared norm of the projected vector. We then prove Proposition 3 by utilizing Lemma 15 and Lemma 16. Lemma 15. The subspace spanned by the columns of B T has rotational symmetry (with respect to the randomness of X t(1:m) and X v(1:m) ). Specifically, for any rotation S ∈ SO(p) where SO(p) ⊆ R p×p denotes the set of all rotations in p-dimensional space, the rotated random matrix SB T shares the same probability distribution with the original B T .

Proof. Notice that for any

i = 1, 2, • • • , m, X v(i) T I p - α t n t X t(i) X t(i) T S T =X v(i) T S T - α t n t X v(i) T S -1 SX t(i) X t(i) T S T = SX v(i) T - α t n t SX v(i) T SX t(i) SX t(i) T (since S -1 = S T because S is a rotation) = SX v(i) T I p - α t n t SX t(i) SX t(i) T . We thus have SB T =         X v(1) T I p -αt nt X t(1) X t(1) T S T X v(2) T I p -αt nt X t(2) X t(2) T S T . . . X v(m) T I p -αt nt X t(m) X t(m) T S T         T =         SX v(1) T I p -αt nt SX t(1) SX t(1) T SX v(2) T I p -αt nt SX t(2) SX t(2) T . . . SX v(m) T I p -αt nt SX t(m) SX t(m) T         T . ( ) Because of the rotational symmetry of Gaussian distribution, we know that the rotated random matrices SX v(i) and SX t(i) have the same probability distribution with the original random matrices X v(i) and X t(i) , respectively. Therefore, by Eq. ( 34), we can conclude that SB T has the same probability distribution as B T . The result of this lemma thus follows. The following lemma shows how to use rotational symmetry to calculate the expected value of the squared norm of the projected vector. Such a result has also been used in literature (e.g., (Belkin et al., 2020) ). Lemma 16. Considering any random projection P 0 ∈ R p×p to a k-dim subspace where the subspace has rotational symmetry, then for any given v ∈ R p×1 we must have E P0 ∥P 0 v∥ 2 2 = k p ∥v∥ 2 2 . Proof. Since the subspace has rotational symmetry, to calculate the expected value, we can fix a subspace and integral over all rotations. Specifically, consider any fixed projection A that projects to k-dim subspace that is spanned by a set of orthogonal vectors a 1 , a 2 , • • • , a k ∈ R p . Therefore, after projecting v with A, the squared norm of the projected vector is equal to ∥Av∥ 2 2 = k i=1 ⟨a i , v⟩ 2 . Noticing that applying a rotation in the projected space of A is equivalent to applying the rotation to a 1 , a 2 , • • • , a k , we then have E P0 ∥P 0 v∥ 2 2 = so(p) k i=1 ⟨Sa i , v⟩ 2 dS = so(p) k i=1 ⟨S -1 Sa i , S -1 v⟩ 2 dS (since making a rotation S -1 on both vectors does not change the inner product value) = so(p) k i=1 ⟨a i , S -1 v⟩ 2 dS = so(p) k i=1 ⟨a i , Sv⟩ 2 dS = ∥v∥ 2 2 S p-1 S p-1 k i=1 ⟨a i , ṽ⟩ 2 dṽ = ∥v∥ 2 2 S p-1 k i=1 S p-1 ⟨a i , ṽ⟩ 2 dṽ, where S p-1 denotes the unit sphere in R p and S p-1 denotes its area. Since A can be any fixed projection to k-dim subspace, without loss of generality, we can simply let a i be the i-th standard basis in R p (i.e., only the i-th element is nonzero and is equal to 1). (Note that there are p standard bases although A is spanned by only the first k standard bases.) In this situation, we have S p-1 ⟨a i , ṽ⟩ 2 dṽ = S p-1 ⟨a j , ṽ⟩ 2 dṽ, for all i, j ∈ {1, 2, • • • , p}, and p i=1 S p-1 ⟨a i , ṽ⟩ 2 dṽ = S p-1 p i=1 ⟨a i , ṽ⟩ 2 dṽ = S p-1 ∥ṽ∥ 2 2 dṽ = S p-1 . Therefore, we have S p-1 ⟨a i , ṽ⟩ 2 dṽ = 1 p S p-1 . By Eq. ( 35), we thus have E P0 ∥P 0 v∥ 2 2 = k p ∥v∥ 2 2 . The result of this lemma thus follows. Now we are ready to prove Proposition 3. Proof of Proposition 3. Define θ := arccos ⟨w 0 , Pw 0 ⟩ ∥w 0 ∥ 2 2 ∈ [0, π/2]. Thus, we have ∥(I -P)w 0 ∥ 2 2 = sin 2 θ • ∥w 0 ∥ 2 2 . ( ) By Lemma 15, we know that the distribution of the hyper-plane spanned by the rows of B has rotational symmetry (which is a mn v -dimensional space). Therefore, θ follows the same distribution of the angle between a uniformly distributed random vector a ∈ R p and a fixed mn v -dimensional hyper-plane. To characterize such distribution of θ (or equivalently, sin θ), without loss of generality, we let a ∼ N (0, I p ) and the hyper-plane be spanned by the first mn v standard bases of R p . Thus, we have cos 2 θ ∼ a [1:mnv] 2 2 ∥a∥ 2 2 , sin 2 θ ∼ a [mnv+1:p] 2 2 ∥a∥ 2 2 . ( ) Notice that a [mnv+1:p] 2 2 and ∥a∥ 2 2 follow χ 2 distribution with p -mn v degrees and p degrees of freedom, respectively. By Lemma 9 and letting x = ln p in Lemma 9, we have Pr ∥a∥ 2 2 ≤ p -2 p ln p ≤ 1 p , ( ) Pr ∥a∥ 2 2 ≥ p + 2 p ln p + 2 ln p ≤ 1 p , ( ) Pr a [mnv+1:p] 2 2 ≥ (p -mn v ) + 2 (p -mn v ) ln p + 2 ln p ≤ 1 p . ( ) Pr a [mnv+1:p] 2 2 ≤ (p -mn v ) + 2 (p -mn v ) ln p ≤ 1 p . Because p ≥ 16, by Lemma 3, we have 2 ln p p < 1 =⇒ 2 √ p ln p p < 1 =⇒ p -2 p ln p > 0. ( ) We define A target,3 := {Term 1 of Eq. ( 12) ≤ b w0 } , Ãtarget,3 := Term 1 of Eq. ( 12) ≥ bw0 . 38) and Eq. ( 40)). ( 43) By Eq. ( 36) and Eq. ( 43), we thus have

We thus have

Pr A c target,3 = Pr sin 2 θ ≥ (p -mn v ) + 2 (p -mn v ) ln p + 2 Pr[A target,3 ] ≥ 1 - 2 p . Similarly, using Eq. ( 39) and Eq. ( 41) (also by the union bound), we can prove Pr[ Ãtarget,3 ] ≥ 1 - 2 p . By Lemma 16, we have E[∥Pw0∥ 2 2 ] = mnv p ∥w 0 ∥ 2 2 . Thus, we have E[Term 1 of Eq. (12)] = p -mn v p ∥w 0 ∥ 2 2 . The result of this lemma thus follows.

K PROOF OF PROPOSITION 2

Proof. Define the event in Proposition 2 as A target := E w (1:m) ,ϵ t(1:m) ,ϵ v(1:m) ∥ ŵideal -w 0 ∥ 2 2 ≤ b ideal w . Define A target,1 := λ max (BB T ) ≥ b eig,min 1 {p>nt} + c eig,min 1 {p≤nt} Combining Proposition 4 and Proposition 5 with the union bound, we have Pr[A target,1 ] ≥ 1 - 23m 2 n 2 v min{p, n t } 0.4 . ( ) We adopt the event in Proposition 6 as A target,2 := E w (1:m) ,ϵ t(1:m) ,ϵ v(1:m) ∥δγ∥ 2 2 ≤ b δ . By Eq. ( 14), we have 2 i=1 A target,i =⇒ A target . Thus, we have .4 (by Eq. ( 44) and Proposition 6) Pr[A target ] ≥ Pr 2 i=1 A target,i =1 -Pr 2 i=1 A c target,i ≥1 - 2 i=1 Pr[A c target,i ] (by the union bound) ≥1 - 23m 2 n 2 v min{p, n t } 0.4 - 5mn v n t - 2mn v p 0 ≥1 - 26m 2 n 2 v min{p, n t } 0.4 . The last inequality is because m ≥ 1, n v ≥ 1, n t ≥ 256 =⇒ n 0.6 t ≥ 256 0.6 ≈ 27.86 =⇒ 5 n 0.6 t ≤ 1. The result of this proposition thus follows.

L PROOF OF PROPOSITION 4

We prove Proposition 4 by estimating every element of BB T . We split BB T into m × m blocks (each block is of size n v × n v ). Recalling the definition of B in Eq. ( 7), we identify three types of elements in BB T as diagonal elements (type 1), off-diagonal elements of diagonal block (type 2), and other elements (type 3). Fig. 4 illustrates these three types of elements when m = 2 and n v = 3. In the rest of this part, we will first define and estimate each type of elements separately in Appendices L.1, L.2, and L.3. Using these results, we will then estimate the eigenvalues of BB T and finish the proof of Proposition 4. mnv) ). There are 4 (i.e., m 2 ) blocks (divided by the dashed red line) and each block is of size 3 × 3 (i.e., n v × n v ). TYPE 1: DIAGONAL ELEMENTS Type 1 elements are the diagonal elements of BB T , which can be denoted as [BB T ] (i-1)nv+j, (i-1)nv+j =[X v(i) ] T j I p - α t n t X t(i) X t(i) T T I p - α t n t X t(i) X t(i) T [X v(i) ] j , where i ∈ {1, 2, • • • , m} corresponds to the i-th training task, j ∈ {1, 2, • • • , n v } corresponds to the input vector of the j-th validation sample (of the i-th training task). To estimate Eq. ( 45), we have the following lemma. Lemma 17. For any i ∈ {1, 2, • • • , m} and any j ∈ {1, 2, • • • , n v }, when p ≥ n t ≥ 256, we must have Pr [BB T ] (i-1)nv+j, (i-1)nv+j ∈ [b 1 , b 1 ] ≥ 1 - 5 √ n t , where b 1 := max {0, 1 -α ′ t } 2 n t -2n t ln n t + p -n t -2(p -n t ) ln n t , b 1 := max{α ′ t , 1 -α ′ t } 2 n t + 2n t ln n t + ln n t + p -n t + 2(p -n t ) ln n t + ln n t . Proof. See Appendix L.1. TYPE 2: OFF-DIAGONAL ELEMENTS OF DIAGONAL BLOCKS Type 2 elements are the off-diagonal elements of diagonal blocks. Similar to Eq. ( 45), Type 2 elements can be denoted by [BB T ] (i-1)nv+j, (i-1)nv+k =[X v(i) ] T j I p - α t n t X t(i) X t(i) T T I p - α t n t X t(i) X t(i) T [X v(i) ] k , where j ̸ = k. We have the following lemma. Lemma 18. For any i ∈ {1, 2, • • • , m} and any j, k ∈ {1, 2, • • • , n v } that j ̸ = k, when p ≥ n t ≥ 256, we must have Pr [BB T ] (i-1)nv+j, (i-1)nv+k ≤ b 2 ≥ 1 - 5 n 0.4 t , where b 2 := max{α ′ t , 1 -α ′ t } 2 n t ln n t + √ p ln p. Proof. See Appendix L.2.

TYPE 3: OTHER ELEMENTS (ELEMENTS OF OFF-DIAGONAL BLOCKS)

Type 3 elements are other elements that are not Type 1 or Type 2. In other words, Type 3 elements belong to off-diagonal blocks. Similar to Eq. ( 45) and Eq. ( 46), Type 3 elements can be denoted by [BB T ] (i-1)nv+j, (l-1)nv+k =[X v(i) ] T j I p - α t n t X t(i) X t(i) T T I p - α t n t X t(l) X t(l) T [X v(l) ] k = I p - α t n t X t(i) X t(i) T [X v(i) ] j , I p - α t n t X t(l) X t(l) T [X v(l) ] k where i ̸ = l. We have the following lemma. Lemma 19. For any i, l ∈ {1, 2, • • • , m} and any j, k ∈ {1, 2, • • • , n v } that i ̸ = l, when p ≥ n t ≥ 256, we must have Pr [BB T ] (i-1)nv+j, (l-1)nv+k ≤ b 3 ≥ 1 - 13 n 0.4 t , where b 3 := 6 p ln p. Proof. See Appendix L.3. Now we are ready to prove Proposition 4. Proof of Proposition 4. Define a few events as follows: A 1 := all type 1 elements of BB T are in [b 1 , b 1 ] , A 2 := all type 2 elements of BB T are in [-b 2 , b 2 ] , A 3 := all type 3 elements of BB T are in [-b 3 , b 3 ] . Notice that there are mn v type 1 elements, mn v (n v -1) type 2 elements, and m(m -1)n 2 v type 3 elements. By the union bound, Lemmas 17, 18, and 19, we have Pr{A c 1 } ≤ 5 √ n t • mn v ≤ 5 n 0.4 t • mn v , Pr{A c 2 } ≤ 5 n 0.4 t • mn v (n v -1), Pr{A c 3 } ≤ 13 n 0.4 t • m(m -1)n 2 v . It remains to estimate the probability of A (p≥nt) target,1 . To that end, we have Pr A (p≥nt) target,1 ≥ Pr 3 i=1 A i (since 3 i=1 A i =⇒ A (p≥nt) target,1 ) =1 -Pr 3 i=1 A c i ≥1 - 3 i=1 Pr {A c i } (by the union bound) ≥1 - 23m 2 n 2 v n 0.4 t (by Eqs. (48)(49)(50)). The result of this proposition thus follows. L.1 PROOF OF LEMMA 17 Since all training inputs are independent with each other, without loss of generality, we let i = 1 and replace [X v(i) ] j by a random vector a ∼ N (0, I p ) ∈ R p×1 which is independent of X t(i) . In other words, estimating Eq. ( 45) is equivalent to estimate I p - α t n t X t(1) X t(1) T a 2 2 . We further introduce some extra notations as follows. Since p ≥ n t , we can define the singular values of X t(1) ∈ R p×nt as 0 ≤ λ t(1) 1 ≤ λ t(1) 2 ≤ • • • λ t(1) nt . Define Λ t(1) := diag λ t(1) 1 , λ t(1) 2 , • • • , λ t(1) nt ∈ R nt×nt . Do singular value decomposition of X t(1) , we have X t(1) = U t(1) D t(1) V t(1) T . Notice that U t(1) ∈ R p×p is an orthogonal matrix, D t(1) = Λ t(1) 0 ∈ R p×nt , and V t(1) ∈ R nt×nt is an orthogonal matrix. Using these notations, we thus have 1) T (by Eq. ( 54)) I p - α t n t X t(1) X t(1) T =U t(1) I p - α t n t D t(1) D t(1) T U t( =U t(1) I nt -αt nt Λ t(1) 2 0 0 I p-nt U t(1) T (by the definition of D t(1) ). ( 55) The following two lemmas will be useful in the proof of Lemma 17. Lemma 20. If x ≥ 16, then 3 16 x ≥ ln x, x + √ 2x ln x + ln x ≤ 2x, x - √ 2x ln x ≥ x 3 , x + ln x ≤ 6 7 √ 2x. (59) Further, if x ≥ 256, then 1 45 x ≥ ln x. Proof. We prove each equation sequentially as follows. Proof of Eq. (56): When x ≥ 16, we have ∂ 3 16 x -ln x ∂x = 3 16 - 1 x > 0. Thus, 3 16 x -ln x is monotone increasing when x ≥ 16. Thus, in order to prove 3 16 x ≥ ln x for all x ≥ 16, we only need to prove 3 16 • 16 ≥ ln 16. Notice that ln(16) ≈ 2.7726 < 3. Therefore, Eq. ( 56) holds. Proof of Eq. (57): Using Eq. ( 56), we have √ 2x ln x + ln x ≤ 3 8 + 3 16 x ≈ 0.7999x ≤ x. Eq. ( 57) thus follows. Proof of Eq. ( 58): Doing square root on both sides of 3 16 x ≥ ln x, we thus have √ 3 4 √ x ≥ √ ln x =⇒ 3 8 • x ≥ √ 2x ln x =⇒ x - √ 2x ln x ≥ 1 - 3 8 x ≥ 1 3 x (since 1 - 3 8 ≈ 0.3876 ≥ 1 3 ). Eq. ( 58) thus follows. Proof of Eq. (59): We have x + ln x ≤x + 3 16 x (by Eq. ( 56)) = 19 16 √ 2 √ 2x ≤ 6 7 √ 2x (since 19 16 √ 2 ≈ 0.8397 ≤ 6 7 ). Eq. ( 59) thus follows. Proof of Eq. (60): When x ≥ 256, we have ∂ 1 45 x -ln x ∂x = 1 45 - 1 x > 0. Thus, 1 45 x -ln x is monotone increasing when x ≥ 256. Thus, in order to prove 1 45 x ≥ ln x for all x ≥ 256, we only need to prove 1 45 • 256 ≥ ln 256. Notice that 256/45 -ln 256 ≈ 0.1437 ≥ 0. Therefore, Eq. ( 60) holds. Now we are ready to prove Lemma 17. Proof of Lemma 17. Recalling U t(1) in Eq. ( 54), we define a ′ := U t(1) T a ∈ R p×1 , χ 2 nt := nt i=1 a ′ i 2 , χ 2 p-nt := p i=p-nt+1 a ′ i 2 . ( ) We then have I p - α t n t X t(1) X t(1) T a 2 2 =a T U t(1) I nt -αt nt Λ t(1) 2 2 0 0 I p-nt U t(1) T a (by Eq. ( 55)) =a ′ T I nt -αt nt Λ t(1) 2 2 0 0 I p-nt a ′ (by Eq. ( 61)) = nt i=1 1 - α t n t λ t(1) i 2 2 a ′ i 2 + p i=nt+1 a ′ i 2 ∈ min j∈{1,2,••• ,nt} 1 - α t n t λ t(1) j 2 nt i=1 a ′ i 2 + p i=nt+1 a ′ i 2 , max j∈{1,2,••• ,nt} 1 - α t n t λ t(1) j 2 nt i=1 a ′ i 2 + p i=nt+1 a ′ i 2 = max 0, 1 - α t n t λ t(1) nt 2 2 χ 2 nt + χ 2 p-nt , max 1 - α t n t λ t(1) nt 2 , 1 - α t n t λ t(1) 1 2 2 χ 2 nt + χ 2 p-nt (by Eq. ( 62) and Eq. ( 53)). ( Because of rotational symmetry of normal distribution of a, we know that χ 2 numT rain and χ 2 p-nt follows χ 2 distribution of p and p -n t degrees of freedom, respectivelyfoot_4 . We define several events as follows: A 1 := χ 2 nt > n t + 2 n t ln √ n t + 2 ln √ n t , A 2 := χ 2 p-nt > p -n t + 2 (p -n t ) ln √ n t + 2 ln √ n t , A 3 := χ 2 p-nt < p -n t -2 (p -n t ) ln √ n t , A 4 := λ t(1) nt > √ p + √ n t + ln √ n t , A 5 := λ t(1) 1 < √ p - √ n t -ln √ n t , A 6 := χ 2 nt < n t -2 n t ln √ n t . We have Pr X t(1) ,a 6 i=1 A c i =1 -Pr X t(1) ,a 6 i=1 A i ≥1 - 3 i=1 Pr a {A i } -Pr X t(1) {A 4 ∪ A 5 } -Pr a {A 6 } (by the union bound) ≥1 -4e -ln √ nt -2e -(ln √ nt) 2 /2 (by Lemma 9 and Lemma 8) =1 - 4 √ n t - 2 exp 1 2 ln √ n t • ln √ n t ≥1 - 4 √ n t - 2 exp ln √ n t (since ln √ n t ≥ 2 when n t ≥ 256) =1 - 6 √ n t . ( ) Define the target event A := I p - α t n t X t(1) X t(1) T a 2 2 ∈ [b 1 , b 1 ] . It remains to prove that 6 i=1 A c i =⇒ A. To that end, when 6 i=1 A c i , we have I p - α t n t X t(1) X t(1) T a 2 2 ≥ max 0, 1 -λ t(1) nt 2 2 χ 2 nt + χ 2 p-nt (by Eq. ( 63)) ≥ max {0, 1 -α ′ t } 2 n t -2n t ln n t + p -n t -2(p -n t ) ln n t (by A c 3 and A c 6 ) =b 1 , and I p - α t n t X t(1) X t(1) T a 2 2 ≤ max 1 - α t n t λ t(1) nt 2 , 1 - α t n t λ t(1) 1 2 2 χ 2 nt + χ 2 p-nt (by Eq. ( 63)) ≤ max 1 - α t n t ( √ p - √ n t -ln √ n t ) 2 , 1 - α t n t ( √ p + √ n t + ln √ n t ) 2 2 • n t + 2n t ln n t + ln n t + p -n t + 2(p -n t ) ln n t + ln n t (by A c 1 , A c 2 , A c 4 and A c 5 ) ≤ max{α ′ t , 1 -α ′ t } 2 • n t + 2n t ln n t + ln n t + p -n t + 2(p -n t ) ln n t + ln n t =b 1 .

L.2 PROOF OF LEMMA 18

Since all samples are independent with each other, without loss of generality, we let i = 1 and replace [X v(i) ] j , [X v(i) ] k by two i.i.d. random vector a, b ∼ N (0, I p ) ∈ R p×1 which are independent of X t(i) . In other words, estimating Eq. ( 46) is equivalent to estimate a T I p - α t n t X t(1) X t(1) T T I p - α t n t X t(1) X t(1) T b. Proof of Lemma 18. Recalling U t(1) in Eq. ( 54), we define a ′ := U t(1) T a ∈ R p×1 , b ′ := U t(1) T b ∈ R p×1 , ϕ nt := nt i=1 a ′ i b ′ i , ϕ p-nt := p i=p-nt+1 a ′ i b ′ i , We then have a T I p - α t n t X t(1) X t(1) T 2 b = a T U t(1) I nt -αt nt Λ t(1) 2 2 0 0 I p-nt U t(1) T b (by Eq. ( 55)) = a ′ T I nt -αt nt Λ t(1) 2 2 0 0 I p-nt b ′ (by Eq. ( 65)) = nt i=1 1 - α t n t λ t(1) i 2 2 a ′ i b ′ i + p i=nt+1 a ′ i b ′ i ≤ max 1 - α t n t λ t(1) 1 2 , 1 - α t n t λ t(1) nt 2 2 |ϕ nt | + |ϕ p-nt | (by Eq. ( 65) and Eq. ( 53)). ( 67) Because of the rotational symmetry of normal distribution, we know that ϕ nt and ϕ p-nt have the same probability distribution if a ′ and b ′ follow i.i.d. N (0, I p )foot_5 . Define several events as follows: A 1 := |ϕ nt | > n t ln n t , A 2 := {|ϕ p-nt | > √ p ln p} , A 3 := λ t(1) nt > √ p + √ n t + ln √ n t , A 4 := λ t(1) 1 < √ p - √ n t -ln √ n t . We have Pr X t(1) ,a,b 4 i=1 A c i =1 -Pr X t(1) ,a,b 4 i=1 A i ≥1 -Pr a,b {A 1 } -Pr a,b {A 2 } -Pr X t(1) {A 3 ∪ A 4 } (by the union bound)  ≥1 - 2 n 0.4 t - 2 p -2e -(ln √ nt) 2 /2 ( A := a T I p - α t n t X t(1) X t(1) T 2 b ≤ b 2 . i=1 A c i =⇒ A. To that end, when 4 i=1 A c i , we have a T I p - α t n t X t(1) X t(1) T 2 b ≤ max 1 - α t n t λ t(1) 1 2 , 1 - α t n t λ t(1) nt 2 2 |ϕ nt | + |ϕ p-nt | (by Eq. ( 67)) ≤ max 1 - α t n t ( √ p - √ n t -ln √ n t ) 2 , 1 - α t n t ( √ p + √ n t + ln √ n t ) 2 2 n t ln n t + √ p ln p (by A c 1 , A c 2 , A c 3 , and A c 4 ) ≤ max{α ′ t , 1 -α ′ t } 2 n t ln n t + √ p ln p. By Eq. ( 69) (which implies 4 i=1 A c i =⇒ A) and Eq. ( 68), the result of this lemma thus follows.

L.3 PROOF OF LEMMA 19

Because all inputs are i.i.d., Eq. ( 47) is the inner product of two independent vectors, where each vector follows the same distribution as the vector ρ := I p - α t n t X t(1) X t(1) T a ∈ R p , where a ∼ N (0, I p ) and is independent of X t(1) . In other words, it is equivalent to estimate ρ T 1 ρ 2 where ρ 1 and ρ 2 follow the i.i.d. shown in ( 70). If we can characterize the probability distribution of ρ, then we can estimate Eq. ( 47). The following lemma shows the rotational symmetry of Eq. ( 70). Lemma 21. The probability distribution of ρ has rotational symmetry. In other words, for any rotation S ∈ SO(p) where SO(p) ⊆ R p×p denotes the set of all rotations in p dimension, the rotated random vector Sρ shares the same probability distribution with the original vector ρ. Proof. By Eq. ( 70), we have Sρ =Sa - α t n t SX t(1) X t(1) T a =Sa - α t n t SX t(1) X t(1) T S -1 Sa =Sa - α t n t SX t(1) X t(1) T S T Sa (because S -1 = S T , as a rotation is an orthogonal matrix) =Sa - α t n t (SX t(1) )(SX t(1) ) T (Sa). Notice that Sa and SX t(1) are the rotated vectors of a and X t(1) , respectively. Since a and X t(1) are independent Gaussian vector/matrix, by rotational symmetry, we know their distribution and independence do not affected by a common rotation S. Thus, Saαt nt (SX t(1) )(SX t(1) ) T (Sa) has the same probability distribution as ρ. The result of this lemma thus follows. The following lemma characterize the distribution of the angle between two independent random vectors where both vectors have rotational symmetry. Lemma 22. Consider two i.i.d. random vector c 1 , c 2 ∈ R p that have rotational symmetry. We then have Pr c T 1 c 2 ∥c 1 ∥ 2 ∥c 2 ∥ 2 ≥ √ p ln p p -2 √ p ln p ≤ 2 p + 2 p 0.4 . Proof. Notice that |c T 1 c2| ∥c1∥ 2 ∥c2∥ 2 denotes the angle between c 1 and c 2 . By rotational symmetry, it is equivalent to prove that the angle between two independent random vectors that have rotational symmetry. To that end, consider two i.i.d. random vectors x 1 , x 2 ∼ N (0, I p ). The distribution of the angle between x 1 and x 2 should be the same as the angle between c 1 and c 2 . In other words, we have x T 1 x T 2 ∥x 1 ∥ 2 ∥x 2 ∥ 2 ∼ c T 1 c 2 ∥c 1 ∥ 2 ∥c 2 ∥ 2 . ( ) By Lemma 11, we have Pr x T 1 x 2 > p ln p ≤ 2 p 0.4 . ( ) Noticing that ∥x 1 ∥ 2 and ∥x 2 ∥ 2 follow chi-square distribution with p degrees of freedom, by Lemma 9, we have Pr ∥x 1 ∥ 2 2 ≤ p -2 p ln p ≤ e -ln p = 1 p , ( ) Pr ∥x 2 ∥ 2 2 ≤ p -2 p ln p ≤ e -ln p = 1 p . ( ) By Eqs. (72)(73)(74) and the union bound, we thus have Pr x T 1 x 2 ∥x 1 ∥ 2 ∥x 2 ∥ 2 ≥ √ p ln p p -2 √ p ln p ≤ 2 p + 2 p 0.4 . By Eq. ( 71), the result of this lemma thus follows. Now we are ready to prove Lemma 19. Proof of Lemma 19. Since p ≥ n t ≥ 256, by Eq. ( 60) of Lemma 20, we have p -2 p ln p ≥ 1 -2 1 45 p ≥ 1 - √ 5 1 45 p = 2p 3 . ( ) We define some events as follows: A 1 := ∥ρ 1 ∥ 2 2 ≥ b 1 , A 2 := ∥ρ 2 ∥ 2 2 ≥ b 1 , A 3 := ρ T 1 ρ 2 ∥ρ 1 ∥ 2 ∥ρ 2 ∥ 2 ≥ √ p ln p p -2 √ p ln p , A := ρ T 1 ρ 2 ≤ b 3 . By Lemma 17, we have Pr{A 1 } ≤ 5 √ n t , Pr{A 2 } ≤ 5 √ n t . ( ) By Lemma 22, we have Pr{A 3 } ≤ 2 p + 2 p 0.4 . ( ) Since n t ≥ 256, by letting x = n t in Eq. ( 57) of Lemma 20, we have n t + 2n t ln n t + ln n t ≤ 2n t . and c 1 + ((n v -1)c 2 + (m -1)n v c 3 ) = max{α ′ t , 1 -α ′ t } 2 (p + 2 ln p) + (n v + 1 + 2(m -1)n v ) max{α ′ t , 1 -α ′ t } 2 p ln p ≤ max{α ′ t , 1 -α ′ t } 2 p + (2mn v + 1 ) p ln p (by Eq. ( 82)) =c eig,max . Define the event A (p≤nt) target,1 := c eig,min ≤ λ min (BB T ) ≤ λ max (BB T ) ≤ c eig,max , Therefore, we have proven 3 i=1 A i =⇒ A (p≤nt) target,1 . Thus, we have Pr[A (p≤nt) target,1 ] ≥ Pr 3 i=1 A i =1 -Pr 3 i=1 A c i ≥1 - 3 i=1 Pr[A i ] (by th union bound) ≥1 - 16m 2 n 2 v p 0.4 (by Eq. ( 83)). The result of this proposition thus follows. M.1 PROOF OF LEMMA 23 Proof. When p < n t , we define the singular values of X t(1) ∈ R p×nt as 0 ≤ λ t(1) 1 ≤ λ t(1) 2 ≤ • • • λ t(1) p . Define Λ t(1) := diag λ t(1) 1 , λ t(1) 2 , • • • , λ t(1) p ∈ R p×p . We can still do the singular value decomposition as Eq. ( 54), but here D t(1) = Λ t(1) 0 ∈ R p×nt since p < n t . Using these notations, we thus have I p - α t n t X t(1) X t(1) T =U t(1) I p - α t n t Λ t(1) 2 U t(1) T . Similar to Eq. ( 63), we have I p - α t n t X t(1) X t(1) T a 2 2 ∈ max 0, 1 - α t n t λ t(1) p 2 2 χ 2 p , max 1 - α t n t λ t(1) p 2 , 1 - α t n t λ t(1) 1 2 2 χ 2 p , where χ 2 p := U t(1) T a 2 2 = ∥a∥ 2 2 follows χ 2 distribution with p degrees of freedom. We define several events as follows: A 1 := χ 2 p < p -2 p ln p , A 2 := χ 2 p > p + 2 p ln p + 2 ln p , A 3 := λ t(1) p > √ n t + √ p + ln √ n t , A 4 := λ t(1) 1 < √ n t - √ p -ln √ n t . We have Define the target event Pr X t(1) ,a 4 i=1 A c i =1 -Pr 4 i=1 A i ≥1 - 2 i=1 Pr a {A i } -Pr X t( A := I p - α t n t X t(1) X t(1) T a 2 2 ∈ [c 1 , c 1 ] .

It remains to prove that

4 i=1 A c i =⇒ A. To that end, when 4 i=1 A c i , we have I p - α t n t X t(1) X t(1) T a 2 2 ≥ max 0, 1 - α t n t λ t(1) p 2 2 χ 2 p (by Eq. ( 84)) ≥ max 0, 1 - α t n t ( √ n t + √ p + ln √ n t ) 2 2 p -2 p ln p (by A c 1 and A c 3 ) = max{0, 1 -α ′ t } 2 • (p -2 p ln p). When 4 i=1 A c i , we also have I p - α t n t X t(1) X t(1) T a 2 2 ≤ max 1 - α t n t λ t(1) p 2 , 1 - α t n t λ t(1) 1 2 2 χ 2 p (by Eq. ( 84)) ≤ max 1 - α t n t ( √ n t - √ p -ln √ n t ) 2 , 1 - α t n t ( √ n t + √ p + ln √ n t ) 2 2 • p + 2 p ln p + 2 ln p (by A c 2 , A c 3 and A c 4 ) ≤ max{α ′ t , 1 -α ′ t } 2 p + 2 p ln p + 2 ln p .

M.2 PROOF OF LEMMA 24

Proof. Similar to Eq. ( 67), we have a T I p - α t n t X t(1) X t(1) T 2 b ≤ max 1 - α t n t λ t(1) 1 2 , 1 - α t n t λ t(1) p 2 2 |ϕ p | . Define several events as follows: A 1 := |ϕ p | > p ln p , A 2 := λ t(1) p > √ n t + √ p + ln √ n t , A 3 := λ t(1) 1 < √ n t - √ p -ln √ n t . First, we want to show 3 i=1 A c i =⇒ A. To that end, when 3 i=1 A c i , we have ρ T 1 ρ 2 = ∥ρ 1 ∥ 2 ∥ρ 2 ∥ 2 ρ T 1 ρ 2 ∥ρ 1 ∥ 2 ∥ρ 2 ∥ 2 ≤ max{α ′ t , 1 -α ′ t } 2 • (p + 2 p ln p + 2 ln p) • √ p ln p p -2 √ p ln p (by 3 i=1 A c i ) ≤ max{α ′ t , 1 -α ′ t } 2 • 1 + 2 1 45 + 2 45 1 -2 1 45 p ln p (by Eq. ( 88)) ≤2 max{α ′ t , 1 -α ′ t } 2 • p ln p (because 1 + 2 1 45 + 2 45 1 -2 1 45 ≈ 1.91 ≤ 2). Thus, we have proven 3 i=1 A c i =⇒ A, which implies that Pr[A] ≥ Pr 3 i=1 A c i =1 -Pr 3 i=1 A i ≥1 - 3 i=1 Pr[A i ] (by the union bound) ≥1 - 6 p - √ n t -2 p 0.4 (by Eq. ( 86) and Eq. ( 87)). The result of this lemma thus follows.

N PROOF OF PROPOSITION 6

Plugging Eq. (1) into Eq. ( 7), we have γ =         X v(1) T I p -αt nt X t(1) X t(1) T w (1) -αt nt X v(1) T X t(1) ϵ t(1) + ϵ v(1) X v(2) T I p -αt nt X t(2) X t(2) T w (2) -αt nt X v(2) T X t(2) ϵ t(2) + ϵ v(2) . . . X v(m) T I p -αt nt X t(m) X t(m) T w (m) -αt nt X v(m) T X t(m) ϵ t(m) + ϵ v(m)         . By Eq. ( 11), we thus have δγ =         X v(1) T I p -αt nt X t(1) X t(1) T (w (1) -w 0 ) -αt nt X v(1) T X t(1) ϵ t(1) + ϵ v(1) X v(2) T I p -αt nt X t(2) X t(2) T (w (2) -w 0 ) -αt nt X v(2) T X t(2) ϵ t(2) + ϵ v(2) . . . X v(m) T I p -αt nt X t(m) X t(m) T (w (m) -w 0 ) -αt nt X v(m) T X t(m) ϵ t(m) + ϵ v(m)         . In Eq. ( 89), since terms ϵ t(1:m) and ϵ v(1:m) have zero mean and are independent of each other, we have E ϵ t(1:m) ,ϵ v(1:m) ∥δγ∥ 2 2 = E ϵ t(1:m) ,ϵ v(1:m) m i=1 X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) 2 2 + α t n t X v(i) T X t(i) ϵ t(i) 2 2 + ϵ v(i) 2 2 =mn v σ 2 + m i=1 X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) 2 2 + m i=1 E ϵ t(i) α t n t X v(i) T X t(i) ϵ t(i) 2 2 . ( ) Notice that E ϵ t(i) α t n t X v(i) T X t(i) ϵ t(i) 2 2 = α 2 t n 2 t E ϵ t(i) X v(i) T X t(i) ϵ t(i) T X v(i) T X t(i) ϵ t(i) = α 2 t n 2 t E ϵ t(i) Tr X v(i) T X t(i) ϵ t(i) ϵ t(i) T X t(i) T X v(i) (by trace trick Tr[W Z] = Tr[ZW ]) = α 2 t σ 2 n 2 t Tr X v(i) T X t(i) X t(i) T X v(i) (since E ϵ t(i) [ϵ t(i) ϵ t(i) T ] = σ 2 I nt ). Plugging Eq. ( 91) into Eq. ( 90), we thus have E ϵ t(1:m) ,ϵ v(1:m) ∥δγ∥ 2 2 =mn v σ 2 + m i=1 X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) 2 2 Term A + m i=1 α 2 t σ 2 n 2 t Tr X v(i) T X t(i) X t(i) T X v(i) Term B (by Eq. ( 91)). ( 92) The following two lemmas estimate Term A and Term B. Lemma 26. When n t ≥ 16, we have Pr X t(1:m) ,X v(1:m) E w (i) [Term A of Eq. (92)] ≤ mn v ν 2 • 2 ln(sn t ) • D(α t , n t , s) + α 2 t (p -1) n t • 6.25(ln(spn t )) 2 ≥ 1 - 5mn v n t , E X t(1:m) ,X v(1:m) ,w (1:m) [Term A of Eq. (92)] = ν 2 mn v (1 -α t ) 2 + α 2 t (p + 1) n t . Proof. See Appendix N.1. Lemma 27. We have Pr X v(1:n t ) ,X t(1:n t ) Term B in Eq. (92) ≤ mα 2 t σ 2 n v p n t ln p • (ln n t ) 2 ≥ 1 - 2mn v p 0.4 , E X v(1:n t ) ,X t(1:n t ) [Term B in Eq. (92)] = mα 2 t σ 2 n v p n t . Proof. See Appendix N.2. Now we are ready to prove Proposition 6. Proof of Proposition 6. By Eq. ( 92), Lemma 26, and Lemma 27, the result of Proposition 6 thus follows. Notice that the probability is estimated by the union bound. N.1 PROOF OF LEMMA 26 Proof. We have E w (i) X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) 2 2 = E X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) T X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) = E Tr X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) (w (i) -w 0 ) T I p - α t n t X t(i) X t(i) T X v(i) (by the trace trick) = Tr X v(i) T I p - α t n t X t(i) X t(i) T E (w (i) -w 0 )(w (i) -w 0 ) T I p - α t n t X t(i) X t(i) T X v(i) = Tr X v(i) T I p - α t n t X t(i) X t(i) T Λ 0 0 0 I p - α t n t X t(i) X t(i) T X v(i) (by Assumption 2). Define A (i) := √ Λ 0 0 0 I p - α t n t X t(i) X t(i) T X v(i) ∈ R p×nv . Plugging A (i) into Eq. ( 95), we thus have E w (i) X v(i) T I p - α t n t X t(i) X t(i) T (w (i) -w 0 ) 2 2 = Tr A T (i) A (i) . Here [•] j,k denotes the element at the j-th row, k-th column of the matrix, [•] l,: denotes the l-th row (vector) of the matrix, [•] :,k denotes the k-th column (vector) of the matrix. Notice that only the first s rows of A (i) is non-zero. Define Q i,j,k := X v(i) j,k 1 - α t n t X t(i) j,: 2 2 + l={1,2,••• ,p}\{j} -X v(i) l,k • α t n t ⟨X t(i) j,: , X t(i) l,: ⟩. We thus have [A (i) ] j,k =    ν (i),j I p -αt nt X t(i) X t(i) T j,: , X v(i) :,k , when j = 1, • • • , s, 0, when j = s + 1, • • • , p, = ν (i),j Q i,j,k , when j = 1, • • • , s, 0, when j = s + 1, • • • , p. Therefore, for any k ∈ {1, 2, • • • , n v }, we have [A T (i) A (i) ] k,k = [A (i) ] T :,k • [A (i) ] :,k = p j=1 [A (i) ] 2 j,k = s j=1 ν (i),j 2 Q 2 i,j,k . By Eq. ( 96) and Eq. ( 97), we thus have Term A in Eq. (92) = m i=1 nv k=1 s j=1 ν (i),j 2 Q 2 i,j,k . Part 1: calculate the expected value of Q 2 i,j,k By Assumption 1 and Lemma 13, we have E X t(t) j,: = n t , and E X t(t) j,: 4 2 = n t (n t + 2). We also have E⟨X t(i) j,: , X t(i) l,: ⟩ 2 = E nt q=1 X t(i) j,q X t(i) l,q 2 = nt q=1 E(X t(i) j,q X t(i) l,q ) 2 (by Assumption 1) = nt q=1 E[X t(i) 2 j,q ] E[X t(i) 2 l,q ] =n t . If we fix X t(i) and only consider the randomness in X v(i) , since each element of X v(i) :,k are i.i.d. standard Gaussian random variables, then we have Q i,j,k ∼ N (0, σ 2 Q i,j,k ), where σ 2 Q i,j,k = 1 -α t n t X t(i) j,: 2 2 2 + l={1,2,••• ,p}\{j} α t n t ⟨X t(i) j,: , X t(i) l,: ⟩ 2 . Thus, we have E X v(i) ,X t(i) Q 2 i,j,k = E X t(i) E X v(i) Q 2 i,j,k = E X t(i) σ 2 Q i,j,k (by Eq. ( 101)) =1 -2 α t n t E X t(i) X t(i) j,:  2 2 + α 2 t n 2 t E X t(i) By Eq. ( 101) and Lemma 14, for any X t(i) , we have Pr X v(i) Q i,j,k σ Q i,j,k ≥ 2 ln(sn t ) ≤ 1 sn t . Thus, we have Pr X v(i) ,X t(i) Q i,j,k σ Q i,j,k ≥ 2 ln(sn t ) A 3,(i,j,k) ≤ 1 sn t . We define A 1,(i,j,l) , A 2,(i,j) , and A 3,(i,j,k) as shown in Eq. ( 103), Eq. ( 104), and Eq. ( 105), respectively. Then, we define the event A as A := A c 1,(i,j,l) , A c 2,(i,j) , A c 3,(i,j,k) hold for all i ∈ {1, 2, • • • , m}, j ∈ {1, 2, • • • , s}, l ∈ {1, 2, • • • , p} \ {j}, and k ∈ {1, 2, • • • , n v } , Applying the union bound, we then have Pr[A] ≥ 1 - 2m n t - 2m n t - mn v n t ≥ 1 - 5mn v n t . When A happens, we have (by Eq. ( 106) and the definition of ν 2 in Assumption 2). The result of this lemma thus follows by combining Part 1 and Part 2. σ 2 Q i,j,k = 1 - α t n t X t(i)

N.2 PROOF OF LEMMA 27

Proof. In order to show Eq. ( 93), it suffices to show that Pr max i∈{1,2,••• ,m} Tr X v(i) T X t(i) X t(i) T X v(i) ≤ n v n t (ln n t ) 3 p ≥ 1 - 2mn v n 0.4 t , and in order to show Eq. ( 94), it suffices to show that for any i ∈ {1, 2, • • • , m}, E Tr X v(i) T X t(i) X t(i) T X v(i) = n v n t p. ( ) We first prove Eq. ( 107). To that end, we notice that X v(i) T X t(i) is a n v × n t matrix. For any i = 1, 2, • • • , m and j = 1, 2, • • • , n v , we have X v(i) T X t(i) X t(i) T X v(i) j,j = [X v(i) T X t(i) ] j,: 2 2 = nt k=1 X v(i) T X t(i) 2 j,k = nt k=1 ⟨[X v(i) ] :,j , [X t(i) ] :,k ⟩ 2 = nt k=1 p l=1 [X v(i) ] l,j [X t(i) ] l,k 2 . ( ) Thus, we have max i∈{1,2,••• ,m} Tr X v(i) T X t(i) X t(i) T X v(i) = max i∈{1,2,••• ,m} nv j=1 X v(i) T X t(i) X t(i) T X v(i) j,j = max i∈{1,2,••• ,m} nv j=1 nt k=1 p l=1 [X v(i) ] l,j [X t(i) ] l,k (by Eq. ( 109)) ≤n v n t      max i∈{1,2,••• ,m} j∈{1,2,••• ,nv} k∈{1,2,••• ,nt} p l=1 [X v(i) ] l,j [X t(i) ] l,k      2 . ( ) Notice that training input X t(i) and validation input X v(i) are independence with each other and each element follows i.i.d. standard Gaussian distribution. Therefore, by applying Lemma 11 (where c = ln n t , k = p, and q = p), for any given i, j, and k, we have v(i) ] l,j [X t(i) ] l,k > ln n t p ln p ≤ 2 exp(ln n t • 0.4 ln p) ≤ 2 n t • p 0.4 . Pr p l=1 [X The last inequality we use the fact that ln n t • 0.4 ln p ≥ ln n t + 0.4 ln p when min{n t , p} ≥ 256.foot_6 By the union bound, we thus have Pr          max i∈{1,2,••• ,m} j∈{1,2,••• ,nv} k∈{1,2,••• ,nt} p l=1 [X v(i) ] l,j [X t(i) ] l,k > ln n t p ln p          ≤ 2mn v p 0.4 . ( ) By Eq. ( 110) and Eq. ( 111), we have proven Eq. ( 107). The result Eq. ( 93) of this lemma thus follows. It remains to prove Eq. ( 108). To that end, by Eq. ( 109), we have Tr X v(i) T X t(i) X t(i) T X v(i) = nv j=1 nt k=1 p l=1 [X v(i) ] l,j [X t(i) ] l,k 2 . Thus, by Assumption 1, we have E nv j=1 nt k=1 p l=1 [X v(i) ] l,j [X t(i) ] l,k 2 = nv j=1 nt k=1 E p l=1 [X v(i) ] l,j [X t(i) ] l,k 2 = nv j=1 nt k=1 p l=1 E [X v(i) ] 2 l,j [X t(i) ] 2 l,k = nv j=1 nt k=1 p l=1 E[X v(i) ] 2 l,j E[X t(i) ] 2 l,k =n v n t p, i.e., we have proven Eq. ( 108) (and therefore Eq. ( 94) holds). The result of this lemma thus follows.

O UNDERPARAMETERIZED SITUATION

In this case, we have p ≤ mn v . The solution that minimize the meta loss is We then have Term 1 of Eq. ( 118) ∈   ν 2 1 - α t n t g 4 m i=1   nv j=1 [X v(i) ] 2 j   2 , ν 2 1 - α t n t g 4 m i=1   nv j=1 [X v(i) ] 2 j   2    (by Eq. ( 113) and Eq. ( 114)) ∈   ν 2 1 - α t n t g 4 1 m   m i=1 nv j=1 [X v(i) ] 2 j   2 , ν 2 1 - α t n t g 4 m i=1   nv j=1 [X v(i) ] 2 j   2    (by Eq. ( 119))    (by Eq. ( 115)). ∈   ν 2 1 - α t n t g 4 1 m   m i=1 nv j=1 [X v(i) ] 2 j   2 , Similarly, we have 115)). (B T B) 2 =   m i=1 1 - α t n t X t(i) T 2 2 2 nv j=1 [X v(i) ] 2 Plugging the above equations into Eq. ( 117), the result of this lemma thus follows.



< 1, then b w decreases1 The approximation considers only the dominating terms and treats logarithm terms as constants (since they change slowly). Notice that our approximation here is different from an asymptotic result, since the precision of such approximation in the finite regime can be precisely quantified, whereas an asymptotic result can only estimate the precision in the order of magnitude in the infinite regime as typically denoted by O(•) notations. We use a more specific setup (such as Gaussian features) than that inChen et al. (2022), which allows us to provide a more specific bound and verify its tightness. E.g., how the generalization performance of overfitted solutions changes with respect to the number of features. These constants are manually calibrated to fit the experimental values better. X v(i) and X t(i) are independent, so a and U t(1) are also independent. The calculation of χ 2 n t (or χ 2 p-n t ) can utilize the rotational symmetry of a (or a ′ ) is because χ 2 n t (or χ 2 p-n t ) represents the squared norm of the result that projects a ′ into a nt-dim (or (p -nt)-dim) subspace. We can utilize the rotational symmetry because ϕn t (or ϕp-n t ) can be viewed as the result of the following steps: 1) project a ′ and b ′ into a fixed subspace that is spanned by the first nt (or last p -nt) standard basis vectors in R p ; 2) calculate the inner product of these two projected vectors. Notice that ln nt ≥ ln 256 ≈ 5.5 ≥ 2 and 0.4 ln p ≥ 0.4 ln 256 ≈ 0.4 × 5.5 ≥ 2. Thus, we have (ln nt -1)(0.4 ln p -1) ≥ (2 -1) × (2 -1) = 1, which implies that ln nt • 0.4 ln p ≥ ln nt + 0.4 ln p.



where b w := b w0 + b ideal w and η := 27m 2 n 2 v min{p,nt} 0.4 . The value of b w0 and b ideal w

Figure 1: The model error w.r.t. different values of ν and σ, where m = 10, n t = 50, n v = 3, s = 5, ∥w 0 ∥ 2 2 = 100, and α t = 0.02 p . Subfigure (b) is a copy of subfigure (a) that zooms in the descent floor. Every point is the average of 100 random simulations. The markers in subfigure (b) indicate the descent floor for each curve.

Figure 2: Comparison between the experimental values and the theoretical values of the model error.

Figure3: The test error for a two-layer fully connected neural network. The x-axis denotes the neural network width (i.e., the number of neurons in the hidden layer).

eig,min 1 {p>nt} + c eig,min 1 {p≤nt} , 0} , b w := b w0 + b ideal w .

Figure4: Illustration of three types of the elements in BB T when m = 2 and n v = 3. In this case BB T is a 6 × 6 matrix (i.e., R (mnv)×(mnv) ). There are 4 (i.e., m 2 ) blocks (divided by the dashed red line) and each block is of size 3 × 3 (i.e., n v × n v ).

ŵℓ2 := arg min ŵ L meta .When B is full column-rank, we have ŵℓ2 = (B T B) -1 B T γ.

Positioning our work in Table 1 of Chen et al. (2022) (which is shown here by the part above the last row).

Table of the notations.

Term A] = ν 2 mn v (1 -α t ) 2 +Notice that we use the definition of ν and ν (i) in Assumption 2 that ν 2 = By Assumption 1 and Lemma 11 (where c = 2.5 ln(spn t ) and q = e), for any giveni ∈ {1, 2, • • • , m}, j ∈ {1, 2, • • • , s}, and l ∈ {1, 2, • • • , p} \ {j}, we must have Pr ⟨X t(i) j,: , X t(i) l,: ⟩ ≥ 2.5 ln(spn t )By Assumption 1 and Lemma 9, for any given i ∈ {1, 2, • • • , m} and j ∈ {1, 2, • • • , s}, we have -2 n t ln(sn t ), n t + 2 n t ln(sn t ) + 2 ln(sn t )

Q i,j,k • 2 ln(sn t ) (by A c 3,(i,j,k) for all i, j, k) ≤mn v ν 2 • 2 ln(sn t ) • D(α t , n t , s) +

ACKNOWLEDGEMENT

The work of P. Ju and N. Shroff has been partly supported by the NSF grants NSF AI Institute (AI-EDGE) CNS-2112471, CNS-2106933, 2007231, CNS-1955535, and CNS-1901057, and in part by Army Research Office under Grant W911NF-21-1-0244. The work of Y. Liang has been partly supported by the NSF grants NSF AI Institute (AI-EDGE) CNS-2112471 and DMS-2134145.

annex

We first prove that 3 i=1 A i =⇒ A target,1 . To that end, recall the definition of disc D(a i,i , r i (A)) and the radius of the disc r i (A) for a matrix A in Lemma 7. We now apply Lemma 7 on BB T . Because of A 2 and A 3 , for any i ∈ {1, 2, • • • , mn v }, we have

Because of

3 i=1 A i and Lemma 7, we have all eigenvalues of BB T is inSince p ≥ n t ≥ 256, we have 2(p -n t ) ln n t ≤ 2p ln p ≤ 2p ln p, and n t ln n t ≤ p ln p ≤ √ p ln p.(51)Since n t ≥ 256, by Eq. ( 60) of Lemma 20, we have ln n t ≤ 1 45 n t . Thus, we have ln n t = ln n t ln n t ≤ 1 45 n t ln n t .Recalling the values of b 1 , b 1 , b 2 , and b 3 in Lemmas 17, 18, and 19, we have(by Eq. ( 51) and max{0, 1 -and(by Eq. ( 51) and max{0, 1 -Therefore, we have proven thattarget,1 .We also have Therefore, we haveFirst, we want to showTo that end, when(by Eq. ( 75) and Eq. ( 80))=6 p ln p.Thus, we have proven thatPr {A i } (by the union bound) 76) and Eq. ( 77))The result of this lemma thus follows.M PROOF OF PROPOSITION 5wherewhereProof. See Appendix M.3.

Now we are ready to prove Proposition 5

Proof of Proposition 5. Define a few events as follows:A 2 := all type 2 elements of BB T are in [-c 2 , c 2 ] ,A 3 := all type 3 elements of BB T are in [-c 3 , c 3 ] .Because n t ≥ p ≥ 256, we haveSince p ≥ 256, by Eq. ( 60) in Lemma 20, we haveNotice that there are mn v type 1 elements, mn v (n v -1) type 2 elements, and m(m -1)n 2 v type 3 elements. By the union bound, Lemmas 17, 18, 19, and Eq. ( 81), we haveByRecalling the values of c 1 , c 2 , and c 3 in Lemmas 23, 24, 25, we have• p ln p (by Eq. ( 85), A c 1 , A c 2 , andThus, we have(by Lemma 11 and Lemma 8)

M.3 PROOF OF LEMMA 25

Proof. Define several events as follows:By Lemma 23, we haveBy Lemma 22, we haveWhen p ≥ 256, by Eq. ( 60) of Lemma 20, we haveLemma 28. Consider the case p = s = 1. If there exist g, g, h ∈ R such thatthen we must havewe then havew (g, g, (r/r) 2 , mr).Proof. Since p = s = 1, B becomes a vector and B T B is a scalar that equals toBy Eq. ( 112), we haveIn this case, B T δγ is also a scalar that equals toBy the independence and zero mean of w (i) -w 0 , ϵ t(i) , and ϵ v(i) , we have w (g, g, (r/r) 2 , mr), g = n t + 2 n t log n t + 2 log n t , g = n t -2 n t log n t , h = mn v + 2 mn v log(mn v ) + 2 log(mn v ),When p = s = 1 and α t is relatively small such that 1 -αt nt g ≥ 0, we must haveProof. Notice that X t(i) 2 2 follows χ 2 distribution with n t degrees of freedom, and m i=1 nv j=1 [X v(i) ] 2 j follows χ 2 distribution with mn v degrees of freedom. Given any fixed i ∈ {1, 2, • • • , m}, by Lemma 9 (letting x = log n t ), we haveBy Lemma 9 (letting x = log(mn v )), we haveBy the union bound, we thus have Pr X t(i) 2 2 ∈ [g, g], for all i ∈ {1, 2, • • • , m} ≤ 2m n t .Similarly, we haveThe result thus follows by Lemma 28.We interpret the meaning of Proposition 7 as follows. We first approximate each part by the highest order term. Then we have g ≈ g ≈ n t , r ≈ r ≈ n v , and h ≈ mn v ≈ mr. Thus, we have b (p=1) w (g, g, 1, h) ≈ b (p=1) w (g, g, (r/r) 2 , mr).Therefore, we can conclude that our estimation on the model error ∥ ŵℓ2 -w 0 ∥ 2 2 in this case is relatively tight. In other words, we know that (with high probability when n v and n t is relatively large, and α t is relatively small)

