WHEN DO MODELS GENERALIZE? A PERSPECTIVE FROM DATA-ALGORITHM COMPATIBILITY Anonymous authors Paper under double-blind review

Abstract

One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression (Nagarajan & Kolter, 2019) . In many scenarios, their failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. To address this issue, we propose a concept named compatibility, which quantitatively characterizes generalization in a both data-relevant and algorithm-relevant manner. By considering the entire training trajectory and focusing on early-stopping iterates, compatibility exploits the data and the algorithm information and is therefore a suitable notion for generalization of overparameterized models. We validate this by theoretically studying compatibility under the setting of solving overparameterized linear regression with gradient descent. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting. Our theoretical results demonstrate that in the sense of compatibility, generalization holds with significantly weaker restrictions on the problem instance than the previous at-convergence analysis.

1. INTRODUCTION

Although deep neural networks achieve great success in practice (Silver et al., 2017; Devlin et al., 2019; Brown et al., 2020) , their remarkable generalization ability is still among the essential mysteries in the deep learning community. One of the most intriguing features of deep neural networks is overparameterization, which confers a level of tractability to the training problem, but leaves traditional generalization theories failing to work. In generalization analysis, both the training algorithm and the data distribution play essential roles (Jiang et al., 2020) . For instance, a line of work (Zhang et al., 2021; Nagarajan & Kolter, 2019) highlights the role of the algorithm by showing that the algorithm-irrelevant uniform convergence bounds can become inconsistent in deep learning regimes. Another line of work (Bartlett et al., 2019; Tsigler & Bartlett, 2020) on benign overfitting emphasizes the role of data distribution via profound analysis of specific overparameterized models. Despite the significant role of data and algorithm in generalization analysis, existing theories usually focus on either the data factor (e.g., uniform convergence (Nagarajan & Kolter, 2019) and last iterate analysis (Bartlett et al., 2019; Tsigler & Bartlett, 2020) ) or the algorithm factor (e.g., stabilitybased bounds (Hardt et al., 2016) ). 1 Combining both data and algorithm factor into generalization analysis can help derive tighter generalization bounds and explain the generalization ability of overparameterized models observed in practice. In this sense, a natural question arises: How to incorporate both data factor and algorithm factor into generalization analysis? To gain insight into the interplay between data and algorithms, we provide motivating examples of a synthetic overparameterized linear regression task and a classification task on the corrupted MNIST dataset in figure 1 . In both scenarios, the final iterate with less algorithmic information, which may include the algorithm type (e.g., GD or SGD), hyperparameters (e.g., learning rate, number of epochs), generalizes much worse than the early stopping solutions (see the Blue Line). In the linear regression case, the generalization error of the final iterate can be more than ×100 larger than that of the early stopping solution. In the MNIST case, the final iterate on the SGD trajectory has 19.9% test error, much higher than the 2.88% test error of the best iterate on the GD trajectory. Therefore, the almost ubiquitous strategy of early stopping is a key ingredient in generalization analysis for overparameterized models, whose benefits have been demonstrated both theoretically and empirically (Yao et al., 2007; Ali et al., 2019; Li et al., 2020b; Ji et al., 2021) . By focusing on the entire optimization trajectory and performing data-dependent trajectory analysis, both data information and the dynamics of the training algorithm can be exploited to yield consistent generalization bounds. To analyze the data-dependent trajectory, we introduce a new concept named data-algorithmcompatibility, which jointly characterizes the role of the data and the algorithm in generalization analysis. Informally speaking, an algorithm is compatible with a data distribution if as the sample size goes to infinity, the minimum excess risk of the iterates on the training trajectory converges to zero. The significance of compatibility comes in three folds. Firstly, compatibility incorporates both data and algorithm factors into generalization analysis, and brings new messages into generalization in the overparameterization regime (see Definition 3.1). Secondly, compatibility serves as a minimal condition for generalization, without which one cannot expect to find a consistent solution via standard learning procedures. Consequently, compatibility holds with only mild assumptions and applies to a wide range of problem instances (see Theorem 4.1). Thirdly, compatibility captures the algorithmic significance of early stopping in generalization. By exploiting the algorithm information along the entire trajectory, we arrive at better generalization bounds than the at-convergence analysis (see Table 1 and 2 for examples). To theoretically validate compatibility, we study it under overparameterized linear regression setting. Analysis of the overparameterized linear regression is a reasonable starting point to study compatibility for more complex models like deep neural networks, since many phenomena of the high dimensional non-linear model are also observed in the linear regime (e.g., Figure 1 ). Furthermore, the recent neural tangent kernel (NTK) framework demonstrates that very wide neural networks trained using gradient descent with appropriate random initialization can be approximated by kernel regression in a reproducing kernel Hilbert space, which rigorously establishes a close relationship between overparameterized linear regression and deep neural network training (Jacot et al., 2018; Arora et al., 2019) . Specifically, we investigate solving overparameterized linear regression using gradient descent with constant step size, and prove that under some mild regularity conditions, gradient descent is compatible with overparameterized linear regression if the effective dimensions of the feature covariance matrix are asymptotically bounded by the sample size. In this setting, the assumptions needed for generalization in the sense of compatibility are significantly weaker than that in the at-convergence analysis (Bartlett et al., 2019) , which demonstrates the benefits of data-relevant and algorithmrelevant generalization analysis. We summarize our contributions as follows: • We formalize the notion of compatibility, which highlights the interaction between data and algorithm and serves as a minimal condition for generalization. • We derive a sufficient condition for compatibility in solving overparameterized linear regression with gradient descent. Our theory substantiates the meaningfulness of compatibility by showing that generalization in the sense of compatibility typically requires much weaker restrictions in the problem instance. • Technically, we derive time-variant generalization bounds for overparameterized linear regression via data-dependent trajectory analysis. Empirically, various experiment results verify the motivation of compatibility and demonstrate the benefits of early stopping.

2. RELATED WORKS

Data-Dependent Techniques mainly focus on the data distribution condition for generalization. One of the most popular bounds among data-dependent bounds is uniform convergence (Koltchinskii & Panchenko, 2000; Bartlett et al., 2017; Zhou et al., 2020; Zhang et al., 2021) . However, recent works (Nagarajan & Kolter, 2019; Negrea et al., 2020) point out that uniform convergence may not be powerful enough to explain generalization, because it may only yield inconsistent bound in even linear regression cases. Another line of works investigates benign overfitting, which mainly involves generalization at convergence (Bartlett et al., 2019; Zou et al., 2021; Tsigler & Bartlett, 2020; Li et al., 2020c; Wang & Thrampoulidis, 2021; Frei et al., 2022) . Algorithm-Dependent Techniques measure the role of the algorithmic information in generalization. A line of works derives generalization bounds via algorithm stability (Hardt et al., 2016; Feldman & Vondrák, 2018; Mou et al., 2018; Feldman & Vondrák, 2019; Bousquet et al., 2020; Li et al., 2020a; Lei & Ying, 2020; Bassily et al., 2020; Teng et al., 2021) . A parallel line of works analyzes the implicit bias of algorithmic information (Bousquet & Elisseeff, 2002; Soudry et al., 2018; Shah et al., 2020; Hu et al., 2020; Lyu & Li, 2020; Lyu et al., 2021) , which are mainly based on analyzing a specific data distribution (e.g., linear separable). Other Generalization Techniques. Besides the techniques discussed above, there are many other approaches. For example, PAC-Bayes theory performs well empirically and theoretically (Shawe-Taylor & Williamson, 1997; Seeger, 2002; McAllester, 2003; Parrado-Hernández et al., 2012; McAllester, 2013; Dziugaite & Roy, 2017; Neyshabur et al., 2018) and can even yield non-vacuous bounds in deep learning regimes (Rivasplata et al., 2020; Pérez-Ortiz et al., 2021) . Furthermore, there are other promising techniques including information theory (Russo & Zou, 2016; Xu & Raginsky, 2017; Banerjee & Montúfar, 2021) , and compression-based bounds (Arora et al., 2018) . Early Stopping has the potential to improve generalization for various machine learning problems (Raskutti et al., 2014; Vaskevicius et al., 2020; Zhang et al., 2021; Li et al., 2021; Kuzborskij & Szepesvári, 2021; Bai et al., 2021; Shen et al., 2022) . A line of works studies the rate of early stopping in linear regression and kernel regression with different algorithms, e.g., gradient descent (Yao et al., 2007) , stochastic gradient descent (Tarres & Yao, 2014; Rosasco & Villa, 2015; Dieuleveut & Bach, 2016; Lin & Rosasco, 2017; Pillaud-Vivien et al., 2018) , gradient flow (Ali et al., 2019) , conjugate gradient (Blanchard & Krämer, 2016) and spectral algorithms (Gerfo et al., 2008; Lin & Cevher, 2018) . Of the most relevance here is Yao et al. (2007) , which proves an optimal excess bound for a certain class of kernel regression problems solved using early stopping gradient descent, Beyond linear models, early-stopping is also effective for training deep neural networks (Li et al., 2020b; Ji et al., 2021) . Another line of research focuses on the signal for early stopping (Prechelt, 2012; Forouzesh & Thiran, 2021) .

3. COMPATIBILITY

In this section, we formally define compatibility between the data distribution and the training algorithm, starting from the basic notations.

3.1. NOTATIONS

Data Distribution. Let D denote the population distribution and z ∼ D denote a data point sampled from distribution D. Usually, z contains a feature and its corresponding response. Besides, we denote the dataset with n samples as Z ≜ {z i } i∈ [n] , where z i ∼ D are i.i.d. sampled from distribution D. Loss and Excess Risk. Let ℓ(θ; z) denote the loss on sample z with parameter θ ∈ R p . The corresponding population loss is defined as L(θ; D) ≜ E z∼D ℓ(θ; z). When the context is clear, we omit the dependency on D and denote the population loss by L(θ). Our goal is to find the optimal parameter θ * which minimizes the population loss, i.e., L(θ * ) = min θ L(θ). Measuring how a parameter θ approaches θ * relies on a term excess risk R(θ), defined as R(θ) ≜ L(θ) -L(θ * ). Algorithm. Let A(•) denote a iterative algorithm that takes training data Z as input and outputs a sequence of parameters {θ (t) n } t≥0 , where t is the iteration number. The algorithm can be either deterministic or stochastic, e.g., variants of (S)GD.

3.2. DEFINITIONS OF COMPATIBILITY

Based on the above notations, we introduce the notion of compatibility between data distribution and algorithm in Definition 3.1. Informally, compatibility measures whether a consistent excess risk can be reached along the training trajectory. Note that we omit the role of the loss function in the definition, although the algorithm depends on the loss function.  sup t∈Tn R(θ (t) n ) P → 0 as n → ∞. We call {T n } n>0 the compatibility region of (D, A). The distribution D is allowed to change with n. In this case, D should be understood as a sequence of distributions {D n } n≥1 . We also allow the dimension of model parameter θ to be infinity or to grow with n. We omit this dependency on n when the context is clear. Compatibility serves as a minimal condition for generalization, since if a data distribution is incompatible with the algorithm, one cannot expect to reach a small excess risk even if we allow for arbitrary early stopping. However, we remark that considering only the minimal excess risk is insufficient for a practical purpose, as one cannot exactly find the t that minimizes R(θ (t) n ) due to the noise in the validation set. Therefore, it is meaningful to consider a region of time t on which the excess risk is consistent as in Definition 3.1. The larger the compatibility region is, the more robust the algorithm will be to the noise in its execution. Comparisons with Other Notions. Compared to classic definitions of learnability, e.g., PAC learning, the definition of compatibility is data-specific and algorithm-specific, and is thus a more finegrained notion. Compared to the concept of benign proposed in the recent paper (Bartlett et al., 2019) , which studies whether the excess risk at t = ∞ converges to zero in probability as the sample size goes to infinity, compatibility only requires that there exists a time to derive a consistent excess risk. We will show later in Section 4.2 that in the overpamameterized linear regression setting, there exist cases such that the problem instance is compatible but not benign.

4. COMPATIBILITY ANALYSIS OF OVERPARAMETERIZED LINEAR REGRESSION WITH GRADIENT DESCENT

This paper mainly analyzes compatibility in the overparameterized linear regression regime. We first introduce the data distribution, loss, and training algorithm, and then present the main theorem, which provides a sufficient condition for compatibility in this setting. 4  = V ΛV ⊤ = i>0 λ i v i v ⊤ i with decreasing eigenvalues λ 1 ≥ λ 2 ≥ • • • . We make the following assumptions on the distribution of the feature vector. Assumption 1 (Assumptions on feature distribution). We assume that 1. E[x] = 0. 2. λ 1 > 0, i>0 λ i < C for some absolute constant C. 3. Let x = Λ -1 2 V ⊤ x. The random vector x has independent σ x -subgaussian entries.foot_1  Loss and Excess Risk. We choose square loss as the loss function ℓ, i.e. ℓ(θ, (x, y)) = 1/2(yx ⊤ θ) 2 . The corresponding population loss is denoted by L(θ) = Eℓ(θ, (x, y)) and the optimal parameter is denoted by θ * ≜ argmin θ∈R p L(θ). We assume that ∥θ * ∥ < C for some absolute constant C. If there are multiple such minimizers, we choose an arbitrary one and fix it thereafter. We focus on the excess risk of parameter θ, defined as R(θ) = L(θ) -L(θ * ) = 1 2 E(y -x ⊤ θ) 2 - 1 2 E(y -x ⊤ θ * ) 2 = 1 2 (θ -θ * ) ⊤ Σ(θ -θ * ). Let ε = y -x ⊤ θ * denote the noise in data point (x, y). The following assumptions involve the conditional distribution of the noise. Assumption 2 (Assumptions on noise distribution). We assume that 1. The conditional noise ε|x has zero mean. 2. The conditional noise ε|x is σ y -subgaussian. Note that both Assumption 1 and Assumption 2 are commonly considered in the related literatures (Bartlett et al., 2019; Tsigler & Bartlett, 2020; Zou et al., 2021) . Training Set. Given a training set {(x i , y i )} 1≤i≤n with n pairs independently sampled from the population distribution D, we define X ≜ (x 1 , • • • , x n ) ⊤ ∈ R n×p as the feature matrix, Y ≜ (y 1 , • • • , y n ) ⊤ ∈ R n as the corresponding noise vector, and ε ≜ Y -Xθ * as the residual vector. Let the singular value decomposition (SVD) of X be X = U Λ 1 2 W ⊤ , with Λ = diag{µ 1 • • • , µ n } ∈ R n×n , µ 1 ≥ • • • ≥ µ n . We consider the overparameterized regime where the feature dimension is larger than the sample size, namely, p > n. In this regime, we assume that rank(X) = n almost surely as in Bartlett et al. (2019) . This assumption is equivalent to the invertibility of XX ⊤ . Assumption 3 (Linear independent training set). For any n < p, we assume that the features in the training set {x 1 , x 2 , • • • , x n } is linearly independent almost surely. Algorithm. Given the dataset (X, Y ), define the empirical loss function as L(θ) ≜ 1 2n ∥Xθ-Y ∥ 2 . We choose full-batch gradient descent on the empirical risk with a constant learning rate λ as the algorithm A in the previous template. In this case, the update rule for the optimization trajectory {θ t } t≥0 is formulated as θ t+1 = θ t - λ n X ⊤ (Xθ t -Y ). Without loss of generality, we consider zero initialization θ 0 = 0 in this paper. In this case, for a sufficiently small learning rate λ, θ t converges to the min-norm interpolator θ = X ⊤ (XX ⊤ ) -1 Y as t goes to infinity, which was well studied previously (Bartlett et al., 2019) . This paper takes one step further and discuss the excess risk along the entire training trajectory {R(θ t )} t≥0 . Effective Rank and Effective Dimensions. We define the effective rank of the feature matrix Σ as r(Σ) ≜ i>0 λi λ1 . Our results on compatibility depend on two notions of effective dimension of the feature covariance Σ, defined as k 0 ≜ min l ≥ 0 : λ l+1 ≤ c 0 i>l λ i n , k 1 ≜ min l ≥ 0 : λ l+1 ≤ c 1 i>0 λ i n , where c 0 , c 1 are constants independent of the dimension p, sample size n, and time tfoot_2 . We omit the dependency of k 0 , k 1 on c 0 , c 1 , n, Σ when the context is clear.

4.2. COMPATIBILITY FOR OVERPARAMETERIZED LINEAR REGRESSION WITH GRADIENT DESCENT

Next, we present the main result of this section, which provides a clean condition for compatibility between gradient descent and overparameterized linear regression.  k 0 = O(n), k 1 = o(n), r(Σ) = o(n), gradient descent is compatible with overparameterized linear regression in the region T n = ω 1 λ , o n λ , namely, sup t∈Tn R(θ t ) P → 0 as n → ∞. Furthermore, if the feature dimension p = ∞ and the data distribution does not change with n, then the condition k 0 = O(n) alone suffices for compatibility. The proof of Theorem 4.1 is given in Appendix A and sketched in Section 5. The theorem shows that gradient descent is compatible with overparameterized linear regression under some mild regularity conditions on the learning rate, effective rank and effective dimensions. The condition on the learning rate is natural for optimizing a smooth objective. We conjecture that the condition k 0 = O(n) can not be removed in general cases, since the effective dimension k 0 characterizes the concentration of the singular values of the data matrix X and plays a crucial role in the excess risk of the gradient descent dynamics. We discuss extensions to the kernel regression setting in Appendix B.6. Comparison with Benign Overfitting. The recent paper Bartlett et al. (2019) studies overparameterized linear regression and gives the condition for min-norm interpolator to generalize. They prove that the feature covariance Σ is benign if and only if k 0 = o(n), R k0 (Σ) ≜ ( i>k0 λ i ) 2 i>k0 λ 2 i = ω(n), r(Σ) = o(n) (7) As discussed in Section 3.2, benign problem instance also satisfies compatibility, since benign overfitting requires a stronger condition on k 0 and an additional assumption on R k0 (Σ). The following example shows that this inclusion relationship is strict. Example 4.1. Under the same assumption as in Theorem 4.1, if the spectrum of Σ satisfies λ k = 1 k α , for some α > 1, we derive that k 0 = Θ(n). Therefore, this problem instance satisfies compatibility, but does not satisfy benign overfitting. Example 4.1 shows the existence of a case where the early stopping solution can generalize but interpolating solution fails to generalize. Therefore, compatibility can characterize generalization for a much wider range of problem instances.foot_3  Comparisons with Existing Methods. Theorem 4.1 cannot be directly implied by off-the-shelf stability-based generalization bounds (Hardt et al., 2016; Feldman & Vondrák, 2019) or uniform convergence bounds (Koltchinskii & Panchenko, 2000; Bartlett et al., 2017) . The main reason is that both methods rely on a high probability bound of the parameter norm, which requires a nontrivial data-dependent and algorithm-dependent analysis. Even if this can be done, both methods will provably give looser bounds with smaller compatibility regions than that in Theorem 4.1. See Appendix B.3 for a detailed comparison with stability-based bounds and Appendix B.4 for discussions on uniform convergence bounds. We also provide a comparison with previous analysis of early stopping (Yao et al., 2007; Lin & Rosasco, 2017; Pillaud-Vivien et al., 2018) in Appendix B.5.

5. PROOF SKETCH AND TECHNIQUES 5.1 A TIME VARIANT BOUND

We further introduce an additional type of effective dimension besides k 0 , k 1 , which is time variant and is utilized to track the optimization dynamics. Definition 5.1 (Effective Dimensions). Given a feature covariance matrix Σ, define the effective dimension k 2 as k 2 ≜ min l ≥ 0 : i>l λ i + nλ l+1 ≤ c 2 c(t, n) i>0 λ i , where c 2 is a constant independent of the dimension p, sample size n, and time t. The term c(t, n) is a function to be discussed later. When the context is clear, we omit its dependencies on c 2 , c(t, n), n, Σ and only denote it by k 2 . Based on the effective rank and effective dimensions defined above, we provide a time-variant bound in Theorem 5.1 for overparameterized linear regression, which further leads to the compatibility argument in Theorem 4.1. Compared to the existing bound (Bartlett et al., 2019), Theorem 5.1 focuses on investigating the role of training epoch t in the excess risk, and is of independent interest. Theorem 5.1 (Time Variant Bound). Suppose Assumption 1, 2 and 3 hold. Fix a function c(t, n). Given δ ∈ (0, 1), assume that k 0 ≤ n c , log 1 δ ≤ n c , 0 < λ ≤ 1 c i>0 λi for a large enough constant c. Then with probability at least 1 -δ, we have for any t ∈ N, R(θ t ) ≲ B(θ t ) + V (θ t ), where We provide four types of feature covariance with eigenvalues λ k , including Inverse Polynomial B(θ t ) = ∥θ * ∥ 2   1 λt + ∥Σ∥ max    r(Σ) n , r(Σ) n , log( 1 δ ) n      , V (θ t ) = σ 2 y log 1 δ   k 1 n + k 2 c(t, n)n + c(t, n) λt n i>0 λ i 2   . (12) (λ k = 1 k α , α > 1), Inverse Log Polynomial (λ k = 1 k log β (k+1) , β > 1), Constant (λ k = 1 n 1+ε , 1 ≤ k ≤ n 1+ε , ε > 0), and Piecewise Constant (λ k = 1 s if 1 ≤ k ≤ s and λ k = 1 d-s if s + 1 ≤ k ≤ d, where s = n r , d = n q , 0 < r ≤ 1, q > 1) . In light of these bounds, ours outperforms Bartlett et al. (2019) in all the cases, and outperforms Zou et al. (2021) in Constant / Piecewise Constant cases if ε < 1 2 and q < min{2 -r, 3 2 }. We refer to Appendix B for more details. DISTRIBUTIONS OURS BARTLETT ET AL. ( 2019) ZOU ET AL. ( 2021) INVERSE POLYNOMIAL O n -min{ α-1 α , 1 2 } O(1) O n -α-1 α INVERSE LOG POLYNOMIAL O 1 log β n o(1) O 1 log β n CONSTANT O n -1 2 O n -min{ε, 1 2 } O n -min{ε,1} PIECEWISE CONSTANT O n -min{1-r, 1 2 } O n -min{1-r,q-1, 1 2 } O n -min{1-r,q-1} We provide a high-level intuition behind Theorem 5.1. We decompose R(θ t ) into the bias term and the variance term. The variance term is then split into the leading part and tailing part based on the sprctrum of the feature covariance Σ. The eigenvalues in the tailing part will cause the variance term in the excess risk of the min-norm interpolating solution to be Ω(1) for fast decaying spectrum, as is the case in (Bartlett et al., 2019) . However, since the convergence in the tailing eigenspace is slower compared with the leading eigenspace, a proper early stopping strategy will prevent the overfitting in the tailing eigenspace and meanwhile avoid underfitting in the leading eigenspace. The c(t, n) Principle. It is worth emphasizing that our bound holds for arbitrary positive function c(t, n). Therefore, one can fine-tune the generalization bound by choosing a proper c(t, n). In the subsequent sections, we show how to derive consistent risk bounds for different time t, based on different choices of c(t, n). We present the case of choosing a constant c(t, n) in the next section. We leave the case of choosing a varying c(t, n) to Appendix B.7.

5.2. VARYING t, CONSTANT c(t, n)

Theorem 5.1 provides an excess risk upper bound uniformly for t ∈ N. However, it is still non-trivial to derive Theorem 4.1, where the remaining question is to decide the term c(t, n). The following corollary shows the generalization bound when setting c(t, n) to a constant. Corollary 5.1 (Constant c(t, n)). Let Assumption 1, 2 and 3 hold. Fix a constant c(t, n). Suppose k 0 = O(n), k 1 = o(n), r(Σ) = o(n), λ = O 1 i>0 λi . Then there exists a sequence of positive constants {δ n } n≥0 which converge to 0, such that with probability at least 1 -δ n , the excess risk is consistent for t ∈ ω 1 λ , o n λ , i.e. R(θ t ) = o(1). (13) Furthermore, for any positive constant δ, with probability at least 1 -δ, the minimal excess risk on the training trajectory can be bounded as min t R(θ t ) ≲ max{ r(Σ), 1} √ n + max{k 1 , 1} n . ( ) Lemma 5.1 below shows that k 1 = o(n) always holds for fixed distribution. Therefore, combining Corollary 5.1 and Lemma 5.1 completes the proof of Theorem 4.1. Lemma 5.1. For any fixed (i.e. independent of sample size n) feature covariance Σ satisfying assumption 1, we have k 1 (n) = o(n). Example Distributions We apply the bound in Corollary 5.1 to several data distributions. These distributions are widely discussed in (Bartlett et al., 2019; Zou et al., 2021) . We also derive the 

6. EXPERIMENTS

In this section, we provide numerical studies of overparameterized linear regression problems. We consider overparameterized linear regression instances with input dimension p = 1000, sample size n = 100. The features are sampled from Gaussian distribution with different covariances. The empirical results (a.) demonstrate the benefits of trajectory analysis underlying the definition of compatibility, since the optimal excess risk along the algorithm trajectory is significantly lower than that of the min-norm interpolator (b.) validate the statements in Corollary 5.1, since the optimal excess risk is lower when the eigenvalues of feature covariance decay faster. We refer to Appendix C for detailed setups, additional results and discussions. Observation One: Early stopping solution along the training trajectory generalizes significantly better than the min-norm interpolator. We calculate the excess risk of optimal early stopping solutions and min-norm interpolators from 1000 independent trials and list the results in Table 2 . The results illustrate that the early stopping solution on the algorithm trajectory enjoys much better generalization properties. This observation corroborates the meaningfulness of compatibility and the importance of data-dependent training trajectory in generalization analysis. Observation Two: The faster covariance spectrum decays, the lower optimal excess risk is. Table 2 also illustrates a positive correlation between the decaying rate of λ i and the generalization performance of the early stopping solution. This accords with Theorem 5.1, showing that the excess risk is better for a smaller effective dimension k 1 , where small k 1 indicates a faster-decaying eigenvalue λ i . We additionally note that such a phenomenon also illustrates the difference between min-norm and early stopping solutions in linear regression, since Bartlett et al. ( 2019) demonstrate that the min-norm solution is not consistent when the eigenvalues decay too fast. By comparison, early stopping solutions do not suffer from this restriction.

7. CONCLUSION

In this paper, we propose the concept of data-algorithm compatibility, and study compatibility for overparameterized linear regression with gradient descent. Our theoretical and empirical results demonstrate that compatibility eases the assumptions and broadens the scope of generalization by fully exploiting the data information and the algorithm information. Despite linear cases in this paper, compatibility can be a much more general concept. Therefore, we believe this paper will motivate more work on data-dependent trajectory analysis. Lemma A.1. (Lemma 10 in Bartlett et al. ( 2019)) For any σ x , there exists a constant c, such that for any 0 ≤ k < n, with probability at least 1 -e -n c , µ k+1 ≤ c i>k λ i + λ k+1 n . This implies that as long as the step size λ is small than a threshold independent of sample size n, gradient descent is stable. Corollary A.1. There exists a constant c, such that with probability at least 1 -e -n c , for any 0 ≤ λ ≤ 1 c i>0 λi we have O ⪯ I - λ n X ⊤ X ⪯ I. ( ) Proof. The right hand side of the inequality is obvious since λ > 0. For the left hand side, we have to show that the eigenvalues of Iλ n X ⊤ X is non-negative. since X ⊤ X and XX ⊤ have the same non-zero eigenvalues, we know that with probability at least 1 -e -n c , the smallest eigenvalue of Iλ n X ⊤ X can be lower bounded by 1 - λ n µ 1 ≥ 1 -cλ i>0 λ i n + λ k+1 ≥ 1 -2cλ i>0 λ i ≥ 0. ( ) where the second inequality uses lemma A.1, and the last inequality holds if λ ≤ 1 2c i>0 λi .

A.3 PROOF FOR THE BIAS-VARIANCE DECOMPOSITION

Let X † denote the Moore-Penrose pseudoinverse of matrix X. The following lemma gives a closed form expression for θ t . Lemma A.2. The dynamics of {θ t } t≥0 satisfies θ t = I - λ n X ⊤ X t (θ 0 -X † Y ) + X † Y . ( ) Proof. We prove the lemma using induction. The equality holds at t = 0 as both sides are θ 0 . Recall that θ t is updated as θ t+1 = θ t + λ n X ⊤ (Y -Xθ t ). ( ) Suppose that the dynamic holds up to the t-th step. Plug the expression for θ t into the above recursion and note that X ⊤ XX † = X ⊤ , we get θ t+1 = I - λ n X ⊤ X θ t + λ n X ⊤ Y = I - λ n X ⊤ X t+1 (θ 0 -X † Y ) + I - λ n X ⊤ X X † Y + λ n X ⊤ Y = I - λ n X ⊤ X t+1 (θ 0 -X † Y ) + X † Y . ( ) which finishes the proof. Next we prove two identities which will be used in further proof. Lemma A.3. The following two identities hold for any matrix X and non-negative integer t: I -X † X + I - λ n X ⊤ X t X † X = I - λ n X ⊤ X t , I -I - λ n X ⊤ X t X † XX ⊤ = X ⊤ I -I - λ n XX ⊤ t . Proof. Note that X ⊤ XX † = X ⊤ , we can expand the left hand side of the first identity above using binomial theorem and eliminate the pseudo-inverse X † : I -X † X + I - λ n X ⊤ X t X † X = I -X † X + t k=0 t k - λ n X ⊤ X k X † X = I -X † X + X † X + t k=1 t k - λ n k (X ⊤ X) k-1 X ⊤ XX † X = I + t k=1 t k - λ n k (X ⊤ X) k = I - λ n X ⊤ X t . Under review as a conference paper at ICLR 2023 The second identity can be proved in a similar way: I -I - λ n X ⊤ X t X † XX ⊤ = - t k=1 t k - λ n X ⊤ X k X † XX ⊤ = - t k=1 t k - λ n k (X ⊤ X) k-1 X ⊤ XX † XX ⊤ = - t k=1 t k - λ n k (X ⊤ X) k-1 X ⊤ XX ⊤ = - t k=1 t k - λ n k X ⊤ (XX ⊤ ) k = X ⊤ I -I - λ n XX ⊤ t . We are now ready to prove the main result of this section. Lemma A.4. The excess risk at the t-th epoch can be upper bounded as R(θ t ) ≤ θ * ⊤ Bθ * + ε ⊤ Cε, where B = I - λ n X ⊤ X t Σ I - λ n X ⊤ X t , C = XX ⊤ -1 I -I - λ n XX ⊤ t XΣX ⊤ I -I - λ n XX ⊤ t XX ⊤ -1 , ( ) which characterizes bias term and variance term in the excess risk. Furthermore, there exists constant c such that with probability at least 1 -δ over the randomness of ε, we have ε ⊤ Cε ≤ cσ 2 y log 1 δ Tr[C]. Proof. First note that XX ⊤ is invertible by Assumption 3. Express the excess risk as follows R(θ t ) = 1 2 E[(y -x ⊤ θ t ) 2 -(y -x ⊤ θ * ) 2 ] = 1 2 E[(y -x ⊤ θ * + x ⊤ θ * -x ⊤ θ t ) 2 -(y -x ⊤ θ * ) 2 ] = 1 2 E[(x ⊤ (θ t -θ * )) 2 + 2(y -x ⊤ θ * )(x ⊤ θ * -x ⊤ θ t )] = 1 2 E[x ⊤ (θ t -θ * )] 2 . ( ) Recall that θ 0 = 0 and Y = Xθ * + ε and we can further simplify the formula for θ t in lemma A.2: θ t = I - λ n X ⊤ X t (θ 0 -X † Y ) + X † Y = I -I - λ n X ⊤ X t X † (Xθ * + ε). Plug it into the above expression for R(θ t ), we have R(θ t ) = 1 2 E x ⊤ I -I - λ n X ⊤ X t X † (Xθ * + ε) -x ⊤ θ * 2 = 1 2 E x ⊤ X † X -I - λ n X ⊤ X t X † X -I θ * +x ⊤ I -I - λ n X ⊤ X t X † ε 2 . ( ) Applying lemma A.3, we obtain R(θ t ) = 1 2 E -x ⊤ I - λ n X ⊤ X t θ * + x ⊤ X ⊤ I -I - λ n XX ⊤ t (XX ⊤ ) -1 ε 2 ≤ E x ⊤ I - λ n X ⊤ X t θ * 2 + E x ⊤ X ⊤ I -I - λ n XX ⊤ t (XX ⊤ ) -1 ε 2 := θ * ⊤ Bθ * + ε ⊤ Cε. ( ) which proves the first claim in the lemma. The second part of the theorem directly follows from lemma 18 in Bartlett et al. (2019) .

A.4 PROOF FOR THE BIAS UPPER BOUND

The next lemma guarantees that the sample covariace matrix 1 n X ⊤ X concentrates well around Σ. Lemma A.5. (Lemma 35 in Bartlett et al. ( 2019)) There exists constant c such that for any 0 < δ < 1 with probability as least 1 -δ, Σ - 1 n X ⊤ X ≤ c∥Σ∥ max    r(Σ) n , r(Σ) n , log( 1 δ ) n , log( 1 δ ) n    . ( ) The following inequality will be useful in our proof to characterize the decaying rate of the bias term with t. Lemma A.6. For any positive semidefinite matrix P which satisfies ∥P ∥ ≤ 1, we have ∥P (1 -P ) t ∥ ≤ 1 t . Proof. Assume without loss of generality that P is diagonal. Then it suffices to consider seperately each eigenvalue σ of P , and show that σ(1 -σ) t ≤ 1 t . In fact, by AM-GM inequality we have σ(1 -σ) t ≤ 1 t tσ + (1 -σ)t t + 1 t+1 ≤ 1 t , which completes the proof. Next we prove the main result of this section. Lemma A.7. There exists constant c such that if 0 ≤ λ ≤ 1 c i>0 λi , then for any 0 < δ < 1, with probability at least 1 -δ the following bound on the bias term holds for any t θ * ⊤ Bθ * ≤ c ∥θ * ∥ 2   1 λt + ∥Σ∥ max    r(Σ) n , r(Σ) n , log( 1 δ ) n , log( 1 δ ) n      . ( ) Proof. The bias can be decomposed into the following two terms θ * ⊤ Bθ * = θ * ⊤ I - λ n X ⊤ X t Σ - 1 n X ⊤ X I - λ n X ⊤ X t θ * + θ * ⊤ 1 n X ⊤ X I - λ n X ⊤ X 2t θ * . For sufficiently small learning rate λ as given by corollary A.1, we know that with high probability I - λ n X ⊤ X ≤ 1, which together with lemma A.5 gives a high probability bound on the first term: θ * ⊤ I - λ n X ⊤ X t Σ - 1 n X ⊤ X I - λ n X ⊤ X t θ * ≤ c∥Σ∥ ∥θ * ∥ 2 max    r(Σ) n , r(Σ) n , log( 1 δ ) n , log( 1 δ ) n    . ( ) For the second term, invoke lemma A.6 with P = λ n X ⊤ X and we get θ * ⊤ 1 n X ⊤ X I - λ n X ⊤ X 2t θ * ≤ 1 λ ∥θ * ∥ 2 λ n X ⊤ X I λ n X ⊤ X 2t ≤ 1 2λt ∥θ * ∥ 2 . Putting these two bounds together gives the proof for the main theorem.

A.5 PROOF FOR THE VARIANCE UPPER BOUND

Recall that X = U Λ 1 2 W ⊤ is the singular value decomposition of data matrix X, where U = (u 1 , • • • , u n ), W = (w 1 , • • • , w n ), Λ = diag{µ 1 , • • • , µ n } with µ 1 ≥ µ 2 ≥ • • • µ n . Recall that k 0 = min{l ≥ 0 : λ l+1 ≤ c 0 i>l λ i n }, k 1 = min{l ≥ 0 : λ l+1 ≤ c 1 i>0 λ i n }, k 2 = min{l ≥ 0 : i>l λ i + nλ l+1 ≤ c 2 c(t, n) i>0 λ i }}, for some constant c 0 , c 1 , c 2 and function c(t, n). We further define k 3 = min{l ≥ 0 : µ l+1 ≤ c 3 c(t, n) i>0 λ i }, for some constant c 3 . The next lemma shows that we can appropriately choose constants to ensure that k 3 ≤ k 2 holds with high probability, and in some specific cases we have k 2 ≤ k 1 . Lemma A.8. For any function c(t, n) and constant c 2 , there exists constants c, c 3 , such that k 3 ≤ k 2 with probability at least 1 -e -n c . Furthermore, if c(t, n) is a positive constant function, for any c 1 , there exists c 2 such that k 2 ≤ k 1 . Proof. According to lemma A.1, there exists a constant c, with probability at least 1 -e -n c we have µ k2+1 ≤ c( i>k2 λ i + nλ k2+1 ) ≤ cc 2 c(t, n) i>0 λ i . Therefore, we know that k 3 ≤ k 2 for c 3 = cc 2 . By the definition of k 1 , we have i>k1 λ i + nλ k1+1 ≤ (c 1 + 1) i>0 λ i , which implies that k 2 ≤ k 1 for c 2 = c1+1 c(t,n) , if c(t, n) is a positive constant. Next we bound Tr[C], which implies an upper bound on the variance term. Theorem A.1. There exist constants c, c 0 , c 1 , c 2 such that if k 0 ≤ n c , then with probability at least 1 -e -n c , the trace of the variance matrix C has the following upper bound for any t: Tr[C] ≤ c   k 1 n + k 2 c(t, n)n + c(t, n) λt n i>0 λ i 2   . ( ) Proof. We divide the eigenvalues of XX ⊤ into two groups based on whether they are greater than c 3 c(t, n) i>0 λ i . The first group consists of µ 1 • • • µ k3 , and the second group consists of µ k3+1 • • • µ n . For 1 ≤ j ≤ k 3 , we have 1 -1 - λ n µ j t ≤ 1. Therefore we have the following upper bound on I -Iλ n XX ⊤ t 2 : I -I - λ n XX ⊤ t 2 = U diag    1 -1 - λ n µ 1 t 2 • • • 1 -1 - λ n µ n t 2    U ⊤ ⪯ U diag            k3 times 1, • • • 1, n-k3 times 1 -1 - λ n µ k3+1 t 2 , • • • 1 -1 - λ n µ n t 2            U ⊤ = U diag    k3 times 1, • • • 1, n-k3 times 0, • • • 0    U ⊤ + U diag            k3 times 0, • • • 0, n-k3 times 1 -1 - λ n µ k3+1 t 2 , • • • 1 -1 - λ n µ n t 2            U ⊤ . ( ) For positive semidefinite matrices P , Q, R which satisfies Q ⪯ R, it holds that Tr[P Q] ≤ Tr[P R] . It implies the following upperbound of Tr[C]: Tr[C] = Tr   I -I - λ n XX ⊤ t 2 XX ⊤ -2 XΣX ⊤   ≤ Tr   U diag    k3 times 1, • • • 1, n-k3 times 0, • • • 0    U ⊤ XX ⊤ -2 XΣX ⊤   1 ⃝ + Tr       U diag            k3 times 0, • • • 0, n-k3 times 1 -1 - λ n µ k3+1 t 2 , • • • 1 -1 - λ n µ n t 2            U ⊤ XX ⊤ -2 XΣX ⊤       . 2 ⃝ (48) Bounding 1 ⃝ Noticing X = U Λ 1 2 W ⊤ and Σ = i≥1 λ i v i v ⊤ i , we express the first term as sums of eigenvector products, 1 ⃝ = Tr   U diag    k3 times 1, • • • 1, n-k3 times 0, • • • 0    U ⊤ XX ⊤ -2 XΣX ⊤   = Tr   U diag    k3 times 1, • • • 1, n-k3 times 0, • • • 0    U ⊤ U Λ-2 U ⊤ U Λ 1 2 W ⊤ ΣW Λ 1 2 U ⊤   = Tr   diag    k3 times 1, • • • 1, n-k3 times 0, • • • 0    Λ-1 W ⊤ ΣW   = i≥1 λ i Tr   diag    k3 times 1, • • • 1, n-k3 times 0, • • • 0    Λ-1 W ⊤ v i v ⊤ i W   = i≥1 1≤j≤k3 λ i µ j v ⊤ i w j 2 . ( ) Next we divide the above summation into 1 ≤ i ≤ k 1 and i > k 1 . For the first part, notice that 1≤j≤k3 λ i µ j v ⊤ i w j 2 ≤ 1≤j≤n λ i µ j v ⊤ i w j 2 = λ i v ⊤ i   1≤j≤n 1 µ j w j w ⊤ j   v i = λ i v ⊤ i W Λ-1 W ⊤ v i = λ i v ⊤ i W Λ 1 2 U ⊤ U Λ-2 U ⊤ U Λ 1 2 W ⊤ v i = λ 2 i x⊤ i (XX ⊤ ) -2 xi , ( ) where xi is defined as xi = Xvi √ λi = U Λ 1 2 W ⊤ vi √ λi . From the proof of lemma 11 in Bartlett et al. (2019) , we know that for any σ x , there exists a constant c 0 and c such that if k 0 ≤ n c , with probability at least 1 -e -n c the first part can be bounded as Under review as a conference paper at ICLR 2023 1≤i≤k1 1≤j≤k3 λ i µ j v ⊤ i w j 2 ≤ 1≤i≤k1 λ 2 i xi (XX ⊤ ) -2 xi ≤ c k 1 n , which gives a bound for the first part. For the second part we interchange the order of summation and have i≥k1 1≤j≤k3 λ i µ j v ⊤ i w j 2 = 1≤j≤k3 i≥k1 λ i µ j v ⊤ i w j 2 ≤ 1 c 3 c(t, n) i>0 λ i 1≤j≤k3 i≥k1 λ i v ⊤ i w j 2 = λ k1+1 c 3 c(t, n) i>0 λ i 1≤j≤k3 i≥k1 v ⊤ i w j 2 ≤ λ k1+1 c 3 c(t, n) i>0 λ i 1≤j≤k3 1 = λ k1+1 k 3 c 3 c(t, n) i>0 λ i ≤ c k 3 c(t, n)n . ( ) for c large enough. Putting 51 and 52 together, and noting that k 3 ≤ k 2 with high probability as given in lemma A.8, we know there exists a constant c that with probability at least 1 -e -n c , 1 ⃝ ≤ c k 1 n + c k 2 c(t, n)n . ( ) Bounding 2 ⃝ Similar to the first step in bounding 1 ⃝, we note that 2 ⃝ = Tr       U diag            k3 times 0, • • • 0, n-k3 times 1 -1 - λ n µ k3+1 t 2 , • • • , 1 -1 - λ n µ n t 2            U Λ-2 U ⊤ U Λ 1 2 W ⊤ ΣW Λ 1 2 U ⊤ = Tr       diag            k3 times 0, • • • 0, n-k3 times 1 µ k3+1 1 -1 - λ n µ k3+1 t 2 , • • • , 1 µ n 1 -1 - λ n µ n t 2            W ⊤ ΣW . ( ) From Bernoulli's inequality and the definition of k 3 , for any k 3 + 1 ≤ j ≤ n, we have 1 µ k 1 -1 - λ n µ k t 2 ≤ 1 µ k λ n µ k t 2 = λt n 2 µ k ≤ c 3 λt n 2 c(t, n) i>0 λ i , Hence, ⃝ ≤ c 3 c(t, n) λt n 2 i>0 λ i Tr[W ⊤ ΣW ] = c 3 c(t, n) λt n i>0 λ i 2 . ( ) Putting things together From the bounds for 1 ⃝ and 2 ⃝ given above, we know that there exists a constant c such that with probability at least 1 -e -n c , the trace of the variance matrix C has the following upper bound Tr[C] ≤ c   k 1 n + k 2 c(t, n)n + c(t, n) λt n i>0 λ i 2   . ( ) Proof of theorem 5.1. Lemma A.4, A.7 and Theorem A.1 gives the complete proof. Note that the high probability events in the proof are independent of the epoch number t, and this implies that the theorem holds uniformly for all t ∈ N.

A.6 PROOF OF COMPATIBILITY RESULTS

Corollary A.2 (Corollary 5.1 restated). Let Assumption 1, 2 and 3 hold. Fix a constant c(t, n). Suppose k 0 = O(n), k 1 = o(n), r(Σ) = o(n), λ = O 1 i>0 λi . Then there exists a sequence of positive constants {δ n } n≥0 which converge to 0, such that with probability at least 1 -δ n , the excess risk is consistent for t ∈ ω 1 λ , o n λ , i.e. R(θ t ) = o(1). Furthermore, for any positive constant δ, with probability at least 1 -δ, the minimal excess risk on the training trajectory can be bounded as min t R(θ t ) ≲ max{ r(Σ), 1} √ n + max{k 1 , 1} n . Proof. According to Lemma A.7, with probability at least 1 -δn 2 , the following inequality holds for all t: B(θ t ) ≲   1 λt + max    r(Σ) n , r(Σ) n , log( 1 δn ) n , log( 1 δn ) n      . ( ) If δ n is chosen such that log 1 δn = o(n), we have that with probability at least 1 -δn 2 , we have for all t = ω 1 λ : 1), (59) in the sample size n. B(θ t ) = o( When c(t, n) is a constant, we have k 2 ≤ k 1 with high probability as given in lemma A.8. Therefore, according to Lemma A.4 and Theorem A.1, we know that if log 1 δn = O(n), with probability at least 1 -δn 2 , the following bound holds for all t: V (θ t ) ≲ log 1 δ n k 1 n + λ 2 t 2 n 2 . ( ) Since k 1 = o(n), t = o n λ , we have k1 n + λ 2 t 2 n 2 = o(1) . Therefore, there exists a mildly decaying sequence of δ n with log 1 δn k1 n + λ 2 t 2 n 2 = o(1), i.e., V (θ t ) = o(1). ( ) To conclude, δ n can be chosen such that log 1 δ n = ω(1), log 1 δ n = O(n), log 1 δ n = O 1 k1 n + λ 2 t 2 n 2 , ( ) and then with probability at least 1 -δ n , the excess risk is consistent for all t ∈ ω 1 λ , o n λ : R(θ t ) = B(θ t ) + V (θ t ) = o(1). ( ) This completes the proof for the first claim. The second claim follows from Equation 58and 60 by setting t = Θ √ n λ . Lemma A.9 (Lemma 5.1 restated). For any fixed (i.e. independent of sample size n) feature covariance Σ satisfying assumption 1, we have k 1 (n) = o(n). Proof. Suppose there exists constant c, such that k 1 (n) ≥ cn. By definition of k 1 , we know that λ l ≥ c1 i>0 λi n holds for 1 ≤ l ≤ k 1 (n). Hence we have ⌊cn2 i+1 ⌋ l=⌊cn2 i ⌋+1 λ l ≳ c 1 i>0 λ i n2 i+1 cn2 i ≳ i>0 λ i . ( ) summing up all l leads to a contradiction since i>0 λ i < ∞, which finishes the proof. Theorem A.2 (Theorem 4.1 restated). Consider the overparameterized linear regression setting defined in section 4.1. Let Assumption 1,2 and 3 hold. Assume the learning rate satisfies λ = O( 1 Tr(Σ) ). Then under the condition that k 0 = O(n), k 1 = o(n), r(Σ) = o(n), gradient descent is compatible with overparameterized linear regression in the region T n = ω 1 λ , o n λ , namely, sup t∈Tn R(θ t ) P → 0 as n → ∞. Furthermore, if the feature dimension p = ∞ and the data distribution does not change with n, then the condition k 0 = O(n) alone suffices for compatibility. Proof. According to Corollary 5.1 and Lemma 5.1, for any ε > 0, there exists {δ n } n>0 and N such that for any sample size n > N , we have Pr sup t∈Tn R(θ t ) > ε ≤ δ n . Let n → ∞ shows that sup t∈Tn R(θ t ) converges to 0 in probability, which completes the proof.

B EXAMPLES AND DISCUSSIONS B.1 EXAMPLE DISTRIBUTIONS

We apply our bound in Corollary 5.1 to several examples. In each example, we show the data distribution, the time interval, and the corresponding generalization bound. These distributions are widely discussed in (Bartlett et al., 2019; Zou et al., 2021) . Example B.1. Under the same conditions as Theorem 5.1, let Σ denote the feature covariance matrix. We show the following examples: 1. (Inverse Polynominal). If the spectrum of Σ satisfies λ k = 1 k α , for some α > 1, we derive that k 0 = Θ(n), k 1 = Θ n 1 α . Therefore, min t V (θ t ) = O n 1-α α and min t R(θ t ) = O n -min{ α-1 α , 1 2 } . 2. (Inverse Log-Polynominal). If the spectrum of Σ satisfies λ k = 1 k log β (k + 1) , for some β > 1, we derive that k 0 = Θ n log n , k 1 = Θ n log β n . Therefore, min t V (θ t ) = O 1 log β n and min t R(θ t ) = O 1 log β n .

3.. (Constant).

If the spectrum of Σ satisfies λ k = 1 n 1+ε , 1 ≤ k ≤ n 1+ε , for some ε > 0, we derive that k 0 = 0, k 1 = 0. Therefore, min t V (θ t ) = O 1 n and min t R(θ t ) = O 1 √ n .

4.. (Piecewise Constant).

If the spectrum of Σ satisfies λ k = 1 s 1 ≤ k ≤ s, 1 d-s s + 1 ≤ k ≤ d, where s = n r , d = n q , 0 < r ≤ 1, q ≥ 1. We derive that k 0 = n r , k 1 = n r . Therefore, min t V (θ t ) = O(n r-1 ) and min t R(θ t ) = O n -min{1-r, 1 2 } .

B.2 COMPARISONS WITH BENIGN OVERFITTING RESULTS

We summarize the results in Bartlett et al. (2019) ; Zou et al. (2021) and our results in Table 1 , and provide a detailed comparison with them below. Comparison to Bartlett et al. (2019) . In this seminal work, the authors study the excess risk of the min-norm interpolator. As discussed before, the min-norm interpolator is the convergence point of gradient descent in the overparameterized linear regression setting. One of the main results in Bartlett et al. ( 2019) is providing a tight bound for the variance partfoot_4 in excess risk as V ( θ) = O k 0 n + n R k0 (Σ) , where θ = X ⊤ (XX ⊤ ) -1 Y denotes the min-norm interpolator, and R k (Σ) = ( i>k λ i ) 2 / ( i>k λ 2 i ) denote another type of effective rank. By introducing the time factor, Theorem 5.1 improves over Equation (66) in at least two aspects. Firstly, Theorem 5.1 guarantees the consistency of the gradient descent dynamics for a broad range of step number t, while Bartlett et al. (2019) only study the limiting behavior of the dynamics of t → ∞. Secondly, Theorem 5.1 implies that the excess risk of early stopping gradient descent solution can be much better than the min-norm interpolator. Compared to the bound in Equation ( 66), the bound in Corollary 5.1 (a.) replaces k 0 with a much smaller quantity k 1 ; and (b.) drops the second term involving R k0 (Σ). Therefore, we can derive a consistent bound for overparameterized linear regression for an early stopping solution, even though the excess risk of limiting point (min-norm interpolator) can be Ω(1) (See the first example in B.1). We can further derive a data-dependent time interval in which the bound is consistent, which cannot be directly obtained in Bartlett et al. (2019) . Comparison to Zou et al. (2021) . Zou et al. ( 2021) study a different setting, which focuses on the one-pass stochastic gradient descent solution of linear regression. The authors prove a bound for the excess risk as R( θt ) = O k 1 n + n i>k1 λ 2 i ( i>0 λ i ) 2 , where θt denotes the parameter obtained using stochastic gradient descent (SGD) with constant step size at epoch t. Similar to our bound, Equation 67 also uses the effective dimension k 1 to characterize the variance term. However, we emphasize that Zou et al. (2021) derive the bound in a pretty different scenario from ours, which is one-pass SGD scenario. During the one-pass SGD training, we use a fresh data point to perform stochastic gradient descent in each epoch, and therefore they set t = Θ(n) by default. As a comparison, we apply the standard full-batch gradient descent, and thus the time can be more flexible. Besides, our results in Corollary 5.1 improve the bound in Equation ( 67) by dropping the second term. We refer to the third and fourth example in Example B.1 for more details, wherefoot_5 our bound outperforms Zou et al. (2021) when ε < 1/2 or q < min{2 -r, 3/2}.

B.3 COMPARISONS WITH STABILITY-BASED BOUNDS

In this section, we show that Theorem 5.1 gives provably better upper bounds than the stabilitybased method. We cite a result from Teng et al. (2021) , which uses stability arguments to tackle overparameteried linear regression under similar assumptions. Theorem B.1 (modified from Theorem 1 in Teng et al. (2021) ). Under the overparameterized linear regression settings, assume that ∥x∥ ≤ 1, |ε| ≤ V , w = θ * ,⊤ x √ θ * ,⊤ Σθ * is σ 2 w -subgaussian. Let B t = sup τ ∈[t] ∥θ t ∥. the following inequality holds with probability at least 1 -δ: R(θ t ) = Õ max{1, θ * ,⊤ Σθ * σ 2 w , (V + B t ) 2 } log(4/δ) n + ∥θ * ∥ 2 λt + λt(V + B t ) 2 n . Theorem B.1 applies the general stability-based results (Hardt et al., 2016; Feldman & Vondrák, 2019) in the overparameterized linear regression setting, by replacing the bounded Lipschitz condition with the bounded domain condition. A fine-grained analysis (Lei & Ying, 2020 ) may remove the bounded Lipschitz condition, but it additionally requires zero noise or decaying learning rate, which is different from our setting. We omit the excess risk decomposition technique adopted in Teng et al. (2021) for presentation clarity. Theorem B.1 can not directly yield the stability argument in Theorem 4.1, since obtaining a high probability bound of B t requires a delicate trajectory analysis and is a non-trivial task. Therefore, data-irrelevant methods such as stability-based bounds can not be directly applied to our setting. Even if one can replace B t in Equation 68 with its expectation that is easier to handle (this modification will require adding concentration-related terms, and make the bound in Equation 68 looser), we can still demonstrate that Theorem 5.1 is tighter than the corresponding stability-based analysis by providing a lower bound on E[B 2 t ], which will imply a lower bound on the righthand side of Equation 68. Theorem B.2. Let Assumption 1, 2, 3 holds. Suppose λ = O 1 i>0 λi . Suppose the conditional variance of the noise ε|x is lower bounded by σ 2 ε . There exists constant c, such that with probability at least 1 -ne -n c , we have for t = o(n), E∥θ t ∥ 2 = Ω λ 2 t 2 n i>k0 λ i First we prove the following lemma, bounding the number of large µ i . Example B.4. (Inverse Log-Polynominal). If the spectrum of Σ satisfies λ k = 1 k log β (k + 1) , for some β > 1. For this distribution, we have ζ = 1 2 , γ = 1, and the compatibility region is (0, n 3 4 ), which is smaller than the compatibility region (0, n) given Corollary 5.1.

B.6 DISCUSSIONS ON EXTENSIONS TO KERNEL REGRESSION

In this section, we discuss extending the analysis in overparameterized linear regression to kernel regression setting. Let H denote a infinite dimensional Hilbert space equipped with inner product ⟨•, •⟩ H , and ϕ : R d → H denote a feature map. Consider the following class of functions: F = {f : R d → R|f (x) = ⟨θ, ϕ(x)⟩ H }. Let (x, y) ∈ R d × R denote the data vector and the response, following a joint distribution D. The goal of kernel regression is to find a function f parameterized by θ, that minimizes the following population loss L(θ) = 1 2 E (y -f (x)) 2 = 1 2 E (y -⟨θ, ϕ(x)⟩ ) 2 (78) Therefore, kernel regression is equivalent to solving a linear regression problem on transformed data (ϕ(x), y). By replacing x with ϕ(x), the notations and results in Section 4 can be naturally extended to the kernel regression case, which is detailed as follows. Given a dataset {(x i , y i )} 1≤i≤n sampled i.i.d from D, consider the following dynamics of gradient descent analogous to Equation (3), θ t+1 = θ t - λ n ϕ(X) ⊤ (ϕ(X)θ t -Y ), where ϕ(X) = (ϕ(x 1 ), • • • , ϕ(x n )) ⊤ and Y = (y 1 , • • • , y n ) ⊤ . Note that this dynamic takes place in the feature space H. The following corollary characterizes the compatibility between kernel regression and gradient descent. Corollary B.1. Assume the distribution of (ϕ(x), y) satisfies Assumptions 1, 2, 3, and does not change with sample size n. Let Σ ϕ = E ϕ(x)ϕ(x) ⊤ denote the feature covariance. Then under the condition that the effective dimension k 0 (Σ ϕ ) = O(n) and the learning rate λ = O( 1Tr(Σ ϕ ) ), gradient descent using 79 is compatible with kernel regression. Proof. The corollary follows from Theorem 4.1, by noting that its proof holds for feature vectors in an infinite dimensional Hilbert space. We give several remarks regarding the difference between the generalization analysis of kernel regression and overparameterized linear regression. Firstly, Corollary B.1 assumes ϕ(x) has subgaussian i.i.d entries after normalization (see Assumption 1.3), which can be hard to satisfy due to the non-linearity of feature mapping ϕ(•). Secondly, Corollary B.1 focuses on gradient descent in the feature space H. Another perspective is to apply the Representer theorem (Mohri et al., 2012) to express the weight θ into a linear combination of transformed inputs as θ = 1≤j≤n α j ϕ(x j ), and analyze the gradient descent dynamics of the n-dimensional vector α = (α 1 , • • • , α n ). Although they both converge to the min-norm solution, these two gradient descent dynamics are different, and Theorem 4.1 is not directly applicable to gradient descent on α. Thirdly, previous works (El Karoui, 2010; Liang & Rakhlin, 2018; Mei & Montanari, 2022) prove that the spectrum of feature covariance Σ ϕ has an approximately linear relationship with the spectrum of data covariance Σ = E xx ⊤ when feature mapping ϕ(•) corresponds to an inner product type kernel. This can be used to derive a more intrinsic compatibility condition using only data distribution rather than feature distribution. We leave a more refined analysis for data-algorithm compatibility in the kernel regression regime for future works.

B.7 VARYING t, VARYING c(t, n)

Although setting c(t, n) to a constant as in Corollary 5.1 suffices to prove Theorem 4.1, in this section we show that the choice of c(t, n) can be much more flexible. Specifically, we provide a concrete example and demonstrate that by setting c(t, n) to a non-constant, Theorem 5.1 can indeed produce larger compatibility regions. = ω(n). In this example, Theorem 5.1 outperforms all O t n -type bounds, which become vacuous when t = ω(n).

B.8 CALCULATIONS IN SECTION B.1

We calculate the quantities r(Σ), k 0 , k 1 , k 2 for the example distributions in B.1. The results validate that k 1 is typically a much smaller quantity than k 0 , and k 1 serves as a proxy for k 2 in the constant c(t, n) setting.

1.. Calculations for λ

k = 1 k α , α > 1. Define r k (Σ) = i>k λi λ k+1 as in Bartlett et al. (2019) . Since i>k 1 i α = Θ( 1 k α-1 ), we have r k (Σ) = Θ 1 k α-1 1 k α = Θ(k). Hence, k 0 = Θ(n) 7 , and the conditions of theorem 5.1 is satisfied. As i>0 λ i < ∞, By its definition we know that k 1 is the smallest l such that λ l+1 = O( 1 n ). Therefore, k 1 = Θ(n 1 α ). Similarly, k 2 = Θ(n 1 α ). 2. Calculations for λ k = 1 k log β (k+1) , β > 1. i>k 1 i log β (i+1) = Θ( ∞ k 1 x log β x dx) = Θ( 1 log β-1 k ), which implies r k (Σ) = k log k. Solving k 0 log k 0 ≥ Θ(n), we have k 0 = Θ( n log n ). By the definition of k 1 , we know that k 1 is the smallest l such that l log β (l + 1) ≥ Θ(n). Therefore, k 1 = Θ( n log β n ). k 2 = Θ( n log β n ) by similar calculations. 3. Calculations for λ i = 1 n 1+ε , 1 ≤ i ≤ n 1+ε , ε > 0. Since r 0 (Σ) = n 1+ε , we have k 0 = 0. By the definition of k 1 and k 2 , we also have k 1 = k 2 = 0. 4. Calculations for λ = 1 s 1 ≤ k ≤ s 1 d-s s + 1 ≤ k ≤ d , s = n r , d = n q , 0 < r ≤ 1, q ≥ 1. For 0 ≤ k < s, r k (Σ) = Θ( 1 1 s ) = Θ(n r ) = o(n), while r s (Σ) = 1 1 d-s = Θ(n q ) = ω(n). Therefore, k 0 = s = n r . Similarly, noting that λ k = 1 n r = ω(n) for 0 ≤ k < s and λ s = Θ( 1 n d ) = o(n), we know that k 1 = s = n r . Similarly, k 2 = s = n r . B.9 CALCULATIONS FOR λ k = 1 k α , α > 1 IN SECTION B.7 Set c(t, n) = 1 n β , where β > 0 will be chosen later. First we calculate k 2 under this choice of c(t, n). Note that i>k 1 i α = Θ 1 k α-1 . Therefore, k 2 is the smallest k such that 1 k α-1 + n k α = O( 1 n β ). For the bound on V (θ t ) to be consistent, we need k 2 = o(n). Hence, 1 k α-1 = O( n k α ), which implies k 2 = n β+1 α . 7 The calculations for k0, k1 and k2 in this section only apply when n is sufficiently large. Plugging the value of c(t, n) and k 2 into our bound, we have V (θ t ) = O n ( 1 α -1)+( 1 α +1)β + n 2τ -β-2 which attains its minimum Θ(n 2ατ -3α+2τ -1 2α+1 ) at β = Θ 2ατ -α-1 2α+1 . For V (θ) = O(1), we need τ ≤ 3α+1 2α+2 . For β ≥ 0, we need τ ≥ α+1 2α . Putting them together gives the range of t in which the above calculation applies.

B.10 DISCUSSION ON D n

In this paper, the distribution D is regarded as a sequence of distribution {D n } which may dependent on sample size n. The phenomenon comes from overparameterization and asymptotic requirements. In the definition of compatibility, we require n → ∞. In this case, overparameterization requires that the dimension p (if finite) cannot be independent of n since n → ∞ would break overparameterization. Therefore, the covariance Σ also has n-dependency, since Σ is closely related to p. Several points are worth mentioning: (1) Similar arguments generally appear in related works. For example, Bartlett et al. (2019) use similar arguments when discussing the definition of benign covariance (Page 7 in the arxiv version). ( 2) One can avoid such issues by letting p → ∞. This is why we discuss the special case p = ∞ in Theorem 4.1. (3) If p is a fixed finite constant that does not alter with n, the problem becomes underparameterized and thus trivial to get a consistent generalization bound.

B.11 COMPARISONS WITH PAC-BAYES BOUNDS

As one of the most exciting techniques in generalization analysis, PAC-Bayes theory works both theoretically and empirically. Usually, the form of PAC-Bayes bounds relies on the distance between the prior distribution which is unrelated to the training procedure, and the posterior distribution which is the distribution after training (e.g., isotropic Gaussian distribution centered at a trained parameter). In this sense, PAC-Bayes considers different regimes from this paper, since the returned classifier in this paper is not a distribution. Moreover, PAC-Bayes usually do not explicitly focus on earlystopping iterates, while the notion of compatibility heavily relies on the early-stopping arguments. By considering early-stopping that encodes algorithm information, we can derive tighter generalization bound relying on weaker assumptions. More interestingly, PAC-Bayes framework is not mutually exclusive with the trajectory analysis. One can indeed introduce trajectory analysis into PAC-Bayes techniques and derive a compatibility region based on trajectory-based PAC-Bayes theory. By doing so, one can explicitly incorporate more algorithm information into PAC-Bayes framework. We leave more detailed discussions for future work. We next show the differences in more detail: 1. Different settings. In PAC-Bayes analysis, the returned classifier is a distribution instead of a fixed parameter. Forcing PAC-Bayes analysis in the fixed parameter regime would cost much because it is hard to define distribution distance (e.g., KL divergence) on two single-point distributions. 2. Different classifiers. Although both methods consider "a bag of" classifiers, they are fundamentally different. In PAC-Bayes framework, the trained random classifiers are regarded to be independently drawn from the posterior distribution. However, in trajectory analysis, all classifiers are dependent during the training process. Therefore, trajectory analysis is more challenging in this sense. 3. Different characterizations. PAC-Bayes framework characterizes the expectation of generalization loss, over the randomness of the posterior distribution over parameters, which do not explicitly sketch the time factor. In comparison, the trajectory analysis in this paper focus on providing a compatibility region, where the generalization error is uniformly consistent and considers the time factor.  a) λi = 1 i (b) λi = 1 i 2 (c) λi = 1 i 3 (d) λi = 1 i log(i+1) (e) λi = 1 i log 2 (i+1) (f) λi = 1 i log 3 (i+1)

C.1 DETAILS AND DISCUSSIONS FOR LINEAR REGRESSION EXPERIMENTS

In this section, we provide the experiment details for linear regression experiments and present additional empirical results. In Section 6, We consider six overparameterized linear regression instances with input dimension p = 1000 and sample size n = 100. The feature vectors are independently sampled from zero-mean Gaussian distributions, whose covariances are diagonal with entries λ i = 1 i , 1 i 2 , 1 i 3 , 1 i log(i+1) , 1 i log 2 (i+1) , 1 i log 3 (i+1) , respectively. For each feature x, we construct the response as y = x ⊤ θ * + ε, where θ * is sampled from a p-dimensional standard Gaussian distribution for each instance, and ε is sampled from a standard Gaussian distribution. We conduct gradient descent with learning rate λ = 0.001 on the above instances. We also compute min-norm excess risk via its closed form, with 1 × 10 -4 weight decay on parameter θ to avoid numerical instability. Note that the regularization will only reduce the final excess and will not jeopardize our final conclusion's correctness. The linear regression experiment in Figure 1 follows the setting described in section 6. Although the final iterate does not interpolate the training data, the results suffice to demonstrate the gap between the early-stopping and final-iterate excess risk. The training plot for different covariances are given in Figure 2 . We provide the experiment results of sample sizes n = 50, n = 200 and feature dimensions p = 500, p = 2000 analogous to those in Section 6. The settings are the same as described in the main text, except for the sample size. The optimal excess risk and min-norm excess risk are provided in Table 3 , 4, 6 and 5. The tables indicate that the three observations given in section 6 hold for different sample size n.

C.2 DETAILS FOR MNIST EXPERIMENTS

In this section, we provide the experiment details and additional results in MNIST dataset. The MNIST experiment details are described below. We create a noisy version of MNIST with label noise rate 20%, i.e. randomly perturbing the label with probability 20% for each training data, Table 3 : The effective dimension k 1 , the optimal early stopping excess risk and min-norm excess risk for different feature distributions, with sample size n = 50, p = 1000. We calculate the 95% confidence interval for the excess risk. to simulate the label noise which is common in real datasets, e.g ImageNet (Stock & Cissé, 2018; Shankar et al., 2020; Yun et al., 2021) . We do not inject noise into the test data.

DISTRIBUTIONS

We choose a standard four layer convolutional neural network as the classifier. We use a vanilla SGD optimizer without momentum or weight decay. The initial learning rate is set to 0.5. Learning rate is decayed by 0.98 every epoch. Each model is trained for 300 epochs. The training batch size is set to 1024, and the test batch size is set to 1000. We choose the standard cross entropy loss as the loss function. We provide the plot for different levels of label noise in Figure 3 . We present the corresponding test error of the best early stopping iterate and the final iterate in Table 7 . Since the theoretical part of this paper focuses on GD, we also provide a corresponding plot of GD training in Figure 4 for completeness. Table 5 : The effective dimension k 1 , the optimal early stopping excess risk and min-norm excess risk for different feature distributions, with sample size n = 100 , p = 500. We calculate the 95% confidence interval for the excess risk. 



The taxonomy of data-dependent techniques does not mean that they totally ignore all the information of algorithm, but mean that it loses some important algorithm information which makes it vacuous in generalization analysis. Similar arguments hold for algorithm-dependent techniques. A random variable X is σ-subgaussian if E[e λX ] ≤ e λ 2 σ 2 2for any λ. Constants may depend on σx, and we omit this dependency thereafter for clarity. It is worth mentioning thatBartlett et al. (2019) and this paper study the generalization behavior of different models. The comparison here and in the following sections aims to validate the meaningfulness of compatibility analysis, instead of beating their results. We do not compare the bias component in the excess risk bound here since our bound for the bias component followsBartlett et al. (2019). Due to the bias term in Theorem 5.1, the overall excess risk bound cannot surpass the order O(1/ √ n), which leads to the cases thatZou et al. (2021) outperforms our bound. However, we note that such differences come from the intrinsic property of GD and SGD, which may be unable to avoid in the GD regimes.



Figure 1: (a) The training plot for linear regression with spectrum λ i = 1/i 2 using GD. Note that the axes are in the log scale. (b) The training plot of CNN on corrupted MNIST with 20% label noise using SGD. Both models successfully learn the useful features in the initial phase of training, but it takes a long time for them to fit the noise in the dataset. The observations demonstrate the power of data-dependent trajectory analysis, since the early stopping solution on the trajectory generalizes well but the final iterate fails to generalize. See Appendix C for details.

Definition 3.1 (Compatibility). Given a loss function ℓ(•) with corresponding excess risk R(•), a data distribution D is compatible with an algorithm A if there exists nonempty subsets T n of N, such that sup t∈Tn R(θ (t) n ) converges to zero in probability as sample size n tends to infinity, where {θ (t) n } t≥0 denotes the output of algorithm A, and the randomness comes from the sampling of training data Z from distribution D and the execution of algorithm A. That is to say, (D, A) is compatible if there exists nonempty sets T n , such that

Theorem 4.1 (Compatibility for Overparameterized Linear Regression with Gradient Descent). Consider the overparameterized linear regression setting defined in section 4.1. Let Assumption 1,2 and 3 hold. Assume the learning rate satisfies λ = O 1 Tr(Σ) . Then under the condition that

Example B.5. Under the same conditions as Theorem 5.1, let Σ denote the feature covariance matrix. If the spectrum of Σ satisfies λ k = 1 k α for some α > 1, we set c(t, n) = Θ n α+1-2ατ 2α+1 for a given α+1 2α ≤ τ ≤ 3α+1 2α+2 . Then for t = Θ(n τ ), we derive that V (θ t ) = O n 2ατ -3α+2τ -1 2α+1 . Example B.5 shows that by choosing c(t, n) as a non-constant, we exploit the full power of Theorem 5.1, and extend the compatibility region to t = o n 3α+1 2α+2



Figure 2: The training plot for overparameterized linear regression with different covariances using GD.

Figure 3: The training plot for corrupted MNIST with different levels of label noise using SGD.Figure (c) is copied from Figure 1.

Figure 3: The training plot for corrupted MNIST with different levels of label noise using SGD.Figure (c) is copied from Figure 1.

Figure 4: The training plot for corrupted MNIST with 20% label noise using GD.

.1 PRELIMINARIES FOR OVERPARAMETERIZED LINEAR REGRESSION Notations. Let O, o, Ω, ω denote asymptotic notations, with their usual meaning. For example, the argument a n = O(b n ) means that there exists a large enough constant C, such that a n ≤ Cb n . We use ≲ with the same meaning as the asymptotic notation O. Besides, let ∥x∥ denote the ℓ 2 norm for vector x, and ∥A∥ denote the operator norm for matrix A. We allow the vector to belong to a countably infinite-dimensional Hilbert space H, and with a slight abuse of notation, we use R ∞ interchangeably with H. In this case, x ⊤ z denotes inner product and xz ⊤ denotes tensor product for x, z ∈ H. Data Distribution. Let (x, y) ∈ R p × R denote the feature vector and the response, following a joint distribution D. Let Σ ≜ E[xx ⊤ ] denote the feature covariance matrix, whose eigenvalue decomposition is Σ

Comparisons of excess risk bound withBartlett et al. (2019) andZou et al. (2021).

The effective dimension k 1 , the optimal early stopping excess risk, and the min-norm excess risk for different feature distributions, with sample size n = 100 , p = 1000. The table shows that early stopping solutions generalize significantly better than min-norm interpolators, and reveals a positive correlation between the effective dimension k 1 and excess risk of early stopping solution. We calculate the 95% confidence interval for each excess risk.

The effective dimension k 1 , the optimal early stopping excess risk and min-norm excess risk for different feature distributions, with sample size n = 200 , p = 1000. We calculate the 95% confidence interval for the excess risk.

The effective dimension k 1 , the optimal early stopping excess risk and min-norm excess risk for different feature distributions, with sample size n = 100 , p = 2000. We calculate the 95% confidence interval for the excess risk.

The test error of optimal stopping iterate and final iterate on MNIST dataset with different levels of label noise. The results demonstrate that stopping iterate can have significantly better generalization performance than interpolating solutions for real datasets.

A PROOFS FOR THE MAIN RESULTS

We first sketch the proof in section A.1 and give some preliminary lemmas A.2. The following sections A.3, A.4 and A.5 are devoted to the proof of Theorem 5.1. The proof of Theorem 4.1 is given in A.6.

A.1 PROOF SKETCH

We start with a standard bias-variance decomposition following Bartlett et al. (2019) , which derives that the time-variant excess risk R(θ t ) can be bounded by a bias term and a variance term. We refer to Appendix A.3 for more details.For the bias part, we first decompose it into an optimization error and an approximation error. For the optimization error, we use the spectrum analysis to bound it with O (1/t) where t denotes the time. For the approximation error, we bound it with O (1/ √ n)) where n denotes the sample size, inspired by Bartlett et al. (2019) . We refer to Appendix A.4 for more details.For the variance part, a key step is to bound the term (I -λ n XX ⊤ ) t , where X is the feature matrix. The difficulty arises from the different scales of the eigenvalues of XX ⊤ , where the largest eigenvalue has order Θ(n) while the smallest eigenvalue has order O(1), according to Lemma 10 in Bartlett et al. (2019) . To overcome this issue, we divide the matrix XX ⊤ based on whether its eigenvalues is larger than c(t, n), which is a flexible term dependent on time t and sample size n. Therefore, we split the variance term based on eigenvalues of covariance matrix Σ (leading to the k 1 -related term) and based on the eigenvalues of XX ⊤ (leading to the k 2 -related term). We refer to Appendix A.5 for more details.

A.2 PRELIMINARIES

The following result comes from Bartlett et al. (2019) , which bounds the eigenvalues of XX ⊤ . Lemma B.1. Suppose t = o(n). Let l denote the number of µ i , such that µ i = Ω n t . Then with probability at least 1 -ne -n c , we have l = O(t).Proof. According to Lemma A.1, we know that with probability at least 1 -ne -n c , Equation 15 holds for all 0 ≤ k ≤ n -1. Conditioned on this, we haveSince t = o(n), we have l = O(t) as claimed.We also need the result from Bartlett et al. (2019) , which gives a lowerbound of µ n . Lemma B.2. (Lemma 10 in Bartlett et al. ( 2019)) For any σ x , there exists a constant c, such that with probability at least 1 -e -n c we have,We are now ready to prove Theorem B.2.Proof. We begin with the calculation of ∥θ t ∥ 2 . By Lemma A.2, the conditional unbiasedness of noise in Assumption 2 and the noise variance lower bound, we haveWhenPlugging it into Equation 72 and then use Lemma B.1, B.2, we know that under the high probability event in Lemma B.1 and B.2,Therefore, the stability-based bound, i.e., the right hand side of Equation 68, can be lower bounded in expectation as Ω λ 3 t 3 n 2 i>k0 λ i . This implies that the stability-based bound is vacuous when. Thus, stability-based methods will provably yield smaller compatibility region than ω 1 λ , o n λ in Theorem 4.1 when i>k0 λ i is not very small, as demonstrated in the examples below. Example B.2. Let Assumption 1, 2, 3 holds. Assume without loss of generality that λ = Θ(1). We have the following examples:1. (Inverse Polynominal). If the spectrum of Σ satisfiesfor some α > 1, we derive that). Therefore, the stability bound in Theorem B.1 is vacuous when, which is outperformed by the compatibility region in Theorem 5.1 when α < 2.

2.. (Inverse Log-Polynominal).

If the spectrum of Σ satisfies,for some β > 1 , we derive that k 0 = Θ n log n , i>k0 λ i = Θ(1). Therefore, the stability bound in Theorem B.1 is vacuous when, which is outperformed by the compatibility region in Theorem 5.1.

3.. (Constant).

If the spectrum of Σ satisfiesfor some ε > 0, we derive that k 0 = 0, i>k0 λ i = 1. Therefore, the stability bound in Theorem B.1 is vacuous when, which is outperformed by the compatibility region in Theorem 5.1.

4.. (Piecewise Constant).

If the spectrum of Σ satisfieswhere s = n r , d = n q , 0 < r ≤ 1, q ≥ 1, we derive that k 0 = n r , i>k0 λ i = 1. Therefore, the stability bound in Theorem B.1 is vacuous when, which is outperformed by the compatibility region in Theorem 5.1.

B.4 COMPARISONS WITH UNIFORM CONVERGENCE BOUNDS

We give a standard bound on the Rademacher complexity of linear models. Mohri et al. (2012) ). Let S ⊆ {x : ∥x∥ 2 ≤ r} be a sample of size n and let H = {x → ⟨w, x⟩ : ∥w∥ 2 ≤ Λ}. Then, the empirical Rademacher complexity of H can be bounded as follows:

Theorem B.3 (Theorem in

Furthermore, Talagrand's Lemma (See Lemma 5.7 in Mohri et al. (2012) ) indicates thatwhere L = Θ(Λ) is the Lipschitz coefficient of the square loss function l in our setting. Therefore, the Rademacher generalization bound is vacuous when Λ = Ω(n. A similar comparison as in Example B.2 can demonstrate that uniform stability arguments will provably yield smaller compatibility region than that in Theorem 5.1 for example distributions.

B.5 COMPARISON WITH PREVIOUS WORKS ON EARLY STOPPING

A line of works focuses on deriving the excess risk guarantee of linear regression or kernel regression with early stopping (stochastic) gradient descent. We refer to Section 2 for a detailed discussion.Here we compare our results with some most relevant works, including (Yao et al., 2007; Lin & Rosasco, 2017; Pillaud-Vivien et al., 2018) .Comparison with Yao et al. (2007) . Yao et al. (2007) study kernel regression with early stopping gradient descent and share some similarities with our paper. However, their approaches cannot cover ours, and are fundamentally different from ours in the following aspects.Firstly, the assumptions used in the two approaches are different, due to different goals and techniques. Yao et al. (2007) assume that the input feature and data noise have bounded norm (see Section 2.1 Definitions and Notations in Yao et al. (2007) ), while we require that the input feature is subgaussian with independent entries. The assumption used in our paper is widely-used in benign overfitting analysis, following Bartlett et al. (2019) .Furthermore, although Yao et al. (2007) obtain a minimax bound in terms of the convergence rate, it is suboptimal in terms of compatibility region. Specifically, The results in our paper show a compatibility region like (0, n) while the techniques Yao et al. (2007) can only lead to a compatibility region like (0, √ n). See Proof of the Main Theorem in section 2 in Yao et al. (2007) for details. Such differences come from different goals of the two approaches, where Yao et al. (2007) focus on providing the optimal early-stopping time while we focus on providing a larger compatibility region. Pillaud-Vivien et al. (2018) . Pillaud-Vivien et al. ( 2018) study kernel regression with multi-pass stochastic gradient descent and derive optimal excess risk guarantee. Different from our approaches with full-batch gradient descent, they study averaged stochastic gradient descent with batchsize equal to 1.

Comparison with

Comparison with Lin & Rosasco (2017) . Lin & Rosasco (2017) study stochastic gradient descent with arbitrary batchsize, which is reduced to full batch gradient descent when setting the batchsize to sample size n. However, their results are still fundamentally different from ours, since they require the boundness assumption, and focus more on the optimal early stopping time rather than the largest compatibility region, in the same spirit of Yao et al. (2007) . Specifically, Lin & Rosasco (2017) derive a compatibility region like (0, n 

