EARLY STOPPING IN DEEP NETWORKS: DOUBLE DE-SCENT AND HOW TO ELIMINATE IT

Abstract

Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, where as a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent occurs for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and mitigating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a wide two-layer neural network, where the first and second layers govern bias-variance tradeoffs. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance.

1. INTRODUCTION

Most machine learning algorithms learn a function that predicts a label from features. This function lies in a hypothesis class, such as a neural networks parameterized by its weights. Learning amounts to fitting the parameters of the function by minimizing an empirical risk over the training examples. The goal is to learn a function that performs well on new examples, which are assumed to come from the same distribution as the training examples. Classical machine learning theory says that the test error or risk as a function of the size of the hypothesis class is U-shaped: a small hypothesis class is not sufficiently expressive to have small error, and a large one leads to overfitting to spurious patterns in the data. The superposition of those two sources of errors, typically referred to as bias and variance, yields the classical U-shaped curve. However, increasing the model size beyond the number of training examples can decrease the error again. This phenomena, dubbed "double descent" by Belkin et al. (2019) has been observed as early as 1995 by Opper (1995) , and is relevant today because most modern machine learning models, in particular deep neural networks, operate in the over-parameterized regime, where the error often decreases again as a function of model size, and where the model is sufficiently expressive to describe any data, even noise. Interestingly, this double descent behavior also occurs as a function of training time, as observed by Nakkiran et al. (2020a) and as illustrated in Figure 1 . The left panel of Figure 1 shows that as a function of training epochs, the test error first decreases, increases, and then decreases again. It is important to understand this so-called epoch-wise double descent behavior to determine the early stopping time that gives the best performance. Early stopping, or other regularization techniques, are critical for learning from noisy labels (Arpit et al., 2017; Yilmaz & Heckel, 2020) . Nakkiran et al. (2020a) conjectured that epoch-wise double descent occurs because the training time controls the "effective model complexity". This conjecture is intuitive, because the model-size, and thus the size of the hypothesis class, can be controlled by regularizing the empirical risk via early stopping the gradient descent iterations, as formalized in the under-parameterized regime by Yao Figure 1 : Left: The test and train error curves of an over-parameterized 5-layer convolutional network trained on the CIFAR-10 training set with 20% random label noise. As observed by Nakkiran et al. (2020a) , the performance shows a double descent behavior. Right: As we show here, the risk of a regression problem can be decomposed as the sum of two bias-variance tradeoffs. Both examples: Early stopping the training where the test error achieves its minima is critical for performance. et al. (2007) ; Raskutti et al. (2014) ; Bühlmann & Yu (2003) . Specifically, limiting the number of gradient descent iterations ensures that the functions parameters lie in a ball around the initial parameters. While this conjecture might be true for certain problem setups, it is not consistent with our empirical observation for the 5-layer CNN studied by Nakkiran et al. (2020a) : Specifically, the empirically measured overall bias in Figure 1 is increasing for some iterations, whereas an increasing model size would imply that it is decreasing (see Appendix B.2 for details on this experiment). In this paper, we show empirically and theoretically that epoch-wise double descent-at least in the setups we observed it-arises for a different reason: It is explained by a superposition of biasvariance tradeoffs, as illustrated for a toy-regression example in the right panel of Figure 1 . If the risk can be decomposed into two U-shaped bias-variance tradeoffs with minima at different epochs/iterations, then the overall risk/test error has a double descent behavior. We also note that epoch-wise double descent is not a phenomena tied to over-parameterization. Both under-and overparameterized models can have epoch-wise double descent as we show in this paper. 1.1 CONTRIBUTIONS The goal of this paper is to understand the epoch-wise double descent behavior. Our main finding is that epoch-wise double descent can be explained as a superposition of bias variance tradeoffs, and arises naturally in some standard neural networks because parts of the network are learned faster than others. Our contributions are as follows: First, we consider a linear regression model and theoretically characterize the risk of early stopped least squares. We show that if features have different scales, then the early stopped least squares estimate as a function of the early stopping time is a superposition of bias-variance tradeoffs, which yields a double descent like curve (see Figure 1 , right panel). Second, we characterize the early stopped risk of a two-layer neural network theoretically and show that it is upper bounded by a curve consisting of over-lapping bias-variance tradeoffs that are governed by the initializations and stepsizes of the two layers. The initialization scales and stepsizes of the weights in the first and second layer determine whether double descent occurs or not. We provide numerical examples showing how epoch-wise double descent occurs when training such a two-layer network on data, and how it can be eliminated by scaling the stepsizes of the layers accordingly. Third, we study a standard 5-layer convolutional network as well as ResNet-18 empirically. For the 5-layer convolutional network we find-similarly as for the two-layer model-epoch-wise double descent occurs because the convolutional layers (representation layers) are learned faster than the final, fully connected layer.Similarly, for ResNet-18, we find that later layers are learned faster than early layers, which again results in double descent. In both cases, epoch-wise double descent can be eliminated through adjusting the stepsizes of different coefficients or layers. In summary, we provide new examples on when epoch-wise double descent occurs, as well as analytical results explaining epoch-wise double descent theoretically. Our theory is constructive in that it suggests a simple and effective mitigation strategy: scaling stepsizes appropriately. We also note that epoch-wise double descent should be eliminated by adjusting the stepsizes and/or the initialization, because this often translates to better overall performance.

1.2. RELATED WORKS

There is a large number of works that have studied early stopping theoretically. Intuitively, each step of an iterative algorithm reduces the bias but increases variance. Heckel & Soltanolkotabi (2020b) . We use the same proof strategy as those papers to characterize the early stopping performance of a simple two-layer neural network, but in contrast to those earlier works, we develop early stopping results and optimize over both the weights in the first and second layer, as opposed to only optimizing over the weights in the first layer. That is important, because we want to demonstrate that initialization and stepsize choices of different layers lead to different bias-variance tradeoffs. Next, we note that there is an emerging line of works that theoretically establishes double descent behavior as a function of the model complexity (e. Finally, our suggestion to mitigate epoch-wise double descent with step-size adaption and early stopping is a form of regularization. Related work for model-wise double descent shows that model-wise double descent can be mitigated with ( 2 ) regularization Nakkiran et al. (2020b) , and 2 regularization and early stopping are strongly related (Ali et al., 2019) .

2. EARLY-STOPPED GRADIENT DESCENT FOR LINEAR LEAST SQUARES

We start by studying the risk of early stopped gradient descent for fitting a linear model to data generated by a Gaussian linear model. Our main finding is that the risk as a function of the early stopping time is characterized by a superposition of U-shaped bias-variance tradeoffs, and if the features of the Gaussian linear model have different scales, those bias-variance tradeoff curves add up to a double descent shaped risk curve. We also show that the performance of the estimator can be improved through double descent elimination by scaling the stepsizes associated with the features.

2.1. DATA MODEL AND RISK

Consider a regression problem, and suppose data is generated from a Gaussian linear model as y = x, θ * + z, where x ∈ R d is a zero-mean Gaussian feature vector with diagonal co-variance matrix Σ = diag(σ 2 1 , . . . , σ 2 d ), and z is independent, zero-mean Gaussian noise with variance σ 2 . We are given a training set D = {(x 1 , y 1 ), . . . , (x n , y n )} consisting of n data points drawn iid from this Gaussian linear model. We consider the class of linear estimators parameterized by a vector θ ∈ R d , which we estimate based on the training data D. The linear estimator predicts the label associated with a feature vector x as ŷ = x T θ. The (mean-squared) risk of this estimator is R( θ) = E y -x T θ 2 , where expectation is over an example (x, y) drawn independently (of the training set) from the underlying linear model.

2.2. EARLY-STOPPED LEAST SQUARES ESTIMATE

We consider the estimate based on early stopping gradient descent applied to the empirical risk R(θ) = 1 n n i=1 (y i -x T i θ) 2 . We initialize gradient descent with θ 0 = 0 and iterate, for t = 1, 2, . . ., with updates θ t+1 = θ t -1 2 diag(η)∇ R(θ t ) , where diag(η) is a diagonal matrix containing the stepsizes η i > 0 associated with each of the features as entries. Note that we allow for different stepsizes for all of the features. In the following, we study the properties of the iterates t, i.e., θ t .

2.3. RISK OF EARLY STOPPED LEAST SQUARES

The main result of this section is that in the underparameterized regime, where d n, the risk of gradient descent after t iterations, R(θ t ), is very close to a risk expression defined as R( θt ) := σ 2 + d i=1 σ 2 i (θ * i ) 2 (1 -η i σ 2 i ) 2t + σ 2 n (1 -(1 -η i σ 2 i ) t ) 2 Ui(t) , as formalized by the theorem below. We focus on the underparameterized regime here, because in the over-parameterized regime our estimator cannot achieve small risk in general. In Section 3 we study a more general setting in the overparameterized regime. Theorem 1. Suppose that the stepsizes obey η i ≤ 1 σ 2 i , for all i = 1, . . . , d. With probability at least 1 -2d -5 -2de -n/8e -d -2e -32 over the random training set generated by a linear Gaussian model with parameters θ * and Σ, the difference of the early stopped risk and the risk expression in (1) at iteration t is at most R(θ t ) -R( θt ) ≤ c max i η 2 i σ 4 i min i η i σ 4 i d n Σθ * 2 2 + d n σ 2 log(d) + σ 2 n √ d . Here, c is a numerical constant. Theorem 1 guarantees that with high probability the risk R(θ t ) is well approximated by the risk expression R( θt ), provided the model is sufficiently underparameterized (i.e., d/n is small). As a consequence, the risk of early stopped least-squares is a superposition of U-shaped biasvariance tradeoffs, and if the features are differently scaled, this can give rise to epoch-wise double descent. To see this, first note that the terms U i (t) in the risk expression (1) are U-shaped as a function of the early stopping time t, because σ 2 i (θ * i ) 2 (1-η i σ 2 i ) 2t decreases in t and σ 2 n (1-(1-η i σ 2 i ) t ) 2 increases in t; see Figure 2a for an example. The minima of the individual U-shaped curves U i (t) depend on the product of the stepsize and the i-th features' variance, η i σ 2 i ; the larger this product, the earlier (as a function of the number of iterations, t) the respective U-shaped curve reaches its minimum. Therefore, if we add up two (or more) such U-shaped curves with minima at different iterations, the resulting risk curve can have a double descent shape (again, see Figure 2a ). This establishes our claim that differently scaled features can give rise to epoch-wise double descent. Finally we note that the reason why we refer to the U-shaped curves as bias-variance tradeoffs, is that the terms Figure 2 : Early stopped least squares risk for a two-feature Gaussian linear model. a: Two Ushaped bias-variance tradeoffs U i (t) for the parameters θ * 1 = 1.5, σ 1 = 1, η 1 = 0.05 (bias-variance 1) and θ * 2 = 10, σ 2 = 0.15, η 2 = 0.05 (bias-variance 2), along with their sum (1+2) which determines the risk. b: Same plot, but this time the bias-variance tradeoff U 2 (t) is shifted to the left by increasing the stepsize η 2 according to Proposition 1 (yielding bias-variance tradeoff 3), so that its minimum overlaps with that of bias-variance tradeoff 1. This eliminates double descent and gives better performance. c: The resulting risk curves before and after elimination, demonstrating that the minimum of the risk after double descent elimination is smaller than before elimination. d i=1 σ 2 i (θ * i ) 2 (1 -η i σ 2 i ) 2t and d i=1 σ 2 n (1 -(1 -η i σ 2 i ) t ) 2 in Improving performance by eliminating double descent: Epoch-wise double descent can be eliminated by properly scaling the stepsizes associated with each of the features, so that the minima of the individual bias-variance tradeoffs overlap at the same iteration t: Proposition 1. Pick an optimal early stopping time t ≥ 1. The minimum of the risk expression min η1,...,η d min t R( θt ) is achieved at iteration t by choosing the stepsizes pertaining to the features as η i = 1 σ 2 i 1 - σ 2 /n σ 2 i (θ * i ) 2 +σ 2 /n 1/ t . Elimination of double descent is illustrated in Figure 2b . By eliminating double descent optimally so that all the individual bias-variance tradeoffs U i (t) achieve their minima at the same early stopping point t, we achieve the lowest overall risk at the optimal early stopping point. Thus eliminating double descent is important for optimal performance. In practice we typically do not know the variances of the features and therefore may not be able to optimally choose the stepsizes. However, we may be able to mitigate double descent sub-optimally by treating the stepsizes as hyperparameters.

3. EARLY STOPPING IN TWO LAYER NEURAL NETWORKS

In this section, we establish a bound on the risk of a two-layer neural network and show that this bound can be interpreted as a super-position of U-shaped bias-variance tradeoffs, similar to the expression governing the risk of the linear model from the previous section. The risk of the twolayer network is governed by two associated kernels pertaining to the first and second layer, and the initialization scale and stepsizes of the weights in the first and second layer determine whether double descent occurs or not. We also show in an experiment that if double descent occurs, it can be eliminated by adapting the stepsizes of the two layers. Network model: We consider a two-layer neural network with ReLU activation functions and k neurons in the hidden layer: f W,v (x) = 1 √ k relu(x T W)v. . Here, x ∈ R d is the input of the network, W ∈ R d×k and v ∈ R k are the weights of the first and second layer. Moreover, relu(z) = max(z, 0) is the rectified linear unit, applied elementwise.

Data model:

We assume that we are given a training set D = {(x 1 , y 1 ), . . . , (x n , y n )} with examples (x i , y i ) drawn iid from some joint distribution. For convenience, we assume that the datapoints are normalized, i.e., x i 2 = 1, and the labels are bounded, i.e., |y i | ≤ 1. Training with early stopped gradient descent: We train the network with early stopped and randomly initialized gradient descent on a quadratic loss. We choose the weights at initialization as [W 0 ] i,j ∼ N (0, ω 2 ), [v 0 ] i ∼ Uniform({-ν, ν}). (3) Here, ω and ν are parameters that trade off the magnitude of the weights of the first and second layer. Note that with this initialization, for a fixed unit norm feature vector x, we have f W0,v0 (x) = O(νω). We apply gradient descent to the mean-squared loss L(W, v) = 1 2 n i=1 (y i -f W,v (x i )) 2 . The gradient descent updates are v t+1 = v t -η∇ v L(W t , v t ) and W t+1 = W t -η∇ W L(W t , v t ), where η is a constant learning rate. We study the risk of the network as a function of the iterations t. Evaluation and performance metric: Our goal is to bound the test error as a function of the iterations of gradient descent. Let : R × R → [0, 1] be a loss function that is 1-Lipschitz in its first argument and obeys (y, y) = 0; a concrete example is the loss (z, y) = |z -y| for arguments z, y ∈ [0, 1]. The test error or risk is defined, as before, as R(f ) = E [ (f (x), y)] , where expectation is over examples (x, y) drawn from the unknown joint distribution from which the training set is drawn as well.

3.1. RISK OF EARLY STOPPED NEURAL NETWORK TRAINING

Our main result is a bound on the test error of the two layer neural network trained for t iterations, in the regime where the network is very wide. The result depends on the Gram matrix Σ ∈ R n×n determined by two kernels associated with the first and second layer of the network. The (i, j)-th entry of the Gram matrix as a function of the training examples is defined as Σ ij = ν 2 K 1 (x i , x j ) + ω 2 K 2 (x i , x j ), with kernels K 1 (x i , x j ) = 1 2 1 - cos -1 (ρ ij ) π ρ ij , and K 2 (x i , x j ) = K 1 (x i , x j ) + 1 -ρ 2 ij /(2π), where ρ ij = x i , x j (recall that we assume x i 2 = 1, for all i). Our result depends on the singular values and vectors of this Gram matrix: Σ = n i=1 σ 2 i u i u T i . We are now ready to state our result. Theorem 2. Let α > 0 be the smallest eigenvalue of the Gram matrix Σ, suppose that the network is sufficiently wide, i.e., k ≥ Ω n 10 α 15 min(ν,ω) , and suppose the initialization scale parameters obey νω ≤ α/ 32 log(2n/δ) and ν + ω ≤ 1 for some δ ∈ (0, 1). Then, with probability at least 1δ, the risk of the network trained with gradient descent for t iterations is at most R(f Wt,vt ) ≤ 1 n n i=1 u i , y 2 (1 -ησ 2 i ) 2t + 1 n n i=1 u i , y 2 (1 -(1 -ησ 2 i ) t ) 2 σ 2 i + O( 1 √ n ). (5) Regarding the assumptions of the theorem, we remark that while the exponent of n and α in the width-condition (k ≥ Ω n 10 α 15 min(ν,ω) ) can be improved, the width condition ensures that the network is sufficiently wide so that the network operates in the kernel regime where the network behaves similar to an associated linear model. Regarding the assumption that the smallest eigenvalue of the Gram matrix obeys α > 0, Theorem 3.1 by Du et al. (2019) shows that if no two x i , x j are parallel, then α > 0, for a very related Gram matrix (specifically, the Gram matrix only consisting of the kernel K 1 defined above). As argued in that work, for most real-world datasets no two inputs are parallel, therefore this assumption is rather mild. The risk bound established by Theorem 2 can be interpreted as a superposition of n-many U-shaped bias variance tradeoffs, similar to the expression (1) governing the risk of early stopped linear least squares. Specifically, the i-th "bias" term u i , y 2 (1ησ 2 i ) 2t decreases in the number of gradient descent iterations t, while the i-th "variance" term u i , y 2 (1-(1-ησ 2 i ) t ) 2 σ 2 i increases in the number of gradient descent iterations. The speed at which the two terms increase and decrease, respectively, is determined by the singular value σ 2 i . Those singular values, in turn, are determined by the kernels K 1 and K 2 , the random initialization (in particular the scale parameters ν, ω), and the distribution of the examples. Whether epoch-wise double descent occurs or not depends on those singular values and therefore on the kernels, the initialization, and the distribution of the examples, as illustrated with the following numerical example. Numerical example to illustrate the theorem: We draw data from the linear model specified in Section 2.1 with geometrically decaying diagonal co-variance entries and zero additive noise. We then train the network for different initialization scale parameters ω, ν once with the same stepsize for both layers (η = 8e-5), and once with a smaller stepsize for the second layer, i.e., η W = 8e-5 and η v = 1e-6. In the top row of Figure 3 , it can be seen that the empirical risk has a double-descent behavior if both layers are initialized at the same scale (i.e., ω = ν = 1). To understand the relation to the theorem better, we also plot in Figure 3 the extent to which the singular values are associated with the parameters in the first and second layer. To capture this, we first comment on the relation of the parameters of the first and second layer to the singular values σ 2 i and vectors u 2 i of the Gram matrix: In the wide-network regime in which the theorem applies, the networks output is well approximated by its linearization around around the initialization. With this, the networks predictions for the training examples are approximately f W,v (x 1 ) . . . f W,v (x n ) ≈ J Vect(W) v = n i=1 σ i u i (v T i,W Vect(W) + v T i,v v), where J ∈ R n×dk+k is (approximately) the Jacobian of the network at initialization and J = n i=1 σ i u i v i T is its singular value decomposition. Here, we denote by v i,W ∈ R dk and v i,v ∈ R k the parts of the right-singular vectors of the Jacobian associated with the weights in the first and second layer, respectively. The norm of those vectors measures to what extent the singular value σ i is associated with the weights in the first and second layer. Returning to the numerical example, as the bottom row of Figure 3 shows, if we initialize both layers at the same scale (ω = ν = 1), then most of the large singular values are associated, for the most part, with the weights in the second layer. This leads to double descent, that can be mitigated by choosing a smaller stepsize associated with the weights in the second layer. Improving performance by eliminating double descent: Similarly as for the linear least squares problem studied in the previous section, it is possible to shape the bias variance tradeoffs by adapting the stepsizes (or through initialization of the layers). In Figure 3 , we illustrate this behavior: Double descent is eliminated by choosing a smaller stepsize for the second layer, or by choosing a smaller initialization for the first layer, as suggested by our theoretical results, and similar to the linear least squares setup as discussed in the previous section. Also note that, not only does choosing a smaller stepsize for the second layer eliminate double descent, it also gives a better overall risk. To understand the relation to the kernels, suppose we choose the initialization equally, i.e., ω = 1 and ν = 1. If we update the variables of the second layer (i.e., v) with a much larger stepsize than that of the first layer (i.e., W), then the kernel associated with the second layer dominates and the network behaves like a random feature model Rahimi & Recht (2008) . Similarly, if we update the variables of the first layer with a much larger stepsize than that of the second layer, then the network behaves like a network with the final layer weights v fixed. Thus, the stepsizes trade off the impact of the two kernels, and this tradeoff yields a double descent curve.

4. EARLY STOPPING IN CONVOLUTIONAL NEURAL NETWORKS

We finally study the training of a standard 5-layer convolutional neural network (CNN) and a standard ResNet-18 model on the (10 class classification) CIFAR-10 dataset. Both networks were studied in Nakkiran et al. (2020a) . As shown in that paper, the risk has a double descent behavior if the 10 0 10 1 10 2 10 3 10 4 10 5 to the fully connected layer. This causes the fully connected layer to be learned faster than the convolution layer. Middle and Right: Performance when trained with the i) same stepsize for all layers, and ii) a smaller stepsize for the fully connected layer. Decreasing the learning rate of the fully connected layer causes it to be learned at a similar speed as the convolutional layers and thereby eliminates double descent and increases performance (i.e., the minima of ii is smaller than that of i). σ i • 2 2 10 1 10 2 σ i 10 1 10 2 σ i v i,W 2 2 v i,v 2 2 network is trained on a dataset with label noise, and we consider the same setup with 20% random label noise. While we have no theoretical results for those two complicated neural network models, we demonstrate-inspired by our theory-that epoch-wise double descent can be eliminated and the early stopping performance can be improved by adjusting the stepsizes/learning rates.

5-layer CNN:

The 5-layer CNN consists for 4 convolutional layers followed by a fully connected layer. Figure 4 shows that, just like for the two-layer network from the previous section, double descent can be eliminated by changing stepsizes, this time by decreasing the stepsize of the final fully connected layer. The intuition behind this is that large singular values of the Jacobian of the network at initialization are mostly associated with the last fully connected layer, measured in the same way as in the previous section. This causes the convolutional layers to be learned slower than the fully connected layer which results in double descent. Analogously as before, decreasing the stepsize pertaining to the fully connected layer eliminates double descent.

ResNet-18:

We next consider the popular ResNet-18 model. ResNet-18 has a double descent behavior when trained on the noisy CIFAR-10 problem Nakkiran et al. (2020a) . Inspired by our theory, we again hypothesize that the double descent behavior occurs because some layer(s) of the ResNet-18 model are fitted at a faster rate than others. If that hypothesis is true, then scaling the learning rates of some layers should eliminate double descent. Indeed, Figure 6 in the appendix shows that when scaling the stepsizes of the later half of the layers of the network mitigates double descent. Martin Below, we provide a proof of Theorem 1. Here we provide intuition why the risk is governed by the risk expression (1). First, note that the risk of the estimator can be written as a function of the variances of the features, σ 2 i , and of the coefficients of the underlying true linear model, θ * = [θ * 1 , . . . , θ * d ], as R( θ) = σ 2 + d i=1 σ 2 i (θ * i -θi ) 2 . ( ) where we used that z and x are drawn independently. Next, recall that we consider the estimate based on early stopping gradient descent applied to the empirical risk R(θ) = Xθy 2 2 . Here, the matrix X ∈ R n×d contains the scaled training feature vectors 1 √ n x 1 , . . . , 1 √ n x n as rows, and y = 1 √ n [y 1 , . . . , y n ] are the corresponding scaled responses. The gradient descent iterates obey θ t+1 -θ * = I -diag(η)X T X (θ t -θ * ) + diag(η)X T z, where z = [z 1 , . . . , z n ] is the noise. As we formalize below, in the under-parameterized regime where n d, we have that X T X ≈ Σ 2 . Therefore the original iterates are close to the proximal iterates θt defined by θt+1 -θ * = I -diag(η)Σ T Σ ( θt -θ * ) + diag(η)X T z. ( ) The proximal iterates are, up to the extra term diag(η)X T z, equal to the iterates of gradient descent applied to the population risk R(θ). Note that in contrast to the literature where it is common to bound the deviation of the original iterates from the iterates on the population risk Raskutti et al. (2014) , here we control the deviation of the original iterates to the proximal iterates θt . The iterates θt can easily be written out in closed form. To do so, first note that for the recursion θ t+1 = αθ t + γ we have θt = α t θ 0 + γ t-1 i=1 α i = α t θ 0 + γ 1-α t 1-α , where we used the formula for a geometric series. Using this relation, and that we are starting our iterations at θ 0 i = 0, we obtain for the i-th entry of θt that θt i -θ * i = (1 -η i σ 2 i ) t θ * i + σ i xT i z 1 -(1 -η i σ 2 i ) t σ 2 i , where xi is the i-th column of X (not the i-th example/feature vector!). Next note that, E (x T i z) 2 ≈ σ 2 σ 2 i because the entries of z are N (0, σ 2 ) distributed, and the entries of xi are 1/ √ nN (0, σ 2 i ) distributed. Using this expectation in the iterates θt , and evaluating the risk of those iterates via the formula for the risk given by ( 7) yields the risk expression (1). The proof of Theorem 1 in the appendix makes this intuition precise by formally bounding the difference of the proximal iterates to the original iterates. A.2 MOTIVATION FOR CALLING THE U-SHAPED CURVES BIAS-VARIANCE TRADEOFFS Let θ = θ(D) be the parameter obtained based on the training data (for example by early stopping). The textbook bias-variance decomposition of the risk of θ is E D R( θ) = E x x, θ * -E D x, θ 2 Bias( θ) + E D,x x, θ -E D x, θ 2 Variance( θ) +σ 2 . The first term above is the bias of the hypothesis ĥ(x) = x, θ . It measures how well the average function can estimate the true underlying function h(x) = x, θ * . A low bias means that the hypothesis accurately estimates the true underlying function x, θ * . The second term is the variance of the method. The variance of the method measures the variance of the hypothesis over the training sets. Recall from the previous paragraph that the estimate θt approximates the original iterations θ t well provided that the model is sufficiently underparameterized, i.e., d/n is small. It is straightforward to verify that Bias( θt ) = d i=1 (θ * i ) 2 (1 -η i σ 2 i ) 2t and Variance( θt ) = d i=1 σ 2 n (1 -(1 -η i σ 2 i ) t ) 2 , exactly equal to the bias and variance terms in the risk expression (1). It follows that the bias and variance of the original gradient descent iterates θ t are also approximately equal to the terms in the risk expression (1). The U-shaped curves are then the bias-variance terms pertaining to the i-th feature; this formally establishes the U-shaped curves as bias-variance tradeoffs.

A.3 NUMERICAL RESULTS FOR LINEAR LEAST SQUARES

In this section we provide further numerical results for linear least squares. We consider a linear model with d = 700 features, and with n = 6d examples. We let a fraction 6/7 of the features have singular value σ i = 1 and associated model coefficient θ i = 1, and the rest, 1/7 of the features, have singular value σ i = 0.1 and θ i = 10. In Figure 5 (a) we show the risk obtained by simulating the risk empirically along with the risk expression R( θt ) given by equation (1). It can be seen that the risk expression R( θt ) slightly under-estimates the true risk. The quality of the estimate becomes better as we increase n; in Figure 5 In Figure 6 we provide test and train error curves for ResNet-18 trained with different stepsizes on noisy CIFAR-10. The results show that, as mentioned in the main body, double descent is eliminated by choosing the stepsizes appropriately. In more detail: ResNet-18 consists of 18 layers in total, where there are 4 residual blocks, each featuring 4 convolutional layers with residual connections, between the first standalone convolutional layer and the last fully-connected layer. We consider standard SGD training of ResNet-18 on noisy CIFAR-10 with an initial learning rate of η = 0.1 and inverse square-root decay with decay rate T = 512. This is the standard training setup for ResNet-18, and is exactly the setup for which for which Nakkiran et al. reported double-descent behavior. We found that similar to the 5-layer convolutional network, double descent occurs in ResNet-18 because some of the networks' layers are learned at a different rates than others. We found that the weights of the last fully-connected layer 10 0 10 1 10 2 10 3 10 4 1.2 1.4 1.6 1.8 2 t iterations risk (a) n = 5d R(θ t ) R( θt ) 10 0 10 1 10 2 10 3 10 4 1.2 1.4 1.6 1.8 t iterations (b) n = 10d R(θ t ) R( θt ) Figure 5 : The risk of early-stopped gradient least-squares R( θt ) based on numerical simulation of the Gaussian model along with the risk expression R( θt ) given in (1). We averaged over 100 runs of gradient descent, and the shaded region corresponds to one standard deviation over the runs. It can be seen that the risk expression slightly underestimates the true risk, but other than that describes the behavior of the risk well. as well as the last two residual blocks were learned faster relative to the other layers. Following the method inspired by our theory for the linear case and two-layer network and empirical observations from the 5-layer convolutional network, we eliminate the double descent by decreasing the stepsizes of these layers to 10 -4 from 10 -1 after a few epochs. Note that ResNet-18 has a different architecture than the simple 5-layer convolutional network and the values chosen differ for the two networks. This is expected as double descent depends on many factors such as the underlying data distribution as well as the network architecture and training.

B.2 NUMERICAL BIAS-VARIANCE DECOMPOSITION FOR THE 5-LAYER CNN

As discussed before, classical machine learning theory for the underparameterized regime establishes the bias-variance tradeoff as a result of the bias decreasing and the variance increasing as a function of the model size (complexity). In the over-parameterized regime, the bias often continues to decrease, while the variance also decreases. This has been established in a number of recent works Jacot et al. In this paper, we demonstrated that epoch-wise double descent occurs for a different reason than the model-wise double descent. Namely, epoch-wise double descent can be explained as a temporal superposition of multiple bias-variance tradeoff curves rather than with a unimodal variance curve. That also means that the overall bias (i.e., the sum of the individual bias terms) might not be decreasing and the overall variance might not be uni-modal like in the model-wise case. To demonstrate that Scaling the stepsizes of the different layers/components eliminates the multi-descent similar to the case of the double descent and improves the optimal early stopping performance for both the linear model and the two-layer neural network as predicted by our theory. it is in fact not, in Figure 7 , we plot the numerically computed bias and variance terms (computed as proposed in Yang et al. ( 2020)) for the CNN experiment from Section 4 along with the risk, which shows that in fact the overall bias is increasing, while the variance has a double-descent like shape.

C MULTI-DESCENT

We note that in principle we can also observe multiple descents as a function of training time. Specifically, recall the risk expression for the linear case, equation 1. It consists of d many bias-variance tradeoffs, so in principle those curves might give rise not only to epoch-wise double descent, but to multiple descent. See Figure 8 , left panel, in which we show an example of three bias-variance tradeoffs that add up to a multi-descent curve. Likewise multi-descent can occur for neural networks. In Figure 8 , right panel, we demonstrate this for the two-layer network introduced in Section 3. For multi-descent to occur in a neural network, we require a very particular setup. Specifically, for the two-layer neural network we consider, we found that the existence of the multiple descents depends heavily on the noise in the data generation process. We found for the two-layer neural network, multi-descent to occur only for a particular range of noise levels. In more detail, we draw data from a linear model specified in Section 2.1, in exactly the same way as for the simulations in the main body; but this time we added noise (the noise variance σ 2 of the additive noise z is non-equal to zero). Specifically, the 50-dimensional feature vectors were chosen by drawing from a Gaussian with diagonal covariance matrix with geometrically decaying sigma values starting from σ 1 = 4 and with noise variance σ = 11. We generated n = 100 examples, and the network has a width of k = 250. Intuitively, multiple descents could be observed in other empirical scenarios and for other architectures based on our theoretical and experimental findings for the linear case and the two-layer neural network. However, we did not observe multi-descent in a practical setup (such as for training CIFAR-10 with a convolutional network), as it requires a very particular setup (i.e., combination of underlying data distribution, network architecture, and training). Image classification datasets are considered to be minimally noisy and highly structured and this particular setup does not seem to occur in practice even with the artificially injected label noise, at least we didn't observe it when training standard networks on CIFAR-10, and it hasn't been reported elsewhere.

D PROOF OF THEOREM 1

The difference of the risk and risk expression can be bounded by R(θ t ) -R( θt ) ≤ R(θ t ) -R( θt ) + R( θt ) -R( θt ) . We bound the two terms on the righ-hand-side separately. We start with bounding the first term by applying the lemma below. Lemma 1. Define X so that X = XΣ. Suppose that I -XT X ≤ , with ≤ mini ηiσ 2 i 2 maxi ηiσ 2 i . Then R(θ t ) -R( θt ) ≤ (1 -(1 -min i η i σ 2 i /2) t ) 2 8 max i η 2 i σ 4 i min i η 2 i σ 4 i 2 max ∈{1,...,k} Σ θ -Σθ * 2 2 . In order to apply the lemma, we start by verifying its condition. Towards this goal, consider the matrix X = XΣ and note that the entries of X are iid N (0, 1/n). A standard concentration inequality from the compressive sensing literature (specifically (Foucart & Rauhut, Holger, 2013, Chapter 9 )) states that, for any β ∈ (0, 1), P I -XT X ≥ β ≤ e -nβ 2 15 +4d . With β = 75d n we obtain that, with probability at least 1e -d , I -XT X ≤ 75 d n . Next, we bound the term on the RHS of in (10), with the following lemma. Lemma 2. Provided that η i σ 2 i ≤ 1 for all i, with probability at least 1 -2d(e -β 2 /2 + e -n/8 ), max Σ θ -Σθ * 2 2 ≤ 2 Σθ * 2 2 + 4 d n σ 2 β 2 . Applying the lemma with β 2 = 10 log(d), we obtain that with probability at least 1 -2d -5 -2de -n/8e -d we have R(θ t ) -R( θt ) ≤ 8 max i η 2 i σ 4 i min i η 2 i σ 4 i 75d n 2 Σθ * 2 2 + 4 d n σ 2 10 log(2d) . We are now ready to bound the second term in (9): Lemma 3. With probability at least 1 -4e -β 2 8 , we have that R( θt ) -R( θt ) ≤ σ 2 n β3 √ d, with R( θt ) as defined in (1). Applying the two bounds ( 11) and ( 12) to the RHS of the bound (9) concludes the proof. The remainder of the proof is devoted to proving the three lemmas above.

D.1 PROOF OF LEMMA 1

Recall that the iterates of the original and closely related problem are given by θ t+1 -θ * = (I -diag(η)X T X)(θ t -θ * ) + diag(η)X T z, θt+1 -θ * = I -diag(η)Σ T Σ ( θt -θ * ) + diag(η)X T z. Note that X = XΣ, where we defined X which has iid Gaussian entries N (0, 1/n). With this notation, and using that Σ is diagonal and therefore commutes with diagonal matrices, we obtain the following expressions for the residuals of the two iterates: Σθ t+1 -Σθ * = (I -diag(η)Σ 2 XT X)(Σθ t -Σθ * ) + diag(η)Σ 2 XT z Σ θt+1 -Σθ * = I -diag(η)Σ 2 (Σ θt -Σθ * ) + diag(η)Σ 2 XT z. The difference between the residuals is Σθ t+1 -Σ θt+1 = (I -diag(η)Σ 2 XT X)(Σθ t -Σθ * ) -I -diag(η)Σ 2 (Σ θt -Σθ * ) = Σθ t -Σ θt -diag(η)Σ 2 XT X(Σθ t -Σθ * ) + diag(η)Σ 2 (Σ θt -Σθ * ) = (I -diag(η)Σ 2 XT X)(Σθ t -Σ θt ) + diag(η)Σ 2 (I -XT X)(Σ θt -Σθ * ), where the last equality follows by adding and subtracting diag(η)Σ 2 XT X(Σ θt -Σθ * ) and rearranging the terms. It follows that Σθ t+1 -Σ θt+1 2 ≤ (1 -min i η i σ 2 i /2) Σθ t -Σ θt 2 + max i η i σ 2 i max Σ θ -Σθ * 2 . ( ) Here, we used the bound I -diag(η)Σ 2 XT X ≤ I -diag(η)Σ 2 + diag(η)Σ 2 (I -XT X) ≤ (1 -min i η i σ 2 i ) + max i η i σ 2 i ≤ (1 -min i η i σ 2 i /2 ). Here, we used that η i σ 2 i ≤ 1, by assumption, and the last inequality follows by the assumption ≤ mini ηiσ 2 i 2 max ηiσ 2 i . Iterating the bound (13) yields Σθ t -Σ θt 2 ≤ 1 -(1 -min i η i σ 2 i /2) t min i η i σ 2 i /2 max i η i σ 2 i max Σ θ -Σθ * 2 , which concludes the proof.

D.2 PROOF OF LEMMA 2

Recall that σ i ( θt i -θ * i ) = σ i (1 -η i σ 2 i ) t θ * i + xT i z(1 -(1 -η i σ 2 i ) t ). With η i σ 2 i ≤ 1, by assumption, it follows that σ 2 i ( θt i -θ * i ) 2 ≤ 2σ 2 i (θ * i ) 2 + 2(x T i z) 2 . ( ) Conditioned on z, the random variable xT i z is zero-mean Gaussian with variance z 2 /n. Thus, P |x T i z| 2 ≥ z 2 2 n β 2 ≤ 2e -β 2 /2 . Moreover, as used previously in ( 16), with probability at least 1 -2e -n/8 , z 2 2 ≤ 2σ 2 . Combining the two with the union bound, we obtain P |x T i z| 2 ≥ 2σ 2 n β 2 ≤ 2e -β 2 /2 + 2e -n/8 . Using this bound in inequality ( 14), we have that, with probability at least 1 -2(e -β 2 /2 + e -n/8 ) that σ 2 i ( θt i -θ * i ) 2 ≤ 2σ 2 i (θ * i ) 2 + 4 1 n σ 2 β 2 . By the union bound over all i we therefore get that max t Σ θt -Σθ * 2 2 ≤ 2 Σθ * 2 2 + 4 d n σ 2 β 2 , with probability at least 1 -2d(e -β 2 /2 + e -n/8 ).

D.3 PROOF OF LEMMA 3

We have R( θt ) = σ 2 + d i=1 σ 2 i (1 -η i σ 2 i ) t θ * i + σ i xT i z 1 -(1 -η i σ 2 i ) t σ 2 i 2 = σ 2 + d i=1 σ i (1 -η i σ 2 i ) t θ * i + xT i z(1 -(1 -η i σ 2 i ) t 2 Zi . The random variable Z i , conditioned on z, is a squared Gaussian with variance upper bounded by z 2 √ n and has expectation E [Z i ] = σ 2 i (1 -η i σ 2 i ) 2t (θ * i ) 2 + z 2 2 n (1 -(1 -η i σ 2 i ) t ) 2 . By a standard concentration inequality of sub-exponential random variables (see e.g. (Wainwright, 2019, Chapter 2, Equation 2.21)), we get, for β ∈ (0, √ d) and conditioned on z, that the event E 1 = d i=1 (Z i -E [Z i ]) ≤ z 2 2 n √ dβ occurs with probability at least 1 -2e -β 2 8 . With the same standard concentration inequality for sub-exponential random variables, we have that the event E 2 = z 2 2 -σ 2 ≤ σ 2 β √ n (16) also occurs with probability at least 1-2e -β 2 8 . By the union bound, both events hold simultaneously with probability at least 1 -4e -β 2 8 . On both events, we have that R( θt ) -R( θt ) = d i=1 (Z i -E [Z i ]) + 1 n z 2 2 -σ 2 (1 -(1 -ησ 2 i ) t ) 2 ≤ d i=1 (Z i -E [Z i ]) + d z 2 2 -σ 2 ≤ z 2 2 n √ dβ + d n 1 √ n σ 2 β ≤ 2σ 2 n √ dβ + d n 1 √ n σ 2 β ≤ σ 2 n β3 √ d. concluding the proof of our lemma.

E PROOF OF PROPOSITION 1

By equation ( 1), the risk expression is a sum of U-shaped curves: R( θt ) = σ 2 + d i=1 U i (t). We start by considering one such U-shaped curve, and find its minimum as a function of the number of iterations, t. Towards this end, we set the derivative of one such U-shaped curve, given by ∂ ∂k U i (t) = σ 2 i (θ * i ) 2 2 log(1 -η i σ 2 i )(1 -η i σ 2 i ) 2t + σ 2 n 2((1 -η i σ 2 i ) t -1) log(1 -η i σ 2 i )(1 -η i σ 2 i ) t = 2 log(1 -η i σ 2 i )(1 -ησ 2 i ) t (1 -η i σ 2 i ) t (σ 2 i (θ * i ) 2 + σ 2 n ) - σ 2 n to zero, which gives that the minimum occurs when η i = 1 σ 2 i 1 - σ 2 /n σ 2 i (θ * i ) 2 + σ 2 /n 1/t . ( ) For the iteration t which satisfies this equation, we get min t U i (t) = σ 2 /nσ 2 i (θ * i ) 2 σ 2 /n + σ 2 i (θ * i ) 2 , thus this minimum is independent of the iteration t and independent of the stepsize, provided their relation is as described in ( 17) above.

F PROOF AND STATEMENTS FOR NEURAL NETWORKS

In this section, we prove the following result, which is a slightly more formal version of our main result for neural networks, Theorem 2. Theorem 3. Draw a dataset D = {(x 1 , y 1 ), . . . , (x n , y n )} consisting of n examples i.i.d. from a distribution with x i 2 = 1 and |y i | ≤ 1. Let Σ ∈ R n×n be the corresponding Gram matrix defined in (4), and suppose its smallest singular value obeys α > 0. Pick an error parameter ξ ∈ (0, 1) and a failure probability δ ∈ (0, 1), and consider the two-layer neural network f W,v (x) = 1 √ k relu(x T W)v, with parameters W d×k , v ∈ R k initialized according to 3) with initialization scale parameters ν, ω obeying νω ≤ ξ/ 32 log(2n/δ) and ν + ω ≤ 1. Suppose that the network is sufficiently overparameterized, i.e., k ≥ Ω n 10 α 11 min(ν, ω)ξ 4 . (18) Then, the risk of the network trained with gradient descent with constant stepsize η for t iterations obeys, with probability at least 1δ, R(f Wt,vt ) ≤ 1 n n i=1 u i , y 2 (1 -ησ 2 i ) 2t + 1 n n i=1 u i , y 2 1 -(1 -ησ 2 i ) 2t σ 2 i + 1 √ n + O(ξ/α). Theorem 2 directly follows by choosing the error parameter as ξ = O(α).

F.1 PROOF OF THEOREM 3

In this section, we provide a proof of Theorem 2. Our proof relies on the observation that highly overparameterized neural networks behave as associated linear models, as established in a large number of prior works (Arora et al., 2019; Du et al., 2018; Oymak & Soltanolkotabi, 2020; Oymak et al., 2019; Heckel & Soltanolkotabi, 2020b) . The proof consists of two parts. First, we control the empirical risk as a function of the number of gradient descent steps, t. Second, we control the generalization error, i.e., the gap between the population risk and the empirical risk by bounding the Rademacher complexity of the function class consisting of two-layer networks trained with t iterations of gradient descent. Recall that our result depends on the singular values and vectors of the gram matrix of kernels associated with the twolayer network. The Gram matrix is given as the expectation of the outer product of the Jacobian of the network at initialization: Σ = E J (W 0 , v 0 )J T (W 0 , v 0 ) = n i=1 σ 2 i u i u T i . Here, expectation is with respect to the random initialization W 0 , v 0 .

Bound on the training error:

We start with a results that controls the training error and ensures that the coefficients of the neural network move little from its initialization. Theorem 4. Pick an error parameter ξ ∈ (0, 1) and any failure probability δ ∈ (0, 1), and choose ν, ω so that they satisfy νω ≤ ξ/ 32 log(2n/δ). Suppose that the network is sufficiently overparameterized, i.e., k ≥ Ω n 10 (ν + ω) 9 α 11 min(ν, ω)ξ 4 . i) Then, with probability at least 1δ, the mean squared loss after t iterations of gradient descent obeys n i=1 (y i -f Wt,vt (x i )) 2 ≤ n i=1 (1 -ησ 2 i ) 2t u i , y 2 + ξ y 2 . ( ) ii) Moreover, the coefficients overall deviate little from its initialization, i.e., W t -W 0 2 F + v t -v 0 2 2 ≤ n i=1 u i , y 1 -(1 -ησ 2 i ) t σ i 2 + ξ α √ n Q:= . Here, • F denotes the Frobenius norm. In addition each of the coefficients changes only little, i.e., for all iterations t w t,r -w 0,r 2 ≤ (ν + 4 α √ n) n √ k 2 α 2 , ( ) |v t,r -v 0,r | ≤ O(ω log(nk/δ)) + 4 α √ n n √ k 2 α 2 . Here, w t,r is the r-th row of W t , and v t,r is the r-th entry of v t . Bound on the empirical risk: Because we train with respect to the 2 -loss but define the risk with respect to the (generic Lipschitz) loss , the empirical risk and training loss are not the same. Nevertheless, we can upper bound the empirical risk computed over the training set at iteration t with the training loss at iteration t: R(f Wt,vt ) = 1 n n i=1 (f Wt,vt (x i ), y i ) (i) ≤ 1 n n i=1 |f Wt,vt (x i ) -y i | ≤ 1 n (f Wt,vt (x i ) -y i ) 2 (ii) ≤ 1 n n i=1 u i , y 2 (1 -ησ 2 i ) 2t + ξ, where (i) follows from (z, y) = (z, y) -(y, y) ≤ |z -y| because the loss is 1-Lipschitz. Equation (ii) is the most interesting one, and follows from Theorem 4, equation ( 21), and holds with probability at least 1δ. This bound is proven by showing that, provided the network is sufficiently wide, the training loss behaves as gradient descent applied to a linear least-squares problem with dynamics governed by the gram matrix Σ. Bound on the generalization error: Next, we bound the generalization error R(f ) -R(f ) by bounding the Rademacher complexity of the functions that gradient descent can reach with t gradient descent iterations. Let F be a class of functions f : R d → R. Let 1 , . . . , n be iid Rademacher random variables, i.e., random variables that are chosen uniformly from {-1, 1}. Given the dataset D, define the empirical Rademacher complexity of the function class F as R D (F) = 1 n E sup f ∈F n i=1 i f (x i ) . Here, D = {(x 1 , y 1 ), . . . , (x n , y n )} is the training set, consisting of n points drawn iid from the example generating distribution. By a standard result from statistical learning theory, a bound on the Radermacher complexity directly gives a bound on the generalization error for each predictor in a class of predictors. Theorem 5 ( (Mohri et al., 2012, Thm. 3.1) ). Suppose (•, •) is bounded in [0, 1] and 1-Lipschitz in its first argument. With probability at least 1δ over the random dataset D consisting of n iid examples, we have that sup f ∈F R(f ) -R(f ) ≤ 2R D (F) + 3 log(2/δ) 2n . We consider the class of neural networks with weights close to the random initialization W 0 , v 0 , defined as: F Q,M = {f W,v : W ∈ W, v ∈ V} , with W = W : W -W 0 F ≤ Q, w r -w 0,r 2 ≤ ωM, for all r , V = {v : v -v 0 2 ≤ Q, |v r -v 0,r | ≤ νM, for all r} . The Rademacher complexity of this class of functions is controlled with the following result. Lemma 4. Let W 0 be drawn from a Gaussian distribution with N (0, ω 2 ) entries, and suppose the entries of v 0 are draw uniformly from {-ν, ν}. Assume the (x i , y i ) are drawn iid from some distribution with x i 2 = 1 and |y i | ≤ 1. With probability at least 1δ over the random training set, provided that log(2n/δ)/2k ≤ 1/2, the empirical Rademacher complexity of F Q,M is, simultaneously for all Q, bounded by R D (F Q,M ) ≤ Q √ n (ν + ω) + νω(5M 2 √ k + 4M log(2/δ)/2). We set M = O( ξ α k -1/4 ). With this choice, the term on the right hand side above is bounded by νω(5M 2 √ k + 4M log(2/δ)/2) ≤ O(ξ/α), where we used νω ≤ 1 and √ log(2/δ)/2 k 1/4 ≤ 1, by assumption (18). Note that by (23) and by ( 24) combined with the assumption (18) we have that w rw 0,r 2 ≤ ωM and |v rv 0,r | ≤ νM , as desired. Let Q i = i for i = 1, 2, . . .. Simultaneously for all i, by the lemma above, for this choice of M , the function class F Qi,M has Rademacher complexity bounded by R D (F Qi,M ) ≤ Q i √ n (ν + ω) + O(ξ/α). We next choose the radius Q as defined in (22) . Let i * be the smallest integer such that Q ≤ Q i * , so that Q i * ≤ Q + 1. We have that i * ≤ O( n/α) and R D (F Q i * ,M ) ≤ (Q + 1) √ n (ν + ω) + O(ξ/α) ≤ 1 n n i=1 u i , y 1 -(1 -ησ 2 i ) t σ i 2 + 1 √ n + O(ξ/α), by the assumption of the theorem on k being sufficiently large, and by ν + ω ≤ 1. Next, from a union bound over the finite set of integers i = 1, . . . , i * , we obtain max i=1,...,i * sup f ∈F Q i ,M R(f ) -R(f ) ≤ 1 n n i=1 u i , y 1 -(1 -ησ 2 i ) t σ i 2 + 1 √ n + O(ξ/α), as desired. Final bound on the risk: Combining the bound on the training with the generalization bound yields the upper bound ( 19) on the risk of the network trained for t iterations of gradient descent. The remainder of the proof is devoted to proving Theorem 4 and Lemma 4.

F.2 PRELIMINARIES

We start with introducing some useful notation. First note that the prediction of the neural network for the n training data points as a function of the parameters are f (W, v) = 1 √ k    relu(x T 1 W)v . . . relu(x T n W)v    = 1 √ k relu(XW)v, where X n×d is the feature matrix and W ∈ R d×k and v ∈ R k are the trainable weights of the network. The transposed Jacobian of the function f is given by J T (W, v) = J T 1 (W, v) J T 2 (W) ∈ R dk+k×n , where we defined the Jacobians corresponding to the weights of the first layer, W, and the second layer, v, respectively as J T 1 (W, v) = 1 √ k    v 1 X T diag(relu (Xw 1 )) . . . v t X T diag(relu (Xw t ))    ∈ R dk×n , J T 2 (W) = 1 √ k relu(XW) T ∈ R k×n . Here, relu (x) = 1 {x≥0} is the derivative of the relu activation function, which is the step function. Our results depend on the singular values and vectors of the expected Jacobian at initialization: E J (W 0 , v 0 )J T (W 0 , v 0 ) = ν 2 k =1 E relu (Xw 0, )relu (Xw 0, ) T XX T + 1 k E relu(XW 0 )relu(XW 0 ) T , where is the Hadamard product, and where we used that the entries of v 0 are choosen iid uniformly from {-ν, ν}. Expectation is over the weights W 0 at initialization, which are iid N (0, ω 2 ). This yields E J (W 0 , v 0 )J T (W 0 , v 0 ) ij = ν 2 K 1 (x i , x j ) + ω 2 K 2 (x i , x j ), where K 1 and K 2 are two kernels associated with the first and second layers of the network and are given by K 1 (x i , x j ) = E relu (Xw )relu (Xw ) T ij = 1 2 1 -cos -1 (ρ ij ) /π x i , x j with ρ ij = xi,xj xi 2 xj 2 and by K 2 (x i , x j ) = 1 ω 2 1 k E relu(XW)relu(XW) T ij = 1 ω 2 E relu(Xw)relu(Xw) T ij = 1 2 1 -ρ 2 ij /π + (1 -cos -1 (ρ ij )/π)ρ ij x i 2 x j 2 . For both of those expressions, we used the calculations from (Daniely et al., 2016, Sec. 4.2) for the final expressions of the kernels. Also note that, by assumption x i 2 = 1.

F.3 PROOF OF THEOREM 4 (BOUND ON THE TRAINING ERROR)

In this subsection, we prove Theorem 4.

F.3.1 THE DYNAMICS OF LINEAR AND NONLINEAR LEAST-SQUARES

Theorem 4 relies on approximating the trajectory of gradient descent applied to the training loss with an associated linear model that approximates the non-linear neural network in the highly-overparameterized regime. This strategy has been used in a number of recent publications (Arora et al., 2019; Du et al., 2018; Oymak & Soltanolkotabi, 2020; Oymak et al., 2019; Heckel & Soltanolkotabi, 2020b) ; in order to avoid repetition, we rely on a statement (Heckel & Soltanolkotabi, 2020a, Theorem 4) , which bounds the error between the true trajectory of gradient descent and the trajectory of an associated linear problem. Let f : R N → R n be a non-linear function with parameters θ ∈ R N , and consider the non-linear least squares problem L(θ) = 1 2 f (θ) -y 2 2 . The gradient descent iterations starting from an initial point θ 0 are given by θ t+1 = θ t -η∇L(θ t ) where ∇L(θ) = J T (θ)(f (θ) -y), where J (θ) ∈ R n×N is the Jacobian of f at θ (i.e., [J (θ)] i,j = ∂fi(θ) ∂θj ). The associated linearized least-squares problem is defined as L lin (θ) = 1 2 f (θ 0 ) + J(θ -θ 0 ) -y 2 2 . Here, J ∈ R n×N , refered to as the reference Jacobian, is a fixed matrix independent of the parameter θ that approximates the Jacobian mapping at initialization, J (θ 0 ). Starting from the same initial point θ 0 , the gradient descent updates of the linearized problem are θt+1 = θt -ηJ T f (θ 0 ) + J( θt -θ 0 ) -y . ( ) To show that the non-linear updates (33) are close to the linearized iterates (35), we make the following assumptions: i) We assume that the singular values of the reference Jacobian obey for some α, β √ 2α ≤ σ n ≤ σ 1 ≤ β. (36a) Furthermore, we assume that the norm of the Jacobian associated with the nonlinear model f is bounded in a radius R around the random initialization J (θ) ≤ β for all θ ∈ B R (θ 0 ). (36b) Here, B R (θ 0 ) := {θ : θθ 0 ≤ R} is the ball with radius R around θ 0 . ii) We assume the reference Jacobian and the Jacobian of the nonlinearity at initialization J (θ 0 ) are 0 -close: J (θ 0 ) -J ≤ 0 . iii) We assume that within a radius R around the initialization, the Jacobian varies by no more than : J (θ) -J (θ 0 ) ≤ 2 , for all θ ∈ B R (θ 0 ). Under these assumptions the difference between the non-linear residual r t := f (θ t )y and the linear residual rt := f (θ 0 ) + J( θtθ 0 )y are close throughout the entire run of gradient descent. Theorem 6 ( (Heckel & Soltanolkotabi, 2020a, Theorem 4) , Closeness of linear and nonlinear least--squares problems). Assume the Jacobian J (θ) ∈ R n×N associated with the function f (θ) obeys Assumptions (36a), (36b), (36c), and (36d) around an initial point θ 0 ∈ R N with respect to a reference Jacobian J ∈ R n×N and with parameters α, β, 0 , , obeying 2β( 0 + ) ≤ α 2 , and R. Furthermore, assume the radius R is given by R := 2 J † r 0 2 + 5 β 2 α 4 ( 0 + ) r 0 2 . ( ) Here, J † is the pseudo-inverse of J. We run gradient descent with stepsize η ≤ 1 β 2 on the linear and non-linear least squares problem, starting from the same initialization θ 0 . Then, for all iterations t, i) the non-linear residual converges geometrically r t 2 ≤ 1 -ηα 2 t r 0 2 , ii) the residuals of the original and the linearized problems are close r t -rt 2 ≤ 2β( 0 + ) e(ln 2)α 2 r 0 2 , iii) the parameters of the original and the linearized problems are close θ t -θt 2 ≤ 2.5 β 2 α 4 ( 0 + ) r 0 2 , iv) and the parameters are not far from the initialization θ t -θ 0 2 ≤ R 2 . ( ) Theorem 6 above formalizes that in a (small) radius around the initialization, the non-linear problem behaves very similar to its associated linear problem. As a consequence, to characterize the dynamics of the nonlinear problem, it suffices to characterize the dynamics of the linearized problem. This is the subject of our next theorem, which is a standard result on the gradient iterations of a least squares problem, see for example (Heckel & Soltanolkotabi, 2020b, Thm. 5 ) for the proof. Theorem 7 (E.g. Theorem 5 in Heckel & Soltanolkotabi (2020b) ). Consider a linear least squares problem (34) and let J = n i=1 σ i u i v T i be the singular value decomposition of the matrix J. Then the linear residual rt after t iterations of gradient descent with updates (35) is rt = n i=1 1 -ησ 2 i t u i u i , r 0 . ( ) Moreover, using a step size satisfying η ≤ 1 σ 2 1 , the linearized iterates (35) obey θt -θ 0 2 2 = n i=1 u i , r 0 1 -(1 -ησ 2 i ) t σ i 2 . (43) F.3.2 PROVING THEOREM 4 BY APPLYING THEOREM 6 We are now ready to prove Theorem 4. We apply Theorem 6 to the predictions of the network given by f (W, v) defined in (30) with parameter θ = (W, v). As reference Jacobian we choose a matrix J ∈ R n×dk+k that satisfies JJ T = E J (W 0 , v 0 )J T (W 0 , v 0 ) (where expectation is over the random initialization (W 0 , v 0 )), and at the same time is very close to the Jacobian of f at initialization, i.e., to J (W 0 , v 0 ). Towards this goal, we apply Theorem 6 with the following choices of parameters: α = σ min (Σ)/ √ 2, β = 10 √ n(ω + ν), = 1 16 ξ α 3 β 2 , 0 = 2 (ω 2 + ν 2 ) 3n √ k log(kn/δ). (44) Note that assumption (20) guarantees that 0 ≤ , a fact we used later. We now verify that the conditions of Theorem 4 are satisfied for this choice of parameters with probability at least 1δ. Specifically we show that each of the conditions holds with probability at least 1δ. By a union bound, the success probability is then at least 1 -Ω(δ), and by rescaling δ by a constant, the conditions are satisfied with probability at least 1δ.

Bound on residual:

We need a bound on the network outputs at initialization as well as on the initial residual to verify the conditions of the theorem. We start with the former: f (W 0 , v 0 ) 2 = 1 √ k relu(XW 0 )v 0 2 ≤ νω 8 log(2n/δ) X F = νω 8 log(2n/δ) √ n, where the inequality holds with probability at least 1δ, by Gaussian concentration (see Lemma 6 in Heckel & Soltanolkotabi (2020b) and recall that W 0 has iid N (0, ω 2 ) entries). Moreover, the last equality follows from x i 2 = 1. It follows that, with probability at least 1δ, the initial residual is bounded by r 0 2 = 1 √ k relu(XW 0 )v 0 -y 2 ≤ νω 8 log(2n/δ) √ n + √ n ≤ 2 √ n where the first inequality holds by the triangle inequality and using the assumption |y i | ≤ 1, and the second inequality by νω 8 log(2n/δ) ≤ 1, again by assumption. Radius in the theorem: In order to verify the condition of the theorem, we need to control the radius in the theorem, which we do next. With our assumptions and the choices of parameters above, the radius in the theorem, defined in equation ( 37), obeys R = 2 J † r 0 2 + 5 β 2 α 4 ( 0 + ) r 0 2 (i) ≤ √ 2 α + 5 β 2 α 4 ( 0 + ) r 0 2 (ii) ≤ √ 2 α + 5 16α r 0 2 (iii) ≤ 4 α √ n (iv) ≤ min(ω, ν) √ k 1 16 ξ α 3 β 2 X (ν + 3ω) 3 R . Here, (i) follows from the fact that J † r 0 2 ≤ 1 √ 2α r 0 2 , (ii) from 0 + ≤ 2 = 1 8 ξ α 3 β 2 (by definition of 0 and ), and (iii) from the bound on the residual (46). For (iv) we used assumption (20) in the theorem. Verifying Assumptions (36a) and (36b): By definition JJ T = Σ, thus the lower bound in assumption (36a) holds by the definition of α as σ n (Σ) ≥ √ 2α. Regarding the upper bound of (36a), note that J 2 ≤ ν 2 K 1 F + ω 2 K 2 F ≤ ν 2 n + ω 2 n, where K 1 ∈ R n×n and K 2 ∈ R n×n are the kernel matrix with entries K 1 (x i , x j ) and K 2 (x i , x j ). It follows that J ≤ 2(ω + ν) √ n ≤ β, as desired. This concludes the verification of (36a). To verify assumption (36b) note that J (W, v) ≤ J 1 (W) + J 2 (W) ≤ 1 √ k ( X v 2 + XW F ) ≤ 1 √ k ( X ( v 0 2 + v -v 0 2 + W -W 0 F ) + XW 0 F ) ≤ √ n10(ω + ν) = β. For the last inequality, we used that v 0 2 = ν √ k, and that v - v 0 2 + W -W 0 F ≤ R ≤ √ k min(ω, ν) , by the bound on the radius in (47), and finally that XW 0 F ≤ ω6 √ k with probability at least 1δ provided that k ≥ log(n/δ), which holds by assumption. For this inequality we used that x T i W 0 is a Gaussian vector with iid N (0, ωfoot_0 ) entries. It follows that assumption (36b) holds with probability at least 1δ, as desired. Verifying Assumption (36c): We start with stating a concentration lemma from Heckel & Soltanolkotabi (2020b) . Lemma 5 (Concentration lemma (Heckel & Soltanolkotabi, 2020b, Lemma 3) ). Consider the partial Jacobian J 1 (W), and let W ∈ R n×k be generated at random with i.i.d. N (0, ω 2 ) entries, and suppose the v are drawn from a distribution with |v | ≤ ν. Then, with probability at least 1δ, J 1 (W, v)J T 1 (W, v) -E J 1 (W, v)J T 1 (W, v) ≤ ν 2 √ k X 2 log (2n/δ). Lemma 6. Let J 2 (W) = 1 √ k relu(XW), with W generated at random with i.i.d. N (0, ω 2 ) entries. With probability at least 1 -δ, J 2 (W)J T 2 (W) -E J 2 (W)J T 2 (W) ≤ 3 ω 2 √ k X 2 log(kn/δ). Combining the statements of the two lemmas, it follows that, with probability at least 1 -2δ, J (W, v)J T (W, v) -E J (W, v)J T (W, v) ≤ (ω 2 + ν 2 )3 1 √ k X 2 log(kn/δ) ≤ (ω 2 + ν 2 )3 1 √ k n log(kn/δ). To show that (49) implies the condition in (36c), we use the following lemma. Lemma 7 ( (Oymak et al., 2019, Lem. 6.4) ). Let J 0 ∈ R n×N , N ≥ n and let Σ be n × n psd matrix obeying J 0 J T 0 -Σ ≤ ˜ 2 , for a scalar ˜ ≥ 0. Then there exists a matrix J ∈ R n×N obeying Σ = JJ T such that J -J 0 ≤ 2˜ . From Lemma 7 combined with equation ( 49), there exists a matrix J ∈ R n×N that obeys J -J (W 0 , v 0 ) ≤ 0 , 0 = 2 (ω 2 + ν 2 ) 3n √ k log(kn/δ). This part of the proof also specifies our choice of the matrix J as a matrix that is 0 close to the Jacobian at initialization, J (W 0 ), and that exists by Lemma 7 above. Verifying Assumption (36d): We control the perturbation around the random initialization. Lemma 8. Let W 0 have iid N (0, ω 2 ) entries and let v 0 have (arbitrary) entries in {-ν, +ν} entries. Then for all W and v obeying, for some R ≤ 1 the Jacobian in (31) obeys J (W, v) -J (W 0 , v 0 ) ≤ X 1 √ k ω R + ν R + ν √ 2(2t R) 1/3 , with probability at least 1 -ne -1 2 R4/3 k 7/3 . Recall the definition R = √ k X (ν+3ω) 3 from (47). From R ≤ (kR) 1/3 for R ≤ √ k, the bound provided by lemma 8 guarantees that J (W, v) -J (W 0 , v 0 ) ≤ X ω + 3ν √ k (k R) 1/3 = = 1 16 ξ α 3 β 2 , where the second inequality follows by choosing R = √ k X (ν+3ω) 3 . This holds with probability at least 1 -ne -1 2 Rk 7/3 = 1 -ne -2 -17 ξ 4 α 8 β 8 k 3 (i) ≥ 1δ, where in (i) we used (20). Therefore, Assumption (36d) holds with high probability by our choice of = 1 16 ξ α 3 β 2 . Concluding the proof of Theorem 4: By the previous paragraphs, the assumptions of Theorem 4 are satisfied with probability at least 1 -O(δ). Therefore we can bound bound the training error and the deviation of the coefficients from the initialization as follows. Training error: We bound the training error in (20). The training error at iteration t is bounded by f (W t , v t ) -y 2 ≤ rt 2 + rt -r t 2 (i) ≤ n i=1 (1 -ησ 2 i ) 2t u i , r 0 2 + 2β( 0 + ) e(ln 2)α 2 r 0 2 (ii) ≤ n i=1 (1 -ησ 2 i ) 2t u i , y 2 + f (W 0 , w 0 ) 2 + 2β( 0 + ) e(ln 2)α 2 r 0 2 (iii) ≤ n i=1 (1 -ησ 2 i ) 2t u i , y 2 + ξ y 2 , where inequality (i) follows from bounding the linear residual rt 2 with theorem 7, as well as bounding the distance between the linear residual and the non-linear one with (39). Inequality (ii) follows from r 0 = f (W 0 , v 0 )y, and finally (iii) follows from f (W 0 , v 0 ) 2 ≤ νω 8 log(2n/δ) y 2 ≤ ξ 2 y 2 , by (45), and β α 2 ( 0 + ) ≤ β α 2 2 = 1 8 ξ α β ≤ ξ. Distance from initialization: We next bound the distance from the initialization, i.e., we establish (22). Combining equation ( 43) in theorem 7 with equation (39) in theorem 6, we obtain W t -W 0 2 F + v t -v 0 2 2 ≤ n i=1 u i , r 0 1 -(1 -ησ 2 i ) t σ i 2 + 2.5 β 2 α 4 ( 0 + ) r 0 2 (i) ≤ n i=1 u i , y 1 -(1 -ησ 2 i ) t σ i 2 + 1 α f (W 0 , v 0 ) 2 + 2.5 β 2 α 4 ( 0 + ) r 0 2 (ii) ≤ n i=1 u i , y 1 -(1 -ησ 2 i ) t σ i 2 + ξ α √ n. where (i) follows from r 0 = f (W 0 , v 0 )-y and (ii) follows from 2.5 β 2 α 4 ( 0 + ) r 0 2 ≤ 5 16α r 0 2 ≤ 5 16α ( f (W 0 , v 0 ) 2 + y 2 ), where we used 0 + ≤ 2 = 1 8 ξ α 3 β 2 , by definition of , 0 , combined with f (W 0 , v 0 ) 2 ≤ ξ/4 √ n. Bound on change of coefficients: Finally, we establish the bounds on the change of the individual coefficients ( 23) and ( 24). We start with the weights in the first layer, w r . The gradient with respect to w r is given by ∇ wr L(W, v) = [J T 1 (W, v)] r r, where [J T 1 (W, v)] r is the submatrix of the Jacobian multiplying with the weight w r , and r is the residual. Therefore, we obtain w t,rw 0,r 2 = t-1 τ =0 (w τ +1,rw τ,r ) 2 ≤ t-1 τ =0 η [J T 1 (W τ , v τ )] r r τ 2 (i) ≤ (ν + 4 α √ n) √ n √ k η t-1 τ =0 r τ 2 (ii) ≤ (ν + 4 α √ n) √ n √ k η t-1 τ =0 (1ηα 2 ) τ r 0 2 (iii) ≤ (ν + 4 α √ n) √ n √ k 1 α 2 r 0 2 . Here, (i) follows from [J T 1 (W τ , v τ )] r = v r √ k diag(relu (Xw r ))X ≤ v r √ n √ k ≤ (ν + 4 α √ n) √ n √ k , where the last inequality follows from |v τ,rv 0,r | ≤ v τ,rv 0,r 2 ≤ R ≤ 4 α √ n, by (47). Moreover, for (ii) we used that, by (38), the non-linear residuals converge geometrically, and (iii) follows from the formula for a geometric series. This conclude the proof of the bound (23). Analogously, we obtain |v r -v 0,r | ≤ t-1 τ =0 η|[J T 2 (W τ )] r r τ | ≤ O(ω log(nk/δ)) + 4 α √ n √ n √ k 1 α 2 r 0 2 where we used that 1 √ k relu(Xw r ) 2 ≤ 1 √ k Xw 0,r 2 + X(w 0,r -w r ) 2 ≤ 1 √ k O(ω n log(nk/δ)) + √ n w 0,r -w r 2 ≤ √ n √ k O(ω log(nk/δ)) + √ n 4 α 2 . Here, the last inequality follows by using that the entries of Xw 0,r are not independent but N (0, ω 2 ) distributed, and by taking an union bound over all entries of that vector and over all Xw 0,r . This concludes the proof of the bound (24). For the entries W being iid N (0, ω 2 ), we note that, with probability at least 1ne -kq 2 /2 , the q-th smallest entry of x T i W ∈ R k obeys |x T i W | π(q) x i ≥ q 2k ω for all i = 1, . . . , n. (55) We are now ready to conclude the proof of the lemma. By equation (54), J 1 (W, v ) -J 1 (W , v ) ≤ 1 √ k v ∞ X max j σ (x T j W) -σ (x T j W ) ≤ 1 √ k v ∞ X 2q provided that W -W ≤ √ q q 2k ω with probability at least 1ne -kq 2 /2 . F.4 PROOF OF LEMMA 4 (BOUND ON THE RADEMACHER COMPLEXITY) Our proof follows that of a related result, specifically (Arora et al., 2019, Lem. 6.4 ) which pertains to a two-layer ReLU network where only the first layer is trained and the second layers' coefficients are fixed. Our goal is to bound the empirical Rademacher complexity R D (F Q,M ) = 1 n E sup W∈W,v∈V n i=1 i 1 √ k k r=1 v r relu(w T r x i ) , where expectation is over the iid Rademacher random variables i , and where {x 1 , . . . , x n } are the training examples. The derivation of the bound on the Rademacher complexity is based on the intuition that if the parameter M of the constraint w rw 0,r 2 ≤ ωM is sufficiently small, then relu (w T r x i ) is constant for most r, because |w T 0,r x i | is bounded away from ωM with high probability, by anticoncentration of a Gaussian. For those coefficient vectors r for which relu (w T r x i ) is constant, we have relu(w T r x i ) = relu (w T 0,r x i )w T r x i . For the other coefficients, we can bound the difference of those two values as relu(w T r x i )relu (w T 0,r x i )w T r x i =relu (w T r x i )w T r x irelu (w T 0,r x i )w T r x i =relu (w T r x i )w T r x irelu (w T 0,r x i )w T 0,r x i + relu (w T 0,r x i )w T 0,r x irelu (w T 0,r x i )w T r x i =relu(w T r x i )relu(w T 0,r x i ) + relu (w T 0,r x i ) w 0,rw r , x i ≤2 w rw 0,r 2 x i 2 ≤2ωM, where the last inequality holds for W ∈ W. It follows that  R D (F Q,M ) ≤ 1 n E sup where J 1 is the Jacobian defined in (31), and where we use w = vect(W) ∈ R dk for the vectorized version of the matrix W with a slight abuse of notation. With this notation, we can bound the first term in (56) by 1 n E sup W∈W,v∈V T J 1 (W 0 , v)w (i) = 1 n E sup W∈W,v∈V T (J 1 (W 0 , v)w -J 1 (W 0 , v 0 )w 0 ) = 1 n E sup W∈W,v∈V T (J 1 (W 0 , v)w -J 1 (W 0 , v)w 0 + J 1 (W 0 , v)w 0 -J 1 (W 0 , v 0 )w 0 ) = 1 n E sup W∈W,v∈V T (J 1 (W 0 , v)(w -w 0 ) + J 2 (W 0 )(v -v 0 )) = 1 n E sup W∈W,v∈V T (J 1 (W 0 , v 0 )(w -w 0 ) + J 1 (W 0 , v -v 0 )(w -w 0 ) + J 2 (W 0 )(v -v 0 )) (ii) ≤ 1 n E T J 1 (W 0 , v 0 ) 2 Q + 1 n E T J 2 (W 0 ) 2 Q + √ kνωM 2 (iii) ≤ 1 n Q(ν + ω) √ n + √ kνωM 2 . ( ) Here, equality (i) follows because J 1 (W 0 , v 0 )w 0 has zero mean, inequality (ii) follows from the Cauchy-Schwarz inequality as well as from J 1 (W 0 , v -v 0 )(w -w 0 ) 2 ≤ 1 √ k X k r=1 |v r -v 0,r | w r -w 0,r 2 ≤ √ kνωM 2 √ n, and inequality (iii) follows from E [ A 2 ] ≤ E A 2 2 = A F , by Jensen's inequality, and from the bounds J 1 (W 0 , v 0 ) F ≤ ν and J 2 (W 0 ) F ≤ ω, which holds with probability at least 1δ provided that log(2n/δ)/2k ≤ 1/2, which in turn holds by assumption. We next upper bound the second term in (56). Following the argument in (Arora et al., 2019  with probability at least 1δ. Here, we used that v r ≤ |v 0,r | + |v rv 0,r | ≤ 2ν. Putting the bounds on the first and second term in (56) (given by inequality (57) and inequality ( 58)) together, we get that, with probability at least 1δ, the Rademacher complexity is upper bounded by R D (F) ≤ Q √ n (ν + ω) + νω 5M 2 √ k + 4M log(2/δ) which concludes our proof.



√k,W -W 0 F ≤ ω R, vv 0 2 ≤ ν R,



g., measured by the number of parameters) for linear regressionHastie et al. (2019);Belkin et al. (2020), for random feature regressionMei & Montanari (2019);d'Ascoli et al. (2020), and for binary linear regressionDeng et al. (2020). A number of recent theoretical double-descent works Jacot et al. (2020); Yang et al. (2020); d'Ascoli et al. (2020) have decomposed the risk into bias and variance terms, and studied their behavior. Those works demonstrate that the bias typically decreases as a function of the model size, and the variance first increases, and then decreases, which can yield a double-descent behavior. As we demonstrate in Appendix B.2, the epoch-wise double descent phenomena for the standard CNN studied by Nakkiran et al. (2020a) cannot be explained with this observation: The variance is increasing as a function of training epochs, as opposed to being unimodal.

ω = 0.01; ν = 1 10 0 10 1 10 2 10 3 10 4 10 5 t iterations b) ω = 1; ν = 1 10 0 10 1 10 2 10 3 10 4 10 5 t iterations c) ω = 1; ν = 0

Figure 3: Top row: Risk of the two-layer neural network trained on data drawn from a linear model with diagonal covariance matrix with geometrically decaying variances. The risk has a double descent curve unless we either i) initialize the first layer with a smaller initialization strength ω than the second one ν, or we ii) choose a smaller stepsize for the weights in the second layer. Both improves the risk as suggested by the theory. Bottom row: The norms v i,W 2 2 and v i,v 2 2measure to what extend the singular values σ i are associated with the weights in the first (W) and second (v) layer respectively. Double descent occurs when singular values are mostly associated with the second layer, because then those weights are learned faster relative to the first layer weights.

(b) we show simulations for the same configuration but with n = 10d. B SUPPORTING MATERIAL FOR: EARLY STOPPING IN CONVOLUTIONAL NEURAL NETWORKS B.1 RESNET-18 TRAINING DOUBLE DESCENT ELIMININATION

Figure6: Left: Test error of the ResNet-18 trained with the i) same stepsize for all layer, and with ii) a smaller stepsize for the latter half of the layers. Decreasing the learning rate of the last layers causes the last layers to be learned at a similar speed as the first and thereby eliminates double descent. Right: The training error curves for i) and ii).

(2020);Yang et al. (2020);d'Ascoli et al. (2020), and provides a bias-variance decomposition of the model-wise double-descent shaped risk curve.

Figure 7: Bias and variance as a function on training epochs for training a 5-layer CNN until convergence. The overall variance is increasing, and the overall bias has a double-descent like shape. The training and test error curves show the interpolation of the training set and double descent behavior of the error in this interval.

r 1 {relu (w T r,0 xi) =relu (w T r xi)} = 1 n E sup W∈W,v∈V T J 1 (W 0 , v)w + 2ωM n r 1 {relu (w T r,0 xi) =relu (w T r xi)} ,

{relu (w T r,0 xi) =relu (w T r xi)} ≤ 2νkn M + log(2/δ) 2k ,

Thus early stopping can ensure that neither bias nor variance are too large. A variety of papersYao et al. (2007);Raskutti et al. (2014);Bühlmann & Yu (2003);Wei et al. (2019) formalized this intuition and developed theoretically sound early stopping rules. Those works do not, however, predict when a double descent curve can occur.A second, more recent line of works, studies early stopping from a different perspective, namely that of gradient descent fitting different components of a signal or different labels at different speeds. For linear least squares, the data in the direction of singular vectors associated with large singular values is fitted faster than that in the direction of singular vectors associated with small singular values. Advani et al. (2020) have shown this for a linear least squares problem or stated differently, a linear neural network with a single layer.Li et al. (2020);Arora et al. (2019) have shown that this view explains why neural network often fit clean labels before noisy ones, andHeckel & Soltanolkotabi  (2020b)  have used this view to prove that convolutional neural networks provably denoise images.Our theoretical results for neural networks build on a line of works that relate the dynamics of gradient descent to those of an associated linear model or a kernel method in the highly overparam-

the risk expression (1) are approximately equal to the bias and the variance of the model θ t in the standard textbook biasvariance decomposition of the risk, see Appendix A.2 for a detailed discussion.

Wainwright. High Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. Yuting Wei, Fanny Yang, and Martin J. Wainwright. Early Stopping for Kernel Boosting Algorithms: A General Analysis With Localized Complexities. IEEE Transactions on Information Theory, 65 (10):6685-6703, 2019.

ACKNOWLEDGEMENTS

F. F. Yilmaz and R. Heckel are (partially) supported by NSF award IIS-1816986. R. Heckel also acknowledges support by the TUM Institute of Advanced Study, and the authors would like to thank Fanny Yang and Alexandru Tifrea for discussions and helpful comments on this manuscript.

CODE

Code to reproduce the experiments is available at https://github.com/MLI-lab/early_ stopping_double_descent.

F.3.3 PROOF OF LEMMA 8

First note thatIn the reminder of the proof we bound the three terms above. We start by bounding the third term in (50) as:We proceed with bounding the first term in (50) as:Next we establish below that with probability at least 1ne -kq 2 /2 , the second term in in ( 50) is bounded asprovided thatwhere the last inequality follows from setting q = (2k R) 2/3 (note that the assumption R ≤ 1 2 √ k ensures q ≤ k). Putting those three bounds together in (50) yieldswhich established the claim.It remains to prove (51). Towards this goal, first note thatWe proceed with bounding the second term in the RHS of (53) as:Because relu is the step function, we have to bound the number of sign flips between the matrices XW and XW . For this we use the lemma below: Lemma 9. Let |v| π(q) be the q-th smallest entry of v in absolute value. Suppose that, for all i, and q ≤ k,Then max i σ (x T i W)σ (x T i W ) ≤ 2q.

