IMPLICIT BIAS OF GRADIENT DESCENT FOR MEAN SQUARED ERROR REGRESSION WITH WIDE NEURAL NETWORKS

Abstract

We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For 1D regression, we show that the solution of training a width-n shallow ReLU network is within n -1/2 of the function which fits the training data and whose difference from initialization has smallest 2-norm of the weighted second derivative with respect to the input. The curvature penalty function 1/ζ is expressed in terms of the probability distribution that is utilized to initialize the network parameters, and we compute it explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. While similar results have been obtained in previous works, our analysis clarifies important details and allows us to obtain significant generalizations. In particular, the result generalizes to multivariate regression and different activation functions. Moreover, we show that the training trajectories are captured by trajectories of spatially adaptive smoothing splines with decreasing regularization strength.

1. INTRODUCTION

Understanding why neural networks trained in the overparametrized regime and without explicit regularization generalize well in practice is an important problem (Zhang et al., 2017) . Some form of capacity control different from network size must be at play (Neyshabur et al., 2014) and specifically the implicit bias of parameter optimization has been identified to play a key role (Neyshabur et al., 2017) . By implicit bias we mean that among the many hypotheses that fit the training data, the algorithm selects one which satisfies additional properties that may be beneficial for its performance on new data. Jacot et al. (2018) and Lee et al. (2019) showed that the training dynamics of shallow and deep wide neural networks is well approximated by that of the linear Taylor approximation of the models at a suitable initialization. Chizat et al. (2019) observe that a model can converge to zero training loss while hardly varying its parameters, a phenomenon that can be attributed to scaling of the output weights and makes the model behave as its linearization around the initialization. Zhang et al. (2019) consider linearized models for regression problems and show that gradient flow finds the global minimum of the loss function which is closest to initialization in parameter space. This type of analysis connects with trajectory based analysis of neural networks (Saxe et al., 2014) . Oymak and Soltanolkotabi (2019) studied the overparametrized neural networks directly and showed that gradient descent finds a global minimizer of the loss function which is close to the initialization. Towards interpreting parameters in function space, Savarese et al. (2019) and Ongie et al. (2020) studied infinite-width neural networks with parameters having bounded norm, in 1D and multi-dimensional input spaces, respectively. They showed that, under a standard parametrization, the complexity of the functions represented by the network, as measured by the 1-norm of the second derivative, can be controlled by the 2-norm of the parameters. Using these results, one can show that gradient descent with 2 weight penalty leads to simple functions. Sahs et al. (2020) relates function properties, such as breakpoint and slope distributions, to the distributions of the network parameters. The implicit bias of parameter optimization has been investigated in terms of the properties of the loss function at the points reached by different optimization methodologies (Keskar et al., 2017; Wu et al., 2017; Dinh et al., 2017) . In terms of the solutions, Maennel et al. (2018) show that gradient flow for shallow networks with rectified linear units (ReLU) initialized close to zero quantizes features in a way that depends on the training data but not on the network size. Williams et al. (2019) obtained results for 1D regression contrasting the kernel and adaptive regimes. Soudry et al. (2018) show that in classification problems with separable data, gradient descent with linear networks converges to a maxmargin solution. Gunasekar et al. (2018b) present a result on implicit bias for deep linear convolutional networks, and Ji and Telgarsky (2019) study non-separable data. Chizat and Bach (2020) show that gradient flow for logistic regression with infinitely wide two-layer networks yields a max-margin classifier in a certain space. Gunasekar et al. (2018a) analyze the implicit bias of different optimization methods (natural gradient, steepest and mirror descent) for linear regression and separable linear classification problems, and obtain characterizations in terms of minimum norm or max-margin solutions. In this work, we study the implicit bias of gradient descent for regression problems. We focus on wide ReLU networks and describe the bias in function space. In Section 2 we provide settings and notation. We present our main results in Section 3, and develop the main theory in Sections 4 and 5. In the interest of a concise presentation, technical proofs and extended discussions are deferred to appendices.

2. NOTATION AND PROBLEM SETUP

Consider a fully connected network with d inputs, one hidden layer of width n, and a single output. For any given input x ∈ R d , the output of the network is f (x,θ) = n i=1 W (2) i φ( W (1) i ,x +b (1) i )+b (2) , where φ is a point-wise activation function, W (1) ∈ R n×d , W (2) ∈ R n , b (1) ∈ R n and b (2) ∈ R are the weights and biases of layer l = 1,2. We write θ = vec(∪ 2 l=1 {W (l) ,b (l) }) for the vector of all network parameters. These parameters are initialized by independent samples of pre-specified random variables W and B in the following way: W (1) i,j d = 1/d W, b (1) i d = 1/d B W (2) i d = 1/n W, b (2) d = 1/n B. (2) More generally, we will also allow weight-bias pairs to be sampled from a joint distribution of (W,B) which we only assume to be sub-Gaussian. In the analysis of Jacot et al. (2018) ; Lee et al. (2019) , W and B are Gaussian N (0,σ 2 ). In the default initialization of PyTorch, W and B have uniform distribution U(-σ,σ). The setting (1) is known as the standard parametrization. Some works (Jacot et al., 2018; Lee et al., 2019) utilize the so-called NTK parametrization, where the factor 1/n is carried outside of the trainable parameter. If we fix the learning rate for all parameters, gradient descent leads to different trajectories under these two parametrizations. Our results are presented for the standard parametrization. Details on this in Appendix C.3. We consider a regression problem for data {(x j , y j )} M j=1 with inputs X = {x j } M j=1 and outputs Y = {y j } M j=1 . For a loss function : R × R → R, the empirical risk of our function is L(θ) = M j=1 (f (x j ,θ),y j ). We use full batch gradient descent with a fixed learning rate η to minimize L(θ). Writing θ t for the parameter at time t, and θ 0 for the initialization, this defines an iteration θ t+1 = θ t -η∇L(θ) = θ t -η∇ θ f (X ,θ t ) T ∇ f (X ,θt) L, where f (X ,θ t ) = [f (x 1 ,θ t ),...,f (x M ,θ t )] T is the vector of network outputs for all training inputs, and ∇ f (X ,θt) L is the gradient of the loss with respect to the model outputs. We will use subscript i to index neurons and subscript t to index time. Let Θn be the empirical neural tangent kernel (NTK) of the standard parametrization at time 0, which is the matrix Θn = 1 n ∇ θ f (X ,θ 0 )∇ θ f (X ,θ 0 ) T .

3. MAIN RESULTS AND DISCUSSION

We obtain a description of the implicit bias in function space when applying gradient descent to regression problems with wide ReLU neural networks. We prove the following result in Appendix D. An interpretation of the result and generalizations are given further below. Theorem 1 (Implicit bias of gradient descent in wide ReLU networks). Consider a feedforward network with a single input unit, a hidden layer of n rectified linear units, and a single linear output unit. Assume standard parametrization (1) and that for each hidden unit the input weight and bias are initialized from a sub-Gaussian (W,B) (2) with joint density p W,B . Then, for any finite data set {(x j , y j )} M j=1 and sufficiently large n there exist constant u and v so that optimization of the mean square error on the adjusted training data {(x j , y j -ux j -v)} M j=1 by full-batch gradient descent with sufficiently small step size converges to a parameter θ * for which the output function f (x,θ * ) (1) attains zero training error. Furthermore, letting ζ(x) = R |W | 3 p W,B (W,-W x) dW and S = supp(ζ)∩[min i x j ,max i x j ], we have f (x,θ * )-g * (x) 2 = O(n -1 2 ),x ∈ S (the 2-norm over S) with high probability over the random initialization θ 0 , where g * solves following variational problem: min g∈C 2 (S) S 1 ζ(x) (g (x)-f (x,θ 0 )) 2 dx subject to g(x j ) = y j -ux j -v, j = 1,...,M. Interpretation An intuitive interpretation of the theorem is that at those regions of the input space where ζ is smaller, we can expect the difference between the functions after and before training to have a small curvature. We may call ρ = 1/ζ a curvature penalty function. The bias induced from initialization is expressed explicitly. We note that under suitable asymmetric parameter initialization (see Appendix C.2), it is possible to achieve f (•,θ 0 ) ≡ 0. Then the regularization is on the curvature of the output function itself. In Theorem 9 we obtain the explicit form of ζ for various common parameter initialization procedures. In particular, when the parameters are initialized independently from a uniform distribution on a finite interval, ζ is constant and the problem is solved by the natural cubic spline interpolation of the data. The adjustment of the training data simply accounts for the fact that second derivatives define a function only up to linear terms. In practice we can use the coefficients a and b of linear regression y j = ax j + b + j , j = 1,...,M , and set the adjusted data as {(x j , j )} M j=1 . Although Theorem 1 describes the gradient descent training with the linearly adjusted data, this result can also approximately describe training with the original training data. Further details are provided in Appendix L. We illustrate Theorem 1 numerically in Figure 1 and more extensively in Appendix A. In close agreement with the theory, the solution to the variational problem captures the solution of gradient descent training uniformly with error of order n -1/2 . To illustrate the effect of the curvature penalty function, Figure 1 also shows the solutions to the variational problem for different values of ζ corresponding to different initialization distributions. We see that at input points where ζ is small / peaks strongly, the solution function tends to have a lower curvature / be able to use a higher curvature in order to fit the data. With the presented bias description we can formulate heuristics for parameter initialization either to ease optimization or also to induce specific smoothness priors on the solutions. In particular, by Proposition 8 any curvature penalty 1/ζ can be implemented by an appropriate choice of the parameter initialization distribution. By our analysis, the effective capacity of the model, understood as the set of possible output functions after training, is adapted to the size M of the training dataset and is well captured by a space of cubic splines relative to the initial function. This is a space with dimension of order M independently of the number of parameters of the network.

Strategy of the proof

In Section 4, we observe that for a linearized model gradient descent with sufficiently small step size finds the minimizer of the training objective which is closest to the initial parameter (similar to a result by Zhang et al., 2019) . Then Theorem 4 shows that the training dynamics of the linearization of a wide network is well approximated in parameter and function space by that of a lower dimensional linear model which trains only the output weights. This property is sometimes taken for granted and we show that it holds for the standard parametrization, although it does not hold for the NTK parametrization (defined in Appendix C.3), which leads to the adaptive regime. In Section 5, for networks with a single input and a single layer of ReLUs, we relate the implicit bias of gradient descent in parameter space to an alternative optimization problem. In Theorem 5 we show that the solution of this problem has a well defined limit as the width of the network tends to infinity, which allows us to obtain a variational formulation. In Theorem 6 we translate the description of the bias from parameter space to function space. In Theorem 9 we provide explicit descriptions of the weight function for various common initialization procedures. Finally, we can utilize recent results bounding the difference in function space of the solutions obtained from training a wide network and its linearization (Lee et al., 2019, Theorem H.1) . Generalizations Theorem 4 has several generalizations elaborated in Appendix P. For multivariate regression, we have the following theorem. Theorem 2 (Multivariate regression). Use the same network setting as in Theorem 1 except that the number of input units changes to d. Assume that for each hidden unit the input weight and bias are initialized from a sub-Gaussian (W,B) where W is a d-dimensional random vector and B is a random variable. Then, for any finite data set {(x j ,y j )} M i=1 and sufficiently large n there exist constant vector u and constant v so that optimization of the mean square error on the adjusted training data {(x j ,y j -u,x j -v)} M j=1 by full-batch gradient descent with sufficiently small step size converges to a parameter θ * for which f (x,θ * ) attains zero training error. Furthermore, let U = W 2 , V = W/ W 2 , C = -B/ W 2 and ζ(V ,c) = p V,C (V ,c)E(U 2 |V = V ,C = c) where p V,C is the joint density of (V,C). Then we have f (x,θ * ) -g * (x) 2 = O(n -1 2 ),x ∈ R d (the 2-norm over R d ) with high probability over the random initialization θ 0 , where g * solves following variational problem: min g∈C(R d ) supp(ζ) R{(-∆) (d+1)/2 (g-f (•,θ 0 ))}(V ,c) 2 ζ(V ,c) dV dc subject to g(x j ) = y j , j = 1,...,M, R{(-∆) (d+1)/2 (g-f (•,θ 0 ))}(V ,c) = 0, (V ,c) ∈ supp(ζ). ( ) Here R is the Radon transform which is defined by R{f }(ω, b) := ω,x =b f (x)ds(x), and the power of the negative Laplacian (-∆) (d+1)/2 is the operator defined in Fourier domain by (-∆) (d+1)/2 f (ξ) = ξ d+1 f (ξ). For different activation functions, we have the following corollary. Corollary 3 (Different activation functions). Use the same setting as in Theorem 1 except that we use the activation function φ instead of ReLU. Suppose that φ is a Green's function of a linear operator L, i.e. Lφ = δ, where δ denotes the Dirac delta function. Assume that the activation function φ is homogeneous of degree k, i.e. φ(ax) = a k φ(x) for all a > 0. Then we can find a function p satisfying Lp ≡ 0 and adjust training data {(x j ,y j )} M j=1 to {(x j ,y j -p(x j )} M j=1 . After that, the statement in Theorem 1 holds with the variational problem (4) changed to min g∈C 2 (S) S 1 ζ(x) [L(g(x)-f (x,θ 0 ))] 2 dx s.t. g(x j ) = y j -p(x j ), j = 1,...,M, where ζ(x) = p C (x)E(W 2k |C = x) and S = supp(ζ)∩[min i x i ,max i x i ]. Moreover, our method allows us to describe the optimization trajectory in function space (see Appendix N). If we substitute constraints g(x j ) = y j in (4) by a quadratic term 1 λ 1 M M j=1 (g(x j )-y j ) 2 added to the objective, we obtain the variational problem for a so-called spatially adaptive smoothing spline (see Abramovich and Steinberg, 1996; Pintore et al., 2006) . This problem can be solved explicitly and can be shown to approximate early stopping. To be more specific, the solution to following optimization problem approximates the output function of the network after gradient descent training for t steps with learning rate η/n: min g∈C 2 (S) M j=1 [g(x j )-y j ] 2 + 1 ηt S 1 ζ(x) (g (x)-f (x,θ 0 )) 2 dx. ( ) Related works Zhang et al. ( 2019) described the implicit bias of gradient descent in the kernel regime as minimizing a kernel norm from initialization, subject to fitting the training data. Our result can be regarded as making the kernel norm explicit, thus providing an interpretable description of the bias in function space and further illuminating the role of the parameter initialization procedure. We prove the equivalence in Appendix M. Savarese et al. (2019) showed that infinite-width networks with 2-norm weight regularization represent functions with smallest 1-norm of the second derivative, an example of which are linear splines. We discuss this in Appendix C.4. A recent preprint further develops this direction for two-layer networks with certain activation functions that interpolate data while minimizing a weight norm (Parhi and Nowak, 2019) . In contrast, our result characterizes the solutions of training from a given initialization without explicit regularization, which turn out to minimize a weighted 2-norm of the second derivative and hence correspond to cubic splines. In finishing this work we became aware of a recent preprint (Heiss et al., 2019) which discusses ridge weight penalty, adaptive splines, and early stopping for one-input ReLU networks training only the output layer. Williams et al. (2019) showed a similar result in the kernel regime for shallow ReLU networks where they train only the second layer and from zero initialization. In contrast, we consider the initialization of the second layer and show that the difference from the initial output function is implicitly regularized by gradient descent. We show the result of training both layers and prove that it can be approximated by training only the second layer in Theorem 4. In addition, we give the explicit form of ζ in Theorem 9, while the ζ given by Williams et al. (2019) has a minor error because of a typo in their computation. Most importantly, our statement can be generalized to multivariate regression, different activation functions, training trajectories.

4.1. IMPLICIT BIAS IN PARAMETER SPACE FOR A LINEARIZED MODEL

In this section we describe how training a linearized network or a wide network by gradient descent leads to solutions that are biased, having parameter values close to the values at initialization. First, we consider the following linearized model: f lin (x,ω) = f (x,θ 0 )+∇ θ f (x,θ 0 )(ω-θ 0 ). ( ) We write ω for the parameter of the linearized model, in order to distinguish it from the parameter of the nonlinearized model. The empirical loss of the linearized model is defined by L lin (ω) = M j=1 (f lin (x j ,ω),y j ). The gradient descent iteration for the linearized model is given by ω 0 = θ 0 , ω t+1 = ω t -η∇ θ f (X ,θ 0 ) T ∇ f lin (X ,ωt) L lin . (9) Next, we consider wide neural networks. According to Lee et al. (2019, Theorem H.1) , sup t f lin (x,ω t )-f (x,θ t ) 2 = O(n -1 2 ) with arbitrarily high probability. So gradient descent training of a wide network or of the linearized model give similar trajectories and solutions in function space. Both fit the training data perfectly, meaning f lin (X ,ω ∞ ) = f (X ,θ ∞ ) = Y, and are also approximately equal outside the training data. Under the assumption that rank(∇ θ f (X ,θ 0 )) = M , the gradient descent iterations (9) converge to the unique global minimum that is closest to initialization (Gunasekar et al., 2018a; Zhang et al., 2019) , which is the solution of following constrained optimization problem (further details and remarks are provided in Appendix E): min ω ω-θ 0 2 s.t. f lin (X ,ω) = Y.

4.2. TRAINING ONLY THE OUTPUT LAYER APPROXIMATES TRAINING ALL PARAMETERS

From now on we consider networks with a single hidden layer of n ReLUs and a linear output 2) . We show that the functions and parameter vectors obtained by training the linearized model are close to those obtained by training only the output layer. Hence, by the arguments of the previous section, training all parameters of a wide network or training only the output layer gives similar functions. f (x, θ) = n i=1 W (2) i [W (1) i x + b (1) i ] + + b ( Let θ 0 = vec(W (1) , b (1) , W (2) , b ) be the parameter at initialization so that f lin (•, θ 0 ) = f (•, θ 0 ). After training the linearized network let the parameter be ω ∞ = vec( W (1) , b (1) , W (2) , b ). Using initialization (2), with probability arbitrarily close to 1, W i = O(1) and W (2) i ,b (2) = O(n -1 2 ). 1 Therefore, writing H for the Heaviside function, we have ∇ W (1) i ,b (1) i f (x,θ 0 ) = W (2) i H(W (1) i x+b (1) )•x , W i H(W (1) i ) = O(n -1 2 ), ∇ W (2) i ,b (2) f (x,θ 0 ) = [W (1) i x+b (1) i ] + , 1 = O(1). So when n is large, if we use gradient descent with a constant learning rate for all parameters, then the changes of W (1) , b (1) , b (2) are negligible compared with the changes of W (2) . So approximately we can train just the output weights, W i ,i = 1,...,n, and fix all other parameters. This corresponds to a smaller linear model. Let ω t = vec(W (1) t ,b (1) t , W (2) t ,b t ) be the parameter at time t under the update rule where W (1) ,b , b 2) are kept fixed at their initial values, and W (2) 0 = W (2) , W t+1 = W (2) t -η∇ W (2) L lin ( ω t ). Let ω ∞ = lim t→∞ ω t . By the above discussion, we expect that f lin (x, ω ∞ ) is close to f lin (x,ω ∞ ). In fact, we prove the following for the MSE loss. The proof and further remarks are provided in Appendix F. We relate Theorem 4 to training a wide network in Appendix G. Theorem 4 (Training only output weights vs linearized network). Consider a finite data set {(x i ,y i )} M i=1 . Assume that (1) we use the MSE loss ( y,y) = 1 2 y -y 2 2 ; (2) inf n λ min ( Θn ) > 0. Let ω t denote the parameters of the linearized model at time t when we train all parameters using (9), and let ω t denote the parameters at time t when we only train weights of the output layer using (12). If we use the same learning rate η in these two training processes and η < 2 nλmax( Θn) , then for any x ∈ R, with probability arbitrarily close to 1 over the random initialization (2), sup t |f lin (x, ω t )-f lin (x,ω t )| = O(n -1 ), as n → ∞. ( ) Moreover, in terms of the parameter trajectories we have sup t W (1) t -W (1) t 2 = O(n -1 ), sup t b (1) t -b (1) t 2 = O(n -1 ), sup t W (2) t -W (2) t 2 = O(n -3/2 ), sup t b (2) t -b (2) t = O(n -1 ). In view of the arguments in this section, in the next sections we will focus on training only the output weights and understanding the corresponding solution functions. 1 More precisely, for any δ > 0, ∃C, s.t. with prob. 1-δ, |W 1) , c), which again corresponds to the output function of the network. Then, the second derivative g with respect to x (see Appendix I) satisfies g (x,γ) = p C (x) R γ(W (1) ,x) W (1) dν W|C=x (W (1) ). Thus γ(W (1) ,c) is closely related to g (x,γ) and we can try to express (16) in terms of g (x,γ). Since g (x,γ) determines g(x,γ) only up to linear functions, we consider the following problem: 1) ,c) = y j , j = 1,...,M. (2) i |,|b (2) | ≤ Cn -1/2 and |W (1) i |,|b i | ≤ C. Let g(x, γ) = R 2 γ(W (1) , c)[W (1) (x -c)] + dν(W ( min γ∈C(R 2 ),u∈R,v∈R R 2 γ 2 (W (1) ,c) dν(W (1) ,c) subject to ux j +v+ R 2 γ(W (1) ,c)[W (1) (x j -c)] + dν(W (17) Here u,v are not included in the cost. They add a linear function to the output of the neural network. If u and v in the solution of ( 17) are small, then the solution is close to the solution of ( 16). Ongie et al. (2020) also use this trick to simplify the characterization of neural networks in function space. Next we study the solution of ( 17) in function space. This is our main technical result. Theorem 6 (Implicit bias in function space). Assume W and B are random variables with P(W = 0) = 0, and let C = -B/W. Let ν denote the probability distribution of (W,C). Suppose (γ,u,v) is the solution of (17), and consider the corresponding output function Then g(x,(γ,u,v)) satisfies g (x,(γ,u,v)) = 0 for x ∈ S and for x ∈ S it is the solution of the following problem: g(x,(γ,u,v)) = ux+v+ R 2 γ(W (1) ,c)[W (1) (x-c)] + dν(W (1) ,c). min h∈C 2 (S) S (h (x)) 2 ζ(x) dx s.t. h(x j ) = y j , j = 1,...,m. The proof is provided in Appendix I, where we also present the corresponding statement without ASI. We study the explicit form of this function in the next section.

5.3. EXPLICIT FORM OF THE CURVATURE PENALTY FUNCTION

Proposition 7. Let p W,B denote the joint density function of (W,B) and let C = -B/W so that p C is the breakpoint density. Then ζ(x) = E(W 2 |C = x)p C (x) = R |W | 3 p W,B (W,-W x) dW . The proof is presented in Appendix J. If we allow the initial weight and biases to be sampled from a suitable joint distribution, we can make the curvature penalty ρ = 1/ζ arbitrary. Proposition 8 (Constructing any curvature penalty). Given any function : R → R >0 , satisfying Z = R 1 < ∞, if we set the density of C as p C (x) = 1 Z 1 (x) and make W independent of C with non-vanishing second moment, then (E(W 2 |C = x)p C (x)) -1 = (E(W 2 )p C (x)) -1 ∝ (x), x ∈ R. Further remarks on sampling and independent variables are provided in Appendix J. To conclude this section we compute the explicit form of ζ for several common initialization procedures. Theorem 9 (Explicit form of the curvature penalty for common initializations). The proof is provided in Appendix K. Theorem 9 (b) and (c) show that for certain distributions of (W,B), ζ is constant. In this case problem ( 19) is solved by the cubic spline interpolation of the data with natural boundary conditions (Ahlberg et al., 1967) . The case of general ζ is solved by space adaptive natural cubic splines, which can be computed numerically by solving a linear system and theoretically in an RKHS formalism. We provide details in Appendix O.

6. CONCLUSION AND DISCUSSION

We obtained a explicit description of the implicit bias of gradient descent for mean squared error regression with wide shallow ReLU networks. We presented a result for the univariate case and generalizations to multi-variate ReLU networks and networks with different activation functions. Our result can also help us characterize the training trajectory of gradient descent in function space. Our main result shows that the trained network outputs a function that interpolates the training data and has the minimum possible weighted 2-norm of the second derivative with respect to the input. This corresponds to an spatially adaptive interpolating spline. The space of interpolating splines is a linear space which has a dimension that is linear in the number of data points. Hence our result means that, even if the network has many parameters, the complexity of the trained functions will be adjusted to the number of data points. Interpolating splines have been studied in great detail in the literature and our result allows us to directly apply corresponding generalization results to the case of trained networks. This is related to approximation theory and characterizations for the number of samples and their spacing needed in order to approximate functions from a given smoothness class to a desired precision (Rieger and Zwicknagl, 2010; Wendland, 2004) . Zhang et al. (2019) described the implicit bias of gradient descent as minimizing a RKHS norm from initialization. Our result can be regarded as making the RKHS norm explicit, thus providing an interpretable description of the bias in function space. Compared with Zhang et al. (2019) , our results give a precise description of the role of the parameter initialization scheme, which determines the inverse curvature penalty function ζ. This gives us a rather good picture of how the initialization affects the implicit bias of gradient descent. This could be used in order to select a good initialization scheme. For instance, one could conduct a pre-assessment of the data to estimate the locations of the input space where the target function has a high curvature, and choose the parameter initialization accordingly. This is an interesting possibility to experiment with, based on our theoretical result. Our result can also be interpreted in combination with early stopping. The training trajectory is approximated by a smoothing spline, meaning that the network will filter out high frequencies which are usually associated to noise in the training data. This behaviour is sometimes referred to as a spectral bias (Rahaman et al., 2019) .



Here we assume that P(W = 0) = 0 so that the random variable C is well defined. It is not an important restriction, since neurons with weight W (1) = 0 give constant functions that can be absorbed in the bias of output layer.



Figure 1: Illustration of Theorem 1. Left: Uniform error between the solution g * to the variational problem and the functions f (•, θ * ) obtained by gradient descent training of a neural network (in this case with uniform initialization W ∼ U (-1,1), B ∼ U (-2,2)), against the number of neurons. The inset shows examples of the trained networks (blue) alongside with the training data (dots) and the solution to the variational problem (orange). Right: Effect of the curvature penalty function on the shape of the solution function. The bottom shows g * for various different ζ shown at the top. Again dots are training data. The green curve is for ζ constant on [-2,2], derived from initialization W ∼ U (-1,1), B ∼ U (-2,2); blue is for ζ(x) = 1/(1+x 2 ) 2 , derived from W ∼ N (0,1), B ∼ N (0,1); and orange for ζ(x) = 1/(0.1 + x 2 ) 2 , derived from W ∼ N (0,1), B ∼ N (0,0.1). Theorem 9 shows how to compute ζ for the above distributions.

Let ν C denote the marginal distribution of C and assume it has a density functionp C . Let E(W 2 |C) denote the conditional expectation of W 2 given C. Consider the function ζ(x) = p C (x)E(W 2 |C = x). Assume that training data x i ∈ supp(ζ), i = 1,...,m. Consider the set S = supp(ζ)∩[min i x i ,max i x i ].

(a) Gaussian initialization. Assume that W and B are independent, W ∼ N (0,σ 2 w ) and B ∼ N (0,σ 2 b ). Then ζ is given by ζ(x) Binary-uniform initialization. Assume that W and B are independent, W ∈ {-1, 1} and B ∼ U(-a b ,a b ) with a b ≥ L. Then ζ is constant on [-L,L]. (c) Uniform initialization. Assume that W and B are independent, W ∼ U(-a w , a w ) and B ∼ U(-a b ,a b ) with a b aw ≥ L. Then ζ is constant on [-L,L].

5. GRADIENT DESCENT LEADS TO SIMPLE FUNCTIONS

In this section we provide a function space characterization of the implicit bias previously described in parameter space. According to (10), gradient descent training of the output weights (12) achieves zero loss, f lin (x j , ω ∞ )-f lin (x j ,θTo simplify the presentation, in the following we let f lin (x, θ 0 ) ≡ 0 by using the ASI trick (see Appendix C.2). The analysis still goes through without this.

5.1. INFINITE WIDTH LIMIT

We reformulate problem ( 14) in a way that allows us to consider the limit of infinitely wide networks, with n → ∞, and obtain a deterministic counterpart, analogous to the convergence of the NTK. Let µ n denote the empirical distribution of the samples (WHere 1 A is the indicator function for measurable subsets A in R 2 . We further consider a function α n : R 2 → R whose value encodes the difference of the output weight from its initialization for a hidden unit with input weight and bias given by the argument,i -W(2) i ). Then ( 14) with ASI can be rewritten aswhere j ranges from 1 to M . Here we minimize over functions α n in C(R 2 ), but since only the values on (Wi ,b i ) n i=1 are taken into account, we can take any continuous interpolation of α n (Wi ,b i ), i = 1,...,n. Now we can consider the infinite width limit. Let µ be the probability measure of (W,B). We obtain a continuous version of problem (15) by substituting µ for µ n . Since we know that µ n weakly converges to µ, we prove that in fact the solution of problem (15) converges to the solution of the continuous problem, which is formulated in the following theorem. Details in Appendix H.i=1 be i.i.d. samples from a pair (W,B) of random variables with finite fourth moment. Suppose µ n is the empirical distribution of (W) is the solution of (15). Let α(W (1) ,b) be the solution of the continuous problem with µ in place of µ n . Then for any bounded) is the function represented by a network with n hidden neurons after training, and) is the function represented by the infinite-width network.

5.2. FUNCTION SPACE DESCRIPTION OF THE IMPLICIT BIAS

Next we connect the problem from the previous section to second derivatives by first rewriting it in terms of breakpoints. Consider the breakpoint c = -b/W (1) of a ReLU with weight W (1) and bias b. We define a corresponding random variable C = -B/W and let ν denote the distribution of (W,C). 2 Then with γ(W (1) ,c) = α(W (1) ,-cW (1) ) the continuous version of ( 15) is equivalently given as min γ∈C(R 2 ) R 2 γ 2 (W (1) ,c) dν(W (1) ,c) s.t. 

