IMPLICIT BIAS OF GRADIENT DESCENT FOR MEAN SQUARED ERROR REGRESSION WITH WIDE NEURAL NETWORKS

Abstract

We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For 1D regression, we show that the solution of training a width-n shallow ReLU network is within n -1/2 of the function which fits the training data and whose difference from initialization has smallest 2-norm of the weighted second derivative with respect to the input. The curvature penalty function 1/ζ is expressed in terms of the probability distribution that is utilized to initialize the network parameters, and we compute it explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. While similar results have been obtained in previous works, our analysis clarifies important details and allows us to obtain significant generalizations. In particular, the result generalizes to multivariate regression and different activation functions. Moreover, we show that the training trajectories are captured by trajectories of spatially adaptive smoothing splines with decreasing regularization strength.

1. INTRODUCTION

Understanding why neural networks trained in the overparametrized regime and without explicit regularization generalize well in practice is an important problem (Zhang et al., 2017) . Some form of capacity control different from network size must be at play (Neyshabur et al., 2014) and specifically the implicit bias of parameter optimization has been identified to play a key role (Neyshabur et al., 2017) . By implicit bias we mean that among the many hypotheses that fit the training data, the algorithm selects one which satisfies additional properties that may be beneficial for its performance on new data. Jacot et al. (2018) and Lee et al. (2019) showed that the training dynamics of shallow and deep wide neural networks is well approximated by that of the linear Taylor approximation of the models at a suitable initialization. Chizat et al. (2019) observe that a model can converge to zero training loss while hardly varying its parameters, a phenomenon that can be attributed to scaling of the output weights and makes the model behave as its linearization around the initialization. Zhang et al. ( 2019) consider linearized models for regression problems and show that gradient flow finds the global minimum of the loss function which is closest to initialization in parameter space. This type of analysis connects with trajectory based analysis of neural networks (Saxe et al., 2014) . Oymak and Soltanolkotabi (2019) studied the overparametrized neural networks directly and showed that gradient descent finds a global minimizer of the loss function which is close to the initialization. Towards interpreting parameters in function space, Savarese et al. ( 2019) and Ongie et al. ( 2020) studied infinite-width neural networks with parameters having bounded norm, in 1D and multi-dimensional input spaces, respectively. They showed that, under a standard parametrization, the complexity of the functions represented by the network, as measured by the 1-norm of the second derivative, can be controlled by the 2-norm of the parameters. Using these results, one can show that gradient descent with 2 weight penalty leads to simple functions. Sahs et al. (2020) relates function properties, such as breakpoint and slope distributions, to the distributions of the network parameters. The implicit bias of parameter optimization has been investigated in terms of the properties of the loss function at the points reached by different optimization methodologies (Keskar et al., 2017; Wu et al., 2017; Dinh et al., 2017) . In terms of the solutions, Maennel et al. (2018) show that gradient flow for shallow networks with rectified linear units (ReLU) initialized close to zero quantizes features in a way that depends on the training data but not on the network size. Williams et al. ( 2019) obtained results for 1D regression contrasting the kernel and adaptive regimes. Soudry et al. (2018) show that in classification problems with separable data, gradient descent with linear networks converges to a maxmargin solution. Gunasekar et al. (2018b) present a result on implicit bias for deep linear convolutional networks, and Ji and Telgarsky ( 2019) study non-separable data. Chizat and Bach (2020) show that gradient flow for logistic regression with infinitely wide two-layer networks yields a max-margin classifier in a certain space. Gunasekar et al. (2018a) analyze the implicit bias of different optimization methods (natural gradient, steepest and mirror descent) for linear regression and separable linear classification problems, and obtain characterizations in terms of minimum norm or max-margin solutions. In this work, we study the implicit bias of gradient descent for regression problems. We focus on wide ReLU networks and describe the bias in function space. In Section 2 we provide settings and notation. We present our main results in Section 3, and develop the main theory in Sections 4 and 5. In the interest of a concise presentation, technical proofs and extended discussions are deferred to appendices.

2. NOTATION AND PROBLEM SETUP

Consider a fully connected network with d inputs, one hidden layer of width n, and a single output. For any given input x ∈ R d , the output of the network is f (x,θ) = n i=1 W (2) i φ( W (1) i ,x +b (1) i )+b (2) , where φ is a point-wise activation function, W (1) ∈ R n×d , W (2) ∈ R n , b ∈ R n and b (2) ∈ R are the weights and biases of layer l = 1,2. We write θ = vec(∪ 2 l=1 {W (l) ,b (l) }) for the vector of all network parameters. These parameters are initialized by independent samples of pre-specified random variables W and B in the following way: W (1) i,j d = 1/d W, b (1) i d = 1/d B W (2) i d = 1/n W, b (2) d = 1/n B. More generally, we will also allow weight-bias pairs to be sampled from a joint distribution of (W,B) which we only assume to be sub-Gaussian. In the analysis of Jacot et al. (2018) ; Lee et al. (2019) , W and B are Gaussian N (0,σ 2 ). In the default initialization of PyTorch, W and B have uniform distribution U(-σ,σ). The setting (1) is known as the standard parametrization. Some works (Jacot et al., 2018; Lee et al., 2019) utilize the so-called NTK parametrization, where the factor 1/n is carried outside of the trainable parameter. If we fix the learning rate for all parameters, gradient descent leads to different trajectories under these two parametrizations. Our results are presented for the standard parametrization. Details on this in Appendix C.3. We consider a regression problem for data {(x j , y j )} M j=1 with inputs X = {x j } M j=1 and outputs Y = {y j } M j=1 . For a loss function : R × R → R, the empirical risk of our function is L(θ) = M j=1 (f (x j ,θ),y j ). We use full batch gradient descent with a fixed learning rate η to minimize L(θ). Writing θ t for the parameter at time t, and θ 0 for the initialization, this defines an iteration θ t+1 = θ t -η∇L(θ) = θ t -η∇ θ f (X ,θ t ) T ∇ f (X ,θt) L, where f (X ,θ t ) = [f (x 1 ,θ t ),...,f (x M ,θ t )] T is the vector of network outputs for all training inputs, and ∇ f (X ,θt) L is the gradient of the loss with respect to the model outputs. We will use subscript i to index neurons and subscript t to index time. Let Θn be the empirical neural tangent kernel (NTK) of the standard parametrization at time 0, which is the matrix Θn = 1 n ∇ θ f (X ,θ 0 )∇ θ f (X ,θ 0 ) T .

