GUIDING NEURAL NETWORK INITIALIZATION VIA MARGINAL LIKELIHOOD MAXIMIZATION

Abstract

We propose a simple approach to help guide hyperparameter selection for neural network initialization. We leverage the relationship between neural network and Gaussian process models having corresponding activation and covariance functions to infer the hyperparameter values desirable for model initialization. Our experiment shows that marginal likelihood maximization provides recommendations that yield near-optimal prediction performance on MNIST classification task under experiment constraints. Furthermore, our empirical results indicate consistency in the proposed technique, suggesting that computation cost for the procedure could be significantly reduced with smaller training sets.

1. INTRODUCTION

Training deep neural networks successfully can be challenging. However, with proper initialization trained models could improve their prediction performance. Various initialization strategies in neural network have been discussed extensively in numerous research works. Glorot and Bengio (2010) focused on linear cases and proposed the normalized initialization scheme (also known as Xavier-initialization). Their derivation was obtained by considering activation variances in the forward path and the gradient variance in back-propagation. He-initialization (He et al., 2015) was developed for very deep networks with rectifier nonlinearities. Their approach imposed a condition on the weight variances to control the variation in the input magnitudes. Because of its success, He-initialization has become the de facto choice for deep ReLU networks. While Glorot-and Heinitialization schemes recognize the importance of and make use of the hidden layer widths in their formulation, other methods were also suggested to improve training in deep neural networks. Mishkin and Matas (2016) demonstrated that pre-initialization with orthonormal matrices followed by output variance normalization produces prediction performance comparable to, if not better than, standard techniques. Additionally, Schoenholz et al. (2017) developed the bound on the network depth based on the principle of 'Edge of Chaos' given a particular set of initialization hyperparameters. Furthermore, Hayou et al. (2019) showed that theoretically and in practice proper initialization parameter tuning with appropriate activation function is important to model training for improved performance. Neal (1996) showed that as a fully-connected, single-hidden-layer feedforward untrained neural network becomes infinitely wide, Gaussian prior distributions over the network hidden-to-output weights and biases converge to a Gaussian process, under the assumption that the parameters are independent. In other words, the untrained infinite neural network and its induced Gaussian process counterpart are equivalent. Also, as a result of the central limit theorem, the covariance between network output evaluated at different inputs can be represented as a function of the hidden node activation function. Intuitively, we could therefore relate the prediction performance of an untrained, finite-width, single-hidden-layer, fully-connected feedforward neural network to a Gaussian process model with a covariance function corresponding to the network's activation function. In this work we propose a simple and efficient method that learns from training data to guide the selection of initialization hyperparameters in neural networks. Marginal likelihood is a popular tool for choosing kernel hyperparameters in model selection. Its applications in convolutional Gaussian processes and deep kernel learning are discussed, respectively, in (van der Wilk et al., 2017; Wilson et al., 2016) . Our method aims to synergize this powerful functionality of marginal likelihood and the relationship between untrained neural networks and Gaussian process models to make recommendations for neural network initialization. We first derive the covariance function corresponding to the activation function of the network whose prediction performance we wish to evaluate. We then employ marginal likelihood optimization for the Gaussian process model to learn hyperparameters from data. We hypothesize that the optimal set of hyperparameter values could improve initialization of the neural network.

2. APPROACH

To assess our proposed method, we build a neural network and a Gaussian process model with corresponding activation and covariance functions. With the Gaussian process we estimate the covariance hyperparameters from training data. These hyperparameter values are then applied in the neural network to evaluate and compare its prediction accuracy among various hyperparameter sets. We first describe the structure of the neural network, followed by the Gaussian process model and the underlying reason for employing the marginal likelihood. Then, given the network activation function we proceed to derive a closed form representation of its counterpart covariance function.

2.1. SINGLE-HIDDEN-LAYER NEURAL NETWORKS

Our neural network model is a fully-connected, single-hidden-layer feedforward network with 2000 hidden nodes and rectified linear unit (ReLU) activation function. Following (Lee et al., 2018) , we conduct our empirical study by considering classifying MNIST images as regression prediction. Inasmuch as the network is designed for regression, we choose the mean square error (MSE) loss as its objective function, along with Adam optimizer, and accuracy as the performance metric. In addition, one-hot encoding is utilized to generate class labels, where an incorrectly labeled class is designated -0.1, and a correctly labeled class 0.9 . For example, the one-hot representation of the integer 7 is given by [-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, 0.9, -0.1, -0.1]. The input to each hidden node nonlinearity (the pre-activation) is represented by z 0 j (x) = b 0 j + din k=1 W 0 jk x 0 k , while the hidden unit output after the nonlinearity (the post-activation) is denoted by x 1 j (x) = φ(z 0 j (x)), j ∈ {1, 2, • • • , N 1 }. Since we typically apply linear activation function in the output stage of a regression model, the model output is simply z 1 i (x) = b 1 i + N1 j=1 W 1 ij x 1 j (x) .

2.2. GAUSSIAN PROCESSES

A Gaussian process (MacKay, 1998; Neal, 1998; Williams and Rasmussen, 2006; Bishop, 2006) is a set of random variables any finite collection of which follows a multivariate normal distribution. A Guassian process prediction model exploits this unique property and offers a Bayesian approach to solving machine learning problems. The model is completely specified by its mean function and covariance function. By choosing a particular covariance function, a prior distribution over functions is induced which, together with observed inputs and targets, can be used to generate prediction distribution for making predictions and uncertainty measures on unknown test points. These capabilities allow Gaussian processes to be used effectively in many important machine learning applications such as human pose inference (Urtasun and Darrell, 2008) and object classification (Kapoor et al., 2010) . Recent research works also apply Gaussian processes in deep structures for image classification (van der Wilk et al., 2017) and regression tasks (Wilson et al., 2016) . To help achieve optimal performance for Guassian process prediction we select a suitable covariance function and tune the model by adjusting hyperparameters characterizing the covariance function. This can be accomplished by applying the marginal likelihood which is a crucial feature that enables Gaussian processes to learn proper hyperparameter values from training data.

2.3. HYPERPARAMETERS AND MARGINAL LIKELIHOOD OPTIMIZATION

We briefly describe the procedure for estimating optimal hyperparamter values via maximizing the Gaussian process marginal likelihood function. Consider a set of N multidimensional input data X = {x i } N i=1 , x i ∈ R D , and target set y = {y i } N i=1 , y i ∈ R . For each input x i we have a corresponding input-output pair (x i , y i ), where the observed output target is given by y i = f (x i ) + i , with data noise i ∼ N (0, σ 2 n ). We model the input-output latent function f as a Gaussian process : f (x i ) ∼ GP µ(x i ), k(x i , x j ) , where we customarily set the mean function µ(x i ) := E[f (x i )] = 0, and denote k(x i , x j ) as the covariance function. The marginal likelihood (or evidence) (Williams and Rasmussen, 2006; Bishop, 2006) measures the probability of observed targets given input data and can be expressed as the integral of the product of likelihood and the prior, marginalized over the latent function f : p(y|X) = p(y, f |X) df = p(y|f, X)p(f |X) df. The marginal likelihood can be obtained by either evaluating the integral (1) or by noticing {y i } N i=1 = {f (x i )+ i } N i=1 , which gives us y|X ∼ N (0, K+σ 2 n I) where K = [k(x i , x j )] N i,j=1 and I are N by N covariance matrix and identify matrix, respectively. As a result, p(y|X) = 1 (2π) N/2 |K + σ 2 n I| 1/2 exp - 1 2 y T (K + σ 2 n I) -1 y . To facilitate computation, we evaluate the log marginal likelihood which is given by log p(y|X) = - 1 2 y T K + σ 2 n I -1 y - 1 2 log |K + σ 2 n I| - N 2 log 2π. We are reminded here that the marginal likelihood is applied directly on the entire training dataset, rather than a validation subset. In addition, Cholesky decomposition (Neal, 1998) can be employed to calculate the term K + σ 2 n I -1 in equation ( 2).

2.4. RELU COVARIANCE FUNCTION

With the structure of the single-hidden-layer ReLU neural network defined, we proceed to study its corresponding ReLU Gaussian process. The ReLU covariance function is developed to estimate the covariance at the output of the ReLU neural network model. Our alternative derivation was inspired by the work on arc-cosine family of kernels developed in (Cho and Saul, 2009) . In our work we first derive the expectation of the product of post-activations, instead of on the input to the nonlinearity (Lee et al., 2018) . Then, we apply the output layer activation function on the post-activation expected value. It can be shown that the resulting representations are equivalent. The complete derivation of our expression is provided in the Appendix 5. Referring to Figure 1 , we consider input vectors x 0 , y 0 ∈ R din . The initial weight value is drawn randomly from the Gaussian distribution f W 0 jk = N (0, σ 2 w din ) and bias value from f b 0 j = N (0, σ 2 b ) . The expected value of the product of post-activations at the output of the j th hidden node is computed as E[X j (x 0 )X j (y 0 )] = • • • ∞ -∞ max(b 0 j + w 0 j • x 0 ) max(b 0 j + w 0 j • y 0 )f b 0 j ,W 0 j (b, w) dw 0 j db 0 j = • • • ∞ -∞ (b 0 j + w 0 j • x 0 ) + (b 0 j + w 0 j • y 0 ) + f b 0 j ,W 0 j (b, w) dw 0 j db Suppose we denote the pre-activations as U = b 0 j + W 0 j • x 0 = b 0 j + din k=1 W 0 jk x 0 k ∼ N (0, σ 2 b + σ 2 w x 2 ), V = b 0 j + W 0 j • y 0 = b 0 j + din k =1 W 0 jk y 0 k ∼ N (0, σ 2 b + σ 2 w y 2 ). It can be shown that the random variables U, V have a joint Gaussian distribution: U V ∼ N (0, Σ), where Σ = σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w (x • y) σ 2 b + σ 2 w (x • y) σ 2 b + σ 2 w y 2 , for simplicity we let x = x 0 , y = y 0 . We can therefore write expression (3) as ∞ 0 uv 1 2π|Σ| 1 2 exp - 1 2 (u, v)Σ -1 (u, v) T du dv. Now let D := |Σ| = σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 -σ 2 b + σ 2 w (x • y) 2 , and Σ -1 = a 11 a 12 a 21 a 22 , where a 11 = 1 D (σ 2 b + σ 2 w y 2 ), a 22 = 1 D (σ 2 b + σ 2 w x 2 ), a 12 = a 21 = -1 D (σ 2 b + σ 2 w (x • y)). With polar coordinate transformation: u = r √ a11 cos α, v = r √ a22 sin α, expression (3) can be further reduced to 1 4πD 1/2 a 11 a 22 π 2 α=0 2 sin 2α 1 -cos φ sin 2α 2 dα = 1 2πD 1/2 a 11 a 22 sin 3 φ sin(φ) + (π -φ) cos(φ) , where φ = cos -1 -a 12 √ a 11 a 22 . With some algebraic operations and after computing the entries in Σ -1 , we arrive at E[X j (x)X j (y)] = 1 2π σ 2 b + x 2 σ 2 w 1 2 σ 2 b + y 2 σ 2 w 1 2 sin φ + (π -φ) cos φ where φ = cos -1 σ 2 b + (x • y)σ 2 w σ 2 b + x 2 σ 2 w 1/2 σ 2 b + y 2 σ 2 w 1/2 . To compute the expected value, E[X j (x)] = max(b + w • x)f b 0 j ,W 0 jk (b, w) dw db, we denote U = b + w • x ∼ N (0, σ 2 b + σ 2 w x 2 ), and apply the change in variable 1 2σ 2 u 2 = t, where σ 2 = σ 2 b + σ 2 w x 2 to obtain E[X j (x)] = ∞ -∞ (u) + f U (u)du = ∞ 0 u 1 √ 2πσ e -1 2σ 2 u 2 du = ∞ 0 σ 2 dt 1 √ 2πσ e -t = σ √ 2π = σ 2 b + σ 2 w x 2 √ 2π The covariance function at the network output is therefore determined to be E b 1 i + N1 j=1 W 1 ij X j (x) b 1 i + N1 k=1 W 1 ik X k (y) -E b 1 i + N1 j=1 W 1 ij X j (x) b 1 i + N1 k=1 W 1 ik X k (y) = E[(b 1 i ) 2 ] + N1 j=1 E[(W 1 ij ) 2 ]E[X j (x)X j (y)] - 1 2π σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 = σ 2 b + σ 2 w N 1 N 1 E[X j (x)X j (y)] - 1 2π σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 = σ 2 b + σ 2 w 2π σ 2 b + x 2 σ 2 w 1 2 σ 2 b + y 2 σ 2 w 1 2 sin φ + (π -φ) cos φ -1 .

2.5. GAUSSIAN PROCESS PREDICTION: A SIMULATION

Performing simulations allows us to explore and understand some properties of the models we wish to study. Simulation results also offer the opportunity for evaluating model precision and insight into observed events. To demonstrate making predictions with Gaussian process regression model, we borrow equations from (Williams and Rasmussen, 2006) where the formulation of Gaussian process predictive distribution is treated in great detail. Given the design matrix X = {x i } N i=1 , x i ∈ R D , observed targets y = {y i } N i=1 , y i ∈ R, unknown test data X * , and their function values f * := f (X * ), the joint distribution of the target and function values is computed as y f * ∼ N 0, K(X, X) + σ 2 n I K(X, X * K(X * , X) K(X * , X * ) , where K(X, X) represents the covariance matrix of all pairs of training points, K(X, X * ) denotes that of pairs of training and test points, and K(X * , X * ) gives the covariance matrix of pairs of test points. The prediction distribution is the conditional distribution f * |X, y, X * ∼ N µ * , Σ * with mean function µ * = K(X * , X) K(X, X) + σ 2 n I -1 y and covariance Σ * = K(X * , X * ) -K(X * , X) K(X, X) + σ 2 n I -1 K(X, X * ). The simulation starts out with setting the hyperparameters of the ReLU covariance function to Our simulation results agree with the principle that through optimizing the marginal likelihood of the Gaussian process model, we could estimate from training data the hyperparameter values most appropriate for its chosen covariance function.

3. MNIST CLASSIFICATION EXPERIMENT

We conduct a classification experiment on the MNIST handwritten digit dataset (LeCun, 1998) making use of corresponding ReLU neural network and Gaussian process models. As in (Lee et al., 2018) , the classification task on the class labels is treated as Gaussian process regression (also known as kriging in spatial statistics (Cressie, 1993) ). It is necessary to point out that the goal of this work is to examine using the marginal likelihood to estimate the best available initial hyperparameter setting for neural networks, rather than determining the networks' optimal structure. Our experiment consists of three main steps: (A) searching within a given grid of hyperparameter values for the pair {σ 2 w , σ2 b } that maximizes the log marginal likelihood function of the Gaussian process model, (B) evaluating prediction accuracy of the corresponding neural network at each grid point {σ 2 w , σ 2 b } including {σ 2 w , σ2 b }, and (C) assessing neural network performance over all tested hyperparameter pairs.

3.1. PROCEDURE

The workflow for the experiment is as follows: we set up a grid map of σ 2 w ∈ {0.4, 1.2, 2.0, 2.8, 3.6}, σ 2 b ∈ {0.0, 1.0, 2.0}. Then, N samples are randomly selected from the MNIST training set to form a training subset, where N is the training size. This is followed by computing the log marginal likelihood (equation 2) at each grid point. This allows us to identify the hyperparameter pair {σ 2 w , σ2 b } that yields the maximum log marginal likelihood value. On the neural network side, we build a fully-connected feedforward neural network with a single hidden layer width, hidden width, of 2000 nodes, Adam optimizer, and mse loss function. Since the network model is fully connected, the size of the input layer d in is 28(pixels) x 28(pixels) = 784. Prior to training, the initialization parameters {w, b} are set by sampling the distributions N (0, σ 2 w /d in ) and N (0, σ 2 b ) for weights and biases from the input to the hidden layer, and N (0, σ 2 w /2000) and N (0, σ 2 b ) for weights and biases from the hidden to the output layer. The neural network is then trained with the training subset generated previously. We compute the model classification accuracy on the MNIST test set and repeat the procedure over the entire grid map of hyperparameter pairs. To investigate the usefulness of our proposed approach for assisting model initialization, we employ He-initialization approach as a benchmark to measure numerically and graphically our neural network performance over all tested hyperparameter pairs. Additionally, we check for recommendation consistency.

3.2. RESULTS

Applying the method described in Section 2.3 for estimating model hyperparameter pair we obtain a consistent recommendation of (σ 2 w , σ 2 b ) = (3.6, 0.0). We observe that the convergence rate based on our method approaches that using He-initialization as the training size increases. This suggests that our technique may potentially be efficient for guiding deep neural network initialization. Left: train size=1000. Middle: train size=3000. Right: train size=5000. After running 250 training epochs, convergence of the neural network model and its prediction accuracy are studied for different training sizes. We observe that training based on our initialization approach converges to that based on He-initialization as the size of training samples increases, as shown in Figure 3 . This seems to suggest that our approach may be used as an efficient tool for recommending initialization in deep learning. It is worth noting that the Gaussian process model marginal likelihood consistently suggests the hyperparameter pair (σ 2 w , σ 2 b ) = (3.6, 0). The fact that the bias variance σ 2 b is estimated to be 0 coincides with the assumption that bias vector being 0 in (He et al., 2015) . Table 1 lists neural network prediction accuracy based on, respectively, our approach and He-initialization scheme, against the best and the worst performers. The results indicate that more frequently our approach achieves slightly better accuracy than based on He-initialization. However, neither approach reliably gives an estimate of weight variance close to that for the best case. 

4. DISCUSSION AND FUTURE WORK

In this work we propose a simple, consistent, and time-efficient method to guide the selection of initial hyperparameters for neural networks. We show that through maximizing the log marginal likelihood we can learn from training data hyperparameter setting that leads to accurate and efficient initialization in neural networks. We develop an alternative representation of the ReLU covariance function to estimate the covariance at the output of the ReLU neural network model. We first derive the expectation of the product of post-activations. Then, we apply the output layer activation function on the post-activation expected value to generate the output covariance function. Utilizing marginal likelihood optimization with the derived ReLU covariance function we perform a simulation to demonstrate the effectiveness of Gaussian process regression. We train a fully-connected single-hidden-layer neural network model to perform classification (treated as regression) on MNIST data set. The empirical results indicate that applying the recommended hyperparameter setting for initialization the neural network model performs well, with He-initialization scheme as the benchmark method. A further examination of the results reveals consistency of the process. This implies that smaller training subsets could be used to provide reasonable recommendation for neural network initialization on sizable training data sets, reducing the computation time which is otherwise required for inverting considerably large covariance matrices. The main goal of our future research is to investigate if our proposed method is adequate for deep neural networks with complicated data sets. We wish to ascertain if consistent recommendation could be attained by learning from larger data sets of color images via marginal likelihood maximization. We will attempt to derive or approximate multilayer covariance functions corresponding to various activation functions. Deep fully-connected neural network models will be built to perform classification on CIFAR-10 data set. Our hypothesis is that learning directly from training data helps to improve neural network initialization strategy.

5. APPENDIX

Covariance Function at the Output of ReLU Neural Network Our derivation follows the work on arc-cosine family of kernels developed in (Cho and Saul, 2009) . However, instead of applying coplanar vector rotation in calculating the kernel integral, we recognize that the integrand can be written in terms of two jointly normal random variables. This helps to facilitate the computation which becomes more involved when both the weight and bias parameters are included. The derivation is also made to conform to the arc-cosine kernel by utilizing the identities (Cho and Saul, 2009, equation (17) , ( 18)) to give us π 2 η=0 1 1 -cos φ cos η dη = π -φ sin φ , π 2 θ=0 sin 2θ 1 -cos φ sin 2θ 2 dθ = 1 sin 3 φ sin(φ) + (π -φ) cos(φ) . Equation ( 4) is derived, with the substitution η = 2(θ -π 4 ), as follow: π 2 θ=0 sin 2θ 1 -cos φ sin 2θ 2 dθ = π 2 η=0 cos η 1 -cos φ cos η 2 dη = ∂ ∂ cos φ π 2 η=0 1 1 -cos φ cos η dη = ∂ ∂ cos φ π -φ sin φ = -1 sin(φ) ∂ ∂φ π -φ sin φ = 1 sin 3 φ sin(φ) + (π -φ) cos(φ) . Denote the input layer (layer 0) weight and bias parameters as b 0 j ∼ N (0, σ 2 b ) and W 0 jk iid ∼ N (0, σ 2 w din ), where b 0 j |= W 0 jk for all k ∈ {1, • • • , d in }, j ∈ {1, • • • , N 1 }. The expected value of the product of post-activations at the output of the j th hidden node is computed as E[X j (x 0 )X j (y 0 )] = • • • ∞ -∞ max(b 0 j + w 0 j • x 0 ) max(b 0 j + w 0 j • y 0 )f b 0 j ,W 0 j (b, w) dw 0 j db 0 j = • • • ∞ -∞ (b 0 j + w 0 j • x 0 ) + (b 0 j + w 0 j • y 0 ) + f b 0 j ,W 0 j (b, w) dw 0 j db Each pre-activation can be written in terms of a random variable: U = b 0 j + W 0 j • x 0 = b 0 j + din k=1 W 0 jk x 0 k ∼ N (0, σ 2 b + σ 2 w x 2 ), V = b 0 j + W 0 j • y 0 = b 0 j + din k =1 W 0 jk y 0 k ∼ N (0, σ 2 b + σ 2 w y 2 ). Since E[U ] = E[V ] = 0, their covariance can be expressed as cov(U, V ) = E[(b 0 j + W 0 j • x 0 )(b 0 j + W 0 j • y 0 )] = E[(b 0 j ) 2 ] + E din k=1 din k =1 W 0 jk W 0 jk x 0 k y 0 k = σ 2 b + din k=1 din k =1 E W 0 jk W 0 jk x 0 k y 0 k = σ 2 b + σ 2 w din k=1 x 0 k y 0 k = σ 2 b + σ 2 w (x • y) (For simplicity we set x = x 0 , y = y 0 ) This implies that the random variables U, V have a joint Gaussian distribution: U V ∼ N (0, Σ), where Σ = σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w (x • y) σ 2 b + σ 2 w (x • y) σ 2 b + σ 2 w y 2 . We can, therefore, rewrite equation ( 5) as ∞ 0 uv 1 2π|Σ| 1 2 exp - 1 2 (u, v)Σ -1 (u, v) T du dv (6) Denote D := |Σ| = σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 -σ 2 b + σ 2 w (x • y) 2 , Σ -1 = a 11 a 12 a 21 a 22 , with a 11 = 1 D (σ 2 b + σ 2 w y 2 ), a 22 = 1 D (σ 2 b + σ 2 w x 2 ), a 12 = a 21 = -1 D (σ 2 b + σ 2 w (x • y)). We therefore have: D(a 11 a 22 -a 2 12 ) = D 1 D (σ 2 b + σ 2 w y 2 ) 1 D (σ 2 b + σ 2 w x 2 ) - -1 D (σ 2 b + σ 2 w (x • y) 2 = 1 D σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 -σ 2 b + σ 2 w (x • y) 2 = 1. (by definition) The exponential term in equation ( 6) then becomes: - 1 2 (u, v)Σ -1 (u, v) T = - 1 2 a 11 u 2 + 2a 12 uv + a 22 v 2 . We now make use of the transformation from Cartesian to polar coordinates by setting u = r √ a 11 cos α, v = r √ a 22 sin α =⇒ a 11 u 2 = r 2 cos 2 α, a 22 v 2 = r 2 sin 2 α. The Jacobian J is calculated as ∂(u, v) ∂(r, α) = r √ a 11 a 22 . Equation ( 6) can in turn be expressed as 1 2πD 1/2 π 2 α=0 ∞ r=0 r 2 sin 2α 2 √ a 11 a 22 exp -1 2 [r 2 cos 2 α + 2a 12 r 2 sin α cos α √ a 11 a 22 + r 2 sin 2 α] r dr dα √ a 11 a 22 = 1 4πD 1/2 a 11 a 22 π 2 α=0 sin 2α dα ∞ r=0 r 3 exp -r 2 2 [1 + a 12 sin 2α √ a 11 a 22 ] dr Next, we need to show that H := 1 + a 12 sin 2α √ a 11 a 22 ≥ 0 to ensure the expression in ( 8) is bounded. First, since x - y 2 = x 2 + y 2 -2(x • y) ≥ 0 =⇒ x 2 + y 2 ≥ 2(x • y) , and let the angle between the vectors x, y be θ = cos -1 x • y x y , we have σ 2 b + σ 2 w (x • y) 2 σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 = σ 4 b + (σ 2 w ) 2 (x • y) 2 + 2σ 2 b (σ 2 w )(x • y) σ 4 b + (σ 2 w ) 2 x 2 y 2 + σ 2 b (σ 2 w ) x 2 + y 2 = σ 4 b + (σ 2 w ) 2 ( x y cos θ) 2 + σ 2 b (σ 2 w )2(x • y) σ 4 b + (σ 2 w ) 2 x 2 y 2 + σ 2 b (σ 2 w ) x 2 + y 2 ≤ 1. This means that we can define a quantity φ as φ = cos -1 σ 2 b + σ 2 w (x • y) (σ 2 b + σ 2 w x 2 )(σ 2 b + σ 2 w y 2 ) 1/2 = cos -1 1 D σ 2 b + σ 2 w (x • y) 1 D (σ 2 b + σ 2 w x 2 )(σ 2 b + σ 2 w y 2 ) = cos -1 -a 12 √ a 11 a 22 =⇒ cos φ = -a 12 √ a 11 a 22 This also leads to H := 1 + a 12 sin 2α √ a 11 a 22 = 1 + -1 D σ 2 b + σ 2 w (x • y) sin 2α 1 D (σ 2 b + σ 2 w x 2 )(σ 2 b + σ 2 w y 2 ) = 1 - σ 2 b + σ 2 w (x • y) sin 2α (σ 2 b + σ 2 w x 2 )(σ 2 b + σ 2 w y 2 ) 1/2 ≥ 1 - σ 2 b + σ 2 w (x • y) (σ 2 b + σ 2 w x 2 )(σ 2 b + σ 2 w y 2 ) 1/2 ≥ 0 With a change of variables, we now evaluate the integral involving the parameter r in expression (8) as follows. Let η = r The expected value of the product of post-activations at the output of the j th hidden node in the first hidden layer is therefore determined to be b + σ 2 w x 2 ), and apply the change in variables: 1 2σ 2 u 2 = t, where σ 2 = σ 2 b + σ 2 w x 2 to obtain E[X j (x)] = ∞ -∞ (u) + f U (u)du = ∞ 0 u 1 √ 2πσ e -1 2σ 2 u 2 du = ∞ 0 σ 2 dt 1 √ 2πσ e -t = σ √ 2π = σ 2 b + σ 2 w x 2 √ 2π The covariance function at the network output is therefore determined to be E b 1 i + N1 j=1 W 1 ij X j (x) b 1 i + N1 k=1 W 1 ik X k (y) -E b 1 i + N1 j=1 W 1 ij X j (x) b 1 i + N1 k=1 W 1 ik X k (y) = E[(b 1 i ) 2 ] + N1 j=1 E[(W 1 ij ) 2 ]E[X j (x)X j (y)] - 1 2π σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 = σ 2 b + σ 2 w N 1 N 1 E[X j (x)X j (y)] - 1 2π σ 2 b + σ 2 w x 2 σ 2 b + σ 2 w y 2 = σ 2 b + σ 2 w 2π σ 2 b + x 2 σ 2 w 1 2 σ 2 b + y 2 σ 2 w 1 2 sin φ + (π -φ) cos φ -1 .



Figure 1: A single-hidden-layer, fully-connected feedforward neural network for regression prediction. Left panel: Structural diagram of the neural network. Right panel: ReLU activation function: φ(a) := (a) + = max(0, a) = a for a ≥ 0; φ(a) = 0 otherwise.

(3.6, 0.02), chosen from σ 2 w ∈ [0.4, 1.2, 2.0, 2.8, 3.6], and σ 2 b ∈ [0.0001, 0.01, 0.02]. We randomly select a set of 70 training and 30 test location points from 100 values evenly spaced in the interval [0.0, 1.0]. Ten sample paths, as shown in the top left panel of Figure (2), are generated from the design Gaussian process model. Their sample mean produces 70 training target and 30 test values. We then estimate the optimal hyperparameters from the training targets via evaluating the marginal likelihood, equation (2), over the design ranges of σ 2 w and σ 2 b . The maximum marginal likelihood is obtained at {σ 2 w , σ2 b } = {3.6, 0.02} which is the design hyperparameter pair. The minimum marginal likelihood is obtained at {σ 2 w , σ2 b } = {3.6, 0.0001}. A Gaussian process model is then built with the optimal hyperparameter pair to make predictions for the 30 test location points. The model accuracy is assessed with a RMSE of 0.00051. Additionally we overlay the predicted and true test target values, as shown in the top middle panel of Figure (2), to detect any prediction errors. We plot the line of equality to further validate the estimated hyperparameters, as depicted in the top right panel of the figure. The evaluation process is repeated applying the hyperparameter pair {σ 2 w , σ2 b } which produces a prediction RMSE of 0.00188, over 3 times as large as the optimal case. The accuracy plots shown in the bottom panels of Figure (2) indicate some prediction errors.

Figure 2: Gaussian process regression prediction on simulated data. Top left: 10 sample paths generated from a Gaussian process model with hyperparameters (σ 2 w , σ 2 b ) = (3.6, 0.02). Top middle: Point-wise visual comparison between predicted and true target values for the optimal hyperparameter pair {3.6, 0.02}, showing good prediction results. Top right: The line of equality further confirming the prediction accuracy. Bottom left: Point-wise visual comparison for hyperparameter pair {3.6, 0.0001}. Bottom right: Prediction errors revealed with the line of equality.

Figure 3: Comparing MNIST training accuracy over various training sizes.We observe that the convergence rate based on our method approaches that using He-initialization as the training size increases. This suggests that our technique may potentially be efficient for guiding deep neural network initialization. Left: train size=1000. Middle: train size=3000. Right: train size=5000.

2 a 11 a 22 sin 3 φ sin(φ) + (π -φ) cos(φ) .from equation (4)where φ = cos -1 -a 12 √ a 11 a 22 .Finally,2πD 1/2 a 11 a 22 sin 3 φ = 2πD 1/2 a 11 a 22 (1cos 2 φ) 3/2= 2πD 1/2 a 11 a 22 1 -

+ σ 2 w x 2 σ 2 b + σ 2 w y 2 1/2 sin(φ) + (π -φ) cos(φ),where φ = cos -1 σ 2 b + σ 2 w (x • y) (σ 2 b + σ 2 w x 2 )(σ 2 b + σ 2 w y 2 ) 1/2 .Under review as a conference paper at ICLR 2021To compute the expected value, E[X j (x)] = max(b+ w • x)f b 0 j ,W 0 jk (b, w) dw db, we denote U = b + w • x ∼ N (0, σ 2

Single-hidden-layer fully-connected neural network model prediction accuracy on MNIST test set, and associated hyperparameter pair.

