GUIDING NEURAL NETWORK INITIALIZATION VIA MARGINAL LIKELIHOOD MAXIMIZATION

Abstract

We propose a simple approach to help guide hyperparameter selection for neural network initialization. We leverage the relationship between neural network and Gaussian process models having corresponding activation and covariance functions to infer the hyperparameter values desirable for model initialization. Our experiment shows that marginal likelihood maximization provides recommendations that yield near-optimal prediction performance on MNIST classification task under experiment constraints. Furthermore, our empirical results indicate consistency in the proposed technique, suggesting that computation cost for the procedure could be significantly reduced with smaller training sets.

1. INTRODUCTION

Training deep neural networks successfully can be challenging. However, with proper initialization trained models could improve their prediction performance. Various initialization strategies in neural network have been discussed extensively in numerous research works. Glorot and Bengio (2010) focused on linear cases and proposed the normalized initialization scheme (also known as Xavier-initialization). Their derivation was obtained by considering activation variances in the forward path and the gradient variance in back-propagation. He-initialization (He et al., 2015) was developed for very deep networks with rectifier nonlinearities. Their approach imposed a condition on the weight variances to control the variation in the input magnitudes. Because of its success, He-initialization has become the de facto choice for deep ReLU networks. While Glorot-and Heinitialization schemes recognize the importance of and make use of the hidden layer widths in their formulation, other methods were also suggested to improve training in deep neural networks. Mishkin and Matas (2016) demonstrated that pre-initialization with orthonormal matrices followed by output variance normalization produces prediction performance comparable to, if not better than, standard techniques. Additionally, Schoenholz et al. (2017) developed the bound on the network depth based on the principle of 'Edge of Chaos' given a particular set of initialization hyperparameters. Furthermore, Hayou et al. (2019) showed that theoretically and in practice proper initialization parameter tuning with appropriate activation function is important to model training for improved performance. Neal (1996) showed that as a fully-connected, single-hidden-layer feedforward untrained neural network becomes infinitely wide, Gaussian prior distributions over the network hidden-to-output weights and biases converge to a Gaussian process, under the assumption that the parameters are independent. In other words, the untrained infinite neural network and its induced Gaussian process counterpart are equivalent. Also, as a result of the central limit theorem, the covariance between network output evaluated at different inputs can be represented as a function of the hidden node activation function. Intuitively, we could therefore relate the prediction performance of an untrained, finite-width, single-hidden-layer, fully-connected feedforward neural network to a Gaussian process model with a covariance function corresponding to the network's activation function. In this work we propose a simple and efficient method that learns from training data to guide the selection of initialization hyperparameters in neural networks. Marginal likelihood is a popular tool for choosing kernel hyperparameters in model selection. Its applications in convolutional Gaussian processes and deep kernel learning are discussed, respectively, in (van der Wilk et al., 2017; Wilson et al., 2016) . Our method aims to synergize this powerful functionality of marginal likelihood and

