GUIDING NEURAL NETWORK INITIALIZATION VIA MARGINAL LIKELIHOOD MAXIMIZATION

Abstract

We propose a simple approach to help guide hyperparameter selection for neural network initialization. We leverage the relationship between neural network and Gaussian process models having corresponding activation and covariance functions to infer the hyperparameter values desirable for model initialization. Our experiment shows that marginal likelihood maximization provides recommendations that yield near-optimal prediction performance on MNIST classification task under experiment constraints. Furthermore, our empirical results indicate consistency in the proposed technique, suggesting that computation cost for the procedure could be significantly reduced with smaller training sets.

1. INTRODUCTION

Training deep neural networks successfully can be challenging. However, with proper initialization trained models could improve their prediction performance. Various initialization strategies in neural network have been discussed extensively in numerous research works. Glorot and Bengio (2010) focused on linear cases and proposed the normalized initialization scheme (also known as Xavier-initialization). Their derivation was obtained by considering activation variances in the forward path and the gradient variance in back-propagation. He-initialization (He et al., 2015) was developed for very deep networks with rectifier nonlinearities. Their approach imposed a condition on the weight variances to control the variation in the input magnitudes. Because of its success, He-initialization has become the de facto choice for deep ReLU networks. While Glorot-and Heinitialization schemes recognize the importance of and make use of the hidden layer widths in their formulation, other methods were also suggested to improve training in deep neural networks. Mishkin and Matas (2016) demonstrated that pre-initialization with orthonormal matrices followed by output variance normalization produces prediction performance comparable to, if not better than, standard techniques. Additionally, Schoenholz et al. (2017) developed the bound on the network depth based on the principle of 'Edge of Chaos' given a particular set of initialization hyperparameters. Furthermore, Hayou et al. (2019) showed that theoretically and in practice proper initialization parameter tuning with appropriate activation function is important to model training for improved performance. Neal (1996) showed that as a fully-connected, single-hidden-layer feedforward untrained neural network becomes infinitely wide, Gaussian prior distributions over the network hidden-to-output weights and biases converge to a Gaussian process, under the assumption that the parameters are independent. In other words, the untrained infinite neural network and its induced Gaussian process counterpart are equivalent. Also, as a result of the central limit theorem, the covariance between network output evaluated at different inputs can be represented as a function of the hidden node activation function. Intuitively, we could therefore relate the prediction performance of an untrained, finite-width, single-hidden-layer, fully-connected feedforward neural network to a Gaussian process model with a covariance function corresponding to the network's activation function. In this work we propose a simple and efficient method that learns from training data to guide the selection of initialization hyperparameters in neural networks. Marginal likelihood is a popular tool for choosing kernel hyperparameters in model selection. Its applications in convolutional Gaussian processes and deep kernel learning are discussed, respectively, in (van der Wilk et al., 2017; Wilson et al., 2016) . Our method aims to synergize this powerful functionality of marginal likelihood and the relationship between untrained neural networks and Gaussian process models to make recommendations for neural network initialization. We first derive the covariance function corresponding to the activation function of the network whose prediction performance we wish to evaluate. We then employ marginal likelihood optimization for the Gaussian process model to learn hyperparameters from data. We hypothesize that the optimal set of hyperparameter values could improve initialization of the neural network.

2. APPROACH

To assess our proposed method, we build a neural network and a Gaussian process model with corresponding activation and covariance functions. With the Gaussian process we estimate the covariance hyperparameters from training data. These hyperparameter values are then applied in the neural network to evaluate and compare its prediction accuracy among various hyperparameter sets. We first describe the structure of the neural network, followed by the Gaussian process model and the underlying reason for employing the marginal likelihood. Then, given the network activation function we proceed to derive a closed form representation of its counterpart covariance function.

2.1. SINGLE-HIDDEN-LAYER NEURAL NETWORKS

Our neural network model is a fully-connected, single-hidden-layer feedforward network with 2000 hidden nodes and rectified linear unit (ReLU) activation function. Following (Lee et al., 2018) , we conduct our empirical study by considering classifying MNIST images as regression prediction. Inasmuch as the network is designed for regression, we choose the mean square error (MSE) loss as its objective function, along with Adam optimizer, and accuracy as the performance metric. In addition, one-hot encoding is utilized to generate class labels, where an incorrectly labeled class is designated -0.1, and a correctly labeled class 0.9 . For example, the one-hot representation of the integer 7 is given by [-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, 0.9, -0.1, -0.1]. 



Figure 1: A single-hidden-layer, fully-connected feedforward neural network for regression prediction. Left panel: Structural diagram of the neural network. Right panel: ReLU activation function: φ(a) := (a) + = max(0, a) = a for a ≥ 0; φ(a) = 0 otherwise.

