CRITICAL INITIALIZATION OF WIDE AND DEEP NEU-RAL NETWORKS THROUGH PARTIAL JACOBIANS: GENERAL THEORY AND APPLICATIONS

Abstract

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce partial Jacobians of a network, defined as derivatives of preactivations in layer l with respect to preactivations in layer l 0 ≤ l. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows one to select optimal initialization for a broad class of deep neural networks; including fully connected, convolutional and attention layers. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze the MLP-Mixer architecture and show that it is everywhere critical.

1. INTRODUCTION

When the number of parameters in each layer becomes large, the functional space description of deep neural networks simplifies dramatically. The network function, f (x), in this limit, is a Gaussian process (Neal, 1996; Lee et al., 2018) with a kernel -sometimes referred to as neural network Gaussian process (NNGP) kernel (Lee et al., 2018) -determined by the network architecture and hyperparameters (e.g depth, precise choices of layers and the activation functions, as well as the distribution of weights and biases). Similar line of reasoning was earlier developed for recurrent neural networks (Molgedey et al., 1992) . Furthermore, for special choices of parameterization and MSE loss function, the training dynamics under gradient descent can be solved exactly in terms of the neural tangent kernel (NTK) (Jacot et al., 2018; Lee et al., 2019) . A large body of work was devoted to the calculation of the NNGP kernel and NTK for different architectures, calculation of the finite width corrections to these quantities, and empirical investigation of the training dynamics of wide networks (Novak et al., 2018b; Xiao et al., 2018; Hron et al., 2020; Dyer & Gur-Ari, 2019; Andreassen & Dyer, 2020; Lewkowycz & Gur-Ari, 2020; Aitken & Gur-Ari, 2020; Geiger et al., 2020; Hanin, 2021; Roberts et al., 2022; Yaida, 2020; Shankar et al., 2020; Arora et al., 2019b; a; Lee et al., 2020; Yang et al., 2018; Yang & Hu, 2021; Yang, 2019b; a; Matthews et al., 2018; Garriga-Alonso et al., 2018; Allen-Zhu et al., 2019; Tsuchida et al., 2021; Martens et al., 2021) . One important result that arose from these works is that the network architecture determines the most appropriate initialization of the weights and biases (Poole et al., 2016; Schoenholz et al., 2016; Lee et al., 2018) . To state this result, we consider networks with/without LayerNorm (Ba et al., 2016) and residual connections (He et al., 2016) ; the preactivations for which can be defined as follows h l+1 i (x) = N l j=1 w l+1 ij ϕ( hl j (x)) + b l+1 i + µh l i (x) , 1

