CRITICAL INITIALIZATION OF WIDE AND DEEP NEU-RAL NETWORKS THROUGH PARTIAL JACOBIANS: GENERAL THEORY AND APPLICATIONS

Abstract

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce partial Jacobians of a network, defined as derivatives of preactivations in layer l with respect to preactivations in layer l 0 ≤ l. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows one to select optimal initialization for a broad class of deep neural networks; including fully connected, convolutional and attention layers. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze the MLP-Mixer architecture and show that it is everywhere critical.

1. INTRODUCTION

When the number of parameters in each layer becomes large, the functional space description of deep neural networks simplifies dramatically. The network function, f (x), in this limit, is a Gaussian process (Neal, 1996; Lee et al., 2018) with a kernel -sometimes referred to as neural network Gaussian process (NNGP) kernel (Lee et al., 2018) -determined by the network architecture and hyperparameters (e.g depth, precise choices of layers and the activation functions, as well as the distribution of weights and biases). Similar line of reasoning was earlier developed for recurrent neural networks (Molgedey et al., 1992) . Furthermore, for special choices of parameterization and MSE loss function, the training dynamics under gradient descent can be solved exactly in terms of the neural tangent kernel (NTK) (Jacot et al., 2018; Lee et al., 2019) . A large body of work was devoted to the calculation of the NNGP kernel and NTK for different architectures, calculation of the finite width corrections to these quantities, and empirical investigation of the training dynamics of wide networks (Novak et al., 2018b; Xiao et al., 2018; Hron et al., 2020; Dyer & Gur-Ari, 2019; Andreassen & Dyer, 2020; Lewkowycz & Gur-Ari, 2020; Aitken & Gur-Ari, 2020; Geiger et al., 2020; Hanin, 2021; Roberts et al., 2022; Yaida, 2020; Shankar et al., 2020; Arora et al., 2019b; a; Lee et al., 2020; Yang et al., 2018; Yang & Hu, 2021; Yang, 2019b; a; Matthews et al., 2018; Garriga-Alonso et al., 2018; Allen-Zhu et al., 2019; Tsuchida et al., 2021; Martens et al., 2021) . One important result that arose from these works is that the network architecture determines the most appropriate initialization of the weights and biases (Poole et al., 2016; Schoenholz et al., 2016; Lee et al., 2018) . To state this result, we consider networks with/without LayerNorm (Ba et al., 2016) and residual connections (He et al., 2016) ; the preactivations for which can be defined as follows h l+1 i (x) = N l j=1 w l+1 ij ϕ( hl j (x)) + b l+1 i + µh l i (x) , where hl j = LayerNorm(h l j ) and the parameter µ controls the strength of residual connections. For the input layer: h 1 i (x) = N0 j=1 w 1 ij x j + b 1 i . In the (l + 1)-th layer, weights w l+1 ij ∈ R N l+1 ×N l and biases b l+1 i ∈ R N l+1 ×1 are taken from normal distributions N (0, σ 2 w /N l ) and N (0, σ 2 b ), respectively. Hyperparameters σ w and σ b need to be tuned. ϕ(•) is the activation function and x ∈ R N0×1 is the input. For results discussed in this work, x can be sampled from either a realistic (i.e. highly correlated) dataset or a high entropy distribution. For a network of depth L, the network function is given by f (x) = h L (x). Different network architectures and activation functions, ϕ, lead to different "optimal" choices of (σ w , σ b ). The optimal choice can be understood, using the language of statistical mechanics, as a critical point (or manifold) in the σ b -σ w plane. The notion of criticality becomes sharp as the network depth, L, becomes large. Criticality ensures that both NNGP and the norm of gradients remain O(L 0 ) as the network gets deeper (Roberts et al., 2022) . Very deep networks will not train unless initialized critically, since the gradients explode or vanish exponentially. Moreover, high trainability does not imply that the trained model has a great performance (test accuracy) after training.

1.1. RESULTS

Here we focus on two main results of this work: (i) empirical method to check criticality of a neural network and (ii) an architecture based on layer normalization and residual connections that is critical for any initialization. First we introduce the notion of a partial Jacobian. Definition 1.1. Let h l i (x) be preactivations of a neural network f (x). The partial Jacobian J l0,l ij is defined as derivative of preactivations at layer l with respect to preactivations at layer l 0 ≤ l J l0,l ij (x) = ∂h l j (x) ∂h l0 i (x) . (2) The partial Jacobian is a random matrix with vanishing mean at initialization. We introduce a deterministic measure of the magnitude of J l0,l ij -its squared Frobenius norm, averaged over parameterinitializations. Definition 1.2. Let J l0,l ij be a partial Jacobian of a neural network f (x). Averaged partial Jacobian norm (APJN) is defined as J l0,l (x) ≡ E θ   1 N l N l j=1 N l 0 i=1 ∂h l j (x) ∂h l0 i (x) 2   , where E θ indicates averaging over parameter-initializations. In what follows, we show that criticality, studied previously in literature, occurs when APJN either remains finite, or varies algebraically as l becomes large. To prove this we derive the recurrence relation for J l0,l (x) in the limit N l → ∞ and analyze it at large depth. Algebraic behaviour of APJN with depth is characterized by an architecture-dependent critical exponent, ζ, so that J l0,l (x) ≈ l -ζ . Such behaviour is familiar from statistical mechanics when a system is tuned to a critical point (Cardy, 1996) . Away from criticality, there are two phases: ordered and chaotic. In the ordered phase APJN vanishes exponentially with depth, whereas in the chaotic phase APJN grows exponentially J l0,l ≈ c l0 e ± l ξ . ( ) Here ξ is the correlation length. It characterizes how fast gradients explode or vanish. Theorem 1.3 (Main result). Let f (x) be a deep MLP network with Lipschitz continuous activation ϕ(•). Assume that the LayerNorm is applied to preactivations and there are residual connections with strength µ acting according to (1). In the limit N l → ∞ the correlation length is bounded from below for σ 2 b < ∞ ξ ≥ 1 | log (1 -µ 2 ) A B + µ 2 | , ( ) where the non-negative constants A and B are given by A = E θ 1 N l N l k=1 ϕ ′ ( hl k ) 2 , B = E θ 1 N l N l k=1 ϕ( hl k ) 2 , where ϕ ′ (•) is the derivative of ϕ(•). When µ = 1, the correlation length diverges and the network is critical for any initialization, with ζ = O(1). In practice Theorem 1.3 means that different choices of initialization bear no effect on trainability of the network provided that LayerNorm and residual connections are arranged as stated. 1.2 RELATED WORK Some of our results were either surmised or obtained in a different form in the literature. We find that LayerNorm ensures that NNGP kernel remains finite at any depth as suggested in the original work of Ba et al. (2016) . LayerNorm also alters the criticality of J l0,l (x). It was noted in Xu et al. (2019) that LayerNorm (applied to preactivations) regularizes the backward pass. We formalize this observation by showing that LayerNorm (applied to preactivations) dramatically enhances correlation length (which is not the case for LayerNorm applied to activations). This can be seen from Theorem 1.3, setting µ = 0. When residual connections of strength 1 are combined with erf (or any other erf-like activation function, e.g. tanh), the neural network enters a subcritical phase with enhanced correlation length (see Theorem 4.3). A version of this result was discussed in Yang & Schoenholz (2017) . When residual connections are introduced on top of LayerNorm, the correlation length ξ is further increased. If residual connections have strength 1 the network enters a critical phase for any initialization. Importance of correct ordering of LayerNorm, residual connections and attention layers was discussed in Xiong et al. (2020) . Several architectures with the same order of GroupNorm and residual connections were investigated in Yu et al. (2021) . The partial Jacobian has been used to study generalization bounds in Arora et al. (2018) . The Jacobian norm (i.e. ||J 0,l ij || 2 ) of trained feed-forward neural networks was studied in Novak et al. (2018a) , where it was correlated with generalization. Partial Jacobians with l 0 = l -1 were studied in the context of RNNs (Chen et al., 2018; Can et al., 2020) , where they were referred to as state-tostate Jacobians. As the aspect ratio (L/N ) of the network approaches 1, the finite width corrections to the Jacobian become more prominent. On the other hand, even with small aspect ratio, the effect of the spectral density of Jacobian becomes important as the depth L becomes very large. Pennington et al. (2018) study the spectrum of the input-output Jacobian for MLPs. Xiao et al. (2018) extend the analysis to CNNs, showing that very deep vanilla CNNs can be trained by achieving "dynamical isometry".

2. RECURRENCE RELATIONS

Here we derive the infinite width recurrence relations for the APJN and the norm of preactivations. We use (1) in its simplest form, which has no LayerNorm and µ = 0. Definition 2.1. We define averaged covariance of preactivations as follows K l (x, x ′ ) = E θ 1 N l N l i=1 h l i (x)h l i (x ′ ) . ( ) Lemma 2.2. When N l → ∞ for l = 1, . . . , L -1, the expectation value over parameter initializations for a general function of preactivations: O(h l (x)), can be expressed as the averaging over the Gaussian process h l (x) with covariance K l (x, x ′ ). E θ O(h l i (x)) = 1 2πK l (x, x) dh l O(h l i (x))e - (h l i (x)) 2 2K l (x,x) . This result has been established in Lee et al. (2018) . Note that the density in (8) only depends on the diagonal part of the covariance matrix, K l (x, x). We will refer to K l (x, x) as NNGP kernel. Remark 2.3. In the infinite width limit the means appearing in (10)-( 12) are self-averaging and, therefore, deterministic. They converge in distribution to their averages over parameterizations. 1 N l N l i=1 ϕ(h l i ) 2 N l →∞ ----→ E θ 1 N l N l i=1 ϕ(h l i ) 2 = E θ ϕ(h l i ) 2 . (9) When performing analytic calculations we use the infinite width convention; whereas in our finitewidth experiments we explicitly average over initializations of θ l . Theorem 2.4. With Lemma 2.2, in the infinite width limit, the NNGP kernel K l+1 (x, x) is deterministic, and can determined recursively via K l+1 (x, x) = σ 2 w E θ 1 N l N l i=1 ϕ(h l i (x)) 2 + σ 2 b . ( ) Theorem 2.5. Let f (x) be an MLP network with a Lipschitz continuous activation function ϕ(x). In the infinite width limit, APJN J l0,l+1 (x) is deterministic and satisfies a recurrence relation J l0,l+1 (x) = χ l J J l0,l (x) , where the factor χ l J is given by χ l J = E θ σ 2 w N l N l i=1 ϕ ′ (h l i (x)) 2 . ( ) Theorem 2.4 is due to Lee et al. (2018) . Theorem 2.5 is new and is valid only in the limit of infinite width. The proof is in Appendix B. We will drop the explicit dependence on x to improve readability. The expectation values that appear in ( 10)-( 12) are evaluated using (8). When the integrals can be taken analytically, they lead to explicit equations for the critical lines and/or the critical points. Details of these calculations as well as the derivation of ( 10)-( 12) can be found in the Appendix. A subtlety emerges in (11) when l 0 = 0, where a correction of the order O(N -foot_0 0 ) arises for non-scale invariant activation functions. This subtlety is discussed in the Appendix B. When the depth of the network becomes large, the l-dependence of the expectation values that appear in (7), (12) saturate to a (possibly infinite) constant value; which means that K l , J l0,l and χ l J have reached a fixed point. We denote the corresponding quantities as K ⋆ , J l0,⋆ , χ ⋆ J . The existence of a fixed point is not obvious and should be checked on a case by case basis. Fixed point analysis for K l was done in Poole et al. (2016) for bounded activation functions and in Roberts et al. (2022) for the general case. The stability is formulated in terms of χ ⋆ K = ∂K l+1 ∂K l K l =K ⋆ . ( ) The norm of preactivations remains finite (or behaves algebraically) when χ ⋆ K = 1. Eq. ( 11) nicely expresses J l0,l+1 as a linear function of J l0,l . The behaviour of J l0,l+1 at large l is determined by χ l J . When χ l J > 1 partial Jacobians diverge exponentially, while for χ l J < 1 partial Jacobians vanish exponentially. Neural networks are trainable only up to a certain depth when initialized O(1) away from criticality, which is determined by the equation χ ⋆ J = 1 . (14) Eq. ( 14) is an implicit equation on σ b , σ w and generally outputs a critical line in σ b -σ w plane. The parameter χ ⋆ J has to be calculated on a case-by-case basis using either (12) or the method presented in the next section. Everywhere on the critical line, J l0,l saturates to a constant or behaves algebraically. When the condition χ ⋆ K = 1 is added, we are left with a critical point 1 . This analysis of criticality at infinite width agrees with Roberts et al. (2022) , where χ ⊥ is to be identified with χ ⋆ J ; and Schoenholz et al. ( 2016); Martens et al. (2021) , where their analysis based on the equivalent χ 1 or C ′ (1) only works for bounded activation functions. In particular, condition (14) together with χ ⋆ K = 1 ensures that NTK is O(1) at initialization.

2.1. EMPIRICAL DIAGNOSTIC OF CRITICALITY

APJN J l0,l provides a clear practical way to diagnose whether the network is critical or not. Proper choice of l 0 and l allows us to minimize the non-universal effects and cleanly extract χ ⋆ J . Recurrence relation (11), supplemented with the initial condition J l0,l0+1 = χ l0 J , can be formally solved as J l0,l = l-1 ℓ=l0 χ ℓ J . We would like to obtain an estimate of χ ⋆ J as accurately as possible. To that end, imagine that for some l ′ > l 0 the fixed point has been essentially reached and χ l ′ J ≈ χ ⋆ J . Then the APJN J l0,l = (χ ⋆ J ) l-l ′ -1 • l ′ ℓ=l0 χ ℓ J (16) depends on the details of how the critical point is approached; which are encoded in the last factor. Proposition 2.6. If the network f (x) is homogeneous, i.e., consists of a (possibly complex) block of layers, periodically repeated L times. Then the penultimate APJN provides an accurate estimate of χ ⋆ J : J L-2,L-1 L→∞ = χ ⋆ J . ( ) This is a direct consequence of combining (10) and (12) as L goes to infinity. See Figure 4 in Appendix C for numerical justification. Proposition 2.6 is the central result of this section and will be heavily used in the remainder of this work. Note that for deep networks, away from criticality, APJN takes form J l0,l ≈ c l0 e ± l ξ , ξ = | log χ ⋆ J | -1 , where c l0 is a non-universal constant that depends on l 0 . If the sign in (18) is positive (χ ⋆ J > 1) the network is in the chaotic phase, while when the sign is negative (χ ⋆ J < 1) the network is in the ordered phase. ξ has the meaning of correlation length: on the depth scale of approximately kξ the gradients remain appreciable, and hence the network with the depth of ≈ kξ will train. We used (17) to map out the σ b -σ w phase diagrams of various MLP architectures. The partial Jacobians are calculated numerically with N l = 500, L = 50 and averaged over initializations. The details are further elaborated in Appendix A. The location of the critical line agrees remarkably well with our infinite width calculations. Results are presented in Fig. 1 . One fortunate outcome of both theory and experiment is that when LayerNorm is applied to preactivations, ReLU networks can still be initialized using He initialization (He et al., 2015) which, in our convention, is ( √ 2, 0). At criticality, χ ⋆ J = 1 and the correlation length diverges; indicating that gradients can propagate arbitrarily far. A more careful analysis of non-linear corrections shows that APJN can exhibit algebraic behaviour with depth and can still vanish in the infinite depth limit, but much slower than the ordered phase.

2.2. SCALING AT A CRITICAL POINT

At criticality χ l J saturates to a fixed value χ ⋆ J = 1. If we are interested in J l0,l with l -l 0 = O(L) then it is essential to know how exactly χ l J approaches 1. Theorem 2.7. Assume that deep neural network f (x) is initialized critically. Then l → ∞ asymptotics of APJN is given by J l0,l (x) = O(l -ζ ) , ( ) where ζ is the critical exponent Roberts et al. (2022) . Critical exponents can be determined analytically in the limit of infinite width. (For a detailed discussion, see Appendix C.)

3. LAYER NORMALIZATION

The fact that critical initialization is concentrated on a single point (σ ⋆ w , σ ⋆ b ) may appear unsettling because great care must be taken to initialize the network critically. The situation can be substantially improved by utilizing the normalization techniques known as LayerNorm (Ba et al., 2016) and GroupNorm (Wu & He, 2018) . Our results apply to GroupNorm verbatim in the case when the number of groups is much smaller than the width. LayerNorm can act either on preactivations or on activations (discussed in the Appendix B). Depending on this choice, criticality will occur on different critical lines in σ b -σ w plane. When LayerNorm is applied to preactivations the correlation length is enhanced, allowing to train much deeper networks even far away from criticality. The LayerNorm applied to preactivations takes the following form Definition 3.1 (Normalized preactivations). hl i = h l i -E[h l ] E[(h l ) 2 ] -E[h l ] 2 N l →∞ ----→ 1 √ K l h l i , where we have introduced E[h l ] = 1 N l N l i=1 h l i . In the limit of infinite width E[h l ] = 0 and E[(h l ) 2 ] = K l , defined according to (7).

Normalized preactivations, hl

i , are distributed according to N (0, 1) for all l, σ w , σ b . The norms are, therefore, always finite and the condition χ ⋆ K = 1 is trivially satisfied. This results in a critical line rather than a critical point. The recurrence relations (10)-12 for the NNGP and partial Jacobians are only slightly modified K l+1 = σ 2 w E θ 1 N l N l i=1 ϕ( hl i ) 2 + σ 2 b , χ l J = σ 2 w K l E θ 1 N l N l i=1 ϕ ′ ( hl i ) 2 . ( ) Assuming that the value of χ l J at the fixed point is χ ⋆ J , the network is critical when (14) holds. χ l J (21) changes very slowly with l and is also bounded from below, as elaborated in the next section. Thus, χ ⋆ J remains close to 1 for a very wide range of hyperparameters. Consequently, the correlation length is large even away from criticality. This leads to much higher trainability of deep networks with LayerNorm on preactivations even away from criticality.

4. RESIDUAL (SKIP) CONNECTIONS

Adding residual connections between the network layers is a widely used technique to facilitate the training of deep networks. Originally introduced (He et al., 2016) in the context of convolutional neural networks (LeCun et al., 1998) (CNNs) for image recognition, residual connections have since been used in a variety of networks architectures and tasks. Consider (1) with non-zero µ and without LayerNorm layers. Then the recurrence relations (10)-( 12) for the NNGP kernel and χ l J are modified as follows K l+1 = σ 2 w E θ   1 N l N l j=1 ϕ(h l j ) 2   + σ 2 b + µ 2 K l , χ l J = σ 2 w E θ 1 N l N l k=1 ϕ ′ (h l k ) 2 + µ 2 . ( ) Remark 4.1. When µ < 1, the fixed point value of NNGP kernel is scaled by (1-µ 2 ) -1 . For µ = 1, the critical point is formally at (0, 0). Remark 4.2. For µ = 1, (22) implies that χ l J ≥ 1, where the equality holds on the σ w = 0 axis. Consequently, APJN exponentially diverges as a function of depth l for all σ w > 0. In this case, σ w needs to be taken sufficiently close to 0 to ensure trainability at large depths. When µ < 1, residual connections amplify the chaotic phase and decrease the correlation length away from criticality for unbounded activation functions. Solving the recurrence relations ( 22) for erf activation, we find an effect observed in Yang & Schoenholz (2017) for tanh activation. They noted that tanh-like MLP networks with skip connections "hover over the edge of chaos". We quantify their observation as follows. ; with residual connections of variable strengths µ = {0.0, 0.9, 1.0}. Both cases: without LayerNorm (first three columns) and with LayerNorm (last three columns) are shown. The solid lines indicate the critical lines obtained through infinite width limit calculations; while the stars indicate the critical points. The dotted lines in the rightmost column correspond to the critical lines for µ < 1 case. For networks with LayerNorm and µ = 1, χ ⋆ J = 1 holds on the entire σ b -σ w plane, for all activation functions that we considered. We also note that for erf activation, the case µ = 1 without LayerNorm is subcritical and has a large correlation length. Theorem 4.3. Let f (x) be a deep MLP network with erf activation function and residual connections of strength µ = 1. Then in the limit N l → ∞ • The NNGP kernel K l linearly diverges with depth l.

• χ l

J approaches 1 from above (as can be seen from Fig. 1 ) : χ l J ≈ 1 + c/ √ l, where c = 2σ 2 w /(π σ 2 w + σ 2 b ) is a non-universal constant. • APJN diverges as a stretched exponential : J l0,l = O(e √ l λ ), where λ = 1/(4c 2 ) is the new length scale. We will refer to this case as subcritical. Although χ ⋆ J reaches 1, the APJN still diverges with depth faster than any power law. The growth is controlled by the new scale λ. To control the gradient we would like to make λ large, which can be accomplished by decreasing σ w . In this case the trainability is enhanced (see Fig. 2 ). Similar results hold for tanh activation function (Yang & Schoenholz, 2017) , however in that case there is no explicit expression for c.

5. RESIDUAL CONNECTIONS + LAYERNORM

In practice, it is common to use a combination of residual connections and LayerNorm. Using (1), the recurrence relations (10)-( 12) for the NNGP and partial Jacobians are modified as follows K l+1 = σ 2 w E θ   1 N l N l j=1 ϕ( hl j ) 2   + σ 2 b + µ 2 K l , χ l J = σ 2 w K l E θ 1 N l N l k=1 ϕ ′ ( hl k ) 2 + µ 2 . ( ) Remark 5.1. For µ < 1, (23) implies that the fixed point value of NNGP kernel is scaled by 1 -µ 2 . Moreover, residual connections do not shift the phase boundary. The interference between residual connections and LayerNorm brings χ l J closer to 1 on the entire σ b -σ w plane (as can be seen from Fig. 1 ). Therefore the correlation length ξ is improved in both the phases, allowing for training of deeper networks. At criticality, Jacobians linearly diverge with depth. As was mentioned before, the combination of LayerNorm and residual connections dramatically enhances correlation length, leading to a more stable architecture. This observation is formalized by Theorem 1.3. The proof leverages the properties of solutions of (23) close to the fixed point, and is fleshed out in Appendix E. Remark 5.2. When µ = 1, the correlation length diverges for any initialization. Remark 5.2 provides an alternative perspective on architecture design. On the one hand, given a neural network architecture one can use (17) to initialize it critically. Alternatively, one can add extra layers, such as a combination of residual connections and LayerNorm, to ensure that the network is always critical and will train well no matter which initialization scheme is used. Remark 5.3. When µ = 1, the condition χ ⋆ J = 1 holds on the entire σ b -σ w and for any activation function ϕ (see Fig. 1 ). NNGP kernel diverges linearly, while APJN diverges algebraically with the critical exponent of ζ = O(1). The exact value of the critical exponent depends on the activation function and the ratio σ b σw . The trainability is dramatically enhanced as can be seen from Fig. 2 . Remark 5.4. Networks with BatchNorm (Ioffe & Szegedy, 2015) , used in conjunction with residual connection of strength µ = 1, also enjoy this everywhere criticality and enhanced trainability (Yang et al., 2018; He et al., 2022) .

6. MLP-MIXER

MLP-Mixer architecture is a recent example of MLP approach to computer vision (Tolstikhin et al., 2021) . Its main ingredients are: patches, MLP layers, LayerNorm and residual connections. As such it can be analyzed using the tools and results presented above. The detailed summary of the MLP-Mixer is presented in the Appendix F, while here we will state the results. When µ = 1 the MLP-Mixer is everywhere critical due to the interaction between LayerNorm and residual connections. When µ < 1 we can identify a critical line by numerically evaluating χ ⋆ J using (17). The phase diagrams are presented in Fig. 3 . To illustrate the importance of critical initialization we trained MLP-Mixer at µ = 1 for various initializations, including a highly unconventional σ 2 b = 10, σ 2 w = 10. While the final performance varies by a few percent, the network trains well at a large depth of L = 100 mixer blocks. We also trained MLP-Mixer at µ = 0.5. In this case, the model trains well at L = 100 when initialized critically; while far away from the critical line (σ 2 b = 10, σ 2 w = 10) it starts training after 30 epochs, and the gap between it and critical initialization is larger than µ = 1 cases. The learning curves are presented in Fig. 3 . We emphasize that we were interested in trainability and, consequently, did not tune the hyperparameters to achieve the best generalization. 

7. CONCLUSIONS

We have introduced partial Jacobians and their averaged norms as a tool to analyze the propagation of gradients through deep neural networks at initialization. Using APJN evaluated close to the output, J L-2,L-1 ≈ χ ⋆ J , we have introduced a very cheap and simple empirical test for criticality. We have also shown that criticality formulated in terms of partial Jacobians is equivalent to criticality studied previously in literature (Poole et al., 2016; Roberts et al., 2022; Martens et al., 2021) . APJN will play an important role in quantifying the criticality of inhomogeneous (i.e. no periodic stacking of blocks) networks. We have investigated homogeneous architectures that include fully-connected layers, normalization layers and residual connections. In the limit of infinite width, we showed that (i) in the presence of LayerNorm, the critical point generally becomes a critical line, making the initialization problem much easier, (ii) LayerNorm applied to preactivations enhances correlation length leading to improved trainability, (iii) combination of µ = 1 residual connections and erf activation function enhances correlation length driving the network to a subcritical phase with APJN growing according to a stretched exponential law, (iv) combination of residual connections and LayerNorm drastically increases correlation length leading to improved trainability, (v) when µ = 1 and LayerNorm is applied to preactivations the network is critical on the entire σ b -σ w plane. We have considered the example of a modern high performance architecture -the MLP-Mixer. We showed that it is critical everywhere and is not sensitive to initialization at µ = 1 due to the interaction between LayerNorm and residual connections. We have also studied MLP-Mixer at µ = 0.5 and showed that it is critical along a line (as expected for an architecture with LayerNorm). We demonstrated empirically that deep (100 blocks) MLP-Mixer trains for a variety of initializations at µ = 1 but only trains well close to the critical line for µ = 0.5. Our work shows that an architecture can be designed to have a large correlation length leading to a guaranteed trainability with SGD for any initialization scheme.

8. ETHICS STATEMENT

We have read the Code of Ethics; have adhered to it while writing this paper; and will adhere to it during the paper submission, review, and discussion process.

9. REPRODUCIBILITY STATEMENT

We provide the details of all the experiments in Appendix A; including hyperparameter choices, GPU specifications and GPU hours. We also provide the source code in the Supplementary Material.

A EXPERIMENTAL DETAILS

We implemented our methods using PyTorch (Paszke et al., 2019) hooks and an efficient Jacobian approximate algorithm (Hoffman et al., 2019) . Figure 1 : All the phase diagrams were plotted using χ L-1 J generated from networks with L = 50 and N l = 500. We used hooks to obtain the gradients that go into calculating χ L-1 J . χ L-1 J data was averaged over 100 different parameter-initializations. Inputs were generated from a normal Gaussian distribution and have dimension 28 × 28. Generating the data for the figure took approximately 2 days on Google Colab Pro (single Tesla P100 GPU). Figure 2 : In all cases, networks are trained for 10 epochs using stochastic gradient descent with CrossEntropy loss. We used the Fashion MNIST dataset Xiao et al. (2017) . All networks had depth L = 50 and width N l = 500. The learning rates were logarithmically sampled within (10 -5 , 1). Generating the data for the figure took approximately 12 days on Google Colab Pro (single Tesla P100 GPU). We trained all cases on CIFAR-10 dataset using vanilla SGD paired with CSE. Batch size bs = 256, weight decay λ = 10 -4 was selected from {10 -5 , 10 -4 }, mixup rate α = 0.8 was selected from {0.4, 0.8}. We also used RandAgument and horizontal flip with default settings in PyTorch. For all cases we searched learning rates within {0.005, 0.01, 0.05, 0.1, 0.2, 0.5}. We also tried a linear warm-up schedule for first 3000 iterations, but we did not see any improvement in performances. Generating the data for the figure took approximately 4 days on Google Colab Pro (single Tesla P100 GPU).

B TECHNICAL DETAILS FOR JACOBIANS AND LAYERNORM

We will drop the dependence of h l i (x) on x throughout the Appendices. It should not cause any confusion since we are always considering a single input.

B.1 NNGP KERNEL

First, we derive the recurrence relation for the NNGP kernel Eq.( 10). As mentioned in main text, weights and biases are initialized (independently) from standard normal distribution N (0, σ 2 w /fan i n). We then have E θ [w l ij w l mn ] = σ 2 w N l-1 δ im δ jn and E θ [b l i b l j ] = σ 2 b δ ij by definition. We would like to prove Theorem 2.4, as a consequence of Lemma 2.2. The proof of Lemma 2.2 can be found in Roberts et al. (2022) . Proof of Theorem 2.4. One can prove this by definition with Lemma 2.2. K l+1 ≡ 1 N l+1 N l+1 i=1 E θ [h l+1 i h l+1 i ] = 1 N l+1 N l+1 i=1 E θ     N l j=1 w l+1 ij ϕ(h l j ) + b l+1 i   N l k=1 w l+1 ik ϕ(h l k ) + b l+1 i   = 1 N l+1 N l+1 i=1 E θ   N l j=1 N l k=1 w l+1 ij w l+1 ik ϕ(h l j )ϕ(h l k ) + b l+1 i b l+1 i   = 1 N l+1 N l+1 i=1 E θ   σ 2 w N l N l j=1 ϕ(h l j )ϕ(h l j ) + σ 2 b   = σ 2 w N l N l j=1 E θ ϕ(h l j )ϕ(h l j ) + σ 2 b . B.2 JACOBIANS Next, we prove Theorem 2.5. Proof of Theorem 2.5. We start from the definition of the partial Jacobian (l > l 0 ) J l0,l+1 ≡ 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 ∂h l+1 i ∂h l0 j ∂h l+1 i ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k=1 ∂h l+1 i ∂h l k ∂h l k ∂h l0 j N l m=1 ∂h l+1 i ∂h l m ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik ϕ ′ (h l k ) w l+1 im ϕ ′ (h l m ) ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik w l+1 im ϕ ′ (h l k )ϕ ′ (h l m ) ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = 1 N l+1 N l+1 i=1 N l 0 j=1 N l k=1 σ 2 w N l E θ ϕ ′ (h l k )ϕ ′ (h l k ) ∂h l k ∂h l0 j ∂h l k ∂h l0 j = σ 2 w N l N l k=1 E θ   ϕ ′ (h l k )ϕ ′ (h l k )   N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j     . ( ) In the infinite width limit, the sum over neurons in a layer self-averages due to the law of large numbers. Also the distribution of h l+1 i is independent of h l i . This allows us to represent the expectation value of a product as product of expectation values. (This holds for l 0 ̸ = 0. We will show momentarily that the l 0 = 0 case acquires corrections due to finite input width N 0 ). Thus we have J l0,l+1 = σ 2 w E θ [ϕ ′ (h l k )ϕ ′ (h l k )]E θ   1 N l N l k=1 N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j   = σ 2 w E θ ϕ ′ (h l k )ϕ ′ (h l k ) J l0,l =⇒ J l0,l+1 = χ l J J l0,l , where in the first line we used Lemma 2.2. The first integral is taken over the distribution of h l k , and the second integral is taken over the distribution of h l-1 k . χ l J is defined by Eq.( 12). The critical line is defined by requiring χ ⋆ J = 1, where critical points are reached by further requiring χ ⋆ K = 1. As we mentioned in main text, l 0 = 0 is subtle since the input dimension is fixed N 0 , which can not be assumed to be infinity. Even though for dataset like MNIST, usually N 0 is not significantly smaller than width N l . We show how to take finite O(N -1 0 ) correction into account by using one example. Lemma B.1. Consider a one hidden layer network with a finite input dimension N 0 . In the infinite width limit, the Jacobian is still deterministic and the first step of the recurrence relation is modified to: J 0,2 = χ 1 J + 2σ 2 w N 0 χ 1 ∆ N0 k 1 N 0 h 0 k h 0 k J 0,1 , where J 0,1 = σ 2 w . Proof. J 0,2 = 1 N 2 E θ   N2 i=1 N0 j=1 ∂h 2 i ∂h 0 j ∂h 2 i ∂h 0 j   = 1 N 2 E θ   N2 i=1 N0 j=1 N1 k,m=1 w 2 ik w 2 im ϕ ′ (h 1 k )ϕ ′ (h 1 m ) ∂h 1 k ∂h 0 j ∂h 1 m ∂h 0 j   = 1 N 2 N2 i=1 N0 j=1 N1 k,m=1 E θ [w 2 ik w 2 im w 1 kj w 1 mj ϕ ′ (h 1 k )ϕ ′ (h 1 m )] = N0 j=1 N1 k=1 σ 2 w N 1 E θ [w 1 kj w 1 kj ϕ ′ (h 1 k )ϕ ′ (h 1 k )] =σ 2 w χ 1 J + 2σ 2 w N 0 χ 1 ∆ N0 k 1 N 0 h 0 k h 0 k = χ 1 J + 2σ 2 w N 0 χ 1 ∆ N0 k 1 N 0 h 0 k h 0 k J 0,1 , where to get the result we used integrate by parts, then explicitly integrated over w 1 ij . We defined a new quantity χ l ∆ . Definition B.2 (Coefficient of Finite Width Corrections). χ l ∆ = σ 2 w N l N l i=1 E θ [ϕ ′′ (h l i )ϕ ′′ (h l i ) + ϕ ′′′ (h l i )ϕ ′ (h l i )] . ( ) Remark B.3. Notice that the correction to J 0,2 is order O(N -1 0 ). If one calculate the recurrence relation for deeper layers, the correction to J 0,l will be O( l l ′ =0 N -1 l ′ ), which means the contribution from hidden layers can be ignored in infinite width limit. The J 0,2 example justified factorization of the integral when we go from the last line of Eq.( 26) to Eq.( 27). Finally, the full Jacobian in infinite width limit can be written as Theorem B.4 (Partial Jacobian). The partial Jacobian of a given network can be written as J 0,l = σ 2 w χ 1 J + 2σ 2 w N 0 χ 1 ∆ N0 k 1 N 0 h 0 k h 0 k l-1 l ′ =2 χ l ′ J , where any partial Jacobian with l 0 > 0 does not receive an O(N -1 0 ) correction.

B.3 LAYERNORM ON PRE-ACTIVATIONS

Definition B.5 (Layer Normalization). hl i = h l i -E[h l ] E[(h l ) 2 ] -E[h l ] 2 γ l i + β l i , where γ l i and β l i are learnable parameters. Remark B.6. With only LayerNorm, the (1) is simplified to h l+1 i = N l j=1 w l+1 ij ϕ( hl j ) + b l+1 i . ( ) Remark B.7. In the limit of infinite width, using the law of large numbers, the average over neurons E [• • • ] can be replaced by the average of parameter-initializations E θ [• • • ]. Additionally, in this limit, the preactivations are i.i.d. Gaussian distributed : h l ∼ N (0, K l ). E h l = E θ h l = 0 , E h l 2 = E θ h l 2 = K l . The normalized preactivation then simplifies to the form of Eq.( 20). Remark B.8. At initialization, the parameters γ l i and β l i take the values 1 and 0, respectively. This leads to the form in equation (20). In infinite width limit it has the following form hl i = h l i -E θ [h l ] E θ [(h l ) 2 ] -E θ [h l ] 2 . ( ) Lemma B.9. With LayerNorm on preactivations, the gaussian average is modified to E θ O( hl i ) = 1 √ 2π d hl i O( hl i ) e -( hl i ) 2 2 . ( ) Proof. By definition hl i is sampled from a standard normal distribution N (0, 1), then use Lemma 2.2 to get the final form. Theorem B.10. In the infinite width limit the recurrence relation for the NNGP kernel with Layer-Norm on preactivations is K l+1 = σ 2 w N l N l j=1 E θ ϕ( hl j )ϕ( hl j ) + σ 2 b . Proof. K l+1 = 1 N l+1 N l+1 i=1 E θ h l+1 i h l+1 i = 1 N l+1 N l+1 i=1 E θ     N l j=1 w l+1 ij ϕ( hl j ) + b l+1 i   N l k=1 w l+1 ik ϕ( hl k ) + b l+1 i   = σ 2 w N l N l j=1 E θ ϕ( hl j )ϕ( hl j ) + σ 2 b . ( ) Theorem B.11. In the infinite width limit the recurrence relation for partial Jacobian with Layer-Norm on preactivations is J l0,l+1 = χ l J J l0,l , where χ l J = σ 2 w N l K l N l i=1 E θ ϕ ′ ( hl i ) 2 . Proof. J l0,l+1 = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 ∂h l+1 i ∂h l0 j ∂h l+1 i ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k=1 ∂h l+1 i ∂ hl k ∂ hl k ∂h l k ∂h l k ∂h l0 j N l m=1 ∂h l+1 i ∂ hl m ∂ hl m ∂h l m ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik ϕ ′ ( hl k ) 1 √ K l w l+1 im ϕ ′ ( hl m ) 1 √ K l ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = σ 2 w N l K l N l k=1 E θ   ϕ ′ ( hl k )ϕ ′ ( hl k )   N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j     = σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) J l0,l = χ l J J l0,l ,

B.4 LAYERNORM ON ACTIVATIONS

The general definition of LayerNorm on activations is given as follows. Definition B.12 (LayerNorm on Activations). ϕ(h l i ) = ϕ(h l i ) -E[ϕ(h l )] E[ϕ(h l ) 2 ] -E[ϕ(h l )] 2 γ l i + β l i . Remark B.13. The recurrence relation for preactivations (Eq.( 1)) gets modified to h l+1 i = N l j=1 w l+1 ij ϕ(h l j ) + b l+1 i . ( ) Remark B.14. At initialization, the parameters γ l i and β l i take the values 1 and 0, respectively. This leads to the form ϕ(h l i ) = ϕ(h l i ) -E[ϕ(h l )] E[ϕ(h l ) 2 ] -E[ϕ(h l )] 2 = ϕ(h l i ) -E θ ϕ(h l ) E θ [ϕ(h l ) 2 ] -E θ [ϕ(h l )] 2 , ( ) where the first line follows from the fact that at initialization, the parameters γ l i and β l i take the values 1 and 0 respectively. In the second line, we have invoked the infinite width limit. Remark B.15. Evaluating Gaussian average in this case is similar to cases in previous section. The only difference being that the averages are taking over the distribution h l-1 ∼ N (0, K l-1 = σ 2 w + σ 2 b ). Again this can be summarized as E θ O(h l i ) = 1 2π(σ 2 w + σ 2 b ) dh l i O(h l i ) e - (h l i ) 2 2(σ 2 w +σ 2 b ) . Next, we calculate the modifications to the recurrence relations for the NNGP kernel and Jacobians. Theorem B.16. In the infinite width limit the recurrence relation for the NNGP kernel with Layer-Norm on activations is K l+1 = σ 2 w + σ 2 b . Proof. K l+1 = 1 N l+1 N l+1 i=1 E θ h l+1 i h l+1 i = 1 N l+1 N l+1 i=1 E θ     N l j=1 w l+1 ij ϕ(h l j ) + b l+1 i   N l k=1 w l+1 ik ϕ(h l k ) + b l+1 i   = σ 2 w N l N l j=1 E θ ϕ(h l j ) 2 + σ 2 b = σ 2 w N l N l j=1 E θ      ϕ(h l j ) -E θ ϕ(h l ) E θ [ϕ(h l ) 2 ] -E θ [ϕ(h l )] 2   2    + σ 2 b = σ 2 w N l N l j=1 E θ ϕ(h l j ) -E θ ϕ(h l ) 2 E θ [ϕ(h l ) 2 ] -E θ [ϕ(h l )] 2 + σ 2 b = σ 2 w + σ 2 b . Theorem B.17. In the infinite width limit the recurrence relation for partial Jacobian with Layer-Norm on activations is J l0,l+1 = χ l J J l0,l , where Figure 4 : log-log plot of the partial Jacobian J 0,l vs. l for erf, ReLU and erf together with GELU (with LayerNorm applied to preactivations and residual connections of strength 1) activation functions. The critical exponents predicted from the infinite width analysis are in agreement with the data. The fluctuations get larger towards the output because the aspect ratio (i.e. L/N l ) approaches 1/4. χ l J ≡ σ 2 w E θ [ϕ ′ (h l ) 2 )] E θ [ϕ(h l ) 2 ]-E θ [ϕ(h l )] 2 . Proof. J l0,l+1 = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 ∂h l+1 i ∂h l0 j ∂h l+1 i ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k=1 ∂h l+1 i ∂h l k ∂h l k ∂h l0 j N l m=1 ∂h l+1 i ∂h l m ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik ϕ ′ (h l k ) w l+1 im ϕ ′ (h l m ) ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = σ 2 w N l N l k=1 N l 0 j=1 E θ   ϕ ′ (h l k ) ϕ ′ (h l k )   N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j     = σ 2 w N l N l k=1 E θ ϕ ′ (h l k ) 2 J l0,l = σ 2 w E θ ϕ ′ (h l ) 2 E θ [ϕ(h l ) 2 ] -E θ [ϕ(h l )] 2 J l0,l = χ l J J l0,l ,

D RESIDUAL CONNECTIONS

Definition D.1. We define residual connections by the modified the recurrence relation for preactivations (Eq.( 1)) h l+1 i = N l j=1 w l+1 ij ϕ(h l j ) + b l+1 i + µh l i , where the parameter µ controls the strength of the residual connection. Remark D.2. Note that this definition requires N l+1 = N l . We ensure this by only adding residual connections to the hidden layers, which are of the same width. More generally, one can introduce a tensor parameter µ ij . Remark D.3. In general, the parameter µ could be layer-dependent (µ l ). But we suppress this dependence here since we are discussing self-similar networks. Theorem D.4. In the infinite width limit, the recurrence relation for the NNGP kernel with residual connections is changed by an additional term controlled by µ K l+1 = σ 2 w N l N l j=1 E θ ϕ(h l j )ϕ(h l j ) + σ 2 b + µ 2 K l . ( ) Proof. K l+1 = 1 N l+1 N l+1 i=1 E θ h l+1 i h l+1 i = 1 N l+1 N l+1 i=1 E θ     N l j=1 w l+1 ij ϕ(h l j ) + b l+1 i + µh l i   N l k=1 w l+1 ik ϕ(h l k ) + b l+1 i + µh l i   = 1 N l+1 N l+1 i=1 E θ   N l j=1 N l k=1 w l+1 ij w l+1 ik ϕ(h l j )ϕ(h l k ) + b l+1 i b l+1 i + µ 2 h l i h l i   = 1 N l+1 N l+1 i=1 E θ   σ 2 w N l N l j=1 ϕ(h l j )ϕ(h l j ) + σ 2 b   + µ 2 1 N l+1 N l+1 i=1 E θ h l i h l i = σ 2 w N l N l j=1 E θ ϕ(h l j )ϕ(h l j ) + σ 2 b + µ 2 K l , ( ) where we used the fact N l+1 = N l to get the last line. Theorem D.5. In the infinite width limit, the recurrence relation for partial Jacobians with residual connections has a simple multiplicative form J l0,l+1 = χ l J J l0,l , where the recurrence coefficient is shifted to χ l J = σ 2 w E θ ϕ ′ (h l k )ϕ ′ (h l k ) + µ 2 . Proof. J l0,l+1 ≡ 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 ∂h l+1 i ∂h l0 j ∂h l+1 i ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k=1 ∂h l+1 i ∂h l k ∂h l k ∂h l0 j N l m=1 ∂h l+1 i ∂h l m ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik ϕ ′ (h l k ) + µδ ik w l+1 im ϕ ′ (h l m ) + µδ im ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik w l+1 im ϕ ′ (h l k )ϕ ′ (h l m ) + µ 2 δ ik δ im ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = σ 2 w N l N l k=1 E θ   ϕ ′ (h l k )ϕ ′ (h l k )   N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j     + 1 N l N l k=1 E θ   µ 2   N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j     = σ 2 w E θ ϕ ′ (h l k )ϕ ′ (h l k ) + µ 2 E θ   1 N l N l k=1 N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j   = σ 2 w E θ ϕ ′ (h l k )ϕ ′ (h l k ) + µ 2 J l0,l J l0,l+1 = χ l J J l0,l .

E RESIDUAL CONNECTIONS WITH LAYERNORM ON PREACTIVATIONS (PRE-LN)

We recall the recurrence relation (1): h l+1 i = N l j=1 w l+1 ij ϕ( hl j ) + b l+1 i + µh l i . ( ) Theorem E.1. In the infinite width limit, the recurrence relation for the NNGP kernel is then modified to K l+1 = σ 2 w N l N l j=1 E θ ϕ( hl j )ϕ( hl j ) + σ 2 b + µ 2 K l . ( ) Proof. K l+1 = 1 N l+1 N l+1 i=1 E θ h l+1 i h l+1 i = 1 N l+1 N l+1 i=1 E θ     N l j=1 w l+1 ij ϕ( hl j ) + b l+1 i + µh l i   N l k=1 w l+1 ik ϕ( hl k ) + b l+1 i + µh l i   = σ 2 w N l N l j=1 E θ ϕ( hl j )ϕ( hl j ) + σ 2 b + µ 2 K l . ( ) Remark E.2. For µ < 1, the recursion relation has a fixed point K ⋆ = σ 2 w N l ⋆ (1 -µ 2 ) N l ⋆ j=1 E θ ϕ( hl ⋆ j )ϕ( hl ⋆ j ) + σ 2 b 1 -µ 2 . ( ) where the average here is exactly the same as cases for LayerNorm applied to preactivations without residue connections. l ⋆ labels some very large depth l. Remark E.3. For µ = 1 case, the solution of ( 63) is K l = K 0 + l l ′ =1   σ 2 w N l N l j=1 E θ ϕ( hl ′ j )ϕ( hl ′ j ) + σ 2 b   . ( ) which is linearly growing since the expectation does not depend on depth. K 0 is the NNGP kernel after the input layer. Theorem E.4. In the infinite width limit, the recurrence relation for Jacobians changes by a constant shift in the recursion coefficient. J l0,l+1 = χ l J J l0,l , (67) where for this case χ l J = σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) + µ 2 . ( ) Proof. J l0,l+1 = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 ∂h l+1 i ∂h l0 j ∂h l+1 i ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k=1 ∂h l+1 i ∂ hl k ∂ hl k ∂h l k ∂h l k ∂h l0 j N l m=1 ∂h l+1 i ∂ hl m ∂ hl m ∂h l m ∂h l m ∂h l0 j   = 1 N l+1 E θ   N l+1 i=1 N l 0 j=1 N l k,m=1 w l+1 ik ϕ ′ ( hl k ) √ K l + µδ ik w l+1 im ϕ ′ ( hl m ) √ K l + µδ ik ∂h l k ∂h l0 j ∂h l m ∂h l0 j   = E θ   σ 2 w N l K l N l k=1 ϕ ′ ( hl k )ϕ ′ ( hl k ) + µ 2   N l 0 j=1 ∂h l k ∂h l0 j ∂h l k ∂h l0 j     = σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) + µ 2 J l0,l = χ l J J l0,l , Remark E.5. One can directly use results from cases without residue connections. We will momentarily see that the phase boundary does not change with residual connections when µ < 1. However, the correlation length decays way slower when the network is initialized far from criticality. Remark E.6. As we mentioned above µ = 1 needs extra care. Plug in the result (66) and µ = 1 we find out that χ l J | µ=1 = σ 2 w N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) N l K 0 + l l ′ =1 σ 2 w N l j=1 E θ ϕ( hl j )ϕ( hl j ) + N l σ 2 b + 1 ∼ 1 + O 1 l , which leads to power law behaved Jacobians at large depth. Where the exponent ζ is not universal. Recall that ξ = | log χ ⋆ J | -1 , then Theorem 1.3 is a summary of ( 68) and (70) in l → ∞ limit.

F MLP-MIXER

In this section we would like to analyze an architecture called MLP-Mixer Tolstikhin et al. (2021) , which is based on multi-layer perceptrons (MLPs). A MLP-Mixer (i) chops images into patches, then applies affine transformations per patch, (ii) applies several Mixer Layers, (iii) applies prehead LayerNorm, Global Average Pooling, an output affine transformation. We will explain the architecture by showing forward pass equations. Suppose one has a single input with dimension (C in , H in , W in ). We label it as x µi , where the Greek letter labels channels and the Latin letter labels flattened pixels. First of all the (i) is realized by a special convolutional layer, where kernel size f is equal to the stride s. Then first convolution layer can be written as h 0 µi = f 2 j=1 Cin ν=1 W 0 µν;j x ν,j+(i-1)s 2 + b 0 µi , ( ) where f is the size of filter and s is the stride. In our example f = s. Notice in PyTorch both bias and weights are sampled from a uniform distribution U(- √ k, √ k), where k = (C in f 2 ) -1 . E θ [W 0 µν;i W 0 ρσ;j ] = 1 3C in f 2 δ µρ δ νσ δ ij , E θ [b 0 µi b 0 νj ] = 1 3C in f 2 δ µν δ ij . Notice that the output of Conv2d: h 0 µi ∈ R C×Np , where C stands for channels and N p = H in W in /f 2 stands for patches, both of them will be mixed later by Mixer layers. Next we stack l Mixer Layers. A Mixer Layer contains LayerNorms and two MLPs, where the first one mixed patches i, j (token mixing) with a hidden dimension N tm , the second one mixed channels µ, ν (channel-mixing) with a hidden dimension N cm . Notice that for Mixer Layers we use the standard parameterization. • First LayerNorm. It acts on channels µ. h6l µi = h 6l µi -E C [h 6l ρi ] Var C [h 6l ρi ] , where we defined a channel mean E C [h 6l ρi ] ≡ 1 C C ρ=1 h 6l ρi and channel variance Var C ≡ E C h 6l ρi 2 -E C [h 6l ρi ] 2 . • First MlpBlock. It mixes patches i, j, preactivations from different channels share the same weight and bias. h 6l+1 µj = Np k=1 w 6l+1 jk h6l µk + b 6l+1 j . -6l + 2: Affine Layer. h 6l+2 µi = Ntm j=1 w 6l+2 ij ϕ(h 6l+1 µj ) + b 6l+2 i , where N tm stands for hidden dimension of "token mixing". -6l + 3: Residual Connections. h 6l+3 µi = h 6l+2 µi + µh 6l µi . • Second LayerNorm. It again acts on channels µ. h6l+3 µi = h 6l+3 µi -E C [h 6l+3 ρi ] Var C [h 6l+3 ρi ] . • Second MlpBlock. It mixes channels µ, ν, preactivations from different patches share the same weight and bias. -6l + 4: Linear Affine Layer. h 6l+4 νi = C ρ=1 w 6l+4 νρ h6l+3 ρi + b 6l+4 ν . -6l + 5. Affine Layer. h 6l+5 µi = Ncm ν=1 w 6l+5 µν ϕ(h 6l+4 νi ) + b 6l+5 µ . -6l + 6. Residual Connections. h 6l+6 µi = h 6l+5 µi + µh 6l+3 µi . Suppose the network has L Mixer layers. After those layers the network has a pre-head LayerNorm layer, a global average pooling layer and a output layer. The pre-head LayerNorm normalizes over channels µ can be described as the following h6L µi = h 6L µi -E C [h 6L ρi ] Var C [h 6L ρi ] . Global Average Pool over patches i. h p µ = 1 N p Np i=1 h6L µi . Output Layer f µ = C ν=1 w µν h p ν + b µ . We plotted phase diagram using the following quantity from repeating Mixer Layers: χ ⋆ J = lim L→∞   1 N p C Np i=1 C µ=1 E θ   C ρ=1 Np k=1 ∂h 6L µi ∂h 6L-6 ρk ∂h 6L µi ∂h 6L-6 ρk     .

G RESULTS FOR SCALE INVARIANT ACTIVATION FUNCTIONS

Definition G.1 (Scale invariant activation functions). ϕ(x) = a + x Θ(x) + a -x Θ(-x) , where Θ(x) is the Heaviside step function. ReLU is the special case with a + = 1 and a -= 0. G.1 NNGP KERNEL First evaluate the average using Lemma 2.2 E θ ϕ(h l i )ϕ(h l i ) = 1 √ 2πK l dh l i a 2 + + a 2 - h l i 2 e -(h l i ) 2 2K l = a 2 + + a 2 - 2 K l . ( ) Thus we obtain the recurrence relation for the NNGP kernel with scale invariant activation function. K l+1 = σ 2 w (a 2 + + a 2 -) 2 K l + σ 2 b . Finite fixed point of the recurrence relation above exists only if χ ⋆ K = σ 2 w (a 2 + + a 2 -) 2 ≤ 1 . As a result σ 2 w ≤ 2 a 2 + + a 2 - . For σ 2 w = 2 a 2 + +a 2 - case, finite fixed point exists only if σ 2 b = 0.

G.2 JACOBIAN(S)

The calculation is quite straight forward, by definition χ l J =σ 2 w E θ ϕ ′ (h l i )ϕ ′ (h l i ) = σ 2 w √ 2πK l dh l i a + Θ(h l i ) -a -Θ(h l i ) 2 e -(h l i ) 2 2K l = σ 2 w (a 2 + + a 2 -) 2 , where we used the property xδ(x) = 0 for Dirac's delta function to get the first line. Thus the critical line is defined by σ w = 2 a 2 + + a 2 - . For ReLU with a + = 1 and a -= 0, the network is at critical line when σ w = √ 2 , where the critical point is located at (σ w , σ b ) = ( √ 2, 0) .

G.3 CRITICAL EXPONENTS

Since the recurrence relations for the NNGP kernel and Jacobians are linear. Then from Lemma C.1 and Theorem 2.7 ζ K = 0 and ζ = 0 . G.4 LAYERNORM ON PRE-ACTIVATIONS Use Lemma B.9 and combine all known results for scale invariant functions χ l J = σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) Kl-1 =1 = σ 2 w (a 2 + + a 2 -) σ 2 w (a 2 + + a 2 -) + 2σ 2 b . For this case, χ l J ≤ 1 (97) is always true. The equality only holds at σ b = 0 line.

G.5 LAYERNORM ON ACTIVATIONS

First we substitute K l-1 = σ 2 w + σ 2 b into known results E θ ϕ ′ (h l i )ϕ ′ (h l i ) = a 2 + + a 2 - 2 , ( ) E θ ϕ(h l i )ϕ(h l i ) = a 2 + + a 2 - 2 (σ 2 w + σ 2 b ) . ( ) There is a new expectation value we need to show explicitly E θ ϕ(h l i ) = 1 2π(σ 2 w + σ 2 b ) ∞ -∞ dh l i ϕ(h l i )e -1 2 h l i (σ 2 w +σ 2 b ) -1 h l i = 1 2π(σ 2 w + σ 2 b ) ∞ 0 dh l i (a + -a -)h l i e - (h l i ) 2 2(σ 2 w +σ 2 b ) = (a + -a -) σ 2 w + σ 2 b 2π . ( ) Thus χ l J = σ 2 w σ 2 w + σ 2 b • π(a 2 + + a 2 -) π(a 2 + + a 2 -) -(a + -a -) 2 . ( ) The critical line is defined by χ ⋆ J = 1, which can be solved as σ b = (a + -a -) 2 π(a 2 + + a 2 -) -(a + -a -) 2 σ w . For ReLU with a + = 1 and a -= 0 σ b = 1 π -1 σ w ≈0.683σ w . G.6 RESIDUAL CONNECTIONS The recurrence relation for the NNGP kernel can be evaluated to be K l+1 = σ 2 w (a 2 + + a 2 -) 2 K l + σ 2 b + µ 2 K l . ( ) The condition for the existence of fixed point χ ⋆ K = σ 2 w (a 2 + + a 2 -) 2 + µ 2 ≤ 1 (105) leads us to σ 2 w ≤ 2(1 -µ 2 ) a 2 + + a 2 - . For σ 2 w = 2(1-µ 2 ) a 2 + +a 2 - , finite fixed point exists only if σ 2 b = 0. (Diverges linearly otherwise) The recurrence coefficient for Jacobian is evaluated to be χ ⋆ J = σ 2 w (a 2 + + a 2 -) 2 + µ 2 . ( ) The critical line is defined as σ w = 2(1 -µ 2 ) a 2 + + a 2 - . The critical point is located at 2(1-µ 2 ) a 2 + +a 2 - , 0 . For ReLU, the critical point is at 2(1 -µ 2 ), 0 .

G.7 RESIDUAL CONNECTIONS WITH LAYERNORM ON PREACTIVATIONS (PRE-LN)

Again use Lemma B.9 and combine all known results for scale invariant functions χ ⋆ J = lim l→∞ σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) Kl-1 =1 + µ 2 = σ 2 w (a 2 + + a 2 -)(1 -µ 2 ) σ 2 w (a 2 + + a 2 -) + 2σ 2 b + µ 2 = 1 - 2σ 2 b (1 -µ 2 ) σ 2 w (a 2 + + a 2 -) + 2σ 2 b ( ) Similar to the case without residue connections χ l J ≤ 1 (110) is always true. The equality only holds at σ b = 0 line for µ < 1. Notice there is a very special case µ = 1, where the whole σ b -σ w plane is critical.

H RESULTS FOR ERF ACTIVATION FUNCTION

Definition H.1 (erf activation function). ϕ(x) = 2 √ π x 0 e -t 2 dt . H.1 NNGP KERNEL To evaluate Lemma 2.2 exactly, we introduce two dummy variables λ 1 and λ 2 Williams (1997) . E θ ϕ(λ 1 h l i )ϕ(λ 2 h l i ) = dλ 1 dλ 2 d 2 dλ 1 dλ 2 E θ ϕ(λ 1 h l i )ϕ(λ 2 h l i ) = dλ 1 dλ 2 dh l i 4 √ 2π 3 K l h l i 2 e -(λ 2 1 +λ 2 2 + 1 2K l )(h l i ) 2 = dλ 1 dλ 2 4K l π (1 + 2K l (λ 2 1 + λ 2 2 )) = 2 π arcsin 2K l λ 1 λ 2 1 + 2K l (λ 2 1 + λ 2 2 ) . ( ) We use the special case where λ 1 = λ 2 = 1. Thus the recurrence relation for the NNGP kernel with erf activation function is K l+1 = 2σ 2 w π arcsin 2K l 1 + 2K l + σ 2 b . As in scale invariant case, finite fixed point only exists when χ ⋆ K = 4σ 2 w π 1 (1 + 2K ⋆ ) √ 1 + 4K ⋆ ≤ 1 . ( ) Numerical results show the condition is satisfied everywhere in σ b -σ w plane, where χ ⋆ K = 1 is only possible when K ⋆ = 0.

H.2 JACOBIANS

Follow the definition χ l J = σ 2 w E θ ϕ ′ (h l i )ϕ ′ (h l i ) = 4σ 2 w √ 2π 3 K l dh l i e -2(h l i ) 2 e -(h l i ) 2 2K l = 4σ 2 w π 1 √ 1 + 4K l . ( ) To find phase boundary χ ⋆ J = 1, we need to combine Eq.( 113) and Eq.( 115) and evaluate them at K ⋆ . K ⋆ = 2σ 2 w π arcsin 2K ⋆ 1 + 2K ⋆ + σ 2 b , χ ⋆ J = 4σ 2 w π 1 √ 1 + 4K ⋆ = 1 . ( ) One can solve equations above and find the critical line σ b = 16σ 4 w -π 2 4π 2 - 2σ 2 w π arcsin 16σ 4 w -π 2 16σ 4 w + π 2 . ( ) Critical point is reached by further requiring χ ⋆ K = 1. Since χ ⋆ K ≤ χ ⋆ J , the only possible case is K ⋆ = 0, which is located at (σ w , σ b ) = π 4 , 0 .

H.3 CRITICAL EXPONENTS

We show how to extract critical exponents of the NNGP kernel and Jacobians of erf activation function. Critical point for erf is at (σ b , σ w ) = (0, π 4 ), with K ⋆ = 0. Now suppose l is large enough such that the deviation of K l from fixed point value K ⋆ is small. Define δK l ≡ K l -K ⋆ . Eq.( 113) can be rewritten as δK l+1 = 1 2 arcsin 2δK l 1 + 2δK l ≈δK l -2(δK l ) 2 . (120) From Lemma C.1 A = 1 2 and ζ K = 1 . Next we analyze critical exponent of Jacobians by expanding (115) around K ⋆ = 0 critical point (σ b , σ w ) = (0, π 4 ). To leading order l -1 we have χ l J ≈1 -2δK l ≈1 - 1 l . ( ) Thus the recurrence relation for partial Jacobian, at large l, takes form J l0,l+1 = 1 - 1 l J l0,l . At large l J l0,l = c l0 l -1 , with a non-universal constant c l0 . The critical exponent is ζ = 1 , which is the same as ζ K . H.4 LAYERNORM ON PRE-ACTIVATIONS Use Lemma B.9, we have χ l J = σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) Kl-1 =1 = 4σ 2 w √ 5 2σ 2 w arcsin 2 3 + πσ 2 b . ( ) The critical line is then defined by σ b = 2 π 2 √ 5 -arcsin 2 3 σ w ≈ 0.324σ w . H.5 LAYERNORM ON ACTIVATIONS Due to the symmetry of erf activation function E θ ϕ(h l i ) = 0, we only need to modify our known results. E θ ϕ ′ (h l i )ϕ ′ (h l i ) = 4 π 1 1 + 4(σ 2 w + σ 2 b ) , E θ ϕ(h l i )ϕ(h l i ) = 2 π arcsin 2(σ 2 w + σ 2 b ) 1 + 2(σ 2 w + σ 2 b ) . ( ) Thus χ l J = 2σ 2 w 1 + 4(σ 2 w + σ 2 b ) • 1 arcsin 2(σ 2 w +σ 2 b ) 1+2(σ 2 w +σ 2 b ) , where the phase boundary is defined by the transcendental equation χ l J = 1.

H.6 RESIDUAL CONNECTIONS

The recurrence relation for the NNGP kernel can be evaluated to be K l+1 = 2σ 2 w π arcsin 2K l 1 + 2K l + σ 2 b + µ 2 K l . Finite fixed point only exists when χ ⋆ K = 4σ 2 w π 1 (1 + 2K ⋆ ) √ 1 + 4K ⋆ + µ 2 ≤ 1 . ( ) Notice that χ ⋆ K ≤ χ ⋆ J still holds, where the equality holds only when K ⋆ = 0. The recurrence coefficient for Jacobian is evaluated to be χ ⋆ J = 4σ 2 w π 1 √ 1 + 4K ⋆ + µ 2 . ( ) The critical line is defined as σ b = 16σ 4 w -π 2 (1 -µ 2 ) 2 4π 2 (1 -µ 2 ) - 2σ 2 w π arcsin 16σ 4 w -π 2 (1 -µ 2 ) 2 16σ 4 w + π 2 (1 -µ 2 ) 2 . ( ) Critical point is reached by further requiring χ ⋆ K = 1. Since χ ⋆ K ≤ χ ⋆ J , the only possible case is K ⋆ = 0, which is located at (σ w , σ b ) = π(1 -µ 2 ) 4 , 0 . Note that for µ = 1, one needs to put extra efforts into analyzing the scaling behavior. First we notice that K l monotonically increases with depth l -the recurrence relation for the NNGP kernel at large l (or large K l ) is K l+1 ≈ σ 2 w + σ 2 b + K l , ) which regulates the first term in (133). For µ = 1 at large depth χ l J ∼ 1 + 4σ 2 w π C 0 + 4(σ 2 w + σ 2 b )l . ( ) Here C 0 is a constant that depends on the input. We can approximate the asymptotic form of log J l0,l as follows log J l0,l = log l l ′ =l0 χ l ′ J = l l ′ =l0 log 1 + 4σ 2 w π C 0 + 4(σ 2 w + σ 2 b )l ′ ≈ l l0 dl ′ log 1 + 4σ 2 w π C 0 + 4(σ 2 w + σ 2 b )l ′ ∼ 2c √ l + O(log l) , where c = 2σ 2 w π √ σ 2 w +σ 2 b . We conclude that at large depth, the APJN for µ = 1, erf networks can be written as J l0,l ∼ O e 2c √ l+O(log l) . This result checks out empirically, as shown in Figure 5 . Use Lemma B.9 and results we had without residue connections for erf with LayerNorm on preactivations. χ * J = lim l→∞ σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) Kl-1 =1 + µ 2 = 4σ 2 w (1 -µ 2 ) √ 5 2σ 2 w arcsin 2 3 + πσ 2 b + µ 2 . ( ) The critical line is then defined by σ b = 2 π 2 √ 5 -arcsin 2 3 σ w ≈ 0.324σ w . I RESULTS FOR GELU ACTIVATION FUNCTION Definition I.1 (GELU activation function). ϕ(x) = x 2 1 + erf x √ 2 = x 2 1 + 2 √ π x √ 2 0 e -t 2 dt . I.1 NNGP KERNEL Use Lemma 2.2 for GELU E θ ϕ(h l i )ϕ(h l i ) = 1 √ 2πK l dh l i (h l i ) 2 4 1 + erf h l i √ 2 2 e -(h l i ) 2 2K l = 1 √ 2πK l dh l i (h l i ) 2 4 1 + erf 2 h l i √ 2 e -(h l i ) 2 2K l = K l 4 + 1 √ 32πK l dh l i (h l i ) 2 erf 2 h l i √ 2 e -(h l i ) 2 2K l = K l 4 + K l √ 32πK l dh l i erf 2 h l i √ 2 e -(h l i ) 2 2K l + (K l ) 2 √ 32πK l dh l i erf ′ h l i √ 2 erf ′ h l i √ 2 + erf h l i √ 2 erf ′′ h l i √ 2 e -(h l i ) 2 2K l = K l 4 + K l 2π arcsin K l 1 + K l + 2K l (1 + K l ) √ 1 + 2K l , where from the third line to the fourth line we used integrate by parts twice, and to get the last line we used results from erf activations. Thus the recurrence relation for the NNGP kernel is K l+1 = K l 4 + K l 2π arcsin K l 1 + K l + (K l ) 2 π(1 + K l ) √ 1 + 2K l σ 2 w + σ 2 b . As a result χ ⋆ K = σ 2 w 4 + σ 2 w 2π arcsin K ⋆ 1 + K ⋆ + 4(K ⋆ ) 3 + 11(K ⋆ ) 2 + 5K ⋆ (1 + K ⋆ ) 2 (1 + 2K ⋆ ) 3 2 . ( ) I.2 JACOBIANS Follow the definition χ l J =σ 2 w E θ ϕ ′ (h l i )ϕ ′ (h l i ) = σ 2 w √ 2πK l dh l i   1 2 + 1 2 erf h l i √ 2 + e -(h l i ) 2 2 h l i √ 2π   2 e -(h l i ) 2 2K l = σ 2 w √ 2πK l dh l i    1 4 + 1 4 erf h l i √ 2 2 + h l i erf h l i √ 2 e -(h l i ) 2 2 √ 2π + e -(h l i ) 2 (h l i ) 2 2π    e -(h l i ) 2 2K l = σ 2 w 4 + σ 2 w 2π arcsin K l 1 + K l + K l (3 + 5K l ) (1 + K l )(1 + 2K l ) 3 2 , ( ) where we dropped odd function terms to get the third line, and to get the last line we used known result for erf in the second term, integrate by parts in the third term. Here to get the critical line is harder. One can use the recurrence relation for the NNGP kernel at fixed point K ⋆ and χ ⋆ J = 1 K ⋆ = σ 2 w 4 K ⋆ + σ 2 w 2π arcsin K ⋆ 1 + K ⋆ + σ 2 w K ⋆ π(1 + K ⋆ ) √ 1 + 2K ⋆ K ⋆ + σ 2 b , χ ⋆ J = σ 2 w 4 + σ 2 w 2π arcsin K ⋆ 1 + K ⋆ + K ⋆ (3 + 5K ⋆ ) (1 + K ⋆ )(1 + 2K ⋆ ) 3 2 = 1 . Cancel the arcsin term, σ w and σ b then can be written as a function of K ⋆ σ w = 2 1 + 2K ⋆ (3 + 5K ⋆ ) π(1 + K ⋆ )(1 + 2K ⋆ ) 3 2 + 2 π arcsin K ⋆ 1 + K ⋆ -1 2 , σ b = K ⋆ √ 2π(1 + 2K ⋆ ) 3 4 σ w . One can then scan K ⋆ to draw the critical line. In order to locate critical point, we further require χ ⋆ K = 1. To locate the critical point, we solve χ ⋆ J -χ ⋆ K = 0 instead. We have σ 2 w [(K ⋆ ) 3 -3(K ⋆ ) 2 -2K ⋆ ] 2π(1 + K ⋆ ) 2 (1 + 2K ⋆ ) 3 2 = 0 , which has two non-negative solutions out of three K ⋆ = 0 and K ⋆ = 3 + √ 17 2 . ( ) One can then solve σ b and σ w by plugging corresponding K ⋆ values. (σ w , σ b ) = (2, 0) , for K ⋆ = 0 , ( ) (σ w , σ b ) ≈ (1.408, 0.416) , for K ⋆ = 3 + √ 17 2 . ( ) I.3 CRITICAL EXPONENTS GELU behaves in a different way compare to erf. First we discuss the K ⋆ = 0 critical point, which is located at (σ b , σ w ) = (0, 2). We expand Eq.( 144), and keep next to leading order δK l = K l -K ⋆ δK l+1 ≈ δK l + 6 π (δK l ) 2 . ( ) From Lemma C.1 A = - π 6 and ζ K = 1 , which is not possible since δK l ≥ 0 for this case. This result means scaling analysis is not working here. Next, we consider the other fixed point with K ⋆ = 3+ √ 17 2 at (σ b , σ w ) = (0.416, 1.408). Expand the NNGP kernel recurrence relation again. δK l+1 ≈ δK l + 0.00014(δK l ) 2 . (157) Following the same analysis, we find δK l ≈ -7142.9 l -1 . Looks like scaling analysis works for this case, since K ⋆ > 0. The solution shows that the critical point is half-stableRoberts et al. ( 2022). If K l < K ⋆ , the fixed point is repealing, while when K l > K ⋆ , the fixed point is attractive. However, the extremely large coefficient in the scaling behavior of δK l embarrasses the analysis. Since for any network with a reasonable depth, the deviation δK l is not small. Now we can expand χ l J at some large depth, up to leading order l -1 . χ l J ≈ 1 - 66.668 l . Then δJ l0,l ≈ c l0 l -66.668 , (160) where c l0 is a positive non-universal constant.

Critical exponent

ζ = 66.668 . (161) Which in practice is not traceable.

I.4 LAYERNORM ON PRE-ACTIVATIONS

Use Lemma B.9, we have χ l J = σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) Kl-1 =1 = σ 2 w (6π + 4 √ 3) σ 2 w (6π + 3 √ 3) + 18πσ 2 b . ( ) The critical line is then at σ b = 6 √ 3π -1 2 σ w ≈0.175σ w . I.5 LAYERNORM ON ACTIVATIONS First we need to evaluate a new expectation value E θ ϕ(h l i ) = 1 2π(σ 2 w + σ 2 b ) dh l i h l i 2 1 + erf x √ 2 e - (h l i ) 2 2(σ 2 w +σ 2 b ) = σ 2 w + σ 2 b 2π(1 + σ 2 w + σ 2 b ) , where we used integrate by parts to get the result. The other integrals are modified to  3 2 π(σ 2 w + σ 2 b )(1 + σ 2 w + σ 2 b ) -2(σ 2 w + σ 2 b ) 2 + 4(σ 2 w +σ 2 b ) 2 √ 1+2(σ 2 w +σ 2 b ) + 2(σ 2 w + σ 2 b )(1 + σ 2 w + σ 2 b ) arcsin σ 2 w +σ 2 b 1+σ 2 w +σ 2 b . The critical line defined by χ l J = 1, one can numerically solve it by scanning over σ b and σ w .

I.6 RESIDUAL CONNECTIONS

The recurrence relation for the NNGP kernel is K l+1 = K l 4 + K l 2π arcsin K l 1 + K l + (K l ) 2 π(1 + K l ) √ 1 + 2K l σ 2 w + σ 2 b + µ 2 K l . ( ) Fixed point exists if χ ⋆ K = σ 2 w 4 + σ 2 w 2π arcsin K ⋆ 1 + K ⋆ + 4(K ⋆ ) 3 + 11(K ⋆ ) 2 + 5K ⋆ (1 + K ⋆ ) 2 (1 + 2K ⋆ ) 3 2 + µ 2 ≤ 1 . ( ) The recurrence coefficient for Jacobian is χ ⋆ J = σ 2 w 4 + σ 2 w 2π arcsin K ⋆ 1 + K ⋆ + K ⋆ (3 + 5K ⋆ ) (1 + K ⋆ )(1 + 2K ⋆ ) 3 2 + µ 2 . ( ) Phase boundary is shifted σ w = 2 1 -µ 2 1 + 2K ⋆ (3 + 5K ⋆ ) π(1 + K ⋆ )(1 + 2K ⋆ ) 3 2 + 2 π arcsin K ⋆ 1 + K ⋆ -1 2 , ( ) σ b = K ⋆ √ 2π(1 + 2K ⋆ ) 3 4 σ w . One can again scan over K ⋆ to draw the critical line. In order to locate critical point, we further require χ ⋆ K = 1. To locate the critical point, we solve χ ⋆ J -χ ⋆ K = 0 instead. We have σ 2 w [(K ⋆ ) 3 -3(K ⋆ ) 2 -2K ⋆ ] 2π(1 + K ⋆ ) 2 (1 + 2K ⋆ ) 3 2 = 0 , which has two non-negative solutions out of three K ⋆ = 0 and K ⋆ = 3 + √ 17 2 . ( ) One can then solve σ b and σ w by plugging corresponding K ⋆ values. (σ w , σ b ) = (2 1 -µ 2 , 0) , for K ⋆ = 0 , (σ w , σ b ) ≈ (1.408 1 -µ 2 , 0.416 1 -µ 2 ) , for K ⋆ = 3 + √ 17 2 . I.7 RESIDUAL CONNECTIONS WITH LAYERNORM ON PREACTIVATIONS (PRE-LN) Use Lemma B.9 and results we had without residue connections for GELU. χ * J = lim l→∞ σ 2 w N l K l N l k=1 E θ ϕ ′ ( hl k )ϕ ′ ( hl k ) Kl-1 =1 + µ 2 = σ 2 w (6π + 4 √ 3)(1 -µ 2 ) σ 2 w (6π + 3 √ 3) + 18πσ 2 b + µ 2 =1 - ( √ 3σ 2 w -18πσ 2 b )(1 -µ 2 ) σ 2 w (6π + 3 √ 3) + 18πσ 2 b . The critical line is then at σ b = 6 √ 3π -1 2 σ w ≈0.175σ w , just like without residue connections.

J ADDITIONAL EXPERIMENTAL RESULTS

In the following training results, we used NTK parameterization for the linear layers in the MLP. We emphasize that this choice has little effect on the training and convergence in this case, compared to standard initialization. In figure 6 , we compare the performance of deep MLP networks with and without LayerNorm. We note that the case with LayerNorm applied to preactivations continues to train at very large value of σ 2 w . In all cases, networks are trained using stochastic gradient descent with MSE. We used the Fashion MNIST datasetXiao et al. (2017) . All networks had depth L = 50 and width N l = 500. The learning rates were logarithmically sampled • within (10 -8 , 10 6 ) for ReLU, (10 -5 , 10) for LN-ReLU and ReLU-LN; • within (10 -5 , 1) for erf, LN-erf and erf-LN; • within (10 -8 , 10) for GELU, (10 -3 , 10) for LN-GELU and GELU-LN, where λ max is the largest eigenvalue of NTK for each σ w . 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 In figure 7 , we showed empirically that the critical exponent of partial Jacobians are vanished for erf with LayerNorm. Test accuracy from kernel regression reflects the trainability (training accuracy) with SGD in ordered phase. We found that the trainable depth is be predicted by the correlation length cξ with LayerNorm applied to preactivations, where the prefactor c = 28. The prefactor we had is the same as vanilla cases in Xiao et al. (2020) . The difference is from the fact that they used log 10 and we used log e . In figure 9 , we explore the broad range in σ 2 w of the performance of MLP network with erf activation function and LayerNorm on preativations. The network has depth L = 50 and width N l = 500; and is trained using SGD on Fashion MNIST. The learning rates are chosen based on a logarithmic scan with a short training time.



Scale-invariant activation functions are more forgiving: away from the critical point K l scales algebraically with l. We note that for this particular experiment, we used NTK parameterization for MLP. However, we emphasize that this does not affect the results. We used NTK parameterization for this experiment. However, we emphasize that it does not affect the final result.



Figure 1: χ ⋆J phase diagrams for ReLU (first row), erf (second row) and GELU (third row); with residual connections of variable strengths µ = {0.0, 0.9, 1.0}. Both cases: without LayerNorm (first three columns) and with LayerNorm (last three columns) are shown. The solid lines indicate the critical lines obtained through infinite width limit calculations; while the stars indicate the critical points. The dotted lines in the rightmost column correspond to the critical lines for µ < 1 case. For networks with LayerNorm and µ = 1, χ ⋆ J = 1 holds on the entire σ b -σ w plane, for all activation functions that we considered. We also note that for erf activation, the case µ = 1 without LayerNorm is subcritical and has a large correlation length.

Figure 2: Trainability (Training Accuracy) of deep MLP networks (N l = 500, L = 50) featuring ReLU (first row), erf (second row) and GELU (third row); with residual connections of variable strengths µ = {0.0, 0.9, 1.0}. Both cases: without LayerNorm (first three columns) and with Lay-erNorm (last three columns) are shown. Cases with LayerNorm and µ = 1 (last column) train at all values of σ 2 w and σ 2 b we considered, in agreement with theory.

Figure 3: from left to right: (1)(2) µ = 0.5 and µ = 1.0 phase diagrams plotted using χ ⋆ J from repeating Mixer Layers of the GELU MLP-Mixer. Black line indicates the empirical phase boundary. Stars indicate points we selected to train on CIFAR-10. (σ 2 w = 10, σ 2 b = 10) point is outside the phase diagrams. (3)(4) µ = 0.5 and µ = 1 MLP-Mixer training curves. Solid and dashed lines indicate training and validation accuracies, respectively. All the networks are L = 100 blocks deep, except for one, which is L = 32 blocks deep; all networks have 10 million parameters.

Figure 3: (a)(b)We made the phase diagram for MLP-Mixer with 30 blocks and averaged over 100 different parameter-initializations. (c)(d)We used network with L = 100, patch size 4 × 4, hidden size C = 128, two MLP dimensions N tm = N cm = 256. The L = 32 point has doubled widths. All networks have 10 million parameters. Notice that for all Mixer Layers we used NTK initialization.We trained all cases on CIFAR-10 dataset using vanilla SGD paired with CSE. Batch size bs = 256, weight decay λ = 10 -4 was selected from {10 -5 , 10 -4 }, mixup rate α = 0.8 was selected from {0.4, 0.8}. We also used RandAgument and horizontal flip with default settings in PyTorch. For all cases we searched learning rates within {0.005, 0.01, 0.05, 0.1, 0.2, 0.5}. We also tried a linear warm-up schedule for first 3000 iterations, but we did not see any improvement in performances. Generating the data for the figure took approximately 4 days on Google Colab Pro (single Tesla P100 GPU).

3

Figure 5: log(J l0,l )-√ l for µ = 1, σ 2 b = 0, erf.

Figure 6: Performance of deep MLP networks at and away from criticality, with and without Layer-Norm. The blue plateau, corresponding to LayerNorm applied to preactivations, continues to train at very large values of σ 2 w without the need to tune the learning rate.

Figure7: log -log plot of partial Jacobian J 0,l vs. l for (A) LN-erf and (B) erf-LN.

C CRITICAL EXPONENTS

To prove Theorem 2.7, we first need to find the critical exponent of the NNGP kernel Roberts et al. (2022) . Lemma C.1. In the infinite width limit, consider a critically initialized network with a activation function ϕ. The scaling behavior of the fluctuation δK l ≡ K l -K ⋆ in non-exponential. If the recurrence relation can be expand to leading order δK l as δK l+1 ≈ δK l -c n (δK l ) n for n ≥ 2. The solution of δK l iswhereThe constant c n and the order of first non-zero term n is determined by the choice of activation function.Proof. We can expand the recurrence relation for the NNGP kernel (10) to second order of δK l = K l -K ⋆ on both side.Use power law ansatzMultiply l ζ K on both side then use Taylor expansionFor arbitrary l, the only non-trivial solution of the equation above isProof of Theorem 2.7. We will assume c 2 ̸ = 0. Then use Lemma C.1, we can expand χ l J in terms of δK l . To leading order l -1Consider a sufficiently large l. In this case O(l -1 ) approximation is valid. We write recurrence relations of Jacobians asWhen c n = 0 for all n ≥ 2, from Lemma C.1 we have δK l = 0. Thus the Jacobian saturates to some constant.We checked the scaling empirically by plotting J 0,l vs. l in a log-log plot and fitting the slope. These results are presented in Fig. 4 . The agreement with infinite width calculation (following sections) is excellent. 2 

