EXPECTED GRADIENTS OF MAXOUT NETWORKS AND CONSEQUENCES TO PARAMETER INITIALIZATION

Abstract

We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.

1. INTRODUCTION

We study the gradients of maxout networks and derive several implications for training stability, parameter initialization, and expressivity. Concretely, we compute stochastic order bounds and bounds on the moments depending on the parameter distribution and the network architecture. The analysis is based on the input-output Jacobian of maxout networks. We discover that, in contrast to ReLU networks, when initialized with a zero-mean Gaussian distribution, the distribution of the input-output Jacobian of a maxout network depends on the network input, which may lead to unstable gradients and training difficulties. Nonetheless, we can obtain a rigorous parameter initialization recommendation for wide networks. The analysis of gradients also allows us to refine previous bounds on the expected number of linear regions of maxout networks at initialization and derive new results on the length distortion and the NTK. Maxout networks A rank-K maxout unit, introduced by Goodfellow et al. (2013) , computes the maximum of K real-valued parametric affine functions. Concretely, a rank-K maxout unit with n inputs implements a function R n → R; x → max k∈[K] {⟨W k , x⟩ + b k }, where W k ∈ R n and b k ∈ R, k ∈ [K] := {1, . . . , K}, are trainable weights and biases. The K arguments of the maximum are called the pre-activation features of the maxout unit. This may be regarded as a multi-argument generalization of a ReLU, which computes the maximum of a real-valued affine function and zero. Goodfellow et al. (2013) demonstrated that maxout networks could perform better than ReLU networks under similar circumstances. Additionally, maxout networks have been shown to be useful for combating catastrophic forgetting in neural networks (Goodfellow et al., 2015) . On the other hand, Castaneda et al. (2019) evaluated the performance of maxout networks in a big data setting and observed that increasing the width of ReLU networks is more effective in improving performance than replacing ReLUs with maxout units and that ReLU networks converge faster than maxout networks. We observe that proper initialization strategies for maxout networks have not been studied in the same level of detail as for ReLU networks and that this might resolve some of the problems encountered in previous maxout network applications.

Parameter initialization

The vanishing and exploding gradient problem has been known since the work of Hochreiter (1991) . It makes choosing an appropriate learning rate harder and slows training Sun (2019) . Common approaches to address this difficulty include the choice of specific architectures, e.g. LSTMs (Hochreiter, 1991) or ResNets (He et al., 2016) , and normalization methods such as batch normalization (Ioffe & Szegedy, 2015) or explicit control of the gradient magnitude with gradient clipping (Pascanu et al., 2013) . We will focus on approaches based on parameter initialization that control the activation length and parameter gradients (LeCun et al., 2012; Glorot & Bengio, 2010; He et al., 2015; Gurbuzbalaban & Hu, 2021; Zhang et al., 2019; Bachlechner et al., 2021) . He et al. (2015) studied forward and backward passes to obtain initialization recommendations for ReLU. A more rigorous analysis of the gradients was performed by Hanin & Rolnick (2018) ; Hanin (2018) , who also considered higher-order moments and derived recommendations on the network architecture. Sun et al. (2018) derived a corresponding strategy for rank K = 2 maxout networks. For higher maxout ranks, Tseran & Montúfar (2021) considered balancing the forward pass, assuming Gaussian or uniform distribution on the pre-activation features of each layer. However, this assumption is not fully justified. We will analyze maxout network gradients, including the higher order moments, and give a rigorous justification for the initialization suggested by Tseran & Montúfar (2021) . Expected number of linear regions Neural networks with piecewise linear activation functions subdivide their input space into linear regions, i.e., regions over which the computed function is (affine) linear. The number of linear regions serves as a complexity measure to differentiate network architectures (Pascanu et al., 2014; Montufar et al., 2014; Telgarsky, 2015; 2016) . The first results on the expected number of linear regions were obtained by Hanin & Rolnick (2019a; b) for ReLU networks, showing that it can be much smaller than the maximum possible number. Tseran & Montúfar (2021) obtained corresponding results for maxout networks. An important factor controlling the bounds in these works is a constant depending on the gradient of the neuron activations with respect to the network input. By studying the input-output Jacobian of maxout networks, we obtain a refined bound for this constant and, consequently, the expected number of linear regions. Expected curve distortion Another complexity measure is the distortion of the length of an input curve as it passes through a network. Poole et al. (2016) studied the propagation of Riemannian curvature through wide neural networks using a mean-field approach, and later, a related notion of "trajectory length" was considered by Raghu et al. (2017) . It was demonstrated that these measures can grow exponentially with the network depth, which was linked to the ability of deep networks to "disentangle" complex representations. Based on these notions, Murray et al. (2022) studies how to avoid rapid convergence of pairwise input correlations, vanishing and exploding gradients. However, Hanin et al. (2021) proved that for a ReLU network with He initialization the length of the curve does not grow with the depth and even shrinks slightly. We establish similar results for maxout networks. NTK It is known that the Neural Tangent Kernel (NTK) of a finite network can be approximated by its expectation (Jacot et al., 2018) . However, for ReLU networks Hanin & Nica (2020a) showed that if both the depth and width tend to infinity, the NTK does not converge to a constant in probability. By studying the expectation of the gradients, we show that similarly to ReLU, the NTK of maxout networks does not converge to a constant when both width and depth are sent to infinity. Contributions Our contributions can be summarized as follows. • For expected gradients, we derive stochastic order bounds for the directional derivative of the input-output map of a deep fully-connected maxout network (Theorem 1) as well as bounds for the moments (Corollary 2). Additionally, we derive an equality in distribution for the directional derivatives (Theorem 3), based on which we also discuss the moments (Remark 4) in wide networks. We further derive the moments of the activation length of a fully-connected maxout network (Corollary 5). • We rigorously derive parameter initialization guidelines for wide maxout networks preventing vanishing and exploding gradients and formulate architecture recommendations. We experimentally demonstrate that they make it possible to train standard-width deep fully-connected and convolutional maxout networks using simple procedures (such as SGD with momentum and Adam), yielding higher accuracy than other initializations or ReLU networks on image classification tasks. • We derive several implications refining previous bounds on the expected number of linear regions (Corollary 6), and new results on length distortion (Corollary 7) and the NTK (Corollary 9).

2. PRELIMINARIES

Architecture We consider feedforward fully-connected maxout neural networks with n 0 inputs, L hidden layers of widths n 1 , . . . , n L-1 , and a linear output layer, which implement functions of the form N = ψ • ϕ L-1 • • • • • ϕ 1 . The l-th hidden layer is a function ϕ l : R n l-1 → R n l with components i ∈ [n l ] := {1, . . . , n l } given by the maximum of K ≥ 2 trainable affine functions ϕ l,i : R n l-1 → R; x (l-1) → max k∈[K] {W (l) i,k x (l-1) + b (l) i,k }, where W (l) i,k ∈ R n l-1 , b i,k ∈ R. Here x (l-1) ∈ R n l-1 denotes the output of the (l -1)th layer and x (0) := x. We will write x (l) i,k = W (l) i,k x (l-1) + b (l) i,k to denote the kth pre-activation of the ith neuron in the lth layer. Finally ψ : R n L-1 → R n L is a linear output layer. We will write Θ = {W, b} for the parameters. Unless stated otherwise, we assume that for each layer, the weights and biases are initialized as i.i.d. samples from a Gaussian distribution with mean 0 and variance c/n l-1 , where c is a positive constant. For the linear output layer, the variance is set as 1/n L-1 . We shall study appropriate choices of c. We will use ∥ • ∥ to denote the ℓ 2 vector norm. We recall that a real valued random variable X is said to be smaller than Y in the stochastic order, denoted by X ≤ st Y , if Pr(X > x) ≤ Pr(Y > x) for all x ∈ R. In Appendix A we review basic notions about maxout networks and random variables that we will use in our results. Input-output Jacobian and activation length We are concerned with the gradients of the outputs with respect to the inputs, ∇N i (x) = ∇ x N i , and with respect to the parameters, ∇N i (Θ) = ∇ Θ N i . In our notation, the argument indicates the variables with respect to which we are taking the derivatives. To study these gradients, we consider the input-output Jacobian J N (x) = [∇N 1 (x), . . . , ∇N n L (x)] T . To see the connection to the gradient with respect to the network parameters, consider any loss function L : R n L → R. A short calculation shows that, for a fixed input x ∈ R n0 , the derivative of the loss with respect to one of the weights W (l) i,k ′ ,j of a maxout unit is ∇L(N (x)) , J N (x (l) i ) x (l-1) j if k ′ = argmax k {x (l) i,k } and zero otherwise, i.e. ∂L(x) ∂W (l) i,k ′ ,j = C(x, W ) ∥J N x (l) u∥ x (l-1) j , where C(x, W ) := ∥J N (x (l) i )∥ -1 ⟨∇L(N (x)), J N (x i )⟩ and u = e i ∈ R n l . A similar decomposition of the derivative was used by Hanin (2018) ; Hanin & Rolnick (2018) for ReLU networks. By (1) the fluctuation of the gradient norm around its mean is captured by the joint distribution of the squared norm of the directional derivative ∥J N (x)u∥ 2 and the normalized activation length A (l) = ∥x (l) ∥ 2 /n l . We also observe that ∥J N (x) u∥ 2 is related to the singular values of the inputoutput Jacobian, which is of interest since a spectrum concentrated around one at initialization can speed up training (Saxe et al., 2014; Pennington et al., 2017; 2018) : First, the sum of singular values is tr(J N (x) T J N (x)) = n L i=1 ⟨J N (x) T J N (x) u i , u i ⟩ = n L i=1 ∥J N (x) u i ∥ 2 , where the vectors u i form an orthonormal basis. Second, using the Stieltjes transform, one can show that singular values of the Jacobian depend on the even moments of the entries of J N (Hanin, 2018, Section 3.1) .

3.1. BOUNDS ON THE INPUT-OUTPUT JACOBIAN

Theorem 1 (Bounds on ∥J N (x)u∥ 2 ). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Let u ∈ R n0 be a fixed unit vector. Then, almost surely with respect to the parameter initialization, for any input into the network x ∈ R n0 , the following stochastic order bounds hold: 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 ξ l,i (χ 2 1 , K) ≤ st ∥J N (x)u∥ 2 ≤ st 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 Ξ l,i (χ 2 1 , K), where ξ l,i (χ 2 1 , K) and Ξ l,i (χ 2 1 , K) are respectively the smallest and largest order statistic in a sample of size K of chi-squared random variables with 1 degree of freedom, independent of each other and of the vectors u and x. The proof is in Appendix B. It is based on appropriate modifications to the ReLU discussion of Hanin & Nica (2020b) ; Hanin et al. (2021) and proceeds by writing the Jacobian norm as the product of the layer norms and bounding them with min k∈[K] {⟨W (l) i,k , u (l-1) ⟩ 2 } and max k∈[K] {⟨W (l) i,k , u (l-1) ⟩ 2 }. Since the product of a Gaussian vector with a unit vector is always Gaussian, the lower and upper bounds are distributed as the smallest and largest order statistics in a sample of size K of chi-squared Maxout, rank K = 5 ReLU 1 hidden layer 3 hidden layers 5 hidden layers 10 hidden layers 5 hidden layers Figure 1 : Expectation of the directional derivative of the input-output map E[∥J N (x)u∥ 2 ] for width-2 fully-connected networks with inputs in R 2 . For maxout networks, this expectation depends on the input, while for ReLU networks, it does not. Input points x were generated as a grid of 100 × 100 points in [-10 3 , 10 3 ] 2 , and u was a fixed vector sampled from the unit sphere. The expectation was estimated based on 10,000 initializations with weights and biases sampled from N (0, 1). random variables with 1 degree of freedom. In contrast to ReLU networks, we found that for maxout networks, it is impossible to write equality in distribution involving only independent random variables because of the dependency of the distribution of ∥J N (x)u∥ 2 on the network input x and the direction vector u (see Figure 1 ). We discuss this in more detail in Section 3.2. Corollary 2 (Bounds on the moments of ∥J N (x)u∥ 2 ). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Let u ∈ R n0 be a fixed unit vector and x ∈ R n0 be any input into the network, Then (i) n L n 0 (cS) L-1 ≤ E[∥J N (x)u∥ 2 ] ≤ n L n 0 (cL) L-1 , (ii) Var ∥J N (x)u∥ 2 ≤ n L n 0 2 c 2(L-1) K 2(L-1) exp 4 L-1 l=1 1 n l K + 1 n L -S 2(L-1) , (iii) E ∥J N (x)u∥ 2t ≤ n L n 0 t (cK) t(L-1) exp t 2 L-1 l=1 1 n l K + 1 n L , t ∈ N, where the expectation is taken with respect to the distribution of the network weights. The constants S and L depend on K and denote the means of the smallest and the largest order statistic in a sample of K chi-squared random variables. For K = 2, . . . , 10, S ∈ [0.02, 0.4] and L ∈ [1.6, 4]. See Table 4 in Appendix C for the exact values. Notice that for t ≥ 2, the tth moments of the input-output Jacobian depend on the architecture of the network, but the mean does not (Corollary 2), similarly to their behavior in ReLU networks Hanin (2018) . We also observe that the upper bound on the tth moments can grow exponentially with the network depth depending on the maxout rank. However, the upper bound on the moments can be tightened provided corresponding bounds for the largest order statistics of the chi-squared distribution.

3.2. DISTRIBUTION OF THE INPUT-OUTPUT JACOBIAN

Here we present the equality in distribution for the input-output Jacobian. It contains dependent variables for the individual layers and thus cannot be readily used to obtain bounds on the moments, but it is particularly helpful for studying the behavior of wide maxout networks. Theorem 3 (Equality in distribution for ∥J N (x)u∥ 2 ). Consider a maxout network with the settings of Section 2. Let u ∈ R n0 be a fixed unit vector and x ∈ R n0 , x ̸ = 0 be any input into the network. Then, almost surely, with respect to the parameter initialization, ∥J N (x)u∥ 2 equals in distribution 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 v i 1 -cos 2 γ x (l-1) ,u (l-1) + Ξ l,i (N (0, 1), K) cos γ x (l-1) ,u (l-1)

2

, where v i and Ξ l,i (N (0, 1), K) are independent, v i ∼ N (0, 1), Ξ l,i (N (0, 1), K) is the largest order statistic in a sample of K standard Gaussian random variables. Here γ x (l) ,u (l) denotes the angle between x (l) := (x (l) 1 , . . . , x n l , 1) and u (l) := (u (l) 1 , . . . , u n l , 0) in R n l +1 , where u (l) = W (l) u (l-1) /∥W (l) u (l-1) ∥ when W (l) u (l-1) ̸ = 0 and 0 otherwise, and u (0) = u. The matrices W (l) consist of rows W (l) i = W (l) i,k ′ ∈ R n l-1 , where k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k }. This statement is proved in Appendix D. The main strategy is to construct an orthonormal basis B = (b 1 , . . . , b n l ), where b 1 := x (l) /∥x (l) ∥, which allows us to express the layer gradient depending on the angle between x (l) and u (l) . Remark 4 (Wide networks). By Theorem 3, in a maxout network the distribution of ∥J N (x)u∥ 2 depends on the cos γ x (l-1) ,u (l-1) , which changes as the network gets wider or deeper. Since independent and isotropic random vectors in high-dimensional spaces tend to be almost orthogonal, we expect that the cosine will be close to zero for the earlier layers of wide networks, and individual units will behave similarly to squared standard Gaussians. In wide and deep networks, if the network parameters are sampled from N (0, c/n l-1 ), c = 1/M, and K ≥ 3, we expect that | cos γ x (l) ,u (l) | ≈ 1 for the later layers of deep networks and individual units will behave more as the squared largest order statistics. Here M is the second moment of the largest order statistic in a sample of size K of standard Gaussian random variables. Based on this, for deep and wide networks, we can expect that E[∥J N (x)u∥ 2 ] ≈ n L n 0 (cM) L-1 = n L n 0 . ( ) This intuition is discussed in more detail in Appendix D. According to (2), we expect that the expected gradient magnitude will be stable with depth when an appropriate initialization is used. See Figure 2 for a numerical evaluation of the effects of the width and depth on the gradients.

3.3. ACTIVATION LENGTH

To have a full picture of the derivatives in (1), we consider the activation length. The full version and proof of Corollary 5 are in Appendix E. The proof is based on Theorem 3, replacing u with x/∥x∥. Corollary 5 (Moments of the normalized activation length). Consider a maxout network with the settings of Section 2. Let x ∈ R n0 be any input into the network. Then, for the moments of the normalized activation length A (l ′ ) of the l ′ th layer we have Mean: The expectation is taken with respect to the distribution of the network weights and biases, and M is a constant depending on K that can be computed approximately, see Table 4 for the values for K = 2, . . . , 10. See Appendix E for the variance bounds and details on functions G 1 , G 2 . E A (l ′ ) = ∥x (0) ∥ 2 1 n 0 (cM) l ′ + l ′ j=2 1 n j-1 (cM) l ′ -j+1 , Moments of order t ≥ 2: G 1 (cM) tl ′ ≤ E A (l ′ ) t ≤ G 2   (cK) tl ′ exp    l ′ l=1 t 2 n l K      . We could obtain an exact expression for the mean activation length for a finitely wide maxout network since its distribution only depends on the norm of the input, while this is not the case for the inputoutput Jacobian (Sections 3.1 and 3.2). We observe that the variance and the tth moments, t ≥ 2, have an exponential dependence on the network architecture, including the maxout rank, whereas the mean does not, similarly to the input-output Jacobian (Corollary 2). Such behavior also occurs for ReLU networks (Hanin & Rolnick, 2018) . See Figure 8 in Appendix E for an evaluation of the result.

4. IMPLICATIONS TO INITIALIZATION AND NETWORK ARCHITECTURE

We now aim to find initialization approaches and architectures that can avoid exploding and vanishing gradients. We take the annealed exploding and vanishing gradients definition from Hanin (2018) as a starting point for such investigation for maxout networks. Formally, we require E   ∂L(x) ∂W (l) i,k ′ ,j 2   =Θ(1), Var   ∂L(x) ∂W (l) i,k ′ ,j 2   =Θ(1), sup l≥1 E   ∂L(x) ∂W (l) i,k ′ ,j 2t   <∞, ∀t ≥ 3, where the expectation is with respect to the weights and biases. Based on (1) these conditions can be attained by ensuring that similar conditions hold for ∥J N (x)u∥ 2 and A (l) . Initialization recommendations Based on Corollary 2, the mean of ∥J N (x)u∥ 2 can be stabilized for some c ∈ [1/L, 1/S]. However, Theorem 3 shows that ∥J N (x)u∥ 2 depends on the input into the network. Hence, we expect that there is no value of c stabilizing input-output Jacobian for every input simultaneously. Nevertheless, based on Remark 4, for wide and deep maxout networks, E[∥J N (x)u∥ 2 ] ≈ n L /n 0 if c = 1/M, and the mean becomes stable. While Remark 4 does not include maxout rank K = 2, the same recommendation can be obtained for it using the approach from He et al. (2015) , see Sun et al. (2018) . Moreover, according to Corollary 5, the mean of the normalized activation length remains stable for different network depths if c = 1/M. Hence, we recommend c = 1/M as an appropriate value for initialization. See Table 1 for the numerical value of c for K = 2, . . . , 10. We call this type of initialization, when the parameters are sampled from N (0, c/fan-in), c = 1/M, "maxout initialization". We note that this matches the previous recommendation from Tseran & Montúfar (2021) , which we now derived rigorously. Architecture recommendations In Corollaries 2 and 5 the upper bound on the moments t ≥ 2 of ∥J N (x)u∥ 2 and A (l) = ∥x (l) ∥ 2 /n l can grow exponentially with the depth depending on the values of (cK) L and L-1 l=1 1/(n l K). Hence, we recommend choosing the widths such that L-1 l=1 1/(n l K) ≤ 1, which holds, e.g., if n l ≥ L/K, ∀l = 1, . . . , L -1, and choosing a moderate value of the maxout rank K. However, the upper bound can still tend to infinity for the high-order moments. From Remark 4, it follows that for K ≥ 3 to have a stable initialization independent of the network input, a maxout network has to be deep and wide. Experimentally, we observe that for 100-neuron wide networks with K = 3, the absolute value of the cosine that determines the initialization stability converges to 1 at around 60 layers, and for K = 4, 5, at around 30 layers. See Figure 5 in Appendix D. To sum up, we recommend working with deep and wide maxout networks with widths satisfying L-1 l=1 1/(n l K) ≤ 1, and choosing the maxout-rank not too small nor too large, e.g., K = 5.

5. IMPLICATIONS TO EXPRESSIVITY AND NTK

With Theorems 1 and 3 in place, we can now obtain maxout versions of the several types of results that previously have been derived only for ReLU networks.

5.1. EXPECTED NUMBER OF LINEAR REGIONS OF MAXOUT NETWORKS

For a piece-wise linear function f : R n0 → R, a linear region is defined as a maximal connected subset of R n0 on which f has a constant gradient. Tseran & Montúfar (2021) and Hanin & Rolnick (2019b) established upper bounds on the expected number of linear regions of maxout and ReLU networks, respectively. One of the key factors controlling these bounds is C grad , which is any upper bound on (sup x∈R n 0 E[∥∇ζ z,k (x)∥ t ]) 1/t , for any t ∈ N and z = 1, . . . , N . Here ζ z,k is the kth pre-activation feature of the zth unit in the network, N is the total number of units, and the gradient is with respect to the network input. Using Corollary 2, we obtain a value for C grad for maxout networks, which remained an open problem in the work of Tseran & Montúfar (2021) . The proof of Corollary 6 and the resulting refined bound on the expected number of linear regions are in Appendix F. Corollary 6 (Value for C grad ). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Consider the pre-activation feature ζ z,k of a unit z = 1, . . . , N . Then, for any t ∈ N, sup x∈R n 0 E ∥∇ζ z,k (x)∥ t 1 t ≤ n -1 2 0 max 1, (cK) L-1 2 exp t 2 L-1 l=1 1 n l K + 1 . The value of C grad given in Corollary 6 grows as O((cK) L-1 exp{t L-1 l=1 1/(n l K)}). The first factor grows exponentially with the network depth if cK > 1. This is the case when the network is initialized as in Section 4. However, since K is usually a small constant and c ≤ 1, cK ≥ 1 is a small constant. The second factor grows exponentially with the depth if L-1 l=1 1/(n l K) ≥ 1. Hence, the exponential growth can be avoided if n l ≥ (L -1)/K, ∀l = 1, . . . , L -1.

5.2. EXPECTED CURVE LENGTH DISTORTION

Let M be a smooth 1-dimensional curve in R n0 of length len(M ) and N (M ) ⊆ R n L the image of M under the map x → N (x). We are interested in the length distortion of M , defined as len(N (M ))/len(M ). Using the results from Section 3.1, observing that the input-output Jacobian of maxout networks is well defined almost everywhere, and following Hanin et al. (2021) , we obtain the following corollary. The proof is in Appendix G. Corollary 7 (Expected curve length distortion). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Let M be a smooth 1-dimensional curve of unit length in R n0 . Then, the following upper bounds on the moments of len(N (M )) hold: E [len(N (M ))] ≤ n L n 0 1 2 (cL) L-1 2 , Var [len(N (M ))] ≤ n L n 0 (cL) L-1 , E len(N (M )) t ≤ n L n 0 t 2 (cK) t(L-1) 2 exp t 2 2 L-1 l=1 1 n l K + 1 n L , where L is a constant depending on K, see Table 4 in Appendix C for values for K = 2, . . . , 10. Remark 8 (Expected curve length distortion in wide maxout networks). If the network is initialized according to Section 4, using Remark 4 and repeating the steps of the proof of Corollary 7, we get E [len(N (M ))] ≲ (n L /n 0 ) 1/2 and Var [len(N (M ))] ≈ n L /n 0 . Hence, similarly to ReLU networks, wide maxout networks, if initialized to keep the gradients stable, have low expected curve length distortion at initialization. However, we cannot conclude whether the curve length shrinks. For narrow networks, the upper bound does not exclude the possibility that the expected distortion grows exponentially with the network depth, depending on the initialization.

5.3. ON-DIAGONAL NTK

We denote the on-diagonal NTK with K N (x, x) = i (∂N (x)/∂θ i ) 2 . In Appendix H we show: Corollary 9 (On-diagonal NTK). Consider a maxout network with the settings of Section 2. Assume that n L = 1 and that the biases are initialized to zero and are not trained. Assume that S ≤ c ≤ L, where the constants S, L are as specified in Table 4 . Then, ∥x (0) ∥ 2 (cS) L-2 n 0 P ≤ E[K N (x, x)] ≤ ∥x (0) ∥ 2 (cL) L-2 M L-1 n 0 P, E[K N (x, x) 2 ] ≤ 2P P W (cK) 2(L-2) ∥x (0) ∥ 4 n 2 0 exp    L-1 j=1 4 n j K + 4    , where P = L-1 l=0 n l , P W = L l=0 n l n l-1 , and M is as specified in Table 4 . By Corollary 9, E[K N (x, x) 2 ]/(E[K N (x, x)]) 2 is in O((P W /P )C L exp{ L l=1 1/(n l K)}) , where C depends on L, M and K. Hence, if widths n 1 , . . . , n L-1 and depth L tend to infinity, this upper bound does not converge to a constant, suggesting that the NTK might not converge to a constant in probability. This is in line with previous results for ReLU networks by Hanin & Nica (2020a) .

6. EXPERIMENTS

We check how the initialization proposed in Section 4 affects the network training. This initialization was first proposed heuristically by Tseran & Montúfar (2021) , where it was tested for 10-layer fullyconnected networks with an MNIST experiment. We consider both fully-connected and convolutional neural networks and run experiments for MNIST (LeCun & Cortes, 2010) , Iris (Fisher, 1936) , Fashion MNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . Fully connected networks have 21 layers and CNNs have a VGG-19-like architecture (Simonyan & Zisserman, 2015) with 20 or 16 layers depending on the input size, all with maxout rank 5. Weights are sampled from N (0, c/fan-in) in fully-connected networks and N (0, c/(k 2 • fan-in)) in CNNs of kernel size k. The biases are initialized to zero. We report the mean and std of 4 runs. We use plain deep networks without any kind of modifications or pre-training. We do not use normalization techniques, such as batch normalization (Ioffe & Szegedy, 2015) , since this would obscure the effects of the initialization. Because of this, our results are not necessarily state-of-the-art. More details on the experiments are given in Appendix I, and the implementation is made available at https://anonymous.4open.science/r/maxout_expected_gradient-68BD.

Max-pooling initialization

To account for the maximum in max-pooling layers, a maxout layer appearing after a max-pooling layer is initialized as if its maxout rank was K × m 2 , where m 2 is the max-pooling window size. For example, we used K = 5 and m 2 = 4, resulting in c = 0.26573 for such maxout layers. All other layers are initialized according to Section 4. We observe that max-pooling initialization often leads to slightly higher accuracy. Results for SGD with momentum Table 2 reports test accuracy for networks trained using SGD with Nesterov momentum. We compare ReLU and maxout networks with different initializations: maxout, max-pooling, small value c = 0.1, and He c = 2. We observe that maxout and maxpooling initializations allow training deep maxout networks and obtaining better accuracy than ReLU Table 3 : Accuracy on the test set for the networks trained with Adam. Observe that maxout networks initialized with the maxout or max-pooling initialization perform better or comparably to ReLU networks, while maxout networks initialized with ReLU-He converge slower and perform worse. networks, whereas performance is significantly worse or training does not progress for maxout networks with other initializations. Results for Adam Table 3 reports test accuracy for networks trained using Adam (Kingma & Ba, 2015) . We compare ReLU and maxout networks with the following initializations: maxout, max-pooling, and He c = 2. We observe that, compared to He initialization, maxout and max-pooling initializations lead to faster convergence and better test accuracy. Compared to ReLU networks, maxout networks have better or comparable accuracy if maxout or max-pooling initialization is used.

7. DISCUSSION

We study the gradients of maxout networks with respect to the parameters and the inputs by analyzing a directional derivative of the input-output map. We observe that the distribution of the input-output Jacobian of maxout networks depends on the network input (in contrast to ReLU networks), which can complicate the stable initialization of maxout networks. Based on bounds on the moments, we derive an initialization that provably avoids vanishing and exploding gradients in wide networks. Experimentally, we show that, compared to other initializations, the suggested approach leads to better performance for fully connected and convolutional deep networks of standard width trained with SGD or Adam and better or similar performance compared to ReLU networks. Additionally, we refine previous upper bounds on the expected number of linear regions. We also derive results for the expected curve length distortion, observing that it does not grow exponentially with the depth in wide networks. Furthermore, we obtain bounds on the maxout NTK, suggesting that it might not converge to a constant when both the width and depth are large. These contributions enhance the applicability of maxout networks and add to the theoretical exploration of activation functions beyond ReLU. Limitations Even though our proposed initialization is in a sense optimal, our results are applicable only when the weights are sampled from N (0, c/fan-in) for some c. Further, we derived theoretical results only for fully-connected networks. Our experiments indicate that they also hold for CNNs: Figure 2 demonstrates that gradients behave according to the theory for fully connected and convolutional networks, and Tables 2 and 3 show improvement in CNNs performance under the initialization suggested in Section 4. However, we have yet to conduct the theoretical analysis of CNNs.

Future work

In future work, we would like to obtain more general results in settings involving multi-argument functions, such as aggregation functions in graph neural networks, and investigate the effects that initialization strategies stabilizing the initial gradients have at later stages of training.

Reproducibility Statement

To ensure reproducibility, we make the code public on GitHub at https: //anonymous.4open.science/r/maxout_expected_gradient-68BD, and Section 6 and Appendix I provide a detailed description of the experimental settings. Proofs are in Appendices B-H.

APPENDIX

The appendix is organized as follows. • Appendix A Basics • Appendix B Bounds for the input-output Jacobian norm ∥J N (x)u∥ 2 • Appendix C Moments of the input-output Jacobian norm ∥J N (x)u∥ 2 • Appendix D Equality in distribution for the input-output Jacobian norm and wide network results • Appendix E Activation length • Appendix F Expected number of linear regions • Appendix G Expected curve length distortion • Appendix H NTK • Appendix I Experiments A BASICS A.1 BASICS ON MAXOUT NETWORKS As mentioned in the introduction, a rank-K maxout unit computes the maximum of K real-valued affine functions. Concretely, a rank-K maxout unit with n inputs implements a function R n → R; x → max k∈[K] {⟨W k , x⟩ + b k }, where W k ∈ R n and b k ∈ R, k ∈ [K] := {1, . . . , K}, are trainable weights and biases. The K arguments of the maximum are called the pre-activation features of the maxout unit. A rank-K maxout unit can be regarded as a composition of an affine map with K outputs and a maximum gate. A layer corresponds to parallel computation of several such units. For instance a layer with n inputs and m maxout units computes functions of the form R n → R m ; x →     max k∈[K] {⟨W (1) 1,k , x⟩ + b (1) 1,k } . . . max k∈[K] {⟨W (1) m,k , x⟩ + b (1) m,k }     , where now W (1) i,k and b (1) i,k are the weights and biases of the kth pre-activation feature of the ith maxout unit in the first layer. The situation is illustrated in Figure 3 for the case of a network with two inputs, one layer with two maxout units of rank three, and one output layer with a single output unit. x 1 x 2 i n p u t a f fi n e m a x a f fi n e W (1) W (2) Figure 3 : Illustration of a simple maxout network with two input units, one hidden layer consisting of two maxout units of rank 3, and an affine output layer with a single output unit.

A.2 BASIC NOTIONS OF PROBABILITY

We ought to remind several probability theory notions that we use to state our results. Firstly, recall that if v 1 , . . . , v k are independent, univariate standard normal random variables, then the sum of their squares, k i=1 v 2 i , is distributed according to the chi-squared distribution with k degrees of freedom. We will denote such a random variable with χ 2 k . Secondly, the largest order statistic is a random variable defined as the maximum of a random sample, and the smallest order statistic is the minimum of a sample. And finally, a real valued random variable X is said to be smaller than Y in the stochastic order, denoted by X ≤ st Y , if Pr(X > x) ≤ Pr(Y > x) for all x ∈ R. We will also denote with d = equality in distribution (meaning the cdfs are the same). With this, we start with the results for the squared norm of the input-output Jacobian ∥J N (x)u∥ 2 .

A.3 DETAILS ON THE EQUATION (1)

In (1) we are investigating magnitude of ∂L(x)  ∂W (l) i,k ′ ,j . The reason we focus on the Jacobian norm rather than on C is as follows. We have ∂L(x) ∂W (l) i,k ′ ,j =⟨∇ N L(N (x)), J N (W (l) i,k ′ ,j )⟩ =⟨∇ N L(N (x)), J N (x (l) i )⟩x (l-1) j =⟨∇ N L(N (x)), J N (x (l) )u⟩x (l-1) j , u = e i =C(x, W )∥J N (x (l) )u∥x (l-1) j Note that C(x, W ) = ⟨∇ N L(N (x)), v⟩ with v = J N x (l) u/∥J N x (l) u∥, ∥v∥ = 1. Hence C(x, W ) ≤ ∥∇ N L(N (x))∥∥v∥ = ∥∇ N L(N (x))∥. The latter term does not directly depend on the specific parametrization nor the specific architecture of the network but only on the loss function and the prediction. In view of the description of ∂L(x) ∂W (l) i,k ′ ,j , the variance depends on the square of x (l-1) j . Similarly, the variance of the gradient ∇ W (l) L(x) = ( ∂L(x) ∂W (l) i,k ′ ,j ) j depends on x (l-1) = (x (l-1) j ) j and thus depends on ∥x (l-1) ∥ 2 . This is how activation length appears in (1).

B BOUNDS FOR THE INPUT

-OUTPUT JACOBIAN NORM ∥J N (x)u∥ 2 B.1 PRELIMINARIES We start by presenting several well-known results we will need for further discussion. Product of a Gaussian matrix and a unit vector Lemma 10. Suppose W is an n × n ′ matrix with i.i.d. Gaussian entries and u is a random unit vector in R n ′ that is independent of W but otherwise has any distribution. Then 1. W u is independent of u and is equal in distribution to W v where v is any fixed unit vector in R n ′ . 2. If the entries of W are sampled i.i.d. from N (µ, σ 2 ), then for all i = 1, . . . , n, W i u ∼ N (µ, σ 2 ) and independent of u. 3. If the entries of W are sampled i.i.d. from N (0, σ 2 ), then the squared ℓ 2 norm ∥W u∥ 2 d = σ 2 χ 2 n , where χ 2 n is a chi-squared random variable with n degrees of freedom that is independent of u. Proof. Statement 1 was proved in, e.g., Hanin et al. (2021, Lemma C.3 ) by considering directly the joint distribution of W u and u. Statement 2 follows from Statement 1 if we pick v = e 1 . To prove Statement 3, recall that by definition of the ℓ 2 norm, ∥W u∥ 2 = n i=1 (W i u) 2 . By Statement 2, for all i = 1, . . . , n, W i u are Gaussian random variables independent of u with mean zero and variance σ 2 . Since any Gaussian random variable sampled from N (µ, σ 2 ) can be written as µ + σv, where v ∼ N (0, 1), we can write n i=1 (W i u) 2 = σ 2 n i=1 v 2 i . By definition of the chi- squared distribution, n i=1 v 2 i is a chi-squared random variable with n degrees of freedom denoted with χ 2 n , which leads to the desired result.

Stochastic order

We recall the definition of a stochastic order. A real-valued random variable X is said to be smaller than Y in the stochastic order, denoted by X ≤ st Y , if Pr(X > x) ≤ Pr(Y > x) for all x ∈ R. Remark 11 (Stochastic ordering for functions). Consider two functions f : X → R and g : X → R that satisfy f (x) ≤ g(x) for all x ∈ X. Then, for a random variable X, f (X) ≤ st g(X). To see this, observe that for any y ∈ R,  Pr(f (X) > y) = Pr(X ∈ {x : f (x) > y}) and Pr(g(X) > y) = Pr(X ∈ {x : g(x) > y}). Since f (x) ≤ g(x) for all x ∈ X, {x : f (x) > y} ⊆ {x : g(x) > y}. Hence, Pr(f (X) > y) ≤ Pr(g(X) > y), ∈ R, Pr(X > y) ≤ Pr(Y > y) = Pr( Ŷ > y). B.2 EXPRESSION FOR ∥J N (x)u∥ 2 Before proceeding to the proof of the main statement, given in Theorem 1, we present Proposition 13. Firstly, in Proposition 13 below, we prove an equality that holds almost surely for an input-output Jacobian under our assumptions. In this particular statement the reasoning closely follows Hanin et al. (2021, Proposition C.2) . The modifications are due to the fact that a maxout network Jacobian is a product of matrices consisting of the rows of weights that are selected based on which pre-activation feature attains maximum, while in a ReLU network, the rows in these matrices are either the neuron weights or zeros. Proposition 13 (Equality for ∥J N (x)u∥ 2 ). Let N be a fully-connected feed-forward neural network with maxout units of rank K and a linear last layer. Let the network have L layers of widths n 1 , . . . , n L and n 0 inputs. Assume that the weights are continuous random variables (that have a density) and that the biases are independent of the weights but otherwise initialized using any approach. Let u ∈ R n0 be a fixed unit vector. Then, for any input into the network, x ∈ R n0 , almost surely with respect to the parameter initialization the Jacobian with respect to the input satisfies ∥J N (x)u∥ 2 = ∥W (L) u (L-1) ∥ 2 L-1 l=1 n l i=1 ⟨W (l) i , u (l-1) ⟩ 2 , where vectors u (l) , l = 1, . . . , L -1 are defined recursively as u (l) = W (l) u (l-1) /∥W (l) u (l-1) ∥ when W (l) u (l-1) ̸ = 0 and 0 otherwise, and l) is the output of the lth layer, and x (0) = x. u (0) = u. The matrices W (l) consist of rows W (l) i = W (l) i,k ′ ∈ R n l-1 , i = 1, . . . , n l , where k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k }, x ( Proof. The Jacobian J N (x) of a network N (x) : R n0 → R n L can be written as a product of matrices W (l) , l = 1, . . . , L, depending on the activation region of the input x. The matrix W (l) consists of rows W (l) i = W (l) i,k ′ ∈ R n l-1 , where k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k } for i = 1, . . . , n l , and x (l-1) is the lth layer's input. For the last layer, which is linear, we have W (L) = W (L) . Thus, ∥J N (x)u∥ 2 = ∥W (L) W (L-1) • • • W (1) u∥ 2 . ( ) Further we denote u with u (0) and assume ∥W (1) u (0) ∥ ̸ = 0. To see that this holds almost surely, note that for a fixed unit vector u (0) , the probability of W 1) being such that ∥W (1) u (0) ∥ = 0 is 0. This is indeed the case since to satisfy ∥W (1) u (0) ∥ = 0, the weights must be a solution to a system of n 1 linear equations and this system is regular when u ̸ = 0, so the solution set has positive co-dimension and hence zero measure. Multiplying and dividing (4) by ∥W (1) u (0) ∥ 2 , ∥J N (x)u∥ 2 = W (L) W (L-1) • • • W (2) W (1) u (0) ∥W (1) u (0) ∥ 2 ∥W (1) u (0) ∥ 2 = W (L) W (L-1) • • • W (2) u (1) 2 ∥W (1) u (0) ∥ 2 , where u (1) = W (1) u (0) /∥W (1) u (0) ∥. Repeating this procedure layer-by-layer, we get ∥W (L) u (L-1) ∥ 2 ∥W (L-1) u (L-2) ∥ 2 • • • ∥W (1) u (0) ∥ 2 . ( ) By definition of the ℓ 2 norm, for any layer l, ∥W l) u (l-1) ∥ 2 = n l i=1 ⟨W (l) i , u (l-1) ⟩ 2 . Substituting this into (5) we get the desired statement. B.3 STOCHASTIC ORDERING FOR ∥J N (x)u∥ 2 Now we prove the result for the stochastic ordering of the input-output Jacobian in a finite-width maxout network. Theorem 1 (Bounds on ∥J N (x)u∥ 2 ). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Let u ∈ R n0 be a fixed unit vector. Then, almost surely with respect to the parameter initialization, for any input into the network x ∈ R n0 , the following stochastic order bounds hold: 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 ξ l,i (χ 2 1 , K) ≤ st ∥J N (x)u∥ 2 ≤ st 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 Ξ l,i (χ 2 1 , K), where ξ l,i (χ 2 1 , K) and Ξ l,i (χ 2 1 , K) are respectively the smallest and largest order statistic in a sample of size K of chi-squared random variables with 1 degree of freedom, independent of each other and of the vectors u and x. Proof. From Proposition 13, we have the following equality ∥J N (x)u∥ 2 = ∥W (L) u (L-1) ∥ 2 L-1 l=1 n l i=1 ⟨W (l) i , u (l-1) ⟩ 2 , where vectors u (l) , l = 0, . . . , L -1 are defined recursively as  and x (l-1 ) is the lth layer's input, x (0) = x. We assumed that weights in the last layer are sampled from a Gaussian distribution with mean zero and variance 1/n L-1 . Then, by Lemma 10 item 3, -1) . In equation ( 6), using this observation and then multiplying and dividing the summands by c/n l-1 and rearranging we obtain u (l) = W (l) u (l-1) /∥W (l) u (l-1) ∥ and u (0) = u. Matrices W (l) consist of rows W (l) i = W (l) i,k ′ ∈ R n l-1 , i = 1, . . . , n l , where k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k }, ∥W (L) u (L-1) ∥ 2 d = (1/n L-1 )χ 2 n L and is independent of u (L ∥J N (x)u∥ 2 d = 1 n L-1 χ 2 n L L-1 l=1 c n l-1 n l i=1 n l-1 c ⟨W (l) i , u (l-1) ⟩ 2 = 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 n l-1 c ⟨W (l) i , u (l-1) ⟩ 2 .

Now we focus on n

l-1 /c ⟨W (l) i , u (l-1) ⟩. Since we have assumed that the weights are sampled from a Gaussian distribution with zero mean and variance c/n l-1 , any weight W (l) i,k,j , j = 1, . . . , n l-1 , can be written as c/n l-1 v (l) i,k,j , where v (l) i,k,j is a standard Gaussian random variable. We also write W (l) i,k = c/n l-1 V (l) i,k , where V (l) i,k is an n l-1 -dimensional standard Gaussian random vector. Observe that for any k ′ ∈ [K], ⟨W (l) i,k ′ , u (l-1) ⟩ 2 ≤ max k∈[K] {⟨W (l) i,k , u (l-1) ⟩ 2 } and ⟨W (l) i,k ′ , u (l-1) ⟩ 2 ≥ min k∈[K] {⟨W (l) i,k , u (l-1) ⟩ 2 }. Therefore, c n l-1 min k∈[K] V (l) i,k , u (l-1) 2 ≤ ⟨W (l) i , u (l-1) ⟩ 2 ≤ c n l-1 max k∈[K] V (l) i,k , u (l-1) 2 . Notice that vectors u (l-1) are unit vectors by their definition. By Lemma 10, the inner product of a standard Gaussian vector and a unit vector is a standard Gaussian random variable independent of the given unit vector. By definition, a squared standard Gaussian random variable is distributed as χ 2 1 , a chi-squared random variable with 1 degree of freedom. Hence, max k∈[K] {⟨V (l) i,k , u (l-1) ⟩ 2 } is distributed as the largest order statistic in a sample of size K of chi-squared random variables with 1 degree of freedom. We will denote such a random variable with Ξ l,i (χ 2 1 , K). Likewise, min k∈[K] {⟨V i,k , u (l-1) ⟩ 2 } is distributed as the smallest order statistic in a sample of size K of chi-squared random variables with 1 degree of freedom, denoted with ξ l,i (χ 2 1 , K). Combining results for each layer, we obtain the following bounds ∥J N (x)u∥ 2 ≤ 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 max k∈[K] V (l) i,k , u (l-1) 2 d = 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 Ξ l,i (χ 2 1 , K), ∥J N (x)u∥ 2 ≥ 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 min k∈[K] V (l) i,k , u (l-1) 2 d = 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 ξ l,i (χ 2 1 , K). Then, by Remarks 11 and 12, the following stochastic ordering holds 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 ξ l,i (χ 2 1 , K) ≤ st ∥J N (x)u∥ 2 ≤ st 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 Ξ l,i (χ 2 1 , K), which concludes the proof. C MOMENTS OF THE INPUT-OUTPUT JACOBIAN NORM ∥J N (x)u∥ 2 In the proof on the bounds of the moments we use an approach similar to Hanin et al. (2021) for upper bounding the moments of the chi-squared distribution. Corollary 2 (Bounds on the moments of ∥J N (x)u∥ 2 ). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Let u ∈ R n0 be a fixed unit vector and x ∈ R n0 be any input into the network, Then (i) n L n 0 (cS) L-1 ≤ E[∥J N (x)u∥ 2 ] ≤ n L n 0 (cL) L-1 , (ii) Var ∥J N (x)u∥ 2 ≤ n L n 0 2 c 2(L-1) K 2(L-1) exp 4 L-1 l=1 1 n l K + 1 n L -S 2(L-1) , (iii) E ∥J N (x)u∥ 2t ≤ n L n 0 t (cK) t(L-1) exp t 2 L-1 l=1 1 n l K + 1 n L , t ∈ N, where the expectation is taken with respect to the distribution of the network weights. The constants S and L depend on K and denote the means of the smallest and the largest order statistic in a sample of K chi-squared random variables. For K = 2, . . . , 10, S ∈ [0.02, 0.4] and L ∈ [1.6, 4]. See Table 4 in Appendix C for the exact values. Proof. We first prove results for the mean, then for the moments of order t > 1, and finish with the proof for the variance. Mean Using mutual independence of the variables in the bounds in Theorem 1, and that if two non-negative univariate random variables X and Y are such that X ≤ st Y then E[X n ] ≤ E[Y n ] for all n ≥ 1 (Müller & Stoyan, 2002, Theorem 1.2.12), 1 n 0 E χ 2 n L L-1 l=1 c n l n l i=1 E [ξ l,i ] ≤ E[∥J N (x)u∥ 2 ] ≤ 1 n 0 E χ 2 n L L-1 l=1 c n l n l i=1 E [Ξ l,i ] . where we used ξ l,i and Ξ l,i as shorthands for ξ l,i (χ 2 1 , K) and Ξ l,i (χ 2 1 , K). Using the formulas for the largest and the smallest order statistic pdfs from Remark 14, the largest order statistic mean equals E [Ξ l,i ] = K √ 2π ∞ 0 erf x 2 K-1 x 1/2 e -x/2 dx = L, and the smallest order statistic mean equals E [ξ l,i ] = K √ 2π ∞ 0 1 -erf x 2 K-1 x 1/2 e -x/2 dx = S. Here we denoted the right hand-sides with L and S, which are constants depending on K, and can be computed exactly for K = 2 and K = 3, and approximately for higher K-s, see Table 4 . It is known that E χ 2 n L = n L . Combining, we get n L n 0 (cS) L-1 ≤ E[∥J N (x)u∥ 2 ] ≤ n L n 0 (cL) L-1 . Moments of order t > 1 As above, using mutual independence of the variables in the bounds in Theorem 1, and that if two non-negative univariate random variables X and Y are such that X ≤ st Y then E[X n ] ≤ E[Y n ] for all n ≥ 1 (Müller & Stoyan, 2002, Theorem 1.2.12), E[∥J N (x)u∥ 2t ] ≤ E   1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 Ξ l,i t   = n L n 0 t 1 n L t E χ 2 n L t L-1 l=1 c n l t E   n l i=1 Ξ l,i t   . Upper-bounding the maximum of chi-squared variables with a sum, c n l t E   n l i=1 Ξ l,i t   ≤ c n l t E   n l i=1 K k=1 (χ 2 1 ) l,i,k t   = c n l t E χ 2 n l K t , where we used that a sum of n l K chi-squared variables with one degree of freedom is a chi-squared variable with n l K degrees of freedom. Using the formula for noncentral moments of the chi-squared distribution and the inequality 1 + x ≤ e x , c n l t E χ 2 n l K t = c n l t (n l K) (n l K + 2) • • • (n l K + 2t -2) = c t K t • 1 • 1 + 2 n l K • • • 1 + 2t -2 n l K ≤ c t K t exp t-1 i=0 2i n l K ≤ c t K t exp t 2 n l K , where we used the formula for calculating the sum of consecutive numbers t-1 i=1 i = t(t -1)/2. Similarly, 1 n L t E χ 2 n L t ≤ exp t 2 n L . Combining, we upper bound (7) with n L n 0 t (cK) t(L-1) exp t 2 L-1 l=1 1 n l K + 1 n L . Variance Combining the upper bound on the second moment and the lower bound on the mean, we get the following upper bound on the variance -1) . Var ∥J N (x)u∥ 2 ≤ n L n 0 2 c 2(L-1) K 2(L-1) exp 4 L-1 l=1 1 n l K + 1 n L -S 2(L which concludes the proof. Remark 14 (Computing the constants). Here we provide the derivations necessary to compute the constants equal to the moments of the largest and the smallest order statistics appearing in the results. Firstly, the cdf of the largest order statistic of independent univariate random variables y 1 , . . . , y K with cdf F (x) and pdf f (x) is Pr max k∈[K] {y k } < x = Pr K k=1 (y k < x) = K k=1 Pr (y k < x) = (F (x)) K . Hence, the pdf is K(F (x)) K-1 f (x). For the smallest order statistic, the cdf is Pr min k∈[K] {y k } < x = 1 - K k=1 Pr (y k ≥ x) = 1 -(1 -F (x)) K . Thus, the pdf is K (1 -F (x)) K-1 f (x). Now we obtain pdfs for the distributions that are used in the results.

Chi-squared distribution

The cdf of a chi-squared random variable χ 2 k with k = 1 degree of freedom is F (x) = (Γ(k/2)) -1 γ(k/2, x/2) = erf( x/2), and the pdf is f (x) = (2 k/2 Γ(k/2)) -1 x k/2-1 e -x/2 = (2π) -1/2 x -1/2 e -x/2 . Here we used that Γ(1/2) = √ π and γ(1/2, x/2) = √ π erf( x/2). Therefore, the pdf of the largest order statistic in a sample of K chi-squared random variables with 1 degree of freedom Ξ l,i (χ 2 1 , K) is K erf x 2 K-1 1 √ 2π x -1 2 e -x 2 . The pdf of the smallest order statistic in a sample of K chi-squared random variables with 1 degree of freedom ξ l,i (χ 2 1 , K) is K 1 -erf x 2 K-1 1 √ 2π x -1 2 e -x 2 . Standard Gaussian distribution Recall that the cdf of a standard Gaussian random variable is F (x) = 1/2(1 + erf(x/ √ 2)) , and the pdf is f (x) = 1/ √ 2π exp{-x 2 /2}. Then, for the pdf of the largest order statistic in a sample of K standard Gaussian random variables Ξ l,i (N (0, 1), K) we get K 2 K-1 √ 2π 1 + erf x √ 2 K-1 e -x 2 2 . Constants Now we obtain formulas for the constants. For the mean of the smallest order statistic in a sample of K chi-squared random variables with 1 degree of freedom ξ l,i (χ 2 1 , K), we get S = K √ 2π ∞ 0 x 1 2 1 -erf x 2 K-1 e -x 2 dx. The mean of the largest order statistic in a sample of K chi-squared random variables with 1 degree of freedom Ξ l,i (χ 2 1 , K) is L = K √ 2π ∞ 0 x 1 2 erf x 2 K-1 e -x 2 dx. The second moment of the largest order statistic in a sample of K standard Gaussian random variables Ξ l,i (N (0, 1), K) equals M = K 2 K-1 √ 2π ∞ -∞ x 2 1 + erf x √ 2 K-1 e -x 2 2 dx. These constants can be evaluated using numerical computation software. The values estimated for K = 2, . . . , 10 using Mathematica (Wolfram Research, Inc, 2022) are in Table 4 . 

D EQUALITY IN DISTRIBUTION FOR THE INPUT-OUTPUT JACOBIAN NORM AND WIDE NETWORK RESULTS

Here we prove results from Section 3.2. We will use the following theorem from Anderson (2003) . We reference it here without proof, but remark that it is based on the well-known result that uncorrelated jointly Gaussian random variables are independent. Theorem 15 (Anderson 2003, Theorem 3.3.1) . Suppose X 1 , . . . , X N are independent, where X α is distributed according to N (µ α , Σ). Let C = (c αβ ) be an N × N orthogonal matrix. Then Y α = N β=1 c αβ X β is distributed according to N (ν α , Σ), where ν α = N β=1 c αβ µ β , α = 1, . . . , N , and Y 1 , . . . , Y N are independent. Remark 16. We will use Theorem 15 in the following way. Notice that it is possible to consider a vector v with entries sampled i.i.d. from N (0, σ 2 ) in Theorem 15 and treat entries of v as a set of 1-dimensional vectors X 1 , . . . , X N . Then we can obtain that products of the columns of the orthogonal matrix C and the vector v, Y β = N α=1 c αβ v α , are distributed according to N (0, σ 2 ) and are mutually independent. Theorem 3 (Equality in distribution for ∥J N (x)u∥ 2 ). Consider a maxout network with the settings of Section 2. Let u ∈ R n0 be a fixed unit vector and x ∈ R n0 , x ̸ = 0 be any input into the network. Then, almost surely, with respect to the parameter initialization, ∥J N (x)u∥ 2 equals in distribution 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 v i 1 -cos 2 γ x (l-1) ,u (l-1) + Ξ l,i (N (0, 1), K) cos γ x (l-1) ,u (l-1)

2

, where v i and Ξ l,i (N (0, 1), K) are independent, v i ∼ N (0, 1), Ξ l,i (N (0, 1), K) is the largest order statistic in a sample of K standard Gaussian random variables. Here γ x (l) ,u (l) denotes the angle between x (l) := (x (l) 1 , . . . , x n l , 1) and u (l) := (u (l) 1 , . . . , u n l , 0) in R n l +1 , where u (l) = W (l) u (l-1) /∥W (l) u (l-1) ∥ when W (l) u (l-1) ̸ = 0 and 0 otherwise, and u (0) = u. The matrices W (l) consist of rows W (l) i = W (l) i,k ′ ∈ R n l-1 , where k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k }. 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 n l-1 c ⟨W (l) i , u (l-1) ⟩ 2 . Now we focus on n l-1 /c ⟨W (l) i , u (l-1) ⟩. Since we have assumed that the weights and biases are sampled from the Gaussian distribution with zero mean and variance c/n l-1 , any weight W (l) i,k,j , j = 1, . . . , n l-1 (or bias), can be written as c/n l-1 v (l) i,k,j , where v (l) i,k,j is standard Gaussian. Therefore, n l-1 c ⟨W (l) i , u (l-1) ⟩ 2 = ⟨V (l) i , u (l-1) ⟩ 2 , where V (l) i = V (l) i,k ′ ∈ R n l-1 +1 , k ′ = argmax k∈[K] {⟨V (l) i,k , x (l-1) ⟩}, V i,k are (n l-1 + 1)dimensional standard Gaussian random vectors.

We construct an orthonormal basis

B = (b 1 , . . . , b n l-1 +1 ) of R n l-1 +1 , where we set b 1 = x (l-1) /∥x (l-1) ∥ and choose the other vectors to be unit vectors orthogonal to b 1 . The change of basis matrix from the standard basis I to the basis B is given by B T ; see, e.g., Anton & Rorres (2013, Theorem 6.6.4) . Then, any row V (l) i,k an be expressed as V (l) i,k = c k,1 b 1 + • • • + c k,n l-1 +1 b n l-1 +1 , where c k,j = ⟨V (l) i,k , b j ⟩, j = 1, . . . , n l-1 + 1. The coordinate vector of x (l-1) relative to B is (∥x (l-1) ∥, 0, . . . , 0). Vector u (l-1) in B has the coordinate vector (⟨u (l-1) , b 1 ⟩, . . . , ⟨u (l-1) , b n l-1 +1 ⟩). This coordinate vector has norm 1 since the change of basis between two orthonormal bases does not change the ℓ 2 norm; see, e.g., Anton & Rorres (2013, Theorem 6.3.2) . For the maximum, using the representation of the vectors in the basis B, we get ⟨V (l) i , x (l-1) ⟩ = max k∈[K] ⟨V (l) i,k , x (l-1) ⟩ = max k∈[K] c k,1 ∥x (l-1) ∥ = ∥x (l-1) ∥ max k∈[K] {c k,1 } . (11) Width 2 Width 10 Width 100 Therefore, in the basis B, V (l) i has components (max k∈[K] {c k,1 } , c k ′ ,2 , . . . , c k ′ ,n l-1 +1 ). By Theorem 15, for all k = 1, . . . , K, j = 1, . . . , n l-1 + 1, the coefficients c k,j are mutually independent standard Gaussian random variables that are also independent of vectors b j , j = 1, . . . , n l-1 , by Lemma 10 and of u (l-1) .

⟨V

(l) i , u (l-1) ⟩ = max k∈[K] {c k,1 } ⟨u (l-1) , b 1 ⟩ + n l-1 +1 j=2 c k ′ ,j ⟨u (l-1) , b j ⟩ d = Ξ l,i (N (0, 1), K)⟨u (l-1) , b 1 ⟩ + n l-1 +1 j=2 v j ⟨u (l-1) , b j ⟩, where Ξ l,i (N (0, 1), K) is the largest order statistic in a sample of K standard Gaussian random variables, and v j ∼ N (0, 1). Since we have simply written equality in distribution for max k∈[K] {c k,1 } and c k ′ ,j , the variables Ξ l,i (N (0, 1), K) and v j , j = 2, . . . , n l-1 are also mutually independent, and independent of vectors b j , j = 1, . . . , n l-1 , and of u (l-1) . In the following we will use Ξ l,i as a shorthand for Ξ l,i (N (0, 1), K). A linear combination n i=1 a i , v i of Gaussian random variables v 1 , . . . , v n , v j ∼ N (µ j , σ 2 j ), j = 1, . . . , n with coefficients a 1 , . . . , a n is distributed according to N ( n i=1 a i µ i , n i=1 a 2 i σ 2 i ). Hence, n l-1 +1 j=2 v j ⟨u (l-1) , b j ⟩ ∼ N (0, n l-1 +1 j=2 ⟨u (l-1) , b j ⟩ 2 ). Since n l-1 +1 j=2 ⟨u (l-1) , b j ⟩ 2 = 1 -⟨u (l-1) , b 1 ⟩ 2 = 1 -cos 2 γ x (l-1) ,u (l-1) , we get n l-1 +1 j=2 v j ⟨u (l-1) , b j ⟩ + Ξ l,i ⟨u (l-1) , b 1 ⟩ d = v i 1 -cos 2 γ x (l-1) ,u (l-1) + Ξ l,i cos γ x (l-1) ,u (l-1) , ( ) where v i ∼ N (0, 1). Notice that v i 1 -cos 2 γ x (l-1) ,u (l-1) and Ξ l,i cos γ u (l-1) ,x (l-1) are stochastically independent because v i and Ξ l,i are independent and multiplying random variables by constants does not affect stochastic independence. Remark 17. The result in Theorem 3 also holds when the biases are initialized to zero. The proof is simplified in this case. There is no need to define additional vectors x (l-1) and u (l-1) , and when constructing the basis, the first vector is defined as b 1 := x (l-1) /∥x (l-1) ∥. The rest of the proof remains the same. Remark 18 (Effects of the width and depth on a maxout network). According to Theorem 3, the behavior of ∥J N (x)u∥ 2 in a maxout network depends on the cos γ x (l-1) ,u (l-1) , which changes as the network gets wider or deeper. Figure 7 demonstrates how the width and depth affect ∥J N (x)u∥ 2 . Wide shallow networks Since independent and isotropic random vectors in high-dimensional spaces tend to be almost orthogonal (Vershynin, 2018, Remark 2.3.5) , cos γ x (0) ,u (0) will be close to 0 with high probability for wide networks if the entries of the vectors x and u are i.i.d. standard Gaussian (or i.i.d. from an isotropic distribution). Hence, we expect that the cosine will be around zero for the earlier layers of wide networks and individual units will behave more as the squared standard Gaussians. c = 0.3, c < 1/E[Ξ 2 ] c = 1/E[Ξ 2 ] c = 10, c > 1/E[Ξ 2 ] Wide deep networks Consider wide and deep networks, where the layers l = 0, . . . , L -1 are approximately of the same width n l1 ≈ n l2 , l 1 , l 2 = 0, . . . , L -1. Assume that c = 1/M = 1/E[(Ξ(N (0, 1), K)) 2 ]. We will demonstrate that under these conditions. | cos γ x (l) ,u (l) | ≈ 1 for the later layers for 2 < K < 100. Thus, individual units behave as the squared largest order statistics. To see this, we need to estimate cos γ x (l) ,u (l) from Theorem 3, which is defined as cos γ x (l) ,u (l) = ρ (l) xu = ⟨x (l) , u (l) ⟩ ∥x (l) ∥∥u (l) ∥ = ⟨x (l) , u (l) ⟩ ∥x (l) ∥∥u (l) ∥ = n l-1 n l ⟨x (l) , u (l) ⟩ n l-1 n l ∥x (l) ∥ n l-1 n l ∥u (l) ∥ , where we denoted cos γ x (l) ,u (l) with ρ (l) xu , and with u (l) , u (l) before the normalization. Firstly, for x (l) we get n l-1 n l ∥x (l) ∥ 2 = n l-1 n l n l i=1 max k∈[K] W (l) i,k x (l-1) 2 + 1 = c∥x (l-1) ∥ 2 1 n l n l i=1 max k∈[K] V (l) i,k x (l-1) ∥x (l-1) ∥ 2 + n l-1 c∥x (l-1) ∥ 2 n l d = c∥x (l-1) ∥ 2 1 n l n l i=1 Ξ 2 l,i + n l-1 c∥x (l-1) ∥ 2 n l , where in the second line we used that W (l) i,k = c/n l-1 V (l) i,k , V i,k,j ∼ N (0, 1), j = 1, . . . , n l-1 . In the third line, Ξ l,i d = Ξ(N (0, 1), K) is the largest order statistic in a sample of K standard Inputs are standard Gaussian vectors. Vector u is a one-hot vector with 1 at a random position, and it is the same for one setup. We sampled 1000 inputs and 1000 initializations for each input. The left end corresponds to the second moments of the Gaussian distribution and the right end to the second moment of the largest order statistic. Observe that for wide and deep networks the mean is closer to the second moment of the largest order statistic. Gaussians, since by Lemma 10, V i,k x (l-1) /∥x (l-1) ∥ are mutually independent standard Gaussian random variables. When the network width is large, 1/n l n l i=1 Ξ 2 l,i approximates the second moment of the largest order statistic, and n l-1 /n l ≈ 1 when the layer widths are approximately the same. Then n l-1 n l ∥x (l) ∥ 2 ≈ c∥x (l-1) ∥ 2 E Ξ 2 + 1 c∥x (l-1) ∥ 2 . Now we will show that 1/∥x (l-1) ∥ 2 ≈ 0. Firstly, by the same reasoning as above, ∥x (l-1) ∥ 2 = n l-1 i=1 max k∈[K] W (l) i,k x (l-2) 2 + 1 d = ∥x (0) ∥ 2 c l-1 n l-1 n 0 l-1 j=1 1 n j nj i=1 Ξ 2 l,i + • • • + c 2 n l-1 n l-3 l-1 j=l-2 1 n j nj i=1 Ξ 2 l,i + cn l-1 n l-2 1 n l-1 n l-1 i=1 Ξ 2 l,i + 1. Since we assumed that the layer widths are large and approximately the same, ∥x (l-1) ∥ 2 ≈ ∥x (0) ∥ 2 cE[Ξ 2 ] l-1 + • • • + cE[Ξ 2 ] + 1 = ∥x (0) ∥ 2 cE[Ξ 2 ] l-1 + l-2 j=0 cE[Ξ 2 ] j . Using the assumption that c = 1/E[Ξ 2 ], we obtain that ∥x (l-1) ∥ 2 ≈ ∥x (0) ∥ 2 + (l -1) and goes to infinity with the network depth. Hence, 1/∥x (l-1) ∥ 2 ≈ 0 and n l-1 n l ∥x (l) ∥ 2 ≈ c∥x (l-1) ∥ 2 E Ξ 2 . Now consider u (l) . Using the reasoning from Theorem 3, see equations ( 12) and ( 13), u (l) i d = c/n l-1 (Ξ l,i ρ (l-1) xu + v i 1 -(ρ (l-1) xu ) 2 ), i = 1, . . . , n l , v i ∼ N (0, 1). Then in a wide network ∥u (l) ∥ 2 ≈ cn l n l-1 E   Ξρ (l-1) xu + v 1 -ρ (l-1) xu 2 2   . Note that the random variable Ξ in equations ( 14) and ( 15) is the same based on the derivations in Theorem 3, to see this, compare equations ( 11) and ( 12). Similarly, for the dot product ⟨x (l) , u (l) ⟩ in a wide network we obtain that ⟨x (l) , u (l) ⟩ ≈ ∥x (l-1) ∥ cn l n l-1 E Ξ Ξρ (l-1) xu + v 1 -ρ (l-1) xu 2 . Hence, we have the following recursive map for ρ (l) xu ρ (l) xu = E Ξ Ξρ (l-1) xu + v 1 -ρ (l-1) xu 2 E [Ξ 2 ] E   Ξρ (l-1) xu + v 1 -ρ (l-1) xu 2 2   = 1 E [Ξ 2 ] ρ (l-1) xu E[Ξ 2 ] ρ (l-1) xu 2 (E[Ξ 2 ] -1) + 1 , where we used independence of v and Ξ, see Theorem 3, and that E[v] = 0 and E[v 2 ] = 1. This map has fixed points ρ * = ±1, which can be confirmed by direct calculation. To check if these fixed points are stable, we need to consider the values of the derivative ∂ρ (l) xu /∂ρ (l-1) xu at them. We obtain 4 for K = 2, . . . , 10 and Figure 6 for K = 2, . . . , 100. Hence, the fixed points are stable (Strogatz, 2018, Chapter 10.1) . Note that for K = 2, 1/E[Ξ 2 ] = 1, and this analysis is inconclusive. Therefore, if the network parameters are sampled from N (0, c/n l-1 ), c = 1/M = 1/E[Ξ(N (0, 1), K) 2 ], we expect that | cos γ x (l) ,u (l) | ≈ 1 for the later layers of deep networks and individual units will behave more as the squared largest order statistics. Figure 4 demonstrates convergence of | cos γ x (l) ,u (l) | to 1 with the depth for wide networks, and Figure 5 shows that there is no convergence for c < 1/E[Ξ 2 ] and that the cosine still converges for c > 1/E[Ξ 2 ]. Remark 19 (Expectation of ∥J N (x)u∥ 2 in a wide and deep network). According to Remark 18, for deep and wide networks, we can expect that | cos γ x (l-1) ,u (l-1) | ≈ 1 if c = 1/M, which allows obtaining an approximate equality for the expectation of ∥J N (x)u∥ 2 . Hence, using Theorem 3, ∂ρ (l) xu ∂ρ (l-1) xu = E[Ξ 2 ] 1 2 ρ (l-1) xu 2 (E[Ξ 2 ] -1) + 1 -3 2 . When ρ (l-1) xu = this partial derivative equals 1/E[Ξ 2 ] < 1 for K > 2, since E[Ξ 2 ] > 1, see Table ∥J N (x)u∥ 2 ≈ 1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 (Ξ l,i (N (0, 1), K)) 2 . ( ) Then, using mutual independence of the variables in equation ( 16), E[∥J N (x)u∥ 2 ] ≈ 1 n 0 E χ 2 n L L-1 l=1 c n l n l i=1 E (Ξ l,i (N (0, 1), K)) 2 . Since M = E[(Ξ l,i (N (0, 1), K)) 2 ], see Table 4 , and c = 1/M, we get E[∥J N (x)u∥ 2 ] ≈ n L n 0 (cM) L-1 = n L n 0 . Remark 20 (Lower bound on the moments in a wide and deep network). Using ( 16) and taking into account the mutual independence of the variables, E[∥J N (x)u∥ 2t ] ≈ E   1 n 0 χ 2 n L L-1 l=1 c n l n l i=1 (Ξ l,i (N (0, 1), K)) 2 t   = n L n 0 t 1 n L t E χ 2 n L t L-1 l=1 c n l t E   n l i=1 (Ξ l,i (N (0, 1), K)) 2 t   ≥ n L n 0 t 1 n L t E χ 2 n L t L-1 l=1 c n l t n l i=1 E (Ξ l,i (N (0, 1), K)) 2 t , where in the last inequality, we used linearity of expectation and Jensen's inequality since taking the tth power for t ≥ 1 is a convex function for non-negative arguments. Using the formula for noncentral moments of the chi-squared distribution and the inequality ln x ≥ 1 -1/x, ∀x > 0, meaning that x = exp{ln x} ≥ exp{1 -1/x}, we get 1 n L t E χ 2 n L t = 1 n l t (n L ) (n L + 2) • • • (n L + 2t -2) = t-1 i=0 1 + 2i n L ≥ exp t-1 i=1 2i n L + 2i ≥ exp t -1 2n L , where in the last inequality, we used that 2i/(n L + 2i) ≥ 2/(n L + 2) ≥ 1/(2n L ) for all i, n L ≥ 1. Using that E[(Ξ l,i (N (0, 1), K)) 2 ] = M, see Table 4 , and combing this with ( 17) and ( 18), E[∥J N (x)u∥ 2t ] ⪆ n L n 0 t exp t -2 2n L (cM) t(L-1) . The bound in ( 19) can be tightened if a tighter lower bound on the moments of the sum of the squared largest order statistics in a sample of K standard Gaussians is known. To derive a lower bound on the moments t ≥ 2 for the general case in Corollary 2, it is necessary to obtain a non-trivial lower bound on the moments of the sum of the smallest order statistics in a sample of K chi-squared random variables with 1 degree of freedom.

E ACTIVATION LENGTH

Here we prove the results from Subsection 3.3. Figure 8 demonstrates a close match between the estimated normalized activation length and the behavior predicted in Corollary 21 and 5. Corollary 21 (Distribution of the normalized activation length). Consider a maxout network with the settings of Section 2. Then, almost surely with respect to the parameter initialization, for any input into the network x ∈ R n0 and l ′ = 1, . . . , L -1, the normalized activation length A (l ′ ) is equal in distribution to ∥x (0) ∥ 2 1 n 0 l ′ l=1 c n l n l i=1 Ξ l,i (N (0, 1), K) 2 + l ′ j=2   1 n j-1 l ′ l=j c n l n l i=1 Ξ l,i (N (0, 1), K) 2   , where x (0) := (x 1 , . . . , x n0 , 1) ∈ R n0+1 , Ξ l,i (N (0, 1), K) is the largest order statistic in a sample of K standard Gaussian random variables, and Ξ l,i (N (0, 1), K) are stochastically independent. Notice that variables Ξ l,i (N (0, 1), K) with the same indices are the same random variables. Proof. Define x (l) = (x 1 , . . . , x n l , 1) ∈ R n l +1 . Append the bias columns to the weight matrices and denote obtained matrices with W (l) ∈ R n l ×(n l-1 +1) . Denote W (l) i,k ′ ∈ R n l-1 +1 , k ′ = argmax k∈[K] {⟨W (l) i,k , x (l-1) ⟩}, with W (l) i . Under this notation, ∥x (l) ∥ 2 = (∥x (l) ∥ 2 + 1), ∥x (l) ∥ = ∥W (l) x (l-1) ∥. Then ∥x (l ′ ) ∥ 2 equals ∥x (l ′ ) ∥ 2 = ∥W (l ′ ) x (l ′ -1) ∥ 2 = W (l ′ ) x (l ′ -1) ∥x (l ′ -1) ∥ 2 ∥x (l ′ -1) ∥ 2 = W (l ′ ) x (l ′ -1) ∥x (l ′ -1) ∥ 2   W (l ′ -1) x (l ′ -2) ∥x (l ′ -2) ∥ 2 ∥x (l ′ -2) ∥ 2 + 1   = • • • = ∥x (0) ∥ 2 l ′ l=1 W (l) x (l-1) ∥x (l-1) ∥ 2 + l ′ j=2   l ′ l=j W (l) x (l-1) ∥x (l-1) ∥ 2   , where we multiplied and divided ∥W (l) x (l-1) ∥ 2 by ∥x (l-1) ∥ 2 at each step. Using the approach from Theorem 3, more specifically equations ( 10), ( 12) and ( 13), with u (l) = x (l) /∥x (l) ∥, implying that cos γ x (l) ,u (l) = 1, A (l ′ ) = 1 n l ∥x (l ′ ) ∥ 2 d = ∥x (0) ∥ 2 1 n l ′ l ′ l=1 c n l-1 n l i=1 Ξ 2 l,i + 1 n l ′ l ′ j=2   l ′ l=j c n l-1 n l i=1 Ξ 2 l,i   = ∥x (0) ∥ 2 1 n 0 l ′ l=1 c n l n l i=1 Ξ 2 l,i + l ′ j=2   1 n j-1 l ′ l=j c n l n l i=1 Ξ 2 l,i   , where Ξ l,i = Ξ l,i (N (0, 1), K) is the largest order statistic in a sample of K standard Gaussian random variables, and stochastic independence of variables Ξ l,i follows from Theorem 3. Corollary 5 (Moments of the activation length). Consider a maxout network with the settings of Section 2. Let x ∈ R n0 be any input into the network. Then, for the moments of the normalized activation length, the following results hold. Mean: with respect to the distribution of parameters / random variables Ξ l,i (N (0, 1), K) averaged over 100, 000 initializations, and a numerically evaluated formula for the mean from Corollary 5. All layers had 10 neurons. The lines for the mean and areas for the std overlap. Note that there is no std for the formula in the plot. E A (l ′ ) = ∥x (0) ∥ 2 1 n 0 (cM) l ′ + l ′ j=2 1 n j-1 (cM) l ′ -j+1 . Variance: Var[A (l ′ ) ] ≤ 2 ∥x (0) ∥ 4 n 2 0 c 2l ′ K 2l ′ exp    l ′ l=1 4 n l K    . Moments of the order t ≥ 2: E A (l ′ ) t ≤ 2 t-1 ∥x (0) ∥ 2t n t 0 c tl ′ K tl ′ exp    l ′ l=1 t 2 n l K    + (2(l ′ -1)) t-1 l ′ j=2   (cK) t(l ′ -j+1) n t j-1 exp    l ′ l=j t 2 n l K      , E A (l ′ ) t ≥ ∥x (0) ∥ 2t n t 0 (cM) tl ′ + l ′ j=2 (cM) t(l ′ -j+1) n t j-1 . where expectation is taken with respect to the distribution of the network weights and biases, and M is a constant depending on K that can be computed approximately, see Table 4 for the values for K = 2, . . . , 10. Proof. Mean Taking expectation in Corollary 21 and using independence of Ξ l,i (N (0, 1), K), E A (l ′ ) = ∥x (0) ∥ 2 1 n 0 (cM) l ′ + l ′ j=2 1 n j-1 (cM) l ′ -j+1 , ( ) where M is the second moment of Ξ l,i (N (0, 1), K), see Table 4 for its values for K = 2, . . . , 10. Moments of the order t ≥ 2 Using Corollary 21, we get E A (l ′ ) t = E     ∥x (0) ∥ 2 1 n 0 l ′ l=1 c n l n l i=1 Ξ 2 l,i + l ′ j=2   1 n j-1 l ′ l=j c n l n l i=1 Ξ 2 l,i     t   . Upper bound First, we derive an upper bound on (21). Notice that all arguments in (21) are positive except for a zero measure set of Ξ l,i ∈ R l ′ l=1 n l . According to the power mean inequality, for any x 1 , . . . , x n ∈ x 1 , . . . , x n > 0 and any t ∈ R, t > 1, (x 1 + • • • + x n ) t ≤ n t-1 (x t 1 + • • • + x t n ). Using the power mean inequality first on the whole expression and then on the second summand, E A (l ′ ) t ≤ 2 t-1 E     ∥x (0) ∥ 2 1 n 0 l ′ l=1 c n l n l i=1 Ξ 2 l,i   t   + (l ′ -1) t-1 l ′ j=2 E     1 n j-1 l ′ l=j c n l n l i=1 Ξ 2 l,i   t   . Using independence of Ξ l,i , ( 22) equals 2 t-1 ∥x (0) ∥ 2t n t 0 l ′ l=1   c t n t l E   n l i=1 Ξ 2 l,i t     + (2(l ′ -1)) t-1 l ′ j=2   1 n t j-1 l ′ l=j   c t n t l E   n l i=1 Ξ 2 l,i t       . Upper-bounding the largest order statistic with the sum of squared standard Gaussian random variables, we get that n l i=1 Ξ 2 l,i ≤ χ 2 n l K . Hence, E A (l ′ ) t ≤ 2 t-1 ∥x (0) ∥ 2t n t 0 l ′ l=1 c t n t l E χ 2 n l K t + (2(l ′ -1)) t-1 l ′ j=2   1 n t j-1 l ′ l=j c t n t l E χ 2 n l K t   . Using the formula for noncentral moments of the chi-squared distribution and 1 + x ≤ e x , ∀x ∈ R, c t n t l E χ 2 n l K t = c t n t l (n l K) (n l K + 2) • • • (n l K + 2t -2) = c t K t • 1 • 1 + 2 n l K • • • 1 + 2t -2 n l K ≤ c t K t exp t-1 i=0 2i n l K ≤ c t K t exp t 2 n l K , where we used the formula for calculating the sum of consecutive numbers t-1 i=1 i = t(t -1)/2. Using this result in (23), we get the final upper bound E A (l ′ ) t ≤ 2 t-1 ∥x (0) ∥ 2t n t 0 c tl ′ K tl ′ exp    l ′ l=1 t 2 n l K    + (2(l ′ -1)) t-1 l ′ j=2   (cK) t(l ′ -j+1) n t j-1 exp    l ′ l=j t 2 n l K      . Lower bound Using that arguments in (21) are non-negative and t ≥ 1, we can lower bound the power of the sum with the sum of the powers and get, E A (l ′ ) t ≥ ∥x (0) ∥ 2t n t 0 l ′ l=1   c t n t l E   n l i=1 Ξ 2 l,i t     + l ′ j=2   1 n t j-1 l ′ l=j   c t n t l E   n l i=1 Ξ 2 l,i t       ≥ ∥x (0) ∥ 2t n t 0 l ′ l=1   c t n t l n l i=1 E Ξ 2 l,i t   + l ′ j=2   1 n t j-1 l ′ l=j   c t n t l n l i=1 E Ξ 2 l,i t     , where we used the linearity of expectation both expressions and Jensen's inequality in the last line. Using that E[(Ξ l,i (N (0, 1), K)) 2 ] = M, see Table 4 , we get E A (l ′ ) t ≥ ∥x (0) ∥ 2t n t 0 (cM) tl ′ + l ′ j=2 (cM) t(l ′ -j+1) n t j-1 . ( ) Variance We can use an upper bound on the second moment as an upper bound on the variance. Remark 22 (Zero bias). Similar results can be obtained for the zero bias case and would result in the same bounds without the second summand. For the proof one would work directly with the vectors x (l) , without defining the vectors x (l) , and to obtain the equality in distribution one would use Remark 17.

F EXPECTED NUMBER OF LINEAR REGIONS

Here we prove the result from Section 5.1. Corollary 6 (Value for C grad ). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Consider the pre-activation feature ζ z,k of a unit z = 1, . . . , N . Then, for any t ∈ N, sup x∈R n 0 E ∥∇ζ z,k (x)∥ t 1 t ≤ n -1 2 0 max 1, (cK) L-1 2 exp t 2 L-1 l=1 1 n l K + 1 . Proof. Distribution of ∇ζ z,k is the same as the distribution of the gradient with respect to the network input ∇ N 1 (x) in a maxout network that has a single linear output unit and L = l(z) layers, where l(z) is the depth of a unit z. Therefore, we will consider (sup x∈R n 0 E[∥∇ N 1 (x)∥ 2t ]) 1/2t . Notice that since n L = 1, ∇ N 1 (x) = J N (x) T = J N (x) T u for a 1-dimensional vector u = (1). Hence, ∥∇ N 1 (x)∥ = sup ∥u∥=1,u∈R n L ∥J N (x) T u∥ = ∥J N (x) T ∥, where the matrix norm is the spectral norm. Using that a matrix and its transpose have the same spectral norm, (25) equals ∥J N (x)∥ = sup ∥u∥=1,u∈R n 0 ∥J N (x)u∥. Therefore, we need to upper bound   sup x∈R n 0 E   sup ∥u∥=1,u∈R n 0 ∥J N (x)u∥ t     1 t ≤   sup x∈R n 0 E   sup ∥u∥=1,u∈R n 0 ∥J N (x)u∥ 2t     1 2t , where we used Jensen's inequality. Now we can use an upper bound on E[∥J N (x)u∥ 2t ] from Corollary 2, which holds for any x, u ∈ R n0 , ∥u∥ = 1, and thus holds for the suprema. Recalling that n L = 1, we get E ∥J N (x)u∥ 2t ≤ 1 n 0 t (cK) t( L-1) exp    t 2   L-1 l=1 1 n l K + 1      . Hence, E ∥J N (x)u∥ 2t 1 2t ≤ n -1 2 0 (cK) L-1 2 exp    t 2   L-1 l=1 1 n l K + 1      . Taking the maximum over L ∈ {1, . . . , L}, the final upper bound is n -1 2 0 max 1, (cK) L-1 2 exp t 2 L-1 l=1 1 n l K + 1 . Now we provide an updated upper bound on the number of r-partial activation from Tseran & Montúfar (2021, Theorem 9) . In this bound, the case r = 0 corresponds to the number of linear regions. For a detailed discussion of the activation regions of maxout networks and their differences from linear regions, see Tseran & Montúfar (2021) . Since the proof of Tseran & Montúfar (2021, Theorem 9 ) only uses C grad for t ≤ n 0 , we obtain the following statement. Theorem 23 (Upper bound on the expected number of partial activation regions). Consider a maxout network with the settings of Section 2 with N maxout units. Assume that the biases are independent of the weights and initialized so that: 1. Every collection of biases has a conditional density with respect to Lebesgue measure given the values of all other weights and biases. 2. There exists C bias > 0 so that for any pre-activation features ζ 1 , . . . , ζ t from any neurons, the conditional density of their biases ρ b1,...,bt given all the other weights and biases satisfies sup b1,...,bt∈R ρ b1,...,bt (b 1 , . . . , b t ) ≤ C t bias . Fix r ∈ {0, . . . , n 0 }. Let C grad = n -1/2 0 max{1, (cK) (L-1)/2 } exp{n 0 /2( L-1 l=1 1/(n l K) + 1)} and T = 2 5 C grad C bias . Then, there exists δ 0 ≤ 1/(2C grad C bias ) such that for all cubes C ⊆ R n0 with side length δ > δ 0 we have E[# r-partial activation regions of N in C] vol(C) ≤    rK 2r N r K N -r , N ≤ n 0 (T KN ) n 0 ( n 0 K 2n 0 ) (2K) r n0! , N ≥ n 0 . Here the expectation is taken with respect to the distribution of weights and biases in N . Of particular interest is the case r = 0, which corresponds to the number of linear regions.

G EXPECTED CURVE LENGTH DISTORTION

In this section we prove the result from Section 5.2. Let M be a smooth 1-dimensional curve in R n0 . Fix a smooth unit speed parameterization of M = γ([0, 1]) with γ : R → R n0 , γ(τ ) = (γ 1 (τ ), . . . , γ n0 (τ )). Then, parametrization of the curve N (M ) is given by a mapping Γ := N • γ, Γ : R → R n L . Thus, the length of N (M ) is len(N (M )) = 1 0 ∥Γ ′ (τ )∥dτ. Notice that the input-output Jacobian of maxout networks is well defined almost everywhere because for any neuron, using that the biases are independent from weights, and the weights are initialized from a continuous distribution, P (k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k }, k ′′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k }) = 0, i = 1, . . . , n L-1 . Hence, Γ ′ (τ ) = J N (γ(τ ))γ ′ (τ ) , where we used the chain rule, and we can employ the following lemma from Hanin et al. (2021) . We state it here without proof which uses Tonelli's theorem, power mean inequality and chain rule. Lemma 24 (Connection between the length of the curve and ∥J N (x)u∥, Hanin et al. 2021, Lemma C.1) . For any integer t ≥ 0, E len(N (M )) t ≤ 1 0 E ∥J N (γ(τ ))γ ′ (τ )∥ t dτ = E ∥J N (x)u∥ t , where u ∈ R n0 is a unit vector. Now we are ready to proof Corollary 7. Corollary 7 (Expected curve length distortion). Consider a maxout network with the settings of Section 2. Assume that the biases are independent of the weights but otherwise initialized using any approach. Let M be a smooth 1-dimensional curve of unit length in R n0 Then, the following upper bounds on the moments of len(N (M )) hold: E [len(N (M ))] ≤ n L n 0 1 2 (cL) L-1 2 , Var [len(N (M ))] ≤ n L n 0 (cL) L-1 , E len(N (M )) t ≤ n L n 0 t 2 (cK) t(L-1) 2 exp t 2 2 L-1 l=1 1 n l K + 1 n L , where L is a constant depending on K, see Table 4 in Appendix C for values for K = 2, . . . , 10. Proof. By Lemma 24, E len(N (M )) t ≤ E ∥J N (x)u∥ t ≤ E ∥J N (x)u∥ 2t 1 2 , where we used Jensen's inequality to obtain the last upper bound. Hence, using Corollary 2, we get the following upper bounds on the moments on the length of the curve. Mean E [len(N (M ))] ≤ E ∥J N (x)u∥ 2 1 2 ≤ n L n 0 1 2 (cL) L-1 2 . Variance Var [len(N (M ))] ≤ E len(N (M )) 2 ≤ n L n 0 (cL) L-1 . Moments of the order t ≥ 3 E len(N (M )) t ≤ E ∥J N (x)u∥ 2t 1 2 ≤ n L n 0 t 2 (cK) t(L-1) 2 exp t 2 2 L-1 l=1 1 n l K + 1 n L .

H NTK

Here we prove the results from Section 5.3. Corollary 9 (On-diagonal NTK). Consider a maxout network with the settings of Section 2. Assume that n L = 1 and that the biases are initialized to zero and are not trained. Assume that S ≤ c ≤ L, where the constants S, L are as specified in Table 4 . Then, ∥x (0) ∥ 2 (cS) L-2 n 0 P ≤ E[K N (x, x)] ≤ ∥x (0) ∥ 2 (cL) L-2 M L-1 n 0 P, E[K N (x, x) 2 ] ≤ 2P P W (cK) 2(L-2) ∥x (0) ∥ 4 n 2 0 exp    L-1 j=1 4 n j K + 4    , where P = L-1 l=0 n l , P W = L l=0 n l n l-1 , and M is as specified in Table 4 . Proof. Under the assumption that biases are not trained, on-diagonal NTK of a maxout network is K N (x, x) = L l=1 n l i=1 K k=1 n l-1 j=1 ∂N ∂W (l) i,k,j (x) 2 . Since in maxout network for all k-s except k = k ′ = argmax k∈[K] {W (l) i,k x (l-1) + b (l) i,k } the derivatives with respect to the weights and biases are zero, on-diagonal NTK equals K N (x, x) = L l=1 n l i=1 n l-1 j=1 ∂N ∂W (l) i,k ′ ,j (x) 2 . Notice that since we assumed a continuous distribution over the network weights and the biases are zero, the partial derivatives are defined everywhere except for the set of measure zero. Part I. Kernel mean E[K N (x, x)] Firstly, using the chain rule, a partial derivative with respect to network weight is ∂N ∂W (l) i,k ′ ,j (x) = ∂N ∂x (l) i (x)x (l-1) j = J N (x (l) )e i x (l-1) j . Recall that we assumed n L = 1. Therefore, we need to consider (∂N (x)∂W (l) i,k ′ ,j ) 2 = ∥J N (x (l) )u∥ 2 (x (l-1) j ) 2 , where u = e i . Combining Theorem 1 and Corollary 21 in combination with Remark 22 for the zero-bias case and using the independence of the random variables in the expressions, E ∥J N (x (l) )u∥ 2 x (l-1) j 2 ≤ so E   1 n l χ 2 n L L-1 j=l c n j nj i=1 Ξ j,i (χ 2 1 , K)   • E   ∥x (0) ∥ 2 1 n 0 l-1 j=1 c n j nj i=1 Ξ j,i (N (0, 1), K) 2   , where we treat the (l -1)th layer as if it has one unit when we use the normalized activation length result. Then, using Corollaries 2 and 5, E ∥J N (x (l) )u∥ 2 x (l-1) j 2 ≤ ∥x (0) ∥ 2 c L-2 n 0 n l L L-l-1 M l-1 . Taking the sum, we get E[K N (x, x)] = E[K W ] = E   L l=1 n l i=1 n l-1 j=1 ∂N ∂W (l) i,k ′ ,j (x) 2   ≤ L l=1 n l i=1 n l-1 j=1 ∥x (0) ∥ 2 c L-2 n 0 n l L L-l-1 M l-1 ≤ ∥x (0) ∥ 2 (cL) L-2 M L-1 n 0 P, where P = L-1 l=0 n l denotes the number of neurons in the network up to the last layer, but including the input neurons. Here we used that for K ≥ 2, both L, M ≥ 1, see Table 4 . Similarly, E[K W ] ≥ L l=1 n l i=1 n l-1 j=1 ∥x (0) ∥ 2 c L-2 n 0 n l S L-l-1 M l-1 ≥ ∥x (0) ∥ 2 (cS) L-2 n 0 P. Here we used that for K ≥ 2, S ≤ 1 and M ≥ 1, see Table 4 . Part II. Second moment E[K N (x, x) 2 ] Using equation ( 26) with Corollaries 2 and 5, E   ∂N ∂W (l) i,k ′ ,j (x) 4   = E ∥J N (x (l) )u∥ 4 x (l-1) j 4 ≤ so E      1 n l χ 2 n L L-1 j=l c n j nj i=1 Ξ j,i (χ 2 1 , K)   2    E      ∥x (0) ∥ 2 1 n 0 l-1 j=1 c n j nj i=1 Ξ j,i (N (0, 1), K) 2   2    ≤ 2(cK) 2(L-2) ∥x (0) ∥ 4 n 2 l n 2 0 exp    4   L-1 j=1 1 n j K - 1 n l K + 1      . Under review as a conference paper at 2023 Notice that all summands are non-negative. Then, using AM-QM inequality, E[K N (x, x) 2 ] = E      L l=1 n l i=1 n l-1 j=1 ∂N ∂W (l) i,k ′ ,j (x) 2   2    ≤ P W L l=1 n l i=1 n l-1 j=1 E   ∂N ∂W (l) i,k ′ ,j (x) 4   ≤ P W L l=1 n l i=1 n l-1 j=1 2(cK) 2(L-2) ∥x (0) ∥ 4 n 2 l n 2 0 exp    4   L-1 j=1 1 n j K - 1 n l K + 1      ≤ 2P P W (cK) 2(L-2) ∥x (0) ∥ 4 n 2 0 exp    L-1 j=1 4 n j K + 4    , where P W = L l=0 n l n l-1 denotes the number of all weights in the network.

I EXPERIMENTS I.1 EXPERIMENTS WITH SGD AND ADAM FROM SECTION 6

In this subsection we provide more details on the experiments presented in Section 6. The implementation of the key routines is available at https://anonymous.4open.science/r/maxout_expected_ gradient-68BD. Experiments were implemented in Python using TensorFlow (Martín Abadi et al., 2015) , numpy (Harris et al., 2020) and mpi4py (Dalcin et al., 2011) . The plots were created using matplotlib (Hunter, 2007) . We conducted all training experiments from Section 6 on a GPU cluster with nodes having 4 Nvidia A100 GPUs with 40 GB of memory. The most extensive experiments were running for one day on one GPU. Experiment in Figure 2 was run on a CPU cluster that uses Intel Xeon IceLakeSP processors (Platinum 8360Y) with 72 cores per node and 256 GB RAM. All other experiments were executed on the laptop ThinkPad T470 with Intel Core i5-7200U CPU with 16 GB RAM. Training experiments Now we discuss the training experiments. We use MNIST (LeCun & Cortes, 2010) , Iris (Fisher, 1936) , Fashion MNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011) , CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) . All maxout networks have the maxout rank K = 5. Weights are sampled from N (0, c/fan-in) in fully-connected networks and N (0, c/(k 2 • fan-in)), where k is the kernel size, in CNNs. The biases are initialized to zero. ReLU networks are initialized using He approach (He et al., 2015) , meaning that c = 2. All results are averaged over 4 runs. We do not use any weight normalization techniques, such as batch normalization (Ioffe & Szegedy, 2015) . We performed the dataset split into training, validation and test dataset and report the accuracy on the test set, while the validation set was used only for picking the hyper-parameters and was not used in training. The mini-batch size in all experiments is 32. The number of training epochs was picked by observing the training set loss and choosing the number of epochs for which the loss has converged. The exception is the SVHN dataset, for which we observe the double descent phenomenon and stop training after 150 epochs. Network architecture Fully connected networks have 21 layers. Specifically, their architecture is [5×fc64, 5×fc32, 5×fc16, 5×fc8, out], where "5×fc64" means that there are 5 fully-connected layers with 64 neurons, and "out" stands for the output layer that has the number of neurons equal to the number of classes in a dataset. CNNs have a VGG-19-like architecture (Simonyan & Zisserman, 2015) with 20 or 16 layers, depending the input size. The 20-layer architecture is 2×conv64, mp, 2×conv128, mp, 4×conv256, mp, 4×conv512, mp, 4×conv512, mp, 2×fc4096, fc1000, out] , [ where "conv64" stands for a convolutional layer with 64 neurons and "mp" for max-pooling layer. The kernel size in all convolutional layers is 3 × 3. Max-pooling uses 2 × 2 pooling windows with stride 2. Such architecture is used for datasets with the images that have the side length greater or equal to 32: CIFAR-10, CIFAR-100 and SVHN. The 16-layer architecture is used for images with the smaller image size: MNIST and Fashion MNIST. This architecture does not have the last convolutional block of the 20-layer version. Concretely, it has the following layers: [2×conv64, mp, 2×conv128, mp, 4×conv256, mp, 4×conv512, mp, 2×fc4096, fc1000, out], Max-pooling initialization To account for the maximum in max-pooling layers, a maxout layer appearing after a max-pooling layer is initialized as if its maxout rank was K × m 2 , where m 2 is the max-pooling window size. The reason for this is that the outputs of a computational block consisting of a max-pooling window and a maxout layer are taking maxima over K × m 2 linear functions, max{W 1 max{x 1 , . . . , x m 2 } + b 1 , . . . , W K max{x 1 , . . . , x m 2 } + b K } = max{f 1 , . . . , f Km 2 }, where the f i are Km 2 affine functions. Therefore, we initialize the layers that follow max-pooling layers using the criterion for maxout rank m 2 × K instead of K. In our experiments, K = 5, m = 2, and m 2 × K = 20. Hence, for such layers, we use the constant c = 1/M = 0.26573, where M is computed for K = 20 using the formula from Remark 14 in Appendix C. All other layers that do not follow max-pooling layers are initialized as suggested in Section 4. We observe that max-pooling initialization often leads to slightly higher accuracy. Data augmentation There is no data augmentation for fully connected networks. For convolutional networks, for MNIST, Fashion MNIST and SVHN datasets we perform random translation, rotation and zoom of the input images. For CIFAR-10 and CIFAR-100, we additionally apply a random horizontal flip. Learning rate decay In all experiments, we use the learning rate decay and choose the optimal initial learning rate for all network and initialization types based on their accuracy on the validation dataset. The learning rate was halved every nth epoch. For SVHN, n = 10, and for all other datasets, n = 100. SGD with momentum We use SGD with Nesterov momentum, with the momentum value of 0.9. Specific dataset settings are the following. • MNIST (fully-connected networks). Networks are trained for 600 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.002, maxout networks with c = 0.1: 0.002, maxout networks with c = 2: 2 × 10 -7 , ReLU networks: 0.002. • Iris. Networks are trained for 500 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.01, maxout networks with c = 0.1: 0.01, maxout networks with c = 2: 4 × 10 -8 , ReLU networks: 0.005. • MNIST (convolutional networks). Networks are trained for 800 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.009, maxout networks with max-pooling initialization: 0.009, maxout networks with c = 0.1: 0.009, maxout networks with c = 2: 8 × 10 -6 , ReLU networks: 0.01. • Fashion MNIST. Networks are trained for 800 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.004, maxout networks with max-pooling initialization: 0.006, maxout networks with c = 0.1: 0.4, maxout networks with c = 2: 5 × 10 -6 , ReLU networks: 0.01. • CIFAR-10. Networks are trained for 1000 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.004, maxout networks with max-pooling initialization: 0.005, maxout networks with c = 0.1: 0.5, maxout networks with c = 2: 8 × 10 -8 , ReLU networks: 0.009. • CIFAR-100. Networks are trained for 1000 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.002, maxout networks with max-pooling initialization: 0.002, maxout networks with c = 0.1: 0.002, maxout networks with c = 2: 8 × 10 -5 , ReLU networks: 0.006. .18 97.77 ±0.17 97.91 ±0.11 75.82 ±38.12 75.94 ±38.18 97.89 ±0.10 9.8 ±0.00 9.8 ±0.00 9.8 ±0.00 Iris 90.83 ±3.63 90.83 ±1.44 90 ±0.00 30 ±0.00 30 ±0.00 30 ±0.00 30 ±0.00 30 ±0.00 30 ±0.00 CONVOLUTIONAL MNIST 99.57 ±0.07 99.59 ±00.02 54.69 ±44.89 9.8 ±0.00 9.8 ±0.00 9.8 ±0.00 9.8 ±0.00 9.8 ±0.00 9.8 ±0.00 CIFAR-10 91.4 ±0.22 91.69 ±0.25 50.83 ±40.83 10 ±0.00 10 ±0.00 10 ±0.00 10 ±0.00 10 ±0.00 10 ±0.00 We should point out that what is a fair comparison is not as straightforward as matching the parameter count. In particular, wider networks have the advantage of having a higher dimensional representation. A fully connected network will not necessarily perform as well as a convolutional network with the same number of parameters, and a deep and narrow network will not necessarily perform as well as a wider and shallower network with the same number of parameters. Nevertheless, to add more details to the results, we perform experiments using ReLU networks that have as many parameters as maxout networks. See Tables 9 and 10 for results. We modify network architectures described in Section I.1 for these experiments in the following way. In the first experiment, we use fully connected ReLU networks 5 times wider than maxout networks. For convolutional networks, however, the resulting CNNs with ReLU activations would be extremely wide, so we only made it 2 times wider. In our setup, a 5 times wider CNN network would need to be trained for longer than 24 hours, which is difficult in our experiment environment. Maxout networks only required a much shorter time of around 10 hours, which indicates possible benefits in some cases. In the second experiment, we consider ReLU networks that are 5 times deeper than maxout networks. More specifically. they have the following architecture : [25×fc64, 25×fc32, 25×fc16, 25×fc8, out] . As expected, wider networks do better. On the other hand, deeper ReLU networks of the same width do much worse than maxout networks. 



Figure 2: Expected value and interquartile range of the squared gradients n 0 (∂N /∂W i,k ′ ,j ) 2 as a function of depth. Weights are sampled from N (0, c/fan-in) in fully-connected networks and N (0, c/(k 2 • fan-in)), where k is the kernel size, in CNNs. Biases are zero, and the maxout rank K is 5. The gradient is stable in wide fully-connected and convolutional networks with c = 0.55555 (red line), the value suggested in Section 4. The dark and light blue lines represent the bounds from Corollary 2, and equal 1/L = 0.36 and 1/S = 12. The yellow line corresponds to the ReLU-He initialization. We compute the mean and quartiles from 100 network initializations and a fixed input. The same color lines that are close to each other correspond to 3 different unit-norm network inputs.

Figure 4: The plots show that | cos γ x (l) ,u (l) | grows with the network depth and eventually converges to 1 for wide networks and maxout rank K > 2. The results were averaged over 1000 parameter initializations, and both weights and biases were sampled from N (0, c/fan-in), c = 1/E[(Ξ(N (0, 1), K))2 ], as discussed in Section 4. Vectors x and u were sampled from N (0, I).

Figure 5: The plots show that | cos γ x (l) ,u (l) | does not converge to 1 for c < 1/E[Ξ 2 ] and converges for c ≥ 1/E[Ξ 2 ]. The network had 100 neurons at each layer, and both weights and biases were sampled from N (0, c/fan-in). The results were averaged over 1000 parameter initializations. Vectors x and u were sampled from N (0, I).

Figure 6: Second moment of Ξ(N (0, 1), K) for different sample sizes K. It increases with K for any K, 2 ≤ K ≤ 100, and E[(Ξ(N (0, 1), K)) 2 ] > 1 for K > 2.

Figure7: Shown is the expectation value of the square norm of the directional derivative of the input output map of a maxout network for a fixed random direction with respect to the weights, plotted as a function of the input. Weights and biases are sampled from N (0, 1/fan-in) and biases are zero. Inputs are standard Gaussian vectors. Vector u is a one-hot vector with 1 at a random position, and it is the same for one setup. We sampled 1000 inputs and 1000 initializations for each input. The left end corresponds to the second moments of the Gaussian distribution and the right end to the second moment of the largest order statistic. Observe that for wide and deep networks the mean is closer to the second moment of the largest order statistic.

Figure 8: Comparison of the activation length with the equality in distribution result from Corollary 21 and the formula for the mean from Corollary 5.Plotted are means and stds estimated with respect to the distribution of parameters / random variables Ξ l,i (N (0, 1), K) averaged over 100, 000 initializations, and a numerically evaluated formula for the mean from Corollary 5. All layers had 10 neurons. The lines for the mean and areas for the std overlap. Note that there is no std for the formula in the plot.

Recommended values for the constant c for different maxout ranks K based on Section 4.

Accuracy on the test set for networks trained using SGD with Nesterov momentum. Observe that maxout networks initialized with the maxout or max-pooling initialization perform significantly better than the ones initialized with other initializations and better or comparably to ReLU networks.

Constants L and S denote the means of the largest and the smallest order statistics in a sample of size K of chi-squared random variables with 1 degree of freedom. Constant M denotes the second moment of the largest order statistic in a sample of size K of standard Gaussian random variables. See Remark 14 for the explanation of how these constants are computed.

Accuracy on the test set for maxout networks trained using SGD Nesterov momentum for values of c greater than or equal to c = 0.55555, the value suggested in Section 4.

Accuracy on the test set for maxout networks with batch normalization trained using SGD with Nesterov momentum for values of c less than or equal c = 0.55555, the value suggested in Section 4. Observe that the recommended value of c from our theory closely matches the empirical optimum value of c. ±0.00 11.35 ±0.00 99.33 ±0.09 99.35 ±0.05 99.32 ±0.05 99.36 ±0.04 99.41±0.07  CIFAR-10 10 ±0.00 10 ±0.00 10 ±0.00 74.85 ±3.29 75.72 ±4.94 77.16 ±1.84 77.68±1.07

Accuracy on the test set for maxout networks with batch normalization trained using SGD with Nesterov momentum for values of c greater than or equal to c = 0.55555, the value suggested in Section 4. Observe that the recommended value of c from our theory closely matches the empirical optimum value of c. ±0.07 99.39 ±0.04 99.35 ±0.02 98.83 ±0.07 97.69 ±0.31 95.11 ±1.40 93.09 ±1.88 87.63 ±1.86 CIFAR-10 77.68 ±1.07 79.26 ±0.76 75.82 ±1.05 66.23 ±1.69 50.97 ±2.28 43.81 ±1.80 39.27 ±2.73 40.27 ±0.92 I.4 COMPARISON OF MAXOUT AND RELU NETWORKS IN TERMS OF THE NUMBER OF PARAMETERS

Accuracy on the test set for networks trained using SGD with Nesterov Fullyconnected ReLU networks are 5 times wider than fully-connected maxout networks. Convolutional ReLU networks are 2 times wider than convolutional maxout networks. All networks have the same number of layers.

Accuracy on the test set for networks trained using SGD with Nesterov momentum. Fullyconnected ReLU networks are 5 times deeper than fully-connected maxout networks but have the same width.

annex

Proof. By Proposition 13, almost surely with respect to the parameter initialization,where vectors u (l) , l = 0, . . . , L -1 are defined recursively as u (l) = W (l) u (l-1) /∥W (l) u (l-1) ∥ and u (0) = u. Matrices W (l) consist of rows Wi,k }, and x (l-1) is the lth layer's input, x (0) = x. We assumed that weights in the last layer are sampled from a Gaussian distribution with mean zero and variance 1/n L-1 . Then, by Lemma 10, -1) . We use this observation in the equation ( 6), multiply and divide the summands in the expression by c/n l-1 and rearrange to obtain thatWe define x (l-1) := (xWe append the vectors of biases to the weight matrices and denote obtained matrices with W (l) ∈ R n l ×(n l-1 +1) . Then (9) equals • SVHN. Networks are trained for 150 epochs. The learning rate is halved every 10 epochs.Learning rates: maxout networks with maxout initialization: 0.005, maxout networks with max-pooling initialization: 0.005, maxout networks with c = 0.1: 0.005, maxout networks with c = 2: 7 × 10 -5 , ReLU networks: 0.005.Adam We use Adam optimizer Kingma & Ba (2015) with default TensorFlow parameters β 1 = 0.9, β 2 = 0.999. Specific dataset settings are the following.• MNIST (fully-connected networks). Networks are trained for 600 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.0008, maxout networks with c = 2: 0.0007, ReLU networks: 0.0008.• MNIST (convolutional networks). Networks are trained for 800 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.0001, maxout networks with max-pooling initialization: 0.00006, maxout networks with c = 2: 0.00004, ReLU networks: 0.00009.• Fashion MNIST. Networks are trained for 1000 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.00007, maxout networks with max-pooling initialization: 0.00008, maxout networks with c = 2: 0.00005, ReLU networks: 0.0002. • CIFAR-10. Networks are trained for 1000 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.00009, maxout networks with max-pooling initialization: 0.00009, maxout networks with c = 2: 0.00005, ReLU networks: 0.0001. • CIFAR-100. Networks are trained for 1000 epochs. The learning rate is halved every 100 epochs. Learning rates: maxout networks with maxout initialization: 0.00008, maxout networks with max-pooling initialization: 0.00009, maxout networks with c = 2: 0.00005, ReLU networks: 0.00009.

I.2 ABLATION ANALYSIS

Tables 5 and 6 show the results of the additional experiments that use SGD with Nesterov momentum for more values of c and K = 5. From this, we see that the recommended value of c from Section 4 closely matches the empirical optimum value of c. Note that here we have fixed the learning rate across choices of c. More specifically, the following learning rates were used for the experiments with different datasets. MNIST with fully-connected networks: 0.002; Iris: 0.01; MNIST with convolutional networks: 0.009; CIFAR-10: 0.004.

I.3 BATCH NORMALIZATION

Tables 7 and 8 report test accuracy for maxout networks with batch normalization trained using SGD with Nesterov momentum for various values of c. The implementation of the experiments is similar to that described in Section I.1, except for the following differences: The networks use batch normalization after each layer with activations; The width of the last fully connected layer is 100, and all other layers of the convolutional networks are 8 times narrower; The learning rate is fixed at 0.01 for all experiments. We use the default batch normalization parameters from TensorFlow. Specifically, the momentum equals 0.99 and ϵ = 0.001. We observe that our initialization strategy is still beneficial when training with batch normalization.

