EXPECTED GRADIENTS OF MAXOUT NETWORKS AND CONSEQUENCES TO PARAMETER INITIALIZATION

Abstract

We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.

1. INTRODUCTION

We study the gradients of maxout networks and derive several implications for training stability, parameter initialization, and expressivity. Concretely, we compute stochastic order bounds and bounds on the moments depending on the parameter distribution and the network architecture. The analysis is based on the input-output Jacobian of maxout networks. We discover that, in contrast to ReLU networks, when initialized with a zero-mean Gaussian distribution, the distribution of the input-output Jacobian of a maxout network depends on the network input, which may lead to unstable gradients and training difficulties. Nonetheless, we can obtain a rigorous parameter initialization recommendation for wide networks. The analysis of gradients also allows us to refine previous bounds on the expected number of linear regions of maxout networks at initialization and derive new results on the length distortion and the NTK. Maxout networks A rank-K maxout unit, introduced by Goodfellow et al. (2013) , computes the maximum of K real-valued parametric affine functions. Concretely, a rank-K maxout unit with n inputs implements a function R n → R; x → max k∈[K] {⟨W k , x⟩ + b k }, where W k ∈ R n and b k ∈ R, k ∈ [K] := {1, . . . , K}, are trainable weights and biases. The K arguments of the maximum are called the pre-activation features of the maxout unit. This may be regarded as a multi-argument generalization of a ReLU, which computes the maximum of a real-valued affine function and zero. Goodfellow et al. (2013) demonstrated that maxout networks could perform better than ReLU networks under similar circumstances. Additionally, maxout networks have been shown to be useful for combating catastrophic forgetting in neural networks (Goodfellow et al., 2015) . On the other hand, Castaneda et al. ( 2019) evaluated the performance of maxout networks in a big data setting and observed that increasing the width of ReLU networks is more effective in improving performance than replacing ReLUs with maxout units and that ReLU networks converge faster than maxout networks. We observe that proper initialization strategies for maxout networks have not been studied in the same level of detail as for ReLU networks and that this might resolve some of the problems encountered in previous maxout network applications.

Parameter initialization

The vanishing and exploding gradient problem has been known since the work of Hochreiter (1991) . It makes choosing an appropriate learning rate harder and slows training Sun (2019). Common approaches to address this difficulty include the choice of specific architectures, e.g. LSTMs (Hochreiter, 1991) or ResNets (He et al., 2016) , and normalization methods such as batch normalization (Ioffe & Szegedy, 2015) or explicit control of the gradient magnitude with gradient clipping (Pascanu et al., 2013) . We will focus on approaches based on parameter initialization that control the activation length and parameter gradients (LeCun et al., 2012; Glorot & Bengio, 2010; He et al., 2015; Gurbuzbalaban & Hu, 2021; Zhang et al., 2019; Bachlechner et al., 2021) . He et al. (2015) studied forward and backward passes to obtain initialization recommendations for ReLU. A more rigorous analysis of the gradients was performed by Hanin & Rolnick (2018) ; Hanin (2018), who also considered higher-order moments and derived recommendations on the network architecture. Sun et al. ( 2018) derived a corresponding strategy for rank K = 2 maxout networks. For higher maxout ranks, Tseran & Montúfar (2021) considered balancing the forward pass, assuming Gaussian or uniform distribution on the pre-activation features of each layer. However, this assumption is not fully justified. We will analyze maxout network gradients, including the higher order moments, and give a rigorous justification for the initialization suggested by Tseran & Montúfar (2021) . Expected number of linear regions Neural networks with piecewise linear activation functions subdivide their input space into linear regions, i.e., regions over which the computed function is (affine) linear. The number of linear regions serves as a complexity measure to differentiate network architectures (Pascanu et al., 2014; Montufar et al., 2014; Telgarsky, 2015; 2016) . The first results on the expected number of linear regions were obtained by Hanin & Rolnick (2019a;b) for ReLU networks, showing that it can be much smaller than the maximum possible number. Tseran & Montúfar (2021) obtained corresponding results for maxout networks. An important factor controlling the bounds in these works is a constant depending on the gradient of the neuron activations with respect to the network input. By studying the input-output Jacobian of maxout networks, we obtain a refined bound for this constant and, consequently, the expected number of linear regions. Expected curve distortion Another complexity measure is the distortion of the length of an input curve as it passes through a network. Poole et al. (2016) studied the propagation of Riemannian curvature through wide neural networks using a mean-field approach, and later, a related notion of "trajectory length" was considered by Raghu et al. (2017) . It was demonstrated that these measures can grow exponentially with the network depth, which was linked to the ability of deep networks to "disentangle" complex representations. Based on these notions, Murray et al. ( 2022) studies how to avoid rapid convergence of pairwise input correlations, vanishing and exploding gradients. However, Hanin et al. ( 2021) proved that for a ReLU network with He initialization the length of the curve does not grow with the depth and even shrinks slightly. We establish similar results for maxout networks. NTK It is known that the Neural Tangent Kernel (NTK) of a finite network can be approximated by its expectation (Jacot et al., 2018) . However, for ReLU networks Hanin & Nica (2020a) showed that if both the depth and width tend to infinity, the NTK does not converge to a constant in probability. By studying the expectation of the gradients, we show that similarly to ReLU, the NTK of maxout networks does not converge to a constant when both width and depth are sent to infinity. Contributions Our contributions can be summarized as follows. • For expected gradients, we derive stochastic order bounds for the directional derivative of the input-output map of a deep fully-connected maxout network (Theorem 1) as well as bounds for the moments (Corollary 2). Additionally, we derive an equality in distribution for the directional derivatives (Theorem 3), based on which we also discuss the moments (Remark 4) in wide networks. We further derive the moments of the activation length of a fully-connected maxout network (Corollary 5). • We rigorously derive parameter initialization guidelines for wide maxout networks preventing vanishing and exploding gradients and formulate architecture recommendations. We experimentally demonstrate that they make it possible to train standard-width deep fully-connected and convolutional maxout networks using simple procedures (such as SGD with momentum and Adam), yielding higher accuracy than other initializations or ReLU networks on image classification tasks. • We derive several implications refining previous bounds on the expected number of linear regions (Corollary 6), and new results on length distortion (Corollary 7) and the NTK (Corollary 9).

2. PRELIMINARIES

Architecture We consider feedforward fully-connected maxout neural networks with n 0 inputs, L hidden layers of widths n 1 , . . . , n L-1 , and a linear output layer, which implement functions of the form N = ψ • ϕ L-1 • • • • • ϕ 1 . The l-th hidden layer is a function ϕ l : R n l-1 → R n l with components

