BIDIRECTIONALLY SELF-NORMALIZING NEURAL NETWORKS

Abstract

The problem of exploding and vanishing gradients has been a long-standing obstacle that hinders the effective training of neural networks. Despite various tricks and techniques that have been employed to alleviate the problem in practice, there still lacks satisfactory theories or provable solutions. In this paper, we address the problem from the perspective of high-dimensional probability theory. We provide a rigorous result that shows, under mild conditions, how the exploding/vanishing gradient problem disappears with high probability if the neural networks have sufficient width. Our main idea is to constrain both forward and backward signal propagation in a nonlinear neural network through a new class of activation functions, namely Gaussian-Poincaré normalized functions, and orthogonal weight matrices. Experiments on both synthetic and real-world data validate our theory and confirm its effectiveness on very deep neural networks when applied in practice.

1. INTRODUCTION

Neural networks have brought unprecedented performance in various artificial intelligence tasks (Graves et al., 2013; Krizhevsky et al., 2012; Silver et al., 2017) . However, despite decades of research, training neural networks is still mostly guided by empirical observations and successful training often requires various heuristics and extensive hyperparameter tuning. It is therefore desirable to understand the cause of the difficulty in neural network training and to propose theoretically sound solutions. A major difficulty is the gradient exploding/vanishing problem (Glorot & Bengio, 2010; Hochreiter, 1991; Pascanu et al., 2013; Philipp et al., 2018) . That is, the norm of the gradient in each layer is either growing or shrinking at an exponential rate as the gradient signal is propagated from the top layer to bottom layer. For deep neural networks, this problem might cause numerical overflow and make the optimization problem intrinsically difficult, as the gradient in each layer has vastly different magnitude and therefore the optimization landscape becomes pathological. One might attempt to solve the problem by simply normalizing the gradient in each layer. Indeed, the adaptive gradient optimization methods (Duchi et al., 2011; Kingma & Ba, 2015; Tieleman & Hinton, 2012) implement this idea and have been widely used in practice. However, one might also wonder if there is a solution more intrinsic to deep neural networks, whose internal structure if well-exploited would lead to further advances. To enable the trainability of deep neural networks, batch normalization (Ioffe & Szegedy, 2015) was proposed in recent years and achieved widespread empirical success. Batch normalization is a differentiable operation which normalizes its inputs based on mini-batch statistics and inserted between the linear and nonlinear layers. It is reported that batch normalization can accelerate neural network training significantly (Ioffe & Szegedy, 2015) . However, batch normalization does not solve the gradient exploding/vanishing problem (Philipp et al., 2018) . Indeed it is proved that batch normalization can actually worsen the problem (Yang et al., 2019) . Besides, batch normalization requires separate training and testing phases and might be ineffective when the mini-batch size is small (Ioffe, 2017). The shortcomings of batch normalization motivate us to search for a more principled and generic approach to solve the gradient exploding/vanishing problem. Alternatively, self-normalizing neural networks (Klambauer et al., 2017) and dynamical isometry theory (Pennington et al., 2017) were proposed to combat gradient exploding/vanishing problem. In self-normalizing neural networks, the output of each network unit is constrained to have zero mean and unit variance. Based on this motivation, a new activation function, scaled exponential linear unit (SELU), was devised. In dynamical isometry theory, all singular values of the input-output Jacobian matrix are constrained to be close to one at initialization. This amounts to initializing the functionality of a network to be close to an orthogonal matrix. While the two theories dispense batch normalization, it is shown that SELU still suffers from exploding/vanishing gradient problem and dynamical isometry restricts the functionality of the network to be close to linear (pseudolinearity) (Philipp et al., 2018) . In this paper, we follow the above line of research to investigate neural network trainability. Our contributions are three-fold: First, we introduce bidirectionally self-normalizing neural network (BSNN) that consist of orthogonal weight matrices and a new class of activation functions which we call Gaussian-Poincaré normalized (GPN) functions. We show many common activation functions can be easily transformed into their respective GPN versions. Second, we rigorously prove that the gradient exploding/vanishing problem disappears with high probability in BSNNs if the width of each layer is sufficiently large. Third, with experiments on synthetic and real-world data, we confirm that BSNNs solve the gradient vanishing/exploding problem to large extent while maintaining nonlinear functionality.

2. THEORY

In this section, we introduce bidirectionally self-normalizing neural network (BSNN) formally and analyze its properties. All the proofs of our results are left to Appendix. To simplify the analysis, we define a neural network in a restricted sense as follows: Definition 1 (Neural Network). A neural network is a function from R n to R n composed of layerwise operations for l = 1, . . . , L as h (l) = W (l) x (l) , x (l+1) = φ(h (l) ), where W (l) ∈ R n×n , φ : R → R is a differentiable function applied element-wise to a vector, x (1) is the input and x (L+1) is the output. Under this definition, φ is called the activation function, {W (l) } L l=1 are called the parameters, n is called the width and L is called the depth. Superscript (l) denotes the l-th layer of a neural network. The above formulation is similar to (Pennington et al., 2017) but we omit the bias term in (1) for simplicity as it plays no role in our analysis. Let E be the objective function of {W (l) } L l=1 and D (l) be a diagonal matrix with diagonal elements D (l) ii = φ (h (l) i ) , where φ denotes the derivative of φ. Now, the error signal is back propagated via d (L) = D (L) ∂E ∂x (L+1) , d (l) = D (l) (W (l+1) ) T d (l+1) , and the gradient of the weight matrix for layer l can be computed as ∂E ∂W (l) = d (l) (x (l) ) T . To solve the gradient exploding/vanishing problem, we constrain the forward signal x (l) and the backward signal d (l) in order to constrain the norm of the gradient. This leads to the following definition and proposition. Definition 2 (Bidirectional Self-Normalization). A neural network is bidirectionally selfnormalizing if x (1) 2 = x (2) 2 = ... = x (L) 2 , (4) d (1) 2 = d (2) 2 = ... = d (L) 2 . (5) Proposition 1. If a neural network is bidirectionally self-normalizing, then ∂E ∂W (1) F = ∂E ∂W (2) F = ... = ∂E ∂W (L) F . (6)

