BIDIRECTIONALLY SELF-NORMALIZING NEURAL NETWORKS

Abstract

The problem of exploding and vanishing gradients has been a long-standing obstacle that hinders the effective training of neural networks. Despite various tricks and techniques that have been employed to alleviate the problem in practice, there still lacks satisfactory theories or provable solutions. In this paper, we address the problem from the perspective of high-dimensional probability theory. We provide a rigorous result that shows, under mild conditions, how the exploding/vanishing gradient problem disappears with high probability if the neural networks have sufficient width. Our main idea is to constrain both forward and backward signal propagation in a nonlinear neural network through a new class of activation functions, namely Gaussian-Poincaré normalized functions, and orthogonal weight matrices. Experiments on both synthetic and real-world data validate our theory and confirm its effectiveness on very deep neural networks when applied in practice.

1. INTRODUCTION

Neural networks have brought unprecedented performance in various artificial intelligence tasks (Graves et al., 2013; Krizhevsky et al., 2012; Silver et al., 2017) . However, despite decades of research, training neural networks is still mostly guided by empirical observations and successful training often requires various heuristics and extensive hyperparameter tuning. It is therefore desirable to understand the cause of the difficulty in neural network training and to propose theoretically sound solutions. A major difficulty is the gradient exploding/vanishing problem (Glorot & Bengio, 2010; Hochreiter, 1991; Pascanu et al., 2013; Philipp et al., 2018) . That is, the norm of the gradient in each layer is either growing or shrinking at an exponential rate as the gradient signal is propagated from the top layer to bottom layer. For deep neural networks, this problem might cause numerical overflow and make the optimization problem intrinsically difficult, as the gradient in each layer has vastly different magnitude and therefore the optimization landscape becomes pathological. One might attempt to solve the problem by simply normalizing the gradient in each layer. Indeed, the adaptive gradient optimization methods (Duchi et al., 2011; Kingma & Ba, 2015; Tieleman & Hinton, 2012) implement this idea and have been widely used in practice. However, one might also wonder if there is a solution more intrinsic to deep neural networks, whose internal structure if well-exploited would lead to further advances. To enable the trainability of deep neural networks, batch normalization (Ioffe & Szegedy, 2015) was proposed in recent years and achieved widespread empirical success. Batch normalization is a differentiable operation which normalizes its inputs based on mini-batch statistics and inserted between the linear and nonlinear layers. It is reported that batch normalization can accelerate neural network training significantly (Ioffe & Szegedy, 2015) . However, batch normalization does not solve the gradient exploding/vanishing problem (Philipp et al., 2018) . Indeed it is proved that batch normalization can actually worsen the problem (Yang et al., 2019) . Besides, batch normalization requires separate training and testing phases and might be ineffective when the mini-batch size is small (Ioffe, 2017) . The shortcomings of batch normalization motivate us to search for a more principled and generic approach to solve the gradient exploding/vanishing problem. Alternatively, self-normalizing neural networks (Klambauer et al., 2017) and dynamical isometry theory (Pennington et al., 2017) were proposed to combat gradient exploding/vanishing problem. In self-normalizing neural networks, the output of each network unit is constrained to have zero mean and unit variance. Based on this motivation, a new activation function, scaled exponential linear unit (SELU), was devised. In dynamical isometry theory, all singular values of the input-output Jacobian matrix are constrained to be close to one at initialization. This amounts to initializing the functionality of a network to be close to an orthogonal matrix. While the two theories dispense batch normalization, it is shown that SELU still suffers from exploding/vanishing gradient problem and dynamical isometry restricts the functionality of the network to be close to linear (pseudolinearity) (Philipp et al., 2018) . In this paper, we follow the above line of research to investigate neural network trainability. Our contributions are three-fold: First, we introduce bidirectionally self-normalizing neural network (BSNN) that consist of orthogonal weight matrices and a new class of activation functions which we call Gaussian-Poincaré normalized (GPN) functions. We show many common activation functions can be easily transformed into their respective GPN versions. Second, we rigorously prove that the gradient exploding/vanishing problem disappears with high probability in BSNNs if the width of each layer is sufficiently large. Third, with experiments on synthetic and real-world data, we confirm that BSNNs solve the gradient vanishing/exploding problem to large extent while maintaining nonlinear functionality.

2. THEORY

In this section, we introduce bidirectionally self-normalizing neural network (BSNN) formally and analyze its properties. All the proofs of our results are left to Appendix. To simplify the analysis, we define a neural network in a restricted sense as follows: Definition 1 (Neural Network). A neural network is a function from R n to R n composed of layerwise operations for l = 1, . . . , L as h (l) = W (l) x (l) , x (l+1) = φ(h (l) ), where W (l) ∈ R n×n , φ : R → R is a differentiable function applied element-wise to a vector, x (1) is the input and x (L+1) is the output. Under this definition, φ is called the activation function, {W (l) } L l=1 are called the parameters, n is called the width and L is called the depth. Superscript (l) denotes the l-th layer of a neural network. The above formulation is similar to (Pennington et al., 2017) but we omit the bias term in (1) for simplicity as it plays no role in our analysis. Let E be the objective function of {W (l) } L l=1 and D (l) be a diagonal matrix with diagonal elements D (l) ii = φ (h (l) i ) , where φ denotes the derivative of φ. Now, the error signal is back propagated via d (L) = D (L) ∂E ∂x (L+1) , d (l) = D (l) (W (l+1) ) T d (l+1) , and the gradient of the weight matrix for layer l can be computed as ∂E ∂W (l) = d (l) (x (l) ) T . To solve the gradient exploding/vanishing problem, we constrain the forward signal x (l) and the backward signal d (l) in order to constrain the norm of the gradient. This leads to the following definition and proposition. Definition 2 (Bidirectional Self-Normalization). A neural network is bidirectionally selfnormalizing if x (1) 2 = x (2) 2 = ... = x (L) 2 , d (1) 2 = d (2) 2 = ... = d (L) 2 . (5) Proposition 1. If a neural network is bidirectionally self-normalizing, then ∂E ∂W (1) F = ∂E ∂W (2) F = ... = ∂E ∂W (L) F . ( ) In the rest of this section, we derive the conditions under which bidirectional self-normalization is achievable for a neural network.

2.1. CONSTRAINTS ON WEIGHT MATRICES

We constrain the weight matrices to be orthogonal since multiplication by an orthogonal matrix preserves the norm of a vector. For linear networks, this guarantees bidirectionally selfnoramlizingnormalization and its further benefits are discussed in (Saxe et al., 2014) . Even for nonlinear neural networks, orthogonal constraints are shown to improve the trainability with proper scaling (Mishkin & Matas, 2016; Pennington et al., 2017) .

2.2. CONSTRAINTS ON ACTIVATION FUNCTIONS

To achieve bidirectionally self-noramlizingnormalization for a nonlinear network, it is not enough only to constrain the weight matrices. We also need to constrain the activation function in such a way that both forward and backward signals are normalized. To this end, we propose the following constraint that captures the relationship between a function and its derivative. Definition 3 (Gaussian-Poincaré Normalization). Function φ : R → R is Gaussian-Poincaré normalized if it is differentiable and E x∼N (0,1) [φ(x) 2 ] = E x∼N (0,1) [φ (x) 2 ] = 1. ( ) The definition is inspired by the following theorem which shows the fundamental relationship between a function and its derivative under Gaussian measure. Theorem 1 (Gaussian-Poincaré Inequality (Bogachev, 1998)). If function φ : R → R is differen- tiable with bounded E x∼N (0,1) [φ(x) 2 ] and E x∼N (0,1) [φ (x) 2 ], then Var x∼N (0,1) [φ(x)] ≤ E x∼N (0,1) [φ (x) 2 ]. Note that there is an implicit assumption that the input is approximately Gaussian for a Gaussian-Poincaré normalized (GPN) function. Even though this is standard in the literature (Klambauer et al., 2017; Pennington et al., 2017; Schoenholz et al., 2017) , we will rigorously prove that this assumption is valid when orthogonal weight matrices are used in equation 1. Next, we state a property of GPN functions. Proposition 2. Function φ : R → R is Gaussian-Poincaré normalized and E x∼N (0,1) [φ(x)] = 0 if and only if φ(x) = x or φ(x) = -x. This result indicates that any nonlinear function with zero mean under Gaussian distribution (e.g., Tanh and SELU) is not GPN. Now we show that a large class of activation functions can be converted into their respective GPN versions using an affine transformation. Proposition 3. For any differentiable function φ : R → R with non-zero and bounded E x∼N (0,1) [φ(x) 2 ] and E x∼N (0,1) [φ (x) 2 ], there exist two constants a and b such that aφ(x) + b is Gaussian-Poincaré normalized. To obtain a and b, one can use numerical procedure to compute the values of E x∼N (0,1) [φ (x) 2 ], E x∼N (0,1) [φ(x) 2 ] and E x∼N (0,1) [φ(x)] and then solve the quadratic equations E x∼N (0,1) [a 2 φ (x) 2 ] = 1, E x∼N (0,1) [(aφ(x) + b) 2 ] = 1. ( ) We computed a and b (not unique) for several common activation functions with their default hyperparametersfoot_0 and the results are listed in Table 1 . Note that ReLU, LeakyReLU and SELU are not differentiable at x = 0 but they can be regarded as approximations of their smooth counterparts. We ignore such point and evaluate the integrals for x ∈ (-∞, 0) ∪ (0, ∞). With the orthogonal constraint on the weight matrices and the Gaussian-Poincaré normalization on the activation function, we prove that bidirectionally self-noramlizingnormalization is achievable with high probability under mild conditions in the next subsection. 

2.3. NORM-PRESERVATION THEOREMS

The bidirectionally self-noramlizingnormalization may not be achievable precisely in general unless the neural network is a linear one. Therefore, we investigate the properties of neural networks in a probabilistic framework. The random matrix theory and the high-dimensional probability theory allow us to characterize the behaviors of a large class of neural networks by its mean behavior, which is significantly simpler to analyze. Therefore, we study neural networks of random weights whose properties may shed light on the trainability of neural networks in practice. First, we need a probabilistic version of the vector norm constraint. Definition 4 (Thin-Shell Concentration). Random vector x ∈ R n is thin-shell concentrated if for any > 0 P 1 n x 2 2 -1 ≥ → 0 (11) as n → ∞. The definition is modified from the one in (Bobkov, 2003) . Examples of thin-shell concentrated distributions include standard multivariate Gaussian and any distribution on the n-dimensional sphere of radius √ n. Assumptions. To prove the main results, i.e., the norm-preservation theorems, we require the following assumptions: 1. Random vector x ∈ R n is thin-shell concentrated. 2. Random orthogonal matrix W = (w 1 , w 2 , ..., w n ) T is uniformly distributed. 3. Function φ : R → R is Gaussian-Poincaré normalized. 4. Function φ and its derivative are Lipschitz continuous. The above assumptions are not restrictive. For Assumption 1, one can always normalize the input vectors of a neural network. For Assumption 2, orthogonal constraint or its relaxation has already been employed in neural network training (Brock et al., 2017) . Note, in Assumption 2, uniformly distributed means that W is distributed under Haar measure, which is the unique rotation invariant probability measure on orthogonal matrix group. We refer the reader to (Meckes, 2019) for details. Furthermore, all the activation functions or their smooth counterparts listed in Table 1 satisfy Assumptions 3 and 4. With the above assumptions, we can prove the following norm-preservation theorems. Theorem 2 (Forward Norm-Preservation). Random vector (φ(w T 1 x), φ(w T 2 x), ..., φ(w T n x)) is thin-shell concentrated. This result shows the transformation (orthogonal matrix followed by the GPN activation function) can preserve the norm of its input with high probability. Since the output is thin-shell concentrated, it serves as the input for the next layer and so on. Hence, the forward pass can preserve the norm of its input in each layer along the forward path when n is sufficiently large. Theorem 3 (Backward Norm-Preservation). Let D be the diagonal matrix whose diagonal elements are D ii = φ (w T i x) and y ∈ R n be a fixed vector with bounded y ∞ . Then for any > 0 P 1 n Dy 2 2 -y 2 2 ≥ → 0 (13) as n → ∞. This result shows that the multiplication by D preserves the norm of its input with high probability. Since orthogonal matrix W also preserves the norm of its input, when the gradient error signal is propagated backwards as in (2), the norm is preserved in each layer along the backward path when n is sufficient large. Hence, combining Theorems 2 and 3, we proved that bidirectionally self-noramlizingnormalization is achievable with high probability if the neural network is wide enough and the conditions in the Assumptions are satisfied. Then by Proposition 1, the gradient exploding/vanishing problem disappears with high probability. Sketch of the proofs. The proofs of Theorems 2 and 3 are mainly based on a phenomenon in high-dimensional probability theory, concentration of measure. We refer the reader to (Vershynin, 2018) for an introduction to the subject. Briefly, it can be shown that for some high-dimensional probability distributions, most mass is concentrated around a certain range. For example, while most mass of a low-dimensional standard multivariate Gaussian distribution is concentrated around the center, most mass of a high-dimensional standard multivariate Gaussian distribution is concentrated around a thin-shell. Furthermore, the random variables transformed by Lipschitz functions are also concentrated around certain values. Using this phenomenon, it can be shown that rows {w i } of a random orthogonal matrix W in high dimension are approximately independent random unit vectors and the inner product w T i x for thin-shell concentrated vector x can be shown to be approximately Gaussian. Then from the assumptions that φ is GPN and φ and φ are Lipschitz continuous, the proofs follow. Each of these steps is rigorously proved in Appendix.

3. EXPERIMENTS

We verify our theory on both synthetic and real-world data. More experimental results can be found in Appendix. In short, while very deep neural networks with non-GPN activations show vanishing/exploding gradients, GPN versions show stable gradients and improved trainability in both synthetic and real data. Furthermore, compared to dynamical isometry theory, BSNNs do not exhibit pseudo-linearity and maintain nonlinear functionality.

3.1. SYNTHETIC DATA

We create synthetic data to test the norm-preservation properties of the neural networks. The input x 1 is 500 data points of random standard Gaussian vectors of 500 dimension. The gradient error ∂E/∂x L+1 is also random standard Gaussian vector of 500 dimension. All the neural networks have depth 200. All the weight matrices are random orthogonal matrices uniformly generated. No training is performed. In Figure 1 , we show the norm of inputs and gradient of the neural networks of width 500. From the results, we can see that with GPN, the gradient exploding/vanishing problem is eliminated to large extent. The neural network with Tanh activation function does not show gradient exploding/vanishing problem either. However, x (l) is close to zero for large l and each layer is close to a linear one since Tanh(x) ≈ x when x ≈ 0 (pseudo-linearity), for which dynamical isometry is achieved. One might wonder if bidirectionally self-noramlizingnormalization has the same effect as dynamical isometry in solving the gradient exploding/vanishing problem, that is, to make the neural network close to an orthogonal matrix. To answer this question, we show the histogram of φ (h . This shows that BSNNs do not suffer from the gradient vanishing/explosion problem while exhibiting nonlinear functionality. (l) i ) in In Figure 7 in Appendix, we show the gradient norm of BSNNs with varying width. There we note that as the width increases, the norm of gradient in each layer of the neural network becomes more equalized, as predicted by our theory. 

3.2. REAL-WORLD DATA

We run experiments on real-world image datasets MNIST and CIFAR-10. The neural networks have width 500 and depth 200 (plus one unconstrained layer at bottom and one at top to fit the dimensionality of the input and output). We use stochastic gradient descent of momentum 0.5 with mini-batch size 64 and learning rate 0.0001. The training is run for 50 epochs for MNIST and 100 epochs for CIFAR-10. We do not use data augmentation. Since it is computationally expensive to enforce the orthogonality constraint, we simply constrain each row of the weight matrix to have l 2 norm one as a relaxation of orthogonality by the following parametrization W = (v 1 / v 1 2 , v 2 / v 2 2 , ..., v n / v n 2 ) T and optimize V = (v 1 , v 2 , ..., v n ) T as an uncon- strained problem. We summarize the results in Table 2 . We can see that, for activation functions ReLU, LeakyReLU and GELU, the neural networks are not trainable. But once these functions are GPN, the neural network can be trained. GPN activation functions consistently outperform their unnormalized counterparts in terms of the trainability, as the training accuracy is increased, but not necessarily generalization ability. We show the test accuracy during training in Figure 3 , from which we can see the training is accelerated when SELU is GPN. ReLU, LeakyReLU and GELU, if not GPN, are completely untrainable due to the vanished gradient (see Appendix). We observe that batch normalization leads to gradient explosion when combining with any of the activation functions. This confirms the claim of (Philipp et al., 2018) and (Yang et al., 2019 ) that batch normalization does not solve the gradient exploding/vanishing problem. On the other hand, without batch normalization the neural network with any GPN activation function has stable gradient magnitude throughout training (see Appendix). This indicates that BSNNs can dispense with batch normalization and therefore avoid its shortcomings.

4. RELATED WORK

We compare our theory to several most relevant theories in literature. A key distinguishing feature of our theory is that we provide rigorous proofs of the conditions under which the exploding/vanishing problem disappears. To the best of our knowledge, this is the first time that the problem is provably solved for nonlinear neural networks.

4.1. SELF-NORMALIZING NEURAL NETWORKS

Self-normalizing neural networks enforce zero mean and unit variance for the output of each unit with the SELU activation function (Klambauer et al., 2017) . However, as pointed out in (Philipp et al., 2018) and confirmed in our experiments, only constraining forward signal propagation does not solve the gradient exploding/vanishing problem since the norm of the backward signal can grow or shrink. The signal propagation in both directions needs to be constrained as in our theory.

4.2. DEEP SIGNAL PROPAGATION

Our theory is developed from the deep signal propagation theory (Poole et al., 2016; Schoenholz et al., 2017) . Both theories require E x∼N (0,1) [φ (x) 2 ] = 1. However, ours also requires the quantity E x∼N (0,1) [φ(x) 2 ] to be one while in Poole et al. (Poole et al., 2016; Schoenholz et al., 2017) it can be an arbitrary positive number. We emphasize that it is desirable to enforce E x∼N (0,1) [φ(x) 2 ] = 1 to avoid trivial solutions. For example, if φ(x) = Tanh( x) with ≈ 0, then φ( x) ≈ x and the neural network becomes essentially a linear one for which depth is unnecessary (pseudo-linearity (Philipp et al., 2018) ). This is observed in Figure 1 (a). Moreover, in (Poole et al., 2016; Schoenholz et al., 2017) the signal propagation analysis is done based on random weights under i.i.d. Gaussian distribution whereas we proved how one can solve gradient vanishing/exploding problem assuming the weight matrices are orthogonal and uniformly distributed under Haar measure.

4.3. DYNAMICAL ISOMETRY

Dynamical isometry theory (Pennington et al., 2017) enforces the Jacobian matrix of the inputoutput function of a neural network to have all singular values close to one. Since the weight matrices are constrained to be orthogonal, it is equivalent to enforce each D (l) in (2) to be close to the identity matrix, which implies the functionality of neural network at initialization is close to an orthogonal matrix (pseudo-linearity). This indeed enables trainability since linear neural networks with orthogonal weight matrices do not suffer from the gradient exploding/vanishing problem. As neural networks need to learn a nonlinear input-output functionality to solve certain tasks, during training the weights of a neural network are unconstrained so that the neural network would move to a nonlinear region where the gradient exploding/vanishing problem might return. In our theory, although the orthogonality of weight matrices is also required, we approach the problem from a different perspective. We do not encourage the linearity at initialization. The neural network can be initialized to be nonlinear and stay nonlinear during the training even when the weights are constrained.

5. CONCLUSION

In this paper, we have introduced bidirectionally self-normalizing neural network (BSNN) which constrains both forward and backward signal propagation using a new class of Gaussian-Poincaré normalized activation functions and orthogonal weight matrices. BSNNs are not restrictive in the sense that many commonly used activation functions can be Gaussian-Poincaré normalized. We have rigorously proved that gradient vanishing/exploding problem disappears in BSNNs with high probability under mild conditions. Experiments on synthetic and real-world data confirm the validity of our theory and demonstrate that BSNNs have excellent trainability without batch normalization. Currently, the theoretical analysis is limited to same width, fully-connected neural networks. Future work includes extending our theory to more sophisticated networks such as convolutional architectures as well as investigating the generalization capabilities of BSNNs.

APPENDIX A PROOFS

Proposition 1. If a neural network is bidirectionally self-normalizing, then ∂E ∂W (1) F = ∂E ∂W (2) F = ... = ∂E ∂W (L) F . ( ) Proof. For each l, we have ∂E ∂W (l) F = trace ∂E ∂W (l) ∂E ∂W (l) T (15) = trace(d (l) (x (l) ) T x (l) (d (l) ) T ) (16) = trace((x (l) ) T x (l) (d (l) ) T d (l) ) (17) = (x (l) ) T x (l) (d (l) ) T d (l) (18) = x (l) 2 d (l) 2 . ( ) By the definition of bidirectional self-normalization, we have ∂E ∂W (1) F = ... = ∂E ∂W (L) F . Proposition 2. Function φ : R → R is Gaussian-Poincaré normalized and E x∼N (0,1) [φ(x)] = 0 if and only if φ(x) = x or φ(x) = -x. Proof. Since E x∼N (0,1) [φ(x) 2 ] < ∞ and E x∼N (0,1) [φ (x) 2 ] < ∞, φ(x) and φ (x) can be expanded in terms of Hermite polynomials. Let the Hermite polynomial of degree k be H k (x) = (-1) k √ k! exp( x 2 2 ) d k dx k exp(- x 2 2 ) and due to H k (x) = √ kH k-1 (x), we have φ(x) = ∞ k=0 a k H k (x), φ (x) = ∞ k=1 √ ka k H k-1 (x). Since E x∼N (0,1) [φ(x)] = 0, we have a 0 = E x∼N (0,1) [H 0 (x)φ(x)] (23) = E x∼N (0,1) [φ(x)] (24) = 0. Since E x∼N (0,1) [φ(x) 2 ] = E x∼N (0,1) [φ (x) 2 ] = 1 26) and Hermite polynomials are orthonormal, we have E x∼N (0,1) [φ(x) 2 ] = ∞ k=1 a 2 k = E x∼N (0,1) [φ (x) 2 ] = ∞ k=1 ka 2 k = 1. Therefore, we have ∞ k=1 ka 2 k - ∞ k=1 a 2 k = 0 (28) that is ∞ k=2 (k -1)a 2 k = 0. Since each term in ∞ k=2 (k -1)a 2 k is nonnegative, the only solution is a k = 0 for k ≥ 2. And since E x∼N (0,1) [φ(x) 2 ] = a 2 1 = 1, we have a 1 = ±1. Hence, φ(x) = ±H 1 (x) = ±x. Proposition 3. For any differentiable function φ : R → R with non-zero and bounded E x∼N (0,1) [φ(x) 2 ] and E x∼N (0,1) [φ (x) 2 ], there exist two constants a and b such that aφ(x) + b is Gaussian-Poincaré normalized. Proof. Let ϕ(x) = φ(x) + c. Then let ψ(c) = E x∼N (0,1) [ϕ(x) 2 ] -E x∼N (0,1) [(φ (x)) 2 ] (30) = Var x∼N (0,1) [ϕ(x)] + (E x∼N (0,1) [ϕ(x)]) 2 -E x∼N (0,1) [(φ (x)) 2 ] (31) = Var x∼N (0,1) [φ(x)] + (E x∼N (0,1) [φ(x)] + c) 2 -E x∼N (0,1) [(φ (x)) 2 ]. Therefore, ψ(c) is a quadratic function of c. We also have ψ(c) > 0 as c → ∞ and ψ(-E x∼N (0,1) [φ(x)]) ≤ 0 due to Gaussian-Poincaré inequality. Hence, there exists c for which ψ(c) = 0 such that E x∼N (0,1) [(φ(x)+c) 2 ] = E x∼N (0,1) [φ (x) 2 ]. Let a = (E x∼N (0,1) [φ (x) 2 ]) -1/2 and b = ac, we have E x∼N (0,1) [(aφ(x) + b) 2 ] = E x∼N (0,1) [(aφ (x)) 2 ] = 1. The proof is largely due to (Eldredge, 2020) with minor modification in here. Assumptions. 1. Random vector x ∈ R n is thin-shell concentrated. 2. Random orthogonal matrix W = (w 1 , w 2 , ..., w n ) T is uniformly distributed. 3. Function φ : R → R is Gaussian-Poincaré normalized. 4. Function φ and its derivative are Lipschitz continuous. Theorem 2 (Forward Norm-Preservation). Random vector (φ(w T 1 x), φ(w T 2 x), ..., φ(w T n x)) is thin-shell concentrated. Theorem 3 (Backward Norm-Preservation). Let D be the diagonal matrix whose diagonal elements are D ii = φ (w T i x) and y ∈ R n be a fixed vector with bounded y ∞ . Then for any > 0 P 1 n Dy 2 2 -y 2 2 ≥ → 0 (34) as n → ∞. Notations. S n-1 = {x ∈ R n : x 2 = 1}. O(n) is the orthogonal matrix group of size n. 1 {•} denotes the indicator function. 0 n denotes the vector of dimension n and all elements equal to zero. I n denotes the identity matrix of size n × n. Lemma 1. If random variable x ∼ N (0, 1) and function f : R → R is Lipschitz continuous, then random variable f (x) is sub-gaussian. Proof. Due to the Gaussian concentration theorem (Theorem 5.2.2 in (Vershynin, 2018 )), we have f (x) -E[f (x)] ψ2 ≤ CK where • ψ2 denotes sub-gaussian norm, C is a constant and K is the Lipschitz constant of f . This implies f (x) -E[f (x)] is sub-gaussian (Proposition 2.5.2 in (Vershynin, 2018) ). Therefore f (x) is sub-gaussian (Lemma 2.6.8 in (Vershynin, 2018) ). Lemma 2. Let x = (x 1 , x 2 , ..., x n ) ∈ R n be a random vector that each coordinate x i is independent and sub-gaussian and E[x 2 i ] = 1. Let y = (y 1 , y 2 , ..., y n ) ∈ R n be a fixed vector with bounded y ∞ . Then P 1 n i x 2 i y 2 i - i y 2 i ≥ → 0 (36) as n → ∞. Proof. Since y i x i is sub-gaussian, then y 2 i x 2 i is sub-exponential (Lemma 2.7.6 in (Vershynin, 2018)  ). Since E[y 2 i x 2 i ] = y 2 i E[x 2 i ] = y 2 i , y 2 i x 2 i -y 2 i is sub-exponential with zero mean (Exercise 2.7.10 in (Vershynin, 2018) ). Applying Bernsteins inequality (Corollary 2.8.3 in (Vershynin, 2018 )), we proved the lemma. Lemma 3. Let z ∼ N (0 n , I n ). Then for any 0 < δ < 1 P{z ∈ R n : (1 -δ) √ n ≤ z 2 ≤ (1 + δ) √ n} ≥ 1 -2 exp(-nδ 2 ). See (Alberts & Khoshnevisan, 2018 ) (Theorem 1.2) for a proof. Lemma 4. Let z ∼ N (0 n , I n ). Then z/ z 2 is uniformly distributed on S n-1 . See (Dawkins, 2016) for a proof. Lemma 5. Let z = (z 1 , z 2 , ..., z n ) ∼ N (0 n , I n ), a = (a 1 , a 2 , ..., a n ) be a fixed vector with bounded a ∞ and f : R → R be a continuous function. Then for any > 0 P 1 n i y i f ( √ n/ z 2 z i ) - i y i f (z i ) > → 0 (38) as n → ∞. Proof. Since 1 n i a i f ( √ n/ z 2 z i ) - i a i f (z i ) ≤ 1 n i |a i | • |f ( √ n/ z 2 z i ) -f (z i )|, if, as n → ∞, P 1 n i |a i | • |f ( √ n/ z 2 z i ) -f (z i )| > → 0, then P 1 n i y i f ( √ n/ z 2 z i ) - i y i f (z i ) > → 0. ( ) For 0 < δ < 1, let A = z ∈ R n : 1 n i |a i | • |f ( √ n/ z 2 z i ) -f (z i )| > , U δ = z ∈ R n : (1 -δ) √ n ≤ z 2 ≤ (1 + δ) √ n . Then P 1 n i |a i | • |f ( √ n/ z 2 z i ) -f (z i )| > = R n 1 {z∈A} dz (44) = R n \U δ 1 {z∈A} dz + U δ 1 {z∈A} dz. Let δ = n -1/4 . From Lemma 3, we have, as n → ∞, R n \U δ 1 {z∈A} dz ≤ R n \U δ dz = 1 -P{z ∈ U δ } ≤ 2 exp(-nδ 2 ) → 0. ( ) For z ∈ U δ and δ = n -1/4 , we have z 2 → √ n, √ n/ z 2 z i → z i and therefore f ( √ n/ z 2 z i ) → f (z i ) as n → ∞. Hence, U δ 1 {z∈A} dz → 0, as n → ∞. Lemma 6. Let random matrix W be uniformly distributed on O(n) random vector θ be uniformly distributed on S n-1 and random vector x ∈ R n be thin-shell concentrated. Then Wx → √ nθ as n → ∞. Proof. Let y ∈ R n be any vector with y 2 = √ n and e = ( √ n, 0, ..., 0) ∈ R n . Since W is uniformly distributed, Wy has the same distribution as We. We is the first row of √ nW, which is equivalent to random vector √ nθ. Since x is thin-shell concentrated, x → √ n x 2 x = y and therefore Wx → √ nθ as n → ∞. Proof of Theorem 2. Let z = (z 1 , z 2 , ..., z n ) ∼ N (0, I) . Due to Lemma 1, random variable φ(z i ) is sub-gaussian. Since φ is Gaussian-Poincaré normalized, E zi∼N (0,1) [φ(z i ) 2 ] = 1. Applying Lemma 2 with each y i = 1, we have for > 0 P 1 n i φ(z i ) 2 -1 ≥ → 0 (47) as n → ∞. Due to Lemma 4 and 5 (with each a i = 1), for random vector θ = (θ 1 , θ 2 , ..., θ n ) uniformly distributed on S n-1 , we have P 1 n i φ( √ nθ i ) 2 - 1 n i φ(z i ) 2 ≥ → 0 and therefore P 1 n i φ( √ nθ i ) 2 -1 ≥ → 0 as n → ∞. Then from Lemma 6, we have Wx → √ nθ and therefore P 1 n i φ(w T i x) 2 -1 ≥ → 0 (50) as n → ∞. Proof of Theorem 3. Let z = (z 1 , z 2 , ..., z n ) be the standard multivariate Gaussian random vectors. Due to Lemma 1, random variable φ (z i ) is sub-gaussian. Since φ is Gaussian-Poincaré normalized, E zi∼N (0,1) [φ (z i ) 2 ] = 1. Applying Lemma 2, we have P 1 n i y 2 i φ (z i ) 2 -y 2 i ≥ → 0 as n → ∞. Due to Lemma 4 and 5 (with each a i = y 2 i ), for random vector θ = (θ 1 , θ 2 , ..., θ n ) uniformly distributed on S n-1 , we have P 1 n i y 2 i φ ( √ nθ i ) 2 - 1 n i y 2 i φ (z i ) 2 ≥ → 0 (52) and therefore P 1 n i y 2 i φ ( √ nθ i ) 2 -y 2 i ≥ → 0 (53) as n → ∞. Then from Lemma 6, we have Wx → √ nθ and therefore P 1 n i y 2 i φ (w T i x) 2 -y 2 i ≥ → 0 as n → ∞.

APPENDIX B ADDITIONAL EXPERIMENTS

Due to the space limitation, we only showed the experiments with Tanh and SELU activation functions in the main text. In this section, we show the experiments with ReLU, LeakyReLU, ELU and SELU. Additionally, we also measure the gradient exploding/vanishing during training on the real-world data.

B.1 SYNTHETIC DATA

We show the figures of the experimental results in addition to the ones in the main text. 



We use α = 0.01 for LeakyReLU, α = 1 for ELU and φ(x) = x/(1 + exp(-1.702x)) for GELU.



Figure 2. If the functionality of a neural network is close to an orthogonal matrix, since the weight matrices are orthogonal, then the values of φ (h (l) i ) would concentrate around one (Figure 2 (a)), which is not the case for BSNNs (Figure 2 (b))

) F , SELU-GPN.

Figure 1: Results on synthetic data with different activation functions. "-GPN" denotes the function is Gaussian-Poincaré normalized. x (l) 2 denotes the l 2 norm of the outputs of the l-th layer. n denotes the width. ∂E ∂W (l) F is the Frobenius norm of the gradient of the weight matrix in the l-th layer.

Figure 2: Histogram of φ (h (l) i ). The values of φ (h (l) i ) are accumulated for all units, all layers and all samples in the histogram.

Figure 3: Test accuracy (percentage) during training on CIFAR-10. "-BN" denotes that batch normalization is applied before the activation function.

Figure 4: Results on synthetic data with different activation functions. "-GPN" denotes the function is Gaussian-Poincaré normalized. x (l)2 denotes the l 2 norm of the outputs of the l-th layer. n denotes the width. ∂E ∂W (l) F is the Frobenius norm of the gradient of the weight matrix in the l-th layer.

Figure 5: Results on synthetic data with different activation functions. "-GPN" denotes the function is Gaussian-Poincaré normalized. x (l)2 denotes the l 2 norm of the outputs of the l-th layer. n denotes the width. ∂E ∂W (l) F is the Frobenius norm of the gradient of the weight matrix in the l-th layer.

Figure 8: Test accuracy (percentage) during training on MNIST.

Figure 9: Test accuracy (percentage) during training on CIFAR-10.

LeakyReLU-GPN-BN.

Figure 10: Gradient norm ratio during training on MNIST. Horizontal axis denotes the mini-batch updates. Vertical axis denotes the gradient norm ratio max l ∂E ∂V (l) F / min l ∂E ∂V (l) F . The gradient vanishes ( ∂E ∂V (l) F ≈ 0) for ReLU and LeakyReLU during training and hence the plots are empty.

LeakyReLU-GPN-BN.

Figure 12: Gradient norm ratio during training on CIFAR-10. Horizontal axis denotes the minibatch updates. Vertical axis denotes the gradient norm ratio max l ∂E ∂V (l) F / min l ∂E ∂V (l) F . The gradient vanishes ( ∂E ∂V (l) F ≈ 0) for ReLU and LeakyReLU during training and hence the plots are empty. 22

Constants for Gaussian-Poincaré normalization of activation functions.

Accuracy (percentage) of neural networks of depth 200 with different activation functions on real-world data. The numbers in parenthesis denote the results when batch normalization is applied before the activation function.

annex

(k) GELU. In Figure 10 , 11, 12 and 13, we show a measure of gradient exploding/vanishing during training for different activation functions. The measure is defined as the ratio of the maximum gradient norm and the minimum gradient norm across layers. Since we use the parametrization, the gradient norm ratio is defined on the unconstrained weights V, that is,Note that for ReLU, LeakyReLU and GELU, the gradient vanishes during training in some experiments and therefore the plots are empty. From the figures, we can see that batch normalization leads to gradient explosion especially at the early stage of training. On the other hand, without batch normalization, the gradient is stable throughout training for GPN activation functions. 

