REDEFINING THE SELF-NORMALIZATION PROPERTY

Abstract

The approaches that prevent gradient explosion and vanishing have boosted the performance of deep neural networks in recent years. A unique one among them is the self-normalizing neural network (SNN), which is generally more stable than initialization techniques without explicit normalization. The self-normalization property of SNN in previous studies comes from the Scaled Exponential Linear Unit (SELU) activation function. However, it has been shown that in deeper neural networks, SELU either leads to gradient explosion or loses its self-normalization property. Besides, its accuracy on large-scale benchmarks like ImageNet is less satisfying. In this paper, we analyze the forward and backward passes of SNN with mean-field theory and block dynamical isometry. A new definition for selfnormalization property is proposed that is easier to use both analytically and numerically. A proposition is also proposed which enables us to compare the strength of the self-normalization property between different activation functions. We further develop two new activation functions, leaky SELU (lSELU) and scaled SELU (sSELU), that have stronger self-normalization property. The optimal parameters in them can be easily solved with a constrained optimization program. Besides, analysis on the activation's mean in the forward pass reveals that the selfnormalization property on mean gets weaker with larger fan-in, which explains the performance degradation on ImageNet. This can be solved with weight centralization, mixup data augmentation, and centralized activation function. On moderatescale datasets CIFAR-10, CIFAR-100, and Tiny ImageNet, the direct application of lSELU and sSELU achieve up to 2.13% higher accuracy. On Conv MobileNet V1 -ImageNet, sSELU with Mixup, trainable λ, and centralized activation function reaches 71.95% accuracy that is even better than Batch Normalization.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) have achieved state-of-the-art performance on different tasks like image classification (He et al., 2015; Zheng et al., 2019) . This rapid development can be largely attributed to the initialization and normalization techniques that prevent the gradient explosion and vanishing. The initialization techniques (He et al., 2015; Xiao et al., 2018) initialize the parameters in networks to have good statistical property at beginning, and assume that this property can be more or less maintained throughout the training process. However, this assumption is likely to be violated when the network gets deeper or is trained with higher learning rate. Hence, normalization techniques are proposed to explicitly normalize the network parameters (Salimans & Kingma, 2016; Arpit et al., 2016) or the activations (Ioffe & Szegedy, 2015b; Ulyanov et al., 2016) during training. In particular, Batch Normalization (BN) (Ioffe & Szegedy, 2015a) has become a standard component in DNNs, as it not only effectively improves convergence rate and training stability, but also regularizes the model to improve generalization ability. However, BN still has several drawbacks. First, when calculating the mean and variance, the accumulation must be done under FP32 to avoid underflow (Micikevicius et al., 2018) . This brings challenges when training neural networks in low bit width. Second, the performance degradation under micro batch size also makes it more difficult to design training accelerators, as the large batch size increases the memory size to store the intermediate results for backward pass (Deng et al., 2020) . Besides, Chen et al. (2020a) ; Wu et al. (2018) show that BN introduces considerable overhead. The self-normalizing neural network (SNN) provides a promising way to address this challenge. SNN initializes the neural network to have a good statistical property at the beginning just like initialization techniques. However, the statistics deviation in forward and backward passes can be gradually fixed during propagation, thus it is more robust to the deviation from initial properties (Chen et al., 2020b) . For instance, the mean and variance of output activations with SELU in Klambauer et al. (2017) automatically converge the fixed point (0, 1). Chen et al. (2020b) analyze the Frobenius norm of backward gradient in SNN activated with SELU. They reveal a trade-off between the self-normalization property and the speed of gradient explosion in the backward pass, and the hyper-parameters need to be configured according to the depth of the network. The resulting activation function, depth-aware SELU (dSELU), has achieved even higher accuracy than the original configuration on moderate-scale datasets like CIFAR-10, and makes the SNN trainable on ImageNet. However, in deeper neural networks, the dSELU gradually degenerates to ReLU and loses its selfnormalization property. Moreover, even with dSELU, the test accuracy on ImageNet with Conv MobileNet V1 (Howard et al., 2017) is still 1.79% lower than BN (Chen et al., 2020b) . Therefore, we aim to answer the following three questions in this paper: 1). Is SELU the only activation function that has self-normalization property? 2). If it is not, is there a better choice? And how do we compare the strength of self-normalization property between different activation functions? 3). Why the performance of SNN on ImageNet is less satisfying? Is there any way to alleviate that? In this paper, we analyze the signal propagation in both forward and backward passes in serial deep neural networks with mean-field theory (Poole et al., 2016) and block dynamical isometry (Chen et al., 2020b) . Our main theoretical results are summarized as follows: • We illustrate that an activation function would demonstrate self-normalization property if the second moment of its Jacobian matrix's singular values φ(q) is inversely proportional to the second moment of its input pre-activations q, and the property gets stronger when φ(q) gets closer to 1/q. A new definition of the self-normalization property is proposed that can be easily used both analytically and numerically. • We propose leaky SELU (lSELU) and scaled SELU (sSELU). Both of them have an additional parameter, β, that can be configured to achieve stronger self-normalization property. The hyper-parameters can be solved with a constrained optimization program, thus no additional hyper-parameter relative to dSELU is introduced. • We reveal that models with larger fan-in have weaker normalization effectiveness on the mean of the forward pass signal. This can be solved with explicit weight centralization, mixup data augmentation (Zhang et al., 2018) , and centralized activation function. On CIFAR-10, CIFAR-100, and Tiny ImageNet, lSELU and sSELU achieves up to 2.13% higher test accuracy than previous studies. On ImageNet -Conv MobileNet V1, sSELU with Mixup, trainable λ, and centralized activation function achieves comparable test accuracy (71.95%) with BN. Besides, we provide a CUDA kernel design for lSELU and sSELU that has only 2% overhead than SELU.

2. RELATED WORK

In this section, we present an overview of existing studies on the self-normalizing neural networks (SNN) as well as statistical studies on forward and backward signals in deep neural networks. Self-normalizing Neural Network. Scaled Exponential Linear Unit (SELU) (Klambauer et al., 2017) scales the Exponent Linear Unit (ELU) by a constant scalar λ. The λ and original parameter α in ELU are configured such that the mean and variance of output activation have a fixed point (0, 1). The authors further prove that this fixed point is still stable and attractive even when the input activations and the weights are unnormalized. Chen et al. (2020b) investigate the fixed point in backward gradient. They reveal that the gradient of SNN is exploding with the rate (1 + ) per layer, where is a small positive value. The self-normalizing property gets stronger when is larger, whereas the gradient will explode at a higher rate. Therefore, they propose the depth-aware SELU in which the ≈ 1/L is used to derive the optimal α and λ in SELU for a network with depth L. Burkholz & Dubatovka (2018) investigate the forward activations under the limit of large layer width with mean-field theory. They have identified an Order-to-Chaos phase transition characterized by the second moment of singular values of the network's input-output Jacobian matrix. The neural network has good performance when it is on the border of the order and chaos phases. On the other hand, Chen et al. (2020b) develop a very handy framework for analyzing the Frobenius norm of gradient. They illustrate that the gradient norm equality is a universal philosophy behind various different initialization, normalization techniques, and even some neural network structures. The gradient norm equality means the Frobenius Norm of the gradient is more or less equal in different layers so that the information flow in the backward pass can be well preserved. (Arpit & Bengio, 2020) 3 SELF-NORMALIZATION PROPERTY In this section, we formally define the self-normalization property under the problem formulation, notations, and assumptions as follows. Problem Formulation. Let's consider a DNN with L layers. Each layer performs a linear transform followed by a non-linear element-wise activation function f , i.e. x l = f (h l ), h l = W l x l-1 + b l , l = 1, ..., L, where x l ∈ R N l is the output feature vector of layer l, h l is the pre-activation vector, W l is the weight of fully-connected layer or the expanded doubly block circulant matrix (Sedghi et al., 2019) of 2D convolution, b l is the vector of biases, and we denote the loss as L. Besides, without loss of generality, for f and x ∼ N (0, q), we have (1 + δ q )E f 2 (x) = E[ df (x)/dx) 2 E[x 2 ], where δ q is a function of q. Following previous studies (Poole et al., 2016; Chen et al., 2020b) , for ∀ l, we make the assumptions as follows: Assumption 1 The mean of entries in W l and b l are zero. Assumption 2 With central limit theory, the entries in h l follow i.i.d. N (0, q l ), q l = 1 N l h T l h l . Assumption 3 The eigenvalues of W T l W l are independent with entries in h l-1 . Klambauer et al. (2017) first define the self-normalization property of a neural network as follows. Definition 1 (Self-normalizing Neural Network) A neural network is self-normalizing if it possesses a mapping g : Ω → Ω for each activation y that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on (ω, τ ) in Ω. Furthermore, the mean and the variance remain in the domain Ω, that is g(Ω) ⊆ Ω, where Ω = {(µ, ν)|µ ∈ [µ min , µ max ], ν ∈ [ν min , ν max ]}. When iteratively applying the g, each point within Ω converges to this fixed point. This definition imitates the explicit normalization techniques like BN, which ensures that the feedforward signal is normalized. Based on Definition 1, Klambauer et al. (2017) propose the SELU: f (x) = λ x if x > 0 αe x -α if x ≤ 0 . Besides, Klambauer et al. (2017) initialize the entries in W l with N (0, 1/N l-1 ), so that the output pre-activation will have the same second moment of input activation. With the stable fixed points of mean and variance around (0, 1), the optimal choice for λ and α can be derived from ∞ -∞ f (z) e -z 2 2 √ 2π dz = 0, ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz = 1. Furthermore, the authors prove that the fixed points for mean and variance are still attractive even when the statistical properties of the parameters in the neural network deviate from the initial setup. However, the statistical fixed point in the forward pass doesn't necessarily lead to good dynamics of gradient. Chen et al. (2020b) analyze the Frobenius norm of the gradient in neural networks activated by SELU. With the same activation function shown in equation 3, their analysis shows that the optimal λ and α can be configured by preserving the Frobenius norm of backward gradient and second moment of forward activations with equations as follows: ∞ -∞ df (z) dz 2 e -z 2 2 √ 2π dz = 1 + , ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz = 1. ( ) where is a small positive constant, without which the only solution for equation 5 would be λ = √ 2 and α = 0, and the activation function degenerates back to ReLU with the initialization technique proposed in He et al. (2015) . Thus it will lose the self-normalization property. Conversely, a relatively large will bring stronger self-normalization property, but meanwhile make the Frobenius norm of gradient explode with rate (1+ ) per layer. Notably, the original configuration of SELU can be obtained by setting = 0.0716. Therefore, Chen et al. (2020b) assert that having ≈ 1 L could bring a good trade-off between gradient norm stability and self-normalization property. Experiments on CIFAR-10 and ImageNet show that the new configuration results in higher accuracy. Inspired by Chen et al. (2020b) , we formally redefine the self-normalization property as follows: Definition 2 (Self-normalization Property) Given an activation function f , we define operand φ as φ(q) = ∞ -∞ df ( √ qz) d √ qz 2 e -z 2 2 √ 2π dz. ( ) If f satisfies: φ(1) = 1 + , ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz = 1, min(1, 1 q ) < φ(q) < max(1, 1 q ), then we say f has the self-normalization property. While the first two equations in equation 7 are identical to equation 5 that constructs fixed-points for both the second moment of activations and the Frobenius norm of the gradient, the third one makes these fixed points attractive, as we have the proposition as follows. Proposition 3.1 (Strength of Self-normalization Property) Under all the three Assumptions and Definition 2, we represent φ(q) as a linear interpolation between 1 and 1/q as follows. φ(q) = 1 + (1 -γ q<1 )(1/q -1) q < 1 1/q + γ q>1 (1 -1/q) q > 1 . ( ) where γ q ∈ (0, 1) is a function of q. Then the following conclusions hold (Proof: Appendix A.2): • The self-normalization property gets stronger when γ q<1 and γ q>1 get closer to 0. In particular, we have |γ q<1 | ≈ |γ q>1 | ≈ | dφ(q) dq | q=1 + 1| when q is around 1. • For layer l, the gradient explodes under the rate (1 + δ q l ), i.e. Π l i=1 (1 + δ qi-1 )E || ∂L ∂h l || 2 2 = q 0 E || ∂L ∂h0 || 2 2 . Proposition 3.1 is derived based on Assumption 1, whereas the mean of weight matrices may shift during training. Fortunately, Proposition 3.2 shows that the deviation of the mean of forward activations can also be normalized by simply multiplying with the weight matrix. Proposition 3.2 (Normalization of Mean) Under the assumption that the entries in the weight matrix w ij are independent with the input activations, and their expectation has an upper bound µ, i.e. ∀ i, j, E[w ij ] ≤ µ. Then we say multiplication with the weight matrix normalizes the mean if µ < 1 N l-1 holds, where N l-1 is the fan-in of the current layer l. Moreover, the mean is scaled down by ratio smaller than µN l-1 . (Proof: Appendix A.3) 4 NOVEL SELF-NORMALIZING ACTIVATION FUNCTIONS Proposition 3.1 reveals that f with φ(q) closer to 1/q may have stronger self-normalization property. Therefore, on the basis of SELU, we propose to add an additional hyper-parameter β that can be configured to bring φ(q) closer to 1/q and encode other interesting properties. As demos, we find the following two activation functions are quite promising. Scaled Scaled Exponential Linear Unit (sSELU). The sSELU is defined as follows f (x) = λ x if x > 0 αe βx -α if x ≤ 0 . The negative pre-activations are scaled by β before fed into the activation function. This design is motivated by the observation that without the curvature provided by the exponential term αe x , φ(q) of SELU will be a constant value without self-normalization property. Leaky Scaled Exponential Linear Unit (lSELU). The lSELU is defined as follows f (x) = λ x if x > 0 αe x + βx -α if x ≤ 0 , which has an additional negative slope βx. This is inspired by the observation that leaky ReLU helps to avoid the saturation of negative activations. Besides, Chen et al. (2020b) show that leaky ReLU alone also improves the stability of the training process. 0 1 2 q 0.50 0.75 1.00 1.25 1.50 1.75 (q) (a) dSELU sSELU = 0.51 sSELU = 1.02 sSELU = 1.20 5 0 5 x 4 2 0 2 4 6 f(x) (b) ReLU dSELU sSELU = 0.51 sSELU = 1.02 sSELU = 1.20 0 1 2 q 0.50 0.75 1.00 1.25 1.50 1.75 (q) (c) dSELU lSELU = 0.17 lSELU = 1.00 lSELU = 1.40 5 0 5 x 4 2 0 2 4 6 f(x) (d) ReLU dSELU lSELU = 0.17 lSELU = 1.00 lSELU = 1.40 Figure 1: The φ(q) ∼ q and f (x) ∼ x of sSELU (a) & (b) and lSELU (c) & (d) under different λ. Determine the optimal λ, α, and β. Figure 1 shows that given proper parameters λ, α, and β, our sSELU and dSELU can be configured to get closer to 1/q, which indicates stronger selfnormalization property. With the first conclusion in Proposition 3.1 and equation 7, the λ, α, and β can be obtained by solving the optimization problem below when is provided: min λ,α,β dφ(q) dq q=1 + 1 , s.t. φ(1) = 1 + , ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz = 1, λ ≥ 1. In particular, the constraint λ ≥ 1 is inspired by the argument in Klambauer et al. (2017) that "a slope larger than one can increase the variance if it is too small in the lower layer". In this paper, we find that constraining λ ≥ 1 provides two other benefits. First, having λ ≈ 1 helps to maintain the mean of the output activations around 0. Second, having larger λ slows down the gradient explosion in the backward pass. The detailed discussion can be found in Appendix A.4. 0 1 2 q 1.00 1.05 1.10 1 + q = 0.01 = 0.03 = 0.05 = 0.07 dSeLU sSeLU lSeLU Figure 2: (1 + δ q ) ∼ q of sSELU & lSELU under dif- ferent . Determine the . While Chen et al. (2020b) propose to have < 1/L to avoid gradient explosion, where L is the depth of the network, their derivation is based on the assumption that δ q ≈ 0 in equation 2. However, after taking the nonzero δ q into consideration, our Proposition 3.1 shows that the rate is actually (1 + δ q ) rather than (1 + ). We plot the relationship between (1 + δ q ) and q under different in Figure 2 . First of all, because of the first term in equation 7, we have δ q = when q = 1, this illustrates the intuition behind using (1 + ) to characterize the rate of gradient explosion. Therefore, ≈ 1/L is still a good choice to arbitrarily determine , especially for relatively shallow networks. As lSELU and sSELU has relatively higher δ q>1 in Figure 2 , a relatively smaller than that of dSELU may yield better performance. Last but not least, in deeper neural networks, q has more chance to deviate from the fixed point q = 1, and δ q gets larger when q gets larger. Therefore, the trade-off between the strength of self-normalization property and the speed of gradient explosion may become too complex to be captured by ≈ 1/L, and it might be more promising to determine the proper on the validation set. 5 LARGE-SCALE SELF-NORMALIZING NEURAL NETWORK Chen et al. (2020b) evaluate SELU and dSELU on Conv MobileNet V1 (Howard et al., 2017) on Im-ageNet. While SELU suffers from gradient explosion, the accuracy of dSELU is 1.79% lower than the BN baseline. We observe that there are two major reasons behind this performance degradation and propose several tricks that improve the performance on large-scale SNNs. Nonzero Mean in the Forward Pass. Proposition 3.2 reveals that the nonzero mean can be diminished by multiplying with the weight matrices when µ < 1 N l-1 . On small-scale SNNs, as N l-1 is relatively small, this condition is easy to satisfy, and we don't have to worry about the deviation of the mean from 0. However, in large-scale SNNs for large datasets like ImageNet, larger fan-in is required to ensure the network has enough parameters to model the more complex problem. In Appendix A.5, we empirically show that models with larger fan-in tend to have larger µN l-1 , which implies weaker self-normalization property on the mean. As our Proposition 3.1 is based on Assumption 1, a greatly biased mean may violate the assumption. As a result, for large-scale SNNs, we have to consider the influence of nonzero-mean. While the influence of the weight matrix on the mean is well captured by Proposition 3.2, the influence of the activation function is more complex. In particular, for layer l, we assume the preactivations follow i.i.d. N (E[h l ], σ 2 ), and the output mean can be computed with E[x l ] = ∞ -∞ f (x) 1 √ 2πσ 2 e -(x-E[h l ]) 2 2σ 2 dx (12) We plot the relationship in Figure 3 , in which the solid line represents the theoretical value and the dash line is the value measured via numerical experiments. When the variance σ 2 is large, there will be a positive bias on the mean of output. The explanation is quite intuitive: the saturated region in the negative axis has an asymmetric growth rate compared with the positive axis. Hence, when the variance is large, the positive part contributes more than the negative part, which increases the mean. Lack of Regularization during Training. Luo et al. (2018) show that batch normalization also regularizes the training process. In particular, using the statistics of minibatch µ B and σ B for explicit normalization introduces additional Gaussian noise that regularizes the training process, and it also discourages the reliance on a single neuron and penalizes correlations among neurons. However, activation functions with the self-normalization property don't have these features as they do not rely on the statistics from minibatch. 1 0 1 E[hl] Based on the analysis above, we find that three techniques can be used to improve the performance of large-scale SNNs: mixup data augmentation (Zhang et al., 2018) , weight centralization, and centralized activation functions. Mixup Data Augmentation. The mixup is a simple data augmentation routine that constructs virtual training examples via linear interpolation: x = γx i + (1 -γ)x j , y = γy i + (1 -γ)y j , where (x i , y i ) and (x j , y j ) are two training samples randomly drawn from the training set, γ ∈ (0, 1). In particular, we find that using Mixup with SNN brings two benefits. First, Mixup reduces the variance / second moment of the inputs. Under the assumption that the corresponding entries x i and x j in the two samples are independent and E[ x 2 i ] = E[x 2 j ] := E[x 2 ], E[x i ] = E[x j ] = 0, we have E (γx i + (1 -γ)x j ) 2 = (γ 2 + (1 -γ 2 )) 2 E[x 2 ]. For instance, when γ = 0.7, the second moment of the sample entries is 0.58 of the original training samples, hence the variance of the input samples is implicitly decreased. With a smaller q in the first few layers, on one hand, as shown in Figure 2 , a smaller second moment q leads to smaller δ q , which reduces the gradient explosion rate in the backward pass. On the other hand, as shown in Figure 3 , a smaller variance also reduces the shift of output mean caused by the activation function. Second, Mixup creates additional training samples from the dataset, which provides additional regularization that could further boost the accuracy. The same property is also used in Zhang et al. (2019) . Besides, we empirically find that making λ trainable is also helpful when applying lSELU and sSELU to large datasets like ImageNet. The trainable λ can be viewed as the scalar multiplier initialized at 1 used in Zhang et al. (2019) . Together with the bias of each layer, they serve as the affine transform applied in batch normalization (Ioffe & Szegedy, 2015a) , which increases the representational power of the network (Michalski et al., 2019) . Weight Centralization. When µ < 1/N l-1 , multiplication with the weight can effectively normalize the mean of activations. Therefore, we can explicitly centralize the weights, i.e Ŵ = W -mean(W ). As the weights are usually much smaller than the feature maps, the overhead of Weight Centralization is usually quite small. Moreover, as it doesn't rely on the batch, Weight Centralization can still be utilized under micro-batch scenarios. Centralized Activation Function. When the network with a large fan-in is relatively shallow, we can trade the strength of self-normalization property with the deviation of the mean caused by the activation function. While φ(1) = 1 + , E[f (x)] = 0, and E[f 2 (x)] = 1 can not simultaneously hold in SELU and dSELU as they only have two parameters, the λ, α, and β in our sSELU and lSELU can be solved with φ(1) = 1 + , E[f (x)] = ∞ -∞ f (z) e -z 2 2 √ 2π dz = 0, E[f 2 (x)] = ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz = 1, which ensures that the output activations still have zero-mean when the input is at the fixed point.

6. EXPERIMENTS

In this section, we validate our activation functions on multiple image classification benchmarks. In Appendix B, we present an efficient CUDA kernel design, under which the overhead of lSELU and sSELU are only 2% higher than SELU. The experiment setup is in Appendix C and the value of the parameters λ, α, β, and the resulting γ q=1 are summarized in Appendix D.

6.1. NORMALIZATION EFFECTIVENESS

We empirically show that our new activation functions have better normalization effectiveness than existing studies, which is demonstrated by the second moment of the output pre-activation of each convolutional layer (E[h 2 ]) and the Frobenius norm of the gradient of the weight (|| ∂L ∂W || F ). As shown in Figure 4 , in the forward pass, sSELU and lSELU normalize the second moment in the forward pass better than dSELU. In the backward pass, compared with SELU and dSELU, both sSELU and lSELU have much flatter and more concentrated distribution of the Frobenius norm in the backward pass. Notably, SELU has ≈ 0.0716, and higher lead to stronger self-normalization property. This explains why it also has good dynamics in the forward pass. However, = 0.0716 also increases the speed of gradient explosion, which explains why SELU has worse backward dynamics. Last but not least, the further the E[h 2 ] deviates from 0 in the forward pass, the faster the || ∂L ∂W || F increases in the backward pass. As larger q = E[h 2 ] will lead to larger δ q , this observation justifies the second conclusion in Proposition 3.1.

6.2. MODERATE-SCALE DATASETS

We summarize the results on CIFAR-10, CIFAR-100, and Tiny ImageNet in Table 1 . First of all, under most of , our lSELU and sSELU are comparable or even better than dSELU. In particular, sSELU achieves consistent accuracy improvement on CIFAR-10 and CIFAR-100, while lSELU has better performance on Tiny ImageNet. Second, the results show that ≈ 1/L ≈ 0.017 is not always the best choice for dSELU, lSELU, and sSELU, but the best accuracy achieved in our sSELU and lSELU are under relatively smaller than dSELU (Chen et al., 2020b) . These two observations accord with our arguments in Section 4 on the selection of proper . In this part, we evaluate our conclusions in Section 5. First, adding Weight Centralization or Mixup successfully solve the gradient explosion problem in SELU, lSELU, and sSELU. Second, in dSELU, lSELU, and sSELU, making λ trainable brings additional performance improvement than using Mixup alone. Moreover, after relaxing the constraint λ ≥ 1 to λ ≥ 0.5, the test accuracy "lSELU (λ ≥ 0.5)+Mixup" drops by 8.25%, this demonstrates the importance of constraining λ to be no less than 1. Last but not least, by combining centralized lSELU and sSELU with Mixup and trainable λ, we achieve 71.82% and 71.95% top-1 accuracy.

7. CONCLUSION

In this paper, we analyze the forward and backward pass signals in SNNs and redefine the selfnormalization property. Two novel activation functions, lSELU and sSELU, are developed under this definition. A constrained optimization program is proposed to solve the optimal configurations. Moreover, we reveal the reason behind the performance degradation of SNN under large fan-in, and several solutions are proposed. With our novel methods, advanced results are achieved on multiple benchmarks. Our study demonstrates a new research direction for the design of activation functions.

A PROOFS A.1 SIGNAL PROPAGATION IN DEEP NEURAL NETWORKS

For convenience, we denote the Jacobian matrix ∂f (h l-1 ) h l-1 as D l-1 , and tr(W l W T l )tr(D l-1 D T l-1 ) as χ l , where tr is the normalized trace. Proposition A.1 (Forward Signal under Mean Field Theory) Under the formulation, notations, and assumptions above, the evolution of the second moment of pre-activations q l∈[1,L] in the forward pass can be described with q l = q 0 Π l i=1 χ i 1 + δ qi-1 , l = 1, ..., L. Proof. Under Assumption 1 & 2, the pre-activation vector of input in layer l can be characterized with a Gaussian random variable x = √ q l z, where z is a random variable following N (0, 1). With these definitions, we can investigate how q evolves between layer l -1 and l: q l = 1 N l (W l f (h l-1 ) + b l ) T (W l f (h l-1 ) + b l ) = σ 2 b + 1 N l f (h l-1 ) T U T ΛU f (h l-1 ), ( ) where σ 2 b = 1 N l b T l b l , U is an Orthogonal matrix, and Λ is a diagonal matrix of eigenvalues in W T l W l . We characterize the diagonal entries in Λ with random variable λ whose probability density function is p(λ). With Assumption 3, we have q l = σ 2 b + N l-1 N l f 2 ( √ q l-1 z) e -z 2 2 √ 2π dz λp(λ)dλ = σ 2 b +tr(W l W ,T l ) f 2 ( √ q l-1 z) e -z 2 2 √ 2π dz, Then, we substitute equation 2 into equation 16 which yields q l = σ 2 b + q l-1 1+δ q-1 tr(W l W T l ) f ( √ q l-1 z) 2 e -z 2 2 √ 2π dz = σ 2 b + q l-1 1+δ q-1 tr(W l W T l )tr(D l-1 D T l-1 ). (17) As the bias vector is usually initialized with zero and shared among multiple feature entries, σ b has a lower impact than the second term. Therefore, if we neglect the σ 2 b , with the notation χ l = tr(W l W T l )tr(D l-1 D T l-1 ), we have q l = q 0 Π l i=1 χ i 1 + δ qi-1 , l = 1, ..., L. ( ) Proposition A.2 (Backward Gradient under Block Dynamical Isometry) Under the formulation, notations, and assumptions above, the evolution of the Frobenius norm of the gradient in the backward pass can be described with E || ∂L ∂h l || 2 2 /E || ∂L ∂h 0 || 2 2 ≈ Π l i=1 1 χ i . ( ) Proof. Given the gradient ∂L ∂h l , with the chain rule, we have ∂L ∂h l-1 = D T l W T l ∂L ∂h l and ∂L ∂h 0 = Π l i=1 D T i W T i ∂L ∂h l . ( ) In particular, we are interested in the Frobenius norm of ∂L ∂h l represented as || ∂L ∂h l || 2 2 . According to Chen et al. (2020b) , its expectation can be computed with E || ∂L ∂h 0 || 2 2 ≈ tr Π l i=1 D T i W T i T Π l i=1 D T i W T i = E || ∂L ∂h l || 2 2 . ( ) Chen et al. (2020b) proves the theorem as follows. Definition 3 Chen et al. (2020b) (k th Moment Unitarily Invariant) Let {A i } := {A 1 , A 2 ..., A L } be a series independent random matrices. Let {U i } := {U 1 , U 3 ..., U L } be a series independent haar unitary matrices independent of {A 1 , A 2 ..., A L }. We say that (Π i A i )(Π i A i ) T is the k th moment unitarily invariant if ∀0 < p ≤ k, we have tr (Π i A i )(Π i A i ) T p = tr (Π i U i A i )(Π i U i A i ) T p . ( ) Theorem A.1 Chen et al. (2020b) (Multiplication). Given J := Π 1 i=L J i , where {J i ∈ R mi×mi-1 } is a series of independent random matrices. If (Π 1 i=L J i )(Π 1 i=L J i ) T is at least the 1 st moment unitarily invariant (Definition 3), we have tr (Π 1 i=L J i )(Π 1 i=L J i ) T Π 1 i=L tr J i J i T . ( ) Therefore, equation 21 can be further simplified with Theorem A.1 as follows. E || ∂L ∂h l || 2 2 /E || ∂L ∂h 0 || 2 2 ≈ Π l i=1 1 tr(W i W T i )tr(D i-1 D T i-1 ) = Π l i=1 1 χ i . ( ) A.2 PROOF OF PROPOSITION 3.1 Proposition 3.1 (Strength of Self-normalization Property) Under Definition 2, we represent φ(q) as a linear interpolation between 1 and 1/q as follows. φ(q) = 1 + (1 -γ q<1 )(1/q -1) if q < 1 1/q + γ q>1 (1 -1/q) if q > 1 . ( ) where γ q ∈ (0, 1) is a function of q. Then the following conclusions hold: • The self-normalization property gets stronger when γ q<1 and γ q>1 get closer to 0. In particular, |γ q<1 | ≈ |γ q>1 | ≈ | dφ(q) dq | q=1 + 1| when q is around 1. • For layer l, the gradient explodes under rate (1 + δ q l ), i.e. Π l i=1 (1 + δ qi-1 )E || ∂L ∂h l || 2 2 = q 0 E || ∂L ∂h0 || 2 2 . Proof. When γ q<1 and γ q>1 are approaching 0, φ(q) gets closer to 1/q. With equation 6, we have φ(q) = tr(D l D T l ) where D l is the Jacobian matrix of the activation function in layer l. As the weights are initialized with N (0, 1 N l ), we have tr(W l W T l ) = 1. In the forward pass, with equation 2, we have q l+1 = 1 1+δq l φ(q l )q l . We can substitute equation 8 and get 1 -q l+1 = γq l <1 1+δq l (1 -q l ) + (1 -1 1+δq l ) if q l < 1 q l+1 -1 = γq l >1 1+δq l (q l -1) -(1 -1 1+δq l ) if q l > 1 . In the backward pass, with equation 14 and equation 19, we have E || ∂L ∂h l || 2 2 q 0 E || ∂L ∂h0 || 2 2 = 1/q l Π l i=1 (1 + δ qi-1 ) . ( ) Because of E || ∂L ∂h l+1 || 2 2 = E || ∂L ∂h l || 2 2 tr(D l D T l )tr(W l+1 W T l+1 ) = E || ∂L ∂h l || 2 2 φ(q l ) , we have          E || ∂L ∂h l+1 || 2 2 q0E || ∂L ∂h 0 || 2 2 = 1 Π l i=1 (1+δq i-1 ) 1 q l +(1-γq l <1)(1-ql ) = 1+γ q l <1 (1/q l -1) Π l i=1 (1+δq i-1 ) , if q l < 1 E || ∂L ∂h l+1 || 2 2 q0E || ∂L ∂h 0 || 2 2 = 1 Π l i=1 (1+δq i-1 ) 1 1+γq l >1 (q l -1) = 1/q l +(1-γ q l >1 )(1-1/q l ) Π l i=1 (1+δq i-1 ) , if q l > 1 , where γ q l <1 = γq l <1 q l q l +(1-γq l <1)(1-ql ) ∈ (0, 1) γ q l >1 = γq l >1 q l 1+γq l >1(ql -1) ∈ (0, 1) are the monotonically increasing functions of γ q l <1 and γ q l >1 , respectively. Similarly, we can derive how the deviation from the fixed point evolves during the back propagation.          E || ∂L ∂h l+1 || 2 2 q0E || ∂L ∂h 0 || 2 2 -1 = γ q l <1 E || ∂L ∂h l || 2 2 q0E || ∂L ∂h 0 || 2 2 -1 + (1 -γ q l <1 )( 1 Π l i=1 (1+δq i-1 ) -1), if q l < 1 1 - E || ∂L ∂h l+1 || 2 2 q0E || ∂L ∂h 0 || 2 2 = γ q l >1 1 - E || ∂L ∂h l || 2 2 q0E || ∂L ∂h 0 || 2 2 + (1 -γ q l >1 )(1 - 1 Π l i=1 (1+δq i-1 ) ), if q l > 1 . (31) First of all, when δ qi are neglectable, equation 26 and equation 31 can be simplified as          1 -q l+1 = γ q l <1 (1 -q l ), E || ∂L ∂h l+1 || 2 2 q0E || ∂L ∂h 0 || 2 2 -1 = γ q l <1 E || ∂L ∂h l || 2 2 q0E || ∂L ∂h 0 || 2 2 -1 , if q l < 1 q l+1 -1 = γ q l >1 (q l -1), 1 - E || ∂L ∂h l+1 || 2 2 q0E || ∂L ∂h 0 || 2 2 = γ q l >1 1 - E || ∂L ∂h l || 2 2 q0E || ∂L ∂h 0 || 2 2 , if q l > 1 . (32) As γ q l <1 and γ q l >1 are the monotonically increasing functions of γ q l <1 and γ q l >1 , it is obvious that with smaller γ q l <1 and γ q l >1 , the deviation from the fixed point in both forward and backward passes shrinks faster. In particular, when q is around the fixed point q = 1 as ensured by the second term in equation 7, we can approximate φ(q) and 1/q with their first-order Taylor expansion around q = 1, with the definition of γ q l <1 and γ q l >1 in equation 8, we have        γ q<1 ≈ γ q<1 = 1/(1+∆q)-φ(1+∆q) 1/(1+∆q)-1 ≈ 1 + dφ(q) dq q=1 + ∆q , if ∆q < 0 γ q>1 ≈ γ q>1 = φ(1+∆q)-1/(1+∆q) 1/(1-1+∆q) ≈ 1 + dφ(q) dq q=1 + ∆q , if ∆q > 0 . ( ) As a result, we can reduce the number of layers required to diminish the deviation by minimizing | dφ(q) dq | q=1 + 1|. Then, we discuss the influence of δ q . The fixed point of the two recursive functions in equation 26 and equation 31 can be computed as        1 -q = δq l <1 1+δq l <1-γq<1 , E || ∂L ∂h l || 2 2 q0E || ∂L ∂h 0 || 2 2 -1 = 1 Π l i=1 (1+δq i-1 ) -1, if q < 1 q -1 = -δq l <1 1+δq l <1-γq>1 , 1 - E || ∂L ∂h l || 2 2 q0E || ∂L ∂h 0 || 2 2 = 1 - 1 Π l i=1 (1+δq i-1 ) , if q > 1 . ( ) While the fixed point of deviation slightly deviates from 0, in the backward pass, we have Π l i=1 (1 + δ qi-1 )E || ∂L ∂h l || 2 2 = q 0 E || ∂L ∂h0 || 2 2 , which suggests that the gradient explodes with rate (1 + δ q l ) at layer l. A.3 PROOF OF PROPOSITION 3.2 Proposition 3.2 (Normalization of Mean) Under the assumption that the entries in the weight matrix w ij are independent with the input activations, and their expectation has an upper bound µ, i.e. ∀ i, j, E[w ij ] ≤ µ. Then we say multiplication with the weight matrix normalizes the mean if µ < 1 N l-1 holds, where N l-1 is the fan-in of the current layer l. Moreover, the mean is scaled down by ratio smaller than µN l-1 . Proof. With equation 1, the j th entry in the output pre-activation h l can be computed with h l,j = N l-1 i=1 w j,i x l-1,i . Therefore, with the assumption on independence between weight and input activations, we have E[h l,j ] = N l-1 i=1 E[w j,i x l-1,i ] ≤ N l-1 µE 1 N l-1 1 T x l-1 . ( ) As the term E 1 N l-1 1 T x l-1 can be viewed as the mean of the input activations, when µ < 1 N l , equation 36 reveals that the mean is reduced after multiplying with the weight matrix.

A.4 BENEFITS OF HAVING λ ≥ 1

First of all, we show that having λ ≈ 1 helps to maintain the mean of the output activations around 0. As we normalize the mean by multiplying with the weights, we don't require the E[f (x)] = 0 when x ∼ N (0, 1) like Klambauer et al. (2017) . However, according to Proposition 3.2, the speed of the mean converging to 0 gets slower when the expectation of entries in the weight matrix deviates from 0 and when the fan-in gets larger. Therefore, it's still ideal to avoid shifting the mean too much when the activations flowing through the activation functions. As shown in Figure 5 , we simulate the forward pass in a 64-layer fully-connected neural network, and plot the distribution of output activations in layer 1, 4, 16, and 64. It is obvious that when λ < 1, a spike around 0 is observed for both sSELU and lSELU, and this leads to a large negative mean of the output activations. On the other hand, for instance, by solving φ(1) = 1 + , ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz = 1, ∞ -∞ f (z) e -z 2 2 √ 2π dz = 0 under = 0.03, we have λ ≈ 1.0360 and 1.0362 for sSELU and lSELU, respectively. Therefore, λ = 1 is a good starting point for the optimization. Second, we show that having larger λ slows down the gradient explosion in the backward pass. According to the second conclusion in Proposition 3.1, the backward gradient explodes under rate (1 + δ q ), thus keeping δ q low is critical for avoiding gradient explosion. According to the definition in equation 2, (1 + δ q ) can be computed with 1 + δ q = φ(q) q ∞ -∞ f 2 (z) e -z 2 2 √ 2π dz . ( ) We plot the relationship between the maximum (1 + δ q ) under q ∈ (0, 2] and λ in Figure 6 . Obviously, the maximum (1 + δ q ) decreases when λ gets larger. This observation is quite intuitive. The 1 + δ i characterizes the relative deviation between E[f 2 (x)] and E[(df (x)/dx) 2 ]E[x 2 ]. For the positive pre-activations, we have E[f 2 (x + )] = ∞ 0 λ √ qz e -z 2 2 √ 2π dz = λ ∞ 0 √ qz e -z 2 2 √ 2π dz = E[(df (x + )/dx + ) 2 ]E[x 2 + ]. Hence, the deviation is contributed only by the negative part. With a larger λ, the positive activations are scaled up, thus the negative activations have to be scaled down to preserve the overall second moment. Therefore, the negative part contributes less to the overall second moment, and the relative deviation between E[f 2 (x)] and E[(df (x)/dx) 2 ]E[x 2 ] gets smaller. All in all, a larger λ leads to smaller δ q , and a smaller δ q reduces the gradient explosion rate (1 + δ q ).

A.5 THE µN l-1 WITH INCREASING FAN-IN

Here we empirically illustrate that µN l-1 increases when the fan-in N l-1 gets larger. The experiment is performed on a 32-layer CNN activated with dSELU. We collect the µN l-1 of each convolutional layer and the final fully-connected layer after each epoch among total 10 epochs. The learning rate is set to 0.005. Let the number of input channels of layer l be c l , the k in each title dSELU×k indicates that the number of input channel is scaled to k × c l . As shown in Figure 7 , with larger fan-in, the layers tend to have larger µN l-1 . According to Proposition 3.2, larger µN l-1 will lead to weaker self-normalization property on the mean. Notably, if we further scaling the number of input channels with k greater than 4, gradient explosion happens. These two observations justify our conclusion that the shift of mean is more influential in networks with larger fan-in.

B EFFICIENT IMPLEMENTATION OF LSELU AND SSELU ON GPU

In this section, we present an efficient implementation for lSELU and sSELU on GPU. In particular, we take lSELU as an example, as the same strategy can be directly applied to sSELU. streaming multiprocessors busy, it is also small enough to keep most of the reduction on chip and reduce the atomic transactions. We evaluate our new CUDA kernels on NVIDIA V100 GPU. We randomly generate an input feature map with size of [512, 64, 56, 56] as the input of the activation function, and we compare the kernel latency of forward and backward passes with the native SELU in PyTorch. The results are summarized in the table below:  1 k h kwcin ), where k h and k w are the height and width of the filters and c in is the input channels. The models are optimized with SGD with momentum=0.9, weight decay=0.0005.  W × H 3 × 3, 16, s1 block1 W × H 3 × 3, 16, s1 3 × 3, 16, s1 × 8 ds1 W 2 × H 2 3 × 3, 32, s2 3 × 3, 32, s1 × 1 block2 W 2 × H 2 3 × 3, 32, s1 3 × 3, 32, s1 × 8 ds2 W 4 × H 4 3 × 3, 64, s2 3 × 3, 64, s1 × 1 block3 W 4 × H 4 3 × 3, 64, s1 3 × 3, 64, s1 × 8 1 × 1 average pool, fc For Section 6.1, we train the model from scratch for 3190 iterations (10 epochs) on CIFAR-10 under the learning rate 0.015. The choice of learning rate is based on the observation that it is large enough to simulate the fierce update of parameters but also small enough to avoid gradient explosion. Following Chen et al. (2020b) , we set = 0.017 for dSELU, lSELU, and sSELU. We collect the second-moment of the output pre-activations of each convolutional layer as well as the Frobenius norm of backward gradient on the convolving kernels in each iteration. For Section 6.2, as the model has a relatively small fan-in, we directly apply sSELU and lSELU without techniques mentioned in Section 5. Besides, We clip the gradient by "[-2, 2]" for all the experiments to increases stability. All the results are averaged among 4 independent runs to reduce fluctuation. For CIFAR-10 and CIFAR-100, the models are trained with batch size 128 for 130 epochs with the initial learning rate set to 0.01, and decayed to 0.001 at epoch 80. For Tiny Ima-geNet, the models are trained with batch size 64 for 200 epochs. The initial learning rate is set to 0.001, and decayed by 10 at epoch 130, 180. For Section 6.3, following Chen et al. (2020b) , we choose the Conv MobileNet V1 (Howard et al., 2017) . The "Conv" indicates that traditional convolutions rather than depthwise separable convolution is used, as the latter one requires more epochs to converge (Zhou et al., 2020) . The model is trained for 90 epochs with batch size 512 under leaning rate 0.02 (decayed by 10× at epoch 60 and 75). Following Zhang et al. (2018) , the γ for interpolation is drawn from Beta distribution Beta(0.7, 0.7). For all the experiments of dSELU, lSELU, and sSELU, we follow Chen et al. (2020b) and set = 0.06.

D COMPARISON OF PARAMETERS BETWEEN DIFFERENT ACTIVATION FUNCTIONS

We summarize the value of the parameters λ, α, β, and the corresponding γ q under different configurations in Table 5 . According to Proposition 3.1, smaller γ q=1 will lead to stronger selfnormalization property. As shown in Table 5 , under the same , our sSELU and lSELU have γ q=1 compared with dSELU, this justifies our intuition that lSELU and sSELU can be configured to have stronger self-normalization property. Second, the result shows that for each activation function, γ q=1 increases when gets larger. However, as shown in Figure 2 , larger also leads to larger δ q , which increases the speed of gradient explosion in the backward pass. Last but not least, for experiments on MobileNet V1, our lSELU and sSELU under = 0.06 achieve approximately the same γ q=1 with SELU with ≈ 0.0716, whereas the latter one has a higher gradient explosion rate. The performance of SELU proposed in Klambauer et al. (2017) is first demonstrated on fullyconnected neural networks. In this section, we compare our sSELU and lSELU against the original SELU in a 64-layer fully-connected neural network on three typical datasets: UCI miniboone, UCI adult, and HTRU. The results are summarized in Table 6 . As the neural network has 64 layers, we only evaluate sSELU and lSELU at ∈ {0.01, 0.017, 0.03}. The results show that with all these three , our sSELU and lSELU achieve consistent improvement over SELU, which further justifies the effectiveness of our activation functions. 



Figure 3: The absolute value of the output mean under different input mean and variance.

Figure 4: The distribution of the second-moment of forward pre-activation and the Frobenius norm of backward gradient on weights.

Figure 5: The distribution of output activations of sSELU and lSELU in different layers.

Figure 6: Influence of λ on 1 + δ q

Figure 7: The µN l-1 with Increasing Fan-in.

Test accuracy on CIFAR-10, CIFAR-100, and Tiny ImageNet (Cl=95%).

Test accuracy under different configurations on ImageNet (Cl=95%).

Forward and Backward pass Latency of SELU and lSELU.Compared with the original SELU, our new implementation with the trainable λ only increases the latency by around 2%, which is neglectable. The reason behind this is that the latency of activation functions are bounded by the DRAM bandwidth of GPU(Chen et al., 2020b), and the computation units are underutilized. As our CUDA kernels don't introduce additional DRAM access, it has low impact on the latency.



λ, α, β, and the corresponding γ q under different configurations.

Test accuracy on UCI miniboone, UCI adult, and HTRU2 (Cl=95%).

annex

Algorithm 1: Forward Kernel of lSELU.Data: Input Feature: X ∈ R N , Output Feature: Y ∈ R N , λ, α, β ∈ R; ThreadIdx: t;BlockIdx: b; Thread Block Size: T ; Number of Thread Blocks: B; 1 beginForward Pass. The forward pass kernel for lSELU is shown in Algorithm 1. In the forward pass, we haveThe implementation is quite straightforward, all the threads stride across the feature map and element-wisely compute the output activations. The input and output feature maps are treated as 1D array, therefore the kernel achieves good coalescing in both read and write. While we take T = 1024, the number of thread blocks B is computed by B = (N + T -1)/T , so the number of thread blocks is large enough to achieve a high utilization rate.Algorithm 2: Backward Kernel of lSELU. Backward Pass. The backward pass kernel is shown in Algorithm 2. When the λ is trainable, the backward pass of lSELU is shown as follows:As the ∂L ∂y is used to compute both ∂L ∂x and ∂L ∂λ , it can be cached in registers for data reuse (line 4). When the threads stride through the whole feature map, each thread holds a partial sum in a private register p (line 2). In order to avoid the underflow of floating point accumulation, Khan summation algorithm (Higham, 1993) is applied (line 9). At last, we use the block reduction in the CUB library to get the partial sum of the whole thread block, and the final result is atomically added to the ∂L ∂λ . Different from the forward pass, we choose B to be a few thousands (usually much smaller than B = (N + T -1)/T ). The motivation behind this is that while it is large enough to keep all the

