REDEFINING THE SELF-NORMALIZATION PROPERTY

Abstract

The approaches that prevent gradient explosion and vanishing have boosted the performance of deep neural networks in recent years. A unique one among them is the self-normalizing neural network (SNN), which is generally more stable than initialization techniques without explicit normalization. The self-normalization property of SNN in previous studies comes from the Scaled Exponential Linear Unit (SELU) activation function. However, it has been shown that in deeper neural networks, SELU either leads to gradient explosion or loses its self-normalization property. Besides, its accuracy on large-scale benchmarks like ImageNet is less satisfying. In this paper, we analyze the forward and backward passes of SNN with mean-field theory and block dynamical isometry. A new definition for selfnormalization property is proposed that is easier to use both analytically and numerically. A proposition is also proposed which enables us to compare the strength of the self-normalization property between different activation functions. We further develop two new activation functions, leaky SELU (lSELU) and scaled SELU (sSELU), that have stronger self-normalization property. The optimal parameters in them can be easily solved with a constrained optimization program. Besides, analysis on the activation's mean in the forward pass reveals that the selfnormalization property on mean gets weaker with larger fan-in, which explains the performance degradation on ImageNet. This can be solved with weight centralization, mixup data augmentation, and centralized activation function. On moderatescale datasets CIFAR-10, CIFAR-100, and Tiny ImageNet, the direct application of lSELU and sSELU achieve up to 2.13% higher accuracy. On Conv MobileNet V1 -ImageNet, sSELU with Mixup, trainable λ, and centralized activation function reaches 71.95% accuracy that is even better than Batch Normalization.

1. INTRODUCTION

In recent years, deep neural networks (DNNs) have achieved state-of-the-art performance on different tasks like image classification (He et al., 2015; Zheng et al., 2019) . This rapid development can be largely attributed to the initialization and normalization techniques that prevent the gradient explosion and vanishing. The initialization techniques (He et al., 2015; Xiao et al., 2018) initialize the parameters in networks to have good statistical property at beginning, and assume that this property can be more or less maintained throughout the training process. However, this assumption is likely to be violated when the network gets deeper or is trained with higher learning rate. Hence, normalization techniques are proposed to explicitly normalize the network parameters (Salimans & Kingma, 2016; Arpit et al., 2016) or the activations (Ioffe & Szegedy, 2015b; Ulyanov et al., 2016) during training. In particular, Batch Normalization (BN) (Ioffe & Szegedy, 2015a) has become a standard component in DNNs, as it not only effectively improves convergence rate and training stability, but also regularizes the model to improve generalization ability. However, BN still has several drawbacks. First, when calculating the mean and variance, the accumulation must be done under FP32 to avoid underflow (Micikevicius et al., 2018) . This brings challenges when training neural networks in low bit width. Second, the performance degradation under micro batch size also makes it more difficult to design training accelerators, as the large batch size increases the memory size to store the intermediate results for backward pass (Deng et al., 2020) . Besides, Chen et al. (2020a); Wu et al. (2018) show that BN introduces considerable overhead. The self-normalizing neural network (SNN) provides a promising way to address this challenge. SNN initializes the neural network to have a good statistical property at the beginning just like initialization techniques. However, the statistics deviation in forward and backward passes can be gradually fixed during propagation, thus it is more robust to the deviation from initial properties (Chen et al., 2020b) . For instance, the mean and variance of output activations with SELU in Klambauer et al. (2017) automatically converge the fixed point (0, 1). Chen et al. ( 2020b) analyze the Frobenius norm of backward gradient in SNN activated with SELU. They reveal a trade-off between the self-normalization property and the speed of gradient explosion in the backward pass, and the hyper-parameters need to be configured according to the depth of the network. The resulting activation function, depth-aware SELU (dSELU), has achieved even higher accuracy than the original configuration on moderate-scale datasets like CIFAR-10, and makes the SNN trainable on ImageNet. However, in deeper neural networks, the dSELU gradually degenerates to ReLU and loses its selfnormalization property. Moreover, even with dSELU, the test accuracy on ImageNet with Conv MobileNet V1 (Howard et al., 2017) is still 1.79% lower than BN (Chen et al., 2020b) . Therefore, we aim to answer the following three questions in this paper: 1). Is SELU the only activation function that has self-normalization property? 2). If it is not, is there a better choice? And how do we compare the strength of self-normalization property between different activation functions? 3). Why the performance of SNN on ImageNet is less satisfying? Is there any way to alleviate that? In this paper, we analyze the signal propagation in both forward and backward passes in serial deep neural networks with mean-field theory (Poole et al., 2016) and block dynamical isometry (Chen et al., 2020b). Our main theoretical results are summarized as follows: • We illustrate that an activation function would demonstrate self-normalization property if the second moment of its Jacobian matrix's singular values φ(q) is inversely proportional to the second moment of its input pre-activations q, and the property gets stronger when φ(q) gets closer to 1/q. A new definition of the self-normalization property is proposed that can be easily used both analytically and numerically. • We propose leaky SELU (lSELU) and scaled SELU (sSELU). Both of them have an additional parameter, β, that can be configured to achieve stronger self-normalization property. The hyper-parameters can be solved with a constrained optimization program, thus no additional hyper-parameter relative to dSELU is introduced. • We reveal that models with larger fan-in have weaker normalization effectiveness on the mean of the forward pass signal. This can be solved with explicit weight centralization, mixup data augmentation (Zhang et al., 2018) , and centralized activation function. On CIFAR-10, CIFAR-100, and Tiny ImageNet, lSELU and sSELU achieves up to 2.13% higher test accuracy than previous studies. On ImageNet -Conv MobileNet V1, sSELU with Mixup, trainable λ, and centralized activation function achieves comparable test accuracy (71.95%) with BN. Besides, we provide a CUDA kernel design for lSELU and sSELU that has only 2% overhead than SELU.

2. RELATED WORK

In this section, we present an overview of existing studies on the self-normalizing neural networks (SNN) as well as statistical studies on forward and backward signals in deep neural networks. Self-normalizing Neural Network. Scaled Exponential Linear Unit (SELU) (Klambauer et al., 2017) scales the Exponent Linear Unit (ELU) by a constant scalar λ. The λ and original parameter α in ELU are configured such that the mean and variance of output activation have a fixed point (0, 1). The authors further prove that this fixed point is still stable and attractive even when the input activations and the weights are unnormalized. Chen et al. (2020b) investigate the fixed point in backward gradient. They reveal that the gradient of SNN is exploding with the rate (1 + ) per layer, where is a small positive value. The self-normalizing property gets stronger when is larger, whereas the gradient will explode at a higher rate. Therefore, they propose the depth-aware SELU in which the ≈ 1/L is used to derive the optimal α and λ in SELU for a network with depth L. 



Deep Neural Networks. Schoenholz et al. (2016); Poole et al. (2016); Burkholz & Dubatovka (2018) investigate the forward activations under the limit of large layer width with mean-field theory. They have identified an Order-to-Chaos phase transition characterized by the second moment of singular values of the network's input-output Jacobian matrix. The neural network has good performance when it is on the border of the order and chaos phases. On the other

