TRAINING INSTABILITY AND DISHARMONY BETWEEN RELU AND BATCH NORMALIZATION

Abstract

Deep neural networks based on batch normalization and ReLU-like activation functions experience instability during early stages of training owing to the high gradient induced by temporal gradient explosion. ReLU reduces the variance by more than the expected amount and batch normalization amplifies the gradient during its recovery. In this paper, we explain the explosion of a gradient mathematically while the forward propagation remains stable, and also the alleviation of the problem during training. Based on this, we propose a Layer-wise Asymmetric Learning rate Clipping (LALC) algorithm, which outperforms existing learning rate scaling methods in large batch training and can also be used to replace WarmUp in small batch training.

1. INTRODUCTION

The success of deep neural networks is based on their great expressive power, which increases exponentially with depth (Chatziafratis et al., 2019) . However, the deep hierarchical architecture of a deep neural network may induce the exploding/vanishing gradient problem (Pascanu et al., 2013) , which degrades performance and may even make training impossible. Several techniques have been suggested to address the aforementioned problem, including better weight initialization (Glorot & Bengio, 2010; He et al., 2015) , activation functions (Nair & Hinton, 2010; Klambauer et al., 2017 ), residual connections (He et al., 2016) , normalization methods (Ioffe & Szegedy, 2015; Ulyanov et al., 2016; Nam & Kim, 2018; Ba et al., 2016; Wu & He, 2018; Qiao et al., 2019) , etc. However, Contrary to popular belief, gradient explosion is present in various modern deep learning architectures that use batch normalization and ReLU-like activation functions (Figure 1 ). The stable flow of forwarding activation does not guarantee the stable flow of backward gradient when an entropy difference exists between the layers (Philipp et al., 2017) . A gradient explosion in modern deep learning architecture has been reported in a number of papers (Philipp et al., 2017; Frankle et al., 2020; You et al., 2017) , but its cause and solution have not been sufficiently researched. (Philipp et al., 2017) discussed the context and presented an intuitive understanding of the problem, considering the modality of gradient explosion while activation remains stable, the increase in entropy induced by the activation function, the alleviation of the problem using residual learning, etc. However, although the authors observed that gradient explosion occurs only in the case of batch normalization, they did not explain this phenomenon. They speculated that the sampling error of the normalization procedure amplifies the problem-however, no clear correlation has been reported between the gradient explosion rate and the expected sampling error (batch size, difference normalization scheme, etc.). To the best of our knowledge, this is the first attempt to demonstrate how disharmony between activation function and batch normalization causes gradient explosion and training instability during the early stages of neural network training mathematically. The alleviation of the problem during training is also discussed. Exploding/vanishing gradient is a well-known problem in deep learning. It may seem unnatural that it still exists but is not widely known. One reason is that the problem is not as severe as before. The exploding rate is approximately π/(π -1) ∼ 1.21 per effective depth. It is tolerable in networks with tens of layers or those with hundreds of layers and dense residual connections. Moreover, the exploding gradient at the initialization state is rapidly relieved during training, and even a vanishing gradient state can be approached. Thus, this problem has been referred to by different names in 

2. HOW GRADIENT EXPLOSION OCCURS

As weights are repeatedly multiplied during forward/backward propagation, the exploding or vanishing gradient problem is commonly believed to be caused by excessively large or small parameters (Bengio et al., 1994; Pascanu et al., 2013) . Thus, it has been largely treated as the problem of initialization and maintenance of optimal weight scales. In that case, the problem would have been 'solved' with the advent of (batch) normalization (Ioffe & Szegedy, 2015) , which automatically corrects suboptimal choices of weight scales. However, maintaining the norm of weights, and thereby the norm of forwarding propagation, does not automatically maintain the norm of backward propagation. Rectified Linear Unit (ReLU) (Nair & Hinton, 2010) (as well as its smoother variants (Maas et al., 2013; Klambauer et al., 2017; Clevert et al., 2015; Ramachandran et al., 2017; Hendrycks & Gimpel, 2016) ) blocks approximately half of the activation at each instance. In this sense, (He et al., 2015) assumed that it halves the output variance based on some zero-centered assumptions. Therefore, the authors concluded that initializing weights with N (0, 2/n out ) maintains similar variances in both forward and backward propagation. However, the relationship of the activation function with the input should also be considered. As depicted in Figure 2 , using the positive part of the activation is different from blocking randomly selected activations. In short, (He et al., 2015) described the situation assuming DropOut (Srivastava et al., 2014) to be the activation function instead if ReLU. In that case, both forward and backward signals are roughly halved, and neither exploding nor vanishing gradient occur. This is verified in



Figure 1: Gradient explosion rate ( V ar(g n )/V ar(g N )) of deep neural network models at the initialization state corresponding to (a) different architectures and (b) activation functions. The explosion rate is approximately π/(π -1) in vanilla networks with batch normalization; but it is lower in architectures with residual connections (He et al., 2016), which reduces the effective depth. Moreover, gradient explosion does not occur in architectures with layer normalization (Ba et al., 2016), including transformer-based architectures (Dosovitskiy et al., 2020; Liu et al., 2021).Figure (b) is plotted using a 51-layer VGG-like (Simonyan & Zisserman, 2014) architecture. Smoother variants of ReLU (Maaset al., 2013; Clevert et al., 2015; Ramachandran et al., 2017; Hendrycks  & Gimpel, 2016) exhibit lower exploding rates as they exhibit flatter behavior near zero and lower signal blockage at the initialization state. In the extreme case, gradient explosion does not occur if the activation function is not used. Note that it also does not occur withDropOut (Srivastava et al.,  2014), which can be regarded as a ReLU that blocks signals randomly.

