TRAINING INSTABILITY AND DISHARMONY BETWEEN RELU AND BATCH NORMALIZATION

Abstract

Deep neural networks based on batch normalization and ReLU-like activation functions experience instability during early stages of training owing to the high gradient induced by temporal gradient explosion. ReLU reduces the variance by more than the expected amount and batch normalization amplifies the gradient during its recovery. In this paper, we explain the explosion of a gradient mathematically while the forward propagation remains stable, and also the alleviation of the problem during training. Based on this, we propose a Layer-wise Asymmetric Learning rate Clipping (LALC) algorithm, which outperforms existing learning rate scaling methods in large batch training and can also be used to replace WarmUp in small batch training.

1. INTRODUCTION

The success of deep neural networks is based on their great expressive power, which increases exponentially with depth (Chatziafratis et al., 2019) . However, the deep hierarchical architecture of a deep neural network may induce the exploding/vanishing gradient problem (Pascanu et al., 2013) , which degrades performance and may even make training impossible. Several techniques have been suggested to address the aforementioned problem, including better weight initialization (Glorot & Bengio, 2010; He et al., 2015) , activation functions (Nair & Hinton, 2010; Klambauer et al., 2017 ), residual connections (He et al., 2016) , normalization methods (Ioffe & Szegedy, 2015; Ulyanov et al., 2016; Nam & Kim, 2018; Ba et al., 2016; Wu & He, 2018; Qiao et al., 2019) , etc. However, Contrary to popular belief, gradient explosion is present in various modern deep learning architectures that use batch normalization and ReLU-like activation functions (Figure 1 ). The stable flow of forwarding activation does not guarantee the stable flow of backward gradient when an entropy difference exists between the layers (Philipp et al., 2017) . A gradient explosion in modern deep learning architecture has been reported in a number of papers (Philipp et al., 2017; Frankle et al., 2020; You et al., 2017) , but its cause and solution have not been sufficiently researched. (Philipp et al., 2017) discussed the context and presented an intuitive understanding of the problem, considering the modality of gradient explosion while activation remains stable, the increase in entropy induced by the activation function, the alleviation of the problem using residual learning, etc. However, although the authors observed that gradient explosion occurs only in the case of batch normalization, they did not explain this phenomenon. They speculated that the sampling error of the normalization procedure amplifies the problem-however, no clear correlation has been reported between the gradient explosion rate and the expected sampling error (batch size, difference normalization scheme, etc.). To the best of our knowledge, this is the first attempt to demonstrate how disharmony between activation function and batch normalization causes gradient explosion and training instability during the early stages of neural network training mathematically. The alleviation of the problem during training is also discussed. Exploding/vanishing gradient is a well-known problem in deep learning. It may seem unnatural that it still exists but is not widely known. One reason is that the problem is not as severe as before. The exploding rate is approximately π/(π -1) ∼ 1.21 per effective depth. It is tolerable in networks with tens of layers or those with hundreds of layers and dense residual connections. Moreover, the exploding gradient at the initialization state is rapidly relieved during training, and even a vanishing gradient state can be approached. Thus, this problem has been referred to by different names in

