BIDIRECTIONALLY SELF-NORMALIZING NEURAL NETWORKS

Abstract

The problem of exploding and vanishing gradients has been a long-standing obstacle that hinders the effective training of neural networks. Despite various tricks and techniques that have been employed to alleviate the problem in practice, there still lacks satisfactory theories or provable solutions. In this paper, we address the problem from the perspective of high-dimensional probability theory. We provide a rigorous result that shows, under mild conditions, how the exploding/vanishing gradient problem disappears with high probability if the neural networks have sufficient width. Our main idea is to constrain both forward and backward signal propagation in a nonlinear neural network through a new class of activation functions, namely Gaussian-Poincaré normalized functions, and orthogonal weight matrices. Experiments on both synthetic and real-world data validate our theory and confirm its effectiveness on very deep neural networks when applied in practice.

1. INTRODUCTION

Neural networks have brought unprecedented performance in various artificial intelligence tasks (Graves et al., 2013; Krizhevsky et al., 2012; Silver et al., 2017) . However, despite decades of research, training neural networks is still mostly guided by empirical observations and successful training often requires various heuristics and extensive hyperparameter tuning. It is therefore desirable to understand the cause of the difficulty in neural network training and to propose theoretically sound solutions. A major difficulty is the gradient exploding/vanishing problem (Glorot & Bengio, 2010; Hochreiter, 1991; Pascanu et al., 2013; Philipp et al., 2018) . That is, the norm of the gradient in each layer is either growing or shrinking at an exponential rate as the gradient signal is propagated from the top layer to bottom layer. For deep neural networks, this problem might cause numerical overflow and make the optimization problem intrinsically difficult, as the gradient in each layer has vastly different magnitude and therefore the optimization landscape becomes pathological. One might attempt to solve the problem by simply normalizing the gradient in each layer. Indeed, the adaptive gradient optimization methods (Duchi et al., 2011; Kingma & Ba, 2015; Tieleman & Hinton, 2012) implement this idea and have been widely used in practice. However, one might also wonder if there is a solution more intrinsic to deep neural networks, whose internal structure if well-exploited would lead to further advances. To enable the trainability of deep neural networks, batch normalization (Ioffe & Szegedy, 2015) was proposed in recent years and achieved widespread empirical success. Batch normalization is a differentiable operation which normalizes its inputs based on mini-batch statistics and inserted between the linear and nonlinear layers. It is reported that batch normalization can accelerate neural network training significantly (Ioffe & Szegedy, 2015) . However, batch normalization does not solve the gradient exploding/vanishing problem (Philipp et al., 2018) . Indeed it is proved that batch normalization can actually worsen the problem (Yang et al., 2019) . Besides, batch normalization requires separate training and testing phases and might be ineffective when the mini-batch size is small (Ioffe, 2017). The shortcomings of batch normalization motivate us to search for a more principled and generic approach to solve the gradient exploding/vanishing problem.

