

Abstract

ReLU is one of the most popular activations in deep learning, especially thanks to its stabilizing effect on training. However, because it is nondifferentiable at the origin, it complicates the use of analysis methods that examine derivatives, such as the Neural Tangent Kernel (NTK). Many smooth relaxations try to retain the practical benefits of ReLU while increasing network regularity. Although their success has ranged widely, some notable architectures (e.g., the BERT family) do utilize them. We present a theoretical characterization of smooth-ReLUs within fully-connected feedforward neural networks. In addition to the well-known SWISH and GeLU, we introduce GumbelLU, AlgebraicLU, and GudermanLU, as new relaxations. All these activations can be characterized by a positive temperature parameter which we can lower to continuously improve the approximation. By studying the interplay of initialization schemes with temperature, we confirm that when these relaxations converge to ReLU, the statistical properties of the corresponding neural networks at initialization also converge to those of ReLU networks. Moreover, we derive temperature-dependent critical initialization schemes with which networks based on these activations exhibit stable ReLU-like behavior at any temperature. Finally, we empirically study both classes of networks on MNIST and CIFAR-10 in the full-batch training regime. We observe faster training dynamics of smooth-ReLU networks with our proposed initialization instead of the standard one. While all networks exhibit very similar train loss trajectories at criticality, smooth-ReLU networks feature differentiable NTKs throughout training, whereas ReLU networks exhibit stochastic NTK fluctuations. Our results clarify how smooth-ReLU relaxations reproduce the practical benefits of ReLU in everywhere-smooth neural networks.

1. Introduction

In recent decades, deep learning has shown tremendous success in e.g. computer vision, natural language processing, and drug discovery (LeCun et al., 2015) . For instance, Effi-cientNet (Tan and Le, 2019) achieved state-of-the-art performance in image classification on CIFAR100 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) . In natural language processing, GPT (Radford et al., 2018) and its newer versions (Radford et al., 2019; Brown et al., 2020) were capable of producing human-like responses to various reading comprehension tasks. As the name suggests, deep learning neural architectures are composed of many layers sequentially applied. The most basic architecture, the fully connected feedforward network (FFN), consists of an alternating sequence of linear layers and non-linear layers called activations. One of the most popular activations is ReLU (Jarrett et al., 2009; Nair and Hinton, 2010) , given its computational simplicity, gradient stability, and expressive power (Nair and Hinton, 2010; Raghu et al., 2017) . Beyond ReLU, EfficientNet and others use SWISH (Ramachandran et al., 2017; Elfwing et al., 2018; Alcaide, 2018; Chieng et al., 2018; Howard et al., 2019) . GPT-2 and ALBERT (Lan et al., 2020) employ the GeLU activation (Hendrycks and Gimpel, 2016) . Both are smooth approximations to ReLU, with the advantage of the existence of higher-order derivatives to study their properties (Hanin and Nica, 2020b; Li et al., 2021) . Activation smoothness is required to ensure smoothness of the network's input-output mapping; this is in turn necessary for certain applications (e.g., physics-informed neural networks and neural network methods to solving differential

