

Abstract

ReLU is one of the most popular activations in deep learning, especially thanks to its stabilizing effect on training. However, because it is nondifferentiable at the origin, it complicates the use of analysis methods that examine derivatives, such as the Neural Tangent Kernel (NTK). Many smooth relaxations try to retain the practical benefits of ReLU while increasing network regularity. Although their success has ranged widely, some notable architectures (e.g., the BERT family) do utilize them. We present a theoretical characterization of smooth-ReLUs within fully-connected feedforward neural networks. In addition to the well-known SWISH and GeLU, we introduce GumbelLU, AlgebraicLU, and GudermanLU, as new relaxations. All these activations can be characterized by a positive temperature parameter which we can lower to continuously improve the approximation. By studying the interplay of initialization schemes with temperature, we confirm that when these relaxations converge to ReLU, the statistical properties of the corresponding neural networks at initialization also converge to those of ReLU networks. Moreover, we derive temperature-dependent critical initialization schemes with which networks based on these activations exhibit stable ReLU-like behavior at any temperature. Finally, we empirically study both classes of networks on MNIST and CIFAR-10 in the full-batch training regime. We observe faster training dynamics of smooth-ReLU networks with our proposed initialization instead of the standard one. While all networks exhibit very similar train loss trajectories at criticality, smooth-ReLU networks feature differentiable NTKs throughout training, whereas ReLU networks exhibit stochastic NTK fluctuations. Our results clarify how smooth-ReLU relaxations reproduce the practical benefits of ReLU in everywhere-smooth neural networks.

1. Introduction

In recent decades, deep learning has shown tremendous success in e.g. computer vision, natural language processing, and drug discovery (LeCun et al., 2015) . For instance, Effi-cientNet (Tan and Le, 2019) achieved state-of-the-art performance in image classification on CIFAR100 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) . In natural language processing, GPT (Radford et al., 2018) and its newer versions (Radford et al., 2019; Brown et al., 2020) were capable of producing human-like responses to various reading comprehension tasks. As the name suggests, deep learning neural architectures are composed of many layers sequentially applied. The most basic architecture, the fully connected feedforward network (FFN), consists of an alternating sequence of linear layers and non-linear layers called activations. One of the most popular activations is ReLU (Jarrett et al., 2009; Nair and Hinton, 2010) , given its computational simplicity, gradient stability, and expressive power (Nair and Hinton, 2010; Raghu et al., 2017) . Beyond ReLU, EfficientNet and others use SWISH (Ramachandran et al., 2017; Elfwing et al., 2018; Alcaide, 2018; Chieng et al., 2018; Howard et al., 2019) . GPT-2 and ALBERT (Lan et al., 2020) employ the GeLU activation (Hendrycks and Gimpel, 2016) . Both are smooth approximations to ReLU, with the advantage of the existence of higher-order derivatives to study their properties (Hanin and Nica, 2020b; Li et al., 2021) . Activation smoothness is required to ensure smoothness of the network's input-output mapping; this is in turn necessary for certain applications (e.g., physics-informed neural networks and neural network methods to solving differential equations require the network output to be a smooth function of its input, (Raissi et al., 2019) ), certain architectures (e.g., the existence of neural ODEs is typically proven assuming smooth activations (Chen et al., 2018)), and for certain theoretical analysis techniques (e.g., differential Neural Tangent Kernel (NTK) approaches Roberts et al. ( 2022)). Our focus is to understand the properties of smooth ReLU relaxations, that we call smooth-ReLUs, of the type σ T (z) = za(z/T ), with T a positive temperature parameter and a(•) any sigmoid function ranging from 0 to 1. As the temperature is lowered, the smooth function a(z/T ) converges to the Heaviside function H(z) pointwise. Different choices of a(•) correspond to different smooth approximations to ReLU. The well known SWISH and GeLU can be represented this way, and we introduce new activations with this property. Inspired by recent works based on renormalization group and field theory (Roberts et al., 2022) , we study the stability of information propagation through the layers of FFNs with smooth-ReLU variants. In particular, we study statistical properties over random network initializations, compare initialization and training dynamics of various smooth-ReLUs, and analyze to what extent they echo ReLU. Initializing weights and biases by sampling a standard normal distribution results in the exponential explosion of the variance of the representations. Weight initializations whose variance is proportional to the inverse of the width can train deeper linear networks, more reliably and rapidly, by keeping the variance of representations and gradients constant with depth. However, for non-linear activations, an additional proportionality factor, that is activation dependent, is necessary to obtain depth stability (He et al., 2015; Roberts et al., 2022) . Based on this observation, we quantify how similar different smooth-ReLU functions are to ReLU by comparing the variance of network representations at initialization. We also compare the behaviour of training loss during full-batch training. To gain further insight into training dynamics, we also use the NTK theory (Jacot et al., 2018) . In recent years, kernel methods have found wide use to theoretically understand neural networks' performance (Allen-Zhu et al., 2019; Zou et al., 2020; Liu et al., 2022; 2020; Yang, 2020) . A flurry of works have extended the NTK analysis to all kinds of neural networks, with Arora et al. ( 2019) extending NTK to convolutional networks, and more recently Feng and Kolter (2020) investigating the NTK analysis for large-depth limits in Deep Equilibrium Models Bai et al. (2019) . Our work aims to improve our understanding of the expressivity and training of fullyconnected deep smooth-ReLU networks and devise simple prescriptions to improve them. Specifically, our contributions are as follows: • Derive conditions of existence of stable initialization schemes for smooth-ReLUs at any temperature, and provide code for computing the necessary hyper-parameters. • Introduce GumbelLU, AlgebraicLU and GudermanLU. We note that, in certain aspects of the analysis, GudermanLU is closer to ReLU than the other smooth-ReLUs analysed, including SWISH and GeLU. • Demonstrate experimentally that representation variance and NTK of smooth-ReLUs resemble those of ReLU under the stable initialization scheme we provide, for any temperature. Also, the same applies for very low temperatures with standard initialization. • Show that smooth-ReLUs typically have faster training dynamics with our proposed initialization, and stable and continuous NTK updates. ReLU's NTK shows stochasticity instead. We observe that the smooth-ReLU NTK updates under our stable initialization are similar for different temperatures.

2.1. Initialization Covariance

To better understand the trainability of a FFN, we pay attention to how the magnitude of the input propagates through the layers. Let's denote the layer preactivation as, z (l) i;α = n j W (l) ij σ(z l-1 j;α ) + b (l) i ,

