ARELU: ATTENTION-BASED RECTIFIED LINEAR UNIT

Abstract

Element-wise activation functions play a critical role in deep neural networks via affecting the expressivity power and the learning dynamics. Learning-based activation functions have recently gained increasing attention and success. We propose a new perspective of learnable activation function through formulating them with element-wise attention mechanism. In each network layer, we devise an attention module which learns an element-wise, sign-based attention map for the pre-activation feature map. The attention map scales an element based on its sign. Adding the attention module with a rectified linear unit (ReLU) results in an amplification of positive elements and a suppression of negative ones, both with learned, data-adaptive parameters. We coin the resulting activation function Attention-based Rectified Linear Unit (AReLU). The attention module essentially learns an element-wise residue of the activated part of the input, as ReLU can be viewed as an identity transformation. This makes the network training more resistant to gradient vanishing. The learned attentive activation leads to well-focused activation of relevant regions of a feature map. Through extensive evaluations, we show that AReLU significantly boosts the performance of most mainstream network architectures with only two extra learnable parameters per layer introduced. Notably, AReLU facilitates fast network training under small learning rates, which makes it especially suited in the case of transfer learning and meta learning.

1. INTRODUCTION

Activation functions, introducing nonlinearities to artificial neural networks, is essential to networks' expressivity power and learning dynamics. Designing activation functions that facilitate fast training of accurate deep neural networks is an active area of research (Maas et al., 2013; Goodfellow et al., 2013; Xu et al., 2015a; Clevert et al., 2015; Hendrycks & Gimpel, 2016; Klambauer et al., 2017; Barron, 2017; Ramachandran et al., 2017) . Aside from the large body of hand-designed functions, learning-based approaches recently gain more attention and success (Agostinelli et al., 2014; He et al., 2015; Manessi & Rozza, 2018; Molina et al., 2019; Goyal et al., 2019) . The existing learnable activation functions are motivated either by relaxing/parameterizing a non-learnable activation function (e.g. Rectified Linear Units (ReLU) (Nair & Hinton, 2010) ) with learnable parameters (He et al., 2015) , or by seeking for a data-driven combination of a pool of pre-defined activation functions (Manessi & Rozza, 2018) . Existing learning-based methods make activation functions data-adaptive through introducing degrees of freedom and/or enlarging the hypothesis space explored. In this work, we propose a new perspective of learnable activation functions through formulating them with element-wise attention mechanism. A straightforward motivation of this is a straightforward observation that both activation functions and element-wise attention functions are applied as a network module of element-wise multiplication. More intriguingly, learning element-wise activation functions in a neural network can intuitively be viewed as task-oriented attention mechanism (Chorowski et al., 2015; Xu et al., 2015b) , i.e., learning where (which element in the input feature map) to attend (activate) given an end task to fulfill. This motivates an arguably more interpretable formulation of attentive activation functions. Attention mechanism has been a cornerstone in deep learning. It directs the network to learn which part of the input is more relevant or contributes more to the output. There have been many variants of attention modules with plentiful successful applications. In natural language processing, vector-wise attention is developed to model the long-range dependencies in a sequence of word vectors (Luong et al., 2015; Vaswani et al., 2017) . Many computer vision tasks utilize pixel-wise or channel-wise attention modules for more expressive and invariant representation learning (Xu et al., 2015b; Chen et al., 2017) . Element-wise attention (Bochkovskiy et al., 2020) is the most fine-grained where each element of a feature volume can receive different amount of attention. Consequently, it attains high expressivity with neuron-level degrees of freedom. Inspired by that, we devise for each layer of a network an element-wise attention module which learns a sign-based attention map for the pre-activation feature map. The attention map scales an element based on its sign. Through adding the attention and a ReLU module, we obtain Attention-based Rectified Linear Unit (AReLU) which amplifies positive elements and suppresses negative ones, both with learned, data-adaptive parameters. The attention module essentially learns an element-wise residue for the activated elements with respect to the ReLU since the latter can be viewed as an identity transformation. This helps ameliorate the gradient vanishing issue effectively. Through extensive experiments on several public benchmarks, we show that AReLU significantly boosts the performance of most mainstream network architectures with only two extra learnable parameters per layer introduced. Moreover, AReLU enables fast learning under small learning rates, making it especially suited for transfer learning. We also demonstrate with feature map visualization that the learned attentive activation achieves well-focused, task-oriented activation of relevant regions.

2. RELATED WORK

Non-learnable activation functions Sigmoid is a non-linear, saturated activation function used mostly in the output layers of a deep learning model. However, it suffers from the exploding/vanishing gradient problem. As a remedy, the rectified linear unit (ReLU) (Nair & Hinton, 2010) has been the most widely used activation function for deep learning models with the state-of-the-art performance in many applications. Many variants of ReLU have been proposed to further improve its performance on different tasks LReLU (Maas et al., 2013) , ReLU6 (Krizhevsky & Hinton, 2010) , RReLU (Xu et al., 2015a) . Besides that, some specified activation functions also have been designed for different usages, such as CELU (Barron, 2017) , ELU (Clevert et al., 2015) , GELU (Hendrycks & Gimpel, 2016) , Maxout (Goodfellow et al., 2013) , SELU (Klambauer et al., 2017) , (Softplus) (Glorot et al., 2011) , Swish (Ramachandran et al., 2017) . Learnable activation functions Recently, learnable activation functions have drawn more attentions. PReLU (He et al., 2015) , as a variants of ReLU, improves model fitting with little extra computational cost and overfitting risk. Recently, PAU (Molina et al., 2019) is proposed to not only approximate common activation functions but also learn new ones while providing compact representations with few learnable parameters. Several other learnable activation functions such as APL (Agostinelli et al., 2014) , Comb (Manessi & Rozza, 2018) , SLAF (Goyal et al., 2019 ) also achieve promising performance under different tasks. Attention Mechanism Vector-Wise Attention Mechanism (VWAM) has been widely applied in Natural Language Processing (NLP) tasks (Xu et al., 2015c; Luong et al., 2015; Bahdanau et al., 2014; Vaswani et al., 2017; Ahmed et al., 2017) . VWAM learns which vector among a sequence of word vectors is the most relevant to the task in hand. Channel-Wise Attention Mechanism (CWAM) can be regarded as an extension of VWAM from NLP to Vision tasks (Tang et al., 2019b; 2020; Kim et al., 2019) . It learns to assign each channel an attentional value. Pixel-Wise Attention Mechanism (PWAM) is also widely used in vision (Tang et al., 2019c; a) . Element-Wise Attention Mechanism (EWAM) assigns different values to each element without any spatial/channel constraint. The recently proposed YOLOv4 (Bochkovskiy et al., 2020) is the first work that introduces EWAM implemented by a convolutional layer and sigmoid function. It achieves the state-of-the-art performance on object detection. We introduce a new kind of EWAM for learnable activation function.

3. METHOD

We start by describing attention mechanism and then introduce element-wise sign-based attention mechanism based on which AReLU is defined. The optimization of AReLU then follows. 

Let us denote

V = {v i } ∈ R D 1 v ×D 2 v ×••• a tensor representing input data or feature volume. Function Φ, parameterized by Θ = {θ i }, is used to compute an attention map S = {s i } ∈ R D θ(1) v ×D θ(2) v ×••• s i = Φ(v i , Θ). (1) Φ can be implemented by a neural network with Θ being its learnable parameters. We can modulate the input V with the attention map S using a function Ψ, obtaining the output U = {u i } ∈ R D 1 v ×D 2 v ×••• : u i = Ψ(v i , s i ). (2) Ψ is an element-wise multiplication. In order to perform element-wise multiplication, one needs to first extend S to the full dimension of V . We next review various attention mechanisms with attention map at different granularities. Figure 1 (left) gives an illustration of various attention mechanisms. Vector-wise Attention Mechanism In NLP, attention maps are usually computed over different word vectors. In this case, V = {v i } ∈ R N ×D represents a sequence of N feature vectors with dimension D. S = {s i } ∈ R N is a sequence of attention values for the corresponding vectors.

Channel-wise Attention Mechanism

In computer vision, a feature volume V = {v i } ∈ R W ×H×C has a spatial dimension of W × H and a channel dimension of C. S = {s i } ∈ R C is an attention map over the C channels. All elements in each channel share the same attention value.

Spatial-wise Attention Mechanism

Considering again V = {v i } ∈ R W ×H×C with a spatial dimension of W × H. S = {s i } ∈ R W ×H is an attention map over the spatial dimension. All channels of a given spatial location share the same attention value.

Element-wise Attention Mechanism

Given a feature volume V = {v i } ∈ R W ×H×C containing W × H × C elements, we compute an attention map over the whole volume (all elements), i.e., S = {s i } ∈ R W ×H×C , so that each element has an independent attention value.

3.2. ELEMENT-WISE SIGN-BASED ATTENTION (ELSA)

We propose, ELSA, a new kind of element-wise attention mechanism which is used to define our attention-based activation. Considering a feature volume V = {v i } ∈ R W ×H×C , we compute an element-wise attention map S = {s i } ∈ R W ×H×C : s i = Φ(v i , Θ) = C(α), v i < 0 σ(β), v i ≥ 0 (3) where Θ = {α, β} ∈ R 2 is learnable parameters. C(•) clamps the input variable into [0.01, 0.99]. σ is the sigmoid function. The modulation function Ψ is defined as: u i = Ψ(v i , s i ) = s i v i . (4) In ELSA, positive and negative elements receive different amount of attention determined by the two parameters α and β, respectively. Therefore, it can also be regarded as sign-wise attention mechanism. With only two learnable parameters, ELSA is light-weight and easy to learn.

3.3. ARELU: ATTENTION-BASED RECTIFIED LINEAR UNITS

We represent the function Φ in ELSA with a network layer with learnable parameters α and β: L(x i , α, β) = C(α)x i , x i < 0 σ(β)x i , x i ≥ 0 (5) where X = {x i } is the input of the current layer. In constructing an activation function with ELSA, we combine it with the standard Rectified Linear Units R(x i ) = 0, x i < 0 x i , x i ≥ 0 (6) Adding them together leads to a learnable activation function: F(x i , α, β) = R(x i ) + L(x i , α, β) = C(α)x i , x i < 0 (1 + σ(β))x i , x i ≥ 0 (7) This combination amplifies positive elements and suppresses negative ones based on the learned scaling parameters β and α, respectively. Thus, ELSA learns an element-wise residue for the activated elements w.r.t. ReLU which is an identity transformation, which helps ameliorate gradient vanishing.

3.4. THE OPTIMIZATION OF ARELU

AReLU can be trained using back-propagation jointly with all other network layers. The update formulation of α and β can be derived with the chain rule. Specifically, the gradient of α is: ∂E ∂α = ∂E ∂F(x i , α, β) ∂F(x i , α, β)) ∂α ( ) where E is the error function to be minimized. The term ∂E ∂F (xi,α,β) is the gradient propagated from the deeper layer. The gradient of the activation of X with respect to α is given by: ∂F(X, α, β) ∂α = xi<0 x i (9) Here, the derivative of the clamp function C(•) is handled simply by detaching the gradient backpropagation when α < 0.01 or α > 0.99. The gradient of the activation of X with respect to β is: ∂F(X, α, β) ∂β = xi≥0 σ(β)(1 -σ(β))x i (10) The gradient of the activation with respect to input x i by: ∂F(x i , α, β) ∂x i = α, x i < 0 1 + σ(β), x i ≥ 0 (11) It can be found that AReLU amplifies the gradients propagated from the downstream when the input is activated since 1 + σ(β) > 1; it suppresses the gradients otherwise. On the contrary, there is no such amplification effect in the standard ReLU and its variants (e.g., PReLu (He et al., 2015) ) -only suppression is available. The ability to amplify the gradients over the activated input helps avoiding gradient vanishing, and thus speeds up the training convergence of the model (see Figure 3 ). Moreover, the amplification factor is learned to dynamically adapt to the input and is confined with the sigmoid function. This makes the activation more data-adaptive and stable (see Figure 1 (right) for a visual comparison of post-activation feature maps by AReLU and ReLU). The suppression part is similar to PReLu which learns the suppression factor for ameliorating zero gradients. AReLU introduces a very small number of extra parameters which is 2L for an L-layer network. The computational complexity due to AReLU is negligible for both forward and backward propagation. Note that the gradients of α and β depend on the entire feature volume X. This means that ELSA can be regarded as a global attention mechanism: Although the attention map is computed in an element-wise manner, the parameters are learned globally accounting for the impact of the full feature volume. This makes our AReLU more data-adaptive and hence the whole network more expressive. 

4. EXPERIMENTS

We first study the robustness of AReLU in terms of parameter initialization. We then evaluate convergence of network training with different activation functions on two standard classification benchmarks (MNIST (LeCun et al., 1998) ) and CIFAR100 (Krizhevsky et al., 2009) . We compare AReLU with 18 different activation functions including 13 non-learnable ones and 5 learnable ones; see the list in Table 1 . The number of learnable parameters for each learnable activation function are also given in the table. In the end, we also demonstrate the advantages of AReLU in transfer learning. Please refer to supplemental material for more results and experiments details.

4.1. INITIALIZATION OF LEARNABLE PARAMETER α AND β

For evaluation purpose, we design a neural network (MNIST-Conv) with three convolutional layers each followed by a max-pooling layer and an AReLU, and finally a fully connected layer followed by a softmax layer. Details of this network can be found in the supplemental material. The experiment on parameter initialization is conducted with MNIST-Conv over the MNIST dataset. As shown in Figure 2 (a), AReLU is insensitive to the initialization of α and β. Different initial values result in close convergence rate and classification accuracy. Generally, a large initial value of β can speed up the convergence. Figure 2 (b) shows the learning procedure of the two parameters and (c) plots the learned final AReLU's for the three convolutional layers. In the following experiments, we initialize α = 0.9 and β = 2.0 by default.

4.2. CONVERGENCE ON MNIST

On the MNIST dataset, we evaluate MNIST-Conv implemented with different activation functions and trained with the ADAM or SGD optimizer. The activation function is placed after each maxpooling layers. We compare AReLU with both learnable and non-learnable activation functions under different learning rates of 1 × 10 -2 , 1 × 10 -3 , 1 × 10 -4 , and 1 × 10 -5 . To compare the convergence speed of different activation functions, we report the accuracy after the first epoch, again taking the mean over five times training; see Table 1 . In the table, we report the improvement of AReLU over the best among other non-learnable and learnable methods. In Figure 3 , we plot the mean accuracy over increasing number of training epochs. As shown in Table 1 , AReLU outperforms most existing non-learnable and learnable activation functions in terms of convergence speed and final classification accuracy on MNIST. A note-worthy phenomenon is that AReLU can achieve a more effective training with a small learning rate (see the significant improvement when the learning rate is 1 × 10 -4 or 1 × 10 -5 ) than the alternatives. This can also be observed from Figure 3 . Generally, smaller learning rates would cause lower learning efficiency since the vanishing gradient issue is intensified in such case. AReLU can overcome this difficulty thanks to its gradient amplification effect. Efficient learning with a small learning rate is very useful in transfer learning where a pre-trained model is usually fine-tuned on a new domain/dataset with a small learning rate which is difficult for most existing deep networks. Section 4.4 will demonstrate this application of AReLU.

4.3. CONVERGENCE ON CIFAR100

In order to better demonstrate the effect of ELSA, we regard the ReLU, without ELSA, as our baseline. For plot clarity, we choose to compare only with those most representative competitive activation functions including PAU, SELU, ReLU, LReLU (LReLU), and PReLU. More results can be found in the supplemental material. We evaluate the performance of AReLU with five different mainstream network architectures on CIFAR100. We use the SGD optimizer and follow the training configuration in (Pereyra et al., 2017) : The learning rate is 0.1, the batch size is 64, the weight decay is 5 × 10 -4 , and the momentum is 0.9. The results are plotted in Figure 4 . Learnable activation functions generally have a faster convergence compared to non-learnable ones. AReLU achieves a faster convergence speed for all the five network architectures. It is worth to note that though PAU can achieve a faster convergence at the beginning in some networks such as SeResNet-18, it tends to overfit later with a fast saturation of accuracy. AReLU avoids such overfitting with smaller number of parameters than PAU (2 vs 10). We also conduct a qualitative analysis of AReLU by visualizing the learned feature maps using Grad-CAM Selvaraju et al. ( 2017) using testing images of CIFAR100. Grad-CAM is a recently proposed network visualization method which utilizes gradients to depict the importance of the spatial locations in a feature map. Since gradients are computed with respect to a specific image class, Grad-CAM visualization can be regarded as a task-oriented attention map. In Figure 5 , we visualize the first-layer feature map of ResNet-18. As shown in the figure, the feature maps learned with AReLU leads to semantically more meaningful activation of regions with respect to the target class. This is due to the data-adaptive, attentive ability of AReLU.

4.4. PERFORMANCE IN TRANSFER LEARNING

We evaluate transfer learning of MNIST-Conv with different activation functions between two datasets: MNIST and SVHNfoot_0 . The data preprocessing for adapting the two datasets follows (Shin et al., 2017) . We train three models and test them on SVHN: 1) one is trained directly on SVHN without any pretraining, 2) one trained on MNIST but not finetuned on SVHN, and 3) one pretrained on MNIST and finetuned on SVHN. In pretraining, we train MNIST-Conv using SGD with a learning rate of 0.01 for 20 epochs which is sufficient for all model variants to converge. In finetuning, we train the model on SVHN with a learning rate of 1 × 10 -5 , using SGD optimizer for 100 epochs. The testing results on SVHN are reported in Table 2 where we compare AReLU with several competitive alternatives. Without pretraining, it is hard to obtain a good accuracy on the difficult task of SVHN. Nevertheless, MNIST-Conv with AReLU performs the best among all alternatives; some activation functions even failed in learning. In the setting of transfer learning (pretrain + finetune), AReLU outperforms all other activation functions for different amount of pretraining, thanks to it high learning efficiency with small learning rates. 

4.5. PERFORMANCE IN META LEARNING

We evaluate the meta learning performance of MNIST-Conv with the various activation functions based on the MAML framework Finn et al. (2017) . MAML is a fairly general optimization-based algorithm compatible with any model that learns through gradient descent. It aims to obtain metalearning parameters from similar tasks and adapt the parameters to novel tasks with the same distribution using a few gradient updates. In MAML, model parameters are explicitly trained such that a small number of gradient updates over a small amount of training data from the novel task could lead to good generalization performance on that task. We expect that the fast convergence of AReLU would help MAML to adapt a model to a novel task more efficiently and with better generalization. We set the fast adaption steps as 5 and use 32 tasks for each steps. We train the model for 100 iterations with a learning rate of 0.005. We report in Table 3 the final test accuracy for different activation functions on a 5-ways-1-shots task and a 5-ways-5-shots task, respectively. The results show that AReLU shows clear advantage compared to the alternative activation functions. One noteworthy phenomenon is the performance of PAU (Molina et al., 2019) : It performs well in other evaluations but not on meta learning which is probably due to its overfitting-prone nature.

4.6. THE GENERALIZED EFFECT OF ELSA

In this experiemnt, we show that ELSA (Element-wise Sign-based Attention) can serve as a general module which can be plugged in to any existing activation function and obtain a performance boost for most cases. We define a new activation F the same as Eq. ( 7), but replace the ReLU function R with specified activation function. We keep the same experiment settings as Sec. 4.2. As shown in Tab: 4, after plugging with a ELSA module, we can obtain a performance boost for most cases compared with Table 1 , indicating the well generalized effect of ELSA.

5. CONCLUSION

We have presented AReLU, a new learnable activation function formulated with element-wise signbased attention mechanism. Networks implemented with AReLU can better mitigate the gradient vanishing issue and converge faster with small learning rates. This makes it especially useful in transfer learning where a pretrained model needs to be finetuned in the target domain with a small learning rate. AReLU can significantly boost the performance of most mainstream network architectures with only two extra learnable parameters per layer introduced. In the future, we would like to investigate the application/extension of AReLU to more diverse tasks such as object detection, language translation and even structural feature learning with graph neural networks. 



http://ufldl.stanford.edu/housenumbers/



Figure 1: Left: An illustration of attention mechanisms with attention map at different granularities. Right: Visualization of pre-activation and post-activation feature maps obtained with ReLU and AReLU on a testing image of the handwritten digit dataset MNIST LeCun et al. (1998). over a subspace of V (let θ(•) denote a correspondence function for the indices of dimension):

Figure 2: (a): Plot of accuracy over epochs for networks trained with different initialization of α and β. A larger initial β leads to faster convergence and higher accuracy is obtained when α is initialized to 0.25 or 0.75. (b): The learning procedure of α and β which are initialized to 0.25 and 1.0, respectively. (c): The learned final AReLU's for the three convolutional layers of the MNIST-Conv network. The shaded region gives the range of AReLU curves.

Figure 3: Plots of mean testing accuracy (%) on MNIST for five-time trainings of MNIST-Conv over increasing training epochs. The training is conducted using SGD with small learning rates (left: 1 × 10 -4 , right: 1 × 10 -5 ).

Figure 4: Plots of mean testing accuracy (%) on CIFAR100 over increasing training epochs, using different network architectures. The training is conducted using SGD with a learning rate of 0.1. Input ReLU AReLU Figure 5: Grad-CAM visualization of feature maps extracted by ResNet-18 with AReLU and ReLU. The first row is the testing images of CIFAR100.

Mean testing accuracy (%) on MNIST for five trainings of MNIST-Conv after the first epoch with different optimizers and learning rates. We compare AReLU with 13 non-learnable and 5 learnable activation functions. The number of parameters per activation unit are listed beside the name of the learnable activation functions. The best numbers are shown in bold text with blue color for non-learnable methods (the upper part of the table) and red for learnable ones (the lower part). At the bottom of the table, we report the improvement of AReLU over the best among other non-learnable and learnable methods, in blue and red color respectively.

Test accuracy (%) on SVHN by MNIST-Conv models (implemented with different activation functions) trained directly on SVHN (no pretrain), trained on MNIST but not finetuned (no finetune), as well as pretrained on MNIST and finetuned on SVHN for 5, 10 and 20 epoches. The left part of the table is non-learnable activation functions and the right learnable ones.

Test accuracy (%) on MNIST by MAML with MNIST-Conv models implemented with different activation functions. The performance is compared on a 5-ways-1-shots task and a 5-ways-5-shots task, respectively.

Mean testing accuracy (%) on MNIST for five trainings of MNIST-Conv after the first epoch with different optimizers and learning rates. For each activation function and each learning rate, we show results training with ELSA module. The numbers showing ELSA module improves over the original activation function (shown in Table 1) are highlighted with underline.

