ARELU: ATTENTION-BASED RECTIFIED LINEAR UNIT

Abstract

Element-wise activation functions play a critical role in deep neural networks via affecting the expressivity power and the learning dynamics. Learning-based activation functions have recently gained increasing attention and success. We propose a new perspective of learnable activation function through formulating them with element-wise attention mechanism. In each network layer, we devise an attention module which learns an element-wise, sign-based attention map for the pre-activation feature map. The attention map scales an element based on its sign. Adding the attention module with a rectified linear unit (ReLU) results in an amplification of positive elements and a suppression of negative ones, both with learned, data-adaptive parameters. We coin the resulting activation function Attention-based Rectified Linear Unit (AReLU). The attention module essentially learns an element-wise residue of the activated part of the input, as ReLU can be viewed as an identity transformation. This makes the network training more resistant to gradient vanishing. The learned attentive activation leads to well-focused activation of relevant regions of a feature map. Through extensive evaluations, we show that AReLU significantly boosts the performance of most mainstream network architectures with only two extra learnable parameters per layer introduced. Notably, AReLU facilitates fast network training under small learning rates, which makes it especially suited in the case of transfer learning and meta learning.

1. INTRODUCTION

Activation functions, introducing nonlinearities to artificial neural networks, is essential to networks' expressivity power and learning dynamics. Designing activation functions that facilitate fast training of accurate deep neural networks is an active area of research (Maas et al., 2013; Goodfellow et al., 2013; Xu et al., 2015a; Clevert et al., 2015; Hendrycks & Gimpel, 2016; Klambauer et al., 2017; Barron, 2017; Ramachandran et al., 2017) . Aside from the large body of hand-designed functions, learning-based approaches recently gain more attention and success (Agostinelli et al., 2014; He et al., 2015; Manessi & Rozza, 2018; Molina et al., 2019; Goyal et al., 2019) . The existing learnable activation functions are motivated either by relaxing/parameterizing a non-learnable activation function (e.g. Rectified Linear Units (ReLU) (Nair & Hinton, 2010)) with learnable parameters (He et al., 2015) , or by seeking for a data-driven combination of a pool of pre-defined activation functions (Manessi & Rozza, 2018) . Existing learning-based methods make activation functions data-adaptive through introducing degrees of freedom and/or enlarging the hypothesis space explored. In this work, we propose a new perspective of learnable activation functions through formulating them with element-wise attention mechanism. A straightforward motivation of this is a straightforward observation that both activation functions and element-wise attention functions are applied as a network module of element-wise multiplication. More intriguingly, learning element-wise activation functions in a neural network can intuitively be viewed as task-oriented attention mechanism (Chorowski et al., 2015; Xu et al., 2015b) , i.e., learning where (which element in the input feature map) to attend (activate) given an end task to fulfill. This motivates an arguably more interpretable formulation of attentive activation functions. Attention mechanism has been a cornerstone in deep learning. It directs the network to learn which part of the input is more relevant or contributes more to the output. There have been many variants of attention modules with plentiful successful applications. In natural language processing, vector-wise attention is developed to model the long-range dependencies in a sequence of word vectors (Luong et al., 2015; Vaswani et al., 2017) . Many computer vision tasks utilize pixel-wise or channel-wise attention modules for more expressive and invariant representation learning (Xu et al., 2015b; Chen et al., 2017) . Element-wise attention (Bochkovskiy et al., 2020) is the most fine-grained where each element of a feature volume can receive different amount of attention. Consequently, it attains high expressivity with neuron-level degrees of freedom. Inspired by that, we devise for each layer of a network an element-wise attention module which learns a sign-based attention map for the pre-activation feature map. The attention map scales an element based on its sign. Through adding the attention and a ReLU module, we obtain Attention-based Rectified Linear Unit (AReLU) which amplifies positive elements and suppresses negative ones, both with learned, data-adaptive parameters. The attention module essentially learns an element-wise residue for the activated elements with respect to the ReLU since the latter can be viewed as an identity transformation. This helps ameliorate the gradient vanishing issue effectively. Through extensive experiments on several public benchmarks, we show that AReLU significantly boosts the performance of most mainstream network architectures with only two extra learnable parameters per layer introduced. Moreover, AReLU enables fast learning under small learning rates, making it especially suited for transfer learning. We also demonstrate with feature map visualization that the learned attentive activation achieves well-focused, task-oriented activation of relevant regions.

2. RELATED WORK

Non-learnable activation functions Sigmoid is a non-linear, saturated activation function used mostly in the output layers of a deep learning model. However, it suffers from the exploding/vanishing gradient problem. As a remedy, the rectified linear unit (ReLU) (Nair & Hinton, 2010) has been the most widely used activation function for deep learning models with the state-of-the-art performance in many applications. Many variants of ReLU have been proposed to further improve its performance on different tasks LReLU (Maas et al., 2013 ), ReLU6 (Krizhevsky & Hinton, 2010) , RReLU (Xu et al., 2015a) . Besides that, some specified activation functions also have been designed for different usages, such as CELU (Barron, 2017), ELU (Clevert et al., 2015) , GELU (Hendrycks & Gimpel, 2016) , Maxout (Goodfellow et al., 2013) , SELU (Klambauer et al., 2017) , (Softplus) (Glorot et al., 2011 ), Swish (Ramachandran et al., 2017) . Learnable activation functions Recently, learnable activation functions have drawn more attentions. PReLU (He et al., 2015) , as a variants of ReLU, improves model fitting with little extra computational cost and overfitting risk. Recently, PAU (Molina et al., 2019) is proposed to not only approximate common activation functions but also learn new ones while providing compact representations with few learnable parameters. Several other learnable activation functions such as APL (Agostinelli et al., 2014 ), Comb (Manessi & Rozza, 2018) , SLAF (Goyal et al., 2019 ) also achieve promising performance under different tasks. Attention Mechanism Vector-Wise Attention Mechanism (VWAM) has been widely applied in Natural Language Processing (NLP) tasks (Xu et al., 2015c; Luong et al., 2015; Bahdanau et al., 2014; Vaswani et al., 2017; Ahmed et al., 2017) . VWAM learns which vector among a sequence of word vectors is the most relevant to the task in hand. Channel-Wise Attention Mechanism (CWAM) can be regarded as an extension of VWAM from NLP to Vision tasks (Tang et al., 2019b; 2020; Kim et al., 2019) . It learns to assign each channel an attentional value. Pixel-Wise Attention Mechanism (PWAM) is also widely used in vision (Tang et al., 2019c; a) . Element-Wise Attention Mechanism (EWAM) assigns different values to each element without any spatial/channel constraint. The recently proposed YOLOv4 (Bochkovskiy et al., 2020) is the first work that introduces EWAM implemented by a convolutional layer and sigmoid function. It achieves the state-of-the-art performance on object detection. We introduce a new kind of EWAM for learnable activation function.

3. METHOD

We start by describing attention mechanism and then introduce element-wise sign-based attention mechanism based on which AReLU is defined. The optimization of AReLU then follows. 



3.1 ATTENTION MECHANISMLet us denoteV = {v i } ∈ R D 1 v ×D 2 v ×••• a tensor representing input data or feature volume. Function Φ, parameterized by Θ = {θ i }, is used to compute an attention map S = {s i } ∈ R D θ(1) v ×D θ(2) v ×•••

