ARELU: ATTENTION-BASED RECTIFIED LINEAR UNIT

Abstract

Element-wise activation functions play a critical role in deep neural networks via affecting the expressivity power and the learning dynamics. Learning-based activation functions have recently gained increasing attention and success. We propose a new perspective of learnable activation function through formulating them with element-wise attention mechanism. In each network layer, we devise an attention module which learns an element-wise, sign-based attention map for the pre-activation feature map. The attention map scales an element based on its sign. Adding the attention module with a rectified linear unit (ReLU) results in an amplification of positive elements and a suppression of negative ones, both with learned, data-adaptive parameters. We coin the resulting activation function Attention-based Rectified Linear Unit (AReLU). The attention module essentially learns an element-wise residue of the activated part of the input, as ReLU can be viewed as an identity transformation. This makes the network training more resistant to gradient vanishing. The learned attentive activation leads to well-focused activation of relevant regions of a feature map. Through extensive evaluations, we show that AReLU significantly boosts the performance of most mainstream network architectures with only two extra learnable parameters per layer introduced. Notably, AReLU facilitates fast network training under small learning rates, which makes it especially suited in the case of transfer learning and meta learning.

1. INTRODUCTION

Activation functions, introducing nonlinearities to artificial neural networks, is essential to networks' expressivity power and learning dynamics. Designing activation functions that facilitate fast training of accurate deep neural networks is an active area of research (Maas et al., 2013; Goodfellow et al., 2013; Xu et al., 2015a; Clevert et al., 2015; Hendrycks & Gimpel, 2016; Klambauer et al., 2017; Barron, 2017; Ramachandran et al., 2017) . Aside from the large body of hand-designed functions, learning-based approaches recently gain more attention and success (Agostinelli et al., 2014; He et al., 2015; Manessi & Rozza, 2018; Molina et al., 2019; Goyal et al., 2019) . The existing learnable activation functions are motivated either by relaxing/parameterizing a non-learnable activation function (e.g. Rectified Linear Units (ReLU) (Nair & Hinton, 2010)) with learnable parameters (He et al., 2015) , or by seeking for a data-driven combination of a pool of pre-defined activation functions (Manessi & Rozza, 2018) . Existing learning-based methods make activation functions data-adaptive through introducing degrees of freedom and/or enlarging the hypothesis space explored. In this work, we propose a new perspective of learnable activation functions through formulating them with element-wise attention mechanism. A straightforward motivation of this is a straightforward observation that both activation functions and element-wise attention functions are applied as a network module of element-wise multiplication. More intriguingly, learning element-wise activation functions in a neural network can intuitively be viewed as task-oriented attention mechanism (Chorowski et al., 2015; Xu et al., 2015b) , i.e., learning where (which element in the input feature map) to attend (activate) given an end task to fulfill. This motivates an arguably more interpretable formulation of attentive activation functions. Attention mechanism has been a cornerstone in deep learning. It directs the network to learn which part of the input is more relevant or contributes more to the output. There have been many variants of attention modules with plentiful successful applications. In natural language processing, vector-wise attention is developed to model the long-range dependencies in a sequence of word vectors (Luong et al., 2015; Vaswani et al., 2017) . Many computer vision tasks utilize pixel-wise or channel-wise attention modules for more expressive and invariant representation learning (Xu et al., 2015b; Chen 

