LAU: A NOVEL TWO-PARAMETER LEARNABLE LOG-MOID ACTIVATION UNIT Anonymous

Abstract

In this work, we proposed a novel learnable Logmoid Activation Unit (LAU), f (x) = x ln(1 + αsigmoid(βx)) by parameterizing Logmoid with two hyperparameters α and β that are optimized via back-propagation algorithm. We design quasi-interpolation neural network operators with Logmoid-1 for approximating any continuous function in closed spaces. Our simulations show end-to-end learning deep neural networks with learnable Logmoids increase the predictive performances beyond all well-known activation functions for different tasks.

1. INTRODUCTION

In recent years, deep learning has achieved remarkable success in various classification problems (LeCun et al., 2015; Sriperumbudur et al., 2010) . The main reason is the powerful abilities of deep neural networks (DNNs) in representation and learning for the unknown structures. One of most important components in DNNs is activation function. A well-designed activation function can greatly improve the predictive performances (Krizhevsky et al., 2017) . This has intrigued growing interests in exploring activation functions (Nwankpa et al., 2018; Liang & Srikant, 2016; Shen et al., 2019) . Activation functions are generally classified into linear, nonlinear monotonic and nonlinear nonmonotonic functions. Although linear functions including Step Function (Klein et al., 2009 ), Sign Function (Huang & Babri, 1998) and Identity Function have been widely used in early results, they are useless in practical applications because of the discontinuous derivatives, or lack of biological motivation and classification ability. These problems are further addressed by using nonlinear monotonic functions such as Sigmoid, Tanh and ReLU families. While small derivatives of Sigmoid (Hassell et al., 1977) and Tanh (Kalman & Kwasny, 1992) may cause the gradient to disappear (He & Xu, 2010; Klambauer et al., 2017) . Softplus and ReLU functions are designed for solving this problem. Meanwhile, ReLU shows other characters such as the educed saturation, sparsity, efficiency, and ease of use, but the neural network may lose some valid information because of all negative values of ReLU being zero. This intrigues designing new activation functions such as Leaky ReLU (Maas et al., 2013 ), RReLU (Xu et al., 2015) , ELU (Clevert et al., 2015) and Swish (Ramachandran et al., 2017) . In general, designing good activation function is still an open question. One method is to find new activation functions by combining different units, such as Mish (Misra, 2019) and TanhExp (Liu & Di, 2020) . The second is to parameterize some well-known activation functions Biswas et al. 2019) using learnable hyper-parameter(s) and back-propagation algorithm (LeCun et al., 1989) . In this work, we proposed a new family of activation functions by parameterizing Logmoid with fewer trainable parameters for each network layer. It is given by f (x; α, β) = x ln(1 + αSigmoid(βx)) where the logarithmic operation can reduce the range of Sigmoid, and α and β are trainable parameters. The main contributions are summarized as follows: 1. A new family of activation functions called Logmoid family is proposed. The rest is organized as follows: Section 2 briefly describes some relative activation functions. Section 3 proposes new activation functions by parameterizing Logmoid. Section 4 analyzes the performances of the subfamily Lgomoid-1. In Section 5, we introduce a learnable activation unit called LAUs, and implement several simulations to verify its effectiveness, while the last section concludes the results.

2. RELATED WORKS

This section devotes some related activation functions including Swish, TanhSoft and PAU. Swish-Swish (Ramachandran et al., 2017) is defined by f (x) = xSigmoid(x). ( This function shows new features of unsaturation, smooth and nonmonotonicity. It can potentially present problems in gradient-based optimization because of not continuously differentiable. TanhSoft-TanhSoft family is indexed by four hyper-parameters as f (x; α, β, γ, δ) = tanh(αx + βe γx ) ln(δ + e x ). (3) Based on simulations on CIFAR10 dataset, two sub-families of TanhSoft-1 and TanhSoft-2 are defined as TanhSoft-1 : f (x; α, 0, γ, 1) = tanh(αx) ln(1 + e x ), (4) TanhSoft-2 : f (x; 0, β, γ, 0) = x tanh(βe γx ). (5) TanhSoft-1 (α = 0.87) and TanhSoft-2 (β = 0.6, γ = 1) may show higher Top-1 performance than others Biswas et al. ( 2020). PAU-Padé Activation Units (PAU) is a learnable activation function based on rational function. The padé is a rational function of the form F (x) = a 0 + a 1 x + a 2 x 2 + • • • + a m x m 1 + b 1 x + b 2 x 2 + • • • + b n x n PAU allows free parameters which can be optimized end-to-end in neural networks.

3. LOGMOID ACTIVATION FUNCTION FAMILY

In this section we present a new family of activation function by parameterizing Logmoid with two hyper-parameters. Specially, we propose a family of functions with two hyper-parameters as f (x; α, β) = x ln(1 + ασ(βx)) where σ refers to Sigmoid σ(x) = 1 1+e -x . Table 1 shows its Top-1 accuracy. For α = 1, the accuracy decreases faster for -1 < β < 1, and reaches to 78.2% for β = 5. For β = 1, it decreases faster with the gradual increase of α. This means Logmoid with α = 1 has a better performance.



(2020); Zhou et al. (2020), which may show better performance beyond parameter-free functions. One example is Logish Zhu et al. (2021) which exhibits better performance and Top-1 accuracy. Another is Padé Activation Unit (PAU) Molina et al. (

The Top-1 accuracy of Logmoid with parameters on CIFAR10.

