LAU: A NOVEL TWO-PARAMETER LEARNABLE LOG-MOID ACTIVATION UNIT Anonymous

Abstract

In this work, we proposed a novel learnable Logmoid Activation Unit (LAU), f (x) = x ln(1 + αsigmoid(βx)) by parameterizing Logmoid with two hyperparameters α and β that are optimized via back-propagation algorithm. We design quasi-interpolation neural network operators with Logmoid-1 for approximating any continuous function in closed spaces. Our simulations show end-to-end learning deep neural networks with learnable Logmoids increase the predictive performances beyond all well-known activation functions for different tasks.

1. INTRODUCTION

In recent years, deep learning has achieved remarkable success in various classification problems (LeCun et al., 2015; Sriperumbudur et al., 2010) . The main reason is the powerful abilities of deep neural networks (DNNs) in representation and learning for the unknown structures. One of most important components in DNNs is activation function. A well-designed activation function can greatly improve the predictive performances (Krizhevsky et al., 2017) . This has intrigued growing interests in exploring activation functions (Nwankpa et al., 2018; Liang & Srikant, 2016; Shen et al., 2019) . Activation functions are generally classified into linear, nonlinear monotonic and nonlinear nonmonotonic functions. Although linear functions including Step Function (Klein et al., 2009) , Sign Function (Huang & Babri, 1998) and Identity Function have been widely used in early results, they are useless in practical applications because of the discontinuous derivatives, or lack of biological motivation and classification ability. These problems are further addressed by using nonlinear monotonic functions such as Sigmoid, Tanh and ReLU families. While small derivatives of Sigmoid (Hassell et al., 1977) and Tanh (Kalman & Kwasny, 1992) may cause the gradient to disappear (He & Xu, 2010; Klambauer et al., 2017) . Softplus and ReLU functions are designed for solving this problem. Meanwhile, ReLU shows other characters such as the educed saturation, sparsity, efficiency, and ease of use, but the neural network may lose some valid information because of all negative values of ReLU being zero. This intrigues designing new activation functions such as Leaky ReLU (Maas et al., 2013) , RReLU (Xu et al., 2015) , ELU (Clevert et al., 2015) and Swish (Ramachandran et al., 2017) . In general, designing good activation function is still an open question. One method is to find new activation functions by combining different units, such as Mish (Misra, 2019) and TanhExp (Liu & Di, 2020) . The second is to parameterize some well-known activation functions Biswas et al. (2020) ; Zhou et al. (2020) , which may show better performance beyond parameter-free functions. One example is Logish Zhu et al. (2021) which exhibits better performance and Top-1 accuracy. Another is Padé Activation Unit (PAU) Molina et al. (2019) using learnable hyper-parameter(s) and back-propagation algorithm (LeCun et al., 1989) . In this work, we proposed a new family of activation functions by parameterizing Logmoid with fewer trainable parameters for each network layer. It is given by f (x; α, β) = x ln(1 + αSigmoid(βx)) where the logarithmic operation can reduce the range of Sigmoid, and α and β are trainable parameters. The main contributions are summarized as follows: 1. A new family of activation functions called Logmoid family is proposed. The rest is organized as follows: Section 2 briefly describes some relative activation functions. Section 3 proposes new activation functions by parameterizing Logmoid. Section 4 analyzes the performances of the subfamily Lgomoid-1. In Section 5, we introduce a learnable activation unit called LAUs, and implement several simulations to verify its effectiveness, while the last section concludes the results.

2. RELATED WORKS

This section devotes some related activation functions including Swish, TanhSoft and PAU. Swish-Swish (Ramachandran et al., 2017) is defined by f (x) = xSigmoid(x). (2) This function shows new features of unsaturation, smooth and nonmonotonicity. It can potentially present problems in gradient-based optimization because of not continuously differentiable. TanhSoft-TanhSoft family is indexed by four hyper-parameters as f (x; α, β, γ, δ) = tanh(αx + βe γx ) ln(δ + e x ). (3) Based on simulations on CIFAR10 dataset, two sub-families of TanhSoft-1 and TanhSoft-2 are defined as TanhSoft-1 : f (x; α, 0, γ, 1) = tanh(αx) ln(1 + e x ), (4) TanhSoft-2 : f (x; 0, β, γ, 0) = x tanh(βe γx ). (5) TanhSoft-1 (α = 0.87) and TanhSoft-2 (β = 0.6, γ = 1) may show higher Top-1 performance than others Biswas et al. (2020) . PAU-Padé Activation Units (PAU) is a learnable activation function based on rational function. The padé is a rational function of the form F (x) = a 0 + a 1 x + a 2 x 2 + • • • + a m x m 1 + b 1 x + b 2 x 2 + • • • + b n x n PAU allows free parameters which can be optimized end-to-end in neural networks.

3. LOGMOID ACTIVATION FUNCTION FAMILY

In this section we present a new family of activation function by parameterizing Logmoid with two hyper-parameters. Specially, we propose a family of functions with two hyper-parameters as f (x; α, β) = x ln(1 + ασ(βx)) where σ refers to Sigmoid σ(x) = 1 1+e -x . Table 1 shows its Top-1 accuracy. For α = 1, the accuracy decreases faster for -1 < β < 1, and reaches to 78.2% for β = 5. For β = 1, it decreases faster with the gradual increase of α. This means Logmoid with α = 1 has a better performance. For practical applications, one may choose hyper-parameters from 1 α 5 and 1 β 5. This is further regarded as sub-family Logmoid-1 f (x; 1, 1).

4. CONTINUOUS FUNCTION APPROXIMATIONS BY LOGMOID-1

In this section the sub-family Logmoid-1 will be used to approximate any continuous function on a compact set. Logmoid-1 is a smooth and non-monotonic activation function beyond ReLU. Meanwhile, it inherits the self-gated property of Swish where the self-gated refers to the input itself and a function with the input as its argument, i.e., f (x) = xg(x). The output landscapes of a five-layer fully connected network are shown in Fig. 2 . While ReLU has lot of sharp transitions. Logmoid-1 shows a continuous and fluent transition shape. This is then useful for optimization and generalization (Li et al., 2018) . Next, we construct a bell-shaped function with Logmoid-1 and present main theorems for approximate functions. Bell-shaped functions-From Eq.( 7) Logmoid-1 with α = β = 1 is defined as g(x) = x ln(1 + σ(x)), x ∈ R (8) According to ref. (Cardaliaguet & Euvrard, 1992) , define a new bell-shaped function as Ψ(x) =      ψ(x) + 2|Y 1 |, if T 0 ≤ x < T 1 ; λ(ψ(x) + 2|Y 2 |), if T 2 < x < T 0 ; -ψ(x), if x ≥ T 1 ; -λψ(x), if x ≤ T 2 . ( ) where ψ(x) is defined by ψ (x) = ϕ(x + 1 2 ) -ϕ(x -1 2 ) with a bounded function ϕ(x) = g(x + 1 2 ) -g(x -1 2 ), λ = 0.934, T 0 = -0.2536, T 1 = 3.5025 and T 2 = -3.7326 are extreme points, and Y 1 = -0.0181 and Y 2 = -0.0307 are the values of ψ(x) at points T 1 and T 2 . Here, λ is used to ensure the continuity of Ψ(x) at T 0 . And then, we define another bell-shaped function as Φ(x) = 1 ξ Ψ(x) ( ) where ξ is defined by ∞ k=-∞ Ψ(x -k) = ξ. It implies that ∞ k=-∞ Φ(x -k) = 1. We further define Γ(x) = e -1 2 x-1 2 , if x > 0; e 1 2 x-1 2 , if x ≤ 0. Approximation errors-As universal approximation operators, Feed-forward neural networks (FNNs) (Hornik et al., 1989; Cybenko, 1989) with a hidden layer and n + 1 neurons is represented by N n (x) = n j=0 c j ς( a j • x + b j ) + d j , x ∈ R s , s ∈ N ( ) where ς is an activation function, j ∈ [0, n] and b j ∈ R are thresholds of the hidden layer, a j ∈ R s are the connection weights between the input and hidden layers, c j ∈ R are the connection weights between the hidden and output layers, d j ∈ R are the thresholds of the output layer, and • denotes the inner product. In general, any continuous function on a compact set can be theoretically approximated up to any precision by increasing hidden neurons (Cybenko, 1989; Hornik et al., 1990; Hornik, 1991; Funahashi, 1989; Leshno et al., 1993) . We present a tighter estimation of approximation error by Logmoid-1. Let C[-1, 1] be the space of continuous functions on [-1, 1]. For any function f ∈ C[a, b], define w(f, δ) := max a≤x,y≤b |x-y|≤δ |f (x) -f (y)|. ( ) f is (L, α)-Lipschitz continuous with α ∈ (0, 1] (e.g., f ∈ Lip(L, α)) if there is a constant L > 0 such that w(f, δ) ≤ Lδ α . For any function f ∈ C[-1, 1], we construct a FNN operator by Φ(x) as G Logmoid-1 (f, x) := n k=-n f ( k n )Φ(nx -k) (13) Theorem 1 For the operator G Logmoid-1 (f, x) we have |f (x) -G Logmoid-1 (f, x)| ≤ w(f, 1 n α ) + (8e -1 2 n 1-α + 4 e ) f ∞ where • ∞ denotes the uniform norm. This means the approximation error of f (x) by G Logmoid-1 (f, x) can be controlled by n and α. The proof is shown in Appendix A. Moreover, denote C B (R) as the set of continuous and bounded function on R. For any function f ∈ C B (R) define a FNN operator as G Logmoid-1 (f, x) := ∞ n=-∞ f ( k n )Φ(nx -k). ( ) Theorem 2 For the FNN operator G Logmoid-1 (f, x) we have The test accuracy vs Batch size for LAUs, Logmoid-1, ReLU, Swish, Mish. f (x) -G Logmoid-1 (f, x) ≤ w(f, 1 n α ) + 8e -1 2 n 1-α f ∞ 0 5 The proof of Theorem 2 is shown in Appendix B. Similar to the bell-shaped function (10), a new bell-shaped function based on Swish and Eqs.( 9) is defined as Φ Swish (x) = ξ(φ(x) + 2|Y 1 |), if -T 1 ≤ x ≤ T 1 ; -ξφ(x), otherwise. where λ = 1.758, T 1 is the extreme point, and Y 1 = -0.034 = φ(T 1 ). φ(x) is defined by φ(x) = Swish(x + 1) -2Swish(x) + Swish(x -1) Similar to the FNN operator G Logmoid-1 (f, x), we may define a FNN operator G Swish (f, x) by Φ Swish (x), in order to further verify the performance of Logmoid-1. Two examples are shown in Appendix C. 

5. EXPERIMENTS WITH LOGMOID ACTIVATION UNITS (LAU)

In this section we simulate new activation functions for different tasks. We denote Logmoid Activation Units (LAU) with parameters as f (x; α, β) = x ln(1 + ασ(βx)) where α and β are trainable hyper-parameters being optimized end-to-end to find good one at each layer automatically. Logmoid networks are feed-forward networks including convolutional and residual architectures with pooling layers. Inspired by PAUs, we suppose to learn one LAU per layer. There are 2L parameters with L layers.

5.1. INITIALIZING EXPERIMENTS WITH LOGMOID-1

This subsection is for exploring good initializations for all LAUs. We show that Logmoid-1 is good beyond random initializations. Several ablation experiments are performed to evaluate Logmoid-1 on MNIST (LeCun et al., 2010) and KMNIST (Clanuwat et al., 2018) with 10 classes. All the simulations used a basic network with 15 layers (Liu & Di, 2020) . Each one of the last 12 layers contains a batch normalization (Ioffe & Szegedy, 2015) , a dropout rate (Srivastava et al., 2014) of 0.25, an activation function, and a dense layer with 500 neurons. Learning speed. We trained the basic 15 layers network with 15 epoches on MNIST and 30 epoches on KMNIST. As shown in Figs. 3 and 5 , Logmoid-1 performs better than Swish and ReLU both in the convergence speed and final accuracy, and has a similar performance with TanhExp. Logmoid-1 can update the parameters rapidly and force the network to fit the dataset in an efficient way. Over-fitting. Experiments are performed with Logmoid-1, ReLU, Swish, and TanhExp on both MNIST and KMNIST using the basic network with blocks from 15 to 25. From Fig. 4 (a), the network suffers from over-fitting by ReLU and Swish because of a sharp decrease in accuracy. Logmoid-1 and TanhExp maintain a high accuracy in large models and then prevent from over-fitting.

5.2. EXPERIMENTS WITH LAUS

This subsection devotes to the simulation experiments with LAUs for different tasks. We simulate LAUs and other baseline functions including ReLU, Swish, TanhExp, ACONC (Ma et al., 2021) and Logmoid-1 using standard deep neural networks. We benchmark the results in the following datasets: MNIST (LeCun et al., 2010) , Kuzushiji-MNIST (KMNIST) (Clanuwat et al., 2018) , Fashion-MNIST (FMNIST) (Xiao et al., 2017) , CIFAR (Krizhevsky et al., 2009) , ImageNet (Russakovsky et al., 2015) and COCO (Lin et al., 2014) . Ablation experiments with LAUs-We demonstrate the image classification performance with LAUs and others. ShuffleNet-V2 network is used as the backbone. All the experiments are simulated on CIFAR-10 dataset. Batch size. We test LAUs in the same network with the batch size of 16, 32, 64, 128, 256, 512, respectively. LAUs has the best performance compared with others as Fig. 4(b) . Learning rate. Table 2 shows the Top-1 accuracy in ShuffleNet-V2 network. For the learning rate 10 -1 , LAUs has the highest 88.97% test accuracy while three nonmonotonic functions have similar performance with 10 -2 . LAUs is best for learning rate 10 -4 . This means LAUs has good performance in almost all comparable learning rates. Meanwhile, its expression ability is good at a smaller learning rate. Empirical results on Fashion-MNIST-We take use of two basic architectures of LeNet (Lecun et al., 1998) and VGG-8 (Simonyan & Zisserman, 2014) to evaluate Logmoid-1 on FMNIST. LeNet contains 61,706 trainable parameters while LAUs adds 8 additional parameters. We investigate whether there are smaller sub-networks with similar performances. Some blocks of VGG-8 will be replaced by dilated convolutions, as Fig. 6 , named as VGG-8-Dilated network. From Empirical results on CIFAR-We explore the performance of LAUs on difficult datasets, CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . These 3-channel images require a neuron network with strong learning ability. We simulate LAUs as well as other baseline functions on ResNet-20 (He et al., 2016) , (Howard et al., 2017) , MobileNet-V2 (Sandler et al., 2018) , ShuffleNet (Zhang et al., 2018) , ShuffleNet-V2 (Zhang et al., 2018 ), SqueezeNet (Iandola et al., 2016) , SeNet-18 (Hu et al., 2018) and EfficientNet-B0 (Tan & Le, 2019) . From Table 4 LAUs outperforms the baseline activation functions in most cases. LAUs improves the Top-1 accuracy by around 5% on EfficientNet-B0. The predictive performance of LAUs on ResNet-20 and SqueezeNet is still competitive. LAUs gets 2.49% improvement over ReLU on MobileNet while SRS (Zhou et al., 2020) gets 2.33% improvement. Object detection on COCO-Object detection is a fundamental branch of computer vision. We implement simulation experiments on COCO 2017 (Lin et al., 2014) , which contains 118K training, 5K validation and 20K test-dev images. We choose the Mask R-CNN (He et al., 2017) as the detector and Swin Transformer (Liu et al., 2021) with different activations as the backbone. We choose a batch size of 2, AdamW optimizer (initial learning rate of 0.0001, weight decay of 0.05, and 3x schedual). From Table 7 , LAUs gets 2.9%AP box and 2.3%AP mask improvements respectively over ReLU while PWLU (Zhou et al., 2021) gets 1.42% AP box and 1.83%AP mask improvements.

6. CONCLUSIONS

In this work, we presented a novel family of nonmonotonic activation function, named as Logmoid family. We constructed a class of FNN operators with Logmoid-1 to approximate any continuous function. We proposed a learnable Logmoid Activation Unit (LAU), which is initialized using Logmoid-1, trainable in an end-to-end fashion. One can replace standard activation functions with LAU units in any neural networks. Simulations shows LAUs has the best performance across all activation functions and architectures. Further work will be performed to apply LAUs to other related tasks.



Fig. 1: (a) Numerical Logmoid (solid line) and its derivative (dashed) in terms of β and α = 1. (b) Numerical Logmoid (solid line) and its first derivative (dashed line) with respect to α and β = 1.

Fig. 3: Test of Logmoid-1, TanhExp, Swish, and ReLU with a 15-layer basic network on MNIST. (a) Test accuracy. (b) Test loss.

Fig. 5: Test of Logmoid-1, TanhExp, Swish, and ReLU with a 15-layer basic network on KMNIST. (a) Test accuracy. (b) Test loss.

Fig. 6: Schematic VGG-8-Dilated network. DConv denotes the dilated convolutional.

Fig. 7: The estimated activation functions after training LeNet with LAUs on FMNIST.

The Top-1 accuracy of Logmoid with parameters on CIFAR10.



Performance comparison of activation functions on FMNIST.

LAUs improves 1.1% Top-1 accuracy compared with the runner-up result on VGG-8. Comparing the baseline functions on the VGG-8-Dilated and LeNet, LAUs always matches or outperforms the best performance. This shows LAU is a learnable activation function without requirement of additional suboptimal experiments. From Fig.7 some learnable activations are smooth versions of Logmoid-1 by controlling hyper-parameters α and β. So, we initialize LAUs with Logmoid-1.

Performances of activation functions on CIFAR-10.

Performances of activation functions on CIFAR-100.The CIFAR-100 has 100 classes with 500 training images and 100 test images per class. From Table5LAUs outperforms ReLU on all networks. It improves the top-1 accuracy by 4.6%, 3.3%, 4.4%, 2.1% and 2.9% compared with ReLU, Logmoid-1, Swish, ACONC and TanhExp on EfficientNet-B0. LAUs gets 2.65% improvement over ReLU on MobileNet and is smaller than its from SRS. LAUs shows smaller classification accuracy by 1.4% compared with the best on ShuffleNet-V2. This may be further improved by increasing simulation epochs.

Comparison of different activations on the ImageNet-200 dataset. We report the Top-1 and Top-5 accuracies (in %) on ShuffleNet-V2 and MobileNet-V3(Howard et al., 2019).Empirical results on ImageNet-200-We compared LAUs with other baseline activation functions on ImageNet 2012(Russakovsky et al., 2015). For a quick comparison, we randomly choose 200 classes and extracted 500 training images, 50 testing images each class from ImageNet, named as ImageNet-200. From Table6LAUs leads in both the Top-1 and Top-5 accuracies. LAUs gets 4.6% Top-1 improvements over ReLU on MobileNet-V3 while PWLU(Zhou et al., 2021) gets 1.91% Top-1 improvements.

Performances of activation functions on the COCO object detection task. We report results on Mask R-CNN with Swin Transformer backbone.

