Activation Function

Abstract

Inspire by nature world mode, a activation function is proposed. It is absolute function.Through test on mnist dataset and fully-connected neural network and convolutional neural network, some conclusions are put forward. The line of accuracy of absolute function is a little shaken that is different from the line of accuracy of relu and leaky relu. The absolute function can keep the negative parts as equal as the positive parts, so the individualization is more active than relu and leaky relu function. In order to generalization, the individualization is the reason of shake, the accuracy may be good in some set and may be worse in some set. The absolute function is less likely to be over-fitting. The batch size is small, the individualization is clear, vice versa. Through test on mnist and autoencoder, It is that the leaky relu function can do classification task well, while the absolute function can do generation task well. Because the classification task need more universality and generation task need more individualization. The pleasure irritation and painful irritation is not only the magnitude differences, but also the sign differences, so the negative part should keep as a part. Stimulation which happens frequently is low value, it is showed around zero in figure 1 . Stimulation which happens accidentally is high value, it is showed far away from zero in figure 1 . So the high value is the big stimulation, which is individualization.

1.1.. Preface

There are many activation function, such as sigmoid, tanh, relu, leaky relu, elu. The function frequently used is relu [1] . The relu function has dead relu problem that is some neuron never be activated, the related parameter never be updated. The absolute function doesn' t have dead relu problem, and gradient disappearance and gradient explosion, however it has some interesting characteristics. The image of absolute function is showed below. The formula is y=|x|.

2.1.. The test of absolute function on fully-connected neural network

First, I test the activation function which include relu , leaky relu and absolute on mnist dataset and fully-connected neural network. The network is 5 layers. The first 5000 samples are validation set, the left are training set [7] . Optimizer is adam [6] , loss is SparseCategoricalCrossentropy, metric is accuracy [2] . The following picture is the test result. From the figure 2 , the accuracy of absolute function in validation set is lower than the accuracy of relu and leaky relu in validation set when epochs are small. But when the epochs grow, the accuracy of absolute function in validation set is almost equal with the accuracy of relu and leaky relu in validation set. The line of accuracy of absolute function is a little shaken that is different from the line of accuracy of relu and leaky relu. The loss of absolution function is almost equal with the loss of relu and leaky relu. Because the absolute function keep the negative parts as equal as the positive parts, so the individualization is more active than relu and leaky relu function. Why? Let's change the batch size, the results are in figure 3 . In figure 3 , batch size is 64, the line is no longer shaken like before, it behaves like the line of relu and leaky relu. The line of relu and leaky relu function are more universality and stable than the line of absolute function. It filters the negative parts, so it loses the individualization partly. In order to generalization, the individualization is the reason of shake, the accuracy may be good in some set and may be worse in some set. The absolute function is less likely to be over-fitting. 

2.2.. The test of absolute function on convolutional neural network

Let's change the neural network, the result on convolutional neural network and mnist dataset is showed in figure 4 . The convolutional neural network has 3 layers convolutional neural network and 2 layers fully connected network. The shake phenomenon is more clear than on fully-connected neural network when batch size is 32. When batch_size increase, the shake phenomenon is more weaker. It can be anti-over-fitting when batch_size is small. The loss which decrease firstly but increase then is odd when batch size is 32, however the loss is stable like relu and leaky relu when batch size is 128. Correspondingly, the accuracy is increase step by step. In order to increase the accuracy, the model has to learn the individualization which increases the loss. The relu and leaky relu function can not learn the individualization like the absolute function, so its loss is stable. The batch size is small, the individualization is clear, vice versa. If you want to change the individualization of absolute function, just change the batch size. 

3.. The test of absolute function on autoencoder generation

Because the absolute function has more individualization, so I choose it to do autoencoder's generation [3] work. Abstract network(namely prediction network, always used as classification and regression) is common now. But the concrete network(namely generation network) which generates concrete information from concept or label is rare. Its principle is showed in figure 5 . The test is on mnist dataset and convolutional neural network [5] . The convolutional neural network(abstract network) is like LeNet, it has 4 layers convolutional neural network and 2 layers fully connected network. The concrete network is the inverse function of abstract network, it has 6 layers. Optimizer is adam, loss is mse. You can read my paper [3] for the detail. The test result is showed in figure 6 . In figure 6 , the left image is input, the label is argmax function of the predict of output of abstract network, the right image is the output of concrete network. 

4.Some Thinking

Should we keep the negative parts of the inputs? I think we can learn from other fields. The color has white and black, the temperature has hot and cool, the smell has fragrant and smelly, the taste has bitter and sweet. All information our body received are opposite mode. The pleasure irritation and painful irritation is not only the magnitude differences, but also the sign differences, so the negative parts should keep as a part. According to probability principle, nature word is normal distribution [4] . Stimulation which happens frequently is low value, it is showed around zero in figure 1 . Stimulation which happens accidentally is high value, it is showed far away from zero in figure 1 . So the high value is the big stimulation, which is individualization.



Fig1:The Image of Absolute Function

Fig3:Training and Validation Accuracy,Batch Size is 64

Fig4.1:Training and Validation Accuracy,Batch Size is 32

Fig4.7.1 Visualize Intermediate Activation with Relu Activation(First Convolution Layer)

Fig5The Functionally Separate Auto-encoder

Fig6.2 Absolute Function,Epoch=1

.7.1 Visualize Intermediate Activation with Relu Activation(First Convolution Layer)

