DISCOVERING PARAMETRIC ACTIVATION FUNCTIONS

Abstract

Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with four different neural network architectures on the CIFAR-10 and CIFAR-100 image classification datasets show that this approach is effective. It discovers both general activation functions and specialized functions for different architectures, consistently improving accuracy over ReLU and other recently proposed activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.

1. INTRODUCTION

The rectified linear unit (ReLU(x) = max{x, 0}) is the most commonly used activation function in modern deep learning architectures (Nair & Hinton, 2010) . When introduced, it offered substantial improvements over the previously popular tanh and sigmoid activation functions. Because ReLU is unbounded as x → ∞, it is less susceptible to vanishing gradients than tanh and sigmoid are. It is also simple to calculate, which leads to faster training times. Activation function design continues to be an active area of research, and a number of novel activation functions have been introduced since ReLU, each with different properties (Nwankpa et al., 2018) . In certain settings, these novel activation functions lead to substantial improvements in accuracy over ReLU, but the gains are often inconsistent across tasks. Because of this inconsistency, ReLU is still the most commonly used: it is reliable, even though it may be suboptimal. The improvements and inconsistencies are due to a gradually evolving understanding of what makes an activation function effective. For example, Leaky ReLU (Maas et al., 2013) allows a small amount of gradient information to flow when the input is negative. It was introduced to prevent ReLU from creating dead neurons, i.e. those that are stuck at always outputting zero. On the other hand, the ELU activation function (Clevert et al., 2015) contains a negative saturation regime to control the forward propagated variance. These two very different activation functions have seemingly contradicting properties, yet each has proven more effective than ReLU in various tasks. There are also often complex interactions between an activation function and other neural network design choices, adding to the difficulty of selecting an appropriate activation function for a given task. For example, Ramachandran et al. (2018) warned that the scale parameter in batch normalization (Ioffe & Szegedy, 2015) should be set when training with the Swish activation function; Hendrycks & Gimpel (2016) suggested using an optimizer with momentum when using GELU; Klambauer et al. ( 2017) introduced a modification of dropout (Hinton et al., 2012) called alpha dropout to be used with SELU. These results suggest that significant gains are possible by designing the activation function properly for a network and task, but that it is difficult to do so manually. This paper presents an approach to automatic activation function design. The approach is inspired by genetic programming (Koza, 1992) , which describes techniques for evolving computer programs to solve a particular task. In contrast with previous studies (Bingham et al., 2020; Ramachandran et al., 2018; Liu et al., 2020; Basirat & Roth, 2018) , this paper focuses on automatically discovering activation functions that are parametric. Evolution discovers the general form of the function, while gradient descent optimizes the parameters of the function during training. The approach, Table 1: The operator search space consists of basic unary and binary functions as well as existing activation functions (Appendix D). σ(x) = (1 + e -x ) -1 . The unary operators bessel_i0e and bessel_i1e are the exponentially scaled modified Bessel functions of order 0 and 1, respectively. Unary Binary 0 |x| erf(x) tanh(x) arcsinh(x) ReLU(x) Softplus(x) x1 + x2 x x 2 1 1 x -1 erfc(x) e x -1 arctanh(x) ELU(x) Softsign(x) x1 -x2 max{x1, x2} x x 2 sinh(x) σ(x) bessel_i0e(x) SELU(x) HardSigmoid(x) x1 • x2 min{x1, x2} -x e x cosh(x) log(σ(x)) bessel_i1e(x) Swish(x) x1/x2 called PANGAEA (Parametric ActivatioN functions Generated Automatically by an Evolutionary Algorithm), discovers general activation functions that improve performance overall over previously proposed functions. It also produces specialized functions for different architectures, such as Wide ResNet, ResNet, and Preactivation ResNet, that perform even better than the general functions, demonstrating its ability to customize activation functions to architectures.

2. RELATED WORK

Prior work in automatic activation function discovery includes that of Ramachandran et al. (2018) , who used reinforcement learning to design novel activation functions. They discovered multiple functions, but analyzed just one in depth: Swish(x) = x • σ(x). Of the top eight functions discovered, only Swish and max{x, σ(x)} consistently outperformed ReLU across multiple tasks, suggesting that improvements are possible but often task specific. Bingham et al. ( 2020) used evolution to discover novel activation functions. Whereas their functions had a fixed graph structure, PANGAEA utilizes a flexible search space that implements activation functions as arbitrary computation graphs. PANGAEA also includes more powerful mutation operations, and a function parameterization approach that makes it possible to further refine functions through gradient descent. Liu et al. ( 2020) evolved normalization-activation layers. They searched for a computation graph that replaced both batch normalization and ReLU in multiple neural networks. They argued that the inherent nonlinearity of the discovered layers precluded the need for any explicit activation function. However, experiments in this paper show that carefully designed parametric activation functions can in fact be a powerful augmentation to existing deep learning models. 1 ). The activation functions are implemented in TensorFlow (Abadi et al., 2016) , and safe operator implementations are chosen when possible (e.g. the binary operator x 1 /x 2 is implemented as tf.math.divide_no_nan, which returns 0 if x 2 = 0). The operators in Table 1 were chosen to create a large and expressive search space that contains activation functions unlikely to be discovered by hand. Operators that are periodic (e.g. sin(x)) and operators that contain repeated asymptotes were not included; in

