DISCOVERING PARAMETRIC ACTIVATION FUNCTIONS

Abstract

Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with four different neural network architectures on the CIFAR-10 and CIFAR-100 image classification datasets show that this approach is effective. It discovers both general activation functions and specialized functions for different architectures, consistently improving accuracy over ReLU and other recently proposed activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.

1. INTRODUCTION

The rectified linear unit (ReLU(x) = max{x, 0}) is the most commonly used activation function in modern deep learning architectures (Nair & Hinton, 2010) . When introduced, it offered substantial improvements over the previously popular tanh and sigmoid activation functions. Because ReLU is unbounded as x → ∞, it is less susceptible to vanishing gradients than tanh and sigmoid are. It is also simple to calculate, which leads to faster training times. Activation function design continues to be an active area of research, and a number of novel activation functions have been introduced since ReLU, each with different properties (Nwankpa et al., 2018) . In certain settings, these novel activation functions lead to substantial improvements in accuracy over ReLU, but the gains are often inconsistent across tasks. Because of this inconsistency, ReLU is still the most commonly used: it is reliable, even though it may be suboptimal. The improvements and inconsistencies are due to a gradually evolving understanding of what makes an activation function effective. For example, Leaky ReLU (Maas et al., 2013) allows a small amount of gradient information to flow when the input is negative. It was introduced to prevent ReLU from creating dead neurons, i.e. those that are stuck at always outputting zero. On the other hand, the ELU activation function (Clevert et al., 2015) contains a negative saturation regime to control the forward propagated variance. These two very different activation functions have seemingly contradicting properties, yet each has proven more effective than ReLU in various tasks. There are also often complex interactions between an activation function and other neural network design choices, adding to the difficulty of selecting an appropriate activation function for a given task. For example, Ramachandran et al. (2018) warned that the scale parameter in batch normalization (Ioffe & Szegedy, 2015) should be set when training with the Swish activation function; Hendrycks & Gimpel (2016) suggested using an optimizer with momentum when using GELU; Klambauer et al. (2017) introduced a modification of dropout (Hinton et al., 2012) called alpha dropout to be used with SELU. These results suggest that significant gains are possible by designing the activation function properly for a network and task, but that it is difficult to do so manually. This paper presents an approach to automatic activation function design. The approach is inspired by genetic programming (Koza, 1992) , which describes techniques for evolving computer programs to solve a particular task. In contrast with previous studies (Bingham et al., 2020; Ramachandran et al., 2018; Liu et al., 2020; Basirat & Roth, 2018) , this paper focuses on automatically discovering activation functions that are parametric. Evolution discovers the general form of the function, while gradient descent optimizes the parameters of the function during training. The approach,

