ACTIVATION NOISE FOR EBM CLASSIFICATION

Abstract

We study activation noise in a generative energy-based modeling setting during training for the purpose of regularization. We prove that activation noise is a general form of dropout. Then, we analyze the role of activation noise at inference time and demonstrate it to be utilizing sampling. Thanks to the activation noise we observe about 200% improvement in performance (classification accuracy). Later, we not only discover, but also prove that the best performance is achieved when the activation noise follows the same distribution both during training and inference. To explicate this phenomenon, we provide theoretical results that illuminate the roles of activation noise during training, inference, and their mutual influence on the performance. To further confirm our theoretical results, we conduct experiments for five datasets and seven distributions of activation noise.

1. INTRODUCTION

Whether it is for performing regularization (Moradi et al., 2020) to mitigate overfitting (Kukacka et al., 2017) or for ameliorating the saturation behavior of the activation functions, thereby aiding the optimization procedure (Gulcehre et al., 2016) , injecting noise to the activation function of neural networks has been shown effective (Xu et al., 2012) . Such activation noise denoted by z, is added to the output of each neuron of the network (Tian & Zhang, 2022) as follows: t = s + z = f ( i w i x i + b) + αz (1) where w i , x i , b, s, f (•), α, z, and t stand for the ith element of weights, the ith element of input signal, the bias, the raw/un-noisy activation signal, the activation function, the noise scalar, the normalized noise (divorced from its scalar α) originating from any distribution, and the noisy output, respectively. Studying this setting inflicted by noisy units is of significance, because it resembles how neurons of the brain in the presence of noise learn and perform inference (Wu et al., 2001) . In the literature, training with input/activation noise has been shown to be equivalent to loss regularization: a well-studied regularization scheme in which an extra penalty term is appended to the loss function. Also, injecting noise has been shown to keep the weights of the neural network small, which is reminiscent of other practices of regularization that directly limit the range of weights (Bishop, 1995) . Furthermore, injecting noise to input samples (or activation functions) is an instance of data augmentation (Goodfellow et al., 2016) . Injecting noise practically expands the size of the training dataset, because each time training samples are exposed to the model, random noise is added to the input/latent variables rendering them different every time they are fed to the model. Noisy samples therefore can be deemed as new samples which are drawn from the domain in the vicinity of the known samples: they make the structure of the input space smooth, thereby mitigating the curse of dimensionality and its consequent patchiness/sparsity of the datasets. This smoothing makes it easier for the neural network to learn the mapping function (Vincent et al., 2010) . In the existing works, however, the impacts of activation noise have been neither fully understood during training time nor broached at the inference time, not to mention the lack of study on the relationship between having activation noise at training and inference, especially for the generative energy-based modeling (EBM). In this paper, we study those issues: for the EBM setting, for the first time, we study the empirical and theoretical aspects of activation noise not only during training time but also at inference and discuss how these two roles relate to each other. We prove that, during training, activation noise (Gulcehre et al., 2016) is a general form of dropout (Srivastava et al., 2014) . This is interesting because dropout has been widely adopted as a regularization scheme. We then We also prove that, during inference, adopting activation noise can be interpreted as sampling the neural network. Accordingly, with activation noise during inference, we estimate the energy of the EBM. Surprisingly, we discover that there is a very strong interrelation between the distribution of activation noise during training and inference: the performance is optimized when those two follow the same distributions. We also prove how to find the distribution of the noise during inference to minimize the inference error, thereby maximizing the performance as high as 200%. Overall, our main contributions in this paper are as follows: • We prove that, during training, activation noise is a general form of dropout. Afterward, we establish the connections between activation noise and loss regularization/data augmentation. With activation noise during inference as well as training, we observe about 200% improvement in performance (classification accuracy), which is unprecedented. Also, we discover/prove that the performance is maximized when the noise in activation functions follow the same distribution during both training and inference. • To explain this phenomenon, we provide theoretical results that illuminate the two strikingly distinct roles of activation noise during training and inference. We later discuss their mutual influence on the performance. To examine our theoretical results, we provide extensive experiments for five datasets, many noise distributions, various noise values for the noise scalar α, and different number of samples.

2. RELATED WORKS

Our study touches upon multiple domains: (i) neuroscience, (ii) regularization in machine learning, (iii) generative energy-based modeling, and (iv) anomaly detection and one-class classification. (i) Studying the impact of noise in artificial neural networks (ANNs) can aid neuroscience to understand the brain's operation (Lindsay, 2020; Richards et al., 2019) . From neuroscience, we know that neurons of the brain (as formulated by Eq. 1) never produce the same output twice even when the same stimuli are presented because of their internal noisy biological processes (Ruda et al., 2020; Wu et al., 2001; Romo et al., 2003) . Having a noisy population of neurons if anything seems like a disadvantage (Averbeck et al., 2006; Abbott & Dayan, 1999) ; then, how does the brain thwart the inevitable and omnipresent noise (Dan et al., 1998) ? We provide new results on top of current evidence that noise can indeed enhance both the training (via regularization) and inference (by error minimization) (Zylberberg et al., 2016; Zohary et al., 1994) . (ii) Injecting noise to neural networks is known to be a regularization scheme: regularization is broadly defined as any modification made to a learning algorithm that is intended to mitigate overfitting: reducing the generalization error but not its training error (Kukacka et al., 2017) . Regularization schemes often seek to reduce overfitting (reduce generalization error) by keeping weights of neural networks small (Xu et al., 2012) . Hence, the simplest and most common regularization is to append a penalty to the loss function which increases in proportion to the size of the weights of the model. However, regularization schemes are diverse (Moradi et al., 2020) ; in the following, we review the popular regularization schemes: weight regularization (weight decay) (Gitman & Ginsburg, 2017) penalizes the model during training based on the magnitude of the weights (Van Laarhoven, 2017) . This encourages the model to map the inputs to the outputs of the training dataset such that the weights of the model are kept small (Salimans & Kingma, 2016) . Batch-normalization regularizes the network by reducing the internal covariate shift: it scales the output of the layer, by standardizing the activations of each input variable per mini-batch (Ioffe & Szegedy, 2015) . Ensemble learning (Zhou, 2021) trains multiple models (with heterogeneous architectures) and averages the predictions of all of them (Breiman, 1996) . Activity regularization (Kilinc & Uysal, 2018b) penalizes the model during training based on the magnitude of the activations (Deng et al., 2019; Kilinc & Uysal, 2018a) . Weight constraint limits the magnitude of weights to be within a range (Srebro & Shraibman, 2005) . Dropout (Srivastava et al., 2014) probabilistically removes inputs during training: dropout relies on the rationale of ensemble learning that trains multiple models. However, training and maintaining multiple models in parallel inflicts heavy computational/memory expenses. Alternatively, dropout proposes that a single model can be leveraged to simulate training an expo-

