ACTIVATION NOISE FOR EBM CLASSIFICATION

Abstract

We study activation noise in a generative energy-based modeling setting during training for the purpose of regularization. We prove that activation noise is a general form of dropout. Then, we analyze the role of activation noise at inference time and demonstrate it to be utilizing sampling. Thanks to the activation noise we observe about 200% improvement in performance (classification accuracy). Later, we not only discover, but also prove that the best performance is achieved when the activation noise follows the same distribution both during training and inference. To explicate this phenomenon, we provide theoretical results that illuminate the roles of activation noise during training, inference, and their mutual influence on the performance. To further confirm our theoretical results, we conduct experiments for five datasets and seven distributions of activation noise.

1. INTRODUCTION

Whether it is for performing regularization (Moradi et al., 2020) to mitigate overfitting (Kukacka et al., 2017) or for ameliorating the saturation behavior of the activation functions, thereby aiding the optimization procedure (Gulcehre et al., 2016) , injecting noise to the activation function of neural networks has been shown effective (Xu et al., 2012) . Such activation noise denoted by z, is added to the output of each neuron of the network (Tian & Zhang, 2022) as follows: t = s + z = f ( i w i x i + b) + αz (1) where w i , x i , b, s, f (•), α, z, and t stand for the ith element of weights, the ith element of input signal, the bias, the raw/un-noisy activation signal, the activation function, the noise scalar, the normalized noise (divorced from its scalar α) originating from any distribution, and the noisy output, respectively. Studying this setting inflicted by noisy units is of significance, because it resembles how neurons of the brain in the presence of noise learn and perform inference (Wu et al., 2001) . In the literature, training with input/activation noise has been shown to be equivalent to loss regularization: a well-studied regularization scheme in which an extra penalty term is appended to the loss function. Also, injecting noise has been shown to keep the weights of the neural network small, which is reminiscent of other practices of regularization that directly limit the range of weights (Bishop, 1995) . Furthermore, injecting noise to input samples (or activation functions) is an instance of data augmentation (Goodfellow et al., 2016) . Injecting noise practically expands the size of the training dataset, because each time training samples are exposed to the model, random noise is added to the input/latent variables rendering them different every time they are fed to the model. Noisy samples therefore can be deemed as new samples which are drawn from the domain in the vicinity of the known samples: they make the structure of the input space smooth, thereby mitigating the curse of dimensionality and its consequent patchiness/sparsity of the datasets. This smoothing makes it easier for the neural network to learn the mapping function (Vincent et al., 2010) . In the existing works, however, the impacts of activation noise have been neither fully understood during training time nor broached at the inference time, not to mention the lack of study on the relationship between having activation noise at training and inference, especially for the generative energy-based modeling (EBM). In this paper, we study those issues: for the EBM setting, for the first time, we study the empirical and theoretical aspects of activation noise not only during training time but also at inference and discuss how these two roles relate to each other. We prove that, during training, activation noise (Gulcehre et al., 2016) is a general form of dropout (Srivastava et al., 2014) . This is interesting because dropout has been widely adopted as a regularization scheme. We then

