FOURIER STOCHASTIC BACKPROPAGATION

Abstract

Backpropagating gradients through random variables is at the heart of numerous machine learning applications. In this paper, we present a general framework for deriving stochastic backpropagation rules for any distribution, discrete or continuous. Our approach exploits the link between the characteristic function and the Fourier transform, to transport the derivatives from the parameters of the distribution to the random variable. Our method generalizes previously known estimators, and results in new estimators for the gamma, beta, Dirichlet and Laplace distributions. Furthermore, we show that the classical deterministic backproapagation rule and the discrete random variable case, can also be interpreted through stochastic backpropagation.

1. INTRODUCTION

Deep neural networks with stochastic hidden layers have become crucial in multiple domains, such as generative modeling (Kingma & Welling, 2013; Rezende et al., 2014; Mnih & Gregor, 2014) , deep reinforcement learning (Sutton et al., 2000) , and attention mechanisms (Mnih et al., 2014) . The difficulty encountered in training such models arises in the computation of gradients for functions of the form L(θ) := E z∼p θ [f (z)] with respect to the parameters θ, thus needing to backpropagate the gradient through the random variable z. One of the first and most used methods is the score function or reinforce method (Glynn, 1989; Williams, 1992) , that requires the computation and estimation of the derivative of the log probability function. For high dimensional applications however, it has been noted that reinforce gradients have high variance, making the training process unstable (Rezende et al., 2014) . Recently, significant progress has been made in tackling the variance problem. The first class of approaches dealing with continuous random variables are reparameterization tricks. In that case a standardization function is introduced, that separates the stochasticity from the dependency on the parameters θ. Thus being able to transport the derivative inside the expectation and sample from a fixed distribution, resulting in low variance gradient (Kingma & Welling, 2013; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014; Ruiz et al., 2016; Naesseth et al., 2017; Figurnov et al., 2018) . The second class of approaches concerns discrete random variables, for which a direct reparameterization is not known. The first solution uses the score function gradient with control variate methods to reduce its variance (Mnih & Gregor, 2014; Gu et al., 2016) . The second consists in introducing a continuous relaxation admitting a reparameterization trick of the discrete random variable, thus being able to backpropagate low-variance reparameterized gradients by sampling from the concrete distribution (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2018) . Although recent developments have advanced the state-of-the-art in terms of variance reduction and performance, stochastic backpropagation (i.e computing gradients through random variables) still lacks theoretical foundation. In particular, the following questions remain open: How to develop stochastic backpropagation rules, where the derivative is transferred explicitly to the function f for a broader range of distributions? And can the discrete and deterministic cases be interpreted in the sense of stochastic backpropagation? In this paper, we provide a new method to address these questions, and our main contributions are the following: • We present a theoretical framework based on the link between the multivariate Fourier transform and the characteristic function, that provides a standard method for deriving stochastic backpropagation rules, for any distribution discrete or continuous.

