FOURIER STOCHASTIC BACKPROPAGATION

Abstract

Backpropagating gradients through random variables is at the heart of numerous machine learning applications. In this paper, we present a general framework for deriving stochastic backpropagation rules for any distribution, discrete or continuous. Our approach exploits the link between the characteristic function and the Fourier transform, to transport the derivatives from the parameters of the distribution to the random variable. Our method generalizes previously known estimators, and results in new estimators for the gamma, beta, Dirichlet and Laplace distributions. Furthermore, we show that the classical deterministic backproapagation rule and the discrete random variable case, can also be interpreted through stochastic backpropagation.

1. INTRODUCTION

Deep neural networks with stochastic hidden layers have become crucial in multiple domains, such as generative modeling (Kingma & Welling, 2013; Rezende et al., 2014; Mnih & Gregor, 2014) , deep reinforcement learning (Sutton et al., 2000) , and attention mechanisms (Mnih et al., 2014) . The difficulty encountered in training such models arises in the computation of gradients for functions of the form L(θ) := E z∼p θ [f (z)] with respect to the parameters θ, thus needing to backpropagate the gradient through the random variable z. One of the first and most used methods is the score function or reinforce method (Glynn, 1989; Williams, 1992) , that requires the computation and estimation of the derivative of the log probability function. For high dimensional applications however, it has been noted that reinforce gradients have high variance, making the training process unstable (Rezende et al., 2014) . Recently, significant progress has been made in tackling the variance problem. The first class of approaches dealing with continuous random variables are reparameterization tricks. In that case a standardization function is introduced, that separates the stochasticity from the dependency on the parameters θ. Thus being able to transport the derivative inside the expectation and sample from a fixed distribution, resulting in low variance gradient (Kingma & Welling, 2013; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014; Ruiz et al., 2016; Naesseth et al., 2017; Figurnov et al., 2018) . The second class of approaches concerns discrete random variables, for which a direct reparameterization is not known. The first solution uses the score function gradient with control variate methods to reduce its variance (Mnih & Gregor, 2014; Gu et al., 2016) . The second consists in introducing a continuous relaxation admitting a reparameterization trick of the discrete random variable, thus being able to backpropagate low-variance reparameterized gradients by sampling from the concrete distribution (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2018) . Although recent developments have advanced the state-of-the-art in terms of variance reduction and performance, stochastic backpropagation (i.e computing gradients through random variables) still lacks theoretical foundation. In particular, the following questions remain open: How to develop stochastic backpropagation rules, where the derivative is transferred explicitly to the function f for a broader range of distributions? And can the discrete and deterministic cases be interpreted in the sense of stochastic backpropagation? In this paper, we provide a new method to address these questions, and our main contributions are the following: • We present a theoretical framework based on the link between the multivariate Fourier transform and the characteristic function, that provides a standard method for deriving stochastic backpropagation rules, for any distribution discrete or continuous. • We show that deterministic backpropagation can be interpreted as a special case of stochastic backpropagation, where the probability distribution p θ is a Dirac delta distribution, and that the discrete case can also be interpreted as backpropagating a discrete derivative. • We generalize previously known estimators, and provide new stochastic backpropagation rules for the special cases of the Laplace, gamma, beta, and Dirichlet distributions. • We demonstrate experimentally that the resulting new estimators are competitive with state-of-the art methods on simple tasks.

2. BACKGROUND & PRELIMINARIES

Let (E, λ) be a d-dimensional measure space equipped with the standard inner product, and f be a square summable positive real valued function on E, that is, f : E → R + , with E |f (z)| 2 λ(dz) < ∞. Let p θ be an arbitrary parameterized probability density on the space E. We denote by ϕ θ its characteristic function, defined as: ϕ θ (ω) := E z∼p θ [e iω T z ]. We denote by f the Fourier transform of the function f defined as: f (ω) := F{f }(ω) = E f (z)e -iω T z λ(dz). The inverse Fourier transform is given in this case by: f (z) := F -1 { f }(z) = R d f (ω)e iω T z µ(dω), where µ(dω) represents the measure in the Fourier domain. In this paper we treat the cases where E = R d for which µ(dω) = dω (2π) d , and the case where E is a discrete set, for which the measure µ is defined as: µ(dω) = 1[ω ∈ [-π, π] d ] dω (2π) d . Throughout the paper, we reserve the letter i to denote the imaginary unit: i 2 = -1. To denote higher order derivatives of the function f , we use the multi-index notation (Saint Raymond, 2018) . For a multi-index n = (n 1 , ..., n d ) ∈ N d , we define: To clarify the multi-index notation, let us consider the example where d = 3, and n = (1, 0, 2), in this case: ∂ n z = ∂ 3 ∂z 1 ∂z 2 3 and, ω n = ω 1 ω 2 3 . The objective is to derive stochastic backpropagation rules, similar to that of (Rezende et al., 2014) , for functions of the form: L(θ) := E z∼p θ [f (z)], for any arbitrary distribution p θ , discrete or continuous. (4)



Stochastic backpropagation rules similar to that of(Rezende et al., 2014)  can in fact be derived for any continuous distribution, under certain conditions on the characteristic function. In the following theorem we present the main result of our paper concerning the derivation of Fourier stochastic backpropagation rules.Theorem 1. (Continuous Stochastic Backpropagation) Let f ∈ C ∞ (R d , R + ), under the condition that ∇ θ log ϕ θ is a holomorphic function of iω, then there exists a unique {a n (θ)} n∈N d ∈ R such that: ∇ θ L = |n|≥0 a n (θ)E z∼p θ [∂ n z f (z)] .(3)Where {a n (θ)} n∈N d are the Taylor expansion coefficients of ∇ θ log ϕ θ (ω):∇ θ log ϕ θ (ω) = |n|≥0 a n (θ)(iω) n .

