FOURIER STOCHASTIC BACKPROPAGATION

Abstract

Backpropagating gradients through random variables is at the heart of numerous machine learning applications. In this paper, we present a general framework for deriving stochastic backpropagation rules for any distribution, discrete or continuous. Our approach exploits the link between the characteristic function and the Fourier transform, to transport the derivatives from the parameters of the distribution to the random variable. Our method generalizes previously known estimators, and results in new estimators for the gamma, beta, Dirichlet and Laplace distributions. Furthermore, we show that the classical deterministic backproapagation rule and the discrete random variable case, can also be interpreted through stochastic backpropagation.

1. INTRODUCTION

Deep neural networks with stochastic hidden layers have become crucial in multiple domains, such as generative modeling (Kingma & Welling, 2013; Rezende et al., 2014; Mnih & Gregor, 2014) , deep reinforcement learning (Sutton et al., 2000) , and attention mechanisms (Mnih et al., 2014) . The difficulty encountered in training such models arises in the computation of gradients for functions of the form L(θ) := E z∼p θ [f (z)] with respect to the parameters θ, thus needing to backpropagate the gradient through the random variable z. One of the first and most used methods is the score function or reinforce method (Glynn, 1989; Williams, 1992) , that requires the computation and estimation of the derivative of the log probability function. For high dimensional applications however, it has been noted that reinforce gradients have high variance, making the training process unstable (Rezende et al., 2014) . Recently, significant progress has been made in tackling the variance problem. The first class of approaches dealing with continuous random variables are reparameterization tricks. In that case a standardization function is introduced, that separates the stochasticity from the dependency on the parameters θ. Thus being able to transport the derivative inside the expectation and sample from a fixed distribution, resulting in low variance gradient (Kingma & Welling, 2013; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014; Ruiz et al., 2016; Naesseth et al., 2017; Figurnov et al., 2018) . The second class of approaches concerns discrete random variables, for which a direct reparameterization is not known. The first solution uses the score function gradient with control variate methods to reduce its variance (Mnih & Gregor, 2014; Gu et al., 2016) . The second consists in introducing a continuous relaxation admitting a reparameterization trick of the discrete random variable, thus being able to backpropagate low-variance reparameterized gradients by sampling from the concrete distribution (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2018) . Although recent developments have advanced the state-of-the-art in terms of variance reduction and performance, stochastic backpropagation (i.e computing gradients through random variables) still lacks theoretical foundation. In particular, the following questions remain open: How to develop stochastic backpropagation rules, where the derivative is transferred explicitly to the function f for a broader range of distributions? And can the discrete and deterministic cases be interpreted in the sense of stochastic backpropagation? In this paper, we provide a new method to address these questions, and our main contributions are the following: • We present a theoretical framework based on the link between the multivariate Fourier transform and the characteristic function, that provides a standard method for deriving stochastic backpropagation rules, for any distribution discrete or continuous. • We show that deterministic backpropagation can be interpreted as a special case of stochastic backpropagation, where the probability distribution p θ is a Dirac delta distribution, and that the discrete case can also be interpreted as backpropagating a discrete derivative. • We generalize previously known estimators, and provide new stochastic backpropagation rules for the special cases of the Laplace, gamma, beta, and Dirichlet distributions. • We demonstrate experimentally that the resulting new estimators are competitive with state-of-the art methods on simple tasks.

2. BACKGROUND & PRELIMINARIES

Let (E, λ) be a d-dimensional measure space equipped with the standard inner product, and f be a square summable positive real valued function on E, that is, f : E → R + , with E |f (z)| 2 λ(dz) < ∞. Let p θ be an arbitrary parameterized probability density on the space E. We denote by ϕ θ its characteristic function, defined as: ϕ θ (ω) := E z∼p θ [e iω T z ]. We denote by f the Fourier transform of the function f defined as: f (ω) := F{f }(ω) = E f (z)e -iω T z λ(dz). The inverse Fourier transform is given in this case by: f (z) := F -1 { f }(z) = R d f (ω)e iω T z µ(dω), where µ(dω) represents the measure in the Fourier domain. In this paper we treat the cases where E = R d for which µ(dω) = dω (2π) d , and the case where E is a discrete set, for which the measure µ is defined as: µ(dω) = 1[ω ∈ [-π, π] d ] dω (2π) d . Throughout the paper, we reserve the letter i to denote the imaginary unit: i 2 = -1. To denote higher order derivatives of the function f , we use the multi-index notation (Saint Raymond, 2018) . For a multi-index n = (n 1 , ..., n d ) ∈ N d , we define: To clarify the multi-index notation, let us consider the example where d = 3, and n = (1, 0, 2), in this case: ∂ n z := ∂ |n| ∂z n1 1 ... ∂ n z = ∂ 3 ∂z 1 ∂z 2 3 and, ω n = ω 1 ω 2 3 . The objective is to derive stochastic backpropagation rules, similar to that of (Rezende et al., 2014) , for functions of the form: L(θ) := E z∼p θ [f (z)], for any arbitrary distribution p θ , discrete or continuous.

3. FOURIER STOCHASTIC BACKPROPAGATION

Stochastic backpropagation rules similar to that of (Rezende et al., 2014) can in fact be derived for any continuous distribution, under certain conditions on the characteristic function. In the following theorem we present the main result of our paper concerning the derivation of Fourier stochastic backpropagation rules. Theorem 1. (Continuous Stochastic Backpropagation) Let f ∈ C ∞ (R d , R + ) , under the condition that ∇ θ log ϕ θ is a holomorphic function of iω, then there exists a unique {a n (θ)} n∈N d ∈ R such that: ∇ θ L = |n|≥0 a n (θ)E z∼p θ [∂ n z f (z)] . Where {a n (θ)} n∈N d are the Taylor expansion coefficients of ∇ θ log ϕ θ (ω): ∇ θ log ϕ θ (ω) = |n|≥0 a n (θ)(iω) n . Proof. Let us rewrite L in terms of f : L(θ) = p θ (z)f (z)λ(dz) = p θ (z)F -1 [ f ](z)λ(dz) = R d f (ω) E p θ (z)e iω T z λ(dz)µ(dω) Fubini's theorem = R d f (ω)ϕ θ (ω)µ(dω). (5) By introducing the derivative under the integral sign, and using the reinforce trick (Williams, 1992) applied to ϕ θ , where ∇ θ ϕ θ (ω) = ϕ θ (ω)∇ θ log ϕ θ (ω), equation 5 becomes: ∇ θ L = R d f (ω)ϕ θ (ω)∇ θ log ϕ θ (ω)µ(dω). Under analyticity conditions of the gradient of the log characteristic function, we can expand the gradient term ∇ θ log ϕ θ (ω), in terms of Taylor series around zero as: ∇ θ log ϕ θ (ω) = |n|≥0 a n (θ)(iω) n . ( ) Putting everything together, and replacing the characteristic function by its expression, the gradient of L becomes: ∇ θ L = R d f (ω) E p θ (z)e iω T z |n|≥0 a n (θ)(iω) n µ(dω)λ(dz). By rearranging the sums using Fubini's theorem a second time, we obtain the following expression for the gradient: ∇ θ L = E z∼p θ   F -1    ω → |n|≥0 a n (θ)(iω) n f (ω)    (z)   = |n|≥0 a n (θ)E z∼p θ F -1 ω → (iω) n f (ω) (z) = |n|≥0 a n (θ)E z∼p θ [∂ n z f (z)] . Q.E.D Identically, we can follow the same procedure for discrete random variables. We suppose that p θ factorizes over disjoint cliques of the dependency graph, where each dimension z j takes values in a discrete space Val(z j ). In theorem 2 we derive the result concerning the discrete case. Theorem 2. (Discrete Stochastic Backpropagation) Let E be a discrete space: E = d j=1 Val(z j ), and C the set of disjoint cliques of the dependency graph over z, that is, p θ (z) = c∈C p θ (z c ) then, ∇ θ L = c∈C zc =z * c ∇ θ p θ (z c )E z -c ∼p θ [Df (z -c , z c )] . Where: • z * c : represents the normalizing assignment p θ (z * c ) = 1 -zc =z * c p θ (z c ). f (ω)µ(dω) = c∈C zc =z * c ∇ θ p θ (z c )E z-c∼p θ [Df (z -c , z c )] . Q.E.D The estimator of equation 10 has been derived in the literature through Rao-Blackwellization of the score function gradient, and it has been known under different names (Titsias & Lázaro-Gredilla, 2015; Asadi et al., 2017; Cong et al., 2019) . Theorem 2 shows that the discrete case can also be seen as backpropagating a derivative of the function f , in this case a discrete derivative given by equation 11.

4. APPLICATIONS OF FOURIER STOCHASTIC BACKPROPAGATION

Following from the previous section, we derive the stochastic backpropagation estimators for certain commonly used distributions. The multivariate Gaussian distribution: In this case p θ (z) = N (z; µ θ , Σ θ ). The log characteristic function is given by: log ϕ θ (ω) = iµ T θ ω + 1 2 Tr Σ θ i 2 ωω T . Thus by applying theorem 1, we recover the stochastic backpropagation rule of (Rezende et al., 2014) : ∇ θ L = E z∼p θ ∂µ θ ∂θ T ∇ z f (z) + 1 2 Tr ∂Σ θ ∂θ ∇ 2 z f (z) , where, ∇ z and ∇ 2 z , represent the gradient and hessian operators. The multivariate Dirac distribution: p θ (z) = δ a θ (z), the log characteristic function of the Dirac distribution is given by: log ϕ θ (ω) = iω T a θ . Thus the stochastic backpropagation rule of the Dirac is given by: ∇ θ L = ∂a θ ∂θ T E z∼δa θ [∇ z f (z)] = ∂a θ ∂θ T ∇ z f (a θ ), resulting in the classical backpropagation rule. In other words, the deterministic backpropagation rule is a special case of stochastic backpropagation where the distribution is a Dirac delta distribution. This result provides a link between probabilistic graphical models and classical neural networks. We investigate this link further in Appendix A. The multivariate Bernoulli: p θ (z) = d j=1 B(z j ; π (j) θ ) , where π (j) θ = P[z j = 1] . By applying theorem 2, we obtain the local expectation gradient of (Titsias & Lázaro-Gredilla, 2015) : ∇ θ L = d j=1 ∂π (j) θ ∂θ E z-j ∼p θ [f (z -j , 1) -f (z -j , 0)] . The multivariate categorical: p θ (z) = d j=1 cat(z j ; π (j) θ ) , where the dimensions are independent and take values in the set {1, ..., K}. Similarly to the Bernoulli case, we obtain the following stochastic backpropagation rule: ∇ θ L = d j=1 K-1 k=1 ∂π θ (j) k ∂θ E z-j ∼p θ [Df (z -j , k)] . The Laplace distribution: p θ (z) = L(z; µ θ , b θ ), in this case the log characteristic function is the following: log ϕ θ (ω) = iµ θ ω -log(1 + b 2 θ ω 2 ), using the Taylor series expansion for the function x → 1 1-x , we get the following stochastic backpropagation rule for the Laplace distribution: ∇ θ L = ∂µ θ ∂θ E z df dz (z) + 1 b 2 θ ∂b 2 θ ∂θ ∞ n=1 b 2n θ E z d 2n f dz 2n (z) . ( ) The gamma distribution: p θ (z) = Γ(z; k θ , µ θ ), the log characteristic function of the Gamma distribution is given by: log ϕ θ (ω) = -k θ log(1 -iµ θ ω). By expanding it using Taylor series of the logarithm function, we obtain the following stochastic backpropagation rule: ∇ θ L = ∞ n=1 1 n ∂k θ ∂θ + k θ µ θ ∂µ θ ∂θ µ n θ E z∼p θ d n f dz n (z) . ( ) The estimator of equation 20 gives a stochastic backpropagation rule for the gamma distribution and, hence also applies by extension to the special cases of the exponential, Erlang, and chi-squared distributions. The beta distribution: p θ (z) = Beta(z; α θ , β θ ), in this case the characteristic function is the confluent hypergeometric function: ϕ θ (ω) = 1 F 1 (α θ ; α θ + β θ ; iω). A series expansion of the gradient of the log of this function is not trivial to derive. However, we can use the parameterization linking the gamma and beta distributions to derive a stochastic backpropagation rule. Indeed, if ζ 1 ∼ Γ(α θ , 1) and ζ 2 ∼ Γ(β θ , 1) , then z = g(ζ 1 , ζ 2 ) = ζ1 ζ1+ζ2 ∼ Beta(α θ , β θ ). By substituting in the gamma stochastic backpropagation rule, we obtain: ∇ θ L = ∞ n=1 1 n ∂α θ ∂θ E ζ1,ζ2 ∂ n f ∂ζ n 1 ζ 1 ζ 1 + ζ 2 + ∂β θ ∂θ E ζ1,ζ2 ∂ n f ∂ζ n 2 ζ 1 ζ 1 + ζ 2 . ( ) The Dirichlet distribution: p θ (z) = Dir(z; K, α θ ), following the same procedure, as for the beta distribution and using the following parameterization: z k = ζ k K j=1 ζj with, ζ k ∼ Γ(α (k) θ , 1), we obtain: ∇ θ L = ∞ n=1 1 n K k=1 ∂α (k) θ ∂θ E ζj ∀j ∂ n f ∂ζ n k ζ 1 K j=1 ζ j , ..., ζ K K j=1 ζ j . ( )

5. TRACTABLE CASES & APPROXIMATIONS OF FOURIER STOCHASTIC BACKPROPAGATION

The Fourier stochastic backpropagation gradient as presented in previous sections presents two major computational bottlenecks for non-trivial distributions. The first is the computation of infinite series, and the second is evaluating higher order derivatives of the function f . Depending on the application, the function f could be chosen in order to bypass the computational bottlenecks. A trivial example, is if the higher order derivatives of the function f vanish at a certain order: ∂ n z f = 0. Another example, is the exponential function f (z) = exp( T z). From the fact that it obeys the following partial differential equation ∂f ∂zj (z) = j f (z), one can deduce that the stochastic backpropagation rule reduces in this case to: ∇ θ L = ∇ θ log ϕ θ i E z∼p θ [f (z)] In most real world applications however, the infinite sum will not often reduce to a tractable expression such as that of the exponential. An example of this case is the evidence lower bound of a generative model with Bernoulli observations. In this case, the natural solution is to truncate the sum up to a finite order. The assumption (although it might be wrong), is that the components associated to higher frequencies of the spectrum of the gradient of the log characteristic function, do not contribute as much. And by analogy to the signal processing field, we apply a Low-pass filter to eliminate them. In this case the gradient of the log characteristic function of equation 7 becomes: ∇ θ log ϕ θ (ω) = n≤N a n (θ)(iω) n + o((iω) N ). 6 EXPERIMENTS In our experimental evaluations, we test the stochastic backpropagation estimators of equations 19 and 20 for the gamma and Laplace distributions. In the case of the gamma estimator, we use toy examples where we can derive exact stochastic backpropagation rules without truncating the infinite sum. As for the Laplace stochastic backpropagation rule, we test the estimator in the case of Bayesian logistic regression with Laplacian priors and variational posteriors on the weights. We compare our estimators with the pathwise (Jankowiak & Karaletsos, 2019; Jankowiak & Obermeyer, 2018) , and score function estimators, in addition to the weak reparameterization estimator in the gamma case (Ruiz et al., 2016) . We do not use control variates in our setup, the goal is to verify the exactness of the proposed infinite series estimators and how they compare to current state-of-the-art methods in simple settings. In all our experiments, we use the Adam optimizer to update the weights (Kingma & Ba, 2014), with a standard learning rate of 10 -3 . In all the curves, we report the mean and standard deviation for all the metrics considered over 5 iterations.

6.1. TOY PROBLEMS

In the toy problem setting, we test the gamma stochastic backpropagation rule following the same procedure as (Mohamed et al., 2019) . we consider the following cases: Toy problem 1: L(θ) = E z∼p θ ||z -|| 2 , where p θ (z) = d j=1 Γ(z j ; k j , µ j ), θ = {k, µ}, and = .49. In this case, we only need to compute the first and second order derivatives of the function f . Toy problem 2: L(θ) = E z∼p θ d j=1 exp(-z j ) , in this case, the infinite sum transfers to , which results in the following estimator: ∇ θ L = ∇ θ log ϕ θ i E z∼p θ [f (z)]. In figures 1 and 2 we report the training loss and log variance of the gradient across iterations of gradient descent for different values of the dimension d ∈ {1, 10, 100}. The stochastic backpropagation estimator converges to the minimal value in all cases faster than the other estimators and the variance of the gradient is competitive with the pathwise gradient. We evaluate the Laplace stochastic backpropagation estimator using a Bayesian logistic regression model (Jaakkola & Jordan, 1997) , similarly to (Mohamed et al., 2019) . In our case, we substitute the normal prior and posterior on the weights with Laplace priors and posteriors. We adopt the same notations of (Murphy, 2012) , where the data, target and weight variables are respectively: x n ∈ R d , y n ∈ {-1, 1}, and w. The probabilistic model in our case is the following: p(w) = d j=1 L(w j , 0, 1) p(y|x, w) = σ(yx T w), where σ represents the sigmoid function. We consider Laplacian variational posteriors of the form p θ (w) = d j=1 L(w j , µ j , b j ), with θ = {µ, b}. The evidence lower bound of a single sample is given by: L(x n , y n ; θ) = E w∼p θ log σ(y n x T n w) -D KL [p θ ||p], where the Kullback-Leibler divergence between the two Laplace distributions is the following: D KL [p θ ||p] = d j=1 |µ j |+b j e - |µ j | b j -log b j -1 . ( ) We test the model on the UCI women's breast cancer dataset (Dua & Graff, 2017) , with a batch size of 64 and 50 samples from the posterior to evaluate the expectation. In the case of the stochastic backpropagation estimator we truncate the infinite series for the scale parameter b of equation 19 to N = 4 and N = 8. In figure 3 , we report the training evidence lower bound, the log variance of the gradient, and the accuracy computed on the entire dataset for the different estimators. The stochastic backpropagation estimator converges faster than the considered estimators and the variance is significantly lower. We also notice that the truncation level of the infinite series for the scale parameter has little effect on the outcome. In figure 4 , we report the bias and variance of the estimator at different values of the truncation level, for a fixed parameter value during the training phase (epoch=100). The bias and variance do not vary much, with the truncation level in this case, this result confirms the intuition of neglecting higher frequencies presented in section 5. In addition, we compare the mean squared error between the Laplace stochastic backpropagation estimator and the score function and reparameterization estimators. As shown in figure 4 the mean squared error is small, thus the values of the gradients across iterations are close. However, the reparameterization gradient is closer to our estimator than the score function gradient, probably due to the fact that the reparameterization gradient is more stable and has lower variance.

7. RELATED WORK & DISCUSSION

Computing gradients through stochastic computation graphs has received considerable attention from the community, due to its application in many fields. The first general approach that provides a closed form solution for any probability distribution is the score function method (Glynn, 1989; Williams, 1992; Sutton et al., 2000; Schulman et al., 2015) . The main inconvenience of this approach, is that it results in high variance gradients when the dimension of the random variable becomes high. In order to bypass this issue, the second approach consisted of designing control variates to reduce the variance of the score function estimator (Paisley et al., 2012; Weaver & Tao, 2013; Mnih & Gregor, 2014; Ranganath et al., 2014; Tokui & Sato, 2017) . In addition to the score function gradient, it was proposed to use an importance weighted estimator instead of the classical score function with a multi-sample objective (Mnih & Rezende, 2016; Burda et al., 2015) . The second class of approaches is that concerning reparameterization tricks (Kingma & Welling, 2013; Rezende et al., 2014) . Through the decoupling of the computation of the gradient from the expectancy, reparameterization tricks have shown that they provide low-variance gradients using often a single sample. The issue for these methods is the necessity to find a reparameterization for each probability distribution. Certain distributions such as the Gaussian are easy to reparametrize but others like the gamma are not. In addition, discrete random variables do not admit an easy reparameterization as well. Recently, these issues has been partially solved through implicit reparameterization, the generalized reparameterization gradient, and the pathwise gradient (Ruiz et al., 2016; Figurnov et al., 2018; Jankowiak & Obermeyer, 2018) . For the discrete case, continuous relaxations that are reparameterizable have been proposed and combined with control variate methods (Gu et al., 2016; Maddison et al., 2016; Jang et al., 2016; Tucker et al., 2017; Grathwohl et al., 2018) . Our approach, in contrast provides a new broad family of stochastic backpropagation rules derived using the Fourier transform. One interesting aspect of our approach is the fact that the weighting a n is separated from the expectation of the higher order derivatives of the function f . Thus the sampled variable does not intervene in the weighting in contrast to other methods such as reparameterization and pathwise gradients. In addition, if the function f contains weakly correlated terms, by applying the derivative, some random variables would be eliminated. Thus the estimators of the higher order derivatives would be sampled with respect to the derivation variable (and other correlated variables), which would result in lower variance. It is worth noting that deriving stochastic backpropagation rules using the Fourier transform has been proposed for the Gaussian case (Fellows et al., 2018) . In our work, we extend it to non Gaussian distributions by way of the characteristic function, and exploiting the invariance of the functional inner product under Fourier transformation (Parseval's theorem).

8. CONCLUSION

In conclusion, in this paper we presented a new method to compute gradients through random variables for any probability distribution, by explicitly transferring the derivative to the random variable using the Fourier transform. Our approach, gives a framework to be applied for any distribution, where the gradient of the log characteristic function is analytic, resulting in a new broad family of stochastic backpropagation rules, that are unique for each distribution. L) . In this section, we explore the connection between neural networks and probabilistic graphical models following from the stochastic backpropagation rule of the Dirac delta distribution. To this end, let us consider the probabilistic graphical model of figure 5 . The observed random variables in this model are denoted x and y representing the data and target variables. We place the analysis in a supervised learning context, but the argument is valid for unsupervised models as well. As usual the goal is to maximize the log likelihood for the data samples (x, y), which is intractable, given that we need to integrate over the hidden variables. However using variational inference, we can maximize an evidence lower bound of the form: L(θ; x, y) = E h (1:L) ∼q θ (•|x) log p(y, h (1:L) , x) + H[q θ (•|x)] As suggested in the Dirac stochastic backpropagation rule, let us assume that the variational posteriors and priors are Dirac delta distribution of the form: q θ (h (l+1) |h (l) ) = p(h (l+1) |h (l) ) = δ a (l+1) (W (l+1) T h (l) +b (l) ) (h (l+1) ) ∀0 ≤ l ≤ L -1 (29) where, the a (l) , W (l) , and b (l) represent respectively the activation functions, the weights and biases for layer l, with the convention x := h (0) . Under these assumptions, the Kullback-Leibler divergence term is equal to zero, and the evidence lower bound reduces to the the log-likelihood of a classical neural network: 1) x + b (1) )...) L(θ; x, y) = log p(y|g θ (x)), with, g θ (x) = a (L) (W (L) (....a (1) (W Thus, when using neural networks we are indirectly using a probabilistic graphical model and making the strong assumption that the hidden layers follow a parameterized Dirac distribution knowing the previous layer.

B EXPERIMENTS USING DISCRETE STOCHASTIC BACKPROPAGATION

We evaluate the Bernoulli and Categorical Stochastic Backpropagation estimators (BSB and CSB) of equations 17 and 18 on standard generative modeling benchmark tasks, using the MNIST and Omniglot datasets (LeCun & Cortes, 2010; Lake et al., 2013) . We use the REBAR, RELAX, and Gumbel-softmax (or Concrete) estimators as baselines for our comparison (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2018) . The Bernoulli stochastic backpropagation is compared to the REBAR and RELAX estimators for three models: the sigmoid belief network of one and two stochastic hidden layers (Neal, 1992) and the variational autoencoder. In this case, we adopt the same architectures as (Grathwohl et al., 2018) . The categorical stochastic backpropagation estimator is compared to the Gumbel-softmax estimator (Maddison et al., 2016; Jang et al., 2016) using two models: a variational autoencoder and a single layer belief network with categorical priors. In this case, we set the dimension of the hidden layer to d = 20 and the number of modalities for each dimension to K = 10. All models are trained using the ADAM optimizer (Kingma & Ba, 2014) using a standard learning rate α = 10 -4 and batch size of 100. We train the models for 500 epochs on the MNIST dataset and 100 epochs on the Omniglot dataset, longer learning epochs leads to overfitting and lower performance on the test sets for all estimators and models. We perform 5 iterations of training in all experiments and we report the mean and standard deviation of each performance metric considered. For all models and estimators, we report the mean marginal test likelihood in tables 1 and 2 for both datasets. The test likelihood is estimated via importance sampling using 200 samples from the variational posterior. In all cases the stochastic backpropagation estimator, a control variate free method outperforms the baselines. In the case of the one layer sigmoid belief network the BSB estimator exhibits an increase of performance of about 4 nats in the case of the MNIST dataset and 10 nats in the case of the Omniglot dataset. We also vary the number of samples used to estimate the expectation in the stochastic backpropagation rule S ∈ {1, 5, 10}. We notice that using a single sample estimate does not hurt performance and leads to a faster training process. We estimate the mean variance of the gradients w.r.t the parameters of the models using exponential moving averages of the first and second moments computed by the ADAM optimizer. The BSB estimator significantly outperforms the REBAR, RELAX estimators in terms of variance reduction with a difference of about 2 nats in the case of sigmoid belief networks, and 1 nat in the case of the categorical variational autoencoder on the mnist dataset, leading to a more stable training process as shown in figures 6 and 7. Finally, we evaluate the computational overhead of the categorical stochastic backpropagation estimator compared the Gumbel-softmax estimator. We compare the two estimators in terms of execution time of one epoch of training. The comparison is done using GPU implementations on 3 , the Gumbel-softmax method is faster than stochastic backpropagation (S = 1) by a difference of about 3 seconds per training epoch. This is due to the forward passes performed to compute each of the terms in equation equation 18. The variance reduction and the increase in performance outweigh however the computational cost.



Figure 1: Training loss and log variance of the gradients for the different estimators for f (z) = d j=1 (z j -) 2 for d ∈ {1, 10, 100}.

Figure 2: Training loss and log variance of the gradients for the different estimators for f (z) = d j=1 exp(-z j ) for d ∈ {1, 10, 100}.

Figure 4: (Left) Bias and variance of the gradient for different values of the truncation level at a fixed parameter value. (Right-top) Mean square error between the Laplace gradient estimator and the score function and reparameterization estimators across iterations. (Right-bottom) Norm of the gradient estimators.

Figure 6: The training evidence lower bound on the MNIST training set (top) and the log variance of the gradient (bottom) over 5 runs. Comparison with the REBAR and RELAX estimators.

Figure 7: Training evidence lower bound and the log variance of the gradient for the categorical VAE on the MNIST dataset.

-114.14 ± 0.44 -114.55 ± 0.48 -110.87 ± 0.2  -110.70 ± 0.11 -110.59  ± 0.08 two layer SBN -101.33 ± 0.04 -101.09 ± 0.07 -99.74 ± 0.3 -100.44 ± 0.28 -100.66 ± 0.21 Bern. VAE -127.76 ± 0.84 -128.06 ± 2.66 -107.4 ± 1.47 -108.46 ± 0.37 -109.19 ± 1.31 Omniglot one layer SBN -123.66 ± 0.05 -123.82 ± 0.17 -113.53 ± 0.21 -114.34 ± 0.16 -114.37 ± 0.19 Test likelihood for the Bernoulli stochastic backpropagation (BSB) estimator compared to the REBAR and RELAX estimators. We report the mean and standard deviation over 5 runs.Figure 5: A hidden variable probabilistic model, where the observed variables are the data x and target y, with L hidden stochastic layers h (1:

Test likelihood for the categorical stochastic backpropagation (CSB) estimator, compared to the Gumbel-softmax estimator. We report the mean and standard deviation over 5 runs.

Execution time of one epoch of training on the mnist dataset per estimator, per model. a NVIDIA GeForce RTX 2080 Ti GPU, where the stochastic backpropagation rule of equation equation 18 is vectorized, thus leveraging the parallel batch treatment of the GPU. As shown in table

