GENERALIZED GUMBEL-SOFTMAX GRADIENT ESTI-MATOR FOR GENERIC DISCRETE RANDOM VARIABLES Anonymous

Abstract

Estimating the gradients of stochastic nodes, which enables the gradient descent optimization on neural network parameters, is one of the crucial research questions in the deep generative modeling community. When it comes to discrete distributions, Gumbel-Softmax trick reparameterizes Bernoulli and categorical random variables by continuous relaxation. However, gradient estimators of discrete distributions other than the Bernoulli and the categorical have not been explored, and the the Gumbel-Softmax trick is not directly applicable to other discrete distributions. This paper proposes a general version of the Gumbel-Softmax estimator with a theoretical basis, and the proposed estimator is able to reparameterize generic discrete distributions, broader than the Bernoulli and the categorical. In detail, we utilize the truncation of discrete random variables and the Gumbel-Softmax trick with a linear transformation for the relaxed reparameterization. The proposed approach enables the relaxed discrete random variable to be reparameterized through a large-scale stochastic computational graph. Our experiments consist of (1) synthetic data analyses and applications on VAE, which show the efficacy of our methods; and (2) topic models, which demonstrate the value of the proposed estimation in practice.

1. INTRODUCTION

Stochastic computational graphs, including deep generative models such as variational autoencoders, are widely used for representation learning. Optimizing the network parameters through gradient methods requires an estimation of the gradient values, but the stochasticity requires the computation of expectation, which differentiates this problem from the deterministic gradient of ordinary neural networks. There are two common ways of obtaining the gradients: score function-based methods and reparameterization methods. The score function-based estimators tend to result in unbiased gradients with high variances, while the reparameterization estimators seem to result in biased gradients with low variances (Xu et al., 2019) . Hence, the core technique of the score-function based estimators becomes reducing the variances of gradients to achieve stable and fast optimizations. Meanwhile, utilizing the reparameterization estimators requires the differentiable non-centered parameterization (Kingma & Welling, 2014a) of random variables. If we focus on the reparameterization estimators, one of the most popular examples is the reparameterization in the Gaussian variational autoencoder (VAE) (Kingma & Welling, 2014b) , which has an exact reparameterization form. Other VAEs with explicit priors suggest the reparameterization tricks with approximations (Nalisnick & Smyth, 2017; Joo et al., 2020) . For continuous random variables, it is feasible to estimate gradients with automatic differentiation by utilizing a transport equation (Jankowiak & Obermeyer, 2018) or an implicit reparameterization (Figurnov et al., 2018) . However, these methods are not applicable to discrete random variables, due to the non-differentiability. Recently, some discrete random variables, such as Bernoulli or categorical random variables, have been well-explored in terms of the reparameterization method by overcoming such difficulty through a continuous relaxation (Jang et al., 2017; Maddison et al., 2017) . However, other discrete distributions have not been explored from the learning perspective in the deep generative modeling community, for example, Poisson, binomial, multinomial, geometric, negative binomial distributions, etc. Prior works on graphical models, such as Ranganath et al. (2015; 2016) , utilized Poisson latent variables for the latent counting. Another line of work (Wu et al., 2020) utilized the Gaussian approximation on the Poisson latent variable to count the number of words, which can be a poor approximation if the rate parameter is small. In this sense, study on the stochastic gradient estimator for discrete distributions is required in the deep generative modeling community, which broadens the choice of prior assumptions and the utilization of various distributions. This paper proposes a generalized version of the Gumbel-Softmax reparameterization trick, which can be utilized to generic discrete random variables through continuous relaxation, namely Generalized Gumbel-Softmax (GENGS). The key ideas of GENGS are (1) a conversion of the sampling process to one-hot categorical selection process; (2) a reversion of the selected category in the one-hot form to the original sample value; and (3) a relaxation of the categorical selection process into the continuous form. To implement these steps, GENGS first truncates discrete random variables to approximate the distribution with the finite number of possible outcomes. Afterward, GENGS utilizes the Gumbel-Softmax trick together with a special form of a linear transformation. Our main theorem supports that the proposed GENGS is applicable to general discrete random variables, other than the Bernoulli and the categorical. The GENGS experiments show the efficacy with synthetic examples and VAEs, as well as the usability in topic model application.

2. PRELIMINARY: REPARAMETERIZATION TRICK & GUMBEL-SOFTMAX

2.1 BACKPROPAGATION THROUGH STOCHASTIC NODES WITH REPARAMETERIZATION TRICK Suppose we have a stochastic node, or a latent variable, z ∼ p(z|θ), where the distribution depends on parameter θ. The goal is optimizing the loss function L(θ, η) = E z∼p(z|θ) [f η (z)], where f η is a continuous and differentiable function with respect to η, i.e., neural networks. To optimize the loss function in terms of θ through the gradient methods, we need to find ∇ θ L(θ, η) = ∇ θ E z∼p(z|θ) [f η (z)], which can not be directly computed with its original form. To compute ∇ θ L(θ, η), the reparameterization trick alternatively introduces an auxiliary variable ∼ p( ), which takes over all randomness of the latent variable z, so the sampled value z can be re-written as z = g(θ, ), with a deterministic and differentiable function g in terms of θ. Figure 1(a) illustrates the reparameterization trick: the shaded nodes indicate random nodes, and the dotted lines denote sampling processes. Here, the gradient of the loss function with respect to θ is derived as ∇ θ L = ∇ θ E z∼p(z|θ) [f η (z)] = E ∼p( ) [∇ θ f η (g(θ, ))] = E ∼p( ) [∇ g f η (g(θ, ))∇ θ g(θ, )] (1) where the last term of Equation 1 is now achievable. A condition on enabling the reparameterization trick is the assumption of the continuity of the random variable z, so the distribution of z is limited to a class of continuous distributions. To utilize the differentiable reparameterization trick on discrete random variables, continuous relaxation can be applied: for example, a relaxation from the categorical distribution to the Gumbel-Softmax distribution, described in the next subsection.

2.2. REPARAMETERIZATION TRICK ON CATEGORICAL RANDOM VARIABLE

A Gumbel-Max trick (Gumbel, 1948) is a procedure for sampling a one-hot categorical value from the Gumbel distribution, instead of direct sampling from a categorical distribution. This implies that the categorical random variable X ∼ Categorical(π), where π lies on the (n -1)-dimensional simplex ∆ n-1 , can be reparameterized by the Gumbel-Max trick: (1) sample u j ∼ Uniform(0, 1)



Figure 1: Reparameterization in stochastic computational graphs.

