GENERALIZED GUMBEL-SOFTMAX GRADIENT ESTI-MATOR FOR GENERIC DISCRETE RANDOM VARIABLES Anonymous

Abstract

Estimating the gradients of stochastic nodes, which enables the gradient descent optimization on neural network parameters, is one of the crucial research questions in the deep generative modeling community. When it comes to discrete distributions, Gumbel-Softmax trick reparameterizes Bernoulli and categorical random variables by continuous relaxation. However, gradient estimators of discrete distributions other than the Bernoulli and the categorical have not been explored, and the the Gumbel-Softmax trick is not directly applicable to other discrete distributions. This paper proposes a general version of the Gumbel-Softmax estimator with a theoretical basis, and the proposed estimator is able to reparameterize generic discrete distributions, broader than the Bernoulli and the categorical. In detail, we utilize the truncation of discrete random variables and the Gumbel-Softmax trick with a linear transformation for the relaxed reparameterization. The proposed approach enables the relaxed discrete random variable to be reparameterized through a large-scale stochastic computational graph. Our experiments consist of (1) synthetic data analyses and applications on VAE, which show the efficacy of our methods; and (2) topic models, which demonstrate the value of the proposed estimation in practice.

1. INTRODUCTION

Stochastic computational graphs, including deep generative models such as variational autoencoders, are widely used for representation learning. Optimizing the network parameters through gradient methods requires an estimation of the gradient values, but the stochasticity requires the computation of expectation, which differentiates this problem from the deterministic gradient of ordinary neural networks. There are two common ways of obtaining the gradients: score function-based methods and reparameterization methods. The score function-based estimators tend to result in unbiased gradients with high variances, while the reparameterization estimators seem to result in biased gradients with low variances (Xu et al., 2019) . Hence, the core technique of the score-function based estimators becomes reducing the variances of gradients to achieve stable and fast optimizations. Meanwhile, utilizing the reparameterization estimators requires the differentiable non-centered parameterization (Kingma & Welling, 2014a) of random variables. If we focus on the reparameterization estimators, one of the most popular examples is the reparameterization in the Gaussian variational autoencoder (VAE) (Kingma & Welling, 2014b) , which has an exact reparameterization form. Other VAEs with explicit priors suggest the reparameterization tricks with approximations (Nalisnick & Smyth, 2017; Joo et al., 2020) . For continuous random variables, it is feasible to estimate gradients with automatic differentiation by utilizing a transport equation (Jankowiak & Obermeyer, 2018) or an implicit reparameterization (Figurnov et al., 2018) . However, these methods are not applicable to discrete random variables, due to the non-differentiability. Recently, some discrete random variables, such as Bernoulli or categorical random variables, have been well-explored in terms of the reparameterization method by overcoming such difficulty through a continuous relaxation (Jang et al., 2017; Maddison et al., 2017) . However, other discrete distributions have not been explored from the learning perspective in the deep generative modeling community, for example, Poisson, binomial, multinomial, geometric, negative binomial distributions, etc. Prior works on graphical models, such as Ranganath et al. (2015; 2016) , utilized Poisson latent variables for the latent counting. Another line of work (Wu et al., 2020) utilized the Gaussian approximation on the Poisson latent variable to count the number of words, which can be a poor approximation if the rate parameter is small. In this sense, study on the stochastic gradient estimator for discrete distributions is required in the deep generative modeling community, which broadens the choice of prior assumptions and the utilization of various distributions. This paper proposes a generalized version of the Gumbel-Softmax reparameterization trick, which can be utilized to generic discrete random variables through continuous relaxation, namely Generalized Gumbel-Softmax (GENGS) . The key ideas of GENGS are (1) a conversion of the sampling process to one-hot categorical selection process; (2) a reversion of the selected category in the one-hot form to the original sample value; and (3) a relaxation of the categorical selection process into the continuous form. To implement these steps, GENGS first truncates discrete random variables to approximate the distribution with the finite number of possible outcomes. Afterward, GENGS utilizes the Gumbel-Softmax trick together with a special form of a linear transformation. Our main theorem supports that the proposed GENGS is applicable to general discrete random variables, other than the Bernoulli and the categorical. The GENGS experiments show the efficacy with synthetic examples and VAEs, as well as the usability in topic model application.

2. PRELIMINARY: REPARAMETERIZATION TRICK & GUMBEL-SOFTMAX

2.1 BACKPROPAGATION THROUGH STOCHASTIC NODES WITH REPARAMETERIZATION TRICK Suppose we have a stochastic node, or a latent variable, z ∼ p(z|θ), where the distribution depends on parameter θ. The goal is optimizing the loss function L(θ, η) = E z∼p(z|θ) [f η (z)], where f η is a continuous and differentiable function with respect to η, i.e., neural networks. To optimize the loss function in terms of θ through the gradient methods, we need to find ∇ θ L(θ, η) = ∇ θ E z∼p(z|θ) [f η (z)], which can not be directly computed with its original form. To compute ∇ θ L(θ, η), the reparameterization trick alternatively introduces an auxiliary variable ∼ p( ), which takes over all randomness of the latent variable z, so the sampled value z can be re-written as z = g(θ, ), with a deterministic and differentiable function g in terms of θ. Figure 1(a) illustrates the reparameterization trick: the shaded nodes indicate random nodes, and the dotted lines denote sampling processes. Here, the gradient of the loss function with respect to θ is derived as ∇ θ L = ∇ θ E z∼p(z|θ) [f η (z)] = E ∼p( ) [∇ θ f η (g(θ, ))] = E ∼p( ) [∇ g f η (g(θ, ))∇ θ g(θ, )] (1) where the last term of Equation 1 is now achievable. A condition on enabling the reparameterization trick is the assumption of the continuity of the random variable z, so the distribution of z is limited to a class of continuous distributions. To utilize the differentiable reparameterization trick on discrete random variables, continuous relaxation can be applied: for example, a relaxation from the categorical distribution to the Gumbel-Softmax distribution, described in the next subsection.

2.2. REPARAMETERIZATION TRICK ON CATEGORICAL RANDOM VARIABLE

A Gumbel-Max trick (Gumbel, 1948) is a procedure for sampling a one-hot categorical value from the Gumbel distribution, instead of direct sampling from a categorical distribution. This implies that the categorical random variable X ∼ Categorical(π), where π lies on the (n -1)-dimensional simplex ∆ n-1 , can be reparameterized by the Gumbel-Max trick: (1) sample u j ∼ Uniform(0, 1) to generate a gumbel sample g j = -log(-log u j ) for each j = 1, • • • , n; and (2) compute k = argmax n j=1 [log π j + g j ], where π is a categorical parameter. This procedure generates a one-hot sample x such that x j = 0 for j = k and x k = 1 with P (X k = 1) = π k . We denote GM(π) to be the distribution whose samples are generated by the Gumbel-Max trick. A Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017) is an alternative to the Gumbel-Max trick that continuously relaxes a categorical random variable. The Gumbel-Softmax utilizes the softmax with a temperature τ > 0, instead of the argmax in the sampling process, which enables (1) relaxing the discreteness of the categorical random variable to the one-hot-like form x(τ ) = softmax log π+g τ in the continuous domain; and (2) approximating the Gumbel-Max by taking τ small enough. Lately, the Gumbel-Softmax estimator has been widely used to reparameterize categorical random variables, such as RelaxedOneHotCategorical in TensorFlow (Abadi et al., 2016) . We denote GS(π, τ ) to be the distribution generated by the Gumbel-Softmax trick.

3. PROCESS OF GENGS REPARAMETERIZATION

This section discusses the process of GENGS to help understand the concept with minimal theoretical details, and Section 4 provides the theoretical background of GENGS. The three steps of GENGS are the following: (1) approximate a discrete distribution by truncating the distribution; (2) reparameterize the truncated distribution with the Gumbel-Max trick and the linear transformation T , which will be introduced below; and (3) relax the discreteness by replacing the Gumbel-Max trick in Step 2 with the Gumbel-Softmax trick. Figure 1 (b) illustrates the full steps of the GENGS trick. Step 1. Truncate the discrete distribution to finitize the number of possible outcomes. Suppose X ∼ Poisson(100), which has a mode near at x = 100, and near-zero probabilities at x < 50 and x > 150. The key idea of the first step is ignoring the outcomes of near-zero probabilities at certain levels (ex. x < 50 and x > 150) and only focusing on the probable samples of meaningful probabilities (ex. 50 ≤ x ≤ 150), i.e., truncating the distribution, which finitizes the support of the distribution. Now, suppose we have a discrete random variable X ∼ D(λ), and its truncated random variable Z ∼ TD(λ, R), where R denotes the truncation range that needs to be pre-defined. Proposition 3 in Section 4 provides theoretical reason that Z approximates X. Since we finitized the support by the truncation, we may assume Z has a support C = {c 0 , • • • , c n-1 } of n possible outcomes and its corresponding constant outcome vector c = (c 0 , • • • , c n-1 ). Note that the ordering of c k is not significant, and Appendix E provides examples of the setting on c. Step 2. Divide sampling process of Z into two-fold: select a one-hot category of Z, and revert the selected one-hot category into the original value. For example, if the sampled value of Z is c 2 ∈ C, we will first focus on the one-hot category class vector one hot(c 2 ) = (0, 0, 1, 0, • • • , 0), rather than the sampled value c 2 . Such a one-hot categorical selection process is possible by utilizing the categorical selection w ∼ Categorical(π) or its reparameterized version, the Gumbel-Max trick GM(π). Here, the categorical parameter π = (π 0 , • • • , π n-1 ) can be directly calculated by the explicit probability mass funciton (PMF) of the distribution, i.e., π k = P (Z = c k ). However, the PMF of the truncated distribution requires a modification from the PMF of the original distribution, which is determined by how we define Z from X. See Definition 1, 2, and Appendix A for detailed configuration of π. Suppose we now have a one-hot categorical sample w from the categorical parameter π. Afterward, we revert the one-hot selected categorical vector w = (w 0 , • • • , w n-1 ) into the original sample value with a linear transformation T (w) = k w k c k = k w c. Proposition 4 shows the validity of the alternative sampling process in Section 4. Step 3. Relax the one-hot categorical selection into the continuous form by utilizing the Gumbel-Softmax trick. Up to now, the sole shortage of the reparameterization trick is the differentiability due to the one-hot categorical selecting Gumbel-Max process. Then, as in Section 2.2, the process can be continuously relaxed with the Gumbel-Softmax trick GS(π, τ ) for some temperature τ . Theorem 5 in Section 4 shows that the alternative sampling process still holds under the continuous relaxation.

4. THEORETICAL BACKGROUND OF GENGS

To support our reparameterization methodology, this section provides the main theorem on the reparameterizations. The first proposition approximates an original discrete distribution with its truncated version. Next, the second proposition enables the truncated distribution to be reparameterized by the Gumbel-Max trick. Finally, the main theorem shows that the Gumbel-Softmax trick converges to the Gumbel-Max trick under an assumption of the linear transformation. Through these steps, we note that our proposed reparameterization trick is generalized and grounded, theoretically.

4.1. FINITIZING THE SUPPORT BY TRUNCATING DISCRETE RANDOM VARIABLES

Definition 1 specifies a truncated discrete random variable for truncating the right-hand side. Note that Definition 1 can be easily extended to truncate the left-hand side or both sides of distributions. Definition 2 is truncating both side version, and Appendix A discusses its variation. However, we focus on the non-negative distribution in the remainder of this subsection, since most of the popularly used discrete random variables have the support of N ≥0 . Definition 1. (A special case of right-hand-side truncation for non-negative discrete random variables) A truncated discrete random variable Z n of a non-negative discrete random variable X ∼ D(λ) is a discrete random variable such that Z n = X if X ≤ n -1, and Z n = n -1 if X > n. The random variable Z n is said to follow a truncated discrete distribution TD(λ, R) with a parameter λ and a truncation range R = [0, n). Alternatively, we write as truncation level R = n if the left truncation is at zero in the non-negative case. Definition 2. (Both-side truncation for general discrete distributions) A truncated discrete random variable Z m,n of a discrete random variable X ∼ D(λ) is a discrete random variable such that (1) Z m,n = X if m < X < n; (2) Z m,n = n -1 if X ≥ n; and (3) Z m,n = m + 1 if X ≤ n. The random variable Z m,n is said to follow a truncated discrete distribution TD(λ, R = (m, n)) with a parameter λ and a truncation range R = (m, n). As we discussed in Section 3, truncating the distribution intends to finitize the number of possible outcomes to utilize the categorical selection. From the definition, the samples of finitized support can be simply considered c k = k in this special non-negative case. Furthermore, due to the definition, the modification on the PMF only exists in the last category c n-1 = n -1, and the modified PMF can be computed by injecting the near-zero remaining probability summation to the last category, right before the truncation level. In other words, π k = P (Z n = c k ) = P (Z n = k) = P (X = k) for k = 0, • • • , n -2 and π n-1 = 1 - n-2 k=0 π k , hence, the sum-to-one property remains satisfied. Here, the idea which leads to Proposition 3 is that if we take the truncation level far enough from zero, we can cover most of the possible outcomes that can be sampled from the original distribution. Note that the truncation step can be omitted if the original distribution already has a finite support, but one can utilize the truncation to ignore the unlikely samples. Proposition 3. With Definition 1, Z n converges to X almost surely as n → ∞. Also, with Definition 2, Z m,n converges to X almost surely as m → -∞ and n → ∞. The almost sure convergence property of Proposition 3 supports the theoretical basis of approximating a discrete random variable D(λ) with the truncated random variable TD(λ, n), and Appendix A shows the detailed proof. Through the truncation, the discrete distribution is approximated with finitized support, and the Gumbel tricks are ready to be utilized.

4.2. REPARAMETERIZATION BY GENERALIZED GUMBEL-SOFTMAX TRICK

Next, we select one-hot categorical sample from the finitized categories and revert the one-hot selection to the original sample value. The widely utilized discrete distributions have the explicit forms of PMF, so we can directly compute the PMF values for the truncated support with a pre-defined truncation range. Once the distribution and the truncation range are fixed as D(λ) and R, respectively, we have the corresponding constant outcome vector c = (c 0 , • • • , c n-1 ) and the computed PMF value vector π = (π 0 , • • • , π n-1 ) of a truncated distribution TD(λ, R), where π k = TD(c k ; λ, R). Afterward, we define a transformation T (w) = k w k c k = k w c. Additionally, we denote the distributions, generated by applying T on the samples of GM and GS, as T (GM) and T (GS), respectively. Then, we can generate a sampled value of TD(λ, R) with a linear transformation and a Gumbel-Max sample, as stated in Proposition 4, proved in Appendix B. Proposition 4. For any truncated discrete random variable Z ∼ TD(λ, R) of discrete distribution D(λ) and a transformation T , Z can be reparameterized by T (GM(π)) if we set π k = P (Z = c k ). Due to the reparameterization, the randomness of TD(λ, R) with respect to the parameter λ moves into the uniform random variable in the Gumbel-Max, since T is a continuous and deterministic function. Then, TD(λ, R) can be approximated by replacing the Gumbel-Max with the Gumbel-Softmax in T , as stated in Theorem 5, proved in Appendix C. We define T (GS(π, τ )) as GENGS(π, τ ). Theorem 5. For a transformation T and a given categorical parameter π ∈ ∆ n-1 , the convergence property of Gumbel-Softmax to Gumbel-Max still holds under the linear transformation T , i.e., GS(π, τ ) → GM(π) as τ → 0 implies GENGS(π, τ ) → T (GM(π)) as τ → 0. Theorem 5 implies that we can relax and approximate the truncated discrete random variable TD(λ, R) by the Gumbel-Softmax and the linear transformation. The assumption of the theorem that GS(π, τ ) → GM(π) as τ → 0 has not been mathematically proven in the original literature (Jang et al., 2017; Maddison et al., 2017) . Instead, the authors have empirically shown that GS(π, τ ) eventually becomes GM(π) as τ → 0. Figure 2 shows how GENGS gets closer to the original distribution by adjusting the truncation range and the temperature. Truncation Range. We can observe that the approximation becomes closer to the original distribution as we widen the truncation range R. However, the increment of R is technically limited due to the finite neural network output for the inference. Note that the choice of truncation range is crucial in terms of covering many probable samples. Therefore, we set the truncation range to cover all but less than 1e-5% of probability with respect to the prior distribution, or arbitrary large. Temperature. The decrement of τ from softmax log π+g τ results in the closer distribution to the original distribution. However, the initially small τ leads to high bias and variance of gradients, which becomes problematic at the learning stage on π. Therefore, the annealing of τ from a large value to a small value is necessary to provide a learning chance of π. Note that there is no condition on the distribution to be reparameterized with GENGS in the theoretical analysis. Hence, once a discrete distribution has an explicit PMF, GENGS can be easily utilized to the reparameterization. Appendix E suggests examples of GENGS utilization, including one that shows the regular Gumbel-Softmax is a special case of GENGS. Appendix F provides the algorithm of GENGS, and Appendix G gives discrete distributions in TensorFlow, which can utilize GENGS.

5. EXTENTION OF GENGS: IMPLICIT INFERENCE & DISCRETIZATION

Implicit Inference. Unlike continuous random variables, discrete random variables have countable outcomes. Instead of inferring the distribution parameter λ and then computing the PMF values through the fixed PMF and λ, we can directly infer the PMF values π of possible outcomes with softmax, which becomes the input of the Gumbel tricks, by loosening the assumption on the approximate posterior PMF shape. This implicit inference on the PMF values becomes possible due to truncating distribution by finitizing the possible outcomes. However, this inference approach needs a regularizer, such as the KL divergence term in the objective function of VAEs, which ensures the distribution shape to be similar to a prior distribution with a pre-defined parameter. We found that loosening the approximate posterior assumption leads to a significant performance gain in our VAE experiments. See Appendix F for the algorithm of the implicit inference. Discretization of Continuously Relaxed Sample. GENGS outputs a continuously reparameterized sample value since we are relaxing the discrete random variable into a continuous form. Utilizing the Straight-Through (ST) Gumbel-Softmax estimator (Bengio et al., 2013; Jang et al., 2017) , instead of the naive Gumbel-Softmax, we can obtain the discrete sample as well. Since ST Gumbel-Softmax discretizes the relaxed Gumbel-Softmax output with argmax, ST Gumbel-Softmax uses the gradients obtained from the relaxed ones, which could result in significant performance degradation.

6. RELATED WORK

GENGS is basically a single-sample gradient estimator like other reparameterization gradient estimators. Though GENGS could use multiple samples to obtain the stable gradients, we compare GENGS with the other estimators using a single sample to test the fundamental performance of gradient estimators. RF denotes the basic REINFORCE (Williams, 1992) . NVIL (Mnih & Gregor, 2014) utilizes a neural network to introduce the optimal control variate. MUPROP (Gu et al., 2016) utilizes the first-order Taylor expansion on the loss term as a control variate. VIMCO(k) (Mnih & Rezende, 2016) is designed as k-sample gradient estimator. REBAR (Tucker et al., 2017) and RELAX (Grathwohl et al., 2017) utilize reparameterization trick for constructing the control variate. Deterministic RaoBlack (DETRB) (Liu et al., 2019) uses the weighted value of the fixed gradients from m-selected categories and the estimated gradients from the remaining categories with respect to their odds to reduce the variance. The idea of Stochastic RaoBlack (STORB) (Kool et al., 2020) is essentially same as that of DETRB, but STORB randomly chooses the categories at each step, instead of using fixed categories. Kool et al. (2020) also suggested an unordered set gradient estimator (UNORD), which also uses the multiple sampled gradients, utilizing the sampling without replacements. For DETRB, STORB, and UNORD, we use m = 1 category that utilizes the fixed gradient for the fair comparison. Note that if there are K > 1 dimensions to be inferred, the models require computing m K gradient combinations, which has higher complexity than GENGS. The * symbol denotes a variation that utilizes a built-in control variate introduced in the work of Kool et al. (2020) .

7. EXPERIMENT

7.1 SYNTHETIC EXAMPLE Experimental Setting. In this experiment, we expand the toy experiments from Tucker et al. (2017) ; Grathwohl et al. (2017) to various discrete distributions. We first fix constant t 1 , • • • , t k , and then optimize the loss function E z∼p(z|λ) k i=1 (z i -t i ) 2 with respect to λ. Here, we set p(z|λ) as Poisson(20), Binomial(20, .3), Multinomial(3, [.7, .2, .1]), and NegativeBinomial(3, .4) in this experiment. We also adapt the Rao-Blackwellization (RB) idea in GENGS by utilizing m = 1 in calculating the selected gradient, so this adaptation results in GENGS-RB that estimates the remaining gradients by GENGS. See Appendix J for the detailed experimental settings. Experimental Result. Figure 3 compares the log-loss and the log-variance of estimated gradients from various estimators. In the figure, the log-loss needs to be minimized to correctly estimate the backpropagated gradient value in the learning process. Additionally, the log-variance requires being minimized to maintain the consistency of the gradients, so the gradient descent can be efficient. GENGS shows the best log-loss and the best log-variance if GENGS keeps the continuous relaxation of the modeled discrete random variable. For the Poisson case, the exact gradient has a closed-form solution, as in Appendix J, and GENGS shows the lowest bias among all gradient estimators. See Appendix J for the curves with confidence intervals and the curves without smoothing.

7.2. VAE: SYNTHETIC EXPERIMENT ON DEEP GENERATIVE MODELS

Experimental Setting. We follow the VAE experiments of Figurnov et al. (2018) with discrete priors, which diversifies the choice of prior assumption, as the latent factor count in the discrete case. This (Bottom Row) variances of gradients for Binomial, Multinomial, and NegativeBinomial. We utilize the cumulative average for smoothing the curves, and the curves with confidence intervals and the curves without smoothing are in Appendix J. experiment utilizes the Poisson, the geometric, and the negative binomial distributions. The evidence lower bound (ELBO) L(x) = E q φ(z|x) [log p θ (x|z)] -KL(q φ (z|x)||p θ (z)), which consists of the reconstruction part and the KL divergence part, is minimized during the training period. Optimizing the ELBO of VAEs requires computing the KL divergence between the approximate posterior and the prior distributions. In GENGS, by truncating the original distribution, the KL divergence becomes the derivation with categorical distributions. See Appendix H for the detailed statement and proof. Note that the purpose of VAE experiments is not to compare the performance across various prior distributions. The VAE is considered as a more challenging task than the synthetic example in the last subsection, since (1) this task requires computing the gradients of the encoder network parameters through the latent distribution parameter λ; and (2) each stochastic gradient of the latent dimension affects every encoder parameter since we are utilizing the fully-connected layers. Hence, a single poorly estimated gradient with respect to the latent distribution parameter λ could negatively affect the learning of encoder parameters. See Appendix K for the detailed experimental settings. Experimental Result. Table 1 shows the negative ELBO results of the VAE experiments. We found that some baselines failed to reach the optimal point, so we excluded those estimators in such suboptimal cases. The variants of GENGS showed the lowest negative ELBO in general, and loosening the PMF condition idea (i.e., the implicit inference) reached the better optimal points. Figure 4 shows the reconstructed images by VAEs with various gradient estimators on MNIST and OMNIGLOT. GENGS draws the clearest images and better reconstructions, which aligns with the quantitative result in Table 1 . See Appendix K for the full table and additional discussion. 

8. CONCLUSION

This paper suggests a new gradient estimator for general discrete random variables, a generalized version of the Gumbel-Softmax estimator. To strengthen the practical usage of reparameterization tricks with the Gumbel-Softmax function, we provide a theoretical background of our reparameterization trick. Our finding claims that a discrete random variable can always be reparameterized through the proposed GENGS algorithm. The limitation of GENGS is the setting of the truncation level and the Gumbel-Softmax temperature, which becomes the trade-off between the gradient estimation accuracy and the time budget. Subsequently, we show the synthetic analysis and the VAE experiment, as well as topic model application of GENGS. With the generalization, we expect that GENGS can diversify the options of distributions in the deep generative model community.

A PROOF OF PROPOSITION 3 & TRUNCATING BOTH SIDES

A.1 PROOF OF PROPOSITION 3 FOR TRUNCATING RIGHT-HAND-SIDE Before reminding Definition 1 and Proposition 3 of the main paper, we first introduce the definition of almost sure convergence of sequence of random variables. Definition. (Almost Sure Convergence) For a sequence of random variables {X n } ∞ n=1 , X n converges almost surely to random variable X, if P ( lim n→∞ X n = X) = P ({ω ∈ Ω| lim n→∞ X n (ω) = X(ω)}) = 1 (2) for a sample space Ω. Definition. (A special case of right-hand-side truncation for non-negative discrete random variables) A truncated discrete random variable Z n of a non-negative discrete random variable X ∼ D(λ) is a discrete random variable such that Z n = X if X ≤ n -1, and Z n = n -1 if X > n. The random variable Z n is said to follow a truncated discrete distribution TD(λ, R) with a parameter λ and a truncation range R = [0, n). Alternatively, we write as truncation level R = n if the left truncation is at zero in the non-negative case. With this definition, as we discussed in the main text, the constant outcome vector can be defined as c = (0, 1, 2, • • • , n -1). Note that for Z n = 0, 1, • • • , n -2, the sample spaces of Z n and X are equal, hence, P (Z n ) = P (X) in such cases. This implies that the modified PMF of Z n = k can be computed by using the PMF of X = kfoot_0 in such cases, i.e., P (Z n = k) = P (X = k) for k = 0, 1, • • • , n-2. Then, consequently, P (Z n = n-1) = P (X ≥ n-1) = 1- n-2 k=0 P (Z n = k) due to the sum-to-one property of probability. Hence, the corresponding categorical parameter π of the constant outcome vector c can be computed using the PMF of the original discrete random variable X: π k = P (Z n = k) = P (X = k) for k = 0, 1, • • • , n -2, and π n-1 = P (Z n = n -1) = P (X ≥ n -1) = 1 - n-2 k=0 π k . Proposition. With Definition 1, Z n converges to X almost surely as n → ∞. Proof. Note that {ω ∈ Ω|Z n (ω) = X(ω)} = {ω ∈ Ω|X(ω) < n}. Then, we have the following: P ( lim n→∞ Z n = X) = P ({ω ∈ Ω| lim n→∞ Z n (ω) = X(ω)}) (3) = P ({ω ∈ Ω| lim n→∞ Z n (w) = lim n→∞ X(w) < lim n→∞ n = ∞}}) (4) = P ({ω ∈ Ω|X(ω) < ∞}) (5) = P (X < ∞) (6) = 1 (7) since X is a non-negative discrete random variable, and its cumulative distribution function F X (x) has the following properties: non-decreasing as x → ∞, and 0 ≤ F X ≤ 1.

A.2 PROPOSITION 3 FOR TRUNCATING BOTH SIDES

The below is truncating both side version of Definition 2 and Proposition 3 in the main paper. For simplicity, we assume the distribution has support Z. Definition. (Both-side truncation for general discrete distributions) A truncated discrete random variable Z m,n of a discrete random variable X ∼ D(λ) is a discrete random variable such that Z m,n =    X, if m < X < n n -1, if X ≥ n m + 1, if X ≤ m . The random variable Z m,n is said to follow a truncated discrete distribution TD(λ, R = (m, n)) with a parameter λ and a truncation range R = (m, n). With this definition, the constant outcome vector can be defined as c = (m + 1, • • • , n -1). Note that for Z m,n = m + 2, • • • , n -2, the sample spaces of Z m,n and X are equal, hence, P (Z m,n ) = P (X) in such cases. This implies that the modified PMF of Z m,n = k can be computed by using the PMF of X = k in such cases, i.e., P (Z m,n = k) = P (X = k) for k = m + 2, • • • , n -2. Then, with the definition, P (Z m,n = m + 1) = P (X ≤ m + 1) and P (Z m,n = n -1) = P (X ≥ n -1). However, during the implementation, this definition can be a problem since the configuration implies that there might be a case that we have to compute two infinite sums, which are P (Z m,n = m + 1) = P (X ≤ m + 1) = k≤m+1 P (X = k) and P (Z m,n = n -1) = P (X ≥ n -1) = k≥n-1 P (X = k)foot_1 . Hence, we also provide an alternative configuration of Z m,n in the later of this section. Proposition. With definition above for the truncating both sides, Z m,n converges to X almost surely as m → -∞ and n → ∞. Proof. Note that {ω ∈ Ω|Z m,n (ω) = X(ω)} = {ω ∈ Ω|m < X(ω) < n}. Then, we have the following: P ( lim n→∞ m→-∞ Z m,n = X) = P ({ω ∈ Ω| lim n→∞ m→-∞ Z m,n (ω) = X(ω)}) (8) = P ({ω ∈ Ω| -∞ = lim m→-∞ m < X(w) < lim n→∞ n = ∞}}) (9) = P ({ω ∈ Ω| -∞ < X(ω) < ∞}) (10) = P (-∞ < X < ∞) (11) = P (X < ∞) -P (-∞ < X) (12) = 1 (13) since X is a discrete random variable, and its cumulative distribution function F X (x) has the following properties: non-decreasing as x → ∞, non-increasing as x → -∞, and 0 ≤ F X ≤ 1. As we discussed above, during the implementation, computing the small probabilities of both left and right tail can cause a problem, either it can have high complexityfoot_2 or even impossiblefoot_3 . Hence, when we can consider the alternative definition such as Z m,n = X, if m < X < n n -1, if X ≥ n or X ≤ m , which has a simpler PMF computation. In this case, we can simply add the remaining probability of unlikely samples at the right-most value, and moreover, the alternative configuration does not harm the proof of the proposition. Hence, with the alternative defitnition, the the corresponding categorical parameter π = (π m+1 , • • • , π n-1 ) of the constant outcome vector c = (m + 1, • • • , n -1) can be computed using the PMF of the original discrete random variable X: π k = P (Z n = k) = P (X = k) for k = m + 1, • • • , n -2, and π n-1 = P (Z n = n -1) = P (X ≥ n -1) + P (X < m + 1) = 1 - n-2 k=m+1 π k .

B PROOF OF PROPOSITION 4

Proposition. For any truncated discrete random variable Z ∼ TD(λ, R) of discrete distribution D(λ) and a transformation T , Z can be reparameterized by T (GM(π)) if we set π k = P (Z = c k ). Proof. Note that Z has two parameters: the distribution parameter λ and the truncation range R. Assume that we have possible outcome set C = {c 0 , • • • , c n-1 } of n possible outcomes by truncating the distribution with truncation range R. Note that the transformation is defined as T (w) = n-1 k=0 w k c k = k w c . By pre-defining the truncation range R as a hyper-parameter, the randomness of Z is fully dependent on the distribution parameter λ. Now, we introduce the Gumbel random variable G = -log(-log U ) where U ∼ Uniform(0, 1) as an auxiliary random variable. Then, given a categorical parameter π ∈ ∆ n-1 , any n-dimensional one-hot vector e j = (0, • • • , 0, 1, 0, • • • , 0), which has 1 in j th entry and 0 in all other entries, can be reparameterized by Gumbel-Max trick. Suppose we have a sample c m from Z, and note that we have known PMF values π = (π 0 , • • • , π n-1 ) of Z by the definition of Z. Then, with the transformation T and the constant outcome vector c = (c 0 , • • • , c n-1 ), the following holds: c m = n-1 k=0 c k e k = T (e m ) . ( ) Since the transformation T is also a deterministic function, by introcuding the Gumbel random variable as an auxiliary random variable, we can replace the randomness of Z from λ (or π in the implicit inference case) with the uniform random variable composing the Gumbel random variable. Hence, the truncated discrete random variable Z can be reparameterized by the Gumbel-Max trick and the transformation T .

C PROOF OF THEOREM 5

Theorem. For a transformation T and a given categorical parameter π ∈ ∆ n-1 , the convergence property of Gumbel-Softmax to Gumbel-Max still holds under the linear transformation T , i.e., GS(π, τ ) → GM(π) as τ → 0 implies GENGS(π, τ ) → T (GM(π)) as τ → 0. Proof. Suppose we have given a categorical parameter π ∈ ∆ n-1 . Define f M be a Gumbel-Max trick function, and f τ S be a Gumbel-Softmax trick function with a temperature τ > 0 that both functions take the categorical parameter and a Gumbel sample as inputs. Note that f M returns a one-hot vector which has 1 in the argmax entry after the Gumbel perturbation, and f τ S returns a one-hot-like softmax activation value with the temperature τ with the Gumbel perturbation. Draw a random sample u ∼ Uniform(0, 1) n which defines the Gumbel sample g for the Gumbel perturbation. Assume that log π m -log (-log u m ) > log π j -log (-log u j ) for all j = m, i.e., the selected sample as a category is c m out of possible outcome set {c 0 , • • • , c n-1 }. Therefore, for the Gumbel-Max trick, it is clear that T (f M (π, g)) = n-1 k=0 [e m ] k • c k = n-1 k=0 e m c = c m for c = (c 0 , • • • , c n-1 ) where e j = (0, • • • , 0, 1, 0, • • • , 0) is a n-dimensional one-hot vector, which has 1 in the m th entry and 0 in all other entries. Then, the statement that GS(π, τ ) → GM(π) as τ → 0 implies f τ S (π, g) → f M (π, g) , i.e., the following: f τ S (π, g) j = exp log πj -log(-log uj ) τ n-1 k=0 exp log π k -log(-log u k ) τ (15) → 1 if j = m 0 if j = m as τ → 0 . Then, f τ S (π, g) = ẽm for some relaxed one-hot vector of e m by introducing the softmax relaxation. As a consequence, f τ S (π, g) j × c j = exp log πi-log(-log ui) τ n-1 k=0 exp log π k -log(-log u k ) τ j × c j (17) → c m if j = m 0 if j = m as τ → 0 , since the constant multiplication gives no harm to the approximation. Hence, by taking the summation, which also gives no harm to the approximation,  T (f τ S (π, g)) = n-1 k=0 [ẽ m ] k •c k → c m = n-1 k=0 ẽm c → c m = T (f M (π, g)).

E EXAMPLES OF GENGS E.1 TRUNCATING RIGHT-HAND SIDE FOR POISSON DISTRIBUTION

For the distributions that the left-side truncation needs to be at zero, such as the Poisson with small rate parameter, the geometric, the negative binomial, etc., as we mentioned in the main text, we can simply set the constant outcome vector c = (0, 1, • • • , n -1). Note that the ordering of c is not crucial, for example, we can also set c = (2, 1, 0, 3, 4, • • • , n -1). Then, the corresponding PMF value needs to be computed as π = P (Z = 2), P (Z = 1), P (Z = 0), P (Z = 3), P (Z = 4), • • • , P (Z = n-1) , where Z is the truncated discrete random variable.

E.2 TRUNCATING BOTH SIDES FOR POISSON DISTRIBUTION

For the case when the distribution requires left-side truncation, for example, Poisson(100)foot_4 , we can set the constant outcome vector such as c = (50, 51, • • • , 149, 150). Again, the ordering of c is not crucial, hence, we can also set c = (148, 150, 149, 147, • • • , 53, 50, 52, 51), for example. Afterward, the PMF value π is naturally computed in the same order as the constant outcome vector c.

E.3 GUMBEL-SOFTMAX IS A SPECIAL CASE OF GENGS.

The Gumbel-Softmax estimator of categorical random variables is a trivial case of GENGS. Assume the number of dimensions n = 3 and a categorical parameter π = (0.5, 0.3, 0.2) in this example. Then, the poissble outcomes of Categorical(π) are c 0 = [1, 0, 0] T , c 1 = [0, 1, 0] T , and c 2 = [0, 0, 1] T . Afterward, draw a sample w from GS(π, τ ) for some temperature τ > 0, and the value will become a relaxed one-hot form, for example, (0.95, 0.04, 0.01). If we construct c = [c 0 , c 1 , c 2 ] (19) = 1 0 0 0 1 0 0 0 1 = I 3 , then T (w) = k w c = k w k c k ≈ w. Hence, the Gumbel-Softmax trick can be written in the form of GENGS with the identity matrix in the linear transformation. Note that we abuse the symbol of Hadamard product ( ) in terms of the dimension, where the last term of T (w) = k w c = k w k c k is actually the multiplication of a scalar w k ∈ [0, 1] and a vector c k ∈ R 3 in this case.

E.4 GENGS CAN BE APPLIED TO MULTINOMIAL DISTRIBUTION.

If we go one step forward from the example above, we can reparameterize multinomial distribution with GENGS trick, also. For example, assume the number of trial m = 3 and the probability vector p = [p 1 , p 2 , p 3 ] = [.7, .2, .1]. Then, the possible outcomes are c 0 = [3, 0, 0] T , c 1 = [0, 3, 0] T , c 2 = [0, 0, 3] T , c 3 = [2, 1, 0] T , c 4 = [2, 0, 1] T , c 5 = [1, 2, 0] T , c 6 = [1, 0, 2] T , c 7 = [0, 2, 1] T , c 8 = [0, 1, 2] T , and c 9 = [1, 1, 1] T , where the corresponding probability is (n1+n2+n3)! n1!n2!n3! p 1 n1 p 2 n2 p 3 n3 for outcome [n 1 , n 2 , n 3 ]. Construct the linear transformation constant c as the following: c = [c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 ] (21) = 3 0 0 2 2 1 1 0 0 1 0 3 0 1 0 2 0 2 1 1 0 0 3 0 1 0 2 1 2 1 , where the categorical parameter π = (n1+n2+n3)! n1!n2!n3! p 1 n1 p 2 n2 p 3 n3 [n1,n2,n3] . If we sample a Gumbel-Softmax sample w from GS(π, τ ) and compute T = k w c, the result will be in the relaxed form of selected samples c m . This example shows how the linear transformation constant c can be generalized to the matrix form. However, if we recall that the equation n 0 + • • • + n k-1 = n has n+k-1 k solutions of non-negative tuple (n 0 , • • • , n k-1 ), where n+k-1 k ≤ O(n min{k,n-1} ), the relaxed categorical selection through the Gumbel-Softmax can become problematic due to the high complexity when n or k has large value. Hence, in this situation, reducing the possible outcomes by disregarding unlikely samples by user guidance can be a remedy, but the treatment can not be a fundamental solution, and handling such case can be an open research question.

F INFERENCE STEP, ALGORITHM & COMPLEXITY OF GENGS

F.1 VISUALIZATION OF GENGS REPARAMETRIZATION STEPS Figure 7 and Figure 8 represent the reparameterization stpes of explicit inference and implicit inference of GENGS, respectively. For both figures, the shaded nodes indicate the auxiliary Gumbel samples, composed of uniform samples, which enable the reparameterized variable to be deterministic with respect to the parameter of the target distribution. 

F.2 ALGORITHM & ADDITIONAL COMPUTATIONAL COMPLEXITY OF GENGS

Algorithm 1 and Algorithm 2 provides the algorithm of explicit and implicit inference steps of GENGS, respectively. The explicit inference has further computation on PMF value calculation in Line 3-4 of Algorithm 1 and the linear transformation computation in Line 8 of 1 compared to the original Gumbel-Softmax reparameterizer of the categorical distribution. Note that this additional computation complexity may not be represented as O(n) if we assume that there are n possible outcomes by truncating the distribution, since the computation on PMF values from the inferred distribution parameter λ depends on the PMF of distribution. Meanwhile, the implicit inference only has extra linear transformation computation in Line 5 of Algorithm 2 compared to the original Gumbel-Softmax reparameterizer of the categorical distribution, by inferring the logit of the PMF values directly in Line 3 of Algorithm 2. Hence, in this case, the additional computational complexity 

G DISTRIBUTIONS IN TENSORFLOW PROBABILITY

We provide the table of discrete distributions where GENGS can be applied in Table 3 . We list up the discrete distributions which are available in TensorFlow Probability 0.8.0 6 in lexicographical order. Note that Bernoulli and Categorical are different from OneHotCategorical and RelaxedOneHotCategorical. The original Gumbel-Softmax are available only for (1) OneHotCategorical ,which is relaxed as RelaxedOneHotCategorical; and (2) Bernoulli, which is relaxed as RelaxedBernoulli, but can not be applied to Categorical. However, by generalizing the Gumbel-Softmax, i.e., GENGS, other distributions as listed below, including Categorical. Also, Empirical, which is a user-defined distribution in TensorFlow Probability, can utilize GENGS which is not in the list. Note that for Binomial, if n is large or p has extreme value close to 0 or 1, one can truncate left, right, or both. Also, for Multinomial, as an extension of the Binomial case, one can disregard unlikely samples. (Bottom Row) Variances of gradients for Binomial, Multinomial, and NegativeBinomial. We utilize the cumulative average for smoothing the curves, and we also provide confidence intervals together. (X = k) Parameter Support Side Bernoulli(p) p k (1 -p) 1-k p None p ∈ [0, 1], k ∈ {0, 1} Binomial(n, p) n k p k (1 -p) k p None * n ∈ N ≥0 , p ∈ [0, 1], k ∈ {1, • • • , n} Categorical(p = [p 1 , • • • , p K ]) K j=1 p kj j p None p i ∈ [0, 1], p i = 1, k ∈ {1, • • • , K}, k i ∈ {0, 1}, k i = 1 DirichletMultinomial(n, α = [α 1 , • • • , α K ]) n!Γ( αj ) Γ(n+ αj ) K j=1 Γ(kj +αj ) kj !Γ(αj ) α None n ∈ N, α i > 0, k i ∈ {0, • • • , n}, k i = n Geometric(p) (1 -p) k p p Right p ∈ (0, 1), k ∈ N ≥0 Multinomial(n, p = [p 1 , • • • , p K ]) n k1 ••• kK K j=1 p kj j p None * n ∈ N, p i ∈ [0, 1], p i = 1, k i ∈ {0, • • • , n}, k i = n NegativeBinomial(n, p) n+k-1 k (1 -p) n p k p Right n ∈ N, k ∈ N ≥0 OneHotCategorical(p = [p 1 , • • • , p K ]) K j=1 p kj j p None p i ∈ [0, 1], p i = 1, k i ∈ {0, 1}, k i = 1 Poisson(λ) λ k exp -λ k! λ Right, Both λ ∈ R >0 , k ∈ N ≥0 Zipf(n, s) k -s n j=1 n -s s None s ≥ 0, n ∈ N, k ∈ {1, • • • , n} Proof. Note that Poisson distribution with the rate parameter λ has a mean λ and a variance λ. Hence, the Poisson distribution has the first moment µ 1 = λ, and the second moment µ 2 = λ 2 + λ. ∂L ∂λ = ∂ ∂λ E z ∼p(z|λ) [(z -t) 2 ] (26) = ∂ ∂λ z≥0 p(z|λ)(z -t) 2 (27) = z≥0 ∂ ∂λ p(z|λ)(z -t) 2 (28) = z≥0 ∂ ∂λ λ z e -λ z! (z -t) 2 (29) = z≥0 (z -t) 2 ∂ ∂λ λ z e -λ z! (30) = t 2 ∂ ∂λ e -λ + z≥1 (z 2 -2tz + t 2 ) zλ z-1 e -λ -λ z e -λ z! (31) = -t 2 e -λ + z≥0 (z + 1) 2 -2t(z + 1) + t 2 λ z e -λ z! - z≥1 (z 2 -2tz + t 2 ) λ z e -λ z! (32) = -t 2 e -λ + z≥0 z 2 -2(t -1)z + (t -1) 2 p(z|λ) - z≥1 (z 2 -2tz + t 2 )p(z|λ) super-topics and sub-topics including the vocabularies. Here, we utilize the idea of Miao et al. (2016; 2017) ; Srivastava & Sutton (2017) , the neural variational architecture, to extract the latent document representation as (relaxed) counts of supermost-topic, and consequently capture sub-topic counts. To ensure the positive linked weights between super-topics and sub-topics, we utilize absolute value function. The generative process of NVPDEF is z 1 ∼ Poisson(λ 0 ), z 2 ∼ Poisson(λ 1 ), • • • , z K ∼ Poisson(λ K-1 ), (36) x ∼ MultinomialLogisticRegression(λ K ) where we adopt multinomial logistic regression from NVDM (Miao et al., 2016) , and the inference process of NVPDEF is λ0 = MLP(x), λ1 = W 1 λ1 , • • • , λK = W K-1 λK-1 so that the approximate Poisson posterior q(z k |z k-1 ) has λk as distribution parameter. Here, each z k ∼ Poisson(λ k-1 ) represents the count distribution of topics from the super-topic. Each component of W k , w k,i,j is positive, and w k,i,j captures the positive weight of relationship between super-topic i of the k th layer and sub-topic j of the (k + 1) th layer. The objective function, ELBO, of NVPDEF is L = E q(z K ,••• ,z1) [log p(x|z K , • • • , z 1 )] - K k=1 KL(q(z k |z k-1 )||p(z k )) where z 0 = x for simplifying the equation. 20Newsgroupsfoot_8 and RCV1-V2foot_9 datasets are used in this experiment. 20Newsgroups dataset has {train : test} = {11, 258 : 7, 487} split with the vocabulary size of 2, 000, and RCV1-V2 has {train : test} = {794, 414 : 10, 000} split with the vocabulary size of 10, 000. For the data pre-processing, stopwords are removed and the most frequent vocabularies are chosen. Especially for 20Newsgroups, we use the vocabulary from Srivastava & Sutton (2017) . For the single-stacked version of NVPDEF, we do not anneal the temperature, instead, we set temerature τ = .5. For the multi-stacked version of NVPDEF, i.e., MULTI-STACKED NVPDEF, we utilized 10-sample on the latent layers for the stable optimization of consecutive sampling. Also, to have better chances of learning, we utilize linear temperature annealing from τ = 3. to τ = .5 during the training period. For all neural network models, we utilize two 500-dimensional hidden layers for the encoders. We use 50, 20-50 stacked layers for 20Newsgroups, 200, 50-200 stacked layers for RCV1-V2 dataset. We set λ 1 = .75 with truncation level 15 for the single-stacked case, and λ 1 = 1.1, λ 2 = 1. with truncation level 15 for the multi-stacked case. We train NVPDEF for 100 epochs with batch size 256 and learning rate 1e-3. We also iteratively update encoder parameters and each linked weight parameter of latent variable. As a performance measure, we utilize perplexity perp = exp(-1 

L.2 EXPERIMENTAL RESULT

We provide super-topic, sub-topic, and word relationship obtained from two-layer-stacked NVPDEF in 20Newsgroups dataset by listing up the top-weighted sub-topics and words in Table 5 .

M OPEN RESEARCH QUESTION: NON-PARAMETRIC REPARAMETERIZATION TRICK

In GENGS, to reparameterize discrete distributions, we convert sampling process to categorical selection process by finitizing the support of the distribution with truncation. Here, truncating distribution converts categorical selection on countably finite number of categories to categorical selection on finite number of categories, i.e., turn non-parametric problem into parametric problem. The categorical selection on finite number of categories by disregarding the samples of extremely small probability might cause a problem if we need to utilize full range of possible outcomes. For example, in the multinomial case in Appendix E.4, as n and k grows, proposed GENGS should ignore numerous probable samples due to the high complexity on the number of possible outcomes. A* sampling (Maddison et al., 2014) is a non-parametric version of Gumbel-Max trick which we also utilize in GENGS, since A* sampling searches maximum Gumbel sample among countably infinite Gumbel samples by A* algorithm. Utilizing A* sampling concept in reparameterizing the distribution with countably infinite support could lead to better reparameterization in terms of reparameterizing the exact distribution instead of approximate distribution. However, we utilize the truncated distribution in proposed GENGS to convert countably infinite categorical selection into finite categorical selection for the following reasons. First, while adapting non-parametric methodology into a neural network, which has a fixed number of parameters, people usually give a limit as a certain point by utilizing the truncation level. An example of such a case is a Stick-breaking VAE (Nalisnick & Smyth, 2017) which utilizes Dirichlet process in the latent variable of VAE, and the authors finitized the number of sticks by the human guidance. Second, while there is no previous work on reparameterization trick for fully-non-parametric categorical selection, if we finitize the number of categories with the truncated distributions suggested in the paper, we can utilize Gumbel-Softmax reparameterizer which already verified in deep generative model community and widely used by the implementation in the deep learning framework such as TensorFlow, i.e., RelaxedOneHotCategorical. Finally, if we have to choose one between the parametric model and the non-parametric model, the choice depends on the situation that we face up to. For example, we can compare Gaussian mixture model (GMM) and Dirichlet process Gaussian mixture model (DPGMM). If we have a clue on the number of clusters, we could directly apply GMM instead of DPGMM. However, if we know nothing about the data, utilizing DPGMM can be a good choice. Also, we can not directly compare GMM and DPGMM along the same line, since the experimental result differs from data to data. In summary, as other non-parametric models do, we turn the non-parametric problem into the parametric problem, especially by utilizing the truncated distribution, and this kind of treatment is a natural way of solving such difficulty. However, we believe that investigating non-parametric reparameterizer, particularly utilizing A* sampling which is theoretically solid, is a crucial and open research question in the deep generative model community.



Note that we are using the distribution with eplicitly known PMF for discrete distribution X. Or, there might be a case that one of P (Zm,n = m + 1) = P (X ≤ m + 1) and P (Zm,n = n -1) = P (X ≥ n -1) requires summation of high complexity, even though it is a finite summation. For example, if we truncate Poisson(1000) with truncation range R = [900, 1100], we have to compute PMF values for 0 ≤ x < 900 to sum-up the left-hand meaningless probability, and it causes high complexity. For example, the case when the support has infinite support for both left (-∞) and right (∞) sides. Note that Poisson(100) has inprobable samples at x < 50 and x > 150. https://www.tensorflow.org/probability/api docs/python/tfp/distributions https://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT For GENGS ST, the temperature annealing is unnecessary as the ST Gumbel-Softmax estimator does(Jang et al., 2017). http://qwone.com/∼jason/20Newsgroups/ https://trec.nist.gov/data/reuters/reuters.html



Figure 1: Reparameterization in stochastic computational graphs.

Figure 2: (a) Approximation of GENGS in terms of choices of the truncation level n and the temperature τ in Poisson(7). As sub-figures go from left to right, the truncation level grows. Hence, the popped-out sticks, implying remaining probability of the right side, disappears if the truncation level is large enough. As sub-figures go from top to bottom, the temperature decreases, and the PMF of truncated distributions and the original distributions becomes similar. Appendix D provides the fine-grained PMF of GENGS. (b) On the x-axis, TD(λ, n) → D(λ) as truncation level n ↑ ∞, according to Proposition 3. Then, TD(λ, n) can be reparameterized by the Gumbel-Max trick with a linear transformation T as in Proposition 4. On the y-axis, as temperature τ ↓ 0, GENGS(π, τ ) → TD(λ, n), where π is a computed PMF value of TD(λ, n), according to Theorem 5.

Figure 3: Synthetic example performance curves in log scale: (Top Row) losses, variances, and biases of gradients for Poisson; (Middle Row) losses for Binomial, Multinomial, and NegativeBinomial;(Bottom Row) variances of gradients for Binomial, Multinomial, and NegativeBinomial. We utilize the cumulative average for smoothing the curves, and the curves with confidence intervals and the curves without smoothing are in Appendix J.

Figure 4: Reconstructed images by VAEs with various gradient estimators. GENGS shows the clearest images among other gradient estimators with better reconstruction.

Figure 5: (Left) A graphical notation of NVPDEF with generative process (θ) and inference network (φ). The multi-stacked latent layers have λ i as a prior distribution parameter. (Right) A neural network diagram of NVPDEF: diamond nodes indicate the auxiliary random variable for the reparameterization trick.Experimental Setting. This experiment shows another application of GENGS in the topic modeling. The Poisson distribution is one of the most important distribution for counting the number of outcomes among all discrete distributions. The authors of Deep Exponential Families (DEFs)(Ranganath et al., 2015) utilized the exponential family, including the Poisson distribution, on the stacked latent layers. Therefore, we focus on the Poisson DEF, which assumes the Poisson latent layers to capture the counting numbers of latent super-topics and sub-topics; and we convert the Poisson DEF into a neural variational form, which resembles NVDM(Miao et al., 2016). Figure5shows the neural network and its corresponding probabilistic modeling structure. We utilize GENGS on the Poisson DEF to sample the values in the latent variable, namely the neural variational Poisson DEF (NVPDEF). See Appendix L for further description of DEFs, NVPDEF, and detailed experimental settings.

Note thatFigure 2(b) in the main paper is drawn by rounding-up continuous values into integers. Since PMF for discrete distributions (Poisson(7) in Figure 2(b)) and PDF for continuous distributions (GENGS for Poisson(7) in Figure 2(b)) cannot be directly compared within the same figure due to their scale, we provide Figure 6 of the fine-grained PMF by rounding-up in small decimals.

Figure 6: Fine-grained PMF of GenGS for Poisson(7).

Figure 7: Visualization of GENGS reparameterization of explicit inference version. Note that we compute PMF values from the infered parameter λ.

Figure 8: Visualization of GENGS reparameterization of implicit inference version. Note that we do not infer the distribution parameter λ, and the PMF values are computed directly, instead.

KL DIVERGENCE BETWEEN TWO TRUNCATAED DISTRIBUTIONS Theorem. Assume two truncated distributions X ∼ TD(λ, n) and Y ∼ TD( λ, n) where π k = P (X = k), πk = P (Y = k). Then, the KL divergence between X and Y can be represented in the KL divergence between the categorical distributions where KL(Y ||X) = KL(Categorical(π)||Categorical(π)).

Figure 9: Synthetic example performance curves in log scale: (Top Row) Losses, variances, and biases of gradients for Poisson; (Middle Row) Losses for Binomial, Multinomial, and NegativeBinomial;(Bottom Row) Variances of gradients for Binomial, Multinomial, and NegativeBinomial. We utilize the cumulative average for smoothing the curves, and we also provide confidence intervals together.

where N d is the number of words in document d, and D is the total number of documents.

Test negative ELBO on MNIST and OMNIGLOT datasets. The lower is better for the negative ELBO. We provide a full table including baselines with insignificant results and variations of GENGS in Appendix K. Symbol "-" indicates no convergence.

Test perplexity on 20News and RCV dataset.

A list of distributions which can be reparameterized by GENGS with their distribution parameters in TensorFlow Probability 0.8.0.

Super-topic, sub-topic, and  word relationship obtained from two-layer-stacked NVPDEF in 20Newsgroups dataset.

annex

J SYNTHETIC EXAMPLE J.1 EXPERIMENTAL SETTINGIn this experiment, we first sample t 1 , • • • , t k i.i.d. from a discete distribution D(θ) for a fixed θ > 0, and optimize the loss function E z∼p(z|λ) k i=1 (z i -t i ) 2 with respect to λ where p(z|λ) is D(λ). We use Poisson(20), Binomial(20, .3), Multinomial(3, [.7, .2, .1] ), and NegativeBinomial (3, .4 ) in this experiment, and the distribution parameter which we want to infer in each distribution is λ in Poisson(λ), Binomial(20, λ), Multinomial(3, λ), and NegativeBinomial(3, λ) . For GENGS, we use truncation level (7, 36) and 12 for the Poisson and the negative binomial, respectively. Note that the binomial case does not require truncation of the distribution. We use k = 5 sampled targets for the Poisson and the binomial cases, and k = 1 for the negative binomial case. In this experiment, we separately utilize the temperature τ as τ = 1. for the high-temperature case, and τ = .25 for the low-temperature case. To compute the variance of gradients, we sampled 100 gradients for the Poisson and binomial, and 500 gradients for the negative binomial. For fair comparisons, we use m = 1 fixed category for gradients for the RBs. Whereas it is able to use more than one fixed gradient in the synthetic example, if there is more than one latent dimension, K, it requires to compute m K gradient combinations, which has high complexity. We also adapt the Rao-Blackwellization idea in GENGS, which is utilizing m = 1 fixed gradient and utilizing GENGS for the remainings, namely GENGS-RB. We exclude UNORD snice UNORD fails to converge to the optimal parameter because of its approximation accuracy problem with single gradient sample.Closed-form True Gradient Derivation for the Poisson Synthetic Example. Throughout the synthetic example, we compare the quality of gradient estimators by the convergence of losses, variances of estimated gradients, and biases between true gradient and estimated gradient. To compute the bias between the true gradient and the estimated gradient, we need the closed-form solution of the true gradient. We find that the Poisson case has the closed-form true gradient, and the derivation is as follows.Proposition. If p(z|λ) is a Poisson distribution with a rate parameter λ, the true gradient ofWe compare the log-loss and the log-variance of estimated gradients from various estimators in this experiment. We also compare the log-bias in the Poisson case. We additionally provide Figure 9 to report the confidence interval, and Figure 10 to show the convergence of loss which may not be seen in Figure 3 of the main paper. We utilize the (truncated) Poisson, the (truncated) geometric, and the (truncated) negative binomial distributions in this experiment. Both MNIST and OMNIGLOT 7 are hand-written gray-scale datasets of size 28 × 28. We split MNIST dataset into {train : validation : test} = {45, 000 : 5, 000 : 10, 000}, and OMNIGLOT dataset into {22, 095 : 2, 250 : 8, 070}.We construct two fully-connected hidden layers of dimension 500 for the encoder and the decoder, and we set the latent dimension K = 50 for both MNIST and OMNIGLOT datasets. 

K.2 EXPERIMENTAL RESULT

Table 4 shows the negative ELBO results on the VAE experiments with a full range of gradient estimators. The variants of GENGS show the lowest negative ELBO in general. We empirically found that the extreme probability imbalance, due to the explicit PMF restriction, induces unstable learning which leads to the performance degradation of RELAX or REBAR. We found that utilizing Straight-Through for the discretization degrades the performance. Meanwhile, the implicit inference methodology further leads to better optimal points, enabled by loosening the PMF condition. The empirical reason why the implicit version is better than the explicit version is that the inferred PMF shape is thinner in the implicit case.. Hence, the implicit distribution has lower variance than the explicit one, and consequently samples consistent values that lead to better trained neural network parameters, although it looses the original PMF shape. (Ranganath et al., 2015) are probabilistic graphical model which utilize the stacks of exponentail family distributions. If we assume the Poisson distribution, which is included in the exponential family, each k th Poisson latent variable counts the number of sub-topics occurrence.The relationship between the super-topic and the sub-topic is modeled with the linked weights, which has positive values. Hence, with Poisson DEF, we can model hierarchical relations between

