PARTIAL REJECTION CONTROL FOR ROBUST VARIA-TIONAL INFERENCE IN SEQUENTIAL LATENT VARI-ABLE MODELS

Abstract

Effective variational inference crucially depends on a flexible variational family of distributions. Recent work has explored sequential Monte-Carlo (SMC) methods to construct variational distributions, which can, in principle, approximate the target posterior arbitrarily well, which is especially appealing for models with inherent sequential structure. However, SMC, which represents the posterior using a weighted set of particles, often suffers from particle weight degeneracy, leading to a large variance of the resulting estimators. To address this issue, we present a novel approach that leverages the idea of partial rejection control (PRC) for developing a robust variational inference (VI) framework. In addition to developing a superior VI bound, we propose a novel marginal likelihood estimator constructed via a dice-enterprise: a generalization of the Bernoulli factory to construct unbiased estimators for SMC-PRC. The resulting variational lower bound can be optimized efficiently with respect to the variational parameters and generalizes several existing approaches in the VI literature into a single framework. We show theoretical properties of the lower bound and report experiments on various sequential models, such as the Gaussian state-space model and variational RNN, on which our approach outperforms existing methods.

1. INTRODUCTION

Exact inference in latent variable models is usually intractable. Markov Chain Monte-Carlo (MCMC) (Andrieu et al., 2003) and variational inference (VI) methods (Blei et al., 2017) , are commonly employed in such models to make inference tractable. While MCMC has been the traditional method of choice, often with provable guarantees, optimization-based VI methods have also enjoyed considerable recent interest due to their excellent scalability on large-scale datasets. VI is based on maximizing a lower bound constructed through a marginal likelihood estimator. For latent variable models with sequential structure, sequential Monte-Carlo (SMC) (Doucet & Johansen, 2009) returns a much lower variance estimator of the log marginal likelihood than importance sampling (Bérard et al., 2014; Cérou et al., 2011) . In this work, we focus our attention on designing a low variance, unbiased, and computationally efficient estimator of the marginal likelihood. The performance of SMC based methods is strongly dependent on the choice of the proposal distribution. Inadequate proposal distributions propose values in low probability areas under the target, leading to particle depletion (Doucet & Johansen, 2009) . An effective solution is to use rejection control (Liu et al., 1998; Peters et al., 2012) which is based on an approximate rejection sampling step within SMC to reject samples with low importance weights. In this work, we leverage the idea of partial rejection control (PRC) within the framework of SMC based VI for sequential latent variable models. To this end, we construct a novel lower bound, VSMC-PRC, and propose an efficient optimization strategy for selecting the variational parameters. Compared to other recent SMC based VI approaches (Naesseth et al., 2017; Maddison et al., 2017; Le et al., 2017) , our approach consists of an inbuilt accept-reject mechanism within SMC to prevent particle depletion. The use of accept-reject within SMC makes the particle weight intractable, therefore, we use a generalization of the Bernoulli factory (Asmussen et al., 1992) to construct unbiased estimators of the marginal likelihood for SMC-PRC. Although the idea of combining VI with an inbuilt accept-reject mechanism is not new (Salimans et al., 2015; Ruiz & Titsias, 2019; Grover et al., 2018; Gummadi, 2014) , a key distinction of our approach is to incorporate an accept-reject mechanism along with a resampling framework. In contrast to standard sampling algorithms that may reject the entire stream of particles, we use a partial accept-reject on the most recent update, increasing the sampling efficiency. Further, the variational framework of SMC-PRC is interesting in itself as it combines accept-reject with particle filter methods. Therefore, our proposed bound VSMC-PRC generalizes several existing approaches for example: Variational Rejection Sampling (VRS) (Grover et al., 2018) , FIVO (Maddison et al., 2014) , IWAE (Burda et al., 2015) , and standard variational Bayes (Blei et al., 2017) . Another key distinction is that, while existing approaches using Bernoulli factory are limited to niche one-dimensional toy examples, our proposed approach is scalable. To the best of our knowledge, there is no prior work that has used Bernoulli factories for such a general case like variational recurrent neural networks (VRNN); therefore, we believe this aspect to be a significant contribution as well. The rest of the paper is organized as follows: In Section 2, we provide a brief review on SMC, partial rejection control, and dice enterprise. In Section 3, we introduce our VSMC-PRC bound and provide new theoretical insights into the Monte-Carlo estimator and design efficient ways to optimize it. Finally, we discuss related work and present experiments on the Gaussian state-space model (SSM) and VRNN.

2. BACKGROUND

We denote a sequence of T real-valued observations as x 1:T = (x 1 , x 2 , . . . , x T ), and assume that there is an associated sequence of latent variables z 1:T = (z 1 , z 2 , . . . , z T ). We are interested in inferring the posterior distribution of the latent variables, i.e., p(z 1:T |x 1:T ). The task is, in general, intractable. For the rest of the paper we have used some common notations from SMC and VI literature where z i t : i th particle at time t; A i t-1 : ancestor variable for the i th particle at time t; θ and φ: model and variational parameters, respectively.

2.1. SEQUENTIAL MONTE CARLO WITH PARTIAL REJECTION CONTROL

An SMC sampler approximates a sequence of densities {p θ (z 1:t |x 1:t )} T t=1 through a set of N weighted samples generated from a proposal distribution. Let the proposal density be q φ (z 1:T |x 1:T ) = T t=1 q φ (z t |x 1:t , z 1:t-1 ). (1) Consider time t-1 at which we have uniformly weighted samples {N -1 , z i 1:t-1 , A i t-1 } N i=1 estimating p θ (z 1:t-1 |x 1:t-1 ). We want to estimate p θ (z 1:t |x 1:t ) such that particles with a low importance weight are automatically rejected. PRC achieves this by using an approximate rejection sampling step (Liu et al., 1998; Peters et al., 2012) . The overall procedure is as follows: 1. Generate z i t ∼ q φ (z t |x 1:t , z A i t-1 1:t-1 ) where i = 1, 2, . . . , N . 2. Accept z i t with probability a θ,φ (z i t |z A i t-1 1:t-1 , x 1:t ) =   1 + M (i, t -1)q φ (z i t |x 1:t , z A i t-1 1:t-1 ) p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 )   -1 , where M (i, t -1) is a hyperparameter controlling the acceptance rate (see Proposition 3 and Section 3.3 for more details). Note that PRC applies accept-reject only on z i t , not on the entire trajectory.

3.. If z i

t is rejected go to step 1. 4. The new incremental importance weight of the accepted sample is α t (z i 1:t ) = c i t Z(z A i t-1 1:t-1 , x 1:t ), where c i t is c i t = p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |z A i t-1 1:t-1 , x 1:t ) , and the intractable normalization constant Z(.) (For simplicity of notation, we have ignored the dependence of Z(.) on M (i, t -1)) Z(z A i t-1 1:t-1 , x 1:t ) = a θ,φ (z t |z A i t-1 1:t-1 , x 1:t )q φ (z t |x 1:t , z A i t-1 1:t-1 )dz t . (5) 5. Compute Monte-Carlo estimator of unnormalized weight w i t = p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) 1 K K k=1 a θ,φ (δ i,k t |z A i t-1 1:t-1 , x 1:t ) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |z A i t-1 1:t-1 , x 1:t ) , where δ i,k t ∼ q φ (z t |x 1:t , z A i t-1 1:t-1 ) and k = 1, 2, . . . , K. Note that w i t would be essential for constructing unbiased estimator for p θ (x 1:T ). 6. Generate ancestor variables A i t through dice-enterprise and set new weights w i t = N -1 for i = 1, 2, . . . , N : A i t ∼ Categorical α t (z 1 1:t ) N j=1 α t (z j 1:t ) , α t (z 2 1:t ) N j=1 α t (z j 1:t ) , . . . , α t (z N 1:t ) N j=1 α t (z j 1:t ) . ( )

2.2. DICE ENTERPRISE

Simulation of ancestor variables in Eq. 7 is non-trivial due to intractable normalization constants in the incremental importance weight (see Eq. 3). Vanilla Monte-Carlo estimation of α t (.) yields biased samples of ancestor variables from Eq. 7. To address this issue, we leverage a generalization of Bernoulli factory, called dice-enterprise (Morina et al., 2019) . Note that multinoulli extensions of Bernoulli factory (Dughmi et al., 2017) have also been used for resampling within intractable SMC before (Schmon et al., 2019) , a key distinction of our approach is to design a scalable Bernoulli factory methodology especially useful for VI applications. Suppose we can simulate Bernoulli(p i t ) outcomes where p i t is intractable. Bernoulli factory problem simulates an event with probability f (p i t ), where f (.) is some desired function. In our case, the intractable coin probability p i t is the intractable normalization constant, p i t = Z(z A i t-1 1:t-1 , x 1:t ) = a θ,φ (z t |z A i t-1 1:t-1 , x 1:t )q φ (z t |x 1:t , z A i t-1 1:t-1 )dz t . Since p i t ∈ [0, 1] and we can easily simulate this coin, we obtain the dice-enterprise algorithm below. 1. Required: Constants {c i t } N i=1 see Eq. 4. 2. Sample C ∼ Categorical c 1 t N j=1 c j t , c 2 t N j=1 c j t , . . . , c N t N j=1 c j t 3. If C = i, generate U i ∼ U [0, 1] and z t ∼ q φ (z t |x 1:t , z A i t-1 1:t-1 ) • If U i < a θ,φ (z t |z A i t-1 1:t-1 , x 1:t ) output i • Else go to step 2 The dice-enterprise produces unbiased ancestor variables. Note that we can easily control the efficiency of the proposed dice-enterprise through the hyper-parameter M (similar to Eq. 2) in contrast to existing Bernoulli factory algorithms (Schmon et al., 2019) . For details on efficiency and correctness, please refer to Section 3.1 and Section 3.3. Our proposed VSMC-PRC bound is constructed through a marginal likelihood estimator obtained by combining the SMC sampler with a PRC step and dice-enterprise. The variance of estimators obtained through SMC-PRC particle filter is usually low (Peters et al., 2012) . Therefore, we expect VSMC-PRC to be a tighter bound compared to the standard SMC based bounds used in recent works (Maddison et al., 2017; Naesseth et al., 2017; Le et al., 2017) . Algorithm 1 summarizes the generative process to simulate the VSMC-PRC bound. Please see Figure 1 to visualize the generative process for VSMC-PRC. Algorithm 1 Estimating the VSMC-PRC lower bound 1: Required: N , K, and M 2: for t ∈ {1, 2, . . . , T } do 3: for i ∈ {1, 2, . . . , N } do 4: z i t , c i t , w i t ∼ PRC (q, p, M (i, t -1)) 5: z i 1:t = (z A i t-1 1:t-1 , z i t ) 6: end for 7: for i ∈ {1, 2, . . . , N } do 8: A i t = DICE-ENT {c i t , z i 1:t } N i=1 9: end for 10: end for 11: return log T t=1 1 N N i=1 w i t 12: 13: PRC (q, p, M (i, t -1)) 14: while sample not accepted do 15: Generate z i t ∼ q φ (z t |x 1:t , z A i t-1 1:t-1 ) 16: Accept z i t with probability a θ,φ (z i t |z  A i t-1 1:t-1 , x 1:t ) 17: end while 18: Sample {δ i,k t } K k=1 ∼ q φ (z t |x 1:t , z A i t-1 1:t- 23: DICE-ENT {c i t , z i 1:t } N i=1 24: Sample C ∼ Multinoulli c i t N j=1 c j t N i=1 25: if C == i then 26: Sample U i ∼ U [0, 1] 27: z i t ∼ q φ (z t |x 1:t , z A i t-1 1:t-1 ) 28: end if 29: if U i < a θ,φ (z i t |z A i t-1 1:t-1 , x 1:t ) then 30: return (i) 31: else 32: return DICE-ENT {c i t , z i 1:t } N i=1 33: end if 3 PARTIAL REJECTION CONTROL BASED VI FOR SEQUENTIAL LATENT VARIABLE MODELS We now show how to leverage PRC to develop a robust VI framework for sequential latent variable models. Our framework is based on the VSMC-PRC bound presented below. The complete sampling distribution of Algorithm 1 is as follows. Q VSMC-PRC z 1:N 1:T , A 1:N 1:T -1 , δ 1:N,1:K 1:T = K k=1 N i=1 q φ (δ i,k 1 |x 1 ) T t=2 N i=1 K k=1 q φ (δ i,k t |x 1:t , z A i t-1 1:t-1 ) × N i=1 q φ (z i 1 |x 1 )a θ,φ (z i 1 |x 1 ) Z(x 1 ) T -1 t=1 N i=1 Discrete(A i t |α t ) q φ (z i t+1 |x 1:t+1 , z A i t 1:t )a θ,φ (z i t+1 |z A i t 1:t , x 1:t+1 ) Z(z A i t 1:t , x 1:t+1 ) The normalization constants Z(.) in Eq. 9 are intractable and have to be estimated while calculating the weights. Therefore, we introduce an extra parameter K, denoting the number of Monte-Carlo samples used to estimate Z(.). The Monte-Carlo estimator of VSMC-PRC bound is LVSMC-PRC (θ, φ; x 1:T , K) = T t=1 log 1 N N i=1 w i t . We maximize the VSMC-PRC bound with respect to model parameters θ and variational parameters φ. This requires estimating the gradient the details of which are provided in Section 3.2.

3.1. THEORETICAL PROPERTIES

We now present properties of the Monte-Carlo estimator LVSMC-PRC . The key variables that affect this bound are N (number of samples), hyper-parameter M , and the number of Monte-Carlo samples used to compute the normalization constant Z(.), i.e., K. As discussed by Bérard et al. (2014) ; Naesseth et al. (2017) , as N increases, we expect the VSMC-PRC bound to get tighter. Hence, we will focus our attention on M and K. All the proofs can be found in the appendix. Proposition 1. Dice-enterprise produces unbiased ancestor variables. Further, let Λ t be the number of iterations required for generating one ancestor variable, then Λ t ∼ Geom E[Λ t ] -1 where E[Λ t ] = N i=1 c i t N i=1 c i t Z(z A i t-1 1:t-1 , x 1:t ) . As evident from Proposition 1, the computational efficiency of the dice-enterprise clearly relies on the normalization constant Z(.). Note that the value of Z(.) could be interpreted as the average acceptance rate of PRC which depends on the hyper-parameter M (i, t -1). If the average acceptance rate for PRC for all particles is γ, then we can express the expected number of iterations as E[Λ i t ] = γ -1 . Therefore, the computational efficiency of dice-enterprise is similar to the PRC step and depends crucially on the hyper-parameter M . Proposition 2. For all K, exp( LVSMC-PRC ) is unbiased, i.e., E exp( LVSMC-PRC ) = p θ (x 1:T ). Fur- ther, E[ LVSMC-PRC ] is non-decreasing in K. The use of Monte-Carlo estimator in place of the true value of Z(.) creates an inefficiency, as depicted by Proposition 2. The bound monotonically increases as we increase K despite the use of resampling operation. It is important to note that Algorithm 1 produces an unbiased estimator of the marginal likelihood for all values of K. Proposition 3. Let the sampling distribution of the i th particle (generated via PRC) at time t be r θ,φ (z t |z A i t-1 1:t-1 , x 1:t ), then KL r θ,φ (z t |z A i t-1 1:t-1 , x 1:t ) p θ (z t |z A i t-1 1:t-1 , x 1:t ) ≤ KL q φ (z t |z A i t-1 1:t-1 , x 1:t ) p θ (z t |z A i t-1 1:t-1 , x 1:t ) . Proposition 3 implies that the use of the accept-reject mechanism within SMC refines the sampling distribution. Instead of accepting all samples, the PRC step ensures that only high-quality samples are accepted, leading to a tighter bound for VSMC in general (not always). We show in the appendix that when M (i, t -1) → ∞, the PRC step reduces to pure rejection sampling (Robert & Casella, 2013) . On the other hand, M (i, t -1) → 0 implies that all samples are accepted from the proposal. Recall, M (i, t -1) is a hyperparameter that can be tuned to control the acceptance rate. For more details on tuning M , see Section 3.3.

3.2. GRADIENT ESTIMATION

For tuning the variational parameters, we use stochastic optimization. Algorithm 1 produces the marginal likelihood estimator by sequentially sampling the particles, ancestor variables, and particles for the normalization constant (z 1:N 1:T , A 1:N 1:T -1 , δ 1:N,1:K 1:T ). When the variational distribution q φ (.) is reparameterizable, we can make the sampling of δ i,k t independent of the model and variational parameters. However, the generated particles z i t are not reparametrizable due to the PRC step. Finally, the ancestor variables are discrete and, therefore, cannot be reparameterized. The complete gradient can be divided into three core components (assuming q φ (.) is reparametrizable): 11) ∇ θ,φ E[ LVSMC-PRC ] = E QVSMC-PRC ∇ θ,φ LVSMC-PRC (θ, φ; x 1:T , K) + g PRC + g RSAMP ( ≈ E QVSMC-PRC ∇ θ,φ LVSMC-PRC (θ, φ; x 1:T , K) . ( ) Note that g PRC and g RSAMP denote the score gradient of PRC and resampling step, respectively. Due to high variance, we have ignored these terms for the optimization. We have derived the full gradient and explored the gradient variance issues in the appendix. Please see Figure 2 (left) comparing the convergence of biased gradient vs. unbiased gradients on a toy Gaussian SSM.

3.3. LEARNING THE M MATRIX

We use M as a hyperparameter for the PRC step which controls the acceptance rate of the sampler. The basic scheme of tuning M is as follows: • Define a new random variable F (z t+1 |z A i t 1:t , x 1:t+1 ) = log q φ (zt+1|x1:t+1,z A i t 1:t ) p θ (xt+1,zt+1|x1:t,z A i t 1:t ) . • Draw z j t+1 ∼ q φ (z t+1 |x 1:t+1 , z A i t 1:t ) for j = 1, 2, . . . , J. • Evaluate γ ∈ [0, 1] quantile value of {F (z j t+1 |z A i t t , x 1:t+1 )} J j=1 . In general for this case the acceptance rate would be around γ for all particles. log M (i, t) = -Q F (zt+1|z A i t 1:t ,x1:t+1) (γ). • If M matrix is very large then use a common {M (., t)} T t=1 for every time-step. In general, for this configuration, the acceptance rate would be greater than equal to γ for all particles: log M (., t) = min -Q F (zt+1|z A i t 1:t ,x1:t+1) (γ) N i=1 . Through γ: a user parameter, we can directly control the acceptance rate. Therefore, both diceenterprise and PRC would take around (less than) γ -1 iterations to produce a sample for M value learned from Eq. 13 (see Eq. 14). For implementation details please refer to the experiments. Note that a similar scheme was also employed in Grover et al. (2018) . We update {{M (i, t -1)} N i=1 } T t=1 dynamically once every F epochs to save time. 

4. RELATED WORK AND SPECIAL CASES

There is significant recent interest in developing more expressive variational posteriors for latent variable models. There are two basic schemes for constructing tighter bounds on the log marginal likelihood: sampling-based methods (MCMC, rejection sampling) (Salimans et al., 2015; Ruiz & Titsias, 2019; Hoffman, 2017; Grover et al., 2018) or multiple samples from VI distributions to increase the flexibility (IS, SMC) (Burda et al., 2015; Maddison et al., 2017; Lawson et al., 2018; Naesseth et al., 2015) . In this work, we present a unified framework for combining these two approaches, utilizing the best of both worlds. Although applying sampling-based methods on VI is useful, the density ratio between the true posterior and the improved density is often intractable. Therefore, we cannot take advantage of variance-reducing schemes like resampling, which is crucial for sequential models. We solve this issue through dice-enterprise: an extension of the Bernoulli factory. Recently, Bernoulli factory has amassed a great interest in the area of Bayesian inference (Gonçalves et al., 2017a; b; Vats et al., 2020) . Although Bernoulli factory is theoretically valuable, its applicability is severely limited due to a high rejection rate. In this paper, we have presented an approach that combines SMC with dice-enterprise for efficient implementation. A closely related work from SMC literature is Schmon et al. (2019) , which also utilizies Bernoulli factories to implement unbiased resampling. However, their method is not scalable and designed particularly for partially observed diffusions. Another relevant work for unbiased estimation of the marginal likelihood is that of Kudlicka et al. (2020) . In contrast to our approach, this method samples one additional particle and keeps track of the number of steps required by PRC for every time-step to obtain their unbiased estimator. The weights are tractable for Kudlicka et al. (2020) as they do not take into account the effect of the normalization constant Z(.). On the other hand, we consider the effect of Z(.) on the particle's weight, making resampling operation infeasible. To fix this intractability, we To provide more clarity, we will consider some special cases of VSMC-PRC bound and relate it with existing work: Note that for N = 1 our method reduces to a special case of Gummadi (2014) which uses a constraint function C t for every time-step and restarts the particle trajectory from ∆ t (if C t is violated). Therefore, if we use the setting C t (z 1:t ) = a(z t |z 1:t-1 , x 1:t ) and ∆ t = t -1, our method reduces to a specific case of Gummadi (2014) . For the special case of N = 1 and T = 1, our method reduces to VRS (Grover et al., 2018) . For N, T > 1, if we remove the PRC step, our bound reduces to FIVO (Maddison et al., 2017) . Finally, if we remove both the PRC step and resampling, then our method effectively reduces to IWAE (Burda et al., 2015) . Please refer to Figure 1 for more details.

5. EXPERIMENTS

In this section, we evaluate our proposed algorithm on synthetic as well as real-world datasets and compare them with relevant baselines. For the synthetic data experiment, we implement our method on a Gaussian SSM and compare our approach with VSMC (Naesseth et al., 2017) . For the real data experiment, we train a VRNN (Chung et al., 2015) on the polyphonic music dataset.

5.1. GAUSSIAN STATE SPACE MODEL

In this experiment, we study the linear Gaussian state space model. Consider the model z t = Az t-1 + e z , x t = Cz t + e x , where e z , e x ∼ N (0, I) and z 0 = 0. We are interested in learning a good proposal for the above model. The latent variable is denoted by z t and the observed data by x t . Let the dimension of z t be d z and dimension of x t be d x . The matrix A has the elements (A) i,j = α |i-j|+1 , for α = 0.42. We explore different settings of d z , d x , and matrix C. A sparse version of C matrix measures the first d x components of z t , on the other hand a dense version of C is normally distributed i.e C i,j ∼ N (0, 1). We consider four different configurations for the experiment. For more details please refer to Figure 2 . The variational distribution is a multivariate Gaussian with unknown mean vector µ = {µ d } dz d=1 and diagonal covariance matrix {log σ 2 d } dz d=1 . We set N = 4 and T = 10 for all the cases: q(z t |z t-1 ) ∼ N z t |Az t-1 + µ, diag(σ 2 ) . The {{M (i, t -1)} N i=1 } T t=1 matrix (see Eq. 13) for approximate rejection sampling is updated once every 10 epochs with acceptance rate γ ∈ {0.8, 0.4}. For estimating the intractable normalization constants, we generate K = 3 samples. Figure 2: (left) compares the convergence of biased gradient vs unbiased gradients. Note that we get a much tighter bound as compared to VSMC (Naesseth et al., 2017) .

5.2. VARIATIONAL RNN

VRNN (Chung et al., 2015) comprises of three core components: the observation x t , stochastic latent state z t , and a deterministic hidden state h t (z t-1 , x t-1 , h t-1 ), which is modeled through a RNN. For the experiments, we use a single-layer LSTM for modeling the hidden state. The conditional distributions p t (z t |.) and q t (z t |.) are assumed to be factorized Gaussians, parametrized by a single layer neural net. The output distribution g t (x t |.) depends on the dataset. For a fair comparison, we use the same model setting as employed in FIVO (Maddison et al., 2017) . We evaluate our model on four polyphonic music datasets: Nottingham, JSB chorales, Musedata, and Piano-midi.de. Each observation x t is represented as a binary vector of 88 dimensions. Therefore, we model the observation distribution g t (x t |.) by a set of 88 factorized Bernoulli variables. We split all four datasets into the standard train, validation, and test sets. For tuning the learning rate, we use the validation test set. For a fair comparison, we use the same learning rate and iterations for all the models. Let the dimension of hidden state (learned by single layer LSTM) be d h and dimension of latent variable be d z . We choose the setting d z = d h = 64 for all the data-sets except JSB. For modeling JSB, we use d z = d h = 32. For VSMC-PRC we have considered N ∈ {4, 6} Further, for each N , we consider four settings (K, γ) ∈ {(1, 0.9), (1, 0.8), (3, 0.9), (3, 0.8)}. The M hyper-parameter for PRC step is learned from Eq. 14 due to large size. We have updated M value once every 50 epoches. Note that in this scenario, the acceptance rate for all particles would be greater than equal to γ. For more details on experiments, please refer to the appendix. As discussed in Section 3.1, the PRC step and dice-enterprise have time complexity O(N/γ) for producing N samples (assuming average acceptance rate γ). Therefore, we consider N γ -1 particles for IWAE and FIVO to ensure effectively the same number of particles, where N ∈ {4, 6} and γ = 0.8. Note, however, that the acceptance rate is ≥ γ, so this adjustment actually favors the other approaches more. For FIVO, we perform resampling when ESS falls below N/2. Table 1 summarizes the results which show whether rejecting samples provide us with any benefit or not, and as the results show, our approach, even with the aforementioned adjustment, outperforms the other approaches in terms of test log-likelihoods, while still having a similar computational cost. Table 1 : We report Test log-likelihood for models trained with FIVO, IWAE, ELBO, and VSMC-PRC. For VSMC-PRC N = (4, 6) and (K, γ) ∈ {(1, 0.9), (1, 0.8), (3, 0.9), (3, 0.8)} (results are in this order). The results for pianoroll data-sets are in nats per timestep. In Sec. 3.1, we discussed the effect of K and PRC rejection rate on VSMC-PRC bound. We expect a performance improvement when K and the rejection rate is increased. Although the results for VSMC-PRC's different configurations are almost the same, we still get the best average ranking for (K = 3, γ = 0.8). Overall, for most cases, VSMC-PRC bound performs better than FIVO (Maddison et al., 2017) and IWAE (Burda et al., 2015) for a variety of configurations.

N

In VSMC-PRC, improvement in the bound value comes at the cost of estimating the normalization constant Z(.), i.e., K. On further inspection, we can clearly see that increasing K doesn't provide us with any substantial benefits despite the increase in computational cost. Therefore, to maintain the computational trade-off (K = 1, γ > 0.8) seems to be a reasonable choice for VI practitioners. Table 1 signifies that rejecting samples with low importance weight is better instead of keeping a large number of particles (at least for a reasonably high acceptance rate γ). The proposed bound uses more particles (PRC step and dice-enterprise) than existing approaches like FIVO and IWAE due to intractability. Future work aims at designing a scalable implementation for VSMC-PRC bound that consumes fewer particles.

6. CONCLUSION

We introduced VSMC-PRC, a novel bound that combines SMC and partial rejection sampling with VI in a synergistic manner. This results in a robust VI procedure for sequential latent variable models. Instead of using standard sampling algorithms, we have employed a partial sampling scheme suitable for high dimensional sequences. Our experimental results clearly demonstrate that VSMC-PRC outperforms existing bounds like IWAE (Burda et al., 2015) and standard particle filter bounds (Maddison et al., 2017; Naesseth et al., 2017; Le et al., 2017) . The future work aims explore partial versions of powerful sampling algorithms like Hamiltonian Monte Carlo (Neal et al., 2011) instead of rejection sampling.

A PROOF OF THEORETICAL RESULTS

Proof of proposition 1 : Dice-enterprise produces unbiased ancestor variables. Let's evaluate the probability that Dice-enterprise would output i as the ancestor index. Assume that the algorithm terminates after r steps where r ∈ {1, 2, . . . , ∞}, then the probability of output i is given as follows: Pr(output = i) = ∞ r=1 Pr(output = i|after r steps) = c i t Z(z A i t-1 1:t-1 , x 1:t ) N m=1 c m t ∞ r=1   N j=1 c j t (1 -Z(z A j t-1 1:t-1 , x 1:t ) N m=1 c m t   r-1 = c i t Z(z A i t-1 1:t-1 , x 1:t ) N j=1 c j t Z(z A j t-1 1:t-1 , x 1:t ) . It is easy to see that Λ t is geometrically distributed with success probability given by Pr(getting output in a loop) = N i=1 c i t Z(z A i t-1 1:t-1 , x 1:t ) N m=1 c m t Proof of proposition 2 : Before explaining the proof, we will first introduce Z: Monte-Carlo estimator of the unknown normalization constant Z(.). Since we are using K samples Z(x 1:t , z A i t-1 1:t-1 ; K) = 1 K K k=1 a θ,φ (δ i,k t |x 1:t , z A i t-1 1:t-1 ). ( ) Algorithm 1 is producing an unbiased estimator of the marginal likelihood. We will first integrate out δ 1:N,1:K 1:T from the marginal likelihood estimator. E QVSMC-PRC exp( LVSMC-PRC ) = A 1:N 1:T -1 T t=1 1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) Z(x 1:t , z A i t-1 1:t-1 ; K) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 ) Q VSMC-PRC z 1:N 1:T , A 1:N 1:T -1 , δ 1:N,1:K 1:T dz 1:N 1:T dδ 1:N,1:K 1:T = A 1:N 1:T -1 T t=1 1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) Z(x 1:t , z A i t-1 1:t-1 ; K) K k=1 q φ (δ i,k t |x 1:t , z A i t-1 1:t-1 )dδ i,k t q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 ) Q VSMC-PRC z 1:N 1:T , A 1:N 1:T -1 dz 1:N 1:T = A 1:N 1:T -1 T t=1 1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 )Z(x 1:t , z A i t-1 1:t-1 ) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 ) Q VSMC-PRC z 1:N 1:T , A 1:N 1:T -1 dz 1:N 1:T = A 1:N 1:T -1 T t=1 1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) r θ,φ (z i t |x 1:t , z A i t-1 1:t-1 ) Q VSMC-PRC z 1:N 1:T , A 1:N 1:T -1 dz 1:N 1:T = A 1:N 1:T -1 T t=1 1 N N i=1 w i t N i=1 r θ,φ (z i 1 |x 1 ) T -1 t=1 N i=1 Discrete(A i t |α t )r θ,φ (z i t+1 |x 1:t+1 , z A i t 1:t ) dz 1:N 1:T = E T t=1 1 N N i=1 w i t . Note that r θ,φ (z i t |.) is the sampling density of PRC. It is easy to see that Eq. 16 is a standard SMC estimator for the marginal likelihood p θ (x 1:T ). The proof is quite standard in SMC literature and can be found in (Naesseth et al., 2017; 2019) . The key factor that makes our bound unbiased is the ability to produce unbiased ancestor samples despite the presence of intractable normalization constant. Using Jensen inequality we can easily show that VSMC-PRC bound is smaller then log marginal likelihood. E LVSMC-PRC ; K = T t=1 log   1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) Z(x 1:t , z A i t-1 1:t-1 ; K) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 )   Q VSMC-PRC z 1:N 1:T , A 1:N 1:T -1 , δ 1:N,1:K 1:T dz 1:N 1:T dδ 1:N,1:K 1:T dA 1:N 1:T -1 ≤ log p θ (x 1:T ) . We will show that E[ LVSMC-PRC ; K] is non-decreasing with K. Let's define a collection of subsets {{I i,t } N i=1 } T t=1 ⊂ {1, 2, . . . , K} having elements {i 1 , i 2 , . . . , i m } randomly drawn from the set {1, 2, . . . , K} having length m ≤ K. We can easily show the following expectation: Z(x 1:t , z A i t-1 1:t-1 ; K) = E Ii,t={i1,i2,...,im} Z(x 1:t , z A i t-1 1:t-1 ; m) = 1 K K k=1 a θ,φ (δ i,k t |x 1:t , z A i t-1 1:t-1 ) Substitute the values of above expectation in equation 17. Use Jensen's inequality i. e E[log X] ≤ log E[X] to complete the proof . E LVSMC-PRC ; K = E   T t=1 log E {{Ii,t} N i=1 } T t=1   1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) Z(x 1:t , z A i t-1 1:t-1 ; m) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 )     ≥ E   E {{Ii,t} N i=1 } T t=1 T t=1 log   1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) Z(x 1:t , z A i t-1 1:t-1 ; m) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 )     ≥ E LVSMC-PRC ; m Now we will see what happens when the limit K → ∞. Using dominated convergence theorem we can write down the estimator as follows: lim K→+∞ E[ LVSMC-PRC ] = lim K→+∞ E   T t=1 log   1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) Z(x 1:t , z A i t-1 1:t-1 ; K) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 )     = E   T t=1 log   1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 ) lim K→+∞ Z(x 1:t , z A i t-1 1:t-1 ; K) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z A i t-1 1:t-1 )     = E   T t=1 log   1 N N i=1 p θ (x t , z i t |x 1:t-1 , z A i t-1 1:t-1 )Z(x 1:t , z A i t-1 1:t-1 ) q φ (z i t |x 1:t , z A i t-1 1:t-1 )a θ,φ (z i t |x 1:t , z 

C EXPERIMENTAL SETUP

For the real data experiment, we train a VRNN (Chung et al., 2015) on the polyphonic music dataset. Polyphonic music comprises of four datasets: : Nottingham, JSB chorales, Musedata, and Piano-midi.de. Each dataset was divided into standard train, validation, and test datset. The validation data was used to tune the learning rate: we picked the learning rate from the following set: {3 × 10 -4 , 1 × 10 -4 , 3 × 10 -5 , 1 × 10 -5 }, instead of optimizing for each method, we picked the learning rate at which FIVO (Maddison et al., 2017) validation performance is the best. Once the learning rate is decided, we ran every method for the same number of iterations to ensure uniformity. We use a single-layer LSTM for modeling the hidden state having dimension d h . For a length T sequence, the variational distribution and joint data likelihood are defined as follows r(z 1:T |x 1:T ) = T t=1 q t (z t |h t (z t-1 , x t-1 , h t-1 ), x t )a t (z t |h t (z t-1 , x t-1 , h t-1 ), x t ) T t=1 Z t (h t (z t-1 , x t-1 , h t-1 ), x t ) Note that a t (z t |.) is the acceptance probability for the PRC step and Z t (.) is the intractable normalization constant. Similarly, we can write down the joint data likelihood as p(z 1:T , x 1:T ) = T t=1 p t (z t |h t (z t-1 , x t-1 , h t-1 ), x t )g t (x t |h t (z t-1 , x t-1 , h t-1 ), z t ) The conditional distributions p t (z t |.) and q t (z t |.) are assumed to be factorized Gaussians, where dimension of the latent variable z t is d z . Note that the conditional densities are parametrized by fully connected neural networks with a single layer having size d h . The output distribution g t (x t |.) is modeled by a set of 88 iid Bernoulli variables. Please see Table 2 for more details regarding implementation details. The unknown weights and biases are initialized using a Xavier initialization. For setting up the optimization we used a batch size of 4 with adam optimizer. The unknown hyperparameter M is updated once every 50 epoches through Eq. 14 to save time.



Figure 1: Comparison of VSMC-PRC with IWAE(Burda et al., 2015) and FIVO(Maddison et al., 2017) (a) The blue arrows represent the resampling step, We then generate multiple samples from parametrized proposal z i t |z i 1:t-1 out of which one sample is accepted via PRC, depicted via green arrows. (b) In IWAE, there is no resampling step and no PRC step (c). In FIVO, there is a resampling step (blue arrows) but no PRC step.

Figure 2: (Left) The figures compares the bound value for VSMC-PRC with full gradient and biased gradient (equation 12) as a function of iterations. (Left) The Table compares the bound value for VSMC (Naesseth et al., 2017) and VSMC-PRC for 80% and 40% acceptance rate. (Right) We compare VSMC, VSMC-PRC (40% acceptance rate), and log p θ (x1:T ) as a function of iterations.

Figure 3: Convergence rate for biased gradient vs. unbiased gradient on toy Gaussian SSM. We compared grep (blue), grep + gPRC (orange), and unbiased gradient (green) for optimization.

To learn more on setting hyper-parameter M , see Liu et al. (1998); Peters et al. (2012).

annex

Proof of proposition 3 : We will show that PRC step refines the learned distribution.

KL r(z t |z

First, we will use the property of negatively correlated random variables. Note that the following two random variablesand Y = a(z t |zare negatively correlated. We know that for negatively correlated variables following identity holdsFurther we have used Jensen's inequality to show that E[log a(z t |zCase 1: M (i, t -1) → 0 implies r(z t |zIn this situation all samples would be accepted. Hence, we can express the sampling distribution as:Case 2:1:t-1 , x 1:t ) In this case, the acceptance probability would reduce to standard rejection sampling. Therefore, the sampling distribution would become equal to the true posterior.

r(z t |z

M (i,t-1)q(z i t |x1:t,zq(z i t |x 1:t , zM (i,t-1)q(z i t |x1:t,z

B GRADIENT ESTIMATION

In this section, we will derive the unbiased gradients for the Monte-Carlo estimate E[ LVSMC-PRC ]. Note that we can express the complete gradient into three core components (assuming q() is reparametriz- 

