MIND THE GAP WHEN CONDITIONING AMORTISED INFERENCE IN SEQUENTIAL LATENT-VARIABLE MODELS

Abstract

Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e. g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter-a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction.

1. INTRODUCTION

Variational inference has paved the way towards learning deep latent-variable models (LVMs): maximising the evidence lower bound (ELBO) approximates maximum likelihood learning (Jordan et al., 1999; Beal, 2003; MacKay, 2003) . An efficient variant is amortised variational inference where the approximate posteriors are represented by a deep neural network, the inference network (Hinton et al., 1995) . It produces the parameters of the variational distribution for each observation by a single forward pass-in contrast to classical variational inference with a full optimisation process per sample. The framework of variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014) adds the reparameterisation trick for low-variance gradient estimates of the inference network. Learning of deep generative models is hence both tractable and flexible, as all its parts can be represented by neural networks and fit by stochastic gradient descent. The quality of the solution is largely determined by how well the true posterior is approximated: the gap between the ELBO and the log marginal likelihood is the KL divergence from the approximate to the true posterior. Recent works have proposed ways of closing this gap, suggesting tighter alternatives to the ELBO (Burda et al., 2016; Mnih & Rezende, 2016) or more expressive variational posteriors (Rezende & Mohamed, 2015; Mescheder et al., 2017) . Cremer et al. (2018) provide an analysis of this gap, splitting it in two. The approximation gap is caused by restricting the approximate posterior to a specific parametric form, the variational family. The amortisation gap comes from the inference network failing to produce the optimal parameters within the family. In this work, we address a previously unstudied aspect of amortised posterior approximations: successful approximate posterior design is not merely about picking a parametric form of a probability density. It is also about carefully deciding which conditions to feed into the inference network. This is a particular problem in sequential LVMs, where it is common practice to feed only a restricted set of inputs into the inference network: not conditioning on latent variables from other time steps can vastly improve efficiency (Bayer & Osendorfer, 2014; Lai et al., 2019) . Further, leaving out p(z | C, C = 0) p(z | C, C = 1) p(z | C) ν(z | C) ≈ p(z | C) ω(z | C) Figure 1 : Illustration of the effect of partial conditioning. Consider a latent-variable model p C, C z × p(z), with binary C and arbitrary C. We omit C from the amortised approximate posterior q(z | C). Shown are two cases of separated (left) and overlapping (right) Gaussian true posteriors. Top row: the full posteriors p z C, C = 0 and p z C, C = 1 as well as their average, the marginal posterior p(z | C). Middle row: Variational Gaussian approximation ν(z | C) to the marginal posterior, which was obtained by stochastic gradient descent on the reverse, modeseeking KL-divergence (Hoffman et al., 2013) . Bottom row: The optimal ω(z | C) obtained by optimising the ELBO with a partially-conditioned amortised approximate posterior w. r. t. q. It is located far away from the modes, sharply peaked and shares little mass with the true full posteriors, the marginal posterior as well as the approximate marginal posterior. observations from the future structurally mimics the Bayesian filter, useful for state estimation in the control of partially observable systems (Karl et al., 2017b; Hafner et al., 2019; Lee et al., 2020) . We analyse the emerging conditioning gap and the resulting suboptimality for sequential amortised inference models. If the variational posterior is only partially conditioned, all those posteriors equal w. r. t. included conditions but different on the excluded ones will need to share an approximate posterior. The result is a suboptimal compromise over many different true posteriors which cannot be mitigated by improving variational family or inference network capacity. We empirically show its effects in an extensive study on three real-world data sets on the common use case of variational state-space models.

2.1. VARIATIONAL AUTO-ENCODERS

Consider the task of fitting the parameters of a latent-variable model p θ (x, z) = p θ (x | z) × p θ (z) to a target distribution p(x) by maximum likelihood: arg max θ E x∼ p(x) [log p θ (x)]. (1) In the case of variational auto-encoders (VAEs), the prior distribution p θ (z) is often simple, such as a standard Gaussian. The likelihood p θ (x | z) on the other hand is mostly represented by a deep neural network producing the likelihood parameters as a function of z. The log marginal likelihood log p θ (x) = log p θ (x, z) dz contains a challenging integral within a logarithm. The maximisation of eq. ( 1) is thus generally intractable. Yet, for a single sample x the log marginal likelihood can be bounded from below by the evidence lower bound (ELBO): log p θ (x) ≥ log p θ (x) -KL(q(z) || p θ (z | x)) (2) = E z∼q [log p θ (x | z)] -KL(q(z) || p θ (z)) =: -(θ, q, x), where a surrogate distribution q, the variational posterior, is introduced. The bound is tighter the better q(z) approximates p θ (z | x). The gap vanishes with the posterior KL divergence in eq. ( 2). Typically, the target distribution is given by a finite data set D = {x (n) } N n=1 . In classical variational inference, equally many variational posteriors {q (n) (z)} N n=1 are determined by direct optimisation. For instance, each Gaussian q (n) (z) is represented by its mean µ (n) and standard deviation σ (n) . The set of all possible distributions of that parametric form is called the variational family Q q(z). Learning the generative model parameters θ can then be performed by an empirical risk minimisation approach, where the average negative ELBO over a data set is minimised: arg min θ,{q (n) } N n=1 N n=1 θ, q (n) (z), x (n) . The situation is different for VAEs. The parameters of the approximate posterior for a specific sample x are produced by another deep neural network, the inference network q φ (z | x). Instead of directly optimising for sample-wise parameters, a global set of weights φ for the inference network is found by stochastic gradient-descent of the expected negative ELBO: arg min θ,φ E x∼ p(x) [ (θ, q φ (z | x), x)]. This approach tends to scale more gracefully to large data sets. The Sequential Case The posterior of a sequential latent-variable model p(x 1:T , z 1:T ) = T t=1 p(x t | x 1:t-1 , z 1:t )p(z t | z 1:t-1 , x 1:t-1 ). of observations (x 1 , . . . , x T ) = x 1:T and latents (z 1 , . . . , z T ) = z 1:T , with z 1:0 = ∅, factorises as p(z 1:T | x 1:T ) = T t=1 p(z t | z 1:t-1 , x 1:T ). The posterior KL divergence is a sum of step-wise divergences (Cover & Thomas, 2006) : KL(q(z 1:T | •) || p(z 1:T | x 1:T )) = T t=1 E z1:t-1∼q [KL(q(z t | •) || p(z t | z 1:t-1 , x 1:T ))] (4) where we have left out the conditions of q(z 1:T | •) = t q(z t | •) for now. This decomposition allows us to focus on the non-sequential case in our theoretical analysis, even though the phenomenon most commonly occurs with sequential LVMs.

2.2. THE APPROXIMATION AND AMORTISATION GAPS

The ELBO and the log marginal likelihood are separated by the posterior Kullback-Leibler divergence KL(q(z) || p θ (z | x)), cf. eq. ( 2). If the optimal member of the variational family q ∈ Q is not equal to the true posterior, or p(z | x) / ∈ Q, then KL(q (z) || p θ (z | x)) > 0, so that the ELBO is strictly lower than the log marginal likelihood-the approximation gap. Using an inference network we must expect that even the optimal variational posterior within the variational family is not found in general. The gap widens: KL(q φ (z | x) || p θ (z | x)) ≥ KL(q (z) || p θ (z | x)). This phenomenon was first discussed by Cremer et al. (2018) . The additional gap KL(q φ (z | x) || p θ (z | x)) -KL(q (z) || p θ (z | x)) ≥ 0. is called the amortisation gap. The two gaps represent two distinct suboptimality cases when training VAEs. The former is caused by the choice of variational family, the latter by the success of the amortised search for a good candidate q ∈ Q. (Chung et al., 2015) x1:t, z1:t-1 xt+1:T DKF (Krishnan et al., 2015) x1:T zt-1 (exper.) DKS (Krishnan et al., 2017) zt-1, xt:T -DVBF (Karl et al., 2017a) xt, zt-1 xt+1:T Planet (Hafner et al., 2019) xt, zt-1 xt+1:T SLAC (Lee et al., 2020) xt, zt-1 xt+1:T

3. PARTIAL CONDITIONING OF INFERENCE NETWORKS

VAEs add an additional consideration to approximate posterior design compared to classical variational inference: the choice of inputs to the inference networks. Instead of feeding the entire observation x, one can choose to feed a strict subset C ⊂ x. This is common practice in many sequential variants of the VAE, cf. table 1, typically motivated by efficiency or downstream applications. The discrepancy between inference models inputs and the true posterior conditions has been acknowledged before, e. g. by Krishnan et al. (2017) , where they investigate different variants. Their analysis does not go beyond quantitative empirical comparisons. In the following, we show that such design choices lead to a distinct third cause of inference suboptimality, the conditioning gap.

3.1. MINIMISING EXPECTED KL DIVERGENCES YIELDS PRODUCTS OF POSTERIORS

The root cause is that all observations x that share the same included conditions C but differ in the excluded conditions must now share the same approximate posterior by design. This shared approximate posterior optimizes the expected KL divergence ω := arg min q E C|C KL q φ (z | C) p z C, C . w. r. t. the missing conditions C = x \ C. By rearranging (cf. appendix A.1 for details) E C|C KL q φ (z | C) p z C, C = KL q φ (z | C) exp E C|C log p z C, C /Z -log Z, where Z denotes a normalising constant, we see that the shared optimal approximate posterior is ω(z) ∝ exp E C|C log p z C, C . Interestingly, this expression bears superficial similarity with the true partially-conditioned posterior p(z | C) = E C|C p z C, C = exp log E C|C p z C, C . To understand the difference, consider a uniform discrete C | C for the sake of the argument. In this case, the expectation is an average over all missing conditions. The true posterior p(z | C) is then a mixture distribution of all plausible full posteriors p z C, C . The optimal approximate posterior, because of the logarithm inside the sum, is a product of plausible full posteriors. Such distributions occur e. g. in products of experts (Hinton, 2002) or Gaussian sensor fusion (Murphy, 2012) . Products behave differently from mixtures: an intuition due to Welling (2007) is that a factor in a product can single-handedly "veto" a sample, while each term in a mixture can only "pass" it. This intuition is highlighted in fig. 1 . We can see that ω is located between the modes and sharply peaked. It shares almost no mass with both the posteriors and the marginal posterior. The approximate marginal posterior ν on the other hand either covers one or two modes, depending on the width of the two full posteriors-a much more reasonable approximation. In fact, ω(z) will only coincide with either p(z | C) or p z C, C in the the extreme case when p(z | C) = p z C, C ⇔ p(z | C)p C C = p z, C C , cf. appendix A.2, which implies statistical independence of C and z given C. Notice how the suboptimality of ω assumed neither a particular variational family, nor imperfect amortisation-ω is the analytically optimal shared posterior. Both approximation and amortisation gap would only additionally affect the KL divergence inside the expectation in eq. ( 5). As such, the inference suboptimality described here can occur even if those vanish. We call E C E C|C KL ω(z | C) p z C, C the conditioning gap, which in contrast to approximation and amortisation gap can only be defined as an expectation w. r. t. p(x) = p C, C . Notice further that the effects described here assumed the KL divergence, a natural choice due to its duality with the ELBO (cf. eq. ( 2)). To what extent alternative divergences (Ranganath et al., 2014; Li & Turner, 2016) exhibit similar or more favourable behaviour is left to future research. Learning Generative Models with Partially-Conditioned Posteriors We find that the optimal partially-conditioned posterior will not correspond to desirable posteriors such as p(z | C) or p(z | C, C), even assuming that the variational family could represent them. This also affects learning the generative model. A simple variational calculus argument (appendix A.3) reveals that, when trained with partially-conditioned approximate posteriors, maximum-likelihood models and ELBOoptimal models are generally not the same. In fact, they coincide only in the restricted cases where p(z | C, C) = q(z | C). As seen before this is the case if and only if C ⊥ z | C.

3.2. LEARNING A UNIVARIATE GAUSSIAN

It is worth highlighting the results of this section in a minimal scalar linear Gaussian example. The target distribution is p(x) ∼ N (0, 1). We assume the latent variable model p a (x, z) = N (x | az, 0.1) • N (z | 0, 1) with free parameter a ≥ 0. This implies p a (x) = N x 0, 0.1 + a 2 , p a (z | x) = N z a 0.1 + a 2 -1 x, 1 + 10a 2 -1 . With a * = √ 0.9, we recover the target distribution with posterior p a * (z | x) = N z √ 0.9x, 0.1 . Next, we introduce the variational approximation q(z) = N (z | µ z , σ 2 z ). With the only condition x missing, this is a deliberately extreme case of partial conditioning where C = ∅ and C = {x}. Notice that the true posterior is a member of the variational family, i. e., no approximation gap. We maximise the expected ELBO ω a (z) = arg max q E x∼ p E z∼q log p a (x, z) q(z) , i. e., all observations from p share the same approximation q. One can show that ω a (z) = N z 0, 100a 2 + 1 -1 . We immediately see that ω a (z) is neither equal to p a * (z | x) ≡ p z C, C nor to p(z) ≡ p(z | C). Less obvious is that the variance of the optimal solution is much smaller than that of both p z C, C and p(z | C). This is a consequence of the product of the expert compromise discussed earlier: the (renormalised) product of Gaussian densities will always have lower variance than either of the factors. This simple example highlights how poor shared posterior approximations can become. Further, inserting ω a back into the expected ELBO and optimising for a reveals that the maximum likelihood model parameter a * is not optimal. In other words, p a * (x, z) does not optimise the expected ELBO in p-despite being the maximum likelihood model.

3.3. EXTENSION TO THE SEQUENTIAL CASE

The conditioning gap may arise in each of the terms of eq. ( 4) if q is partially conditioned. Yet, it is common to leave out future observations, e. g. q(z t | z 1:t-1 , x 1:t , x t+1:T ). To the best of our knowledge, in the literature only sequential applications of VAEs are potentially affected by the conditioning gap. Still, sequential latent-variable models with amortised under-conditioned posteriors have been applied success fully to, e. g., density estimation, anomaly detection, and sequential decision making (Bayer & Osendorfer, 2014; Soelch et al., 2016; Hafner et al., 2019) . This may seem at odds with the previous results. The gap is not severe in all cases. Let us emphasize two "safe" cases where the gap vanishes because C t ⊥ z t | C t is approximately true. First, where the partially-and the fully-conditioned posterior correspond to the prior transition, i. e. p(z t | z t-1 ) ≈ p(z t | C t ) ≈ p z t C t , C t . This is for example the case for deterministic dynamics where the transition is a single point mass. Second, the case where the observations are sufficient to explain the latent state, i. e. p(z t | x t ) ≈ p(z t | C t ) ≈ p z t C t , C t . A common case are systems with perfect state information. Several studies have shown that performance gains of fully-conditioned over partially-conditioned approaches are negligible (Fraccaro et al., 2016; Maddison et al., 2017; Buesing et al., 2018) . We conjecture that the mentioned "safe" cases are overrepresented in the studied data sets. For example, environments for reinforcement learning such as the gym or MuJoCo environments (Todorov et al., 2012; Brockman et al., 2016) feature deterministic dynamics. We will address a broader variety of cases in section 5.

4. RELATED WORK

The bias of the ELBO has been studied numerous times (Nowozin, 2018; Huang & Courville, 2019) . A remarkable study by Turner & Sahani (2011) contains a series of carefully designed experiments showing how different forms of assumptions on the approximate posterior let different qualities of the solutions emerge. This study predates the introduction of amortised inference (Kingma & Welling, 2014; Rezende et al., 2014) however. Cremer et al. (2018) identify two gaps arising in the context of VAEs: the approximation gap, i. e. inference error resulting from the true posterior not being a member of variational family; and the amortisation gap, i. e. imperfect inference due to an amortised inference procedure, e. g. a single neural network call. Both gaps occur per sample and are independent of the problems in section 3. Applying stochastic gradient variational Bayes (Kingma & Welling, 2014; Rezende et al., 2014) to sequential models was pioneered by Bayer & Osendorfer (2014) ; Fabius et al. (2015) ; Chung et al. (2015) . The inclusion of state-space assumptions was henceforth demonstrated by Archer et al. (2015) ; Krishnan et al. (2015; 2017) ; Karl et al. (2017a) ; Fraccaro et al. (2016; 2017) ; Becker-Ehmck et al. (2019) . Specific applications to various domains followed, such as video prediction (Denton & Fergus, 2018) , tracking (Kosiorek et al., 2018; Hsieh et al., 2018; Akhundov et al., 2019) or simultaneous localisation and mapping (Mirchev et al., 2019) . Notable performance in model-based sequential decision making has also been achieved by Gregor et al. (2019) ; Hafner et al. (2019) ; Lee et al. (2020) ; Karl et al. (2017b); Becker-Ehmck et al. (2020) . The overall benefit of latent sequential variables was however questioned by Lai et al. (2019) . The performance of sequential latent-variable models in comparison to auto-regressive models was studied, arriving at the conclusion that the added stochasticity does not increase, but even decreases performance. We want to point out that their empirical study is restricted to the setting where z t ⊥ z 1:t-1 | x 1:T , see the appendix of their work. We conjecture that such assumptions lead to a collapse of the model where the latent variables merely help to explain the data local in time, i. e. intra-step correlations. A related field is that of missing-value imputation. Here, tasks have increasingly been solved with probabilistic models (Ledig et al., 2017; Ivanov et al., 2019) . The difference to our work is the explicit nature of the missing conditions. Missing conditions are typically directly considered by learning p C C instead of p(x) and appropriate changes to the loss functions. are sampled from the model after having observed a prefix x 1:t . The state at the end of the prefix is inferred with a bootstrap particle filter. Each plot shows a kernel density estimate of the distribution over the final location x T , once for the semi-conditioned model in green and for the fully-conditioned in blue. The true value is marked as a vertical, black line. The fully-conditioned model assigns higher likelihood in almost all cases and is more concentrated around the truth.

5. STUDY: VARIATIONAL STATE-SPACE MODELS

We study the implications of section 3 for the case of variational state-space models (VSSMs). We deliberately avoid the cases of deterministic dynamics and systems with perfect state information: unmanned aerial vehicle (UAV) trajectories with imperfect state information in section 5.2, a sequential version of the MNIST data set section 5.3 and traffic flow in section 5.4. The reason is that the gap is absent in these cases, see section 3.3. In all cases, we not only report the ELBO, but conduct qualitative evaluations as well. We start out with describing the employed model in section 5.1.

5.1. VARIATIONAL STATE-SPACE MODELS

The two Markov assumptions that every observation only depends on the current state and every state only depends on the previous state and condition lead to the following generative model: p(x 1:T , z 1:T | u 1:T ) = T t=1 p(x t | z t )p(z t | z t-1 , u t-1 ), where u 1:T are additional conditions from the data set (such as control signals) and z 0 = u 0 := ∅. We consider three different conditionings of the inference model q(z 1:T | x 1:T , u 1:T ) = q(z 1 | x 1:k , u 1:k ) T t=2 q(z t | z t-1 , x 1:m , u 1:m ), partial with k = 1, m = t, semi with k > 1, m = t, and full with k = m = T . We call k the sneak peek, inspired by Karl et al. (2017a) . See appendix B for more details.

5.2. UAV TRAJECTORIES FROM THE BLACKBIRD DATA SET

We apply semi-and fully-conditioned VSSMs to UAV modelling (Antonini et al., 2018) . By discarding the rotational information, we create a system with imperfect state information, cf. section 3.3. Each observation x t ∈ R 3 is the location of an unmanned aerial vehicle (UAV) in a fixed global frame. The conditions u t ∈ R 14 , consist of IMU readings, rotor speeds, and pulse-width modulation. The emission model was implemented as a Gaussian with fixed, hand-picked standard deviations, where the mean corresponds to the first three state dimensions (cf. (Akhundov et al., 2019) ): [0.15, 0.15, 0.075] . We leave out the partially-conditioned case, as it cannot infer the higher-order derivatives necessary for rigid-body dynamics. A sneak peek of k = 7 for the semi-conditioned model is theoretically sufficient to infer those moments. See appendix C for details. p(x t | z t ) = N µ = z t,1:3 , σ 2 = Fully-conditioned models outperform semi-conditioned ones on the test set ELBO, as can be seen in table 2a . We evaluated the models on prefix-sampling, i. e. the predictive performance of p(x t+1:T | x 1:t , u 1:T ). To restrict the analysis to the found parameters of the generative model only, we inferred the filter distribution p(z t | x 1:t , u 1:t ) using a bootstrap particle filter (Sarkka, 2013) . By not using the respective approximate posteriors, we ensure fairness between the different models. Samples from the predictive distribution were obtained via ancestral sampling of the generative model. Representative samples are shown in fig. 4 . We performed a posterior-predictive check for both models, where we compare the densities of the final observations x T obtained from prefix sampling in fig. 2 . Both evaluations qualitatively illustrate that the predictions of the fully-conditioned approach concentrate more around the true values. In particular the partially-conditioned model struggles more with long-term prediction.

5.3. ROW-WISE MNIST

We transformed the MNIST data set into a sequential data set by considering one row in the image plane per time step, from top to bottom. This results in stochastic dynamics: similiar initial rows can result in a 3, 8, 9 or 0, future rows are very informative (cf. section 3.3). Before all experiments, each pixel was binarised by sampling from a Bernoulli with a rate in [0, 1] proportional to the corresponding pixel intensity. The setup was identical to that of section 5.2, except that a Bernoulli likelihood parameterised by a feed-forward neural network was used: p(x t | z t ) = B(FNN θ E (z t )). No conditions u 1:T and a short sneak-peek (k = 1) were used. The fully-conditioned model outperforms the partiallyconditioned by a large margin, placing it significantly closer to state-of-the-art performance, see table 2b . This is supported by samples from the model, see fig. 3 . Cf. appendix D for details. For qualitative evaluation, we used a state-of-the-art classifierfoot_0 to categorise 10,000 samples from each of the models. If the data distribution is learned well, we expect the classifier to behave similarly on both data and generative samples, i. e. yield uniformly distributed class predictions. We report KL-divergences from the uniform distribution of the class prediction distributions in table 2b. A bar plot of the induced class distributions can be found in fig. 3 . Only the fully-conditioned model is able to nearly capture a uniform distribution. 

5.4. TRAFFIC FLOW

We consider the Seattle loop data set (Cui et al., 2019; 2020) of average speed measurements of cars at 323 locations on motorways near Seattle, from which we selected a single one (42). The dynamics of this system are highly stochastic, which is of special interest for our study, cf. section 3.3. Even though all days start out very similar, traffic jams can emerge suddenly. In this scenario the emission model was a feed-forward network conditioned on the whole latent state, i. e. p(x t | z t ) = N FNN θ E (z t ), σ 2 x . We compare partially-, semi-(k = 7) and fully-conditioned models. See appendix E for details. The results are shown in table 2a. While the fully-conditioned posterior emerges as the best choice on the validation set, the semi-conditioned and the fullyconditioned one are on par on the test set. We suspect that the sneak peek is sufficient to fully capture a sensible initial state approximation. We performed a qualitative evaluation of this scenario as well in the form of prefix sampling. Given the first t = 12 observations, the task is to predict the remaining ones for the day, compare section 5.2. We show the results in fig. 4 . The fully-conditioned model clearly shows more concentrated yet multi-modal predictive distributions. The partially-condition model concentrates too much, and the semi-conditioned one too little.

6. DISCUSSION & CONCLUSION

We studied the effect of leaving out conditions of amortised posteriors, presenting strong theoretical findings and empirical evidence that it can impair maximum likelihood solutions and inference. Our work helps with approximate posterior design and provides intuitions for when full conditioning is necessary. We advocate that partially-conditioned approximate inference models should be used with caution in downstream tasks as they may not be adequate for the task, e. g., replacing a Bayesian filter. In general, we recommend conditioning inference models conforming to the true posterior.

A DETAILED DERIVATIONS A.1 DERIVATION OF THE SOLUTION TO AN EXPECTED KL-DIVERGENCE

We want to show that E C|C KL q φ (z | C) p z C, C = KL q φ (z | C) exp E C|C log p z C, C /Z -log Z. This can be seen from E C|C KL q φ (z | C) p z C, C = E C|C E q φ (z|C) log q φ (z | C) p z C, C = E q φ (z|C) [log q φ (z | C)] -E q φ (z|C) log exp E C|C log p z C, C = KL q φ (z | C) exp E C|C log p z C, C /Z -log Z, where Z is the normalizing constant of exp E C|C log p z C, C .

A.2 ANALYZING THE OPTIMAL PARTIALLY-CONDITIONED POSTERIOR

In this section, we show that assuming either ω(z ) = p(z | C) or ω(z) = p z C, C implies p C z, C = p C C . We begin by observing ω(z) ∝ exp E C|C log p z C, C = exp E C|C log p z C, C p(z | C) p(z | C) = p(z | C) exp E C|C log p z C, C p(z | C) . Now, firstly, ω(z) = p(z | C) =⇒ exp E p(C|C) log p z C, C p(z | C) = 1 =⇒ p z C, C p(z | C) = p C z, C p C C = 1 =⇒ p C z, C = p C C . Secondly, ω(z) = p z C, C =⇒ exp E p(C|C) log p z C, C p(z | C) ∝ p z C, C p(z | C) =⇒ p z C, C p(z | C) = p C z, C p C C is constant w. r. t. C =⇒ p C z, C = p C C .

A.3 PROOF OF SUBOPTIMAL GENERATIVE MODEL

We investigate whether a maximum likelihood solution p = arg min p E x 1:T ∼ p[-log p(x 1:T )] is a minimum of the expected negative ELBO. From calculus of variations (Gelfand & Fomin, 2003) , we derive necessary optimality conditions for maximum likelihood and the expected negative ELBO as 0 ! = dE x 1:T ∼ p[-log p(x 1:T )] dp α dG dp , (max. likelihood) (7) 0 ! = dE x 1:T ∼ p[-log p(x 1:T )] dp + dE x 1:T ∼ p[KL] dp + λ dG dp , (ELBO) (8) respectively. G is a constraint ensuring that p is a valid density, λ and α are Lagrange multipliers. KL refers to the posterior divergence in eq. ( 2). Equating ( 7) and ( 8) and rearranging gives dE x 1:T ∼ p[KL] dp + (α -λ) dG dp = 0. Equation ( 9) is a necessary and sufficient condition (van Erven & Harremoës, 2014) that the Kullback-Leibler divergence is minimised as a function of p, which happens when p(z t | C t , C t ) = q(z t | C t ) for all t.

B MODEL DETAILS

We restrict our study to a class of current state-of-the-art variational sequence models: that of residual state-space models. The fact that inference is not performed on the full latent state, but merely the residual, allows more efficient learning through better gradient flow (Karl et al., 2017b; Fraccaro et al., 2016) .

Generative Model

The two Markov assumptions that every observation only depends on the current state and every state only depends on the previous state and condition lead to the following model: p(x 1:T , z 1:T | u 1:T ) = T t=1 p(x t | z t )p(z t | z t-1 , u t-1 ), where u 1:T are additional conditions from the data set (such as control signals) and z 0 = u 0 := ∅. Such models can be represented efficiently by choosing a convenient parametric form (e. g. neural networks) which map the conditions to the parameters of an appropriate distribution D. In this work, we let the emission model D x be a feed-forward neural network parameterised by θ E : p(x t | z t ) = D x (FNN θ E (z t )). For example, we might use a Gaussian distribution that is parameterised by its mean and variance, which are outputs of a feed-forward neural network (FNN). Further, we represent the transition model as a deterministic step plus a component-wise scaled residual: z t = FNN θ T (z t-1 , u t-1 ) + FNN θ G (z t-1 ) t , t ∼ D . Here FNN θ G (z t-1 ) produces a gain with which the residual is multiplied element-wise. The residual distribution D is assumed to be zero-centred and independent across time steps. Gaussian noise is hence a convenient choice. The initial state z 1 needs to have a separate implementation, as it has no predecessor z 0 and is hence structurally different. In this work, we consider a base distribution that is transformed into the distribution of interest by a bijective map, ensuring tractability. We use an inverse auto-regressive flow (Kingma et al., 2016) and write p θz 1 (z 1 ) to indicate the dependence on trainable parameters, but do not specify the base distribution further for clarity. Amortised Inference Models This formulation allows for an efficient implementation of inference models. Given the residual, the state is a deterministic function of the preceding state-and hence it is only necessary to infer the residual. Consequently we implement the inference model as such: q( t | z 1:t-1 , x 1:T , u 1:T ) = D t (FNN φ (z t , f t ) ). Note that that the state-space assumptions imply p( t | z 1:t-1 , x 1:T ) = p( t | z t-1 , x t+1:T ) (Sarkka, 2013) . We reuse the deterministic part of the transition zt = FNN θ T (z t-1 , u t-1 ). Features f 1:T = f 1:T = (f 1 , f 2 , . . . , f T ) define whether the inference is partially-conditioned and are explained below. Due to the absence of a predecessor, the initial latent state z 1 needs special treatment as in the generative case. We employ a separate feed-forward neural network to yield the parameters of the variational posterior at the first time step: q(z 1 | x 1:T , u 1:T ) = Dz1 FNN φz 1 (f 1 ) . Under-, Semi-, and Fully-Conditioned Posteriors We control whether a variational posterior is partially-, semi-or fully conditioned by the design of the features f 1:T . In the case of a fullyconditioned approximate posterior, we employ a bidirectional RNN (Schuster & Paliwal, 1997) : f 1:T = BiRNN φ f t (x 1:T , u 1:T ), For the case of an partially-conditioned model, a unidirectional RNN is an option. Since it is common practice (Karl et al., 2017b) to condition the initial inference model on parts of the future as well, we let the feature at the first time step sneak peek into a chunk of length k of the future: f 1 = FNN φ f 1 (x 1:k , u 1:k ) f 2:T = RNN φ f t (x 1:T , u 1:T ), where the recurrent network is unidirectional. We call this approach a semi-conditioned model. 



https://github.com/keras-team/keras/blob/2.4.0/examples/mnist_cnn.py



Figure2: Posterior-predictive check of prefix-sampling on the Blackbird data set. Possible futures are sampled from the model after having observed a prefix x 1:t . The state at the end of the prefix is inferred with a bootstrap particle filter. Each plot shows a kernel density estimate of the distribution over the final location x T , once for the semi-conditioned model in green and for the fully-conditioned in blue. The true value is marked as a vertical, black line. The fully-conditioned model assigns higher likelihood in almost all cases and is more concentrated around the truth.

Figure 3: Left: Class distributions of the respective image distributions induced by a state-of-the-art classifier. The data distribution is close to uniform, except for 5. The fully-conditioned model yields too few 5's and is close to uniform for the other digits. The partially-conditioned model only captures the right frequencies of 1 and 3. Right: Comparison of generative sampling on row-wise MNIST. Samples from the data distribution are shown on the left. The middle and right show samples from models with an partially-and a fully-conditioned approximate posterior, respectively.

Figure 4: Comparison of prefix-sampling. Possible futures x(i) k+1:T (coloured lines) are sampled from the model after having observed a prefix x 1:k (solid black line) and then compared to the true suffix x k+1:T (dashed line). The state at the end of the prefix is inferred with a bootstrap particle filter. Left: UAV data, top-down view. Right: Traffic flow data, time-series view.

Overview of partial conditions for sequential inference networks q(z t | C t ) in the literature. C t denotes missing conditions vs. the true posterior according to the respective graphical model. DKF acknowledges the true factorization, but does not use z t-1 in any experiments.

Results on UAV and row-wise MNIST modelling.

C.2 HYPER-PARAMETERS OF SEMI-CONDITIONED MODEL

D DETAILS ON THE EXPERIMENTAL SETUP OF MNIST MODELLING

We conducted a hyper parameter search of 64 experiments for 15,000 iterations. The 5 best experiments (according to the ELBO at the last iteration) were continued for 85,000 further iterations. 

E DETAILS ON THE EXPERIMENTAL SETUP OF TRAFFIC FLOW MODELLING

The data is down-sampled to contain average speeds over 30 minute windows from 6:30 to 19:00, resulting in 26 time steps per day. The data was split into training, validation and testing data by months, January up to July for training, July to September for validation and the remainder for testing. The standard deviation σ 2x was determined during hyper-parameter optimisation along with the architectural and optimisation parameters. A hyper parameter search of 128 configurations was conducted. After 150,000 weight updates, the model for each partially-, semi-and fully-conditioned with the lowest negative ELBO on the validation set was selected. 

