MIND THE GAP WHEN CONDITIONING AMORTISED INFERENCE IN SEQUENTIAL LATENT-VARIABLE MODELS

Abstract

Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e. g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter-a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction.

1. INTRODUCTION

Variational inference has paved the way towards learning deep latent-variable models (LVMs): maximising the evidence lower bound (ELBO) approximates maximum likelihood learning (Jordan et al., 1999; Beal, 2003; MacKay, 2003) . An efficient variant is amortised variational inference where the approximate posteriors are represented by a deep neural network, the inference network (Hinton et al., 1995) . It produces the parameters of the variational distribution for each observation by a single forward pass-in contrast to classical variational inference with a full optimisation process per sample. The framework of variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014) adds the reparameterisation trick for low-variance gradient estimates of the inference network. Learning of deep generative models is hence both tractable and flexible, as all its parts can be represented by neural networks and fit by stochastic gradient descent. The quality of the solution is largely determined by how well the true posterior is approximated: the gap between the ELBO and the log marginal likelihood is the KL divergence from the approximate to the true posterior. Recent works have proposed ways of closing this gap, suggesting tighter alternatives to the ELBO (Burda et al., 2016; Mnih & Rezende, 2016) or more expressive variational posteriors (Rezende & Mohamed, 2015; Mescheder et al., 2017 ). Cremer et al. (2018) provide an analysis of this gap, splitting it in two. The approximation gap is caused by restricting the approximate posterior to a specific parametric form, the variational family. The amortisation gap comes from the inference network failing to produce the optimal parameters within the family. In this work, we address a previously unstudied aspect of amortised posterior approximations: successful approximate posterior design is not merely about picking a parametric form of a probability density. It is also about carefully deciding which conditions to feed into the inference network. This is a particular problem in sequential LVMs, where it is common practice to feed only a restricted set of inputs into the inference network: not conditioning on latent variables from other time steps can vastly improve efficiency (Bayer & Osendorfer, 2014; Lai et al., 2019) . Further, leaving out to the marginal posterior, which was obtained by stochastic gradient descent on the reverse, modeseeking KL-divergence (Hoffman et al., 2013) . Bottom row: The optimal ω(z | C) obtained by optimising the ELBO with a partially-conditioned amortised approximate posterior w. r. t. q. It is located far away from the modes, sharply peaked and shares little mass with the true full posteriors, the marginal posterior as well as the approximate marginal posterior. observations from the future structurally mimics the Bayesian filter, useful for state estimation in the control of partially observable systems (Karl et al., 2017b; Hafner et al., 2019; Lee et al., 2020) . p(z | C, C = 0) p(z | C, C = 1) p(z | C) ν(z | C) ≈ p(z | C) ω(z | C) We analyse the emerging conditioning gap and the resulting suboptimality for sequential amortised inference models. If the variational posterior is only partially conditioned, all those posteriors equal w. r. t. included conditions but different on the excluded ones will need to share an approximate posterior. The result is a suboptimal compromise over many different true posteriors which cannot be mitigated by improving variational family or inference network capacity. We empirically show its effects in an extensive study on three real-world data sets on the common use case of variational state-space models.

2.1. VARIATIONAL AUTO-ENCODERS

Consider the task of fitting the parameters of a latent-variable model p θ (x, z) = p θ (x | z) × p θ (z) to a target distribution p(x) by maximum likelihood: arg max θ E x∼ p(x) [log p θ (x)]. (1) In the case of variational auto-encoders (VAEs), the prior distribution p θ (z) is often simple, such as a standard Gaussian. The likelihood p θ (x | z) on the other hand is mostly represented by a deep neural network producing the likelihood parameters as a function of z. The log marginal likelihood log p θ (x) = log p θ (x, z) dz contains a challenging integral within a logarithm. The maximisation of eq. ( 1) is thus generally intractable. Yet, for a single sample x the log marginal likelihood can be bounded from below by the evidence lower bound (ELBO): log p θ (x) ≥ log p θ (x) -KL(q(z) || p θ (z | x)) (2) = E z∼q [log p θ (x | z)] -KL(q(z) || p θ (z)) =: -(θ, q, x), where a surrogate distribution q, the variational posterior, is introduced. The bound is tighter the better q(z) approximates p θ (z | x). The gap vanishes with the posterior KL divergence in eq. ( 2).



Figure 1: Illustration of the effect of partial conditioning. Consider a latent-variable model p C, C z × p(z), with binary C and arbitrary C. We omit C from the amortised approximate posterior q(z | C). Shown are two cases of separated (left) and overlapping (right) Gaussian true posteriors. Top row: the full posteriors p z C, C = 0 and p z C, C = 1 as well as their average, the marginal posterior p(z | C). Middle row: Variational Gaussian approximation ν(z | C)to the marginal posterior, which was obtained by stochastic gradient descent on the reverse, modeseeking KL-divergence(Hoffman et al., 2013). Bottom row: The optimal ω(z | C) obtained by optimising the ELBO with a partially-conditioned amortised approximate posterior w. r. t. q. It is located far away from the modes, sharply peaked and shares little mass with the true full posteriors, the marginal posterior as well as the approximate marginal posterior.

