MIND THE GAP WHEN CONDITIONING AMORTISED INFERENCE IN SEQUENTIAL LATENT-VARIABLE MODELS

Abstract

Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e. g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter-a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction.

1. INTRODUCTION

Variational inference has paved the way towards learning deep latent-variable models (LVMs): maximising the evidence lower bound (ELBO) approximates maximum likelihood learning (Jordan et al., 1999; Beal, 2003; MacKay, 2003) . An efficient variant is amortised variational inference where the approximate posteriors are represented by a deep neural network, the inference network (Hinton et al., 1995) . It produces the parameters of the variational distribution for each observation by a single forward pass-in contrast to classical variational inference with a full optimisation process per sample. The framework of variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014) adds the reparameterisation trick for low-variance gradient estimates of the inference network. Learning of deep generative models is hence both tractable and flexible, as all its parts can be represented by neural networks and fit by stochastic gradient descent. The quality of the solution is largely determined by how well the true posterior is approximated: the gap between the ELBO and the log marginal likelihood is the KL divergence from the approximate to the true posterior. Recent works have proposed ways of closing this gap, suggesting tighter alternatives to the ELBO (Burda et al., 2016; Mnih & Rezende, 2016) or more expressive variational posteriors (Rezende & Mohamed, 2015; Mescheder et al., 2017 ). Cremer et al. (2018) provide an analysis of this gap, splitting it in two. The approximation gap is caused by restricting the approximate posterior to a specific parametric form, the variational family. The amortisation gap comes from the inference network failing to produce the optimal parameters within the family. In this work, we address a previously unstudied aspect of amortised posterior approximations: successful approximate posterior design is not merely about picking a parametric form of a probability density. It is also about carefully deciding which conditions to feed into the inference network. This is a particular problem in sequential LVMs, where it is common practice to feed only a restricted set of inputs into the inference network: not conditioning on latent variables from other time steps can vastly improve efficiency (Bayer & Osendorfer, 2014; Lai et al., 2019) . Further, leaving out

