DETERMINISTIC TRAINING OF GENERATIVE AUTOEN-CODERS USING INVERTIBLE LAYERS

Abstract

In this work, we provide a deterministic alternative to the stochastic variational training of generative autoencoders. We refer to these new generative autoencoders as AutoEncoders within Flows (AEF), since the encoder and decoder are defined as affine layers of an overall invertible architecture. This results in a deterministic encoding of the data, as opposed to the stochastic encoding of VAEs. The paper introduces two related families of AEFs. The first family relies on a partition of the ambient space and is trained by exact maximum-likelihood. The second family exploits a deterministic expansion of the ambient space and is trained by maximizing the log-probability in this extended space. This latter case leaves complete freedom in the choice of encoder, decoder and prior architectures, making it a drop-in replacement for the training of existing VAEs and VAE-style models. We show that these AEFs can have strikingly higher performance than architecturally identical VAEs in terms of log-likelihood and sample quality, especially for low dimensional latent spaces. Importantly, we show that AEF samples are substantially sharper than VAE samples.

1. INTRODUCTION

Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) have maintained an enduring popularity in the machine learning community in spite of the impressive performance of other generative models (Goodfellow et al., 2014; Karras et al., 2020; Van Oord et al., 2016; Van den Oord et al., 2016; Salimans et al., 2017; Dinh et al., 2014; Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma & Dhariwal, 2018; Sohl-Dickstein et al., 2015; Nichol & Dhariwal, 2021) . One key feature of VAEs is their ability to project complex data into a semantically meaningful set of latent variables. This feature is considered particularly useful in fields such as model-based reinforcement learning, where temporally linked VAE architectures form the backbone of most state-of-the-art worldmodels (Ha & Schmidhuber, 2018b; a; Hafner et al., 2020; Gregor et al., 2019; Zintgraf et al., 2020; Hafner et al., 2021) . Another attractive feature of VAEs is that they leave ample architectural freedom when compared with other likelihood-based generative models, with their signature encoder-decoder architectures being popular in many areas of ML beside generative modeling (Ronneberger et al., 2015; Vaswani et al., 2017; Radford et al., 2021; Ramesh et al., 2021; 2022) . However, VAE training is complicated by the lack of a closed-form expression for the log-likelihood, with the variational gap between the surrogate loss (i.e. the ELBO) and the true log-likelihood being responsible for unstable training and, at least in non-hierarchical models, sub-optimal encoding and sample quality (Hoffman & Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018; Cremer et al., 2018; Mattei & Frellsen, 2018) . Consequently, a large fraction of VAE research is devoted to tightening the gap between the ELBO and the true likelihood of the model. Gap reduction can be achieved both by devising alternative lower bounds (Burda et al., 2015; Bamler et al., 2017) or more flexible parameterized posterior distributions (Rezende & Mohamed, 2015; Kingma et al., 2016) . Normalizing flows (NF) are deep generative models comprised of tractably invertible layers, whose log-likelihood can be computed in closed-form using the change of variable formula (Kobyzev et al., 2020; Papamakarios et al., 2021) . However, this constraint appears to be at odds with autoencoder architectures, which map all relevant information in a latent space of different (often smaller) dimensionality. This is potentially problematic since naturalistic data such as images and speech waveforms are thought to live, at least approximately, in a lower dimensional manifold of the ambient space (Bengio et al., 2013; F. et al., 2016; Pope et al., 2021) . It is therefore common to use hybrid VAE-flow models that deploy NFs for modeling the VAE prior and/or the variational posterior (Rezende & Mohamed, 2015; Kingma et al., 2016) . However, training these models is often a delicate business as changes in the encoder and the prior cause a misalignment from the posterior, increasing the gap and causing a shifting-target dynamic that introduces instabilities and decreases performance. For this reason, the complex autoregressive or flow priors common in modern applications are often trained ex-post after VAE training (Van Den Oord et al., 2017; Razavi et al., 2019) . In this paper we introduce a new approach for training VAE-style architectures with deterministically encoded latents. The key insight is that we can formulate an autoencoder within a conventional invertible architecture by using invertible affine layers and by keeping track of the deviations between data and predictions. Importantly, this can be done while leaving complete freedom in the design of the encoder, decoder and prior, which makes our approach a drop-in replacement for the training of existing VAE and VAE-style models. The resulting models can be trained by maximum-likelihood using the change of variables formula. We denote these new generative autoencoders as autoencoders within flows (AEF), since the autoencoder architecture is constructed inside a NF architecture.

2. PRELIMINARIES

In this section, we will outline the standard theory behind probabilistic generative modeling and non-linear dimensionality reduction. Consider a dataset comprised of data-points x ∈ R N . We refer to R N as the ambient space. The dataset is assumed to be sampled from a D-dimensional curved manifold M embedded in the ambient space. We refer to M as the signal space. The dimensionality of the signal space reflects the true dimensionality of the signal while the dimensionality of the ambient space depends on the particularities of the measurement device (e.g. the nominal resolution of the camera). Variational autoencoders: VAEs are deep generative models in which the density of each datapoint depends on a D-dimensional stochastic latent variable z ∈ R D , which parameterizes the signal space. The emission model is often assumed to be a diagonal Gaussian with parameters determined by deep architectures: p(x | z j ; θ) = N (x j ; f (z; θ), f s (z; θ)) , where θ denotes the model parameters. In this formula, the parameterized functions f (z; θ) and f s (z; θ) are the outputs of a decoder architecture. The emission model is paired with a prior p 0 (z; θ) over the latents. While the marginal likelihood is intractable, it is possible to derive a lower bound (the ELBO) by introducing a parameterized approximate posterior defined by the encoder architectures g m (x; ψ) and g s (x; ψ), which respectively return the posterior mean and scale over the latent variables. An additional normalizing flow transformation n -1 post (•; ψ) is often included in order to account for the non-Gaussianity of the posterior (Kingma et al., 2016) . Stochastic estimates of the gradient of the ELBO can be computed by expressing samples from the posterior as a differentiable deterministic function of the random samples (Kingma & Welling, 2014; Rezende et al., 2014) . For our purposes, it is important to notice that the Gaussian reparameterization formula: z(x, ϵ; ψ) = g m (x; ψ) + g s (x; ψ) ⊙ ϵ , (1) defines an affine invertible layer, formally analogous to those used in RealNVPs and related NFs (Dinh et al., 2017; Papamakarios et al., 2017) . In the simplified case of Gaussian residual model with variance σ 2 , the reparameterization of the ELBO leads to the following surrogate objective function

