DETERMINISTIC TRAINING OF GENERATIVE AUTOEN-CODERS USING INVERTIBLE LAYERS

Abstract

In this work, we provide a deterministic alternative to the stochastic variational training of generative autoencoders. We refer to these new generative autoencoders as AutoEncoders within Flows (AEF), since the encoder and decoder are defined as affine layers of an overall invertible architecture. This results in a deterministic encoding of the data, as opposed to the stochastic encoding of VAEs. The paper introduces two related families of AEFs. The first family relies on a partition of the ambient space and is trained by exact maximum-likelihood. The second family exploits a deterministic expansion of the ambient space and is trained by maximizing the log-probability in this extended space. This latter case leaves complete freedom in the choice of encoder, decoder and prior architectures, making it a drop-in replacement for the training of existing VAEs and VAE-style models. We show that these AEFs can have strikingly higher performance than architecturally identical VAEs in terms of log-likelihood and sample quality, especially for low dimensional latent spaces. Importantly, we show that AEF samples are substantially sharper than VAE samples.

1. INTRODUCTION

Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) have maintained an enduring popularity in the machine learning community in spite of the impressive performance of other generative models (Goodfellow et al., 2014; Karras et al., 2020; Van Oord et al., 2016; Van den Oord et al., 2016; Salimans et al., 2017; Dinh et al., 2014; Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma & Dhariwal, 2018; Sohl-Dickstein et al., 2015; Nichol & Dhariwal, 2021) . One key feature of VAEs is their ability to project complex data into a semantically meaningful set of latent variables. This feature is considered particularly useful in fields such as model-based reinforcement learning, where temporally linked VAE architectures form the backbone of most state-of-the-art worldmodels (Ha & Schmidhuber, 2018b; a; Hafner et al., 2020; Gregor et al., 2019; Zintgraf et al., 2020; Hafner et al., 2021) . Another attractive feature of VAEs is that they leave ample architectural freedom when compared with other likelihood-based generative models, with their signature encoder-decoder architectures being popular in many areas of ML beside generative modeling (Ronneberger et al., 2015; Vaswani et al., 2017; Radford et al., 2021; Ramesh et al., 2021; 2022) . However, VAE training is complicated by the lack of a closed-form expression for the log-likelihood, with the variational gap between the surrogate loss (i.e. the ELBO) and the true log-likelihood being responsible for unstable training and, at least in non-hierarchical models, sub-optimal encoding and sample quality (Hoffman & Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018; Cremer et al., 2018; Mattei & Frellsen, 2018) . Consequently, a large fraction of VAE research is devoted to tightening the gap between the ELBO and the true likelihood of the model. Gap reduction can be achieved both by devising alternative lower bounds (Burda et al., 2015; Bamler et al., 2017) or more flexible parameterized posterior distributions (Rezende & Mohamed, 2015; Kingma et al., 2016) . Normalizing flows (NF) are deep generative models comprised of tractably invertible layers, whose log-likelihood can be computed in closed-form using the change of variable formula (Kobyzev et al., 2020; Papamakarios et al., 2021) . However, this constraint appears to be at odds with autoencoder architectures, which map all relevant information in a latent space of different (often smaller) dimensionality. This is potentially problematic since naturalistic data such as images and speech waveforms are thought to live, at least approximately, in a lower dimensional manifold of the ambient space (Bengio et al., 2013; F. et al., 2016; Pope et al., 2021) . It is therefore common to use hybrid VAE-flow models that deploy NFs for modeling the VAE prior and/or the variational posterior (Rezende & Mohamed, 2015; Kingma et al., 2016) . However, training these models is often a delicate business as changes in the encoder and the prior cause a misalignment from the posterior, increasing the gap and causing a shifting-target dynamic that introduces instabilities and decreases performance. For this reason, the complex autoregressive or flow priors common in modern applications are often trained ex-post after VAE training (Van Den Oord et al., 2017; Razavi et al., 2019) . In this paper we introduce a new approach for training VAE-style architectures with deterministically encoded latents. The key insight is that we can formulate an autoencoder within a conventional invertible architecture by using invertible affine layers and by keeping track of the deviations between data and predictions. Importantly, this can be done while leaving complete freedom in the design of the encoder, decoder and prior, which makes our approach a drop-in replacement for the training of existing VAE and VAE-style models. The resulting models can be trained by maximum-likelihood using the change of variables formula. We denote these new generative autoencoders as autoencoders within flows (AEF), since the autoencoder architecture is constructed inside a NF architecture.

2. PRELIMINARIES

In this section, we will outline the standard theory behind probabilistic generative modeling and non-linear dimensionality reduction. Consider a dataset comprised of data-points x ∈ R N . We refer to R N as the ambient space. The dataset is assumed to be sampled from a D-dimensional curved manifold M embedded in the ambient space. We refer to M as the signal space. The dimensionality of the signal space reflects the true dimensionality of the signal while the dimensionality of the ambient space depends on the particularities of the measurement device (e.g. the nominal resolution of the camera). Variational autoencoders: VAEs are deep generative models in which the density of each datapoint depends on a D-dimensional stochastic latent variable z ∈ R D , which parameterizes the signal space. The emission model is often assumed to be a diagonal Gaussian with parameters determined by deep architectures: p(x | z j ; θ) = N (x j ; f (z; θ), f s (z; θ)) , where θ denotes the model parameters. In this formula, the parameterized functions f (z; θ) and f s (z; θ) are the outputs of a decoder architecture. The emission model is paired with a prior p 0 (z; θ) over the latents. While the marginal likelihood is intractable, it is possible to derive a lower bound (the ELBO) by introducing a parameterized approximate posterior defined by the encoder architectures g m (x; ψ) and g s (x; ψ), which respectively return the posterior mean and scale over the latent variables. An additional normalizing flow transformation n -1 post (•; ψ) is often included in order to account for the non-Gaussianity of the posterior (Kingma et al., 2016) . Stochastic estimates of the gradient of the ELBO can be computed by expressing samples from the posterior as a differentiable deterministic function of the random samples (Kingma & Welling, 2014; Rezende et al., 2014) . For our purposes, it is important to notice that the Gaussian reparameterization formula: z(x, ϵ; ψ) = g m (x; ψ) + g s (x; ψ) ⊙ ϵ , defines an affine invertible layer, formally analogous to those used in RealNVPs and related NFs (Dinh et al., 2017; Papamakarios et al., 2017) . In the simplified case of Gaussian residual model with variance σ 2 , the reparameterization of the ELBO leads to the following surrogate objective function for one data-point x: L VAE (θ, σ 2 , ψ) =E ϵ     1 2σ 2 x -f z(x, ϵ; ψ); θ 2 2

Reconstruction error

log p 0 (z(x, ϵ; ψ); θ) Prior loss     (2) + N 2 log 2πσ 2 - H[q] Posterior entropy . While this loss has an interpretable appeal, it is important to keep in mind that it is just a tractable surrogate for the log-likelihood. The gap between the ELBO and log-likelihood can be decomposed into an inference gap and an amortization gap (Cremer et al., 2018) . Large gaps lead to highly sub-optimal training since the network is trained on a loose approximation of the true objective. Normalizing flows and affine layers: The unavailability of a closed form log-likelihood introduces a sub-optimality in VAE training that can only be remedied using complex posterior models or high variance lower bounds. NFs are alternative methods that assume the latent manifold to have the same dimensionality of the ambient space. Consider M = R N and ϕ(y; θ) being an invertible differentiable mapping. Under these assumptions, the log-likelihood of the model can be expressed in closed form using the change-of-variable formula log p(x) = log p 0 (ϕ -1 (y)) + log det Dϕ -1 (y) , where Dϕ -1 (y) is the Jacobi matrix of the inverse mapping. If the base distribution p 0 is standard normal, the loss has the form L NF (θ) = 1 2 ϕ -1 (x; θ) 2 + log det Dϕ -1 (x; θ) . NF architectures are designed by composing K invertible layers ϕ k . Affine layers divide the variables in two blocks y 1 k and y 2 k , one of which remain unchanged and the other is scaled and translated based on the first ϕ k (y 2 k ) = y 2 k+1 = s(y 1 k ) ⊙ y 2 k + m(y 1 k ) , where s is an arbitrary positive-valued function and m is an unconstrained function. When applied in the inverse direction, this layer becomes (y 2 k+1 -m(y 1 k+1 ))/s(y 1 k ), which "predicts away" the mean and variance of one block given the other. Using several of these layers with randomized partitions gradually removes statistical dependencies until at the last layer the resulting y variable approximately matches the uncorrelated Gaussian target p 0 .

3.1. AUTOENCODING WITH NORMALIZING FLOWS

In this section we show how to define a deterministic generative autoencoder within a NF architecture. Given its similarity with autoencoders, we named this flow architecture as autoencoder within flows (AEF). For now, we assume D ≪ N , namely that the manifold dimensionality of the signal is much lower than the ambient dimensionality. We will remove this assumption in the next sections. Let us begin by partitioning the input signal x ∈ R N in two subsets x 1:D and x D+1:N , as commonly done for coupling layers in NFs. In this paper, we refer to x 1:D and x D+1:N as core and shell variables respectively. As we shall see, the core variables are in 1-to-1 relation with the latents while the shell variables are (approximately) "predicted away" by the latents. We define a mean (shell) encoder g m (•; θ) : R N -D → R D and a scale (shell) encoder g s (•; θ) : R N -D → R D . These two architectures are analogous to the mean and scale encoders of a VAE except for the fact that they only take the shell variables as input. We also define an invertible core encoder n -1 (•; θ) : R D → R D that takes care of encoding the core variables. Using these architectures and the core/shell partition, we encode the data (both core and shell) into the latent variables via an invertible affine transformation: z = g m (x D+1:N ; θ) + g s (x D+1:N ; θ) ⊙ n -1 (x 1:D ; θ) . In order to recover the data from the latent, we define a decoder f D+1:N (•; θ) : R D → R N -D . This decoder only needs to reconstruct the shell variables since the core variables can be recovered by inverting the core encoder. As far as the scale encoding is not zero, the encoding formula defines a surjective transformation between the ambient space R N and the latent space R D . Such a transformation is not invertible since information concerning the shell variables can be lost and, consequently, f D+1:N (z(x); θ) ̸ = x. We can circumvent this problem by retaining the residuals of the autoencoder as additional variables: δ D+1:N = x D+1:N -f D+1:N (z; θ) , These variables retain the information lost during encoding, thereby completing the surjective encoding formula into an invertible transformation Φ -1 : R N → R N defined as Φ -1 (x 1:D , x D+1:N ) → z, δ D+1:N . Formally, the variables δ D+1:N are parts of the 'latents' of the NF. However, conceptually they are not true latent variables as they can only account for additive white noise. We can now define a base distribution p 0 (z; θ) to the latents. This distribution defines a generative model in the latent space R D and it is analogous to the prior in VAEs. For this reason, we will often (improperly) refer to p 0 (z; θ) as the 'prior'. However, it is important to keep in mind that this distribution is not a Bayesian prior since AEFs latents do not have a straightforward Bayesian interpretation. Moreover, we define an error distribution r(δ; θ) to the residuals. It is important for this distribution to have a learnable scale parameter that can be scaled down during training as the residuals are "predicted away". In summary, the invertible architecture Φ -1 is trained to map the distribution of the data into a factorized distribution of latent codes and residuals: Φ -1 (x 1:D , x D+1:N ; θ) ∼ p 0 (z; θ)r(δ; θ) The exact negative log-likelihood loss can be obtained by using the change of variable formula of normalizing flows (Eq. 3): L AEF (θ, σ 2 ) = 1 2σ 2 x D+1:N -f D+1:N z(x; θ); θ 2 2 + N -D 2 log 2πσ 2 -log r(δ;θ) log p 0 (z(x; θ); θ) (7) -log | det DΦ -1 (x; θ) | , where for the sake of simplicity we assumed (centered) Gaussian residual noise with (trainable) variance σ 2 . It is now possible to note some striking similarities between the flow just described and a VAE. Eq. 4 resembles the Gaussian reparametrization trick for variational autoencoders, with the reparameterized posterior noise replaced by the core variables. Furthermore, equation 7 closely resembles the ELBO in equation 2 but without noise injection, with the reconstruction error applied only to the shell variables and the Jacobian term replacing the entropy of the posterior distribution. As usual in flows, the AEF architecture can be used as a generative model simply by sampling the latent code z from the 'prior' p 0 (z; θ) and inverting the AEF transformation Φ -1 (•). Since it is usually not useful to add residual noise to the generated data, we can set δ D+1:N = 0 instead of sampling it from the error distribution. This results in the following procedure: z ∼ p 0 (z; θ) → x D+1:N = f D+1:N (z; θ) → x 1:D = n z -g m (x D+1:N ; θ) g s (x D+1:N ; θ) . This is visualized in Figure 3 (b) in Appendix A, and implemented in Algorithm 2.

3.2. PARTITION STRATEGIES AND AMBIENT SPACE EXPANSION

So far, we did not specify how to select core and shell variables. The most straightforward approach is to to use an arbitrarily chosen partition of the original variables. For example, we could use a random partition or, in an image, we could extract a central sub-image of "core pixels" and have the image with cropped center as shell. An AEF with this kind of partition interpolates between a regular normalizing flow (for D ≈ N ) and an autoencoder (for D ≪ N ). However, this approach limits the number of latents to be smaller than N and reduces compatibility with the VAE literature. Furthermore, in the case of noise corrupted data this partitioning strategy does not allow for denoising of the core variables since they do not have corresponding deviations and error distributions. All these issues can be circumvented if we appropriately expand the ambient space. In fact, the dimensionality of the ambient space is usually largely arbitrary, depending on factors such as the sampling resolution of cameras and digital microphones instead of the physical features of their respective measured signals. Consider a parameterized injective function Ψ : R N → R N +D , defined as follows Ψ(x; γ) = (x, w = h(x; γ)) , where γ are the transformation parameters. This function expands the ambient dimensionality but, since the transformation is injective and deterministic, it leaves the dimensionality of the signal manifold unchanged. Conceptually, this should be seen as a form of feature expansion and not an architectural component of a flow. In this expanded space, we can define deviations for all the original variables and, consequently, use a loss with VAE-style reconstruction error on all variables. This can be done simply by using a regular AEF architecture with the original variables x as shell and w = h(x) as core. This results in the following encoding layer: z = g m (x; θ) + g s (x; θ) ⊙ n -1 (h(x; γ); θ) . Note that this differs from the well-known VAE reparameterized encoding formula (E. 1) just by the fact that the input white noise is replaced by a deterministic function of the data. This is the only architectural difference between VAEs and AEF in the extended space. It is not immediately obvious that the feature expansion parameters γ can be trained by maximizing the log-likelihood. However, this joint training can be fully justified as the minimization of (the limit of) KL divergences (see Section 3.3), which results in the following objective function: L AEF (θ, σ 2 , γ) = 1 2σ 2 x -f z(x, h(x; γ); θ); θ 2 2 + N 2 log 2πσ 2 (10) -log p 0 (z(x, h(x; γ); θ); θ) -log | det DΦ -1 (x, h(x; γ); θ) | , which is just the log-likelihood in the expanded ambient space. We include the pseudocode for computing the objective function as negative log likelihood in Algorithm 1. Algorithm 1 Negative Log Likelihood AEF on expanded ambient space: g: Encoder; f : Decoder; h: feature expansion map; n: core encoder; p 0 : Base distribution ('prior'); r: error distribution; θ: model parameters, γ: feature expansion parameters; x: input image procedure NEGATIVELOGLIKELIHOOD(x) log | det J| ← 0 w ← h(x; γ) z ← g m (x; θ) + g s (x; θ) ⊙ n -1 (w; θ) log | det J| ← log | det J| + log | det J(n -1 (w; θ))| + log g s (x; θ) δ ← x -f (z; θ) return -(log p 0 (z; θ) + log r(δ; θ) + log | det J|) During generative sampling, the additional variables w are redundant and can be discarded. This results in a generative sampling formula that is identical to the one used in VAEs (see also Algorithm 3): x = f (z; θ), z ∼ p 0 (z; θ). Consequently, the core flow n does no longer participate in the generative sampling and it instead has an auxiliary role. This is a sign of its strong relation with the posterior flow in a VAE, which is already implicit in Eq. 9. The marginal likelihood in the original ambient space p(x) = p(x, w)dw cannot be obtained in closed-form from the joint p(x, w). Note that, at least from the point of view of manifold learning and dimensionality reduction, the original likelihood p(x) itself does not have a privileged interpretation as the ambient dimensionality is usually arbitrary and the 'true' likelihood of the data is degenerate and lives on a lower dimensional manifold. Nevertheless, p(x) is often useful for evaluation purposes (model comparison). Therefore, we provide an efficient importance sampling scheme in Appendix C. All likelihood results reported in this paper have been corrected using this method so as to be comparable with the VAE baselines. Algorithm 2 Sampling AEF with core-shell partition: g: Encoder; f : Decoder; n: core encoder; p 0 : base distribution ('prior'); θ: model parameters procedure SAMPLE z ∼ p 0 (z; θ) x D+1:N ← f (z; θ) x 1:D ← n z-gm(x D+1:N ;θ) gs(x D+1:N ;θ) x ← cat(x 1:D , x D+1:N ) return x Algorithm 3 Sampling AEF with expanded space: f : Decoder; p 0 : base distribution ('prior'); θ: model parameters procedure SAMPLE z ∼ p 0 (z; θ) x ← f (z; θ) return x

3.3. TRAINING THE EXPANSION PARAMETERS γ

The feature expansion map h(x, γ) is, strictly speaking, not a component of the flow architecture. Therefore, in spite of its intuitive appeal, it is not obvious that we can train γ by minimizing the joint negative log-likelihood. However, this can be fully justified as minimization of the KL divergence between the generative model and the empirical distribution of the data with expanded dimensionality. The idea is to define a loss functional that can be used to make two parameterized distributions approach each other, instead of just having a parameterized distribution approaching a fixed target distribution. We will start considering a stochastic dimensionality expansion and then show that the deterministic case can be obtained as a limit. We will denote the empirical sampling distribution of the dataset as d(x). Now consider the joint distribution of the empirical data and the stochastically expanded latent dimensions: q σ (x, w) = d(x)N (h(x, γ), Iσ) ) where the expansion is performed using a Gaussian conditional density with fixed variance. The KL divergence between the expanded empirical distribution q(x, w) and the joint distribution of the AEF model p(x, w) is given by: D KL (q σ (x, w), p(x, w)) = -E x,ϵ∼q [log p(x, w)] + c , where the constant c is the differential entropy of q σ (x, w), which is independent from both θ and γ. This happens because the differential entropy of a Gaussian variable does not depend on its mean. We can now compute the gradient with respect to γ: ∇ γ D KL (q σ (x, w), p(x, w)) = -E x,ϵ [∇ γ log p(x, w)] , which, up to a constant, is just the gradient of the stochastically augmented joint log-likelihood. We can now notice that, while the KL divergence is not well-defined at the limit σ → 0, the limiting gradient is well-defined since the diverging entropy term does not depend on γ. This leads us with the following limiting gradient: lim σ→0 ∇ γ D KL (q σ (x, w), p(x, w)) (14) = -lim σ→0 E x,ϵ [∇ γ log p(x, h(x, γ) + σ ⊙ ϵ)] (15) = -E x [∇ γ log p(x, h(x, γ))] , where h(x, γ) + σ ⊙ ϵ, in the second line, is the the reparameterization formula for the conditional Gaussian variable w. We can now see that, up to a proportionality constant, is the gradient of the negative log-likelihood used to train the flow.

4. RELATED WORK

Improvements of VAE training: In recent years, there have been many works analyzing and improving the training procedure of VAEs. Works such as (Hoffman & Johnson, 2016), (Zhao et al., 2017) and (Alemi et al., 2018) diagnose some issues with ELBO training and propose a series of methods to improve results. (Rezende & Viola, 2018) and (Dai & Wipf, 2019) , propose an augmented objective to tackle specific optimization issues. In the same vein, (Cremer et al., 2018) and (Mattei & Frellsen, 2018) Maaløe et al., 2019) . More recent works use hierarchical VAEs with many layers, achieving state of the art log-likelihood and generating images with impressive sample quality (Vahdat & Kautz, 2020; Child, 2021; Hazami et al., 2022) . These works use a latent dimensionality that is greater, often by orders of magnitude, than the dimensionality of the ambient space. This dimensionality expansion greatly ameliorate the problem of posterior separations diagnosed in (Dai & Wipf, 2019) at the price of higher computational and memory costs and lower interpretability. Interestingly, a recent post-training analysis showed that just a few percents of the hierarchical VAE's latent dimensions are needed to encode the data (Hazami et al., 2022) .

Stochastic auxiliary variables and augmented normalizing flows:

The fact that VAE encoding can be seen as a form of affine coupling was first noticed in (Dinh et al., 2014) (Appendix C ). This allows one to conceptualize the one-sample reparameterization estimate of the ELBO as a likelihood in a stochastically augmented space. However, this is just an equivalent re-formulation and does not solve the inference sub-optimality problems of VAEs, which in this setting can be explained by the fact that sharp posteriors correspond to singular points of non-invertibility. This approach was recently generalized in (Huang et al., 2020) , where several VAE-style affine layers are stacked in an flow architecture that takes noise augmented input. The stacking procedure removes the signal/noise separation and autoencoder-like loss of VAEs since the final prediction is not compared with the original image but only with the previous layer. The resulting model is very similar to other auxiliary-augmented flows (Cornish et al., 2020; Weilbach et al., 2020; Caterini et al., 2021a) .

Two-steps training procedures:

A recent trend in generative modeling is to disentangle the tasks of learning a low-dimensional latent representation and maximum-likelihood density estimation. (Dai & Wipf, 2019) proposes a two-stage procedure for training VAEs, and show its effectiveness both theoretically and empirically. (Ghosh et al., 2020) propose to instead use an explicit regularization scheme for the decoder, and employ an ex-post density estimation on the latent space to allow for sampling and ensure that the latent space is distributed according to a simple distribution. Similarly, (Xiao et al., 2019) and (Böhm & Seljak, 2020 ) use a deterministic autoencoder to learn the latent representation of the data, and a normalizing flow to model the distribution of such latents, leading to better density estimation while avoiding over-regularization of the latent variables. More recently, (Loaiza-Ganem et al., 2022) discusses the problem of manifold overfitting, which arises when the manifold is learned but not the distribution on it, and propose a two-step training procedure applicable to all the likelihood-based models. NFs on manifolds: Several authors propose variations of NFs that can model data on a manifold. Works such as (Gemici et al., 2016) , (Mathieu & Nickel, 2020) and (Rezende et al., 2020) assume that the dimensionality reducing map is already known and available. (Kim et al., 2020; Horvat & Pfister, 2021) propose methods based on adding noise to the data, which are capable of learning unknown lower-dimensional manifolds. Injective flows strive for the same goal by using injective deterministic transformations to map the data to a lower dimensional base density. (Brehmer & Cranmer, 2020) , (Kothari et al., 2021) and (Cramer et al., 2022) train injective flows with a two-steps training procedure in which they alternate manifold and density training, while (Kumar et al., 2020) introduces lower bounds on the injective change of variable. (Cunningham et al., 2020) combines injective flows with an additive noise model to account for deviations from the learned manifold using stochastic inversions trained on a variational bound. In (Caterini et al., 2021b) the authors train injective flows by evaluating the in-manifold likelihood using a modification of the change of variable formulas for rectangular Jacobi matrices, which is then supplemented by an additional ad-hoc reconstruction loss. (Ross & Cresswell, 2021) uses conformal embeddings to train injective flows with exact on-manifold log-likelihood plus an additional ad hoc reconstruction loss.

SurVAE flows:

A work that is in spirit similar to ours is SurVAE (Nielsen et al., 2020) , in the sense that it aims at bridging the gap between Normalizing Flows and VAEs by combining their strengths. SurVAEs extend flow models by incorporating stochastic and surjective transformations in both the generative and the inference direction. The (generative direction) surjective transformations give rise to stochastic estimates of the likelihood contribution and introduce lower bound likelihood estimates. On the other hand, surjections in the inference direction allow exact probability computations for the latent variables. However, these inference surjections themselves cannot be straightforwardly used to perform generative dimensionality reduction since the stochastic forward transformation p(x|z) is intractable and depends on the data probability itself via Bayes theorem (which is exactly the quantity we want to estimate in generative modeling). In general, the contributions of our work and SurVAE are almost orthogonal and could be combined to obtain the benefits of both techniques, for example by incorporating surjections in our decoder architecture so as to enforce symmetries in the data.

5. EXPERIMENTS

We now show empirically that, at least in complex naturalistic datasets such as CelebA-HQ and ImageNet Karras et al. (2018) ; Deng et al. (2009) , the exact maximum likelihood training leads to drastically improved results compared to architecturally equivalent VAEs. Our aim is to show the difference in performance between exact log-likelihood and ELBO objective functions when the architecture is kept constant. Therefore, in our VAE baselines we do not use the many regularization, annealing and KL scaling terms that are somewhat common in VAE applications (Vahdat & Kautz, 2020; Child, 2021) . However, we make sure to compare architecturally identical models with strong prior and variational posteriors, which in themselves ameliorate many of the pathologies of the ELBO (Kingma et al., 2016; Tomczak & Welling, 2018) We focus on models with significantly less latent variables than observable dimensions since this is the regime that is the most problematic for traditional VAE training. We use a complex Encoder-Decoder architecture with residual blocks, similar to the one used in (Child, 2021) , and compare the performances of the AEF (with linear core variables) and its architecturally identical VAE for different latent dimensions. Table 1 shows that AEFs significantly outperform their architecturally equivalent VAEs both in terms of bits per dimension and FID score, generating significantly sharper and more detailed samples (Fig. 1 , 2). To compute bits per dimension, we use importance sampling, as described in Appendix D.6. Additionally, we look at smaller scale datasets: MNIST, FashionMNIST and KMNIST (Deng, 2012; Xiao et al., 2017; Clanuwat et al., 2018) . Here we use small conventional encoder and decoder architectures (see Appendix D.1). Results on MNIST are shown in Table 1 for a latent dimensionality of 2 and 32. In Appendix E we present these results for all latent dimensions and datasets, as well as the results for AEF (center) and AEF (corner). Performance for all three versions of AEF is comparable, with the linear AEF performing slightly better. Overall, we observe an increase in performance for AEFs compared with VAEs. On the other hand, VAE outperforms AEF for a high number of latent dimensions. In Appendix F we investigate the importance of the posterior and prior flows on generative performance for both AEFs and VAEs. Here we observe that incorporating a prior flow is very important to sample quality for both models. Additionally, we observed that, in this setting, even the AEF without flows performs better than a VAE with both posterior and prior flow in terms of BPD and FID. CelebA-HQ (64x64) ImageNet (32x32) 

6. DISCUSSION AND CONCLUSIONS

In this work we showed that autoencoders, if properly constructed out of invertible layers, can be trained by maximum-likelihood either in the original ambient space or in an appropriately expanded space. This latter approach results in an objective function that can be directly used in the training of any pre-existing VAE and VAE-like model that uses Gaussian residual noise. In our experiments, we showed that in many datasets AEFs perform strikingly better than VAEs. Importantly, the AEF images where not affected by the blurriness typical of low-dimensional VAEs, which resulted in very remarkable difference in the quality of samples and reconstructions. This is an interesting and perhaps surprising result as the AEF and VAE models were architecturally identical. Given the arguments presented in Alemi et al. (2018) , we conjecture that this difference is due to the failure of the VAEs to converge to a sufficiently sharp posterior, which has been proven to result in poor separation between the encoding of the training samples. We further conjecture that this failure is due to the fact that the optimum is very close to a singular point of non-invertibility of the VAE architecture, which results in numerical instabilities due to the diverging KL term. This problem does not affect AEFs since the encoding is always deterministic. In our experiments, the main exception to this trend was MNIST with 32 latent dimensions, where the VAE performed consistently better. We conjecture that this is due to the relatively large ratio between the latent dimensionality and the true signal dimensionality, which results in less concentrated (and less unstable) variational posterior distributions. In this regime, the extra statistical variability provided by the posterior samples of the VAE can loosen the topological constraints of the architecture, potentially leading to higher performance Cornish et al. (2020) ; Caterini et al. (2021a) . When compared to VAEs, the main limitation of AEFs is that they cannot straightforwardly use discrete emission models (decoders) since their flow architecture is assumed to work on continuous data. Another limitation is that, since we are not using stochastic auxiliary variables, AEFs have the same topological constraints of regular NFs (Cornish et al., 2020; Caterini et al., 2021a) . However, auxiliary variables can be straightforwardly added to AEFs using standard methods Caterini et al. (2021a) . Finally, from the point of view of the NF literature, our work opens the door to hybrid autoencoder-flow models that can learn how to reduce the latent dimensionality by "predicting away" some dimensions, possibly at several different stages of processing. Adding manifold learning capabilities to NFs has great potential since it can avoid some of the pathologies of invertible models when used on lower dimensional data, while at the same time increasing training efficiency.

7. REPRODUCIBILITY

We provide access to all the code used in our experiments here: https://github.com/gisilvs/AEF. The readme contains instructions on how to reproduce the experiments presented in the paper from the command line, as well as a Jupyter notebook example on training an AEF from scratch using the provided code. To further increase the clarity of our proposed method, we have added diagrams that visually explain the sampling and inference procedure for our model in Appendix A, as well as corresponding pseudocode (algorithms 1, 2, 3). The reader can find additional explanations of the methods in Sections 3.3 and C. Additional details on the performed experiments such as description of architectural details and hyperparameter used in this work can be found in Section D. Ablation experiments are found in Section F in the Supplementary material. CelebA-HQ ( 128 Table 2 : Mean squared error between inception feature activations of original inputs and reconstructions of noisy inputs. The number between parentheses denotes the latent dimensionality of the models. The second row gives the standard deviation of the noise distribution. For CelebA-HQ and MNIST we report the mean over two and five runs respectively. To do so, we first compute w = h(x; γ), then take K samples from a normal distribution centered around w: w 1,...,K ∼ q(w) = N (w, ϵ), where ϵ is the approximate posterior scale and needs to be tuned on the validation set for each model. We perform the rest of the computations using w 1,...,K instead of w, and finally compute the probability in Eq. 17. To reduce the variance of the estimator, we additionally use importance weighted sampling: log p(x) ≈ log E w 1,...,K ∼q(w) 1 K k p(x, w k ) q(w k ) D EXPERIMENTS' DETAILS

D.1 ENCODER AND DECODER

For the MNIST datasets, all the encoders and decoders consist of a two-layers convolutional neural network. The encoder uses 3 × 3 kernels, 64 for the first layer and 128 for the second. Each layer is followed by ReLU activation. Finally, two linear layers map the feature maps to mean and standard deviation of the latent space respectively. Similarly, the decoder first uses a linear layer to increase the dimensionality of the latent samples, and then two transposed convolutional layers with respectively 128 and 64 kernels of size 4. The decoder outputs two values: the mean and the standard deviation, as a set of trainable parameters constrained with softplus activation. For the larger CelebA and ImageNet datasets we use an encoder and decoder with many more layers and residual blocks, an architecture similar to (Child, 2021) . In particular, we use the same residual bottleneck block, but the encoder does not output activations at intermediate layers, and the decoder processes only the input coming from the previous layer, without stochastic sampling and prior computation. In other words, we reuse the residual bottleneck blocks from (Child, 2021) to build a "traditional" encoder-decoder architecture, and the outputs of the encoder and decoder are the same as for the manifold learning and denoising experiments. In all the experiments, we use four residual bottleneck blocks for each feature map size.

D.2 MASKED AUTOREGRESSIVE FLOW

For all of our NFs, we use and adapt implementations from (Durkan et al., 2020) . Our MAF models stack K MADE autoregressive layers (Germain et al., 2015) , each with 2 residual blocks with N hidden units. We add ActNorm (Kingma & Dhariwal, 2018) between each autoregressive layer. IAF is simply the inverse of MAF. When MAF or IAF are used within an autoencoder architecture, whether as prior, posterior or encoder flow, we use K = 4 and N = 256.

D.3 PREPROCESSING LAYER

For all models, we apply a preprocessing bijective transformation like the one used in (Papamakarios et al., 2017) : x = logit (λ + (1 -2λ)z) ( ) where z is the input image and λ is a parameter. We use λ = 1e -6 for the MNIST-like datasets and λ = 0.05 for CelebA-HQ and ImageNet. This transformation is then followed by an ActNorm layer.

D.4 DATASETS AND DATA PRE-PROCESSING

In all our experiments, we use a dequantized version of the data, in which we first add uniform noise u ∼ U(0, 1) to the image and then divide by 256. Division by 256 requires an adjustment in the log likelihood, which we take into account when computing the bits per dimension. For the denoising experiments we add gaussian white noise N (0, σ) to the samples, with different levels of σ depending on the dataset: 0.25, 0.5, 0.75 and 1.0 for MNIST datasets, and 0.05, 0.1 and 0.2 for CelebA-HQ. By varying the standard deviation of the noise distribution we can increase the intensity of the noise. After adding the noise to the images we clip them so that the pixel values stay in their original range. For CelebA-HQ we apply random left-right flipping whenever an image is loaded into a batch. For all the models we use 10% of the training dataset as validation set, apart from CelebA-HQ for which we use the predefined train-val-test split.

D.5 TRAINING PROCEDURE

On the MNIST datasets we train all models for 100K iterations, and we evaluate the test metrics on the iteration that achieved the best validation loss. 

E.1 GENERATIVE MODELING AND MANIFOLD LEARNING

In this section we provide additional results. Table 6 shows the results of all runs on CelebA-HQ and ImageNet. We show results comparing the linear and partitioned versions of AEF with VAEs on the MNIST datasets in Figure 5 and Figure 6 . Figure 8 shows samples of VAEs and AEFs trained on MNIST, FashionMNIST and KMNIST with a latent dimensionality of 32, and Figure 9 shows the same for CelebA-HQ resized to 64 × 64 for various latent dimensionalities. 

E.2 DENOISING

This section presents additional results and figures comparing the denoising performance of AEFs to baseline models, specifically to a VAE with equivalent architecture and a least squares denoising autoencoder (AE) with equivalent encoder and decoder. Figure 10 presents the mean squared error between inception feature activations (IFE) for increasing levels of noise MNIST, FashionMNIST and KMNIST datasets. 

F ABLATIONS

In this section we investigate the importance of the prior and posterior flow on the performance of both VAEs and AEFs. We train a VAE and AEF with and without prior and posterior flow on CelebA-HQ (32 × 32), and report the results in BPD and FID in Table 7 . Qualitatively and by FID score we observe that for both VAE and AEF the quality of samples generated without a prior flow present in the model are of significantly worse quality than when there is a prior flow present. Additionally, having only a posterior flow decreases samples quality in both the VAE and the AEF. It is interesting to note that an AEF without core encoder and prior flow still performs better than a VAE with both posterior flow and prior flow in terms of BPD and FID score. Examples of samples for each ablation are presented in Figure 11 (1) (2) (3) 



offer a theoretical analysis of the factors affecting the quality of approximate inference in VAEs and introduce strategies to alleviate them. None of these works introduce an exact maximum likelihood training. A popular way to improve VAEs is to use a more flexible prior or posterior distribution.(Rezende & Mohamed, 2015) and(Kingma et al., 2016) use NFs as variational posteriors, showing how the aggregate posterior matches the prior more closely, while(Tomczak & Welling, 2018) use a variational mixture of posteriors as prior.(Morrow & Chiu, 2020) use a two-stage training procedure, in which first they train a VAE with a NF prior, and then combine the decoder withGlow (Kingma & Dhariwal, 2018)  for improved sample quality. An alternative line of research is the use of hierarchical priors with several stochastic layers(Sønderby et al., 2016;

Figure 1: Reconstructions from an AEF (top row) and VAE (bottom row) with equivalent architectures trained on rescaled CelebA-HQ and rescaled ImageNet with 256 latent dimensions.

Figure 4: Examples of reconstruction performance of AEF, VAE, and deterministic autoencoder (AE) trained on CelebA samples with a noise level (σ) of 0.1, and a latent dimensionality of 128. Top: image with added noise; Middle: reconstructed image; Bottom: original image.

Figure5: Bits per dimension achieved by AEFs and equivalent VAE MNIST, FashionMNIST and KMNIST. We show the mean and 95% confidence interval over 5 runs.

Figure 7: Visualization of a learned two-dimensional manifold by AEF trained on MNIST with a two-dimensional latent space.

Figure10: Mean squared error between inception feature activations of the original inputs, and the reconstructions based on a noisy input for increasing levels of noise. Averaged over five runs. Bars indicate 95% confidence intervals. The number of latent dimensions for was set to 2 for the upper row, and 32 for the lower row.

Figure 11: Samples and reconstructions of VAEs and AEFs trained on CelebA-HQ resized to 32 × 32 with: 1) no posterior or prior flow; 2) only a posterior flow; 3) only a prior flow; 4) both posterior and prior flows. All models had a latent dimensionality of 128. Samples were generated with a temperature of 0.85.

. As main baseline, we use the original VAE algorithm as introduced in (Kingma & Welling, 2014) with an Inverse Autoregressive Flow (IAF) posterior and a Masked Autoregressive Flow (MAF) prior. Apart from the feature expansion layer, this baseline is architecturally identical to our AEF (linear) model, which uses learnable linear features as core variables. We also test two AEF models that do not expand the ambient space, one with the center pixels and the other with the corner pixels as core variables, referred to as AEF (center) and AEF (corner) respectively. Additional experiments on denoising are reported in Appendix B, while more details about the experiments and results can be found in Appendix D and E. The code for all the experiments is available at: https://github.com/gisilvs/AEF.Generative modeling and manifold learning:To compare the generative performance of AEFs with VAEs we test on CelebA-HQ resized to 64 × 64 and 32 × 32, and ImageNet resized to 32 × 32.

Bits per dimension (BPD) and Frechet Inception Distance (FID) for AEF and VAE models

For CelebA-HQ (32 × 32 and 64 × 64), we train instead for 1M iterations and do early stopping if the validation loss does not improve for more than 20k iterations. ImageNet models are trained for 2M iterations with early stopping set to 100K iterations with no improvement. For generative modeling on CelebA-HQ, ImageNet, and all the denoising models we use gradient clipping if the magnitude of the gradients is bigger than 200. As optimizer we choose ADAM(Kingma & Ba, 2015) with a learning rate 1e -3 for the MNIST-like datasets, and 1e -4 for denoising experiments, CelebA-HQ and ImageNet. We use a batch size 128 for all the MNIST experiments, a batch size of 64 for CelebA-HQ resized to 32 × 32 and a batch size of 16 for CelebA-HQ resized to 64 × 64 and ImageNet. .8 MODELS' PARAMETERS We report the number of parameters used in each model for the different datasets: Table3for the models trained on MNIST-like datasets, Table4for CelebA-HQ resized to 32 × 32, and Table.Approximate number of model parameters for all models trained on MNIST-like datasets.

Approximate number of model parameters for all models trained on CelebA-HQ and ImageNet rescaled to 32 × 32.

Approximate number of model parameters for all models trained on CelebA-HQ rescaled to 64 × 64.

Bits per dimension (BPD) and Frechet Inception Distance (FID) for both runs of AEF and VAE models trained on CelebA and ImageNet. The second row denotes the latent dimensionality used. Lower is better for both metrics.

and Figure 12. Ablation results of VAEs and AEFs with and without a posterior flow/core encoder and prior flow.

ACKNOWLEDGEMENTS

OnePlanet Research Center aknowledges the support of the Province of Gelderland.

SUPPLEMENTARY MATERIALS

A ADDITIONAL DIAGRAMS In Figure 3 , we show diagrams for inferece and sampling procedures, for AEFs with partitioning and expanded ambient space. 

B DENOISING EXPERIMENTS

Denoising is one of the main applications of the classical autoencoder literature. The basic idea is that the signal can be compressed in a low dimensional manifold while the (white) noise, being incomprehensible, is filtered out. The denoising problem the capacity of our training scheme to separate the signal and the noise spaces, thereby performing an adaptive non-linear filtering.We test performance on CelebA-HQ (32 × 32), and the MNIST, FashionMNIST and KMNIST datasets with different noise levels. We compare against architecturally equivalent VAEs and least squares denoising autoencoders (AE). All models were trained exclusively on noisy data. To compare performance we look at the mean squared error between the Inception feature activations (the same ones as used for the FID score) for an original, noiseless sample and the reconstruction based on its noisy version. Results for CelebA-HQ and MNIST are shown in Table 2 and Fig. 4 for various noise regimes, while results for FashionMNIST and KMNIST and the other noise regimes can be found in Appendix E.2. For CelebA-HQ we see that the gap in performance between AEF and VAE for denoising is similar to the one in generative modeling. On the other hand, for the less complex MNIST datasets we observe that the VAE often performs better than both the AEF and the denoising AE, which suggests that the posterior uncertainty of the stochastic latent variables of the VAE may play a beneficial role in the denoising performance.

C IMPORTANCE SAMPLING FOR LINEAR AEF

After training, if we wish to compute the probability p(x) for model comparison purposes, we would need to solve the integral p(x) = p(x, w)dw, which cannot be obtained in closed form. However, we can use an importance sampling scheme: p(x) = p(x, w)dw = q(w) p(x, w) q(w) dw ≈ 1 K k p(x, w k ) q(w k ) (17)

