DETERMINISTIC TRAINING OF GENERATIVE AUTOEN-CODERS USING INVERTIBLE LAYERS

Abstract

In this work, we provide a deterministic alternative to the stochastic variational training of generative autoencoders. We refer to these new generative autoencoders as AutoEncoders within Flows (AEF), since the encoder and decoder are defined as affine layers of an overall invertible architecture. This results in a deterministic encoding of the data, as opposed to the stochastic encoding of VAEs. The paper introduces two related families of AEFs. The first family relies on a partition of the ambient space and is trained by exact maximum-likelihood. The second family exploits a deterministic expansion of the ambient space and is trained by maximizing the log-probability in this extended space. This latter case leaves complete freedom in the choice of encoder, decoder and prior architectures, making it a drop-in replacement for the training of existing VAEs and VAE-style models. We show that these AEFs can have strikingly higher performance than architecturally identical VAEs in terms of log-likelihood and sample quality, especially for low dimensional latent spaces. Importantly, we show that AEF samples are substantially sharper than VAE samples.

1. INTRODUCTION

Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) have maintained an enduring popularity in the machine learning community in spite of the impressive performance of other generative models (Goodfellow et al., 2014; Karras et al., 2020; Van Oord et al., 2016; Van den Oord et al., 2016; Salimans et al., 2017; Dinh et al., 2014; Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma & Dhariwal, 2018; Sohl-Dickstein et al., 2015; Nichol & Dhariwal, 2021) . One key feature of VAEs is their ability to project complex data into a semantically meaningful set of latent variables. This feature is considered particularly useful in fields such as model-based reinforcement learning, where temporally linked VAE architectures form the backbone of most state-of-the-art worldmodels (Ha & Schmidhuber, 2018b; a; Hafner et al., 2020; Gregor et al., 2019; Zintgraf et al., 2020; Hafner et al., 2021) . Another attractive feature of VAEs is that they leave ample architectural freedom when compared with other likelihood-based generative models, with their signature encoder-decoder architectures being popular in many areas of ML beside generative modeling (Ronneberger et al., 2015; Vaswani et al., 2017; Radford et al., 2021; Ramesh et al., 2021; 2022) . However, VAE training is complicated by the lack of a closed-form expression for the log-likelihood, with the variational gap between the surrogate loss (i.e. the ELBO) and the true log-likelihood being responsible for unstable training and, at least in non-hierarchical models, sub-optimal encoding and sample quality (Hoffman & Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018; Cremer et al., 2018; Mattei & Frellsen, 2018) . Consequently, a large fraction of VAE research is devoted to tightening the gap between

