EXPLICITLY MINIMIZING THE BLUR ERROR OF VARI-ATIONAL AUTOENCODERS

Abstract

Variational autoencoders (VAEs) are powerful generative modelling methods, however they suffer from blurry generated samples and reconstructions compared to the images they have been trained on. Significant research effort has been spent to increase the generative capabilities by creating more flexible models but often flexibility comes at the cost of higher complexity and computational cost. Several works have focused on altering the reconstruction term of the evidence lower bound (ELBO), however, often at the expense of losing the mathematical link to maximizing the likelihood of the samples under the modeled distribution. Here we propose a new formulation of the reconstruction term for the VAE that specifically penalizes the generation of blurry images while at the same time still maximizing the ELBO under the modeled distribution. We show the potential of the proposed loss on three different data sets, where it outperforms several recently proposed reconstruction losses for VAEs.

1. INTRODUCTION

Generative modelling aims to learn a data distribution P D from samples, such that new samples can be generated from the learned distribution, x ∼ P D . The learned distribution can be used for a variety of tasks ranging from out-of-distribution detection (Asim et al. (2020) ) to serving as a prior for reconstruction tasks (Tezcan et al. (2019) ). One generative modelling approach of particular interest is Variational Autoencoder (VAE) as introduced by Kingma & Welling (2013) . This approach is particularly interesting because it yields a lower dimensional latent model, which allows generating lower dimensional representations of samples, and for doing that the model parameters are determined through directly maximizing the ELBO of the training samples. The combination of these two points makes VAEs unique compared to other generative modeling approaches (Goodfellow et al. ( 2014 One major drawback of VAEs is that they often produce blurry generative samples even though the images in the training distribution were sharp. This is due to the formulation of the optimization. Variational autoencoders are optimized by maximizing the evidence lower bound (ELBO) Ez∼q ϕ (z|x) log[p θ (x|z)] -D KL [q ϕ (z|x)||p(z)] with respect to the network parameters θ and ϕ. The first term is often referred to as the reconstruction loss and effectively ensures that an observed sample can be mapped to the latent space through the posterior model and reconstructed back from its possible latent representation. The second term is matching the learned approximate posterior distribution, q ϕ (z|x), with that of a prior distribution p(z). The ELBO can also be formulated as minimizing the Kullback-Leibler divergence in the augmented space, where the data distribution is expanded with the auxiliary variable z, D KL [q D,ϕ (x, z)||p θ (x, z)] as described by Kingma et al. (2019) . It is clear that the model is penalized heavily if samples are likely under q D,ϕ (x, z) but not p θ (x, z), however not the other way around due to the asymmetry of the Kullback-Leibler divergence. Therefore, p θ (x, z) and thus p θ (x) will have a larger variance than the original data distribution q D,ϕ (x, z) leading to generated samples being more diverse than original data, practically also including blurry images. To alleviate this issue both q ϕ (z|x) and p θ (x|z) should be flexible enough as has been the objective of other works (Berg et al. (2018 ), Kingma et al. (2016 ), Vahdat & Kautz (2020) ). 1



), Rezende & Mohamed (2015), Ho et al. (2020), Brehmer & Cranmer (2020), Caterini et al. (2021)).

