EXPLICITLY MINIMIZING THE BLUR ERROR OF VARI-ATIONAL AUTOENCODERS

Abstract

Variational autoencoders (VAEs) are powerful generative modelling methods, however they suffer from blurry generated samples and reconstructions compared to the images they have been trained on. Significant research effort has been spent to increase the generative capabilities by creating more flexible models but often flexibility comes at the cost of higher complexity and computational cost. Several works have focused on altering the reconstruction term of the evidence lower bound (ELBO), however, often at the expense of losing the mathematical link to maximizing the likelihood of the samples under the modeled distribution. Here we propose a new formulation of the reconstruction term for the VAE that specifically penalizes the generation of blurry images while at the same time still maximizing the ELBO under the modeled distribution. We show the potential of the proposed loss on three different data sets, where it outperforms several recently proposed reconstruction losses for VAEs.

1. INTRODUCTION

Generative modelling aims to learn a data distribution P D from samples, such that new samples can be generated from the learned distribution, x ∼ P D . The learned distribution can be used for a variety of tasks ranging from out-of-distribution detection (Asim et al. (2020) ) to serving as a prior for reconstruction tasks (Tezcan et al. (2019) ). One generative modelling approach of particular interest is Variational Autoencoder (VAE) as introduced by Kingma & Welling (2013) . This approach is particularly interesting because it yields a lower dimensional latent model, which allows generating lower dimensional representations of samples, and for doing that the model parameters are determined through directly maximizing the ELBO of the training samples. The combination of these two points makes VAEs unique compared to other generative modeling approaches (Goodfellow et al. (2014 ), Rezende & Mohamed (2015 ), Ho et al. (2020 ), Brehmer & Cranmer (2020 ), Caterini et al. (2021) ). One major drawback of VAEs is that they often produce blurry generative samples even though the images in the training distribution were sharp. This is due to the formulation of the optimization. Variational autoencoders are optimized by maximizing the evidence lower bound (ELBO) Ez∼q ϕ (z|x) log[p θ (x|z)] -D KL [q ϕ (z|x)||p(z)] with respect to the network parameters θ and ϕ. The first term is often referred to as the reconstruction loss and effectively ensures that an observed sample can be mapped to the latent space through the posterior model and reconstructed back from its possible latent representation. The second term is matching the learned approximate posterior distribution, q ϕ (z|x), with that of a prior distribution p(z). The ELBO can also be formulated as minimizing the Kullback-Leibler divergence in the augmented space, where the data distribution is expanded with the auxiliary variable z, D KL [q D,ϕ (x, z)||p θ (x, z)] as described by Kingma et al. (2019) . It is clear that the model is penalized heavily if samples are likely under q D,ϕ (x, z) but not p θ (x, z), however not the other way around due to the asymmetry of the Kullback-Leibler divergence. Therefore, p θ (x, z) and thus p θ (x) will have a larger variance than the original data distribution q D,ϕ (x, z) leading to generated samples being more diverse than original data, practically also including blurry images. To alleviate this issue both q ϕ (z|x) and p θ (x|z) should be flexible enough as has been the objective of other works (Berg et al. (2018) , Kingma et al. (2016 ), Vahdat & Kautz (2020) ). In addition to the line of work on increasing the flexibility of the posterior q ϕ (z|x), which often come at the cost of being more difficult to optimize and computationally expensive, there is a line of work focusing on improving the reconstruction loss formulation. To keep the mathematical link to maximizing p θ (x) for the observed samples through the ELBO formulation, a distribution p θ (x|z) has to be assumed for the reconstruction loss. Popular choices for these distributions are Gaussian or Bernoulli as introduced by Kingma & Welling (2013). The Gaussian distribution, p θ (x|z) = N (µ(z) θ , Σ) has seen widespread adaption (Castrejon et al. (2019) , Lee et al. (2020) ) and leads to a simplification of the reconstruction loss to the mean squared error (MSE) if Σ is assumed to be identity. Since the latent space z has a lower dimensionality than x, a perfect reconstruction will not be possible and the reconstruction loss will determine which features of x to weigh the most. The MSE places no weight on specific features, all features will be reconstructed with a similar weighting. Since most power in natural images is located in the lower frequency in their spectrum Van der Schaaf & van Hateren (1996) , low frequency features, in other words blurry features, will dominate the reconstruction loss. Consequently, reconstruction fidelity in higher frequency features will be less important. Several previous works have recognized the potential of augmenting the reconstruction loss to address the blur, but they approached the problem implicitly. They aimed to retrieve more visually pleasing generations. One initial approach by Hou et al. ( 2017) was to replace the pixel-wise loss with a feature based loss that is calculated with a pre-trained convolutional neural network (CNN). However, this loss is limited to the domain of images on which the CNN was pre-trained. It has also been suggested in literature to combine generative adverserial networks with VAEs by replacing the reconstruction loss of the VAE with an adversarial loss (Larsen et al. ( 2016)). This approach comes at the expense of architecture changes and with optimization challenges. Another approach by Barron (2019) optimizes the shape of the loss function during training to increase the robustness. While their loss formulation has shown improved results, it was not specifically focusing on sharpening VAE generations. By learning the parameters of a Watson perceptual model and using it as a reconstruction loss, Czolbe et al. ( 2020) have shown that generative examples of VAEs can be improved. Since the human perception also values sharp examples, the proposed loss improved sharpness of samples, but optimization with the loss is less stable than other methods. Recently, Jiang et al. ( 2021) tackled the blur problem directly in the frequency domain. They have introduced the focal frequency loss, which applies a per frequency weighting for the error term in the frequency domain, where the weighting is dependent on the magnitude of the error at that frequency. The above mentioned methods do not explicitly focus on reducing blur errors, and, except for the work by Barron (2019) , the proposed reconstruction terms lose their mathematical link to maximizing the ELBO, since the distribution p θ (x|z) is not defined. In this paper we aim to explicitly minimize the blur error of VAEs through the reconstruction term, while at the same time still perform ELBO maximization for the observed samples. We derive a loss function that explicitly weights errors that are due to blur more than other errors. In our experiments, we show that the new loss function produces sharper images than other state-of-the art reconstruction loss functions for VAEs on three different datasets.

2. BACKGROUND ON BLURRING AND SHARPENING

In this section we will review some background on blurring and image sharpening. The blurring degradation of a sharp image, x can be modeled by convolving x, with a blur kernel, k, x = x * k, where x denotes the degraded image. Assuming no additive noise and the blurring kernel does not suppress any frequencies, using the convolution theorem, the blurring operation can be inverted in the frequency domain as F(x)/F(k) = F(x), where F is the Fourier transform. The sharp x can then be obtained by applying the inverse Fourier transform to the ratio. The assumptions of knowing the blur kernel k, that no noise was present during the blur generation, and convolving with k not suppressing any frequencies are strong. In the case that k is not well known and F(k) has several terms close to zero, division by F(k) could cause extreme values. To provide a more stable inverse operation, one can use the Wiener deconvolution (Wiener (1949)), F * (k) |F(k)| 2 + C F(x) ≈ F(x),

