THE TILTED VARIATIONAL AUTOENCODER: IMPROVING OUT-OF-DISTRIBUTION DETECTION

Abstract

A problem with using the Gaussian distribution as a prior for a variational autoencoder (VAE) is that the set on which Gaussians have high probability density is small as the latent dimension increases. This is an issue because VAEs aim to achieve both a high likelihood with respect to a prior distribution and at the same time, separation between points for better reconstruction. Therefore, a small volume in the high-density region of the prior is problematic because it restricts the separation of latent points. To address this, we propose a simple generalization of the Gaussian distribution, the tilted Gaussian, whose maximum probability density occurs on a sphere instead of a single point. The tilted Gaussian has exponentially more volume in high-density regions than the standard Gaussian as a function of the distribution dimension. We empirically demonstrate that this simple change in the prior distribution improves VAE performance on the task of detecting unsupervised out-of-distribution (OOD) samples. We also introduce a new OOD testing procedure, called the Will-It-Move test, where the tilted Gaussian achieves remarkable OOD performance.

1. INTRODUCTION

Due to its simplicity, the Gaussian distribution is a common prior for the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) . One drawback it has is that the region of high probability density becomes relatively smaller as the latent dimension increases. To see why this is an issue, consider the objective of the VAE. It tries to encode points such that they are close to the prior and can reconstruct points into their original form. Given a limited capacity of an encoder/decoder model, points in the latent space must be separated to have significant differences in their reconstructed points. With a sufficiently complex data set, it would be required to have a large volume in the high density region of the prior distribution to accommodate each of the latent points, while allowing for sufficient separation. We argue that the Gaussian distribution's volume under regions of high probability density is not large enough to accommodate real data sets. To this end, we show that many of the points encoded by Gaussian prior VAEs exist in low-density regions, and that the high-density region remains relatively empty. In support, Nalisnick et al. (2019a) report that the latent point at the highest density of a Gaussian VAE trained on MINST was an all-black image. To deal with these issues, we propose a simple generalization of the Gaussian distribution called the tilted Gaussian distribution. We create this distribution by "exponentially tilting" the ordinary multivariate Gaussian distribution by its norm. The operation of exponential tilting is a common procedure in such diverse fields as statistical mechanics, large deviations or importance sampling, but we believe using it for VAEs as we do here is a novel contribution. The tilted Gaussian has a maximum probability density lying on the surface of a sphere rather than at a single point. A single parameter corresponds to the sphere's radius, allowing for control of the volume under the high-density region of the distribution. We show that the tilted Gaussian has exponentially more volume that the standard Gaussian as as function of the latent dimension, allowing for a far greater proportion of points from a dataset to exist in regions of high probability density. We investigate a simpler method of increasing the prior volume by using a Gaussian with large variance, however the effect on performance was minimal D.2. To demonstrate the benefits of the tilted Gaussian as a prior for the VAE, we focus on the task of OOD detection. It has been noted that somewhat surprisingly, VAEs assign high likelihood to OOD points, despite being optimized on a lower bound of the log-likelihood (Nalisnick et al., 2019a; Choi et al., 2019) . A possible contributing factor to this poor performance is that the high-density region of the latent space is not densely populated by in-distribution (ID) points due to the volume considerations previously detailed. We show that VAEs using the tilted Gaussian as a prior (which we call the tilted VAE), have a far greater percentage of points in high density regions (see Figure 1 for an illustration and Section 3.5 for detailed numbers) and perform significantly better on the OOD task (See Table 2 ). While the improvement is a step towards robust OOD detection with VAEs, we show that a prior distribution alone cannot achieve the desired level of performance on this task. Thus, we propose a new test, called the Will-It-Move test, for the OOD problem. Combined with the tilted Gaussian as a prior, it consistently improves the performance of current methods to perform OOD detection with VAEs. (See Section 4.2 for a description and Table 2 for results) 2 RELATED WORK

2.1. EXTENSIONS OF VARIATIONAL AUTOENCODERS

In this section, we give a non-exhaustive list of extensions proposed for VAEs. The majority of approaches aim to increase the flexibility of the prior. For example, a mixture of Gaussians can be used as an alternative (Dilokthanakul et al., 2016) to the standard Gaussian prior. The VampPrior attempts to improve upon the mixture of Gaussians by using a mixture of variational posteriors (Tomczak & Welling, 2018) . Another proposal for the prior distribution is the hyperspherical VAE (Davidson et al., 2018) , which uses a von-Mises-Fisher (VMF) distribution. In contrast, our tilted prior is concentrated around the hypersphere as a "soft" constraint, which allows the ordinary normal distribution instead of the VMF to be used. Other proposed methods use the Dirichlet process (Nalisnick & Smyth, 2017) , the Chinese restaurant process (Goyal et al., 2017) , and the Gaussian



Figure 1: A 2D representation of the 10D latent space of the Gaussian VAE vs the tilted VAE with τ = 25 trained on the Fashion-MNIST dataset, plotted with an isoradial projection that preserves the radius (i.e. r 2D = ∥z 10D ∥; see Appendix F.1 for details on this isoradial projection). Shaded regions indicate where the latent probability density is at least 25%, 50% or 75% respectively of its maximum, R c := {z | ρ(z) ≥ c max w ρ(w)}. The encoded points of the tilted VAE lie almost entirely in the region of high probability density, while the ordinary Gaussian VAE places points outside these regions. See also Section 3.5 for more comparisons.

