AUTO-ENCODING GOODNESS OF FIT

Abstract

We develop a new type of generative autoencoder called the Goodness-of-Fit Autoencoder (GoFAE), which incorporates GoF tests at two levels. At the minibatch level, it uses GoF test statistics as regularization objectives. At a more global level, it selects a regularization coefficient based on higher criticism, i.e., a test on the uniformity of the local GoF p-values. We justify the use of GoF tests by providing a relaxed L 2 -Wasserstein bound on the distance between the latent distribution and a distribution class. We prove that optimization based on these tests can be done with stochastic gradient descent on a compact Riemannian manifold. Empirically, we show that our higher criticism parameter selection procedure balances reconstruction and generation using mutual information and uniformity of p-values respectively. Finally, we show that GoFAE achieves comparable FID scores and mean squared errors with competing deep generative models while retaining statistical indistinguishability from Gaussian in the latent space based on a variety of hypothesis tests.

1. INTRODUCTION

Generative autoencoders (GAEs) aim to achieve unsupervised, implicit generative modeling via learning a latent representation of the data (Bousquet et al., 2017) . A generative model, known as the decoder, maps elements in a latent space, called codes, to the data space. These codes are sampled from a pre-specified distribution, or prior. GAEs also learn an encoder that maps the data space into the latent space, by controlling the probability distribution of the transformed data, or posterior. As an important type of GAEs, variational autoencoders (VAEs) maximize a lower bound on the data log-likelihood, which consists of a reconstruction term and a Kullback-Leibler (KL) divergence between the approximate posterior and prior distributions (Kingma & Welling, 2013; Rezende et al., 2014) . Another class of GAEs seek to minimize the optimal transport cost (Villani, 2008) between the true data distribution and the generative model. This objective can be simplified into an objective minimizing a reconstruction error and subject to matching the aggregated posterior to the prior distribution (Bousquet et al., 2017) . This constraint is relaxed in the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2017) via a penalty on the divergence between the aggregated posterior and the prior, allowing for a variety of discrepancies (Patrini et al., 2018; Kolouri et al., 2018) . Regardless of the criterion, training a GAE requires balancing low reconstruction error with a regularization loss that encourages the latent representation to be meaningful for data generation (Hinton & Salakhutdinov, 2006; Ruthotto & Haber, 2021) . Overly emphasizing minimization of the divergence metric between the data derived posterior and the prior in GAEs is problematic. In VAEs, this can manifest as posterior collapse (Higgins et al., 2017; Alemi et al., 2018; Takida et al., 2021) resulting in the latent space containing little information about the data. Meanwhile, WAE can suffer from over-regularization when the prior distribution is too simple, e.g. isotropic Gaussian (Rubenstein et al., 2018; Dai & Wipf, 2019) . Generally, it is difficult to decide when the posterior is close enough to the prior but not to a degree that is problematic. The difficulty is rooted in several issues: (a) an absence of tight constraints on the statistical distances; (b) distributions across minibatches used in the training; and (c) difference in scale between reconstruction and regularization objectives. Unlike statistical distances, goodness-of-fit (GoF) tests are statistical hypothesis tests that assess the indistinguishability between a given (empirical) distribution and a distribution class (Stephens, 2017) . In recent years, GAEs based on GoF tests have been proposed to address some of the aforementioned issues of VAEs and WAEs (Ridgeway & Mozer, 2018; Palmer et al., 2018; Ding et al., 2019) . However, this emerging GAE approach still has some outstanding issues. In GAEs, GoF test statistics are optimized locally in minibatches. The issue of balancing reconstruction error and meaningful latent representation is manifested as the calibration of GoF test p-values. If GoF test p-values are too small (i.e., minibatches are distinguishable from the prior), then sampling quality is poor; conversely, an abundance of large GoF p-values may result in poor reconstruction as the posterior matches too closely to the prior at the minibatch level. In addition, there currently does not exist a stochastic gradient descent (SGD) algorithm that is applicable to GoF tests due to identifiability issues, unbounded domains, and gradient singularities. Our Contributions: We study the GoFAE a framework for parametric test statistic optimization, resulting in a novel GAE that optimizes GoF tests for normality, and an algorithm for regularization coefficient selection. Note that the GoF tests are not only for Gaussians with nonsingular covariance matrices, but all Gaussians, so they can handle situations where the data distribution is concentrated on or closely around a manifold with dimension smaller than the ambient space. The framework uses Gaussian priors as it is a standard option (Doersch, 2016; Bousquet et al., 2017; Kingma et al., 2019) , and also because normality tests are better understood than GoF tests for other distributions, with more tests and more accurate calculation of p-values available. The framework can be modified to use other priors in a straightforward way provided that the same level of understanding can be gained on GoF tests for those priors as for normality tests. See Fig. (1) for latent space behavior of VAE, WAE and our GoFAE. Proofs are deferred to the appendix. Our contributions are summarized as follows. • We propose a framework (Sec. 2) for bounding the statistical distance between the posterior and a prior distribution class G in GAEs, which forms the theoretical foundation for a deterministic GAE -Goodness of Fit Autoencoder (GoFAE), that directly optimizes GoF hypothesis test statistics. • We examine four GoF tests of normality based on correlation and empirical distribution functions (Sec. 3). Each GoF test focuses on a different aspect of Gaussianity, e.g., moments or quantiles. • A model selection method using higher criticism of p-values is proposed (Sec. 3), which enables global normality testing and is test-based instead of performance-based. This method helps determine the range of the regularization coefficient that well balances reconstruction and generation using uniformity of p-values respectively (Fig. 3b ). • We show that gradient based optimization of test statistics for normality can be complicated by identifiability issues, unbounded domains, and gradient singularities; we propose a SGD that optimizes over a Riemannian manifold (Stiefel manifold in our case) that effectively solves our GAE formulation with convergence analysis (Sec. 4). • We show that GoFAE achieves comparable FID scores and mean squared error on three datasets while retaining statistical indistinguishability from Gaussian in the latent space (Sec. 5).

2. PRELIMINARIES FOR GOODNESS OF FIT AUTOENCODING

Background. Let (X , P X ) and (Z, P Z ) be two probability spaces. In our setup, P X is the true, but unknown, non-atomic distribution on the data space X , while P Z is a prior distribution on the latent space Z. An implicit generative model is defined by sampling a code Z ∼ P Z and



Figure 1: Latent behaviors for VAE, WAE, and GoFAE inspired from Figure 1 of Tolstikhin et al. (2017). (a) The VAE requires the approximate posterior distribution (orange contours) to match the prior P Z (white contours) for each example. (b) The WAE forces the encoded distribution (green contours) to match prior P Z . (c) The GoFAE forces the encoded distribution (purple contours) to match some P Z in the class prior G. For illustration, several P Zi ∈ G are visualized (white contours).

