AUTO-ENCODING GOODNESS OF FIT

Abstract

We develop a new type of generative autoencoder called the Goodness-of-Fit Autoencoder (GoFAE), which incorporates GoF tests at two levels. At the minibatch level, it uses GoF test statistics as regularization objectives. At a more global level, it selects a regularization coefficient based on higher criticism, i.e., a test on the uniformity of the local GoF p-values. We justify the use of GoF tests by providing a relaxed L 2 -Wasserstein bound on the distance between the latent distribution and a distribution class. We prove that optimization based on these tests can be done with stochastic gradient descent on a compact Riemannian manifold. Empirically, we show that our higher criticism parameter selection procedure balances reconstruction and generation using mutual information and uniformity of p-values respectively. Finally, we show that GoFAE achieves comparable FID scores and mean squared errors with competing deep generative models while retaining statistical indistinguishability from Gaussian in the latent space based on a variety of hypothesis tests.

1. INTRODUCTION

Generative autoencoders (GAEs) aim to achieve unsupervised, implicit generative modeling via learning a latent representation of the data (Bousquet et al., 2017) . A generative model, known as the decoder, maps elements in a latent space, called codes, to the data space. These codes are sampled from a pre-specified distribution, or prior. GAEs also learn an encoder that maps the data space into the latent space, by controlling the probability distribution of the transformed data, or posterior. As an important type of GAEs, variational autoencoders (VAEs) maximize a lower bound on the data log-likelihood, which consists of a reconstruction term and a Kullback-Leibler (KL) divergence between the approximate posterior and prior distributions (Kingma & Welling, 2013; Rezende et al., 2014) . Another class of GAEs seek to minimize the optimal transport cost (Villani, 2008) between the true data distribution and the generative model. This objective can be simplified into an objective minimizing a reconstruction error and subject to matching the aggregated posterior to the prior distribution (Bousquet et al., 2017) . This constraint is relaxed in the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2017) via a penalty on the divergence between the aggregated posterior and the prior, allowing for a variety of discrepancies (Patrini et al., 2018; Kolouri et al., 2018) . Regardless of the criterion, training a GAE requires balancing low reconstruction error with a regularization loss that encourages the latent representation to be meaningful for data generation (Hinton & Salakhutdinov, 2006; Ruthotto & Haber, 2021) . Overly emphasizing minimization of the divergence metric between the data derived posterior and the prior in GAEs is problematic. In VAEs, this can manifest as posterior collapse (Higgins et al., 2017; Alemi et al., 2018; Takida et al., 2021) resulting in the latent space containing little information about the data. Meanwhile, WAE can suffer from over-regularization when the prior distribution is too simple, e.g. isotropic Gaussian (Rubenstein et al., 2018; Dai & Wipf, 2019) . Generally, it is difficult to decide when the posterior is close enough to the prior but not to a degree that is problematic. The difficulty is rooted in several issues: (a) an absence of tight constraints on the statistical distances; (b) distributions across minibatches used in the training; and (c) difference in scale between reconstruction and regularization objectives. Unlike statistical distances, goodness-of-fit (GoF) tests are statistical hypothesis tests that assess the indistinguishability between a given (empirical) distribution and a distribution class (Stephens, 2017) . In recent years, GAEs based on GoF tests have been proposed to address some of the aforementioned issues of VAEs and WAEs (Ridgeway & Mozer, 2018; Palmer et al., 2018; Ding et al., 2019) . However, this emerging GAE approach still has some outstanding issues. In GAEs, GoF test statistics are optimized locally in minibatches. The issue of balancing reconstruction error and meaningful latent representation is manifested as the calibration of GoF test p-values. If GoF test p-values are too small (i.e., minibatches are distinguishable from the prior), then sampling quality is poor; conversely, an abundance of large GoF p-values may result in poor reconstruction as the posterior matches too closely to the prior at the minibatch level. In addition, there currently does not exist a stochastic gradient descent (SGD) algorithm that is applicable to GoF tests due to identifiability issues, unbounded domains, and gradient singularities. Our Contributions: We study the GoFAE a framework for parametric test statistic optimization, resulting in a novel GAE that optimizes GoF tests for normality, and an algorithm for regularization coefficient selection. Note that the GoF tests are not only for Gaussians with nonsingular covariance matrices, but all Gaussians, so they can handle situations where the data distribution is concentrated on or closely around a manifold with dimension smaller than the ambient space. The framework uses Gaussian priors as it is a standard option (Doersch, 2016; Bousquet et al., 2017; Kingma et al., 2019) , and also because normality tests are better understood than GoF tests for other distributions, with more tests and more accurate calculation of p-values available. The framework can be modified to use other priors in a straightforward way provided that the same level of understanding can be gained on GoF tests for those priors as for normality tests. See Fig. (1) for latent space behavior of VAE, WAE and our GoFAE. Proofs are deferred to the appendix. Our contributions are summarized as follows. • We propose a framework (Sec. 2) for bounding the statistical distance between the posterior and a prior distribution class G in GAEs, which forms the theoretical foundation for a deterministic GAE -Goodness of Fit Autoencoder (GoFAE), that directly optimizes GoF hypothesis test statistics. • We examine four GoF tests of normality based on correlation and empirical distribution functions (Sec. 3) . Each GoF test focuses on a different aspect of Gaussianity, e.g., moments or quantiles. • A model selection method using higher criticism of p-values is proposed (Sec. 3), which enables global normality testing and is test-based instead of performance-based. This method helps determine the range of the regularization coefficient that well balances reconstruction and generation using uniformity of p-values respectively (Fig. 3b ). • We show that gradient based optimization of test statistics for normality can be complicated by identifiability issues, unbounded domains, and gradient singularities; we propose a SGD that optimizes over a Riemannian manifold (Stiefel manifold in our case) that effectively solves our GAE formulation with convergence analysis (Sec. 4). • We show that GoFAE achieves comparable FID scores and mean squared error on three datasets while retaining statistical indistinguishability from Gaussian in the latent space (Sec. 5).

2. PRELIMINARIES FOR GOODNESS OF FIT AUTOENCODING

Background. Let (X , P X ) and (Z, P Z ) be two probability spaces. In our setup, P X is the true, but unknown, non-atomic distribution on the data space X , while P Z is a prior distribution on the latent space Z. An implicit generative model is defined by sampling a code Z ∼ P Z and applying a mapping G, called the decoder, to produce G(Z) ∈ X . The distribution of G(Z) is given by the pushforward of P Z under G, commonly denoted by G # P Z . Our concern is finding G such that G # P Z is close to P X . One approach is based on optimal transport. If µ and ν are two distributions respectively on spaces E 1 and E 2 , and c(u, v) is a cost function on E 1 × E 2 , then the optimal transport cost to transfer µ to ν is T c (µ, ν) = inf π∈Π(µ,ν) E (U,V )∼π [c(U, V )], where Π(µ, ν) is the set of all joint probability distributions with marginals µ and ν. If E 1 = E 2 and is endowed with a metric d, and c(u, v) = d(u, v) p with p > 0, then d Wp (µ, ν) = T c (µ, ν) 1/p is known as the L p -Wasserstein distance. In principle, G can be learned by solving min G T c (P X , G # P Z ). Unfortunately, the minimization is intractable. Instead, the WAE approach entails finding a mapping F , known as the encoder, such that the pushforward distribution P Y := F # P X matches P Z . We will only consider non-random encoders and decoders, that is, a fully deterministic autoencoder. This is theoretically supported by the Monge-Kantorovich equivalence (Villani, 2008; Patrini et al., 2018) . In this setting, F : X → Z encodes X ∼ P X to produce the code Y = F (X), and G : Z → X decodes Y to produce the reconstruction G(Y ). To emphasize the encode-decode process we can write the reconstruction as (G • F )(X) := G(F (X)) = G(Y ) and the reconstruction distribution as the pushforward distribution (G • F ) # P X = G # F # P X = G # P Y . We can define the WAE objective WAE λ c (P X , G) = inf F X c(x, G(F (x)))dP X (x) + λD(F # P X , P Z ) , where λ > 0 is a regularization coefficient and D is a penalty on statistical discrepancy. Wasserstein Distance: Bounds. When X and Z are Euclidean spaces, a common choice for c is the squared Euclidean distance, leading to the L 2 -Wasserstein distance d W2 and the following bounds. Proposition 1. If X and Z are Euclidean spaces and the decoder G is differentiable, then (a) d W2 (G # P Y , G # P Z ) ≤ ∇G ∞ d W2 (P Y , P Z ), (b) d W2 (P X , G # P Z ) ≤ [E X -G(Y ) 2 ] 1/2 + ∇G ∞ d W2 (P Y , P Z ). Combined with the triangle inequality for d W2 , (a) and (b) in Proposition 1 imply that when the reconstruction error [E X -G(Y ) 2 ] 1/2 is small, proximity between the latent distribution P Y and the prior P Z is sufficient to ensure proximity of the generated distribution G # P Z and the data distribution P X . Both (a) and (b) are similar to Patrini et al. (2018) , though here a less tight bound -the square-error -is used in (b) as it is an easier objective to optimize. Given a sample of data, {X i } n i=1 , the L 2 -Wasserstein distance between the empirical distribution of {Y i = F (X i )} n i=1 and P Z is a natural approximation of d W2 (P Y , P Z ) because it is a consistent estimator of the latter as n → ∞, although in general is biased (Xu et al., 2020) . Proposition 2. Let PX,n be the empirical distribution of samples {X i } n i=1 , and PY,n be the empirical distribution of {Y i = F (X i )} n i=1 . Assume that F (X) is differentiable with respect to X with bounded gradients ∇F (X). Then, (a) d W2 ( PY,n , P Y ) ≤ ∇F ∞ d W2 ( PX,n , P X ), (b) d W2 (P Y , P Z ) ≤ d W2 ( PY,n , P Z ) + ∇F ∞ d W2 ( PX,n , P X ). For large n, d W2 ( PX,n , P X ) is small, so from (b), the proximity of the latent distribution and the prior is mainly controlled by the proximity between the empirical distribution PY,n and P Z . Together with Proposition 1, this shows that in order to achieve close proximity between G # P Z and P X , we need to have a strong control on the proximity between PY,n and P Z in addition to a good reconstruction. Extending to a Class. So far we have focused on a single completely specified P Z . However, as we will argue in the next section, it can be beneficial to reduce the specificity to allow for a class G of priors. Letting d W2 (• , G) = inf P∈G d W2 (• , P), all the above results can be easily extended. For example, Proposition 1 (a) gives inf P Z ∈G d W2 (G # P Y , G # P Z ) ≤ G ∞ d W2 (P Y , G). Once d W2 (P Y , G) is controlled, we can then identify P Z ∈ G that satisfies d W2 (P Y , P Z ) = d W2 (P Y , G) and use it as the prior. Clearly, if G only contains one distribution, this is reduced to the case where the prior is completely specified. Thus, the setting that allows for a class of priors is more general. We propose to use hypothesis tests associated with the Wasserstein distance. This does not preclude tests associated with other statistical distances, which may even be used in conjunction. Many standard statistical distances are not proper metrics, but aim to control statistical distances dominated by the Wasserstein distance. At the population level, the Wasserstein distance provides stronger separation between different distributions; at the sample level, it is useful to use different GoF tests since each GoF test is sensitive to different characteristics of the data.

3. THE NEED FOR A HIGHER CRITICISM: FROM LOCAL TO GLOBAL TESTING

GAEs are typically trained using size m minibatches, or subsamples, of data. A natural starting point will be to push for matching the prior at the minibatch level. However, Section 3.2 discusses how only focusing on the minibatch level runs the risk of overfitting and how it can be mitigated with higher criticism. GoF tests assess whether a sample of observed data, {y i } m i=1 , is drawn from a distribution class G. The most general opposing hypotheses are H 0 : P Y ∈ G vs H 1 : P Y ∈ G. A test is specified by a test statistic T and a significance level α ∈ (0, 1). If the observed value of T is T when it is applied to {y i } m i=1 , i.e., T ({y i } m i=1 ) = T , then the (lower tail) p-value is P (T ({Y i } m i=1 ) ≤ T | H 0 ), where {Y i } m i=1 ∼ P Y ∈ G. H 0 is rejected in favor of H 1 at level α if the p-value is less than α, or equivalently, if T ≤ T α , where T α is the α-th quantile of T under H 0 . The probability integral transform states that if T has a continuous distribution under H 0 , the p-values are uniformly distributed on (0, 1) (Murdoch et al., 2008) . (Shao & Zhou, 2010) . Thus, one way to test MVN is to apply GoF tests of univariate normality (UVN) to a set of (random) projections of Y and then make a decision based on the resulting collection of test statistics. Many UVN tests exist (Looney, 1995; Mecklin & Mundfrom, 2005) , each focusing on different distributional characteristics. We next briefly describe correlation based (CB) tests because they are directly related to the Wasserstein distance. For other test types, see appendix § B. In essence, CB tests on UVN are based on assessment of the correlation between empirical and UVN quantiles (Dufour et al., 1998) . Two common CB tests are Shapiro-Wilk (SW) (Shapiro & Wilk, 1965) and Shapiro-Francia (SF) (Shapiro & Francia, 1972) . Their test statistics will be denoted by T SW and T SF respectively. Both are closely related to d W2 . For example, for m 1, T SW is a strictly decreasing function of d W2 ( PY,m , N (0, 1)) (del Barrio et al., 1999) , i.e., large T SW values correspond to small d W2 , justifying the rule that T SW > T α in order not to reject H 0 . The Benefits of GoF Testing in Autoencoders. GoF tests on MVN inspect if P Y is equal to some P Z ∈ G (Fig. 1 ), where G is the class of all Gaussians, i.e., MVN distributions. The primary justification for preferring P Z as a Gaussian to the more commonly specified isotropic Gaussian (Makhzani et al., 2015; Kingma & Welling, 2013; Higgins et al., 2017) is that its covariance Σ is allowed to implicitly adapt during training. For example, Σ can be of full rank, allowing for correlations between the latent variables, or singular (i.e., a degenerate MVN, Figs. 5b, 5c ). The ability to adapt to degenerate MVN is of particular benefit. If F is differentiable and the dimension of the latent space is greater than the intrinsic dimension of the data distribution, then the latent variable Y = F (X) only takes values in a manifold whose dimension is less than that of the latent space (Rubenstein et al., 2018) . If P Z is only allowed to be an isotropic Gaussian, or more generally, a Gaussian with a covariance matrix of full rank, the regularizer in Equation 1 will promote F that can fill the latent space with F (X), leading to poor sample quality and wrong proportions of generated images (Rubenstein et al., 2018) . Since GoF tests use the class G of all Gaussians, they may help avoid such a situation. Note that this precludes the use of whitening, i.e. the transformation Σ -1/2 [X -E(X)] to get to N (0, I), as the covariance matrix Σ may be singular. In other words, whitening confines available Gaussians to a subset much smaller than G (Fig. 2 ). Notably, a similar benefit was exemplified in the 2-Stage VAE, where decoupling manifold learning from learning the probability measure stabilized FID scores and sharpened generated samples when the intrinsic dimension is less than the dimension of the latent space (Dai & Wipf, 2019) . There are several additional benefits of GoF testing. First, it allows the use of higher criticism (HC), discussed in Section 3.2. GoF tests act at the local level (minibatch), while HC applies to the global level (training set). Second, many GoF tests produce closed-form test statistics, and thus do not require tuning and provide easily understood output. Lastly, testing for MVN via projections is unaffected by rank deficiency. This is not the case with the majority of multivariate GoF tests, e.g., the Henze-Zirkler test (Henze & Zirkler, 1990) . In fact, any affine invariant test statistic for MVN must be a function of Mahalanobis distance, consequently requiring non-singularity (Henze, 2002) . 

3.1. A LOCAL PERSPECTIVE: GOODNESS OF FIT FOR NORMALITY

In light of Section 2, the aim is to ensure proximity of P X to G # P Z by finding F, G such that the reconstruction loss, d, is small and F # P X is sufficiently close to prior P Z . As discussed above, P Z is not completely specified but required to belong to G. We can formulate the problem as min F,G E[d(X, G(F (X))] subject to F # P X ∈ G. To enforce the constraint in the minimization, a hypothesis test can be used to decide whether or not to reject the null H 0 : (2) When the network is trained using a single minibatch of size m, making Y less statistically distinguishable from G amounts to increasing E[T ({F (X i )} m i=1 )], where X i are i.i.d. ∼ P X . However, if the training results in none of the minibatches yielding small T values to reject H 0 at a given α level, it also indicates a mismatch between the latent and prior distributions. This type of mismatch cannot be detected at the minibatch level; a more global view is detailed in the next section. F # P X ∈ G. If

3.2. A GLOBAL PERSPECTIVE: GOODNESS OF FIT FOR UNIFORMITY -HIGHER CRITICISM

Algorithm 1 Evaluating Higher Criticism Require: Trained encoder F θ , {xi} N i=1 , GoF test T 1: for i = 1 : N/m do 2: Randomly sample minibatch X of size m 3: Y = F θ (X) 4: if projection required then 5: T = T (Yu), where u ∈ S 6: else 7: T = T (Y) 8: Calculate p-value of T and store it 9: Use KS unif to evaluate p-value set. Neural networks are powerful universal function approximators (Hornik et al., 1989; Cybenko, 1989) . With sufficient capacity it may be possible to train F to overfit, i.e. to produce too many minibatches with large p-values. Under H 0 , the probability integral transform posits p-value uniformity. Therefore, it is expected that after observing many minibatches, approximately a fraction α of them will have p-values that are less than α. This idea of a more encompassing test is known as Tukey's higher criticism (HC) (Donoho et al., 2004) . While each minibatch GoF test is concerned with optimizing for indistinguishability from the prior distribution class G, the HC test is concerned with testing whether the collection of p-values is uniformly distributed on (0, 1), which may be accomplished through the Kolmogorov-Smirnov uniformity (KS unif ) test. See Algorithm 1 for a pseudo-code and Fig. 3 for an illustration of the HC process. HC and GoF form a symbiotic relationship; HC cannot exist by itself, and GoF by itself may over or under fit, producing p-values in incorrect proportions.  U = {x = [x 1 , x 2 , x 3 ] ∈ R 3 : 0 ≤ x 1 ≤ x 2 ≤ x 3 ≤ 1}. For ∀x ∈ U, the coordinates of x are also its order statistics, the min, median and max coordinates corresponds to the x, y and z-axis. The green simplex is where  T SW ({x 1 , x 2 , x 3 }) = 1,

4. RIEMANNIAN SGD FOR OPTIMIZING A GOODNESS OF FIT AUTOENCODER

In practice F and G are neural networks parameterized by θ and φ respectively. The empirical GoFAE minibatch loss function based on Equation 2 is written as L(θ, φ; {x i } m i=1 ) = 1 m m i=1 d(x i , G φ (F θ (x i ))) ± λT ({F θ (x i )} m i=1 ), A GoF test statistic T is not merely an evaluation mechanism. Its gradient will impact how the model learns what characteristics of a sample indicate strong deviation from normality, carrying over to what the P Y becomes. The following desiderata may help when selecting T . We denote a collection of sample observations by bold X, V, and Y. We suppose that with probability one under P X the following two conditions are satisfied. • GoF-Trainable: T (F θ (X)) is almost everywhere continuously differentiable in feasible region Ω. • GoF-Consistent: There exists a θ in Ω such that F θ (X) is consistent with the assumption that F θ (X) are i.i.d. samples from the target distribution. GoF-Trainable is needed if gradient based methods are used to optimize the network parameters. Consider a common encoder architecture F θ that consists of multiple feature extraction layers (forming a mapping H Ξ (X)) followed by a fully connected layer with parameter Θ. Thus, F θ (X) = H Ξ (X)Θ = VΘ = Y, and θ = {Ξ, Θ}. The last layer is linear with no activation function as shown in Fig. 4a . With this design, normality can be optimized with respect to the last layer Θ as discussed below. Given an GoF-Consistent statistic T , it is natural to seek a solution, θ . Theorem 3. Suppose V ∈ R m×k is of full row rank. For Θ ∈ R k×d , define Y = VΘ. Denote T SW = T SW (Yu) where u ∈ R d is a unit vector. Then, T SW is differentiable with respect to Θ almost everywhere, and ∇ Θ T SW = 0 if and only if T SW = 1. This theorem justifies the use of T SW as an objective function according to GoF desiderata. The largest possible value of T SW is 1, corresponding to an inability to detect deviation from normality no matter what level α is specified. See the appendix for other test choices. Identifiability, Singularities, and Stiefel Manifold. If Y = VΘ is Gaussian for some Θ, then VΘM is also Gaussian for any nonsingular matrix M. Thus, any matrix of the form ΘM is a solution. This leads to several problems when optimizing an objective function containing Θ as a variable and the test statistic as part of the objective. First, there is an identifiability issue since, without any restrictions, the set of solutions is infinite. Here, all that matters is the direction of the projection, so restricting the space to the unit sphere is reasonable. Second, almost all GoF tests for normality in the current literature are affine invariant. However, the gradient and Hessian of the test statistics will not be bounded. Fig. 4b illustrates an example of such a singularity issue. Theorem 4. Let T ({y i } m i=1 ), m ≥ 3 be any affine invariant test statistic that is non-constant and differentiable wherever y i are not all equal. Then, for any b, as (y 1 , . . . , y m ) → (b, . . . , b) , sup ∇T ({y i } m i=1 ) → ∞, where • is the Frobenius norm. If Θ is searched without being kept away from zero or the diagonal line, traditional SGD results will not be applicable to yield convergence of Θ. A common strategy might be to lower bound the smallest singular value of Θ T Θ. However, it does not solve the issue that any re-scaling of Θ leads to another solution, so Θ must be also upper bounded. It is thus desirable to restrict Θ in such a way that avoids getting close to singular by both upper and lower bounding it in terms of its singular values. We propose to restrict Θ to the compact Riemannian manifold of orthonormal matrices, i.e. Θ ∈ M = {Θ ∈ R k×d : Θ T Θ = I d } which is also known as the Stiefel manifold. This imposes a feasible region for θ ∈ Ω = {{Ξ, Θ} : Θ T Θ = I d }. Optimization over a Stiefel manifold has been studied previously (Nishimori & Akaho, 2005; Absil et al., 2009; Cho & Lee, 2017; Bécigneul & Ganea, 2018; Huang et al., 2018; 2020; Li et al., 2020) .

4.1. OPTIMIZING GOODNESS OF FIT TEST STATISTICS

A Riemannian metric provides a way to measure lengths and angles of tangent vectors of a smooth manifold. For the Stiefel manifold M, in the Euclidean metric, we have (Li et al., 2020) . For arbitrary matrix Θ ∈ R k×d , the retraction back to M, denoted Θ M , can be performed via a singular value decomposition, Θ M = RS T , where Θ = RΛS T and Λ is the d × d diagonal matrix of singular values, R ∈ R k×d has orthonormal columns, i.e. R ∈ M, and S ∈ R d×d is an orthogonal matrix. The retraction of the gradient 4c ). The process of mapping Euclidean gradients to T Θ M, updating Θ, and retracting back to M is well known (Nishimori & Akaho, 2005) . The next result states that the Riemannian gradient for T SW has a finite second moment, which is needed for our convergence theorem in § 4.2. Theorem 5. Let T = T SW be as in Theorem 3. Denote by ∇ θ , ∇ Θ , the Riemannian gradient w.r.t. θ, Θ, respectively, and • the Frobenius norm. Let B = (I-J/m)V, i.e., B is obtained by subtracting from each row of V the mean of the rows. Suppose (a) sup Γ 1 , Γ 2 = trace(Γ T 1 Γ 2 ) for any Γ 1 , Γ 2 in the tangent space of M at Θ, T Θ M. It is known that T Θ M = {Γ : Γ T Θ + Θ T Γ = 0, Θ ∈ M} D = ∇ Θ T to T Θ M, denoted by D T Θ M , can be accomplished by D T Θ M = D -Θ(Θ T D + D T Θ)/2 (Fig. Ξ (E[ B 4 ] + E[ ∇ Ξ B 4 ]) < ∞, and (b) sup x =1,Ξ E[ Bx -4 ] < ∞. Then, sup θ E[ ∇ θ T 2 ] < ∞.

4.2. CONVERGENCE OF THE GOODNESS OF FIT AUTOENCODER

Since Θ ∈ M, model training requires Riemannian SGD. The proof of Bonnabel (2013) for the convergence of SGD on a Riemannian manifold, which was extended from the Euclidean case (Bottou, 1998) , requires conditions on step size, differentiability, and a uniform bound on the stochastic gradient. We show that this result holds under a much weaker condition, eliminating the need for a uniform bound. While we apply the resulting theorem to our specific problem with a Stiefel manifold, we emphasize that Theorem 6 is applicable to other suitable Riemannian manifolds. Theorem 6. Let M be a connected Riemannian manifold with injectivity radius uniformly bounded from below by I > 0. Let C ∈ C (3) (M) and R w be a twice continuously differentiable retraction. Let z 0 , z 1 , . . . be i.i.d. ∼ ζ taking values in Z. Let H : Z×M → T M be a measurable function such that E[H(ζ, w)] = ∇C(w) for all w ∈ M, where T M is the tangent bundle of M. Consider the SGD update w t+1 = R wt (-γ t H(z t , w t )) with step size γ t > 0 satisfying γ 2 t < +∞ and γ t = +∞. Suppose there exists a compact set K such that all w t ∈ K and sup w∈K E[ H(ζ, w) 2 ] ≤ A 2 for some A > 0. Then, C(w t ) converges almost surely and ∇C(w t ) → 0 almost surely. In our context, C(•) corresponds to Equation 3. The parameters of the GoFAE are {θ, φ} = {Ξ, Θ, φ} where Ξ, φ are defined in Euclidean space, a Riemannian manifold endowed with the Euclidean metric, and Θ on the Stiefel manifold. Thus, {Ξ, Θ, φ} is a product manifold that is also Riemannian and {Ξ, Θ φ} are updated simultaneously. Convergence of the GoFAE holds from Proposition 5 and Theorem 6 provided that Ξ, φ stay within a compact set. Algorithms 2 and S1 give the pseudo-code for GoFAE optimization and the complete GoFAE and HC pipeline, respectively.

5. EXPERIMENTS

We evaluate generation and reconstruction performance, normality, and informativeness of GoFAEfoot_0 using several GoF statistics on the MNIST (LeCun et al., 1998) , CelebA (Liu et al., 2015) and The following results are from experiments on CelebA.

Algorithm 2 GoFAE Optimization

Require: test T , learning rates: η1, η2, max iterations J, regularization coefficient λ 1: Initialize: Ξ, Θ, φ 2: while j < J do 3: Sample minibatch of size m, X 4: V = F (1) Ξ j (X), Y = F (2) Θ j (V) = VΘj 5: if test requires projection then 6: T = T (Yu), where u ∈ S 7: else 8: T = T (Y) 9: L = d(X, G φ (Y)) ± λT 10: Ξj+1 = Ξj -η1∇ Ξ L or other optim 11: φj+1 = φj -η1∇ φ L or other optim 12: D = ∇ Θ T 13: Γ = D -Θj(Θ T j D + D T Θj)/2 14: Θ j+1 = Θj + η2Γ 15: Compute RΛS T = SVD(Θ j+1 ) 16: Θj+1 = Θ j+1 M = RS T Effect of λ and Mutual Information (MI). We investigated the balance between generation and reconstruction in the GoFAE by considering minibatch (local) GoF test p-value distributions, (global) normality, and MI as a function of λ (Fig. 5 ). Global normality as assessed through the higher criticism (HC) principle for p-value uniformity is computed using Algorithm 1. We trained GoFAE on CelebA using T SW as the test statistic for λ = 10 (Fig. 5a , blue) and λ = 100, 000 (Fig. 5a , yellow). For small λ, the model emphasizes reconstruction and the p-value distribution will be skewed right since the penalty for deviating from normality is small (Fig. 5a , red dashed line). As λ increases, more emphasis is placed on normality and, for some λ, p-values are expected to be uniformly distributed (Fig. 5a , black line). If λ is too large, the p-value distribution will be left-skewed (Fig. 5a , red solid line), which corresponds to overfitting the prior. We assessed the normality of GoFAE minibatch encodings using 30 repetitions of Algorithm 1 for different λ (Figs. 5b-5c ). The blue points and blue solid line represent the KS unif test pvalues (p KS unif ) and mean (p KS unif = 30 i=1 p KS unif i ), respectively. HC rejects global normality (p KS unif <0.05) when the test statistic p-values are right-skewed (too many minibatches deviate from normality) or left-skewed (minibatches are too normal); this GoFAE property discourages the posterior from over-fitting the prior. The range of λ values for which the distribution of GoF test p-values is indistinguishable from uniform is where the mean pKS unif ≥ 0.05 (Figs. 5b-5c ). Our method uses the class of Gaussians, G = {N (µ, Σ) : µ ∈ R d , Σ ∈ R d×d } as a prior, where (µ, Σ) are the parameters of a Gaussian distribution denoting the mean and covariance matrix, respectively. When a model is finished training, we assume F # P X ∈ G and use estimates of the mean and covariance ( μ, Σ) for each GoFAE model to generate samples. Specifically, we drew N = 10 4 samples {z i } N i=1 to generate images {G φ (z i )} N i=1 . These images are then encoded, giving the set {z i } N i=1 with zi = F θ (G φ (z i )), resulting in a Markov chain {z i } N i=1 → {G φ (z i )} N i=1 → {z i } N i=1 . Assuming zi are also normal, the data processing inequality gives a lower bound on the mutual information between Z ∼ P Z and Z ∼ F # G # P Z , I(Z, Z) (Figs. 5b-5c, red line). As λ increases the MI diminishes, suggesting the posterior overfits the prior as the encoding becomes independent of the data. The unimodality of pKS unif and monotonicity of MI suggests λ can be selected solely based on KS unif without reference to a performance criterion. Gaussian Degeneracy. Due to noise in the data, numerically P Y can never be singular. Nevertheless, the experiments indicated the GoF test pushed P Y to become "more and more singular" in the sense that the condition number of its covariance matrix, κ( Σ), became increasingly large. As λ continues to increase, κ( Σ) also increases (Figs. 5b-5c, green line), implying a Gaussian increasingly concentrated around a lower-dimensional linear manifold. We observed the same trend in the spectrum of singular values (SV) of Σ for each λ after training with T SW (Fig. 6 ); while the spectrum did not exhibit large SVs, it had many small SVs. Notably, even when λ was not too large, κ( Σ) was relatively large, indicating a shift to a lower-dimension (Figs. 5b-5c ). However, MI remained relatively large and the KS unif test suggests the P Y is still indistinguishable from Gaussian. Together, these results are evidence that the GoFAE can adapt as needed to a reduced-dimension representation while maintaining the representation's informativeness. We presented the GoFAE, a deterministic GAE based on optimal transport and Wasserstein distances that optimizes GoF test statistics. We showed that gradient based optimization of GoFAE induces identifiability issues, unbounded domains, and gradient singularities, which we resolve using Riemannian SGD. By using GoF statistics to measure deviation from the prior class of Gaussians, our model is capable of implicitly adapting its covariance matrix during training from full-rank to singular, which we demonstrate empirically. We developed a performance agnostic model selection algorithm based on higher criticism of p-values for global normality testing. Collectively, empirical results show that GoFAE achieves comparable reconstruction and generation performance while retaining statistical indistinguishability from Gaussian in the latent space.  d W2 (G # P Y , G # P Z ) ≤ ∇G ∞ d W2 (P Y , P Z ), (b) d W2 (P X , G # P Z ) ≤ [E X -G(Y ) 2 ] 1/2 + ∇G ∞ d W2 (P Y , P Z ). Proof. (a) Recall that for any two probability measures µ and ν on the same Euclidean space, d 2 W2 (µ, ν) = inf π∈Π(µ,ν) y -z 2 π(dz, dz). Let f (x, y) = (G(x), G(y)). For any Q ∈ Π(P Y , P Z ), the measure Q (•) = Q • f -1 (•) ∈ Π(P G(Y ) , P G(Z) ), so d 2 W (P G(Y ) , P G(Z) ) ≤ y -z 2 Q (dy , dz ) = G(y) -G(z) 2 Q(dy, dz) ≤ ∇G 2 ∞ y -z 2 Q(dy, dz). Taking infimum of the right hand side over all Q ∈ Π(P Y , P Z ) gives the result. (b) By triangular inequality for d W2 and part (a), d W2 (P X , P G(Z) ) ≤ d W2 (P X , P G(Y ) ) + d W2 (P G(y) , P G(Z) ) ≤ [E X -G(Y ) 2 ] 1/2 + ∇G ∞ d W2 (P Y , P Z ). A.2 PROOF OF PROPOSITION 2. Proposition 2. Let PX,n be the empirical distribution of samples {X i } n i=1 , and PY,n be the empirical distribution of {Y i = F (X i )} n i=1 . Assume that F (X) is differentiable with respect to X with bounded gradients ∇F (X). Then, (a) A.3 PROOF OF THEOREM 3. d W2 ( PY,n , P Y ) ≤ ∇F ∞ d W2 ( PX,n , P X ), (b) d W2 (P Y , P Z ) ≤ d W2 ( PY,n , P Z ) + ∇F ∞ d W2 ( PX,n , P X ). As described in Section B.1, the Shapiro-Wilk statistic of a sample X 1 , . . . , X m is T SW = ( m i=1 a i X (i) ) 2 m i=1 (X i -X) 2 , where X (1) ≤ X (2) • • • ≤ X (m) are the order statistics and a i are certain constants that are all different. Let a = (a 1 , . . . , a m ) T . We only need the fact that a = 1 and 1 T a = 0, where 1 is the vector of m 1's. Theorem 3. Suppose V ∈ R m×k is of full row rank. For Θ ∈ R k×d , define Y = VΘ. Denote T SW = T SW (Yu) where u ∈ R d is a unit vector. Then, T SW is differentiable with respect to Θ almost everywhere, and ∇ Θ T SW = 0 if and only if T SW = 1. Proof. Let V * be a row permutation of V such that the coordinates of V * Θu are in increasing order. Denote by I the m × m identity matrix and J = 11 T . Put B = (I -J/m)V * and note that a T V * = a T B as a T J = a T 11 T = (1 T a) T 1 = 0. Then T SW = (a T V * Θu) 2 (I -J/m)V * Θu 2 = (a T BΘu) 2 BΘu 2 . (S4) It is easy to see that T SW is differentiable at Θ if all v T i Θu are different, where v T 1 , . . . , v T m are the rows of V. Also, in this case, V * is unique and BΘu = 0. On the other hand, for i = j, since u = 0 and v i = v j as V is of full row rank, the set of Θ with (v i -v j ) T Θu = 0 is a strict linear subspace of R k×d , so has Lebesgue measure 0. Then T SW is differentiable in Θ almost everywhere. Fix Θ with all the coordinates of VΘu being different. Then ∇ Θ T SW = 2(a T BΘu) BΘu 2 B T a - (a T BΘu)(BΘu) BΘu 2 u T . (S5) Note that a T BΘu is a scalar. Also, a T BΘu = 0, for otherwise from (S4) T SW = 0, contradicting Lemma 3 (Shapiro & Wilk, 1965) . Suppose ∇ Θ T SW = 0. Then from (S5), B T a - (a T BΘu)(BΘu) BΘu 2 u T = 0. Let c = BΘu 2 a T BΘu and h = BΘu -ca. Then the above equation can be written as B T a - BΘu c u T = -c -1 B T hu T = 0, giving B T hu T u = u 2 B T h = 0. From u = 0, 0 = h T B = h T (I -J/m)V * = (h -k1) T V * with k = (h T 1)/m. Because V * is of full row rank, h = k1. However, 1 T h = 1 T BΘu -c1 T a. Since 1 T B = 1 T (I -J/m)V = 0 and as pointed earlier, 1 T a = 0, then 1 T h = 0. As a result k = 0, giving h = 0, i.e. BΘu = ca. Plugging this into (S5) and noting that a = 1, T SW = 1. On the other hand, by Lemma 2 of (Shapiro & Wilk, 1965), T SW = 1 if and only if BΘu = sa for some scalar s = 0. Clearly in this case all the coordinates of BΘu are different, so T SW is differentiable at Θ. Since 1 is the maximum value of T SW , then ∇ Θ T SW = 0. A.4 PROOF OF THEOREM 4. Theorem 4. Let T ({y i } m i=1 ), m ≥ 3 be any affine invariant test statistic that is non-constant and differentiable wherever y i are not all equal. Then, for any b, as (y 1 , . . . , y m ) → (b, . . . , b) , sup ∇T ({y i } m i=1 ) → ∞, where • is the Frobenius norm. Proof. First, ∀s > 0, T (sy 1 + b, . . . , sy m + b) = T (y 1 , . . . , y m ) then ∇T (sy 1 + b, . . . , sy m + b) = 1 s ∇T (y 1 , . . . , y m ). By assumption ∇T (y 1 , . . . , y m ) = 0 for some (y, . . . , y m ). Then let s → 0, ∇T (sy 1 + b, . . . , sy m + b) = 1 s ∇T (y 1 , . . . , y m ) → ∞. A.5 PROOF OF THEOREM 5. Recall X = (X 1 , . . . , X m ) and V = H Ξ (X), where the X i 's are i.i.d. sample observations. Note that V is a function of Ξ and X but not a function of Θ. Theorem 5. Let T = T SW be as in Theorem 3. Denote by ∇ θ , ∇ Θ , the Riemannian gradient w.r.t. θ, Θ, respectively, and • the Frobenius norm. Let B = (I-J/m)V, i.e., B is obtained by subtracting from each row of V the mean of the rows. Suppose (a) sup Ξ (E[ B 4 ] + E[ ∇ Ξ B 4 ]) < ∞, and b) sup x =1,Ξ E[ Bx -4 ] < ∞. Then, sup θ E[ ∇ θ T 2 ] < ∞. Proof. Since all the assumptions and assertion of Theorem 6 are invariant under any permutation of the rows of V, for ease of notation and without loss of generality, suppose the coordinates of VΘu are in increasing order. By the construction, the coordinates of BΘu are also increasing and have mean 0. Then T = (a T BΘu) 2 BΘu 2 . By ∇ θ T = (∇ Θ T, ∇ Ξ T ), it suffices to show sup θ E( ∇ Θ T 2 ) < ∞ and sup θ E( ∇ Ξ T 2 ) < ∞. Regard the Stiefel manifold M as a subset of R k×d equipped with the inner product Γ 1 , Γ 2 = tr(Γ T 1 Γ 2 ), which is simply the Euclidean inner product of the vectorized Γ 1 and Γ 2 . The Riemannian gradient ∇ Θ is obtained under this inner product and is equal to the orthogonal projection of ∇ Θ onto T Θ M, where ∇ Θ denotes the Euclidean gradient. Then ∇ Θ T ≤ ∇ Θ T , so it is enough to show sup θ E( ∇ Θ T 2 ) < ∞. From (S5), ∇ Θ T = 2(a T BΘu) BΘu 2 B T a - (a T BΘu)(BΘu) BΘu 2 u T . Then by u = 1, ∇ Θ T ≤ 4 a 2 B BΘu , so by Cauchy-Schwarz inequality E( ∇ Θ T 2 ) ≤ 16 a 4 E( B 4 )E( BΘu -4 ) 1/2 ≤ 16 a 4 E( B 4 ) sup x =1,Ξ E( Bx -4 ) 1/2 , where the second inequality is due to the independence of u and B. Then by assumptions (b) and (c), sup θ E( ∇ Θ T 2 ) < ∞. Next, since Ξ lives in a Euclidean space, the Riemannian gradient of T w.r.t. Ξ consists of the partial derivatives of T w.r.t. of its coordinates. For each coordinate ξ of Ξ, ∂ ξ T := ∂T ∂ξ = 2 a T (∂ ξ B)Θu(a T BΘu) BΘu 2 - (a T BΘu) 2 (BΘu) T (∂ ξ B)Θu BΘu 4 . Then by Θu = 1, |∂ ξ T | ≤ 4 a 2 ∂ ξ B BΘu . Taking the sum of (∂ ξ T ) 2 over all the coordinates of Ξ then yields ∇ Ξ T 2 ≤ 16 a 4 ∇ Ξ B 2 BΘu 2 . So by Cauchy-Schwarz inequality E(∇ Ξ T ) 2 ≤ 16 a 4 E( ∇ Ξ B 4 )E( BΘu -4 ) 1/2 ≤ 16 a 4 E( ∇ Ξ B 4 ) sup x =1,Ξ E( Bx -4 ) 1/2 , where the second inequality is again due to the independence of u and B. Then by assumptions (b) and (c), sup θ E( ∇ Ξ T 2 ) < ∞. A.6 PROOF OF THEOREM 6. The following result is a relaxation of Theorems 1 and 2 in (Bonnabel, 2013) . Let (γ t ) t≥0 = (γ 0 , γ 1 , γ 2 , . . .) be a sequence of step sizes. Suppose C(•) is a three times continuously differentiable cost function on a smooth connected Riemannian manifold M, (z t ) t≥0 is a sequence of i.i.d. random variables taking values in a measurable space Z, and H(•, •) is a measurable function on Z × T M such that E z H(z, w) = ∇C(w), where T M is the tangent bundle of M and z ∼ z t . Also suppose w 0 ∈ M is independent of (z t ) t≥0 . Theorem 6. Let M be a connected Riemannian manifold with injectivity radius uniformly bounded from below by I > 0. Let C ∈ C (3) (M) and R w be a twice continuously differentiable retraction. Let z 0 , z 1 , . . . be i.i.d. ∼ ζ taking values in Z. Let H : Z×M → T M be a measurable function such that E[H(ζ, w)] = ∇C(w) for all w ∈ M, where T M is the tangent bundle of M. Consider the SGD update w t+1 = R wt (-γ t H(z t , w t )) with step size γ t > 0 satisfying γ 2 t < +∞ and γ t = +∞. Suppose there exists a compact set K such that all w t ∈ K and sup w∈K E[ H(ζ, w) 2 ] ≤ A 2 for some A > 0. Then, C(w t ) converges almost surely and ∇C(w t ) → 0 almost surely. As in (Bonnabel, 2013) , the proof of Theorem 6 starts with the the following result which is of interest in its own right. Proposition S1. Let (γ t ) t≥0 and M be as in Theorem 7. Consider the update w t+1 = exp wt (-γ t H(z t , w t )), where exp w is the exponential map at w. Suppose there exists a compact set K such that w t ∈ K for all t ≥ 0. We also suppose for some A > 0, E z ( H(z, w) 2 ) ≤ A 2 for all w ∈ K. Then, C(w t ) converges a.s. and ∇C(w t ) → 0 a.s. Proof of Proposition S1. The proof builds upon the one for Theorem 1 (Bonnabel, 2013) . Let F t = σ(w 0 , z 0 , . . . , z t-1 ). Then for t ≥ 1, w t is F t measurable. If γ t H(z t , w t < I, then exp wt {sH(z t , w t )} 0≤s≤γt is the geodesic linking w t+1 and w t , so as in the proof of Theorem 1 in (Bonnabel, 2013) , the Taylor formula yields C(w t+1 ) -C(w t ) ≤ -γ t H(z t , w t ), ∇C(w t ) + γ 2 t H(z t , w t ) 2 k 2 , where , is the Riemannian inner product at T wt M. The inequality has the same form as equation ( 5) in (Bonnabel, 2013) but with k 2 = sup w∈K0 ∇ 2 C , i.e., (∇ 2 C(w))v ≤ k 2 v for all w ∈ K 0 and v ∈ T w M, where K 0 is the compact set of all points with distance at most I from K. Define events E -1 = ∅, E t = { H(z t , w t ) < I/γ t }, t ≥ 0. (S6) Denote by 1 E the indicator function of an event E. Then by C(w) ≥ 0, C(w t+1 )1 Et ≤ C(w t ) -γ t H(z t , w t ), ∇C(w t ) 1 Et + γ 2 t H(z t , w t ) 2 k 2 . (S7) Taking expectations conditional on F t on both sides of the equality, E[C(w t+1 )1 Et | F t ] ≤ C(w t ) -γ t E( H(z t , w t ), ∇C(w t ) | F t ) + γ t E( H(z t , w t ), ∇C(w t ) 1 E c t | F t ) + γ 2 t E( H(z t , w t ) 2 | F t )k 2 . (S8) Since z t is independent from F t while w t is F t -measurable, E( H(z t , w t ), ∇C(w t ) | F t ) = E z ( H(z, w t ), ∇C(w t ) ) = ∇C(w t ) 2 , E( H(z t , w t ) 2 | F t ) = E z ( H(z, w t ) 2 ) ≤ A 2 , ( ) where in E z (•), w t is treated as a fixed value. On the other hand, E( H(z t , w t ), ∇C(w t ) 1 E c t | F t ) = E[H(z t , w t )1 E c t | F t ], ∇C(w t ) ≤ E[ H(z t , w t ) 1 E c t | F t ]k 1 , where k 1 = sup K ∇C . Since E c t = { H(z t , w t ) ≥ I/γ t }, by Markov inequality, E[ H(z t , w t ) 1 E c t | F t ] ≤ E[ H(z t , w t ) 2 /(I/γ t ) | F t ] ≤ γ t A 2 /I. (S10) Then from (S8), E[C(w t+1 )1 Et | F t ] ≤ C(w t ) + γ 2 t A 2 k -γ t ∇C(w t ) 2 (S11) with k = k 2 + k 1 /I. Let N t = C(w t )1 Et-1 + A 2 k s≥t γ 2 s - s<t C(w s )1 E c s-1 . Since s<t+1 C(w s )1 E c s-1 is F t -measurable, then from (S11), E(N t+1 | F t ) ≤ N t -γ t ∇C(w t ) 2 . (S12) Therefore, N t is a supermartingale. Let ξ = s≥0 C(w s )1 E c s-1 . Let k 0 = sup w∈K C(w). Then |N t | ≤ k 0 + A 2 k s≥0 γ 2 s + ξ. On the other hand, by Fubini's theorem followed by Markov inequality, Eξ ≤ k 0 s≥0 P{ H(z s , w s ) ≥ I/γ t } ≤ k 0 s≥0 γ 2 s I 2 sup w∈K E z H(z, w) 2 < ∞. Then N t is uniformly integrable, i.e., sup t E(|N t |1 |Nt≥c| ) → 0 as c → ∞. Then by martingale convergence theorem, N t converges a.s. Moreover, from the display and Borel-Cantelli lemma, all but a finite number of 1 E c t are 0, a.s. As a result, it then follows that C(w t ) converges a.s. To show that ∇C(w t ) → 0 a.s., by Doob's decomposition, N t = M t -Z t , where M t is a martingale and Z t is increasing and F t-1 -measurable with Z 0 = 0, and both M t and Z t are uniformly integrable, giving Z t ↑ Z ≥ 0 with EZ < ∞. Put p = ∇C 2 . From (S12), t γ t p(w t ) ≤ t (Z t+1 -Z t ) = Z < ∞ a.s. Then, by γ t = ∞, to show ∇C(w t ) → 0, it suffices to show that p(w t ) converges a.s. Similar to (S7), p(w t+1 )1 Et -p(w t ) ≤ -2γ t ∇C(w t ), (∇ 2 C(w t ))H(z t , w t ) 1 Et + γ 2 t H(z t , w t ) 2 k 4 , where k 4 is an upper bound on ∇ 2 p on K 0 . For ease of notation, write ∇C, (∇ 2 C)H t for ∇C(w t ), (∇ 2 C(w t ))H(z t , w t ) , H t for H(z t , w t ) , and so on. Then E(p(w t+1 )1 Et -p(w t ) | F t ) ≤ -2γ t E( ∇C, (∇ 2 C)H t 1 Et | F t ) + γ 2 t k 4 E( H 2 t | F t ) ≤ -2γ t E( ∇C, (∇ 2 C)H t 1 Et | F t ) + γ 2 t k 4 A 2 . On the other hand, recalling that k 1 = sup K ∇C and k 2 = sup K0 ∇ 2 C , |E( ∇C, (∇ 2 C)H t 1 Et | F t )| ≤ |E( ∇C, (∇ 2 C)H t | F t )| + |E( ∇C, (∇ 2 C)H t 1 E c t | F t )| ≤ | ∇C, (∇ 2 C)∇C t | + ∇C t • E( (∇ 2 C)H t 1 E c t | F t ) ≤ k 2 ∇C 2 t + k 2 ∇C t • E( H t 1 E c t | F) ≤ k 2 ∇C 2 t + k 3 ∇C t • E( H t 1 E c t | F) ≤ k 2 p(w t ) + (k 1 k 2 /I)A 2 γ t . where the last line follows from (S10). Combining the above two displays, E(p(w t+1 )1 Et -p(w t )1 Et-1 | F t ) ≤ q(w t ) := p(w t )1 E c t-1 + 2k 2 γ t p(w t ) + k * γ 2 t A 2 , (S13) where k * = k 4 + 2k 1 k 2 /I. From the above proof, Eq(w t ) < ∞. Then by a similar argument for the convergence of C(w t ) based on submartingale convergence, p(w t ) converges a.s. Proof of Theorem 6. The proof builds upon the one for Theorem 2 (Bonnabel, 2013) . There are constants > 0 and 0 < I 0 ≤ I, such that d(R w (v), exp w (v)) ≤ v 2 for all w ∈ K and v ∈ T w M with v ≤ I 0 . Without loss of generality, suppose > 1 and I 0 = I < 1/ , for otherwise we can decrease I. Define the same events E t as in (S6) and let constants k 0 , . . . , k 4 be defined as in the proof of Proposition S1. Let w * t = exp wt (-γ t H(z t , w t )). Then d(w * t+1 , w t+1 )1 Et ≤ γ 2 t H(z t , w t ) 2 . By assumption, w t+1 ∈ K. Then on the event E t , d(w * t+1 , w t+1 ) ≤ I 2 < I, so w t+1 ∈ K 0 , where K 0 is defined in the proof of Proposition S1. Then C(w t+1 )1 Et -C(w t ) ≤ C(w * t+1 )1 Et -C(w t ) + |C(w t+1 ) -C(w * t+1 )|1 Et ≤ C(w * t+1 )1 Et -C(w t ) + d(w t+1 , w * t+1 )k 1 1 E1 ≤ C(w * t+1 )1 Et -C(w t ) + γ 2 t H(z t , w t ) 2 k 1 . Then by (S9) and (S11), E(C(w t+1 )1 Et -C(w t ) | F t ) ≤ γ 2 t A 2 k -γ t ∇C(w t ) 2 , where k = k 2 + k 1 /I + k 1 . Then, following the same argument as in the proof of Proposition S1, C(w t ) converges a.s. and t≥0 γ t E( ∇C(w t ) 2 ) < ∞, and to show p(w t ) := ∇C(w t ) 2 → 0, it suffices to show p(w t ) converges a.s. We have p(w t+1 )1 Et -p(w t ) ≤ |p(w t+1 ) -p(w * t+1 )|1 Et + p(w * t+1 )1 Et -p(w t ) ≤ k 4 γ 2 t H(z t , w t ) 2 + p(w * t+1 )1 Et -p(w t ). Then by (S9) and (S13), E(p(w t+1 )1 Et -p(w t )1 Et-1 | F t ) ≤ p(w t )1 E c t-1 + 2k 2 γ t p(w t ) + (2k 4 + 2k 1 k 2 /I)γ 2 t A 2 . Then, with the same argument following (S13), p(w t ) converges a.s.

B GOODNESS-OF-FIT TESTS

In this section, we present several commonly used GoF tests. They are grouped into three classes: tests based on correlation (CB), tests based on empirical distribution function (EDF), and tests based on empirical characteristic function (ECF). The CB GoF tests were first covered in Section 3. The latter two are covered here. Empirical Distribution Function (EDF) Tests: UVN EDF tests are based on the discrepancy between the empirical and hypothesized distribution functions (D'Agostino, 2017), encompassing two broad classes: supremum tests, and quadratic tests. Kolmogorov-Smirnov (KS) is a supremum test, measuring the largest absolute distance. Two popular quadratic tests are Cramér-von Mises (CVM), which measures the integrated quadratic deviation weighted by a function Ψ, and Anderson-Darling (AD) (Anderson et al., 1952) , which gives higher weight to distribution tails. Empirical Characteristic Function (ECF) Tests: ECF tests are based on the weighted integral of the difference between the ECF and its pointwise limit, including Epps-Pulley (EP) for UVN (Epps & Pulley, 1983 ) and MVN (Baringhaus & Henze, 1988) and the Henze-Zirkler (HZ) test, a generalization of EP (Henze & Zirkler, 1990) . For consistency, T * and d * represent respectively the test statistic and corresponding statistical distance for each test. We will used F X to denote both the law of X and its cumulative distribution function. For a random sample X 1 , X 2 , . . . X m , denote its sample mean, sample variance, and empirical cumulative distribution function by X, S m , and FX,m , respectively. If the X i 's are univariate, we further sort them into order statistics X (1) ≤ X (2) ≤ • • • X (m) . B.1 CB CLASS: SHAPIRO-WILK, SHAPIRO-FRANCIA Let X 1 , . . . , X m be univairate and i.i.d. ∼ F X . To test H 0 : F X ∈ G, where G is the class of normal distributions, the Shapiro-Wilk (SW) test statistic is defined as T SW = m i=1 a i X (i) 2 m i=1 X i -X 2 , where a = (a 1 , a 2 , . . . , a m ) T is obtained via a = M -1 c M -1 c with c = (c 1 , . . . , c m ) T and M being the mean vector and covariance matrix, respectively, of the order statistics of m independent N (0, 1) random variables. The corresponding L 2 -Wasserstein distance is d 2 W2 ( FX,m , G) = inf a,σ 2 d 2 W2 ( FX,m , N (a, σ 2 )) = S 2 m - 1 0 F -1 X,m (t)Φ -1 (t) dt 2 with Φ(•) being the distribution function of standard normal (del Barrio et al., 1999) . Following the same notations for SW test, the Shapiro-Francis (SF) test on normality (Shapiro & Francia, 1972) is defined as T SF = m i=1 b i X (i) 2 m i=1 X i -X 2 , where b = (b 1 , b 2 , . . . , b m ) T is obtained via b = c c .

B.2 EDF CLASS: CRAMER-VON MISES, KOLMOGOROV-SMIRNOV

Let X 1 , . . . , X m be univariate and i.i.d. ∼ F X . Let F be a specified univariate distribution and suppose we whish to test H 0 : F X = F . The Cramer-Von Mises (CVM) test statistic is defined as T CV M = 1 12m + m i=1 F (X (i) ) - 2i -1 2m 2 . The CVM test statistic corresponds to the statistical distance d CV M ( FX,m , F ) = FX,m -F 2 Ψ(F ) dF, where Ψ is a weight function. Note that for T CV M , Ψ ≡ 1. On the other hand, the Kolmogorov-Smirnov (KS) test statistic and related statistical distance are defined respectively as T KS = max 1≤i≤m F (X (i) ) - i -1 m , i m -F (X (i) ) , d KS ( FX,m , F ) = sup | FX,m -F |.

B.3 ECF CLASS: HENZE-ZIRKLER

Suppose X i are k-dimensional and i.i.d. ∼ P X . Following the similar notation in (Henze & Zirkler, 1990) , define the scaled residuals Y j = S -1/2 m (X j -X), j = 1, . . . , m. Let φ m (t) = 1 m m j=1 exp(it Y j ) denote the empirical characteristic function of Y j . To test H 0 : F X ∈ G, the Henze-Zirkler (HZ) test statistic is defined as T HZ = m (4I{S m is singular} + D m,β I{S m is nonsingular}) , where β > 0 is a parameter, D m,β = R k φ m (t) -exp - 1 2 t 2 2 ψ β (t) dt, ψ β (t) = (2πβ 2 ) -k/2 exp - t 2 2β 2 . After some simplification Korkmaz et al. (2014) , we have D m,β = 1 m 2 m i,j=1 e -β 2 Yi-Yj 2 /2 - 2 (1 + β) k/2 m m i=1 e -β 2 Yi 2 /[2(1+β 2 )] + 1 (1 + 2β 2 ) k/2 . The optimal choice for β is proposed to be 1 √ 2 m(2k + 1) 4 1/(k+4) . Denote the Fourier transform of a probability measure µ on R k by μ(t) = R k e it x µ(dx). Then D m,β = R k | FY,m (t) -G(t)| 2 ψ β (t) dt, where G denotes the k-variate standard normal distribution. Thus, HZ statistic corresponds to the following statistical distance d HZ (µ, ν) = R k |μ(t) -ν(t)| 2 ψ β (t) dt. B.4 A COMPARISON BETWEEN THE ASSOCIATED STATISTICAL DISTANCES Proposition S2. Let X, Y denote two k-variate random variables. Then, (a) d HZ (F X , F Y ) ≤ C 1 d W2 (F X , F Y ), (b) d KS (F X , F Y ) ≤ C 2 d W1 (F X , F Y ), (c) if k = 1 and the weight function Ψ(•) in the CVM test is bounded, then d CV M (F X , F Y ) ≤ C 3 d KS (F X , F Y ) 2 . In these inequalities the constants C 1 , C 2 may depend on k, while C 3 may depend on Ψ(•). Proof. (a) Put µ = F X and ν = F Y . For any ω ∈ Π(µ, ν), by Cauchy-Schwarz inequality and Fubini's theorem d HZ (µ, ν) 2 = R d R d e it x µ(dx) - R d e it y ν(dy) 2 φ(t) dt = R d R d ×R d (e it x -e it y )ω(dx, dy) 2 φ(t) dt ≤ R d R d ×R d |e it x -e it y | 2 ω(dx, dy) φ(t) dt = R d ×R d R d |e it x-it y | 2 φ(t) dt ω(dx, dy). Using |e ia -e ib | 2 ≤ (a -b) 2 for a, b ∈ R, d HZ (µ, ν) 2 ≤ R d ×R d R d |t x -t y| 2 φ(t) dt ω(dx, dy) ≤ R d ×R d R d t 2 x -y 2 φ(t) dt ω(dx, dy) = R d t 2 φ(t) dt • R d ×R d x -y 2 ω(dx, dy) = C R d ×R d x -y 2 ω(dx, dy) with C = R d t 2 φ(t) dt < ∞. Since the above inequality holds for all ω ∈ Π(µ, ν), then d HZ (µ, ν) ≤ √ Cd W2 (µ, ν). Let C 1 = √ C. Then (a) follows. (b) The connection between Kolmogorov-Smirnov distance and L 1 -Wasserstain distance under both of the univariate and multivariate cases are well studied, detailed proof can be found in Corollary 3.1 (Koike, 2019) , Proposition 1.2 (Ross, 2011)  (c) Letting C 3 = sup Ψ, d CV M (F X , F Y ) = (F X -F Y ) 2 Ψ(F Y ) dF Y ≤ C 3 |F X -F Y | 2 dF Y ≤ C 3 (sup |F X -F Y |) 2 dF Y = C 3 d KS (F X , F Y ) 2 .

C MODEL ARCHITECTURE

The architecture for the encoder and decoder of CelebA and MNIST is based in the WAE (Tolstikhin et al., 2017) . The discriminator for the WAE-GAN followed Tolstikhin et al. (2017) . The architecture for CIFAR10 is based on Lippe (2022) . When modeling a specific dataset, the specified encoder and decoder components are the same for all models.

D TRAINING DETAILS D.1 TEST STATISTIC PROJECTIONS

Once a batch has been encoded, it is projected from 64D to 1D in order to be tested. Instead of randomly sampling directions from the unit sphere to implement the projections, we sample an orthonormal basis and project down each direction, calculating the test statistic along each. This is used in two ways: 1) selecting the most pessimistic direction to optimize, or 2) computing the average direction to optimize. For example, the SW test statistic value should be large to fail to reject normality. Selecting the direction associated with the smallest statistic value can be used as a new statistic to optimize.

D.4 FURTHER DETAILS ON MODEL SELECTION

We selected λ with grid-search using a training and validation set and an array of λ values. Each model was evaluated for p-value uniformity using Algorithm 1. Note that Algorithm 1 takes multiple minibatch (local) GoF p-values and produces a single Kolmogorov-Smirnov test statistic and p-value pair for uniformity (a single blue dot in Figs. 5b-5c ). Since univariate GoF tests require projection, there are two sources of randomness that come into play when producing GoF p-values from a data set: 1) shuffling minibatches (line 2 in Algorithm 1), and 2) random projections. Instead of selecting λ based on a single KS uniformity p-value, Algorithm 1 was run 30 times. The average of these 30 uniformity p-values was computed, and compared against a pre-specified threshold, which is 0.05 by default. If the average is larger than the threshold, then λ is a possible candidate. There is a region where a variety of λ values satisfy the threshold condition of 0.05 (Figs. 5b-5c ). Since the loss function, Equation 2, should also have small reconstruction error, the smallest λ for which the corresponding average of 30 KS p-values is larger than 0.05 is the chosen hyperparameter and used in the final model. Pseudo-code for the GoFAE pipeline can be seen in Algorithm S1. Liu et al. (2015) . For fair comparisons, we adopted a common architecture for all methods for each dataset as described in Section C. (2013) . Both VAE and β-VAE had an initial learning rate of 1e -4, while VAE with learnable γ used 1e -3. Algorithm For our models, Adam optimizer Kingma & Ba (2014) was used for the encoder (learning rate 3e -3, β = (0.5, 0.999)) and decoder (learning rate 3e -3, β = (0.5, 0.999)). Riemannian SGD with learning rate 5e -3 was used to constrain Θ to the Stiefel manifold using a one cycle learning rate scheduler with max learning rate as 1e -3. Singular value decomposition was used for retracting Θ back to M after a parameter update.

E.1.2 CELEBA

The CelebA dataset was pre-processed following the same procedure in Tolstikhin et al. (2017) . First, a 140x140 center crop is taken and the image is resized to 64 × 64 resolution. For the VAE, the Adam optimizer was used with a learning rate of 1e -4, and β 1 = 0.5 and β 2 = 0.999. The β-VAE used the Adam optimizer with a learning rate of 1e -4, β 1 = 0.5, and β 2 = 0.999. The VAE with learnable γ used the Adam optimizer with a learning rate of 1e -3, β 1 = 0.5, and β 2 = 0.999. The WAE-GAN also used the Adam optimizer where the encoder and decoder parameters had an initial learning rate of 3e -4 and the discriminator was set to 1e -3 with β 1 = 0.5, β 2 = 0.999. For the two-stage VAE, we used the trained VAE with learned gamma as stage one. A second VAE was trained using the µ, σ from the first stage as inputs, also with a trainable gamma. The architecture was a three layer dense net with ReLU activations, and the decoder was the inverse, with the last layer being linear. For our method, which had two components for the encoder, as visualized in Figure (3a) in the main paper. The first component of the encoder used the Adam optimizer with learning rate set to 3e -3 and β 1 = 0.5 and β 2 = 0.999. The decoder also used Adam, with the same learning rates and hyperparameters. The dense layer of the encoding processes is parameterized by Θ, an element of the Steifel manifold M, and is trained with Riemannian SGD with a learning rate of 5e -3 while making use of the 1-cycle learning rate scheduler with a maximum learning rate of 1e -3. As our methods require projection, a 64 × 64 dimensional orthonormal basis is sampled and used to project the encoded data. E.1.3 CIFAR-10 CIFAR10 is the final dataset, and consists of 60 thousand small images. The training set contains 50 thousand with the remaining going to the test set. After setting a seed, the training set was split into a smaller training set, of 45 thousand, and a validation set with the rest. The architecture used follows from the Model Architecture section. A latent dimension size of 64 is used. Models are evaluated on the average reconstruction quality. Here, the mean-squared error of randomly selected batches is computed over over a single pass through the test set, and then averaged. This is repeated 30 times, with the mean and standard deviation reported. FID scores for reconstructed and generated data are also computed. The Adam optimizer was used for the encoder and decoder of all methods. The learning rate was set at 3e -4, with β 1 = 0.5 and β 2 = 0.999. The discriminator for WAE-GAN is the same from the CelebA setup. All models are trained for 200 epochs, and reduce the learning rate when hitting a plateau (ReduceLROnPlateau in PyTorch). For this a minimum learning rate was set as 5e -5, patience of 10, and a factor of 0.2. For the two-stage VAE, the encoder consisted of 2 dense layers of 32 nodes each with ReLU activation, 64-dimension code layer, and the decoder contained 3 dense layers with ReLU activation except the final layer which was linear. The architecture was explored using the encoded validation set coming from stage 1 (the VAE with learnable gamma). This second stage VAE was trained for 50 epochs, with KL converging to zero quickly. Uniformity for the 2-Stage VAE was assessed using sampling from the test set which was encoded after stage 1 training was complete. Similar to the models for MNIST and CelebA, the dense layer encoding for the GoFAE models, Θ, is trained with Riemannian SGD. The learning rate is 5e -4 and uses a 1-Cycle learning rate scheduler with a maximum learning rate of 1e -4. Each encoded batch is projected onto a randomly sampled 64 × 64 orthonormal basis. The test statistic T is applied to each direction and the final statistic (min, average, max) computed.

E.2 MNIST

In the following sections, we visually assess the quality of GoFAE and competing models. For the GoFAE plots, a "good" regularization coefficient λ > 0 indicates that the model should (i) fail to reject H 0 at an α-level chosen a-priori (close to normal but not overly so), (ii) accurately reconstruct the input, and finally (iii) generate samples which are both qualitatively and quantitatively convincing. In this section, we present performance of competing and GoFAE methods in Table S2 . Reconstruction error, reconstruction FID, generation FID, and the mean (std) of the KS unif p-values are reported. Additionally, we examine the effects of λ with GoFAE-SW model in Figure S1 . The GoF test pushes the transformation to become increasingly singular in the sense that the condition number of the covariance matrix becomes increasingly large (Figure S1 ). As λ gets larger, the average KS uniformity statistic also increases, indicating P Y appears to be indistinguishable from the Gaussian class. However, as λ continues to increase, the condition number of the estimated covariance matrix also increases. Yet, for a small interval of λ, the mutual information has not completely diminished, and the KS uniformity suggests P Y is still indistinguishable from Gaussian. This visualization provides insights on how the class of Gaussians allows the model to adapt as needed during training. 

E.3 ADDITIONAL PLOTS

In this section, we include additional experimental results for GoFAE and comparison models.

E.3.1 CELEBA

This section contains images produced from GoFAE and competing GAEs on the CelebA dataset with a latent space dimension of 64. Figure S2 depicts reconstructed faces. In Table 1 in the main paper, we evaluated reconstruction quality with mean-squared error and the Frechet-Inception Distance (FID). The visual quality of the images produced by the GoFAE models are consistent with the low MSE and reconstruction FID observed in Table 1 . Figure S3 contains faces generated from the model. The GoFAE models again produced competitive FID scores, with GoFAE-SF producing the lowest (best) of the group. Tables S3 and S4 illustrate the effects of λ on KS uniformity test, corresponding p-value, reconstruction error, and test statistics for GoFAE-SW, GoFAE-SF, GoFAE-CVM and GoFAE-KS, respectively. There are several things to note from the Tables of figures demonstrating the effects of λ on the training of GoFAE architectures (Table S3 and Table S4 ). The first row shows the mutual information and GoF p-value distribution computed from the test set as a function of log(λ). As λ increases the models are producing P Y that are appearing increasingly indistinguishable from normal. However, beyond a certain λ, the encoding no longer appears normal, GoFAE-SW clearly depicts this occurring in Table S3 , row (a), left pannel. Row (b) tracks the average p-value produced on the test set while training with a particular GoF test statistic. Larger λ can easily be seen to focus more on normality as the p-values tend to start at high levels and stay. The reconstruction error in row (c) converges quickly. Of particular interest in the log(λ) vs KS p-value plots, is the behavior of the average KS p-value; after the average KS uniformity peaks, the mutual information falls rapidly. This clearly illustrates the relationship posited in the experiments section where the model prioritizes posterior matching. Similar plots for MNIST can be seen in section E.3.2 with the quantitative information regarding reconstruction error, reconstruction FID and generation FID in Table S2 . Once again, the GoFAE models produce overall lower MSE than competitor models, while maintaining competitive generation FID scores. This section includes MSE, FID scores, and p-values from the higher criticism evaluation in Table S7 and reconstructed and generated samples in Figure S7 and Figure S8 . 



Code can be found at https://github.com/aripalmer/GoFAE.



Figure 1: Latent behaviors for VAE, WAE, and GoFAE inspired from Figure 1 of Tolstikhin et al. (2017). (a) The VAE requires the approximate posterior distribution (orange contours) to match the prior P Z (white contours) for each example. (b) The WAE forces the encoded distribution (green contours) to match prior P Z . (c) The GoFAE forces the encoded distribution (purple contours) to match some P Z in the class prior G. For illustration, several P Zi ∈ G are visualized (white contours).

Multivariate normality (MVN) has received the most attention in the study on multivariate GoF tests. If Y ∈ R d has a normal distribution, then for every u ∈ S, Y T u, the projection of Y on u, follows a UVN distribution (Rao et al., 1973), where S = {u ∈ R d | u = 1} is the unit sphere in R d . Conversely, if Y has a non-normal distribution, the set of u ∈ S with Y T u being normal has Lebesgue measure 0

Figure 2: Illustration of distribution family G.

Figure 3: Model selection demonstration. If the regularization coefficient λ is too small (a) or too big (c), the model is under (over) regularized and the minibatch p-values (p) are skewed right (left). As a result, global normality is rejected through the higher criticism principle for p-value uniformity (p KS unif < α). (b) An appropriately chosen λ fails to reject global normality (p KS unif > α).

H 0 is rejected for small values of the test statistic, we can include this rejection criterion as a constraint to the optimization problem min F,G E[d(X, G(F (X))] subject to E[T ({F (X i )} m i=1 )] > T α . Rewriting with regularization coefficient λ leads to the Lagrangian min F,G E[d(X, G(F (X))] + λ(T α -E[T ({F (X i )} m i=1 )]). Since λ ≥ 0, the final objective can be simplified to min F,G E[d(X, G(F (X))) -λT ({F (X i )} m i=1 )].

Figure 4: (a) GoFAE Architecture. (b) Example of singularity. Blue region is a polytope created asU = {x = [x 1 , x 2 , x 3 ] ∈ R 3 : 0 ≤ x 1 ≤ x 2 ≤ x 3 ≤ 1}.For ∀x ∈ U, the coordinates of x are also its order statistics, the min, median and max coordinates corresponds to the x, y and z-axis. The green simplex is where T SW ({x 1 , x 2 , x 3 }) = 1, and the region built with the dotted lines and the red line creates an acceptance region, outside of which is the rejection region. The light blue arrows are the derivatives at corresponding points and the red boundary line corresponds to the singularity where the gradient blows up. (c) Visualization for GoF optimization on a Stiefel manifold M.

Figure 4: (a) GoFAE Architecture. (b) Example of singularity. Blue region is a polytope created asU = {x = [x 1 , x 2 , x 3 ] ∈ R 3 : 0 ≤ x 1 ≤ x 2 ≤ x 3 ≤ 1}.For ∀x ∈ U, the coordinates of x are also its order statistics, the min, median and max coordinates corresponds to the x, y and z-axis. The green simplex is where T SW ({x 1 , x 2 , x 3 }) = 1, and the region built with the dotted lines and the red line creates an acceptance region, outside of which is the rejection region. The light blue arrows are the derivatives at corresponding points and the red boundary line corresponds to the singularity where the gradient blows up. (c) Visualization for GoF optimization on a Stiefel manifold M.

Figure 5: Effects of λ on p-value distribution and mutual information for GoFAE models.CIFAR10(Krizhevsky et al., 2009) datasets and compare to several other GAE models. We emphasize that our goal is not to merely to produce competitive evaluation metrics, but to provide a principled way to balance the reconstruction versus prior matching trade-off. For MNIST and CelebA the architectures are based onTolstikhin et al. (2017), while CIFAR10 is fromLippe (2022). The aim is to keep the architecture consistent across models with the exception of method specific components. See the appendix for complete architectures ( § C), training details ( § D), and additional results ( § E). The following results are from experiments on CelebA.Algorithm 2 GoFAE Optimization

Figure 6: SV of ΣReconstruction, Generation, and Normality. We assessed the quality of the generated and test set reconstructed images using Fréchet Inception Distance (FID)(Heusel et al., 2017) based on 10 4 samples and mean-square error (MSE) (Table1). We compared the GoFAE with the AE(Bengio et al., 2013), VAE (Kingma & Welling, 2013), VAE with learned γ (Dai & Wipf, 2019), 2-Stage VAE (Dai & Wipf, 2019), β-VAE (Higgins et al., 2017) and WAE-GAN (Tolstikhin et al., 2017); convergence was assessed by tracking the test set reconstruction error over training epochs (TablesS3-S9 and Fig.S9). Figs. 7a, 7b are for models presented in Table 1. We selected the smallest λ whose mean p-value of the HC uniformity test, KS unif , was greater than 0.05 for the GoFAE models. The correlation based GoF tests have the most competitive performance on FID and test set MSE. We assessed the normality of minibatch encodings across each method using several GoF tests for normality combined with KS unif . We ran 30 repetitions of Algorithm 1 for each method and reported the mean (std) of the KS unif p-values in Table1.In addition to superior MSE and FID scores, the GoFAE models obtained uniform p-values under varying GoF tests. The variability across GoF tests highlights the fact that different tests are sensitive to different distributional characteristics. Qualitative and quantitative assessments ( § E.3) and convergence plots ( § E.4) are given in the appendix for MNIST, CIFAR-10, and CelebA. Finally, an ablation study provided empirical justification for Riemannian SGD in the GoFAE ( § E.5).

Figure 7: Reconstruction error on the CelebA training (a) and testing (b) sets.

Proof. (a) is a direct consequence of Proposition 1 (a), except that G therein is replaced with F . (b) follows by combining (a) with d W2 (P Y , P Z ) ≤ d W2 ( PY,m , P Z ) + d W2 ( PY,m , P Y ).

E.1.1 MNIST The MNIST dataset was re-scaled to the unit interval. Training proceeded for 50 epochs using mini-batches of size 128. We specified a 10-dimensional code layer. Competitor models include β-VAE Higgins et al. (2017), VAE Kingma & Welling (2013), WAE-GAN Tolstikhin et al. (2017), VAE with learned γ Dai & Wipf (2019), 2-Stage VAE Dai & Wipf (2019), and AE Bengio et al.

Figure S1: Effects of λ. We evaluated the uniformity KS test p-values (blue dots), average KS pvalues (blue line), mutual information (red line), log condition number (green line) and reconstruction error (black line) under different λ for GoFAE-SW.

Figure S2: Comparison of reconstruction between competing methods: (b) AE, (c) VAE (fixed γ), (d) VAE (learned γ), (e) β-VAE (β = 10), (f) WAE-GAN, and our framework: (g) GoFAE-SW, (h) GoFAE-SF, (i) GoFAE-CVM, (j) GoFAE-KS with CelebA.

Figure S3: Comparison of generated samples between competing methods: (b) AE, (c) VAE (fixed γ), (d) VAE (learned γ), (e) β-VAE (β = 10), (f) WAE-GAN, (g) 2-Stage VAE, and our framework: (h) GoFAE-SW, (i) GoFAE-SF, (j) GoFAE-CVM, (k) GoFAE-KS with CelebA.

Figure S5: Comparison of reconstruction between competing methods (b) AE, (c) VAE, (d) β-VAE (β = 2), (e) C-β-VAE, (f) WAE-GAN, and our framework: (g) GoFAE-SW, (h) GoFAE-SF, (i) GoFAE-EP, (j) GoFAE-CVM with MNIST.

Figure S7: Comparison of reconstructed images between competing methods (b) VAE (fixed γ), (c) VAE (LEARNED γ), (d) β-VAE (β = 2), (e) WAE-GAN, and our framework: (f) GoFAE-SW, (g) GoFAE-SF, (h) GoFAE-CVM, (i) GoFAE-KS, (j) GoFAE-EP with CIFAR-10.

Figure S8: Comparison of generated samples between competing methods (b) VAE (fixed γ), (c) VAE (LEARNED γ), (d) 2-Stage-VAE, (e) β-VAE (β = 2), (f) WAE-GAN, and our framework: (g) GoFAE-SW, (h) GoFAE-SF, (i) GoFAE-CVM, (j) GoFAE-KS, (k) GoFAE-EP with CIFAR-10.

Evaluation of CelebA by MSE, FID scores, and samples with p-values from higher criticism.

Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. Advances in Neural Information Processing Systems, 33:8798-8809, 2020. Recall that in the setup of the result, X is a random variable taking values in R m , Y and Z are random variables taking values in R d , F : R m → R d and G : R d → R m are two mappings, and X and Y are linked by Y = F (X). Proposition 1. If X and Z are Euclidean spaces and the decoder G is differentiable, then (a)

S1 GoFAE Pipeline: Training Locally and Assessing with Higher Criticism

Evaluation of MNIST by MSE, FID scores(Heusel et al., 2017), and samples with p-values from higher criticism.

TableS5and S6 demonstrate the effects of λ on KS uniformity test, corresponding p-value, reconstruction error, and test statistics for GoFAE-SW, GoFAE-SF, GoFAE-CVM and GoFAE-KS, respectively. The experiments on MNIST reach the same conclusion as in the CelebA, that our GoFAE methods is competitive in both reconstruction and generation.

TableS8and S9 demonstrate the effects of λ on KS uniformity test, corresponding p-value, reconstruction error, and test statistics for GoFAE-SW, GoFAE-SF, GoFAE-CVM and GoFAE-KS, respectively. The 2-Stage VAE uses the VAE with learned γ for reconstruction. The experiments on CIFAR-10 reach the same conclusion as in the CelebA and MNIST, that our GoFAE methods is competitive in both reconstruction and generation. Evaluation of CIFAR-10 by MSE, FID scores, and samples with p-values from higher criticism.

Comparison of Riemannian SGD (left) and standard SGD (right) for MNIST.

Comparison of Riemannian SGD (left) and standard SGD (right) for CIFAR-10.

ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for their comments which have improved this work. This work was partially supported by a National Science Foundation (NSF) grant: IIS-1718738, a National Institutes of Health (NIH) grant: K02DA043063, and a grant from US Department of Education: a Graduate Assistance in Areas of National Need (GAANN) program. J.Bi was also supported by NIH grants: 5R01MH119678-02 and 5R01DA051922-02.

annex

Similarly, for tests that fail to reject for small values, selecting the direction associated with the largest value can be used. We refer to both scenarios as selecting the most pessimistic direction.Alternatively, the mean of all the statistics could be used as a new statistic.

D.2 EMPIRICAL DISTRIBUTIONS

The distribution of a test statistic must be known analytically or estimable in order to compute p-values. There are a few options for calculating the empirical distribution of a test statistic when using GoF tests with projections. The first possibility is to randomly sample from the unit sphere, project the multivariate data, and then use the corresponding distribution to determine the p-value. However, a single projection may not be particularly informative, and early testing indicated this method lead to slower convergence. Another option is to project along multiple directions and calculate the statistics of these directions. It is then possible to create a new statistic from these, for example the average, minimum, or maximum. Unfortunately, this method precludes using the original test statistic distribution or calculating p-values. To remedy this, we create an empirical distribution of this new statistic by repeatedly sampling; an empirical p-value are produced from the test statistic samples.

D.3 MUTUAL INFORMATION

The mutual information is estimated after the model has been trained by sampling from N 64 ( μ, Σ) where ( μ, Σ) are estimated either during or after training. GoFAE tracks these statistics during training. These samples are decoded, and then encoded. Assuming the encoded data is Gaussian, mutual information may be calculated asPublished as a conference paper at ICLR 2023 

E.4 CONVERGENCE ANALYSIS FOR COMPETING METHODS

In this section, we include the test set reconstruction error over training epochs for the competing methods for MNIST, CelebA and CIFAR-10 data. Since we adopted a common autoencoder architecture (Section C), reconstruction error follows a similar trend across all models (Figure S9 ). Since the 2-Stage VAE shares the VAE with learned γ as the first stage, we was assessed its convergence in the same manner as the VAE with learned γ. 

E.5 ABLATION STUDY

We proposed Riemannian SGD as a solution to several problems associated with optimizing GoF test statistics. Theoretically, this approach is necessary to prove convergence. However, it remains to be seen whether there is any impact on model performance when the constraints are removed.To explore this, we considered several GoFAE models trained them without RSGD on MNIST and CIFAR-10. We used traditional SGD in place of Riemannian SGD; Θ is no longer retracted to the Stiefel manifold.GoFAE models with SW, SF, CVM and EP are retrained using the same architecture and hyperparameter configuration on MNIST. In place of the orthogonal initialization used for Θ, Kaiming initialization He et al. (2015) was used. Table S2 summarizes the results with the best performing model in bold for each evaluation measure.GoFAE models with SW, SF, CVM, and EP were also retrained on On CIFAR-10. All architectures and hyper-parameters remained the same as the Riemannian SGD counterparts, except that Θ which was initialized following Kaiming initialization. Table S11 summarized the results with the best performing model in bold for each evaluation measure. These results show empirical support for

