GROMOV-WASSERSTEIN AUTOENCODERS

Abstract

Variational Autoencoder (VAE)-based generative models offer flexible representation learning by incorporating meta-priors, general premises considered beneficial for downstream tasks. However, the incorporated meta-priors often involve ad-hoc model deviations from the original likelihood architecture, causing undesirable changes in their training. In this paper, we propose a novel representation learning method, Gromov-Wasserstein Autoencoders (GWAE), which directly matches the latent and data distributions using the variational autoencoding scheme. Instead of likelihood-based objectives, GWAE models minimize the Gromov-Wasserstein (GW) metric between the trainable prior and given data distributions. The GW metric measures the distance structure-oriented discrepancy between distributions even with different dimensionalities, which provides a direct measure between the latent and data spaces. By restricting the prior family, we can introduce meta-priors into the latent space without changing their objective. The empirical comparisons with VAE-based models show that GWAE models work in two prominent meta-priors, disentanglement and clustering, with their GW objective unchanged.

1. INTRODUCTION

One fundamental challenge in unsupervised learning is capturing the underlying low-dimensional structure of high-dimensional data because natural data (e.g., images) lie in low-dimensional manifolds (Carlsson et al., 2008; Bengio et al., 2013) . Since deep neural networks have shown their potential for non-linear mapping, representation learning has recently made substantial progress in its applications to high-dimensional and complex data (Kingma & Welling, 2014; Rezende et al., 2014; Hsu et al., 2017; Hu et al., 2017) . Learning low-dimensional representations is in mounting demand because the inference of concise representations extracts the essence of data to facilitate various downstream tasks (Thomas et al., 2017; Higgins et al., 2017b; Creager et al., 2019; Locatello et al., 2019a) . For obtaining such general-purpose representations, several meta-priors have been proposed (Bengio et al., 2013; Tschannen et al., 2018) . Meta-priors are general premises about the world, such as disentanglement (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018; Ding et al., 2020) , hierarchical factors (Vahdat & Kautz, 2020; Zhao et al., 2017; Sønderby et al., 2016) , and clustering (Zhao et al., 2018; Zong et al., 2018; Asano et al., 2020) . A prominent approach to representation learning is a deep generative model based on the variational autoencoder (VAE) (Kingma & Welling, 2014) . VAE-based models adopt the variational autoencoding scheme, which introduces an inference model in addition to a generative model and thereby offers bidirectionally tractable processes between observed variables (data) and latent variables. In this scheme, the reparameterization trick (Kingma & Welling, 2014) yields representation learning capability since reparameterized latent codes are tractable for gradient computation. The introduction of additional losses and constraints provides further regularization for the training process based on meta-priors. However, controlling representation learning remains a challenging task in VAE-based models owing to the deviation from the original optimization. Whereas the existing VAE-based approaches modify the latent space based on the meta-prior (Kim & Mnih, 2018; Zhao et al., 2017; Zong et al., 2018) , their training objectives still partly rely on the evidence lower bound (ELBO). Since the ELBO objective is grounded on variational inference, ad-hoc model modifications cause implicit and undesirable changes, e.g., posterior collapse (Dai et al., 2020) and implicit prior change (Hoffman et al., 2017) in β-VAE (Higgins et al., 2017a) . Under such modifications, it is also unclear whether a latent representation retains the underlying data structure because VAE models implicitly interpolate data points to form a latent space using noises injected into latent codes by the reparameterization trick (Rezende & Viola, 2018a; b; Aneja et al., 2021) . As another paradigm of variational modeling, the ELBO objective has been reinterpreted from the optimal transport (OT) viewpoint (Tolstikhin et al., 2018) . Tolstikhin et al. (2018) have derived a family of generative models called the Wasserstein autoencoder (WAE) by applying the variational autoencoding model to high-dimensional OT problems as the couplings (Appendix A. 4 for more details). Despite the OT-based model derivation, the WAE objective is equivalent to that of Info-VAE (Zhao et al., 2019) , whose objective consists of the ELBO and the mutual information term. The WAE formulation is derived from the estimation and minimization of the OT cost (Tolstikhin et al., 2018; Arjovsky et al., 2017) between the data distribution and the generative model, i.e., the generative modeling by applying the Wasserstein metric. It furnishes a wide class of models, even when the prior support does not cover the entire variational posterior support. The OT paradigm also applies to existing representation learning approaches originally derived from re-weighting the Kullback-Leibler (KL) divergence term (Gaujac et al., 2021) . Another technique for optimizing the VAE-based ELBO objective called implicit variational inference (IVI) (Huszár, 2017) has been actively researched. While the VAE model has an analytically tractable prior for variational inference, IVI aims at variational inference using implicit distributions, in which one can use its sampler instead of its probability density function. A notable approach to IVI is the density ratio estimation (Sugiyama et al., 2012) , which replaces the f -divergence term in the variational objective with an adversarial discriminator that distinguishes the origin of the samples. For distribution matching, this algorithm shares theoretical grounds with generative models based on the generative adversarial networks (GANs) (Goodfellow et al., 2014; Sønderby et al., 2017) , which induces the application of IVI toward the distribution matching in complex and high-dimensional variables, such as images. See Appendix A.6 for more discussions. In this paper, we propose a novel representation learning methodology, Gromov-Wasserstein Autoencoder (GWAE) based on the Gromov-Wasserstein (GW) metric (Mémoli, 2011) , an OT-based metric between distributions applicable even with different dimensionality (Mémoli, 2011; Xu et al., 2020; Nguyen et al., 2021) . Instead of the ELBO objective, we apply the GW metric objective in the variational autoencoding scheme to directly match the latent marginal (prior) and the data distribution. The GWAE models obtain a latent representation retaining the distance structure of the data space to hold the underlying data information. The GW objective also induces the variational autoencoding to perform the distribution matching of the generative and inference models, despite the OT-based derivation. Under the OT-based variational autoencoding, one can adopt a prior of a GWAE model from a rich class of trainable priors depending on the assumed meta-prior even though the KL divergence from the prior to the encoder is infinite. Our contributions are listed below. • We propose a novel probabilistic model family GWAE, which matches the latent space to the given unlabeled data via the variational autoencoding scheme. The GWAE models estimate and minimize the GW metric between the latent and data spaces to directly match the latent representation closer to the data in terms of distance structure. • We propose several families of priors in the form of implicit distributions, adaptively learned from the given dataset using stochastic gradient descent (SGD). The choice of the prior family corresponds to the meta-prior, thereby providing a more flexible modeling scheme for representation learning. • We conduct empirical evaluations on the capability of GWAE in prominent meta-priors: disentanglement and clustering. Several experiments on image datasets CelebA (Liu et al., 2015) , MNIST (LeCun et al., 1998) , and 3D Shapes (Burgess & Kim, 2018) , show that GWAE models outperform the VAE-based representation learning methods whereas their GW objective is not changed over different meta-priors. (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018; Achille & Soatto, 2018; Kumar et al., 2018; Zong et al., 2018; Zhao et al., 2017; Sønderby et al., 2016; Zhao et al., 2019; Hou et al., 2019; Detlefsen & Hauberg, 2019; Ding et al., 2020) . The standard VAE learns an encoder and a decoder with parameters ϕ and θ, respectively, to learn a low-dimensional representation in its latent variables z using a bottleneck layer of the autoencoder.

2. RELATED WORK

Using data x ∈ p data (x) supported on the data space X , the VAE objective is the ELBO formulated by the following optimization problem: maximize θ,ϕ E p data (x) E q ϕ (z|x) [log p θ (x|z)] -D KL (q ϕ (z|x)∥π(z)) , where the encoder q ϕ (z|x) and decoder p θ (x|z) are parameterized by neural networks, and the prior π(z) is postulated before training. The first and second terms (called the reconstruction term and the KL term, respectively) in Eq. ( 1) are in a trade-off relationship (Tschannen et al., 2018) . This implies that learning is guided to autoencoding by the reconstruction term while matching the distribution of latent variables to the pre-defined prior using the KL term. Implicit Variational Inference. IVI solves the variational inference problem using implicit distributions (Huszár, 2017) . A major approach to IVI is density ratio estimation (Sugiyama et al., 2012) , in which the ratio between probability distribution functions is estimated using a discriminator instead of their closed-form expression. Since IVI-based and GAN-based models share density ratio estimation mechanisms in distribution matching (Sønderby et al., 2017) , the combination of VAEs and GANs has been actively studied, especially from the aspect of the matching of implicit distributions. The successful results achieved by GAN-based models in high-dimensional data, such as natural images, have propelled an active application and research of IVI in unsupervised learning (Larsen et al., 2016; Makhzani, 2018) . Optimal Transport. The OT cost is used as a measure of the difference between distributions supported on high-dimensional space using SGD (Arjovsky et al., 2017; Tolstikhin et al., 2018; Gaujac et al., 2021) . This provides the Wasserstein metric for the discrepancy between distributions. For a constant ξ ≥ 1, the ξ-Wasserstein metric between distributions r and s is defined as W ξ (r, s) = inf γ∈P(r(x),s(x ′ )) E γ(x,x ′ ) d ξ (x, x ′ ) 1/ξ , where x denotes the random variable in which the distributions r and s are defined, and P(r(x), s(x ′ )) denotes the set consisting of all couplings whose x-marginal is r(x) and whose x ′ -marginal is s(x ′ ). Owing to the difficulty of computing the exact infimum in Eq. ( 2) for high-dimensional, large-scale data, several approaches try to minimize the estimated ξ-Wasserstein metric using neural networks and SGD (Tolstikhin et al., 2018; Arjovsky et al., 2017) . The form in Eq. ( 2) is the primal form of the Wasserstein metric, particularly compared with its dual form for the case of ξ = 1 (Arjovsky et al., 2017) . The two prominent approaches for the OT in high-dimensional, complex large-scale data are: (i) minimizing the primal form using a probabilistic autoencoder (Tolstikhin et al., 2018) , and (ii) adversarially optimizing the dual form using a generator-critic pair (Arjovsky et al., 2017) . Wasserstein Autoencoder (WAE). WAE (Tolstikhin et al., 2018) is a family of generative models whose autoencoder estimates and minimizes the primal form of the Wasserstein metric between the generative model p θ (x) and the data distribution p data (x) using SGD in the variational autoencoding settings, i.e., the VAE model architecture (Kingma & Welling, 2014) . This primal-based formulation induces a representation learning methodology from the OT viewpoint because the WAE objective is equivalent to that of InfoVAE (Zhao et al., 2019) , which learns the variational autoencoding model by retaining the mutual information of the probabilistic encoder. Kantorovich-Rubinstein Duality. The Wasserstein GAN models (Arjovsky et al., 2017) adopt an objective based on the 1-Wasserstein metric between the generative model p θ (x) and data distribution p data (x). This objective is estimated using the Kantorovich-Rubinstein duality (Villiani, 2009; Arjovsky et al., 2017) , which holds for the 1-Wasserstein as W 1 (r, s) = sup f :1-Lipschitz E r(x) [f (x)] -E s(x) [f (x)] . To estimate this function f using SGD, a 1-Lipschitz neural network called a critic is introduced, as with a discriminator in the GAN-based models. The training process using mini-batches is adversarially conducted, i.e., by repeating updates of the critic parameters and the generative parameters alternatively. During this process, the critic maximizes the objective in Eq. ( 3) to approach the supremum, whereas the generative model minimizes the objective for the distribution matching p θ (x) ≈ p data (x).

3. PROPOSED METHOD

Our GWAE models minimize the OT cost between the data and latent spaces, based on generative modeling in the variational autoencoding. GWAE models learn representations by matching the distance structure between the latent and data spaces, instead of likelihood maximization.

3.1. OPTIMAL TRANSPORT BETWEEN SPACES

Although the OT problem induces a metric between probability distributions, its application is limited to distributions sharing one sample space. The GW metric (Mémoli, 2011) measures the discrepancy between metric measure spaces using the OT of distance distributions. A metric measure space consists of a sample space, metric, and probability measure. Given a pair of different metric spaces, i.e., sample spaces and metrics, the GW metric measures the discrepancy between probability distributions supported on the spaces. In terms of the GW metric, two distributions are considered to be equal if there is an isometric mapping between their supports (Sturm, 2012; Sejourne et al., 2021) . For a constant ρ ≥ 1, the formulation of the ρ-GW metric between probability distributions r(x) supported on a metric space (X , d X ) and s(z) supported on (Z, d Z ) is given by GW ρ (r, s) := inf γ∈P(r(x),s(z)) E γ(x,z) E γ(x ′ ,z ′ ) |d X (x, x ′ ) -d Z (z, z ′ )| ρ 1/ρ , where P(r(x), s(z)) denotes the set of all couplings with r(x) as x-marginal and s(z) as z-marginal. The metrics d X and d Z are the metrics in the spaces X and Z, respectively.

3.2. APPLICATION TO REPRESENTATION LEARNING: GROMOV-WASSERSTEIN AUTOENCODER

In this work, we propose a novel GWAE modeling methodology based on the GW metric for distance structure modeling in the variational autoencoding formulation. The objectives of generative models typically aim for distribution matching in the data space, e.g., the likelihood (Kingma & Welling, 2014) and the Jensen-Shannon divergence (Goodfellow et al., 2014) . The GWAE objective differs from these approaches and aims to directly match the latent and data distributions based on their distance structure.

3.2.1. MODEL SETTINGS: VARIATIONAL AUTOENCODING

Given an N -sized set of data points {x i } N i=1 supported on a data space X , representation learning aims to build a latent space Z and obtain mappings between both the spaces. For numerical computation, we postulate that the spaces X and Z respectively have tractable metrics d X and d Z such as the Euclidean distance (see Appendix B.1 for details), and let M, L ∈ N \ {0}, X ⊆ R M , and Z ⊆ R L . We mention the bottleneck case M ≫ L similarly to the existing representation learning methods (Kingma & Welling, 2014; Higgins et al., 2017a; Kim & Mnih, 2018) because the data space X is typically an L-dimensional manifold (Carlsson et al., 2008; Bengio et al., 2013) . We construct a model with a trainable latent prior π θ (z) to approach the data distribution p data (x) in terms of distance structure. Following the standard VAE (Kingma & Welling, 2014), we consider a generative model p θ (x, z) with parameters θ and an inference model q ϕ (x, z) with parameters ϕ. The generation process consists of the prior π θ (z) and a decoder p θ (x|z) parameterized with neural networks. Since the inverted generation process p θ (z|x) = π θ (z)p θ (x|z)/p θ (x) is intractable in this scheme, an encoder q ϕ (z|x) ≈ p θ (z|x) is instead established using neural networks for parameterization. Thus, the generative p θ (x, z) and inference q ϕ (x, z) models are defined as p θ (x, z) = π θ (z)p θ (x|z), q ϕ (x, z) = p data (x)q ϕ (z|x). The empirical pdata (x) = 1/N N i=1 δ(x -x i ) is used for the estimation of p data (x). A Dirac decoder and a diagonal Gaussian encoder are used to alleviate deviations from the data manifold as in Tolstikhin et al. (2018) (see Appendix B.1 for these details and formulations).

3.2.2. OPTIMAL TRANSPORT OBJECTIVE

Here, we focus on the latent space Z to transfer the underlying data structure to the latent space. This highlights the main difference between the GWAE and the existing generative approaches. The training objective of GWAE is the GW metric between the metric measure spaces (X , d X , p data (x)) and (Z, d Z , π θ (z)) as minimize θ GW ρ (p data (x), π θ (z)) ρ , where ρ ≥ 1 is a constant, and we adopt ρ = 1 to alleviate the effect of outlier samples distant from the isometry for training stability. Computing the exact GW value is difficult owing to the high dimensionality of both x and z. Hence, we estimate and minimize the GW metric using the variational autoencoding scheme, which captures the latent factors of complex data in a stable manner. We recast the GW objective into a main GW estimator L GW with three regularizations: a reconstruction loss L W , a joint dual loss L D , and an entropy regularization R H . Estimated GW metric L GW . We use the generative model p θ (x, z) as the coupling of Eq. ( 6) similarly to the WAE (Tolstikhin et al., 2018) methodology. The main loss L GW estimates the GW metric as: minimize θ L GW := E p θ (x,z) E p θ (x ′ ,z ′ ) |d X (x, x ′ ) -Cd Z (z, z ′ )| ρ , subject to p data (x) = p θ (x), where C is a trainable scale constant to cancel out the scale degree of freedom, and p θ (x) denotes the marginal p θ (x) = Z p θ (x, z)dz. WAE-based X -marginal condition L W . To obtain a numerical solution with stable training, Tolstikhin et al. (2018) relax the X -matching condition of Eq. ( 8) into ξ-Wasserstein minimization (ξ ≥ 1) using the variational autoencoding coupling. The WAE methodology (Tolstikhin et al., 2018) uses the inference model q ϕ (x, z) to formulate the ξ-Wasserstein minimization as the reconstruction loss L W with a Z-matching condition as: minimize θ,ϕ L W := E q ϕ (x,z) E p θ (x ′ |z) [d X (x, x ′ )] , subject to q ϕ (z) = π θ (z). ( ) where d X is a distance function based on the L ξ metric. We adopt the settings ξ = 2 to retain the conventional Gaussian reconstruction loss. Merged sufficient condition L D . We merge the marginal coupling conditions of Eq. ( 8) and Eq. ( 10) into the joint X × Z-matching sufficient condition p θ (x, z) = q ϕ (x, z) to attain bidirectional inferences while preserving the stability of autoencoding. Since such joint distribution matching can also be relaxed into the minimization of W 1 (q ϕ (x, z), p θ (x, z)), this condition is satisfied by minimizing the Kantorovich-Rubinstein duality introduced by Arjovsky et al. (2017) as in Eq. ( 3). Practically, a 1-Lipschitz neural network (critic) f ψ estimates the supremum of Eq. ( 3), and the main model minimizes this estimated supremum as: minimize θ,ϕ maximize ψ L D := E q ϕ (x,z) [f ψ (x, z)] -E p θ (x,z) [f ψ (x, z)] , ( ) where ψ is the critic parameters. To satisfy the 1-Lipschitz constraint, the critic f ψ is implemented with techniques such as spectral normalization (Miyato et al., 2018) and gradient penalty (Gulrajani et al., 2017) (see Appendix B.3 for the details of the gradient penalty loss). Entropy regularization R H . We further introduce the entropy regularization R H using the inference entropy to avoid degenerate solutions in which the encoder q ϕ (z|x) becomes Dirac and deterministic for all data points. In such degenerate solutions, the latent representation simply becomes a look-up table because such a point-to-point encoder maps the set of data points into a set of latent code points with measure zero (Hoffman et al., 2017; Dai et al., 2018) , causing overfitting into the empirical data distribution. An effective way to avoid it is a regularization with the inference entropy H q of the latent variables z conditioned on data x as R H := H q (z|x) = E q ϕ (x,z) [-log q ϕ (z|x)] . Since the conditioned entropy H q (z|x) diverges to negative infinity in the degenerate solutions, the regularization term -R H facilitates the probabilistic learning of GWAE models. Stochastic Training with Single Estimated Objective. Applying the Lagrange multiplier method to the aforementioned constraints, we recast the GW metric of Eq. ( 6) into a single objective L with multipliers λ W , λ D , and λ H as minimize θ,ϕ maximize ψ L := L GW + λ W L W + λ D L D -λ H R H . One efficient solution to optimize this objective is using the mini-batch gradient descent in alternative steps (Goodfellow et al., 2014; Arjovsky et al., 2017) , which we can conduct in automatic differentiation packages, such as PyTorch (Paszke et al., 2019) . One step of mini-batch descent is the minimization of the total objective L in Eq. ( 13), and the other step is the maximization of the critic objective L D in Eq. ( 11). By alternatively repeating these steps, the critic estimates the Wasserstein metric using the expected potential difference L D (Arjovsky et al., 2017) . Although the objective in Eq. ( 13) involves three auxiliary regularizations including an adversarial term, the GWAE model can be efficiently optimized because the adversarial mechanism and the variational autoencoding scheme share the goal of distribution matching p θ (x, z) ≈ q ϕ (x, z) (see Appendix C.5 for more details).

3.2.3. PRIOR BY SAMPLING

GWAE models apply to the cases in which the prior π θ (z) takes the form of an implicit distribution with a sampler. An implicit distribution π θ (z) provides its sampler z ∼ π θ (z) while a closed-form expression of the probability density function is not available. The adversarial algorithm of GWAE handles such cases and enables a wide class of priors to provide meta-prior-based inductive biases for unsupervised representation learning, e.g., for disentanglement (Locatello et al., 2019b; 2020) . Note that the GW objective in Eq. ( 6) becomes a constant function in non-trainable prior cases.

Neural Prior (NP).

A straightforward way to build a differentiable sampler of a trainable prior is using a neural network to convert noises. The prior of the latent variables z is defined via sampling using a neural network g θ : R L → R L with parameters θ (see Appendix B.2 for its formulation). Notably, the neural network g θ need not be invertible unlike Normalizing Flow (Rezende & Mohamed, 2015) since the prior is defined as an implicit distribution not requiring a push-forward measure. Factorized Neural Prior (FNP). For disentanglement, we can constitute a factorized prior using an element-wise independent neural network gθ = {g (i) θ } L i=1 (see Appendix B. 2 for its formulation). Such factorized priors can be easily implemented utilizing the 1-dimensional grouped convolution (Krizhevsky et al., 2012) . Gaussian Mixture Prior (GMP). For clustering structure, we construct a class of Gaussian mixture priors. Given that the prior contains K components, the k-th component is parameterized using the weights w k , means m k ∈ R L , and square-root covariances M k ∈ R L×L as π θ (z) = K k=1 w k N (z|m k , M k M T k ), where the weights {w k } K k=1 are normalized as K k=1 w k = 1. To sample from a prior of this class, one randomly chooses a component k from the K-way categorical distribution with probabilities (w 1 , w 2 , . . . , w k ) and draws a sample z as follows: z = m k + M k ϵ, ϵ ∼ N (0, I L ), where 0 and I n denote the zero vector and the n-sized identity matrix, respectively. In this class of priors, the set of trainable parameters consists of {(w k , m k , M k )} K k=1 . Note that this parameterization can be easily implemented in differentiable programming frameworks because M k M T k is positive semidefinite for any M k ∈ R L×L .

4. EXPERIMENTS

We investigated the wide capability of the GWAE models for learning representations based on meta-priors. 1 We evaluated GWAEs in two principal meta-priors: disentanglement and clustering. To validate the effectiveness of GWAE on different tasks for each meta-prior, we conducted each experiment in corresponding experimental settings. We further studied their autoencoding and generation for the inspection of general capability.

4.1. EXPERIMENTAL SETTINGS

We compared the GWAE models with existing representation learning methods (see Appendix A for the details of the compared methods). For the experimental results in this section, we used four visual datasets: CelebA (Liu et al., 2015) , MNIST (LeCun et al., 1998) , 3D Shapes (Burgess & Kim, 2018) , and Omniglot (Lake et al., 2015) (see Appendix C.1 for dataset details). For quantitative evaluations, we selected hyperparameters from λ W ∈ [10 0 , 10 1 ], λ D ∈ [10 0 , 10 1 ], and λ H ∈ [10 -4 , 10 0 ] using their performance on the validation set. For fair comparisons, we trained the networks with a consistent architecture from scratch in all the methods (see Appendix C.2 for architecture details).

4.2. GROMOV-WASSERSTEIN ESTIMATION AND MINIMIZATION

We validated the estimation and minimization of the GW metric in Fig. 1 . First, to validate the estimation of the GW metric, we compared the GW metric estimated in GWAE and the empirical GW value computed in the conventional method in Fig. 1a . Against the GWAE models estimating the GW metric as in Eq. ( 7), the empirical GW values are computed by the standard OT framework POT (Flamary et al., 2021) . Although the estimated L GW is slightly higher than the empirical values, the curves behave in a very similar manner during the entire training process. This result supports that the GWAE model successfully estimated the GW values and yielded their gradients to proceed with the distribution matching between the data and latent spaces. Second, to validate the minimization of the GW metric, we show the histogram of the differences of generated samples in the data and latent space in Fig. 1b . The isometry of generated samples is attained if the generative coupling p θ (x, z) attains the infimum in Eq. ( 4). This histogram result shows that the generative model p θ (x, z) acquired nearly-isometric latent embedding, and suggests that the GW metric was successfully minimized although the objective of Eq. ( 13) contains three regularization loss terms (refer to Appendix C.8 for ablation studies, and Appendix C.4 for comparisons). These two experimental results support that the GWAE models successfully estimated and optimized the GW objective.

4.3. LEARNING REPRESENTATIONS BASED ON META-PRIORS

Disentanglement. We investigated the disentanglement of representations obtained using GWAE models and compared them with conventional VAE-based disentanglement methods. Since the element-wise independence in the latent space is postulated as a meta-prior for disentangled representation learning, we used the FNP class for the prior π θ (z). Considering practical applications with Table 1 : Quantitative comparison of disentanglement. The reported scores were calculated in 3D Shapes (Burgess & Kim, 2018) , and the latent size L = 16. Since the latent size L is larger than the number of the ground truth factors, the hyperparameter tuning was based on the validation set DCI-C (Eastwood & Williams, 2018) values. To deal with the probabilistic scores (Zaidi et al., 2021) (Kim & Mnih, 2018) 0.7963 ± 0.0004 0.7390 ± 0.0004 0.9961 ± 0.0002 DIP-VAE-I (Kumar et al., 2018) 0.8609 ± 0.0003 0.6984 ± 0.0003 0.9961 ± 0.0001 DIP-VAE-II (Kumar et al., 2018) 0.8236 ± 0.0001 0.7498 ± 0.0003 0.9957 ± 0.0002 GWAE (FNP) 0.9080 ± 0.0002 0.7024 ± 0.0002 0.9966 ± 0.0002 * The ranges are denoted by (mean) ± (standard error of the mean). unknown ground-truth factor, we set relatively large latent size L to avoid the shortage of dimensionality. The qualitative and quantitative results are shown in Fig. 2 and Table 1 , respectively. These results support the ability to learn a disentangled representation in complex data. The scatter plots in Fig. 2 suggest that the GWAE model successfully extracted one underlying factor of variation (object hue) precisely along one axis, whereas the standard VAE (Kingma & Welling, 2014) formed several clusters for each value, and FactorVAE (Kim & Mnih, 2018) obtained the factor in quadrants. Clustering Structure. We empirically evaluated the capabilities of capturing clusters using MNIST (LeCun et al., 1998) . We compared the GWAE model using GMP with other VAE-based methods considering the out-of-distribution (OoD) detection performance in Fig. 3 . We used MNIST images as in-distribution (ID) samples for training and Omniglot (Lake et al., 2015) images as unseen OoD samples. Quantitative results show that the GWAE model successfully extracted the clustering structure, empirically implying the applicability of multimodal priors.

4.4. AUTOENCODING MODEL

We additionally studied the autoencoding and generation performance of GWAE models in Table 2 (see Appendix C.7 for qualitative evaluations). Although the distribution matching p θ (x) ≈ p data (x) is a collateral condition of Eq. ( 7), quantitative results show that the GWAE model also favorably compares with existing autoencoding models in terms of generative capacity. This result suggests the substantial capture of the underlying low-dimensional distribution in GWAE models, which can lead to the applications to other types of meta-priors. , 2015) . We trained these models using MNIST as ID samples and used Omniglot as OoD samples. We upsampled Omniglot to 10,000 samples for data balancing. For the anomaly detection using the latent codes z, we applied the negative log-likelihood energy -log π(z) for VAE and DAGMM, and used the estimated Kantorovich potential E p θ (x|z) [f ψ (x, z)] for GWAE (see Appendix C.10 for more latent space details). (Zhao et al., 2017) 147.1 19.76 WAE (Tolstikhin et al., 2018) 55 * 22.70 WVI (Ambrogioni et al., 2018) 295.0 14.45 SWAE (Kolouri et al., 2019) 102.2 21.85 OT-based models RAE (Xu et al., 2020) 52. 

5. CONCLUSION

In this work, we have introduced a novel representation learning method that performs the distance distribution matching between the given unlabeled data and the latent space. Our GWAE model family transfers distance structure from the data space into the latent space in the OT viewpoint, replacing the ELBO objective of variational inference with the GW metric. The GW objective provides a direct measure between the latent and data distribution. Qualitative and quantitative evaluations empirically show the performance of GWAE models in terms of representation learning. In future work, further applications also remain open to various types of meta-priors, such as spherical representations and non-Euclidean embedding spaces.

REPRODUCIBILITY STATEMENT

We describe the implementation details in Section 4, Appendix B, and Appendix C. The dataset details are provided in Appendix C.1. To ensure reproducibility, our code is available online at https://github.com/ganmodokix/gwae and is provided as the supplementary material.  ELBO(x; θ, ϕ) = E q ϕ (z|x) [log p θ (x|z)] -D KL (q ϕ (z|x)∥π(z)) , E p data (x) E q ϕ (z|x) [p θ (x|z)] -βD KL (q ϕ (z|x)∥π(z)) . ( ) The KKT multiplier β works as the weight of the regularization to impose a factorized prior (e.g., the standard Gaussian N (0, I L )) on the latent variables. This re-weighting induces the capability of disentanglement in the case of β > 1; however, a large value of β causes posterior collapse, in which the latent variables "forget" the information of the input data. From the Information Bottleneck (IB) (Tishby et al., 1999) point of view, the β-VAE objective is re-interpreted as the following optimization problem (Alemi et al., 2018; Achille & Soatto, 2018) : maximize θ,ϕ I ϕ (z; y) (18) subject to I ϕ (z; x) ≤ I c , where I c is a bottleneck capacity, y is a task to be estimated, and I ϕ (•; •) denotes the mutual information on the inference model. Introducing the Lagrange multiplier β, the IB problem is given as maximize θ,ϕ I ϕ (z; y) -βI ϕ (z; x). Alemi et al. ( 2018) have given the lower bound of this IB objective as I ϕ (z; y) -βI ϕ (z; x) ≥ E p data (y)q ϕ (z|y) [log p θ (y|z)] -H(y) The lower bound of I ϕ (z; y) -β E p data (x) [D KL (q ϕ (z|x)∥π(z))] The upper bound of I ϕ (z; x) , where the task entropy H(y) is independent of the parameters θ and ϕ. The autoencoding task y = x gives the objective equivalent to that of the original VAE. This IB-based formulation of the β-VAE objective implies that the larger value of the multiplier β guides the training process to minimize the mutual information I ϕ (z; x) to make the encoder forget the input data, i.e., to cause posterior collapse. A.1.2 FACTORVAE FactorVAE (Kim & Mnih, 2018 ) is a state-of-the-art disentanglement method that minimizes the Total Correlation (TC) of the aggregated posterior q ϕ (z) = E p data (x) [q ϕ (z|x)] in addition to the original ELBO objective. The TC is expressed as the KL divergence between a distribution and its factorized counterpart. In the FactorVAE case, the TC of the aggregated posterior is the KL divergence from the factorized aggregated posterior qϕ (z) = L i=1 q ϕ (z i ) to the aggregated posterior q ϕ (z). The training objective of FactorVAE is the weighted sum of the ELBO and the TC term as maximize θ,ϕ ELBO(x; θ, ϕ) -γTC(q ϕ (z)), where TC(z) denotes the TC of the latent variables z defined as TC(z) = D KL (q ϕ (z)∥q ϕ (z)) (23) = E q ϕ (z) log Disc(z) 1 -Disc(z) . ( ) In Eq. ( 24), Disc(z) denotes a discriminator to estimate the TC term by density ratio estimation (Sugiyama et al., 2012) as Disc(z) = arg max f :Z→[0,1] E q ϕ (z) [log f (z)] + E qϕ (z) [log(1 -f (z))] . Practically, the discriminator is estimated using SGD in parallel using samples from qϕ (z) by permuting the latent codes along the batch dimension independently in each latent variable. A  = E p data (x) E q ϕ (z|x) [p θ (x|z)] -D KL (q ϕ (z)∥π(z)) The main difference between the VAE and InfoVAE objectives is using the regularization term D KL (q ϕ (z)∥π(z)) instead of the original VAE regularization D KL (q ϕ (z|x)∥π(z)). The original KL term becomes zero if all the data points are encoded into the standard Gaussian N (0, I L ) to cause posterior collapse. The InfoVAE KL term D KL (q ϕ (z)∥π(z)) alleviates this problem by adopting the aggregated posterior q ϕ (z) for optimization instead of the encoder q ϕ (z|x). The authors of InfoVAE (Zhao et al., 2019) further provide the model family in which the KL term is replaced with other divergences. They introduce an alternative divergence D(q ϕ (z), π(z)) and its weight λ to conduct representation learning by the following training objective: maximize θ,ϕ E p data (x) [ELBO(x; θ, ϕ)] + I ϕ (x; z) (28) = E p data (x) E q ϕ (z|x) [p θ (x|z)] -λD(q ϕ (z), π(z)). In the original InfoVAE paper (Zhao et al., 2019) , the authors reported that the Maximum-Mean Discrepancy (MMD) is the best choice for the divergence D. The MMD divergence MMD(q ϕ (z), π(z)) is defined as MMD(q ϕ (z), π(z)) = E q ϕ (z) E q ϕ (z ′ ) [k(z, z ′ )] + E π(z) E π(z ′ ) [k(z, z ′ )] -2E q ϕ (z) E π(z ′ ) [k(z, z ′ )] , where k(•, •) is any universal kernel, such as the radial basis function kernel k(z, z ′ ) = exp(-∥z -z ′ ∥ 2 2 /σ 2 ) (31) for a constant σ > 0.

A.2 VAE-BASED METHODS BASED ON HIERARCHICAL FACTORS

Several VAE-based methods postulate the existence of hierarchical factors as its meta-prior to learn representations with the abstractness of different levels (Sønderby et al., 2016; Zhao et al., 2017) . These methods involve the change in their network architecture to utilize the feature hierarchy often captured in the hidden layers of deep neural networks. A. 

A.3 VAE-BASED METHODS INVOLVING PRIOR LEARNING

The standard VAE model has a pre-defined prior, which may cause the discrepancy between the underlying data structure and the postulated prior (Dai & Wipf, 2019) . Several methods overcome this problem by involving the prior itself in the training process.

A.3.1 VAMPPRIOR

VampPrior (Tomczak & Welling, 2018 ) is a type of prior consisting of the mixture of the encoder distributions from several pseudo-input. The pseudo-inputs are introduced as trainable parameters, which are input into the encoder to build a mixture prior. Thus, the VAE models with VampPriors have trainable priors while retaining the main training procedure using the reparameterization trick to apply SGD. A.4 WASSERSTEIN AUTOENCODER (WAE) WAE (Tolstikhin et al., 2018) is a family of generative models whose autoencoder tries to estimate and minimize the primal form of the Wasserstein metric between the generative model p θ (x) and the data distribution p data (x) using SGD with the following objective: minimize θ,ϕ E p data (x) E q ϕ (z|x) E p θ (x ′ |z) [d(x, x ′ )] + λD(q ϕ (z), π(z)), ( ) where λ is a Lagrange multiplier, the generative model is defined as a latent variable model p θ (x, z) = π(z)p θ (x|z) postulating the prior of the latent variables π(z), and a conditional distribution q ϕ (z|x) is a probabilistic encoder to optimize instead of all couplings supported on X × X . The WAE objective is indeed equivalent to that of InfoVAE (Zhao et al., 2019) in Eq. ( 27), which provides the OT-based perspective on VAE-based models. Following the InfoVAE (Zhao et al., 2019) , we adopt the MMD for the divergence D, which is denoted by "WAE-MMD" in the original WAE paper (Tolstikhin et al., 2018) . Although the WAE-based approaches rewrite VAE-based objectives with the Wasserstein metric, these metrics are between x-marginal distributions and do not directly include the latent space Z. To learn representations z, the Wasserstein-based objective is further modified (Gaujac et al., 2021) . A.5 RELATIONAL REGULARIZED AUTOENCODER (RAE) Relational Regularized Autoencoder (RAE) (Xu et al., 2020) is a variational autoencoding generative model with a regularization loss based on the fused Gromov-Wasserstein (FGW) metric. RAE introduces the FGW metric between the aggregated posterior and the latent prior as the regularization divergence to fortify the WAE constraint π θ (z) = q ϕ (z) introduced by Tolstikhin et al. ( 2018) for generative modeling. The FGW regularization is introduced with a weight hyperparameter β ∈ [0, 1] and given as minimize θ,ϕ E p data (x) E q ϕ (z|x) E p θ (x ′ |z) [d(x, x ′ )] + λD F GW (q ϕ (z), π θ (z); β), where D F GW denotes the FGW metric being the upper bound of the weighted sum of the Wasserstein and Gromov-Wasserstein metrics. The FGW metric D F GW is given as D F GW (q ϕ (z), π θ (z); β) = inf γ∈P(q ϕ (z),π θ (z)) (1 -β)E γ(z,z ′ ) [d Z (z, z ′ )] + βE γ(z1,z ′ 1 )γ(z2,z ′ 2 ) [|d Z (z 1 , z 2 ) -d Z (z ′ 1 , z ′ 2 )|] (34) ≥ (1 -β) inf γ∈P(q ϕ (z),π θ (z)) E γ(z,z ′ ) [d Z (z, z ′ )] Wasserstein term for direct comparison + β inf γ∈P(q ϕ (z),π θ (z)) E γ(z1,z ′ 1 )γ(z2,z ′ 2 ) [|d Z (z 1 , z 2 ) -d Z (z ′ 1 , z ′ 2 )| 2 ] Gromov-Wasserstein term for relational comparison , ( ) where P(q ϕ (z), π θ (z)) is a set of all couplings whose marginals are q ϕ (z), π θ (z). The discrepancy between the prior π θ (z) and the aggregated posterior q ϕ (z) causes the degradation of generative performance since the processes of decoding p data (x)q ϕ (z|x)p θ (x|z) and generation π θ (z)p θ (x|z) are modeled in different regions of the latent space. This formulation enables learning a prior distribution q ϕ (z) with flexibly assuming the structures of data, where the prior π θ (z) is modeled as a Gaussian mixture model the original settings by Xu et al. (2020) . They aim at matching the distributions on the latent space Z, which can have an identical dimensionality but may differ in terms of distance structure.

A.6 IVI METHODS

Beyond the analytically tractable distributions, implicit distributions are applied to variational inference. An implicit distribution only requires its sampling method, which extends the variety of modeling and applications in variational inference and VAE-based models.

A.6.1 DENSITY RATIO ESTIMATION BY ADVERSARIAL DISCRIMINATORS

The density ratio estimation technique (Sugiyama et al., 2012) is essential to the mechanism of GANs (Goodfellow et al., 2014) and IVI methods (Huszár, 2017) , which is conducted via an optimal discriminator f * between distributions r(x) and s(x) as D KL (r(x)∥s(x)) = E r(x) log r(x) s(x) = E r(x) log f * (x) 1 -f * (x) = E r(x) [log f * (x) -log(1 -f * (x))] , where f * (x) = arg max f :X →(0,1) E r(x) [log f (x)] + E s(x) [log(1 -f (x))] . ( ) The discriminator is estimated via maximizing Eq. ( 37) with a neural network f ≈ f * . The training of discriminators often suffers from instability and mode collapse owing to its alternative parameter updates based on Eq. ( 36) and Eq. ( 37) (Arjovsky & Bottou, 2017; Arjovsky et al., 2017) . One approach to tackle this problem is imposing the Lipschitz continuity on the discriminator based on the Kantorovich-Rubinstein duality (Arjovsky et al., 2017) . A.6.2 ADVERSARIAL VARIATIONAL BAYES (AVB) Adversarial Variational Bayes (AVB) (Mescheder et al., 2017) is an ELBO optimization method using the adversarial training process instead of the analytical KL term. Let us recall that the KL term in Eq. ( 1) is defined by the expected density ratio as D KL (q ϕ (z|x)∥π(z)) = E q ϕ (z|x) q ϕ (z|x) π(z) . ( ) Adopting the density ratio trick (Sugiyama et al., 2012) , the analytical KL term can be replaced with the optimal discriminator, which takes a data point x and its encoder sample z ∼ q ϕ (z|x) to output the density ratio q ϕ (z|x)/π(z). It enables implicit distributions in the prior while retaining the ELBO objective of variational inference.

A.6.3 ADVERSARIALLY LEARNED INFERENCE (ALI) / BIDIRECTIONAL GENERATIVE ADVERSARIAL NETWORKS (BIGAN)

Adversarially Learned Inference (ALI) (Dumoulin et al., 2017) / Bidirectional Generative Adversarial Networks (BiGAN) (Donahue et al., 2017) are models introducing the distribution matching of the generative model and the inference model as implicit distributions. These models have been proposed in different papers (Dumoulin et al., 2017; Donahue et al., 2017) ; however, they share an equivalent methodology. One can draw samples from the generative model π(z)p θ (x|z) by decoding prior samples and also from the inference model p data (x)q ϕ (z|x) by encoding data points. Here the ALI/BiGAN models introduce a discriminator to estimate the Jensen-Shannon divergence between the generative model p θ (x, z) and the inference model q ϕ (x, z). The model matching between the encoder and the decoder also learns latent representations by the bidirectional mappings. The decoder p θ (x|z) is modeled with a neural network D θ : Z → R M and its parameters θ as p θ (x|z) = δ(x -D θ (z)). Following the standard VAE settings (Kingma & Welling, 2014), the encoder q ϕ (z|x) is defined as a diagonal Gaussian parameterized by neural networks µ ϕ : Z → R M and σ 2 ϕ : Z → R M + with parameters ϕ as q ϕ (z|x) = N (z|µ ϕ (x), diag(σ 2 ϕ (x))). For the distance functions d X and d Z in Eq. ( 7) and Eq. ( 9), we used the L 2 distance defined as d X (x, x ′ ) = 1 √ 2 ∥x -x ′ ∥, d Z (z, z ′ ) = 1 √ 2 ∥z -z ′ ∥. As another choice, we also utilized the adversarially learned metric (Larsen et al., 2016) in Eq. ( 9). In the adversarially learned metric, the distance is measured in the feature space formed by the hidden outputs of the critic f ψ . Let h ψ (x) denote the critic hidden outputs in which the critic takes x as its input. We can then define a distance d ′ based on the adversarially learned metric as d ′ (x, x ′ ) = d X (x, x ′ ) 2 + 1 2 ∥h ψ (x) -h ψ (x ′ )∥ 2 2 . Since the critic network f ψ (x, z) has the Y-shaped architecture (see Appendix C.2) and concatenates x-based features and z-based features in one of the hidden layers to take a pair (x, z) as the inputs, we use the x-side branch as h ψ (x).

B.2 PRIOR DETAILS

Neural Prior (NP). Formally, the NP π θ (z) with a neural network g θ is defined as: π θ (z) = π(ϵ) det ∂g θ (ϵ) ∂ϵ dϵ, where π(ϵ) = N (ϵ|0, I L ). We can implement this class of prior with sampling noises ϵ as z = g θ (ϵ), avoiding the calculation of the integral. Factorized Neural Prior. For disentanglement in the variational autoencoding settings, element-wise independence is often imposed on latent variables z. Following the standard VAE settings (Kingma & Welling, 2014), we postulate Z = R L , where the latent variables z ∈ Z are expressed as an L-dimensional vector z = [z 1 , z 2 , . . . , z L ] T . As with the NP, the FNP class of prior is defined as π θ (z) = L i=1 π(i) θ (z i ), where π(i) θ (z i ) = π(ϵ (i) ) ∂g (i) θ (ϵ (i) ) ∂ϵ (i) dϵ (i) , (i = 1, 2, . . . , L) π(ϵ (i) ) = N (ϵ (i) |0, 1). (i = 1, 2, . . . , L) This prior can be implemented with N disjoint neural networks, or 1-dimensional grouped convolutions. The difference between the NP and the FNP is element-wise independence, in which the prior π θ (z) is factorized into distributions for each latent variable. Factorized priors enable disentanglement by obtaining a representation comprising independent factors of variation (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018) .

B.3 GRADIENT PENALTY

In the case of gradient penalty (Gulrajani et al., 2017) , the maximization in Eq. ( 11) is further modified as maximize ψ L D + λ GP E q ϕ (x,z) E p θ (x ′ ,z ′ ) E ϵ∼U (0,1) ∥∇ (x,z) f ψ (x, z)∥ 2 -1 2 , where λ GP > 0 is a constant, and x = ϵx + (1 -ϵ)x ′ and z = ϵz + (1 -ϵ)z ′ are interpolated samples by the random uniform noise ϵ. We adopt λ GP = 10 in all the experiments reported in this paper. Introducing the gradient penalty together with other techniques such as spectral normalization (Miyato et al., 2018) is effective and essential for adversarial learning in general (Chu et al., 2020; Miyato et al., 2018) .

C EXPERIMENTAL DETAILS

For the reported experimental results, we used a single GPU of NVIDIA GeForce® RTX 2080 Ti, and a single run of the entire GWAE training process until convergence takes about eight hours.

C.1 DATASET DETAILS

For the reported experiments in Section 4, we used the following datasets: MNIST (LeCun et al., 1998) . The MNIST dataset contains 70,000 handwritten digit images of 10 classes, comprising 60,000 training images and 10,000 test images. We used the original test set and randomly split the original training set into 54,000 training images and 6,000 validation images. We used the class information as its approximate factors of variation in the form of 10-dimensional dummy variables. This dataset is available online 2 in its original format or via the torchvision package 3 in the PyTorch (Paszke et al., 2019) tensor format. The MNIST dataset is licensed under the terms of the Creative Commons Attribution-Share Alike 3.0 license 4 . CelebA (Liu et al., 2015) . The CelebA dataset contains 202,599 aligned face images with 40 binary attributes. We cropped 144 × 144 pixels in the center of the 178 × 218-sized aligned images in the original dataset to omit excessive backgrounds. We used the train/validation/test partitions that the original authors provided. We used the binary attributes as its approximate factors of variation in the form of 40-dimensional vectors. As in the website of this dataset 5 , the CelebA dataset is available for non-commercial research purposes only. 3D Shapes (Burgess & Kim, 2018) . The 3D Shapes dataset contains 480,000 synthetic images with six ground truth factors of variation. The images in this dataset contain a single-colored 3D object, a single-colored wall of a rectangular room, a single-colored floor. These images are procedurally generated from the independent factors of variation, floor colour, wall colour, object colour, scale, shape, and orientation (Burgess & Kim, 2018) . We randomly split the entire dataset into 384,000/48,000/48,000 images for the train/validation/test set, respectively. Since the factor shape is a categorical variable in four classes, we converted it into four dummy variables to obtain quantitative factors of variation in the form of 9dimensional vectors. The repository of this dataset 6 is licensed under Apache License 2.0 7 . Omniglot (Lake et al., 2015) . The Omniglot dataset contains 1,623 images of hand-written characters from 50 different alphabets written by 20 different people. The images are 105 × 105sized, binary-valued. We used this dataset as OoD samples over MNIST in the evaluations on the OoD detection utilizing cluster structure. The repository of this dataset 8 is licensed under the MIT License 9 . CIFAR-10 ( Krizhevsky & Hinton, 2009) . The CIFAR10 dataset contains 60,000 images with 10 classes, comprising 50,000 training images and 10,000 test images. The images are 32x32 color images in 10 natural image classes, such as airplane and cat. This dataset is provided online 10 without any specific license. In all the datasets above, we used all the images as the raster (bitmap) representation and resized them to 64 × 64 pixels with three channels, where each image is a 3 × 64 × 64-sized tensor value and M = 12, 288. For gray-scale (one-channeled) images such as in MNIST, we repeated these images along the channel dimension three times to uniform these sizes to 3 × 64 × 64 elements.

C.2 ARCHITECTURE DETAILS

The architecture of neural networks in GWAE and the compared methods are built with convolutions and deconvolution (transposed convolution) in the same settings as shown in Tables 3 and 4 . In all the experiments on GWAE, we applied the gradient penalty and the spectral normalization in the critic networks to impose the 1-Lipschitz continuity on the critic f ψ , as shown in Table 7 . In the neural samplers of GWAE models, we used the fully-connected architecture in Table 5 for NP and the grouped-convolutional architecture in Table 6 for FNP. We used fully connected layers for unconstrained priors in NP, and 1-dimensional grouped convolution layers (converting sequences with length 1 and L channels) for factorized priors in FNP. For the optimizers of GWAE, we used RMSPropfoot_1 with a learning rate of 10 -4 for the main autoencoder network and used RMSProp with a learning rate of 5 × 10 -5 for the critic network. For all the compared methods except for GWAE, we used the Adam (Kingma & Ba, 2015) optimizer with a learning rate of 10 -4 . In the experiments, we used an equal batch size of 64 for all evaluated models. The batch size is relatively small, since the computational cost of GWAE for each batch is quadratic to the batch size B and the GW estimation runs in time O(N B) for each epoch using ⌈N/B⌉ batches. In the case that a batch normalization layer is introduced in the encoder outputs q ϕ (z i |x) = N ( μ(x), diag( σ2 (x))), the mean and variance are computed w.r.t. the aggregated posterior q ϕ (z) rather than the element-wise sample mean and variance of L-dimensional output values. The normalized parameters ( μ(x), σ(x)) against the original parameters (µ(x), σ(x)) are given as μ(x) = µ(x) -E q ϕ (z) [z] V q ϕ (z) [z] , (x) = σ 2 (x) V q ϕ (z) [z] , where the division is element-wise conducted, and V denotes the variance. The mean E q ϕ (z) [z] and variance V q ϕ (z) [z] are approximated using unbiased estimators consisting of mini-batch samples. Given a mini-batch index set B ⊆ {1, 2, . . . , N }, the unbiased estimations are expressed using the law of total variance as E q ϕ (z) [z] ≈ 1 #B i∈B µ(x i ) =: μ, V q ϕ (z) [z] ≈ 1 #B i∈B σ 2 (x i ) + 1 #B -1 i∈B (µ(x) -μ) 2 . (53) For quantitative evaluations, we used the DCI scores (Eastwood & Williams, 2018) for disentanglement, the FID score (Heusel et al., 2017) for image generation, and the PSNR score for image reconstruction.

C.3.1 DCI SCORES

The DCI scores (Eastwood & Williams, 2018 ) measure a representation in terms of disentangled representation learning. In the DCI scores, disentanglement is measured from three aspects: (i) each representation variable represents a single factor of variation, (ii) each factor of variation is expressed Table 7 : Model architecture for the critics in the GWAE models. We concatenated the outputs of the x-side and z-side branches and multiplied the concatenated outputs by 0.5 to input into the stem network for the sake of the gradient norm, resulting in a Y-shaped network. We applied spectral normalization (Miyato et al., 2018) to all the layers in the critic networks and used the LeakyReLU (Maas et al., 2013) (Maas et al., 2013) with negative slope 0.2 by a single representation variable, and (iii) a representation is informative w.r.t. the original data. The correspondence of variables and factors is computed via estimating the ground truth factors from the representation using random forest (Breiman, 2001) . DCI Disentanglement (DCI-D) measures (i) the factor singleness for each variable. DCI Completeness (DCI-C) measures (ii) the variable singleness for each factor. DCI Informativeness (DCI-I) measures (iii) whether the representation is informative for estimating the ground truth factors. These metrics are computed via the variable importances (e.g., the Gini impurity (Breiman, 2001 )) of the random forest (Breiman, 2001) , in which the random forest regressor estimates the ground truth factors using the representation variables. Using L-dimensional representation variables z, V -dimensional factors y and their importance R ij of the i-th variable z i , 1998) . Each histogram consists of 10,000 samples of (∆x, ∆z), where ∆x (vertical) and ∆z (horizontal) respectively denote the differences ∆x = d X (x, x ′ ) and ∆z = d Z (z, z ′ ) of two generative samples (x, z), (x ′ , z ′ ) ∼ p θ (x, z). In all reported results including FactorVAE (Kim & Mnih, 2018 ) (γ=3), WAE (Tolstikhin et al., 2018) , and GWAE (NP, λ D =1, λ W =1, λ H =1), the latent dimension L was set to L = 16, and their priors were set to the standard Gaussian. Table 9 : The ablation study on generation and reconstruction in CelebA (Liu et al., 2015) . The same settings as collateral condition p θ (x, z) ≈ q ϕ (x, z) in Eq. ( 11) and the generative modeling p θ (x) ≈ p data (x) as its necessary condition.

C.8 ABLATION STUDY

We conducted the ablation study of the losses and regularizations introduced in Eq. ( 13). Table 9 shows the results of the ablation study of the three sub-constraints L W , L D , and R H . The ablations yielded the performance degradation of GWAE, especially in L W . These results suggest the necessity of each regularization term and reveal their roles in representation learning. Ablation of L W . The ablation of the term L W brought low-quality reconstruction, which suggests that L W works as the autoencoding constraint as can be seen from taking the reconstruction loss in L W . It also reduced generation capability as well as reconstruction, suggesting that the generative modeling via autoencoding is inherited from the variational autoencoding architecture of VAEs (Kingma & Welling, 2014). Ablation of L D . Without the term L D , the GWAE models suffer from the lack of distribution matching in data generation, while it successfully conducted data reconstruction. These phenomena could be caused by the discrepancy between the encoded latent distribution q ϕ (z) and the prior π θ (z). Similar results are also obtained in the ablation of the merged sufficient condition (see Eq. ( 11)) for the regularization L D , where L D is defined as the MMD loss between the prior π θ (z) and the encoded latent q ϕ (z), as in the WAE-MMD model (Tolstikhin et al., 2018) . This choice of L D on the low-dimensional space Z appears to be a replacement for the Kantorovich potential adversarially learned in the high-dimensional joint space X × Z; however, lacking the merged sufficient condition seems to have caused the crucial reduction of generation performance as in the gross ablation of L D . These results imply that the term L D with adversarial learning regularizes the generative model p θ (x, z) to match the inference q ϕ (x, z). Ablation of R H . Removing R H slightly increased the reconstruction error but deteriorated the generation quality. To confirm this behavior, we also show the samples generated by the GWAE model without the regularization R H in Fig. 10 and its reconstruction in Fig. 11 . These qualitative results that the decoder without R H successfully reconstructs the images from the inference q ϕ (z) but generates corrupted images from the prior π θ (z). It suggests the "hole" problem (Rezende & Viola, 2018b) in the degenerate solution, where each data point is mapped at a single latent point to cover the zero-measure area of the latent space and the latent space is almost everywhere not covered by the inference q ϕ (z). Thus, the entropy regularization R H seems to have worked for retaining the probabilistic mappings in the encoder q ϕ (z|x) to avoid this phenomenon. In addition, for ablating ρ = 1, we also experimented with the ρ = ξ settings that appear to be intuitively natural although causing an unstable training process due to the outlier samples in L GW . The GWAE model with ρ = ξ suffered from performance degradation both in the generation and reconstruction, suggesting that our settings ρ = 1 ≤ ξ affect the learning process of the entire model.

C.9 THE META-PRIOR EFFECT ON GW MINIMIZATION AND ESTIMATION

For a further inspection of Section 4.2, we also studied the GW minimization and estimation using FNP in Fig. 12 . Compared with the NP case in Fig. 1 , GWAE with FNP presents less stable and more biased estimation and minimization. The learning curve of L GW in Fig. 12a is largely biased in the first 40 epochs and then seems to be converged at approximately 3.2, a higher value than that of Fig. 1a (lower than 2). The isometry histogram also suggests the degradation of GW minimization in FNP. In Fig. 12b , more samples fell in off-diagonal areas, showing that the isometry is less tight than that of Fig. 1b . These results are presumably due to the mismatch of disentanglement meta-prior in MNIST (LeCun et al., 1998) 



In the tables of the quantitative evaluations, ↑ and ↓ indicate scores in which higher and lower values are better, respectively. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6. pdf



The estimation of the GW metric in each epoch.

Figure 1: The estimation and minimization of the GW metric. This trial of training is conduced in GWAE (NP, λ D =1, λ W =1, λ H =1) using the MNIST (LeCun et al., 1998) dataset. (a) The curves show the GW values estimated by the loss term L GW (solid, blue) and the empirical GW computed by the POT package (Flamary et al., 2021) (dashed, orange). The values are computed using the validation set. (b) The axes ∆x = d X (x, x ′ ) (vertical) and ∆z = d Z (z, z ′ ) (horizontal) respectively denote the difference in the data and latent spaces between generated samples (x, z), (x ′ , z ′ ) ∼ p θ (x, z). The histogram contains 10,000 generated sample pairs.

GWAE (FNP, λW =1, λD=10, λH=0.3).

Figure 2: Comparison of the learned latent spaces in 3D Shapes (Burgess & Kim, 2018) and L = 16. The vertical and horizontal axes in the scatter plots respectively represent two of the 16 (= L) latent variables with the highest and the second-highest informativeness (Do & Tran, 2020) w.r.t. the object hue factor. Note that a single factor value varies along only one axis in a disentangled representation.

Figure3: The ROC curves of the OoD detection inMNIST (LeCun et al., 1998)  againstOmniglot (Lake et al., 2015). We trained these models using MNIST as ID samples and used Omniglot as OoD samples. We upsampled Omniglot to 10,000 samples for data balancing. For the anomaly detection using the latent codes z, we applied the negative log-likelihood energy -log π(z) for VAE and DAGMM, and used the estimated Kantorovich potential E p θ (x|z) [f ψ (x, z)] for GWAE (see Appendix C.10 for more latent space details).

which is mentioned as the expected objective of the original VAE (Kingma & Welling, 2014) in Eq. (1). A.1.1 β-VAE β-VAE (Higgins et al., 2017a) is a VAE-based model for learning disentangled representations by re-weighting the KL term of the ELBO. Given a KKT multiplier β > 0, the β-VAE objective is expressed as maximize θ,ϕ

VAE(Dai & Wipf, 2019) is a generative model with two probabilistic autoencoders. The process of 2-Stage VAE consists of two steps: (i) training a standard VAE using the given dataset as the input, and (ii) training another VAE using the latent variables of the previous VAE as the input. The 2-Stage VAE model attempts to overcome the discrepancy between the pre-defined prior and the learned latent representation by introducing the second VAE in stage (ii), which yields the prior training using the VAE in stage (i).

VAE-GAN VAE-GAN (Larsen et al., 2016) is a hybrid model based on VAE and GANs. The VAE-GAN models introduce a discriminator for the generative modeling w.r.t. the data x and utilize the hidden layers of the discriminator to model the decoder likelihood p θ (x|z) along the manifolds supporting the data. It provides the outstanding performance of data generation to the VAE framework by measuring the similarity of data utilizing the GANs-like network architecture. B DETAILS OF PROPOSED METHOD B.1 MODELING DETAILS

VAE (Kingma & Welling, 2014).

FactorVAE(Kim & Mnih, 2018).

WAE (Tolstikhin et al., 2018).

Figure4: Histograms of the differences inMNIST (LeCun et al., 1998). Each histogram consists of 10,000 samples of (∆x, ∆z), where ∆x (vertical) and ∆z (horizontal) respectively denote the differences ∆x = d X (x, x ′ ) and ∆z = d Z (z, z ′ ) of two generative samples (x, z), (x ′ , z ′ ) ∼ p θ (x, z). In all reported results including FactorVAE(Kim & Mnih, 2018) (γ=3), WAE(Tolstikhin et al., 2018), and GWAE (NP, λ D =1, λ W =1, λ H =1), the latent dimension L was set to L = 16, and their priors were set to the standard Gaussian.

because one of the major generative factors of MNIST images is the kind of digits, a categorical variable typically learned as one-hot variables in contrast to the factorization imposed by FNP.C.10 PRIORS IN CLUSTERING STRUCTUREFor more detailed investigation of the capture of clustering structure studied in Fig.3, we further study the latent spaces of VAE (Kingma & Welling, 2014), DAGMM(Zong et al., 2018), and GWAE with GMP. The t-SNE visualization (van der Maaten & Hinton, 2008) of the latent spaces are shown in Fig.13, which suggests that the GWAE model with GMP clearly captured the clustering structure in its latent space. The prior of VAE (Kingma & Welling, 2014) is defined as the standard Gaussian N (0, I L ) which does not consist of multiple clusters. The learned prior of DAGMM contains multiple clusters; however, adjacent clusters were overlapping to some extent. From the learned prior in GWAE, we can observe clear clusters densely concentrating themselves and separating each other. These results support the quantitative OoD results in Fig.3, in which the GWAE model outperforms the other two models with and without explicit clustering modeling, respectively.

β-VAE (Higgins et al., 2017a) (β=0.1).(c) AVB(Mescheder et al., 2017) (β=1).

(d) WAE (λ=100).(e) GWAE (NP, λD=1, λW =10, λH=0.0001).

Figure 6: Reconstructed images in CelebA(Liu et al., 2015). The images denote original data samples (top rows), reconstructed images (middle rows), and zoomed reconstructions (bottom rows). Each column corresponds to one data instance in the test set.

Figure 7: Reconstructed images inMNIST (LeCun et al., 1998). The images denote original data samples (top rows), reconstructed images (bottom rows). Each column corresponds to one data instance in the test set.

(a) ALI (Dumoulin et al., 2017). (b) VAE-GAN (Larsen et al., 2016) (γ=1). (c) 2-Stage VAE (Dai & Wipf, 2019). (d) GWAE (NP, λD=1, λW =10, λH=0.0001).

Figure 9: Generated images in CelebA(Liu et al., 2015). We show 100 images sampled from the generative model p θ (x) without conducting cherry-picking.

Figure 10: Generated images in CelebA (Liu et al., 2015) using the GWAE model without the regularization term R H .

Figure 11: Reconstructed images in CelebA (Liu et al., 2015) using the GWAE model without the regularization term R H . Each column corresponds to one test data instance. The rows denote original (top) and reconstructed (bottom) images.

DAGMM (Zong et al., 2018).

Figure13: The t-SNE visualizations(van der Maaten & Hinton, 2008) for latent space samples z ∼ π θ (z) for the OoD detection in Fig.3. The left plot presents the sampled points of the t-SNE embeddings, and the right one presents the kernel density estimation (KDE) of these embeddings. The sample size is equally 1,024 in each reported model.

, we reported the ranges for five measurements. The details of the scores are provided in Appendix C.3.

Quantitative comparisons of generation and reconstruction. The FID scores(Heusel et al., 2017) evaluate a random sample set from the generative model p θ (x) (without using dataset images) against the entire test set, and both consist of an equal number of 19,962 samples. The PSNR scores measure the reconstruction q ϕ (z)p θ (x|z) using test images (see Appendix C.3 for details). All reported values were computed in CelebA(Liu et al., 2015) with a latent size of L = 64. For all the methods, we applied early stopping (patience=10) and hyperparameter tuning using the validation set. The bold and underlined values respectively denote the best and the second-best performance in each score.

The values are cited from the original papers annotated after the model names.

Bo Zong, Qi Song, Martin Renquang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations (ICLR), pp. 1-19, 2018. URL https://openreview.net/forum?id=BJJLHbb0-. 1, 3, 9, 30, 38 A DETAILS OF RELATED WORK For self-containment, we describe VAE-based representation learning methods. As with Section 3, x and z denote data and latent variables, respectively, and the data x are M -dimensional and the latent variables z are L-dimensional. Unless otherwise noted, each VAE-based model consists of a generative model p θ (x, z) with parameters θ, an inference model q ϕ (x, z) with parameters ϕ, and a pre-defined (non-trainable) prior π(z) as in the standard VAE model architecture. A.1 VAE-BASED MODELS WITH ELBO EXTENSION Utilizing the latent variables of VAE-based models is a prominent approach to representation learning. Several models with extended ELBO-based objectives aim to overcome the shortcomings of the original VAE model, such as posterior collapse. VAE-based models are mainly grounded on the ELBO objective, where we denote the ELBO for the data point x as

Whereas the objective is still the ELBO, the LadderVAE model structure has hierarchical latent variables. The generative process is modeled as the Markov chain of several latent variable groups, and the inference model consists of deterministic feature encoders and the decoders shared with generative models. In the original paper(Sønderby et al., 2016), the authors claim that the LadderVAE models provide tighter log-likelihood lower bounds than the standard VAE. Instead of the hierarchical models based on Markov chains, the VLadderAE models introduce the hierarchical structure in the network architecture parameterizing the generative and the inference model. Since it constrains feature hierarchy by the process of feature extraction, VLad-derAE also performs disentanglement, e.g., the latent variables from different hidden convolutional layers capture textural or global features of visual data.

Model architecture for the encoders in the GWAE models and the compared models. For the 64 × 64 RGB images used in the experiments, the input size is set to (Channels, Height, Width) =(3, 64, 64). FC and Conv denote fully-connected (linear) layers and convolutional layers, respectively.

Model architecture for the decoders in the GWAE models and the compared models. The image shape is set to the same as Table3. FC and DeConv denote fully-connected layers and deconvolutional layers, respectively.

Model architecture for the samplers in the GWAE models with NP. FC denotes a fullyconnected layer.

Model architecture for the samplers in the GWAE models with FNP. GroupConv denotes 1-dimensional grouped convolutional layers.

activation for the critic to retain the 1-Lipschitz continuity. FC and Conv denote fully-connected layers and convolutional layers, respectively.

Table 2 are applied in these experiments.

ACKNOWLEDGMENTS

This work was partly supported by AMED Grant Number JP21zf0127004 and JSPS KAKENHI Grant Number JP21H03456.

annex

for the k-th factor y k , the DCI-D and DCI-C scores for each variable and each factor are defined aswhereq ik log V q ik , (k = 1, 2, . . . , V )whereThe DCI-D score for the entire variable set is given by the weighted sum L i=1 ρ i DCI-D i , where the weight ρ i is weighted importanceThe DCI-C score for the entire factor set is given by the average score 1/V V k=1 DCI-C k . The DCI-D and DCI-C metrics take values within the range [0, 1], where higher values indicate better performance. For DCI-I, we used the normalized definition by Zaidi et al. (2021) because the normalized DCI-I values are within the range [0, 1] and the higher values mean better informativeness, while DCI-I score DCI-D Original is the estimation mean square error in the original definition. The DCI-I definition that we used is expressed as DCI-I = 1 -6 × DCI-I Original .(58) Following the original paper (Eastwood & Williams, 2018) , we set the number of random trees to 10 and decided the tree depth with cross-validation.

C.3.2 FRÉCHET INCEPTION DISTANCE (FID)

Fréchet Inception Distance (FID) (Heusel et al., 2017) is a score for evaluating the quality of the generated images by generative models. The FID score is defined as the squared 2-Wasserstein metric between the features of the real images with mean (µ r , Σ r ) and that of the generated images with mean (µ g , Σ g ). Assuming that the features are normally distributed in the feature space, the FID score is expressed as).(60) Since the Wasserstein metric measures the discrepancy between distributions, lower values indicate better generation performance in the FID score. Following the original FID paper (Heusel et al., 2017) , we used the features obtained from the final pooling layer outputs of the Inception-v3 pre-trained in the ImageNet dataset (Deng et al., 2009) .

C.3.3 PEAK SIGNAL-TO-NOISE RATIO (PSNR)

For measuring the image reconstruction, we used the Peak Signal-to-Noise Ratio (PSNR) value. The PSNR value is defined as PSNR = 20 log 10 (MAX) -10 log 10 (MSE), (61) where MAX denotes the maximum value of the pixel values, and MSE indicates the mean square error (MSE). In all the experiments conducted in Section 4, the value of MAX is set to MAX = 1 because the images input as a dataset data are scaled within the range [0, 1].

C.4 ISOMETRY COMPARISON

Regarding the evaluations in Section 4.2, we further conducted comparisons on isometry in Fig. 4 . The results show that the GWAE models provide more isometric autoencoders compared with other VAE-based representation learning methods. The existing VAE-based methods did not yield as far as GWAE, which supports that the GW metric works as a different objective class from the ELBO. This implies that the GW metric loss substantially affects the training procedure of learning representations.Published as a conference paper at ICLR 2023 We present the training process of GWAE in Fig. 5 . Although the objective seems complex for its composition of four different losses, the training process successfully converged and the values of the terms L GW , L W , and L D jointly descended in the most part of training. Although the term -R H increased, its values did not diverge to prevent the degenerate solutions. These results imply that the three different losses L GW , L W , and L D did not conflict during the training process even for the complicated data, balancing these three terms against -R H as in the trade-off of the reconstruction against the regularization in β-VAE (Higgins et al., 2017a; Tschannen et al., 2018) .

C.6 PRIOR FAMILY SELECTION

We show the effect of prior family selection regarding a meta-prior, disentanglement, in Table 8 .While GWAE models with the NP and GMP retain the informativeness of the FNP, the other two priors than FNP did not comparably disentangle the latent factors. Although the NP covers a more general family of prior, these results suggest that choosing a prior family suitable to the postulated meta-prior greatly facilitates learning representations.

C.7 QUALITATIVE EVALUATIONS OF GENERATION AND RECONSTRUCTION

We show the reconstructed images by GWAE and state-of-the-art variational autoencoding methods in Fig. 6 . The shown images are the first ten samples of the test split in the CelebA (Liu et al., 2015) dataset under the latent size L = 64. Compared with the other methods, the reconstruction of the GWAE model tends to retain edges (see the bottom rows of Fig. 6 ), while VAE-based models generate smooth, blurry images due to the noise injected in the latent space to perform probabilistic modeling and manifold learning. We also show the reconstruction results of MNIST (LeCun et al., 1998) in Fig. 7 and CIFAR-10 ( Krizhevsky & Hinton, 2009) in Fig. 8 . These results support that the GWAE models consistently perform autoencoding also in a more simple dataset (MNIST). In a more complex dataset (CIFAR-10), the GWAE model attained the best evaluation in generation albeit its reconstruction, suggesting that the GWAE model successfully captured the abstract structure of data rather than reconstructed the given images. This difference highlights the difference in their objectives, i.e., the GW objective aims at distribution matching in the latent space, while the β-VAE (Higgins et al., 2017a) objective with β < 1 puts weight on reconstruction.We further study the generated images by GWAE and state-of-the-art VAE-based generative models in Fig. 9 . These qualitative results show that the GWAE generation successfully obtains a diverse set of images compared with those of state-of-the-art autoencoding generative models. The ALI model (Dumoulin et al., 2017) (Fig. 9 (a)) also generates various images by the distribution matching of bidirectional models, but the generated images have wavy contours, failing at composing images with a consistent appearance owing to the lack of an autoencoding process. Although the VAE-GAN model (Larsen et al., 2016) (Fig. 9 (b) ) adequately yields organized images with smooth textures, the azimuth of these images is less diverse, i.e., the great majority of the images are facing forward or looking slightly sideways. The images generated by 2-Stage VAE (Dai & Wipf, 2019) (Fig. 9 (c)) have diverse azimuth, color, and background; however, these images tend to incline toward the majority attributes, e.g., not wearing eyeglasses or sunglasses. The GWAE model (Fig. 9 (d) ) successfully generates facial images with various skin colors, diversified backgrounds, and assorted facial expressions (e.g., wearing a mustache). These results imply that the GWAE models also function as generative models while it has been built as a representation learning method owing to the Published as a conference paper at ICLR 2023 

