GROMOV-WASSERSTEIN AUTOENCODERS

Abstract

Variational Autoencoder (VAE)-based generative models offer flexible representation learning by incorporating meta-priors, general premises considered beneficial for downstream tasks. However, the incorporated meta-priors often involve ad-hoc model deviations from the original likelihood architecture, causing undesirable changes in their training. In this paper, we propose a novel representation learning method, Gromov-Wasserstein Autoencoders (GWAE), which directly matches the latent and data distributions using the variational autoencoding scheme. Instead of likelihood-based objectives, GWAE models minimize the Gromov-Wasserstein (GW) metric between the trainable prior and given data distributions. The GW metric measures the distance structure-oriented discrepancy between distributions even with different dimensionalities, which provides a direct measure between the latent and data spaces. By restricting the prior family, we can introduce meta-priors into the latent space without changing their objective. The empirical comparisons with VAE-based models show that GWAE models work in two prominent meta-priors, disentanglement and clustering, with their GW objective unchanged.

1. INTRODUCTION

One fundamental challenge in unsupervised learning is capturing the underlying low-dimensional structure of high-dimensional data because natural data (e.g., images) lie in low-dimensional manifolds (Carlsson et al., 2008; Bengio et al., 2013) . Since deep neural networks have shown their potential for non-linear mapping, representation learning has recently made substantial progress in its applications to high-dimensional and complex data (Kingma & Welling, 2014; Rezende et al., 2014; Hsu et al., 2017; Hu et al., 2017) . Learning low-dimensional representations is in mounting demand because the inference of concise representations extracts the essence of data to facilitate various downstream tasks (Thomas et al., 2017; Higgins et al., 2017b; Creager et al., 2019; Locatello et al., 2019a) . For obtaining such general-purpose representations, several meta-priors have been proposed (Bengio et al., 2013; Tschannen et al., 2018) . Meta-priors are general premises about the world, such as disentanglement (Higgins et al., 2017a; Chen et al., 2018; Kim & Mnih, 2018; Ding et al., 2020) , hierarchical factors (Vahdat & Kautz, 2020; Zhao et al., 2017; Sønderby et al., 2016), and clustering (Zhao et al., 2018; Zong et al., 2018; Asano et al., 2020) . A prominent approach to representation learning is a deep generative model based on the variational autoencoder (VAE) (Kingma & Welling, 2014) . VAE-based models adopt the variational autoencoding scheme, which introduces an inference model in addition to a generative model and thereby offers bidirectionally tractable processes between observed variables (data) and latent variables. In this scheme, the reparameterization trick (Kingma & Welling, 2014) yields representation learning capability since reparameterized latent codes are tractable for gradient computation. The introduction of additional losses and constraints provides further regularization for the training process based on meta-priors. However, controlling representation learning remains a challenging task in VAE-based models owing to the deviation from the original optimization. Whereas the existing VAE-based approaches modify the latent space based on the meta-prior (Kim & Mnih, 2018; Zhao et al., 2017; Zong et al., 2018) , their training objectives still partly rely on the evidence lower bound (ELBO). Since the ELBO objective is grounded on variational inference, ad-hoc model modifications cause implicit and undesirable changes, e.g., posterior collapse (Dai et al., 2020) and implicit prior change (Hoffman et al., 2017) in β-VAE (Higgins et al., 2017a) . Under such modifications, it is also unclear whether a latent representation retains the underlying data structure because VAE models implicitly interpolate data points to form a latent space using noises injected into latent codes by the reparameterization trick (Rezende & Viola, 2018a; b; Aneja et al., 2021) . As another paradigm of variational modeling, the ELBO objective has been reinterpreted from the optimal transport (OT) viewpoint (Tolstikhin et al., 2018) . Tolstikhin et al. ( 2018) have derived a family of generative models called the Wasserstein autoencoder (WAE) by applying the variational autoencoding model to high-dimensional OT problems as the couplings (Appendix A.4 for more details). Despite the OT-based model derivation, the WAE objective is equivalent to that of Info-VAE (Zhao et al., 2019) , whose objective consists of the ELBO and the mutual information term. The WAE formulation is derived from the estimation and minimization of the OT cost (Tolstikhin et al., 2018; Arjovsky et al., 2017) between the data distribution and the generative model, i.e., the generative modeling by applying the Wasserstein metric. It furnishes a wide class of models, even when the prior support does not cover the entire variational posterior support. The OT paradigm also applies to existing representation learning approaches originally derived from re-weighting the Kullback-Leibler (KL) divergence term (Gaujac et al., 2021) . Another technique for optimizing the VAE-based ELBO objective called implicit variational inference (IVI) (Huszár, 2017) has been actively researched. While the VAE model has an analytically tractable prior for variational inference, IVI aims at variational inference using implicit distributions, in which one can use its sampler instead of its probability density function. A notable approach to IVI is the density ratio estimation (Sugiyama et al., 2012) , which replaces the f -divergence term in the variational objective with an adversarial discriminator that distinguishes the origin of the samples. For distribution matching, this algorithm shares theoretical grounds with generative models based on the generative adversarial networks (GANs) (Goodfellow et al., 2014; Sønderby et al., 2017) , which induces the application of IVI toward the distribution matching in complex and high-dimensional variables, such as images. See Appendix A.6 for more discussions. In this paper, we propose a novel representation learning methodology, Gromov-Wasserstein Autoencoder (GWAE) based on the Gromov-Wasserstein (GW) metric (Mémoli, 2011), an OT-based metric between distributions applicable even with different dimensionality (Mémoli, 2011; Xu et al., 2020; Nguyen et al., 2021) . Instead of the ELBO objective, we apply the GW metric objective in the variational autoencoding scheme to directly match the latent marginal (prior) and the data distribution. The GWAE models obtain a latent representation retaining the distance structure of the data space to hold the underlying data information. The GW objective also induces the variational autoencoding to perform the distribution matching of the generative and inference models, despite the OT-based derivation. Under the OT-based variational autoencoding, one can adopt a prior of a GWAE model from a rich class of trainable priors depending on the assumed meta-prior even though the KL divergence from the prior to the encoder is infinite. Our contributions are listed below. • We propose a novel probabilistic model family GWAE, which matches the latent space to the given unlabeled data via the variational autoencoding scheme. The GWAE models estimate and minimize the GW metric between the latent and data spaces to directly match the latent representation closer to the data in terms of distance structure. • We propose several families of priors in the form of implicit distributions, adaptively learned from the given dataset using stochastic gradient descent (SGD). The choice of the prior family corresponds to the meta-prior, thereby providing a more flexible modeling scheme for representation learning. • We conduct empirical evaluations on the capability of GWAE in prominent meta-priors: disentanglement and clustering. Several experiments on image datasets CelebA (Liu et al., 2015) , MNIST (LeCun et al., 1998), and 3D Shapes (Burgess & Kim, 2018) , show that GWAE models outperform the VAE-based representation learning methods whereas their GW objective is not changed over different meta-priors. 



VAE-based Representation Learning. VAE (Kingma & Welling, 2014) is a prominent deep generative model for representation learning. Following its theoretical consistency and explicit handling of latent variables, many state-of-the-art representation learning methods are proposed

